[关键词]
[摘要]
代码注释在软件质量保障中发挥着重要的作用,它可以提升代码的可读性,使代码更易理解、重用和维护.但是出于各种各样的原因,有时开发者并没有添加必要的注释,使得在软件维护的过程中,往往需要花费大量的时间来理解代码,大大降低了软件维护的效率.近年来,多项工作利用机器学习技术自动生成代码注释,这些方法从代码中提取出语义和结构化信息后,输入序列到序列的神经网络模型生成相应的注释,均取得了不错的效果.然而,当前最好的代码注释生成模型Hybrid-DeepCom仍然存在两方面的不足.一是其在预处理时可能破坏代码结构导致不同实例的输入信息不一致,使得模型学习效果欠佳;二是由于序列到序列模型的限制,其无法在注释中生成词库之外的单词(out-of-vocabulary word,简称OOV word).例如在源代码中出现次数极少的变量名、方法名等标识符通常都为OOV词,缺少了它们,注释将难以理解.为解决上述问题,提出了一种新的代码注释生成模型CodePtr.一方面,通过添加完整的源代码编码器解决代码结构被破坏的问题;另一方面,引入指针生成网络(pointer-generator network)模块,在解码的每一步实现生成词和复制词两种模式的自动切换,特别是遇到在输入中出现次数极少的标识符时模型可以直接将其复制到输出中,以此解决无法生成OOV词的问题.最后,在大型数据集上通过实验对比了CodePtr和Hybrid-DeepCom模型,结果表明,当词库大小为30 000时,CodePtr的各项翻译效果指标平均提升6%,同时,处理OOV词的效果提升近50%,充分说明了CodePtr模型的有效性.
[Key word]
[Abstract]
Code comments plays an important role in software quality assurance, which can improve the readability of source code and make it easier to understand, reuse, and maintain. However, for various reasons, sometimes developers do not add the necessary comments, which make developers always waste a lot of time understanding the source code and greatly reduces the efficiency of software maintenance. In recent years, lots of work using machine learning to automatically generate corresponding comments for the source code. These methods extract such information as code sequence and structure, and then utilize sequence to sequence (seq2seq) neural model to generate the corresponding comments, which have achieved sound results. However, Hybrid-DeepCom, the state-of-the-art code comment generation model, is still deficient in two aspects. The first is that it may break the code structure during preprocessing, resulting in inconsistent input information of different instances, making the model learning effect poor; the second is that due to the limitations of the seq2seq model, it is not able to generate out-of-vocabulary word (OOV word) in the comment. For example, variable names, method names, and other identifiers that appear very infrequently in the source code are usually OOV words, without them, comments would be difficult to be understood. In order to solve this problem, the automatic comment generation model named CodePtr is proposed in this study. On the one hand, a complete source code encoder is added to solve the problem of code structure being broken; on the other hand, the pointer-generator network module is introduced to realize the automatic switch between the generated word mode and the copy word mode in each step of decoding, especially when encountering the identifier with few times in the input, the model can directly copy it to the output, so as to solve the problem of not being able to generate OOV word. Finally, this study compares the CodePtr and Hybrid-DeepCom models through experiments on large data sets. The results show that when the size of the vocabulary is 30 000, CodePtr is increased by 6% on average in translation performance metrics, and the effect of OOV word processing is improved by nearly 50%, which fully demonstrates the effectiveness of CodePtr model.
[中图分类号]
[基金项目]
国家自然科学基金(61802167,61972197,61802095);江苏省自然科学基金(BK20201250);华为-南京大学下一代程序设计创新实验室合作协议子项目