基于大模型的Python到Dafny代码翻译

doi:10.13328/j.cnki.jos.007606

微信小程序

微信服务号

微信订阅号

首页 > 过刊浏览>2026年第37卷第9期 >. DOI:10.13328/j.cnki.jos.007606

PDF HTML阅读 XML下载导出引用引用提醒

基于大模型的Python到Dafny代码翻译
DOI:
                        10.13328/j.cnki.jos.007606
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:

Translation from Python into Dafny via Large Language Model

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

近年来,基于大模型的代码生成被广泛应用于软件开发领域.然而,由于大模型生成结果具有一定的随机性,如何保证生成代码的正确性成为重中之重.形式化验证是保障软件正确性的一种有力手段,但验证前需要将代码及相应需求形式化为相关工具的规约语言.Dafny是一种可验证的编程语言,并有相应的自动化验证工具支持.本文旨在探索大模型在将Python代码翻译为Dafny语言方面的能力.为此,本文提出了基于提示词的静态翻译和动态修复的翻译生成方法以及基于程序测试的翻译结果评估方法.静态翻译方法考虑三种渐进的提示词模板(基础模板,Dafny示例模板,对照示例模板),使用大模型一次性生成翻译;动态修复方法多次调用大模型对翻译进行修复,每次将上一轮代码测试的错误信息加入提示词.评估时,在没有现成测试用例集的情况下,利用大模型生成测试用例集;对翻译结果结合测试用例集应用Dafny测试工具进行测试以评估其正确性.我们在GPT-4o和DeepSeek-V3等大模型上使用HumanEval、MBPP和LeetCode等数据集进行了实验.结果表明,使用对照示例模板和动态修复方法能够有效提升大模型在Python到Dafny翻译任务上的表现.此外,本文还基于实验结果分析了影响翻译质量的因素,并对评估方法与标准进行讨论.

Abstract:

In recent years, code generation based on large language models is widely used in software development. However, due to the inherent stochasticity of the outputs from LLMs, proving the reliability of the code generated by LLMs has become a top priority. Formal verification is a powerful means that gives assurance of the correctness on the software. Before verification, the source code and requirements should be formalized into target specification language. Dafny is a programming language with built-in specification patterns. Dafny verification tools allow automated verification on Dafny specifications. This study aims to explore the ability of LLMs in translating Python into Dafny. To this end, this study introduces a prompt-based translation generation method employing both static translation and dynamic repair, along with a program-testing-based translation evaluating method. The static translation method considers three progressively refined prompting templates - Basic Prompting, Dafny Few-shot Prompting, Python2Dafny Few-shot Prompting – and performs single-step translation generation via LLMs; the dynamic repair method iteratively invokes LLMs for translation repair, incorporating error messages from the previous testing result into the prompt during each iteration. In evaluation, this study generates new testcase datasets by prompting LLMs on the occasion that there are no existing testcase datasets; with the datasets, this study uses Dafny test suite on translation results to perform evaluation. Our experiment employs state-of-the-art LLMs (notably GPT-4o and DeepSeek-V3) on HumanEval, MBPP and LeetCode datasets. Experimental results demonstrate that Python2Dafny Few-shot Prompting method and dynamic repair method effectively improves the correctness of Dafny translations generated by LLMs. Furthermore, this study contributes an analysis of translation quality determinants based on experimental evidence, and a discussion on both evaluation methodology and criteria.

参考文献

相似文献

引证文献

引用本文

杜奕成,卢奕函,朱雪阳,张文辉.基于大模型的Python到Dafny代码翻译.软件学报,2026,37(9):

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-09-08
最后修改日期:2025-10-28
录用日期:
在线发布日期: 2025-12-24
出版日期:

微信小程序

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码