基于大模型的Python到Dafny代码翻译
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:


Translation from Python into Dafny via Large Language Model
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    近年来,基于大模型的代码生成被广泛应用于软件开发领域.然而,由于大模型生成结果具有一定的随机性,如何保证生成代码的正确性成为重中之重.形式化验证是保障软件正确性的一种有力手段,但验证前需要将代码及相应需求形式化为相关工具的规约语言.Dafny是一种可验证的编程语言,并有相应的自动化验证工具支持.本文旨在探索大模型在将Python代码翻译为Dafny语言方面的能力.为此,本文提出了基于提示词的静态翻译和动态修复的翻译生成方法以及基于程序测试的翻译结果评估方法.静态翻译方法考虑三种渐进的提示词模板(基础模板,Dafny示例模板,对照示例模板),使用大模型一次性生成翻译;动态修复方法多次调用大模型对翻译进行修复,每次将上一轮代码测试的错误信息加入提示词.评估时,在没有现成测试用例集的情况下,利用大模型生成测试用例集;对翻译结果结合测试用例集应用Dafny测试工具进行测试以评估其正确性.我们在GPT-4o和DeepSeek-V3等大模型上使用HumanEval、MBPP和LeetCode等数据集进行了实验.结果表明,使用对照示例模板和动态修复方法能够有效提升大模型在Python到Dafny翻译任务上的表现.此外,本文还基于实验结果分析了影响翻译质量的因素,并对评估方法与标准进行讨论.

    Abstract:

    In recent years, code generation based on large language models is widely used in software development. However, due to the inherent stochasticity of the outputs from LLMs, proving the reliability of the code generated by LLMs has become a top priority. Formal verification is a powerful means that gives assurance of the correctness on the software. Before verification, the source code and requirements should be formalized into target specification language. Dafny is a programming language with built-in specification patterns. Dafny verification tools allow automated verification on Dafny specifications. This study aims to explore the ability of LLMs in translating Python into Dafny. To this end, this study introduces a prompt-based translation generation method employing both static translation and dynamic repair, along with a program-testing-based translation evaluating method. The static translation method considers three progressively refined prompting templates - Basic Prompting, Dafny Few-shot Prompting, Python2Dafny Few-shot Prompting – and performs single-step translation generation via LLMs; the dynamic repair method iteratively invokes LLMs for translation repair, incorporating error messages from the previous testing result into the prompt during each iteration. In evaluation, this study generates new testcase datasets by prompting LLMs on the occasion that there are no existing testcase datasets; with the datasets, this study uses Dafny test suite on translation results to perform evaluation. Our experiment employs state-of-the-art LLMs (notably GPT-4o and DeepSeek-V3) on HumanEval, MBPP and LeetCode datasets. Experimental results demonstrate that Python2Dafny Few-shot Prompting method and dynamic repair method effectively improves the correctness of Dafny translations generated by LLMs. Furthermore, this study contributes an analysis of translation quality determinants based on experimental evidence, and a discussion on both evaluation methodology and criteria.

    参考文献
    相似文献
    引证文献
引用本文

杜奕成,卢奕函,朱雪阳,张文辉.基于大模型的Python到Dafny代码翻译.软件学报,2026,37(9):

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-09-08
  • 最后修改日期:2025-10-28
  • 录用日期:
  • 在线发布日期: 2025-12-24
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号