基于大语言模型的Python到Dafny 代码翻译
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:


Translation from Python into Dafny Based on Large Language Models
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    近年来, 基于大模型的代码生成被广泛应用于软件开发领域. 然而, 由于大模型生成结果具有一定的随机性, 如何保证生成代码的正确性成为重中之重. 形式化验证是保障软件正确性的一种有力手段, 但验证前需要将代码及相应需求形式化为相关工具的规约语言. Dafny是一种可验证的编程语言, 并有相应的自动化验证工具支持. 旨在探索大模型在将Python代码翻译为Dafny语言方面的能力. 为此, 提出了基于提示词的静态翻译和动态修复的翻译生成方法, 以及基于程序测试的翻译结果评估方法. 静态翻译方法考虑3种渐进的提示词模板(基础模板、Dafny示例模板、对照示例模板), 使用大模型一次性生成翻译; 动态修复方法多次调用大模型对翻译进行修复, 每次将上一轮代码测试的错误信息加入提示词. 评估时, 在没有现成测试用例集的情况下, 利用大模型生成测试用例集; 对于翻译结果, 应用Dafny 测试工具结合测试用例集进行测试, 以评估其正确性. 在GPT-4o和DeepSeek-V3等大模型上使用HumanEval、MBPP和LeetCode等数据集进行实验. 结果表明, 使用对照示例模板和动态修复方法能够有效提升大模型在Python到Dafny翻译任务上的表现. 此外, 还基于实验结果分析影响翻译质量的因素, 并对评估方法与标准进行讨论.

    Abstract:

    In recent years, code generation based on large language models has been widely applied in software development. However, due to the inherent stochasticity of model outputs, ensuring the correctness of generated code remains a critical challenge. Formal verification provides a rigorous means to guarantee software correctness, but it requires both source code and corresponding requirements to be formalized into the specification language of verification tools. Dafny is a verification-oriented programming language equipped with automated verification support. From this perspective, this study explores the ability of large language models (LLMs) to translate Python code into Dafny. To this end, a prompt-based translation generation method is proposed, which consists of static translation and dynamic repair, together with a program-testing-based evaluation method. The static translation method employs three progressively refined prompting templates-namely, a basic template, a Dafny example-based template, and a Python-to-Dafny contrastive example-based template—to generate translations in a single step. The dynamic repair method iteratively invokes LLMs to refine translations by incorporating error messages produced during the previous round of code testing into the prompt. For evaluation, when no existing test suites are available, test cases are automatically generated using LLMs. The translated Dafny programs are then verified using Dafny testing tools in conjunction with the generated test cases to assess translation correctness. Experiments are conducted on HumanEval, MBPP, and LeetCode datasets using representative large language models, including GPT-4o and DeepSeek-V3. Experimental results demonstrate that the contrastive example-based prompting template and the dynamic repair method effectively improve the performance of LLMs on the Python-to-Dafny code translation task. Furthermore, factors influencing translation quality are analyzed based on experimental observations, and the evaluation methodology and criteria are discussed.

    参考文献
    相似文献
    引证文献
引用本文

杜奕成,卢奕函,朱雪阳,张文辉.基于大语言模型的Python到Dafny 代码翻译.软件学报,2026,37(9):1-17

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-09-08
  • 最后修改日期:2025-10-28
  • 录用日期:
  • 在线发布日期: 2025-12-24
  • 出版日期: 2026-09-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号