Abstract:In recent years, code generation based on large language models has been widely applied in software development. However, due to the inherent stochasticity of model outputs, ensuring the correctness of generated code remains a critical challenge. Formal verification provides a rigorous means to guarantee software correctness, but it requires both source code and corresponding requirements to be formalized into the specification language of verification tools. Dafny is a verification-oriented programming language equipped with automated verification support. From this perspective, this study explores the ability of large language models (LLMs) to translate Python code into Dafny. To this end, a prompt-based translation generation method is proposed, which consists of static translation and dynamic repair, together with a program-testing-based evaluation method. The static translation method employs three progressively refined prompting templates-namely, a basic template, a Dafny example-based template, and a Python-to-Dafny contrastive example-based template—to generate translations in a single step. The dynamic repair method iteratively invokes LLMs to refine translations by incorporating error messages produced during the previous round of code testing into the prompt. For evaluation, when no existing test suites are available, test cases are automatically generated using LLMs. The translated Dafny programs are then verified using Dafny testing tools in conjunction with the generated test cases to assess translation correctness. Experiments are conducted on HumanEval, MBPP, and LeetCode datasets using representative large language models, including GPT-4o and DeepSeek-V3. Experimental results demonstrate that the contrastive example-based prompting template and the dynamic repair method effectively improve the performance of LLMs on the Python-to-Dafny code translation task. Furthermore, factors influencing translation quality are analyzed based on experimental observations, and the evaluation methodology and criteria are discussed.