Abstract:Software programming assistants based on large language models (LLMs), such as Copilot, significantly enhance programmer productivity. However, LLMs have large computing and storage requirements and are difficult to deploy locally. Building a lightweight, small LLM can meet computing, storage, and deployment requirements, but it leads to a greater accuracy loss in code generation compared to large LLMs. Knowledge distillation (KD) techniques allow small LLMs (student models) to approximate the output distributions of large LLMs (teacher models) on target training datasets, thus reducing accuracy loss in code generation. Cutting-edge KD techniques in artificial intelligence are based on the Kullback-Leibler (KL) divergence loss function, which measures and reduces accuracy loss due to discrepancies in the output distributions between student and teacher models. However, student models struggle to learn in the near-zero distribution regions of teacher models. Consequently, researchers have employed the Reverse KL (RKL) divergence loss function to address this issue in near-zero distribution regions. This study finds that RKL faces learning challenges in high-probability distribution regions and complements the KL divergence loss function. For some datasets, low-quality outputs from teacher models lead to poor learning outcomes for the student models. This study proposes an adaptive knowledge distillation (AKD) method that uses prompts to enhance teacher model output quality and constructs an adaptive loss function to adjust learning priorities based on the distributional differences between student and teacher models. This ensures the student model effectively learns in both primary and near-zero probability regions. Using the AKD method, this study trains a lightweight code generation model based on StarCoder-1B/7B (student/teacher models) and the CodeAlpaca dataset, evaluating accuracy loss and code quality issues. Experimental results show that the lightweight model size is reduced by 85.7%. On the HumanEval and MBPP data sets, prompts with clear instructions improve teacher model code generation quality, reducing the average accuracy loss of the trained student model by 6%. The AKD-trained model’s average accuracy loss compared to the teacher model (StarCoder-7B) is 17.14%, a 30.6% reduction over the original student model. The AKD-trained model’s accuracy loss is reduced by an average of 19.9% compared to state-of-the-art KD and RKD methods. Regarding inference memory requirements, the KD and RKD methods require 54.7 GB, while the AKD method only adds 3 GB. In terms of training time, the AKD method incurs a 30% increase. However, even when the KD and RKD methods are trained for the same duration, their average performance improves by only 3%, which is 16.9% lower than that of the AKD method. Therefore, the additional training cost of the AKD method is justified. Moreover, applying the AKD method to the CodeLlama and CodeGen model series reduces accuracy loss by an average of 19.2% compared to state-of-the-art KD and RKD methods, demonstrating the generalizability of the AKD method.