面向Linux内核开发知识的大模型问答能力评测

doi:10.13328/j.cnki.jos.007597

微信小程序

微信服务号

微信订阅号

首页 > 过刊浏览>2026年第37卷第8期 >1-28. DOI:10.13328/j.cnki.jos.007597

PDF HTML阅读 XML下载导出引用引用提醒

面向Linux内核开发知识的大模型问答能力评测
DOI:
                        10.13328/j.cnki.jos.007597
                    
CSTR:
                        
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:TP311
基金项目:国家自然科学基金(62172099)

Question-answering Capability Evaluation of Large Language Models on Linux Kernel Development Knowledge

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

大语言模型(large language model, LLM)在软件开发技术问答任务中展现出强大潜力, 为代码知识获取和理解提供了新途径. 然而, 在以Linux内核为代表的复杂系统软件领域, LLM在代码实现、关键机制理解、演化历史追溯及设计决策分析等方面的真实能力仍缺乏系统验证. 现有评测基准多针对通用任务, 存在领域深度不足、难度逐渐饱和及评测问题与工程实践存在偏差等局限, 难以保障特定领域开发知识问答的客观性、准确性和全面性. 为客观评估LLM在复杂系统软件中的知识问答能力, 研究提出一种LLM问答能力评测基准数据集构建方法, 并构建面向Linux内核的高质量问答评测基准LKQABench, 同时设计一种多裁判协同的代码知识问答评测方法MJ-CCE. LKQABench基于开发者社区的真实技术问答数据, 经过语义分析和人工审核修订, 构建202个标准问答对, 覆盖Linux内核主要模块和不同认知维度. MJ-CCE方法定义多个裁判大模型的协同评分与投票机制, 从关键知识点覆盖度、事实正确性与表达清晰度等维度对回答进行多维度评估. 在LKQABench上对主流大模型的实证研究表明, 当前大模型能较好回答内核实现的单点知识问题, 但在涉及跨主题知识整合、深度推理及版本演化关联的问题中, 存在知识点遗漏、逻辑链条不完整等不足. 研究不仅揭示了大模型在软件开发知识问答中的能力边界, 也为其在该领域的持续优化提供了实证数据支撑.

Abstract:

Large language models (LLMs) have shown great potential in software development question-answering (QA) tasks, providing new approaches for acquiring and understanding code knowledge. However, in complex system software represented by the Linux kernel, the actual capabilities of LLMs in code implementation, understanding key mechanisms, tracing evolutionary history, and analyzing design decisions remain insufficiently validated. Existing benchmarks mainly target general-purpose tasks and suffer from insufficient domain depth, difficulty saturation, and misalignment with real engineering practices, making it difficult to ensure the objectivity, accuracy, and comprehensiveness of domain-specific development knowledge QA. To objectively evaluate the QA capabilities of LLMs in complex system software, this study proposes a benchmark dataset construction method for LLM QA capability evaluation, constructs the high-quality QA benchmark for the Linux kernel (LKQABench), and further designs a multi-judge collaborative code knowledge QA evaluation method (MJ-CCE). LKQABench is built from real technical QA data in developer communities, refined through semantic analysis and human review, resulting in 202 standard QA pairs covering major Linux kernel subsystems and multiple cognitive dimensions. MJ-CCE defines a collaborative scoring and voting mechanism among multiple judge models, evaluating answers across three dimensions: key points coverage, factual correctness, and clarity of expression. Experiments on LKQABench show that current LLMs achieve satisfactory performance on single-point knowledge questions related to kernel implementation but exhibit significant shortcomings, such as missing key points and incomplete reasoning chains, when tackling cross-topic integration, deep reasoning, and version-evolution-related questions. This study not only delineates the capability boundaries of LLMs in software development knowledge QA but also provides empirical evidence to support their continuous optimization in this domain.

参考文献

相似文献

引证文献

引用本文

欧闻毅,吴毅坚,黄宸一,彭鑫.面向Linux内核开发知识的大模型问答能力评测.软件学报,2026,37(8):1-28

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-09-08
最后修改日期:2025-10-28
录用日期:
在线发布日期: 2025-12-24
出版日期: 2026-08-06

微信小程序

微信服务号

微信订阅号

引用本文

相关视频

分享

文章指标

历史

文章二维码