Abstract:Large language models (LLMs) have shown great potential in software development question-answering (QA) tasks, providing new approaches for acquiring and understanding code knowledge. However, in complex system software represented by the Linux kernel, the actual capabilities of LLMs in code implementation, understanding key mechanisms, tracing evolutionary history, and analyzing design decisions remain insufficiently validated. Existing benchmarks mainly target general-purpose tasks and suffer from insufficient domain depth, difficulty saturation, and misalignment with real engineering practices, making it difficult to ensure the objectivity, accuracy, and comprehensiveness of domain-specific development knowledge QA. To objectively evaluate the QA capabilities of LLMs in complex system software, this study proposes a benchmark dataset construction method for LLM QA capability evaluation, constructs the high-quality QA benchmark for the Linux kernel (LKQABench), and further designs a multi-judge collaborative code knowledge QA evaluation method (MJ-CCE). LKQABench is built from real technical QA data in developer communities, refined through semantic analysis and human review, resulting in 202 standard QA pairs covering major Linux kernel subsystems and multiple cognitive dimensions. MJ-CCE defines a collaborative scoring and voting mechanism among multiple judge models, evaluating answers across three dimensions: key points coverage, factual correctness, and clarity of expression. Experiments on LKQABench show that current LLMs achieve satisfactory performance on single-point knowledge questions related to kernel implementation but exhibit significant shortcomings, such as missing key points and incomplete reasoning chains, when tackling cross-topic integration, deep reasoning, and version-evolution-related questions. This study not only delineates the capability boundaries of LLMs in software development knowledge QA but also provides empirical evidence to support their continuous optimization in this domain.