面向软件测试领域知识问答的大模型评估
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

国家自然科学基金(62472209); 江苏省重点研发计划(BE2023025-2)


Benchmarking Large Language Models for Software Testing Knowledge Q&A
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    大语言模型(large language model, LLM)在通用任务中已展现出卓越的性能, 但其在专业领域中的可信性、鲁棒性与可用性仍缺乏系统化评估. 以软件测试教材编写为代表性应用场景, 围绕100个核心测试概念与方法精心构建了700个测试问题, 并选取5个代表性LLM, 系统评估了其在阅读理解、问答及文本生成方面的能力. 实验结果表明, LLM在大多数问题上整体表现优良, 在答案的准确性、完整性和流畅性方面均达到较高水准; 然而, 在涉及研究现状与复杂概念时, 仍存在幻觉与推理偏差等可靠性问题. 进一步分析显示, 大模型生成的内容在知识覆盖度与教育性上较传统教材具有较为明显的优势, 能够为软件测试教材的修订与教学提供有效支持. 不仅系统揭示了LLM在专业领域知识处理中的具体能力边界与典型缺陷, 也为基于问答驱动的智能化评估方法在专业教育与应用中的推广提供了实证依据与方法参考.

    Abstract:

    Large language models (LLMs) have demonstrated remarkable performance in general tasks. However, their trustworthiness, robustness, and applicability in specialized domains remain insufficiently assessed. Using the compilation of software testing textbooks as a representative application scenario, this study constructs 700 carefully designed test questions covering 100 core testing concepts and methods and systematically assesses five representative LLMs in terms of reading comprehension, question-answering (Q&A), and text generation. The experimental results indicate that LLMs generally exhibit strong performance on most questions, achieving high levels of accuracy, completeness, and fluency. However, issues of reliability, such as hallucination and reasoning bias, persist, particularly when addressing current research trends and complex concepts. Further analysis reveals that LLM-generated content provides broader knowledge coverage and greater educational value compared with traditional textbooks, offering effective support for revising and teaching software testing materials. This study not only delineates the specific capability boundaries and typical deficiencies of LLMs in processing domain knowledge but also provides empirical evidence and methodological insights for advancing Q&A-driven intelligent evaluation in professional education and applications.

    参考文献
    相似文献
    引证文献
引用本文

陈煜磊,聂钰格,吴化尧.面向软件测试领域知识问答的大模型评估.软件学报,2026,37(8):1-17

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-09-08
  • 最后修改日期:2025-10-28
  • 录用日期:
  • 在线发布日期: 2025-12-24
  • 出版日期: 2026-08-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号