智能问答系统逻辑推理测试
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

国家自然科学基金(62202324, 62322208, 62472310)


Logical Reasoning Testing of Intelligent Question Answering System
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    智能问答系统利用信息检索和自然语言处理技术, 实现对问题的自动化回复. 然而, 与其他人工智能软件相似, 智能问答系统同样存在缺陷. 存在缺陷的智能问答系统会降低用户体验, 造成企业的经济损失, 甚至引发社会层面的恐慌. 因此, 及时检测并修复智能问答系统中的缺陷至关重要. 目前, 智能问答系统自动测试方法主要分为两类. 其一, 基于问题与预测答案合成假定事实, 并基于假定事实生成新问题和预期答案, 以此揭示问答系统中的缺陷. 其二, 从现有数据集中提取不影响原问题答案的知识片段并融入原始测试输入中生成答案一致的新测试输入, 实现对问答系统的缺陷检测任务. 然而, 这两类方法均着重于测试模型的语义理解能力, 未能充分测试模型的逻辑推理能力. 此外, 这两类方法分别依赖于问答系统的回答范式和模型自带的数据集来生成新的测试用例, 限制了其在基于大规模语言模型的问答系统中的测试效能. 针对上述挑战, 提出一种逻辑引导的蜕变测试技术QALT. QALT设计了3种逻辑相关的蜕变关系, 并使用了语义相似度度量和依存句法分析等技术指导生成高质量的测试用例, 实现对智能问答系统的精准测试. 实验结果表明, QALT在两类智能问答系统上一共检测9247个缺陷, 分别比当前两种最先进的技术(即QAQA和QAAskeR)多检测3150和3897个缺陷. 基于人工采样标注结果的统计分析, QALT在两个智能问答系统上检测到真阳性缺陷的期望数量总和为8073, 预期比QAQA和QAAskeR分别多检测2142和4867个真阳性缺陷. 此外, 使用QALT生成的测试输入通过模型微调对被测软件中的缺陷进行修复. 微调后模型的错误率成功地从 22.33% 降低到了14.37%.

    Abstract:

    Intelligent question answering (QA) system utilizes information retrieval and natural language processing techniques to deliver automated responses to user inquiries. Like other artificial intelligence software, intelligent QA system is prone to bugs. These bugs can degrade user experience, cause financial losses, or even trigger social panic. Therefore, it is crucial to detect and fix bugs in intelligent QA system promptly. Automated testing approaches fall into two categories. The first approach synthesizes hypothetical facts based on questions and predicted answers, then generates new questions and expected answers to detect bugs. The second approach generates semantically equivalent test inputs by injecting knowledge from existing datasets, ensuring the answer to the question remains unchanged. However, both methods have limitations in practical use. They rely heavily on the intelligent QA system’s output or training set, which results in poor testing effectiveness and generalization, especially for large-language-model-based intelligent QA systems. Moreover, these methods primarily assess semantic understanding while neglecting the logical reasoning capabilities of intelligent QA system. To address this gap, a logic-guided testing technique named QALT is proposed. It designs three logically related metamorphic relations and uses semantic similarity measurement and dependency parsing to generate high-quality test cases. The experimental results show that QALT detected a total of 9247 bugs in two different intelligent QA systems, which is 3150 and 3897 more bugs than the two current state-of-the-art techniques (i.e., QAQA and QAAskeR), respectively. Based on the statistical analysis of manually labeled results, QALT detects approximately 8073 true bugs, which is 2142 more than QAQA and 4867 more than QAAskeR. Moreover, the test inputs generated by QALT successfully reduce the MR violation rate from 22.33% to 14.37% when used for fine-tuning the intelligent QA system under test.

    参考文献
    相似文献
    引证文献
引用本文

沈庆超,李行健,姜佳君,陈俊洁,齐一先,王赞.智能问答系统逻辑推理测试.软件学报,,():1-20

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-01-30
  • 最后修改日期:2024-06-30
  • 录用日期:
  • 在线发布日期: 2025-07-23
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号