Abstract:Large language models (LLMs) have demonstrated remarkable performance in general tasks. However, their trustworthiness, robustness, and applicability in specialized domains remain insufficiently assessed. Using the compilation of software testing textbooks as a representative application scenario, this study constructs 700 carefully designed test questions covering 100 core testing concepts and methods and systematically assesses five representative LLMs in terms of reading comprehension, question-answering (Q&A), and text generation. The experimental results indicate that LLMs generally exhibit strong performance on most questions, achieving high levels of accuracy, completeness, and fluency. However, issues of reliability, such as hallucination and reasoning bias, persist, particularly when addressing current research trends and complex concepts. Further analysis reveals that LLM-generated content provides broader knowledge coverage and greater educational value compared with traditional textbooks, offering effective support for revising and teaching software testing materials. This study not only delineates the specific capability boundaries and typical deficiencies of LLMs in processing domain knowledge but also provides empirical evidence and methodological insights for advancing Q&A-driven intelligent evaluation in professional education and applications.