国家自然科学基金(61972184, 62272205, 62272206, 62076112)
层次主题模型是构建主题层次的重要工具. 现有的层次主题模型大多通过在主题模型中引入nCRP构造方法, 为文档主题提供树形结构的先验分布, 但无法生成具有明确领域涵义的主题层次结构, 即领域主题层次. 同时, 领域主题不仅存在层次关系, 而且不同父主题下的子主题之间还存在子领域方面共享的关联关系, 在现有主题关系研究中没有合适的模型来生成这种领域主题层次. 为了从领域文本中自动、有效地挖掘出领域主题的层次关系和关联关系, 在4个方面进行创新研究. 首先, 通过主题共享机制改进nCRP构造方法, 提出nCRP+层次构造方法, 为主题模型中的主题提供具有分层主题方面共享的树形先验分布; 其次, 结合nCRP+和HDP模型构建重分层的Dirichlet过程, 提出rHDP (reallocated hierarchical Dirichlet processes)层次主题模型; 第三, 结合领域分类信息、词语语义和主题词的领域代表性, 定义领域知识, 包括基于投票机制的领域隶属度、词语与领域主题的语义相关度和层次化的主题-词语贡献度; 最后, 通过领域知识改进rHDP主题模型中领域主题和主题词的分配过程, 提出结合领域知识的层次主题模型rHDP_DK (rHDP with domain knowledge), 并改进采样过程. 实验结果表明, 基于nCRP+的层次主题模型在评价指标方面均优于基于nCRP的层次主题模型(hLDA, nHDP)和神经主题模型(TSNTM); 通过rHDP_DK模型生成的主题层次结构具有领域主题层次清晰、关联子主题的主题词领域差异明确的特点. 此外, 该模型将为领域主题层次提供一个通用的自动挖掘框架.
Hierarchical topic model is an important tool to organize topic hierarchy. Most of the existing hierarchical topic models provide tree-structured prior distributions for document topics by introducing the nCRP construction method into the topic models, but they cannot acquire a topic hierarchy with clear domain meanings, referred to as domain topic hierarchy. Meanwhile, there are not only hierarchical relationships among domain topics but also sub-topic aspect sharing relationships under different parent topics. There is no appropriate model that yields such domain topic hierarchy in the current research on topic relationships. In order to automatically and effectively mine the hierarchical and correlated relationships of domain topics from domain texts, improvements are put forward as follows. Firstly, this study improves the nCRP construction method through the topic sharing mechanism and proposes the nCRP+ hierarchical construction method to provide a tree-structured prior distribution with hierarchical topic aspect sharing for topics generated from topic models. Then the reallocated hierarchical Dirichlet processes (rHDP) are developed based on nCRP+ and HDP models, and an rHDP model is proposed. By employing the domain taxonomy, word semantics, and domain representation of topic words, the study defines domain knowledge, including the domain membership degree based on the voting mechanism, the semantic relevance between words and domain topics, and the contribution degree of hierarchical topic words. Finally, domain knowledge is used to improve the allocation processes of domain topics and topic words in the rHDP model, and rHDP with domain knowledge (rHDP_DK) model is proposed to improve the sampling process. The experimental results show that hierarchical topic models based on nCRP+ are superior to those based on nCRP (hLDA and nHDP) and neural topic model (TSNTM) in terms of evaluation metrics. The topic hierarchy, built by the rHDP_DK model, is characterized by clear domain topic hierarchy and explicit domain differences among related sub-topics. Furthermore, the model will provide a general automatic mining framework for domain topic hierarchy.