实体识别是信息抽取的关键技术. 相较于普通文本, 中文医疗文本的实体识别任务往往面对大量的嵌套实体. 以往识别实体的方法往往忽视了医疗文本本身所特有的实体嵌套规则而直接采用序列标注方法, 为此, 提出一种融合实体嵌套规则的中文实体识别方法. 所提方法在训练过程中将实体的识别任务转化为实体的边界识别与边界首尾关系识别的联合训练任务, 在解码过程中结合从实际医疗文本中所总结出来的实体嵌套规则对解码结果进行过滤, 从而使得识别结果能够符合实际文本中内外层实体嵌套组合的组成规律. 在公开的医疗文本实体识别的实验上取得良好的效果. 数据集上的实验表明, 所提方法在嵌套类型实体识别性能上显著优于已有的方法, 在整体准确率方面比最先进的方法提高0.5%.
Entity recognition is a key technology for information extraction. Compared with ordinary text, the entity recognition of Chinese medical text is often faced with a large number of nested entities. Previous methods of entity recognition often ignore the entity nesting rules unique to medical text and directly use sequence annotation methods. Therefore, a Chinese entity recognition method that incorporates entity nesting rules is proposed. This method transforms the entity recognition task into a joint training task of entity boundary recognition and boundary first-tail relationship recognition in the training process and filters the results by combining the entity nesting rules summarized from actual medical text in the decoding process. In this way, the recognition results are in line with the composition law of the nested combinations of inner and outer entities in the actual text. Good results have been achieved in public experiments on entity recognition of medical text. Experiments on the dataset show that the proposed method is significantly superior to the existing methods in terms of nested-type entity recognition performance, and the overall accuracy is increased by 0.5% compared with the state-of-the-art methods.