Dynamically Transfer Entity Span Information for Cross-domain Chinese Named Entity Recognition
Author:
Affiliation:

Clc Number:

TP391

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Boundaries identification of Chinese named entities is a difficult problem because of no separator between Chinese texts. Furthermore, the lack of well-marked NER data makes Chinese named entity recognition (NER) tasks more challenging in vertical domains, such as clinical domain and financial domain. To address aforementioned issues, this study proposes a novel cross-domain Chinese NER model by dynamically transferring entity span information (TES-NER). The cross-domain shared entity span information is transferred from the general domain (source domain) with sufficient corpus to the Chinese NER model on the vertical domain (target domain) through a dynamic fusion layer based on the gate mechanism, where the entity span information is used to represent the scope of the Chinese named entities. Specifically, TES-NER first introduces a cross-domain shared entity span recognition module based on a bidirectional long short-term memory (BiLSTM) layer and a fully connected neural network (FCN) which are used to identify the cross-domain shared entity span information to determine the boundaries of the Chinese named entities. Then, a Chinese NER module is constructed to identify the domain-specific Chinese named entities by applying independent BiLSTM with conditional random field models (BiLSTM-CRF). Finally, a dynamic fusion layer is designed to dynamically determine the amount of the cross-domain shared entity span information extracted from the entity span recognition module, which is used to transfer the knowledge to the domain-specific NER model through the gate mechanism. This study sets the general domain (source domain) dataset as the news domain dataset (MSRA) with sufficient labeled corpus, while the vertical domain (target domain) datasets are composed of three datasets: Mixed domain (OntoNotes 5.0), financial domain (Resume), and medical domain (CCKS 2017). Among them, the mixed domain dataset (OntoNotes 5.0) is a corpus integrating six different vertical domains. The F1 values of the model proposed in this study are 2.18%, 1.68%, and 0.99% higher than BiLSTM-CRF, respectively.

    Reference
    Related
    Cited by
Get Citation

吴炳潮,邓成龙,关贝,陈晓霖,昝道广,常志军,肖尊严,曲大成,王永吉.动态迁移实体块信息的跨领域中文实体识别模型.软件学报,2022,33(10):3776-3792

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:October 16,2020
  • Revised:December 15,2020
  • Adopted:
  • Online: February 07,2021
  • Published: October 06,2022
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063