A Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text 阅读笔记

该论文最主要的贡献就是这个数据,数据集地址。论文中提到的标标签过程也是一个创新点,运用了启发式和机器辅助标标签,这样可以提高准确度并减少标注人员工作。

contribution

  • provide a new dataset for joint learning of NER and RE for Chinese literature text
  • the proposed dataset is based on the discourse level which provides additional context information
  • introduce some widely used models to conduct experiments

tagging process

two methods:one is a heuristic tagging method and another is a machine auxiliary tagging method.

Step 1: First Tagging Process

find a problem of data inconsistency.

Step 2: Heuristic Tagging Based on Generic disambiguating Rules

  • For example, remove all adjective words and only tag “entity header” .
  • re-annotate all articles and correct all inconsistency entities and relations based on the heuristic rules.

Step 3: Machine Auxiliary Tagging

  • The core idea is to train a model to learn annotation guidelines on the subset of the corpus and produce predicted tags on the rest data.
  • CRF

tagging set

Annotation Format

Entity

Each entity is identified by T tag, which takes several attributes.

  • Id: a unique number identifying the entity within the document. It starts at 0, and is incremented every time a new entity is identified within the same document.
  • Type: one of the entity tags.
  • Begin Index: the begin index of an entity. It starts at 0, and is incremented every character.
  • End Index: the end index of an entity. It starts at 0, and is incremented every character.
  • Value: words being referred to an identifiable object.

Relation

Each relation is identified by R tag, which can take several attributes:

  • Id: a unique number identifying the relation within the document. It starts at 0, and is incremented every time a new relation is identified within the same document.
  • Arg1 and Arg2: two entities associated with a relation.
  • Type: one of the relation tags.

 上一篇
《Differentiating Concepts and Instances for Knowledge Graph Embedding》阅读笔记 《Differentiating Concepts and Instances for Knowledge Graph Embedding》阅读笔记
论文获取地址。这篇文章最大的亮点就是把concept映射为一个球面,然后把instance映射为一个向量,通过这种空间关系来进行embedding。如果instance和concept满足InstanceOf的关系,则instance应该
下一篇 
pandas的数据类型操作 pandas的数据类型操作
在原文链接中摘抄出部分信息作为记录形成本文。 数据类型 Pandas dtype Python 类型 NumPy 类型 用途 object str string_, unicode_ 文本 int64 int int
2018-12-01
  目录