- provide a new dataset for joint learning of NER and RE for Chinese literature text
- the proposed dataset is based on the discourse level which provides additional context information
- introduce some widely used models to conduct experiments
two methods:one is a heuristic tagging method and another is a machine auxiliary tagging method.
find a problem of data inconsistency.
- For example, remove all adjective words and only tag “entity header” .
- re-annotate all articles and correct all inconsistency entities and relations based on the heuristic rules.
- The core idea is to train a model to learn annotation guidelines on the subset of the corpus and produce predicted tags on the rest data.
Each entity is identified by T tag, which takes several attributes.
- Id: a unique number identifying the entity within the document. It starts at 0, and is incremented every time a new entity is identified within the same document.
- Type: one of the entity tags.
- Begin Index: the begin index of an entity. It starts at 0, and is incremented every character.
- End Index: the end index of an entity. It starts at 0, and is incremented every character.
- Value: words being referred to an identifiable object.
Each relation is identified by R tag, which can take several attributes:
- Id: a unique number identifying the relation within the document. It starts at 0, and is incremented every time a new relation is identified within the same document.
- Arg1 and Arg2: two entities associated with a relation.
- Type: one of the relation tags.