InfoExtractor 2.0 is a relation extraction baseline system developed for DuIE 2.0.
Different from [DuIE 1.0](http://lic2019.ccf.org.cn/kg), the new 2.0 task is more inclined to colloquial language, and further introduces **complex relations** which entails multiple objects in one single SPO.
For detailed information about the dataset, please refer to the official website of our [competition](http://bjyz-ai.epc.baidu.com/aistudio/competition/detail/34?isFromCcf=true).
InfoExtractor 2.0 is built upon a SOTA pre-trained language model [ERNIE](https://arxiv.org/abs/1904.09223) using PaddlePaddle.
We design a structured **tagging strategy** to directly fine-tune ERNIE, through which multiple, overlapped SPOs can be extracted in **a single pass**.
The InfoExtractor 2.0 system is simple yet effective, achieving 0.554 F1 on the DuIE 2.0 demo data and 0.848 F1 on DuIE 1.0.
The hyperparameters are simply set to: BATCH_SIZE=16, LEARNING_RATE=2e-5, and EPOCH=10 (without tuning).
- - -
### Tagging Strategy
Our tagging strategy is designed to discover multiple, overlapped SPOs in the DuIE 2.0 task.
Based on the classic 'BIO' tagging scheme, we assign tags (also known as labels) to each token to indicate its position in an entity span.
The only difference lies in that a "B" tag here is further distinguished by different predicates and subject/object dichotomy.
Suppose there are N predicates. Then a "B" tag should be like "B-predicate-subject" or "B-predicate-object",
which results in 2*N **mutually exclusive** "B" tags.
After tagging, we treat the task as token-level multi-label classification, with a total of (2*N+2) labels (2 for the “I” and “O” tags).
Below is a visual illustration of our tagging strategy:
**Accuracy** (token-level and example-level) is printed during the during the training procedure.
### Prediction
Specify your checkpoints dir in the prediction script, and then run:
```
sh ./script/predict.sh
```
This will write the predictions into a json file with the same format as the original dataset (required for final official evaluation). GPU ID and batch size can be specified in the script. The final prediction file is saved into `./data/`
### Official Evaluation
Zip your prediction json file and then run official evaluation:
```
zip ./data/predict_test.json.zip ./data/predict_test.json
Precision, Recall and F1 scores are used as the official evaluation metrics to measure the performance of participating systems. Alias file lists entities with more than one correct mentions. It is not provided due to security reasons.