introduction_en.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "21e8b000",
   "metadata": {},
   "source": [
    "# KcBERT: Korean comments BERT\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "336ee0b8",
   "metadata": {},
   "source": [
    "Kaggle에 학습을 위해 정제한(아래 `clean`처리를 거친) Dataset을 공개하였습니다!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "691c1f27",
   "metadata": {},
   "source": [
    "직접 다운받으셔서 다양한 Task에 학습을 진행해보세요 :)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36ec915c",
   "metadata": {},
   "source": [
    "공개된 한국어 BERT는 대부분 한국어 위키, 뉴스 기사, 책 등 잘 정제된 데이터를 기반으로 학습한 모델입니다. 한편, 실제로 NSMC와 같은 댓글형 데이터셋은 정제되지 않았고 구어체 특징에 신조어가 많으며, 오탈자 등 공식적인 글쓰기에서 나타나지 않는 표현들이 빈번하게 등장합니다.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5b8d7d7",
   "metadata": {},
   "source": [
    "KcBERT는 위와 같은 특성의 데이터셋에 적용하기 위해, 네이버 뉴스에서 댓글과 대댓글을 수집해, 토크나이저와 BERT모델을 처음부터 학습한 Pretrained BERT 모델입니다.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b0095da8",
   "metadata": {},
   "source": [
    "KcBERT는 Huggingface의 Transformers 라이브러리를 통해 간편히 불러와 사용할 수 있습니다. (별도의 파일 다운로드가 필요하지 않습니다.)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4bf51d97",
   "metadata": {},
   "source": [
    "## KcBERT Performance\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9679c8b9",
   "metadata": {},
   "source": [
    "- Finetune 코드는 https://github.com/Beomi/KcBERT-finetune 에서 찾아보실 수 있습니다.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "486782a2",
   "metadata": {},
   "source": [
    "|                       | Size<br/>(용량)  | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) | **KorQuaD (Dev)**<br/>(EM/F1) |\n",
    "| :-------------------- | :---: | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :---------------------------: |\n",
    "| KcBERT-Base                | 417M  |       89.62        |         84.34          |       66.95        |        74.85         |           75.57           |            93.93            |         60.25 / 84.39         |\n",
    "| KcBERT-Large                | 1.2G  |       **90.68**        |         85.53          |       70.15        |        76.99         |           77.49           |            94.06            |         62.16 / 86.64          |\n",
    "| KoBERT                | 351M  |       89.63        |         86.11          |       80.65        |        79.00         |           79.64           |            93.93            |         52.81 / 80.27         |\n",
    "| XLM-Roberta-Base      | 1.03G |       89.49        |         86.26          |       82.95        |        79.92         |           79.09           |            93.53            |         64.70 / 88.94         |\n",
    "| HanBERT               | 614M  |       90.16        |       **87.31**        |       82.40        |      **80.89**       |           83.33           |            94.19            |         78.74 / 92.02         |\n",
    "| KoELECTRA-Base    | 423M  |     **90.21**      |         86.87          |       81.90        |        80.85         |           83.21           |            94.20            |         61.10 / 89.59         |\n",
    "| KoELECTRA-Base-v2 | 423M  |       89.70        |         87.02          |     **83.90**      |        80.61         |         **84.30**         |          **94.72**          |       **84.34 / 92.58**       |\n",
    "| DistilKoBERT           | 108M |       88.41        |         84.13          |       62.55        |        70.55         |           73.21           |            92.48            |         54.12 / 77.80         |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e86103a2",
   "metadata": {},
   "source": [
    "\\*HanBERT의 Size는 Bert Model과 Tokenizer DB를 합친 것입니다.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1078bc5d",
   "metadata": {},
   "source": [
    "\\***config의 세팅을 그대로 하여 돌린 결과이며, hyperparameter tuning을 추가적으로 할 시 더 좋은 성능이 나올 수 있습니다.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8ac2ee11",
   "metadata": {},
   "source": [
    "## How to use\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e171068a",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade paddlenlp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "38c7ad79",
   "metadata": {},
   "outputs": [],
   "source": [
    "import paddle\n",
    "from paddlenlp.transformers import AutoModel\n",
    "\n",
    "model = AutoModel.from_pretrained(\"beomi/kcbert-base\")\n",
    "input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
    "print(model(input_ids))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a794d15a",
   "metadata": {},
   "source": [
    "```\n",
    "@inproceedings{lee2020kcbert,\n",
    "title={KcBERT: Korean Comments BERT},\n",
    "author={Lee, Junbum},\n",
    "booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},\n",
    "pages={437--440},\n",
    "year={2020}\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0183cbe",
   "metadata": {},
   "source": [
    "- 논문집 다운로드 링크: http://hclt.kr/dwn/?v=bG5iOmNvbmZlcmVuY2U7aWR4OjMy (*혹은 http://hclt.kr/symp/?lnb=conference )\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba768b26",
   "metadata": {},
   "source": [
    "## Acknowledgement\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea148064",
   "metadata": {},
   "source": [
    "KcBERT Model을 학습하는 GCP/TPU 환경은 TFRC 프로그램의 지원을 받았습니다.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78732669",
   "metadata": {},
   "source": [
    "모델 학습 과정에서 많은 조언을 주신 [Monologg](https://github.com/monologg/) 님 감사합니다 :)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ffa9ed9",
   "metadata": {},
   "source": [
    "## Reference\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea69da89",
   "metadata": {},
   "source": [
    "### Github Repos\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d72d564c",
   "metadata": {},
   "source": [
    "- [BERT by Google](https://github.com/google-research/bert)\n",
    "- [KoBERT by SKT](https://github.com/SKTBrain/KoBERT)\n",
    "- [KoELECTRA by Monologg](https://github.com/monologg/KoELECTRA/)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "38503607",
   "metadata": {},
   "source": [
    "- [Transformers by Huggingface](https://github.com/huggingface/transformers)\n",
    "- [Tokenizers by Hugginface](https://github.com/huggingface/tokenizers)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a71a565f",
   "metadata": {},
   "source": [
    "### Papers\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9aa4d324",
   "metadata": {},
   "source": [
    "- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b1ba932",
   "metadata": {},
   "source": [
    "### Blogs\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c9e32e1",
   "metadata": {},
   "source": [
    "- [Monologg님의 KoELECTRA 학습기](https://monologg.kr/categories/NLP/ELECTRA/)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b551dcf",
   "metadata": {},
   "source": [
    "> The model introduction and model weights originate from [https://huggingface.co/beomi/kcbert-base](https://huggingface.co/beomi/kcbert-base) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}