introduction_en.ipynb 3.4 KB
Notebook
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "911a1be9",
   "metadata": {},
   "source": [
    "# 🤗 + 📚 dbmdz Turkish BERT model\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f09b0f1",
   "metadata": {},
   "source": [
    "In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State\n",
    "Library open sources a cased model for Turkish 🎉\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa2a78a0",
   "metadata": {},
   "source": [
    "# 🇹🇷 BERTurk\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b2f8c68",
   "metadata": {},
   "source": [
    "BERTurk is a community-driven cased BERT model for Turkish.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe23e365",
   "metadata": {},
   "source": [
    "Some datasets used for pretraining and evaluation are contributed from the\n",
    "awesome Turkish NLP community, as well as the decision for the model name: BERTurk.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2da0ce24",
   "metadata": {},
   "source": [
    "## Stats\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3f6af43",
   "metadata": {},
   "source": [
    "The current version of the model is trained on a filtered and sentence\n",
    "segmented version of the Turkish [OSCAR corpus](https://traces1.inria.fr/oscar/),\n",
    "a recent Wikipedia dump, various [OPUS corpora](http://opus.nlpl.eu/) and a\n",
    "special corpus provided by [Kemal Oflazer](http://www.andrew.cmu.edu/user/ko/).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d8d60c1",
   "metadata": {},
   "source": [
    "The final training corpus has a size of 35GB and 44,04,976,662 tokens.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce42504f",
   "metadata": {},
   "source": [
    "For this model we use a vocab size of 128k.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4815bfdb",
   "metadata": {},
   "source": [
    "## Usage\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a084604",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade paddlenlp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d041c78",
   "metadata": {},
   "outputs": [],
   "source": [
    "import paddle\n",
    "from paddlenlp.transformers import AutoModel\n",
    "\n",
    "model = AutoModel.from_pretrained(\"dbmdz/bert-base-turkish-128k-cased\")\n",
    "input_ids = paddle.randint(100, 200, shape=[1, 20])\n",
    "print(model(input_ids))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "da82079c",
   "metadata": {},
   "source": [
    "# Reference"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6632d46",
   "metadata": {},
   "source": [
    "> The model introduction and model weights originate from [https://huggingface.co/dbmdz/bert-base-turkish-128k-cased](https://huggingface.co/dbmdz/bert-base-turkish-128k-cased) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}