{ "cells": [ { "cell_type": "markdown", "id": "83973edc", "metadata": {}, "source": [ "## Overview\n", "\n", "This is a very small distilled version of the bert-base-multilingual-cased model for Russian and English (45 MB, 12M parameters). There is also an **updated version of this model**, rubert-tiny2, with a larger vocabulary and better quality on practically all Russian NLU tasks.\n" ] }, { "cell_type": "markdown", "id": "59944441", "metadata": {}, "source": [ "This model is useful if you want to fine-tune it for a relatively simple Russian task (e.g. NER or sentiment classification), and you care more about speed and size than about accuracy. It is approximately x10 smaller and faster than a base-sized BERT. Its `[CLS]` embeddings can be used as a sentence representation aligned between Russian and English.\n" ] }, { "cell_type": "markdown", "id": "c0e2918f", "metadata": {}, "source": [ "It was trained on the [Yandex Translate corpus](https://translate.yandex.ru/corpus), [OPUS-100](https://huggingface.co/datasets/opus100) and Tatoeba, using MLM loss distilled from bert-base-multilingual-cased, translation ranking loss, and `[CLS]` embeddings distilled from LaBSE, rubert-base-cased-sentence, Laser and USE.\n" ] }, { "cell_type": "markdown", "id": "b0c0158e", "metadata": {}, "source": [ "There is a more detailed [description in Russian](https://habr.com/ru/post/562064/).\n" ] }, { "cell_type": "markdown", "id": "28ce4026", "metadata": {}, "source": [ "Sentence embeddings can be produced as follows:\n" ] }, { "cell_type": "markdown", "id": "d521437a", "metadata": {}, "source": [ "## How to use" ] }, { "cell_type": "code", "execution_count": null, "id": "da5acdb0", "metadata": {}, "outputs": [], "source": [ "!pip install --upgrade paddlenlp" ] }, { "cell_type": "code", "execution_count": null, "id": "df2d3cc6", "metadata": {}, "outputs": [], "source": [ "import paddle\n", "from paddlenlp.transformers import AutoModel\n", "\n", "model = AutoModel.from_pretrained(\"cointegrated/rubert-tiny\")\n", "input_ids = paddle.randint(100, 200, shape=[1, 20])\n", "print(model(input_ids))" ] }, { "cell_type": "markdown", "id": "065bda47", "metadata": {}, "source": [ "> 此模型介绍及权重来源于[https://huggingface.co/cointegrated/rubert-tiny](https://huggingface.co/cointegrated/rubert-tiny),并转换为飞桨模型格式。\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 5 }