{ "cells": [ { "cell_type": "markdown", "id": "db267b71", "metadata": {}, "source": [ "## Overview\n", "\n", "This is an updated version of cointegrated/rubert-tiny: a small Russian BERT-based encoder with high-quality sentence embeddings. This [post in Russian](https://habr.com/ru/post/669674/) gives more details.\n" ] }, { "cell_type": "markdown", "id": "801acf5c", "metadata": {}, "source": [ "The differences from the previous version include:\n", "- a larger vocabulary: 83828 tokens instead of 29564;\n", "- larger supported sequences: 2048 instead of 512;\n", "- sentence embeddings approximate LaBSE closer than before;\n", "- meaningful segment embeddings (tuned on the NLI task)\n", "- the model is focused only on Russian.\n" ] }, { "cell_type": "markdown", "id": "f2c7dbc1", "metadata": {}, "source": [ "The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.\n" ] }, { "cell_type": "markdown", "id": "9ff63df2", "metadata": {}, "source": [ "Sentence embeddings can be produced as follows:\n" ] }, { "cell_type": "markdown", "id": "2b073558", "metadata": {}, "source": [ "## how to use" ] }, { "cell_type": "code", "execution_count": null, "id": "c98c0cce", "metadata": {}, "outputs": [], "source": [ "!pip install --upgrade paddlenlp" ] }, { "cell_type": "code", "execution_count": null, "id": "81978806", "metadata": {}, "outputs": [], "source": [ "import paddle\n", "from paddlenlp.transformers import AutoModel\n", "\n", "model = AutoModel.from_pretrained(\"cointegrated/rubert-tiny2\")\n", "input_ids = paddle.randint(100, 200, shape=[1, 20])\n", "print(model(input_ids))" ] }, { "cell_type": "markdown", "id": "33dbe378", "metadata": {}, "source": [ "> The model introduction and model weights originate from [https://huggingface.co/cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) and were converted to PaddlePaddle format for ease of use in PaddleNLP.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 5 }