{ "cells": [ { "cell_type": "markdown", "id": "9eef057a", "metadata": {}, "source": [ "## Overview\n", "\n", "This is an updated version of cointegrated/rubert-tiny: a small Russian BERT-based encoder with high-quality sentence embeddings. This [post in Russian](https://habr.com/ru/post/669674/) gives more details.\n" ] }, { "cell_type": "markdown", "id": "08d9a049", "metadata": {}, "source": [ "The differences from the previous version include:\n", "- a larger vocabulary: 83828 tokens instead of 29564;\n", "- larger supported sequences: 2048 instead of 512;\n", "- sentence embeddings approximate LaBSE closer than before;\n", "- meaningful segment embeddings (tuned on the NLI task)\n", "- the model is focused only on Russian.\n" ] }, { "cell_type": "markdown", "id": "8a7ba50b", "metadata": {}, "source": [ "The model should be used as is to produce sentence embeddings (e.g. for KNN classification of short texts) or fine-tuned for a downstream task.\n" ] }, { "cell_type": "markdown", "id": "184e1cc6", "metadata": {}, "source": [ "Sentence embeddings can be produced as follows:\n" ] }, { "cell_type": "markdown", "id": "a9613056", "metadata": {}, "source": [ "## How to use" ] }, { "cell_type": "code", "execution_count": null, "id": "d60b7b64", "metadata": {}, "outputs": [], "source": [ "!pip install --upgrade paddlenlp" ] }, { "cell_type": "code", "execution_count": null, "id": "716f2b63", "metadata": {}, "outputs": [], "source": [ "import paddle\n", "from paddlenlp.transformers import AutoModel\n", "\n", "model = AutoModel.from_pretrained(\"cointegrated/rubert-tiny2\")\n", "input_ids = paddle.randint(100, 200, shape=[1, 20])\n", "print(model(input_ids))" ] }, { "cell_type": "markdown", "id": "0ba8c599", "metadata": {}, "source": [ "> 此模型介绍及权重来源于[https://huggingface.co/cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2),并转换为飞桨模型格式。\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" } }, "nbformat": 4, "nbformat_minor": 5 }