未验证 提交 9ec8847b 编写于 作者: 那伊抹微笑 提交者: GitHub

Merge pull request #1 from apachecn/v0.1.0

update from apachecn
......@@ -13,4 +13,41 @@ FastText 是一个用于高效学习单词表示和句子分类的库。
## 项目负责人
* [@片刻](https://github.com/jiangzhonglian)
* [@wnma](https://github.com/wnma3mz)
* [@Lisanaaa](https://github.com/Lisanaaa)
**维护组织:** [@ApacheCN](https://github.com/apachecn])
## 贡献者
### FastText 0.1.0 中文文档贡献者
| 标题 | 翻译 | 校对 |
| ------------------------------------------------------------ | ------------------------------------------ | ---- |
| [api](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/api.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [cheatsheet](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/cheatsheet.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [crawl-vectors](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/crawl-vectors.md) | [@GMbappe](https://github.com/GMbappe) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [dataset](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/dataset.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [english-vectors](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/english-vectors.md) | [@GMbappe](https://github.com/GMbappe) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [faqs](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/faqs.md) | [@Twinkle](https://github.com/kemingzeng) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [language-identification](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/language-identification.md) | [@wnma](https://github.com/wnma3mz) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [options](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/options.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [pretrained-vectors](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/pretrained-vectors.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [references](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/references.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [supervised-models](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/supervised-models.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [supervised-tutorial](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/supervised-tutorial.md) | [@Lisanaaa](https://github.com/Lisanaaa) | [@wnma](https://github.com/wnma3mz) |
| [support](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/support.md) | [@片刻](https://github.com/jiangzhonglian) | [@Lisanaaa](https://github.com/Lisanaaa) |
| [unsupervised-tutorials](https://github.com/apachecn/fasttext-doc-zh/blob/v0.1.0/doc/zh/unsupervised-tutorials.md) | [@wnma](https://github.com/wnma3mz) | [@Lisanaaa](https://github.com/Lisanaaa) |
## 加入我们
如果想要加入我们, 请参阅: <http://www.apachecn.org/organization/209.html>.
欢迎各位爱装逼的大佬们.
## 建议反馈
- 联系项目负责人 [@wnma](https://github.com/wnma3mz)或者 [@Lisanaaa](https://github.com/Lisanaaa).
- 在我们的 [apachecn/fasttext-doc-zh](https://github.com/apachecn/fasttext-doc-zh) github 上提 issue.
- 发邮件送到 Email: fasttext#apachecn.org (#替换成@) .
- 在我们的 [组织学习交流群](http://www.apachecn.org/organization/348.html) 中联系群主/管理员即可.
---
id: api
title:API
---
# API
We automatically generate our [API documentation](/docs/en/html/index.html) with doxygen.
我们使用 doxygen 自动生成我们的 [API documentation](/docs/en/html/index.html).
---
id: cheatsheet
title: Cheatsheet
---
# Cheatsheet(备忘单)
## Word representation learning
## Word representation learning(词表示学习)
In order to learn word vectors do:
为了学习单词向量:
```bash
$ ./fasttext skipgram -input data.txt -output model
```
## Obtaining word vectors
## Obtaining word vectors(获取单词向量)
Print word vectors for a text file `queries.txt` containing words.
`queries.txt` 包含单词的文本文件打印单词向量.
```bash
$ ./fasttext print-word-vectors model.bin < queries.txt
```
## Text classification
## Text classification(文本分类)
In order to train a text classifier do:
为了训练一个文本分类器, 请执行以下操作:
```bash
$ ./fasttext supervised -input train.txt -output model
```
Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:
一旦模型被训练完毕, 您可以使用以下公式计算测试集上的k (P@k and R@k) 的精准和召回来评估它:
```bash
$ ./fasttext test model.bin test.txt 1
```
In order to obtain the k most likely labels for a piece of text, use:
为了获得一段文字的k个最可能的标签,使用:
```bash
$ ./fasttext predict model.bin test.txt k
```
In order to obtain the k most likely labels and their associated probabilities for a piece of text, use:
为了获得一段文字的k个最可能的标签及其相关概率,请使用:
```bash
$ ./fasttext predict-prob model.bin test.txt k
```
If you want to compute vector representations of sentences or paragraphs, please use:
如果你想计算句子或段落的向量表示, 请使用:
```bash
$ ./fasttext print-sentence-vectors model.bin < text.txt
```
## Quantization
## Quantization(量化)
In order to create a `.ftz` file with a smaller memory footprint do:
为了创建一个 `.ftz` 内存占用量较小的文件, 请执行以下操作:
```bash
$ ./fasttext quantize -output model
```
All other commands such as test also work with this model
所有其他命令(如测试)也适用于此模型
```bash
$ ./fasttext test model.ftz test.txt
......
此差异已折叠。
---
id: dataset
title: Datasets
---
# 数据集
[Download YFCC100M Dataset](https://fb-public.box.com/s/htfdbrvycvroebv9ecaezaztocbcnsdn)
[下载 YFCC100M 数据集](https://fb-public.box.com/s/htfdbrvycvroebv9ecaezaztocbcnsdn)
---
id: english-vectors
title: English word vectors
---
# 英文单词向量
This page gathers several pre-trained word vectors trained using fastText.
这一篇整合了一些之前用 fasttext 训练的词向量。
### Download pre-trained word vectors
### 下载经过训练的词向量
Pre-trained word vectors learned on different sources can be downloaded below:
你可以从下面下载单词向量,他们基于学习不同的数据来源,并且被预先训练过:
1. [wiki-news-300d-1M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip): 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
2. [wiki-news-300d-1M-subword.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.vec.zip): 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
3. [crawl-300d-2M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip): 2 million word vectors trained on Common Crawl (600B tokens).
1. [wiki-news-300d-1M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip) :一百万的词向量,这些词向量是在 2017 维基百科,UMBC 基于网络的语料库和 statmt.org 新闻数据集训练得到的(16B)
2. [wiki-news-300d-1M-subword.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.vec.zip) : 一百万的带有子词信息的词向量,这些词向量是在 2017 维基百科,UMBC 基于网络的语料库和 statmt.org 新闻数据集的训练得到的(16B)
### Format
3. [crawl-300d-2M.vec.zip](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip) : 两百万的词向量,这些词向量是在 Common Crawl 上训练得到的。(600B)
The first line of the file contains the number of words in the vocabulary and the size of the vectors.
Each line contains a word followed by its vectors, like in the default fastText text format.
Each value is space separated. Words are ordered by descending frequency.
### 格式
### License
文件的第一行包含了词汇表中单词的数量以及向量的大小。
每一行包含了一个单词和它的向量,就像是 fasttext 文本格式默认的那种样子。
每个值都是由空格隔开。
单词是按照频数降序排列的。
These word vectors are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/).
### 许可证明
### References
这些词向量在 [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/) 可以看到。
If you use these word vectors, please cite the following paper:
### 参考资料
如果你使用了这些词向量,请引用下面的文章:
T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. [*Advances in Pre-Training Distributed Word Representations*](https://arxiv.org/abs/1712.09405)
......
---
id: faqs
title:FAQ
---
# 常问问题
## What is fastText? Are there tutorials?
## 什么是 fastText? 有教程吗?
FastText is a library for text classification and representation. It transforms text into continuous vectors that can later be used on any language related task. A few tutorials are available.
FastText 是一个文本分类与表示的库. 它将文本转化为能用于任何语言相关任务的连续向量. 有一些教程可供使用.
## Why are my fastText models that big?
## 为什么我的 fastText 模型那么大?
fastText uses a hashtable for either word or character ngrams. The size of the hashtable directly impacts the size of a model. To reduce the size of the model, it is possible to reduce the size of this table with the option '-hash'. For example a good value is 20000. Another option that greatly impacts the size of a model is the size of the vectors (-dim). This dimension can be reduced to save space but this can significantly impact performance. If that still produce a model that is too big, one can further reduce the size of a trained model with the quantization option.
FastText 使用散列表来表示单词或字符 ngram(n元模型). 散列表的大小直接影响模型的大小. 要减小模型的大小, 可以使用 '-hash' 选项, 例如一个很好的值是20000. 另一个大大影响模型大小的选项是矢量的大小 (-dim) . 可以减少此维度以节省空间, 但这可能会显著影响性能. 如果仍然产生太大的模型,可以使用量化选项进一步减小训练模型的大小.
```bash
./fasttext quantize -output model
```
## What would be the best way to represent word phrases rather than words?
## 表示单词短语而不是单词的最佳方法是什么?
Currently the best approach to represent word phrases or sentence is to take a bag of words of word vectors. Additionally, for phrases like “New York”, preprocessing the data so that it becomes a single token “New_York” can greatly help.
目前, 表示单词短语或句子的最佳方法是将单词向量的单词做成词袋. 此外, 对于诸如“纽约”这样的短语, 预处理数据以使其成为单个标记 "New_York" 可以提供极大的帮助。
## Why does fastText produce vectors even for unknown words?
## 为什么 fastText 对未知词也产生向量?
One of the key features of fastText word representation is its ability to produce vectors for any words, even made-up ones.
Indeed, fastText word vectors are built from vectors of substrings of characters contained in it.
This allows to build vectors even for misspelled words or concatenation of words.
FastText 词表示的一个关键特征就是它能对任何词产生词向量, 即使是自制词.
事实上, fastText 词向量是由包含在其中的字符字串构成的.
这甚至允许为拼写错误的单词或拼接单词创建词向量.
## Why is the hierarchical softmax slightly worse in performance than the full softmax?
## 为什么分层 softmax 的效果比完全 softmax 效果要略差一些?
The hierachical softmax is an approximation of the full softmax loss that allows to train on large number of class efficiently. This is often at the cost of a few percent of accuracy.
Note also that this loss is thought for classes that are unbalanced, that is some classes are more frequent than others. If your dataset has a balanced number of examples per class, it is worth trying the negative sampling loss (-loss ns -neg 100).
However, negative sampling will still be very slow at test time, since the full softmax will be computed.
分层 softmax 是完全 softmax 损失函数的一种近似, 它能够在大量类的数据上高效训练. 这通常会损失一些精确度.
还要注意, 这个损失函数是针对某些类比其他类出现得更为频繁的类别不均衡情况的. 如果你的数据集中各类的样本均衡, 那么值得尝试一下负采样损失 (-loss ns -neg 100).
然而, 负采样在测试时仍然会非常慢, 因为会计算完全 softmax.
## Can we run fastText program on a GPU?
## 我们可以在 GPU 上运行 fastText 程序吗?
FastText only works on CPU for accessibility. That being said, fastText has been implemented in the caffe2 library which can be run on GPU.
FastText 由于可访问性只工作于 CPU. 就是说, fastText 已经可以在能运行于 GPU 的 caffe2 库中实现.
## Can I use fastText with python? Or other languages?
## 我能用 python 语言使用 fastText 吗? 或者其他语言?
There are few unofficial wrappers for python or lua available on github.
Github 上几乎没有非官方的 python 或者 lua 包装器.
## Can I use fastText with continuous data?
## 我能用 fastText 处理连续数据吗?
FastText works on discrete tokens and thus cannot be directly used on continuous tokens. However, one can discretize continuous tokens to use fastText on them, for example by rounding values to a specific digit ("12.3" becomes "12").
FastText适用于离散标记, 因此不能直接用于连续标记. 但是, 可以将连续标记离散化以对其使用fastText, 例如将值四舍五入为特定数字 ("12.3" 变为 "12").
## There are misspellings in the dictionary. Should we improve text normalization?
## 词典中一些错误拼写的词. 我们应该提升文本规范化吗?
If the words are infrequent, there is no need to worry.
如果这些词出现频率不高, 无须理会.
## I'm encountering a NaN, why could this be?
## 我遇到了 NaN, 为什么会这样呢?
You'll likely see this behavior because your learning rate is too high. Try reducing it until you don't see this error anymore.
你出现这个情况可能是因为学习率太高. 尝试减小学习率直到看不到这个错误.
## My compiler / architecture can't build fastText. What should I do?
Try a newer version of your compiler. We try to maintain compatibility with older versions of gcc and many platforms, however sometimes maintaining backwards compatibility becomes very hard. In general, compilers and tool chains that ship with LTS versions of major linux distributions should be fair game. In any case, create an issue with your compiler version and architecture and we'll try to implement compatibility.
## 我的编译器 / 体系结构无法构建 fastText. 我该怎么办?
尝试新版本的编译器. 我们试图保持与老版本 gcc 和很多平台的兼容性, 然后有时候保持后端兼容变得非常难.
一般来说, 附带 LTS 版本的主要 linux 发行版的编译器和工具链应该都是没问题的. 遇到任何情况, 都可以创建一个你的编译器版本和体系结构的 issue(问题, 指在github上提出), 我们将尽力实现兼容性.
......
---
id: language-identification
title: Language identification
---
# 语言识别
### Description
### 说明
We distribute two models for language identification, which can recognize 176 languages (see the list of ISO codes below). These models were trained on data from [Wikipedia](https://www.wikipedia.org/), [Tatoeba](https://tatoeba.org/eng/) and [SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/), used under [CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/).
我们发布了两种语言识别模型,可以识别 176 种语言(请参阅下面的 ISO 代码列表)。 这些模型是通过 [Wikipedia](https://www.wikipedia.org/)[Tatoeba](https://tatoeba.org/eng/)[SETimes](http://nlp.ffzg.hr/resources/corpora/setimes/) 的数据进行训练的,在 [CC-BY-SA](http://creativecommons.org/licenses/by-sa/3.0/) 下使用。
We distribute two versions of the models:
我们发布两个版本的模型:
* [lid.176.bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.bin), which is faster and slightly more accurate, but has a file size of 126MB ;
* [lid.176.ftz](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.ftz), which is the compressed version of the model, with a file size of 917kB.
* [lid.176.bin](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.bin) ,这个模型更快更准确,但有一个文件的大小有 126 MB;
* [lid.176.ftz](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/lid.176.ftz) ,这是个只有 917 kB 的压缩版的模型。
These models were trained on UTF-8 data, and therefore expect UTF-8 as input.
这些模型都是使用 UTF-8 数据进行训练的,因此需要使用 UTF-8 作为输入。
### License
### 许可
The models are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/).
这些模型根据 [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/) 进行发布。
### List of supported languages
### 支持的语言列表
```
af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh
af als am ar ar as as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv c cy cy de deqq dsb dty dv el eml en eo es et eu faf fr fr fyy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hyia id ie ieo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my mym mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah scn sco sd sh si sl sl so sq sr su sv sw ta te tg th t t t tr t t tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh
```
### References
### 参考
If you use these models, please cite the following papers:
如果您使用这些模型,请引用以下论文:
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)
```
......
---
id: options
title: List of options
---
# 选项列表
Invoke a command without arguments to list available arguments and their default values:
调用不带参数的命令来列出可用参数及其默认值:
```bash
$ ./fasttext supervised
Empty input or output path.
The following arguments are mandatory:
-input training file path
-output output file path
The following arguments are optional:
-verbose verbosity level [2]
The following arguments for the dictionary are optional:
-minCount minimal number of word occurences [5]
-minCountLabel minimal number of label occurences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [3]
-maxn max length of char ngram [6]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
The following arguments for training are optional:
-lr learning rate [0.05]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {ns, hs, softmax} [ns]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]
The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]
空的输入或输出路径.
以下参数是强制性的:
-input 训练文件路径
-output 输出文件路径
以下参数是可选的:
-verbose 冗长等级 [2]
以下字典参数是可选的:
 -minCount           词出现的最少次数 [5]
 -minCountLabel     标签出现的最少次数 [0]
-wordNgrams 单词 ngram 的最大长度 [1]
 -bucket             桶的个数 [2000000]
-minn char ngram 的最小长度 [3]
-maxn char ngram 的最大长度 [6]
-t 抽样阈值 [0.0001]
-label 标签前缀 [__label__]
以下用于训练的参数是可选的:
 -lr                 学习速率 [0.05]
 -lrUpdateRate       更改学习速率的更新速率 100]
-dim 字向量的大小 [100]
-ws 上下文窗口的大小 [5]
 -epoch             迭代次数 [5]
 -neg               负样本个数 [5]
-loss 损失函数 {ns, hs, softmax} [ns]
 -thread             线程个数 [12]
 -pretrainedVectors 监督学习的预训练词向量 []
-saveOutput 是否应该保存输出参数 [0]
以下量化参数是可选的:
 -cutoff             要保留的词和 ngram 的数量 [0]
 -retrain           微调 embeddings(假如应用 -cutoff 参数的话) [0]
-qnorm 分别量化范数 [0]
-qout 量化分类器 [0]
 -dsub               每个子向量的大小 [2]
```
Defaults may vary by mode. (Word-representation modes `skipgram` and `cbow` use a default `-minCount` of 5.)
默认值可能因模式而异。 (Word-representation 模型 `skipgram``cbow` 使用 5 作为 `-minCount` 的默认值)
---
id: pretrained-vectors
title: Wiki word vectors
---
# 维基词向量
We are publishing pre-trained word vectors for 294 languages, trained on [*Wikipedia*](https://www.wikipedia.org) using fastText.
These vectors in dimension 300 were obtained using the skip-gram model described in [*Bojanowski et al. (2016)*](https://arxiv.org/abs/1607.04606) with default parameters.
我们正在为 294 种语言发布预训练的单词向量, 并使用 fastText 在 [*维基百科*](https://www.wikipedia.org) 上进行了训练. 这些 300 维的向量是通过使用 [*Bojanowski 等人 (2016)*](https://arxiv.org/abs/1607.04606) 描述的 skip-gram 模型(使用: 默认参数)获得的.
Please note that a newer version of multi-lingual word vectors are available at: [https://fasttext.cc/docs/en/crawl-vectors.html].
请注意, 新版本的多语言词语向量可在: [https://fasttext.cc/docs/en/crawl-vectors.html].
### Models
### Models(模型)
The models can be downloaded from:
这些模型可以从下面下载:
||||
|-|-|-|
......@@ -113,19 +109,18 @@ The models can be downloaded from:
| Yiddish: [*bin+text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.yi.zip), [*text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.yi.vec) | Yoruba: [*bin+text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.yo.zip), [*text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.yo.vec) | Zazaki: [*bin+text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.diq.zip), [*text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.diq.vec) |
| Zeelandic: [*bin+text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.zea.zip), [*text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.zea.vec) | Zhuang: [*bin+text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.za.zip), [*text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.za.vec) | Zulu: [*bin+text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.zu.zip), [*text*](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.zu.vec) |
### Format
### Format(格式化)
The word vectors come in both the binary and text default formats of fastText.
In the text format, each line contain a word followed by its vector. Each value is space separated.
Words are ordered by their frequency in a descending order.
单词向量以 fastText 的二进制和文本默认格式出现.
在文本格式中,每行包含一个单词,后面跟着它的向量. 每个值都是空格分隔的. 单词按降序排序.
### License
### License(许可证)
The word vectors are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/).
该词向量分布在知识 [*共享署名 - 相同方式共享3.0许可下*](https://creativecommons.org/licenses/by-sa/3.0/).
### References
### References(参考)
If you use these word vectors, please cite the following paper:
如果您使用这些单词向量, 请引用以下文章:
P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)
......
---
id: references
title: References
---
# 参考
Please cite [1](#enriching-word-vectors-with-subword-information) if using this code for learning word representations or [2](#bag-of-tricks-for-efficient-text-classification) if using for text classification.
如果使用此代码学习词语表示, 请引用 [1](#enriching-word-vectors-with-subword-information); 如果使用文本分类, 请引用 [2](#bag-of-tricks-for-efficient-text-classification)
[1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)
......@@ -27,7 +24,7 @@ Please cite [1](#enriching-word-vectors-with-subword-information) if using this
}
```
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: Compressing text classification models*](https://arxiv.org/abs/1612.03651)
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: 压缩文本分类模型*](https://arxiv.org/abs/1612.03651)
```markup
@article{joulin2016fasttext,
......@@ -38,4 +35,4 @@ Please cite [1](#enriching-word-vectors-with-subword-information) if using this
}
```
(\* These authors contributed equally.)
(\* 这些作者贡献相同。)
---
id: supervised-models
title: Supervised models
---
# 监督模型
This page gathers several pre-trained supervised models on several datasets.
这个页面收集了几个预先训练好的监督模型,其训练数据来自于几个不同的数据集。
### Description
### Description(描述)
The regular models are trained using the procedure described in [1]. They can be reproduced using the classification-results.sh script within our github repository. The quantized models are build by using the respective supervised settings and adding the following flags to the quantize subcommand.
常规模型使用 [1] 中描述的步骤进行训练. 可以使用我们 github 存储库中的 classification-results.sh 脚本来复现它们。通过使用相应的监督设置并将以下参数添加到量化子命令中来构建量化模型.
```bash
-qnorm -retrain -cutoff 100000
```
### Table of models
### Table of models(模型表格)
Each entry describes the test accuracy and size of the model. You can click on a table cell to download the corresponding model.
每个条目描述了模型的测试精度和大小。 您可以单击表格单元格来下载相应的模型.
| dataset | ag news | amazon review full | amazon review polarity | dbpedia |
| 数据集 | ag 新闻 | 亚马逊评论全部 | 亚马逊评论极性 | dbpedia |
|-----------|-----------------------|-----------------------|------------------------|------------------------|
| regular | [0.924 / 387MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/ag_news.bin) | [0.603 / 462MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_full.bin) | [0.946 / 471MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_polarity.bin) | [0.986 / 427MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/dbpedia.bin) |
| compressed | [0.92 / 1.6MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/ag_news.ftz) | [0.599 / 1.6MB]( https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_full.ftz) | [0.93 / 1.6MB]( https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/amazon_review_polarity.ftz) | [0.984 / 1.7MB]( https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/dbpedia.ftz) |
| dataset | sogou news | yahoo answers | yelp review polarity | yelp review full |
| 数据集 | 搜狗新闻 | 雅虎回答 | yelp review 极性 | yelp 评论全部 |
|-----------|----------------------|------------------------|----------------------|------------------------|
| regular | [0.969 / 402MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/sogou_news.bin) | [0.724 / 494MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yahoo_answers.bin)| [0.957 / 409MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_polarity.bin)| [0.639 / 412MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_full.bin)|
| compressed | [0.968 / 1.4MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/sogou_news.ftz) | [0.717 / 1.6MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yahoo_answers.ftz) | [0.957 / 1.5MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_polarity.ftz) | [0.636 / 1.5MB](https://s3-us-west-1.amazonaws.com/fasttext-vectors/supervised_models/yelp_review_full.ftz) |
### References
### References(参考)
If you use these models, please cite the following paper:
如果您使用这些模型, 请引用以下文章:
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)
......@@ -42,7 +39,7 @@ If you use these models, please cite the following paper:
}
```
[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: Compressing text classification models*](https://arxiv.org/abs/1612.03651)
[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [*FastText.zip: 压缩文本分类模型*](https://arxiv.org/abs/1612.03651)
```markup
@article{joulin2016fasttext,
......
此差异已折叠。
---
id: support
title: Get started
---
# 快速入门
## What is fastText?
## 什么是快速文本?
fastText is a library for efficient learning of word representations and sentence classification.
fastText 是一个用于高效学习单词表示和句子分类的库。
## Requirements
## 要求
fastText builds on modern Mac OS and Linux distributions.
Since it uses C++11 features, it requires a compiler with good C++11 support.
These include :
fastText 建立在现代 Mac OS 和 Linux 发行版上.
由于它使用 C++11功能, 因此需要具有良好 C++11 支持的编译器.
这些包括:
* (gcc-4.6.3 or newer) or (clang-3.3 or newer)
Compilation is carried out using a Makefile, so you will need to have a working **make**.
For the word-similarity evaluation script you will need:
使用 Makefile 进行编译, 因此您需要有一个可行的 **make**. 对于单词相似性评估脚本, 您需要:
* python 2.6 or newer
* numpy & scipy
## Building fastText
## 建立快速文本
In order to build `fastText`, use the following:
为了构建 `fastText`, 请使用以下内容:
```bash
$ git clone https://github.com/facebookresearch/fastText.git
......@@ -31,6 +28,6 @@ $ cd fastText
$ make
```
This will produce object files for all the classes as well as the main binary `fasttext`.
If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).
这将产生所有类以及主二进制文件的目标文件 `fasttext`
如果您不打算使用默认的系统范围编译器, 请更新 Makefile 开头定义的两个宏 (CC 和 INCLUDES)。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册