未验证 提交 fb4e0c3b 编写于 作者: L LiuChiachi 提交者: GitHub

add text/datasets Chinese doc (#2628)

上级 55bd7e03
...@@ -6,36 +6,35 @@ Conll05st ...@@ -6,36 +6,35 @@ Conll05st
.. py:class:: paddle.text.datasets.Conll05st() .. py:class:: paddle.text.datasets.Conll05st()
Implementation of `Conll05st <https://www.cs.upc.edu/~srlconll/soft.html>`_ 该类是对`Conll05st <https://www.cs.upc.edu/~srlconll/soft.html>`_
test dataset. 测试数据集的实现.
Note: only support download test dataset automatically for that .. note::
only test dataset of Conll05st is public. 只支持自动下载公共的 Conll05st测试数据集。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_filestr- 保存数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
word_dict_file(str): path to word dictionary file, can be set None if - word_dict_filestr- 保存词典的路径。如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
verb_dict_file(str): path to verb dictionary file, can be set None if - verb_dict_filestr- 保存动词词典的路径。如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
target_dict_file(str): path to target dictionary file, can be set None if - target_dict_filestr- 保存目标词典的路径如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
emb_file(str): path to embedding dictionary file, only used for - emb_filestr- 保存词嵌入词典的文件。只有在:code:`get_embedding`能被设置为None
:code:`get_embedding` can be set None if :attr:`download` is :attr:`download` True时使用。
True. Default None - downloadbool- 如果:attr:`data_file` :attr:`word_dict_file`
download(bool): whether to download dataset automatically if :attr:`verb_dict_file` :attr:`target_dict_file` 未设置,是否下载数据集。默认为True
:attr:`data_file` :attr:`word_dict_file` :attr:`verb_dict_file`
:attr:`target_dict_file` is not set. Default True 返回值
Returns:
Dataset: instance of conll05st dataset
代码示例
::::::::: :::::::::
``Dataset``conll05st数据集实例。
.. code-block:: python 代码示例
:::::::::
.. code-block:: python
import paddle import paddle
from paddle.text.datasets import Conll05st from paddle.text.datasets import Conll05st
...@@ -61,4 +60,3 @@ Conll05st ...@@ -61,4 +60,3 @@ Conll05st
pred_idx, mark, label= model(pred_idx, mark, label) pred_idx, mark, label= model(pred_idx, mark, label)
print(pred_idx.numpy(), mark.numpy(), label.numpy()) print(pred_idx.numpy(), mark.numpy(), label.numpy())
\ No newline at end of file
...@@ -6,24 +6,24 @@ Imdb ...@@ -6,24 +6,24 @@ Imdb
.. py:class:: paddle.text.datasets.Imdb() .. py:class:: paddle.text.datasets.Imdb()
Implementation of `IMDB <https://www.imdb.com/interfaces/>`_ dataset. 该类是对`IMDB <https://www.imdb.com/interfaces/>`_ 测试数据集的实现。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_file(str) - 保存压缩数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
mode(str): 'train' 'test' mode. Default 'train'. - mode(str) - 'train' 'test' 模式。默认为'train'
cutoff(int): cutoff number for building word dictionary. Default 150. - cutoff(int) - 构建词典的截止大小。默认为Default 150
download(bool): whether to download dataset automatically if - download(bool) - 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` is not set. Default True
Returns: 返回值
Dataset: instance of IMDB dataset :::::::::
``Dataset`` IMDB数据集实例。
代码示例 代码示例
::::::::: :::::::::
.. code-block:: python .. code-block:: python
import paddle import paddle
from paddle.text.datasets import Imdb from paddle.text.datasets import Imdb
...@@ -48,4 +48,3 @@ Imdb ...@@ -48,4 +48,3 @@ Imdb
image, label = model(doc, label) image, label = model(doc, label)
print(doc.numpy().shape, label.numpy().shape) print(doc.numpy().shape, label.numpy().shape)
\ No newline at end of file
...@@ -6,26 +6,26 @@ Imikolov ...@@ -6,26 +6,26 @@ Imikolov
.. py:class:: paddle.text.datasets.Imikolov() .. py:class:: paddle.text.datasets.Imikolov()
Implementation of imikolov dataset. 该类是对imikolov测试数据集的实现。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_filestr- 保存数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
data_type(str): 'NGRAM' or 'SEQ'. Default 'NGRAM'. - data_typestr- 'NGRAM''SEQ'。默认为'NGRAM'
window_size(int): sliding window size for 'NGRAM' data. Default -1. - window_sizeint) - 'NGRAM'数据滑动窗口的大小。默认为-1
mode(str): 'train' 'test' mode. Default 'train'. - modestr- 'train' 'test' mode. Default 'train'.
min_word_freq(int): minimal word frequence for building word dictionary. Default 50. - min_word_freqint- 构建词典的最小词频。默认为50
download(bool): whether to download dataset automatically if - downloadbool- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` is not set. Default True
返回值
Returns:
Dataset: instance of imikolov dataset
代码示例
::::::::: :::::::::
``Dataset``imikolov数据集实例。
.. code-block:: python 代码示例
:::::::::
.. code-block:: python
import paddle import paddle
from paddle.text.datasets import Imikolov from paddle.text.datasets import Imikolov
...@@ -50,4 +50,3 @@ Imikolov ...@@ -50,4 +50,3 @@ Imikolov
src, trg = model(src, trg) src, trg = model(src, trg)
print(src.numpy().shape, trg.numpy().shape) print(src.numpy().shape, trg.numpy().shape)
\ No newline at end of file
...@@ -6,23 +6,23 @@ MovieReviews ...@@ -6,23 +6,23 @@ MovieReviews
.. py:class:: paddle.text.datasets.MovieReviews() .. py:class:: paddle.text.datasets.MovieReviews()
Implementation of `NLTK movie reviews <http://www.nltk.org/nltk_data/>`_ dataset. 该类是对`NLTK movie reviews <http://www.nltk.org/nltk_data/>`_ 测试数据集的实现。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_filestr- 保存压缩数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
mode(str): 'train' 'test' mode. Default 'train'. - modestr- 'train' 'test' 模式。默认为'train'
download(bool): whether auto download cifar dataset if - downloadbool- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` unset. Default True.
Returns: 返回值
Dataset: instance of movie reviews dataset :::::::::
``Dataset``NLTK movie reviews数据集实例。
代码示例 代码示例
::::::::: :::::::::
.. code-block:: python .. code-block:: python
import paddle import paddle
from paddle.text.datasets import MovieReviews from paddle.text.datasets import MovieReviews
...@@ -47,4 +47,3 @@ MovieReviews ...@@ -47,4 +47,3 @@ MovieReviews
word_list, category = model(word_list, category) word_list, category = model(word_list, category)
print(word_list.numpy().shape, category.numpy()) print(word_list.numpy().shape, category.numpy())
\ No newline at end of file
...@@ -6,25 +6,26 @@ Movielens ...@@ -6,25 +6,26 @@ Movielens
.. py:class:: paddle.text.datasets.Movielens() .. py:class:: paddle.text.datasets.Movielens()
Implementation of `Movielens 1-M <https://grouplens.org/datasets/movielens/1m/>`_ dataset. 该类是对`Movielens 1-M <https://grouplens.org/datasets/movielens/1m/>`_
测试数据集的实现。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_filestr- 保存压缩数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
mode(str): 'train' or 'test' mode. Default 'train'. - modestr- 'train' 'test' 模式。默认为'train'
test_ratio(float): split ratio for test sample. Default 0.1. - test_ratiofloat) - 为测试集划分的比例。默认为0.1
rand_seed(int): random seed. Default 0. - rand_seedint- 随机数种子。默认为0
download(bool): whether to download dataset automatically if - downloadbool- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` is not set. Default True
返回值
Returns:
Dataset: instance of Movielens 1-M dataset
代码示例
::::::::: :::::::::
``Dataset``Movielens 1-M数据集实例。
.. code-block:: python 代码示例
:::::::::
.. code-block:: python
import paddle import paddle
from paddle.text.datasets import Movielens from paddle.text.datasets import Movielens
...@@ -49,5 +50,3 @@ Movielens ...@@ -49,5 +50,3 @@ Movielens
model = SimpleNet() model = SimpleNet()
category, title, rating = model(category, title, rating) category, title, rating = model(category, title, rating)
print(category.numpy().shape, title.numpy().shape, rating.numpy().shape) print(category.numpy().shape, title.numpy().shape, rating.numpy().shape)
\ No newline at end of file
...@@ -6,24 +6,24 @@ UCIHousing ...@@ -6,24 +6,24 @@ UCIHousing
.. py:class:: paddle.text.datasets.UCIHousing() .. py:class:: paddle.text.datasets.UCIHousing()
Implementation of `UCI housing <https://archive.ics.uci.edu/ml/datasets/Housing>`_ 该类是对`UCI housing <https://archive.ics.uci.edu/ml/datasets/Housing>`_
dataset 测试数据集的实现。
参数 参数
::::::::: :::::::::
data_file(str): path to data file, can be set None if - data_filestr- 保存数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
mode(str): 'train' or 'test' mode. Default 'train'. - modestr- 'train''test'模式。默认为'train'
download(bool): whether to download dataset automatically if - downloadbool- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` is not set. Default True
Returns: 返回值
Dataset: instance of UCI housing dataset. :::::::::
``Dataset``UCI housing数据集实例。
代码示例 代码示例
::::::::: :::::::::
.. code-block:: python .. code-block:: python
import paddle import paddle
from paddle.text.datasets import UCIHousing from paddle.text.datasets import UCIHousing
...@@ -48,4 +48,3 @@ UCIHousing ...@@ -48,4 +48,3 @@ UCIHousing
feature, target = model(feature, target) feature, target = model(feature, target)
print(feature.numpy().shape, target.numpy()) print(feature.numpy().shape, target.numpy())
\ No newline at end of file
...@@ -6,27 +6,27 @@ WMT14 ...@@ -6,27 +6,27 @@ WMT14
.. py:class:: paddle.text.datasets.WMT14() .. py:class:: paddle.text.datasets.WMT14()
Implementation of `WMT14 <http://www.statmt.org/wmt14/>`_ test dataset. 该类是对`WMT14 <http://www.statmt.org/wmt14/>`_ 测试数据集实现。
The original WMT14 dataset is too large and a small set of data for set is 由于原始WMT14数据集太大,我们在这里提供了一组小数据集。该类将从
provided. This module will download dataset from http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz
http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz 下载数据集。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_filestr- 保存数据集压缩文件的路径, 如果参数:attr:`download`设置为True,可设置为None
:attr:`download` is True. Default None 默认为None
mode(str): 'train', 'test' or 'gen'. Default 'train' - modestr- 'train', 'test' 'gen'。默认为'train'
dict_size(int): word dictionary size. Default -1. - dict_sizeint- 词典大小。默认为-1
download(bool): whether to download dataset automatically if - downloadbool- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` is not set. Default True
Returns: 返回值
Dataset: instance of WMT14 dataset :::::::::
``Dataset``WMT14数据集实例。
代码示例 代码示例
::::::::: :::::::::
.. code-block:: python .. code-block:: python
import paddle import paddle
from paddle.text.datasets import WMT14 from paddle.text.datasets import WMT14
...@@ -51,5 +51,3 @@ WMT14 ...@@ -51,5 +51,3 @@ WMT14
model = SimpleNet() model = SimpleNet()
src_ids, trg_ids, trg_ids_next = model(src_ids, trg_ids, trg_ids_next) src_ids, trg_ids, trg_ids_next = model(src_ids, trg_ids, trg_ids_next)
print(src_ids.numpy(), trg_ids.numpy(), trg_ids_next.numpy()) print(src_ids.numpy(), trg_ids.numpy(), trg_ids_next.numpy())
\ No newline at end of file
...@@ -6,14 +6,14 @@ WMT16 ...@@ -6,14 +6,14 @@ WMT16
.. py:class:: paddle.text.datasets.WMT16() .. py:class:: paddle.text.datasets.WMT16()
Implementation of `WMT16 <http://www.statmt.org/wmt16/>`_ test dataset. 该类是对`WMT16 <http://www.statmt.org/wmt16/>`_ 测试数据集实现。
ACL2016 Multimodal Machine Translation. Please see this website for more ACL2016多模态机器翻译。有关更多详细信息,请访问此网站:
details: http://www.statmt.org/wmt16/multimodal-task.html#task1 http://www.statmt.org/wmt16/multimodal-task.html#task1
If you use the dataset created for your task, please cite the following paper: 如果您任务中使用了该数据集,请引用如下论文:
Multi30K: Multilingual English-German Image Descriptions. Multi30K: Multilingual English-German Image Descriptions.
.. code-block:: text .. code-block:: text
@article{elliott-EtAl:2016:VL16, @article{elliott-EtAl:2016:VL16,
author = {{Elliott}, D. and {Frank}, S. and {Sima"an}, K. and {Specia}, L.}, author = {{Elliott}, D. and {Frank}, S. and {Sima"an}, K. and {Specia}, L.},
...@@ -24,24 +24,24 @@ WMT16 ...@@ -24,24 +24,24 @@ WMT16
year = 2016 year = 2016
} }
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_file(str)- 保存数据集压缩文件的路径,如果参数:attr:`download`设置为True,可设置为None。
:attr:`download` is True. Default None 默认值为None。
mode(str): 'train', 'test' or 'val'. Default 'train' - mode(str)- 'train', 'test' 或 'val'。默认为'train'。
src_dict_size(int): word dictionary size for source language word. Default -1. - src_dict_size(int)- 源语言词典大小。默认为-1。
trg_dict_size(int): word dictionary size for target language word. Default -1. - trg_dict_size(int) - 目标语言测点大小。默认为-1。
lang(str): source language, 'en' or 'de'. Default 'en'. - lang(str)- 源语言,'en' 或 'de'。默认为 'en'。
download(bool): whether to download dataset automatically if - download(bool)- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True。
:attr:`data_file` is not set. Default True
返回值
Returns:
Dataset: instance of WMT16 dataset
代码示例
::::::::: :::::::::
``Dataset``,WMT16数据集实例。
.. code-block:: python 代码示例
:::::::::
.. code-block:: python
import paddle import paddle
from paddle.text.datasets import WMT16 from paddle.text.datasets import WMT16
...@@ -67,4 +67,3 @@ WMT16 ...@@ -67,4 +67,3 @@ WMT16
src_ids, trg_ids, trg_ids_next = model(src_ids, trg_ids, trg_ids_next) src_ids, trg_ids, trg_ids_next = model(src_ids, trg_ids, trg_ids_next)
print(src_ids.numpy(), trg_ids.numpy(), trg_ids_next.numpy()) print(src_ids.numpy(), trg_ids.numpy(), trg_ids_next.numpy())
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册