未验证 提交 aae80b4d 编写于 作者: L LiuChiachi 提交者: GitHub

add text/datasets Chinese doc (#2598)

上级 2eac1e60
...@@ -6,59 +6,57 @@ Conll05st ...@@ -6,59 +6,57 @@ Conll05st
.. py:class:: paddle.text.datasets.Conll05st() .. py:class:: paddle.text.datasets.Conll05st()
Implementation of `Conll05st <https://www.cs.upc.edu/~srlconll/soft.html>`_ 该类是对`Conll05st <https://www.cs.upc.edu/~srlconll/soft.html>`_
test dataset. 测试数据集的实现.
Note: only support download test dataset automatically for that .. note::
only test dataset of Conll05st is public. 只支持自动下载公共的 Conll05st测试数据集。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_filestr- 保存数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
word_dict_file(str): path to word dictionary file, can be set None if - word_dict_filestr- 保存词典的路径。如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
verb_dict_file(str): path to verb dictionary file, can be set None if - verb_dict_filestr- 保存动词词典的路径。如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
target_dict_file(str): path to target dictionary file, can be set None if - target_dict_filestr- 保存目标词典的路径如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
emb_file(str): path to embedding dictionary file, only used for - emb_filestr- 保存词嵌入词典的文件。只有在:code:`get_embedding`能被设置为None
:code:`get_embedding` can be set None if :attr:`download` is :attr:`download` True时使用。
True. Default None - downloadbool- 如果:attr:`data_file` :attr:`word_dict_file`
download(bool): whether to download dataset automatically if :attr:`verb_dict_file` :attr:`target_dict_file` 未设置,是否下载数据集。默认为True
:attr:`data_file` :attr:`word_dict_file` :attr:`verb_dict_file`
:attr:`target_dict_file` is not set. Default True 返回值
Returns:
Dataset: instance of conll05st dataset
代码示例
::::::::: :::::::::
``Dataset``conll05st数据集实例。
.. code-block:: python 代码示例
:::::::::
.. code-block:: python
import paddle import paddle
from paddle.text.datasets import Conll05st from paddle.text.datasets import Conll05st
class SimpleNet(paddle.nn.Layer): class SimpleNet(paddle.nn.Layer):
def __init__(self): def __init__(self):
super(SimpleNet, self).__init__() super(SimpleNet, self).__init__()
def forward(self, pred_idx, mark, label): def forward(self, pred_idx, mark, label):
return paddle.sum(pred_idx), paddle.sum(mark), paddle.sum(label) return paddle.sum(pred_idx), paddle.sum(mark), paddle.sum(label)
paddle.disable_static() paddle.disable_static()
conll05st = Conll05st() conll05st = Conll05st()
for i in range(10): for i in range(10):
pred_idx, mark, label= conll05st[i][-3:] pred_idx, mark, label= conll05st[i][-3:]
pred_idx = paddle.to_tensor(pred_idx) pred_idx = paddle.to_tensor(pred_idx)
mark = paddle.to_tensor(mark) mark = paddle.to_tensor(mark)
label = paddle.to_tensor(label) label = paddle.to_tensor(label)
model = SimpleNet() model = SimpleNet()
pred_idx, mark, label= model(pred_idx, mark, label) pred_idx, mark, label= model(pred_idx, mark, label)
print(pred_idx.numpy(), mark.numpy(), label.numpy()) print(pred_idx.numpy(), mark.numpy(), label.numpy())
\ No newline at end of file
...@@ -6,46 +6,45 @@ Imdb ...@@ -6,46 +6,45 @@ Imdb
.. py:class:: paddle.text.datasets.Imdb() .. py:class:: paddle.text.datasets.Imdb()
Implementation of `IMDB <https://www.imdb.com/interfaces/>`_ dataset. 该类是对`IMDB <https://www.imdb.com/interfaces/>`_ 测试数据集的实现。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_file(str) - 保存压缩数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
mode(str): 'train' 'test' mode. Default 'train'. - mode(str) - 'train' 'test' 模式。默认为'train'
cutoff(int): cutoff number for building word dictionary. Default 150. - cutoff(int) - 构建词典的截止大小。默认为Default 150
download(bool): whether to download dataset automatically if - download(bool) - 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` is not set. Default True
Returns: 返回值
Dataset: instance of IMDB dataset :::::::::
``Dataset`` IMDB数据集实例。
代码示例 代码示例
::::::::: :::::::::
.. code-block:: python .. code-block:: python
import paddle import paddle
from paddle.text.datasets import Imdb from paddle.text.datasets import Imdb
class SimpleNet(paddle.nn.Layer): class SimpleNet(paddle.nn.Layer):
def __init__(self): def __init__(self):
super(SimpleNet, self).__init__() super(SimpleNet, self).__init__()
def forward(self, doc, label): def forward(self, doc, label):
return paddle.sum(doc), label return paddle.sum(doc), label
paddle.disable_static() paddle.disable_static()
imdb = Imdb(mode='train') imdb = Imdb(mode='train')
for i in range(10): for i in range(10):
doc, label = imdb[i] doc, label = imdb[i]
doc = paddle.to_tensor(doc) doc = paddle.to_tensor(doc)
label = paddle.to_tensor(label) label = paddle.to_tensor(label)
model = SimpleNet() model = SimpleNet()
image, label = model(doc, label) image, label = model(doc, label)
print(doc.numpy().shape, label.numpy().shape) print(doc.numpy().shape, label.numpy().shape)
\ No newline at end of file
...@@ -6,48 +6,47 @@ Imikolov ...@@ -6,48 +6,47 @@ Imikolov
.. py:class:: paddle.text.datasets.Imikolov() .. py:class:: paddle.text.datasets.Imikolov()
Implementation of imikolov dataset. 该类是对imikolov测试数据集的实现。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_filestr- 保存数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
data_type(str): 'NGRAM' or 'SEQ'. Default 'NGRAM'. - data_typestr- 'NGRAM''SEQ'。默认为'NGRAM'
window_size(int): sliding window size for 'NGRAM' data. Default -1. - window_sizeint) - 'NGRAM'数据滑动窗口的大小。默认为-1
mode(str): 'train' 'test' mode. Default 'train'. - modestr- 'train' 'test' mode. Default 'train'.
min_word_freq(int): minimal word frequence for building word dictionary. Default 50. - min_word_freqint- 构建词典的最小词频。默认为50
download(bool): whether to download dataset automatically if - downloadbool- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` is not set. Default True
返回值
Returns:
Dataset: instance of imikolov dataset
代码示例
::::::::: :::::::::
``Dataset``imikolov数据集实例。
.. code-block:: python 代码示例
:::::::::
.. code-block:: python
import paddle import paddle
from paddle.text.datasets import Imikolov from paddle.text.datasets import Imikolov
class SimpleNet(paddle.nn.Layer): class SimpleNet(paddle.nn.Layer):
def __init__(self): def __init__(self):
super(SimpleNet, self).__init__() super(SimpleNet, self).__init__()
def forward(self, src, trg): def forward(self, src, trg):
return paddle.sum(src), paddle.sum(trg) return paddle.sum(src), paddle.sum(trg)
paddle.disable_static() paddle.disable_static()
imikolov = Imikolov(mode='train', data_type='SEQ', window_size=2) imikolov = Imikolov(mode='train', data_type='SEQ', window_size=2)
for i in range(10): for i in range(10):
src, trg = imikolov[i] src, trg = imikolov[i]
src = paddle.to_tensor(src) src = paddle.to_tensor(src)
trg = paddle.to_tensor(trg) trg = paddle.to_tensor(trg)
model = SimpleNet() model = SimpleNet()
src, trg = model(src, trg) src, trg = model(src, trg)
print(src.numpy().shape, trg.numpy().shape) print(src.numpy().shape, trg.numpy().shape)
\ No newline at end of file
...@@ -6,45 +6,44 @@ MovieReviews ...@@ -6,45 +6,44 @@ MovieReviews
.. py:class:: paddle.text.datasets.MovieReviews() .. py:class:: paddle.text.datasets.MovieReviews()
Implementation of `NLTK movie reviews <http://www.nltk.org/nltk_data/>`_ dataset. 该类是对`NLTK movie reviews <http://www.nltk.org/nltk_data/>`_ 测试数据集的实现。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_filestr- 保存压缩数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
mode(str): 'train' 'test' mode. Default 'train'. - modestr- 'train' 'test' 模式。默认为'train'
download(bool): whether auto download cifar dataset if - downloadbool- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` unset. Default True.
Returns: 返回值
Dataset: instance of movie reviews dataset :::::::::
``Dataset``NLTK movie reviews数据集实例。
代码示例 代码示例
::::::::: :::::::::
.. code-block:: python .. code-block:: python
import paddle import paddle
from paddle.text.datasets import MovieReviews from paddle.text.datasets import MovieReviews
class SimpleNet(paddle.nn.Layer): class SimpleNet(paddle.nn.Layer):
def __init__(self): def __init__(self):
super(SimpleNet, self).__init__() super(SimpleNet, self).__init__()
def forward(self, word, category): def forward(self, word, category):
return paddle.sum(word), category return paddle.sum(word), category
paddle.disable_static() paddle.disable_static()
movie_reviews = MovieReviews(mode='train') movie_reviews = MovieReviews(mode='train')
for i in range(10): for i in range(10):
word_list, category = movie_reviews[i] word_list, category = movie_reviews[i]
word_list = paddle.to_tensor(word_list) word_list = paddle.to_tensor(word_list)
category = paddle.to_tensor(category) category = paddle.to_tensor(category)
model = SimpleNet() model = SimpleNet()
word_list, category = model(word_list, category) word_list, category = model(word_list, category)
print(word_list.numpy().shape, category.numpy()) print(word_list.numpy().shape, category.numpy())
\ No newline at end of file
...@@ -6,48 +6,47 @@ Movielens ...@@ -6,48 +6,47 @@ Movielens
.. py:class:: paddle.text.datasets.Movielens() .. py:class:: paddle.text.datasets.Movielens()
Implementation of `Movielens 1-M <https://grouplens.org/datasets/movielens/1m/>`_ dataset. 该类是对`Movielens 1-M <https://grouplens.org/datasets/movielens/1m/>`_
测试数据集的实现。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_filestr- 保存压缩数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
mode(str): 'train' or 'test' mode. Default 'train'. - modestr- 'train' 'test' 模式。默认为'train'
test_ratio(float): split ratio for test sample. Default 0.1. - test_ratiofloat) - 为测试集划分的比例。默认为0.1
rand_seed(int): random seed. Default 0. - rand_seedint- 随机数种子。默认为0
download(bool): whether to download dataset automatically if - downloadbool- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` is not set. Default True
返回值
Returns:
Dataset: instance of Movielens 1-M dataset
代码示例
::::::::: :::::::::
``Dataset``Movielens 1-M数据集实例。
.. code-block:: python 代码示例
:::::::::
import paddle .. code-block:: python
from paddle.text.datasets import Movielens
class SimpleNet(paddle.nn.Layer): import paddle
def __init__(self): from paddle.text.datasets import Movielens
super(SimpleNet, self).__init__()
def forward(self, category, title, rating): class SimpleNet(paddle.nn.Layer):
return paddle.sum(category), paddle.sum(title), paddle.sum(rating) def __init__(self):
super(SimpleNet, self).__init__()
paddle.disable_static() def forward(self, category, title, rating):
return paddle.sum(category), paddle.sum(title), paddle.sum(rating)
movielens = Movielens(mode='train') paddle.disable_static()
for i in range(10): movielens = Movielens(mode='train')
category, title, rating = movielens[i][-3:]
category = paddle.to_tensor(category)
title = paddle.to_tensor(title)
rating = paddle.to_tensor(rating)
model = SimpleNet() for i in range(10):
category, title, rating = model(category, title, rating) category, title, rating = movielens[i][-3:]
print(category.numpy().shape, title.numpy().shape, rating.numpy().shape) category = paddle.to_tensor(category)
title = paddle.to_tensor(title)
rating = paddle.to_tensor(rating)
model = SimpleNet()
\ No newline at end of file category, title, rating = model(category, title, rating)
print(category.numpy().shape, title.numpy().shape, rating.numpy().shape)
...@@ -6,46 +6,45 @@ UCIHousing ...@@ -6,46 +6,45 @@ UCIHousing
.. py:class:: paddle.text.datasets.UCIHousing() .. py:class:: paddle.text.datasets.UCIHousing()
Implementation of `UCI housing <https://archive.ics.uci.edu/ml/datasets/Housing>`_ 该类是对`UCI housing <https://archive.ics.uci.edu/ml/datasets/Housing>`_
dataset 测试数据集的实现。
参数 参数
::::::::: :::::::::
data_file(str): path to data file, can be set None if - data_filestr- 保存数据的路径,如果参数:attr:`download`设置为True
:attr:`download` is True. Default None 可设置为None。默认为None
mode(str): 'train' or 'test' mode. Default 'train'. - modestr- 'train''test'模式。默认为'train'
download(bool): whether to download dataset automatically if - downloadbool- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` is not set. Default True
Returns: 返回值
Dataset: instance of UCI housing dataset. :::::::::
``Dataset``UCI housing数据集实例。
代码示例 代码示例
::::::::: :::::::::
.. code-block:: python .. code-block:: python
import paddle import paddle
from paddle.text.datasets import UCIHousing from paddle.text.datasets import UCIHousing
class SimpleNet(paddle.nn.Layer): class SimpleNet(paddle.nn.Layer):
def __init__(self): def __init__(self):
super(SimpleNet, self).__init__() super(SimpleNet, self).__init__()
def forward(self, feature, target): def forward(self, feature, target):
return paddle.sum(feature), target return paddle.sum(feature), target
paddle.disable_static() paddle.disable_static()
uci_housing = UCIHousing(mode='train') uci_housing = UCIHousing(mode='train')
for i in range(10): for i in range(10):
feature, target = uci_housing[i] feature, target = uci_housing[i]
feature = paddle.to_tensor(feature) feature = paddle.to_tensor(feature)
target = paddle.to_tensor(target) target = paddle.to_tensor(target)
model = SimpleNet() model = SimpleNet()
feature, target = model(feature, target) feature, target = model(feature, target)
print(feature.numpy().shape, target.numpy()) print(feature.numpy().shape, target.numpy())
\ No newline at end of file
...@@ -6,50 +6,48 @@ WMT14 ...@@ -6,50 +6,48 @@ WMT14
.. py:class:: paddle.text.datasets.WMT14() .. py:class:: paddle.text.datasets.WMT14()
Implementation of `WMT14 <http://www.statmt.org/wmt14/>`_ test dataset. 该类是对`WMT14 <http://www.statmt.org/wmt14/>`_ 测试数据集实现。
The original WMT14 dataset is too large and a small set of data for set is 由于原始WMT14数据集太大,我们在这里提供了一组小数据集。该类将从
provided. This module will download dataset from http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz
http://paddlepaddle.bj.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz 下载数据集。
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_filestr- 保存数据集压缩文件的路径, 如果参数:attr:`download`设置为True,可设置为None
:attr:`download` is True. Default None 默认为None
mode(str): 'train', 'test' or 'gen'. Default 'train' - modestr- 'train', 'test' 'gen'。默认为'train'
dict_size(int): word dictionary size. Default -1. - dict_sizeint- 词典大小。默认为-1
download(bool): whether to download dataset automatically if - downloadbool- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True
:attr:`data_file` is not set. Default True
Returns: 返回值
Dataset: instance of WMT14 dataset
代码示例
::::::::: :::::::::
``Dataset``WMT14数据集实例。
.. code-block:: python 代码示例
:::::::::
import paddle .. code-block:: python
from paddle.text.datasets import WMT14
class SimpleNet(paddle.nn.Layer): import paddle
def __init__(self): from paddle.text.datasets import WMT14
super(SimpleNet, self).__init__()
def forward(self, src_ids, trg_ids, trg_ids_next): class SimpleNet(paddle.nn.Layer):
return paddle.sum(src_ids), paddle.sum(trg_ids), paddle.sum(trg_ids_next) def __init__(self):
super(SimpleNet, self).__init__()
paddle.disable_static() def forward(self, src_ids, trg_ids, trg_ids_next):
return paddle.sum(src_ids), paddle.sum(trg_ids), paddle.sum(trg_ids_next)
wmt14 = WMT14(mode='train', dict_size=50) paddle.disable_static()
for i in range(10): wmt14 = WMT14(mode='train', dict_size=50)
src_ids, trg_ids, trg_ids_next = wmt14[i]
src_ids = paddle.to_tensor(src_ids)
trg_ids = paddle.to_tensor(trg_ids)
trg_ids_next = paddle.to_tensor(trg_ids_next)
model = SimpleNet() for i in range(10):
src_ids, trg_ids, trg_ids_next = model(src_ids, trg_ids, trg_ids_next) src_ids, trg_ids, trg_ids_next = wmt14[i]
print(src_ids.numpy(), trg_ids.numpy(), trg_ids_next.numpy()) src_ids = paddle.to_tensor(src_ids)
trg_ids = paddle.to_tensor(trg_ids)
trg_ids_next = paddle.to_tensor(trg_ids_next)
model = SimpleNet()
\ No newline at end of file src_ids, trg_ids, trg_ids_next = model(src_ids, trg_ids, trg_ids_next)
print(src_ids.numpy(), trg_ids.numpy(), trg_ids_next.numpy())
...@@ -6,65 +6,64 @@ WMT16 ...@@ -6,65 +6,64 @@ WMT16
.. py:class:: paddle.text.datasets.WMT16() .. py:class:: paddle.text.datasets.WMT16()
Implementation of `WMT16 <http://www.statmt.org/wmt16/>`_ test dataset. 该类是对`WMT16 <http://www.statmt.org/wmt16/>`_ 测试数据集实现。
ACL2016 Multimodal Machine Translation. Please see this website for more ACL2016多模态机器翻译。有关更多详细信息,请访问此网站:
details: http://www.statmt.org/wmt16/multimodal-task.html#task1 http://www.statmt.org/wmt16/multimodal-task.html#task1
If you use the dataset created for your task, please cite the following paper: 如果您任务中使用了该数据集,请引用如下论文:
Multi30K: Multilingual English-German Image Descriptions. Multi30K: Multilingual English-German Image Descriptions.
.. code-block:: text .. code-block:: text
@article{elliott-EtAl:2016:VL16, @article{elliott-EtAl:2016:VL16,
author = {{Elliott}, D. and {Frank}, S. and {Sima"an}, K. and {Specia}, L.}, author = {{Elliott}, D. and {Frank}, S. and {Sima"an}, K. and {Specia}, L.},
title = {Multi30K: Multilingual English-German Image Descriptions}, title = {Multi30K: Multilingual English-German Image Descriptions},
booktitle = {Proceedings of the 6th Workshop on Vision and Language}, booktitle = {Proceedings of the 6th Workshop on Vision and Language},
year = {2016}, year = {2016},
pages = {70--74}, pages = {70--74},
year = 2016 year = 2016
} }
参数 参数
::::::::: :::::::::
data_file(str): path to data tar file, can be set None if - data_file(str)- 保存数据集压缩文件的路径,如果参数:attr:`download`设置为True,可设置为None。
:attr:`download` is True. Default None 默认值为None。
mode(str): 'train', 'test' or 'val'. Default 'train' - mode(str)- 'train', 'test' 或 'val'。默认为'train'。
src_dict_size(int): word dictionary size for source language word. Default -1. - src_dict_size(int)- 源语言词典大小。默认为-1。
trg_dict_size(int): word dictionary size for target language word. Default -1. - trg_dict_size(int) - 目标语言测点大小。默认为-1。
lang(str): source language, 'en' or 'de'. Default 'en'. - lang(str)- 源语言,'en' 或 'de'。默认为 'en'。
download(bool): whether to download dataset automatically if - download(bool)- 如果:attr:`data_file`未设置,是否自动下载数据集。默认为True。
:attr:`data_file` is not set. Default True
返回值
Returns:
Dataset: instance of WMT16 dataset
代码示例
::::::::: :::::::::
``Dataset``,WMT16数据集实例。
.. code-block:: python 代码示例
:::::::::
.. code-block:: python
import paddle import paddle
from paddle.text.datasets import WMT16 from paddle.text.datasets import WMT16
class SimpleNet(paddle.nn.Layer): class SimpleNet(paddle.nn.Layer):
def __init__(self): def __init__(self):
super(SimpleNet, self).__init__() super(SimpleNet, self).__init__()
def forward(self, src_ids, trg_ids, trg_ids_next): def forward(self, src_ids, trg_ids, trg_ids_next):
return paddle.sum(src_ids), paddle.sum(trg_ids), paddle.sum(trg_ids_next) return paddle.sum(src_ids), paddle.sum(trg_ids), paddle.sum(trg_ids_next)
paddle.disable_static() paddle.disable_static()
wmt16 = WMT16(mode='train', src_dict_size=50, trg_dict_size=50) wmt16 = WMT16(mode='train', src_dict_size=50, trg_dict_size=50)
for i in range(10): for i in range(10):
src_ids, trg_ids, trg_ids_next = wmt16[i] src_ids, trg_ids, trg_ids_next = wmt16[i]
src_ids = paddle.to_tensor(src_ids) src_ids = paddle.to_tensor(src_ids)
trg_ids = paddle.to_tensor(trg_ids) trg_ids = paddle.to_tensor(trg_ids)
trg_ids_next = paddle.to_tensor(trg_ids_next) trg_ids_next = paddle.to_tensor(trg_ids_next)
model = SimpleNet() model = SimpleNet()
src_ids, trg_ids, trg_ids_next = model(src_ids, trg_ids, trg_ids_next) src_ids, trg_ids, trg_ids_next = model(src_ids, trg_ids, trg_ids_next)
print(src_ids.numpy(), trg_ids.numpy(), trg_ids_next.numpy()) print(src_ids.numpy(), trg_ids.numpy(), trg_ids_next.numpy())
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册