Dataset¶
Dataset package.
mnist¶
MNIST dataset.
This module will download dataset from http://yann.lecun.com/exdb/mnist/ and parse training set and test set into paddle reader creators.
-
paddle.v2.dataset.mnist.
train
() MNIST training set creator.
It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].
返回: Training reader creator 返回类型: callable
-
paddle.v2.dataset.mnist.
test
() MNIST test set creator.
It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].
返回: Test reader creator. 返回类型: callable
-
paddle.v2.dataset.mnist.
convert
(path) Converts dataset to recordio format
cifar¶
CIFAR dataset.
This module will download dataset from https://www.cs.toronto.edu/~kriz/cifar.html and parse train/test set into paddle reader creators.
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The CIFAR-100 dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.
-
paddle.v2.dataset.cifar.
train100
() CIFAR-100 training set creator.
It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 99].
返回: Training reader creator 返回类型: callable
-
paddle.v2.dataset.cifar.
test100
() CIFAR-100 test set creator.
It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].
返回: Test reader creator. 返回类型: callable
-
paddle.v2.dataset.cifar.
train10
() CIFAR-10 training set creator.
It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].
返回: Training reader creator 返回类型: callable
-
paddle.v2.dataset.cifar.
test10
() CIFAR-10 test set creator.
It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].
返回: Test reader creator. 返回类型: callable
-
paddle.v2.dataset.cifar.
convert
(path) Converts dataset to recordio format
conll05¶
Conll05 dataset. Paddle semantic role labeling Book and demo use this dataset as an example. Because Conll05 is not free in public, the default downloaded URL is test set of Conll05 (which is public). Users can change URL and MD5 to their Conll dataset. And a pre-trained word vector model based on Wikipedia corpus is used to initialize SRL model.
-
paddle.v2.dataset.conll05.
get_dict
() Get the word, verb and label dictionary of Wikipedia corpus.
-
paddle.v2.dataset.conll05.
get_embedding
() Get the trained word vector based on Wikipedia corpus.
-
paddle.v2.dataset.conll05.
test
() Conll05 test set creator.
Because the training dataset is not free, the test dataset is used for training. It returns a reader creator, each sample in the reader is nine features, including sentence sequence, predicate, predicate context, predicate context flag and tagged sequence.
返回: Training reader creator 返回类型: callable
imdb¶
IMDB dataset.
This module downloads IMDB dataset from http://ai.stanford.edu/%7Eamaas/data/sentiment/. This dataset contains a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. Besides, this module also provides API for building dictionary.
-
paddle.v2.dataset.imdb.
build_dict
(pattern, cutoff) Build a word dictionary from the corpus. Keys of the dictionary are words, and values are zero-based IDs of these words.
-
paddle.v2.dataset.imdb.
train
(word_idx) IMDB training set creator.
It returns a reader creator, each sample in the reader is an zero-based ID sequence and label in [0, 1].
参数: word_idx (dict) – word dictionary 返回: Training reader creator 返回类型: callable
-
paddle.v2.dataset.imdb.
test
(word_idx) IMDB test set creator.
It returns a reader creator, each sample in the reader is an zero-based ID sequence and label in [0, 1].
参数: word_idx (dict) – word dictionary 返回: Test reader creator 返回类型: callable
-
paddle.v2.dataset.imdb.
convert
(path) Converts dataset to recordio format
imikolov¶
imikolov’s simple dataset.
This module will download dataset from http://www.fit.vutbr.cz/~imikolov/rnnlm/ and parse training set and test set into paddle reader creators.
-
paddle.v2.dataset.imikolov.
build_dict
(min_word_freq=50) Build a word dictionary from the corpus, Keys of the dictionary are words, and values are zero-based IDs of these words.
-
paddle.v2.dataset.imikolov.
train
(word_idx, n, data_type=1) imikolov training set creator.
It returns a reader creator, each sample in the reader is a word ID tuple.
参数: - word_idx (dict) – word dictionary
- n (int) – sliding window size if type is ngram, otherwise max length of sequence
- data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
返回: Training reader creator
返回类型: callable
-
paddle.v2.dataset.imikolov.
test
(word_idx, n, data_type=1) imikolov test set creator.
It returns a reader creator, each sample in the reader is a word ID tuple.
参数: - word_idx (dict) – word dictionary
- n (int) – sliding window size if type is ngram, otherwise max length of sequence
- data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
返回: Test reader creator
返回类型: callable
-
paddle.v2.dataset.imikolov.
convert
(path) Converts dataset to recordio format
movielens¶
Movielens 1-M dataset.
Movielens 1-M dataset contains 1 million ratings from 6000 users on 4000 movies, which was collected by GroupLens Research. This module will download Movielens 1-M dataset from http://files.grouplens.org/datasets/movielens/ml-1m.zip and parse training set and test set into paddle reader creators.
-
paddle.v2.dataset.movielens.
get_movie_title_dict
() Get movie title dictionary.
-
paddle.v2.dataset.movielens.
max_movie_id
() Get the maximum value of movie id.
-
paddle.v2.dataset.movielens.
max_user_id
() Get the maximum value of user id.
-
paddle.v2.dataset.movielens.
max_job_id
() Get the maximum value of job id.
-
paddle.v2.dataset.movielens.
movie_categories
() Get movie categoriges dictionary.
-
paddle.v2.dataset.movielens.
user_info
() Get user info dictionary.
-
paddle.v2.dataset.movielens.
movie_info
() Get movie info dictionary.
-
paddle.v2.dataset.movielens.
convert
(path) Converts dataset to recordio format
-
class
paddle.v2.dataset.movielens.
MovieInfo
(index, categories, title) Movie id, title and categories information are stored in MovieInfo.
-
class
paddle.v2.dataset.movielens.
UserInfo
(index, gender, age, job_id) User id, gender, age, and job information are stored in UserInfo.
sentiment¶
The script fetch and preprocess movie_reviews data set that provided by NLTK
TODO(yuyang18): Complete dataset.
-
paddle.v2.dataset.sentiment.
get_word_dict
() Sorted the words by the frequency of words which occur in sample :return:
words_freq_sorted
-
paddle.v2.dataset.sentiment.
train
() Default training set reader creator
-
paddle.v2.dataset.sentiment.
test
() Default test set reader creator
-
paddle.v2.dataset.sentiment.
convert
(path) Converts dataset to recordio format
uci_housing¶
UCI Housing dataset.
This module will download dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ and parse training set and test set into paddle reader creators.
-
paddle.v2.dataset.uci_housing.
train
() UCI_HOUSING training set creator.
It returns a reader creator, each sample in the reader is features after normalization and price number.
返回: Training reader creator 返回类型: callable
-
paddle.v2.dataset.uci_housing.
test
() UCI_HOUSING test set creator.
It returns a reader creator, each sample in the reader is features after normalization and price number.
返回: Test reader creator 返回类型: callable
wmt14¶
WMT14 dataset. The original WMT14 dataset is too large and a small set of data for set is provided. This module will download dataset from http://paddlepaddle.cdn.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz and parse training set and test set into paddle reader creators.
-
paddle.v2.dataset.wmt14.
train
(dict_size) WMT14 training set creator.
It returns a reader creator, each sample in the reader is source language word ID sequence, target language word ID sequence and next word ID sequence.
返回: Training reader creator 返回类型: callable
-
paddle.v2.dataset.wmt14.
test
(dict_size) WMT14 test set creator.
It returns a reader creator, each sample in the reader is source language word ID sequence, target language word ID sequence and next word ID sequence.
返回: Test reader creator 返回类型: callable
-
paddle.v2.dataset.wmt14.
convert
(path) Converts dataset to recordio format