Dataset¶

Dataset package.

mnist¶

MNIST dataset.

This module will download dataset from http://yann.lecun.com/exdb/mnist/ and parse training set and test set into paddle reader creators.

paddle.v2.dataset.mnist.train()

MNIST training set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

返回:	Training reader creator
返回类型:	callable

paddle.v2.dataset.mnist.test()

MNIST test set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

返回:	Test reader creator.
返回类型:	callable

paddle.v2.dataset.mnist.convert(path): Converts dataset to recordio format

cifar¶

CIFAR dataset.

This module will download dataset from https://www.cs.toronto.edu/~kriz/cifar.html and parse train/test set into paddle reader creators.

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The CIFAR-100 dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.

paddle.v2.dataset.cifar.train100()

CIFAR-100 training set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 99].

返回:	Training reader creator
返回类型:	callable

paddle.v2.dataset.cifar.test100()

CIFAR-100 test set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

返回:	Test reader creator.
返回类型:	callable

paddle.v2.dataset.cifar.train10()

CIFAR-10 training set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

返回:	Training reader creator
返回类型:	callable

paddle.v2.dataset.cifar.test10()

CIFAR-10 test set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

返回:	Test reader creator.
返回类型:	callable

paddle.v2.dataset.cifar.convert(path): Converts dataset to recordio format

conll05¶

Conll05 dataset. Paddle semantic role labeling Book and demo use this dataset as an example. Because Conll05 is not free in public, the default downloaded URL is test set of Conll05 (which is public). Users can change URL and MD5 to their Conll dataset. And a pre-trained word vector model based on Wikipedia corpus is used to initialize SRL model.

paddle.v2.dataset.conll05.get_dict(): Get the word, verb and label dictionary of Wikipedia corpus.

paddle.v2.dataset.conll05.get_embedding(): Get the trained word vector based on Wikipedia corpus.

paddle.v2.dataset.conll05.test()

Conll05 test set creator.

Because the training dataset is not free, the test dataset is used for training. It returns a reader creator, each sample in the reader is nine features, including sentence sequence, predicate, predicate context, predicate context flag and tagged sequence.

返回:	Training reader creator
返回类型:	callable

imdb¶

IMDB dataset.

This module downloads IMDB dataset from http://ai.stanford.edu/%7Eamaas/data/sentiment/. This dataset contains a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. Besides, this module also provides API for building dictionary.

paddle.v2.dataset.imdb.build_dict(pattern, cutoff): Build a word dictionary from the corpus. Keys of the dictionary are words, and values are zero-based IDs of these words.

paddle.v2.dataset.imdb.train(word_idx)

IMDB training set creator.

It returns a reader creator, each sample in the reader is an zero-based ID sequence and label in [0, 1].

参数:	word_idx (dict) – word dictionary
返回:	Training reader creator
返回类型:	callable

paddle.v2.dataset.imdb.test(word_idx)

IMDB test set creator.

It returns a reader creator, each sample in the reader is an zero-based ID sequence and label in [0, 1].

参数:	word_idx (dict) – word dictionary
返回:	Test reader creator
返回类型:	callable

paddle.v2.dataset.imdb.convert(path): Converts dataset to recordio format

imikolov¶

imikolov’s simple dataset.

This module will download dataset from http://www.fit.vutbr.cz/~imikolov/rnnlm/ and parse training set and test set into paddle reader creators.

paddle.v2.dataset.imikolov.build_dict(min_word_freq=50): Build a word dictionary from the corpus, Keys of the dictionary are words, and values are zero-based IDs of these words.

paddle.v2.dataset.imikolov.train(word_idx, n, data_type=1)

imikolov training set creator.

It returns a reader creator, each sample in the reader is a word ID tuple.

参数:	word_idx (dict) – word dictionary n (int) – sliding window size if type is ngram, otherwise max length of sequence data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
返回:	Training reader creator
返回类型:	callable

paddle.v2.dataset.imikolov.test(word_idx, n, data_type=1)

imikolov test set creator.

It returns a reader creator, each sample in the reader is a word ID tuple.

参数:	word_idx (dict) – word dictionary n (int) – sliding window size if type is ngram, otherwise max length of sequence data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
返回:	Test reader creator
返回类型:	callable

paddle.v2.dataset.imikolov.convert(path): Converts dataset to recordio format

movielens¶

Movielens 1-M dataset.

Movielens 1-M dataset contains 1 million ratings from 6000 users on 4000 movies, which was collected by GroupLens Research. This module will download Movielens 1-M dataset from http://files.grouplens.org/datasets/movielens/ml-1m.zip and parse training set and test set into paddle reader creators.

paddle.v2.dataset.movielens.get_movie_title_dict(): Get movie title dictionary.

paddle.v2.dataset.movielens.max_movie_id(): Get the maximum value of movie id.

paddle.v2.dataset.movielens.max_user_id(): Get the maximum value of user id.

paddle.v2.dataset.movielens.max_job_id(): Get the maximum value of job id.

paddle.v2.dataset.movielens.movie_categories(): Get movie categoriges dictionary.

paddle.v2.dataset.movielens.user_info(): Get user info dictionary.

paddle.v2.dataset.movielens.movie_info(): Get movie info dictionary.

paddle.v2.dataset.movielens.convert(path): Converts dataset to recordio format

class paddle.v2.dataset.movielens.MovieInfo(index, categories, title): Movie id, title and categories information are stored in MovieInfo.

class paddle.v2.dataset.movielens.UserInfo(index, gender, age, job_id): User id, gender, age, and job information are stored in UserInfo.

sentiment¶

The script fetch and preprocess movie_reviews data set that provided by NLTK

TODO(yuyang18): Complete dataset.

paddle.v2.dataset.sentiment.get_word_dict(): Sorted the words by the frequency of words which occur in sample :return:

words_freq_sorted

paddle.v2.dataset.sentiment.train(): Default training set reader creator

paddle.v2.dataset.sentiment.test(): Default test set reader creator

paddle.v2.dataset.sentiment.convert(path): Converts dataset to recordio format

uci_housing¶

UCI Housing dataset.

This module will download dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ and parse training set and test set into paddle reader creators.

paddle.v2.dataset.uci_housing.train()

UCI_HOUSING training set creator.

It returns a reader creator, each sample in the reader is features after normalization and price number.

返回:	Training reader creator
返回类型:	callable

paddle.v2.dataset.uci_housing.test()

UCI_HOUSING test set creator.

It returns a reader creator, each sample in the reader is features after normalization and price number.

返回:	Test reader creator
返回类型:	callable

wmt14¶

WMT14 dataset. The original WMT14 dataset is too large and a small set of data for set is provided. This module will download dataset from http://paddlepaddle.cdn.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz and parse training set and test set into paddle reader creators.

paddle.v2.dataset.wmt14.train(dict_size)

WMT14 training set creator.

It returns a reader creator, each sample in the reader is source language word ID sequence, target language word ID sequence and next word ID sequence.

返回:	Training reader creator
返回类型:	callable

paddle.v2.dataset.wmt14.test(dict_size)

WMT14 test set creator.

It returns a reader creator, each sample in the reader is source language word ID sequence, target language word ID sequence and next word ID sequence.

返回:	Test reader creator
返回类型:	callable

paddle.v2.dataset.wmt14.convert(path): Converts dataset to recordio format