Data Reader Interface and DataSets

DataTypes

paddle.v2.data_type.dense_array(dim, seq_type=0)

Dense Array. It means the input feature is dense array with float type. For example, if the input is an image with 28*28 pixels, the input of Paddle neural network could be a dense vector with dimension 784 or a numpy array with shape (28, 28).

For the 2-D convolution operation, each sample in one mini-batch must have the similarly size in PaddlePaddle now. But, it supports variable-dimension feature across mini-batch. For the variable-dimension, the param dim is not used. While the data reader must yield numpy array and the data feeder will set the data shape correctly.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.dense_vector(dim, seq_type=0)

Dense Array. It means the input feature is dense array with float type. For example, if the input is an image with 28*28 pixels, the input of Paddle neural network could be a dense vector with dimension 784 or a numpy array with shape (28, 28).

For the 2-D convolution operation, each sample in one mini-batch must have the similarly size in PaddlePaddle now. But, it supports variable-dimension feature across mini-batch. For the variable-dimension, the param dim is not used. While the data reader must yield numpy array and the data feeder will set the data shape correctly.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.dense_vector_sequence(dim)

Data type of a sequence of dense vector.

Parameters:dim (int) – dimension of dense vector.
Returns:An input type object
Return type:InputType
paddle.v2.data_type.integer_value(value_range, seq_type=0)

Data type of integer.

Parameters:
  • seq_type (int) – sequence type of this input.
  • value_range (int) – range of this integer.
Returns:

An input type object

Return type:

InputType

paddle.v2.data_type.integer_value_sequence(value_range)

Data type of a sequence of integer.

Parameters:value_range (int) – range of each element.
paddle.v2.data_type.sparse_binary_vector(dim, seq_type=0)

Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_binary_vector_sequence(dim)
Data type of a sequence of sparse vector, which every element is either zero
or one.
Parameters:dim (int) – dimension of sparse vector.
Returns:An input type object
Return type:InputType
paddle.v2.data_type.sparse_float_vector(dim, seq_type=0)

Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_float_vector_sequence(dim)

Data type of a sequence of sparse vector, which most elements are zero, others could be any float value.

Parameters:dim (int) – dimension of sparse vector.
Returns:An input type object
Return type:InputType
paddle.v2.data_type.sparse_non_value_slot(dim, seq_type=0)

Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_value_slot(dim, seq_type=0)

Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

class paddle.v2.data_type.InputType(dim, seq_type, tp)

InputType is the base class for paddle input types.

Note

this is a base class, and should never be used by user.

Parameters:
  • dim (int) – dimension of input. If the input is an integer, it means the value range. Otherwise, it means the size of layer.
  • seq_type (int) – sequence type of input. 0 means it is not a sequence. 1 means it is a variable length sequence. 2 means it is a nested sequence.
  • type (int) – data type of input.

DataFeeder

Reader

At training and testing time, PaddlePaddle programs need to read data. To ease the users’ work to write data reading code, we define that

  • A reader is a function that reads data (from file, network, random number generator, etc) and yields data items.
  • A reader creator is a function that returns a reader function.
  • A reader decorator is a function, which accepts one or more readers, and returns a reader.
  • A batch reader is a function that reads data (from reader, file, network, random number generator, etc) and yields a batch of data items.

Data Reader Interface

Indeed, data reader doesn’t have to be a function that reads and yields data items. It can be any function with no parameter that creates a iterable (anything can be used in for x in iterable):

iterable = data_reader()

Element produced from the iterable should be a single entry of data, not a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of supported type (e.g., numpy 1d array of float32, int, list of int)

An example implementation for single item data reader creator:

def reader_creator_random_image(width, height):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height)
return reader

An example implementation for multiple item data reader creator:

def reader_creator_random_image_and_label(width, height, label):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height), label
return reader

TODO(yuyang18): Should we add whole design doc here?

paddle.v2.reader.map_readers(func, *readers)

Creates a data reader that outputs return value of function using output of each data readers as arguments.

Parameters:
  • func – function to use. The type of func should be (Sample) => Sample
  • readers – readers whose outputs will be used as arguments of func.
Type:

callable

Returns:

the created data reader.

Return type:

callable

paddle.v2.reader.buffered(reader, size)

Creates a buffered data reader.

The buffered data reader will read and save data entries into a buffer. Reading from the buffered data reader will proceed as long as the buffer is not empty.

Parameters:
  • reader (callable) – the data reader to read from.
  • size (int) – max buffer size.
Returns:

the buffered data reader.

paddle.v2.reader.compose(*readers, **kwargs)

Creates a data reader whose output is the combination of input readers.

If input readers output following data entries: (1, 2) 3 (4, 5) The composed reader will output: (1, 2, 3, 4, 5)

Parameters:
  • readers – readers that will be composed together.
  • check_alignment (bool) – if True, will check if input readers are aligned correctly. If False, will not check alignment and trailing outputs will be discarded. Defaults to True.
Returns:

the new data reader.

Raises:

ComposeNotAligned – outputs of readers are not aligned. Will not raise when check_alignment is set to False.

paddle.v2.reader.chain(*readers)

Creates a data reader whose output is the outputs of input data readers chained together.

If input readers output following data entries: [0, 0, 0] [1, 1, 1] [2, 2, 2] The chained reader will output: [0, 0, 0, 1, 1, 1, 2, 2, 2]

Parameters:readers – input readers.
Returns:the new data reader.
Return type:callable
paddle.v2.reader.shuffle(reader, buf_size)

Creates a data reader whose data output is shuffled.

Output from the iterator that created by original reader will be buffered into shuffle buffer, and then shuffled. The size of shuffle buffer is determined by argument buf_size.

Parameters:
  • reader (callable) – the original reader whose output will be shuffled.
  • buf_size (int) – shuffle buffer size.
Returns:

the new reader whose output is shuffled.

Return type:

callable

paddle.v2.reader.firstn(reader, n)

Limit the max number of samples that reader could return.

Parameters:
  • reader (callable) – the data reader to read from.
  • n (int) – the max number of samples that return.
Returns:

the decorated reader.

Return type:

callable

paddle.v2.reader.xmap_readers(mapper, reader, process_num, buffer_size, order=False)

Use multiprocess to map samples from reader by a mapper defined by user. And this function contains a buffered decorator. :param mapper: a function to map sample. :type mapper: callable :param reader: the data reader to read from :type reader: callable :param process_num: process number to handle original sample :type process_num: int :param buffer_size: max buffer size :type buffer_size: int :param order: keep the order of reader :type order: bool :return: the decarated reader :rtype: callable

Creator package contains some simple reader creator, which could be used in user program.

paddle.v2.reader.creator.np_array(x)

Creates a reader that yields elements of x, if it is a numpy vector. Or rows of x, if it is a numpy matrix. Or any sub-hyperplane indexed by the highest dimension.

Parameters:x – the numpy array to create reader from.
Returns:data reader created from x.
paddle.v2.reader.creator.text_file(path)

Creates a data reader that outputs text line by line from given text file. Trailing new line (‘\n’) of each line will be removed.

Path:path of the text file.
Returns:data reader of text file
paddle.v2.reader.creator.cloud_reader(paths, etcd_endpoints, timeout_sec=5, buf_size=64)
Create a data reader that yield a record one by one from
the paths:
Path:path of recordio files.
Etcd_endpoints:the endpoints for etcd cluster
Returns:data reader of recordio files.

minibatch

paddle.v2.minibatch.batch(reader, batch_size)

Create a batched reader.

Parameters:
  • reader (callable) – the data reader to read from.
  • batch_size (int) – size of each mini-batch
Returns:

the batched reader.

Return type:

callable

Dataset

Dataset package.

mnist

MNIST dataset.

This module will download dataset from http://yann.lecun.com/exdb/mnist/ and parse training set and test set into paddle reader creators.

paddle.v2.dataset.mnist.train()

MNIST training set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Training reader creator
Return type:callable
paddle.v2.dataset.mnist.test()

MNIST test set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Test reader creator.
Return type:callable
paddle.v2.dataset.mnist.convert(path)

Converts dataset to recordio format

cifar

CIFAR dataset.

This module will download dataset from https://www.cs.toronto.edu/~kriz/cifar.html and parse train/test set into paddle reader creators.

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The CIFAR-100 dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.

paddle.v2.dataset.cifar.train100()

CIFAR-100 training set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 99].

Returns:Training reader creator
Return type:callable
paddle.v2.dataset.cifar.test100()

CIFAR-100 test set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Test reader creator.
Return type:callable
paddle.v2.dataset.cifar.train10()

CIFAR-10 training set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Training reader creator
Return type:callable
paddle.v2.dataset.cifar.test10()

CIFAR-10 test set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Test reader creator.
Return type:callable
paddle.v2.dataset.cifar.convert(path)

Converts dataset to recordio format

conll05

Conll05 dataset. Paddle semantic role labeling Book and demo use this dataset as an example. Because Conll05 is not free in public, the default downloaded URL is test set of Conll05 (which is public). Users can change URL and MD5 to their Conll dataset. And a pre-trained word vector model based on Wikipedia corpus is used to initialize SRL model.

paddle.v2.dataset.conll05.get_dict()

Get the word, verb and label dictionary of Wikipedia corpus.

paddle.v2.dataset.conll05.get_embedding()

Get the trained word vector based on Wikipedia corpus.

paddle.v2.dataset.conll05.test()

Conll05 test set creator.

Because the training dataset is not free, the test dataset is used for training. It returns a reader creator, each sample in the reader is nine features, including sentence sequence, predicate, predicate context, predicate context flag and tagged sequence.

Returns:Training reader creator
Return type:callable

imdb

IMDB dataset.

This module downloads IMDB dataset from http://ai.stanford.edu/%7Eamaas/data/sentiment/. This dataset contains a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. Besides, this module also provides API for building dictionary.

paddle.v2.dataset.imdb.build_dict(pattern, cutoff)

Build a word dictionary from the corpus. Keys of the dictionary are words, and values are zero-based IDs of these words.

paddle.v2.dataset.imdb.train(word_idx)

IMDB training set creator.

It returns a reader creator, each sample in the reader is an zero-based ID sequence and label in [0, 1].

Parameters:word_idx (dict) – word dictionary
Returns:Training reader creator
Return type:callable
paddle.v2.dataset.imdb.test(word_idx)

IMDB test set creator.

It returns a reader creator, each sample in the reader is an zero-based ID sequence and label in [0, 1].

Parameters:word_idx (dict) – word dictionary
Returns:Test reader creator
Return type:callable
paddle.v2.dataset.imdb.convert(path)

Converts dataset to recordio format

imikolov

imikolov’s simple dataset.

This module will download dataset from http://www.fit.vutbr.cz/~imikolov/rnnlm/ and parse training set and test set into paddle reader creators.

paddle.v2.dataset.imikolov.build_dict(min_word_freq=50)

Build a word dictionary from the corpus, Keys of the dictionary are words, and values are zero-based IDs of these words.

paddle.v2.dataset.imikolov.train(word_idx, n, data_type=1)

imikolov training set creator.

It returns a reader creator, each sample in the reader is a word ID tuple.

Parameters:
  • word_idx (dict) – word dictionary
  • n (int) – sliding window size if type is ngram, otherwise max length of sequence
  • data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
Returns:

Training reader creator

Return type:

callable

paddle.v2.dataset.imikolov.test(word_idx, n, data_type=1)

imikolov test set creator.

It returns a reader creator, each sample in the reader is a word ID tuple.

Parameters:
  • word_idx (dict) – word dictionary
  • n (int) – sliding window size if type is ngram, otherwise max length of sequence
  • data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
Returns:

Test reader creator

Return type:

callable

paddle.v2.dataset.imikolov.convert(path)

Converts dataset to recordio format

movielens

Movielens 1-M dataset.

Movielens 1-M dataset contains 1 million ratings from 6000 users on 4000 movies, which was collected by GroupLens Research. This module will download Movielens 1-M dataset from http://files.grouplens.org/datasets/movielens/ml-1m.zip and parse training set and test set into paddle reader creators.

paddle.v2.dataset.movielens.get_movie_title_dict()

Get movie title dictionary.

paddle.v2.dataset.movielens.max_movie_id()

Get the maximum value of movie id.

paddle.v2.dataset.movielens.max_user_id()

Get the maximum value of user id.

paddle.v2.dataset.movielens.max_job_id()

Get the maximum value of job id.

paddle.v2.dataset.movielens.movie_categories()

Get movie categoriges dictionary.

paddle.v2.dataset.movielens.user_info()

Get user info dictionary.

paddle.v2.dataset.movielens.movie_info()

Get movie info dictionary.

paddle.v2.dataset.movielens.convert(path)

Converts dataset to recordio format

class paddle.v2.dataset.movielens.MovieInfo(index, categories, title)

Movie id, title and categories information are stored in MovieInfo.

class paddle.v2.dataset.movielens.UserInfo(index, gender, age, job_id)

User id, gender, age, and job information are stored in UserInfo.

sentiment

The script fetch and preprocess movie_reviews data set that provided by NLTK

TODO(yuyang18): Complete dataset.

paddle.v2.dataset.sentiment.get_word_dict()

Sorted the words by the frequency of words which occur in sample :return:

words_freq_sorted
paddle.v2.dataset.sentiment.train()

Default training set reader creator

paddle.v2.dataset.sentiment.test()

Default test set reader creator

paddle.v2.dataset.sentiment.convert(path)

Converts dataset to recordio format

uci_housing

UCI Housing dataset.

This module will download dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ and parse training set and test set into paddle reader creators.

paddle.v2.dataset.uci_housing.train()

UCI_HOUSING training set creator.

It returns a reader creator, each sample in the reader is features after normalization and price number.

Returns:Training reader creator
Return type:callable
paddle.v2.dataset.uci_housing.test()

UCI_HOUSING test set creator.

It returns a reader creator, each sample in the reader is features after normalization and price number.

Returns:Test reader creator
Return type:callable

wmt14

WMT14 dataset. The original WMT14 dataset is too large and a small set of data for set is provided. This module will download dataset from http://paddlepaddle.cdn.bcebos.com/demo/wmt_shrinked_data/wmt14.tgz and parse training set and test set into paddle reader creators.

paddle.v2.dataset.wmt14.train(dict_size)

WMT14 training set creator.

It returns a reader creator, each sample in the reader is source language word ID sequence, target language word ID sequence and next word ID sequence.

Returns:Training reader creator
Return type:callable
paddle.v2.dataset.wmt14.test(dict_size)

WMT14 test set creator.

It returns a reader creator, each sample in the reader is source language word ID sequence, target language word ID sequence and next word ID sequence.

Returns:Test reader creator
Return type:callable
paddle.v2.dataset.wmt14.convert(path)

Converts dataset to recordio format