Datasets

DataTypes

paddle.v2.data_type.dense_vector(dim, seq_type=0)

Dense Vector. It means the input feature is dense float vector. For example, if the input is an image with 28*28 pixels, the input of Paddle neural network should be a dense vector with dimension 784.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.dense_vector_sequence(dim)

Data type of a sequence of dense vector.

Parameters:dim (int) – dimension of dense vector.
Returns:An input type object
Return type:InputType
paddle.v2.data_type.integer_value(value_range, seq_type=0)

Data type of integer.

Parameters:
  • seq_type (int) – sequence type of this input.
  • value_range (int) – range of this integer.
Returns:

An input type object

Return type:

InputType

paddle.v2.data_type.integer_value_sequence(value_range)

Data type of a sequence of integer.

Parameters:value_range (int) – range of each element.
paddle.v2.data_type.sparse_binary_vector(dim, seq_type=0)

Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_binary_vector_sequence(dim)
Data type of a sequence of sparse vector, which every element is either zero
or one.
Parameters:dim (int) – dimension of sparse vector.
Returns:An input type object
Return type:InputType
paddle.v2.data_type.sparse_non_value_slot(dim, seq_type=0)

Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_value_slot(dim, seq_type=0)

Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_vector(dim, seq_type=0)

Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_vector_sequence(dim)

Data type of a sequence of sparse vector, which most elements are zero, others could be any float value.

Parameters:dim (int) – dimension of sparse vector.
Returns:An input type object
Return type:InputType
class paddle.v2.data_type.InputType(dim, seq_type, tp)

InputType is the base class for paddle input types.

Note

this is a base class, and should never be used by user.

Parameters:
  • dim (int) – dimension of input. If the input is an integer, it means the value range. Otherwise, it means the size of layer.
  • seq_type (int) – sequence type of input. 0 means it is not a sequence. 1 means it is a variable length sequence. 2 means it is a nested sequence.
  • type (int) – data type of input.

DataFeeder

class paddle.v2.data_feeder.DataFeeder(data_types, feeding=None)

DataFeeder converts the data returned by paddle.reader into a data structure of Arguments which is defined in the API. The paddle.reader usually returns a list of mini-batch data entries. Each data entry in the list is one sample. Each sample is a list or a tuple with one feature or multiple features. DataFeeder converts this mini-batch data entries into Arguments in order to feed it to C++ interface.

The simple usage shows below

feeding = ['image', 'label']
data_types = enumerate_data_types_of_data_layers(topology)
feeder = DataFeeder(data_types=data_types, feeding=feeding)

minibatch_data = [([1.0, 2.0, 3.0, ...], 5)]

arg = feeder(minibatch_data)

If mini-batch data and data layers are not one to one mapping, we could pass a dictionary to feeding parameter to represent the mapping relationship.

data_types = [('image', paddle.data_type.dense_vector(784)),
              ('label', paddle.data_type.integer_value(10))]
feeding = {'image':0, 'label':1}
feeder = DataFeeder(data_types=data_types, feeding=feeding)
minibatch_data = [
                   ( [1.0,2.0,3.0,4.0], 5, [6,7,8] ),  # first sample
                   ( [1.0,2.0,3.0,4.0], 5, [6,7,8] )   # second sample
                 ]
# or minibatch_data = [
#                       [ [1.0,2.0,3.0,4.0], 5, [6,7,8] ],  # first sample
#                       [ [1.0,2.0,3.0,4.0], 5, [6,7,8] ]   # second sample
#                     ]
arg = feeder(minibatch_data)

Note

This module is for internal use only. Users should use the reader interface.

Parameters:
  • data_types (list) – A list to specify data name and type. Each item is a tuple of (data_name, data_type).
  • feeding (dict|collections.Sequence|None) – A dictionary or a sequence to specify the position of each data in the input data.
convert(dat, argument=None)
Parameters:
  • dat (list) – A list of mini-batch data. Each sample is a list or tuple one feature or multiple features.
  • argument (py_paddle.swig_paddle.Arguments) – An Arguments object contains this mini-batch data with one or multiple features. The Arguments definition is in the API.

Reader

At training and testing time, PaddlePaddle programs need to read data. To ease the users’ work to write data reading code, we define that

  • A reader is a function that reads data (from file, network, random number generator, etc) and yields data items.
  • A reader creator is a function that returns a reader function.
  • A reader decorator is a function, which accepts one or more readers, and returns a reader.
  • A batch reader is a function that reads data (from reader, file, network, random number generator, etc) and yields a batch of data items.

Data Reader Interface

Indeed, data reader doesn’t have to be a function that reads and yields data items. It can be any function with no parameter that creates a iterable (anything can be used in for x in iterable):

iterable = data_reader()

Element produced from the iterable should be a single entry of data, not a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of supported type (e.g., numpy 1d array of float32, int, list of int)

An example implementation for single item data reader creator:

def reader_creator_random_image(width, height):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height)
return reader

An example implementation for multiple item data reader creator:

def reader_creator_random_image_and_label(width, height, label):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height), label
return reader

TODO(yuyang18): Should we add whole design doc here?

paddle.v2.reader.map_readers(func, *readers)

Creates a data reader that outputs return value of function using output of each data readers as arguments.

Parameters:
  • func – function to use. The type of func should be (Sample) => Sample
  • readers – readers whose outputs will be used as arguments of func.
Type:

callable

Returns:

the created data reader.

Return type:

callable

paddle.v2.reader.buffered(reader, size)

Creates a buffered data reader.

The buffered data reader will read and save data entries into a buffer. Reading from the buffered data reader will proceed as long as the buffer is not empty.

Parameters:
  • reader (callable) – the data reader to read from.
  • size (int) – max buffer size.
Returns:

the buffered data reader.

paddle.v2.reader.compose(*readers, **kwargs)

Creates a data reader whose output is the combination of input readers.

If input readers output following data entries: (1, 2) 3 (4, 5) The composed reader will output: (1, 2, 3, 4, 5)

Parameters:
  • readers – readers that will be composed together.
  • check_alignment (bool) – if True, will check if input readers are aligned correctly. If False, will not check alignment and trailing outputs will be discarded. Defaults to True.
Returns:

the new data reader.

Raises:

ComposeNotAligned – outputs of readers are not aligned. Will not raise when check_alignment is set to False.

paddle.v2.reader.chain(*readers)

Creates a data reader whose output is the outputs of input data readers chained together.

If input readers output following data entries: [0, 0, 0] [1, 1, 1] [2, 2, 2] The chained reader will output: [0, 0, 0, 1, 1, 1, 2, 2, 2]

Parameters:readers – input readers.
Returns:the new data reader.
Return type:callable
paddle.v2.reader.shuffle(reader, buf_size)

Creates a data reader whose data output is shuffled.

Output from the iterator that created by original reader will be buffered into shuffle buffer, and then shuffled. The size of shuffle buffer is determined by argument buf_size.

Parameters:
  • reader (callable) – the original reader whose output will be shuffled.
  • buf_size (int) – shuffle buffer size.
Returns:

the new reader whose output is shuffled.

Return type:

callable

paddle.v2.reader.firstn(reader, n)

Limit the max number of samples that reader could return.

Parameters:
  • reader (callable) – the data reader to read from.
  • n (int) – the max number of samples that return.
Returns:

the decorated reader.

Return type:

callable

Creator package contains some simple reader creator, which could be used in user program.

paddle.v2.reader.creator.np_array(x)

Creates a reader that yields elements of x, if it is a numpy vector. Or rows of x, if it is a numpy matrix. Or any sub-hyperplane indexed by the highest dimension.

Parameters:x – the numpy array to create reader from.
Returns:data reader created from x.
paddle.v2.reader.creator.text_file(path)

Creates a data reader that outputs text line by line from given text file. Trailing new line (‘\n’) of each line will be removed.

Path:path of the text file.
Returns:data reader of text file

minibatch

paddle.v2.minibatch.batch(reader, batch_size)

Create a batched reader.

Parameters:
  • reader (callable) – the data reader to read from.
  • batch_size (int) – size of each mini-batch
Returns:

the batched reader.

Return type:

callable

Dataset

Dataset package.

mnist

MNIST dataset.

This module will download dataset from http://yann.lecun.com/exdb/mnist/ and parse train set and test set into paddle reader creators.

paddle.v2.dataset.mnist.train()

MNIST train set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Train reader creator
Return type:callable
paddle.v2.dataset.mnist.test()

MNIST test set cretor.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

Returns:Test reader creator.
Return type:callable

cifar

CIFAR dataset: https://www.cs.toronto.edu/~kriz/cifar.html

TODO(yuyang18): Complete the comments.

conll05

imdb

IMDB dataset: http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz

TODO(yuyang18): Complete comments.

imikolov

imikolov’s simple dataset: http://www.fit.vutbr.cz/~imikolov/rnnlm/

Complete comments.

movielens

Movielens 1-M dataset.

TODO(yuyang18): Complete comments.

sentiment

The script fetch and preprocess movie_reviews data set that provided by NLTK

TODO(yuyang18): Complete dataset.

paddle.v2.dataset.sentiment.get_word_dict()

Sorted the words by the frequency of words which occur in sample :return:

words_freq_sorted
paddle.v2.dataset.sentiment.train()

Default train set reader creator

paddle.v2.dataset.sentiment.test()

Default test set reader creator

uci_housing

UCI Housing dataset.

TODO(yuyang18): Complete comments.

wmt14

UCI Housing dataset.

TODO(yuyang18): Complete comments.