Datasets¶

DataTypes¶

paddle.v2.data_type.dense_vector(dim, seq_type=0)

Dense Vector. It means the input feature is dense float vector. For example, if the input is an image with 28*28 pixels, the input of Paddle neural network should be a dense vector with dimension 784.

参数:	dim (int) – dimension of this vector. seq_type (int) – sequence type of input.
返回:	An input type object.
返回类型:	InputType

paddle.v2.data_type.dense_vector_sequence(dim)

Data type of a sequence of dense vector.

参数:	dim (int) – dimension of dense vector.
返回:	An input type object
返回类型:	InputType

paddle.v2.data_type.integer_value(value_range, seq_type=0)

Data type of integer.

参数:	seq_type (int) – sequence type of this input. value_range (int) – range of this integer.
返回:	An input type object
返回类型:	InputType

paddle.v2.data_type.integer_value_sequence(value_range)

Data type of a sequence of integer.

参数:	value_range (int) – range of each element.

paddle.v2.data_type.sparse_binary_vector(dim, seq_type=0)

Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one.

参数:	dim (int) – dimension of this vector. seq_type (int) – sequence type of this input.
返回:	An input type object.
返回类型:	InputType

paddle.v2.data_type.sparse_binary_vector_sequence(dim)

Data type of a sequence of sparse vector, which every element is either zero: or one.

参数:	dim (int) – dimension of sparse vector.
返回:	An input type object
返回类型:	InputType

paddle.v2.data_type.sparse_non_value_slot(dim, seq_type=0)

Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one.

参数:	dim (int) – dimension of this vector. seq_type (int) – sequence type of this input.
返回:	An input type object.
返回类型:	InputType

paddle.v2.data_type.sparse_value_slot(dim, seq_type=0)

Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value.

参数:	dim (int) – dimension of this vector. seq_type (int) – sequence type of this input.
返回:	An input type object.
返回类型:	InputType

paddle.v2.data_type.sparse_vector(dim, seq_type=0)

Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value.

参数:	dim (int) – dimension of this vector. seq_type (int) – sequence type of this input.
返回:	An input type object.
返回类型:	InputType

paddle.v2.data_type.sparse_vector_sequence(dim)

Data type of a sequence of sparse vector, which most elements are zero, others could be any float value.

参数:	dim (int) – dimension of sparse vector.
返回:	An input type object
返回类型:	InputType

class paddle.v2.data_type.InputType(dim, seq_type, tp)

InputType is the base class for paddle input types.

注解

this is a base class, and should never be used by user.

参数:	dim (int) – dimension of input. If the input is an integer, it means the value range. Otherwise, it means the size of layer. seq_type (int) – sequence type of input. 0 means it is not a sequence. 1 means it is a variable length sequence. 2 means it is a nested sequence. type (int) – data type of input.

DataFeeder¶

class paddle.v2.data_feeder.DataFeeder(data_types, feeding=None)

DataFeeder converts the data returned by paddle.reader into a data structure of Arguments which is defined in the API. The paddle.reader usually returns a list of mini-batch data entries. Each data entry in the list is one sample. Each sample is a list or a tuple with one feature or multiple features. DataFeeder converts this mini-batch data entries into Arguments in order to feed it to C++ interface.

The simple usage shows below

feeding = ['image', 'label']
data_types = enumerate_data_types_of_data_layers(topology)
feeder = DataFeeder(data_types=data_types, feeding=feeding)

minibatch_data = [([1.0, 2.0, 3.0, ...], 5)]

arg = feeder(minibatch_data)

If mini-batch data and data layers are not one to one mapping, we could pass a dictionary to feeding parameter to represent the mapping relationship.

data_types = [('image', paddle.data_type.dense_vector(784)),
              ('label', paddle.data_type.integer_value(10))]
feeding = {'image':0, 'label':1}
feeder = DataFeeder(data_types=data_types, feeding=feeding)
minibatch_data = [
                   ( [1.0,2.0,3.0,4.0], 5, [6,7,8] ),  # first sample
                   ( [1.0,2.0,3.0,4.0], 5, [6,7,8] )   # second sample
                 ]
# or minibatch_data = [
#                       [ [1.0,2.0,3.0,4.0], 5, [6,7,8] ],  # first sample
#                       [ [1.0,2.0,3.0,4.0], 5, [6,7,8] ]   # second sample
#                     ]
arg = feeder(minibatch_data)

注解

This module is for internal use only. Users should use the reader interface.

参数:	data_types (list) – A list to specify data name and type. Each item is a tuple of (data_name, data_type). feeding (dict\|collections.Sequence\|None) – A dictionary or a sequence to specify the position of each data in the input data.

convert(dat, argument=None)

参数:	dat (list) – A list of mini-batch data. Each sample is a list or tuple one feature or multiple features. argument (py_paddle.swig_paddle.Arguments) – An Arguments object contains this mini-batch data with one or multiple features. The Arguments definition is in the API.

Reader¶

At training and testing time, PaddlePaddle programs need to read data. To ease the users’ work to write data reading code, we define that

A reader is a function that reads data (from file, network, random number generator, etc) and yields data items.
A reader creator is a function that returns a reader function.
A reader decorator is a function, which accepts one or more readers, and returns a reader.
A batch reader is a function that reads data (from reader, file, network, random number generator, etc) and yields a batch of data items.

Data Reader Interface¶

Indeed, data reader doesn’t have to be a function that reads and yields data items. It can be any function with no parameter that creates a iterable (anything can be used in for x in iterable):

iterable = data_reader()

Element produced from the iterable should be a single entry of data, not a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of supported type (e.g., numpy 1d array of float32, int, list of int)

An example implementation for single item data reader creator:

def reader_creator_random_image(width, height):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height)
return reader

An example implementation for multiple item data reader creator:

def reader_creator_random_image_and_label(width, height, label):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height), label
return reader

TODO(yuyang18): Should we add whole design doc here?

paddle.v2.reader.map_readers(func, *readers)

Creates a data reader that outputs return value of function using output of each data readers as arguments.

参数:	func – function to use. The type of func should be (Sample) => Sample readers – readers whose outputs will be used as arguments of func.
Type:	callable
返回:	the created data reader.
返回类型:	callable

paddle.v2.reader.buffered(reader, size)

Creates a buffered data reader.

The buffered data reader will read and save data entries into a buffer. Reading from the buffered data reader will proceed as long as the buffer is not empty.

参数:	reader (callable) – the data reader to read from. size (int) – max buffer size.
返回:	the buffered data reader.

paddle.v2.reader.compose(*readers, **kwargs)

Creates a data reader whose output is the combination of input readers.

If input readers output following data entries: (1, 2) 3 (4, 5) The composed reader will output: (1, 2, 3, 4, 5)

参数:	readers – readers that will be composed together. check_alignment (bool) – if True, will check if input readers are aligned correctly. If False, will not check alignment and trailing outputs will be discarded. Defaults to True.
返回:	the new data reader.
引发:	ComposeNotAligned – outputs of readers are not aligned. Will not raise when check_alignment is set to False.

paddle.v2.reader.chain(*readers)

Creates a data reader whose output is the outputs of input data readers chained together.

If input readers output following data entries: [0, 0, 0] [1, 1, 1] [2, 2, 2] The chained reader will output: [0, 0, 0, 1, 1, 1, 2, 2, 2]

参数:	readers – input readers.
返回:	the new data reader.
返回类型:	callable

paddle.v2.reader.shuffle(reader, buf_size)

Creates a data reader whose data output is shuffled.

Output from the iterator that created by original reader will be buffered into shuffle buffer, and then shuffled. The size of shuffle buffer is determined by argument buf_size.

参数:	reader (callable) – the original reader whose output will be shuffled. buf_size (int) – shuffle buffer size.
返回:	the new reader whose output is shuffled.
返回类型:	callable

paddle.v2.reader.firstn(reader, n)

Limit the max number of samples that reader could return.

参数:	reader (callable) – the data reader to read from. n (int) – the max number of samples that return.
返回:	the decorated reader.
返回类型:	callable

Creator package contains some simple reader creator, which could be used in user program.

paddle.v2.reader.creator.np_array(x)

Creates a reader that yields elements of x, if it is a numpy vector. Or rows of x, if it is a numpy matrix. Or any sub-hyperplane indexed by the highest dimension.

参数:	x – the numpy array to create reader from.
返回:	data reader created from x.

paddle.v2.reader.creator.text_file(path)

Creates a data reader that outputs text line by line from given text file. Trailing new line (‘\n’) of each line will be removed.

Path:	path of the text file.
返回:	data reader of text file

minibatch¶

paddle.v2.minibatch.batch(reader, batch_size)

Create a batched reader.

参数:	reader (callable) – the data reader to read from. batch_size (int) – size of each mini-batch
返回:	the batched reader.
返回类型:	callable

Dataset¶

Dataset package.

mnist¶

MNIST dataset.

This module will download dataset from http://yann.lecun.com/exdb/mnist/ and parse train set and test set into paddle reader creators.

paddle.v2.dataset.mnist.train()

MNIST train set creator.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

返回:	Train reader creator
返回类型:	callable

paddle.v2.dataset.mnist.test()

MNIST test set cretor.

It returns a reader creator, each sample in the reader is image pixels in [0, 1] and label in [0, 9].

返回:	Test reader creator.
返回类型:	callable

cifar¶

CIFAR dataset: https://www.cs.toronto.edu/~kriz/cifar.html

TODO(yuyang18): Complete the comments.

conll05¶

imdb¶

IMDB dataset: http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz

TODO(yuyang18): Complete comments.

imikolov¶

imikolov’s simple dataset: http://www.fit.vutbr.cz/~imikolov/rnnlm/

Complete comments.

movielens¶

Movielens 1-M dataset.

TODO(yuyang18): Complete comments.

sentiment¶

The script fetch and preprocess movie_reviews data set that provided by NLTK

TODO(yuyang18): Complete dataset.

paddle.v2.dataset.sentiment.get_word_dict(): Sorted the words by the frequency of words which occur in sample :return:

words_freq_sorted

paddle.v2.dataset.sentiment.train(): Default train set reader creator

paddle.v2.dataset.sentiment.test(): Default test set reader creator

uci_housing¶

UCI Housing dataset.

TODO(yuyang18): Complete comments.

wmt14¶

UCI Housing dataset.

TODO(yuyang18): Complete comments.