Data Reader Interface

DataTypes

paddle.v2.data_type.dense_array(dim, seq_type=0)

Dense Array. It means the input feature is dense array with float type. For example, if the input is an image with 28*28 pixels, the input of Paddle neural network could be a dense vector with dimension 784 or a numpy array with shape (28, 28).

For the 2-D convolution operation, each sample in one mini-batch must have the similarly size in PaddlePaddle now. But, it supports variable-dimension feature across mini-batch. For the variable-dimension, the param dim is not used. While the data reader must yield numpy array and the data feeder will set the data shape correctly.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.dense_vector(dim, seq_type=0)

Dense Array. It means the input feature is dense array with float type. For example, if the input is an image with 28*28 pixels, the input of Paddle neural network could be a dense vector with dimension 784 or a numpy array with shape (28, 28).

For the 2-D convolution operation, each sample in one mini-batch must have the similarly size in PaddlePaddle now. But, it supports variable-dimension feature across mini-batch. For the variable-dimension, the param dim is not used. While the data reader must yield numpy array and the data feeder will set the data shape correctly.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.dense_vector_sequence(dim)

Data type of a sequence of dense vector.

Parameters:dim (int) – dimension of dense vector.
Returns:An input type object
Return type:InputType
paddle.v2.data_type.integer_value(value_range, seq_type=0)

Data type of integer.

Parameters:
  • seq_type (int) – sequence type of this input.
  • value_range (int) – range of this integer.
Returns:

An input type object

Return type:

InputType

paddle.v2.data_type.integer_value_sequence(value_range)

Data type of a sequence of integer.

Parameters:value_range (int) – range of each element.
paddle.v2.data_type.sparse_binary_vector(dim, seq_type=0)

Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_binary_vector_sequence(dim)
Data type of a sequence of sparse vector, which every element is either zero
or one.
Parameters:dim (int) – dimension of sparse vector.
Returns:An input type object
Return type:InputType
paddle.v2.data_type.sparse_float_vector(dim, seq_type=0)

Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_float_vector_sequence(dim)

Data type of a sequence of sparse vector, which most elements are zero, others could be any float value.

Parameters:dim (int) – dimension of sparse vector.
Returns:An input type object
Return type:InputType
paddle.v2.data_type.sparse_non_value_slot(dim, seq_type=0)

Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

paddle.v2.data_type.sparse_value_slot(dim, seq_type=0)

Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value.

Parameters:
  • dim (int) – dimension of this vector.
  • seq_type (int) – sequence type of this input.
Returns:

An input type object.

Return type:

InputType

class paddle.v2.data_type.InputType(dim, seq_type, tp)

InputType is the base class for paddle input types.

Note

this is a base class, and should never be used by user.

Parameters:
  • dim (int) – dimension of input. If the input is an integer, it means the value range. Otherwise, it means the size of layer.
  • seq_type (int) – sequence type of input. 0 means it is not a sequence. 1 means it is a variable length sequence. 2 means it is a nested sequence.
  • type (int) – data type of input.

DataFeeder

Reader

At training and testing time, PaddlePaddle programs need to read data. To ease the users’ work to write data reading code, we define that

  • A reader is a function that reads data (from file, network, random number generator, etc) and yields data items.
  • A reader creator is a function that returns a reader function.
  • A reader decorator is a function, which accepts one or more readers, and returns a reader.
  • A batch reader is a function that reads data (from reader, file, network, random number generator, etc) and yields a batch of data items.

Data Reader Interface

Indeed, data reader doesn’t have to be a function that reads and yields data items. It can be any function with no parameter that creates a iterable (anything can be used in for x in iterable):

iterable = data_reader()

Element produced from the iterable should be a single entry of data, not a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of supported type (e.g., numpy 1d array of float32, int, list of int)

An example implementation for single item data reader creator:

def reader_creator_random_image(width, height):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height)
return reader

An example implementation for multiple item data reader creator:

def reader_creator_random_image_and_label(width, height, label):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height), label
return reader

TODO(yuyang18): Should we add whole design doc here?

paddle.v2.reader.map_readers(func, *readers)

Creates a data reader that outputs return value of function using output of each data readers as arguments.

Parameters:
  • func – function to use. The type of func should be (Sample) => Sample
  • readers – readers whose outputs will be used as arguments of func.
Type:

callable

Returns:

the created data reader.

Return type:

callable

paddle.v2.reader.buffered(reader, size)

Creates a buffered data reader.

The buffered data reader will read and save data entries into a buffer. Reading from the buffered data reader will proceed as long as the buffer is not empty.

Parameters:
  • reader (callable) – the data reader to read from.
  • size (int) – max buffer size.
Returns:

the buffered data reader.

paddle.v2.reader.compose(*readers, **kwargs)

Creates a data reader whose output is the combination of input readers.

If input readers output following data entries: (1, 2) 3 (4, 5) The composed reader will output: (1, 2, 3, 4, 5)

Parameters:
  • readers – readers that will be composed together.
  • check_alignment (bool) – if True, will check if input readers are aligned correctly. If False, will not check alignment and trailing outputs will be discarded. Defaults to True.
Returns:

the new data reader.

Raises:

ComposeNotAligned – outputs of readers are not aligned. Will not raise when check_alignment is set to False.

paddle.v2.reader.chain(*readers)

Creates a data reader whose output is the outputs of input data readers chained together.

If input readers output following data entries: [0, 0, 0] [1, 1, 1] [2, 2, 2] The chained reader will output: [0, 0, 0, 1, 1, 1, 2, 2, 2]

Parameters:readers – input readers.
Returns:the new data reader.
Return type:callable
paddle.v2.reader.shuffle(reader, buf_size)

Creates a data reader whose data output is shuffled.

Output from the iterator that created by original reader will be buffered into shuffle buffer, and then shuffled. The size of shuffle buffer is determined by argument buf_size.

Parameters:
  • reader (callable) – the original reader whose output will be shuffled.
  • buf_size (int) – shuffle buffer size.
Returns:

the new reader whose output is shuffled.

Return type:

callable

paddle.v2.reader.firstn(reader, n)

Limit the max number of samples that reader could return.

Parameters:
  • reader (callable) – the data reader to read from.
  • n (int) – the max number of samples that return.
Returns:

the decorated reader.

Return type:

callable

paddle.v2.reader.xmap_readers(mapper, reader, process_num, buffer_size, order=False)

Use multiprocess to map samples from reader by a mapper defined by user. And this function contains a buffered decorator. :param mapper: a function to map sample. :type mapper: callable :param reader: the data reader to read from :type reader: callable :param process_num: process number to handle original sample :type process_num: int :param buffer_size: max buffer size :type buffer_size: int :param order: keep the order of reader :type order: bool :return: the decarated reader :rtype: callable

paddle.v2.reader.pipe_reader(left_cmd, parser, bufsize=8192, file_type='plain', cut_lines=True, line_break='\n')

pipe_reader read data by stream from a command, take it’s stdout into a pipe buffer and redirect it to the parser to parse, then yield data as your desired format.

You can using standard linux command or call another program to read data, from HDFS, Ceph, URL, AWS S3 etc:

cmd = “hadoop fs -cat /path/to/some/file” cmd = “cat sample_file.tar.gz” cmd = “curl http://someurl” cmd = “python print_s3_bucket.py”

A sample parser:

def sample_parser(lines):

# parse each line as one sample data, # return a list of samples as batches. ret = [] for l in lines:

ret.append(l.split(” ”)[1:5])

return ret

param left_cmd:command to excute to get stdout from.
type left_cmd:string
param parser:parser function to parse lines of data. if cut_lines is True, parser will receive list of lines. if cut_lines is False, parser will receive a raw buffer each time. parser should return a list of parsed values.
type parser:callable
param bufsize:the buffer size used for the stdout pipe.
type bufsize:int
param file_type:
 can be plain/gzip, stream buffer data type.
type file_type:string
param cut_lines:
 whether to pass lines instead of raw buffer to the parser
type cut_lines:bool
param line_break:
 line break of the file, like
or
type line_break:
 string
return:the reader generator.
rtype:callable

Creator package contains some simple reader creator, which could be used in user program.

paddle.v2.reader.creator.np_array(x)

Creates a reader that yields elements of x, if it is a numpy vector. Or rows of x, if it is a numpy matrix. Or any sub-hyperplane indexed by the highest dimension.

Parameters:x – the numpy array to create reader from.
Returns:data reader created from x.
paddle.v2.reader.creator.text_file(path)

Creates a data reader that outputs text line by line from given text file. Trailing new line (‘\n’) of each line will be removed.

Path:path of the text file.
Returns:data reader of text file
paddle.v2.reader.creator.cloud_reader(paths, etcd_endpoints, timeout_sec=5, buf_size=64)
Create a data reader that yield a record one by one from
the paths:
Paths:path of recordio files, can be a string or a string list.
Etcd_endpoints:the endpoints for etcd cluster
Returns:data reader of recordio files.

minibatch

paddle.v2.minibatch.batch(reader, batch_size)

Create a batched reader.

Parameters:
  • reader (callable) – the data reader to read from.
  • batch_size (int) – size of each mini-batch
Returns:

the batched reader.

Return type:

callable