Data Reader Interface¶
DataTypes¶
- 
paddle.v2.data_type.dense_array(dim, seq_type=0)
- Dense Array. It means the input feature is dense array with float type. For example, if the input is an image with 28*28 pixels, the input of Paddle neural network could be a dense vector with dimension 784 or a numpy array with shape (28, 28). - For the 2-D convolution operation, each sample in one mini-batch must have the similarly size in PaddlePaddle now. But, it supports variable-dimension feature across mini-batch. For the variable-dimension, the param dim is not used. While the data reader must yield numpy array and the data feeder will set the data shape correctly. - Parameters: - dim (int) – dimension of this vector.
- seq_type (int) – sequence type of input.
 - Returns: - An input type object. - Return type: - InputType 
- 
paddle.v2.data_type.dense_vector(dim, seq_type=0)
- Dense Array. It means the input feature is dense array with float type. For example, if the input is an image with 28*28 pixels, the input of Paddle neural network could be a dense vector with dimension 784 or a numpy array with shape (28, 28). - For the 2-D convolution operation, each sample in one mini-batch must have the similarly size in PaddlePaddle now. But, it supports variable-dimension feature across mini-batch. For the variable-dimension, the param dim is not used. While the data reader must yield numpy array and the data feeder will set the data shape correctly. - Parameters: - dim (int) – dimension of this vector.
- seq_type (int) – sequence type of input.
 - Returns: - An input type object. - Return type: - InputType 
- 
paddle.v2.data_type.dense_vector_sequence(dim)
- Data type of a sequence of dense vector. - Parameters: - dim (int) – dimension of dense vector. - Returns: - An input type object - Return type: - InputType 
- 
paddle.v2.data_type.integer_value(value_range, seq_type=0)
- Data type of integer. - Parameters: - seq_type (int) – sequence type of this input.
- value_range (int) – range of this integer.
 - Returns: - An input type object - Return type: - InputType 
- 
paddle.v2.data_type.integer_value_sequence(value_range)
- Data type of a sequence of integer. - Parameters: - value_range (int) – range of each element. 
- 
paddle.v2.data_type.sparse_binary_vector(dim, seq_type=0)
- Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one. - Parameters: - dim (int) – dimension of this vector.
- seq_type (int) – sequence type of this input.
 - Returns: - An input type object. - Return type: - InputType 
- 
paddle.v2.data_type.sparse_binary_vector_sequence(dim)
- Data type of a sequence of sparse vector, which every element is either zero
- or one.
 - Parameters: - dim (int) – dimension of sparse vector. - Returns: - An input type object - Return type: - InputType 
- 
paddle.v2.data_type.sparse_float_vector(dim, seq_type=0)
- Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value. - Parameters: - dim (int) – dimension of this vector.
- seq_type (int) – sequence type of this input.
 - Returns: - An input type object. - Return type: - InputType 
- 
paddle.v2.data_type.sparse_float_vector_sequence(dim)
- Data type of a sequence of sparse vector, which most elements are zero, others could be any float value. - Parameters: - dim (int) – dimension of sparse vector. - Returns: - An input type object - Return type: - InputType 
- 
paddle.v2.data_type.sparse_non_value_slot(dim, seq_type=0)
- Sparse binary vector. It means the input feature is a sparse vector and the every element in this vector is either zero or one. - Parameters: - dim (int) – dimension of this vector.
- seq_type (int) – sequence type of this input.
 - Returns: - An input type object. - Return type: - InputType 
- 
paddle.v2.data_type.sparse_value_slot(dim, seq_type=0)
- Sparse vector. It means the input feature is a sparse vector. Most of the elements in this vector are zero, others could be any float value. - Parameters: - dim (int) – dimension of this vector.
- seq_type (int) – sequence type of this input.
 - Returns: - An input type object. - Return type: - InputType 
- 
class paddle.v2.data_type.InputType(dim, seq_type, tp)
- InputType is the base class for paddle input types. - Note - this is a base class, and should never be used by user. - Parameters: - dim (int) – dimension of input. If the input is an integer, it means the value range. Otherwise, it means the size of layer.
- seq_type (int) – sequence type of input. 0 means it is not a sequence. 1 means it is a variable length sequence. 2 means it is a nested sequence.
- type (int) – data type of input.
 
DataFeeder¶
Reader¶
At training and testing time, PaddlePaddle programs need to read data. To ease the users’ work to write data reading code, we define that
- A reader is a function that reads data (from file, network, random number generator, etc) and yields data items.
- A reader creator is a function that returns a reader function.
- A reader decorator is a function, which accepts one or more readers, and returns a reader.
- A batch reader is a function that reads data (from reader, file, network, random number generator, etc) and yields a batch of data items.
Data Reader Interface¶
Indeed, data reader doesn’t have to be a function that reads and yields data
items. It can be any function with no parameter that creates a iterable
(anything can be used in for x in iterable):
iterable = data_reader()
Element produced from the iterable should be a single entry of data, not a mini batch. That entry of data could be a single item, or a tuple of items. Item should be of supported type (e.g., numpy 1d array of float32, int, list of int)
An example implementation for single item data reader creator:
def reader_creator_random_image(width, height):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height)
return reader
An example implementation for multiple item data reader creator:
def reader_creator_random_image_and_label(width, height, label):
    def reader():
        while True:
            yield numpy.random.uniform(-1, 1, size=width*height), label
return reader
TODO(yuyang18): Should we add whole design doc here?
- 
paddle.v2.reader.map_readers(func, *readers)
- Creates a data reader that outputs return value of function using output of each data readers as arguments. - Parameters: - func – function to use. The type of func should be (Sample) => Sample
- readers – readers whose outputs will be used as arguments of func.
 - Type: - callable - Returns: - the created data reader. - Return type: - callable 
- 
paddle.v2.reader.buffered(reader, size)
- Creates a buffered data reader. - The buffered data reader will read and save data entries into a buffer. Reading from the buffered data reader will proceed as long as the buffer is not empty. - Parameters: - reader (callable) – the data reader to read from.
- size (int) – max buffer size.
 - Returns: - the buffered data reader. 
- 
paddle.v2.reader.compose(*readers, **kwargs)
- Creates a data reader whose output is the combination of input readers. - If input readers output following data entries: (1, 2) 3 (4, 5) The composed reader will output: (1, 2, 3, 4, 5) - Parameters: - readers – readers that will be composed together.
- check_alignment (bool) – if True, will check if input readers are aligned correctly. If False, will not check alignment and trailing outputs will be discarded. Defaults to True.
 - Returns: - the new data reader. - Raises: - ComposeNotAligned – outputs of readers are not aligned. Will not raise when check_alignment is set to False. 
- 
paddle.v2.reader.chain(*readers)
- Creates a data reader whose output is the outputs of input data readers chained together. - If input readers output following data entries: [0, 0, 0] [1, 1, 1] [2, 2, 2] The chained reader will output: [0, 0, 0, 1, 1, 1, 2, 2, 2] - Parameters: - readers – input readers. - Returns: - the new data reader. - Return type: - callable 
- 
paddle.v2.reader.shuffle(reader, buf_size)
- Creates a data reader whose data output is shuffled. - Output from the iterator that created by original reader will be buffered into shuffle buffer, and then shuffled. The size of shuffle buffer is determined by argument buf_size. - Parameters: - reader (callable) – the original reader whose output will be shuffled.
- buf_size (int) – shuffle buffer size.
 - Returns: - the new reader whose output is shuffled. - Return type: - callable 
- 
paddle.v2.reader.firstn(reader, n)
- Limit the max number of samples that reader could return. - Parameters: - reader (callable) – the data reader to read from.
- n (int) – the max number of samples that return.
 - Returns: - the decorated reader. - Return type: - callable 
- 
paddle.v2.reader.xmap_readers(mapper, reader, process_num, buffer_size, order=False)
- Use multiprocess to map samples from reader by a mapper defined by user. And this function contains a buffered decorator. :param mapper: a function to map sample. :type mapper: callable :param reader: the data reader to read from :type reader: callable :param process_num: process number to handle original sample :type process_num: int :param buffer_size: max buffer size :type buffer_size: int :param order: keep the order of reader :type order: bool :return: the decarated reader :rtype: callable 
- 
class paddle.v2.reader.PipeReader(command, bufsize=8192, file_type='plain')
- PipeReader read data by stream from a command, take it’s stdout into a pipe buffer and redirect it to the parser to parse, then yield data as your desired format. - You can using standard linux command or call another program to read data, from HDFS, Ceph, URL, AWS S3 etc: - An example: - def example_reader(): for f in myfiles: pr = PipeReader("cat %s"%f) for l in pr.get_line(): sample = l.split(" ") yield sample - 
get_line(cut_lines=True, line_break='\n')
- param cut_lines: - cut buffer to lines - type cut_lines: - bool - param line_break: - line break of the file, like - or
- type line_break: - string - return: - one line or a buffer of bytes - rtype: - string 
 
 
- 
Creator package contains some simple reader creator, which could be used in user program.
- 
paddle.v2.reader.creator.np_array(x)
- Creates a reader that yields elements of x, if it is a numpy vector. Or rows of x, if it is a numpy matrix. Or any sub-hyperplane indexed by the highest dimension. - Parameters: - x – the numpy array to create reader from. - Returns: - data reader created from x. 
- 
paddle.v2.reader.creator.text_file(path)
- Creates a data reader that outputs text line by line from given text file. Trailing new line (‘\n’) of each line will be removed. - Path: - path of the text file. - Returns: - data reader of text file 
- 
paddle.v2.reader.creator.cloud_reader(paths, etcd_endpoints, timeout_sec=5, buf_size=64)
- Create a data reader that yield a record one by one from
- the paths:
 - Paths: - path of recordio files, can be a string or a string list. - Etcd_endpoints: - the endpoints for etcd cluster - Returns: - data reader of recordio files. 
minibatch¶
- 
paddle.v2.minibatch.batch(reader, batch_size)
- Create a batched reader. - Parameters: - reader (callable) – the data reader to read from.
- batch_size (int) – size of each mini-batch
 - Returns: - the batched reader. - Return type: - callable 
