PyDataProviderWrapper API¶

This module provide a wrapper(decorator) to wrap a data process method into a PyDataProvider. Some examples are shown here.

class paddle.trainer.PyDataProviderWrapper.DenseSlot(dim)¶

Dense Slot Type: Each item is the value of a Dense Vector.

Its yield format for provider is:

NonSeq: [float, float, ... ]
Seq: [[float, float, ...], [float, float ....], ... ]
SubSeq: [[[float, float, ...], [float ....], ...] , [[float, float, ...], [float ....], ...] , ...]

class paddle.trainer.PyDataProviderWrapper.SparseNonValueSlot(dim)¶

Sparse NonValue Slot Type: Each item is the id of a Sparse Vector.

Its yield format for provider is:

NonSeq: [int, int, ...]
Seq: [[int, int, ...], [int, int, ...], ... ]
SubSeq: [[[int, int, ...], [int, ....], ...] , [[int, int, ...], [int, ....], ...] , ...]

class paddle.trainer.PyDataProviderWrapper.SparseValueSlot(dim)¶

Sparse Value Slot Type: Each item is the id and value of a Sparse Vector.

Its yield format for provider is:

NonSeq: [(int, float), (int, float), ... ]
Seq: [[(int,float), (int, float), ... ], [(int, float), (int, float), ...], ... ]
SubSeq: [[[(int,float), ...], [(int, float), ....], ...] , [[(int,float), ...], [(int, float), ....], ...] , ...]

class paddle.trainer.PyDataProviderWrapper.IndexSlot(dim)¶

Index Value Slot Type: Each item is the id of Label.

Its yield format for provider is:

NonSeq: int
Seq: [int, int, ....]
SubSeq: [[int, int, ...], [int, int, ...], ... ]

class paddle.trainer.PyDataProviderWrapper.StringSlot(dim)¶

String Value Slot Type: Each item is a string for printout, can be used in DataLayer too.

Its yield format for provider is:

NonSeq: string
Seq: [string, string, ....]
SubSeq: [[string, string, ...], [string, string, ...], ... ]

class paddle.trainer.PyDataProviderWrapper.PoolSize(pool_size)¶: Max number of sample which contains in provider.

paddle.trainer.PyDataProviderWrapper.provider(slots=None, use_seq=False, should_shuffle=True, pool_size=1, can_over_batch_size=True, calc_batch_size=<function <lambda>>, debug=False, init_hook=<function default_init_hook>, profile_filename=None)¶

The decorator for PyDataProvider. User should use this to create Provider class. User should only concern how to read sample from file.

So the basic usage is:

@provider(some data provider config here...)
def process(obj, file_name):
    while not at end of file_name:
        sample = readOneSampleFromFile(file_name)
        yield sample.

The configuration of data provider should be setup by:

Parameters:

init_hook (callable) –
A callback will be invoked when PyDataProvider instance created. The parameter is (obj, *args, **kwargs).
- obj: actually data provider instance, which contains some global objects in obj.xxxxx, and is used by process function.
  1. obj.slots: a list of SlotType Object. Can be set in init. For example, obj.slots = [DenseSlot(9), IndexSlot(2)].
  2. obj.logger: a logger object. User can invoke obj.logger.info(), obj.logger.fatal(), etc.
- args and kwargs: the data provider __init__ parameters. For example, load_data_args will be found in **kwargs, and if you want to recieve it from trainer_config, recommand to use init_hook_wrapper
pool_size (int | PoolSize) –
- int: it will read at most pool_size files to memory.
- PoolSize: it will read at most PoolSize.size samples to memory.
- If not set, it will read all the files to memory.
slots (list | callable) –
Specify the SlotTypes, can also be set in init_hook. It has two formats:
- A list of SlotType objects. For example, slots = [DenseSlot(9), IndexSlot(2)].
- A method return a list of SlotTypes, and the parameter of method is (obj, *file_list, **kwargs).
use_seq (bool) –
False if use no sequence (Default). True if use sequence:
- If sequence has no sub-sequence: Each slot will return a list of data. This list is one sequence. So the return format likes [[a0, a1, a2], [b1, b2, b3, b4], [c1]].
- If sequence has sub-sequence: Each slot will return a nested-list of data. This list contains several sub-lists, each sub-list is one sub-sequence. So the return format likes [[[a0, a1, a2], [a4, a5]], [[b1, b2, b3, b4], [b5, b6]], [[c1], [c2]]].
should_shuffle (bool) – True if data should shuffle.
calc_batch_size (callable) –
The method calculate each data’s batch size.
- Default is the batch size of one sample.
- User can customize by lamda funtion. For example, calc_batch_size = lambda data : len(data) means calculating the token number of a sequence data.
can_over_batch_size (bool) –
Whether actual batch size >= input batch size
- True (>=): getNextBatch method can return more data (Default).
- False (<): user must ensure that each data’s batch size < input batch size.
debug (bool) – True if enable debug logger and some debug check. Default is False.
profile_filename (None | Str) – None if disable profile (Default). Otherwise, the data provider will dump profile result when reset. And the dump filename is <profile_filename>_<reset_count>.

paddle.trainer.PyDataProviderWrapper.init_hook_wrapper(func)¶

Wrap a method for PyDataProviderWrapper’s init_hook. This method can receive parameter from trainer_config’s load_data_args. The load_data_args must pass a pickle.dumps() value, and dump a map as keyword args. The wrapped method func will receive them as keyword args.

So an example usage is:

@init_hook_wrapper
def hook(obj, dictionary, file_list, **kwargs):
    obj.dictionary = dictionary
    obj.slots = [IndexSlot(len(obj.dictionary)),
                 IndexSlot(len(open(file_list[0], "r").readlines()))]

Parameters:	func (callable) – init_hook function
Returns:	wrapped method, can be passed into @provider.

PyDataProviderWrapper API¶

Previous topic

Next topic

This Page