PyDataProviderWrapper API¶
This module provide a wrapper(decorator) to wrap a data process method into a PyDataProvider. Some examples are shown here.
- 
class paddle.trainer.PyDataProviderWrapper.DenseSlot(dim)¶
- Dense Slot Type: Each item is the value of a Dense Vector. - Its yield format for - provideris:- NonSeq: [float, float, ... ]
- Seq: [[float, float, ...], [float, float ....], ... ]
- SubSeq: [[[float, float, ...], [float ....], ...] , [[float, float, ...], [float ....], ...] , ...]
 
- 
class paddle.trainer.PyDataProviderWrapper.SparseNonValueSlot(dim)¶
- Sparse NonValue Slot Type: Each item is the id of a Sparse Vector. - Its yield format for - provideris:- NonSeq: [int, int, ...]
- Seq: [[int, int, ...], [int, int, ...], ... ]
- SubSeq: [[[int, int, ...], [int, ....], ...] , [[int, int, ...], [int, ....], ...] , ...]
 
- 
class paddle.trainer.PyDataProviderWrapper.SparseValueSlot(dim)¶
- Sparse Value Slot Type: Each item is the id and value of a Sparse Vector. - Its yield format for - provideris:- NonSeq: [(int, float), (int, float), ... ]
- Seq: [[(int,float), (int, float), ... ], [(int, float), (int, float), ...], ... ]
- SubSeq: [[[(int,float), ...], [(int, float), ....], ...] , [[(int,float), ...], [(int, float), ....], ...] , ...]
 
- 
class paddle.trainer.PyDataProviderWrapper.IndexSlot(dim)¶
- Index Value Slot Type: Each item is the id of Label. - Its yield format for - provideris:- NonSeq: int
- Seq: [int, int, ....]
- SubSeq: [[int, int, ...], [int, int, ...], ... ]
 
- 
class paddle.trainer.PyDataProviderWrapper.StringSlot(dim)¶
- String Value Slot Type: Each item is a string for printout, can be used in DataLayer too. - Its yield format for - provideris:- NonSeq: string
- Seq: [string, string, ....]
- SubSeq: [[string, string, ...], [string, string, ...], ... ]
 
- 
class paddle.trainer.PyDataProviderWrapper.PoolSize(pool_size)¶
- Max number of sample which contains in provider. 
- 
paddle.trainer.PyDataProviderWrapper.provider(slots=None, use_seq=False, should_shuffle=True, pool_size=1, can_over_batch_size=True, calc_batch_size=<function <lambda>>, debug=False, init_hook=<function default_init_hook>, profile_filename=None)¶
- The decorator for PyDataProvider. User should use this to create Provider class. User should only concern how to read sample from file. - So the basic usage is: - @provider(some data provider config here...) def process(obj, file_name): while not at end of file_name: sample = readOneSampleFromFile(file_name) yield sample. - The configuration of data provider should be setup by: - Parameters: - init_hook (callable) – A callback will be invoked when PyDataProvider instance created. The parameter is (obj, *args, **kwargs). - obj: actually data provider instance, which                                  contains some global objects in obj.xxxxx,                                  and is used by process function.- obj.slots: a list of SlotType Object. Can be set in init. For example, obj.slots = [DenseSlot(9), IndexSlot(2)].
- obj.logger: a logger object. User can invoke obj.logger.info(), obj.logger.fatal(), etc.
 
- args and kwargs: the data provider __init__ parameters. For example, load_data_args will be found in **kwargs, and if you want to recieve it from trainer_config, recommand to use init_hook_wrapper
 
- obj: actually data provider instance, which                                  contains some global objects in obj.xxxxx,                                  and is used by process function.
- pool_size (int | PoolSize) – - int: it will read at most pool_size files to memory.
- PoolSize: it will read at most PoolSize.size samples to memory.
- If not set, it will read all the files to memory.
 
- slots (list | callable) – Specify the SlotTypes, can also be set in init_hook. It has two formats: - A list of SlotType objects. For example, slots = [DenseSlot(9), IndexSlot(2)].
- A method return a list of SlotTypes, and the parameter of method is (obj, *file_list, **kwargs).
 
- use_seq (bool) – False if use no sequence (Default). True if use sequence: - If sequence has no sub-sequence: Each slot will return a list of data. This list is one sequence. So the return format likes [[a0, a1, a2], [b1, b2, b3, b4], [c1]].
- If sequence has sub-sequence: Each slot will return a nested-list of data. This list contains several sub-lists, each sub-list is one sub-sequence. So the return format likes [[[a0, a1, a2], [a4, a5]], [[b1, b2, b3, b4], [b5, b6]], [[c1], [c2]]].
 
- should_shuffle (bool) – True if data should shuffle.
- calc_batch_size (callable) – The method calculate each data’s batch size. - Default is the batch size of one sample.
- User can customize by lamda funtion. For example,                               calc_batch_size = lambda data : len(data)means calculating the token number of a sequence data.
 
- can_over_batch_size (bool) – Whether actual batch size >= input batch size- True (>=): getNextBatch method can return more data (Default).
- False (<): user must ensure that each data’s batch size < input batch size.
 
- debug (bool) – True if enable debug logger and some debug check. Default is False.
- profile_filename (None | Str) – None if disable profile (Default). Otherwise, the data provider will dump profile result when reset. And the dump filename is <profile_filename>_<reset_count>.
 
- init_hook (callable) – 
- 
paddle.trainer.PyDataProviderWrapper.init_hook_wrapper(func)¶
- Wrap a method for PyDataProviderWrapper’s init_hook. This method can receive parameter from trainer_config’s load_data_args. The load_data_args must pass a pickle.dumps() value, and dump a map as keyword args. The wrapped method - funcwill receive them as keyword args.- So an example usage is: - @init_hook_wrapper def hook(obj, dictionary, file_list, **kwargs): obj.dictionary = dictionary obj.slots = [IndexSlot(len(obj.dictionary)), IndexSlot(len(open(file_list[0], "r").readlines()))] - Parameters: - func (callable) – init_hook function - Returns: - wrapped method, can be passed into @provider.