PyDataProviderWrapper API¶
This module provide a wrapper(decorator) to wrap a data process method into a PyDataProvider. Some examples are shown here.
-
class
paddle.trainer.PyDataProviderWrapper.
DenseSlot
(dim)¶ Dense Slot Type: Each item is the value of a Dense Vector.
Its yield format for
provider
is:- NonSeq: [float, float, ... ]
- Seq: [[float, float, ...], [float, float ....], ... ]
- SubSeq: [[[float, float, ...], [float ....], ...] , [[float, float, ...], [float ....], ...] , ...]
-
class
paddle.trainer.PyDataProviderWrapper.
SparseNonValueSlot
(dim)¶ Sparse NonValue Slot Type: Each item is the id of a Sparse Vector.
Its yield format for
provider
is:- NonSeq: [int, int, ...]
- Seq: [[int, int, ...], [int, int, ...], ... ]
- SubSeq: [[[int, int, ...], [int, ....], ...] , [[int, int, ...], [int, ....], ...] , ...]
-
class
paddle.trainer.PyDataProviderWrapper.
SparseValueSlot
(dim)¶ Sparse Value Slot Type: Each item is the id and value of a Sparse Vector.
Its yield format for
provider
is:- NonSeq: [(int, float), (int, float), ... ]
- Seq: [[(int,float), (int, float), ... ], [(int, float), (int, float), ...], ... ]
- SubSeq: [[[(int,float), ...], [(int, float), ....], ...] , [[(int,float), ...], [(int, float), ....], ...] , ...]
-
class
paddle.trainer.PyDataProviderWrapper.
IndexSlot
(dim)¶ Index Value Slot Type: Each item is the id of Label.
Its yield format for
provider
is:- NonSeq: int
- Seq: [int, int, ....]
- SubSeq: [[int, int, ...], [int, int, ...], ... ]
-
class
paddle.trainer.PyDataProviderWrapper.
StringSlot
(dim)¶ String Value Slot Type: Each item is a string for printout, can be used in DataLayer too.
Its yield format for
provider
is:- NonSeq: string
- Seq: [string, string, ....]
- SubSeq: [[string, string, ...], [string, string, ...], ... ]
-
class
paddle.trainer.PyDataProviderWrapper.
PoolSize
(pool_size)¶ Max number of sample which contains in provider.
-
paddle.trainer.PyDataProviderWrapper.
provider
(slots=None, use_seq=False, should_shuffle=True, pool_size=1, can_over_batch_size=True, calc_batch_size=<function <lambda>>, debug=False, init_hook=<function default_init_hook>, profile_filename=None)¶ The decorator for PyDataProvider. User should use this to create Provider class. User should only concern how to read sample from file.
So the basic usage is:
@provider(some data provider config here...) def process(obj, file_name): while not at end of file_name: sample = readOneSampleFromFile(file_name) yield sample.
The configuration of data provider should be setup by:
Parameters: - init_hook (callable) –
A callback will be invoked when PyDataProvider instance created. The parameter is (obj, *args, **kwargs).
- obj: actually data provider instance, which contains some global objects in obj.xxxxx, and is used by process function.
- obj.slots: a list of SlotType Object. Can be set in init. For example, obj.slots = [DenseSlot(9), IndexSlot(2)].
- obj.logger: a logger object. User can invoke obj.logger.info(), obj.logger.fatal(), etc.
- args and kwargs: the data provider __init__ parameters. For example, load_data_args will be found in **kwargs, and if you want to recieve it from trainer_config, recommand to use init_hook_wrapper
- obj: actually data provider instance, which contains some global objects in obj.xxxxx, and is used by process function.
- pool_size (int | PoolSize) –
- int: it will read at most pool_size files to memory.
- PoolSize: it will read at most PoolSize.size samples to memory.
- If not set, it will read all the files to memory.
- slots (list | callable) –
Specify the SlotTypes, can also be set in init_hook. It has two formats:
- A list of SlotType objects. For example, slots = [DenseSlot(9), IndexSlot(2)].
- A method return a list of SlotTypes, and the parameter of method is (obj, *file_list, **kwargs).
- use_seq (bool) –
False if use no sequence (Default). True if use sequence:
- If sequence has no sub-sequence: Each slot will return a list of data. This list is one sequence. So the return format likes [[a0, a1, a2], [b1, b2, b3, b4], [c1]].
- If sequence has sub-sequence: Each slot will return a nested-list of data. This list contains several sub-lists, each sub-list is one sub-sequence. So the return format likes [[[a0, a1, a2], [a4, a5]], [[b1, b2, b3, b4], [b5, b6]], [[c1], [c2]]].
- should_shuffle (bool) – True if data should shuffle.
- calc_batch_size (callable) –
The method calculate each data’s batch size.
- Default is the batch size of one sample.
- User can customize by lamda funtion. For example,
calc_batch_size = lambda data : len(data)
means calculating the token number of a sequence data.
- can_over_batch_size (bool) –
Whether
actual batch size >= input batch size
- True (>=): getNextBatch method can return more data (Default).
- False (<): user must ensure that each data’s batch size < input batch size.
- debug (bool) – True if enable debug logger and some debug check. Default is False.
- profile_filename (None | Str) – None if disable profile (Default). Otherwise, the data provider will dump profile result when reset. And the dump filename is <profile_filename>_<reset_count>.
- init_hook (callable) –
-
paddle.trainer.PyDataProviderWrapper.
init_hook_wrapper
(func)¶ Wrap a method for PyDataProviderWrapper’s init_hook. This method can receive parameter from trainer_config’s load_data_args. The load_data_args must pass a pickle.dumps() value, and dump a map as keyword args. The wrapped method
func
will receive them as keyword args.So an example usage is:
@init_hook_wrapper def hook(obj, dictionary, file_list, **kwargs): obj.dictionary = dictionary obj.slots = [IndexSlot(len(obj.dictionary)), IndexSlot(len(open(file_list[0], "r").readlines()))]
Parameters: func (callable) – init_hook function Returns: wrapped method, can be passed into @provider.