PyDataProviderWrapper API

This module provide a wrapper(decorator) to wrap a data process method into a PyDataProvider. Some examples are shown here.

class paddle.trainer.PyDataProviderWrapper.DenseSlot(dim)

Dense Slot Type: Each item is the value of a Dense Vector.

Its yield format for provider is:

  • NonSeq: [float, float, ... ]
  • Seq: [[float, float, ...], [float, float ....], ... ]
  • SubSeq: [[[float, float, ...], [float ....], ...] , [[float, float, ...], [float ....], ...] , ...]
class paddle.trainer.PyDataProviderWrapper.SparseNonValueSlot(dim)

Sparse NonValue Slot Type: Each item is the id of a Sparse Vector.

Its yield format for provider is:

  • NonSeq: [int, int, ...]
  • Seq: [[int, int, ...], [int, int, ...], ... ]
  • SubSeq: [[[int, int, ...], [int, ....], ...] , [[int, int, ...], [int, ....], ...] , ...]
class paddle.trainer.PyDataProviderWrapper.SparseValueSlot(dim)

Sparse Value Slot Type: Each item is the id and value of a Sparse Vector.

Its yield format for provider is:

  • NonSeq: [(int, float), (int, float), ... ]
  • Seq: [[(int,float), (int, float), ... ], [(int, float), (int, float), ...], ... ]
  • SubSeq: [[[(int,float), ...], [(int, float), ....], ...] , [[(int,float), ...], [(int, float), ....], ...] , ...]
class paddle.trainer.PyDataProviderWrapper.IndexSlot(dim)

Index Value Slot Type: Each item is the id of Label.

Its yield format for provider is:

  • NonSeq: int
  • Seq: [int, int, ....]
  • SubSeq: [[int, int, ...], [int, int, ...], ... ]
class paddle.trainer.PyDataProviderWrapper.StringSlot(dim)

String Value Slot Type: Each item is a string for printout, can be used in DataLayer too.

Its yield format for provider is:

  • NonSeq: string
  • Seq: [string, string, ....]
  • SubSeq: [[string, string, ...], [string, string, ...], ... ]
class paddle.trainer.PyDataProviderWrapper.PoolSize(pool_size)

Max number of sample which contains in provider.

paddle.trainer.PyDataProviderWrapper.provider(slots=None, use_seq=False, should_shuffle=True, pool_size=1, can_over_batch_size=True, calc_batch_size=<function <lambda>>, debug=False, init_hook=<function default_init_hook>, profile_filename=None)

The decorator for PyDataProvider. User should use this to create Provider class. User should only concern how to read sample from file.

So the basic usage is:

@provider(some data provider config here...)
def process(obj, file_name):
    while not at end of file_name:
        sample = readOneSampleFromFile(file_name)
        yield sample.

The configuration of data provider should be setup by:

Parameters:
  • init_hook (callable) –

    A callback will be invoked when PyDataProvider instance created. The parameter is (obj, *args, **kwargs).

    • obj: actually data provider instance, which contains some global objects in obj.xxxxx, and is used by process function.
      1. obj.slots: a list of SlotType Object. Can be set in init. For example, obj.slots = [DenseSlot(9), IndexSlot(2)].
      2. obj.logger: a logger object. User can invoke obj.logger.info(), obj.logger.fatal(), etc.
    • args and kwargs: the data provider __init__ parameters. For example, load_data_args will be found in **kwargs, and if you want to recieve it from trainer_config, recommand to use init_hook_wrapper
  • pool_size (int | PoolSize) –
    • int: it will read at most pool_size files to memory.
    • PoolSize: it will read at most PoolSize.size samples to memory.
    • If not set, it will read all the files to memory.
  • slots (list | callable) –

    Specify the SlotTypes, can also be set in init_hook. It has two formats:

    • A list of SlotType objects. For example, slots = [DenseSlot(9), IndexSlot(2)].
    • A method return a list of SlotTypes, and the parameter of method is (obj, *file_list, **kwargs).
  • use_seq (bool) –

    False if use no sequence (Default). True if use sequence:

    • If sequence has no sub-sequence: Each slot will return a list of data. This list is one sequence. So the return format likes [[a0, a1, a2], [b1, b2, b3, b4], [c1]].
    • If sequence has sub-sequence: Each slot will return a nested-list of data. This list contains several sub-lists, each sub-list is one sub-sequence. So the return format likes [[[a0, a1, a2], [a4, a5]], [[b1, b2, b3, b4], [b5, b6]], [[c1], [c2]]].
  • should_shuffle (bool) – True if data should shuffle.
  • calc_batch_size (callable) –

    The method calculate each data’s batch size.

    • Default is the batch size of one sample.
    • User can customize by lamda funtion. For example, calc_batch_size = lambda data : len(data) means calculating the token number of a sequence data.
  • can_over_batch_size (bool) –

    Whether actual batch size >= input batch size

    • True (>=): getNextBatch method can return more data (Default).
    • False (<): user must ensure that each data’s batch size < input batch size.
  • debug (bool) – True if enable debug logger and some debug check. Default is False.
  • profile_filename (None | Str) – None if disable profile (Default). Otherwise, the data provider will dump profile result when reset. And the dump filename is <profile_filename>_<reset_count>.
paddle.trainer.PyDataProviderWrapper.init_hook_wrapper(func)

Wrap a method for PyDataProviderWrapper’s init_hook. This method can receive parameter from trainer_config’s load_data_args. The load_data_args must pass a pickle.dumps() value, and dump a map as keyword args. The wrapped method func will receive them as keyword args.

So an example usage is:

@init_hook_wrapper
def hook(obj, dictionary, file_list, **kwargs):
    obj.dictionary = dictionary
    obj.slots = [IndexSlot(len(obj.dictionary)),
                 IndexSlot(len(open(file_list[0], "r").readlines()))]
Parameters:func (callable) – init_hook function
Returns:wrapped method, can be passed into @provider.