DataSources

Data Sources are helpers to define paddle training data or testing data. There are several data attributes will be used by paddle:

  • Data ProviderType: such as Python, Protobuf
  • Data File list: a single file that contains all data file paths
paddle.trainer_config_helpers.data_sources.define_py_data_sources(train_list, test_list, module, obj, args=None, train_async=False, data_cls=<function PyData>)

Define python Train/Test data sources in one method. If train/test use the same Data Provider configuration, module/obj/args contain one argument, otherwise contain a list or tuple of arguments. For example:

define_py_data_sources("train.list", "test.list", module="data_provider"
                       obj="process", args={"dictionary": dict_name})

Or.

define_py_data_sources("train.list", "test.list", module="data_provider"
                       obj=["process_train", "process_test"],
                       args=[{"dictionary": dict_train}, {"dictionary": dict_test}])

The related data provider can refer to here.

Parameters:
  • data_cls
  • train_list (basestring) – Train list name.
  • test_list (basestring) – Test list name.
  • module (basestring or tuple or list) – python module name. If train and test is different, then pass a tuple or list to this argument.
  • obj (basestring or tuple or list) – python object name. May be a function name if using PyDataProviderWrapper. If train and test is different, then pass a tuple or list to this argument.
  • args (string or picklable object or list or tuple.) – The best practice is using dict() to pass arguments into DataProvider, and use @init_hook_wrapper to receive arguments. If train and test is different, then pass a tuple or list to this argument.
  • train_async (bool) – Is training data load asynchronously or not.
Returns:

None

Return type:

None