Data Providers

Base DataProvider

class paddle::DataProvider

Base class for DataProvider, which supplies data for training.

Note
It can supplies multiple streams of data. For typical supervised training, there are two streams: one is for input, one is for label.

Subclassed by paddle::DataProviderGroup< T >, paddle::DummyDataProvider, paddle::MultiDataProvider, paddle::ProtoDataProvider, paddle::PyDataProvider, paddle::PyDataProvider2, paddle::SimpleDataProviderBase

Public Functions

DataProvider(const DataConfig &config, bool useGpu)
virtual ~DataProvider()
const DataConfig &getConfig() const
void setSkipShuffle()
int64_t getNextBatch(int64_t size, DataBatch *batch)

Get next batch of training samples.

Return
actual size of obtained training samples
Parameters
  • size -

    size of training samples to get

  • batch -

    a batch of training samples

virtual void shuffle() = 0

Shuffle the data set.

virtual void reset()

reset all the value of index

Note
reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function

virtual int64_t getSize() = 0

Get the size of training samples.

Return
the number of training samples in the data set.
Note
return -1 to indicate unlimited number of samples.

virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch) = 0

Get next batch training samples internally.

Return
actual size of obtained training samples
Parameters
  • size -

    size of training samples to get

  • batch -

    a batch of training samples

Public Static Functions

DataProvider *create(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu = FLAGS_use_gpu)
static DataProvider *create(const DataConfig &config, bool useGpu = FLAGS_use_gpu)

create only used for unittest.

Public Static Attributes

ClassRegistrar<DataProvider, DataConfig, ModelConfig, bool> registrar_

Protected Functions

int64_t getNextBatchFromBuffer(int64_t size, DataBatch *batch)

@brief Get next batch training samples from buffer

Return
actual size of obtained training samples
Parameters
  • size -

    size of training samples to get

  • batch -

    a batch of training samples

void initAsyncLoader()

Protected Attributes

DataConfig config_
bool skipShuffle_
float usageRatio_
bool useGpu_
std::unique_ptr<DoubleBuffer> doubleBuffer_
ThreadLocal<std::vector<MatrixPtr>> constantSlots_

DataProviderGroup

template <class T>
class paddle::DataProviderGroup

Inherits from paddle::DataProvider

Public Functions

DataProviderGroup(const DataConfig &config, bool useGpu)
~DataProviderGroup()
virtual void reset()

reset all the value of index

Note
reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function

virtual void shuffle()

Shuffle the data set.

virtual int64_t getSize()

Get the size of training samples.

Return
the number of training samples in the data set.
Note
return -1 to indicate unlimited number of samples.

virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)

Get next batch training samples internally.

Return
actual size of obtained training samples
Parameters
  • size -

    size of training samples to get

  • batch -

    a batch of training samples

Protected Types

typedef T ProviderType
typedef std::shared_ptr<ProviderType> ProviderPtrType

Protected Attributes

ProviderPtrType provider_
std::vector<std::string> fileList_
std::mutex lock_
std::unique_ptr<MultiThreadWorker<ProviderType>> loader_

MultiDataProvider

class paddle::MultiDataProvider

Inherits from paddle::DataProvider

Public Functions

MultiDataProvider(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu)
~MultiDataProvider()
virtual void reset()

reset all the value of index

Note
reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function

virtual void shuffle()

Shuffle the data set.

virtual int64_t getSize()

Get the size of training samples.

Return
the number of training samples in the data set.
Note
return -1 to indicate unlimited number of samples.

virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)

Get next batch training samples internally.

Return
actual size of obtained training samples
Parameters
  • size -

    size of training samples to get

  • batch -

    a batch of training samples

bool isTestMode() const

Protected Attributes

std::vector<std::unique_ptr<DataProvider>> subDataProviders_

PyDataProvider

IFieldScanner

class paddle::IFieldScanner

FieldScanner Interface.

It will read python object, and fill to argument’s each slot. There are two steps, prepare and fill. Scanner will alloc memory during prepare step, fill data into argument during fill step.

Subclassed by paddle::DenseScanner, paddle::IndexScanner, paddle::SequenceScanner, paddle::SparseNonValueScanner

Public Functions

DISABLE_COPY(IFieldScanner)
IFieldScanner(SlotHeader *headerPtr)

Ctor.

Parameters
  • headerPtr -

    slot header that scanner belong to.

virtual ~IFieldScanner()
virtual void startPrepare(Argument &argument)

Start prepare step.

virtual void prepare(Argument &argument, PyObject *obj)

Prepare step.

Note
the obj could be a timestep of sample or whole sample. It depends what scanner it is.

virtual void finishPrepare(Argument &argument)

Finish Prepare step.

virtual void startFill(Argument &argument)

Start fill step.

virtual void fill(Argument &argument, PyObject *obj)

Fill step.

Note
the obj could be a timestep of sample or whole sample. It depends what scanner it is.

virtual void finishFill(Argument &argument)

Finish fill step.

Public Static Functions

IFieldScanner *create(SlotHeader *header)

Factory method. Create a scanner by header. The final scanner may be combine many scanners.

Note
Fatal if header is not support.

Protected Attributes

SlotHeader *headerPtr_

DenseScanner

class paddle::DenseScanner

Scanner for dense slot.

Inherits from paddle::IFieldScanner

Public Functions

DenseScanner(SlotHeader *ptr)
virtual void prepare(Argument &argument, PyObject *obj)

Prepare.

Parameters
  • argument -

    target argument

  • obj -

    each timestep of a sample.

virtual void finishPrepare(Argument &argument)

Finish Prepare step.

virtual void fill(Argument &argument, PyObject *obj)

Fill argument from obj.

Parameters
  • argument -

  • obj -

IndexScanner

class paddle::IndexScanner

Scanner for index slot

Inherits from paddle::IFieldScanner

Public Functions

IndexScanner(SlotHeader *ptr)
virtual void prepare(Argument &argument, PyObject *obj)

Prepare memory space.

Note
obj is a single timestep of sample

virtual void finishPrepare(Argument &argument)

Finish Prepare step.

virtual void fill(Argument &argument, PyObject *obj)

Fill one index to argument.

SparseNonValueScanner

class paddle::SparseNonValueScanner

Inherits from paddle::IFieldScanner

Subclassed by paddle::SparseValueScanner

Public Functions

SparseNonValueScanner(SlotHeader *ptr)
virtual void prepare(Argument &argument, PyObject *obj)

Prepare memory space

Note
obj is a timestep of one sample.

virtual void finishPrepare(Argument &argument)

Finish Prepare step.

virtual void startFill(Argument &argument)

Start fill step.

virtual void fill(Argument &argument, PyObject *obj)

Fill one sparse vector to argument.

Note
obj is a timestep of one sample.

Protected Functions

virtual void setData(int *col, real *dat, PyObject *obj)

Set a single sparse index and value.

Parameters
  • col -

    sparse index

  • dat -

    sparse value

  • obj -

    Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).

Protected Attributes

size_t nnz_
size_t height_

SparseValueScanner

class paddle::SparseValueScanner

Inherits from paddle::SparseNonValueScanner

Public Functions

SparseValueScanner(SlotHeader *ptr)
virtual void finishPrepare(Argument &argument)

Finish Prepare step.

Protected Functions

virtual void setData(int *col, real *dat, PyObject *obj)

Set a single sparse index and value.

Parameters
  • col -

    sparse index

  • dat -

    sparse value

  • obj -

    Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).

SequenceScanner

class paddle::SparseValueScanner

Inherits from paddle::SparseNonValueScanner

Public Functions

SparseValueScanner(SlotHeader *ptr)
virtual void finishPrepare(Argument &argument)

Finish Prepare step.

Protected Functions

virtual void setData(int *col, real *dat, PyObject *obj)

Set a single sparse index and value.

Parameters
  • col -

    sparse index

  • dat -

    sparse value

  • obj -

    Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).

IPyDataProviderCache

class paddle::IPyDataProviderCache

Py Data Provider Cache Interface.

Subclassed by paddle::CacheOnePassInMemory, paddle::NoCacheStrategy

Public Functions

virtual ~IPyDataProviderCache()
virtual bool reset() = 0

invoke when DataProvider::reset()

Return
true if read data from python.

virtual void drop(std::deque<PyObjectPtr> *data) = 0

invoke when these data are used by DataProvider, and need to clear.

Note
The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
Parameters
  • data -

    used data.

virtual std::deque<PyObjectPtr> *load() = 0

Return whole data in cache.

Public Static Functions

IPyDataProviderCache *create(CacheType ct)

Factory method. Convert CacheType to IPyDataProviderCache*

NoCacheStrategy

class paddle::NoCacheStrategy

No Cache Strategy. Will destruct old data immediately and load data from python every pass.

Inherits from paddle::IPyDataProviderCache

Public Functions

virtual bool reset()

invoke when DataProvider::reset()

Return
true if read data from python.

virtual void drop(std::deque<PyObjectPtr> *data)

invoke when these data are used by DataProvider, and need to clear.

Note
The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
Parameters
  • data -

    used data.

virtual std::deque<PyObjectPtr> *load()

Return whole data in cache.

CacheOnePassInMemory

class paddle::CacheOnePassInMemory

Cache One Pass In Memory strategy.

In first pass, will load data from python and store them in memory. The rest passes, will load data from memory.

Inherits from paddle::IPyDataProviderCache

Public Functions

CacheOnePassInMemory()
virtual bool reset()

invoke when DataProvider::reset()

Return
true if read data from python.

virtual void drop(std::deque<PyObjectPtr> *data)

invoke when these data are used by DataProvider, and need to clear.

Note
The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
Parameters
  • data -

    used data.

virtual std::deque<PyObjectPtr> *load()

Return whole data in cache.

IPyDataProvider

class paddle::PyDataProvider2

PyDataProvider2.

For usage, please refer python module ‘paddle.trainer.PyDataProvider2’

Here, we start a thread to read data. It is totally asynchronous for reading data. And it support cache strategies.

Inherits from paddle::DataProvider

Public Functions

PyDataProvider2(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu)

Ctor

virtual ~PyDataProvider2()

Dtor

Note
will stop loading thread when destructing

virtual void reset()

Resetting the PyDataProvider. May start reading thread here.

virtual void shuffle()

Shuffle. Do nothing because PyDataProvider do shuffle implicitly by random select data from datapool.

virtual int64_t getSize()

Not limited size.

virtual int64_t getNextBatchInternal(int64_t size_, DataBatch *batch)

Loading a batch of data.

Proto Data Provider

ProtoDataProvider

class paddle::ProtoDataProvider

Provider data from protobuf data file with each sample specified by proto message.

DataSample defined in DataFormat.proto.

The file format is

header

sample1

sample2

...

sampleN

Note
: In the data file, each message is prefixed with its length. The read/write of the protbuf are implemented in ProtoReader.h

Inherits from paddle::DataProvider

Subclassed by paddle::ProtoSequenceDataProvider

Public Functions

ProtoDataProvider(const DataConfig &config, bool useGpu, bool loadDataAll = true)
virtual void reset()

reset all the value of index

Note
reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function

virtual int64_t getSize()

Note
this size includes the sequences which are skipped because they are longer than the batch size.

virtual void shuffle()

Shuffle the data set.

void loadData(const std::vector<std::string> &fileList)
virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)

Get next batch training samples internally.

Return
actual size of obtained training samples
Parameters
  • size -

    size of training samples to get

  • batch -

    a batch of training samples

Protected Functions

void loadData(const std::string &fileName)

load protobuf data from a list of file

Parameters
  • fileName -

    file name of a file which contains a list of file names

void loadDataFile(const std::string &fileName)

load protobuf data from file

Parameters
  • fileName -

    data file name

void checkDataHeader(const DataHeader &header)

check data header of each data sample

Parameters
  • header -

    data header read from protobuf data

void fillSlots(const DataSample &sample)

fill protobuf data into slot_, slot_ is a vector of ProtoSlot in memory.

Parameters
  • sample -

    data sample read from protobuf data

bool iidData() const

return true if each sample is one sequence, i.e., independent of other samples.

void checkSample(const DataSample &sample)

check that sample is consistent with header_

template <class Op>
int64_t sequenceLoop(Op op, int64_t size)
template <class Op>
int64_t sampleLoop(Op op, int64_t size)
template <class Op>
int64_t subSampleLoop(Op op, int64_t size, int slot)
void showDataStats()

Protected Attributes

DataHeader header_
int numVecSlots_
std::vector<ProtoSlot> slots_
size_t sampleNums_
std::vector<size_t> sequenceStartPositions_

The starting position of each sequence in samples. The last element should be num of samples. If empty, each sample is one sequence.

int64_t currentSequenceIndex_
std::vector<size_t> shuffledSequenceIds_
ThreadLocalD<DataBatch> cpuBatch_
ThreadLocalD<DataBatch> gpuBatch_
RWLock lock_
std::vector<StatPtr> nnzStats_
struct ProtoSlot

Public Members

SlotDef::SlotType type
int dim
std::vector<int> indexData
std::vector<real> denseData
std::vector<sparse_non_value_t> sparseNonValueData
std::vector<sparse_float_value_t> sparseFloatValueData
std::vector<int64_t> indices
std::vector<int64_t> subIndices
std::vector<ProtoVarSlot> varDenseData
std::vector<std::vector<int>> varIndices
std::vector<std::string> strData
struct ProtoVarSlot

Public Members

std::vector<real> data
std::vector<int> dims

ProtoSequenceDataProvider

class paddle::ProtoSequenceDataProvider

Special use for Proto data: instances should contain sparse-non-value slots and label.

Note
ProtoSequenceDataProvider treats each SPARSE SLOT as a SEQUENCE

Inherits from paddle::ProtoDataProvider

Public Functions

ProtoSequenceDataProvider(const DataConfig &config, bool useGpu, bool loadDataAll = true)
~ProtoSequenceDataProvider()
virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)

Get next batch training samples internally.

Return
actual size of obtained training samples
Parameters
  • size -

    size of training samples to get

  • batch -

    a batch of training samples