Data Providers¶

Base DataProvider¶

class paddle::DataProvider¶

Base class for DataProvider, which supplies data for training.

Note: It can supplies multiple streams of data. For typical supervised training, there are two streams: one is for input, one is for label.

Subclassed by paddle::DataProviderGroup< T >, paddle::DummyDataProvider, paddle::MultiDataProvider, paddle::ProtoDataProvider, paddle::PyDataProvider, paddle::PyDataProvider2, paddle::SimpleDataProviderBase

Public Functions

DataProvider(const DataConfig &config, bool useGpu)¶

virtual ~DataProvider()¶

const DataConfig &getConfig() const¶

void setSkipShuffle()¶

int64_t getNextBatch(int64_t size, DataBatch *batch)¶

Get next batch of training samples.

Return

actual size of obtained training samples

Parameters

size -
size of training samples to get
batch -
a batch of training samples

virtual void shuffle() = 0¶: Shuffle the data set.

virtual void reset()¶

reset all the value of index

Note: reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function

virtual int64_t getSize() = 0¶

Get the size of training samples.

Return: the number of training samples in the data set.
Note: return -1 to indicate unlimited number of samples.

virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch) = 0¶

Get next batch training samples internally.

Return

actual size of obtained training samples

Parameters

size -
size of training samples to get
batch -
a batch of training samples

Public Static Functions

DataProvider *create(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu = FLAGS_use_gpu)¶

static DataProvider *create(const DataConfig &config, bool useGpu = FLAGS_use_gpu)¶: create only used for unittest.

Public Static Attributes

ClassRegistrar<DataProvider, DataConfig, ModelConfig, bool> registrar_¶

Protected Functions

int64_t getNextBatchFromBuffer(int64_t size, DataBatch *batch)¶

@brief Get next batch training samples from buffer

Return

actual size of obtained training samples

Parameters

size -
size of training samples to get
batch -
a batch of training samples

void initAsyncLoader()¶

Protected Attributes

DataConfig config_¶

bool skipShuffle_¶

float usageRatio_¶

bool useGpu_¶

std::unique_ptr<DoubleBuffer> doubleBuffer_¶

ThreadLocal<std::vector<MatrixPtr>> constantSlots_¶

DataProviderGroup¶

template <class T>

class paddle::DataProviderGroup¶

Inherits from paddle::DataProvider

Public Functions

DataProviderGroup(const DataConfig &config, bool useGpu)¶

~DataProviderGroup()¶

virtual void reset()¶

reset all the value of index

Note: reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function

virtual void shuffle()¶: Shuffle the data set.

virtual int64_t getSize()¶

Get the size of training samples.

Return: the number of training samples in the data set.
Note: return -1 to indicate unlimited number of samples.

virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)¶

Get next batch training samples internally.

Return

actual size of obtained training samples

Parameters

size -
size of training samples to get
batch -
a batch of training samples

Protected Types

typedef T ProviderType¶

typedef std::shared_ptr<ProviderType> ProviderPtrType¶

Protected Attributes

ProviderPtrType provider_¶

std::vector<std::string> fileList_¶

std::mutex lock_¶

std::unique_ptr<MultiThreadWorker<ProviderType>> loader_¶

MultiDataProvider¶

class paddle::MultiDataProvider¶

Inherits from paddle::DataProvider

Public Functions

MultiDataProvider(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu)¶

~MultiDataProvider()¶

virtual void reset()¶

reset all the value of index

Note: reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function

virtual void shuffle()¶: Shuffle the data set.

virtual int64_t getSize()¶

Get the size of training samples.

Return: the number of training samples in the data set.
Note: return -1 to indicate unlimited number of samples.

virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)¶

Get next batch training samples internally.

Return

actual size of obtained training samples

Parameters

size -
size of training samples to get
batch -
a batch of training samples

bool isTestMode() const¶

Protected Attributes

std::vector<std::unique_ptr<DataProvider>> subDataProviders_¶

PyDataProvider¶

IFieldScanner¶

class paddle::IFieldScanner¶

FieldScanner Interface.

It will read python object, and fill to argument’s each slot. There are two steps, prepare and fill. Scanner will alloc memory during prepare step, fill data into argument during fill step.

Subclassed by paddle::DenseScanner, paddle::IndexScanner, paddle::SequenceScanner, paddle::SparseNonValueScanner

Public Functions

DISABLE_COPY(IFieldScanner)¶

IFieldScanner(SlotHeader *headerPtr)¶

Ctor.

Parameters

headerPtr -
slot header that scanner belong to.

virtual ~IFieldScanner()¶

virtual void startPrepare(Argument &argument)¶: Start prepare step.

virtual void prepare(Argument &argument, PyObject *obj)¶

Prepare step.

Note: the obj could be a timestep of sample or whole sample. It depends what scanner it is.

virtual void finishPrepare(Argument &argument)¶: Finish Prepare step.

virtual void startFill(Argument &argument)¶: Start fill step.

virtual void fill(Argument &argument, PyObject *obj)¶

Fill step.

Note: the obj could be a timestep of sample or whole sample. It depends what scanner it is.

virtual void finishFill(Argument &argument)¶: Finish fill step.

Public Static Functions

IFieldScanner *create(SlotHeader *header)¶

Factory method. Create a scanner by header. The final scanner may be combine many scanners.

Note: Fatal if header is not support.

Protected Attributes

SlotHeader *headerPtr_¶

DenseScanner¶

class paddle::DenseScanner¶

Scanner for dense slot.

Inherits from paddle::IFieldScanner

Public Functions

DenseScanner(SlotHeader *ptr)¶

virtual void prepare(Argument &argument, PyObject *obj)¶

Prepare.

Parameters

argument -
target argument
obj -
each timestep of a sample.

virtual void finishPrepare(Argument &argument)¶: Finish Prepare step.

virtual void fill(Argument &argument, PyObject *obj)¶

Fill argument from obj.

Parameters

argument -
obj -

IndexScanner¶

class paddle::IndexScanner¶

Scanner for index slot

Inherits from paddle::IFieldScanner

Public Functions

IndexScanner(SlotHeader *ptr)¶

virtual void prepare(Argument &argument, PyObject *obj)¶

Prepare memory space.

Note: obj is a single timestep of sample

virtual void finishPrepare(Argument &argument)¶: Finish Prepare step.

virtual void fill(Argument &argument, PyObject *obj)¶: Fill one index to argument.

SparseNonValueScanner¶

class paddle::SparseNonValueScanner¶

Inherits from paddle::IFieldScanner

Subclassed by paddle::SparseValueScanner

Public Functions

SparseNonValueScanner(SlotHeader *ptr)¶

virtual void prepare(Argument &argument, PyObject *obj)¶

Prepare memory space

Note: obj is a timestep of one sample.

virtual void finishPrepare(Argument &argument)¶: Finish Prepare step.

virtual void startFill(Argument &argument)¶: Start fill step.

virtual void fill(Argument &argument, PyObject *obj)¶

Fill one sparse vector to argument.

Note: obj is a timestep of one sample.

Protected Functions

virtual void setData(int *col, real *dat, PyObject *obj)¶

Set a single sparse index and value.

Parameters

col -
sparse index
dat -
sparse value
obj -
Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).

Protected Attributes

size_t nnz_¶

size_t height_¶

SparseValueScanner¶

class paddle::SparseValueScanner¶

Inherits from paddle::SparseNonValueScanner

Public Functions

SparseValueScanner(SlotHeader *ptr)¶

virtual void finishPrepare(Argument &argument)¶: Finish Prepare step.

Protected Functions

virtual void setData(int *col, real *dat, PyObject *obj)¶

Set a single sparse index and value.

Parameters

col -
sparse index
dat -
sparse value
obj -
Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).

SequenceScanner¶

class paddle::SparseValueScanner¶

Inherits from paddle::SparseNonValueScanner

Public Functions

SparseValueScanner(SlotHeader *ptr)¶

virtual void finishPrepare(Argument &argument)¶: Finish Prepare step.

Protected Functions

virtual void setData(int *col, real *dat, PyObject *obj)¶

Set a single sparse index and value.

Parameters

col -
sparse index
dat -
sparse value
obj -
Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).

IPyDataProviderCache¶

class paddle::IPyDataProviderCache¶

Py Data Provider Cache Interface.

Subclassed by paddle::CacheOnePassInMemory, paddle::NoCacheStrategy

Public Functions

virtual ~IPyDataProviderCache()¶

virtual bool reset() = 0¶

invoke when DataProvider::reset()

Return: true if read data from python.

virtual void drop(std::deque<PyObjectPtr> *data) = 0¶

invoke when these data are used by DataProvider, and need to clear.

Note

The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).

Parameters

data -
used data.

virtual std::deque<PyObjectPtr> *load() = 0¶: Return whole data in cache.

Public Static Functions

IPyDataProviderCache *create(CacheType ct)¶: Factory method. Convert CacheType to IPyDataProviderCache*

NoCacheStrategy¶

class paddle::NoCacheStrategy¶

No Cache Strategy. Will destruct old data immediately and load data from python every pass.

Inherits from paddle::IPyDataProviderCache

Public Functions

virtual bool reset()¶

invoke when DataProvider::reset()

Return: true if read data from python.

virtual void drop(std::deque<PyObjectPtr> *data)¶

invoke when these data are used by DataProvider, and need to clear.

Note

The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).

Parameters

data -
used data.

virtual std::deque<PyObjectPtr> *load()¶: Return whole data in cache.

CacheOnePassInMemory¶

class paddle::CacheOnePassInMemory¶

Cache One Pass In Memory strategy.

In first pass, will load data from python and store them in memory. The rest passes, will load data from memory.

Inherits from paddle::IPyDataProviderCache

Public Functions

CacheOnePassInMemory()¶

virtual bool reset()¶

invoke when DataProvider::reset()

Return: true if read data from python.

virtual void drop(std::deque<PyObjectPtr> *data)¶

invoke when these data are used by DataProvider, and need to clear.

Note

The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).

Parameters

data -
used data.

virtual std::deque<PyObjectPtr> *load()¶: Return whole data in cache.

IPyDataProvider¶

class paddle::PyDataProvider2¶

PyDataProvider2.

For usage, please refer python module ‘paddle.trainer.PyDataProvider2’

Here, we start a thread to read data. It is totally asynchronous for reading data. And it support cache strategies.

Inherits from paddle::DataProvider

Public Functions

PyDataProvider2(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu)¶: Ctor

virtual ~PyDataProvider2()¶

Dtor

Note: will stop loading thread when destructing

virtual void reset()¶: Resetting the PyDataProvider. May start reading thread here.

virtual void shuffle()¶: Shuffle. Do nothing because PyDataProvider do shuffle implicitly by random select data from datapool.

virtual int64_t getSize()¶: Not limited size.

virtual int64_t getNextBatchInternal(int64_t size_, DataBatch *batch)¶: Loading a batch of data.

Proto Data Provider¶

ProtoDataProvider¶

class paddle::ProtoDataProvider¶

Provider data from protobuf data file with each sample specified by proto message.

DataSample defined in DataFormat.proto.

The file format is

header

sample1

sample2

...

sampleN

Note: : In the data file, each message is prefixed with its length. The read/write of the protbuf are implemented in ProtoReader.h

Inherits from paddle::DataProvider

Subclassed by paddle::ProtoSequenceDataProvider

Public Functions

ProtoDataProvider(const DataConfig &config, bool useGpu, bool loadDataAll = true)¶

virtual void reset()¶

reset all the value of index

Note: reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function

virtual int64_t getSize()¶

Note: this size includes the sequences which are skipped because they are longer than the batch size.

virtual void shuffle()¶: Shuffle the data set.

void loadData(const std::vector<std::string> &fileList)¶

virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)¶

Get next batch training samples internally.

Return

actual size of obtained training samples

Parameters

size -
size of training samples to get
batch -
a batch of training samples

Protected Functions

void loadData(const std::string &fileName)¶

load protobuf data from a list of file

Parameters

fileName -
file name of a file which contains a list of file names

void loadDataFile(const std::string &fileName)¶

load protobuf data from file

Parameters

fileName -
data file name

void checkDataHeader(const DataHeader &header)¶

check data header of each data sample

Parameters

header -
data header read from protobuf data

void fillSlots(const DataSample &sample)¶

fill protobuf data into slot_, slot_ is a vector of ProtoSlot in memory.

Parameters

sample -
data sample read from protobuf data

bool iidData() const¶: return true if each sample is one sequence, i.e., independent of other samples.

void checkSample(const DataSample &sample)¶: check that sample is consistent with header_

template <class Op>
int64_t sequenceLoop(Op op, int64_t size)¶

template <class Op>
int64_t sampleLoop(Op op, int64_t size)¶

template <class Op>
int64_t subSampleLoop(Op op, int64_t size, int slot)¶

void showDataStats()¶

Protected Attributes

DataHeader header_¶

int numVecSlots_¶

std::vector<ProtoSlot> slots_¶

size_t sampleNums_¶

std::vector<size_t> sequenceStartPositions_¶: The starting position of each sequence in samples. The last element should be num of samples. If empty, each sample is one sequence.

int64_t currentSequenceIndex_¶

std::vector<size_t> shuffledSequenceIds_¶

ThreadLocalD<DataBatch> cpuBatch_¶

ThreadLocalD<DataBatch> gpuBatch_¶

RWLock lock_¶

std::vector<StatPtr> nnzStats_¶

struct ProtoSlot¶

Public Members

SlotDef::SlotType type¶

int dim¶

std::vector<int> indexData¶

std::vector<real> denseData¶

std::vector<sparse_non_value_t> sparseNonValueData¶

std::vector<sparse_float_value_t> sparseFloatValueData¶

std::vector<int64_t> indices¶

std::vector<int64_t> subIndices¶

std::vector<ProtoVarSlot> varDenseData¶

std::vector<std::vector<int>> varIndices¶

std::vector<std::string> strData¶

struct ProtoVarSlot¶

Public Members

std::vector<real> data¶

std::vector<int> dims¶

ProtoSequenceDataProvider¶

class paddle::ProtoSequenceDataProvider¶

Special use for Proto data: instances should contain sparse-non-value slots and label.

Note: ProtoSequenceDataProvider treats each SPARSE SLOT as a SEQUENCE

Inherits from paddle::ProtoDataProvider

Public Functions

ProtoSequenceDataProvider(const DataConfig &config, bool useGpu, bool loadDataAll = true)¶

~ProtoSequenceDataProvider()¶

virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)¶

Get next batch training samples internally.

Return

actual size of obtained training samples

Parameters

size -
size of training samples to get
batch -
a batch of training samples

Data Providers¶

Base DataProvider¶

DataProviderGroup¶

MultiDataProvider¶

PyDataProvider¶

IFieldScanner¶

DenseScanner¶

IndexScanner¶

SparseNonValueScanner¶

SparseValueScanner¶

SequenceScanner¶

IPyDataProviderCache¶

NoCacheStrategy¶

CacheOnePassInMemory¶

IPyDataProvider¶

Proto Data Provider¶

ProtoDataProvider¶

ProtoSequenceDataProvider¶

Table Of Contents

Previous topic

Next topic

This Page