Data Providers¶
Base DataProvider¶
-
class
paddle::
DataProvider
¶ Base class for DataProvider, which supplies data for training.
- Note
- It can supplies multiple streams of data. For typical supervised training, there are two streams: one is for input, one is for label.
Subclassed by paddle::DataProviderGroup< T >, paddle::DummyDataProvider, paddle::MultiDataProvider, paddle::ProtoDataProvider, paddle::PyDataProvider, paddle::PyDataProvider2, paddle::SimpleDataProviderBase
Public Functions
-
DataProvider
(const DataConfig &config, bool useGpu)¶
-
virtual
~DataProvider
()¶
-
const DataConfig &
getConfig
() const¶
-
void
setSkipShuffle
()¶
-
int64_t
getNextBatch
(int64_t size, DataBatch *batch)¶ Get next batch of training samples.
- Return
- actual size of obtained training samples
- Parameters
size
: size of training samples to getbatch
: a batch of training samples
-
virtual void
shuffle
() = 0¶ Shuffle the data set.
-
virtual void
reset
()¶ reset all the value of index
- Note
- reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
-
virtual int64_t
getSize
() = 0¶ Get the size of training samples.
- Return
- the number of training samples in the data set.
- Note
- return -1 to indicate unlimited number of samples.
-
virtual int64_t
getNextBatchInternal
(int64_t size, DataBatch *batch) = 0¶ Get next batch training samples internally.
- Return
- actual size of obtained training samples
- Parameters
size
: size of training samples to getbatch
: a batch of training samples
Public Static Functions
-
DataProvider *
create
(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu = FLAGS_use_gpu)¶
-
static DataProvider *
create
(const DataConfig &config, bool useGpu = FLAGS_use_gpu)¶ create only used for unittest.
Public Static Attributes
-
ClassRegistrar<DataProvider, DataConfig, ModelConfig, bool>
registrar_
¶
DataProviderGroup¶
- template <class T>
-
class
paddle::
DataProviderGroup
¶ Inherits from paddle::DataProvider
Public Functions
-
DataProviderGroup
(const DataConfig &config, bool useGpu)¶
-
~DataProviderGroup
()¶
-
void
reset
()¶ reset all the value of index
- Note
- reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
-
virtual void
shuffle
()¶ Shuffle the data set.
-
virtual int64_t
getSize
()¶ Get the size of training samples.
- Return
- the number of training samples in the data set.
- Note
- return -1 to indicate unlimited number of samples.
-
int64_t
getNextBatchInternal
(int64_t size, DataBatch *batch)¶ Get next batch training samples internally.
- Return
- actual size of obtained training samples
- Parameters
size
: size of training samples to getbatch
: a batch of training samples
Protected Attributes
-
ProviderPtrType
provider_
¶
-
std::vector<std::string>
fileList_
¶
-
std::mutex
lock_
¶
-
std::unique_ptr<MultiThreadWorker<ProviderType>>
loader_
¶
-
MultiDataProvider¶
-
class
paddle::
MultiDataProvider
¶ Inherits from paddle::DataProvider
Public Functions
-
MultiDataProvider
(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu)¶
-
~MultiDataProvider
()¶
-
void
reset
()¶ reset all the value of index
- Note
- reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
-
void
shuffle
()¶ Shuffle the data set.
-
virtual int64_t
getSize
()¶ Get the size of training samples.
- Return
- the number of training samples in the data set.
- Note
- return -1 to indicate unlimited number of samples.
-
int64_t
getNextBatchInternal
(int64_t size, DataBatch *batch)¶ Get next batch training samples internally.
- Return
- actual size of obtained training samples
- Parameters
size
: size of training samples to getbatch
: a batch of training samples
-
bool
isTestMode
() const¶
Protected Attributes
-
std::vector<std::unique_ptr<DataProvider>>
subDataProviders_
¶
-
PyDataProvider¶
IFieldScanner¶
-
class
paddle::
IFieldScanner
¶ FieldScanner Interface.
It will read python object, and fill to argument’s each slot. There are two steps, prepare and fill. Scanner will alloc memory during prepare step, fill data into argument during fill step.
Subclassed by paddle::DenseScanner, paddle::IndexScanner, paddle::SequenceScanner, paddle::SparseNonValueScanner
Public Functions
-
DISABLE_COPY
(IFieldScanner)¶
-
IFieldScanner
(SlotHeader *headerPtr)¶ Ctor.
- Parameters
headerPtr
: slot header that scanner belong to.
-
virtual
~IFieldScanner
()¶
-
virtual void
prepare
(Argument &argument, PyObject *obj)¶ Prepare step.
- Note
- the obj could be a timestep of sample or whole sample. It depends what scanner it is.
Public Static Functions
-
IFieldScanner *
create
(SlotHeader *header)¶ Factory method. Create a scanner by header. The final scanner may be combine many scanners.
- Note
- Fatal if header is not support.
Protected Attributes
-
SlotHeader *
headerPtr_
¶
-
DenseScanner¶
-
class
paddle::
DenseScanner
¶ Scanner for dense slot.
Inherits from paddle::IFieldScanner
IndexScanner¶
-
class
paddle::
IndexScanner
¶ Scanner for index slot
Inherits from paddle::IFieldScanner
SparseNonValueScanner¶
-
class
paddle::
SparseNonValueScanner
¶ Inherits from paddle::IFieldScanner
Subclassed by paddle::SparseValueScanner
Public Functions
-
SparseNonValueScanner
(SlotHeader *ptr)¶
Protected Functions
-
virtual void
setData
(int *col, real *dat, PyObject *obj)¶ Set a single sparse index and value.
- Parameters
col
: sparse indexdat
: sparse valueobj
: Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).
-
SparseValueScanner¶
-
class
paddle::
SparseValueScanner
¶ Inherits from paddle::SparseNonValueScanner
Public Functions
-
SparseValueScanner
(SlotHeader *ptr)¶
Protected Functions
-
virtual void
setData
(int *col, real *dat, PyObject *obj)¶ Set a single sparse index and value.
- Parameters
col
: sparse indexdat
: sparse valueobj
: Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).
-
SequenceScanner¶
-
class
paddle::
SparseValueScanner
¶ Inherits from paddle::SparseNonValueScanner
Public Functions
-
SparseValueScanner
(SlotHeader *ptr)¶
Protected Functions
-
virtual void
setData
(int *col, real *dat, PyObject *obj)¶ Set a single sparse index and value.
- Parameters
col
: sparse indexdat
: sparse valueobj
: Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).
-
IPyDataProviderCache¶
-
class
paddle::
IPyDataProviderCache
¶ Py Data Provider Cache Interface.
Subclassed by paddle::CacheOnePassInMemory, paddle::NoCacheStrategy
Public Functions
-
virtual
~IPyDataProviderCache
()¶
-
virtual bool
reset
() = 0¶ invoke when DataProvider::reset()
- Return
- true if read data from python.
-
virtual void
drop
(std::deque<PyObjectPtr> *data) = 0¶ invoke when these data are used by DataProvider, and need to clear.
- Note
- The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
- Parameters
data
: used data.
-
virtual std::deque<PyObjectPtr> *
load
() = 0¶ Return whole data in cache.
Public Static Functions
-
IPyDataProviderCache *
create
(CacheType ct)¶ Factory method. Convert CacheType to IPyDataProviderCache*
-
virtual
NoCacheStrategy¶
-
class
paddle::
NoCacheStrategy
¶ No Cache Strategy. Will destruct old data immediately and load data from python every pass.
Inherits from paddle::IPyDataProviderCache
Public Functions
-
virtual bool
reset
()¶ invoke when DataProvider::reset()
- Return
- true if read data from python.
-
virtual void
drop
(std::deque<PyObjectPtr> *data)¶ invoke when these data are used by DataProvider, and need to clear.
- Note
- The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
- Parameters
data
: used data.
-
virtual std::deque<PyObjectPtr> *
load
()¶ Return whole data in cache.
-
virtual bool
CacheOnePassInMemory¶
-
class
paddle::
CacheOnePassInMemory
¶ Cache One Pass In Memory strategy.
In first pass, will load data from python and store them in memory. The rest passes, will load data from memory.
Inherits from paddle::IPyDataProviderCache
Public Functions
-
CacheOnePassInMemory
()¶
-
virtual bool
reset
()¶ invoke when DataProvider::reset()
- Return
- true if read data from python.
-
virtual void
drop
(std::deque<PyObjectPtr> *data)¶ invoke when these data are used by DataProvider, and need to clear.
- Note
- The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
- Parameters
data
: used data.
-
virtual std::deque<PyObjectPtr> *
load
()¶ Return whole data in cache.
-
IPyDataProvider¶
-
class
paddle::
PyDataProvider2
¶ -
For usage, please refer python module ‘paddle.trainer.PyDataProvider2’
Here, we start a thread to read data. It is totally asynchronous for reading data. And it support cache strategies.
Inherits from paddle::DataProvider
Public Functions
-
PyDataProvider2
(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu)¶ Ctor
-
virtual
~PyDataProvider2
()¶ Dtor
- Note
- will stop loading thread when destructing
-
virtual void
reset
()¶ Resetting the PyDataProvider. May start reading thread here.
-
void
shuffle
()¶ Shuffle. Do nothing because PyDataProvider do shuffle implicitly by random select data from datapool.
-
int64_t
getSize
()¶ Not limited size.
-
int64_t
getNextBatchInternal
(int64_t size_, DataBatch *batch)¶ Loading a batch of data.
-
Proto Data Provider¶
ProtoDataProvider¶
-
class
paddle::
ProtoDataProvider
¶ Provider data from protobuf data file with each sample specified by proto message.
DataSample defined in DataFormat.proto.
The file format is
header
sample1
sample2
...
sampleN
- Note
- : In the data file, each message is prefixed with its length. The read/write of the protbuf are implemented in ProtoReader.h
Inherits from paddle::DataProvider
Subclassed by paddle::ProtoSequenceDataProvider
Public Functions
-
ProtoDataProvider
(const DataConfig &config, bool useGpu, bool loadDataAll = true)¶
-
void
reset
()¶ reset all the value of index
- Note
- reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
-
virtual int64_t
getSize
()¶ - Note
- this size includes the sequences which are skipped because they are longer than the batch size.
-
void
shuffle
()¶ Shuffle the data set.
-
void
loadData
(const std::vector<std::string> &fileList)¶
-
int64_t
getNextBatchInternal
(int64_t size, DataBatch *batch)¶ Get next batch training samples internally.
- Return
- actual size of obtained training samples
- Parameters
size
: size of training samples to getbatch
: a batch of training samples
Protected Functions
-
void
loadData
(const std::string &fileName)¶ load protobuf data from a list of file
- Parameters
fileName
: file name of a file which contains a list of file names
-
void
loadDataFile
(const std::string &fileName)¶ load protobuf data from file
- Parameters
fileName
: data file name
-
void
checkDataHeader
(const DataHeader &header)¶ check data header of each data sample
- Parameters
header
: data header read from protobuf data
-
void
fillSlots
(const DataSample &sample)¶ fill protobuf data into slot_, slot_ is a vector of ProtoSlot in memory.
- Parameters
sample
: data sample read from protobuf data
-
bool
iidData
() const¶ return true if each sample is one sequence, i.e., independent of other samples.
-
void
checkSample
(const DataSample &sample)¶ check that sample is consistent with header_
- template <class Op>
-
int64_t
sequenceLoop
(Op op, int64_t size)¶
- template <class Op>
-
int64_t
sampleLoop
(Op op, int64_t size)¶
- template <class Op>
-
int64_t
subSampleLoop
(Op op, int64_t size, int slot)¶
-
void
showDataStats
()¶
Protected Attributes
-
DataHeader
header_
¶
-
int
numVecSlots_
¶
-
size_t
sampleNums_
¶
-
std::vector<size_t>
sequenceStartPositions_
¶ The starting position of each sequence in samples. The last element should be num of samples. If empty, each sample is one sequence.
-
int64_t
currentSequenceIndex_
¶
-
std::vector<size_t>
shuffledSequenceIds_
¶
-
ThreadLocalD<DataBatch>
cpuBatch_
¶
-
ThreadLocalD<DataBatch>
gpuBatch_
¶
-
std::vector<StatPtr>
nnzStats_
¶
-
struct
ProtoSlot
¶ Public Members
-
SlotDef::SlotType
type
¶
-
int
dim
¶
-
std::vector<int>
indexData
¶
-
std::vector<real>
denseData
¶
-
std::vector<sparse_non_value_t>
sparseNonValueData
¶
-
std::vector<sparse_float_value_t>
sparseFloatValueData
¶
-
std::vector<int64_t>
indices
¶
-
std::vector<int64_t>
subIndices
¶
-
std::vector<ProtoVarSlot>
varDenseData
¶
-
std::vector<std::vector<int>>
varIndices
¶
-
std::vector<std::string>
strData
¶
-
SlotDef::SlotType
-
struct
ProtoVarSlot
¶
ProtoSequenceDataProvider¶
-
class
paddle::
ProtoSequenceDataProvider
¶ Special use for Proto data: instances should contain sparse-non-value slots and label.
- Note
- ProtoSequenceDataProvider treats each SPARSE SLOT as a SEQUENCE
Inherits from paddle::ProtoDataProvider
Public Functions
-
ProtoSequenceDataProvider
(const DataConfig &config, bool useGpu, bool loadDataAll = true)¶
-
~ProtoSequenceDataProvider
()¶
-
int64_t
getNextBatchInternal
(int64_t size, DataBatch *batch)¶ Get next batch training samples internally.
- Return
- actual size of obtained training samples
- Parameters
size
: size of training samples to getbatch
: a batch of training samples