Data Providers¶
Base DataProvider¶
- 
class 
paddle::DataProvider¶ Base class for DataProvider, which supplies data for training.
- Note
 - It can supplies multiple streams of data. For typical supervised training, there are two streams: one is for input, one is for label.
 
Subclassed by paddle::DataProviderGroup< T >, paddle::DummyDataProvider, paddle::MultiDataProvider, paddle::ProtoDataProvider, paddle::PyDataProvider, paddle::PyDataProvider2, paddle::SimpleDataProviderBase
Public Functions
- 
DataProvider(const DataConfig &config, bool useGpu)¶ 
- 
virtual 
~DataProvider()¶ 
- 
const DataConfig &
getConfig() const¶ 
- 
void 
setSkipShuffle()¶ 
- 
int64_t 
getNextBatch(int64_t size, DataBatch *batch)¶ Get next batch of training samples.
- Return
 - actual size of obtained training samples
 - Parameters
 size-size of training samples to get
batch-a batch of training samples
- 
virtual void 
shuffle() = 0¶ Shuffle the data set.
- 
virtual void 
reset()¶ reset all the value of index
- Note
 - reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
 
- 
virtual int64_t 
getSize() = 0¶ Get the size of training samples.
- Return
 - the number of training samples in the data set.
 - Note
 - return -1 to indicate unlimited number of samples.
 
- 
virtual int64_t 
getNextBatchInternal(int64_t size, DataBatch *batch) = 0¶ Get next batch training samples internally.
- Return
 - actual size of obtained training samples
 - Parameters
 size-size of training samples to get
batch-a batch of training samples
Public Static Functions
- 
DataProvider *
create(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu = FLAGS_use_gpu)¶ 
- 
static DataProvider *
create(const DataConfig &config, bool useGpu)¶ create only used for unittest.
Public Static Attributes
- 
ClassRegistrar<DataProvider, DataConfig, ModelConfig, bool> 
registrar_¶ 
DataProviderGroup¶
- template <class T>
 - 
class 
paddle::DataProviderGroup¶ Inherits from paddle::DataProvider
Public Functions
- 
DataProviderGroup(const DataConfig &config, bool useGpu)¶ 
- 
~DataProviderGroup()¶ 
- 
virtual void 
reset()¶ reset all the value of index
- Note
 - reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
 
- 
virtual void 
shuffle()¶ Shuffle the data set.
- 
virtual int64_t 
getSize()¶ Get the size of training samples.
- Return
 - the number of training samples in the data set.
 - Note
 - return -1 to indicate unlimited number of samples.
 
- 
virtual int64_t 
getNextBatchInternal(int64_t size, DataBatch *batch)¶ Get next batch training samples internally.
- Return
 - actual size of obtained training samples
 - Parameters
 size-size of training samples to get
batch-a batch of training samples
Protected Attributes
- 
ProviderPtrType 
provider_¶ 
- 
std::vector<std::string> 
fileList_¶ 
- 
std::mutex 
lock_¶ 
- 
std::unique_ptr<MultiThreadWorker<ProviderType>> 
loader_¶ 
- 
 
MultiDataProvider¶
- 
class 
paddle::MultiDataProvider¶ Inherits from paddle::DataProvider
Public Functions
- 
MultiDataProvider(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu)¶ 
- 
~MultiDataProvider()¶ 
- 
virtual void 
reset()¶ reset all the value of index
- Note
 - reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
 
- 
virtual void 
shuffle()¶ Shuffle the data set.
- 
virtual int64_t 
getSize()¶ Get the size of training samples.
- Return
 - the number of training samples in the data set.
 - Note
 - return -1 to indicate unlimited number of samples.
 
- 
virtual int64_t 
getNextBatchInternal(int64_t size, DataBatch *batch)¶ Get next batch training samples internally.
- Return
 - actual size of obtained training samples
 - Parameters
 size-size of training samples to get
batch-a batch of training samples
- 
bool 
isTestMode() const¶ 
Protected Attributes
- 
std::vector<std::unique_ptr<DataProvider>> 
subDataProviders_¶ 
- 
 
PyDataProvider¶
IFieldScanner¶
- 
class 
paddle::IFieldScanner¶ FieldScanner Interface.
It will read python object, and fill to argument’s each slot. There are two steps, prepare and fill. Scanner will alloc memory during prepare step, fill data into argument during fill step.
Subclassed by paddle::DenseScanner, paddle::IndexScanner, paddle::SequenceScanner, paddle::SparseNonValueScanner
Public Functions
- 
DISABLE_COPY(IFieldScanner)¶ 
- 
IFieldScanner(SlotHeader *headerPtr)¶ Ctor.
- Parameters
 headerPtr-slot header that scanner belong to.
- 
virtual 
~IFieldScanner()¶ 
- 
virtual void 
prepare(Argument &argument, PyObject *obj)¶ Prepare step.
- Note
 - the obj could be a timestep of sample or whole sample. It depends what scanner it is.
 
Public Static Functions
- 
IFieldScanner *
create(SlotHeader *header)¶ Factory method. Create a scanner by header. The final scanner may be combine many scanners.
- Note
 - Fatal if header is not support.
 
Protected Attributes
- 
SlotHeader *
headerPtr_¶ 
- 
 
DenseScanner¶
- 
class 
paddle::DenseScanner¶ Scanner for dense slot.
Inherits from paddle::IFieldScanner
IndexScanner¶
- 
class 
paddle::IndexScanner¶ Scanner for index slot
Inherits from paddle::IFieldScanner
SparseNonValueScanner¶
- 
class 
paddle::SparseNonValueScanner¶ Inherits from paddle::IFieldScanner
Subclassed by paddle::SparseValueScanner
Public Functions
- 
SparseNonValueScanner(SlotHeader *ptr)¶ 
Protected Functions
- 
virtual void 
setData(int *col, real *dat, PyObject *obj)¶ Set a single sparse index and value.
- Parameters
 col-sparse index
dat-sparse value
obj-Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).
- 
 
SparseValueScanner¶
- 
class 
paddle::SparseValueScanner¶ Inherits from paddle::SparseNonValueScanner
Public Functions
- 
SparseValueScanner(SlotHeader *ptr)¶ 
Protected Functions
- 
virtual void 
setData(int *col, real *dat, PyObject *obj)¶ Set a single sparse index and value.
- Parameters
 col-sparse index
dat-sparse value
obj-Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).
- 
 
SequenceScanner¶
- 
class 
paddle::SparseValueScanner¶ Inherits from paddle::SparseNonValueScanner
Public Functions
- 
SparseValueScanner(SlotHeader *ptr)¶ 
Protected Functions
- 
virtual void 
setData(int *col, real *dat, PyObject *obj)¶ Set a single sparse index and value.
- Parameters
 col-sparse index
dat-sparse value
obj-Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float).
- 
 
IPyDataProviderCache¶
- 
class 
paddle::IPyDataProviderCache¶ Py Data Provider Cache Interface.
Subclassed by paddle::CacheOnePassInMemory, paddle::NoCacheStrategy
Public Functions
- 
virtual 
~IPyDataProviderCache()¶ 
- 
virtual bool 
reset() = 0¶ invoke when DataProvider::reset()
- Return
 - true if read data from python.
 
- 
virtual void 
drop(std::deque<PyObjectPtr> *data) = 0¶ invoke when these data are used by DataProvider, and need to clear.
- Note
 - The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
 - Parameters
 data-used data.
- 
virtual std::deque<PyObjectPtr> *
load() = 0¶ Return whole data in cache.
Public Static Functions
- 
IPyDataProviderCache *
create(CacheType ct)¶ Factory method. Convert CacheType to IPyDataProviderCache*
- 
virtual 
 
NoCacheStrategy¶
- 
class 
paddle::NoCacheStrategy¶ No Cache Strategy. Will destruct old data immediately and load data from python every pass.
Inherits from paddle::IPyDataProviderCache
Public Functions
- 
virtual bool 
reset()¶ invoke when DataProvider::reset()
- Return
 - true if read data from python.
 
- 
virtual void 
drop(std::deque<PyObjectPtr> *data)¶ invoke when these data are used by DataProvider, and need to clear.
- Note
 - The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
 - Parameters
 data-used data.
- 
virtual std::deque<PyObjectPtr> *
load()¶ Return whole data in cache.
- 
virtual bool 
 
CacheOnePassInMemory¶
- 
class 
paddle::CacheOnePassInMemory¶ Cache One Pass In Memory strategy.
In first pass, will load data from python and store them in memory. The rest passes, will load data from memory.
Inherits from paddle::IPyDataProviderCache
Public Functions
- 
CacheOnePassInMemory()¶ 
- 
virtual bool 
reset()¶ invoke when DataProvider::reset()
- Return
 - true if read data from python.
 
- 
virtual void 
drop(std::deque<PyObjectPtr> *data)¶ invoke when these data are used by DataProvider, and need to clear.
- Note
 - The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
 - Parameters
 data-used data.
- 
virtual std::deque<PyObjectPtr> *
load()¶ Return whole data in cache.
- 
 
IPyDataProvider¶
- 
class 
paddle::PyDataProvider2¶ - 
For usage, please refer python module ‘paddle.trainer.PyDataProvider2’
Here, we start a thread to read data. It is totally asynchronous for reading data. And it support cache strategies.
Inherits from paddle::DataProvider
Public Functions
- 
PyDataProvider2(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu)¶ Ctor
- 
virtual 
~PyDataProvider2()¶ Dtor
- Note
 - will stop loading thread when destructing
 
- 
virtual void 
reset()¶ Resetting the PyDataProvider. May start reading thread here.
- 
virtual void 
shuffle()¶ Shuffle. Do nothing because PyDataProvider do shuffle implicitly by random select data from datapool.
- 
virtual int64_t 
getSize()¶ Not limited size.
- 
virtual int64_t 
getNextBatchInternal(int64_t size_, DataBatch *batch)¶ Loading a batch of data.
 - 
 
Proto Data Provider¶
ProtoDataProvider¶
- 
class 
paddle::ProtoDataProvider¶ Provider data from protobuf data file with each sample specified by proto message.
DataSample defined in DataFormat.proto.
The file format is
header
sample1
sample2
...
sampleN
- Note
 - : In the data file, each message is prefixed with its length. The read/write of the protbuf are implemented in ProtoReader.h
 
Inherits from paddle::DataProvider
Subclassed by paddle::ProtoSequenceDataProvider
Public Functions
- 
ProtoDataProvider(const DataConfig &config, bool useGpu, bool loadDataAll = true)¶ 
- 
virtual void 
reset()¶ reset all the value of index
- Note
 - reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
 
- 
virtual int64_t 
getSize()¶ - Note
 - this size includes the sequences which are skipped because they are longer than the batch size.
 
- 
virtual void 
shuffle()¶ Shuffle the data set.
- 
void 
loadData(const std::vector<std::string> &fileList)¶ 
- 
virtual int64_t 
getNextBatchInternal(int64_t size, DataBatch *batch)¶ Get next batch training samples internally.
- Return
 - actual size of obtained training samples
 - Parameters
 size-size of training samples to get
batch-a batch of training samples
Protected Functions
- 
void 
loadData(const std::string &fileName)¶ load protobuf data from a list of file
- Parameters
 fileName-file name of a file which contains a list of file names
- 
void 
loadDataFile(const std::string &fileName)¶ load protobuf data from file
- Parameters
 fileName-data file name
- 
void 
checkDataHeader(const DataHeader &header)¶ check data header of each data sample
- Parameters
 header-data header read from protobuf data
- 
void 
fillSlots(const DataSample &sample)¶ fill protobuf data into slot_, slot_ is a vector of ProtoSlot in memory.
- Parameters
 sample-data sample read from protobuf data
- 
bool 
iidData() const¶ return true if each sample is one sequence, i.e., independent of other samples.
- 
void 
checkSample(const DataSample &sample)¶ check that sample is consistent with header_
- template <class Op>
 - 
int64_t 
sequenceLoop(Op op, int64_t size)¶ 
- template <class Op>
 - 
int64_t 
sampleLoop(Op op, int64_t size)¶ 
- template <class Op>
 - 
int64_t 
subSampleLoop(Op op, int64_t size, int slot)¶ 
- 
void 
showDataStats()¶ 
Protected Attributes
- 
DataHeader 
header_¶ 
- 
int 
numVecSlots_¶ 
- 
size_t 
sampleNums_¶ 
- 
std::vector<size_t> 
sequenceStartPositions_¶ The starting position of each sequence in samples. The last element should be num of samples. If empty, each sample is one sequence.
- 
int64_t 
currentSequenceIndex_¶ 
- 
std::vector<size_t> 
shuffledSequenceIds_¶ 
- 
ThreadLocalD<DataBatch> 
cpuBatch_¶ 
- 
ThreadLocalD<DataBatch> 
gpuBatch_¶ 
- 
std::vector<StatPtr> 
nnzStats_¶ 
- 
struct 
ProtoSlot¶ Public Members
- 
SlotDef::SlotType 
type¶ 
- 
int 
dim¶ 
- 
std::vector<int> 
indexData¶ 
- 
std::vector<real> 
denseData¶ 
- 
std::vector<sparse_non_value_t> 
sparseNonValueData¶ 
- 
std::vector<sparse_float_value_t> 
sparseFloatValueData¶ 
- 
std::vector<int64_t> 
indices¶ 
- 
std::vector<int64_t> 
subIndices¶ 
- 
std::vector<ProtoVarSlot> 
varDenseData¶ 
- 
std::vector<std::vector<int>> 
varIndices¶ 
- 
std::vector<std::string> 
strData¶ 
- 
SlotDef::SlotType 
 
- 
struct 
ProtoVarSlot¶ 
ProtoSequenceDataProvider¶
- 
class 
paddle::ProtoSequenceDataProvider¶ Special use for Proto data: instances should contain sparse-non-value slots and label.
- Note
 - ProtoSequenceDataProvider treats each SPARSE SLOT as a SEQUENCE
 
Inherits from paddle::ProtoDataProvider
Public Functions
- 
ProtoSequenceDataProvider(const DataConfig &config, bool useGpu, bool loadDataAll = true)¶ 
- 
~ProtoSequenceDataProvider()¶ 
- 
virtual int64_t 
getNextBatchInternal(int64_t size, DataBatch *batch)¶ Get next batch training samples internally.
- Return
 - actual size of obtained training samples
 - Parameters
 size-size of training samples to get
batch-a batch of training samples