Data Providers¶
Base DataProvider¶
- 
class paddle::DataProvider¶
- Base class for DataProvider, which supplies data for training. - Note
- It can supplies multiple streams of data. For typical supervised training, there are two streams: one is for input, one is for label.
 - Subclassed by paddle::DataProviderGroup< T >, paddle::DummyDataProvider, paddle::MultiDataProvider, paddle::ProtoDataProvider, paddle::PyDataProvider, paddle::PyDataProvider2, paddle::SimpleDataProviderBase - Public Functions - 
DataProvider(const DataConfig &config, bool useGpu)¶
 - 
virtual ~DataProvider()¶
 - 
const DataConfig &getConfig() const¶
 - 
void setSkipShuffle()¶
 - 
int64_t getNextBatch(int64_t size, DataBatch *batch)¶
- Get next batch of training samples. - Return
- actual size of obtained training samples
- Parameters
- size-- size of training samples to get 
- batch-- a batch of training samples 
 
 
 - 
virtual void shuffle() = 0¶
- Shuffle the data set. 
 - 
virtual void reset()¶
- reset all the value of index - Note
- reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
 
 - 
virtual int64_t getSize() = 0¶
- Get the size of training samples. - Return
- the number of training samples in the data set.
- Note
- return -1 to indicate unlimited number of samples.
 
 - 
virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch) = 0¶
- Get next batch training samples internally. - Return
- actual size of obtained training samples
- Parameters
- size-- size of training samples to get 
- batch-- a batch of training samples 
 
 
 - Public Static Functions - 
DataProvider *create(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu = FLAGS_use_gpu)¶
 - 
static DataProvider *create(const DataConfig &config, bool useGpu)¶
- create only used for unittest. 
 - Public Static Attributes - 
ClassRegistrar<DataProvider, DataConfig, ModelConfig, bool> registrar_¶
 
DataProviderGroup¶
- template <class T>
- 
class paddle::DataProviderGroup¶
- Inherits from paddle::DataProvider - Public Functions - 
DataProviderGroup(const DataConfig &config, bool useGpu)¶
 - 
~DataProviderGroup()¶
 - 
virtual void reset()¶
- reset all the value of index - Note
- reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
 
 - 
virtual void shuffle()¶
- Shuffle the data set. 
 - 
virtual int64_t getSize()¶
- Get the size of training samples. - Return
- the number of training samples in the data set.
- Note
- return -1 to indicate unlimited number of samples.
 
 - 
virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)¶
- Get next batch training samples internally. - Return
- actual size of obtained training samples
- Parameters
- size-- size of training samples to get 
- batch-- a batch of training samples 
 
 
 - Protected Attributes - 
ProviderPtrType provider_¶
 - 
std::vector<std::string> fileList_¶
 - 
std::mutex lock_¶
 - 
std::unique_ptr<MultiThreadWorker<ProviderType>> loader_¶
 
- 
MultiDataProvider¶
- 
class paddle::MultiDataProvider¶
- Inherits from paddle::DataProvider - Public Functions - 
MultiDataProvider(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu)¶
 - 
~MultiDataProvider()¶
 - 
virtual void reset()¶
- reset all the value of index - Note
- reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
 
 - 
virtual void shuffle()¶
- Shuffle the data set. 
 - 
virtual int64_t getSize()¶
- Get the size of training samples. - Return
- the number of training samples in the data set.
- Note
- return -1 to indicate unlimited number of samples.
 
 - 
virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)¶
- Get next batch training samples internally. - Return
- actual size of obtained training samples
- Parameters
- size-- size of training samples to get 
- batch-- a batch of training samples 
 
 
 - 
bool isTestMode() const¶
 - Protected Attributes - 
std::vector<std::unique_ptr<DataProvider>> subDataProviders_¶
 
- 
PyDataProvider¶
IFieldScanner¶
- 
class paddle::IFieldScanner¶
- FieldScanner Interface. - It will read python object, and fill to argument’s each slot. There are two steps, prepare and fill. Scanner will alloc memory during prepare step, fill data into argument during fill step. - Subclassed by paddle::DenseScanner, paddle::IndexScanner, paddle::SequenceScanner, paddle::SparseNonValueScanner - Public Functions - 
DISABLE_COPY(IFieldScanner)¶
 - 
IFieldScanner(SlotHeader *headerPtr)¶
- Ctor. - Parameters
- headerPtr-- slot header that scanner belong to. 
 
 
 - 
virtual ~IFieldScanner()¶
 - 
virtual void prepare(Argument &argument, PyObject *obj)¶
- Prepare step. - Note
- the obj could be a timestep of sample or whole sample. It depends what scanner it is.
 
 - Public Static Functions - 
IFieldScanner *create(SlotHeader *header)¶
- Factory method. Create a scanner by header. The final scanner may be combine many scanners. - Note
- Fatal if header is not support.
 
 - Protected Attributes - 
SlotHeader *headerPtr_¶
 
- 
DenseScanner¶
- 
class paddle::DenseScanner¶
- Scanner for dense slot. - Inherits from paddle::IFieldScanner 
IndexScanner¶
- 
class paddle::IndexScanner¶
- Scanner for index slot - Inherits from paddle::IFieldScanner 
SparseNonValueScanner¶
- 
class paddle::SparseNonValueScanner¶
- Inherits from paddle::IFieldScanner - Subclassed by paddle::SparseValueScanner - Public Functions - 
SparseNonValueScanner(SlotHeader *ptr)¶
 - Protected Functions - 
virtual void setData(int *col, real *dat, PyObject *obj)¶
- Set a single sparse index and value. - Parameters
- col-- sparse index 
- dat-- sparse value 
- obj-- Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float). 
 
 
 
- 
SparseValueScanner¶
- 
class paddle::SparseValueScanner¶
- Inherits from paddle::SparseNonValueScanner - Public Functions - 
SparseValueScanner(SlotHeader *ptr)¶
 - Protected Functions - 
virtual void setData(int *col, real *dat, PyObject *obj)¶
- Set a single sparse index and value. - Parameters
- col-- sparse index 
- dat-- sparse value 
- obj-- Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float). 
 
 
 
- 
SequenceScanner¶
- 
class paddle::SparseValueScanner¶
- Inherits from paddle::SparseNonValueScanner - Public Functions - 
SparseValueScanner(SlotHeader *ptr)¶
 - Protected Functions - 
virtual void setData(int *col, real *dat, PyObject *obj)¶
- Set a single sparse index and value. - Parameters
- col-- sparse index 
- dat-- sparse value 
- obj-- Python Object. For sparse_non_value is a PyInt or PyLong. For sparse_value is a Tuple (int, float). 
 
 
 
- 
IPyDataProviderCache¶
- 
class paddle::IPyDataProviderCache¶
- Py Data Provider Cache Interface. - Subclassed by paddle::CacheOnePassInMemory, paddle::NoCacheStrategy - Public Functions - 
virtual ~IPyDataProviderCache()¶
 - 
virtual bool reset() = 0¶
- invoke when DataProvider::reset() - Return
- true if read data from python.
 
 - 
virtual void drop(std::deque<PyObjectPtr> *data) = 0¶
- invoke when these data are used by DataProvider, and need to clear. - Note
- The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
- Parameters
- data-- used data. 
 
 
 - 
virtual std::deque<PyObjectPtr> *load() = 0¶
- Return whole data in cache. 
 - Public Static Functions - 
IPyDataProviderCache *create(CacheType ct)¶
- Factory method. Convert CacheType to IPyDataProviderCache* 
 
- 
virtual 
NoCacheStrategy¶
- 
class paddle::NoCacheStrategy¶
- No Cache Strategy. Will destruct old data immediately and load data from python every pass. - Inherits from paddle::IPyDataProviderCache - Public Functions - 
virtual bool reset()¶
- invoke when DataProvider::reset() - Return
- true if read data from python.
 
 - 
virtual void drop(std::deque<PyObjectPtr> *data)¶
- invoke when these data are used by DataProvider, and need to clear. - Note
- The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
- Parameters
- data-- used data. 
 
 
 - 
virtual std::deque<PyObjectPtr> *load()¶
- Return whole data in cache. 
 
- 
virtual bool 
CacheOnePassInMemory¶
- 
class paddle::CacheOnePassInMemory¶
- Cache One Pass In Memory strategy. - In first pass, will load data from python and store them in memory. The rest passes, will load data from memory. - Inherits from paddle::IPyDataProviderCache - Public Functions - 
CacheOnePassInMemory()¶
 - 
virtual bool reset()¶
- invoke when DataProvider::reset() - Return
- true if read data from python.
 
 - 
virtual void drop(std::deque<PyObjectPtr> *data)¶
- invoke when these data are used by DataProvider, and need to clear. - Note
- The implemented class must clear these data array. Or if you want to delete the PyObjectPtr later, you should make sure the paddle process only have one active thread calling python code (use PyGuard otherwise).
- Parameters
- data-- used data. 
 
 
 - 
virtual std::deque<PyObjectPtr> *load()¶
- Return whole data in cache. 
 
- 
IPyDataProvider¶
- 
class paddle::PyDataProvider2¶
- 
For usage, please refer python module ‘paddle.trainer.PyDataProvider2’ Here, we start a thread to read data. It is totally asynchronous for reading data. And it support cache strategies. Inherits from paddle::DataProvider Public Functions - 
PyDataProvider2(const DataConfig &config, const ModelConfig &modelConfig, bool useGpu)¶
- Ctor 
 - 
virtual ~PyDataProvider2()¶
- Dtor - Note
- will stop loading thread when destructing
 
 - 
virtual void reset()¶
- Resetting the PyDataProvider. May start reading thread here. 
 - 
virtual void shuffle()¶
- Shuffle. Do nothing because PyDataProvider do shuffle implicitly by random select data from datapool. 
 - 
virtual int64_t getSize()¶
- Not limited size. 
 - 
virtual int64_t getNextBatchInternal(int64_t size_, DataBatch *batch)¶
- Loading a batch of data. 
 
- 
Proto Data Provider¶
ProtoDataProvider¶
- 
class paddle::ProtoDataProvider¶
- Provider data from protobuf data file with each sample specified by proto message. - DataSample defined in DataFormat.proto. - The file format is - header - sample1 - sample2 - ... - sampleN - Note
- : In the data file, each message is prefixed with its length. The read/write of the protbuf are implemented in ProtoReader.h
 - Inherits from paddle::DataProvider - Subclassed by paddle::ProtoSequenceDataProvider - Public Functions - 
ProtoDataProvider(const DataConfig &config, bool useGpu, bool loadDataAll = true)¶
 - 
virtual void reset()¶
- reset all the value of index - Note
- reset() must be called before any calls to getNextBatch() IMPORTANT: subclass reset() should always call the base class reset() at the end of the function
 
 - 
virtual int64_t getSize()¶
- Note
- this size includes the sequences which are skipped because they are longer than the batch size.
 
 - 
virtual void shuffle()¶
- Shuffle the data set. 
 - 
void loadData(const std::vector<std::string> &fileList)¶
 - 
virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)¶
- Get next batch training samples internally. - Return
- actual size of obtained training samples
- Parameters
- size-- size of training samples to get 
- batch-- a batch of training samples 
 
 
 - Protected Functions - 
void loadData(const std::string &fileName)¶
- load protobuf data from a list of file - Parameters
- fileName-- file name of a file which contains a list of file names 
 
 
 - 
void loadDataFile(const std::string &fileName)¶
- load protobuf data from file - Parameters
- fileName-- data file name 
 
 
 - 
void checkDataHeader(const DataHeader &header)¶
- check data header of each data sample - Parameters
- header-- data header read from protobuf data 
 
 
 - 
void fillSlots(const DataSample &sample)¶
- fill protobuf data into slot_, slot_ is a vector of ProtoSlot in memory. - Parameters
- sample-- data sample read from protobuf data 
 
 
 - 
bool iidData() const¶
- return true if each sample is one sequence, i.e., independent of other samples. 
 - 
void checkSample(const DataSample &sample)¶
- check that sample is consistent with header_ 
 - template <class Op>
- 
int64_t sequenceLoop(Op op, int64_t size)¶
 - template <class Op>
- 
int64_t sampleLoop(Op op, int64_t size)¶
 - template <class Op>
- 
int64_t subSampleLoop(Op op, int64_t size, int slot)¶
 - 
void showDataStats()¶
 - Protected Attributes - 
DataHeader header_¶
 - 
int numVecSlots_¶
 - 
size_t sampleNums_¶
 - 
std::vector<size_t> sequenceStartPositions_¶
- The starting position of each sequence in samples. The last element should be num of samples. If empty, each sample is one sequence. 
 - 
int64_t currentSequenceIndex_¶
 - 
std::vector<size_t> shuffledSequenceIds_¶
 - 
ThreadLocalD<DataBatch> cpuBatch_¶
 - 
ThreadLocalD<DataBatch> gpuBatch_¶
 - 
std::vector<StatPtr> nnzStats_¶
 - 
struct ProtoSlot¶
- Public Members - 
SlotDef::SlotType type¶
 - 
int dim¶
 - 
std::vector<int> indexData¶
 - 
std::vector<real> denseData¶
 - 
std::vector<sparse_non_value_t> sparseNonValueData¶
 - 
std::vector<sparse_float_value_t> sparseFloatValueData¶
 - 
std::vector<int64_t> indices¶
 - 
std::vector<int64_t> subIndices¶
 - 
std::vector<ProtoVarSlot> varDenseData¶
 - 
std::vector<std::vector<int>> varIndices¶
 - 
std::vector<std::string> strData¶
 
- 
SlotDef::SlotType 
 - 
struct ProtoVarSlot¶
 
ProtoSequenceDataProvider¶
- 
class paddle::ProtoSequenceDataProvider¶
- Special use for Proto data: instances should contain sparse-non-value slots and label. - Note
- ProtoSequenceDataProvider treats each SPARSE SLOT as a SEQUENCE
 - Inherits from paddle::ProtoDataProvider - Public Functions - 
ProtoSequenceDataProvider(const DataConfig &config, bool useGpu, bool loadDataAll = true)¶
 - 
~ProtoSequenceDataProvider()¶
 - 
virtual int64_t getNextBatchInternal(int64_t size, DataBatch *batch)¶
- Get next batch training samples internally. - Return
- actual size of obtained training samples
- Parameters
- size-- size of training samples to get 
- batch-- a batch of training samples