DataProvider Tutorial¶
DataProvider is responsible for data management in PaddlePaddle, corresponding to Data Layer.
Input Data Format¶
PaddlePaddle uses Slot to describe the data layer of neural network. One slot describes one data layer. Each slot stores a series of samples, and each sample contains a set of features. There are three attributes of a slot:
- Dimension: dimenstion of features
- SlotType: there are 5 different slot types in PaddlePaddle, following table compares the four commonly used ones.
SlotType | Feature Description | Vector Description |
---|---|---|
DenseSlot | Continuous Features | Dense Vector |
SparseNonValueSlot | Discrete Features without weights | Sparse Vector with all non-zero elements equaled to 1 |
SparseValueSlot | Discrete Features with weights | Sparse Vector |
IndexSlot | mostly the same as SparseNonValueSlot, but especially for a single label | Sparse Vector with only one value in each time step |
And the remained one is StringSlot. It stores Character String, and can be used for debug or to describe data Id for prediction, etc.
- SeqType: a sequence is a sample whose features are expanded in time scale. And a sub-sequence is a continous ordered subset of a sequence. For example, (a1, a2) and (a3, a4, a5) are two sub-sequences of one sequence (a1, a2, a3, a4, a5). Following are 3 different sequence types in PaddlePaddle:
- NonSeq: input sample is not sequence
- Seq: input sample is a sequence without sub-sequence
- SubSeq: input sample is a sequence with sub-sequence
Python DataProvider¶
PyDataProviderWrapper is a python decorator in PaddlePaddle, used to read custom python DataProvider class. It currently supports all SlotTypes and SeqTypes of input data. User should only concern how to read samples from file. Feel easy with its Use Case and API Reference.