add custom_dataset_reader.md

c6fc563b · xujiaqi01 · 25fbe519 · c6fc563b
隐藏空白更改
内联并排

Showing with 70 addition and 0 deletion

doc/custom_dataset_reader.md doc/custom_dataset_reader.md +70 -0

未找到文件。
--- a/doc/custom_dataset_reader.md
+++ b/doc/custom_dataset_reader.md
+# PaddleRec 推荐数据集格式
+当你的数据集格式为[slot:feasign]*这种模式，或者可以预处理为这种格式时，可以直接使用PaddleRec内置的Reader。
+好处是不用自己写Reader了，各个model之间的数据格式也都可以统一成一样的格式。
+## 数据格式说明
+假如你的原始数据格式为
+```bash
+<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
+```
+其中```<label>```表示广告是否被点击，点击用1表示，未点击用0表示。```<integer feature>```代表数值特征（连续特征），共有13个连续特征。
+并且每个特征有一个特征值。
+```<categorical feature>```代表分类特征（离散特征），共有26个离散特征。相邻两个特征用```\t```分隔。
+假设这13个连续特征（dense slot）的name如下：
+```
+D1 D2 D3 D4 D4 D6 D7 D8 D9 D10 D11 D12 D13
+```
+这26个离散特征（sparse slot）的name如下：
+```
+S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 S22 S23 S24 S25 S26
+```
+那么下面这条样本（1个label + 13个dense值 + 26个feasign）
+```
+1 0.1 0.4 0.2 0.3 0.5 0.8 0.3 0.2 0.1 0.5 0.6 0.3 0.9 60 16 91 50 52 52 28 69 63 33 87 69 48 59 27 12 95 36 37 41 17 3 86 19 88 60
+```
+可以转换成：
+```
+label:1 D1:0.1 D2:0.4 D3:0.2 D4:0.3 D5:0.5 D6:0.8 D7:0.3 D8:0.2 D9:0.1 D10:0.5 D11:0.6 D12:0.3 D13:0.9 S14:60 S15:16 S16:91 S17:50 S18:52 S19:52 S20:28 S21:69 S22:63 S23:33 S24:87 S25:69 S26:48 S27:59 S28:27 S29:12 S30:95 S31:36 S32:37 S33:41 S34:17 S35:3 S36:86 S37:19 S38:88 S39:60
+```
+注意：上面各个slot:feasign字段之间的顺序没有要求，比如```D1:0.1 D2:0.4```改成```D2:0.4 D1:0.1```也可以。
+## 配置
+reader中需要配置```sparse_slots```与```dense_slots```，例如
+```
+  workspace: xxxx
+  reader:
+    batch_size: 2
+    train_data_path: "{workspace}/data/train_data"
+    sparse_slots: "label S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 S21 S22 S23 S24 S25 S26"
+    dense_slots: "D1:1 D2:1 D3:1 D4:1 D4:1 D6:1 D7:1 D8:1 D9:1 D10:1 D11:1 D12:1 D13:1"
+  model:
+    xxxxx
+```
+sparse_slots表示稀疏特征的列表，以空格分开。
+dense_slots表示稠密特征的列表，以空格分开。每个字段的格式是```[dense_slot_name]:[dim1,dim2,dim3...]```，其中```dim1,dim2,dim3...```表示shape
+配置好了之后，这些slot对应的variable就可以在model中的如下变量啦：
+```
+self._sparse_data_var
+self._dense_data_var
+```
 # PaddleRec 自定义数据集及Reader
 ## 数据集及reader配置简介