训练过程中实际参与测试的样本量与 配置中指定的测试样本量不相等,且差异较大
Created by: weiyuze
通过test_data_path指定了测试数据,测试样本大概有3000多万;
cluster_config(
fs_name="",
fs_ugi="",
train_data_path="",
test_data_path="",
output_path=",
)
加载测试数据的日志,显示加载了33622个样本,总共有1000个这样规模的part
Fri Jun 16 18:48:49 2017[1,28]<stdout>:0 inst loaded
Fri Jun 16 18:48:49 2017[1,43]<stdout>:33622 all data loaded
但是,问题来了,训练过程中,打出的日志可以看出,参与测试的样本只有1110000个,与指定的测试样本差异巨大;这是什么问题? 查看workspace目录下test_data_path下的数据,与配置中指定的测试数据一致
I0616 19:02:43.990978 28735 Tester.cpp:111] Test samples=1110000 cost=0.460042 Eval: classification_error_15min=0.0386514 positive_label=0 precision=0.974028 recall=0.992194 F1-score=0.983027 positive_label=1 precision=0.566085 recall=0.324633 F1-score=0.412633 positive_label=2 precision=0.653608 recall=0.544981 F1-score=0.594372 classification_error_30min=0.0415784 positive_label=0 precision=0.971718 recall=0.991218 F1-score=0.981371 positive_label=1 precision=0.538024 recall=0.308805 F1-score=0.392392 positive_label=2 precision=0.605387 recall=0.454104 F1-score=0.518945 classification_error_45min=0.0431766 positive_label=0 precision=0.971818 recall=0.989658 F1-score=0.980657 positive_label=1 precision=0.510215 recall=0.329872 F1-score=0.400686 positive_label=2 precision=0.586656 recall=0.41121 F1-score=0.483509 classification_error_60min=0.0428378 positive_label=0 precision=0.97073 recall=0.990619 F1-score=0.980574 positive_label=1 precision=0.529627 recall=0.300821 F1-score=0.383704 positive_label=2 precision=0.579775 recall=0.427326 F1-score=0.492012
I0616 19:02:43.991206 28735 GradientMachine.cpp:112] Saving parameters to ./output/pass-00000-001
I0616 19:02:43.996541 28735 Util.cpp:213] copy conf/trainer_config.conf to ./output/pass-00000-001