数据量过大, 线上无法正常预测
Created by: ziyenano
当数据量不大时, 可以使用少量节点预测 (num_nodes = 5), 但output中, rank-00000 到 rank-00004数据均相同. 同样的脚本和配置, 当需要预测的数据量很大时, 已经碰到如下几种错误:
-
Job Status Series: epilogue finished epilogue started user_main failed (Killed) job will be killed because agent(16) lost job will be killed because agent(23) lost job will be killed because agent(19) lost user_main started prologue finished prologue started Ready
-
epilogue finished epilogue started user_main failed (Status:kTaskStatusExited) user_main started prologue finished prologue started Ready
-
不报错, 但是output一直为空