Can not fetch new task after some trainers have been scaled down
Created by: Yancey1989
The logs as following:
addle/go/master/service.go:389]
t=2017-11-01T09:12:39+0000 lvl=warn msg="No more available task." todoLen=0 pendingLen=3 doneLen=3 failedLen=0 curPass=2 stack=[github.com/PaddlePaddle/Paddle/go/master/service.go:389]
t=2017-11-01T09:12:40+0000 lvl=warn msg="No more available task." todoLen=0 pendingLen=3 doneLen=3 failedLen=0 curPass=2 stack=[github.com/PaddlePaddle/Paddle/go/master/service.go:389]
t=2017-11-01T09:12:42+0000 lvl=warn msg="No more available task." pendingLen=3 doneLen=3 failedLen=0 curPass=2 todoLen=0 stack=[github.com/PaddlePaddle/Paddle/go/master/service.go:389]
t=2017-11-01T09:13:10+0000 lvl=warn msg="No more available task." todoLen=0 pendingLen=3 doneLen=3 failedLen=0 curPass=2 stack=[github.com/PaddlePaddle/Paddle/go/master/service.go:389]
t=2017-11-01T09:21:25+0000 lvl=warn msg="No more available task." todoLen=0 pendingLen=3 doneLen=3 failedLen=0 curPass=2 stack=[github.com/PaddlePaddle/Paddle/go/master/service.go:389]
t=2017-11-01T09:27:29+0000 lvl=warn msg="Task failed, re-dispatch." task="{Meta:{ID:8674665223307489707 Epoch:3} Chunks:[{Path:/data/mnist/mnist-train-00040 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00041 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00042 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00043 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00044 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00045 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00046 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00047 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00048 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}} {Path:/data/mnist/mnist-train-00049 Index:{ChunkOffsets:[0] ChunkLens:[1000] NumRecords:1000 ChunkRecords:[1000]}}]}" num failed=1 stack="[github.com/PaddlePaddle/Paddle/go/master/service.go:336 github.com/PaddlePaddle/Paddle/go/master/service.go:351]"
Looks like scale down will lead to the task timeout.