New checkpoint (#3540)
* flow.load/save/get_all_variables without large tensor and multi machine support * add lazy blob cache and disable blob_cache after writing * update checkpoint to call the potential slice_assign and read_slice_from_blob method * reformat * new checkpoint supports eager * split mut bn into mutable input bn and output bn * work in eager mode. deprecate checkpoint.init() * slice_assign implementation * new slice op * check step > 0, add more tests, refine the code * revert the initializer changes * remove print * set y to 0 for partialsum * check sbp, fix incorrect attr check * add more tests * rename slice2->logical_slice * update tests * extract common python code into a function * get_size_in_slice -> GetSizeInSlice, rm unused test file * minor update about step > 0 * minor update on tests * add WITH_CUDA guard * integrate with logical slice/slice_assign * set scope according to variable op_conf * initial support of stream init * read_slice_from_blob/as_numpy return nd_idx and set the cpu:0 placement for created variable * extract a 'for_every_slice' function * initializer registration * one meta file per variable * remove mis-added file * code clean * create model io jobs only if legacy model io enabled, update legacy api * add legacy model io test * slice operation optimization * add and update tests * barrier for multi node eager * make sync as a cluster instruction * update test * fix life cycle problem * add python api vm_util.Sync() * make initializer receive a random_seed * Add vm_util.Sync(), remove debug code * resolve TODO, remove __repr__ for now * use compiled op_conf for getting random_seed * UserOpAttrVal -> AttrValue, remove debug code * test another dtype * remove mis-added ) * fix dtype error when shape[axis+1:] is empty * add initializers to check_point * code clean, enable a temporary default checkpoint for test * move legacy implementation to deprecated/ * update deprecated implementation * fix bug in eager, add eager tests and some other minor updates * remove name field in FileBackendBlob, update Load for single variable, and some other minor updates * remove mis-added file * move initializer implementation, some minor changes * disable some bn tests missing checkpoint.init() * fix dtype conversion bug * relex the tolerance of layer_norm test * reformat * minor code clean * use new pybind11 eager sync api * add assignment between memory test * disable optimizers test for now * code clean * reuse CreateEagerVariableBlob * remove mis-added file * unify two read slice function * minor code clean * add initializer_updated to check_point.py * fix typo * resolve merge conflict * restore bn tests * add type annotations, add some comments and minor code clean * add some comments, remove 'need_root_path' parameter * fixup * get parallel_conf from job_set instead of op_attribute * disable two tests involving legacy model io in eager mode * add InitialzierImpl * add InitializerImpl * support load from numpy array, add test * rename and format * Add necessary docs and TODO, improve warning message * ParallelConf4InterfaceOpName->ParallelConf4LazyInterfaceOpName * address some comments * rename api * fix problems * add test_initializer.py * remove unused initializers * remove quantinfo, move new checkpoint to check_point_v2.py * fix crash on checkpoint.init() Signed-off-by: Ndaquexian <daquexian566@gmail.com> * restore optimizer test Signed-off-by: Ndaquexian <daquexian566@gmail.com> * Add GetOpAttributes api Signed-off-by: Ndaquexian <daquexian566@gmail.com> * restore ParallelConf4LazyOp as parallel desc symbol id in op attr doesn't align with that in job set Signed-off-by: Ndaquexian <daquexian566@gmail.com> * Add TestResumeTraining, shrink the large model size Signed-off-by: Ndaquexian <daquexian566@gmail.com> * restore 2n4c ci test Signed-off-by: Ndaquexian <daquexian566@gmail.com> * code clean Signed-off-by: Ndaquexian <daquexian566@gmail.com> * add snapshot_done Signed-off-by: Ndaquexian <daquexian566@gmail.com> * add test_mixed_model, update test_load_numpy Signed-off-by: Ndaquexian <daquexian566@gmail.com> * add flow.sync_default_session in api implementation Signed-off-by: Ndaquexian <daquexian566@gmail.com> * change the default value of ignore_mismatch from False to True to align with existing behavior Signed-off-by: Ndaquexian <daquexian566@gmail.com> * fix wrong initializer in test_mseloss.py and test_bce_loss.py Signed-off-by: Ndaquexian <daquexian566@gmail.com> * ForEachOpNode -> ForEachNode Signed-off-by: Ndaquexian <daquexian566@gmail.com> * fix test_partially_load_numpy Signed-off-by: Ndaquexian <daquexian566@gmail.com> Co-authored-by: Nwanghongsheng <2496533749@qq.com> Co-authored-by: Ncheng cheng <472491134@qq.com>
Showing
此差异已折叠。
想要评论请 注册 或 登录