From 2375bf2c73dc5d796eca93978f487e24c39e2676 Mon Sep 17 00:00:00 2001 From: sandyhouse Date: Wed, 12 Jan 2022 20:51:22 +0800 Subject: [PATCH] update --- .../images/data_parallel.png | Bin 0 -> 13429 bytes .../test_train_fleet_infer_python.md | 170 ++++++++++++------ 2 files changed, 120 insertions(+), 50 deletions(-) create mode 100644 tutorials/tipc/train_fleet_infer_python/images/data_parallel.png diff --git a/tutorials/tipc/train_fleet_infer_python/images/data_parallel.png b/tutorials/tipc/train_fleet_infer_python/images/data_parallel.png new file mode 100644 index 0000000000000000000000000000000000000000..3bbf67714e53458c9bd70bd75e9391b7f91f126c GIT binary patch literal 13429 zcmZX5XIK+m7p*9$ASEKb1}TC_N$8!>OX$5z=pnQadQ}8MmyR@*D$=BPPzhB5=^X@7 zdPj=1JAUu`-TOVy{gIi;%;d~Dd(Pfx@3mHaoY>;iJYcU4!69SYb3r-7}O9(>C7jOu3cSqS7 z*&)=OFziqzApsFy0b$@Yr@D&12Amx#4;-VMob7;}vYm~y+uxtqqOE;+o!o%iA;P>6 z-oG7}cSO6n05>U#3G?#v3JMDg@(Kw7=l|dE{@%;a`}drlwY@dk>3<&fw*hZEw5Jnb zhW{Alzs>%=V281G_~)#VfSRzgf}f*~y^Dr0Tt!303ab5&88CkCcK_UCYv=7`1M~wh zmm*+rTPJG=w6zP+jMKk1MJgHV*&C?C6?o(wkVd*tl!xGdY%icFVu+AKx_kO4YV%tg z`zk|y^bsnGijE>4hC1$!LPEBpFjWs98wWubVO@J;UyO!>hMWfmqAmb2@DOv;Qqi>2 z@)be=qk_RiwS|Cpp=7NqXe1!4>t*dAC+e+%_JN^b>M##gJ43XCrh~kWBGS#l(c0b5 zkzWAjY47H)qU9hWrsR)uLwPD%JE73V2t&Arnx>|zzq*(Y+)+;yt*T)s1c%sQ?2MG4 zs!A#fns#V0C2vD(5jf1qNLAn4(bYiL80Dp);mmIb6B5-|F)_4NH`G?MGgday)pyXg zRo6p8Ox)R_au_>^nzg92sjZ`couZNm&{>Fs9#UTzuH&uasU>d;f!Z3o_M zIeXX{`P#UkexY0a8qkJWlc|2WkX{FF&zg5M1rvv=)x7X zmEn4}-lAGiK?4C>106Z2g1Wwdw_g0Z)klRH8K?Pnk;q~dL& zr2Z&D& z&bl^U8UhGEv;y4G+0)lm&_LTxT^sJBt*wZ*);BbEw3SC=;815}C&*ksTHv&9P&Q8<8+eQzH_SLc%Py<4poVE^d38|@R4>%@p z^k3inkD~(L{|$};ik}-0t~YLgZ>TE3^nES2bMENtD^vGeS2biz1?*brXyZtyK7b8T zQ`%)TsHgH20bRU?TNXMxHoJ5IQFH-t#H2zT9e41mWqV}Hm)Rh<^oFrlp6G`yhrb5q zBsy}uR~L3Ir7uso%d(S-;SqxCdM0N)QTeAvLkq@jZ)I#%9e(?;MNqc}Tr$0u4y^ar z>tMx}&qJWAS30s~iI17*0;Bm}cM3h`K|)lMIAA*nlXAmg#N!Nb@YSSxsqm{{cLJOt zWz})`J#1ZiNu3%6$V|O@zttSi<3i_6sOZ(&VI;dqMx>GSiOnqz9#kP>885_uO}&Ng*wS;l~4?ID%#-C+0ExW@p&W1R8bbNHmb zYJFHAH`Qh*bA7q-n>(eBFW&2maDH@oTj-Qpj+`ut{Zsb|;wPAc^E%)!HXa83 zIm(cB)UwYJcAZ(vpgMiObdx3k_kP~2*;F$KnKYo_q&!DLZ|nk?Y@q^Cau zs?ea0s}rHQ8KLfS2@%$;$)+HY*aP~o>xJvcWy*U8oO5L!o>y0ucS1hDtzfK7R8N1x z`oALMwVxsxO@}A?L~xFm5NWl7?T<{)mh3I7!+ZBwo!O`wEp0>KiM;9dTRJX4z<)vdSQT95&A=g}8@gCqtVzU#qC-;xLtBrj>VMSO8N zw0M{7`gycaKjP^k%i89+j4o+Mwp6$w(}4Tc8l`Q2hDj{WaDxTt1fg&{wKLMklp29t zo$A9guCr#>+>wZ39=QfhA2r3;TYK-p&Bgo}Rg*a$0JBW*zq8B~+ZntNZz(+w@t-pG zt=%S{-jZRycs3{{@gcEiD)B)+Rvunlw+TFA+xFEPHQietCxcQ;bG1}+nYACss$i%K zm=ai7lOewLZLW;XeYI3h9&kNk7nn*qjr&cGdZ698np@1$fwFq(_ER*?OapX6t98)A z#p}i%Wpysy0);N3)L2U+3f3y`Hu--YPR%}j*rhB~w3+6 zl>)Y+(V5?j?ASi|@t1TyPRrUC!|A-1OkZdy43oH+&AjWldo7Vu z^`6{|w&GPM#OR-2U8*a9(d(Ds@9#m{eUC4=sPNY%FyG2ysx1#Qgn94&&Y>g1*6C?% zrP&fMkFEy8zQk8F`d(z_EgQ4<>PF}k#7d|f9-k1eFu_5 zz9qIN(7`5%5?E*th^gn1*jp5OKM@Ok48*KaazkppU)HxVwiBT40~`s2sY?{ry~YLq7qY5#GfjX4F2aQ~3)zUgxr>kKMql@_sMqBd#Iyx$U;X#a(sP z1H%p7%BtK3s`UWi+MnpEq*#sO_2rOPx<9{_SWu6lv{PA_q-UrxHcr1*ntTtA2$^L3 zuy@F=KMyA>H?=#ES}4p2I{n_@1y5+EqCt!80rx5`m0x(j>hR^wZS#$|A9i`>)*SoL zf;tt+0=(FD*1;4Y;`Ssk>lDaIx3{H8Tliy6c0OMfuSY*DtaTdQPl@v`_gw7{+cK$P zefn9Yy?v_2nHlieTJAd@WV^gmZEopX^z`FyH)UQgokmyoCUf(x-Uni#Sd|G7g*z-; zZlah2bLf>bZK$%D7g~t#wn;GgCG2m*m&rbJmb^Kn7 zV%@rx9e>4{s9Kg_CG#SLq-Ugw8y{ylrcp-4b7y~pg!{gzSm>wig$8xoN21;bcjGTZ zHc-av&(H21C`DV7k2-CSIqzg18&V|wh@Rv=cnH{=(W98nhTd%D2H1Bxuww1g2b;iF1&~$Sl{1BL zT#)7vFwaNp7;ph(S6|b-Qg-HA!cLeM^7Z$1FMO0w1|8?#aIsBxI*?h(4MdQp(FaI| z>=S!gu-^3KDkQ{9)@jfT%AZzv>A_C;{(2#0+U}OfqlGxjI!DI$(~}@jDRrp<+7z0T z_E>Hw6>8s0)0TL;^ds_vPg$>n9>F~d=!^=wQP}xFGM`HJ=uN0`i!_3GdH&oxFVe?| zG*bz0tiEzKoJT@9D$!VCILjhQ3^a;`lU|byAoA-t(!jZr_H;y)DI)DfhB`PrysGIx zb*Fe@*!8XSSaCY>7Vy(4V?nn`d^@#t`TDZy5B`^GjbqLQC+z~Onykw4*$53(71fv& z=bimHvyP>~wpo-wTksdt`kDSrcJ7_a;B*@YRoLztNz%~X_c!eCR>x$I3pwJX*s$aJ z=}OMkoRBn}N|{ph#z8M^Yq~+eb-pbshyLWD#)jaV&>tnbh25WDo)LTG(8c?QTeYI` z{LrhEpwb6TT}>-sj?HGTEJNko`OyW%VYx2t0(YHbJWH?@44xsJR=#uPe@fjmAYQfm z1pTwuP7G&ev@08R&DR=N$Dw#Yu{KwFi=LDwqMu^aDds$MD+~=9Rq8!iA31EQbj?%2 zQdDoSNf7zESh)Y+ObDxo4>8ZZpaoA)$1@^Y)U(#+0vf7 z)?r+D0&?G(ffN~NM9X9*o=o3}mUhbk258X((!{GxF0516HGSZ`AFMd*1f6BgM+rqq za>pOOelb>}dk;H)_v>fLn}~Y^)b|p)2(cyeway&rYcKsrVm%xeyTexZ(UUU;tAAS~ zBL`MSb?0=7Qc$wR_NarFM*aTf`N?BIato4YqHa6ce&1T1)FI?%RX*>YWeC^z z_F$54SNI;SuvD10eO1CoCK*2{O-yQ*M1qT6x%Q`US-Y)*OgPK4oDO4lUKC$TxAmpd z=4f88Oj6rsIy??Ky9g(_w>F3@g5Tc{bd1S%9iicXkeBK;XPP6yOh7(&CxPkXep_*s zvldOnoj*%6>t@JkYI?Kx1qHmx3^)vfsQn11a2)%hsYO5lSoxUYE(3(6T3h`2`_Xse zq+6oMv4*r16AukIEj16>FE7^?!~47n1xLRIIyeoQI42@eOE1AiNukC$BjYQ&&yAsP z(g$}$+-7MA5a|1!2S*-_4UWt|QxR;2`2=2mE8rp$n`+PEP3&cB@aMY43Mj6YZN4># zl^V_)ur`-L?6XEi$39|kKEasZYD|gyQA~iiS{6&xStY=2P&k&G7lE(HuLuU_=sm_F zkYe&+yaP^SV(2qAYa_B3FB z_1t&t^uBR3`hI-{p?r9?uz4;oj-q%fU)fALfH=2_;uX}MoVDt4iCJ171x_$|zBY*U zT+-8dZmxOMspQW}Y`I~bIJl5?Z=F5qD2buxhb#VOnG?q!4)DQh@Y3RLCIW4V`dPBr}WyE}>qlnAu0{>NSQ5nfhE!G{+ z<}o2Qm95|^Kny$SY97vVbxQ{v0<%=7m4nl#rZ@_8wG@*ns|u#!<2JnBrs$C_m(*}r z??;F49seP8F{@L)W(o2Wm-_IUZwU31yvazs?mBx*vrZAjZD76wtfB0>$stOlh^!5} zZy%dj=tSHVH#J>}J%B~>gWowv>}$H-;|2kauSzxD1C-nk!G;ux8^r(Q!wK(_bST7r z52dlwKc1dqvHh)EX5Pr5_T5L1j{;O`s65xlc?Z(8#Ns*WNj;RO7+FN@(Zl32J$~;p zT8~^A4ylwE!&c5WMi*W~`uhzA_+*QRtHR|I(8 zvsV253Mq-xqLCku61v1ZY71RW@PK$T5^>D+Riiwsp{rn6FJH2V4snorzCWs*@`P$h z2{#91XTQ|G`Ucm5)YGKVt2ljezxdugo7>V6BN_W0aKdQK(2&uB0FF>-I3& zNwrCy*{=+q)CHV=Qj5=eIXZ{7Wus?HN4G8lO1^$E@g7h4z zXTH-31pZAYf3Kz55+Rz;sOjnnYKO-^P7~<$5LB^VYrLiBM{H?-&1t^Tes$}te>th? z%k-W+Z%IT3Q&H!tU1|E$)ov-B)-morS=ll+mfwrc1Ri{^cjh3#$Z4X+zbs~Fqe01u zJ)i9kJ&r%q#-GY&#fuDG4<3)|CNbR6=QO%sN>vdzH;EO-QCL)(%yw_RcldK^@~c+y zbPB3&S|pHluZm(q|70qPS4GfqhhcrjP`2#tb4+wnSjOB~l0|@XgRn}XBs*Q*`^ZD( z$`A8hQKLPHVm9^|_uUttSOaC7^Vd?_B+vUzorg=m(nYO5dpWMokKW>EnDNVYksALS z$5KH?J+ej9sYg8ffu`)MjuxjvAJm9 z2{?@qtrz2_=UCgiXVLegXNMeZ$ErybKm}j?w}Ky;VejI?a(REZ9xH_%%Rl~wWJwdQEO^evq1U%um`CG7({2neHYk;Y*knJ;FyGUDM-|;zyeSh? zG->N}aF}*$%O0ms4&vT0^SUhrO8((-!pkF;dBcM3`tTETE55ELp`2`)DP=S*PzPTP z-6p$M-S^{?r)+s&Cec*T$Wdf55HXPB+u!OuD&2M-i*cPR(2)0LCvUcY6s&R`Hmt^O ziR&v7dR z%LPR=Nau*|<)q|Fg+yGn2l1o6tK>pdow{c1d@px9wnpbsSgVW6oAc*h-LnmDRmgpR z6Fv~8U}>b93DrOS6sVN_d+3_^;O{_IjqZ3%OiUR<&Q!>WvXY!*jg$g>kB(1mRlDDq z80LeLc%B!q@jP4eeaV1NqfSV>oGTTueDL#&Xz+z^?3xW>$Tnj_yn0aX#j}sn!c>X=++p z>Ws{g@ZoLt-bVf&v?g92dM#tU=4d4i1Dl!kWDe?V$fdKaQE9jh1=CFNefFN=2kaJb zwDxRsvNCMYGM&#V{w4vDv4IsZK`C=m)~ds7e$W|LxtA~a?WMeW;ITQVameC2(u`+z z^dz$_Hm*9A+$06FJxX}&Z2)oDc&TCd|@z66bxn@y4n zz4KQhL>PqM2sE18N<2;I*kN@}LviGb(-?*lH*u|NE)_@;`mu3ffq^C%Ra)+kWlr>r ziH#H)Tfsl$}jlkcFqbbDfe5QS=On7i)-DsUe1KwF^lc+~cfBuulJa5_N z&lUAGZ#Nap5gC!lx-%-Vw9k{svPlp>c!V z6ED`EH3t7^ril_7jfTN$C3rN7NfxU@u7h_N0F3EbWo4zkf%Z&Jnx(=Lte0i>f;Q&i zqDK4=`iA!jR;0NJ7&TYpD%|a=u`BD&2Y!W;0TmBnK9$Y>~O+X^ymN(rHQ`($^~;pIKb$1l2Ks7nnqOJb2=j*cz> z!rJSmWaLs+h2CrmKGsa>CDYBt5kcxg)<+8Wj*e_!&eproy9CjOWu!@#PF5=HeRW$s z#o!cb3aVHiZP04*mxoA&i-py_4$ep?8!D0R|A07@RL=Yi^gm*{IRWtQ^CNeuh!%;^}2n0o50W=Q%$d;NQpY>5%gPE>+E zbUj}WHfHF9r8|E|7qiN)7S&DnweFPlCnbss3Q|z>S=JW(srvKvczYJsqki-{bF9V3 zEnC7TKEyYMN{I2jN`<5tR(|41SMSHSfv?`Ve|lIAO6S(Yb6aeK`JNhKHh+AwA5vef zcv)VpI=e1)?$cRg@NCJT=R-7`5<(wy0(_3V2ORY$yiVEM3<#IM ze}DQd#P-SkMcuQ5pTo5!%{LatgrVp8>zVoatVNn}8_LEdmuS#~k0hBA`#XnwpYp$z z+xWDXk5%V$ucf~pshq6Xd*bvXx}3v;tdus&DQ2r6JKY`_glIJnM{}{+l zPWKMh>$SN3dNBrBOaIiSC~=$^aK6l3{VW$_Rd5^erz=1I74GGlXqeIp@D|alr~5t3 z1|qXlD)0);FKN8RK07U*L!^?yN`vn0U@3k$dE7`>I#_kp~pCr^Ox zOjU*b7DBJ6_yz=&Ry0+GA_0Ul3N0>M^yVl`{jL4_oeYN6@DLK^o$Mw%l6M7aG~taW zp-fweLm&hZ^+<)8Gs?J~@|A$yuUTk|@UHxVx!=-nJrUxAW1A3IgsSY5N*@^+nQvn$ zMb4>@4RpuFQYLWki6Du<6~1hG5!XH9{)q{y8iy}d_XGo>6FUz&uJ@R_P(*t|Tr=Su zyN810qm?hrX1i4~pKyH^-df7pd~SERH_jEE~OWL5;a}@D$uA52CozM`y;xw%^*8`d>3&-wY@am z@PL$A23Oydgog)XqN{%;kF2)}8`|MaBYVr#xS&?HL*>ohQa_(&_+ zK2Sd`)L6DG^QG@g$O6n9Y=uWE2Jt*F&u(!~3zJl3g_OTipP^h*xheCT$x;8}=JM1l z&C&Vgn`if`5}fK(%cA1c`!h$Ks{7t#xi|BE7+Zr6OOeIl?z6C|x$6r@U%j~pX|tWU zpbhl=H4?RXx_{@)!Vu5ns~c**$%o#H&R|tyX|QgyWSuNqS(ZV9-i*G!qVuEm<-ozJ zrjP+aRc)0b*_zzN)|s9;0D>IX8(16l+$K#ewT68|m~2wvRR#Ba`-5FSYO-?wz9-$5 z{*4?feF7}tK;rY|7%rX!6&$<|Ql?BzQ-q zzS|wsHjK2FhgVosTzkjKYi>=o#SO8YxJblqx5y1UyrB;KL;T!l=v{-rnCz2SsWFl? zG=Pcu#x2{bt{>-^y-^2^v5XXJxz&C*9cP^8R}d+AW=s?CZZx$jj)yCgsLQfHi7Te> z^V&)zMgi5wUIXSReX~r-`uyGSKn4fzz<2FwJk?|+Uw^h1^FP0f1#CMD@F72`?bA2J zBSv-v8SgjOdsr8Tn;E+G6c4?w3fR>g^F)HWC==%B$7P><#CBsi7~Dh7hQ)6tDHkqK?wP9+!(=KqPrcKpAWZ4 zq+S4+<Sf!{nJm_$m?!G`XtWg`K zn7b}DC&6v_unHy_aurBS!w>Fe`02GXj_2pO+NrX5gP0Z)S{bJ16TxmI)jYV5k6ZL! z>C<7dK%MS(TqC-4a$Wg-6<^vyX~HuY@V@bhvv-8D0L%|-5x!F6QNjEg-;$N~P#tL! zyGD06IHlUQuiv2?%>{um(DOR4ugIcrYTdrI0@jMf=>+60bsM6Z$_jscRAl)i`sa{K zy>UFCn-iCGg4ZNcX4%TLB9s`PV!okhzFaJiY~bMG54tE*xaWZqPne`^1|kr3GpMKN z&^P+y{%M~L0O%s5uudmiXE6>JH-UnXPinhB$F#j^<=wOD%SN#YW?g5sR!zgJ}a|h4Hf><*H8DVHQ5IWXVx&8 zC<*l$Dtl^4Ojy*LBH5mmqs-O`>XMgaOfD^8tK7iET?VOP<{FM6`Q(U-D5B*?s+p8* zk(U9~PH}+ewtFd~+hsSuPZ7DO-PW<&Qs4W%5Nx>-7Mpd`?EW%qixy(=vq@{!cyWu! zVss+dlDV3IR=^flBJfCzh*|8IPTbu#?|H@x`pM_9^pCTP++c@Ac@4!4sMvL!V2-d) zN{nl;SdBWgw`5&=b}dfEbCK2s+efnT)o*kJG#Es-#xc# zGhP!Zn8R_{22{%)^L?8O0DQF(hr5aJ9H+Yk;_ccO{vtK30oG}!-SyD4HTW``Rq9)6 zy`k%U&ABvT7lGZJg4+@cJyg}33E5GGHm!4;W4 zDXr0nfz!n^wVzM1LssHnSbqcP3jxm5(L#|nO%`5Z1-;04=9ZuMz^|E%*CjAYsEX*V zS%xCvveX^#!xB=0C$)mF1Y6kbT58L_{0bCon+XFI+_*uy`d153R3gw+!6@C!9R6k1 z4!g>m8Lsp@Ywr>Pz_9OcALiEJFHhuT=Icl4Zc(?oj?Qd(I&U
SFvs{H=&r_NH~ zlE{3P-=Pa1OO_VC!=j@eeppj#7#pymST3CmQB+T`cXwN$x^55ZY#DVi^A3$Pl88NE z98j9Yqzr$mAu-D_Z^STpA;t6DKjkq5E%39Cc_OyQ%G=r0!8=p)jDCkJHa|$$d&F!% zA;;cRc^p?}{R6#?(%>kz-#TSQsN{%~Oz_O&y5=q~N$=Ru2=`hI&Guxd=6@@c7m;R6 zQdeRr!v1XX{P}6bk9(?;3I^v$ZP;D7sVE(|w}zKl=Mww4{&=nQPfzDt` z-kH7DG;Tl%5K%w*YM4-0I{D0bR_Tq+e4t-@cnA9s697|Wk?r>hJQ7Wk2LL~a?@Fi& zwq&GKq-;g~m+f{9)R&~kt{IRgWHz|LS%<{tiXG)lHv!s5RiU-qCH~eU|<&OHV^6#5bnXoc5y@Nb}dASpR zA-n$dY;?}r-+Q}oRWb2%<4cKI{!AbVp^zDyDdOuA8kx1%a&Pg zdB_wq;A|JA-F-u*`$aIWFX%|?zBMP)S7bN8_047hPgIc6WNALqNTR64tTxAi8w|BES|}j z9A1?oA~{IKNn9#qpCPv*Q7?o`)!K1g)7R*h!+r!GK>HVS0}PaFaJnU<$8x$2Luc&q z@kfSH0rLZ?eL#**7KV zUKPjY{8e`h!e3dm1KN(wNQ~EyUtVRF)vy38mpNBBkHG`v^RIt6lV;bfZGeP$1EdSc zKq`&tZ;FQ$knpMjB!YcsR#KPx+rVj}Ehra2wV@hocE2-y1!_nDK4NiPW@r>f18N9W z2cRsn?s?Acd1lJDUIOp{DNiivmaIC(YPtM+`f_=cZR2hy96ht>dWZn}jYo;OgSTipd5T6SbTS!mS#l!ZPh@sJm3UjEaG9jZ<83Zb2^v0V##9I(*|XaMCE#zjgD~ zL?`cWde(pmw~o^hPlS)D>hp~bqvmMNRE6R@6m28=Y6lkN(*cCh_rdk3#TV8#fgamO zJZ%Sk8!N*xus3;9uN@J7Y){Ejy2r?tSQ_EZ1kn@AQzE5ulKCd~Xx7al;v zL6LME;Q08=qt}%Gl64N?o!!|xKzLrSJNe^gq*J6X#uU|;(>7svGfEyKTrzBpiyHz>CXgh466`sg-GkJ+#=L z6F6P@b;lo9%qQBV{){VM6Jv7v^gar8vMC|9nhnw+_P{f__^dOV)s44~Bx zyKX5T*Nl^~i-)k;BrGeceV}Jo!_WrVe%Fq-hEFbXCOS|xGrR>B7AnrJUe<0fhjQ|Y zYb0?&3i4S?z{B0aLYTAe4v1e*X*N`Xp$XUI6TjU z!b1FDYjPTm?ZbW?pAX+I60ke>rHF zh*EgZ*=_>>H`fpa4&aOIW5f^E?zf_(+syQ?T{}&|#|rYN0+c z=yjn=E`#t-?zi~?4wi=_rj0d50WVV-U+uJ=t`f97Ov^&+odHS)j2+SPwBE0p5T54R zgUQYg(JP;3%=*-9&)U)t-|Ody;Nq|@pl+p{s%Juw)T^N{A>Z1p z97u=dI<{;Tl{Nbc*CHc8^reaazWc!!Jz6BT>dMhQ3NSjuvU^L{qeV8!w>7NIN+mg! zL|+M+p|;){c;e$sRdtyV(f+<0UkRm4i#by@>$0bPsLn=`0ODJM@f{C9`%>&?5sS-B z?%<0}y;0~EUFFT~8lHb-k}jy{4@j_Xa>N1*CZ&I1?<0{M4y`BC)%MEa)s(g_lO+~F zIy_my1tj7_Sq*OA>g~>ZqZ7Mw0%DFOW0FQPgKX&F?l{EIjD!(qRUYO zXnqw3sO;qC#ecI7MHEKQ1FEQ%-A1dS*Li^s#(sZSl>juy!N|75ELRLj53CY}!_tVcYUw-`ucf0*}k zC#Xdm;Ib*W&57HkJM%?F{<38M)wvc>A&^Itl~&ff#`otXPFC^@( zNvEp=J!*~t-soQe9}rU|-T^tZeP`%~f%C^E%^;>{T$ku$igRUdqasPp7xxrZ3nVFa zYJqfGhv5p4zismVJ0gJooM>s-owxttm+l`o{Bb&*4;FZanBhTQ1rmHW9yKY8Dj$;v z#8u>|Kt985OeSx7^1$|F3b|GO{&!rxzVbNT9(oINUUCF(!ERTBavV*wGNQ~fvNE4#Ap zc>tn{>Ge$4hayr4w)$W)K=NqSy?;QW2HIVyW_}bR7VjTVqCGxM1sE<*wiWGi7v-{r zQ$hV?0)y+TLw`8$qe0gvS28blreXofK$FP|5TGsCUBfYg&nM4B4Va}m_E1Rv6!+iZL*daxlT^pt)b0AfE^d62y1&S;lv z6%$n1&tw0-Dv7Ljh0 + + 如上图所示,与单机单卡的普通模型训练相比,使用飞桨分布式训练的代码都只需要补充三个部分代码: + +1. 导入分布式训练需要的依赖包 + +2. 初始化分布式环境 + +3. 使用DataParallel封装用户组网 + 下面将逐一进行讲解。 - - - 2.1 导入依赖 - - 导入必要的依赖,例如分布式训练专用的Fleet API(paddle.distributed.fleet)。 - - ```python - from paddle.distributed import fleet - ``` - - - 2.2 初始化分布式环境 - - 包括定义缺省的分布式策略,然后通过将参数is_collective设置为True。 - - ```python - strategy = fleet.DistributedStrategy() - fleet.init(is_collective=True, strategy=strategy) - ``` - - - 2.3 使用DataParallel封装用户组网 - - - - ```python - model = paddle.DataParallel(model) - ``` - -- [3. 多机多卡推理功能开发](#3---) - - 由于数据并行训练各个卡上包含完整的模型副本,因此只需要保存某张卡上的模型用于推理即可。通常,可以选择保存第一张卡上的模型用于推理。 + +- 2.1 导入依赖 ```python - if fleet.worker_index() == 0: - # save inference model + from paddle.distributed as dist ``` + +- 2.2 初始化分布式环境 + ```python + dist.init_parallel_env() + ``` + +- 2.3 使用DataParallel封装用户组网 + ```python + model = paddle.DataParallel(model) + ``` -- [4. FAQ](#4) +假设用户训练脚本文件名为train.py,下面我们说明如何启动分布式训练任务。 + +1. 启动单机多卡任务 + + 当使用单机多卡时,可以通过如下的命令启动分布式训练任务: + + ```shell + python -m paddle.distributed.launch --gpus="0,1" train.py + ``` + + 其中,``--gpus``选项指定用户分布式训练使用的GPU卡。 + +2. 启动多机多卡任务 + + 我们以2台机器为例,说明如何启动多机多卡分布式训练任务。假设两台机器的ip地址分别为192.168.0.1和192.168.0.2。 + + 首先,我们需要确保两台机器间的网络是互通的,可以通过``ping``命令验证机器间网络的互通性,如下所示: + + ```shell + # 在ip地址为192.168.0.1的机器上 + ping 192.168.0.2 + ``` + + 接着,我们分别在两台机器上启动分布式任务: + + ```shell + # 在ip地址为192.168.0.1的机器上 + python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" train.py + ``` + + ```shell + # 在ip地址为192.168.0.2的机器上 + python -m paddle.distributed.launch --ips="192.168.0.1,192.168.0.2" --gpus="0,1" train.py + ``` + + 启动上述命令后,将在控制台上输出类似如下所示的信息: + + ```shell + WARNING 2021-01-04 17:59:08,725 launch.py:314] Not found distinct arguments and compiled with cuda. Default use collective mode + launch train in GPU mode + INFO 2021-01-04 17:59:08,727 launch_utils.py:472] Local start 2 processes. First process distributed environment info (Only For Debug): + +=======================================================================================+ + | Distributed Envs Value | + +---------------------------------------------------------------------------------------+ + | PADDLE_CURRENT_ENDPOINT 127.0.0.1:17901 | + | PADDLE_TRAINERS_NUM 2 | + | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:17901,127.0.0.1:18846 | + | FLAGS_selected_gpus 0 | + | PADDLE_TRAINER_ID 0 | + +=======================================================================================+ + + ... + W0104 17:59:19.018365 43338 device_context.cc:342] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 9.2 + W0104 17:59:19.022523 43338 device_context.cc:352] device: 0, cuDNN Version: 7.4. + W0104 17:59:23.193490 43338 fuse_all_reduce_op_pass.cc:78] Find all_reduce operators: 161. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 5. + + + ``` + +当使用paddle.distributed.launch模块启动分布式任务时,所有日志将保存在./log目录下,日志文件名为workerlog.xx,其中xx为整数;每个卡训练进程对应一个日志文件。 + +## 3. [多机多卡推理功能开发](#3) + +由于数据并行训练各个卡上包含完整的模型副本,因此只需要保存某张卡上的模型用于推理即可。通常,可以选择保存第一张卡上的模型用于推理。 + +```python +if fleet.worker_index() == 0: + # save inference model +``` + +更多关于推理的信息,请参考[Linux GPU/CPU 模型推理开发文档](../train_infer_python/infer_python.md) + +## 4. [FAQ](#4) + +- 问:当程序报错时,如何排查错误? + +- 答:首先查看日志,是否可以可以定位错误的信息,如显存不够OOM等。 + +- 问:如果程序hang,如何排查出错原因? + +- 答:一般引起程序hang的问题,都是通信问题。比如,两个进程同步不一致:一个进程等待同步A数据,而另一个进程却在等待同步B数据,从而导致程序hang。一般排查步骤是定位进程hang的位置,然后具体分析导致hang的原因。可以通过设置如下环境变量查看程序hang时执行的算子:`export GLOG_v=3; export FLAGS_benchmark=1`。 + +- 问:程序中报错,显示NCCL相关错误,怎么排查原因? + +- 答:可以通过设置如下环境变量查看程序错误信息:`export NCCL_DEBUG=INFO`。并重点关注NCCL WARN相关信息。 -- GitLab