diff --git a/deep_speech_2/README.md b/deep_speech_2/README.md old mode 100644 new mode 100755 index 9c2a0872bffd70006885c79376963146fda16b27..263339415c8a119db5e61dec5aaba23c5c8ab57a --- a/deep_speech_2/README.md +++ b/deep_speech_2/README.md @@ -1,4 +1,4 @@ -# Deep Speech 2 on PaddlePaddle +# DeepSpeech2 on PaddlePaddle ## Installation @@ -161,3 +161,9 @@ python demo_client.py On the client console, press and hold the "white-space" key on the keyboard to start talking, until you finish your speech and then release the "white-space" key. The decoding results (infered transcription) will be displayed. It could be possible to start the server and the client in two seperate machines, e.g. `demo_client.py` is usually started in a machine with a microphone hardware, while `demo_server.py` is usually started in a remote server with powerful GPUs. Please first make sure that these two machines have network access to each other, and then use `--host_ip` and `--host_port` to indicate the server machine's actual IP address (instead of the `localhost` as default) and TCP port, in both `demo_server.py` and `demo_client.py`. + + +## PaddleCloud Training + +If you wish to train DeepSpeech2 on PaddleCloud, please refer to +[Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/models/tree/develop/deep_speech_2/cloud). diff --git a/deep_speech_2/cloud/README.md b/deep_speech_2/cloud/README.md old mode 100644 new mode 100755 index 8e7e49f9ea75a56d84431f80a87565fbec62bbb2..a5be1c420880d4f32d472cdd23124cbf35033094 --- a/deep_speech_2/cloud/README.md +++ b/deep_speech_2/cloud/README.md @@ -1,81 +1,63 @@ -# Run DS2 on PaddleCloud +# Train DeepSpeech2 on PaddleCloud >Note: ->Make sure [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `models/deep_speech_2/cloud/` +>Please make sure [PaddleCloud Client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `deep_speech_2/cloud/` -## Step-1 Configure data set +## Step 1: Upload Data -Configure your input data and output path in pcloud_submit.sh: +Provided with several input manifests, `pcloud_upload_data.sh` will pack and upload all the containing audio files to PaddleCloud filesystem, and also generate some corresponding manifest files with updated cloud paths. -- `TRAIN_MANIFEST`: Absolute path of train data manifest file in local file system.This file has format as bellow: +Please modify the following arguments in `pcloud_upload_data.sh`: + +- `IN_MANIFESTS`: Paths (in local filesystem) of manifest files containing the audio files to be uploaded. Multiple paths can be concatenated with a whitespace delimeter. +- `OUT_MANIFESTS`: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of `audio_filepath` in the output manifests are updated with cloud filesystem paths. +- `CLOUD_DATA_DIR`: Directory (in PaddleCloud filesystem) to upload the data to. Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it. +- `NUM_SHARDS`: Number of data shards / parts (in tar files) to be generated when packing and uploading data. Smaller `num_shards` requires larger temoporal local disk space for packing data. + +By running: ``` -{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac", "duration": 5.855, "text -": "mister quilter is the ..."} -{"audio_filepath": "/home/disk1/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac", "duration": 4.815, "text -": "nor is mister ..."} +sh pcloud_upload_data.sh ``` +all the audio files will be uploaded to PaddleCloud filesystem, and you will get modified manifests files in `OUT_MANIFESTS`. -- `TEST_MANIFEST`: Absolute path of train data manifest file in local filesystem. This file has format like `TRAIN_MANIFEST`. -- `VOCAB_FILE`: Absolute path of vocabulary file in local filesytem. -- `MEAN_STD_FILE`: Absolute path of normalizer's statistic file in local filesytem. -- `CLOUD_DATA_DIR:` Absolute path in PaddleCloud filesystem. We will upload local train data to this directory. -- `CLOUD_MODEL_DIR`: Absolute path in PaddleCloud filesystem. PaddleCloud trainer will save model to this directory. +You have to take this step only once, in the very first time you do the cloud training. Later on, the data is persisitent on the cloud filesystem and reusable for further job submissions. ->Note: Upload will be skipped if target file has existed in `CLOUD_DATA_DIR`. +## Step 2: Configure Training -## Step-2 Configure computation resource +Configure cloud training arguments in `pcloud_submit.sh`, with the following arguments: -Configure computation resource in pcloud_submit.sh: +- `TRAIN_MANIFEST`: Manifest filepath (in local filesystem) for training. Notice that the`audio_filepath` should be in cloud filesystem, like those generated by `pcloud_upload_data.sh`. +- `DEV_MANIFEST`: Manifest filepath (in local filesystem) for validation. +- `CLOUD_MODEL_DIR`: Directory (in PaddleCloud filesystem) to save the model parameters (checkpoints). Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it. +- `BATCH_SIZE`: Training batch size for a single node. +- `NUM_GPU`: Number of GPUs allocated for a single node. +- `NUM_NODE`: Number of nodes (machines) allocated for this job. +- `IS_LOCAL`: Set to False to enable parameter server, if using multiple nodes. -``` -# Configure computation resource and submit job to PaddleCloud - paddlecloud submit \ - -image wanghaoshuang/pcloud_ds2:latest \ - -jobname ${JOB_NAME} \ - -cpu 4 \ - -gpu 4 \ - -memory 10Gi \ - -parallelism 1 \ - -pscpu 1 \ - -pservers 1 \ - -psmemory 10Gi \ - -passes 1 \ - -entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR}" \ - ${DS2_PATH} -``` -For more information, please refer to [PaddleCloud](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务) +Configure other training hyper-parameters in `pcloud_train.sh` as you wish, just as what you can do in local training. -## Step-3 Configure algorithm options -Configure algorithm options in pcloud_train.sh: -``` -python train.py \ ---use_gpu=1 \ ---trainer_count=4 \ ---batch_size=256 \ ---mean_std_filepath=$MEAN_STD_FILE \ ---train_manifest_path='./local.train.manifest' \ ---dev_manifest_path='./local.test.manifest' \ ---vocab_filepath=$VOCAB_PATH \ ---output_model_dir=${MODEL_PATH} -``` -You can get more information about algorithm options by follow command: -``` -cd .. -python train.py --help -``` +By running: -## Step-4 Submit job ``` -$ sh pcloud_submit.sh +sh pcloud_submit.sh ``` +you submit a training job to PaddleCloud. And you will see the job name when the submission is done. + + +## Step 3 Get Job Logs +Run this to list all the jobs you have submitted, as well as their running status: -## Step-5 Get logs ``` -$ paddlecloud logs -n 10000 deepspeech20170727130129 +paddlecloud get jobs ``` -For more information, please refer to [PaddleCloud client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#下载并配置paddlecloud) or get help by follow command: + +Run this, the corresponding job's logs will be printed. ``` -paddlecloud --help +paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME ``` + +## More Help + +For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务). diff --git a/deep_speech_2/cloud/pcloud_submit.sh b/deep_speech_2/cloud/pcloud_submit.sh index 3a64f32e245d90eb88af1c36cc2726551e03a9f7..a7fb42cbc4fd69d2861bff4be5a6351ff0e69b58 100644 --- a/deep_speech_2/cloud/pcloud_submit.sh +++ b/deep_speech_2/cloud/pcloud_submit.sh @@ -1,32 +1,11 @@ -# Configure input data set in local filesystem -TRAIN_MANIFEST="../datasets/manifest.train" -DEV_MANIFEST="../datasets/manifest.dev" -VOCAB_FILE="../datasets/vocab/eng_vocab.txt" -MEAN_STD_FILE="../mean_std.npz" -# Configure output path in PaddleCloud filesystem -CLOUD_DATA_DIR="/pfs/dlnel/home/sunxinghai@baidu.com/deepspeech2/data" -CLOUD_MODEL_DIR="/pfs/dlnel/home/sunxinghai@baidu.com/deepspeech2/model" -# Configure cloud resources -NUM_CPU=8 +TRAIN_MANIFEST="cloud/cloud.manifest.train" +DEV_MANIFEST="cloud/cloud.manifest.dev" +CLOUD_MODEL_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/model" +BATCH_SIZE=256 NUM_GPU=8 NUM_NODE=1 -MEMORY="10Gi" IS_LOCAL="True" -# Pack and upload local data to PaddleCloud filesystem -python upload_data.py \ ---train_manifest_path=${TRAIN_MANIFEST} \ ---dev_manifest_path=${DEV_MANIFEST} \ ---vocab_file=${VOCAB_FILE} \ ---mean_std_file=${MEAN_STD_FILE} \ ---cloud_data_path=${CLOUD_DATA_DIR} -if [ $? -ne 0 ] -then - echo "upload data failed!" - exit 1 -fi - -# Submit job to PaddleCloud JOB_NAME=deepspeech-`date +%Y%m%d%H%M%S` DS2_PATH=${PWD%/*} cp -f pcloud_train.sh ${DS2_PATH} @@ -34,15 +13,15 @@ cp -f pcloud_train.sh ${DS2_PATH} paddlecloud submit \ -image bootstrapper:5000/wanghaoshuang/pcloud_ds2:latest \ -jobname ${JOB_NAME} \ --cpu ${NUM_CPU} \ +-cpu ${NUM_GPU} \ -gpu ${NUM_GPU} \ --memory ${MEMORY} \ +-memory 64Gi \ -parallelism ${NUM_NODE} \ -pscpu 1 \ -pservers 1 \ --psmemory ${MEMORY} \ +-psmemory 64Gi \ -passes 1 \ --entry "sh pcloud_train.sh ${CLOUD_DATA_DIR} ${CLOUD_MODEL_DIR} ${NUM_CPU} ${NUM_GPU} ${IS_LOCAL}" \ +-entry "sh pcloud_train.sh ${TRAIN_MANIFEST} ${DEV_MANIFEST} ${CLOUD_MODEL_DIR} ${NUM_GPU} ${BATCH_SIZE} ${IS_LOCAL}" \ ${DS2_PATH} rm ${DS2_PATH}/pcloud_train.sh diff --git a/deep_speech_2/cloud/pcloud_train.sh b/deep_speech_2/cloud/pcloud_train.sh index 21bd43f92e3e5f3b59bd6f28ab133546349d7b7b..e42da1d629ee5876dbcfe81a1359f30e2c983d1c 100644 --- a/deep_speech_2/cloud/pcloud_train.sh +++ b/deep_speech_2/cloud/pcloud_train.sh @@ -1,36 +1,24 @@ -DATA_PATH=$1 -MODEL_PATH=$2 -NUM_CPU=$3 +TRAIN_MANIFEST=$1 +DEV_MANIFEST=$2 +MODEL_PATH=$3 NUM_GPU=$4 -IS_LOCAL=$5 +BATCH_SIZE=$5 +IS_LOCAL=$6 -TRAIN_MANI=${DATA_PATH}/cloud.train.manifest -DEV_MANI=${DATA_PATH}/cloud.dev.manifest -TRAIN_TAR=${DATA_PATH}/cloud.train.tar -DEV_TAR=${DATA_PATH}/cloud.dev.tar -VOCAB_PATH=${DATA_PATH}/vocab.txt -MEAN_STD_FILE=${DATA_PATH}/mean_std.npz - -# split train data for each pcloud node python ./cloud/split_data.py \ ---in_manifest_path=${TRAIN_MANI} \ ---data_tar_path=${TRAIN_TAR} \ ---out_manifest_path='/local.train.manifest' +--in_manifest_path=${TRAIN_MANIFEST} \ +--out_manifest_path='/local.manifest.train' -# split dev data for each pcloud node python ./cloud/split_data.py \ ---in_manifest_path=${DEV_MANI} \ ---data_tar_path=${DEV_TAR} \ ---out_manifest_path='/local.dev.manifest' +--in_manifest_path=${DEV_MANIFEST} \ +--out_manifest_path='/local.manifest.dev' -# run train python train.py \ +--batch_size=$BATCH_SIZE \ --use_gpu=1 \ --trainer_count=${NUM_GPU} \ ---num_threads_data=${NUM_CPU} \ +--num_threads_data=${NUM_GPU} \ --is_local=${IS_LOCAL} \ ---mean_std_filepath=${MEAN_STD_FILE} \ ---train_manifest_path='/local.train.manifest' \ ---dev_manifest_path='/local.dev.manifest' \ ---vocab_filepath=${VOCAB_PATH} \ ---output_model_dir=${MODEL_PATH} +--train_manifest_path='/local.manifest.train' \ +--dev_manifest_path='/local.manifest.dev' \ +--output_model_dir=${MODEL_PATH} \ diff --git a/deep_speech_2/cloud/pcloud_upload_data.sh b/deep_speech_2/cloud/pcloud_upload_data.sh new file mode 100644 index 0000000000000000000000000000000000000000..97a0ab1818607745ba0e2e1192abb67060bf13c3 --- /dev/null +++ b/deep_speech_2/cloud/pcloud_upload_data.sh @@ -0,0 +1,17 @@ +IN_MANIFESTS="../datasets/manifest.train ../datasets/manifest.dev ../datasets/manifest.test" +OUT_MANIFESTS="./cloud.manifest.train ./cloud.manifest.dev ./cloud.manifest.test" +CLOUD_DATA_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/data/librispeech" +NUM_SHARDS=50 + +python upload_data.py \ +--in_manifest_paths ${IN_MANIFESTS} \ +--out_manifest_paths ${OUT_MANIFESTS} \ +--cloud_data_dir ${CLOUD_DATA_DIR} \ +--num_shards ${NUM_SHARDS} + +if [ $? -ne 0 ] +then + echo "Upload Data Failed!" + exit 1 +fi +echo "All Done." diff --git a/deep_speech_2/cloud/split_data.py b/deep_speech_2/cloud/split_data.py index 8df194a62bac052a0b49ce4c8993e640fdc9dc88..3496d52bfb5bf6c249c03dfb4df2937625bd55b5 100644 --- a/deep_speech_2/cloud/split_data.py +++ b/deep_speech_2/cloud/split_data.py @@ -1,7 +1,5 @@ """This tool is used for splitting data into each node of -paddle cloud by total trainer count and current trainer id. -The meaning of trainer is a instance of k8s cluster. -This script should be called in paddle cloud. +paddlecloud. This script should be called in paddlecloud. """ from __future__ import absolute_import from __future__ import division @@ -14,40 +12,30 @@ import argparse parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( "--in_manifest_path", - default='./cloud.train.manifest', type=str, - help="Input manifest path. (default: %(default)s)") -parser.add_argument( - "--data_tar_path", - default='./cloud.train.tar', - type=str, - help="Data tar file path. (default: %(default)s)") + required=True, + help="Input manifest path for all nodes.") parser.add_argument( "--out_manifest_path", - default='./local.train.manifest', type=str, - help="Out manifest file path. (default: %(default)s)") + required=True, + help="Output manifest file path for current node.") args = parser.parse_args() -def split_data(in_manifest, tar_path, out_manifest): +def split_data(in_manifest_path, out_manifest_path): with open("/trainer_id", "r") as f: trainer_id = int(f.readline()[:-1]) with open("/trainer_count", "r") as f: trainer_count = int(f.readline()[:-1]) - tar_path = os.path.abspath(tar_path) - result = [] - for index, json_line in enumerate(open(in_manifest)): + out_manifest = [] + for index, json_line in enumerate(open(in_manifest_path, 'r')): if (index % trainer_count) == trainer_id: - json_data = json.loads(json_line) - json_data['audio_filepath'] = "tar:%s#%s" % ( - tar_path, json_data['audio_filepath']) - result.append("%s\n" % json.dumps(json_data)) - with open(out_manifest, 'w') as manifest: - manifest.writelines(result) + out_manifest.append("%s\n" % json_line.strip()) + with open(out_manifest_path, 'w') as f: + f.writelines(out_manifest) if __name__ == '__main__': - split_data(args.in_manifest_path, args.data_tar_path, - args.out_manifest_path) + split_data(args.in_manifest_path, args.out_manifest_path) diff --git a/deep_speech_2/cloud/upload_data.py b/deep_speech_2/cloud/upload_data.py index efa9e77c0e1e154cc8245a4444db983a76d82510..9973f8c768410fd86a6ded6a74dac24f9f918173 100644 --- a/deep_speech_2/cloud/upload_data.py +++ b/deep_speech_2/cloud/upload_data.py @@ -1,12 +1,9 @@ -"""This script is used for preparing data for DeepSpeech2 trainning on paddle -cloud. +"""This script is for uploading data for DeepSpeech2 training on paddlecloud. Steps: -1. Read original manifest and get the local path of sound files. -2. Tar all local sound files into one tar file. -3. Modify original manifest to remove the local path information. - -Finally, we will get a tar file and a new manifest. +1. Read original manifests and extract local sound files. +2. Tar all local sound files into multiple tar files and upload them. +3. Modify original manifests with updated paths in cloud filesystem. """ from __future__ import absolute_import from __future__ import division @@ -22,66 +19,88 @@ from subprocess import call import _init_paths from data_utils.utils import read_manifest -TRAIN_TAR = "cloud.train.tar" -TRAIN_MANIFEST = "cloud.train.manifest" -DEV_TAR = "cloud.dev.tar" -DEV_MANIFEST = "cloud.dev.manifest" -VOCAB_FILE = "vocab.txt" -MEAN_STD_FILE = "mean_std.npz" - parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( - "--train_manifest_path", - default="../datasets/manifest.train", - type=str, - help="Manifest file path for train data. (default: %(default)s)") -parser.add_argument( - "--dev_manifest_path", - default="../datasets/manifest.dev", + "--in_manifest_paths", + default=[ + "../datasets/manifest.train", "../datasets/manifest.dev", + "../datasets/manifest.test" + ], type=str, - help="Manifest file path for validation data. (default: %(default)s)") -parser.add_argument( - "--vocab_file", - default="../datasets/vocab/eng_vocab.txt", - type=str, - help="Vocabulary file to be uploaded to paddlecloud. " + nargs='+', + help="Local filepaths of input manifests to load, pack and upload." "(default: %(default)s)") parser.add_argument( - "--mean_std_file", - default="../mean_std.npz", + "--out_manifest_paths", + default=[ + "./cloud.manifest.train", "./cloud.manifest.dev", + "./cloud.manifest.test" + ], type=str, - help="Normalizer's statistics (mean and stddev) file to be uploaded to " - "paddlecloud. (default: %(default)s)") + nargs='+', + help="Local filepaths of modified manifests to write to. " + "(default: %(default)s)") parser.add_argument( - "--cloud_data_path", + "--cloud_data_dir", required=True, type=str, - help="Destination path on paddlecloud. (default: %(default)s)") + help="Destination directory on paddlecloud to upload data to.") +parser.add_argument( + "--num_shards", + default=10, + type=int, + help="Number of parts to split data to. (default: %(default)s)") parser.add_argument( - "--local_tmp_path", + "--local_tmp_dir", default="./tmp/", type=str, help="Local directory for storing temporary data. (default: %(default)s)") args = parser.parse_args() -def pack_data(manifest_path, out_tar_path, out_manifest_path): - """1. According to the manifest, tar sound files into out_tar_path. - 2. Generate a new manifest for output tar file. +def upload_data(in_manifest_path_list, out_manifest_path_list, local_tmp_dir, + upload_tar_dir, num_shards): + """Extract and pack sound files listed in the manifest files into multple + tar files and upload them to padldecloud. Besides, generate new manifest + files with updated paths in paddlecloud. """ - out_tar = tarfile.open(out_tar_path, 'w') - manifest = read_manifest(manifest_path) - results = [] - for json_data in manifest: - sound_file = json_data['audio_filepath'] - filename = os.path.basename(sound_file) - out_tar.add(sound_file, arcname=filename) - json_data['audio_filepath'] = filename - results.append("%s\n" % json.dumps(json_data)) - with open(out_manifest_path, 'w') as out_manifest: - out_manifest.writelines(results) - out_manifest.close() - out_tar.close() + # compute total audio number + total_line = 0 + for manifest_path in in_manifest_path_list: + with open(manifest_path, 'r') as f: + total_line += len(f.readlines()) + line_per_tar = (total_line // num_shards) + 1 + + # pack and upload shard by shard + line_count, tar_file = 0, None + for manifest_path, out_manifest_path in zip(in_manifest_path_list, + out_manifest_path_list): + manifest = read_manifest(manifest_path) + out_manifest = [] + for json_data in manifest: + sound_filepath = json_data['audio_filepath'] + sound_filename = os.path.basename(sound_filepath) + if line_count % line_per_tar == 0: + if tar_file != None: + tar_file.close() + pcloud_cp(tar_path, upload_tar_dir) + os.remove(tar_path) + tar_name = 'part-%s-of-%s.tar' % ( + str(line_count // line_per_tar).zfill(5), + str(num_shards).zfill(5)) + tar_path = os.path.join(local_tmp_dir, tar_name) + tar_file = tarfile.open(tar_path, 'w') + tar_file.add(sound_filepath, arcname=sound_filename) + line_count += 1 + json_data['audio_filepath'] = "tar:%s#%s" % ( + os.path.join(upload_tar_dir, tar_name), sound_filename) + out_manifest.append("%s\n" % json.dumps(json_data)) + with open(out_manifest_path, 'w') as f: + f.writelines(out_manifest) + pcloud_cp(out_manifest_path, upload_tar_dir) + tar_file.close() + pcloud_cp(tar_path, upload_tar_dir) + os.remove(tar_path) def pcloud_mkdir(dir): @@ -99,44 +118,12 @@ def pcloud_cp(src, dst): raise IOError("PaddleCloud cp failed: from [%s] to [%s]." % (src, dst)) -def pcloud_exist(path): - """Check if file or directory exists in PaddleCloud filesystem. - """ - ret = call(['paddlecloud', 'ls', path]) - return ret - - if __name__ == '__main__': - cloud_train_manifest = os.path.join(args.cloud_data_path, TRAIN_MANIFEST) - cloud_train_tar = os.path.join(args.cloud_data_path, TRAIN_TAR) - cloud_dev_manifest = os.path.join(args.cloud_data_path, DEV_MANIFEST) - cloud_dev_tar = os.path.join(args.cloud_data_path, DEV_TAR) - cloud_vocab_file = os.path.join(args.cloud_data_path, VOCAB_FILE) - cloud_mean_file = os.path.join(args.cloud_data_path, MEAN_STD_FILE) - - local_train_manifest = os.path.join(args.local_tmp_path, TRAIN_MANIFEST) - local_train_tar = os.path.join(args.local_tmp_path, TRAIN_TAR) - local_dev_manifest = os.path.join(args.local_tmp_path, DEV_MANIFEST) - local_dev_tar = os.path.join(args.local_tmp_path, DEV_TAR) - - # prepare local and cloud dir - if os.path.exists(args.local_tmp_path): - shutil.rmtree(args.local_tmp_path) - os.makedirs(args.local_tmp_path) - pcloud_mkdir(args.cloud_data_path) - - # pack and upload train data - pack_data(args.train_manifest_path, local_train_tar, local_train_manifest) - pcloud_cp(local_train_manifest, cloud_train_manifest) - pcloud_cp(local_train_tar, cloud_train_tar) - - # pack and upload validation data - pack_data(args.dev_manifest_path, local_dev_tar, local_dev_manifest) - pcloud_cp(local_dev_manifest, cloud_dev_manifest) - pcloud_cp(local_dev_tar, cloud_dev_tar) + if not os.path.exists(args.local_tmp_dir): + os.makedirs(args.local_tmp_dir) + pcloud_mkdir(args.cloud_data_dir) - # upload vocab file and mean_std file - pcloud_cp(args.vocab_file, cloud_vocab_file) - pcloud_cp(args.mean_std_file, cloud_mean_file) + upload_data(args.in_manifest_paths, args.out_manifest_paths, + args.local_tmp_dir, args.cloud_data_dir, args.num_shards) - shutil.rmtree(args.local_tmp_path) + shutil.rmtree(args.local_tmp_dir)