reader support reading data from native HDFS without downloading the data to local disk (#5011) · Issue · PaddlePaddle / Paddle

reader support reading data from native HDFS without downloading the data to local disk

Created by: typhoonzero

Distributed training jobs are likely to use training data from some distributed storage services like AWS S3 or Hadoop HDFS. And, HDFS is widely used for companies. A simple way is to download a sub set of the files to local disk, and then write a reader to read from local file list, which is not very efficient.

So we need to read data directly from HDFS instead. To do this, we need to implement a base reader or decorator that can use a Hadoop client to pipe data:

ps = subprocess.Popen(('hadoop', 'fs', '-cat', your_file_path_or_glob), stdout=subprocess.PIPE)
# read data from ps and do user parse functions
ps.wait()

PaddlePaddle / Paddle 大约 1 年 前同步成功

reader support reading data from native HDFS without downloading the data to local disk

PaddlePaddle / Paddle
大约 1 年前同步成功