reader support reading data from native HDFS without downloading the data to local disk
Created by: typhoonzero
Distributed training jobs are likely to use training data from some distributed storage services like AWS S3 or Hadoop HDFS. And, HDFS is widely used for companies. A simple way is to download a sub set of the files to local disk, and then write a reader
to read from local file list, which is not very efficient.
So we need to read data directly from HDFS instead. To do this, we need to implement a base reader or decorator that can use a Hadoop client to pipe data:
ps = subprocess.Popen(('hadoop', 'fs', '-cat', your_file_path_or_glob), stdout=subprocess.PIPE)
# read data from ps and do user parse functions
ps.wait()