Created by: wangkuiyi
Fixes https://github.com/baidu/Paddle/issues/111
Motivation
The initial purpose of this PR is that it took me >12 hours to run preprocess.sh
on a VM with my MacBook Pro. I checked with Yi Yang that he can run this in a few minutes on his powerful CPU&GPU desktop. But I am afraid that the purpose of a QuickStart should be quick and start enough, so that potential clients can realistically feel the convenience brought by Paddle. Hence this PR.
Comparison
The time cost are primarily due to that the current approach downloads the full Amazon Reviews dataset, which is ~500MB gzipped and ~1.5GB unzipped. The process of the whole dataset also costs much time. So this PR's primary target is to download only part of the dataset. Compare with the existing approach,
- this PR uses a ~100-line Python script
preprocess_data.py
to replacedata/get_data.sh
,preprocess.py
andpreprocess.sh
, which add up to ~300 lines code, - after a short discussion with @emailweixu , we decided to use space-delimited word segmentation to replace the Moses word segmenter, so no need to download the Mesos segmenter.
-
preprocess_data.py
can read directly from the HTTP server that hosts the data, or from a local copy of the data. In either case, it reads until required number of instances are scanned. This releases it from reading the whole dataset. - The new script doesn't use
shuf
, which exists in Linux but not in Mac OS X. So the new script works with both Linux and Mac OS X.
Usage
If we get this PR merged, the initialization steps described in the Quick Start guide would change from
cd demo/quick_start
./data/get_data.sh
./preprocess.sh
into
cd demo/quick_start
python ./process_data.py
Details
Above ./process_data.py
commands read directly from the default URL http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
for a number of JSON objects until it can generate {train,test,pred}.txt
which add up to 100 instances, the default number of dataset size.
If we are going to generate a bigger dataset, say 1000 instances in total, we can run
python ./process_data.py -n 1000
Or, if we already downloaded the reviews_Electronics_5.json.gz
file, we can run
python ./process_data.py ~/Download/reviews_Electronics_5.json.gz
An additional command line parameter -t
controls the cap size of the dictionary. If we want to generate an 1000-instance dataset while limitinug the dictionary size to 1999, we can do
python ./process_data.py -n 1000 -t 1999 ~/Download/reviews_Electronics_5.json.gz