Simplify Quick Start (!112) · 合并请求 · PaddlePaddle / Paddle

Simplify Quick Start !112

Created by: wangkuiyi

Fixes https://github.com/baidu/Paddle/issues/111

Motivation

The initial purpose of this PR is that it took me >12 hours to run preprocess.sh on a VM with my MacBook Pro. I checked with Yi Yang that he can run this in a few minutes on his powerful CPU&GPU desktop. But I am afraid that the purpose of a QuickStart should be quick and start enough, so that potential clients can realistically feel the convenience brought by Paddle. Hence this PR.

Comparison

The time cost are primarily due to that the current approach downloads the full Amazon Reviews dataset, which is ~500MB gzipped and ~1.5GB unzipped. The process of the whole dataset also costs much time. So this PR's primary target is to download only part of the dataset. Compare with the existing approach,

this PR uses a ~100-line Python script preprocess_data.py to replace data/get_data.sh, preprocess.py and preprocess.sh, which add up to ~300 lines code,
after a short discussion with @emailweixu , we decided to use space-delimited word segmentation to replace the Moses word segmenter, so no need to download the Mesos segmenter.
preprocess_data.py can read directly from the HTTP server that hosts the data, or from a local copy of the data. In either case, it reads until required number of instances are scanned. This releases it from reading the whole dataset.
The new script doesn't use shuf, which exists in Linux but not in Mac OS X. So the new script works with both Linux and Mac OS X.

Usage

If we get this PR merged, the initialization steps described in the Quick Start guide would change from

cd demo/quick_start
./data/get_data.sh
./preprocess.sh

into

cd demo/quick_start
python ./process_data.py

Details

Above ./process_data.py commands read directly from the default URL http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz for a number of JSON objects until it can generate {train,test,pred}.txt which add up to 100 instances, the default number of dataset size.

If we are going to generate a bigger dataset, say 1000 instances in total, we can run

python ./process_data.py -n 1000

Or, if we already downloaded the reviews_Electronics_5.json.gz file, we can run

python ./process_data.py ~/Download/reviews_Electronics_5.json.gz

An additional command line parameter -t controls the cap size of the dictionary. If we want to generate an 1000-instance dataset while limitinug the dictionary size to 1999, we can do

python ./process_data.py -n 1000 -t 1999 ~/Download/reviews_Electronics_5.json.gz

PaddlePaddle / Paddle 大约 2 年 前同步成功