diff --git a/demo/quick_start/data/README.md b/demo/quick_start/data/README.md new file mode 100644 index 0000000000000000000000000000000000000000..63abcf7ebf31903213e44cf492b93e09f61db14e --- /dev/null +++ b/demo/quick_start/data/README.md @@ -0,0 +1,9 @@ +This dataset consists of electronics product reviews associated with +binary labels (positive/negative) for sentiment classification. + +The preprocessed data can be downloaded by script `get_data.sh`. +The data was derived from reviews_Electronics_5.json.gz at + +http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz + +If you want to process the raw data, you can use the script `proc_from_raw_data/get_data.sh`. diff --git a/demo/quick_start/data/get_data.sh b/demo/quick_start/data/get_data.sh index f355d63225b28ab495b34e72dd3be8d237ae08f4..952de3f3c8f52a7a6f84412f9b38f16ac2503ac2 100755 --- a/demo/quick_start/data/get_data.sh +++ b/demo/quick_start/data/get_data.sh @@ -17,14 +17,11 @@ set -e DIR="$( cd "$(dirname "$0")" ; pwd -P )" cd $DIR -echo "Downloading Amazon Electronics reviews data..." -# http://jmcauley.ucsd.edu/data/amazon/ -wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz +# Download the preprocessed data +wget http://paddlepaddle.bj.bcebos.com/demo/quick_start_preprocessed_data/preprocessed_data.tar.gz -echo "Downloading mosesdecoder..." -#https://github.com/moses-smt/mosesdecoder -wget https://github.com/moses-smt/mosesdecoder/archive/master.zip +# Extract package +tar zxvf preprocessed_data.tar.gz -unzip master.zip -rm master.zip -echo "Done." +# Remove compressed package +rm preprocessed_data.tar.gz diff --git a/demo/quick_start/data/pred.list b/demo/quick_start/data/pred.list deleted file mode 100644 index d88b2b63851101a8b40e706b32d8c17b5fabb201..0000000000000000000000000000000000000000 --- a/demo/quick_start/data/pred.list +++ /dev/null @@ -1 +0,0 @@ -./data/pred.txt diff --git a/demo/quick_start/data/pred.txt b/demo/quick_start/data/pred.txt deleted file mode 100644 index 6ed5f738ddaff6645448d5e606dcef1baf01b282..0000000000000000000000000000000000000000 --- a/demo/quick_start/data/pred.txt +++ /dev/null @@ -1,2 +0,0 @@ -the device is cute , but that 's just about all that 's good. the specs are what you 'd expect : it 's a wifi mic , with some noise filter options. the app has the option to upload your baby 's name and photo , which is a cutesy touch. but the app is otherwise unstable and useless unless you upgrade for $ 60 / year.set up involves downloading the app , turning on the mic , switching your phone to the wifi network of the mic , telling the app your wifi settings , switching your wifi back to your home router. the app is then directly connected to your mic.the app is adware ! the main screen says " cry notifications on / off : upgrade to evoz premium and receive a text message of email when your baby is crying " .but the adware points out an important limitation , this monitor is only intended to be used from your home network. if you want to access it remotely , get a webcam. this app would make a lot more sense of the premium features were included with the hardware . -don 't be fooled by my one star rating. if there was a zero , i would have selected it. this product was a waste of my money.it has never worked like the company said it supposed to. i only have one device , an iphone 4gs. after charging the the iphone mid way , the i.sound portable power max 16,000 mah is completely drained. the led light no longer lit up. when plugging the isound portable power max into a wall outlet to charge , it would charge for about 20-30 minutes and then all four battery led indicator lit up showing a full charge. i would leave it on to charge for the full 8 hours or more but each time with the same result upon using. don 't buy this thing. put your money to good use elsewhere . diff --git a/demo/quick_start/preprocess.sh b/demo/quick_start/data/proc_from_raw_data/get_data.sh similarity index 65% rename from demo/quick_start/preprocess.sh rename to demo/quick_start/data/proc_from_raw_data/get_data.sh index c9190e2dd2ef754bf3c7287006322b52493dc3a0..cd85e26842dfccea78e4f26bdfee938887021f03 100755 --- a/demo/quick_start/preprocess.sh +++ b/demo/quick_start/data/proc_from_raw_data/get_data.sh @@ -16,10 +16,26 @@ # 1. size of pos : neg = 1:1. # 2. size of testing set = min(25k, len(all_data) * 0.1), others is traning set. # 3. distinct train set and test set. -# 4. build dict set -e +DIR="$( cd "$(dirname "$0")" ; pwd -P )" +cd $DIR + +# Download data +echo "Downloading Amazon Electronics reviews data..." +# http://jmcauley.ucsd.edu/data/amazon/ +wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz +echo "Downloading mosesdecoder..." +# https://github.com/moses-smt/mosesdecoder +wget https://github.com/moses-smt/mosesdecoder/archive/master.zip + +unzip master.zip +rm master.zip + +################## +# Preprocess data +echo "Preprocess data..." export LC_ALL=C UNAME_STR=`uname` @@ -29,11 +45,11 @@ else SHUF_PROG='gshuf' fi -mkdir -p data/tmp -python preprocess.py -i data/reviews_Electronics_5.json.gz +mkdir -p tmp +python preprocess.py -i reviews_Electronics_5.json.gz # uniq and shuffle -cd data/tmp -echo 'uniq and shuffle...' +cd tmp +echo 'Uniq and shuffle...' cat pos_*|sort|uniq|${SHUF_PROG}> pos.shuffed cat neg_*|sort|uniq|${SHUF_PROG}> neg.shuffed @@ -53,11 +69,11 @@ cat train.pos train.neg | ${SHUF_PROG} >../train.txt cat test.pos test.neg | ${SHUF_PROG} >../test.txt cd - -echo 'data/train.txt' > data/train.list -echo 'data/test.txt' > data/test.list +echo 'train.txt' > train.list +echo 'test.txt' > test.list # use 30k dict -rm -rf data/tmp -mv data/dict.txt data/dict_all.txt -cat data/dict_all.txt | head -n 30001 > data/dict.txt -echo 'preprocess finished' +rm -rf tmp +mv dict.txt dict_all.txt +cat dict_all.txt | head -n 30001 > dict.txt +echo 'Done.' diff --git a/demo/quick_start/preprocess.py b/demo/quick_start/data/proc_from_raw_data/preprocess.py similarity index 95% rename from demo/quick_start/preprocess.py rename to demo/quick_start/data/proc_from_raw_data/preprocess.py index d87fad632a7429f7d9682badabe4c72ca127354f..56c2c5f16ceb63ff88fa51ed78c2e77ea5b64592 100755 --- a/demo/quick_start/preprocess.py +++ b/demo/quick_start/data/proc_from_raw_data/preprocess.py @@ -14,7 +14,7 @@ # See the License for the specific language governing permissions and # limitations under the License. """ -1. (remove HTML before or not)tokensizing +1. Tokenize the words and punctuation 2. pos sample : rating score 5; neg sample: rating score 1-2. Usage: @@ -76,7 +76,11 @@ def tokenize(sentences): sentences : a list of input sentences. return: a list of processed text. """ - dir = './data/mosesdecoder-master/scripts/tokenizer/tokenizer.perl' + dir = './mosesdecoder-master/scripts/tokenizer/tokenizer.perl' + if not os.path.exists(dir): + sys.exit( + "The ./mosesdecoder-master/scripts/tokenizer/tokenizer.perl does not exists." + ) tokenizer_cmd = [dir, '-l', 'en', '-q', '-'] assert isinstance(sentences, list) text = "\n".join(sentences) @@ -104,7 +108,7 @@ def tokenize_batch(id): num_batch, instance, pre_fix = parse_queue.get() if num_batch == -1: ### parse_queue finished tokenize_queue.put((-1, None, None)) - sys.stderr.write("tokenize theread %s finish\n" % (id)) + sys.stderr.write("Thread %s finish\n" % (id)) break tokenize_instance = tokenize(instance) tokenize_queue.put((num_batch, tokenize_instance, pre_fix)) diff --git a/doc/demo/quick_start/index_en.md b/doc/demo/quick_start/index_en.md index 659485d9be1b6a3e9759a2fd040cb09d1f2a3005..ec548b5393d7b210d6409328c00917aeb679a451 100644 --- a/doc/demo/quick_start/index_en.md +++ b/doc/demo/quick_start/index_en.md @@ -59,12 +59,11 @@ To build your text classification system, your code will need to perform five st ## Preprocess data into standardized format In this example, you are going to use [Amazon electronic product review dataset](http://jmcauley.ucsd.edu/data/amazon/) to build a bunch of deep neural network models for text classification. Each text in this dataset is a product review. This dataset has two categories: “positive” and “negative”. Positive means the reviewer likes the product, while negative means the reviewer does not like the product. -`demo/quick_start` in the [source code](https://github.com/baidu/Paddle) provides scripts for downloading data and preprocessing data as shown below. The data process takes several minutes (about 3 minutes in our machine). +`demo/quick_start` in the [source code](https://github.com/PaddlePaddle/Paddle) provides script for downloading the preprocessed data as shown below. (If you want to process the raw data, you can use the script `demo/quick_start/data/proc_from_raw_data/get_data.sh`). ```bash cd demo/quick_start ./data/get_data.sh -./preprocess.sh ``` ## Transfer Data to Model diff --git a/doc_cn/demo/quick_start/index.md b/doc_cn/demo/quick_start/index.md index 4d9b24ba851a7aaaeb0d79bfbeb0703b8878b77f..4a6e07ee1ffd94cf8f781af307b53a96a78e6b93 100644 --- a/doc_cn/demo/quick_start/index.md +++ b/doc_cn/demo/quick_start/index.md @@ -32,13 +32,11 @@ ## 数据格式准备(Data Preparation) 在本问题中,我们使用[Amazon电子产品评论数据](http://jmcauley.ucsd.edu/data/amazon/), -将评论分为好评(正样本)和差评(负样本)两类。[源码](https://github.com/baidu/Paddle)的`demo/quick_start`里提供了数据下载脚本 -和预处理脚本。 +将评论分为好评(正样本)和差评(负样本)两类。[源码](https://github.com/PaddlePaddle/Paddle)的`demo/quick_start`里提供了下载已经预处理数据的脚本(如果想从最原始的数据处理,可以使用脚本 `./demo/quick_start/data/proc_from_raw_data/get_data.sh`)。 ```bash cd demo/quick_start ./data/get_data.sh -./preprocess.sh ``` ## 数据向模型传送(Transfer Data to Model) @@ -143,7 +141,7 @@ PyDataProvider2。 我们将以基本的逻辑回归网络作为起点,并逐渐展示更加深入的功能。更详细的网络配置 连接请参考Layer文档。 -所有配置在[源码](https://github.com/baidu/Paddle)`demo/quick_start`目录,首先列举逻辑回归网络。 +所有配置在[源码](https://github.com/PaddlePaddle/Paddle)`demo/quick_start`目录,首先列举逻辑回归网络。 ### 逻辑回归模型(Logistic Regression)