提交 0561dd01 编写于 作者: D dangqingqing

Update doc and proc_from_raw_data/get_data.sh

上级 1edacf7a
...@@ -25,14 +25,17 @@ cd $DIR ...@@ -25,14 +25,17 @@ cd $DIR
# Download data # Download data
echo "Downloading Amazon Electronics reviews data..." echo "Downloading Amazon Electronics reviews data..."
# http://jmcauley.ucsd.edu/data/amazon/ # http://jmcauley.ucsd.edu/data/amazon/
#wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
echo "Downloading mosesdecoder..." echo "Downloading mosesdecoder..."
#https://github.com/moses-smt/mosesdecoder # https://github.com/moses-smt/mosesdecoder
#wget https://github.com/moses-smt/mosesdecoder/archive/master.zip wget https://github.com/moses-smt/mosesdecoder/archive/master.zip
#unzip master.zip
#rm master.zip
echo "Done."
unzip master.zip
rm master.zip
##################
# Preprocess data
echo "Preprocess data..."
export LC_ALL=C export LC_ALL=C
UNAME_STR=`uname` UNAME_STR=`uname`
...@@ -42,12 +45,11 @@ else ...@@ -42,12 +45,11 @@ else
SHUF_PROG='gshuf' SHUF_PROG='gshuf'
fi fi
# Start preprocess
mkdir -p tmp mkdir -p tmp
python preprocess.py -i reviews_Electronics_5.json.gz python preprocess.py -i reviews_Electronics_5.json.gz
# uniq and shuffle # uniq and shuffle
cd tmp cd tmp
echo 'uniq and shuffle...' echo 'Uniq and shuffle...'
cat pos_*|sort|uniq|${SHUF_PROG}> pos.shuffed cat pos_*|sort|uniq|${SHUF_PROG}> pos.shuffed
cat neg_*|sort|uniq|${SHUF_PROG}> neg.shuffed cat neg_*|sort|uniq|${SHUF_PROG}> neg.shuffed
...@@ -74,4 +76,4 @@ echo 'test.txt' > test.list ...@@ -74,4 +76,4 @@ echo 'test.txt' > test.list
rm -rf tmp rm -rf tmp
mv dict.txt dict_all.txt mv dict.txt dict_all.txt
cat dict_all.txt | head -n 30001 > dict.txt cat dict_all.txt | head -n 30001 > dict.txt
echo 'preprocess finished' echo 'Done.'
...@@ -59,7 +59,7 @@ To build your text classification system, your code will need to perform five st ...@@ -59,7 +59,7 @@ To build your text classification system, your code will need to perform five st
## Preprocess data into standardized format ## Preprocess data into standardized format
In this example, you are going to use [Amazon electronic product review dataset](http://jmcauley.ucsd.edu/data/amazon/) to build a bunch of deep neural network models for text classification. Each text in this dataset is a product review. This dataset has two categories: “positive” and “negative”. Positive means the reviewer likes the product, while negative means the reviewer does not like the product. In this example, you are going to use [Amazon electronic product review dataset](http://jmcauley.ucsd.edu/data/amazon/) to build a bunch of deep neural network models for text classification. Each text in this dataset is a product review. This dataset has two categories: “positive” and “negative”. Positive means the reviewer likes the product, while negative means the reviewer does not like the product.
`demo/quick_start` in the [source code](https://github.com/baidu/Paddle) provides script for downloading the preprocessed data as shown below. (If you want to process the raw data, you can use the script `demo/quick_start/data/proc_from_raw_data/get_data.sh`). `demo/quick_start` in the [source code](https://github.com/PaddlePaddle/Paddle) provides script for downloading the preprocessed data as shown below. (If you want to process the raw data, you can use the script `demo/quick_start/data/proc_from_raw_data/get_data.sh`).
```bash ```bash
cd demo/quick_start cd demo/quick_start
......
...@@ -32,7 +32,7 @@ ...@@ -32,7 +32,7 @@
## 数据格式准备(Data Preparation) ## 数据格式准备(Data Preparation)
在本问题中,我们使用[Amazon电子产品评论数据](http://jmcauley.ucsd.edu/data/amazon/) 在本问题中,我们使用[Amazon电子产品评论数据](http://jmcauley.ucsd.edu/data/amazon/)
将评论分为好评(正样本)和差评(负样本)两类。[源码](https://github.com/baidu/Paddle)`demo/quick_start`里提供了下载已经预处理数据的脚本(如果想从最原始的数据处理,可以使用脚本 `./demo/quick_start/data/proc_from_raw_data/get_data.sh`)。 将评论分为好评(正样本)和差评(负样本)两类。[源码](https://github.com/PaddlePaddle/Paddle)`demo/quick_start`里提供了下载已经预处理数据的脚本(如果想从最原始的数据处理,可以使用脚本 `./demo/quick_start/data/proc_from_raw_data/get_data.sh`)。
```bash ```bash
cd demo/quick_start cd demo/quick_start
...@@ -141,7 +141,7 @@ PyDataProvider2</a>。 ...@@ -141,7 +141,7 @@ PyDataProvider2</a>。
我们将以基本的逻辑回归网络作为起点,并逐渐展示更加深入的功能。更详细的网络配置 我们将以基本的逻辑回归网络作为起点,并逐渐展示更加深入的功能。更详细的网络配置
连接请参考<a href = "../../../doc/layer.html">Layer文档</a> 连接请参考<a href = "../../../doc/layer.html">Layer文档</a>
所有配置在[源码](https://github.com/baidu/Paddle)`demo/quick_start`目录,首先列举逻辑回归网络。 所有配置在[源码](https://github.com/PaddlePaddle/Paddle)`demo/quick_start`目录,首先列举逻辑回归网络。
### 逻辑回归模型(Logistic Regression) ### 逻辑回归模型(Logistic Regression)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册