提交 2f602484 编写于 作者: T Tao Luo 提交者: GitHub

Merge pull request #573 from qingqing01/quick_start

Update the data of quick start.
This dataset consists of electronics product reviews associated with
binary labels (positive/negative) for sentiment classification.
The preprocessed data can be downloaded by script `get_data.sh`.
The data was derived from reviews_Electronics_5.json.gz at
http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
If you want to process the raw data, you can use the script `proc_from_raw_data/get_data.sh`.
...@@ -17,14 +17,11 @@ set -e ...@@ -17,14 +17,11 @@ set -e
DIR="$( cd "$(dirname "$0")" ; pwd -P )" DIR="$( cd "$(dirname "$0")" ; pwd -P )"
cd $DIR cd $DIR
echo "Downloading Amazon Electronics reviews data..." # Download the preprocessed data
# http://jmcauley.ucsd.edu/data/amazon/ wget http://paddlepaddle.bj.bcebos.com/demo/quick_start_preprocessed_data/preprocessed_data.tar.gz
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
echo "Downloading mosesdecoder..." # Extract package
#https://github.com/moses-smt/mosesdecoder tar zxvf preprocessed_data.tar.gz
wget https://github.com/moses-smt/mosesdecoder/archive/master.zip
unzip master.zip # Remove compressed package
rm master.zip rm preprocessed_data.tar.gz
echo "Done."
the device is cute , but that 's just about all that 's good. the specs are what you 'd expect : it 's a wifi mic , with some noise filter options. the app has the option to upload your baby 's name and photo , which is a cutesy touch. but the app is otherwise unstable and useless unless you upgrade for $ 60 / year.set up involves downloading the app , turning on the mic , switching your phone to the wifi network of the mic , telling the app your wifi settings , switching your wifi back to your home router. the app is then directly connected to your mic.the app is adware ! the main screen says " cry notifications on / off : upgrade to evoz premium and receive a text message of email when your baby is crying " .but the adware points out an important limitation , this monitor is only intended to be used from your home network. if you want to access it remotely , get a webcam. this app would make a lot more sense of the premium features were included with the hardware .
don 't be fooled by my one star rating. if there was a zero , i would have selected it. this product was a waste of my money.it has never worked like the company said it supposed to. i only have one device , an iphone 4gs. after charging the the iphone mid way , the i.sound portable power max 16,000 mah is completely drained. the led light no longer lit up. when plugging the isound portable power max into a wall outlet to charge , it would charge for about 20-30 minutes and then all four battery led indicator lit up showing a full charge. i would leave it on to charge for the full 8 hours or more but each time with the same result upon using. don 't buy this thing. put your money to good use elsewhere .
...@@ -16,10 +16,26 @@ ...@@ -16,10 +16,26 @@
# 1. size of pos : neg = 1:1. # 1. size of pos : neg = 1:1.
# 2. size of testing set = min(25k, len(all_data) * 0.1), others is traning set. # 2. size of testing set = min(25k, len(all_data) * 0.1), others is traning set.
# 3. distinct train set and test set. # 3. distinct train set and test set.
# 4. build dict
set -e set -e
DIR="$( cd "$(dirname "$0")" ; pwd -P )"
cd $DIR
# Download data
echo "Downloading Amazon Electronics reviews data..."
# http://jmcauley.ucsd.edu/data/amazon/
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
echo "Downloading mosesdecoder..."
# https://github.com/moses-smt/mosesdecoder
wget https://github.com/moses-smt/mosesdecoder/archive/master.zip
unzip master.zip
rm master.zip
##################
# Preprocess data
echo "Preprocess data..."
export LC_ALL=C export LC_ALL=C
UNAME_STR=`uname` UNAME_STR=`uname`
...@@ -29,11 +45,11 @@ else ...@@ -29,11 +45,11 @@ else
SHUF_PROG='gshuf' SHUF_PROG='gshuf'
fi fi
mkdir -p data/tmp mkdir -p tmp
python preprocess.py -i data/reviews_Electronics_5.json.gz python preprocess.py -i reviews_Electronics_5.json.gz
# uniq and shuffle # uniq and shuffle
cd data/tmp cd tmp
echo 'uniq and shuffle...' echo 'Uniq and shuffle...'
cat pos_*|sort|uniq|${SHUF_PROG}> pos.shuffed cat pos_*|sort|uniq|${SHUF_PROG}> pos.shuffed
cat neg_*|sort|uniq|${SHUF_PROG}> neg.shuffed cat neg_*|sort|uniq|${SHUF_PROG}> neg.shuffed
...@@ -53,11 +69,11 @@ cat train.pos train.neg | ${SHUF_PROG} >../train.txt ...@@ -53,11 +69,11 @@ cat train.pos train.neg | ${SHUF_PROG} >../train.txt
cat test.pos test.neg | ${SHUF_PROG} >../test.txt cat test.pos test.neg | ${SHUF_PROG} >../test.txt
cd - cd -
echo 'data/train.txt' > data/train.list echo 'train.txt' > train.list
echo 'data/test.txt' > data/test.list echo 'test.txt' > test.list
# use 30k dict # use 30k dict
rm -rf data/tmp rm -rf tmp
mv data/dict.txt data/dict_all.txt mv dict.txt dict_all.txt
cat data/dict_all.txt | head -n 30001 > data/dict.txt cat dict_all.txt | head -n 30001 > dict.txt
echo 'preprocess finished' echo 'Done.'
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" """
1. (remove HTML before or not)tokensizing 1. Tokenize the words and punctuation
2. pos sample : rating score 5; neg sample: rating score 1-2. 2. pos sample : rating score 5; neg sample: rating score 1-2.
Usage: Usage:
...@@ -76,7 +76,11 @@ def tokenize(sentences): ...@@ -76,7 +76,11 @@ def tokenize(sentences):
sentences : a list of input sentences. sentences : a list of input sentences.
return: a list of processed text. return: a list of processed text.
""" """
dir = './data/mosesdecoder-master/scripts/tokenizer/tokenizer.perl' dir = './mosesdecoder-master/scripts/tokenizer/tokenizer.perl'
if not os.path.exists(dir):
sys.exit(
"The ./mosesdecoder-master/scripts/tokenizer/tokenizer.perl does not exists."
)
tokenizer_cmd = [dir, '-l', 'en', '-q', '-'] tokenizer_cmd = [dir, '-l', 'en', '-q', '-']
assert isinstance(sentences, list) assert isinstance(sentences, list)
text = "\n".join(sentences) text = "\n".join(sentences)
...@@ -104,7 +108,7 @@ def tokenize_batch(id): ...@@ -104,7 +108,7 @@ def tokenize_batch(id):
num_batch, instance, pre_fix = parse_queue.get() num_batch, instance, pre_fix = parse_queue.get()
if num_batch == -1: ### parse_queue finished if num_batch == -1: ### parse_queue finished
tokenize_queue.put((-1, None, None)) tokenize_queue.put((-1, None, None))
sys.stderr.write("tokenize theread %s finish\n" % (id)) sys.stderr.write("Thread %s finish\n" % (id))
break break
tokenize_instance = tokenize(instance) tokenize_instance = tokenize(instance)
tokenize_queue.put((num_batch, tokenize_instance, pre_fix)) tokenize_queue.put((num_batch, tokenize_instance, pre_fix))
......
...@@ -59,12 +59,11 @@ To build your text classification system, your code will need to perform five st ...@@ -59,12 +59,11 @@ To build your text classification system, your code will need to perform five st
## Preprocess data into standardized format ## Preprocess data into standardized format
In this example, you are going to use [Amazon electronic product review dataset](http://jmcauley.ucsd.edu/data/amazon/) to build a bunch of deep neural network models for text classification. Each text in this dataset is a product review. This dataset has two categories: “positive” and “negative”. Positive means the reviewer likes the product, while negative means the reviewer does not like the product. In this example, you are going to use [Amazon electronic product review dataset](http://jmcauley.ucsd.edu/data/amazon/) to build a bunch of deep neural network models for text classification. Each text in this dataset is a product review. This dataset has two categories: “positive” and “negative”. Positive means the reviewer likes the product, while negative means the reviewer does not like the product.
`demo/quick_start` in the [source code](https://github.com/baidu/Paddle) provides scripts for downloading data and preprocessing data as shown below. The data process takes several minutes (about 3 minutes in our machine). `demo/quick_start` in the [source code](https://github.com/PaddlePaddle/Paddle) provides script for downloading the preprocessed data as shown below. (If you want to process the raw data, you can use the script `demo/quick_start/data/proc_from_raw_data/get_data.sh`).
```bash ```bash
cd demo/quick_start cd demo/quick_start
./data/get_data.sh ./data/get_data.sh
./preprocess.sh
``` ```
## Transfer Data to Model ## Transfer Data to Model
......
...@@ -32,13 +32,11 @@ ...@@ -32,13 +32,11 @@
## 数据格式准备(Data Preparation) ## 数据格式准备(Data Preparation)
在本问题中,我们使用[Amazon电子产品评论数据](http://jmcauley.ucsd.edu/data/amazon/) 在本问题中,我们使用[Amazon电子产品评论数据](http://jmcauley.ucsd.edu/data/amazon/)
将评论分为好评(正样本)和差评(负样本)两类。[源码](https://github.com/baidu/Paddle)`demo/quick_start`里提供了数据下载脚本 将评论分为好评(正样本)和差评(负样本)两类。[源码](https://github.com/PaddlePaddle/Paddle)`demo/quick_start`里提供了下载已经预处理数据的脚本(如果想从最原始的数据处理,可以使用脚本 `./demo/quick_start/data/proc_from_raw_data/get_data.sh`)。
和预处理脚本。
```bash ```bash
cd demo/quick_start cd demo/quick_start
./data/get_data.sh ./data/get_data.sh
./preprocess.sh
``` ```
## 数据向模型传送(Transfer Data to Model) ## 数据向模型传送(Transfer Data to Model)
...@@ -143,7 +141,7 @@ PyDataProvider2</a>。 ...@@ -143,7 +141,7 @@ PyDataProvider2</a>。
我们将以基本的逻辑回归网络作为起点,并逐渐展示更加深入的功能。更详细的网络配置 我们将以基本的逻辑回归网络作为起点,并逐渐展示更加深入的功能。更详细的网络配置
连接请参考<a href = "../../../doc/layer.html">Layer文档</a> 连接请参考<a href = "../../../doc/layer.html">Layer文档</a>
所有配置在[源码](https://github.com/baidu/Paddle)`demo/quick_start`目录,首先列举逻辑回归网络。 所有配置在[源码](https://github.com/PaddlePaddle/Paddle)`demo/quick_start`目录,首先列举逻辑回归网络。
### 逻辑回归模型(Logistic Regression) ### 逻辑回归模型(Logistic Regression)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册