README.md 3.3 KB
Newer Older
Q
Qiao Longfei 已提交
1

Q
Qiao Longfei 已提交
2
# Skip-Gram Word2Vec Model
Q
Qiao Longfei 已提交
3 4 5 6 7

## Introduction


## Environment
Q
Qiao Longfei 已提交
8
You should install PaddlePaddle Fluid first.
Q
Qiao Longfei 已提交
9 10

## Dataset
11
The training data for the 1 Billion Word Language Model Benchmark的(http://www.statmt.org/lm-benchmark).
Q
Qiao Longfei 已提交
12 13 14 15 16

Download dataset:
```bash
cd data && ./download.sh && cd ..
```
J
JiabinYang 已提交
17 18 19 20 21
if you would like to use our supported third party vocab, please run:

```bash
wget http://download.tensorflow.org/models/LM_LSTM_CNN/vocab-2016-09-10.txt
```
Q
Qiao Longfei 已提交
22 23

## Model
Q
Qiao Longfei 已提交
24
This model implement a skip-gram model of word2vector.
Q
Qiao Longfei 已提交
25 26 27


## Data Preprocessing method
Q
Qiao Longfei 已提交
28 29

Preprocess the training data to generate a word dict.
Q
Qiao Longfei 已提交
30

Q
Qiao Longfei 已提交
31
```bash
J
JiabinYang 已提交
32
python preprocess.py --data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled --is_local --dict_path data/1-billion_dict
Q
Qiao Longfei 已提交
33
```
J
JiabinYang 已提交
34 35
if you would like to use our supported third party vocab, please set --other_dict_path as the directory of where you
save the vocab you will use and set --with_other_dict flag on to using it.
Q
Qiao Longfei 已提交
36

Q
Qiao Longfei 已提交
37 38 39 40
## Train
The command line options for training can be listed by `python train.py -h`.

### Local Train:
J
JiabinYang 已提交
41
we set CPU_NUM=1 as default CPU_NUM to execute
Q
Qiao Longfei 已提交
42
```bash
J
JiabinYang 已提交
43
export CPU_NUM=1 && \
Q
Qiao Longfei 已提交
44
python train.py \
J
JiabinYang 已提交
45 46
        --train_data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled \
        --dict_path data/1-billion_dict \
J
JiabinYang 已提交
47
        --with_hs --with_nce --is_local \
Q
Qiao Longfei 已提交
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
        2>&1 | tee train.log
```


### Distributed Train
Run a 2 pserver 2 trainer distribute training on a single machine.
In distributed training setting, training data is splited by trainer_id, so that training data
 do not overlap among trainers

```bash
sh cluster_train.sh
```

## Infer

J
JiabinYang 已提交
63 64 65 66 67 68 69 70 71 72 73 74
In infer.py we construct some test cases in the `build_test_case` method to evaluate the effect of word embeding:
We enter the test case (we are currently using the analogical-reasoning task: find the structure of A - B = C - D, for which we calculate A - B + D, find the nearest C by cosine distance, the calculation accuracy is removed Candidates for A, B, and D appear in the candidate) Then calculate the cosine similarity of the candidate and all words in the entire embeding, and print out the topK (K is determined by the parameter --rank_num, the default is 4).

Such as:
For: boy - girl + aunt = uncle
0 nearest aunt: 0.89
1 nearest uncle: 0.70
2 nearest grandmother: 0.67
3 nearest father:0.64

You can also add your own tests by mimicking the examples given in the `build_test_case` method.

75 76 77 78 79
To running test case from test files, please download the test files into 'test' directory
we provide test for each case with the following structure:
        `word1 word2 word3 word4`
so we can build it into `word1 - word2 + word3 = word4`

J
JiabinYang 已提交
80 81 82 83 84 85 86 87 88 89
Forecast in training:

```bash
Python infer.py --infer_during_train 2>&1 | tee infer.log
```
Use a model for offline prediction:

```bash
Python infer.py --infer_once --model_output_dir ./models/[specific models file directory] 2>&1 | tee infer.log
```
Q
Qiao Longfei 已提交
90 91 92 93 94 95

## Train on Baidu Cloud
1. Please prepare some CPU machines on Baidu Cloud following the steps in [train_on_baidu_cloud](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst)
1. Prepare dataset using preprocess.py.
1. Split the train.txt to trainer_num parts and put them on the machines.
1. Run training with the cluster train using the command in `Distributed Train` above.