README.md 3.6 KB
Newer Older
Q
Qiao Longfei 已提交
1

Q
Qiao Longfei 已提交
2
# Skip-Gram Word2Vec Model
Q
Qiao Longfei 已提交
3 4 5 6 7

## Introduction


## Environment
Q
Qiao Longfei 已提交
8
You should install PaddlePaddle Fluid first.
Q
Qiao Longfei 已提交
9 10

## Dataset
11
The training data for the 1 Billion Word Language Model Benchmark的(http://www.statmt.org/lm-benchmark).
Q
Qiao Longfei 已提交
12 13 14 15 16

Download dataset:
```bash
cd data && ./download.sh && cd ..
```
J
JiabinYang 已提交
17 18 19 20 21
if you would like to use our supported third party vocab, please run:

```bash
wget http://download.tensorflow.org/models/LM_LSTM_CNN/vocab-2016-09-10.txt
```
Q
Qiao Longfei 已提交
22 23

## Model
Q
Qiao Longfei 已提交
24
This model implement a skip-gram model of word2vector.
Q
Qiao Longfei 已提交
25 26 27


## Data Preprocessing method
Q
Qiao Longfei 已提交
28 29

Preprocess the training data to generate a word dict.
Q
Qiao Longfei 已提交
30

Q
Qiao Longfei 已提交
31
```bash
J
JiabinYang 已提交
32
python preprocess.py --data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled --dict_path data/1-billion_dict
Q
Qiao Longfei 已提交
33
```
J
JiabinYang 已提交
34 35 36 37 38 39 40 41
if you would like to use your own vocab follow the format below:
```bash
<UNK>
a
b
c
```
Then, please set --other_dict_path as the directory of where you
J
JiabinYang 已提交
42
save the vocab you will use and set --with_other_dict flag on to using it.
Q
Qiao Longfei 已提交
43

Q
Qiao Longfei 已提交
44 45 46 47
## Train
The command line options for training can be listed by `python train.py -h`.

### Local Train:
J
JiabinYang 已提交
48
we set CPU_NUM=1 as default CPU_NUM to execute
Q
Qiao Longfei 已提交
49
```bash
J
JiabinYang 已提交
50
export CPU_NUM=1 && \
Q
Qiao Longfei 已提交
51
python train.py \
J
JiabinYang 已提交
52 53
        --train_data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled \
        --dict_path data/1-billion_dict \
J
JiabinYang 已提交
54
        --with_hs --with_nce --is_local \
Q
Qiao Longfei 已提交
55 56
        2>&1 | tee train.log
```
57 58
if you would like to use our supported third party vocab, please set --other_dict_path as the directory of where you
save the vocab you will use and set --with_other_dict flag on to using it.
Q
Qiao Longfei 已提交
59 60 61 62 63 64 65 66 67 68 69 70

### Distributed Train
Run a 2 pserver 2 trainer distribute training on a single machine.
In distributed training setting, training data is splited by trainer_id, so that training data
 do not overlap among trainers

```bash
sh cluster_train.sh
```

## Infer

J
JiabinYang 已提交
71 72 73 74 75 76 77 78 79 80 81 82
In infer.py we construct some test cases in the `build_test_case` method to evaluate the effect of word embeding:
We enter the test case (we are currently using the analogical-reasoning task: find the structure of A - B = C - D, for which we calculate A - B + D, find the nearest C by cosine distance, the calculation accuracy is removed Candidates for A, B, and D appear in the candidate) Then calculate the cosine similarity of the candidate and all words in the entire embeding, and print out the topK (K is determined by the parameter --rank_num, the default is 4).

Such as:
For: boy - girl + aunt = uncle
0 nearest aunt: 0.89
1 nearest uncle: 0.70
2 nearest grandmother: 0.67
3 nearest father:0.64

You can also add your own tests by mimicking the examples given in the `build_test_case` method.

83 84 85 86 87
To running test case from test files, please download the test files into 'test' directory
we provide test for each case with the following structure:
        `word1 word2 word3 word4`
so we can build it into `word1 - word2 + word3 = word4`

J
JiabinYang 已提交
88 89 90 91 92 93 94 95 96 97
Forecast in training:

```bash
Python infer.py --infer_during_train 2>&1 | tee infer.log
```
Use a model for offline prediction:

```bash
Python infer.py --infer_once --model_output_dir ./models/[specific models file directory] 2>&1 | tee infer.log
```
Q
Qiao Longfei 已提交
98 99 100 101 102 103

## Train on Baidu Cloud
1. Please prepare some CPU machines on Baidu Cloud following the steps in [train_on_baidu_cloud](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst)
1. Prepare dataset using preprocess.py.
1. Split the train.txt to trainer_num parts and put them on the machines.
1. Run training with the cluster train using the command in `Distributed Train` above.