README.md 2.6 KB
Newer Older
Q
Qiao Longfei 已提交
1

Q
Qiao Longfei 已提交
2
# Skip-Gram Word2Vec Model
Q
Qiao Longfei 已提交
3 4 5 6 7

## Introduction


## Environment
Q
Qiao Longfei 已提交
8
You should install PaddlePaddle Fluid first.
Q
Qiao Longfei 已提交
9 10

## Dataset
11
The training data for the 1 Billion Word Language Model Benchmark的(http://www.statmt.org/lm-benchmark).
Q
Qiao Longfei 已提交
12 13 14 15 16 17 18

Download dataset:
```bash
cd data && ./download.sh && cd ..
```

## Model
Q
Qiao Longfei 已提交
19
This model implement a skip-gram model of word2vector.
Q
Qiao Longfei 已提交
20 21 22


## Data Preprocessing method
Q
Qiao Longfei 已提交
23 24

Preprocess the training data to generate a word dict.
Q
Qiao Longfei 已提交
25

Q
Qiao Longfei 已提交
26
```bash
27
python preprocess.py --data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled --dict_path data/1-billion_dict
Q
Qiao Longfei 已提交
28 29
```

Q
Qiao Longfei 已提交
30 31 32 33 34 35
## Train
The command line options for training can be listed by `python train.py -h`.

### Local Train:
```bash
python train.py \
J
JiabinYang 已提交
36 37
        --train_data_path ./data/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled \
        --dict_path data/1-billion_dict \
Q
Qiao Longfei 已提交
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
        2>&1 | tee train.log
```


### Distributed Train
Run a 2 pserver 2 trainer distribute training on a single machine.
In distributed training setting, training data is splited by trainer_id, so that training data
 do not overlap among trainers

```bash
sh cluster_train.sh
```

## Infer

J
JiabinYang 已提交
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
In infer.py we construct some test cases in the `build_test_case` method to evaluate the effect of word embeding:
We enter the test case (we are currently using the analogical-reasoning task: find the structure of A - B = C - D, for which we calculate A - B + D, find the nearest C by cosine distance, the calculation accuracy is removed Candidates for A, B, and D appear in the candidate) Then calculate the cosine similarity of the candidate and all words in the entire embeding, and print out the topK (K is determined by the parameter --rank_num, the default is 4).

Such as:
For: boy - girl + aunt = uncle
0 nearest aunt: 0.89
1 nearest uncle: 0.70
2 nearest grandmother: 0.67
3 nearest father:0.64

You can also add your own tests by mimicking the examples given in the `build_test_case` method.

Forecast in training:

```bash
Python infer.py --infer_during_train 2>&1 | tee infer.log
```
Use a model for offline prediction:

```bash
Python infer.py --infer_once --model_output_dir ./models/[specific models file directory] 2>&1 | tee infer.log
```
Q
Qiao Longfei 已提交
75 76 77 78 79 80

## Train on Baidu Cloud
1. Please prepare some CPU machines on Baidu Cloud following the steps in [train_on_baidu_cloud](https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/user_guides/howto/training/train_on_baidu_cloud_cn.rst)
1. Prepare dataset using preprocess.py.
1. Split the train.txt to trainer_num parts and put them on the machines.
1. Run training with the cluster train using the command in `Distributed Train` above.