deleted codes and related Chinese texts

3da6b45b · wangyang59 · 04cfb4d9 · 3da6b45b
隐藏空白更改
内联并排

Showing with 3 addition and 310 deletion

word2vec/README.en.md word2vec/README.en.md +3 -310

未找到文件。
--- a/word2vec/README.en.md
+++ b/word2vec/README.en.md
@@ -134,318 +134,11 @@ As illustrated in the figure above, Skip-gram model maps the word embedding of t

 ## Data Preparation

-### 数据介绍与下载
-
-本教程使用Penn Tree Bank (PTB)数据集。PTB数据集较小，训练速度快，应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下：
-
-<p align="center">
-<table>
-	<tr>
-		<td>训练数据</td>
-		<td>验证数据</td>
-		<td>测试数据</td>
-	</tr>
-	<tr>
-		<td>ptb.train.txt</td>
-		<td>ptb.valid.txt</td>
-		<td>ptb.test.txt</td>
-	</tr>
-	<tr>
-		<td>42068句</td>
-		<td>3370句</td>
-		<td>3761句</td>
-	</tr>
-</table>
-</p>
-
-执行以下命令，可下载该数据集，并分别将训练数据和验证数据输入`train.list`和`test.list`文件中，供PaddlePaddle训练时使用。
-
-```bash
-./data/getdata.sh
-```
-
-	
-### 提供数据给PaddlePaddle
-
-1. 使用initializer函数进行dataprovider的初始化，包括字典的建立（build_dict函数中）和PaddlePaddle输入字段的格式定义。注意：这里N为n-gram模型中的`n`, 本章代码中，定义$N=5$, 表示在PaddlePaddle训练时，每条数据的前4个词用来预测第5个词。大家也可以根据自己的数据和需求自行调整N，但调整的同时要在模型配置文件中加入/减少相应输入字段。
-
-    ```python
-    from paddle.trainer.PyDataProvider2 import *
-    import collections
-    import logging
-    import pdb
-    
-    logging.basicConfig(
-        format='[%(levelname)s %(asctime)s %(filename)s:%(lineno)s] %(message)s', )
-    logger = logging.getLogger('paddle')
-    logger.setLevel(logging.INFO)
-    
-    N = 5  # Ngram
-    cutoff = 50  # select words with frequency > cutoff to dictionary
-    def build_dict(ftrain, fdict):
-    	sentences = []
-        with open(ftrain) as fin:
-            for line in fin:
-                line = ['<s>'] + line.strip().split() + ['<e>']
-                sentences += line
-        wordfreq = collections.Counter(sentences)
-        wordfreq = filter(lambda x: x[1] > cutoff, wordfreq.items())
-        dictionary = sorted(wordfreq, key = lambda x: (-x[1], x[0]))
-        words, _ = list(zip(*dictionary))
-        for word in words:
-            print >> fdict, word
-        word_idx = dict(zip(words, xrange(len(words))))
-        logger.info("Dictionary size=%s" %len(words))
-        return word_idx
-    
-    def initializer(settings, srcText, dictfile, **xargs):
-        with open(dictfile, 'w') as fdict:
-            settings.dicts = build_dict(srcText, fdict)
-        input_types = []
-        for i in xrange(N):
-            input_types.append(integer_value(len(settings.dicts)))
-        settings.input_types = input_types
-    ```
-
-2. 使用process函数中将数据逐一提供给PaddlePaddle。具体来说，将每句话前面补上N-1个开始符号 `<s>`, 末尾补上一个结束符号 `<e>`，然后以N为窗口大小，从头到尾每次向右滑动窗口并生成一条数据。
-
-    ```python
-    @provider(init_hook=initializer)
-    def process(settings, filename):
-        UNKID = settings.dicts['<unk>']
-        with open(filename) as fin:
-            for line in fin:
-                line = ['<s>']*(N-1)  + line.strip().split() + ['<e>']
-                line = [settings.dicts.get(w, UNKID) for w in line]
-                for i in range(N, len(line) + 1):
-                    yield line[i-N: i]
-    ```
-    
-    如"I have a dream" 一句提供了5条数据:
-
-    > `<s> <s> <s> <s> I` <br>
-    > `<s> <s> <s> I have` <br>
-    > `<s> <s> I have a`  <br>
-    > `<s> I have a dream` <br>
-    > `I have a dream <e>` <br>
-
-
-## 模型配置说明
-
-### 数据定义
-
-通过`define_py_data_sources2`函数从dataprovider中读入数据，其中args指定了训练文本(srcText)和词汇表(dictfile)。
-
-```python
-from paddle.trainer_config_helpers import *
-import math
-
-args = {'srcText': 'data/simple-examples/data/ptb.train.txt',
-        'dictfile': 'data/vocabulary.txt'}
-		
-define_py_data_sources2(
-    train_list="data/train.list",
-    test_list="data/test.list",
-    module="dataprovider",
-    obj="process",
-    args=args)
-```
-
-### 算法配置
-
-在这里，我们指定了模型的训练参数, L2正则项系数、学习率和batch size。
-
-```python
-settings(
-    batch_size=100, regularization=L2Regularization(8e-4), learning_rate=3e-3)
-```
-
-### 模型结构
-
-本配置的模型结构如下图所示：
-
-<p align="center">	
-	<img src="image/ngram.png" width=400><br/>
-	图5. 模型配置中的N-gram神经网络模型
-</p>
-
-1. 定义参数维度和和数据输入。
-
-    ```python
-    dictsize = 1953 # 字典大小
-    embsize = 32 # 词向量维度
-    hiddensize = 256 # 隐层维度
-    
-    firstword = data_layer(name = "firstw", size = dictsize)
-    secondword = data_layer(name = "secondw", size = dictsize)
-    thirdword = data_layer(name = "thirdw", size = dictsize)
-    fourthword = data_layer(name = "fourthw", size = dictsize)
-    nextword = data_layer(name = "fifthw", size = dictsize)
-    ```
-
-2. 将$w_t$之前的$n-1$个词 $w_{t-n+1},...w_{t-1}$，通过$|V|\times D$的矩阵映射到D维词向量（本例中取D=32）。
-	
-	```python	
-	def wordemb(inlayer):
-		wordemb = table_projection(
-        input = inlayer,
-        size = embsize,
-        param_attr=ParamAttr(name = "_proj",
-            initial_std=0.001, # 参数初始化标准差
-            l2_rate= 0,))      # 词向量不需要稀疏化，因此其l2_rate设为0
-    return wordemb
-
-	Efirst = wordemb(firstword)
-	Esecond = wordemb(secondword)
-	Ethird = wordemb(thirdword)
-	Efourth = wordemb(fourthword)
-	```
-
-3. 接着，将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。
-
-	```python
-	contextemb = concat_layer(input = [Efirst, Esecond, Ethird, Efourth])
-	```
-4. 然后，将历史文本特征经过一个全连接得到文本隐层特征。
-
-    ```python
-	hidden1 = fc_layer(
-	        input = contextemb,
-	        size = hiddensize,
-	        act = SigmoidActivation(),
-	        layer_attr = ExtraAttr(drop_rate=0.5),
-	        bias_attr = ParamAttr(learning_rate = 2),
-	        param_attr = ParamAttr(
-	            initial_std = 1./math.sqrt(embsize*8),
-	            learning_rate = 1))
-    ```
+## Model Configuration
 	
-5. 最后，将文本隐层特征，再经过一个全连接，映射成一个$|V|$维向量，同时通过softmax归一化得到这`|V|`个词的生成概率。
-
-    ```python
-	# use context embedding to predict nextword
-	predictword = fc_layer(
-	        input = hidden1,
-	        size = dictsize,
-	        bias_attr = ParamAttr(learning_rate = 2),
-	        act = SoftmaxActivation())
-	```
-
-6. 网络的损失函数为多分类交叉熵，可直接调用`classification_cost`函数。
-
-	```python
-	cost = classification_cost(
-	        input = predictword,
-	        label = nextword)
-	# network input and output
-	outputs(cost)
-	```
-	
-##训练模型
-
-模型训练命令为`./train.sh`。脚本内容如下，其中指定了总共需要执行30个pass。
-
-```bash
-paddle train \
-       --config ngram.py \
-       --use_gpu=1 \
-       --dot_period=100 \
-       --log_period=3000 \
-       --test_period=0 \
-       --save_dir=model \
-       --num_passes=30
-```
-
-一个pass的训练日志如下所示：
-
-```text
-.............................
-I1222 09:27:16.477841 12590 TrainerInternal.cpp:162]  Batch=3000 samples=300000 AvgCost=5.36135 CurrentCost=5.36135 Eval: classification_error_evaluator=0.818653  CurrentEval: class
-ification_error_evaluator=0.818653 
-.............................
-I1222 09:27:22.416700 12590 TrainerInternal.cpp:162]  Batch=6000 samples=600000 AvgCost=5.29301 CurrentCost=5.22467 Eval: classification_error_evaluator=0.814542  CurrentEval: class
-ification_error_evaluator=0.81043 
-.............................
-I1222 09:27:28.343756 12590 TrainerInternal.cpp:162]  Batch=9000 samples=900000 AvgCost=5.22494 CurrentCost=5.08876 Eval: classification_error_evaluator=0.810088  CurrentEval: class
-ification_error_evaluator=0.80118 
-..I1222 09:27:29.128582 12590 TrainerInternal.cpp:179]  Pass=0 Batch=9296 samples=929600 AvgCost=5.21786 Eval: classification_error_evaluator=0.809647 
-I1222 09:27:29.627616 12590 Tester.cpp:111]  Test samples=73760 cost=4.9594 Eval: classification_error_evaluator=0.79676 
-I1222 09:27:29.627713 12590 GradientMachine.cpp:112] Saving parameters to model/pass-00000
-```
-经过30个pass，我们将得到平均错误率为classification_error_evaluator=0.735611。
-
-
-## 应用模型
-训练模型后，我们可以加载模型参数，用训练出来的词向量初始化其他模型，也可以将模型参数从二进制格式转换成文本格式进行后续应用。
-
-### 初始化其他模型
+## Model Training

-训练好的模型参数可以用来初始化其他模型。具体方法如下：
-在PaddlePaddle 训练命令行中，用`--init_model_path` 来定义初始化模型的位置，用`--load_missing_parameter_strategy`指定除了词向量以外的新模型其他参数的初始化策略。注意，新模型需要和原模型共享被初始化参数的参数名。
-	
-### 查看词向量
-PaddlePaddle训练出来的参数为二进制格式，存储在对应训练pass的文件夹下。这里我们提供了文件`format_convert.py`用来互转PaddlePaddle训练结果的二进制文件和文本格式特征文件。
-
-```bash
-python format_convert.py --b2t -i INPUT -o OUTPUT -d DIM
-```
-其中，INPUT是输入的（二进制）词向量模型名称，OUTPUT是输出的文本模型名称，DIM是词向量参数维度。
-
-用法如：
-
-```bash
-python format_convert.py --b2t -i model/pass-00029/_proj -o model/pass-00029/_proj.txt -d 32
-```
-转换后得到的文本文件如下：
-
-```text
-0,4,62496
-0.7444070,-0.1846171,-1.5771370,0.7070392,2.1963732,-0.0091410, ......
-0.0721337,-0.2429973,-0.0606297,0.1882059,-0.2072131,-0.7661019, ......
-......
-```
-
-其中，第一行是PaddlePaddle 输出文件的格式说明，包含3个属性：<br/>
-1) PaddlePaddle的版本号，本例中为0;<br/>
-2) 浮点数占用的字节数，本例中为4;<br/>
-3) 总计的参数个数, 本例中为62496（即1953*32）;<br/>
-第二行及之后的每一行都按顺序表示字典里一个词的特征，用逗号分隔。
-	
-### 修改词向量
-
-我们可以对词向量进行修改，并转换成PaddlePaddle参数二进制格式，方法：	
-
-```bash
-python format_convert.py --t2b -i INPUT -o OUTPUT
-```
-
-其中，INPUT是输入的输入的文本词向量模型名称，OUTPUT是输出的二进制词向量模型名称
-
-输入的文本格式如下（注意，不包含上面二进制转文本后第一行的格式说明）：
-
-```text
-0.7444070,-0.1846171,-1.5771370,0.7070392,2.1963732,-0.0091410, ......
-0.0721337,-0.2429973,-0.0606297,0.1882059,-0.2072131,-0.7661019, ......
-......
-```
-	
-	
-
-### 计算词语之间的余弦距离
-
-两个向量之间的距离可以用余弦值来表示，余弦值在$[-1,1]$的区间内，向量间余弦值越大，其距离越近。这里我们在`calculate_dis.py`中实现不同词语的距离度量。
-用法如下：
-
-```bash
-python calculate_dis.py VOCABULARY EMBEDDINGLAYER` 
-```
-
-其中，`VOCABULARY`是字典，`EMBEDDINGLAYER`是词向量模型，示例如下：
-
-```bash
-python calculate_dis.py data/vocabulary.txt model/pass-00029/_proj.txt
-```
- 
+## Model Application
 
 ## Conclusion