diff --git a/demo/quant/quant_embedding/README.md b/demo/quant/quant_embedding/README.md
new file mode 100755
index 0000000000000000000000000000000000000000..5667b19a7f27062dc508a68569ae9fb86d178b45
--- /dev/null
+++ b/demo/quant/quant_embedding/README.md
@@ -0,0 +1,240 @@
+# Embedding量化示例
+
+本示例介绍如何使用Embedding量化的接口 [paddleslim.quant.quant_embedding]() 。``quant_embedding``接口将网络中的Embedding参数从``float32``类型量化到 ``8-bit``整数类型,在几乎不损失模型精度的情况下减少模型的存储空间和显存占用。
+
+接口如下:
+```
+quant_embedding(program, place, config, scope=None)
+```
+
+参数介绍:
+
+- program(fluid.Program) : 需要量化的program
+- scope(fluid.Scope, optional) : 用来获取和写入``Variable``, 如果设置为``None``,则使用``fluid.global_scope()``.
+- place(fluid.CPUPlace or fluid.CUDAPlace): 运行program的设备
+- config(dict) : 定义量化的配置。可以配置的参数有:
+ - ``'params_name'`` (str, required): 需要进行量化的参数名称,此参数必须设置。
+ - ``'quantize_type'`` (str, optional): 量化的类型,目前支持的类型是``'abs_max'``, 待支持的类型有 ``'log', 'product_quantization'``。 默认值是``'abs_max'``.
+ - ``'quantize_bits'``(int, optional): 量化的``bit``数,目前支持的``bit``数为8。默认值是8.
+ - ``'dtype'``(str, optional): 量化之后的数据类型, 目前支持的是``'int8'``. 默认值是``int8``。
+ - ``'threshold'``(float, optional): 量化之前将根据此阈值对需要量化的参数值进行``clip``. 如果不设置,则跳过``clip``过程直接量化。
+
+该接口对program的修改:
+
+量化前:
+
+
+
+图1:量化前的模型结构
+
+
+量化后:
+
+
+
+图2: 量化后的模型结构
+
+
+以下将以 ``基于skip-gram的word2vector模型`` 为例来说明如何使用``quant_embedding``接口。首先介绍 ``基于skip-gram的word2vector模型`` 的正常训练和测试流程。
+
+## 基于skip-gram的word2vector模型
+
+以下是本例的简要目录结构及说明:
+
+```text
+.
+├── cluster_train.py # 分布式训练函数
+├── cluster_train.sh # 本地模拟多机脚本
+├── train.py # 训练函数
+├── infer.py # 预测脚本
+├── net.py # 网络结构
+├── preprocess.py # 预处理脚本,包括构建词典和预处理文本
+├── reader.py # 训练阶段的文本读写
+├── train.py # 训练函数
+└── utils.py # 通用函数
+
+```
+
+### 介绍
+本例实现了skip-gram模式的word2vector模型。
+
+同时推荐用户参考[ IPython Notebook demo](https://aistudio.baidu.com/aistudio/projectDetail/124377)
+
+### 数据下载
+全量数据集使用的是来自1 Billion Word Language Model Benchmark的(http://www.statmt.org/lm-benchmark) 的数据集.
+
+```bash
+mkdir data
+wget http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
+tar xzvf 1-billion-word-language-modeling-benchmark-r13output.tar.gz
+mv 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/ data/
+```
+
+备用数据地址下载命令如下
+
+```bash
+mkdir data
+wget https://paddlerec.bj.bcebos.com/word2vec/1-billion-word-language-modeling-benchmark-r13output.tar
+tar xvf 1-billion-word-language-modeling-benchmark-r13output.tar
+mv 1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/ data/
+```
+
+为了方便快速验证,我们也提供了经典的text8样例数据集,包含1700w个词。 下载命令如下
+
+```bash
+mkdir data
+wget https://paddlerec.bj.bcebos.com/word2vec/text.tar
+tar xvf text.tar
+mv text data/
+```
+
+
+### 数据预处理
+以样例数据集为例进行预处理。全量数据集注意解压后以training-monolingual.tokenized.shuffled 目录为预处理目录,和样例数据集的text目录并列。
+
+词典格式: 词<空格>词频。注意低频词用'UNK'表示
+
+可以按格式自建词典,如果自建词典跳过第一步。
+```
+the 1061396
+of 593677
+and 416629
+one 411764
+in 372201
+a 325873
+ 324608
+to 316376
+zero 264975
+nine 250430
+```
+
+第一步根据英文语料生成词典,中文语料可以通过修改text_strip方法自定义处理方法。
+
+```bash
+python preprocess.py --build_dict --build_dict_corpus_dir data/text/ --dict_path data/test_build_dict
+```
+
+第二步根据词典将文本转成id, 同时进行downsample,按照概率过滤常见词, 同时生成word和id映射的文件,文件名为词典+"_word_to_id_"。
+
+```bash
+python preprocess.py --filter_corpus --dict_path data/test_build_dict --input_corpus_dir data/text --output_corpus_dir data/convert_text8 --min_count 5 --downsample 0.001
+```
+
+### 训练
+具体的参数配置可运行
+
+
+```bash
+python train.py -h
+```
+
+单机多线程训练
+```bash
+OPENBLAS_NUM_THREADS=1 CPU_NUM=5 python train.py --train_data_dir data/convert_text8 --dict_path data/test_build_dict --num_passes 10 --batch_size 100 --model_output_dir v1_cpu5_b100_lr1dir --base_lr 1.0 --print_batch 1000 --with_speed --is_sparse
+```
+
+本地单机模拟多机训练
+
+```bash
+sh cluster_train.sh
+```
+
+本示例中按照单机多线程训练的命令进行训练,训练完毕后,可看到在当前文件夹下保存模型的路径为: ``v1_cpu5_b100_lr1dir``, 运行 ``ls v1_cpu5_b100_lr1dir``可看到该文件夹下保存了训练的10个epoch的模型文件。
+```
+pass-0 pass-1 pass-2 pass-3 pass-4 pass-5 pass-6 pass-7 pass-8 pass-9
+```
+
+### 预测
+测试集下载命令如下
+
+```bash
+#全量数据集测试集
+wget https://paddlerec.bj.bcebos.com/word2vec/test_dir.tar
+#样本数据集测试集
+wget https://paddlerec.bj.bcebos.com/word2vec/test_mid_dir.tar
+```
+
+预测命令,注意词典名称需要加后缀"_word_to_id_", 此文件是预处理阶段生成的。
+```bash
+python infer.py --infer_epoch --test_dir data/test_mid_dir --dict_path data/test_build_dict_word_to_id_ --batch_size 20000 --model_dir v1_cpu5_b100_lr1dir/ --start_index 0 --last_index 9
+```
+运行该预测命令, 可看到如下输出
+```
+('start index: ', 0, ' last_index:', 9)
+('vocab_size:', 63642)
+step:1 249
+epoch:0 acc:0.014
+step:1 590
+epoch:1 acc:0.033
+step:1 982
+epoch:2 acc:0.055
+step:1 1338
+epoch:3 acc:0.075
+step:1 1653
+epoch:4 acc:0.093
+step:1 1914
+epoch:5 acc:0.107
+step:1 2204
+epoch:6 acc:0.124
+step:1 2416
+epoch:7 acc:0.136
+step:1 2606
+epoch:8 acc:0.146
+step:1 2722
+epoch:9 acc:0.153
+```
+
+## 量化``基于skip-gram的word2vector模型``
+
+量化配置为:
+```
+config = {
+ 'params_name': 'emb',
+ 'quantize_type': 'abs_max'
+ }
+```
+
+运行命令为:
+
+```bash
+python infer.py --infer_epoch --test_dir data/test_mid_dir --dict_path data/test_build_dict_word_to_id_ --batch_size 20000 --model_dir v1_cpu5_b100_lr1dir/ --start_index 0 --last_index 9 --emb_quant True
+```
+
+运行输出为:
+
+```
+('start index: ', 0, ' last_index:', 9)
+('vocab_size:', 63642)
+quant_embedding config {'quantize_type': 'abs_max', 'params_name': 'emb', 'quantize_bits': 8, 'dtype': 'int8'}
+step:1 253
+epoch:0 acc:0.014
+quant_embedding config {'quantize_type': 'abs_max', 'params_name': 'emb', 'quantize_bits': 8, 'dtype': 'int8'}
+step:1 586
+epoch:1 acc:0.033
+quant_embedding config {'quantize_type': 'abs_max', 'params_name': 'emb', 'quantize_bits': 8, 'dtype': 'int8'}
+step:1 970
+epoch:2 acc:0.054
+quant_embedding config {'quantize_type': 'abs_max', 'params_name': 'emb', 'quantize_bits': 8, 'dtype': 'int8'}
+step:1 1364
+epoch:3 acc:0.077
+quant_embedding config {'quantize_type': 'abs_max', 'params_name': 'emb', 'quantize_bits': 8, 'dtype': 'int8'}
+step:1 1642
+epoch:4 acc:0.092
+quant_embedding config {'quantize_type': 'abs_max', 'params_name': 'emb', 'quantize_bits': 8, 'dtype': 'int8'}
+step:1 1936
+epoch:5 acc:0.109
+quant_embedding config {'quantize_type': 'abs_max', 'params_name': 'emb', 'quantize_bits': 8, 'dtype': 'int8'}
+step:1 2216
+epoch:6 acc:0.124
+quant_embedding config {'quantize_type': 'abs_max', 'params_name': 'emb', 'quantize_bits': 8, 'dtype': 'int8'}
+step:1 2419
+epoch:7 acc:0.136
+quant_embedding config {'quantize_type': 'abs_max', 'params_name': 'emb', 'quantize_bits': 8, 'dtype': 'int8'}
+step:1 2603
+epoch:8 acc:0.146
+quant_embedding config {'quantize_type': 'abs_max', 'params_name': 'emb', 'quantize_bits': 8, 'dtype': 'int8'}
+step:1 2719
+epoch:9 acc:0.153
+```
+
+量化后的模型保存在``./output_quant``中,可看到量化后的参数``'emb.int8'``的大小为3.9M, 在``./v1_cpu5_b100_lr1dir``中可看到量化前的参数``'emb'``的大小为16M。
diff --git a/demo/quant/quant_embedding/cluster_train.py b/demo/quant/quant_embedding/cluster_train.py
new file mode 100755
index 0000000000000000000000000000000000000000..9b5bc2f620958fe4387c70022e2dbaee9042f625
--- /dev/null
+++ b/demo/quant/quant_embedding/cluster_train.py
@@ -0,0 +1,250 @@
+from __future__ import print_function
+import argparse
+import logging
+import os
+import time
+import math
+import random
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import six
+import reader
+from net import skip_gram_word2vec
+
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger("fluid")
+logger.setLevel(logging.INFO)
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(
+ description="PaddlePaddle Word2vec example")
+ parser.add_argument(
+ '--train_data_dir',
+ type=str,
+ default='./data/text',
+ help="The path of taining dataset")
+ parser.add_argument(
+ '--base_lr',
+ type=float,
+ default=0.01,
+ help="The number of learing rate (default: 0.01)")
+ parser.add_argument(
+ '--save_step',
+ type=int,
+ default=500000,
+ help="The number of step to save (default: 500000)")
+ parser.add_argument(
+ '--print_batch',
+ type=int,
+ default=100,
+ help="The number of print_batch (default: 10)")
+ parser.add_argument(
+ '--dict_path',
+ type=str,
+ default='./data/1-billion_dict',
+ help="The path of data dict")
+ parser.add_argument(
+ '--batch_size',
+ type=int,
+ default=500,
+ help="The size of mini-batch (default:500)")
+ parser.add_argument(
+ '--num_passes',
+ type=int,
+ default=10,
+ help="The number of passes to train (default: 10)")
+ parser.add_argument(
+ '--model_output_dir',
+ type=str,
+ default='models',
+ help='The path for model to store (default: models)')
+ parser.add_argument('--nce_num', type=int, default=5, help='nce_num')
+ parser.add_argument(
+ '--embedding_size',
+ type=int,
+ default=64,
+ help='sparse feature hashing space for index processing')
+ parser.add_argument(
+ '--is_sparse',
+ action='store_true',
+ required=False,
+ default=False,
+ help='embedding and nce will use sparse or not, (default: False)')
+ parser.add_argument(
+ '--with_speed',
+ action='store_true',
+ required=False,
+ default=False,
+ help='print speed or not , (default: False)')
+ parser.add_argument(
+ '--role', type=str, default='pserver', help='trainer or pserver')
+ parser.add_argument(
+ '--endpoints',
+ type=str,
+ default='127.0.0.1:6000',
+ help='The pserver endpoints, like: 127.0.0.1:6000, 127.0.0.1:6001')
+ parser.add_argument(
+ '--current_endpoint',
+ type=str,
+ default='127.0.0.1:6000',
+ help='The current_endpoint')
+ parser.add_argument(
+ '--trainer_id',
+ type=int,
+ default=0,
+ help='trainer id ,only trainer_id=0 save model')
+ parser.add_argument(
+ '--trainers',
+ type=int,
+ default=1,
+ help='The num of trianers, (default: 1)')
+ return parser.parse_args()
+
+
+def convert_python_to_tensor(weight, batch_size, sample_reader):
+ def __reader__():
+ cs = np.array(weight).cumsum()
+ result = [[], []]
+ for sample in sample_reader():
+ for i, fea in enumerate(sample):
+ result[i].append(fea)
+ if len(result[0]) == batch_size:
+ tensor_result = []
+ for tensor in result:
+ t = fluid.Tensor()
+ dat = np.array(tensor, dtype='int64')
+ if len(dat.shape) > 2:
+ dat = dat.reshape((dat.shape[0], dat.shape[2]))
+ elif len(dat.shape) == 1:
+ dat = dat.reshape((-1, 1))
+ t.set(dat, fluid.CPUPlace())
+ tensor_result.append(t)
+ tt = fluid.Tensor()
+ neg_array = cs.searchsorted(np.random.sample(args.nce_num))
+ neg_array = np.tile(neg_array, batch_size)
+ tt.set(
+ neg_array.reshape((batch_size, args.nce_num)),
+ fluid.CPUPlace())
+ tensor_result.append(tt)
+ yield tensor_result
+ result = [[], []]
+
+ return __reader__
+
+
+def train_loop(args, train_program, reader, py_reader, loss, trainer_id,
+ weight):
+
+ py_reader.decorate_tensor_provider(
+ convert_python_to_tensor(weight, args.batch_size, reader.train()))
+
+ place = fluid.CPUPlace()
+ exe = fluid.Executor(place)
+ exe.run(fluid.default_startup_program())
+
+ print("CPU_NUM:" + str(os.getenv("CPU_NUM")))
+
+ train_exe = exe
+
+ for pass_id in range(args.num_passes):
+ py_reader.start()
+ time.sleep(10)
+ epoch_start = time.time()
+ batch_id = 0
+ start = time.time()
+ try:
+ while True:
+
+ loss_val = train_exe.run(fetch_list=[loss.name])
+ loss_val = np.mean(loss_val)
+
+ if batch_id % args.print_batch == 0:
+ logger.info(
+ "TRAIN --> pass: {} batch: {} loss: {} reader queue:{}".
+ format(pass_id, batch_id,
+ loss_val.mean(), py_reader.queue.size()))
+ if args.with_speed:
+ if batch_id % 500 == 0 and batch_id != 0:
+ elapsed = (time.time() - start)
+ start = time.time()
+ samples = 1001 * args.batch_size * int(
+ os.getenv("CPU_NUM"))
+ logger.info("Time used: {}, Samples/Sec: {}".format(
+ elapsed, samples / elapsed))
+
+ if batch_id % args.save_step == 0 and batch_id != 0:
+ model_dir = args.model_output_dir + '/pass-' + str(
+ pass_id) + ('/batch-' + str(batch_id))
+ if trainer_id == 0:
+ fluid.io.save_params(executor=exe, dirname=model_dir)
+ print("model saved in %s" % model_dir)
+ batch_id += 1
+
+ except fluid.core.EOFException:
+ py_reader.reset()
+ epoch_end = time.time()
+ logger.info("Epoch: {0}, Train total expend: {1} ".format(
+ pass_id, epoch_end - epoch_start))
+ model_dir = args.model_output_dir + '/pass-' + str(pass_id)
+ if trainer_id == 0:
+ fluid.io.save_params(executor=exe, dirname=model_dir)
+ print("model saved in %s" % model_dir)
+
+
+def GetFileList(data_path):
+ return os.listdir(data_path)
+
+
+def train(args):
+
+ if not os.path.isdir(args.model_output_dir) and args.trainer_id == 0:
+ os.mkdir(args.model_output_dir)
+
+ filelist = GetFileList(args.train_data_dir)
+ word2vec_reader = reader.Word2VecReader(
+ args.dict_path, args.train_data_dir, filelist, 0, 1)
+
+ logger.info("dict_size: {}".format(word2vec_reader.dict_size))
+ np_power = np.power(np.array(word2vec_reader.id_frequencys), 0.75)
+ id_frequencys_pow = np_power / np_power.sum()
+
+ loss, py_reader = skip_gram_word2vec(
+ word2vec_reader.dict_size,
+ args.embedding_size,
+ is_sparse=args.is_sparse,
+ neg_num=args.nce_num)
+
+ optimizer = fluid.optimizer.SGD(
+ learning_rate=fluid.layers.exponential_decay(
+ learning_rate=args.base_lr,
+ decay_steps=100000,
+ decay_rate=0.999,
+ staircase=True))
+
+ optimizer.minimize(loss)
+
+ logger.info("run dist training")
+
+ t = fluid.DistributeTranspiler()
+ t.transpile(
+ args.trainer_id, pservers=args.endpoints, trainers=args.trainers)
+ if args.role == "pserver":
+ print("run psever")
+ pserver_prog = t.get_pserver_program(args.current_endpoint)
+ pserver_startup = t.get_startup_program(args.current_endpoint,
+ pserver_prog)
+ exe = fluid.Executor(fluid.CPUPlace())
+ exe.run(pserver_startup)
+ exe.run(pserver_prog)
+ elif args.role == "trainer":
+ print("run trainer")
+ train_loop(args,
+ t.get_trainer_program(), word2vec_reader, py_reader, loss,
+ args.trainer_id, id_frequencys_pow)
+
+
+if __name__ == '__main__':
+ args = parse_args()
+ train(args)
diff --git a/demo/quant/quant_embedding/cluster_train.sh b/demo/quant/quant_embedding/cluster_train.sh
new file mode 100755
index 0000000000000000000000000000000000000000..756196fd41eeb52d5f43553664c824748ac83e4e
--- /dev/null
+++ b/demo/quant/quant_embedding/cluster_train.sh
@@ -0,0 +1,68 @@
+#!/bin/bash
+
+#export GLOG_v=30
+#export GLOG_logtostderr=1
+
+# start pserver0
+export CPU_NUM=5
+export FLAGS_rpc_deadline=3000000
+python cluster_train.py \
+ --train_data_dir data/convert_text8 \
+ --dict_path data/test_build_dict \
+ --batch_size 100 \
+ --model_output_dir dis_model \
+ --base_lr 1.0 \
+ --print_batch 1 \
+ --is_sparse \
+ --with_speed \
+ --role pserver \
+ --endpoints 127.0.0.1:6000,127.0.0.1:6001 \
+ --current_endpoint 127.0.0.1:6000 \
+ --trainers 2 \
+ > pserver0.log 2>&1 &
+
+python cluster_train.py \
+ --train_data_dir data/convert_text8 \
+ --dict_path data/test_build_dict \
+ --batch_size 100 \
+ --model_output_dir dis_model \
+ --base_lr 1.0 \
+ --print_batch 1 \
+ --is_sparse \
+ --with_speed \
+ --role pserver \
+ --endpoints 127.0.0.1:6000,127.0.0.1:6001 \
+ --current_endpoint 127.0.0.1:6001 \
+ --trainers 2 \
+ > pserver1.log 2>&1 &
+
+# start trainer0
+python cluster_train.py \
+ --train_data_dir data/convert_text8 \
+ --dict_path data/test_build_dict \
+ --batch_size 100 \
+ --model_output_dir dis_model \
+ --base_lr 1.0 \
+ --print_batch 1000 \
+ --is_sparse \
+ --with_speed \
+ --role trainer \
+ --endpoints 127.0.0.1:6000,127.0.0.1:6001 \
+ --trainers 2 \
+ --trainer_id 0 \
+ > trainer0.log 2>&1 &
+# start trainer1
+python cluster_train.py \
+ --train_data_dir data/convert_text8 \
+ --dict_path data/test_build_dict \
+ --batch_size 100 \
+ --model_output_dir dis_model \
+ --base_lr 1.0 \
+ --print_batch 1000 \
+ --is_sparse \
+ --with_speed \
+ --role trainer \
+ --endpoints 127.0.0.1:6000,127.0.0.1:6001 \
+ --trainers 2 \
+ --trainer_id 1 \
+ > trainer1.log 2>&1 &
diff --git a/demo/quant/quant_embedding/image/after.png b/demo/quant/quant_embedding/image/after.png
new file mode 100644
index 0000000000000000000000000000000000000000..50a3f172fb5b34c853f27f4406458e5bd6399f06
Binary files /dev/null and b/demo/quant/quant_embedding/image/after.png differ
diff --git a/demo/quant/quant_embedding/image/before.png b/demo/quant/quant_embedding/image/before.png
new file mode 100644
index 0000000000000000000000000000000000000000..5b77f0b05a5e3bb56a80aeaf29b6624798c2eb56
Binary files /dev/null and b/demo/quant/quant_embedding/image/before.png differ
diff --git a/demo/quant/quant_embedding/infer.py b/demo/quant/quant_embedding/infer.py
new file mode 100755
index 0000000000000000000000000000000000000000..40ae2ee8c639754d24a5474c5e58d7e062a1d4d0
--- /dev/null
+++ b/demo/quant/quant_embedding/infer.py
@@ -0,0 +1,231 @@
+import argparse
+import sys
+import time
+import math
+import unittest
+import contextlib
+import numpy as np
+import six
+import paddle.fluid as fluid
+import paddle
+import net
+import utils
+sys.path.append(sys.path[0] + "/../../../")
+from paddleslim.quant import quant_embedding
+
+
+def parse_args():
+ parser = argparse.ArgumentParser("PaddlePaddle Word2vec infer example")
+ parser.add_argument(
+ '--dict_path',
+ type=str,
+ default='./data/data_c/1-billion_dict_word_to_id_',
+ help="The path of dic")
+ parser.add_argument(
+ '--infer_epoch',
+ action='store_true',
+ required=False,
+ default=False,
+ help='infer by epoch')
+ parser.add_argument(
+ '--infer_step',
+ action='store_true',
+ required=False,
+ default=False,
+ help='infer by step')
+ parser.add_argument(
+ '--test_dir', type=str, default='test_data', help='test file address')
+ parser.add_argument(
+ '--print_step', type=int, default='500000', help='print step')
+ parser.add_argument(
+ '--start_index', type=int, default='0', help='start index')
+ parser.add_argument(
+ '--start_batch', type=int, default='1', help='start index')
+ parser.add_argument(
+ '--end_batch', type=int, default='13', help='start index')
+ parser.add_argument(
+ '--last_index', type=int, default='100', help='last index')
+ parser.add_argument(
+ '--model_dir', type=str, default='model', help='model dir')
+ parser.add_argument(
+ '--use_cuda', type=int, default='0', help='whether use cuda')
+ parser.add_argument(
+ '--batch_size', type=int, default='5', help='batch_size')
+ parser.add_argument(
+ '--emb_size', type=int, default='64', help='batch_size')
+ parser.add_argument(
+ '--emb_quant',
+ type=bool,
+ default=False,
+ help='whether to quant embedding parameter')
+ args = parser.parse_args()
+ return args
+
+
+def infer_epoch(args, vocab_size, test_reader, use_cuda, i2w):
+ """ inference function """
+ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+ exe = fluid.Executor(place)
+ emb_size = args.emb_size
+ batch_size = args.batch_size
+ with fluid.scope_guard(fluid.Scope()):
+ main_program = fluid.Program()
+ with fluid.program_guard(main_program):
+ values, pred = net.infer_network(vocab_size, emb_size)
+ for epoch in range(start_index, last_index + 1):
+ copy_program = main_program.clone()
+ model_path = model_dir + "/pass-" + str(epoch)
+ fluid.io.load_params(
+ executor=exe,
+ dirname=model_path,
+ main_program=copy_program)
+ if args.emb_quant:
+ config = {'params_name': 'emb', 'quantize_type': 'abs_max'}
+ copy_program = quant_embedding(copy_program, place, config)
+ fluid.io.save_persistables(
+ exe,
+ './output_quant/pass-' + str(epoch),
+ main_program=copy_program)
+
+ accum_num = 0
+ accum_num_sum = 0.0
+ t0 = time.time()
+ step_id = 0
+ for data in test_reader():
+ step_id += 1
+ b_size = len([dat[0] for dat in data])
+ wa = np.array(
+ [dat[0] for dat in data]).astype("int64").reshape(
+ b_size, 1)
+ wb = np.array(
+ [dat[1] for dat in data]).astype("int64").reshape(
+ b_size, 1)
+ wc = np.array(
+ [dat[2] for dat in data]).astype("int64").reshape(
+ b_size, 1)
+
+ label = [dat[3] for dat in data]
+ input_word = [dat[4] for dat in data]
+ para = exe.run(copy_program,
+ feed={
+ "analogy_a": wa,
+ "analogy_b": wb,
+ "analogy_c": wc,
+ "all_label":
+ np.arange(vocab_size).reshape(
+ vocab_size, 1).astype("int64"),
+ },
+ fetch_list=[pred.name, values],
+ return_numpy=False)
+ pre = np.array(para[0])
+ val = np.array(para[1])
+ for ii in range(len(label)):
+ top4 = pre[ii]
+ accum_num_sum += 1
+ for idx in top4:
+ if int(idx) in input_word[ii]:
+ continue
+ if int(idx) == int(label[ii][0]):
+ accum_num += 1
+ break
+ if step_id % 1 == 0:
+ print("step:%d %d " % (step_id, accum_num))
+
+ print("epoch:%d \t acc:%.3f " %
+ (epoch, 1.0 * accum_num / accum_num_sum))
+
+
+def infer_step(args, vocab_size, test_reader, use_cuda, i2w):
+ """ inference function """
+ place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
+ exe = fluid.Executor(place)
+ emb_size = args.emb_size
+ batch_size = args.batch_size
+ with fluid.scope_guard(fluid.Scope()):
+ main_program = fluid.Program()
+ with fluid.program_guard(main_program):
+ values, pred = net.infer_network(vocab_size, emb_size)
+ for epoch in range(start_index, last_index + 1):
+ for batchid in range(args.start_batch, args.end_batch):
+ copy_program = main_program.clone()
+ model_path = model_dir + "/pass-" + str(epoch) + (
+ '/batch-' + str(batchid * args.print_step))
+ fluid.io.load_params(
+ executor=exe,
+ dirname=model_path,
+ main_program=copy_program)
+ accum_num = 0
+ accum_num_sum = 0.0
+ t0 = time.time()
+ step_id = 0
+ for data in test_reader():
+ step_id += 1
+ b_size = len([dat[0] for dat in data])
+ wa = np.array(
+ [dat[0] for dat in data]).astype("int64").reshape(
+ b_size, 1)
+ wb = np.array(
+ [dat[1] for dat in data]).astype("int64").reshape(
+ b_size, 1)
+ wc = np.array(
+ [dat[2] for dat in data]).astype("int64").reshape(
+ b_size, 1)
+
+ label = [dat[3] for dat in data]
+ input_word = [dat[4] for dat in data]
+ para = exe.run(
+ copy_program,
+ feed={
+ "analogy_a": wa,
+ "analogy_b": wb,
+ "analogy_c": wc,
+ "all_label":
+ np.arange(vocab_size).reshape(vocab_size, 1),
+ },
+ fetch_list=[pred.name, values],
+ return_numpy=False)
+ pre = np.array(para[0])
+ val = np.array(para[1])
+ for ii in range(len(label)):
+ top4 = pre[ii]
+ accum_num_sum += 1
+ for idx in top4:
+ if int(idx) in input_word[ii]:
+ continue
+ if int(idx) == int(label[ii][0]):
+ accum_num += 1
+ break
+ if step_id % 1 == 0:
+ print("step:%d %d " % (step_id, accum_num))
+ print("epoch:%d \t acc:%.3f " %
+ (epoch, 1.0 * accum_num / accum_num_sum))
+ t1 = time.time()
+
+
+if __name__ == "__main__":
+ args = parse_args()
+ start_index = args.start_index
+ last_index = args.last_index
+ test_dir = args.test_dir
+ model_dir = args.model_dir
+ batch_size = args.batch_size
+ dict_path = args.dict_path
+ use_cuda = True if args.use_cuda else False
+ print("start index: ", start_index, " last_index:", last_index)
+ vocab_size, test_reader, id2word = utils.prepare_data(
+ test_dir, dict_path, batch_size=batch_size)
+ print("vocab_size:", vocab_size)
+ if args.infer_step:
+ infer_step(
+ args,
+ vocab_size,
+ test_reader=test_reader,
+ use_cuda=use_cuda,
+ i2w=id2word)
+ else:
+ infer_epoch(
+ args,
+ vocab_size,
+ test_reader=test_reader,
+ use_cuda=use_cuda,
+ i2w=id2word)
diff --git a/demo/quant/quant_embedding/net.py b/demo/quant/quant_embedding/net.py
new file mode 100755
index 0000000000000000000000000000000000000000..ab2abbc76bde8e03c9a6e1e0abb062aa467d2c91
--- /dev/null
+++ b/demo/quant/quant_embedding/net.py
@@ -0,0 +1,136 @@
+# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+neural network for word2vec
+"""
+from __future__ import print_function
+import math
+import numpy as np
+import paddle.fluid as fluid
+
+
+def skip_gram_word2vec(dict_size, embedding_size, is_sparse=False, neg_num=5):
+
+ datas = []
+ input_word = fluid.layers.data(name="input_word", shape=[1], dtype='int64')
+ true_word = fluid.layers.data(name='true_label', shape=[1], dtype='int64')
+ neg_word = fluid.layers.data(
+ name="neg_label", shape=[neg_num], dtype='int64')
+
+ datas.append(input_word)
+ datas.append(true_word)
+ datas.append(neg_word)
+
+ py_reader = fluid.layers.create_py_reader_by_data(
+ capacity=64, feed_list=datas, name='py_reader', use_double_buffer=True)
+
+ words = fluid.layers.read_file(py_reader)
+ init_width = 0.5 / embedding_size
+ input_emb = fluid.layers.embedding(
+ input=words[0],
+ is_sparse=is_sparse,
+ size=[dict_size, embedding_size],
+ param_attr=fluid.ParamAttr(
+ name='emb',
+ initializer=fluid.initializer.Uniform(-init_width, init_width)))
+
+ true_emb_w = fluid.layers.embedding(
+ input=words[1],
+ is_sparse=is_sparse,
+ size=[dict_size, embedding_size],
+ param_attr=fluid.ParamAttr(
+ name='emb_w', initializer=fluid.initializer.Constant(value=0.0)))
+
+ true_emb_b = fluid.layers.embedding(
+ input=words[1],
+ is_sparse=is_sparse,
+ size=[dict_size, 1],
+ param_attr=fluid.ParamAttr(
+ name='emb_b', initializer=fluid.initializer.Constant(value=0.0)))
+ neg_word_reshape = fluid.layers.reshape(words[2], shape=[-1, 1])
+ neg_word_reshape.stop_gradient = True
+
+ neg_emb_w = fluid.layers.embedding(
+ input=neg_word_reshape,
+ is_sparse=is_sparse,
+ size=[dict_size, embedding_size],
+ param_attr=fluid.ParamAttr(
+ name='emb_w', learning_rate=1.0))
+
+ neg_emb_w_re = fluid.layers.reshape(
+ neg_emb_w, shape=[-1, neg_num, embedding_size])
+ neg_emb_b = fluid.layers.embedding(
+ input=neg_word_reshape,
+ is_sparse=is_sparse,
+ size=[dict_size, 1],
+ param_attr=fluid.ParamAttr(
+ name='emb_b', learning_rate=1.0))
+
+ neg_emb_b_vec = fluid.layers.reshape(neg_emb_b, shape=[-1, neg_num])
+ true_logits = fluid.layers.elementwise_add(
+ fluid.layers.reduce_sum(
+ fluid.layers.elementwise_mul(input_emb, true_emb_w),
+ dim=1,
+ keep_dim=True),
+ true_emb_b)
+ input_emb_re = fluid.layers.reshape(
+ input_emb, shape=[-1, 1, embedding_size])
+ neg_matmul = fluid.layers.matmul(
+ input_emb_re, neg_emb_w_re, transpose_y=True)
+ neg_matmul_re = fluid.layers.reshape(neg_matmul, shape=[-1, neg_num])
+ neg_logits = fluid.layers.elementwise_add(neg_matmul_re, neg_emb_b_vec)
+ #nce loss
+
+ label_ones = fluid.layers.fill_constant_batch_size_like(
+ true_logits, shape=[-1, 1], value=1.0, dtype='float32')
+ label_zeros = fluid.layers.fill_constant_batch_size_like(
+ true_logits, shape=[-1, neg_num], value=0.0, dtype='float32')
+
+ true_xent = fluid.layers.sigmoid_cross_entropy_with_logits(true_logits,
+ label_ones)
+ neg_xent = fluid.layers.sigmoid_cross_entropy_with_logits(neg_logits,
+ label_zeros)
+ cost = fluid.layers.elementwise_add(
+ fluid.layers.reduce_sum(
+ true_xent, dim=1),
+ fluid.layers.reduce_sum(
+ neg_xent, dim=1))
+ avg_cost = fluid.layers.reduce_mean(cost)
+ return avg_cost, py_reader
+
+
+def infer_network(vocab_size, emb_size):
+ analogy_a = fluid.layers.data(name="analogy_a", shape=[1], dtype='int64')
+ analogy_b = fluid.layers.data(name="analogy_b", shape=[1], dtype='int64')
+ analogy_c = fluid.layers.data(name="analogy_c", shape=[1], dtype='int64')
+ all_label = fluid.layers.data(
+ name="all_label",
+ shape=[vocab_size, 1],
+ dtype='int64',
+ append_batch_size=False)
+ emb_all_label = fluid.layers.embedding(
+ input=all_label, size=[vocab_size, emb_size], param_attr="emb")
+
+ emb_a = fluid.layers.embedding(
+ input=analogy_a, size=[vocab_size, emb_size], param_attr="emb")
+ emb_b = fluid.layers.embedding(
+ input=analogy_b, size=[vocab_size, emb_size], param_attr="emb")
+ emb_c = fluid.layers.embedding(
+ input=analogy_c, size=[vocab_size, emb_size], param_attr="emb")
+ target = fluid.layers.elementwise_add(
+ fluid.layers.elementwise_sub(emb_b, emb_a), emb_c)
+ emb_all_label_l2 = fluid.layers.l2_normalize(x=emb_all_label, axis=1)
+ dist = fluid.layers.matmul(x=target, y=emb_all_label_l2, transpose_y=True)
+ values, pred_idx = fluid.layers.topk(input=dist, k=4)
+ return values, pred_idx
diff --git a/demo/quant/quant_embedding/preprocess.py b/demo/quant/quant_embedding/preprocess.py
new file mode 100755
index 0000000000000000000000000000000000000000..db1e0994f9cb0ec5945cf018dea3c5f49309d47e
--- /dev/null
+++ b/demo/quant/quant_embedding/preprocess.py
@@ -0,0 +1,195 @@
+# -*- coding: utf-8 -*
+import os
+import random
+import re
+import six
+import argparse
+import io
+import math
+prog = re.compile("[^a-z ]", flags=0)
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(
+ description="Paddle Fluid word2 vector preprocess")
+ parser.add_argument(
+ '--build_dict_corpus_dir', type=str, help="The dir of corpus")
+ parser.add_argument(
+ '--input_corpus_dir', type=str, help="The dir of input corpus")
+ parser.add_argument(
+ '--output_corpus_dir', type=str, help="The dir of output corpus")
+ parser.add_argument(
+ '--dict_path',
+ type=str,
+ default='./dict',
+ help="The path of dictionary ")
+ parser.add_argument(
+ '--min_count',
+ type=int,
+ default=5,
+ help="If the word count is less then min_count, it will be removed from dict"
+ )
+ parser.add_argument(
+ '--downsample',
+ type=float,
+ default=0.001,
+ help="filter word by downsample")
+ parser.add_argument(
+ '--filter_corpus',
+ action='store_true',
+ default=False,
+ help='Filter corpus')
+ parser.add_argument(
+ '--build_dict',
+ action='store_true',
+ default=False,
+ help='Build dict from corpus')
+ return parser.parse_args()
+
+
+def text_strip(text):
+ #English Preprocess Rule
+ return prog.sub("", text.lower())
+
+
+# Shameless copy from Tensorflow https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py
+# Unicode utility functions that work with Python 2 and 3
+def native_to_unicode(s):
+ if _is_unicode(s):
+ return s
+ try:
+ return _to_unicode(s)
+ except UnicodeDecodeError:
+ res = _to_unicode(s, ignore_errors=True)
+ return res
+
+
+def _is_unicode(s):
+ if six.PY2:
+ if isinstance(s, unicode):
+ return True
+ else:
+ if isinstance(s, str):
+ return True
+ return False
+
+
+def _to_unicode(s, ignore_errors=False):
+ if _is_unicode(s):
+ return s
+ error_mode = "ignore" if ignore_errors else "strict"
+ return s.decode("utf-8", errors=error_mode)
+
+
+def filter_corpus(args):
+ """
+ filter corpus and convert id.
+ """
+ word_count = dict()
+ word_to_id_ = dict()
+ word_all_count = 0
+ id_counts = []
+ word_id = 0
+ #read dict
+ with io.open(args.dict_path, 'r', encoding='utf-8') as f:
+ for line in f:
+ word, count = line.split()[0], int(line.split()[1])
+ word_count[word] = count
+ word_to_id_[word] = word_id
+ word_id += 1
+ id_counts.append(count)
+ word_all_count += count
+
+ #write word2id file
+ print("write word2id file to : " + args.dict_path + "_word_to_id_")
+ with io.open(
+ args.dict_path + "_word_to_id_", 'w+', encoding='utf-8') as fid:
+ for k, v in word_to_id_.items():
+ fid.write(k + " " + str(v) + '\n')
+ #filter corpus and convert id
+ if not os.path.exists(args.output_corpus_dir):
+ os.makedirs(args.output_corpus_dir)
+ for file in os.listdir(args.input_corpus_dir):
+ with io.open(args.output_corpus_dir + '/convert_' + file, "w") as wf:
+ with io.open(
+ args.input_corpus_dir + '/' + file,
+ encoding='utf-8') as rf:
+ print(args.input_corpus_dir + '/' + file)
+ for line in rf:
+ signal = False
+ line = text_strip(line)
+ words = line.split()
+ for item in words:
+ if item in word_count:
+ idx = word_to_id_[item]
+ else:
+ idx = word_to_id_[native_to_unicode('')]
+ count_w = id_counts[idx]
+ corpus_size = word_all_count
+ keep_prob = (
+ math.sqrt(count_w /
+ (args.downsample * corpus_size)) + 1
+ ) * (args.downsample * corpus_size) / count_w
+ r_value = random.random()
+ if r_value > keep_prob:
+ continue
+ wf.write(_to_unicode(str(idx) + " "))
+ signal = True
+ if signal:
+ wf.write(_to_unicode("\n"))
+
+
+def build_dict(args):
+ """
+ proprocess the data, generate dictionary and save into dict_path.
+ :param corpus_dir: the input data dir.
+ :param dict_path: the generated dict path. the data in dict is "word count"
+ :param min_count:
+ :return:
+ """
+ # word to count
+
+ word_count = dict()
+
+ for file in os.listdir(args.build_dict_corpus_dir):
+ with io.open(
+ args.build_dict_corpus_dir + "/" + file,
+ encoding='utf-8') as f:
+ print("build dict : ", args.build_dict_corpus_dir + "/" + file)
+ for line in f:
+ line = text_strip(line)
+ words = line.split()
+ for item in words:
+ if item in word_count:
+ word_count[item] = word_count[item] + 1
+ else:
+ word_count[item] = 1
+
+ item_to_remove = []
+ for item in word_count:
+ if word_count[item] <= args.min_count:
+ item_to_remove.append(item)
+
+ unk_sum = 0
+ for item in item_to_remove:
+ unk_sum += word_count[item]
+ del word_count[item]
+ #sort by count
+ word_count[native_to_unicode('')] = unk_sum
+ word_count = sorted(
+ word_count.items(), key=lambda word_count: -word_count[1])
+
+ with io.open(args.dict_path, 'w+', encoding='utf-8') as f:
+ for k, v in word_count:
+ f.write(k + " " + str(v) + '\n')
+
+
+if __name__ == "__main__":
+ args = parse_args()
+ if args.build_dict:
+ build_dict(args)
+ elif args.filter_corpus:
+ filter_corpus(args)
+ else:
+ print(
+ "error command line, please choose --build_dict or --filter_corpus")
diff --git a/demo/quant/quant_embedding/reader.py b/demo/quant/quant_embedding/reader.py
new file mode 100755
index 0000000000000000000000000000000000000000..7400e493fc78db389506f938ea510f461fdc2154
--- /dev/null
+++ b/demo/quant/quant_embedding/reader.py
@@ -0,0 +1,106 @@
+# -*- coding: utf-8 -*
+
+import numpy as np
+import preprocess
+import logging
+import math
+import random
+import io
+
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger("fluid")
+logger.setLevel(logging.INFO)
+
+
+class NumpyRandomInt(object):
+ def __init__(self, a, b, buf_size=1000):
+ self.idx = 0
+ self.buffer = np.random.random_integers(a, b, buf_size)
+ self.a = a
+ self.b = b
+
+ def __call__(self):
+ if self.idx == len(self.buffer):
+ self.buffer = np.random.random_integers(self.a, self.b,
+ len(self.buffer))
+ self.idx = 0
+
+ result = self.buffer[self.idx]
+ self.idx += 1
+ return result
+
+
+class Word2VecReader(object):
+ def __init__(self,
+ dict_path,
+ data_path,
+ filelist,
+ trainer_id,
+ trainer_num,
+ window_size=5):
+ self.window_size_ = window_size
+ self.data_path_ = data_path
+ self.filelist = filelist
+ self.trainer_id = trainer_id
+ self.trainer_num = trainer_num
+
+ word_all_count = 0
+ id_counts = []
+ word_id = 0
+
+ with io.open(dict_path, 'r', encoding='utf-8') as f:
+ for line in f:
+ word, count = line.split()[0], int(line.split()[1])
+ word_id += 1
+ id_counts.append(count)
+ word_all_count += count
+
+ self.word_all_count = word_all_count
+ self.corpus_size_ = word_all_count
+ self.dict_size = len(id_counts)
+ self.id_counts_ = id_counts
+
+ print("corpus_size:", self.corpus_size_)
+ self.id_frequencys = [
+ float(count) / word_all_count for count in self.id_counts_
+ ]
+ print("dict_size = " + str(self.dict_size) + " word_all_count = " +
+ str(word_all_count))
+
+ self.random_generator = NumpyRandomInt(1, self.window_size_ + 1)
+
+ def get_context_words(self, words, idx):
+ """
+ Get the context word list of target word.
+ words: the words of the current line
+ idx: input word index
+ window_size: window size
+ """
+ target_window = self.random_generator()
+ start_point = idx - target_window # if (idx - target_window) > 0 else 0
+ if start_point < 0:
+ start_point = 0
+ end_point = idx + target_window
+ targets = words[start_point:idx] + words[idx + 1:end_point + 1]
+ return targets
+
+ def train(self):
+ def nce_reader():
+ for file in self.filelist:
+ with io.open(
+ self.data_path_ + "/" + file, 'r',
+ encoding='utf-8') as f:
+ logger.info("running data in {}".format(self.data_path_ +
+ "/" + file))
+ count = 1
+ for line in f:
+ if self.trainer_id == count % self.trainer_num:
+ word_ids = [int(w) for w in line.split()]
+ for idx, target_id in enumerate(word_ids):
+ context_word_ids = self.get_context_words(
+ word_ids, idx)
+ for context_id in context_word_ids:
+ yield [target_id], [context_id]
+ count += 1
+
+ return nce_reader
diff --git a/demo/quant/quant_embedding/train.py b/demo/quant/quant_embedding/train.py
new file mode 100755
index 0000000000000000000000000000000000000000..72af09744564f0fcc83d1f43a8bd1a04d8a8fa38
--- /dev/null
+++ b/demo/quant/quant_embedding/train.py
@@ -0,0 +1,228 @@
+from __future__ import print_function
+import argparse
+import logging
+import os
+import time
+import math
+import random
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+import six
+import reader
+from net import skip_gram_word2vec
+
+logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger("fluid")
+logger.setLevel(logging.INFO)
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(
+ description="PaddlePaddle Word2vec example")
+ parser.add_argument(
+ '--train_data_dir',
+ type=str,
+ default='./data/text',
+ help="The path of taining dataset")
+ parser.add_argument(
+ '--base_lr',
+ type=float,
+ default=0.01,
+ help="The number of learing rate (default: 0.01)")
+ parser.add_argument(
+ '--save_step',
+ type=int,
+ default=500000,
+ help="The number of step to save (default: 500000)")
+ parser.add_argument(
+ '--print_batch',
+ type=int,
+ default=10,
+ help="The number of print_batch (default: 10)")
+ parser.add_argument(
+ '--dict_path',
+ type=str,
+ default='./data/1-billion_dict',
+ help="The path of data dict")
+ parser.add_argument(
+ '--batch_size',
+ type=int,
+ default=500,
+ help="The size of mini-batch (default:500)")
+ parser.add_argument(
+ '--num_passes',
+ type=int,
+ default=10,
+ help="The number of passes to train (default: 10)")
+ parser.add_argument(
+ '--model_output_dir',
+ type=str,
+ default='models',
+ help='The path for model to store (default: models)')
+ parser.add_argument('--nce_num', type=int, default=5, help='nce_num')
+ parser.add_argument(
+ '--embedding_size',
+ type=int,
+ default=64,
+ help='sparse feature hashing space for index processing')
+ parser.add_argument(
+ '--is_sparse',
+ action='store_true',
+ required=False,
+ default=False,
+ help='embedding and nce will use sparse or not, (default: False)')
+ parser.add_argument(
+ '--with_speed',
+ action='store_true',
+ required=False,
+ default=False,
+ help='print speed or not , (default: False)')
+ return parser.parse_args()
+
+
+def convert_python_to_tensor(weight, batch_size, sample_reader):
+ def __reader__():
+ cs = np.array(weight).cumsum()
+ result = [[], []]
+ for sample in sample_reader():
+ for i, fea in enumerate(sample):
+ result[i].append(fea)
+ if len(result[0]) == batch_size:
+ tensor_result = []
+ for tensor in result:
+ t = fluid.Tensor()
+ dat = np.array(tensor, dtype='int64')
+ if len(dat.shape) > 2:
+ dat = dat.reshape((dat.shape[0], dat.shape[2]))
+ elif len(dat.shape) == 1:
+ dat = dat.reshape((-1, 1))
+ t.set(dat, fluid.CPUPlace())
+ tensor_result.append(t)
+ tt = fluid.Tensor()
+ neg_array = cs.searchsorted(np.random.sample(args.nce_num))
+ neg_array = np.tile(neg_array, batch_size)
+ tt.set(
+ neg_array.reshape((batch_size, args.nce_num)),
+ fluid.CPUPlace())
+ tensor_result.append(tt)
+ yield tensor_result
+ result = [[], []]
+
+ return __reader__
+
+
+def train_loop(args, train_program, reader, py_reader, loss, trainer_id,
+ weight):
+
+ py_reader.decorate_tensor_provider(
+ convert_python_to_tensor(weight, args.batch_size, reader.train()))
+
+ place = fluid.CPUPlace()
+ exe = fluid.Executor(place)
+ exe.run(fluid.default_startup_program())
+
+ exec_strategy = fluid.ExecutionStrategy()
+ exec_strategy.use_experimental_executor = True
+
+ print("CPU_NUM:" + str(os.getenv("CPU_NUM")))
+ exec_strategy.num_threads = int(os.getenv("CPU_NUM"))
+
+ build_strategy = fluid.BuildStrategy()
+ if int(os.getenv("CPU_NUM")) > 1:
+ build_strategy.reduce_strategy = fluid.BuildStrategy.ReduceStrategy.Reduce
+
+ train_exe = fluid.ParallelExecutor(
+ use_cuda=False,
+ loss_name=loss.name,
+ main_program=train_program,
+ build_strategy=build_strategy,
+ exec_strategy=exec_strategy)
+
+ for pass_id in range(args.num_passes):
+ py_reader.start()
+ time.sleep(10)
+ epoch_start = time.time()
+ batch_id = 0
+ start = time.time()
+ try:
+ while True:
+
+ loss_val = train_exe.run(fetch_list=[loss.name])
+ loss_val = np.mean(loss_val)
+
+ if batch_id % args.print_batch == 0:
+ logger.info(
+ "TRAIN --> pass: {} batch: {} loss: {} reader queue:{}".
+ format(pass_id, batch_id,
+ loss_val.mean(), py_reader.queue.size()))
+ if args.with_speed:
+ if batch_id % 500 == 0 and batch_id != 0:
+ elapsed = (time.time() - start)
+ start = time.time()
+ samples = 1001 * args.batch_size * int(
+ os.getenv("CPU_NUM"))
+ logger.info("Time used: {}, Samples/Sec: {}".format(
+ elapsed, samples / elapsed))
+
+ if batch_id % args.save_step == 0 and batch_id != 0:
+ model_dir = args.model_output_dir + '/pass-' + str(
+ pass_id) + ('/batch-' + str(batch_id))
+ if trainer_id == 0:
+ fluid.io.save_params(executor=exe, dirname=model_dir)
+ print("model saved in %s" % model_dir)
+ batch_id += 1
+
+ except fluid.core.EOFException:
+ py_reader.reset()
+ epoch_end = time.time()
+ logger.info("Epoch: {0}, Train total expend: {1} ".format(
+ pass_id, epoch_end - epoch_start))
+ model_dir = args.model_output_dir + '/pass-' + str(pass_id)
+ if trainer_id == 0:
+ fluid.io.save_params(executor=exe, dirname=model_dir)
+ print("model saved in %s" % model_dir)
+
+
+def GetFileList(data_path):
+ return os.listdir(data_path)
+
+
+def train(args):
+
+ if not os.path.isdir(args.model_output_dir):
+ os.mkdir(args.model_output_dir)
+
+ filelist = GetFileList(args.train_data_dir)
+ word2vec_reader = reader.Word2VecReader(
+ args.dict_path, args.train_data_dir, filelist, 0, 1)
+
+ logger.info("dict_size: {}".format(word2vec_reader.dict_size))
+ np_power = np.power(np.array(word2vec_reader.id_frequencys), 0.75)
+ id_frequencys_pow = np_power / np_power.sum()
+
+ loss, py_reader = skip_gram_word2vec(
+ word2vec_reader.dict_size,
+ args.embedding_size,
+ is_sparse=args.is_sparse,
+ neg_num=args.nce_num)
+
+ optimizer = fluid.optimizer.SGD(
+ learning_rate=fluid.layers.exponential_decay(
+ learning_rate=args.base_lr,
+ decay_steps=100000,
+ decay_rate=0.999,
+ staircase=True))
+
+ optimizer.minimize(loss)
+
+ # do local training
+ logger.info("run local training")
+ main_program = fluid.default_main_program()
+ train_loop(args, main_program, word2vec_reader, py_reader, loss, 0,
+ id_frequencys_pow)
+
+
+if __name__ == '__main__':
+ args = parse_args()
+ train(args)
diff --git a/demo/quant/quant_embedding/utils.py b/demo/quant/quant_embedding/utils.py
new file mode 100755
index 0000000000000000000000000000000000000000..01cd04e493b09e880303d7b0c87f5ed71cf86357
--- /dev/null
+++ b/demo/quant/quant_embedding/utils.py
@@ -0,0 +1,96 @@
+import sys
+import collections
+import six
+import time
+import numpy as np
+import paddle.fluid as fluid
+import paddle
+import os
+import preprocess
+
+
+def BuildWord_IdMap(dict_path):
+ word_to_id = dict()
+ id_to_word = dict()
+ with open(dict_path, 'r') as f:
+ for line in f:
+ word_to_id[line.split(' ')[0]] = int(line.split(' ')[1])
+ id_to_word[int(line.split(' ')[1])] = line.split(' ')[0]
+ return word_to_id, id_to_word
+
+
+def prepare_data(file_dir, dict_path, batch_size):
+ w2i, i2w = BuildWord_IdMap(dict_path)
+ vocab_size = len(i2w)
+ reader = paddle.batch(test(file_dir, w2i), batch_size)
+ return vocab_size, reader, i2w
+
+
+def native_to_unicode(s):
+ if _is_unicode(s):
+ return s
+ try:
+ return _to_unicode(s)
+ except UnicodeDecodeError:
+ res = _to_unicode(s, ignore_errors=True)
+ return res
+
+
+def _is_unicode(s):
+ if six.PY2:
+ if isinstance(s, unicode):
+ return True
+ else:
+ if isinstance(s, str):
+ return True
+ return False
+
+
+def _to_unicode(s, ignore_errors=False):
+ if _is_unicode(s):
+ return s
+ error_mode = "ignore" if ignore_errors else "strict"
+ return s.decode("utf-8", errors=error_mode)
+
+
+def strip_lines(line, vocab):
+ return _replace_oov(vocab, native_to_unicode(line))
+
+
+def _replace_oov(original_vocab, line):
+ """Replace out-of-vocab words with "".
+ This maintains compatibility with published results.
+ Args:
+ original_vocab: a set of strings (The standard vocabulary for the dataset)
+ line: a unicode string - a space-delimited sequence of words.
+ Returns:
+ a unicode string - a space-delimited sequence of words.
+ """
+ return u" ".join([
+ word if word in original_vocab else u"" for word in line.split()
+ ])
+
+
+def reader_creator(file_dir, word_to_id):
+ def reader():
+ files = os.listdir(file_dir)
+ for fi in files:
+ with open(file_dir + '/' + fi, "r") as f:
+ for line in f:
+ if ':' in line:
+ pass
+ else:
+ line = strip_lines(line.lower(), word_to_id)
+ line = line.split()
+ yield [word_to_id[line[0]]], [word_to_id[line[1]]], [
+ word_to_id[line[2]]
+ ], [word_to_id[line[3]]], [
+ word_to_id[line[0]], word_to_id[line[1]],
+ word_to_id[line[2]]
+ ]
+
+ return reader
+
+
+def test(test_dir, w2i):
+ return reader_creator(test_dir, w2i)