提交 f0016137 编写于 作者: J jhjiangcs

add doc and metrics-AUC for mpc-FM demo.

上级 d5900853
## Instructions for PaddleFL-MPC Factorization Machine(FM) Demo
([简体中文](./README_CN.md)|English)
This document introduces how to run FM demo based on Paddle-MPC, which has two ways of running, i.e., single machine and multi machines.
### 1. Running on Single Machine
#### (1). Prepare Data
Download Criteo dataset using script `./data/download.sh`, then split the dataset into training data and testing data (sample data: `./data/sample_data/train/sample_train.txt`). Generate encrypted training and testing data utilizing `generate_encrypted_data()` in `process_data.py` script. Users can run the script with command `python process_data.py` to generate encrypted feature and label in given directory, e.g., `./mpc_data/`. Different suffix names are used for these files to indicate the ownership of different computation parties. For instance, a file named `criteo_feature_idx.part0` means it is a feature file of party 0.
#### (2). Launch Train Demo with A Shell Script
You should set the env params as follow:
```
export PYTHON=/your/python
export PATH_TO_REDIS_BIN=/path/to/redis_bin
export LOCALHOST=/your/localhost
export REDIS_PORT=/your/redis/port
```
Launch train demo with the `run_standalone.sh` script. The concrete command is:
```bash
bash run_standalone.sh train_fm.py
```
The information of current epoch and step will be displayed on screen while training, as well as the total cost time when traning finished. Encrypted infer model will be stored after each epoch's training.
#### (3). Launch Train Demo with A Shell Script
Launch infer demo with the `run_standalone.sh` script. The concrete command is:
```bash
bash run_standalone.sh load_model_and_infer.py
```
Load encrypted infer model and infer encrypted test data. The predictions with cypher text format would be save in `./mpc_infer_data/` directory (users can modify it in the python script `load_model_and_infer.py`), and the format of file name is similar to what is described in Step 1.
#### (4). Decrypt Data
Decrypt the saved prediction data and save the decrypted prediction results into a specified file using `decrypt_data_to_file()` in `process_data.py` script. The decrypted prediction results would be saved. `Accuracy` and `AUC` can be evaluated using the interface `evaluate_accuracy` and `evaluate_auc` in the script `evaluate_metrics.py`. (This phase is include in the infer phase, user also can run that in single script.)
### 2. Running on Multi Machines
#### (1). Prepare Data
Data owner encrypts data. Concrete operations are consistent with “Prepare Data” in “Running on Single Machine”.
#### (2). Distribute Encrypted Data
According to the suffix of file name, distribute encrypted data files to `./mpc_data/ ` directories of all 3 computation parties. For example, send encrypted data with suffix `.part0` to `./mpc_data/` of party 0 with `scp` command.
#### (3). Modify training and infering script `train_fm.py` and `load_model_and_infer.py`
Each computation party modifies `localhost` in the following code as the IP address of it's machine.
```python
pfl_mpc.init("aby3", int(role), "localhost", server, int(port))
```
#### (4). Launch Demo on Each Party
**Note** that Redis service is necessary for demo running. Remember to clear the cache of Redis server before launching demo on each computation party, in order to avoid any negative influences caused by the cached records in Redis. The following command can be used for clear Redis, where REDIS_BIN is the executable binary of redis-cli, SERVER and PORT represent the IP and port of Redis server respectively.
```
$REDIS_BIN -h $SERVER -p $PORT flushall
```
Launch train demo on each computation party with the following command,
```
$PYTHON_EXECUTABLE train_fm.py $PARTY_ID $SERVER $PORT
```
Launch infer demo on each computation party with the following command,
```
$PYTHON_EXECUTABLE load_model_and_infer.py $PARTY_ID $SERVER $PORT
```
where PYTHON_EXECUTABLE is the python which installs PaddleFL, PARTY_ID is the ID of computation party, which is 0, 1, or 2, SERVER and PORT represent the IP and port of Redis server respectively.
Similarly, predictions with cypher text format would be saved in `./mpc_infer_data/` directory, for example, a file named `prediction.part0` for party 0.
#### (5). Decrypt Prediction Data
Each computation party sends `prediction.part?` file in `./mpc_infer_data/` directory to the `./mpc_infer_data/` directory of data owner. Data owner decrypts the prediction data and saves the decrypted prediction results into a specified file using `decrypt_data_to_file()` in `process_data.py` script. For example, users can write the following code into a python script named `decrypt_save.py`, and then run the script with command `python decrypt_save.py decrypted_file`. The decrypted prediction results would be saved into file `decrypted_file`. Then, users can evaluate the accuracy and AUC metrics.
```python
import sys
decrypt_file=sys.argv[1]
import process_data
process_data.decrypt_data_to_file("./mpc_infer_data/prediction", (BATCH_SIZE,), decrypted_file)
```
## PaddleFL-MPC FM Demo运行说明
(简体中文|[English](./README.md))
本示例介绍基于PaddleFL-MPC进行Criteo数据集FM模型训练和预测的使用说明,分为单机运行和多机运行两种方式。
### 一. 单机运行
#### 1. 准备数据
使用脚本`data/download.sh`下载Criteo数据集,并拆分成训练数据集和预测数据集(少量数据集`data/sample_data/train/sample_train.txt`可用于验证模型训练和预测)。使用`process_data.py`脚本中的`generate_encrypted_data()`产生加密训练数据和测试数据,用户可以直接运行脚本`python process_data.py`在指定的目录下(比如`./mpc_data/`)产生加密训练数据和测试数据(特征和标签)。在指定目录下生成对应于3个计算party的feature和label的加密数据文件,以后缀名区分属于不同party的数据。比如,`criteo_feature_idx.part0`表示属于party0的feature id,`criteo_feature_value.part0`表示属于party0的feature value,`criteo_label.part0`表示属于party0的label。
#### 2. 使用shell脚本启动训练demo
运行demo之前,需设置以下环境变量:
```
export PYTHON=/yor/python
export PATH_TO_REDIS_BIN=/path/to/redis_bin
export LOCALHOST=/your/localhost
export REDIS_PORT=/your/redis/port
```
然后使用`run_standalone.sh`脚本,启动并运行训练demo,命令如下:
```bash 
bash run_standalone.sh train_fm.py
```
运行之后将在屏幕上打印训练过程中所处的epoch和step,并在完成训练后打印训练花费的时间,并在每个epoch训练结束后,保存可用于执行加密预测的模型。
#### 3. 使用shell脚本启动预测demo
在完成训练之后,使用`run_standalone.sh`脚本,启动并运行预测demo,命令如下:
```bash
bash run_standalone.sh load_model_and_infer.py
```
加载训练时保存的模型,对测试数据进行预测,并将预测密文结果保存到./mpc_infer_data/目录下的文件中,文件命名格式类似于步骤1中所述。
#### 4. 解密数据
预测结束后,可以使用`process_data.py`脚本中的`decrypt_data_to_file()`,将保存的密文预测结果进行解密,并且将解密得到的明文预测结果保存到指定文件中。然后再调用脚本`evaluate_metrics.py`中的`evaluate_accuracy``evaluate_auc`接口统计预测的准确率和AUC(Area Under Curve)。(现已包含再infer过程中,用户也可以单独写在一个脚本执行)。
### 二. 多机运行
#### 1. 准备数据
数据方对数据进行加密处理。具体操作和单机运行中的准备数据步骤一致。
#### 2. 分发数据
按照后缀名,将步骤1中准备好的数据分别发送到对应的计算party的./mpc_data/目录下。比如,使用scp命令,将后缀为`part0`的加密数据发送到party0的./mpc_data/目录下。
#### 3. 修改各计算party的训练脚本traim_fm.py
各计算party根据自己的机器环境,将脚本如下内容中的`localhost`修改为自己的IP地址:
```python
pfl_mpc.init("aby3", int(role), "localhost", server, int(port))
```
#### 4. 各计算party启动demo
**注意**:运行需要用到redis服务。为了确保redis中已保存的数据不会影响demo的运行,请在各计算party启动demo之前,使用如下命令清空redis。其中,REDIS_BIN表示redis-cli可执行程序,SERVER和PORT分别表示redis server的IP地址和端口号。
```
$REDIS_BIN -h $SERVER -p $PORT flushall
```
在各计算party分别执行以下命令,启动demo:
```
$PYTHON_EXECUTABLE train_fm.py $PARTY_ID $SERVER $PORT
```
其中,PYTHON_EXECUTABLE表示自己安装了PaddleFL的python,PARTY_ID表示计算party的编号,值为0、1或2,SERVER和PORT分别表示redis server的IP地址和端口号。
同样地,密文prediction数据将会保存到`./mpc_infer_data/`目录下的文件中。比如,在party0中将保存为文件`prediction.part0`.
#### 5. 解密预测数据
各计算party将`./mpc_infer_data/`目录下的`prediction.part*`文件发送到数据方的`./mpc_infer_data/`目录下。数据方使用`process_data.py`脚本中的`decrypt_data_to_file()`,将密文预测结果进行解密,并且将解密得到的明文预测结果保存到指定文件中。然后可以使用`process_data.py`脚本中的`decrypt_data_to_file()`,将保存的密文预测结果进行解密。
...@@ -22,34 +22,6 @@ cont_diff_ = [cont_max_[i] - cont_min_[i] for i in range(len(cont_min_))] ...@@ -22,34 +22,6 @@ cont_diff_ = [cont_max_[i] - cont_min_[i] for i in range(len(cont_min_))]
continuous_range_ = range(1, 14) continuous_range_ = range(1, 14)
categorical_range_ = range(14, 40) categorical_range_ = range(14, 40)
line_num = 0
def reader(hash_dim_, paddle_train_data_dir):
files = [str(paddle_train_data_dir) + "/%s" % x for x in os.listdir(paddle_train_data_dir)]
for file in files:
with open(file, 'r') as f:
for line in f:
features = line.rstrip('\n').split('\t')
feat_idx = []
feat_value = []
for idx in continuous_range_:
feat_idx.append(hash('dense_feat_id' + str(idx)) % hash_dim_)
if features[idx] == '':
feat_value.append(0.0)
else:
feat_value.append(
(float(features[idx]) - cont_min_[idx - 1]) /
cont_diff_[idx - 1])
for idx in categorical_range_:
if features[idx] == '':
feat_idx.append(hash('sparse_feat_id' + str(idx)) % hash_dim_)
feat_value.append(0.0)
else:
feat_idx.append(
hash(str(idx) + features[idx]) % hash_dim_)
feat_value.append(1.0)
label = [int(features[0])]
yield feat_idx[:], feat_value[:], label[:]
def generate_sample(hash_dim_, paddle_train_data_dir): def generate_sample(hash_dim_, paddle_train_data_dir):
files = [str(paddle_train_data_dir) + "/%s" % x for x in os.listdir(paddle_train_data_dir)] files = [str(paddle_train_data_dir) + "/%s" % x for x in os.listdir(paddle_train_data_dir)]
......
...@@ -17,10 +17,31 @@ Evaluate accuracy. ...@@ -17,10 +17,31 @@ Evaluate accuracy.
import numpy as np import numpy as np
import logging import logging
import paddle_fl.mpc as pfl_mpc
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s') logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger("fluid") logger = logging.getLogger("fluid")
logger.setLevel(logging.INFO) logger.setLevel(logging.INFO)
def evaluate_auc(file1, file2):
"""
evaluate accuracy
"""
auc_metric = pfl_mpc.metrics.Auc("ROC")
label = np.loadtxt(file1,delimiter='\n')
class1_preds = np.loadtxt(file2,delimiter='\n')
label_data = label.reshape(label.shape[0], 1)
class1_preds = class1_preds.reshape(label_data.shape)
class0_preds = 1 - class1_preds
preds = np.concatenate((class0_preds, class1_preds), axis=1)
auc_metric.update(preds = preds, labels = label_data)
logger.info("Evaluate AUC: ")
auc_value = auc_metric.eval()
logger.info(auc_value)
return auc_value
def evaluate_accuracy(file1, file2): def evaluate_accuracy(file1, file2):
""" """
...@@ -44,6 +65,6 @@ def evaluate_accuracy(file1, file2): ...@@ -44,6 +65,6 @@ def evaluate_accuracy(file1, file2):
if __name__ == '__main__': if __name__ == '__main__':
#evaluate_accuracy("./mpc_data/label_mnist", "./mpc_infer_data/label_paddle") evaluate_accuracy("./mpc_infer_data/label_criteo", "./mpc_infer_data/label_mpc")
evaluate_accuracy("./mpc_data/label_criteo", "./mpc_infer_data/label_mpc") evaluate_auc("./mpc_infer_data/label_criteo", "./mpc_infer_data/label_mpc")
...@@ -26,7 +26,7 @@ import paddle_fl.mpc.data_utils.aby3 as aby3 ...@@ -26,7 +26,7 @@ import paddle_fl.mpc.data_utils.aby3 as aby3
import args import args
import mpc_network import mpc_network
import process_data import process_data
import evaluate_accuracy import evaluate_metrics as evaluate
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s') logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s')
...@@ -113,8 +113,8 @@ def infer(test_loader, role, exe, BATCH_SIZE, mpc_model_dir, mpc_model_filename) ...@@ -113,8 +113,8 @@ def infer(test_loader, role, exe, BATCH_SIZE, mpc_model_dir, mpc_model_filename)
if os.path.exists(decrypt_file): if os.path.exists(decrypt_file):
os.remove(decrypt_file) os.remove(decrypt_file)
process_data.decrypt_data_to_file(cypher_file, (BATCH_SIZE, ), decrypt_file) process_data.decrypt_data_to_file(cypher_file, (BATCH_SIZE, ), decrypt_file)
evaluate_accuracy.evaluate_accuracy('./mpc_infer_data/label_criteo', decrypt_file) evaluate.evaluate_accuracy('./mpc_infer_data/label_criteo', decrypt_file)
os.remove(decrypt_file) evaluate.evaluate_auc('./mpc_infer_data/label_criteo', decrypt_file)
end_time = time.time() end_time = time.time()
logger.info('End Evaluate Accuracy...cost time: {}'.format(end_time - start_time)) logger.info('End Evaluate Accuracy...cost time: {}'.format(end_time - start_time))
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册