README.md 13.7 KB
Newer Older
G
guru4elephant 已提交
1

D
Dong Daxiang 已提交
2
<img src='https://github.com/PaddlePaddle/PaddleFL/blob/master/docs/source/_static/FL-logo.png' width = "400" height = "160">
D
Dong Daxiang 已提交
3

D
Dong Daxiang 已提交
4
[DOC](https://paddlefl.readthedocs.io/en/latest/) | [Quick Start](https://paddlefl.readthedocs.io/en/latest/instruction.html) | [中文](./README_cn.md)
D
Dong Daxiang 已提交
5

G
guru4elephant 已提交
6 7 8 9 10 11
PaddleFL is an open source federated learning framework based on PaddlePaddle. Researchers can easily replicate and compare different federated learning algorithms with PaddleFL. Developers can also benefit from PaddleFL in that it is easy to deploy a federated learning system in large scale distributed clusters. In PaddleFL, serveral federated learning strategies will be provided with application in computer vision, natural language processing, recommendation and so on. Application of traditional machine learning training strategies such as Multi-task learning, Transfer Learning in Federated Learning settings will be provided. Based on PaddlePaddle's large scale distributed training and elastic scheduling of training job on Kubernetes, PaddleFL can be easily deployed based on full-stack open sourced software.

## Federated Learning

Data is becoming more and more expensive nowadays, and sharing of raw data is very hard across organizations. Federated Learning aims to solve the problem of data isolation and secure sharing of data knowledge among organizations. The concept of federated learning is proposed by researchers in Google [1, 2, 3]. 

J
jingqinghe 已提交
12 13 14 15 16 17 18 19 20 21
## Overview of PaddleFL

### Horizontal Federated Learning

<img src='images/FL-framework.png' width = "1000" height = "320" align="middle"/>

In PaddleFL, horizontal and vertical federated learning strategies will be implemented according to the categorization given in [4]. Application demonstrations in natural language processing, computer vision and recommendation will be provided in PaddleFL. 

#### A. Federated Learning Strategy

J
jed 已提交
22
- **Vertical Federated Learning**: Logistic Regression with PrivC[5], Neural Network with ABY3 [11]
J
jingqinghe 已提交
23 24 25 26 27 28 29 30 31 32 33

- **Horizontal Federated Learning**: Federated Averaging [2], Differential Privacy [6], Secure Aggregation

#### B. Training Strategy

- **Multi Task Learning** [7]

- **Transfer Learning** [8]

- **Active Learning**

J
jed 已提交
34
### Federated Learning with MPC
J
jingqinghe 已提交
35

J
jingqinghe 已提交
36
<img src='images/PFM-overview.png' width = "1000" height = "446" align="middle"/>
J
jingqinghe 已提交
37

J
jed 已提交
38
Paddle FL MPC (PFM) is a framework for privacy-preserving deep learning based on PaddlePaddle. It follows the same running mechanism and programming paradigm with PaddlePaddle, while using secure multi-party computation (MPC) to enable secure training and prediction. 
J
jingqinghe 已提交
39

J
jed 已提交
40
With PFM, it is easy to train models or conduct prediction as on PaddlePaddle over encrypted data, without the need for cryptography expertise. Furthermore, the rich industry-oriented models and algorithms built on PaddlePaddle can be smoothly migrated to secure versions on PFM with little effort.
J
jingqinghe 已提交
41

J
jed 已提交
42
As a key product of PaddleFL, PFM intrinsically supports federated learning well, including horizontal, vertical and transfer learning scenarios. It provides both provable security (semantic security) and competitive performance.
J
jingqinghe 已提交
43

J
jingqinghe 已提交
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
## Compilation and Installation

### Docker Installation

```sh
#Pull and run the docker
docker pull hub.baidubce.com/paddlefl/paddle_fl:latest
docker run --name <docker_name> --net=host -it -v $PWD:/root <image id> /bin/bash

#Install paddle_fl
pip install paddle_fl
```

### Compile From Source Code

J
jingqinghe 已提交
59
#### A. Environment preparation
J
jingqinghe 已提交
60 61 62 63

* CentOS 6 or CentOS 7 (64 bit)
* Python 2.7.15+/3.5.1+/3.6/3.7 ( 64 bit) or above
* pip or pip3 9.0.1+ (64 bit)
J
jingqinghe 已提交
64
* PaddlePaddle release 1.8
J
jingqinghe 已提交
65 66 67 68
* Redis 5.0.8 (64 bit)
* GCC or G++ 4.8.3+
* cmake 3.15+

J
jingqinghe 已提交
69
#### B. Clone the source code, compile and install
J
jingqinghe 已提交
70 71 72 73 74 75 76 77 78 79

Fetch the source code and checkout stable release
```sh
git clone https://github.com/PaddlePaddle/PaddleFL
cd /path/to/PaddleFL

# Checkout stable release
mkdir build && cd build
```

J
jingqinghe 已提交
80
Execute compile commands, where `PYTHON_EXECUTABLE` is path to the python binary where the PaddlePaddle is installed, `CMAKE_CXX_COMPILER` is the path of G++ and `PYTHON_INCLUDE_DIRS` is the corresponding python include directory. You can get the `PYTHON_INCLUDE_DIRS` via the following command:
J
jingqinghe 已提交
81 82 83 84 85 86

```sh
${PYTHON_EXECUTABLE} -c "from distutils.sysconfig import get_python_inc;print(get_python_inc())"
```
Then you can put the directory in the following command and make:
```sh
J
jingqinghe 已提交
87
cmake ../ -DPYTHON_EXECUTABLE=${PYTHON_EXECUTABLE} -DPYTHON_INCLUDE_DIRS=${python_include_dir} -DCMAKE_CXX_COMPILER=${g++_path}
J
jingqinghe 已提交
88 89 90 91 92 93 94 95 96 97
make -j$(nproc)
```
Install the package:

```sh
make install
cd /path/to/PaddleFL/python
${PYTHON_EXECUTABLE} setup.py sdist bdist_wheel
pip or pip3 install dist/***.whl -U
```
J
jingqinghe 已提交
98
We also prepare a stable redis package for you to download and install
J
jingqinghe 已提交
99

J
jingqinghe 已提交
100 101 102 103 104
```sh
wget --no-check-certificate https://paddlefl.bj.bcebos.com/redis-stable.tar
tar -xf redis-stable.tar
cd redis-stable &&  make
```
J
jingqinghe 已提交
105

G
guru4elephant 已提交
106 107
## Framework design of PaddleFL

J
jingqinghe 已提交
108
### Horizontal Federated Learning
J
jingqinghe 已提交
109
<img src='images/FL-training.png' width = "1000" height = "622" align="middle"/>
G
guru4elephant 已提交
110 111 112

In PaddleFL, components for defining a federated learning task and training a federated learning job are as follows:

J
jingqinghe 已提交
113
#### A. Compile Time
G
guru4elephant 已提交
114

J
jingqinghe 已提交
115
- **FL-Strategy**: a user can define federated learning strategies with FL-Strategy such as Fed-Avg[2]
G
guru4elephant 已提交
116 117 118 119 120 121 122

- **User-Defined-Program**: PaddlePaddle's program that defines the machine learning model structure and training strategies such as multi-task learning.

- **Distributed-Config**: In federated learning, a system should be deployed in distributed settings. Distributed Training Config defines distributed training node information.

- **FL-Job-Generator**: Given FL-Strategy, User-Defined Program and Distributed Training Config, FL-Job for federated server and worker will be generated through FL Job Generator. FL-Jobs will be sent to organizations and federated parameter server for run-time execution.

J
jingqinghe 已提交
123
#### B. Run Time
G
guru4elephant 已提交
124 125 126 127 128

- **FL-Server**: federated parameter server that usually runs in cloud or third-party clusters.

- **FL-Worker**: Each organization participates in federated learning will have one or more federated workers that will communicate with the federated parameter server.

Q
qjing666 已提交
129 130
- **FL-scheduler**: Decide which set of trainers can join the training before each updating cycle. 

J
jingqinghe 已提交
131
For more instructions, please refer to the [examples](./python/paddle_fl/paddle_fl/examples)
J
jingqinghe 已提交
132

J
jed 已提交
133
### Federated Learning with MPC
J
jingqinghe 已提交
134

J
jed 已提交
135
Paddle FL MPC implements secure training and inference tasks based on the underlying MPC protocol like ABY3[11], which is a high efficient three-party computing model.
J
jingqinghe 已提交
136

J
jed 已提交
137
In ABY3, participants can be classified into roles of Input Party (IP), Computing Party (CP) and Result Party (RP). Input Parties (e.g., the training data/model owners) encrypt and distribute data or models to Computing Parties. Computing Parties (e.g., the VM on the cloud) conduct training or inference tasks based on specific MPC protocols, being restricted to see only the encrypted data or models, and thus guarantee the data privacy. When the computation is completed, one or more Result Parties (e.g., data owners or specified third-party) receive the encrypted results from Computing Parties, and reconstruct the plaintext results. Roles can be overlapped, e.g., a data owner can also act as a computing party.
J
jingqinghe 已提交
138

J
jed 已提交
139
A full training or inference process in PFM consists of mainly three phases: data preparation, training/inference, and result reconstruction.
J
jingqinghe 已提交
140

J
jingqinghe 已提交
141
#### A. Data preparation
J
jingqinghe 已提交
142

J
jingqinghe 已提交
143
##### 1. Private data alignment
J
jingqinghe 已提交
144

J
jed 已提交
145
PFM enables data owners (IPs) to find out records with identical keys (like UUID) without revealing private data to each other. This is especially useful in the vertical learning cases where segmented features with same keys need to be identified and aligned from all owners in a private manner before training. Using the OT-based PSI (Private Set Intersection) algorithm, PFM can perform private alignment at a speed of up to 60k records per second.
J
jingqinghe 已提交
146

J
jingqinghe 已提交
147
##### 2. Encryption and distribution
J
jingqinghe 已提交
148

J
jed 已提交
149
In PFM, data and models from IPs will be encrypted using Secret-Sharing[10], and then be sent to CPs, via directly transmission or distributed storage like HDFS. Each CP can only obtain one share of each piece of data, and thus is unable to recover the original value in the Semi-honest model.
J
jingqinghe 已提交
150

J
jingqinghe 已提交
151
#### B. Training/inference
J
jingqinghe 已提交
152

J
jingqinghe 已提交
153
<img src='images/PFM-design.png' width = "1000" height = "400" align="middle"/>
J
jingqinghe 已提交
154 155 156

As in PaddlePaddle, a training or inference job can be separated into the compile-time phase and the run-time phase:

J
jingqinghe 已提交
157
##### 1. Compile time
J
jingqinghe 已提交
158

J
jingqinghe 已提交
159
* **MPC environment specification**: a user needs to choose a MPC protocol, and configure the network settings. In current version, PFM provides only the "ABY3" protocol. More protocol implementation will be provided in future.
J
jed 已提交
160
* **User-defined job program**: a user can define the machine learning model structure and the training strategies (or inference task) in a PFM program, using the secure operators.
J
jingqinghe 已提交
161

J
jingqinghe 已提交
162
##### 2. Run time
J
jingqinghe 已提交
163

J
jed 已提交
164
A PFM program is exactly a PaddlePaddle program, and will be executed as normal PaddlePaddle programs. For example, in run-time a PFM program will be transpiled into ProgramDesc, and then be passed to and run by the Executor. The main concepts in the run-time phase are as follows:
J
jingqinghe 已提交
165

J
jed 已提交
166
* **Computing nodes**: a computing node is an entity corresponding to a Computing Party. In real deployment, it can be a bare-metal machine, a cloud VM, a docker or even a process. PFM requires exactly three computing nodes in each run, which is determined by the underlying ABY3 protocol. A PFM program will be deployed and run in parallel on all three computing nodes. 
J
jingqinghe 已提交
167
* **Operators using MPC**: PFM provides typical machine learning operators in `paddle_fl.mpc` over encrypted data. Such operators are implemented upon PaddlePaddle framework, based on MPC protocols like ABY3. Like other PaddlePaddle operators, in run time, instances of PFM operators are created and run in order by Executor.
J
jingqinghe 已提交
168

J
jingqinghe 已提交
169
#### C. Result reconstruction
J
jingqinghe 已提交
170

J
jed 已提交
171
Upon completion of the secure training (or inference) job, the models (or prediction results) will be output by CPs in encrypted form. Result Parties can collect the encrypted results, decrypt them using the tools in PFM, and deliver the plaintext results to users.
J
jingqinghe 已提交
172

J
jingqinghe 已提交
173
For more instructions, please refer to [mpc examples](./python/paddle_fl/mpc/examples)
Q
qjing666 已提交
174
## Easy deployment with kubernetes
J
jingqinghe 已提交
175 176

### Horizontal Federated Learning
Q
qjing666 已提交
177
```sh
Q
qjing666 已提交
178

J
jingqinghe 已提交
179
kubectl apply -f ./python/paddle_fl/paddle_fl/examples/k8s_deployment/master.yaml
Q
qjing666 已提交
180 181

```
J
jingqinghe 已提交
182
Please refer [K8S deployment example](./python/paddle_fl/paddle_fl/examples/k8s_deployment/README.md) for details
J
jingqinghe 已提交
183

J
jingqinghe 已提交
184
You can also refer [K8S cluster application and kubectl installation](./python/paddle_fl/paddle_fl/examples/k8s_deployment/deploy_instruction.md) to deploy your K8S cluster
J
jingqinghe 已提交
185

J
jed 已提交
186
### Federated Learning with MPC
J
jingqinghe 已提交
187 188 189

To be added.

G
guru4elephant 已提交
190 191
## Benchmark task

J
jingqinghe 已提交
192 193
### Horzontal Federated Learning

D
Dong Daxiang 已提交
194
Gru4Rec [9] introduces recurrent neural network model in session-based recommendation. PaddlePaddle's Gru4Rec implementation is in https://github.com/PaddlePaddle/models/tree/develop/PaddleRec/gru4rec. An example is given in [Gru4Rec in Federated Learning](https://paddlefl.readthedocs.io/en/latest/examples/gru4rec_examples.html)
J
jingqinghe 已提交
195

J
jed 已提交
196
### Federated Learning with MPC 
J
jingqinghe 已提交
197

J
jingqinghe 已提交
198
#### A. Convergence of paddle_fl.mpc vs paddle
J
jingqinghe 已提交
199

J
jingqinghe 已提交
200
##### 1. Training Parameters
J
jingqinghe 已提交
201 202 203 204
- Dataset: Boston house price dataset
- Number of Epoch: 20
- Batch Size: 10

J
jingqinghe 已提交
205
##### 2. Experiment Results
J
jingqinghe 已提交
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228

| Epoch/Step | paddle_fl.mpc | Paddle |
| ---------- | ------------- | ------ |
| Epoch=0, Step=0  | 738.39491 | 738.46204 |
| Epoch=1, Step=0  | 630.68834 | 629.9071 |
| Epoch=2, Step=0  | 539.54683 | 538.1757 |
| Epoch=3, Step=0  | 462.41159 | 460.64722 |
| Epoch=4, Step=0  | 397.11516 | 395.11017 |
| Epoch=5, Step=0  | 341.83102 | 339.69815 |
| Epoch=6, Step=0  | 295.01114 | 292.83597 |
| Epoch=7, Step=0  | 255.35141 | 253.19429 |
| Epoch=8, Step=0  | 221.74739 | 219.65132 |
| Epoch=9, Step=0  | 193.26459 | 191.25981 |
| Epoch=10, Step=0  | 169.11423 | 167.2204 |
| Epoch=11, Step=0  | 148.63138 | 146.85835 |
| Epoch=12, Step=0  | 131.25081 | 129.60391 |
| Epoch=13, Step=0  | 116.49708 | 114.97599 |
| Epoch=14, Step=0  | 103.96669 | 102.56854 |
| Epoch=15, Step=0  | 93.31706 | 92.03858 |
| Epoch=16, Step=0  | 84.26219 | 83.09653 |
| Epoch=17, Step=0  | 76.55664 | 75.49785 |
| Epoch=18, Step=0  | 69.99673 | 69.03561 |
| Epoch=19, Step=0  | 64.40562 | 63.53539 |
G
guru4elephant 已提交
229

J
jingqinghe 已提交
230
## On Going and Future Work
G
guru4elephant 已提交
231

J
jingqinghe 已提交
232 233 234
- Vertial Federated Learning will support more algorithms.

- Add K8S deployment scheme for Paddle Encrypted.
G
guru4elephant 已提交
235 236 237 238 239 240 241 242 243 244 245

## Reference

[1]. Jakub Konečný, H. Brendan McMahan, Daniel Ramage, Peter Richtárik. **Federated Optimization: Distributed Machine Learning for On-Device Intelligence.** 2016

[2]. H. Brendan McMahan, Eider Moore, Daniel Ramage, Blaise Agüera y Arcas. **Federated Learning of Deep Networks using Model Averaging.** 2017

[3]. Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, Dave Bacon. **Federated Learning: Strategies for Improving Communication Efficiency.** 2016

[4]. Qiang Yang, Yang Liu, Tianjian Chen, Yongxin Tong. **Federated Machine Learning: Concept and Applications.** 2019

J
jed 已提交
246
[5]. Kai He, Liu Yang, Jue Hong, Jinghua Jiang, Jieming Wu, Xu Dong et al. **PrivC  - A framework for efficient Secure Two-Party Computation.** In Proc. of SecureComm 2019
G
guru4elephant 已提交
247 248 249 250 251 252 253

[6]. Martín Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, Li Zhang. **Deep Learning with Differential Privacy.** 2016

[7]. Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, Ameet Talwalkar. **Federated Multi-Task Learning** 2016

[8]. Yang Liu, Tianjian Chen, Qiang Yang. **Secure Federated Transfer Learning.** 2018

D
Dong Daxiang 已提交
254
[9]. Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, Domonkos Tikk. **Session-based Recommendations with Recurrent Neural Networks.** 2016
J
jed 已提交
255 256 257 258

[10]. https://en.wikipedia.org/wiki/Secret_sharing

[11]. Payman Mohassel and Peter Rindal. **ABY3: A Mixed Protocol Framework for Machine Learning.** In Proc. of CCS 2018