The following code snippet shows how to download checkpoints from HDFS:
```python
fromplscimportEntry
if__name__=="__main__":
ins=Entry()
ins.set_checkpoint_dir('./saved_model')
ins.set_hdfs_info("your_hdfs_addr",
"name,passwd",
fs_checkpoint_dir="some_dir")
```
Using above code, checkpoints in "some_dir" on HDFS file systems will be downloaded on local "./saved_model" directory. Please make sure the local directory "./saved_model" exists.
## Pre-processing for images in base64 format
In practice, base64 is a common format to store images. All image data is stored in one file and each line of the file represents a image data in base64 format and its corresponding label.
The following gives an example structure of datasets:
```shell script
dataset
|-- file_list.txt
|-- dataset.part1
|-- dataset.part2
| ....
`-- dataset.part10
```
The file file_list.txt records all data files, each line of which represents a data file, for example dataset.part1.
For distributed training, every GPU card has to process the same number of data, and global shuffle is usually used for all images.
The provided pre-processing tool does global shuffle on all training images and splits these images into groups evenly. The number of groups is equal to the number of GPU cards used.
### How to use
The pre-processing tool is put in the directory "tools". To use it, the sqlite3 module is required, which can be installed using the following command:
```shell script
pip install sqlite3
```
Using the following command to show the help message:
```shell script
python tools/process_base64_files.py --help
```
The tool provides the following options:
- data_dir: the root directory for datasets
- file_list: the file to record data files, e.g., file_list.txt
With PLSC, we assume the dataset is organized in the following structure:
```shell script
train_data/
|-- images
`-- label.txt
```
All images are stored in the directory 'images', and the file 'label.txt' are used to record the index of all images, each line of which represents the relative path for a image and its corresponding label.
When dataset for users are organized in their own structure, using the following steps to use their own datasets:
1. Define a generator, which pre-processing users' images (e.g., resizing) and generate samples one by one using *yield*;
* A sample is a tuple of (data, label), where data represents a images after decoding and preprocessing
2. Use paddle.batch to wrap the above generator, and get the batched generator
3. Assign the batched reader to the 'train_reader' member of plsc.Entry
We assume the dataset for a user is organized as follows:
```shell script
train_data/
|-- images
`-- label.txt
```
First, using the following code to define a generator: