{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 利用PaddleHub Auto Fine-tune进行自动超参搜索\n", "\n", "## 一、简介\n", "\n", "机器学习训练模型的过程中自然少不了调参。模型的参数可分成两类:参数与超参数,前者是模型通过自身的训练学习得到的参数数据;后者则需要通过人工经验设置(如学习率、dropout_rate、batch_size等),以提高模型训练的效果。当前模型往往参数空间大,手动调参十分耗时,尝试成本高。PaddleHub Auto Fine-tune可以实现自动调整超参数。\n", "\n", "PaddleHub Auto Fine-tune提供两种搜索超参策略:\n", "\n", "* HAZero: 核心思想是通过对正态分布中协方差矩阵的调整来处理变量之间的依赖关系和scaling。算法基本可以分成以下三步: 采样产生新解;计算目标函数值;更新正太分布参数。调整参数的基本思路为,调整参数使得产生好解的概率逐渐增大\n", "\n", "* PSHE2: 采用粒子群算法,最优超参数组合就是所求问题的解。现在想求得最优解就是要找到更新超参数组合,即如何更新超参数,才能让算法更快更好的收敛到最优解。PSO算法根据超参数本身历史的最优,在一定随机扰动的情况下决定下一步的更新方向。\n", "\n", "\n", "PaddleHub Auto Fine-tune提供两种超参评估策略:\n", "\n", "* FullTrail: 给定一组超参,利用这组超参从头开始Finetune一个新模型,之后在数据集dev部分评估这个模型\n", "\n", "* ModelBased: 给定一组超参,若这组超参来自第一轮搜索的超参,则从头开始Finetune一个新模型;若这组超参数不是来自第一轮搜索的超参数,则程序会加载前几轮已经Fine-tune完毕后保存的较好模型,基于这个模型,在当前的超参数组合下继续Finetune。这个Fine-tune完毕后保存的较好模型,评估方式是这个模型在数据集dev部分的效果。\n", "\n", "## 二、准备工作\n", "\n", "使用PaddleHub Auto Fine-tune必须准备两个文件,并且这两个文件需要按照指定的格式书写。这两个文件分别是需要Fine-tune的python脚本finetuee.py和需要搜索的超参数信息yaml文件hparam.yaml。\n", "\n", "以Fine-tune中文情感分类任务为例,我们展示如何利用PaddleHub Auto Finetune进行自动搜素超参。\n", "\n", "以下是待搜索超参数的yaml文件hparam.yaml,包含需要搜素的超参名字、类型、范围等信息。其中类型只支持float和int类型\n", "```\n", "param_list:\n", "- name : learning_rate\n", " init_value : 0.001\n", " type : float\n", " lower_than : 0.05\n", " greater_than : 0.000005\n", "- name : weight_decay\n", " init_value : 0.1\n", " type : float\n", " lower_than : 1\n", " greater_than : 0.0\n", "- name : batch_size\n", " init_value : 32\n", " type : int\n", " lower_than : 40\n", " greater_than : 30\n", "- name : warmup_prop\n", " init_value : 0.1\n", " type : float\n", " lower_than : 0.2\n", " greater_than : 0.0\n", "```\n", "\n", "**NOTE:** 该yaml文件的最外层级的key必须是param_list\n", "\n", "\n", "以下是中文情感分类的finetunee.py\n", "\n", "```python\n", "from __future__ import absolute_import\n", "from __future__ import division\n", "from __future__ import print_function\n", "\n", "import argparse\n", "import ast\n", "\n", "import paddle.fluid as fluid\n", "import paddlehub as hub\n", "import os\n", "from paddlehub.common.logger import logger\n", "\n", "# yapf: disable\n", "parser = argparse.ArgumentParser(__doc__)\n", "parser.add_argument(\"--epochs\", type=int, default=3, help=\"epochs.\")\n", "parser.add_argument(\"--batch_size\", type=int, default=32, help=\"batch_size.\")\n", "parser.add_argument(\"--learning_rate\", type=float, default=5e-5, help=\"learning_rate.\")\n", "parser.add_argument(\"--warmup_prop\", type=float, default=0.1, help=\"warmup_prop.\")\n", "parser.add_argument(\"--weight_decay\", type=float, default=0.01, help=\"weight_decay.\")\n", "parser.add_argument(\"--max_seq_len\", type=int, default=128, help=\"Number of words of the longest seqence.\")\n", "parser.add_argument(\"--checkpoint_dir\", type=str, default=None, help=\"Directory to model checkpoint\")\n", "parser.add_argument(\"--model_path\", type=str, default=\"\", help=\"load model path\")\n", "args = parser.parse_args()\n", "# yapf: enable.\n", "\n", "\n", "if __name__ == '__main__':\n", " # Load Paddlehub ERNIE pretrained model\n", " module = hub.Module(name=\"ernie\")\n", " inputs, outputs, program = module.context(\n", " trainable=True, max_seq_len=args.max_seq_len)\n", "\n", " # Download dataset and use ClassifyReader to read dataset\n", " dataset = hub.dataset.ChnSentiCorp()\n", " metrics_choices = [\"acc\"]\n", "\n", " reader = hub.reader.ClassifyReader(\n", " dataset=dataset,\n", " vocab_path=module.get_vocab_path(),\n", " max_seq_len=args.max_seq_len)\n", "\n", " # Construct transfer learning network\n", " # Use \"pooled_output\" for classification tasks on an entire sentence.\n", " pooled_output = outputs[\"pooled_output\"]\n", "\n", " # Setup feed list for data feeder\n", " # Must feed all the tensor of ERNIE's module need\n", " feed_list = [\n", " inputs[\"input_ids\"].name,\n", " inputs[\"position_ids\"].name,\n", " inputs[\"segment_ids\"].name,\n", " inputs[\"input_mask\"].name,\n", " ]\n", "\n", " # Select finetune strategy, setup config and finetune\n", " strategy = hub.AdamWeightDecayStrategy(\n", " warmup_proportion=args.warmup_prop,\n", " learning_rate=args.learning_rate,\n", " weight_decay=args.weight_decay,\n", " lr_scheduler=\"linear_decay\")\n", "\n", " # Setup runing config for PaddleHub Finetune API\n", " config = hub.RunConfig(\n", " checkpoint_dir=args.checkpoint_dir,\n", " use_cuda=True,\n", " num_epoch=args.epochs,\n", " batch_size=args.batch_size,\n", " enable_memory_optim=True,\n", " strategy=strategy)\n", "\n", " # Define a classfication finetune task by PaddleHub's API\n", " cls_task = hub.TextClassifierTask(\n", " data_reader=reader,\n", " feature=pooled_output,\n", " feed_list=feed_list,\n", " num_classes=dataset.num_labels,\n", " config=config,\n", " metrics_choices=metrics_choices)\n", "\n", " # Finetune and evaluate by PaddleHub's API\n", " if args.model_path != \"\":\n", " with cls_task.phase_guard(phase=\"train\"):\n", " cls_task.init_if_necessary()\n", " cls_task.load_parameters(args.model_path)\n", " logger.info(\"PaddleHub has loaded model from %s\" % args.model_path)\n", "\n", " run_states = cls_task.finetune()\n", " train_avg_score, train_avg_loss, train_run_speed = cls_task._calculate_metrics(run_states)\n", "\n", " run_states = cls_task.eval()\n", " eval_avg_score, eval_avg_loss, eval_run_speed = cls_task._calculate_metrics(run_states)\n", "\n", "print(eval_avg_score[\"acc\"], end=\"\")\n", "```\n", "**Note**:以上是finetunee.py的写法。\n", "> finetunee.py必须可以接收待搜素超参数选项参数, 并且待搜素超参数选项名字和yaml文件中的超参数名字保持一致.\n", "\n", "> finetunee.py必须有checkpoint_dir这个选项。\n", "\n", "> PaddleHub Auto Fine-tune超参评估策略选择为ModelBased,finetunee.py必须有model_path选项。\n", "\n", "> PaddleHub Auto Fine-tune搜索超参策略选择hazero时,必须提供两个以上的待搜索超参。\n", "\n", "> finetunee.py的最后一个输出必须是模型在数据集dev上的评价效果,同时以“”结束,如print(eval_avg_score[\"acc\"], end=\"\"). \n", "\n", "\n", "\n", "## 三、启动方式\n", "\n", "**确认安装PaddleHub版本在1.2.0以上。**\n", "\n", "通过以下命令方式:\n", "```shell\n", "$ OUTPUT=result/\n", "$ hub autofientune finetunee.py --param_file=hparam.yaml --cuda=['1','2'] --popsize=5 --round=10 \n", "$ --output_dir=${OUTPUT} --evaluate_choice=fulltrail --strategy=hazero\n", "```\n", "\n", "其中,选项\n", "\n", "> `--param_file`: 需要搜索的超参数信息yaml文件\n", "\n", "> `--cuda`: 设置运行程序的可用GPU卡号,list类型,中间以逗号隔开,不能有空格,默认为[‘0’]\n", "\n", "> `--popsize`: 设置程序运行每轮产生的超参组合数,默认为5\n", "\n", "> `--round`: 设置程序运行的轮数,默认是10\n", "\n", "> `--output_dir`: 设置程序运行输出结果存放目录,可选,不指定该选项参数时,在当前运行路径下生成存放程序运行输出信息的文件夹\n", "\n", "> `--evaluate_choice`: 设置自动搜索超参的评价效果方式,可选fulltrail和modelbased, 默认为fulltrail\n", "\n", "> `--tuning_strategy`: 设置自动搜索超参策略,可选hazero和pshe2,默认为hazero\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" } }, "nbformat": 4, "nbformat_minor": 2 }