提交 7caad04c 编写于 作者: M Mimee 提交者: GitHub

Fixing anchors (#430)

* Fixes anchor issues, except for ch1.

* Remove old files.
上级 c5dbde89
<!DOCTYPE html><html><head><meta charset="utf-8"><style>body {
width: 45em;
border: 1px solid #ddd;
outline: 1300px solid #fff;
margin: 16px auto;
}
body .markdown-body
{
padding: 30px;
}
@font-face {
font-family: fontawesome-mini;
src: url(data:font/woff;charset=utf-8;base64,d09GRgABAAAAAAzUABAAAAAAFNgAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABGRlRNAAABbAAAABwAAAAcZMzaOEdERUYAAAGIAAAAHQAAACAAOQAET1MvMgAAAagAAAA+AAAAYHqhde9jbWFwAAAB6AAAAFIAAAFa4azkLWN2dCAAAAI8AAAAKAAAACgFgwioZnBnbQAAAmQAAAGxAAACZVO0L6dnYXNwAAAEGAAAAAgAAAAIAAAAEGdseWYAAAQgAAAFDgAACMz7eroHaGVhZAAACTAAAAAwAAAANgWEOEloaGVhAAAJYAAAAB0AAAAkDGEGa2htdHgAAAmAAAAAEwAAADBEgAAQbG9jYQAACZQAAAAaAAAAGgsICJBtYXhwAAAJsAAAACAAAAAgASgBD25hbWUAAAnQAAACZwAABOD4no+3cG9zdAAADDgAAABsAAAAmF+yXM9wcmVwAAAMpAAAAC4AAAAusPIrFAAAAAEAAAAAyYlvMQAAAADLVHQgAAAAAM/u9uZ4nGNgZGBg4ANiCQYQYGJgBEJuIGYB8xgABMMAPgAAAHicY2Bm42OcwMDKwMLSw2LMwMDQBqGZihmiwHycoKCyqJjB4YPDh4NsDP+BfNb3DIuAFCOSEgUGRgAKDgt4AAB4nGNgYGBmgGAZBkYGEAgB8hjBfBYGCyDNxcDBwMTA9MHhQ9SHrA8H//9nYACyQyFs/sP86/kX8HtB9UIBIxsDXICRCUgwMaACRoZhDwA3fxKSAAAAAAHyAHABJQB/AIEAdAFGAOsBIwC/ALgAxACGAGYAugBNACcA/wCIeJxdUbtOW0EQ3Q0PA4HE2CA52hSzmZDGe6EFCcTVjWJkO4XlCGk3cpGLcQEfQIFEDdqvGaChpEibBiEXSHxCPiESM2uIojQ7O7NzzpkzS8qRqnfpa89T5ySQwt0GzTb9Tki1swD3pOvrjYy0gwdabGb0ynX7/gsGm9GUO2oA5T1vKQ8ZTTuBWrSn/tH8Cob7/B/zOxi0NNP01DoJ6SEE5ptxS4PvGc26yw/6gtXhYjAwpJim4i4/plL+tzTnasuwtZHRvIMzEfnJNEBTa20Emv7UIdXzcRRLkMumsTaYmLL+JBPBhcl0VVO1zPjawV2ys+hggyrNgQfYw1Z5DB4ODyYU0rckyiwNEfZiq8QIEZMcCjnl3Mn+pED5SBLGvElKO+OGtQbGkdfAoDZPs/88m01tbx3C+FkcwXe/GUs6+MiG2hgRYjtiKYAJREJGVfmGGs+9LAbkUvvPQJSA5fGPf50ItO7YRDyXtXUOMVYIen7b3PLLirtWuc6LQndvqmqo0inN+17OvscDnh4Lw0FjwZvP+/5Kgfo8LK40aA4EQ3o3ev+iteqIq7wXPrIn07+xWgAAAAABAAH//wAPeJyFlctvG1UUh+/12DPN1B7P3JnYjj2Ox4/MuDHxJH5N3UdaEUQLqBIkfQQioJWQ6AMEQkIqsPGCPwA1otuWSmTBhjtps2ADWbJg3EpIXbGouqSbCraJw7kzNo2dRN1cnXN1ZvT7zuuiMEI7ncizyA0URofRBJpCdbQuIFShYY+GZRrxMDVtih5TwQPHtXDFFSIKoWIbuREBjLH27Ny4MsbVx+uOJThavebgVrNRLAiYx06rXsvhxLgWx9xpfHdrs/ekc2Pl2cpPCVEITQpwbj8VQhfXSq2m+Wxqaq2D73Kne5e3NjHqQNj3CRYlJlgUl/jRNP+2Gs2pNYRQiOnmUaQDqm30KqKiTTWPWjboxnTWpvgxjXo0KrtZXAHt7hwIz0YVcj88JnKlJKi3NPAwLyDwZudSmJSMMJFDYaOkaol6XtESx3Gt1VTytdZJ3DCLeaVhVnCBH1fycHTxFXwPX+l2e3d6H/TufGGmMTLTnbSJUdo00zuBswMO/nl3YLeL/wnu9/limCuD3vC54h5NBVz6Li414AI8Vx3iiosKcQXUbrvhFFiYb++HN4DaF4XzFW0fIN4XDWJ3a3XQoq9V8WiyRmdsatV9xUcHims1JloH0YUa090G3Tro3mC6c01f+YwCPquINr1PTaCP6rVTOOmf0GE2dBc7zWIhji3/5MchSuBHgDbU99RMWt3YUNMZMJmx92YP6NsHx/5/M1yvInpnkIOM3Z8fA3JQ2lW1RFC1KaBPDFXNAHYYvGy73aYZZZ3HifbeuiVZCpwA3oQBs0wGPYJbJfg60xrKEbKiNtTe1adwrpBRwlAuQ3q3VRaX0QmQ9a49BTSCuF1MLfQ6+tinOubRBZuWPNoMevGMT+V41KitO1is3D/tpMcq1JHZqDHGs8DoYGDkxJgKjHROeTCmhZvzPm9pod+ltKm4PN7Dyvvldlpsg8D+4AUJZ3F/JBstZz7cbFRxsaAGV6yX/dkcycWf8eS3QlQea+YLjdm3yrOnrhFpUyKVvFE4lpv4bO3Svx/6F/4xmiDu/RT5iI++lko18mY1oX+5UGKR6kmVjM/Zb76yfHtxy+h/SyQ0lLdpdKy/lWB6szatetQJ8nZ80A2Qt6ift6gJeavU3BO4gtxs/KCtNPVibCtYCWY3SIlSBPKXZALXiIR9oZeJ1AuMyxLpHIy/yO7vSiSE+kZvk0ihJ30HgHfzZtEMmvV58x6dtqns0XTAW7Vdm4HJ04OCp/crOO7rd9SGxQAE/mVA9xRN+kVSMRFF6S9JFGUtthkjBA5tFCWc2l4V43Ex9GmUP3SI37Jjmir9KqlaDJ4S4JB3vuM/jzyH1+8MuoZ+QGzfnvPoJb96cZlWjMcKLfgDwB7E634JTY+asjsPzS5CiVnEWY+KsrsIN5rn3mAPjqmQBxGjcGKB9f9ZxY3mYC2L85CJ2FXIxKKyHk+dg0FHbuEc7D5NzWUX32WxFcWNGRAbvwSx0RmIXVDuYySafluQBmzA/ssqJAMLnli+WIC90Gw4lm85wcp0qjArEDPJJV/sSx4P9ungTpgMw5gVC1XO4uULq0s3v1rqLi0vX/z65vlH50f8T/RHmSPTk5xxWBWOluMT6WiOy+tdvWxlV/XQb3o3c6Ssr+r6I708GsX9/nzp1tKFh0s3v7m4vAy/Hnb/KMOvc1wump6Il48K6mGDy02X9Yd65pa+nQIjk76lWxCkG8NBCP0HQS9IpAAAeJxjYGRgYGBhcCrq214Qz2/zlUGenQEEzr/77oug/zewFbB+AHI5GJhAogBwKQ0qeJxjYGRgYH3/P46BgZ0BBNgKGBgZUAEPAE/7At0AAAB4nGNngAB2IGYjhBsYBAAIYADVAAAAAAAAAAAAAFwAyAEeAaACCgKmAx4DggRmAAAAAQAAAAwAagAEAAAAAAACAAEAAgAWAAABAAChAAAAAHiclZI7bxQxFIWPd/JkUYQChEhIyAVKgdBMskm1QkKrRETpQiLRUczueB/K7HhlOxttg8LvoKPgP9DxFxANDR0tHRWi4NjrPIBEgh1p/dm+vufcawNYFWsQmP6e4jSyQB2fI9cwj++RE9wTjyPP4LYoI89iWbyLPIe6+Bh5Hs9rryMv4GbtW+RF3EhuRa7jbrIbeQkPkjdUETOLnL0Kip4FVvAhco1RXyMnSPEz8gzWxE7kWTwUp5HnsCLeR57HW/El8gJWa58iL+JO7UfkOh4l9yMv4UnyEtvQGGECgwF66MNBooF1bGCL1ELB/TYU+ZBRlvsKQ44Se6jQ4a7hef+fh72Crv25kp+8lNWGmeKoOI5jJLb1aGIGvb6TjfWNLdkqdFvJw4l1amjlXtXRZqRN7lSRylZZyhBqpVFWmTEXgWfUrpi/hZOQXdOd4rKuXOtEWT3k5IArPRzTUU5tHKjecZkTpnVbNOnt6jzN8240GD4xtikvZW56043rPMg/dS+dlOceXoR+WPbJ55Dsekq1lJpnypsMUsYOdCW30o103Ytu/lvh+5RWFLfBjm9/N8hJntPhvx92rnoE/kyHdGasGy754kw36vsVf/lFeBi+0COu+cfgQr42G3CRpeLoZ53gmfe3X6rcKt5oVxnptHR9JS8ehVUd5wvvahN2uqxOOpMXapibI5k7Zwbt4xBSaTfoKBufhAnO/uqNcfK8OTs0OQ6l7JIqFjDhYj5WcjevCnI/1DDiI8j4ndWb/5YzDZWh79yomWXeXj7Nnw70/2TIeFPTrlSh89k1ObOSRVZWZfgF0r/zJQB4nG2JUQuCQBCEd07TTg36fb2IyBaLd3vWaUh/vmSJnvpgmG8YcmS8X3Shf3R7QA4OBUocUKHGER5NNbOOEvwc1txnuWkTRb/aPjimJ5vXabI+3VfOiyS15UWvyezM2xiGOPyuMohOH8O8JiO4Af+FsAGNAEuwCFBYsQEBjlmxRgYrWCGwEFlLsBRSWCGwgFkdsAYrXFhZsBQrAAA=) format('woff');
}
@font-face {
font-family: octicons-anchor;
src: url(data:font/woff;charset=utf-8;base64,d09GRgABAAAAAAYcAA0AAAAACjQAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABGRlRNAAABMAAAABwAAAAca8vGTk9TLzIAAAFMAAAARAAAAFZG1VHVY21hcAAAAZAAAAA+AAABQgAP9AdjdnQgAAAB0AAAAAQAAAAEACICiGdhc3AAAAHUAAAACAAAAAj//wADZ2x5ZgAAAdwAAADRAAABEKyikaNoZWFkAAACsAAAAC0AAAA2AtXoA2hoZWEAAALgAAAAHAAAACQHngNFaG10eAAAAvwAAAAQAAAAEAwAACJsb2NhAAADDAAAAAoAAAAKALIAVG1heHAAAAMYAAAAHwAAACABEAB2bmFtZQAAAzgAAALBAAAFu3I9x/Nwb3N0AAAF/AAAAB0AAAAvaoFvbwAAAAEAAAAAzBdyYwAAAADP2IQvAAAAAM/bz7t4nGNgZGFgnMDAysDB1Ml0hoGBoR9CM75mMGLkYGBgYmBlZsAKAtJcUxgcPsR8iGF2+O/AEMPsznAYKMwIkgMA5REMOXicY2BgYGaAYBkGRgYQsAHyGMF8FgYFIM0ChED+h5j//yEk/3KoSgZGNgYYk4GRCUgwMaACRoZhDwCs7QgGAAAAIgKIAAAAAf//AAJ4nHWMMQrCQBBF/0zWrCCIKUQsTDCL2EXMohYGSSmorScInsRGL2DOYJe0Ntp7BK+gJ1BxF1stZvjz/v8DRghQzEc4kIgKwiAppcA9LtzKLSkdNhKFY3HF4lK69ExKslx7Xa+vPRVS43G98vG1DnkDMIBUgFN0MDXflU8tbaZOUkXUH0+U27RoRpOIyCKjbMCVejwypzJJG4jIwb43rfl6wbwanocrJm9XFYfskuVC5K/TPyczNU7b84CXcbxks1Un6H6tLH9vf2LRnn8Ax7A5WQAAAHicY2BkYGAA4teL1+yI57f5ysDNwgAC529f0kOmWRiYVgEpDgYmEA8AUzEKsQAAAHicY2BkYGB2+O/AEMPCAAJAkpEBFbAAADgKAe0EAAAiAAAAAAQAAAAEAAAAAAAAKgAqACoAiAAAeJxjYGRgYGBhsGFgYgABEMkFhAwM/xn0QAIAD6YBhwB4nI1Ty07cMBS9QwKlQapQW3VXySvEqDCZGbGaHULiIQ1FKgjWMxknMfLEke2A+IJu+wntrt/QbVf9gG75jK577Lg8K1qQPCfnnnt8fX1NRC/pmjrk/zprC+8D7tBy9DHgBXoWfQ44Av8t4Bj4Z8CLtBL9CniJluPXASf0Lm4CXqFX8Q84dOLnMB17N4c7tBo1AS/Qi+hTwBH4rwHHwN8DXqQ30XXAS7QaLwSc0Gn8NuAVWou/gFmnjLrEaEh9GmDdDGgL3B4JsrRPDU2hTOiMSuJUIdKQQayiAth69r6akSSFqIJuA19TrzCIaY8sIoxyrNIrL//pw7A2iMygkX5vDj+G+kuoLdX4GlGK/8Lnlz6/h9MpmoO9rafrz7ILXEHHaAx95s9lsI7AHNMBWEZHULnfAXwG9/ZqdzLI08iuwRloXE8kfhXYAvE23+23DU3t626rbs8/8adv+9DWknsHp3E17oCf+Z48rvEQNZ78paYM38qfk3v/u3l3u3GXN2Dmvmvpf1Srwk3pB/VSsp512bA/GG5i2WJ7wu430yQ5K3nFGiOqgtmSB5pJVSizwaacmUZzZhXLlZTq8qGGFY2YcSkqbth6aW1tRmlaCFs2016m5qn36SbJrqosG4uMV4aP2PHBmB3tjtmgN2izkGQyLWprekbIntJFing32a5rKWCN/SdSoga45EJykyQ7asZvHQ8PTm6cslIpwyeyjbVltNikc2HTR7YKh9LBl9DADC0U/jLcBZDKrMhUBfQBvXRzLtFtjU9eNHKin0x5InTqb8lNpfKv1s1xHzTXRqgKzek/mb7nB8RZTCDhGEX3kK/8Q75AmUM/eLkfA+0Hi908Kx4eNsMgudg5GLdRD7a84npi+YxNr5i5KIbW5izXas7cHXIMAau1OueZhfj+cOcP3P8MNIWLyYOBuxL6DRylJ4cAAAB4nGNgYoAALjDJyIAOWMCiTIxMLDmZedkABtIBygAAAA==) format('woff');
}
.markdown-body {
font-family: sans-serif;
-ms-text-size-adjust: 100%;
-webkit-text-size-adjust: 100%;
color: #333333;
overflow: hidden;
font-family: "Helvetica Neue", Helvetica, "Segoe UI", Arial, freesans, sans-serif;
font-size: 16px;
line-height: 1.6;
word-wrap: break-word;
}
.markdown-body a {
background: transparent;
}
.markdown-body a:active,
.markdown-body a:hover {
outline: 0;
}
.markdown-body b,
.markdown-body strong {
font-weight: bold;
}
.markdown-body mark {
background: #ff0;
color: #000;
font-style: italic;
font-weight: bold;
}
.markdown-body sub,
.markdown-body sup {
font-size: 75%;
line-height: 0;
position: relative;
vertical-align: baseline;
}
.markdown-body sup {
top: -0.5em;
}
.markdown-body sub {
bottom: -0.25em;
}
.markdown-body h1 {
font-size: 2em;
margin: 0.67em 0;
}
.markdown-body img {
border: 0;
}
.markdown-body hr {
-moz-box-sizing: content-box;
box-sizing: content-box;
height: 0;
}
.markdown-body pre {
overflow: auto;
}
.markdown-body code,
.markdown-body kbd,
.markdown-body pre,
.markdown-body samp {
font-family: monospace, monospace;
font-size: 1em;
}
.markdown-body input {
color: inherit;
font: inherit;
margin: 0;
}
.markdown-body html input[disabled] {
cursor: default;
}
.markdown-body input {
line-height: normal;
}
.markdown-body input[type="checkbox"] {
box-sizing: border-box;
padding: 0;
}
.markdown-body table {
border-collapse: collapse;
border-spacing: 0;
}
.markdown-body td,
.markdown-body th {
padding: 0;
}
.markdown-body .codehilitetable {
border: 0;
border-spacing: 0;
}
.markdown-body .codehilitetable tr {
border: 0;
}
.markdown-body .codehilitetable pre,
.markdown-body .codehilitetable div.codehilite {
margin: 0;
}
.markdown-body .linenos,
.markdown-body .code,
.markdown-body .codehilitetable td {
border: 0;
padding: 0;
}
.markdown-body td:not(.linenos) .linenodiv {
padding: 0 !important;
}
.markdown-body .code {
width: 100%;
}
.markdown-body .linenos div pre,
.markdown-body .linenodiv pre,
.markdown-body .linenodiv {
border: 0;
-webkit-border-radius: 0;
-moz-border-radius: 0;
border-radius: 0;
-webkit-border-top-left-radius: 3px;
-webkit-border-bottom-left-radius: 3px;
-moz-border-radius-topleft: 3px;
-moz-border-radius-bottomleft: 3px;
border-top-left-radius: 3px;
border-bottom-left-radius: 3px;
}
.markdown-body .code div pre,
.markdown-body .code div {
border: 0;
-webkit-border-radius: 0;
-moz-border-radius: 0;
border-radius: 0;
-webkit-border-top-right-radius: 3px;
-webkit-border-bottom-right-radius: 3px;
-moz-border-radius-topright: 3px;
-moz-border-radius-bottomright: 3px;
border-top-right-radius: 3px;
border-bottom-right-radius: 3px;
}
.markdown-body * {
-moz-box-sizing: border-box;
box-sizing: border-box;
}
.markdown-body input {
font: 13px Helvetica, arial, freesans, clean, sans-serif, "Segoe UI Emoji", "Segoe UI Symbol";
line-height: 1.4;
}
.markdown-body a {
color: #4183c4;
text-decoration: none;
}
.markdown-body a:hover,
.markdown-body a:focus,
.markdown-body a:active {
text-decoration: underline;
}
.markdown-body hr {
height: 0;
margin: 15px 0;
overflow: hidden;
background: transparent;
border: 0;
border-bottom: 1px solid #ddd;
}
.markdown-body hr:before,
.markdown-body hr:after {
display: table;
content: " ";
}
.markdown-body hr:after {
clear: both;
}
.markdown-body h1,
.markdown-body h2,
.markdown-body h3,
.markdown-body h4,
.markdown-body h5,
.markdown-body h6 {
margin-top: 15px;
margin-bottom: 15px;
line-height: 1.1;
}
.markdown-body h1 {
font-size: 30px;
}
.markdown-body h2 {
font-size: 21px;
}
.markdown-body h3 {
font-size: 16px;
}
.markdown-body h4 {
font-size: 14px;
}
.markdown-body h5 {
font-size: 12px;
}
.markdown-body h6 {
font-size: 11px;
}
.markdown-body blockquote {
margin: 0;
}
.markdown-body ul,
.markdown-body ol {
padding: 0;
margin-top: 0;
margin-bottom: 0;
}
.markdown-body ol ol,
.markdown-body ul ol {
list-style-type: lower-roman;
}
.markdown-body ul ul ol,
.markdown-body ul ol ol,
.markdown-body ol ul ol,
.markdown-body ol ol ol {
list-style-type: lower-alpha;
}
.markdown-body dd {
margin-left: 0;
}
.markdown-body code,
.markdown-body pre,
.markdown-body samp {
font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;
font-size: 12px;
}
.markdown-body pre {
margin-top: 0;
margin-bottom: 0;
}
.markdown-body kbd {
background-color: #e7e7e7;
background-image: -moz-linear-gradient(#fefefe, #e7e7e7);
background-image: -webkit-linear-gradient(#fefefe, #e7e7e7);
background-image: linear-gradient(#fefefe, #e7e7e7);
background-repeat: repeat-x;
border-radius: 2px;
border: 1px solid #cfcfcf;
color: #000;
padding: 3px 5px;
line-height: 10px;
font: 11px Consolas, "Liberation Mono", Menlo, Courier, monospace;
display: inline-block;
}
.markdown-body>*:first-child {
margin-top: 0 !important;
}
.markdown-body>*:last-child {
margin-bottom: 0 !important;
}
.markdown-body .headeranchor-link {
position: absolute;
top: 0;
bottom: 0;
left: 0;
display: block;
padding-right: 6px;
padding-left: 30px;
margin-left: -30px;
}
.markdown-body .headeranchor-link:focus {
outline: none;
}
.markdown-body h1,
.markdown-body h2,
.markdown-body h3,
.markdown-body h4,
.markdown-body h5,
.markdown-body h6 {
position: relative;
margin-top: 1em;
margin-bottom: 16px;
font-weight: bold;
line-height: 1.4;
}
.markdown-body h1 .headeranchor,
.markdown-body h2 .headeranchor,
.markdown-body h3 .headeranchor,
.markdown-body h4 .headeranchor,
.markdown-body h5 .headeranchor,
.markdown-body h6 .headeranchor {
display: none;
color: #000;
vertical-align: middle;
}
.markdown-body h1:hover .headeranchor-link,
.markdown-body h2:hover .headeranchor-link,
.markdown-body h3:hover .headeranchor-link,
.markdown-body h4:hover .headeranchor-link,
.markdown-body h5:hover .headeranchor-link,
.markdown-body h6:hover .headeranchor-link {
height: 1em;
padding-left: 8px;
margin-left: -30px;
line-height: 1;
text-decoration: none;
}
.markdown-body h1:hover .headeranchor-link .headeranchor,
.markdown-body h2:hover .headeranchor-link .headeranchor,
.markdown-body h3:hover .headeranchor-link .headeranchor,
.markdown-body h4:hover .headeranchor-link .headeranchor,
.markdown-body h5:hover .headeranchor-link .headeranchor,
.markdown-body h6:hover .headeranchor-link .headeranchor {
display: inline-block;
}
.markdown-body h1 {
padding-bottom: 0.3em;
font-size: 2.25em;
line-height: 1.2;
border-bottom: 1px solid #eee;
}
.markdown-body h2 {
padding-bottom: 0.3em;
font-size: 1.75em;
line-height: 1.225;
border-bottom: 1px solid #eee;
}
.markdown-body h3 {
font-size: 1.5em;
line-height: 1.43;
}
.markdown-body h4 {
font-size: 1.25em;
}
.markdown-body h5 {
font-size: 1em;
}
.markdown-body h6 {
font-size: 1em;
color: #777;
}
.markdown-body p,
.markdown-body blockquote,
.markdown-body ul,
.markdown-body ol,
.markdown-body dl,
.markdown-body table,
.markdown-body pre,
.markdown-body .admonition {
margin-top: 0;
margin-bottom: 16px;
}
.markdown-body hr {
height: 4px;
padding: 0;
margin: 16px 0;
background-color: #e7e7e7;
border: 0 none;
}
.markdown-body ul,
.markdown-body ol {
padding-left: 2em;
}
.markdown-body ul ul,
.markdown-body ul ol,
.markdown-body ol ol,
.markdown-body ol ul {
margin-top: 0;
margin-bottom: 0;
}
.markdown-body li>p {
margin-top: 16px;
}
.markdown-body dl {
padding: 0;
}
.markdown-body dl dt {
padding: 0;
margin-top: 16px;
font-size: 1em;
font-style: italic;
font-weight: bold;
}
.markdown-body dl dd {
padding: 0 16px;
margin-bottom: 16px;
}
.markdown-body blockquote {
padding: 0 15px;
color: #777;
border-left: 4px solid #ddd;
}
.markdown-body blockquote>:first-child {
margin-top: 0;
}
.markdown-body blockquote>:last-child {
margin-bottom: 0;
}
.markdown-body table {
display: block;
width: 100%;
overflow: auto;
word-break: normal;
word-break: keep-all;
}
.markdown-body table th {
font-weight: bold;
}
.markdown-body table th,
.markdown-body table td {
padding: 6px 13px;
border: 1px solid #ddd;
}
.markdown-body table tr {
background-color: #fff;
border-top: 1px solid #ccc;
}
.markdown-body table tr:nth-child(2n) {
background-color: #f8f8f8;
}
.markdown-body img {
max-width: 100%;
-moz-box-sizing: border-box;
box-sizing: border-box;
}
.markdown-body code,
.markdown-body samp {
padding: 0;
padding-top: 0.2em;
padding-bottom: 0.2em;
margin: 0;
font-size: 85%;
background-color: rgba(0,0,0,0.04);
border-radius: 3px;
}
.markdown-body code:before,
.markdown-body code:after {
letter-spacing: -0.2em;
content: "\00a0";
}
.markdown-body pre>code {
padding: 0;
margin: 0;
font-size: 100%;
word-break: normal;
white-space: pre;
background: transparent;
border: 0;
}
.markdown-body .codehilite {
margin-bottom: 16px;
}
.markdown-body .codehilite pre,
.markdown-body pre {
padding: 16px;
overflow: auto;
font-size: 85%;
line-height: 1.45;
background-color: #f7f7f7;
border-radius: 3px;
}
.markdown-body .codehilite pre {
margin-bottom: 0;
word-break: normal;
}
.markdown-body pre {
word-wrap: normal;
}
.markdown-body pre code {
display: inline;
max-width: initial;
padding: 0;
margin: 0;
overflow: initial;
line-height: inherit;
word-wrap: normal;
background-color: transparent;
border: 0;
}
.markdown-body pre code:before,
.markdown-body pre code:after {
content: normal;
}
/* Admonition */
.markdown-body .admonition {
-webkit-border-radius: 3px;
-moz-border-radius: 3px;
position: relative;
border-radius: 3px;
border: 1px solid #e0e0e0;
border-left: 6px solid #333;
padding: 10px 10px 10px 30px;
}
.markdown-body .admonition table {
color: #333;
}
.markdown-body .admonition p {
padding: 0;
}
.markdown-body .admonition-title {
font-weight: bold;
margin: 0;
}
.markdown-body .admonition>.admonition-title {
color: #333;
}
.markdown-body .attention>.admonition-title {
color: #a6d796;
}
.markdown-body .caution>.admonition-title {
color: #d7a796;
}
.markdown-body .hint>.admonition-title {
color: #96c6d7;
}
.markdown-body .danger>.admonition-title {
color: #c25f77;
}
.markdown-body .question>.admonition-title {
color: #96a6d7;
}
.markdown-body .note>.admonition-title {
color: #d7c896;
}
.markdown-body .admonition:before,
.markdown-body .attention:before,
.markdown-body .caution:before,
.markdown-body .hint:before,
.markdown-body .danger:before,
.markdown-body .question:before,
.markdown-body .note:before {
font: normal normal 16px fontawesome-mini;
-moz-osx-font-smoothing: grayscale;
-webkit-user-select: none;
-moz-user-select: none;
-ms-user-select: none;
user-select: none;
line-height: 1.5;
color: #333;
position: absolute;
left: 0;
top: 0;
padding-top: 10px;
padding-left: 10px;
}
.markdown-body .admonition:before {
content: "\f056\00a0";
color: 333;
}
.markdown-body .attention:before {
content: "\f058\00a0";
color: #a6d796;
}
.markdown-body .caution:before {
content: "\f06a\00a0";
color: #d7a796;
}
.markdown-body .hint:before {
content: "\f05a\00a0";
color: #96c6d7;
}
.markdown-body .danger:before {
content: "\f057\00a0";
color: #c25f77;
}
.markdown-body .question:before {
content: "\f059\00a0";
color: #96a6d7;
}
.markdown-body .note:before {
content: "\f040\00a0";
color: #d7c896;
}
.markdown-body .admonition::after {
content: normal;
}
.markdown-body .attention {
border-left: 6px solid #a6d796;
}
.markdown-body .caution {
border-left: 6px solid #d7a796;
}
.markdown-body .hint {
border-left: 6px solid #96c6d7;
}
.markdown-body .danger {
border-left: 6px solid #c25f77;
}
.markdown-body .question {
border-left: 6px solid #96a6d7;
}
.markdown-body .note {
border-left: 6px solid #d7c896;
}
.markdown-body .admonition>*:first-child {
margin-top: 0 !important;
}
.markdown-body .admonition>*:last-child {
margin-bottom: 0 !important;
}
/* progress bar*/
.markdown-body .progress {
display: block;
width: 300px;
margin: 10px 0;
height: 24px;
-webkit-border-radius: 3px;
-moz-border-radius: 3px;
border-radius: 3px;
background-color: #ededed;
position: relative;
box-shadow: inset -1px 1px 3px rgba(0, 0, 0, .1);
}
.markdown-body .progress-label {
position: absolute;
text-align: center;
font-weight: bold;
width: 100%; margin: 0;
line-height: 24px;
color: #333;
text-shadow: 1px 1px 0 #fefefe, -1px -1px 0 #fefefe, -1px 1px 0 #fefefe, 1px -1px 0 #fefefe, 0 1px 0 #fefefe, 0 -1px 0 #fefefe, 1px 0 0 #fefefe, -1px 0 0 #fefefe, 1px 1px 2px #000;
-webkit-font-smoothing: antialiased !important;
white-space: nowrap;
overflow: hidden;
}
.markdown-body .progress-bar {
height: 24px;
float: left;
-webkit-border-radius: 3px;
-moz-border-radius: 3px;
border-radius: 3px;
background-color: #96c6d7;
box-shadow: inset 0 1px 0 rgba(255, 255, 255, .5), inset 0 -1px 0 rgba(0, 0, 0, .1);
background-size: 30px 30px;
background-image: -webkit-linear-gradient(
135deg, rgba(255, 255, 255, .4) 27%,
transparent 27%,
transparent 52%, rgba(255, 255, 255, .4) 52%,
rgba(255, 255, 255, .4) 77%,
transparent 77%, transparent
);
background-image: -moz-linear-gradient(
135deg,
rgba(255, 255, 255, .4) 27%, transparent 27%,
transparent 52%, rgba(255, 255, 255, .4) 52%,
rgba(255, 255, 255, .4) 77%, transparent 77%,
transparent
);
background-image: -ms-linear-gradient(
135deg,
rgba(255, 255, 255, .4) 27%, transparent 27%,
transparent 52%, rgba(255, 255, 255, .4) 52%,
rgba(255, 255, 255, .4) 77%, transparent 77%,
transparent
);
background-image: -o-linear-gradient(
135deg,
rgba(255, 255, 255, .4) 27%, transparent 27%,
transparent 52%, rgba(255, 255, 255, .4) 52%,
rgba(255, 255, 255, .4) 77%, transparent 77%,
transparent
);
background-image: linear-gradient(
135deg,
rgba(255, 255, 255, .4) 27%, transparent 27%,
transparent 52%, rgba(255, 255, 255, .4) 52%,
rgba(255, 255, 255, .4) 77%, transparent 77%,
transparent
);
}
.markdown-body .progress-100plus .progress-bar {
background-color: #a6d796;
}
.markdown-body .progress-80plus .progress-bar {
background-color: #c6d796;
}
.markdown-body .progress-60plus .progress-bar {
background-color: #d7c896;
}
.markdown-body .progress-40plus .progress-bar {
background-color: #d7a796;
}
.markdown-body .progress-20plus .progress-bar {
background-color: #d796a6;
}
.markdown-body .progress-0plus .progress-bar {
background-color: #c25f77;
}
.markdown-body .candystripe-animate .progress-bar{
-webkit-animation: animate-stripes 3s linear infinite;
-moz-animation: animate-stripes 3s linear infinite;
animation: animate-stripes 3s linear infinite;
}
@-webkit-keyframes animate-stripes {
0% {
background-position: 0 0;
}
100% {
background-position: 60px 0;
}
}
@-moz-keyframes animate-stripes {
0% {
background-position: 0 0;
}
100% {
background-position: 60px 0;
}
}
@keyframes animate-stripes {
0% {
background-position: 0 0;
}
100% {
background-position: 60px 0;
}
}
.markdown-body .gloss .progress-bar {
box-shadow:
inset 0 4px 12px rgba(255, 255, 255, .7),
inset 0 -12px 0 rgba(0, 0, 0, .05);
}
/* Multimarkdown Critic Blocks */
.markdown-body .critic_mark {
background: #ff0;
}
.markdown-body .critic_delete {
color: #c82829;
text-decoration: line-through;
}
.markdown-body .critic_insert {
color: #718c00 ;
text-decoration: underline;
}
.markdown-body .critic_comment {
color: #8e908c;
font-style: italic;
}
.markdown-body .headeranchor {
font: normal normal 16px octicons-anchor;
line-height: 1;
display: inline-block;
text-decoration: none;
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
-webkit-user-select: none;
-moz-user-select: none;
-ms-user-select: none;
user-select: none;
}
.headeranchor:before {
content: '\f05c';
}
.markdown-body .task-list-item {
list-style-type: none;
}
.markdown-body .task-list-item+.task-list-item {
margin-top: 3px;
}
.markdown-body .task-list-item input {
margin: 0 4px 0.25em -20px;
vertical-align: middle;
}
/* Media */
@media only screen and (min-width: 480px) {
.markdown-body {
font-size:14px;
}
}
@media only screen and (min-width: 768px) {
.markdown-body {
font-size:16px;
}
}
@media print {
.markdown-body * {
background: transparent !important;
color: black !important;
filter:none !important;
-ms-filter: none !important;
}
.markdown-body {
font-size:12pt;
max-width:100%;
outline:none;
border: 0;
}
.markdown-body a,
.markdown-body a:visited {
text-decoration: underline;
}
.markdown-body .headeranchor-link {
display: none;
}
.markdown-body a[href]:after {
content: " (" attr(href) ")";
}
.markdown-body abbr[title]:after {
content: " (" attr(title) ")";
}
.markdown-body .ir a:after,
.markdown-body a[href^="javascript:"]:after,
.markdown-body a[href^="#"]:after {
content: "";
}
.markdown-body pre {
white-space: pre;
white-space: pre-wrap;
word-wrap: break-word;
}
.markdown-body pre,
.markdown-body blockquote {
border: 1px solid #999;
padding-right: 1em;
page-break-inside: avoid;
}
.markdown-body .progress,
.markdown-body .progress-bar {
-moz-box-shadow: none;
-webkit-box-shadow: none;
box-shadow: none;
}
.markdown-body .progress {
border: 1px solid #ddd;
}
.markdown-body .progress-bar {
height: 22px;
border-right: 1px solid #ddd;
}
.markdown-body tr,
.markdown-body img {
page-break-inside: avoid;
}
.markdown-body img {
max-width: 100% !important;
}
.markdown-body p,
.markdown-body h2,
.markdown-body h3 {
orphans: 3;
widows: 3;
}
.markdown-body h2,
.markdown-body h3 {
page-break-after: avoid;
}
}
</style><title>README.en</title></head><body><article class="markdown-body"><h1 id="recognize-digits"><a name="user-content-recognize-digits" href="#recognize-digits" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Recognize Digits</h1>
<p>The source code for this tutorial is live at <a href="https://github.com/PaddlePaddle/book/tree/develop/02.recognize_digits">book/recognize_digits</a>. For instructions on getting started with Paddle, please refer to <a href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst">installation instructions</a>.</p>
<h2 id="introduction"><a name="user-content-introduction" href="#introduction" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Introduction</h2>
<p>When one learns to program, the first task is usually to write a program that prints &ldquo;Hello World!&rdquo;. In Machine Learning or Deep Learning, the equivalent task is to train a model to recognize hand-written digits on the dataset <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a>. Handwriting recognition is a classic image classification problem. The problem is relatively easy and MNIST is a complete dataset. As a simple Computer Vision dataset, MNIST contains images of handwritten digits and their corresponding labels (Fig. 1). The input image is a $28\times28$ matrix, and the label is one of the digits from $0$ to $9$. All images are normalized, meaning that they are both rescaled and centered.</p>
<p align="center">
<img src="/Users/xuxinlei02/Documents/forks/book/02.recognize_digits/image/mnist_example_image.png" width="400"><br/>
Fig. 1. Examples of MNIST images
</p>
<p>The MNIST dataset is created from the <a href="https://www.nist.gov/srd/nist-special-database-19">NIST</a> Special Database 3 (SD-3) and the Special Database 1 (SD-1). The SD-3 is labeled by the staff of the U.S. Census Bureau, while SD-1 is labeled by high school students the in U.S. Therefore the SD-3 is cleaner and easier to recognize than the SD-1 dataset. Yann LeCun et al. used half of the samples from each of SD-1 and SD-3 to create the MNIST training set (60,000 samples) and test set (10,000 samples), where training set was labeled by 250 different annotators, and it was guaranteed that there wasn&rsquo;t a complete overlap of annotators of training set and test set.</p>
<p>Yann LeCun, one of the founders of Deep Learning, have previously made tremendous contributions to handwritten character recognition and proposed the <strong>Convolutional Neural Network</strong> (CNN), which drastically improved recognition capability for handwritten characters. CNNs are now a critical concept in Deep Learning. From the LeNet proposal by Yann LeCun, to those winning models in ImageNet competitions, such as VGGNet, GoogLeNet, and ResNet (See <a href="https://github.com/PaddlePaddle/book/tree/develop/03.image_classification">Image Classification</a> tutorial), CNNs have achieved a series of impressive results in Image Classification tasks.</p>
<p>Many algorithms are tested on MNIST. In 1998, LeCun experimented with single layer linear classifier, Multilayer Perceptron (MLP) and Multilayer CNN LeNet. These algorithms quickly reduced test error from 12% to 0.7% [<a href="#References">1</a>]. Since then, researchers have worked on many algorithms such as <strong>K-Nearest Neighbors</strong> (k-NN) [<a href="#References">2</a>], <strong>Support Vector Machine</strong> (SVM) [<a href="#References">3</a>], <strong>Neural Networks</strong> [<a href="#References">4-7</a>] and <strong>Boosting</strong> [<a href="#References">8</a>]. Various preprocessing methods like distortion removal, noise removal, and blurring, have also been applied to increase recognition accuracy.</p>
<p>In this tutorial, we tackle the task of handwritten character recognition. We start with a simple <strong>softmax</strong> regression model and guide our readers step-by-step to improve this model&rsquo;s performance on the task of recognition.</p>
<h2 id="model-overview"><a name="user-content-model-overview" href="#model-overview" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Model Overview</h2>
<p>Before introducing classification algorithms and training procedure, we provide some definitions:<br />
- $X$ is the input: Input is a $28\times 28$ MNIST image. It is flattened to a $784$ dimensional vector. $X=\left (x_0, x_1, \dots, x_{783} \right )$.<br />
- $Y$ is the output: Output of the classifier is 1 of the 10 classes (digits from 0 to 9). $Y=\left (y_0, y_1, \dots, y_9 \right )$. Each dimension $y_i$ represents the probability that the input image belongs to class $i$.<br />
- $L$ is the ground truth label: $L=\left ( l_0, l_1, \dots, l_9 \right )$. It is also 10 dimensional, but only one dimension is 1 and all others are all 0.</p>
<h3 id="softmax-regression"><a name="user-content-softmax-regression" href="#softmax-regression" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Softmax Regression</h3>
<p>In a simple softmax regression model, the input is fed to fully connected layers and a softmax function is applied to get probabilities of multiple output classes[<a href="#References">9</a>].</p>
<p>The input $X$ is multiplied by weights $W$, and bias $b$ is added to generate activations.</p>
<p>$$ y_i = \text{softmax}(\sum_j W_{i,j}x_j + b_i) $$</p>
<p>where $ \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $</p>
<p>For an $N$-class classification problem with $N$ output nodes, an $N$ dimensional vector is normalized to $N$ real values in the range $[0,1]$, each representing the probability that the sample belongs to a certain class. Here $y_i$ is the prediction probability that an image is digit $i$.</p>
<p>In such a classification problem, we usually use the cross entropy loss function:</p>
<p>$$ \text{crossentropy}(label, y) = -\sum_i label_ilog(y_i) $$</p>
<p>Fig. 2 shows a softmax regression network, with weights in blue, and bias in red. +1 indicates bias is 1.</p>
<p align="center">
<img src="image/softmax_regression_en.png" width=400><br/>
Fig. 2. Softmax regression network architecture<br/>
</p>
<h3 id="multilayer-perceptron"><a name="user-content-multilayer-perceptron" href="#multilayer-perceptron" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Multilayer Perceptron</h3>
<p>The softmax regression model described above uses the simplest two-layer neural network. That is, it only contains an input layer and an output layer. So its regression ability is limited. To achieve better recognition results, consider adding several hidden layers [<a href="#References">10</a>] between the input layer and the output layer.</p>
<ol>
<li>After the first hidden layer, we get $ H_1 = \phi(W_1X + b_1) $, where $\phi$ is the activation function. Some common ones are sigmoid, tanh and ReLU.</li>
<li>After the second hidden layer, we get $ H_2 = \phi(W_2H_1 + b_2) $.</li>
<li>Finally, the output layer outputs $Y=\text{softmax}(W_3H_2 + b_3)$, the final classification result vector.</li>
</ol>
<p>Fig. 3. shows a Multilayer Perceptron network, with the weights in blue, and the bias in red. +1 indicates that the bias is $1$.</p>
<p align="center">
<img src="image/mlp_en.png" width=500><br/>
Fig. 3. Multilayer Perceptron network architecture<br/>
</p>
<h3 id="convolutional-neural-network"><a name="user-content-convolutional-neural-network" href="#convolutional-neural-network" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Convolutional Neural Network</h3>
<h4 id="convolutional-layer"><a name="user-content-convolutional-layer" href="#convolutional-layer" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Convolutional Layer</h4>
<p align="center">
<img src="/Users/xuxinlei02/Documents/forks/book/02.recognize_digits/image/conv_layer.png" width='750'><br/>
Fig. 4. Convolutional layer<br/>
</p>
<p>The Convolutional layer is the core of a Convolutional Neural Network. The parameters in this layer are composed of a set of filters or kernels. In the forward step, each kernel moves horizontally and vertically, we compute a dot product of the kernel and the input at the corresponding positions. Then, we add the bias and apply an activation function. The result is a two-dimensional activation map. For example, some kernel may recognize corners, and some may recognize circles. These convolution kernels may respond strongly to the corresponding features.</p>
<p>Fig. 4 is a dynamic graph of a convolutional layer, where depths are not shown for simplicity. Input is $W_1=5, H_1=5, D_1=3$. In fact, this is a common representation for colored images. $W_1$ and $H_1$ of a colored image correspond to the width and height respectively. $D_1$ corresponds to the 3 color channels for RGB. The parameters of the convolutional layer are $K=2, F=3, S=2, P=1$. $K$ is the number of kernels. Here, $Filter W_0$ and $Filter W_1$ are two kernels. $F$ is kernel size. $W0$ and $W1$ are both $3\times3$ matrix in all depths. $S$ is the stride. Kernels move leftwards or downwards by 2 units each time. $P$ is padding, an extension of the input. The gray area in the figure shows zero padding with size 1.</p>
<h4 id="pooling-layer"><a name="user-content-pooling-layer" href="#pooling-layer" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Pooling Layer</h4>
<p align="center">
<img src="/Users/xuxinlei02/Documents/forks/book/02.recognize_digits/image/max_pooling_en.png" width="400px"><br/>
Fig. 5 Pooling layer<br/>
</p>
<p>A Pooling layer performs downsampling. The main functionality of this layer is to reduce computation by reducing the network parameters. It also prevents overfitting to some extent. Usually, a pooling layer is added after a convolutional layer. Pooling layer can be of various types like max pooling, average pooling, etc. Max pooling uses rectangles to segment the input layer into several parts and computes the maximum value in each part as the output (Fig. 5.)</p>
<h4 id="lenet-5-network"><a name="user-content-lenet-5-network" href="#lenet-5-network" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>LeNet-5 Network</h4>
<p align="center">
<img src="/Users/xuxinlei02/Documents/forks/book/02.recognize_digits/image/cnn_en.png"><br/>
Fig. 6. LeNet-5 Convolutional Neural Network architecture<br/>
</p>
<p><a href="http://yann.lecun.com/exdb/lenet/">LeNet-5</a> is one of the simplest Convolutional Neural Networks. Fig. 6. shows its architecture: A 2-dimensional input image is fed into two sets of convolutional layers and pooling layers, this output is then fed to a fully connected layer and a softmax classifier. The following three properties of convolution enable LeNet-5 to better recognize images than Multilayer fully connected perceptrons:</p>
<ul>
<li>3D properties of neurons: a convolutional layer is organized by width, height and depth. Neurons in each layer are connected to only a small region in the previous layer. This region is called the receptive field.</li>
<li>Local connection: A CNN utilizes the local space correlation by connecting local neurons. This design guarantees that the learned filter has a strong response to local input features. Stacking many such layers generates a non-linear filter that is more global. This enables the network to first obtain good representation for small parts of input and then combine them to represent a larger region.</li>
<li>Sharing weights: In a CNN, computation is iterated on shared parameters (weights and bias) to form a feature map. This means all neurons in the same depth of the output respond to the same feature. This allows detecting a feature regardless of its position in the input and enables translation equivariance.</li>
</ul>
<p>For more details on Convolutional Neural Networks, please refer to <a href="http://cs231n.github.io/convolutional-networks/">this Stanford open course</a> and <a href="https://github.com/PaddlePaddle/book/blob/develop/image_classification/README.md">this Image Classification</a> tutorial.</p>
<h3 id="list-of-common-activation-functions"><a name="user-content-list-of-common-activation-functions" href="#list-of-common-activation-functions" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>List of Common Activation Functions</h3>
<ul>
<li>
<p>Sigmoid activation function: $ f(x) = sigmoid(x) = \frac{1}{1+e^{-x}} $</p>
</li>
<li>
<p>Tanh activation function: $ f(x) = tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}} $</p>
</li>
</ul>
<p>In fact, tanh function is just a rescaled version of the sigmoid function. It is obtained by magnifying the value of the sigmoid function and moving it downwards by 1.</p>
<ul>
<li>ReLU activation function: $ f(x) = max(0, x) $</li>
</ul>
<p>For more information, please refer to <a href="https://en.wikipedia.org/wiki/Activation_function">Activation functions on Wikipedia</a>.</p>
<h2 id="data-preparation"><a name="user-content-data-preparation" href="#data-preparation" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Data Preparation</h2>
<p>PaddlePaddle provides a Python module, <code>paddle.dataset.mnist</code>, which downloads and caches the <a href="http://yann.lecun.com/exdb/mnist/">MNIST dataset</a>. The cache is under <code>/home/username/.cache/paddle/dataset/mnist</code>:</p>
<table>
<thead>
<tr>
<th>File name</th>
<th>Description</th>
<th>Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>train-images-idx3-ubyte</td>
<td>Training images</td>
<td>60,000</td>
</tr>
<tr>
<td>train-labels-idx1-ubyte</td>
<td>Training labels</td>
<td>60,000</td>
</tr>
<tr>
<td>t10k-images-idx3-ubyte</td>
<td>Evaluation images</td>
<td>10,000</td>
</tr>
<tr>
<td>t10k-labels-idx1-ubyte</td>
<td>Evaluation labels</td>
<td>10,000</td>
</tr>
</tbody>
</table>
<h2 id="model-configuration"><a name="user-content-model-configuration" href="#model-configuration" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Model Configuration</h2>
<p>A PaddlePaddle program starts from importing the API package:</p>
<pre><code class="python">import paddle.v2 as paddle
</code></pre>
<p>We want to use this program to demonstrate three different classifiers, each defined as a Python function:</p>
<ul>
<li>Softmax regression: the network has a fully-connection layer with softmax activation:</li>
</ul>
<pre><code class="python">def softmax_regression(img):
predict = paddle.layer.fc(input=img,
size=10,
act=paddle.activation.Softmax())
return predict
</code></pre>
<ul>
<li>Multi-Layer Perceptron: this network has two hidden fully-connected layers, one with ReLU and the other with softmax activation:</li>
</ul>
<pre><code class="python">def multilayer_perceptron(img):
hidden1 = paddle.layer.fc(input=img, size=128, act=paddle.activation.Relu())
hidden2 = paddle.layer.fc(input=hidden1,
size=64,
act=paddle.activation.Relu())
predict = paddle.layer.fc(input=hidden2,
size=10,
act=paddle.activation.Softmax())
return predict
</code></pre>
<ul>
<li>Convolution network LeNet-5: the input image is fed through two convolution-pooling layers, a fully-connected layer, and the softmax output layer:</li>
</ul>
<pre><code class="python">def convolutional_neural_network(img):
conv_pool_1 = paddle.networks.simple_img_conv_pool(
input=img,
filter_size=5,
num_filters=20,
num_channel=1,
pool_size=2,
pool_stride=2,
act=paddle.activation.Tanh())
conv_pool_2 = paddle.networks.simple_img_conv_pool(
input=conv_pool_1,
filter_size=5,
num_filters=50,
num_channel=20,
pool_size=2,
pool_stride=2,
act=paddle.activation.Tanh())
fc1 = paddle.layer.fc(input=conv_pool_2,
size=128,
act=paddle.activation.Tanh())
predict = paddle.layer.fc(input=fc1,
size=10,
act=paddle.activation.Softmax())
return predict
</code></pre>
<p>PaddlePaddle provides a special layer <code>layer.data</code> for reading data. Let us create a data layer for reading images and connect it to a classification network created using one of above three functions. We also need a cost layer for training the model.</p>
<pre><code class="python">paddle.init(use_gpu=False, trainer_count=1)
images = paddle.layer.data(
name='pixel', type=paddle.data_type.dense_vector(784))
label = paddle.layer.data(
name='label', type=paddle.data_type.integer_value(10))
predict = softmax_regression(images)
#predict = multilayer_perceptron(images) # uncomment for MLP
#predict = convolutional_neural_network(images) # uncomment for LeNet5
cost = paddle.layer.classification_cost(input=predict, label=label)
</code></pre>
<p>Now, it is time to specify training parameters. In the following <code>Momentum</code> optimizer, <code>momentum=0.9</code> means that 90% of the current momentum comes from that of the previous iteration. The learning rate relates to the speed at which the network training converges. Regularization is meant to prevent over-fitting; here we use the L2 regularization.</p>
<pre><code class="python">parameters = paddle.parameters.create(cost)
optimizer = paddle.optimizer.Momentum(
learning_rate=0.1 / 128.0,
momentum=0.9,
regularization=paddle.optimizer.L2Regularization(rate=0.0005 * 128))
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=optimizer)
</code></pre>
<p>Then we specify the training data <code>paddle.dataset.movielens.train()</code> and testing data <code>paddle.dataset.movielens.test()</code>. These two methods are <em>reader creators</em>. Once called, a reader creator returns a <em>reader</em>. A reader is a Python method, which, once called, returns a Python generator, which yields instances of data.</p>
<p><code>shuffle</code> is a reader decorator. It takes in a reader A as input and returns a new reader B. Under the hood, B calls A to read data in the following fashion: it copies in <code>buffer_size</code> instances at a time into a buffer, shuffles the data, and yields the shuffled instances one at a time. A large buffer size would yield very shuffled data.</p>
<p><code>batch</code> is a special decorator, which takes in reader and outputs a <em>batch reader</em>, which doesn&rsquo;t yield an instance, but a minibatch at a time.</p>
<pre><code class="python">lists = []
def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print &quot;Pass %d, Batch %d, Cost %f, %s&quot; % (
event.pass_id, event.batch_id, event.cost, event.metrics)
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=paddle.batch(
paddle.dataset.mnist.test(), batch_size=128))
print &quot;Test with Pass %d, Cost %f, %s\n&quot; % (
event.pass_id, result.cost, result.metrics)
lists.append((event.pass_id, result.cost,
result.metrics['classification_error_evaluator']))
trainer.train(
reader=paddle.batch(
paddle.reader.shuffle(
paddle.dataset.mnist.train(), buf_size=8192),
batch_size=128),
event_handler=event_handler,
num_passes=100)
</code></pre>
<p>During training, <code>trainer.train</code> invokes <code>event_handler</code> for certain events. This gives us a chance to print the training progress.</p>
<pre><code># Pass 0, Batch 0, Cost 2.780790, {'classification_error_evaluator': 0.9453125}
# Pass 0, Batch 100, Cost 0.635356, {'classification_error_evaluator': 0.2109375}
# Pass 0, Batch 200, Cost 0.326094, {'classification_error_evaluator': 0.1328125}
# Pass 0, Batch 300, Cost 0.361920, {'classification_error_evaluator': 0.1015625}
# Pass 0, Batch 400, Cost 0.410101, {'classification_error_evaluator': 0.125}
# Test with Pass 0, Cost 0.326659, {'classification_error_evaluator': 0.09470000118017197}
</code></pre>
<p>After the training, we can check the model&rsquo;s prediction accuracy.</p>
<pre><code># find the best pass
best = sorted(lists, key=lambda list: float(list[1]))[0]
print 'Best pass is %s, testing Avgcost is %s' % (best[0], best[1])
print 'The classification accuracy is %.2f%%' % (100 - float(best[2]) * 100)
</code></pre>
<p>Usually, with MNIST data, the softmax regression model achieves an accuracy around 92.34%, the MLP 97.66%, and the convolution network around 99.20%. Convolution layers have been widely considered a great invention for image processing.</p>
<h2 id="conclusion"><a name="user-content-conclusion" href="#conclusion" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Conclusion</h2>
<p>This tutorial describes a few basic Deep Learning models using Softmax regression, Multilayer Perceptron Network, and Convolutional Neural Network. The subsequent tutorials will derive more sophisticated models from these. It is crucial to understand these models for future learning. When our model evolves from a simple softmax regression to slightly complex Convolutional Neural Network, the recognition accuracy on the MNIST data set achieves large improvement in accuracy. This is due to the Convolutional layers&rsquo; local connections and parameter sharing. While learning new models in the future, we encourage the readers to understand the key ideas that lead a new model to improve the results of an old one. Moreover, this tutorial introduces the basic flow of PaddlePaddle model design, which starts with a <em>dataprovider</em>, a model layer construction, and finally training and prediction. Readers can leverage the flow used in this MNIST handwritten digit classification example and experiment with different data and network architectures to train models for classification tasks of their choice.</p>
<h2 id="references"><a name="user-content-references" href="#references" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>References</h2>
<ol>
<li>LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. <a href="http://ieeexplore.ieee.org/abstract/document/726791/">&ldquo;Gradient-based learning applied to document recognition.&rdquo;</a> Proceedings of the IEEE 86, no. 11 (1998): 2278-2324.</li>
<li>Wejéus, Samuel. <a href="http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A753279&amp;dswid=-434">&ldquo;A Neural Network Approach to Arbitrary SymbolRecognition on Modern Smartphones.&rdquo;</a> (2014).</li>
<li>Decoste, Dennis, and Bernhard Schölkopf. <a href="http://link.springer.com/article/10.1023/A:1012454411458">&ldquo;Training invariant support vector machines.&rdquo;</a> Machine learning 46, no. 1-3 (2002): 161-190.</li>
<li>Simard, Patrice Y., David Steinkraus, and John C. Platt. <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.160.8494&amp;rep=rep1&amp;type=pdf">&ldquo;Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis.&rdquo;</a> In ICDAR, vol. 3, pp. 958-962. 2003.</li>
<li>Salakhutdinov, Ruslan, and Geoffrey E. Hinton. <a href="http://www.jmlr.org/proceedings/papers/v2/salakhutdinov07a/salakhutdinov07a.pdf">&ldquo;Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure.&rdquo;</a> In AISTATS, vol. 11. 2007.</li>
<li>Cireşan, Dan Claudiu, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. <a href="http://www.mitpressjournals.org/doi/abs/10.1162/NECO_a_00052">&ldquo;Deep, big, simple neural nets for handwritten digit recognition.&rdquo;</a> Neural computation 22, no. 12 (2010): 3207-3220.</li>
<li>Deng, Li, Michael L. Seltzer, Dong Yu, Alex Acero, Abdel-rahman Mohamed, and Geoffrey E. Hinton. <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.185.1908&amp;rep=rep1&amp;type=pdf">&ldquo;Binary coding of speech spectrograms using a deep auto-encoder.&rdquo;</a> In Interspeech, pp. 1692-1695. 2010.</li>
<li>Kégl, Balázs, and Róbert Busa-Fekete. <a href="http://dl.acm.org/citation.cfm?id=1553439">&ldquo;Boosting products of base classifiers.&rdquo;</a> In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 497-504. ACM, 2009.</li>
<li>Rosenblatt, Frank. <a href="http://psycnet.apa.org/journals/rev/65/6/386/">&ldquo;The perceptron: A probabilistic model for information storage and organization in the brain.&rdquo;</a> Psychological review 65, no. 6 (1958): 386.</li>
<li>Bishop, Christopher M. <a href="http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf">&ldquo;Pattern recognition.&rdquo;</a> Machine Learning 128 (2006): 1-58.</li>
</ol>
<p><br/><br />
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.</p></article></body></html>
\ No newline at end of file
...@@ -522,7 +522,7 @@ print "Label of image/dog.png is: %d" % lab[0][0] ...@@ -522,7 +522,7 @@ print "Label of image/dog.png is: %d" % lab[0][0]
Traditional image classification methods involve multiple stages of processing, which has to utilize complex frameworks. Contrarily, CNN models can be trained end-to-end with a significant increase in classification accuracy. In this chapter, we introduced three models -- VGG, GoogleNet, ResNet and provided PaddlePaddle config files for training VGG and ResNet on CIFAR10. We also explained how to perform prediction and feature extraction using the PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try. Traditional image classification methods involve multiple stages of processing, which has to utilize complex frameworks. Contrarily, CNN models can be trained end-to-end with a significant increase in classification accuracy. In this chapter, we introduced three models -- VGG, GoogleNet, ResNet and provided PaddlePaddle config files for training VGG and ResNet on CIFAR10. We also explained how to perform prediction and feature extraction using the PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try.
## Reference ## References
[1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004. [1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004.
......
...@@ -564,7 +564,7 @@ print "Label of image/dog.png is: %d" % lab[0][0] ...@@ -564,7 +564,7 @@ print "Label of image/dog.png is: %d" % lab[0][0]
Traditional image classification methods involve multiple stages of processing, which has to utilize complex frameworks. Contrarily, CNN models can be trained end-to-end with a significant increase in classification accuracy. In this chapter, we introduced three models -- VGG, GoogleNet, ResNet and provided PaddlePaddle config files for training VGG and ResNet on CIFAR10. We also explained how to perform prediction and feature extraction using the PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try. Traditional image classification methods involve multiple stages of processing, which has to utilize complex frameworks. Contrarily, CNN models can be trained end-to-end with a significant increase in classification accuracy. In this chapter, we introduced three models -- VGG, GoogleNet, ResNet and provided PaddlePaddle config files for training VGG and ResNet on CIFAR10. We also explained how to perform prediction and feature extraction using the PaddlePaddle API. For other datasets such as ImageNet, the procedure for config and training are the same and you are welcome to give it a try.
## Reference ## References
[1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004. [1] D. G. Lowe, [Distinctive image features from scale-invariant keypoints](http://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf). IJCV, 60(2):91-110, 2004.
......
...@@ -18,7 +18,7 @@ One-hot vectors are intuitive, yet they have limited usefulness. Take the exampl ...@@ -18,7 +18,7 @@ One-hot vectors are intuitive, yet they have limited usefulness. Take the exampl
Like many machine learning models, word embeddings can represent knowledge in various ways. Another model may project an one-hot vector to an embedding vector of lower dimension e.g. $embedding(mother's day) = [0.3, 4.2, -1.5, ...], embedding(carnations) = [0.2, 5.6, -2.3, ...]$. Mapping one-hot vectors onto an embedded vector space has the potential to bring the embedding vectors of similar words (either semantically or usage-wise) closer to each other, so that the cosine similarity between the corresponding vectors for words like "Mother's Day" and "carnations" are no longer zero. Like many machine learning models, word embeddings can represent knowledge in various ways. Another model may project an one-hot vector to an embedding vector of lower dimension e.g. $embedding(mother's day) = [0.3, 4.2, -1.5, ...], embedding(carnations) = [0.2, 5.6, -2.3, ...]$. Mapping one-hot vectors onto an embedded vector space has the potential to bring the embedding vectors of similar words (either semantically or usage-wise) closer to each other, so that the cosine similarity between the corresponding vectors for words like "Mother's Day" and "carnations" are no longer zero.
A word embedding model could be a probabilistic model, a co-occurrence matrix model, or a neural network. Before people started using neural networks to generate word embedding, the traditional method was to calculate a co-occurrence matrix $X$ of words. Here, $X$ is a $|V| \times |V|$ matrix, where $X_{ij}$ represents the co-occurrence times of the $i$th and $j$th words in the vocabulary `V` within all corpus, and $|V|$ is the size of the vocabulary. By performing matrix decomposition on $X$ e.g. Singular Value Decomposition \[[5](#References)\] A word embedding model could be a probabilistic model, a co-occurrence matrix model, or a neural network. Before people started using neural networks to generate word embedding, the traditional method was to calculate a co-occurrence matrix $X$ of words. Here, $X$ is a $|V| \times |V|$ matrix, where $X_{ij}$ represents the co-occurrence times of the $i$th and $j$th words in the vocabulary `V` within all corpus, and $|V|$ is the size of the vocabulary. By performing matrix decomposition on $X$ e.g. Singular Value Decomposition \[[5](#references)\]
$$X = USV^T$$ $$X = USV^T$$
...@@ -33,7 +33,7 @@ The neural network based model does not require storing huge hash tables of stat ...@@ -33,7 +33,7 @@ The neural network based model does not require storing huge hash tables of stat
## Results Demonstration ## Results Demonstration
In this section, we use the $t-$SNE\[[4](#reference)\] data visualization algorithm to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we can see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other. In this section, we use the $t-$SNE\[[4](#references)\] data visualization algorithm to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we can see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other.
<p align="center"> <p align="center">
<img src = "image/2d_similarity.png" width=400><br/> <img src = "image/2d_similarity.png" width=400><br/>
...@@ -61,7 +61,7 @@ In this section, we will introduce three word embedding models: N-gram model, CB ...@@ -61,7 +61,7 @@ In this section, we will introduce three word embedding models: N-gram model, CB
For N-gram model, we will first introduce the concept of language model, and implement it using PaddlePaddle in section [Training](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec#model-application). For N-gram model, we will first introduce the concept of language model, and implement it using PaddlePaddle in section [Training](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec#model-application).
The latter two models, which became popular recently, are neural word embedding model developed by Tomas Mikolov at Google \[[3](#reference)\]. Despite their apparent simplicity, these models train very well. The latter two models, which became popular recently, are neural word embedding model developed by Tomas Mikolov at Google \[[3](#references)\]. Despite their apparent simplicity, these models train very well.
### Language Model ### Language Model
...@@ -83,7 +83,7 @@ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$ ...@@ -83,7 +83,7 @@ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word. In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word.
Yoshua Bengio and other scientists describe how to train a word embedding model using neural network in the famous paper of Neural Probabilistic Language Models \[[1](#reference)\] published in 2003. The Neural Network Language Model (NNLM) described in the paper learns the language model and word embedding simultaneously through a linear transformation and a non-linear hidden connection. That is, after training on large amounts of corpus, the model learns the word embedding; then, it computes the probability of the whole sentence, using the embedding. This type of language model can overcome the **curse of dimensionality** i.e. model inaccuracy caused by the difference in dimensionality between training and testing data. Note that the term *neural network language model* is ill-defined, so we will not use the name NNLM but only refer to it as *N-gram neural model* in this section. Yoshua Bengio and other scientists describe how to train a word embedding model using neural network in the famous paper of Neural Probabilistic Language Models \[[1](#references)\] published in 2003. The Neural Network Language Model (NNLM) described in the paper learns the language model and word embedding simultaneously through a linear transformation and a non-linear hidden connection. That is, after training on large amounts of corpus, the model learns the word embedding; then, it computes the probability of the whole sentence, using the embedding. This type of language model can overcome the **curse of dimensionality** i.e. model inaccuracy caused by the difference in dimensionality between training and testing data. Note that the term *neural network language model* is ill-defined, so we will not use the name NNLM but only refer to it as *N-gram neural model* in this section.
We have previously described language model using conditional probability, where the probability of the *t*-th word in a sentence depends on all $t-1$ words before it. Furthermore, since words further prior have less impact on a word, and every word within an n-gram is only effected by its previous n-1 words, we have: We have previously described language model using conditional probability, where the probability of the *t*-th word in a sentence depends on all $t-1$ words before it. Furthermore, since words further prior have less impact on a word, and every word within an n-gram is only effected by its previous n-1 words, we have:
...@@ -151,7 +151,7 @@ As illustrated in the figure above, skip-gram model maps the word embedding of t ...@@ -151,7 +151,7 @@ As illustrated in the figure above, skip-gram model maps the word embedding of t
## Dataset ## Dataset
We will use Penn Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows: We will use Penn Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#references)\]. Its statistics are as follows:
<p align="center"> <p align="center">
<table> <table>
...@@ -439,7 +439,7 @@ This chapter introduces word embeddings, the relationship between language model ...@@ -439,7 +439,7 @@ This chapter introduces word embeddings, the relationship between language model
In information retrieval, the relevance between the query and document keyword can be computed through the cosine similarity of their word embeddings. In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. In document classification, clustering the word embedding can group synonyms in the documents. We hope that readers can use word embedding models in their work after reading this chapter. In information retrieval, the relevance between the query and document keyword can be computed through the cosine similarity of their word embeddings. In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. In document classification, clustering the word embedding can group synonyms in the documents. We hope that readers can use word embedding models in their work after reading this chapter.
## Referenes ## References
1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155. 1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155.
2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201. 2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201.
3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013. 3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013.
......
...@@ -60,7 +60,7 @@ One-hot vectors are intuitive, yet they have limited usefulness. Take the exampl ...@@ -60,7 +60,7 @@ One-hot vectors are intuitive, yet they have limited usefulness. Take the exampl
Like many machine learning models, word embeddings can represent knowledge in various ways. Another model may project an one-hot vector to an embedding vector of lower dimension e.g. $embedding(mother's day) = [0.3, 4.2, -1.5, ...], embedding(carnations) = [0.2, 5.6, -2.3, ...]$. Mapping one-hot vectors onto an embedded vector space has the potential to bring the embedding vectors of similar words (either semantically or usage-wise) closer to each other, so that the cosine similarity between the corresponding vectors for words like "Mother's Day" and "carnations" are no longer zero. Like many machine learning models, word embeddings can represent knowledge in various ways. Another model may project an one-hot vector to an embedding vector of lower dimension e.g. $embedding(mother's day) = [0.3, 4.2, -1.5, ...], embedding(carnations) = [0.2, 5.6, -2.3, ...]$. Mapping one-hot vectors onto an embedded vector space has the potential to bring the embedding vectors of similar words (either semantically or usage-wise) closer to each other, so that the cosine similarity between the corresponding vectors for words like "Mother's Day" and "carnations" are no longer zero.
A word embedding model could be a probabilistic model, a co-occurrence matrix model, or a neural network. Before people started using neural networks to generate word embedding, the traditional method was to calculate a co-occurrence matrix $X$ of words. Here, $X$ is a $|V| \times |V|$ matrix, where $X_{ij}$ represents the co-occurrence times of the $i$th and $j$th words in the vocabulary `V` within all corpus, and $|V|$ is the size of the vocabulary. By performing matrix decomposition on $X$ e.g. Singular Value Decomposition \[[5](#References)\] A word embedding model could be a probabilistic model, a co-occurrence matrix model, or a neural network. Before people started using neural networks to generate word embedding, the traditional method was to calculate a co-occurrence matrix $X$ of words. Here, $X$ is a $|V| \times |V|$ matrix, where $X_{ij}$ represents the co-occurrence times of the $i$th and $j$th words in the vocabulary `V` within all corpus, and $|V|$ is the size of the vocabulary. By performing matrix decomposition on $X$ e.g. Singular Value Decomposition \[[5](#references)\]
$$X = USV^T$$ $$X = USV^T$$
...@@ -75,7 +75,7 @@ The neural network based model does not require storing huge hash tables of stat ...@@ -75,7 +75,7 @@ The neural network based model does not require storing huge hash tables of stat
## Results Demonstration ## Results Demonstration
In this section, we use the $t-$SNE\[[4](#reference)\] data visualization algorithm to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we can see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other. In this section, we use the $t-$SNE\[[4](#references)\] data visualization algorithm to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we can see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other.
<p align="center"> <p align="center">
<img src = "image/2d_similarity.png" width=400><br/> <img src = "image/2d_similarity.png" width=400><br/>
...@@ -103,7 +103,7 @@ In this section, we will introduce three word embedding models: N-gram model, CB ...@@ -103,7 +103,7 @@ In this section, we will introduce three word embedding models: N-gram model, CB
For N-gram model, we will first introduce the concept of language model, and implement it using PaddlePaddle in section [Training](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec#model-application). For N-gram model, we will first introduce the concept of language model, and implement it using PaddlePaddle in section [Training](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec#model-application).
The latter two models, which became popular recently, are neural word embedding model developed by Tomas Mikolov at Google \[[3](#reference)\]. Despite their apparent simplicity, these models train very well. The latter two models, which became popular recently, are neural word embedding model developed by Tomas Mikolov at Google \[[3](#references)\]. Despite their apparent simplicity, these models train very well.
### Language Model ### Language Model
...@@ -125,7 +125,7 @@ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$ ...@@ -125,7 +125,7 @@ $$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$
In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word. In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word.
Yoshua Bengio and other scientists describe how to train a word embedding model using neural network in the famous paper of Neural Probabilistic Language Models \[[1](#reference)\] published in 2003. The Neural Network Language Model (NNLM) described in the paper learns the language model and word embedding simultaneously through a linear transformation and a non-linear hidden connection. That is, after training on large amounts of corpus, the model learns the word embedding; then, it computes the probability of the whole sentence, using the embedding. This type of language model can overcome the **curse of dimensionality** i.e. model inaccuracy caused by the difference in dimensionality between training and testing data. Note that the term *neural network language model* is ill-defined, so we will not use the name NNLM but only refer to it as *N-gram neural model* in this section. Yoshua Bengio and other scientists describe how to train a word embedding model using neural network in the famous paper of Neural Probabilistic Language Models \[[1](#references)\] published in 2003. The Neural Network Language Model (NNLM) described in the paper learns the language model and word embedding simultaneously through a linear transformation and a non-linear hidden connection. That is, after training on large amounts of corpus, the model learns the word embedding; then, it computes the probability of the whole sentence, using the embedding. This type of language model can overcome the **curse of dimensionality** i.e. model inaccuracy caused by the difference in dimensionality between training and testing data. Note that the term *neural network language model* is ill-defined, so we will not use the name NNLM but only refer to it as *N-gram neural model* in this section.
We have previously described language model using conditional probability, where the probability of the *t*-th word in a sentence depends on all $t-1$ words before it. Furthermore, since words further prior have less impact on a word, and every word within an n-gram is only effected by its previous n-1 words, we have: We have previously described language model using conditional probability, where the probability of the *t*-th word in a sentence depends on all $t-1$ words before it. Furthermore, since words further prior have less impact on a word, and every word within an n-gram is only effected by its previous n-1 words, we have:
...@@ -193,7 +193,7 @@ As illustrated in the figure above, skip-gram model maps the word embedding of t ...@@ -193,7 +193,7 @@ As illustrated in the figure above, skip-gram model maps the word embedding of t
## Dataset ## Dataset
We will use Penn Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows: We will use Penn Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#references)\]. Its statistics are as follows:
<p align="center"> <p align="center">
<table> <table>
...@@ -481,7 +481,7 @@ This chapter introduces word embeddings, the relationship between language model ...@@ -481,7 +481,7 @@ This chapter introduces word embeddings, the relationship between language model
In information retrieval, the relevance between the query and document keyword can be computed through the cosine similarity of their word embeddings. In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. In document classification, clustering the word embedding can group synonyms in the documents. We hope that readers can use word embedding models in their work after reading this chapter. In information retrieval, the relevance between the query and document keyword can be computed through the cosine similarity of their word embeddings. In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. In document classification, clustering the word embedding can group synonyms in the documents. We hope that readers can use word embedding models in their work after reading this chapter.
## Referenes ## References
1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155. 1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155.
2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201. 2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201.
3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013. 3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013.
......
...@@ -7,18 +7,18 @@ The source code from this tutorial is at [here](https://github.com/PaddlePaddle/ ...@@ -7,18 +7,18 @@ The source code from this tutorial is at [here](https://github.com/PaddlePaddle/
The recommender system is a component of e-commerce, online videos, and online reading services. There are several different approaches for recommender systems to learn from user behavior and product properties and to understand users' interests. The recommender system is a component of e-commerce, online videos, and online reading services. There are several different approaches for recommender systems to learn from user behavior and product properties and to understand users' interests.
- User behavior-based approach. A well-known method of this approach is collaborative filtering, which assumes that if two users made similar purchases, they share common interests and would likely go on making the same decision. Some variants of collaborative filtering are user-based[[3](#reference)], item-based [[4](#reference)], social network based[[5](#reference)], and model-based. - User behavior-based approach. A well-known method of this approach is collaborative filtering, which assumes that if two users made similar purchases, they share common interests and would likely go on making the same decision. Some variants of collaborative filtering are user-based[[3](#references)], item-based [[4](#references)], social network based[[5](#references)], and model-based.
- Content-based approach[[1](#reference)]. This approach represents product properties and user interests as feature vectors of the same space so that it could measure how much a user is interested in a product by the distance between two feature vectors. - Content-based approach[[1](#references)]. This approach represents product properties and user interests as feature vectors of the same space so that it could measure how much a user is interested in a product by the distance between two feature vectors.
- Hybrid approach[[2](#reference)]: This one combines above two to help with each other about the data sparsity problem[[6](#reference)]. - Hybrid approach[[2](#references)]: This one combines above two to help with each other about the data sparsity problem[[6](#references)].
This tutorial explains a deep learning based hybrid approach and its implement in PaddlePaddle. We are going to train a model using a dataset that includes user information, movie information, and ratings. Once we train the model, we will be able to get a predicted rating given a pair of user and movie IDs. This tutorial explains a deep learning based hybrid approach and its implement in PaddlePaddle. We are going to train a model using a dataset that includes user information, movie information, and ratings. Once we train the model, we will be able to get a predicted rating given a pair of user and movie IDs.
## Model Overview ## Model Overview
To know more about deep learning based recommendation, let us start from going over the Youtube recommender system[[7](#reference)] before introducing our hybrid model. To know more about deep learning based recommendation, let us start from going over the Youtube recommender system[[7](#references)] before introducing our hybrid model.
### YouTube's Deep Learning Recommendation Model ### YouTube's Deep Learning Recommendation Model
...@@ -32,14 +32,14 @@ Figure 1. YouTube recommender system overview. ...@@ -32,14 +32,14 @@ Figure 1. YouTube recommender system overview.
#### Candidate Generation Network #### Candidate Generation Network
Youtube models candidate generation as a multiclass classification problem with a huge number of classes equal to the number of videos. The architecture of the model is as follows: YouTube models candidate generation as a multi-class classification problem with a huge number of classes equal to the number of videos. The architecture of the model is as follows:
<p align="center"> <p align="center">
<img src="image/Deep_candidate_generation_model_architecture.en.png" width="70%" ><br/> <img src="image/Deep_candidate_generation_model_architecture.en.png" width="70%" ><br/>
Figure 2. Deep candidate generation model. Figure 2. Deep candidate generation model.
</p> </p>
The first stage of this model maps watching history and search queries into fixed-length representative features. Then, an MLP (multi-layer perceptron, as described in the [Recognize Digits](https://github.com/PaddlePaddle/book/blob/develop/recognize_digits/README.md) tutorial) takes the concatenation of all representative vectors. The output of the MLP represents the user' *intrinsic interests*. At training time, it is used together with a softmax output layer for minimizing the classification error. At serving time, it is used to compute the relevance of the user with all movies. The first stage of this model maps watching history and search queries into fixed-length representative features. Then, an MLP (multi-layer Perceptron, as described in the [Recognize Digits](https://github.com/PaddlePaddle/book/blob/develop/recognize_digits/README.md) tutorial) takes the concatenation of all representative vectors. The output of the MLP represents the user' *intrinsic interests*. At training time, it is used together with a softmax output layer for minimizing the classification error. At serving time, it is used to compute the relevance of the user with all movies.
For a user $U$, the predicted watching probability of video $i$ is For a user $U$, the predicted watching probability of video $i$ is
...@@ -378,7 +378,7 @@ trainer.train( ...@@ -378,7 +378,7 @@ trainer.train(
This tutorial goes over traditional approaches in recommender system and a deep learning based approach. We also show that how to train and use the model with PaddlePaddle. Deep learning has been well used in computer vision and NLP, we look forward to its new successes in recommender systems. This tutorial goes over traditional approaches in recommender system and a deep learning based approach. We also show that how to train and use the model with PaddlePaddle. Deep learning has been well used in computer vision and NLP, we look forward to its new successes in recommender systems.
## Reference ## References
1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325. 1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2. 2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
......
...@@ -49,18 +49,18 @@ The source code from this tutorial is at [here](https://github.com/PaddlePaddle/ ...@@ -49,18 +49,18 @@ The source code from this tutorial is at [here](https://github.com/PaddlePaddle/
The recommender system is a component of e-commerce, online videos, and online reading services. There are several different approaches for recommender systems to learn from user behavior and product properties and to understand users' interests. The recommender system is a component of e-commerce, online videos, and online reading services. There are several different approaches for recommender systems to learn from user behavior and product properties and to understand users' interests.
- User behavior-based approach. A well-known method of this approach is collaborative filtering, which assumes that if two users made similar purchases, they share common interests and would likely go on making the same decision. Some variants of collaborative filtering are user-based[[3](#reference)], item-based [[4](#reference)], social network based[[5](#reference)], and model-based. - User behavior-based approach. A well-known method of this approach is collaborative filtering, which assumes that if two users made similar purchases, they share common interests and would likely go on making the same decision. Some variants of collaborative filtering are user-based[[3](#references)], item-based [[4](#references)], social network based[[5](#references)], and model-based.
- Content-based approach[[1](#reference)]. This approach represents product properties and user interests as feature vectors of the same space so that it could measure how much a user is interested in a product by the distance between two feature vectors. - Content-based approach[[1](#references)]. This approach represents product properties and user interests as feature vectors of the same space so that it could measure how much a user is interested in a product by the distance between two feature vectors.
- Hybrid approach[[2](#reference)]: This one combines above two to help with each other about the data sparsity problem[[6](#reference)]. - Hybrid approach[[2](#references)]: This one combines above two to help with each other about the data sparsity problem[[6](#references)].
This tutorial explains a deep learning based hybrid approach and its implement in PaddlePaddle. We are going to train a model using a dataset that includes user information, movie information, and ratings. Once we train the model, we will be able to get a predicted rating given a pair of user and movie IDs. This tutorial explains a deep learning based hybrid approach and its implement in PaddlePaddle. We are going to train a model using a dataset that includes user information, movie information, and ratings. Once we train the model, we will be able to get a predicted rating given a pair of user and movie IDs.
## Model Overview ## Model Overview
To know more about deep learning based recommendation, let us start from going over the Youtube recommender system[[7](#reference)] before introducing our hybrid model. To know more about deep learning based recommendation, let us start from going over the Youtube recommender system[[7](#references)] before introducing our hybrid model.
### YouTube's Deep Learning Recommendation Model ### YouTube's Deep Learning Recommendation Model
...@@ -74,14 +74,14 @@ Figure 1. YouTube recommender system overview. ...@@ -74,14 +74,14 @@ Figure 1. YouTube recommender system overview.
#### Candidate Generation Network #### Candidate Generation Network
Youtube models candidate generation as a multiclass classification problem with a huge number of classes equal to the number of videos. The architecture of the model is as follows: YouTube models candidate generation as a multi-class classification problem with a huge number of classes equal to the number of videos. The architecture of the model is as follows:
<p align="center"> <p align="center">
<img src="image/Deep_candidate_generation_model_architecture.en.png" width="70%" ><br/> <img src="image/Deep_candidate_generation_model_architecture.en.png" width="70%" ><br/>
Figure 2. Deep candidate generation model. Figure 2. Deep candidate generation model.
</p> </p>
The first stage of this model maps watching history and search queries into fixed-length representative features. Then, an MLP (multi-layer perceptron, as described in the [Recognize Digits](https://github.com/PaddlePaddle/book/blob/develop/recognize_digits/README.md) tutorial) takes the concatenation of all representative vectors. The output of the MLP represents the user' *intrinsic interests*. At training time, it is used together with a softmax output layer for minimizing the classification error. At serving time, it is used to compute the relevance of the user with all movies. The first stage of this model maps watching history and search queries into fixed-length representative features. Then, an MLP (multi-layer Perceptron, as described in the [Recognize Digits](https://github.com/PaddlePaddle/book/blob/develop/recognize_digits/README.md) tutorial) takes the concatenation of all representative vectors. The output of the MLP represents the user' *intrinsic interests*. At training time, it is used together with a softmax output layer for minimizing the classification error. At serving time, it is used to compute the relevance of the user with all movies.
For a user $U$, the predicted watching probability of video $i$ is For a user $U$, the predicted watching probability of video $i$ is
...@@ -420,7 +420,7 @@ trainer.train( ...@@ -420,7 +420,7 @@ trainer.train(
This tutorial goes over traditional approaches in recommender system and a deep learning based approach. We also show that how to train and use the model with PaddlePaddle. Deep learning has been well used in computer vision and NLP, we look forward to its new successes in recommender systems. This tutorial goes over traditional approaches in recommender system and a deep learning based approach. We also show that how to train and use the model with PaddlePaddle. Deep learning has been well used in computer vision and NLP, we look forward to its new successes in recommender systems.
## Reference ## References
1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325. 1. [Peter Brusilovsky](https://en.wikipedia.org/wiki/Peter_Brusilovsky) (2007). *The Adaptive Web*. p. 325.
2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2. 2. Robin Burke , [Hybrid Web Recommender Systems](http://www.dcs.warwick.ac.uk/~acristea/courses/CS411/2010/Book%20-%20The%20Adaptive%20Web/HybridWebRecommenderSystems.pdf), pp. 377-408, The Adaptive Web, Peter Brusilovsky, Alfred Kobsa, Wolfgang Nejdl (Ed.), Lecture Notes in Computer Science, Springer-Verlag, Berlin, Germany, Lecture Notes in Computer Science, Vol. 4321, May 2007, 978-3-540-72078-2.
......
<!DOCTYPE html><html><head><meta charset="utf-8"><style>body {
width: 45em;
border: 1px solid #ddd;
outline: 1300px solid #fff;
margin: 16px auto;
}
body .markdown-body
{
padding: 30px;
}
@font-face {
font-family: fontawesome-mini;
src: url(data:font/woff;charset=utf-8;base64,d09GRgABAAAAAAzUABAAAAAAFNgAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABGRlRNAAABbAAAABwAAAAcZMzaOEdERUYAAAGIAAAAHQAAACAAOQAET1MvMgAAAagAAAA+AAAAYHqhde9jbWFwAAAB6AAAAFIAAAFa4azkLWN2dCAAAAI8AAAAKAAAACgFgwioZnBnbQAAAmQAAAGxAAACZVO0L6dnYXNwAAAEGAAAAAgAAAAIAAAAEGdseWYAAAQgAAAFDgAACMz7eroHaGVhZAAACTAAAAAwAAAANgWEOEloaGVhAAAJYAAAAB0AAAAkDGEGa2htdHgAAAmAAAAAEwAAADBEgAAQbG9jYQAACZQAAAAaAAAAGgsICJBtYXhwAAAJsAAAACAAAAAgASgBD25hbWUAAAnQAAACZwAABOD4no+3cG9zdAAADDgAAABsAAAAmF+yXM9wcmVwAAAMpAAAAC4AAAAusPIrFAAAAAEAAAAAyYlvMQAAAADLVHQgAAAAAM/u9uZ4nGNgZGBg4ANiCQYQYGJgBEJuIGYB8xgABMMAPgAAAHicY2Bm42OcwMDKwMLSw2LMwMDQBqGZihmiwHycoKCyqJjB4YPDh4NsDP+BfNb3DIuAFCOSEgUGRgAKDgt4AAB4nGNgYGBmgGAZBkYGEAgB8hjBfBYGCyDNxcDBwMTA9MHhQ9SHrA8H//9nYACyQyFs/sP86/kX8HtB9UIBIxsDXICRCUgwMaACRoZhDwA3fxKSAAAAAAHyAHABJQB/AIEAdAFGAOsBIwC/ALgAxACGAGYAugBNACcA/wCIeJxdUbtOW0EQ3Q0PA4HE2CA52hSzmZDGe6EFCcTVjWJkO4XlCGk3cpGLcQEfQIFEDdqvGaChpEibBiEXSHxCPiESM2uIojQ7O7NzzpkzS8qRqnfpa89T5ySQwt0GzTb9Tki1swD3pOvrjYy0gwdabGb0ynX7/gsGm9GUO2oA5T1vKQ8ZTTuBWrSn/tH8Cob7/B/zOxi0NNP01DoJ6SEE5ptxS4PvGc26yw/6gtXhYjAwpJim4i4/plL+tzTnasuwtZHRvIMzEfnJNEBTa20Emv7UIdXzcRRLkMumsTaYmLL+JBPBhcl0VVO1zPjawV2ys+hggyrNgQfYw1Z5DB4ODyYU0rckyiwNEfZiq8QIEZMcCjnl3Mn+pED5SBLGvElKO+OGtQbGkdfAoDZPs/88m01tbx3C+FkcwXe/GUs6+MiG2hgRYjtiKYAJREJGVfmGGs+9LAbkUvvPQJSA5fGPf50ItO7YRDyXtXUOMVYIen7b3PLLirtWuc6LQndvqmqo0inN+17OvscDnh4Lw0FjwZvP+/5Kgfo8LK40aA4EQ3o3ev+iteqIq7wXPrIn07+xWgAAAAABAAH//wAPeJyFlctvG1UUh+/12DPN1B7P3JnYjj2Ox4/MuDHxJH5N3UdaEUQLqBIkfQQioJWQ6AMEQkIqsPGCPwA1otuWSmTBhjtps2ADWbJg3EpIXbGouqSbCraJw7kzNo2dRN1cnXN1ZvT7zuuiMEI7ncizyA0URofRBJpCdbQuIFShYY+GZRrxMDVtih5TwQPHtXDFFSIKoWIbuREBjLH27Ny4MsbVx+uOJThavebgVrNRLAiYx06rXsvhxLgWx9xpfHdrs/ekc2Pl2cpPCVEITQpwbj8VQhfXSq2m+Wxqaq2D73Kne5e3NjHqQNj3CRYlJlgUl/jRNP+2Gs2pNYRQiOnmUaQDqm30KqKiTTWPWjboxnTWpvgxjXo0KrtZXAHt7hwIz0YVcj88JnKlJKi3NPAwLyDwZudSmJSMMJFDYaOkaol6XtESx3Gt1VTytdZJ3DCLeaVhVnCBH1fycHTxFXwPX+l2e3d6H/TufGGmMTLTnbSJUdo00zuBswMO/nl3YLeL/wnu9/limCuD3vC54h5NBVz6Li414AI8Vx3iiosKcQXUbrvhFFiYb++HN4DaF4XzFW0fIN4XDWJ3a3XQoq9V8WiyRmdsatV9xUcHims1JloH0YUa090G3Tro3mC6c01f+YwCPquINr1PTaCP6rVTOOmf0GE2dBc7zWIhji3/5MchSuBHgDbU99RMWt3YUNMZMJmx92YP6NsHx/5/M1yvInpnkIOM3Z8fA3JQ2lW1RFC1KaBPDFXNAHYYvGy73aYZZZ3HifbeuiVZCpwA3oQBs0wGPYJbJfg60xrKEbKiNtTe1adwrpBRwlAuQ3q3VRaX0QmQ9a49BTSCuF1MLfQ6+tinOubRBZuWPNoMevGMT+V41KitO1is3D/tpMcq1JHZqDHGs8DoYGDkxJgKjHROeTCmhZvzPm9pod+ltKm4PN7Dyvvldlpsg8D+4AUJZ3F/JBstZz7cbFRxsaAGV6yX/dkcycWf8eS3QlQea+YLjdm3yrOnrhFpUyKVvFE4lpv4bO3Svx/6F/4xmiDu/RT5iI++lko18mY1oX+5UGKR6kmVjM/Zb76yfHtxy+h/SyQ0lLdpdKy/lWB6szatetQJ8nZ80A2Qt6ift6gJeavU3BO4gtxs/KCtNPVibCtYCWY3SIlSBPKXZALXiIR9oZeJ1AuMyxLpHIy/yO7vSiSE+kZvk0ihJ30HgHfzZtEMmvV58x6dtqns0XTAW7Vdm4HJ04OCp/crOO7rd9SGxQAE/mVA9xRN+kVSMRFF6S9JFGUtthkjBA5tFCWc2l4V43Ex9GmUP3SI37Jjmir9KqlaDJ4S4JB3vuM/jzyH1+8MuoZ+QGzfnvPoJb96cZlWjMcKLfgDwB7E634JTY+asjsPzS5CiVnEWY+KsrsIN5rn3mAPjqmQBxGjcGKB9f9ZxY3mYC2L85CJ2FXIxKKyHk+dg0FHbuEc7D5NzWUX32WxFcWNGRAbvwSx0RmIXVDuYySafluQBmzA/ssqJAMLnli+WIC90Gw4lm85wcp0qjArEDPJJV/sSx4P9ungTpgMw5gVC1XO4uULq0s3v1rqLi0vX/z65vlH50f8T/RHmSPTk5xxWBWOluMT6WiOy+tdvWxlV/XQb3o3c6Ssr+r6I708GsX9/nzp1tKFh0s3v7m4vAy/Hnb/KMOvc1wump6Il48K6mGDy02X9Yd65pa+nQIjk76lWxCkG8NBCP0HQS9IpAAAeJxjYGRgYGBhcCrq214Qz2/zlUGenQEEzr/77oug/zewFbB+AHI5GJhAogBwKQ0qeJxjYGRgYH3/P46BgZ0BBNgKGBgZUAEPAE/7At0AAAB4nGNngAB2IGYjhBsYBAAIYADVAAAAAAAAAAAAAFwAyAEeAaACCgKmAx4DggRmAAAAAQAAAAwAagAEAAAAAAACAAEAAgAWAAABAAChAAAAAHiclZI7bxQxFIWPd/JkUYQChEhIyAVKgdBMskm1QkKrRETpQiLRUczueB/K7HhlOxttg8LvoKPgP9DxFxANDR0tHRWi4NjrPIBEgh1p/dm+vufcawNYFWsQmP6e4jSyQB2fI9cwj++RE9wTjyPP4LYoI89iWbyLPIe6+Bh5Hs9rryMv4GbtW+RF3EhuRa7jbrIbeQkPkjdUETOLnL0Kip4FVvAhco1RXyMnSPEz8gzWxE7kWTwUp5HnsCLeR57HW/El8gJWa58iL+JO7UfkOh4l9yMv4UnyEtvQGGECgwF66MNBooF1bGCL1ELB/TYU+ZBRlvsKQ44Se6jQ4a7hef+fh72Crv25kp+8lNWGmeKoOI5jJLb1aGIGvb6TjfWNLdkqdFvJw4l1amjlXtXRZqRN7lSRylZZyhBqpVFWmTEXgWfUrpi/hZOQXdOd4rKuXOtEWT3k5IArPRzTUU5tHKjecZkTpnVbNOnt6jzN8240GD4xtikvZW56043rPMg/dS+dlOceXoR+WPbJ55Dsekq1lJpnypsMUsYOdCW30o103Ytu/lvh+5RWFLfBjm9/N8hJntPhvx92rnoE/kyHdGasGy754kw36vsVf/lFeBi+0COu+cfgQr42G3CRpeLoZ53gmfe3X6rcKt5oVxnptHR9JS8ehVUd5wvvahN2uqxOOpMXapibI5k7Zwbt4xBSaTfoKBufhAnO/uqNcfK8OTs0OQ6l7JIqFjDhYj5WcjevCnI/1DDiI8j4ndWb/5YzDZWh79yomWXeXj7Nnw70/2TIeFPTrlSh89k1ObOSRVZWZfgF0r/zJQB4nG2JUQuCQBCEd07TTg36fb2IyBaLd3vWaUh/vmSJnvpgmG8YcmS8X3Shf3R7QA4OBUocUKHGER5NNbOOEvwc1txnuWkTRb/aPjimJ5vXabI+3VfOiyS15UWvyezM2xiGOPyuMohOH8O8JiO4Af+FsAGNAEuwCFBYsQEBjlmxRgYrWCGwEFlLsBRSWCGwgFkdsAYrXFhZsBQrAAA=) format('woff');
}
@font-face {
font-family: octicons-anchor;
src: url(data:font/woff;charset=utf-8;base64,d09GRgABAAAAAAYcAA0AAAAACjQAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABGRlRNAAABMAAAABwAAAAca8vGTk9TLzIAAAFMAAAARAAAAFZG1VHVY21hcAAAAZAAAAA+AAABQgAP9AdjdnQgAAAB0AAAAAQAAAAEACICiGdhc3AAAAHUAAAACAAAAAj//wADZ2x5ZgAAAdwAAADRAAABEKyikaNoZWFkAAACsAAAAC0AAAA2AtXoA2hoZWEAAALgAAAAHAAAACQHngNFaG10eAAAAvwAAAAQAAAAEAwAACJsb2NhAAADDAAAAAoAAAAKALIAVG1heHAAAAMYAAAAHwAAACABEAB2bmFtZQAAAzgAAALBAAAFu3I9x/Nwb3N0AAAF/AAAAB0AAAAvaoFvbwAAAAEAAAAAzBdyYwAAAADP2IQvAAAAAM/bz7t4nGNgZGFgnMDAysDB1Ml0hoGBoR9CM75mMGLkYGBgYmBlZsAKAtJcUxgcPsR8iGF2+O/AEMPsznAYKMwIkgMA5REMOXicY2BgYGaAYBkGRgYQsAHyGMF8FgYFIM0ChED+h5j//yEk/3KoSgZGNgYYk4GRCUgwMaACRoZhDwCs7QgGAAAAIgKIAAAAAf//AAJ4nHWMMQrCQBBF/0zWrCCIKUQsTDCL2EXMohYGSSmorScInsRGL2DOYJe0Ntp7BK+gJ1BxF1stZvjz/v8DRghQzEc4kIgKwiAppcA9LtzKLSkdNhKFY3HF4lK69ExKslx7Xa+vPRVS43G98vG1DnkDMIBUgFN0MDXflU8tbaZOUkXUH0+U27RoRpOIyCKjbMCVejwypzJJG4jIwb43rfl6wbwanocrJm9XFYfskuVC5K/TPyczNU7b84CXcbxks1Un6H6tLH9vf2LRnn8Ax7A5WQAAAHicY2BkYGAA4teL1+yI57f5ysDNwgAC529f0kOmWRiYVgEpDgYmEA8AUzEKsQAAAHicY2BkYGB2+O/AEMPCAAJAkpEBFbAAADgKAe0EAAAiAAAAAAQAAAAEAAAAAAAAKgAqACoAiAAAeJxjYGRgYGBhsGFgYgABEMkFhAwM/xn0QAIAD6YBhwB4nI1Ty07cMBS9QwKlQapQW3VXySvEqDCZGbGaHULiIQ1FKgjWMxknMfLEke2A+IJu+wntrt/QbVf9gG75jK577Lg8K1qQPCfnnnt8fX1NRC/pmjrk/zprC+8D7tBy9DHgBXoWfQ44Av8t4Bj4Z8CLtBL9CniJluPXASf0Lm4CXqFX8Q84dOLnMB17N4c7tBo1AS/Qi+hTwBH4rwHHwN8DXqQ30XXAS7QaLwSc0Gn8NuAVWou/gFmnjLrEaEh9GmDdDGgL3B4JsrRPDU2hTOiMSuJUIdKQQayiAth69r6akSSFqIJuA19TrzCIaY8sIoxyrNIrL//pw7A2iMygkX5vDj+G+kuoLdX4GlGK/8Lnlz6/h9MpmoO9rafrz7ILXEHHaAx95s9lsI7AHNMBWEZHULnfAXwG9/ZqdzLI08iuwRloXE8kfhXYAvE23+23DU3t626rbs8/8adv+9DWknsHp3E17oCf+Z48rvEQNZ78paYM38qfk3v/u3l3u3GXN2Dmvmvpf1Srwk3pB/VSsp512bA/GG5i2WJ7wu430yQ5K3nFGiOqgtmSB5pJVSizwaacmUZzZhXLlZTq8qGGFY2YcSkqbth6aW1tRmlaCFs2016m5qn36SbJrqosG4uMV4aP2PHBmB3tjtmgN2izkGQyLWprekbIntJFing32a5rKWCN/SdSoga45EJykyQ7asZvHQ8PTm6cslIpwyeyjbVltNikc2HTR7YKh9LBl9DADC0U/jLcBZDKrMhUBfQBvXRzLtFtjU9eNHKin0x5InTqb8lNpfKv1s1xHzTXRqgKzek/mb7nB8RZTCDhGEX3kK/8Q75AmUM/eLkfA+0Hi908Kx4eNsMgudg5GLdRD7a84npi+YxNr5i5KIbW5izXas7cHXIMAau1OueZhfj+cOcP3P8MNIWLyYOBuxL6DRylJ4cAAAB4nGNgYoAALjDJyIAOWMCiTIxMLDmZedkABtIBygAAAA==) format('woff');
}
.markdown-body {
font-family: sans-serif;
-ms-text-size-adjust: 100%;
-webkit-text-size-adjust: 100%;
color: #333333;
overflow: hidden;
font-family: "Helvetica Neue", Helvetica, "Segoe UI", Arial, freesans, sans-serif;
font-size: 16px;
line-height: 1.6;
word-wrap: break-word;
}
.markdown-body a {
background: transparent;
}
.markdown-body a:active,
.markdown-body a:hover {
outline: 0;
}
.markdown-body b,
.markdown-body strong {
font-weight: bold;
}
.markdown-body mark {
background: #ff0;
color: #000;
font-style: italic;
font-weight: bold;
}
.markdown-body sub,
.markdown-body sup {
font-size: 75%;
line-height: 0;
position: relative;
vertical-align: baseline;
}
.markdown-body sup {
top: -0.5em;
}
.markdown-body sub {
bottom: -0.25em;
}
.markdown-body h1 {
font-size: 2em;
margin: 0.67em 0;
}
.markdown-body img {
border: 0;
}
.markdown-body hr {
-moz-box-sizing: content-box;
box-sizing: content-box;
height: 0;
}
.markdown-body pre {
overflow: auto;
}
.markdown-body code,
.markdown-body kbd,
.markdown-body pre,
.markdown-body samp {
font-family: monospace, monospace;
font-size: 1em;
}
.markdown-body input {
color: inherit;
font: inherit;
margin: 0;
}
.markdown-body html input[disabled] {
cursor: default;
}
.markdown-body input {
line-height: normal;
}
.markdown-body input[type="checkbox"] {
box-sizing: border-box;
padding: 0;
}
.markdown-body table {
border-collapse: collapse;
border-spacing: 0;
}
.markdown-body td,
.markdown-body th {
padding: 0;
}
.markdown-body .codehilitetable {
border: 0;
border-spacing: 0;
}
.markdown-body .codehilitetable tr {
border: 0;
}
.markdown-body .codehilitetable pre,
.markdown-body .codehilitetable div.codehilite {
margin: 0;
}
.markdown-body .linenos,
.markdown-body .code,
.markdown-body .codehilitetable td {
border: 0;
padding: 0;
}
.markdown-body td:not(.linenos) .linenodiv {
padding: 0 !important;
}
.markdown-body .code {
width: 100%;
}
.markdown-body .linenos div pre,
.markdown-body .linenodiv pre,
.markdown-body .linenodiv {
border: 0;
-webkit-border-radius: 0;
-moz-border-radius: 0;
border-radius: 0;
-webkit-border-top-left-radius: 3px;
-webkit-border-bottom-left-radius: 3px;
-moz-border-radius-topleft: 3px;
-moz-border-radius-bottomleft: 3px;
border-top-left-radius: 3px;
border-bottom-left-radius: 3px;
}
.markdown-body .code div pre,
.markdown-body .code div {
border: 0;
-webkit-border-radius: 0;
-moz-border-radius: 0;
border-radius: 0;
-webkit-border-top-right-radius: 3px;
-webkit-border-bottom-right-radius: 3px;
-moz-border-radius-topright: 3px;
-moz-border-radius-bottomright: 3px;
border-top-right-radius: 3px;
border-bottom-right-radius: 3px;
}
.markdown-body * {
-moz-box-sizing: border-box;
box-sizing: border-box;
}
.markdown-body input {
font: 13px Helvetica, arial, freesans, clean, sans-serif, "Segoe UI Emoji", "Segoe UI Symbol";
line-height: 1.4;
}
.markdown-body a {
color: #4183c4;
text-decoration: none;
}
.markdown-body a:hover,
.markdown-body a:focus,
.markdown-body a:active {
text-decoration: underline;
}
.markdown-body hr {
height: 0;
margin: 15px 0;
overflow: hidden;
background: transparent;
border: 0;
border-bottom: 1px solid #ddd;
}
.markdown-body hr:before,
.markdown-body hr:after {
display: table;
content: " ";
}
.markdown-body hr:after {
clear: both;
}
.markdown-body h1,
.markdown-body h2,
.markdown-body h3,
.markdown-body h4,
.markdown-body h5,
.markdown-body h6 {
margin-top: 15px;
margin-bottom: 15px;
line-height: 1.1;
}
.markdown-body h1 {
font-size: 30px;
}
.markdown-body h2 {
font-size: 21px;
}
.markdown-body h3 {
font-size: 16px;
}
.markdown-body h4 {
font-size: 14px;
}
.markdown-body h5 {
font-size: 12px;
}
.markdown-body h6 {
font-size: 11px;
}
.markdown-body blockquote {
margin: 0;
}
.markdown-body ul,
.markdown-body ol {
padding: 0;
margin-top: 0;
margin-bottom: 0;
}
.markdown-body ol ol,
.markdown-body ul ol {
list-style-type: lower-roman;
}
.markdown-body ul ul ol,
.markdown-body ul ol ol,
.markdown-body ol ul ol,
.markdown-body ol ol ol {
list-style-type: lower-alpha;
}
.markdown-body dd {
margin-left: 0;
}
.markdown-body code,
.markdown-body pre,
.markdown-body samp {
font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;
font-size: 12px;
}
.markdown-body pre {
margin-top: 0;
margin-bottom: 0;
}
.markdown-body kbd {
background-color: #e7e7e7;
background-image: -moz-linear-gradient(#fefefe, #e7e7e7);
background-image: -webkit-linear-gradient(#fefefe, #e7e7e7);
background-image: linear-gradient(#fefefe, #e7e7e7);
background-repeat: repeat-x;
border-radius: 2px;
border: 1px solid #cfcfcf;
color: #000;
padding: 3px 5px;
line-height: 10px;
font: 11px Consolas, "Liberation Mono", Menlo, Courier, monospace;
display: inline-block;
}
.markdown-body>*:first-child {
margin-top: 0 !important;
}
.markdown-body>*:last-child {
margin-bottom: 0 !important;
}
.markdown-body .headeranchor-link {
position: absolute;
top: 0;
bottom: 0;
left: 0;
display: block;
padding-right: 6px;
padding-left: 30px;
margin-left: -30px;
}
.markdown-body .headeranchor-link:focus {
outline: none;
}
.markdown-body h1,
.markdown-body h2,
.markdown-body h3,
.markdown-body h4,
.markdown-body h5,
.markdown-body h6 {
position: relative;
margin-top: 1em;
margin-bottom: 16px;
font-weight: bold;
line-height: 1.4;
}
.markdown-body h1 .headeranchor,
.markdown-body h2 .headeranchor,
.markdown-body h3 .headeranchor,
.markdown-body h4 .headeranchor,
.markdown-body h5 .headeranchor,
.markdown-body h6 .headeranchor {
display: none;
color: #000;
vertical-align: middle;
}
.markdown-body h1:hover .headeranchor-link,
.markdown-body h2:hover .headeranchor-link,
.markdown-body h3:hover .headeranchor-link,
.markdown-body h4:hover .headeranchor-link,
.markdown-body h5:hover .headeranchor-link,
.markdown-body h6:hover .headeranchor-link {
height: 1em;
padding-left: 8px;
margin-left: -30px;
line-height: 1;
text-decoration: none;
}
.markdown-body h1:hover .headeranchor-link .headeranchor,
.markdown-body h2:hover .headeranchor-link .headeranchor,
.markdown-body h3:hover .headeranchor-link .headeranchor,
.markdown-body h4:hover .headeranchor-link .headeranchor,
.markdown-body h5:hover .headeranchor-link .headeranchor,
.markdown-body h6:hover .headeranchor-link .headeranchor {
display: inline-block;
}
.markdown-body h1 {
padding-bottom: 0.3em;
font-size: 2.25em;
line-height: 1.2;
border-bottom: 1px solid #eee;
}
.markdown-body h2 {
padding-bottom: 0.3em;
font-size: 1.75em;
line-height: 1.225;
border-bottom: 1px solid #eee;
}
.markdown-body h3 {
font-size: 1.5em;
line-height: 1.43;
}
.markdown-body h4 {
font-size: 1.25em;
}
.markdown-body h5 {
font-size: 1em;
}
.markdown-body h6 {
font-size: 1em;
color: #777;
}
.markdown-body p,
.markdown-body blockquote,
.markdown-body ul,
.markdown-body ol,
.markdown-body dl,
.markdown-body table,
.markdown-body pre,
.markdown-body .admonition {
margin-top: 0;
margin-bottom: 16px;
}
.markdown-body hr {
height: 4px;
padding: 0;
margin: 16px 0;
background-color: #e7e7e7;
border: 0 none;
}
.markdown-body ul,
.markdown-body ol {
padding-left: 2em;
}
.markdown-body ul ul,
.markdown-body ul ol,
.markdown-body ol ol,
.markdown-body ol ul {
margin-top: 0;
margin-bottom: 0;
}
.markdown-body li>p {
margin-top: 16px;
}
.markdown-body dl {
padding: 0;
}
.markdown-body dl dt {
padding: 0;
margin-top: 16px;
font-size: 1em;
font-style: italic;
font-weight: bold;
}
.markdown-body dl dd {
padding: 0 16px;
margin-bottom: 16px;
}
.markdown-body blockquote {
padding: 0 15px;
color: #777;
border-left: 4px solid #ddd;
}
.markdown-body blockquote>:first-child {
margin-top: 0;
}
.markdown-body blockquote>:last-child {
margin-bottom: 0;
}
.markdown-body table {
display: block;
width: 100%;
overflow: auto;
word-break: normal;
word-break: keep-all;
}
.markdown-body table th {
font-weight: bold;
}
.markdown-body table th,
.markdown-body table td {
padding: 6px 13px;
border: 1px solid #ddd;
}
.markdown-body table tr {
background-color: #fff;
border-top: 1px solid #ccc;
}
.markdown-body table tr:nth-child(2n) {
background-color: #f8f8f8;
}
.markdown-body img {
max-width: 100%;
-moz-box-sizing: border-box;
box-sizing: border-box;
}
.markdown-body code,
.markdown-body samp {
padding: 0;
padding-top: 0.2em;
padding-bottom: 0.2em;
margin: 0;
font-size: 85%;
background-color: rgba(0,0,0,0.04);
border-radius: 3px;
}
.markdown-body code:before,
.markdown-body code:after {
letter-spacing: -0.2em;
content: "\00a0";
}
.markdown-body pre>code {
padding: 0;
margin: 0;
font-size: 100%;
word-break: normal;
white-space: pre;
background: transparent;
border: 0;
}
.markdown-body .codehilite {
margin-bottom: 16px;
}
.markdown-body .codehilite pre,
.markdown-body pre {
padding: 16px;
overflow: auto;
font-size: 85%;
line-height: 1.45;
background-color: #f7f7f7;
border-radius: 3px;
}
.markdown-body .codehilite pre {
margin-bottom: 0;
word-break: normal;
}
.markdown-body pre {
word-wrap: normal;
}
.markdown-body pre code {
display: inline;
max-width: initial;
padding: 0;
margin: 0;
overflow: initial;
line-height: inherit;
word-wrap: normal;
background-color: transparent;
border: 0;
}
.markdown-body pre code:before,
.markdown-body pre code:after {
content: normal;
}
/* Admonition */
.markdown-body .admonition {
-webkit-border-radius: 3px;
-moz-border-radius: 3px;
position: relative;
border-radius: 3px;
border: 1px solid #e0e0e0;
border-left: 6px solid #333;
padding: 10px 10px 10px 30px;
}
.markdown-body .admonition table {
color: #333;
}
.markdown-body .admonition p {
padding: 0;
}
.markdown-body .admonition-title {
font-weight: bold;
margin: 0;
}
.markdown-body .admonition>.admonition-title {
color: #333;
}
.markdown-body .attention>.admonition-title {
color: #a6d796;
}
.markdown-body .caution>.admonition-title {
color: #d7a796;
}
.markdown-body .hint>.admonition-title {
color: #96c6d7;
}
.markdown-body .danger>.admonition-title {
color: #c25f77;
}
.markdown-body .question>.admonition-title {
color: #96a6d7;
}
.markdown-body .note>.admonition-title {
color: #d7c896;
}
.markdown-body .admonition:before,
.markdown-body .attention:before,
.markdown-body .caution:before,
.markdown-body .hint:before,
.markdown-body .danger:before,
.markdown-body .question:before,
.markdown-body .note:before {
font: normal normal 16px fontawesome-mini;
-moz-osx-font-smoothing: grayscale;
-webkit-user-select: none;
-moz-user-select: none;
-ms-user-select: none;
user-select: none;
line-height: 1.5;
color: #333;
position: absolute;
left: 0;
top: 0;
padding-top: 10px;
padding-left: 10px;
}
.markdown-body .admonition:before {
content: "\f056\00a0";
color: 333;
}
.markdown-body .attention:before {
content: "\f058\00a0";
color: #a6d796;
}
.markdown-body .caution:before {
content: "\f06a\00a0";
color: #d7a796;
}
.markdown-body .hint:before {
content: "\f05a\00a0";
color: #96c6d7;
}
.markdown-body .danger:before {
content: "\f057\00a0";
color: #c25f77;
}
.markdown-body .question:before {
content: "\f059\00a0";
color: #96a6d7;
}
.markdown-body .note:before {
content: "\f040\00a0";
color: #d7c896;
}
.markdown-body .admonition::after {
content: normal;
}
.markdown-body .attention {
border-left: 6px solid #a6d796;
}
.markdown-body .caution {
border-left: 6px solid #d7a796;
}
.markdown-body .hint {
border-left: 6px solid #96c6d7;
}
.markdown-body .danger {
border-left: 6px solid #c25f77;
}
.markdown-body .question {
border-left: 6px solid #96a6d7;
}
.markdown-body .note {
border-left: 6px solid #d7c896;
}
.markdown-body .admonition>*:first-child {
margin-top: 0 !important;
}
.markdown-body .admonition>*:last-child {
margin-bottom: 0 !important;
}
/* progress bar*/
.markdown-body .progress {
display: block;
width: 300px;
margin: 10px 0;
height: 24px;
-webkit-border-radius: 3px;
-moz-border-radius: 3px;
border-radius: 3px;
background-color: #ededed;
position: relative;
box-shadow: inset -1px 1px 3px rgba(0, 0, 0, .1);
}
.markdown-body .progress-label {
position: absolute;
text-align: center;
font-weight: bold;
width: 100%; margin: 0;
line-height: 24px;
color: #333;
text-shadow: 1px 1px 0 #fefefe, -1px -1px 0 #fefefe, -1px 1px 0 #fefefe, 1px -1px 0 #fefefe, 0 1px 0 #fefefe, 0 -1px 0 #fefefe, 1px 0 0 #fefefe, -1px 0 0 #fefefe, 1px 1px 2px #000;
-webkit-font-smoothing: antialiased !important;
white-space: nowrap;
overflow: hidden;
}
.markdown-body .progress-bar {
height: 24px;
float: left;
-webkit-border-radius: 3px;
-moz-border-radius: 3px;
border-radius: 3px;
background-color: #96c6d7;
box-shadow: inset 0 1px 0 rgba(255, 255, 255, .5), inset 0 -1px 0 rgba(0, 0, 0, .1);
background-size: 30px 30px;
background-image: -webkit-linear-gradient(
135deg, rgba(255, 255, 255, .4) 27%,
transparent 27%,
transparent 52%, rgba(255, 255, 255, .4) 52%,
rgba(255, 255, 255, .4) 77%,
transparent 77%, transparent
);
background-image: -moz-linear-gradient(
135deg,
rgba(255, 255, 255, .4) 27%, transparent 27%,
transparent 52%, rgba(255, 255, 255, .4) 52%,
rgba(255, 255, 255, .4) 77%, transparent 77%,
transparent
);
background-image: -ms-linear-gradient(
135deg,
rgba(255, 255, 255, .4) 27%, transparent 27%,
transparent 52%, rgba(255, 255, 255, .4) 52%,
rgba(255, 255, 255, .4) 77%, transparent 77%,
transparent
);
background-image: -o-linear-gradient(
135deg,
rgba(255, 255, 255, .4) 27%, transparent 27%,
transparent 52%, rgba(255, 255, 255, .4) 52%,
rgba(255, 255, 255, .4) 77%, transparent 77%,
transparent
);
background-image: linear-gradient(
135deg,
rgba(255, 255, 255, .4) 27%, transparent 27%,
transparent 52%, rgba(255, 255, 255, .4) 52%,
rgba(255, 255, 255, .4) 77%, transparent 77%,
transparent
);
}
.markdown-body .progress-100plus .progress-bar {
background-color: #a6d796;
}
.markdown-body .progress-80plus .progress-bar {
background-color: #c6d796;
}
.markdown-body .progress-60plus .progress-bar {
background-color: #d7c896;
}
.markdown-body .progress-40plus .progress-bar {
background-color: #d7a796;
}
.markdown-body .progress-20plus .progress-bar {
background-color: #d796a6;
}
.markdown-body .progress-0plus .progress-bar {
background-color: #c25f77;
}
.markdown-body .candystripe-animate .progress-bar{
-webkit-animation: animate-stripes 3s linear infinite;
-moz-animation: animate-stripes 3s linear infinite;
animation: animate-stripes 3s linear infinite;
}
@-webkit-keyframes animate-stripes {
0% {
background-position: 0 0;
}
100% {
background-position: 60px 0;
}
}
@-moz-keyframes animate-stripes {
0% {
background-position: 0 0;
}
100% {
background-position: 60px 0;
}
}
@keyframes animate-stripes {
0% {
background-position: 0 0;
}
100% {
background-position: 60px 0;
}
}
.markdown-body .gloss .progress-bar {
box-shadow:
inset 0 4px 12px rgba(255, 255, 255, .7),
inset 0 -12px 0 rgba(0, 0, 0, .05);
}
/* Multimarkdown Critic Blocks */
.markdown-body .critic_mark {
background: #ff0;
}
.markdown-body .critic_delete {
color: #c82829;
text-decoration: line-through;
}
.markdown-body .critic_insert {
color: #718c00 ;
text-decoration: underline;
}
.markdown-body .critic_comment {
color: #8e908c;
font-style: italic;
}
.markdown-body .headeranchor {
font: normal normal 16px octicons-anchor;
line-height: 1;
display: inline-block;
text-decoration: none;
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
-webkit-user-select: none;
-moz-user-select: none;
-ms-user-select: none;
user-select: none;
}
.headeranchor:before {
content: '\f05c';
}
.markdown-body .task-list-item {
list-style-type: none;
}
.markdown-body .task-list-item+.task-list-item {
margin-top: 3px;
}
.markdown-body .task-list-item input {
margin: 0 4px 0.25em -20px;
vertical-align: middle;
}
/* Media */
@media only screen and (min-width: 480px) {
.markdown-body {
font-size:14px;
}
}
@media only screen and (min-width: 768px) {
.markdown-body {
font-size:16px;
}
}
@media print {
.markdown-body * {
background: transparent !important;
color: black !important;
filter:none !important;
-ms-filter: none !important;
}
.markdown-body {
font-size:12pt;
max-width:100%;
outline:none;
border: 0;
}
.markdown-body a,
.markdown-body a:visited {
text-decoration: underline;
}
.markdown-body .headeranchor-link {
display: none;
}
.markdown-body a[href]:after {
content: " (" attr(href) ")";
}
.markdown-body abbr[title]:after {
content: " (" attr(title) ")";
}
.markdown-body .ir a:after,
.markdown-body a[href^="javascript:"]:after,
.markdown-body a[href^="#"]:after {
content: "";
}
.markdown-body pre {
white-space: pre;
white-space: pre-wrap;
word-wrap: break-word;
}
.markdown-body pre,
.markdown-body blockquote {
border: 1px solid #999;
padding-right: 1em;
page-break-inside: avoid;
}
.markdown-body .progress,
.markdown-body .progress-bar {
-moz-box-shadow: none;
-webkit-box-shadow: none;
box-shadow: none;
}
.markdown-body .progress {
border: 1px solid #ddd;
}
.markdown-body .progress-bar {
height: 22px;
border-right: 1px solid #ddd;
}
.markdown-body tr,
.markdown-body img {
page-break-inside: avoid;
}
.markdown-body img {
max-width: 100% !important;
}
.markdown-body p,
.markdown-body h2,
.markdown-body h3 {
orphans: 3;
widows: 3;
}
.markdown-body h2,
.markdown-body h3 {
page-break-after: avoid;
}
}
</style><title>README.en</title></head><body><article class="markdown-body"><h1 id="sentiment-analysis"><a name="user-content-sentiment-analysis" href="#sentiment-analysis" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Sentiment Analysis</h1>
<p>The source codes of this section can be located at <a href="https://github.com/PaddlePaddle/book/tree/develop/understand_sentiment">book/understand_sentiment</a>. First-time users may refer to PaddlePaddle for <a href="https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/build_and_install/docker_install_en.rst">Installation guide</a>.</p>
<h2 id="background"><a name="user-content-background" href="#background" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Background</h2>
<p>In natural language processing, sentiment analysis refers to determining the emotion expressed in a piece of text. The text can be a sentence, a paragraph, or a document. Emotion categorization can be binary &ndash; positive/negative or happy/sad &ndash; or in three classes &ndash; positive/neutral/negative. Sentiment analysis is applicable in a wide range of services, such as e-commerce sites like Amazon and Taobao, hospitality services like Airbnb and hotels.com, and movie rating sites like Rotten Tomatoes and IMDB. It can be used to gauge from the reviews how the customers feel about the product. Table 1 illustrates an example of sentiment analysis in movie reviews:</p>
<table>
<thead>
<tr>
<th>Movie Review</th>
<th>Category</th>
</tr>
</thead>
<tbody>
<tr>
<td>Best movie of Xiaogang Feng in recent years!</td>
<td>Positive</td>
</tr>
<tr>
<td>Pretty bad. Feels like a tv-series from a local TV-channel</td>
<td>Negative</td>
</tr>
<tr>
<td>Politically correct version of Taken &hellip; and boring as Heck</td>
<td>Negative</td>
</tr>
<tr>
<td>delightful, mesmerizing, and completely unexpected. The plot is nicely designed.</td>
<td>Positive</td>
</tr>
</tbody>
</table>
<p align="center">Table 1 Sentiment Analysis in Movie Reviews</p>
<p>In natural language processing, sentiment analysis can be categorized as a <strong>Text Classification problem</strong>, i.e., to categorize a piece of text to a specific class. It involves two related tasks: text representation and classification. Before the emergence of deep learning techniques, the mainstream methods for text representation include BOW (<em>bag of words</em>) and topic modeling, while the latter contain SVM (<em>support vector machine</em>) and LR (<em>logistic regression</em>).</p>
<p>The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have with little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics.</p>
<p>This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods [<a href="#Reference">1</a>].</p>
<h2 id="model-overview"><a name="user-content-model-overview" href="#model-overview" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Model Overview</h2>
<p>The model we used in this chapter uses <strong>Convolutional Neural Networks</strong> (<strong>CNNs</strong>) and <strong>Recurrent Neural Networks</strong> (<strong>RNNs</strong>) with some specific extensions.</p>
<h3 id="convolutional-neural-networks-for-texts-cnn"><a name="user-content-convolutional-neural-networks-for-texts-cnn" href="#convolutional-neural-networks-for-texts-cnn" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Convolutional Neural Networks for Texts (CNN)</h3>
<p><strong>Convolutional Neural Networks</strong> are frequently applied to data with grid-like topology such as two-dimensional images and one-dimensional texts. A CNN can extract multiple local features, combine them, and produce high-level abstractions, which correspond to semantic understanding. Empirically, CNN is shown to be efficient for image and text modeling.</p>
<p>CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. Here, we briefly describe a CNN used to classify texts[<a href="#Refernce">1</a>], as shown in Figure 1.</p>
<p align="center">
<img src="/Users/xuxinlei02/Documents/forks/book/05.understand_sentiment/image/text_cnn_en.png" width = "80%" align="center"/><br/>
Figure 1. CNN for text modeling.
</p>
<p>Let $n$ be the length of the sentence to process, and the $i$-th word has embedding as $x_i\in\mathbb{R}^k$,where $k$ is the embedding dimensionality.</p>
<p>First, we concatenate the words by piecing together every $h$ words, each as a window of length $h$. This window is denoted as $x_{i:i+h-1}$, consisting of $x_{i},x_{i+1},\ldots,x_{i+h-1}$, where $x_i$ is the first word in the window and $i$ takes value ranging from $1$ to $n-h+1$: $x_{i:i+h-1}\in\mathbb{R}^{hk}$.</p>
<p>Next, we apply the convolution operation: we apply the kernel $w\in\mathbb{R}^{hk}$ in each window, extracting features $c_i=f(w\cdot x_{i:i+h-1}+b)$, where $b\in\mathbb{R}$ is the bias and $f$ is a non-linear activation function such as $sigmoid$. Convolving by the kernel at every window ${x_{1:h},x_{2:h+1},\ldots,x_{n-h+1:n}}$ produces a feature map in the following form:</p>
<p>$$c=[c_1,c_2,\ldots,c_{n-h+1}], c \in \mathbb{R}^{n-h+1}$$</p>
<p>Next, we apply <em>max pooling</em> over time to represent the whole sentence $\hat c$, which is the maximum element across the feature map:</p>
<p>$$\hat c=max(c)$$</p>
<p>In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size (as shown in Figure 1 in different colors).</p>
<p>Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem.</p>
<p>For short texts, the aforementioned CNN model can achieve very high accuracy [<a href="#Reference">1</a>]. If we want to extract more abstract representations, we may apply a deeper CNN model [<a href="#Reference">2</a>,<a href="#Reference">3</a>].</p>
<h3 id="recurrent-neural-network-rnn"><a name="user-content-recurrent-neural-network-rnn" href="#recurrent-neural-network-rnn" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Recurrent Neural Network (RNN)</h3>
<p>RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete [<a href="#Reference">4</a>]. Since NLP is a classical problem on sequential data, the RNN, especially its variant LSTM[<a href="#Reference">5</a>]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth.</p>
<p align="center">
<img src="/Users/xuxinlei02/Documents/forks/book/05.understand_sentiment/image/rnn.png" width = "60%" align="center"/><br/>
Figure 2. An illustration of an unfolded RNN in time.
</p>
<p>As shown in Figure 2, we unfold an RNN: at the $t$-th time step, the network takes two inputs: the $t$-th input vector $\vec{x_t}$ and the latent state from the last time-step $\vec{h_{t-1}}$. From those, it computes the latent state of the current step $\vec{h_t}$. This process is repeated until all inputs are consumed. Denoting the RNN as function $f$, it can be formulated as follows:</p>
<p>$$\vec{h_t}=f(\vec{x_t},\vec{h_{t-1}})=\sigma(W_{xh}\vec{x_t}+W_{hh}\vec{h_{h-1}}+\vec{b_h})$$</p>
<p>where $W_{xh}$ is the weight matrix to feed into the latent layer; $W_{hh}$ is the latent-to-latent matrix; $b_h$ is the latent bias and $\sigma$ refers to the $sigmoid$ function.</p>
<p>In NLP, words are often represented as a one-hot vectors and then mapped to an embedding. The embedded feature goes through an RNN as input $x_t$ at every time step. Moreover, we can add other layers on top of RNN, such as a deep or stacked RNN. Finally, the last latent state may be used as a feature for sentence classification.</p>
<h3 id="long-short-term-memory-lstm"><a name="user-content-long-short-term-memory-lstm" href="#long-short-term-memory-lstm" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Long-Short Term Memory (LSTM)</h3>
<p>Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding[<a href="#">6</a>]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed <strong>Long Short Term Memory</strong> (LSTM)[<a href="#Reference">5</a>]). </p>
<p>Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the <strong>LSTM-RNN</strong>, denoted as a function $F$, as follows:</p>
<p>$$ h_t=F(x_t,h_{t-1})$$</p>
<p>$F$ contains following formulations[<a href="#Reference">7</a>]:<br />
\begin{align}<br />
i_t &amp; = \sigma(W_{xi}x_t+W_{hi}h_{h-1}+W_{ci}c_{t-1}+b_i)\\<br />
f_t &amp; = \sigma(W_{xf}x_t+W_{hf}h_{h-1}+W_{cf}c_{t-1}+b_f)\\<br />
c_t &amp; = f_t\odot c_{t-1}+i_t\odot \tanh(W_{xc}x_t+W_{hc}h_{h-1}+b_c)\\<br />
o_t &amp; = \sigma(W_{xo}x_t+W_{ho}h_{h-1}+W_{co}c_{t}+b_o)\\<br />
h_t &amp; = o_t\odot \tanh(c_t)\\<br />
\end{align}</p>
<p>In the equation,$i_t, f_t, c_t, o_t$ stand for input gate, forget gate, memory cell and output gate, respectively; $W$ and $b$ are model parameters. The $tanh$ is a hyperbolic tangent, and $\odot$ denotes an element-wise product operation. Input gate controls the magnitude of new input into the memory cell $c$; forget gate controls memory propagated from the last time step; output gate controls output magnitude. The three gates are computed similarly with different parameters, and they influence memory cell $c$ separately, as shown in Figure 3:</p>
<p align="center">
<img src="/Users/xuxinlei02/Documents/forks/book/05.understand_sentiment/image/lstm_en.png" width = "65%" align="center"/><br/>
Figure 3. LSTM at time step $t$ [7].
</p>
<p>LSTM enhances the ability of considering long-term reliance, with the help of memory cell and gate. Similar structures are also proposed in Gated Recurrent Unit (GRU)[<a href="Reference">8</a>] with simpler design. <strong>The structures are still similar to RNN, though with some modifications (As shown in Figure 2), i.e., latent status depends on input as well as the latent status of last time-step, and the process goes on recurrently until all input are consumed:</strong></p>
<p>$$ h_t=Recrurent(x_t,h_{t-1})$$<br />
where $Recrurent$ is a simple RNN, GRU or LSTM.</p>
<h3 id="stacked-bidirectional-lstm"><a name="user-content-stacked-bidirectional-lstm" href="#stacked-bidirectional-lstm" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Stacked Bidirectional LSTM</h3>
<p>For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data[<a href="#Reference">9</a>].</p>
<p>As shown in Figure 4 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.</p>
<p align="center">
<img src="image/stacked_lstm_en.png" width=450><br/>
Figure 4. Stacked Bidirectional LSTM for NLP modeling.
</p>
<h2 id="dataset"><a name="user-content-dataset" href="#dataset" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Dataset</h2>
<p>We use <a href="http://ai.stanford.edu/%7Eamaas/data/sentiment/">IMDB</a> dataset for sentiment analysis in this tutorial, which consists of 50,000 movie reviews split evenly into 25k train and 25k test sets. In the labeled train/test sets, a negative review has a score &lt;= 4 out of 10, and a positive review has a score &gt;= 7 out of 10.</p>
<p><code>paddle.datasets</code> package encapsulates multiple public datasets, including <code>cifar</code>, <code>imdb</code>, <code>mnist</code>, <code>moivelens</code>, and <code>wmt14</code>, etc. There&rsquo;s no need for us to manually download and preprocess IMDB.</p>
<p>After issuing a command <code>python train.py</code>, training will start immediately. The details will be unpacked by the following sessions to see how it works.</p>
<h2 id="model-structure"><a name="user-content-model-structure" href="#model-structure" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Model Structure</h2>
<h3 id="initialize-paddlepaddle"><a name="user-content-initialize-paddlepaddle" href="#initialize-paddlepaddle" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Initialize PaddlePaddle</h3>
<p>We must import and initialize PaddlePaddle (enable/disable GPU, set the number of trainers, etc).</p>
<pre><code class="python">import sys
import paddle.v2 as paddle
# PaddlePaddle init
paddle.init(use_gpu=False, trainer_count=1)
</code></pre>
<p>As alluded to in section <a href="#model-overview">Model Overview</a>, here we provide the implementations of both Text CNN and Stacked-bidirectional LSTM models.</p>
<h3 id="text-convolution-neural-network-text-cnn"><a name="user-content-text-convolution-neural-network-text-cnn" href="#text-convolution-neural-network-text-cnn" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Text Convolution Neural Network (Text CNN)</h3>
<p>We create a neural network <code>convolution_net</code> as the following snippet code.</p>
<p>Note: <code>paddle.networks.sequence_conv_pool</code> includes both convolution and pooling layer operations.</p>
<pre><code class="python">def convolution_net(input_dim, class_dim=2, emb_dim=128, hid_dim=128):
data = paddle.layer.data(&quot;word&quot;,
paddle.data_type.integer_value_sequence(input_dim))
emb = paddle.layer.embedding(input=data, size=emb_dim)
conv_3 = paddle.networks.sequence_conv_pool(
input=emb, context_len=3, hidden_size=hid_dim)
conv_4 = paddle.networks.sequence_conv_pool(
input=emb, context_len=4, hidden_size=hid_dim)
output = paddle.layer.fc(input=[conv_3, conv_4],
size=class_dim,
act=paddle.activation.Softmax())
lbl = paddle.layer.data(&quot;label&quot;, paddle.data_type.integer_value(2))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
</code></pre>
<ol>
<li>
<p>Define input data and its dimension</p>
<p>Parameter <code>input_dim</code> denotes the dictionary size, and <code>class_dim</code> is the number of categories. In <code>convolution_net</code>, the input to the network is defined in <code>paddle.layer.data</code>.</p>
</li>
<li>
<p>Define Classifier</p>
<p>The above Text CNN network extracts high-level features and maps them to a vector of the same size as the categories. <code>paddle.activation.Softmax</code> function or classifier is then used for calculating the probability of the sentence belonging to each category.</p>
</li>
<li>
<p>Define Loss Function</p>
<p>In the context of supervised learning, labels of the training set are defined in <code>paddle.layer.data</code>, too. During training, cross-entropy is used as loss function in <code>paddle.layer.classification_cost</code> and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.</p>
</li>
</ol>
<h4 id="stacked-bidirectional-lstm_1"><a name="user-content-stacked-bidirectional-lstm_1" href="#stacked-bidirectional-lstm_1" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Stacked bidirectional LSTM</h4>
<p>We create a neural network <code>stacked_lstm_net</code> as below.</p>
<pre><code class="python">def stacked_lstm_net(input_dim,
class_dim=2,
emb_dim=128,
hid_dim=512,
stacked_num=3):
&quot;&quot;&quot;
A Wrapper for sentiment classification task.
This network uses bi-directional recurrent network,
consisting three LSTM layers. This configure is referred to
the paper as following url, but use fewer layrs.
http://www.aclweb.org/anthology/P15-1109
input_dim: here is word dictionary dimension.
class_dim: number of categories.
emb_dim: dimension of word embedding.
hid_dim: dimension of hidden layer.
stacked_num: number of stacked lstm-hidden layer.
&quot;&quot;&quot;
assert stacked_num % 2 == 1
layer_attr = paddle.attr.Extra(drop_rate=0.5)
fc_para_attr = paddle.attr.Param(learning_rate=1e-3)
lstm_para_attr = paddle.attr.Param(initial_std=0., learning_rate=1.)
para_attr = [fc_para_attr, lstm_para_attr]
bias_attr = paddle.attr.Param(initial_std=0., l2_rate=0.)
relu = paddle.activation.Relu()
linear = paddle.activation.Linear()
data = paddle.layer.data(&quot;word&quot;,
paddle.data_type.integer_value_sequence(input_dim))
emb = paddle.layer.embedding(input=data, size=emb_dim)
fc1 = paddle.layer.fc(input=emb,
size=hid_dim,
act=linear,
bias_attr=bias_attr)
lstm1 = paddle.layer.lstmemory(
input=fc1, act=relu, bias_attr=bias_attr, layer_attr=layer_attr)
inputs = [fc1, lstm1]
for i in range(2, stacked_num + 1):
fc = paddle.layer.fc(input=inputs,
size=hid_dim,
act=linear,
param_attr=para_attr,
bias_attr=bias_attr)
lstm = paddle.layer.lstmemory(
input=fc,
reverse=(i % 2) == 0,
act=relu,
bias_attr=bias_attr,
layer_attr=layer_attr)
inputs = [fc, lstm]
fc_last = paddle.layer.pooling(
input=inputs[0], pooling_type=paddle.pooling.Max())
lstm_last = paddle.layer.pooling(
input=inputs[1], pooling_type=paddle.pooling.Max())
output = paddle.layer.fc(input=[fc_last, lstm_last],
size=class_dim,
act=paddle.activation.Softmax(),
bias_attr=bias_attr,
param_attr=para_attr)
lbl = paddle.layer.data(&quot;label&quot;, paddle.data_type.integer_value(2))
cost = paddle.layer.classification_cost(input=output, label=lbl)
return cost
</code></pre>
<ol>
<li>
<p>Define input data and its dimension</p>
<p>Parameter <code>input_dim</code> denotes the dictionary size, and <code>class_dim</code> is the number of categories. In <code>stacked_lstm_net</code>, the input to the network is defined in <code>paddle.layer.data</code>.</p>
</li>
<li>
<p>Define Classifier</p>
<p>The above stacked bidirectional LSTM network extracts high-level features and maps them to a vector of the same size as the categories. <code>paddle.activation.Softmax</code> function or classifier is then used for calculating the probability of the sentence belonging to each category.</p>
</li>
<li>
<p>Define Loss Function</p>
<p>In the context of supervised learning, labels of the training set are defined in <code>paddle.layer.data</code>, too. During training, cross-entropy is used as loss function in <code>paddle.layer.classification_cost</code> and as the output of the network; During testing, the outputs are the probabilities calculated in the classifier.</p>
</li>
</ol>
<p>To reiterate, we can either invoke <code>convolution_net</code> or <code>stacked_lstm_net</code>.</p>
<pre><code class="python">word_dict = paddle.dataset.imdb.word_dict()
dict_dim = len(word_dict)
class_dim = 2
# option 1
cost = convolution_net(dict_dim, class_dim=class_dim)
# option 2
# cost = stacked_lstm_net(dict_dim, class_dim=class_dim, stacked_num=3)
</code></pre>
<h2 id="model-training"><a name="user-content-model-training" href="#model-training" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Model Training</h2>
<h3 id="define-parameters"><a name="user-content-define-parameters" href="#define-parameters" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Define Parameters</h3>
<p>First, we create the model parameters according to the previous model configuration <code>cost</code>.</p>
<pre><code class="python"># create parameters
parameters = paddle.parameters.create(cost)
</code></pre>
<h3 id="create-trainer"><a name="user-content-create-trainer" href="#create-trainer" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Create Trainer</h3>
<p>Before jumping into creating a training module, algorithm setting is also necessary.<br />
Here we specified <code>Adam</code> optimization algorithm via <code>paddle.optimizer</code>.</p>
<pre><code class="python"># create optimizer
adam_optimizer = paddle.optimizer.Adam(
learning_rate=2e-3,
regularization=paddle.optimizer.L2Regularization(rate=8e-4),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))
# create trainer
trainer = paddle.trainer.SGD(cost=cost,
parameters=parameters,
update_equation=adam_optimizer)
</code></pre>
<h3 id="training"><a name="user-content-training" href="#training" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Training</h3>
<p><code>paddle.dataset.imdb.train()</code> will yield records during each pass, after shuffling, a batch input is generated for training.</p>
<pre><code class="python">train_reader = paddle.batch(
paddle.reader.shuffle(
lambda: paddle.dataset.imdb.train(word_dict), buf_size=1000),
batch_size=100)
test_reader = paddle.batch(
lambda: paddle.dataset.imdb.test(word_dict), batch_size=100)
</code></pre>
<p><code>feeding</code> is devoted to specifying the correspondence between each yield record and <code>paddle.layer.data</code>. For instance, the first column of data generated by <code>paddle.dataset.imdb.train()</code> corresponds to <code>word</code> feature.</p>
<pre><code class="python">feeding = {'word': 0, 'label': 1}
</code></pre>
<p>Callback function <code>event_handler</code> will be invoked to track training progress when a pre-defined event happens.</p>
<pre><code class="python">def event_handler(event):
if isinstance(event, paddle.event.EndIteration):
if event.batch_id % 100 == 0:
print &quot;\nPass %d, Batch %d, Cost %f, %s&quot; % (
event.pass_id, event.batch_id, event.cost, event.metrics)
else:
sys.stdout.write('.')
sys.stdout.flush()
if isinstance(event, paddle.event.EndPass):
result = trainer.test(reader=test_reader, feeding=feeding)
print &quot;\nTest with Pass %d, %s&quot; % (event.pass_id, result.metrics)
</code></pre>
<p>Finally, we can invoke <code>trainer.train</code> to start training:</p>
<pre><code class="python">trainer.train(
reader=train_reader,
event_handler=event_handler,
feeding=feeding,
num_passes=10)
</code></pre>
<h2 id="conclusion"><a name="user-content-conclusion" href="#conclusion" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Conclusion</h2>
<p>In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks.</p>
<h2 id="reference"><a name="user-content-reference" href="#reference" class="headeranchor-link" aria-hidden="true"><span class="headeranchor"></span></a>Reference</h2>
<ol>
<li>Kim Y. <a href="http://arxiv.org/pdf/1408.5882">Convolutional neural networks for sentence classification</a>[J]. arXiv preprint arXiv:1408.5882, 2014.</li>
<li>Kalchbrenner N, Grefenstette E, Blunsom P. <a href="http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&amp;utm_source=PourOver">A convolutional neural network for modelling sentences</a>[J]. arXiv preprint arXiv:1404.2188, 2014.</li>
<li>Yann N. Dauphin, et al. <a href="https://arxiv.org/pdf/1612.08083v1.pdf">Language Modeling with Gated Convolutional Networks</a>[J] arXiv preprint arXiv:1612.08083, 2016.</li>
<li>Siegelmann H T, Sontag E D. <a href="http://research.cs.queensu.ca/home/akl/cisc879/papers/SELECTED_PAPERS_FROM_VARIOUS_SOURCES/05070215382317071.pdf">On the computational power of neural nets</a>[C]//Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992: 440-449.</li>
<li>Hochreiter S, Schmidhuber J. <a href="http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf">Long short-term memory</a>[J]. Neural computation, 1997, 9(8): 1735-1780.</li>
<li>Bengio Y, Simard P, Frasconi P. <a href="http://www-dsi.ing.unifi.it/~paolo/ps/tnn-94-gradient.pdf">Learning long-term dependencies with gradient descent is difficult</a>[J]. IEEE transactions on neural networks, 1994, 5(2): 157-166.</li>
<li>Graves A. <a href="http://arxiv.org/pdf/1308.0850">Generating sequences with recurrent neural networks</a>[J]. arXiv preprint arXiv:1308.0850, 2013.</li>
<li>Cho K, Van Merriënboer B, Gulcehre C, et al. <a href="http://arxiv.org/pdf/1406.1078">Learning phrase representations using RNN encoder-decoder for statistical machine translation</a>[J]. arXiv preprint arXiv:1406.1078, 2014.</li>
<li>Zhou J, Xu W. <a href="http://www.aclweb.org/anthology/P/P15/P15-1109.pdf">End-to-end learning of semantic role labeling using recurrent neural networks</a>[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics. 2015.</li>
</ol>
<p><br/><br />
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.</p></article></body></html>
\ No newline at end of file
...@@ -19,7 +19,7 @@ In natural language processing, sentiment analysis can be categorized as a **Tex ...@@ -19,7 +19,7 @@ In natural language processing, sentiment analysis can be categorized as a **Tex
The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have very little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics. The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have very little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics.
This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods \[[1](#Reference)\]. This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods \[[1](#references)\].
## Model Overview ## Model Overview
...@@ -32,11 +32,11 @@ The convolutional neural network for texts is introduced in chapter [recommender ...@@ -32,11 +32,11 @@ The convolutional neural network for texts is introduced in chapter [recommender
CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. We firstly apply the convolution operation: we apply the kernel in each window, extracting features. Convolving by the kernel at every window produces a feature map. Next, we apply *max pooling* over time to represent the whole sentence, which is the maximum element across the feature map. In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size. Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem. CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. We firstly apply the convolution operation: we apply the kernel in each window, extracting features. Convolving by the kernel at every window produces a feature map. Next, we apply *max pooling* over time to represent the whole sentence, which is the maximum element across the feature map. In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size. Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem.
For short texts, the aforementioned CNN model can achieve very high accuracy \[[1](#Reference)\]. If we want to extract more abstract representations, we may apply a deeper CNN model \[[2](#Reference),[3](#Reference)\]. For short texts, the aforementioned CNN model can achieve very high accuracy \[[1](#references)\]. If we want to extract more abstract representations, we may apply a deeper CNN model \[[2](#references),[3](#references)\].
### Recurrent Neural Network (RNN) ### Recurrent Neural Network (RNN)
RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete \[[4](#Reference)\]. Since NLP is a classical problem of sequential data, the RNN, especially its variant LSTM\[[5](#Reference)\]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth. RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete \[[4](#references)\]. Since NLP is a classical problem of sequential data, the RNN, especially its variant LSTM\[[5](#references)\]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth.
<p align="center"> <p align="center">
<img src="image/rnn.png" width = "60%" align="center"/><br/> <img src="image/rnn.png" width = "60%" align="center"/><br/>
...@@ -53,13 +53,13 @@ In NLP, words are often represented as one-hot vectors and then mapped to an emb ...@@ -53,13 +53,13 @@ In NLP, words are often represented as one-hot vectors and then mapped to an emb
### Long-Short Term Memory (LSTM) ### Long-Short Term Memory (LSTM)
Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding\[[6](#)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed **Long Short Term Memory** (LSTM)\[[5](#Reference)\]). Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding\[[6](#references)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed **Long Short Term Memory** (LSTM)\[[5](#references)\]).
Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the **LSTM-RNN**, denoted as a function $F$, as follows: Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the **LSTM-RNN**, denoted as a function $F$, as follows:
$$ h_t=F(x_t,h_{t-1})$$ $$ h_t=F(x_t,h_{t-1})$$
$F$ contains following formulations\[[7](#Reference)\] $F$ contains following formulations\[[7](#references)\]
\begin{align} \begin{align}
i_t & = \sigma(W_{xi}x_t+W_{hi}h_{h-1}+W_{ci}c_{t-1}+b_i)\\\\ i_t & = \sigma(W_{xi}x_t+W_{hi}h_{h-1}+W_{ci}c_{t-1}+b_i)\\\\
f_t & = \sigma(W_{xf}x_t+W_{hf}h_{h-1}+W_{cf}c_{t-1}+b_f)\\\\ f_t & = \sigma(W_{xf}x_t+W_{hf}h_{h-1}+W_{cf}c_{t-1}+b_f)\\\\
...@@ -82,7 +82,7 @@ where $Recrurent$ is a simple RNN, GRU or LSTM. ...@@ -82,7 +82,7 @@ where $Recrurent$ is a simple RNN, GRU or LSTM.
### Stacked Bidirectional LSTM ### Stacked Bidirectional LSTM
For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#Reference)\]. For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#references)\].
As shown in Figure 3 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification. As shown in Figure 3 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.
...@@ -331,7 +331,7 @@ trainer.train( ...@@ -331,7 +331,7 @@ trainer.train(
In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks. In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks.
## Reference ## References
1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014. 1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modeling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014. 2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modeling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
......
...@@ -61,7 +61,7 @@ In natural language processing, sentiment analysis can be categorized as a **Tex ...@@ -61,7 +61,7 @@ In natural language processing, sentiment analysis can be categorized as a **Tex
The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have very little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics. The BOW model does not capture all the information in a piece of text, as it ignores syntax and grammar and just treats the text as a set of words. For example, “this movie is extremely bad“ and “boring, dull, and empty work” describe very similar semantic meaning, yet their BOW representations have very little similarity. Furthermore, “the movie is bad“ and “the movie is not bad“ have high similarity with BOW features, but they express completely opposite semantics.
This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods \[[1](#Reference)\]. This chapter introduces a deep learning model that handles these issues in BOW. Our model embeds texts into a low-dimensional space and takes word order into consideration. It is an end-to-end framework and it has large performance improvement over traditional methods \[[1](#references)\].
## Model Overview ## Model Overview
...@@ -74,11 +74,11 @@ The convolutional neural network for texts is introduced in chapter [recommender ...@@ -74,11 +74,11 @@ The convolutional neural network for texts is introduced in chapter [recommender
CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. We firstly apply the convolution operation: we apply the kernel in each window, extracting features. Convolving by the kernel at every window produces a feature map. Next, we apply *max pooling* over time to represent the whole sentence, which is the maximum element across the feature map. In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size. Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem. CNN mainly contains convolution and pooling operation, with versatile combinations in various applications. We firstly apply the convolution operation: we apply the kernel in each window, extracting features. Convolving by the kernel at every window produces a feature map. Next, we apply *max pooling* over time to represent the whole sentence, which is the maximum element across the feature map. In real applications, we will apply multiple CNN kernels on the sentences. It can be implemented efficiently by concatenating the kernels together as a matrix. Also, we can use CNN kernels with different kernel size. Finally, concatenating the resulting features produces a fixed-length representation, which can be combined with a softmax to form the model for the sentiment analysis problem.
For short texts, the aforementioned CNN model can achieve very high accuracy \[[1](#Reference)\]. If we want to extract more abstract representations, we may apply a deeper CNN model \[[2](#Reference),[3](#Reference)\]. For short texts, the aforementioned CNN model can achieve very high accuracy \[[1](#references)\]. If we want to extract more abstract representations, we may apply a deeper CNN model \[[2](#references),[3](#references)\].
### Recurrent Neural Network (RNN) ### Recurrent Neural Network (RNN)
RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete \[[4](#Reference)\]. Since NLP is a classical problem of sequential data, the RNN, especially its variant LSTM\[[5](#Reference)\]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth. RNN is an effective model for sequential data. In terms of computability, the RNN is Turing-complete \[[4](#references)\]. Since NLP is a classical problem of sequential data, the RNN, especially its variant LSTM\[[5](#references)\]), achieves state-of-the-art performance on various NLP tasks, such as language modeling, syntax parsing, POS-tagging, image captioning, dialog, machine translation, and so forth.
<p align="center"> <p align="center">
<img src="image/rnn.png" width = "60%" align="center"/><br/> <img src="image/rnn.png" width = "60%" align="center"/><br/>
...@@ -95,13 +95,13 @@ In NLP, words are often represented as one-hot vectors and then mapped to an emb ...@@ -95,13 +95,13 @@ In NLP, words are often represented as one-hot vectors and then mapped to an emb
### Long-Short Term Memory (LSTM) ### Long-Short Term Memory (LSTM)
Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding\[[6](#)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed **Long Short Term Memory** (LSTM)\[[5](#Reference)\]). Training an RNN on long sequential data sometimes leads to the gradient vanishing or exploding\[[6](#references)\]. To solve this problem Hochreiter S, Schmidhuber J. (1997) proposed **Long Short Term Memory** (LSTM)\[[5](#references)\]).
Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the **LSTM-RNN**, denoted as a function $F$, as follows: Compared to the structure of a simple RNN, an LSTM includes memory cell $c$, input gate $i$, forget gate $f$ and output gate $o$. These gates and memory cells dramatically improve the ability for the network to handle long sequences. We can formulate the **LSTM-RNN**, denoted as a function $F$, as follows:
$$ h_t=F(x_t,h_{t-1})$$ $$ h_t=F(x_t,h_{t-1})$$
$F$ contains following formulations\[[7](#Reference)\]: $F$ contains following formulations\[[7](#references)\]:
\begin{align} \begin{align}
i_t & = \sigma(W_{xi}x_t+W_{hi}h_{h-1}+W_{ci}c_{t-1}+b_i)\\\\ i_t & = \sigma(W_{xi}x_t+W_{hi}h_{h-1}+W_{ci}c_{t-1}+b_i)\\\\
f_t & = \sigma(W_{xf}x_t+W_{hf}h_{h-1}+W_{cf}c_{t-1}+b_f)\\\\ f_t & = \sigma(W_{xf}x_t+W_{hf}h_{h-1}+W_{cf}c_{t-1}+b_f)\\\\
...@@ -124,7 +124,7 @@ where $Recrurent$ is a simple RNN, GRU or LSTM. ...@@ -124,7 +124,7 @@ where $Recrurent$ is a simple RNN, GRU or LSTM.
### Stacked Bidirectional LSTM ### Stacked Bidirectional LSTM
For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#Reference)\]. For vanilla LSTM, $h_t$ contains input information from previous time-step $1..t-1$ context. We can also apply an RNN with reverse-direction to take successive context $t+1…n$ into consideration. Combining constructing deep RNN (deeper RNN can contain more abstract and higher level semantic), we can design structures with deep stacked bidirectional LSTM to model sequential data\[[9](#references)\].
As shown in Figure 3 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification. As shown in Figure 3 (3-layer RNN), odd/even layers are forward/reverse LSTM. Higher layers of LSTM take lower-layers LSTM as input, and the top-layer LSTM produces a fixed length vector by max-pooling (this representation considers contexts from previous and successive words for higher-level abstractions). Finally, we concatenate the output to a softmax layer for classification.
...@@ -373,7 +373,7 @@ trainer.train( ...@@ -373,7 +373,7 @@ trainer.train(
In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks. In this chapter, we use sentiment analysis as an example to introduce applying deep learning models on end-to-end short text classification, as well as how to use PaddlePaddle to implement the model. Meanwhile, we briefly introduce two models for text processing: CNN and RNN. In following chapters, we will see how these models can be applied in other tasks.
## Reference ## References
1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014. 1. Kim Y. [Convolutional neural networks for sentence classification](http://arxiv.org/pdf/1408.5882)[J]. arXiv preprint arXiv:1408.5882, 2014.
2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modeling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014. 2. Kalchbrenner N, Grefenstette E, Blunsom P. [A convolutional neural network for modeling sentences](http://arxiv.org/pdf/1404.2188.pdf?utm_medium=App.net&utm_source=PourOver)[J]. arXiv preprint arXiv:1404.2188, 2014.
......
...@@ -31,7 +31,7 @@ Fig 1. Syntax tree ...@@ -31,7 +31,7 @@ Fig 1. Syntax tree
</div> </div>
However, a complete syntactic analysis requires identifying the relationship among all constituents. Thus, the accuracy of SRL is sensitive to the preciseness of the syntactic analysis, making SRL challenging. To reduce its complexity and obtain some information on the syntactic structures, we often use *shallow syntactic analysis* a.k.a. partial parsing or chunking. Unlike complete syntactic analysis, which requires the construction of the complete parsing tree, *Shallow Syntactic Analysis* only requires identifying some independent constituents with relatively simple structures, such as verb phrases (chunk). To avoid difficulties in constructing a syntax tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking-based SRL methods, which reduces SRL into a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using **BIO representation**. For syntactic chunks forming role A, its first chunk receives the B-A tag (Begin) and the remaining ones receive the tag I-A (Inside); in the end, the chunks left out will receive the tag O. However, a complete syntactic analysis requires identifying the relationship among all constituents. Thus, the accuracy of SRL is sensitive to the preciseness of the syntactic analysis, making SRL challenging. To reduce its complexity and obtain some information on the syntactic structures, we often use *shallow syntactic analysis* a.k.a. partial parsing or chunking. Unlike complete syntactic analysis, which requires the construction of the complete parsing tree, *Shallow Syntactic Analysis* only requires identifying some independent constituents with relatively simple structures, such as verb phrases (chunk). To avoid difficulties in constructing a syntax tree with high accuracy, some work\[[1](#reference)\] proposed semantic chunking-based SRL methods, which reduces SRL into a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using **BIO representation**. For syntactic chunks forming role A, its first chunk receives the B-A tag (Begin) and the remaining ones receive the tag I-A (Inside); in the end, the chunks left out will receive the tag O.
The BIO representation of above example is shown in Fig.1. The BIO representation of above example is shown in Fig.1.
...@@ -54,7 +54,7 @@ In this tutorial, our SRL system is built as an end-to-end system via a neural n ...@@ -54,7 +54,7 @@ In this tutorial, our SRL system is built as an end-to-end system via a neural n
### Stacked Recurrent Neural Network ### Stacked Recurrent Neural Network
*Deep Neural Networks* can extract hierarchical representations. The higher layers can form relatively abstract/complex representations, based on primitive features discovered through the lower layers. Unfolding LSTMs through time results in a deep feed-forward neural network. This is because any computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. On the other hand, due to parameter sharing over time, LSTMs are also *shallow*; that is, the computation carried out at each time-step is just a linear transformation. Deep LSTM networks are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical neural networks can be efficient at representing some functions and modeling varying-length dependencies\[[2](#Reference)\]. *Deep Neural Networks* can extract hierarchical representations. The higher layers can form relatively abstract/complex representations, based on primitive features discovered through the lower layers. Unfolding LSTMs through time results in a deep feed-forward neural network. This is because any computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. On the other hand, due to parameter sharing over time, LSTMs are also *shallow*; that is, the computation carried out at each time-step is just a linear transformation. Deep LSTM networks are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical neural networks can be efficient at representing some functions and modeling varying-length dependencies\[[2](#reference)\].
However, in a deep LSTM network, any gradient propagated back in depth needs to traverse a large number of nonlinear steps. As a result, while LSTMs of 4 layers can be trained properly, those with 4-8 have much worse performance. Conventional LSTMs prevent back-propagated errors from vanishing or exploding by introducing shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well. However, in a deep LSTM network, any gradient propagated back in depth needs to traverse a large number of nonlinear steps. As a result, while LSTMs of 4 layers can be trained properly, those with 4-8 have much worse performance. Conventional LSTMs prevent back-propagated errors from vanishing or exploding by introducing shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well.
...@@ -70,7 +70,7 @@ Based on the stacked LSTMs, we add shortcut connections: take the input-to-hidde ...@@ -70,7 +70,7 @@ Based on the stacked LSTMs, we add shortcut connections: take the input-to-hidde
Fig.3 illustrates the final stacked recurrent neural networks. Fig.3 illustrates the final stacked recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/stacked_lstm_en.png" width = "40%" align=center><br> <img src="./image/stacked_lstm_en.png" width = "40%" align=center><br>
Fig 3. Stacked Recurrent Neural Networks Fig 3. Stacked Recurrent Neural Networks
</p> </p>
...@@ -82,28 +82,28 @@ While LSTMs can summarize the history, they can not see the future. Because most ...@@ -82,28 +82,28 @@ While LSTMs can summarize the history, they can not see the future. Because most
To address this, we can design a bidirectional recurrent neural network by making a minor modification. A higher LSTM layer can process the sequence in reversed direction with regards to its immediate lower LSTM layer, i.e., deep LSTM layers take turns to train on input sequences from left-to-right and right-to-left. Therefore, LSTM layers at time-step $t$ can see both histories and the future, starting from the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks. To address this, we can design a bidirectional recurrent neural network by making a minor modification. A higher LSTM layer can process the sequence in reversed direction with regards to its immediate lower LSTM layer, i.e., deep LSTM layers take turns to train on input sequences from left-to-right and right-to-left. Therefore, LSTM layers at time-step $t$ can see both histories and the future, starting from the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br> <img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br>
Fig 4. Bidirectional LSTMs Fig 4. Bidirectional LSTMs
</p> </p>
Note that, this bidirectional RNNs is different from the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following chapter [machine translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) Note that, this bidirectional RNNs is different from the one proposed by Bengio et al. in machine translation tasks \[[3](#reference), [4](#reference)\]. We will introduce another bidirectional RNNs in the following chapter [machine translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md)
### Conditional Random Field (CRF) ### Conditional Random Field (CRF)
Typically, a neural network's lower layers learn representations while its very top layer accomplishs the final task. These principles can guide our problem-solving approaches. In SRL tasks, a **Conditional Random Field** (*CRF*) is built on top of the network in order to perform the final prediction to tag sequences. It takes representations provided by the last LSTM layer as input. Typically, a neural network's lower layers learn representations while its very top layer accomplishes the final task. These principles can guide our problem-solving approaches. In SRL tasks, a **Conditional Random Field** (*CRF*) is built on top of the network in order to perform the final prediction to tag sequences. It takes representations provided by the last LSTM layer as input.
The CRF is an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. In essence, CRFs learn the conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input and $Y = (y_1, y_2, ... , y_n)$ are label sequences; to decode, simply search through $Y$ for a sequence that maximizes the conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。 The CRF is an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. In essence, CRFs learn the conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input and $Y = (y_1, y_2, ... , y_n)$ are label sequences; to decode, simply search through $Y$ for a sequence that maximizes the conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。
Sequence tagging tasks do not assume a lot of conditional independence, because they only concern about the input and the output being linear sequences. Thus, the graph model of sequence tagging tasks is usually a simple chain or line, which results in a **Linear-Chain Conditional Random Field**, shown in Fig.5. Sequence tagging tasks do not assume a lot of conditional independence, because they only concern about the input and the output being linear sequences. Thus, the graph model of sequence tagging tasks is usually a simple chain or line, which results in a **Linear-Chain Conditional Random Field**, shown in Fig.5.
<p align="center"> <p align="center">
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br> <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
Fig 5. Linear Chain Conditional Random Field used in SRL tasks Fig 5. Linear Chain Conditional Random Field used in SRL tasks
</p> </p>
By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form: By the fundamental theorem of random fields \[[5](#reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form:
$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$ $$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
...@@ -147,7 +147,7 @@ After these modifications, the model is as follows, as illustrated in Figure 6: ...@@ -147,7 +147,7 @@ After these modifications, the model is as follows, as illustrated in Figure 6:
4. Take the representation from step 3 as input to CRF, label sequence as a supervisory signal, and complete sequence tagging tasks. 4. Take the representation from step 3 as input to CRF, label sequence as a supervisory signal, and complete sequence tagging tasks.
<div align="center"> <div align="center">
<img src="image/db_lstm_network_en.png" width = "60%" align=center /><br> <img src="image/db_lstm_network_en.png" width = "60%" align=center /><br>
Fig 6. DB-LSTM for SRL tasks Fig 6. DB-LSTM for SRL tasks
</div> </div>
...@@ -165,7 +165,7 @@ conll05st-release/ ...@@ -165,7 +165,7 @@ conll05st-release/
└── words # text sequence └── words # text sequence
``` ```
The annotation information is derived from the results of Penn TreeBank\[[7](#references)\] and PropBank \[[8](# references)\]. The labeling of the PropBank is different from the labeling methods mentioned before, but shares with it the same underlying principle. For descriptions of the labeling, please refer to the paper \[[9](#references)\]. The annotation information is derived from the results of Penn TreeBank\[[7](#references)\] and PropBank \[[8](#references)\]. The labeling of the PropBank is different from the labeling methods mentioned before, but shares with it the same underlying principle. For descriptions of the labeling, please refer to the paper \[[9](#references)\].
The raw data needs to be preprocessed into formats that PaddlePaddle can handle. The preprocessing consists of the following steps: The raw data needs to be preprocessed into formats that PaddlePaddle can handle. The preprocessing consists of the following steps:
...@@ -266,7 +266,7 @@ Note that `hidden_dim = 512` means a LSTM hidden vector of 128 dimension (512/4) ...@@ -266,7 +266,7 @@ Note that `hidden_dim = 512` means a LSTM hidden vector of 128 dimension (512/4)
- Transform the word sequence itself, the predicate, the predicate context, and the region mark sequence into embedded vector sequences. - Transform the word sequence itself, the predicate, the predicate context, and the region mark sequence into embedded vector sequences.
```python ```python
# Since word vectorlookup table is pre-trained, we won't update it this time. # Since word vectorlookup table is pre-trained, we won't update it this time.
# is_static being True prevents updating the lookup table during training. # is_static being True prevents updating the lookup table during training.
...@@ -295,7 +295,7 @@ emb_layers.append(mark_embedding) ...@@ -295,7 +295,7 @@ emb_layers.append(mark_embedding)
- 8 LSTM units are trained through alternating left-to-right / right-to-left order denoted by the variable `reverse`. - 8 LSTM units are trained through alternating left-to-right / right-to-left order denoted by the variable `reverse`.
```python ```python
hidden_0 = paddle.layer.mixed( hidden_0 = paddle.layer.mixed(
size=hidden_dim, size=hidden_dim,
bias_attr=std_default, bias_attr=std_default,
...@@ -522,7 +522,7 @@ print pre_lab ...@@ -522,7 +522,7 @@ print pre_lab
Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we use SRL as an example to illustrate using PaddlePaddle to do sequence tagging tasks. The models proposed are from our published paper\[[10](#Reference)\]. We only use test data for illustration since the training data on the CoNLL 2005 dataset is not completely public. This aims to propose an end-to-end neural network model with fewer dependencies on natural language processing tools but is comparable, or even better than traditional models in terms of performance. Please check out our paper for more information and discussions. Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we use SRL as an example to illustrate using PaddlePaddle to do sequence tagging tasks. The models proposed are from our published paper\[[10](#Reference)\]. We only use test data for illustration since the training data on the CoNLL 2005 dataset is not completely public. This aims to propose an end-to-end neural network model with fewer dependencies on natural language processing tools but is comparable, or even better than traditional models in terms of performance. Please check out our paper for more information and discussions.
## Reference ## References
1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483. 1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483.
2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013. 2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014. 3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
......
...@@ -73,7 +73,7 @@ Fig 1. Syntax tree ...@@ -73,7 +73,7 @@ Fig 1. Syntax tree
</div> </div>
However, a complete syntactic analysis requires identifying the relationship among all constituents. Thus, the accuracy of SRL is sensitive to the preciseness of the syntactic analysis, making SRL challenging. To reduce its complexity and obtain some information on the syntactic structures, we often use *shallow syntactic analysis* a.k.a. partial parsing or chunking. Unlike complete syntactic analysis, which requires the construction of the complete parsing tree, *Shallow Syntactic Analysis* only requires identifying some independent constituents with relatively simple structures, such as verb phrases (chunk). To avoid difficulties in constructing a syntax tree with high accuracy, some work\[[1](#Reference)\] proposed semantic chunking-based SRL methods, which reduces SRL into a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using **BIO representation**. For syntactic chunks forming role A, its first chunk receives the B-A tag (Begin) and the remaining ones receive the tag I-A (Inside); in the end, the chunks left out will receive the tag O. However, a complete syntactic analysis requires identifying the relationship among all constituents. Thus, the accuracy of SRL is sensitive to the preciseness of the syntactic analysis, making SRL challenging. To reduce its complexity and obtain some information on the syntactic structures, we often use *shallow syntactic analysis* a.k.a. partial parsing or chunking. Unlike complete syntactic analysis, which requires the construction of the complete parsing tree, *Shallow Syntactic Analysis* only requires identifying some independent constituents with relatively simple structures, such as verb phrases (chunk). To avoid difficulties in constructing a syntax tree with high accuracy, some work\[[1](#reference)\] proposed semantic chunking-based SRL methods, which reduces SRL into a sequence tagging problem. Sequence tagging tasks classify syntactic chunks using **BIO representation**. For syntactic chunks forming role A, its first chunk receives the B-A tag (Begin) and the remaining ones receive the tag I-A (Inside); in the end, the chunks left out will receive the tag O.
The BIO representation of above example is shown in Fig.1. The BIO representation of above example is shown in Fig.1.
...@@ -96,7 +96,7 @@ In this tutorial, our SRL system is built as an end-to-end system via a neural n ...@@ -96,7 +96,7 @@ In this tutorial, our SRL system is built as an end-to-end system via a neural n
### Stacked Recurrent Neural Network ### Stacked Recurrent Neural Network
*Deep Neural Networks* can extract hierarchical representations. The higher layers can form relatively abstract/complex representations, based on primitive features discovered through the lower layers. Unfolding LSTMs through time results in a deep feed-forward neural network. This is because any computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. On the other hand, due to parameter sharing over time, LSTMs are also *shallow*; that is, the computation carried out at each time-step is just a linear transformation. Deep LSTM networks are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical neural networks can be efficient at representing some functions and modeling varying-length dependencies\[[2](#Reference)\]. *Deep Neural Networks* can extract hierarchical representations. The higher layers can form relatively abstract/complex representations, based on primitive features discovered through the lower layers. Unfolding LSTMs through time results in a deep feed-forward neural network. This is because any computational path between the input at time $k < t$ to the output at time $t$ crosses several nonlinear layers. On the other hand, due to parameter sharing over time, LSTMs are also *shallow*; that is, the computation carried out at each time-step is just a linear transformation. Deep LSTM networks are typically constructed by stacking multiple LSTM layers on top of each other and taking the output from lower LSTM layer at time $t$ as the input of upper LSTM layer at time $t$. Deep, hierarchical neural networks can be efficient at representing some functions and modeling varying-length dependencies\[[2](#reference)\].
However, in a deep LSTM network, any gradient propagated back in depth needs to traverse a large number of nonlinear steps. As a result, while LSTMs of 4 layers can be trained properly, those with 4-8 have much worse performance. Conventional LSTMs prevent back-propagated errors from vanishing or exploding by introducing shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well. However, in a deep LSTM network, any gradient propagated back in depth needs to traverse a large number of nonlinear steps. As a result, while LSTMs of 4 layers can be trained properly, those with 4-8 have much worse performance. Conventional LSTMs prevent back-propagated errors from vanishing or exploding by introducing shortcut connections to skip the intermediate nonlinear layers. Therefore, deep LSTMs can consider shortcut connections in depth as well.
...@@ -112,7 +112,7 @@ Based on the stacked LSTMs, we add shortcut connections: take the input-to-hidde ...@@ -112,7 +112,7 @@ Based on the stacked LSTMs, we add shortcut connections: take the input-to-hidde
Fig.3 illustrates the final stacked recurrent neural networks. Fig.3 illustrates the final stacked recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/stacked_lstm_en.png" width = "40%" align=center><br> <img src="./image/stacked_lstm_en.png" width = "40%" align=center><br>
Fig 3. Stacked Recurrent Neural Networks Fig 3. Stacked Recurrent Neural Networks
</p> </p>
...@@ -124,28 +124,28 @@ While LSTMs can summarize the history, they can not see the future. Because most ...@@ -124,28 +124,28 @@ While LSTMs can summarize the history, they can not see the future. Because most
To address this, we can design a bidirectional recurrent neural network by making a minor modification. A higher LSTM layer can process the sequence in reversed direction with regards to its immediate lower LSTM layer, i.e., deep LSTM layers take turns to train on input sequences from left-to-right and right-to-left. Therefore, LSTM layers at time-step $t$ can see both histories and the future, starting from the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks. To address this, we can design a bidirectional recurrent neural network by making a minor modification. A higher LSTM layer can process the sequence in reversed direction with regards to its immediate lower LSTM layer, i.e., deep LSTM layers take turns to train on input sequences from left-to-right and right-to-left. Therefore, LSTM layers at time-step $t$ can see both histories and the future, starting from the second layer. Fig. 4 illustrates the bidirectional recurrent neural networks.
<p align="center"> <p align="center">
<img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br> <img src="./image/bidirectional_stacked_lstm_en.png" width = "60%" align=center><br>
Fig 4. Bidirectional LSTMs Fig 4. Bidirectional LSTMs
</p> </p>
Note that, this bidirectional RNNs is different from the one proposed by Bengio et al. in machine translation tasks \[[3](#Reference), [4](#Reference)\]. We will introduce another bidirectional RNNs in the following chapter [machine translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md) Note that, this bidirectional RNNs is different from the one proposed by Bengio et al. in machine translation tasks \[[3](#reference), [4](#reference)\]. We will introduce another bidirectional RNNs in the following chapter [machine translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md)
### Conditional Random Field (CRF) ### Conditional Random Field (CRF)
Typically, a neural network's lower layers learn representations while its very top layer accomplishs the final task. These principles can guide our problem-solving approaches. In SRL tasks, a **Conditional Random Field** (*CRF*) is built on top of the network in order to perform the final prediction to tag sequences. It takes representations provided by the last LSTM layer as input. Typically, a neural network's lower layers learn representations while its very top layer accomplishes the final task. These principles can guide our problem-solving approaches. In SRL tasks, a **Conditional Random Field** (*CRF*) is built on top of the network in order to perform the final prediction to tag sequences. It takes representations provided by the last LSTM layer as input.
The CRF is an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. In essence, CRFs learn the conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input and $Y = (y_1, y_2, ... , y_n)$ are label sequences; to decode, simply search through $Y$ for a sequence that maximizes the conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。 The CRF is an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these variables. In essence, CRFs learn the conditional probability $P(Y|X)$, where $X = (x_1, x_2, ... , x_n)$ are sequences of input and $Y = (y_1, y_2, ... , y_n)$ are label sequences; to decode, simply search through $Y$ for a sequence that maximizes the conditional probability $P(Y|X)$, i.e., $Y^* = \mbox{arg max}_{Y} P(Y | X)$。
Sequence tagging tasks do not assume a lot of conditional independence, because they only concern about the input and the output being linear sequences. Thus, the graph model of sequence tagging tasks is usually a simple chain or line, which results in a **Linear-Chain Conditional Random Field**, shown in Fig.5. Sequence tagging tasks do not assume a lot of conditional independence, because they only concern about the input and the output being linear sequences. Thus, the graph model of sequence tagging tasks is usually a simple chain or line, which results in a **Linear-Chain Conditional Random Field**, shown in Fig.5.
<p align="center"> <p align="center">
<img src="./image/linear_chain_crf.png" width = "35%" align=center><br> <img src="./image/linear_chain_crf.png" width = "35%" align=center><br>
Fig 5. Linear Chain Conditional Random Field used in SRL tasks Fig 5. Linear Chain Conditional Random Field used in SRL tasks
</p> </p>
By the fundamental theorem of random fields \[[5](#Reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form: By the fundamental theorem of random fields \[[5](#reference)\], the joint distribution over the label sequence $Y$ given $X$ has the form:
$$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$ $$p(Y | X) = \frac{1}{Z(X)} \text{exp}\left(\sum_{i=1}^{n}\left(\sum_{j}\lambda_{j}t_{j} (y_{i - 1}, y_{i}, X, i) + \sum_{k} \mu_k s_k (y_i, X, i)\right)\right)$$
...@@ -189,7 +189,7 @@ After these modifications, the model is as follows, as illustrated in Figure 6: ...@@ -189,7 +189,7 @@ After these modifications, the model is as follows, as illustrated in Figure 6:
4. Take the representation from step 3 as input to CRF, label sequence as a supervisory signal, and complete sequence tagging tasks. 4. Take the representation from step 3 as input to CRF, label sequence as a supervisory signal, and complete sequence tagging tasks.
<div align="center"> <div align="center">
<img src="image/db_lstm_network_en.png" width = "60%" align=center /><br> <img src="image/db_lstm_network_en.png" width = "60%" align=center /><br>
Fig 6. DB-LSTM for SRL tasks Fig 6. DB-LSTM for SRL tasks
</div> </div>
...@@ -207,7 +207,7 @@ conll05st-release/ ...@@ -207,7 +207,7 @@ conll05st-release/
└── words # text sequence └── words # text sequence
``` ```
The annotation information is derived from the results of Penn TreeBank\[[7](#references)\] and PropBank \[[8](# references)\]. The labeling of the PropBank is different from the labeling methods mentioned before, but shares with it the same underlying principle. For descriptions of the labeling, please refer to the paper \[[9](#references)\]. The annotation information is derived from the results of Penn TreeBank\[[7](#references)\] and PropBank \[[8](#references)\]. The labeling of the PropBank is different from the labeling methods mentioned before, but shares with it the same underlying principle. For descriptions of the labeling, please refer to the paper \[[9](#references)\].
The raw data needs to be preprocessed into formats that PaddlePaddle can handle. The preprocessing consists of the following steps: The raw data needs to be preprocessed into formats that PaddlePaddle can handle. The preprocessing consists of the following steps:
...@@ -308,7 +308,7 @@ Note that `hidden_dim = 512` means a LSTM hidden vector of 128 dimension (512/4) ...@@ -308,7 +308,7 @@ Note that `hidden_dim = 512` means a LSTM hidden vector of 128 dimension (512/4)
- Transform the word sequence itself, the predicate, the predicate context, and the region mark sequence into embedded vector sequences. - Transform the word sequence itself, the predicate, the predicate context, and the region mark sequence into embedded vector sequences.
```python ```python
# Since word vectorlookup table is pre-trained, we won't update it this time. # Since word vectorlookup table is pre-trained, we won't update it this time.
# is_static being True prevents updating the lookup table during training. # is_static being True prevents updating the lookup table during training.
...@@ -337,7 +337,7 @@ emb_layers.append(mark_embedding) ...@@ -337,7 +337,7 @@ emb_layers.append(mark_embedding)
- 8 LSTM units are trained through alternating left-to-right / right-to-left order denoted by the variable `reverse`. - 8 LSTM units are trained through alternating left-to-right / right-to-left order denoted by the variable `reverse`.
```python ```python
hidden_0 = paddle.layer.mixed( hidden_0 = paddle.layer.mixed(
size=hidden_dim, size=hidden_dim,
bias_attr=std_default, bias_attr=std_default,
...@@ -564,7 +564,7 @@ print pre_lab ...@@ -564,7 +564,7 @@ print pre_lab
Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we use SRL as an example to illustrate using PaddlePaddle to do sequence tagging tasks. The models proposed are from our published paper\[[10](#Reference)\]. We only use test data for illustration since the training data on the CoNLL 2005 dataset is not completely public. This aims to propose an end-to-end neural network model with fewer dependencies on natural language processing tools but is comparable, or even better than traditional models in terms of performance. Please check out our paper for more information and discussions. Semantic Role Labeling is an important intermediate step in a wide range of natural language processing tasks. In this tutorial, we use SRL as an example to illustrate using PaddlePaddle to do sequence tagging tasks. The models proposed are from our published paper\[[10](#Reference)\]. We only use test data for illustration since the training data on the CoNLL 2005 dataset is not completely public. This aims to propose an end-to-end neural network model with fewer dependencies on natural language processing tools but is comparable, or even better than traditional models in terms of performance. Please check out our paper for more information and discussions.
## Reference ## References
1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483. 1. Sun W, Sui Z, Wang M, et al. [Chinese semantic role labeling with shallow parsing](http://www.aclweb.org/anthology/D09-1#page=1513)[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Association for Computational Linguistics, 2009: 1475-1483.
2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013. 2. Pascanu R, Gulcehre C, Cho K, et al. [How to construct deep recurrent neural networks](https://arxiv.org/abs/1312.6026)[J]. arXiv preprint arXiv:1312.6026, 2013.
3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014. 3. Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation](https://arxiv.org/abs/1406.1078)[J]. arXiv preprint arXiv:1406.1078, 2014.
......
...@@ -6,6 +6,7 @@ The source code of this tutorial is live at [book/machine_translation](https://g ...@@ -6,6 +6,7 @@ The source code of this tutorial is live at [book/machine_translation](https://g
Machine translation (MT) leverages computers to translate from one language to another. The language to be translated is referred to as the source language, while the language to be translated into is referred to as the target language. Thus, Machine translation is the process of translating from the source language to the target language. It is one of the most important research topics in the field of natural language processing. Machine translation (MT) leverages computers to translate from one language to another. The language to be translated is referred to as the source language, while the language to be translated into is referred to as the target language. Thus, Machine translation is the process of translating from the source language to the target language. It is one of the most important research topics in the field of natural language processing.
Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one language. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#references)\]. Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one language. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#references)\].
...@@ -57,7 +58,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent ...@@ -57,7 +58,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent
We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter. We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter.
Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies. Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies.
GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below. GRU\[[2](#references)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
A GRU unit has only two gates: A GRU unit has only two gates:
- reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output. - reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output.
- update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1. - update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1.
...@@ -68,11 +69,11 @@ Figure 2. A GRU Gate ...@@ -68,11 +69,11 @@ Figure 2. A GRU Gate
</p> </p>
Generally speaking, sequences with short distance dependencies will have an active reset gate while sequences with long distance dependency will have an active update date. Generally speaking, sequences with short distance dependencies will have an active reset gate while sequences with long distance dependency will have an active update date.
In addition, Chung et al.\[[3](#References)\] have empirically shown that although GRU has less parameters, it has similar performance to LSTM on several different tasks. In addition, Chung et al.\[[3](#references)\] have empirically shown that although GRU has less parameters, it has similar performance to LSTM on several different tasks.
### Bi-directional Recurrent Neural Network ### Bi-directional Recurrent Neural Network
We already introduced an instance of bi-directional RNN in the [Semantic Role Labeling](https://github.com/PaddlePaddle/book/blob/develop/label_semantic_roles/README.md) chapter. Here we present another bi-directional RNN model with a different architecture proposed by Bengio et al. in \[[2](#References),[4](#References)\]. This model takes a sequence as input and outputs a fixed dimensional feature vector at each step, encoding the context information at the corresponding time step. We already introduced an instance of bi-directional RNN in the [Semantic Role Labeling](https://github.com/PaddlePaddle/book/blob/develop/label_semantic_roles/README.md) chapter. Here we present another bi-directional RNN model with a different architecture proposed by Bengio et al. in \[[2](#references),[4](#references)\]. This model takes a sequence as input and outputs a fixed dimensional feature vector at each step, encoding the context information at the corresponding time step.
Specifically, this bi-directional RNN processes the input sequence in the original and reverse order respectively, and then concatenates the output feature vectors at each time step as the final output. Thus the output node at each time step contains information from the past and future as context. The figure below shows an unrolled bi-directional RNN. This network contains a forward RNN and backward RNN with six weight matrices: weight matrices from input to forward hidden layer and backward hidden ($W_1, W_3$), weight matrices from hidden to itself ($W_2, W_5$), matrices from forward hidden and backward hidden to output layer ($W_4, W_6$). Note that there are no connections between forward hidden and backward hidden layers. Specifically, this bi-directional RNN processes the input sequence in the original and reverse order respectively, and then concatenates the output feature vectors at each time step as the final output. Thus the output node at each time step contains information from the past and future as context. The figure below shows an unrolled bi-directional RNN. This network contains a forward RNN and backward RNN with six weight matrices: weight matrices from input to forward hidden layer and backward hidden ($W_1, W_3$), weight matrices from hidden to itself ($W_2, W_5$), matrices from forward hidden and backward hidden to output layer ($W_4, W_6$). Note that there are no connections between forward hidden and backward hidden layers.
...@@ -83,7 +84,7 @@ Figure 3. Temporally unrolled bi-directional RNN ...@@ -83,7 +84,7 @@ Figure 3. Temporally unrolled bi-directional RNN
### Encoder-Decoder Framework ### Encoder-Decoder Framework
The Encoder-Decoder\[[2](#References)\] framework aims to solve the mapping of a sequence to another sequence, for sequences with arbitrary lengths. The source sequence is encoded into a vector via an encoder, which is then decoded to a target sequence via a decoder by maximizing the predictive probability. Both the encoder and the decoder are typically implemented via RNN. The Encoder-Decoder\[[2](#references)\] framework aims to solve the mapping of a sequence to another sequence, for sequences with arbitrary lengths. The source sequence is encoded into a vector via an encoder, which is then decoded to a target sequence via a decoder by maximizing the predictive probability. Both the encoder and the decoder are typically implemented via RNN.
<p align="center"> <p align="center">
<img src="image/encoder_decoder_en.png" width=700><br/> <img src="image/encoder_decoder_en.png" width=700><br/>
...@@ -144,7 +145,7 @@ The generation process of machine translation is to translate the source sentenc ...@@ -144,7 +145,7 @@ The generation process of machine translation is to translate the source sentenc
There are a few problems with the fixed dimensional vector representation from the encoding stage: There are a few problems with the fixed dimensional vector representation from the encoding stage:
* It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence. * It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
* Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following. * Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#references)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
Different from the simple decoder, $z_i$ is computed as: Different from the simple decoder, $z_i$ is computed as:
...@@ -400,7 +401,7 @@ is_generating = False ...@@ -400,7 +401,7 @@ is_generating = False
max_length=max_length) max_length=max_length)
``` ```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details. Note: Our configuration is based on Bahdanau et al. \[[4](#references)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
## Model Training ## Model Training
......
...@@ -48,6 +48,7 @@ The source code of this tutorial is live at [book/machine_translation](https://g ...@@ -48,6 +48,7 @@ The source code of this tutorial is live at [book/machine_translation](https://g
Machine translation (MT) leverages computers to translate from one language to another. The language to be translated is referred to as the source language, while the language to be translated into is referred to as the target language. Thus, Machine translation is the process of translating from the source language to the target language. It is one of the most important research topics in the field of natural language processing. Machine translation (MT) leverages computers to translate from one language to another. The language to be translated is referred to as the source language, while the language to be translated into is referred to as the target language. Thus, Machine translation is the process of translating from the source language to the target language. It is one of the most important research topics in the field of natural language processing.
Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one language. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#references)\]. Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one language. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#references)\].
...@@ -99,7 +100,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent ...@@ -99,7 +100,7 @@ This section will introduce Gated Recurrent Unit (GRU), Bi-directional Recurrent
We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter. We already introduced RNN and LSTM in the [Sentiment Analysis](https://github.com/PaddlePaddle/book/blob/develop/understand_sentiment/README.md) chapter.
Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies. Compared to a simple RNN, the LSTM added memory cell, input gate, forget gate and output gate. These gates combined with the memory cell greatly improve the ability to handle long-term dependencies.
GRU\[[2](#References)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below. GRU\[[2](#references)\] proposed by Cho et al is a simplified LSTM and an extension of a simple RNN. It is shown in the figure below.
A GRU unit has only two gates: A GRU unit has only two gates:
- reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output. - reset gate: when this gate is closed, the history information is discarded, i.e., the irrelevant historical information has no effect on the future output.
- update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1. - update gate: it combines the input gate and the forget gate and is used to control the impact of historical information on the hidden output. The historical information is passed over when the update gate is close to 1.
...@@ -110,11 +111,11 @@ Figure 2. A GRU Gate ...@@ -110,11 +111,11 @@ Figure 2. A GRU Gate
</p> </p>
Generally speaking, sequences with short distance dependencies will have an active reset gate while sequences with long distance dependency will have an active update date. Generally speaking, sequences with short distance dependencies will have an active reset gate while sequences with long distance dependency will have an active update date.
In addition, Chung et al.\[[3](#References)\] have empirically shown that although GRU has less parameters, it has similar performance to LSTM on several different tasks. In addition, Chung et al.\[[3](#references)\] have empirically shown that although GRU has less parameters, it has similar performance to LSTM on several different tasks.
### Bi-directional Recurrent Neural Network ### Bi-directional Recurrent Neural Network
We already introduced an instance of bi-directional RNN in the [Semantic Role Labeling](https://github.com/PaddlePaddle/book/blob/develop/label_semantic_roles/README.md) chapter. Here we present another bi-directional RNN model with a different architecture proposed by Bengio et al. in \[[2](#References),[4](#References)\]. This model takes a sequence as input and outputs a fixed dimensional feature vector at each step, encoding the context information at the corresponding time step. We already introduced an instance of bi-directional RNN in the [Semantic Role Labeling](https://github.com/PaddlePaddle/book/blob/develop/label_semantic_roles/README.md) chapter. Here we present another bi-directional RNN model with a different architecture proposed by Bengio et al. in \[[2](#references),[4](#references)\]. This model takes a sequence as input and outputs a fixed dimensional feature vector at each step, encoding the context information at the corresponding time step.
Specifically, this bi-directional RNN processes the input sequence in the original and reverse order respectively, and then concatenates the output feature vectors at each time step as the final output. Thus the output node at each time step contains information from the past and future as context. The figure below shows an unrolled bi-directional RNN. This network contains a forward RNN and backward RNN with six weight matrices: weight matrices from input to forward hidden layer and backward hidden ($W_1, W_3$), weight matrices from hidden to itself ($W_2, W_5$), matrices from forward hidden and backward hidden to output layer ($W_4, W_6$). Note that there are no connections between forward hidden and backward hidden layers. Specifically, this bi-directional RNN processes the input sequence in the original and reverse order respectively, and then concatenates the output feature vectors at each time step as the final output. Thus the output node at each time step contains information from the past and future as context. The figure below shows an unrolled bi-directional RNN. This network contains a forward RNN and backward RNN with six weight matrices: weight matrices from input to forward hidden layer and backward hidden ($W_1, W_3$), weight matrices from hidden to itself ($W_2, W_5$), matrices from forward hidden and backward hidden to output layer ($W_4, W_6$). Note that there are no connections between forward hidden and backward hidden layers.
...@@ -125,7 +126,7 @@ Figure 3. Temporally unrolled bi-directional RNN ...@@ -125,7 +126,7 @@ Figure 3. Temporally unrolled bi-directional RNN
### Encoder-Decoder Framework ### Encoder-Decoder Framework
The Encoder-Decoder\[[2](#References)\] framework aims to solve the mapping of a sequence to another sequence, for sequences with arbitrary lengths. The source sequence is encoded into a vector via an encoder, which is then decoded to a target sequence via a decoder by maximizing the predictive probability. Both the encoder and the decoder are typically implemented via RNN. The Encoder-Decoder\[[2](#references)\] framework aims to solve the mapping of a sequence to another sequence, for sequences with arbitrary lengths. The source sequence is encoded into a vector via an encoder, which is then decoded to a target sequence via a decoder by maximizing the predictive probability. Both the encoder and the decoder are typically implemented via RNN.
<p align="center"> <p align="center">
<img src="image/encoder_decoder_en.png" width=700><br/> <img src="image/encoder_decoder_en.png" width=700><br/>
...@@ -186,7 +187,7 @@ The generation process of machine translation is to translate the source sentenc ...@@ -186,7 +187,7 @@ The generation process of machine translation is to translate the source sentenc
There are a few problems with the fixed dimensional vector representation from the encoding stage: There are a few problems with the fixed dimensional vector representation from the encoding stage:
* It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence. * It is very challenging to encode both the semantic and syntactic information a sentence with a fixed dimensional vector regardless of the length of the sentence.
* Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#References)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following. * Intuitively, when translating a sentence, we typically pay more attention to the parts in the source sentence more relevant to the current translation. Moreover, the focus changes along the process of the translation. With a fixed dimensional vector, all the information from the source sentence is treated equally in terms of attention. This is not reasonable. Therefore, Bahdanau et al. \[[4](#references)\] introduced attention mechanism, which can decode based on different fragments of the context sequence in order to address the difficulty of feature learning for long sentences. Decoder with attention will be explained in the following.
Different from the simple decoder, $z_i$ is computed as: Different from the simple decoder, $z_i$ is computed as:
...@@ -442,7 +443,7 @@ is_generating = False ...@@ -442,7 +443,7 @@ is_generating = False
max_length=max_length) max_length=max_length)
``` ```
Note: Our configuration is based on Bahdanau et al. \[[4](#Reference)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details. Note: Our configuration is based on Bahdanau et al. \[[4](#references)\] but with a few simplifications. Please refer to [issue #1133](https://github.com/PaddlePaddle/Paddle/issues/1133) for more details.
## Model Training ## Model Training
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册