Merge pull request #417 from juliecbd/polish_doc

Polish Chapter 1, 8

Merge pull request #417 from juliecbd/polish_doc
Polish Chapter 1, 8
0a89c482 · Yi Wang · GitHub · b61add60 · ee7a9cc9 · 0a89c482
4 changed file
--- a/01.fit_a_line/README.md
+++ b/01.fit_a_line/README.md
@@ -4,7 +4,7 @@ Let us begin the tutorial with a classical problem called Linear Regression \[[1
 The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line). For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book).

 ## Problem Setup
-Suppose we have a dataset of $n$ real estate properties. These real estate properties will be referred to as *homes* in this chapter for clarity.
+Suppose we have a dataset of $n$ real estate properties. Each real estate property will be referred to as **homes** in this chapter for clarity.

 Each home is associated with $d$ attributes. The attributes describe characteristics such the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby.

@@ -15,7 +15,7 @@ $$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b,  i=1,\
 where $\vec{\omega}$ and $b$ are the model parameters we want to estimate. Once they are learned, we will be able to predict the price of a home, given the attributes associated with it. We call this model **Linear Regression**. In other words, we want to regress a value against several values linearly. In practice, a linear model is often too simplistic to capture the real relationships between the variables. Yet, because Linear Regression is easy to train and analyze, it has been applied to a large number of real problems. As a result, it is an important topic in many classic Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\].

 ## Results Demonstration
-We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of simlilar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the more precise the model predicts, the closer the point is to the dotted line.
+We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of similar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the more precise the model predicts, the closer the point is to the dotted line.
 <p align="center">
    <img src = "image/predictions_en.png" width=400><br/>
    Figure 1. Predicted Value V.S. Actual Value

--- a/01.fit_a_line/index.html
+++ b/01.fit_a_line/index.html
@@ -46,7 +46,7 @@ Let us begin the tutorial with a classical problem called Linear Regression \[[1
 The source code for this tutorial lives on [book/fit_a_line](https://github.com/PaddlePaddle/book/tree/develop/01.fit_a_line). For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book).

 ## Problem Setup
-Suppose we have a dataset of $n$ real estate properties. These real estate properties will be referred to as *homes* in this chapter for clarity.
+Suppose we have a dataset of $n$ real estate properties. Each real estate property will be referred to as **homes** in this chapter for clarity.

 Each home is associated with $d$ attributes. The attributes describe characteristics such the number of rooms in the home, the number of schools or hospitals in the neighborhood, and the traffic condition nearby.

@@ -57,7 +57,7 @@ $$y_i = \omega_1x_{i,1} + \omega_2x_{i,2} + \ldots + \omega_dx_{i,d} + b,  i=1,\
 where $\vec{\omega}$ and $b$ are the model parameters we want to estimate. Once they are learned, we will be able to predict the price of a home, given the attributes associated with it. We call this model **Linear Regression**. In other words, we want to regress a value against several values linearly. In practice, a linear model is often too simplistic to capture the real relationships between the variables. Yet, because Linear Regression is easy to train and analyze, it has been applied to a large number of real problems. As a result, it is an important topic in many classic Statistical Learning and Machine Learning textbooks \[[2,3,4](#References)\].

 ## Results Demonstration
-We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of simlilar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the more precise the model predicts, the closer the point is to the dotted line.
+We first show the result of our model. The dataset [UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) is used to train a linear model to predict the home prices in Boston. The figure below shows the predictions the model makes for some home prices. The $X$-axis represents the median value of the prices of similar homes within a bin, while the $Y$-axis represents the home value our linear model predicts. The dotted line represents points where $X=Y$. When reading the diagram, the more precise the model predicts, the closer the point is to the dotted line.
 <p align="center">
    <img src = "image/predictions_en.png" width=400><br/>
    Figure 1. Predicted Value V.S. Actual Value

--- a/08.machine_translation/README.md
+++ b/08.machine_translation/README.md
@@ -6,24 +6,24 @@ The source codes is located at [book/machine_translation](https://github.com/Pad

 Machine translation (MT) leverages computers to translate from one language to another. The language to be translated is referred to as the source language, while the language to be translated into is referred to as the target language. Thus, Machine translation is the process of translating from the source language to the target language. It is one of the most important research topics in the field of natural language processing.

-Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]。
+Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one language. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]。


 To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:

-1. human designed features cannot cover all possible linguistic variations;
+1. Human designed features cannot cover all possible linguistic variations;

-2. it is difficult to use global features;
+2. It is difficult to use global features;

-3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
+3. The techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.



 The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:

-1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
+1. Techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);

-2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
+2. Techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).

 <p align="center">
 <img src="image/nmt_en.png" width=400><br/>
@@ -98,9 +98,9 @@ There are three steps for encoding a sentence:

 2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation

-  * the dimensionality of the vector is typically large, leading to the curse of dimensionality;
+  * The dimensionality of the vector is typically large, leading to the curse of dimensionality;

-  * it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.
+  * It is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.

 3. Encoding of the source sequence via RNN: This can be described mathematically as:

@@ -332,10 +332,10 @@ is_generating = False

 5. Training mode:

-   - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
+   - Word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
   - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
-   - the sequence of next words from the target language is used as label (lbl)
-   - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
+   - The sequence of next words from the target language is used as label (lbl)
+   - Multi-class cross-entropy (`classification_cost`) is used to calculate the cost

   ```python
   if not is_generating:
@@ -365,7 +365,7 @@ is_generating = False

 6. Generating mode:

-   - the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
+   - The decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
   - `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.

   ```python

--- a/08.machine_translation/index.html
+++ b/08.machine_translation/index.html
@@ -48,24 +48,24 @@ The source codes is located at [book/machine_translation](https://github.com/Pad

 Machine translation (MT) leverages computers to translate from one language to another. The language to be translated is referred to as the source language, while the language to be translated into is referred to as the target language. Thus, Machine translation is the process of translating from the source language to the target language. It is one of the most important research topics in the field of natural language processing.

-Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one languge. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]。
+Early machine translation systems are mainly rule-based i.e. they rely on a language expert to specify the translation rules between the two languages. It is quite difficult to cover all the rules used in one language. So it is quite a challenge for language experts to specify all possible rules in two or more different languages. Hence, a major challenge in conventional machine translation has been the difficulty in obtaining a complete rule set \[[1](#References)\]。


 To address the aforementioned problems, statistical machine translation techniques have been developed. These techniques learn the translation rules from a large corpus, instead of being designed by a language expert. While these techniques overcome the bottleneck of knowledge acquisition, there are still quite a lot of challenges, for example:

-1. human designed features cannot cover all possible linguistic variations;
+1. Human designed features cannot cover all possible linguistic variations;

-2. it is difficult to use global features;
+2. It is difficult to use global features;

-3. the techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.
+3. The techniques heavily rely on pre-processing techniques like word alignment, word segmentation and tokenization, rule-extraction and syntactic parsing etc. The error introduced in any of these steps could accumulate and impact translation quality.



 The recent development of deep learning provides new solutions to these challenges. The two main categories for deep learning based machine translation techniques are:

-1. techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);
+1. Techniques based on the statistical machine translation system but with some key components improved with neural networks, e.g., language model, reordering model (please refer to the left part of Figure 1);

-2. techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).
+2. Techniques mapping from source language to target language directly using a neural network, or end-to-end neural machine translation (NMT).

 <p align="center">
 <img src="image/nmt_en.png" width=400><br/>
@@ -140,9 +140,9 @@ There are three steps for encoding a sentence:

 2. Word embedding as a representation in the low-dimensional semantic space: There are two problems with one-hot vector representation

-  * the dimensionality of the vector is typically large, leading to the curse of dimensionality;
+  * The dimensionality of the vector is typically large, leading to the curse of dimensionality;

-  * it is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.
+  * It is hard to capture the relationships between words, i.e., semantic similarities. Therefore, it is useful to project the one-hot vector into a low-dimensional semantic space as a dense vector with fixed dimensions, i.e., $s_i=Cw_i$ for the $i$-th word, with $C\epsilon R^{K\times \left | V \right |}$ as the projection matrix and $K$ is the dimensionality of the word embedding vector.

 3. Encoding of the source sequence via RNN: This can be described mathematically as:

@@ -374,10 +374,10 @@ is_generating = False

 5. Training mode:

-   - word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
+   - Word embedding from the target language trg_embedding is passed to `gru_decoder_with_attention` as current_word.
   - `recurrent_group` calls `gru_decoder_with_attention` in a recurrent way
-   - the sequence of next words from the target language is used as label (lbl)
-   - multi-class cross-entropy (`classification_cost`) is used to calculate the cost
+   - The sequence of next words from the target language is used as label (lbl)
+   - Multi-class cross-entropy (`classification_cost`) is used to calculate the cost

   ```python
   if not is_generating:
@@ -407,7 +407,7 @@ is_generating = False

 6. Generating mode:

-   - the decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
+   - The decoder predicts a next target word based on the the last generated target word. Embedding of the last generated word is automatically gotten by GeneratedInputs.
   - `beam_search` calls `gru_decoder_with_attention` in a recurrent way, to predict sequence id.

   ```python