提交 282823b8 编写于 作者: Y Yifei Feng 提交者: GitHub

Merge pull request #5503 from yifeif/r0.11

Reformat markdown.
...@@ -436,35 +436,35 @@ you a desirable model size. ...@@ -436,35 +436,35 @@ you a desirable model size.
Finally, let's take a minute to talk about what the Logistic Regression model Finally, let's take a minute to talk about what the Logistic Regression model
actually looks like in case you're not already familiar with it. We'll denote actually looks like in case you're not already familiar with it. We'll denote
the label as $$Y$$, and the set of observed features as a feature vector the label as \\(Y\\), and the set of observed features as a feature vector
$$\mathbf{x}=[x_1, x_2, ..., x_d]$$. We define $$Y=1$$ if an individual earned > \\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). We define \\(Y=1\\) if an individual earned >
50,000 dollars and $$Y=0$$ otherwise. In Logistic Regression, the probability of 50,000 dollars and \\(Y=0\\) otherwise. In Logistic Regression, the probability of
the label being positive ($$Y=1$$) given the features $$\mathbf{x}$$ is given the label being positive (\\(Y=1\\)) given the features \\(\mathbf{x}\\) is given
as: as:
$$ P(Y=1|\mathbf{x}) = \frac{1}{1+\exp(-(\mathbf{w}^T\mathbf{x}+b))}$$ $$ P(Y=1|\mathbf{x}) = \frac{1}{1+\exp(-(\mathbf{w}^T\mathbf{x}+b))}$$
where $$\mathbf{w}=[w_1, w_2, ..., w_d]$$ are the model weights for the features where \\(\mathbf{w}=[w_1, w_2, ..., w_d]\\) are the model weights for the features
$$\mathbf{x}=[x_1, x_2, ..., x_d]$$. $$b$$ is a constant that is often called \\(\mathbf{x}=[x_1, x_2, ..., x_d]\\). \\(b\\) is a constant that is often called
the **bias** of the model. The equation consists of two parts—A linear model and the **bias** of the model. The equation consists of two parts—A linear model and
a logistic function: a logistic function:
* **Linear Model**: First, we can see that $$\mathbf{w}^T\mathbf{x}+b = b + * **Linear Model**: First, we can see that \\(\mathbf{w}^T\mathbf{x}+b = b +
w_1x_1 + ... +w_dx_d$$ is a linear model where the output is a linear w_1x_1 + ... +w_dx_d\\) is a linear model where the output is a linear
function of the input features $$\mathbf{x}$$. The bias $$b$$ is the function of the input features \\(\mathbf{x}\\). The bias \\(b\\) is the
prediction one would make without observing any features. The model weight prediction one would make without observing any features. The model weight
$$w_i$$ reflects how the feature $$x_i$$ is correlated with the positive \\(w_i\\) reflects how the feature \\(x_i\\) is correlated with the positive
label. If $$x_i$$ is positively correlated with the positive label, the label. If \\(x_i\\) is positively correlated with the positive label, the
weight $$w_i$$ increases, and the probability $$P(Y=1|\mathbf{x})$$ will be weight \\(w_i\\) increases, and the probability \\(P(Y=1|\mathbf{x})\\) will be
closer to 1. On the other hand, if $$x_i$$ is negatively correlated with the closer to 1. On the other hand, if \\(x_i\\) is negatively correlated with the
positive label, then the weight $$w_i$$ decreases and the probability positive label, then the weight \\(w_i\\) decreases and the probability
$$P(Y=1|\mathbf{x})$$ will be closer to 0. \\(P(Y=1|\mathbf{x})\\) will be closer to 0.
* **Logistic Function**: Second, we can see that there's a logistic function * **Logistic Function**: Second, we can see that there's a logistic function
(also known as the sigmoid function) $$S(t) = 1/(1+\exp(-t))$$ being applied (also known as the sigmoid function) \\(S(t) = 1/(1+\exp(-t))\\) being applied
to the linear model. The logistic function is used to convert the output of to the linear model. The logistic function is used to convert the output of
the linear model $$\mathbf{w}^T\mathbf{x}+b$$ from any real number into the the linear model \\(\mathbf{w}^T\mathbf{x}+b\\) from any real number into the
range of $$[0, 1]$$, which can be interpreted as a probability. range of \\([0, 1]\\), which can be interpreted as a probability.
Model training is an optimization problem: The goal is to find a set of model Model training is an optimization problem: The goal is to find a set of model
weights (i.e. model parameters) to minimize a **loss function** defined over the weights (i.e. model parameters) to minimize a **loss function** defined over the
......
...@@ -157,8 +157,8 @@ The higher the `dimension` of the embedding is, the more degrees of freedom the ...@@ -157,8 +157,8 @@ The higher the `dimension` of the embedding is, the more degrees of freedom the
model will have to learn the representations of the features. For simplicity, we model will have to learn the representations of the features. For simplicity, we
set the dimension to 8 for all feature columns here. Empirically, a more set the dimension to 8 for all feature columns here. Empirically, a more
informed decision for the number of dimensions is to start with a value on the informed decision for the number of dimensions is to start with a value on the
order of $$k\log_2(n)$$ or $$k\sqrt[4]n$$, where $$n$$ is the number of unique order of \\(\log_2(n)\\) or \\(k\sqrt[4]n\\), where \\(n\\) is the number of unique
features in a feature column and $$k$$ is a small constant (usually smaller than features in a feature column and \\(k\\) is a small constant (usually smaller than
10). 10).
Through dense embeddings, deep models can generalize better and make predictions Through dense embeddings, deep models can generalize better and make predictions
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册