diff --git a/paddle/operators/linear_chain_crf_op.cc b/paddle/operators/linear_chain_crf_op.cc index 066bdf67aa037e9c25cfdfaff7ec8771eb59cde8..8e079a14e0a15e8ff803b6087e6b0b02083479ef 100644 --- a/paddle/operators/linear_chain_crf_op.cc +++ b/paddle/operators/linear_chain_crf_op.cc @@ -32,19 +32,19 @@ class LinearChainCRFOpMaker : public framework::OpProtoAndCheckerMaker { "[(D + 2) x D]. The learnable parameter for the linear_chain_crf " "operator. See more details in the operator's comments."); AddInput("Label", - "(LoDTensor, default LoDTensor) A LoDTensor with shape " + "(LoDTensor, default LoDTensor) A LoDTensor with shape " "[N x 1], where N is the total element number in a mini-batch. " "The ground truth."); AddOutput( "Alpha", "(Tensor, default Tensor) A 2-D Tensor with shape [N x D]. " - "The forward vectors for the entire batch. Denote it as \f$\alpha\f$. " - "\f$\alpha$\f is a memo table used to calculate the normalization " - "factor in CRF. \f$\alpha[k, v]$\f stores the unnormalized " + "The forward vectors for the entire batch. Denote it as $\alpha$. " + "$\alpha$ is a memo table used to calculate the normalization " + "factor in CRF. $\alpha[k, v]$ stores the unnormalized " "probabilites of all possible unfinished sequences of tags that end at " - "position \f$k$\f with tag \f$v$\f. For each \f$k$\f, " - "\f$\alpha[k, v]$\f is a vector of length \f$D$\f with a component for " - "each tag value \f$v$\f. This vector is called a forward vecotr and " + "position $k$ with tag $v$. For each $k$, " + "$\alpha[k, v]$ is a vector of length $D$ with a component for " + "each tag value $v$. This vector is called a forward vecotr and " "will also be used in backward computations.") .AsIntermediate(); AddOutput( @@ -73,9 +73,9 @@ LinearChainCRF Operator. Conditional Random Field defines an undirected probabilistic graph with nodes denoting random variables and edges denoting dependencies between these -variables. CRF learns the conditional probability \f$P(Y|X)\f$, where -\f$X = (x_1, x_2, ... , x_n)\f$ are structured inputs and -\f$Y = (y_1, y_2, ... , y_n)\f$ are labels for the inputs. +variables. CRF learns the conditional probability $P(Y|X)$, where +$X = (x_1, x_2, ... , x_n)$ are structured inputs and +$Y = (y_1, y_2, ... , y_n)$ are labels for the inputs. Linear chain CRF is a special case of CRF that is useful for sequence labeling task. Sequence labeling tasks do not assume a lot of conditional @@ -88,21 +88,22 @@ CRF. Please refer to http://www.cs.columbia.edu/~mcollins/fb.pdf and http://cseweb.ucsd.edu/~elkan/250Bwinter2012/loglinearCRFs.pdf for details. Equation: -1. Denote Input(Emission) to this operator as \f$x\f$ here. +1. Denote Input(Emission) to this operator as $x$ here. 2. The first D values of Input(Transition) to this operator are for starting -weights, denoted as \f$a\f$ here. +weights, denoted as $a$ here. 3. The next D values of Input(Transition) of this operator are for ending -weights, denoted as \f$b\f$ here. +weights, denoted as $b$ here. 4. The remaning values of Input(Transition) are for transition weights, -denoted as \f$w\f$ here. -5. Denote Input(Label) as \f$s\f$ here. - -The probability of a sequence \f$s\f$ of length \f$L\f$ is defined as: -\f$P(s) = (1/Z) \exp(a_{s_1} + b_{s_L} - + \sum_{l=1}^L x_{s_l} - + \sum_{l=2}^L w_{s_{l-1},s_l})\f$ -where \f$Z\f$ is a normalization value so that the sum of \f$P(s)\f$ over -all possible sequences is \f$1\f$, and \f$x\f$ is the emission feature weight +denoted as $w$ here. +5. Denote Input(Label) as $s$ here. + +The probability of a sequence $s$ of length $L$ is defined as: +$$P(s) = (1/Z) \exp(a_{s_1} + b_{s_L} + + \sum_{l=1}^L x_{s_l} + + \sum_{l=2}^L w_{s_{l-1},s_l})$$ + +where $Z$ is a normalization value so that the sum of $P(s)$ over +all possible sequences is 1, and $x$ is the emission feature weight to the linear chain CRF. Finally, the linear chain CRF operator outputs the logarithm of the conditional diff --git a/paddle/operators/softmax_op.cc b/paddle/operators/softmax_op.cc index 93f89e33a73c5f4c6c0e5a8793a0abe7c692b656..93e0525badc26808f0dca70cc1153ac728f1fe9c 100644 --- a/paddle/operators/softmax_op.cc +++ b/paddle/operators/softmax_op.cc @@ -59,7 +59,7 @@ Then the ratio of the exponential of the given dimension and the sum of exponential values of all the other dimensions is the output of the softmax operator. -For each row `i` and each column `j` in input X, we have: +For each row $i$ and each column $j$ in Input(X), we have: $$Y[i, j] = \frac{\exp(X[i, j])}{\sum_j(exp(X[i, j])}$$ )DOC"); diff --git a/paddle/operators/softmax_with_cross_entropy_op.cc b/paddle/operators/softmax_with_cross_entropy_op.cc index 3dbb62d2e571eb92025c1b3fc0a6653c7cda007a..fc027d6f95cdbc24af59ef1188b6f16f6a93e85c 100644 --- a/paddle/operators/softmax_with_cross_entropy_op.cc +++ b/paddle/operators/softmax_with_cross_entropy_op.cc @@ -67,15 +67,15 @@ The equation is as follows: 1) Hard label (one-hot label, so every sample has exactly one class) -$$Loss_j = \f$ -\text{Logit}_{Label_j} + +$$Loss_j = -\text{Logit}_{Label_j} + \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right), -j = 1, ..., K $\f$$ +j = 1,..., K$$ 2) Soft label (each sample can have a distribution over all classes) -$$Loss_j = \f$ -\sum_{i=0}^{K}\text{Label}_i\left(\text{Logit}_i - +$$Loss_j = -\sum_{i=0}^{K}\text{Label}_i \left(\text{Logit}_i - \log\left(\sum_{i=0}^{K}\exp(\text{Logit}_i)\right)\right), -j = 1,...,K $\f$$ +j = 1,...,K$$ )DOC"); }