提交 15b428a8 编写于 作者: T Tomasz Surowiec 提交者: labuladong

[English] Correct Edit Distance

(I feel a need to preface this by saying that the changes I made were based solely on the English version, and the reading experience, for I do not speak Chinese (I'm not even sure if it's Mandarin, Cantonese or which one, nor if there's any difference in writing). Should you object to any changes, please let me know)

1. Changed the apostrophe character from *`* to *'*. The former is not used as an apostrophe, and is actually an accent mark (for preferred characters, refer to (this)[http://snowball.tartarus.org/texts/apostrophe.html] (U+0027 ', U+2019 ’, U+201B ‛)).
2. Changed certain verb forms. For instance: *"The last question is that **writing** a function to calculate [...]"* to *"The last question is **to write** a function which calculates [...]"* (also *which* -> *to* in order to avoid the repetition).
3. Changed the capitalization where necessary. For example: *Helpless* to *helpless* (since it's not a proper name); *We* to *we* (middle of a sentence).
4. Changed a few verbs to nouns, and vice versa: *"explain"* (on its own serves as an imperative, which would demand that the reader explain it) to *"explanation"*; *"storage"* to "store [the least editing distance of s1 and s2; L226]".
5. Fixed redundant spaces: *"[...] `s2`  in [...]"* to *"[...] `s2` in [...]"*; *"s1[i]!=s2[j]"* to *"s1[i] != s2[j]"*
6. Changed a few odd phrases: *"a violent solution"* to *"a brute force solution"* ( (most)[https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiolLjsxK3xAhUspIsKHU-9BwEQFjADegQIBBAD&url=https%3A%2F%2Fgsdrc.org%2Fwp-content%2Fuploads%2F2019%2F11%2F671_P-CVE_Programming_on_Men_Women_Boys_and_Girls.pdf&usg=AOvVaw2QjJ_-yjxxw2oWYtG77sgE], albeit (not all)[https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiolLjsxK3xAhUspIsKHU-9BwEQFjABegQIAhAE&url=https%3A%2F%2Fwww.programmersought.com%2Farticle%2F27011223160%2F&usg=AOvVaw1f6OCbf1r7M8GAfmGotBDg], uses of the former did not refer to programming ); *"from the bottom to up"* to *"from the bottom up"* (or *"bottom-up"*; see: (from the bottom up)[https://www.lexico.com/definition/from_the_bottom_up], and (bottom-up)[https://dictionary.cambridge.org/dictionary/english/bottom-up] ); *"operation number"* to *"the number of operations"* (this one might have been more subjective).
7. Removed latex dollar signs (since github markdown does not support it): *"$O(min(M, N))$"* to *"O(min(M, N))"* (as used in other articles).
8. Pluralized certain nouns after *"any"* ( ("We use any for indefinite quantities in questions and negative sentences")[https://dictionary.cambridge.org/grammar/british-grammar/any] ): *"any operation"* to *"any operations"*.
9. Added some formatting: "dp(i, j)" to "`dp(i, j)`" (L98).

I'm also not sure if *"[...] I wrote some words out of place by mistake"* refers to mistyping, but it's still more than clear, albeit slightly wordy.

To reiterate, should any of these be of concern, let me know, especially since it is my second language.
Best regards
上级 ac3abb66
......@@ -4,17 +4,17 @@
**Author: [labuladong](https://github.com/labuladong)**
few days ago, I saw an interview paper of Tencent. In this paper, most of the algorithm problems are Dynamic programming. The last question is that writing a function to calculate the shortest editing Distance. Today I wrote an article specifically to discuss this problem.
Few days ago, I saw an interview paper of Tencent. In this paper, most of the algorithm problems are Dynamic programming. The last question is to write a function which calculates the shortest Editing distance. Today I wrote an article specifically to discuss this problem.
I personally like this problem because it looks very hard, the solution is Surprisingly simple and beautiful and it`s a rare algorithm which is very useful.(yech, I recognized that many algorithm problems are not very useful.)Following is the problem:
I personally like this problem because it looks very hard, yet the solution is surprisingly simple and beautiful. Though it's a rare algorithm, it's is very useful (yech, I recognize that many algorithm problems are not very useful). The problem is as follows:
![](../pictures/editDistance/title.png)
Why I say this problem is hard? Obviously, it`s just hard, making people Helpless and frightened.
Why did I say this problem is hard? Obviously, it's just hard, making people helpless and frightened.
And why I say this problem is useful? Because days ago I used the algorithm in my daily life. I had a article in my ‎Wechat Official Account and I wrote some words out of place by mistake. So I decided to modify this part to make the logic suitable. However, the Wechat Official Account article can only be modified 20 words at most, and it only supports addition, deletion and replacement(exactly same as the editing distance problem.) So I used the algorithm to find a best way to solve the problem in just 16 steps.
And why did I say this problem is useful? Because a few days ago I used the algorithm in my daily life. I had an article in my Wechat Official Account and I wrote some words out of place by mistake. So I decided to modify this part to make the logic suitable. However, the Wechat Official Account article can only be modified 20 words at most, and it only supports addition, deletion and replacement (exactly same as the editing distance problem). So I used the algorithm to find the best way to solve the problem in just 16 steps.
Another advanced example is that the edit distance can be used to measure the similarity of two DNA sequences. The DNA sequence is a sequence included of A, G, C and T, which is similar to a string. The less editing distance is, The more similar the two DNA are. Maybe the owner of these DNAs were ancient relatives.
Another advanced example is that the edit distance can be used to measure the similarity of two DNA sequences. The DNA sequence is a sequence consisting of A, G, C and T, which is similar to a string. The shorter the editing distance is, the more similar the two DNA are. Maybe the owners of these DNAs were ancient relatives.
Let's get to the point, I will explain you how to edit the distance in detail, and I hope you could obtain something fruitful.
......@@ -22,9 +22,9 @@ Let's get to the point, I will explain you how to edit the distance in detail, a
### 1. train of thought
The editing distance is a problem that give us two strings `s1` and `s2` with only three operations and let\`s change `s1` to `s2` in least steps. The first thing to be sure of is that the result of `s1` to `s2` and `s2` to `s1` is the same. So we will use `s1` to `s2` as an example.
The editing distance is a problem where we are given two strings `s1` and `s2` with only three operations, and we have to transform `s1` into `s2` in fewest steps. The first thing to be sure of is that the results of `s1` to `s2` and `s2` to `s1` are the same. So we will use `s1` to `s2` as an example.
Mentioned in the early paper "The longest common subsequence", **I said that to solve the dynamic programming problem of two strings, We normally use two pointers `i`, `j` to point to the end of the two strings, and then go forward step by step to reduce the size of the problem.**
Mentioned in the early paper "The longest common subsequence", **I said that to solve the dynamic programming problem of two strings, we normally use two pointers `i`, `j` to point to the ends of the two strings, and then go forward step by step to reduce the size of the problem.**
Assuming that the two strings are "rad" and "apple", in order to change `s1` to` s2`, the algorithm works like this:
......@@ -35,11 +35,11 @@ Remember this gif in order to solve the editing distance problem. The key is how
the right operation which I will discuss later.
According to the above gif, we can figure out that there are not only three operations, in fact there is the fourth operation which is skip. For example:
According to the above gif, we can figure out that there are not only three operations; in fact, there is a fourth operation which is skip. For example:
![](../pictures/editDistance/2.jpg)
As the two strings are same, obviously there should be no operation to minimize the distance. Just move `i`, `j`.
As the two characters are the same, obviously there should be no operation to minimize the distance. Just move `i`, `j`.
Another simple situation is when `j` has finished `s2`, if `i` has not finished `s1`, then you can only delete `s1` to make them the same. For example:
......@@ -47,17 +47,17 @@ Another simple situation is when `j` has finished `s2`, if `i` has not finished
![](../pictures/editDistance/3.jpg)
Similarly, if `i` finished `s1` and `j` has not finished `s2`, you can only insert all the remaining characters of `s2` into `s1` by inserting. As you see, the two cases are the **base case** of the algorithm.
Similarly, if `i` finished `s1` and `j` has not finished `s2`, you can only insert all the remaining characters of `s2` into `s1`. As you see, the two cases are the **base case** of the algorithm.
Let\`s look at how to change your ideas into code. Sit tight, it's time to go.
Let's look at how to change your ideas into code. Sit tight, it's time to go.
### 2. code in detail
First we sort out our ideas:
The base case is when `i` finished `s1` or `j` finished `s2`, we can return the remaining length of another string directly.
The base case is when `i` finished `s1` or `j` finished `s2`, we can directly return the remaining length of the other string.
For each pair characters, `s1[i]` and `s2[j]`, there are four operations:
For each pair of characters, `s1[i]` and `s2[j]`, there are four operations:
```python
if s1[i] == s2[j]:
......@@ -70,7 +70,7 @@ else:
replace
```
With this framework, the problem has been solved. Maybe you will ask, how to chose the "three choices"? It\`s very simple, try it all, and chose the smallest one. we need some recursive skills here.Look at the code:
With this framework, the problem has been solved. Maybe you will ask, how to chose the "three choices"? It's very simple: try it all, and chose the smallest one. we need some recursive skills here. Look at the code:
```python
def minDistance(s1, s2) -> int:
......@@ -93,62 +93,62 @@ def minDistance(s1, s2) -> int:
return dp(len(s1) - 1, len(s2) - 1)
```
Let\`s explain this recursive code in detail. There is no need to explain the base case, so I mainly explain the recursive part.
Let's explain this recursive code in detail. There is no need to explain the base case, so I'll mainly explain the recursive part.
It is said that recursive code is very interpretable. It does make sense. As long as you understand the definition of a function, you can clearly understand the logic of the algorithm. The function dp(i, j) is defined like this:
It is said that recursive code is very interpretable. It does make sense. As long as you understand the definition of a function, you can clearly understand the logic of the algorithm. The function `dp(i, j)` is defined like this:
```python
def dp(i, j) -> int
# return the least editing distance s1[0..i] and s2[0..j]
```
**Remember this definition**, let\`s look at the code:
**Remember this definition**, let's look at the code:
```python
if s1[i] == s2[j]:
return dp(i - 1, j - 1) # skip
# explain:
# already the same, no need any operation
# explanation:
# already the same, no need of any operations
# the least editing distance of s1[0..i] and s2[0..j] equals
# the least distance of s1[0..i-1] s2[0..j-1]
# the least distance of s1[0..i-1] and s2[0..j-1]
# It means that dp(i, j) equals dp(i-1, j-1)
```
if `s1[i]!=s2[j]`, we should recurse the three operations which needs a bit of thing:
if `s1[i] != s2[j]`, we should recurse the three operations which needs a bit of thing:
```python
dp(i, j - 1) + 1, # insert
# explain:
# I Directly insert a character same as s2[j] at s1[i]
# then s2[j] are matched,move forward j,and continue compareed with i
# Don`t forget to add one to the operation number
# explanation:
# Directly insert the same character as s2[j] at s1[i]
# then s2[j] is matched,move forward j,and continue comparing with i
# Don't forget to add one to the number of operations
```
![](../pictures/editDistance/insert.gif)
```python
dp(i - 1, j) + 1, # delete
# explain:
# I directly delete s[i]
# move forward i,continue to compared with j
# add one to the operation number
# explanation:
# Directly delete s[i]
# move i forward,continue comparing with j
# add one to the number of operations
```
![](../pictures/editDistance/delete.gif)
```python
dp(i - 1, j - 1) + 1 # replace
# explain:
# I directly replace s1[i] with s2[j], then they are matched
# move forward i,j and continue to compare
# add one to operation number
# explanation:
# Directly replace s1[i] with s2[j], then they are matched
# move forward i,j and continue comparing
# add one to the number of operations
```
![](../pictures/editDistance/replace.gif)
Now, you should fully understand this short and clever code. Another small problem is that this is a violent solution. There are many overlapping subproblems, which should to be optimized by dynamic programming techniques.
Now, you should fully understand this short and clever code. Another small problem is that this is a brute force solution. There are many overlapping subproblems, which should be optimized by dynamic programming techniques.
**How can we see the overlapping subproblems at a glance?** As mentioned in the previous article "Regular Expressions for Dynamic Programming", we need to abstract the recursive framework of the algorithm in this article:
**How can we see the overlapping subproblems at a glance?** As mentioned in the previous article "Regular Expressions for Dynamic Programming", we need to abstract the recursive framework of the algorithm in this article:
```python
def dp(i, j):
......@@ -189,19 +189,19 @@ First, we declare the meaning of the dp array. The dp array is a two-dimensional
![](../pictures/editDistance/dp.jpg)
With the foundation of the previous recursive solution, it\`s easy to understand. `dp [..][0]` and `dp [0][..]` correspond to the base case. The meaning of `dp [i][j]` is similar to the previous dp function:
With the foundation of the previous recursive solution, it's easy to understand. `dp [..][0]` and `dp [0][..]` correspond to the base case. The meaning of `dp [i][j]` is similar to the previous dp function:
```python
def dp(i, j) -> int
# return the least editing distance of s1[0..i] and s2[0..j]
dp[i-1][j-1]
# storage the least editing distance of s1[0..i] and s2[0..j]
# store the least editing distance of s1[0..i] and s2[0..j]
```
The base case of the dp function is that `i, j` is equal to -1. However the array index is at least 0, the dp array is offset by one position.
The base case of the dp function is that `i, j` is equal to -1. However, the array index is at least 0, so the dp array is offset by one position.
Since the dp array has the same meaning as the recursive dp function, you can directly apply the previous ideas to write code. **The only difference is that the DP table is solved from the bottom to up, and the recursive solution is solved from the top to down**:
Since the dp array has the same meaning as the recursive dp function, you can directly apply the previous ideas to write code. **The only difference is that the DP table is solved bottom-up, and the recursive solution is solved top-down**:
```java
int minDistance(String s1, String s2) {
......@@ -212,7 +212,7 @@ int minDistance(String s1, String s2) {
dp[i][0] = i;
for (int j = 1; j <= n; j++)
dp[0][j] = j;
// from the bottom to up
// from the bottom up
for (int i = 1; i <= m; i++)
for (int j = 1; j <= n; j++)
if (s1.charAt(i-1) == s2.charAt(j-1))
......@@ -223,7 +223,7 @@ int minDistance(String s1, String s2) {
dp[i][j - 1] + 1,
dp[i-1][j-1] + 1
);
// storage the least editing distance of s1 and s2
// store the least editing distance of s1 and s2
return dp[m][n];
}
......@@ -234,15 +234,15 @@ int min(int a, int b, int c) {
### 4. Extension
Generally speaking, when dealing with the dynamic programming of two strings, we just follow the ideas of this article, making the DP table. Why? Because it\`s easy to find out the relationship of the state transitions, such as the DP table of the edit distance:
Generally speaking, when dealing with the dynamic programming of two strings, we just follow the ideas of this article, making the DP table. Why? Because it's easy to find out the relationship of the state transitions, such as the DP table of the edit distance:
![](../pictures/editDistance/4.jpg)
There is another detail: since every `dp[i][j]` is only related to the three status, the space complexity can be reduced to $O(min(M, N))$ (M, N is the length of the two strings). It\`s not very difficult but the code is harder to read. You can try to optimize it by yourself.
There is another detail: since every `dp[i][j]` is only related to the three status, the space complexity can be reduced to O(min(M, N)) (M, N is the length of the two strings). It's not very difficult but the code is harder to read. You can try to optimize it by yourself.
Maybe you will also ask, **As we only found the minimum editing distance, how can we know the every step?** In the example of modifying the article you mentioned earlier, only a editing distance is definitely not enough. You must know how to modify it.
You may also ask, **As we only found the minimum editing distance, how can we know the every step?** In the example of modifying the article you mentioned earlier, only a editing distance is definitely not enough. You must know how to modify it.
Actually, it\`s very simple, just slightly modified the code and add additional information to the dp array:
Actually, it's very simple, just slightly modify the code and add additional information to the dp array:
```java
// int[][] dp;
......@@ -258,13 +258,13 @@ class Node {
}
```
The `val` attribute is the value of the previous dp array, and the` choice` attribute represents the operation. When making the best choice, record the operation and then infer the specific operation from the result.
The `val` attribute is the value of the previous dp array, and the `choice` attribute represents the operation. When making the best choice, record the operation and then infer the specific operation from the result.
Our final result is `dp [m] [n]`, where `val` holds the minimum edit distance, and` choice` holds the last operation, such as the insert operation, then you can move one space to the left:
Our final result is `dp [m] [n]`, where `val` holds the minimum edit distance, and `choice` holds the last operation, such as the insert operation, then you can move one space to the left:
![](../pictures/editDistance/5.jpg)
Repeat this process, you can return to the starting point `dp [0] [0]` step by step to form a path. Editing according to the operations on this path is the best solution.
Repeating this process, you can return to the starting point `dp [0] [0]` step by step to form a path. Editing according to the operations on this path is the best solution.
![](../pictures/editDistance/6.jpg)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册