提交 767fd05d 编写于 作者: qq_43092590's avatar qq_43092590

翻译完成

上级 8f2686d0
# Editing Distance
**Translator: [Master-cai](https://github.com/Master-cai)**
**Author: [labuladong](https://github.com/labuladong)**
few days ago, I saw an interview paper of Tencent. In this paper, most of the algorithm problems are Dynamic programming. The last question is that writing a function to calculate the shortest editing Distance. Today I wrote an article specifically to discuss this problem.
I personally like this problem because it looks very hard, the solution is Surprisingly simple and beautiful and it`s a rare algorithm which is not very useful.(yech, I recognized that many algorithm problems are not very useful.)Following is the problem:
![](../pictures/editDistance/title.png)
Why I say this problem is hard? Obviously, it`s just hard, making people Helpless and frightened.
And why I say this problem is useful? Because days ago I used the algorithm in my daily life. I had a article in my ‎Wechat Official Account and I wrote some words out of place by mistake. So I decided to modify this part to make the logic suitable. However, the Wechat Official Account article can only be modified 20 words at most, and it only supports addition, deletion and replacement(exactly same as the editing distance problem.) So I used the algorithm to find a best way to solve the problem in just 16 steps.
Another advanced example is that the edit distance can be used to measure the similarity of two DNA sequences. The DNA sequence is a sequence included of A, G, C and T, which is similar to a string. The less editing distance is, The more similar the two DNA are. Maybe the owner of these DNAs were ancient relatives.
Let's get to the point, I will explain you how to edit the distance in detail, and I hope you could obtain something fruitful.
### 1. train of thought
The editing distance is a problem that give us two strings `s1` and `s2` with only three operations and let\`s change `s1` to `s2` in least steps. The first thing to be sure of is that the result of `s1` to `s2` and `s2` to `s1` is the same. So we will use `s1` to `s2` as an example.
Mentioned in the early paper "The longest common subsequence", **I said that to solve the dynamic programming problem of two strings, We normally use two pointers `i`, `j` to point to the end of the two strings, and then go forward step by step to reduce the size of the problem.**
Assuming that the two strings are "rad" and "apple", in order to change `s1` to` s2`, the algorithm works like this:
![](../pictures/editDistance/edit.gif)
![](../pictures/editDistance/1.jpg)
Remember this gif in order to solve the editing distance problem. The key is how to make
the right operation which I will discuss later.
According to the above gif, we can figure out that there are not only three operations, in fact there is the fourth operation which is skip. For example:
![](../pictures/editDistance/2.jpg)
As the two strings are same, obviously there should be no operation to minimize the distance. Just move `i`, `j`.
Another simple situation is when `j` has finished `s2`, if `i` has not finished `s1`, then you can only delete `s1` to make them the same. For example:
![](../pictures/editDistance/3.jpg)
Similarly, if `i` finished `s1` and `j` has not finished `s2`, you can only insert all the remaining characters of `s2` into `s1` by inserting. As you see, the two cases are the **base case** of the algorithm.
Let\`s look at how to change your ideas into code. Sit tight, it's time to go.
### 2. code in detail
First we sort out our ideas:
The base case is when `i` finished `s1` or `j` finished `s2`, we can return the remaining length of another string directly.
For each pair characters, `s1[i]` and `s2[j]`, there are four operations:
```python
if s1[i] == s2[j]:
skip
i, j move forward
else:
chose
insert
delete
replace
```
With this framework, the problem has been solved. Maybe you will ask, how to chose the "three choices"? It\`s very simple, try it all, and chose the smallest one. we need some recursive skills here.Look at the code:
```python
def minDistance(s1, s2) -> int:
def dp(i, j):
# base case
if i == -1: return j + 1
if j == -1: return i + 1
if s1[i] == s2[j]:
return dp(i - 1, j - 1) # skip
else:
return min(
dp(i, j - 1) + 1, # insert
dp(i - 1, j) + 1, # delete
dp(i - 1, j - 1) + 1 # replace
)
# i,j initialize to the last index
return dp(len(s1) - 1, len(s2) - 1)
```
Let\`s explain this recursive code in detail. There is no need to explain the base case, so I mainly explain the recursive part.
It is said that recursive code is very interpretable. It does make sense. As long as you understand the definition of a function, you can clearly understand the logic of the algorithm. The function dp(i, j) is defined like this:
```python
def dp(i, j) -> int
# return the least editing distance s1[0..i] and s2[0..j]
```
**Remember this definition**, let\`s look at the code:
```python
if s1[i] == s2[j]:
return dp(i - 1, j - 1) # skip
# explain:
# already the same, no need any operation
# the least editing distance of s1[0..i] and s2[0..j] equals
# the least distance of s1[0..i-1] 和 s2[0..j-1]
# It means that dp(i, j) equals dp(i-1, j-1)
```
if `s1[i]!=s2[j]`, we should recurse the three operations which needs a bit of thing:
```python
dp(i, j - 1) + 1, # insert
# explain:
# I Directly insert a character same as s2[j] at s1[i]
# then s2[j] are matched,move forward j,and continue compareed with i
# Don`t forget to add one to the operation number
```
![](../pictures/editDistance/insert.gif)
```python
dp(i - 1, j) + 1, # delete
# explain:
# I directly delete s[i]
# move forward i,continue to compared with j
# add one to the operation number
```
![](../pictures/editDistance/delete.gif)
```python
dp(i - 1, j - 1) + 1 # replace
# explain:
# I directly replace s1[i] with s2[j], then they are matched
# move forward i,j and continue to compare
# add one to operation number
```
![](../pictures/editDistance/replace.gif)
Now, you should fully understand this short and clever code. Another small problem is that this is a violent solution. There are many overlapping subproblems, which should to be optimized by dynamic programming techniques.
**How can we see the overlapping subproblems at a glance?** As mentioned in the previous article "Regular Expressions for Dynamic Programming", we need to abstract the recursive framework of the algorithm in this article:
```python
def dp(i, j):
dp(i - 1, j - 1) #1
dp(i, j - 1) #2
dp(i - 1, j) #3
```
For the subproblem `dp(i-1, j-1)`, how can we get it from the original question `dp(i, j)`? Once we found a repetitive path, it means that there is a huge number of repetitive paths, which is the overlapping subproblem. For example: `dp(i, j)-> #1` and `dp(i, j)->#2->#3`.
### 3. Optimized by Dynamic programming
For the overlapping subproblems, we introduced in the previous article "Detailed Explanation of Dynamic Programming" in detailed. The optimization is nothing more than a memo or a DP table.
The memo is easy to append, just modified the original code slightly.
```python
def minDistance(s1, s2) -> int:
memo = dict() # memo
def dp(i, j):
if (i, j) in memo:
return memo[(i, j)]
...
if s1[i] == s2[j]:
memo[(i, j)] = ...
else:
memo[(i, j)] = ...
return memo[(i, j)]
return dp(len(s1) - 1, len(s2) - 1)
```
**We mainly explain the DP table solution.**
First, we declare the meaning of the dp array. The dp array is a two-dimensional array, which looks like this:
![](../pictures/editDistance/dp.jpg)
With the foundation of the previous recursive solution, it\`s easy to understand. `dp [..][0]` and `dp [0][..]` correspond to the base case. The meaning of `dp [i][j]` is similar to the previous dp function:
```python
def dp(i, j) -> int
# return the least editing distance of s1[0..i] and s2[0..j]
dp[i-1][j-1]
# storage the least editing distance of s1[0..i] and s2[0..j]
```
The base case of the dp function is that `i, j` is equal to -1. However the array index is at least 0, the dp array is offset by one position.
Since the dp array has the same meaning as the recursive dp function, you can directly apply the previous ideas to write code. **The only difference is that the DP table is solved from the bottom to up, and the recursive solution is solved from the top to down**:
```java
int minDistance(String s1, String s2) {
int m = s1.length(), n = s2.length();
int[][] dp = new int[m + 1][n + 1];
// base case
for (int i = 1; i <= m; i++)
dp[i][0] = i;
for (int j = 1; j <= n; j++)
dp[0][j] = j;
// from the bottom to up
for (int i = 1; i <= m; i++)
for (int j = 1; j <= n; j++)
if (s1.charAt(i-1) == s2.charAt(j-1))
dp[i][j] = dp[i - 1][j - 1];
else
dp[i][j] = min(
dp[i - 1][j] + 1,
dp[i][j - 1] + 1,
dp[i-1][j-1] + 1
);
// storage the least editing distance of s1 and s2
return dp[m][n];
}
int min(int a, int b, int c) {
return Math.min(a, Math.min(b, c));
}
```
### 4. Extension
Generally speaking, when dealing with the dynamic programming of two strings, we just follow the ideas of this article, making the DP table. Why? Because it\`s easy to find out the relationship of the state transitions, such as the DP table of the edit distance:
![](../pictures/editDistance/4.jpg)
There is another detail: since every `dp[i][j]` is only related to the three status, the space complexity can be reduced to $O(min(M, N))$ (M, N is the length of the two strings). It\`s not very difficult but the code is harder to read. You can try to optimize it by yourself.
Maybe you will also ask, **As we only found the minimum editing distance, how can we know the every step?** In the example of modifying the article you mentioned earlier, only a editing distance is definitely not enough. You must know how to modify it.
Actually, it\`s very simple, just slightly modified the code and add additional information to the dp array:
```java
// int[][] dp;
Node[][] dp;
class Node {
int val;
int choice;
// 0 skip
// 1 insert
// 2 delete
// 3 replace
}
```
The `val` attribute is the value of the previous dp array, and the` choice` attribute represents the operation. When making the best choice, record the operation and then infer the specific operation from the result.
Our final result is `dp [m] [n]`, where `val` holds the minimum edit distance, and` choice` holds the last operation, such as the insert operation, then you can move one space to the left:
![](../pictures/editDistance/5.jpg)
Repeat this process, you can return to the starting point `dp [0] [0]` step by step to form a path. Editing according to the operations on this path is the best solution.
![](../pictures/editDistance/6.jpg)
The above is the entire content of the edit distance algorithm.
# 编辑距离
前几天看了一份鹅场的面试题,算法部分大半是动态规划,最后一题就是写一个计算编辑距离的函数,今天就专门写一篇文章来探讨一下这个问题。
我个人很喜欢编辑距离这个问题,因为它看起来十分困难,解法却出奇得简单漂亮,而且它是少有的比较实用的算法(是的,我承认很多算法问题都不太实用)。下面先来看下题目:
![](../pictures/editDistance/title.png)
为什么说这个问题难呢,因为显而易见,它就是难,让人手足无措,望而生畏。
为什么说它实用呢,因为前几天我就在日常生活中用到了这个算法。之前有一篇公众号文章由于疏忽,写错位了一段内容,我决定修改这部分内容让逻辑通顺。但是公众号文章最多只能修改 20 个字,且只支持增、删、替换操作(跟编辑距离问题一模一样),于是我就用算法求出了一个最优方案,只用了 16 步就完成了修改。
再比如高大上一点的应用,DNA 序列是由 A,G,C,T 组成的序列,可以类比成字符串。编辑距离可以衡量两个 DNA 序列的相似度,编辑距离越小,说明这两段 DNA 越相似,说不定这俩 DNA 的主人是远古近亲啥的。
下面言归正传,详细讲解一下编辑距离该怎么算,相信本文会让你有收获。
### 一、思路
编辑距离问题就是给我们两个字符串 `s1``s2`,只能用三种操作,让我们把 `s1` 变成 `s2`,求最少的操作数。需要明确的是,不管是把 `s1` 变成 `s2` 还是反过来,结果都是一样的,所以后文就以 `s1` 变成 `s2` 举例。
前文「最长公共子序列」说过,**解决两个字符串的动态规划问题,一般都是用两个指针 `i,j` 分别指向两个字符串的最后,然后一步步往前走,缩小问题的规模**
设两个字符串分别为 "rad" 和 "apple",为了把 `s1` 变成 `s2`,算法会这样进行:
![](../pictures/editDistance/edit.gif)
![](../pictures/editDistance/1.jpg)
请记住这个 GIF 过程,这样就能算出编辑距离。关键在于如何做出正确的操作,稍后会讲。
根据上面的 GIF,可以发现操作不只有三个,其实还有第四个操作,就是什么都不要做(skip)。比如这个情况:
![](../pictures/editDistance/2.jpg)
因为这两个字符本来就相同,为了使编辑距离最小,显然不应该对它们有任何操作,直接往前移动 `i,j` 即可。
还有一个很容易处理的情况,就是 `j` 走完 `s2` 时,如果 `i` 还没走完 `s1`,那么只能用删除操作把 `s1` 缩短为 `s2`。比如这个情况:
![](../pictures/editDistance/3.jpg)
类似的,如果 `i` 走完 `s1``j` 还没走完了 `s2`,那就只能用插入操作把 `s2` 剩下的字符全部插入 `s1`。等会会看到,这两种情况就是算法的 **base case**
下面详解一下如何将思路转换成代码,坐稳,要发车了。
### 二、代码详解
先梳理一下之前的思路:
base case 是 `i` 走完 `s1``j` 走完 `s2`,可以直接返回另一个字符串剩下的长度。
对于每对儿字符 `s1[i]``s2[j]`,可以有四种操作:
```python
if s1[i] == s2[j]:
啥都别做skip
i, j 同时向前移动
else:
三选一
插入insert
删除delete
替换replace
```
有这个框架,问题就已经解决了。读者也许会问,这个「三选一」到底该怎么选择呢?很简单,全试一遍,哪个操作最后得到的编辑距离最小,就选谁。这里需要递归技巧,理解需要点技巧,先看下代码:
```python
def minDistance(s1, s2) -> int:
def dp(i, j):
# base case
if i == -1: return j + 1
if j == -1: return i + 1
if s1[i] == s2[j]:
return dp(i - 1, j - 1) # 啥都不做
else:
return min(
dp(i, j - 1) + 1, # 插入
dp(i - 1, j) + 1, # 删除
dp(i - 1, j - 1) + 1 # 替换
)
# i,j 初始化指向最后一个索引
return dp(len(s1) - 1, len(s2) - 1)
```
下面来详细解释一下这段递归代码,base case 应该不用解释了,主要解释一下递归部分。
都说递归代码的可解释性很好,这是有道理的,只要理解函数的定义,就能很清楚地理解算法的逻辑。我们这里 dp(i, j) 函数的定义是这样的:
```python
def dp(i, j) -> int
# 返回 s1[0..i] 和 s2[0..j] 的最小编辑距离
```
**记住这个定义**之后,先来看这段代码:
```python
if s1[i] == s2[j]:
return dp(i - 1, j - 1) # 啥都不做
# 解释:
# 本来就相等,不需要任何操作
# s1[0..i] 和 s2[0..j] 的最小编辑距离等于
# s1[0..i-1] 和 s2[0..j-1] 的最小编辑距离
# 也就是说 dp(i, j) 等于 dp(i-1, j-1)
```
如果 `s1[i]!=s2[j]`,就要对三个操作递归了,稍微需要点思考:
```python
dp(i, j - 1) + 1, # 插入
# 解释:
# 我直接在 s1[i] 插入一个和 s2[j] 一样的字符
# 那么 s2[j] 就被匹配了,前移 j,继续跟 i 对比
# 别忘了操作数加一
```
![](../pictures/editDistance/insert.gif)
```python
dp(i - 1, j) + 1, # 删除
# 解释:
# 我直接把 s[i] 这个字符删掉
# 前移 i,继续跟 j 对比
# 操作数加一
```
![](../pictures/editDistance/delete.gif)
```python
dp(i - 1, j - 1) + 1 # 替换
# 解释:
# 我直接把 s1[i] 替换成 s2[j],这样它俩就匹配了
# 同时前移 i,j 继续对比
# 操作数加一
```
![](../pictures/editDistance/replace.gif)
现在,你应该完全理解这段短小精悍的代码了。还有点小问题就是,这个解法是暴力解法,存在重叠子问题,需要用动态规划技巧来优化。
**怎么能一眼看出存在重叠子问题呢**?前文「动态规划之正则表达式」有提过,这里再简单提一下,需要抽象出本文算法的递归框架:
```python
def dp(i, j):
dp(i - 1, j - 1) #1
dp(i, j - 1) #2
dp(i - 1, j) #3
```
对于子问题 `dp(i-1, j-1)`,如何通过原问题 `dp(i, j)` 得到呢?有不止一条路径,比如 `dp(i, j) -> #1``dp(i, j) -> #2 -> #3`。一旦发现一条重复路径,就说明存在巨量重复路径,也就是重叠子问题。
### 三、动态规划优化
对于重叠子问题呢,前文「动态规划详解」详细介绍过,优化方法无非是备忘录或者 DP table。
备忘录很好加,原来的代码稍加修改即可:
```python
def minDistance(s1, s2) -> int:
memo = dict() # 备忘录
def dp(i, j):
if (i, j) in memo:
return memo[(i, j)]
...
if s1[i] == s2[j]:
memo[(i, j)] = ...
else:
memo[(i, j)] = ...
return memo[(i, j)]
return dp(len(s1) - 1, len(s2) - 1)
```
**主要说下 DP table 的解法**
首先明确 dp 数组的含义,dp 数组是一个二维数组,长这样:
![](../pictures/editDistance/dp.jpg)
有了之前递归解法的铺垫,应该很容易理解。`dp[..][0]``dp[0][..]` 对应 base case,`dp[i][j]` 的含义和之前的 dp 函数类似:
```python
def dp(i, j) -> int
# 返回 s1[0..i] 和 s2[0..j] 的最小编辑距离
dp[i-1][j-1]
# 存储 s1[0..i] 和 s2[0..j] 的最小编辑距离
```
dp 函数的 base case 是 `i,j` 等于 -1,而数组索引至少是 0,所以 dp 数组会偏移一位。
既然 dp 数组和递归 dp 函数含义一样,也就可以直接套用之前的思路写代码,**唯一不同的是,DP table 是自底向上求解,递归解法是自顶向下求解**
```java
int minDistance(String s1, String s2) {
int m = s1.length(), n = s2.length();
int[][] dp = new int[m + 1][n + 1];
// base case
for (int i = 1; i <= m; i++)
dp[i][0] = i;
for (int j = 1; j <= n; j++)
dp[0][j] = j;
// 自底向上求解
for (int i = 1; i <= m; i++)
for (int j = 1; j <= n; j++)
if (s1.charAt(i-1) == s2.charAt(j-1))
dp[i][j] = dp[i - 1][j - 1];
else
dp[i][j] = min(
dp[i - 1][j] + 1,
dp[i][j - 1] + 1,
dp[i-1][j-1] + 1
);
// 储存着整个 s1 和 s2 的最小编辑距离
return dp[m][n];
}
int min(int a, int b, int c) {
return Math.min(a, Math.min(b, c));
}
```
### 三、扩展延伸
一般来说,处理两个字符串的动态规划问题,都是按本文的思路处理,建立 DP table。为什么呢,因为易于找出状态转移的关系,比如编辑距离的 DP table:
![](../pictures/editDistance/4.jpg)
还有一个细节,既然每个 `dp[i][j]` 只和它附近的三个状态有关,空间复杂度是可以压缩成 $O(min(M, N))$ 的(M,N 是两个字符串的长度)。不难,但是可解释性大大降低,读者可以自己尝试优化一下。
你可能还会问,**这里只求出了最小的编辑距离,那具体的操作是什么**?你之前举的修改公众号文章的例子,只有一个最小编辑距离肯定不够,还得知道具体怎么修改才行。
这个其实很简单,代码稍加修改,给 dp 数组增加额外的信息即可:
```java
// int[][] dp;
Node[][] dp;
class Node {
int val;
int choice;
// 0 代表啥都不做
// 1 代表插入
// 2 代表删除
// 3 代表替换
}
```
`val` 属性就是之前的 dp 数组的数值,`choice` 属性代表操作。在做最优选择时,顺便把操作记录下来,然后就从结果反推具体操作。
我们的最终结果不是 `dp[m][n]` 吗,这里的 `val` 存着最小编辑距离,`choice` 存着最后一个操作,比如说是插入操作,那么就可以左移一格:
![](../pictures/editDistance/5.jpg)
重复此过程,可以一步步回到起点 `dp[0][0]`,形成一条路径,按这条路径上的操作进行编辑,就是最佳方案。
![](../pictures/editDistance/6.jpg)
以上就是编辑距离算法的全部内容,如果本文对你有帮助,**欢迎关注我的公众号 labuladong,致力于把算法问题讲清楚**
![labuladong](../pictures/labuladong.png)
\ No newline at end of file
pictures/editDistance/1.jpg

130.5 KB | W: | H:

pictures/editDistance/1.jpg

124.7 KB | W: | H:

pictures/editDistance/1.jpg
pictures/editDistance/1.jpg
pictures/editDistance/1.jpg
pictures/editDistance/1.jpg
  • 2-up
  • Swipe
  • Onion skin
pictures/editDistance/3.jpg

154.5 KB | W: | H:

pictures/editDistance/3.jpg

142.1 KB | W: | H:

pictures/editDistance/3.jpg
pictures/editDistance/3.jpg
pictures/editDistance/3.jpg
pictures/editDistance/3.jpg
  • 2-up
  • Swipe
  • Onion skin
pictures/editDistance/4.jpg

171.8 KB | W: | H:

pictures/editDistance/4.jpg

155.4 KB | W: | H:

pictures/editDistance/4.jpg
pictures/editDistance/4.jpg
pictures/editDistance/4.jpg
pictures/editDistance/4.jpg
  • 2-up
  • Swipe
  • Onion skin
pictures/editDistance/5.jpg

162.1 KB | W: | H:

pictures/editDistance/5.jpg

146.1 KB | W: | H:

pictures/editDistance/5.jpg
pictures/editDistance/5.jpg
pictures/editDistance/5.jpg
pictures/editDistance/5.jpg
  • 2-up
  • Swipe
  • Onion skin
pictures/editDistance/6.jpg

165.7 KB | W: | H:

pictures/editDistance/6.jpg

148.7 KB | W: | H:

pictures/editDistance/6.jpg
pictures/editDistance/6.jpg
pictures/editDistance/6.jpg
pictures/editDistance/6.jpg
  • 2-up
  • Swipe
  • Onion skin
pictures/editDistance/delete.gif

910.1 KB | W: | H:

pictures/editDistance/delete.gif

311.9 KB | W: | H:

pictures/editDistance/delete.gif
pictures/editDistance/delete.gif
pictures/editDistance/delete.gif
pictures/editDistance/delete.gif
  • 2-up
  • Swipe
  • Onion skin
pictures/editDistance/title.png

81.6 KB | W: | H:

pictures/editDistance/title.png

46.8 KB | W: | H:

pictures/editDistance/title.png
pictures/editDistance/title.png
pictures/editDistance/title.png
pictures/editDistance/title.png
  • 2-up
  • Swipe
  • Onion skin
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册