String subsequences - Extracting the LCS →
In the previous post we saw how to compute the length of the longest common subsequence (LCS). The algorithm returned an integer value indicating the length of the LCS. How do we go about extracting the LCS? With a simple extension to the previous algorithm, we can also extract the string sequence that is the LCS. In particular, we will traverse the populated lookup table and extract the LCS from that.
Let us consider a simple example, so that we can work it out by hand. Let the string s1 be ABCD and string s2 be BAD. Once we populate the lookup table as shown previously, it would look something like the one below
Populating the lookup table is the same as in our previous post, with just a reference to our lookup table passed to our function.
Note that when either of the strings is empty, LCS is also empty and hence the length of the LCS is 0.
If one observes the LCS algorithm carefully, when a character of the strings match, in order to compute the value at (i, j) we look at the smaller subproblem (diagonally above and to the left - at coordinates (i-1, j-1) and add 1 to it.
This tells us that if there is a match, the character in the string s1[m-1] (or s2[n-1]) would be a part of the LCS and hence we collect it in our result string and call our processing (the lookup table) function on the rest of the table, starting at (m-1, n-1).
If there is no match, then we compute the max of the subsequence lengths from (m-1, n) and (m, n-1). Potentially these could be two subsequences with the last character (that matched) the same.
Essentially, for the value of the cell (i, j) we were computing the max of the values in the top cell (i-1, j) and the left cell (i, j-1). In order to extract the LCS, we hence need to move in the direction of the max (i.e. whether the current cell value was influenced by the top cell or the left cell).
Putting this together,
The traversal of the lookup table extracting the LCS ad is shown below:
But LCS is not unique. There can be many LCS for a pair of strings. How do we collect all of the longest common subsequences?