Strings are just sequences of characters but their applications go beyond literature and CS. They are particularly popular in Bioinformatics. All proteins, DNA can infact be represented as a sequence of alphabets. For example, DNA sequences are usually represented using sequences of A, T, G, C alphabets which indicate adenine, thymine, guanine and cytosine nucleobases. Proteins usually consist of 20 different letters indicating 20 different amino acids. Comparing DNA or protein sequences is an important part of molecular biology research. It has a wide array of applications including:

  • identifying common motifs using common subsequences
  • predicting the function of proteins
  • understanding similarities for drug discovery and design

How does one compare sequences? They might be millions of alphabets long. One of the useful ways to compare sequences is to find the longest common subsequence between two or more sequences. Note that subsequence is different from substring in that the alphabets are not required to be in consecutive positions. For example, DNA sequences TAGTCACG and AGACTGTC have a longest common subsequence of length 5 - AGACG. There can be many such subsequences.

Let us look at a simple recursive algorithm to find the longest common subsequence among two strings. Suppose we have two input strings s1 and s2 of lengths m and n and we want to return the length of the longest common subsequence. For any recursive algorithm, the recipe is to find the base case and recursive step.

  • The base case is simple - if one of the input strings is of length 0, the longest common subsequence (LCS) is also of length 0 and we just return 0.
if (m == 0 || n == 0)
   return 0; 
  • For the recursive step, we can match the characters in the string one by one. If the character we are looking at matches, we recursively call the algorithm on the rest of the string and increment our length count by 1.
lcs_r(s1, s2, m - 1, n - 1) + 1;

If the character we are looking at does not match, we want to find the maximum of LCS of the two possibilities (s1 without the current character and s2 without the current character).

std::max(lcs_r(s1, s2, m, n - 1),
         lcs_r(s1, s2, m - 1, n));

Putting it all together,

int lcs_r(std::string s1, std::string s2, int m, int n)
{
  // base case
  if (m == 0 || n == 0)
    return 0;
  
  // recursion
  // last char match
  if (s1[m - 1] == s2[n - 1])
    return lcs_r(s1, s2, m - 1, n - 1) + 1;

  // last char mismatch
  return std::max(lcs_r(s1, s2, m, n - 1),
                  lcs_r(s1, s2, m - 1, n));
}

Note that the time complexity of the above solution is $O(2^{m+n})$.