• Languages, mind and our way of thinking... →

    Allez les Bleus!!

    Just happened to listen to this week’s episode of one of the recent NPR podcast series titled Hidden Brain and it got me thinking a lot. I speak almost 5 languages (not fluently but to different degrees of expertise) and I am not sure how my mind thinks. One thing for sure though is that I happen to think in my mother tongue and my mind translates it in (almost) real-time while I communicate, be it talking or writing down my thoughts as I do while writing this blog post. If languages do influence our thinking and understanding and more importantly the biases we cultivate, what does it mean for language translators - do they also convey the biases that are inherently influenced by one’s language? Or as a technologist I am also inclined to think what would it mean for AI systems (say chat-bots) that are meant to facilitate communication? I now tend to believe that they have to be designed with the intricacies of languages and the influences, biases that it carries along encoded into the system in order for it to be more acceptable. Even if not as a chat-bot, it would be a great system to train people to work with people of different cultures so that we understand different cultures better and comfortably communicate taking into account their nuances.

    If you are interested in listening to it, here it is:

  • String subsequences - memoization →

    Since we remember the past, we know that we can try and remember intermediate computations of our recursive Longest Common Subsequence (LCS) algorithm we developed previously in order to reduce the cost of making several recursive calls. We can use a hash-table to remember the intermediate results.

    Since we are looking at the lengths of two substrings while matching the characters, our intermediate results would be of the form int_result[i][j] for different lengths i and j of the two strings. Before we make our recursive calls, we just check the stored intermediate result to see if we have computed the LCS for the substring lengths.

    In particular, we use an unordered_map and form a key (here, we use a string key) using the different lengths of strings m and n:

    std::unordered_map<std::string, int> lookup;
    std::string key = std::to_string(m) + ":" + std::to_string(n);

    If the character we are analyzing matches, we update the key and increment our length counter.

    lookup[key] = lcs_r(s1, s2, m - 1, n - 1) + 1;

    If not, we make recursive calls including and excluding characters from the two strings (very similar to incl and excl we saw earlier) and compute the maximum of the two returned values (as we are computing the longest common subsequence).

    lookup[key] = std::max(lcs_r(s1, s2, m - 1, n),
                           lcs_r(s1, s2, m, n - 1));

    Putting it all together:

    std::unordered_map<std::string, int> lookup;
    int lcs_r(std::string s1, std::string s2, int m, int n)
    {
      // base case
      if (m == 0 || n == 0)
        return 0;
      std::string key = std::to_string(m) + ":" + std::to_string(n);
      if (lookup.find(key) == lookup.end())
      {
        if (s1[m - 1] == s2[n - 1])
          lookup[key] = lcs_r(s1, s2, m - 1, n - 1) + 1;
        else
          lookup[key] = std::max(lcs_r(s1, s2, m - 1, n),
                                 lcs_r(s1, s2, m, n - 1));
      }
      return lookup[key];
    }

    The time complexity of this solution is O(mn) with a space complexity of O(mn) to store the lookup table, where m and n are the lengths of the input strings.

  • String subsequences - Basics →

    Strings are just sequences of characters but their applications go beyond literature and CS. They are particularly popular in Bioinformatics. All proteins, DNA can infact be represented as a sequence of alphabets. For example, DNA sequences are usually represented using sequences of A, T, G, C alphabets which indicate adenine, thymine, guanine and cytosine nucleobases. Proteins usually consist of 20 different letters indicating 20 different amino acids. Comparing DNA or protein sequences is an important part of molecular biology research. It has a wide array of applications including:

    • identifying common motifs using common subsequences
    • predicting the function of proteins
    • understanding similarities for drug discovery and design

    How does one compare sequences? They might be millions of alphabets long. One of the useful ways to compare sequences is to find the longest common subsequence between two or more sequences. Note that subsequence is different from substring in that the alphabets are not required to be in consecutive positions. For example, DNA sequences TAGTCACG and AGACTGTC have a longest common subsequence of length 5 - AGACG. There can be many such subsequences.

    Let us look at a simple recursive algorithm to find the longest common subsequence among two strings. Suppose we have two input strings s1 and s2 of lengths m and n and we want to return the length of the longest common subsequence. For any recursive algorithm, the recipe is to find the base case and recursive step.

    • The base case is simple - if one of the input strings is of length 0, the longest common subsequence (LCS) is also of length 0 and we just return 0.
    if (m == 0 || n == 0)
       return 0; 
    • For the recursive step, we can match the characters in the string one by one. If the character we are looking at matches, we recursively call the algorithm on the rest of the string and increment our length count by 1.
    lcs_r(s1, s2, m - 1, n - 1) + 1;

    If the character we are looking at does not match, we want to find the maximum of LCS of the two possibilities (s1 without the current character and s2 without the current character).

    std::max(lcs_r(s1, s2, m, n - 1),
             lcs_r(s1, s2, m - 1, n));

    Putting it all together,

    int lcs_r(std::string s1, std::string s2, int m, int n)
    {
      // base case
      if (m == 0 || n == 0)
        return 0;
      
      // recursion
      // last char match
      if (s1[m - 1] == s2[n - 1])
        return lcs_r(s1, s2, m - 1, n - 1) + 1;
    
      // last char mismatch
      return std::max(lcs_r(s1, s2, m, n - 1),
                      lcs_r(s1, s2, m - 1, n));
    }

    Note that the time complexity of the above solution is $O(2^{m+n})$.

  • Generating Optimal Value from Shipping →

    Last week your old CS friend helped you refine your algorithm. Now, he has offered to send one of his students to do an internship in your shipping business (your business never had interns) to help you refine your algorithms. You invite the student to show your code and the refined version and he says apart from refining this algorithm, he would also gather data on daily shipping and use data science and machine learning to help you predict your returns from shipping. You seem impressed and let him do his work.

    He revises your algorithm and adds the following comments:

    This is a typical dynamic programming problem. It has:
    - optimal substructure: the optimal solution can be built successively from optimal solutions of the subproblems
    - overlapping subproblems : the problem can be broken down into subproblems of smaller size
    

    He goes on to write that it could also be solved in a bottom-up fashion rather than a top-down manner as your lecturer friend did. That is, instead of starting with the big problem and breaking it down into smaller subproblems, you could start with the smallest subproblem you can imagine (say your ship’s capacity is 0) and build up solutions to larger subproblems progressively.

    The idea here is to store the results of intermediate subproblems in some data structure (very much like the hash-table we saw earlier) and refer to the values in the data structure as we build solutions to larger subproblems. Since we look at the number of items left to decide (whether to accept or not) and the threshold (capacity) of our boat, we use a 2D vector max_val to store the result.

    std::vector<std::vector<int>> max_val(n+1, std::vector<int>(t+1));

    where n is the number of elements and t is the threshold (capacity).

    If the capacity of our ship is 0 we cannot carry anything and hence the value we would obtain is 0.

    // if the threshold is 0, max_val is 0
    // as you can't carry anything
    for (int i = 0; i <= n; i++)
      max_val[i][0] = 0;

    Similarly, if there are no items to carry, we don’t get any value out of our business.

    // if there are 0 items, max_val is 0
    for (int j = 0; j <= t; j++)
      max_val[0][j] = 0;

    We can now build solutions for other values of i (1..n) and j (1..t) by iterating over the possible values and using the solutions to the smallest subproblems computed above. Now again, we need to check if an item i would fit in our boat (it would fit only if it is within the ships remaining capacity). If it would fit, we need to check if it is a good choice to carry it, for which we do a comparision of the value we obtain with and without the item i, very similar to incl and excl earlier. If j is the remaining threshold, and w[i-1] is the weight of the current item,

    // if item i fits in the knapsack, add its value
    if (j - w[i-1] >= 0)
      max_val[i][j] = std::max(v[i-1] + max_val[i-1][j - w[i-1]], max_val[i-1][j]);

    Putting it all together,

    int knapsack(std::vector<int>& w, std::vector<int>& v, int n, int t)
    {
      std::vector<std::vector<int>> max_val(n+1, std::vector<int>(t+1));
    
      for (int j = 0; j <= t; j++)
        max_val[0][j] = 0;
      
      for (int i = 0; i <= n; i++)
        max_val[i][0] = 0;
    
      // for each item in the set
      for (int i = 1; i <= n; i++)
      {
        // for every weight upto max threshold
        for (int j = 1; j <= t; j++)
        {
          if (j - w[i-1] >= 0)
            max_val[i][j] = std::max(v[i-1] + max_val[i-1][j - w[i-1]], max_val[i-1][j]);
          else 
            max_val[i][j] = max_val[i-1][j];
        }
      }
      return max_val[n][t];
    }
  • Making value from shipping products efficiently →

    Last week, you wrote a recursive algorithm to choose which products to accept for your shipping. Brexit is all over the news and with different kinds of cross border levies that might come into place, your shipping business has become even more chaotic. You get lots of items that need to be shipped but most of them low value and you don’t want to pass up on those. So you want to make your algorithm scale up to large number of items and still create the best value for you. Your CS skills are rusty at this point and you are busy reading and discussing Brexit news. You decide to meet an old classmate in a pub just to rant about Brexit. He had gone to school with you studying CS and is now a lecturer in a small college nearby.

    You meet him in the pub, and after getting tired of Brexit discussions, you start reminiscing about your college days and the teachers you had. You recall the algorithm you wrote the previous week and start telling your friend about it. You friend is asking you why you had to write a recursive algorithm and how do you plan to scale it up when you have lots of inputs. You don’t want to think hard on the problem and you just order another drink. You show your algorithm to your friend in your chromebook. Your friend, being a lecturer, opens starts editing the program and tells you:

    Those who cannot remember the past are condemned to repeat it

    You think he is talking about one of the many previous trips to the bar when you got heavily drunk. You don’t remember those vividly anyway. He was actually talking about remembering intermediate results as you make recursive calls. He shows you how to quickly put a hash-table and store the intermediate results as you make recursive calls.

    You declare an unordered map lookup and populate it when we compute max(incl, excl).

    std::unordered_map<std::string, int> lookup;
    // initially, we have n items and `t` threshold
    std::string key = std::to_string(n) + ":" + std::to_string(t);

    For every item i we have to decide whether or not to carry, we look at the value of the item and also the remaining threshold t when we make recursive calls to knapsack() function. We can check if we have a result already computed and stored (as a result of a previous call).

    if (lookup.find(key) == lookup.end())
    {
      ...
      // make recursive calls; compute incl, excl
      // compute max(incl, excl)
      lookup[key] = std::max(incl, excl);
      ...
    }
    std::unordered_map<std::string, int> lookup;
    
    int knapsack(std::vector<int>& w, std::vector<int>& v, int n, int t)
    {
      // base case
      if (t < 0)
        return INT8_MIN;
    
      if (n < 0 || t == 0)
        return 0;
    
      std::string key = std::to_string(n) + ":" + std::to_string(t);
    
      if (lookup.find(key) == lookup.end())
      {
        // include the current element in the solution
        int incl = v[n] + knapsack(w, v, n-1, t - w[n]);
        // exclude the current element
        int excl = knapsack(w, v, n-1, t);
        lookup[key] = std::max(incl, excl);
      }
      return lookup[key];
    }

    The next day, you realize that you are able to use the algorithm modified by your friend on much larger number of items. Remembering is not that bad after all.