• Reinforcement Learning - Theoretical Foundations →

    In the previous post, we saw the basic concepts and terminology used in Reinforcement Learning (RL). Markov Decision Processes are used to describe the environment.

    The future is independent of the past given the present

    A state is considered Markov if it captures all the relevant information from history. It might have started from initial state and gone through 100 transitions, but one does not need to remember the 100 intermediate states in order to know the system. The current state must somehow capture all of the relevant information. For example, if I say my age is 30 years (state), it fully captures what my age was in each of the last 30 years (state in each year in history). More formally,

    \[\mathbb{P}[S_{t+1} \mid S_t] = \mathbb{P}[S_{t+1} \mid S_1, \cdots, S_t]\]

    We also talk about state transitions. More often these are probabilistic transitions. If I am a 30 year old engineer (my current state), there is a relatively high probability I can start afresh and learn machine learning from YouTube and become a machine learning engineer (new state) but there is a very low probability that I can start afresh and become as gymnast (new state). Given a current state s, there is a probabilistic transition relation to next state s’. More formally,

    \[\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s' \mid S_t = s]\]

    A reward is a function on states. That is, given a state s it tells you the expected reward one could accumulate going forward, starting from s. It is the expectation of the rewards you might get at each state going into the future, starting from state s. More formally,

    \[\mathcal{R}_s = \mathbb{E}[R_{t+1} \mid S_t = s]\]

    If I am a young graduate student, I might be more focussed on long-term rewards (higher pay later on) at the cost of sacrificing short-term rewards (early employment). If I were a trader in a hedge fund, I might be more focussed on short-term bets (short selling) rather than long-term growth companies. It is fair to say that the reward in each of the above cases has a different rate of growth. One could be myopic and be interested in short-term rewards or one could be long-sighted and wait for long-term rewards. In order to accomodate this, we introduce a discount factor $\gamma \in [0, 1]$, which when close to $0$ indicates interest in short-term rewards and when close to $1$ indicates long-sightedness. The expected total reward, also called return in RL literature is the total discounted reward from a given point in time. More formally,

    \[G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{i=0}^{\infty} \gamma^{i} R_{t+i+1}\]

    How does one evaluate a state s. One needs a metric to compare states to say one state is better than the other. The state value function v(s) gives the long-term value of a state s. Note that it is long-term value, i.e. the value of the state s into the future, if we start from state s. More formally,

    \[v(s) = \mathbb{E}[G_t \mid S_t = s]\]

    It is the expectation of the returns starting from state s. Since the returns themselves are discounted rewards,

    \[v(s) = \mathbb{E}[R_{t+1} + \gamma v(S_{t+1}) \mid S_t = s]\]

    Since are trying to program systems that have agency, the agent can take actions in order to optimize its reward. We also introduce a set of actions and this introduction makes the reward not only a function of a state s but also on the action a that we take while in state s. I could take an action $a_1$ to spend 300 on a Nintendo console or I can take an alternative action $a_2$ of investing 300 in TSLA. These two actions would yield, most likely, very different rewards, starting at the same state (of having 300 to spend). We can rewrite our transition probabilities and rewards by incorporating action.

    \[\mathcal{P}_{ss'}^{a} = \mathbb{P}[S_{t+1} = s' \mid S_t = s, A_t = a]\]

    and

    \[\mathcal{R}_{s}^{a} = \mathbb{E}[R_{t+1} \mid S_t = s, A_t = a]\]

    That was too much of formalism for a post. We will look at more terminologies and formalisms before we dive into some examples to concretize our understanding.

  • Reinforcement Learning - Fundamentals →

    When we were toddlers, we did not know a lot many things. Slowly, we learnt through our own experiences or the experiences of others around us. We may not have known a gas stove is hot, a container filled to the brim is difficult to carry without spilling, etc. We might still be in the process of learning certain things. Some of these are taught in schools through repetitions and exams in order to get feedback, correct our course and reinforce our knowledge.

    Even in everyday activities, we continuously gather feedback and correct our actions. If we start early or start before time, we reach our destination on time. If we study well we get good grades, if we develop unhealthy habits, we have negative effects on our health, and so on. The environment gives us the feedback. The environment includes all other entities apart from us. We can, to some extent, observe the environment but we do not know the full behavior of the environment. We can read weather forecasts before making a trip, but we may not know the predicted rain will cause flash floods, for example.

    There are some fundamental concepts and terminologies in reinforcement learning (RL) that makes mapping it to real life experiences easier. An agent is someone who has agency - makes decisions and takes actions. In real life, it refers to us - human beings (or our minds). We learn and develop RL algorithms to program an automated system with a physical embodiment (e.g. Robot) or even without it (e.g. just a software API), in which case the agent is the automated system that we are trying to build and make it better using RL. The environment is what the agent acts on which is everything beyond us. We do not completely know the environment. We just observe how the environment behaves and calibrate our actions. The environment in turn provides rewards which help us learn and calibrate. If we get drenched in rain (negative reward) we may carry an umbrella (action) the next day and try to reduce the negative reward. If we study well (action), we get good grades (positive reward) which would in turn motivate us to study better in order to increase our reward.


    The state of the system is defined by the valuation of the state variables. If I am a trader in the stock market, my goal is to take actions (buy/sell stocks) in order to increase my reward (profit). My state would be the value of the money I have and a better state would be one in which the value is more (through aggregating profits through a sequence of actions). We talk about sequence of states (i.e. we take action from a state, move to another state, receive reward and take another action, and so on). Our goal is to make sure we reach a state by taking profitable actions where the value we obtain, going forward from that state, is maximized.

    In future posts, we will formalize many of these concepts and introduce some more.

    These are all inspired by the wonderful course on Reinforcement Learning by Prof. David Silver .

  • How About This Program? Is This Correct? →

    Most system software is written in C. Besides being one of the oldest programming languages C has arbitrary memory address access and pointer arithmetic that made it attractive for several applications. Further, the memory footprint of C and small runtime made it the preferred language for embedded applications. It is no surprise that most of the software that we encounter in our everyday life has at least some (or in many cases most) part of it written in C.

    Overtime, several libraries were developed for C which made programming in C fast and easy. Let us look at a small C function copy_data() that uses one such library function memcpy(dest, src, n) that copies n bytes from a src memory to dest memory area.

    void copy_data(char *buf, int len)
    {
      char tmp[80];
      if (len > sizeof(tmp))        // [A]   
        return -1;
      
      memcpy(tmp, buf, len);        // [B]
      printf("copied data: %s\n", tmp);
    }
    

    Our tiny C function copy_data has a pointer to a character buffer buf. We intend to copy len bytes from buf to tmp. We call the memcpy(...) function passing tmp, buf and len. Looks simple and clean. But is it correct? Think about it.

    Although this function looks simple enough to just visually review for correctness, there is a subtle bug. An user of the function could pass a negative value for len and pass the check at [A]. At [B] is where this gets manifested. memcpy() takes an unsigned int as the len parameter and hence a negative value for len would be interpreted as a huge unsigned value at [B], causing the memory beyond tmp to be overwritten.

  • Is This Program Correct? →

    Over the last couple of decades we have come to rely more and more on software. They have occupied more of our lives than any other artifact like television or automobiles (which themselves are beginning to be run more by software). We entrust them with our passwords, bank account numbers, personal details, email gossips, etc. But are they correct? Can they also be imperfect like us humans and consequently harm others that trusted them? We will explore how small and silly bugs in programs can lead to severe consequences. Since most safety critical software is written in C, we will stick to examples using the C programming language. But most of these could be extended to Java (may not be to Rust or other languages though).

    Is this program correct?

    Below is a simple implementation of Binary Search as found in several text-books, production software, etc. The input parameters are an array arr of size n and we are interested in searching for a target element t using binary search. If the target element is not found, we will return -1 and if found, we will return the index of the element in the array.

    int bsearch(int arr[], int n, int t)
    {
      int lo = 0;
      int hi = n - 1;
      int mid;
      while (lo <= hi)
      {
        mid = (lo + hi)/2;
        if (arr[mid] == t)
          return mid;
        if (arr[mid] < t)
          lo = mid + 1;
        if (arr[mid] > t)
          hi = mid - 1;
      }
      return -1;
    }

    What could go wrong in this program? One can write a bunch of unit tests and it would, most likely, go through fine. And if you think there is some very corner case scenario that is happening here, you are not alone. This bug was undiscovered for decades - we all know binary search algorithm is not something new. When this was discovered, it turned out that not only this binary search algorithm, but many others that involve binary search as a sub-routine was also wrong. Google’s blogpost on this goes into interesting details and it turns out it also existed in JDK.

    The bug is related to arithmetic overflow when computing mid in:

    mid = (lo + hi)/2;

    A potential fix for this would be to rewrite the above as:

    mid = lo + ((hi - lo)/2);

    After all, we should not be trusting our programs as much as we do, before analyzing them.

  • The Edge Effect →

    On this rainy Sunday in California, with not much motivation to do anything serious, with the last of the India-Australia Test series underway in Sydney delayed due to rain, I started scrolling through the library of my favourite pod-casts and this particular one caught my attention. The Edge Effect - an episode from NPR’s Hidden Brain series.

    The episode talks about the importance of encouraging and developing diversity of thoughts and ideas. It draws examples from science and music and makes compelling arguments supporting diversity. In particular, the experiments by Adam Galinsky and the life story of Cristina Pato were very impressive.

    For those interested, here is a link to the podcast:

    I should take time to appreciate the diversity my parents exposed me to, while growing up in India.