Hi. My name is John Langford, and I want to tell you about contextual bandits for real-world reinforcement learning. It stems from a long-term project I've been working on for more than a decade resulting in many real-world deployments and in general, contextual bandits are the way that reinforcement learning is deployed in the real-world these days. Let's start with why we want to think about the real-world. In general, how reinforcement is often different from the real-world. Typically, right now reinforcement learning works with the simulator. A simulator provides observations, and then a learning algorithm has a policy which chooses an action. The simulator then processes and returns in reward. This is what happens when you do reinforcement learning for a gameplay, for example. Now, when you want to apply this to the real-world, there's a real question about how these things align. So for example, when an observation comes from real-world, is it like observation from a simulator? Typically, it differs in various ways. When the action is taken based on an observation, is it the same? Often, differs because nature of the observation is different, which leads to different action even given the same policy. Then, is the reward that you get back from real-world similar to what you get in the simulator? No, the answer is no. It is not similar. So given this, there's a divergence. There's is a gap between the simulator and the reality. So while you can try to learn in simulators, the applicability of what you've learned in simulators to the real-world applications is unclear and in many cases, maybe even not even possible. So given that there's gap between simulator-based reinforcement learning, which is where much of reinforcement learning is, and real-world-based reinforcement learning, how do you do real-world reinforcement learning? I think the key answer is a shift in priorities. So you want to shift your priorities to support real-world reinforcement learning applications. So as an example, Temporal Credit Assignment, In other words, which we've watched the game, is really important to reinforcement learning. But maybe it's a little less important. Maybe generalization across different observations is more important. In a simulator, it's easy for the reinforcement learning algorithm to control the environment. You can say step forward one step please. But in the real-world, typically the environment controls you. This is a fundamental import in terms of what your interfaces need to be for working with the real-world. Computational efficiency is the key limiting concern in a simulator because you have as many samples as you can compute. But in the real-world, statistical efficiency is the greater concern because you only have the samples that the real-world gives you, and so you must use those samples to achieve the greatest impact that you can. In simulator based reinforced learning, we have to think about state. Well, state is the fundamental information necessary in order to make a decision. But in the real world, often you have some very complex observation which may have much more information that's necessary to make a decision. If you have a megapixel camera, maybe you don't need all those pixels to make a decision, and maybe it's important to distinguish between all of those different possible pixel settings as a state in making a decision in other ways. When you're in the real world, suddenly it becomes important to be able to off policy evaluation. So in contextual bandits, which I'll talk about in a moment, there are algorithms which just do learning, and the algorithms do learning but also as a byproduct produced it the candidate that you can do a policy evaluation with. Naturally, these policy evaluation supporting algorithms are just preferred for actual applications. Another distinction has to come with which policy you care about. In a simulator, maybe you run for a long time and save last policies you care about. But in the real world, every data point you're gathering involve some interaction with the world, the way you want that performance to be pretty good. So you really care about the entire trajectory of policies, the sequence of policies as you're learning in the real world. Let's think about personalized news. This is a very natural application for reinforcement learning in the real-world. Maybe you have a set of possible news articles which are the ones of interest today. Some music comes to a website. You would like to choose some news article that they probably are interested in. Then you get some feedback about whether or not they actually read the article. So this is a very natural reinforcement learning structure and this is in fact even simpler to it's contextual bandit structure. So think about what contextual bandits are? In contextual bandits what happens is, you observe some features. Maybe it is the geolocation of the user. Maybe it's some profile based on the way the user has behaved in the past. Maybe it's features of the possible actions which are available as well. Based on this, you choose an action and then you get some reward. Your goal is to maximize the reward in the setting. Case canal, this sounds like full reinforcement learning and it is to a large extent. But there's one severe caveat which is that we're imagining there's no credit assignment problem. So the news article that is displayed to you does not influence the way I will behave when the news article is explained to me. The history of this is actually pretty recent. Something where the world has changed a great deal over the last decade. So in 2007, there was the first contextual bandit algorithm called Epoch Greedy which is essentially a version of Epsilon Greedy that varies the epsilon dealing on how much you know. We rapidly discovered there's an even earlier paper, the EXP4 algorithm which in some sense is even more efficient statistically. But it's horrifically bad computationally. In 2010, we had our first application of personalized news that was deployed in the real world when I was at Yahoo research. In 2011, we had the first marriage of Epoch Greedy style learning and EXP4 style learning to achieve, if essentially there was computationally statistically efficient. In 2014, we had an even better algorithm we came up with which you can actually use today, and then 2016, we created the first version of a decision service which is a general service Cloud service you could use to create your own contextual bandit learner. Two thousand and nineteen, this eventually led to an reinforcement learning service product, the Azure Cognitive Services Personalizer. Based upon the service, you can go and personalize your website or its layout or many other things. It's a general service. You can feed in arbitrary features, choose an action, and then sit in an arbitrary word with it learning. Based upon the system, we actually have 2019 AI system of the year. So I don't have enough time to go into great detail here. But if you're interested in more details on contextual bandits, there's a tutorial out [inaudible] in 2018, which I recommend taking a look at. If you're interested in the personalizer service, that's available at this URL, and then if you should need algorithms behind the personalizer service and internal implementations of contextual bandit algorithms, Vowpal Wabbit has become a repository for a large number of different contextual bandit algorithms. Thank you.