In this section, we will review what RNNs can and can't capture from their inputs. One of the ways we can understand what RNNs are and are not capable of learning, is to train a sequence to sequence model and ask it to make predictions given some random input. Ordinarily, examining the predictions of the model for a random point in feature space wouldn't be informative. For example, if we look at this random point in feature space, the model's output here really doesn't tell us about what this classification model learned. However, with one to sequence and sequence to sequence RNNs, the output is much more nuance than simply a class or a label. It's an entire sequence. If you take such a model and train it to predict the next word in the sequence, then you have a mapping between its inputs and the domain from which its training data came from. That means that you can assess its output for random input based on how likely it is to be found within the corpus on which the model was trained. The better the model understands the domain, the more plausible it's phantom predictions will be. As we review these examples, imagine that you had the same job as the model does to create a sample from this domain. Instead of using machine learning though, you'd use traditional programming. How would you write such a program? What sort of state would it have? For our examples, we'll use a character sequence sequence model. Unlike language models, which accepts sequences of words, character models accept sequences of characters. We use two very different domains to train our models, each with a set of rules that varies in its complexity. The first domain is the complete works of William Shakespeare, and there are lots of different rules in Shakespeare. There are rules that govern all natural language, which are so complicated that linguists still haven't figured them all out. They do agree on some very basic ones though, like subject verb agreement. There are also rules that are more drama specific. For example, all plays have titles and consists of a certain number of acts that are labeled in ascending order. Within a scene, actors enter and do things. Now, think about your program and its state, and remember that all we have to work with for state is a fixed size vector of floats. Perhaps your program would have some sort of nested loops for the acts and scenes and loops require maintaining some sort of counter. To generate a scene, we'd need to know who's there. So, maybe we need a set of all English names to sample from. Then perhaps within a generate scene function, we did know which characters are actually in the current scene because characters enter and leave all the time. So, perhaps we could maintain a set that we update with the characters currently on stage. We then need to give the characters things to say. So, perhaps we'd use a Markov model compiled from word co-occurrence statistics although that requires a very large table of numbers. So, now let's summon our imaginations a bit and think about this program that we've been writing. Which parts of its state and code are more complicated than others? Control structures seem pretty simple. That's just a few counters. A set, like the set of characters is more expensive because it could have unlimited size. Remembering which characters have died already, so we don't inadvertently add goes to our play, that's really complicated. Let's look at what the model is able to produce. The model generated a plausible title, and even knows that after titles come acts. But then you see the first mistake, the scene numbering is wrong. Our naive program use counters to loop, but RNNs don't actually have counters. It did remember to put a number for act and seen though. After the scene which has a plausible location, two characters enter, and note how the stage directions are correctly bracketed. The state for remembering to close brackets could be as simple as a bit or a counter. Now, I'm no Shakespearian scholar but the dialog here seems pretty reasonable, and that's notable because our naive method of generating text was to compile co-occurrence frequency statistics in a table that was far bigger than our memory. This is even more impressive given that this is actually a character RNN, not a word RNN, which means it doesn't even have direct access to words. But there's still a problem with the output. The characters in the place who speak never actually entered the scene. We see a similar story when we train the model on the TensorFlow library, which is written in Python. The models capable of remarkable amounts of memorization. It learn the entire Apache license, though keep in mind there was overwhelming evidence for this because it's at the top of every file. The model learn to correctly triply nest the NumPy arrays, and make indentations, which like the stage directions might be some sort of counter or Boolean in our naive implementation. But much like with Shakespeare, the model doesn't learn that some things require a much more complicated representations. For example, it doesn't understand variable scope, and uses variables that haven't yet been declared, which makes sense because variable scope requires a much more complicated bit of state in your computer to correctly maintain.