[MUSIC] So this is a slide I've taken from Christian Grant where he argues that look, if you consider databases versus statistical packages such as SAS or Matlab or R or SPSS, this is what they're doing now. They're downloading data to use in their favorite statistical package. Frequently under the assumption that well of course I have to, right. Of course that's the only thing that could possibly express this. Well look, most of these stat packages the first thing you'll do is read the data off the disk and load it into memory and then start calling functions on it. Well increasingly data sets simply don't fit in memory on a single machine, certainly not on your laptop. And so you have a couple choices here, either you shift into some kind of fancy cluster version of the tools for which they exist for things likes SPSS and Matlab although they're quite expensive. Or you sample the data so that you only have only can work with a subset that actually does fit in memory. And you'll see this to be very, very common, is that it's just it's just par for the course to take a sample of the data in order to be able to work with it efficiently. Right, but the point here is that this isn't really required if you use different packages. In this case, the argument here is if you can use databases. If you can figure out how to perform your task in the database, you'll get the scalability for free. Moreover, these tool kits don't necessarily have any kind of notion of parallelism, right? So even if it does fit in memory, every machine you buy nowadays has at least four cores in it and probably more like eight and soon to be 12 and 16. So to take advantage of all those cores on your problem. Is something you're gonna be looking for in a package. And this is something that databases can do automatically. Most databases, not all. In fact the ones you may be familiar with, MySQL and Postgres, typically do not. But other databases will and we'll talk more about this. Okay. So you get parallelism for free if you can use a database and you get scalability beyond the size of main memory. For free, if you can use a database. That's perhaps a big if, and we'll talk about it. Okay. So I'll give you an example and and you'll actually do this as part of a homework assignment, but, can you express matrix multiplication in SQL? And if you can, then I'd argue well hey, now any formula that you can express using matrix multiplication you can perhaps express in SQL by doing this over and over again. Okay, and the answer is, yes. And in fact the simplest version of this is pretty straightforward. So if you haven't ever seen SQL before we'll talk a little bit more about this, but if you have bear with me. Imagine you have two matrices, A and B. Oops. I'm using the wrong device here. Two matrices, A and B. What you wanna do is find all the representations of each matrix here is as a Row, excuse me. Row id, Column id, and value. Right, so that's a relation. Now This is a very inefficient relation. If your matrix is dense. And all you think about what, well I'll tell you why. And you can think about a little more as well. Is that, an implicit representation of this only has let's say you have five rows, and six columns. Then you only need the 30 values. Five times six. But here you're doing, you have to do 30 row IDs plus 30 column IDs, plus 30 values. So you sort of tripled the size of your data relative to efficient main memory representations. So why would you do that? Well, it turns out that a lot of matrices in practice are sparse. And I put that word right up here at the top. In a sparse matrix, not all the cells actually have a value. So you don't actually need to store them. And so this representation, in terms of our explicitly having a row ID, a column ID and a value, turns out to be pretty efficient. Okay. And in fact, sparse matrix solvers ,this is exactly the kind of representation they use internally. All right, so if you have a sparse matrix and if you encode it, if you represent it in a database, then expressing matrix multiply is not too bad. What you want to do is find all the columns, for each column number in the matrix A, find the corresponding row number In column B and then add up all the contributions to the new value. And I'll show a diagram of this after in fact, you know what let me skip going into too much detail about this right now because I'm gonna talk about this in detail in preparation for the homework where you'll do this. So right now I guess the takeaway what I want you to take away is that representing matrices inside of a database sounds very unusual. It's actually the world's worst idea, and in part of the readings from this Mad Skills paper, you'll see why. So right now I just want you to take away that it can be done and it's not necessarily a terrible idea. [MUSIC]