Hey, here's one of the modules that everybody loves, is how to make your queries run faster, and save on data processing cost. In this performance optimization module, we'll cover the key types of work that BigQuery does in the back end, and how you can avoid falling into performance traps. Then, we'll cover one of the most common data traps, which is having too much data skewed to too few values, and how filtering can be your best friend here. Last App is your best tool for analyzing and debugging performance issues, the query explanation map. Now there are four key areas that govern a lot of the characteristics of BigQuery performance. Let's take a look at each of them in turn and see some examples. First and foremost is how much data that you're reading and writing out of BigQuery, you have your input and output. Second, is that communication between your workers or slots. It's done via the shuffle mechanism that we talked a little bit about for in the architecture lecture. Third, is highly computational work. Like large functions UDFs, mathematical operations. That's how much of the CPU you're using. And the fourth which is my favorite, is the SQL syntax. Is there more efficient ways to write your queries? What are some of those pitfalls that we can avoid when we're writing that SQL? Let's look at each of those. First stop, is input and output. How much data we're reading and then how much data we're writing. As you might have expected, SELECT* makes the list here. We only want to use the columns that we're actually going to use. And if you want a quick preview of that dataset, again, use the preview mechanic that's built into the metadata and the BigQuery you're in. Second, BigQuery performs the best when your data is denormalized. That means instead of having a lot of smaller tables, and doing joints to bring the data back together, have all those data tables be combined into one massive table or a denormalized table, as we saw in the architectural lecture. And use things like nested and repeated fields, instead of having orders in one table and customers in the other. Consider actually bringing those together in a parent-child relationship through that nested and repeated structure that we looked at a little bit earlier. Lastly, as you saw in the merging and union of datasets, try to use those really granular suffixes in your table wildcards to get the most specificity out of your data. As an example here, if you remember the GSOD or gsod tables for whether, having something like GSOD* is little bit less specific than saying G-sod, two zero star, to get only years from the year 2004. Next, let's talk a little bit about BigQuery native storage versus the external data connections that we discussed a little bit on just in those datasets. So, let's cover some of the downfalls of external data connections. First they'll never be able to be cached, because caching is a function of BigQuery managed storage itself. Second, live edits to underlying data sources. So if there is a spreadsheet that you're relying on data from just to quickly pull that into a BigQuery, if you have so many that's editing that spreadsheet live at the same time, you can create potential race conditions and BigQuery is not going to know which dataset is the latest to bring in from that spreadsheet versions. Lastly, BigQuery has a lot of performance tweaks under the hood that you might not even realize. So when you're writing things like a long query, if you filter your dataset all the way at the end, what BigQuery might actually do is it recognizes that filter, it can actually be performed earlier, and it'll do something that's called predicate pushdown. And without even you knowing, it'll actually bump up your work clause much earlier in the query, and then filter that data before it gets processed a little bit. Now keep in mind there are particularly use cases where you want to use an external data connection, like grabbing an ad hoc data, and then performing a one time analysis on it, much like in your extract transform and load step, and then ultimately storing, so you don't have to continuously do that. But keep in mind the performance it's in the trade-offs that you're making there. Next up is the shuffle mechanic. One of the most interesting things behind the hood is how those workers communicate to each other, to break apart your query and a sign that work to a lot of different workers or slots behind the scenes. Now what you can do to make their lives easier is pre-filter your data before doing massive JOINs. So if you have 80 percent of your records are null, go ahead and lob off those rows even before passing that work to the workers. And that saves a lot of reads, and it saves a lot of communication time in passing those data between each of the different slots. Now one of the things I'll highlight here, is that you can identify potential data skew if you have one of your workers that's getting particularly bottlenecked. And we'll take a look at how you can identify that and recognize that a little bit down the road. Okay, here's a few fun querying examples. When we talked about the WITH clause, we said use it liberally and help break apart those complex queries. But one of the caveats or drawbacks to that, is you don't want to use a bunch of WITH clause in place of materializing the results of those queries into permanent storage. So here you see in the query on the left, we're using a WITH clause to define a join against the IRS 2015 filings data, against the organizational details table. If you ran this query more than once, you're continuously doing that join over and over and over again. It's much more performing just to store lines six through 13, as a permanent table, and then refer to that later. Now the main takeaway with WITH clauses is that the results of those queries, those subqueries that you've named in that WITH clause, like line 6 and line 15 here, those results are not materialized and they're actually requeried every single time, if you're referencing those tables more than once in one of your long queries. So it's a trade-off between readability for your code, and also performance or speed, when you're storing those tables as permanent. Here's another example that demonstrate that same concept. For example, if you're only interested in looking at the charities that are schools, an IRS 2015 dataset, and you're commonly passing this filter over and over and over again, it's better to actually store those results as a permanent table, and then query those directly. Now let's revisit our friend the GROUP BY across. If you remember the GROUP BY, actually aggregates those values over the subset of data that you have. So if you're performing in this particular case, we're looking at the number of Wikipedia contributors, and then grouping the amount of edits that they made on the amount of IDs or editors that there are for that Wikipedia dataset. If you can imagine there's a turn of different contributors, and doing such a large GROUP BY, potentially, it creates many what we call forced shuffles or forced communication steps between your underlying slots or workers. One of the ways you could get around this, is potentially performing another level or higher level aggregation on your dimension, add the contributor here to bucket them. So you have fewer buckets that you're GROUPING BY. And we'll look at this example in greater depth in just a few slides. Moving onto the computational aspect of BigQuery, if you want to optimize the CPU usage, and you're getting resources exceeded ours, take a look at any Javascript user to find the functions you might have, and see if there's any way to revisit those concepts within native SQL functions or potentially SQL user defined functions. The reason being here is that BigQuery actually has to fire up a Java sub-process to run a lot of your Javascript code, and that doesn't mean you shouldn't use UDFs. The UDFs have their place, one of those functions that are just outside of the normal scope of SQL capabilities, we just keep in mind the performance trade-off there. Now as we discussed a lot in the JOINs and UNIONs lecture, keep in mind that you want to fundamentally master your data model and how your data tables are related to each other. The relationship between customers and orders, one customer can have many orders, or one order can only belong one customer, going through that mental exercise like what we did with the weather stations and the temperature data, will help you determine whether or not the result sets that you get makes sense. Now one of the specific examples that we looked at earlier, was uncovering the insight that one particular charity filed more than once for the tax year 2015. And doing something like a sum of all revenue across all those rows, would bring in data that was potentially duplicative across multiple charities. So keep in mind, understanding that data model or working with the subject matter experts to determine what those unique row conditions or field should be is critical. Next, if you're combining data historically, be sure to use the TABLE_SUFFIX and be as granular as possible, so you can avoid reading more data than you intend to use. Lastly, if you've used SQL bunch before, you might have run into self-joins. Consider using those window functions instead for optimal performance. Another great performance tip, is to keep intensive operations like ORDER BYs to be the very last thing that you're doing in your queries. So for example here as you can see, we have a subquery, a named subquery there, with line 12 we're actually ordering it, but we actually don't need to use an ORDER BY there, because we're not doing anything with ordered results. They're one of the easiest ways to spot this in your queries if you're doing more than one ORDER BY. So, if you have an ORDER BY like you see here in the named subquery in line 12, since we're not actually doing anything with the order of that data there, we can actually remove that ORDER BY all the way to the end if we want to do something like ORDER BY charity names. And this allows you to take the advantage of any filtering that's going to happen before. So we're only ordering by those results that actually matter to us in the very end.