This blog was never meant to go from point A to point Z in regards to difficulty or knowledge requirements. I just wanted to blog about the things I run into that could help people who already attempt to tune their queries.
...but then I figured that I could at least do a post about how MY process goes, when I attempt to performance tune something.
So let's say your customers are complaining about this one query, and while you usually ignore them, as they are always complaining about something, this one catches your eye.
They want to get the users that have ten thousand times more upvotes on them than downvotes, but at the same time also have made a post or two that have a low score.
When I start to performance tune a query, I usually first execute "set statistics io, time on" and enable the actual execution plan.
After that is done, I run my query to get the feel of the current situation. After it finishes, I usually copy all the data in the Messages tab and paste it in http://statisticsparser.com/ which is a great online tool to easily visualize the amount of data your query read.
Generally speaking, the more logical reads, the worse the query is. This one isn't actually "that" bad, but it does take a bit a time to run (especially CPU time), so let's proceed into the Execution plan tab, and take a look at what is causing this and how we can improve it.
When reading an execution plan, you read from right to left, top to bottom. Here we can see that it scans (yeah, it says seek on one, but it's just seeking to a certain value and is scanning from there) two different indexes on the Posts table first, syncing them to one list and then scans the Users table and merges the results together. What I am looking for here is when the estimated versus actual started taking a wrong turn. In this case, it's at the Users table.
It thought it would find 739 714 rows, but it actually found 10. Wow.
If we look back at the original query, I could make a qualified guess that this is due to that fun formula where we divide the upvotes by the downvotes... SQL Server makes up these estimates, using statistics about the actual data, before the query is even run. For instance, if we have an index on the upvotes column, SQL Server holds a small amount of data on upvotes that is accessible when compiling the query plan and before the query is actually run. This small amount of data can give indications of how much data will be returned when the query is run. There are tons of information on how statistics work and it's too complex to explain here, but generally, if the estimate vs actual (739714 vs 10) is off by a large amount, that's usually the issue. In this situation, SQL Server might have statistics about both upvotes and downvotes, but not when used together like this.
To fix this, I know that I just need to make this estimate better.
I also happen to know of a trick with computed columns, allowing the sheer existence of one to produce statistics of not the column in itself, but the formula it contains.
So if I add...
...then SQL Server will automatically know the estimated results from the newly created statistic that comes with this column, even if the column isn't even used in the query!
So if we just go ahead and run that same query again, weeeeeeee geeeeeeet...
...a whooooole lot better results!
And the plan looks absolutely fantastic! Look at those estimates!
...wait a second, we are still scanning the Users table. And sure, less than a second is fine, but...
What if we add an index to that newly created computed column!?
...and run it once again!
AND THE CROWD GOES WILD!! That's so fast that I can't even finish this sent
For some reason it now thought this would yield 2020 results instead of 14. But that's fine, 2020 never hurt anyone. Nope. 2020 is absolutely, totally and completely fine. Nothing wrong ever came out of the number 2020. Ship it.
...or did it?
Actually...
Before we go...
Let's just take a look at how this query is actually run.
Hold on...
That's odd...
It looks like these 10000 and 10 values in the query isn't actually a fixed number... it's a number that is put in by the some person at your customer's company...
I wonder what would happen if we wrap this (with everything we've done) inside a stored procedure and run it once with 10000 and then again with a much lower number... say... 10? because that would probably yield a lot more results, right?
...and then the second one would re-use the first one's query plan...
...oh shit...
...double shit...
Well, there you have it. SQL is hard and it doesn't matter how awesome you are as you'll probably end up making the problem worse than it was before.
Stay tuned for the second part in this series, where we attempt to tackle the common issue of Parameter Sniffing and get this thing working properly!
Comments