Wednesday, January 27, 2016

Count all records in a table

That is very easy and well know topic, but I still have to do it because there are still a lot of people who do not really know how counting really works.

Just recently I've met a person, who argued with me on the way of counting records in a table.
He was irritated by me using "COUNT(*)" and suggested using "COUNT(1)" instead.

I've tried to explain that result would be the same anyway with the same expenses.
Unfortunately there was no time for an explanation and I want to put it in that post for everybody.

At first we can run following script against AdventureWorks Database (which you can download here):
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT COUNT(*) FROM Person.Person;
SELECT COUNT(1) FROM Person.Person;
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
If you run that script multiple times and see in "Messages" tab, you'll notice that both statements execute equally quick and require absolutely same amount of IO to count the records.


The way of thinking that "COUNT(*)" is bad starts from the fact that in most cases "SELECT * " is bad and you have to avoid asterisk when you can.
But then, what is the difference between "COUNT(*)" and "COUNT(1)"?

"*" Means that we "want to count all records in the table"
"1" Means that we "want to count all records in the table substituted by number 1"

You Might be surprised, but all following statements produce equal results and spend the same amount of resources:
SELECT COUNT(*) FROM Person.Person;
SELECT COUNT(1) FROM Person.Person;
SELECT COUNT(1000000) FROM Person.Person;
SELECT COUNT('A') FROM Person.Person;
SELECT COUNT('Any Substitution String') FROM Person.Person;
If, before the execution, you press "Ctrl-M" in your SQL Server Management Studio you can see the query execution times for all these queries:

All query plans for these queries are the same and cost exactly the same 20%

Look at the most expensive part of all queries: Index Scan, which consumes 86% of a query.
It produces 19972 output rows and all these rows have the same size 9 bytes, even in the case of crazy long sentence:



So, what is the difference of what we put inside of COUNT() function?

The difference starts when we try to count by specific field. For example by Firs and Middle name:
SET STATISTICS IO ON
SELECT COUNT(FirstName) FROM Person.Person;
SELECT COUNT(MiddleName) FROM Person.Person;
SET STATISTICS IO OFF

The result will be little surprising:





Why counting on the same table is different?
Because Middle name has NULL values, which were ignored by COUNT() function.




If we rewrite the second query like this:
SELECT COUNT(IsNull(MiddleName,'')) FROM Person.Person;
We will get the perfect result of 19972 records.

That means that using Column names for counting should be used in cases when you want to count items ONLY in the specific field and do not care about overall number of records.

You also might use "DISTINCT" statement to count only unique records:
SELECT COUNT(DISTINCT FirstName) FROM Person.Person;
That query returns only 1018 counted First names.

"DISTINCT" is pretty expensive operation. Use it only when it is absolutely necessary.

At this point we established that "COUNT(*)" and "COUNT(1)" are basically equivalent and produce the same result, but do we have a faster way to count the rows?
Of cause we have!

We can use "sys.dm_db_partition_stats":
SET STATISTICS IO ON
SET STATISTICS TIME ON

SELECT COUNT(*) FROM Person.Person;

SELECT row_count FROM sys.dm_db_partition_stats
WHERE index_id < 2 and object_id = OBJECT_ID('Person.Person');


SET STATISTICS TIME OFF
SET STATISTICS IO OFF

You can see that second query takes only 12 read operations to get the row count in a table.
For the small table difference is not significant, but for tables with billions of rows you can prefer that method over anything else.


WARNING!
Use "sys.dm_db_partition_stats" only in cases when you do not need the exact record count.
MSDN says that it provides "The approximate number of rows in the partition.".
That means for small tables you can safely use "COUNT(*)" and for large ones you can get an approximation in very quick manner.

No comments:

Post a Comment