RKL eSolutions | Insights, Tips and Trends from a top Sage Reseller and Technology Specialist

What are Statistics in SQL Server and why do they matter to me?

Written by RKL Team | Aug 13, 2020 10:30:00 AM
 

What are Statistics in SQL Server and why do they matter to me?

Executive summary – You want to ensure that at the data base level your configuration to maintain statistics are set properly. This means that AUTO_CREATE_STATISTICS is set to ON, AUTO_UPDATE_STATISTICS is set to ON and AUTO_UPDATE_STATISTICS_ASYNC is set to ON. For databases running on SQL 2012 or earlier you will want to enable T2371 trace flag for better updating of statistics on large tables. We also highly recommend a weekly maintenance job to update indexes (which update related statistics) as well as a weekly maintenance job that refreshes out of date statistics.   


Sage X3, Sage 500 and Sage 100 Premium all run with their relational database on Microsoft SQL Server. (Sage X3 can run on Oracle – but most installs we work with are SQL Server). For Sage X3 and Sage 100, their primary use of SQL Server is to store and retrieve data for the transactional workload and retrieve data for reporting workload. Sage 500 does the same, however around 60% of the business logic workload in Sage 500 is also executed within SQL Server. For all three having SQL Server running optimally will have a real bearing on the performance for end users. Statistics in SQL Server play an important and often ignored role in this performance.

So what are statistics and where do they fit in?

I’ll back up a bit and explain how the SQL Server database engine executes a query. When the database engine receives a query it needs to decide how to retrieve the data and return it to the calling program in an acceptably efficient way. The engine has a query optimizer, which is logic that takes a look at the tables in the query, and the predicates to determine which tables to pull data in from first and how to join the data together. It also choses which indexes to use based on the columns needed in the query. The optimizer will compare the costs of different query plans together and will select a plan that it deems is acceptable. This is called compiling the plan.

Statistics are used by the query optimizer to estimate how many rows will be returned from a table based on the predicates in the query. A predicate is a logic expression in either the where clause or within the “On” expression in a query. Predicate expressions return true or false.

So what is a statistic?

A statistic is a histogram of a column of data in a table. It can be multiple columns, but it is easiest to think about it as a single column. This histogram can be up to 200 buckets. The goal is to count the number of rows that are in each bucket. For instance, if the column holds first names, you may get 26 buckets, one for each letter of the alphabet. So the histogram will count the rows that start with the letter A and record that into the A bucket and it will do that for each letter of the alphabet. When the query optimizer has a predicate on this column, for instance “Where FirstName = Joe”, it will estimate how many rows in the table will have the name Joe. It will check the row count in the J bucket, and if there are 50 rows, it will estimate that when retrieving the rows for this table, it will get 50 rows back.    

Getting the row count is important as it will allow the query optimizer to better decide which data to retrieve first. The optimizer will tend to pick the smallest data sets first and use those smaller data sets to be able to limit what has to be returned from the larger datasets. If I don’t have to read the data because it is not going to be used in the query then why would I read it? If I have 1000 customers, 10 who have a first name of Joe and a million orders, I would return the 10 records for Joe and the 1000 orders that are tied to Joe rather than the whole million records.     

So the short on statistics is that good statistics make queries go faster, and bad statistics can make queries go much slower.

There are three configuration settings, a trace flag and 2 maintenance tasks that will help in ensuring your statistics are working properly.

AUTO_CREATE_STATISTICS – This is a configuration setting at the database level. When this is set to TRUE, SQL server will create the column statistics when it thinks it needs them. I haven’t encountered a case where you would want to set this to FALSE. Statistics will be created when you create an index. They will also be created when SQL finds a column in a predicate that does not have a statistic.

AUTO_UPDATE_STATISTICS – This configuration tells SQL Server to go ahead and update statistics when it thinks they are out of date. This is based on an algorithm based on the table size and the number of row modified since the statistic was last updated. For instance, if the table has a million rows and there have been 200,000 modifications to the table since the last statistics calculation, SQL Server will update the statistic – meaning it will update the row counts in the histogram bins.  Ensure that your database has this set to TRUE.

AUTO_UPDATE_STATISTICS_ASYNC – This configuration tells SQL Server that it is ok to update statistics asynchronously. Surprising, isn’t it?  When retrieving a statistic, SQL will check to see if it needs an update, comparing row count to row modified count. If this configuration is set to false, SQL will update the statistic before returning the estimated row count to the query optimizer. This can slow down the execution of the optimizer. If ASYNC is set to true, SQL will return the row count and mark the statistic for an asynchronous update. A separate process in SQL will look for statistics that need updating based on this flag and update them. Ensure that your database has this setting set to TRUE.

TRACE FLAG 2371 – This trace flag changes the algorithm that is used to update statistics for large tables. The logic from the earliest versions of SQL Server was to check to see if 20% or more of the rows had been modified. (It was a little more complex – but for large tables this was the case).  For large tables this is a problem. If you have a million row table, it will take 200,000 modifications to get the auto update to run. Trace Flag 2371 was introduced in SQL Server 2008 R2. This changes the default auto update dynamic algorithm that uses the number of rows in a table and the modified count to set the auto update threshold. For instance in a million row table will have a threshold of 32,000 row modifications. For compatibility level 130 (SQL Server 2016) this behavior is enabled by default. So for compatibility level 120 or below enable Trace flag 2317.

Maintenance Jobs – Ensure that your databases have a weekly maintenance job that rebuilds fragmented indexes and then update stale statistics. When you rebuild indexes, the associated statistics are also updated. Since not all statistics are based on an index, a subsequent step should be to update stale statistics.

For the vast majority of Sage X3, Sage 100 and Sage 500 implementations, the above configuration settings will keep your statistics happy, healthy and performing as expected. 

Related Article


Index Advantage, Surprise this is Relative
More SQL Blogs by John Puttman