This week I was at the Detroit MTC for a SQL Server 2012 release event where Rick Brewis (Microsoft) and my friend Josh Fennessy (BlueGranite) presented on the new functionality of SQL Server 2012. One of the highlights Rick’s presentation was the new Columnstore Index, and the huge performance increases it yields over existing models. After seeing the performance increases I started wondering how the Columnstore Index compared to other common indexing strategies available in the SQL Server engine in a data warehouse.
In SQL Server 2012 there are three main types of indexes that are commonly used to improve performance in a data warehouse:
A clustered index is used to store and sort rows of a table in order based on the clustered index. The clustered index is stored as a B-Tree which allows for quick retrieval of rows (stored as leaves in the tree) based on the key values of the clustered index. In a data warehouse fact table it is important we apply the clustered index to the most commonly used predicate. For example, a sales fact table typically is analyzed by sale date therefor the clustered index would be applied to the sale date column. Without a clustered index on your table it is stored as a heap. More information on clustered indexes can be found at SQLServerCentral.com’s Introduction to Indexes: Part 2
Non-Clustered indexes are very similar to clustered indexes in the sense they are stored as a B-Tree, however the rows themselves are not leaves of the unclustered index, rather only the data included in the index and a pointer to the physical row is stored at the leaf level. In earlier versions of SQL one of the easiest ways to speed up a query was to create a covering non-clustered index (a non-clustered index which included all the query’s columns) at the cost of disk space. More information on non-clustered indexes can be found at SQLServerCentral.com’s Introduction to Indexes: Part 3
Columnstore indexes are a new type of index which have shipped as part of SQL Server 2012. Many individuals are familiar with this technology as PowerPivot. Columnstore indexes data is grouped and compressed one column at a time and stored in memory. For small tables you can include all columns in the Columnstore index, however, since Columnstore Indexes are memory bound I’ve found the best compromise to be including all dimension keys in the index. There are also a few limitations with Columnstore indexes; the most notable limitations are these indexes can only be created as non-clustered (for the best performance it is important to also have a clustered index on your table) and tables cannot be updated once there is a Columnstore index without first dropping the index. Additional information on Columnstore Indexes can be found at MSDN.
Now that we have an understanding of which indexes are available to us, we can now begin testing the various types of indexes. For my testing I utilized an existing fact table on my laptop with 120 columns and 30 million data records. For the sake of simplicity I am going to illustrate my examples using a simplified version of the table and it’s indexes.
CREATE TABLE [dbo].[FactSales] ( [TimeKey] [int] NOT NULL, [ProductKey] [int] NOT NULL, [FinancialOrgKey] [int] NOT NULL, [SalesQty] [int] NULL, [SalesAmount] [money] NULL ) ON PRIMARY
Once the table was created and loaded I created different indexing scenarios and using a common query, returning around 325K rows, tested the performance of the different scenarios.
SELECT * FROM [dbo].[FactSales] WHERE ProductKey IN ( SELECT ProductKey FROM [dbo].[FactSales] GROUP BY ProductKey HAVING COUNT(*) > 25000 ) AND FinancialOrgKey IN ( SELECT FinancialOrgKey FROM [dbo].[FactSales] GROUP BY FinancialOrgKey HAVING COUNT(*) > 5000 ) AND TimeKey BETWEEN 20110101 AND 20110301
The first scenario I wanted to test was how does the query perform with no indexing at all to get a baseline of how the query performed with no performance optimizations. Surprisingly it only took 1 minute 30 seconds to return all 325K rows.
Clustered Index Only
Next I wanted to see how adding a clustered index to the table affected the performance. Since I am working with sales I added the clustered index to the integer time key and executed my test query. This time the results returned in 1 minute 14 seconds adding 10.2MB to the database in the 6 minutes 11 seconds to build the index.
CREATE CLUSTERED INDEX [ClusteredIndex] ON [dbo].[FactSales] ( [TimeKey] ASC )
Clustered Index & Non-Clustered Indexes on Dimension Keys
The third test I performed was applying indexes in the manner the is suggested in The Microsoft Data Warehouse Toolkit. For those not familiar with this approach single non-clustered indexes are added to each dimension key on the fact table in addition to the clustered index on the time key. This approach should in theory increase performance by replacing table scans with more efficient key lookups. This time the query only took 45 seconds but with a substantial data growth of 1396.6MB in the 7 minutes 57 seconds needed to build the indexes.
CREATE CLUSTERED INDEX [ClusteredIndex] ON [dbo].[FactSales] ( [TimeKey] ASC ) CREATE NONCLUSTERED INDEX [NonClusteredFinancialOrgKeyIndex] ON [dbo].[FactSales] ( [FinancialOrgKey] ASC ) CREATE NONCLUSTERED INDEX [NonClusteredProductKeyIndex] ON [dbo].[FactSales] ( [ProductKey] ASC )
Now that we have tested the existing index types available in earlier versions of SQL Server, I moved on to the new SQL Server 2012 Columnstore Index. First, I removed all existing indexes from the FactSales table and created the Columnstore Index. The existing indexes were removed to accurately gauge how the Columnstore index performs when referencing items in a heap. Surprisingly the test query only took 46 seconds to complete and only grew the database by 323MB in the 40 seconds it took to build the index.
CREATE NONCLUSTERED COLUMNSTORE INDEX [NonClusteredColumnStore] ON [dbo].[FactSales] ( [TimeKey], [ProductKey], [FinancialOrgKey] )
Columnstore Index With Clustered Index
After seeing the impressive performance improvements the Columnstore Index I didn’t think adding a clustered index to the table would yield any further improvement, but to my surprise after adding the clustered index the test query performed even better and returned all 325K rows in only 36 seconds and only added 333.2MB to the database in the 8 minutes and 22 seconds it took to build the indexes.
CREATE NONCLUSTERED COLUMNSTORE INDEX [NonClusteredColumnStore] ON [dbo].[FactSales] ( [TimeKey], [ProductKey], [FinancialOrgKey] ) CREATE CLUSTERED INDEX [ClusteredIndex] ON [dbo].[FactSales] ( [TimeKey] ASC )
Columnstore Index With Clustered Index and Non-Clustered Indexes on Dimension Keys
With the first two Columnstore Index tests returning surprisingly good results I decided to try to squeeze a little more performance out by adding Non-Clustered Indexes to the dimension keys to eliminate one remaining table scan in the query plan. The new indexes allowed the test query to run in only 31 seconds, however this small improvement came at the much higher cost of adding 1719MB to to the database in the 9 minutes and 30 seconds used to build the indexes.
CREATE NONCLUSTERED COLUMNSTORE INDEX [NonClusteredColumnStore] ON [dbo].[FactSales] ( [TimeKey], [ProductKey], [FinancialOrgKey] ) CREATE CLUSTERED INDEX [ClusteredIndex] ON [dbo].[FactSales] ( [TimeKey] ASC ) CREATE NONCLUSTERED INDEX [NonClusteredFinancialOrgKeyIndex] ON [dbo].[FactSales] ( [FinancialOrgKey] ASC ) CREATE NONCLUSTERED INDEX [NonClusteredProductKeyIndex] ON [dbo].[FactSales] ( [ProductKey] ASC )
In closing, SQL 2012 offers some significant improvements when indexing data warehouses. As always your mileage may vary with these indexing strategies, so I definitely suggest performing similar tests in your environment prior to making any sweeping changes to your indexing strategy.