Vertical Partitioning

· 6 min read
Table of Contents

Introduction

Data density is something that was hinted at in the last three parts of this series, however, it wasn’t totally discussed. Data density put simply is how many rows you get per page and how much space is left over on each page. It was discussed that a page can only contain up to 8060 bytes. This means that if a row is fixed width for 5000 bytes in a row, you could only have 1 row per page and essentially 3060 bytes wasted on every page. This can result in a huge waste of space in the data files.

A good practice to avoid wasting space is using vertical partitioning. This practice has 2 main categories:

  • Normalization
  • Row splitting

For the focus of this article we will be focusing on Row splitting. Put simply, this is the process of dividing out columns that are not used often OR that are large columns that could potentially be stored into another table.

Since these values are stored in another table it does make the queries a bit more complicated to write as you are slightly de-normalizing your data especially if this is used often and is just a large value. However, if the column is not used that often this can be very useful not only for storage but also for query optimization.

Demo Setup

To start off, we are going to demo a table that has not been vertically partitioned.

1
2
3
4
5
6
7
CREATE TABLE [NOVerticalPartitioning]
(
[Column1] INT IDENTITY,
[Column2] VARCHAR (100),
[Column3] VARCHAR (20),
[Column4] VARCHAR (1000),
)

From here we will insert 1 million rows, I believe this to be a realistic size for a person or product table.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
INSERT INTO [NOVerticalPartitioning]
(
[Column2],
[Column3],
[Column4]
)
VALUES
(
REPLICATE ('2', 50),
REPLICATE('3',20),
REPLICATE ('4', 1000)
)
GO 1000000

Now before we analyze this data, lets see how it would look if we move the large VARCHAR (1000) column that is always populated off to another table.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
CREATE TABLE [WITHVerticalPartitioning1]
(
[Column1] INT IDENTITY PRIMARY KEY,
[Column2] VARCHAR (100),
[Column3] VARCHAR (20)
)
CREATE TABLE [WITHVerticalPartitioning2]
(
[Column1] INT IDENTITY FOREIGN KEY REFERENCES
[WITHVerticalPartitioning1]([Column1]),
[Column4] VARCHAR (1000)
)

AND lets also insert the 1 million rows same as previously inserted.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
BEGIN TRANSACTION
INSERT INTO [WITHVerticalPartitioning1]
(
[Column2],
[Column3]
)
VALUES
(
REPLICATE ('2', 50),
REPLICATE('3',20)
)

INSERT INTO [WITHVerticalPartitioning2]
(
[Column4]
)
VALUES
(
REPLICATE ('4', 1000)
)
GO 1000000
COMMIT TRANSACTION

Now lets used the same query from the previous post to view the row structure of the table with no vertical partitioning.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
SELECT
[alloc_unit_type_desc] AS [Data Structure]
, [page_count] AS [pages]
, [record_count] AS [Rows]
, [min_record_size_in_bytes] AS [min row]
, [max_record_size_in_bytes] AS [Max Row]
FROM SYS.dm_db_index_physical_stats
(DB_id()
,OBJECT_ID (N'NOVerticalPartitioning')
, NULL
, NULL
, N'sampled')

Note that the row count is 1000200. This is due to the fact that we only took a “sampled” view of the pages, we will need to view the “detailed” in order to get an exact count. As previously noted, since we have more then 10,000 pages, the sampled view will not to a detailed view automatically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
SELECT
[alloc_unit_type_desc] AS [Data Structure]
, [page_count] AS [pages]
, [record_count] AS [Rows]
, [min_record_size_in_bytes] AS [min row]
, [max_record_size_in_bytes] AS [Max Row]
FROM SYS.dm_db_index_physical_stats
(DB_id()
,OBJECT_ID (N'NOVerticalPartitioning')
, NULL
, NULL
, N'detailed')

There we go, exactly 1000000 rows.

Now lets see how our performance looks, but first to make sure we are starting from a clean state lets clear our cache using the following script.

1
2
3
4
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO

Now lets run a query to return only the even rows that we had inserted based off our primary key value. However, prior to running the following query lets also turn on our “STATISTICS IO ON” this helps you to understand how your query performed shows you what actually happened.

1
2
3
4
5
6
SET STATISTICS IO ON
GO
SELECT * FROM [NOVerticalPartitioning]
WHERE Column1 % 2 = 0
GO
SET STATISTICS IO OFF

Once we click on our messages tab we will see the following:

(500000 row(s) affected) Table ‘NOVerticalPartitioning’. Scan count 1, logical reads 166674, physical reads 0, read-ahead reads 166473, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Now for the Vertically partitioned tables, lets first look at the row structures.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
SELECT
[alloc_unit_type_desc] AS [Data Structure]
, [page_count] AS [pages]
, [record_count] AS [Rows]
, [min_record_size_in_bytes] AS [min row]
, [max_record_size_in_bytes] AS [Max Row]
FROM SYS.dm_db_index_physical_stats
(DB_id()
,OBJECT_ID (N'WITHVerticalPartitioning1')
, NULL
, NULL
, N'detailed')
SELECT
[alloc_unit_type_desc] AS [Data Structure]
, [page_count] AS [pages]
, [record_count] AS [Rows]
, [min_record_size_in_bytes] AS [min row]
, [max_record_size_in_bytes] AS [Max Row]
FROM SYS.dm_db_index_physical_stats
(DB_id()
,OBJECT_ID (N'WITHVerticalPartitioning2')
, NULL
, NULL
, N'detailed')

Finally lets see how our query performs.

1
2
3
4
5
6
7
8
SET STATISTICS IO ON
GO
SELECT * FROM [WITHVerticalPartitioning1] VP1
inner join [WITHVerticalPartitioning2] VP2 ON vp1.Column1=vp2.Column1
WHERE vp1.Column1 % 2 = 0
GO
SET STATISTICS IO OFF
GO

Results Comparison

Now lets compare what we have gathered thus far:

MetricVertically PartitionedNOT Vertically Partitioned
Total pages154018166674
Selecting all columns - Scan Count101
Selecting all columns - Logical Reads193764166674
Selecting all columns - Physical Reads46590
Selecting all columns - Read-Ahead Reads193764166674

Well it looks like we have now saved 12,656 pages, which is excellent! It saved us just over 101MB, however, if we were always trying to return the last column, this would definitely not be worth it for performance reasons. However, if we weren’t always selecting all of the columns, what would that look like? Would that make the difference for us and make it worth our time?

To find out, lets run the following query and analyze the out puts.

1
2
3
4
5
6
7
8
9
SET STATISTICS IO ON
GO
SELECT Column2,Column3 FROM [WITHVerticalPartitioning1] VP1
WHERE vp1.Column1 % 2 = 0
SELECT Column2,Column3 FROM [NOVerticalPartitioning]
WHERE Column1 % 2 = 0
GO
SET STATISTICS IO OFF
GO
MetricVertically PartitionedNOT Vertically Partitioned
Selecting Column2 and Column3 - Scan Count11
Selecting Column2 and Column3 - Logical Reads11154166674
Selecting Column2 and Column3 - Physical Reads30
Selecting Column2 and Column3 - Read-Ahead Reads11150166674

It made a huge difference for performance, we now no longer had to look at that third column and increased our performance by almost 15 times.

Conclusion

In summary, vertical partitioning is a good design practice and should be used, however, it does depend on your workload. Do not throw columns to another table just because they are large objects. Instead, take the time to evaluate your data and make the right choice when it comes to what table the column should be under.

Comments