Friday, April 18, 2008

SQL Waste Land

Welcome to Waste Land, the place where your server is full of char and nchar columns instead of varchar and nvarchar columns. In case you're not familiar with those, the difference between the two sets of data types is that the ones that aren't varying (don't have the letters "var" in the data type name) always consume a fixed amount of space. So if you define a column in a table as char(2) and then insert the value 'a', the database actually stores 'a ' with an extra space.

In SQL Server 2008, when you turn on rowlevel and page compression, it will treat char as varchar and will also compress values. But until then, char columns waste space.

Inspired by the wastefulness of a large database I work with, I wrote some code to find out: just how much of the database is wasted space?

First, some prerequisite functions:

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[TokenItem]') and xtype in (N'FN', N'IF', N'TF'))
drop function [dbo].[TokenItem]
GO
/* Function TokenItem
Author: Erik E
Returns the [TokenNum]th item from a string
For example
PRINT dbo.TokenItem('The quick brown fox jumped over the lazy dog', 4, ' ')
will return the fourth token of the string, with tokens identified as delimited by spaces: fox
*/
CREATE FUNCTION TokenItem (
   @String varchar(8000),
   @TokenNum int,
   @Delimiter varchar(8000) = ','
)
RETURNS varchar(8000)
AS
BEGIN
   DECLARE @DelimLen int
   SET @DelimLen = Datalength(@Delimiter)
   IF @TokenNum < 1 OR @TokenNum > 1 + (Datalength(@String) - Datalength(Replace(@String, @Delimiter, ''))) / @DelimLen
      RETURN ''

   DECLARE @Pos int
   SET @Pos = 1

   WHILE @TokenNum > 1 BEGIN
      SET @Pos = CharIndex(@Delimiter, @String, @Pos) + @DelimLen
      SET @TokenNum = @TokenNum - 1
   END

   RETURN SubString(@String, @Pos, CharIndex(@Delimiter, @String + @Delimiter, @Pos) - @Pos)
END

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[TokenItemRev]') and xtype in (N'FN', N'IF', N'TF'))
drop function [dbo].[TokenItemRev]
GO
CREATE FUNCTION TokenItemRev (
   @String varchar(8000),
   @TokenNum int,
   @Delimiter varchar(1) = ','
)
RETURNS varchar(8000)
/*
Returns the @TokenNum-th token as separated by @Delimiter, counting from the last token
Returns an empty string if no token by that number exists (< 1 or > tokencount)
Example:
PRINT dbo.TokenItemRev('a,b,c,d,e', 2, ',')
returns 'd'
*/
AS
BEGIN
   RETURN Reverse(dbo.TokenItem(Reverse(@String), @TokenNum, @Delimiter))
END

Now the real meat of the tool.

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[SpaceCharInfoCollectBackup]') and OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[SpaceCharInfoCollect]
GO
/*************************************************************************************
Procedure:   SpaceCharInfoCollect
Version:      1.2
Date:         8/11/2006
Author:      Erik E

Description:
   Find out how much wasted space there is in a database,
   because of the use of char or nchar columns instead of varchar or nvarchar.
   Doesn't support object names containing periods.

Parameters:
   TableName      Table to analyze for space information. Since it defaults to the
                  database the stored procedure is in, include the database name
                  if the stored procedure is being run from a different database,
                  as in database.owner.table. Can also specify owner.
   DestTableName  Table to receive the result information, can be in another database
                  on the same server. Use database.owner.table if desiring a table
                  in a different database than the stored procedure is located in or
                  if you want a different owner.
   DateStamp      Time to store as the data collection time. Defaults to GetDate().

Planned Updates:
   A version that runs against all databases on a server, without using undocumented
   stored procedures (See SP SpaceInfoCharCollectAllDB).

Update History:
   1.0 <8/11/2006>:   Original version
   1.1 <10/13/2006>:  Updated to accept destination table, and database name for
                      table being reported on. Added database name to results table.
                      Added object owner logic and column to results table.
   1.2 <4/18/2007>:   Used numbers instead of names for column selection to reduce
                      script size, to handle more char columns in a table without errors

Examples:
USE ThisDatabase
EXECUTE SpaceInfoCharCollectAllDB

USE DesiredDatabase
EXECUTE sp_msforeachtable 'EXECUTE DatabaseOfThisSP.dbo.SpaceCharInfoCollect ''DesiredDatabase.?'', ''DesiredDatabase.dbo.SpaceCharInfo'''
*************************************************************************************/

CREATE PROCEDURE SpaceCharInfoCollect
   @TableName varchar(8000), -- table to analyze space information on, include databasename, use brackets if name has weird characters
   @DestTableName varchar(8000), -- table to receive the result information, can be in another database on the same server
   @DateStamp datetime = NULL, -- defaults to GetDate(),
   @Debug bit = 0
AS
DECLARE
   @NSQL nvarchar(4000),
   @PreSQL varchar(8000),
   @SQL1 varchar(8000),
   @MidSQL1 varchar(8000),
   @SQL2 varchar(8000),
   @MidSQL2 varchar(8000),
   @SQL3 varchar(8000),
   @PostSQL varchar(8000),

   @DB varchar(8000),
   @Owner varchar(8000),
   @Table varchar(8000),

   @DestDB varchar(8000),
   @DestOwner varchar(8000),
   @DestTable varchar(8000)

SET NOCOUNT ON

PRINT @TableName

IF @DateStamp IS NULL SET @DateStamp = GetDate()

SET @TableName = Replace(@TableName, '''', '''''')
SET @Table = WhatDatabase.dbo.TokenItemRev(@TableName, 1, '.')
SET @Owner = WhatDatabase.dbo.TokenItemRev(@TableName, 2, '.')
IF @Owner = '' SET @Owner = user
IF @Owner LIKE '[[]%]' SET @Owner = SubString(@Owner, 2, Len(@Owner) - 2)
SET @DB = WhatDatabase.dbo.TokenItemRev(@TableName, 3, '.')
IF @DB = '' SET @DB = Replace(DB_Name(), '''', '''''')
IF @DB LIKE '[[]%]' SET @DB = SubString(@DB, 2, Len(@DB) - 2)

IF @Table LIKE '[[]%]' SET @Table = SubString(@Table, 2, Len(@Table) - 2)

SET @DestTableName = Replace(@DestTableName, '''', '''''')
SET @DestTable = WhatDatabase.dbo.TokenItemRev(@DestTableName, 1, '.')
SET @DestOwner = WhatDatabase.dbo.TokenItemRev(@DestTableName, 2, '.')
IF @DestOwner = '' SET @DestOwner = user
SET @DestDB = WhatDatabase.dbo.TokenItemRev(@DestTableName, 3, '.')

SET @SQL1 = 'IF NOT EXISTS (SELECT * FROM '
   + @DestDB
   + '.Information_Schema.Tables WHERE Table_Name = ''' + @DestTable + ''' AND Table_Schema = ''' + @DestOwner + ''')
   SELECT
      SpaceDate = convert(datetime, NULL),
      DatabaseName = convert(varchar(128), NULL),
      OwnerName = convert(varchar(128), NULL),
      TableName = convert(varchar(128), NULL),
      ColumnName = convert(varchar(128), NULL),
      ExtraSpaces = convert(bigint, 0),
      RowCnt = convert(bigint, 0)
   INTO ' + @DestTableName + '
   WHERE 1 = 0'

EXECUTE (@SQL1)

SET @PreSQL = 'INSERT ' + @DestTableName + ' (SpaceDate, DatabaseName, OwnerName, TableName, ColumnName, ExtraSpaces, RowCnt)
SELECT
   SpaceDate = ''' + Convert(varchar(50), @DateStamp, 121) + ''',
   DatabaseName = ''' + @DB + ''',
   OwnerName = ''' + @Owner + ''',
   TableName = ''' + Replace(@Table, '''', '''''') + ''',
   ColumnName,
   ExtraSpaces = CASE ColumnNum'
SET @MidSQL1 = '
      END,
   RowCnt
FROM
   (
      SELECT'
SET @MidSQL2 = ',
         RowCnt = Count(*)
      FROM ' + Replace(@TableName, '''', '''''') + ' WITH (NOLOCK)
   ) X CROSS JOIN ('
SET @PostSQL = '
   ) Y (ColumnNum, ColumnName)'

SET @SQL1 = ''

SET @NSQL =
N'SELECT
   @SQL1 = @SQL1 + ''
      WHEN '' + Convert(varchar(11), Ordinal_Position) + '' THEN ['' + Replace(Column_Name, '''''''', '''''''''''') + '']''
      + CASE WHEN Data_type = ''nchar'' THEN ''* 2'' ELSE '''' END,
   @SQL2 = IsNull(@SQL2 + '','', '''') + ''
         ['' + Replace(Column_Name, '''''''', '''''''''''') + ''] = Sum(IsNull('' + LTrim(Str(Character_Maximum_Length)) + '' - Len(['' + Replace(Column_Name, '''''''', '''''''''''') + '']), 0))'',
   @SQL3 = IsNull(@SQL3, '''') + ''
      '' + CASE WHEN @SQL3 IS NULL THEN '''' ELSE ''UNION ALL '' END + ''SELECT ''
      + Convert(varchar(11), Ordinal_Position) + '', ''
      + '''''''' + Replace(Column_Name, '''''''', '''''''''''') + ''''''''
FROM
   ' + QuoteName(@DB) + '.Information_Schema.[Columns] C
   INNER JOIN ' + QuoteName(@DB) + '.Information_Schema.Tables T ON C.Table_Name = T.Table_Name and Table_Type = ''base table''
WHERE
   Data_Type in (''char'', ''nchar'')
   AND T.Table_Name = ''' + @Table + '''
   AND T.Table_Schema = ''' + @Owner + ''''

IF @Debug = 1 PRINT @NSQL

EXECUTE sp_executesql @NSQL, N'@SQL1 varchar(8000) OUTPUT, @SQL2 varchar(8000) OUTPUT, @SQL3 varchar(8000) OUTPUT', @SQL1 OUTPUT, @SQL2 OUTPUT, @SQL3 OUTPUT
IF @SQL3 IS NULL RETURN 0 --no columns found

IF @Debug = 1
   PRINT (@PreSQL + @SQL1 + @MidSQL1 + @SQL2 + @MidSQL2 + @SQL3 + @PostSQL)
   ELSE EXECUTE (@PreSQL + @SQL1 + @MidSQL1 + @SQL2 + @MidSQL2 + @SQL3 + @PostSQL)

RETURN 0

/*
--Compare before and after values from most recent two run dates:
SELECT
   DatabaseName = IsNull(x.databasename, y.databasename),
   OwnerName = IsNull(x.ownername, y.ownername),
   TableName = IsNull(x.tablename, y.ownername),
   ColumnName = IsNull(x.columnname, y.columnname),
   ExtraSpaces1 = Sum(x.extraspaces),
   ExtraSpaces2 = Sum(y.extraspaces),
   RowCnt1 = Min(x.rowcnt), -- / count(distinct IsNull(x.databasename, y.databasename) + IsNull(x.ownername, y.ownername) + IsNull(x.tablename, y.ownername)),
   RowCnt2 = Min(y.rowcnt), --/ count(distinct IsNull(x.databasename, y.databasename) + IsNull(x.ownername, y.ownername) + IsNull(x.tablename, y.ownername))
   CollectionTime1 = Min(x.Spacedate),
   CollectionTime2 = Min(y.Spacedate)
FROM
   (
      SELECT * FROM dbo.spacecharinfo d1 WHERE spacedate = (select max(spacedate) from dbo.spacecharinfo where spacedate < (select max(spacedate) from dbo.spacecharinfo))
   ) x
   FULL JOIN (
      SELECT * FROM dbo.spacecharinfo d1 WHERE spacedate = (select max(spacedate) from dbo.spacecharinfo)
   ) y ON x.databasename = y.databasename AND x.ownername = y.ownername AND x.tablename = y.tablename AND x.columnname = y.columnname
GROUP BY
   IsNull(x.databasename, y.databasename),
   IsNull(x.ownername, y.ownername),
   IsNull(x.tablename, y.ownername),
   IsNull(x.columnname, y.columnname)
WITH ROLLUP
ORDER BY
   IsNull(x.databasename, y.databasename),
   IsNull(x.ownername, y.ownername),
   IsNull(x.tablename, y.ownername),
   IsNull(x.columnname, y.columnname)

select
   spacedate,
   databasename,
   ExtraSpaces = sum(extraspaces),
   [Rows] = sum(rowcnt),
   [Average spaces per row] = convert(float, sum(extraspaces)) / convert(float, sum(rowcnt)),
   [Average spaces per column] = convert(float, avg(extraspaces))
from spacecharinfo
group by spacedate, databasename
with rollup
having grouping(spacedate) = 0
*/

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[SpaceInfoCollectAllDB]') and OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[SpaceInfoCollectAllDB]
GO
-- runs SpaceInfoCollect against all databases and tables, see that SP for more information
CREATE PROCEDURE SpaceInfoCollectAllDB(@DestTableName nvarchar(4000) = NULL)
AS
SET NOCOUNT ON
IF @DestTableName IS NULL SET @DestTableName = QuoteName(DB_Name()) + '.' + QuoteName(User) + '.SpaceCharInfo'
DECLARE
   @SQL nvarchar(4000),
   @DBName nvarchar(128),
   @Dt datetime
SET @Dt = GetDate()
SET @SQL =
'SELECT ''USE '' + [name] + '' EXECUTE sp_msforeachtable ''''EXECUTE ''
   + QuoteName(DB_Name()) + ''.dbo.SpaceCharInfoCollect '''''''''' + QuoteName(Replace([name], '''''''', '''''''''''')) + ''.''
   + Char(63) + '''''''''', ''''''''YourDesiredDefaultDatabase.dbo.SpaceCharInfo'''''''', ''''''''' + Convert(varchar(50), @Dt, 121) + '''''''''''''''
FROM master.dbo.sysdatabases
WHERE [name] NOT IN (''master'', ''model'', ''tempdb'', ''pubs'', ''msdb'')'

SET @DBName = DB_Name()

EXECUTE sp_executesql N'EXECUTE master.dbo.xp_execresultset @SQL, @DBName', N'@SQL nvarchar(4000), @DBName nvarchar(128)', @SQL, @DBName
SET @SQL = 'SELECT * FROM ' + @DestTableName + ' WHERE SpaceDate = ''' + Convert(varchar(50), @Dt, 121) + ''''
SET NOCOUNT OFF
EXECUTE (@SQL)

Whew! Now how do you use all that? Easy.

EXECUTE SpaceInfoCollectAllDB

When you're done, take a look at the code at the end of the SpaceCharInfoCollect to see some final analysis tools for the output table. The database I ran this on had a whopping 10 GB of extra spaces, which is something like 14% of the entire database size.

Yes, the dynamic SQL is really difficult to work with. Making major changes directly in the script is nearly impossible: it's much easier to grab the section of script you want to change, paste it into a new window, and replace doubled single-quotes with single single-quotes. Which of course doesn't work for embedded values from the full outer script, but it helps immensely. Then when you have what you need, you double them back up and paste in just the parts you changed. At times you have to go two levels of dynamic SQL deep (or more) and undouble/double your quotes at each step. Use the power of search and replace to help you.

In some cases for small changes it does work to use the debug mode to output the final script and then examine or try to run THAT, and iteratively modify the dynamic SQL and see what you get.

If you find any bugs, please let me know. I haven't tested this on nchar columns, but it should multiply the space used by 2 since we care about bytes rather than characters.

P.S. There are 15 single quotes in a row in this script. Wheeeeeeeee.

2 comments:

BH said...

You could help yourself with quotes and dynamic SQL if you used " instead of ' to delimit your strings, then do a Replace(@sql, '"', '''') before executing the SQL.

Erik Eckhardt said...

Thanks for the idea. I will experiment with it. You've given me the idea to create each successive layer of SQL with a SET statement and then incorporate it into the next layer as I go.

Something like

SET @SQL = 'SELECT * FROM {TableName}'

SELECT @SQL = Replace(@SQL, '{TableName}', Table_Name) FROM Information_Schema.Tables

The idea is to make each sub-level SQL statement to be self-contained and (nearly) unescaped, as opposed to the way I've done it in the past that each next sub-level must be escaped again.

Though, I seem to do okay with this. Coincidentally, I just reworked this SP into one that searches tables for values this week, and it wasn't too bad. Though if the SQL had been more plain, it probably would have taken less time.