Friday, April 18, 2008

SQL Waste Land

Welcome to Waste Land, the place where your server is full of char and nchar columns instead of varchar and nvarchar columns. In case you're not familiar with those, the difference between the two sets of data types is that the ones that aren't varying (don't have the letters "var" in the data type name) always consume a fixed amount of space. So if you define a column in a table as char(2) and then insert the value 'a', the database actually stores 'a ' with an extra space.

In SQL Server 2008, when you turn on rowlevel and page compression, it will treat char as varchar and will also compress values. But until then, char columns waste space.

Inspired by the wastefulness of a large database I work with, I wrote some code to find out: just how much of the database is wasted space?

First, some prerequisite functions:

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[TokenItem]') and xtype in (N'FN', N'IF', N'TF'))
drop function [dbo].[TokenItem]
/* Function TokenItem
Author: Erik E
Returns the [TokenNum]th item from a string
For example
PRINT dbo.TokenItem('The quick brown fox jumped over the lazy dog', 4, ' ')
will return the fourth token of the string, with tokens identified as delimited by spaces: fox
   @String varchar(8000),
   @TokenNum int,
   @Delimiter varchar(8000) = ','
RETURNS varchar(8000)
   DECLARE @DelimLen int
   SET @DelimLen = Datalength(@Delimiter)
   IF @TokenNum < 1 OR @TokenNum > 1 + (Datalength(@String) - Datalength(Replace(@String, @Delimiter, ''))) / @DelimLen
      RETURN ''

   DECLARE @Pos int
   SET @Pos = 1

   WHILE @TokenNum > 1 BEGIN
      SET @Pos = CharIndex(@Delimiter, @String, @Pos) + @DelimLen
      SET @TokenNum = @TokenNum - 1

   RETURN SubString(@String, @Pos, CharIndex(@Delimiter, @String + @Delimiter, @Pos) - @Pos)

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[TokenItemRev]') and xtype in (N'FN', N'IF', N'TF'))
drop function [dbo].[TokenItemRev]
   @String varchar(8000),
   @TokenNum int,
   @Delimiter varchar(1) = ','
RETURNS varchar(8000)
Returns the @TokenNum-th token as separated by @Delimiter, counting from the last token
Returns an empty string if no token by that number exists (< 1 or > tokencount)
PRINT dbo.TokenItemRev('a,b,c,d,e', 2, ',')
returns 'd'
   RETURN Reverse(dbo.TokenItem(Reverse(@String), @TokenNum, @Delimiter))

Now the real meat of the tool.

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[SpaceCharInfoCollectBackup]') and OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[SpaceCharInfoCollect]
Procedure:   SpaceCharInfoCollect
Version:      1.2
Date:         8/11/2006
Author:      Erik E

   Find out how much wasted space there is in a database,
   because of the use of char or nchar columns instead of varchar or nvarchar.
   Doesn't support object names containing periods.

   TableName      Table to analyze for space information. Since it defaults to the
                  database the stored procedure is in, include the database name
                  if the stored procedure is being run from a different database,
                  as in database.owner.table. Can also specify owner.
   DestTableName  Table to receive the result information, can be in another database
                  on the same server. Use database.owner.table if desiring a table
                  in a different database than the stored procedure is located in or
                  if you want a different owner.
   DateStamp      Time to store as the data collection time. Defaults to GetDate().

Planned Updates:
   A version that runs against all databases on a server, without using undocumented
   stored procedures (See SP SpaceInfoCharCollectAllDB).

Update History:
   1.0 <8/11/2006>:   Original version
   1.1 <10/13/2006>:  Updated to accept destination table, and database name for
                      table being reported on. Added database name to results table.
                      Added object owner logic and column to results table.
   1.2 <4/18/2007>:   Used numbers instead of names for column selection to reduce
                      script size, to handle more char columns in a table without errors

USE ThisDatabase
EXECUTE SpaceInfoCharCollectAllDB

USE DesiredDatabase
EXECUTE sp_msforeachtable 'EXECUTE DatabaseOfThisSP.dbo.SpaceCharInfoCollect ''DesiredDatabase.?'', ''DesiredDatabase.dbo.SpaceCharInfo'''

   @TableName varchar(8000), -- table to analyze space information on, include databasename, use brackets if name has weird characters
   @DestTableName varchar(8000), -- table to receive the result information, can be in another database on the same server
   @DateStamp datetime = NULL, -- defaults to GetDate(),
   @Debug bit = 0
   @NSQL nvarchar(4000),
   @PreSQL varchar(8000),
   @SQL1 varchar(8000),
   @MidSQL1 varchar(8000),
   @SQL2 varchar(8000),
   @MidSQL2 varchar(8000),
   @SQL3 varchar(8000),
   @PostSQL varchar(8000),

   @DB varchar(8000),
   @Owner varchar(8000),
   @Table varchar(8000),

   @DestDB varchar(8000),
   @DestOwner varchar(8000),
   @DestTable varchar(8000)


PRINT @TableName

IF @DateStamp IS NULL SET @DateStamp = GetDate()

SET @TableName = Replace(@TableName, '''', '''''')
SET @Table = WhatDatabase.dbo.TokenItemRev(@TableName, 1, '.')
SET @Owner = WhatDatabase.dbo.TokenItemRev(@TableName, 2, '.')
IF @Owner = '' SET @Owner = user
IF @Owner LIKE '[[]%]' SET @Owner = SubString(@Owner, 2, Len(@Owner) - 2)
SET @DB = WhatDatabase.dbo.TokenItemRev(@TableName, 3, '.')
IF @DB = '' SET @DB = Replace(DB_Name(), '''', '''''')
IF @DB LIKE '[[]%]' SET @DB = SubString(@DB, 2, Len(@DB) - 2)

IF @Table LIKE '[[]%]' SET @Table = SubString(@Table, 2, Len(@Table) - 2)

SET @DestTableName = Replace(@DestTableName, '''', '''''')
SET @DestTable = WhatDatabase.dbo.TokenItemRev(@DestTableName, 1, '.')
SET @DestOwner = WhatDatabase.dbo.TokenItemRev(@DestTableName, 2, '.')
IF @DestOwner = '' SET @DestOwner = user
SET @DestDB = WhatDatabase.dbo.TokenItemRev(@DestTableName, 3, '.')

   + @DestDB
   + '.Information_Schema.Tables WHERE Table_Name = ''' + @DestTable + ''' AND Table_Schema = ''' + @DestOwner + ''')
      SpaceDate = convert(datetime, NULL),
      DatabaseName = convert(varchar(128), NULL),
      OwnerName = convert(varchar(128), NULL),
      TableName = convert(varchar(128), NULL),
      ColumnName = convert(varchar(128), NULL),
      ExtraSpaces = convert(bigint, 0),
      RowCnt = convert(bigint, 0)
   INTO ' + @DestTableName + '
   WHERE 1 = 0'


SET @PreSQL = 'INSERT ' + @DestTableName + ' (SpaceDate, DatabaseName, OwnerName, TableName, ColumnName, ExtraSpaces, RowCnt)
   SpaceDate = ''' + Convert(varchar(50), @DateStamp, 121) + ''',
   DatabaseName = ''' + @DB + ''',
   OwnerName = ''' + @Owner + ''',
   TableName = ''' + Replace(@Table, '''', '''''') + ''',
   ExtraSpaces = CASE ColumnNum'
SET @MidSQL1 = '
SET @MidSQL2 = ',
         RowCnt = Count(*)
      FROM ' + Replace(@TableName, '''', '''''') + ' WITH (NOLOCK)
   ) X CROSS JOIN ('
SET @PostSQL = '
   ) Y (ColumnNum, ColumnName)'

SET @SQL1 = ''

   @SQL1 = @SQL1 + ''
      WHEN '' + Convert(varchar(11), Ordinal_Position) + '' THEN ['' + Replace(Column_Name, '''''''', '''''''''''') + '']''
      + CASE WHEN Data_type = ''nchar'' THEN ''* 2'' ELSE '''' END,
   @SQL2 = IsNull(@SQL2 + '','', '''') + ''
         ['' + Replace(Column_Name, '''''''', '''''''''''') + ''] = Sum(IsNull('' + LTrim(Str(Character_Maximum_Length)) + '' - Len(['' + Replace(Column_Name, '''''''', '''''''''''') + '']), 0))'',
   @SQL3 = IsNull(@SQL3, '''') + ''
      + Convert(varchar(11), Ordinal_Position) + '', ''
      + '''''''' + Replace(Column_Name, '''''''', '''''''''''') + ''''''''
   ' + QuoteName(@DB) + '.Information_Schema.[Columns] C
   INNER JOIN ' + QuoteName(@DB) + '.Information_Schema.Tables T ON C.Table_Name = T.Table_Name and Table_Type = ''base table''
   Data_Type in (''char'', ''nchar'')
   AND T.Table_Name = ''' + @Table + '''
   AND T.Table_Schema = ''' + @Owner + ''''

IF @Debug = 1 PRINT @NSQL

EXECUTE sp_executesql @NSQL, N'@SQL1 varchar(8000) OUTPUT, @SQL2 varchar(8000) OUTPUT, @SQL3 varchar(8000) OUTPUT', @SQL1 OUTPUT, @SQL2 OUTPUT, @SQL3 OUTPUT
IF @SQL3 IS NULL RETURN 0 --no columns found

IF @Debug = 1
   PRINT (@PreSQL + @SQL1 + @MidSQL1 + @SQL2 + @MidSQL2 + @SQL3 + @PostSQL)
   ELSE EXECUTE (@PreSQL + @SQL1 + @MidSQL1 + @SQL2 + @MidSQL2 + @SQL3 + @PostSQL)


--Compare before and after values from most recent two run dates:
   DatabaseName = IsNull(x.databasename, y.databasename),
   OwnerName = IsNull(x.ownername, y.ownername),
   TableName = IsNull(x.tablename, y.ownername),
   ColumnName = IsNull(x.columnname, y.columnname),
   ExtraSpaces1 = Sum(x.extraspaces),
   ExtraSpaces2 = Sum(y.extraspaces),
   RowCnt1 = Min(x.rowcnt), -- / count(distinct IsNull(x.databasename, y.databasename) + IsNull(x.ownername, y.ownername) + IsNull(x.tablename, y.ownername)),
   RowCnt2 = Min(y.rowcnt), --/ count(distinct IsNull(x.databasename, y.databasename) + IsNull(x.ownername, y.ownername) + IsNull(x.tablename, y.ownername))
   CollectionTime1 = Min(x.Spacedate),
   CollectionTime2 = Min(y.Spacedate)
      SELECT * FROM dbo.spacecharinfo d1 WHERE spacedate = (select max(spacedate) from dbo.spacecharinfo where spacedate < (select max(spacedate) from dbo.spacecharinfo))
   ) x
      SELECT * FROM dbo.spacecharinfo d1 WHERE spacedate = (select max(spacedate) from dbo.spacecharinfo)
   ) y ON x.databasename = y.databasename AND x.ownername = y.ownername AND x.tablename = y.tablename AND x.columnname = y.columnname
   IsNull(x.databasename, y.databasename),
   IsNull(x.ownername, y.ownername),
   IsNull(x.tablename, y.ownername),
   IsNull(x.columnname, y.columnname)
   IsNull(x.databasename, y.databasename),
   IsNull(x.ownername, y.ownername),
   IsNull(x.tablename, y.ownername),
   IsNull(x.columnname, y.columnname)

   ExtraSpaces = sum(extraspaces),
   [Rows] = sum(rowcnt),
   [Average spaces per row] = convert(float, sum(extraspaces)) / convert(float, sum(rowcnt)),
   [Average spaces per column] = convert(float, avg(extraspaces))
from spacecharinfo
group by spacedate, databasename
with rollup
having grouping(spacedate) = 0

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].[SpaceInfoCollectAllDB]') and OBJECTPROPERTY(id, N'IsProcedure') = 1)
drop procedure [dbo].[SpaceInfoCollectAllDB]
-- runs SpaceInfoCollect against all databases and tables, see that SP for more information
CREATE PROCEDURE SpaceInfoCollectAllDB(@DestTableName nvarchar(4000) = NULL)
IF @DestTableName IS NULL SET @DestTableName = QuoteName(DB_Name()) + '.' + QuoteName(User) + '.SpaceCharInfo'
   @SQL nvarchar(4000),
   @DBName nvarchar(128),
   @Dt datetime
SET @Dt = GetDate()
'SELECT ''USE '' + [name] + '' EXECUTE sp_msforeachtable ''''EXECUTE ''
   + QuoteName(DB_Name()) + ''.dbo.SpaceCharInfoCollect '''''''''' + QuoteName(Replace([name], '''''''', '''''''''''')) + ''.''
   + Char(63) + '''''''''', ''''''''YourDesiredDefaultDatabase.dbo.SpaceCharInfo'''''''', ''''''''' + Convert(varchar(50), @Dt, 121) + '''''''''''''''
FROM master.dbo.sysdatabases
WHERE [name] NOT IN (''master'', ''model'', ''tempdb'', ''pubs'', ''msdb'')'

SET @DBName = DB_Name()

EXECUTE sp_executesql N'EXECUTE master.dbo.xp_execresultset @SQL, @DBName', N'@SQL nvarchar(4000), @DBName nvarchar(128)', @SQL, @DBName
SET @SQL = 'SELECT * FROM ' + @DestTableName + ' WHERE SpaceDate = ''' + Convert(varchar(50), @Dt, 121) + ''''

Whew! Now how do you use all that? Easy.

EXECUTE SpaceInfoCollectAllDB

When you're done, take a look at the code at the end of the SpaceCharInfoCollect to see some final analysis tools for the output table. The database I ran this on had a whopping 10 GB of extra spaces, which is something like 14% of the entire database size.

Yes, the dynamic SQL is really difficult to work with. Making major changes directly in the script is nearly impossible: it's much easier to grab the section of script you want to change, paste it into a new window, and replace doubled single-quotes with single single-quotes. Which of course doesn't work for embedded values from the full outer script, but it helps immensely. Then when you have what you need, you double them back up and paste in just the parts you changed. At times you have to go two levels of dynamic SQL deep (or more) and undouble/double your quotes at each step. Use the power of search and replace to help you.

In some cases for small changes it does work to use the debug mode to output the final script and then examine or try to run THAT, and iteratively modify the dynamic SQL and see what you get.

If you find any bugs, please let me know. I haven't tested this on nchar columns, but it should multiply the space used by 2 since we care about bytes rather than characters.

P.S. There are 15 single quotes in a row in this script. Wheeeeeeeee.

Friday, April 4, 2008

Database Time Storage and Display Problems

After seeing a post about UTC time in a forum, I did some more searching and reading on the web, and found an article, UTC datetime values in SQL Server 2000.

Earlier today I had the same basic idea as the one presented in that blog, though a little more simply:

SELECT UTCDateColumn + Getdate() - Getutcdate()

In the comments on the blog entry is a function someone tried to build that would automatically adjust for daylight savings time as well. This immediately struck me as problematic, and this was borne out by further comments such as that the rules are different for Arizona. And to not be overly self-centric, what about the majority of the people in the world who do not observe DST at all? Follow the link in this paragraph to see a map of the world usage of DST.

There were additional problems. One commenter said, "Say it's currently summer and I'm in daylight savings time, my difference will include DST. Now I try to convert a date from 6 months ago, I'll be an hour out."

So here are my rambling thoughts on the subject for your amusement (at me or with me, your choice).

  1. So you want to know the time. The time where and in what time reference? Is it at the client? Is it at the server? Is it at the place the time event corresponds to like the cash machine where a withdrawal was made?

  2. If you have web clients around the world, using the database to calculate "The Time" is pointless.

  3. Even if your client location(s) only span one time zone (and can you guarantee this forever) are you sure that client will always be in the same time zone as the server?

  4. The various methods to automatically calculate the UTC differential on the server are clever but with DST are exposed for being completely inadequate, because all one is answering is, what is the current UTC differential, including DST, at the server's location? One cannot reasonably expect that to apply to all clients and for all date ranges in the database.
In my mind, the correct answer is to let the user choose his UTC differential in the client, or let the client autodiscover the differential (through a server getutcdate(), or connecting to a time server, or reading the offset out of the registry, for example).

Handling older dates that were stored during a different phase of DST is an interesting problem. Are we answering the question "what was the time then in terms of our time now" or "what was the time then in terms of the time then?" And the time now/then where?

Before you answer that seemingly easy question overquickly, think about the 1 hour period just after DST adjustment makes the time jump back and the same hour is repeated. Imagine you're reviewing security logs for a crime event. Your system reports:

  • 2:05 am - last person to exit the facility was researcher John Smith

  • 2:20 am - critical research samples verified to be secure

  • 2:35 am - later investigation pinpoints this as the time critical research samples were known to be missing
If the system is adjusting times for "what was the time then" then you can't know whether John Smith left the building before the samples were verified secure, or if he left after they were stolen, because there were in fact two segments of time that night covering 2 am - 2:59:59 am, and he could have cleverly left the building during the second hour in order to fool the computer systems.

Anyway, if you set a concern like this aside and still wish to adjust for the time things were when recorded, I can't see how this can be accomplished except on the client because only the client can know the actual location of the client and his desired DST rules. Unless you keep a comprehensive and carefully-updated catalog of the DST rules everywhere your clients can possibly be, this sounds like a challenge.

Maybe the answer is to store both dates in the database: the client time AND UTC time, or UTC time and client offset at that time. This data doesn't even have to be in the individual rows: it could be captured or updated in a UTC-offset history table for each client location every time a client connects. It has its own problems like: does the client know his location, UTC offset, and DST rules & current phase correctly? If so, can the client always be trusted to present this information accurately or could the user purposefully taint the data?

There is even more complication:

What if you are in some place that doesn't observe DST and you want to compare transactions from months ago that were recorded for two different places: your current location and a location that is in a different time zone and does observe DST.

What consistent method of adjustment do you plan to use to display all the times in the same list? Clearly they have to be adjusted the same for any sort of sensical ordering or comparison, and it makes sense to adjust them for the observer so he can view them in his own time. Yet now, adjusting the historical date of the time recorded for the observing-DST location has to be done in terms of the viewing location's DST rules, which can also lead to incorrect assumptions.

So yet again it seems that time data needs to either be stored in UTC and viewed in UTC with an adjustment for the CURRENT offset only and the user knowing this information and making allowances for it, or time data needs to be stored in UTC and ALSO the client's time and the client's location, in some format or other, so that true reconstruction of "the time" can be answered from every perspective needed: any location, observing any DST rules, during any DST phase. At historical review time, there is a huge list of possible dates to display for a single recorded UTC time:

  • The UTC time of the event (the one place where all the madness is overcome!)

  • The time of the event at the event location, using the DST offset of the event location then

  • The time of the event at the event location, using the DST offest of the event location now

  • The time of the event at the user's location, using the DST offset of the user location then

  • The time of the event at the user's location, using the DST offset of the user location now

  • The time of the event at the event location, using the DST offset of the user location then

  • The time of the event at the event location, using the DST offset of the user location now

In closing, the getdate() - getutcdate() idea might be useful in conjunction with the client's stated known offset to determine if the client is observing DST *now*, but by itself can never solve the problem of fully-sensible historical time display in any time zone observing any DST rules (that themselves change over time).