Squared Thoughts: 2007

Thursday, December 27, 2007

SQL Overelaboration

Random forum posting, function that detects if a string is made only of numbers:

CREATE function fn_numeric
-- input is varchar string.
   (@string varchar(12) )
returns bit -- 0 = false, 1 = true
as
begin
declare @x smallint, @y smallint, @z bit
set @string = replace(@string, ' ', '@')
set @x = len(@string)
set @y = 0
set @z = 1
while @y < @x
    begin
    set @y = @y + 1
    if substring(@string, @y, 1) between '0' and '9'
        continue
    else
        begin
        set @z = 0
        break
        end
    end
return (@z)
end

My response:

WHERE ColumnToTest NOT LIKE '%[^0-9]%'

Monday, October 29, 2007

Sometimes Optional Keywords Shouldn't Be

I always use the CLUSTERED or NONCLUSTERED keywords in my constraint scripting. I used to not do it. But then I realized the following problem:

You have a table with some stuff in it.

CREATE TABLE Test(a int NOT NULL, b int NOT NULL)
INSERT Test VALUES (1, 2)
INSERT Test VALUES (2, 3)
INSERT Test VALUES (3, 4)

You go and add a primary key to it:

ALTER TABLE Test ADD CONSTRAINT PK_Test PRIMARY KEY (a) --<-- Clustered, right? Perhaps...

And now you have a clustered index, right? Yes. But let's say that this table had different needs than most and its primary key wasn't the clustered index and there was already a non-primary-key clustered index! So drop the constraint you just made, then re-add it after making our test table correspond with your "real" table we're imagining:

ALTER TABLE Test DROP CONSTRAINT PK_Test
CREATE CLUSTERED INDEX IX_Test_b ON Test (b)
ALTER TABLE Test ADD CONSTRAINT PK_Test PRIMARY KEY (a) --<-- Not clustered this time.

So now your primary key constraint is NOT clustered, even though the default for primary keys IS clustered. And it was the same DDL script each time, performing something very different. So I concluded that it's best to always include the keyword CLUSTERED or NONCLUSTERED so that you and other developers know that's what was intended. Plus now you won't go your merry way thinking that you did something you didn't. Let's try to re-add that PK but using the CLUSTERED keyword (which we are expecting to happen by default without using it):

ALTER TABLE Test DROP CONSTRAINT PK_Test
ALTER TABLE Test ADD CONSTRAINT PK_Test PRIMARY KEY CLUSTERED (a)

And now you get the information you needed to do your job properly:

Server: Msg 1902, Level 16, State 3, Line 1
Cannot create more than one clustered index on table 'Test'. Drop the existing clustered index 'IX_Test_b' before creating another.
Server: Msg 1750, Level 16, State 1, Line 1
Could not create constraint. See previous errors.

UNIQUE is another word that I've seen people forget to include in their NONCLUSTERED index creation (not primary keys of course) so don't forget to put that in when you are expecting uniqueness. On the one hand, you might think that checking for uniqueness on every update or insert would be a performance hit. But that is not a problem: the real performance hit comes from the extra bytes required on that non-unique index as a unique-ifier bloating the index's size and reducing its speed.

I should add that having extra unused indexes is a performance hit all by itself. Here's a good discussion about that. (This guy has a WAY better blog than I will ever have. If you've gotten anything from reading my blog, then you should forget about mine and go read his. When you're done reading all 50,000 of his posts, then come back here, where there's still a teensy tiny chance remaining you'll learn something from me that you didn't from him.)

Thursday, October 25, 2007

Table Variables Use Tempdb

create table #t(a int)
select * from tempdb.dbo.sysobjects where name not like 'sys%'
drop table #t
GO
declare @t table(a int)
select * from tempdb.dbo.sysobjects where name not like 'sys%'

Now remove the GO and run it again. Notice that the table variable exists in tempdb during the first select! This is generically in computer programming called hoisting (and see more interesting related stuff).

The query engine decides how much to put in tempdb and how much to keep in memory. Some reports I've read say that the data always lives in memory, but others say that tempdb can be used for some of the data. What I'm taking away is that, while there is always the possibility that your table variable will live only in memory, it is not guaranteed. But using a table at all will always incur a performance hit. And consuming too much RAM can be as much a detriment to a server as consuming too much tempdb space.

Create a correlated subquery and put it in the SELECT list of a simple query against a many-row table that has a clustered index, and put a literal derived table in the subquery that UNION ALL SELECTS the numbers 1 through 10. It doesn't matter what the subquery does as long as it logically runs once for each row in the main query's table (it has a reference to a column from the main table).

Now create a table variable and insert the same numbers 1 through 10 and modify the correlated subquery to use the table variable instead of the UNION list. Compare execution plans and traces. Note that the first query has 10 constant scans in it, repeated as many times as there are rows, but the number of reads are close to the number of pages in the physical table.

The second query has no constant scans in it, but the number of reads is now closer to the number of rows in the physical table.

You cannot think of table variables as no-cost, memory only, super-fast objects with no side effects. They are not. They have side effects that can very nearly be main effects.

Monday, September 24, 2007

SQL Confusion

It is bizarre how often I find laughable and silly code in production databases that are part of software which costs millions of dollars.

I'm sure there will be more of these to come. Here's my contribution for today.

SELECT distinct
...,
(select DATEADD(mi,-DatePart(mi, GetDate()), DATEADD(ss,-(DatePart(ss, GetDate())), GetDate()))) as orecordeddtm

WHOA! Run that statement and you'll see it's not only convoluted but it's wrong, too: it leaves the milliseconds portion of the time still on there (it's a datetime column). Here's my proposed simple and working alternative:

DateAdd(hh, DateDiff(hh, 0, GetDate()), 0)

The query is DISTINCT, which isn't a problem in this case because the expression is GetDate(). But what if the author of this used the same logic for a column? The left-in milliseconds would be messing everything up. Plus, why is the expression inside its very own select clause? Bizarre. Which of the following two statements makes more sense to YOU?

SELECT 1 + 1
SELECT (SELECT 1) + (SELECT 1)

Interesting is about the kindest description I can come up with for this.

Thursday, September 6, 2007

Mixed-Up Statement Types

I am going to start trying to put together the basic "What you need to know to get started with SQL Server." I see the same exact problems over and over again in SQL help forums. FAQs are good, but there are dozens and dozens of them, each generally solving a specific problem. I see a need for something before that. People are getting so confused they don't even know what to ask, let alone know what kind of FAQ is going to help them with their problem.

So I'll one at a time make some posts about different conceptual issues I see and then finally I'll pull it all together as the "Top 10 Concepts" for beginners. Or whatever name seems right at the time. :)

So today I'm going to start with the different types of statements in T-SQL, and particularly the difference between expressive, procedural, and manipulative statements. Other kinds of statements we won't concern ourselves with now are data definition, data control, and transaction control. People generally don't get as confused about those.

Expressive Statements

An expression consists of one or more literal values or functions tied together with operators, which when evaluated in the correct order result in a value or collection of values.

In most cases an expression returns only a single value. In computing there is a special term for "a thing that is only a single value" called a scalar. A composite or many-valued expression may be a list, array, or record. In SQL Server there are some cases where a rowset can be treated as a list and used in an expression.

Here are some example expressions in T-SQL:

Power(2, 10)
-- The literal values 2 and 10 are operated on by the built-in function power which returns a single value.

(18 + 23) * 7 % 5
— The literal values 18, 23, 7, and 5 are combined with operators: addition, grouping (parentheses), multiplication, and modulus, to return a single value.

CASE ... WHEN ... <WHEN ...> ELSE ... END

These are all scalar expressions. I'll give an example of a multi-valued expression later because to use it one has to mix some of the statement types.

Let me call attention specifically to the fact that with only one exception, all parts of an expression are evaluated. CASE statements do support "short-circuit" logic where the evaluation stops with the last WHEN condition that is true. But in general, you should think about expressions as an all-or-nothing proposition. Nothing can be done with all the parts of an expression until the whole thing is evaluated into a final and single scalar.

Procedural Statements

Procedural statements are called that because there is some procedure that must be followed. It isn't a simple case of order-of-operations resulting in a single value. There is in fact no value expressed at all. Here are some examples:

IF <true-or-false expression>
WHILE <true-or-false expression>
RETURN <integer expression>
THEN
ELSE
BEGIN
END
GOTO
RAISERROR(...)
PRINT

These all tell the query parser what statement it should proceed to next. Note that some of these procedural statements expect expressions immediately following. The server, while executing the statements, has to evaluate the expression in order to know what it is supposed to do. Let me call attention specifically to the fact that not every statement may be executed, and they may be executed repeatedly or out of order.

Manipulative statements

These are normally called queries (though some of them do things instead of just asking questions as the word query implies).

SELECT
INSERT
UPDATE
DELETE
UNION

The Point Of All This

Now that you know the three main kinds of statements (and I won't rule out the possibility of there being more or of there being subclassifications of these) the key concept you must know to get along well with SQL Server is that when a certain kind of statement is expected, you can't use a different one in its place.

Here are some examples of confusing one kind of statement with another.

-- Example 1
IF (IF @b = 2 THEN @i ELSE @j) = 1 PRINT 'True' ELSE PRINT 'False'

IF expects a true-false expression after it. You can't put another procedural statement there.

-- Example 2
EXECUTE CASE WHEN @Action = 'Kill' THEN MyKillingSP ELSE MyNonKillingSP END

EXECUTE isn't looking for an expression. It wants an SP name. Instead, do

IF @Action = 'Kill' EXECUTE MyKillingSP ELSE EXECUTE MyNonKillingSP.

There. EXECUTE is happy now because it has an SP name immediately following in each case.

-- Example 3 - An actual example from online
RETURN
   (
      IF @LINEID = '123'
         BEGIN
            select * FROM TABLE1
         END
      ELSE
         BEGIN
            SELECT * FROM TABLE2
         END
      END
   )

In a stored procedure, RETURN expects an integer and in a user-defined function it expects an expression or a rowset or nothing, depending on its type. You can't put procedural statements in. This is obviously a rowset-returning user-defined function. So it will have to be a table-variable returning one. Use procedural logic to insert the right rows into the table variable, and then simply RETURN.

-- Example 4
SELECT *
FROM MyTable
WHERE
Status = (IF @Status IS NULL THEN '' ELSE @Status)

IF is a procedural statement. You can't just embed it in the middle of an expression. And the IF statement switches back to expressive again: the first keyword after THEN can't be an expression. What's supposed to be done with that? It has to be procedural or manipulative: SET or INSERT or GOTO or PRINT. If you ran the following all by itself as its own batch, you'd get an error:

So '' can't be the first thing after THEN.

The way to do an expressive if-then-else is with a CASE statement. I think it's worth spending a moment to explain that it has two forms:

--Simple equivalency
CASE CaseExpression WHEN TestEquivalentExpression THEN ResultExpression <WHEN ... THEN ...> ELSE ... END

CASE @i
   WHEN 1 THEN 'One'
   WHEN 2 THEN 'Two'
   WHEN 3 THEN 'Three'
   WHEN NULL THEN 'NULL'
-- This doesn't work because "@i = NULL" will always fail.
-- Use the next construction to test for this.
   ELSE 'Invalid'
END

-- fully expressive
CASE WHEN TrueOrFalseExpression THEN ResultExpression <WHEN ... THEN ...> ELSE ... END

CASE
   WHEN @i = 1 THEN 'One'
   WHEN @i = 2 THEN 'Two'
   WHEN @i = 3 THEN 'Three'
   WHEN @i IS NULL THEN 'NULL'
   ELSE 'Invalid'
END

But it resolves to a single final value. I believe (though I need confirmation on this) that the query engine expands the first syntax into the second. So be wary of using nondeterministic functions like Rand() in CASE statements. (Nullif() definitely gets expanded to a CASE statement so it has the same issue.)

SELECT *
FROM MyTable
WHERE
Status = CASE WHEN @Status IS NULL THEN '' ELSE @Status END

This is just an example and may not be best coding practice.

Manipulative Statements As Expressions

Finally, there are some special cases where manipulative statements can be used in expressions or as expressions.

1. A manipulative statement that returns a single column and a single row can be used in most any place that an expression can be used. This is done by placing it in parentheses:

IF (
SELECT OrderGroup FROM OrderStatuses WHERE OrderStatus = @OrderStatus
) = 2 ...

PRINT (SELECT Count(*) FROM MyTable)

2. A manipulative statement that returns a single column and many rows can be used in an expressive statement immediately following certain operators, again in parentheses. Those keywords are IN, EXISTS, and in conjunction with inequality operators, ANY, SOME, and ALL:

SELECT *
FROM Orders
WHERE OrderStatus IN (SELECT OrderStatus FROM OrderStatuses WHERE OrderGroup = 2)
-- Note this is better done with an INNER JOIN.

IF EXISTS (
   SELECT *
   FROM Orders
   WHERE
      ExpectedShipDate >= GetDate()
      AND OrderStatus = 1
) ...
-- note that because of EXISTS, no columns are returned, and the query engine stops as soon as it finds one row meeting the criteria.

SELECT *
FROM Orders
WHERE
   OrderStatus > ALL (
      SELECT OrderStatus
      FROM OrderStatuses
      WHERE OrderGroup = 2
   )
-- can use ANY and SOME as well as ALL

Note that some other SQL languages may allow multiple columns in some of these kinds of expressions.

That should be enough to get people started on understanding

Note: I might have mistakes or be missing some details in this blog entry. If I am, I'd appreciate you pointing it out so I can make the final version accurate and complete.

Wednesday, August 29, 2007

SQL Server 2005

I've only recently really started to get into some of the new T-SQL stuff in SQL Server 2005. I am really digging them. And of course, at the same time, I am still wanting yet even more!

Today I used ROW_NUMBER() for the first time, and it was easy. It will take a few tries to remember the syntax exactly, but it makes perfect sense. You have to be able to say when to begin counting at 1 again (PARTITION OVER) and what order to count in (ORDER BY). RANK, DENSE_RANK, and TILE_N (if I'm remembering that right) all make sense, too.

I'm looking forward to getting into the OUTPUT clause. Before I had even learned about it in SQL 2005 I wanted this functionality for INSERT, UPDATE, and DELETE. I imagined its syntax slightly differently. First, any of these statement could be made to have a rowset. I actually overlooked this part when I was originally thinking about it, but it's fairly obvious once you realize its necessity. Here's one possible way:

RETURN_ROWSET INSERT Stuff SELECT Blah FROM Gorp

Then this query could be used any place a SELECT could be used:

UPDATE H
SET Column = NewValue
FROM
   StuffHistory H
   INNER JOIN (
      RETURN_ROWSET INSERT Stuff SELECT Blah FROM Gorp
   ) S ON H.StuffID = S.StuffID

This would allow you to chain together any number of statements, causing each INSERT, UPDATE, or DELETE to return a rowset, which would either go to a client or could be used again in the next statement. I realize now that there are some issues with the syntax I envisioned, but I think it's interesting that I "invented" this on my own before learning about it from SQL 2005.

(The issues are that for an update there are two potential rowsets: the Inserted and Deleted meta-tables, so the way Microsoft implemented it makes a certain sort of sense, though I find the syntax a little weird, and I don't think you can nest these over and over again.)

Perhaps instead of nesting things this way it could be more like a common table expression, with many queries using the thing just like a table:

;WITH_ROWSET S AS (INSERT Stuff SELECT Blah FROM Gorp)
BEGIN
   UPDATE H
   SET Column = NewValue
   FROM
      StuffHistory H
      INNER JOIN S ON H.StuffID = S.StuffID

   SELECT NewValue FROM H WHERE OtherColumn = 1
END

There are also a bunch of gotchas surrounding views, triggers, remote tables, and limitations on the target of an OUTPUT INTO clause (target table can't have triggers, foreign keys, CHECK constraints, or enabled rules).

Here's one example from BOL 2005. Do you understand it on the first try? I didn't.

If the OUTPUT clause is specified without also specifying the INTO keyword, the target of the DML operation cannot have any enabled trigger defined on it for the given DML action. For example, if the OUTPUT clause is defined in an UPDATE statement, the target table cannot have any enabled UPDATE triggers.

Anyway, that's enough for now. I'll talk another time about alternate syntax I wish one could use for UPDATEs and INSERTs.

P.S. If you are trying to insert HTML in your own blog on blogspot.com (blogger.com) stay away from the "Compose" mode and stay in "Edit Html" mode. They try to parse the HTML and fail and end up eating your stuff.

Thursday, August 23, 2007

Am I Thinking About The Right Thing?

I have thought for a long time about deliberately ordering my thought processes. It's something that I do quite frequently. I'd like to share that with people and maybe spur others to do it more, but ironically I'm finding it difficult to express the concept clearly.

I think I'll start for now with an example. If I come back to this topic later, I can build on that example and maybe come up with something that helps me communicate what I'm talking about more effectively.

Sometimes I wonder if the way I think is substantially different from the way many people think. I mean, the areas where my brain works well (or doesn't work well) may not coincide with many people's. So I'm not really sure about the validity of some of my thoughts on how one might order one's own thought processes to yield better results. For example, I like to use spatial representations to figure out problems. There are many times when working on a query that I imagine little squares (representing tables or groups or some kind of data) touching each other, splitting, moving, folding, and so on. But for someone who doesn't think well spatially, that's probably not going to work. There are other modes and ways, though. So even if you can't relate directly to what I'm talking about, maybe you can find some comparable process that does work for you, that helps you be more deliberate in your thinking process.

On the other hand, maybe I'm way out in left field. Maybe my brain is stuffed full of cheese-filled rotten rutabagas! So take what I say here with a grain of salt.

When one is doing analysis of a query, one should check each thing along the way to see if it's giving the right value. It's kind of like when you have a problem with a computer peripheral, say, a printer. There is sort of a list, roughly from lowest to highest: Is it turned on? Is the power cable plugged in? Is the network/parallel cable connected to the right location? Does it have paper? Are error codes showing, is there a paper jam, are there any obvious latches or doors that need to be closed, is there a ready mode that it needs to enter? Can you generate a test page? If it's a network printer, can you ping it? Can anyone else print to it? Can you print to it with a different OS? Can you print to it with a different program? Can you print a smaller document? Can you print a document without images in it? Can you print a document with a different font?

If you find yourself troubleshooting at a certain level and it's not working, then back up and go to the more basic questions. You might even start at the beginning. Doing this helps you check your assumptions and guides you into the next correct troubleshooting step. If the extent of the description of the problem is "I can't print," you don't know much at all. But if the problem is, "I can print from Joe's computer but not from mine," then you can be directed to the level of problem-solving you ought to be at. Since now you know that it's plugged in, turned on, ready, has paper, is connectable for Joe and is responding to print requests, what level are you at? You're ready to verify your own connectivity. Only then deal with access rights. Then drivers. Then different programs. Then things affecting printer memory.

Don't waste your time with printing from different programs until you're sure you can even connect to that thing. Maybe the network cable on your computer came unplugged—you'd just be wasting your time messing around with printing from different programs. Maybe it's a private printer connected directly to his computer with a USB or parallel cable and it's not shared or you don't have access to it—you'd be wasting your time messing around with trying to ping it or using different drivers.

Going back to databases, here's a real-world example of getting stuck too high up in the chain. Several experienced SQL people were helping a poster with a query where he was trying to show summary data from several tables at once, but he was getting sums that were multiples of the correct sums. Everything was complicated because he wanted some pivoting, but was doing it by joining to the tables multiple times instead of using case statement mini-machines to calculate the correct sums as the stream of rows passed by. (Here I go with a spatial representation again. I tell you, it really helps me.)

The reason he couldn't figure it out was that he was thinking of the query in terms of its structure after grouping instead of before. When there was more than one row to be grouped by for one table, and then he joined to another table (whether or not it had multiple rows per group), everything was getting cross-joined among the groups. He needed to step back and run his query without the group by clause, and see if the rows made sense before being collapsed into their defined buckets (there's more spatial framing again). If he'd done that, he'd have seen the cross-joining behavior and realized what the problem was. All the experts, myself included, forgot about that at first and started fiddling with the summing/aggregating stuff!

That was too high of a level. We hadn't even checked to see if the data we were trying to manipulate was in the correct form. And it wasn't.

So. Deliberately ordering one's own thought processes. The next time I am about to think about a problem, I'm going to stop thinking. Then I'm going to think about what I'm thinking about for a minute. Is this the right thing to think about? Am I making assumptions? Can I test those assumptions?

I'm going to back off from solving the problem and think about whether it's even the right problem to start with.

Tuesday, August 21, 2007

The 3 Millisecond Datetime Trick Is Dead

At times I've used my knowledge that SQL Server 2000 stores datetimes to a precision of 1/300th of a millisecond (I won't go into full details here). Here are some examples:

If @StartDate and @EndDate have already had their time portions removed (or guaranteed to have none) I might do this:

WHERE DateCol BETWEEN @StartDate AND @EndDate - '0:00:00.003'

or perhaps

WHERE DateCol BETWEEN @TheDate AND @TheDate + '23:59:59.997'

But I realized today that these tricks, though cool and interesting, and while perhaps reducing the amount of code or increasing readability, assume that the data type of DateCol will never change. That doesn't sound so bad since the only alternative is smalldatetime, right? Not for long.

In SQL Server 2008, the TIME, datetimeoffset, and datetime2 data types will have 100 nanosecond precision. It's not a bad bet that somewhere, some time, someone's going to be converting one of the datetime columns I was querying against to one of those. And then all my code written in this sort of style is going to break big-time.

And this is the kind of nasty break that doesn't bring all systems to a screeching halt (which has the benefit of notifying you that something is wrong) but lets you keep running your processes for days or weeks or months, until eventually something slips through that 3 millisecond hole or accumulates enough to be noticeable. Something critical, something important, something I should have thought about before using this kind of technique.

From now on I will be a good camper and deal with the pain of

WHERE DateCol >= @StartDate AND DateCol < @EndDate
--AND
WHERE DateCol >= @TheDate AND DateCol < (@TheDate + 1)

since they work no matter what the data type of DateCol is.

Naming Objects In Your Database

This is something I posted first in a SQL Server forum. Here it is again, touched up a bit.

So you want to name the objects in your database: your tables, your views, your functions, and anything else. And there are a lot of people out there who name them with what the object type is. Maybe even with more information. For example:

and even worse:

Then there are column names:

So does this stuff help at all? Is it useful to put the type or relationship in the name of objects? Can you leave the name off the ID column in your tables? Should you put more information in there? What does it matter, anyway, as these are mere naming conventions, right? Just pick something, and stick with it!

I agree that consistency is important. But consistently helpful is a far improvement over consistently something-less-than-that. Consistently unnecessary is extra work and frustration. Consistently interfering impairs your ability and speed. Consistently mediocre is not the greatest.

Overall, I think it is best to take a minimalistic approach. I hope my reasons make sense to you, but whatever you do pick, you should have clear and compelling reasons for what you do. Notice the compelling part. Your default should be to avoid clutter and only add things if there is a materially demonstrable reason to violate that rule.

First of all, much of my advice is meaningless if you're using a GUI where one performs the less taxing task of recognition instead of the recall required to write queries in text. If you are writing queries in text, more power to you, and if you keep doing it I predict you'll eventually see the sense of my recommendations.

But if you're using a GUI to write queries you'll only reach a certain level of ability, never stepping into the next level of professional skill and seasoned experience in the SQL realm. I haven't used a GUI to write a serious query in a very long time (I might have started a couple in an Access database but even then switched to SQL view). I recently helped someone who posted his query that was obviously built in a GUI and it had about 10 left joins. I seriously doubt they all needed to be left joins, which means he was killing performance for no reason.

So if you are writing queries in a text editor, here are some thoughts for you:

• Lose the underscores. They clutter up the screen. They take the shift key to type and also your pinky finger which has to move two rows up from the home row. And using CamelCase is just as clear and saves space and that pinky traversing. And if you're really feeling lazy you don't even have to hit the shift key.

order_id
customer_name
device_code_reason

OrderID
CustomerName
DeviceCodeReason

Look at it this way: if you can train your eyes to scan either kind of naming convention just as efficiently, why not use the one that involves less typing and takes less space on the screen?

• The prefix/suffix naming convention is not always understood or followed. In one shop I worked in they named tables with prefixes: tbl for base tables, tlk for lookup tables, and tin for "intersect" tables. But no one really knew what they were for sure. "tin" tables were supposed to be transactional tables that had constantly changing data such as orders and transactions. "tbl" tables were supposed to be things like customers and products. But the problem was that many tables didn't fit a clear definition, and people were careless, or came along years after the creator of the database (who yes used a GUI database creation program), and they got named wrong. So now after all that silliness and extra typing, there was a "tin" table that was really a "tbl" table and a "tlk" table that was really a "tin" table after some changes, and ... so on and so forth. Just leave off that junk.

• You'll eventually become completely sick of typing the extra characters over and over again for NO GOOD REASON. That is, if you care about speed. If you don't care about speed, type your prefixes merrily. If you do get tired of them, though, you'll have these prefixes so wrapped up in your code and database that ripping them out becomes impossible. All those extra letters will come to be so much useless froofroo to you. At the same shop as tbl/tin/tlk they really named columns like keyOrderID and frnkeyOrderID. They added nine characters of typing to EVERY SINGLE JOIN I ever had to write. After a year of working there, the pain was no less. It took me about 2 weeks to learn the primary and foreign keys of that database. The other fifty weeks I wished the database designer had left well enough alone.

• Naming is something that is so much more important than most people seem to think. What does it matter what we call things, as long as we're "consistent," right? I disagree. My experience is that the naming in a database the single most important thing that structures how programmers relate to the data and that determines their final speed and facility working with it in the long run.

Mislabeling things is like putting up signs at the cliff edge that say "keep going" and "this is the right way" when it's not. Label things appropriately.

• Scanning and comprehension speed suffer when extra things are tacked on. When you look at a stored procedure or query, if you yourself weren't just working on it yesterday, then you have to digest it to understand it. The more prefix_blob_description_gorp you have in there, the harder it is to see what's going on. The assumption here is, you know your database like the back of your hand. Will that extra stuff speed you up now or slow you down? Who should you code for, the least common denominator or a reasonable level of facility? In fact, this brings up another important point:

• You do not want to give that new developer, who is unfamiliar with your database, a prematurely quick sense of competence and ability in your database. By doing this, you increase the chances that he or she will make some serious mistake. This is the opposite mistake of mislabeling. Now the path over the edge of the cliff is littered with so many friendly and distracting signals that the poor new person gets the same wrong signal to keep going when the correct response ought to be, "wait, I'm not sure I am doing the right thing." I'm all for helping new developers get going as fast as possible. But they ought to do this by studying the schema and practicing with it. Don't make it so apparently easy (but not necessarily materially easy) that even an ignorant person will believe he knows all about your database.

• The information is unnecessary and ultimately makes developers remember more stuff, not less. It sounds at first like a great way to expose information about the tables and objects. But the fact is that the period of time where a new developer isn't familiar with the database is actually very short. So for the benefit of two months of instant answers to the beginner, everyone else in the shop is suffering through unnecessary baggage.

• Don't label things with each developer's name. Why would you do that? Do all the developers write different versions of the same SPs? Are they not production SPs but simply a sort of learning environment where each can write whatever he wants so it's important to keep the names straight? In that case, why do the objects have the same owners? Just use a different owner name and don't put your name in the object. But wait, why are they even in the same database? Why does it possibly matter who wrote something in the function of the database as a whole? Things should be named according to what they are or do, not who worked on it first or last.

If I was offered a consulting job on a database that had every stored procedure and object using different names based on who had written them, I would refuse the work for fear of going absolutely nuts. If I was forced by my situation to take the work, I would every day curse the people who decided that putting their names on their stuff would somehow have any kind of long-term value and I would quit the moment I found something else. Long after Shannon and Victor and Felix are gone from the company, poor developers are toiling away trying to remember if that was a felix function or a victor view or perhaps it was a shannon function and a felix view! GAHHH!

• Spend careful and deliberate time naming things. I often use a thesaurus when designing a database. I have at times spent (cumulatively over the project) hours working out the correct naming of things. But when I was done, WOW how things flowed. I've worked with databases where the detail table wasn't, and the parent table wasn't, and the order table wasn't, and the Determination column wasn't (it should have been called DeterminationMethod, the actual determination was called something else entirely). Every single one of those is a roadblock to proper usage, even to the developer who's been working with that database for years. Make it intuitive instead of notated.

The way I do it is to never add anything to table names, and to use carefully chosen words for my functions and SPs that make it clear what they are. For example, all my SPs are either SubjectVerb or VerbSubject, consistently throughout the database (still experimenting with which I like better--I'm leaning toward SubjectVerb). Functions have words that make it clear what they do. NumberToHex is a function. OrderNumberToOrderID is a function. ProductDetailCount is a scalar function, ProductDetail is a table, and ProductCreate is a stored procedure. Develop your own pattern. I can go through my own code so fast when I've been careful like this. The meaning nearly leaps out at me.

• If you absolutely must use a prefix or a suffix (I lean toward suffixes myself because I like things to group by subject instead of function) why not use a single character instead of three? Tables can still do without any fluff at all. Then your other objects are still distinguishable:

OrderNumber_OrderIDF

(Before you criticize why one might have order numbers and order IDs, consider a system that has to handle orders from disparate sources which use overlapping order number schemes. So you either have to have a composite key--source,ordernumber--or use a global and internal order ID that can be unique per order.)

And think about that suffix part. In what possible situation would you see all your objects mixed together, and you need to ordered them by type but don't have a column that specifies type? This is the only way I can think of that a prefix with the type might be helpful. If it was actually needed at all because the objects aren't named intuitively.

• The object type can change. What if you decide to change the structure of one of your tables, and to support your older versions of software, you build the new table with a new name and convert your old table to a view? Now you have tbl_Whatever that's a view. Whoops. And what if you need to use that really important sp, and you name SPs with their creator (shudder), only you can't remember who wrote it? Now you have to look it up. What a time waster. What if you can't remember if it's an inline table function or a table-variable function? More wheels spinning. If these seem farfetched, they are not. I have worked with databases that had "tbl" objects that were views and believe me it was a nightmare remembering which was which. Who cares what it is! Views and tables are used in exactly the same way so why call them something different? If you're worried about the execution plan and query performance (and you ought to be) then you're already at a level that demands familiarity with the type of objects you're working with. If you absolutely need to notate that something is a view, put it in comments.

• Think of it this way: you're trying to build a race car--a sleek, elegant, efficient thing that can go as fast as possible and be interfaced with well. What good is a racecar that can theoretically go 200mph if the steering wheel has sharp spikes all over it and the driver has to spend extra time avoiding them? He won't be spending as much time at top speed as he ought to be.

• Name a column the same thing everywhere. Don't call it different things in different tables. Especially don't just call it "ID." Call it BlahID everywhere, including the Blah table that it originates from.

Calling all identity columns ID is certainly a kind of standardization. But it's a standardization of seriously doubtful value that makes different columns in different tables have the same name! This is now the opposite error: conflating things that one ought to be able to tell apart. If you've been trained to differentiate each thing by putting the type of it in the name, why now depart radically from that philosophy by calling all the IDs the same thing? From the beginner/new developer perspective, it seems to make things easier in some ways--no looking up column names when selecting, right? But it's not easier, it's harder. If you ever get the experience of doing it both ways in a large and busy production database (that is, ID vs. TableNameID), I suspect you'll eventually agree (unless you've become dogmatic about it by then and don't really care about the actual function or efficency any more).

I prefer to standardize things in this way: If there is an ID column, it shares the same name as the table, + ID. Now there's no confusion about what the name is. Now if you have a query with many joins and you need to use ID you're not getting confused.

• Another reason to avoid columns named "ID" is that you will eventually put the wrong alias on some join. You may not even notice because the query parses fine and you save your SP and off you go, but whoops, you got an effective cartesian product because your join condition referenced the wrong ID. Or just the wrong resultset. What a confusing mess. Much better, when you accidentally put D.ProductID instead of P.ProductID, for the compiler to complain that the table aliased by D has no such column. The alternate "D.ID" though would have compiled just fine and given the wrong results farther down the path, maybe even after release. It's much better to catch errors early.

• Selecting is not where the problem is. Yes, it's shorter to say SELECT ID FROM Table. Fewer characters. But the real problem is in JOINs. The problem is in old resultsets you saved in Excel that you don't know what table they came from because they so unhelpfully say ID. You don't know what some error message means that a user reported to you because all he remembered was the main column that the constraint violated and he says "there was a foreign key constraint in column ID!" Great. That's every table in your database.

Squared Thoughts