Wednesday, August 29, 2007

SQL Server 2005

I've only recently really started to get into some of the new T-SQL stuff in SQL Server 2005. I am really digging them. And of course, at the same time, I am still wanting yet even more!

Today I used ROW_NUMBER() for the first time, and it was easy. It will take a few tries to remember the syntax exactly, but it makes perfect sense. You have to be able to say when to begin counting at 1 again (PARTITION OVER) and what order to count in (ORDER BY). RANK, DENSE_RANK, and TILE_N (if I'm remembering that right) all make sense, too.

I'm looking forward to getting into the OUTPUT clause. Before I had even learned about it in SQL 2005 I wanted this functionality for INSERT, UPDATE, and DELETE. I imagined its syntax slightly differently. First, any of these statement could be made to have a rowset. I actually overlooked this part when I was originally thinking about it, but it's fairly obvious once you realize its necessity. Here's one possible way:

RETURN_ROWSET INSERT Stuff SELECT Blah FROM Gorp

Then this query could be used any place a SELECT could be used:

UPDATE H
SET Column = NewValue
FROM
   StuffHistory H
   INNER JOIN (
      RETURN_ROWSET INSERT Stuff SELECT Blah FROM Gorp
   ) S ON H.StuffID = S.StuffID

This would allow you to chain together any number of statements, causing each INSERT, UPDATE, or DELETE to return a rowset, which would either go to a client or could be used again in the next statement. I realize now that there are some issues with the syntax I envisioned, but I think it's interesting that I "invented" this on my own before learning about it from SQL 2005.

(The issues are that for an update there are two potential rowsets: the Inserted and Deleted meta-tables, so the way Microsoft implemented it makes a certain sort of sense, though I find the syntax a little weird, and I don't think you can nest these over and over again.)

Perhaps instead of nesting things this way it could be more like a common table expression, with many queries using the thing just like a table:

;WITH_ROWSET S AS (INSERT Stuff SELECT Blah FROM Gorp)
BEGIN
   UPDATE H
   SET Column = NewValue
   FROM
      StuffHistory H
      INNER JOIN S ON H.StuffID = S.StuffID

   SELECT NewValue FROM H WHERE OtherColumn = 1
END

There are also a bunch of gotchas surrounding views, triggers, remote tables, and limitations on the target of an OUTPUT INTO clause (target table can't have triggers, foreign keys, CHECK constraints, or enabled rules).

Here's one example from BOL 2005. Do you understand it on the first try? I didn't.
    If the OUTPUT clause is specified without also specifying the INTO keyword, the target of the DML operation cannot have any enabled trigger defined on it for the given DML action. For example, if the OUTPUT clause is defined in an UPDATE statement, the target table cannot have any enabled UPDATE triggers.
Anyway, that's enough for now. I'll talk another time about alternate syntax I wish one could use for UPDATEs and INSERTs.

P.S. If you are trying to insert HTML in your own blog on blogspot.com (blogger.com) stay away from the "Compose" mode and stay in "Edit Html" mode. They try to parse the HTML and fail and end up eating your stuff.

Thursday, August 23, 2007

Am I Thinking About The Right Thing?

I have thought for a long time about deliberately ordering my thought processes. It's something that I do quite frequently. I'd like to share that with people and maybe spur others to do it more, but ironically I'm finding it difficult to express the concept clearly.

I think I'll start for now with an example. If I come back to this topic later, I can build on that example and maybe come up with something that helps me communicate what I'm talking about more effectively.

Sometimes I wonder if the way I think is substantially different from the way many people think. I mean, the areas where my brain works well (or doesn't work well) may not coincide with many people's. So I'm not really sure about the validity of some of my thoughts on how one might order one's own thought processes to yield better results. For example, I like to use spatial representations to figure out problems. There are many times when working on a query that I imagine little squares (representing tables or groups or some kind of data) touching each other, splitting, moving, folding, and so on. But for someone who doesn't think well spatially, that's probably not going to work. There are other modes and ways, though. So even if you can't relate directly to what I'm talking about, maybe you can find some comparable process that does work for you, that helps you be more deliberate in your thinking process.

On the other hand, maybe I'm way out in left field. Maybe my brain is stuffed full of cheese-filled rotten rutabagas! So take what I say here with a grain of salt.

When one is doing analysis of a query, one should check each thing along the way to see if it's giving the right value. It's kind of like when you have a problem with a computer peripheral, say, a printer. There is sort of a list, roughly from lowest to highest: Is it turned on? Is the power cable plugged in? Is the network/parallel cable connected to the right location? Does it have paper? Are error codes showing, is there a paper jam, are there any obvious latches or doors that need to be closed, is there a ready mode that it needs to enter? Can you generate a test page? If it's a network printer, can you ping it? Can anyone else print to it? Can you print to it with a different OS? Can you print to it with a different program? Can you print a smaller document? Can you print a document without images in it? Can you print a document with a different font?

If you find yourself troubleshooting at a certain level and it's not working, then back up and go to the more basic questions. You might even start at the beginning. Doing this helps you check your assumptions and guides you into the next correct troubleshooting step. If the extent of the description of the problem is "I can't print," you don't know much at all. But if the problem is, "I can print from Joe's computer but not from mine," then you can be directed to the level of problem-solving you ought to be at. Since now you know that it's plugged in, turned on, ready, has paper, is connectable for Joe and is responding to print requests, what level are you at? You're ready to verify your own connectivity. Only then deal with access rights. Then drivers. Then different programs. Then things affecting printer memory.

Don't waste your time with printing from different programs until you're sure you can even connect to that thing. Maybe the network cable on your computer came unplugged—you'd just be wasting your time messing around with printing from different programs. Maybe it's a private printer connected directly to his computer with a USB or parallel cable and it's not shared or you don't have access to it—you'd be wasting your time messing around with trying to ping it or using different drivers.

Going back to databases, here's a real-world example of getting stuck too high up in the chain. Several experienced SQL people were helping a poster with a query where he was trying to show summary data from several tables at once, but he was getting sums that were multiples of the correct sums. Everything was complicated because he wanted some pivoting, but was doing it by joining to the tables multiple times instead of using case statement mini-machines to calculate the correct sums as the stream of rows passed by. (Here I go with a spatial representation again. I tell you, it really helps me.)

The reason he couldn't figure it out was that he was thinking of the query in terms of its structure after grouping instead of before. When there was more than one row to be grouped by for one table, and then he joined to another table (whether or not it had multiple rows per group), everything was getting cross-joined among the groups. He needed to step back and run his query without the group by clause, and see if the rows made sense before being collapsed into their defined buckets (there's more spatial framing again). If he'd done that, he'd have seen the cross-joining behavior and realized what the problem was. All the experts, myself included, forgot about that at first and started fiddling with the summing/aggregating stuff!

That was too high of a level. We hadn't even checked to see if the data we were trying to manipulate was in the correct form. And it wasn't.

So. Deliberately ordering one's own thought processes. The next time I am about to think about a problem, I'm going to stop thinking. Then I'm going to think about what I'm thinking about for a minute. Is this the right thing to think about? Am I making assumptions? Can I test those assumptions?

I'm going to back off from solving the problem and think about whether it's even the right problem to start with.

Tuesday, August 21, 2007

The 3 Millisecond Datetime Trick Is Dead

At times I've used my knowledge that SQL Server 2000 stores datetimes to a precision of 1/300th of a millisecond (I won't go into full details here). Here are some examples:

If @StartDate and @EndDate have already had their time portions removed (or guaranteed to have none) I might do this:

WHERE DateCol BETWEEN @StartDate AND @EndDate - '0:00:00.003'

or perhaps

WHERE DateCol BETWEEN @TheDate AND @TheDate + '23:59:59.997'

But I realized today that these tricks, though cool and interesting, and while perhaps reducing the amount of code or increasing readability, assume that the data type of DateCol will never change. That doesn't sound so bad since the only alternative is smalldatetime, right? Not for long.

In SQL Server 2008, the TIME, datetimeoffset, and datetime2 data types will have 100 nanosecond precision. It's not a bad bet that somewhere, some time, someone's going to be converting one of the datetime columns I was querying against to one of those. And then all my code written in this sort of style is going to break big-time.

And this is the kind of nasty break that doesn't bring all systems to a screeching halt (which has the benefit of notifying you that something is wrong) but lets you keep running your processes for days or weeks or months, until eventually something slips through that 3 millisecond hole or accumulates enough to be noticeable. Something critical, something important, something I should have thought about before using this kind of technique.

From now on I will be a good camper and deal with the pain of

WHERE DateCol >= @StartDate AND DateCol < @EndDate
--AND
WHERE DateCol >= @TheDate AND DateCol < (@TheDate + 1)

since they work no matter what the data type of DateCol is.

Naming Objects In Your Database

This is something I posted first in a SQL Server forum. Here it is again, touched up a bit.



So you want to name the objects in your database: your tables, your views, your functions, and anything else. And there are a lot of people out there who name them with what the object type is. Maybe even with more information. For example:

    tblOrders
    Table_Orders
    sp_OrderSend (NEVER prefix SPs with "sp_" because the server will look for a system SP first!)
    Proc_OrderSend
    OrderStatusFunc
    fn_OrderStatus
    fn_t_OrderStatus
and even worse:

    joe_sp_OrderSend
    mike_sp_OrderCancel
    david_fn_OrderStatus
Then there are column names:
    ID
    ProductID
    keyProductID
    pkeyProductID
    pkProductID
    pkProductIDCode
    frnkeyProductID
    fkProductID
So does this stuff help at all? Is it useful to put the type or relationship in the name of objects? Can you leave the name off the ID column in your tables? Should you put more information in there? What does it matter, anyway, as these are mere naming conventions, right? Just pick something, and stick with it!

I agree that consistency is important. But consistently helpful is a far improvement over consistently something-less-than-that. Consistently unnecessary is extra work and frustration. Consistently interfering impairs your ability and speed. Consistently mediocre is not the greatest.

Overall, I think it is best to take a minimalistic approach. I hope my reasons make sense to you, but whatever you do pick, you should have clear and compelling reasons for what you do. Notice the compelling part. Your default should be to avoid clutter and only add things if there is a materially demonstrable reason to violate that rule.

First of all, much of my advice is meaningless if you're using a GUI where one performs the less taxing task of recognition instead of the recall required to write queries in text. If you are writing queries in text, more power to you, and if you keep doing it I predict you'll eventually see the sense of my recommendations.

But if you're using a GUI to write queries you'll only reach a certain level of ability, never stepping into the next level of professional skill and seasoned experience in the SQL realm. I haven't used a GUI to write a serious query in a very long time (I might have started a couple in an Access database but even then switched to SQL view). I recently helped someone who posted his query that was obviously built in a GUI and it had about 10 left joins. I seriously doubt they all needed to be left joins, which means he was killing performance for no reason.

So if you are writing queries in a text editor, here are some thoughts for you:

• Lose the underscores. They clutter up the screen. They take the shift key to type and also your pinky finger which has to move two rows up from the home row. And using CamelCase is just as clear and saves space and that pinky traversing. And if you're really feeling lazy you don't even have to hit the shift key.

order_id
customer_name
device_code_reason

OrderID
CustomerName
DeviceCodeReason

Look at it this way: if you can train your eyes to scan either kind of naming convention just as efficiently, why not use the one that involves less typing and takes less space on the screen?

• The prefix/suffix naming convention is not always understood or followed. In one shop I worked in they named tables with prefixes: tbl for base tables, tlk for lookup tables, and tin for "intersect" tables. But no one really knew what they were for sure. "tin" tables were supposed to be transactional tables that had constantly changing data such as orders and transactions. "tbl" tables were supposed to be things like customers and products. But the problem was that many tables didn't fit a clear definition, and people were careless, or came along years after the creator of the database (who yes used a GUI database creation program), and they got named wrong. So now after all that silliness and extra typing, there was a "tin" table that was really a "tbl" table and a "tlk" table that was really a "tin" table after some changes, and ... so on and so forth. Just leave off that junk.

• You'll eventually become completely sick of typing the extra characters over and over again for NO GOOD REASON. That is, if you care about speed. If you don't care about speed, type your prefixes merrily. If you do get tired of them, though, you'll have these prefixes so wrapped up in your code and database that ripping them out becomes impossible. All those extra letters will come to be so much useless froofroo to you. At the same shop as tbl/tin/tlk they really named columns like keyOrderID and frnkeyOrderID. They added nine characters of typing to EVERY SINGLE JOIN I ever had to write. After a year of working there, the pain was no less. It took me about 2 weeks to learn the primary and foreign keys of that database. The other fifty weeks I wished the database designer had left well enough alone.

• Naming is something that is so much more important than most people seem to think. What does it matter what we call things, as long as we're "consistent," right? I disagree. My experience is that the naming in a database the single most important thing that structures how programmers relate to the data and that determines their final speed and facility working with it in the long run.

Mislabeling things is like putting up signs at the cliff edge that say "keep going" and "this is the right way" when it's not. Label things appropriately.

• Scanning and comprehension speed suffer when extra things are tacked on. When you look at a stored procedure or query, if you yourself weren't just working on it yesterday, then you have to digest it to understand it. The more prefix_blob_description_gorp you have in there, the harder it is to see what's going on. The assumption here is, you know your database like the back of your hand. Will that extra stuff speed you up now or slow you down? Who should you code for, the least common denominator or a reasonable level of facility? In fact, this brings up another important point:

• You do not want to give that new developer, who is unfamiliar with your database, a prematurely quick sense of competence and ability in your database. By doing this, you increase the chances that he or she will make some serious mistake. This is the opposite mistake of mislabeling. Now the path over the edge of the cliff is littered with so many friendly and distracting signals that the poor new person gets the same wrong signal to keep going when the correct response ought to be, "wait, I'm not sure I am doing the right thing." I'm all for helping new developers get going as fast as possible. But they ought to do this by studying the schema and practicing with it. Don't make it so apparently easy (but not necessarily materially easy) that even an ignorant person will believe he knows all about your database.

• The information is unnecessary and ultimately makes developers remember more stuff, not less. It sounds at first like a great way to expose information about the tables and objects. But the fact is that the period of time where a new developer isn't familiar with the database is actually very short. So for the benefit of two months of instant answers to the beginner, everyone else in the shop is suffering through unnecessary baggage.

• Don't label things with each developer's name. Why would you do that? Do all the developers write different versions of the same SPs? Are they not production SPs but simply a sort of learning environment where each can write whatever he wants so it's important to keep the names straight? In that case, why do the objects have the same owners? Just use a different owner name and don't put your name in the object. But wait, why are they even in the same database? Why does it possibly matter who wrote something in the function of the database as a whole? Things should be named according to what they are or do, not who worked on it first or last.

If I was offered a consulting job on a database that had every stored procedure and object using different names based on who had written them, I would refuse the work for fear of going absolutely nuts. If I was forced by my situation to take the work, I would every day curse the people who decided that putting their names on their stuff would somehow have any kind of long-term value and I would quit the moment I found something else. Long after Shannon and Victor and Felix are gone from the company, poor developers are toiling away trying to remember if that was a felix function or a victor view or perhaps it was a shannon function and a felix view! GAHHH!

• Spend careful and deliberate time naming things. I often use a thesaurus when designing a database. I have at times spent (cumulatively over the project) hours working out the correct naming of things. But when I was done, WOW how things flowed. I've worked with databases where the detail table wasn't, and the parent table wasn't, and the order table wasn't, and the Determination column wasn't (it should have been called DeterminationMethod, the actual determination was called something else entirely). Every single one of those is a roadblock to proper usage, even to the developer who's been working with that database for years. Make it intuitive instead of notated.

The way I do it is to never add anything to table names, and to use carefully chosen words for my functions and SPs that make it clear what they are. For example, all my SPs are either SubjectVerb or VerbSubject, consistently throughout the database (still experimenting with which I like better--I'm leaning toward SubjectVerb). Functions have words that make it clear what they do. NumberToHex is a function. OrderNumberToOrderID is a function. ProductDetailCount is a scalar function, ProductDetail is a table, and ProductCreate is a stored procedure. Develop your own pattern. I can go through my own code so fast when I've been careful like this. The meaning nearly leaps out at me.

• If you absolutely must use a prefix or a suffix (I lean toward suffixes myself because I like things to group by subject instead of function) why not use a single character instead of three? Tables can still do without any fluff at all. Then your other objects are still distinguishable:

OrderNumber_OrderIDF

(Before you criticize why one might have order numbers and order IDs, consider a system that has to handle orders from disparate sources which use overlapping order number schemes. So you either have to have a composite key--source,ordernumber--or use a global and internal order ID that can be unique per order.)

And think about that suffix part. In what possible situation would you see all your objects mixed together, and you need to ordered them by type but don't have a column that specifies type? This is the only way I can think of that a prefix with the type might be helpful. If it was actually needed at all because the objects aren't named intuitively.

• The object type can change. What if you decide to change the structure of one of your tables, and to support your older versions of software, you build the new table with a new name and convert your old table to a view? Now you have tbl_Whatever that's a view. Whoops. And what if you need to use that really important sp, and you name SPs with their creator (shudder), only you can't remember who wrote it? Now you have to look it up. What a time waster. What if you can't remember if it's an inline table function or a table-variable function? More wheels spinning. If these seem farfetched, they are not. I have worked with databases that had "tbl" objects that were views and believe me it was a nightmare remembering which was which. Who cares what it is! Views and tables are used in exactly the same way so why call them something different? If you're worried about the execution plan and query performance (and you ought to be) then you're already at a level that demands familiarity with the type of objects you're working with. If you absolutely need to notate that something is a view, put it in comments.

• Think of it this way: you're trying to build a race car--a sleek, elegant, efficient thing that can go as fast as possible and be interfaced with well. What good is a racecar that can theoretically go 200mph if the steering wheel has sharp spikes all over it and the driver has to spend extra time avoiding them? He won't be spending as much time at top speed as he ought to be.

• Name a column the same thing everywhere. Don't call it different things in different tables. Especially don't just call it "ID." Call it BlahID everywhere, including the Blah table that it originates from.

Calling all identity columns ID is certainly a kind of standardization. But it's a standardization of seriously doubtful value that makes different columns in different tables have the same name! This is now the opposite error: conflating things that one ought to be able to tell apart. If you've been trained to differentiate each thing by putting the type of it in the name, why now depart radically from that philosophy by calling all the IDs the same thing? From the beginner/new developer perspective, it seems to make things easier in some ways--no looking up column names when selecting, right? But it's not easier, it's harder. If you ever get the experience of doing it both ways in a large and busy production database (that is, ID vs. TableNameID), I suspect you'll eventually agree (unless you've become dogmatic about it by then and don't really care about the actual function or efficency any more).

I prefer to standardize things in this way: If there is an ID column, it shares the same name as the table, + ID. Now there's no confusion about what the name is. Now if you have a query with many joins and you need to use ID you're not getting confused.

• Another reason to avoid columns named "ID" is that you will eventually put the wrong alias on some join. You may not even notice because the query parses fine and you save your SP and off you go, but whoops, you got an effective cartesian product because your join condition referenced the wrong ID. Or just the wrong resultset. What a confusing mess. Much better, when you accidentally put D.ProductID instead of P.ProductID, for the compiler to complain that the table aliased by D has no such column. The alternate "D.ID" though would have compiled just fine and given the wrong results farther down the path, maybe even after release. It's much better to catch errors early.

• Selecting is not where the problem is. Yes, it's shorter to say SELECT ID FROM Table. Fewer characters. But the real problem is in JOINs. The problem is in old resultsets you saved in Excel that you don't know what table they came from because they so unhelpfully say ID. You don't know what some error message means that a user reported to you because all he remembered was the main column that the constraint violated and he says "there was a foreign key constraint in column ID!" Great. That's every table in your database.