Saturday, December 27, 2008

SQL joins

A relational database is so named because the data it stores relates to each other. In SQL, the relationships between data is expressed in a query by what is known as a join.

There are a number of different types of joins. Most joins can be either defined as an implicit join or an explicit join. An implicit join is one where there is no explicit JOIN statement yet there is a join implicit in the WHERE clause of the query. An explicit join is one where there is actually a JOIN statement in the FROM clause of the query.

An explicit join is also known as a qualified join, and uses the syntax

SELECT <columns>
FROM <table>
<join type> JOIN <table>
ON <join condition>


Cross joins

Given two tables, each row in the inner table joins each row of the outer table. I can think of almost no examples where you'd want to use this, but the implicit join syntax is:

SELECT * FROM
OrgUnit, Person

The explicit join syntax is:

SELECT * FROM OrgUnit
CROSS JOIN Person

A cross join is also known as a Cartesian product or a Cartesian join. I have never needed to use it! Note that this is the only join that does not use the ON syntax, as it's not needed and if you needed to restrict the data set then you'd need to filter in the WHERE clause of the query. Of course, if your filter does a comparison against two related columns of the datasets, then you have an INNER join, described below.

Inner joins

An "inner" join is named because given two tables, the query will return each row in the inner table that has a match in the outer table.

To show an example of this, I'll have another look at my two tables I introduced in the cross join section. They are OrgUnit and Person.




























OrgUnit
OrgUnitID OrgUnitName
1Manufacturing
2Accounting
3Human Resources
4Tiger team
5IT
6Consultancy
7Special projects






































Person
PersonID FirstName LastName PhoneNumber Address OrgUnitID
1 Chris Sasquatch 03 9998 7656 5 Curumbah St, Gladstone, VIC, 3145 5
2 Jill Anderson 03 9765 3975 23 Sesqatonia Rd, Smallville, VIC, 3087 6
3 Robert Manaheim 03 7659 8765 18a Xyz St, Whoknowswhere, VIC, 3009 2
4 Mysterious Guy 02 8765 4321 78 Unknown St, NSW, 2000 1
5 Captain Obvious 02 1111 1111 54 Duh Lane, NSW, 2119 4
6 Mickey Mouse 07 7654 6665 1 Disney Rd, QLD, 7654 7
7 Donald Duck 07 7654 6665 5 Drewery Lane, WA, 8097 NULL
8 Daffy Duck 08 7654 3214 8 St Kilda Street, WA, 8008 7
9 Bugs Bunny 08 7654 3214 1004 Killawee Rd, NT, 6543 7



If I wanted to find out all the Organization Units that each invididual resides in, I would run the following explicit query:

SELECT
p.FirstName,
p.LastName,
p.Address,
p.PhoneNumber,
o.OrgUnitName

FROM Person p
INNER JOIN OrgUnit
ON p.OrgUnitID=o.OrgUnitID

The implicit query is:

SELECT
p.FirstName,
p.LastName,
p.Address,
p.PhoneNumber,
o.OrgUnitName

FROM Person p, OrgUnit o
WHERE p.OrgUnitID=o.OrgUnitID


This gives the following results:































FirstName LastName PhoneNumber Address OrgUnitName
Chris Sasquatch 03 9998 7656 5 Curumbah St, Gladstone, VIC, 3145 IT
Jill Anderson 03 9765 3975 23 Sesqatonia Rd, Smallville, VIC, 3087 Consultancy
Robert Manaheim 03 7659 8765 18a Xyz St, Whoknowswhere, VIC, 3009 Accounting
Mysterious Guy 02 8765 4321 78 Unknown St, NSW, 2000 Manufacturing
Captain Obvious 02 1111 1111 54 Duh Lane, NSW, 2119 Tiger team
Mickey Mouse 07 7654 6665 1 Disney Rd, QLD, 7654 Special projects
Daffy Duck 08 7654 3214 8 St Kilda Street, WA, 8008 Special projects
Bugs Bunny 08 7654 3214 1004 Killawee Rd, NT, 6543 Special projects



Outer joins

An "outer" join is so named because given two tables, each row in the outer table is returned even if it does not meet the ON condition. There is no implicit join syntax for this sort of join - you must use the ANSI syntax.

There are a number of forms of outer joins: left outer join, right outer join and full outer join. Let's explore each of these.

Left outer joins

A left outer join returns the "left" rows of an outer join regardless of whether they met the ON condition or not. If the right hand table does not meet the join condition, then a NULL is displayed.

Let's say that I wanted to find all the people in my database and if they are part of an organization unit. I would do this with the following query:

SELECT
p.FirstName,
p.LastName,
p.Address,
p.PhoneNumber,
o.OrgUnitName

FROM Person p
LEFT OUTER JOIN OrgUnit
ON p.OrgUnitID=o.OrgUnitID

This would return the following results:

































FirstName LastName PhoneNumber Address OrgUnitName
Chris Sasquatch 03 9998 7656 5 Curumbah St, Gladstone, VIC, 3145 IT
Jill Anderson 03 9765 3975 23 Sesqatonia Rd, Smallville, VIC, 3087 Consultancy
Robert Manaheim 03 7659 8765 18a Xyz St, Whoknowswhere, VIC, 3009 Accounting
Mysterious Guy 02 8765 4321 78 Unknown St, NSW, 2000 Manufacturing
Captain Obvious 02 1111 1111 54 Duh Lane, NSW, 2119 Tiger team
Mickey Mouse 07 7654 6665 1 Disney Rd, QLD, 7654 Special projects
Daffy Duck 08 7654 3214 8 St Kilda Street, WA, 8008 Special projects
Bugs Bunny 08 7654 3214 1004 Killawee Rd, NT, 6543 Special projects
Donald Duck 07 7654 6665 5 Drewery Lane, WA, 8097 NULL






I have highlighted in bold red the row that is returned in the left outer join that was not returned in the inner join. As you can see, Donald Duck exists in the left hand table, but is not part of a organization unit. With the left outer join he is still returned in the output, however as he is not part of any org unit it displays a NULL value.

Right outer joins

A right outer join is the counterpart of the left outer join. It returns the "right" rows of an outer join regardless of whether they met the ON condition or not. If the left hand table does not meet the join condition, then a NULL is displayed.

Let's say that I wanted to find all org units and if they have any members show who they are. I would run the following query:

SELECT
p.FirstName, p.LastName, o.OrgUnitName
FROM Person p
RIGHT OUTER JOIN OrgUnit
ON p.OrgUnitID=o.OrgUnitID

This returns the following results:




































FirstName LastName OrgUnitName
Mysterious Guy Manufacturing
Robert Manaheim Accounting
NULL NULL Human Resources
Captain Obvious Tiger team
Chris Sasquatch IT
Jill Anderson Consultancy
Mickey Mouse Special projects
Daffy Duck Special projects
Bugs Bunny Special projects





Again I have highlighted in bold red the important row that is returned by the right outer join. The human resources org unit of this company has nobody in it (silly rabbits!) but we still want to know all the org units of the company. Thus as the table OrgUnits is on the right hand side of the join we return the organization but NULLs for the left hand side table rows.

Full outer joins

A full outer join allows us to display each row in both left and right tables. If there is a row in the right hand table that does not match the ON join condition, then it displays the right hand table row and NULL values for the left hand table. If there is a row in the left hand table that does not match the ON join condition, then it displays the left hand table row and NULL values for the right hand table.

This may seem a bit like a cross join, but it is not. A cross join matches every row for the left and right hand tables regardless of the relationship between the tables. The full outer join maintains the relationships between the tables, it's just if either side does not have a matching row then it displays the row on the right/left hand side regardless.

If I wanted to see all the OrgUnits and each member of the OrgUnit, but I also wanted to see all the people in the database regardless of whether they belong to an OrgUnit, I would run the following query:

SELECT
o.OrgUnitName,
p.FirstName,
p.LastName,
p.PhoneNumber,
p.Address,
o.OrgUnitName

FROM Person p
FULL OUTER JOIN OrgUnit
ON p.OrgUnitID=o.OrgUnitID




































FirstName LastName PhoneNumber Address OrgUnitName
Chris Sasquatch 03 9998 7656 5 Curumbah St, Gladstone, VIC, 3145 IT
Jill Anderson 03 9765 3975 23 Sesqatonia Rd, Smallville, VIC, 3087 Consultancy
NULL NULL NULL NULL Human Resources
Robert Manaheim 03 7659 8765 18a Xyz St, Whoknowswhere, VIC, 3009 Accounting
Mysterious Guy 02 8765 4321 78 Unknown St, NSW, 2000 Manufacturing
Captain Obvious 02 1111 1111 54 Duh Lane, NSW, 2119 Tiger team
Mickey Mouse 07 7654 6665 1 Disney Rd, QLD, 7654 Special projects
Daffy Duck 08 7654 3214 8 St Kilda Street, WA, 8008 Special projects
Bugs Bunny 08 7654 3214 1004 Killawee Rd, NT, 6543 Special projects
Donald Duck 07 7654 6665 5 Drewery Lane, WA, 8097 NULL




Semi and anti-semi joins

The last sort of joins I want to talk about are semi-joins and anti-semi joins.

A semi-join will return rows from a table when there is one or more corresponding rows in the joining table. Interestingly, there isn't actually any explicit syntax for semi-joins, instead you must use the EXISTS statement.

For instance, let's say that I want to find all OrgUnits that have at least one Person. We could do the following:

SELECT o.OrgUnitName
FROM OrgUnit o
INNER JOIN Person p

However, this will return more than one record for the OrgUnit "Special projects".

Instead, a semi-join will do this for you. The syntax is:

SELECT o.OrgUnitID
FROM OrgUnit AS o
WHERE EXISTS
(
SELECT p.OrgUnitID FROM Person AS p

WHERE p.OrgUnitID = o.OrgUnitID

)

You can also do this by using an IN statement:

SELECT o.OrgUnitName
FROM OrgUnit AS o
WHERE o.OrgUnitID IN
(
SELECT p.OrgUnitID FROM Person AS p
WHERE p.OrgUnitID = o.OrgUnitID
)

An anti-semi join however, does the exact opposite. It will find any rows that do not have at least one matching row in the second table. To do this, use the NOT EXISTS statement.

Going back to our previous example, if we wanted to find all the departments that do not have any members, we would do the following:

SELECT o.OrgUnitID
FROM OrgUnit AS o
WHERE NOT EXISTS
(
SELECT p.OrgUnitID FROM Person AS p

WHERE p.OrgUnitID = o.OrgUnitID

)

With the NOT IN statement it would be:

SELECT o.OrgUnitName
FROM OrgUnit AS o
WHERE o.OrgUnitID NOT IN
(
SELECT p.OrgUnitID FROM Person AS p
WHERE p.OrgUnitID = o.OrgUnitID
)

Note 1: I picked this IN trick up at the following blog.
Note 2: I asked over at SQL Server Central whether they thought it would be a good idea to add a TOP 1 clause to the query. i.e. the correlated subquery becomes:

SELECT TOP 1 p.OrgUnitID FROM Person AS p
WHERE p.OrgUnitID = o.OrgUnitID

However, it was politely suggested I test myself. Fair enough I suppose :-)

Another way of doing a FULL OUTER join

Interestingly, we can get a FULL OUTER join by doing a left outer join of Table1 to Table2, then do a UNION ALL of a left anti-semi join of Table2 to Table1. Or in other words:

Table1 LEFT OUTER JOIN Table2
UNION ALL
Table2 LEFT ANTI-SEMI JOIN Table1

I picked this up from the book Inside Microsoft SQL Server 2005: Query Tuning and Optimization. It's useful because a nested loop can't do a FULL OUTER join, except that you can transform the FULL OUTER join into joins that the nested loop can handle. Which I thought was quite neat, in a totally geeky way of course :-)

Friday, December 26, 2008

Thursday, December 25, 2008

C warning flags

A short blog post today. Just found an excellent post explaining the different gcc warning flags and what they do, by GNOME hacker Benjamin Otte.

Tuesday, December 23, 2008

Query tuning

I think that I've found the best blog post on SQL tuning I've ever read.

It's by Jeff Moden, and it's entitled More RBAR and "Tuning" UPDATEs.

I had a similar issue the other day, where I had a query that was taking almost 7 minutes to retrieve 6,000 records. For some reason it was doing a UNION of two SELECT statements, yet the second SELECT statement was the same as the first - so I thought that this was what was causing the issue.

And sure, when we took out the second SELECT statement the query reduced to about 3 and a half minutes.

Yet that was absolutely still too long. When I looked at the statement a bit further, I noticed that it was doing a LEFT OUTER join against a view that used a count function, a correlated subquery and a CASE statement, and not only that but when the actual SELECT statement found a NULL yet another CASE statement was used to convert the NULL into a value. The query plan was hideous - for 6,000 records in the final query output the LEFT OUTER join to the view was bringing back 1.6 million rows.

Hello RBAR!

So I did exactly what Jeff did - read the view and the query to work out what it was trying to do, then I rewrote the view to be more sensible. All of a sudden I went from about 3.5 minutes to run the query to 7 seconds.

So lesson learned - RBAR bad, set-based good.

Friday, December 12, 2008

Memory in Windows

During memory performance troubleshooting on Windows servers, I've often had to go to the task manager - and at times I've needed to use the Performance Monitor. The biggest problem I had when I first used these tools was understanding the different types of memory counters/columns in these tools. It has taken me a reasonable amount of research to understand the difference between such counters as virtual bytes, private bytes and working set.

In case anyone else has wondered what the different memory counters are in the Performance Monitor, or have been confused by the column labels in the task manager, the following is a brief summary.

Virtual memory


Before I explain, a quick explanation is probably due as to the meaning of the term "Virtual Memory". Probably the best explanation I have ever read is a 1992 Microsoft technical article entitled The Virtual-Memory Manager in Windows NT. However, in a nutshell, Windows uses a 32-bit contiguous virtual address space which can address up to 4 GB of memory. Don't get too excited about this, because by default 2GB of virtual address space is reserved for the operating system, though this can be changed to just 1GB with the /3GB bootup switch.

Each virtual address is actually just a placeholder to a physical location of memory in the machine, and when accessing memory Windows must jump through a few hoops to translate the virtual address to the physical address. In essence it must look up a 3 level btree to get to the address, first looking up a page directory, which refers to an entry in a page table, which in turn refers to a page frame. Note that for faster memory access Windows (and in fact all modern operating systems) relies on the processor's translatable lookaside buffer (TLB) to gain direct access to the memory.

The interesting thing about virtual memory is that each page table entry may either refer to an area of physical memory or to a pagefile on disk. This allows Windows to address more memory than is physically available via the system's RAM. When a program looks up an area of memory that is backed by a pagefile, the process is suspended by the operating system and the data is retrieved from the disk and placed into physical memory. This is known as a page fault. When the page has been retrieved from the disk and into memory, the operating system unsuspends and returns control back to the process.

This is often a point of confusion for people, the best example of which was recently on Slashdot under an "Ask Slashdot" question "Why Use Virtual Memory in Modern Systems?":
"I have a system with Windows Vista Ultimate (64-bit) installed on it, and it has 4GB of RAM. However when I've been watching system performance, my system seems to divide the work between the physical RAM and the virtual memory, so I have 2GB of data in the virtual memory and another 2GB in the physical memory. Is there a reason why my system should even be using the virtual memory anymore? I would think the computer would run better if it based everything off of RAM instead of virtual memory. Any thoughts on this matter or could you explain why the system is acting this way?"
The confusion here is that the poster is asking quite a different question, which is "why use a pagefile in modern systems?", which I think for workstations is actually a fairly reasonable question!

I also find it interesting that in Windows Vista Microsoft have tried to address this in their description of the virtual bytes counter of Performance Monitor, as they now state in the counter's description that "Use of virtual address space does not necessarily imply corresponding use of either disk or main memory pages." Obviously Microsoft tech support got heartily sick of having to explain this concept to the many people who misunderstand it and call them about it!

Performance monitor

The performance monitor (perfmon.exe, I'll call it perfmon from now on) is a tool in all versions of Windows since Windows 2000 that can access the various counters that are built into Windows, as well as other third party counters for products such as SQL Server.

If you had a process that you believed was leaking memory, then you can run perfmon to find out if this is the case. You would simply create a new log, then you'd add a number of process counters for the process you feel is causing the problem. Alternatively, if you didn't know what process was causing issues you could just add the process object to capture all counters for every process, of course you would then need to remove hundreds of counters when you loaded the file... but sometimes you've just got to do these sort of things when it's not clear what's going on. Painful, but true. :-)

The problem, then, becomes a matter of knowing what counters to actually look at. The following are the important ones:
  • Virtual bytes are the number of bytes of virtual memory that the process is using in total

  • Private bytes are the number of bytes of virtual memory that is being used by the process alone and that it can't share with other processes. We must distinguish between this counter and the virtual bytes counter because the process can actually load memory-mapped files, EXEs or DLLs into its virtual address space. Private bytes is a counter that does not include these things. As such, you will almost always see that virtual bytes is significantly larger than private bytes.

  • Working set bytes are the set of virtual addresses used by the application that are actually resident in memory (i.e. are NOT paged to the disk)

  • Page file bytes are, fairly obviously, the set of virtual addresses that reference a page in the pagefile. When memory is low, incidently, you can also add in the counter Page faults/sec to see how often the memory is being swapped from disk into memory. The more page faults per second, the slower the process will run because suspending a process while a page is retreived from a disk can be quite an expensive operation!

  • Paged pool bytes/Paged non-pool bytes - I had to look this one up! The best explanation is actually on Microsoft's Ask the Performance Team where they have a whole article on understanding pool resources. Rather than reinvent the wheel, I'll just quote them:
    When a machine boots up, the Memory Manager creates two dynamically sized memory pools that kernel-mode components use to allocate system memory. These two pools are known as the Paged Pool and NonPaged Pool. Each of these pools start at an initial size that is based upon the amount of physical memory present in the system. Pool memory is a subset of available memory and is not necessarily contiguous. If necessary, these pools can grow up to a maximum size that is determined by the system at boot time.
    For more info on this area of memory, the linked article explains it all very well so I won't explain it any further here.
I should note that after you capture these counters, when you view them again you will probably need to set the scale to something more sensible like 0.000001, or else you just won't see a line on the graph!

Task Manager

When something is going slowly, most people who know enough about Windows look at the processes tab of the Windows Task Manager.

I have to say before I talk about the task manager that though I've really only just started to use Windows Vista (well, I hardly use it at all, I prefer Ubuntu Linux for home use) despite its shortcomings Microsoft really put some care in the ancillary utilities like the Performance Monitor (now part of the Reliability and Performance Monitor) and the task manager. The task manager under Vista is really quite impressive now!

However, I'm afraid that I've only just started looking at the Vista task manager, so I won't talk about it much here. I might edit the article later and rewrite this section. :-)

The main areas to look at in the Task manager are the mem usage column and the VM size column. While it might seem very straight forward to understand what these columns do, given the discussion above, the Task manager programmers must have been on crack as they don't report what you might think.
  • Mem usage seems very vague for a description - after all, does this refer to the amount of virtual memory being used, the amount of actual unpaged memory, perhaps the total amount of memory exclusively used by the process? Well, it is actually the working set, which if you remember is all the virtual memory addresses that are resident in physical memory.

  • VM size is really a very odd name to give a column. After all, VM size is actually 4GB on 32-bit CPUs as virtual memory is all the memory that Windows can possibly address. Not surprisingly, confusion reigns supreme about this column and it is often confused as either the virtual memory that is paged to disk or the amount of virtual memory being used by a process. It is neither of these things, instead it is actually the amount of private memory being used by the process (remembering that private memory is memory that is being used exclusively by the process and doesn't take into account things like dlls, etc).
Anyway, I do hope that this is somewhat enlightening. Researching this took me some time and while searching google I found it frustrating seeing so much contradictory and confusing information about how to answer a question that at first glance seem to be straight forward but isn't: "how much memory is my program using?"

Wednesday, December 10, 2008

varchar(MAX) truncating

I was reading Karen Delaney's excellent Inside SQL Server 2005 book about the SQL Server 2005 storage engine, and she explained quite a neat feature: varchar(MAX).

varchar(MAX) is a funny sort of varchar, only you aren't restricted to only 8000 bytes (or characters - to varchar it's all the same) but in fact the maximum length of the column is the maximum length of any LOB type (2^31). Basically, if a column is set to varchar(MAX) and you have less than 8000 bytes then SQL Server internally stores the column as a varchar column, and if you exceed 8000 bytes then it stores it as LOB data.

The interesting thing about varchar(max) is that it appears that unlike TEXT columns you can actually apply normal string manipulating SQL on it (i.e. replicate, left, right, etc). Yes, that's right, no more mucking about with TEXTPTR manipulation!

The trap to watch out for is that if you decide to do something like join a varchar(MAX) value to a varchar value in order to create a new varchar(MAX) value, if the new value becomes greater than 8000 characters then the new value will be truncated to 8000 bytes.

e.g.

declare @theText varchar(MAX)
set @theText = replicate('x', 7095) + 'the end bit'
select len(@theText)

(shamelessly stolen from this blog)

While you might think that the length of the string will show as 8006, the length is actually 8000.

Why does this occur?

The answer is actually found in SQL Server books online, under the section Precision, Scale and Length. This says that:

When two char, varchar, binary, or varbinary expressions are concatenated, the length of the resulting expression is the sum of the lengths of the two source expressions or 8,000 characters, whichever is less.

When two nchar or nvarchar expressions are concatenated, the length of the resulting expression is the sum of the lengths of the two source expressions or 4,000 characters, whichever is less.
Therefore, be warned! unless you cast the string to a varchar(MAX), then you might not get the results you expect!

For the above example, incidently, you would do the following:

declare @theText varchar(MAX)
set @theText = replicate(cast 'x' as varchar(MAX)), 7095) + cast('the end bit' as varchar(MAX))
select len(@theText)