Saturday, December 27, 2008

SQL joins

A relational database is so named because the data it stores relates to each other. In SQL, the relationships between data is expressed in a query by what is known as a join.

There are a number of different types of joins. Most joins can be either defined as an implicit join or an explicit join. An implicit join is one where there is no explicit JOIN statement yet there is a join implicit in the WHERE clause of the query. An explicit join is one where there is actually a JOIN statement in the FROM clause of the query.

An explicit join is also known as a qualified join, and uses the syntax

SELECT <columns>
FROM <table>
<join type> JOIN <table>
ON <join condition>


Cross joins

Given two tables, each row in the inner table joins each row of the outer table. I can think of almost no examples where you'd want to use this, but the implicit join syntax is:

SELECT * FROM
OrgUnit, Person

The explicit join syntax is:

SELECT * FROM OrgUnit
CROSS JOIN Person

A cross join is also known as a Cartesian product or a Cartesian join. I have never needed to use it! Note that this is the only join that does not use the ON syntax, as it's not needed and if you needed to restrict the data set then you'd need to filter in the WHERE clause of the query. Of course, if your filter does a comparison against two related columns of the datasets, then you have an INNER join, described below.

Inner joins

An "inner" join is named because given two tables, the query will return each row in the inner table that has a match in the outer table.

To show an example of this, I'll have another look at my two tables I introduced in the cross join section. They are OrgUnit and Person.




























OrgUnit
OrgUnitID OrgUnitName
1Manufacturing
2Accounting
3Human Resources
4Tiger team
5IT
6Consultancy
7Special projects






































Person
PersonID FirstName LastName PhoneNumber Address OrgUnitID
1 Chris Sasquatch 03 9998 7656 5 Curumbah St, Gladstone, VIC, 3145 5
2 Jill Anderson 03 9765 3975 23 Sesqatonia Rd, Smallville, VIC, 3087 6
3 Robert Manaheim 03 7659 8765 18a Xyz St, Whoknowswhere, VIC, 3009 2
4 Mysterious Guy 02 8765 4321 78 Unknown St, NSW, 2000 1
5 Captain Obvious 02 1111 1111 54 Duh Lane, NSW, 2119 4
6 Mickey Mouse 07 7654 6665 1 Disney Rd, QLD, 7654 7
7 Donald Duck 07 7654 6665 5 Drewery Lane, WA, 8097 NULL
8 Daffy Duck 08 7654 3214 8 St Kilda Street, WA, 8008 7
9 Bugs Bunny 08 7654 3214 1004 Killawee Rd, NT, 6543 7



If I wanted to find out all the Organization Units that each invididual resides in, I would run the following explicit query:

SELECT
p.FirstName,
p.LastName,
p.Address,
p.PhoneNumber,
o.OrgUnitName

FROM Person p
INNER JOIN OrgUnit
ON p.OrgUnitID=o.OrgUnitID

The implicit query is:

SELECT
p.FirstName,
p.LastName,
p.Address,
p.PhoneNumber,
o.OrgUnitName

FROM Person p, OrgUnit o
WHERE p.OrgUnitID=o.OrgUnitID


This gives the following results:































FirstName LastName PhoneNumber Address OrgUnitName
Chris Sasquatch 03 9998 7656 5 Curumbah St, Gladstone, VIC, 3145 IT
Jill Anderson 03 9765 3975 23 Sesqatonia Rd, Smallville, VIC, 3087 Consultancy
Robert Manaheim 03 7659 8765 18a Xyz St, Whoknowswhere, VIC, 3009 Accounting
Mysterious Guy 02 8765 4321 78 Unknown St, NSW, 2000 Manufacturing
Captain Obvious 02 1111 1111 54 Duh Lane, NSW, 2119 Tiger team
Mickey Mouse 07 7654 6665 1 Disney Rd, QLD, 7654 Special projects
Daffy Duck 08 7654 3214 8 St Kilda Street, WA, 8008 Special projects
Bugs Bunny 08 7654 3214 1004 Killawee Rd, NT, 6543 Special projects



Outer joins

An "outer" join is so named because given two tables, each row in the outer table is returned even if it does not meet the ON condition. There is no implicit join syntax for this sort of join - you must use the ANSI syntax.

There are a number of forms of outer joins: left outer join, right outer join and full outer join. Let's explore each of these.

Left outer joins

A left outer join returns the "left" rows of an outer join regardless of whether they met the ON condition or not. If the right hand table does not meet the join condition, then a NULL is displayed.

Let's say that I wanted to find all the people in my database and if they are part of an organization unit. I would do this with the following query:

SELECT
p.FirstName,
p.LastName,
p.Address,
p.PhoneNumber,
o.OrgUnitName

FROM Person p
LEFT OUTER JOIN OrgUnit
ON p.OrgUnitID=o.OrgUnitID

This would return the following results:

































FirstName LastName PhoneNumber Address OrgUnitName
Chris Sasquatch 03 9998 7656 5 Curumbah St, Gladstone, VIC, 3145 IT
Jill Anderson 03 9765 3975 23 Sesqatonia Rd, Smallville, VIC, 3087 Consultancy
Robert Manaheim 03 7659 8765 18a Xyz St, Whoknowswhere, VIC, 3009 Accounting
Mysterious Guy 02 8765 4321 78 Unknown St, NSW, 2000 Manufacturing
Captain Obvious 02 1111 1111 54 Duh Lane, NSW, 2119 Tiger team
Mickey Mouse 07 7654 6665 1 Disney Rd, QLD, 7654 Special projects
Daffy Duck 08 7654 3214 8 St Kilda Street, WA, 8008 Special projects
Bugs Bunny 08 7654 3214 1004 Killawee Rd, NT, 6543 Special projects
Donald Duck 07 7654 6665 5 Drewery Lane, WA, 8097 NULL






I have highlighted in bold red the row that is returned in the left outer join that was not returned in the inner join. As you can see, Donald Duck exists in the left hand table, but is not part of a organization unit. With the left outer join he is still returned in the output, however as he is not part of any org unit it displays a NULL value.

Right outer joins

A right outer join is the counterpart of the left outer join. It returns the "right" rows of an outer join regardless of whether they met the ON condition or not. If the left hand table does not meet the join condition, then a NULL is displayed.

Let's say that I wanted to find all org units and if they have any members show who they are. I would run the following query:

SELECT
p.FirstName, p.LastName, o.OrgUnitName
FROM Person p
RIGHT OUTER JOIN OrgUnit
ON p.OrgUnitID=o.OrgUnitID

This returns the following results:




































FirstName LastName OrgUnitName
Mysterious Guy Manufacturing
Robert Manaheim Accounting
NULL NULL Human Resources
Captain Obvious Tiger team
Chris Sasquatch IT
Jill Anderson Consultancy
Mickey Mouse Special projects
Daffy Duck Special projects
Bugs Bunny Special projects





Again I have highlighted in bold red the important row that is returned by the right outer join. The human resources org unit of this company has nobody in it (silly rabbits!) but we still want to know all the org units of the company. Thus as the table OrgUnits is on the right hand side of the join we return the organization but NULLs for the left hand side table rows.

Full outer joins

A full outer join allows us to display each row in both left and right tables. If there is a row in the right hand table that does not match the ON join condition, then it displays the right hand table row and NULL values for the left hand table. If there is a row in the left hand table that does not match the ON join condition, then it displays the left hand table row and NULL values for the right hand table.

This may seem a bit like a cross join, but it is not. A cross join matches every row for the left and right hand tables regardless of the relationship between the tables. The full outer join maintains the relationships between the tables, it's just if either side does not have a matching row then it displays the row on the right/left hand side regardless.

If I wanted to see all the OrgUnits and each member of the OrgUnit, but I also wanted to see all the people in the database regardless of whether they belong to an OrgUnit, I would run the following query:

SELECT
o.OrgUnitName,
p.FirstName,
p.LastName,
p.PhoneNumber,
p.Address,
o.OrgUnitName

FROM Person p
FULL OUTER JOIN OrgUnit
ON p.OrgUnitID=o.OrgUnitID




































FirstName LastName PhoneNumber Address OrgUnitName
Chris Sasquatch 03 9998 7656 5 Curumbah St, Gladstone, VIC, 3145 IT
Jill Anderson 03 9765 3975 23 Sesqatonia Rd, Smallville, VIC, 3087 Consultancy
NULL NULL NULL NULL Human Resources
Robert Manaheim 03 7659 8765 18a Xyz St, Whoknowswhere, VIC, 3009 Accounting
Mysterious Guy 02 8765 4321 78 Unknown St, NSW, 2000 Manufacturing
Captain Obvious 02 1111 1111 54 Duh Lane, NSW, 2119 Tiger team
Mickey Mouse 07 7654 6665 1 Disney Rd, QLD, 7654 Special projects
Daffy Duck 08 7654 3214 8 St Kilda Street, WA, 8008 Special projects
Bugs Bunny 08 7654 3214 1004 Killawee Rd, NT, 6543 Special projects
Donald Duck 07 7654 6665 5 Drewery Lane, WA, 8097 NULL




Semi and anti-semi joins

The last sort of joins I want to talk about are semi-joins and anti-semi joins.

A semi-join will return rows from a table when there is one or more corresponding rows in the joining table. Interestingly, there isn't actually any explicit syntax for semi-joins, instead you must use the EXISTS statement.

For instance, let's say that I want to find all OrgUnits that have at least one Person. We could do the following:

SELECT o.OrgUnitName
FROM OrgUnit o
INNER JOIN Person p

However, this will return more than one record for the OrgUnit "Special projects".

Instead, a semi-join will do this for you. The syntax is:

SELECT o.OrgUnitID
FROM OrgUnit AS o
WHERE EXISTS
(
SELECT p.OrgUnitID FROM Person AS p

WHERE p.OrgUnitID = o.OrgUnitID

)

You can also do this by using an IN statement:

SELECT o.OrgUnitName
FROM OrgUnit AS o
WHERE o.OrgUnitID IN
(
SELECT p.OrgUnitID FROM Person AS p
WHERE p.OrgUnitID = o.OrgUnitID
)

An anti-semi join however, does the exact opposite. It will find any rows that do not have at least one matching row in the second table. To do this, use the NOT EXISTS statement.

Going back to our previous example, if we wanted to find all the departments that do not have any members, we would do the following:

SELECT o.OrgUnitID
FROM OrgUnit AS o
WHERE NOT EXISTS
(
SELECT p.OrgUnitID FROM Person AS p

WHERE p.OrgUnitID = o.OrgUnitID

)

With the NOT IN statement it would be:

SELECT o.OrgUnitName
FROM OrgUnit AS o
WHERE o.OrgUnitID NOT IN
(
SELECT p.OrgUnitID FROM Person AS p
WHERE p.OrgUnitID = o.OrgUnitID
)

Note 1: I picked this IN trick up at the following blog.
Note 2: I asked over at SQL Server Central whether they thought it would be a good idea to add a TOP 1 clause to the query. i.e. the correlated subquery becomes:

SELECT TOP 1 p.OrgUnitID FROM Person AS p
WHERE p.OrgUnitID = o.OrgUnitID

However, it was politely suggested I test myself. Fair enough I suppose :-)

Another way of doing a FULL OUTER join

Interestingly, we can get a FULL OUTER join by doing a left outer join of Table1 to Table2, then do a UNION ALL of a left anti-semi join of Table2 to Table1. Or in other words:

Table1 LEFT OUTER JOIN Table2
UNION ALL
Table2 LEFT ANTI-SEMI JOIN Table1

I picked this up from the book Inside Microsoft SQL Server 2005: Query Tuning and Optimization. It's useful because a nested loop can't do a FULL OUTER join, except that you can transform the FULL OUTER join into joins that the nested loop can handle. Which I thought was quite neat, in a totally geeky way of course :-)

1 comment:

  1. Just realised I have a problem in one of the queries. In the semi-join example, it reads:

    SELECT o.OrgUnitName
    FROM OrgUnit o
    INNER JOIN Person p

    That should be:

    SELECT o.OrgUnitName
    FROM OrgUnit o
    INNER JOIN Person p
    on p.OrgUnitID = o.OrgUnitID

    Don't want to edit the article as it might muck up the formatting!

    ReplyDelete