Wednesday, December 17, 2008

SQL queries for creating reports across multiple tables

I often run into cases where I need to extract fields from multiple tables and show them all in a report. And I need to get all the fields with just one SQL query. I want to limit it to a single query for two reasons: one, I want the database to return the records in the user-specified order (e.g., ordered on a particular field or two); and two, using a single query usually is very efficient because exploits the capabilities of the database engine's query planner.

But designing the right query is sometimes difficult. In particular, showing the values referred to by foreign keys and picklists are a bit of a challenge, particularly when the database schema allows foreign keys or picklist references to be null. So I felt it worthwhile to document this problem as best I can, along with a couple of approaches for solving it. Your comments, of course, are welcome.

A simple example


The first is a simple example with just two tables, a Person entity table containing information about a person, and a Profession picklist table that maintains the list of professions. Here are the relevant fields in these tables.

Profession
id (primary key)
display_string
display_rank (integer)

Person
person_id (primary key)
full_name
email_address
profession_id (nullable foreign key reference to Profession.id)

Our problem is to produce a single query that retrieves the full name, email address and display_string field of the profession for every Person record in the table, ordered by the Profession's display_string. The output might look like this:

My Cousin Willie will@goodfor.not
Nelson Mandela nelson@gov.za Politics
Winston Churchill churchill@gov.uk Politics
Anatoly Karpov mike@jo.com Sports
Mother Teresa mo@heaven.net Social service

The first attempted solution is this one, using a simple join:
select Person.full_name as name, Person.email_address as email, Profession.display_string as profession from Person, Profession where Person.profession_id = Profession.id order by profession;
This works, but produces only those Person records that have a non-null value in their profession_id column. Since we want to also produce the Person records that have a null profession id, we have to resort to a left join, like this:
select Person.full_name as name, Person.email_address as email, Profession.display_string as profession
from Person left join Profession on Person.profession = Profession.id
order by profession;
Alternatively, we could use a correlated subquery like this:
select Person.full_name as name, Person.email_address as email,
(select Profession.display_string from Profession where Person.profession_id = Profession.id) as profession from Person order by profession;


A more complex example


Real-world situations are rarely limited to just two tables. So to make our problem more realistic, we'll extend it to where it spans two more tables. Person and Activity tables are major entities, and Profession and ActivityStatus tables are picklist containers. In addition, Activity has a nullable foreign key reference to Person, so that we can optionally identify a contact with whom the activity is associated.

So here is a representation of the two additional tables:

ActivityStatus
id (primary key)
display_string

Activity
activity_id
subject
time
status (foreign key to ActivityStatus,id, may not be null)
person_id (foreign key to Person, may be null)


This time, we are required to produce a single report containing all Activity records, along with the contact information (full name, email address and profession) of the person with whom the activity is associated. If there is no person associated with an activity, we want those columns to show empty in the report. And just for fun, suppose we want the report ordered by the name of the person associated with the activity. So our report might look like this:

Subject AtTime Status WithWhom DoesWhat
Golf game 2 pm Completed Anatoly Karpov Sports
Uno session 1 am Pending My cousin Willie
Breakfast meeting 8 am Pending Nelson Mandela Politics
Dinner meeting 3 pm Completed Nelson Mandela Politics
Dinner meeting 7 pm Completed Winston Churchill Politics

To create this query, we can try grouping the fields we want according to the tables in which they occur, and hooking them together using joins and subqueries, like this:

select Activity.subject, Activity.time as AtTime, ActivityStatus.status as Status,
p.full_name as WithWhom,
(select Profession.display_string from Profession where Profession.id = p.profession) as DoesWhat
from Activity
left join (select full_name, profession_id from Person) as p on p.person_id = Activity.person_id,
ActivityStatus
where ActivityStatus.id = Activity.status
order by WithWhom;
Here we have used a left join to extract the fields of the Person records, and a correlated subquery to extract the picklist value of the profession. We could just as easily have used a left join for the picklist value as well. In this sense the left join and the correlated subquery are equivalent in expressive power, but their performance may vary depending on the database engine's query planner.

One other thing worth noting: The query above is Activity-centric, in the sense that it produces Activity records as the basis, coupled with information about associated Persons. This is the reason why the Activity fields occur at the main level of the select query, and the Person fields occur in the join portion. If we had instead wanted a Person-centric report, with the Person records along with any associated activities -- which would likely yield a much larger result set -- we could have written a query that puts the Person fields in the outer level and the Activity fields in the left join, like this:

select Person.full_name, Person.email_address,
(select display_string from Profession where Profession.id = Person.profession_id) as DoesWhat,
ac.subject, ac.AtTime
from Person
left join
(select Activity.subject, Activity.time as AtTime, Activity.status from Activity) as ac
on ac.person_id = Person.person_id;


Summary


Here are a few guidelines I've found useful when creating a multi-table query:
  • Figure out which of the main entities is the "base" entity, and use its fields at the top level of the query.
  • Joins are expensive, and left joins even more so. Therefore, use a left join only if (a) the foreign key involved in the join is nullable and (b) you want the null values to show in your results. Otherwise, use an equi-join.
  • A correlated subquery provides a simple means to extract a single foreign-key dereference value.


Comments welcome!

No comments: