Recursive SQL for querying hierarchical data: Part 1

Recently, I was working on an application that required reading hierarchically structured data. And I thought it might be useful to document multiple ways to store and query such hierarchical data (ex. Org. chart, File-system layout, or Set of tasks in a project) in a database. So, let’s jump right in.

Definitions first – what is hierarchical data?

Hierarchical data is a specific kind of data, characterized by a hierarchical relationship between the data sets.

Think about data sets having multiple levels: something above, something below, and a few at the same level. A typical example of such hierarchical model is an organizational chart like the one below.

Illustration – Hierarchical organizational chart

Storing hierarchical data in a relational database

In a relational system, hierarchical data can be represented as a parent-child relationship. Typically, this implies a rule that a child data has only one parent, while the parent data can have one or more ‘children’.

Now a problem with hierarchical data usually surfaces when you try to save this data in a database. To do that, you need to fit all those multi level parent-child relationships into a relatively flat format i.e. a database table.

A simple solution to this problem is to use a column (field) that relates to another column in the same table. Let’s understand this using an example:

-- Create employee table
CREATE TABLE employees (
   employee_id serial PRIMARY KEY,
   full_name VARCHAR NOT NULL,
   title VARCHAR NOT null,
   manager_id INT  -- this col relates to employee_id 
);
-- Insert some sample rows
INSERT INTO employees (
   employee_id,
   full_name,
   title,
   manager_id
)
VALUES
   (1, 'Smith Ross', 'ceo', NULL),
   (2, 'Mike Pearl', 'cio', 1),
   (3, 'Green Field', 'coo', 1),
   (4, 'Dewane Paul', 'svp', 2),
   (5, 'Matts Sh', 'assistant', 2),
   (6, 'Plank Pto', 'assistant', 3),
   (7, 'kayling Metcalfe', 'svp', 3),
   (8, 'blaze Mills', 'dir', 4),
   (9, 'clare Glover', 'vp', 4),
   (10, 'jonas Henderson','dir',7),
   (11, 'scarlet Kelly','dir', 7),
   (12, 'frank Climo','assistant', 9);   

As you can see in the sample data, manager id corresponds to employee id (every manager is also an employee). Example – Mike Pearl (employee_id = 2) is the manager for Dewane Paul (employee_id = 4) and Matts Sh (employee_id = 5). Also Mike reports to Smith Ross (employee_id = 1).

Note: for CEO Smith Ross (employee_id = 1) the manager ID NULL, as a CEO does not reports to anyone. There are some perks of being at the top 😛

Querying hierarchical data

Self-Join

Self-join offers a simple approach to query hierarchical data. Remember, in a hierarchical data table we have one column that relates to another column in the same table. In this example, to get the direct subordinates from the table we will leverage employee_id column.

Use Self-Join to list all direct report for employee id 1

SELECT 
	sub.employee_id,
	sup.manager_id,
	sub.full_name AS employee_name
FROM employees sub
	JOIN employees sup
		ON sub.employee_id = sup.employee_id
		where sup.manager_id = 1
			ORDER BY manager_id;
employee_id manager_id employee_name
2 1 Mike Pearl
3 1 Green Field

This returns all the direct reports for CEO Smith Ross i.e. Mike and Green. So its easy to fetch manager and subordinates one level down (or above) using self-joins.

However, hierarchies can go deep and can have an immense number of levels (think file folders on your laptop). What if we are now asked to return list of all employees (direct and indirect) reporting to the CEO?

And that’s where we need Recursive queries!

So, what’s a recursive query?

A recursive query is a query that refers to a recursive Common Table Expression (CTE). Recursive CTEs allow themselves to be called until a condition is met.

Let’s jump back to to our org. chart and list all employees reporting to CEO using a recursive CTE — and then try dissect the recursive CTE a bit.

WITH RECURSIVE subordinates AS (
   SELECT
      employee_id,
      manager_id,
      full_name AS employee_name
   FROM
      employees
   WHERE
      employee_id = 1
   UNION
      SELECT
         e.employee_id,
         e.manager_id,
         e.full_name
      FROM
         employees e
      INNER JOIN subordinates s ON s.employee_id = e.manager_id
) SELECT
   *
FROM
   subordinates;

How recursive CTE works:

  • First a recursive CTE specifies non-recursive term and recursive term.
  • Here, the non-recursive term returns the base result set R0 that is the employee with the employee_id 1 (remember CEO does not has a manager)
  • In the first iteration, recursive query returns the direct subordinate(s) of the CEO i.e.  employee_ids 2 and 3.
employee_id manager_id employee_name
2 1 Mike Pearl
3 1 Green Field
  • The second iteration of the recursive member uses the result set above as the input value, and returns direct reports for employee_ids 2 and 3.
employee_id manager_id employee_name
4 2 Dewane Paul
5 2 Matts Sh
6 3 Plank Pto
7 3 kayling Metcalfe
  • The third iteration again uses the above result set as the input value, and returns direct reports employee id from step 2.
employee_id manager_id employee_name
8 4 blaze Mills
9 4 clare Glover
10 7 jonas Henderson
11 7 scarlet Kelly
  • In the last iteration, we get employee at the bottom of org chart.
employee_id manager_id employee_name
12 9 frank Climo

PostgreSQL returns the final result set that is the union of all intermediate result sets from above iterations generated by the non-recursive and recursive terms. Here we get everyone who is reporting to CEO (directly or indirectly).

employee_id manager_id employee_name
1 Smith Ross
2 1 Mike Pearl
3 1 Green Field
4 2 Dewane Paul
5 2 Matts Sh
6 3 Plank Pto
7 3 kayling Metcalfe
8 4 blaze Mills
9 4 clare Glover
10 7 jonas Henderson
11 7 scarlet Kelly
12 9 frank Climo

Conclusion

This post has given you an introduction to storing hierarchical data in relational database and awe-inspiring power of recursive CTEs. CTE’s are not that hard to understand, and they provide the simplest way to do recursive computation across your dataset containing hierarchies and graphs. So next time you have a need to do some recursive computation, consider doing it natively in SQL as opposed to writing a procedural code in your application. For further reading consider taking a look at some of these helpful resources:

I hope you learned something new today.

Like what I write? Please join my mailing list, and I’ll let you know whenever I write another post. No spam, I promise! 👨‍💻

Author: Varun Dhawan

Hello dear reader, I'm a Product Manager @Microsoft based in MN, US (beautiful land of 10,000 lakes). I am perpetually curious and always willing to learn and engineer systems that can help solve complex problems using data. When I am not engineering or blogging, you’ll find me cooking and spending time with my family. Varun.Dhawan@gmail.com

One thought on “Recursive SQL for querying hierarchical data: Part 1”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: