Recently, I was working on an application that required reading hierarchically structured data. And I thought it might be useful to document multiple ways to store and query such hierarchical data (ex. Org. chart, File-system layout, or Set of tasks in a project) in a database. So, let’s jump right in.
Definitions first – what is hierarchical data?
Hierarchical data is a specific kind of data, characterized by a hierarchical relationship between the data sets.
Think about data sets having multiple levels: something above, something below, and a few at the same level. A typical example of such hierarchical model is an organizational chart like the one below.
Storing hierarchical data in a relational database
In a relational system, hierarchical data can be represented as a parent-child relationship. Typically, this implies a rule that a child data has only one parent, while the parent data can have one or more ‘children’.
Now a problem with hierarchical data usually surfaces when you try to save this data in a database. To do that, you need to fit all those multi level parent-child relationships into a relatively flat format i.e. a database table.
A simple solution to this problem is to use a column (field) that relates to another column in the same table. Let’s understand this using an example:
-- Create employee table CREATE TABLE employees ( employee_id serial PRIMARY KEY, full_name VARCHAR NOT NULL, title VARCHAR NOT null, manager_id INT -- this col relates to employee_id ); -- Insert some sample rows INSERT INTO employees ( employee_id, full_name, title, manager_id ) VALUES (1, 'Smith Ross', 'ceo', NULL), (2, 'Mike Pearl', 'cio', 1), (3, 'Green Field', 'coo', 1), (4, 'Dewane Paul', 'svp', 2), (5, 'Matts Sh', 'assistant', 2), (6, 'Plank Pto', 'assistant', 3), (7, 'kayling Metcalfe', 'svp', 3), (8, 'blaze Mills', 'dir', 4), (9, 'clare Glover', 'vp', 4), (10, 'jonas Henderson','dir',7), (11, 'scarlet Kelly','dir', 7), (12, 'frank Climo','assistant', 9);
As you can see in the sample data, manager id corresponds to employee id (every manager is also an employee). Example – Mike Pearl (employee_id = 2) is the manager for Dewane Paul (employee_id = 4) and Matts Sh (employee_id = 5). Also Mike reports to Smith Ross (employee_id = 1).
Note: for CEO Smith Ross (employee_id = 1) the manager ID
NULL, as a CEO does not reports to anyone. There are some perks of being at the top 😛
Querying hierarchical data
Self-join offers a simple approach to query hierarchical data. Remember, in a hierarchical data table we have one column that relates to another column in the same table. In this example, to get the direct subordinates from the table we will leverage employee_id column.
Use Self-Join to list all direct report for employee id 1
SELECT sub.employee_id, sup.manager_id, sub.full_name AS employee_name FROM employees sub JOIN employees sup ON sub.employee_id = sup.employee_id where sup.manager_id = 1 ORDER BY manager_id;
This returns all the direct reports for CEO Smith Ross i.e. Mike and Green. So its easy to fetch manager and subordinates one level down (or above) using self-joins.
However, hierarchies can go deep and can have an immense number of levels (think file folders on your laptop). What if we are now asked to return list of all employees (direct and indirect) reporting to the CEO?
And that’s where we need Recursive queries!
So, what’s a recursive query?
A recursive query is a query that refers to a recursive Common Table Expression (CTE). Recursive CTEs allow themselves to be called until a condition is met.
Let’s jump back to to our org. chart and list all employees reporting to CEO using a recursive CTE — and then try dissect the recursive CTE a bit.
WITH RECURSIVE subordinates AS ( SELECT employee_id, manager_id, full_name AS employee_name FROM employees WHERE employee_id = 1 UNION SELECT e.employee_id, e.manager_id, e.full_name FROM employees e INNER JOIN subordinates s ON s.employee_id = e.manager_id ) SELECT * FROM subordinates;
How recursive CTE works:
- First a recursive CTE specifies non-recursive term and recursive term.
- Here, the non-recursive term returns the base result set R0 that is the employee with the employee_id 1 (remember CEO does not has a manager)
- In the
,recursive query returns the direct subordinate(s) of the CEO i.e. employee_ids 2 and 3.
second iterationof the recursive member uses the result set above as the input value, and returns direct reports for employee_ids 2 and 3.
third iterationagain uses the above result set as the input value, and returns direct reports employee id from step 2.
- In the
last iteration, we get employee at the bottom of org chart.
PostgreSQL returns the final result set that is the
union of all intermediate result sets from above iterations generated by the non-recursive and recursive terms. Here we get everyone who is reporting to CEO (directly or indirectly).
This post has given you an introduction to storing hierarchical data in relational database and awe-inspiring power of recursive CTEs. CTE’s are not that hard to understand, and they provide the simplest way to do recursive computation across your dataset containing hierarchies and graphs. So next time you have a need to do some recursive computation, consider doing it natively in SQL as opposed to writing a procedural code in your application. For further reading consider taking a look at some of these helpful resources:
I hope you learned something new today.
Like what I write? Please join my mailing list, and I’ll let you know whenever I write another post. No spam, I promise! 👨💻