Data Science with PostgreSQL – Window Functions Basics

Window functions are a powerful tool that helps to leverage the power of PostgreSQL for Data Analysis. In this blog series, I will explain what window functions are, why you should use them, types of window functions and finally will introduce you to some basic window functions in PostgreSQL. In the next few post, I’ll go through more advanced window functions and demo some scenarios. So let’s get going.

Definitions first – What are window functions?

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculations that can be done with an aggregate function. However, window functions do not cause rows to become grouped into a single output row like non-window aggregate calls would. Instead, the rows retain their separate identities.

Source: PostgreSQL Docs 3.5. Window Functions

So how are they different from GROUP BY ?

Window functions do aggregation similar to GROUP BY clause. However, rows are not grouped into a single row, instead, each row retains its separate identity. That is, a window function may return a single value for each row. Here’s a simple visualization to understand this.

On left-hand side is GROUP BY , that collapse all 3 rows in one single row based on type of aggregation (SUM(), AVG(), COUNT() etc.). On Right-hand side is Window Function that output all 3 rows with an aggregation value. The example below explain this further.

Comparing GROUP BY and Window Function

Let’s use a simple example to compare and contrast GROUP BY and Window Function. We have an Employee table with columns as employee name, job title, salary, etc. We want to calculate the average salary for each job title.

As you can notice, GROUP BY aggregation on the left-hand side, collapse everything in four rows (i.e. 4 distinct job title) to display the average salary. In contrast, the Window Function retains all original rows. So we can see employees ‘Job Title’, ‘Salary’ and also ‘Average Salary’. This makes it easy to compare each employee salary to the average salary for their job title.

One can argue that it does not make much sense to repeat the same value Avg Salary for every single row. However it becomes quite useful if you were to compute Percentage Salary for all employees compared to their peers with same job title. I’ll show you this later in post, for now lets start with a super simple example.

Let’s explore with a simple example

Create a simple result set of numbers from 1 to 10 using generate_series()

SELECT * 
FROM generate_series(1, 10) AS f(series);

series
------
1
2
3
4
5
6
7
8
9
10

Using window function OVER() display SUM of series against each row.

SELECT series, SUM(series) OVER () 
FROM generate_series(1, 10) AS f(series);

series	sum
------	------
1	    55
2	    55
3	    55
4	    55
5	    55
6	    55
7	    55
8	    55
9	    55
10	    55

Using three OVER() display COUNT, SUM, AVERAGE of series against each row.

SELECT series
	, COUNT(series) OVER ()
	, AVG(series) OVER ()
	, SUM(series) OVER ()
FROM generate_series(1, 10) AS f(series);

series count   avg    sum
----- ------ ------ ------
1	   10	   5.5	   55
2	   10	   5.5	   55
3	   10	   5.5	   55
4	   10	   5.5	   55
5	   10	   5.5	   55
6	   10	   5.5	   55
7	   10	   5.5	   55
8	   10	   5.5	   55
9	   10	   5.5	   55
10	   10	   5.5	   55

At this time, there is no need to memorize the syntax. The only concept I want you to understand is that the window function computes a value for each row in the “window” or “partition“.

Now a “window” can be all rows (like our example above), or an only subset of rows as specified in the PARTITION BY clause. For our employee table example, we may partition by Job Title. In the example below, you can see the job title in a different color. Here, each color represents a different “window” or a “partition”. Let’s see this in real-world scenario.

Display Sal, Avg Sal and % Sal for all employees, group by Job Title

Coming back to employee, job title and salary example. Let’s first check the our base tables here i.e. “employee” and “jobs”.

SELECT j.job_title,
e.salary,
AVG(e.salary) OVER (PARTITION BY j.job_title) as "avg sal",
round(e.salary / AVG(e.salary) OVER (PARTITION BY j.job_title) * 100,2) as "sal percent"
FROM employees e 
INNER JOIN jobs j 
	ON e.job_id = j.job_id
	GROUP BY j.job_title, e.salary;

Types of window functions?

There are multiple different window functions in PostgreSQL, that can all be grouped in three broad categories based on their use: Value, Ranking and Aggregate. Here’s the complete list.


You can check the PostgreSQL docs for the full list.

Conclusion

Hopefully you’ve an overview of window functions. These are powerful tools for analyzing the structured data. In the subsequent posts, I’ll deep dive into each of these Window Function categories with real-world some scenarios. Stay tuned!

Like what I write? Join my mailing list, and I’ll let you know whenever I write another post. No spam, I promise! 👨‍💻

Author: Varun Dhawan

Hello dear reader, I'm a DevOps Engineer based in MN, US (beautiful land of 10,000 lakes). I am perpetually curious and always willing to learn and engineer systems that can help solve complex problems using data. When I am not engineering or blogging, you’ll find me cooking and spending time with my family. Varun.Dhawan@gmail.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: