Monday, 7 November 2011

Writable Common Table Expressions

There have previously been various posts on Writable Common Table Expressions (wCTEs) which demonstrate either performance gains or simple examples of how they work, but recently when giving some training, I wanted to come up with some new material to show folk that demonstrates an example of when you might wish to use it. Note that writable common table expressions are only possible in PostgreSQL, and only from version 9.1 onwards. But because 9.1 has been out for almost 2 months now, that means you can go and use this feature straight away!

So in case you don't know what a writable common table expression is, in summary it's the same as a normal common table expression, except that you can also use UPDATEs, DELETEs and INSERTs with it. If you don't know what a normal common table expression is, it effectively let's you make the results of queries available to other queries in the same statement.

What follows is a rather basic case, but the data is kept simple so as to make it easier to follow the examples.

First, we create a company payroll table:

CREATE TABLE company_payroll
(
department text PRIMARY KEY,
staff_count int NOT NULL,
staff_cost numeric(16,2) NOT NULL
);

Which we'll populate with some data:

INSERT INTO company_payroll VALUES
('development', 45, 2839281.33),
('marketing', 8, 412424.45),
('management', 4, 839432.53);

Then we want an employees table:

CREATE TABLE employees
(
id serial PRIMARY KEY,
department text REFERENCES company_payroll (department),
salary numeric(16,2),
consecutive_days_late int NOT NULL DEFAULT 0
);

Then stick some values in it:

INSERT INTO employees (department, salary, consecutive_days_late) VALUES
('development', 12834.23, 0),
('development', 37853.55, 6),
('marketing', 24488.23, 2),
('development', 91824.29, 14),
('marketing', 24829.23, 16),
('marketing', 24488.23, 2),
('management', 100293.11, 14),
('management', 40287.11, 18),
('development', 72843.22, 11);

It doesn't matter that the employees don't add up to the details provided in the company payroll. Just pretend there are many other rows, but no-one else has been late.

So we now have 2 tables looking like this:

test=# select * from company_payroll;
department | staff_count | staff_cost
-------------+-------------+------------
development | 45 | 2839281.33
marketing | 8 | 412424.45
management | 4 | 839432.53
(3 rows)

test=# select * from employees;
id | department | salary | consecutive_days_late
----+-------------+-----------+-----------------------
1 | development | 12834.23 | 0
2 | development | 37853.55 | 6
3 | marketing | 24488.23 | 2
4 | development | 91824.29 | 14
5 | marketing | 24829.23 | 16
6 | marketing | 24488.23 | 2
7 | management | 100293.11 | 14
8 | management | 40287.11 | 18
9 | development | 72843.22 | 11
(9 rows)

Okay, so we have 9 employees, but quite a few of them keep turning up late for work. Management have decided to put their foot down and fire anyone who's been late for more than 10 days in a row (and probably break a few employment laws in the process). We want to delete these employees from the employees table, then update the company payroll to reflect these changes. Then finally we want to see some before and after figures to compare the new figures against the old.

So let's present a writable common table expression to solve the problem, then we'll break it down.

WITH fired_employees AS (
DELETE FROM employees
WHERE consecutive_days_late > 10
RETURNING *),
summarise_fired_staff AS (
SELECT department, COUNT(*) AS fired_count, SUM(salary) AS salary_sum
FROM fired_employees
GROUP BY department),
updated_financials AS (
UPDATE company_payroll
SET staff_count = staff_count - fired_count,
staff_cost = staff_cost - salary_sum
FROM summarise_fired_staff
WHERE company_payroll.department = summarise_fired_staff.department
RETURNING company_payroll.*
)
SELECT
cp.department, cp.staff_count AS old_staff_count,
uf.staff_count AS new_staff_count,
cp.staff_cost AS old_staff_cost,
uf.staff_cost AS new_staff_cost
FROM company_payroll cp
INNER JOIN updated_financials uf ON uf.department = cp.department;

So let's take this one step at a time. The first thing we're doing is deleting all employees from the employees table where they've been late for more then 10 consecutive days:

DELETE FROM employees
WHERE consecutive_days_late > 10
RETURNING *

Okay, so that's simple enough. But what we've done here is put it in a WITH clause and labelled the result as fired_employees. This can now be treated like a temporary table, but one which only exists for the duration of the query. Note that we're also using the RETURNING clause of the DELETE statement to have it return all rows that were deleted by the statement. It's the output of this which is now represented by the fired_employees table expression name. We can now use this for other purposes, and we will. So fired_employees now contains:

id | department | salary | consecutive_days_late
----+-------------+-----------+-----------------------
4 | development | 91824.29 | 14
5 | marketing | 24829.23 | 16
7 | management | 100293.11 | 14
8 | management | 40287.11 | 18
9 | development | 72843.22 | 11
(5 rows)

Next we summarise how many people we've fired from each department, and how much they cost in total per department:

SELECT department, COUNT(*) AS fired_count, SUM(salary) AS salary_sum
FROM fired_employees
GROUP BY department

Again, very straightforward. Note that we're referring to the fired_employees expression which we can treat like a regular table, and the output of this has been named summarise_fired_staff. This contains:

department | fired_count | salary_sum
-------------+-------------+------------
development | 2 | 164667.51
management | 2 | 140580.22
marketing | 1 | 24829.23
(3 rows)

As you can see, we fired 2 developers worth 164667.51 in whatever currency you want to pretend this represents. We've also fired 2 from management and 1 from marketing.

Now we want to update the figures we maintain in the company payroll table with this information:

UPDATE company_payroll
SET staff_count = staff_count - fired_count,
staff_cost = staff_cost - salary_sum
FROM summarise_fired_staff
WHERE company_payroll.department = summarise_fired_staff.department
RETURNING company_payroll.*

Here we're joining the company_payroll to the summarise_fired_staff expression based on the department name. The update involves decrementing the staff_count by the fired_count returned in the summarise_fired_staff result, and the same for the staff_cost, which is decremented by the salary_sum. Now we want to return the updated rows from this action, so we use "RETURNING company_payroll.*". And the output of this part of the query we've labelled updated_financials. As you can see, it's similar to how bash commands are used; you pass the output from one to another, and so on. The result of each one gets passed to the next. And now updated_financials contains:

department | staff_count | staff_cost
-------------+-------------+------------
development | 43 | 2674613.82
management | 2 | 698852.31
marketing | 7 | 387595.22
(3 rows)

So finally we want to see a side-by-side comparison of before and after figures:

SELECT
cp.department, cp.staff_count AS previous_staff_count,
uf.staff_count AS new_staff_count,
cp.staff_cost AS previous_staff_cost,
uf.staff_cost AS new_staff_cost
FROM company_payroll cp
INNER JOIN updated_financials uf ON uf.department = cp.department

So we've joined the still yet-to-be-actually-updated company_payroll table with the updated version (which we named updated_financials) based on department name. We then select the department, followed by the original staff_count from the company_payroll table (labelling it previous_staff_count for clarity), and then the updated_financials own staff_count colum (labelled new_staff_count). We also do the same for the staff cost.

So running all of this in one go returns the following:

department | old_staff_count | new_staff_count | old_staff_cost | new_staff_cost
-------------+-----------------+-----------------+----------------+----------------
development | 45 | 43 | 2839281.33 | 2674613.82
management | 4 | 2 | 839432.53 | 698852.31
marketing | 8 | 7 | 412424.45 | 387595.22
(3 rows)

If you're curious to know what the explain plan for this query is, it's as follows:

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=38.33..39.40 rows=3 width=80) (actual time=0.256..0.258 rows=3 loops=1)
Hash Cond: (cp.department = uf.department)
CTE fired_employees
-> Delete on employees (cost=0.00..21.75 rows=313 width=6) (actual time=0.059..0.066 rows=4 loops=1)
-> Seq Scan on employees (cost=0.00..21.75 rows=313 width=6) (actual time=0.022..0.023 rows=4 loops=1)
Filter: (consecutive_days_late > 10)
Rows Removed by Filter: 12
CTE summarise_fired_staff
-> HashAggregate (cost=8.61..10.61 rows=200 width=48) (actual time=0.107..0.108 rows=3 loops=1)
-> CTE Scan on fired_employees (cost=0.00..6.26 rows=313 width=48) (actual time=0.062..0.071 rows=4 loops=1)
CTE updated_financials
-> Update on company_payroll (cost=1.07..5.88 rows=3 width=198) (actual time=0.184..0.198 rows=3 loops=1)
-> Hash Join (cost=1.07..5.88 rows=3 width=198) (actual time=0.159..0.167 rows=3 loops=1)
Hash Cond: (summarise_fired_staff.department = company_payroll.department)
-> CTE Scan on summarise_fired_staff (cost=0.00..4.00 rows=200 width=168) (actual time=0.127..0.131 rows=3 loops=1)
-> Hash (cost=1.03..1.03 rows=3 width=62) (actual time=0.008..0.008 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on company_payroll (cost=0.00..1.03 rows=3 width=62) (actual time=0.002..0.003 rows=3 loops=1)
-> Seq Scan on company_payroll cp (cost=0.00..1.03 rows=3 width=56) (actual time=0.029..0.030 rows=3 loops=1)
-> Hash (cost=0.06..0.06 rows=3 width=56) (actual time=0.205..0.205 rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> CTE Scan on updated_financials uf (cost=0.00..0.06 rows=3 width=56) (actual time=0.188..0.203 rows=3 loops=1)
Total runtime: 0.514 ms
(23 rows)

You could also easily add in delta (or diff) columns to show how much they've changed by, because you have 2 sets of columns to reference and do a straight-forward calculation, then label those columns accordingly.

So what initially appears as a complicated behemoth of a statement is actually fairly simple once you break it down. And while the example shown isn't necessarily realistic, it should hopefully demonstrate its potential applications.

And you can try out this example right away if you're running PostgreSQL 9.1. Just run the CREATE TABLE and INSERT statements above, then run the big writable common table expression query and see the results for yourself. Note that because the wCTE query actually makes changes, running it more than once will do nothing since the employees that were fired were removed by the query the first time it was run.

You can read other blog entries on wCTEs at these links: