Tuesday 16 November 2010

count_if (expression)


Someone just posted a problem on a PostgreSQL mailing list. They had several boolean columns, but they wanted to get how many values were true for each column, but in 1 query. There are several ways to do this, but none of which are particularly elegant.

Say we have the following table definition:

CREATE TABLE subscribers
(
subscriber text PRIMARY KEY,
news boolean DEFAULT false,
sport boolean DEFAULT false,
promotions boolean DEFAULT false,
horrible_spam boolean DEFAULT false
);

INSERT INTO subscribers (subscriber, news, sport, promotions, horrible_spam)
VALUES
('person_a', true, false, false, false),
('person_b', true, true, true, false),
('person_c', true, false, true, true),
('person_d', false, false, false, false),
('person_e', true, true, false, false),
('person_f', true, false, true, true),
('person_g', true, false, false, false);


Now we want to get a count of how many of each subscription category has someone subscribing to it. We can use a case statement for each column:

SELECT sum(CASE news WHEN true THEN 1 END) AS "news count",
sum(CASE sport WHEN true THEN 1 END) as "sport count",
sum(CASE promotions WHEN true THEN 1 END) as "promotions count",
sum(CASE horrible_spam WHEN true THEN 1 END) as "horrible_spam count"
FROM subscribers;

Which produces:

 news count | sport count | promotions count | horrible_spam count 
------------+-------------+------------------+---------------------
6 | 2 | 3 | 2
(1 row)

This looks almost unreadable and is very cumbersome. There's another way however, although it's no better. We use 4 subqueries:


SELECT (SELECT count(news) FROM subscribers WHERE news = true) AS "news count",
(SELECT count(sport) FROM subscribers WHERE sport = true) AS "sport count",
(SELECT count(promotions) FROM subscribers WHERE promotions = true) AS "promotions count",
(SELECT count(horrible_spam) FROM subscribers WHERE horrible_spam = true) AS "horrible_spam count";
 news count | sport count | promotions count | horrible_spam count 
------------+-------------+------------------+---------------------
6 | 2 | 3 | 2
(1 row)

Arghhh.. my eyes! Okay, we can try casting booleans to integers:

SELECT count(news::int) AS "news count",
count(sport::int) AS "sport count",
count(promotions::int) AS "promotions count",
count(horrible_spam::int) AS "horrible_spam count"
FROM subscribers;
 news count | sport count | promotions count | horrible_spam count 
------------+-------------+------------------+---------------------
7 | 7 | 7 | 7
(1 row)

Hmmm.. obviously not.

Surely there's a better way? Well no, there isn't... unless we create an aggregate function which will count based on an expression.

CREATE OR REPLACE FUNCTION countif_add(current_count int, expression bool)
RETURNS int AS
$BODY$
SELECT CASE expression
WHEN true THEN
current_count + 1
ELSE
current_count
END;
$BODY$
LANGUAGE SQL IMMUTABLE;
CREATE AGGREGATE count_if (boolean)
(
sfunc = countif_add,
stype = int,
initcond = 0
);

Now we can use a far tidier query.

SELECT count_if(news) AS "news count",
count_if(sport) AS "sport count",
count_if(promotions) AS "promotions count",
count_if(horrible_spam) AS "horrible_spam count"
FROM subscribers;
 news count | sport count | promotions count | horrible_spam count 
------------+-------------+------------------+---------------------
6 | 2 | 3 | 2
(1 row)

But this function lets us do more than count true values in boolean columns. It's evaluating expressions. So you can have queries like:

SELECT count_if(my_column > 7) AS "things bigger than 7"
count_if(this_column = that_column) AS "stuff that matches"
FROM my_table;