Tuesday 16 November 2010

count_if (expression)


Someone just posted a problem on a PostgreSQL mailing list. They had several boolean columns, but they wanted to get how many values were true for each column, but in 1 query. There are several ways to do this, but none of which are particularly elegant.

Say we have the following table definition:

CREATE TABLE subscribers
(
subscriber text PRIMARY KEY,
news boolean DEFAULT false,
sport boolean DEFAULT false,
promotions boolean DEFAULT false,
horrible_spam boolean DEFAULT false
);

INSERT INTO subscribers (subscriber, news, sport, promotions, horrible_spam)
VALUES
('person_a', true, false, false, false),
('person_b', true, true, true, false),
('person_c', true, false, true, true),
('person_d', false, false, false, false),
('person_e', true, true, false, false),
('person_f', true, false, true, true),
('person_g', true, false, false, false);


Now we want to get a count of how many of each subscription category has someone subscribing to it. We can use a case statement for each column:

SELECT sum(CASE news WHEN true THEN 1 END) AS "news count",
sum(CASE sport WHEN true THEN 1 END) as "sport count",
sum(CASE promotions WHEN true THEN 1 END) as "promotions count",
sum(CASE horrible_spam WHEN true THEN 1 END) as "horrible_spam count"
FROM subscribers;

Which produces:

 news count | sport count | promotions count | horrible_spam count 
------------+-------------+------------------+---------------------
6 | 2 | 3 | 2
(1 row)

This looks almost unreadable and is very cumbersome. There's another way however, although it's no better. We use 4 subqueries:


SELECT (SELECT count(news) FROM subscribers WHERE news = true) AS "news count",
(SELECT count(sport) FROM subscribers WHERE sport = true) AS "sport count",
(SELECT count(promotions) FROM subscribers WHERE promotions = true) AS "promotions count",
(SELECT count(horrible_spam) FROM subscribers WHERE horrible_spam = true) AS "horrible_spam count";
 news count | sport count | promotions count | horrible_spam count 
------------+-------------+------------------+---------------------
6 | 2 | 3 | 2
(1 row)

Arghhh.. my eyes! Okay, we can try casting booleans to integers:

SELECT count(news::int) AS "news count",
count(sport::int) AS "sport count",
count(promotions::int) AS "promotions count",
count(horrible_spam::int) AS "horrible_spam count"
FROM subscribers;
 news count | sport count | promotions count | horrible_spam count 
------------+-------------+------------------+---------------------
7 | 7 | 7 | 7
(1 row)

Hmmm.. obviously not.

Surely there's a better way? Well no, there isn't... unless we create an aggregate function which will count based on an expression.

CREATE OR REPLACE FUNCTION countif_add(current_count int, expression bool)
RETURNS int AS
$BODY$
SELECT CASE expression
WHEN true THEN
current_count + 1
ELSE
current_count
END;
$BODY$
LANGUAGE SQL IMMUTABLE;
CREATE AGGREGATE count_if (boolean)
(
sfunc = countif_add,
stype = int,
initcond = 0
);

Now we can use a far tidier query.

SELECT count_if(news) AS "news count",
count_if(sport) AS "sport count",
count_if(promotions) AS "promotions count",
count_if(horrible_spam) AS "horrible_spam count"
FROM subscribers;
 news count | sport count | promotions count | horrible_spam count 
------------+-------------+------------------+---------------------
6 | 2 | 3 | 2
(1 row)

But this function lets us do more than count true values in boolean columns. It's evaluating expressions. So you can have queries like:

SELECT count_if(my_column > 7) AS "things bigger than 7"
count_if(this_column = that_column) AS "stuff that matches"
FROM my_table;

4 comments:

depesz said...

This can be also done using count(nullif(..., ...)) approach, which is nice - as it's built-in, but has the drawback of not being really nice on eyes.

Thom said...

Hey Depesz,

Yes, you're right. :) And there's probably a couple other ways too. However, I would like something like this sort of count_if function to be built in too as it's more intuitive.

Writing...

SELECT count(nullif(bool_column,false)) FROM my_table;

...looks a bit weird. The user has to realise that the 2 values mustn't match one another in order for it to be counted. There's also the problem of counting nulls with that method. But for the purposes of the original problem, that would have worked fine.

Joe said...

It seems to me the third try should have used sum, as did the first, so at least you get the correct results:

SELECT sum(news::int) AS news_count,
sum(sport::int) AS sport_count,
sum(promotions::int) AS proms_count,
sum(horrible_spam::int) AS spam_count
FROM subscribers;

I don't find that so unattractive (if properly indented--which Blogger doesn't allow me to do), but I guess you do.

Thom said...

You are indeed correct. I obviously had a brain failure at that moment.