Thom's memory leak: 03/01/2016

In my previous blog post about horizontal scalability changes in PostrgreSQL 9.6, I covered pushing sorts and joins to the remote server, but Robert Haas (EnterpriseDB) has just committed a change that allows DML (that's UPDATEs and DELETEs) to be pushed to the remote server too. This is thanks to the work of Etsuro Fujita (NTT) with advice and testing from Rushabh Lathia (EnterpriseDB), Shigeru Hanada, Robert Haas (EnterpriseDB), Kyotaro Horiguchi (NTT), Albe Laurenz and myself (EnterpriseDB). So this calls for an appendix to my original post!

DML pushdown

The problem previously was that an update or a delete would mean fetching a batch of rows from the remote server, and sending individual UPDATE commands to the remote server.

So if we ran this on the local server (where remote.big_table is a remote table which resides on the remote server):

UPDATE remote.big_table SET content = content || '.';

The remote server would receive the following request from the local server:

START TRANSACTION ISOLATION LEVEL REPEATABLE READ;
DECLARE c1 CURSOR FOR
  SELECT id, content, ctid FROM public.big_table FOR UPDATE;
FETCH 100 FROM c1;
UPDATE public.big_table SET content = $2 WHERE ctid = $1

So it starts a transaction, opens a cursor to iterate over the whole table, then fetches a batch of 100 rows. For each row the local server gets, it then issues a prepared UPDATE statement and executes it for every row. So if you have 1 million rows, there will be 1 million UPDATE statements.

This is pretty inefficient, as not only does the local server have to keep fetching batches, and issuing individual UPDATE statements for each row, but the remote server has to keep sending off batches, and execute every one of the UPDATE statements it receives. There's also the potential effect on network traffic this would have.

This will no longer be the case in 9.6. Instead, the local server will ask the remote server to do the whole UPDATE on its side:

START TRANSACTION ISOLATION LEVEL REPEATABLE READ;
UPDATE public.big_table SET content = (content || '.'::text);

Very straightforward activity here. The local server told the remote server to handle the whole update itself, and the remote server just does it. Obviously, there will be cases where individual UPDATEs need to be sent if joining the remote table to a local one, but this is an enormous win over the old behaviour in simple cases like this. Note that, at the moment, if an update targets a remote table and also joins to another remote table, it will push the join down, but only to fetch batches of rows to then issue individual UPDATES for. So those cases will still be at similar to the old behaviour.

Also, the examples I have given are just of UPDATE statements, but this also applies to DELETE too. So again, rather than fetching a batch of tuples from the remote server, and issuing individual DELETE statements for each one, it will send the whole statement across.

There is a downside to this new commit, which I haven't yet checked if there's a plan to rectify it. Previously, if you ran an UPDATE or DELETE statement, then cancelled it on the local side, the remote side would cancel and rollback, as instead of continuing to send individual UPDATE or DELETE statements, it would issue an abort. But with the current changes, the abort doesn't reach the remote server, or at least not in time.

This is part 2 in a 4 part blog series on improvements coming in PostgreSQL 9.6.

Part 1 - Horizontal Scalability
- Join Pushdown
- Sort Pushdown
- "Walk and chew gum at the same time"
Part 2 - Monitoring
- VACUUM progress monitoring
- Wait monitoring
Part 3 - Parallelism
Part 4 - Vertical Scalability

VACUUM progress monitoring

When you run VACUUM on a large table, do you know how much it's done after 5 minutes? Do you know how much more there is to go? The answer is, no. You can't know, at least not accurately or easily. In PostgreSQL 9.6, you'll be able to monitor just this (although not VACUUM FULL), thanks to the work of Amit Langote (NTT) and Robert Haas (EnterpriseDB), which was built upon the work done by Rahila Syed (EnterpriseDB) and Vinayak Pokale (NTT). We now have a system view called pg_stat_progress_vacuum. This is the first in, hopefully, several progress monitoring views to come in future. Here's an example of what it shows you (shown in extended output format so that it fits on this blog page):

-[ RECORD 1 ]------+--------------
pid                | 13612
datid              | 16384
datname            | test
relid              | 16385
phase              | scanning heap
heap_blks_total    | 335141
heap_blks_scanned  | 186055
heap_blks_vacuumed | 0
index_vacuum_count | 0
max_dead_tuples    | 44739242
num_dead_tuples    | 3849285

The PostgreSQL docs explain what each of these columns contain: http://www.postgresql.org/docs/devel/static/progress-reporting.html#PG-STAT-PROGRESS-VACUUM

In the example above, we can see we have 335141 heap blocks in the table being scanned. Each block is 8kB in size, so this represents a 2618MB table. Next, we see that 186055 blocks have been scanned, so 1454MB. That means we have approximately 1164MB left to scan, and we're 55.5% into the scan phase of the VACUUM. You can see that we haven't actually VACUUMed any of the table yet (heap_blks_vacuumed), and the current phase reports "scanning heap". There are 7 phases for VACUUMs that will be visible in this view:

initializing
scanning heap
vacuuming indexes
vacuuming heap
cleaning up indexes
truncating heap
performing final cleanup

These are all described in more detail in the documentation: http://www.postgresql.org/docs/devel/static/progress-reporting.html#VACUUM-PHASES

What we have here is a very useful tool in monitoring VACUUM progress, which was previously not possible. DBAs used to have to wait until a VACUUM finished. Note that this new feature doesn't give us a time estimate of how much VACUUM time remains. This is because a VACUUM still may need to wait, or it may skip pages that it knows don't need VACUUMing, so a time estimate would be unreliable. You could craft a query that links pg_stat_progress_vacuum with pg_stat_activity to work out how long it's taken so far, and based on how many blocks it got through in that time, how long it would take to complete given the remaining blocks, but that would still only apply to an individual phase, and wouldn't give you a trustworthy ETA.

I did say that this was the first progress monitoring view, and that more may come in future. These may include progress on REINDEX, ANALYZE, VACUUM FULL/CLUSTER and ALTER TABLE (where it results in a table rewrite).

Wait monitoring

DBAs have relied on pg_stat_activity to provide information about what's currently happening on the database. This returns information about each connection in the cluster, including the queries being run, which users are connected to which database, when the connection was made, whether it's active or idle, and various other bits of information.

But now it's been updated to include 2 new columns: wait_event_type and wait_event. This is thanks to the work of Amit Kapila (EnterpriseDB) and Ildus Kurbangaliev (PostgresPro). These provide introspection on a level not previously possible, and return information about what that connection is currently waiting on (if anything). This can be a huge help in debugging performance issues, or working out what's holding a query up. For example, if there's a hold-up waiting to write to the Write-Ahead Log, we'll see wait_event_type set to 'LWLockNamed' and wait_event set to 'WALWriteLock'.

Here's a real example:

-[ RECORD 81 ]---+---------------------------------------------------------------------------------------------------------------
datid            | 16422
datname          | pgbench
pid              | 5972
usesysid         | 10
usename          | thom
application_name | pgbench
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2016-03-15 19:37:47.544535+00
xact_start       | 2016-03-15 19:38:06.277451+00
query_start      | 2016-03-15 19:38:06.281068+00
state_change     | 2016-03-15 19:38:06.28107+00
wait_event_type  | Lock
wait_event       | tuple
state            | active
backend_xid      | 176332
backend_xmin     | 175895
query            | UPDATE pgbench_branches SET bbalance = bbalance + 4430 WHERE bid = 77;

As you can see, here we have wait_event_type showing 'Lock' and wait_event showing 'tuple'. So this is waiting on a tuple lock at this moment in time, which means the query won't progress until it no longer needs to wait.

Here another example:

-[ RECORD 47 ]---+------------------------------
datid            | 16384
datname          | pgbench
pid              | 6052
usesysid         | 10
usename          | thom
application_name | pgbench
client_addr      | 
client_hostname  | 
client_port      | -1
backend_start    | 2016-03-17 14:43:37.424887+00
xact_start       | 2016-03-17 14:43:37.426224+00
query_start      | 2016-03-17 14:43:37.428431+00
state_change     | 2016-03-17 14:43:37.428432+00
wait_event_type  | LWLockNamed
wait_event       | WALWriteLock
state            | active
backend_xid      | 8460
backend_xmin     | 
query            | END;

So, given the information shown, we can determine that it's waiting on a Write-Ahead Log write lock. A full list of the wait event types and wait events is available in the documentation: http://www.postgresql.org/docs/devel/static/monitoring-stats.html#WAIT-EVENT-TABLE

This should be a huge help for cases where queries or commands get stuck waiting for something, where previously we wouldn't have known what it was waiting for.

Thom's memory leak

Friday 18 March 2016

PostgreSQL 9.6 - Part 1.1 - Horizontal Scalability revisited

DML pushdown

Thursday 17 March 2016

PostgreSQL 9.6 - Part 2 - Monitoring

VACUUM progress monitoring

Wait monitoring