Tuesday, 4 June 2013

1,001 chained streaming-only replication instances

On my laptop I've managed to create 1,001 local chained (one-to-one) streaming-only (meaning no archive directory) asynchronous replication instances. The output of the status of the list is here: https://gist.github.com/darkixion/5694200

I also tested the promotion of the 1st standby to see if it would cope with propagation to the final 1,000th standby, and it worked flawlessly. This didn't work on my copy of Linux Mint without some adjustments to the kernel semaphores values, and it does take a while for all the standbys in the chain to reach full recovery. However, promotion propagation is very fast.

Try it for yourself (if you have enough RAM that is). You may find it quicker to use my pg_rep_test tool. Just don't do this manually... it'll take far too long.

Thanks to Heikki for putting in the changes that made this archiveless cascading replication possible. :)

Update: some figures

So looking at the logs, it's clear why it takes so long for all 1,000 standbys to come online; it tries to connect to its replication host every 5 seconds, so the delay between the host coming online and the standby coming online is up to 5 seconds. This potentially amounts to 5,000 seconds (about 83 mins) to ensure they're all online and receiving a streaming replication connection. A test of this shows it taking 46 minutes 25 seconds.

And as requested by Jonathan Katz (@jkatz05) I can tell you that the amount of time it takes for the promotion of the 1st standby to cause the 1,000th standby to switch to the new timeline (at least on my laptop with an SSD) is 1 minute 46 seconds, so a rate of 9.266 promoted instances per second. And as for actual data changes (in the case of my test, the creation of a table), it took about 6 seconds to reach the 1,000th standby. Re-tested with an insert of a row, and it's about the same again.

4 comments:

vincent said...

Interesting, but I can't help but wonder... why? :-)

Thom said...

Because I can. But mainly I do this stuff to see if I can find breaking points. I haven't yet tried promoting 999 standbys to see if the 1,000th standby switches through all 1,000 timelines, but then that takes up a lot of disk space doing that simple action.

Greg Smith said...

You mentioned needing enough RAM, but didn't say how much memory you have in your laptop where this was successful. The output from something like "free" after all the servers are going would be very informative if you run this again.

Thom said...

Ah yes, I mentioned it on Twitter, but forgot to here. My laptop has 16GB RAM.