Very high latency, low bandwidth replication

Started by Bob Jolliffealmost 12 years ago6 messagesgeneral

bobjolliffe@gmail.com

almost 12 years ago

I have been grappling with a problem for some time I would appreciate some
advice on. We have a public health application which is web based with a
postgresql backing store which is designed for use by the public sector
ministry of health in a significant number of African, Asian and other
countries (http//:dhis2.org). "Traditionally" it is hosted as a national
data warehouse application with users dispersed amongst district offices
and sometimes health facilities around the country.

Particularly in many countries in Africa the public sector typically has
limited data centre infrastructure to reliably host the application
in-house and so a good number have opted to use some global cloud service
(infrastructure as a service) to ensure maximum availability of the
application. Others have managed to make use of in-country resources such
as national ISPs and mobile companies. There are many cost-benefit and
governance considerations behind these decisions which I don't need to go
into here.

Whereas ministries have been prepared to do this there are important to
reasons to ensure that a backup of the database can be maintained in the
ministry. So we attempt to grab the nightly snapshot backups from the
database each night. In the past I have attempted this somewhat
simplistically with rsync over ssh but it is a very inefficient approach
and particularly so over weak internet connections.

What are people's thoughts about a more optimal solution? I would like to
use a more incremental approach to replication. This does not have to be a
"live" replication .. asynchronously triggering once every 24 hours is
sufficient. Also there are only a subset of tables which are required (the
rest consist of data which is generated).

Appreciate any advice.

Regards
Bob

Francisco Olarte

folarte@peoplecall.com

almost 12 years ago

In reply to: Bob Jolliffe (#1)

Re: Very high latency, low bandwidth replication

Hi Bob.

On Mon, Jun 30, 2014 at 10:05 AM, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

What are people's thoughts about a more optimal solution? I would like to
use a more incremental approach to replication. This does not have to be a
"live" replication .. asynchronously triggering once every 24 hours is
sufficient. Also there are only a subset of tables which are required (the
rest consist of data which is generated).

If you only need to replicate once every 24 hours, which means you can
tolerate lags, you could try log shipping. Instead of sending the wal
records from master to standby directly just spool them, compress them
as much as you can ( I would try pglesslog plus an XZ on it's output
), and send it once a day. This for the 'incremental part'. For the
only a subset of tables, you could try to set up a local partial
mirror using any of the trigger based replication products and then do
log-shipping of that.

Also, the logical replication slot stuff added to the latest version
seems really promissing for this kind of thing, but I'm not familiar
enough with it to recommend anything.

Also, depending on your data updating patterns, database sizes and
other stuff, a trigger based replication approach can save a lot of
traffic. I mean, if you have records which are heavily updated, but
only replicate once a day, you can collapse all the day stuff in a
single update. I once did a similar thing to transmit deltas over a
2400bps modem by making daily sorted dumps and sending daily deltas
with previous day ( it needed a bit of coding, about a couple hundred
lines, but produced ridiculously small deltas, and with a bit of care
their application was idempotent, which simplified the recovery on
errors ).

Francisco Olarte.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Stuart Bishop

stuart@stuartbishop.net

almost 12 years ago

In reply to: Bob Jolliffe (#1)

Re: Very high latency, low bandwidth replication

On 30 June 2014 15:05, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

What are people's thoughts about a more optimal solution? I would like to
use a more incremental approach to replication. This does not have to be a
"live" replication .. asynchronously triggering once every 24 hours is
sufficient. Also there are only a subset of tables which are required (the
rest consist of data which is generated).

WAL shipping is probably best here. Configure an archive_command on
the master to compress and push logs to cloud storage, and configure
a hot standby on site to pull and decompress the logs. The wal-e tool
may make things simpler pushing to cloud storage, or just follow the
PostgreSQL documentation to archive the WAL files to a filesystem.

If that isn't good enough, you can look at more esoteric approaches
(eg. nightly plaintext dumps to a git repository, pushing changes to
disk on site).

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Bob Jolliffe

bobjolliffe@gmail.com

almost 12 years ago

In reply to: Francisco Olarte (#2)

Re: Very high latency, low bandwidth replication

Thanks Francisco for these inputs. I hadn't considered log shipping as I
knew I didn't want to track changes to all tables (and databases). Setting
up a local partial mirror is an interesting thought which hadn't crossed my
mind .. I'll giver that some consideration.

Though currently I am thinking to address the problem of generating deltas
at the application level rather than to use postgresql features which are
largely optimized for a slightly different set of circumstances and
requirements.

Impressive what can be done witha 2400 baud modem when you set your mind to
it. Fortunately this days are mostly behind us :-)

On 30 June 2014 13:05, Francisco Olarte <folarte@peoplecall.com> wrote:

Show quoted text

Hi Bob.

On Mon, Jun 30, 2014 at 10:05 AM, Bob Jolliffe <bobjolliffe@gmail.com>
wrote:

What are people's thoughts about a more optimal solution? I would like

to

use a more incremental approach to replication. This does not have to

be a

"live" replication .. asynchronously triggering once every 24 hours is
sufficient. Also there are only a subset of tables which are required

(the

rest consist of data which is generated).

If you only need to replicate once every 24 hours, which means you can
tolerate lags, you could try log shipping. Instead of sending the wal
records from master to standby directly just spool them, compress them
as much as you can ( I would try pglesslog plus an XZ on it's output
), and send it once a day. This for the 'incremental part'. For the
only a subset of tables, you could try to set up a local partial
mirror using any of the trigger based replication products and then do
log-shipping of that.

Also, the logical replication slot stuff added to the latest version
seems really promissing for this kind of thing, but I'm not familiar
enough with it to recommend anything.

Also, depending on your data updating patterns, database sizes and
other stuff, a trigger based replication approach can save a lot of
traffic. I mean, if you have records which are heavily updated, but
only replicate once a day, you can collapse all the day stuff in a
single update. I once did a similar thing to transmit deltas over a
2400bps modem by making daily sorted dumps and sending daily deltas
with previous day ( it needed a bit of coding, about a couple hundred
lines, but produced ridiculously small deltas, and with a bit of care
their application was idempotent, which simplified the recovery on
errors ).

Francisco Olarte.

Bob Jolliffe

bobjolliffe@gmail.com

almost 12 years ago

In reply to: Stuart Bishop (#3)

Re: Very high latency, low bandwidth replication

Thanks Stuart. I'll do some measurements on plaintext dump to git.

On 2 July 2014 09:46, Stuart Bishop <stuart@stuartbishop.net> wrote:

Show quoted text

On 30 June 2014 15:05, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

What are people's thoughts about a more optimal solution? I would like

to

use a more incremental approach to replication. This does not have to

be a

"live" replication .. asynchronously triggering once every 24 hours is
sufficient. Also there are only a subset of tables which are required

(the

rest consist of data which is generated).

WAL shipping is probably best here. Configure an archive_command on
the master to compress and push logs to cloud storage, and configure
a hot standby on site to pull and decompress the logs. The wal-e tool
may make things simpler pushing to cloud storage, or just follow the
PostgreSQL documentation to archive the WAL files to a filesystem.

If that isn't good enough, you can look at more esoteric approaches
(eg. nightly plaintext dumps to a git repository, pushing changes to
disk on site).

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Francisco Olarte

folarte@peoplecall.com

almost 12 years ago

In reply to: Bob Jolliffe (#4)

Re: Very high latency, low bandwidth replication

Hi Bob:

On Sat, Jul 5, 2014 at 3:34 PM, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

Though currently I am thinking to address the problem of generating deltas
at the application level rather than to use postgresql features which are
largely optimized for a slightly different set of circumstances and
requirements.

Yeah, special purpose can always be more efficient. Specially when you
can tune for your special circunstances, like non full replication,
and replicating only once a day, and you probably can minimizae the
changesets knowing data dependencies.

Impressive what can be done witha 2400 baud modem when you set your mind to
it. Fortunately this days are mostly behind us :-)

Not so behind. Then you did that kind of thing for 100Mb data sets,
now you can use the same algorithms for 1Tb data sets. Line speed have
gone up more than data set sizes, but this advantage has been eaten by
customer delay expectations ( in my case, now they want at most 30 min
delay, not 1440 ).

Not knowing your exact data structure or size I cannot recommend
anything, but I've been abble to make really good deltas using the
same algorithms that were used on tapes, dump the table, sort it, diff
with last days, the throughput of current machines is amazing, you can
easily achieve full disk speed.

Regards.
Francisco Olarte.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general