Streaming replication and sharding

Started by Tiemo Kieftabout 13 years ago6 messagesgeneral

t.kieft@infinitesimal.nl

about 13 years ago

Hi,

We are developing an application that uses various web analytics packages
(like Google Analytics) to run analyses on. We are currently in closed beta
stadium where we don't have a lot of data in the database, but at some
point it will grow considerably.

We basically have two different sets of data, on the one hand we have raw
metrics from the datasource, and on the other hand we have account and meta
information. The former can be re-downloaded at any time, and will grow to
quite large sizes. The latter set is the one that we really care about, and
don't want to risk losing.

Currently we plan on using streaming replication to replicate all data to
at least one slave, for the near future this will do, since we can run some
of the large (read-only) aggregation queries on the slave database. In the
future the dataset might grow to the point where we need to start thinking
about sharding. The analytics data can be sharded on a per-customer basis,
and doesn't have to be replicated.

Since Postgres doesn't support per-table streaming replication (as far as I
can tell), the only solution would be to run two separate instances of
postgres per server. One instance is replicated to all servers, and will
contain account and other important information. The other instance is used
to store analytics data. Is this a viable way of solving this problem, or
are we overlooking something?

The problem is not really immediate, as the dataset is currently small
enough to fit on one machine (and replicated to a second), just want to be
future proof, and get this solved before the problems start.

--
- Tiemo

Laurenz Albe

laurenz.albe@cybertec.at

about 13 years ago

In reply to: Tiemo Kieft (#1)

Re: Streaming replication and sharding

Tiemo Kieft wrote:

We are developing an application that uses various web analytics packages (like Google Analytics) to
run analyses on. We are currently in closed beta stadium where we don't have a lot of data in the
database, but at some point it will grow considerably.

We basically have two different sets of data, on the one hand we have raw metrics from the datasource,
and on the other hand we have account and meta information. The former can be re-downloaded at any
time, and will grow to quite large sizes. The latter set is the one that we really care about, and
don't want to risk losing.

Currently we plan on using streaming replication to replicate all data to at least one slave, for the
near future this will do, since we can run some of the large (read-only) aggregation queries on the
slave database. In the future the dataset might grow to the point where we need to start thinking
about sharding. The analytics data can be sharded on a per-customer basis, and doesn't have to be
replicated.

Since Postgres doesn't support per-table streaming replication (as far as I can tell), the only
solution would be to run two separate instances of postgres per server. One instance is replicated to
all servers, and will contain account and other important information. The other instance is used to
store analytics data. Is this a viable way of solving this problem, or are we overlooking something?

The problem is not really immediate, as the dataset is currently small enough to fit on one machine
(and replicated to a second), just want to be future proof, and get this solved before the problems
start.

The problems I see with distributing your data across
several PostgreSQL clusters is that they become disconnected.

It will become much more difficult to keep them consistent:

You cannot have referential integrity, and if a database restore
is needed, you have to make extra provisions that you can restore
your system to a consistent state across all clusters.

You also lose the ability to join between tables that are
distributed across different databases. This can be a
perdormance problem, particularly in an OLAP scenario.

That's all I can think of at the moment.

Yours,
Laurenz Albe

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

René Romero Benavides

ichbinrene@gmail.com

about 13 years ago

In reply to: Laurenz Albe (#2)

Re: Streaming replication and sharding

I'm also interested in leveraging something like that, these are my thoughts:

- Have a master server with all the data
- Have the data partitioned vertically (inheritance, exclusion constrains,
etc)
- One synchronous slave
- N asynchronous slaves that feed from the synchronous slave (cascading
replication)
- Use plproxy to enforce that one instance serves queries of only one portion
of the data (pseudo-sharding). Also with plproxy is possible to parallelize
some queries.

Advantages:
	+ Highly redundant
	+ Lends itself to automatic failover to the syncrhonous slave
	+ Read scalable
	+ No synchronization conflicts among "shards"
	+ Some queries could be parallelized

Disadvantages:
	+ Plproxy works only with functions
	+ Writes can be performed only in the master ( not write scalable)
	+ Needs good communication infrastructure
	+ Table structures need to be simple.
	+ At some point (failover), you'll need to implement plproxy automatic re-
configuration

On Monday, February 18, 2013 08:00:33 AM Albe Laurenz wrote:

Tiemo Kieft wrote:

We are developing an application that uses various web analytics packages
(like Google Analytics) to

run analyses on. We are currently in closed

beta stadium where we don't have a lot of data in the database, but at
some point it will grow considerably.

We basically have two different sets of data, on the one hand we have raw
metrics from the datasource,

and on the other hand we have account and

meta information. The former can be re-downloaded at any time, and will
grow to quite large sizes. The latter set is the one that we really care
about, and don't want to risk losing.

Currently we plan on using streaming replication to replicate all data to
at least one slave, for the

near future this will do, since we can run

some of the large (read-only) aggregation queries on the slave database.
In the future the dataset might grow to the point where we need to start
thinking about sharding. The analytics data can be sharded on a
per-customer basis, and doesn't have to be replicated.

Since Postgres doesn't support per-table streaming replication (as far as
I can tell), the only

solution would be to run two separate instances of

postgres per server. One instance is replicated to all servers, and will
contain account and other important information. The other instance is
used to store analytics data. Is this a viable way of solving this
problem, or are we overlooking something?
The problem is not really immediate, as the dataset is currently small
enough to fit on one machine

(and replicated to a second), just want to

be future proof, and get this solved before the problems start.

The problems I see with distributing your data across
several PostgreSQL clusters is that they become disconnected.

It will become much more difficult to keep them consistent:

You cannot have referential integrity, and if a database restore
is needed, you have to make extra provisions that you can restore
your system to a consistent state across all clusters.

You also lose the ability to join between tables that are
distributed across different databases. This can be a
perdormance problem, particularly in an OLAP scenario.

That's all I can think of at the moment.

Yours,
Laurenz Albe

--
René Romero Benavides @iCodeiExist @PgsqlMx

Postgresql Tips en español para la comunidad de México e Hispanoamérica.
http://postgresql.org.mx

René Romero Benavides

ichbinrene@gmail.com

about 13 years ago

In reply to: Laurenz Albe (#2)

Re: Streaming replication and sharding

There's also one project called Postgres-XC which seems very promising, but
I'm kind of wary about its level of maturity.

On Monday, February 18, 2013 08:00:33 AM Albe Laurenz wrote:

Tiemo Kieft wrote:

We are developing an application that uses various web analytics packages
(like Google Analytics) to

run analyses on. We are currently in closed

beta stadium where we don't have a lot of data in the database, but at
some point it will grow considerably.

We basically have two different sets of data, on the one hand we have raw
metrics from the datasource,

and on the other hand we have account and

meta information. The former can be re-downloaded at any time, and will
grow to quite large sizes. The latter set is the one that we really care
about, and don't want to risk losing.

Currently we plan on using streaming replication to replicate all data to
at least one slave, for the

near future this will do, since we can run

some of the large (read-only) aggregation queries on the slave database.
In the future the dataset might grow to the point where we need to start
thinking about sharding. The analytics data can be sharded on a
per-customer basis, and doesn't have to be replicated.

Since Postgres doesn't support per-table streaming replication (as far as
I can tell), the only

solution would be to run two separate instances of

postgres per server. One instance is replicated to all servers, and will
contain account and other important information. The other instance is
used to store analytics data. Is this a viable way of solving this
problem, or are we overlooking something?
The problem is not really immediate, as the dataset is currently small
enough to fit on one machine

(and replicated to a second), just want to

be future proof, and get this solved before the problems start.

The problems I see with distributing your data across
several PostgreSQL clusters is that they become disconnected.

It will become much more difficult to keep them consistent:

You cannot have referential integrity, and if a database restore
is needed, you have to make extra provisions that you can restore
your system to a consistent state across all clusters.

You also lose the ability to join between tables that are
distributed across different databases. This can be a
perdormance problem, particularly in an OLAP scenario.

That's all I can think of at the moment.

Yours,
Laurenz Albe

--
René Romero Benavides @iCodeiExist @PgsqlMx

Postgresql Tips en español para la comunidad de México e Hispanoamérica.
http://postgresql.org.mx

Michael Paquier

michael@paquier.xyz

about 13 years ago

In reply to: René Romero Benavides (#4)

Re: Streaming replication and sharding

On Tue, Feb 19, 2013 at 6:01 AM, René Romero Benavides <ichbinrene@gmail.com

wrote:

**

There's also one project called Postgres-XC which seems very promising,
but I'm kind of wary about its level of maturity.

Project page is here: https://sourceforge.net/projects/postgres-xc/
The latest version 1.0 released last year in June lacks of a couple of
features widely used these days like returning or triggers, but the team is
working hard in implementing that for 1.1 planned for April-June this year
with many other things.
--
Michael

René Romero Benavides

ichbinrene@gmail.com

about 13 years ago

In reply to: Michael Paquier (#5)

Re: Streaming replication and sharding

Looking forward to that release and best wishes for the project in general.

On Tuesday, February 19, 2013 01:02:34 PM Michael Paquier wrote:

On Tue, Feb 19, 2013 at 6:01 AM, René Romero Benavides <ichbinrene@gmail.com>
wrote:

There's also one project called Postgres-XC which seems very promising, but
I'm kind of wary about its level of maturity.
Project page is here: https://sourceforge.net/projects/postgres-xc/
The latest version 1.0 released last year in June lacks of a couple of
features widely used these days like returning or triggers, but the team is
working hard in implementing that for 1.1 planned for April-June this year
with many other things.

--
Michael

--
René Romero Benavides @iCodeiExist @PgsqlMx

Postgresql Tips en español para la comunidad de México e Hispanoamérica.
http://postgresql.org.mx