PostgreSQL Documentation of High Availability and Load Balancing

Started by Markus Wannerover 19 years ago25 messagesdocs

markus@bluegap.ch

over 19 years ago

Hi,

We are about to extend the PostgreSQL documentation for high
availability and load balancing. We've had some discussions so far, but
only few people have spoken up until now. I'd really like to get some
more input from all the different projects providing solutions in this area.

You can find the current draft here:
http://momjian.us/main/writings/pgsql/sgml/high-availability.html

Please have a look and check if your project fits somewhere and if the
description is correct. I'll happily forward your inputs if you are not
subscribed to the (AFAIK closed) PostgreSQL docs mailing list.

Regards

Markus

Markus Wanner

markus@bluegap.ch

over 19 years ago

In reply to: Markus Wanner (#1)

Re: PostgreSQL Documentation of High Availability and Load Balancing

Hello Jose,

Josï¿½ Orlando Pereira wrote:

Hi,

Thanks for the heads up. I mostly agree with the text (except maybe the
paragraph on Oracle RAC and 2PC).

Hm, what's wrong with that? Okay, we should better not mention Oracle
RAC there, but it is a product doing 'Multi-Master Replication Using
Clustering', isn't it?

Regarding GORDA, I would not try to fit it into existing bullets, as our
current prototype implements several variations of query broadcasting,
multi-master and master/slave, although none with production quality.

AFAIK, we want to help the users to get a general understanding about
replication, high availability and load balancing. So as to give them a
good starting point on their search for a solution to their problem(s).

Thus it's okay if you say GORDA is prototyping several of the algorithms
mentioned. I just would like to make sure that the descriptions are
general enough to be appropriate for most implementations.

For example, recently Josh Berkus proposed, that 'statement-based
replication' was the more common term for 'query broadcasting'. Do you
agree with that? What did you call it? Does the description fit?

Instead, I'd point out PostgreSQL's friendliness to research and the resulting
prototypes that stem from academia. This can help to draw community attention
to our efforts.

Thank you for your suggestions. And I'm glad you're seeing PostgreSQL
that way. But I think your additions don't quite fit into the
documentation because they are too promotional.

I should probably have given some more details about previous
discussions. One consensus we've reached was, that we don't want to keep
a quickly changing list of open source projects in our documentation. So
probably nether GORDA nor Postgres-R will get mentioned there.

But for sure we'll compile a list of available replication solutions
*somewhere* on the website, where we can change it as often as we want.
I'm sure GORDA will get mentioned there.

@pgsql-docs: BTW, what's the state on that one? Or do I have to ask that
on -www?

Regards

Markus

Import Notes

Reply to msg id not found: 200611201325.18896.jop@di.uminho.pt

Markus Wanner

markus@bluegap.ch

over 19 years ago

In reply to: Markus Wanner (#1)

Re: [Pgcluster-general] PostgreSQL Documentation of High Availability

Hi,

a.mitani@sra-europe.com wrote:

Current generation of PGCluster is a Shared-Nothing type of multi-master
and syncronous replication system.

Thank you for pointing us to yet another very common distinction in the
clustering world: shared-nothing vs. shared-disk or even
shared-everything. We don't touch that in the current documentation. Do
we want or need to do so?

I think that the feature of this type of replication system is as the
'Multi-Master Replication Using Clustering' chapter of your document.

Most probably, yes. Please note that it's not *my* document :-) Bruce
Momjian wrote most of it, with only some hints and annoying nit-picking
from my side.

However, Oracle RAC is a Shared-Everything type of multi-master clustering
system. If it set up appropriately, most of these limitations would be
improved.

Shared-Everything, really? I thought they did their own distributed
shared memory or distributed locking stuff, so it would be shared-disk.
And together with their OCFS, they would reach shared-nothing. But I
don't really know.

@pgsql-docs: I'd strongly vote for not mentioning Oracle if we don't
event want to mention proprietary products for PostgreSQL. There are
enough research or ongoing projects (even some ongoing reserch projects
;-) ) to mention. PgCluster-II, GORDA, Slony-II or Postgres-R come to mind.

Next generation of PGCluster (I named PGCluster-II) will be a
Shared-Everything type of multi-master clustering system as demonstrated
in Toronto.

Yeah, I remember that demonstration. Do you think PGCluster-II fits
what's described under 'Multi-Master Replication Using Clustering'? Do
you think we should explain Shared-Nothing vs. Shared-Disk vs.
Shared-Everything there?

Regards

Markus

Import Notes

Reply to msg id not found: 39022.194.26.187.1.1164038087.squirrel@82.192.72.130

Markus Wanner

markus@bluegap.ch

over 19 years ago

In reply to: Markus Wanner (#1)

Re: [Pgcluster-general] PostgreSQL Documentation of High Availability

Hi,

a.mitani@sra-europe.com wrote:

I believe that shared-everything and shared-disk are cmpletely different
things.

No. AFAIK, shared everything is basically a marketing term. Some vendors
(especially Oracle) use it to mean 'shared disk', which is really
confusing. But that's probably intentional, as it sounds good to share
everything... much better than sharing nothing.

Others, like GreenPlum use the term to mean 'shared memory' for example
in [1]GreenPlum about Shared Nothing vs Shared Everything: http://www.greenplum.com/products/sharedNothing.php, which is much more appropriate.

Anyway, the term 'Shared Nothing' seems to go back to Stonebraker's
paper 'The Case for Shared Nothing Architecture' [2]Michael Stonebraker, The Case for Shared Nothing Architecture: http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf. There he defined
these three terms:

shared memory (SM): multiple processors share a common central memory
shared disk (SD): multiple processors each with private memory share
a common collection of disks
shared nothing (SN): neither memory nor peripheral storage is shared
among processors

As it makes no sense to have shared memory but individual disks, no term
for such a thing got defined.

Oracle has said the RAC as Shared-Everything.
Probably, the definition of the words would differ from them.

It's hard to find architectural details within their documents, but
AFAICS, they do something like distributed shared memory or distributed
locking, which they call "Cache Fusion", see [3]A random Oracle Paper about "Cache Fusion" (tm): http://www.oracle.com/technology/products/oracle9i/pdf/cache_fusion_rel2.pdf.

In the very same paper, they define what a Cluster is for them: "A
cluster is a group of independent servers that cooperate as a single
system. The primary cluster components are processor nodes, a cluster
interconnect, and a shared disk subsystem. The clusters share disk
access and resources that manage data, but the distinct hardware cluster
nodes do not share memory."

Thus, Oracle RAC seems to be a shared disk solution. Only in conjunction
with their OCFS, you could probably call it a shared-nothing solution,
but it's certainly not a shared-memory thing.

I'd vote for explaining these terms in the PostgreSQL documentation, as
there seems to be a lot of confusion regarding these terms.

Regards

Markus

[1]: GreenPlum about Shared Nothing vs Shared Everything: http://www.greenplum.com/products/sharedNothing.php
http://www.greenplum.com/products/sharedNothing.php

[2]: Michael Stonebraker, The Case for Shared Nothing Architecture: http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf
http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf

[3]: A random Oracle Paper about "Cache Fusion" (tm): http://www.oracle.com/technology/products/oracle9i/pdf/cache_fusion_rel2.pdf
http://www.oracle.com/technology/products/oracle9i/pdf/cache_fusion_rel2.pdf

Import Notes

Reply to msg id not found: 52419.194.26.187.1.1164041161.squirrel@82.192.72.130

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Markus Wanner (#1)

Re: PostgreSQL Documentation of High Availability and

Nice idea.

---------------------------------------------------------------------------

Markus Schiltknecht wrote:

Hi,

We are about to extend the PostgreSQL documentation for high
availability and load balancing. We've had some discussions so far, but
only few people have spoken up until now. I'd really like to get some
more input from all the different projects providing solutions in this area.

You can find the current draft here:
http://momjian.us/main/writings/pgsql/sgml/high-availability.html

Please have a look and check if your project fits somewhere and if the
description is correct. I'll happily forward your inputs if you are not
subscribed to the (AFAIK closed) PostgreSQL docs mailing list.

Regards

Markus

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Markus Wanner (#3)

Re: [Pgcluster-general] PostgreSQL Documentation of High

Markus Schiltknecht wrote:

Hi,

a.mitani@sra-europe.com wrote:

Current generation of PGCluster is a Shared-Nothing type of multi-master
and syncronous replication system.

Thank you for pointing us to yet another very common distinction in the
clustering world: shared-nothing vs. shared-disk or even
shared-everything. We don't touch that in the current documentation. Do
we want or need to do so?

I feel the shared-* issue splits us up like master/slave and
multi-master splits up --- it added more confusion than clarity, because
many solutions fell in the middle.

I think that the feature of this type of replication system is as the
'Multi-Master Replication Using Clustering' chapter of your document.

Most probably, yes. Please note that it's not *my* document :-) Bruce
Momjian wrote most of it, with only some hints and annoying nit-picking
from my side.

However, Oracle RAC is a Shared-Everything type of multi-master clustering
system. If it set up appropriately, most of these limitations would be
improved.

Shared-Everything, really? I thought they did their own distributed
shared memory or distributed locking stuff, so it would be shared-disk.
And together with their OCFS, they would reach shared-nothing. But I
don't really know.

Yea, gets confusing.

@pgsql-docs: I'd strongly vote for not mentioning Oracle if we don't
event want to mention proprietary products for PostgreSQL. There are
enough research or ongoing projects (even some ongoing reserch projects
;-) ) to mention. PgCluster-II, GORDA, Slony-II or Postgres-R come to mind.

Good point. I mentioned Oracle RAC only because it seems to be an
industry standard, so by mentioning it, people know exactly what we are
talking about. Is there a better way? And people do ask for Oracle
RAC, so in a way we are telling them we don't have something similar.
As sad as that is, it is true currently.

Next generation of PGCluster (I named PGCluster-II) will be a
Shared-Everything type of multi-master clustering system as demonstrated
in Toronto.

Yeah, I remember that demonstration. Do you think PGCluster-II fits
what's described under 'Multi-Master Replication Using Clustering'? Do
you think we should explain Shared-Nothing vs. Shared-Disk vs.
Shared-Everything there?

pgcluster is must closer to Oracle RAC, but I haven't mentioned it
because I am unsure where it is in terms of usability and stability.
Comments?

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Markus Wanner (#4)

Re: [Pgcluster-general] PostgreSQL Documentation of

Markus Schiltknecht wrote:

In the very same paper, they define what a Cluster is for them: "A
cluster is a group of independent servers that cooperate as a single
system. The primary cluster components are processor nodes, a cluster
interconnect, and a shared disk subsystem. The clusters share disk
access and resources that manage data, but the distinct hardware cluster
nodes do not share memory."

Thus, Oracle RAC seems to be a shared disk solution. Only in conjunction
with their OCFS, you could probably call it a shared-nothing solution,
but it's certainly not a shared-memory thing.

I'd vote for explaining these terms in the PostgreSQL documentation, as
there seems to be a lot of confusion regarding these terms.

OK, but how does explaining the terms help our users?

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Robert Treat

xzilla@users.sourceforge.net

over 19 years ago

In reply to: Markus Wanner (#2)

Re: PostgreSQL Documentation of High Availability and Load Balancing

On Monday 20 November 2006 09:52, Markus Schiltknecht wrote:

But for sure we'll compile a list of available replication solutions
*somewhere* on the website, where we can change it as often as we want.
I'm sure GORDA will get mentioned there.

@pgsql-docs: BTW, what's the state on that one? Or do I have to ask that
on -www?

Normally the www list is the best place to ask www related questions, but the
answer to this is to take a look at the techdocs section of the website
(http://www.postgresql.org/docs/techdocs/), it is a wiki like system that
allows any community member to put online guides and documentation into
place, made specifically for the purpose of things like what your looking
for.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

Markus Wanner

markus@bluegap.ch

over 19 years ago

In reply to: Bruce Momjian (#6)

Re: [Pgcluster-general] PostgreSQL Documentation of High Availability

Hi,

Bruce Momjian wrote:

I feel the shared-* issue splits us up like master/slave and
multi-master splits up

No, not quite. To sum up, I'd say the following combinations make sense:

sync, multi-master replication on shared-memory cluster (which is much
like a super-computer. With shared memory distributing locks does not
cost much - beside marketing, there is probably not much sense in
calling this a cluster at all).

sync, multi-master replication on shared-disk cluster (where locks and
memory-caches have to be synchronized. OracleRAC and PgCluster-II fit in
here.)

(Probably running an async replication on a shared-disk cluster would
make sense with MVCC and in some corner cases, but I don't see much
benefits in that.)

sync, multi-master replication on shared-nothing cluster (where locks,
caches and data needs to be synchronized over an interconnect.
Postgres-R, PgCluster, PgPool)

(sync, single-master replication does not make much sense, because if
you go sync at all, you could as well use the nodes which run in sync).

async, multi-master replication on shared-nothing cluster (i.e. Slony-I)

async, single-master replication on shared-nothing cluster (mainly for
failover purpose, you mention solutions for that)

For me these categorizations are important and help a good deal to
ensure what I'm talking about with somebody. The documentation is much
more focused on individual solutions, sometimes avoiding to categorize
them. I would love to get others opinions, but as not many others speak
up, I just accept it that way.

Yea, gets confusing.

Well, Oracle also does a good deal in making it confusing, IMO.

Good point. I mentioned Oracle RAC only because it seems to be an
industry standard, so by mentioning it, people know exactly what we are
talking about.

That's a point, even if I don't really know how much of an industry
standard it is. But given how badly Oracle does in explaining basics of
replication and clustering, I think it's not very beneficial.

Is there a better way? And people do ask for Oracle
RAC, so in a way we are telling them we don't have something similar.
As sad as that is, it is true currently.

How far is PGCluster-II? Does it make sense to mention it? Can
PGCluster-II be used with network filesystems like NFS, OCFS2 or the like?

pgcluster is must closer to Oracle RAC,

Why do you think so? Oracle RAC is mainly based on a shared disk
cluster, where PGCluster bases on a shared nothing architecture.
PGCluster-II seems closer to Oracle RAC, for me.

but I haven't mentioned it
because I am unsure where it is in terms of usability and stability.
Comments?

Did you work on it since Toronto, Mitani-San?

Regards

Markus

#10

Markus Wanner

markus@bluegap.ch

over 19 years ago

In reply to: Bruce Momjian (#7)

Re: [Pgcluster-general] PostgreSQL Documentation of High Availability

Hello Bruce,

Bruce Momjian wrote:

OK, but how does explaining the terms help our users?

As we even have on sort-of-a solution for shared disk clusters (the
Shared Disk Failover part), we should explain this term (as you already
do there).

Clarifying that all other solutions are for shared nothing clusters
makes sense, IMO. We don't necessarily need to go into shared memory and
the confusion which shared everything introduced. OTOH, where else to
enlighten people about that if not in such a documentation?

To answer your question: by explaining these terms, they are
demystified. The users will understand the experts better and have some
fundamental terms which they can base their discussion on. Of course
it's questionable how far to go, and we are debating just that now, I think.

But I have no doubt in the OSS tradition of good documentation. Long
live the saying 'RTFM'! :-)

Regards

Markus

#11

Markus Wanner

markus@bluegap.ch

over 19 years ago

In reply to: Markus Wanner (#1)

Re: [Sequoia] PostgreSQL Documentation of High Availability and Load

Hello Emmanuel,

Emmanuel Cecchet wrote:

I just quickly went through the classification.
I don't think that your description of 'multi-master replication using
clustering' is correct. Oracle RAC is a shared disk approach and just
send cache invalidations to other nodes but not actual data. As the disk
is shared, data is only commited once to disk and there is a distributed
locking protocol to make nodes agree on a serializable transactional order.

I agree with that. AFAI understand, we are not entirely sure how much of
that we want to cover in our documentation.

You can have a look at the last ApacheCon presentation I gave
(http://www.continuent.org/uploads/sequoia/Resources/2006-08-15Cecchet_ApacheConAsia2006.pdf),

thank you for that link, I'll read through it ASAP.

there is a comparison of most solutions. I think that it would be cool
to have a comparison matrix with the different features that you may
need in an HA solution and see how the different solutions can provide
an answer to these problems.

We've already come to the consensus, that we don't want to list all
products in our documentation, but better do that on the website, where
it can easily be changed at any time. The documentation should explain
fundamentals.

What you describe as 'Statement-Based Replication Middleware' which is
in fact multi-master replication at the middleware level, is usually
more than just a simple proxy that broadcasts all queries. It must
provide 1-copy-serializability if you want the cluster to be consistent.

Yes, the current description is very much targeted at pgpool. Thank you
very much for your input, that helps to clarify things.

This means that write queries must be sent in the same serializable
order to every node. As SQL is interpreted, macros can be replaced on
the fly by the middleware with cluster-wide values. About sequences,
calls to sequences must be broadcast as if it were writes. This is
needed to update sequences the same way on all nodes. Note that
sequences are parts of unrollbackable changes of the database. This
means that even if a transaction rollbacks at one node, its sequence
will not be rollbacked meaning that all other nodes must also play
rollbacked transactions (at runtime or recovery time) if they were
accessing sequences or involving other operations that the database
cannot rollback.
From a performance standpoint, the middleware approach can balance
queries that come from the same client connection to different nodes
which is not the case when you are directly connected to a database
instance. Also failover can be made fully transparent at the middleware
level (any node failure can be completely hidden to the client). With
Sequoia, we even hide middleware failures by providing transparent
failover code in our driver.
The discussion should add a section describing real HA features (not
just load balancing):
- Do you lose data on failure?
- Will my transactions fail on a node failure?
- Is failover transparent? Does it need special support in the client
application?
- Is failback a manual process?
- How long does it take to restart/resynchronize a node?
- Can I add nodes on the fly to the cluster?
- Can I do maintenance operations without stopping the cluster?
- Can I upgrade the cluster without interruptions (e.g. migrate from
PostgreSQL 7.4 to 8.0 without interruption) ?
- WAN support? How network partitions are handled?

These are good questions to analyze a certain solution. As far as our
documentation is concerned, I think giving rough estimates for
categories of replication algorithms is sufficient (i.e. stating that
Multi Master Replication scales very good for reading transactions, but
not very well for writing ones).

When configured with RAIDb-1,

I know RAID-1, but what's a RAIBd-1?

Sequoia provide a multi-master (full)
replication at the middleware level. Note that Sequoia also allows for
partial replication or data partitioning (but the granularity is the
table meaning that you can only distribute tables on different nodes but
not slice a table and split it on different nodes as you describe in
Data Partitioning).

Thanks again for this great work and I hope this will help improve the
documentation.

Sure, thank you very much for your input. I'm glad we have the
possibility to cover sequoia, too.

Note that there is also Postgres-R that is another approach to
middleware-based replication.

Middleware? Mrs. Kemme also did some research with middleware
replication, but I wouldn't exactly call Postgres-R a middleware-based
replication solution. I've ported Postgres-R to PostgreSQL 8.2 and can
assure you that it's very well embedded into the backend of PostgreSQL.
Probably too well for some ;-) You might want to check out my (slightly
outdated) website www.postgres-r.org.

Regards

Markus

Import Notes

Reply to msg id not found: 4562CBC9.2010208@continuent.com

#12

Markus Wanner

markus@bluegap.ch

over 19 years ago

In reply to: Markus Wanner (#1)

Re: PostgreSQL Documentation of High Availability and Load Balancing

Hi,

Josï¿½ Orlando Pereira wrote:

Hm, what's wrong with that? Okay, we should better not mention Oracle
RAC there, but it is a product doing 'Multi-Master Replication Using
Clustering', isn't it?

AFAIK, RAC uses a shared disk, thus it does not provide replication.

Oh, that's right. Hm... thus there are no such things as Multi-Master
Replication for shared-disk or shared-memory machines, because that's
not replication. My fault, sorry.

And I
don't think RAC can be emulated at all at the application level with 2PC.

No, that would not make sense. The paragraph is about Multi Master
Replication, which I thought Oracle RAC would be in. But I agree that
Oracle RAC should not be considered replication at all.

What do you think about sharing disks by the means of network file
systems, like OCFS2? I was under the impression that Oracle built that
one to run RAC on top of it. That combination would run on a shared
nothing cluster, but does that make it replication?

According to you, what category does Oracle RAC (and PGCluster-II)
belong to? Shared Disk Clusters?

Classifying replication protocols is indeed a hard problem. Besides my issues
with the multi-master replication using clustering category, I miss a
reference to multi-master asynchronous replication (and thus, to
reconciliation), which is a big issue in Oracle, MS SQL, etc literature.

Yeah, I'm missing that, too.

Well, the docu talks about async and sync, but IMO, it's somewhat sloppy
in that it only covers one aspect of synchronous replication (namely
that a failover will not loose data).

The other statement, that 'servers will return consistent results with
no propagation delay' is somewhat uncorrect, as there certainly is a
delay of propagation before the commit. And in that the individual
databases are very well consistent, just not synchronous.

Emmanuel Cecchet listed some questions one might use to categorize or
further specify aspects of synchronous replication in [1]Emmanuel Cecchet: https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html.

The current paragraph doesn't even clearly state that it's talking about
synchronous replication. Maybe we want to have only one paragraph for
Multi-Master replication and cover sync as well as async there?

Coming from a fault-tolerant distributed systems background, we'd call
that "replicated state machine" or "active replication". I don't think
however that using those names in this context would be helpful.

Wikipedia has a definition of replicated state machine in [2]Wikipedia definition of replicated state machine: http://en.wikipedia.org/wiki/State_machine_replication. I'm not
keen to use that term.

Thank you for your suggestions. And I'm glad you're seeing PostgreSQL
that way. But I think your additions don't quite fit into the
documentation because they are too promotional.

Hey, you can't blame me for trying... ;)

No, it's more that I'm sorry for not having explained better what we need.

Ok, I understand your motivations. I agree with the listing replication
solutions somewhere on the website. I'd still add the research and innovation
bullet,

Yes, pointing to that surely won't hurt.

instead of trying to squeeze group-communication based stuff in
existing bullets.

I see 2PC, shared memory and locking and using a GCS as implementation
details of sync, multi-master replication. I'd even put statement-based
replication in there, but one can reasonably argue about that. Anyway,
if at all, those should only be quickly mentioned as possible
implementations. But I don't think it helps to go that far. Having a
good description of sync MM and async MM replication is certainly
sufficient there.

Again, thank you very much for your inputs.

Regards

Markus

[1]: Emmanuel Cecchet: https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html
https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html

[2]: Wikipedia definition of replicated state machine: http://en.wikipedia.org/wiki/State_machine_replication
http://en.wikipedia.org/wiki/State_machine_replication

Import Notes

Reply to msg id not found: 200611211030.46180.jop@di.uminho.pt

#13

Markus Wanner

markus@bluegap.ch

over 19 years ago

In reply to: Markus Wanner (#1)

Re: [Sequoia] PostgreSQL Documentation of High Availability and Load

Hi,

Emmanuel Cecchet wrote:

I think that you can still have a matrix with the main
features(performance, data loss/failover/failback on node failure,
Disaster recovery, WAN, ...) and how each approach (master/slave, shared
disk, multi-master, ...) addresses the issue.

Yes, I certainly agree with that.

These are good questions to analyze a certain solution. As far as our
documentation is concerned, I think giving rough estimates for
categories of replication algorithms is sufficient (i.e. stating that
Multi Master Replication scales very good for reading transactions,
but not very well for writing ones).

Even here I think that there is a common misconception between
performance and scalability. Most people think that by having multiple
nodes their query will run faster which is obviously wrong if your
original workload does not saturate a single node.

Sure. Do you think that should be made clearer?

The replication
mechanisms are even adding overhead (usually perceived as increased
latency) to the query execution. It is ONLY when the workload increases
that you can see throughput going up (ideally somewhat close to the
workload increase) and query latency remaining stable. Unless you really
have a parallel query execution (that is only efficient for big queries
anyway), you will never see a performance improvement on a single query
execution since this is always the same database engine that executes
the query in the end.

I don't quite agree with that statement, but probably I'm just
misreading it. If you have enough concurrent transactions you can spread
among the nodes, you'll certainly note an improvement. After all, it's a
huge difference, if your single node is processing only ten or hundreds
of concurrent transactions.

Of course, the amount of concurrent transactions limits how far a
replication solution can scale. Having more nodes than concurrent
transactions does not make sense. (Of course with the exception of
parallel query execution.)

But you are right that full replication (in shared nothing environments)
does not perform with write heavy workload. At best it will go to the
speed of the fastest node in the cluster, but it will usually degrade
quickly. A good replication implementation will have a constant overhead
on query execution time (let's say few millliseconds). Therefore the
impact will be quite different if this is a small query or a
long-running query. Adding few milliseconds to a query that takes
seconds to execute is negligible but adding the same time to a
sub-millisecond query will be a tremendous slowdown (in term of latency).
To summarize, clustering solutions provide performance scalability
(stable latency, throughput increasing almost linearly with load) but
not performance improvement on individual query execution time.

Yes, for writing transactions, no for read-only ones (queries?). Or why
do you have to add overhead to read-only queries?

If the
client application is not multithreaded it is very unlikely that any
solution will improve the application performance.

Ehm.. I wouldn't refer to threading here. You can very well have
multiple single-process programs running on different nodes...

I'd keep referring to concurrency of transactions.

As an additional point, transactions including calls such as 'select
nextval' should be considered as write transactions with PostgreSQL.

Sure.

That might not be obvious for most users.

Agreed.

When configured with RAIDb-1,

I know RAID-1, but what's a RAIBd-1?

RAIDb is an acronym for Redundant Array of Inexpensive Databases.
You can find an article on this at
http://c-jdbc.objectweb.org/current/doc/RR-C-JDBC.pdf

Aha, thank you.

That's great that the work was revived in 8.2. Yes, Postgres-R is much
more embedded in Postgres but I was confused with Middle-R that was done
later on with Bettina and Ricardo using a similar technique at the
middlware level.

Yeah, I thought you meant that one. I don't know Middle-R at all, sorry.
Seems similar to sequoia. Did you base your work on Middle-R?

What are your development plans for Postgres-R?

To make it work and production ready as soon as possible. ;-) I'm
currently working on initialization and recovery.

Regards

Markus

Import Notes

Reply to msg id not found: 4562E138.2060808@continuent.com

#14

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Markus Wanner (#12)

Re: PostgreSQL Documentation of High Availability and

Markus Schiltknecht wrote:

Hi,

Jos? Orlando Pereira wrote:

Hm, what's wrong with that? Okay, we should better not mention Oracle
RAC there, but it is a product doing 'Multi-Master Replication Using
Clustering', isn't it?

AFAIK, RAC uses a shared disk, thus it does not provide replication.

Oh, that's right. Hm... thus there are no such things as Multi-Master
Replication for shared-disk or shared-memory machines, because that's
not replication. My fault, sorry.

OK, title now is "Multi-Master Clustering".

And I
don't think RAC can be emulated at all at the application level with 2PC.

No, that would not make sense. The paragraph is about Multi Master
Replication, which I thought Oracle RAC would be in. But I agree that
Oracle RAC should not be considered replication at all.

What do you think about sharing disks by the means of network file
systems, like OCFS2? I was under the impression that Oracle built that
one to run RAC on top of it. That combination would run on a shared
nothing cluster, but does that make it replication?

According to you, what category does Oracle RAC (and PGCluster-II)
belong to? Shared Disk Clusters?

Classifying replication protocols is indeed a hard problem. Besides my issues
with the multi-master replication using clustering category, I miss a
reference to multi-master asynchronous replication (and thus, to
reconciliation), which is a big issue in Oracle, MS SQL, etc literature.

Yeah, I'm missing that, too.

I added async multi-master:

<varlistentry>
<term>Multi-Master With Conflict Resolution</term>
<listitem>

<para>
For servers that are not regularly connected, like laptops or
remote servers, keeping data consistent among servers is a
challenge. One simple solution is to allow each server to
modify the data, and have periodic communication compare
databases and ask users to resolve any conflicts.
</para>
</listitem>
</varlistentry>

Well, the docu talks about async and sync, but IMO, it's somewhat sloppy
in that it only covers one aspect of synchronous replication (namely
that a failover will not loose data).

The other statement, that 'servers will return consistent results with
no propagation delay' is somewhat uncorrect, as there certainly is a
delay of propagation before the commit. And in that the individual
databases are very well consistent, just not synchronous.

OK, updated to add "little" delay, and removed "small" from async case:

load-balanced servers will return consistent results with little
propagation delay. Asynchronous updating has a delay between the

Emmanuel Cecchet listed some questions one might use to categorize or
further specify aspects of synchronous replication in [1].

The current paragraph doesn't even clearly state that it's talking about
synchronous replication. Maybe we want to have only one paragraph for
Multi-Master replication and cover sync as well as async there?

Does the new conflict resolution section help that?

Coming from a fault-tolerant distributed systems background, we'd call
that "replicated state machine" or "active replication". I don't think
however that using those names in this context would be helpful.

Wikipedia has a definition of replicated state machine in [2]. I'm not
keen to use that term.

Thank you for your suggestions. And I'm glad you're seeing PostgreSQL
that way. But I think your additions don't quite fit into the
documentation because they are too promotional.

Hey, you can't blame me for trying... ;)

No, it's more that I'm sorry for not having explained better what we need.

I was originally worried no one commented on my initial version of this
chapter. I am not worried any more. ;-) Actually, I think we all
understand 60% of this topic, but a different 60%, so when we are done,
it will cover 100%.

Ok, I understand your motivations. I agree with the listing replication
solutions somewhere on the website. I'd still add the research and innovation
bullet,

Yes, pointing to that surely won't hurt.

instead of trying to squeeze group-communication based stuff in
existing bullets.

I see 2PC, shared memory and locking and using a GCS as implementation
details of sync, multi-master replication. I'd even put statement-based
replication in there, but one can reasonably argue about that. Anyway,
if at all, those should only be quickly mentioned as possible
implementations. But I don't think it helps to go that far. Having a
good description of sync MM and async MM replication is certainly
sufficient there.

[1]: Emmanuel Cecchet:
https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html

Ah, good read. I didn't realize the shared disk aspect of Oracle RAC,
and have removed mention of RAC from our documentation. Oracle RAC
seems like an interesting hybrid solution. They use shared disk so they
don't have to send the data to all the nodes, but send cache
invalidation information to all nodes so they know when something has
changed. I have added the Oracle RAC details as an SGML comment in case
we ever need to mention it.

As far as going into the other details of what features each replication
solution has, e.g. adding nodes, etc, it is beyond the scope of this
chapter, though perhaps some of the items are appropriate.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#15

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Markus Wanner (#9)

Re: [Pgcluster-general] PostgreSQL Documentation of

Markus Schiltknecht wrote:

Hi,

Bruce Momjian wrote:

I feel the shared-* issue splits us up like master/slave and
multi-master splits up

No, not quite. To sum up, I'd say the following combinations make sense:

sync, multi-master replication on shared-memory cluster (which is much
like a super-computer. With shared memory distributing locks does not
cost much - beside marketing, there is probably not much sense in
calling this a cluster at all).

Wow, how is that different than an multi-CPU server? I guess I don't
see the point to it. The only value I see to it would be failover if
one of the servers fails, but it seems the failed server would be
holding locks that would make failover difficult to do without
restarting all the servers.

sync, multi-master replication on shared-disk cluster (where locks and
memory-caches have to be synchronized. OracleRAC and PgCluster-II fit in
here.)

OK. I didn't think pgcluster was shared disk. I thought all the
synchronization was via the network.

(Probably running an async replication on a shared-disk cluster would
make sense with MVCC and in some corner cases, but I don't see much
benefits in that.)

sync, multi-master replication on shared-nothing cluster (where locks,
caches and data needs to be synchronized over an interconnect.
Postgres-R, PgCluster, PgPool)

Yes, I think we have that one covered.

(sync, single-master replication does not make much sense, because if
you go sync at all, you could as well use the nodes which run in sync).

async, multi-master replication on shared-nothing cluster (i.e. Slony-I)

Covered.

async, single-master replication on shared-nothing cluster (mainly for
failover purpose, you mention solutions for that)

Added as a new entry calld Conflict Resolution.

For me these categorizations are important and help a good deal to
ensure what I'm talking about with somebody. The documentation is much
more focused on individual solutions, sometimes avoiding to categorize
them. I would love to get others opinions, but as not many others speak
up, I just accept it that way.

One problem I have is that we we have shared disk failover, but no other
shared case with a PostgreSQL implementation, and people don't want to
mention Oracle RAC, so why do we mention it if we have no
implementations even in the works.

Yea, gets confusing.

Well, Oracle also does a good deal in making it confusing, IMO.

Good point. I mentioned Oracle RAC only because it seems to be an
industry standard, so by mentioning it, people know exactly what we are
talking about.

That's a point, even if I don't really know how much of an industry
standard it is. But given how badly Oracle does in explaining basics of
replication and clustering, I think it's not very beneficial.

OK, agreed, removed.

Is there a better way? And people do ask for Oracle
RAC, so in a way we are telling them we don't have something similar.
As sad as that is, it is true currently.

How far is PGCluster-II? Does it make sense to mention it? Can
PGCluster-II be used with network filesystems like NFS, OCFS2 or the like?

I am waiting for email from Mitani-san, the pgcluster author.

pgcluster is must closer to Oracle RAC,

Why do you think so? Oracle RAC is mainly based on a shared disk
cluster, where PGCluster bases on a shared nothing architecture.
PGCluster-II seems closer to Oracle RAC, for me.

Oh, I am not aware of pgcluster-II. Did you mean pgpool-II? I think
so. I have mentioned pgpool-II now as part of Clustering For Parallel
Query Execution. Is that OK?

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#16

Joshua D. Drake

jd@commandprompt.com

over 19 years ago

In reply to: Bruce Momjian (#15)

Re: [Pgcluster-general] PostgreSQL Documentation of

On Tue, 2006-11-21 at 16:51 -0500, Bruce Momjian wrote:

Markus Schiltknecht wrote:

Hi,

Bruce Momjian wrote:

I feel the shared-* issue splits us up like master/slave and
multi-master splits up

No, not quite. To sum up, I'd say the following combinations make sense:

sync, multi-master replication on shared-memory cluster (which is much
like a super-computer. With shared memory distributing locks does not
cost much - beside marketing, there is probably not much sense in
calling this a cluster at all).

Wow, how is that different than an multi-CPU server?

You can't have 1000 cpus :).. You can have 1000 dual core servers.

Joshua D. Drake

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#17

Markus Wanner

markus@bluegap.ch

over 19 years ago

In reply to: Joshua D. Drake (#16)

Re: [Pgcluster-general] PostgreSQL Documentation of

Joshua D. Drake wrote:

On Tue, 2006-11-21 at 16:51 -0500, Bruce Momjian wrote:

Markus Schiltknecht wrote:

Hi,

Bruce Momjian wrote:

I feel the shared-* issue splits us up like master/slave and
multi-master splits up

No, not quite. To sum up, I'd say the following combinations make sense:

sync, multi-master replication on shared-memory cluster (which is much
like a super-computer. With shared memory distributing locks does not
cost much - beside marketing, there is probably not much sense in
calling this a cluster at all).

Wow, how is that different than an multi-CPU server?

You can't have 1000 cpus :).. You can have 1000 dual core servers.

Have them share all their memory is the challenge, though. Ask IBM, they
certainly do big-irons like that.

As I said, it's like a super-computer, but you can also see it as a
cluster of CPUs with shared memory and shared disks. It's all just a
matter of your point of view.

Regards

Markus

#18

Markus Wanner

markus@bluegap.ch

over 19 years ago

In reply to: Bruce Momjian (#15)

Re: [Pgcluster-general] PostgreSQL Documentation of

Hi,

Bruce Momjian wrote:

Oh, I am not aware of pgcluster-II. Did you mean pgpool-II? I think
so.

No, I really mean PGCluster-II. Didn't you attend A. Mitani's speech
about PGCluster? Check his slides:

http://conference.postgresql.org/download/TFCKUpload/62.pdf

Regards

Markus

#19

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Joshua D. Drake (#16)

Re: [Pgcluster-general] PostgreSQL Documentation of

Joshua D. Drake wrote:

On Tue, 2006-11-21 at 16:51 -0500, Bruce Momjian wrote:

Markus Schiltknecht wrote:

Hi,

Bruce Momjian wrote:

I feel the shared-* issue splits us up like master/slave and
multi-master splits up

No, not quite. To sum up, I'd say the following combinations make sense:

sync, multi-master replication on shared-memory cluster (which is much
like a super-computer. With shared memory distributing locks does not
cost much - beside marketing, there is probably not much sense in
calling this a cluster at all).

Wow, how is that different than an multi-CPU server?

You can't have 1000 cpus :).. You can have 1000 dual core servers.

But does anyone make a shared-memory cluster that can do 1000 cpu's?
Sounds like Sequent, but I didn't think anyone was doing this anymore.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#20

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Markus Wanner (#10)

Re: [Pgcluster-general] PostgreSQL Documentation of

Markus Schiltknecht wrote:

Hello Bruce,

Bruce Momjian wrote:

OK, but how does explaining the terms help our users?

As we even have on sort-of-a solution for shared disk clusters (the
Shared Disk Failover part), we should explain this term (as you already
do there).

Clarifying that all other solutions are for shared nothing clusters
makes sense, IMO. We don't necessarily need to go into shared memory and
the confusion which shared everything introduced. OTOH, where else to
enlighten people about that if not in such a documentation?

To answer your question: by explaining these terms, they are
demystified. The users will understand the experts better and have some
fundamental terms which they can base their discussion on. Of course
it's questionable how far to go, and we are debating just that now, I think.

But I have no doubt in the OSS tradition of good documentation. Long
live the saying 'RTFM'! :-)

I figured that shared-disk/memory only really makes sense for
multi-master clustering, so I mentioned it in that paragraph:

<term>Multi-Master Clustering</term>
<listitem>

<para>
In clustering, each server can accept write requests, and
modified data is transmitted from the original server to every
other server before each transaction commits. Heavy write
activity can cause excessive locking, leading to poor performance.
In fact, write performance is often worse than that of a single
-> server. Read requests can be sent to any server. Some
-> implementations use cluster-wide shared memory or shared disk
-> to reduce the communication overhead. Clustering is best for
mostly read workloads, though its big advantage is that any
server can accept write requests — there is no need to
partition workloads between master and slave servers, and
because the data changes are sent from one server to another,
there is no problem with non-deterministic functions like
<function>random()</>.

Is that enought?

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#21

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Markus Wanner (#18)

#22

Markus Wanner

markus@bluegap.ch

over 19 years ago