replication docs: split single vs. multi-master

Started by Markus Schiltknechtabout 19 years ago16 messages
#1Markus Schiltknecht
markus@bluegap.ch
4 attachment(s)

Hi,

as promised on -docs, here comes my proposal on how to improve the
replication documentation. The patches are split as follows and have to
be applied in order:

replication_doku_1.diff:

Smallest possible one-word change to warm-up...

replication_doku_2.diff:

Moves down "Clustering For Parallel Query Execution", because
it's not a replication type, but a feature, see explanation below.

replication_doku_3.diff:

This is the most important part, splitting all replication types
into single- and multi-master replication. I'm new to SGML, so
please bear with me if this is not the right way to do it...

"Shared-Disk-Failover" does IMO not fall into a replication category.
Should we mention there, that 'sharing' a disk using NFS or some
such is not recommended? (And more importantly, does not work as
a multi-master replication solution)

I've added a general paragraph describing Single-Master Replication.
I'm stating that 'Single-Master Replication is always asynchronous'.
Can anybody think of a counter example? Or a use case for sync
Single-Master Replication? The argument to put down is: if you go
sync, why don't you do Multi-Master right away?

Most of the "Clustering for Load Balancing" text applies to all
synchronous, Multi-Master Replication algorithms, even to
"Query Broadcasting". Thus it became the general description
of Multi-Master Replication. The section "Clustering for
Load Balancing" has been removed.

replication_doku_4.diff:

These are the text modifications I did to adjust to the new structure.
I've adjusted the Multi-Master Replication text to really be
appropriate for all existing solutions.

"Query Broadcasting" has some corrections, mainly to stick to describe
that algorithm there and none of the general properties of
Multi-Master Replication.

I've added two sections to describe 2PC and Distributed SHMEM
algorithms which belong into that category and cover all of the
previous text. Except that I've removed the mentioning of Oracle RAC
in favor of Pgpool-II.

IMO this makes it clearer, what replication types exist and how to
categorize them. I'm tempted to mention the Postgres-R algorithm as
fourth sub-section of Multi-Master Replication, as it's quite different
from all the others in many aspects. But I urgently need to do go to
work now... besides, I'm heavily biased regarding Postgres-R, so
probably someone else should write that paragraph. :-)

The only downside of the structure I'm proposing here is: the
non-replication-algorithms fall of somewhat. Namely: "Shared-Disk
Failover", "Data Partitioning", "Parallel Query Execution" and
"Commercial Solutions".

For me, "Data Partitioning" as well as "Parallel Query Execution" are
possible optimizations which can be run on top of replicated data. They
don't replicate data and are thus not replication solutions. But
grouping those two together would make sense.

So. I really have to go to work now!

Regards

Markus

Attachments:

replication_doku_1.difftext/x-patch; charset=iso-8859-1; name=replication_doku_1.diffDownload
*** doc/src/sgml/failover.sgml	2006-11-15 08:52:25.000000000 +0100
--- doc/src/sgml/failover.sgml	2006-11-15 08:58:34.000000000 +0100
***************
*** 126,132 ****
    </para>
  
    <para>
!    Such partitioning implements both failover and load balancing.  Failover
     is achieved because the data resides on both servers, and this is an
     ideal way to enable failover if the servers share a slow communication
     channel. Load balancing is possible because read requests can go to any
--- 126,132 ----
    </para>
  
    <para>
!    Such partitioning provides both failover and load balancing.  Failover
     is achieved because the data resides on both servers, and this is an
     ideal way to enable failover if the servers share a slow communication
     channel. Load balancing is possible because read requests can go to any
replication_doku_2.difftext/x-patch; charset=iso-8859-1; name=replication_doku_2.diffDownload
*** doc/src/sgml/failover.sgml	2006-11-15 08:58:34.000000000 +0100
--- doc/src/sgml/failover.sgml	2006-11-15 09:10:53.000000000 +0100
***************
*** 114,150 ****
    </para>
   </sect1>
  
-  <sect1 id="data-partitioning">
-   <title>Data Partitioning</title>
- 
-   <para>
-    Data partitioning splits tables into data sets.  Each set can only be
-    modified by one server.  For example, data can be partitioned by
-    offices, e.g. London and Paris.  While London and Paris servers have all
-    data records, only London can modify London records, and Paris can only
-    modify Paris records.
-   </para>
- 
-   <para>
-    Such partitioning provides both failover and load balancing.  Failover
-    is achieved because the data resides on both servers, and this is an
-    ideal way to enable failover if the servers share a slow communication
-    channel. Load balancing is possible because read requests can go to any
-    of the servers, and write requests are split among the servers.  Of
-    course, the communication to keep all the servers up-to-date adds
-    overhead, so ideally the write load should be low, or localized as in
-    the London/Paris example above.
-   </para>
- 
-   <para>
-    Data partitioning is usually handled by application code, though rules
-    and triggers can be used to keep the read-only data sets current.  Slony-I
-    can also be used in such a setup.  While Slony-I replicates only entire
-    tables, London and Paris can be placed in separate tables, and
-    inheritance can be used to access both tables using a single table name.
-   </para>
-  </sect1>
- 
   <sect1 id="query-broadcast-load-balancing">
    <title>Query Broadcast Load Balancing</title>
  
--- 114,119 ----
***************
*** 198,203 ****
--- 167,203 ----
    </para>
   </sect1>
  
+  <sect1 id="data-partitioning">
+   <title>Data Partitioning</title>
+ 
+   <para>
+    Data partitioning splits tables into data sets.  Each set can only be
+    modified by one server.  For example, data can be partitioned by
+    offices, e.g. London and Paris.  While London and Paris servers have all
+    data records, only London can modify London records, and Paris can only
+    modify Paris records.
+   </para>
+ 
+   <para>
+    Such partitioning provides both failover and load balancing.  Failover
+    is achieved because the data resides on both servers, and this is an
+    ideal way to enable failover if the servers share a slow communication
+    channel. Load balancing is possible because read requests can go to any
+    of the servers, and write requests are split among the servers.  Of
+    course, the communication to keep all the servers up-to-date adds
+    overhead, so ideally the write load should be low, or localized as in
+    the London/Paris example above.
+   </para>
+ 
+   <para>
+    Data partitioning is usually handled by application code, though rules
+    and triggers can be used to keep the read-only data sets current.  Slony-I
+    can also be used in such a setup.  While Slony-I replicates only entire
+    tables, London and Paris can be placed in separate tables, and
+    inheritance can be used to access both tables using a single table name.
+   </para>
+  </sect1>
+ 
   <sect1 id="clustering-for-parallel-query-execution">
    <title>Clustering For Parallel Query Execution</title>
  
replication_doku_3.difftext/x-patch; charset=iso-8859-1; name=replication_doku_3.diffDownload
*** doc/src/sgml/failover.sgml	2006-11-15 09:10:53.000000000 +0100
--- doc/src/sgml/failover.sgml	2006-11-15 11:11:12.000000000 +0100
***************
*** 83,89 ****
    </para>
   </sect1>
  
!  <sect1 id="warm-standby-using-point-in-time-recovery">
    <title>Warm Standby Using Point-In-Time Recovery</title>
  
    <para>
--- 83,107 ----
    </para>
   </sect1>
  
!  <sect1 id="single-master-replication">
!   <indexterm><primary>single-master</></>
! 
!   <title>Single-Master Replication</title>
! 
!    <para>
!     When using only one server to process data-modifying transactions we
!     speak of a master server and Single-Master Replication.  It is used
!     to allow one or more slave servers to stay current and therefore
!     provide a quick failover for high availability.
!    </para>
! 
!    <para>
!     Single-Master Replication is always asynchronous to allow the master
!     to process data-modifying transactions at full speed, but with the
!     downside of possible data loss during failover.
!    </para>
! 
!  <sect2 id="warm-standby-using-point-in-time-recovery">
    <title>Warm Standby Using Point-In-Time Recovery</title>
  
    <para>
***************
*** 94,102 ****
     made the new master database server.  This is asynchronous and
     can only be done for the entire database server.
    </para>
!  </sect1>
  
!  <sect1 id="continuously-running-replication-server">
    <title>Continuously Running Replication Server</title>
  
    <para>
--- 112,120 ----
     made the new master database server.  This is asynchronous and
     can only be done for the entire database server.
    </para>
!  </sect2>
  
!  <sect2 id="continuously-running-replication-server">
    <title>Continuously Running Replication Server</title>
  
    <para>
***************
*** 112,121 ****
     granularity.  It updates the backup server in batches, so the replication
     is asynchronous and might lose data during a fail over.
    </para>
   </sect1>
  
!  <sect1 id="query-broadcast-load-balancing">
!   <title>Query Broadcast Load Balancing</title>
  
    <para>
     Query broadcast load balancing is accomplished by having a
--- 130,160 ----
     granularity.  It updates the backup server in batches, so the replication
     is asynchronous and might lose data during a fail over.
    </para>
+  </sect2>
+ 
   </sect1>
  
! 
!  <sect1 id="Multi-Master Replication">
! 
!   <indexterm><primary>multi-master</></>
! 
!   <title>Multi-Master Replication</title>
! 
!   <para>
!    In clustering, each server can accept write requests, and modified
!    data is transmitted from the original server to every other
!    server before each transaction commits.  Heavy write activity
!    can cause excessive locking, leading to poor performance.  In
!    fact, write performance is often worse than that of a single
!    server.  Read requests can be sent to any server.  Clustering
!    is best for mostly read workloads, though its big advantage is
!    that any server can accept write requests &mdash; there is no need
!    to partition workloads between read/write and read-only servers.
!   </para>
! 
!  <sect2 id="query-broadcast">
!   <title>Query Broadcasting</title>
  
    <para>
     Query broadcast load balancing is accomplished by having a
***************
*** 138,170 ****
     care must be taken that all transactions either commit or abort
     on all servers  Pgpool is an example of this type of replication.
    </para>
!  </sect1>
! 
!  <sect1 id="clustering-for-load-balancing">
!   <title>Clustering For Load Balancing</title>
! 
!   <para>
!    In clustering, each server can accept write requests, and modified
!    data is transmitted from the original server to every other
!    server before each transaction commits.  Heavy write activity
!    can cause excessive locking, leading to poor performance.  In
!    fact, write performance is often worse than that of a single
!    server.  Read requests can be sent to any server.  Clustering
!    is best for mostly read workloads, though its big advantage is
!    that any server can accept write requests &mdash; there is no need
!    to partition workloads between read/write and read-only servers.
!   </para>
  
-   <para>
-    Clustering is implemented by <productname>Oracle</> in their
-    <productname><acronym>RAC</></> product.  <productname>PostgreSQL</>
-    does not offer this type of load balancing, though
-    <productname>PostgreSQL</> two-phase commit (<xref
-    linkend="sql-prepare-transaction"
-    endterm="sql-prepare-transaction-title"> and <xref
-    linkend="sql-commit-prepared" endterm="sql-commit-prepared-title">)
-    can be used to implement this in application code or middleware.
-   </para>
   </sect1>
  
   <sect1 id="data-partitioning">
--- 177,184 ----
     care must be taken that all transactions either commit or abort
     on all servers  Pgpool is an example of this type of replication.
    </para>
!  </sect2>
  
   </sect1>
  
   <sect1 id="data-partitioning">
replication_doku_4.difftext/x-patch; charset=iso-8859-1; name=replication_doku_4.diffDownload
*** doc/src/sgml/failover.sgml	2006-11-15 11:11:12.000000000 +0100
--- doc/src/sgml/failover.sgml	2006-11-15 11:05:58.000000000 +0100
***************
*** 141,181 ****
  
    <title>Multi-Master Replication</title>
  
!   <para>
!    In clustering, each server can accept write requests, and modified
!    data is transmitted from the original server to every other
!    server before each transaction commits.  Heavy write activity
!    can cause excessive locking, leading to poor performance.  In
!    fact, write performance is often worse than that of a single
!    server.  Read requests can be sent to any server.  Clustering
!    is best for mostly read workloads, though its big advantage is
!    that any server can accept write requests &mdash; there is no need
!    to partition workloads between read/write and read-only servers.
!   </para>
  
   <sect2 id="query-broadcast">
    <title>Query Broadcasting</title>
  
    <para>
!    Query broadcast load balancing is accomplished by having a
!    program intercept every SQL query and send it to all servers.
!    This is unique because most replication solutions have the write
!    server propagate its changes to the other servers.  With query
!    broadcasting, each server operates independently.  Read-only
!    queries can be sent to a single server because there is no need
!    for all servers to process it.
    </para>
  
    <para>
!    One limitation of this solution is that functions like
!    <function>random()</>, <function>CURRENT_TIMESTAMP</>, and
!    sequences can have different values on different servers.  This
!    is because each server operates independently, and because SQL
!    queries are broadcast (and not actual modified rows).  If this
!    is unacceptable, applications must query such values from a
!    single server and then use those values in write queries.  Also,
!    care must be taken that all transactions either commit or abort
!    on all servers  Pgpool is an example of this type of replication.
    </para>
   </sect2>
  
--- 141,215 ----
  
    <title>Multi-Master Replication</title>
  
!    <para>
!     With Multi-Master Replication, each server can accept write requests,
!     and modified data is transmitted to all servers.  Heavy write activity
!     causes network traffic and excessive locking, leading to poor performance
!     especially for synchronous Multi-Master Replication.  The write
!     performance can therefore be worse than that of a single server. Thus
!     Multi-Master Replication works best for mostly read workloads, though
!     its big advantage is that any server can accept write requests &mdash;
!     there is no need to partition workloads between read/write and read-only
!     servers.
!    </para>
! 
!    <para>
!     Because every server has a consistent copy (replica) of the data, read
!     requests can be sent to any server.  In asynchronous Multi-Master
!     Replication the servers replica diverge until re-synchronized.
!     Likewise, conflicts in committed transactions can only be detected during
!     re-synchronization.  So either the application has to ensure not to send
!     conflicting transactions or the conflicts have to be resolved somehow
!     during re-synchronization of the servers.
!    </para>
  
   <sect2 id="query-broadcast">
    <title>Query Broadcasting</title>
  
    <para>
!    Query broadcasting is often accomplished by having a program intercept
!    data-modifying SQL queries and send them to all servers.  With query
!    broadcasting, each server operates independently, thus every
!    data-modifying transaction has to be processed on every server.  Pgpool
!    is an example of this type of replication.
!   </para>
! 
!   <para>
!    One challenge in implementing this solution is that functions like
!    <function>random()</> or <function>CURRENT_TIMESTAMP</> can return
!    different values on different servers.  So if only plain SQL gets
!    transmitted, the application has to cope with these differences.
!    Some solutions try to correctly handle these situations by executing
!    the non-deterministic functions on only one server and propagating
!    the result together with the rest of the SQL to the others.
!   </para>
!  </sect2>
! 
!  <sect2 id="two-phase-commit-replication">
!   <title>Using Two Phase Commit</title>
! 
!   <para>
!    <productname>PostgreSQL</productname> offers two-phase commit
!    (<xref linkend="sql-prepare-transaction"
!    endterm="sql-prepare-transaction-title"> and <xref
!    linkend="sql-commit-prepared" endterm="sql-commit-prepared-title">
!    which can be used to implement synchronous Multi-Master replication
!    in application code or middleware.  As with Query Broadcasting, care
!    has to be taken with non-deterministic functions.
    </para>
+  </sect2>
+ 
+  <sect2 id="distributed-shared-memory-replication">
+   <title>Using Distributed Shared Memory</title>
  
    <para>
!    A different approach is to use Distributed Shared Memory to propagate
!    the modified data and all the locks of data-modifying transactions.
!    This can be done on the level of the operating system, i.e. with
!    OpenMosix or within the database itself.  However, in both methods
!    generate a lot of network traffic and do not scale very well for that
!    reason.  Pgpool-II is an effort to implement Distributed Shared Memory
!    Replication for PostgreSQL.
    </para>
   </sect2>
  
#2Bruce Momjian
bruce@momjian.us
In reply to: Markus Schiltknecht (#1)
Re: replication docs: split single vs. multi-master

Markus Schiltknecht wrote:

Hi,

as promised on -docs, here comes my proposal on how to improve the
replication documentation. The patches are split as follows and have to
be applied in order:

replication_doku_1.diff:

Smallest possible one-word change to warm-up...

Done.

replication_doku_2.diff:

Moves down "Clustering For Parallel Query Execution", because
it's not a replication type, but a feature, see explanation below.

Actually the patch moves down data paritioning. I am confused.

replication_doku_3.diff:

This is the most important part, splitting all replication types
into single- and multi-master replication. I'm new to SGML, so
please bear with me if this is not the right way to do it...

"Shared-Disk-Failover" does IMO not fall into a replication category.
Should we mention there, that 'sharing' a disk using NFS or some
such is not recommended? (And more importantly, does not work as
a multi-master replication solution)

I've added a general paragraph describing Single-Master Replication.
I'm stating that 'Single-Master Replication is always asynchronous'.
Can anybody think of a counter example? Or a use case for sync
Single-Master Replication? The argument to put down is: if you go
sync, why don't you do Multi-Master right away?

Most of the "Clustering for Load Balancing" text applies to all
synchronous, Multi-Master Replication algorithms, even to
"Query Broadcasting". Thus it became the general description
of Multi-Master Replication. The section "Clustering for
Load Balancing" has been removed.

I thought a long time about this. I have always liked splitting the
solutions up into single and multi-master, but in doing this
documentation section, I realized that the split isn't all that helpful,
and can be confusing. For example, Slony is clearly single-master, but
what about data partitioning? That is multi-master, in that there is
more than one master, but only one master per data set. And for
multi-master, Oracle RAC is clearly multi master, and I can see pgpool
as multi-master, or as several single-master systems, in that they
operate independently. After much thought, it seems that putting things
into single/multi-master categories just adds more confusion, because
several solutions just aren't clear, or fall into neither, e.g. Shared
Disk Failover. Another issue is that you mentioned heavly locking for
multi-master, when in fact pgpool doesn't do any special inter-server
locking, so it just doesn't apply.

In summary, it just seemed clearer to talk about each item and how it
works, rather than try to categorize them. The categorization just
seems to do more harm than good.

Of course, I might be totally wrong, and am still looking for feedback,
but these are my current thoughts. Feedback?

I didn't mention distributed shared memory as a separate item because I
felt it was an implementation detail of clustering, rather than
something separate. I kept two-phase in the cluster item for the same
reason.

Current version at:

http://momjian.us/main/writings/pgsql/sgml/failover.html

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#3Markus Schiltknecht
markus@bluegap.ch
In reply to: Bruce Momjian (#2)
Re: replication docs: split single vs. multi-master

Hello Bruce,

Bruce Momjian wrote:

Actually the patch moves down data paritioning. I am confused.

Uh.. yeah, sorry, that's what I meant.

I thought a long time about this. I have always liked splitting the
solutions up into single and multi-master, but in doing this
documentation section, I realized that the split isn't all that helpful,
and can be confusing.

Not mentioning that categorization doesn't help in clearing the
confusion. Just look around, most people use these terms. They're used
by MySQL and Oracle. Even Microsofts ActiveDirectory seems to have a
multi-master operation mode.

For example, Slony is clearly single-master,

Agreed.

but
what about data partitioning? That is multi-master, in that there is
more than one master, but only one master per data set.

Data Partitioning is a way to work around the trouble of database
replication in the application layer. Instead of trying to categorize it
like a replication algorithm, we should explain that working around the
trouble may be worthwhile in many cases.

And for
multi-master, Oracle RAC is clearly multi master,

Yes.

and I can see pgpool
as multi-master, or as several single-master systems, in that they
operate independently.

Several single-master systems? C'mon! Pgpool simply implements the most
simplistic form of multi-master replication. Just because you can access
the single databases inside the cluster doesn't make it less
Multi-Master, does it?

After much thought, it seems that putting things
into single/multi-master categories just adds more confusion, because
several solutions just aren't clear

Agreed, I'm not saying you must categorize all solutions you describe.
But please do categorize the ones which can be (and have so often been)
categorized.

or fall into neither, e.g. Shared Disk Failover.

Oh, yes, this reminds me of Brad Nicholson's suggestion in [1]Brad Nicholson's suggestion: http://archives.postgresql.org/pgsql-admin/2006-11/msg00154.php to add a
warning "about the risk of having two postmaster come up...".

What about other means of sharing disks or filesystems? NBDs or even
worse: NFS?

Another issue is that you mentioned heavly locking for
multi-master, when in fact pgpool doesn't do any special inter-server
locking, so it just doesn't apply.

Sure it does apply, in the sense that *every* single lock is granted and
released on *every* node. The total amount of locks scales linearly with
the amount of nodes in the cluster.

In summary, it just seemed clearer to talk about each item and how it
works, rather than try to categorize them. The categorization just
seems to do more harm than good.

Of course, I might be totally wrong, and am still looking for feedback,
but these are my current thoughts. Feedback?

AFAICT, the categorization in Single- and Multi-Master replication is
very common. I think that's partly because it's focused on the solution.
One can ask: do I want to write on all nodes or is a failover solution
sufficient? Or can I probably get away with a read-only Slave?

It's a categorization the user does, often before having a glimpse about
how complicated database replication really is. Thus, IMO, it would make
sense to help the user and allow him to quickly find answers. (And we
can still tell them that it's not easy or even possible to categorize
all the solutions.)

I didn't mention distributed shared memory as a separate item because I
felt it was an implementation detail of clustering, rather than
something separate. I kept two-phase in the cluster item for the same
reason.

Why is pgpool not an implementation detail of clustering, then?

Current version at:

http://momjian.us/main/writings/pgsql/sgml/failover.html

That somehow doesn't work for me:

--- momjian.us ping statistics ---
15 packets transmitted, 0 received, 100% packet loss, time 14011ms

Just my 2 cents, in the hope to be of help.

Regards

Markus

[1]: Brad Nicholson's suggestion: http://archives.postgresql.org/pgsql-admin/2006-11/msg00154.php
http://archives.postgresql.org/pgsql-admin/2006-11/msg00154.php

#4Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#2)
1 attachment(s)
Re: replication docs: split single vs. multi-master

Bruce Momjian wrote:

I didn't mention distributed shared memory as a separate item because I
felt it was an implementation detail of clustering, rather than
something separate. I kept two-phase in the cluster item for the same
reason.

Current version at:

http://momjian.us/main/writings/pgsql/sgml/failover.html

I am now attaching the additional text I added based on your comments.
I have also changed the markup so all the solutions appear on the same
web page. I think seeing it all together might give us new ideas for
improvement.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachments:

/rtmp/difftext/x-diffDownload
Index: doc/src/sgml/failover.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/failover.sgml,v
retrieving revision 1.6
diff -c -c -r1.6 failover.sgml
*** doc/src/sgml/failover.sgml	15 Nov 2006 01:09:08 -0000	1.6
--- doc/src/sgml/failover.sgml	16 Nov 2006 17:12:49 -0000
***************
*** 33,38 ****
--- 33,50 ----
   </para>
  
   <para>
+   Some solutions deal with synchronization by allowing only one
+   server to modify the data.  Servers that can modify data are
+   called read/write or "master" server.  Servers with read-only
+   data are called backup or "slave" servers.  As you will see below,
+   these terms cover a variety of implementations.  Some servers
+   are masters of some data sets, and slave of others.  Some slaves
+   cannot be accessed until they are changed to master servers,
+   while other slaves can reply to read-only queries while they are
+   slaves.
+  </para>
+ 
+  <para>
    Some failover and load balancing solutions are synchronous, meaning that
    a data-modifying transaction is not considered committed until all
    servers have committed the transaction.  This guarantees that a failover
***************
*** 118,132 ****
    <title>Data Partitioning</title>
  
    <para>
!    Data partitioning splits tables into data sets.  Each set can only be
!    modified by one server.  For example, data can be partitioned by
!    offices, e.g. London and Paris.  While London and Paris servers have all
!    data records, only London can modify London records, and Paris can only
!    modify Paris records.
    </para>
  
    <para>
!    Such partitioning implements both failover and load balancing.  Failover
     is achieved because the data resides on both servers, and this is an
     ideal way to enable failover if the servers share a slow communication
     channel. Load balancing is possible because read requests can go to any
--- 130,149 ----
    <title>Data Partitioning</title>
  
    <para>
!    Data partitioning splits tables into data sets.  Each set can
!    be modified by only one server.  For example, data can be
!    partitioned by offices, e.g. London and Paris.  While London
!    and Paris servers have all data records, only London can modify
!    London records, and Paris can only modify Paris records.  This
!    is similar to section <xref
!    linkend="continuously-running-replication-server"> above, except
!    that instead of having a read/write server and a read-only server,
!    each server has a read/write data set and a read-only data
!    set.
    </para>
  
    <para>
!    Such partitioning provides both failover and load balancing.  Failover
     is achieved because the data resides on both servers, and this is an
     ideal way to enable failover if the servers share a slow communication
     channel. Load balancing is possible because read requests can go to any
#5Markus Schiltknecht
markus@bluegap.ch
In reply to: Bruce Momjian (#4)
Re: replication docs: split single vs. multi-master

Bruce Momjian wrote:

I am now attaching the additional text I added based on your comments.
I have also changed the markup so all the solutions appear on the same
web page. I think seeing it all together might give us new ideas for
improvement.

Good, it's definitely better to have it all on one page.

I just thought about the words 'master' and 'slave', which are
admittedly quite unfortunate. I remember reading about efforts to remove
them from geek-speech. They proposed to introduce better names. At least
with the old IDE drives, master- and slave-drives seem to disappear now...

Regards

Markus

#6Bruce Momjian
bruce@momjian.us
In reply to: Markus Schiltknecht (#3)
Re: [PATCHES] replication docs: split single vs. multi-master

Markus Schiltknecht wrote:

Not mentioning that categorization doesn't help in clearing the
confusion. Just look around, most people use these terms. They're used
by MySQL and Oracle. Even Microsofts ActiveDirectory seems to have a
multi-master operation mode.

OK.

For example, Slony is clearly single-master,

Agreed.

but
what about data partitioning? That is multi-master, in that there is
more than one master, but only one master per data set.

Data Partitioning is a way to work around the trouble of database
replication in the application layer. Instead of trying to categorize it
like a replication algorithm, we should explain that working around the
trouble may be worthwhile in many cases.

OK. I am still feeling that data partitioning is like master/slave
replication because you have to get that read-only copy to the other
server. If you split things up so data sets resided on only one
machine, you are right that would not be replication, but do people do
that? If so, it is almost another solution.

And for
multi-master, Oracle RAC is clearly multi master,

Yes.

and I can see pgpool
as multi-master, or as several single-master systems, in that they
operate independently.

Several single-master systems? C'mon! Pgpool simply implements the most
simplistic form of multi-master replication. Just because you can access
the single databases inside the cluster doesn't make it less
Multi-Master, does it?

OK, changed to "Multi-Master Replication Using Query Broadcasting".

After much thought, it seems that putting things
into single/multi-master categories just adds more confusion, because
several solutions just aren't clear

Agreed, I'm not saying you must categorize all solutions you describe.
But please do categorize the ones which can be (and have so often been)
categorized.

OK.

or fall into neither, e.g. Shared Disk Failover.

Oh, yes, this reminds me of Brad Nicholson's suggestion in [1] to add a
warning "about the risk of having two postmaster come up...".

Added.

What about other means of sharing disks or filesystems? NBDs or even
worse: NFS?

Added.

Another issue is that you mentioned heavly locking for
multi-master, when in fact pgpool doesn't do any special inter-server
locking, so it just doesn't apply.

Sure it does apply, in the sense that *every* single lock is granted and
released on *every* node. The total amount of locks scales linearly with
the amount of nodes in the cluster.

Uh, but the locks are the same on each machine as if it was a single
server, while in a cluster, the locks are more intertwined with other
things that are happening on the server, no?

In summary, it just seemed clearer to talk about each item and how it
works, rather than try to categorize them. The categorization just
seems to do more harm than good.

Of course, I might be totally wrong, and am still looking for feedback,
but these are my current thoughts. Feedback?

AFAICT, the categorization in Single- and Multi-Master replication is
very common. I think that's partly because it's focused on the solution.
One can ask: do I want to write on all nodes or is a failover solution
sufficient? Or can I probably get away with a read-only Slave?

OK.

It's a categorization the user does, often before having a glimpse about
how complicated database replication really is. Thus, IMO, it would make
sense to help the user and allow him to quickly find answers. (And we
can still tell them that it's not easy or even possible to categorize
all the solutions.)

I didn't mention distributed shared memory as a separate item because I
felt it was an implementation detail of clustering, rather than
something separate. I kept two-phase in the cluster item for the same
reason.

Why is pgpool not an implementation detail of clustering, then?

Current version at:

http://momjian.us/main/writings/pgsql/sgml/failover.html

That somehow doesn't work for me:

I lost power for a few hours. I am back online. I have updated the
docs at that URL. Please check and let me know.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#7Hannu Krosing
hannu@skype.net
In reply to: Bruce Momjian (#6)
Re: [PATCHES] replication docs: split single vs.

Ühel kenal päeval, R, 2006-11-17 kell 00:01, kirjutas Bruce Momjian:

Markus Schiltknecht wrote:

Not mentioning that categorization doesn't help in clearing the
confusion. Just look around, most people use these terms. They're used
by MySQL and Oracle. Even Microsofts ActiveDirectory seems to have a
multi-master operation mode.

OK.

For example, Slony is clearly single-master,

Agreed.

but
what about data partitioning? That is multi-master, in that there is
more than one master, but only one master per data set.

Data Partitioning is a way to work around the trouble of database
replication in the application layer. Instead of trying to categorize it
like a replication algorithm, we should explain that working around the
trouble may be worthwhile in many cases.

OK. I am still feeling that data partitioning is like master/slave
replication because you have to get that read-only copy to the other
server. If you split things up so data sets resided on only one
machine, you are right that would not be replication, but do people do
that? If so, it is almost another solution.

People do that in cases where there is high write loads ("high" as in
"not 10+ times less than reads") and just replicating the RO copies
would be prohibitively expensive in either network, cpu or memory terms.

pl/proxy is one tool for doing it. You can get latest stable version
from https://developer.skype.com/SkypeGarage/DbProjects .

And for
multi-master, Oracle RAC is clearly multi master,

Yes.

and I can see pgpool
as multi-master, or as several single-master systems, in that they
operate independently.

Several single-master systems? C'mon! Pgpool simply implements the most
simplistic form of multi-master replication.

In what way is pgpool multimaster ? last time I looked it did nothing
but applying DML to several databses. i.e. it is not replication at all,
or at least it is masterless, unless we think of the pgpool process
itself as the _single_ master :)

Just because you can access

the single databases inside the cluster doesn't make it less
Multi-Master, does it?

OK, changed to "Multi-Master Replication Using Query Broadcasting".

I think this gives completely wrong picture of what pgpool does.

How about just "Query Broadcasting" ?

After much thought, it seems that putting things
into single/multi-master categories just adds more confusion, because
several solutions just aren't clear

Agreed, I'm not saying you must categorize all solutions you describe.
But please do categorize the ones which can be (and have so often been)
categorized.

OK.

or fall into neither, e.g. Shared Disk Failover.

Oh, yes, this reminds me of Brad Nicholson's suggestion in [1] to add a
warning "about the risk of having two postmaster come up...".

Added.

What about other means of sharing disks or filesystems? NBDs or even
worse: NFS?

Added.

Another issue is that you mentioned heavly locking for
multi-master, when in fact pgpool doesn't do any special inter-server
locking, so it just doesn't apply.

Sure it does apply, in the sense that *every* single lock is granted and
released on *every* node. The total amount of locks scales linearly with
the amount of nodes in the cluster.

Uh, but the locks are the same on each machine as if it was a single
server, while in a cluster, the locks are more intertwined with other
things that are happening on the server, no?

In summary, it just seemed clearer to talk about each item and how it
works, rather than try to categorize them. The categorization just
seems to do more harm than good.

Of course, I might be totally wrong, and am still looking for feedback,
but these are my current thoughts. Feedback?

AFAICT, the categorization in Single- and Multi-Master replication is
very common. I think that's partly because it's focused on the solution.
One can ask: do I want to write on all nodes or is a failover solution
sufficient? Or can I probably get away with a read-only Slave?

OK.

It's a categorization the user does, often before having a glimpse about
how complicated database replication really is. Thus, IMO, it would make
sense to help the user and allow him to quickly find answers. (And we
can still tell them that it's not easy or even possible to categorize
all the solutions.)

I didn't mention distributed shared memory as a separate item because I
felt it was an implementation detail of clustering, rather than
something separate. I kept two-phase in the cluster item for the same
reason.

Why is pgpool not an implementation detail of clustering, then?

Current version at:

http://momjian.us/main/writings/pgsql/sgml/failover.html

That somehow doesn't work for me:

I lost power for a few hours. I am back online. I have updated the
docs at that URL. Please check and let me know.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#8Hannu Krosing
hannu@skype.net
In reply to: Bruce Momjian (#6)
Re: [PATCHES] replication docs: split single vs.

Ühel kenal päeval, R, 2006-11-17 kell 00:01, kirjutas Bruce Momjian:

Current version at:

http://momjian.us/main/writings/pgsql/sgml/failover.html

it refers to "Warm Standby Using Point-In-Time
Recovery" (http://momjian.us/main/writings/pgsql/sgml/warm-standby.html), maybe its a good idea to give pointers to SkyTools (description: https://developer.skype.com/SkypeGarage/DbProjects/SkyTools
code: http://pgfoundry.org/projects/skytools/ ) which includes a
walmgr.py script which sets up and manages WAL-based standby servers.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#9Markus Schiltknecht
markus@bluegap.ch
In reply to: Bruce Momjian (#6)
Re: [PATCHES] replication docs: split single vs. multi-master

Hello Bruce,

You wrote:

I am still feeling that data partitioning is like master/slave
replication because you have to get that read-only copy to the other
server.

Yes, that's where replication comes into play. But data partitioning per
se has nothing to do with replication, has it? You can partition your
data however you want: among tablespaces, among databases or among
multiple servers. Data partitioning solves different problems than
replication. I think it's important to keep them separate. Why do you
mix-in Slony-I in the Data Partitioning Section? One can use any other
replication solution to "get that read-only copy to the other server".

If you split things up so data sets resided on only one
machine, you are right that would not be replication, but do people do
that? If so, it is almost another solution.

Yes, as I say: Data Partitioning solves another problem.

And for
multi-master, Oracle RAC is clearly multi master,

Yes.

and I can see pgpool
as multi-master, or as several single-master systems, in that they
operate independently.

Several single-master systems? C'mon! Pgpool simply implements the most
simplistic form of multi-master replication. Just because you can access
the single databases inside the cluster doesn't make it less
Multi-Master, does it?

OK, changed to "Multi-Master Replication Using Query Broadcasting".

Good. That reads already better for me. ;-)

As Jim Nasby pointed out in [1], not all solutions are as simplistic as
pgpool and do not necessarily have the same disadvantages - while using
the very same algorithm: Query Broadcasting.

I suggest we make sure to clarify that and better point out some of the
aspects all Multi-Master Replication have in common (see
replication_doku_4.diff of my patches).

Added.

Added.

(the additions to "Shared Disk Failover")

Good. Short and clear. (Except perhaps: how can I find out if NFS has
full POSIX behavior? Do we have to go into more detail there? I dunno.)

Uh, but the locks are the same on each machine as if it was a single
server, while in a cluster, the locks are more intertwined with other
things that are happening on the server, no?

Sure.

Maybe you are right and we should better not use the term locking there.
It seems confusing because it's not clear what a 'lock' is for some
replication systems (i.e. also Postgres-R, how do you compare it's
"amount of locks"?).

Regards

Markus

#10Markus Schiltknecht
markus@bluegap.ch
In reply to: Hannu Krosing (#7)
Re: [PATCHES] replication docs: split single vs. multi-master

Good morning Hannu,

Hannu Krosing wrote:

People do that in cases where there is high write loads ("high" as in
"not 10+ times less than reads") and just replicating the RO copies
would be prohibitively expensive in either network, cpu or memory terms.

Okay. It that case it's even less like any type of replication.

IMO, Data Partitioning is the most simple method of Load Balancing. It's
like saying: hey, if your database server is overloaded, simply split
your data over multiple servers.

Which is not always possible and can lead to other problems. Some of
which can solved by replication solutions.

In what way is pgpool multimaster ? last time I looked it did nothing
but applying DML to several databses. i.e. it is not replication at all,

Please give your definition of replication.

Wikipedia gives us [1]Wikipedia about Replication (Computer Science): http://en.wikipedia.org/wiki/Replication_%28computer_science%29: "Replication refers to the use of redundant
resources, such as software or hardware components, to improve
reliability, fault-tolerance, or performance."

Pgpool does that by Query Broadcasting, no?

or at least it is masterless, unless we think of the pgpool process
itself as the _single_ master :)

Hm. That's a good point. Pgpool allows to write to only one master (the
pgpool process) but read from multiple, synchronous masters. I admit
that makes it a little hard to split into Single- or Multi-Master.

Doesn't Sequoia support multiple Query Broadcasting processes? Would it
qualify as Multi-Master *Replication*, then?

In an ideal implementation, every Master could broadcast queries to all
other masters. Thus giving a *real* Multi-Master solution. Postgres-R
(6.4) did fall back into that mode for transactions which change a lot
of tuples, so that the writeset didn't exceed a certain size limit.

I think this gives completely wrong picture of what pgpool does.

As I see it, that's because pgpool is a very limited implementation of
Query Broadcasting. But pgpool is not the only solution implementing
that algorithm. Do we want to describe the general algorithm or pgpool here?

Regards

Markus

[1]: Wikipedia about Replication (Computer Science): http://en.wikipedia.org/wiki/Replication_%28computer_science%29
http://en.wikipedia.org/wiki/Replication_%28computer_science%29

#11Bruce Momjian
bruce@momjian.us
In reply to: Hannu Krosing (#7)
Re: [PATCHES] replication docs: split single vs.

Hannu Krosing wrote:

OK. I am still feeling that data partitioning is like master/slave
replication because you have to get that read-only copy to the other
server. If you split things up so data sets resided on only one
machine, you are right that would not be replication, but do people do
that? If so, it is almost another solution.

People do that in cases where there is high write loads ("high" as in
"not 10+ times less than reads") and just replicating the RO copies
would be prohibitively expensive in either network, cpu or memory terms.

OK, as Markus suggested, I have moved Data Partitioning down to the
bottom, and mentioned it as only optionally keeping a read-only copy on
each server. Is this better?

Several single-master systems? C'mon! Pgpool simply implements the most
simplistic form of multi-master replication.

In what way is pgpool multimaster ? last time I looked it did nothing
but applying DML to several databses. i.e. it is not replication at all,
or at least it is masterless, unless we think of the pgpool process
itself as the _single_ master :)

I have remove the mention of "multi-master" from query broadcast.

Just because you can access

the single databases inside the cluster doesn't make it less
Multi-Master, does it?

OK, changed to "Multi-Master Replication Using Query Broadcasting".

I think this gives completely wrong picture of what pgpool does.

How about just "Query Broadcasting" ?

Done.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#12Bruce Momjian
bruce@momjian.us
In reply to: Hannu Krosing (#8)
Re: [PATCHES] replication docs: split single vs.

Hannu Krosing wrote:

?hel kenal p?eval, R, 2006-11-17 kell 00:01, kirjutas Bruce Momjian:

Current version at:

http://momjian.us/main/writings/pgsql/sgml/failover.html

it refers to "Warm Standby Using Point-In-Time
Recovery" (http://momjian.us/main/writings/pgsql/sgml/warm-standby.html), maybe its a good idea to give pointers to SkyTools (description: https://developer.skype.com/SkypeGarage/DbProjects/SkyTools
code: http://pgfoundry.org/projects/skytools/ ) which includes a
walmgr.py script which sets up and manages WAL-based standby servers.

Isn't that functionality included in 8.2, which is what this
documentation is being included with?

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#13Bruce Momjian
bruce@momjian.us
In reply to: Markus Schiltknecht (#9)
Re: [PATCHES] replication docs: split single vs.

Markus Schiltknecht wrote:

Hello Bruce,

You wrote:

I am still feeling that data partitioning is like master/slave
replication because you have to get that read-only copy to the other
server.

Yes, that's where replication comes into play. But data partitioning per
se has nothing to do with replication, has it? You can partition your
data however you want: among tablespaces, among databases or among
multiple servers. Data partitioning solves different problems than
replication. I think it's important to keep them separate. Why do you
mix-in Slony-I in the Data Partitioning Section? One can use any other
replication solution to "get that read-only copy to the other server".

Yes, updated.

and I can see pgpool
as multi-master, or as several single-master systems, in that they
operate independently.

Several single-master systems? C'mon! Pgpool simply implements the most
simplistic form of multi-master replication. Just because you can access
the single databases inside the cluster doesn't make it less
Multi-Master, does it?

OK, changed to "Multi-Master Replication Using Query Broadcasting".

Good. That reads already better for me. ;-)

Oops, now modified to just "Query Broadcasting".

As Jim Nasby pointed out in [1], not all solutions are as simplistic as
pgpool and do not necessarily have the same disadvantages - while using
the very same algorithm: Query Broadcasting.

I suggest we make sure to clarify that and better point out some of the
aspects all Multi-Master Replication have in common (see
replication_doku_4.diff of my patches).

Added.

Added.

(the additions to "Shared Disk Failover")

Good. Short and clear. (Except perhaps: how can I find out if NFS has
full POSIX behavior? Do we have to go into more detail there? I dunno.)

Uh, I am unclear on that myself. I think NFS3 or NSF4 is OK, but am
unsure.

Uh, but the locks are the same on each machine as if it was a single
server, while in a cluster, the locks are more intertwined with other
things that are happening on the server, no?

Sure.

Maybe you are right and we should better not use the term locking there.
It seems confusing because it's not clear what a 'lock' is for some
replication systems (i.e. also Postgres-R, how do you compare it's
"amount of locks"?).

OK, locks are currently mentioned only for clustering.

URL updated:

http://momjian.us/main/writings/pgsql/sgml/failover.html

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#14Bruce Momjian
bruce@momjian.us
In reply to: Markus Schiltknecht (#10)
Re: [PATCHES] replication docs: split single vs.

I have renamed the documentation section "High Availability and Load
Balancing". I think the current version takes many of your comments
below into account. Please let me know.

---------------------------------------------------------------------------

Markus Schiltknecht wrote:

Good morning Hannu,

Hannu Krosing wrote:

People do that in cases where there is high write loads ("high" as in
"not 10+ times less than reads") and just replicating the RO copies
would be prohibitively expensive in either network, cpu or memory terms.

Okay. It that case it's even less like any type of replication.

IMO, Data Partitioning is the most simple method of Load Balancing. It's
like saying: hey, if your database server is overloaded, simply split
your data over multiple servers.

Which is not always possible and can lead to other problems. Some of
which can solved by replication solutions.

In what way is pgpool multimaster ? last time I looked it did nothing
but applying DML to several databses. i.e. it is not replication at all,

Please give your definition of replication.

Wikipedia gives us [1]: "Replication refers to the use of redundant
resources, such as software or hardware components, to improve
reliability, fault-tolerance, or performance."

Pgpool does that by Query Broadcasting, no?

or at least it is masterless, unless we think of the pgpool process
itself as the _single_ master :)

Hm. That's a good point. Pgpool allows to write to only one master (the
pgpool process) but read from multiple, synchronous masters. I admit
that makes it a little hard to split into Single- or Multi-Master.

Doesn't Sequoia support multiple Query Broadcasting processes? Would it
qualify as Multi-Master *Replication*, then?

In an ideal implementation, every Master could broadcast queries to all
other masters. Thus giving a *real* Multi-Master solution. Postgres-R
(6.4) did fall back into that mode for transactions which change a lot
of tuples, so that the writeset didn't exceed a certain size limit.

I think this gives completely wrong picture of what pgpool does.

As I see it, that's because pgpool is a very limited implementation of
Query Broadcasting. But pgpool is not the only solution implementing
that algorithm. Do we want to describe the general algorithm or pgpool here?

Regards

Markus

[1]: Wikipedia about Replication (Computer Science):
http://en.wikipedia.org/wiki/Replication_%28computer_science%29

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#15Tatsuo Ishii
ishii@sraoss.co.jp
In reply to: Bruce Momjian (#14)
Re: [PATCHES] replication docs: split single vs.

From high-availability.sgml:

Clustering For Parallel Query Execution

This allows multiple servers to work concurrently on a single
query. One possible way this could work is for the data to be
split among servers and for each server to execute its part of the
query and results sent to a central server to be combined and
returned to the user. There currently is no PostgreSQL open source
solution for this.

I think pgpool-II can do this.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

#16Bruce Momjian
bruce@momjian.us
In reply to: Tatsuo Ishii (#15)
Re: [PATCHES] replication docs: split single vs.

Tatsuo Ishii wrote:

From high-availability.sgml:

Clustering For Parallel Query Execution

This allows multiple servers to work concurrently on a single
query. One possible way this could work is for the data to be
split among servers and for each server to execute its part of the
query and results sent to a central server to be combined and
returned to the user. There currently is no PostgreSQL open source
solution for this.

I think pgpool-II can do this.

Thanks, I suspected it could, added:

This allows multiple servers to work concurrently on a single
query. One possible way this could work is for the data to be
split among servers and for each server to execute its part of
the query and results sent to a central server to be combined
and returned to the user. Pgpool-II has this capability.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +