Big 7.4 items

Started by Bruce Momjianabout 23 years ago68 messages

pgman@candle.pha.pa.us

about 23 years ago

I wanted to outline some of the big items we are looking at for 7.4:

Win32 Port:

Katie Ward and Jan are working on contributing their Win32
port for 7.4. They plan to have a patch available by the end of
December.

Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

Replication

I have talked to Darren Johnson and I believe 7.4 is the time to
merge the Postgres-R source tree into our main CVS. Most of the
replication code will be in its own directory, with only minor
changes to our existing tree. They have single-master
replication working now, so we may have that feature in some
capacity for 7.4. I know others are working on replication
solutions. This is probably the time to decide for certain if
this is the direction we want to go for replication. Most who
have have studied Postgres-R feel it is the most promising
multi-master replication solution for reliably networked hosts.

Comments?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

about 23 years ago

In reply to: Bruce Momjian (#1)

Re: Big 7.4 items

On 13 Dec 2002 at 1:22, Bruce Momjian wrote:

Replication

I have talked to Darren Johnson and I believe 7.4 is the time to
merge the Postgres-R source tree into our main CVS. Most of the
replication code will be in its own directory, with only minor
changes to our existing tree. They have single-master
replication working now, so we may have that feature in some
capacity for 7.4. I know others are working on replication
solutions. This is probably the time to decide for certain if
this is the direction we want to go for replication. Most who
have have studied Postgres-R feel it is the most promising
multi-master replication solution for reliably networked hosts.

Comments?

Some.

1) What kind of replication are we looking at? log file replay/syncnronous etc.
If it is real time, like usogres( I hope I am in line with things here), that
would be real good .Choice is always good..

2 If we are going to have replication, can we have built in load balancing? Is
it a good idea to have it in postgresql or a separate application would be way
to go?

And where are nested transactions?

Bye
Shridhar

--
Booker's Law: An ounce of application is worth a ton of abstraction.

Hannu Krosing

hannu@tm.ee

about 23 years ago

In reply to: Bruce Momjian (#1)

Re: Big 7.4 items

On Fri, 2002-12-13 at 06:22, Bruce Momjian wrote:

I wanted to outline some of the big items we are looking at for 7.4:
Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

How hard would it be to extend PITR for master-slave (hot backup)
repliaction, which should then amount to continuously shipping logs to
slave and doing nonstop PITR there :)

It will never be usable for multi-master replication, but somehow it
feels that for master-slave replication simple log replay would be most
simple and robust solution.

--
Hannu Krosing <hannu@tm.ee>

Mike Mascari

mascarm@mascari.com

about 23 years ago

In reply to: Bruce Momjian (#1)

Re: Big 7.4 items

Bruce Momjian wrote:

I wanted to outline some of the big items we are looking at for 7.4:

Win32 Port:

Katie Ward and Jan are working on contributing their Win32
port for 7.4. They plan to have a patch available by the end of
December.

Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

Replication

I have talked to Darren Johnson and I believe 7.4 is the time to
merge the Postgres-R source tree into our main CVS. Most of the
replication code will be in its own directory, with only minor
changes to our existing tree. They have single-master
replication working now, so we may have that feature in some
capacity for 7.4. I know others are working on replication
solutions. This is probably the time to decide for certain if
this is the direction we want to go for replication. Most who
have have studied Postgres-R feel it is the most promising
multi-master replication solution for reliably networked hosts.

Comments?

What about distributed TX support:

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=20021106111554.69ae1dcd.pgsql%40snaga.org&rnum=2&prev=/groups%3Fq%3DNAGAYASU%2BSatoshi%26ie%3DUTF-8%26oe%3DUTF-8%26hl%3Den

Mike Mascari
mascarm@mascari.com

Noname

darren@up.hrcoxmail.com

about 23 years ago

In reply to: Mike Mascari (#4)

Re: Big 7.4 items

How hard would it be to extend PITR for master-slave (hot backup)
repliaction, which should then amount to continuously shipping logs to
slave and doing nonstop PITR there :)

I have not looked at the PITR patch yet, but it might be
possible to use the same PITR format to queue/log writesetswith postgres-r, so we can have multi-master replication
and PITR from the same mechanism.

Darren

Import Notes

Resolved by subject fallback

Joe Conway

mail@joeconway.com

about 23 years ago

In reply to: Bruce Momjian (#1)

Re: Big 7.4 items

Bruce Momjian wrote:

Win32 Port:

Katie Ward and Jan are working on contributing their Win32
port for 7.4. They plan to have a patch available by the end of
December.

I have .Net Studio available to me, so if you need help in merging or testing
or whatever, let me know.

Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

As Hannu asked (and related to your question below), is there any thought of
extending this to allow simple log based replication? In many important
scenarios that would be more than adequate, and simpler to set up.

Replication

I have talked to Darren Johnson and I believe 7.4 is the time to
merge the Postgres-R source tree into our main CVS. Most of the
replication code will be in its own directory, with only minor
changes to our existing tree. They have single-master
replication working now, so we may have that feature in some
capacity for 7.4. I know others are working on replication
solutions. This is probably the time to decide for certain if
this is the direction we want to go for replication. Most who
have have studied Postgres-R feel it is the most promising
multi-master replication solution for reliably networked hosts.

I'd question if we would want the one-and-only builtin replication method to
be dependent on an external communication library (Spread). I would like to
see Postgres-R merged, but I'd also like to see a simple log-based option.

Comments?

I'd also second Mike Mascari's question -- whatever happened to the person
working on two-phase commit? Is that likely to be done for 7.4? Did he ever
send in a patch?

Joe

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Shridhar Daithankar (#2)

Re: Big 7.4 items

Shridhar Daithankar wrote:

And where are nested transactions?

I didn't mention nested transactions because it didn't seem to be a
_big_ item like the others.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

snpe

snpe@snpe.co.yu

about 23 years ago

In reply to: Bruce Momjian (#7)

Re: Big 7.4 items

On Friday 13 December 2002 17:51, Bruce Momjian wrote:

Shridhar Daithankar wrote:

And where are nested transactions?

I didn't mention nested transactions because it didn't seem to be a
_big_ item like the others.

This is big item

regards
Haris Peco

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Shridhar Daithankar (#2)

Re: Big 7.4 items

Shridhar Daithankar wrote:

1) What kind of replication are we looking at? log file
replay/synchronous etc. If it is real time, like usogres( I
hope I am in line with things here), that would be real good.
Choice is always good.

Good. This is the discussion we need. Let me quote the TODO list
replication section first:

* Add replication of distributed databases [replication]
o automatic failover
o load balancing
o master/slave replication
o multi-master replication
o partition data across servers
o sample implementation in contrib/rserv
o queries across databases or servers (two-phase commit)
o allow replication over unreliable or non-persistent links
o http://gborg.postgresql.org/project/pgreplication/projdisplay.php

OK, the first thing is that there isn't any one replication solution
that will behave optimally in all situations.

Now, let me describe Postgres-R and then the other replication
solutions. Postgres-R is multi-master, meaning you can send SELECT and
UPDATE/DELETE queries to any of the servers in the cluster, and get the
same result. It is also synchronous, meaning it doesn't update the
local copy until it is sure the other nodes agree to the change. It
allows failover, because if one node goes down, the others keep going.

Now, let me contrast:

rserv and dbmirror do master/slave. There is no mechanism to allow you
to do updates on the slave, and have them propagate to the master. You
can, however, send SELECT queries to the slave, and in fact that's how
usogres does load balancing.

Two-phase commit is probably the most popular commercial replication
solution. While it works for multi-master, it suffers from poor
performance and doesn't handle cases where one node disappears very
well.

Another replication need is for asynchronous replication, most
traditionally for traveling salesmen who need to update their databases
periodically. The only solution I know for that is PeerDirect's
PostgreSQL commercial offering at http://www.peerdirect.com. It is
possible PITR may help with this, but we need to handle propagating
changes made by the salesmen back up into the server, and to do that, we
will need a mechanism to handle conflicts that occur when two people
update the same records. This is always a problem for asynchronous
replication.

2 If we are going to have replication, can we have built in load
balancing? Is it a good idea to have it in postgresql or a
separate application would be way to go?

Well, because Postgres-R is multi-master, it has automatic load
balancing. You can have your users point to whatever node you want.
You can implement this "pointing" by using dns IP address cycling, or
have a router that auto-load balances, though you would need to keep a
db session on the same node, of course.

So, in summary, I think we will eventually have two directions for
replication. One is Postgres-R for multi-master, synchronous
replication, and PITR, for asynchronous replication. I don't think
there is any value to use PITR for synchronous replication because by
definition, you don't _store_ the changes for later use because it is
synchronous. In synchronous, you communicate your changes to all the
nodes involved, then commit them.

I will describe the use of 'spread' and the Postgres-R internal issues
in my next email.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#10

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Hannu Krosing (#3)

Re: Big 7.4 items

Hannu Krosing wrote:

On Fri, 2002-12-13 at 06:22, Bruce Momjian wrote:

I wanted to outline some of the big items we are looking at for 7.4:
Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

How hard would it be to extend PITR for master-slave (hot backup)
repliaction, which should then amount to continuously shipping logs to
slave and doing nonstop PITR there :)

It will never be usable for multi-master replication, but somehow it
feels that for master-slave replication simple log replay would be most
simple and robust solution.

Exactly. See my previous email. We eventually have two replication
solutions: one, Postgres-R for multi-master, and PITR used for for
asynchonous master/slave.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#11

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Mike Mascari (#4)

Re: Big 7.4 items

Mike Mascari wrote:

What about distributed TX support:

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=20021106111554.69ae1dcd.pgsql%40snaga.org&rnum=2&prev=/groups%3Fq%3DNAGAYASU%2BSatoshi%26ie%3DUTF-8%26oe%3DUTF-8%26hl%3Den

OK, yes, that is Satoshi's 2-phase commit implementation. I will
address 2-phase commit vs Postgres-R in my next email about spread.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#12

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Noname (#5)

Re: Big 7.4 items

darren@up.hrcoxmail.com wrote:

How hard would it be to extend PITR for master-slave (hot backup)
repliaction, which should then amount to continuously shipping logs to
slave and doing nonstop PITR there :)

I have not looked at the PITR patch yet, but it might be possible
to use the same PITR format to queue/log writesetswith postgres-r,
so we can have multi-master replication and PITR from the same
mechanism.

Yes, we do need a method to send write sets to the various nodes, and
PITR may be a help in getting those write sets. However, it should be
clear that we really aren't archiving-replaying them like you would
think for PITR. We are only grabbing stuff from the PITR to send to
other nodes. We may also be able to use PITR to bring nodes back up to
date if they have fallen out of communication.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#13

Mike Mascari

mascarm@mascari.com

about 23 years ago

In reply to: Bruce Momjian (#11)

Re: Big 7.4 items

Okay. But please keep in mind that a 2-phase commit implementation is used for more than just replication. Any distributed TX will require a 2PC protocol. As an example, for the DBLINK implementation to ultimately be transaction safe (at least amongst multiple PostgreSQL installations), the players in the distributed transaction must all be participants in a 2PC exchange. And a participant whose communications link is dropped needs to be able to recover by asking the coordinator whether or not to complete or abort the distributed TX. I am 100% ignorant of the distributed TX standard Tom referenced earlier, but I'd guess there might be an assumption of 2PC support in the implementation. In other words, I think we still need 2PC, regardless of the method of replication. And if Satoshi Nagayasu has an implementation ready, why not investigate its possibilities?

Mike Mascari
mascarm@mascari.com

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>

Show quoted text

Mike Mascari wrote:

What about distributed TX support:

OK, yes, that is Satoshi's 2-phase commit implementation. I will
address 2-phase commit vs Postgres-R in my next email about spread.

#14

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Joe Conway (#6)

Re: Big 7.4 items

Joe Conway wrote:

Bruce Momjian wrote:

Win32 Port:

Katie Ward and Jan are working on contributing their Win32
port for 7.4. They plan to have a patch available by the end of
December.

I have .Net Studio available to me, so if you need help in merging or testing
or whatever, let me know.

OK, Jan, let him know how he can help.

Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

As Hannu asked (and related to your question below), is there any thought of
extending this to allow simple log based replication? In many important
scenarios that would be more than adequate, and simpler to set up.

Yes, see previous email.

I'd question if we would want the one-and-only builtin replication method to
be dependent on an external communication library (Spread). I would like to
see Postgres-R merged, but I'd also like to see a simple log-based option.

OK, let me reiterate I think we will have two replication solutions in
the end --- on Postgres-R for multi-master/synchronous, and PITR for
master/slave asynchronous replication.

Let me address the Spread issue and two-phase commit. (Spread is an
open source piece of software used by Postgres-R.)

In two-phase commit, when one node is about to commit, it gets a lock
from all the other nodes, does its commit, then releases the lock. (Of
course, this is simplified.) It is called two-phase because it says to
all the other nodes "I am about to do something, is that OK?", then when
gets all OK's, it does the commit and says "I did the commit".

Postgres-R uses a different mechanism. This method is shown on page 22
and 24 and following of:

ftp://gborg.postgresql.org/pub/pgreplication/stable/PostgreSQLReplication.pdf.gz

The basic difference is that Spread groups all the write sets into a
queue who's ordering is the same on all the nodes. Instead of asking
for approval for a commit, a node puts its commit in the Spread queue,
and then waits for its own commit to come back in the queue, meaning all
the other nodes saw its commit too.

The only tricky part is that while reading the other node's write sets
before its own arrives, it has to check to see if any of these conflict
with its own write set. If it conflicts, it has to assume the earlier
write set succeeded and its own failed. It also has to check the write
set stream and apply only those changes that don't conflict.

As stated before in Postgres-R discussion, this mechanism hinges on
being able to determine which write sets conflict because there is no
explicit "I aborted", only a stream of write sets, and each node has to
accept the non-conflicting ones and reject the conflicting ones.

I'd also second Mike Mascari's question -- whatever happened to the person
working on two-phase commit? Is that likely to be done for 7.4? Did he ever
send in a patch?

I have not seen a patch from him, but it is very likely he could have
one for 7.4. This is why it is good we discuss this now and figure out
where we want to go for 7.4 so we can get started.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#15

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Mike Mascari (#13)

Re: Big 7.4 items

Mike Mascari wrote:

Okay. But please keep in mind that a 2-phase commit implementation
is used for more than just replication. Any distributed TX will
require a 2PC protocol. As an example, for the DBLINK implementation
to ultimately be transaction safe (at least amongst multiple
PostgreSQL installations), the players in the distributed
transaction must all be participants in a 2PC exchange. And a
participant whose communications link is dropped needs to be
able to recover by asking the coordinator whether or not to
complete or abort the distributed TX. I am 100% ignorant of the
distributed TX standard Tom referenced earlier, but I'd guess
there might be an assumption of 2PC support in the implementation.
In other words, I think we still need 2PC, regardless of the
method of replication. And if Satoshi Nagayasu has an implementation
ready, why not investigate its possibilities?

This is a good point. I don't want to push Postgres-R as our solution.
Rather, I have looked at both and like Postgres-R, but others need to
look at both and decide so we are all in agreement when we move forward.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#16

Jan Wieck

JanWieck@Yahoo.com

about 23 years ago

In reply to: Bruce Momjian (#9)

Re: Big 7.4 items

Bruce Momjian wrote:

OK, the first thing is that there isn't any one replication solution
that will behave optimally in all situations.

Right

Now, let me describe Postgres-R and then the other replication
solutions. Postgres-R is multi-master, meaning you can send SELECT and
UPDATE/DELETE queries to any of the servers in the cluster, and get the
same result. It is also synchronous, meaning it doesn't update the
local copy until it is sure the other nodes agree to the change. It
allows failover, because if one node goes down, the others keep going.

Wrong

It is asynchronous without the need of 2 phase commit. It is group
communication based and requires the group communication system to
guarantee total order. The tricky part is, that the local transaction
must be on hold until the own commit message comes back without a prior
lock conflict by a replication transaction. If such a lock confict
occurs, the replication transaction wins and the local transaction rolls
back.

Now, let me contrast:

rserv and dbmirror do master/slave. There is no mechanism to allow you
to do updates on the slave, and have them propagate to the master. You
can, however, send SELECT queries to the slave, and in fact that's how
usogres does load balancing.

But you cannot use the result of such a SELECT to update anything. So
you can only phase out complete read only transaction to the slaves.
Requires support from the application since the load balancing system
cannot know automatically what will be a read only transaction and what
not.

Two-phase commit is probably the most popular commercial replication
solution. While it works for multi-master, it suffers from poor
performance and doesn't handle cases where one node disappears very
well.

Another replication need is for asynchronous replication, most
traditionally for traveling salesmen who need to update their databases
periodically. The only solution I know for that is PeerDirect's
PostgreSQL commercial offering at http://www.peerdirect.com. It is
possible PITR may help with this, but we need to handle propagating
changes made by the salesmen back up into the server, and to do that, we
will need a mechanism to handle conflicts that occur when two people
update the same records. This is always a problem for asynchronous
replication.

PITR doesn't help here at all, since PeerDirect's replication is trigger
and control table based. What makes our replication system unique is
that it works bidirectional in a heterogenious world.

I will describe the use of 'spread' and the Postgres-R internal issues
in my next email.

The last time i was playing with spread (that was at Great Bridge in
Norfolk), it was IMHO useless (for Postgres-R) because it sometimes
dropped messages when the network load got too high. This occured
without any indication, no error, nothing. This is not exactly what I
understand as total order. I hope they have made some substantial
progress on that.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#17

Mike Mascari

mascarm@mascari.com

about 23 years ago

In reply to: Bruce Momjian (#15)

Re: Big 7.4 items

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>

Mike Mascari wrote:

Okay. But please keep in mind that a 2-phase commit implementation
is used for more than just replication.

This is a good point. I don't want to push Postgres-R as our solution.
Rather, I have looked at both and like Postgres-R, but others need to
look at both and decide so we are all in agreement when we move forward.

After having read your post regarding Spread, I see that it is an alternative to 2PC as a distributed TX protocol. If I understand you correctly, a DBLINK implementation built atop Spread would also be possible. Correct? The question then is, do other RDBMS expose a 2PC implementation which could not then be leveraged at a later time? For example imagine:

1. 7.4 includes a native 2PC protocol with:

CREATE DATABASE LINK accounting
CONNECT TO accounting.acme.com:5432
IDENTIFIED BY mascarm/mascarm;

SELECT *
FROM employees@accounting;

INSERT INTO employees@accounting
VALUES (1, 'Mike', 'Mascari');

That would be great, allowing PostgreSQL servers running in different departments to participate in a distributed tx.

2. 7.5 includes a DBLINK which supports PostgreSQL participating in a heterogenous distributed transaction (with say, an Oracle database):

CREATE DATABASE LINK finance
CONNECT TO <oracle names entry>
IDENTIFIED BY mascarm/mascarm
USING INTERFACE 'pg2oracle.so';

INSERT INTO employees@finance
VALUES (1, 'Mike', 'Mascari');

I guess I'm basically asking:

1) Is it necessary to *choose* between support for 2PC and Spread (Postgres-R) or can't we have both? Spread for Replication, 2PC for non-replicating distributed TX?

2) Do major SQL DBMS vendors which support distributed options expose a callable interface into a 2PC protocol that would allow PostgreSQL to participate? I could check on this...

3) Are there any standards (besides ODBC, which, the last time I looked just had COMMIT/ABORT APIs), that have been defined and adopted by the industry for distributed tx?

Again, I'd guess most people want:

1) High performance Master/Master replication *and* (r.e. Postgres-R)
2) Ability to participate in distrubuted tx's (r.e. 2PC?)

Mike Mascari
mascarm@mascari.com

#18

Neil Conway

neilc@samurai.com

about 23 years ago

In reply to: Bruce Momjian (#14)

Re: Big 7.4 items

On Fri, 2002-12-13 at 13:20, Bruce Momjian wrote:

Let me address the Spread issue and two-phase commit. (Spread is an
open source piece of software used by Postgres-R.)

Note that while Spread is open source in the sense that "the source is
available", it's license is significantly more restrictive than
PostgreSQL's:

http://www.spread.org/license/

Just FYI...

Cheers,

Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#19

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Neil Conway (#18)

Re: Big 7.4 items

Neil Conway wrote:

On Fri, 2002-12-13 at 13:20, Bruce Momjian wrote:

Let me address the Spread issue and two-phase commit. (Spread is an
open source piece of software used by Postgres-R.)

Note that while Spread is open source in the sense that "the source is
available", it's license is significantly more restrictive than
PostgreSQL's:

http://www.spread.org/license/

Interesting. It looks like a modified version of the old BSD license
where you are required to mention you are using Spread. I believe we
can get that reduced. (I think Darren already addressed this with
them.) We certainly are not going to accept software that requires all
PostgreSQL user sites to mention Spread.

The whole "mention" aspect of the old BSD license was pretty ambiguous,
and I assume this is similar.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#20

Mike Mascari

mascarm@mascari.com

about 23 years ago

In reply to: Bruce Momjian (#15)

Re: Big 7.4 items

I wrote:

I guess I'm basically asking:

1) Is it necessary to *choose* between support for 2PC and Spread (Postgres-R) or can't we have both? Spread for Replication, 2PC for non-replicating distributed TX?

2) Do major SQL DBMS vendors which support distributed options expose a callable interface into a 2PC protocol that would allow PostgreSQL to participate? I could check on this...

3) Are there any standards (besides ODBC, which, the last time I looked just had COMMIT/ABORT APIs), that have been defined and adopted by the industry for distributed tx?

Answer:

The Open Group's Open/XA C193 specificiation for API for distributed transactions:

http://www.opengroup.org/public/pubs/catalog/c193.htm

I couldn't find any draft copies on the web, but a good description at the Sybase site:

http://manuals.sybase.com/onlinebooks/group-xs/xsg1111e/xatuxedo/@ebt-link;pt=61?target=%25N%13_446_START_RESTART_N%25

The standard is 2PC based.

Mike Mascari
mascarm@mascari.com

#21

Jan Wieck

JanWieck@Yahoo.com

about 23 years ago

In reply to: Bruce Momjian (#14)

Re: Big 7.4 items

Bruce Momjian wrote:

Joe Conway wrote:

Bruce Momjian wrote:

Win32 Port:

Katie Ward and Jan are working on contributing their Win32
port for 7.4. They plan to have a patch available by the end of
December.

I have .Net Studio available to me, so if you need help in merging or testing
or whatever, let me know.

OK, Jan, let him know how he can help.

My current plan is to comb out the Win32 port only from what we've done
all together against 7.2.1. The result should be a clean patch that
applied against 7.2.1 builds a native windows port.

From there, this patch must be lifted up to 7.4.

I have the original context diff now down from 160,000 lines to 80,000
lines. I think I will have the clean diff against 7.2.1 somewhere next
week. That would IMHO be a good time for Tom to start complaining so
that we can work in the required changes during the 7.4 lifting. ;-)

Jan

#22

Noname

darren@up.hrcoxmail.com

about 23 years ago

In reply to: Jan Wieck (#21)

Re: Big 7.4 items

Note that while Spread is open source in the sense that "the source is
available", it's license is significantly more restrictive than
PostgreSQL's:

http://www.spread.org/license/

Interesting. It looks like a modified version of the old BSD license
where you are required to mention you are using Spread. I believe we
can get that reduced. (I think Darren already addressed this with
them.) We certainly are not going to accept software that requires all
PostgreSQL user sites to mention Spread.

I dont think this is the case. We don't redistribute spread
from the pg-replication site. There are links to the down
load area. I don't think this should be any different if
postgres-r is merged with the main postgresql tree. If
Spread is the group communication we choose to use for
postgresql replication, then I would think some Spread
information would be in order on the advocacy site, and
in any set up documentation for replication.

I have spoken to Yair Amir from the Spread camp on
several occasions, and they are very excited about the
replication project. I sure it won't be an issue, but
I will forward this message to him.

Darren

Import Notes

Resolved by subject fallback

#23

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Noname (#22)

Re: Big 7.4 items

darren@up.hrcoxmail.com wrote:

Note that while Spread is open source in the sense that "the source is
available", it's license is significantly more restrictive than
PostgreSQL's:

http://www.spread.org/license/

Interesting. It looks like a modified version of the old BSD license
where you are required to mention you are using Spread. I believe we
can get that reduced. (I think Darren already addressed this with
them.) We certainly are not going to accept software that requires all
PostgreSQL user sites to mention Spread.

I dont think this is the case. We don't redistribute spread
from the pg-replication site. There are links to the down
load area. I don't think this should be any different if
postgres-r is merged with the main postgresql tree. If
Spread is the group communication we choose to use for
postgresql replication, then I would think some Spread
information would be in order on the advocacy site, and
in any set up documentation for replication.

Yes, the question is whether we will ship the spread code inside our
tarball? I doubt we would ever have replication running by default, but
we may want to ship a binary that was replication-enabled. I am
especially thinking of commerical vendors. Can you imagine Red Hat DB
being required to mention Spread on their web page? I don't think that
will fly.

Of course we will want to mention Spread on our web site and in our
documentation, but we don't want to be forced to, and we don't want that
burden to "spread" out to other users.

I have spoken to Yair Amir from the Spread camp on
several occasions, and they are very excited about the
replication project. I sure it won't be an issue, but
I will forward this message to him.

Good.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#24

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Jan Wieck (#16)

Re: Big 7.4 items

Jan Wieck wrote:

Bruce Momjian wrote:

OK, the first thing is that there isn't any one replication solution
that will behave optimally in all situations.

Right

Now, let me describe Postgres-R and then the other replication
solutions. Postgres-R is multi-master, meaning you can send SELECT and
UPDATE/DELETE queries to any of the servers in the cluster, and get the
same result. It is also synchronous, meaning it doesn't update the
local copy until it is sure the other nodes agree to the change. It
allows failover, because if one node goes down, the others keep going.

Wrong

It is asynchronous without the need of 2 phase commit. It is group

Well, Darren's PDF at:

ftp://gborg.postgresql.org/pub/pgreplication/stable/PostgreSQLReplication.pdf.gz

calls Postgres-R "Type: Embedded, Peer-to-Peer, Sync". I don't know
enough about replication so I will let you fight it out with him. ;-)

communication based and requires the group communication system to
guarantee total order. The tricky part is, that the local transaction
must be on hold until the own commit message comes back without a prior
lock conflict by a replication transaction. If such a lock confict
occurs, the replication transaction wins and the local transaction rolls
back.

Yep, that's the tricky part.

Now, let me contrast:

rserv and dbmirror do master/slave. There is no mechanism to allow you
to do updates on the slave, and have them propagate to the master. You
can, however, send SELECT queries to the slave, and in fact that's how
usogres does load balancing.

But you cannot use the result of such a SELECT to update anything. So
you can only phase out complete read only transaction to the slaves.
Requires support from the application since the load balancing system
cannot know automatically what will be a read only transaction and what
not.

Good point. It has to be a read-only session because you can't jump
nodes during a session. That definately limits its usefulness.

Two-phase commit is probably the most popular commercial replication
solution. While it works for multi-master, it suffers from poor
performance and doesn't handle cases where one node disappears very
well.

Another replication need is for asynchronous replication, most
traditionally for traveling salesmen who need to update their databases
periodically. The only solution I know for that is PeerDirect's
PostgreSQL commercial offering at http://www.peerdirect.com. It is
possible PITR may help with this, but we need to handle propagating
changes made by the salesmen back up into the server, and to do that, we
will need a mechanism to handle conflicts that occur when two people
update the same records. This is always a problem for asynchronous
replication.

PITR doesn't help here at all, since PeerDirect's replication is trigger
and control table based. What makes our replication system unique is
that it works bidirectional in a heterogenious world.

I was only suggesting that PITR _may_ help as an archive method for the
changes. PeerDirect stores those changes in a table?

I will describe the use of 'spread' and the Postgres-R internal issues
in my next email.

The last time i was playing with spread (that was at Great Bridge in
Norfolk), it was IMHO useless (for Postgres-R) because it sometimes
dropped messages when the network load got too high. This occured
without any indication, no error, nothing. This is not exactly what I
understand as total order. I hope they have made some substantial
progress on that.

That's a serious problem, clearly. Hopefully it is either fixed or it
will get fixed. We can't use it without reliability.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#25

Noname

darren@up.hrcoxmail.com

about 23 years ago

In reply to: Bruce Momjian (#24)

Re: Big 7.4 items

It is asynchronous without the need of 2 phase commit. It is group
communication based and requires the group communication system to
guarantee total order. The tricky part is, that the local transaction
must be on hold until the own commit message comes back without a prior

No, It holds until it's own Writeset comes back. Commits
and then send a commit message on the simple channel, so
commits don't wait for ordered writesets.

Remember total order guarantees if no changes in front of
the local changes conflict, the local changes can commit.

lock conflict by a replication transaction. If such a lock confict
occurs, the replication transaction wins and the local transaction rolls
back.

Correct.

The last time i was playing with spread (that was at Great Bridge in
Norfolk), it was IMHO useless (for Postgres-R) because it sometimes
dropped messages when the network load got too high. This occured
without any indication, no error, nothing. This is not exactly what I
understand as total order. I hope they have made some substantial
progress on that.

I remember the TCL tester you set up, and having problems,
but I don't recall investigating what the problems were.
If you still have the code I can try and reproduce the
problem, and investigate it on the spread list.

Darren

Import Notes

Resolved by subject fallback

#26

Noname

darren@up.hrcoxmail.com

about 23 years ago

In reply to: Noname (#25)

Re: Big 7.4 items

It is asynchronous without the need of 2 phase commit. It is group

Well, Darren's PDF at:

ftp://gborg.postgresql.org/pub/pgreplication/stable/PostgreSQLReplication.pdf.gz

calls Postgres-R "Type: Embedded, Peer-to-Peer, Sync". I don't know
enough about replication so I will let you fight it out with him. ;-)

If were still defining synchronous as pre commit, then
postgres-r is synchronous.

Darren

Import Notes

Resolved by subject fallback

#27

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Noname (#25)

Re: Big 7.4 items

darren@up.hrcoxmail.com wrote:

It is asynchronous without the need of 2 phase commit. It is group
communication based and requires the group communication system to
guarantee total order. The tricky part is, that the local transaction
must be on hold until the own commit message comes back without a prior

No, It holds until it's own Writeset comes back. Commits
and then send a commit message on the simple channel, so
commits don't wait for ordered writesets.

Darren, can you clarify this? Why does it send that message? How does
it allow commits not to wait for ordered writesets?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#28

Jan Wieck

JanWieck@Yahoo.com

about 23 years ago

In reply to: Noname (#25)

Re: Big 7.4 items

darren@up.hrcoxmail.com wrote:

It is asynchronous without the need of 2 phase commit. It is group
communication based and requires the group communication system to
guarantee total order. The tricky part is, that the local transaction
must be on hold until the own commit message comes back without a prior

No, It holds until it's own Writeset comes back. Commits
and then send a commit message on the simple channel, so
commits don't wait for ordered writesets.

Remember total order guarantees if no changes in front of
the local changes conflict, the local changes can commit.

Right, it's the writeset ... that get's sent just before you flip bits
in the clog, then wait until it comes back and flip 'em.

The last time i was playing with spread (that was at Great Bridge in
Norfolk), it was IMHO useless (for Postgres-R) because it sometimes
dropped messages when the network load got too high. This occured
without any indication, no error, nothing. This is not exactly what I
understand as total order. I hope they have made some substantial
progress on that.

I remember the TCL tester you set up, and having problems,
but I don't recall investigating what the problems were.
If you still have the code I can try and reproduce the
problem, and investigate it on the spread list.

Maybe you heard about it, there was this funny conversation while
walking down the hallway:

"Did that German guy ever turn in his notebook?"

"You mean THIS German guy?"

"Yes, did he turn it in?"

"He is here. Right BEHIND YOU!!!"

"Hummmpf ... er ..."

The stuff was on that notebook. Sorry.

Jan

#29

Noname

darren@up.hrcoxmail.com

about 23 years ago

In reply to: Jan Wieck (#28)

Re: Big 7.4 items

Darren, can you clarify this? Why does it send that message? How does
it allow commits not to wait for ordered writesets?

There are two channels. One for total order writesets
(changes to the DB). The other is simple order for
aborts, commits, joins (systems joining the replica), etc.
The simple channel is necessary, because we don't want to
wait for total ordered changes to get an abort message and
so forth. In some cases you might get an abort or a commit
message before you get the writeset it refers to.

Lets say we have systems A, B and C. Each one has some
changes and sends a writeset to the group communication
system (GSC). The total order dictates WS(A), WS(B), and
WS(C) and the writes sets are recieved in that order at
each system. Now C gets WS(A) no conflict, gets WS(B) no
conflict, and receives WS(C). Now C can commit WS(C) even
before the commit messages C(A) or C(B), because there is no
conflict.

Hope that helps,

Darren

Import Notes

Resolved by subject fallback

#30

Jan Wieck

JanWieck@Yahoo.com

about 23 years ago

In reply to: Noname (#29)

Re: Big 7.4 items

darren@up.hrcoxmail.com wrote:

Darren, can you clarify this? Why does it send that message? How does
it allow commits not to wait for ordered writesets?

There are two channels. One for total order writesets
(changes to the DB). The other is simple order for
aborts, commits, joins (systems joining the replica), etc.
The simple channel is necessary, because we don't want to
wait for total ordered changes to get an abort message and
so forth. In some cases you might get an abort or a commit
message before you get the writeset it refers to.

Lets say we have systems A, B and C. Each one has some
changes and sends a writeset to the group communication
system (GSC). The total order dictates WS(A), WS(B), and
WS(C) and the writes sets are recieved in that order at
each system. Now C gets WS(A) no conflict, gets WS(B) no
conflict, and receives WS(C). Now C can commit WS(C) even
before the commit messages C(A) or C(B), because there is no
conflict.

And that is IMHO not synchronous. C does not have to wait for A and B to
finish the same tasks. If now at this very moment two new transactions
query system A and system C (assuming A has not yet committed WS(C)
while C has), they will get different data back (thanks to non-blocking
reads). I think this is pretty asynchronous.

It doesn't lead to inconsistencies, because the transaction on A cannot
do something that is in conflict with the changes made by WS(C), since
it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
in this context).

Hope that helps,

Darren

Hope this doesn't add too much confusion :-)

Jan

#31

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Noname (#29)

Re: Big 7.4 items

darren@up.hrcoxmail.com wrote:

Darren, can you clarify this? Why does it send that message? How does
it allow commits not to wait for ordered writesets?

There are two channels. One for total order writesets
(changes to the DB). The other is simple order for
aborts, commits, joins (systems joining the replica), etc.
The simple channel is necessary, because we don't want to
wait for total ordered changes to get an abort message and
so forth. In some cases you might get an abort or a commit
message before you get the writeset it refers to.

Lets say we have systems A, B and C. Each one has some
changes and sends a writeset to the group communication
system (GSC). The total order dictates WS(A), WS(B), and
WS(C) and the writes sets are recieved in that order at
each system. Now C gets WS(A) no conflict, gets WS(B) no
conflict, and receives WS(C). Now C can commit WS(C) even
before the commit messages C(A) or C(B), because there is no
conflict.

Oh, so C doesn't apply A's changes until it see A's commit, but it can
continue with its own changes because there is no conflict?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#32

Neil Conway

neilc@samurai.com

about 23 years ago

In reply to: Jan Wieck (#16)

Re: Big 7.4 items

On Fri, 2002-12-13 at 13:36, Jan Wieck wrote:

But you cannot use the result of such a SELECT to update anything. So
you can only phase out complete read only transaction to the slaves.
Requires support from the application since the load balancing system
cannot know automatically what will be a read only transaction and what
not.

Interesting -- SQL contains the concept of "read only" and "read write"
transactions (the default is RW). If we implemented that (which
shouldn't be too difficult[1]- AFAICS, the only tricky implementation detail is deciding exactly which database operations are "writes". Does nextval() count, for example?), it might make differentiating between
classes of transactions a little easier. Client applications would still
need to be modified, but not nearly as much.

Does this sound like it's worth doing?

[1]: - AFAICS, the only tricky implementation detail is deciding exactly which database operations are "writes". Does nextval() count, for example?
which database operations are "writes". Does nextval() count, for
example?

Cheers,

Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#33

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Neil Conway (#32)

Re: Big 7.4 items

Neil Conway wrote:

On Fri, 2002-12-13 at 13:36, Jan Wieck wrote:

But you cannot use the result of such a SELECT to update anything. So
you can only phase out complete read only transaction to the slaves.
Requires support from the application since the load balancing system
cannot know automatically what will be a read only transaction and what
not.

Interesting -- SQL contains the concept of "read only" and "read write"
transactions (the default is RW). If we implemented that (which
shouldn't be too difficult[1]), it might make differentiating between
classes of transactions a little easier. Client applications would still
need to be modified, but not nearly as much.

Does this sound like it's worth doing?

[1] -- AFAICS, the only tricky implementation detail is deciding exactly
which database operations are "writes". Does nextval() count, for
example?

You can't migrate a session between nodes, so the entire _session_ has
to be read-only.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#34

Christopher Kings-Lynne

chriskl@familyhealth.com.au

about 23 years ago

In reply to: Bruce Momjian (#15)

Re: Big 7.4 items

This is a good point. I don't want to push Postgres-R as our solution.
Rather, I have looked at both and like Postgres-R, but others need to
look at both and decide so we are all in agreement when we move forward.

I think in either way, it's clear that they need to be in the main CVS, in
order for it to get up to speed.

Chris

#35

Darren Johnson

darren@up.hrcoxmail.com

about 23 years ago

In reply to: Noname (#29)

Re: Big 7.4 items

Lets say we have systems A, B and C. Each one has some
changes and sends a writeset to the group communication
system (GSC). The total order dictates WS(A), WS(B), and
WS(C) and the writes sets are recieved in that order at
each system. Now C gets WS(A) no conflict, gets WS(B) no
conflict, and receives WS(C). Now C can commit WS(C) even
before the commit messages C(A) or C(B), because there is no
conflict.

And that is IMHO not synchronous. C does not have to wait for A and B to
finish the same tasks. If now at this very moment two new transactions
query system A and system C (assuming A has not yet committed WS(C)
while C has), they will get different data back (thanks to non-blocking
reads). I think this is pretty asynchronous.

So if we hold WS(C) until we receive commit messages for WS(A) and
WS(B), will that meet
your synchronous expectations, or do all the systems need to commit the
WS in the same order
and at the same exact time.

It doesn't lead to inconsistencies, because the transaction on A cannot
do something that is in conflict with the changes made by WS(C), since
it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
in this context).

Right

Hope this doesn't add too much confusion :-)

No, however I guess I need to adjust my slides to include your
definition of synchronous
replication. ;-)

Darren

Show quoted text

#36

Neil Conway

neilc@samurai.com

about 23 years ago

In reply to: Christopher Kings-Lynne (#34)

Re: Big 7.4 items

On Fri, 2002-12-13 at 20:20, Christopher Kings-Lynne wrote:

This is a good point. I don't want to push Postgres-R as our solution.
Rather, I have looked at both and like Postgres-R, but others need to
look at both and decide so we are all in agreement when we move forward.

I think in either way, it's clear that they need to be in the main CVS, in
order for it to get up to speed.

Why's that?

Cheers,

Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#37

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

about 23 years ago

In reply to: Bruce Momjian (#9)

Re: Big 7.4 items

On Friday 13 December 2002 11:01 pm, you wrote:

Good. This is the discussion we need. Let me quote the TODO list
replication section first:

* Add replication of distributed databases [replication]
o automatic failover

Very good. We need that for HA.

o load balancing

o master/slave replication
o multi-master replication
o partition data across servers

I am interested in this for multitude of reasons. Scalability is obviously one
of them. But I was just wondering about some things(After going thr. all the
messages on this).

Once we have partitioning and replication, that effectively means database
cache can span multiple machines and no longer restricted by shared memory.
So will it work on mosix now? Just a thought.

OK, the first thing is that there isn't any one replication solution
that will behave optimally in all situations.

Now, let me describe Postgres-R and then the other replication
solutions. Postgres-R is multi-master, meaning you can send SELECT and
UPDATE/DELETE queries to any of the servers in the cluster, and get the
same result. It is also synchronous, meaning it doesn't update the
local copy until it is sure the other nodes agree to the change. It
allows failover, because if one node goes down, the others keep going.

Now, let me contrast:

rserv and dbmirror do master/slave. There is no mechanism to allow you
to do updates on the slave, and have them propagate to the master. You
can, however, send SELECT queries to the slave, and in fact that's how
usogres does load balancing.

Seems like mirroring v/s striping to me. Can we have both combined in either
fashion just like RAID.

Most importantly will it be able to resize the cluster on the fly? Are we
looking at network management of database like Oracle does. (OK the tools are
unwarranted in many situation but it has to offer it).

Most importantly I would like to see this thing easy to setup from a one
point-of-administration view.

Something like declare a cluster of these n1 machines as database partitions
and have these another n2 machine do a slave sync with them for handling
loads. If these kind of command-line options are there, adding easy tools on
top of them should be a pop.

And please, in place database upgrades. Otherwise it will be a huge pain to
maintain such a cluster over long period of times.

Another replication need is for asynchronous replication, most
traditionally for traveling salesmen who need to update their databases
periodically. The only solution I know for that is PeerDirect's
PostgreSQL commercial offering at http://www.peerdirect.com. It is
possible PITR may help with this, but we need to handle propagating
changes made by the salesmen back up into the server, and to do that, we
will need a mechanism to handle conflicts that occur when two people
update the same records. This is always a problem for asynchronous
replication.

We need not offer entire asynchronous replication all at once. We can have
levels of asynchronous replication like read only(Syncing essentially) and
Read-write. Even if we get slave sync only at first, that will be huge plus.

2 If we are going to have replication, can we have built in load
balancing? Is it a good idea to have it in postgresql or a
separate application would be way to go?

Well, because Postgres-R is multi-master, it has automatic load
balancing. You can have your users point to whatever node you want.
You can implement this "pointing" by using dns IP address cycling, or
have a router that auto-load balances, though you would need to keep a
db session on the same node, of course.

Umm. W.r.t above point i.e. combining data partitioning and slave-sync, will
it take a partitioned cluster as a single server or that cluster can take
care of itself in such situattions?

So, in summary, I think we will eventually have two directions for
replication. One is Postgres-R for multi-master, synchronous
replication, and PITR, for asynchronous replication. I don't think

I would put that as two options rather than directions. We need to be able to
deploy them both if required.

Imagine postgresql running over 500 machine cluster..;-)

Shridhar

#38

Justin Clift

justin@postgresql.org

about 23 years ago

In reply to: Noname (#25)

Re: Big 7.4 items

darren@up.hrcoxmail.com wrote:

It is asynchronous without the need of 2 phase commit. It is group
communication based and requires the group communication system to
guarantee total order. The tricky part is, that the local transaction
must be on hold until the own commit message comes back without a prior

No, It holds until it's own Writeset comes back. Commits
and then send a commit message on the simple channel, so
commits don't wait for ordered writesets.

Remember total order guarantees if no changes in front of
the local changes conflict, the local changes can commit.

Do people have to be careful about how they use sequences, as they don't normally roll back?

Regards and best wishes,

Justin Clift

<snip>

Darren

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi

#39

Justin Clift

justin@postgresql.org

about 23 years ago

In reply to: Bruce Momjian (#14)

Re: Big 7.4 items

Bruce Momjian wrote:

Joe Conway wrote:

<snip>

Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

As Hannu asked (and related to your question below), is there any thought of
extending this to allow simple log based replication? In many important
scenarios that would be more than adequate, and simpler to set up.

<snip>

For PITR-log-based-replication, how much data would be required to be pushed out to each slave system in order to bring
it up to date?

I'm having visions of a 16MB WAL file being pushed out to slave systems in order to update them with a few rows of data...

:-/

Regards and best wishes,

Justin Clift

#40

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

about 23 years ago

In reply to: Justin Clift (#39)

Re: Big 7.4 items

On 14 Dec 2002 at 18:02, Justin Clift wrote:

For PITR-log-based-replication, how much data would be required to be pushed out to each slave system in order to bring
it up to date?

I'm having visions of a 16MB WAL file being pushed out to slave systems in order to update them with a few rows of data...

I was under impression that data is pushed to slave after a checkpoint is
complete. i.e. 16MB of WAL file has recycled.

Conversely a slave would contain accurate data upto last WAL checkpoint.

I think tunable WAL size should be of some help in such scenario. Otherwise the
system designer has to use async. replication. for granularity upto a
transaction.

Bye
Shridhar

--
Conference, n.: A special meeting in which the boss gathers subordinates to
hear what they have to say, so long as it doesn't conflict with what he's
already decided to do.

#41

Christopher Kings-Lynne

chriskl@familyhealth.com.au

about 23 years ago

In reply to: Neil Conway (#36)

Re: Big 7.4 items

I think in either way, it's clear that they need to be in the main CVS, in
order for it to get up to speed.

Why's that?

Because until replication is in CVS, it won't be used, tested and improved
and developed as fast...

Chris

#42

Al Sutton

al@alsutton.com

about 23 years ago

In reply to: Noname (#29)

Re: Big 7.4 items - Replication

For live replication could I propose that we consider the systems A,B, and C
connected to each other independantly (i.e. A has links to B and C, B has
links to A and C, and C has links to A and B), and that replication is
handled by the node receiving the write based transaction.

If we consider a write transaction that arrives at A (called WT(A)), system
A will then send WT(A) to systems B and C via it's direct connections.
System A will receive back either an OK response if there are not conflicts,
a NOT_OK response if there are conflicts, or no response if the system is
unavailable.

If system A receives a NOT_OK response from any other node it begins the
process of rolling back the transaction from all nodes which previously
issued an OK, and the transaction returns a failure code to the client which
submitted WT(A). The other systems (B and C) would track recent transactions
and there would be a specified timeout after which the transaction is
considered safe and could not be rolled out.

Any system not returning an OK or NOT_OK state is assumed to be down, and
error messages are logged to state that the transaction could not be sent to
the system due it it's unavailablility, and any monitoring system would
alter the administrator that a replicant is faulty.

There would also need to be code developed to ensure that a system could be
brought into sync with the current state of other systems within the group
in order to allow new databases to be added, and faulty databases to be
re-entered to the group. This code could also be used for non-realtime
replication to allow databases to be syncronised with the live master.

This would give a multi-master solution whereby a write transaction to any
one node would guarentee that all available replicants would also hold the
data once it is completed, and would also provide the code to handle
scenarios where non-realtime data replication is required.

This system assumes that a majority of transactions will be sucessful (which
should be the case for a well designed system).

Comments?

Al.

----- Original Message -----
From: "Darren Johnson" <darren@up.hrcoxmail.com>
To: "Jan Wieck" <JanWieck@Yahoo.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>;
<shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
<pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 1:28 AM
Subject: [mail] Re: [HACKERS] Big 7.4 items

Show quoted text

Lets say we have systems A, B and C. Each one has some
changes and sends a writeset to the group communication
system (GSC). The total order dictates WS(A), WS(B), and
WS(C) and the writes sets are recieved in that order at
each system. Now C gets WS(A) no conflict, gets WS(B) no
conflict, and receives WS(C). Now C can commit WS(C) even
before the commit messages C(A) or C(B), because there is no
conflict.

And that is IMHO not synchronous. C does not have to wait for A and B to
finish the same tasks. If now at this very moment two new transactions
query system A and system C (assuming A has not yet committed WS(C)
while C has), they will get different data back (thanks to non-blocking
reads). I think this is pretty asynchronous.

So if we hold WS(C) until we receive commit messages for WS(A) and
WS(B), will that meet
your synchronous expectations, or do all the systems need to commit the
WS in the same order
and at the same exact time.

It doesn't lead to inconsistencies, because the transaction on A cannot
do something that is in conflict with the changes made by WS(C), since
it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
in this context).

Right

Hope this doesn't add too much confusion :-)

No, however I guess I need to adjust my slides to include your
definition of synchronous
replication. ;-)

Darren

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#43

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Al Sutton (#42)

Re: Big 7.4 items - Replication

This sounds like two-phase commit. While it will work, it is probably
slower than Postgres-R's method.

---------------------------------------------------------------------------

Al Sutton wrote:

For live replication could I propose that we consider the systems A,B, and C
connected to each other independantly (i.e. A has links to B and C, B has
links to A and C, and C has links to A and B), and that replication is
handled by the node receiving the write based transaction.

If we consider a write transaction that arrives at A (called WT(A)), system
A will then send WT(A) to systems B and C via it's direct connections.
System A will receive back either an OK response if there are not conflicts,
a NOT_OK response if there are conflicts, or no response if the system is
unavailable.

If system A receives a NOT_OK response from any other node it begins the
process of rolling back the transaction from all nodes which previously
issued an OK, and the transaction returns a failure code to the client which
submitted WT(A). The other systems (B and C) would track recent transactions
and there would be a specified timeout after which the transaction is
considered safe and could not be rolled out.

Any system not returning an OK or NOT_OK state is assumed to be down, and
error messages are logged to state that the transaction could not be sent to
the system due it it's unavailablility, and any monitoring system would
alter the administrator that a replicant is faulty.

There would also need to be code developed to ensure that a system could be
brought into sync with the current state of other systems within the group
in order to allow new databases to be added, and faulty databases to be
re-entered to the group. This code could also be used for non-realtime
replication to allow databases to be syncronised with the live master.

This would give a multi-master solution whereby a write transaction to any
one node would guarentee that all available replicants would also hold the
data once it is completed, and would also provide the code to handle
scenarios where non-realtime data replication is required.

This system assumes that a majority of transactions will be sucessful (which
should be the case for a well designed system).

Comments?

Al.

----- Original Message -----
From: "Darren Johnson" <darren@up.hrcoxmail.com>
To: "Jan Wieck" <JanWieck@Yahoo.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>;
<shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
<pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 1:28 AM
Subject: [mail] Re: [HACKERS] Big 7.4 items

Lets say we have systems A, B and C. Each one has some
changes and sends a writeset to the group communication
system (GSC). The total order dictates WS(A), WS(B), and
WS(C) and the writes sets are recieved in that order at
each system. Now C gets WS(A) no conflict, gets WS(B) no
conflict, and receives WS(C). Now C can commit WS(C) even
before the commit messages C(A) or C(B), because there is no
conflict.

And that is IMHO not synchronous. C does not have to wait for A and B to
finish the same tasks. If now at this very moment two new transactions
query system A and system C (assuming A has not yet committed WS(C)
while C has), they will get different data back (thanks to non-blocking
reads). I think this is pretty asynchronous.

So if we hold WS(C) until we receive commit messages for WS(A) and
WS(B), will that meet
your synchronous expectations, or do all the systems need to commit the
WS in the same order
and at the same exact time.

It doesn't lead to inconsistencies, because the transaction on A cannot
do something that is in conflict with the changes made by WS(C), since
it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
arriving at A will cause WS(A)2 to rollback (WS used synonymous to Xact
in this context).

Right

Hope this doesn't add too much confusion :-)

No, however I guess I need to adjust my slides to include your
definition of synchronous
replication. ;-)

Darren

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#44

Mathieu Arnold

mat@mat.cc

about 23 years ago

In reply to: Bruce Momjian (#43)

Re: Big 7.4 items - Replication

--En cette belle journée de samedi 14 décembre 2002 11:59 -0500,
-- Bruce Momjian écrivait avec ses petits doigts :

This sounds like two-phase commit. While it will work, it is probably
slower than Postgres-R's method.

What exactly is Postgres-R's method ?

--
Mathieu Arnold

#45

Al Sutton

al@alsutton.com

about 23 years ago

In reply to: Bruce Momjian (#43)

Re: [mail] Re: Big 7.4 items - Replication

I see it as very difficult to avoid a two stage process because there will
be the following two parts to any transaction;

1) All databases must agree upon the acceptability of a transaction before
the client can be informed of it's success. 2) All databases must be
informed as to whether or not the transaction was accepted by the entire
replicant set, and thus whether it should be written to the database.

If stage1 is missed then the client application may be informed of a
sucessful transaction which may fail when it is replicated to other
databases.

If stage 2 is missed then databases may become out of sync because they have
accepted transactions that were rejected by other databases.

From reading the PDF on Postgres-R I can see that either one of two things
will occur;

a) There will be a central point of synchronization where conflicts will be
tested and delt with. This is not desirable because it will leave the
synchronization and replication processing load concentrated in one place
which will limit scaleability as well as leaving a single point of failure.

b) The Group Communication blob will consist of a number of processes which
need to talk to all of the others to interrogate them for changes which may
conflict with the current write that being handled and then issue the
transaction response. This is basically the two phase commit solution with
phases moved into the group communication process.

I can see the possibility of using solution b and having less group
communication processes than databases as attempt to simplify things, but
this would mean the loss of a number of databases if the machine running the
group communication process for the set of databases is lost.

Al.

----- Original Message -----
From: "Bruce Momjian" <pgman@candle.pha.pa.us>
To: "Al Sutton" <al@alsutton.com>
Cc: "Darren Johnson" <darren@up.hrcoxmail.com>; "Jan Wieck"
<JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 4:59 PM
Subject: [mail] Re: [HACKERS] Big 7.4 items - Replication

This sounds like two-phase commit. While it will work, it is probably
slower than Postgres-R's method.

--------------------------------------------------------------------------

Al Sutton wrote:

For live replication could I propose that we consider the systems A,B,

and C

connected to each other independantly (i.e. A has links to B and C, B

has

links to A and C, and C has links to A and B), and that replication is
handled by the node receiving the write based transaction.

If we consider a write transaction that arrives at A (called WT(A)),

system

A will then send WT(A) to systems B and C via it's direct connections.
System A will receive back either an OK response if there are not

conflicts,

a NOT_OK response if there are conflicts, or no response if the system

unavailable.

If system A receives a NOT_OK response from any other node it begins the
process of rolling back the transaction from all nodes which previously
issued an OK, and the transaction returns a failure code to the client

which

submitted WT(A). The other systems (B and C) would track recent

transactions

and there would be a specified timeout after which the transaction is
considered safe and could not be rolled out.

Any system not returning an OK or NOT_OK state is assumed to be down,

and

error messages are logged to state that the transaction could not be

sent to

the system due it it's unavailablility, and any monitoring system would
alter the administrator that a replicant is faulty.

There would also need to be code developed to ensure that a system could

brought into sync with the current state of other systems within the

group

in order to allow new databases to be added, and faulty databases to be
re-entered to the group. This code could also be used for non-realtime
replication to allow databases to be syncronised with the live master.

This would give a multi-master solution whereby a write transaction to

any

one node would guarentee that all available replicants would also hold

the

data once it is completed, and would also provide the code to handle
scenarios where non-realtime data replication is required.

This system assumes that a majority of transactions will be sucessful

(which

should be the case for a well designed system).

Comments?

Al.

----- Original Message -----
From: "Darren Johnson" <darren@up.hrcoxmail.com>
To: "Jan Wieck" <JanWieck@Yahoo.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>;
<shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
<pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 1:28 AM
Subject: [mail] Re: [HACKERS] Big 7.4 items

Lets say we have systems A, B and C. Each one has some
changes and sends a writeset to the group communication
system (GSC). The total order dictates WS(A), WS(B), and
WS(C) and the writes sets are recieved in that order at
each system. Now C gets WS(A) no conflict, gets WS(B) no
conflict, and receives WS(C). Now C can commit WS(C) even
before the commit messages C(A) or C(B), because there is no
conflict.

And that is IMHO not synchronous. C does not have to wait for A and B

finish the same tasks. If now at this very moment two new

transactions

query system A and system C (assuming A has not yet committed WS(C)
while C has), they will get different data back (thanks to

non-blocking

reads). I think this is pretty asynchronous.

So if we hold WS(C) until we receive commit messages for WS(A) and
WS(B), will that meet
your synchronous expectations, or do all the systems need to commit

the

WS in the same order
and at the same exact time.

It doesn't lead to inconsistencies, because the transaction on A

cannot

do something that is in conflict with the changes made by WS(C),

since

it's WS(A)2 will come back after WS(C) arrived at A and thus WS(C)
arriving at A will cause WS(A)2 to rollback (WS used synonymous to

Xact

in this context).

Right

Hope this doesn't add too much confusion :-)

No, however I guess I need to adjust my slides to include your
definition of synchronous
replication. ;-)

Darren

---------------------------(end of

broadcast)---------------------------

TIP 6: Have you searched our list archives?

http://archives.postgresql.org

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org
--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania

19073

Show quoted text

#46

Darren Johnson

darren@up.hrcoxmail.com

about 23 years ago

In reply to: Bruce Momjian (#43)

Re: [mail] Re: Big 7.4 items - Replication

b) The Group Communication blob will consist of a number of processes which
need to talk to all of the others to interrogate them for changes which may
conflict with the current write that being handled and then issue the
transaction response. This is basically the two phase commit solution with
phases moved into the group communication process.

I can see the possibility of using solution b and having less group
communication processes than databases as attempt to simplify things, but
this would mean the loss of a number of databases if the machine running the
group communication process for the set of databases is lost.

The group communication system doesn't just run on one system. For
postgres-r using spread
there is actually a spread daemon that runs on each database server. It
has nothing to do with
detecting the conflicts. Its job is to deliver messages in a total
order for writesets or simple order
for commits, aborts, joins, etc.

The detection of conflicts will be done at the database level, by a
backend processes. The basic
concept is "if all databases get the writesets (changes) in the exact
same order, apply them in a
consistent order, avoid conflicts, then one copy serialization is
achieved. (one copy of the database
replicated across all databases in the replica)

I hope that explains the group communication system's responsibility.

Darren

Show quoted text

#47

Al Sutton

al@alsutton.com

about 23 years ago

In reply to: Bruce Momjian (#43)

Re: [mail] Re: Big 7.4 items - Replication

Many thanks for the explanation. Could you explain to me where the order or
the writeset for the following scenario;

If a tranasction takes 50ms to reach one database from another, for a
specific data element (called X), the following timeline occurs

at 0ms, T1(X) is written to system A.
at 10ms, T2(X) is written to system B.

Where T1(X) and T2(X) conflict.

My concern is that if the Group Communication Daemon (gcd) is operating on
each database, a successful result for T1(X) will returned to the client
talking to database A because T2(X) has not reached it, and thus no conflict
is known about, and a sucessful result is returned to the client submitting
T2(X) to database B because it is not aware of T1(X). This would mean that
the two clients beleive bothe T1(X) and T2(X) completed succesfully, yet
they can not due to the conflict.

Thanks,

Al.

----- Original Message -----
From: "Darren Johnson" <darren@up.hrcoxmail.com>
To: "Al Sutton" <al@alsutton.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
<JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 6:48 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

b) The Group Communication blob will consist of a number of processes

which

need to talk to all of the others to interrogate them for changes which

may

conflict with the current write that being handled and then issue the
transaction response. This is basically the two phase commit solution

with

phases moved into the group communication process.

I can see the possibility of using solution b and having less group
communication processes than databases as attempt to simplify things, but
this would mean the loss of a number of databases if the machine running

the

Show quoted text

group communication process for the set of databases is lost.

The group communication system doesn't just run on one system. For
postgres-r using spread
there is actually a spread daemon that runs on each database server. It
has nothing to do with
detecting the conflicts. Its job is to deliver messages in a total
order for writesets or simple order
for commits, aborts, joins, etc.

The detection of conflicts will be done at the database level, by a
backend processes. The basic
concept is "if all databases get the writesets (changes) in the exact
same order, apply them in a
consistent order, avoid conflicts, then one copy serialization is
achieved. (one copy of the database
replicated across all databases in the replica)

I hope that explains the group communication system's responsibility.

Darren

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

#48

David Walker

pgsql@grax.com

about 23 years ago

In reply to: Al Sutton (#47)

Re: [MLIST] Re: [mail] Re: Big 7.4 items - Replication

Another concern I have with multi-master systems is what happens if the
network splits in 2 so that 2 master systems are taking commits for 2
separate sets of clients. It seems to me that to re-sync the 2 databases
upon the network healing would be a very complex task or impossible task.

Show quoted text

On Sunday 15 December 2002 04:16 am, Al Sutton wrote:

Many thanks for the explanation. Could you explain to me where the order or
the writeset for the following scenario;

If a tranasction takes 50ms to reach one database from another, for a
specific data element (called X), the following timeline occurs

at 0ms, T1(X) is written to system A.
at 10ms, T2(X) is written to system B.

Where T1(X) and T2(X) conflict.

My concern is that if the Group Communication Daemon (gcd) is operating on
each database, a successful result for T1(X) will returned to the client
talking to database A because T2(X) has not reached it, and thus no
conflict is known about, and a sucessful result is returned to the client
submitting T2(X) to database B because it is not aware of T1(X). This would
mean that the two clients beleive bothe T1(X) and T2(X) completed
succesfully, yet they can not due to the conflict.

Thanks,

Al.

----- Original Message -----
From: "Darren Johnson" <darren@up.hrcoxmail.com>
To: "Al Sutton" <al@alsutton.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
<JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 6:48 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

b) The Group Communication blob will consist of a number of processes

which

need to talk to all of the others to interrogate them for changes which

may

conflict with the current write that being handled and then issue the
transaction response. This is basically the two phase commit solution

with

phases moved into the group communication process.

I can see the possibility of using solution b and having less group
communication processes than databases as attempt to simplify things,
but this would mean the loss of a number of databases if the machine
running

the

group communication process for the set of databases is lost.

The group communication system doesn't just run on one system. For
postgres-r using spread
there is actually a spread daemon that runs on each database server. It
has nothing to do with
detecting the conflicts. Its job is to deliver messages in a total
order for writesets or simple order
for commits, aborts, joins, etc.

The detection of conflicts will be done at the database level, by a
backend processes. The basic
concept is "if all databases get the writesets (changes) in the exact
same order, apply them in a
consistent order, avoid conflicts, then one copy serialization is
achieved. (one copy of the database
replicated across all databases in the replica)

I hope that explains the group communication system's responsibility.

Darren

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#49

Al Sutton

al@alsutton.com

about 23 years ago

In reply to: Bruce Momjian (#43)

Re: [MLIST] Re: [mail] Re: Big 7.4 items - Replication

David,

This can be resolved by requiring that for any transaction to succeed the
entrypoint database must receive acknowlegements from n/2 + 0.5 (rounded up
to the nearest integer) databases where n is the total number in the
replicant set. The following cases are shown as an example;

Total Number of databases: 2
Number required to accept transaction: 2

Total Number of databases: 3
Number required to accept transaction: 2

Total Number of databases: 4
Number required to accept transaction: 3

Total Number of databases: 5
Number required to accept transaction: 3

Total Number of databases: 6
Number required to accept transaction: 4

Total Number of databases: 7
Number required to accept transaction: 4

Total Number of databases: 8
Number required to accept transaction: 5

This would prevent two replicant sub-sets forming, because it is impossible
for both sets to have over 50% of the databases.

Applications could be able to detect when a database has dropped out of the
replicant set because the database could report a state of "Unable to obtain
majority consesus". This would allow applications differentiate between a
database out of the set where writing to other databases in the set could
yield a sucessful result, and "Unable to commit due to conflict" where
trying other databases is pointless.

Example
----- Original Message -----
From: "David Walker" <pgsql@grax.com>
To: "Al Sutton" <al@alsutton.com>; "Darren Johnson"
<darren@up.hrcoxmail.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
<JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Sunday, December 15, 2002 2:29 PM
Subject: Re: [MLIST] Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

Another concern I have with multi-master systems is what happens if the
network splits in 2 so that 2 master systems are taking commits for 2
separate sets of clients. It seems to me that to re-sync the 2 databases
upon the network healing would be a very complex task or impossible task.

On Sunday 15 December 2002 04:16 am, Al Sutton wrote:

Many thanks for the explanation. Could you explain to me where the order

the writeset for the following scenario;

If a tranasction takes 50ms to reach one database from another, for a
specific data element (called X), the following timeline occurs

at 0ms, T1(X) is written to system A.
at 10ms, T2(X) is written to system B.

Where T1(X) and T2(X) conflict.

My concern is that if the Group Communication Daemon (gcd) is operating

each database, a successful result for T1(X) will returned to the

client

talking to database A because T2(X) has not reached it, and thus no
conflict is known about, and a sucessful result is returned to the

client

submitting T2(X) to database B because it is not aware of T1(X). This

would

mean that the two clients beleive bothe T1(X) and T2(X) completed
succesfully, yet they can not due to the conflict.

Thanks,

Al.

----- Original Message -----
From: "Darren Johnson" <darren@up.hrcoxmail.com>
To: "Al Sutton" <al@alsutton.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
<JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 6:48 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

b) The Group Communication blob will consist of a number of processes

which

need to talk to all of the others to interrogate them for changes

which

may

conflict with the current write that being handled and then issue the
transaction response. This is basically the two phase commit solution

with

phases moved into the group communication process.

I can see the possibility of using solution b and having less group
communication processes than databases as attempt to simplify things,
but this would mean the loss of a number of databases if the machine
running

the

group communication process for the set of databases is lost.

The group communication system doesn't just run on one system. For
postgres-r using spread
there is actually a spread daemon that runs on each database server.

has nothing to do with
detecting the conflicts. Its job is to deliver messages in a total
order for writesets or simple order
for commits, aborts, joins, etc.

The detection of conflicts will be done at the database level, by a
backend processes. The basic
concept is "if all databases get the writesets (changes) in the exact
same order, apply them in a
consistent order, avoid conflicts, then one copy serialization is
achieved. (one copy of the database
replicated across all databases in the replica)

I hope that explains the group communication system's responsibility.

Darren

---------------------------(end of

broadcast)---------------------------

Show quoted text

TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#50

Al Sutton

al@alsutton.com

about 23 years ago

In reply to: Bruce Momjian (#43)

Re: [mail] Re: Big 7.4 items - Replication

Jonathan,

How do the group communication daemons on system A and B agree that T2 is
after T1?,

As I understand it the operation is performed locally before being passed on
to the group for replication, when T2 arrives at system B, system B has no
knowlege of T1 and so can perform T2 sucessfully.

I am guessing that the System B performs T2 locally, sends it to the group
communication daemon for ordering, and then receives it back from the group
communication order queue after it's position in the order queue has been
decided before it is written to the database.

This would indicate to me that there is a single central point which decides
that T2 is after T1.

Is this true?

Al.

----- Original Message -----
From: "Jonathan Stanton" <jonathan@cnds.jhu.edu>
To: "Al Sutton" <al@alsutton.com>
Cc: "Darren Johnson" <darren@up.hrcoxmail.com>; "Bruce Momjian"
<pgman@candle.pha.pa.us>; "Jan Wieck" <JanWieck@Yahoo.com>;
<shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
<pgsql-hackers@postgresql.org>
Sent: Sunday, December 15, 2002 5:00 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

The total order provided by the group communication daemons guarantees
that every member will see the tranactions/writesets in the same order.
So both A and B will see that T1 is ordered before T2 BEFORE writing
anything back to the client. So for both servers T1 will be completed
successfully, and T2 will be aborted because of conflicting writesets.

Jonathan

On Sun, Dec 15, 2002 at 10:16:22AM -0000, Al Sutton wrote:

Many thanks for the explanation. Could you explain to me where the order

the writeset for the following scenario;

If a tranasction takes 50ms to reach one database from another, for a
specific data element (called X), the following timeline occurs

at 0ms, T1(X) is written to system A.
at 10ms, T2(X) is written to system B.

Where T1(X) and T2(X) conflict.

My concern is that if the Group Communication Daemon (gcd) is operating

each database, a successful result for T1(X) will returned to the

client

talking to database A because T2(X) has not reached it, and thus no

conflict

is known about, and a sucessful result is returned to the client

submitting

T2(X) to database B because it is not aware of T1(X). This would mean

that

the two clients beleive bothe T1(X) and T2(X) completed succesfully, yet
they can not due to the conflict.

Thanks,

Al.

----- Original Message -----
From: "Darren Johnson" <darren@up.hrcoxmail.com>
To: "Al Sutton" <al@alsutton.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
<JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 6:48 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

b) The Group Communication blob will consist of a number of processes

which

need to talk to all of the others to interrogate them for changes

which

may

conflict with the current write that being handled and then issue the
transaction response. This is basically the two phase commit solution

with

phases moved into the group communication process.

I can see the possibility of using solution b and having less group
communication processes than databases as attempt to simplify things,

but

this would mean the loss of a number of databases if the machine

running

the

group communication process for the set of databases is lost.

The group communication system doesn't just run on one system. For
postgres-r using spread
there is actually a spread daemon that runs on each database server.

has nothing to do with
detecting the conflicts. Its job is to deliver messages in a total
order for writesets or simple order
for commits, aborts, joins, etc.

The detection of conflicts will be done at the database level, by a
backend processes. The basic
concept is "if all databases get the writesets (changes) in the exact
same order, apply them in a
consistent order, avoid conflicts, then one copy serialization is
achieved. (one copy of the database
replicated across all databases in the replica)

I hope that explains the group communication system's responsibility.

Darren

---------------------------(end of

broadcast)---------------------------

Show quoted text

TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
-------------------------------------------------------
Jonathan R. Stanton jonathan@cs.jhu.edu
Dept. of Computer Science
Johns Hopkins University
-------------------------------------------------------

#51

Al Sutton

al@alsutton.com

about 23 years ago

In reply to: Bruce Momjian (#43)

Re: [mail] Re: Big 7.4 items - Replication

Jonathan,

Many thanks for clarifying the situation some more. With token passing, I
have the following concerns;

1) What happends if a server holding the token should die whilst it is in
posession of the token.

2) If I have n servers, and the time to pass the token between each server
is x milliseconds, I may have to wait for upto m times x milliseconds in
order for a transaction to be processed. If a server is limited to a single
transaction per posession of the token (in order to ensure no system hogs
the token), and the server develops a queue of length y, I will have to wait
m times x times y for the transaction to be processed. Both scenarios I
beleive would not scale well beyond a small subset of servers with low
network latency between them.

If we consider the following situation I can illustrate why I'm still in
favour of a two phase commit;

Imagine, for example, credit card details about the status of an account
replicated in real time between databases in London, Moscow, Singapore,
Syndey, and New York. If any server can talk to any other server with a
guarenteed packet transfer time of 150ms a two phase commit could complete
in 600ms as it's worst case (assuming that the two phases consist of
request/response pairs, and that each server talks to all the others in
parallel). A token passing system may have to wait for the token to pass
through every other server before reaching the one that has the transaction
comitted to it, which could take about 750ms.

If you then expand the network to allow for a primary and disaster recover
database at each location the two phase commit still maintains it's 600ms
response time, but the token passing system doubles to 1500ms.

Allowing disjointed segments to continue executing is also a concern because
any split in the replication group could effectively double the accepted
card limit for any card holder should they purchase items from various
locations around the globe.

I can see an idea that the token may be passed to the system with the most
transactions in a wait state, but this would cause low volume databases to
loose out on response times to higher volume ones, which is again,
undesirable.

Al.

----- Original Message -----
From: "Jonathan Stanton" <jonathan@cnds.jhu.edu>
To: "Al Sutton" <al@alsutton.com>
Cc: "Darren Johnson" <darren@up.hrcoxmail.com>; "Bruce Momjian"
<pgman@candle.pha.pa.us>; "Jan Wieck" <JanWieck@Yahoo.com>;
<shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
<pgsql-hackers@postgresql.org>
Sent: Sunday, December 15, 2002 9:17 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

On Sun, Dec 15, 2002 at 07:42:35PM -0000, Al Sutton wrote:

Jonathan,

How do the group communication daemons on system A and B agree that T2

after T1?,

Lets split this into two separate problems:

1) How do the daemons totally order a set of messages (abstract
messages)

2) How do database transactions get split into writesets that are sent
as messages through the group communication system.

As to question 1, the set of daemons (usually one running on each
participating server) run a distributed ordering algorithm, as well as
distributed algorithms to provide message reliability, fault-detection,
and membership services. These are completely distributed algorithms, no
"central" controller node exists, so even if network partitions occur
the group communication system keeps running and providing ordering and
reliability guarantees to messages.

A number of different algorithms exist as to how to provide a total
order on messages. Spread currently uses a token algorithm, that
involves passing a token between the daemons, and a counter attached to
each message, but other algorithms exist and we have implemneted some
other ones in our research. You can find lots of details in the papers
at www.cnds.jhu.edu/publications/ and www.spread.org.

As to question 2, there are several different approaches to how to use
such a total order for actual database replication. They all use the gcs
total order to establish a single sequence of "events" that all the
databases see. Then each database can act on the events as they are
delivered by teh gcs and be guaranteed that no other database will see a
different order.

In the postgres-R case, the action received from a client is performned
partially at the originating postgres server, the writesets are then
sent through the gcs to order them and determine conflicts. Once they
are delivered back, if no conflicts occured in the meantime, the
original transaction is completed and the result returned to the client.
If a conflict occured, the original transaction is rolled back and
aborted. and the abort is returned to the client.

As I understand it the operation is performed locally before being

passed on

to the group for replication, when T2 arrives at system B, system B has

knowlege of T1 and so can perform T2 sucessfully.

I am guessing that the System B performs T2 locally, sends it to the

group

communication daemon for ordering, and then receives it back from the

group

communication order queue after it's position in the order queue has

been

decided before it is written to the database.

If I understand the above correctly, yes, that is the same as I describe
above.

This would indicate to me that there is a single central point which

decides

that T2 is after T1.

No, there is a distributed algorithm that determins the order. The
distributed algorithm "emulates" a central controller who decides the
order, but no single controller actually exists.

Jonathan

----- Original Message -----
From: "Jonathan Stanton" <jonathan@cnds.jhu.edu>
To: "Al Sutton" <al@alsutton.com>
Cc: "Darren Johnson" <darren@up.hrcoxmail.com>; "Bruce Momjian"
<pgman@candle.pha.pa.us>; "Jan Wieck" <JanWieck@Yahoo.com>;
<shridhar_daithankar@persistent.co.in>; "PostgreSQL-development"
<pgsql-hackers@postgresql.org>
Sent: Sunday, December 15, 2002 5:00 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

The total order provided by the group communication daemons guarantees
that every member will see the tranactions/writesets in the same

order.

So both A and B will see that T1 is ordered before T2 BEFORE writing
anything back to the client. So for both servers T1 will be completed
successfully, and T2 will be aborted because of conflicting writesets.

Jonathan

On Sun, Dec 15, 2002 at 10:16:22AM -0000, Al Sutton wrote:

Many thanks for the explanation. Could you explain to me where the

order

or

the writeset for the following scenario;

If a tranasction takes 50ms to reach one database from another, for

specific data element (called X), the following timeline occurs

at 0ms, T1(X) is written to system A.
at 10ms, T2(X) is written to system B.

Where T1(X) and T2(X) conflict.

My concern is that if the Group Communication Daemon (gcd) is

operating

on

each database, a successful result for T1(X) will returned to the

client

talking to database A because T2(X) has not reached it, and thus no

conflict

is known about, and a sucessful result is returned to the client

submitting

T2(X) to database B because it is not aware of T1(X). This would

mean

that

the two clients beleive bothe T1(X) and T2(X) completed succesfully,

yet

they can not due to the conflict.

Thanks,

Al.

----- Original Message -----
From: "Darren Johnson" <darren@up.hrcoxmail.com>
To: "Al Sutton" <al@alsutton.com>
Cc: "Bruce Momjian" <pgman@candle.pha.pa.us>; "Jan Wieck"
<JanWieck@Yahoo.com>; <shridhar_daithankar@persistent.co.in>;
"PostgreSQL-development" <pgsql-hackers@postgresql.org>
Sent: Saturday, December 14, 2002 6:48 PM
Subject: Re: [mail] Re: [HACKERS] Big 7.4 items - Replication

b) The Group Communication blob will consist of a number of

processes

which

need to talk to all of the others to interrogate them for changes

which

may

conflict with the current write that being handled and then issue

the

transaction response. This is basically the two phase commit

solution

with

phases moved into the group communication process.

I can see the possibility of using solution b and having less

group

communication processes than databases as attempt to simplify

things,

but

this would mean the loss of a number of databases if the machine

running

the

group communication process for the set of databases is lost.

The group communication system doesn't just run on one system.

For

postgres-r using spread
there is actually a spread daemon that runs on each database

server.

It

has nothing to do with
detecting the conflicts. Its job is to deliver messages in a

total

order for writesets or simple order
for commits, aborts, joins, etc.

The detection of conflicts will be done at the database level, by

backend processes. The basic
concept is "if all databases get the writesets (changes) in the

exact

same order, apply them in a
consistent order, avoid conflicts, then one copy serialization is
achieved. (one copy of the database
replicated across all databases in the replica)

I hope that explains the group communication system's

responsibility.

Darren

---------------------------(end of

broadcast)---------------------------

TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html

---------------------------(end of

broadcast)---------------------------

Show quoted text

TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
-------------------------------------------------------
Jonathan R. Stanton jonathan@cs.jhu.edu
Dept. of Computer Science
Johns Hopkins University
-------------------------------------------------------

--
-------------------------------------------------------
Jonathan R. Stanton jonathan@cs.jhu.edu
Dept. of Computer Science
Johns Hopkins University
-------------------------------------------------------

#52

Jan Wieck

JanWieck@Yahoo.com

about 23 years ago

In reply to: Bruce Momjian (#43)

Re: [mail] Re: Big 7.4 items - Replication

Darren Johnson wrote:

The group communication system doesn't just run on one system. For
postgres-r using spread

The reason why group communication software is used is simply because
this software is designed with two goals in mind:

1) optimize bandwidth usage

2) make many-to-many communication easy

Number one is done by utilizing things like multicasting where
available.

Number two is done by using global scoped queues.

I add this only to avoid reading that pushing some PITR log snippets via
FTP or worse over a network would do the same. It did not in the past,
it does not do right now and it will not do in the future.

Jan

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#53

Greg Copeland

greg@CopelandConsulting.Net

about 23 years ago

In reply to: Hannu Krosing (#3)

Re: Big 7.4 items

On Fri, 2002-12-13 at 04:53, Hannu Krosing wrote:

On Fri, 2002-12-13 at 06:22, Bruce Momjian wrote:

I wanted to outline some of the big items we are looking at for 7.4:
Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

How hard would it be to extend PITR for master-slave (hot backup)
repliaction, which should then amount to continuously shipping logs to
slave and doing nonstop PITR there :)

It will never be usable for multi-master replication, but somehow it
feels that for master-slave replication simple log replay would be most
simple and robust solution.

I'm curious, what would be the recovery strategy for PITR master-slave
replication should the master fail (assuming hot fail over/backup)? A
simple dump/restore? Are there/is there any facilities in PorstgreSQL
for PITR archival which prevents PITR logs from be recycled (or perhaps,
simply archived off)? What about PITR streaming to networked and/or
removable media?

--
Greg Copeland <greg@copelandconsulting.net>
Copeland Computer Consulting

#54

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

about 23 years ago

In reply to: Greg Copeland (#53)

Re: Big 7.4 items

On Monday 16 December 2002 07:26 pm, you wrote:

I'm curious, what would be the recovery strategy for PITR master-slave
replication should the master fail (assuming hot fail over/backup)? A
simple dump/restore? Are there/is there any facilities in PorstgreSQL
for PITR archival which prevents PITR logs from be recycled (or perhaps,
simply archived off)? What about PITR streaming to networked and/or
removable media?

In asynchrounous replication, WAL log records are fed to anoter host, which
replays those transactions to sync the data. This way it does not matter if
WAL log is recycled as it is already replicated someplace else..

HTH

Shridhar

#55

Greg Copeland

greg@CopelandConsulting.Net

about 23 years ago

In reply to: Shridhar Daithankar (#54)

Re: Big 7.4 items

I must of miscommunicated here as you're describing PITR replication.
I'm asking about a master failing and the slaving picking up. Now, some
n-time later, how do you recover your master system to be back in sync
with the slave. Obviously, I'm assuming some level of manual recovery.
I'm wondering what the general approach would be.

Consider that on the slave which is now the active server (master dead),
it's possible that the slave's PITR's will be recycled before the master
can come back up. As such, unless there is a, an archival process for
PITR or b, a method of streaming PITR's off for archival, the odds of
using PITR to recover the master (resync if you will) seem greatly
reduced as you will be unable to replay PITR on the master for
synchronization.

Greg

On Mon, 2002-12-16 at 08:02, Shridhar Daithankar wrote:

On Monday 16 December 2002 07:26 pm, you wrote:

I'm curious, what would be the recovery strategy for PITR master-slave
replication should the master fail (assuming hot fail over/backup)? A
simple dump/restore? Are there/is there any facilities in PorstgreSQL
for PITR archival which prevents PITR logs from be recycled (or perhaps,
simply archived off)? What about PITR streaming to networked and/or
removable media?

In asynchrounous replication, WAL log records are fed to anoter host, which
replays those transactions to sync the data. This way it does not matter if
WAL log is recycled as it is already replicated someplace else..

HTH

Shridhar

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

--
Greg Copeland <greg@copelandconsulting.net>
Copeland Computer Consulting

#56

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

about 23 years ago

In reply to: Greg Copeland (#55)

Re: Big 7.4 items

On Monday 16 December 2002 07:43 pm, you wrote:

Consider that on the slave which is now the active server (master dead),
it's possible that the slave's PITR's will be recycled before the master
can come back up. As such, unless there is a, an archival process for
PITR or b, a method of streaming PITR's off for archival, the odds of
using PITR to recover the master (resync if you will) seem greatly
reduced as you will be unable to replay PITR on the master for
synchronization.

I agree. Since we are talking about features in future release, I think it
should be added to TODO if not already there.

I don't know about WAL numbering but AFAIU, it increments and old files are
removed once there are enough WAL files as specified in posgresql.conf. IIRC
there are some perl based replication project exist already which use this
feature.

Shridhar

#57

Greg Copeland

greg@CopelandConsulting.Net

about 23 years ago

In reply to: Shridhar Daithankar (#56)

Re: Big 7.4 items

On Mon, 2002-12-16 at 08:20, Shridhar Daithankar wrote:

On Monday 16 December 2002 07:43 pm, you wrote:

Consider that on the slave which is now the active server (master dead),
it's possible that the slave's PITR's will be recycled before the master
can come back up. As such, unless there is a, an archival process for
PITR or b, a method of streaming PITR's off for archival, the odds of
using PITR to recover the master (resync if you will) seem greatly
reduced as you will be unable to replay PITR on the master for
synchronization.

I agree. Since we are talking about features in future release, I think it
should be added to TODO if not already there.

I don't know about WAL numbering but AFAIU, it increments and old files are
removed once there are enough WAL files as specified in posgresql.conf. IIRC
there are some perl based replication project exist already which use this
feature.

The problem with this is that most people, AFAICT, are going to size WAL
based on their performance/sizing requirements and not based on
theoretical estimates which someone might make to allow for a window of
failure. That is, I don't believe increasing the number of WAL's is
going to satisfactorily address the issue.

--
Greg Copeland <greg@copelandconsulting.net>
Copeland Computer Consulting

#58

Shridhar Daithankar

shridhar_daithankar@persistent.co.in

about 23 years ago

In reply to: Greg Copeland (#57)

Re: Big 7.4 items

On Monday 16 December 2002 08:07 pm, you wrote:

On Mon, 2002-12-16 at 08:20, Shridhar Daithankar wrote:

I don't know about WAL numbering but AFAIU, it increments and old files
are removed once there are enough WAL files as specified in
posgresql.conf. IIRC there are some perl based replication project exist
already which use this feature.

The problem with this is that most people, AFAICT, are going to size WAL
based on their performance/sizing requirements and not based on
theoretical estimates which someone might make to allow for a window of
failure. That is, I don't believe increasing the number of WAL's is
going to satisfactorily address the issue.

Sorry for not being clear. When I said, WAL numbering, I meant WAL naming
conventions where numbers are used to mark WAL files.

It is not number of WAL files. It is entirely upto the installation and IIRC,
even in replication project(Sorry I forgot the exact name), you can set
number of WAL files that it can have.

Shridhar

#59

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Shridhar Daithankar (#58)

Re: Big 7.4 items

Shridhar Daithankar wrote:

On Monday 16 December 2002 08:07 pm, you wrote:

On Mon, 2002-12-16 at 08:20, Shridhar Daithankar wrote:

I don't know about WAL numbering but AFAIU, it increments and old files
are removed once there are enough WAL files as specified in
posgresql.conf. IIRC there are some perl based replication project exist
already which use this feature.

The problem with this is that most people, AFAICT, are going to size WAL
based on their performance/sizing requirements and not based on
theoretical estimates which someone might make to allow for a window of
failure. That is, I don't believe increasing the number of WAL's is
going to satisfactorily address the issue.

Sorry for not being clear. When I said, WAL numbering, I meant WAL naming
conventions where numbers are used to mark WAL files.

It is not number of WAL files. It is entirely upto the installation and IIRC,
even in replication project(Sorry I forgot the exact name), you can set
number of WAL files that it can have.

Basically, PITR is going to have a way to archive off a log of database
changes, either from WAL or from somewhere else. At some point, there
is going to have to be administrative action which says, "I have a
master down for three days. I am going to have to save my PITR logs for
that period." So, PITR will probably be used for recovery of a failed
master, and such recover is going to have to require some administrative
action _if_ the automatic expiration of PITR logs is shorter than the
duration the master is down.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#60

Patrick Macdonald

patrickm@redhat.com

about 23 years ago

In reply to: Bruce Momjian (#1)

Re: Big 7.4 items

Bruce Momjian wrote:

I wanted to outline some of the big items we are looking at for 7.4:

[snip]

Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

Neil Conway and I will be working on this starting the beginning
of January. By the middle of January, we hope to have a handle on
an ETA.

Cheers,
Patrick
--
Patrick Macdonald
Red Hat Database Development

#61

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Patrick Macdonald (#60)

Re: Big 7.4 items

Patrick Macdonald wrote:

Bruce Momjian wrote:

I wanted to outline some of the big items we are looking at for 7.4:

[snip]

Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

Neil Conway and I will be working on this starting the beginning
of January. By the middle of January, we hope to have a handle on
an ETA.

Ewe, that is later than I was hoping. I have put J.R's PITR patch up
at:

ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz

(I have tried to contact J.R. several times over the past few months,
with no reply.)

J.R felt it was ready to go. I would like to have an evaluation of the
patch to know what it does and what is missing. I would like that
sooner rather than later because:

o I want to avoid too much code drift
o I don't want to find there are major pieces missing and to
not have enough time to implement them in 7.4
o It is a big feature so I would like sufficient testing before beta

OK, I just talked to Patrick on the phone, and he says Neil Conway is
working on merging the code into 7.3, and adding missing pieces like
logging table creation. So, it seems PITR is moving forward. Neil, can
you comment on where you are with this, and what still needs to be done?
Do we need to start looking at log archiving options? How is the PITR
log contents different from the WAL log contents, except of course no
pre-write page images?

If we need to discuss things, perhaps we can do it now and get folks
working on other pieces, or at least thinking about them.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#62

Neil Conway

neilc@samurai.com

about 23 years ago

In reply to: Bruce Momjian (#61)

Re: Big 7.4 items

On Mon, 2002-12-16 at 13:38, Bruce Momjian wrote:

OK, I just talked to Patrick on the phone, and he says Neil Conway is
working on merging the code into 7.3, and adding missing pieces like
logging table creation. So, it seems PITR is moving forward. Neil, can
you comment on where you are with this, and what still needs to be done?

Well, I should have a preliminary merge of the old PITR patch with CVS
HEAD done by Wednesday or Thursday. It took me a while to merge because
(a) I've got final exams at university at the moment (b) I had to merge
most of it by hand, as much of the diff is a single hunk (!), for some
reason.

As for the status of the code, I haven't really had a chance to evaluate
it; as Patrick noted, I think we should be able to give you an ETA by
the middle of January or so (I'll be offline starting Thursday until the
first week of January).

Cheers,

Neil
--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC

#63

Patrick Macdonald

patrickm@redhat.com

about 23 years ago

In reply to: Bruce Momjian (#61)

Re: Big 7.4 items

Bruce Momjian wrote:

Patrick Macdonald wrote:

Bruce Momjian wrote:

I wanted to outline some of the big items we are looking at for 7.4:

[snip]

Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

Neil Conway and I will be working on this starting the beginning
of January. By the middle of January, we hope to have a handle on
an ETA.

Ewe, that is later than I was hoping. I have put J.R's PITR patch up
at:

ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz

(I have tried to contact J.R. several times over the past few months,
with no reply.)

J.R felt it was ready to go. I would like to have an evaluation of the
patch to know what it does and what is missing. I would like that
sooner rather than later because:

o I want to avoid too much code drift
o I don't want to find there are major pieces missing and to
not have enough time to implement them in 7.4
o It is a big feature so I would like sufficient testing before beta

OK, I just talked to Patrick on the phone, and he says Neil Conway is
working on merging the code into 7.3, and adding missing pieces like
logging table creation. So, it seems PITR is moving forward.

Well, sort of. I stated that Neil was already working on merging the
patch into the CVS tip. I also mentioned that there are missing
pieces but have no idea if Neil is currently working on them.

Cheers,
Patrick
--
Patrick Macdonald
Red Hat Database Development

#64

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Patrick Macdonald (#63)

Re: Big 7.4 items

Patrick Macdonald wrote:

OK, I just talked to Patrick on the phone, and he says Neil Conway is
working on merging the code into 7.3, and adding missing pieces like
logging table creation. So, it seems PITR is moving forward.

Well, sort of. I stated that Neil was already working on merging the
patch into the CVS tip. I also mentioned that there are missing
pieces but have no idea if Neil is currently working on them.

Oh, OK. What I would like to do is find out what actually needs to be
done so we can get folks started on it. If we can get a 7.3 merge,
maybe we should get it into CVS and then list the items needing
attention and folks can submit patches to implement those.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#65

Janardhan

jana-reddy@mediaring.com.sg

about 23 years ago

In reply to: Bruce Momjian (#61)

Re: Big 7.4 items

The file "ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz"
do not
have read permissions to copy. please provide the read permissions to copy.

Regards
jana

Show quoted text

Patrick Macdonald wrote:

Bruce Momjian wrote:

I wanted to outline some of the big items we are looking at for 7.4:

[snip]

Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

Neil Conway and I will be working on this starting the beginning
of January. By the middle of January, we hope to have a handle on
an ETA.

Ewe, that is later than I was hoping. I have put J.R's PITR patch up
at:

ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz

(I have tried to contact J.R. several times over the past few months,
with no reply.)

J.R felt it was ready to go. I would like to have an evaluation of the
patch to know what it does and what is missing. I would like that
sooner rather than later because:

o I want to avoid too much code drift
o I don't want to find there are major pieces missing and to
not have enough time to implement them in 7.4
o It is a big feature so I would like sufficient testing before beta

OK, I just talked to Patrick on the phone, and he says Neil Conway is
working on merging the code into 7.3, and adding missing pieces like
logging table creation. So, it seems PITR is moving forward. Neil, can
you comment on where you are with this, and what still needs to be done?
Do we need to start looking at log archiving options? How is the PITR
log contents different from the WAL log contents, except of course no
pre-write page images?

If we need to discuss things, perhaps we can do it now and get folks
working on other pieces, or at least thinking about them.

#66

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Janardhan (#65)

Re: Big 7.4 items

Oops, sorry. Permissions fixed.

---------------------------------------------------------------------------

Janardhan wrote:

The file "ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz"
do not
have read permissions to copy. please provide the read permissions to copy.

Regards
jana

Patrick Macdonald wrote:

Bruce Momjian wrote:

I wanted to outline some of the big items we are looking at for 7.4:

[snip]

Point-In-Time Recovery (PITR)

J. R. Nield did a PITR patch late in 7.3 development, and Patrick
MacDonald from Red Hat is working on merging it into CVS and
adding any missing pieces. Patrick, do you have an ETA on that?

Neil Conway and I will be working on this starting the beginning
of January. By the middle of January, we hope to have a handle on
an ETA.

Ewe, that is later than I was hoping. I have put J.R's PITR patch up
at:

ftp://candle.pha.pa.us/pub/postgresql/PITR_20020822_02.gz

(I have tried to contact J.R. several times over the past few months,
with no reply.)

J.R felt it was ready to go. I would like to have an evaluation of the
patch to know what it does and what is missing. I would like that
sooner rather than later because:

o I want to avoid too much code drift
o I don't want to find there are major pieces missing and to
not have enough time to implement them in 7.4
o It is a big feature so I would like sufficient testing before beta

OK, I just talked to Patrick on the phone, and he says Neil Conway is
working on merging the code into 7.3, and adding missing pieces like
logging table creation. So, it seems PITR is moving forward. Neil, can
you comment on where you are with this, and what still needs to be done?
Do we need to start looking at log archiving options? How is the PITR
log contents different from the WAL log contents, except of course no
pre-write page images?

If we need to discuss things, perhaps we can do it now and get folks
working on other pieces, or at least thinking about them.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#67

Thomas O'Connell

tfo@monsterlabs.com

about 23 years ago

In reply to: Bruce Momjian (#61)

Re: Big 7.4 items

So if this gets added to the 7.3 branch, will there be documentation
accompanying it?

-tfo

In article <200212161838.gBGIcb717436@candle.pha.pa.us>,
pgman@candle.pha.pa.us (Bruce Momjian) wrote:

Show quoted text

OK, I just talked to Patrick on the phone, and he says Neil Conway is
working on merging the code into 7.3, and adding missing pieces like
logging table creation. So, it seems PITR is moving forward.

#68

Bruce Momjian

pgman@candle.pha.pa.us

about 23 years ago

In reply to: Thomas O'Connell (#67)

Re: Big 7.4 items

I meant he is merging it into HEAD, not the 7.3 CVS. Sorry for the
confusion.

---------------------------------------------------------------------------

Thomas O'Connell wrote:

So if this gets added to the 7.3 branch, will there be documentation
accompanying it?

-tfo

In article <200212161838.gBGIcb717436@candle.pha.pa.us>,
pgman@candle.pha.pa.us (Bruce Momjian) wrote:

OK, I just talked to Patrick on the phone, and he says Neil Conway is
working on merging the code into 7.3, and adding missing pieces like
logging table creation. So, it seems PITR is moving forward.

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073