Inconsistent DB data in Streaming Replication

Started by Samrat Revagadeabout 13 years ago52 messageshackers

revagade.samrat@gmail.com

about 13 years ago

Hello,

We have been trying to figure out possible solutions to the following
problem in streaming replication Consider following scenario:

If master receives commit command, it writes and flushes commit WAL records
to the disk, It also writes and flushes data page related to this
transaction.

The master then sends WAL records to standby up to the commit WAL record.
But before sending these records if failover happens then, old master is
ahead of standby which is now the new master in terms of DB data leading
to inconsistent data .

One solution to avoid this situation is have the master send WAL records to
standby and wait for ACK from standby committing WAL files to disk and only
after that commit data page related to this transaction on master.

The main drawback would be increased wait time for the client due to extra
round trip to standby before master sends ACK to client. Are there any
other issues with this approach?

Thank you,

Samrat

Shaun Thomas

sthomas@optionshouse.com

about 13 years ago

In reply to: Samrat Revagade (#1)

Re: Inconsistent DB data in Streaming Replication

On 04/08/2013 05:34 AM, Samrat Revagade wrote:

One solution to avoid this situation is have the master send WAL
records to standby and wait for ACK from standby committing WAL files
to disk and only after that commit data page related to this
transaction on master.

Isn't this basically what synchronous replication does in PG 9.1+?

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Samrat Revagade (#1)

Re: Inconsistent DB data in Streaming Replication

Samrat Revagade <revagade.samrat@gmail.com> writes:

We have been trying to figure out possible solutions to the following
problem in streaming replication Consider following scenario:

If master receives commit command, it writes and flushes commit WAL records
to the disk, It also writes and flushes data page related to this
transaction.

The master then sends WAL records to standby up to the commit WAL record.
But before sending these records if failover happens then, old master is
ahead of standby which is now the new master in terms of DB data leading
to inconsistent data .

I don't exactly see the problem ... unless you're imagining that master
and slave share the same data storage or something like that. That's
not going to work for a ton of reasons besides this one.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Shaun Thomas (#2)

Re: Inconsistent DB data in Streaming Replication

On Mon, Apr 8, 2013 at 6:50 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:

On 04/08/2013 05:34 AM, Samrat Revagade wrote:

One solution to avoid this situation is have the master send WAL
records to standby and wait for ACK from standby committing WAL files
to disk and only after that commit data page related to this
transaction on master.

Isn't this basically what synchronous replication does in PG 9.1+?

Not exactly. Sync-rep ensures that commit success is not sent to the
client before a synchronous replica acks the commit record. What
Samrat is proposing here is that WAL is not flushed to the OS before
it is acked by a synchronous replica so recovery won't go past the
timeline change made in failover, making it necessary to take a new
base backup to resync with the new master. I seem to remember this
being discussed when sync rep was committed. I don't recall if the
idea was discarded only on performance grounds or whether there were
other issues too.

Thinking about it now it, the requirement is that after crash and
failover to a sync replica we should be able to reuse the datadir to
replicate from the new master without consistency. We should be able
to achieve that by ensuring that we don't write out pages until we
have received an ack from the sync replica and that we check for
possible timeline switches before recovering local WAL. For the first,
it seems to me that it should be enough to rework the updating of
XlogCtl->LogwrtResult.Flush so it accounts for the sync replica. For
the second part, I think Heikkis work on enabling timeline switches
over streaming connections already ensure this (I haven't checked it
out in detail), but if not, shouldn't be too hard to add.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Ants Aasma (#4)

Re: Inconsistent DB data in Streaming Replication

On 2013-04-08 19:26:33 +0300, Ants Aasma wrote:

On Mon, Apr 8, 2013 at 6:50 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:

On 04/08/2013 05:34 AM, Samrat Revagade wrote:

One solution to avoid this situation is have the master send WAL
records to standby and wait for ACK from standby committing WAL files
to disk and only after that commit data page related to this
transaction on master.

Isn't this basically what synchronous replication does in PG 9.1+?

Not exactly. Sync-rep ensures that commit success is not sent to the
client before a synchronous replica acks the commit record. What
Samrat is proposing here is that WAL is not flushed to the OS before
it is acked by a synchronous replica so recovery won't go past the
timeline change made in failover, making it necessary to take a new
base backup to resync with the new master. I seem to remember this
being discussed when sync rep was committed. I don't recall if the
idea was discarded only on performance grounds or whether there were
other issues too.

Thats not going to work for a fair number of reasons:
* wal is streamed *from disk* not from memory
* what if the local node crashes/restarts immediately? Then the standby
is farther ahead than the master.
* the performance implications of never writing data before flushing it
are pretty severe
* ...

So this doesn't seem to solve anything.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Andres Freund (#5)

Re: Inconsistent DB data in Streaming Replication

On Mon, Apr 8, 2013 at 7:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-04-08 19:26:33 +0300, Ants Aasma wrote:

Not exactly. Sync-rep ensures that commit success is not sent to the
client before a synchronous replica acks the commit record. What
Samrat is proposing here is that WAL is not flushed to the OS before
it is acked by a synchronous replica so recovery won't go past the
timeline change made in failover, making it necessary to take a new
base backup to resync with the new master. I seem to remember this
being discussed when sync rep was committed. I don't recall if the
idea was discarded only on performance grounds or whether there were
other issues too.

Thats not going to work for a fair number of reasons:
* wal is streamed *from disk* not from memory

Yeah, this one alone makes the do-not-flush-before-replicating
approach impractical.

* what if the local node crashes/restarts immediately? Then the standby
is farther ahead than the master.
* the performance implications of never writing data before flushing it
are pretty severe
* ...

So this doesn't seem to solve anything.

Yeah, delaying WAL writes until replication is successful seems
impractical, but I don't see why we couldn't optionally take into
account walsender write pointers when considering if we can write out
a page. Sure there will be some performance hit for waiting to
replicate WAL, but on the other hand having to rsync a huge database
isn't too good for performance either.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Samrat Revagade (#1)

Re: Inconsistent DB data in Streaming Replication

On Mon, Apr 8, 2013 at 7:34 PM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:

Hello,

We have been trying to figure out possible solutions to the following problem in streaming replication Consider following scenario:

If master receives commit command, it writes and flushes commit WAL records to the disk, It also writes and flushes data page related to this transaction.

The master then sends WAL records to standby up to the commit WAL record. But before sending these records if failover happens then, old master is ahead of standby which is now the new master in terms of DB data leading to inconsistent data .

Why do you think that the inconsistent data after failover happens is
problem? Because
it's one of the reasons why a fresh base backup is required when
starting old master as
new standby? If yes, I agree with you. I've often heard the complaints
about a backup
when restarting new standby. That's really big problem.

The timeline mismatch after failover was one of the reasons why a
backup is required.
But, thanks to Heikki's recent work, that's solved, i.e., the timeline
mismatch would be
automatically resolved when starting replication in 9.3. So, the
remaining problem is an
inconsistent database.

One solution to avoid this situation is have the master send WAL records to standby and wait for ACK from standby committing WAL files to disk and only after that commit data page related to this transaction on master.

You mean to make the master wait the data page write until WAL has been not only
flushed to disk but also replicated to the standby?

The main drawback would be increased wait time for the client due to extra round trip to standby before master sends ACK to client. Are there any other issues with this approach?

I think that you can introduce GUC specifying whether this extra check
is required to
avoid a backup when failback.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Samrat Revagade

revagade.samrat@gmail.com

about 13 years ago

In reply to: Fujii Masao (#7)

Re: Inconsistent DB data in Streaming Replication

What Samrat is proposing here is that WAL is not flushed to the OS before

it is acked by a synchronous replica so recovery won't go past the

timeline change made in failover, making it necessary to take a new

base backup to resync with the new master.

Actually we are proposing that the data page on the master is not committed
till master receives ACK from the standby. The WAL files can be flushed to
the disk on both the master and standby, before standby generates ACK to
master. The end objective is the same of avoiding to take base backup of
old master to resync with new master.

Why do you think that the inconsistent data after failover happens is
problem? Because

it's one of the reasons why a fresh base backup is required when
starting old master as
new standby? If yes, I agree with you. I've often heard the complaints
about a backup
when restarting new standby. That's really big problem.

Yes, taking backup is major problem when the database size is more than
several TB. It would take very long time to ship backup data over the slow
WAN network.

One solution to avoid this situation is have the master send WAL records

to standby and wait for ACK from standby committing WAL files to disk and
only after that commit data page related to this transaction on master.

You mean to make the master wait the data page write until WAL has been

not only

flushed to disk but also replicated to the standby?

Yes. Master should not write the data page before corresponding WAL
records have been replicated to the standby. The WAL records have been
flushed to disk on both master and standby.

The main drawback would be increased wait time for the client due to

extra round trip to standby before master sends ACK to client. Are there
any other issues with this approach?

I think that you can introduce GUC specifying whether this extra check
is required to avoid a backup when failback

That would be better idea. We can disable it whenever taking a fresh backup
is not a problem.

Regards,

Samrat

On Mon, Apr 8, 2013 at 10:40 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Show quoted text

On Mon, Apr 8, 2013 at 7:34 PM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:

Hello,

We have been trying to figure out possible solutions to the following

problem in streaming replication Consider following scenario:

If master receives commit command, it writes and flushes commit WAL

records to the disk, It also writes and flushes data page related to this
transaction.

The master then sends WAL records to standby up to the commit WAL

record. But before sending these records if failover happens then, old
master is ahead of standby which is now the new master in terms of DB data
leading to inconsistent data .

Why do you think that the inconsistent data after failover happens is
problem? Because
it's one of the reasons why a fresh base backup is required when
starting old master as
new standby? If yes, I agree with you. I've often heard the complaints
about a backup
when restarting new standby. That's really big problem.

The timeline mismatch after failover was one of the reasons why a
backup is required.
But, thanks to Heikki's recent work, that's solved, i.e., the timeline
mismatch would be
automatically resolved when starting replication in 9.3. So, the
remaining problem is an
inconsistent database.

One solution to avoid this situation is have the master send WAL records

to standby and wait for ACK from standby committing WAL files to disk and
only after that commit data page related to this transaction on master.

You mean to make the master wait the data page write until WAL has been
not only
flushed to disk but also replicated to the standby?

The main drawback would be increased wait time for the client due to

extra round trip to standby before master sends ACK to client. Are there
any other issues with this approach?

I think that you can introduce GUC specifying whether this extra check
is required to
avoid a backup when failback.

Regards,

--
Fujii Masao

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Samrat Revagade (#8)

Re: Inconsistent DB data in Streaming Replication

On Tue, Apr 9, 2013 at 9:42 AM, Samrat Revagade
<revagade.samrat@gmail.com> wrote:

What Samrat is proposing here is that WAL is not flushed to the OS before
it is acked by a synchronous replica so recovery won't go past the
timeline change made in failover, making it necessary to take a new
base backup to resync with the new master.

Actually we are proposing that the data page on the master is not committed
till master receives ACK from the standby. The WAL files can be flushed to
the disk on both the master and standby, before standby generates ACK to
master. The end objective is the same of avoiding to take base backup of old
master to resync with new master.

Sorry for misreading your e-mail. It seems like we are on the same
page here. I too have found this an annoying limitation in using
replication in an unreliable environment.

Yes, taking backup is major problem when the database size is more than
several TB. It would take very long time to ship backup data over the slow
WAN network.

For WAN environment rsync can be a good enough answer, a tiny amount
of pages will be actually transferred. This is assuming a smallish
database and low bandwidth. For larger databases avoiding the need to
read in the whole database for differences is an obvious win.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Hannu Krosing

hannu@tm.ee

about 13 years ago

In reply to: Samrat Revagade (#1)

Re: Inconsistent DB data in Streaming Replication

On 04/08/2013 12:34 PM, Samrat Revagade wrote:

Hello,

We have been trying to figure out possible solutions to the following
problem in streaming replication Consider following scenario:

If master receives commit command, it writes and flushes commit WAL
records to the disk, It also writes and flushes data page related to
this transaction.

No data page flushing takes place. All data page writing is delayed to
bgWriter and/or checkpoints.

The master then sends WAL records to standby up to the commit WAL
record. But before sending these records if failover happens then,
old master is ahead of standby which is now the new master in terms
of DB data leading to inconsistent data .

The master sends WAL records to standby continuously, not "upon commit
wal record".

In case of syncrep the master just waits for confirmation from standby
before returning to client on commit.

One solution to avoid this situation is have the master send WAL
records to standby and wait for ACK from standby committing WAL files
to disk and only after that commit data page related to this
transaction on master.

Not just commit, you must stop any *writing* of the wal records
effectively killing any parallelism.

The main drawback would be increased wait time for the client due to
extra round trip to standby before master sends ACK to client. Are
there any other issues with this approach?

Min issue is that it will make *all* backends dependant on each sync
commit, essentially serialising all backends commits, with the
serialisation *including* the latency of roundtrip to client.

With current sync streaming the other backends can continue to write wal,
with proposed approach you can not write any records after the one
waiting an ACK from standby.

Thank you,

Samrat

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Samrat Revagade

revagade.samrat@gmail.com

about 13 years ago

In reply to: Hannu Krosing (#10)

Re: Inconsistent DB data in Streaming Replication

it's one of the reasons why a fresh base backup is required when starting

old master as new standby? >>If yes, I agree with you. I've often heard the
complaints about a backup when restarting new standby. >>That's really big
problem.

I think Fujii Masao is on the same page.

In case of syncrep the master just waits for confirmation from standby

before returning to client on >commit.

Not just commit, you must stop any *writing* of the wal records

effectively killing any parallelism.

Min issue is that it will make *all* backends dependant on each sync

commit, essentially serialising all >backends commits, with the
serialisation *including* the latency of roundtrip to client. With current

sync streaming the other backends can continue to write wal, with proposed

approach you cannot >write any records after the one waiting an ACK from
standby.

Let me rephrase the proposal in a more accurate manner:

Consider following scenario:

(1) A client sends the "COMMIT" command to the master server.

(2) The master writes WAL record to disk

(3) The master writes the data page related to this transaction. i.e. via
checkpoint or bgwriter.

(4) The master sends WAL records continuously to the standby, up to the
commit WAL record.

(5) The standby receives WAL records, writes them to the disk, and then
replies the ACK.

(6) The master returns a success indication to a client after it receives
ACK.

If failover happens between (3) and (4), WAL and DB data in old master are
ahead of them in new master. After failover, new master continues running
new transactions independently from old master. Then WAL record and DB data
would become inconsistent between those two servers. To resolve these
inconsistencies, the backup of new master needs to be taken onto new
standby.

But taking backup is not feasible in case of larger database size with
several TB over a slow WAN.

So to avoid this type of inconsistency without taking fresh backup we are
thinking to do following thing:

I think that you can introduce GUC specifying whether this extra check

is required to avoid a backup >>when failback.

Approach:

Introduce new GUC option specifying whether to prevent PostgreSQL from
writing DB data before corresponding WAL records have been replicated to
the standby. That is, if this GUC option is enabled, PostgreSQL waits for
corresponding WAL records to be not only written to the disk but also
replicated to the standby before writing DB data.

So the process becomes as follows:

(1) A client sends the "COMMIT" command to the master server.

(2) The master writes the commit WAL record to the disk.

(3) The master sends WAL records continuously to standby up to the commit
WAL record.

(4) The standby receives WAL records, writes them to disk, and then replies
the ACK.

(5) *The master then forces a write of the data page related to this
transaction. *

(6) The master returns a success indication to a client after it receives
ACK.

While master is waiting to force a write (point 5) for this data page,
streaming replication continuous. Also other data page writes are not
dependent on this particular page write. So the commit of data pages are
not serialized.

Regards,

Samrat

#12

Samrat Revagade

revagade.samrat@gmail.com

about 13 years ago

In reply to: Samrat Revagade (#11)

Re: Inconsistent DB data in Streaming Replication

(5) *The master then forces a write of the data page related to this

transaction.*

*Sorry, this is incorrect. Whenever the master writes the data page it
checks that the WAL record is written in standby till that LSN. *

*
*

While master is waiting to force a write (point 5) for this data page,

streaming replication continuous.

Also other data page writes are not dependent on this particular page

write. So the commit of data >pages are not serialized.

*Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.*

*
*

*Regards,*

*Samrat*

*
*

#13

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Samrat Revagade (#12)

Re: Inconsistent DB data in Streaming Replication

On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:

(5) The master then forces a write of the data page related to this

transaction.

Sorry, this is incorrect. Whenever the master writes the data page it

checks that the WAL record is written in standby till that LSN.

While master is waiting to force a write (point 5) for this data page,

streaming replication continuous.

Also other data page writes are not dependent on this particular page

write. So the commit of data >pages are not serialized.

Sorry, this is incorrect. Streaming replication continuous, master is not

waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.

I am not sure it will resolve the problem completely as your old-master can
have some WAL extra then new-master for same timeline. I don't remember
exactly will timeline switch feature
take care of this extra WAL, Heikki can confirm this point?
Also I think this can serialize flush of data pages in checkpoint/bgwriter
which is currently not the case.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Amit Kapila (#13)

Re: Inconsistent DB data in Streaming Replication

Amit Kapila <amit.kapila@huawei.com> writes:

On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:

Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.

I am not sure it will resolve the problem completely as your old-master can
have some WAL extra then new-master for same timeline. I don't remember
exactly will timeline switch feature
take care of this extra WAL, Heikki can confirm this point?
Also I think this can serialize flush of data pages in checkpoint/bgwriter
which is currently not the case.

Yeah. TBH this entire discussion seems to be "let's cripple performance
in the normal case so that we can skip doing an rsync when resurrecting
a crashed, failed-over master". This is not merely optimizing for the
wrong thing, it's positively hazardous. After a fail-over, you should
be wondering whether it's safe to resurrect the old master at all, not
about how fast you can bring it back up without validating its data.
IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Tom Lane (#14)

Re: Inconsistent DB data in Streaming Replication

On 2013-04-10 10:10:31 -0400, Tom Lane wrote:

Amit Kapila <amit.kapila@huawei.com> writes:

On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:

Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.

I am not sure it will resolve the problem completely as your old-master can
have some WAL extra then new-master for same timeline. I don't remember
exactly will timeline switch feature
take care of this extra WAL, Heikki can confirm this point?
Also I think this can serialize flush of data pages in checkpoint/bgwriter
which is currently not the case.

Yeah. TBH this entire discussion seems to be "let's cripple performance
in the normal case so that we can skip doing an rsync when resurrecting
a crashed, failed-over master". This is not merely optimizing for the
wrong thing, it's positively hazardous. After a fail-over, you should
be wondering whether it's safe to resurrect the old master at all, not
about how fast you can bring it back up without validating its data.
IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.

Agreed. Especially as in situations where you fall over in a planned
way, e.g. for a hardware upgrade, you can avoid the need to resync with
a littlebit of care. So its mostly in catastrophic situations this
becomes a problem and in those you really should resync - and its a good
idea not to use a normal rsync but a rsync --checksum or similar.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Shaun Thomas

sthomas@optionshouse.com

about 13 years ago

In reply to: Tom Lane (#14)

Re: Inconsistent DB data in Streaming Replication

On 04/10/2013 09:10 AM, Tom Lane wrote:

IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.

Totally. Out in the field, we consider the "old" database corrupt the
moment we fail over. There is literally no way to verify the safety of
any data along the broken chain, given race conditions and multiple
potential failure points.

The only potential use case for this that I can see, would be for system
maintenance and a controlled failover. I agree: that's a major PITA when
doing DR testing, but I personally don't think this is the way to fix
that particular edge case.

Maybe checksums will fix this in the long run... I don't know. DRBD has
a handy block-level verify function for things like this, and it can
re-sync master/slave data by comparing the commit log across the servers
if you tell it one node should be considered incorrect.

The thing is... we have clogs, and we have WAL. If we can assume
bidirectional communication and verification (checksum comparison?) of
both of those components, the database *should* be able to re-sync itself.

Even if that were possible given the internals, I can't see anyone
jumping on this before 9.4 or 9.5 unless someone sponsors the feature.

Automatic re-sync would (within available WALs) be an awesome feature,
though...

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Shaun Thomas (#16)

Re: Inconsistent DB data in Streaming Replication

On Wed, Apr 10, 2013 at 11:26 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:

On 04/10/2013 09:10 AM, Tom Lane wrote:

IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.

Totally. Out in the field, we consider the "old" database corrupt the moment
we fail over.

Strange. If this is really true, shared disk failover solution is
fundamentally broken
because the standby needs to start up with the shared "corrupted"
database at the
failover. Also, we cannot trust the crash recovery at all if we adopt
the same logic
as you think. I think that there are the cases where we can replay and reuse the
old database even after PostgreSQL crashes.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Shaun Thomas

sthomas@optionshouse.com

about 13 years ago

In reply to: Fujii Masao (#17)

Re: Inconsistent DB data in Streaming Replication

On 04/10/2013 11:40 AM, Fujii Masao wrote:

Strange. If this is really true, shared disk failover solution is
fundamentally broken because the standby needs to start up with the
shared "corrupted" database at the failover.

How so? Shared disk doesn't use replication. The point I was trying to
make is that replication requires synchronization between two disparate
servers, and verifying they have exactly the same data is a non-trivial
exercise. Even a single transaction after a failover (effectively)
negates the old server because there's no easy "catch up" mechanism yet.

Even if this isn't necessarily true, it's the safest approach IMO.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Andres Freund (#15)

Re: Inconsistent DB data in Streaming Replication

On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-04-10 10:10:31 -0400, Tom Lane wrote:

Amit Kapila <amit.kapila@huawei.com> writes:

On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:

Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.

I am not sure it will resolve the problem completely as your old-master can
have some WAL extra then new-master for same timeline. I don't remember
exactly will timeline switch feature
take care of this extra WAL, Heikki can confirm this point?
Also I think this can serialize flush of data pages in checkpoint/bgwriter
which is currently not the case.

Yeah. TBH this entire discussion seems to be "let's cripple performance
in the normal case so that we can skip doing an rsync when resurrecting
a crashed, failed-over master". This is not merely optimizing for the
wrong thing, it's positively hazardous. After a fail-over, you should
be wondering whether it's safe to resurrect the old master at all, not
about how fast you can bring it back up without validating its data.
IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.

Agreed. Especially as in situations where you fall over in a planned
way, e.g. for a hardware upgrade, you can avoid the need to resync with
a littlebit of care.

It's really worth documenting that way.

So its mostly in catastrophic situations this
becomes a problem and in those you really should resync - and its a good
idea not to use a normal rsync but a rsync --checksum or similar.

If database is very large, rsync --checksum takes very long. And I'm concerned
that most of data pages in master has the different checksum from those in the
standby because of commit hint bit. I'm not sure how rsync --checksum can
speed up the backup after failover.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Shaun Thomas (#18)

Re: Inconsistent DB data in Streaming Replication

On Thu, Apr 11, 2013 at 1:44 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:

On 04/10/2013 11:40 AM, Fujii Masao wrote:

Strange. If this is really true, shared disk failover solution is
fundamentally broken because the standby needs to start up with the
shared "corrupted" database at the failover.

How so? Shared disk doesn't use replication. The point I was trying to make
is that replication requires synchronization between two disparate servers,
and verifying they have exactly the same data is a non-trivial exercise.
Even a single transaction after a failover (effectively) negates the old
server because there's no easy "catch up" mechanism yet.

Hmm... ISTM what Samrat is proposing can resolve the problem. That is,
if we can think that any data page which has not been replicated to the standby
is not written in the master, new standby (i.e., old master) can safely catch up
with new master (i.e., old standby). In this approach, of course, new standby
might have some WAL records which new master doesn't have, so before
starting up new standby, we need to remove all the WAL files in new standby
and retrieve any WAL files from new master. But, what's the problem in his
approach?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Shaun Thomas (#18)

#22

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Ants Aasma (#21)

#23

Boszormenyi Zoltan

zb@cybertec.at

about 13 years ago

In reply to: Fujii Masao (#19)

#24

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Boszormenyi Zoltan (#23)

#25

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Fujii Masao (#20)

#26

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Samrat Revagade (#1)

#27

Sameer Thakur

samthakur74@gmail.com

about 13 years ago

In reply to: Ants Aasma (#26)

#28

Hannu Krosing

hannu@tm.ee

about 13 years ago

In reply to: Sameer Thakur (#27)

#29

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Hannu Krosing (#28)

#30

Tom Lane

tgl@sss.pgh.pa.us

about 13 years ago

In reply to: Ants Aasma (#29)

#31

Hannu Krosing

hannu@tm.ee

about 13 years ago

In reply to: Ants Aasma (#29)

#32

Ants Aasma

ants.aasma@cybertec.at

about 13 years ago

In reply to: Hannu Krosing (#31)

#33

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Tom Lane (#22)

#34

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Hannu Krosing (#28)

#35

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Ants Aasma (#32)

#36

Pavan Deolasee

pavan.deolasee@gmail.com

about 13 years ago

In reply to: Ants Aasma (#32)

#37

Hannu Krosing

hannu@tm.ee

about 13 years ago

In reply to: Fujii Masao (#34)

#38

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Fujii Masao (#34)

#39

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Pavan Deolasee (#36)

#40

Pavan Deolasee

pavan.deolasee@gmail.com

about 13 years ago

In reply to: Andres Freund (#39)

#41

Andres Freund

andres@anarazel.de

about 13 years ago

In reply to: Pavan Deolasee (#40)

#42

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Hannu Krosing (#37)

#43

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Andres Freund (#38)

#44

Hannu Krosing

hannu@tm.ee

about 13 years ago

In reply to: Fujii Masao (#43)

#45

Florian Pflug

fgp@phlo.org

about 13 years ago

In reply to: Fujii Masao (#43)

#46

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Florian Pflug (#45)

#47

Florian Pflug

fgp@phlo.org

about 13 years ago

In reply to: Amit Kapila (#46)

#48

Amit Kapila

amit.kapila16@gmail.com

about 13 years ago

In reply to: Florian Pflug (#47)

#49

Martijn van Oosterhout

kleptog@svana.org

about 13 years ago

In reply to: Florian Pflug (#47)

#50

Florian Pflug

fgp@phlo.org

about 13 years ago

In reply to: Martijn van Oosterhout (#49)

#51

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Tom Lane (#14)

#52

Fujii Masao

masao.fujii@gmail.com

about 13 years ago

In reply to: Florian Pflug (#47)