Patch for fail-back without fresh backup

Started by Samrat Revagadealmost 13 years ago122 messageshackers
Jump to latest
#1Samrat Revagade
revagade.samrat@gmail.com

Hello,

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after flushing WAL to the local storage. So master contains some file
system level changes that standby does not have. At this point, the data
directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master. Later when the
old master wants to be a standby of the new master, it can't just join the
setup since there is inconsistency in between these two servers. We need to
take the fresh backup from the new master. This can happen in both the
synchronous as well as asynchronous replication.

Fresh backup is also needed in case of clean switch-over because in the
current HEAD, the master does not wait for the standby to receive all the
WAL up to the shutdown checkpoint record before shutting down the
connection. Fujii Masao has already submitted a patch to handle clean
switch-over case, but the problem is still remaining for failback case.

The process of taking fresh backup is very time consuming when databases
are of very big sizes, say several TB's, and when the servers are connected
over a relatively slower link. This would break the service level
agreement of disaster recovery system. So there is need to improve the
process of disaster recovery in PostgreSQL. One way to achieve this is to
maintain consistency between master and standby which helps to avoid need
of fresh backup.

So our proposal on this problem is that we must ensure that master should
not make any file system level changes without confirming that the
corresponding WAL record is replicated to the standby.

There are many suggestions and objections pgsql-hackers about this problem
The brief summary is as follows:

1. The main objection was raised by Tom and others is that we should not
add this feature and should go with traditional way of taking fresh backup
using the rsync, because he was concerned about the additional complexity
of the patch and the performance overhead during normal operations.

2. Tom and others were also worried about the inconsistencies in the
crashed master and suggested that its better to start with a fresh backup.
Fujii Masao and others correctly countered that suggesting that we trust
WAL recovery to clear all such inconsistencies and there is no reason why
we can't do the same here.

3. Someone is suggested using rsync with checksum, but many pages on the
two servers may differ in their binary values because of hint bits etc.

4. The major objection for failback without fresh backup idea was it may
introduce on performance overhead and complexity to the code. By looking at
the patch I must say that patch is not too complex. For performance impact
I tested patch with pgbench which shows that it has very small performance
overhead. Please refer the test results included at end of mail.

*Proposal to solve the problem*

The proposal is based on the concept of master should not do any file
system level change until corresponding WAL record is replicated to the
standby.

There are many places in the code which need to be handled to support the
proposed solution. Following cases explains the need of fresh backup at
the time of failover, and how can we avoid this need by our approach.

1. We must not write any heap pages to the disk before the WAL records
corresponding to those changes are received by the standby. Otherwise if
standby failed to receive WAL corresponding to those heap pages there will
be inconsistency.

2. When CHECKPOINT happens on the master, control file of master gets
updated and last checkpoint record is written to it. Suppose failover
happens and standby fails to receive the WAL record corresponding to
CHECKPOINT, then master and standby has inconsistent copies of control file
that leads to the mismatch in redo record and recovery will not start
normally. To avoid this situation we must not update the control file of
master before the corresponding checkpoint WAL record is received by the
standby

3. Also when we truncate any of the physical files on the master and
suppose the standby failed to receive corresponding WAL, then that physical
file is truncated on master but still available on standby causing
inconsistency. To avoid this we must not truncate physical files on the
master before the WAL record corresponding to that operation is received by
the standby.

4. Same case applies to CLOG pages. If CLOG page is written to the disk and
corresponding WAL record is not replicated to the standby, leads to the
inconsistency. So we must not write the CLOG pages (and may be other SLRU
pages too) to the disk before the corresponding WAL records are received by
standby.

5. The same problem applies for the commit hint bits. But it is more
complicated than the other problems, because no WAL records are generated
for that, hence we cannot apply the same above method, that is wait for
corresponding WAL record to be replicated on standby. So we delay the
processes of updating the commit hint bits, similar to what is done by
asynchronous commits. In other words we need to check if the WAL
corresponding to the transaction commit is received by the failback safe
standby and then only allow hint bit updates.

*Patch explanation:*

The initial work on this patch is done by Pavan Deolasee. I tested it and
will make further enhancements based on the community feedback.

This patch is not complete yet, but I plan to do so with the help of this
community. At this point, the primary purpose is to understand the
complexities and get some initial performance numbers to alleviate some of
the concerns raised by the community.

There are two GUC parameters which supports this failsafe standby

1. failback_safe_standby_name [ name of the failsafe standby ] It is the
name of failsafe standby. Master will not do any file system level change
before corresponding WAL is replicated on the this failsafe standby

2. failback_safe_standby_mode [ off/remote_write/remote_flush] This
parameter specifies the behavior of master i.e. whether it should wait for
WAL to be written on standby or WAL to be flushed on standby. We should
turn it off when we do not want the failsafe standby. This failsafe mode
can be combined with synchronous as well as asynchronous streaming
replication.

Most of the changes are done in the syncrep.c. This is a slight misnomer
because that file deals with synchronous standby and a failback standby
could and most like be a async standby. But keeping the changes this way
has ensured that the patch is easy to read. Once we have acceptance on the
approach, the patch can be modified to reorganize the code in a more
logical way.

The patch adds a new state SYNC_REP_WAITING_FOR_FAILBACK_SAFETY to the sync
standby states. A backend which is waiting for a failback safe standby to
receive WAL records, will wait in this state. Failback safe mechanism can
work in two different modes, that is wait for WAL to be written or flushed
on failsafe standby. That is represented by two new modes
SYNC_REP_WAIT_FAILBACK_SAFE_WRITE and SYNC_REP_WAIT_FAILBACK_SAFE_FLUSH
respectively.

Also the SyncRepWaitForLSN() is changed for conditional wait. So that we
can delay hint bit updates on master instead of blocking the wait for the
failback safe standby to receiver WAL's.

*Benchmark tests*

*PostgreSQL versions:* PostgreSQL 9.3beta1

*Usage:* For operating in failsafe mode you need to configure following two
GUC parameters:

1. failback_safe_standby_name

2.failback_safe_standby_mode

*Performance impact:*

The test are performed on the servers having 32 GB RAM, checkpoint_timeout
is set to 10 minutes so that checkpoint will happen more frequently.
Checkpoint involves flushing all the dirty blocks to the disk and we wanted
to primarily test that code path.

pgbech settings:

Transaction type: TPC-B

Scaling factor: 100

Query mode: simple

Number of clients: 100

Number of threads: 1

Duration: 1800 s

Following table shows the average TPS measured for each scenario. We
conducted 3 tests for each scenario

1) Synchronous Replication - 947 tps

2) Synchronous Replication + Failsafe standby (off) - 934 tps

3) Synchronous Replication + Failsafe standby (remote_flush) - 931 tps

4) Asynchronous Replication - 1369 tps

5) Asynchronous Replication + Failsafe standby (off) - 1349 tps

6) Asynchronous Replication + Failsafe standby (remote_flush) - 1350 tps

By observing the table we can conclude following:

1. Streaming replication + failback safe:

a) On an average, synchronous replication combined with failsafe standby
(remote_flush) causes 1.68 % performance overhead.

b) On an average, asynchronous streaming replication combined with
failsafe standby (remote_flush) causes averagely 1.38 % performance
degradation.

2. Streaming replication + failback safe (turned off):

a) Averagely synchronous replication combined with failsafe standby

(off) causes 1.37 % performance overhead.

b) Averagely asynchronous streaming replication combined with failsafe
standby (off) causes averagely 1.46 % performance degradation.

So the patch is showing 1-2% performance overhead.

Please give your suggestions if there is a need to perform tests for other
scenario.

*Improvements (To-do):*

1. Currently this patch supports only one failback safe standby. It can
either be synchronous or an asynchronous standby. We probably need to
discuss whether it needs to be changed for support of multiple failsafe
standby's.

2. Current design of patch will wait forever for the failback safe standby.
Streaming replication also has same limitation. We probably need to discuss
whether it needs to be changed.

There are couples of more places that probably need some attention and I
have marked them with XXX

Thank you,

Samrat

Attachments:

failback_safe_standby.patchapplication/octet-stream; name=failback_safe_standby.patchDownload+300-49
#2Benedikt Grundmann
bgrundmann@janestreet.com
In reply to: Samrat Revagade (#1)
Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 10:11 AM, Samrat Revagade <revagade.samrat@gmail.com

wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after flushing WAL to the local storage. So master contains some file
system level changes that standby does not have. At this point, the data
directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master. Later when the
old master wants to be a standby of the new master, it can't just join the
setup since there is inconsistency in between these two servers. We need to
take the fresh backup from the new master. This can happen in both the
synchronous as well as asynchronous replication.

Fresh backup is also needed in case of clean switch-over because in the
current HEAD, the master does not wait for the standby to receive all the
WAL up to the shutdown checkpoint record before shutting down the
connection. Fujii Masao has already submitted a patch to handle clean
switch-over case, but the problem is still remaining for failback case.

The process of taking fresh backup is very time consuming when databases
are of very big sizes, say several TB's, and when the servers are connected
over a relatively slower link. This would break the service level
agreement of disaster recovery system. So there is need to improve the
process of disaster recovery in PostgreSQL. One way to achieve this is to
maintain consistency between master and standby which helps to avoid need
of fresh backup.

So our proposal on this problem is that we must ensure that master should
not make any file system level changes without confirming that the
corresponding WAL record is replicated to the standby.

A alternative proposal (which will probably just reveal my lack of
understanding about what is or isn't possible with WAL). Provide a way to
restart the master so that it rolls back the WAL changes that the slave
hasn't seen.

Show quoted text

There are many suggestions and objections pgsql-hackers about this problem
The brief summary is as follows:

#3Samrat Revagade
revagade.samrat@gmail.com
In reply to: Benedikt Grundmann (#2)
Re: Patch for fail-back without fresh backup

That will not happen if there is inconsistency in between both the servers.

Please refer to the discussions on the link provided in the first post:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Regards,

Samrat Revgade

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Samrat Revagade (#1)
Re: Patch for fail-back without fresh backup

On 14.06.2013 12:11, Samrat Revagade wrote:

We have already started a discussion on pgsql-hackers for the problem of
taking fresh backup during the failback operation here is the link for that:

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtbJgWrFu513s+Q@mail.gmail.com

Let me again summarize the problem we are trying to address.

When the master fails, last few WAL files may not reach the standby. But
the master may have gone ahead and made changes to its local file system
after flushing WAL to the local storage. So master contains some file
system level changes that standby does not have. At this point, the data
directory of master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master. Later when the
old master wants to be a standby of the new master, it can't just join the
setup since there is inconsistency in between these two servers. We need to
take the fresh backup from the new master. This can happen in both the
synchronous as well as asynchronous replication.

Did you see the thread on the little tool I wrote called pg_rewind?

/messages/by-id/519DF910.4020609@vmware.com

It solves that problem, for both clean and unexpected shutdown. It needs
some more work and a lot more testing, but requires no changes to the
backend. Robert Haas pointed out in that thread that it has a problem
with hint bits that are not WAL-logged, but it will still work if you
also enable the new checksums feature, which forces hint bit updates to
be WAL-logged. Perhaps we could add a GUC to enable hint bits to be
WAL-logged, regardless of checksums, to make pg_rewind work.

I think that's a more flexible approach to solve this problem. It
doesn't require an online feedback loop from the standby to master, for
starters.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#4)
Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

Robert Haas pointed out in that thread that it has a problem with hint
bits that are not WAL-logged,

I liked that tool a lot until Robert pointed out the above problem. I
thought this is a show stopper because I can't really see any way to
circumvent it unless we enable checksums or explicitly WAL log hint bits.

but it will still work if you also enable the new checksums feature, which
forces hint bit updates to be WAL-logged.

Are we expecting a lot of people to run their clusters with checksums on ?
Sorry, I haven't followed the checksum discussions and don't know how much
overhead it causes. But if the general expectation is that checksums will
be turned on most often, I agree pg_rewind is probably good enough.

Perhaps we could add a GUC to enable hint bits to be WAL-logged,
regardless of checksums, to make pg_rewind work.

Wouldn't that be too costly ? I mean, in the worst case every hint bit on a
page may get updated separately. If each such update is WAL logged, we are
looking for a lot more unnecessary WAL traffic.

I think that's a more flexible approach to solve this problem. It doesn't
require an online feedback loop from the standby to master, for starters.

I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work
with 9.3 and below because of the hint bits issue, otherwise it would have
been even more cool.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#6Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Benedikt Grundmann (#2)
Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 2:51 PM, Benedikt Grundmann <
bgrundmann@janestreet.com> wrote:

A alternative proposal (which will probably just reveal my lack of
understanding about what is or isn't possible with WAL). Provide a way to
restart the master so that it rolls back the WAL changes that the slave
hasn't seen.

WAL records in PostgreSQL can only be used for physical redo. They can not
be used for undo. So what you're suggesting is not possible though I am
sure a few other databases do that.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Pavan Deolasee (#5)
Re: Patch for fail-back without fresh backup

On 14.06.2013 14:06, Pavan Deolasee wrote:

On Fri, Jun 14, 2013 at 4:12 PM, Heikki Linnakangas<hlinnakangas@vmware.com

wrote:

Robert Haas pointed out in that thread that it has a problem with hint
bits that are not WAL-logged,

I liked that tool a lot until Robert pointed out the above problem. I
thought this is a show stopper because I can't really see any way to
circumvent it unless we enable checksums or explicitly WAL log hint bits.

but it will still work if you also enable the new checksums feature, which
forces hint bit updates to be WAL-logged.

Are we expecting a lot of people to run their clusters with checksums on ?
Sorry, I haven't followed the checksum discussions and don't know how much
overhead it causes. But if the general expectation is that checksums will
be turned on most often, I agree pg_rewind is probably good enough.

Well, time will tell I guess. The biggest overhead with the checksums is
exactly the WAL-logging of hint bits.

Perhaps we could add a GUC to enable hint bits to be WAL-logged,
regardless of checksums, to make pg_rewind work.

Wouldn't that be too costly ? I mean, in the worst case every hint bit on a
page may get updated separately. If each such update is WAL logged, we are
looking for a lot more unnecessary WAL traffic.

Yep, same as with checksums. I was not very enthusiastic about the
checksums patch because of that, but a lot of people are willing to pay
that price. Maybe we can figure out a way to reduce that cost in 9.4.
It'd benefit the checksums greatly.

For pg_rewind, we wouldn't actually need a full-page image for hint bit
updates, just a small record saying "hey, I touched this page". And
you'd only need to write that the first time a page is touched after a
checkpoint.

I think that's a more flexible approach to solve this problem. It doesn't
require an online feedback loop from the standby to master, for starters.

I agree. That's a big advantage of pg_rewind. Unfortunately, it can't work
with 9.3 and below because of the hint bits issue, otherwise it would have
been even more cool.

The proposed patch is clearly not 9.3 material either. If anything,
there's a much better change that we could still sneak in a GUC to allow
hint bits to be WAL-logged without checksums in 9.3. All the code is
there, it'd just be a new guc to control it separetely from checksums.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#7)
Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 12:20 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

For pg_rewind, we wouldn't actually need a full-page image for hint bit
updates, just a small record saying "hey, I touched this page". And you'd
only need to write that the first time a page is touched after a checkpoint.

I would expect that to be about the same cost though. The latency for
the fsync on the wal record before being able to flush the buffer is
the biggest cost.

The proposed patch is clearly not 9.3 material either. If anything, there's
a much better change that we could still sneak in a GUC to allow hint bits
to be WAL-logged without checksums in 9.3. All the code is there, it'd just
be a new guc to control it separetely from checksums.

On the other hand if you're going to wal log the hint bits why not
enable checksums?

Do we allow turning off checksums after a database is initdb'd? IIRC
we can't turn it on later but I don't see why we couldn't turn them
off.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#7)
Re: Patch for fail-back without fresh backup

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Well, time will tell I guess. The biggest overhead with the checksums is
exactly the WAL-logging of hint bits.

Refresh my memory as to why we need to WAL-log hints for checksumming?
I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page. Given that we're
paying that cost, I don't see why we'd need to do any extra WAL-logging
(above and beyond the log-when-freeze cost that we have to pay already).
But I've not absorbed any caffeine yet today, so maybe I'm just missing
it.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Samrat Revagade (#1)
Re: Patch for fail-back without fresh backup

On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:

Hello,

We have already started a discussion on pgsql-hackers for the problem of

taking fresh backup during the failback operation here is the link for that:
 

/messages/by-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
JgWrFu513s+Q@mail.gmail.com
 

Let me again summarize the problem we are trying to address.

 

When the master fails, last few WAL files may not reach the standby. But

the master may have gone ahead and made changes to its local file system
after > flushing WAL to the local storage.  So master contains some file
system level changes that standby does not have.  At this point, the data
directory of > master is ahead of standby's data directory.

Subsequently, the standby will be promoted as new master.  Later when the

old master wants to be a standby of the new master, it can't just join the

setup since there is inconsistency in between these two servers. We need

to take the fresh backup from the new master.  This can happen in both the

synchronous as well as asynchronous replication.

 

Fresh backup is also needed in case of clean switch-over because in the

current HEAD, the master does not wait for the standby to receive all the
WAL

up to the shutdown checkpoint record before shutting down the connection.

Fujii Masao has already submitted a patch to handle clean switch-over case,

but the problem is still remaining for failback case.

 

The process of taking fresh backup is very time consuming when databases

are of very big sizes, say several TB's, and when the servers are connected

over a relatively slower link.  This would break the service level

agreement of disaster recovery system.  So there is need to improve the
process of

disaster recovery in PostgreSQL.  One way to achieve this is to maintain

consistency between master and standby which helps to avoid need of fresh

backup.

 

So our proposal on this problem is that we must ensure that master should

not make any file system level changes without confirming that the

corresponding WAL record is replicated to the standby.

 
How will you take care of extra WAL on old master during recovery. If it
plays the WAL which has not reached new-master, it can be a problem.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#9)
Re: Patch for fail-back without fresh backup

On 2013-06-14 09:08:15 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Well, time will tell I guess. The biggest overhead with the checksums is
exactly the WAL-logging of hint bits.

Refresh my memory as to why we need to WAL-log hints for checksumming?
I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page.

I am really rather uncomfortable with that piece of code, and I hacked
it up after Jeff Janes had reported a bug there (The one aborting WAL
replay to early...). So I am very happy that you are looking at it.

Jeff Davis and I were talking about whether the usage of
PGXAC->delayChkpt makes the whole thing sufficiently safe at pgcon - we
couldn't find any real danger but...

Given that we're
paying that cost, I don't see why we'd need to do any extra WAL-logging
(above and beyond the log-when-freeze cost that we have to pay already).
But I've not absorbed any caffeine yet today, so maybe I'm just missing
it.

The usual torn page spiel I think. If we crash while only one half of
the page made it to disk we would get spurious checksum failures from
there on.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#9)
Re: Patch for fail-back without fresh backup

On 14.06.2013 16:08, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

Well, time will tell I guess. The biggest overhead with the checksums is
exactly the WAL-logging of hint bits.

Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

1. Backend sets a hint bit, dirtying the buffer.
2. Checksum is calculated, and buffer is written out to disk.
3. <crash>

If the page is torn, the checksum won't match. Without checksums, a torn
page is not a problem with hint bits, as a single bit can't be torn and
the page is otherwise intact. But with checksums, it causes a checksum
failure.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#12)
Re: Patch for fail-back without fresh backup

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 14.06.2013 16:08, Tom Lane wrote:

Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#13)
Re: Patch for fail-back without fresh backup

On 14.06.2013 16:21, Tom Lane wrote:

Heikki Linnakangas<hlinnakangas@vmware.com> writes:

On 14.06.2013 16:08, Tom Lane wrote:

Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

Correct. We're doing the latter, see XLogSaveBufferForHint().

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: Patch for fail-back without fresh backup

On 2013-06-14 09:21:52 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 14.06.2013 16:08, Tom Lane wrote:

Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

MarkBufferDirtyHint() loggs an FPI (just not via a BKP block) via
XLogSaveBufferForHint() iff XLogCheckBuffer() says we need to by
comparing GetRedoRecPtr() with the page's lsn.
Otherwise we don't do anything besides marking the buffer dirty.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#11)
Re: Patch for fail-back without fresh backup

On 14.06.2013 16:15, Andres Freund wrote:

On 2013-06-14 09:08:15 -0400, Tom Lane wrote:

I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page.

I am really rather uncomfortable with that piece of code, and I hacked
it up after Jeff Janes had reported a bug there (The one aborting WAL
replay to early...). So I am very happy that you are looking at it.

Hmm. In XLogSaveBufferForHint():

* Note that this only works for buffers that fit the standard page model,
* i.e. those for which buffer_std == true

The free-space-map uses non-standard pages, and MarkBufferDirtyHint().
Isn't that completely broken for the FSM? If I'm reading it correctly,
what will happen is that replay will completely zero out all FSM pages
that have been touched. All the FSM data is between pd_lower and
pd_upper, which on standard pages is the "hole".

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#13)
Re: Patch for fail-back without fresh backup

On 2013-06-14 09:21:52 -0400, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 14.06.2013 16:08, Tom Lane wrote:

Refresh my memory as to why we need to WAL-log hints for checksumming?

Torn pages:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

From quickly looking at the code again I think the MarkBufferDirtyHint()
code makes at least one assumption that isn't correct in the fact of
checksums.

It tests for the need to dirty the page with:
if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))

*before* taking a lock. A comment explains why that is safe:

* Since we make this test unlocked, there's a chance we
* might fail to notice that the flags have just been cleared, and failed
* to reset them, due to memory-ordering issues.

That's fine for the classical usecase without checksums but what about
the following scenario:

1) page is dirtied, FPI is logged
2) SetHintBits gets called on the same page, holding only a share lock
3) checkpointer/bgwriter/... writes out the the page, clearing the dirty
flag
4) checkpoint finishes, updates redo ptr
5) SetHintBits actually modifies the hint bits
6) SetHintBits calls MarkBufferDirtyHint which doesn't notice that the
page isn't dirty anymore and thus doesn't check whether something
needs to get logged.

At this point we have a page that has been modified without an FPI. But
it's not marked dirty, so it won't be written out without further
cause. Which might be fine since there's no cause to write out the page
and there probably won't be anyone doing that without logging an FPI
independently.
Can anybody see a scenario where this is actually dangerous?

Since

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#17)
Re: Patch for fail-back without fresh backup

On 14.06.2013 17:01, Andres Freund wrote:

At this point we have a page that has been modified without an FPI. But
it's not marked dirty, so it won't be written out without further
cause. Which might be fine since there's no cause to write out the page
and there probably won't be anyone doing that without logging an FPI
independently.
Can anybody see a scenario where this is actually dangerous?

The code also relies on that being safe during recovery:

* If we're in recovery we cannot dirty a page because of a hint.
* We can set the hint, just not dirty the page as a result so the
* hint is lost when we evict the page or shutdown.
*
* See src/backend/storage/page/README for longer discussion.
*/
if (RecoveryInProgress())
return;

I can't immediately see a problem with that.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#16)
Re: Patch for fail-back without fresh backup

On 2013-06-14 16:58:38 +0300, Heikki Linnakangas wrote:

On 14.06.2013 16:15, Andres Freund wrote:

On 2013-06-14 09:08:15 -0400, Tom Lane wrote:

I just had my nose in the part of the checksum patch that tediously
copies entire pages out of shared buffers to avoid possible instability
of the hint bits while we checksum and write the page.

I am really rather uncomfortable with that piece of code, and I hacked
it up after Jeff Janes had reported a bug there (The one aborting WAL
replay to early...). So I am very happy that you are looking at it.

Hmm. In XLogSaveBufferForHint():

* Note that this only works for buffers that fit the standard page model,
* i.e. those for which buffer_std == true

The free-space-map uses non-standard pages, and MarkBufferDirtyHint(). Isn't
that completely broken for the FSM? If I'm reading it correctly, what will
happen is that replay will completely zero out all FSM pages that have been
touched. All the FSM data is between pd_lower and pd_upper, which on
standard pages is the "hole".

Jeff Davis has a patch pending
(1365493015.7580.3240.camel@sussancws0025) that passes the buffer_std
flag down to MarkBufferDirtyHint() for exactly that reason. I thought we
were on track committing that, but rereading the thread it doesn't look
that way.

Jeff, care to update that patch?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#13)
Re: Patch for fail-back without fresh backup

On Fri, Jun 14, 2013 at 2:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So it's not that we actually need to log the individual hint bit
changes, it's that we need to WAL-log a full page image on the first
update after a checkpoint, so as to recover from torn-page cases.
Which one are we doing?

Wal logging a full page image after a checkpoint wouldn't actually be
enough since subsequent hint bits will dirty the page and not wal log
anything creating a new torn page risk. FPI are only useful if all the
subsequent updates are wal logged.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#20)
#22Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#19)
#23Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#22)
#24Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Samrat Revagade (#1)
#25Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#24)
#26Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#25)
#27Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#23)
#28Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#27)
#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#26)
#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Davis (#22)
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Samrat Revagade (#1)
#32Samrat Revagade
revagade.samrat@gmail.com
In reply to: Simon Riggs (#31)
#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Samrat Revagade (#32)
#34Samrat Revagade
revagade.samrat@gmail.com
In reply to: Simon Riggs (#33)
#35Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Simon Riggs (#31)
#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Pavan Deolasee (#35)
#37Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#29)
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#37)
#39Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Samrat Revagade (#1)
#40Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Samrat Revagade (#1)
#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#40)
#42Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Simon Riggs (#36)
#43Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Masahiko Sawada (#42)
#44Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Masahiko Sawada (#42)
#45Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Samrat Revagade (#1)
#46Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Pavan Deolasee (#44)
#47Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#45)
#48Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#42)
#49Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#36)
#50Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Samrat Revagade (#1)
#51Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#50)
#52Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Samrat Revagade (#1)
#53Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#51)
#54Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Simon Riggs (#36)
#55Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#54)
#56Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#55)
#57Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#56)
#58Samrat Revagade
revagade.samrat@gmail.com
In reply to: Masahiko Sawada (#54)
#59Peter Eisentraut
peter_e@gmx.net
In reply to: Masahiko Sawada (#57)
#60Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Eisentraut (#59)
#61Samrat Revagade
revagade.samrat@gmail.com
In reply to: Masahiko Sawada (#60)
#62Peter Eisentraut
peter_e@gmx.net
In reply to: Samrat Revagade (#61)
#63Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Peter Eisentraut (#62)
#64Samrat Revagade
revagade.samrat@gmail.com
In reply to: Masahiko Sawada (#63)
#65Fujii Masao
masao.fujii@gmail.com
In reply to: Samrat Revagade (#64)
#66Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#65)
#67Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiko Sawada (#66)
#68Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#67)
#69Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#68)
#70Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#67)
#71Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiko Sawada (#70)
#72Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#71)
#73Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiko Sawada (#72)
#74Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#73)
#75Sameer Thakur
samthakur74@gmail.com
In reply to: Samrat Revagade (#64)
#76Samrat Revagade
revagade.samrat@gmail.com
In reply to: Sameer Thakur (#75)
#77Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Samrat Revagade (#76)
#78Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Fujii Masao (#73)
#79Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Pavan Deolasee (#78)
#80Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Masahiko Sawada (#79)
#81Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Pavan Deolasee (#80)
#82Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#81)
#83Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiko Sawada (#82)
#84Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Fujii Masao (#83)
#85Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Masahiko Sawada (#84)
#86Andres Freund
andres@anarazel.de
In reply to: Pavan Deolasee (#85)
#87Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Andres Freund (#86)
#88Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Pavan Deolasee (#87)
#89Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Pavan Deolasee (#85)
#90Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#86)
#91Samrat Revagade
revagade.samrat@gmail.com
In reply to: Andres Freund (#86)
#92Robert Haas
robertmhaas@gmail.com
In reply to: Samrat Revagade (#91)
#93Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#88)
#94Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Pavan Deolasee (#93)
#95Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Masahiko Sawada (#94)
#96Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Pavan Deolasee (#95)
#97Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#96)
#98Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Pavan Deolasee (#97)
#99Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Pavan Deolasee (#97)
#100Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#99)
#101Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Pavan Deolasee (#100)
#102Josh Berkus
josh@agliodbs.com
In reply to: Masahiko Sawada (#72)
#103Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#102)
#104Josh Berkus
josh@agliodbs.com
In reply to: Masahiko Sawada (#72)
#105Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#104)
#106Josh Berkus
josh@agliodbs.com
In reply to: Masahiko Sawada (#72)
#107Magnus Hagander
magnus@hagander.net
In reply to: Josh Berkus (#106)
#108Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Magnus Hagander (#107)
#109Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#107)
#110Andres Freund
andres@anarazel.de
In reply to: Josh Berkus (#106)
#111Andres Freund
andres@anarazel.de
In reply to: Magnus Hagander (#107)
#112Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#110)
#113Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#105)
#114Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#113)
#115Jeff Janes
jeff.janes@gmail.com
In reply to: Alvaro Herrera (#114)
#116Andres Freund
andres@anarazel.de
In reply to: Jeff Janes (#115)
#117Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#116)
#118Jeff Janes
jeff.janes@gmail.com
In reply to: Andres Freund (#116)
#119Andres Freund
andres@anarazel.de
In reply to: Jeff Janes (#118)
#120Jeff Janes
jeff.janes@gmail.com
In reply to: Andres Freund (#119)
#121Andres Freund
andres@anarazel.de
In reply to: Jeff Janes (#120)
#122Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jeff Janes (#120)