Online verification of checksums

Started by Michael Banckover 7 years ago186 messageshackers
Jump to latest
#1Michael Banck
michael.banck@credativ.de

Hi,

v10 almost added online activation of checksums, but all we've got is
pg_verify_checksums, i.e. offline verification of checkums. 

However, we also got (online) checksum verification during base backups,
and I have ported/adapted David Steele's recheck code to my personal
fork of pg_checksums[1]https://github.com/credativ/pg_checksums/commit/dc052f0d6f1282d3c821 5b0eb28b8e7c4e74f9e5, removed the online check (for verification) and
that seems to work fine.

I've now forward-ported this change to pg_verify_checksums, in order to
make this application useful for online clusters, see attached patch.

I've tested this in a tight loop (while true; do pg_verify_checksums -D
data1 -d > /dev/null || /bin/true; done)[2]while patching out the somewhat unhelpful (in regular operation, anyway) debug message for every successful checksum verification while doing "while true; do
createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
done", which I already used to develop the original code in the fork and
which brought up a few bugs.

I got one checksums verification failure this way, all others were
caught by the recheck (I've introduced a 500ms delay for the first ten
failures) like this:

|pg_verify_checksums: checksum verification failed on first attempt in
|file "data1/base/16837/16850", block 7770: calculated checksum 785 but
|expected 5063
|pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
|verified ok on recheck

However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
failures like this:

|pg_verify_checksums: short read of block 2644 in file
|"data1/base/16637/16650", got only 4096 bytes

This is not strictly a verification failure, should we do anything about
this? In my fork, I am also rechecking on this[3]https://github.com/credativ/pg_checksums/blob/master/pg_checksums.c# L160 -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de (and I am happy to
extend the patch that way), but that makes the code and the patch more
complicated and I wanted to check the general opinion on this case
first.

Michael

[1]: https://github.com/credativ/pg_checksums/commit/dc052f0d6f1282d3c821 5b0eb28b8e7c4e74f9e5
5b0eb28b8e7c4e74f9e5
[2]: while patching out the somewhat unhelpful (in regular operation, anyway) debug message for every successful checksum verification
anyway) debug message for every successful checksum verification
[3]: https://github.com/credativ/pg_checksums/blob/master/pg_checksums.c# L160 -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de
L160
--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Attachments:

online-verification-of-checksums_V1.patchtext/x-patch; charset=UTF-8; name=online-verification-of-checksums_V1.patchDownload+58-12
#2Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Banck (#1)
Re: Online verification of checksums

On 26/07/2018 13:59, Michael Banck wrote:

I've now forward-ported this change to pg_verify_checksums, in order to
make this application useful for online clusters, see attached patch.

Why not provide this functionality as a server function or command.
Then you can access blocks with proper locks and don't have to do this
rather ad hoc retry logic on concurrent access.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Magnus Hagander
magnus@hagander.net
In reply to: Peter Eisentraut (#2)
Re: Online verification of checksums

On Thu, Aug 30, 2018 at 8:06 PM, Peter Eisentraut <
peter.eisentraut@2ndquadrant.com> wrote:

On 26/07/2018 13:59, Michael Banck wrote:

I've now forward-ported this change to pg_verify_checksums, in order to
make this application useful for online clusters, see attached patch.

Why not provide this functionality as a server function or command.
Then you can access blocks with proper locks and don't have to do this
rather ad hoc retry logic on concurrent access.

I think it would make sense to provide this functionality in the "checksum
worker" infrastruture suggested in the online checksum enabling patch. But
I think being able to run it from the outside would also be useful,
particularly when it's this simple.

But why do we need a sleep in it? AFAICT this is basically the same code
that we have in basebackup.c, and that one does not need the sleep?
Certainly 500ms would be very long since we're just protecting against a
torn page, but the comment is wrong I think, and we're actually sleeping
0.5ms?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#4Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Banck (#1)
Re: Online verification of checksums

Hallo Michael,

I've now forward-ported this change to pg_verify_checksums, in order to
make this application useful for online clusters, see attached patch.

Patch does not seem to apply anymore, could you rebase it?

--
Fabien.

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Banck (#1)
Re: Online verification of checksums

Hi,

The patch is mostly copying the verification / retry logic from
basebackup.c, but I think it omitted a rather important detail that
makes it incorrect in the presence of concurrent writes.

The very first thing basebackup does is this:

startptr = do_pg_start_backup(...);

i.e. it waits for a checkpoint, remembering the LSN. And then when
checking a page it does this:

if (!PageIsNew(page) && PageGetLSN(page) < startptr)
{
... verify the page checksum
}

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance. But the startptr
detail is pretty important because it supports this retry reasoning:

/*
* Retry the block on the first failure. It's
* possible that we read the first 4K page of the
* block just before postgres updated the entire block
* so it ends up looking torn to us. We only need to
* retry once because the LSN should be updated to
* something we can ignore on the next pass. If the
* error happens again then it is a true validation
* failure.
*/

Imagine the 8kB page as two 4kB pages, with the initial state being
[A1,A2] and another process over-writing it with [B1,B2]. If you read
the 8kB page, what states can you see?

I don't think POSIX provides any guarantees about atomicity of the write
calls (and even if it does, the filesystems on Linux don't seem to). So
you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
of the two versions, depending on timing. Well, torn pages ...

Pretty much the only thing you can rely on is that when one process does

write([B1,B2])

the other process may first read [A1,B2], but the next read will return
[B1,B2] (or possibly newer data, if there was another write). It will
not read the "stale" A1 again.

The basebackup relies on this kinda implicitly - on the retry it'll
notice the LSN changed (thanks to the startptr check), and the page will
be skipped entirely. This is pretty important, because the new page
might be torn in some other way.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

So IMHO this should do something similar to basebackup - check the page
LSN, and if it changed then skip the page.

I'm afraid this requires using the last checkpoint LSN, the way startptr
is used in basebackup. In particular we can't simply remember LSN from
the first read, because we might actually read [B1,A2] on the first try,
and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
torn in various other ways, not necessarily at the 4kB boundary - it
might be torn right after the LSN, for example).

FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Michael Banck
michael.banck@credativ.de
In reply to: Tomas Vondra (#5)
Re: Online verification of checksums

Hi,

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:

The patch is mostly copying the verification / retry logic from
basebackup.c, but I think it omitted a rather important detail that
makes it incorrect in the presence of concurrent writes.

The very first thing basebackup does is this:

startptr = do_pg_start_backup(...);

i.e. it waits for a checkpoint, remembering the LSN. And then when
checking a page it does this:

if (!PageIsNew(page) && PageGetLSN(page) < startptr)
{
... verify the page checksum
}

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance.

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

But the startptr detail is pretty important because it supports this
retry reasoning:

/*
* Retry the block on the first failure. It's
* possible that we read the first 4K page of the
* block just before postgres updated the entire block
* so it ends up looking torn to us. We only need to
* retry once because the LSN should be updated to
* something we can ignore on the next pass. If the
* error happens again then it is a true validation
* failure.
*/

Imagine the 8kB page as two 4kB pages, with the initial state being
[A1,A2] and another process over-writing it with [B1,B2]. If you read
the 8kB page, what states can you see?

I don't think POSIX provides any guarantees about atomicity of the write
calls (and even if it does, the filesystems on Linux don't seem to). So
you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
of the two versions, depending on timing. Well, torn pages ...

Pretty much the only thing you can rely on is that when one process does

write([B1,B2])

the other process may first read [A1,B2], but the next read will return
[B1,B2] (or possibly newer data, if there was another write). It will
not read the "stale" A1 again.

The basebackup relies on this kinda implicitly - on the retry it'll
notice the LSN changed (thanks to the startptr check), and the page will
be skipped entirely. This is pretty important, because the new page
might be torn in some other way.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

Well, ok.

So IMHO this should do something similar to basebackup - check the page
LSN, and if it changed then skip the page.

I'm afraid this requires using the last checkpoint LSN, the way startptr
is used in basebackup. In particular we can't simply remember LSN from
the first read, because we might actually read [B1,A2] on the first try,
and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
torn in various other ways, not necessarily at the 4kB boundary - it
might be torn right after the LSN, for example).

I'd prefer to come up with a plan where we don't just give up once we
see a new LSN, if possible. If I run a modified pg_verify_checksums
which skips on newer pages in a tight benchmark, basically everything
gets skipped as checkpoints don't happen often enough.

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
or some such file which pg_verify_checksums gets confused on, I guess we
should skip that as well. Can we assume that all files that start with
the ones in skip[] are safe to skip or should we have an exception for
files starting with pg_internal.init?

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB M�nchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 M�nchengladbach
Gesch�ftsf�hrung: Dr. Michael Meskes, J�rg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

#7Stephen Frost
sfrost@snowman.net
In reply to: Michael Banck (#6)
Re: Online verification of checksums

Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance.

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

Right, basebackup doesn't need to look at other pages.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

Well, ok.

The newly read page will have an updated LSN though then on the re-read,
in which case basebackup can know that what happened was a rewrite of
the page and it no longer has to care about the page and can skip it.

I haven't looked, but if basebackup isn't checking the LSN again for the
newly read page then that'd be broken, but I believe it does (at least,
that's the algorithm we came up with for pgBackRest, and I know David
shared that when the basebackup code was being written).

So IMHO this should do something similar to basebackup - check the page
LSN, and if it changed then skip the page.

I'm afraid this requires using the last checkpoint LSN, the way startptr
is used in basebackup. In particular we can't simply remember LSN from
the first read, because we might actually read [B1,A2] on the first try,
and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
torn in various other ways, not necessarily at the 4kB boundary - it
might be torn right after the LSN, for example).

I'd prefer to come up with a plan where we don't just give up once we
see a new LSN, if possible. If I run a modified pg_verify_checksums
which skips on newer pages in a tight benchmark, basically everything
gets skipped as checkpoints don't happen often enough.

I'm really not sure how you expect to be able to do something different
here. Even if we started poking into shared buffers, all you'd be able
to see is that there's a bunch of dirty pages- and we don't maintain the
checksums in shared buffers, so it's not like you could verify them
there.

You could possibly have an option that says "force a checkpoint" but,
honestly, that's really not all that interesting either- all you'd be
doing is forcing all the pages to be written out from shared buffers
into the kernel cache and then reading them from there instead, it's not
like you'd actually be able to tell if there was a disk/storage error
because you'll only be looking at the kernel cache.

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

I thought that's what basebackup did- if it doesn't do that today, then
it really should.

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

Not sure what the point of this is. If we wanted to really do something
to cross-check here, we'd track the pages that were skipped and then
look through the WAL to make sure that they're there. That's something
we've talked about doing with pgBackRest, but don't currently.

FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

That then sounds like we really aren't re-checking the LSN, and we
really should be, to avoid getting these sporadic checksum failures on
reread..

Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
or some such file which pg_verify_checksums gets confused on, I guess we
should skip that as well. Can we assume that all files that start with
the ones in skip[] are safe to skip or should we have an exception for
files starting with pg_internal.init?

Everything listed in skip is safe to skip on a restore.. I've not
really thought too much about if they're all safe to skip when checking
checksums for an online system, but I would generally think so..

Thanks!

Stephen

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#7)
Re: Online verification of checksums

On 09/17/2018 04:46 PM, Stephen Frost wrote:

Greetings,

* Michael Banck (michael.banck@credativ.de) wrote:

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance.

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

Right, basebackup doesn't need to look at other pages.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

Well, ok.

The newly read page will have an updated LSN though then on the re-read,
in which case basebackup can know that what happened was a rewrite of
the page and it no longer has to care about the page and can skip it.

I haven't looked, but if basebackup isn't checking the LSN again for the
newly read page then that'd be broken, but I believe it does (at least,
that's the algorithm we came up with for pgBackRest, and I know David
shared that when the basebackup code was being written).

Yes, basebackup does check the LSN on re-read, and skips the page if it
changed on re-read (because it eliminates the consistency guarantees
provided by the checkpoint).

So IMHO this should do something similar to basebackup - check the page
LSN, and if it changed then skip the page.

I'm afraid this requires using the last checkpoint LSN, the way startptr
is used in basebackup. In particular we can't simply remember LSN from
the first read, because we might actually read [B1,A2] on the first try,
and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
torn in various other ways, not necessarily at the 4kB boundary - it
might be torn right after the LSN, for example).

I'd prefer to come up with a plan where we don't just give up once we
see a new LSN, if possible. If I run a modified pg_verify_checksums
which skips on newer pages in a tight benchmark, basically everything
gets skipped as checkpoints don't happen often enough.

I'm really not sure how you expect to be able to do something different
here. Even if we started poking into shared buffers, all you'd be able
to see is that there's a bunch of dirty pages- and we don't maintain the
checksums in shared buffers, so it's not like you could verify them
there.

You could possibly have an option that says "force a checkpoint" but,
honestly, that's really not all that interesting either- all you'd be
doing is forcing all the pages to be written out from shared buffers
into the kernel cache and then reading them from there instead, it's not
like you'd actually be able to tell if there was a disk/storage error
because you'll only be looking at the kernel cache.

Yeah.

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

I thought that's what basebackup did- if it doesn't do that today, then
it really should.

The crucial distinction here is that the trick is not in comparing LSNs
from the two page reads, but comparing it to the checkpoint LSN. If it's
greater, the page may be torn or broken, and there's no way to know
which case it is - so basebackup simply skips it.

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

Not sure what the point of this is. If we wanted to really do something
to cross-check here, we'd track the pages that were skipped and then
look through the WAL to make sure that they're there. That's something
we've talked about doing with pgBackRest, but don't currently.

I agree simply printing the page numbers seems rather useless. What we
could do is remember which pages we skipped and then try checking them
after another checkpoint. Or something like that.

FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

That then sounds like we really aren't re-checking the LSN, and we
really should be, to avoid getting these sporadic checksum failures on
reread..

Again, it's not enough to check the LSN against the preceding read. We
need a checkpoint LSN or something like that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Banck (#6)
Re: Online verification of checksums

On 09/17/2018 04:04 PM, Michael Banck wrote:

Hi,

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:

The patch is mostly copying the verification / retry logic from
basebackup.c, but I think it omitted a rather important detail that
makes it incorrect in the presence of concurrent writes.

The very first thing basebackup does is this:

startptr = do_pg_start_backup(...);

i.e. it waits for a checkpoint, remembering the LSN. And then when
checking a page it does this:

if (!PageIsNew(page) && PageGetLSN(page) < startptr)
{
... verify the page checksum
}

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance.

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

I suppose we might refresh the checkpoint LSN regularly, and use the
most recent one. On large/busy databases that would allow checking
larger part of the database.

But the startptr detail is pretty important because it supports this
retry reasoning:

/*
* Retry the block on the first failure. It's
* possible that we read the first 4K page of the
* block just before postgres updated the entire block
* so it ends up looking torn to us. We only need to
* retry once because the LSN should be updated to
* something we can ignore on the next pass. If the
* error happens again then it is a true validation
* failure.
*/

Imagine the 8kB page as two 4kB pages, with the initial state being
[A1,A2] and another process over-writing it with [B1,B2]. If you read
the 8kB page, what states can you see?

I don't think POSIX provides any guarantees about atomicity of the write
calls (and even if it does, the filesystems on Linux don't seem to). So
you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
of the two versions, depending on timing. Well, torn pages ...

Pretty much the only thing you can rely on is that when one process does

write([B1,B2])

the other process may first read [A1,B2], but the next read will return
[B1,B2] (or possibly newer data, if there was another write). It will
not read the "stale" A1 again.

The basebackup relies on this kinda implicitly - on the retry it'll
notice the LSN changed (thanks to the startptr check), and the page will
be skipped entirely. This is pretty important, because the new page
might be torn in some other way.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

Well, ok.

So IMHO this should do something similar to basebackup - check the page
LSN, and if it changed then skip the page.

I'm afraid this requires using the last checkpoint LSN, the way startptr
is used in basebackup. In particular we can't simply remember LSN from
the first read, because we might actually read [B1,A2] on the first try,
and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
torn in various other ways, not necessarily at the 4kB boundary - it
might be torn right after the LSN, for example).

I'd prefer to come up with a plan where we don't just give up once we
see a new LSN, if possible. If I run a modified pg_verify_checksums
which skips on newer pages in a tight benchmark, basically everything
gets skipped as checkpoints don't happen often enough.

But in that case the checksums are verified when reading the buffer into
shared buffers, it's not like we don't notice the checksum error at all.
We are interested in the pages that have not been read/written for an
extended period time. So I think this is not a problem.

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

Hmmm, maybe.

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

I agree it might be useful to know how many pages were skipped, and how
many actually passed the checksum check.

FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

My guess is this happened because both the read and re-read completed
during the same write.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#8)
Re: Online verification of checksums

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/17/2018 04:46 PM, Stephen Frost wrote:

* Michael Banck (michael.banck@credativ.de) wrote:

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance.

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

Right, basebackup doesn't need to look at other pages.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

Well, ok.

The newly read page will have an updated LSN though then on the re-read,
in which case basebackup can know that what happened was a rewrite of
the page and it no longer has to care about the page and can skip it.

I haven't looked, but if basebackup isn't checking the LSN again for the
newly read page then that'd be broken, but I believe it does (at least,
that's the algorithm we came up with for pgBackRest, and I know David
shared that when the basebackup code was being written).

Yes, basebackup does check the LSN on re-read, and skips the page if it
changed on re-read (because it eliminates the consistency guarantees
provided by the checkpoint).

Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'. The point is that
the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

I thought that's what basebackup did- if it doesn't do that today, then
it really should.

The crucial distinction here is that the trick is not in comparing LSNs
from the two page reads, but comparing it to the checkpoint LSN. If it's
greater, the page may be torn or broken, and there's no way to know
which case it is - so basebackup simply skips it.

Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it. IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

Not sure what the point of this is. If we wanted to really do something
to cross-check here, we'd track the pages that were skipped and then
look through the WAL to make sure that they're there. That's something
we've talked about doing with pgBackRest, but don't currently.

I agree simply printing the page numbers seems rather useless. What we
could do is remember which pages we skipped and then try checking them
after another checkpoint. Or something like that.

I'm still not sure I'm seeing the point of that. They're still going to
almost certainly be in the kernel cache. The reason for checking
against the WAL would be to detect errors in PG where we aren't putting
a page into the WAL when it really should be, or something similar,
which seems like it at least could be useful.

Maybe to put it another way- there's very little point in checking the
checksum of a page which we know must be re-written during recovery to
get to a consistent point. I don't think it hurts in the general case,
but I wouldn't write a lot of code which then needs to be tested to
handle it. I also don't think that we really need to make
pg_verify_checksum spend lots of extra cycles trying to verify that
*every* page had its checksum validated when we know that lots of pages
are going to be in memory marked dirty and our checking of them will be
ultimately pointless as they'll either be written out by the
checkpointer or some other process, or we'll replay them from the WAL if
we crash.

FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

That then sounds like we really aren't re-checking the LSN, and we
really should be, to avoid getting these sporadic checksum failures on
reread..

Again, it's not enough to check the LSN against the preceding read. We
need a checkpoint LSN or something like that.

I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN must be
more recent than the last checkpoint location or things are broken
badly.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of checksums- for
basebackup, we could just check against the backup-start LSN as anything
after that point will be rewritten by WAL anyway). The reason that I
think it's a good idea to check against the checkpoint LSN is that we'd
want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads isn't
really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

When it comes to if the pg_sleep() is necessary or not, I have to admit
to being unsure about that.. I could see how it might be but it seems a
bit surprising- I'd probably want to see exactly what the page was at
the time of the failure and at the time of the second (no-sleep) re-read
and then after a delay and convince myself that it was just an unlucky
case of being scheduled in twice to read that page before the process
writing it out got a chance to finish the write.

Thanks!

Stephen

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#10)
Re: Online verification of checksums

Hi,

On 09/17/2018 06:42 PM, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/17/2018 04:46 PM, Stephen Frost wrote:

* Michael Banck (michael.banck@credativ.de) wrote:

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:

Obviously, pg_verify_checksums can't do that easily because it's
supposed to run from outside the database instance.

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

Right, basebackup doesn't need to look at other pages.

The pg_verify_checksum apparently ignores this skip logic, because on
the retry it simply re-reads the page again, verifies the checksum and
reports an error. Which is broken, because the newly read page might be
torn again due to a concurrent write.

Well, ok.

The newly read page will have an updated LSN though then on the re-read,
in which case basebackup can know that what happened was a rewrite of
the page and it no longer has to care about the page and can skip it.

I haven't looked, but if basebackup isn't checking the LSN again for the
newly read page then that'd be broken, but I believe it does (at least,
that's the algorithm we came up with for pgBackRest, and I know David
shared that when the basebackup code was being written).

Yes, basebackup does check the LSN on re-read, and skips the page if it
changed on re-read (because it eliminates the consistency guarantees
provided by the checkpoint).

Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'. The point is that
the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

The checkpoint guarantees that the whole page was written and flushed to
disk with an LSN before the ckeckpoint LSN. So when you read a page with
that LSN, you know the whole write already completed and a read won't
return data from before the LSN.

Without the checkpoint that's not guaranteed, and simply re-reading the
page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the checksum
still fails

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

I thought that's what basebackup did- if it doesn't do that today, then
it really should.

The crucial distinction here is that the trick is not in comparing LSNs
from the two page reads, but comparing it to the checkpoint LSN. If it's
greater, the page may be torn or broken, and there's no way to know
which case it is - so basebackup simply skips it.

Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it. IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

Nope, we only verify the checksum if it's LSN precedes the checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

Not sure what the point of this is. If we wanted to really do something
to cross-check here, we'd track the pages that were skipped and then
look through the WAL to make sure that they're there. That's something
we've talked about doing with pgBackRest, but don't currently.

I agree simply printing the page numbers seems rather useless. What we
could do is remember which pages we skipped and then try checking them
after another checkpoint. Or something like that.

I'm still not sure I'm seeing the point of that. They're still going to
almost certainly be in the kernel cache. The reason for checking
against the WAL would be to detect errors in PG where we aren't putting
a page into the WAL when it really should be, or something similar,
which seems like it at least could be useful.

Maybe to put it another way- there's very little point in checking the
checksum of a page which we know must be re-written during recovery to
get to a consistent point. I don't think it hurts in the general case,
but I wouldn't write a lot of code which then needs to be tested to
handle it. I also don't think that we really need to make
pg_verify_checksum spend lots of extra cycles trying to verify that
*every* page had its checksum validated when we know that lots of pages
are going to be in memory marked dirty and our checking of them will be
ultimately pointless as they'll either be written out by the
checkpointer or some other process, or we'll replay them from the WAL if
we crash.

Yeah, I agree.

FWIW I also don't understand the purpose of pg_sleep(), it does not seem
to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

That then sounds like we really aren't re-checking the LSN, and we
really should be, to avoid getting these sporadic checksum failures on
reread..

Again, it's not enough to check the LSN against the preceding read. We
need a checkpoint LSN or something like that.

I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN must be
more recent than the last checkpoint location or things are broken
badly.

I don't follow. Are you suggesting we don't need the checkpoint LSN?

I'm pretty sure that's not the case. The thing is - the LSN may not
change between the two reads, but that's not a guarantee the page was
not torn. The example I posted earlier in this message illustrates that.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of checksums- for
basebackup, we could just check against the backup-start LSN as anything
after that point will be rewritten by WAL anyway). The reason that I
think it's a good idea to check against the checkpoint LSN is that we'd
want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads isn't
really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

When it comes to if the pg_sleep() is necessary or not, I have to admit
to being unsure about that.. I could see how it might be but it seems a
bit surprising- I'd probably want to see exactly what the page was at
the time of the failure and at the time of the second (no-sleep) re-read
and then after a delay and convince myself that it was just an unlucky
case of being scheduled in twice to read that page before the process
writing it out got a chance to finish the write.

I think the pg_sleep() is a pretty strong sign there's something broken.
At the very least, it's likely to misbehave on machines with different
timings, machines under memory and/or memory pressure, etc.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#11)
Re: Online verification of checksums

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/17/2018 06:42 PM, Stephen Frost wrote:

Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'. The point is that
the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

The checkpoint guarantees that the whole page was written and flushed to
disk with an LSN before the ckeckpoint LSN. So when you read a page with
that LSN, you know the whole write already completed and a read won't
return data from before the LSN.

Well, you know that the first part was written out at some prior point,
but you could end up reading the first part of a page with an older LSN
while also reading the second part with new data.

Without the checkpoint that's not guaranteed, and simply re-reading the
page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the checksum
still fails

So, I agree with all of the above though I've found it to be extremely
rare to get a single read which you've managed to catch part-way through
a write, getting multiple of them over a period of time strikes me as
even more unlikely. Still, if we can come up with a solution to solve
all of this, great, but I'm not sure that I'm hearing one.

Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it. IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

Nope, we only verify the checksum if it's LSN precedes the checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

That seems like it's leaving something on the table, but, to be fair, we
know that all of those pages should be rewritten by WAL anyway so they
aren't all that interesting to us, particularly in the basebackup case.

I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN must be
more recent than the last checkpoint location or things are broken
badly.

I don't follow. Are you suggesting we don't need the checkpoint LSN?

I'm pretty sure that's not the case. The thing is - the LSN may not
change between the two reads, but that's not a guarantee the page was
not torn. The example I posted earlier in this message illustrates that.

I agree that there's some risk there, but it's certainly much less
likely.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of checksums- for
basebackup, we could just check against the backup-start LSN as anything
after that point will be rewritten by WAL anyway). The reason that I
think it's a good idea to check against the checkpoint LSN is that we'd
want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads isn't
really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

When it comes to if the pg_sleep() is necessary or not, I have to admit
to being unsure about that.. I could see how it might be but it seems a
bit surprising- I'd probably want to see exactly what the page was at
the time of the failure and at the time of the second (no-sleep) re-read
and then after a delay and convince myself that it was just an unlucky
case of being scheduled in twice to read that page before the process
writing it out got a chance to finish the write.

I think the pg_sleep() is a pretty strong sign there's something broken.
At the very least, it's likely to misbehave on machines with different
timings, machines under memory and/or memory pressure, etc.

If we assume that what you've outlined above is a serious enough issue
that we have to address it, and do so without a pg_sleep(), then I think
we have to bake into this a way for the process to check with PG as to
what the page's current LSN is, in shared buffers, because that's the
only place where we've got the locking required to ensure that we don't
end up with a read of a partially written page, and I'm really not
entirely convinced that we need to go to that level. It'd certainly add
a huge amount of additional complexity for what appears to be a quite
unlikely gain.

I'll chat w/ David shortly about this again though and get his thoughts
on it. This is certainly an area we've spent time thinking about but
are obviously also open to finding a better solution.

Thanks!

Stephen

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#12)
Re: Online verification of checksums

On 09/17/2018 07:11 PM, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/17/2018 06:42 PM, Stephen Frost wrote:

Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'. The point is that
the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

The checkpoint guarantees that the whole page was written and flushed to
disk with an LSN before the ckeckpoint LSN. So when you read a page with
that LSN, you know the whole write already completed and a read won't
return data from before the LSN.

Well, you know that the first part was written out at some prior point,
but you could end up reading the first part of a page with an older LSN
while also reading the second part with new data.

Doesn't the checkpoint fsync pretty much guarantee this can't happen?

Without the checkpoint that's not guaranteed, and simply re-reading the
page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the checksum
still fails

So, I agree with all of the above though I've found it to be extremely
rare to get a single read which you've managed to catch part-way through
a write, getting multiple of them over a period of time strikes me as
even more unlikely. Still, if we can come up with a solution to solve
all of this, great, but I'm not sure that I'm hearing one.

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it. IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

Nope, we only verify the checksum if it's LSN precedes the checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

That seems like it's leaving something on the table, but, to be fair, we
know that all of those pages should be rewritten by WAL anyway so they
aren't all that interesting to us, particularly in the basebackup case.

Yep.

I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN must be
more recent than the last checkpoint location or things are broken
badly.

I don't follow. Are you suggesting we don't need the checkpoint LSN?

I'm pretty sure that's not the case. The thing is - the LSN may not
change between the two reads, but that's not a guarantee the page was
not torn. The example I posted earlier in this message illustrates that.

I agree that there's some risk there, but it's certainly much less
likely.

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of checksums- for
basebackup, we could just check against the backup-start LSN as anything
after that point will be rewritten by WAL anyway). The reason that I
think it's a good idea to check against the checkpoint LSN is that we'd
want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads isn't
really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

When it comes to if the pg_sleep() is necessary or not, I have to admit
to being unsure about that.. I could see how it might be but it seems a
bit surprising- I'd probably want to see exactly what the page was at
the time of the failure and at the time of the second (no-sleep) re-read
and then after a delay and convince myself that it was just an unlucky
case of being scheduled in twice to read that page before the process
writing it out got a chance to finish the write.

I think the pg_sleep() is a pretty strong sign there's something broken.
At the very least, it's likely to misbehave on machines with different
timings, machines under memory and/or memory pressure, etc.

If we assume that what you've outlined above is a serious enough issue
that we have to address it, and do so without a pg_sleep(), then I think
we have to bake into this a way for the process to check with PG as to
what the page's current LSN is, in shared buffers, because that's the
only place where we've got the locking required to ensure that we don't
end up with a read of a partially written page, and I'm really not
entirely convinced that we need to go to that level. It'd certainly add
a huge amount of additional complexity for what appears to be a quite
unlikely gain.

I'll chat w/ David shortly about this again though and get his thoughts
on it. This is certainly an area we've spent time thinking about but
are obviously also open to finding a better solution.

Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#13)
Re: Online verification of checksums

Greetings,

On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 09/17/2018 07:11 PM, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/17/2018 06:42 PM, Stephen Frost wrote:

Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'. The point is that
the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

The checkpoint guarantees that the whole page was written and flushed to
disk with an LSN before the ckeckpoint LSN. So when you read a page with
that LSN, you know the whole write already completed and a read won't
return data from before the LSN.

Well, you know that the first part was written out at some prior point,
but you could end up reading the first part of a page with an older LSN
while also reading the second part with new data.

Doesn't the checkpoint fsync pretty much guarantee this can't happen?

How? Either it’s possible for the latter half of a page to be updated
before the first half (where the LSN lives), or it isn’t. If it’s possible
then that LSN could be ancient and it wouldn’t matter.

Without the checkpoint that's not guaranteed, and simply re-reading the

page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the checksum
still fails

So, I agree with all of the above though I've found it to be extremely
rare to get a single read which you've managed to catch part-way through
a write, getting multiple of them over a period of time strikes me as
even more unlikely. Still, if we can come up with a solution to solve
all of this, great, but I'm not sure that I'm hearing one.

I don't recall claiming catching many such torn pages - I'm sure it's

not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

The question is if it’s possible to catch a torn page where the second half
is updated *before* the first half of the page in a read (and then in
subsequent reads having that state be maintained). I have some skepticism
that it’s really possible to happen in the first place but having an
interrupted system call be stalled across two more system calls just seems
terribly unlikely, and this is all based on the assumption that the kernel
might write the second half of a write before the first to the kernel cache
in the first place.

Sure, because we don't care about it any longer- that page isn't

interesting because the WAL will replay over it. IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

Nope, we only verify the checksum if it's LSN precedes the checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

That seems like it's leaving something on the table, but, to be fair, we
know that all of those pages should be rewritten by WAL anyway so they
aren't all that interesting to us, particularly in the basebackup case.

Yep.

I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN must be
more recent than the last checkpoint location or things are broken
badly.

I don't follow. Are you suggesting we don't need the checkpoint LSN?

I'm pretty sure that's not the case. The thing is - the LSN may not
change between the two reads, but that's not a guarantee the page was
not torn. The example I posted earlier in this message illustrates that.

I agree that there's some risk there, but it's certainly much less
likely.

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

Yes, I definitely agree that we don’t want to mis-report checksum failures
if we can avoid it.

Now, that said, I do think it's a good *idea* to check against the

checkpoint LSN (presuming this is for online checking of checksums- for
basebackup, we could just check against the backup-start LSN as

anything

after that point will be rewritten by WAL anyway). The reason that I
think it's a good idea to check against the checkpoint LSN is that we'd
want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads isn't
really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

When it comes to if the pg_sleep() is necessary or not, I have to admit
to being unsure about that.. I could see how it might be but it seems

a

bit surprising- I'd probably want to see exactly what the page was at
the time of the failure and at the time of the second (no-sleep)

re-read

and then after a delay and convince myself that it was just an unlucky
case of being scheduled in twice to read that page before the process
writing it out got a chance to finish the write.

I think the pg_sleep() is a pretty strong sign there's something broken.
At the very least, it's likely to misbehave on machines with different
timings, machines under memory and/or memory pressure, etc.

If we assume that what you've outlined above is a serious enough issue
that we have to address it, and do so without a pg_sleep(), then I think
we have to bake into this a way for the process to check with PG as to
what the page's current LSN is, in shared buffers, because that's the
only place where we've got the locking required to ensure that we don't
end up with a read of a partially written page, and I'm really not
entirely convinced that we need to go to that level. It'd certainly add
a huge amount of additional complexity for what appears to be a quite
unlikely gain.

I'll chat w/ David shortly about this again though and get his thoughts
on it. This is certainly an area we've spent time thinking about but
are obviously also open to finding a better solution.

Why not to simply look at the last checkpoint LSN and use that the same

way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Use that to compare to what? The LSN in the first half of the page could
be from well before the checkpoint or even the backup started.

Thanks!

Stephen

Show quoted text
#15Michael Banck
michael.banck@credativ.de
In reply to: Tomas Vondra (#13)
Re: Online verification of checksums

Hi,

so, trying some intermediate summary here, sorry for (also) top-posting:

1. the basebackup checksum verification logic only checks pages not
changed since the checkpoint, which makes sense for the basebackup. 

2. However, it would be desirable to go further for pg_verify_checksums
and (re-)check all pages.

3. pg_verify_checksums should read the checkpoint LSN on startup and
compare the page LSN against it on re-read, and discard pages which have
checksum failures but are new. (Maybe it should read new checkpoint LSNs
as they come in during its runtime as well? See below).

4. The pg_sleep should go.

5. There seems to be no consensus on whether the number of skipped pages
should be summarized at the end.

Further comments:

Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:

On 09/17/2018 07:11 PM, Stephen Frost wrote:

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/17/2018 06:42 PM, Stephen Frost wrote:
Without the checkpoint that's not guaranteed, and simply re-reading the
page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the checksum
still fails

So, I agree with all of the above though I've found it to be extremely
rare to get a single read which you've managed to catch part-way through
a write, getting multiple of them over a period of time strikes me as
even more unlikely. Still, if we can come up with a solution to solve
all of this, great, but I'm not sure that I'm hearing one.

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it. IIRC it actually goes
something like: check the checksum, if it failed then check if the LSN
is greater than the checkpoint (of the backup start..), if not, then
re-read, if the LSN is now newer than the checkpoint then skip, if the
LSN is the same then throw an error.

Nope, we only verify the checksum if it's LSN precedes the checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

That seems like it's leaving something on the table, but, to be fair, we
know that all of those pages should be rewritten by WAL anyway so they
aren't all that interesting to us, particularly in the basebackup case.

Yep.

Right, see point 1 above.

I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN must be
more recent than the last checkpoint location or things are broken
badly.

I don't follow. Are you suggesting we don't need the checkpoint LSN?

I'm pretty sure that's not the case. The thing is - the LSN may not
change between the two reads, but that's not a guarantee the page was
not torn. The example I posted earlier in this message illustrates that.

I agree that there's some risk there, but it's certainly much less
likely.

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

I agree.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of checksums- for
basebackup, we could just check against the backup-start LSN as anything
after that point will be rewritten by WAL anyway). The reason that I
think it's a good idea to check against the checkpoint LSN is that we'd
want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads isn't
really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

Are you suggesting here that we always check against the current
checkpoint, or is checking against the checkpoint that we saw at startup
enough? I think re-reading pg_control all the time might be more
errorprone that what we could get from this, so I would prefer not to do
this.

When it comes to if the pg_sleep() is necessary or not, I have to admit
to being unsure about that.. I could see how it might be but it seems a
bit surprising- I'd probably want to see exactly what the page was at
the time of the failure and at the time of the second (no-sleep) re-read
and then after a delay and convince myself that it was just an unlucky
case of being scheduled in twice to read that page before the process
writing it out got a chance to finish the write.

I think the pg_sleep() is a pretty strong sign there's something broken.
At the very least, it's likely to misbehave on machines with different
timings, machines under memory and/or memory pressure, etc.

I swapped out the pg_sleep earlier today for the check-against-
checkpoint-LSN-on-reread, and that seems to work just as fine, at least
in the tests I ran.

If we assume that what you've outlined above is a serious enough issue
that we have to address it, and do so without a pg_sleep(), then I think
we have to bake into this a way for the process to check with PG as to
what the page's current LSN is, in shared buffers, because that's the
only place where we've got the locking required to ensure that we don't
end up with a read of a partially written page, and I'm really not
entirely convinced that we need to go to that level. It'd certainly add
a huge amount of additional complexity for what appears to be a quite
unlikely gain.

I'll chat w/ David shortly about this again though and get his thoughts
on it. This is certainly an area we've spent time thinking about but
are obviously also open to finding a better solution.

Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Right.

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

#16Stephen Frost
sfrost@snowman.net
In reply to: Michael Banck (#15)
Re: Online verification of checksums

Greetings,

On Mon, Sep 17, 2018 at 13:38 Michael Banck <michael.banck@credativ.de>
wrote:

so, trying some intermediate summary here, sorry for (also) top-posting:

1. the basebackup checksum verification logic only checks pages not
changed since the checkpoint, which makes sense for the basebackup.

Right. I’m tending towards the idea that this also be adopted for
pg_verify_checksums.

2. However, it would be desirable to go further for pg_verify_checksums

and (re-)check all pages.

Maybe. I’m not entirely convinced that it’s all that useful.

3. pg_verify_checksums should read the checkpoint LSN on startup and

compare the page LSN against it on re-read, and discard pages which have
checksum failures but are new. (Maybe it should read new checkpoint LSNs
as they come in during its runtime as well? See below).

I’m not sure that we really need to but I’m not against it either- but in
that case you’re definitely going to see checksum failures on torn pages.

4. The pg_sleep should go.

I know that pgbackrest does not have a sleep currently and we’ve not yet
seen or been able to reproduce this case where, on a reread, we still see
an older LSN, but we check the LSN first also. If it’s possible that the
LSN still hasn’t changed on the reread then maybe we do need to have a
sleep to force ourselves off CPU to allow the other process to finish
writing, or maybe finish the file and come back around to these pages
later, but we have yet to see this behavior in the wild anywhere, nor have
we been able to reproduce it.

5. There seems to be no consensus on whether the number of skipped pages

should be summarized at the end.

I agree with printing the number of skipped pages, that does seem like a
nice to have. I don’t know that actually printing the pages themselves is
all that useful though.

Further comments:

Am Montag, den 17.09.2018, 19:19 +0200 schrieb Tomas Vondra:

On 09/17/2018 07:11 PM, Stephen Frost wrote:

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/17/2018 06:42 PM, Stephen Frost wrote:
Without the checkpoint that's not guaranteed, and simply re-reading

the

page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the

checksum

still fails

So, I agree with all of the above though I've found it to be extremely
rare to get a single read which you've managed to catch part-way

through

a write, getting multiple of them over a period of time strikes me as
even more unlikely. Still, if we can come up with a solution to solve
all of this, great, but I'm not sure that I'm hearing one.

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

Sure, because we don't care about it any longer- that page isn't
interesting because the WAL will replay over it. IIRC it actually

goes

something like: check the checksum, if it failed then check if the

LSN

is greater than the checkpoint (of the backup start..), if not,

then

re-read, if the LSN is now newer than the checkpoint then skip, if

the

LSN is the same then throw an error.

Nope, we only verify the checksum if it's LSN precedes the

checkpoint:

https://github.com/postgres/postgres/blob/master/src/backend/replication/basebackup.c#L1454

That seems like it's leaving something on the table, but, to be fair,

we

know that all of those pages should be rewritten by WAL anyway so they
aren't all that interesting to us, particularly in the basebackup case.

Yep.

Right, see point 1 above.

I actually tend to disagree with you that, for this purpose, it's
actually necessary to check against the checkpoint LSN- if the LSN
changed and everything is operating correctly then the new LSN

must be

more recent than the last checkpoint location or things are broken
badly.

I don't follow. Are you suggesting we don't need the checkpoint LSN?

I'm pretty sure that's not the case. The thing is - the LSN may not
change between the two reads, but that's not a guarantee the page was
not torn. The example I posted earlier in this message illustrates

that.

I agree that there's some risk there, but it's certainly much less
likely.

Well. If we're going to report a checksum failure, we better be sure it
actually is a broken page. I don't want users to start chasing bogus
data corruption issues.

I agree.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of

checksums- for

basebackup, we could just check against the backup-start LSN as

anything

after that point will be rewritten by WAL anyway). The reason

that I

think it's a good idea to check against the checkpoint LSN is that

we'd

want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads

isn't

really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value

being

read isn't completely ridiculous.

Are you suggesting here that we always check against the current
checkpoint, or is checking against the checkpoint that we saw at startup
enough? I think re-reading pg_control all the time might be more
errorprone that what we could get from this, so I would prefer not to do
this.

I don’t follow why rereading pg_control would be error-prone. That said, I
don’t have a particularly strong opinion either way on this.

When it comes to if the pg_sleep() is necessary or not, I have to

admit

to being unsure about that.. I could see how it might be but it

seems a

bit surprising- I'd probably want to see exactly what the page was

at

the time of the failure and at the time of the second (no-sleep)

re-read

and then after a delay and convince myself that it was just an

unlucky

case of being scheduled in twice to read that page before the

process

writing it out got a chance to finish the write.

I think the pg_sleep() is a pretty strong sign there's something

broken.

At the very least, it's likely to misbehave on machines with

different

timings, machines under memory and/or memory pressure, etc.

I swapped out the pg_sleep earlier today for the check-against-
checkpoint-LSN-on-reread, and that seems to work just as fine, at least
in the tests I ran.

Ok, this sounds like you were probably seeing normal forward torn pages,
and we have certainly seen that before.

If we assume that what you've outlined above is a serious enough issue

that we have to address it, and do so without a pg_sleep(), then I

think

we have to bake into this a way for the process to check with PG as to
what the page's current LSN is, in shared buffers, because that's the
only place where we've got the locking required to ensure that we don't
end up with a read of a partially written page, and I'm really not
entirely convinced that we need to go to that level. It'd certainly

add

a huge amount of additional complexity for what appears to be a quite
unlikely gain.

I'll chat w/ David shortly about this again though and get his thoughts
on it. This is certainly an area we've spent time thinking about but
are obviously also open to finding a better solution.

Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Right.

This is fine if you know the kernel will always write the first page first,
or you accept that a reread of a page which isn’t valid will always result
in seeing a completely updated page.

We’ve made the assumption that a reread on a failure where the LSN on the
first read was older than the backup-start LSN will give us an updated
first-half of the page which we then check the LSN, of, but we have yet to
prove that this is actually possible.

Thanks!

Stephen

#17Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#14)
Re: Online verification of checksums

On 09/17/2018 07:35 PM, Stephen Frost wrote:

Greetings,

On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:

On 09/17/2018 07:11 PM, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com

<mailto:tomas.vondra@2ndquadrant.com>) wrote:

On 09/17/2018 06:42 PM, Stephen Frost wrote:

Ok, good, though I'm not sure what you mean by 'eliminates the
consistency guarantees provided by the checkpoint'.  The point

is that

the page will be in the WAL and the WAL will be replayed during the
restore of the backup.

The checkpoint guarantees that the whole page was written and

flushed to

disk with an LSN before the ckeckpoint LSN. So when you read a

page with

that LSN, you know the whole write already completed and a read won't
return data from before the LSN.

Well, you know that the first part was written out at some prior

point,

but you could end up reading the first part of a page with an

older LSN

while also reading the second part with new data.

Doesn't the checkpoint fsync pretty much guarantee this can't happen?

How? Either it’s possible for the latter half of a page to be updated
before the first half (where the LSN lives), or it isn’t. If it’s
possible then that LSN could be ancient and it wouldn’t matter. 

I'm not sure I understand what you're saying here.

It is not about the latter page to be updated before the first half. I
don't think that's quite possible, because write() into page cache does
in fact write the data sequentially.

The problem is that the write is not atomic, and AFAIK it happens in
sectors (which are either 512B or 4K these days). And it may arbitrarily
interleave with reads.

So you may do write(8k), but it actually happens in 512B chunks and a
concurrent read may observe some mix of those.

But the trick is that if the read sees the effect of the write somewhere
in the middle of the page, the next read is guaranteed to see all the
preceding new data.

Without the checkpoint we risk seeing the same write() both in read and
re-read, just in a different stage - so the LSN would not change, making
the check futile.

But by waiting for the checkpoint we know that the original write is no
longer in progress, so if we saw a partial write we're guaranteed to see
a new LSN on re-read.

This is what I mean by the checkpoint / fsync guarantee.

Without the checkpoint that's not guaranteed, and simply

re-reading the

page and rechecking it vs. the first read does not help:

1) write the first 512B of the page (sector), which includes the LSN

2) read the whole page, which will be a mix [new 512B, ... old ... ]

3) the checksum verification fails

4) read the page again (possibly reading a bit more new data)

5) the LSN did not change compared to the first read, yet the

checksum

still fails

So, I agree with all of the above though I've found it to be extremely
rare to get a single read which you've managed to catch part-way

through

a write, getting multiple of them over a period of time strikes me as
even more unlikely.  Still, if we can come up with a solution to solve
all of this, great, but I'm not sure that I'm hearing one.

I don't recall claiming catching many such torn pages - I'm sure it's
not very common in most workloads. But I suspect constructing workloads
hitting them regularly is not very difficult either (something with a
lot of churn in shared buffers should do the trick).

The question is if it’s possible to catch a torn page where the second
half is updated *before* the first half of the page in a read (and then
in subsequent reads having that state be maintained).  I have some
skepticism that it’s really possible to happen in the first place but
having an interrupted system call be stalled across two more system
calls just seems terribly unlikely, and this is all based on the
assumption that the kernel might write the second half of a write before
the first to the kernel cache in the first place.

Yes, if that was possible, the explanation about the checkpoint fsync
guarantee would be bogus, obviously.

I've spent quite a bit of time looking into how write() is handled, and
I believe seeing only the second half is not possible. You may observe a
page torn in various ways (not necessarily in half), e.g.

[old,new,old]

but then the re-read you should be guaranteed to see new data up until
the last "new" chunk:

[new,new,old]

At least that's my understanding. I failed to deduce what POSIX says
about this, or how it behaves on various OS/filesystems.

The one thing I've done was writing a simple stress test that writes a
single 8kB page in a loop, reads it concurrently and checks the
behavior. And it seems consistent with my understanding.

Now, that said, I do think it's a good *idea* to check against the
checkpoint LSN (presuming this is for online checking of

checksums- for

basebackup, we could just check against the backup-start LSN as

anything

after that point will be rewritten by WAL anyway).  The reason

that I

think it's a good idea to check against the checkpoint LSN is

that we'd

want to throw a big warning if the kernel is just feeding us random
garbage on reads and only finding a difference between two reads

isn't

really doing any kind of validation, whereas checking against the
checkpoint-LSN would at least give us some idea that the value being
read isn't completely ridiculous.

When it comes to if the pg_sleep() is necessary or not, I have

to admit

to being unsure about that..  I could see how it might be but it

seems a

bit surprising- I'd probably want to see exactly what the page

was at

the time of the failure and at the time of the second (no-sleep)

re-read

and then after a delay and convince myself that it was just an

unlucky

case of being scheduled in twice to read that page before the

process

writing it out got a chance to finish the write.

I think the pg_sleep() is a pretty strong sign there's something

broken.

At the very least, it's likely to misbehave on machines with

different

timings, machines under memory and/or memory pressure, etc.

If we assume that what you've outlined above is a serious enough issue
that we have to address it, and do so without a pg_sleep(), then I

think

we have to bake into this a way for the process to check with PG as to
what the page's current LSN is, in shared buffers, because that's the
only place where we've got the locking required to ensure that we

don't

end up with a read of a partially written page, and I'm really not
entirely convinced that we need to go to that level.  It'd

certainly add

a huge amount of additional complexity for what appears to be a quite
unlikely gain.

I'll chat w/ David shortly about this again though and get his

thoughts

on it.  This is certainly an area we've spent time thinking about but
are obviously also open to finding a better solution.

Why not to simply look at the last checkpoint LSN and use that the same
way basebackup does? AFAICS that should make the pg_sleep() unnecessary.

Use that to compare to what?  The LSN in the first half of the page
could be from well before the checkpoint or even the backup started.

Not sure I follow. If the LSN in the page header is old, and the
checksum check failed, then on re-read we either find a new LSN (in
which case we skip the page) or consider this to be a checksum failure.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#17)
Re: Online verification of checksums

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/17/2018 07:35 PM, Stephen Frost wrote:

On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:
Doesn't the checkpoint fsync pretty much guarantee this can't happen?

How? Either it’s possible for the latter half of a page to be updated
before the first half (where the LSN lives), or it isn’t. If it’s
possible then that LSN could be ancient and it wouldn’t matter. 

I'm not sure I understand what you're saying here.

It is not about the latter page to be updated before the first half. I
don't think that's quite possible, because write() into page cache does
in fact write the data sequentially.

Well, maybe 'updated before' wasn't quite the right way to talk about
it, but consider if a read(8K) gets only half-way through the copy
before having to go do something else and by the time it gets back, a
write has come in and rewritten the page, such that the read(8K)
returns half-old and half-new data.

The problem is that the write is not atomic, and AFAIK it happens in
sectors (which are either 512B or 4K these days). And it may arbitrarily
interleave with reads.

Yes, of course the write isn't atomic, that's clear.

So you may do write(8k), but it actually happens in 512B chunks and a
concurrent read may observe some mix of those.

Right, I'm not sure that we really need to worry about sub-4K writes
though I suppose they're technically possible, but it doesn't much
matter in this case since the LSN is early on in the page, of course.

But the trick is that if the read sees the effect of the write somewhere
in the middle of the page, the next read is guaranteed to see all the
preceding new data.

If that's guaranteed then we can just check the LSN and be done.

Without the checkpoint we risk seeing the same write() both in read and
re-read, just in a different stage - so the LSN would not change, making
the check futile.

This is the part that isn't making much sense to me. If we are
guaranteed that writes into the kernel cache are always in order and
always at least 512B in size, then if we check the LSN first and
discover it's "old", and then read the rest of the page and calculate
the checksum, discover it's a bad checksum, and then go back and re-read
the page then we *must* see that the LSN has changed OR conclude that
the checksum is invalidated.

The reason this can happen in the first place is that our 8K read might
only get half-way done before getting scheduled off and a 8K write
happened on the page before our read(8K) gets back to finishing the
read, but if what you're saying is true, then we can't ever have a case
where such a thing would happen and a re-read would still see the "old"
LSN.

If we check the LSN first and discover it's "new" (as in, more recent
than our last checkpoint, or the checkpoint where the backup started)
then, sure, there's going to be a risk that the page is currently being
written right that moment and isn't yet completely valid.

The problem that we aren't solving for is if, somehow, we do a read(8K)
and get the first half/second half mixup and then on a subsequent
read(8K) we see that *again*, implying that somehow the kernel's copy
has the latter-half of the page updated consistently but not the first
half. That's a problem that I haven't got a solution to today. I'd
love to have a guarantee that it's not possible- we've certainly never
seen it but it's been a concern and I thought Michael was suggesting
he'd seen that, but it sounds like there wasn't a check on the LSN in
the first read, in which case it could have just been a 'regular' torn
page case.

But by waiting for the checkpoint we know that the original write is no
longer in progress, so if we saw a partial write we're guaranteed to see
a new LSN on re-read.

This is what I mean by the checkpoint / fsync guarantee.

I don't think any of this really has anythign to do with either fsync
being called or with the actual checkpointing process (except to the
extent that the checkpointer is the thing doing the writing, and that we
should be checking the LSN against the LSN of the last checkpoint when
we started, or against the start of the backup LSN if we're talking
about doing a backup).

The question is if it’s possible to catch a torn page where the second
half is updated *before* the first half of the page in a read (and then
in subsequent reads having that state be maintained).  I have some
skepticism that it’s really possible to happen in the first place but
having an interrupted system call be stalled across two more system
calls just seems terribly unlikely, and this is all based on the
assumption that the kernel might write the second half of a write before
the first to the kernel cache in the first place.

Yes, if that was possible, the explanation about the checkpoint fsync
guarantee would be bogus, obviously.

I've spent quite a bit of time looking into how write() is handled, and
I believe seeing only the second half is not possible. You may observe a
page torn in various ways (not necessarily in half), e.g.

[old,new,old]

but then the re-read you should be guaranteed to see new data up until
the last "new" chunk:

[new,new,old]

At least that's my understanding. I failed to deduce what POSIX says
about this, or how it behaves on various OS/filesystems.

The one thing I've done was writing a simple stress test that writes a
single 8kB page in a loop, reads it concurrently and checks the
behavior. And it seems consistent with my understanding.

Good.

Use that to compare to what?  The LSN in the first half of the page
could be from well before the checkpoint or even the backup started.

Not sure I follow. If the LSN in the page header is old, and the
checksum check failed, then on re-read we either find a new LSN (in
which case we skip the page) or consider this to be a checksum failure.

Right, I'm in agreement with doing that and it's what is done in
pgbasebackup and pgBackRest.

Thanks!

Stephen

#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#18)
Re: Online verification of checksums

On 09/18/2018 12:01 AM, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/17/2018 07:35 PM, Stephen Frost wrote:

On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com
<mailto:tomas.vondra@2ndquadrant.com>> wrote:
Doesn't the checkpoint fsync pretty much guarantee this can't happen?

How? Either it’s possible for the latter half of a page to be updated
before the first half (where the LSN lives), or it isn’t. If it’s
possible then that LSN could be ancient and it wouldn’t matter. 

I'm not sure I understand what you're saying here.

It is not about the latter page to be updated before the first half. I
don't think that's quite possible, because write() into page cache does
in fact write the data sequentially.

Well, maybe 'updated before' wasn't quite the right way to talk about
it, but consider if a read(8K) gets only half-way through the copy
before having to go do something else and by the time it gets back, a
write has come in and rewritten the page, such that the read(8K)
returns half-old and half-new data.

The problem is that the write is not atomic, and AFAIK it happens in
sectors (which are either 512B or 4K these days). And it may arbitrarily
interleave with reads.

Yes, of course the write isn't atomic, that's clear.

So you may do write(8k), but it actually happens in 512B chunks and a
concurrent read may observe some mix of those.

Right, I'm not sure that we really need to worry about sub-4K writes
though I suppose they're technically possible, but it doesn't much
matter in this case since the LSN is early on in the page, of course.

But the trick is that if the read sees the effect of the write somewhere
in the middle of the page, the next read is guaranteed to see all the
preceding new data.

If that's guaranteed then we can just check the LSN and be done.

What do you mean by "check the LSN"? Compare it to LSN from the first
read? You don't know if the first read already saw the new LSN or not
(see the next example).

Without the checkpoint we risk seeing the same write() both in read and
re-read, just in a different stage - so the LSN would not change, making
the check futile.

This is the part that isn't making much sense to me. If we are
guaranteed that writes into the kernel cache are always in order and
always at least 512B in size, then if we check the LSN first and
discover it's "old", and then read the rest of the page and calculate
the checksum, discover it's a bad checksum, and then go back and re-read
the page then we *must* see that the LSN has changed OR conclude that
the checksum is invalidated.

Even if the writes are in order and in 512B chunks, you don't know how
they are interleaved with the reads.

Let's assume we're doing a write(), which splits the 8kB page into 512B
chunks. A concurrent read may observe a random mix of old and new data,
depending on timing.

So let's say a read sees the first 2kB of data like this:

[new, new, new, old, new, old, new, old]

OK, the page is obviously torn, checksum fails, and we try reading it
again. We should see new data at least until the last 'new' chunk in the
first read, so let's say we got this:

[new, new, new, new, new, new, new, old]

Obviously, this page is also torn (there are old data at the end), but
we've read the new data in both cases, which includes the LSN. So the
LSN is the same in both cases, and your detection fails.

Comparing the page LSN to the last checkpoint LSN solves this, because
if the LSN is older than the checkpoint LSN, that write must have been
completed by now, and so we're not in danger of seeing only incomplete
effects of it. And newer write will update the LSN.

The reason this can happen in the first place is that our 8K read might
only get half-way done before getting scheduled off and a 8K write
happened on the page before our read(8K) gets back to finishing the
read, but if what you're saying is true, then we can't ever have a case
where such a thing would happen and a re-read would still see the "old"
LSN.

If we check the LSN first and discover it's "new" (as in, more recent
than our last checkpoint, or the checkpoint where the backup started)
then, sure, there's going to be a risk that the page is currently being
written right that moment and isn't yet completely valid.

Right.

The problem that we aren't solving for is if, somehow, we do a read(8K)
and get the first half/second half mixup and then on a subsequent
read(8K) we see that *again*, implying that somehow the kernel's copy
has the latter-half of the page updated consistently but not the first
half. That's a problem that I haven't got a solution to today. I'd
love to have a guarantee that it's not possible- we've certainly never
seen it but it's been a concern and I thought Michael was suggesting
he'd seen that, but it sounds like there wasn't a check on the LSN in
the first read, in which case it could have just been a 'regular' torn
page case.

Well, yeah. If that would be possible, we'd be in serious trouble. I've
done quite a bit of experimentation with concurrent reads and writes and
I have not observed such behavior. Of course, that's hardly a proof it
can't happen, and it wouldn't be the first surprise with respect to
kernel I/O this year ...

But by waiting for the checkpoint we know that the original write is no
longer in progress, so if we saw a partial write we're guaranteed to see
a new LSN on re-read.

This is what I mean by the checkpoint / fsync guarantee.

I don't think any of this really has anythign to do with either fsync
being called or with the actual checkpointing process (except to the
extent that the checkpointer is the thing doing the writing, and that we
should be checking the LSN against the LSN of the last checkpoint when
we started, or against the start of the backup LSN if we're talking
about doing a backup).

You're right it's not about the fsync, sorry for the confusion. My point
is that using the checkpoint LSN gives us a guarantee that write is no
longer in progress, and so we can't see a page torn because of it. And
if we see a partial write due to a new write, it's guaranteed to update
the page LSN (and we'll notice it).

The question is if it’s possible to catch a torn page where the second
half is updated *before* the first half of the page in a read (and then
in subsequent reads having that state be maintained).  I have some
skepticism that it’s really possible to happen in the first place but
having an interrupted system call be stalled across two more system
calls just seems terribly unlikely, and this is all based on the
assumption that the kernel might write the second half of a write before
the first to the kernel cache in the first place.

Yes, if that was possible, the explanation about the checkpoint fsync
guarantee would be bogus, obviously.

I've spent quite a bit of time looking into how write() is handled, and
I believe seeing only the second half is not possible. You may observe a
page torn in various ways (not necessarily in half), e.g.

[old,new,old]

but then the re-read you should be guaranteed to see new data up until
the last "new" chunk:

[new,new,old]

At least that's my understanding. I failed to deduce what POSIX says
about this, or how it behaves on various OS/filesystems.

The one thing I've done was writing a simple stress test that writes a
single 8kB page in a loop, reads it concurrently and checks the
behavior. And it seems consistent with my understanding.

Good.

Use that to compare to what?  The LSN in the first half of the page
could be from well before the checkpoint or even the backup started.

Not sure I follow. If the LSN in the page header is old, and the
checksum check failed, then on re-read we either find a new LSN (in
which case we skip the page) or consider this to be a checksum failure.

Right, I'm in agreement with doing that and it's what is done in
pgbasebackup and pgBackRest.

OK. All I'm saying is pg_verify_checksums should probably do the same
thing, i.e. grab checkpoint LSN and roll with that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#19)
Re: Online verification of checksums

Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/18/2018 12:01 AM, Stephen Frost wrote:

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:

On 09/17/2018 07:35 PM, Stephen Frost wrote:
But the trick is that if the read sees the effect of the write somewhere
in the middle of the page, the next read is guaranteed to see all the
preceding new data.

If that's guaranteed then we can just check the LSN and be done.

What do you mean by "check the LSN"? Compare it to LSN from the first
read? You don't know if the first read already saw the new LSN or not
(see the next example).

Hmm, ok, I can see your point there. I've been going back and forth
between checking against what the prior LSN was on the page and checking
it against an independent source (like the last checkpoint's LSN), but..

[...]

Comparing the page LSN to the last checkpoint LSN solves this, because
if the LSN is older than the checkpoint LSN, that write must have been
completed by now, and so we're not in danger of seeing only incomplete
effects of it. And newer write will update the LSN.

Yeah, that makes sense- we need to be looking at something which only
gets updated once the write has actually completed, and the last
checkpoint's LSN gives us that guarantee.

The problem that we aren't solving for is if, somehow, we do a read(8K)
and get the first half/second half mixup and then on a subsequent
read(8K) we see that *again*, implying that somehow the kernel's copy
has the latter-half of the page updated consistently but not the first
half. That's a problem that I haven't got a solution to today. I'd
love to have a guarantee that it's not possible- we've certainly never
seen it but it's been a concern and I thought Michael was suggesting
he'd seen that, but it sounds like there wasn't a check on the LSN in
the first read, in which case it could have just been a 'regular' torn
page case.

Well, yeah. If that would be possible, we'd be in serious trouble. I've
done quite a bit of experimentation with concurrent reads and writes and
I have not observed such behavior. Of course, that's hardly a proof it
can't happen, and it wouldn't be the first surprise with respect to
kernel I/O this year ...

I'm glad to hear that you've done a lot of experimentation in this area
and haven't seen such strange behavior happen- we've got quite a few
people running pgBackRest with checksum-checking and haven't seen it
either, but it's always been a bit of a concern.

You're right it's not about the fsync, sorry for the confusion. My point
is that using the checkpoint LSN gives us a guarantee that write is no
longer in progress, and so we can't see a page torn because of it. And
if we see a partial write due to a new write, it's guaranteed to update
the page LSN (and we'll notice it).

Right, no worries about the confusion, I hadn't been fully thinking
through the LSN bit either and that what we really need is some external
confirmation of a write having *completed* (not just started) and that
makes a definite difference.

Right, I'm in agreement with doing that and it's what is done in
pgbasebackup and pgBackRest.

OK. All I'm saying is pg_verify_checksums should probably do the same
thing, i.e. grab checkpoint LSN and roll with that.

Agreed.

Thanks!

Stephen

#21Michael Banck
michael.banck@credativ.de
In reply to: Stephen Frost (#16)
#22Michael Banck
michael.banck@credativ.de
In reply to: Stephen Frost (#20)
#23Michael Banck
michael.banck@credativ.de
In reply to: Michael Banck (#1)
#24Stephen Frost
sfrost@snowman.net
In reply to: Michael Banck (#23)
#25David Steele
david@pgmasters.net
In reply to: Stephen Frost (#24)
#26Michael Banck
michael.banck@credativ.de
In reply to: David Steele (#25)
#27Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Banck (#26)
#28Michael Banck
michael.banck@credativ.de
In reply to: Fabien COELHO (#27)
#29Stephen Frost
sfrost@snowman.net
In reply to: Michael Banck (#28)
#30Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Banck (#28)
#31Michael Banck
michael.banck@credativ.de
In reply to: Stephen Frost (#29)
#32Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Stephen Frost (#29)
#33Stephen Frost
sfrost@snowman.net
In reply to: Fabien COELHO (#32)
#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Banck (#31)
#35Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Banck (#31)
#36Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#35)
#37Stephen Frost
sfrost@snowman.net
In reply to: Michael Paquier (#36)
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#37)
#39Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Banck (#31)
#40Michael Banck
michael.banck@credativ.de
In reply to: Fabien COELHO (#39)
#41Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Banck (#40)
#42Michael Banck
michael.banck@credativ.de
In reply to: Fabien COELHO (#41)
#43Stephen Frost
sfrost@snowman.net
In reply to: Michael Banck (#42)
#44Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#43)
#45Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#44)
#46Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Stephen Frost (#45)
#47Michael Paquier
michael@paquier.xyz
In reply to: Dmitry Dolgov (#46)
#48Michael Banck
michael.banck@credativ.de
In reply to: Michael Paquier (#47)
#49Michael Banck
michael.banck@credativ.de
In reply to: Michael Banck (#48)
#50Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Banck (#48)
#51Andres Freund
andres@anarazel.de
In reply to: Fabien COELHO (#50)
#52Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#51)
#53Michael Banck
michael.banck@credativ.de
In reply to: Andres Freund (#51)
#54Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Banck (#53)
#55Andres Freund
andres@anarazel.de
In reply to: Fabien COELHO (#54)
#56Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#55)
#57Michael Banck
michael.banck@credativ.de
In reply to: Tomas Vondra (#56)
#58Stephen Frost
sfrost@snowman.net
In reply to: Michael Banck (#57)
#59Michael Banck
michael.banck@credativ.de
In reply to: Stephen Frost (#58)
#60Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Banck (#59)
#61Michael Banck
michael.banck@credativ.de
In reply to: Fabien COELHO (#60)
#62Robert Haas
robertmhaas@gmail.com
In reply to: Michael Banck (#23)
#63Michael Banck
michael.banck@credativ.de
In reply to: Robert Haas (#62)
#64Stephen Frost
sfrost@snowman.net
In reply to: Michael Banck (#63)
#65Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#62)
#66Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Stephen Frost (#64)
#67Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#66)
#68Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#67)
#69Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#68)
#70Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Paquier (#68)
#71Michael Banck
michael.banck@credativ.de
In reply to: Stephen Frost (#64)
#72Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#69)
#73Michael Paquier
michael@paquier.xyz
In reply to: Michael Banck (#71)
#74Michael Paquier
michael@paquier.xyz
In reply to: Fabien COELHO (#70)
#75Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Michael Paquier (#74)
#76Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#74)
#77Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#74)
#78Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#72)
#79Magnus Hagander
magnus@hagander.net
In reply to: Tomas Vondra (#77)
#80Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#78)
#81Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Paquier (#80)
#82Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#81)
#83Stephen Frost
sfrost@snowman.net
In reply to: Michael Paquier (#82)
#84Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#65)
#85Robert Haas
robertmhaas@gmail.com
In reply to: Michael Banck (#63)
#86Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#85)
#87Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#84)
#88Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#86)
#89Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#88)
#90Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#89)
#91Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#90)
#92Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#86)
#93Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#92)
#94Michael Banck
michael.banck@credativ.de
In reply to: Michael Banck (#71)
#95Julien Rouhaud
rjuju123@gmail.com
In reply to: Andres Freund (#93)
#96Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Julien Rouhaud (#95)
#97Julien Rouhaud
rjuju123@gmail.com
In reply to: Tomas Vondra (#96)
#98Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#65)
#99Stephen Frost
sfrost@snowman.net
In reply to: Tomas Vondra (#90)
#100Michael Paquier
michael@paquier.xyz
In reply to: Stephen Frost (#98)
#101Stephen Frost
sfrost@snowman.net
In reply to: Michael Paquier (#100)
#102Michael Banck
michael.banck@credativ.de
In reply to: Stephen Frost (#101)
#103Stephen Frost
sfrost@snowman.net
In reply to: Michael Banck (#102)
#104Michael Paquier
michael@paquier.xyz
In reply to: Stephen Frost (#101)
#105Stephen Frost
sfrost@snowman.net
In reply to: Michael Paquier (#104)
#106Michael Banck
michael.banck@credativ.de
In reply to: Michael Banck (#102)
#107Michael Banck
michael.banck@credativ.de
In reply to: Stephen Frost (#103)
#108Stephen Frost
sfrost@snowman.net
In reply to: Michael Banck (#107)
#109Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#100)
#110Michael Banck
michael.banck@credativ.de
In reply to: Stephen Frost (#108)
#111Stephen Frost
sfrost@snowman.net
In reply to: Michael Banck (#110)
#112Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#101)
#113Michael Banck
michael.banck@credativ.de
In reply to: Robert Haas (#112)
#114Andres Freund
andres@anarazel.de
In reply to: Michael Banck (#113)
#115Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#114)
#116Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#115)
#117Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#116)
#118Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#117)
#119Michael Banck
michael.banck@credativ.de
In reply to: Andres Freund (#116)
#120Andres Freund
andres@anarazel.de
In reply to: Michael Banck (#119)
#121Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#120)
#122Michael Banck
michael.banck@credativ.de
In reply to: Andres Freund (#116)
#123Andres Freund
andres@anarazel.de
In reply to: Michael Banck (#122)
#124Michael Banck
michael.banck@credativ.de
In reply to: Andres Freund (#123)
#125Michael Banck
michael.banck@credativ.de
In reply to: Michael Banck (#124)
#126Michael Banck
michael.banck@credativ.de
In reply to: Michael Banck (#110)
#127Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Michael Banck (#126)
#128Michael Banck
michael.banck@credativ.de
In reply to: Tomas Vondra (#127)
#129Andres Freund
andres@anarazel.de
In reply to: Michael Banck (#128)
#130Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#129)
#131Magnus Hagander
magnus@hagander.net
In reply to: Tomas Vondra (#130)
#132Stephen Frost
sfrost@snowman.net
In reply to: Magnus Hagander (#131)
#133Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#132)
#134Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#133)
#135Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#134)
#136Magnus Hagander
magnus@hagander.net
In reply to: Stephen Frost (#132)
#137Michael Banck
michael.banck@credativ.de
In reply to: Magnus Hagander (#136)
#138Magnus Hagander
magnus@hagander.net
In reply to: Michael Banck (#137)
#139Andres Freund
andres@anarazel.de
In reply to: Magnus Hagander (#138)
#140Michael Banck
michael.banck@credativ.de
In reply to: Michael Banck (#125)
#141Michael Paquier
michael@paquier.xyz
In reply to: Michael Banck (#140)
#142Michael Banck
michael.banck@credativ.de
In reply to: Michael Paquier (#141)
#143Asif Rehman
asifr.rehman@gmail.com
In reply to: Michael Banck (#142)
#144Michael Banck
michael.banck@credativ.de
In reply to: Asif Rehman (#143)
#145Asif Rehman
asifr.rehman@gmail.com
In reply to: Michael Banck (#144)
#146Michael Banck
michael.banck@credativ.de
In reply to: Asif Rehman (#145)
#147Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Banck (#146)
#148Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#147)
#149Michael Banck
michael.banck@credativ.de
In reply to: Tom Lane (#148)
#150Daniel Gustafsson
daniel@yesql.se
In reply to: Michael Banck (#149)
#151Daniel Gustafsson
daniel@yesql.se
In reply to: Daniel Gustafsson (#150)
#152Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#148)
#153Michael Banck
michael.banck@credativ.de
In reply to: Michael Paquier (#152)
#154Michael Paquier
michael@paquier.xyz
In reply to: Michael Banck (#153)
#155Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#154)
#156Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#155)
#157Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#156)
#158Michael Banck
michael.banck@credativ.de
In reply to: Michael Paquier (#157)
#159Michael Paquier
michael@paquier.xyz
In reply to: Michael Banck (#158)
#160Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#159)
#161Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#160)
#162Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#161)
#163Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#162)
#164Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#163)
#165David Steele
david@pgmasters.net
In reply to: Michael Paquier (#164)
#166Stephen Frost
sfrost@snowman.net
In reply to: David Steele (#165)
#167Michael Paquier
michael@paquier.xyz
In reply to: Stephen Frost (#166)
#168Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Michael Paquier (#167)
#169Stephen Frost
sfrost@snowman.net
In reply to: Michael Paquier (#167)
#170Stephen Frost
sfrost@snowman.net
In reply to: Anastasia Lubennikova (#168)
#171Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Stephen Frost (#170)
#172Stephen Frost
sfrost@snowman.net
In reply to: Anastasia Lubennikova (#171)
#173Michael Paquier
michael@paquier.xyz
In reply to: Stephen Frost (#170)
#174Michael Paquier
michael@paquier.xyz
In reply to: Stephen Frost (#172)
#175Stephen Frost
sfrost@snowman.net
In reply to: Michael Paquier (#174)
#176David Steele
david@pgmasters.net
In reply to: Michael Paquier (#173)
#177Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#176)
#178Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#177)
#179Stephen Frost
sfrost@snowman.net
In reply to: Magnus Hagander (#178)
#180Michael Paquier
michael@paquier.xyz
In reply to: Stephen Frost (#179)
#181Stephen Frost
sfrost@snowman.net
In reply to: Michael Paquier (#180)
#182David Steele
david@pgmasters.net
In reply to: Stephen Frost (#181)
#183David Steele
david@pgmasters.net
In reply to: David Steele (#182)
#184Ibrar Ahmed
ibrar.ahmad@gmail.com
In reply to: David Steele (#183)
#185Daniel Gustafsson
daniel@yesql.se
In reply to: Ibrar Ahmed (#184)
#186Daniel Gustafsson
daniel@yesql.se
In reply to: Daniel Gustafsson (#185)