WAL replay bugs

Started by Heikki Linnakangasabout 12 years ago47 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

I've been playing with a little hack that records a before and after
image of every page modification that is WAL-logged, and writes the
images to a file along with the LSN of the corresponding WAL record. I
set up a master-standby replication with that hack in place in both
servers, and ran the regression suite. Then I compared the after images
after every WAL record, as written on master, and as replayed by the
standby.

The idea is that the page content in the standby after replaying a WAL
record should be identical to the page in the master, when the WAL
record was generated. There are some known cases where that doesn't
hold, but it's a useful sanity check. To reduce noise, I've been
focusing on one access method at a time, filtering out others.

I did that for GIN first, and indeed found a bug in my new
incomplete-split code, see commit 594bac42. After fixing that, and
zeroing some padding bytes (38a2b95c), I'm now getting a clean run with
that.

Next, I took on GiST, and lo-and-behold found a bug there pretty quickly
as well. This one has been there ever since we got Hot Standby: the redo
of a page update (e.g an insertion) resets the right-link of the page.
If there is a concurrent scan, in a hot standby server, that scan might
still need the rightlink, and will hence miss some tuples. This can be
reproduced like this:

1. in master, create test table.

CREATE TABLE gisttest (id int4);
CREATE INDEX gisttest_idx ON gisttest USING gist (id);
INSERT INTO gisttest SELECT g * 1000 from generate_series(1, 100000) g;

-- Test function. Starts a scan, fetches one row from it, then waits 10
seconds until fetching the rest of the rows.
-- Returns the number of rows scanned. Should be 100000 if you follow
-- these test instructions.
CREATE OR REPLACE FUNCTION gisttestfunc() RETURNS int AS
$$
declare
i int4;
t text;
cur CURSOR FOR SELECT 'foo' FROM gisttest WHERE id >= 0;
begin
set enable_seqscan=off; set enable_bitmapscan=off;

i = 0;
OPEN cur;
FETCH cur INTO t;

perform pg_sleep(10);

LOOP
EXIT WHEN NOT FOUND; -- this is bogus on first iteration
i = i + 1;
FETCH cur INTO t;
END LOOP;
CLOSE cur;
RETURN i;
END;
$$ LANGUAGE plpgsql;

2. in standby

SELECT gisttestfunc();
<blocks>

3. Quickly, before the scan in standby continues, cause some page splits:

INSERT INTO gisttest SELECT g * 1000+1 from generate_series(1, 100000) g;

4. The scan in standby finishes. It should return 100000, but will
return a lower number if you hit the bug.

At a quick glance, I think fixing that is just a matter of not resetting
the right-link. I'll take a closer look tomorrow, but for now I just
wanted to report what I've been doing. I'll post the scripts I've been
using later too - nag me if I don't.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#1)
Re: WAL replay bugs

On 04/07/2014 02:16 PM, Heikki Linnakangas wrote:

I've been playing with a little hack that records a before and after
image of every page modification that is WAL-logged, and writes the
images to a file along with the LSN of the corresponding WAL record. I
set up a master-standby replication with that hack in place in both
servers, and ran the regression suite. Then I compared the after images
after every WAL record, as written on master, and as replayed by the
standby.

This is awesome ... thank you for doing this.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#1)
Re: WAL replay bugs

On Tue, Apr 8, 2014 at 3:16 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I've been playing with a little hack that records a before and after image
of every page modification that is WAL-logged, and writes the images to a
file along with the LSN of the corresponding WAL record. I set up a
master-standby replication with that hack in place in both servers, and ran
the regression suite. Then I compared the after images after every WAL
record, as written on master, and as replayed by the standby.

Assuming that adding some dedicated hooks in the core able to do
actions before and after a page modification occur is not *that*
costly (well I imagine that it is not acceptable in terms of
performance), could it be possible to get that in the shape of a
extension that could be used to test WAL record consistency? This may
be an idea to think about...
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Sachin Kotwal
kotsachin@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: WAL replay bugs

I executed given steps many times to produce this bug.
But still I unable to hit this bug.
I used attached scripts to produce this bug.

Can I get scripts to produce this bug?

wal_replay_bug.sh
<http://postgresql.1045698.n5.nabble.com/file/n5799512/wal_replay_bug.sh&gt;

-----
Thanks and Regards,

Sachin Kotwal
NTT-DATA-OSS Center (Pune)
--
View this message in context: http://postgresql.1045698.n5.nabble.com/WAL-replay-bugs-tp5799053p5799512.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Sachin Kotwal (#4)
Re: WAL replay bugs

On 04/10/2014 10:52 AM, sachin kotwal wrote:

I executed given steps many times to produce this bug.
But still I unable to hit this bug.
I used attached scripts to produce this bug.

Can I get scripts to produce this bug?

wal_replay_bug.sh
<http://postgresql.1045698.n5.nabble.com/file/n5799512/wal_replay_bug.sh&gt;

Oh, I can't reproduce it using that script either. I must've used some
variation of it, and posted wrong script.

The attached seems to do the trick. I changed the INSERT statements
slightly, so that all the new rows have the same key.

Thanks for verifying this!

- Heikki

Attachments:

wal_replay_bug.shapplication/x-shellscript; name=wal_replay_bug.shDownload
#6Sachin Kotwal
kotsachin@gmail.com
In reply to: Heikki Linnakangas (#5)
Re: WAL replay bugs

On Thu, Apr 10, 2014 at 6:21 PM, Heikki Linnakangas <hlinnakangas@vmware.com

wrote:

On 04/10/2014 10:52 AM, sachin kotwal wrote:

I executed given steps many times to produce this bug.
But still I unable to hit this bug.
I used attached scripts to produce this bug.

Can I get scripts to produce this bug?

Oh, I can't reproduce it using that script either. I must've used some
variation of it, and posted wrong script.

The attached seems to do the trick. I changed the INSERT statements
slightly, so that all the new rows have the same key.

Thanks for verifying this!

Thanks to explain the case to produce this bug.
I am able to produce this bug by using latest scripts from last mail.
I applied patch submitted for this bug and re-run the scripts.
Now it is giving correct result.

Thanks and Regards,

Sachin Kotwal

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#3)
Re: WAL replay bugs

On 04/08/2014 06:41 AM, Michael Paquier wrote:

On Tue, Apr 8, 2014 at 3:16 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I've been playing with a little hack that records a before and after image
of every page modification that is WAL-logged, and writes the images to a
file along with the LSN of the corresponding WAL record. I set up a
master-standby replication with that hack in place in both servers, and ran
the regression suite. Then I compared the after images after every WAL
record, as written on master, and as replayed by the standby.

Assuming that adding some dedicated hooks in the core able to do
actions before and after a page modification occur is not *that*
costly (well I imagine that it is not acceptable in terms of
performance), could it be possible to get that in the shape of a
extension that could be used to test WAL record consistency? This may
be an idea to think about...

Yeah, working on it. It can live as a patch set if nothing else.

This has been very fruitful, I just committed another fix for a bug I
found with this earlier today.

There are quite a few things that cause differences between master and
standby. We have hint bits in many places, unused space that isn't
zeroed etc.

Two things that are not bugs, but I'd like to change just to make this
tool easier to maintain, and to generally clean things up:

1. When creating a sequence, we first use simple_heap_insert() to insert
the sequence tuple, which creates a WAL record. Then we write a new
sequence RM WAL record about the same thing. The reason is that the WAL
record written by regular heap_insert is bogus for a sequence tuple.
After replaying just the heap insertion, but not the other record, the
page doesn't have the magic value indicating that it's a sequence, i.e.
it's broken as a sequence page. That's OK because we only do this when
creating a new sequence, so if we crash between those two records, the
whole relation is not visible to anyone. Nevertheless, I'd like to fix
that by using PageAddItem directly to insert the tuple, instead of
simple_heap_insert. We have to override the xmin field of the tuple
anyway, and we don't need any of the other services like finding the
insert location, toasting, visibility map or freespace map updates, that
simple_heap_insert() provides.

2. _bt_restore_page, when restoring a B-tree page split record. It adds
tuples to the page in reverse order compared to how it's done in master.
There is a comment noting that, and it asks "Is it worth changing just
on general principles?". Yes, I think it is.

Any objections to changing those two?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#7)
Re: WAL replay bugs

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

Two things that are not bugs, but I'd like to change just to make this
tool easier to maintain, and to generally clean things up:

1. When creating a sequence, we first use simple_heap_insert() to insert
the sequence tuple, which creates a WAL record. Then we write a new
sequence RM WAL record about the same thing. The reason is that the WAL
record written by regular heap_insert is bogus for a sequence tuple.
After replaying just the heap insertion, but not the other record, the
page doesn't have the magic value indicating that it's a sequence, i.e.
it's broken as a sequence page. That's OK because we only do this when
creating a new sequence, so if we crash between those two records, the
whole relation is not visible to anyone. Nevertheless, I'd like to fix
that by using PageAddItem directly to insert the tuple, instead of
simple_heap_insert. We have to override the xmin field of the tuple
anyway, and we don't need any of the other services like finding the
insert location, toasting, visibility map or freespace map updates, that
simple_heap_insert() provides.

2. _bt_restore_page, when restoring a B-tree page split record. It adds
tuples to the page in reverse order compared to how it's done in master.
There is a comment noting that, and it asks "Is it worth changing just
on general principles?". Yes, I think it is.

Any objections to changing those two?

Not here. I've always suspected #2 was going to bite us someday anyway.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

In reply to: Tom Lane (#8)
Re: WAL replay bugs

On Thu, Apr 17, 2014 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Any objections to changing those two?

Not here. I've always suspected #2 was going to bite us someday anyway.

+1

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#7)
Re: WAL replay bugs

On 04/17/2014 07:59 PM, Heikki Linnakangas wrote:

On 04/08/2014 06:41 AM, Michael Paquier wrote:

On Tue, Apr 8, 2014 at 3:16 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I've been playing with a little hack that records a before and after image
of every page modification that is WAL-logged, and writes the images to a
file along with the LSN of the corresponding WAL record. I set up a
master-standby replication with that hack in place in both servers, and ran
the regression suite. Then I compared the after images after every WAL
record, as written on master, and as replayed by the standby.

Assuming that adding some dedicated hooks in the core able to do
actions before and after a page modification occur is not *that*
costly (well I imagine that it is not acceptable in terms of
performance), could it be possible to get that in the shape of a
extension that could be used to test WAL record consistency? This may
be an idea to think about...

Yeah, working on it. It can live as a patch set if nothing else.

This has been very fruitful, I just committed another fix for a bug I
found with this earlier today.

There are quite a few things that cause differences between master and
standby. We have hint bits in many places, unused space that isn't
zeroed etc.

[a few more fixed bugs later]

Ok, I'm now getting clean output when running the regression suite with
this tool.

And here is the tool itself. It consists of two parts:

1. Modifications to the backend to write the page images
2. A post-processing tool to compare the logged images between master
and standby.

The attached diff contains both parts. The postprocessing tool is in
contrib/page_image_logging. See contrib/page_image_logging/README for
instructions. Let me know if you have any questions or need further help
running the tool.

I've also pushed this to my git repository at
git://git.postgresql.org/git/users/heikki/postgres.git, branch
"page_image_logging". I intend to keep it up-to-date with current master.

This is a pretty ugly hack, so I'm not proposing to commit this in the
current state. But perhaps this could be done more cleanly, by adding
some hooks in the backend as Michael suggested.
- Heikki

Attachments:

page_image_logging-1.patchtext/x-diff; name=page_image_logging-1.patchDownload+857-6
#11Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#10)
Re: WAL replay bugs

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

And here is the tool itself. It consists of two parts:

1. Modifications to the backend to write the page images
2. A post-processing tool to compare the logged images between master and
standby.

Having that into Postgres at the disposition of developers would be
great, and I believe that it would greatly reduce the occurrence of
bugs caused by WAL replay during recovery. So, with the permission of
the author, I have been looking at this facility for a cleaner
integration into Postgres.

Roughly, this utility is made of three parts:
1) A set of masking functions that can be used on page images to
normalize them. This is used to put magic numbers or enforce flag
values to make page content consistent across nodes. This is for
example the case of the free space between pd_lower and pd_upper,
pd_flags, etc. Of course this depends on the type of page (btree,
heap, etc.).
2) Facility to memorize, analyze if they have been modified, and flush
page images to a dedicated file. This interacts with the buffer
manager mainly.
3) Facility to reorder page images within the same WAL record as
master/standby may not write them in the same order on a standby or a
master due to for example lock released in different order. This is
part of the binary analyzing the diffs between master and standby.

As of now, 2) is integrated in the backend, 1) and 3) are part of the
contrib module. However I am thinking that 1) and 2) should be done in
core using an ifdef similar to CLOBBER_FREED_MEMORY, to mask the page
images and write them in a dedicated file (in global/ ?), while 3)
would be fine as a separate binary in contrib/. An essential thing to
add would be to have a set of regression tests that developers and
buildfarm machines could directly use.

Perhaps there are parts of what is proposed here that could be made
more generalized, like the masking functions. So do not hesitate if
you have any opinion on the matter.

Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#11)
Re: WAL replay bugs

On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Perhaps there are parts of what is proposed here that could be made
more generalized, like the masking functions. So do not hesitate if
you have any opinion on the matter.

OK, attached is the result of this hacking:

Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines
that can be used to check for consistency at page level when replaying
WAL files among several nodes of a cluster (generally master and
standby node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then
each buffer is captured is with the following format as a single line
of the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between
pages, and format is chosen to facilitate comparison between buffer
entries.
- A client part, located in contrib/buffer_capture_cmp, that can be
used to compare buffer captures between nodes.

The footprint on core code is minimal and is controlled by a symbol
called BUFFER_CAPTURE that needs to be set at build time to enable the
buffer capture at server level. If this symbol is not enabled, both
server and client parts are idle and generate nothing.

Note that this facility can generate a lot of output (11G when running
regression tests, counting double when using both master and standby).

contrib/buffer_capture_cmp contains a regression test facility easing
testing with buffer captures. The user just needs to run "make check"
in this folder... There is a default set of tests saved in
test-default.sh but user is free to set up custom tests by creating a
file called test-custom.sh that can be kicked by the test facility if
this file is present instead of the defaults.

Patch will be added to the first commit fest as well. Note that the
footprint on core code is limited, so even if there is more than 1k
lines of codes, review is simpler than it looks.

A couple of things to note though:
1) In order to detect if a page is used for a sequence, SEQ_MAGIC
needs to be exposed in sequence.h. This is included in the patch
attached but perhaps this should be changed as a separate patch
2) Regression test facility uses some useful parts taken from
pg_upgrade. I think that we should gather those parts in a common
place (contrib/common?). This can facilitate the integration of other
modules using regression based on bash scripts.
3) While hacking this facility, I noticed that some ItemId entries in
btree pages could be inconsistent between master and standby. Those
items are masked in the current patch, but it looks like a bug of
Postgres itself.

Documentation is added in the code itself, I didn't feel any need to
expose this facility the lambda users in doc/src/sgml...
Regards,
--
Michael

Attachments:

0001-Buffer-capture-facility-check-WAL-replay-consistency.patchtext/plain; charset=US-ASCII; name=0001-Buffer-capture-facility-check-WAL-replay-consistency.patchDownload+1266-10
#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#12)
Re: WAL replay bugs

On 06/13/2014 10:14 AM, Michael Paquier wrote:

On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Perhaps there are parts of what is proposed here that could be made
more generalized, like the masking functions. So do not hesitate if
you have any opinion on the matter.

OK, attached is the result of this hacking:

Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines
that can be used to check for consistency at page level when replaying
WAL files among several nodes of a cluster (generally master and
standby node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then
each buffer is captured is with the following format as a single line
of the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between
pages, and format is chosen to facilitate comparison between buffer
entries.
- A client part, located in contrib/buffer_capture_cmp, that can be
used to compare buffer captures between nodes.

Oh, you moved the masking code from the client tool to the backend. Why?
When debugging, it's useful to have the genuine, non-masked page image
available.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#13)
Re: WAL replay bugs

On Fri, Jun 13, 2014 at 4:48 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 06/13/2014 10:14 AM, Michael Paquier wrote:

On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Perhaps there are parts of what is proposed here that could be made
more generalized, like the masking functions. So do not hesitate if
you have any opinion on the matter.

OK, attached is the result of this hacking:

Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines
that can be used to check for consistency at page level when replaying
WAL files among several nodes of a cluster (generally master and
standby node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then
each buffer is captured is with the following format as a single line
of the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between
pages, and format is chosen to facilitate comparison between buffer
entries.
- A client part, located in contrib/buffer_capture_cmp, that can be
used to compare buffer captures between nodes.

Oh, you moved the masking code from the client tool to the backend. Why?
When debugging, it's useful to have the genuine, non-masked page image
available.

My thought is to share the CPU effort of masking between backends...
That's not a big deal to move them back to the client tool though.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#14)
Re: WAL replay bugs

On Fri, Jun 13, 2014 at 4:50 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Jun 13, 2014 at 4:48 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 06/13/2014 10:14 AM, Michael Paquier wrote:

On Mon, Jun 2, 2014 at 9:55 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Perhaps there are parts of what is proposed here that could be made
more generalized, like the masking functions. So do not hesitate if
you have any opinion on the matter.

OK, attached is the result of this hacking:

Buffer capture facility: check WAL replay consistency

It is a tool aimed to be used by developers and buildfarm machines
that can be used to check for consistency at page level when replaying
WAL files among several nodes of a cluster (generally master and
standby node).

This facility is made of two parts:
- A server part, where all the changes happening at page level are
captured and inserted in a file called buffer_captures located at the
root of PGDATA. Each buffer entry is masked to make the comparison
across node consistent (flags like hint bits for example) and then
each buffer is captured is with the following format as a single line
of the output file:
LSN: %08X/%08X page: PAGE_IN_HEXA
Hexadecimal format makes it easier to detect differences between
pages, and format is chosen to facilitate comparison between buffer
entries.
- A client part, located in contrib/buffer_capture_cmp, that can be
used to compare buffer captures between nodes.

Oh, you moved the masking code from the client tool to the backend. Why?
When debugging, it's useful to have the genuine, non-masked page image
available.

My thought is to share the CPU effort of masking between backends...

And that having a set of API to do page masking on the server side
would be useful for extensions as well. Now that I recall this was one
of the first things that came to my mind when looking at this
facility, thinking that it would be useful to have them in a separate
file, with a dedicated header.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#12)
Re: WAL replay bugs

On Fri, Jun 13, 2014 at 4:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

A couple of things to note though:
1) In order to detect if a page is used for a sequence, SEQ_MAGIC
needs to be exposed in sequence.h. This is included in the patch
attached but perhaps this should be changed as a separate patch
2) Regression test facility uses some useful parts taken from
pg_upgrade. I think that we should gather those parts in a common
place (contrib/common?). This can facilitate the integration of other
modules using regression based on bash scripts.
3) While hacking this facility, I noticed that some ItemId entries in
btree pages could be inconsistent between master and standby. Those
items are masked in the current patch, but it looks like a bug of
Postgres itself.

Attached are 3 patches doing exactly this separation for lisibility.
Regards,
--
Michael

Attachments:

0001-Move-SEQ_MAGIC-to-sequence_h.patchtext/plain; charset=US-ASCII; name=0001-Move-SEQ_MAGIC-to-sequence_h.patchDownload+4-6
0002-Extract-generic-bash-initialization-process-from-pg_upgrade.patchtext/plain; charset=US-ASCII; name=0002-Extract-generic-bash-initialization-process-from-pg_upgrade.patchDownload+64-44
0003-Buffer-capture-facility-check-WAL-replay-consistency.patchtext/plain; charset=US-ASCII; name=0003-Buffer-capture-facility-check-WAL-replay-consistency.patchDownload+1215-5
#17Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#11)
Re: WAL replay bugs

On Mon, Jun 2, 2014 at 8:55 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Apr 23, 2014 at 9:43 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

And here is the tool itself. It consists of two parts:

1. Modifications to the backend to write the page images
2. A post-processing tool to compare the logged images between master and
standby.

Having that into Postgres at the disposition of developers would be
great, and I believe that it would greatly reduce the occurrence of
bugs caused by WAL replay during recovery. So, with the permission of
the author, I have been looking at this facility for a cleaner
integration into Postgres.

I'm not sure if this is reasonably possible, but one thing that would
make this tool a whole lot easier to use would be if you could make
all the magic happen in a single server. For example, suppose you had
a background process that somehow got access to the pre and post
images for every buffer change, and the associated WAL record, and
tried applying the WAL record to the pre-image to see whether it got
the corresponding post-image. Then you could run 'make check' or so
and afterwards do something like psql -c 'SELECT * FROM
wal_replay_problems()' and hopefully get no rows back.

Don't get me wrong, having this tool at all sounds great. But I think
to really get the full benefit out of it we need to be able to run it
in the buildfarm, so that if people break stuff it gets noticed
quickly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#17)
Re: WAL replay bugs

On Wed, Jun 18, 2014 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 2, 2014 at 8:55 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
I'm not sure if this is reasonably possible, but one thing that would
make this tool a whole lot easier to use would be if you could make
all the magic happen in a single server. For example, suppose you had
a background process that somehow got access to the pre and post
images for every buffer change, and the associated WAL record, and
tried applying the WAL record to the pre-image to see whether it got
the corresponding post-image. Then you could run 'make check' or so
and afterwards do something like psql -c 'SELECT * FROM
wal_replay_problems()' and hopefully get no rows back.

So your point is to have a 3rd independent server in the process that
would compare images taken from a master and its standby? Seems to
complicate the machinery.

Don't get me wrong, having this tool at all sounds great. But I think
to really get the full benefit out of it we need to be able to run it
in the buildfarm, so that if people break stuff it gets noticed
quickly.

The patch I sent has included a regression test suite making the tests
rather facilitated: that's only a matter of running actually "make
check" in the contrib repository containing the binary able to compare
buffer captures between a master and a standby.

Thanks,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#18)
Re: WAL replay bugs

On Tue, Jun 17, 2014 at 5:40 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Jun 18, 2014 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 2, 2014 at 8:55 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
I'm not sure if this is reasonably possible, but one thing that would
make this tool a whole lot easier to use would be if you could make
all the magic happen in a single server. For example, suppose you had
a background process that somehow got access to the pre and post
images for every buffer change, and the associated WAL record, and
tried applying the WAL record to the pre-image to see whether it got
the corresponding post-image. Then you could run 'make check' or so
and afterwards do something like psql -c 'SELECT * FROM
wal_replay_problems()' and hopefully get no rows back.

So your point is to have a 3rd independent server in the process that
would compare images taken from a master and its standby? Seems to
complicate the machinery.

No, I was trying to get it down form 2 servers to 1, not 2 servers up to 3.

Don't get me wrong, having this tool at all sounds great. But I think
to really get the full benefit out of it we need to be able to run it
in the buildfarm, so that if people break stuff it gets noticed
quickly.

The patch I sent has included a regression test suite making the tests
rather facilitated: that's only a matter of running actually "make
check" in the contrib repository containing the binary able to compare
buffer captures between a master and a standby.

Cool!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#16)
Re: WAL replay bugs

On Mon, Jun 16, 2014 at 12:19 PM, Michael Paquier <michael.paquier@gmail.com

wrote:

On Fri, Jun 13, 2014 at 4:14 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

A couple of things to note though:
1) In order to detect if a page is used for a sequence, SEQ_MAGIC
needs to be exposed in sequence.h. This is included in the patch
attached but perhaps this should be changed as a separate patch
2) Regression test facility uses some useful parts taken from
pg_upgrade. I think that we should gather those parts in a common
place (contrib/common?). This can facilitate the integration of other
modules using regression based on bash scripts.
3) While hacking this facility, I noticed that some ItemId entries in
btree pages could be inconsistent between master and standby. Those
items are masked in the current patch, but it looks like a bug of
Postgres itself.

Attached are 3 patches doing exactly this separation for lisibility.

Here are rebased patches, their was a conflict with a recent commit in
contrib/pg_upgrade.
--
Michael

Attachments:

0001-Move-SEQ_MAGIC-to-sequence.h.patchtext/x-diff; charset=US-ASCII; name=0001-Move-SEQ_MAGIC-to-sequence.h.patchDownload+4-6
0002-Extract-generic-bash-initialization-process-from-pg_.patchtext/x-diff; charset=US-ASCII; name=0002-Extract-generic-bash-initialization-process-from-pg_.patchDownload+67-13
0003-Buffer-capture-facility-check-WAL-replay-consistency.patchtext/x-diff; charset=US-ASCII; name=0003-Buffer-capture-facility-check-WAL-replay-consistency.patchDownload+1215-5
#21Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#20)
#22Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#21)
#23Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#22)
#24Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#23)
#25Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#25)
#27Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#26)
#28Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#24)
#29Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#28)
#30Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#29)
#31Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#25)
#32Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#30)
#33Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#32)
#34Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#33)
#35Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#34)
#36Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#35)
#37Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#36)
#38Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#35)
#39Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#35)
#40Peter Eisentraut
peter_e@gmx.net
In reply to: Alvaro Herrera (#39)
#41Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#39)
#42Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#40)
#43Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#41)
#44Peter Eisentraut
peter_e@gmx.net
In reply to: Michael Paquier (#42)
#45Michael Paquier
michael@paquier.xyz
In reply to: Peter Eisentraut (#44)
#46Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#35)
#47Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#46)