011_crash_recovery.pl intermittently fails

Started by Kyotaro Horiguchiabout 5 years ago32 messageshackers

horikyota.ntt@gmail.com

about 5 years ago

Hello.

I noticed that 011_crash_recovery.pl intermittently (that being said,
one out of three or so on my environment) fails in the second test.

t/011_crash_recovery.pl .. 2/3
# Failed test 'new xid after restart is greater'
# at t/011_crash_recovery.pl line 56.
# '539'
# >
# '539'

# Failed test 'xid is aborted after crash'
# at t/011_crash_recovery.pl line 60.
# got: 'committed'
# expected: 'aborted'
# Looks like you failed 2 tests of 3.
t/011_crash_recovery.pl .. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/3 subtests

Test Summary Report
-------------------
t/011_crash_recovery.pl (Wstat: 512 Tests: 3 Failed: 2)
Failed tests: 2-3
Non-zero exit status: 2
Files=1, Tests=3, 3 wallclock secs ( 0.03 usr 0.01 sys + 1.90 cusr 0.39 csys = 2.33 CPU)
Result: FAIL

If the server crashed before emitting WAL records for the transaction
just started, the restarted server cannot know the xid is even
started. I'm not sure that is the intention of the test but we must
make sure the WAL to be emitted before crashing. CHECKPOINT ensures
that.

Thoughts? The attached seems to stabilize the test for me.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Kyotaro Horiguchi (#1)

Re: 011_crash_recovery.pl intermittently fails

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

I noticed that 011_crash_recovery.pl intermittently (that being said,
one out of three or so on my environment) fails in the second test.

Hmmm ... what environment is that? This test script hasn't changed
meaningfully in several years, and we have not seen any real issues
with it up to now.

If the server crashed before emitting WAL records for the transaction
just started, the restarted server cannot know the xid is even
started. I'm not sure that is the intention of the test but we must
make sure the WAL to be emitted before crashing. CHECKPOINT ensures
that.

The original commit for this test says

----
commit 857ee8e391ff6654ef9dcc5dd8b658d7709d0a3c
Author: Robert Haas <rhaas@postgresql.org>
Date: Fri Mar 24 12:00:53 2017 -0400

Add a txid_status function.

If your connection to the database server is lost while a COMMIT is
in progress, it may be difficult to figure out whether the COMMIT was
successful or not. This function will tell you, provided that you
don't wait too long to ask. It may be useful in other situations,
too.

Craig Ringer, reviewed by Simon Riggs and by me

Discussion: /messages/by-id/CAMsr+YHQiWNEi0daCTboS40T+V5s_+dst3PYv_8v2wNVH+Xx4g@mail.gmail.com
----

If the function needs a CHECKPOINT to give a reliable answer,
is it actually good for the claimed purpose?

Independently of that, I doubt that adding a checkpoint call
after the pg_current_xact_id() call is going to help. The
Perl script is able to move on as soon as it's read the
function result. If we need this hack, it has to be put
before that SELECT, AFAICS.

regards, tom lane

Peter Geoghegan

pg@bowt.ie

about 5 years ago

In reply to: Tom Lane (#2)

Re: 011_crash_recovery.pl intermittently fails

On Thu, Mar 4, 2021 at 7:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmmm ... what environment is that? This test script hasn't changed
meaningfully in several years, and we have not seen any real issues
with it up to now.

Did you see this recent thread?

/messages/by-id/20210208215206.mqmrnpkaqrdtm7fj@alap3.anarazel.de

--
Peter Geoghegan

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Peter Geoghegan (#3)

Re: 011_crash_recovery.pl intermittently fails

Peter Geoghegan <pg@bowt.ie> writes:

On Thu, Mar 4, 2021 at 7:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmmm ... what environment is that? This test script hasn't changed
meaningfully in several years, and we have not seen any real issues
with it up to now.

Did you see this recent thread?
/messages/by-id/20210208215206.mqmrnpkaqrdtm7fj@alap3.anarazel.de

Hadn't paid much attention at the time, but yeah, it looks like Andres
tripped over some variant of this.

I'd be kind of inclined to remove this test script altogether, on the
grounds that it's wasting cycles on a function that doesn't really
do what is claimed (and we should remove the documentation claim, too).

Having said that, it's still true that this test has been stable in
the buildfarm. Andres explained what he had to perturb to make it
fail, so I'm still interested in what Horiguchi-san did to break it.

regards, tom lane

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Tom Lane (#4)

Re: 011_crash_recovery.pl intermittently fails

I wrote:

I'd be kind of inclined to remove this test script altogether, on the
grounds that it's wasting cycles on a function that doesn't really
do what is claimed (and we should remove the documentation claim, too).

Alternatively, maybe we can salvage the function's usefulness by making it
flush WAL before returning?

If we go that route, then we have the opposite problem with respect
to the test script: rather than trying to make it paper over the
function's problems, we ought to try to make it reliably fail with
the code as it stands.

regards, tom lane

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Tom Lane (#4)

Re: 011_crash_recovery.pl intermittently fails

At Thu, 04 Mar 2021 23:02:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Peter Geoghegan <pg@bowt.ie> writes:

On Thu, Mar 4, 2021 at 7:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmmm ... what environment is that? This test script hasn't changed
meaningfully in several years, and we have not seen any real issues
with it up to now.

Did you see this recent thread?
/messages/by-id/20210208215206.mqmrnpkaqrdtm7fj@alap3.anarazel.de

Hadn't paid much attention at the time, but yeah, it looks like Andres
tripped over some variant of this.

I'd be kind of inclined to remove this test script altogether, on the
grounds that it's wasting cycles on a function that doesn't really
do what is claimed (and we should remove the documentation claim, too).

Having said that, it's still true that this test has been stable in
the buildfarm. Andres explained what he had to perturb to make it
fail, so I'm still interested in what Horiguchi-san did to break it.

CONFIGURE = '--enable-debug' '--enable-cassert' '--enable-tap-tests' '--enable-depend' '--enable-nls' '--with-icu' '--with-openssl' '--with-libxml' '--with-uuid=e2fs' '--with-tcl' '--with-llvm' '--prefix=/home/horiguti/bin/pgsql_work' 'LLVM_CONFIG=/usr/bin/llvm-config-64' 'CC=/usr/lib64/ccache/gcc' 'CLANG=/usr/lib64/ccache/clang' 'CFLAGS=-O0' '--with-wal-blocksize=16'

the WAL block size might have affected. I'll recheck without it.

FWIW xfs/CentOS8/VirtuaBox/Windows10

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#6)

Re: 011_crash_recovery.pl intermittently fails

At Fri, 05 Mar 2021 13:13:04 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 04 Mar 2021 23:02:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Peter Geoghegan <pg@bowt.ie> writes:

On Thu, Mar 4, 2021 at 7:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmmm ... what environment is that? This test script hasn't changed
meaningfully in several years, and we have not seen any real issues
with it up to now.

Did you see this recent thread?
/messages/by-id/20210208215206.mqmrnpkaqrdtm7fj@alap3.anarazel.de

Hadn't paid much attention at the time, but yeah, it looks like Andres
tripped over some variant of this.

I'd be kind of inclined to remove this test script altogether, on the
grounds that it's wasting cycles on a function that doesn't really
do what is claimed (and we should remove the documentation claim, too).

Having said that, it's still true that this test has been stable in
the buildfarm. Andres explained what he had to perturb to make it
fail, so I'm still interested in what Horiguchi-san did to break it.

CONFIGURE = '--enable-debug' '--enable-cassert' '--enable-tap-tests' '--enable-depend' '--enable-nls' '--with-icu' '--with-openssl' '--with-libxml' '--with-uuid=e2fs' '--with-tcl' '--with-llvm' '--prefix=/home/horiguti/bin/pgsql_work' 'LLVM_CONFIG=/usr/bin/llvm-config-64' 'CC=/usr/lib64/ccache/gcc' 'CLANG=/usr/lib64/ccache/clang' 'CFLAGS=-O0' '--with-wal-blocksize=16'

the WAL block size might have affected. I'll recheck without it.

Ok, I don't see the failure. It guess that the WAL records for the
last transaction crosses a block boundary with 8kB WAL blocks, but not
with 16kB blocks.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Tom Lane (#5)

Re: 011_crash_recovery.pl intermittently fails

On Fri, Mar 5, 2021 at 5:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

I'd be kind of inclined to remove this test script altogether, on the
grounds that it's wasting cycles on a function that doesn't really
do what is claimed (and we should remove the documentation claim, too).

Alternatively, maybe we can salvage the function's usefulness by making it
flush WAL before returning?

To make pg_xact_status()'s result reliable, don't you need to make
pg_current_xact_id() flush? In other words, isn't it at the point
that you *observe* the transaction that you have to make sure that
this transaction ID won't be reused after crash recovery. Before
that, it's simultaneously allocated and unallocated, like the cat.

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Thomas Munro (#8)

Re: 011_crash_recovery.pl intermittently fails

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Mar 5, 2021 at 5:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alternatively, maybe we can salvage the function's usefulness by making it
flush WAL before returning?

To make pg_xact_status()'s result reliable, don't you need to make
pg_current_xact_id() flush? In other words, isn't it at the point
that you *observe* the transaction that you have to make sure that
this transaction ID won't be reused after crash recovery. Before
that, it's simultaneously allocated and unallocated, like the cat.

We need to be sure that the XID is written out to WAL before we
let the client see it, yeah. I've not looked to see exactly
where in the code would be the best place.

BTW, I tried simply removing the "allows_streaming" option from
the test, and it failed ten times out of ten tries for me.
So Andres is right that that makes it pretty reproducible in
a stock build.

regards, tom lane

#10

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#7)

Re: 011_crash_recovery.pl intermittently fails

At Fri, 05 Mar 2021 13:21:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 05 Mar 2021 13:13:04 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 04 Mar 2021 23:02:09 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Having said that, it's still true that this test has been stable in
the buildfarm. Andres explained what he had to perturb to make it
fail, so I'm still interested in what Horiguchi-san did to break it.

CONFIGURE = '--enable-debug' '--enable-cassert' '--enable-tap-tests' '--enable-depend' '--enable-nls' '--with-icu' '--with-openssl' '--with-libxml' '--with-uuid=e2fs' '--with-tcl' '--with-llvm' '--prefix=/home/horiguti/bin/pgsql_work' 'LLVM_CONFIG=/usr/bin/llvm-config-64' 'CC=/usr/lib64/ccache/gcc' 'CLANG=/usr/lib64/ccache/clang' 'CFLAGS=-O0' '--with-wal-blocksize=16'

the WAL block size might have affected. I'll recheck without it.

Ok, I don't see the failure. It guess that the WAL records for the
last transaction crosses a block boundary with 8kB WAL blocks, but not
with 16kB blocks.

In the failure case with 16kB WAL blocks, tx538 ends with a commit
record at 0/01648B98 and nothing follows (other than the recrods added
after restart).

In the successful case, tx538 ends at the same LSN and followed by
INSERT@tx539 at0/1648CE0 up to INSERT_LEAF at 0/0165BFD8-1. So tx539
just fits inside the block (0x1648000-0x164BFFF). That amount of WAL
must cross a 8kB bounary.

Actually with 8kB blocks, tx538 ends at 0/0164B1A8 and tx539 starts at
0/0164B2A8 and ends at 0/0165E7C8, corsses a boundary at 0/0164C000
and 0/016E000.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11

Andrey Borodin

amborodin@acm.org

about 5 years ago

In reply to: Tom Lane (#2)

Re: 011_crash_recovery.pl intermittently fails

5 марта 2021 г., в 08:32, Tom Lane <tgl@sss.pgh.pa.us> написал(а):

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

I noticed that 011_crash_recovery.pl intermittently (that being said,
one out of three or so on my environment) fails in the second test.

Hmmm ... what environment is that? This test script hasn't changed
meaningfully in several years, and we have not seen any real issues
with it up to now.

Maybe it's offtopic here, but anyway...
While working on "lz4 for FPIs" I've noticed that this test fails with wal_compression = on.
I did not investigate the case at that moment, but I think that it would be good to run recovery regression tests with compression too.

Best regards, Andrey Borodin.

#12

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Tom Lane (#9)

Re: 011_crash_recovery.pl intermittently fails

At Thu, 04 Mar 2021 23:40:34 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in

BTW, I tried simply removing the "allows_streaming" option from
the test, and it failed ten times out of ten tries for me.
So Andres is right that that makes it pretty reproducible in
a stock build.

The difference comes from the difference of shared_buffers. In the
"allows_streaming" case, PostgresNode::init() *reduces* the number
down to '1MB'(128 blocks) which leads to only 8 XLOGbuffers, which
will very soon be wrap-arounded, which leads to XLogWrite.

When allows_streaming=0 case, the default size of shared_buffers is
128MB (16384 blocks). WAL buffer (512) doesn't get wrap-arounded
during the test and no WAL buffer is written out in that case.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Andrey Borodin (#11)

Re: 011_crash_recovery.pl intermittently fails

At Fri, 5 Mar 2021 10:07:06 +0500, Andrey Borodin <x4mmm@yandex-team.ru> wrote in

Maybe it's offtopic here, but anyway...
While working on "lz4 for FPIs" I've noticed that this test fails with wal_compression = on.
I did not investigate the case at that moment, but I think that it would be good to run recovery regression tests with compression too.

The problem records has 15 pages of FPIs. Reducing of its size may
prevent WAL-buffer wrap around and wal writes. If no wal retten the
test fails.

regrds.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#14

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Andrey Borodin (#11)

Re: 011_crash_recovery.pl intermittently fails

(Sorry for my slippery fingers.)

At Fri, 5 Mar 2021 10:07:06 +0500, Andrey Borodin <x4mmm@yandex-team.ru> wrote in

Maybe it's offtopic here, but anyway...
While working on "lz4 for FPIs" I've noticed that this test fails with wal_compression = on.
I did not investigate the case at that moment, but I think that it would be good to run recovery regression tests with compression too.

Best regards, Andrey Borodin.

The problem records have 15 pages of FPIs. The reduction of their
size may prevent WAL-buffer wrap around and wal writes. If no wal is
written the test fails.

regrds.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15

Andrey Borodin

amborodin@acm.org

about 5 years ago

In reply to: Kyotaro Horiguchi (#14)

Re: 011_crash_recovery.pl intermittently fails

5 марта 2021 г., в 13:00, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):

The problem records have 15 pages of FPIs. The reduction of their
size may prevent WAL-buffer wrap around and wal writes. If no wal is
written the test fails.

Thanks, I've finally understood the root cause.
So, test verifies guarantee that is not provided (durability of aborted transactions)?
Maybe flip it to test that transaction effects are not committed\visible?

Best regards, Andrey Borodin.

#16

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Andrey Borodin (#15)

Re: 011_crash_recovery.pl intermittently fails

At Fri, 5 Mar 2021 13:20:53 +0500, Andrey Borodin <x4mmm@yandex-team.ru> wrote in

5 марта 2021 г., в 13:00, Kyotaro Horiguchi <horikyota.ntt@gmail.com> написал(а):

The problem records have 15 pages of FPIs. The reduction of their
size may prevent WAL-buffer wrap around and wal writes. If no wal is
written the test fails.

Thanks, I've finally understood the root cause.
So, test verifies guarantee that is not provided (durability of aborted transactions)?

I think that's right.

Maybe flip it to test that transaction effects are not committed\visible?

Maybe no. The objective of the test is to check if a maybe-comitted
transaction at crash is finally committed or aborted without directly
confirming the result data, I think. And that feature is found not to
be working as expected.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Tom Lane (#9)

Re: 011_crash_recovery.pl intermittently fails

On Fri, Mar 5, 2021 at 5:40 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

On Fri, Mar 5, 2021 at 5:10 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alternatively, maybe we can salvage the function's usefulness by making it
flush WAL before returning?

To make pg_xact_status()'s result reliable, don't you need to make
pg_current_xact_id() flush? In other words, isn't it at the point
that you *observe* the transaction that you have to make sure that
this transaction ID won't be reused after crash recovery. Before
that, it's simultaneously allocated and unallocated, like the cat.

We need to be sure that the XID is written out to WAL before we
let the client see it, yeah. I've not looked to see exactly
where in the code would be the best place.

One idea would be for TransactionStateData's bool member didLogXid to
become an LSN, initially invalid, that points one past the first
record logged for this transaction, maintained by
MarkCurrentTransactionIdLoggedIfAny(). Then, pg_current_xact_id()
(and any similar xid-reporting function that we deem to be an
officially sanctioned way for a client to "observe" xids) could check
if it's valid yet; if not, it could go ahead and log something
containing the xid to make it valid. Then it could flush the log up
to that LSN.

#18

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#12)

Re: 011_crash_recovery.pl intermittently fails

At Fri, 05 Mar 2021 16:51:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

The difference comes from the difference of shared_buffers. In the
"allows_streaming" case, PostgresNode::init() *reduces* the number
down to '1MB'(128 blocks) which leads to only 8 XLOGbuffers, which
will very soon be wrap-arounded, which leads to XLogWrite.

When allows_streaming=0 case, the default size of shared_buffers is
128MB (16384 blocks). WAL buffer (512) doesn't get wrap-arounded
during the test and no WAL buffer is written out in that case.

So I think we need to remove the shared_buffers setting for the
allows_streamig case in PostgresNode.pm

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Kyotaro Horiguchi (#18)

Re: 011_crash_recovery.pl intermittently fails

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

So I think we need to remove the shared_buffers setting for the
allows_streamig case in PostgresNode.pm

That would have uncertain effects on other TAP tests, so I'm disinclined
to do it that way. 011_crash_recovery.pl doesn't actually use a standby
server, so just removing its use of the allows_streaming option seems
sufficient.

But, of course, first we need a fix for the bug we now know exists.
Was anyone volunteering to make the patch?

regards, tom lane

#20

Michael Paquier

michael@paquier.xyz

about 5 years ago

In reply to: Tom Lane (#19)

Re: 011_crash_recovery.pl intermittently fails

On Fri, Mar 05, 2021 at 11:16:55AM -0500, Tom Lane wrote:

That would have uncertain effects on other TAP tests, so I'm disinclined
to do it that way.

+1.  There may be tests out-of-core that rely on this value as
default.
--
Michael

#21

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Michael Paquier (#20)

#22

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Tom Lane (#19)

#23

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#22)

#24

Tom Lane

tgl@sss.pgh.pa.us

about 5 years ago

In reply to: Thomas Munro (#23)

#25

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Tom Lane (#24)

#26

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#25)

#27

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Thomas Munro (#26)

#28

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Michael Paquier (#27)

#29

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Thomas Munro (#28)

#30

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Michael Paquier (#29)

#31

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Tom Lane (#30)

#32

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: Michael Paquier (#31)

011_crash_recovery.pl intermittently fails

Attachments: