XLogReadRecord() error in XlogReadTwoPhaseData()

Started by Noah Mischover 4 years ago35 messageshackers

noah@leadboat.com

over 4 years ago

On Sun, Oct 24, 2021 at 04:35:02PM -0700, Noah Misch wrote:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-10-24%2012%3A01%3A10
got an interesting v9.6 failure [...]:

2021-10-24 14:25:29.263 CEST [34569:175] pgbench ERROR: could not read two-phase state from xlog at 0/158F4E0
2021-10-24 14:25:29.263 CEST [34569:176] pgbench STATEMENT: COMMIT PREPARED 'c1';

As a first step, let's report the actual XLogReadRecord() error message.
Attached. All the other sites that expect no error already do this.

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Noah Misch (#1)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Sat, Nov 06, 2021 at 06:31:57PM -0700, Noah Misch wrote:

As a first step, let's report the actual XLogReadRecord() error message.
Attached.

Good catch! This looks good.

All the other sites that expect no error already do this.

Indeed. Looking closer, I think that we'd better improve
DecodingContextFindStartpoint(),
pg_logical_replication_slot_advance(), XLogSendLogical() as well as
pg_logical_slot_get_changes_guts() to follow a format closer to what
you have in your patch, with an error message that describes the
context where the problem has been found, instead of just elog()'ing
what XLogReadRecord() returns.
--
Michael

Andrey Borodin

amborodin@acm.org

over 4 years ago

In reply to: Noah Misch (#1)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

7 нояб. 2021 г., в 06:31, Noah Misch <noah@leadboat.com> написал(а):

As a first step, let's report the actual XLogReadRecord() error message.
Attached. All the other sites that expect no error already do this.

BTW some time ago I've spotted a good number of related unreported errors [0]/messages/by-id/A0A36AEE-3476-4326-B877-EE2B55BAEEED@yandex-team.ru.

[0]: /messages/by-id/A0A36AEE-3476-4326-B877-EE2B55BAEEED@yandex-team.ru

Michael Paquier

michael@paquier.xyz

over 4 years ago

In reply to: Michael Paquier (#2)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Mon, Nov 08, 2021 at 01:42:46PM +0900, Michael Paquier wrote:

Indeed. Looking closer, I think that we'd better improve
DecodingContextFindStartpoint(),
pg_logical_replication_slot_advance(), XLogSendLogical() as well as
pg_logical_slot_get_changes_guts() to follow a format closer to what
you have in your patch, with an error message that describes the
context where the problem has been found, instead of just elog()'ing
what XLogReadRecord() returns.

FYI, I have just begun a new thread about those ones:
/messages/by-id/YYnTH6OyOwQcAdkw@paquier.xyz
--
Michael

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Michael Paquier (#2)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Mon, Nov 08, 2021 at 01:42:46PM +0900, Michael Paquier wrote:

On Sat, Nov 06, 2021 at 06:31:57PM -0700, Noah Misch wrote:

On Sun, Oct 24, 2021 at 04:35:02PM -0700, Noah Misch wrote:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-10-24%2012%3A01%3A10
got an interesting v9.6 failure [...]:

2021-10-24 14:25:29.263 CEST [34569:175] pgbench ERROR: could not read two-phase state from xlog at 0/158F4E0
2021-10-24 14:25:29.263 CEST [34569:176] pgbench STATEMENT: COMMIT PREPARED 'c1';

Tom Lane reported another instance today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-11%2013%3A29%3A58

Each of the three failures happened on a sparc64 Debian+gcc machine. I had
tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
without reproducing this.

As a first step, let's report the actual XLogReadRecord() error message.
Attached.

Good catch! This looks good.

Pushed.

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Noah Misch (#5)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

Noah Misch <noah@leadboat.com> writes:

Tom Lane reported another instance today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-11%2013%3A29%3A58

Each of the three failures happened on a sparc64 Debian+gcc machine. I had
tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
without reproducing this.

As a first step, let's report the actual XLogReadRecord() error message.
Attached.

Good catch! This looks good.

Pushed.

Well, we didn't have to wait too long [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24:

# at t/003_cic_2pc.pl line 143.
# 'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 5890048
# pgbench: error: client 2 script 3 aborted in command 2 query 0: ERROR: canceling statement due to lock timeout
# pgbench: fatal: Run was aborted; the above results are incomplete.

I suppose "unexpected pageaddr 0/0" is most easily explained by supposing
that XlogReadTwoPhaseData tried to read a WAL page that hadn't been
written out yet. Have we got any synchronization around that?

regards, tom lane

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Tom Lane (#6)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

Each of the three failures happened on a sparc64 Debian+gcc machine. I had
tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
without reproducing this.

# 'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 5890048
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24

Two others:
ERROR: could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: wanted 24, got 0
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15
ERROR: could not read two-phase state from WAL at 0/1668020: incorrect resource manager data checksum in record at 0/1668020
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52

I suppose "unexpected pageaddr 0/0" is most easily explained by supposing
that XlogReadTwoPhaseData tried to read a WAL page that hadn't been
written out yet. Have we got any synchronization around that?

If the WAL address isn't on disk, that error doesn't happen. Instead,
read_local_xlog_page() blocks waiting for the WAL to become available. It's
still possible that we make the WAL region exist, but it somehow doesn't
contain the right data until shortly later. FinishPreparedTransaction() takes
TwoPhaseStateLock and looks for an entry having gxact->valid. EndPrepare()
fills gxact->prepare_end_lsn, then calls MarkAsPrepared() to set gxact->valid
under TwoPhaseStateLock. All freelist (freeGXacts) interaction holds
TwoPhaseStateLock. I'm not seeing a gap in that synchronization.

I don't have a great theory, but here are candidates to examine next:

- Run with wal_debug=on to confirm logged write location matches read location.
- Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
pg_waldump -s 0/01000000" at the end of the test.
- Dump WAL page binary image at the point of failure.
- Log which branches in XLogReadRecord() are taken.

What else might help?

Andrey Borodin

amborodin@acm.org

over 4 years ago

In reply to: Noah Misch (#7)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

18 нояб. 2021 г., в 12:05, Noah Misch <noah@leadboat.com> написал(а):

What else might help?

Let's add more tests that check survival of 2PC through crash recovery? We do now only one restart. Maybe it worth to do 4 or 8?

Best regards, Andrey Borodin.

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Andrey Borodin (#8)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

Andrey Borodin <x4mmm@yandex-team.ru> writes:

Let's add more tests that check survival of 2PC through crash recovery? We do now only one restart. Maybe it worth to do 4 or 8?

That seems a little premature when we can't explain the failure
we have. Also, buildfarm cycles aren't free.

regards, tom lane

#10

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Noah Misch (#7)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Wed, Nov 17, 2021 at 11:05:06PM -0800, Noah Misch wrote:

On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

Each of the three failures happened on a sparc64 Debian+gcc machine. I had
tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
without reproducing this.

# 'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 5890048
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24

Two others:
ERROR: could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: wanted 24, got 0
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15
ERROR: could not read two-phase state from WAL at 0/1668020: incorrect resource manager data checksum in record at 0/1668020
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52

I suppose "unexpected pageaddr 0/0" is most easily explained by supposing
that XlogReadTwoPhaseData tried to read a WAL page that hadn't been
written out yet. Have we got any synchronization around that?

If the WAL address isn't on disk, that error doesn't happen. Instead,
read_local_xlog_page() blocks waiting for the WAL to become available. It's
still possible that we make the WAL region exist, but it somehow doesn't
contain the right data until shortly later. FinishPreparedTransaction() takes
TwoPhaseStateLock and looks for an entry having gxact->valid. EndPrepare()
fills gxact->prepare_end_lsn, then calls MarkAsPrepared() to set gxact->valid
under TwoPhaseStateLock. All freelist (freeGXacts) interaction holds
TwoPhaseStateLock. I'm not seeing a gap in that synchronization.

I don't have a great theory, but here are candidates to examine next:

- Run with wal_debug=on to confirm logged write location matches read location.
- Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
pg_waldump -s 0/01000000" at the end of the test.
- Dump WAL page binary image at the point of failure.
- Log which branches in XLogReadRecord() are taken.

Tom Turelinckx, are you able to provide remote access to kittiwake or
tadarida? I'd use it to attempt the above things. All else being equal,
kittiwake is more relevant since it's still supported upstream.

The setup of your buildfarm animals provides a clue. I understand kittiwake
is the same as ibisbill except for build options, and tadarida is the same as
mussurana except for build options. ibisbill and mussurana haven't failed, so
one of these is likely needed to reproduce:

absence of --enable-cassert
CFLAGS='-g -O2 -fstack-protector -Wformat -Werror=format-security '
CPPFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2'
LDFLAGS='-Wl,-z,relro -Wl,-z,now'

However, I can't reproduce this on thorntail, even if I use tadarida's
./configure options, CFLAGS, CPPFLAGS, and LDFLAGS.

#11

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Noah Misch (#10)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

Noah Misch <noah@leadboat.com> writes:

Tom Turelinckx, are you able to provide remote access to kittiwake or
tadarida? I'd use it to attempt the above things. All else being equal,
kittiwake is more relevant since it's still supported upstream.

snapper just exhibited the same failure, too:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2021-11-18%2016%3A09%3A49

regards, tom lane

#12

Tom Lane

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Tom Lane (#11)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

I wrote:

snapper just exhibited the same failure, too:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snapper&dt=2021-11-18%2016%3A09%3A49

I grepped the buildfarm logs for all recent (last 3 months) occurrences of
'could not read two-phase state'. Here's the results:

sysname | branch | snapshot | stage | l
-----------+---------------+---------------------+-------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
kittiwake | REL9_6_STABLE | 2021-10-24 12:01:10 | pgbenchCheck | # 'client 1 aborted in state 3: ERROR: could not read two-phase state from xlog at 0/158F4E0
kittiwake | REL_13_STABLE | 2021-10-26 12:51:11 | ContribCheck-en_US.utf8 | # 'pgbench: error: client 3 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/168C8D8
kittiwake | REL_14_STABLE | 2021-11-08 15:42:35 | ContribCheck-en_US.utf8 | # 'pgbench: error: client 0 script 0 aborted in command 3 query 0: ERROR: could not read two-phase state from WAL at 0/17ABF48
kittiwake | REL_13_STABLE | 2021-11-16 15:00:52 | ContribCheck-en_US.utf8 | # 'pgbench: error: client 3 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/1668020: incorrect resource manager data checksum in record at 0/1668020
snapper | REL_14_STABLE | 2021-11-18 16:09:49 | contrib-amcheckCheck | # 'pgbench: error: client 3 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/1770328: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 7798784
tadarida | REL_11_STABLE | 2021-11-11 13:29:58 | pgbenchCheck | # 'client 3 aborted in command 3 (SQL) of script 0; ERROR: could not read two-phase state from WAL at 0/1716C68
tadarida | REL_10_STABLE | 2021-11-12 13:01:15 | pgbenchCheck | # 'client 4 aborted in command 3 of script 0; ERROR: could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: wanted 24, got 0
tadarida | HEAD | 2021-11-17 13:01:24 | contrib-amcheckCheck | # 'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 5890048

So not all are exactly 'unexpected pageaddr 0/0', but they do all
look like we read garbage data.

regards, tom lane

#13

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Noah Misch (#10)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Fri, Nov 19, 2021 at 09:18:23PM -0800, Noah Misch wrote:

On Wed, Nov 17, 2021 at 11:05:06PM -0800, Noah Misch wrote:

On Wed, Nov 17, 2021 at 05:47:10PM -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

Each of the three failures happened on a sparc64 Debian+gcc machine. I had
tried ~8000 iterations on thorntail, another sparc64 Debian+gcc animal,
without reproducing this.

# 'pgbench: error: client 0 script 1 aborted in command 4 query 0: ERROR: could not read two-phase state from WAL at 0/159EF88: unexpected pageaddr 0/0 in log segment 000000010000000000000001, offset 5890048
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-17%2013%3A01%3A24

Two others:
ERROR: could not read two-phase state from WAL at 0/16F1850: invalid record length at 0/16F1850: wanted 24, got 0
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tadarida&dt=2021-11-12%2013%3A01%3A15
ERROR: could not read two-phase state from WAL at 0/1668020: incorrect resource manager data checksum in record at 0/1668020
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=kittiwake&dt=2021-11-16%2015%3A00%3A52

I don't have a great theory, but here are candidates to examine next:

- Run with wal_debug=on to confirm logged write location matches read location.
- Run "PGDATA=contrib/amcheck/tmp_check/t_003_cic_2pc_CIC_2PC_test_data/pgdata
pg_waldump -s 0/01000000" at the end of the test.
- Dump WAL page binary image at the point of failure.
- Log which branches in XLogReadRecord() are taken.

Tom Turelinckx, are you able to provide remote access to kittiwake or
tadarida? I'd use it to attempt the above things. All else being equal,
kittiwake is more relevant since it's still supported upstream.

Thanks for setting up access. Summary: this kernel has a bug in I/O syscalls.
How practical is it to update that kernel? (Userland differs across these
animals, but the kernel does not.) The kernel on buildfarm member thorntail
doesn't exhibit the bug.

For specifics of the kernel bug, see the attached test program. In brief, the
bug arises if one process is write()ing or pwrite()ing a file at about the
same time that another process is read()ing or pread()ing the same. POSIX
says the reader should see the data as it existed before the write or the
newly-written data. On this kernel, the reader can see zeros instead. That
leads to the $SUBJECT failure. PostgreSQL processes write out a given WAL
block 20-30 times in ~10ms, and COMMIT PREPARED reads that block. The writers
aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
the kernel bug yield the failure. We could opt to work around that by writing
only the not-already-written portion of a WAL block, but I doubt that's
worthwhile unless it happens to be a performance win anyway.

Separately, while I don't know of relevance to PostgreSQL, I was interested to
see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.

The setup of your buildfarm animals provides a clue. I understand kittiwake
is the same as ibisbill except for build options, and tadarida is the same as
mussurana except for build options. ibisbill and mussurana haven't failed, so
one of these is likely needed to reproduce:

absence of --enable-cassert
CFLAGS='-g -O2 -fstack-protector -Wformat -Werror=format-security '
CPPFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2'
LDFLAGS='-Wl,-z,relro -Wl,-z,now'

That was a red herring. ibisbill and mussurana don't use --with-tap-tests.
Adding --with-tap-tests has been enough to make their configurations witness
the same kinds of failures.

#14

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Noah Misch (#13)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Sun, Jan 16, 2022 at 8:12 PM Noah Misch <noah@leadboat.com> wrote:

For specifics of the kernel bug, see the attached test program. In brief, the
bug arises if one process is write()ing or pwrite()ing a file at about the
same time that another process is read()ing or pread()ing the same. POSIX
says the reader should see the data as it existed before the write or the
newly-written data. On this kernel, the reader can see zeros instead. That
leads to the $SUBJECT failure. PostgreSQL processes write out a given WAL
block 20-30 times in ~10ms, and COMMIT PREPARED reads that block. The writers
aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
the kernel bug yield the failure. We could opt to work around that by writing
only the not-already-written portion of a WAL block, but I doubt that's
worthwhile unless it happens to be a performance win anyway.

Separately, while I don't know of relevance to PostgreSQL, I was interested to
see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.

FWIW there was some related discussion over here:

/messages/by-id/17064-bb0d7904ef72add3@postgresql.org

#15

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Thomas Munro (#14)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

Cancel that kernel upgrade idea. I no longer expect it to help...

On Sun, Jan 16, 2022 at 10:19:30PM +1300, Thomas Munro wrote:

On Sun, Jan 16, 2022 at 8:12 PM Noah Misch <noah@leadboat.com> wrote:

For specifics of the kernel bug, see the attached test program. In brief, the
bug arises if one process is write()ing or pwrite()ing a file at about the
same time that another process is read()ing or pread()ing the same. POSIX
says the reader should see the data as it existed before the write or the
newly-written data. On this kernel, the reader can see zeros instead. That
leads to the $SUBJECT failure. PostgreSQL processes write out a given WAL
block 20-30 times in ~10ms, and COMMIT PREPARED reads that block. The writers
aren't changing the bytes of interest to COMMIT PREPARED, but the zeros from
the kernel bug yield the failure.

The difference between kittiwake and thorntail comes from thorntail using xfs
and kittiwake using ext4. Running the io-rectitude.c tests on an ext4
partition on thorntail, I see the zeros bug just like I do on kittiwake. I
don't see the zeros bug on ppc64 or x86_64, just sparc64 so far:

* ext4, Linux 3.10.0-1160.49.1.el7.x86_64 (CentOS 7.9.2009):
* pwrite/pread is non-atomic if count>16 (no -D switches)
* write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000)
* pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
* write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)
*
* ext4, Linux 4.9.0-13-sparc64-smp (Debian):
* pwrite/pread is non-atomic if count>4 (no -D switches)
* write/read is non-atomic if count>4 (-DUSE_SEEK)
* write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000)
* pwrite/pread has zeros bug for count>127 (-DCHANGE_CONTENT=0)
* pwrite/pread w/ O_SYNC has zeros bug (-DCHANGE_CONTENT=0 -DOPEN_FLAGS=O_SYNC)
* far less frequent w/ O_SYNC, but it still happens
* pwrite/pread w/o REOPEN also has zeros bug for count>127 (-DCHANGE_CONTENT=0 -DREOPEN=0)
* write/read has zeros bug for count>127 (-DUSE_SEEK -DCHANGE_CONTENT=0)
* write/read w/ O_SYNC has zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 -DOPEN_FLAGS=O_SYNC)
* write/read w/o REOPEN is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0 -DREOPEN=0)
*
* ext4, Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid):
* [behaviors match the previous kernel exactly]
*
* ext4, Linux 5.15.0-2-powerpc64 (Debian bookworm/sid):
* [atomicity matches previous kernel, but zeros bug does not]
* pwrite/pread is non-atomic if count>4 (no -D switches)
* write/read is non-atomic if count>4 (-DUSE_SEEK)
* write/read IS atomic w/o REOPEN (-DUSE_SEEK -DREOPEN=0 -DXLOG_BLCKSZ=8192000)
* pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
* write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)
*
* ext4, Linux 5.15.5-0-virt x86_64 (Alpine):
* [behaviors match the previous kernel exactly]
*
* xfs, Linux 5.15.0-2-sparc64-smp (Debian bookworm/sid):
* pwrite/pread is atomic (-DXLOG_BLCKSZ=8192000)
* write/read is atomic (-DUSE_SEEK -DXLOG_BLCKSZ=8192000)
* pwrite/pread is free from zeros bug (-DCHANGE_CONTENT=0)
* write/read is free from zeros bug (-DUSE_SEEK -DCHANGE_CONTENT=0)

We could opt to work around that by writing
only the not-already-written portion of a WAL block, but I doubt that's
worthwhile unless it happens to be a performance win anyway.

My next steps:

- Report a Debian bug for the sparc64+ext4 zeros problem.

- Try to falsify the idea that "write only the not-already-written portion of
a WAL block" is an effective workaround. Specifically, modify the test
program to have the writer process mutate offsets [N-k,N-1] and [N+1,N+k]
while the reader process reads offset N. If the reader sees a zero, that
workaround is ineffective.

- Implement the workaround, if I didn't falsify its effectiveness. If it
doesn't hurt performance on x86_64, we can use it unconditionally.
Otherwise, limit its use to sparc64 Linux.

Separately, while I don't know of relevance to PostgreSQL, I was interested to
see that CentOS 7 pwrite()/pread() fail to have the POSIX-required atomicity.

FWIW there was some related discussion over here:

/messages/by-id/17064-bb0d7904ef72add3@postgresql.org

That gave me the idea to test different filesystems. Thanks. Incidentally, I
find https://utcc.utoronto.ca/~cks/space/blog/unix/WriteNotVeryAtomic is
mistaken about POSIX requirements. There's no precedent for POSIX writing
"two threads" when it means "two threads of the same process". Moreover, the
part about "shall also apply whenever a file descriptor is successfully
closed, however caused (for example [...] process termination)" would be
superfluous in a requirement specific to threads of one process. Having said
that, if the most-prominent POSIX regular file implementation (ext4 on x86_64)
doesn't implement a POSIX requirement, that has the same practical
consequences for PostgreSQL as POSIX not requiring it.

I now see newer Linux ext4 has drifted further away from POSIX atomicity,
compared to CentOS 7. In CentOS 7 ext4, plain write()/read() was still
atomic. By Linux 5.15.5, those abandoned atomicity.

#16

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Noah Misch (#15)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Mon, Jan 17, 2022 at 10:02 AM Noah Misch <noah@leadboat.com> wrote:

- Report a Debian bug for the sparc64+ext4 zeros problem.

I suspect that 027_stream_regress.pl hits this kernel bug with high
probability[1]/messages/by-id/CA+hUKG+euZ=dc27ZB=s74x0q=zU=2=vs8+6TkJoTUiCPUd2dQA@mail.gmail.com. I wonder if the owner of kittiwake and tadarida would
consider setting up an xfs file system? Or alternatively, since ext4
didn't support concurrent writes until recently, I wonder if there is
an option somewhere to turn the new concurrency stuff off, or failing
that, if we could temporarily downgrade the kernel to an older version
that does inode-level read/write locking.

[1]: /messages/by-id/CA+hUKG+euZ=dc27ZB=s74x0q=zU=2=vs8+6TkJoTUiCPUd2dQA@mail.gmail.com

#17

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Thomas Munro (#16)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Fri, Jan 21, 2022 at 08:34:22AM +1300, Thomas Munro wrote:

On Mon, Jan 17, 2022 at 10:02 AM Noah Misch <noah@leadboat.com> wrote:

- Report a Debian bug for the sparc64+ext4 zeros problem.

I suspect that 027_stream_regress.pl hits this kernel bug with high
probability[1]. I wonder if the owner of kittiwake and tadarida would
consider setting up an xfs file system? Or alternatively, since ext4
didn't support concurrent writes until recently, I wonder if there is
an option somewhere to turn the new concurrency stuff off, or failing
that, if we could temporarily downgrade the kernel to an older version
that does inode-level read/write locking.

If the write-only-new-bytes approach works, I think we'd want to revert those
changes. Perhaps a cheaper stopgap is to make the affected tests skip on
sparc Linux. Is that worth doing? (Could even limit the skip to ext4,
e.g. by testing "df -x ext4 . >/dev/null".)

Show quoted text

[1] /messages/by-id/CA+hUKG+euZ=dc27ZB=s74x0q=zU=2=vs8+6TkJoTUiCPUd2dQA@mail.gmail.com

#18

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Noah Misch (#15)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Sun, Jan 16, 2022 at 01:02:41PM -0800, Noah Misch wrote:

My next steps:

- Report a Debian bug for the sparc64+ext4 zeros problem.

(Not done yet.)

- Try to falsify the idea that "write only the not-already-written portion of
a WAL block" is an effective workaround. Specifically, modify the test
program to have the writer process mutate offsets [N-k,N-1] and [N+1,N+k]
while the reader process reads offset N. If the reader sees a zero, that
workaround is ineffective.

The reader did not see a zero. In addition to bytes outside the write being
immune to the zeros bug, the first and last forty bytes of a write were immune
to the zeros bug.

- Implement the workaround, if I didn't falsify its effectiveness. If it
doesn't hurt performance on x86_64, we can use it unconditionally.
Otherwise, limit its use to sparc64 Linux.

Attached. With this, kittiwake has survived 8.5hr of 003_cic_2pc.pl. Without
the patch, it failed many times, always within 1.3hr. For easier review, this
patch uses the new behavior on all platforms. Before commit and back-patch, I
plan to limit use of the new behavior to sparc Linux. Future work can
benchmark the new behavior and, if it performs well, make it unconditional in
v15+. I would expect performance to be unchanged or slightly better, because
the new behavior requests less futile work from the OS.

#19

Thomas Munro

thomas.munro@gmail.com

over 4 years ago

In reply to: Noah Misch (#18)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

On Sun, Jan 23, 2022 at 7:52 AM Noah Misch <noah@leadboat.com> wrote:

Attached. With this, kittiwake has survived 8.5hr of 003_cic_2pc.pl. Without
the patch, it failed many times, always within 1.3hr. For easier review, this
patch uses the new behavior on all platforms. Before commit and back-patch, I
plan to limit use of the new behavior to sparc Linux. Future work can
benchmark the new behavior and, if it performs well, make it unconditional in
v15+. I would expect performance to be unchanged or slightly better, because
the new behavior requests less futile work from the OS.

One detail is that wal_level=open_datasync, wal_senders=0,
wal_level=minimal will panic, because O_DIRECT requires fs
page-aligned access (and fails in various other ways on other OSes, eg
expensive read-before-write every time). That's an ultra-niche
concern likely affecting nobody, especially when multiplied by the
odds that anyone is using that stack at all (considering that
streaming rep has apparently been borked for years on linux/sparc/ext4
and nobody told us).

I was +1 for the control file locking change in that other thread (I
view the atomicity stuff as Linux-realpolitik-vs-POSIX, not to mention
that we run Windows too, which requires separate analysis). I'm less
sure it makes sense to do anything to support the presumed bogus
zeroes bug for (probably) no real users, especially before we've even
reported it and heard some analysis, for example acceptance that it's
broken and confirmation that this really is just a sparc problem.

#20

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Thomas Munro (#19)

Re: XLogReadRecord() error in XlogReadTwoPhaseData()

Hi,

On 2022-01-24 09:42:13 +1300, Thomas Munro wrote:

On Sun, Jan 23, 2022 at 7:52 AM Noah Misch <noah@leadboat.com> wrote:

Future work can benchmark the new behavior and, if it performs well, make
it unconditional in v15+. I would expect performance to be unchanged or
slightly better, because the new behavior requests less futile work from
the OS.

I doubt it'll be generally applicable. Turning a write operation into a
read-write isn't free. Yes, often enough it's likely that the prior page is
still in cache, but I don't think we can rely on that in general.

It also just fundamentally locks us into never using O_DIRECT in anger. I
don't think that's a good direction.

One detail is that wal_level=open_datasync, wal_senders=0,
wal_level=minimal will panic, because O_DIRECT requires fs
page-aligned access (and fails in various other ways on other OSes, eg
expensive read-before-write every time). That's an ultra-niche
concern likely affecting nobody, especially when multiplied by the
odds that anyone is using that stack at all (considering that
streaming rep has apparently been borked for years on linux/sparc/ext4
and nobody told us).

Seems like the patch should at least make this error out?

Greetings,

Andres Freund

#21

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Thomas Munro (#19)

#22

Thomas Munro

thomas.munro@gmail.com

over 4 years ago