Failures with wal_consistency_checking and 13~

Started by Michael Paquierover 5 years ago7 messages
#1Michael Paquier
michael@paquier.xyz

Hi all,

I have begun my annual study of WAL consistency across replays, and
wal_consistency_checking = 'all' is pointing out at some issues with
at least VACUUM and SPGist:
FATAL: inconsistent page found, rel 1663/16385/22133, forknum 0,
blkno 15
CONTEXT: WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
XID 4619

It may be possible that there are other failures, I have just run
installcheck and this is the first failure I saw after replaying all
the generated WAL on a standby. Please note that I have also checked
12, and installcheck passes.

Thanks,
--
Michael

#2Ranier Vilela
ranier.vf@gmail.com
In reply to: Michael Paquier (#1)
Re: Failures with wal_consistency_checking and 13~

Em seg., 15 de jun. de 2020 às 10:14, Michael Paquier <michael@paquier.xyz>
escreveu:

Hi all,

I have begun my annual study of WAL consistency across replays, and
wal_consistency_checking = 'all' is pointing out at some issues with
at least VACUUM and SPGist:
FATAL: inconsistent page found, rel 1663/16385/22133, forknum 0,
blkno 15
CONTEXT: WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
XID 4619

It may be possible that there are other failures, I have just run
installcheck and this is the first failure I saw after replaying all
the generated WAL on a standby. Please note that I have also checked
12, and installcheck passes.

With Postgres 13, Windows 10 (home), msvc 2019 64 bits,
Shutting down, without interrupting the database, consistently, this log
has appeared.

2020-06-15 21:40:35.998 -03 [3380] LOG: database system shutdown was
interrupted; last known up at 2020-06-15 21:36:01 -03
2020-06-15 21:40:37.978 -03 [3380] LOG: database system was not properly
shut down; automatic recovery in progress
2020-06-15 21:40:37.992 -03 [3380] LOG: redo starts at 0/8A809C78
2020-06-15 21:40:37.992 -03 [3380] LOG: invalid record length at
0/8A809CB0: wanted 24, got 0
2020-06-15 21:40:37.992 -03 [3380] LOG: redo done at 0/8A809C78

regards,
Ranier Vilela

#3Michael Paquier
michael@paquier.xyz
In reply to: Ranier Vilela (#2)
Re: Failures with wal_consistency_checking and 13~

On Mon, Jun 15, 2020 at 11:33:42PM -0300, Ranier Vilela wrote:

With Postgres 13, Windows 10 (home), msvc 2019 64 bits,
Shutting down, without interrupting the database, consistently, this log
has appeared.

2020-06-15 21:40:35.998 -03 [3380] LOG: database system shutdown was
interrupted; last known up at 2020-06-15 21:36:01 -03
2020-06-15 21:40:37.978 -03 [3380] LOG: database system was not properly
shut down; automatic recovery in progress
2020-06-15 21:40:37.992 -03 [3380] LOG: redo starts at 0/8A809C78
2020-06-15 21:40:37.992 -03 [3380] LOG: invalid record length at
0/8A809CB0: wanted 24, got 0
2020-06-15 21:40:37.992 -03 [3380] LOG: redo done at 0/8A809C78

Could you please keep discussions on their own thread please? After
stopping Postgres in a sudden way (immediate mode or just SIGKILL), it
is normal to see crash recovery happen, as well as it is normal to see
an "invalid record length" in the logs when the end of WAL is reached,
meaning the end of recovery.

Thanks.
--
Michael

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#1)
Re: Failures with wal_consistency_checking and 13~

On 2020-Jun-15, Michael Paquier wrote:

I have begun my annual study of WAL consistency across replays, and
wal_consistency_checking = 'all' is pointing out at some issues with
at least VACUUM and SPGist:
FATAL: inconsistent page found, rel 1663/16385/22133, forknum 0,
blkno 15
CONTEXT: WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
XID 4619

It may be possible that there are other failures, I have just run
installcheck and this is the first failure I saw after replaying all
the generated WAL on a standby. Please note that I have also checked
12, and installcheck passes.

Umm. Alexander, do you an idea of what this is about?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Alvaro Herrera (#4)
Re: Failures with wal_consistency_checking and 13~

On Fri, Jun 19, 2020 at 10:34 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2020-Jun-15, Michael Paquier wrote:

I have begun my annual study of WAL consistency across replays, and
wal_consistency_checking = 'all' is pointing out at some issues with
at least VACUUM and SPGist:
FATAL: inconsistent page found, rel 1663/16385/22133, forknum 0,
blkno 15
CONTEXT: WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
XID 4619

It may be possible that there are other failures, I have just run
installcheck and this is the first failure I saw after replaying all
the generated WAL on a standby. Please note that I have also checked
12, and installcheck passes.

Umm. Alexander, do you an idea of what this is about?

I don't have idea yet, but I'll check this out

------
Regards,
Alexander Korotkov

#6Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#5)
Re: Failures with wal_consistency_checking and 13~

On Sat, Jun 20, 2020 at 1:16 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

On Fri, Jun 19, 2020 at 10:34 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2020-Jun-15, Michael Paquier wrote:

I have begun my annual study of WAL consistency across replays, and
wal_consistency_checking = 'all' is pointing out at some issues with
at least VACUUM and SPGist:
FATAL: inconsistent page found, rel 1663/16385/22133, forknum 0,
blkno 15
CONTEXT: WAL redo at 0/739CEDE8 for SPGist/VACUUM_REDIRECT: newest
XID 4619

It may be possible that there are other failures, I have just run
installcheck and this is the first failure I saw after replaying all
the generated WAL on a standby. Please note that I have also checked
12, and installcheck passes.

Umm. Alexander, do you an idea of what this is about?

I don't have idea yet, but I'll check this out

I have discovered and fixed the issue in a44dd932ff. spg_mask()
masked unused space only when pagehdr->pd_lower >
SizeOfPageHeaderData. But during the vacuum regression tests, one
page has been erased completely and pagehdr->pd_lower was set to
SizeOfPageHeaderData. Actually, 13 didn't introduce any issue, it
just added a test that spotted the issue. The issue is here since
a507b86900.

------
Regards,
Alexander Korotkov

#7Michael Paquier
michael@paquier.xyz
In reply to: Alexander Korotkov (#6)
Re: Failures with wal_consistency_checking and 13~

On Sat, Jun 20, 2020 at 05:43:19PM +0300, Alexander Korotkov wrote:

I have discovered and fixed the issue in a44dd932ff. spg_mask()
masked unused space only when pagehdr->pd_lower >
SizeOfPageHeaderData. But during the vacuum regression tests, one
page has been erased completely and pagehdr->pd_lower was set to
SizeOfPageHeaderData. Actually, 13 didn't introduce any issue, it
just added a test that spotted the issue. The issue is here since
a507b86900.

Thanks Alexander for looking at this issue! My runs with
wal_consistency_checking are all clear now.
--
Michael