Streaming replica failure

Started by Aleš Zelenýalmost 5 years ago2 messagesgeneral
Jump to latest
#1Aleš Zelený
zeleny.ales@gmail.com

Hello,

we are using PostgreSQL 12.4 on CentOS 7. The hot standby failed:

2021-04-24 09:19:27 CEST [20956]: [747-1] user=,db=,host=,app= LOG:
recovery restart point at 3D8C/352B4CE8
2021-04-24 09:19:27 CEST [20956]: [748-1] user=,db=,host=,app= DETAIL:
Last completed transaction was at log time 2021-04-24 09:19:27.221313+02.
2021-04-24 09:20:57 CEST [20956]: [749-1] user=,db=,host=,app= LOG:
restartpoint starting: time
2021-04-24 09:24:27 CEST [20956]: [750-1] user=,db=,host=,app= LOG:
restartpoint complete: wrote 171233 buffers (13.4%); 0 WAL file(s) added,
68 removed, 0 recycled; write
=209.663 s, sync=0.012 s, total=209.963 s; sync files=283, longest=0.001 s,
average=0.000 s; distance=684762 kB, estimate=684762 kB
2021-04-24 09:24:27 CEST [20956]: [751-1] user=,db=,host=,app= LOG:
recovery restart point at 3D8C/5EF6B858
2021-04-24 09:24:27 CEST [20956]: [752-1] user=,db=,host=,app= DETAIL:
Last completed transaction was at log time 2021-04-24 09:24:27.288115+02.
2021-04-24 09:25:15 CEST [20955]: [11-1] user=,db=,host=,app= WARNING:
page 366603 of relation base/20955/10143636 is uninitialized
2021-04-24 09:25:15 CEST [20955]: [12-1] user=,db=,host=,app= CONTEXT: WAL
redo at 3D8C/D79F6500 for Heap2/VISIBLE: cutoff xid 3806260577 flags 0x01
2021-04-24 09:25:15 CEST [20955]: [13-1] user=,db=,host=,app= PANIC: WAL
contains references to invalid pages
2021-04-24 09:25:15 CEST [20955]: [14-1] user=,db=,host=,app= CONTEXT: WAL
redo at 3D8C/D79F6500 for Heap2/VISIBLE: cutoff xid 3806260577 flags 0x01
2021-04-24 09:25:16 CEST [20953]: [11-1] user=,db=,host=,app= LOG: startup
process (PID 20955) was terminated by signal 6: Neúspěšně ukončen (SIGABRT)
2021-04-24 09:25:16 CEST [20953]: [12-1] user=,db=,host=,app= LOG:
terminating any other active server processes

The relation base/20955/10143636 is a standard table.

Version details:
postgres=# select version();
version

---------------------------------------------------------------------------------------------------------
PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-39), 64-bit
(1 row)

Settings:
postgres=# select name, setting, category from pg_settings where category =
'Write-Ahead Log / Settings';
name | setting | category
------------------------+-----------+----------------------------
commit_delay | 0 | Write-Ahead Log / Settings
commit_siblings | 5 | Write-Ahead Log / Settings
fsync | on | Write-Ahead Log / Settings
full_page_writes | on | Write-Ahead Log / Settings
synchronous_commit | on | Write-Ahead Log / Settings
wal_buffers | 2048 | Write-Ahead Log / Settings
wal_compression | off | Write-Ahead Log / Settings
wal_init_zero | on | Write-Ahead Log / Settings
wal_level | logical | Write-Ahead Log / Settings
wal_log_hints | off | Write-Ahead Log / Settings
wal_recycle | on | Write-Ahead Log / Settings
wal_sync_method | fdatasync | Write-Ahead Log / Settings
wal_writer_delay | 200 | Write-Ahead Log / Settings
wal_writer_flush_after | 128 | Write-Ahead Log / Settings
(14 rows)

pg_waldump output:
-bash-4.2$ /usr/pgsql-12/bin/pg_waldump 0000000100003D8C000000D7 2>&1 | tail
rmgr: Btree len (rec/tot): 64/ 64, tx: 3812802559, lsn:
3D8C/D7CF5A98, prev 3D8C/D7CF5A58, desc: INSERT_LEAF off 360, blkref #0:
rel 1663/20955/11280092 blk 19
509
rmgr: Transaction len (rec/tot): 46/ 46, tx: 3812802557, lsn:
3D8C/D7CF5AD8, prev 3D8C/D7CF5A98, desc: COMMIT 2021-04-24 09:25:16.160687
CEST
rmgr: Heap len (rec/tot): 159/ 159, tx: 3812802559, lsn:
3D8C/D7CF5B08, prev 3D8C/D7CF5AD8, desc: INSERT off 8 flags 0x08, blkref
#0: rel 1663/20955/11280066 bl
k 165603
rmgr: Btree len (rec/tot): 64/ 64, tx: 3812802559, lsn:
3D8C/D7CF5BA8, prev 3D8C/D7CF5B08, desc: INSERT_LEAF off 317, blkref #0:
rel 1663/20955/11280073 blk 15
340
rmgr: Btree len (rec/tot): 64/ 64, tx: 3812802559, lsn:
3D8C/D7CF5BE8, prev 3D8C/D7CF5BA8, desc: INSERT_LEAF off 130, blkref #0:
rel 1663/20955/11280091 blk 22
003
rmgr: Heap len (rec/tot): 80/ 80, tx: 3812802559, lsn:
3D8C/D7CF5C28, prev 3D8C/D7CF5BE8, desc: INSERT off 61 flags 0x08, blkref
#0: rel 1663/20955/11280082 b
lk 38917
rmgr: Btree len (rec/tot): 64/ 64, tx: 3812802559, lsn:
3D8C/D7CF5C78, prev 3D8C/D7CF5C28, desc: INSERT_LEAF off 73, blkref #0: rel
1663/20955/11280088 blk 421
25
rmgr: Btree len (rec/tot): 64/ 64, tx: 3812802559, lsn:
3D8C/D7CF5CB8, prev 3D8C/D7CF5C78, desc: INSERT_LEAF off 361, blkref #0:
rel 1663/20955/11280092 blk 19
509
rmgr: Transaction len (rec/tot): 46/ 46, tx: 3812802559, lsn:
3D8C/D7CF5CF8, prev 3D8C/D7CF5CB8, desc: COMMIT 2021-04-24 09:25:16.161036
CEST
pg_waldump: fatal: error in WAL record at 3D8C/D7CF5CF8: invalid record
length at 3D8C/D7CF5D28: wanted 24, got 0
-bash-4.2$

Emails like "BUG #13822: Slave terminated - WAL contains references to
invalid page" in the archive are quite old (2015).

While we can create a new replica, ios there something we can do to
diagnose the root cause of this failure?

Thanks for any hints or advice.

Kind regards Ales Zeleny

#2Vijaykumar Jain
vijaykumarjain.github@gmail.com
In reply to: Aleš Zelený (#1)
Re: Streaming replica failure

Were there any issues with hardware ?
Memory/storage ?
I am not sure but it look like data loss to me.

If you have admin access?

Can you run dmesg -a, system logs, hyperion logs etc and see if there are
any errors related to memory corruption.
Coz if there are hardware issues, this will happen again IMHO.
But if this is not a hardware issue, then I guess one of the experts may
have to chime in.

Thanks,
Vijay

On Tue, Apr 27, 2021, 3:31 PM Aleš Zelený <zeleny.ales@gmail.com> wrote:

Show quoted text

Hello,

we are using PostgreSQL 12.4 on CentOS 7. The hot standby failed:

2021-04-24 09:19:27 CEST [20956]: [747-1] user=,db=,host=,app= LOG:
recovery restart point at 3D8C/352B4CE8
2021-04-24 09:19:27 CEST [20956]: [748-1] user=,db=,host=,app= DETAIL:
Last completed transaction was at log time 2021-04-24 09:19:27.221313+02.
2021-04-24 09:20:57 CEST [20956]: [749-1] user=,db=,host=,app= LOG:
restartpoint starting: time
2021-04-24 09:24:27 CEST [20956]: [750-1] user=,db=,host=,app= LOG:
restartpoint complete: wrote 171233 buffers (13.4%); 0 WAL file(s) added,
68 removed, 0 recycled; write
=209.663 s, sync=0.012 s, total=209.963 s; sync files=283, longest=0.001
s, average=0.000 s; distance=684762 kB, estimate=684762 kB
2021-04-24 09:24:27 CEST [20956]: [751-1] user=,db=,host=,app= LOG:
recovery restart point at 3D8C/5EF6B858
2021-04-24 09:24:27 CEST [20956]: [752-1] user=,db=,host=,app= DETAIL:
Last completed transaction was at log time 2021-04-24 09:24:27.288115+02.
2021-04-24 09:25:15 CEST [20955]: [11-1] user=,db=,host=,app= WARNING:
page 366603 of relation base/20955/10143636 is uninitialized
2021-04-24 09:25:15 CEST [20955]: [12-1] user=,db=,host=,app= CONTEXT:
WAL redo at 3D8C/D79F6500 for Heap2/VISIBLE: cutoff xid 3806260577 flags
0x01
2021-04-24 09:25:15 CEST [20955]: [13-1] user=,db=,host=,app= PANIC: WAL
contains references to invalid pages
2021-04-24 09:25:15 CEST [20955]: [14-1] user=,db=,host=,app= CONTEXT:
WAL redo at 3D8C/D79F6500 for Heap2/VISIBLE: cutoff xid 3806260577 flags
0x01
2021-04-24 09:25:16 CEST [20953]: [11-1] user=,db=,host=,app= LOG:
startup process (PID 20955) was terminated by signal 6: Neúspěšně ukončen
(SIGABRT)
2021-04-24 09:25:16 CEST [20953]: [12-1] user=,db=,host=,app= LOG:
terminating any other active server processes

The relation base/20955/10143636 is a standard table.

Version details:
postgres=# select version();
version

---------------------------------------------------------------------------------------------------------
PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-39), 64-bit
(1 row)

Settings:
postgres=# select name, setting, category from pg_settings where category
= 'Write-Ahead Log / Settings';
name | setting | category
------------------------+-----------+----------------------------
commit_delay | 0 | Write-Ahead Log / Settings
commit_siblings | 5 | Write-Ahead Log / Settings
fsync | on | Write-Ahead Log / Settings
full_page_writes | on | Write-Ahead Log / Settings
synchronous_commit | on | Write-Ahead Log / Settings
wal_buffers | 2048 | Write-Ahead Log / Settings
wal_compression | off | Write-Ahead Log / Settings
wal_init_zero | on | Write-Ahead Log / Settings
wal_level | logical | Write-Ahead Log / Settings
wal_log_hints | off | Write-Ahead Log / Settings
wal_recycle | on | Write-Ahead Log / Settings
wal_sync_method | fdatasync | Write-Ahead Log / Settings
wal_writer_delay | 200 | Write-Ahead Log / Settings
wal_writer_flush_after | 128 | Write-Ahead Log / Settings
(14 rows)

pg_waldump output:
-bash-4.2$ /usr/pgsql-12/bin/pg_waldump 0000000100003D8C000000D7 2>&1 |
tail
rmgr: Btree len (rec/tot): 64/ 64, tx: 3812802559, lsn:
3D8C/D7CF5A98, prev 3D8C/D7CF5A58, desc: INSERT_LEAF off 360, blkref #0:
rel 1663/20955/11280092 blk 19
509
rmgr: Transaction len (rec/tot): 46/ 46, tx: 3812802557, lsn:
3D8C/D7CF5AD8, prev 3D8C/D7CF5A98, desc: COMMIT 2021-04-24 09:25:16.160687
CEST
rmgr: Heap len (rec/tot): 159/ 159, tx: 3812802559, lsn:
3D8C/D7CF5B08, prev 3D8C/D7CF5AD8, desc: INSERT off 8 flags 0x08, blkref
#0: rel 1663/20955/11280066 bl
k 165603
rmgr: Btree len (rec/tot): 64/ 64, tx: 3812802559, lsn:
3D8C/D7CF5BA8, prev 3D8C/D7CF5B08, desc: INSERT_LEAF off 317, blkref #0:
rel 1663/20955/11280073 blk 15
340
rmgr: Btree len (rec/tot): 64/ 64, tx: 3812802559, lsn:
3D8C/D7CF5BE8, prev 3D8C/D7CF5BA8, desc: INSERT_LEAF off 130, blkref #0:
rel 1663/20955/11280091 blk 22
003
rmgr: Heap len (rec/tot): 80/ 80, tx: 3812802559, lsn:
3D8C/D7CF5C28, prev 3D8C/D7CF5BE8, desc: INSERT off 61 flags 0x08, blkref
#0: rel 1663/20955/11280082 b
lk 38917
rmgr: Btree len (rec/tot): 64/ 64, tx: 3812802559, lsn:
3D8C/D7CF5C78, prev 3D8C/D7CF5C28, desc: INSERT_LEAF off 73, blkref #0: rel
1663/20955/11280088 blk 421
25
rmgr: Btree len (rec/tot): 64/ 64, tx: 3812802559, lsn:
3D8C/D7CF5CB8, prev 3D8C/D7CF5C78, desc: INSERT_LEAF off 361, blkref #0:
rel 1663/20955/11280092 blk 19
509
rmgr: Transaction len (rec/tot): 46/ 46, tx: 3812802559, lsn:
3D8C/D7CF5CF8, prev 3D8C/D7CF5CB8, desc: COMMIT 2021-04-24 09:25:16.161036
CEST
pg_waldump: fatal: error in WAL record at 3D8C/D7CF5CF8: invalid record
length at 3D8C/D7CF5D28: wanted 24, got 0
-bash-4.2$

Emails like "BUG #13822: Slave terminated - WAL contains references to
invalid page" in the archive are quite old (2015).

While we can create a new replica, ios there something we can do to
diagnose the root cause of this failure?

Thanks for any hints or advice.

Kind regards Ales Zeleny