BUG #15745: WAL References Invalid Pages...that eventually resolves
The following bug has been logged on the website:
Bug reference: 15745
Logged by: Daniel Farina
Email address: daniel@citusdata.com
PostgreSQL version: 11.1
Operating system: Amazon Linux 2018.03
Description:
We have here an interesting case of a panic during recovery from backup with
"WAL References invalid pages." Normally, this is viewed as an issue of
corruption.
But, for serendipitous reasons, I let this one run for a while. As it turns
out, with each crash, it would make *slightly* more progress than the time
before....and then eventually, it suffered no more faults and caught up
normally. Included is a log that shows how sparse these faults were,
relative to all the traffic going on....: roughly two per segment on this
workload, with large gaps between problematic segments, and not necessarily
repetition in a problematic relation or filenode.
The fact the standby eventually came up made me suspicious, so I ran amcheck
with a heap re-check, and, no tuples were in violation.
Included is a log, which shows how the system recovered over and over,
making slight progress each time. This is the entire inventory after such
crashes: after these, the system passed amcheck and appears to work
normally.
postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1]
WARNING: page 162136064 of relation base/16385/21372 does not exist
postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [131-1]
CONTEXT: WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items
postgresql-Mon.log:2019-04-08 00:08:22.619 UTC [3323][1/0] : [132-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [133-1]
CONTEXT: WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items
--
postgresql-Mon.log-2019-04-08 00:17:31.686 UTC [3277][1/0] : [106-1] LOG:
restored log file "00000006000009AA000000E2" from archive
postgresql-Mon.log-2019-04-08 00:17:31.741 UTC [3277][1/0] : [107-1]
WARNING: page 162136064 of relation base/16385/22606 does not exist
postgresql-Mon.log-2019-04-08 00:17:31.741 UTC [3277][1/0] : [108-1]
CONTEXT: WAL redo at 9AA/E269B0F0 for Btree/DELETE: 1 items
postgresql-Mon.log:2019-04-08 00:17:31.742 UTC [3277][1/0] : [109-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:17:31.742 UTC [3277][1/0] : [110-1]
CONTEXT: WAL redo at 9AA/E269B0F0 for Btree/DELETE: 1 items
postgresql-Mon.log-2019-04-08 00:17:31.815 UTC [3274][] : [11-1] LOG:
startup process (PID 3277) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 00:17:31.815 UTC [3274][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 00:26:17.865 UTC [3288][1/0] : [149-1]
WARNING: page 1936013626 of relation base/16385/22536 does not exist
postgresql-Mon.log-2019-04-08 00:26:17.865 UTC [3288][1/0] : [150-1]
CONTEXT: WAL redo at 9AB/C21CA50 for Btree/DELETE: 35 items
postgresql-Mon.log:2019-04-08 00:26:17.865 UTC [3288][1/0] : [151-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:26:17.865 UTC [3288][1/0] : [152-1]
CONTEXT: WAL redo at 9AB/C21CA50 for Btree/DELETE: 35 items
postgresql-Mon.log-2019-04-08 00:26:17.943 UTC [3285][] : [11-1] LOG:
startup process (PID 3288) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 00:26:17.943 UTC [3285][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 00:36:01.466 UTC [3330][1/0] : [118-1] LOG:
restored log file "00000006000009AA000000EE" from archive
postgresql-Mon.log-2019-04-08 00:36:01.497 UTC [3330][1/0] : [119-1]
WARNING: page 1432945756 of relation base/16385/21365 does not exist
postgresql-Mon.log-2019-04-08 00:36:01.497 UTC [3330][1/0] : [120-1]
CONTEXT: WAL redo at 9AA/EE183EC0 for Btree/DELETE: 43 items
postgresql-Mon.log:2019-04-08 00:36:01.497 UTC [3330][1/0] : [121-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:36:01.497 UTC [3330][1/0] : [122-1]
CONTEXT: WAL redo at 9AA/EE183EC0 for Btree/DELETE: 43 items
postgresql-Mon.log-2019-04-08 00:36:01.536 UTC [5403][] : [1-1]
[app=[unknown]] LOG: connection received: host=172.16.101.249 port=48250
postgresql-Mon.log-2019-04-08 00:36:01.574 UTC [3327][] : [11-1] LOG:
startup process (PID 3330) was terminated by signal 6: Aborted
--
postgresql-Mon.log-2019-04-08 00:43:37.086 UTC [3338][1/0] : [121-1] LOG:
restored log file "00000006000009AA000000F1" from archive
postgresql-Mon.log-2019-04-08 00:43:37.134 UTC [3338][1/0] : [122-1]
WARNING: page 825766448 of relation base/16385/21358 does not exist
postgresql-Mon.log-2019-04-08 00:43:37.134 UTC [3338][1/0] : [123-1]
CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Mon.log:2019-04-08 00:43:37.134 UTC [3338][1/0] : [124-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:43:37.134 UTC [3338][1/0] : [125-1]
CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Mon.log-2019-04-08 00:43:37.211 UTC [3335][] : [11-1] LOG:
startup process (PID 3338) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 00:43:37.211 UTC [3335][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 00:52:30.100 UTC [3249][1/0] : [143-1]
WARNING: page 162201600 of relation base/16385/22456 does not exist
postgresql-Mon.log-2019-04-08 00:52:30.100 UTC [3249][1/0] : [144-1]
CONTEXT: WAL redo at 9AB/6F53290 for Btree/DELETE: 3 items
postgresql-Mon.log:2019-04-08 00:52:30.100 UTC [3249][1/0] : [145-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 00:52:30.100 UTC [3249][1/0] : [146-1]
CONTEXT: WAL redo at 9AB/6F53290 for Btree/DELETE: 3 items
postgresql-Mon.log-2019-04-08 00:52:30.173 UTC [3246][] : [11-1] LOG:
startup process (PID 3249) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 00:52:30.173 UTC [3246][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 01:01:56.487 UTC [3315][1/0] : [125-1] LOG:
restored log file "00000006000009AA000000F5" from archive
postgresql-Mon.log-2019-04-08 01:01:56.545 UTC [3315][1/0] : [126-1]
WARNING: page 162136064 of relation base/16385/22456 does not exist
postgresql-Mon.log-2019-04-08 01:01:56.545 UTC [3315][1/0] : [127-1]
CONTEXT: WAL redo at 9AA/F57AF340 for Btree/DELETE: 1 items
postgresql-Mon.log:2019-04-08 01:01:56.545 UTC [3315][1/0] : [128-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 01:01:56.545 UTC [3315][1/0] : [129-1]
CONTEXT: WAL redo at 9AA/F57AF340 for Btree/DELETE: 1 items
--
postgresql-Mon.log-2019-04-08 01:10:55.964 UTC [3274][1/0] : [98-1] LOG:
restored log file "00000006000009AA000000DA" from archive
postgresql-Mon.log-2019-04-08 01:10:56.026 UTC [3274][1/0] : [99-1] WARNING:
page 162201600 of relation base/16385/21379 does not exist
postgresql-Mon.log-2019-04-08 01:10:56.026 UTC [3274][1/0] : [100-1]
CONTEXT: WAL redo at 9AA/DA735378 for Btree/DELETE: 1 items
postgresql-Mon.log:2019-04-08 01:10:56.026 UTC [3274][1/0] : [101-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 01:10:56.026 UTC [3274][1/0] : [102-1]
CONTEXT: WAL redo at 9AA/DA735378 for Btree/DELETE: 1 items
postgresql-Mon.log-2019-04-08 01:10:56.099 UTC [3271][] : [11-1] LOG:
startup process (PID 3274) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 01:10:56.099 UTC [3271][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 01:21:03.013 UTC [3326][1/0] : [266-1] LOG:
restored log file "00000006000009AB00000082" from archive
postgresql-Mon.log-2019-04-08 01:21:03.053 UTC [3326][1/0] : [267-1]
WARNING: page 1886284143 of relation base/16385/21291 does not exist
postgresql-Mon.log-2019-04-08 01:21:03.053 UTC [3326][1/0] : [268-1]
CONTEXT: WAL redo at 9AB/823B9360 for Btree/DELETE: 12 items
postgresql-Mon.log:2019-04-08 01:21:03.053 UTC [3326][1/0] : [269-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 01:21:03.053 UTC [3326][1/0] : [270-1]
CONTEXT: WAL redo at 9AB/823B9360 for Btree/DELETE: 12 items
postgresql-Mon.log-2019-04-08 01:21:03.134 UTC [3323][] : [11-1] LOG:
startup process (PID 3326) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 01:21:03.134 UTC [3323][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Mon.log-2019-04-08 01:30:46.438 UTC [3333][1/0] : [95-1] WARNING:
page 1934703203 of relation base/16385/21306 does not exist
postgresql-Mon.log-2019-04-08 01:30:46.438 UTC [3333][1/0] : [96-1] CONTEXT:
WAL redo at 9AA/D6404470 for Btree/DELETE: 9 items
postgresql-Mon.log:2019-04-08 01:30:46.438 UTC [3333][1/0] : [97-1] PANIC:
WAL contains references to invalid pages
postgresql-Mon.log-2019-04-08 01:30:46.438 UTC [3333][1/0] : [98-1] CONTEXT:
WAL redo at 9AA/D6404470 for Btree/DELETE: 9 items
postgresql-Mon.log-2019-04-08 01:30:46.519 UTC [3330][] : [11-1] LOG:
startup process (PID 3333) was terminated by signal 6: Aborted
postgresql-Mon.log-2019-04-08 01:30:46.519 UTC [3330][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 22:43:15.774 UTC [17013][1/0] : [121-1] LOG:
restored log file "00000006000009AA000000F1" from archive
postgresql-Sun.log-2019-04-07 22:43:15.824 UTC [17013][1/0] : [122-1]
WARNING: page 825766448 of relation base/16385/21358 does not exist
postgresql-Sun.log-2019-04-07 22:43:15.824 UTC [17013][1/0] : [123-1]
CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Sun.log:2019-04-07 22:43:15.824 UTC [17013][1/0] : [124-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 22:43:15.824 UTC [17013][1/0] : [125-1]
CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Sun.log-2019-04-07 22:43:15.904 UTC [17010][] : [11-1] LOG:
startup process (PID 17013) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 22:43:15.904 UTC [17010][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 22:44:15.699 UTC [19068][1/0] : [144-1] LOG:
restored log file "00000006000009AB00000008" from archive
postgresql-Sun.log-2019-04-07 22:44:15.737 UTC [19068][1/0] : [145-1]
WARNING: page 162201600 of relation base/16385/21365 does not exist
postgresql-Sun.log-2019-04-07 22:44:15.737 UTC [19068][1/0] : [146-1]
CONTEXT: WAL redo at 9AB/854FF48 for Btree/DELETE: 2 items
postgresql-Sun.log:2019-04-07 22:44:15.737 UTC [19068][1/0] : [147-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 22:44:15.737 UTC [19068][1/0] : [148-1]
CONTEXT: WAL redo at 9AB/854FF48 for Btree/DELETE: 2 items
postgresql-Sun.log-2019-04-07 22:44:15.827 UTC [19065][] : [11-1] LOG:
startup process (PID 19068) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 22:44:15.827 UTC [19065][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 22:51:58.298 UTC [3306][1/0] : [121-1] LOG:
restored log file "00000006000009AA000000F1" from archive
postgresql-Sun.log-2019-04-07 22:51:58.342 UTC [3306][1/0] : [122-1]
WARNING: page 825766448 of relation base/16385/21358 does not exist
postgresql-Sun.log-2019-04-07 22:51:58.342 UTC [3306][1/0] : [123-1]
CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Sun.log:2019-04-07 22:51:58.342 UTC [3306][1/0] : [124-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 22:51:58.342 UTC [3306][1/0] : [125-1]
CONTEXT: WAL redo at 9AA/F12FC1D0 for Btree/DELETE: 25 items
postgresql-Sun.log-2019-04-07 22:51:58.420 UTC [3303][] : [11-1] LOG:
startup process (PID 3306) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 22:51:58.421 UTC [3303][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:02:20.964 UTC [5948][] : [1-1]
[app=[unknown]] LOG: connection received: host=3.88.80.230 port=57984
postgresql-Sun.log-2019-04-07 23:02:21.041 UTC [3298][1/0] : [155-1]
WARNING: page 825439802 of relation base/16385/21231 does not exist
postgresql-Sun.log-2019-04-07 23:02:21.041 UTC [3298][1/0] : [156-1]
CONTEXT: WAL redo at 9AB/12A04DA0 for Btree/DELETE: 14 items
postgresql-Sun.log:2019-04-07 23:02:21.041 UTC [3298][1/0] : [157-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:02:21.041 UTC [3298][1/0] : [158-1]
CONTEXT: WAL redo at 9AB/12A04DA0 for Btree/DELETE: 14 items
postgresql-Sun.log-2019-04-07 23:02:21.117 UTC [3295][] : [11-1] LOG:
startup process (PID 3298) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 23:02:21.117 UTC [3295][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:12:15.077 UTC [3305][1/0] : [112-1]
WARNING: page 162136064 of relation base/16385/21351 does not exist
postgresql-Sun.log-2019-04-07 23:12:15.077 UTC [3305][1/0] : [113-1]
CONTEXT: WAL redo at 9AA/E7AF0838 for Btree/DELETE: 4 items
postgresql-Sun.log:2019-04-07 23:12:15.077 UTC [3305][1/0] : [114-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:12:15.077 UTC [3305][1/0] : [115-1]
CONTEXT: WAL redo at 9AA/E7AF0838 for Btree/DELETE: 4 items
postgresql-Sun.log-2019-04-07 23:12:15.133 UTC [5203][] : [1-1]
[app=[unknown]] LOG: connection received: host=172.16.101.249 port=60556
postgresql-Sun.log-2019-04-07 23:12:15.149 UTC [3302][] : [11-1] LOG:
startup process (PID 3305) was terminated by signal 6: Aborted
--
postgresql-Sun.log-2019-04-07 23:21:02.554 UTC [5377][] : [1-1]
[app=[unknown]] LOG: connection received: host=172.16.101.249 port=49380
postgresql-Sun.log-2019-04-07 23:21:02.588 UTC [3258][1/0] : [130-1]
WARNING: page 162136064 of relation base/16385/21372 does not exist
postgresql-Sun.log-2019-04-07 23:21:02.588 UTC [3258][1/0] : [131-1]
CONTEXT: WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items
postgresql-Sun.log:2019-04-07 23:21:02.588 UTC [3258][1/0] : [132-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:21:02.588 UTC [3258][1/0] : [133-1]
CONTEXT: WAL redo at 9AA/F94F2700 for Btree/DELETE: 2 items
postgresql-Sun.log-2019-04-07 23:21:02.663 UTC [3255][] : [11-1] LOG:
startup process (PID 3258) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 23:21:02.663 UTC [3255][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:30:34.035 UTC [3272][1/0] : [135-1]
WARNING: page 162136064 of relation base/16385/21365 does not exist
postgresql-Sun.log-2019-04-07 23:30:34.035 UTC [3272][1/0] : [136-1]
CONTEXT: WAL redo at 9AA/FE2A8BE8 for Btree/DELETE: 2 items
postgresql-Sun.log:2019-04-07 23:30:34.035 UTC [3272][1/0] : [137-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:30:34.035 UTC [3272][1/0] : [138-1]
CONTEXT: WAL redo at 9AA/FE2A8BE8 for Btree/DELETE: 2 items
postgresql-Sun.log-2019-04-07 23:30:34.111 UTC [3269][] : [11-1] LOG:
startup process (PID 3272) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 23:30:34.111 UTC [3269][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:40:31.513 UTC [3323][1/0] : [187-1]
WARNING: page 2710832221 of relation base/16385/21393 does not exist
postgresql-Sun.log-2019-04-07 23:40:31.513 UTC [3323][1/0] : [188-1]
CONTEXT: WAL redo at 9AB/32811B38 for Btree/DELETE: 31 items
postgresql-Sun.log:2019-04-07 23:40:31.513 UTC [3323][1/0] : [189-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:40:31.513 UTC [3323][1/0] : [190-1]
CONTEXT: WAL redo at 9AB/32811B38 for Btree/DELETE: 31 items
postgresql-Sun.log-2019-04-07 23:40:31.589 UTC [3320][] : [11-1] LOG:
startup process (PID 3323) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 23:40:31.589 UTC [3320][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:48:15.992 UTC [3290][1/0] : [89-1] LOG:
restored log file "00000006000009AA000000D1" from archive
postgresql-Sun.log-2019-04-07 23:48:16.099 UTC [3290][1/0] : [90-1] WARNING:
page 162136064 of relation base/16385/21351 does not exist
postgresql-Sun.log-2019-04-07 23:48:16.099 UTC [3290][1/0] : [91-1] CONTEXT:
WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items
postgresql-Sun.log:2019-04-07 23:48:16.099 UTC [3290][1/0] : [92-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:48:16.099 UTC [3290][1/0] : [93-1] CONTEXT:
WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items
postgresql-Sun.log-2019-04-07 23:48:16.176 UTC [3287][] : [11-1] LOG:
startup process (PID 3290) was terminated by signal 6: Aborted
postgresql-Sun.log-2019-04-07 23:48:16.176 UTC [3287][] : [12-1] LOG:
terminating any other active server processes
--
postgresql-Sun.log-2019-04-07 23:58:31.375 UTC [3307][1/0] : [89-1] LOG:
restored log file "00000006000009AA000000D1" from archive
postgresql-Sun.log-2019-04-07 23:58:31.476 UTC [3307][1/0] : [90-1] WARNING:
page 162136064 of relation base/16385/21351 does not exist
postgresql-Sun.log-2019-04-07 23:58:31.476 UTC [3307][1/0] : [91-1] CONTEXT:
WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items
postgresql-Sun.log:2019-04-07 23:58:31.476 UTC [3307][1/0] : [92-1] PANIC:
WAL contains references to invalid pages
postgresql-Sun.log-2019-04-07 23:58:31.476 UTC [3307][1/0] : [93-1] CONTEXT:
WAL redo at 9AA/D1D92508 for Btree/DELETE: 8 items
postgresql-Sun.log-2019-04-07 23:58:31.552 UTC [3304][] : [11-1] LOG:
startup process (PID 3307) was terminated by signal 6: Aborted
Hi Daniel,
On Tue, Apr 9, 2019 at 1:30 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
But, for serendipitous reasons, I let this one run for a while. As it turns
out, with each crash, it would make *slightly* more progress than the time
before....and then eventually, it suffered no more faults and caught up
normally. Included is a log that shows how sparse these faults were,
relative to all the traffic going on....: roughly two per segment on this
workload, with large gaps between problematic segments, and not necessarily
repetition in a problematic relation or filenode.
That sounds weird.
The fact the standby eventually came up made me suspicious, so I ran amcheck
with a heap re-check, and, no tuples were in violation.Included is a log, which shows how the system recovered over and over,
making slight progress each time. This is the entire inventory after such
crashes: after these, the system passed amcheck and appears to work
normally.
Did you try bt_index_parent_check('rel', true)? You might want to make
sure that work_mem is set sufficiently high so that the
downlink-block-is-present check is definitely effective; work_mem
bounds the size of a Bloom filter used by the implementation (the heap
verification option has its own Bloom filter, bound by
maintenance_work_mem). Suggest that you "set
client_min_messages=debug1" before running amcheck this way, just in
case that shows something interesting.
postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1]
WARNING: page 162136064 of relation base/16385/21372 does not exist
These WARNING messages all reference block numbers that look like
32-bits of random garbage, but could be from a very large relation.
The relevant WAL record is from B-Tree's opportunistic LP_DEAD garbage
collection (not VACUUM). Note that Andres changed this mechanism for
v12, so that latestRemovedXid was calculated on the primary, rather
than on the standby. I think that this error comes from
btree_xlog_delete_get_latestRemovedXid(), which is in 11 but not
master/12.
I wonder, is "base/16385/21351" the index or the table? Is it possible
to run pg_waldump? I think it's the table.
If the problem is in btree_xlog_delete_get_latestRemovedXid(), then it
is perhaps unsurprising that there isn't evidence of any lasting
corruption.
--
Peter Geoghegan
On Sat, Apr 27, 2019 at 8:28 PM Peter Geoghegan <pg@bowt.ie> wrote:
Hi Daniel,
On Tue, Apr 9, 2019 at 1:30 PM PG Bug reporting form
<noreply@postgresql.org> wrote:But, for serendipitous reasons, I let this one run for a while. As it
turns
out, with each crash, it would make *slightly* more progress than the
time
before....and then eventually, it suffered no more faults and caught up
normally. Included is a log that shows how sparse these faults were,
relative to all the traffic going on....: roughly two per segment on this
workload, with large gaps between problematic segments, and notnecessarily
repetition in a problematic relation or filenode.
That sounds weird.
Yeah. It was.
The fact the standby eventually came up made me suspicious, so I ran
amcheck
with a heap re-check, and, no tuples were in violation.
Included is a log, which shows how the system recovered over and over,
making slight progress each time. This is the entire inventory after such
crashes: after these, the system passed amcheck and appears to work
normally.Did you try bt_index_parent_check('rel', true)? You might want to make
sure that work_mem is set sufficiently high so that the
downlink-block-is-present check is definitely effective; work_mem
bounds the size of a Bloom filter used by the implementation (the heap
verification option has its own Bloom filter, bound by
maintenance_work_mem). Suggest that you "set
client_min_messages=debug1" before running amcheck this way, just in
case that shows something interesting.
No, I didn't want to take that lock...but I'll keep in it mind for next
time, though I'll have to make arrangements.
postgresql-Mon.log-2019-04-08 00:08:22.619 UTC [3323][1/0] : [130-1]
WARNING: page 162136064 of relation base/16385/21372 does not exist
These WARNING messages all reference block numbers that look like
32-bits of random garbage, but could be from a very large relation.
Definitely not a 1.3TiB relation. Good eye.
The relevant WAL record is from B-Tree's opportunistic LP_DEAD garbage
collection (not VACUUM). Note that Andres changed this mechanism for
v12, so that latestRemovedXid was calculated on the primary, rather
than on the standby. I think that this error comes from
btree_xlog_delete_get_latestRemovedXid(), which is in 11 but not
master/12.I wonder, is "base/16385/21351" the index or the table? Is it possible
to run pg_waldump? I think it's the table.If the problem is in btree_xlog_delete_get_latestRemovedXid(), then it
is perhaps unsurprising that there isn't evidence of any lasting
corruption.
Regrettably, it's too late, but I'll have my notes for next time.
On Sun, May 05, 2019 at 08:37:02PM -0700, Daniel Farina wrote:
Regrettably, it's too late, but I'll have my notes for next time.
Here is a guess: the base backup which has been used for this
instance's recovery has been taken from a standby, where the
minimum recovery LSN registered in its control file did not actually
map to the last physical LSN where a page has been flushed by either
the checkpointer, the startup process or a read-only backend.
This reminds me of the problems between 3c64dcb and c186ba1 I was at
the origin of, but this refers to 11.1, which was released after the
fact. Or perhaps a beta version of 11 has been used?
--
Michael