BUG #15346: Replica fails to start after the crash
The following bug has been logged on the website:
Bug reference: 15346
Logged by: Alexander Kukushkin
Email address: cyberdemn@gmail.com
PostgreSQL version: 9.6.10
Operating system: Ubuntu
Description:
We run quite a lot of clusters and from time to time EC2 instances are
dying, but so far we never had problems which I describe here.
The first time we experienced it on the last week when doing upgrade to the
next minor versions. A few replicas failed to start with very similar
symptoms and I just reinitialized them because I was in a hurry.
But, this morning one of EC2 instances running a replica has crashed.
After the crash, the new instances was spawned and existing EBS volume (with
existing PGDATA) attached to it.
Postgres logs before the crash:
2018-08-22 05:14:43.036 UTC,,,23838,,5b78e9a0.5d1e,2636,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"restartpoint complete: wrote 82957 buffers (17.0%); 0
transaction log file(s) added, 3 removed, 7 recycled; write=269.887 s,
sync=0.012 s, total=269.926 s; sync files=105, longe
st=0.002 s, average=0.000 s; distance=168429 kB, estimate=173142
kB",,,,,,,,,""
2018-08-22 05:14:43.036 UTC,,,23838,,5b78e9a0.5d1e,2637,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"recovery restart point at AB3/37BD3C60","last completed
transaction was at log time 2018-08-22 05:14:43.022752+00",,,,,,,,""
2018-08-22 05:15:13.065 UTC,,,23838,,5b78e9a0.5d1e,2638,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"restartpoint starting: time",,,,,,,,,""
2018-08-22 05:19:43.093 UTC,,,23838,,5b78e9a0.5d1e,2639,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"restartpoint complete: wrote 86588 buffers (17.7%); 0
transaction log file(s) added, 2 removed, 8 recycled; write=269.983 s,
sync=0.017 s, total=270.028 s; sync files=112, longe
st=0.002 s, average=0.000 s; distance=156445 kB, estimate=171472
kB",,,,,,,,,""
2018-08-22 05:19:43.093 UTC,,,23838,,5b78e9a0.5d1e,2640,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"recovery restart point at AB3/4149B340","last completed
transaction was at log time 2018-08-22 05:19:43.077943+00",,,,,,,,""
2018-08-22 05:20:13.121 UTC,,,23838,,5b78e9a0.5d1e,2641,,2018-08-19 03:53:04
UTC,,0,LOG,00000,"restartpoint starting: time",,,,,,,,,""
Here it crashed.
Now postgres fails to start with the following logs:
2018-08-22 06:22:12.994 UTC,,,51,,5b7d0114.33,1,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"ending log output to stderr",,"Future log output will go
to log destination ""csvlog"".",,,,,,,""
2018-08-22 06:22:12.996 UTC,,,54,,5b7d0114.36,1,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"database system was interrupted while in recovery at log
time 2018-08-22 05:10:29 UTC",,"If this has occurred more than once some
data might be corrupted and you might need to choose an
earlier recovery target.",,,,,,,""
2018-08-22 06:22:13.232 UTC,,,54,,5b7d0114.36,2,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2018-08-22 06:22:13.420 UTC,,,54,,5b7d0114.36,3,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"restored log file ""00000005.history"" from
archive",,,,,,,,,""
2018-08-22 06:22:13.430 UTC,,,147,"[local]",5b7d0115.93,1,"",2018-08-22
06:22:13 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-08-22 06:22:13.430
UTC,"postgres","postgres",147,"[local]",5b7d0115.93,2,"",2018-08-22 06:22:13
UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2018-08-22 06:22:13.435 UTC,,,149,"[local]",5b7d0115.95,1,"",2018-08-22
06:22:13 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2018-08-22 06:22:13.435
UTC,"postgres","postgres",149,"[local]",5b7d0115.95,2,"",2018-08-22 06:22:13
UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2018-08-22 06:22:13.912 UTC,,,54,,5b7d0114.36,4,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"restored log file ""0000000500000AB30000004A"" from
archive",,,,,,,,,""
... restores WAL files from S3
2018-08-22 06:22:23.489 UTC,,,54,,5b7d0114.36,22,,2018-08-22 06:22:12
UTC,1/0,0,LOG,00000,"restored log file ""0000000500000AB300000050"" from
archive",,,,,,,,,""
2018-08-22 06:22:23.633 UTC,,,54,,5b7d0114.36,23,,2018-08-22 06:22:12
UTC,1/0,0,WARNING,01000,"page 179503104 of relation base/18055/212875 does
not exist",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182
items",,,,""
2018-08-22 06:22:23.634 UTC,,,54,,5b7d0114.36,24,,2018-08-22 06:22:12
UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog
redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,""
2018-08-22 06:22:23.846 UTC,,,51,,5b7d0114.33,3,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"startup process (PID 54) was terminated by signal 6:
Aborted",,,,,,,,,""
2018-08-22 06:22:23.846 UTC,,,51,,5b7d0114.33,4,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
2018-08-22 06:22:23.847
UTC,"postgres","postgres",842,"[local]",5b7d011c.34a,3,"idle",2018-08-22
06:22:20 UTC,3/0,0,WARNING,57P02,"terminating connection because of crash of
another server process","The postmaster has commanded this server process to
roll back the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.","In a moment you
should be able to reconnect to the database and repeat your
command.",,,,,,,"Patroni"
2018-08-22 06:22:23.945 UTC,,,51,,5b7d0114.33,5,,2018-08-22 06:22:12
UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,""
2018-08-22 06:22:35.294 UTC,,,1135,,5b7d012b.46f,1,,2018-08-22 06:22:35
UTC,,0,LOG,00000,"ending log output to stderr",,"Future log output will go
to log destination ""csvlog"".",,,,,,,""
2018-08-22 06:22:35.295 UTC,,,1138,,5b7d012b.472,1,,2018-08-22 06:22:35
UTC,,0,LOG,00000,"database system was interrupted while in recovery at log
time 2018-08-22 05:10:29 UTC",,"If this has occurred more than once some
data might be corrupted and you might need to choose an earlier recovery
target.",,,,,,,""
Here are a few log lines from pg_xlogdump 0000000500000AB300000050 around
LSN AB3/50323E78:
rmgr: Heap len (rec/tot): 54/ 54, tx: 518902809, lsn:
AB3/50323D30, prev AB3/50323CE8, desc: LOCK off 18: xid 518902809: flags 0
LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/18053/213284 blk 137366
rmgr: Btree len (rec/tot): 88/ 88, tx: 518902810, lsn:
AB3/50323D68, prev AB3/50323D30, desc: INSERT_LEAF off 66, blkref #0: rel
1663/18055/212936 blk 32947
rmgr: Btree len (rec/tot): 96/ 96, tx: 518902810, lsn:
AB3/50323DC0, prev AB3/50323D68, desc: INSERT_LEAF off 3, blkref #0: rel
1663/18055/212935 blk 6
rmgr: Btree len (rec/tot): 88/ 88, tx: 518902810, lsn:
AB3/50323E20, prev AB3/50323DC0, desc: INSERT_LEAF off 12, blkref #0: rel
1663/18055/212930 blk 25152
rmgr: Btree len (rec/tot): 436/ 547, tx: 518902810, lsn:
AB3/50323E78, prev AB3/50323E20, desc: DELETE 182 items, blkref #0: rel
1663/18055/212907 blk 72478 FPW
rmgr: Btree len (rec/tot): 88/ 88, tx: 518902810, lsn:
AB3/503240B8, prev AB3/50323E78, desc: INSERT_LEAF off 2, blkref #0: rel
1663/18055/212907 blk 72478
rmgr: Gist len (rec/tot): 55/ 163, tx: 518902810, lsn:
AB3/50324110, prev AB3/503240B8, desc: PAGE_UPDATE , blkref #0: rel
1663/18055/212906 blk 273288 FPW
rmgr: Btree len (rec/tot): 160/ 160, tx: 518902810, lsn:
AB3/503241B8, prev AB3/50324110, desc: INSERT_LEAF off 8, blkref #0: rel
1663/18055/212892 blk 85620
rmgr: Gist len (rec/tot): 57/ 201, tx: 518902812, lsn:
AB3/50324258, prev AB3/503241B8, desc: PAGE_UPDATE , blkref #0: rel
1663/18053/213364 blk 99265 FPW
rmgr: Btree len (rec/tot): 55/ 2403, tx: 518902810, lsn:
AB3/50324328, prev AB3/50324258, desc: INSERT_LEAF off 13, blkref #0: rel
1663/18055/212893 blk 53544 FPW
rmgr: Btree len (rec/tot): 55/ 1464, tx: 518902812, lsn:
AB3/50324C90, prev AB3/50324328, desc: INSERT_LEAF off 83, blkref #0: rel
1663/18053/213369 blk 41494 FPW
rmgr: Transaction len (rec/tot): 34/ 34, tx: 518902810, lsn:
AB3/50325248, prev AB3/50324C90, desc: COMMIT 2018-08-22 05:18:03.670434
UTC
And the list of files for a given relation:
$ ls -al base/18055/212875*
-rw------- 1 postgres postgres 1073741824 Aug 22 05:20 base/18055/212875
-rw------- 1 postgres postgres 1014947840 Aug 22 05:21 base/18055/212875.1
-rw------- 1 postgres postgres 532480 Aug 22 05:21
base/18055/212875_fsm
-rw------- 1 postgres postgres 65536 Aug 22 05:21
base/18055/212875_vm
What I find really strange is a huge difference in the page number between
postgres log and pg_xlogdump output:
"page 179503104 of relation base/18055/212875 does not exist" vs "rel
1663/18055/212907 blk 72478 FPW"
I will keep this instance around for further investigation and would be
happy to provide some more details if you need.
On Wed, Aug 22, 2018 at 07:36:58AM +0000, PG Bug reporting form wrote:
2018-08-22 06:22:23.633 UTC,,,54,,5b7d0114.36,23,,2018-08-22 06:22:12
UTC,1/0,0,WARNING,01000,"page 179503104 of relation base/18055/212875 does
not exist",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182
items",,,,""
2018-08-22 06:22:23.634 UTC,,,54,,5b7d0114.36,24,,2018-08-22 06:22:12
UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog
redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,""
Once recovery has reached a consistent state, the startup process would
look at if there are any invalid pages tracked in a given hash table and
complains loudly about them. It is not the last record or its
surroundings which matter in case, but if this page has been found in
one of the records replayed during recovery up to the consistent point.
Do you have in any records from the WAL segments fetched a reference to
this page? A page is 8kB, and the page number is 179503104, which is
definitely weird as that would cause a relation file to be more than
1000GB. If the record itself is in bad shape, this may be a corrupted
segment. As far as I can see you only have one incorrect page reference
(see XLogCheckInvalidPages in xlog.c).
I will keep this instance around for further investigation and would be
happy to provide some more details if you need.
That would be nice!
--
Michael
Hi,
I've figured out what is going on.
On this server we have a background worker, which starts from
shared_preload_libraries.
In order to debug and reproduce it, I removed everything from
background worker code except _PG_init, worker_main and couple of
sighandler functions.
Here is the code:
void
worker_main(Datum main_arg)
{
pqsignal(SIGHUP, bg_mon_sighup);
pqsignal(SIGTERM, bg_mon_sigterm);
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR)
proc_exit(1);
BackgroundWorkerUnblockSignals();
BackgroundWorkerInitializeConnection("postgres", NULL);
while (!got_sigterm)
{
int rc = WaitLatch(MyLatch,
WL_LATCH_SET |
WL_TIMEOUT | WL_POSTMASTER_DEATH,
naptime*1000L);
ResetLatch(MyLatch);
if (rc & WL_POSTMASTER_DEATH)
proc_exit(1);
}
proc_exit(1);
}
void
_PG_init(void)
{
BackgroundWorker worker;
if (!process_shared_preload_libraries_in_progress)
return;
worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
BGWORKER_BACKEND_DATABASE_CONNECTION;
worker.bgw_start_time = BgWorkerStart_ConsistentState;
worker.bgw_restart_time = 1;
worker.bgw_main = worker_main;
worker.bgw_notify_pid = 0;
snprintf(worker.bgw_name, BGW_MAXLEN, "my_worker");
RegisterBackgroundWorker(&worker);
}
Most of this code is taken from "worker_spi.c".
Basically, it just initializes connection to the postgres database and
sleeps all the time.
If I comment out the 'BackgroundWorkerInitializeConnection("postgres",
NULL);' call, postgres starts without any problem.
What is very strange, because background worker itself is not doing anything...
And one more thing, if I add sleep(15) before calling
BackgroundWorkerInitializeConnection, postgres manages to start
successfully.
Is there a very strange race condition here?
Regards,
--
Alexander Kukushkin
On 2018-Aug-22, Alexander Kukushkin wrote:
If I comment out the 'BackgroundWorkerInitializeConnection("postgres",
NULL);' call, postgres starts without any problem.
What is very strange, because background worker itself is not doing anything...And one more thing, if I add sleep(15) before calling
BackgroundWorkerInitializeConnection, postgres manages to start
successfully.
Is there a very strange race condition here?
Sounds likely. I suggest to have a look at what's going on inside the
postmaster process when it gets stuck.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
2018-08-22 16:44 GMT+02:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
Sounds likely. I suggest to have a look at what's going on inside the
postmaster process when it gets stuck.
Well, it doesn't get stuck, it aborts start with the message:
2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,23,,2018-08-22
14:26:10 UTC,1/0,0,WARNING,01000,"page 179503104 of relation
base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for
Btree/DELETE: 182 items",,,,""
2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,24,,2018-08-22
14:26:10 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid
pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182
items",,,,""
2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,3,,2018-08-22
14:26:10 UTC,,0,LOG,00000,"startup process (PID 28485) was terminated
by signal 6: Aborted",,,,,,,,,""
2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,4,,2018-08-22
14:26:10 UTC,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
2018-08-22 14:26:42.290 UTC,,,28483,,5b7d7282.6f43,5,,2018-08-22
14:26:10 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,""
Regards,
--
Alexander Kukushkin
On 2018-Aug-22, Alexander Kukushkin wrote:
2018-08-22 16:44 GMT+02:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
Sounds likely. I suggest to have a look at what's going on inside the
postmaster process when it gets stuck.Well, it doesn't get stuck, it aborts start with the message:
2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,23,,2018-08-22
14:26:10 UTC,1/0,0,WARNING,01000,"page 179503104 of relation
base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for
Btree/DELETE: 182 items",,,,""
2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,24,,2018-08-22
14:26:10 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid
pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182
items",,,,""
2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,3,,2018-08-22
14:26:10 UTC,,0,LOG,00000,"startup process (PID 28485) was terminated
by signal 6: Aborted",,,,,,,,,""
Oh, that's weird ... sounds like the fact that the bgworker starts
somehow manages to corrupt the list of invalid pages in the startup
process. That doesn't make any sense ...
ENOTIME for a closer look ATM, though, sorry. Maybe you could try
running under valgrind?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 22 Aug 2018 at 17:08, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2018-Aug-22, Alexander Kukushkin wrote:
2018-08-22 16:44 GMT+02:00 Alvaro Herrera <alvherre@2ndquadrant.com>:
Sounds likely. I suggest to have a look at what's going on inside the
postmaster process when it gets stuck.Well, it doesn't get stuck, it aborts start with the message:
2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,23,,2018-08-22
14:26:10 UTC,1/0,0,WARNING,01000,"page 179503104 of relation
base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for
Btree/DELETE: 182 items",,,,""
2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,24,,2018-08-22
14:26:10 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid
pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182
items",,,,""
2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,3,,2018-08-22
14:26:10 UTC,,0,LOG,00000,"startup process (PID 28485) was terminated
by signal 6: Aborted",,,,,,,,,""Oh, that's weird ... sounds like the fact that the bgworker starts
somehow manages to corrupt the list of invalid pages in the startup
process. That doesn't make any sense ...
We can see that the crash itself happened because in XLogReadBufferExtended at
`if (PageIsNew(page))` (xlogutils.c:512) we've got a page that apparently
wasn't initialized yet, and, since we've reached a consistent state,
log_invalid_page panics.
ENOTIME for a closer look ATM, though, sorry. Maybe you could try
running under valgrind?
Could you elaborate please, what can we find using valgrind in this case, some
memory leaks? In any way there is a chance that everything will be ok, since
even just a slow tracing under gdb leads to disappearing of this race
condition.
On 2018-Aug-22, Dmitry Dolgov wrote:
ENOTIME for a closer look ATM, though, sorry. Maybe you could try
running under valgrind?Could you elaborate please, what can we find using valgrind in this case, some
memory leaks? In any way there is a chance that everything will be ok, since
even just a slow tracing under gdb leads to disappearing of this race
condition.
I don't know. I was thinking it could detect some invalid write in
shared memory.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 22, 2018 at 02:50:34PM -0300, Alvaro Herrera wrote:
On 2018-Aug-22, Dmitry Dolgov wrote:
I don't know. I was thinking it could detect some invalid write in
shared memory.
invalid_page_tab does not use HASH_SHARED_MEM. What did you have in
mind?
On my side, I have been letting a primary/standby set run for a certain
amount of time with the following running:
- pgbench load on the primary.
- background worker connecting to database, just running in loop and
sleeping. It uses BgWorkerStart_ConsistentState as start mode.
- The standby gets stopped in immediate mode, and restarted after some
time to let some activity happening after the recovery point has been
reached.
After close to an hour of repeated tests, I am not able to reproduce
that. Maybe there are some specifics in your schema causing more
certain types of records to be created. Could you check if in the set
of WAL segments replayed you have references to the block the hash table
is complaining about?
--
Michael
Hi Michae,
It aborts not right after reaching a consistent state, but a little bit later:
"consistent recovery state reached at AB3/4A1B3118"
"WAL contains references to invalid pages",,,,,"xlog redo at
AB3/50323E78 for Btree/DELETE: 182 items"
Wal segments 0000000500000AB30000004A and 0000000500000AB300000050
correspondingly.
Also I managed to get a backtrace:
Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fd3f1deb801 in __GI_abort () at abort.c:79
#2 0x00005644f7094aaf in errfinish (dummy=dummy@entry=0) at
./build/../src/backend/utils/error/elog.c:557
#3 0x00005644f7096a49 in elog_finish (elevel=elevel@entry=22,
fmt=fmt@entry=0x5644f70e5998 "WAL contains references to invalid
pages") at ./build/../src/backend/utils/error/elog.c:1378
#4 0x00005644f6d8ec18 in log_invalid_page (node=...,
forkno=forkno@entry=MAIN_FORKNUM, blkno=blkno@entry=179503104,
present=<optimized out>) at
./build/../src/backend/access/transam/xlogutils.c:95
#5 0x00005644f6d8f168 in XLogReadBufferExtended (rnode=...,
forknum=forknum@entry=MAIN_FORKNUM, blkno=179503104,
mode=mode@entry=RBM_NORMAL) at
./build/../src/backend/access/transam/xlogutils.c:515
#6 0x00005644f6d59448 in btree_xlog_delete_get_latestRemovedXid
(record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:593
#7 btree_xlog_delete (record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:682
#8 btree_redo (record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:1012
#9 0x00005644f6d83c34 in StartupXLOG () at
./build/../src/backend/access/transam/xlog.c:6967
#10 0x00005644f6f27953 in StartupProcessMain () at
./build/../src/backend/postmaster/startup.c:216
#11 0x00005644f6d925a7 in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7fff1f893a80) at
./build/../src/backend/bootstrap/bootstrap.c:419
#12 0x00005644f6f24983 in StartChildProcess (type=StartupProcess) at
./build/../src/backend/postmaster/postmaster.c:5300
#13 0x00005644f6f27437 in PostmasterMain (argc=16,
argv=0x5644f7d07f10) at
./build/../src/backend/postmaster/postmaster.c:1321
#14 0x00005644f6d08b5b in main (argc=16, argv=0x5644f7d07f10) at
./build/../src/backend/main/main.c:228
From looking on xlogutils.c I actually figured out that this backtrace
is a bit misleading because it can't get to the line 515
So I checked out REL9_6_10 tag and build it with -O0, now it makes more sense.
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f9ae0777801 in __GI_abort () at abort.c:79
#2 0x000055c9f285f56d in errfinish (dummy=0) at elog.c:557
#3 0x000055c9f2861d7c in elog_finish (elevel=22, fmt=0x55c9f28cbff0
"WAL contains references to invalid pages") at elog.c:1378
#4 0x000055c9f2424e76 in log_invalid_page (node=...,
forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at
xlogutils.c:95
#5 0x000055c9f24258ce in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at
xlogutils.c:470
#6 0x000055c9f23de8ce in btree_xlog_delete_get_latestRemovedXid
(record=0x55c9f3cbf438) at nbtxlog.c:593
#7 0x000055c9f23dea88 in btree_xlog_delete (record=0x55c9f3cbf438) at
nbtxlog.c:682
#8 0x000055c9f23df857 in btree_redo (record=0x55c9f3cbf438) at nbtxlog.c:1012
#9 0x000055c9f24160e8 in StartupXLOG () at xlog.c:6967
#10 0x000055c9f267d9f6 in StartupProcessMain () at startup.c:216
#11 0x000055c9f242a276 in AuxiliaryProcessMain (argc=2,
argv=0x7ffea746c260) at bootstrap.c:419
#12 0x000055c9f267c93e in StartChildProcess (type=StartupProcess) at
postmaster.c:5300
#13 0x000055c9f2677469 in PostmasterMain (argc=16,
argv=0x55c9f3c91690) at postmaster.c:1321
#14 0x000055c9f25c1515 in main (argc=16, argv=0x55c9f3c91690) at main.c:228
After close to an hour of repeated tests, I am not able to reproduce
that. Maybe there are some specifics in your schema causing more
certain types of records to be created. Could you check if in the set
of WAL segments replayed you have references to the block the hash table
is complaining about?
I am sorry, but don't really know where to look :(
Regards,
--
Alexander Kukushkin
Hi,
I think I am getting closer:
If I start postgres without background worker, then works following code:
// https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/nbtxlog.c;h=c536e224321dbc46574aa76876b9d49fa4b5e9a4;hb=REL9_6_10#l549
if (CountDBBackends(InvalidOid) == 0)
return latestRemovedXid;
And it returns from the btree_xlog_delete_get_latestRemovedXid function.
In case if we have a background worker, then
CountDBBackends(InvalidOid) = 1 and as a result it aborts.
Comment before if statement clearly tells about possible race condition:
/*
* If there's nothing running on the standby we don't need to derive a
* full latestRemovedXid value, so use a fast path out of here. This
* returns InvalidTransactionId, and so will conflict with all HS
* transactions; but since we just worked out that that's zero people,
* it's OK.
*
* XXX There is a race condition here, which is that a new backend might
* start just after we look. If so, it cannot need to
conflict, but this
* coding will result in throwing a conflict anyway.
*/
Regards,
--
Alexander Kukushkin
Hi,
btree_xlog_delete_get_latestRemovedXid is callled only when the
postgres is already running InHotStandby
And what this function is actually doing, it reads index page from
disk and iterates over IndexTuple records to get corresponding
HeapPages.
The problem is that xlrec->nitems = 182 in my case, but on the index
page there are only 56 tuples and the rest of the page is filled with
zeros:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007f07a2be2801 in __GI_abort () at abort.c:79
#2 0x000055938000f586 in errfinish (dummy=0) at elog.c:557
#3 0x0000559380011d95 in elog_finish (elevel=22, fmt=0x55938007c010
"WAL contains references to invalid pages") at elog.c:1378
#4 0x000055937fbd4e8f in log_invalid_page (node=...,
forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at
xlogutils.c:95
#5 0x000055937fbd58e7 in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at
xlogutils.c:470
#6 0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid
(record=0x55938151b438) at nbtxlog.c:599
#7 0x000055937fb8eaa1 in btree_xlog_delete (record=0x55938151b438) at
nbtxlog.c:688
#8 0x000055937fb8f870 in btree_redo (record=0x55938151b438) at nbtxlog.c:1018
#9 0x000055937fbc6101 in StartupXLOG () at xlog.c:6967
#10 0x000055937fe2da0f in StartupProcessMain () at startup.c:216
#11 0x000055937fbda28f in AuxiliaryProcessMain (argc=2,
argv=0x7ffe88f10df0) at bootstrap.c:419
#12 0x000055937fe2c957 in StartChildProcess (type=StartupProcess) at
postmaster.c:5300
#13 0x000055937fe27482 in PostmasterMain (argc=16,
argv=0x5593814ed690) at postmaster.c:1321
#14 0x000055937fd7152e in main (argc=16, argv=0x5593814ed690) at main.c:228
(gdb) frame 6
#6 0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid
(record=0x55938151b438) at nbtxlog.c:599
599 hbuffer = XLogReadBufferExtended(xlrec->hnode,
MAIN_FORKNUM, hblkno, RBM_NORMAL);
(gdb) l 587
582 * Loop through the deleted index items to obtain the
TransactionId from
583 * the heap items they point to.
584 */
585 unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete);
586
587 for (i = 0; i < xlrec->nitems; i++)
588 {
589 /*
590 * Identify the index tuple about to be deleted
591 */
(gdb) l
592 iitemid = PageGetItemId(ipage, unused[i]);
593 itup = (IndexTuple) PageGetItem(ipage, iitemid);
594
595 /*
596 * Locate the heap page that the index tuple points at
597 */
598 hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
599 hbuffer = XLogReadBufferExtended(xlrec->hnode,
MAIN_FORKNUM, hblkno, RBM_NORMAL);
600 if (!BufferIsValid(hbuffer))
601 {
(gdb) p i
$45 = 53
(gdb) p unused[i]
$46 = 57
(gdb) p *iitemid
$47 = {lp_off = 0, lp_flags = 0, lp_len = 0}
Therefore iitemid points to the beginning of the index page, what
doesn't make sense.
Why the number of tuples in the xlog is greater than the number of
tuples on the index page?
Because this page was already overwritten and its LSN is HIGHER than
the current LSN!
(gdb) p *(PageHeader)ipage
$49 = {pd_lsn = {xlogid = 2739, xrecoff = 1455373160}, pd_checksum =
3042, pd_flags = 0, pd_lower = 248, pd_upper = 5936, pd_special =
8176, pd_pagesize_version = 8196, pd_prune_xid = 0, pd_linp =
0x7f06ed9e9598}
Index page LSN: AB3/56BF3B68
Current LSN: AB3/50323E78
Is there a way to recover from such a situation? Should the postgres
in such case do comparison of LSNs and if the LSN on the page is
higher than the current LSN simply return InvalidTransactionId?
Apparently, if there are no connections open postgres simply is not
running this code and it seems ok.
Regards,
--
Alexander Kukushkin
On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
Why the number of tuples in the xlog is greater than the number of
tuples on the index page?
Because this page was already overwritten and its LSN is HIGHER than
the current LSN!
That's annoying. Because that means that the control file of your
server maps to a consistent point which is older than some of the
relation pages. How was the base backup of this node created? Please
remember that when taking a base backup from a standby, you should
backup the control file last, as there is no control of end backup with
records available. So it seems to me that the origin of your problem
comes from an incorrect base backup expectation?
Is there a way to recover from such a situation? Should the postgres
in such case do comparison of LSNs and if the LSN on the page is
higher than the current LSN simply return InvalidTransactionId?
Apparently, if there are no connections open postgres simply is not
running this code and it seems ok.
One idea I have would be to copy all the WAL segments up to the point
where the pages to-be-updated are, and let Postgres replay all the local
WALs first. However it is hard to say if that would be enough, as you
could have more references to pages even newer than the btree one you
just found.
--
Michael
On August 28, 2018 11:44:09 AM GMT+09:00, Michael Paquier <michael@paquier.xyz> wrote:
On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
Why the number of tuples in the xlog is greater than the number of
tuples on the index page?
Because this page was already overwritten and its LSN is HIGHER than
the current LSN!That's annoying. Because that means that the control file of your
server maps to a consistent point which is older than some of the
relation pages. How was the base backup of this node created? Please
remember that when taking a base backup from a standby, you should
backup the control file last, as there is no control of end backup with
records available. So it seems to me that the origin of your problem
comes from an incorrect base backup expectation?
Uh, where is that "control file last" bit coming from?
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi Michael,
That's annoying. Because that means that the control file of your
server maps to a consistent point which is older than some of the
relation pages. How was the base backup of this node created? Please
remember that when taking a base backup from a standby, you should
backup the control file last, as there is no control of end backup with
records available. So it seems to me that the origin of your problem
comes from an incorrect base backup expectation?
We are running the cluster of 3 nodes (m4.large + EBS volume for
PGDATA) on AWS. Replicas were initialized about a years ago with
pg_basebackup and working absolutely fine. In the past year I did a
few minor upgrades with switchover (first upgrade of the replicas,
switchover, and upgrade the former primary). The last switchover was
done on the August 19th. This instance was working as a replica for
about three days until the sudden crash of EC2 instance. On the new
instance we attached existing EBS volume with existing the PGDATA and
tried to start postgres. Consequences you can see in the very first
email.
One idea I have would be to copy all the WAL segments up to the point
where the pages to-be-updated are, and let Postgres replay all the local
WALs first. However it is hard to say if that would be enough, as you
could have more references to pages even newer than the btree one you
just found.
Well, I did some experiments, among them was the approach you suggest,
i.e. I commented out restore_command in the recovery.conf and copied
quite a few WAL segments to the pg_xlog. Results are the same. It
aborts as long as there are connections open :(
Regards,
--
Alexander Kukushkin
Greetings,
* Michael Paquier (michael@paquier.xyz) wrote:
On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
Why the number of tuples in the xlog is greater than the number of
tuples on the index page?
Because this page was already overwritten and its LSN is HIGHER than
the current LSN!That's annoying. Because that means that the control file of your
server maps to a consistent point which is older than some of the
relation pages. How was the base backup of this node created? Please
remember that when taking a base backup from a standby, you should
backup the control file last, as there is no control of end backup with
records available. So it seems to me that the origin of your problem
comes from an incorrect base backup expectation?
Yeah, we really need to know how this system was built. If it was built
using the low-level pg_start/stop_backup, then which API was used- the
old one that creates a backup_label file, or the new API which requires
the user to create the backup_label file? I haven't followed this
thread very closely, but I wonder if maybe the new API was used, but no
backup_label file was created, making PG think it was doing crash
recovery instead of restoring from a file-level backup...
Thanks!
Stephen
Greetings,
* Andres Freund (andres@anarazel.de) wrote:
On August 28, 2018 11:44:09 AM GMT+09:00, Michael Paquier <michael@paquier.xyz> wrote:
On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
Why the number of tuples in the xlog is greater than the number of
tuples on the index page?
Because this page was already overwritten and its LSN is HIGHER than
the current LSN!That's annoying. Because that means that the control file of your
server maps to a consistent point which is older than some of the
relation pages. How was the base backup of this node created? Please
remember that when taking a base backup from a standby, you should
backup the control file last, as there is no control of end backup with
records available. So it seems to me that the origin of your problem
comes from an incorrect base backup expectation?Uh, where is that "control file last" bit coming from?
pg_basebackup copies it last. The comments should probably be improved
as to *why* but my recollection is that it's, at least in part, to
ensure the new cluster can't be used until it's actually a complete
backup.
Thanks!
Stephen
On Tue, Aug 28, 2018 at 08:23:11AM -0400, Stephen Frost wrote:
* Andres Freund (andres@anarazel.de) wrote:
Uh, where is that "control file last" bit coming from?
pg_basebackup copies it last. The comments should probably be improved
as to *why* but my recollection is that it's, at least in part, to
ensure the new cluster can't be used until it's actually a complete
backup.
What we have now is mainly in basebackup.c. See 8366c780 which
introduced that. Stephen has that right, as we cannot rely on an
end-backup record when taking a backup from a standby, copying the
control file last ensures that the consistent point should be late
enough that no other pages are inconsistent. Even with that, I think
that there is still a small race condition but I cannot put my finger on
it now. I agree that the current comments do a bad job as to why this
happens. That's actually something I discovered when discussing what
has resulted in f267c1c2.
--
Michael
Hi Michael,
it looks like you missed my answer:
/messages/by-id/CAFh8B=m0Bht-BfKmyzfxcivzjcqRd7BbNHeWthDveWwZ+DrV2A@mail.gmail.com
Regards,
--
Alexander
On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote:
Is there a way to recover from such a situation? Should the postgres
in such case do comparison of LSNs and if the LSN on the page is
higher than the current LSN simply return InvalidTransactionId?
Hmm. That does not sound right to me. If the page has a header LSN
higher than the one replayed, we should not see it as consistency has
normally been reached. XLogReadBufferExtended() seems to complain in
your case about a page which should not exist per the information of
your backtrace. What's the length of relation file at this point? If
the relation is considered as having less blocks, shouldn't we just
ignore it if we're trying to delete items on it and return
InvalidTransactionId? I have to admit that I am not the best specialist
with this code.
hblkno looks also unsanely high to me if you look at the other blkno
references you are mentioning upthread.
Apparently, if there are no connections open postgres simply is not
running this code and it seems ok.
Yeah, that's used for standby conflicts.
--
Michael
Hi,
2018-08-29 8:17 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
Hmm. That does not sound right to me. If the page has a header LSN
higher than the one replayed, we should not see it as consistency has
normally been reached. XLogReadBufferExtended() seems to complain in
your case about a page which should not exist per the information of
your backtrace. What's the length of relation file at this point? If
the relation is considered as having less blocks, shouldn't we just
ignore it if we're trying to delete items on it and return
InvalidTransactionId? I have to admit that I am not the best specialist
with this code.
This is what pg_controldata reports:
Latest checkpoint location: AB3/4A1B30A8
Prior checkpoint location: AB3/4A1B30A8
Latest checkpoint's REDO location: AB3/4149B340
Latest checkpoint's REDO WAL file: 0000000500000AB300000041
Latest checkpoint's TimeLineID: 5
Latest checkpoint's PrevTimeLineID: 5
Latest checkpoint's full_page_writes: on
Minimum recovery ending location: AB3/4A1B3118
Min recovery ending loc's timeline: 5
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
Therefore it reaches consistency at AB3/4A1B3118 and starts accepting
connections.
A few moments later it starts replaying 0000000500000AB300000050,
where "AB3/50323E78" contains "Btree/DELETE: 182".
This record in the pg_xlogdump looks like:
rmgr: Btree len (rec/tot): 436/ 547, tx: 518902810, lsn:
AB3/50323E78, prev AB3/50323E20, desc: DELETE 182 items, blkref #0:
rel 1663/18055/212907 blk 72478 FPW
212907 -- btree index:
$ ls -al data/base/18055/212907*
-rw------- 1 akukushkin akukushkin 594812928 Aug 22 07:22 data/base/18055/212907
-rw------- 1 akukushkin akukushkin 163840 Aug 22 07:22
data/base/18055/212907_fsm
212875 -- the table:
$ ls -al data/base/18055/212875*
-rw------- 1 akukushkin akukushkin 1073741824 Aug 22 07:20
data/base/18055/212875
-rw------- 1 akukushkin akukushkin 1014947840 Aug 22 07:21
data/base/18055/212875.1
-rw------- 1 akukushkin akukushkin 532480 Aug 22 07:21
data/base/18055/212875_fsm
-rw------- 1 akukushkin akukushkin 65536 Aug 22 07:21
data/base/18055/212875_vm
As you can see its size is only about 2GB.
hblkno looks also unsanely high to me if you look at the other blkno
references you are mentioning upthread.
hblkno is calculated from the data which postgres reads from indexfile
data/base/18055/212907, block 72478.
This block has only 56 index tuples, while there expected to be at
least 182, therefore starting from tuple 57 the tuple pointer starts
looking as:
(gdb) p *iitemid
$47 = {lp_off = 0, lp_flags = 0, lp_len = 0}
Basically it points to the beginning of the page, what doesn't make
sense at all.
Why the block 72478 of index relfile doesn't meet our expectations
(contains so few tuples)?
The answer to this question is in the page header. LSN, written in the
indexpage header is AB3/56BF3B68.
That has only one meaning, while the postgres was working before the
crash it managed to apply WAL stream til at least AB3/56BF3B68, what
is far ahead of "Minimum recovery ending location: AB3/4A1B3118".
Regards,
--
Alexander Kukushkin
On Wed, Aug 29, 2018 at 08:59:16AM +0200, Alexander Kukushkin wrote:
Why the block 72478 of index relfile doesn't meet our expectations
(contains so few tuples)?
The answer to this question is in the page header. LSN, written in the
indexpage header is AB3/56BF3B68.
That has only one meaning, while the postgres was working before the
crash it managed to apply WAL stream til at least AB3/56BF3B68, what
is far ahead of "Minimum recovery ending location: AB3/4A1B3118".
Yeah, that's the pinpoint. Do you know by chance what was the content
of the control file for each standby you have upgraded to 9.6.10 before
starting them with the new binaries? You mentioned a cluster of three
nodes, so I guess that you have two standbys, and that one of them did
not see the symptoms discussed here, while the other saw them. Do you
still have the logs of the recovery just after starting the other
standby with 9.4.10 which did not see the symptom? All your standbys
are using the background worker which would cause the btree deletion
code to be scanned, right?
I am trying to work on a reproducer with a bgworker starting once
recovery has been reached, without success yet. Does your cluster
generate some XLOG_PARAMETER_CHANGE records? In some cases, 9.4.8 could
have updated minRecoveryPoint to go backward, which is something that
8d68ee6 has been working on addressing.
Did you also try to use local WAL segments up where AB3/56BF3B68 is
applied, and also have a restore_command so as extra WAL segment fetches
from the archive would happen?
--
Michael
Hi,
2018-08-29 14:10 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
Yeah, that's the pinpoint. Do you know by chance what was the content
of the control file for each standby you have upgraded to 9.6.10 before
starting them with the new binaries? You mentioned a cluster of three
No, I don't. Right after the upgrade they started normally and have
been working for a few days. I believe the controlfile was overwritten
a few hundred times before the instance crashed.
nodes, so I guess that you have two standbys, and that one of them did
not see the symptoms discussed here, while the other saw them. Do you
The other node didn't crash and still working.
still have the logs of the recovery just after starting the other
standby with 9.4.10 which did not see the symptom? All your standbys
I don't think it is really related to the minor upgrade. After the
upgrade the whole cluster was running for about 3 days.
Every day it generates about 2000 WAL segments, the total volume of
daily WALs is very close to the size of cluster, which is 38GB.
are using the background worker which would cause the btree deletion
code to be scanned, right?
Well, any open connection to the database will produce the same
result. In our case we are using Patroni for automatic failover, which
connects immediately after postgres has started and keeps this
connection permanently open. Background worker just appeared to be
faster than anything else.
I am trying to work on a reproducer with a bgworker starting once
recovery has been reached, without success yet. Does your cluster
generate some XLOG_PARAMETER_CHANGE records? In some cases, 9.4.8 could
have updated minRecoveryPoint to go backward, which is something that
8d68ee6 has been working on addressing.
No, it doesn't.
Did you also try to use local WAL segments up where AB3/56BF3B68 is
applied, and also have a restore_command so as extra WAL segment fetches
from the archive would happen?
If there are no connections open, it applies a necessary amount of WAL
segments (with the help of restore_command off course) and reaches the
real consistency. After that, it is possible to connect and it doesn't
startup process doesn't abort anymore.
BTW, I am thinking that we should return InvalidTransactionId from the
btree_xlog_delete_get_latestRemovedXid if the index page we read from
disk is newer then xlog record we are currently processing. Please see
the patch attached.
--
Alexander Kukushkin
Attachments:
nbtxlog.c.patchtext/x-patch; charset=US-ASCII; name=nbtxlog.c.patchDownload
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index c536e22432..8bc6dede34 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -572,6 +572,10 @@ btree_xlog_delete_get_latestRemovedXid(XLogReaderState *record)
LockBuffer(ibuffer, BT_READ);
ipage = (Page) BufferGetPage(ibuffer);
+ /* Index page is newer than xlog record */
+ if (record->EndRecPtr <= PageGetLSN(ipage))
+ return InvalidTransactionId;
+
/*
* Loop through the deleted index items to obtain the TransactionId from
* the heap items they point to.
On Wed, Aug 29, 2018 at 03:05:25PM +0200, Alexander Kukushkin wrote:
BTW, I am thinking that we should return InvalidTransactionId from the
btree_xlog_delete_get_latestRemovedXid if the index page we read from
disk is newer then xlog record we are currently processing. Please see
the patch attached.
This is not a solution in my opinion, as you could invalidate activities
of backends connected to the database when the incorrect consistent
point allows connections to come in too early.
What happens if you replay with hot_standby = on up to the latest point,
without any concurrent connections, then issue a checkpoint on the
standby once you got to a point newer than the complain, and finally
restart the standby with the bgworker?
Another idea I have would be to make the standby promote, issue a
checkpoint on it, and then use pg_rewind as a trick to update the
control file to a point newer than the inconsistency. As PG < 9.6.10
could make the minimum recovery point go backwards, applying the upgrade
after the consistent point got to an incorrect state would trigger the
failure.
--
Michael
Hello hackers!
It seems bgwriter running on the replicas is broken in the commit
8d68ee6 and as a result bgwriter never updates minRecoveryPoint in the
pg_control.Please see a detailed explanation below.
2018-08-29 22:54 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
This is not a solution in my opinion, as you could invalidate activities
of backends connected to the database when the incorrect consistent
point allows connections to come in too early.
That true, but I still think it is better than aborting startup process...
What happens if you replay with hot_standby = on up to the latest point,
without any concurrent connections, then issue a checkpoint on the
standby once you got to a point newer than the complain, and finally
restart the standby with the bgworker?Another idea I have would be to make the standby promote, issue a
checkpoint on it, and then use pg_rewind as a trick to update the
control file to a point newer than the inconsistency. As PG < 9.6.10
could make the minimum recovery point go backwards, applying the upgrade
after the consistent point got to an incorrect state would trigger the
failure.
Well, all these actions probably help to relife symptoms and replay
WAL up to the point when it becomes really consistent.
I was really long trying to figure out how it could happen that some
of the pages were written on disk, but pg_control wasn't updated, And
I think I managed to put all pieces of the puzzle into a nice picture:
static void
UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
{
/* Quick check using our local copy of the variable */
if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint))
return;
/*
* An invalid minRecoveryPoint means that we need to recover
all the WAL,
* i.e., we're doing crash recovery. We never modify the control file's
* value in that case, so we can short-circuit future checks
here too. The
* local values of minRecoveryPoint and minRecoveryPointTLI
should not be
* updated until crash recovery finishes.
*/
if (XLogRecPtrIsInvalid(minRecoveryPoint))
{
updateMinRecoveryPoint = false;
return;
}
This code was changed in the commit 8d68ee6 and it broke bgwriter. Now
bgwriter never updates pg_control when flushes dirty pages to disk.
How it happens:
When bgwriter starts, minRecoveryPoint is not initialized and if I
attach with gdb, it shows that value of minRecoveryPoint = 0,
therefore it is Invalid.
As a result, updateMinRecoveryPoint is set to false and on the next
call of UpdateMinRecoveryPoint from bgwriter it returns from the
function after the very first if.
Bgwriter itself never changes updateMinRecoveryPoint to true and boom,
we can get a lot of pages written to disk, but minRecoveryPoint in the
pg_control won't be updated!
If the replica happened to crash in such conditions it reaches a
consistency much earlier than it should!
Regards,
--
Alexander Kukushkin
On Thu, Aug 30, 2018 at 10:55:23AM +0200, Alexander Kukushkin wrote:
Bgwriter itself never changes updateMinRecoveryPoint to true and boom,
we can get a lot of pages written to disk, but minRecoveryPoint in the
pg_control won't be updated!
That's indeed obvious by reading the code. The bgwriter would be
started only once a consistent point has been reached, so the startup
process would have normally already updated the control file to the
consistent point. Something like the attached should take care of the
problem. As the updates of the local copy of minRecoveryPoint strongly
rely on if the startup process is used, I think that we should use
InRecovery for the sanity checks.
I'd like to also add a TAP test for that, which should be easy enough if
we do sanity checks by looking up at the output of the control file.
I'll try to put more thoughts on that.
Does it take care of the problem?
--
Michael
Attachments:
min-consistent-update-v1.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..f9c52fb9d8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2723,9 +2723,14 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
* i.e., we're doing crash recovery. We never modify the control file's
* value in that case, so we can short-circuit future checks here too. The
* local values of minRecoveryPoint and minRecoveryPointTLI should not be
- * updated until crash recovery finishes.
+ * updated until crash recovery finishes. We only do this for the startup
+ * process replaying WAL which needs to keep clean references of
+ * minRecoveryPoint until it is made sure that it has replayed all WAL
+ * available for crash recovery. Other processes would be started after
+ * the consistency point has been reached on a standby, so it would be
+ * actually safe to update the control file in this case.
*/
- if (XLogRecPtrIsInvalid(minRecoveryPoint))
+ if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
{
updateMinRecoveryPoint = false;
return;
@@ -2737,7 +2742,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
- if (force || minRecoveryPoint < lsn)
+ if (XLogRecPtrIsInvalid(minRecoveryPoint))
+ updateMinRecoveryPoint = false;
+ else if (force || minRecoveryPoint < lsn)
{
XLogRecPtr newMinRecoveryPoint;
TimeLineID newMinRecoveryPointTLI;
Hi,
2018-08-30 15:39 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
That's indeed obvious by reading the code. The bgwriter would be
started only once a consistent point has been reached, so the startup
process would have normally already updated the control file to the
consistent point. Something like the attached should take care of the
problem. As the updates of the local copy of minRecoveryPoint strongly
rely on if the startup process is used, I think that we should use
InRecovery for the sanity checks.I'd like to also add a TAP test for that, which should be easy enough if
we do sanity checks by looking up at the output of the control file.
I'll try to put more thoughts on that.Does it take care of the problem?
Yep, with the patch applied bgwriter acts as expected!
Breakpoint 1, UpdateControlFile () at xlog.c:4536
4536 INIT_CRC32C(ControlFile->crc);
(gdb) bt
#0 UpdateControlFile () at xlog.c:4536
#1 0x00005646d071ddb2 in UpdateMinRecoveryPoint (lsn=26341965784,
force=0 '\000') at xlog.c:2597
#2 0x00005646d071de65 in XLogFlush (record=26341965784) at xlog.c:2632
#3 0x00005646d09d693a in FlushBuffer (buf=0x7f8e1ca523c0,
reln=0x5646d2e86028) at bufmgr.c:2729
#4 0x00005646d09d63d6 in SyncOneBuffer (buf_id=99693,
skip_recently_used=1 '\001', wb_context=0x7ffd07757380) at
bufmgr.c:2394
#5 0x00005646d09d6172 in BgBufferSync (wb_context=0x7ffd07757380) at
bufmgr.c:2270
#6 0x00005646d097c266 in BackgroundWriterMain () at bgwriter.c:279
#7 0x00005646d073b38c in AuxiliaryProcessMain (argc=2,
argv=0x7ffd07758840) at bootstrap.c:424
#8 0x00005646d098dc4a in StartChildProcess (type=BgWriterProcess) at
postmaster.c:5300
#9 0x00005646d098d672 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:4999
#10 <signal handler called>
#11 0x00007f8e5f5a6ff7 in __GI___select (nfds=5,
readfds=0x7ffd07759060, writefds=0x0, exceptfds=0x0,
timeout=0x7ffd07758fd0) at ../sysdeps/unix/sysv/linux/select.c:41
#12 0x00005646d09890ca in ServerLoop () at postmaster.c:1685
#13 0x00005646d0988799 in PostmasterMain (argc=17,
argv=0x5646d2e53390) at postmaster.c:1329
#14 0x00005646d08d2880 in main (argc=17, argv=0x5646d2e53390) at main.c:228
Regards,
--
Alexander Kukushkin
On Thu, Aug 30, 2018 at 04:03:43PM +0200, Alexander Kukushkin wrote:
2018-08-30 15:39 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
Does it take care of the problem?
Yep, with the patch applied bgwriter acts as expected!
Thanks for double-checking.
I have been struggling for a couple of hours to get a deterministic test
case out of my pocket, and I did not get one as you would need to get
the bgwriter to flush a page before crash recovery finishes, we could do
that easily with a dedicated bgworker, now pg_ctl start expects the
standby to finish recovery before, which makes it harder to trigger all
the time, particularly for slow machines . Anyway, I did more code
review and I think that I found another issue with XLogNeedsFlush(),
which could enforce updateMinRecoveryPoint to false if called before
XLogFlush during crash recovery from another process than the startup
process, so if it got called before XLogFlush() we'd still have the same
issue for a process doing both operations. Hence, I have come up with
the attached, which actually brings back the code to what it was before
8d68ee6 for those routines, except that we have fast-exit paths for the
startup process so as it is still able to replay all WAL available and
avoid page reference issues post-promotion, deciding when to update its
own copy of minRecoveryPoint when it finishes crash recovery. This also
saves from a couple of locks on the control file from the startup
process.
If you apply the patch and try it on your standby, are you able to get
things up and working?
--
Michael
Attachments:
min-consistent-update-v2.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..65db2e48d8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2723,9 +2723,13 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
* i.e., we're doing crash recovery. We never modify the control file's
* value in that case, so we can short-circuit future checks here too. The
* local values of minRecoveryPoint and minRecoveryPointTLI should not be
- * updated until crash recovery finishes.
+ * updated until crash recovery finishes. We only do this for the startup
+ * process as it should not update its own reference of minRecoveryPoint
+ * until it has finished crash recovery to make sure that all WAL
+ * available is replayed in this case. This also saves from extra locks
+ * taken on the control file from the startup process.
*/
- if (XLogRecPtrIsInvalid(minRecoveryPoint))
+ if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
{
updateMinRecoveryPoint = false;
return;
@@ -2737,7 +2741,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
- if (force || minRecoveryPoint < lsn)
+ if (XLogRecPtrIsInvalid(minRecoveryPoint))
+ updateMinRecoveryPoint = false;
+ else if (force || minRecoveryPoint < lsn)
{
XLogRecPtr newMinRecoveryPoint;
TimeLineID newMinRecoveryPointTLI;
@@ -3127,9 +3133,11 @@ XLogNeedsFlush(XLogRecPtr record)
* An invalid minRecoveryPoint means that we need to recover all the
* WAL, i.e., we're doing crash recovery. We never modify the control
* file's value in that case, so we can short-circuit future checks
- * here too.
+ * here too. This triggers a quick exit path for the startup process,
+ * which cannot update its local copy of minRecoveryPoint as long as
+ * it has not replayed all WAL available when doing crash recovery.
*/
- if (XLogRecPtrIsInvalid(minRecoveryPoint))
+ if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
updateMinRecoveryPoint = false;
/* Quick exit if already known to be updated or cannot be updated */
@@ -3146,8 +3154,19 @@ XLogNeedsFlush(XLogRecPtr record)
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
LWLockRelease(ControlFileLock);
+ /*
+ * Check minRecoveryPoint for any other process than the startup
+ * process doing crash recovery, which should not update the control
+ * file value if crash recovery is still running.
+ */
+ if (XLogRecPtrIsInvalid(minRecoveryPoint))
+ updateMinRecoveryPoint = false;
+
/* check again */
- return record > minRecoveryPoint;
+ if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
+ return false;
+ else
+ return true;
}
/* Quick exit if already known flushed */
2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
I have been struggling for a couple of hours to get a deterministic test
case out of my pocket, and I did not get one as you would need to get
the bgwriter to flush a page before crash recovery finishes, we could do
In my case the active standby server has crashed, it wasn't in the
crash recovery mode.
the time, particularly for slow machines . Anyway, I did more code
review and I think that I found another issue with XLogNeedsFlush(),
which could enforce updateMinRecoveryPoint to false if called before
XLogFlush during crash recovery from another process than the startup
process, so if it got called before XLogFlush() we'd still have the same
issue for a process doing both operations. Hence, I have come up with
At least XLogNeedsFlush() is called just from a couple of places and
doesn't break bgwriter, but anyway, thanks for finding it.
the attached, which actually brings back the code to what it was before
8d68ee6 for those routines, except that we have fast-exit paths for the
startup process so as it is still able to replay all WAL available and
avoid page reference issues post-promotion, deciding when to update its
own copy of minRecoveryPoint when it finishes crash recovery. This also
saves from a couple of locks on the control file from the startup
process.
Sound good.
If you apply the patch and try it on your standby, are you able to get
things up and working?
Nope, minRecoveryPoint in pg_control is still wrong and therefore
startup still aborts on the same place if there are connections open.
I think there is no way to fix it other than let it replay sufficient
amount of WAL without open connections.
Just juddging from the timestamps of WAL files in the pg_xlog it is
obvious that a few moments before the hardware crashed postgres was
replaying 0000000500000AB300000057, because the next file has smaller
mtime (it was recycled).
-rw------- 1 akukushkin akukushkin 16777216 Aug 22 07:22
0000000500000AB300000057
-rw------- 1 akukushkin akukushkin 16777216 Aug 22 07:01
0000000500000AB300000058
Minimum recovery ending location is AB3/4A1B3118, but at the same time
I managed to find pages from 0000000500000AB300000053 on disk (at
least in the index files). That could only mean that bgwriter was
flushing dirty pages, but pg_control wasn't properly updated and it
happened not during recovery after hardware crash, but while the
postgres was running before the hardware crash.
The only possible way to recover such standby - cut off all possible
connections and let it replay all WAL files it managed to write to
disk before the first crash.
Regards,
--
Alexander Kukushkin
On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote:
2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
I have been struggling for a couple of hours to get a deterministic test
case out of my pocket, and I did not get one as you would need to get
the bgwriter to flush a page before crash recovery finishes, we could doIn my case the active standby server has crashed, it wasn't in the
crash recovery mode.
That's what I meant, a standby crashed and then restarted, doing crash
recovery before moving on with archive recovery once it was done with
all its local WAL.
Minimum recovery ending location is AB3/4A1B3118, but at the same time
I managed to find pages from 0000000500000AB300000053 on disk (at
least in the index files). That could only mean that bgwriter was
flushing dirty pages, but pg_control wasn't properly updated and it
happened not during recovery after hardware crash, but while the
postgres was running before the hardware crash.
Exactly, that would explain the incorrect reference.
The only possible way to recover such standby - cut off all possible
connections and let it replay all WAL files it managed to write to
disk before the first crash.
Yeah... I am going to apply the patch after another lookup, that will
fix the problem moving forward. Thanks for checking by the way.
--
Michael
At Thu, 30 Aug 2018 11:57:05 -0700, Michael Paquier <michael@paquier.xyz> wrote in <20180830185705.GF15446@paquier.xyz>
On Thu, Aug 30, 2018 at 08:31:36PM +0200, Alexander Kukushkin wrote:
2018-08-30 19:34 GMT+02:00 Michael Paquier <michael@paquier.xyz>:
I have been struggling for a couple of hours to get a deterministic test
case out of my pocket, and I did not get one as you would need to get
the bgwriter to flush a page before crash recovery finishes, we could doIn my case the active standby server has crashed, it wasn't in the
crash recovery mode.That's what I meant, a standby crashed and then restarted, doing crash
recovery before moving on with archive recovery once it was done with
all its local WAL.Minimum recovery ending location is AB3/4A1B3118, but at the same time
I managed to find pages from 0000000500000AB300000053 on disk (at
least in the index files). That could only mean that bgwriter was
flushing dirty pages, but pg_control wasn't properly updated and it
happened not during recovery after hardware crash, but while the
postgres was running before the hardware crash.Exactly, that would explain the incorrect reference.
The only possible way to recover such standby - cut off all possible
connections and let it replay all WAL files it managed to write to
disk before the first crash.Yeah... I am going to apply the patch after another lookup, that will
fix the problem moving forward. Thanks for checking by the way.
Please wait a bit.. I have a concern about this.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote:
Please wait a bit.. I have a concern about this.
Sure, please feel free.
--
Michael
At Thu, 30 Aug 2018 18:48:55 -0700, Michael Paquier <michael@paquier.xyz> wrote in <20180831014855.GJ15446@paquier.xyz>
On Fri, Aug 31, 2018 at 09:48:46AM +0900, Kyotaro HORIGUCHI wrote:
Please wait a bit.. I have a concern about this.
Sure, please feel free.
Thanks.
I looked though the patch and related code and have a concern.
The patch inhibits turning off updateMinRecoveryPoint on other
than the startup process running crash-recovery except at the end
of XLogNeedsFlush.
/*
* Check minRecoveryPoint for any other process than the startup
* process doing crash recovery, which should not update the control
* file value if crash recovery is still running.
*/
if (XLogRecPtrIsInvalid(minRecoveryPoint))
updateMinRecoveryPoint = false;
Even if any other processes than the startup calls the function
during crash recovery, they don't have a means to turn on
updateMinRecoveryPoint again. Actually the only process that
calls the function druing crash recovery is the startup. bwriter
and checkpointer doesn't. Hot-standby backends come after
crash-recvery ends. After all, we just won't have an invalid
minRecoveryPoint value there, and updateMinRecoverypoint must be
true.
Other than that I didn't find a problem. Please find the attached
patch. I also have not come up with how to check the case
automatically..
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
min-consistent-update-v3.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..74692a128d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2723,9 +2723,13 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
* i.e., we're doing crash recovery. We never modify the control file's
* value in that case, so we can short-circuit future checks here too. The
* local values of minRecoveryPoint and minRecoveryPointTLI should not be
- * updated until crash recovery finishes.
+ * updated until crash recovery finishes. We only do this for the startup
+ * process as it should not update its own reference of minRecoveryPoint
+ * until it has finished crash recovery to make sure that all WAL
+ * available is replayed in this case. This also saves from extra locks
+ * taken on the control file from the startup process.
*/
- if (XLogRecPtrIsInvalid(minRecoveryPoint))
+ if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
{
updateMinRecoveryPoint = false;
return;
@@ -2737,7 +2741,9 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
- if (force || minRecoveryPoint < lsn)
+ if (XLogRecPtrIsInvalid(minRecoveryPoint))
+ updateMinRecoveryPoint = false;
+ else if (force || minRecoveryPoint < lsn)
{
XLogRecPtr newMinRecoveryPoint;
TimeLineID newMinRecoveryPointTLI;
@@ -3124,12 +3130,15 @@ XLogNeedsFlush(XLogRecPtr record)
if (RecoveryInProgress())
{
/*
- * An invalid minRecoveryPoint means that we need to recover all the
- * WAL, i.e., we're doing crash recovery. We never modify the control
- * file's value in that case, so we can short-circuit future checks
- * here too.
+ * An invalid minRecoveryPoint on the startup process means that we
+ * need to recover all the WAL, i.e., we're doing crash recovery. We
+ * never modify the control file's value in that case, so we can
+ * short-circuit future checks here too. This triggers a quick exit
+ * path for the startup process, which cannot update its local copy of
+ * minRecoveryPoint as long as it has not replayed all WAL available
+ * when doing crash recovery.
*/
- if (XLogRecPtrIsInvalid(minRecoveryPoint))
+ if (XLogRecPtrIsInvalid(minRecoveryPoint) && InRecovery)
updateMinRecoveryPoint = false;
/* Quick exit if already known to be updated or cannot be updated */
@@ -3146,6 +3155,12 @@ XLogNeedsFlush(XLogRecPtr record)
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
LWLockRelease(ControlFileLock);
+ /*
+ * No other process than the startup doesn't reach here before crash
+ * recovery ends. So minRecoveryPoint must have a valid value here.
+ */
+ Assert(minRecoveryPoint != InvalidXLogRecPtr);
+
/* check again */
return record > minRecoveryPoint;
}
On Fri, Aug 31, 2018 at 02:52:06PM +0900, Kyotaro HORIGUCHI wrote:
The patch inhibits turning off updateMinRecoveryPoint on other
than the startup process running crash-recovery except at the end
of XLogNeedsFlush.
Yes that's a matter of safety, as I put into the truck any modules which
may use XLogFlush(). And that maps with the old code, so there is no
more surprise.
Even if any other processes than the startup calls the function
during crash recovery, they don't have a means to turn on
updateMinRecoveryPoint again. Actually the only process that
calls the function during crash recovery is the startup. bwriter
and checkpointer doesn't. Hot-standby backends come after
crash-recvery ends. After all, we just won't have an invalid
minRecoveryPoint value there, and updateMinRecoverypoint must be
true.
Yes, until a recovery point is reached only the startup process could
call that. Now I would imagine that we could have a background worker
as well which is spawned when the postmaster starts, and calls those
code paths...
+ /* + * No other process than the startup doesn't reach here before crash + * recovery ends. So minRecoveryPoint must have a valid value here. + */ + Assert(minRecoveryPoint != InvalidXLogRecPtr);
... And that would invalidate your assertion here.
--
Michael
On Thu, Aug 30, 2018 at 11:23:54PM -0700, Michael Paquier wrote:
Yes that's a matter of safety, as I put into the truck any modules which
may use XLogFlush(). And that maps with the old code, so there is no
more surprise.
Okay, I have pushed my previous version as that's the safest approach.
If you have any idea of improvements or clarifications, let's discuss
about those on a different thread and only for HEAD.
Many thanks Alexander for the whole work and Horiguchi-san for the
review.
--
Michael