Checkpointer crashes on slave in 9.4 on windows
During internals tests, it is observed that checkpointer
is getting crashed on slave with below log on slave in
windows:
LOG: checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT: See C include file "ntstatus.h" for a description of the hexadecimal
value.
LOG: terminating any other active server processes
I debugged and found that it is happening when checkpointer
tries to update shared memory config and below is the
call stack.
postgres.exe!LWLockAcquireCommon(LWLock * l=0x0000000000000000,
LWLockMode mode=LW_EXCLUSIVE, unsigned __int64 * valptr=0x0000000000000020,
unsigned __int64 val=18446744073709551615) Line 579 + 0x14 bytes C
postgres.exe!LWLockAcquireWithVar(LWLock * l=0x0000000000000000, unsigned
__int64 * valptr=0x0000000000000020, unsigned __int64
val=18446744073709551615) Line 510 C
postgres.exe!WALInsertLockAcquireExclusive() Line 1627 C
postgres.exe!UpdateFullPageWrites() Line 9037 C
postgres.exe!UpdateSharedMemoryConfig() Line 1364 C
postgres.exe!CheckpointerMain() Line 359 C
postgres.exe!AuxiliaryProcessMain(int argc=2, char * *
argv=0x00000000007d2180) Line 427 C
postgres.exe!SubPostmasterMain(int argc=4, char * *
argv=0x00000000007d2170) Line 4635 C
postgres.exe!main(int argc=4, char * * argv=0x00000000007d2170) Line 207
C
Basically, here the issue is that during startup when
checkpointer tries to acquire WAL Insertion Locks to
update the value of fullPageWrites, it crashes because
the same is still not initialized. It will be initialized in
InitXLOGAccess() which will get called via RecoveryInProgress()
in case recovery is in progress before doing actual checkpoint.
However we are trying to access it before that which leads to
crash.
I think the reason why it occurs only on windows is that
on linux fork will ensure that WAL Insertion Locks get
initialized with same values as postmaster.
To fix this issue, we need to ensure that WAL Insertion
Locks should get initialized before we use them, so one of
the ways is to call InitXLOGAccess() before calling
CheckPointerMain() as I have done in attached patch, other
could be to call RecoveryInProgess() much earlier in path
than now.
Steps to reproduce the issue
-------------------------------------------
On Master
a. Change below parameters in postgresql.conf
wal_level = archive
archive_mode = on
archive_command = 'copy "%p"
"c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f"'
archive_timeout = 10
b. Change pg_hba.conf to accept connections from slave
c. Start Server
d. Connect to server and start online backup
psql.exe -p 5432 -c "select pg_start_backup('label-1')"; postgres
e. Create the slave directory by copying everything from master
f. remove postmaster.pid from slave directoy
g. change port on slave
g. create recovery.conf with below parameters on slave:
standby_mode=on
restore_command = 'copy
"c:\\Users\\PostgreSQL\9.4\\bin\\archive\\%f" "%p"'
h. Stop online backup on master
psql.exe -p 5432 -c "select pg_stop_backup('1')"; postgres
i. Start the slave and you can observe below logs:
LOG: checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT: See C include file "ntstatus.h" for a description of the hexadecimal
value.
Comments?
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_checkpointer_crash_on_slave_v1.patchapplication/octet-stream; name=fix_checkpointer_crash_on_slave_v1.patchDownload
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4a542e6..31c3775 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -423,6 +423,7 @@ AuxiliaryProcessMain(int argc, char *argv[])
case CheckpointerProcess:
/* don't set signals, checkpointer has its own agenda */
+ InitXLOGAccess();
CheckpointerMain();
proc_exit(1); /* should never return */
On Mon, Jul 21, 2014 at 4:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
During internals tests, it is observed that checkpointer
is getting crashed on slave with below log on slave in
windows:LOG: checkpointer process (PID 4040) was terminated by exception 0xC0000005
HINT: See C include file "ntstatus.h" for a description of the hexadecimal
value.
LOG: terminating any other active server processesI debugged and found that it is happening when checkpointer
tries to update shared memory config and below is the
call stack.postgres.exe!LWLockAcquireCommon(LWLock * l=0x0000000000000000, LWLockMode
mode=LW_EXCLUSIVE, unsigned __int64 * valptr=0x0000000000000020, unsigned
__int64 val=18446744073709551615) Line 579 + 0x14 bytes Cpostgres.exe!LWLockAcquireWithVar(LWLock * l=0x0000000000000000, unsigned
__int64 * valptr=0x0000000000000020, unsigned __int64
val=18446744073709551615) Line 510 C
postgres.exe!WALInsertLockAcquireExclusive() Line 1627 C
postgres.exe!UpdateFullPageWrites() Line 9037 C
postgres.exe!UpdateSharedMemoryConfig() Line 1364 C
postgres.exe!CheckpointerMain() Line 359 C
postgres.exe!AuxiliaryProcessMain(int argc=2, char * *
argv=0x00000000007d2180) Line 427 C
postgres.exe!SubPostmasterMain(int argc=4, char * *
argv=0x00000000007d2170) Line 4635 C
postgres.exe!main(int argc=4, char * * argv=0x00000000007d2170) Line 207
CBasically, here the issue is that during startup when
checkpointer tries to acquire WAL Insertion Locks to
update the value of fullPageWrites, it crashes because
the same is still not initialized. It will be initialized in
InitXLOGAccess() which will get called via RecoveryInProgress()
in case recovery is in progress before doing actual checkpoint.
However we are trying to access it before that which leads to
crash.I think the reason why it occurs only on windows is that
on linux fork will ensure that WAL Insertion Locks get
initialized with same values as postmaster.To fix this issue, we need to ensure that WAL Insertion
Locks should get initialized before we use them, so one of
the ways is to call InitXLOGAccess() before calling
CheckPointerMain() as I have done in attached patch, other
could be to call RecoveryInProgess() much earlier in path
than now.
So, this problem was introduced by Heikki's commit,
68a2e52bbaf98f136a96b3a0d734ca52ca440a95, to replace XLogInsert slots
with regular LWLocks. I think the problem here is that the
initialization code here really doesn't belong in InitXLOGAccess at
all:
1. I think WALInsertLocks is just another global variable that needs
to be saved and restored in EXEC_BACKEND mode and that it therefore
ought to participate in the save_backend_variables() mechanism instead
of having its own special-purpose mechanism to save and restore the
value.
2. And I think that the LWLockRegisterTranche call belongs in
XLOGShmeInit(), so that it's parallel to the other call in
CreateLWLocks.
I think that would be more robust, because while your fix will
definitely work, we could easily reintroduce a similar
platform-specific bug for some other auxiliary process. Using the
mechanisms described above will mean that this is set up properly for
everything that's attached to shared memory at all.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Jul 24, 2014 at 12:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Jul 21, 2014 at 4:16 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
So, this problem was introduced by Heikki's commit,
68a2e52bbaf98f136a96b3a0d734ca52ca440a95, to replace XLogInsert slots
with regular LWLocks. I think the problem here is that the
initialization code here really doesn't belong in InitXLOGAccess at
all:1. I think WALInsertLocks is just another global variable that needs
to be saved and restored in EXEC_BACKEND mode and that it therefore
ought to participate in the save_backend_variables() mechanism instead
of having its own special-purpose mechanism to save and restore the
value.
It seems to me that we don't need to save/restore variables that points
to shared memory which we initialize during startup of process. We
do initliaze shared memory during each process start in below call:
SubPostmasterMain()
{
..
..
CreateSharedMemoryAndSemaphores(false, 0);
}
Few another examples of some similar variables are as below:
MultiXactShmemInit()
{
..
OldestMemberMXactId = MultiXactState->perBackendXactIds;
OldestVisibleMXactId = OldestMemberMXactId + MaxOldestSlot;
}
CreateSharedProcArray()
{
..
allProcs = ProcGlobal->allProcs;
allProcs = ProcGlobal->allProcs;
}
However, I think it is better to initialize WALInsertLocks in
XLOGShmemInit()
as we do for other cases and suggested by you in point number-2.
2. And I think that the LWLockRegisterTranche call belongs in
XLOGShmeInit(), so that it's parallel to the other call in
CreateLWLocks.
Agreed.
Revised patch initialize the WALInsertLocks and call
LWLockRegisterTranche() in XLOGShmemInit() which makes their
initialization similar to what we do at other places.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachments:
fix_checkpointer_crash_on_slave_v2.patchapplication/octet-stream; name=fix_checkpointer_crash_on_slave_v2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e4f9595..93f3ca2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4841,6 +4841,11 @@ XLOGShmemInit(void)
{
/* both should be present or neither */
Assert(foundCFile && foundXLog);
+
+ /* Initialize local copy of WALInsertLocks and register the tranche */
+ WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
+ LWLockRegisterTranche(XLogCtl->Insert.WALInsertLockTrancheId,
+ &XLogCtl->Insert.WALInsertLockTranche);
return;
}
memset(XLogCtl, 0, sizeof(XLogCtlData));
@@ -7619,11 +7624,6 @@ InitXLOGAccess(void)
ThisTimeLineID = XLogCtl->ThisTimeLineID;
Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode());
- /* Initialize our copy of WALInsertLocks and register the tranche */
- WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
- LWLockRegisterTranche(XLogCtl->Insert.WALInsertLockTrancheId,
- &XLogCtl->Insert.WALInsertLockTranche);
-
/* Use GetRedoRecPtr to copy the RedoRecPtr safely */
(void) GetRedoRecPtr();
}
On Thu, Jul 24, 2014 at 5:38 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Revised patch initialize the WALInsertLocks and call
LWLockRegisterTranche() in XLOGShmemInit() which makes their
initialization similar to what we do at other places.
OK, that seems all right. Will commit and back-patch.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers