[PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs
Greetings,
Recently, we hit a few occurrences of deadlock when IO failure (including disk full, random remote disk IO failures) happens in single user mode. We found the issue exists on both Linux and Windows in multiple postgres versions.
Here are the steps to repro on Linux (as Windows repro is similar):
1. Get latest PostgreSQL code, build and install the executables.
$ git clone https://git.postgresql.org/git/postgresql.git
$ cd postgresql
$ PGROOT=$(pwd)
$ git checkout REL_11_STABLE
$ mkdir build
$ cd build
$ ../configure --prefix=/path/to/postgres
$ make && make install
2. Run initdb to initialize a PG database folder.
$ /path/to/postgres/bin/initdb -D /path/to/data
3. Because the unable to write relation data scenario is difficult to hit naturally even reserved space is turned off, I have prepared a small patch (see attachment "emulate-error.patch") to force an error when PG tries to write data to relation files. We can just apply the patch and there is no need to put efforts flooding data to disk any more.
$ cd $PGROOT
$ git apply /path/to/emulate-error.patch
$ cd build
$ make && make install
4. Connect to the newly initialized database cluster with single user mode, create a table, and insert some data to the table, do a checkpoint or directly give EOF. Then we hit the deadlock issue and the process will not exit until we kill it.
Do a checkpoint explicitly:
$ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c exit_on_error=true <<EOF
create table t1(a int);
insert into t1 values (1), (2), (3);
checkpoint;
EOF
PostgreSQL stand-alone backend 11.1
backend> backend> backend> 2018-11-29 02:45:27.891 UTC [18806] FATAL: Emulate exception in mdwrite() when writing to disk
2018-11-29 02:55:27.891 UTC [18806] CONTEXT: writing block 8 of relation base/12368/1247
2018-11-29 02:55:27.891 UTC [18806] STATEMENT: checkpoint;
2018-11-29 02:55:27.900 UTC [18806] FATAL: Emulate exception in mdwrite() when writing to disk
2018-11-29 02:55:27.900 UTC [18806] CONTEXT: writing block 8 of relation base/12368/1247
Or directly give an EOF:
$ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c exit_on_error=true <<EOF
create table t1(a int);
insert into t1 values (1), (2), (3);
EOF
PostgreSQL stand-alone backend 11.1
backend> backend> backend> 2018-11-29 02:55:24.438 UTC [18149] FATAL: Emulate exception in mdwrite() when writing to disk
2018-11-29 02:45:24.438 UTC [18149] CONTEXT: writing block 8 of relation base/12368/1247
5. Moreover, when we try to recover the database with single user mode, we hit the issue again, and the process does not bring up the database nor exit.
$ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c exit_on_error=true
2018-11-29 02:59:33.257 UTC [19058] LOG: database system shutdown was interrupted; last known up at 2018-11-29 02:58:49 UTC
2018-11-29 02:59:33.485 UTC [19058] LOG: database system was not properly shut down; automatic recovery in progress
2018-11-29 02:59:33.500 UTC [19058] LOG: redo starts at 0/1672E40
2018-11-29 02:59:33.500 UTC [19058] LOG: invalid record length at 0/1684B90: wanted 24, got 0
2018-11-29 02:59:33.500 UTC [19058] LOG: redo done at 0/1684B68
2018-11-29 02:59:33.500 UTC [19058] LOG: last completed transaction was at log time 2018-11-29 02:58:49.856663+00
2018-11-29 02:59:33.547 UTC [19058] FATAL: Emulate exception in mdwrite() when writing to disk
2018-11-29 02:59:33.547 UTC [19058] CONTEXT: writing block 8 of relation base/12368/1247
Analyses:
So, what happened? Actually, there are 2 types of the deadlock due to the same root cause. Let's first take a look at the scenario in step #5. In this scenario, the deadlock happens when disk IO failure occurs inside StartupXLOG(). If we attach debugger to PG process, we will see the process is stuck acquiring the buffer's lw-lock in AbortBufferIO().
void
AbortBufferIO(void)
{
BufferDesc *buf = InProgressBuf;
if (buf)
{
uint32 buf_state;
/*
* Since LWLockReleaseAll has already been called, we're not holding
* the buffer's io_in_progress_lock. We have to re-acquire it so that
* we can use TerminateBufferIO. Anyone who's executing WaitIO on the
* buffer will be in a busy spin until we succeed in doing this.
*/
LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
This is because the same lock has been acquired before buffer manager attempts to flush the buffer page, which happens in StartBufferIO().
static bool
StartBufferIO(BufferDesc *buf, bool forInput)
{
uint32 buf_state;
Assert(!InProgressBuf);
for (;;)
{
/*
* Grab the io_in_progress lock so that other processes can wait for
* me to finish the I/O.
*/
LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
buf_state = LockBufHdr(buf);
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
After reading the code, AtProcExit_Buffers() assumes all the lw-locks are released. However, in single user mode, at the time StartupXLOG() is being executed, there is no before_shmem_exit/on_shmem_exit callback registered to release the lw-locks.
And, given lw-lock is non-reentrant, so the process gets stuck re-acquiring the same lock.
Here is the call stack:
(gdb) bt
#0 0x00007f0fdb7cb6d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f0fd14c81b8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem@entry=0x7f0fd14c81b8, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007f0fdb7cb7c8 in __new_sem_wait_slow (sem=0x7f0fd14c81b8, abstime=0x0) at sem_waitcommon.c:181
#3 0x00005630d475658a in PGSemaphoreLock (sema=0x7f0fd14c81b8) at pg_sema.c:316
#4 0x00005630d47f689e in LWLockAcquire (lock=0x7f0fd9ae9c00, mode=LW_EXCLUSIVE) at /path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243
#5 0x00005630d47cd087 in AbortBufferIO () at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988
#6 0x00005630d47cb3f9 in AtProcExit_Buffers (code=1, arg=0) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473
#7 0x00005630d47dbc32 in shmem_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272
#8 0x00005630d47dba5e in proc_exit_prepare (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194
#9 0x00005630d47db9c6 in proc_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107
#10 0x00005630d49811bc in errfinish (dummy=0) at /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541
#11 0x00005630d4801f1f in mdwrite (reln=0x5630d6588c68, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843
#12 0x00005630d4804716 in smgrwrite (reln=0x5630d6588c68, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650
#13 0x00005630d47cb824 in FlushBuffer (buf=0x7f0fd19e9c00, reln=0x5630d6588c68) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751
#14 0x00005630d47cb219 in SyncOneBuffer (buf_id=0, skip_recently_used=false, wb_context=0x7ffccc371a70) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394
#15 0x00005630d47cab00 in BufferSync (flags=6) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984
#16 0x00005630d47cb57f in CheckPointBuffers (flags=6) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578
#17 0x00005630d44a685b in CheckPointGuts (checkPointRedo=23612304, flags=6) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149
#18 0x00005630d44a62cf in CreateCheckPoint (flags=6) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937
#19 0x00005630d44a45e3 in StartupXLOG () at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:7723
#20 0x00005630d4995f88 in InitPostgres (in_dbname=0x5630d65582b0 "postgres", dboid=0, username=0x5630d653d7d0 "chengyu", useroid=0, out_dbname=0x0, override_allow_connections=false)
at /path/to/postgres/source/build/../src/backend/utils/init/postinit.c:636
#21 0x00005630d480b68b in PostgresMain (argc=7, argv=0x5630d6534d20, dbname=0x5630d65582b0 "postgres", username=0x5630d653d7d0 "chengyu") at /path/to/postgres/source/build/../src/backend/tcop/postgres.c:3810
#22 0x00005630d4695e8b in main (argc=7, argv=0x5630d6534d20) at /path/to/postgres/source/build/../src/backend/main/main.c:224
(gdb) p on_shmem_exit_list
$1 = {{function = 0x55801cc68f5f <AnonymousShmemDetach>, arg = 0},
{function = 0x55801cc68a4f <IpcMemoryDelete>, arg = 2490396},
{function = 0x55801cc689f8 <IpcMemoryDetach>, arg = 140602018975744},
{function = 0x55801cc6842e <ReleaseSemaphores>, arg = 0},
{function = 0x55801ccec48a <dsm_postmaster_shutdown>, arg = 140602018975744},
{function = 0x55801cd04053 <ProcKill>, arg = 0},
{function = 0x55801cd0402d <RemoveProcFromArray>, arg = 0},
{function = 0x55801ccf74e8 <CleanupInvalidationState>, arg = 140601991817088},
{function = 0x55801ccf446f <CleanupProcSignalState>, arg = 1},
{function = 0x55801ccdd3e5 <AtProcExit_Buffers>, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}}
(gdb) p before_shmem_exit_list
$2 = {{function = 0x0, arg = 0} <repeats 20 times>}
The second type is in Step #4. At the time when "checkpoint" SQL command is being executed, PG has already set up the before_shmem_exit callback ShutdownPostgres(), which releases all lw-locks given transaction or sub-transaction is on-going. So after the first IO error, the buffer page's lw-lock gets released successfully. However, later ShutdownXLOG() is invoked, and PG tries to flush buffer pages again, which results in the second IO error. Different from the first time, this time, all the previous executed before/on_shmem_exit callbacks are not invoked again due to the decrease of the indexes. So lw-locks for buffer pages are not released when PG tries to get the same buffer lock in AbortBufferIO(), and then PG process gets stuck.
Here is the call stack:
(gdb) bt
#0 0x00007ff0c0c036d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7ff0b69001b8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem@entry=0x7ff0b69001b8, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007ff0c0c037c8 in __new_sem_wait_slow (sem=0x7ff0b69001b8, abstime=0x0) at sem_waitcommon.c:181
#3 0x0000562077cc258a in PGSemaphoreLock (sema=0x7ff0b69001b8) at pg_sema.c:316
#4 0x0000562077d6289e in LWLockAcquire (lock=0x7ff0bef225c0, mode=LW_EXCLUSIVE) at /path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243
#5 0x0000562077d39087 in AbortBufferIO () at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988
#6 0x0000562077d373f9 in AtProcExit_Buffers (code=1, arg=0) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473
#7 0x0000562077d47c32 in shmem_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272
#8 0x0000562077d47a5e in proc_exit_prepare (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194
#9 0x0000562077d479c6 in proc_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107
#10 0x0000562077eed1bc in errfinish (dummy=0) at /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541
#11 0x0000562077d6df1f in mdwrite (reln=0x562078a12a18, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843
#12 0x0000562077d70716 in smgrwrite (reln=0x562078a12a18, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650
#13 0x0000562077d37824 in FlushBuffer (buf=0x7ff0b6e22f80, reln=0x562078a12a18) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751
#14 0x0000562077d37219 in SyncOneBuffer (buf_id=78, skip_recently_used=false, wb_context=0x7fffb0e3e230) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394
#15 0x0000562077d36b00 in BufferSync (flags=5) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984
#16 0x0000562077d3757f in CheckPointBuffers (flags=5) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578
#17 0x0000562077a1285b in CheckPointGuts (checkPointRedo=24049152, flags=5) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149
#18 0x0000562077a122cf in CreateCheckPoint (flags=5) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937
#19 0x0000562077a1164f in ShutdownXLOG (code=1, arg=0) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8485
#20 0x0000562077d47c32 in shmem_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272
#21 0x0000562077d47a5e in proc_exit_prepare (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194
#22 0x0000562077d479c6 in proc_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107
#23 0x0000562077eed1bc in errfinish (dummy=0) at /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541
#24 0x0000562077d6df1f in mdwrite (reln=0x562078a12a18, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843
#25 0x0000562077d70716 in smgrwrite (reln=0x562078a12a18, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650
#26 0x0000562077d37824 in FlushBuffer (buf=0x7ff0b6e22f80, reln=0x562078a12a18) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751
#27 0x0000562077d37219 in SyncOneBuffer (buf_id=78, skip_recently_used=false, wb_context=0x7fffb0e3fb10) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394
#28 0x0000562077d36b00 in BufferSync (flags=44) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984
#29 0x0000562077d3757f in CheckPointBuffers (flags=44) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578
#30 0x0000562077a1285b in CheckPointGuts (checkPointRedo=24049152, flags=44) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149
#31 0x0000562077a122cf in CreateCheckPoint (flags=44) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937
#32 0x0000562077cca792 in RequestCheckpoint (flags=44) at /path/to/postgres/source/build/../src/backend/postmaster/checkpointer.c:976
#33 0x0000562077d7bce4 in standard_ProcessUtility (pstmt=0x562078a00b50, queryString=0x562078a00100 "checkpoint;\n", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x5620783ac5e0 <debugtupDR>, completionTag=0x7fffb0e41520 "") at /path/to/postgres/source/build/../src/backend/tcop/utility.c:769
#34 0x0000562077d7b204 in ProcessUtility (pstmt=0x562078a00b50, queryString=0x562078a00100 "checkpoint;\n", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x5620783ac5e0 <debugtupDR>,
completionTag=0x7fffb0e41520 "") at /path/to/postgres/source/build/../src/backend/tcop/utility.c:360
#35 0x0000562077d7a347 in PortalRunUtility (portal=0x5620789f20c0, pstmt=0x562078a00b50, isTopLevel=true, setHoldSnapshot=false, dest=0x5620783ac5e0 <debugtupDR>, completionTag=0x7fffb0e41520 "")
at /path/to/postgres/source/build/../src/backend/tcop/pquery.c:1178
#36 0x0000562077d7a534 in PortalRunMulti (portal=0x5620789f20c0, isTopLevel=true, setHoldSnapshot=false, dest=0x5620783ac5e0 <debugtupDR>, altdest=0x5620783ac5e0 <debugtupDR>,
completionTag=0x7fffb0e41520 "") at /path/to/postgres/source/build/../src/backend/tcop/pquery.c:1324
#37 0x0000562077d79a61 in PortalRun (portal=0x5620789f20c0, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x5620783ac5e0 <debugtupDR>, altdest=0x5620783ac5e0 <debugtupDR>,
completionTag=0x7fffb0e41520 "") at /path/to/postgres/source/build/../src/backend/tcop/pquery.c:799
#38 0x0000562077d734c5 in exec_simple_query (query_string=0x562078a00100 "checkpoint;\n") at /path/to/postgres/source/build/../src/backend/tcop/postgres.c:1145
#39 0x0000562077d77bd5 in PostgresMain (argc=7, argv=0x562078980d20, dbname=0x5620789a42b0 "postgres", username=0x5620789897d0 "chengyu") at /path/to/postgres/source/build/../src/backend/tcop/postgres.c:4182
#40 0x0000562077c01e8b in main (argc=7, argv=0x562078980d20) at /path/to/postgres/source/build/../src/backend/main/main.c:224
(gdb) p on_shmem_exit_list
$9 = {{function = 0x562077cc2f5f <AnonymousShmemDetach>, arg = 0},
{function = 0x562077cc2a4f <IpcMemoryDelete>, arg = 2457627},
{function = 0x562077cc29f8 <IpcMemoryDetach>, arg = 140672005165056},
{function = 0x562077cc242e <ReleaseSemaphores>, arg = 0},
{function = 0x562077d4648a <dsm_postmaster_shutdown>, arg = 140672005165056},
{function = 0x562077d5e053 <ProcKill>, arg = 0},
{function = 0x562077d5e02d <RemoveProcFromArray>, arg = 0},
{function = 0x562077d514e8 <CleanupInvalidationState>, arg = 140671978006400},
{function = 0x562077d4e46f <CleanupProcSignalState>, arg = 1},
{function = 0x562077d373e5 <AtProcExit_Buffers>, arg = 0},
{function = 0x562077a1159d <ShutdownXLOG>, arg = 0},
{function = 0x562077cd0637 <pgstat_beshutdown_hook>, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}}
(gdb) p before_shmem_exit_list
$10 = {{function = 0x562077f02caa <ShutdownPostgres>, arg = 0}, {function = 0x0, arg = 0} <repeats 19 times>}
OK, now we understand the deadlock issue for single user mode. However, will this issue affect multi-user mode (i.e. under postmaster process)? We can have 3 cases for discussion:
1. Startup process: at the time StartupXLOG() is invoked, ShutdownAuxiliaryProcess(), which will release all the lw-locks, has been already registered in before_shmem_exit_list[]. So this case is safe.
2. Checkpointer process: ShutdownXLOG() is not registered as a before/on_shmem_exit callback, instead, it's only invoked in the main loop. So there is no chance to hit IO error for second time during shared memory exit callbacks. Also, Same as startup process, ShutdownAuxiliaryProcess() has been registered. So this case is also safe.
3. Other backend/background processes: these processes do not handle XLOG startup or shutdown, and are protected by ShutdownAuxiliaryProcess(). So they are safe to exit too.
In addition, we have done multiple experiments to confirm these cases.
Affected versions: we found this issue in 9.5, 9.6, 10, 11 and 12devel.
Fix proposal:
According to the affected 2 types of deadlock in single user mode discussed above, there might be multiple ways to fix this issue. In the fix proposal we would like to present, we register a new callback to release all the lw-locks (just like what ShutdownAuxiliaryProcess()does) in an order after AtProcExit_Buffers() and before ShutdownXLOG(). Also, it is registered before PG enters StartupXLOG(), so it can cover the case when ShutdownPostgres() has not been registered. Here is the content of the proposal:
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 62baaf0ab3..d74e8aa1d5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -71,6 +71,7 @@ static HeapTuple GetDatabaseTupleByOid(Oid dboid);
static void PerformAuthentication(Port *port);
static void CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connections);
static void InitCommunication(void);
+static void ReleaseLWLocks(int code, Datum arg);
static void ShutdownPostgres(int code, Datum arg);
static void StatementTimeoutHandler(void);
static void LockTimeoutHandler(void);
@@ -653,6 +654,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
* way, start up the XLOG machinery, and register to have it closed
* down at exit.
*/
+ on_shmem_exit(ReleaseLWLocks, 0);
StartupXLOG();
on_shmem_exit(ShutdownXLOG, 0);
}
@@ -1214,6 +1216,23 @@ process_settings(Oid databaseid, Oid roleid)
heap_close(relsetting, AccessShareLock);
}
+/*
+ * There are 2 types of buffer locks on-holding when AtProcExit_Buffers() is
+ * invoked in a bootstrap process or a standalone backend:
+ * (1) Exceptions thrown during StartupXLOG()
+ * (2) Exceptions thrown during exception-handling in ShutdownXLOG()
+ * So we need this on_shmem_exit callback for single user mode.
+ * For processes under postmaster, ShutdownAuxiliaryProcess() will release
+ * the lw-locks and ShutdownXLOG() is not registered as a callback, so there
+ * is no such issue. Also, please note this callback should be registered in
+ * the order after AtProcExit_buffers() and before ShutdownXLOG().
+ */
+static void
+ReleaseLWLocks(int code, Datum arg)
+{
+ LWLockReleaseAll();
+}
+
/*
* Backend-shutdown callback. Do cleanup that we want to be sure happens
* before all the supporting modules begin to nail their doors shut via
The fix proposal is also attached to this email in file "fix-deadlock.patch".
Please let us know should you have suggestions on this issue and the fix.
Thank you!
Best regards,
--
Chengchao Yu
Software Engineer | Microsoft | Azure Database for PostgreSQL
Attachments:
emulate-error.patchapplication/octet-stream; name=emulate-error.patchDownload
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 10e0271473..a9b71058fe 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -840,6 +840,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
nbytes,
BLCKSZ);
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("Emulate exception in mdwrite() when writing to disk")));
+
if (nbytes != BLCKSZ)
{
if (nbytes < 0)
fix-deadlock.patchapplication/octet-stream; name=fix-deadlock.patchDownload
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 62baaf0ab3..d74e8aa1d5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -71,6 +71,7 @@ static HeapTuple GetDatabaseTupleByOid(Oid dboid);
static void PerformAuthentication(Port *port);
static void CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connections);
static void InitCommunication(void);
+static void ReleaseLWLocks(int code, Datum arg);
static void ShutdownPostgres(int code, Datum arg);
static void StatementTimeoutHandler(void);
static void LockTimeoutHandler(void);
@@ -653,6 +654,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
* way, start up the XLOG machinery, and register to have it closed
* down at exit.
*/
+ on_shmem_exit(ReleaseLWLocks, 0);
StartupXLOG();
on_shmem_exit(ShutdownXLOG, 0);
}
@@ -1214,6 +1216,23 @@ process_settings(Oid databaseid, Oid roleid)
heap_close(relsetting, AccessShareLock);
}
+/*
+ * There are 2 types of buffer locks on-holding when AtProcExit_Buffers() is
+ * invoked in a bootstrap process or a standalone backend:
+ * (1) Exceptions thrown during StartupXLOG()
+ * (2) Exceptions thrown during exception-handling in ShutdownXLOG()
+ * So we need this on_shmem_exit callback for single user mode.
+ * For processes under postmaster, ShutdownAuxiliaryProcess() will release
+ * the lw-locks and ShutdownXLOG() is not registered as a callback, so there
+ * is no such issue. Also, please note this callback should be registered in
+ * the order after AtProcExit_buffers() and before ShutdownXLOG().
+ */
+static void
+ReleaseLWLocks(int code, Datum arg)
+{
+ LWLockReleaseAll();
+}
+
/*
* Backend-shutdown callback. Do cleanup that we want to be sure happens
* before all the supporting modules begin to nail their doors shut via
Hey Chengyu,
How did you set up your GDB to get "p" to pretty-print a Postgres list?
Cheers,
Jesse
On Fri, Nov 30, 2018 at 1:00 PM Chengchao Yu <chengyu@microsoft.com> wrote:
Show quoted text
Greetings,
Recently, we hit a few occurrences of deadlock when IO failure (including
disk full, random remote disk IO failures) happens in single user mode. We
found the issue exists on both Linux and Windows in multiple postgres
versions.Here are the steps to repro on Linux (as Windows repro is similar):
1. Get latest PostgreSQL code, build and install the executables.
$ git clone https://git.postgresql.org/git/postgresql.git
$ cd postgresql
$ PGROOT=$(pwd)
$ git checkout REL_11_STABLE
$ mkdir build
$ cd build
$ ../configure --prefix=/path/to/postgres
$ make && make install
2. Run initdb to initialize a PG database folder.
$ /path/to/postgres/bin/initdb -D /path/to/data
3. Because the unable to write relation data scenario is difficult
to hit naturally even reserved space is turned off, I have prepared a small
patch (see attachment “emulate-error.patch”) to force an error when PG
tries to write data to relation files. We can just apply the patch and
there is no need to put efforts flooding data to disk any more.$ cd $PGROOT
$ git apply /path/to/emulate-error.patch
$ cd build
$ make && make install
4. Connect to the newly initialized database cluster with single
user mode, create a table, and insert some data to the table, do a
checkpoint or directly give EOF. Then we hit the deadlock issue and the
process will not exit until we kill it.Do a checkpoint explicitly:
$ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c
exit_on_error=true <<EOFcreate table t1(a int);
insert into t1 values (1), (2), (3);
checkpoint;
EOF
PostgreSQL stand-alone backend 11.1
backend> backend> backend> 2018-11-29 02:45:27.891 UTC [18806] FATAL:
Emulate exception in mdwrite() when writing to disk2018-11-29 02:55:27.891 UTC [18806] CONTEXT: writing block 8 of relation
base/12368/12472018-11-29 02:55:27.891 UTC [18806] STATEMENT: checkpoint;
2018-11-29 02:55:27.900 UTC [18806] FATAL: Emulate exception in mdwrite()
when writing to disk2018-11-29 02:55:27.900 UTC [18806] CONTEXT: writing block 8 of relation
base/12368/1247Or directly give an EOF:
$ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c
exit_on_error=true <<EOFcreate table t1(a int);
insert into t1 values (1), (2), (3);
EOF
PostgreSQL stand-alone backend 11.1
backend> backend> backend> 2018-11-29 02:55:24.438 UTC [18149] FATAL:
Emulate exception in mdwrite() when writing to disk2018-11-29 02:45:24.438 UTC [18149] CONTEXT: writing block 8 of relation
base/12368/12475. Moreover, when we try to recover the database with single user
mode, we hit the issue again, and the process does not bring up the
database nor exit.$ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c
exit_on_error=true2018-11-29 02:59:33.257 UTC [19058] LOG: database system shutdown was
interrupted; last known up at 2018-11-29 02:58:49 UTC2018-11-29 02:59:33.485 UTC [19058] LOG: database system was not properly
shut down; automatic recovery in progress2018-11-29 02:59:33.500 UTC [19058] LOG: redo starts at 0/1672E40
2018-11-29 02:59:33.500 UTC [19058] LOG: invalid record length at
0/1684B90: wanted 24, got 02018-11-29 02:59:33.500 UTC [19058] LOG: redo done at 0/1684B68
2018-11-29 02:59:33.500 UTC [19058] LOG: last completed transaction was
at log time 2018-11-29 02:58:49.856663+002018-11-29 02:59:33.547 UTC [19058] FATAL: Emulate exception in mdwrite()
when writing to disk2018-11-29 02:59:33.547 UTC [19058] CONTEXT: writing block 8 of relation
base/12368/1247Analyses:
So, what happened? Actually, there are 2 types of the deadlock due to the
same root cause. Let’s first take a look at the scenario in step #5. In
this scenario, the deadlock happens when disk IO failure occurs inside
StartupXLOG(). If we attach debugger to PG process, we will see the
process is stuck acquiring the buffer’s lw-lock in AbortBufferIO().void
AbortBufferIO(void)
{
BufferDesc *buf = InProgressBuf;
if (buf)
{
uint32 buf_state;
/*
* Since LWLockReleaseAll has already been called, we're not
holding* the buffer's io_in_progress_lock. We have to re-acquire it so
that* we can use TerminateBufferIO. Anyone who's executing WaitIO on
the* buffer will be in a busy spin until we succeed in doing this.
*/
LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
This is because the same lock has been acquired before buffer manager
attempts to flush the buffer page, which happens in StartBufferIO().static bool
StartBufferIO(BufferDesc *buf, bool forInput)
{
uint32 buf_state;
Assert(!InProgressBuf);
for (;;)
{
/*
* Grab the io_in_progress lock so that other processes can wait
for* me to finish the I/O.
*/
LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
buf_state = LockBufHdr(buf);
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
After reading the code, AtProcExit_Buffers() assumes all the lw-locks are
released. However, in single user mode, at the time StartupXLOG() is
being executed, there is no before_shmem_exit/on_shmem_exit callback
registered to release the lw-locks.And, given lw-lock is non-reentrant, so the process gets stuck
re-acquiring the same lock.Here is the call stack:
(gdb) bt
#0 0x00007f0fdb7cb6d6 in futex_abstimed_wait_cancelable (private=128,
abstime=0x0, expected=0, futex_word=0x7f0fd14c81b8) at
../sysdeps/unix/sysv/linux/futex-internal.h:205#1 do_futex_wait (sem=sem@entry=0x7f0fd14c81b8, abstime=0x0) at
sem_waitcommon.c:111#2 0x00007f0fdb7cb7c8 in __new_sem_wait_slow (sem=0x7f0fd14c81b8,
abstime=0x0) at sem_waitcommon.c:181#3 0x00005630d475658a in PGSemaphoreLock (sema=0x7f0fd14c81b8) at
pg_sema.c:316#4 0x00005630d47f689e in LWLockAcquire (lock=0x7f0fd9ae9c00,
mode=LW_EXCLUSIVE) at
/path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243#5 0x00005630d47cd087 in *AbortBufferIO* () at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988#6 0x00005630d47cb3f9 in AtProcExit_Buffers (code=1, arg=0) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473#7 0x00005630d47dbc32 in shmem_exit (code=1) at
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272#8 0x00005630d47dba5e in proc_exit_prepare (code=1) at
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194#9 0x00005630d47db9c6 in proc_exit (code=1) at
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107#10 0x00005630d49811bc in *errfinish* (dummy=0) at
/path/to/postgres/source/build/../src/backend/utils/error/elog.c:541#11 0x00005630d4801f1f in mdwrite (reln=0x5630d6588c68,
forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "",
skipFsync=false) at
/path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843#12 0x00005630d4804716 in smgrwrite (reln=0x5630d6588c68,
forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "",
skipFsync=false) at
/path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650#13 0x00005630d47cb824 in FlushBuffer (buf=0x7f0fd19e9c00,
reln=0x5630d6588c68) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751#14 0x00005630d47cb219 in SyncOneBuffer (buf_id=0,
skip_recently_used=false, wb_context=0x7ffccc371a70) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394#15 0x00005630d47cab00 in BufferSync (flags=6) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984#16 0x00005630d47cb57f in CheckPointBuffers (flags=6) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578#17 0x00005630d44a685b in CheckPointGuts (checkPointRedo=23612304,
flags=6) at
/path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149#18 0x00005630d44a62cf in CreateCheckPoint (flags=6) at
/path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937#19 0x00005630d44a45e3 in *StartupXLOG* () at
/path/to/postgres/source/build/../src/backend/access/transam/xlog.c:7723#20 0x00005630d4995f88 in InitPostgres (in_dbname=0x5630d65582b0
"postgres", dboid=0, username=0x5630d653d7d0 "chengyu", useroid=0,
out_dbname=0x0, override_allow_connections=false)at
/path/to/postgres/source/build/../src/backend/utils/init/postinit.c:636#21 0x00005630d480b68b in PostgresMain (argc=7, argv=0x5630d6534d20,
dbname=0x5630d65582b0 "postgres", username=0x5630d653d7d0 "chengyu") at
/path/to/postgres/source/build/../src/backend/tcop/postgres.c:3810#22 0x00005630d4695e8b in main (argc=7, argv=0x5630d6534d20) at
/path/to/postgres/source/build/../src/backend/main/main.c:224(gdb) p on_shmem_exit_list
$1 = {{function = 0x55801cc68f5f <AnonymousShmemDetach>, arg = 0},
{function = 0x55801cc68a4f <IpcMemoryDelete>, arg = 2490396},
{function = 0x55801cc689f8 <IpcMemoryDetach>, arg = 140602018975744},
{function = 0x55801cc6842e <ReleaseSemaphores>, arg = 0},
{function = 0x55801ccec48a <dsm_postmaster_shutdown>, arg =
140602018975744},{function = 0x55801cd04053 <ProcKill>, arg = 0},
{function = 0x55801cd0402d <RemoveProcFromArray>, arg = 0},
{function = 0x55801ccf74e8 <CleanupInvalidationState>, arg =
140601991817088},{function = 0x55801ccf446f <CleanupProcSignalState>, arg = 1},
{function = 0x55801ccdd3e5 <AtProcExit_Buffers>, arg = 0}, {function =
0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0},
{function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg
= 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function =
0x0, arg = 0}, {function = 0x0, arg = 0}}(gdb) p before_shmem_exit_list
$2 = {{function = 0x0, arg = 0} <repeats 20 times>}
The second type is in Step #4. At the time when “checkpoint” SQL command
is being executed, PG has already set up the before_shmem_exit callback
ShutdownPostgres(), which releases all lw-locks given transaction or
sub-transaction is on-going. So after the first IO error, the buffer page’s
lw-lock gets released successfully. However, later ShutdownXLOG() is
invoked, and PG tries to flush buffer pages again, which results in the
second IO error. Different from the first time, this time, all the previous
executed before/on_shmem_exit callbacks are not invoked again due to the
decrease of the indexes. So lw-locks for buffer pages are not released when
PG tries to get the same buffer lock in AbortBufferIO(), and then PG
process gets stuck.Here is the call stack:
(gdb) bt
#0 0x00007ff0c0c036d6 in futex_abstimed_wait_cancelable (private=128,
abstime=0x0, expected=0, futex_word=0x7ff0b69001b8) at
../sysdeps/unix/sysv/linux/futex-internal.h:205#1 do_futex_wait (sem=sem@entry=0x7ff0b69001b8, abstime=0x0) at
sem_waitcommon.c:111#2 0x00007ff0c0c037c8 in __new_sem_wait_slow (sem=0x7ff0b69001b8,
abstime=0x0) at sem_waitcommon.c:181#3 0x0000562077cc258a in PGSemaphoreLock (sema=0x7ff0b69001b8) at
pg_sema.c:316#4 0x0000562077d6289e in LWLockAcquire (lock=0x7ff0bef225c0,
mode=LW_EXCLUSIVE) at
/path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243#5 0x0000562077d39087 in *AbortBufferIO* () at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988#6 0x0000562077d373f9 in AtProcExit_Buffers (code=1, arg=0) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473#7 0x0000562077d47c32 in shmem_exit (code=1) at
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272#8 0x0000562077d47a5e in proc_exit_prepare (code=1) at
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194#9 0x0000562077d479c6 in proc_exit (code=1) at
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107#10 0x0000562077eed1bc in *errfinish* (dummy=0) at
/path/to/postgres/source/build/../src/backend/utils/error/elog.c:541#11 0x0000562077d6df1f in mdwrite (reln=0x562078a12a18,
forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "",
skipFsync=false) at
/path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843#12 0x0000562077d70716 in smgrwrite (reln=0x562078a12a18,
forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "",
skipFsync=false) at
/path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650#13 0x0000562077d37824 in FlushBuffer (buf=0x7ff0b6e22f80,
reln=0x562078a12a18) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751#14 0x0000562077d37219 in SyncOneBuffer (buf_id=78,
skip_recently_used=false, wb_context=0x7fffb0e3e230) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394#15 0x0000562077d36b00 in BufferSync (flags=5) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984#16 0x0000562077d3757f in CheckPointBuffers (flags=5) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578#17 0x0000562077a1285b in CheckPointGuts (checkPointRedo=24049152,
flags=5) at
/path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149#18 0x0000562077a122cf in CreateCheckPoint (flags=5) at
/path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937#19 0x0000562077a1164f in *ShutdownXLOG* (code=1, arg=0) at
/path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8485#20 0x0000562077d47c32 in shmem_exit (code=1) at
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272#21 0x0000562077d47a5e in proc_exit_prepare (code=1) at
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194#22 0x0000562077d479c6 in proc_exit (code=1) at
/path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107#23 0x0000562077eed1bc in *errfinish* (dummy=0) at
/path/to/postgres/source/build/../src/backend/utils/error/elog.c:541#24 0x0000562077d6df1f in mdwrite (reln=0x562078a12a18,
forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "",
skipFsync=false) at
/path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843#25 0x0000562077d70716 in smgrwrite (reln=0x562078a12a18,
forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "",
skipFsync=false) at
/path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650#26 0x0000562077d37824 in FlushBuffer (buf=0x7ff0b6e22f80,
reln=0x562078a12a18) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751#27 0x0000562077d37219 in SyncOneBuffer (buf_id=78,
skip_recently_used=false, wb_context=0x7fffb0e3fb10) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394#28 0x0000562077d36b00 in BufferSync (flags=44) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984#29 0x0000562077d3757f in CheckPointBuffers (flags=44) at
/path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578#30 0x0000562077a1285b in CheckPointGuts (checkPointRedo=24049152,
flags=44) at
/path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149#31 0x0000562077a122cf in CreateCheckPoint (flags=44) at
/path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937#32 0x0000562077cca792 in RequestCheckpoint (flags=44) at
/path/to/postgres/source/build/../src/backend/postmaster/checkpointer.c:976#33 0x0000562077d7bce4 in standard_ProcessUtility (pstmt=0x562078a00b50,
queryString=0x562078a00100 "checkpoint;\n",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,dest=0x5620783ac5e0 <debugtupDR>, completionTag=0x7fffb0e41520 "") at
/path/to/postgres/source/build/../src/backend/tcop/utility.c:769#34 0x0000562077d7b204 in ProcessUtility (pstmt=0x562078a00b50,
queryString=0x562078a00100 "checkpoint;\n",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x5620783ac5e0 <debugtupDR>,completionTag=0x7fffb0e41520 "") at
/path/to/postgres/source/build/../src/backend/tcop/utility.c:360#35 0x0000562077d7a347 in PortalRunUtility (portal=0x5620789f20c0,
pstmt=0x562078a00b50, isTopLevel=true, setHoldSnapshot=false,
dest=0x5620783ac5e0 <debugtupDR>, completionTag=0x7fffb0e41520 "")at /path/to/postgres/source/build/../src/backend/tcop/pquery.c:1178
#36 0x0000562077d7a534 in PortalRunMulti (portal=0x5620789f20c0,
isTopLevel=true, setHoldSnapshot=false, dest=0x5620783ac5e0 <debugtupDR>,
altdest=0x5620783ac5e0 <debugtupDR>,completionTag=0x7fffb0e41520 "") at
/path/to/postgres/source/build/../src/backend/tcop/pquery.c:1324#37 0x0000562077d79a61 in PortalRun (portal=0x5620789f20c0,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x5620783ac5e0 <debugtupDR>, altdest=0x5620783ac5e0 <debugtupDR>,completionTag=0x7fffb0e41520 "") at
/path/to/postgres/source/build/../src/backend/tcop/pquery.c:799#38 0x0000562077d734c5 in exec_simple_query (query_string=0x562078a00100
"checkpoint;\n") at
/path/to/postgres/source/build/../src/backend/tcop/postgres.c:1145#39 0x0000562077d77bd5 in PostgresMain (argc=7, argv=0x562078980d20,
dbname=0x5620789a42b0 "postgres", username=0x5620789897d0 "chengyu") at
/path/to/postgres/source/build/../src/backend/tcop/postgres.c:4182#40 0x0000562077c01e8b in main (argc=7, argv=0x562078980d20) at
/path/to/postgres/source/build/../src/backend/main/main.c:224(gdb) p on_shmem_exit_list
$9 = {{function = 0x562077cc2f5f <AnonymousShmemDetach>, arg = 0},
{function = 0x562077cc2a4f <IpcMemoryDelete>, arg = 2457627},
{function = 0x562077cc29f8 <IpcMemoryDetach>, arg = 140672005165056},
{function = 0x562077cc242e <ReleaseSemaphores>, arg = 0},
{function = 0x562077d4648a <dsm_postmaster_shutdown>, arg =
140672005165056},{function = 0x562077d5e053 <ProcKill>, arg = 0},
{function = 0x562077d5e02d <RemoveProcFromArray>, arg = 0},
{function = 0x562077d514e8 <CleanupInvalidationState>, arg =
140671978006400},{function = 0x562077d4e46f <CleanupProcSignalState>, arg = 1},
{function = 0x562077d373e5 <AtProcExit_Buffers>, arg = 0},
{function = 0x562077a1159d <ShutdownXLOG>, arg = 0},
{function = 0x562077cd0637 <pgstat_beshutdown_hook>, arg = 0}, {function =
0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0},
{function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg
= 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}}(gdb) p before_shmem_exit_list
$10 = {{function = 0x562077f02caa <ShutdownPostgres>, arg = 0}, {function
= 0x0, arg = 0} <repeats 19 times>}OK, now we understand the deadlock issue for single user mode. However,
will this issue affect multi-user mode (i.e. under postmaster process)? We
can have 3 cases for discussion:1. Startup process: at the time StartupXLOG() is invoked,
ShutdownAuxiliaryProcess(), which will release all the lw-locks, has
been already registered in before_shmem_exit_list[]. So this case is
safe.
2. Checkpointer process: ShutdownXLOG() is not registered as a
before/on_shmem_exit callback, instead, it’s only invoked in the main loop.
So there is no chance to hit IO error for second time during shared memory
exit callbacks. Also, Same as startup process,
ShutdownAuxiliaryProcess() has been registered. So this case is also
safe.
3. Other backend/background processes: these processes do not handle
XLOG startup or shutdown, and are protected by
ShutdownAuxiliaryProcess(). So they are safe to exit too.In addition, we have done multiple experiments to confirm these cases.
Affected versions: we found this issue in 9.5, 9.6, 10, 11 and 12devel.
Fix proposal:
According to the affected 2 types of deadlock in single user mode
discussed above, there might be multiple ways to fix this issue. In the fix
proposal we would like to present, we register a new callback to release
all the lw-locks (just like what ShutdownAuxiliaryProcess()does) in an
order after AtProcExit_Buffers() and before ShutdownXLOG(). Also, it is
registered before PG enters StartupXLOG(), so it can cover the case when
ShutdownPostgres() has not been registered. Here is the content of the
proposal:diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.cindex 62baaf0ab3..d74e8aa1d5 100644
--- a/src/backend/utils/init/postinit.c+++ b/src/backend/utils/init/postinit.c@@ -71,6 +71,7 @@ static HeapTuple GetDatabaseTupleByOid(Oid dboid);
static void PerformAuthentication(Port *port);
static void CheckMyDatabase(const char *name, bool am_superuser, bool
override_allow_connections);static void InitCommunication(void);
+static void ReleaseLWLocks(int code, Datum arg);
static void ShutdownPostgres(int code, Datum arg);
static void StatementTimeoutHandler(void);
static void LockTimeoutHandler(void);
@@ -653,6 +654,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const
char *username,* way, start up the XLOG machinery, and register to have
it closed* down at exit.
*/
+ on_shmem_exit(ReleaseLWLocks, 0);
StartupXLOG();
on_shmem_exit(ShutdownXLOG, 0);
}
@@ -1214,6 +1216,23 @@ process_settings(Oid databaseid, Oid roleid)
heap_close(relsetting, AccessShareLock);
}
+/*
+ * There are 2 types of buffer locks on-holding when AtProcExit_Buffers()
is+ * invoked in a bootstrap process or a standalone backend:
+ * (1) Exceptions thrown during StartupXLOG()
+ * (2) Exceptions thrown during exception-handling in ShutdownXLOG()
+ * So we need this on_shmem_exit callback for single user mode.
+ * For processes under postmaster, ShutdownAuxiliaryProcess() will release
+ * the lw-locks and ShutdownXLOG() is not registered as a callback, so
there+ * is no such issue. Also, please note this callback should be registered
in+ * the order after AtProcExit_buffers() and before ShutdownXLOG().
+ */
+static void
+ReleaseLWLocks(int code, Datum arg)
+{
+ LWLockReleaseAll();
+}
+
/*
* Backend-shutdown callback. Do cleanup that we want to be sure happens
* before all the supporting modules begin to nail their doors shut via
The fix proposal is also attached to this email in file
“fix-deadlock.patch”.Please let us know should you have suggestions on this issue and the fix.
Thank you!
Best regards,
--
Chengchao Yu
Software Engineer | Microsoft | Azure Database for PostgreSQL
Hi Jesse,
Actually, before/on_shmem_exit_list are both arrays, they are of type “struct ONEXIT [20]”.
For the pretty-printing, I just edited it manually to make it easier for the reader to look up.
But since you have questioned, there does exit a GDB command of “set print array on”. After that “p” command will print array entry one per line.
Thanks,
Chengchao
From: Jesse Zhang <sbjesse@gmail.com>
Sent: Saturday, December 1, 2018 9:55 AM
To: Chengchao Yu <chengyu@microsoft.com>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs
Hey Chengyu,
How did you set up your GDB to get "p" to pretty-print a Postgres list?
Cheers,
Jesse
On Fri, Nov 30, 2018 at 1:00 PM Chengchao Yu <chengyu@microsoft.com<mailto:chengyu@microsoft.com>> wrote:
Greetings,
Recently, we hit a few occurrences of deadlock when IO failure (including disk full, random remote disk IO failures) happens in single user mode. We found the issue exists on both Linux and Windows in multiple postgres versions.
Here are the steps to repro on Linux (as Windows repro is similar):
1. Get latest PostgreSQL code, build and install the executables.
$ cd postgresql
$ PGROOT=$(pwd)
$ git checkout REL_11_STABLE
$ mkdir build
$ cd build
$ ../configure --prefix=/path/to/postgres
$ make && make install
2. Run initdb to initialize a PG database folder.
$ /path/to/postgres/bin/initdb -D /path/to/data
3. Because the unable to write relation data scenario is difficult to hit naturally even reserved space is turned off, I have prepared a small patch (see attachment “emulate-error.patch”) to force an error when PG tries to write data to relation files. We can just apply the patch and there is no need to put efforts flooding data to disk any more.
$ cd $PGROOT
$ git apply /path/to/emulate-error.patch
$ cd build
$ make && make install
4. Connect to the newly initialized database cluster with single user mode, create a table, and insert some data to the table, do a checkpoint or directly give EOF. Then we hit the deadlock issue and the process will not exit until we kill it.
Do a checkpoint explicitly:
$ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c exit_on_error=true <<EOF
create table t1(a int);
insert into t1 values (1), (2), (3);
checkpoint;
EOF
PostgreSQL stand-alone backend 11.1
backend> backend> backend> 2018-11-29 02:45:27.891 UTC [18806] FATAL: Emulate exception in mdwrite() when writing to disk
2018-11-29 02:55:27.891 UTC [18806] CONTEXT: writing block 8 of relation base/12368/1247
2018-11-29 02:55:27.891 UTC [18806] STATEMENT: checkpoint;
2018-11-29 02:55:27.900 UTC [18806] FATAL: Emulate exception in mdwrite() when writing to disk
2018-11-29 02:55:27.900 UTC [18806] CONTEXT: writing block 8 of relation base/12368/1247
Or directly give an EOF:
$ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c exit_on_error=true <<EOF
create table t1(a int);
insert into t1 values (1), (2), (3);
EOF
PostgreSQL stand-alone backend 11.1
backend> backend> backend> 2018-11-29 02:55:24.438 UTC [18149] FATAL: Emulate exception in mdwrite() when writing to disk
2018-11-29 02:45:24.438 UTC [18149] CONTEXT: writing block 8 of relation base/12368/1247
5. Moreover, when we try to recover the database with single user mode, we hit the issue again, and the process does not bring up the database nor exit.
$ /path/to/postgres/bin/postgres --single -D /path/to/data/ postgres -c exit_on_error=true
2018-11-29 02:59:33.257 UTC [19058] LOG: database system shutdown was interrupted; last known up at 2018-11-29 02:58:49 UTC
2018-11-29 02:59:33.485 UTC [19058] LOG: database system was not properly shut down; automatic recovery in progress
2018-11-29 02:59:33.500 UTC [19058] LOG: redo starts at 0/1672E40
2018-11-29 02:59:33.500 UTC [19058] LOG: invalid record length at 0/1684B90: wanted 24, got 0
2018-11-29 02:59:33.500 UTC [19058] LOG: redo done at 0/1684B68
2018-11-29 02:59:33.500 UTC [19058] LOG: last completed transaction was at log time 2018-11-29 02:58:49.856663+00
2018-11-29 02:59:33.547 UTC [19058] FATAL: Emulate exception in mdwrite() when writing to disk
2018-11-29 02:59:33.547 UTC [19058] CONTEXT: writing block 8 of relation base/12368/1247
Analyses:
So, what happened? Actually, there are 2 types of the deadlock due to the same root cause. Let’s first take a look at the scenario in step #5. In this scenario, the deadlock happens when disk IO failure occurs inside StartupXLOG(). If we attach debugger to PG process, we will see the process is stuck acquiring the buffer’s lw-lock in AbortBufferIO().
void
AbortBufferIO(void)
{
BufferDesc *buf = InProgressBuf;
if (buf)
{
uint32 buf_state;
/*
* Since LWLockReleaseAll has already been called, we're not holding
* the buffer's io_in_progress_lock. We have to re-acquire it so that
* we can use TerminateBufferIO. Anyone who's executing WaitIO on the
* buffer will be in a busy spin until we succeed in doing this.
*/
LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
This is because the same lock has been acquired before buffer manager attempts to flush the buffer page, which happens in StartBufferIO().
static bool
StartBufferIO(BufferDesc *buf, bool forInput)
{
uint32 buf_state;
Assert(!InProgressBuf);
for (;;)
{
/*
* Grab the io_in_progress lock so that other processes can wait for
* me to finish the I/O.
*/
LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
buf_state = LockBufHdr(buf);
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
After reading the code, AtProcExit_Buffers() assumes all the lw-locks are released. However, in single user mode, at the time StartupXLOG() is being executed, there is no before_shmem_exit/on_shmem_exit callback registered to release the lw-locks.
And, given lw-lock is non-reentrant, so the process gets stuck re-acquiring the same lock.
Here is the call stack:
(gdb) bt
#0 0x00007f0fdb7cb6d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f0fd14c81b8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem@entry=0x7f0fd14c81b8, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007f0fdb7cb7c8 in __new_sem_wait_slow (sem=0x7f0fd14c81b8, abstime=0x0) at sem_waitcommon.c:181
#3 0x00005630d475658a in PGSemaphoreLock (sema=0x7f0fd14c81b8) at pg_sema.c:316
#4 0x00005630d47f689e in LWLockAcquire (lock=0x7f0fd9ae9c00, mode=LW_EXCLUSIVE) at /path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243
#5 0x00005630d47cd087 in AbortBufferIO () at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988
#6 0x00005630d47cb3f9 in AtProcExit_Buffers (code=1, arg=0) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473
#7 0x00005630d47dbc32 in shmem_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272
#8 0x00005630d47dba5e in proc_exit_prepare (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194
#9 0x00005630d47db9c6 in proc_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107
#10 0x00005630d49811bc in errfinish (dummy=0) at /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541
#11 0x00005630d4801f1f in mdwrite (reln=0x5630d6588c68, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843
#12 0x00005630d4804716 in smgrwrite (reln=0x5630d6588c68, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650
#13 0x00005630d47cb824 in FlushBuffer (buf=0x7f0fd19e9c00, reln=0x5630d6588c68) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751
#14 0x00005630d47cb219 in SyncOneBuffer (buf_id=0, skip_recently_used=false, wb_context=0x7ffccc371a70) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394
#15 0x00005630d47cab00 in BufferSync (flags=6) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984
#16 0x00005630d47cb57f in CheckPointBuffers (flags=6) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578
#17 0x00005630d44a685b in CheckPointGuts (checkPointRedo=23612304, flags=6) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149
#18 0x00005630d44a62cf in CreateCheckPoint (flags=6) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937
#19 0x00005630d44a45e3 in StartupXLOG () at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:7723
#20 0x00005630d4995f88 in InitPostgres (in_dbname=0x5630d65582b0 "postgres", dboid=0, username=0x5630d653d7d0 "chengyu", useroid=0, out_dbname=0x0, override_allow_connections=false)
at /path/to/postgres/source/build/../src/backend/utils/init/postinit.c:636
#21 0x00005630d480b68b in PostgresMain (argc=7, argv=0x5630d6534d20, dbname=0x5630d65582b0 "postgres", username=0x5630d653d7d0 "chengyu") at /path/to/postgres/source/build/../src/backend/tcop/postgres.c:3810
#22 0x00005630d4695e8b in main (argc=7, argv=0x5630d6534d20) at /path/to/postgres/source/build/../src/backend/main/main.c:224
(gdb) p on_shmem_exit_list
$1 = {{function = 0x55801cc68f5f <AnonymousShmemDetach>, arg = 0},
{function = 0x55801cc68a4f <IpcMemoryDelete>, arg = 2490396},
{function = 0x55801cc689f8 <IpcMemoryDetach>, arg = 140602018975744},
{function = 0x55801cc6842e <ReleaseSemaphores>, arg = 0},
{function = 0x55801ccec48a <dsm_postmaster_shutdown>, arg = 140602018975744},
{function = 0x55801cd04053 <ProcKill>, arg = 0},
{function = 0x55801cd0402d <RemoveProcFromArray>, arg = 0},
{function = 0x55801ccf74e8 <CleanupInvalidationState>, arg = 140601991817088},
{function = 0x55801ccf446f <CleanupProcSignalState>, arg = 1},
{function = 0x55801ccdd3e5 <AtProcExit_Buffers>, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}}
(gdb) p before_shmem_exit_list
$2 = {{function = 0x0, arg = 0} <repeats 20 times>}
The second type is in Step #4. At the time when “checkpoint” SQL command is being executed, PG has already set up the before_shmem_exit callback ShutdownPostgres(), which releases all lw-locks given transaction or sub-transaction is on-going. So after the first IO error, the buffer page’s lw-lock gets released successfully. However, later ShutdownXLOG() is invoked, and PG tries to flush buffer pages again, which results in the second IO error. Different from the first time, this time, all the previous executed before/on_shmem_exit callbacks are not invoked again due to the decrease of the indexes. So lw-locks for buffer pages are not released when PG tries to get the same buffer lock in AbortBufferIO(), and then PG process gets stuck.
Here is the call stack:
(gdb) bt
#0 0x00007ff0c0c036d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7ff0b69001b8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem@entry=0x7ff0b69001b8, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007ff0c0c037c8 in __new_sem_wait_slow (sem=0x7ff0b69001b8, abstime=0x0) at sem_waitcommon.c:181
#3 0x0000562077cc258a in PGSemaphoreLock (sema=0x7ff0b69001b8) at pg_sema.c:316
#4 0x0000562077d6289e in LWLockAcquire (lock=0x7ff0bef225c0, mode=LW_EXCLUSIVE) at /path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243
#5 0x0000562077d39087 in AbortBufferIO () at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988
#6 0x0000562077d373f9 in AtProcExit_Buffers (code=1, arg=0) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473
#7 0x0000562077d47c32 in shmem_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272
#8 0x0000562077d47a5e in proc_exit_prepare (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194
#9 0x0000562077d479c6 in proc_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107
#10 0x0000562077eed1bc in errfinish (dummy=0) at /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541
#11 0x0000562077d6df1f in mdwrite (reln=0x562078a12a18, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843
#12 0x0000562077d70716 in smgrwrite (reln=0x562078a12a18, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650
#13 0x0000562077d37824 in FlushBuffer (buf=0x7ff0b6e22f80, reln=0x562078a12a18) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751
#14 0x0000562077d37219 in SyncOneBuffer (buf_id=78, skip_recently_used=false, wb_context=0x7fffb0e3e230) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394
#15 0x0000562077d36b00 in BufferSync (flags=5) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984
#16 0x0000562077d3757f in CheckPointBuffers (flags=5) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578
#17 0x0000562077a1285b in CheckPointGuts (checkPointRedo=24049152, flags=5) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149
#18 0x0000562077a122cf in CreateCheckPoint (flags=5) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937
#19 0x0000562077a1164f in ShutdownXLOG (code=1, arg=0) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8485
#20 0x0000562077d47c32 in shmem_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272
#21 0x0000562077d47a5e in proc_exit_prepare (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194
#22 0x0000562077d479c6 in proc_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107
#23 0x0000562077eed1bc in errfinish (dummy=0) at /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541
#24 0x0000562077d6df1f in mdwrite (reln=0x562078a12a18, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843
#25 0x0000562077d70716 in smgrwrite (reln=0x562078a12a18, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7ff0b6fbdc00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650
#26 0x0000562077d37824 in FlushBuffer (buf=0x7ff0b6e22f80, reln=0x562078a12a18) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751
#27 0x0000562077d37219 in SyncOneBuffer (buf_id=78, skip_recently_used=false, wb_context=0x7fffb0e3fb10) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394
#28 0x0000562077d36b00 in BufferSync (flags=44) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984
#29 0x0000562077d3757f in CheckPointBuffers (flags=44) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578
#30 0x0000562077a1285b in CheckPointGuts (checkPointRedo=24049152, flags=44) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149
#31 0x0000562077a122cf in CreateCheckPoint (flags=44) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937
#32 0x0000562077cca792 in RequestCheckpoint (flags=44) at /path/to/postgres/source/build/../src/backend/postmaster/checkpointer.c:976
#33 0x0000562077d7bce4 in standard_ProcessUtility (pstmt=0x562078a00b50, queryString=0x562078a00100 "checkpoint;\n", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x5620783ac5e0 <debugtupDR>, completionTag=0x7fffb0e41520 "") at /path/to/postgres/source/build/../src/backend/tcop/utility.c:769
#34 0x0000562077d7b204 in ProcessUtility (pstmt=0x562078a00b50, queryString=0x562078a00100 "checkpoint;\n", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x5620783ac5e0 <debugtupDR>,
completionTag=0x7fffb0e41520 "") at /path/to/postgres/source/build/../src/backend/tcop/utility.c:360
#35 0x0000562077d7a347 in PortalRunUtility (portal=0x5620789f20c0, pstmt=0x562078a00b50, isTopLevel=true, setHoldSnapshot=false, dest=0x5620783ac5e0 <debugtupDR>, completionTag=0x7fffb0e41520 "")
at /path/to/postgres/source/build/../src/backend/tcop/pquery.c:1178
#36 0x0000562077d7a534 in PortalRunMulti (portal=0x5620789f20c0, isTopLevel=true, setHoldSnapshot=false, dest=0x5620783ac5e0 <debugtupDR>, altdest=0x5620783ac5e0 <debugtupDR>,
completionTag=0x7fffb0e41520 "") at /path/to/postgres/source/build/../src/backend/tcop/pquery.c:1324
#37 0x0000562077d79a61 in PortalRun (portal=0x5620789f20c0, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x5620783ac5e0 <debugtupDR>, altdest=0x5620783ac5e0 <debugtupDR>,
completionTag=0x7fffb0e41520 "") at /path/to/postgres/source/build/../src/backend/tcop/pquery.c:799
#38 0x0000562077d734c5 in exec_simple_query (query_string=0x562078a00100 "checkpoint;\n") at /path/to/postgres/source/build/../src/backend/tcop/postgres.c:1145
#39 0x0000562077d77bd5 in PostgresMain (argc=7, argv=0x562078980d20, dbname=0x5620789a42b0 "postgres", username=0x5620789897d0 "chengyu") at /path/to/postgres/source/build/../src/backend/tcop/postgres.c:4182
#40 0x0000562077c01e8b in main (argc=7, argv=0x562078980d20) at /path/to/postgres/source/build/../src/backend/main/main.c:224
(gdb) p on_shmem_exit_list
$9 = {{function = 0x562077cc2f5f <AnonymousShmemDetach>, arg = 0},
{function = 0x562077cc2a4f <IpcMemoryDelete>, arg = 2457627},
{function = 0x562077cc29f8 <IpcMemoryDetach>, arg = 140672005165056},
{function = 0x562077cc242e <ReleaseSemaphores>, arg = 0},
{function = 0x562077d4648a <dsm_postmaster_shutdown>, arg = 140672005165056},
{function = 0x562077d5e053 <ProcKill>, arg = 0},
{function = 0x562077d5e02d <RemoveProcFromArray>, arg = 0},
{function = 0x562077d514e8 <CleanupInvalidationState>, arg = 140671978006400},
{function = 0x562077d4e46f <CleanupProcSignalState>, arg = 1},
{function = 0x562077d373e5 <AtProcExit_Buffers>, arg = 0},
{function = 0x562077a1159d <ShutdownXLOG>, arg = 0},
{function = 0x562077cd0637 <pgstat_beshutdown_hook>, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}, {function = 0x0, arg = 0}}
(gdb) p before_shmem_exit_list
$10 = {{function = 0x562077f02caa <ShutdownPostgres>, arg = 0}, {function = 0x0, arg = 0} <repeats 19 times>}
OK, now we understand the deadlock issue for single user mode. However, will this issue affect multi-user mode (i.e. under postmaster process)? We can have 3 cases for discussion:
a. Startup process: at the time StartupXLOG() is invoked, ShutdownAuxiliaryProcess(), which will release all the lw-locks, has been already registered in before_shmem_exit_list[]. So this case is safe.
b. Checkpointer process: ShutdownXLOG() is not registered as a before/on_shmem_exit callback, instead, it’s only invoked in the main loop. So there is no chance to hit IO error for second time during shared memory exit callbacks. Also, Same as startup process, ShutdownAuxiliaryProcess() has been registered. So this case is also safe.
c. Other backend/background processes: these processes do not handle XLOG startup or shutdown, and are protected by ShutdownAuxiliaryProcess(). So they are safe to exit too.
In addition, we have done multiple experiments to confirm these cases.
Affected versions: we found this issue in 9.5, 9.6, 10, 11 and 12devel.
Fix proposal:
According to the affected 2 types of deadlock in single user mode discussed above, there might be multiple ways to fix this issue. In the fix proposal we would like to present, we register a new callback to release all the lw-locks (just like what ShutdownAuxiliaryProcess()does) in an order after AtProcExit_Buffers() and before ShutdownXLOG(). Also, it is registered before PG enters StartupXLOG(), so it can cover the case when ShutdownPostgres() has not been registered. Here is the content of the proposal:
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 62baaf0ab3..d74e8aa1d5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -71,6 +71,7 @@ static HeapTuple GetDatabaseTupleByOid(Oid dboid);
static void PerformAuthentication(Port *port);
static void CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connections);
static void InitCommunication(void);
+static void ReleaseLWLocks(int code, Datum arg);
static void ShutdownPostgres(int code, Datum arg);
static void StatementTimeoutHandler(void);
static void LockTimeoutHandler(void);
@@ -653,6 +654,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
* way, start up the XLOG machinery, and register to have it closed
* down at exit.
*/
+ on_shmem_exit(ReleaseLWLocks, 0);
StartupXLOG();
on_shmem_exit(ShutdownXLOG, 0);
}
@@ -1214,6 +1216,23 @@ process_settings(Oid databaseid, Oid roleid)
heap_close(relsetting, AccessShareLock);
}
+/*
+ * There are 2 types of buffer locks on-holding when AtProcExit_Buffers() is
+ * invoked in a bootstrap process or a standalone backend:
+ * (1) Exceptions thrown during StartupXLOG()
+ * (2) Exceptions thrown during exception-handling in ShutdownXLOG()
+ * So we need this on_shmem_exit callback for single user mode.
+ * For processes under postmaster, ShutdownAuxiliaryProcess() will release
+ * the lw-locks and ShutdownXLOG() is not registered as a callback, so there
+ * is no such issue. Also, please note this callback should be registered in
+ * the order after AtProcExit_buffers() and before ShutdownXLOG().
+ */
+static void
+ReleaseLWLocks(int code, Datum arg)
+{
+ LWLockReleaseAll();
+}
+
/*
* Backend-shutdown callback. Do cleanup that we want to be sure happens
* before all the supporting modules begin to nail their doors shut via
The fix proposal is also attached to this email in file “fix-deadlock.patch”.
Please let us know should you have suggestions on this issue and the fix.
Thank you!
Best regards,
--
Chengchao Yu
Software Engineer | Microsoft | Azure Database for PostgreSQL
On Sat, Dec 1, 2018 at 2:30 AM Chengchao Yu <chengyu@microsoft.com> wrote:
Recently, we hit a few occurrences of deadlock when IO failure (including disk full, random remote disk IO failures) happens in single user mode. We found the issue exists on both Linux and Windows in multiple postgres versions.
3. Because the unable to write relation data scenario is difficult to hit naturally even reserved space is turned off, I have prepared a small patch (see attachment “emulate-error.patch”) to force an error when PG tries to write data to relation files. We can just apply the patch and there is no need to put efforts flooding data to disk any more.
I have one question related to the way you have tried to emulate the error.
@@ -840,6 +840,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
nbytes,
BLCKSZ);
+ ereport(ERROR,
+ (errcode(ERRCODE_INTERNAL_ERROR),
+ errmsg("Emulate exception in mdwrite() when writing to disk")));
+
We generally reserve the space in a relation before attempting to
write, so not sure how you are able to hit the disk full situation via
mdwrite. If you see the description of the function, that also
indicates same.
/*
* mdwrite() -- Write the supplied block at the appropriate location.
*
* This is to be used only for updating already-existing blocks of a
* relation (ie, those before the current EOF). To extend a relation,
* use mdextend().
*/
I am not telling that mdwrite can never lead to error, but just trying
to understand the issue you actually faced. I haven't read your
proposed solution yet, let's first try to establish the problem you
are facing.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sun, Jan 20, 2019 at 4:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Dec 1, 2018 at 2:30 AM Chengchao Yu <chengyu@microsoft.com> wrote:
Recently, we hit a few occurrences of deadlock when IO failure (including disk full, random remote disk IO failures) happens in single user mode. We found the issue exists on both Linux and Windows in multiple postgres versions.
3. Because the unable to write relation data scenario is difficult to hit naturally even reserved space is turned off, I have prepared a small patch (see attachment “emulate-error.patch”) to force an error when PG tries to write data to relation files. We can just apply the patch and there is no need to put efforts flooding data to disk any more.
I have one question related to the way you have tried to emulate the error.
@@ -840,6 +840,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, nbytes, BLCKSZ); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Emulate exception in mdwrite() when writing to disk"))); +We generally reserve the space in a relation before attempting to
write, so not sure how you are able to hit the disk full situation via
mdwrite. If you see the description of the function, that also
indicates same.
Presumably ZFS or BTRFS or something more exotic could still get
ENOSPC here, and of course any filesystem could give us EIO here
(because the disk is on fire or the remote NFS server is rebooting due
to an automatic Windows update).
--
Thomas Munro
http://www.enterprisedb.com
Hi Amit, Thomas,
Thank you very much for your feedbacks! Apologizes but I just saw both messages.
We generally reserve the space in a relation before attempting to write, so not sure how you are able to hit the disk full situation via mdwrite. If you see the description of the function, that also indicates same.
Absolutely agree, this isn’t a PG issue. Issue manifest for us at Microsoft due to our own storage layer which treat mdextend() actions as setting length of the file only. We have a workaround, and any change isn’t needed for Postgres.
I am not telling that mdwrite can never lead to error, but just trying to understand the issue you actually faced. I haven't read your proposed solution yet, let's first try to establish the problem you are facing.
We see transient IO errors reading a block where PG instance gets dead-lock in single user mode until we kill the instance. The stack trace below shows the behavior which is when mdread() failed with buffer holding its lw-lock. This happens because in single user mode there is no call back to release the lock and when AbortBufferIO() tries to acquire the same lock again, it will wait for the lock indefinitely.
Here is the stack trace:
0a 00000004`8080cc30 00000004`80dcf917 postgres!PGSemaphoreLock+0x65 [d:\orcasqlagsea10\14\s\src\backend\port\win32_sema.c @ 158]
0b 00000004`8080cc90 00000004`80db025c postgres!LWLockAcquire+0x137 [d:\orcasqlagsea10\14\s\src\backend\storage\lmgr\lwlock.c @ 1234]
0c 00000004`8080ccd0 00000004`80db25db postgres!AbortBufferIO+0x2c [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 3995]
0d 00000004`8080cd20 00000004`80dbce36 postgres!AtProcExit_Buffers+0xb [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 2479]
0e 00000004`8080cd50 00000004`80dbd1bd postgres!shmem_exit+0xf6 [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 262]
0f 00000004`8080cd80 00000004`80dbccfd postgres!proc_exit_prepare+0x4d [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 188]
10 00000004`8080cdb0 00000004`80ef9e74 postgres!proc_exit+0xd [d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 141]
11 00000004`8080cde0 00000004`80ddb6ef postgres!errfinish+0x204 [d:\orcasqlagsea10\14\s\src\backend\utils\error\elog.c @ 624]
12 00000004`8080ce50 00000004`80db0f59 postgres!mdread+0x12f [d:\orcasqlagsea10\14\s\src\backend\storage\smgr\md.c @ 806]
13 00000004`8080cea0 00000004`80daeb70 postgres!ReadBuffer_common+0x2c9 [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 897]
14 00000004`8080cf30 00000004`80b81322 postgres!ReadBufferWithoutRelcache+0x60 [d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 694]
15 00000004`8080cf90 00000004`80db9cbb postgres!XLogReadBufferExtended+0x142 [d:\orcasqlagsea10\14\s\src\backend\access\transam\xlogutils.c @ 513]
16 00000004`8080cff0 00000004`80b2f53a postgres!XLogRecordPageWithFreeSpace+0xbb [d:\orcasqlagsea10\14\s\src\backend\storage\freespace\freespace.c @ 254]
17 00000004`8080d030 00000004`80b6eb94 postgres!heap_xlog_insert+0x36a [d:\orcasqlagsea10\14\s\src\backend\access\heap\heapam.c @ 8491]
18 00000004`8080f0d0 00000004`80f0a13f postgres!StartupXLOG+0x1f84 [d:\orcasqlagsea10\14\s\src\backend\access\transam\xlog.c @ 7480]
19 00000004`8080fbf0 00000004`80de121e postgres!InitPostgres+0x12f [d:\orcasqlagsea10\14\s\src\backend\utils\init\postinit.c @ 656]
1a 00000004`8080fcd0 00000004`80c92c31 postgres!PostgresMain+0x25e [d:\orcasqlagsea10\14\s\src\backend\tcop\postgres.c @ 3881]
1b 00000004`8080fed0 00000004`80f51df3 postgres!main+0x491 [d:\orcasqlagsea10\14\s\src\backend\main\main.c @ 235]
Please let us know should you have more feedbacks. Thank you!
Best regards,
--
Chengchao Yu
Software Engineer | Microsoft | Azure Database for PostgreSQL
https://azure.microsoft.com/en-us/services/postgresql/
-----Original Message-----
From: Thomas Munro <thomas.munro@enterprisedb.com>
Sent: Thursday, January 24, 2019 2:32 PM
To: Amit Kapila <amit.kapila16@gmail.com>
Cc: Chengchao Yu <chengyu@microsoft.com>; Pg Hackers <pgsql-hackers@postgresql.org>; Prabhat Tripathi <ptrip@microsoft.com>; Sunil Kamath <Sunil.Kamath@microsoft.com>; Michal Primke <mprimke@microsoft.com>
Subject: Re: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs
On Sun, Jan 20, 2019 at 4:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Dec 1, 2018 at 2:30 AM Chengchao Yu <chengyu@microsoft.com> wrote:
Recently, we hit a few occurrences of deadlock when IO failure (including disk full, random remote disk IO failures) happens in single user mode. We found the issue exists on both Linux and Windows in multiple postgres versions.
3. Because the unable to write relation data scenario is difficult to hit naturally even reserved space is turned off, I have prepared a small patch (see attachment “emulate-error.patch”) to force an error when PG tries to write data to relation files. We can just apply the patch and there is no need to put efforts flooding data to disk any more.
I have one question related to the way you have tried to emulate the error.
@@ -840,6 +840,10 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, nbytes, BLCKSZ); + ereport(ERROR, + (errcode(ERRCODE_INTERNAL_ERROR), + errmsg("Emulate exception in mdwrite() when writing to disk"))); +We generally reserve the space in a relation before attempting to
write, so not sure how you are able to hit the disk full situation via
mdwrite. If you see the description of the function, that also
indicates same.
Presumably ZFS or BTRFS or something more exotic could still get ENOSPC here, and of course any filesystem could give us EIO here (because the disk is on fire or the remote NFS server is rebooting due to an automatic Windows update).
On Sat, Feb 2, 2019 at 4:42 AM Chengchao Yu <chengyu@microsoft.com> wrote:
Hi Amit, Thomas,
Thank you very much for your feedbacks! Apologizes but I just saw both messages.
We generally reserve the space in a relation before attempting to write, so not sure how you are able to hit the disk full situation via mdwrite. If you see the description of the function, that also indicates same.
Absolutely agree, this isn’t a PG issue. Issue manifest for us at Microsoft due to our own storage layer which treat mdextend() actions as setting length of the file only. We have a workaround, and any change isn’t needed for Postgres.
I am not telling that mdwrite can never lead to error, but just trying to understand the issue you actually faced. I haven't read your proposed solution yet, let's first try to establish the problem you are facing.
We see transient IO errors reading a block where PG instance gets dead-lock in single user mode until we kill the instance. The stack trace below shows the behavior which is when mdread() failed with buffer holding its lw-lock. This happens because in single user mode there is no call back to release the lock and when AbortBufferIO() tries to acquire the same lock again, it will wait for the lock indefinitely.
I think you can register your patch for next CF [1]https://commitfest.postgresql.org/22/ so that we don't
forget about it.
[1]: https://commitfest.postgresql.org/22/
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Thank you so much Amit! I have created the patch below:
https://commitfest.postgresql.org/22/2003/
Please let me know should you have more suggestions. Thank you!
Best regards,
--
Chengchao Yu
Software Engineer | Microsoft | Azure Database for PostgreSQL
https://azure.microsoft.com/en-us/services/postgresql/
-----Original Message-----
From: Amit Kapila <amit.kapila16@gmail.com>
Sent: Friday, February 1, 2019 6:58 PM
To: Chengchao Yu <chengyu@microsoft.com>
Cc: Thomas Munro <thomas.munro@enterprisedb.com>; Pg Hackers <pgsql-hackers@postgresql.org>; Prabhat Tripathi <ptrip@microsoft.com>; Sunil Kamath <Sunil.Kamath@microsoft.com>; Michal Primke <mprimke@microsoft.com>; TEJA Mupparti <Tejeswar.Mupparti@microsoft.com>
Subject: Re: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs
On Sat, Feb 2, 2019 at 4:42 AM Chengchao Yu <chengyu@microsoft.com> wrote:
Hi Amit, Thomas,
Thank you very much for your feedbacks! Apologizes but I just saw both messages.
We generally reserve the space in a relation before attempting to write, so not sure how you are able to hit the disk full situation via mdwrite. If you see the description of the function, that also indicates same.
Absolutely agree, this isn’t a PG issue. Issue manifest for us at Microsoft due to our own storage layer which treat mdextend() actions as setting length of the file only. We have a workaround, and any change isn’t needed for Postgres.
I am not telling that mdwrite can never lead to error, but just trying to understand the issue you actually faced. I haven't read your proposed solution yet, let's first try to establish the problem you are facing.
We see transient IO errors reading a block where PG instance gets dead-lock in single user mode until we kill the instance. The stack trace below shows the behavior which is when mdread() failed with buffer holding its lw-lock. This happens because in single user mode there is no call back to release the lock and when AbortBufferIO() tries to acquire the same lock again, it will wait for the lock indefinitely.
I think you can register your patch for next CF [1]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcommitfest.postgresql.org%2F22%2F&amp;data=02%7C01%7Cchengyu%40microsoft.com%7Cfee132e6ec2843c2838a08d688ba3aef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636846730778775307&amp;sdata=lJ2LjRgo%2Bd6ViKqwJ040BPzicOTFtFO8NmmVft00yKY%3D&amp;reserved=0 so that we don't forget about it.
--
With Regards,
Amit Kapila.
EnterpriseDB: https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.enterprisedb.com&amp;data=02%7C01%7Cchengyu%40microsoft.com%7Cfee132e6ec2843c2838a08d688ba3aef%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636846730778775307&amp;sdata=nXcVn6B1fl6b5iiDKybl3zf0fXo22%2BrZ1Ne7v1%2FM5DE%3D&amp;reserved=0
Attachments:
fix-deadlock.patchapplication/octet-stream; name=fix-deadlock.patchDownload
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 62baaf0ab3..d74e8aa1d5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -71,6 +71,7 @@ static HeapTuple GetDatabaseTupleByOid(Oid dboid);
static void PerformAuthentication(Port *port);
static void CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connections);
static void InitCommunication(void);
+static void ReleaseLWLocks(int code, Datum arg);
static void ShutdownPostgres(int code, Datum arg);
static void StatementTimeoutHandler(void);
static void LockTimeoutHandler(void);
@@ -653,6 +654,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
* way, start up the XLOG machinery, and register to have it closed
* down at exit.
*/
+ on_shmem_exit(ReleaseLWLocks, 0);
StartupXLOG();
on_shmem_exit(ShutdownXLOG, 0);
}
@@ -1214,6 +1216,23 @@ process_settings(Oid databaseid, Oid roleid)
heap_close(relsetting, AccessShareLock);
}
+/*
+ * There are 2 types of buffer locks on-holding when AtProcExit_Buffers() is
+ * invoked in a bootstrap process or a standalone backend:
+ * (1) Exceptions thrown during StartupXLOG()
+ * (2) Exceptions thrown during exception-handling in ShutdownXLOG()
+ * So we need this on_shmem_exit callback for single user mode.
+ * For processes under postmaster, ShutdownAuxiliaryProcess() will release
+ * the lw-locks and ShutdownXLOG() is not registered as a callback, so there
+ * is no such issue. Also, please note this callback should be registered in
+ * the order after AtProcExit_buffers() and before ShutdownXLOG().
+ */
+static void
+ReleaseLWLocks(int code, Datum arg)
+{
+ LWLockReleaseAll();
+}
+
/*
* Backend-shutdown callback. Do cleanup that we want to be sure happens
* before all the supporting modules begin to nail their doors shut via
Hi Amit,
Greetings! Thank you so much for your previous feedbacks!
It seems the old patch is broken with latest master branch. So, I have rebased the patch, now it can be applied to the latest master without conflicts.
Btw, since the commitfest was created: https://commitfest.postgresql.org/22/2003/,
were there some places that could be improved? Could you give some suggestions? Thank you!
Best regards,
--
Chengchao Yu
Software Engineer | Microsoft | Azure Database for PostgreSQL
https://azure.microsoft.com/en-us/services/postgresql/
-----Original Message-----
From: Chengchao Yu
Sent: Monday, February 18, 2019 6:08 PM
To: Amit Kapila <amit.kapila16@gmail.com>
Cc: Thomas Munro <thomas.munro@enterprisedb.com>; Pg Hackers <pgsql-hackers@postgresql.org>; Prabhat Tripathi <ptrip@microsoft.com>; Sunil Kamath <Sunil.Kamath@microsoft.com>; Michal Primke <mprimke@microsoft.com>; TEJA Mupparti <Tejeswar.Mupparti@microsoft.com>
Subject: RE: [PATCH] Fix Proposal - Deadlock Issue in Single User Mode When IO Failure Occurs
Thank you so much Amit! I have created the patch below:
https://commitfest.postgresql.org/22/2003/
Please let me know should you have more suggestions. Thank you!
Best regards,
--
Chengchao Yu
Software Engineer | Microsoft | Azure Database for PostgreSQL https://azure.microsoft.com/en-us/services/postgresql/
Attachments:
fix-deadlock-v2.patchapplication/octet-stream; name=fix-deadlock-v2.patchDownload
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..2f826bc9b5 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -68,6 +68,7 @@ static HeapTuple GetDatabaseTupleByOid(Oid dboid);
static void PerformAuthentication(Port *port);
static void CheckMyDatabase(const char *name, bool am_superuser, bool override_allow_connections);
static void InitCommunication(void);
+static void ReleaseLWLocks(int code, Datum arg);
static void ShutdownPostgres(int code, Datum arg);
static void StatementTimeoutHandler(void);
static void LockTimeoutHandler(void);
@@ -661,6 +662,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
*/
CreateAuxProcessResourceOwner();
+ on_shmem_exit(ReleaseLWLocks, 0);
StartupXLOG();
/* Release (and warn about) any buffer pins leaked in StartupXLOG */
ReleaseAuxProcessResources(true);
@@ -1171,6 +1173,23 @@ process_settings(Oid databaseid, Oid roleid)
table_close(relsetting, AccessShareLock);
}
+/*
+ * There are 2 types of buffer locks on-holding when AtProcExit_Buffers() is
+ * invoked in a bootstrap process or a standalone backend:
+ * (1) Exceptions thrown during StartupXLOG()
+ * (2) Exceptions thrown during exception-handling in ShutdownXLOG()
+ * So we need this on_shmem_exit callback for single user mode.
+ * For processes under postmaster, ShutdownAuxiliaryProcess() will release
+ * the lw-locks and ShutdownXLOG() is not registered as a callback, so there
+ * is no such issue. Also, please note this callback should be registered in
+ * the order after AtProcExit_buffers() and before ShutdownXLOG().
+ */
+static void
+ReleaseLWLocks(int code, Datum arg)
+{
+ LWLockReleaseAll();
+}
+
/*
* Backend-shutdown callback. Do cleanup that we want to be sure happens
* before all the supporting modules begin to nail their doors shut via
Hi Kyotaro,
Thank you so much for your valued feedback and suggestions!
I assume that we are in a consensus about the problem we are to fix here.
0a 00000004`8080cc30 00000004`80dcf917 postgres!PGSemaphoreLock+0x65
[d:\orcasqlagsea10\14\s\src\backend\port\win32_sema.c @ 158] 0b
00000004`8080cc90 00000004`80db025c postgres!LWLockAcquire+0x137
[d:\orcasqlagsea10\14\s\src\backend\storage\lmgr\lwlock.c @ 1234] 0c
00000004`8080ccd0 00000004`80db25db postgres!AbortBufferIO+0x2c
[d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 3995] 0d
00000004`8080cd20 00000004`80dbce36 postgres!AtProcExit_Buffers+0xb
[d:\orcasqlagsea10\14\s\src\backend\storage\buffer\bufmgr.c @ 2479] 0e
00000004`8080cd50 00000004`80dbd1bd postgres!shmem_exit+0xf6
[d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 262] 0f
00000004`8080cd80 00000004`80dbccfd postgres!proc_exit_prepare+0x4d
[d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 188]
10 00000004`8080cdb0 00000004`80ef9e74 postgres!proc_exit+0xd
[d:\orcasqlagsea10\14\s\src\backend\storage\ipc\ipc.c @ 141]
11 00000004`8080cde0 00000004`80ddb6ef postgres!errfinish+0x204
[d:\orcasqlagsea10\14\s\src\backend\utils\error\elog.c @ 624]
12 00000004`8080ce50 00000004`80db0f59 postgres!mdread+0x12f
[d:\orcasqlagsea10\14\s\src\backend\storage\smgr\md.c @ 806]
Yes, this is one of the two situations we want to fix. The other situation is a cascade exception case like following.
#0 0x00007f0fdb7cb6d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7f0fd14c81b8) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#1 do_futex_wait (sem=sem(at)entry=0x7f0fd14c81b8, abstime=0x0) at sem_waitcommon.c:111
#2 0x00007f0fdb7cb7c8 in __new_sem_wait_slow (sem=0x7f0fd14c81b8, abstime=0x0) at sem_waitcommon.c:181
#3 0x00005630d475658a in PGSemaphoreLock (sema=0x7f0fd14c81b8) at pg_sema.c:316
#4 0x00005630d47f689e in LWLockAcquire (lock=0x7f0fd9ae9c00, mode=LW_EXCLUSIVE) at /path/to/postgres/source/build/../src/backend/storage/lmgr/lwlock.c:1243
#5 0x00005630d47cd087 in AbortBufferIO () at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:3988
#6 0x00005630d47cb3f9 in AtProcExit_Buffers (code=1, arg=0) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2473
#7 0x00005630d47dbc32 in shmem_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:272
#8 0x00005630d47dba5e in proc_exit_prepare (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:194
#9 0x00005630d47db9c6 in proc_exit (code=1) at /path/to/postgres/source/build/../src/backend/storage/ipc/ipc.c:107
#10 0x00005630d49811bc in errfinish (dummy=0) at /path/to/postgres/source/build/../src/backend/utils/error/elog.c:541
#11 0x00005630d4801f1f in mdwrite (reln=0x5630d6588c68, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/md.c:843
#12 0x00005630d4804716 in smgrwrite (reln=0x5630d6588c68, forknum=MAIN_FORKNUM, blocknum=8, buffer=0x7f0fd1ae9c00 "", skipFsync=false) at /path/to/postgres/source/build/../src/backend/storage/smgr/smgr.c:650
#13 0x00005630d47cb824 in FlushBuffer (buf=0x7f0fd19e9c00, reln=0x5630d6588c68) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2751
#14 0x00005630d47cb219 in SyncOneBuffer (buf_id=0, skip_recently_used=false, wb_context=0x7ffccc371a70) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2394
#15 0x00005630d47cab00 in BufferSync (flags=6) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:1984
#16 0x00005630d47cb57f in CheckPointBuffers (flags=6) at /path/to/postgres/source/build/../src/backend/storage/buffer/bufmgr.c:2578
#17 0x00005630d44a685b in CheckPointGuts (checkPointRedo=23612304, flags=6) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:9149
#18 0x00005630d44a62cf in CreateCheckPoint (flags=6) at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:8937
#19 0x00005630d44a45e3 in StartupXLOG () at /path/to/postgres/source/build/../src/backend/access/transam/xlog.c:7723
#20 0x00005630d4995f88 in InitPostgres (in_dbname=0x5630d65582b0 "postgres", dboid=0, username=0x5630d653d7d0 "chengyu", useroid=0, out_dbname=0x0, override_allow_connections=false)
at /path/to/postgres/source/build/../src/backend/utils/init/postinit.c:636
#21 0x00005630d480b68b in PostgresMain (argc=7, argv=0x5630d6534d20, dbname=0x5630d65582b0 "postgres", username=0x5630d653d7d0 "chengyu") at /path/to/postgres/source/build/../src/backend/tcop/postgres.c:3810
#22 0x00005630d4695e8b in main (argc=7, argv=0x5630d6534d20) at /path/to/postgres/source/build/../src/backend/main/main.c:224
Though ENOSPC is avoided by reservation in PG, the other error code could be returned from OS to form this stack.
Ok, we are fixing this. The proposed patch lets LWLockReleaseAll() called before
InitBufferPoolBackend() by registering the former after the latter into on_shmem_exit
list. Even if it works, I think it's neither clean nor safe to register multiple
order-sensitive callbacks.
Actually I think the order of callbacks retains the order of how components got initialized. In the patch v2, the specific location requirement was for the cascade exception to work as well.
However, I think we can discuss about this as we just would like to ensure lw-locks are released before AbortBufferIO().
And the only caller of it is shmem_exit. More of that, all other caller sites calls
LWLockReleaseAll() just before calling it. If that's the case, why don't we just release
all LWLocks in shmem_exit or in AtProcExit_Buffers before calling AbortBufferIO()? I think
it's sufficient that AtProcExit_Buffers calls it at the beginning. (The comment for the
funcgtion needs editing).
Putting LWLockReleaseAll() in AtProcExit_Buffers() is OK, however, it does not work for the cascade exception case if putting in shmem_exit().
Indeed putting LWLockReleaseAll() in AtProcExit_Buffers() was considered firstly, but as the other part of PG code base prefers putting in other callbacks (e.g. ShutdownAuxiliaryProcess() callback when UnderPostmaster is true), I just followed the same style in patch v2.
But after revisited the decision, I think I agree with you, because:
1. Yes, it looks cleaner in the code.
2. We can avoid the pain if people forgot or wrongly registered the additional callback.
3. Calling LWLockReleaseAll() for the second time is quite fast, it will not bring overburden to AtProcExit_Buffers()
Thus, I have updated the patch v3 according to your suggestions. Could you help to review again?
Please let me know should you have more suggestions or feedbacks.
Thank you!
Best regards,
--
Chengchao Yu
Software Engineer | Microsoft | Azure Database for PostgreSQL
https://azure.microsoft.com/en-us/services/postgresql/
Attachments:
fix-deadlock-v3.patchapplication/octet-stream; name=fix-deadlock-v3.patchDownload
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f3a402854..90f600c5fa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2472,12 +2472,20 @@ InitBufferPoolBackend(void)
}
/*
- * During backend exit, ensure that we released all shared-buffer locks and
- * assert that we have no remaining pins.
+ * During backend exit, we release all shared-buffer locks before
+ * AbortBufferIO() and assert that we have no remaining pins.
*/
static void
AtProcExit_Buffers(int code, Datum arg)
{
+ /*
+ * AtProcExit_Buffers() could be called in a cascade exception when
+ * the previous callback re-acquired some lw-lock again.
+ * Here we make sure that every time AbortBufferIO() is called the
+ * lw-locks have been already released.
+ */
+ LWLockReleaseAll();
+
AbortBufferIO();
UnlockBuffers();
On Sat, Jul 27, 2019 at 6:22 AM Chengchao Yu <chengyu@microsoft.com> wrote:
Thus, I have updated the patch v3 according to your suggestions. Could you help to review again?
Please let me know should you have more suggestions or feedbacks.
I have tried to look into this patch and I don't think it fixes the
problem. Basically, I have tried the commands suggested by you in
single-user mode, create table; insert and then checkpoint. Now, what
I see is almost the same behavior as explained by you in one of the
above emails with a slight difference which makes me think that the
fix you are proposing is not correct. Below is what you told:
"The second type is in Step #4. At the time when “checkpoint” SQL
command is being executed, PG has already set up the before_shmem_exit
callbackShutdownPostgres(), which releases all lw-locks given
transaction or sub-transaction is on-going. So after the first IO
error, the buffer page’s lw-lock gets released successfully. However,
later ShutdownXLOG() is invoked, and PG tries to flush buffer pages
again, which results in the second IO error. Different from the first
time, this time, all the previous executed before/on_shmem_exit
callbacks are not invoked again due to the decrease of the indexes. So
lw-locks for buffer pages are not released when PG tries to get the
same buffer lock in AbortBufferIO(), and then PG process gets stuck."
The only difference is in the last line where for me it gives
assertion failure when trying to do ReleaseAuxProcessResources. Below
is the callstack:
postgres.exe!ExceptionalCondition(const char *
conditionName=0x00db0c78, const char * errorType=0x00db0c68, const
char * fileName=0x00db0c18, int lineNumber=1722) Line 55 C
postgres.exe!UnpinBuffer(BufferDesc * buf=0x052a104c, bool
fixOwner=true) Line 1722 + 0x2f bytes C
postgres.exe!ReleaseBuffer(int buffer=96) Line 3367 + 0x17 bytes C
postgres.exe!ResourceOwnerReleaseInternal(ResourceOwnerData *
owner=0x0141f6e8, <unnamed-enum-RESOURCE_RELEASE_BEFORE_LOCKS>
phase=RESOURCE_RELEASE_BEFORE_LOCKS, bool isCommit=false, bool
isTopLevel=true) Line 526 + 0x9 bytes C
postgres.exe!ResourceOwnerRelease(ResourceOwnerData *
owner=0x0141f6e8, <unnamed-enum-RESOURCE_RELEASE_BEFORE_LOCKS>
phase=RESOURCE_RELEASE_BEFORE_LOCKS, bool isCommit=false, bool
isTopLevel=true) Line 484 + 0x17 bytes C
postgres.exe!ReleaseAuxProcessResources(bool isCommit=false) Line
861 + 0x15 bytes C
postgres.exe!ReleaseAuxProcessResourcesCallback(int code=1, unsigned int arg=0) Line 881 + 0xa bytes C
postgres.exe!shmem_exit(int code=1) Line 272 + 0x1f bytes C
postgres.exe!proc_exit_prepare(int code=1) Line 194 + 0x9 bytes C
postgres.exe!proc_exit(int code=1) Line 107 + 0x9 bytes C
postgres.exe!errfinish(int dummy=0, ...) Line 538 + 0x7 bytes C
postgres.exe!mdwrite(SMgrRelationData * reln=0x0147e140, ForkNumber
forknum=MAIN_FORKNUM, unsigned int blocknum=7, char *
buffer=0x0542dd00, bool skipFsync=false) Line 713 + 0x4c bytes C
postgres.exe!smgrwrite(SMgrRelationData * reln=0x0147e140,
ForkNumber forknum=MAIN_FORKNUM, unsigned int blocknum=7, char *
buffer=0x0542dd00, bool skipFsync=false) Line 587 + 0x24 bytes C
postgres.exe!FlushBuffer(BufferDesc * buf=0x052a104c,
SMgrRelationData * reln=0x0147e140) Line 2759 + 0x1d bytes C
postgres.exe!SyncOneBuffer(int buf_id=95, bool
skip_recently_used=false, WritebackContext * wb_context=0x012ccea0)
Line 2402 + 0xb bytes C
postgres.exe!BufferSync(int flags=5) Line 1992 + 0x15 bytes C
postgres.exe!CheckPointBuffers(int flags=5) Line 2586 + 0x9 bytes C
postgres.exe!CheckPointGuts(unsigned __int64
checkPointRedo=22933176, int flags=5) Line 8991 + 0x9 bytes C
postgres.exe!CreateCheckPoint(int flags=5) Line 8780 + 0x11 bytes C
postgres.exe!ShutdownXLOG(int code=1, unsigned int arg=0) Line 8333
+ 0x7 bytes C
postgres.exe!shmem_exit(int code=1) Line 272 + 0x1f bytes C
postgres.exe!proc_exit_prepare(int code=1) Line 194 + 0x9 bytes C
postgres.exe!proc_exit(int code=1) Line 107 + 0x9 bytes C
postgres.exe!errfinish(int dummy=0, ...) Line 538 + 0x7 bytes C
postgres.exe!mdwrite(SMgrRelationData * reln=0x0147e140, ForkNumber
forknum=MAIN_FORKNUM, unsigned int blocknum=7, char *
buffer=0x0542dd00, bool skipFsync=false) Line 713 + 0x4c bytes C
postgres.exe!smgrwrite(SMgrRelationData * reln=0x0147e140,
ForkNumber forknum=MAIN_FORKNUM, unsigned int blocknum=7, char *
buffer=0x0542dd00, bool skipFsync=false) Line 587 + 0x24 bytes C
postgres.exe!FlushBuffer(BufferDesc * buf=0x052a104c,
SMgrRelationData * reln=0x0147e140) Line 2759 + 0x1d bytes C
postgres.exe!SyncOneBuffer(int buf_id=95, bool
skip_recently_used=false, WritebackContext * wb_context=0x012ce580)
Line 2402 + 0xb bytes C
postgres.exe!BufferSync(int flags=44) Line 1992 + 0x15 bytes C
postgres.exe!CheckPointBuffers(int flags=44) Line 2586 + 0x9 bytes C
postgres.exe!CheckPointGuts(unsigned __int64
checkPointRedo=22933176, int flags=44) Line 8991 + 0x9 bytes C
postgres.exe!CreateCheckPoint(int flags=44) Line 8780 + 0x11 bytes C
postgres.exe!RequestCheckpoint(int flags=44) Line 967 + 0xc bytes C
postgres.exe!standard_ProcessUtility(PlannedStmt * pstmt=0x0146b738,
const char * queryString=0x0146ad98,
<unnamed-enum-PROCESS_UTILITY_TOPLEVEL>
context=PROCESS_UTILITY_TOPLEVEL, ParamListInfoData *
params=0x00000000, QueryEnvironment * queryEnv=0x00000000,
_DestReceiver * dest=0x00adc1d8, char * completionTag=0x012cfdbc)
Line 769 + 0x28 bytes C
It seems to me there are other things like
ReleaseAuxProcessResources() before AbortBufferIO() which expects
LWLocks to be released. I didn't get much time to further debug this,
but I think some more analysis is required for this issue.
I guess you didn't encounter this problem because you are not using
Asserts enabled build, but there could be some other reason as well.
I have marked this CF entry as "Waiting on Author".
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 09, 2019 at 05:34:43PM +0530, Amit Kapila wrote:
The only difference is in the last line where for me it gives
assertion failure when trying to do ReleaseAuxProcessResources. Below
is the callstack:
No need for Windows on this one and I have reproduced easily the same
trace as Amit. The patch has been moved to next CF. Chengchao, could
you provide an update please?
--
Michael
On Mon, Nov 25, 2019 at 1:17 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Sep 09, 2019 at 05:34:43PM +0530, Amit Kapila wrote:
The only difference is in the last line where for me it gives
assertion failure when trying to do ReleaseAuxProcessResources. Below
is the callstack:No need for Windows on this one and I have reproduced easily the same
trace as Amit. The patch has been moved to next CF. Chengchao, could
you provide an update please?
I have marked this patch as "Returned with feedback" as it's been long
since the author has responded. Feel free to provide a new patch for
the next CF.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com