self-deadlock at FATAL exit of boostrap process on read error

Started by Qingqing Zhouover 19 years ago4 messages
#1Qingqing Zhou
zhouqq@cs.toronto.edu

I encounter a situation that the server can't shutdown when a boostrap
process does ReadBuffer() but gets an read error. I guess the problem may be
like this - the boostrap process can't read at line:

smgrread(reln->rd_smgr, blockNum, (char *) bufBlock);

So it does a FATAL exit and shmem_exit() is called:

while (--on_shmem_exit_index >= 0)
(*on_shmem_exit_list[on_shmem_exit_index].function) (code,
on_shmem_exit_list[on_shmem_exit_index].arg);
Where
on_shmem_exit_list[0] = DummyProcKill
on_shmem_exit_list[1] = AtProcExit_Buffers

The above callback is called in a stack order, so AtProcExit_Buffers() will
call AbortBufferIO() which is blocked by itself on "io_in_progress_lock"
(which is not the case as the comment says "since LWLockReleaseAll has
already been called, we're not holding the buffer's io_in_progress_lock").

There may other similar problems for bootstrap process like this, so I am
not sure the best fix for this ...

Regards,
Qingqing

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#1)
Re: self-deadlock at FATAL exit of boostrap process on read error

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

I encounter a situation that the server can't shutdown when a boostrap
process does ReadBuffer() but gets an read error.

Hm, AtProcExit_Buffers is assuming that we've done AbortTransaction,
but the WAL-replay process doesn't do that because it's not running a
transaction. Seems like we need to stack another on-proc-exit function
to do the appropriate subset of AbortTransaction ... LWLockReleaseAll at
least, not sure what else.

Do you have a test case to reproduce this problem?

regards, tom lane

#3Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Qingqing Zhou (#1)
Re: self-deadlock at FATAL exit of boostrap process on read error

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

Do you have a test case to reproduce this problem?

According to the error message, the problem happens during reading
pg_database. I just tried to plug in this line in mdread():

+        /* pretend there is an error reading pg_database */
+        if (reln->smgr_rnode.relNode == 1262)
+        {
+                fprintf(stderr, "Ooops \n");
+                return false;
+        }

v = _mdfd_getseg(reln, blocknum, false);

And it works.

Regards,
Qingqing

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#3)
Re: self-deadlock at FATAL exit of boostrap process on read error

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> wrote

Do you have a test case to reproduce this problem?

According to the error message, the problem happens during reading
pg_database. I just tried to plug in this line in mdread():

OK, patch applied for this.

regards, tom lane