A couple of fishy-looking critical sections

Started by Tom Lanealmost 25 years ago3 messages

tgl@sss.pgh.pa.us

almost 25 years ago

Vadim,

I have committed changes to separate the notions of critical sections
and sections that just want to hold off cancel/die interrupts, as we
discussed. While I was doing that, I noticed a couple of places that
I think you should take a second look at:

1. src/backend/access/nbtree/nbtinsert.c, line 867: shouldn't this
END_CRIT_SECTION be moved up to before the _bt_wrtbuf call? It seems
to me that an elog during the wrtbuf is not a critical failure. If
this code is correct, then all the other crit sections are wrong,
because all of them release the crit section before writing buffers,
not after.

2. src/backend/commands/vacuum.c, line 1907: does this
START_CRIT_SECTION really have to be here, and not down at line 1935,
just before PageRepairFragmentation()? I really don't like the idea of
turning those elogs that are inside the loop into reasons to force a
system-wide restart.

3. src/backend/access/transam/xlog.c, routine CreateCheckPoint:
does this *entire* routine need to be a critical section? Again,
I fear a shotgun approach will mean a net decrease in reliability,
not an improvement. How much of this code really has to be critical?
Do you really want a failure in, say, MoveOfflineLogs to take down the
whole database?

regards, tom lane

Mikheev, Vadim

vmikheev@SECTORBASE.COM

almost 25 years ago

In reply to: Tom Lane (#1)

RE: A couple of fishy-looking critical sections

1. src/backend/access/nbtree/nbtinsert.c, line 867: shouldn't this
END_CRIT_SECTION be moved up to before the _bt_wrtbuf call? It seems
to me that an elog during the wrtbuf is not a critical failure. If
this code is correct, then all the other crit sections are wrong,
because all of them release the crit section before writing buffers,
not after.

Ok to move it up.

2. src/backend/commands/vacuum.c, line 1907: does this
START_CRIT_SECTION really have to be here, and not down at line 1935,
just before PageRepairFragmentation()? I really don't like the idea
of turning those elogs that are inside the loop into reasons to force a
system-wide restart.

Agreed.

3. src/backend/access/transam/xlog.c, routine CreateCheckPoint:
does this *entire* routine need to be a critical section? Again,
I fear a shotgun approach will mean a net decrease in reliability,
not an improvement. How much of this code really has to be critical?

When postmaster has to create Checkpoint this routine is called from
bootstrap.c:BootstrapMain() - ie without normal initialization, so
I don't know result of elog(ERROR) in this case -:(
Probably I should spend more time in this area in attempt to make
Checkpoints rollback-able, but ...
Anyway it seems that the real source of elog(ERROR) there is
FlushBufferPool(). Maybe we could just initialize Warn_restart in
that point? All other places are mostly related to WAL itself
- they are more critical and less likely subject to elog(ERROR).

Do you really want a failure in, say, MoveOfflineLogs to take down the
whole database?

Well, this one could be changed at least (ie STOP --> LOG).

Vadim

Import Notes

Resolved by subject fallback

Tom Lane

tgl@sss.pgh.pa.us

almost 25 years ago

In reply to: Mikheev, Vadim (#2)

Re: A couple of fishy-looking critical sections

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

3. src/backend/access/transam/xlog.c, routine CreateCheckPoint:
does this *entire* routine need to be a critical section? Again,
I fear a shotgun approach will mean a net decrease in reliability,
not an improvement. How much of this code really has to be critical?

When postmaster has to create Checkpoint this routine is called from
bootstrap.c:BootstrapMain() - ie without normal initialization, so
I don't know result of elog(ERROR) in this case -:(

I believe elog(ERROR) will be treated like FATAL in this case (because
Warn_restart isn't set). So the checkpoint process will clean up and
exit, but there wouldn't be a system-wide restart were it not for the
critical section.

The question that's bothering me is whether a system-wide restart is
actually going to make things better, rather than worse, if the
checkpoint process has a problem ...

regards, tom lane