Concurrent-update problem in bufmgr.c

Started by Tom Laneover 25 years ago4 messages
#1Tom Lane
tgl@sss.pgh.pa.us

The comments for bufmgr.c's BufferSync routine point out that it's a
bad thing for some other backend to be modifying a page while it is
written out. The following text has gone unchanged since Postgres95:

* Also, we need to be sure that no other transaction is
* modifying the page as we flush it. This is only a problem for objects
* that use a non-two-phase locking protocol, like btree indices. For
* those objects, we would like to set a write lock for the duration of
* our IO. Another possibility is to code updates to btree pages
* carefully, so that writing them out out of order cannot cause
* any unrecoverable errors.
*
* I don't want to think hard about this right now, so I will try
* to come back to it later.

Unfortunately, the comment is wrong about this being a problem only for
indexes. It's possible for an invalid state of a heap page to be
written out, as well. PageAddItem() sets the item pointer for an added
tuple before it copies the tuple onto the page, so if it is recycling an
existing item pointer slot, there is a state where a valid item pointer
is pointing at a tuple that's wholly or partly not valid. This doesn't
matter as far as active backends are concerned because we should be
holding BUFFER_LOCK_EXCLUSIVE on the page while modifying it. But some
other backend could be in process of writing out the page (if it had
previously dirtied the page), and so it's possible for this invalid
state to reach disk. If the database is shut down before the new
update of the page can be written out, then we have a problem.

Normally, the new page state will be written out at transaction commit,
but what happens if the current transaction aborts? In that case, the
dirty page just sits in shared memory. It will get written the next
time a transaction modifies the page (and commits), or when some backend
decides to recycle the buffer to hold another page. But if the
postmaster gets shut down before that happens, we lose; the dirty page
is never written at all, and when it's re-read after database restart,
the corrupted page state becomes visible.

The window of vulnerability is considerably wider in 7.0 than in prior
releases, because in prior releases *any* transaction commit will write
all dirty pages. In 7.0 the dirtied page will not get written out until
we commit a transaction that modified that particular page (or decide to
recycle the buffer). The odds of seeing a problem are still pretty
small, but the risk is definitely there.

I believe the correct fix for this problem is for bufmgr.c to grab
a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out.
A read lock is sufficient since there's no need to prevent other
backends from reading the page, we just need to prevent them from
changing it during the I/O.

Comments anyone?

regards, tom lane

#2Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#1)
Re: Concurrent-update problem in bufmgr.c

Tom Lane wrote:

[snip]

The window of vulnerability is considerably wider in 7.0 than in prior
releases, because in prior releases *any* transaction commit will write
all dirty pages. In 7.0 the dirtied page will not get written out until
we commit a transaction that modified that particular page (or decide to
recycle the buffer). The odds of seeing a problem are still pretty
small, but the risk is definitely there.

I believe the correct fix for this problem is for bufmgr.c to grab
a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out.
A read lock is sufficient since there's no need to prevent other
backends from reading the page, we just need to prevent them from
changing it during the I/O.

Comments anyone?

This seems to be almost same as I posted 4 months ago(smgrwrite()
without LockBuffer(was RE: ...).
Maybe Vadim would take care of it in the inplementation of WAL.
The following was Vadim's reply to you and me.

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

As far as I see,PostgreSQL doesn't call LockBuffer() before
calling smgrwrite(). This seems to mean that smgrwrite()
could write buffers to disk which are being changed by
another backend. If the(another) backend was aborted by
some reason the buffer page would remain half-changed.

Hmm ... looks fishy to me too. Seems like we ought to hold
BUFFER_LOCK_SHARE on the buffer while dumping it out. It
wouldn't matter under normal circumstances, but as you say
there could be trouble if the other backend crashed before
it could mark the buffer dirty again, or if we had a system
crash before the dirtied page got written again.

Well, known issue. Buffer latches were implemented in 6.5 only
and there was no time to think about subj hard -:)
Yes, we have to shlock buffer before writing and this is what
bufmgr will must do for WAL anyway (to ensure that last buffer
changes already logged)... but please check that buffer is not
exc-locked by backend itself somewhere before smgrwrite()...

Vadim

Regards.

Hiroshi Inoue

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#2)
Re: Concurrent-update problem in bufmgr.c

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

This seems to be almost same as I posted 4 months ago(smgrwrite()
without LockBuffer(was RE: ...).

You are right, this was already a known issue (and I had it buried in
my to-do list, in fact). I rediscovered it while puzzling over some
of the corrupted-data reports we've seen lately.

I'll go ahead and make the change in current sources. Does anyone
have a strong feeling about whether or not to back-patch it into
REL7_0 branch for the upcoming 7.0.3 release?

regards, tom lane

#4Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#3)
RE: Concurrent-update problem in bufmgr.c

I believe the correct fix for this problem is for bufmgr.c to grab
a read lock (BUFFER_LOCK_SHARED) on any page that it is writing out.
A read lock is sufficient since there's no need to prevent other
backends from reading the page, we just need to prevent them from
changing it during the I/O.

Comments anyone?

Do it.

Vadim