Why we still see some reports of "could not access transaction status"

Started by Tom Laneabout 21 years ago13 messages
#1Tom Lane
tgl@sss.pgh.pa.us

Having seen a couple recent reports of "could not access status of
transaction" for old, not-obviously-corrupt transaction numbers, I went
looking to see if I could find a way that the system could truncate CLOG
before it's really marked all occurrences of old transaction numbers as
known-dead or known-good.

I found one.

The problem is that there are several places where a tqual.c routine is
called without checking to see if it changed the tuple's commit hint
bits, and without necessarily writing the page immediately after. One
example is the code path in heap_update where we decide that we can't
update the tuple because a concurrent transaction did so. If
HeapTupleSatisfiesUpdate had set the XMIN_COMMITTED or XMAX_COMMITTED
bits, those bits would remain set in the shared buffer, but *the buffer
would not get marked dirty*.

Before PG 7.2 this was not a bug, because the hint bits could always be
set again later. But now, consider this scenario: while the buffer
remains in memory, VACUUM passes over the table. It doesn't find any
changes needed in that page, so it doesn't write the page either. At
completion of the vacuum, we check whether we can truncate CLOG,
discover we can, and do so. At some later point, the in-memory buffer
is discarded, still without having been written. When next read in,
the page contains an un-hinted transaction status that could easily
point to a transaction before the new CLOG boundary. Ooops.

The odds of such a problem seem exceedingly small ... in other words,
just about right to explain the small numbers of reports we get.

I think what we ought to do to solve this problem permanently is to stop
making the callers of the HeapTupleSatisfiesFoo() routines responsible
for checking for hint bit updates. It would be a lot safer, and AFAICS
not noticeably less efficient, for those routines to call
SetBufferCommitInfoNeedsSave for themselves. This would require adding
to their parameter lists, because they aren't currently told which
buffer the tuple is in, but that's no big deal considering we get to
simplify the calling logic in all the places that are faithfully doing
the t_infomask update check.

Comments?

regards, tom lane

#2Michael Paesold
mpaesold@gmx.at
In reply to: Tom Lane (#1)
Re: Why we still see some reports of "could not access transaction status"

Tom Lane wrote:

Having seen a couple recent reports of "could not access status of
transaction" for old, not-obviously-corrupt transaction numbers, I went
looking to see if I could find a way that the system could truncate CLOG
before it's really marked all occurrences of old transaction numbers as
known-dead or known-good.

I found one.

I was starting to wonder about those reports, too. Actually I was thinking
about bringing this up as soon as I would find time. So I am glad you picked
that up yourself -- and found a problem already.

I think what we ought to do to solve this problem permanently is to stop

...

Comments?

Well, I am not able to comment here, but I can say I usually trust your
judgement.

Best Regards,
Michael Paesold

#3Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Tom Lane (#1)
Re: Why we still see some reports of "could not access transaction status"

On Wed, Oct 13, 2004 at 12:18:08PM -0400, Tom Lane wrote:

I think what we ought to do to solve this problem permanently is to stop
making the callers of the HeapTupleSatisfiesFoo() routines responsible
for checking for hint bit updates. It would be a lot safer, and AFAICS
not noticeably less efficient, for those routines to call
SetBufferCommitInfoNeedsSave for themselves. This would require adding
to their parameter lists, because they aren't currently told which
buffer the tuple is in, but that's no big deal considering we get to
simplify the calling logic in all the places that are faithfully doing
the t_infomask update check.

Comments?

I remember seeing this code when coding the phantom Xid idea and
wondering why such an error-prone style was used. It never ocurred to
me to change it (or maybe have the guts to do it), but now that you
mention it it certainly seems a good idea.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
Tulio: oh, para qu� servir� este boton, Juan Carlos?
Policarpo: No, al�jense, no toquen la consola!
Juan Carlos: Lo apretar� una y otra vez.

#4Gaetano Mendola
mendola@bigfoot.com
In reply to: Tom Lane (#1)
Re: Why we still see some reports of "could not access transaction

Tom Lane wrote:

Having seen a couple recent reports of "could not access status of
transaction" for old, not-obviously-corrupt transaction numbers, I went
looking to see if I could find a way that the system could truncate CLOG
before it's really marked all occurrences of old transaction numbers as
known-dead or known-good.

I found one.

Are you going to fix it for the 8.0 and/or back patch it ?

Regards
Gaetano Mendola

#5Neil Conway
neilc@samurai.com
In reply to: Gaetano Mendola (#4)
Re: Why we still see some reports of "could not access

Gaetano Mendola wrote:

Are you going to fix it for the 8.0 and/or back patch it ?

http://archives.postgresql.org/pgsql-committers/2004-10/msg00229.php
http://archives.postgresql.org/pgsql-committers/2004-10/msg00191.php

plus backpatches to older branches (REL7_3_STABLE, REL7_2_STABLE).

Has there been any thought about putting out another 7.4 release with
this fix?

-Neil

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#5)
Re: Why we still see some reports of "could not access

Neil Conway <neilc@samurai.com> writes:

Has there been any thought about putting out another 7.4 release with
this fix?

There has, but there are some other open issues I'd like to deal with
first.

If anyone has any pending 7.4 fixes, getting them in in the next
few days would be a Good Plan.

regards, tom lane

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#6)
Re: 7.4 changes

Tom Lane wrote:

If anyone has any pending 7.4 fixes, getting them in in the next
few days would be a Good Plan.

Do we want to backport tighter security for plperl? In particular,
insisting on Safe.pm >= 2.09 and removing the :base_io set of ops?

cheers

andrew

#8Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#7)
Re: 7.4 changes

Andrew Dunstan wrote:

Tom Lane wrote:

If anyone has any pending 7.4 fixes, getting them in in the next
few days would be a Good Plan.

Do we want to backport tighter security for plperl? In particular,
insisting on Safe.pm >= 2.09 and removing the :base_io set of ops?

And it would also be nice if we could add
contrib/cube/expected/cube_1.out to the 7.4 branch, I think, so that
more platforms could pass the contrib installcheck tests.

cheers

andrew

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#7)
Re: 7.4 changes

Andrew Dunstan <andrew@dunslane.net> writes:

Do we want to backport tighter security for plperl? In particular,
insisting on Safe.pm >= 2.09 and removing the :base_io set of ops?

I'd vote not: 7.4.5 => 7.4.6 is not an update that people would expect
to break their plperl code ...

regards, tom lane

#10Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#9)
Re: 7.4 changes

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

Do we want to backport tighter security for plperl? In particular,
insisting on Safe.pm >= 2.09 and removing the :base_io set of ops?

I'd vote not: 7.4.5 => 7.4.6 is not an update that people would expect
to break their plperl code ...

*shrug* OK. Then plperl should probably not be regarded as being as
"trusted" as we would like. Note that old versions of Safe.pm have been
the subject of security advisories such as this one
http://www.securityfocus.com/bid/6111/info/ for some time.

cheers

andrew

#11Neil Conway
neilc@samurai.com
In reply to: Andrew Dunstan (#10)
Re: 7.4 changes

On Tue, 2004-10-19 at 02:45, Andrew Dunstan wrote:

*shrug* OK. Then plperl should probably not be regarded as being as
"trusted" as we would like. Note that old versions of Safe.pm have been
the subject of security advisories such as this one
http://www.securityfocus.com/bid/6111/info/ for some time.

Perhaps a compromise would be to require the newer version of Safe.pm,
but leave the other changes for 8.0. Upgrading Safe.pm can presumably be
done without needing any changes to the rest of one's pl/perl code.

-Neil

#12Andrew Dunstan
andrew@dunslane.net
In reply to: Neil Conway (#11)
Re: 7.4 changes

Neil Conway wrote:

On Tue, 2004-10-19 at 02:45, Andrew Dunstan wrote:

*shrug* OK. Then plperl should probably not be regarded as being as
"trusted" as we would like. Note that old versions of Safe.pm have been
the subject of security advisories such as this one
http://www.securityfocus.com/bid/6111/info/ for some time.

Perhaps a compromise would be to require the newer version of Safe.pm,
but leave the other changes for 8.0. Upgrading Safe.pm can presumably be
done without needing any changes to the rest of one's pl/perl code.

s/the rest of/any of/

Indeed it can.

The other thing I suggested was removing the :base_io set of ops - I
would regard plperl functions that did things like printing to STDOUT as
broken to start with.

But maybe we can just live with what we have and advertise that 8.0's
plperl is more secure.

cheers

andrew

#13Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Andrew Dunstan (#12)
Re: 7.4 changes

On Tue, Oct 19, 2004 at 08:47:20AM -0400, Andrew Dunstan wrote:

But maybe we can just live with what we have and advertise that 8.0's
plperl is more secure.

The release notes should point out that 7.4's plperl is unsecure unless
the correct version of Safe.pm is installed. Maybe it works to make it
croak if an unsafe version of Safe.pm is found?

I'm not sure about "living with" known security vulnerabilities. What
about ISPs which give Pg hosting with plperl installed? They surely
will want to know about this.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
One man's impedance mismatch is another man's layer of abstraction.
(Lincoln Yeoh)