pgsql: Fix a couple of bugs in MultiXactId freezing

Started by Alvaro Herreraabout 12 years ago59 messages

alvherre@alvh.no-ip.org

about 12 years ago

Fix a couple of bugs in MultiXactId freezing

Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look
into a multixact to check the members against cutoff_xid. This means
that a very old Xid could survive hidden within a multi, possibly
outliving its CLOG storage. In the distant future, this would cause
clog lookup failures:
ERROR: could not access status of transaction 3883960912
DETAIL: Could not open file "pg_clog/0E78": No such file or directory.

This mostly was problematic when the updating transaction aborted, since
in that case the row wouldn't get pruned away earlier in vacuum and the
multixact could possibly survive for a long time. In many cases, data
that is inaccessible for this reason way can be brought back
heuristically.

As a second bug, heap_freeze_tuple() didn't properly handle multixacts
that need to be frozen according to cutoff_multi, but whose updater xid
is still alive. Instead of preserving the update Xid, it just set Xmax
invalid, which leads to both old and new tuple versions becoming
visible. This is pretty rare in practice, but a real threat
nonetheless. Existing corrupted rows, unfortunately, cannot be repaired
in an automated fashion.

Existing physical replicas might have already incorrectly frozen tuples
because of different behavior than in master, which might only become
apparent in the future once pg_multixact/ is truncated; it is
recommended that all clones be rebuilt after upgrading.

Following code analysis caused by bug report by J Smith in message
CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com
and privately by F-Secure.

Backpatch to 9.3, where freezing of MultiXactIds was introduced.

Analysis and patch by Andres Freund, with some tweaks by Álvaro.

Branch
------
REL9_3_STABLE

Details
-------
http://git.postgresql.org/pg/commitdiff/8e53ae025de90b8f7d935ce0eb4d0551178a4caf

Modified Files
--------------
src/backend/access/heap/heapam.c | 160 ++++++++++++++++++++++++++++----
src/backend/access/transam/multixact.c | 14 ++-
2 files changed, 151 insertions(+), 23 deletions(-)

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Noah Misch

noah@leadboat.com

about 12 years ago

In reply to: Alvaro Herrera (#1)

1 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote:

Fix a couple of bugs in MultiXactId freezing

Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look
into a multixact to check the members against cutoff_xid.

! /*
! * This is a multixact which is not marked LOCK_ONLY, but which
! * is newer than the cutoff_multi. If the update_xid is below the
! * cutoff_xid point, then we can just freeze the Xmax in the
! * tuple, removing it altogether. This seems simple, but there
! * are several underlying assumptions:
! *
! * 1. A tuple marked by an multixact containing a very old
! * committed update Xid would have been pruned away by vacuum; we
! * wouldn't be freezing this tuple at all.
! *
! * 2. There cannot possibly be any live locking members remaining
! * in the multixact. This is because if they were alive, the
! * update's Xid would had been considered, via the lockers'
! * snapshot's Xmin, as part the cutoff_xid.

READ COMMITTED transactions can reset MyPgXact->xmin between commands,
defeating that assumption; see SnapshotResetXmin(). I have attached an
isolationtester spec demonstrating the problem. The test spec additionally
covers a (probably-related) assertion failure, new in 9.3.2.

! *
! * 3. We don't create new MultiXacts via MultiXactIdExpand() that
! * include a very old aborted update Xid: in that function we only
! * include update Xids corresponding to transactions that are
! * committed or in-progress.
! */
! update_xid = HeapTupleGetUpdateXid(tuple);
! if (TransactionIdPrecedes(update_xid, cutoff_xid))
! freeze_xmax = true;

That was the only concrete runtime problem I found during a study of the
newest heap_freeze_tuple() and heap_tuple_needs_freeze() code. One thing that
leaves me unsure is the fact that vacuum_set_xid_limits() does no locking to
ensure a consistent result between GetOldestXmin() and GetOldestMultiXactId().
Transactions may start or end between those calls, making the
GetOldestMultiXactId() result represent a later set of transactions than the
GetOldestXmin() result. I suspect that's fine. New transactions have no
immediate effect on either cutoff, and transaction end can only increase a
cutoff. Using a slightly-lower cutoff than the maximum safe cutoff is always
okay; consider vacuum_defer_cleanup_age.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#2)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 00:47:07 -0500, Noah Misch wrote:

On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote:

Fix a couple of bugs in MultiXactId freezing

Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look
into a multixact to check the members against cutoff_xid.

! /*
! * This is a multixact which is not marked LOCK_ONLY, but which
! * is newer than the cutoff_multi. If the update_xid is below the
! * cutoff_xid point, then we can just freeze the Xmax in the
! * tuple, removing it altogether. This seems simple, but there
! * are several underlying assumptions:
! *
! * 1. A tuple marked by an multixact containing a very old
! * committed update Xid would have been pruned away by vacuum; we
! * wouldn't be freezing this tuple at all.
! *
! * 2. There cannot possibly be any live locking members remaining
! * in the multixact. This is because if they were alive, the
! * update's Xid would had been considered, via the lockers'
! * snapshot's Xmin, as part the cutoff_xid.

READ COMMITTED transactions can reset MyPgXact->xmin between commands,
defeating that assumption; see SnapshotResetXmin(). I have attached an
isolationtester spec demonstrating the problem.

Any idea how to cheat our way out of that one given the current way
heap_freeze_tuple() works (running on both primary and standby)? My only
idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty.
We can't even realistically create a new multixact with fewer members
with the current format of xl_heap_freeze.

The test spec additionally
covers a (probably-related) assertion failure, new in 9.3.2.

Too bad it's too late to do anthing about it for 9.3.2. :(. At least the
last seems actually unrelated, I am not sure why it's 9.3.2
only. Alvaro, are you looking?

That was the only concrete runtime problem I found during a study of the
newest heap_freeze_tuple() and heap_tuple_needs_freeze() code.

I'd even be interested in fuzzy problems ;). If 9.3. wouldn't have been
released the interactions between cutoff_xid/multi would have caused me
to say "back to the drawing" board... I'm not suprised if further things
are lurking there.

One thing that
leaves me unsure is the fact that vacuum_set_xid_limits() does no locking to
ensure a consistent result between GetOldestXmin() and GetOldestMultiXactId().
Transactions may start or end between those calls, making the
GetOldestMultiXactId() result represent a later set of transactions than the
GetOldestXmin() result. I suspect that's fine. New transactions have no
immediate effect on either cutoff, and transaction end can only increase a
cutoff. Using a slightly-lower cutoff than the maximum safe cutoff is always
okay; consider vacuum_defer_cleanup_age.

Yes, that seems fine to me, with the same reasoning.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

about 12 years ago

In reply to: Andres Freund (#3)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On Tue, Dec 03, 2013 at 11:56:07AM +0100, Andres Freund wrote:

On 2013-12-03 00:47:07 -0500, Noah Misch wrote:

On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote:

Fix a couple of bugs in MultiXactId freezing

Both heap_freeze_tuple() and heap_tuple_needs_freeze() neglected to look
into a multixact to check the members against cutoff_xid.

! /*
! * This is a multixact which is not marked LOCK_ONLY, but which
! * is newer than the cutoff_multi. If the update_xid is below the
! * cutoff_xid point, then we can just freeze the Xmax in the
! * tuple, removing it altogether. This seems simple, but there
! * are several underlying assumptions:
! *
! * 1. A tuple marked by an multixact containing a very old
! * committed update Xid would have been pruned away by vacuum; we
! * wouldn't be freezing this tuple at all.
! *
! * 2. There cannot possibly be any live locking members remaining
! * in the multixact. This is because if they were alive, the
! * update's Xid would had been considered, via the lockers'
! * snapshot's Xmin, as part the cutoff_xid.

READ COMMITTED transactions can reset MyPgXact->xmin between commands,
defeating that assumption; see SnapshotResetXmin(). I have attached an
isolationtester spec demonstrating the problem.

Any idea how to cheat our way out of that one given the current way
heap_freeze_tuple() works (running on both primary and standby)? My only
idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty.
We can't even realistically create a new multixact with fewer members
with the current format of xl_heap_freeze.

Perhaps set HEAP_XMAX_LOCK_ONLY on the tuple? We'd then ensure all update XID
consumers check HEAP_XMAX_IS_LOCKED_ONLY() first, much like xmax consumers are
already expected to check HEAP_XMAX_INVALID first. Seems doable, albeit yet
another injection of complexity.

The test spec additionally
covers a (probably-related) assertion failure, new in 9.3.2.

Too bad it's too late to do anthing about it for 9.3.2. :(. At least the
last seems actually unrelated, I am not sure why it's 9.3.2
only. Alvaro, are you looking?

(For clarity, the other problem demonstrated by the test spec is also a 9.3.2
regression.)

That was the only concrete runtime problem I found during a study of the
newest heap_freeze_tuple() and heap_tuple_needs_freeze() code.

I'd even be interested in fuzzy problems ;). If 9.3. wouldn't have been
released the interactions between cutoff_xid/multi would have caused me
to say "back to the drawing" board... I'm not suprised if further things
are lurking there.

heap_freeze_tuple() of 9.2 had an XXX comment about the possibility of getting
spurious lock contention due to wraparound of the multixact space. The
comment is gone, and that mechanism no longer poses a threat. However, a
non-wrapped multixact containing wrapped locker XIDs (we don't freeze locker
XIDs, just updater XIDs) may cause similar spurious contention.

+				/*
+				 * The multixact has an update hidden within.  Get rid of it.
+				 *
+				 * If the update_xid is below the cutoff_xid, it necessarily
+				 * must be an aborted transaction.  In a primary server, such
+				 * an Xmax would have gotten marked invalid by
+				 * HeapTupleSatisfiesVacuum, but in a replica that is not
+				 * called before we are, so deal with it in the same way.
+				 *
+				 * If not below the cutoff_xid, then the tuple would have been
+				 * pruned by vacuum, if the update committed long enough ago,
+				 * and we wouldn't be freezing it; so it's either recently
+				 * committed, or in-progress.  Deal with this by setting the
+				 * Xmax to the update Xid directly and remove the IS_MULTI
+				 * bit.  (We know there cannot be running lockers in this
+				 * multi, because it's below the cutoff_multi value.)
+				 */
+
+				if (TransactionIdPrecedes(update_xid, cutoff_xid))
+				{
+					Assert(InRecovery || TransactionIdDidAbort(update_xid));
+					freeze_xmax = true;
+				}
+				else
+				{
+					Assert(InRecovery || !TransactionIdIsInProgress(update_xid));

This assertion is at odds with the comment, but the assertion is okay for now.
If the updater is still in progress, its OldestMemberMXactId[] entry will have
held back cutoff_multi, and we won't be here. Therefore, if we get here, the
tuple will always be HEAPTUPLE_RECENTLY_DEAD (recently-committed updater) or
HEAPTUPLE_LIVE (aborted updater, recent or not).

Numerous comments in the vicinity (e.g. ones at MultiXactStateData) reflect a
pre-9.3 world. Most or all of that isn't new with the patch at hand, but it
does complicate study.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#3)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-12-03 00:47:07 -0500, Noah Misch wrote:

The test spec additionally
covers a (probably-related) assertion failure, new in 9.3.2.

Too bad it's too late to do anthing about it for 9.3.2. :(. At least the
last seems actually unrelated, I am not sure why it's 9.3.2
only. Alvaro, are you looking?

Is this bad enough that we need to re-wrap the release?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Tom Lane (#5)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 09:48:23 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-12-03 00:47:07 -0500, Noah Misch wrote:

The test spec additionally
covers a (probably-related) assertion failure, new in 9.3.2.

Too bad it's too late to do anthing about it for 9.3.2. :(. At least the
last seems actually unrelated, I am not sure why it's 9.3.2
only. Alvaro, are you looking?

Is this bad enough that we need to re-wrap the release?

Tentatively I'd say no, the only risk is loosing locks afaics. Thats
much bettter than corrupting rows as in 9.3.1. But I'll look into it in
a bit more detail as soon as I am of the phone call I am on.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#4)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 09:16:18 -0500, Noah Misch wrote:

On Tue, Dec 03, 2013 at 11:56:07AM +0100, Andres Freund wrote:

On 2013-12-03 00:47:07 -0500, Noah Misch wrote:

On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote:

Any idea how to cheat our way out of that one given the current way
heap_freeze_tuple() works (running on both primary and standby)? My only
idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty.
We can't even realistically create a new multixact with fewer members
with the current format of xl_heap_freeze.

Perhaps set HEAP_XMAX_LOCK_ONLY on the tuple? We'd then ensure all update XID
consumers check HEAP_XMAX_IS_LOCKED_ONLY() first, much like xmax consumers are
already expected to check HEAP_XMAX_INVALID first. Seems doable, albeit yet
another injection of complexity.

I think its pretty much checked that way already, but the problem seems
to be how to avoid checks on xid commit/abort in that case. I've
complained in 20131121200517.GM7240@alap2.anarazel.de that the old
pre-condition that multixacts aren't checked when they can't be relevant
(via OldestVisibleM*) isn't observed anymore.
So, if we re-introduce that condition again, we should be on the safe
side with that, right?

The test spec additionally
covers a (probably-related) assertion failure, new in 9.3.2.

Too bad it's too late to do anthing about it for 9.3.2. :(. At least the
last seems actually unrelated, I am not sure why it's 9.3.2
only. Alvaro, are you looking?

(For clarity, the other problem demonstrated by the test spec is also a 9.3.2
regression.)

Yea, I just don't see why yet... Looking now.

heap_freeze_tuple() of 9.2 had an XXX comment about the possibility of getting
spurious lock contention due to wraparound of the multixact space. The
comment is gone, and that mechanism no longer poses a threat. However, a
non-wrapped multixact containing wrapped locker XIDs (we don't freeze locker
XIDs, just updater XIDs) may cause similar spurious contention.

Yea, I noticed that that comment was missing as well. I think what we
should do now is to rework freezing in HEAD to make all this more
reasonable.

Numerous comments in the vicinity (e.g. ones at MultiXactStateData) reflect a
pre-9.3 world. Most or all of that isn't new with the patch at hand, but it
does complicate study.

Yea, Alvaro sent a patch for that somewhere, it seems a patch in the
series got lost when foreign key locks were originally applied.

I think we seriously need to put a good amount of work into the
multixact.c stuff in the next months. Otherwise it will be a maintenance
nightmore for a fair bit more time.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#4)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 09:16:18 -0500, Noah Misch wrote:

The test spec additionally
covers a (probably-related) assertion failure, new in 9.3.2.

Too bad it's too late to do anthing about it for 9.3.2. :(. At least the
last seems actually unrelated, I am not sure why it's 9.3.2
only. Alvaro, are you looking?

(For clarity, the other problem demonstrated by the test spec is also a 9.3.2
regression.)

The backtrace for the Assert() you found is:

#4 0x00000000004f1da5 in CreateMultiXactId (nmembers=2, members=0x1ce17d8)
at /home/andres/src/postgresql/src/backend/access/transam/multixact.c:708
#5 0x00000000004f1aeb in MultiXactIdExpand (multi=6241831, xid=6019366, status=MultiXactStatusUpdate)
at /home/andres/src/postgresql/src/backend/access/transam/multixact.c:462
#6 0x00000000004a5d8e in compute_new_xmax_infomask (xmax=6241831, old_infomask=4416, old_infomask2=16386, add_to_xmax=6019366,
mode=LockTupleExclusive, is_update=1 '\001', result_xmax=0x7fffca02a700, result_infomask=0x7fffca02a6fe,
result_infomask2=0x7fffca02a6fc) at /home/andres/src/postgresql/src/backend/access/heap/heapam.c:4651
#7 0x00000000004a2d27 in heap_update (relation=0x7f9fc45cc828, otid=0x7fffca02a8d0, newtup=0x1ce1740, cid=0, crosscheck=0x0,
wait=1 '\001', hufd=0x7fffca02a850, lockmode=0x7fffca02a82c) at /home/andres/src/postgresql/src/backend/access/heap/heapam.c:3304
#8 0x0000000000646f04 in ExecUpdate (tupleid=0x7fffca02a8d0, oldtuple=0x0, slot=0x1ce12c0, planSlot=0x1ce0740, epqstate=0x1ce0120,
estate=0x1cdfe98, canSetTag=1 '\001') at /home/andres/src/postgresql/src/backend/executor/nodeModifyTable.c:690

So imo it isn't really a new problem, it existed all along :/. We only
don't hit it in your terstcase before because we spuriously thought that
a tuple was in-progress if *any* member of the old multi were still
running in some cases instead of just the updater. But I am pretty sure
it can also reproduced in 9.3.1.

Imo the MultiXactIdSetOldestMember() call in heap_update() needs to be
moved outside of the if (satisfies_key). Everything else is vastly more
complex.
Alvaro, correct?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Noah Misch

noah@leadboat.com

about 12 years ago

In reply to: Andres Freund (#7)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On Tue, Dec 03, 2013 at 04:08:23PM +0100, Andres Freund wrote:

On 2013-12-03 09:16:18 -0500, Noah Misch wrote:

On Tue, Dec 03, 2013 at 11:56:07AM +0100, Andres Freund wrote:

On 2013-12-03 00:47:07 -0500, Noah Misch wrote:

On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote:

Any idea how to cheat our way out of that one given the current way
heap_freeze_tuple() works (running on both primary and standby)? My only
idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty.
We can't even realistically create a new multixact with fewer members
with the current format of xl_heap_freeze.

Perhaps set HEAP_XMAX_LOCK_ONLY on the tuple? We'd then ensure all update XID
consumers check HEAP_XMAX_IS_LOCKED_ONLY() first, much like xmax consumers are
already expected to check HEAP_XMAX_INVALID first. Seems doable, albeit yet
another injection of complexity.

I think its pretty much checked that way already, but the problem seems
to be how to avoid checks on xid commit/abort in that case. I've
complained in 20131121200517.GM7240@alap2.anarazel.de that the old
pre-condition that multixacts aren't checked when they can't be relevant
(via OldestVisibleM*) isn't observed anymore.
So, if we re-introduce that condition again, we should be on the safe
side with that, right?

What specific commit/abort checks do you have in mind?

The test spec additionally
covers a (probably-related) assertion failure, new in 9.3.2.

Too bad it's too late to do anthing about it for 9.3.2. :(. At least the
last seems actually unrelated, I am not sure why it's 9.3.2
only. Alvaro, are you looking?

(For clarity, the other problem demonstrated by the test spec is also a 9.3.2
regression.)

Yea, I just don't see why yet... Looking now.

Sorry, my original report was rather terse. I speak of the scenario exercised
by the second permutation in that isolationtester spec. The multixact is
later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2
does freeze it to InvalidTransactionId per the code I cited in my first
response on this thread, which wrongly removes a key lock.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#9)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 10:29:54 -0500, Noah Misch wrote:

Sorry, my original report was rather terse. I speak of the scenario exercised
by the second permutation in that isolationtester spec. The multixact is
later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2
does freeze it to InvalidTransactionId per the code I cited in my first
response on this thread, which wrongly removes a key lock.

That one is clear, I was only confused about the Assert() you
reported. But I think I've explained that elsewhere.

I currently don't see fixing the errorneous freezing of lockers (not the
updater though) without changing the wal format or synchronously waiting
for all lockers to end. Which both see like a no-go?

While it's still a major bug it seems to still be much better than the
previous case of either inaccessible or reappearing rows.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#9)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Hi,

On 2013-12-03 10:29:54 -0500, Noah Misch wrote:

On Tue, Dec 03, 2013 at 04:08:23PM +0100, Andres Freund wrote:

On 2013-12-03 09:16:18 -0500, Noah Misch wrote:

On Tue, Dec 03, 2013 at 11:56:07AM +0100, Andres Freund wrote:

On 2013-12-03 00:47:07 -0500, Noah Misch wrote:

On Sat, Nov 30, 2013 at 01:06:09AM +0000, Alvaro Herrera wrote:

Any idea how to cheat our way out of that one given the current way
heap_freeze_tuple() works (running on both primary and standby)? My only
idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty.
We can't even realistically create a new multixact with fewer members
with the current format of xl_heap_freeze.

Perhaps set HEAP_XMAX_LOCK_ONLY on the tuple? We'd then ensure all update XID
consumers check HEAP_XMAX_IS_LOCKED_ONLY() first, much like xmax consumers are
already expected to check HEAP_XMAX_INVALID first. Seems doable, albeit yet
another injection of complexity.

I think its pretty much checked that way already, but the problem seems
to be how to avoid checks on xid commit/abort in that case. I've
complained in 20131121200517.GM7240@alap2.anarazel.de that the old
pre-condition that multixacts aren't checked when they can't be relevant
(via OldestVisibleM*) isn't observed anymore.
So, if we re-introduce that condition again, we should be on the safe
side with that, right?

What specific commit/abort checks do you have in mind?

MultiXactIdIsRunning() does a TransactionIdIsInProgress() for each
member which in turn does TransactionIdDidCommit(). Similar when locking
a tuple that's locked/updated without a multixact where we go for a
TransactionIdIsInProgress() in XactLockTableWait().

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#6)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Hi Alvaro, Noah,

On 2013-12-03 15:57:10 +0100, Andres Freund wrote:

On 2013-12-03 09:48:23 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-12-03 00:47:07 -0500, Noah Misch wrote:

The test spec additionally
covers a (probably-related) assertion failure, new in 9.3.2.

Too bad it's too late to do anthing about it for 9.3.2. :(. At least the
last seems actually unrelated, I am not sure why it's 9.3.2
only. Alvaro, are you looking?

Is this bad enough that we need to re-wrap the release?

Tentatively I'd say no, the only risk is loosing locks afaics. Thats
much bettter than corrupting rows as in 9.3.1. But I'll look into it in
a bit more detail as soon as I am of the phone call I am on.

After looking, I think I am revising my opinion. The broken locking
behaviour (outside of freeze, which I don't see how we can fix in time),
is actually bad.
Would that stop us from making the release date with packages?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#12)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-12-03 09:48:23 -0500, Tom Lane wrote:

Is this bad enough that we need to re-wrap the release?

After looking, I think I am revising my opinion. The broken locking
behaviour (outside of freeze, which I don't see how we can fix in time),
is actually bad.
Would that stop us from making the release date with packages?

That's hardly answerable when you haven't specified how long you
think it'd take to fix.

In general, though, I'm going to be exceedingly unhappy if this release
introduces new regressions. If we have to put off the release to fix
something, maybe we'd better do so. And we'd damn well better get it
right this time.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Tom Lane (#13)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 12:22:33 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

On 2013-12-03 09:48:23 -0500, Tom Lane wrote:

Is this bad enough that we need to re-wrap the release?

After looking, I think I am revising my opinion. The broken locking
behaviour (outside of freeze, which I don't see how we can fix in time),
is actually bad.
Would that stop us from making the release date with packages?

That's hardly answerable when you haven't specified how long you
think it'd take to fix.

There's two things that are broken as-is:
1) the freezing of multixacts: The new state is much better than the old
one because the old one corrupted data, while the new one somes removes
locks when you explicitly FREEZE.
2) The broken locking behaviour in Noah's test without the
FREEZE.

I don't see a realistic chance of fixing 1) in 9.3. Not even sure if it
can be done without changing the freeze WAL format. But 2) should be fixed
and basically is a oneliner + comments + test. Alvaro?

In general, though, I'm going to be exceedingly unhappy if this release
introduces new regressions. If we have to put off the release to fix
something, maybe we'd better do so. And we'd damn well better get it
right this time.

I think that's really hard for the multixacts stuff. There's lots of
stuff not really okay in there, but we can't do much about that now :(

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#3)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund <andres@2ndquadrant.com> writes:

Any idea how to cheat our way out of that one given the current way
heap_freeze_tuple() works (running on both primary and standby)? My only
idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty.
We can't even realistically create a new multixact with fewer members
with the current format of xl_heap_freeze.

Maybe we should just bite the bullet and change the WAL format for
heap_freeze (inventing an all-new record type, not repurposing the old
one, and allowing WAL replay to continue to accept the old one). The
implication for users would be that they'd have to update slave servers
before the master when installing the update; which is unpleasant, but
better than living with a known data corruption case.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Magnus Hagander

magnus@hagander.net

about 12 years ago

In reply to: Tom Lane (#15)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On Tue, Dec 3, 2013 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Any idea how to cheat our way out of that one given the current way
heap_freeze_tuple() works (running on both primary and standby)? My only
idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty.
We can't even realistically create a new multixact with fewer members
with the current format of xl_heap_freeze.

Maybe we should just bite the bullet and change the WAL format for
heap_freeze (inventing an all-new record type, not repurposing the old
one, and allowing WAL replay to continue to accept the old one). The
implication for users would be that they'd have to update slave servers
before the master when installing the update; which is unpleasant, but
better than living with a known data corruption case.

Agreed. It may suck, but it sucks less.

How badly will it break if they do the upgrade in the wrong order though.
Will the slaves just stop (I assume this?) or is there a risk of a
wrong-order upgrade causing extra breakage? And if they do shut down, would
just upgrading the slave fix it, or would they then have to rebuild the
slave? (actually, don't we recommend they always rebuild the slave
*anyway*? In which case the problem is even smaller..)

I think we've always told people to upgrade the slave first, and it's the
logical thing that AFAIK most other systems require as well, so that's not
an unreasonable requirement at all.

I assume we'd then get rid of the old record type completely in 9.4, right?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#17

Noah Misch

noah@leadboat.com

about 12 years ago

In reply to: Andres Freund (#10)

1 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On Tue, Dec 03, 2013 at 04:37:58PM +0100, Andres Freund wrote:

On 2013-12-03 10:29:54 -0500, Noah Misch wrote:

Sorry, my original report was rather terse. I speak of the scenario exercised
by the second permutation in that isolationtester spec. The multixact is
later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2
does freeze it to InvalidTransactionId per the code I cited in my first
response on this thread, which wrongly removes a key lock.

That one is clear, I was only confused about the Assert() you
reported. But I think I've explained that elsewhere.

I currently don't see fixing the errorneous freezing of lockers (not the
updater though) without changing the wal format or synchronously waiting
for all lockers to end. Which both see like a no-go?

Not fixing it at all is the real no-go. We'd take both of those undesirables
before just tolerating the lost locks in 9.3.

The attached patch illustrates the approach I was describing earlier. It
fixes the test case discussed above. I haven't verified that everything else
in the system is ready for it, so this is just for illustration purposes.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

Attachments:

freeze-multi-lockonly-v1.patchtext/plain; charset=us-asciiDownload

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 5384,5412 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  			TransactionId	update_xid;
  
  			/*
! 			 * This is a multixact which is not marked LOCK_ONLY, but which
! 			 * is newer than the cutoff_multi.  If the update_xid is below the
! 			 * cutoff_xid point, then we can just freeze the Xmax in the
! 			 * tuple, removing it altogether.  This seems simple, but there
! 			 * are several underlying assumptions:
! 			 *
! 			 * 1. A tuple marked by an multixact containing a very old
! 			 * committed update Xid would have been pruned away by vacuum; we
! 			 * wouldn't be freezing this tuple at all.
! 			 *
! 			 * 2. There cannot possibly be any live locking members remaining
! 			 * in the multixact.  This is because if they were alive, the
! 			 * update's Xid would had been considered, via the lockers'
! 			 * snapshot's Xmin, as part the cutoff_xid.
! 			 *
! 			 * 3. We don't create new MultiXacts via MultiXactIdExpand() that
! 			 * include a very old aborted update Xid: in that function we only
! 			 * include update Xids corresponding to transactions that are
! 			 * committed or in-progress.
  			 */
  			update_xid = HeapTupleGetUpdateXid(tuple);
  			if (TransactionIdPrecedes(update_xid, cutoff_xid))
! 				freeze_xmax = true;
  		}
  	}
  	else if (TransactionIdIsNormal(xid) &&
--- 5384,5404 ----
  			TransactionId	update_xid;
  
  			/*
! 			 * This is a multixact which is not marked LOCK_ONLY, but which is
! 			 * newer than the cutoff_multi.  To advance relfrozenxid, we must
! 			 * eliminate any updater XID that falls below cutoff_xid.
! 			 * Affected multixacts may also contain outstanding lockers, which
! 			 * we must preserve.  Forming a new multixact is tempting, but
! 			 * solving it that way requires a WAL format change.  Instead,
! 			 * mark the tuple LOCK_ONLY.  The updater XID is still present,
! 			 * but consuming code will notice LOCK_ONLY and assume it's gone.
  			 */
  			update_xid = HeapTupleGetUpdateXid(tuple);
  			if (TransactionIdPrecedes(update_xid, cutoff_xid))
! 			{
! 				tuple->t_infomask |= HEAP_XMAX_LOCK_ONLY;
! 				changed = true;
! 			}
  		}
  	}
  	else if (TransactionIdIsNormal(xid) &&

#18

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Tom Lane (#15)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Any idea how to cheat our way out of that one given the current way
heap_freeze_tuple() works (running on both primary and standby)? My only
idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty.
We can't even realistically create a new multixact with fewer members
with the current format of xl_heap_freeze.

Maybe we should just bite the bullet and change the WAL format for
heap_freeze (inventing an all-new record type, not repurposing the old
one, and allowing WAL replay to continue to accept the old one). The
implication for users would be that they'd have to update slave servers
before the master when installing the update; which is unpleasant, but
better than living with a known data corruption case.

That was my suggestion too (modulo, I admit, the bit about it being a
new, separate record type.)

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Magnus Hagander (#16)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Dec 3, 2013 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe we should just bite the bullet and change the WAL format for
heap_freeze (inventing an all-new record type, not repurposing the old
one, and allowing WAL replay to continue to accept the old one). The
implication for users would be that they'd have to update slave servers
before the master when installing the update; which is unpleasant, but
better than living with a known data corruption case.

Agreed. It may suck, but it sucks less.

How badly will it break if they do the upgrade in the wrong order though.
Will the slaves just stop (I assume this?) or is there a risk of a
wrong-order upgrade causing extra breakage?

I assume what would happen is the slave would PANIC upon seeing a WAL
record code it didn't recognize. Installing the updated version should
allow it to resume functioning. Would be good to test this, but if it
doesn't work like that, that'd be another bug to fix IMO. We've always
foreseen the possible need to do something like this, so it ought to
work reasonably cleanly.

I assume we'd then get rid of the old record type completely in 9.4, right?

Yeah, we should be able to drop it in 9.4, since we'll surely have other
WAL format changes anyway.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Magnus Hagander

magnus@hagander.net

about 12 years ago

In reply to: Tom Lane (#19)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On Tue, Dec 3, 2013 at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Dec 3, 2013 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe we should just bite the bullet and change the WAL format for
heap_freeze (inventing an all-new record type, not repurposing the old
one, and allowing WAL replay to continue to accept the old one). The
implication for users would be that they'd have to update slave servers
before the master when installing the update; which is unpleasant, but
better than living with a known data corruption case.

Agreed. It may suck, but it sucks less.

How badly will it break if they do the upgrade in the wrong order though.
Will the slaves just stop (I assume this?) or is there a risk of a
wrong-order upgrade causing extra breakage?

I assume what would happen is the slave would PANIC upon seeing a WAL
record code it didn't recognize. Installing the updated version should
allow it to resume functioning. Would be good to test this, but if it
doesn't work like that, that'd be another bug to fix IMO. We've always
foreseen the possible need to do something like this, so it ought to
work reasonably cleanly.

Yeah, as long as that's tested and actually works, that sounds like an
acceptable thing to deal with.

I assume we'd then get rid of the old record type completely in 9.4,

right?

Yeah, we should be able to drop it in 9.4, since we'll surely have other
WAL format changes anyway.

And even if not, there's no point in keeping it unless we actually support
replication from 9.3 -> 9.4, I think, and I don't believe anybody has even
considered working on that yet :)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#21

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#17)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 13:14:38 -0500, Noah Misch wrote:

On Tue, Dec 03, 2013 at 04:37:58PM +0100, Andres Freund wrote:

On 2013-12-03 10:29:54 -0500, Noah Misch wrote:

Sorry, my original report was rather terse. I speak of the scenario exercised
by the second permutation in that isolationtester spec. The multixact is
later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2
does freeze it to InvalidTransactionId per the code I cited in my first
response on this thread, which wrongly removes a key lock.

That one is clear, I was only confused about the Assert() you
reported. But I think I've explained that elsewhere.

I currently don't see fixing the errorneous freezing of lockers (not the
updater though) without changing the wal format or synchronously waiting
for all lockers to end. Which both see like a no-go?

Not fixing it at all is the real no-go. We'd take both of those undesirables
before just tolerating the lost locks in 9.3.

I think it's changing the wal format then.

The attached patch illustrates the approach I was describing earlier. It
fixes the test case discussed above. I haven't verified that everything else
in the system is ready for it, so this is just for illustration purposes.

That might be better than the current state because the potential damage
from such not frozen locks would be to get "could not access status of
transaction ..." errors (*).
But the problem I see with it is that a FOR UPDATE/NO KEY UPDATE lock taken out by
UPDATE is different than the respective lock taken out by SELECT FOR
SHARE:
typedef enum
{
MultiXactStatusForKeyShare = 0x00,
MultiXactStatusForShare = 0x01,
MultiXactStatusForNoKeyUpdate = 0x02,
MultiXactStatusForUpdate = 0x03,
/* an update that doesn't touch "key" columns */
MultiXactStatusNoKeyUpdate = 0x04,
/* other updates, and delete */
MultiXactStatusUpdate = 0x05
} MultiXactStatus;

Ignoring the difference that way isn't going to fly nicely.

*: which already are possible because we do not check multis properly
against OldestVisibleMXactId[] anymore.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#17)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Noah Misch wrote:

On 2013-12-03 10:29:54 -0500, Noah Misch wrote:

Sorry, my original report was rather terse. I speak of the scenario exercised
by the second permutation in that isolationtester spec. The multixact is
later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2
does freeze it to InvalidTransactionId per the code I cited in my first
response on this thread, which wrongly removes a key lock.

That one is clear, I was only confused about the Assert() you
reported. But I think I've explained that elsewhere.

I currently don't see fixing the errorneous freezing of lockers (not the
updater though) without changing the wal format or synchronously waiting
for all lockers to end. Which both see like a no-go?

Not fixing it at all is the real no-go. We'd take both of those undesirables
before just tolerating the lost locks in 9.3.

Well, unless I misunderstand, this is only a problem in the case that
cutoff_multi is not yet past but cutoff_xid is; and that there are
locker transactions still running. So it's really a corner case. Not
saying it's impossible to hit, mind.

The attached patch illustrates the approach I was describing earlier. It
fixes the test case discussed above. I haven't verified that everything else
in the system is ready for it, so this is just for illustration purposes.

Wow, this is scary. I don't oppose it in principle, but we'd better go
over the whole thing once more to ensure every place that cares is
prepared to deal with this.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Tom Lane (#15)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 13:11:13 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Any idea how to cheat our way out of that one given the current way
heap_freeze_tuple() works (running on both primary and standby)? My only
idea was to MultiXactIdWait() if !InRecovery but that's extremly grotty.
We can't even realistically create a new multixact with fewer members
with the current format of xl_heap_freeze.

Maybe we should just bite the bullet and change the WAL format for
heap_freeze (inventing an all-new record type, not repurposing the old
one, and allowing WAL replay to continue to accept the old one). The
implication for users would be that they'd have to update slave servers
before the master when installing the update; which is unpleasant, but
better than living with a known data corruption case.

I wondered about that myself. How would you suggest the format to look
like?
ISTM per tuple we'd need:

* OffsetNumber off
* uint16 infomask
* TransactionId xmin
* TransactionId xmax

I don't see why we'd need infomask2, but so far being overly skimpy in
that place hasn't shown itself to be the greatest idea?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#23)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund wrote:

I wondered about that myself. How would you suggest the format to look
like?
ISTM per tuple we'd need:

* OffsetNumber off
* uint16 infomask
* TransactionId xmin
* TransactionId xmax

I don't see why we'd need infomask2, but so far being overly skimpy in
that place hasn't shown itself to be the greatest idea?

I don't see that we need the xmin; a simple bit flag indicating whether
the Xmin was frozen should be enough.

For xmax we need more detail, as you propose. In infomask, are you
proposing to store the complete infomask, or just the flags being
changed? Note we have a set of bits used in other wal records,
XLHL_XMAX_IS_MULTI and friends, which perhaps we can use here.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Noah Misch

noah@leadboat.com

about 12 years ago

In reply to: Andres Freund (#21)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On Tue, Dec 03, 2013 at 07:26:38PM +0100, Andres Freund wrote:

On 2013-12-03 13:14:38 -0500, Noah Misch wrote:

On Tue, Dec 03, 2013 at 04:37:58PM +0100, Andres Freund wrote:

I currently don't see fixing the errorneous freezing of lockers (not the
updater though) without changing the wal format or synchronously waiting
for all lockers to end. Which both see like a no-go?

Not fixing it at all is the real no-go. We'd take both of those undesirables
before just tolerating the lost locks in 9.3.

I think it's changing the wal format then.

I'd rather have an readily-verifiable fix that changes WAL format than a
tricky fix that avoids doing so. So, modulo not having seen the change, +1.

The attached patch illustrates the approach I was describing earlier. It
fixes the test case discussed above. I haven't verified that everything else
in the system is ready for it, so this is just for illustration purposes.

That might be better than the current state because the potential damage
from such not frozen locks would be to get "could not access status of
transaction ..." errors (*).

*: which already are possible because we do not check multis properly
against OldestVisibleMXactId[] anymore.

Separate issue. That patch adds to the ways we can end up with an extant
multixact referencing an locker XID no longer found it CLOG, but it doesn't
introduce that problem.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#25)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 13:49:49 -0500, Noah Misch wrote:

On Tue, Dec 03, 2013 at 07:26:38PM +0100, Andres Freund wrote:

On 2013-12-03 13:14:38 -0500, Noah Misch wrote:

Not fixing it at all is the real no-go. We'd take both of those undesirables
before just tolerating the lost locks in 9.3.

I think it's changing the wal format then.

I'd rather have an readily-verifiable fix that changes WAL format than a
tricky fix that avoids doing so. So, modulo not having seen the change, +1.

Well, who's going to write that then? I can write something up, but I
really would not like not to be solely responsible for it.

That means we cannot release 9.3 on schedule anyway, right?

The attached patch illustrates the approach I was describing earlier. It
fixes the test case discussed above. I haven't verified that everything else
in the system is ready for it, so this is just for illustration purposes.

That might be better than the current state because the potential damage
from such not frozen locks would be to get "could not access status of
transaction ..." errors (*).

*: which already are possible because we do not check multis properly
against OldestVisibleMXactId[] anymore.

Separate issue. That patch adds to the ways we can end up with an extant
multixact referencing an locker XID no longer found it CLOG, but it doesn't
introduce that problem.

Sure, that was an argument in favor of your idea, not against it ;).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#24)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 15:40:44 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

I wondered about that myself. How would you suggest the format to look
like?
ISTM per tuple we'd need:

* OffsetNumber off
* uint16 infomask
* TransactionId xmin
* TransactionId xmax

I don't see why we'd need infomask2, but so far being overly skimpy in
that place hasn't shown itself to be the greatest idea?

I don't see that we need the xmin; a simple bit flag indicating whether
the Xmin was frozen should be enough.

Yea, that would work as well.

For xmax we need more detail, as you propose. In infomask, are you
proposing to store the complete infomask, or just the flags being
changed? Note we have a set of bits used in other wal records,
XLHL_XMAX_IS_MULTI and friends, which perhaps we can use here.

Tentatively the complete one. I don't think we'd win enough by using
compute_infobits/fix_infomask_from_infobits and we'd need to extend the
bits stored in there unless we are willing to live with not transporting
XMIN/XMAX_COMMITTED which doesn't seem like a good idea.

Btw, why is it currently ok to modify the tuple in heap_freeze_tuple()
without being in a critical section?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#26)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund wrote:

On 2013-12-03 13:49:49 -0500, Noah Misch wrote:

On Tue, Dec 03, 2013 at 07:26:38PM +0100, Andres Freund wrote:

On 2013-12-03 13:14:38 -0500, Noah Misch wrote:

Not fixing it at all is the real no-go. We'd take both of those undesirables
before just tolerating the lost locks in 9.3.

I think it's changing the wal format then.

I'd rather have an readily-verifiable fix that changes WAL format than a
tricky fix that avoids doing so. So, modulo not having seen the change, +1.

Well, who's going to write that then? I can write something up, but I
really would not like not to be solely responsible for it.

I will give this a shot.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Noah Misch (#25)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Noah Misch <noah@leadboat.com> writes:

On Tue, Dec 03, 2013 at 07:26:38PM +0100, Andres Freund wrote:

On 2013-12-03 13:14:38 -0500, Noah Misch wrote:

On Tue, Dec 03, 2013 at 04:37:58PM +0100, Andres Freund wrote:

I currently don't see fixing the errorneous freezing of lockers (not the
updater though) without changing the wal format or synchronously waiting
for all lockers to end. Which both see like a no-go?

Not fixing it at all is the real no-go. We'd take both of those undesirables
before just tolerating the lost locks in 9.3.

I think it's changing the wal format then.

I'd rather have an readily-verifiable fix that changes WAL format than a
tricky fix that avoids doing so. So, modulo not having seen the change, +1.

Yeah, same here.

After some discussion, the core committee has concluded that we should go
ahead with the already-wrapped releases. 9.2.6 and below are good anyway,
and despite this issue 9.3.2 is an improvement over 9.3.1. We'll plan to
do a 9.3.3 as soon as the multixact situation can be straightened out;
but let's learn from experience and not try to fix it in a panic.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Tom Lane (#29)

1 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Tom Lane wrote:

After some discussion, the core committee has concluded that we should go
ahead with the already-wrapped releases. 9.2.6 and below are good anyway,
and despite this issue 9.3.2 is an improvement over 9.3.1. We'll plan to
do a 9.3.3 as soon as the multixact situation can be straightened out;
but let's learn from experience and not try to fix it in a panic.

I would suggest we include this one fix in 9.3.2a. This bug is more
serious than the others because it happens because of checking HTSU on a
tuple containing running locker-only transactions and an aborted update.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

tqual-fix.patchtext/x-diff; charset=iso-8859-1Download

commit 4be26fe1ebc3b198d093a0334e033bb70516fa60
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   Tue Dec 3 15:06:40 2013 -0300

    Fixup "Don't TransactionIdDidAbort in ..." patch
    
    Noah Misch reported a failure in the new freezing logic for MultiXactIds,
    providing a test case that demonstrated it; further investigation revealed a
    problem that HeapTupleSatisfiesUpdate routine could ignore in-progress locker
    transactions for a tuple whose Xmax is a MultiXactId that contains an aborted
    update.  In short, the lock would suddenly go ignored by other transactions.
    
    This change reverts the change to the delete-abort-savept isolation test;
    turns out changing it was in error.
    
    Andres Freund and Ãlvaro Herrera

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 4d63b1c..f787f2c 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -789,13 +789,26 @@ HeapTupleSatisfiesUpdate(HeapTupleHeader tuple, CommandId curcid,
 		if (TransactionIdDidCommit(xmax))
 			return HeapTupleUpdated;
 
-		/* no member, even just a locker, alive anymore */
+		/*
+		 * By here, the update in the Xmax is either aborted or crashed, but
+		 * what about the other members?
+		 */
+
 		if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple)))
+		{
+			/*
+			 * There's no member, even just a locker, alive anymore, so we can
+			 * mark the Xmax as invalid.
+			 */
 			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
 						InvalidTransactionId);
-
-		/* it must have aborted or crashed */
-		return HeapTupleMayBeUpdated;
+			return HeapTupleMayBeUpdated;
+		}
+		else
+		{
+			/* There are lockers running */
+			return HeapTupleBeingUpdated;
+		}
 	}
 
 	if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
diff --git a/src/test/isolation/expected/delete-abort-savept.out b/src/test/isolation/expected/delete-abort-savept.out
index 5b8c444..3420cf4 100644
--- a/src/test/isolation/expected/delete-abort-savept.out
+++ b/src/test/isolation/expected/delete-abort-savept.out
@@ -23,11 +23,12 @@ key            value
 step s1svp: SAVEPOINT f;
 step s1d: DELETE FROM foo;
 step s1r: ROLLBACK TO f;
-step s2l: SELECT * FROM foo FOR UPDATE;
+step s2l: SELECT * FROM foo FOR UPDATE; <waiting ...>
+step s1c: COMMIT;
+step s2l: <... completed>
 key            value          
 
 1              1              
-step s1c: COMMIT;
 step s2c: COMMIT;
 
 starting permutation: s1l s1svp s1d s1r s2l s2c s1c
@@ -38,12 +39,8 @@ key            value
 step s1svp: SAVEPOINT f;
 step s1d: DELETE FROM foo;
 step s1r: ROLLBACK TO f;
-step s2l: SELECT * FROM foo FOR UPDATE;
-key            value          
-
-1              1              
-step s2c: COMMIT;
-step s1c: COMMIT;
+step s2l: SELECT * FROM foo FOR UPDATE; <waiting ...>
+invalid permutation detected
 
 starting permutation: s1l s1svp s1d s2l s1r s1c s2c
 step s1l: SELECT * FROM foo FOR KEY SHARE;

#31

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Alvaro Herrera (#30)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

After some discussion, the core committee has concluded that we should go
ahead with the already-wrapped releases. 9.2.6 and below are good anyway,
and despite this issue 9.3.2 is an improvement over 9.3.1. We'll plan to
do a 9.3.3 as soon as the multixact situation can be straightened out;
but let's learn from experience and not try to fix it in a panic.

I would suggest we include this one fix in 9.3.2a. This bug is more
serious than the others because it happens because of checking HTSU on a
tuple containing running locker-only transactions and an aborted update.

The effect is just that the lockers could lose their locks early, right?
While that's annoying, it's not *directly* a data corruption problem.
And I've lost any enthusiasm I might've had for quick fixes in this area.
I think it'd be better to wait a few days, think this over, and get the
other problem fixed as well.

In any case, I think we're already past the point where we could re-wrap
9.3.2; the tarballs have been in the hands of packagers for > 24 hours.
We'd have to call it 9.3.3.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#9)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Noah Misch wrote:

On Tue, Dec 03, 2013 at 04:08:23PM +0100, Andres Freund wrote:

(For clarity, the other problem demonstrated by the test spec is also a 9.3.2
regression.)

Yea, I just don't see why yet... Looking now.

Sorry, my original report was rather terse. I speak of the scenario exercised
by the second permutation in that isolationtester spec. The multixact is
later than VACUUM's cutoff_multi, so 9.3.1 does not freeze it at all. 9.3.2
does freeze it to InvalidTransactionId per the code I cited in my first
response on this thread, which wrongly removes a key lock.

Attached is a patch to fix it. It's a simple fix, really, but it
reverts the delete-abort-savept test changes we did in 1ce150b7bb.
(This is a more complete version of a patch I posted elsewhere in this
thread as a reply to Tom.)

I added a new isolation spec to test this specific case, and noticed
something that seems curious to me when that test is run in REPEATABLE
READ mode: when the UPDATE is aborted, the concurrent FOR UPDATE gets a
"can't serialize due to concurrent update", but when the UPDATE is
committed, FOR UPDATE works fine. Shouldn't it happen pretty much
exactly the other way around?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Tom Lane (#29)

1 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 15:46:09 -0500, Tom Lane wrote:

Noah Misch <noah@leadboat.com> writes:

I'd rather have an readily-verifiable fix that changes WAL format than a
tricky fix that avoids doing so. So, modulo not having seen the change, +1.

Yeah, same here.

I am afraid it won't be *that* simple. We still need code to look into
multis, check whether all members are ok wrt. cutoff_xid and replace
them, either by the contained xid, or by a new multi with the still
living members. Ugly.

There's currently also the issue that heap_freeze_tuple() modifies the
tuple inplace without a critical section. We're executing a
HeapTupleSatisfiesVacuum() before we get to WAL logging things, that has
plenty of rope to hang itself on. So that doesn't really seem ok to me?

Attached is a pre-pre alpha patch for this. To fix the issue with the
missing critical section it splits freezing into
heap_prepare_freeze_tuple() and heap_execute_freeze_tuple(). The former
doesn't touch tuples and is executed on the primary, and the second
actually peforms the modifications and is executed both, during normal
processing and recovery.

Needs a fair bit of work:
* Should move parts of the multixact processing into multixact.c,
specifically it shouldn't require CreateMultiXactId() to be exported.
* it relies on forward-declaring a struct in heapam.h that's actually
defined heapam_xlog.h - that's pretty ugly.
* any form of testing but make check/isolationcheck across SR.
* lots of the comments around need to be added/reworked
* has a simpler version of Alvaro's patch to HTSV in there

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-WIP-new-freezing-format.patchtext/x-patch; charset=us-asciiDownload

>From 330d128665fcf8633e60a42e8e4a497e2975dac0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 4 Dec 2013 00:11:20 +0100
Subject: [PATCH] WIP: new freezing format

---
 src/backend/access/heap/heapam.c       | 476 +++++++++++++++++++++++++--------
 src/backend/access/rmgrdesc/heapdesc.c |   9 +
 src/backend/access/transam/multixact.c |   3 +-
 src/backend/commands/vacuumlazy.c      |  28 +-
 src/backend/utils/time/tqual.c         |   6 +-
 src/include/access/heapam.h            |   7 +-
 src/include/access/heapam_xlog.h       |  32 ++-
 src/include/access/multixact.h         |   1 +
 8 files changed, 443 insertions(+), 119 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c13f87c..b80fa5b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5242,14 +5242,17 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 		CacheInvalidateHeapTuple(relation, tuple, NULL);
 }
 
-
 /*
- * heap_freeze_tuple
+ * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the specified cutoff XID.  If so, replace them with
- * FrozenTransactionId or InvalidTransactionId as appropriate, and return
- * TRUE.  Return FALSE if nothing was changed.
+ * are older than the specified cutoff XID.  If so, return enough state to
+ * later execute and WAL log replacin them with FrozenTransactionId or
+ * InvalidTransactionId as appropriate, and return TRUE.  Return FALSE if
+ * nothing was changed.
+ *
+ * The 'off' field of the freeze state has to be set at the caller, not here,
+ * if required.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
@@ -5258,54 +5261,44 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
  * XID older than it could neither be running nor seen as running by any
  * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.  Also, since we assume the tuple is
- * not HEAPTUPLE_DEAD, the fact that an XID is not still running allows us
- * to assume that it is either committed good or aborted, as appropriate;
- * so we need no external state checks to decide what to do.  (This is good
- * because this function is applied during WAL recovery, when we don't have
- * access to any such state, and can't depend on the hint bits to be set.)
- * There is an exception we make which is to assume GetMultiXactIdMembers can
- * be called during recovery.
- *
+ * anyone's idea of the tuple state.
  * Similarly, cutoff_multi must be less than or equal to the smallest
  * MultiXactId used by any transaction currently open.
  *
  * If the tuple is in a shared buffer, caller must hold an exclusive lock on
  * that buffer.
  *
- * Note: it might seem we could make the changes without exclusive lock, since
- * TransactionId read/write is assumed atomic anyway.  However there is a race
- * condition: someone who just fetched an old XID that we overwrite here could
- * conceivably not finish checking the XID against pg_clog before we finish
- * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
- * exclusive lock ensures no other backend is in process of checking the
- * tuple status.  Also, getting exclusive lock makes it safe to adjust the
- * infomask bits.
- *
- * NB: Cannot rely on hint bits here, they might not be set after a crash or
- * on a standby.
+ * NB: It is not enough that hint bits indicate something is committed/invalid
+ * - they might not be set on a standby/after crash recovery. So we really
+ * need to remove old xids.
  */
 bool
-heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
-				  MultiXactId cutoff_multi)
+heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+				  TransactionId cutoff_multi, xl_heap_freeze_tuple *frz)
+
 {
 	bool		changed = false;
 	bool		freeze_xmax = false;
 	TransactionId xid;
 
+	frz->freeze_xmin = false;
+	frz->invalid_xvac = false;
+	frz->freeze_xvac = false;
+	frz->t_infomask2 = tuple->t_infomask2;
+	frz->t_infomask = tuple->t_infomask;
+	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
+
 	/* Process xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedes(xid, cutoff_xid))
 	{
-		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
-
+		frz->freeze_xmin = true;
 		/*
 		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
 		 * already be set here, but there's a small chance not.
 		 */
-		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
-		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+		frz->t_infomask |= HEAP_XMIN_COMMITTED;
 		changed = true;
 	}
 
@@ -5332,81 +5325,139 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			/*
 			 * This old multi cannot possibly be running.  If it was a locker
 			 * only, it can be removed without much further thought; but if it
-			 * contained an update, we need to preserve it.
+			 * contained an update, we might need to preserve it.
 			 */
 			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
+			{
 				freeze_xmax = true;
+			}
 			else
 			{
-				TransactionId update_xid;
+				/* replace multi by update xid */
+				frz->xmax = HeapTupleGetUpdateXid(tuple);
+				frz->t_infomask &= ~HEAP_XMAX_BITS;
+				frz->t_infomask &= ~HEAP_XMAX_IS_MULTI;
 
-				update_xid = HeapTupleGetUpdateXid(tuple);
+				/* wasn't only a lock, xid needs to be valid */
+				Assert(TransactionIdIsValid(frz->xmax));
 
 				/*
-				 * The multixact has an update hidden within.  Get rid of it.
-				 *
-				 * If the update_xid is below the cutoff_xid, it necessarily
-				 * must be an aborted transaction.  In a primary server, such
-				 * an Xmax would have gotten marked invalid by
-				 * HeapTupleSatisfiesVacuum, but in a replica that is not
-				 * called before we are, so deal with it in the same way.
-				 *
-				 * If not below the cutoff_xid, then the tuple would have been
-				 * pruned by vacuum, if the update committed long enough ago,
-				 * and we wouldn't be freezing it; so it's either recently
-				 * committed, or in-progress.  Deal with this by setting the
-				 * Xmax to the update Xid directly and remove the IS_MULTI
-				 * bit.  (We know there cannot be running lockers in this
-				 * multi, because it's below the cutoff_multi value.)
+				 * If the xid is older than the cutoff, it has to have
+				 * aborted, otherwise it would have gotten pruned away.
 				 */
-
-				if (TransactionIdPrecedes(update_xid, cutoff_xid))
+				if (TransactionIdPrecedes(frz->xmax, cutoff_xid))
 				{
-					Assert(InRecovery || TransactionIdDidAbort(update_xid));
+					Assert(!TransactionIdDidCommit(frz->xmax));
 					freeze_xmax = true;
 				}
 				else
 				{
-					Assert(InRecovery || !TransactionIdIsInProgress(update_xid));
-					tuple->t_infomask &= ~HEAP_XMAX_BITS;
-					HeapTupleHeaderSetXmax(tuple, update_xid);
-					changed = true;
+					/* preserve xmax */
 				}
+				changed = true;
 			}
 		}
-		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
+		else if (MultiXactIdIsRunning(xid))
 		{
-			/* newer than the cutoff, so don't touch it */
-			;
+			/* cannot be below cutoff */
 		}
 		else
 		{
-			TransactionId	update_xid;
+			TransactionId	update_xid = InvalidTransactionId;
+			MultiXactMember *members = NULL;
+			MultiXactMember *newmembers = NULL;
+			int			nmembers;
+			int			nnewmembers = 0;
+			bool		has_live_members = false;
+			bool		mxact_needs_freeze = false;
+			int			i;
 
 			/*
-			 * This is a multixact which is not marked LOCK_ONLY, but which
-			 * is newer than the cutoff_multi.  If the update_xid is below the
-			 * cutoff_xid point, then we can just freeze the Xmax in the
-			 * tuple, removing it altogether.  This seems simple, but there
-			 * are several underlying assumptions:
-			 *
-			 * 1. A tuple marked by an multixact containing a very old
-			 * committed update Xid would have been pruned away by vacuum; we
-			 * wouldn't be freezing this tuple at all.
-			 *
-			 * 2. There cannot possibly be any live locking members remaining
-			 * in the multixact.  This is because if they were alive, the
-			 * update's Xid would had been considered, via the lockers'
-			 * snapshot's Xmin, as part the cutoff_xid.
-			 *
-			 * 3. We don't create new MultiXacts via MultiXactIdExpand() that
-			 * include a very old aborted update Xid: in that function we only
-			 * include update Xids corresponding to transactions that are
-			 * committed or in-progress.
+			 * For MultiXacts that are not below the cutoff, we need to check
+			 * whether any of the members are too old.
 			 */
-			update_xid = HeapTupleGetUpdateXid(tuple);
-			if (TransactionIdPrecedes(update_xid, cutoff_xid))
+			nmembers = GetMultiXactIdMembers(xid, &members, false);
+
+			if (nmembers <= 0)
+			{
+				/* pg_upgrade'd multi, just freeze away */
+				freeze_xmax = true;
+			}
+			else
+			{
+				newmembers = (MultiXactMember *)
+					palloc(sizeof(MultiXactMember) * (nmembers + 1));
+
+				for (i = 0; i < nmembers; i++)
+				{
+					bool keep = false;
+					bool isupdate = ISUPDATE_from_mxstatus(members[i].status);
+
+					if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+					{
+						/*
+						 * A potential updater could not have committed, tuple
+						 * would have gotten vacuumed away already.
+						 */
+						Assert(!isupdate || TransactionIdDidAbort(cutoff_xid));
+						mxact_needs_freeze = true;
+					}
+					else if (TransactionIdIsInProgress(members[i].xid))
+					{
+						keep = true;
+						if (isupdate)
+							update_xid = members[i].xid;
+					}
+					else if (TransactionIdDidCommit(members[i].xid) && isupdate)
+					{
+						/*
+						 * Only updates need to be preserved when they have
+						 * committed, locks aren't interesting anymore.
+						 */
+						keep = true;
+						update_xid = members[i].xid;
+					}
+
+					if (keep)
+					{
+						newmembers[nnewmembers++] = members[i];
+						has_live_members = true;
+					}
+				}
+			}
+
+			if (!mxact_needs_freeze)
+			{
+				/* nothing to do */;
+			}
+			else if (has_live_members &&
+					 TransactionIdIsValid(update_xid) &&
+					 nnewmembers == 1)
+			{
+				/* only the updater is still alive, replace multixact by xid */
+				frz->xmax = update_xid;
+				frz->t_infomask &= ~HEAP_XMAX_BITS;
+				frz->t_infomask |= HEAP_XMAX_COMMITTED;
+				changed = true;
+				elog(LOG, "replace multi2");
+				/* do not clear HEAP_HOT_UPDATED, HEAP_KEYS_UPDATED just yet */
+			}
+			else if (has_live_members)
+			{
+				frz->xmax = CreateMultiXactId(nnewmembers, newmembers);
+				changed = true;
+				elog(LOG, "recreating multi");
+			}
+			else
+			{
 				freeze_xmax = true;
+			}
+
+			/* cleanup memory we might have allocated */
+			if (nmembers > 0)
+				pfree(members);
+			if (newmembers != NULL)
+				pfree(newmembers);
 		}
 	}
 	else if (TransactionIdIsNormal(xid) &&
@@ -5417,20 +5468,21 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 	if (freeze_xmax)
 	{
-		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
+		frz->xmax = InvalidTransactionId;
 
 		/*
 		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
 		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
 		 * Also get rid of the HEAP_KEYS_UPDATED bit.
 		 */
-		tuple->t_infomask &= ~HEAP_XMAX_BITS;
-		tuple->t_infomask |= HEAP_XMAX_INVALID;
-		HeapTupleHeaderClearHotUpdated(tuple);
-		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+		frz->t_infomask &= ~HEAP_XMAX_BITS;
+		frz->t_infomask |= HEAP_XMAX_INVALID;
+		frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
+		frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
 		changed = true;
 	}
 
+
 	/*
 	 * Old-style VACUUM FULL is gone, but we have to keep this code as long as
 	 * we support having MOVED_OFF/MOVED_IN tuples in the database.
@@ -5447,16 +5499,16 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			 * xvac transaction succeeded.
 			 */
 			if (tuple->t_infomask & HEAP_MOVED_OFF)
-				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+				frz->freeze_xvac = true;
 			else
-				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+				frz->invalid_xvac = true;
 
 			/*
 			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
 			 * will already be set here, but there's a small chance not.
 			 */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
-			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+			frz->t_infomask |= HEAP_XMIN_COMMITTED;
 			changed = true;
 		}
 	}
@@ -5464,6 +5516,59 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 	return changed;
 }
 
+
+/*
+ * heap_freeze_tuple - freeze tuple inplace without WAL logging.
+ *
+ * Useful for callers like CLUSTER that perform their own WAL logging.
+ */
+bool
+heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+				  TransactionId cutoff_multi)
+{
+	xl_heap_freeze_tuple frz;
+	bool do_freeze;
+
+	do_freeze = heap_prepare_freeze_tuple(tuple, cutoff_xid, cutoff_multi, &frz);
+	if (do_freeze)
+		heap_execute_freeze_tuple(tuple, &frz);
+	return do_freeze;
+}
+
+/*
+ * heap_execute_freeze_tuple
+ *
+ * Execute the prepared freezing of a tuple.
+ *
+ * Note: it might seem we could make the changes without exclusive lock, since
+ * TransactionId read/write is assumed atomic anyway.  However there is a race
+ * condition: someone who just fetched an old XID that we overwrite here could
+ * conceivably not finish checking the XID against pg_clog before we finish
+ * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
+ * exclusive lock ensures no other backend is in process of checking the
+ * tuple status.  Also, getting exclusive lock makes it safe to adjust the
+ * infomask bits.
+ *
+ * NB: All code in here must be safe to execute during crash recovery!
+ */
+void
+heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
+{
+	if (frz->freeze_xmin)
+		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+
+	HeapTupleHeaderSetXmax(tuple, frz->xmax);
+
+	if (frz->freeze_xvac)
+		HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+
+	if (frz->invalid_xvac)
+		HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+
+	tuple->t_infomask = frz->t_infomask;
+	tuple->t_infomask2 = frz->t_infomask2;
+}
+
 /*
  * For a given MultiXactId, return the hint bits that should be set in the
  * tuple's infomask.
@@ -5767,16 +5872,26 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		}
 		else if (MultiXactIdPrecedes(multi, cutoff_multi))
 			return true;
-		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
-		{
-			/* only-locker multis don't need internal examination */
-			;
-		}
 		else
 		{
-			if (TransactionIdPrecedes(HeapTupleGetUpdateXid(tuple),
-									  cutoff_xid))
-				return true;
+			MultiXactMember *members;
+			int			nmembers;
+			int		i;
+
+			/* need to check whether any member of the mxact is too old */
+
+			nmembers = GetMultiXactIdMembers(multi, &members, false);
+
+			for (i = 0; i < nmembers; i++)
+			{
+				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+				{
+					pfree(members);
+					return true;
+				}
+			}
+			if (nmembers > 0)
+				pfree(members);
 		}
 	}
 	else
@@ -6031,22 +6146,22 @@ log_heap_clean(Relation reln, Buffer buffer,
  */
 XLogRecPtr
 log_heap_freeze(Relation reln, Buffer buffer,
-				TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				OffsetNumber *offsets, int offcnt)
+				TransactionId cutoff_xid,
+				xl_heap_freeze_tuple *tuples, int ntuples)
 {
-	xl_heap_freeze xlrec;
+	xl_heap_freeze_page xlrec;
 	XLogRecPtr	recptr;
 	XLogRecData rdata[2];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 	/* nor when there are no tuples to freeze */
-	Assert(offcnt > 0);
+	Assert(ntuples > 0);
 
 	xlrec.node = reln->rd_node;
 	xlrec.block = BufferGetBlockNumber(buffer);
 	xlrec.cutoff_xid = cutoff_xid;
-	xlrec.cutoff_multi = cutoff_multi;
+	xlrec.ntuples = ntuples;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapFreeze;
@@ -6058,13 +6173,13 @@ log_heap_freeze(Relation reln, Buffer buffer,
 	 * it is.  When XLogInsert stores the whole buffer, the offsets array need
 	 * not be stored too.
 	 */
-	rdata[1].data = (char *) offsets;
-	rdata[1].len = offcnt * sizeof(OffsetNumber);
+	rdata[1].data = (char *) tuples;
+	rdata[1].len = ntuples * SizeOfHeapFreezeTuple;
 	rdata[1].buffer = buffer;
 	rdata[1].buffer_std = true;
 	rdata[1].next = NULL;
 
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE, rdata);
+	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE, rdata);
 
 	return recptr;
 }
@@ -6406,6 +6521,99 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 	XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
 }
 
+/*
+ * Freeze a single tuple for XLOG_HEAP2_FREEZE
+ *
+ * NB: This type of record aren't generated anymore, since bugs around
+ * multixacts couldn't be fixed without a more robust type of freezing. This
+ * is kept around to be able to perform PITR.
+ */
+static bool
+heap_xlog_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+				  MultiXactId cutoff_multi)
+{
+	bool		changed = false;
+	TransactionId xid;
+
+	xid = HeapTupleHeaderGetXmin(tuple);
+	if (TransactionIdIsNormal(xid) &&
+		TransactionIdPrecedes(xid, cutoff_xid))
+	{
+		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+
+		/*
+		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
+		 * already be set here, but there's a small chance not.
+		 */
+		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
+		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+		changed = true;
+	}
+
+	/*
+	 * Note that this code handles IS_MULTI Xmax values, too, but only to mark
+	 * the tuple as not updated if the multixact is below the cutoff Multixact
+	 * given; it doesn't remove dead members of a very old multixact.
+	 */
+	xid = HeapTupleHeaderGetRawXmax(tuple);
+	if ((tuple->t_infomask & HEAP_XMAX_IS_MULTI) ?
+		(MultiXactIdIsValid(xid) &&
+		 MultiXactIdPrecedes(xid, cutoff_multi)) :
+		(TransactionIdIsNormal(xid) &&
+		 TransactionIdPrecedes(xid, cutoff_xid)))
+	{
+		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
+
+		/*
+		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
+		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
+		 * Also get rid of the HEAP_KEYS_UPDATED bit.
+		 */
+		tuple->t_infomask &= ~HEAP_XMAX_BITS;
+		tuple->t_infomask |= HEAP_XMAX_INVALID;
+		HeapTupleHeaderClearHotUpdated(tuple);
+		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+		changed = true;
+	}
+
+	/*
+	 * Old-style VACUUM FULL is gone, but we have to keep this code as long as
+	 * we support having MOVED_OFF/MOVED_IN tuples in the database.
+	 */
+	if (tuple->t_infomask & HEAP_MOVED)
+	{
+		xid = HeapTupleHeaderGetXvac(tuple);
+		if (TransactionIdIsNormal(xid) &&
+			TransactionIdPrecedes(xid, cutoff_xid))
+		{
+			/*
+			 * If a MOVED_OFF tuple is not dead, the xvac transaction must
+			 * have failed; whereas a non-dead MOVED_IN tuple must mean the
+			 * xvac transaction succeeded.
+			 */
+			if (tuple->t_infomask & HEAP_MOVED_OFF)
+				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+			else
+				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+
+			/*
+			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
+			 * will already be set here, but there's a small chance not.
+			 */
+			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
+			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+			changed = true;
+		}
+	}
+
+	return changed;
+}
+
+/*
+ * NB: This type of record aren't generated anymore, since bugs around
+ * multixacts couldn't be fixed without a more robust type of freezing. This
+ * is kept around to be able to perform PITR.
+ */
 static void
 heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
 {
@@ -6454,7 +6662,7 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
 			ItemId		lp = PageGetItemId(page, *offsets);
 			HeapTupleHeader tuple = (HeapTupleHeader) PageGetItem(page, lp);
 
-			(void) heap_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
+			(void) heap_xlog_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
 			offsets++;
 		}
 	}
@@ -6578,6 +6786,59 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
 	}
 }
 
+/*
+ * Replay XLOG_HEAP2_FREEZE_PAGE records
+ */
+static void
+heap_xlog_freeze_page(XLogRecPtr lsn, XLogRecord *record)
+{
+	xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) XLogRecGetData(record);
+	TransactionId cutoff_xid = xlrec->cutoff_xid;
+	Buffer		buffer;
+	Page		page;
+	int			ntup;
+
+	/*
+	 * In Hot Standby mode, ensure that there's no queries running which still
+	 * consider the frozen xids as running.
+	 */
+	if (InHotStandby)
+		ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node);
+
+	/* If we have a full-page image, restore it and we're done */
+	if (record->xl_info & XLR_BKP_BLOCK(0))
+	{
+		(void) RestoreBackupBlock(lsn, record, 0, false, false);
+		return;
+	}
+
+	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+	if (!BufferIsValid(buffer))
+		return;
+
+	page = (Page) BufferGetPage(buffer);
+
+	if (lsn <= PageGetLSN(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		return;
+	}
+
+	/* now execute freeze plan for each frozen tuple */
+	for (ntup = 0; ntup < xlrec->ntuples; ntup++)
+	{
+		xl_heap_freeze_tuple *xlrec_tp = &xlrec->tuples[ntup];
+		/* offsets are one-based */
+		ItemId			lp = PageGetItemId(page, xlrec_tp->off);
+		HeapTupleHeader tuple = (HeapTupleHeader) PageGetItem(page, lp);
+		heap_execute_freeze_tuple(tuple, xlrec_tp);
+	}
+
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(buffer);
+	UnlockReleaseBuffer(buffer);
+}
+
 static void
 heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
 {
@@ -7433,6 +7694,9 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_HEAP2_CLEAN:
 			heap_xlog_clean(lsn, record);
 			break;
+		case XLOG_HEAP2_FREEZE_PAGE:
+			heap_xlog_freeze_page(lsn, record);
+			break;
 		case XLOG_HEAP2_CLEANUP_INFO:
 			heap_xlog_cleanup_info(lsn, record);
 			break;
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index e14c053..1b244b1 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -149,6 +149,15 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
 						 xlrec->node.relNode, xlrec->block,
 						 xlrec->latestRemovedXid);
 	}
+	if (info == XLOG_HEAP2_FREEZE_PAGE)
+	{
+		xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
+
+		appendStringInfo(buf, "freeze_page: rel %u/%u/%u; blk %u; cutoff xid %u ntuples %u",
+						 xlrec->node.spcNode, xlrec->node.dbNode,
+						 xlrec->node.relNode, xlrec->block,
+						 xlrec->cutoff_xid, xlrec->ntuples);
+	}
 	else if (info == XLOG_HEAP2_CLEANUP_INFO)
 	{
 		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 2081470..2a1bf6f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -286,7 +286,6 @@ static MemoryContext MXactContext = NULL;
 
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
-static MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 				   int nmembers, MultiXactMember *members);
 static MultiXactId GetNewMultiXactId(int nmembers, MultiXactOffset *offset);
@@ -672,7 +671,7 @@ ReadNextMultiXactId(void)
  *
  * NB: the passed members[] array will be sorted in-place.
  */
-static MultiXactId
+MultiXactId
 CreateMultiXactId(int nmembers, MultiXactMember *members)
 {
 	MultiXactId multi;
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index fe2d9e7..538f3b8 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -500,13 +500,14 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		bool		tupgone,
 					hastup;
 		int			prev_dead_count;
-		OffsetNumber frozen[MaxOffsetNumber];
+		xl_heap_freeze_tuple frozen[MaxOffsetNumber]; /* FIXME: stack ok? */
 		int			nfrozen;
 		Size		freespace;
 		bool		all_visible_according_to_vm;
 		bool		all_visible;
 		bool		has_dead_tuples;
 		TransactionId visibility_cutoff_xid = InvalidTransactionId;
+		int			i;
 
 		if (blkno == next_not_all_visible_block)
 		{
@@ -894,9 +895,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 				 * Each non-removable tuple must be checked to see if it needs
 				 * freezing.  Note we already have exclusive buffer lock.
 				 */
-				if (heap_freeze_tuple(tuple.t_data, FreezeLimit,
-									  MultiXactCutoff))
-					frozen[nfrozen++] = offnum;
+				if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
+											  MultiXactCutoff, &frozen[nfrozen]))
+					frozen[nfrozen++].off = offnum;
 			}
 		}						/* scan along page */
 
@@ -907,15 +908,32 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			START_CRIT_SECTION();
+
 			MarkBufferDirty(buf);
+
+			/* execute collected freezes */
+			for (i = 0; i < nfrozen; i++)
+			{
+				ItemId		itemid;
+				HeapTupleHeader htup;
+
+				itemid = PageGetItemId(page, frozen[i].off);
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				heap_execute_freeze_tuple(htup, &frozen[i]);
+			}
+
+			/* Now WAL-log freezing if neccessary */
 			if (RelationNeedsWAL(onerel))
 			{
 				XLogRecPtr	recptr;
 
 				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 MultiXactCutoff, frozen, nfrozen);
+										 frozen, nfrozen);
 				PageSetLSN(page, recptr);
 			}
+			END_CRIT_SECTION();
 		}
 
 		/*
diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 1ebc5ff..67d5fec 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -598,11 +598,13 @@ HeapTupleSatisfiesUpdate(HeapTuple htup, CommandId curcid,
 
 		/* no member, even just a locker, alive anymore */
 		if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple)))
+		{
 			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
 						InvalidTransactionId);
+			return HeapTupleMayBeUpdated;
+		}
 
-		/* it must have aborted or crashed */
-		return HeapTupleMayBeUpdated;
+		return HeapTupleBeingUpdated;
 	}
 
 	if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 0d40398..e5864bb 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -148,8 +148,13 @@ extern HTSU_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 				bool follow_update,
 				Buffer *buffer, HeapUpdateFailureData *hufd);
 extern void heap_inplace_update(Relation relation, HeapTuple tuple);
+
+struct xl_heap_freeze_tuple;
 extern bool heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
-				  TransactionId cutoff_multi);
+					  TransactionId cutoff_multi);
+extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+					  TransactionId cutoff_multi, struct xl_heap_freeze_tuple *frz);
+extern void heap_execute_freeze_tuple(HeapTupleHeader tuple, struct xl_heap_freeze_tuple *xlrec_tp);
 extern bool heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 						MultiXactId cutoff_multi, Buffer buf);
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..d2baa5b 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -50,7 +50,7 @@
  */
 #define XLOG_HEAP2_FREEZE		0x00
 #define XLOG_HEAP2_CLEAN		0x10
-/* 0x20 is free, was XLOG_HEAP2_CLEAN_MOVE */
+#define XLOG_HEAP2_FREEZE_PAGE	0x20
 #define XLOG_HEAP2_CLEANUP_INFO 0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
@@ -251,6 +251,33 @@ typedef struct xl_heap_freeze
 
 #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_multi) + sizeof(MultiXactId))
 
+/* This is what we need to know about tuple freezing during vacuum */
+typedef struct xl_heap_freeze_tuple
+{
+	OffsetNumber off;
+	bool freeze_xmin;
+	bool invalid_xvac;
+	bool freeze_xvac;
+	TransactionId xmax;
+	uint16		t_infomask2;
+	uint16		t_infomask;
+} xl_heap_freeze_tuple;
+
+#define SizeOfHeapFreezeTuple sizeof(xl_heap_freeze_tuple)
+
+/* This is what we need to know about tuple freezing during vacuum */
+typedef struct xl_heap_freeze_block
+{
+	RelFileNode node;
+	BlockNumber block;
+	TransactionId cutoff_xid;
+	uint16		ntuples;
+	xl_heap_freeze_tuple tuples[1];
+} xl_heap_freeze_page;
+
+#define MinSizeOfHeapFreezeBlock (offsetof(xl_heap_freeze_block, tuples))
+
+
 /* This is what we need to know about setting a visibility map bit */
 typedef struct xl_heap_visible
 {
@@ -277,8 +304,7 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *nowunused, int nunused,
 			   TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
-				TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				OffsetNumber *offsets, int offcnt);
+		  TransactionId cutoff_xid,	xl_heap_freeze_tuple *tuples, int ntuples);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
 				 Buffer vm_buffer, TransactionId cutoff_xid);
 extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6085ea3..cad91e2 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -79,6 +79,7 @@ typedef struct xl_multixact_create
 extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status1, TransactionId xid2,
 				  MultiXactStatus status2);
+extern MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members);
 extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid,
 				  MultiXactStatus status);
 extern MultiXactId ReadNextMultiXactId(void);
-- 
1.8.5.rc2.dirty

#34

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#32)

1 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Alvaro Herrera wrote:

Attached is a patch to fix it.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

tqual-fix.patchtext/x-diff; charset=us-asciiDownload

*** a/src/backend/utils/time/tqual.c
--- b/src/backend/utils/time/tqual.c
***************
*** 789,801 **** HeapTupleSatisfiesUpdate(HeapTupleHeader tuple, CommandId curcid,
  		if (TransactionIdDidCommit(xmax))
  			return HeapTupleUpdated;
  
! 		/* no member, even just a locker, alive anymore */
  		if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple)))
  			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
  						InvalidTransactionId);
! 
! 		/* it must have aborted or crashed */
! 		return HeapTupleMayBeUpdated;
  	}
  
  	if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
--- 789,814 ----
  		if (TransactionIdDidCommit(xmax))
  			return HeapTupleUpdated;
  
! 		/*
! 		 * By here, the update in the Xmax is either aborted or crashed, but
! 		 * what about the other members?
! 		 */
! 
  		if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple)))
+ 		{
+ 			/*
+ 			 * There's no member, even just a locker, alive anymore, so we can
+ 			 * mark the Xmax as invalid.
+ 			 */
  			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
  						InvalidTransactionId);
! 			return HeapTupleMayBeUpdated;
! 		}
! 		else
! 		{
! 			/* There are lockers running */
! 			return HeapTupleBeingUpdated;
! 		}
  	}
  
  	if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
*** a/src/test/isolation/expected/delete-abort-savept.out
--- b/src/test/isolation/expected/delete-abort-savept.out
***************
*** 23,33 **** key            value
  step s1svp: SAVEPOINT f;
  step s1d: DELETE FROM foo;
  step s1r: ROLLBACK TO f;
! step s2l: SELECT * FROM foo FOR UPDATE;
  key            value          
  
  1              1              
- step s1c: COMMIT;
  step s2c: COMMIT;
  
  starting permutation: s1l s1svp s1d s1r s2l s2c s1c
--- 23,34 ----
  step s1svp: SAVEPOINT f;
  step s1d: DELETE FROM foo;
  step s1r: ROLLBACK TO f;
! step s2l: SELECT * FROM foo FOR UPDATE; <waiting ...>
! step s1c: COMMIT;
! step s2l: <... completed>
  key            value          
  
  1              1              
  step s2c: COMMIT;
  
  starting permutation: s1l s1svp s1d s1r s2l s2c s1c
***************
*** 38,49 **** key            value
  step s1svp: SAVEPOINT f;
  step s1d: DELETE FROM foo;
  step s1r: ROLLBACK TO f;
! step s2l: SELECT * FROM foo FOR UPDATE;
! key            value          
! 
! 1              1              
! step s2c: COMMIT;
! step s1c: COMMIT;
  
  starting permutation: s1l s1svp s1d s2l s1r s1c s2c
  step s1l: SELECT * FROM foo FOR KEY SHARE;
--- 39,46 ----
  step s1svp: SAVEPOINT f;
  step s1d: DELETE FROM foo;
  step s1r: ROLLBACK TO f;
! step s2l: SELECT * FROM foo FOR UPDATE; <waiting ...>
! invalid permutation detected
  
  starting permutation: s1l s1svp s1d s2l s1r s1c s2c
  step s1l: SELECT * FROM foo FOR KEY SHARE;
*** /dev/null
--- b/src/test/isolation/expected/multixact-no-forget.out
***************
*** 0 ****
--- 1,28 ----
+ Parsed test spec with 3 sessions
+ 
+ starting permutation: s1_lock s2_update s2_abort s3_lock s1_commit
+ step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+ value          
+ 
+ 1              
+ step s2_update: UPDATE dont_forget SET value = 2;
+ step s2_abort: ROLLBACK;
+ step s3_lock: SELECT * FROM dont_forget FOR UPDATE; <waiting ...>
+ step s1_commit: COMMIT;
+ step s3_lock: <... completed>
+ value          
+ 
+ 
+ starting permutation: s1_lock s2_update s2_commit s3_lock s1_commit
+ step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+ value          
+ 
+ 1              
+ step s2_update: UPDATE dont_forget SET value = 2;
+ step s2_commit: COMMIT;
+ step s3_lock: SELECT * FROM dont_forget FOR UPDATE; <waiting ...>
+ step s1_commit: COMMIT;
+ step s3_lock: <... completed>
+ value          
+ 
+ 2              
*** /dev/null
--- b/src/test/isolation/expected/multixact-no-forget_1.out
***************
*** 0 ****
--- 1,27 ----
+ Parsed test spec with 3 sessions
+ 
+ starting permutation: s1_lock s2_update s2_abort s3_lock s1_commit
+ step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+ value          
+ 
+ 1              
+ step s2_update: UPDATE dont_forget SET value = 2;
+ step s2_abort: ROLLBACK;
+ step s3_lock: SELECT * FROM dont_forget FOR UPDATE; <waiting ...>
+ step s1_commit: COMMIT;
+ step s3_lock: <... completed>
+ error in steps s1_commit s3_lock: ERROR:  could not serialize access due to concurrent update
+ 
+ starting permutation: s1_lock s2_update s2_commit s3_lock s1_commit
+ step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+ value          
+ 
+ 1              
+ step s2_update: UPDATE dont_forget SET value = 2;
+ step s2_commit: COMMIT;
+ step s3_lock: SELECT * FROM dont_forget FOR UPDATE; <waiting ...>
+ step s1_commit: COMMIT;
+ step s3_lock: <... completed>
+ value          
+ 
+ 2              
*** a/src/test/isolation/isolation_schedule
--- b/src/test/isolation/isolation_schedule
***************
*** 20,24 **** test: delete-abort-savept
--- 20,25 ----
  test: delete-abort-savept-2
  test: aborted-keyrevoke
  test: multixact-no-deadlock
+ test: multixact-no-forget
  test: drop-index-concurrently-1
  test: timeouts
*** /dev/null
--- b/src/test/isolation/specs/multixact-no-forget.spec
***************
*** 0 ****
--- 1,32 ----
+ # If transaction A holds a lock, and transaction B does an update,
+ # make sure we don't forget the lock if B aborts.
+ setup
+ {
+   CREATE TABLE dont_forget (
+ 	value	int
+   );
+ 
+   INSERT INTO dont_forget VALUES (1);
+ }
+ 
+ teardown
+ {
+   DROP TABLE dont_forget;
+ }
+ 
+ session "s1"
+ setup			{ BEGIN; }
+ step "s1_lock"	{ SELECT * FROM dont_forget FOR KEY SHARE; }
+ step "s1_commit" { COMMIT; }
+ 
+ session "s2"
+ setup				{ BEGIN; }
+ step "s2_update"	{ UPDATE dont_forget SET value = 2; }
+ step "s2_abort"		{ ROLLBACK; }
+ step "s2_commit"	{ COMMIT; }
+ 
+ session "s3"
+ step "s3_lock"	{ SELECT * FROM dont_forget FOR UPDATE; }
+ 
+ permutation "s1_lock" "s2_update" "s2_abort" "s3_lock" "s1_commit"
+ permutation "s1_lock" "s2_update" "s2_commit" "s3_lock" "s1_commit"

#35

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#32)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-03 19:55:40 -0300, Alvaro Herrera wrote:

I added a new isolation spec to test this specific case, and noticed
something that seems curious to me when that test is run in REPEATABLE
READ mode: when the UPDATE is aborted, the concurrent FOR UPDATE gets a
"can't serialize due to concurrent update", but when the UPDATE is
committed, FOR UPDATE works fine. Shouldn't it happen pretty much
exactly the other way around?

That's 247c76a989097f1b4ab6fae898f24e75aa27fc1b . Specifically the
DidCommit() branch in test_lockmode_for_conflict(). You forgot something
akin to
/* locker has finished, all good to go */
if (!ISUPDATE_from_mxstatus(status))
return HeapTupleMayBeUpdated;

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Magnus Hagander

magnus@hagander.net

about 12 years ago

In reply to: Tom Lane (#19)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On Tue, Dec 3, 2013 at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Dec 3, 2013 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe we should just bite the bullet and change the WAL format for
heap_freeze (inventing an all-new record type, not repurposing the old
one, and allowing WAL replay to continue to accept the old one). The
implication for users would be that they'd have to update slave servers
before the master when installing the update; which is unpleasant, but
better than living with a known data corruption case.

Agreed. It may suck, but it sucks less.

How badly will it break if they do the upgrade in the wrong order though.
Will the slaves just stop (I assume this?) or is there a risk of a
wrong-order upgrade causing extra breakage?

I assume what would happen is the slave would PANIC upon seeing a WAL
record code it didn't recognize. Installing the updated version should
allow it to resume functioning. Would be good to test this, but if it
doesn't work like that, that'd be another bug to fix IMO. We've always
foreseen the possible need to do something like this, so it ought to
work reasonably cleanly.

I wonder if we should for the future have the START_REPLICATION command (or
the IDENTIFY_SYSTEM would probably make more sense - or even adding a new
command like IDENTIFY_CLIENT. The point is, something in the replication
protocol) have walreceiver include it's version sent to the master. That
way we could have the walsender identify a walreceiver that's too old and
disconnect it right away - with a much nicer error message than a PANIC.
Right now, walreceiver knows the version of the walsender (through
pqserverversion), but AFAICT there is no way for the walsender to know
which version of the receiver is connected.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#37

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Magnus Hagander (#36)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Dec 3, 2013 at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I assume what would happen is the slave would PANIC upon seeing a WAL
record code it didn't recognize.

I wonder if we should for the future have the START_REPLICATION command (or
the IDENTIFY_SYSTEM would probably make more sense - or even adding a new
command like IDENTIFY_CLIENT. The point is, something in the replication
protocol) have walreceiver include it's version sent to the master. That
way we could have the walsender identify a walreceiver that's too old and
disconnect it right away - with a much nicer error message than a PANIC.

Meh. That only helps for the case of streaming replication, and not for
the thirty-seven other ways that some WAL might arrive at something that
wants to replay it.

It might be worth doing anyway, but I can't get excited about it for this
scenario.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Magnus Hagander

magnus@hagander.net

about 12 years ago

In reply to: Tom Lane (#37)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On Wed, Dec 4, 2013 at 8:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Dec 3, 2013 at 7:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I assume what would happen is the slave would PANIC upon seeing a WAL
record code it didn't recognize.

I wonder if we should for the future have the START_REPLICATION command

(or

the IDENTIFY_SYSTEM would probably make more sense - or even adding a new
command like IDENTIFY_CLIENT. The point is, something in the replication
protocol) have walreceiver include it's version sent to the master. That
way we could have the walsender identify a walreceiver that's too old and
disconnect it right away - with a much nicer error message than a PANIC.

Meh. That only helps for the case of streaming replication, and not for
the thirty-seven other ways that some WAL might arrive at something that
wants to replay it.

It might be worth doing anyway, but I can't get excited about it for this
scenario.

It does, but I bet it's one of the by far most common cases. I'd say it's
that one and restore-from-backup that would cover a huge majority of all
cases. If we can cover those, we don't have to be perfect - so unless it
turns out to be ridiculously complicated, I think it would be worthwhile
having.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#39

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#35)

2 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund wrote:

On 2013-12-03 19:55:40 -0300, Alvaro Herrera wrote:

I added a new isolation spec to test this specific case, and noticed
something that seems curious to me when that test is run in REPEATABLE
READ mode: when the UPDATE is aborted, the concurrent FOR UPDATE gets a
"can't serialize due to concurrent update", but when the UPDATE is
committed, FOR UPDATE works fine. Shouldn't it happen pretty much
exactly the other way around?

That's 247c76a989097f1b4ab6fae898f24e75aa27fc1b . Specifically the
DidCommit() branch in test_lockmode_for_conflict(). You forgot something
akin to
/* locker has finished, all good to go */
if (!ISUPDATE_from_mxstatus(status))
return HeapTupleMayBeUpdated;

So I did. Here are two patches, one to fix this issue, and the other to
fix the issue above. I intend to apply these two to 9.3 and master, and
then apply your freeze fix on top (which I'm cleaning up a bit -- will
resend later.)

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Avoid-resetting-Xmax-when-it-s-a-multi-with-an-abort.patchtext/x-diff; charset=us-asciiDownload

>From 48b6f5eead73a41880dd311b7a55d2b88794ca3d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 4 Dec 2013 17:20:14 -0300
Subject: [PATCH 1/4] Avoid resetting Xmax when it's a multi with an aborted
 update

HeapTupleSatisfiesUpdate can very easily "forget" tuple locks while
checking the contents of a multixact and finding it contains an aborted
update, by setting the HEAP_XMAX_INVALID bit.  This would lead to
concurrent transactions not noticing any previous locks held by
transactions that might still be running, and thus being able to acquire
subsequent locks they wouldn't be normally able to acquire.

This bug was introduced in commit 1ce150b7bb; backpatch this fix to 9.3,
like that commit.

This change reverts the change to the delete-abort-savept isolation test
in 1ce150b7bb, because that behavior change was caused by this bug.

Noticed by Andres Freund while investigating a different issue reported
by Noah Misch.
---
 src/backend/utils/time/tqual.c                     |   21 ++++++++++++++++----
 .../isolation/expected/delete-abort-savept.out     |   13 +++++-------
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/src/backend/utils/time/tqual.c b/src/backend/utils/time/tqual.c
index 4d63b1c..f787f2c 100644
--- a/src/backend/utils/time/tqual.c
+++ b/src/backend/utils/time/tqual.c
@@ -789,13 +789,26 @@ HeapTupleSatisfiesUpdate(HeapTupleHeader tuple, CommandId curcid,
 		if (TransactionIdDidCommit(xmax))
 			return HeapTupleUpdated;
 
-		/* no member, even just a locker, alive anymore */
+		/*
+		 * By here, the update in the Xmax is either aborted or crashed, but
+		 * what about the other members?
+		 */
+
 		if (!MultiXactIdIsRunning(HeapTupleHeaderGetRawXmax(tuple)))
+		{
+			/*
+			 * There's no member, even just a locker, alive anymore, so we can
+			 * mark the Xmax as invalid.
+			 */
 			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
 						InvalidTransactionId);
-
-		/* it must have aborted or crashed */
-		return HeapTupleMayBeUpdated;
+			return HeapTupleMayBeUpdated;
+		}
+		else
+		{
+			/* There are lockers running */
+			return HeapTupleBeingUpdated;
+		}
 	}
 
 	if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
diff --git a/src/test/isolation/expected/delete-abort-savept.out b/src/test/isolation/expected/delete-abort-savept.out
index 5b8c444..3420cf4 100644
--- a/src/test/isolation/expected/delete-abort-savept.out
+++ b/src/test/isolation/expected/delete-abort-savept.out
@@ -23,11 +23,12 @@ key            value
 step s1svp: SAVEPOINT f;
 step s1d: DELETE FROM foo;
 step s1r: ROLLBACK TO f;
-step s2l: SELECT * FROM foo FOR UPDATE;
+step s2l: SELECT * FROM foo FOR UPDATE; <waiting ...>
+step s1c: COMMIT;
+step s2l: <... completed>
 key            value          
 
 1              1              
-step s1c: COMMIT;
 step s2c: COMMIT;
 
 starting permutation: s1l s1svp s1d s1r s2l s2c s1c
@@ -38,12 +39,8 @@ key            value
 step s1svp: SAVEPOINT f;
 step s1d: DELETE FROM foo;
 step s1r: ROLLBACK TO f;
-step s2l: SELECT * FROM foo FOR UPDATE;
-key            value          
-
-1              1              
-step s2c: COMMIT;
-step s1c: COMMIT;
+step s2l: SELECT * FROM foo FOR UPDATE; <waiting ...>
+invalid permutation detected
 
 starting permutation: s1l s1svp s1d s2l s1r s1c s2c
 step s1l: SELECT * FROM foo FOR KEY SHARE;
-- 
1.7.10.4

0002-Fix-improper-abort-during-update-chain-locking.patchtext/x-diff; charset=us-asciiDownload

>From fbe641eeb7eb535c1cac422e3f5962928e3b0362 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 4 Dec 2013 17:26:07 -0300
Subject: [PATCH 2/4] Fix improper abort during update chain locking

In 247c76a98909, I added some code to do fine-grained checking of
MultiXact status of locking/updating transactions when traversing an
update chain.  There was a thinko in that patch which would have the
traversing abort, that is return HeapTupleUpdated, when the other
transaction is a committed lock-only.  In this case we should ignore it
and return success instead.  Of course, in the case where there is a
committed update, HeapTupleUpdated is the correct return value.

A user-visible symptom of this bug is that in REPEATABLE READ and
SERIALIZABLE transaction isolation modes spurious serializability errors
can occur:
  ERROR:  could not serialize access due to concurrent update

In order for this to happen, there needs to be a tuple that's key-share-
locked and also updated, and the update must abort; a subsequent
transaction trying to acquire a new lock on that tuple would abort with
the above error.  The reason is that the initial FOR KEY SHARE is seen
as committed by the new locking transaction, which triggers this bug.

The isolation test added by this commit illustrates the desired
behavior.

Backpatch to 9.3.
---
 src/backend/access/heap/heapam.c                   |   19 ++-
 .../isolation/expected/multixact-no-forget.out     |  122 ++++++++++++++++++++
 .../isolation/expected/multixact-no-forget_1.out   |  118 +++++++++++++++++++
 src/test/isolation/specs/multixact-no-forget.spec  |   42 +++++++
 4 files changed, 298 insertions(+), 3 deletions(-)
 create mode 100644 src/test/isolation/expected/multixact-no-forget.out
 create mode 100644 src/test/isolation/expected/multixact-no-forget_1.out
 create mode 100644 src/test/isolation/specs/multixact-no-forget.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 64174b6..1a0dd21 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -4842,10 +4842,23 @@ test_lockmode_for_conflict(MultiXactStatus status, TransactionId xid,
 	else if (TransactionIdDidCommit(xid))
 	{
 		/*
-		 * If the updating transaction committed, what we do depends on whether
-		 * the lock modes conflict: if they do, then we must report error to
-		 * caller.  But if they don't, we can fall through to lock it.
+		 * The other transaction committed.  If it was only a locker, then the
+		 * lock is completely gone now and we can return success; but if it
+		 * was an update, then what we do depends on whether the two lock
+		 * modes conflict.	If they conflict, then we must report error to
+		 * caller. But if they don't, we can fall through to allow the current
+		 * transaction to lock the tuple.
+		 *
+		 * Note: the reason we worry about ISUPDATE here is because as soon as
+		 * a transaction ends, all its locks are gone and meaningless, and
+		 * thus we can ignore them; whereas its updates persist.  In the
+		 * TransactionIdIsInProgress case, above, we don't need to check
+		 * because we know the lock is still "alive" and thus a conflict needs
+		 * always be checked.
 		 */
+		if (!ISUPDATE_from_mxstatus(status))
+			return HeapTupleMayBeUpdated;
+
 		if (DoLockModesConflict(LOCKMODE_from_mxstatus(status),
 								LOCKMODE_from_mxstatus(wantedstatus)))
 			/* bummer */
diff --git a/src/test/isolation/expected/multixact-no-forget.out b/src/test/isolation/expected/multixact-no-forget.out
new file mode 100644
index 0000000..20f01a6
--- /dev/null
+++ b/src/test/isolation/expected/multixact-no-forget.out
@@ -0,0 +1,122 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_lock s2_update s2_abort s3_forkeyshr s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_abort: ROLLBACK;
+step s3_forkeyshr: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s1_commit: COMMIT;
+
+starting permutation: s1_lock s2_update s2_commit s3_forkeyshr s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_commit: COMMIT;
+step s3_forkeyshr: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+2              
+step s1_commit: COMMIT;
+
+starting permutation: s1_lock s2_update s1_commit s3_forkeyshr s2_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s1_commit: COMMIT;
+step s3_forkeyshr: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_commit: COMMIT;
+
+starting permutation: s1_lock s2_update s2_abort s3_fornokeyupd s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_abort: ROLLBACK;
+step s3_fornokeyupd: SELECT * FROM dont_forget FOR UPDATE; <waiting ...>
+step s1_commit: COMMIT;
+step s3_fornokeyupd: <... completed>
+value          
+
+1              
+
+starting permutation: s1_lock s2_update s2_commit s3_fornokeyupd s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_commit: COMMIT;
+step s3_fornokeyupd: SELECT * FROM dont_forget FOR UPDATE; <waiting ...>
+step s1_commit: COMMIT;
+step s3_fornokeyupd: <... completed>
+value          
+
+2              
+
+starting permutation: s1_lock s2_update s1_commit s3_fornokeyupd s2_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s1_commit: COMMIT;
+step s3_fornokeyupd: SELECT * FROM dont_forget FOR UPDATE; <waiting ...>
+step s2_commit: COMMIT;
+step s3_fornokeyupd: <... completed>
+value          
+
+2              
+
+starting permutation: s1_lock s2_update s2_abort s3_forupd s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_abort: ROLLBACK;
+step s3_forupd: SELECT * FROM dont_forget FOR NO KEY UPDATE;
+value          
+
+1              
+step s1_commit: COMMIT;
+
+starting permutation: s1_lock s2_update s2_commit s3_forupd s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_commit: COMMIT;
+step s3_forupd: SELECT * FROM dont_forget FOR NO KEY UPDATE;
+value          
+
+2              
+step s1_commit: COMMIT;
+
+starting permutation: s1_lock s2_update s1_commit s3_forupd s2_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s1_commit: COMMIT;
+step s3_forupd: SELECT * FROM dont_forget FOR NO KEY UPDATE; <waiting ...>
+step s2_commit: COMMIT;
+step s3_forupd: <... completed>
+value          
+
+2              
diff --git a/src/test/isolation/expected/multixact-no-forget_1.out b/src/test/isolation/expected/multixact-no-forget_1.out
new file mode 100644
index 0000000..7c0adcf
--- /dev/null
+++ b/src/test/isolation/expected/multixact-no-forget_1.out
@@ -0,0 +1,118 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_lock s2_update s2_abort s3_forkeyshr s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_abort: ROLLBACK;
+step s3_forkeyshr: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s1_commit: COMMIT;
+
+starting permutation: s1_lock s2_update s2_commit s3_forkeyshr s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_commit: COMMIT;
+step s3_forkeyshr: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+2              
+step s1_commit: COMMIT;
+
+starting permutation: s1_lock s2_update s1_commit s3_forkeyshr s2_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s1_commit: COMMIT;
+step s3_forkeyshr: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_commit: COMMIT;
+
+starting permutation: s1_lock s2_update s2_abort s3_fornokeyupd s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_abort: ROLLBACK;
+step s3_fornokeyupd: SELECT * FROM dont_forget FOR UPDATE; <waiting ...>
+step s1_commit: COMMIT;
+step s3_fornokeyupd: <... completed>
+value          
+
+1              
+
+starting permutation: s1_lock s2_update s2_commit s3_fornokeyupd s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_commit: COMMIT;
+step s3_fornokeyupd: SELECT * FROM dont_forget FOR UPDATE; <waiting ...>
+step s1_commit: COMMIT;
+step s3_fornokeyupd: <... completed>
+value          
+
+2              
+
+starting permutation: s1_lock s2_update s1_commit s3_fornokeyupd s2_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s1_commit: COMMIT;
+step s3_fornokeyupd: SELECT * FROM dont_forget FOR UPDATE; <waiting ...>
+step s2_commit: COMMIT;
+step s3_fornokeyupd: <... completed>
+error in steps s2_commit s3_fornokeyupd: ERROR:  could not serialize access due to concurrent update
+
+starting permutation: s1_lock s2_update s2_abort s3_forupd s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_abort: ROLLBACK;
+step s3_forupd: SELECT * FROM dont_forget FOR NO KEY UPDATE;
+value          
+
+1              
+step s1_commit: COMMIT;
+
+starting permutation: s1_lock s2_update s2_commit s3_forupd s1_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s2_commit: COMMIT;
+step s3_forupd: SELECT * FROM dont_forget FOR NO KEY UPDATE;
+value          
+
+2              
+step s1_commit: COMMIT;
+
+starting permutation: s1_lock s2_update s1_commit s3_forupd s2_commit
+step s1_lock: SELECT * FROM dont_forget FOR KEY SHARE;
+value          
+
+1              
+step s2_update: UPDATE dont_forget SET value = 2;
+step s1_commit: COMMIT;
+step s3_forupd: SELECT * FROM dont_forget FOR NO KEY UPDATE; <waiting ...>
+step s2_commit: COMMIT;
+step s3_forupd: <... completed>
+error in steps s2_commit s3_forupd: ERROR:  could not serialize access due to concurrent update
diff --git a/src/test/isolation/specs/multixact-no-forget.spec b/src/test/isolation/specs/multixact-no-forget.spec
new file mode 100644
index 0000000..c1cfb7c
--- /dev/null
+++ b/src/test/isolation/specs/multixact-no-forget.spec
@@ -0,0 +1,42 @@
+# If transaction A holds a lock, and transaction B does an update,
+# make sure we don't forget the lock if B aborts.
+setup
+{
+  CREATE TABLE dont_forget (
+	value	int
+  );
+
+  INSERT INTO dont_forget VALUES (1);
+}
+
+teardown
+{
+  DROP TABLE dont_forget;
+}
+
+session "s1"
+setup			{ BEGIN; }
+step "s1_lock"	{ SELECT * FROM dont_forget FOR KEY SHARE; }
+step "s1_commit" { COMMIT; }
+
+session "s2"
+setup				{ BEGIN; }
+step "s2_update"	{ UPDATE dont_forget SET value = 2; }
+step "s2_abort"		{ ROLLBACK; }
+step "s2_commit"	{ COMMIT; }
+
+session "s3"
+# try cases with both a non-conflicting lock with s1's and a conflicting one
+step "s3_forkeyshr"	{ SELECT * FROM dont_forget FOR KEY SHARE; }
+step "s3_fornokeyupd"	{ SELECT * FROM dont_forget FOR UPDATE; }
+step "s3_forupd"	{ SELECT * FROM dont_forget FOR NO KEY UPDATE; }
+
+permutation "s1_lock" "s2_update" "s2_abort" "s3_forkeyshr" "s1_commit"
+permutation "s1_lock" "s2_update" "s2_commit" "s3_forkeyshr" "s1_commit"
+permutation "s1_lock" "s2_update" "s1_commit" "s3_forkeyshr" "s2_commit"
+permutation "s1_lock" "s2_update" "s2_abort" "s3_fornokeyupd" "s1_commit"
+permutation "s1_lock" "s2_update" "s2_commit" "s3_fornokeyupd" "s1_commit"
+permutation "s1_lock" "s2_update" "s1_commit" "s3_fornokeyupd" "s2_commit"
+permutation "s1_lock" "s2_update" "s2_abort" "s3_forupd" "s1_commit"
+permutation "s1_lock" "s2_update" "s2_commit" "s3_forupd" "s1_commit"
+permutation "s1_lock" "s2_update" "s1_commit" "s3_forupd" "s2_commit"
-- 
1.7.10.4

#40

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#39)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Hi,

On 2013-12-05 10:42:35 -0300, Alvaro Herrera wrote:

I intend to apply these two to 9.3 and master, and
then apply your freeze fix on top (which I'm cleaning up a bit -- will
resend later.)

I sure hope it get's cleaned up - it's an evening's hack, with a good
glass of wine ontop. Do you agree with the general direction?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#33)

1 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Here's a revamped version of this patch. One thing I didn't do here is
revert the exporting of CreateMultiXactId, but I don't see any way to
avoid that.

Andres mentioned the idea of sharing some code between
heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't
explored that.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

freezing-multis.patchtext/x-diff; charset=us-asciiDownload

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 5238,5251 **** heap_inplace_update(Relation relation, HeapTuple tuple)
  		CacheInvalidateHeapTuple(relation, tuple, NULL);
  }
  
  
  /*
!  * heap_freeze_tuple
   *
   * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
!  * are older than the specified cutoff XID.  If so, replace them with
!  * FrozenTransactionId or InvalidTransactionId as appropriate, and return
!  * TRUE.  Return FALSE if nothing was changed.
   *
   * It is assumed that the caller has checked the tuple with
   * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
--- 5238,5448 ----
  		CacheInvalidateHeapTuple(relation, tuple, NULL);
  }
  
+ #define		FRM_NOOP				0x0001
+ #define		FRM_INVALIDATE_XMAX		0x0002
+ #define		FRM_RETURN_IS_XID		0x0004
+ #define		FRM_RETURN_IS_MULTI		0x0008
+ #define		FRM_MARK_COMMITTED		0x0010
  
  /*
!  * FreezeMultiXactId
!  *		Determine what to do during freezing when a tuple is marked by a
!  *		MultiXactId.
!  *
!  * "flags" is an output value; it's used to tell caller what to do on return.
!  *
!  * Possible flags are:
!  * FRM_NOOP
!  *		don't do anything -- keep existing Xmax
!  * FRM_INVALIDATE_XMAX
!  *		mark Xmax as InvalidTransactionId and set XMAX_INVALID flag.
!  * FRM_RETURN_IS_XID
!  *		The Xid return value is a single update Xid to set as xmax.
!  * FRM_MARK_COMMITTED
!  *		Xmax can be marked as HEAP_XMAX_COMMITTED
!  * FRM_RETURN_IS_MULTI
!  *		The return value is a new MultiXactId to set as new Xmax.
!  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
!  */
! static TransactionId
! FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
! 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
! 				  uint16 *flags)
! {
! 	TransactionId	xid = InvalidTransactionId;
! 	int			i;
! 	MultiXactMember *members;
! 	int			nmembers;
! 	bool	need_replace;
! 	int		nnewmembers;
! 	MultiXactMember *newmembers;
! 	bool	has_lockers;
! 	TransactionId update_xid;
! 	bool	update_committed;
! 
! 	*flags = 0;
! 
! 	if (!MultiXactIdIsValid(multi))
! 	{
! 		/* Ensure infomask bits are appropriately set/reset */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		return InvalidTransactionId;
! 	}
! 	else if (MultiXactIdPrecedes(multi, cutoff_multi))
! 	{
! 		/*
! 		 * This old multi cannot possibly have members still running.  If it
! 		 * was a locker only, it can be removed without any further
! 		 * consideration; but if it contained an update, we might need to
! 		 * preserve it.
! 		 */
! 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
! 		{
! 			*flags |= FRM_INVALIDATE_XMAX;
! 			return InvalidTransactionId;
! 		}
! 		else
! 		{
! 			/* replace multi by update xid */
! 			xid = MultiXactIdGetUpdateXid(multi, t_infomask);
! 
! 			/* wasn't only a lock, xid needs to be valid */
! 			Assert(TransactionIdIsValid(xid));
! 
! 			/*
! 			 * If the xid is older than the cutoff, it has to have aborted,
! 			 * otherwise the tuple would have gotten pruned away.
! 			 */
! 			if (TransactionIdPrecedes(xid, cutoff_xid))
! 			{
! 				Assert(!TransactionIdDidCommit(xid));
! 				*flags |= FRM_INVALIDATE_XMAX;
! 				/* xid = InvalidTransactionId; */
! 			}
! 			else
! 			{
! 				*flags |= FRM_RETURN_IS_XID;
! 			}
! 		}
! 	}
! 
! 	/*
! 	 * This multixact might have or might not have members still running,
! 	 * but we know it's valid and is newer than the cutoff point for
! 	 * multis.  However, some member(s) of it may be below the cutoff for
! 	 * Xids, so we need to walk the whole members array to figure out what
! 	 * to do, if anything.
! 	 */
! 
! 	nmembers = GetMultiXactIdMembers(multi, &members, false);
! 	if (nmembers <= 0)
! 	{
! 		/* Nothing worth keeping */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		return InvalidTransactionId;
! 	}
! 
! 	/* is there anything older than the cutoff? */
! 	need_replace = false;
! 	for (i = 0; i < nmembers; i++)
! 	{
! 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
! 		{
! 			need_replace = true;
! 			break;
! 		}
! 	}
! 
! 	/*
! 	 * In the simplest case, there is no member older than the cutoff; we can
! 	 * keep the existing MultiXactId as is.
! 	 */
! 	if (!need_replace)
! 	{
! 		*flags |= FRM_NOOP;
! 		pfree(members);
! 		return InvalidTransactionId;
! 	}
! 
! 	/*
! 	 * If the multi needs to be updated, figure out which members do we need
! 	 * to keep.
! 	 */
! 	nnewmembers = 0;
! 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
! 	has_lockers = false;
! 	update_xid = InvalidTransactionId;
! 	update_committed = false;
! 
! 	for (i = 0; i < nmembers; i++)
! 	{
! 		if (ISUPDATE_from_mxstatus(members[i].status) &&
! 			!TransactionIdDidAbort(members[i].xid))
! 		{
! 			/* if it's an update, we must keep unless it aborted */
! 			newmembers[nnewmembers++] = members[i];
! 			Assert(!TransactionIdIsValid(update_xid));
! 			update_xid = members[i].xid;
! 			/* tell caller to set hint while we have the Xid in cache */
! 			if (TransactionIdDidCommit(update_xid))
! 				update_committed = true;
! 		}
! 
! 		/* We only keep lockers if they are still running */
! 		if (TransactionIdIsCurrentTransactionId(members[i].xid) ||
! 			TransactionIdIsInProgress(members[i].xid))
! 		{
! 			newmembers[nnewmembers++] = members[i];
! 			has_lockers = true;
! 		}
! 	}
! 
! 	pfree(members);
! 
! 	if (nnewmembers == 0)
! 	{
! 		/* nothing worth keeping!? Tell caller to remove the whole thing */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		xid = InvalidTransactionId;
! 	}
! 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
! 	{
! 		/*
! 		 * If there's a single member and it's an update, pass it back alone
! 		 * without creating a new Multi.  (XXX we could do this when there's a
! 		 * single remaining locker, too, but that would complicate the API too
! 		 * much; moreover, the case with the single updater is more
! 		 * interesting, because those are longer-lived.)
! 		 */
! 		Assert(nnewmembers == 1);
! 		*flags |= FRM_RETURN_IS_XID;
! 		if (update_committed)
! 			*flags |= FRM_MARK_COMMITTED;
! 		xid = update_xid;
! 	}
! 	else
! 	{
! 		/* Note this is WAL-logged */
! 		xid = CreateMultiXactId(nnewmembers, newmembers);
! 		*flags |= FRM_RETURN_IS_MULTI;
! 	}
! 
! 	pfree(newmembers);
! 
! 	return xid;
! }
! 
! /*
!  * heap_prepare_freeze_tuple
   *
   * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
!  * are older than the specified cutoff XID and cutoff MultiXactId.  If so,
!  * setup enough state (in the *frz output argument) to later execute and
!  * WAL-log what we would need to do, and return TRUE.  Return FALSE if nothing
!  * is to be changed.
!  *
!  * Caller is responsible for setting the offset field, if appropriate.  This
!  * is only necessary if the freeze is to be WAL-logged.
   *
   * It is assumed that the caller has checked the tuple with
   * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
***************
*** 5254,5307 **** heap_inplace_update(Relation relation, HeapTuple tuple)
   * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
   * XID older than it could neither be running nor seen as running by any
   * open transaction.  This ensures that the replacement will not change
!  * anyone's idea of the tuple state.  Also, since we assume the tuple is
!  * not HEAPTUPLE_DEAD, the fact that an XID is not still running allows us
!  * to assume that it is either committed good or aborted, as appropriate;
!  * so we need no external state checks to decide what to do.  (This is good
!  * because this function is applied during WAL recovery, when we don't have
!  * access to any such state, and can't depend on the hint bits to be set.)
!  * There is an exception we make which is to assume GetMultiXactIdMembers can
!  * be called during recovery.
!  *
   * Similarly, cutoff_multi must be less than or equal to the smallest
   * MultiXactId used by any transaction currently open.
   *
   * If the tuple is in a shared buffer, caller must hold an exclusive lock on
   * that buffer.
   *
!  * Note: it might seem we could make the changes without exclusive lock, since
!  * TransactionId read/write is assumed atomic anyway.  However there is a race
!  * condition: someone who just fetched an old XID that we overwrite here could
!  * conceivably not finish checking the XID against pg_clog before we finish
!  * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
!  * exclusive lock ensures no other backend is in process of checking the
!  * tuple status.  Also, getting exclusive lock makes it safe to adjust the
!  * infomask bits.
!  *
!  * NB: Cannot rely on hint bits here, they might not be set after a crash or
!  * on a standby.
   */
  bool
! heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
! 				  MultiXactId cutoff_multi)
  {
  	bool		changed = false;
  	bool		freeze_xmax = false;
  	TransactionId xid;
  
  	/* Process xmin */
  	xid = HeapTupleHeaderGetXmin(tuple);
  	if (TransactionIdIsNormal(xid) &&
  		TransactionIdPrecedes(xid, cutoff_xid))
  	{
! 		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
! 
  		/*
  		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
  		 * already be set here, but there's a small chance not.
  		 */
! 		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
! 		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  		changed = true;
  	}
  
--- 5451,5492 ----
   * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
   * XID older than it could neither be running nor seen as running by any
   * open transaction.  This ensures that the replacement will not change
!  * anyone's idea of the tuple state.
   * Similarly, cutoff_multi must be less than or equal to the smallest
   * MultiXactId used by any transaction currently open.
   *
   * If the tuple is in a shared buffer, caller must hold an exclusive lock on
   * that buffer.
   *
!  * NB: It is not enough to set hint bits to indicate something is
!  * committed/invalid -- they might not be set on a standby, or after crash
!  * recovery.  We really need to remove old xids.
   */
  bool
! heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
! 						  TransactionId cutoff_multi, xl_heap_freeze_tuple *frz)
! 
  {
  	bool		changed = false;
  	bool		freeze_xmax = false;
  	TransactionId xid;
  
+ 	frz->frzflags = 0;
+ 	frz->t_infomask2 = tuple->t_infomask2;
+ 	frz->t_infomask = tuple->t_infomask;
+ 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
+ 
  	/* Process xmin */
  	xid = HeapTupleHeaderGetXmin(tuple);
  	if (TransactionIdIsNormal(xid) &&
  		TransactionIdPrecedes(xid, cutoff_xid))
  	{
! 		frz->frzflags |= XLH_FREEZE_XMIN;
  		/*
  		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
  		 * already be set here, but there's a small chance not.
  		 */
! 		frz->t_infomask |= HEAP_XMIN_COMMITTED;
  		changed = true;
  	}
  
***************
*** 5318,5408 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  
  	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
  	{
! 		if (!MultiXactIdIsValid(xid))
! 		{
! 			/* no xmax set, ignore */
! 			;
! 		}
! 		else if (MultiXactIdPrecedes(xid, cutoff_multi))
! 		{
! 			/*
! 			 * This old multi cannot possibly be running.  If it was a locker
! 			 * only, it can be removed without much further thought; but if it
! 			 * contained an update, we need to preserve it.
! 			 */
! 			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
! 				freeze_xmax = true;
! 			else
! 			{
! 				TransactionId update_xid;
  
! 				update_xid = HeapTupleGetUpdateXid(tuple);
  
! 				/*
! 				 * The multixact has an update hidden within.  Get rid of it.
! 				 *
! 				 * If the update_xid is below the cutoff_xid, it necessarily
! 				 * must be an aborted transaction.  In a primary server, such
! 				 * an Xmax would have gotten marked invalid by
! 				 * HeapTupleSatisfiesVacuum, but in a replica that is not
! 				 * called before we are, so deal with it in the same way.
! 				 *
! 				 * If not below the cutoff_xid, then the tuple would have been
! 				 * pruned by vacuum, if the update committed long enough ago,
! 				 * and we wouldn't be freezing it; so it's either recently
! 				 * committed, or in-progress.  Deal with this by setting the
! 				 * Xmax to the update Xid directly and remove the IS_MULTI
! 				 * bit.  (We know there cannot be running lockers in this
! 				 * multi, because it's below the cutoff_multi value.)
! 				 */
! 
! 				if (TransactionIdPrecedes(update_xid, cutoff_xid))
! 				{
! 					Assert(InRecovery || TransactionIdDidAbort(update_xid));
! 					freeze_xmax = true;
! 				}
! 				else
! 				{
! 					Assert(InRecovery || !TransactionIdIsInProgress(update_xid));
! 					tuple->t_infomask &= ~HEAP_XMAX_BITS;
! 					HeapTupleHeaderSetXmax(tuple, update_xid);
! 					changed = true;
! 				}
! 			}
! 		}
! 		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
  		{
! 			/* newer than the cutoff, so don't touch it */
  			;
  		}
! 		else
  		{
! 			TransactionId	update_xid;
  
! 			/*
! 			 * This is a multixact which is not marked LOCK_ONLY, but which
! 			 * is newer than the cutoff_multi.  If the update_xid is below the
! 			 * cutoff_xid point, then we can just freeze the Xmax in the
! 			 * tuple, removing it altogether.  This seems simple, but there
! 			 * are several underlying assumptions:
! 			 *
! 			 * 1. A tuple marked by an multixact containing a very old
! 			 * committed update Xid would have been pruned away by vacuum; we
! 			 * wouldn't be freezing this tuple at all.
! 			 *
! 			 * 2. There cannot possibly be any live locking members remaining
! 			 * in the multixact.  This is because if they were alive, the
! 			 * update's Xid would had been considered, via the lockers'
! 			 * snapshot's Xmin, as part the cutoff_xid.
! 			 *
! 			 * 3. We don't create new MultiXacts via MultiXactIdExpand() that
! 			 * include a very old aborted update Xid: in that function we only
! 			 * include update Xids corresponding to transactions that are
! 			 * committed or in-progress.
! 			 */
! 			update_xid = HeapTupleGetUpdateXid(tuple);
! 			if (TransactionIdPrecedes(update_xid, cutoff_xid))
! 				freeze_xmax = true;
  		}
  	}
  	else if (TransactionIdIsNormal(xid) &&
--- 5503,5536 ----
  
  	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
  	{
! 		TransactionId	newxmax;
! 		uint16	flags;
  
! 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
! 									cutoff_xid, cutoff_multi, &flags);
  
! 		if (flags & FRM_NOOP)
  		{
! 			/* nothing to do in this case */
  			;
  		}
! 		if (flags & FRM_INVALIDATE_XMAX)
! 			freeze_xmax = true;
! 		else if (flags & FRM_RETURN_IS_XID)
  		{
! 			frz->t_infomask &= ~HEAP_XMAX_BITS;
! 			frz->xmax = newxmax;
! 			if (flags & FRM_MARK_COMMITTED)
! 				frz->t_infomask &= HEAP_XMAX_COMMITTED;
! 		}
! 		else if (flags & FRM_RETURN_IS_MULTI)
! 		{
! 			frz->t_infomask &= ~HEAP_XMAX_BITS;
! 			frz->xmax = newxmax;
  
! 			GetMultiXactIdHintBits(newxmax,
! 								   &frz->t_infomask,
! 								   &frz->t_infomask2);
  		}
  	}
  	else if (TransactionIdIsNormal(xid) &&
***************
*** 5413,5429 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  
  	if (freeze_xmax)
  	{
! 		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
  
  		/*
  		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
  		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
  		 * Also get rid of the HEAP_KEYS_UPDATED bit.
  		 */
! 		tuple->t_infomask &= ~HEAP_XMAX_BITS;
! 		tuple->t_infomask |= HEAP_XMAX_INVALID;
! 		HeapTupleHeaderClearHotUpdated(tuple);
! 		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
  		changed = true;
  	}
  
--- 5541,5557 ----
  
  	if (freeze_xmax)
  	{
! 		frz->xmax = InvalidTransactionId;
  
  		/*
  		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
  		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
  		 * Also get rid of the HEAP_KEYS_UPDATED bit.
  		 */
! 		frz->t_infomask &= ~HEAP_XMAX_BITS;
! 		frz->t_infomask |= HEAP_XMAX_INVALID;
! 		frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
! 		frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
  		changed = true;
  	}
  
***************
*** 5443,5458 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  			 * xvac transaction succeeded.
  			 */
  			if (tuple->t_infomask & HEAP_MOVED_OFF)
! 				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
  			else
! 				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
  
  			/*
  			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
  			 * will already be set here, but there's a small chance not.
  			 */
  			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
! 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  			changed = true;
  		}
  	}
--- 5571,5586 ----
  			 * xvac transaction succeeded.
  			 */
  			if (tuple->t_infomask & HEAP_MOVED_OFF)
! 				frz->frzflags |= XLH_FREEZE_XVAC;
  			else
! 				frz->frzflags |= XLH_INVALID_XVAC;
  
  			/*
  			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
  			 * will already be set here, but there's a small chance not.
  			 */
  			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
! 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
  			changed = true;
  		}
  	}
***************
*** 5461,5466 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
--- 5589,5656 ----
  }
  
  /*
+  * heap_execute_freeze_tuple
+  *		Execute the prepared freezing of a tuple.
+  *
+  * Caller is responsible for ensuring that no other backend can access the
+  * storage underlying this tuple, either by holding an exclusive lock on the
+  * buffer containing it (which is what lazy VACUUM does), or by having it by
+  * in private storage (which is what CLUSTER and friends do).
+  *
+  * Note: it might seem we could make the changes without exclusive lock, since
+  * TransactionId read/write is assumed atomic anyway.  However there is a race
+  * condition: someone who just fetched an old XID that we overwrite here could
+  * conceivably not finish checking the XID against pg_clog before we finish
+  * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
+  * exclusive lock ensures no other backend is in process of checking the
+  * tuple status.  Also, getting exclusive lock makes it safe to adjust the
+  * infomask bits.
+  *
+  * NB: All code in here must be safe to execute during crash recovery!
+  */
+ void
+ heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
+ {
+ 	if (frz->frzflags & XLH_FREEZE_XMIN)
+ 		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+ 
+ 	HeapTupleHeaderSetXmax(tuple, frz->xmax);
+ 
+ 	if (frz->frzflags & XLH_FREEZE_XVAC)
+ 		HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+ 
+ 	if (frz->frzflags & XLH_INVALID_XVAC)
+ 		HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+ 
+ 	tuple->t_infomask = frz->t_infomask;
+ 	tuple->t_infomask2 = frz->t_infomask2;
+ }
+ 
+ /*
+  * heap_freeze_tuple - freeze tuple inplace without WAL logging.
+  *
+  * Useful for callers like CLUSTER that perform their own WAL logging.
+  */
+ bool
+ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ 				  TransactionId cutoff_multi)
+ {
+ 	xl_heap_freeze_tuple frz;
+ 	bool	do_freeze;
+ 
+ 	do_freeze = heap_prepare_freeze_tuple(tuple, cutoff_xid, cutoff_multi, &frz);
+ 
+ 	/*
+ 	 * Note that because this is not a WAL-logged operation, we don't need
+ 	 * to fill in the offset in the freeze record.
+ 	 */
+ 
+ 	if (do_freeze)
+ 		heap_execute_freeze_tuple(tuple, &frz);
+ 	return do_freeze;
+ }
+ 
+ /*
   * For a given MultiXactId, return the hint bits that should be set in the
   * tuple's infomask.
   *
***************
*** 5763,5778 **** heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  		}
  		else if (MultiXactIdPrecedes(multi, cutoff_multi))
  			return true;
- 		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
- 		{
- 			/* only-locker multis don't need internal examination */
- 			;
- 		}
  		else
  		{
! 			if (TransactionIdPrecedes(HeapTupleGetUpdateXid(tuple),
! 									  cutoff_xid))
! 				return true;
  		}
  	}
  	else
--- 5953,5978 ----
  		}
  		else if (MultiXactIdPrecedes(multi, cutoff_multi))
  			return true;
  		else
  		{
! 			MultiXactMember *members;
! 			int			nmembers;
! 			int		i;
! 
! 			/* need to check whether any member of the mxact is too old */
! 
! 			nmembers = GetMultiXactIdMembers(multi, &members, false);
! 
! 			for (i = 0; i < nmembers; i++)
! 			{
! 				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
! 				{
! 					pfree(members);
! 					return true;
! 				}
! 			}
! 			if (nmembers > 0)
! 				pfree(members);
  		}
  	}
  	else
***************
*** 6022,6048 **** log_heap_clean(Relation reln, Buffer buffer,
  }
  
  /*
!  * Perform XLogInsert for a heap-freeze operation.	Caller must already
!  * have modified the buffer and marked it dirty.
   */
  XLogRecPtr
! log_heap_freeze(Relation reln, Buffer buffer,
! 				TransactionId cutoff_xid, MultiXactId cutoff_multi,
! 				OffsetNumber *offsets, int offcnt)
  {
! 	xl_heap_freeze xlrec;
  	XLogRecPtr	recptr;
  	XLogRecData rdata[2];
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
  	/* nor when there are no tuples to freeze */
! 	Assert(offcnt > 0);
  
  	xlrec.node = reln->rd_node;
  	xlrec.block = BufferGetBlockNumber(buffer);
  	xlrec.cutoff_xid = cutoff_xid;
! 	xlrec.cutoff_multi = cutoff_multi;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapFreeze;
--- 6222,6247 ----
  }
  
  /*
!  * Perform XLogInsert for a heap-freeze operation.	Caller must have already
!  * modified the buffer and marked it dirty.
   */
  XLogRecPtr
! log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
! 				xl_heap_freeze_tuple *tuples, int ntuples)
  {
! 	xl_heap_freeze_page xlrec;
  	XLogRecPtr	recptr;
  	XLogRecData rdata[2];
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
  	/* nor when there are no tuples to freeze */
! 	Assert(ntuples > 0);
  
  	xlrec.node = reln->rd_node;
  	xlrec.block = BufferGetBlockNumber(buffer);
  	xlrec.cutoff_xid = cutoff_xid;
! 	xlrec.ntuples = ntuples;
  
  	rdata[0].data = (char *) &xlrec;
  	rdata[0].len = SizeOfHeapFreeze;
***************
*** 6050,6066 **** log_heap_freeze(Relation reln, Buffer buffer,
  	rdata[0].next = &(rdata[1]);
  
  	/*
! 	 * The tuple-offsets array is not actually in the buffer, but pretend that
! 	 * it is.  When XLogInsert stores the whole buffer, the offsets array need
  	 * not be stored too.
  	 */
! 	rdata[1].data = (char *) offsets;
! 	rdata[1].len = offcnt * sizeof(OffsetNumber);
  	rdata[1].buffer = buffer;
  	rdata[1].buffer_std = true;
  	rdata[1].next = NULL;
  
! 	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE, rdata);
  
  	return recptr;
  }
--- 6249,6265 ----
  	rdata[0].next = &(rdata[1]);
  
  	/*
! 	 * The freeze plan array is not actually in the buffer, but pretend that
! 	 * it is.  When XLogInsert stores the whole buffer, the freeze plan need
  	 * not be stored too.
  	 */
! 	rdata[1].data = (char *) tuples;
! 	rdata[1].len = ntuples * SizeOfHeapFreezeTuple;
  	rdata[1].buffer = buffer;
  	rdata[1].buffer_std = true;
  	rdata[1].next = NULL;
  
! 	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE, rdata);
  
  	return recptr;
  }
***************
*** 6402,6407 **** heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
--- 6601,6699 ----
  	XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
  }
  
+ /*
+  * Freeze a single tuple for XLOG_HEAP2_FREEZE
+  *
+  * NB: This type of record aren't generated anymore, since bugs around
+  * multixacts couldn't be fixed without a more robust type of freezing. This
+  * is kept around to be able to perform PITR.
+  */
+ static bool
+ heap_xlog_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ 				  MultiXactId cutoff_multi)
+ {
+ 	bool		changed = false;
+ 	TransactionId xid;
+ 
+ 	xid = HeapTupleHeaderGetXmin(tuple);
+ 	if (TransactionIdIsNormal(xid) &&
+ 		TransactionIdPrecedes(xid, cutoff_xid))
+ 	{
+ 		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+ 
+ 		/*
+ 		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
+ 		 * already be set here, but there's a small chance not.
+ 		 */
+ 		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
+ 		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+ 		changed = true;
+ 	}
+ 
+ 	/*
+ 	 * Note that this code handles IS_MULTI Xmax values, too, but only to mark
+ 	 * the tuple as not updated if the multixact is below the cutoff Multixact
+ 	 * given; it doesn't remove dead members of a very old multixact.
+ 	 */
+ 	xid = HeapTupleHeaderGetRawXmax(tuple);
+ 	if ((tuple->t_infomask & HEAP_XMAX_IS_MULTI) ?
+ 		(MultiXactIdIsValid(xid) &&
+ 		 MultiXactIdPrecedes(xid, cutoff_multi)) :
+ 		(TransactionIdIsNormal(xid) &&
+ 		 TransactionIdPrecedes(xid, cutoff_xid)))
+ 	{
+ 		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
+ 
+ 		/*
+ 		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
+ 		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
+ 		 * Also get rid of the HEAP_KEYS_UPDATED bit.
+ 		 */
+ 		tuple->t_infomask &= ~HEAP_XMAX_BITS;
+ 		tuple->t_infomask |= HEAP_XMAX_INVALID;
+ 		HeapTupleHeaderClearHotUpdated(tuple);
+ 		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+ 		changed = true;
+ 	}
+ 
+ 	/*
+ 	 * Old-style VACUUM FULL is gone, but we have to keep this code as long as
+ 	 * we support having MOVED_OFF/MOVED_IN tuples in the database.
+ 	 */
+ 	if (tuple->t_infomask & HEAP_MOVED)
+ 	{
+ 		xid = HeapTupleHeaderGetXvac(tuple);
+ 		if (TransactionIdIsNormal(xid) &&
+ 			TransactionIdPrecedes(xid, cutoff_xid))
+ 		{
+ 			/*
+ 			 * If a MOVED_OFF tuple is not dead, the xvac transaction must
+ 			 * have failed; whereas a non-dead MOVED_IN tuple must mean the
+ 			 * xvac transaction succeeded.
+ 			 */
+ 			if (tuple->t_infomask & HEAP_MOVED_OFF)
+ 				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+ 			else
+ 				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+ 
+ 			/*
+ 			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
+ 			 * will already be set here, but there's a small chance not.
+ 			 */
+ 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
+ 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+ 			changed = true;
+ 		}
+ 	}
+ 
+ 	return changed;
+ }
+ 
+ /*
+  * NB: This type of record aren't generated anymore, since bugs around
+  * multixacts couldn't be fixed without a more robust type of freezing. This
+  * is kept around to be able to perform PITR.
+  */
  static void
  heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
  {
***************
*** 6450,6456 **** heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
  			ItemId		lp = PageGetItemId(page, *offsets);
  			HeapTupleHeader tuple = (HeapTupleHeader) PageGetItem(page, lp);
  
! 			(void) heap_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
  			offsets++;
  		}
  	}
--- 6742,6748 ----
  			ItemId		lp = PageGetItemId(page, *offsets);
  			HeapTupleHeader tuple = (HeapTupleHeader) PageGetItem(page, lp);
  
! 			(void) heap_xlog_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
  			offsets++;
  		}
  	}
***************
*** 6574,6579 **** heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
--- 6866,6928 ----
  	}
  }
  
+ /*
+  * Replay XLOG_HEAP2_FREEZE_PAGE records
+  */
+ static void
+ heap_xlog_freeze_page(XLogRecPtr lsn, XLogRecord *record)
+ {
+ 	xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) XLogRecGetData(record);
+ 	TransactionId cutoff_xid = xlrec->cutoff_xid;
+ 	Buffer		buffer;
+ 	Page		page;
+ 	int			ntup;
+ 
+ 	/*
+ 	 * In Hot Standby mode, ensure that there's no queries running which still
+ 	 * consider the frozen xids as running.
+ 	 */
+ 	if (InHotStandby)
+ 		ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node);
+ 
+ 	/* If we have a full-page image, restore it and we're done */
+ 	if (record->xl_info & XLR_BKP_BLOCK(0))
+ 	{
+ 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
+ 		return;
+ 	}
+ 
+ 	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ 	if (!BufferIsValid(buffer))
+ 		return;
+ 
+ 	page = (Page) BufferGetPage(buffer);
+ 
+ 	if (lsn <= PageGetLSN(page))
+ 	{
+ 		UnlockReleaseBuffer(buffer);
+ 		return;
+ 	}
+ 
+ 	/* now execute freeze plan for each frozen tuple */
+ 	for (ntup = 0; ntup < xlrec->ntuples; ntup++)
+ 	{
+ 		xl_heap_freeze_tuple *xlrec_tp;
+ 		ItemId			lp;
+ 		HeapTupleHeader tuple;
+ 
+ 		xlrec_tp = &xlrec->tuples[ntup];
+ 		lp = PageGetItemId(page, xlrec_tp->offset);	/* offsets are one-based */
+ 		tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ 
+ 		heap_execute_freeze_tuple(tuple, xlrec_tp);
+ 	}
+ 
+ 	PageSetLSN(page, lsn);
+ 	MarkBufferDirty(buffer);
+ 	UnlockReleaseBuffer(buffer);
+ }
+ 
  static void
  heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
  {
***************
*** 7429,7434 **** heap2_redo(XLogRecPtr lsn, XLogRecord *record)
--- 7778,7786 ----
  		case XLOG_HEAP2_CLEAN:
  			heap_xlog_clean(lsn, record);
  			break;
+ 		case XLOG_HEAP2_FREEZE_PAGE:
+ 			heap_xlog_freeze_page(lsn, record);
+ 			break;
  		case XLOG_HEAP2_CLEANUP_INFO:
  			heap_xlog_cleanup_info(lsn, record);
  			break;
*** a/src/backend/access/rmgrdesc/heapdesc.c
--- b/src/backend/access/rmgrdesc/heapdesc.c
***************
*** 149,154 **** heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 149,163 ----
  						 xlrec->node.relNode, xlrec->block,
  						 xlrec->latestRemovedXid);
  	}
+ 	if (info == XLOG_HEAP2_FREEZE_PAGE)
+ 	{
+ 		xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
+ 
+ 		appendStringInfo(buf, "freeze_page: rel %u/%u/%u; blk %u; cutoff xid %u ntuples %u",
+ 						 xlrec->node.spcNode, xlrec->node.dbNode,
+ 						 xlrec->node.relNode, xlrec->block,
+ 						 xlrec->cutoff_xid, xlrec->ntuples);
+ 	}
  	else if (info == XLOG_HEAP2_CLEANUP_INFO)
  	{
  		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
*** a/src/backend/access/rmgrdesc/mxactdesc.c
--- b/src/backend/access/rmgrdesc/mxactdesc.c
***************
*** 41,47 **** out_member(StringInfo buf, MultiXactMember *member)
  			appendStringInfoString(buf, "(upd) ");
  			break;
  		default:
! 			appendStringInfoString(buf, "(unk) ");
  			break;
  	}
  }
--- 41,47 ----
  			appendStringInfoString(buf, "(upd) ");
  			break;
  		default:
! 			appendStringInfo(buf, "(unk) ", member->status);
  			break;
  	}
  }
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
***************
*** 286,292 **** static MemoryContext MXactContext = NULL;
  
  /* internal MultiXactId management */
  static void MultiXactIdSetOldestVisible(void);
- static MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members);
  static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
  				   int nmembers, MultiXactMember *members);
  static MultiXactId GetNewMultiXactId(int nmembers, MultiXactOffset *offset);
--- 286,291 ----
***************
*** 672,678 **** ReadNextMultiXactId(void)
   *
   * NB: the passed members[] array will be sorted in-place.
   */
! static MultiXactId
  CreateMultiXactId(int nmembers, MultiXactMember *members)
  {
  	MultiXactId multi;
--- 671,677 ----
   *
   * NB: the passed members[] array will be sorted in-place.
   */
! MultiXactId
  CreateMultiXactId(int nmembers, MultiXactMember *members)
  {
  	MultiXactId multi;
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 424,429 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 424,430 ----
  	Buffer		vmbuffer = InvalidBuffer;
  	BlockNumber next_not_all_visible_block;
  	bool		skipping_all_visible_blocks;
+ 	xl_heap_freeze_tuple *frozen;
  
  	pg_rusage_init(&ru0);
  
***************
*** 446,451 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 447,453 ----
  	vacrelstats->latestRemovedXid = InvalidTransactionId;
  
  	lazy_space_alloc(vacrelstats, nblocks);
+ 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
  
  	/*
  	 * We want to skip pages that don't require vacuuming according to the
***************
*** 500,506 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  		bool		tupgone,
  					hastup;
  		int			prev_dead_count;
- 		OffsetNumber frozen[MaxOffsetNumber];
  		int			nfrozen;
  		Size		freespace;
  		bool		all_visible_according_to_vm;
--- 502,507 ----
***************
*** 893,901 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  				 * Each non-removable tuple must be checked to see if it needs
  				 * freezing.  Note we already have exclusive buffer lock.
  				 */
! 				if (heap_freeze_tuple(tuple.t_data, FreezeLimit,
! 									  MultiXactCutoff))
! 					frozen[nfrozen++] = offnum;
  			}
  		}						/* scan along page */
  
--- 894,902 ----
  				 * Each non-removable tuple must be checked to see if it needs
  				 * freezing.  Note we already have exclusive buffer lock.
  				 */
! 				if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
! 											  MultiXactCutoff, &frozen[nfrozen]))
! 					frozen[nfrozen++].offset = offnum;
  			}
  		}						/* scan along page */
  
***************
*** 906,920 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  		 */
  		if (nfrozen > 0)
  		{
  			MarkBufferDirty(buf);
  			if (RelationNeedsWAL(onerel))
  			{
  				XLogRecPtr	recptr;
  
  				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
! 										 MultiXactCutoff, frozen, nfrozen);
  				PageSetLSN(page, recptr);
  			}
  		}
  
  		/*
--- 907,939 ----
  		 */
  		if (nfrozen > 0)
  		{
+ 			START_CRIT_SECTION();
+ 
  			MarkBufferDirty(buf);
+ 
+ 			/* execute collected freezes */
+ 			for (i = 0; i < nfrozen; i++)
+ 			{
+ 				ItemId		itemid;
+ 				HeapTupleHeader htup;
+ 
+ 				itemid = PageGetItemId(page, frozen[i].offset);
+ 				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+ 
+ 				heap_execute_freeze_tuple(htup, &frozen[i]);
+ 			}
+ 
+ 			/* Now WAL-log freezing if neccessary */
  			if (RelationNeedsWAL(onerel))
  			{
  				XLogRecPtr	recptr;
  
  				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
! 										 frozen, nfrozen);
  				PageSetLSN(page, recptr);
  			}
+ 
+ 			END_CRIT_SECTION();
  		}
  
  		/*
***************
*** 1015,1020 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 1034,1041 ----
  			RecordPageWithFreeSpace(onerel, blkno, freespace);
  	}
  
+ 	pfree(frozen);
+ 
  	/* save stats for use later */
  	vacrelstats->scanned_tuples = num_tuples;
  	vacrelstats->tuples_deleted = tups_vacuumed;
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 50,56 ****
   */
  #define XLOG_HEAP2_FREEZE		0x00
  #define XLOG_HEAP2_CLEAN		0x10
! /* 0x20 is free, was XLOG_HEAP2_CLEAN_MOVE */
  #define XLOG_HEAP2_CLEANUP_INFO 0x30
  #define XLOG_HEAP2_VISIBLE		0x40
  #define XLOG_HEAP2_MULTI_INSERT 0x50
--- 50,56 ----
   */
  #define XLOG_HEAP2_FREEZE		0x00
  #define XLOG_HEAP2_CLEAN		0x10
! #define XLOG_HEAP2_FREEZE_PAGE	0x20
  #define XLOG_HEAP2_CLEANUP_INFO 0x30
  #define XLOG_HEAP2_VISIBLE		0x40
  #define XLOG_HEAP2_MULTI_INSERT 0x50
***************
*** 239,245 **** typedef struct xl_heap_inplace
  
  #define SizeOfHeapInplace	(offsetof(xl_heap_inplace, target) + SizeOfHeapTid)
  
! /* This is what we need to know about tuple freezing during vacuum */
  typedef struct xl_heap_freeze
  {
  	RelFileNode node;
--- 239,245 ----
  
  #define SizeOfHeapInplace	(offsetof(xl_heap_inplace, target) + SizeOfHeapTid)
  
! /* This is what we need to know about tuple freezing during vacuum (legacy) */
  typedef struct xl_heap_freeze
  {
  	RelFileNode node;
***************
*** 251,256 **** typedef struct xl_heap_freeze
--- 251,289 ----
  
  #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_multi) + sizeof(MultiXactId))
  
+ /*
+  * a 'freeze plan' struct that represents what we need to know about a single
+  * tuple being frozen during vacuum
+  */
+ #define		XLH_FREEZE_XMIN		0x01
+ #define		XLH_FREEZE_XVAC		0x02
+ #define		XLH_INVALID_XVAC	0x04
+ 
+ typedef struct xl_heap_freeze_tuple
+ {
+ 	TransactionId xmax;
+ 	OffsetNumber offset;
+ 	uint16		t_infomask2;
+ 	uint16		t_infomask;
+ 	uint8		frzflags;
+ } xl_heap_freeze_tuple;
+ 
+ /* XXX we could define size as offsetof(struct, frzflags) and save some
+  * padding, but then the array below wouldn't work properly ... */
+ #define SizeOfHeapFreezeTuple sizeof(xl_heap_freeze_tuple)
+ 
+ /*
+  * This is what we need to know about a block being frozen during vacuum
+  */
+ typedef struct xl_heap_freeze_block
+ {
+ 	RelFileNode node;
+ 	BlockNumber block;
+ 	TransactionId cutoff_xid;
+ 	uint16		ntuples;
+ 	xl_heap_freeze_tuple tuples[FLEXIBLE_ARRAY_MEMBER];
+ } xl_heap_freeze_page;
+ 
  /* This is what we need to know about setting a visibility map bit */
  typedef struct xl_heap_visible
  {
***************
*** 277,284 **** extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
  			   OffsetNumber *nowunused, int nunused,
  			   TransactionId latestRemovedXid);
  extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
! 				TransactionId cutoff_xid, MultiXactId cutoff_multi,
! 				OffsetNumber *offsets, int offcnt);
  extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
  				 Buffer vm_buffer, TransactionId cutoff_xid);
  extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
--- 310,321 ----
  			   OffsetNumber *nowunused, int nunused,
  			   TransactionId latestRemovedXid);
  extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
! 		  TransactionId cutoff_xid,	xl_heap_freeze_tuple *tuples, int ntuples);
! extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
! 						  TransactionId cutoff_xid, TransactionId cutoff_multi,
! 						  xl_heap_freeze_tuple *frz);
! extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
! 						  xl_heap_freeze_tuple *xlrec_tp);
  extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
  				 Buffer vm_buffer, TransactionId cutoff_xid);
  extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
*** a/src/include/access/multixact.h
--- b/src/include/access/multixact.h
***************
*** 81,86 **** extern MultiXactId MultiXactIdCreate(TransactionId xid1,
--- 81,87 ----
  				  MultiXactStatus status2);
  extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid,
  				  MultiXactStatus status);
+ extern MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members);
  extern MultiXactId ReadNextMultiXactId(void);
  extern bool MultiXactIdIsRunning(MultiXactId multi);
  extern void MultiXactIdSetOldestMember(void);

#42

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#41)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote:

Here's a revamped version of this patch. One thing I didn't do here is
revert the exporting of CreateMultiXactId, but I don't see any way to
avoid that.

I don't so much have a problem with exporting CreateMultiXactId(), just
with exporting it under its current name. It's already quirky to have
both MultiXactIdCreate and CreateMultiXactId() in multixact.c but
exporting it imo goes to far.

Andres mentioned the idea of sharing some code between
heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't
explored that.

My idea would just be to have heap_tuple_needs_freeze() call
heap_prepare_freeze_tuple() and check whether it returns true. Yes,
that's slightly more expensive than the current
heap_tuple_needs_freeze(), but it's only called when we couldn't get a
cleanup lock on a page, so that seems ok.

! static TransactionId
! FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
! TransactionId cutoff_xid, MultiXactId cutoff_multi,
! uint16 *flags)
! {

! if (!MultiXactIdIsValid(multi))
! {
! /* Ensure infomask bits are appropriately set/reset */
! *flags |= FRM_INVALIDATE_XMAX;
! return InvalidTransactionId;
! }

Maybe comment that we're sure to be only called when IS_MULTI is set,
which will be unset by FRM_INVALIDATE_XMAX? I wondered twice whether we
wouldn't just continually mark the buffer dirty this way.

! else if (MultiXactIdPrecedes(multi, cutoff_multi))
! {
! /*
! * This old multi cannot possibly have members still running. If it
! * was a locker only, it can be removed without any further
! * consideration; but if it contained an update, we might need to
! * preserve it.
! */
! if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
! {
! *flags |= FRM_INVALIDATE_XMAX;
! return InvalidTransactionId;

Cna you place an Assert(!MultiXactIdIsRunning(multi)) here?

! if (ISUPDATE_from_mxstatus(members[i].status) &&
! !TransactionIdDidAbort(members[i].xid))#

It makes me wary to see a DidAbort() without a previous InProgress()
call.
Also, after we crashed, doesn't DidAbort() possibly return false for
transactions that were in progress before we crashed? At least that's
how I always understood it, and how tqual.c is written.

! /* We only keep lockers if they are still running */
! if (TransactionIdIsCurrentTransactionId(members[i].xid) ||

I don't think there's a need to check for
TransactionIdIsCurrentTransactionId() - vacuum can explicitly *not* be
run inside a transaction.

***************
*** 5443,5458 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
* xvac transaction succeeded.
*/
if (tuple->t_infomask & HEAP_MOVED_OFF)
! HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
else
! HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
/*
* Might as well fix the hint bits too; usually XMIN_COMMITTED
* will already be set here, but there's a small chance not.
*/
Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
! 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
changed = true;
}
}
--- 5571,5586 ----
* xvac transaction succeeded.
*/
if (tuple->t_infomask & HEAP_MOVED_OFF)
! 				frz->frzflags |= XLH_FREEZE_XVAC;
else
! 				frz->frzflags |= XLH_INVALID_XVAC;

Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and
XLH_INVALID_XVAC exactly the other way round? I really don't understand
the moved in/off, since the code has been gone longer than I've followed
the code...

*** a/src/backend/access/rmgrdesc/mxactdesc.c

--- b/src/backend/access/rmgrdesc/mxactdesc.c
***************
*** 41,47 **** out_member(StringInfo buf, MultiXactMember *member)
appendStringInfoString(buf, "(upd) ");
break;
default:
! 			appendStringInfoString(buf, "(unk) ");
break;
}
}
--- 41,47 ----
appendStringInfoString(buf, "(upd) ");
break;
default:
! 			appendStringInfo(buf, "(unk) ", member->status);
break;
}
}

That change doesn't look like it will do anything?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#42)

2 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund wrote:

On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote:

Here's a revamped version of this patch. One thing I didn't do here is
revert the exporting of CreateMultiXactId, but I don't see any way to
avoid that.

I don't so much have a problem with exporting CreateMultiXactId(), just
with exporting it under its current name. It's already quirky to have
both MultiXactIdCreate and CreateMultiXactId() in multixact.c but
exporting it imo goes to far.

MultiXactidCreateFromMembers(int, MultiXactMembers *) ?

Andres mentioned the idea of sharing some code between
heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't
explored that.

My idea would just be to have heap_tuple_needs_freeze() call
heap_prepare_freeze_tuple() and check whether it returns true. Yes,
that's slightly more expensive than the current
heap_tuple_needs_freeze(), but it's only called when we couldn't get a
cleanup lock on a page, so that seems ok.

Doesn't seem a completely bad idea, but let's leave it for a separate
patch. This should be changed in master only IMV anyway, while the rest
of this patch is to be backpatched to 9.3.

! if (!MultiXactIdIsValid(multi))
! {
! /* Ensure infomask bits are appropriately set/reset */
! *flags |= FRM_INVALIDATE_XMAX;
! return InvalidTransactionId;
! }

Maybe comment that we're sure to be only called when IS_MULTI is set,
which will be unset by FRM_INVALIDATE_XMAX? I wondered twice whether we
wouldn't just continually mark the buffer dirty this way.

Done.

! else if (MultiXactIdPrecedes(multi, cutoff_multi))
! {
! /*
! * This old multi cannot possibly have members still running. If it

Cna you place an Assert(!MultiXactIdIsRunning(multi)) here?

Done.

! if (ISUPDATE_from_mxstatus(members[i].status) &&
! !TransactionIdDidAbort(members[i].xid))#

It makes me wary to see a DidAbort() without a previous InProgress()
call. Also, after we crashed, doesn't DidAbort() possibly return
false for transactions that were in progress before we crashed? At
least that's how I always understood it, and how tqual.c is written.

Yes, that's correct. But note that here we're not doing a tuple
liveliness test, which is what tqual.c is doing. What we do with this
info is to keep the Xid as part of the multi if it's still running or
committed. We also keep it if the xact crashed, which is fine because
the Xid will be removed by some later step. If we know for certain that
the update Xid is aborted, then we can ignore it, but this is just an
optimization and not needed for correctness.

That loop had a bug, so I restructured it. (If the update xact had
aborted we wouldn't get to the "continue" and thus would treat it as a
locker-only. I don't think that behavior would cause any visible
misbehavior but it's wrong nonetheless.)

One interesting bit is that we might end up creating singleton
MultiXactIds when freezing, if there's no updater and there's a running
locker. We could avoid this (i.e. mark the tuple as locked by a single
Xid) but it would complicate FreezeMultiXactId's API and it's unlikely
to occur with any frequency anyway.

! /* We only keep lockers if they are still running */
! if (TransactionIdIsCurrentTransactionId(members[i].xid) ||

I don't think there's a need to check for
TransactionIdIsCurrentTransactionId() - vacuum can explicitly *not* be
run inside a transaction.

Keep in mind that freezing can also happen for tuples handled during a
table-rewrite operation such as CLUSTER. I wouldn't place a bet that
you can't have a multi created by a transaction and later run cluster in
the same table in the same transaction. Maybe this is fine because of
the fact that at that point we're holding an exclusive lock in the
table, but it seems fragile. And the test is cheap anyway.

--- 5571,5586 ----
* xvac transaction succeeded.
*/
if (tuple->t_infomask & HEAP_MOVED_OFF)
! 				frz->frzflags |= XLH_FREEZE_XVAC;
else
! 				frz->frzflags |= XLH_INVALID_XVAC;
Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and
XLH_INVALID_XVAC exactly the other way round? I really don't understand
the moved in/off, since the code has been gone longer than I've followed
the code...

Yep, fixed.

--- b/src/backend/access/rmgrdesc/mxactdesc.c
***************
*** 41,47 **** out_member(StringInfo buf, MultiXactMember *member)
appendStringInfoString(buf, "(upd) ");
break;
default:
! 			appendStringInfoString(buf, "(unk) ");
break;
}
}
--- 41,47 ----
appendStringInfoString(buf, "(upd) ");
break;
default:
! 			appendStringInfo(buf, "(unk) ", member->status);
break;
}
}

That change doesn't look like it will do anything?

Meh. That was a leftover --- removed. (I was toying with the "desc"
code because it misbehaves when applied on records as they are created,
as opposed to being applied on records as they are replayed. I'm pretty
sure everyone already knows about this, and it's the reason why
everybody has skimped from examining arrays of things stored in followup
data records. I was naive enough to write code that tries to decode the
followup record that contains the members of the multiaxct we're
creating, which works fine during replay but gets them completely wrong
during regular operation. This is the third time I'm surprised by this
misbehavior; blame my bad memory for not remembering that it's not
supposed to work in the first place.)

Right now there is one case in this code that returns
FRM_INVALIDATE_XMAX when it's not strictly necessary, i.e. it would also
work to keep the Multi as is and return FRM_NOOP instead; and it also
returns FRM_NOOP in one case when we could return FRM_INVALIDATE_XMAX
instead. Neither does any great damage, but there is a consideration
that future examiners of the tuple would have to resolve the MultiXact
by themselves (==> performance hit). On the other hand, returning
INVALIDATE causes the block to be dirtied, which is undesirable if not
already dirty. Maybe this can be optimized so that we return a separate
flag from FreezeMultiXactId when Xmax invalidation is optional, so that
we execute all such operations if and only if the block is already dirty
or being dirtied for other reasons. That would provide the cleanup for
later onlookers, while not causing an unnecessary dirtying.

Attached are patches for this, for both 9.3 and master. The 9.3 patch
keeps the original FREEZE record; I have tested that an unpatched
replica dies with:

PANIC: heap2_redo: unknown op code 32
CONTEXTO: xlog redo UNKNOWN
LOG: proceso de inicio (PID 316) fue terminado por una seï¿½al 6: Aborted

when the master is running the new code. The message is ugly, but I
don't see any way to fix that.

For the master branch, I have removed the original FREEZE record
definition completely and bumped XLOG_PAGE_MAGIC. This doesn't pose a
problem given that we have no replication between different major
versions.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

fix-freeze-93.patchtext/x-diff; charset=us-asciiDownload

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 5238,5251 **** heap_inplace_update(Relation relation, HeapTuple tuple)
  		CacheInvalidateHeapTuple(relation, tuple, NULL);
  }
  
  
  /*
!  * heap_freeze_tuple
   *
   * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
!  * are older than the specified cutoff XID.  If so, replace them with
!  * FrozenTransactionId or InvalidTransactionId as appropriate, and return
!  * TRUE.  Return FALSE if nothing was changed.
   *
   * It is assumed that the caller has checked the tuple with
   * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
--- 5238,5498 ----
  		CacheInvalidateHeapTuple(relation, tuple, NULL);
  }
  
+ #define		FRM_NOOP				0x0001
+ #define		FRM_INVALIDATE_XMAX		0x0002
+ #define		FRM_RETURN_IS_XID		0x0004
+ #define		FRM_RETURN_IS_MULTI		0x0008
+ #define		FRM_MARK_COMMITTED		0x0010
  
  /*
!  * FreezeMultiXactId
!  *		Determine what to do during freezing when a tuple is marked by a
!  *		MultiXactId.
!  *
!  * NB -- this might have the side-effect of creating a new MultiXactId!
!  *
!  * "flags" is an output value; it's used to tell caller what to do on return.
!  * Possible flags are:
!  * FRM_NOOP
!  *		don't do anything -- keep existing Xmax
!  * FRM_INVALIDATE_XMAX
!  *		mark Xmax as InvalidTransactionId and set XMAX_INVALID flag.
!  * FRM_RETURN_IS_XID
!  *		The Xid return value is a single update Xid to set as xmax.
!  * FRM_MARK_COMMITTED
!  *		Xmax can be marked as HEAP_XMAX_COMMITTED
!  * FRM_RETURN_IS_MULTI
!  *		The return value is a new MultiXactId to set as new Xmax.
!  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
!  */
! static TransactionId
! FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
! 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
! 				  uint16 *flags)
! {
! 	TransactionId xid = InvalidTransactionId;
! 	int			i;
! 	MultiXactMember *members;
! 	int			nmembers;
! 	bool		need_replace;
! 	int			nnewmembers;
! 	MultiXactMember *newmembers;
! 	bool		has_lockers;
! 	TransactionId update_xid;
! 	bool		update_committed;
! 
! 	*flags = 0;
! 
! 	/* We should only be called in Multis */
! 	Assert(t_infomask & HEAP_XMAX_IS_MULTI);
! 
! 	if (!MultiXactIdIsValid(multi))
! 	{
! 		/* Ensure infomask bits are appropriately set/reset */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		return InvalidTransactionId;
! 	}
! 	else if (MultiXactIdPrecedes(multi, cutoff_multi))
! 	{
! 		/*
! 		 * This old multi cannot possibly have members still running.  If it
! 		 * was a locker only, it can be removed without any further
! 		 * consideration; but if it contained an update, we might need to
! 		 * preserve it.
! 		 */
! 		Assert(!MultiXactIdIsRunning(multi));
! 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
! 		{
! 			*flags |= FRM_INVALIDATE_XMAX;
! 			xid = InvalidTransactionId; /* not strictly necessary */
! 		}
! 		else
! 		{
! 			/* replace multi by update xid */
! 			xid = MultiXactIdGetUpdateXid(multi, t_infomask);
! 
! 			/* wasn't only a lock, xid needs to be valid */
! 			Assert(TransactionIdIsValid(xid));
! 
! 			/*
! 			 * If the xid is older than the cutoff, it has to have aborted,
! 			 * otherwise the tuple would have gotten pruned away.
! 			 */
! 			if (TransactionIdPrecedes(xid, cutoff_xid))
! 			{
! 				Assert(!TransactionIdDidCommit(xid));
! 				*flags |= FRM_INVALIDATE_XMAX;
! 				xid = InvalidTransactionId;		/* not strictly necessary */
! 			}
! 			else
! 				*flags |= FRM_RETURN_IS_XID;
! 		}
! 
! 		return xid;
! 	}
! 
! 	/*
! 	 * This multixact might have or might not have members still running, but
! 	 * we know it's valid and is newer than the cutoff point for multis.
! 	 * However, some member(s) of it may be below the cutoff for Xids, so we
! 	 * need to walk the whole members array to figure out what to do, if
! 	 * anything.
! 	 */
! 
! 	nmembers = GetMultiXactIdMembers(multi, &members, false);
! 	if (nmembers <= 0)
! 	{
! 		/* Nothing worth keeping */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		return InvalidTransactionId;
! 	}
! 
! 	/* is there anything older than the cutoff? */
! 	need_replace = false;
! 	for (i = 0; i < nmembers; i++)
! 	{
! 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
! 		{
! 			need_replace = true;
! 			break;
! 		}
! 	}
! 
! 	/*
! 	 * In the simplest case, there is no member older than the cutoff; we can
! 	 * keep the existing MultiXactId as is.
! 	 */
! 	if (!need_replace)
! 	{
! 		*flags |= FRM_NOOP;
! 		pfree(members);
! 		return InvalidTransactionId;
! 	}
! 
! 	/*
! 	 * If the multi needs to be updated, figure out which members do we need
! 	 * to keep.
! 	 */
! 	nnewmembers = 0;
! 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
! 	has_lockers = false;
! 	update_xid = InvalidTransactionId;
! 	update_committed = false;
! 
! 	for (i = 0; i < nmembers; i++)
! 	{
! 		if (ISUPDATE_from_mxstatus(members[i].status))
! 		{
! 			/*
! 			 * It's an update; should we keep it?  If the transaction is known
! 			 * aborted then it's okay to ignore it, otherwise not.  (Note this
! 			 * is just an optimization and not needed for correctness, so it's
! 			 * okay to get this test wrong; for example, in case an updater is
! 			 * crashed, or a running transaction in the process of aborting.)
! 			 */
! 			if (!TransactionIdDidAbort(members[i].xid))
! 			{
! 				newmembers[nnewmembers++] = members[i];
! 				Assert(!TransactionIdIsValid(update_xid));
! 
! 				/*
! 				 * Tell caller to set HEAP_XMAX_COMMITTED hint while we have
! 				 * the Xid in cache.  Again, this is just an optimization, so
! 				 * it's not a problem if the transaction is still running and
! 				 * in the process of committing.
! 				 */
! 				if (TransactionIdDidCommit(update_xid))
! 					update_committed = true;
! 
! 				update_xid = newmembers[i].xid;
! 			}
! 
! 			/*
! 			 * Checking for very old update Xids is critical: if the update
! 			 * member of the multi is older than cutoff_xid, we must remove
! 			 * it, because otherwise a later liveliness check could attempt
! 			 * pg_clog access for a page that was truncated away by the
! 			 * current vacuum.	Note that if the update had committed, we
! 			 * wouldn't be freezing this tuple because it would have gotten
! 			 * removed (HEAPTUPLE_DEAD) by HeapTupleSatisfiesVacuum; so it
! 			 * either aborted or crashed.  Therefore, ignore this update_xid.
! 			 */
! 			if (TransactionIdPrecedes(update_xid, cutoff_xid))
! 			{
! 				update_xid = InvalidTransactionId;
! 				update_committed = false;
! 
! 			}
! 		}
! 		else
! 		{
! 			/* We only keep lockers if they are still running */
! 			if (TransactionIdIsCurrentTransactionId(members[i].xid) ||
! 				TransactionIdIsInProgress(members[i].xid))
! 			{
! 				/* running locker cannot possibly be older than the cutoff */
! 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
! 				newmembers[nnewmembers++] = members[i];
! 				has_lockers = true;
! 			}
! 		}
! 	}
! 
! 	pfree(members);
! 
! 	if (nnewmembers == 0)
! 	{
! 		/* nothing worth keeping!? Tell caller to remove the whole thing */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		xid = InvalidTransactionId;
! 	}
! 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
! 	{
! 		/*
! 		 * If there's a single member and it's an update, pass it back alone
! 		 * without creating a new Multi.  (XXX we could do this when there's a
! 		 * single remaining locker, too, but that would complicate the API too
! 		 * much; moreover, the case with the single updater is more
! 		 * interesting, because those are longer-lived.)
! 		 */
! 		Assert(nnewmembers == 1);
! 		*flags |= FRM_RETURN_IS_XID;
! 		if (update_committed)
! 			*flags |= FRM_MARK_COMMITTED;
! 		xid = update_xid;
! 	}
! 	else
! 	{
! 		/*
! 		 * Create a new multixact with the surviving members of the previous
! 		 * one, to set as new Xmax in the tuple.
! 		 *
! 		 * If this is the first possibly-multixact-able operation in the
! 		 * current transaction, set my per-backend OldestMemberMXactId
! 		 * setting. We can be certain that the transaction will never become a
! 		 * member of any older MultiXactIds than that.
! 		 */
! 		MultiXactIdSetOldestMember();
! 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
! 		*flags |= FRM_RETURN_IS_MULTI;
! 	}
! 
! 	pfree(newmembers);
! 
! 	return xid;
! }
! 
! /*
!  * heap_prepare_freeze_tuple
   *
   * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
!  * are older than the specified cutoff XID and cutoff MultiXactId.	If so,
!  * setup enough state (in the *frz output argument) to later execute and
!  * WAL-log what we would need to do, and return TRUE.  Return FALSE if nothing
!  * is to be changed.
!  *
!  * Caller is responsible for setting the offset field, if appropriate.	This
!  * is only necessary if the freeze is to be WAL-logged.
   *
   * It is assumed that the caller has checked the tuple with
   * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
***************
*** 5254,5307 **** heap_inplace_update(Relation relation, HeapTuple tuple)
   * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
   * XID older than it could neither be running nor seen as running by any
   * open transaction.  This ensures that the replacement will not change
!  * anyone's idea of the tuple state.  Also, since we assume the tuple is
!  * not HEAPTUPLE_DEAD, the fact that an XID is not still running allows us
!  * to assume that it is either committed good or aborted, as appropriate;
!  * so we need no external state checks to decide what to do.  (This is good
!  * because this function is applied during WAL recovery, when we don't have
!  * access to any such state, and can't depend on the hint bits to be set.)
!  * There is an exception we make which is to assume GetMultiXactIdMembers can
!  * be called during recovery.
!  *
   * Similarly, cutoff_multi must be less than or equal to the smallest
   * MultiXactId used by any transaction currently open.
   *
   * If the tuple is in a shared buffer, caller must hold an exclusive lock on
   * that buffer.
   *
!  * Note: it might seem we could make the changes without exclusive lock, since
!  * TransactionId read/write is assumed atomic anyway.  However there is a race
!  * condition: someone who just fetched an old XID that we overwrite here could
!  * conceivably not finish checking the XID against pg_clog before we finish
!  * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
!  * exclusive lock ensures no other backend is in process of checking the
!  * tuple status.  Also, getting exclusive lock makes it safe to adjust the
!  * infomask bits.
!  *
!  * NB: Cannot rely on hint bits here, they might not be set after a crash or
!  * on a standby.
   */
  bool
! heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
! 				  MultiXactId cutoff_multi)
  {
  	bool		changed = false;
  	bool		freeze_xmax = false;
  	TransactionId xid;
  
  	/* Process xmin */
  	xid = HeapTupleHeaderGetXmin(tuple);
  	if (TransactionIdIsNormal(xid) &&
  		TransactionIdPrecedes(xid, cutoff_xid))
  	{
! 		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
  
  		/*
  		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
  		 * already be set here, but there's a small chance not.
  		 */
! 		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
! 		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  		changed = true;
  	}
  
--- 5501,5544 ----
   * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
   * XID older than it could neither be running nor seen as running by any
   * open transaction.  This ensures that the replacement will not change
!  * anyone's idea of the tuple state.
   * Similarly, cutoff_multi must be less than or equal to the smallest
   * MultiXactId used by any transaction currently open.
   *
   * If the tuple is in a shared buffer, caller must hold an exclusive lock on
   * that buffer.
   *
!  * NB: It is not enough to set hint bits to indicate something is
!  * committed/invalid -- they might not be set on a standby, or after crash
!  * recovery.  We really need to remove old xids.
   */
  bool
! heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
! 						  TransactionId cutoff_multi,
! 						  xl_heap_freeze_tuple *frz)
! 
  {
  	bool		changed = false;
  	bool		freeze_xmax = false;
  	TransactionId xid;
  
+ 	frz->frzflags = 0;
+ 	frz->t_infomask2 = tuple->t_infomask2;
+ 	frz->t_infomask = tuple->t_infomask;
+ 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
+ 
  	/* Process xmin */
  	xid = HeapTupleHeaderGetXmin(tuple);
  	if (TransactionIdIsNormal(xid) &&
  		TransactionIdPrecedes(xid, cutoff_xid))
  	{
! 		frz->frzflags |= XLH_FREEZE_XMIN;
  
  		/*
  		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
  		 * already be set here, but there's a small chance not.
  		 */
! 		frz->t_infomask |= HEAP_XMIN_COMMITTED;
  		changed = true;
  	}
  
***************
*** 5318,5408 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  
  	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
  	{
! 		if (!MultiXactIdIsValid(xid))
! 		{
! 			/* no xmax set, ignore */
! 			;
! 		}
! 		else if (MultiXactIdPrecedes(xid, cutoff_multi))
! 		{
! 			/*
! 			 * This old multi cannot possibly be running.  If it was a locker
! 			 * only, it can be removed without much further thought; but if it
! 			 * contained an update, we need to preserve it.
! 			 */
! 			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
! 				freeze_xmax = true;
! 			else
! 			{
! 				TransactionId update_xid;
  
! 				update_xid = HeapTupleGetUpdateXid(tuple);
! 
! 				/*
! 				 * The multixact has an update hidden within.  Get rid of it.
! 				 *
! 				 * If the update_xid is below the cutoff_xid, it necessarily
! 				 * must be an aborted transaction.  In a primary server, such
! 				 * an Xmax would have gotten marked invalid by
! 				 * HeapTupleSatisfiesVacuum, but in a replica that is not
! 				 * called before we are, so deal with it in the same way.
! 				 *
! 				 * If not below the cutoff_xid, then the tuple would have been
! 				 * pruned by vacuum, if the update committed long enough ago,
! 				 * and we wouldn't be freezing it; so it's either recently
! 				 * committed, or in-progress.  Deal with this by setting the
! 				 * Xmax to the update Xid directly and remove the IS_MULTI
! 				 * bit.  (We know there cannot be running lockers in this
! 				 * multi, because it's below the cutoff_multi value.)
! 				 */
  
! 				if (TransactionIdPrecedes(update_xid, cutoff_xid))
! 				{
! 					Assert(InRecovery || TransactionIdDidAbort(update_xid));
! 					freeze_xmax = true;
! 				}
! 				else
! 				{
! 					Assert(InRecovery || !TransactionIdIsInProgress(update_xid));
! 					tuple->t_infomask &= ~HEAP_XMAX_BITS;
! 					HeapTupleHeaderSetXmax(tuple, update_xid);
! 					changed = true;
! 				}
! 			}
  		}
! 		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
  		{
! 			/* newer than the cutoff, so don't touch it */
! 			;
  		}
  		else
  		{
! 			TransactionId	update_xid;
! 
! 			/*
! 			 * This is a multixact which is not marked LOCK_ONLY, but which
! 			 * is newer than the cutoff_multi.  If the update_xid is below the
! 			 * cutoff_xid point, then we can just freeze the Xmax in the
! 			 * tuple, removing it altogether.  This seems simple, but there
! 			 * are several underlying assumptions:
! 			 *
! 			 * 1. A tuple marked by an multixact containing a very old
! 			 * committed update Xid would have been pruned away by vacuum; we
! 			 * wouldn't be freezing this tuple at all.
! 			 *
! 			 * 2. There cannot possibly be any live locking members remaining
! 			 * in the multixact.  This is because if they were alive, the
! 			 * update's Xid would had been considered, via the lockers'
! 			 * snapshot's Xmin, as part the cutoff_xid.
! 			 *
! 			 * 3. We don't create new MultiXacts via MultiXactIdExpand() that
! 			 * include a very old aborted update Xid: in that function we only
! 			 * include update Xids corresponding to transactions that are
! 			 * committed or in-progress.
! 			 */
! 			update_xid = HeapTupleGetUpdateXid(tuple);
! 			if (TransactionIdPrecedes(update_xid, cutoff_xid))
! 				freeze_xmax = true;
  		}
  	}
  	else if (TransactionIdIsNormal(xid) &&
--- 5555,5589 ----
  
  	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
  	{
! 		TransactionId newxmax;
! 		uint16		flags;
  
! 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
! 									cutoff_xid, cutoff_multi, &flags);
  
! 		if (flags & FRM_INVALIDATE_XMAX)
! 			freeze_xmax = true;
! 		else if (flags & FRM_RETURN_IS_XID)
! 		{
! 			frz->t_infomask &= ~HEAP_XMAX_BITS;
! 			frz->xmax = newxmax;
! 			if (flags & FRM_MARK_COMMITTED)
! 				frz->t_infomask &= HEAP_XMAX_COMMITTED;
! 			changed = true;
  		}
! 		else if (flags & FRM_RETURN_IS_MULTI)
  		{
! 			frz->t_infomask &= ~HEAP_XMAX_BITS;
! 			frz->xmax = newxmax;
! 
! 			GetMultiXactIdHintBits(newxmax,
! 								   &frz->t_infomask,
! 								   &frz->t_infomask2);
! 			changed = true;
  		}
  		else
  		{
! 			Assert(flags & FRM_NOOP);
  		}
  	}
  	else if (TransactionIdIsNormal(xid) &&
***************
*** 5413,5429 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  
  	if (freeze_xmax)
  	{
! 		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
  
  		/*
  		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
  		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
  		 * Also get rid of the HEAP_KEYS_UPDATED bit.
  		 */
! 		tuple->t_infomask &= ~HEAP_XMAX_BITS;
! 		tuple->t_infomask |= HEAP_XMAX_INVALID;
! 		HeapTupleHeaderClearHotUpdated(tuple);
! 		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
  		changed = true;
  	}
  
--- 5594,5610 ----
  
  	if (freeze_xmax)
  	{
! 		frz->xmax = InvalidTransactionId;
  
  		/*
  		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
  		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
  		 * Also get rid of the HEAP_KEYS_UPDATED bit.
  		 */
! 		frz->t_infomask &= ~HEAP_XMAX_BITS;
! 		frz->t_infomask |= HEAP_XMAX_INVALID;
! 		frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
! 		frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
  		changed = true;
  	}
  
***************
*** 5443,5458 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  			 * xvac transaction succeeded.
  			 */
  			if (tuple->t_infomask & HEAP_MOVED_OFF)
! 				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
  			else
! 				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
  
  			/*
  			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
  			 * will already be set here, but there's a small chance not.
  			 */
  			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
! 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  			changed = true;
  		}
  	}
--- 5624,5639 ----
  			 * xvac transaction succeeded.
  			 */
  			if (tuple->t_infomask & HEAP_MOVED_OFF)
! 				frz->frzflags |= XLH_INVALID_XVAC;
  			else
! 				frz->frzflags |= XLH_FREEZE_XVAC;
  
  			/*
  			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
  			 * will already be set here, but there's a small chance not.
  			 */
  			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
! 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
  			changed = true;
  		}
  	}
***************
*** 5461,5466 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
--- 5642,5709 ----
  }
  
  /*
+  * heap_execute_freeze_tuple
+  *		Execute the prepared freezing of a tuple.
+  *
+  * Caller is responsible for ensuring that no other backend can access the
+  * storage underlying this tuple, either by holding an exclusive lock on the
+  * buffer containing it (which is what lazy VACUUM does), or by having it by
+  * in private storage (which is what CLUSTER and friends do).
+  *
+  * Note: it might seem we could make the changes without exclusive lock, since
+  * TransactionId read/write is assumed atomic anyway.  However there is a race
+  * condition: someone who just fetched an old XID that we overwrite here could
+  * conceivably not finish checking the XID against pg_clog before we finish
+  * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
+  * exclusive lock ensures no other backend is in process of checking the
+  * tuple status.  Also, getting exclusive lock makes it safe to adjust the
+  * infomask bits.
+  *
+  * NB: All code in here must be safe to execute during crash recovery!
+  */
+ void
+ heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
+ {
+ 	if (frz->frzflags & XLH_FREEZE_XMIN)
+ 		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+ 
+ 	HeapTupleHeaderSetXmax(tuple, frz->xmax);
+ 
+ 	if (frz->frzflags & XLH_FREEZE_XVAC)
+ 		HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+ 
+ 	if (frz->frzflags & XLH_INVALID_XVAC)
+ 		HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+ 
+ 	tuple->t_infomask = frz->t_infomask;
+ 	tuple->t_infomask2 = frz->t_infomask2;
+ }
+ 
+ /*
+  * heap_freeze_tuple - freeze tuple inplace without WAL logging.
+  *
+  * Useful for callers like CLUSTER that perform their own WAL logging.
+  */
+ bool
+ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ 				  TransactionId cutoff_multi)
+ {
+ 	xl_heap_freeze_tuple frz;
+ 	bool		do_freeze;
+ 
+ 	do_freeze = heap_prepare_freeze_tuple(tuple, cutoff_xid, cutoff_multi, &frz);
+ 
+ 	/*
+ 	 * Note that because this is not a WAL-logged operation, we don't need to
+ 	 * fill in the offset in the freeze record.
+ 	 */
+ 
+ 	if (do_freeze)
+ 		heap_execute_freeze_tuple(tuple, &frz);
+ 	return do_freeze;
+ }
+ 
+ /*
   * For a given MultiXactId, return the hint bits that should be set in the
   * tuple's infomask.
   *
***************
*** 5763,5778 **** heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  		}
  		else if (MultiXactIdPrecedes(multi, cutoff_multi))
  			return true;
- 		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
- 		{
- 			/* only-locker multis don't need internal examination */
- 			;
- 		}
  		else
  		{
! 			if (TransactionIdPrecedes(HeapTupleGetUpdateXid(tuple),
! 									  cutoff_xid))
! 				return true;
  		}
  	}
  	else
--- 6006,6031 ----
  		}
  		else if (MultiXactIdPrecedes(multi, cutoff_multi))
  			return true;
  		else
  		{
! 			MultiXactMember *members;
! 			int			nmembers;
! 			int			i;
! 
! 			/* need to check whether any member of the mxact is too old */
! 
! 			nmembers = GetMultiXactIdMembers(multi, &members, false);
! 
! 			for (i = 0; i < nmembers; i++)
! 			{
! 				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
! 				{
! 					pfree(members);
! 					return true;
! 				}
! 			}
! 			if (nmembers > 0)
! 				pfree(members);
  		}
  	}
  	else
***************
*** 6022,6066 **** log_heap_clean(Relation reln, Buffer buffer,
  }
  
  /*
!  * Perform XLogInsert for a heap-freeze operation.	Caller must already
!  * have modified the buffer and marked it dirty.
   */
  XLogRecPtr
! log_heap_freeze(Relation reln, Buffer buffer,
! 				TransactionId cutoff_xid, MultiXactId cutoff_multi,
! 				OffsetNumber *offsets, int offcnt)
  {
! 	xl_heap_freeze xlrec;
  	XLogRecPtr	recptr;
  	XLogRecData rdata[2];
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
  	/* nor when there are no tuples to freeze */
! 	Assert(offcnt > 0);
  
  	xlrec.node = reln->rd_node;
  	xlrec.block = BufferGetBlockNumber(buffer);
  	xlrec.cutoff_xid = cutoff_xid;
! 	xlrec.cutoff_multi = cutoff_multi;
  
  	rdata[0].data = (char *) &xlrec;
! 	rdata[0].len = SizeOfHeapFreeze;
  	rdata[0].buffer = InvalidBuffer;
  	rdata[0].next = &(rdata[1]);
  
  	/*
! 	 * The tuple-offsets array is not actually in the buffer, but pretend that
! 	 * it is.  When XLogInsert stores the whole buffer, the offsets array need
  	 * not be stored too.
  	 */
! 	rdata[1].data = (char *) offsets;
! 	rdata[1].len = offcnt * sizeof(OffsetNumber);
  	rdata[1].buffer = buffer;
  	rdata[1].buffer_std = true;
  	rdata[1].next = NULL;
  
! 	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE, rdata);
  
  	return recptr;
  }
--- 6275,6318 ----
  }
  
  /*
!  * Perform XLogInsert for a heap-freeze operation.	Caller must have already
!  * modified the buffer and marked it dirty.
   */
  XLogRecPtr
! log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
! 				xl_heap_freeze_tuple *tuples, int ntuples)
  {
! 	xl_heap_freeze_page xlrec;
  	XLogRecPtr	recptr;
  	XLogRecData rdata[2];
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
  	/* nor when there are no tuples to freeze */
! 	Assert(ntuples > 0);
  
  	xlrec.node = reln->rd_node;
  	xlrec.block = BufferGetBlockNumber(buffer);
  	xlrec.cutoff_xid = cutoff_xid;
! 	xlrec.ntuples = ntuples;
  
  	rdata[0].data = (char *) &xlrec;
! 	rdata[0].len = SizeOfHeapFreezePage;
  	rdata[0].buffer = InvalidBuffer;
  	rdata[0].next = &(rdata[1]);
  
  	/*
! 	 * The freeze plan array is not actually in the buffer, but pretend that
! 	 * it is.  When XLogInsert stores the whole buffer, the freeze plan need
  	 * not be stored too.
  	 */
! 	rdata[1].data = (char *) tuples;
! 	rdata[1].len = ntuples * sizeof(xl_heap_freeze_tuple);
  	rdata[1].buffer = buffer;
  	rdata[1].buffer_std = true;
  	rdata[1].next = NULL;
  
! 	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE, rdata);
  
  	return recptr;
  }
***************
*** 6402,6407 **** heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
--- 6654,6752 ----
  	XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
  }
  
+ /*
+  * Freeze a single tuple for XLOG_HEAP2_FREEZE
+  *
+  * NB: This type of record aren't generated anymore, since bugs around
+  * multixacts couldn't be fixed without a more robust type of freezing. This
+  * is kept around to be able to perform PITR.
+  */
+ static bool
+ heap_xlog_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ 					   MultiXactId cutoff_multi)
+ {
+ 	bool		changed = false;
+ 	TransactionId xid;
+ 
+ 	xid = HeapTupleHeaderGetXmin(tuple);
+ 	if (TransactionIdIsNormal(xid) &&
+ 		TransactionIdPrecedes(xid, cutoff_xid))
+ 	{
+ 		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+ 
+ 		/*
+ 		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
+ 		 * already be set here, but there's a small chance not.
+ 		 */
+ 		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
+ 		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+ 		changed = true;
+ 	}
+ 
+ 	/*
+ 	 * Note that this code handles IS_MULTI Xmax values, too, but only to mark
+ 	 * the tuple as not updated if the multixact is below the cutoff Multixact
+ 	 * given; it doesn't remove dead members of a very old multixact.
+ 	 */
+ 	xid = HeapTupleHeaderGetRawXmax(tuple);
+ 	if ((tuple->t_infomask & HEAP_XMAX_IS_MULTI) ?
+ 		(MultiXactIdIsValid(xid) &&
+ 		 MultiXactIdPrecedes(xid, cutoff_multi)) :
+ 		(TransactionIdIsNormal(xid) &&
+ 		 TransactionIdPrecedes(xid, cutoff_xid)))
+ 	{
+ 		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
+ 
+ 		/*
+ 		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
+ 		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
+ 		 * Also get rid of the HEAP_KEYS_UPDATED bit.
+ 		 */
+ 		tuple->t_infomask &= ~HEAP_XMAX_BITS;
+ 		tuple->t_infomask |= HEAP_XMAX_INVALID;
+ 		HeapTupleHeaderClearHotUpdated(tuple);
+ 		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+ 		changed = true;
+ 	}
+ 
+ 	/*
+ 	 * Old-style VACUUM FULL is gone, but we have to keep this code as long as
+ 	 * we support having MOVED_OFF/MOVED_IN tuples in the database.
+ 	 */
+ 	if (tuple->t_infomask & HEAP_MOVED)
+ 	{
+ 		xid = HeapTupleHeaderGetXvac(tuple);
+ 		if (TransactionIdIsNormal(xid) &&
+ 			TransactionIdPrecedes(xid, cutoff_xid))
+ 		{
+ 			/*
+ 			 * If a MOVED_OFF tuple is not dead, the xvac transaction must
+ 			 * have failed; whereas a non-dead MOVED_IN tuple must mean the
+ 			 * xvac transaction succeeded.
+ 			 */
+ 			if (tuple->t_infomask & HEAP_MOVED_OFF)
+ 				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+ 			else
+ 				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+ 
+ 			/*
+ 			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
+ 			 * will already be set here, but there's a small chance not.
+ 			 */
+ 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
+ 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+ 			changed = true;
+ 		}
+ 	}
+ 
+ 	return changed;
+ }
+ 
+ /*
+  * NB: This type of record aren't generated anymore, since bugs around
+  * multixacts couldn't be fixed without a more robust type of freezing. This
+  * is kept around to be able to perform PITR.
+  */
  static void
  heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
  {
***************
*** 6450,6456 **** heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
  			ItemId		lp = PageGetItemId(page, *offsets);
  			HeapTupleHeader tuple = (HeapTupleHeader) PageGetItem(page, lp);
  
! 			(void) heap_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
  			offsets++;
  		}
  	}
--- 6795,6801 ----
  			ItemId		lp = PageGetItemId(page, *offsets);
  			HeapTupleHeader tuple = (HeapTupleHeader) PageGetItem(page, lp);
  
! 			(void) heap_xlog_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
  			offsets++;
  		}
  	}
***************
*** 6574,6579 **** heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
--- 6919,6981 ----
  	}
  }
  
+ /*
+  * Replay XLOG_HEAP2_FREEZE_PAGE records
+  */
+ static void
+ heap_xlog_freeze_page(XLogRecPtr lsn, XLogRecord *record)
+ {
+ 	xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) XLogRecGetData(record);
+ 	TransactionId cutoff_xid = xlrec->cutoff_xid;
+ 	Buffer		buffer;
+ 	Page		page;
+ 	int			ntup;
+ 
+ 	/*
+ 	 * In Hot Standby mode, ensure that there's no queries running which still
+ 	 * consider the frozen xids as running.
+ 	 */
+ 	if (InHotStandby)
+ 		ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node);
+ 
+ 	/* If we have a full-page image, restore it and we're done */
+ 	if (record->xl_info & XLR_BKP_BLOCK(0))
+ 	{
+ 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
+ 		return;
+ 	}
+ 
+ 	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ 	if (!BufferIsValid(buffer))
+ 		return;
+ 
+ 	page = (Page) BufferGetPage(buffer);
+ 
+ 	if (lsn <= PageGetLSN(page))
+ 	{
+ 		UnlockReleaseBuffer(buffer);
+ 		return;
+ 	}
+ 
+ 	/* now execute freeze plan for each frozen tuple */
+ 	for (ntup = 0; ntup < xlrec->ntuples; ntup++)
+ 	{
+ 		xl_heap_freeze_tuple *xlrec_tp;
+ 		ItemId		lp;
+ 		HeapTupleHeader tuple;
+ 
+ 		xlrec_tp = &xlrec->tuples[ntup];
+ 		lp = PageGetItemId(page, xlrec_tp->offset);		/* offsets are one-based */
+ 		tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ 
+ 		heap_execute_freeze_tuple(tuple, xlrec_tp);
+ 	}
+ 
+ 	PageSetLSN(page, lsn);
+ 	MarkBufferDirty(buffer);
+ 	UnlockReleaseBuffer(buffer);
+ }
+ 
  static void
  heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
  {
***************
*** 7429,7434 **** heap2_redo(XLogRecPtr lsn, XLogRecord *record)
--- 7831,7839 ----
  		case XLOG_HEAP2_CLEAN:
  			heap_xlog_clean(lsn, record);
  			break;
+ 		case XLOG_HEAP2_FREEZE_PAGE:
+ 			heap_xlog_freeze_page(lsn, record);
+ 			break;
  		case XLOG_HEAP2_CLEANUP_INFO:
  			heap_xlog_cleanup_info(lsn, record);
  			break;
*** a/src/backend/access/rmgrdesc/heapdesc.c
--- b/src/backend/access/rmgrdesc/heapdesc.c
***************
*** 149,154 **** heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 149,163 ----
  						 xlrec->node.relNode, xlrec->block,
  						 xlrec->latestRemovedXid);
  	}
+ 	else if (info == XLOG_HEAP2_FREEZE_PAGE)
+ 	{
+ 		xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
+ 
+ 		appendStringInfo(buf, "freeze_page: rel %u/%u/%u; blk %u; cutoff xid %u ntuples %u",
+ 						 xlrec->node.spcNode, xlrec->node.dbNode,
+ 						 xlrec->node.relNode, xlrec->block,
+ 						 xlrec->cutoff_xid, xlrec->ntuples);
+ 	}
  	else if (info == XLOG_HEAP2_CLEANUP_INFO)
  	{
  		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
***************
*** 286,292 **** static MemoryContext MXactContext = NULL;
  
  /* internal MultiXactId management */
  static void MultiXactIdSetOldestVisible(void);
- static MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members);
  static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
  				   int nmembers, MultiXactMember *members);
  static MultiXactId GetNewMultiXactId(int nmembers, MultiXactOffset *offset);
--- 286,291 ----
***************
*** 344,350 **** MultiXactIdCreate(TransactionId xid1, MultiXactStatus status1,
  	members[1].xid = xid2;
  	members[1].status = status2;
  
! 	newMulti = CreateMultiXactId(2, members);
  
  	debug_elog3(DEBUG2, "Create: %s",
  				mxid_to_string(newMulti, 2, members));
--- 343,349 ----
  	members[1].xid = xid2;
  	members[1].status = status2;
  
! 	newMulti = MultiXactIdCreateFromMembers(2, members);
  
  	debug_elog3(DEBUG2, "Create: %s",
  				mxid_to_string(newMulti, 2, members));
***************
*** 407,413 **** MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
  		 */
  		member.xid = xid;
  		member.status = status;
! 		newMulti = CreateMultiXactId(1, &member);
  
  		debug_elog4(DEBUG2, "Expand: %u has no members, create singleton %u",
  					multi, newMulti);
--- 406,412 ----
  		 */
  		member.xid = xid;
  		member.status = status;
! 		newMulti = MultiXactIdCreateFromMembers(1, &member);
  
  		debug_elog4(DEBUG2, "Expand: %u has no members, create singleton %u",
  					multi, newMulti);
***************
*** 459,465 **** MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
  
  	newMembers[j].xid = xid;
  	newMembers[j++].status = status;
! 	newMulti = CreateMultiXactId(j, newMembers);
  
  	pfree(members);
  	pfree(newMembers);
--- 458,464 ----
  
  	newMembers[j].xid = xid;
  	newMembers[j++].status = status;
! 	newMulti = MultiXactIdCreateFromMembers(j, newMembers);
  
  	pfree(members);
  	pfree(newMembers);
***************
*** 664,679 **** ReadNextMultiXactId(void)
  }
  
  /*
!  * CreateMultiXactId
!  *		Make a new MultiXactId
   *
   * Make XLOG, SLRU and cache entries for a new MultiXactId, recording the
   * given TransactionIds as members.  Returns the newly created MultiXactId.
   *
   * NB: the passed members[] array will be sorted in-place.
   */
! static MultiXactId
! CreateMultiXactId(int nmembers, MultiXactMember *members)
  {
  	MultiXactId multi;
  	MultiXactOffset offset;
--- 663,678 ----
  }
  
  /*
!  * MultiXactIdCreateFromMembers
!  *		Make a new MultiXactId from the specified set of members
   *
   * Make XLOG, SLRU and cache entries for a new MultiXactId, recording the
   * given TransactionIds as members.  Returns the newly created MultiXactId.
   *
   * NB: the passed members[] array will be sorted in-place.
   */
! MultiXactId
! MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
  {
  	MultiXactId multi;
  	MultiXactOffset offset;
***************
*** 760,766 **** CreateMultiXactId(int nmembers, MultiXactMember *members)
   * RecordNewMultiXact
   *		Write info about a new multixact into the offsets and members files
   *
!  * This is broken out of CreateMultiXactId so that xlog replay can use it.
   */
  static void
  RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
--- 759,766 ----
   * RecordNewMultiXact
   *		Write info about a new multixact into the offsets and members files
   *
!  * This is broken out of MultiXactIdCreateFromMembers so that xlog replay can
!  * use it.
   */
  static void
  RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 424,429 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 424,430 ----
  	Buffer		vmbuffer = InvalidBuffer;
  	BlockNumber next_not_all_visible_block;
  	bool		skipping_all_visible_blocks;
+ 	xl_heap_freeze_tuple *frozen;
  
  	pg_rusage_init(&ru0);
  
***************
*** 446,451 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 447,453 ----
  	vacrelstats->latestRemovedXid = InvalidTransactionId;
  
  	lazy_space_alloc(vacrelstats, nblocks);
+ 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
  
  	/*
  	 * We want to skip pages that don't require vacuuming according to the
***************
*** 500,506 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  		bool		tupgone,
  					hastup;
  		int			prev_dead_count;
- 		OffsetNumber frozen[MaxOffsetNumber];
  		int			nfrozen;
  		Size		freespace;
  		bool		all_visible_according_to_vm;
--- 502,507 ----
***************
*** 893,901 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  				 * Each non-removable tuple must be checked to see if it needs
  				 * freezing.  Note we already have exclusive buffer lock.
  				 */
! 				if (heap_freeze_tuple(tuple.t_data, FreezeLimit,
! 									  MultiXactCutoff))
! 					frozen[nfrozen++] = offnum;
  			}
  		}						/* scan along page */
  
--- 894,902 ----
  				 * Each non-removable tuple must be checked to see if it needs
  				 * freezing.  Note we already have exclusive buffer lock.
  				 */
! 				if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
! 										  MultiXactCutoff, &frozen[nfrozen]))
! 					frozen[nfrozen++].offset = offnum;
  			}
  		}						/* scan along page */
  
***************
*** 906,920 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  		 */
  		if (nfrozen > 0)
  		{
  			MarkBufferDirty(buf);
  			if (RelationNeedsWAL(onerel))
  			{
  				XLogRecPtr	recptr;
  
  				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
! 										 MultiXactCutoff, frozen, nfrozen);
  				PageSetLSN(page, recptr);
  			}
  		}
  
  		/*
--- 907,939 ----
  		 */
  		if (nfrozen > 0)
  		{
+ 			START_CRIT_SECTION();
+ 
  			MarkBufferDirty(buf);
+ 
+ 			/* execute collected freezes */
+ 			for (i = 0; i < nfrozen; i++)
+ 			{
+ 				ItemId		itemid;
+ 				HeapTupleHeader htup;
+ 
+ 				itemid = PageGetItemId(page, frozen[i].offset);
+ 				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+ 
+ 				heap_execute_freeze_tuple(htup, &frozen[i]);
+ 			}
+ 
+ 			/* Now WAL-log freezing if neccessary */
  			if (RelationNeedsWAL(onerel))
  			{
  				XLogRecPtr	recptr;
  
  				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
! 										 frozen, nfrozen);
  				PageSetLSN(page, recptr);
  			}
+ 
+ 			END_CRIT_SECTION();
  		}
  
  		/*
***************
*** 1015,1020 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 1034,1041 ----
  			RecordPageWithFreeSpace(onerel, blkno, freespace);
  	}
  
+ 	pfree(frozen);
+ 
  	/* save stats for use later */
  	vacrelstats->scanned_tuples = num_tuples;
  	vacrelstats->tuples_deleted = tups_vacuumed;
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 50,56 ****
   */
  #define XLOG_HEAP2_FREEZE		0x00
  #define XLOG_HEAP2_CLEAN		0x10
! /* 0x20 is free, was XLOG_HEAP2_CLEAN_MOVE */
  #define XLOG_HEAP2_CLEANUP_INFO 0x30
  #define XLOG_HEAP2_VISIBLE		0x40
  #define XLOG_HEAP2_MULTI_INSERT 0x50
--- 50,56 ----
   */
  #define XLOG_HEAP2_FREEZE		0x00
  #define XLOG_HEAP2_CLEAN		0x10
! #define XLOG_HEAP2_FREEZE_PAGE	0x20
  #define XLOG_HEAP2_CLEANUP_INFO 0x30
  #define XLOG_HEAP2_VISIBLE		0x40
  #define XLOG_HEAP2_MULTI_INSERT 0x50
***************
*** 239,245 **** typedef struct xl_heap_inplace
  
  #define SizeOfHeapInplace	(offsetof(xl_heap_inplace, target) + SizeOfHeapTid)
  
! /* This is what we need to know about tuple freezing during vacuum */
  typedef struct xl_heap_freeze
  {
  	RelFileNode node;
--- 239,245 ----
  
  #define SizeOfHeapInplace	(offsetof(xl_heap_inplace, target) + SizeOfHeapTid)
  
! /* This is what we need to know about tuple freezing during vacuum (legacy) */
  typedef struct xl_heap_freeze
  {
  	RelFileNode node;
***************
*** 251,256 **** typedef struct xl_heap_freeze
--- 251,287 ----
  
  #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_multi) + sizeof(MultiXactId))
  
+ /*
+  * a 'freeze plan' struct that represents what we need to know about a single
+  * tuple being frozen during vacuum
+  */
+ #define		XLH_FREEZE_XMIN		0x01
+ #define		XLH_FREEZE_XVAC		0x02
+ #define		XLH_INVALID_XVAC	0x04
+ 
+ typedef struct xl_heap_freeze_tuple
+ {
+ 	TransactionId xmax;
+ 	OffsetNumber offset;
+ 	uint16		t_infomask2;
+ 	uint16		t_infomask;
+ 	uint8		frzflags;
+ } xl_heap_freeze_tuple;
+ 
+ /*
+  * This is what we need to know about a block being frozen during vacuum
+  */
+ typedef struct xl_heap_freeze_page
+ {
+ 	RelFileNode node;
+ 	BlockNumber block;
+ 	TransactionId cutoff_xid;
+ 	uint16		ntuples;
+ 	xl_heap_freeze_tuple tuples[FLEXIBLE_ARRAY_MEMBER];
+ } xl_heap_freeze_page;
+ 
+ #define SizeOfHeapFreezePage offsetof(xl_heap_freeze_page, tuples)
+ 
  /* This is what we need to know about setting a visibility map bit */
  typedef struct xl_heap_visible
  {
***************
*** 277,284 **** extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
  			   OffsetNumber *nowunused, int nunused,
  			   TransactionId latestRemovedXid);
  extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
! 				TransactionId cutoff_xid, MultiXactId cutoff_multi,
! 				OffsetNumber *offsets, int offcnt);
  extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
  				 Buffer vm_buffer, TransactionId cutoff_xid);
  extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
--- 308,321 ----
  			   OffsetNumber *nowunused, int nunused,
  			   TransactionId latestRemovedXid);
  extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
! 				TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
! 				int ntuples);
! extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
! 						  TransactionId cutoff_xid,
! 						  TransactionId cutoff_multi,
! 						  xl_heap_freeze_tuple *frz);
! extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
! 						  xl_heap_freeze_tuple *xlrec_tp);
  extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
  				 Buffer vm_buffer, TransactionId cutoff_xid);
  extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
*** a/src/include/access/multixact.h
--- b/src/include/access/multixact.h
***************
*** 81,86 **** extern MultiXactId MultiXactIdCreate(TransactionId xid1,
--- 81,89 ----
  				  MultiXactStatus status2);
  extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid,
  				  MultiXactStatus status);
+ extern MultiXactId MultiXactIdCreateFromMembers(int nmembers,
+ 							 MultiXactMember *members);
+ 
  extern MultiXactId ReadNextMultiXactId(void);
  extern bool MultiXactIdIsRunning(MultiXactId multi);
  extern void MultiXactIdSetOldestMember(void);
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 212,217 **** extern int	wal_level;
--- 212,218 ----
  /* Do we need to WAL-log information required only for Hot Standby? */
  #define XLogStandbyInfoActive() (wal_level >= WAL_LEVEL_HOT_STANDBY)
  
+ #define WAL_DEBUG
  #ifdef WAL_DEBUG
  extern bool XLOG_DEBUG;
  #endif

fix-freeze-master.patchtext/x-diff; charset=us-asciiDownload

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 5409,5422 **** heap_inplace_update(Relation relation, HeapTuple tuple)
  		CacheInvalidateHeapTuple(relation, tuple, NULL);
  }
  
  
  /*
!  * heap_freeze_tuple
   *
   * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
!  * are older than the specified cutoff XID.  If so, replace them with
!  * FrozenTransactionId or InvalidTransactionId as appropriate, and return
!  * TRUE.  Return FALSE if nothing was changed.
   *
   * It is assumed that the caller has checked the tuple with
   * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
--- 5409,5669 ----
  		CacheInvalidateHeapTuple(relation, tuple, NULL);
  }
  
+ #define		FRM_NOOP				0x0001
+ #define		FRM_INVALIDATE_XMAX		0x0002
+ #define		FRM_RETURN_IS_XID		0x0004
+ #define		FRM_RETURN_IS_MULTI		0x0008
+ #define		FRM_MARK_COMMITTED		0x0010
  
  /*
!  * FreezeMultiXactId
!  *		Determine what to do during freezing when a tuple is marked by a
!  *		MultiXactId.
!  *
!  * NB -- this might have the side-effect of creating a new MultiXactId!
!  *
!  * "flags" is an output value; it's used to tell caller what to do on return.
!  * Possible flags are:
!  * FRM_NOOP
!  *		don't do anything -- keep existing Xmax
!  * FRM_INVALIDATE_XMAX
!  *		mark Xmax as InvalidTransactionId and set XMAX_INVALID flag.
!  * FRM_RETURN_IS_XID
!  *		The Xid return value is a single update Xid to set as xmax.
!  * FRM_MARK_COMMITTED
!  *		Xmax can be marked as HEAP_XMAX_COMMITTED
!  * FRM_RETURN_IS_MULTI
!  *		The return value is a new MultiXactId to set as new Xmax.
!  *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
!  */
! static TransactionId
! FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
! 				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
! 				  uint16 *flags)
! {
! 	TransactionId xid = InvalidTransactionId;
! 	int			i;
! 	MultiXactMember *members;
! 	int			nmembers;
! 	bool		need_replace;
! 	int			nnewmembers;
! 	MultiXactMember *newmembers;
! 	bool		has_lockers;
! 	TransactionId update_xid;
! 	bool		update_committed;
! 
! 	*flags = 0;
! 
! 	/* We should only be called in Multis */
! 	Assert(t_infomask & HEAP_XMAX_IS_MULTI);
! 
! 	if (!MultiXactIdIsValid(multi))
! 	{
! 		/* Ensure infomask bits are appropriately set/reset */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		return InvalidTransactionId;
! 	}
! 	else if (MultiXactIdPrecedes(multi, cutoff_multi))
! 	{
! 		/*
! 		 * This old multi cannot possibly have members still running.  If it
! 		 * was a locker only, it can be removed without any further
! 		 * consideration; but if it contained an update, we might need to
! 		 * preserve it.
! 		 */
! 		Assert(!MultiXactIdIsRunning(multi));
! 		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
! 		{
! 			*flags |= FRM_INVALIDATE_XMAX;
! 			xid = InvalidTransactionId; /* not strictly necessary */
! 		}
! 		else
! 		{
! 			/* replace multi by update xid */
! 			xid = MultiXactIdGetUpdateXid(multi, t_infomask);
! 
! 			/* wasn't only a lock, xid needs to be valid */
! 			Assert(TransactionIdIsValid(xid));
! 
! 			/*
! 			 * If the xid is older than the cutoff, it has to have aborted,
! 			 * otherwise the tuple would have gotten pruned away.
! 			 */
! 			if (TransactionIdPrecedes(xid, cutoff_xid))
! 			{
! 				Assert(!TransactionIdDidCommit(xid));
! 				*flags |= FRM_INVALIDATE_XMAX;
! 				xid = InvalidTransactionId;		/* not strictly necessary */
! 			}
! 			else
! 				*flags |= FRM_RETURN_IS_XID;
! 		}
! 
! 		return xid;
! 	}
! 
! 	/*
! 	 * This multixact might have or might not have members still running, but
! 	 * we know it's valid and is newer than the cutoff point for multis.
! 	 * However, some member(s) of it may be below the cutoff for Xids, so we
! 	 * need to walk the whole members array to figure out what to do, if
! 	 * anything.
! 	 */
! 
! 	nmembers = GetMultiXactIdMembers(multi, &members, false);
! 	if (nmembers <= 0)
! 	{
! 		/* Nothing worth keeping */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		return InvalidTransactionId;
! 	}
! 
! 	/* is there anything older than the cutoff? */
! 	need_replace = false;
! 	for (i = 0; i < nmembers; i++)
! 	{
! 		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
! 		{
! 			need_replace = true;
! 			break;
! 		}
! 	}
! 
! 	/*
! 	 * In the simplest case, there is no member older than the cutoff; we can
! 	 * keep the existing MultiXactId as is.
! 	 */
! 	if (!need_replace)
! 	{
! 		*flags |= FRM_NOOP;
! 		pfree(members);
! 		return InvalidTransactionId;
! 	}
! 
! 	/*
! 	 * If the multi needs to be updated, figure out which members do we need
! 	 * to keep.
! 	 */
! 	nnewmembers = 0;
! 	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
! 	has_lockers = false;
! 	update_xid = InvalidTransactionId;
! 	update_committed = false;
! 
! 	for (i = 0; i < nmembers; i++)
! 	{
! 		if (ISUPDATE_from_mxstatus(members[i].status))
! 		{
! 			/*
! 			 * It's an update; should we keep it?  If the transaction is known
! 			 * aborted then it's okay to ignore it, otherwise not.  (Note this
! 			 * is just an optimization and not needed for correctness, so it's
! 			 * okay to get this test wrong; for example, in case an updater is
! 			 * crashed, or a running transaction in the process of aborting.)
! 			 */
! 			if (!TransactionIdDidAbort(members[i].xid))
! 			{
! 				newmembers[nnewmembers++] = members[i];
! 				Assert(!TransactionIdIsValid(update_xid));
! 
! 				/*
! 				 * Tell caller to set HEAP_XMAX_COMMITTED hint while we have
! 				 * the Xid in cache.  Again, this is just an optimization, so
! 				 * it's not a problem if the transaction is still running and
! 				 * in the process of committing.
! 				 */
! 				if (TransactionIdDidCommit(update_xid))
! 					update_committed = true;
! 
! 				update_xid = newmembers[i].xid;
! 			}
! 
! 			/*
! 			 * Checking for very old update Xids is critical: if the update
! 			 * member of the multi is older than cutoff_xid, we must remove
! 			 * it, because otherwise a later liveliness check could attempt
! 			 * pg_clog access for a page that was truncated away by the
! 			 * current vacuum.	Note that if the update had committed, we
! 			 * wouldn't be freezing this tuple because it would have gotten
! 			 * removed (HEAPTUPLE_DEAD) by HeapTupleSatisfiesVacuum; so it
! 			 * either aborted or crashed.  Therefore, ignore this update_xid.
! 			 */
! 			if (TransactionIdPrecedes(update_xid, cutoff_xid))
! 			{
! 				update_xid = InvalidTransactionId;
! 				update_committed = false;
! 
! 			}
! 		}
! 		else
! 		{
! 			/* We only keep lockers if they are still running */
! 			if (TransactionIdIsCurrentTransactionId(members[i].xid) ||
! 				TransactionIdIsInProgress(members[i].xid))
! 			{
! 				/* running locker cannot possibly be older than the cutoff */
! 				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
! 				newmembers[nnewmembers++] = members[i];
! 				has_lockers = true;
! 			}
! 		}
! 	}
! 
! 	pfree(members);
! 
! 	if (nnewmembers == 0)
! 	{
! 		/* nothing worth keeping!? Tell caller to remove the whole thing */
! 		*flags |= FRM_INVALIDATE_XMAX;
! 		xid = InvalidTransactionId;
! 	}
! 	else if (TransactionIdIsValid(update_xid) && !has_lockers)
! 	{
! 		/*
! 		 * If there's a single member and it's an update, pass it back alone
! 		 * without creating a new Multi.  (XXX we could do this when there's a
! 		 * single remaining locker, too, but that would complicate the API too
! 		 * much; moreover, the case with the single updater is more
! 		 * interesting, because those are longer-lived.)
! 		 */
! 		Assert(nnewmembers == 1);
! 		*flags |= FRM_RETURN_IS_XID;
! 		if (update_committed)
! 			*flags |= FRM_MARK_COMMITTED;
! 		xid = update_xid;
! 	}
! 	else
! 	{
! 		/*
! 		 * Create a new multixact with the surviving members of the previous
! 		 * one, to set as new Xmax in the tuple.
! 		 *
! 		 * If this is the first possibly-multixact-able operation in the
! 		 * current transaction, set my per-backend OldestMemberMXactId
! 		 * setting. We can be certain that the transaction will never become a
! 		 * member of any older MultiXactIds than that.
! 		 */
! 		MultiXactIdSetOldestMember();
! 		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
! 		*flags |= FRM_RETURN_IS_MULTI;
! 	}
! 
! 	pfree(newmembers);
! 
! 	return xid;
! }
! 
! /*
!  * heap_prepare_freeze_tuple
   *
   * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
!  * are older than the specified cutoff XID and cutoff MultiXactId.	If so,
!  * setup enough state (in the *frz output argument) to later execute and
!  * WAL-log what we would need to do, and return TRUE.  Return FALSE if nothing
!  * is to be changed.
!  *
!  * Caller is responsible for setting the offset field, if appropriate.	This
!  * is only necessary if the freeze is to be WAL-logged.
   *
   * It is assumed that the caller has checked the tuple with
   * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
***************
*** 5425,5478 **** heap_inplace_update(Relation relation, HeapTuple tuple)
   * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
   * XID older than it could neither be running nor seen as running by any
   * open transaction.  This ensures that the replacement will not change
!  * anyone's idea of the tuple state.  Also, since we assume the tuple is
!  * not HEAPTUPLE_DEAD, the fact that an XID is not still running allows us
!  * to assume that it is either committed good or aborted, as appropriate;
!  * so we need no external state checks to decide what to do.  (This is good
!  * because this function is applied during WAL recovery, when we don't have
!  * access to any such state, and can't depend on the hint bits to be set.)
!  * There is an exception we make which is to assume GetMultiXactIdMembers can
!  * be called during recovery.
!  *
   * Similarly, cutoff_multi must be less than or equal to the smallest
   * MultiXactId used by any transaction currently open.
   *
   * If the tuple is in a shared buffer, caller must hold an exclusive lock on
   * that buffer.
   *
!  * Note: it might seem we could make the changes without exclusive lock, since
!  * TransactionId read/write is assumed atomic anyway.  However there is a race
!  * condition: someone who just fetched an old XID that we overwrite here could
!  * conceivably not finish checking the XID against pg_clog before we finish
!  * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
!  * exclusive lock ensures no other backend is in process of checking the
!  * tuple status.  Also, getting exclusive lock makes it safe to adjust the
!  * infomask bits.
!  *
!  * NB: Cannot rely on hint bits here, they might not be set after a crash or
!  * on a standby.
   */
  bool
! heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
! 				  MultiXactId cutoff_multi)
  {
  	bool		changed = false;
  	bool		freeze_xmax = false;
  	TransactionId xid;
  
  	/* Process xmin */
  	xid = HeapTupleHeaderGetXmin(tuple);
  	if (TransactionIdIsNormal(xid) &&
  		TransactionIdPrecedes(xid, cutoff_xid))
  	{
! 		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
  
  		/*
  		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
  		 * already be set here, but there's a small chance not.
  		 */
! 		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
! 		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  		changed = true;
  	}
  
--- 5672,5715 ----
   * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
   * XID older than it could neither be running nor seen as running by any
   * open transaction.  This ensures that the replacement will not change
!  * anyone's idea of the tuple state.
   * Similarly, cutoff_multi must be less than or equal to the smallest
   * MultiXactId used by any transaction currently open.
   *
   * If the tuple is in a shared buffer, caller must hold an exclusive lock on
   * that buffer.
   *
!  * NB: It is not enough to set hint bits to indicate something is
!  * committed/invalid -- they might not be set on a standby, or after crash
!  * recovery.  We really need to remove old xids.
   */
  bool
! heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
! 						  TransactionId cutoff_multi,
! 						  xl_heap_freeze_tuple *frz)
! 
  {
  	bool		changed = false;
  	bool		freeze_xmax = false;
  	TransactionId xid;
  
+ 	frz->frzflags = 0;
+ 	frz->t_infomask2 = tuple->t_infomask2;
+ 	frz->t_infomask = tuple->t_infomask;
+ 	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
+ 
  	/* Process xmin */
  	xid = HeapTupleHeaderGetXmin(tuple);
  	if (TransactionIdIsNormal(xid) &&
  		TransactionIdPrecedes(xid, cutoff_xid))
  	{
! 		frz->frzflags |= XLH_FREEZE_XMIN;
  
  		/*
  		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
  		 * already be set here, but there's a small chance not.
  		 */
! 		frz->t_infomask |= HEAP_XMIN_COMMITTED;
  		changed = true;
  	}
  
***************
*** 5489,5579 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  
  	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
  	{
! 		if (!MultiXactIdIsValid(xid))
! 		{
! 			/* no xmax set, ignore */
! 			;
! 		}
! 		else if (MultiXactIdPrecedes(xid, cutoff_multi))
! 		{
! 			/*
! 			 * This old multi cannot possibly be running.  If it was a locker
! 			 * only, it can be removed without much further thought; but if it
! 			 * contained an update, we need to preserve it.
! 			 */
! 			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
! 				freeze_xmax = true;
! 			else
! 			{
! 				TransactionId update_xid;
! 
! 				update_xid = HeapTupleGetUpdateXid(tuple);
  
! 				/*
! 				 * The multixact has an update hidden within.  Get rid of it.
! 				 *
! 				 * If the update_xid is below the cutoff_xid, it necessarily
! 				 * must be an aborted transaction.  In a primary server, such
! 				 * an Xmax would have gotten marked invalid by
! 				 * HeapTupleSatisfiesVacuum, but in a replica that is not
! 				 * called before we are, so deal with it in the same way.
! 				 *
! 				 * If not below the cutoff_xid, then the tuple would have been
! 				 * pruned by vacuum, if the update committed long enough ago,
! 				 * and we wouldn't be freezing it; so it's either recently
! 				 * committed, or in-progress.  Deal with this by setting the
! 				 * Xmax to the update Xid directly and remove the IS_MULTI
! 				 * bit.  (We know there cannot be running lockers in this
! 				 * multi, because it's below the cutoff_multi value.)
! 				 */
  
! 				if (TransactionIdPrecedes(update_xid, cutoff_xid))
! 				{
! 					Assert(InRecovery || TransactionIdDidAbort(update_xid));
! 					freeze_xmax = true;
! 				}
! 				else
! 				{
! 					Assert(InRecovery || !TransactionIdIsInProgress(update_xid));
! 					tuple->t_infomask &= ~HEAP_XMAX_BITS;
! 					HeapTupleHeaderSetXmax(tuple, update_xid);
! 					changed = true;
! 				}
! 			}
  		}
! 		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
  		{
! 			/* newer than the cutoff, so don't touch it */
! 			;
  		}
  		else
  		{
! 			TransactionId	update_xid;
! 
! 			/*
! 			 * This is a multixact which is not marked LOCK_ONLY, but which
! 			 * is newer than the cutoff_multi.  If the update_xid is below the
! 			 * cutoff_xid point, then we can just freeze the Xmax in the
! 			 * tuple, removing it altogether.  This seems simple, but there
! 			 * are several underlying assumptions:
! 			 *
! 			 * 1. A tuple marked by an multixact containing a very old
! 			 * committed update Xid would have been pruned away by vacuum; we
! 			 * wouldn't be freezing this tuple at all.
! 			 *
! 			 * 2. There cannot possibly be any live locking members remaining
! 			 * in the multixact.  This is because if they were alive, the
! 			 * update's Xid would had been considered, via the lockers'
! 			 * snapshot's Xmin, as part the cutoff_xid.
! 			 *
! 			 * 3. We don't create new MultiXacts via MultiXactIdExpand() that
! 			 * include a very old aborted update Xid: in that function we only
! 			 * include update Xids corresponding to transactions that are
! 			 * committed or in-progress.
! 			 */
! 			update_xid = HeapTupleGetUpdateXid(tuple);
! 			if (TransactionIdPrecedes(update_xid, cutoff_xid))
! 				freeze_xmax = true;
  		}
  	}
  	else if (TransactionIdIsNormal(xid) &&
--- 5726,5760 ----
  
  	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
  	{
! 		TransactionId newxmax;
! 		uint16		flags;
  
! 		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
! 									cutoff_xid, cutoff_multi, &flags);
  
! 		if (flags & FRM_INVALIDATE_XMAX)
! 			freeze_xmax = true;
! 		else if (flags & FRM_RETURN_IS_XID)
! 		{
! 			frz->t_infomask &= ~HEAP_XMAX_BITS;
! 			frz->xmax = newxmax;
! 			if (flags & FRM_MARK_COMMITTED)
! 				frz->t_infomask &= HEAP_XMAX_COMMITTED;
! 			changed = true;
  		}
! 		else if (flags & FRM_RETURN_IS_MULTI)
  		{
! 			frz->t_infomask &= ~HEAP_XMAX_BITS;
! 			frz->xmax = newxmax;
! 
! 			GetMultiXactIdHintBits(newxmax,
! 								   &frz->t_infomask,
! 								   &frz->t_infomask2);
! 			changed = true;
  		}
  		else
  		{
! 			Assert(flags & FRM_NOOP);
  		}
  	}
  	else if (TransactionIdIsNormal(xid) &&
***************
*** 5584,5600 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  
  	if (freeze_xmax)
  	{
! 		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
  
  		/*
  		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
  		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
  		 * Also get rid of the HEAP_KEYS_UPDATED bit.
  		 */
! 		tuple->t_infomask &= ~HEAP_XMAX_BITS;
! 		tuple->t_infomask |= HEAP_XMAX_INVALID;
! 		HeapTupleHeaderClearHotUpdated(tuple);
! 		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
  		changed = true;
  	}
  
--- 5765,5781 ----
  
  	if (freeze_xmax)
  	{
! 		frz->xmax = InvalidTransactionId;
  
  		/*
  		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
  		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
  		 * Also get rid of the HEAP_KEYS_UPDATED bit.
  		 */
! 		frz->t_infomask &= ~HEAP_XMAX_BITS;
! 		frz->t_infomask |= HEAP_XMAX_INVALID;
! 		frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
! 		frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
  		changed = true;
  	}
  
***************
*** 5614,5629 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
  			 * xvac transaction succeeded.
  			 */
  			if (tuple->t_infomask & HEAP_MOVED_OFF)
! 				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
  			else
! 				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
  
  			/*
  			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
  			 * will already be set here, but there's a small chance not.
  			 */
  			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
! 			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
  			changed = true;
  		}
  	}
--- 5795,5810 ----
  			 * xvac transaction succeeded.
  			 */
  			if (tuple->t_infomask & HEAP_MOVED_OFF)
! 				frz->frzflags |= XLH_INVALID_XVAC;
  			else
! 				frz->frzflags |= XLH_FREEZE_XVAC;
  
  			/*
  			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
  			 * will already be set here, but there's a small chance not.
  			 */
  			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
! 			frz->t_infomask |= HEAP_XMIN_COMMITTED;
  			changed = true;
  		}
  	}
***************
*** 5632,5637 **** heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
--- 5813,5880 ----
  }
  
  /*
+  * heap_execute_freeze_tuple
+  *		Execute the prepared freezing of a tuple.
+  *
+  * Caller is responsible for ensuring that no other backend can access the
+  * storage underlying this tuple, either by holding an exclusive lock on the
+  * buffer containing it (which is what lazy VACUUM does), or by having it by
+  * in private storage (which is what CLUSTER and friends do).
+  *
+  * Note: it might seem we could make the changes without exclusive lock, since
+  * TransactionId read/write is assumed atomic anyway.  However there is a race
+  * condition: someone who just fetched an old XID that we overwrite here could
+  * conceivably not finish checking the XID against pg_clog before we finish
+  * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
+  * exclusive lock ensures no other backend is in process of checking the
+  * tuple status.  Also, getting exclusive lock makes it safe to adjust the
+  * infomask bits.
+  *
+  * NB: All code in here must be safe to execute during crash recovery!
+  */
+ void
+ heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
+ {
+ 	if (frz->frzflags & XLH_FREEZE_XMIN)
+ 		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+ 
+ 	HeapTupleHeaderSetXmax(tuple, frz->xmax);
+ 
+ 	if (frz->frzflags & XLH_FREEZE_XVAC)
+ 		HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+ 
+ 	if (frz->frzflags & XLH_INVALID_XVAC)
+ 		HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+ 
+ 	tuple->t_infomask = frz->t_infomask;
+ 	tuple->t_infomask2 = frz->t_infomask2;
+ }
+ 
+ /*
+  * heap_freeze_tuple - freeze tuple inplace without WAL logging.
+  *
+  * Useful for callers like CLUSTER that perform their own WAL logging.
+  */
+ bool
+ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+ 				  TransactionId cutoff_multi)
+ {
+ 	xl_heap_freeze_tuple frz;
+ 	bool		do_freeze;
+ 
+ 	do_freeze = heap_prepare_freeze_tuple(tuple, cutoff_xid, cutoff_multi, &frz);
+ 
+ 	/*
+ 	 * Note that because this is not a WAL-logged operation, we don't need to
+ 	 * fill in the offset in the freeze record.
+ 	 */
+ 
+ 	if (do_freeze)
+ 		heap_execute_freeze_tuple(tuple, &frz);
+ 	return do_freeze;
+ }
+ 
+ /*
   * For a given MultiXactId, return the hint bits that should be set in the
   * tuple's infomask.
   *
***************
*** 5934,5949 **** heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
  		}
  		else if (MultiXactIdPrecedes(multi, cutoff_multi))
  			return true;
- 		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
- 		{
- 			/* only-locker multis don't need internal examination */
- 			;
- 		}
  		else
  		{
! 			if (TransactionIdPrecedes(HeapTupleGetUpdateXid(tuple),
! 									  cutoff_xid))
! 				return true;
  		}
  	}
  	else
--- 6177,6202 ----
  		}
  		else if (MultiXactIdPrecedes(multi, cutoff_multi))
  			return true;
  		else
  		{
! 			MultiXactMember *members;
! 			int			nmembers;
! 			int			i;
! 
! 			/* need to check whether any member of the mxact is too old */
! 
! 			nmembers = GetMultiXactIdMembers(multi, &members, false);
! 
! 			for (i = 0; i < nmembers; i++)
! 			{
! 				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
! 				{
! 					pfree(members);
! 					return true;
! 				}
! 			}
! 			if (nmembers > 0)
! 				pfree(members);
  		}
  	}
  	else
***************
*** 6193,6237 **** log_heap_clean(Relation reln, Buffer buffer,
  }
  
  /*
!  * Perform XLogInsert for a heap-freeze operation.	Caller must already
!  * have modified the buffer and marked it dirty.
   */
  XLogRecPtr
! log_heap_freeze(Relation reln, Buffer buffer,
! 				TransactionId cutoff_xid, MultiXactId cutoff_multi,
! 				OffsetNumber *offsets, int offcnt)
  {
! 	xl_heap_freeze xlrec;
  	XLogRecPtr	recptr;
  	XLogRecData rdata[2];
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
  	/* nor when there are no tuples to freeze */
! 	Assert(offcnt > 0);
  
  	xlrec.node = reln->rd_node;
  	xlrec.block = BufferGetBlockNumber(buffer);
  	xlrec.cutoff_xid = cutoff_xid;
! 	xlrec.cutoff_multi = cutoff_multi;
  
  	rdata[0].data = (char *) &xlrec;
! 	rdata[0].len = SizeOfHeapFreeze;
  	rdata[0].buffer = InvalidBuffer;
  	rdata[0].next = &(rdata[1]);
  
  	/*
! 	 * The tuple-offsets array is not actually in the buffer, but pretend that
! 	 * it is.  When XLogInsert stores the whole buffer, the offsets array need
  	 * not be stored too.
  	 */
! 	rdata[1].data = (char *) offsets;
! 	rdata[1].len = offcnt * sizeof(OffsetNumber);
  	rdata[1].buffer = buffer;
  	rdata[1].buffer_std = true;
  	rdata[1].next = NULL;
  
! 	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE, rdata);
  
  	return recptr;
  }
--- 6446,6489 ----
  }
  
  /*
!  * Perform XLogInsert for a heap-freeze operation.	Caller must have already
!  * modified the buffer and marked it dirty.
   */
  XLogRecPtr
! log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
! 				xl_heap_freeze_tuple *tuples, int ntuples)
  {
! 	xl_heap_freeze_page xlrec;
  	XLogRecPtr	recptr;
  	XLogRecData rdata[2];
  
  	/* Caller should not call me on a non-WAL-logged relation */
  	Assert(RelationNeedsWAL(reln));
  	/* nor when there are no tuples to freeze */
! 	Assert(ntuples > 0);
  
  	xlrec.node = reln->rd_node;
  	xlrec.block = BufferGetBlockNumber(buffer);
  	xlrec.cutoff_xid = cutoff_xid;
! 	xlrec.ntuples = ntuples;
  
  	rdata[0].data = (char *) &xlrec;
! 	rdata[0].len = SizeOfHeapFreezePage;
  	rdata[0].buffer = InvalidBuffer;
  	rdata[0].next = &(rdata[1]);
  
  	/*
! 	 * The freeze plan array is not actually in the buffer, but pretend that
! 	 * it is.  When XLogInsert stores the whole buffer, the freeze plan need
  	 * not be stored too.
  	 */
! 	rdata[1].data = (char *) tuples;
! 	rdata[1].len = ntuples * sizeof(xl_heap_freeze_tuple);
  	rdata[1].buffer = buffer;
  	rdata[1].buffer_std = true;
  	rdata[1].next = NULL;
  
! 	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE, rdata);
  
  	return recptr;
  }
***************
*** 6839,6902 **** heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
  	XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
  }
  
- static void
- heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
- {
- 	xl_heap_freeze *xlrec = (xl_heap_freeze *) XLogRecGetData(record);
- 	TransactionId cutoff_xid = xlrec->cutoff_xid;
- 	MultiXactId cutoff_multi = xlrec->cutoff_multi;
- 	Buffer		buffer;
- 	Page		page;
- 
- 	/*
- 	 * In Hot Standby mode, ensure that there's no queries running which still
- 	 * consider the frozen xids as running.
- 	 */
- 	if (InHotStandby)
- 		ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node);
- 
- 	/* If we have a full-page image, restore it and we're done */
- 	if (record->xl_info & XLR_BKP_BLOCK(0))
- 	{
- 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
- 		return;
- 	}
- 
- 	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
- 	if (!BufferIsValid(buffer))
- 		return;
- 	page = (Page) BufferGetPage(buffer);
- 
- 	if (lsn <= PageGetLSN(page))
- 	{
- 		UnlockReleaseBuffer(buffer);
- 		return;
- 	}
- 
- 	if (record->xl_len > SizeOfHeapFreeze)
- 	{
- 		OffsetNumber *offsets;
- 		OffsetNumber *offsets_end;
- 
- 		offsets = (OffsetNumber *) ((char *) xlrec + SizeOfHeapFreeze);
- 		offsets_end = (OffsetNumber *) ((char *) xlrec + record->xl_len);
- 
- 		while (offsets < offsets_end)
- 		{
- 			/* offsets[] entries are one-based */
- 			ItemId		lp = PageGetItemId(page, *offsets);
- 			HeapTupleHeader tuple = (HeapTupleHeader) PageGetItem(page, lp);
- 
- 			(void) heap_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
- 			offsets++;
- 		}
- 	}
- 
- 	PageSetLSN(page, lsn);
- 	MarkBufferDirty(buffer);
- 	UnlockReleaseBuffer(buffer);
- }
- 
  /*
   * Replay XLOG_HEAP2_VISIBLE record.
   *
--- 7091,7096 ----
***************
*** 7011,7016 **** heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
--- 7205,7267 ----
  	}
  }
  
+ /*
+  * Replay XLOG_HEAP2_FREEZE_PAGE records
+  */
+ static void
+ heap_xlog_freeze_page(XLogRecPtr lsn, XLogRecord *record)
+ {
+ 	xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) XLogRecGetData(record);
+ 	TransactionId cutoff_xid = xlrec->cutoff_xid;
+ 	Buffer		buffer;
+ 	Page		page;
+ 	int			ntup;
+ 
+ 	/*
+ 	 * In Hot Standby mode, ensure that there's no queries running which still
+ 	 * consider the frozen xids as running.
+ 	 */
+ 	if (InHotStandby)
+ 		ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node);
+ 
+ 	/* If we have a full-page image, restore it and we're done */
+ 	if (record->xl_info & XLR_BKP_BLOCK(0))
+ 	{
+ 		(void) RestoreBackupBlock(lsn, record, 0, false, false);
+ 		return;
+ 	}
+ 
+ 	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ 	if (!BufferIsValid(buffer))
+ 		return;
+ 
+ 	page = (Page) BufferGetPage(buffer);
+ 
+ 	if (lsn <= PageGetLSN(page))
+ 	{
+ 		UnlockReleaseBuffer(buffer);
+ 		return;
+ 	}
+ 
+ 	/* now execute freeze plan for each frozen tuple */
+ 	for (ntup = 0; ntup < xlrec->ntuples; ntup++)
+ 	{
+ 		xl_heap_freeze_tuple *xlrec_tp;
+ 		ItemId		lp;
+ 		HeapTupleHeader tuple;
+ 
+ 		xlrec_tp = &xlrec->tuples[ntup];
+ 		lp = PageGetItemId(page, xlrec_tp->offset);		/* offsets are one-based */
+ 		tuple = (HeapTupleHeader) PageGetItem(page, lp);
+ 
+ 		heap_execute_freeze_tuple(tuple, xlrec_tp);
+ 	}
+ 
+ 	PageSetLSN(page, lsn);
+ 	MarkBufferDirty(buffer);
+ 	UnlockReleaseBuffer(buffer);
+ }
+ 
  static void
  heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
  {
***************
*** 7874,7885 **** heap2_redo(XLogRecPtr lsn, XLogRecord *record)
  
  	switch (info & XLOG_HEAP_OPMASK)
  	{
- 		case XLOG_HEAP2_FREEZE:
- 			heap_xlog_freeze(lsn, record);
- 			break;
  		case XLOG_HEAP2_CLEAN:
  			heap_xlog_clean(lsn, record);
  			break;
  		case XLOG_HEAP2_CLEANUP_INFO:
  			heap_xlog_cleanup_info(lsn, record);
  			break;
--- 8125,8136 ----
  
  	switch (info & XLOG_HEAP_OPMASK)
  	{
  		case XLOG_HEAP2_CLEAN:
  			heap_xlog_clean(lsn, record);
  			break;
+ 		case XLOG_HEAP2_FREEZE_PAGE:
+ 			heap_xlog_freeze_page(lsn, record);
+ 			break;
  		case XLOG_HEAP2_CLEANUP_INFO:
  			heap_xlog_cleanup_info(lsn, record);
  			break;
*** a/src/backend/access/rmgrdesc/heapdesc.c
--- b/src/backend/access/rmgrdesc/heapdesc.c
***************
*** 131,153 **** heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
  	uint8		info = xl_info & ~XLR_INFO_MASK;
  
  	info &= XLOG_HEAP_OPMASK;
! 	if (info == XLOG_HEAP2_FREEZE)
  	{
! 		xl_heap_freeze *xlrec = (xl_heap_freeze *) rec;
  
! 		appendStringInfo(buf, "freeze: rel %u/%u/%u; blk %u; cutoff xid %u multi %u",
  						 xlrec->node.spcNode, xlrec->node.dbNode,
  						 xlrec->node.relNode, xlrec->block,
! 						 xlrec->cutoff_xid, xlrec->cutoff_multi);
  	}
! 	else if (info == XLOG_HEAP2_CLEAN)
  	{
! 		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
  
! 		appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u remxid %u",
  						 xlrec->node.spcNode, xlrec->node.dbNode,
  						 xlrec->node.relNode, xlrec->block,
! 						 xlrec->latestRemovedXid);
  	}
  	else if (info == XLOG_HEAP2_CLEANUP_INFO)
  	{
--- 131,153 ----
  	uint8		info = xl_info & ~XLR_INFO_MASK;
  
  	info &= XLOG_HEAP_OPMASK;
! 	if (info == XLOG_HEAP2_CLEAN)
  	{
! 		xl_heap_clean *xlrec = (xl_heap_clean *) rec;
  
! 		appendStringInfo(buf, "clean: rel %u/%u/%u; blk %u remxid %u",
  						 xlrec->node.spcNode, xlrec->node.dbNode,
  						 xlrec->node.relNode, xlrec->block,
! 						 xlrec->latestRemovedXid);
  	}
! 	else if (info == XLOG_HEAP2_FREEZE_PAGE)
  	{
! 		xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
  
! 		appendStringInfo(buf, "freeze_page: rel %u/%u/%u; blk %u; cutoff xid %u ntuples %u",
  						 xlrec->node.spcNode, xlrec->node.dbNode,
  						 xlrec->node.relNode, xlrec->block,
! 						 xlrec->cutoff_xid, xlrec->ntuples);
  	}
  	else if (info == XLOG_HEAP2_CLEANUP_INFO)
  	{
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
***************
*** 286,292 **** static MemoryContext MXactContext = NULL;
  
  /* internal MultiXactId management */
  static void MultiXactIdSetOldestVisible(void);
- static MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members);
  static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
  				   int nmembers, MultiXactMember *members);
  static MultiXactId GetNewMultiXactId(int nmembers, MultiXactOffset *offset);
--- 286,291 ----
***************
*** 344,350 **** MultiXactIdCreate(TransactionId xid1, MultiXactStatus status1,
  	members[1].xid = xid2;
  	members[1].status = status2;
  
! 	newMulti = CreateMultiXactId(2, members);
  
  	debug_elog3(DEBUG2, "Create: %s",
  				mxid_to_string(newMulti, 2, members));
--- 343,349 ----
  	members[1].xid = xid2;
  	members[1].status = status2;
  
! 	newMulti = MultiXactIdCreateFromMembers(2, members);
  
  	debug_elog3(DEBUG2, "Create: %s",
  				mxid_to_string(newMulti, 2, members));
***************
*** 407,413 **** MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
  		 */
  		member.xid = xid;
  		member.status = status;
! 		newMulti = CreateMultiXactId(1, &member);
  
  		debug_elog4(DEBUG2, "Expand: %u has no members, create singleton %u",
  					multi, newMulti);
--- 406,412 ----
  		 */
  		member.xid = xid;
  		member.status = status;
! 		newMulti = MultiXactIdCreateFromMembers(1, &member);
  
  		debug_elog4(DEBUG2, "Expand: %u has no members, create singleton %u",
  					multi, newMulti);
***************
*** 459,465 **** MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
  
  	newMembers[j].xid = xid;
  	newMembers[j++].status = status;
! 	newMulti = CreateMultiXactId(j, newMembers);
  
  	pfree(members);
  	pfree(newMembers);
--- 458,464 ----
  
  	newMembers[j].xid = xid;
  	newMembers[j++].status = status;
! 	newMulti = MultiXactIdCreateFromMembers(j, newMembers);
  
  	pfree(members);
  	pfree(newMembers);
***************
*** 664,679 **** ReadNextMultiXactId(void)
  }
  
  /*
!  * CreateMultiXactId
!  *		Make a new MultiXactId
   *
   * Make XLOG, SLRU and cache entries for a new MultiXactId, recording the
   * given TransactionIds as members.  Returns the newly created MultiXactId.
   *
   * NB: the passed members[] array will be sorted in-place.
   */
! static MultiXactId
! CreateMultiXactId(int nmembers, MultiXactMember *members)
  {
  	MultiXactId multi;
  	MultiXactOffset offset;
--- 663,678 ----
  }
  
  /*
!  * MultiXactIdCreateFromMembers
!  *		Make a new MultiXactId from the specified set of members
   *
   * Make XLOG, SLRU and cache entries for a new MultiXactId, recording the
   * given TransactionIds as members.  Returns the newly created MultiXactId.
   *
   * NB: the passed members[] array will be sorted in-place.
   */
! MultiXactId
! MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
  {
  	MultiXactId multi;
  	MultiXactOffset offset;
***************
*** 760,766 **** CreateMultiXactId(int nmembers, MultiXactMember *members)
   * RecordNewMultiXact
   *		Write info about a new multixact into the offsets and members files
   *
!  * This is broken out of CreateMultiXactId so that xlog replay can use it.
   */
  static void
  RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
--- 759,766 ----
   * RecordNewMultiXact
   *		Write info about a new multixact into the offsets and members files
   *
!  * This is broken out of MultiXactIdCreateFromMembers so that xlog replay can
!  * use it.
   */
  static void
  RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 424,429 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 424,430 ----
  	Buffer		vmbuffer = InvalidBuffer;
  	BlockNumber next_not_all_visible_block;
  	bool		skipping_all_visible_blocks;
+ 	xl_heap_freeze_tuple *frozen;
  
  	pg_rusage_init(&ru0);
  
***************
*** 446,451 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 447,453 ----
  	vacrelstats->latestRemovedXid = InvalidTransactionId;
  
  	lazy_space_alloc(vacrelstats, nblocks);
+ 	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
  
  	/*
  	 * We want to skip pages that don't require vacuuming according to the
***************
*** 500,506 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  		bool		tupgone,
  					hastup;
  		int			prev_dead_count;
- 		OffsetNumber frozen[MaxOffsetNumber];
  		int			nfrozen;
  		Size		freespace;
  		bool		all_visible_according_to_vm;
--- 502,507 ----
***************
*** 890,898 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  				 * Each non-removable tuple must be checked to see if it needs
  				 * freezing.  Note we already have exclusive buffer lock.
  				 */
! 				if (heap_freeze_tuple(tuple.t_data, FreezeLimit,
! 									  MultiXactCutoff))
! 					frozen[nfrozen++] = offnum;
  			}
  		}						/* scan along page */
  
--- 891,899 ----
  				 * Each non-removable tuple must be checked to see if it needs
  				 * freezing.  Note we already have exclusive buffer lock.
  				 */
! 				if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
! 										  MultiXactCutoff, &frozen[nfrozen]))
! 					frozen[nfrozen++].offset = offnum;
  			}
  		}						/* scan along page */
  
***************
*** 903,917 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
  		 */
  		if (nfrozen > 0)
  		{
  			MarkBufferDirty(buf);
  			if (RelationNeedsWAL(onerel))
  			{
  				XLogRecPtr	recptr;
  
  				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
! 										 MultiXactCutoff, frozen, nfrozen);
  				PageSetLSN(page, recptr);
  			}
  		}
  
  		/*
--- 904,936 ----
  		 */
  		if (nfrozen > 0)
  		{
+ 			START_CRIT_SECTION();
+ 
  			MarkBufferDirty(buf);
+ 
+ 			/* execute collected freezes */
+ 			for (i = 0; i < nfrozen; i++)
+ 			{
+ 				ItemId		itemid;
+ 				HeapTupleHeader htup;
+ 
+ 				itemid = PageGetItemId(page, frozen[i].offset);
+ 				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+ 
+ 				heap_execute_freeze_tuple(htup, &frozen[i]);
+ 			}
+ 
+ 			/* Now WAL-log freezing if neccessary */
  			if (RelationNeedsWAL(onerel))
  			{
  				XLogRecPtr	recptr;
  
  				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
! 										 frozen, nfrozen);
  				PageSetLSN(page, recptr);
  			}
+ 
+ 			END_CRIT_SECTION();
  		}
  
  		/*
***************
*** 1012,1017 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
--- 1031,1038 ----
  			RecordPageWithFreeSpace(onerel, blkno, freespace);
  	}
  
+ 	pfree(frozen);
+ 
  	/* save stats for use later */
  	vacrelstats->scanned_tuples = num_tuples;
  	vacrelstats->tuples_deleted = tups_vacuumed;
*** a/src/include/access/heapam_xlog.h
--- b/src/include/access/heapam_xlog.h
***************
*** 48,56 ****
   * the ones above associated with RM_HEAP_ID.  XLOG_HEAP_OPMASK applies to
   * these, too.
   */
! #define XLOG_HEAP2_FREEZE		0x00
  #define XLOG_HEAP2_CLEAN		0x10
! /* 0x20 is free, was XLOG_HEAP2_CLEAN_MOVE */
  #define XLOG_HEAP2_CLEANUP_INFO 0x30
  #define XLOG_HEAP2_VISIBLE		0x40
  #define XLOG_HEAP2_MULTI_INSERT 0x50
--- 48,56 ----
   * the ones above associated with RM_HEAP_ID.  XLOG_HEAP_OPMASK applies to
   * these, too.
   */
! /* 0x00 is free, was XLOG_HEAP2_FREEZE */
  #define XLOG_HEAP2_CLEAN		0x10
! #define XLOG_HEAP2_FREEZE_PAGE	0x20
  #define XLOG_HEAP2_CLEANUP_INFO 0x30
  #define XLOG_HEAP2_VISIBLE		0x40
  #define XLOG_HEAP2_MULTI_INSERT 0x50
***************
*** 270,286 **** typedef struct xl_heap_inplace
  
  #define SizeOfHeapInplace	(offsetof(xl_heap_inplace, target) + SizeOfHeapTid)
  
! /* This is what we need to know about tuple freezing during vacuum */
! typedef struct xl_heap_freeze
  {
  	RelFileNode node;
  	BlockNumber block;
  	TransactionId cutoff_xid;
! 	MultiXactId cutoff_multi;
! 	/* TUPLE OFFSET NUMBERS FOLLOW AT THE END */
! } xl_heap_freeze;
  
! #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_multi) + sizeof(MultiXactId))
  
  /* This is what we need to know about setting a visibility map bit */
  typedef struct xl_heap_visible
--- 270,305 ----
  
  #define SizeOfHeapInplace	(offsetof(xl_heap_inplace, target) + SizeOfHeapTid)
  
! /*
!  * a 'freeze plan' struct that represents what we need to know about a single
!  * tuple being frozen during vacuum
!  */
! #define		XLH_FREEZE_XMIN		0x01
! #define		XLH_FREEZE_XVAC		0x02
! #define		XLH_INVALID_XVAC	0x04
! 
! typedef struct xl_heap_freeze_tuple
! {
! 	TransactionId xmax;
! 	OffsetNumber offset;
! 	uint16		t_infomask2;
! 	uint16		t_infomask;
! 	uint8		frzflags;
! } xl_heap_freeze_tuple;
! 
! /*
!  * This is what we need to know about a block being frozen during vacuum
!  */
! typedef struct xl_heap_freeze_page
  {
  	RelFileNode node;
  	BlockNumber block;
  	TransactionId cutoff_xid;
! 	uint16		ntuples;
! 	xl_heap_freeze_tuple tuples[FLEXIBLE_ARRAY_MEMBER];
! } xl_heap_freeze_page;
  
! #define SizeOfHeapFreezePage offsetof(xl_heap_freeze_page, tuples)
  
  /* This is what we need to know about setting a visibility map bit */
  typedef struct xl_heap_visible
***************
*** 331,338 **** extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
  			   OffsetNumber *nowunused, int nunused,
  			   TransactionId latestRemovedXid);
  extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
! 				TransactionId cutoff_xid, MultiXactId cutoff_multi,
! 				OffsetNumber *offsets, int offcnt);
  extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
  				 Buffer vm_buffer, TransactionId cutoff_xid);
  extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
--- 350,363 ----
  			   OffsetNumber *nowunused, int nunused,
  			   TransactionId latestRemovedXid);
  extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
! 				TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
! 				int ntuples);
! extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
! 						  TransactionId cutoff_xid,
! 						  TransactionId cutoff_multi,
! 						  xl_heap_freeze_tuple *frz);
! extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
! 						  xl_heap_freeze_tuple *xlrec_tp);
  extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
  				 Buffer vm_buffer, TransactionId cutoff_xid);
  extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
*** a/src/include/access/multixact.h
--- b/src/include/access/multixact.h
***************
*** 81,86 **** extern MultiXactId MultiXactIdCreate(TransactionId xid1,
--- 81,89 ----
  				  MultiXactStatus status2);
  extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid,
  				  MultiXactStatus status);
+ extern MultiXactId MultiXactIdCreateFromMembers(int nmembers,
+ 							 MultiXactMember *members);
+ 
  extern MultiXactId ReadNextMultiXactId(void);
  extern bool MultiXactIdIsRunning(MultiXactId multi);
  extern void MultiXactIdSetOldestMember(void);
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 55,61 **** typedef struct BkpBlock
  /*
   * Each page of XLOG file has a header like this:
   */
! #define XLOG_PAGE_MAGIC 0xD079	/* can be used as WAL version indicator */
  
  typedef struct XLogPageHeaderData
  {
--- 55,61 ----
  /*
   * Each page of XLOG file has a header like this:
   */
! #define XLOG_PAGE_MAGIC 0xD07A	/* can be used as WAL version indicator */
  
  typedef struct XLogPageHeaderData
  {

#44

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#43)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-11 14:00:05 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote:
I don't so much have a problem with exporting CreateMultiXactId(), just
with exporting it under its current name. It's already quirky to have
both MultiXactIdCreate and CreateMultiXactId() in multixact.c but
exporting it imo goes to far.

MultiXactidCreateFromMembers(int, MultiXactMembers *) ?

Works for me.

Andres mentioned the idea of sharing some code between
heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't
explored that.

My idea would just be to have heap_tuple_needs_freeze() call
heap_prepare_freeze_tuple() and check whether it returns true. Yes,
that's slightly more expensive than the current
heap_tuple_needs_freeze(), but it's only called when we couldn't get a
cleanup lock on a page, so that seems ok.

Doesn't seem a completely bad idea, but let's leave it for a separate
patch. This should be changed in master only IMV anyway, while the rest
of this patch is to be backpatched to 9.3.

I am not so sure it shouldn't be backpatched together with this. We now
have similar complex logic in both functions.

! if (ISUPDATE_from_mxstatus(members[i].status) &&
! !TransactionIdDidAbort(members[i].xid))#

It makes me wary to see a DidAbort() without a previous InProgress()
call. Also, after we crashed, doesn't DidAbort() possibly return
false for transactions that were in progress before we crashed? At
least that's how I always understood it, and how tqual.c is written.

Yes, that's correct. But note that here we're not doing a tuple
liveliness test, which is what tqual.c is doing. What we do with this
info is to keep the Xid as part of the multi if it's still running or
committed. We also keep it if the xact crashed, which is fine because
the Xid will be removed by some later step. If we know for certain that
the update Xid is aborted, then we can ignore it, but this is just an
optimization and not needed for correctness.

But why deviate that way? It doesn't seem to save us much?

One interesting bit is that we might end up creating singleton
MultiXactIds when freezing, if there's no updater and there's a running
locker. We could avoid this (i.e. mark the tuple as locked by a single
Xid) but it would complicate FreezeMultiXactId's API and it's unlikely
to occur with any frequency anyway.

Yea, that seems completely fine.

I don't think there's a need to check for
TransactionIdIsCurrentTransactionId() - vacuum can explicitly *not* be
run inside a transaction.

Keep in mind that freezing can also happen for tuples handled during a
table-rewrite operation such as CLUSTER.

Good point.

if (tuple->t_infomask & HEAP_MOVED_OFF)
! frz->frzflags |= XLH_FREEZE_XVAC;
else
! frz->frzflags |= XLH_INVALID_XVAC;

Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and
XLH_INVALID_XVAC exactly the other way round? I really don't understand
the moved in/off, since the code has been gone longer than I've followed
the code...

Yep, fixed.

I wonder how many of the HEAP_MOVED_* cases around are actually
correct... What was the last version those were generated? 8.4?

(I was toying with the "desc"
code because it misbehaves when applied on records as they are created,
as opposed to being applied on records as they are replayed. I'm pretty
sure everyone already knows about this, and it's the reason why
everybody has skimped from examining arrays of things stored in followup
data records. I was naive enough to write code that tries to decode the
followup record that contains the members of the multiaxct we're
creating, which works fine during replay but gets them completely wrong
during regular operation. This is the third time I'm surprised by this
misbehavior; blame my bad memory for not remembering that it's not
supposed to work in the first place.)

I am not really sure what you are talking about. That you cannot
properly decode records before they have been processed by XLogInsert()?
If so, yes, that's pretty clear and I am pretty sure it will break in
lots of places if you try?

Right now there is one case in this code that returns
FRM_INVALIDATE_XMAX when it's not strictly necessary, i.e. it would also
work to keep the Multi as is and return FRM_NOOP instead; and it also
returns FRM_NOOP in one case when we could return FRM_INVALIDATE_XMAX
instead. Neither does any great damage, but there is a consideration
that future examiners of the tuple would have to resolve the MultiXact
by themselves (==> performance hit). On the other hand, returning
INVALIDATE causes the block to be dirtied, which is undesirable if not
already dirty.

Otherwise it will be marked dirty the next time reads the page, so I
don't think this is problematic.

! {
! if (ISUPDATE_from_mxstatus(members[i].status))
! {
! /*
! * It's an update; should we keep it? If the transaction is known
! * aborted then it's okay to ignore it, otherwise not. (Note this
! * is just an optimization and not needed for correctness, so it's
! * okay to get this test wrong; for example, in case an updater is
! * crashed, or a running transaction in the process of aborting.)
! */
! if (!TransactionIdDidAbort(members[i].xid))
! {
! newmembers[nnewmembers++] = members[i];
! Assert(!TransactionIdIsValid(update_xid));
!
! /*
! * Tell caller to set HEAP_XMAX_COMMITTED hint while we have
! * the Xid in cache. Again, this is just an optimization, so
! * it's not a problem if the transaction is still running and
! * in the process of committing.
! */
! if (TransactionIdDidCommit(update_xid))
! update_committed = true;
!
! update_xid = newmembers[i].xid;
! }

I don't think the conclusions here are correct - we might be setting
HEAP_XMAX_COMMITTED a smudge to early that way. If the updating
transaction is in progress, there's the situation that we have updated
the clog, but have not yet removed ourselves from the procarray. I.e. a
situation in which TransactionIdIsInProgress() and
TransactionIdDidCommit() both return true. Afaik it is only correct to
set HEAP_XMAX_COMMITTED once TransactionIdIsInProgress() returns false.

! /*
! * Checking for very old update Xids is critical: if the update
! * member of the multi is older than cutoff_xid, we must remove
! * it, because otherwise a later liveliness check could attempt
! * pg_clog access for a page that was truncated away by the
! * current vacuum. Note that if the update had committed, we
! * wouldn't be freezing this tuple because it would have gotten
! * removed (HEAPTUPLE_DEAD) by HeapTupleSatisfiesVacuum; so it
! * either aborted or crashed. Therefore, ignore this update_xid.
! */
! if (TransactionIdPrecedes(update_xid, cutoff_xid))
! {
! update_xid = InvalidTransactionId;
! update_committed = false;

I vote for an Assert(!TransactionIdDidCommit(update_xid)) here.

! else
! {
! /*
! * Create a new multixact with the surviving members of the previous
! * one, to set as new Xmax in the tuple.
! *
! * If this is the first possibly-multixact-able operation in the
! * current transaction, set my per-backend OldestMemberMXactId
! * setting. We can be certain that the transaction will never become a
! * member of any older MultiXactIds than that.
! */
! MultiXactIdSetOldestMember();
! xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
! *flags |= FRM_RETURN_IS_MULTI;
! }

I worry that this MultiXactIdSetOldestMember() will be problematic in
longrunning vacuums like a anti-wraparound vacuum of a huge
table. There's no real need to set MultiXactIdSetOldestMember() here,
since we will not become the member of a multi. So I think you should
either move the Assert() in MultiXactIdCreateFromMembers() to it's other
callers, or add a parameter to skip it.

! /*
! * heap_prepare_freeze_tuple
*
* Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
! * are older than the specified cutoff XID and cutoff MultiXactId. If so,
! * setup enough state (in the *frz output argument) to later execute and
! * WAL-log what we would need to do, and return TRUE. Return FALSE if nothing
! * is to be changed.
! *
! * Caller is responsible for setting the offset field, if appropriate. This
! * is only necessary if the freeze is to be WAL-logged.

I'd leave of that second sentence, if you want to freeze a whole page
but not WAL log it, you'd need to set offset as well...

if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
{
! else if (flags & FRM_RETURN_IS_MULTI)
{
! frz->t_infomask &= ~HEAP_XMAX_BITS;
! frz->xmax = newxmax;
!
! GetMultiXactIdHintBits(newxmax,
! &frz->t_infomask,
! &frz->t_infomask2);
! changed = true;
}

I worry that all these multixact accesses will create huge performance
problems due to the inefficiency of the multixactid cache. If you scan a
huge table there very well might be millions of different multis we
touch and afaics most of them will end up in the multixactid cache. That
can't end well.
I think we need to either regularly delete that cache when it goes past,
say, 100 entries, or just bypass it entirely.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#44)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund wrote:

On 2013-12-11 14:00:05 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote:

Andres mentioned the idea of sharing some code between
heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't
explored that.

My idea would just be to have heap_tuple_needs_freeze() call
heap_prepare_freeze_tuple() and check whether it returns true. Yes,
that's slightly more expensive than the current
heap_tuple_needs_freeze(), but it's only called when we couldn't get a
cleanup lock on a page, so that seems ok.

Doesn't seem a completely bad idea, but let's leave it for a separate
patch. This should be changed in master only IMV anyway, while the rest
of this patch is to be backpatched to 9.3.

I am not so sure it shouldn't be backpatched together with this. We now
have similar complex logic in both functions.

Any other opinions on this?

! if (ISUPDATE_from_mxstatus(members[i].status) &&
! !TransactionIdDidAbort(members[i].xid))#

It makes me wary to see a DidAbort() without a previous InProgress()
call. Also, after we crashed, doesn't DidAbort() possibly return
false for transactions that were in progress before we crashed? At
least that's how I always understood it, and how tqual.c is written.

Yes, that's correct. But note that here we're not doing a tuple
liveliness test, which is what tqual.c is doing. What we do with this
info is to keep the Xid as part of the multi if it's still running or
committed. We also keep it if the xact crashed, which is fine because
the Xid will be removed by some later step. If we know for certain that
the update Xid is aborted, then we can ignore it, but this is just an
optimization and not needed for correctness.

But why deviate that way? It doesn't seem to save us much?

Well, it does save something -- there not being a live update means we
are likely to be able to invalidate the Xmax completely if there are no
lockers; and even in the case where there are lockers, we will be able
to set LOCK_ONLY which means faster access in several places.

if (tuple->t_infomask & HEAP_MOVED_OFF)
! frz->frzflags |= XLH_FREEZE_XVAC;
else
! frz->frzflags |= XLH_INVALID_XVAC;

Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and
XLH_INVALID_XVAC exactly the other way round? I really don't understand
the moved in/off, since the code has been gone longer than I've followed
the code...

Yep, fixed.

I wonder how many of the HEAP_MOVED_* cases around are actually
correct... What was the last version those were generated? 8.4?

8.4, yeah, before VACUUM FULL got rewritten. I don't think anybody
tests these code paths, because it involves databases that were upgraded
straight from 8.4 and which in their 8.4 time saw VACUUM FULL executed.

I think we should be considering removing these things, or at least have
some mechanism to ensure they don't survive from pre-9.0 installs.

(I was toying with the "desc"
code because it misbehaves when applied on records as they are created,
as opposed to being applied on records as they are replayed. I'm pretty
sure everyone already knows about this, and it's the reason why
everybody has skimped from examining arrays of things stored in followup
data records. I was naive enough to write code that tries to decode the
followup record that contains the members of the multiaxct we're
creating, which works fine during replay but gets them completely wrong
during regular operation. This is the third time I'm surprised by this
misbehavior; blame my bad memory for not remembering that it's not
supposed to work in the first place.)

I am not really sure what you are talking about. That you cannot
properly decode records before they have been processed by XLogInsert()?
If so, yes, that's pretty clear and I am pretty sure it will break in
lots of places if you try?

Well, not sure about "lots of places". The only misbahavior I have seen
is in those desc routines. Of course, the redo routines might also
fail, but then there's no way for them to be running ...

Right now there is one case in this code that returns
FRM_INVALIDATE_XMAX when it's not strictly necessary, i.e. it would also
work to keep the Multi as is and return FRM_NOOP instead; and it also
returns FRM_NOOP in one case when we could return FRM_INVALIDATE_XMAX
instead. Neither does any great damage, but there is a consideration
that future examiners of the tuple would have to resolve the MultiXact
by themselves (==> performance hit). On the other hand, returning
INVALIDATE causes the block to be dirtied, which is undesirable if not
already dirty.

Otherwise it will be marked dirty the next time reads the page, so I
don't think this is problematic.

Not necessarily. I mean, if somebody sees a multi, they might just
resolve it to its members and otherwise leave the page alone. Or in
some cases not even resolve to members (if it's LOCK_ONLY and old enough
to be behind the oldest visible multi).

! {
! if (ISUPDATE_from_mxstatus(members[i].status))
! {
! /*
! * It's an update; should we keep it? If the transaction is known
! * aborted then it's okay to ignore it, otherwise not. (Note this
! * is just an optimization and not needed for correctness, so it's
! * okay to get this test wrong; for example, in case an updater is
! * crashed, or a running transaction in the process of aborting.)
! */
! if (!TransactionIdDidAbort(members[i].xid))
! {
! newmembers[nnewmembers++] = members[i];
! Assert(!TransactionIdIsValid(update_xid));
!
! /*
! * Tell caller to set HEAP_XMAX_COMMITTED hint while we have
! * the Xid in cache. Again, this is just an optimization, so
! * it's not a problem if the transaction is still running and
! * in the process of committing.
! */
! if (TransactionIdDidCommit(update_xid))
! update_committed = true;
!
! update_xid = newmembers[i].xid;
! }

I don't think the conclusions here are correct - we might be setting
HEAP_XMAX_COMMITTED a smudge to early that way. If the updating
transaction is in progress, there's the situation that we have updated
the clog, but have not yet removed ourselves from the procarray. I.e. a
situation in which TransactionIdIsInProgress() and
TransactionIdDidCommit() both return true. Afaik it is only correct to
set HEAP_XMAX_COMMITTED once TransactionIdIsInProgress() returns false.

Hmm ... Is there an actual difference? I mean, a transaction that
marked itself as committed in pg_clog cannot return to any other state,
regardless of what happens elsewhere.

! /*
! * Checking for very old update Xids is critical: if the update
! * member of the multi is older than cutoff_xid, we must remove
! * it, because otherwise a later liveliness check could attempt
! * pg_clog access for a page that was truncated away by the
! * current vacuum. Note that if the update had committed, we
! * wouldn't be freezing this tuple because it would have gotten
! * removed (HEAPTUPLE_DEAD) by HeapTupleSatisfiesVacuum; so it
! * either aborted or crashed. Therefore, ignore this update_xid.
! */
! if (TransactionIdPrecedes(update_xid, cutoff_xid))
! {
! update_xid = InvalidTransactionId;
! update_committed = false;

I vote for an Assert(!TransactionIdDidCommit(update_xid)) here.

Will add.

! else
! {
! /*
! * Create a new multixact with the surviving members of the previous
! * one, to set as new Xmax in the tuple.
! *
! * If this is the first possibly-multixact-able operation in the
! * current transaction, set my per-backend OldestMemberMXactId
! * setting. We can be certain that the transaction will never become a
! * member of any older MultiXactIds than that.
! */
! MultiXactIdSetOldestMember();
! xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
! *flags |= FRM_RETURN_IS_MULTI;
! }

I worry that this MultiXactIdSetOldestMember() will be problematic in
longrunning vacuums like a anti-wraparound vacuum of a huge
table. There's no real need to set MultiXactIdSetOldestMember() here,
since we will not become the member of a multi. So I think you should
either move the Assert() in MultiXactIdCreateFromMembers() to it's other
callers, or add a parameter to skip it.

I would like to have the Assert() work automatically, that is, check the
PROC_IN_VACUUM flag in MyProc->vacuumflags ... but this probably won't
work with CLUSTER. That said, I think we *should* call SetOldestMember
in CLUSTER. So maybe both things should be conditional on
PROC_IN_VACUUM.

(Either way it will be ugly.)

! /*
! * heap_prepare_freeze_tuple
*
* Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
! * are older than the specified cutoff XID and cutoff MultiXactId. If so,
! * setup enough state (in the *frz output argument) to later execute and
! * WAL-log what we would need to do, and return TRUE. Return FALSE if nothing
! * is to be changed.
! *
! * Caller is responsible for setting the offset field, if appropriate. This
! * is only necessary if the freeze is to be WAL-logged.

I'd leave of that second sentence, if you want to freeze a whole page
but not WAL log it, you'd need to set offset as well...

I can buy that.

I worry that all these multixact accesses will create huge performance
problems due to the inefficiency of the multixactid cache. If you scan a
huge table there very well might be millions of different multis we
touch and afaics most of them will end up in the multixactid cache. That
can't end well.
I think we need to either regularly delete that cache when it goes past,
say, 100 entries, or just bypass it entirely.

Delete the whole cache, or just prune it of the least recently used
entries? Maybe the cache should be a dlist instead of the open-coded
stuff that's there now; that would enable pruning of the oldest entries.
I think a blanket deletion might be a cure worse than the disease. I
see your point anyhow.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#45)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

On 2013-12-11 14:00:05 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote:

! if (ISUPDATE_from_mxstatus(members[i].status) &&
! !TransactionIdDidAbort(members[i].xid))#

It makes me wary to see a DidAbort() without a previous InProgress()
call. Also, after we crashed, doesn't DidAbort() possibly return
false for transactions that were in progress before we crashed? At
least that's how I always understood it, and how tqual.c is written.

Yes, that's correct. But note that here we're not doing a tuple
liveliness test, which is what tqual.c is doing. What we do with this
info is to keep the Xid as part of the multi if it's still running or
committed. We also keep it if the xact crashed, which is fine because
the Xid will be removed by some later step. If we know for certain that
the update Xid is aborted, then we can ignore it, but this is just an
optimization and not needed for correctness.

But why deviate that way? It doesn't seem to save us much?

Well, it does save something -- there not being a live update means we
are likely to be able to invalidate the Xmax completely if there are no
lockers; and even in the case where there are lockers, we will be able
to set LOCK_ONLY which means faster access in several places.

What I mean is that we could just query TransactionIdIsInProgress() like
usual. I most of the cases it will be very cheap because of the
RecentXmin() check at its beginning.

I am not really sure what you are talking about. That you cannot
properly decode records before they have been processed by XLogInsert()?
If so, yes, that's pretty clear and I am pretty sure it will break in
lots of places if you try?

Well, not sure about "lots of places". The only misbahavior I have seen
is in those desc routines. Of course, the redo routines might also
fail, but then there's no way for them to be running ...

Hm. I would guess that e.g. display xl_xact_commit fails majorly.

Right now there is one case in this code that returns
FRM_INVALIDATE_XMAX when it's not strictly necessary, i.e. it would also
work to keep the Multi as is and return FRM_NOOP instead; and it also
returns FRM_NOOP in one case when we could return FRM_INVALIDATE_XMAX
instead. Neither does any great damage, but there is a consideration
that future examiners of the tuple would have to resolve the MultiXact
by themselves (==> performance hit). On the other hand, returning
INVALIDATE causes the block to be dirtied, which is undesirable if not
already dirty.

Otherwise it will be marked dirty the next time reads the page, so I
don't think this is problematic.

Not necessarily. I mean, if somebody sees a multi, they might just
resolve it to its members and otherwise leave the page alone. Or in
some cases not even resolve to members (if it's LOCK_ONLY and old enough
to be behind the oldest visible multi).

But the work has to be done anyway, even if possibly slightly later?

! * Tell caller to set HEAP_XMAX_COMMITTED hint while we have
! * the Xid in cache. Again, this is just an optimization, so
! * it's not a problem if the transaction is still running and
! * in the process of committing.
! */
! if (TransactionIdDidCommit(update_xid))
! update_committed = true;
!
! update_xid = newmembers[i].xid;
! }

I don't think the conclusions here are correct - we might be setting
HEAP_XMAX_COMMITTED a smudge to early that way. If the updating
transaction is in progress, there's the situation that we have updated
the clog, but have not yet removed ourselves from the procarray. I.e. a
situation in which TransactionIdIsInProgress() and
TransactionIdDidCommit() both return true. Afaik it is only correct to
set HEAP_XMAX_COMMITTED once TransactionIdIsInProgress() returns false.

Hmm ... Is there an actual difference? I mean, a transaction that
marked itself as committed in pg_clog cannot return to any other state,
regardless of what happens elsewhere.

But it could lead to other transactions seing the row as committed, even
though it isn't really yet.
tqual.c sayeth:
* NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array)
* before TransactionIdDidCommit/TransactionIdDidAbort (which look in
* pg_clog). Otherwise we have a race condition: we might decide that a
* just-committed transaction crashed, because none of the tests succeed.
* xact.c is careful to record commit/abort in pg_clog before it unsets
* MyPgXact->xid in PGXACT array. That fixes that problem, but it also
* means there is a window where TransactionIdIsInProgress and
* TransactionIdDidCommit will both return true. If we check only
* TransactionIdDidCommit, we could consider a tuple committed when a
* later GetSnapshotData call will still think the originating transaction
* is in progress, which leads to application-level inconsistency. The
* upshot is that we gotta check TransactionIdIsInProgress first in all
* code paths, except for a few cases where we are looking at
* subtransactions of our own main transaction and so there can't be any
* race condition.

I don't think there's any reason to deviate from this pattern here. For
old xids TransactionIdIsInProgress() should be really cheap.

I worry that this MultiXactIdSetOldestMember() will be problematic in
longrunning vacuums like a anti-wraparound vacuum of a huge
table. There's no real need to set MultiXactIdSetOldestMember() here,
since we will not become the member of a multi. So I think you should
either move the Assert() in MultiXactIdCreateFromMembers() to it's other
callers, or add a parameter to skip it.

I would like to have the Assert() work automatically, that is, check the
PROC_IN_VACUUM flag in MyProc->vacuumflags ... but this probably won't
work with CLUSTER. That said, I think we *should* call SetOldestMember
in CLUSTER. So maybe both things should be conditional on
PROC_IN_VACUUM.

Why should it be dependent on cluster? SetOldestMember() defines the
oldest multi we can be a member of. Even in cluster, the freezing will
not make us a member of a multi. If the transaction does something else
requiring SetOldestMember(), that will do it?

I worry that all these multixact accesses will create huge performance
problems due to the inefficiency of the multixactid cache. If you scan a
huge table there very well might be millions of different multis we
touch and afaics most of them will end up in the multixactid cache. That
can't end well.
I think we need to either regularly delete that cache when it goes past,
say, 100 entries, or just bypass it entirely.

Delete the whole cache, or just prune it of the least recently used
entries? Maybe the cache should be a dlist instead of the open-coded
stuff that's there now; that would enable pruning of the oldest entries.
I think a blanket deletion might be a cure worse than the disease. I
see your point anyhow.

I was thinking of just deleting the whole thing. Revamping the cache
mechanism to be more efficient, is an important goal, but it imo
shouldn't be lumped together with this. Now you could argue that purging
the cache shouldn't be either - but 9.3.2+ the worst case essentially is
O(n^2) in the number of rows in a table. Don't think that can be
acceptable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#46)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund wrote:

On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

I worry that all these multixact accesses will create huge performance
problems due to the inefficiency of the multixactid cache. If you scan a
huge table there very well might be millions of different multis we
touch and afaics most of them will end up in the multixactid cache. That
can't end well.
I think we need to either regularly delete that cache when it goes past,
say, 100 entries, or just bypass it entirely.

Delete the whole cache, or just prune it of the least recently used
entries? Maybe the cache should be a dlist instead of the open-coded
stuff that's there now; that would enable pruning of the oldest entries.
I think a blanket deletion might be a cure worse than the disease. I
see your point anyhow.

I was thinking of just deleting the whole thing. Revamping the cache
mechanism to be more efficient, is an important goal, but it imo
shouldn't be lumped together with this. Now you could argue that purging
the cache shouldn't be either - but 9.3.2+ the worst case essentially is
O(n^2) in the number of rows in a table. Don't think that can be
acceptable.

So I think this is the only remaining issue to make this patch
committable (I will address the other points in Andres' email.) Since
there has been no other feedback on this thread, Andres and I discussed
the cache issue a bit over IM and we seem to agree that a patch to
revamp the cache should be a fairly localized change that could be
applied on both 9.3 and master, separately from this fix. Doing cache
deletion seems more invasive, and not provide better performance anyway.

Since having a potentially O(n^2) cache behavior but with working freeze
seems better than no O(n^2) but broken freeze, I'm going to apply this
patch shortly and then work on reworking the cache.

Are there other opinions?

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#46)

4 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund wrote:

On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

I worry that this MultiXactIdSetOldestMember() will be problematic in
longrunning vacuums like a anti-wraparound vacuum of a huge
table. There's no real need to set MultiXactIdSetOldestMember() here,
since we will not become the member of a multi. So I think you should
either move the Assert() in MultiXactIdCreateFromMembers() to it's other
callers, or add a parameter to skip it.

I would like to have the Assert() work automatically, that is, check the
PROC_IN_VACUUM flag in MyProc->vacuumflags ... but this probably won't
work with CLUSTER. That said, I think we *should* call SetOldestMember
in CLUSTER. So maybe both things should be conditional on
PROC_IN_VACUUM.

Why should it be dependent on cluster? SetOldestMember() defines the
oldest multi we can be a member of. Even in cluster, the freezing will
not make us a member of a multi. If the transaction does something else
requiring SetOldestMember(), that will do it?

One last thing (I hope). It's not real easy to disable this check,
because it actually lives in GetNewMultiXactId. It would uglify the API
a lot if we were to pass a flag down two layers of routines; and moving
it to higher-level routines doesn't seem all that appropriate either.
I'm thinking we can have a new flag in MyPgXact->vacuumFlags, so
heap_prepare_freeze_tuple does this:

PG_TRY();
{
/* set flag to let multixact.c know what we're doing */
MyPgXact->vacuumFlags |= PROC_FREEZING_MULTI;
newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
cutoff_xid, cutoff_multi, &flags);
}
PG_CATCH();
{
MyPgXact->vacuumFlags &= ~PROC_FREEZING_MULTI;
PG_RE_THROW();
}
PG_END_TRY();
MyPgXact->vacuumFlags &= ~PROC_FREEZING_MULTI;

and GetNewMultiXactId tests it to avoid the assert:

/*
* MultiXactIdSetOldestMember() must have been called already, but don't
* check while freezing MultiXactIds.
*/
Assert((MyPggXact->vacuumFlags & PROC_FREEZING_MULTI) ||
MultiXactIdIsValid(OldestMemberMXactId[MyBackendId]));

This avoids the API uglification issues, but introduces a setjmp call
for every tuple to be frozen. I don't think this is an excessive cost
to pay; after all, this is going to happen only for tuples for which
heap_tuple_needs_freeze already returned true; and for those we're
already going to do a lot of other work.

Attached is the whole series of patches for 9.3. (master is the same,
only with an additional patch that removes the legacy WAL record.)

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Fix-freezing-of-multixacts.patchtext/x-diff; charset=iso-8859-1Download

>From d266b3cef8598c3383d3ba17105d17fc1f384f7d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Tue, 10 Dec 2013 17:56:02 -0300
Subject: [PATCH 1/4] Fix freezing of multixacts
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Andres Freund and Ãlvaro
---
 src/backend/access/heap/heapam.c       |  679 +++++++++++++++++++++++++-------
 src/backend/access/rmgrdesc/heapdesc.c |    9 +
 src/backend/access/transam/multixact.c |   18 +-
 src/backend/commands/vacuumlazy.c      |   31 +-
 src/include/access/heapam_xlog.h       |   43 +-
 src/include/access/multixact.h         |    3 +
 6 files changed, 628 insertions(+), 155 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1a0dd21..24d843a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5238,14 +5238,261 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 		CacheInvalidateHeapTuple(relation, tuple, NULL);
 }
 
+#define		FRM_NOOP				0x0001
+#define		FRM_INVALIDATE_XMAX		0x0002
+#define		FRM_RETURN_IS_XID		0x0004
+#define		FRM_RETURN_IS_MULTI		0x0008
+#define		FRM_MARK_COMMITTED		0x0010
 
 /*
- * heap_freeze_tuple
+ * FreezeMultiXactId
+ *		Determine what to do during freezing when a tuple is marked by a
+ *		MultiXactId.
+ *
+ * NB -- this might have the side-effect of creating a new MultiXactId!
+ *
+ * "flags" is an output value; it's used to tell caller what to do on return.
+ * Possible flags are:
+ * FRM_NOOP
+ *		don't do anything -- keep existing Xmax
+ * FRM_INVALIDATE_XMAX
+ *		mark Xmax as InvalidTransactionId and set XMAX_INVALID flag.
+ * FRM_RETURN_IS_XID
+ *		The Xid return value is a single update Xid to set as xmax.
+ * FRM_MARK_COMMITTED
+ *		Xmax can be marked as HEAP_XMAX_COMMITTED
+ * FRM_RETURN_IS_MULTI
+ *		The return value is a new MultiXactId to set as new Xmax.
+ *		(caller must obtain proper infomask bits using GetMultiXactIdHintBits)
+ */
+static TransactionId
+FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
+				  TransactionId cutoff_xid, MultiXactId cutoff_multi,
+				  uint16 *flags)
+{
+	TransactionId xid = InvalidTransactionId;
+	int			i;
+	MultiXactMember *members;
+	int			nmembers;
+	bool		need_replace;
+	int			nnewmembers;
+	MultiXactMember *newmembers;
+	bool		has_lockers;
+	TransactionId update_xid;
+	bool		update_committed;
+
+	*flags = 0;
+
+	/* We should only be called in Multis */
+	Assert(t_infomask & HEAP_XMAX_IS_MULTI);
+
+	if (!MultiXactIdIsValid(multi))
+	{
+		/* Ensure infomask bits are appropriately set/reset */
+		*flags |= FRM_INVALIDATE_XMAX;
+		return InvalidTransactionId;
+	}
+	else if (MultiXactIdPrecedes(multi, cutoff_multi))
+	{
+		/*
+		 * This old multi cannot possibly have members still running.  If it
+		 * was a locker only, it can be removed without any further
+		 * consideration; but if it contained an update, we might need to
+		 * preserve it.
+		 */
+		Assert(!MultiXactIdIsRunning(multi));
+		if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask))
+		{
+			*flags |= FRM_INVALIDATE_XMAX;
+			xid = InvalidTransactionId; /* not strictly necessary */
+		}
+		else
+		{
+			/* replace multi by update xid */
+			xid = MultiXactIdGetUpdateXid(multi, t_infomask);
+
+			/* wasn't only a lock, xid needs to be valid */
+			Assert(TransactionIdIsValid(xid));
+
+			/*
+			 * If the xid is older than the cutoff, it has to have aborted,
+			 * otherwise the tuple would have gotten pruned away.
+			 */
+			if (TransactionIdPrecedes(xid, cutoff_xid))
+			{
+				Assert(!TransactionIdDidCommit(xid));
+				*flags |= FRM_INVALIDATE_XMAX;
+				xid = InvalidTransactionId;		/* not strictly necessary */
+			}
+			else
+				*flags |= FRM_RETURN_IS_XID;
+		}
+
+		return xid;
+	}
+
+	/*
+	 * This multixact might have or might not have members still running, but
+	 * we know it's valid and is newer than the cutoff point for multis.
+	 * However, some member(s) of it may be below the cutoff for Xids, so we
+	 * need to walk the whole members array to figure out what to do, if
+	 * anything.
+	 */
+
+	nmembers = GetMultiXactIdMembers(multi, &members, false);
+	if (nmembers <= 0)
+	{
+		/* Nothing worth keeping */
+		*flags |= FRM_INVALIDATE_XMAX;
+		return InvalidTransactionId;
+	}
+
+	/* is there anything older than the cutoff? */
+	need_replace = false;
+	for (i = 0; i < nmembers; i++)
+	{
+		if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+		{
+			need_replace = true;
+			break;
+		}
+	}
+
+	/*
+	 * In the simplest case, there is no member older than the cutoff; we can
+	 * keep the existing MultiXactId as is.
+	 */
+	if (!need_replace)
+	{
+		*flags |= FRM_NOOP;
+		pfree(members);
+		return InvalidTransactionId;
+	}
+
+	/*
+	 * If the multi needs to be updated, figure out which members do we need
+	 * to keep.
+	 */
+	nnewmembers = 0;
+	newmembers = palloc(sizeof(MultiXactMember) * nmembers);
+	has_lockers = false;
+	update_xid = InvalidTransactionId;
+	update_committed = false;
+
+	for (i = 0; i < nmembers; i++)
+	{
+		if (ISUPDATE_from_mxstatus(members[i].status))
+		{
+			/*
+			 * It's an update; should we keep it?  If the transaction is known
+			 * aborted then it's okay to ignore it, otherwise not.  (Note this
+			 * is just an optimization and not needed for correctness, so it's
+			 * okay to get this test wrong; for example, in case an updater is
+			 * crashed, or a running transaction in the process of aborting.)
+			 */
+			if (!TransactionIdDidAbort(members[i].xid))
+			{
+				newmembers[nnewmembers++] = members[i];
+				Assert(!TransactionIdIsValid(update_xid));
+
+				/*
+				 * Tell caller to set HEAP_XMAX_COMMITTED hint while we have
+				 * the Xid in cache.  Again, this is just an optimization, so
+				 * it's not a problem if the transaction is still running and
+				 * in the process of committing.
+				 */
+				if (TransactionIdDidCommit(update_xid))
+					update_committed = true;
+
+				update_xid = newmembers[i].xid;
+			}
+
+			/*
+			 * Checking for very old update Xids is critical: if the update
+			 * member of the multi is older than cutoff_xid, we must remove
+			 * it, because otherwise a later liveliness check could attempt
+			 * pg_clog access for a page that was truncated away by the
+			 * current vacuum.	Note that if the update had committed, we
+			 * wouldn't be freezing this tuple because it would have gotten
+			 * removed (HEAPTUPLE_DEAD) by HeapTupleSatisfiesVacuum; so it
+			 * either aborted or crashed.  Therefore, ignore this update_xid.
+			 */
+			if (TransactionIdPrecedes(update_xid, cutoff_xid))
+			{
+				update_xid = InvalidTransactionId;
+				update_committed = false;
+
+			}
+		}
+		else
+		{
+			/* We only keep lockers if they are still running */
+			if (TransactionIdIsCurrentTransactionId(members[i].xid) ||
+				TransactionIdIsInProgress(members[i].xid))
+			{
+				/* running locker cannot possibly be older than the cutoff */
+				Assert(!TransactionIdPrecedes(members[i].xid, cutoff_xid));
+				newmembers[nnewmembers++] = members[i];
+				has_lockers = true;
+			}
+		}
+	}
+
+	pfree(members);
+
+	if (nnewmembers == 0)
+	{
+		/* nothing worth keeping!? Tell caller to remove the whole thing */
+		*flags |= FRM_INVALIDATE_XMAX;
+		xid = InvalidTransactionId;
+	}
+	else if (TransactionIdIsValid(update_xid) && !has_lockers)
+	{
+		/*
+		 * If there's a single member and it's an update, pass it back alone
+		 * without creating a new Multi.  (XXX we could do this when there's a
+		 * single remaining locker, too, but that would complicate the API too
+		 * much; moreover, the case with the single updater is more
+		 * interesting, because those are longer-lived.)
+		 */
+		Assert(nnewmembers == 1);
+		*flags |= FRM_RETURN_IS_XID;
+		if (update_committed)
+			*flags |= FRM_MARK_COMMITTED;
+		xid = update_xid;
+	}
+	else
+	{
+		/*
+		 * Create a new multixact with the surviving members of the previous
+		 * one, to set as new Xmax in the tuple.
+		 *
+		 * If this is the first possibly-multixact-able operation in the
+		 * current transaction, set my per-backend OldestMemberMXactId
+		 * setting. We can be certain that the transaction will never become a
+		 * member of any older MultiXactIds than that.
+		 */
+		MultiXactIdSetOldestMember();
+		xid = MultiXactIdCreateFromMembers(nnewmembers, newmembers);
+		*flags |= FRM_RETURN_IS_MULTI;
+	}
+
+	pfree(newmembers);
+
+	return xid;
+}
+
+/*
+ * heap_prepare_freeze_tuple
  *
  * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac)
- * are older than the specified cutoff XID.  If so, replace them with
- * FrozenTransactionId or InvalidTransactionId as appropriate, and return
- * TRUE.  Return FALSE if nothing was changed.
+ * are older than the specified cutoff XID and cutoff MultiXactId.	If so,
+ * setup enough state (in the *frz output argument) to later execute and
+ * WAL-log what we would need to do, and return TRUE.  Return FALSE if nothing
+ * is to be changed.
+ *
+ * Caller is responsible for setting the offset field, if appropriate.	This
+ * is only necessary if the freeze is to be WAL-logged.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
@@ -5254,54 +5501,44 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
  * NB: cutoff_xid *must* be <= the current global xmin, to ensure that any
  * XID older than it could neither be running nor seen as running by any
  * open transaction.  This ensures that the replacement will not change
- * anyone's idea of the tuple state.  Also, since we assume the tuple is
- * not HEAPTUPLE_DEAD, the fact that an XID is not still running allows us
- * to assume that it is either committed good or aborted, as appropriate;
- * so we need no external state checks to decide what to do.  (This is good
- * because this function is applied during WAL recovery, when we don't have
- * access to any such state, and can't depend on the hint bits to be set.)
- * There is an exception we make which is to assume GetMultiXactIdMembers can
- * be called during recovery.
- *
+ * anyone's idea of the tuple state.
  * Similarly, cutoff_multi must be less than or equal to the smallest
  * MultiXactId used by any transaction currently open.
  *
  * If the tuple is in a shared buffer, caller must hold an exclusive lock on
  * that buffer.
  *
- * Note: it might seem we could make the changes without exclusive lock, since
- * TransactionId read/write is assumed atomic anyway.  However there is a race
- * condition: someone who just fetched an old XID that we overwrite here could
- * conceivably not finish checking the XID against pg_clog before we finish
- * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
- * exclusive lock ensures no other backend is in process of checking the
- * tuple status.  Also, getting exclusive lock makes it safe to adjust the
- * infomask bits.
- *
- * NB: Cannot rely on hint bits here, they might not be set after a crash or
- * on a standby.
+ * NB: It is not enough to set hint bits to indicate something is
+ * committed/invalid -- they might not be set on a standby, or after crash
+ * recovery.  We really need to remove old xids.
  */
 bool
-heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
-				  MultiXactId cutoff_multi)
+heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+						  TransactionId cutoff_multi,
+						  xl_heap_freeze_tuple *frz)
+
 {
 	bool		changed = false;
 	bool		freeze_xmax = false;
 	TransactionId xid;
 
+	frz->frzflags = 0;
+	frz->t_infomask2 = tuple->t_infomask2;
+	frz->t_infomask = tuple->t_infomask;
+	frz->xmax = HeapTupleHeaderGetRawXmax(tuple);
+
 	/* Process xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
 	if (TransactionIdIsNormal(xid) &&
 		TransactionIdPrecedes(xid, cutoff_xid))
 	{
-		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+		frz->frzflags |= XLH_FREEZE_XMIN;
 
 		/*
 		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
 		 * already be set here, but there's a small chance not.
 		 */
-		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
-		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+		frz->t_infomask |= HEAP_XMIN_COMMITTED;
 		changed = true;
 	}
 
@@ -5318,91 +5555,35 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
 	{
-		if (!MultiXactIdIsValid(xid))
-		{
-			/* no xmax set, ignore */
-			;
-		}
-		else if (MultiXactIdPrecedes(xid, cutoff_multi))
-		{
-			/*
-			 * This old multi cannot possibly be running.  If it was a locker
-			 * only, it can be removed without much further thought; but if it
-			 * contained an update, we need to preserve it.
-			 */
-			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
-				freeze_xmax = true;
-			else
-			{
-				TransactionId update_xid;
+		TransactionId newxmax;
+		uint16		flags;
 
-				update_xid = HeapTupleGetUpdateXid(tuple);
-
-				/*
-				 * The multixact has an update hidden within.  Get rid of it.
-				 *
-				 * If the update_xid is below the cutoff_xid, it necessarily
-				 * must be an aborted transaction.  In a primary server, such
-				 * an Xmax would have gotten marked invalid by
-				 * HeapTupleSatisfiesVacuum, but in a replica that is not
-				 * called before we are, so deal with it in the same way.
-				 *
-				 * If not below the cutoff_xid, then the tuple would have been
-				 * pruned by vacuum, if the update committed long enough ago,
-				 * and we wouldn't be freezing it; so it's either recently
-				 * committed, or in-progress.  Deal with this by setting the
-				 * Xmax to the update Xid directly and remove the IS_MULTI
-				 * bit.  (We know there cannot be running lockers in this
-				 * multi, because it's below the cutoff_multi value.)
-				 */
+		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
+									cutoff_xid, cutoff_multi, &flags);
 
-				if (TransactionIdPrecedes(update_xid, cutoff_xid))
-				{
-					Assert(InRecovery || TransactionIdDidAbort(update_xid));
-					freeze_xmax = true;
-				}
-				else
-				{
-					Assert(InRecovery || !TransactionIdIsInProgress(update_xid));
-					tuple->t_infomask &= ~HEAP_XMAX_BITS;
-					HeapTupleHeaderSetXmax(tuple, update_xid);
-					changed = true;
-				}
-			}
+		if (flags & FRM_INVALIDATE_XMAX)
+			freeze_xmax = true;
+		else if (flags & FRM_RETURN_IS_XID)
+		{
+			frz->t_infomask &= ~HEAP_XMAX_BITS;
+			frz->xmax = newxmax;
+			if (flags & FRM_MARK_COMMITTED)
+				frz->t_infomask &= HEAP_XMAX_COMMITTED;
+			changed = true;
 		}
-		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
+		else if (flags & FRM_RETURN_IS_MULTI)
 		{
-			/* newer than the cutoff, so don't touch it */
-			;
+			frz->t_infomask &= ~HEAP_XMAX_BITS;
+			frz->xmax = newxmax;
+
+			GetMultiXactIdHintBits(newxmax,
+								   &frz->t_infomask,
+								   &frz->t_infomask2);
+			changed = true;
 		}
 		else
 		{
-			TransactionId	update_xid;
-
-			/*
-			 * This is a multixact which is not marked LOCK_ONLY, but which
-			 * is newer than the cutoff_multi.  If the update_xid is below the
-			 * cutoff_xid point, then we can just freeze the Xmax in the
-			 * tuple, removing it altogether.  This seems simple, but there
-			 * are several underlying assumptions:
-			 *
-			 * 1. A tuple marked by an multixact containing a very old
-			 * committed update Xid would have been pruned away by vacuum; we
-			 * wouldn't be freezing this tuple at all.
-			 *
-			 * 2. There cannot possibly be any live locking members remaining
-			 * in the multixact.  This is because if they were alive, the
-			 * update's Xid would had been considered, via the lockers'
-			 * snapshot's Xmin, as part the cutoff_xid.
-			 *
-			 * 3. We don't create new MultiXacts via MultiXactIdExpand() that
-			 * include a very old aborted update Xid: in that function we only
-			 * include update Xids corresponding to transactions that are
-			 * committed or in-progress.
-			 */
-			update_xid = HeapTupleGetUpdateXid(tuple);
-			if (TransactionIdPrecedes(update_xid, cutoff_xid))
-				freeze_xmax = true;
+			Assert(flags & FRM_NOOP);
 		}
 	}
 	else if (TransactionIdIsNormal(xid) &&
@@ -5413,17 +5594,17 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 
 	if (freeze_xmax)
 	{
-		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
+		frz->xmax = InvalidTransactionId;
 
 		/*
 		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
 		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
 		 * Also get rid of the HEAP_KEYS_UPDATED bit.
 		 */
-		tuple->t_infomask &= ~HEAP_XMAX_BITS;
-		tuple->t_infomask |= HEAP_XMAX_INVALID;
-		HeapTupleHeaderClearHotUpdated(tuple);
-		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+		frz->t_infomask &= ~HEAP_XMAX_BITS;
+		frz->t_infomask |= HEAP_XMAX_INVALID;
+		frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
+		frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;
 		changed = true;
 	}
 
@@ -5443,16 +5624,16 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 			 * xvac transaction succeeded.
 			 */
 			if (tuple->t_infomask & HEAP_MOVED_OFF)
-				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+				frz->frzflags |= XLH_INVALID_XVAC;
 			else
-				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+				frz->frzflags |= XLH_FREEZE_XVAC;
 
 			/*
 			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
 			 * will already be set here, but there's a small chance not.
 			 */
 			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
-			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+			frz->t_infomask |= HEAP_XMIN_COMMITTED;
 			changed = true;
 		}
 	}
@@ -5461,6 +5642,68 @@ heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 }
 
 /*
+ * heap_execute_freeze_tuple
+ *		Execute the prepared freezing of a tuple.
+ *
+ * Caller is responsible for ensuring that no other backend can access the
+ * storage underlying this tuple, either by holding an exclusive lock on the
+ * buffer containing it (which is what lazy VACUUM does), or by having it by
+ * in private storage (which is what CLUSTER and friends do).
+ *
+ * Note: it might seem we could make the changes without exclusive lock, since
+ * TransactionId read/write is assumed atomic anyway.  However there is a race
+ * condition: someone who just fetched an old XID that we overwrite here could
+ * conceivably not finish checking the XID against pg_clog before we finish
+ * the VACUUM and perhaps truncate off the part of pg_clog he needs.  Getting
+ * exclusive lock ensures no other backend is in process of checking the
+ * tuple status.  Also, getting exclusive lock makes it safe to adjust the
+ * infomask bits.
+ *
+ * NB: All code in here must be safe to execute during crash recovery!
+ */
+void
+heap_execute_freeze_tuple(HeapTupleHeader tuple, xl_heap_freeze_tuple *frz)
+{
+	if (frz->frzflags & XLH_FREEZE_XMIN)
+		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+
+	HeapTupleHeaderSetXmax(tuple, frz->xmax);
+
+	if (frz->frzflags & XLH_FREEZE_XVAC)
+		HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+
+	if (frz->frzflags & XLH_INVALID_XVAC)
+		HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+
+	tuple->t_infomask = frz->t_infomask;
+	tuple->t_infomask2 = frz->t_infomask2;
+}
+
+/*
+ * heap_freeze_tuple - freeze tuple inplace without WAL logging.
+ *
+ * Useful for callers like CLUSTER that perform their own WAL logging.
+ */
+bool
+heap_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+				  TransactionId cutoff_multi)
+{
+	xl_heap_freeze_tuple frz;
+	bool		do_freeze;
+
+	do_freeze = heap_prepare_freeze_tuple(tuple, cutoff_xid, cutoff_multi, &frz);
+
+	/*
+	 * Note that because this is not a WAL-logged operation, we don't need to
+	 * fill in the offset in the freeze record.
+	 */
+
+	if (do_freeze)
+		heap_execute_freeze_tuple(tuple, &frz);
+	return do_freeze;
+}
+
+/*
  * For a given MultiXactId, return the hint bits that should be set in the
  * tuple's infomask.
  *
@@ -5763,16 +6006,26 @@ heap_tuple_needs_freeze(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		}
 		else if (MultiXactIdPrecedes(multi, cutoff_multi))
 			return true;
-		else if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
-		{
-			/* only-locker multis don't need internal examination */
-			;
-		}
 		else
 		{
-			if (TransactionIdPrecedes(HeapTupleGetUpdateXid(tuple),
-									  cutoff_xid))
-				return true;
+			MultiXactMember *members;
+			int			nmembers;
+			int			i;
+
+			/* need to check whether any member of the mxact is too old */
+
+			nmembers = GetMultiXactIdMembers(multi, &members, false);
+
+			for (i = 0; i < nmembers; i++)
+			{
+				if (TransactionIdPrecedes(members[i].xid, cutoff_xid))
+				{
+					pfree(members);
+					return true;
+				}
+			}
+			if (nmembers > 0)
+				pfree(members);
 		}
 	}
 	else
@@ -6022,27 +6275,26 @@ log_heap_clean(Relation reln, Buffer buffer,
 }
 
 /*
- * Perform XLogInsert for a heap-freeze operation.	Caller must already
- * have modified the buffer and marked it dirty.
+ * Perform XLogInsert for a heap-freeze operation.	Caller must have already
+ * modified the buffer and marked it dirty.
  */
 XLogRecPtr
-log_heap_freeze(Relation reln, Buffer buffer,
-				TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				OffsetNumber *offsets, int offcnt)
+log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
+				xl_heap_freeze_tuple *tuples, int ntuples)
 {
-	xl_heap_freeze xlrec;
+	xl_heap_freeze_page xlrec;
 	XLogRecPtr	recptr;
 	XLogRecData rdata[2];
 
 	/* Caller should not call me on a non-WAL-logged relation */
 	Assert(RelationNeedsWAL(reln));
 	/* nor when there are no tuples to freeze */
-	Assert(offcnt > 0);
+	Assert(ntuples > 0);
 
 	xlrec.node = reln->rd_node;
 	xlrec.block = BufferGetBlockNumber(buffer);
 	xlrec.cutoff_xid = cutoff_xid;
-	xlrec.cutoff_multi = cutoff_multi;
+	xlrec.ntuples = ntuples;
 
 	rdata[0].data = (char *) &xlrec;
 	rdata[0].len = SizeOfHeapFreeze;
@@ -6050,17 +6302,17 @@ log_heap_freeze(Relation reln, Buffer buffer,
 	rdata[0].next = &(rdata[1]);
 
 	/*
-	 * The tuple-offsets array is not actually in the buffer, but pretend that
-	 * it is.  When XLogInsert stores the whole buffer, the offsets array need
+	 * The freeze plan array is not actually in the buffer, but pretend that
+	 * it is.  When XLogInsert stores the whole buffer, the freeze plan need
 	 * not be stored too.
 	 */
-	rdata[1].data = (char *) offsets;
-	rdata[1].len = offcnt * sizeof(OffsetNumber);
+	rdata[1].data = (char *) tuples;
+	rdata[1].len = ntuples * sizeof(xl_heap_freeze_tuple);
 	rdata[1].buffer = buffer;
 	rdata[1].buffer_std = true;
 	rdata[1].next = NULL;
 
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE, rdata);
+	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_FREEZE_PAGE, rdata);
 
 	return recptr;
 }
@@ -6402,6 +6654,99 @@ heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record)
 	XLogRecordPageWithFreeSpace(xlrec->node, xlrec->block, freespace);
 }
 
+/*
+ * Freeze a single tuple for XLOG_HEAP2_FREEZE
+ *
+ * NB: This type of record aren't generated anymore, since bugs around
+ * multixacts couldn't be fixed without a more robust type of freezing. This
+ * is kept around to be able to perform PITR.
+ */
+static bool
+heap_xlog_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
+					   MultiXactId cutoff_multi)
+{
+	bool		changed = false;
+	TransactionId xid;
+
+	xid = HeapTupleHeaderGetXmin(tuple);
+	if (TransactionIdIsNormal(xid) &&
+		TransactionIdPrecedes(xid, cutoff_xid))
+	{
+		HeapTupleHeaderSetXmin(tuple, FrozenTransactionId);
+
+		/*
+		 * Might as well fix the hint bits too; usually XMIN_COMMITTED will
+		 * already be set here, but there's a small chance not.
+		 */
+		Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
+		tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+		changed = true;
+	}
+
+	/*
+	 * Note that this code handles IS_MULTI Xmax values, too, but only to mark
+	 * the tuple as not updated if the multixact is below the cutoff Multixact
+	 * given; it doesn't remove dead members of a very old multixact.
+	 */
+	xid = HeapTupleHeaderGetRawXmax(tuple);
+	if ((tuple->t_infomask & HEAP_XMAX_IS_MULTI) ?
+		(MultiXactIdIsValid(xid) &&
+		 MultiXactIdPrecedes(xid, cutoff_multi)) :
+		(TransactionIdIsNormal(xid) &&
+		 TransactionIdPrecedes(xid, cutoff_xid)))
+	{
+		HeapTupleHeaderSetXmax(tuple, InvalidTransactionId);
+
+		/*
+		 * The tuple might be marked either XMAX_INVALID or XMAX_COMMITTED +
+		 * LOCKED.	Normalize to INVALID just to be sure no one gets confused.
+		 * Also get rid of the HEAP_KEYS_UPDATED bit.
+		 */
+		tuple->t_infomask &= ~HEAP_XMAX_BITS;
+		tuple->t_infomask |= HEAP_XMAX_INVALID;
+		HeapTupleHeaderClearHotUpdated(tuple);
+		tuple->t_infomask2 &= ~HEAP_KEYS_UPDATED;
+		changed = true;
+	}
+
+	/*
+	 * Old-style VACUUM FULL is gone, but we have to keep this code as long as
+	 * we support having MOVED_OFF/MOVED_IN tuples in the database.
+	 */
+	if (tuple->t_infomask & HEAP_MOVED)
+	{
+		xid = HeapTupleHeaderGetXvac(tuple);
+		if (TransactionIdIsNormal(xid) &&
+			TransactionIdPrecedes(xid, cutoff_xid))
+		{
+			/*
+			 * If a MOVED_OFF tuple is not dead, the xvac transaction must
+			 * have failed; whereas a non-dead MOVED_IN tuple must mean the
+			 * xvac transaction succeeded.
+			 */
+			if (tuple->t_infomask & HEAP_MOVED_OFF)
+				HeapTupleHeaderSetXvac(tuple, InvalidTransactionId);
+			else
+				HeapTupleHeaderSetXvac(tuple, FrozenTransactionId);
+
+			/*
+			 * Might as well fix the hint bits too; usually XMIN_COMMITTED
+			 * will already be set here, but there's a small chance not.
+			 */
+			Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID));
+			tuple->t_infomask |= HEAP_XMIN_COMMITTED;
+			changed = true;
+		}
+	}
+
+	return changed;
+}
+
+/*
+ * NB: This type of record aren't generated anymore, since bugs around
+ * multixacts couldn't be fixed without a more robust type of freezing. This
+ * is kept around to be able to perform PITR.
+ */
 static void
 heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
 {
@@ -6450,7 +6795,7 @@ heap_xlog_freeze(XLogRecPtr lsn, XLogRecord *record)
 			ItemId		lp = PageGetItemId(page, *offsets);
 			HeapTupleHeader tuple = (HeapTupleHeader) PageGetItem(page, lp);
 
-			(void) heap_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
+			(void) heap_xlog_freeze_tuple(tuple, cutoff_xid, cutoff_multi);
 			offsets++;
 		}
 	}
@@ -6574,6 +6919,63 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
 	}
 }
 
+/*
+ * Replay XLOG_HEAP2_FREEZE_PAGE records
+ */
+static void
+heap_xlog_freeze_page(XLogRecPtr lsn, XLogRecord *record)
+{
+	xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) XLogRecGetData(record);
+	TransactionId cutoff_xid = xlrec->cutoff_xid;
+	Buffer		buffer;
+	Page		page;
+	int			ntup;
+
+	/*
+	 * In Hot Standby mode, ensure that there's no queries running which still
+	 * consider the frozen xids as running.
+	 */
+	if (InHotStandby)
+		ResolveRecoveryConflictWithSnapshot(cutoff_xid, xlrec->node);
+
+	/* If we have a full-page image, restore it and we're done */
+	if (record->xl_info & XLR_BKP_BLOCK(0))
+	{
+		(void) RestoreBackupBlock(lsn, record, 0, false, false);
+		return;
+	}
+
+	buffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+	if (!BufferIsValid(buffer))
+		return;
+
+	page = (Page) BufferGetPage(buffer);
+
+	if (lsn <= PageGetLSN(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		return;
+	}
+
+	/* now execute freeze plan for each frozen tuple */
+	for (ntup = 0; ntup < xlrec->ntuples; ntup++)
+	{
+		xl_heap_freeze_tuple *xlrec_tp;
+		ItemId		lp;
+		HeapTupleHeader tuple;
+
+		xlrec_tp = &xlrec->tuples[ntup];
+		lp = PageGetItemId(page, xlrec_tp->offset);		/* offsets are one-based */
+		tuple = (HeapTupleHeader) PageGetItem(page, lp);
+
+		heap_execute_freeze_tuple(tuple, xlrec_tp);
+	}
+
+	PageSetLSN(page, lsn);
+	MarkBufferDirty(buffer);
+	UnlockReleaseBuffer(buffer);
+}
+
 static void
 heap_xlog_newpage(XLogRecPtr lsn, XLogRecord *record)
 {
@@ -7429,6 +7831,9 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_HEAP2_CLEAN:
 			heap_xlog_clean(lsn, record);
 			break;
+		case XLOG_HEAP2_FREEZE_PAGE:
+			heap_xlog_freeze_page(lsn, record);
+			break;
 		case XLOG_HEAP2_CLEANUP_INFO:
 			heap_xlog_cleanup_info(lsn, record);
 			break;
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index bc8b985..d527aa6 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -149,6 +149,15 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
 						 xlrec->node.relNode, xlrec->block,
 						 xlrec->latestRemovedXid);
 	}
+	else if (info == XLOG_HEAP2_FREEZE_PAGE)
+	{
+		xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
+
+		appendStringInfo(buf, "freeze_page: rel %u/%u/%u; blk %u; cutoff xid %u ntuples %u",
+						 xlrec->node.spcNode, xlrec->node.dbNode,
+						 xlrec->node.relNode, xlrec->block,
+						 xlrec->cutoff_xid, xlrec->ntuples);
+	}
 	else if (info == XLOG_HEAP2_CLEANUP_INFO)
 	{
 		xl_heap_cleanup_info *xlrec = (xl_heap_cleanup_info *) rec;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 2081470..ed7101f 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -286,7 +286,6 @@ static MemoryContext MXactContext = NULL;
 
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
-static MultiXactId CreateMultiXactId(int nmembers, MultiXactMember *members);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 				   int nmembers, MultiXactMember *members);
 static MultiXactId GetNewMultiXactId(int nmembers, MultiXactOffset *offset);
@@ -344,7 +343,7 @@ MultiXactIdCreate(TransactionId xid1, MultiXactStatus status1,
 	members[1].xid = xid2;
 	members[1].status = status2;
 
-	newMulti = CreateMultiXactId(2, members);
+	newMulti = MultiXactIdCreateFromMembers(2, members);
 
 	debug_elog3(DEBUG2, "Create: %s",
 				mxid_to_string(newMulti, 2, members));
@@ -407,7 +406,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
 		 */
 		member.xid = xid;
 		member.status = status;
-		newMulti = CreateMultiXactId(1, &member);
+		newMulti = MultiXactIdCreateFromMembers(1, &member);
 
 		debug_elog4(DEBUG2, "Expand: %u has no members, create singleton %u",
 					multi, newMulti);
@@ -459,7 +458,7 @@ MultiXactIdExpand(MultiXactId multi, TransactionId xid, MultiXactStatus status)
 
 	newMembers[j].xid = xid;
 	newMembers[j++].status = status;
-	newMulti = CreateMultiXactId(j, newMembers);
+	newMulti = MultiXactIdCreateFromMembers(j, newMembers);
 
 	pfree(members);
 	pfree(newMembers);
@@ -664,16 +663,16 @@ ReadNextMultiXactId(void)
 }
 
 /*
- * CreateMultiXactId
- *		Make a new MultiXactId
+ * MultiXactIdCreateFromMembers
+ *		Make a new MultiXactId from the specified set of members
  *
  * Make XLOG, SLRU and cache entries for a new MultiXactId, recording the
  * given TransactionIds as members.  Returns the newly created MultiXactId.
  *
  * NB: the passed members[] array will be sorted in-place.
  */
-static MultiXactId
-CreateMultiXactId(int nmembers, MultiXactMember *members)
+MultiXactId
+MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 {
 	MultiXactId multi;
 	MultiXactOffset offset;
@@ -760,7 +759,8 @@ CreateMultiXactId(int nmembers, MultiXactMember *members)
  * RecordNewMultiXact
  *		Write info about a new multixact into the offsets and members files
  *
- * This is broken out of CreateMultiXactId so that xlog replay can use it.
+ * This is broken out of MultiXactIdCreateFromMembers so that xlog replay can
+ * use it.
  */
 static void
 RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index ff6bd8e..01b6f46 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -424,6 +424,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	Buffer		vmbuffer = InvalidBuffer;
 	BlockNumber next_not_all_visible_block;
 	bool		skipping_all_visible_blocks;
+	xl_heap_freeze_tuple *frozen;
 
 	pg_rusage_init(&ru0);
 
@@ -446,6 +447,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 	vacrelstats->latestRemovedXid = InvalidTransactionId;
 
 	lazy_space_alloc(vacrelstats, nblocks);
+	frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);
 
 	/*
 	 * We want to skip pages that don't require vacuuming according to the
@@ -500,7 +502,6 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		bool		tupgone,
 					hastup;
 		int			prev_dead_count;
-		OffsetNumber frozen[MaxOffsetNumber];
 		int			nfrozen;
 		Size		freespace;
 		bool		all_visible_according_to_vm;
@@ -893,9 +894,9 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 				 * Each non-removable tuple must be checked to see if it needs
 				 * freezing.  Note we already have exclusive buffer lock.
 				 */
-				if (heap_freeze_tuple(tuple.t_data, FreezeLimit,
-									  MultiXactCutoff))
-					frozen[nfrozen++] = offnum;
+				if (heap_prepare_freeze_tuple(tuple.t_data, FreezeLimit,
+										  MultiXactCutoff, &frozen[nfrozen]))
+					frozen[nfrozen++].offset = offnum;
 			}
 		}						/* scan along page */
 
@@ -906,15 +907,33 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 		 */
 		if (nfrozen > 0)
 		{
+			START_CRIT_SECTION();
+
 			MarkBufferDirty(buf);
+
+			/* execute collected freezes */
+			for (i = 0; i < nfrozen; i++)
+			{
+				ItemId		itemid;
+				HeapTupleHeader htup;
+
+				itemid = PageGetItemId(page, frozen[i].offset);
+				htup = (HeapTupleHeader) PageGetItem(page, itemid);
+
+				heap_execute_freeze_tuple(htup, &frozen[i]);
+			}
+
+			/* Now WAL-log freezing if neccessary */
 			if (RelationNeedsWAL(onerel))
 			{
 				XLogRecPtr	recptr;
 
 				recptr = log_heap_freeze(onerel, buf, FreezeLimit,
-										 MultiXactCutoff, frozen, nfrozen);
+										 frozen, nfrozen);
 				PageSetLSN(page, recptr);
 			}
+
+			END_CRIT_SECTION();
 		}
 
 		/*
@@ -1015,6 +1034,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			RecordPageWithFreeSpace(onerel, blkno, freespace);
 	}
 
+	pfree(frozen);
+
 	/* save stats for use later */
 	vacrelstats->scanned_tuples = num_tuples;
 	vacrelstats->tuples_deleted = tups_vacuumed;
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 4381778..138b879 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -50,7 +50,7 @@
  */
 #define XLOG_HEAP2_FREEZE		0x00
 #define XLOG_HEAP2_CLEAN		0x10
-/* 0x20 is free, was XLOG_HEAP2_CLEAN_MOVE */
+#define XLOG_HEAP2_FREEZE_PAGE	0x20
 #define XLOG_HEAP2_CLEANUP_INFO 0x30
 #define XLOG_HEAP2_VISIBLE		0x40
 #define XLOG_HEAP2_MULTI_INSERT 0x50
@@ -239,7 +239,7 @@ typedef struct xl_heap_inplace
 
 #define SizeOfHeapInplace	(offsetof(xl_heap_inplace, target) + SizeOfHeapTid)
 
-/* This is what we need to know about tuple freezing during vacuum */
+/* This is what we need to know about tuple freezing during vacuum (legacy) */
 typedef struct xl_heap_freeze
 {
 	RelFileNode node;
@@ -251,6 +251,35 @@ typedef struct xl_heap_freeze
 
 #define SizeOfHeapFreeze (offsetof(xl_heap_freeze, cutoff_multi) + sizeof(MultiXactId))
 
+/*
+ * a 'freeze plan' struct that represents what we need to know about a single
+ * tuple being frozen during vacuum
+ */
+#define		XLH_FREEZE_XMIN		0x01
+#define		XLH_FREEZE_XVAC		0x02
+#define		XLH_INVALID_XVAC	0x04
+
+typedef struct xl_heap_freeze_tuple
+{
+	TransactionId xmax;
+	OffsetNumber offset;
+	uint16		t_infomask2;
+	uint16		t_infomask;
+	uint8		frzflags;
+} xl_heap_freeze_tuple;
+
+/*
+ * This is what we need to know about a block being frozen during vacuum
+ */
+typedef struct xl_heap_freeze_block
+{
+	RelFileNode node;
+	BlockNumber block;
+	TransactionId cutoff_xid;
+	uint16		ntuples;
+	xl_heap_freeze_tuple tuples[FLEXIBLE_ARRAY_MEMBER];
+} xl_heap_freeze_page;
+
 /* This is what we need to know about setting a visibility map bit */
 typedef struct xl_heap_visible
 {
@@ -277,8 +306,14 @@ extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
 			   OffsetNumber *nowunused, int nunused,
 			   TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
-				TransactionId cutoff_xid, MultiXactId cutoff_multi,
-				OffsetNumber *offsets, int offcnt);
+				TransactionId cutoff_xid, xl_heap_freeze_tuple *tuples,
+				int ntuples);
+extern bool heap_prepare_freeze_tuple(HeapTupleHeader tuple,
+						  TransactionId cutoff_xid,
+						  TransactionId cutoff_multi,
+						  xl_heap_freeze_tuple *frz);
+extern void heap_execute_freeze_tuple(HeapTupleHeader tuple,
+						  xl_heap_freeze_tuple *xlrec_tp);
 extern XLogRecPtr log_heap_visible(RelFileNode rnode, Buffer heap_buffer,
 				 Buffer vm_buffer, TransactionId cutoff_xid);
 extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6085ea3..0e3b273 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -81,6 +81,9 @@ extern MultiXactId MultiXactIdCreate(TransactionId xid1,
 				  MultiXactStatus status2);
 extern MultiXactId MultiXactIdExpand(MultiXactId multi, TransactionId xid,
 				  MultiXactStatus status);
+extern MultiXactId MultiXactIdCreateFromMembers(int nmembers,
+							 MultiXactMember *members);
+
 extern MultiXactId ReadNextMultiXactId(void);
 extern bool MultiXactIdIsRunning(MultiXactId multi);
 extern void MultiXactIdSetOldestMember(void);
-- 
1.7.10.4

0002-fixups-for-9.3.patchtext/x-diff; charset=us-asciiDownload

>From 22f88d30be9f9d97febb335599f2235918685278 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Wed, 11 Dec 2013 13:44:12 -0300
Subject: [PATCH 2/4] fixups for 9.3

---
 src/backend/access/heap/heapam.c |    2 +-
 src/include/access/heapam_xlog.h |    4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 24d843a..9509480 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6297,7 +6297,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,
 	xlrec.ntuples = ntuples;
 
 	rdata[0].data = (char *) &xlrec;
-	rdata[0].len = SizeOfHeapFreeze;
+	rdata[0].len = SizeOfHeapFreezePage;
 	rdata[0].buffer = InvalidBuffer;
 	rdata[0].next = &(rdata[1]);
 
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 138b879..8d25245 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -271,7 +271,7 @@ typedef struct xl_heap_freeze_tuple
 /*
  * This is what we need to know about a block being frozen during vacuum
  */
-typedef struct xl_heap_freeze_block
+typedef struct xl_heap_freeze_page
 {
 	RelFileNode node;
 	BlockNumber block;
@@ -280,6 +280,8 @@ typedef struct xl_heap_freeze_block
 	xl_heap_freeze_tuple tuples[FLEXIBLE_ARRAY_MEMBER];
 } xl_heap_freeze_page;
 
+#define SizeOfHeapFreezePage offsetof(xl_heap_freeze_page, tuples)
+
 /* This is what we need to know about setting a visibility map bit */
 typedef struct xl_heap_visible
 {
-- 
1.7.10.4

0003-fixup-XidIsInProgress-before-transam.c-tests.patchtext/x-diff; charset=us-asciiDownload

>From 35fdab06ceefcd0b4399550d76f9e134a1ae5880 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 12 Dec 2013 17:27:54 -0300
Subject: [PATCH 3/4] fixup XidIsInProgress before transam.c tests

---
 src/backend/access/heap/heapam.c |   81 ++++++++++++++++++++++++++------------
 1 file changed, 55 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9509480..9616c18 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -5381,47 +5381,76 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
 
 	for (i = 0; i < nmembers; i++)
 	{
+		/*
+		 * Determine whether to keep this member or ignore it.
+		 */
 		if (ISUPDATE_from_mxstatus(members[i].status))
 		{
+			TransactionId	xid = members[i].xid;
+
 			/*
 			 * It's an update; should we keep it?  If the transaction is known
-			 * aborted then it's okay to ignore it, otherwise not.  (Note this
-			 * is just an optimization and not needed for correctness, so it's
-			 * okay to get this test wrong; for example, in case an updater is
-			 * crashed, or a running transaction in the process of aborting.)
+			 * aborted then it's okay to ignore it, otherwise not.  However,
+			 * if the Xid is older than the cutoff_xid, we must remove it,
+			 * because otherwise we might allow a very old Xid to persist
+			 * which would later cause pg_clog lookups problems, because the
+			 * corresponding SLRU segment might be about to be truncated away.
+			 * (Note that such an old updater cannot possibly be committed,
+			 * because HeapTupleSatisfiesVacuum would have returned
+			 * HEAPTUPLE_DEAD and we would not be trying to freeze the tuple.)
+			 *
+			 * Note the TransactionIdDidAbort() test is just an optimization
+			 * and not strictly necessary for correctness.
+			 *
+			 * As with all tuple visibility routines, it's critical to test
+			 * TransactionIdIsInProgress before the transam.c routines,
+			 * because of race conditions explained in detail in tqual.c.
 			 */
-			if (!TransactionIdDidAbort(members[i].xid))
+			if (TransactionIdIsCurrentTransactionId(xid) ||
+				TransactionIdIsInProgress(xid))
 			{
-				newmembers[nnewmembers++] = members[i];
 				Assert(!TransactionIdIsValid(update_xid));
-
+				update_xid = xid;
+			}
+			else if (!TransactionIdDidAbort(xid))
+			{
 				/*
-				 * Tell caller to set HEAP_XMAX_COMMITTED hint while we have
-				 * the Xid in cache.  Again, this is just an optimization, so
-				 * it's not a problem if the transaction is still running and
-				 * in the process of committing.
+				 * Test whether to tell caller to set HEAP_XMAX_COMMITTED
+				 * while we have the Xid still in cache.  Note this can only
+				 * be done if the transaction is known not running.
 				 */
-				if (TransactionIdDidCommit(update_xid))
+				if (TransactionIdDidCommit(xid))
 					update_committed = true;
-
-				update_xid = newmembers[i].xid;
+				Assert(!TransactionIdIsValid(update_xid));
+				update_xid = xid;
 			}
 
 			/*
-			 * Checking for very old update Xids is critical: if the update
-			 * member of the multi is older than cutoff_xid, we must remove
-			 * it, because otherwise a later liveliness check could attempt
-			 * pg_clog access for a page that was truncated away by the
-			 * current vacuum.	Note that if the update had committed, we
-			 * wouldn't be freezing this tuple because it would have gotten
-			 * removed (HEAPTUPLE_DEAD) by HeapTupleSatisfiesVacuum; so it
-			 * either aborted or crashed.  Therefore, ignore this update_xid.
+			 * If we determined that it's an Xid corresponding to an update
+			 * that must be retained, additionally add it to the list of
+			 * members of the new Multis, in case we end up using that.  (We
+			 * might still decide to use only an update Xid and not a multi,
+			 * but it's easier to maintain the list as we walk the old members
+			 * list.)
+			 *
+			 * It is possible to end up with a very old updater Xid that
+			 * crashed and thus did not mark itself as aborted in pg_clog.
+			 * That would manifest as a pre-cutoff Xid.  Make sure to ignore
+			 * it.
 			 */
-			if (TransactionIdPrecedes(update_xid, cutoff_xid))
+			if (TransactionIdIsValid(update_xid))
 			{
-				update_xid = InvalidTransactionId;
-				update_committed = false;
-
+				if (!TransactionIdPrecedes(update_xid, cutoff_xid))
+				{
+					newmembers[nnewmembers++] = members[i];
+				}
+				else
+				{
+					/* cannot have committed: would be HEAPTUPLE_DEAD */
+					Assert(!TransactionIdDidCommit(update_xid));
+					update_xid = InvalidTransactionId;
+					update_committed = false;
+				}
 			}
 		}
 		else
-- 
1.7.10.4

0004-set-the-PROC_FREEZING_MULTI-flag-to-avoid-assert.patchtext/x-diff; charset=us-asciiDownload

>From fdfd992ff907e75d4c1c009b9b191748678d5daf Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Thu, 12 Dec 2013 18:18:36 -0300
Subject: [PATCH 4/4] set the PROC_FREEZING_MULTI flag to avoid assert

---
 src/backend/access/heap/heapam.c       |   20 ++++++++++++++++----
 src/backend/access/transam/multixact.c |    9 +++++++--
 src/include/storage/proc.h             |    3 ++-
 3 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9616c18..bf9300d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -60,6 +60,7 @@
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -5520,8 +5521,7 @@ FreezeMultiXactId(MultiXactId multi, uint16 t_infomask,
  * WAL-log what we would need to do, and return TRUE.  Return FALSE if nothing
  * is to be changed.
  *
- * Caller is responsible for setting the offset field, if appropriate.	This
- * is only necessary if the freeze is to be WAL-logged.
+ * Caller is responsible for setting the offset field, if appropriate.
  *
  * It is assumed that the caller has checked the tuple with
  * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
@@ -5587,8 +5587,20 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple, TransactionId cutoff_xid,
 		TransactionId newxmax;
 		uint16		flags;
 
-		newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
-									cutoff_xid, cutoff_multi, &flags);
+		PG_TRY();
+		{
+			/* set flag to let multixact.c know what we're doing */
+			MyPgXact->vacuumFlags |= PROC_FREEZING_MULTI;
+			newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
+										cutoff_xid, cutoff_multi, &flags);
+		}
+		PG_CATCH();
+		{
+			MyPgXact->vacuumFlags &= ~PROC_FREEZING_MULTI;
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+		MyPgXact->vacuumFlags &= ~PROC_FREEZING_MULTI;
 
 		if (flags & FRM_INVALIDATE_XMAX)
 			freeze_xmax = true;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ed7101f..0166978 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -77,6 +77,7 @@
 #include "postmaster/autovacuum.h"
 #include "storage/lmgr.h"
 #include "storage/pmsignal.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
@@ -864,8 +865,12 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 
 	debug_elog3(DEBUG2, "GetNew: for %d xids", nmembers);
 
-	/* MultiXactIdSetOldestMember() must have been called already */
-	Assert(MultiXactIdIsValid(OldestMemberMXactId[MyBackendId]));
+	/*
+	 * MultiXactIdSetOldestMember() must have been called already, but don't
+	 * check while freezing MultiXactIds.
+	 */
+	Assert((MyPgXact->vacuumFlags & PROC_FREEZING_MULTI) ||
+		   MultiXactIdIsValid(OldestMemberMXactId[MyBackendId]));
 
 	/* safety check, we should never get this far in a HS slave */
 	if (RecoveryInProgress())
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3b04d3c..7e53a3a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -42,9 +42,10 @@ struct XidCache
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
 #define		PROC_IN_ANALYZE		0x04	/* currently running analyze */
 #define		PROC_VACUUM_FOR_WRAPAROUND 0x08		/* set by autovac only */
+#define		PROC_FREEZING_MULTI	0x10	/* set while freezing multis */
 
 /* flags reset at EOXact */
-#define		PROC_VACUUM_STATE_MASK (0x0E)
+#define		PROC_VACUUM_STATE_MASK (0x1E)
 
 /*
  * We allow a small number of "weak" relation locks (AccesShareLock,
-- 
1.7.10.4

#49

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#48)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Alvaro Herrera wrote:

One last thing (I hope). It's not real easy to disable this check,
because it actually lives in GetNewMultiXactId. It would uglify the API
a lot if we were to pass a flag down two layers of routines; and moving
it to higher-level routines doesn't seem all that appropriate either.
I'm thinking we can have a new flag in MyPgXact->vacuumFlags, so
heap_prepare_freeze_tuple does this:

PG_TRY();
{
/* set flag to let multixact.c know what we're doing */
MyPgXact->vacuumFlags |= PROC_FREEZING_MULTI;
newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
cutoff_xid, cutoff_multi, &flags);
}

Uhm, actually we don't need a PG_TRY block at all for this to work: we
can rely on the flag being reset at transaction abort, if anything wrong
happens and we lose control. So just set the flag, call
FreezeMultiXactId, reset flag.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#49)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Hi,

On 2013-12-12 18:39:43 -0300, Alvaro Herrera wrote:

Alvaro Herrera wrote:

One last thing (I hope). It's not real easy to disable this check,
because it actually lives in GetNewMultiXactId. It would uglify the API
a lot if we were to pass a flag down two layers of routines; and moving
it to higher-level routines doesn't seem all that appropriate
either.

Unfortunately I find that too ugly. Adding a flag in the procarray
because of an Assert() in a lowlevel routine? That's overkill.
What's the problem with moving the check to MultiXactIdCreate() and
MultiXactIdExpand() instead? Since those are the ones where it's
required to have called SetOldest() before, I don't see why that would
be inappropriate?

I'm thinking we can have a new flag in MyPgXact->vacuumFlags, so
heap_prepare_freeze_tuple does this:

PG_TRY();
{
/* set flag to let multixact.c know what we're doing */
MyPgXact->vacuumFlags |= PROC_FREEZING_MULTI;
newxmax = FreezeMultiXactId(xid, tuple->t_infomask,
cutoff_xid, cutoff_multi, &flags);
}

Uhm, actually we don't need a PG_TRY block at all for this to work: we
can rely on the flag being reset at transaction abort, if anything wrong
happens and we lose control. So just set the flag, call
FreezeMultiXactId, reset flag.

I don't think that'd be correct for a CLUSTER in a subtransaction? A
subtransaction's abort afaics doesn't call ProcArrayEndTransaction() and
thus don't clear vacuumFlags...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Tom Lane

tgl@sss.pgh.pa.us

about 12 years ago

In reply to: Andres Freund (#50)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund <andres@2ndquadrant.com> writes:

Unfortunately I find that too ugly. Adding a flag in the procarray
because of an Assert() in a lowlevel routine? That's overkill.

If this flag doesn't need to be visible to other backends, it absolutely
does not belong in the procarray.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#48)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-12 18:24:34 -0300, Alvaro Herrera wrote:

+			/*
+			 * It's an update; should we keep it?  If the transaction is known
+			 * aborted then it's okay to ignore it, otherwise not.  (Note this
+			 * is just an optimization and not needed for correctness, so it's
+			 * okay to get this test wrong; for example, in case an updater is
+			 * crashed, or a running transaction in the process of aborting.)
+			 */
+			if (!TransactionIdDidAbort(members[i].xid))
+			{
+				newmembers[nnewmembers++] = members[i];
+				Assert(!TransactionIdIsValid(update_xid));
+
+				/*
+				 * Tell caller to set HEAP_XMAX_COMMITTED hint while we have
+				 * the Xid in cache.  Again, this is just an optimization, so
+				 * it's not a problem if the transaction is still running and
+				 * in the process of committing.
+				 */
+				if (TransactionIdDidCommit(update_xid))
+					update_committed = true;
+
+				update_xid = newmembers[i].xid;
+			}

I still don't think this is ok. Freezing shouldn't set hint bits earlier
than tqual.c does. What's the problem with adding a
!TransactionIdIsInProgress()?

You also wrote:
On 2013-12-11 22:08:41 -0300, Alvaro Herrera wrote:

Hmm ... Is there an actual difference? I mean, a transaction that
marked itself as committed in pg_clog cannot return to any other state,
regardless of what happens elsewhere.

Hm, that's not actually true, I missed that so far: Think of async
commits and what we do in tqual.c:SetHintBits(). But I think we're safe
in this scenario, at least for the current callers. vacuumlazy.c will
WAL log the freezing and set the LSN while holding an exclusive lock,
therefor providing an LSN interlock. VACUUM FULL/CLUSTER will be safe,
even with wal_level=minimal, because the relation won't be visible until
it commits and it will contain a smgr pending delete forcing a
synchronous commit. But that should be documented.

+			if (TransactionIdPrecedes(update_xid, cutoff_xid))
+			{
+				update_xid = InvalidTransactionId;
+				update_committed = false;
+
+			}

Deserves an Assert().

+	else if (TransactionIdIsValid(update_xid) && !has_lockers)
+	{
+		/*
+		 * If there's a single member and it's an update, pass it back alone
+		 * without creating a new Multi.  (XXX we could do this when there's a
+		 * single remaining locker, too, but that would complicate the API too
+		 * much; moreover, the case with the single updater is more
+		 * interesting, because those are longer-lived.)
+		 */
+		Assert(nnewmembers == 1);
+		*flags |= FRM_RETURN_IS_XID;
+		if (update_committed)
+			*flags |= FRM_MARK_COMMITTED;
+		xid = update_xid;
+	}

Afaics this will cause HEAP_KEYS_UPDATED to be reset, is that
problematic? I don't really remember what it's needed for TBH...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#52)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund wrote:

On 2013-12-12 18:24:34 -0300, Alvaro Herrera wrote:

+			/*
+			 * It's an update; should we keep it?  If the transaction is known
+			 * aborted then it's okay to ignore it, otherwise not.  (Note this
+			 * is just an optimization and not needed for correctness, so it's
+			 * okay to get this test wrong; for example, in case an updater is
+			 * crashed, or a running transaction in the process of aborting.)
+			 */
+			if (!TransactionIdDidAbort(members[i].xid))
+			{
+				newmembers[nnewmembers++] = members[i];
+				Assert(!TransactionIdIsValid(update_xid));
+
+				/*
+				 * Tell caller to set HEAP_XMAX_COMMITTED hint while we have
+				 * the Xid in cache.  Again, this is just an optimization, so
+				 * it's not a problem if the transaction is still running and
+				 * in the process of committing.
+				 */
+				if (TransactionIdDidCommit(update_xid))
+					update_committed = true;
+
+				update_xid = newmembers[i].xid;
+			}

I still don't think this is ok. Freezing shouldn't set hint bits earlier
than tqual.c does. What's the problem with adding a
!TransactionIdIsInProgress()?

I was supposed to tell you, and evidently forgot, that patch 0001 was
the same as previously submitted, and was modified by the subsequent
patches modify per review comments. These comments should already be
handled in the later patches in the series I just posted. The idea was
to spare you reading the whole thing all over again, but evidently that
backfired. I think the new code doesn't suffer from the problem you
mention; and neither the other one that I trimmed out.

+	else if (TransactionIdIsValid(update_xid) && !has_lockers)
+	{
+		/*
+		 * If there's a single member and it's an update, pass it back alone
+		 * without creating a new Multi.  (XXX we could do this when there's a
+		 * single remaining locker, too, but that would complicate the API too
+		 * much; moreover, the case with the single updater is more
+		 * interesting, because those are longer-lived.)
+		 */
+		Assert(nnewmembers == 1);
+		*flags |= FRM_RETURN_IS_XID;
+		if (update_committed)
+			*flags |= FRM_MARK_COMMITTED;
+		xid = update_xid;
+	}

Afaics this will cause HEAP_KEYS_UPDATED to be reset, is that
problematic? I don't really remember what it's needed for TBH...

Hmm, will check that out.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#47)

1 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Alvaro Herrera wrote:

So I think this is the only remaining issue to make this patch
committable (I will address the other points in Andres' email.) Since
there has been no other feedback on this thread, Andres and I discussed
the cache issue a bit over IM and we seem to agree that a patch to
revamp the cache should be a fairly localized change that could be
applied on both 9.3 and master, separately from this fix. Doing cache
deletion seems more invasive, and not provide better performance anyway.

Here's cache code with LRU superpowers (ahem.)

I settled on 256 as number of entries because it's in the same ballpark
as MaxHeapTuplesPerPage which seems a reasonable guideline to follow.

I considered the idea of avoiding palloc/pfree for cache entries
entirely, instead storing them in a static array which is referenced
from the dlist; unfortunately that doesn't work because each cache entry
is variable size, depending on number of members. We could try to work
around that and allocate a large shared array for members, but that
starts to smell of over-engineering, so I punted.

I was going to 'perf' this, but then found out that I need to compile my
own linux-tools package for a home-compiled kernel ATM.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

revamp-multixact-cache.patchtext/x-diff; charset=us-asciiDownload

*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
***************
*** 72,77 ****
--- 72,78 ----
  #include "catalog/pg_type.h"
  #include "commands/dbcommands.h"
  #include "funcapi.h"
+ #include "lib/ilist.h"
  #include "miscadmin.h"
  #include "pg_trace.h"
  #include "postmaster/autovacuum.h"
***************
*** 262,274 **** static MultiXactId *OldestVisibleMXactId;
   */
  typedef struct mXactCacheEnt
  {
- 	struct mXactCacheEnt *next;
  	MultiXactId multi;
  	int			nmembers;
  	MultiXactMember members[FLEXIBLE_ARRAY_MEMBER];
  } mXactCacheEnt;
  
! static mXactCacheEnt *MXactCache = NULL;
  static MemoryContext MXactContext = NULL;
  
  #ifdef MULTIXACT_DEBUG
--- 263,277 ----
   */
  typedef struct mXactCacheEnt
  {
  	MultiXactId multi;
  	int			nmembers;
+ 	dlist_node	node;
  	MultiXactMember members[FLEXIBLE_ARRAY_MEMBER];
  } mXactCacheEnt;
  
! #define MAX_CACHE_ENTRIES	256
! static dlist_head	MXactCache = DLIST_STATIC_INIT(MXactCache);
! static int			MXactCacheMembers = 0;
  static MemoryContext MXactContext = NULL;
  
  #ifdef MULTIXACT_DEBUG
***************
*** 1306,1312 **** mxactMemberComparator(const void *arg1, const void *arg2)
  static MultiXactId
  mXactCacheGetBySet(int nmembers, MultiXactMember *members)
  {
! 	mXactCacheEnt *entry;
  
  	debug_elog3(DEBUG2, "CacheGet: looking for %s",
  				mxid_to_string(InvalidMultiXactId, nmembers, members));
--- 1309,1315 ----
  static MultiXactId
  mXactCacheGetBySet(int nmembers, MultiXactMember *members)
  {
! 	dlist_iter	iter;
  
  	debug_elog3(DEBUG2, "CacheGet: looking for %s",
  				mxid_to_string(InvalidMultiXactId, nmembers, members));
***************
*** 1314,1321 **** mXactCacheGetBySet(int nmembers, MultiXactMember *members)
  	/* sort the array so comparison is easy */
  	qsort(members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
  
! 	for (entry = MXactCache; entry != NULL; entry = entry->next)
  	{
  		if (entry->nmembers != nmembers)
  			continue;
  
--- 1317,1326 ----
  	/* sort the array so comparison is easy */
  	qsort(members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
  
! 	dlist_foreach(iter, &MXactCache)
  	{
+ 		mXactCacheEnt *entry = dlist_container(mXactCacheEnt, node, iter.cur);
+ 
  		if (entry->nmembers != nmembers)
  			continue;
  
***************
*** 1326,1331 **** mXactCacheGetBySet(int nmembers, MultiXactMember *members)
--- 1331,1337 ----
  		if (memcmp(members, entry->members, nmembers * sizeof(MultiXactMember)) == 0)
  		{
  			debug_elog3(DEBUG2, "CacheGet: found %u", entry->multi);
+ 			dlist_move_head(&MXactCache, iter.cur);
  			return entry->multi;
  		}
  	}
***************
*** 1345,1356 **** mXactCacheGetBySet(int nmembers, MultiXactMember *members)
  static int
  mXactCacheGetById(MultiXactId multi, MultiXactMember **members)
  {
! 	mXactCacheEnt *entry;
  
  	debug_elog3(DEBUG2, "CacheGet: looking for %u", multi);
  
! 	for (entry = MXactCache; entry != NULL; entry = entry->next)
  	{
  		if (entry->multi == multi)
  		{
  			MultiXactMember *ptr;
--- 1351,1364 ----
  static int
  mXactCacheGetById(MultiXactId multi, MultiXactMember **members)
  {
! 	dlist_iter	iter;
  
  	debug_elog3(DEBUG2, "CacheGet: looking for %u", multi);
  
! 	dlist_foreach(iter, &MXactCache)
  	{
+ 		mXactCacheEnt *entry = dlist_container(mXactCacheEnt, node, iter.cur);
+ 
  		if (entry->multi == multi)
  		{
  			MultiXactMember *ptr;
***************
*** 1366,1371 **** mXactCacheGetById(MultiXactId multi, MultiXactMember **members)
--- 1374,1382 ----
  						mxid_to_string(multi,
  									   entry->nmembers,
  									   entry->members));
+ 
+ 			dlist_move_head(&MXactCache, iter.cur);
+ 
  			return entry->nmembers;
  		}
  	}
***************
*** 1409,1416 **** mXactCachePut(MultiXactId multi, int nmembers, MultiXactMember *members)
  	/* mXactCacheGetBySet assumes the entries are sorted, so sort them */
  	qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
  
! 	entry->next = MXactCache;
! 	MXactCache = entry;
  }
  
  static char *
--- 1420,1435 ----
  	/* mXactCacheGetBySet assumes the entries are sorted, so sort them */
  	qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
  
! 	dlist_push_head(&MXactCache, &entry->node);
! 	if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES)
! 	{
! 		dlist_node *node;
! 
! 		node = dlist_tail_node(&MXactCache);
! 		dlist_delete(dlist_tail_node(&MXactCache));
! 		MXactCacheMembers--;
! 		pfree(dlist_container(mXactCacheEnt, node, node));
! 	}
  }
  
  static char *
***************
*** 1485,1491 **** AtEOXact_MultiXact(void)
  	 * a child of TopTransactionContext, we needn't delete it explicitly.
  	 */
  	MXactContext = NULL;
! 	MXactCache = NULL;
  }
  
  /*
--- 1504,1511 ----
  	 * a child of TopTransactionContext, we needn't delete it explicitly.
  	 */
  	MXactContext = NULL;
! 	dlist_init(&MXactCache);
! 	MXactCacheMembers = 0;
  }
  
  /*
***************
*** 1551,1557 **** PostPrepare_MultiXact(TransactionId xid)
  	 * Discard the local MultiXactId cache like in AtEOX_MultiXact
  	 */
  	MXactContext = NULL;
! 	MXactCache = NULL;
  }
  
  /*
--- 1571,1578 ----
  	 * Discard the local MultiXactId cache like in AtEOX_MultiXact
  	 */
  	MXactContext = NULL;
! 	dlist_init(&MXactCache);
! 	MXactCacheMembers = 0;
  }
  
  /*

#55

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#54)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-13 13:39:20 -0300, Alvaro Herrera wrote:

Here's cache code with LRU superpowers (ahem.)

Heh.

I settled on 256 as number of entries because it's in the same ballpark
as MaxHeapTuplesPerPage which seems a reasonable guideline to follow.

Sounds ok.

I considered the idea of avoiding palloc/pfree for cache entries
entirely, instead storing them in a static array which is referenced
from the dlist; unfortunately that doesn't work because each cache entry
is variable size, depending on number of members. We could try to work
around that and allocate a large shared array for members, but that
starts to smell of over-engineering, so I punted.

Good plan imo.

*** 1326,1331 **** mXactCacheGetBySet(int nmembers, MultiXactMember *members)
--- 1331,1337 ----
if (memcmp(members, entry->members, nmembers * sizeof(MultiXactMember)) == 0)
{
debug_elog3(DEBUG2, "CacheGet: found %u", entry->multi);
+ 			dlist_move_head(&MXactCache, iter.cur);
return entry->multi;
}
}

That's only possible because we immediately abort the loop, otherwise
we'd corrupt the iterator. Maybe that deserves a comment.

+ 
+ 			dlist_move_head(&MXactCache, iter.cur);
+

Heh. I forgot that we already had that bit; I was wondering whether you
had to forgot to include it in the patch ;)

static char *
--- 1420,1435 ----
/* mXactCacheGetBySet assumes the entries are sorted, so sort them */
qsort(entry->members, nmembers, sizeof(MultiXactMember), mxactMemberComparator);
! dlist_push_head(&MXactCache, &entry->node);
! if (MXactCacheMembers++ >= MAX_CACHE_ENTRIES)
! {
! dlist_node *node;
!
! node = dlist_tail_node(&MXactCache);
! dlist_delete(dlist_tail_node(&MXactCache));
! MXactCacheMembers--;
! pfree(dlist_container(mXactCacheEnt, node, node));
! }
}

Duplicate dlist_tail_node(). Maybe add a debug_elog3(.. "CacheGet:
pruning %u from cache")?

I wondered before if we shouldn't introduce a layer above dlists, that
support keeping track of the number of elements, and maybe also have
support for LRU behaviour. Not as a part this patch, just generally.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Andres Freund (#52)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Andres Freund wrote:

On 2013-12-12 18:24:34 -0300, Alvaro Herrera wrote:

+	else if (TransactionIdIsValid(update_xid) && !has_lockers)
+	{
+		/*
+		 * If there's a single member and it's an update, pass it back alone
+		 * without creating a new Multi.  (XXX we could do this when there's a
+		 * single remaining locker, too, but that would complicate the API too
+		 * much; moreover, the case with the single updater is more
+		 * interesting, because those are longer-lived.)
+		 */
+		Assert(nnewmembers == 1);
+		*flags |= FRM_RETURN_IS_XID;
+		if (update_committed)
+			*flags |= FRM_MARK_COMMITTED;
+		xid = update_xid;
+	}

Afaics this will cause HEAP_KEYS_UPDATED to be reset, is that
problematic? I don't really remember what it's needed for TBH...

So we do reset HEAP_KEYS_UPDATED, and in general that bit seems critical
for several things. So it should be kept when a Xmax is carried over
from the pre-frozen version of the tuple. But while reading through
that, I realize that we should also be keeping HEAP_HOT_UPDATED in that
case. And particularly we should never clear HEAP_ONLY_TUPLE.

So I think heap_execute_freeze_tuple() is wrong, because it's resetting
the whole infomask to zero, and then setting it to only the bits that
heap_prepare_freeze_tuple decided that it needed set. That seems bogus
to me. heap_execute_freeze_tuple() should only clear a certain number
of bits, and then possibly set some of the same bits; but the remaining
flags should remain untouched. So HEAP_KEYS_UPDATED, HEAP_UPDATED and
HEAP_HOT_UPDATED should be untouched by heap_execute_freeze_tuple;
heap_prepare_freeze_tuple needn't worry about querying those bits at
all.

Only when FreezeMultiXactId returns FRM_INVALIDATE_XMAX, and when the
Xmax is not a multi and it gets removed, should those three flags be
removed completely.

HEAP_ONLY_TUPLE should be untouched by the freezing protocol.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#56)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

On 2013-12-13 17:08:46 -0300, Alvaro Herrera wrote:

Andres Freund wrote:

Afaics this will cause HEAP_KEYS_UPDATED to be reset, is that
problematic? I don't really remember what it's needed for TBH...

So we do reset HEAP_KEYS_UPDATED, and in general that bit seems critical
for several things. So it should be kept when a Xmax is carried over
from the pre-frozen version of the tuple. But while reading through
that, I realize that we should also be keeping HEAP_HOT_UPDATED in that
case. And particularly we should never clear HEAP_ONLY_TUPLE.

That's only for the multi->plain xid case tho, right?

So I think heap_execute_freeze_tuple() is wrong, because it's resetting
the whole infomask to zero, and then setting it to only the bits that
heap_prepare_freeze_tuple decided that it needed set. That seems bogus
to me. heap_execute_freeze_tuple() should only clear a certain number
of bits, and then possibly set some of the same bits; but the remaining
flags should remain untouched.

Uh, my version and the latest you've sent intiially copy the original
infomask to the freeze plan and then manipulate those. That seems fine
to me. Am I missing something?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Noah Misch (#25)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

Noah Misch wrote:

On Tue, Dec 03, 2013 at 07:26:38PM +0100, Andres Freund wrote:

On 2013-12-03 13:14:38 -0500, Noah Misch wrote:

On Tue, Dec 03, 2013 at 04:37:58PM +0100, Andres Freund wrote:

I currently don't see fixing the errorneous freezing of lockers (not the
updater though) without changing the wal format or synchronously waiting
for all lockers to end. Which both see like a no-go?

Not fixing it at all is the real no-go. We'd take both of those undesirables
before just tolerating the lost locks in 9.3.

I think it's changing the wal format then.

I'd rather have an readily-verifiable fix that changes WAL format than a
tricky fix that avoids doing so. So, modulo not having seen the change, +1.

I've committed a patch which hopefully fixes the problem using this
approach. Thanks, Noah, for noticing the issue, and thanks, Andres, for
collaboration in getting the code in the right state.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Alvaro Herrera

alvherre@2ndquadrant.com

about 12 years ago

In reply to: Alvaro Herrera (#58)

1 attachment(s)

Re: pgsql: Fix a couple of bugs in MultiXactId freezing

BTW, there are a a couple of spec files floating around which perhaps we
should consider getting into the source repo (in some cleaned up form).
Noah published one, Andres shared a couple more with me, and I think I
have two more. They can't be made to work in normal circumstances,
because they depend on concurrent server activity. But perhaps we
should add them anyway and perhaps list them in a separate schedule
file, so that any developer interested in messing with this stuff has
them readily available for testing.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services