pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Started by Nonamealmost 16 years ago27 messages
#1Noname
sriggs@postgresql.org

Log Message:
-----------
Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.

Modified Files:
--------------
pgsql/doc/src/sgml:
config.sgml (r1.246 -> r1.247)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/doc/src/sgml/config.sgml?r1=1.246&r2=1.247)
pgsql/src/backend/access/nbtree:
nbtpage.c (r1.115 -> r1.116)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/nbtree/nbtpage.c?r1=1.115&r2=1.116)
pgsql/src/backend/access/transam:
xlog.c (r1.364 -> r1.365)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/transam/xlog.c?r1=1.364&r2=1.365)
pgsql/src/backend/utils/misc:
guc.c (r1.536 -> r1.537)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/misc/guc.c?r1=1.536&r2=1.537)
postgresql.conf.sample (r1.274 -> r1.275)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/misc/postgresql.conf.sample?r1=1.274&r2=1.275)
pgsql/src/include/access:
xlog.h (r1.99 -> r1.100)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/access/xlog.h?r1=1.99&r2=1.100)

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#1)
Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

sriggs@postgresql.org (Simon Riggs) writes:

Log Message:
-----------
Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.

WTF? Simon, this seems to be getting way way beyond anything the
community has agreed to. Particularly, inventing GUCs is not something
to be doing without consensus.

regards, tom lane

#3Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Tom Lane (#2)
Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Tom Lane wrote:

sriggs@postgresql.org (Simon Riggs) writes:

Log Message:
-----------
Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.

WTF? Simon, this seems to be getting way way beyond anything the
community has agreed to. Particularly, inventing GUCs is not something
to be doing without consensus.

yeah actually both of the last commits seem rather strange given the
current discussion on this list I must say...

Stefan

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#2)
Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Fri, 2010-01-29 at 14:04 -0500, Tom Lane wrote:

sriggs@postgresql.org (Simon Riggs) writes:

Log Message:
-----------
Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.

WTF? Simon, this seems to be getting way way beyond anything the
community has agreed to. Particularly, inventing GUCs is not something
to be doing without consensus.

If you or anybody else would like me to revoke it then I am happy to do
that, with no problem or argument. I agree it hasn't had discussion,
though am happy to have such a discussion.

The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.

I imagined such a minor addition would pass without comment. This is not
the same patch not-being-discussed on the other thread, which is too
complex for a commit without review. It is a separate change altogether,
although it does relate to conflict processing. In any case I would
never commit a patch while discussion on it continues.

--
Simon Riggs www.2ndQuadrant.com

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#4)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Simon Riggs <simon@2ndQuadrant.com> writes:

On Fri, 2010-01-29 at 14:04 -0500, Tom Lane wrote:

WTF? Simon, this seems to be getting way way beyond anything the
community has agreed to. Particularly, inventing GUCs is not something
to be doing without consensus.

If you or anybody else would like me to revoke it then I am happy to do
that, with no problem or argument. I agree it hasn't had discussion,
though am happy to have such a discussion.

The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.

Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here. Either test it yourself
enough to be sure it's a win, or don't put it in.

regards, tom lane

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#5)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:

The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.

Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here. Either test it yourself
enough to be sure it's a win, or don't put it in.

I will remove the parameter then, keeping the augmentation. That OK?

--
Simon Riggs www.2ndQuadrant.com

#7Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Simon Riggs (#6)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Simon Riggs wrote:

On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:

The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.

Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here. Either test it yourself
enough to be sure it's a win, or don't put it in.

I will remove the parameter then, keeping the augmentation. That OK?

Well how much is the actual hit with this on the master for different
workloads do we have realistic numbers on that? Also how much of an
actual win is it in the other direction - as in under what circumstances
and workloads does it help in avoiding superflous cancelations on the
standby?

Stefan

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Stefan Kaltenbrunner (#7)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sun, 2010-01-31 at 20:48 +0100, Stefan Kaltenbrunner wrote:

Simon Riggs wrote:

On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:

The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.

Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here. Either test it yourself
enough to be sure it's a win, or don't put it in.

I will remove the parameter then, keeping the augmentation. That OK?

Well how much is the actual hit with this on the master for different
workloads do we have realistic numbers on that? Also how much of an
actual win is it in the other direction - as in under what circumstances
and workloads does it help in avoiding superflous cancelations on the
standby?

At the moment a btree delete record will cancel every request
1. no matter how long they have been running
2. no matter if they haven't accessed the index being cleaned (they
might later, is the thinking...)

This change improves case (1) in that it will only remove queries that
are older than the oldest snapshot on the primary, should
max_standby_delay be exceeded. Case (2) would have been improved by my
other proposed patch should max_standby_delay be exceeded.

The cost of improving case (1) is that we do the equivalent of taking a
snapshot of the procarray while holding locks on the btree block being
cleaned. That will increase index contention somewhat in applications
that do btree deletes, i.e. non-hot index updates or deletes.

Greg Stark's comments have been that he wants to see max_standby_delay =
-1 put back, which would effecively prevent both (1) and (2) from
occurring. I am working on that now.

--
Simon Riggs www.2ndQuadrant.com

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#8)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Simon Riggs <simon@2ndQuadrant.com> writes:

At the moment a btree delete record will cancel every request
1. no matter how long they have been running
2. no matter if they haven't accessed the index being cleaned (they
might later, is the thinking...)

That seems seriously horrid. What is the rationale for #2 in
particular? I would hope that at worst this would affect sessions
that are actively competing for the index being cleaned.

regards, tom lane

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#9)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sun, 2010-01-31 at 15:41 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

At the moment a btree delete record will cancel every request
1. no matter how long they have been running
2. no matter if they haven't accessed the index being cleaned (they
might later, is the thinking...)

That seems seriously horrid. What is the rationale for #2 in
particular? I would hope that at worst this would affect sessions
that are actively competing for the index being cleaned.

That is exactly the feedback I received from many other people and why I
prioritised the relation-specific conflict patch.

It's worse that that because point 2 effects WAL cleanup records for the
heap also.

The rationale is that a session *might* in the future access a table,
and if it did so it would receive the wrong answer *potentially*. This
is the issue I have been discussing for a long time now, in various
forms, starting on-list in Aug 2008.

--
Simon Riggs www.2ndQuadrant.com

#11Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Simon Riggs (#8)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Simon Riggs wrote:

On Sun, 2010-01-31 at 20:48 +0100, Stefan Kaltenbrunner wrote:

Simon Riggs wrote:

On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:

The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.

Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here. Either test it yourself
enough to be sure it's a win, or don't put it in.

I will remove the parameter then, keeping the augmentation. That OK?

Well how much is the actual hit with this on the master for different
workloads do we have realistic numbers on that? Also how much of an
actual win is it in the other direction - as in under what circumstances
and workloads does it help in avoiding superflous cancelations on the
standby?

At the moment a btree delete record will cancel every request
1. no matter how long they have been running
2. no matter if they haven't accessed the index being cleaned (they
might later, is the thinking...)

This change improves case (1) in that it will only remove queries that
are older than the oldest snapshot on the primary, should
max_standby_delay be exceeded. Case (2) would have been improved by my
other proposed patch should max_standby_delay be exceeded.

The cost of improving case (1) is that we do the equivalent of taking a
snapshot of the procarray while holding locks on the btree block being
cleaned. That will increase index contention somewhat in applications
that do btree deletes, i.e. non-hot index updates or deletes.

hmm makes sense from the HS side - but I think to make a case for a GUC
it would help a lot to see actual numbers(as in benchmarks) showing how
much of a hit it is to the master.

Stefan

#12Simon Riggs
simon@2ndQuadrant.com
In reply to: Stefan Kaltenbrunner (#11)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sun, 2010-01-31 at 22:05 +0100, Stefan Kaltenbrunner wrote:

hmm makes sense from the HS side - but I think to make a case for a GUC
it would help a lot to see actual numbers(as in benchmarks) showing how
much of a hit it is to the master.

Agreed, though my approach to benchmarking was to provide the mechanism
by which it was possible to benchmark. I didn't presume to be able to
cover wide area with benchmarking tests just for this.

--
Simon Riggs www.2ndQuadrant.com

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#12)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Simon Riggs wrote:

On Sun, 2010-01-31 at 22:05 +0100, Stefan Kaltenbrunner wrote:

hmm makes sense from the HS side - but I think to make a case for a GUC
it would help a lot to see actual numbers(as in benchmarks) showing how
much of a hit it is to the master.

Agreed, though my approach to benchmarking was to provide the mechanism
by which it was possible to benchmark. I didn't presume to be able to
cover wide area with benchmarking tests just for this.

I don't think this patch helps much though. It's setting lastRemovedXid
to GetOldestXmin(), which is still a very pessimistic estimate. In fact,
if there's only one transaction running in the master, it's not any
better than just setting it to InvalidTransactionId and killing all
active queries in the standby.

What we'd want to set it to is the xmin/xmax of the oldest heap tuple
whose pointer was removed from the index page. That could be much much
older than GetOldestXmin(), allowing many more read-only queries to live
in the standby.

IIRC it was Greg Stark who suggested last time this was discussed that
we could calculate the exact value for latestRemovedXid in the standby.
When replaying the deletion record, the standby could look at all the
heap tuples whose index pointers are being removed, to see which one was
newest. That can be pretty expensive, involving random I/O, but it gives
an exact value, and doesn't stress the master at all. And we could skip
it if there's no potentially conflicting read-only queries running.

That seems like the most user-friendly approach to me. Even though
random I/O is expensive, surely it's better than killing queries.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#13)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

IIRC it was Greg Stark who suggested last time this was discussed that
we could calculate the exact value for latestRemovedXid in the standby.
When replaying the deletion record, the standby could look at all the
heap tuples whose index pointers are being removed, to see which one was
newest.

This assumes that the standby can tell which heap tuples are being
removed, which I'm not sure is true --- consider the case where the
deletion record is carrying a page image instead of a list of deleted
tuples. But maybe we could rejigger the representation to make sure
that the info is always available. In general it sounds like a much
nicer answer.

regards, tom lane

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#13)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sun, 2010-01-31 at 23:43 +0200, Heikki Linnakangas wrote:

IIRC it was Greg Stark who suggested last time this was discussed that
we could calculate the exact value for latestRemovedXid in the
standby. When replaying the deletion record, the standby could look at
all the heap tuples whose index pointers are being removed, to see
which one was newest. That can be pretty expensive, involving random
I/O, but it gives an exact value, and doesn't stress the master at
all. And we could skip it if there's no potentially conflicting
read-only queries running.

That seems like the most user-friendly approach to me. Even though
random I/O is expensive, surely it's better than killing queries.

Best solution, no time to do it.

Should I revoke this change?

--
Simon Riggs www.2ndQuadrant.com

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#15)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Simon Riggs <simon@2ndQuadrant.com> writes:

Should I revoke this change?

Please. We can always put it back later if nothing better gets
implemented.

regards, tom lane

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#16)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sun, 2010-01-31 at 17:10 -0500, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Should I revoke this change?

Please.

Will do, in morning.

--
Simon Riggs www.2ndQuadrant.com

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#14)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sun, 2010-01-31 at 16:53 -0500, Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

IIRC it was Greg Stark who suggested last time this was discussed that
we could calculate the exact value for latestRemovedXid in the standby.
When replaying the deletion record, the standby could look at all the
heap tuples whose index pointers are being removed, to see which one was
newest.

This assumes that the standby can tell which heap tuples are being
removed, which I'm not sure is true --- consider the case where the
deletion record is carrying a page image instead of a list of deleted
tuples. But maybe we could rejigger the representation to make sure
that the info is always available. In general it sounds like a much
nicer answer.

I'm looking at this now since we definitely need either this or the
previous patch.

XLOG_BTREE_DELETE records would not carry latestRemovedXid, this would
be calculated by inspection of heap tuples. XLOG_BTREE_DELETE would have
a heap relid, to allow direct access to the heap without a Relation. We
would check xmin, xmax and xvac. Code will look a little like
heap_fetch() though disimilar enough not to use directly. We would only
look at the tuple header and info.

--
Simon Riggs www.2ndQuadrant.com

#19Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#18)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Simon Riggs wrote:

XLOG_BTREE_DELETE records would not carry latestRemovedXid, this would
be calculated by inspection of heap tuples.

You might still want to keep the conservative latestRemovedXid value in
the WAL record to avoid the extra work when latestRemovedXid alone allows.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#19)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Tue, 2010-03-09 at 11:20 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

XLOG_BTREE_DELETE records would not carry latestRemovedXid, this would
be calculated by inspection of heap tuples.

You might still want to keep the conservative latestRemovedXid value in
the WAL record to avoid the extra work when latestRemovedXid alone allows.

OK will do.

--
Simon Riggs www.2ndQuadrant.com

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#13)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sun, 2010-01-31 at 23:43 +0200, Heikki Linnakangas wrote:

When replaying the deletion record, the standby could look at all the
heap tuples whose index pointers are being removed, to see which one
was newest.

Long after coding this, I now realise this really is a dumb-ass idea.

There is no spoon. The index tuples did once point at valid heap tuples.
1. heap tuples are deleted
2. heap tuples become dead
3. index tuples can now be marked killed
4. index tuples removed
Heap tuples can be removed at step 2, index tuples can't be removed
until step 4. so the dead index tuples can't be followed reliably to
read the xids. They might be the correct ones, might not, and no way to
tell.

So that puts this back to exactly the place we were on 31 Jan:

On Sun, 2010-01-31 at 17:10 -0500, Tom Lane wrote:

We can always put it back later if nothing better gets
implemented.

So, barring huge injections of insight, I'll be putting back the
generation of OldestXmin() into the btree delete path.

--
Simon Riggs www.2ndQuadrant.com

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#21)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

Simon Riggs <simon@2ndQuadrant.com> writes:

On Sun, 2010-01-31 at 23:43 +0200, Heikki Linnakangas wrote:

When replaying the deletion record, the standby could look at all the
heap tuples whose index pointers are being removed, to see which one
was newest.

Long after coding this, I now realise this really is a dumb-ass idea.

There is no spoon. The index tuples did once point at valid heap tuples.
1. heap tuples are deleted
2. heap tuples become dead
3. index tuples can now be marked killed
4. index tuples removed
Heap tuples can be removed at step 2, index tuples can't be removed
until step 4.

Uh, no, heap tuples can't be removed until after all index entries that
are pointing at them have been removed. Please tell me you have not
broken this.

So, barring huge injections of insight, I'll be putting back the
generation of OldestXmin() into the btree delete path.

I stand by my previous comments about where this is going to end up.

regards, tom lane

#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#22)
1 attachment(s)
Re: Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Fri, 2010-03-26 at 16:16 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

On Sun, 2010-01-31 at 23:43 +0200, Heikki Linnakangas wrote:

When replaying the deletion record, the standby could look at all the
heap tuples whose index pointers are being removed, to see which one
was newest.

Long after coding this, I now realise this really is a dumb-ass idea.

There is no spoon. The index tuples did once point at valid heap tuples.
1. heap tuples are deleted
2. heap tuples become dead
3. index tuples can now be marked killed
4. index tuples removed
Heap tuples can be removed at step 2, index tuples can't be removed
until step 4.

Uh, no, heap tuples can't be removed until after all index entries that
are pointing at them have been removed. Please tell me you have not
broken this.

Nothing broken.

It appears that in practice many of the index items point to heap items
that are LP_DEAD. So for the purposes of accessing a heap tuple's xmin,
then we're both right. To the current purpose the tuple has been
removed, though you are also right: its stub remains.

So how do I calculate xmin and xmax for an LP_DEAD tuple? Clearly
nothing can be done directly. Is there another way?

A conjecture: if the index items point to a heap tuple that is LP_NORMAL
then we can get the xmin/xmax from there. The xmin/xmax of LP_DEAD items
will always be *earlier* than the latest LP_NORMAL tuple that is being
removed. So as long as I have at least 1 LP_NORMAL heap tuple, then I
can use the latestRemovedXid from that and simply discard the LP_DEAD
items (for the purposes of this calculation). The idea is that whatever
marked those heap tuples LP_DEAD would also have marked the others, if
they were the same or earlier than the LP_DEAD ones.

Do you agree with this conjecture? If you do, then attached patch is
complete.

--
Simon Riggs www.2ndQuadrant.com

Attachments:

derive_latestRemovedXid_from_heap.patchtext/x-patch; charset=UTF-8; name=derive_latestRemovedXid_from_heap.patchDownload
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***************
*** 57,63 **** static void _bt_findinsertloc(Relation rel,
  				  OffsetNumber *offsetptr,
  				  int keysz,
  				  ScanKey scankey,
! 				  IndexTuple newtup);
  static void _bt_insertonpg(Relation rel, Buffer buf,
  			   BTStack stack,
  			   IndexTuple itup,
--- 57,64 ----
  				  OffsetNumber *offsetptr,
  				  int keysz,
  				  ScanKey scankey,
! 				  IndexTuple newtup,
! 				  Relation heapRel);
  static void _bt_insertonpg(Relation rel, Buffer buf,
  			   BTStack stack,
  			   IndexTuple itup,
***************
*** 78,84 **** static void _bt_pgaddtup(Relation rel, Page page,
  			 OffsetNumber itup_off, const char *where);
  static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
  			int keysz, ScanKey scankey);
! static void _bt_vacuum_one_page(Relation rel, Buffer buffer);
  
  
  /*
--- 79,85 ----
  			 OffsetNumber itup_off, const char *where);
  static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
  			int keysz, ScanKey scankey);
! static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
  
  
  /*
***************
*** 175,181 **** top:
  	if (checkUnique != UNIQUE_CHECK_EXISTING)
  	{
  		/* do the insertion */
! 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup);
  		_bt_insertonpg(rel, buf, stack, itup, offset, false);
  	}
  	else
--- 176,182 ----
  	if (checkUnique != UNIQUE_CHECK_EXISTING)
  	{
  		/* do the insertion */
! 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
  		_bt_insertonpg(rel, buf, stack, itup, offset, false);
  	}
  	else
***************
*** 491,497 **** _bt_findinsertloc(Relation rel,
  				  OffsetNumber *offsetptr,
  				  int keysz,
  				  ScanKey scankey,
! 				  IndexTuple newtup)
  {
  	Buffer		buf = *bufptr;
  	Page		page = BufferGetPage(buf);
--- 492,499 ----
  				  OffsetNumber *offsetptr,
  				  int keysz,
  				  ScanKey scankey,
! 				  IndexTuple newtup,
! 				  Relation heapRel)
  {
  	Buffer		buf = *bufptr;
  	Page		page = BufferGetPage(buf);
***************
*** 556,562 **** _bt_findinsertloc(Relation rel,
  		 */
  		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
  		{
! 			_bt_vacuum_one_page(rel, buf);
  
  			/*
  			 * remember that we vacuumed this page, because that makes the
--- 558,564 ----
  		 */
  		if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
  		{
! 			_bt_vacuum_one_page(rel, buf, heapRel);
  
  			/*
  			 * remember that we vacuumed this page, because that makes the
***************
*** 1998,2004 **** _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
   * super-exclusive "cleanup" lock (see nbtree/README).
   */
  static void
! _bt_vacuum_one_page(Relation rel, Buffer buffer)
  {
  	OffsetNumber deletable[MaxOffsetNumber];
  	int			ndeletable = 0;
--- 2000,2006 ----
   * super-exclusive "cleanup" lock (see nbtree/README).
   */
  static void
! _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
  {
  	OffsetNumber deletable[MaxOffsetNumber];
  	int			ndeletable = 0;
***************
*** 2025,2031 **** _bt_vacuum_one_page(Relation rel, Buffer buffer)
  	}
  
  	if (ndeletable > 0)
! 		_bt_delitems(rel, buffer, deletable, ndeletable, false, 0);
  
  	/*
  	 * Note: if we didn't find any LP_DEAD items, then the page's
--- 2027,2033 ----
  	}
  
  	if (ndeletable > 0)
! 		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
  
  	/*
  	 * Note: if we didn't find any LP_DEAD items, then the page's
*** a/src/backend/access/nbtree/nbtpage.c
--- b/src/backend/access/nbtree/nbtpage.c
***************
*** 719,733 **** _bt_page_recyclable(Page page)
   * ensure correct locking.
   */
  void
! _bt_delitems(Relation rel, Buffer buf,
! 			 OffsetNumber *itemnos, int nitems, bool isVacuum,
! 			 BlockNumber lastBlockVacuumed)
  {
  	Page		page = BufferGetPage(buf);
  	BTPageOpaque opaque;
  
- 	Assert(isVacuum || lastBlockVacuumed == 0);
- 
  	/* No ereport(ERROR) until changes are logged */
  	START_CRIT_SECTION();
  
--- 719,730 ----
   * ensure correct locking.
   */
  void
! _bt_delitems_vacuum(Relation rel, Buffer buf,
! 			 OffsetNumber *itemnos, int nitems, BlockNumber lastBlockVacuumed)
  {
  	Page		page = BufferGetPage(buf);
  	BTPageOpaque opaque;
  
  	/* No ereport(ERROR) until changes are logged */
  	START_CRIT_SECTION();
  
***************
*** 759,793 **** _bt_delitems(Relation rel, Buffer buf,
  		XLogRecPtr	recptr;
  		XLogRecData rdata[2];
  
! 		if (isVacuum)
! 		{
! 			xl_btree_vacuum xlrec_vacuum;
! 
! 			xlrec_vacuum.node = rel->rd_node;
! 			xlrec_vacuum.block = BufferGetBlockNumber(buf);
! 
! 			xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
! 			rdata[0].data = (char *) &xlrec_vacuum;
! 			rdata[0].len = SizeOfBtreeVacuum;
! 		}
! 		else
! 		{
! 			xl_btree_delete xlrec_delete;
! 
! 			xlrec_delete.node = rel->rd_node;
! 			xlrec_delete.block = BufferGetBlockNumber(buf);
  
! 			/*
! 			 * XXX: We would like to set an accurate latestRemovedXid, but
! 			 * there is no easy way of obtaining a useful value. So we punt
! 			 * and store InvalidTransactionId, which forces the standby to
! 			 * wait for/cancel all currently running transactions.
! 			 */
! 			xlrec_delete.latestRemovedXid = InvalidTransactionId;
! 			rdata[0].data = (char *) &xlrec_delete;
! 			rdata[0].len = SizeOfBtreeDelete;
! 		}
  
  		rdata[0].buffer = InvalidBuffer;
  		rdata[0].next = &(rdata[1]);
  
--- 756,769 ----
  		XLogRecPtr	recptr;
  		XLogRecData rdata[2];
  
! 		xl_btree_vacuum xlrec_vacuum;
  
! 		xlrec_vacuum.node = rel->rd_node;
! 		xlrec_vacuum.block = BufferGetBlockNumber(buf);
  
+ 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+ 		rdata[0].data = (char *) &xlrec_vacuum;
+ 		rdata[0].len = SizeOfBtreeVacuum;
  		rdata[0].buffer = InvalidBuffer;
  		rdata[0].next = &(rdata[1]);
  
***************
*** 810,819 **** _bt_delitems(Relation rel, Buffer buf,
  		rdata[1].buffer_std = true;
  		rdata[1].next = NULL;
  
! 		if (isVacuum)
! 			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM, rdata);
! 		else
! 			recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata);
  
  		PageSetLSN(page, recptr);
  		PageSetTLI(page, ThisTimeLineID);
--- 786,867 ----
  		rdata[1].buffer_std = true;
  		rdata[1].next = NULL;
  
! 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM, rdata);
! 
! 		PageSetLSN(page, recptr);
! 		PageSetTLI(page, ThisTimeLineID);
! 	}
! 
! 	END_CRIT_SECTION();
! }
! 
! void
! _bt_delitems_delete(Relation rel, Buffer buf,
! 			 OffsetNumber *itemnos, int nitems, Relation heapRel)
! {
! 	Page		page = BufferGetPage(buf);
! 	BTPageOpaque opaque;
! 
! 	Assert(nitems > 0);
! 
! 	/* No ereport(ERROR) until changes are logged */
! 	START_CRIT_SECTION();
! 
! 	/* Fix the page */
! 	PageIndexMultiDelete(page, itemnos, nitems);
! 
! 	/*
! 	 * We can clear the vacuum cycle ID since this page has certainly been
! 	 * processed by the current vacuum scan.
! 	 */
! 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
! 	opaque->btpo_cycleid = 0;
! 
! 	/*
! 	 * Mark the page as not containing any LP_DEAD items.  This is not
! 	 * certainly true (there might be some that have recently been marked, but
! 	 * weren't included in our target-item list), but it will almost always be
! 	 * true and it doesn't seem worth an additional page scan to check it.
! 	 * Remember that BTP_HAS_GARBAGE is only a hint anyway.
! 	 */
! 	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
! 
! 	MarkBufferDirty(buf);
! 
! 	/* XLOG stuff */
! 	if (!rel->rd_istemp)
! 	{
! 		XLogRecPtr	recptr;
! 		XLogRecData rdata[3];
! 
! 		xl_btree_delete xlrec_delete;
! 
! 		xlrec_delete.node = rel->rd_node;
! 		xlrec_delete.hnode = heapRel->rd_node;
! 		xlrec_delete.block = BufferGetBlockNumber(buf);
! 		xlrec_delete.nitems = nitems;
! 
! 		rdata[0].data = (char *) &xlrec_delete;
! 		rdata[0].len = SizeOfBtreeDelete;
! 		rdata[0].buffer = InvalidBuffer;
! 		rdata[0].next = &(rdata[1]);
! 
! 		/*
! 		 * We need the target-offsets array whether or not we store the
! 		 * to allow us to find the latestRemovedXid on a standby server.
! 		 */
! 		rdata[1].data = (char *) itemnos;
! 		rdata[1].len = nitems * sizeof(OffsetNumber);
! 		rdata[1].buffer = InvalidBuffer;
! 		rdata[1].next = &(rdata[2]);
! 
! 		rdata[2].data = NULL;
! 		rdata[2].len = 0;
! 		rdata[2].buffer = buf;
! 		rdata[2].buffer_std = true;
! 		rdata[2].next = NULL;
! 
! 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DELETE, rdata);
  
  		PageSetLSN(page, recptr);
  		PageSetTLI(page, ThisTimeLineID);
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 708,714 **** btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
  		buf = ReadBufferExtended(rel, MAIN_FORKNUM, num_pages - 1, RBM_NORMAL,
  								 info->strategy);
  		LockBufferForCleanup(buf);
! 		_bt_delitems(rel, buf, NULL, 0, true, vstate.lastBlockVacuumed);
  		_bt_relbuf(rel, buf);
  	}
  
--- 708,714 ----
  		buf = ReadBufferExtended(rel, MAIN_FORKNUM, num_pages - 1, RBM_NORMAL,
  								 info->strategy);
  		LockBufferForCleanup(buf);
! 		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
  		_bt_relbuf(rel, buf);
  	}
  
***************
*** 889,895 **** restart:
  		{
  			BlockNumber lastBlockVacuumed = BufferGetBlockNumber(buf);
  
! 			_bt_delitems(rel, buf, deletable, ndeletable, true, vstate->lastBlockVacuumed);
  
  			/*
  			 * Keep track of the block number of the lastBlockVacuumed, so we
--- 889,895 ----
  		{
  			BlockNumber lastBlockVacuumed = BufferGetBlockNumber(buf);
  
! 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable, vstate->lastBlockVacuumed);
  
  			/*
  			 * Keep track of the block number of the lastBlockVacuumed, so we
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
***************
*** 553,558 **** btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
--- 553,672 ----
  	UnlockReleaseBuffer(buffer);
  }
  
+ static TransactionId
+ btree_xlog_delete_get_latestRemovedXid(XLogRecord *record)
+ {
+ 	OffsetNumber 	*unused;
+ 	Buffer			ibuffer, hbuffer;
+ 	Page			ipage, hpage;
+ 	ItemId			iitemid, hitemid;
+ 	IndexTuple		itup;
+ 	HeapTupleHeader htuphdr;
+ 	BlockNumber 	hblkno;
+ 	OffsetNumber 	hoffnum;
+ 	TransactionId	latestRemovedXid = InvalidTransactionId;
+ 	TransactionId	htupxid = InvalidTransactionId;
+ 	int i;
+ 	int num_unused, num_redirect, num_dead;
+ 
+ 	xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
+ 
+ 	/*
+ 	 * Get index page
+ 	 */
+ 	ibuffer = XLogReadBuffer(xlrec->node, xlrec->block, false);
+ 	if (!BufferIsValid(ibuffer))
+ 		return InvalidTransactionId;
+ 	ipage = (Page) BufferGetPage(ibuffer);
+ 
+ 	/*
+ 	 * Loop through the deleted index items to obtain the TransactionId
+ 	 * from the heap items they point to.
+ 	 */
+ 	unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete);
+ 
+ 	for (i = 0; i < xlrec->nitems; i++)
+ 	{
+ 		/*
+ 		 * Identify the index tuple about to be deleted
+ 		 */
+ 		iitemid = PageGetItemId(ipage, unused[i]);
+ 		itup = (IndexTuple) PageGetItem(ipage, iitemid);
+ 
+ 		hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
+ 		hbuffer = XLogReadBuffer(xlrec->hnode, hblkno, false);
+ 		if (!BufferIsValid(hbuffer))
+ 		{
+ 			UnlockReleaseBuffer(ibuffer);
+ 			return InvalidTransactionId;
+ 		}
+ 		hpage = (Page) BufferGetPage(hbuffer);
+ 
+ 		/*
+ 		 * Look up the heap tuple header that the index tuple points at
+ 		 * by using the heap node supplied with the xlrec. We can't use
+ 		 * heap_fetch, since it uses ReadBuffer rather than XLogReadBuffer.
+ 		 */
+ 		hoffnum = ItemPointerGetOffsetNumber(&(itup->t_tid));
+ 		hitemid = PageGetItemId(hpage, hoffnum);
+ 
+ 		/*
+ 		 * Follow any redirections until we find something useful.
+ 		 */
+ 		while (ItemIdIsRedirected(hitemid))
+ 		{
+ 			num_redirect++;
+ 			hoffnum = ItemIdGetRedirect(hitemid);
+ 			hitemid = PageGetItemId(hpage, hoffnum);
+ 			CHECK_FOR_INTERRUPTS();
+ 		}
+ 
+ 		if (ItemIdHasStorage(hitemid))
+ 		{
+ 			htuphdr = (HeapTupleHeader) PageGetItem(hpage, hitemid);
+ 
+ 			/*
+ 			 * Get the heap tuple's xmin/xmax and ratchet up the latestRemovedXid.
+ 			 */
+ 			htupxid = HeapTupleHeaderGetXmin(htuphdr);
+ 			if (TransactionIdFollows(htupxid, latestRemovedXid))
+ 				latestRemovedXid = htupxid;
+ 
+ 			htupxid = HeapTupleHeaderGetXmax(htuphdr);
+ 			if (TransactionIdFollows(htupxid, latestRemovedXid))
+ 				latestRemovedXid = htupxid;
+ 		}
+ 		else if (ItemIdIsDead(hitemid))
+ 		{
+ 			/*
+ 			 * Conjecture: if hitemid is dead then it had xids before the xids
+ 			 * marked on LP_NORMAL items. So we just ignore this item and move
+ 			 * onto the next, for the purposes of calculating latestRemovedxids.
+ 			 */
+ 			num_dead++;
+ 		}
+ 		else
+ 		{
+ 			Assert(!ItemIdIsUsed(hitemid));
+ 			num_unused++;
+ 		}
+ 
+ 		UnlockReleaseBuffer(hbuffer);
+ 	}
+ 
+ 	UnlockReleaseBuffer(ibuffer);
+ 
+ 	/* debug */
+ 	elog(LOG, "nbtree latestRemovedXid %u nitems %u normal %u dead %u unused %u redirected %u",
+ 					latestRemovedXid,
+ 					xlrec->nitems,
+ 					xlrec->nitems - num_dead - num_redirect - num_unused,
+ 					num_dead, num_unused, num_redirect);
+ 	Assert(num_unused == 0);
+ 
+ 	return latestRemovedXid;
+ }
+ 
  static void
  btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
  {
***************
*** 584,595 **** btree_xlog_delete(XLogRecPtr lsn, XLogRecord *record)
  	if (record->xl_len > SizeOfBtreeDelete)
  	{
  		OffsetNumber *unused;
- 		OffsetNumber *unend;
  
  		unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete);
- 		unend = (OffsetNumber *) ((char *) xlrec + record->xl_len);
  
! 		PageIndexMultiDelete(page, unused, unend - unused);
  	}
  
  	/*
--- 698,707 ----
  	if (record->xl_len > SizeOfBtreeDelete)
  	{
  		OffsetNumber *unused;
  
  		unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete);
  
! 		PageIndexMultiDelete(page, unused, xlrec->nitems);
  	}
  
  	/*
***************
*** 830,835 **** btree_redo(XLogRecPtr lsn, XLogRecord *record)
--- 942,948 ----
  				 * from individual btree vacuum records on that index.
  				 */
  				{
+ 					TransactionId latestRemovedXid = btree_xlog_delete_get_latestRemovedXid(record);
  					xl_btree_delete *xlrec = (xl_btree_delete *) XLogRecGetData(record);
  
  					/*
***************
*** 839,845 **** btree_redo(XLogRecPtr lsn, XLogRecord *record)
  					 * here is worth some thought and possibly some effort to
  					 * improve.
  					 */
! 					ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid, xlrec->node);
  				}
  				break;
  
--- 952,958 ----
  					 * here is worth some thought and possibly some effort to
  					 * improve.
  					 */
! 					ResolveRecoveryConflictWithSnapshot(latestRemovedXid, xlrec->node);
  				}
  				break;
  
***************
*** 1012,1021 **** btree_desc(StringInfo buf, uint8 xl_info, char *rec)
  			{
  				xl_btree_delete *xlrec = (xl_btree_delete *) rec;
  
! 				appendStringInfo(buf, "delete: rel %u/%u/%u; blk %u, latestRemovedXid %u",
! 								 xlrec->node.spcNode, xlrec->node.dbNode,
! 								 xlrec->node.relNode, xlrec->block,
! 								 xlrec->latestRemovedXid);
  				break;
  			}
  		case XLOG_BTREE_DELETE_PAGE:
--- 1125,1134 ----
  			{
  				xl_btree_delete *xlrec = (xl_btree_delete *) rec;
  
! 				appendStringInfo(buf, "delete: index %u/%u/%u; iblk %u, heap %u/%u/%u;",
! 								 xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode,
! 								 xlrec->block,
! 								 xlrec->hnode.spcNode, xlrec->hnode.dbNode, xlrec->hnode.relNode);
  				break;
  			}
  		case XLOG_BTREE_DELETE_PAGE:
*** a/src/include/access/nbtree.h
--- b/src/include/access/nbtree.h
***************
*** 314,327 **** typedef struct xl_btree_split
   */
  typedef struct xl_btree_delete
  {
! 	RelFileNode node;
  	BlockNumber block;
! 	TransactionId latestRemovedXid;
  
  	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
  } xl_btree_delete;
  
! #define SizeOfBtreeDelete	(offsetof(xl_btree_delete, latestRemovedXid) + sizeof(TransactionId))
  
  /*
   * This is what we need to know about page reuse within btree.
--- 314,328 ----
   */
  typedef struct xl_btree_delete
  {
! 	RelFileNode node;		/* RelFileNode of the index */
  	BlockNumber block;
! 	RelFileNode hnode;		/* RelFileNode of the heap the index currently points at */
! 	int			nitems;
  
  	/* TARGET OFFSET NUMBERS FOLLOW AT THE END */
  } xl_btree_delete;
  
! #define SizeOfBtreeDelete	(offsetof(xl_btree_delete, nitems) + sizeof(int))
  
  /*
   * This is what we need to know about page reuse within btree.
***************
*** 349,361 **** typedef struct xl_btree_reuse_page
   * heap tuples.
   *
   * Any changes to any one block are registered on just one WAL record. All
!  * blocks that we need to run EnsureBlockUnpinned() before we touch the changed
!  * block are also given on this record as a variable length array. The array
!  * is compressed by way of storing an array of block ranges, rather than an
!  * actual array of blockids.
   *
   * Note that the *last* WAL record in any vacuum of an index is allowed to
!  * have numItems == 0. All other WAL records must have numItems > 0.
   */
  typedef struct xl_btree_vacuum
  {
--- 350,361 ----
   * heap tuples.
   *
   * Any changes to any one block are registered on just one WAL record. All
!  * blocks that we need to run EnsureBlockUnpinned() are listed as a block range
!  * starting from the last block vacuumed through until this one. Individual
!  * block numbers aren't given.
   *
   * Note that the *last* WAL record in any vacuum of an index is allowed to
!  * have a zero length array of offsets. Earlier records must have at least one.
   */
  typedef struct xl_btree_vacuum
  {
***************
*** 588,596 **** extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
  extern void _bt_relbuf(Relation rel, Buffer buf);
  extern void _bt_pageinit(Page page, Size size);
  extern bool _bt_page_recyclable(Page page);
! extern void _bt_delitems(Relation rel, Buffer buf,
! 			 OffsetNumber *itemnos, int nitems, bool isVacuum,
! 			 BlockNumber lastBlockVacuumed);
  extern int	_bt_pagedel(Relation rel, Buffer buf, BTStack stack);
  
  /*
--- 588,597 ----
  extern void _bt_relbuf(Relation rel, Buffer buf);
  extern void _bt_pageinit(Page page, Size size);
  extern bool _bt_page_recyclable(Page page);
! extern void _bt_delitems_delete(Relation rel, Buffer buf,
! 			 OffsetNumber *itemnos, int nitems, Relation heapRel);
! extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
! 			 OffsetNumber *itemnos, int nitems, BlockNumber lastBlockVacuumed);
  extern int	_bt_pagedel(Relation rel, Buffer buf, BTStack stack);
  
  /*
#24Greg Stark
gsstark@mit.edu
In reply to: Simon Riggs (#23)
Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sat, Mar 27, 2010 at 10:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

It appears that in practice many of the index items point to heap items
that are LP_DEAD. So for the purposes of accessing a heap tuple's xmin,
then we're both right. To the current purpose the tuple has been
removed, though you are also right: its stub remains.

If we're pruning an index entry to a heap tuple that has been HOT
pruned wouldn't the HOT pruning record have already conflicted with
any queries that could see it?

--
greg

#25Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Stark (#24)
Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sat, 2010-03-27 at 19:15 +0000, Greg Stark wrote:

On Sat, Mar 27, 2010 at 10:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

It appears that in practice many of the index items point to heap items
that are LP_DEAD. So for the purposes of accessing a heap tuple's xmin,
then we're both right. To the current purpose the tuple has been
removed, though you are also right: its stub remains.

If we're pruning an index entry to a heap tuple that has been HOT
pruned wouldn't the HOT pruning record have already conflicted with
any queries that could see it?

Quite probably, but a query that started after that record arrived might
slip through. We have to treat each WAL record separately.

Do you agree with the conjecture? That LP_DEAD items can be ignored
because their xid would have been earlier than the latest LP_NORMAL
tuple we find? (on any page).

Or is a slightly less strong condition true: we can ignore LP_DEAD items
on a page that is also referenced by an LP_NORMAL item.

--
Simon Riggs www.2ndQuadrant.com

#26Greg Stark
gsstark@mit.edu
In reply to: Simon Riggs (#25)
Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sat, Mar 27, 2010 at 7:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sat, 2010-03-27 at 19:15 +0000, Greg Stark wrote:

If we're pruning an index entry to a heap tuple that has been HOT
pruned wouldn't the HOT pruning record have already conflicted with
any queries that could see it?

Quite probably, but a query that started after that record arrived might
slip through. We have to treat each WAL record separately.

Slip through? I'm not following you.

Do you agree with the conjecture? That LP_DEAD items can be ignored
because their xid would have been earlier than the latest LP_NORMAL
tuple we find? (on any page).

Or is a slightly less strong condition true: we can ignore LP_DEAD items
on a page that is also referenced by an LP_NORMAL item.

I don't like having dependencies on the precise logic in vacuum rather
than only on the guarantees that vacuum provides. We want to improve
the logic in vacuum and hot pruning to cover more cases and that will
be harder if there's code elsewhere depending on its simple-minded xid
<= globalxmin test.

--
greg

#27Simon Riggs
simon@2ndQuadrant.com
In reply to: Greg Stark (#26)
Re: [COMMITTERS] pgsql: Augment WAL records for btree delete with GetOldestXmin() to

On Sat, 2010-03-27 at 22:39 +0000, Greg Stark wrote:

On Sat, Mar 27, 2010 at 7:36 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sat, 2010-03-27 at 19:15 +0000, Greg Stark wrote:

If we're pruning an index entry to a heap tuple that has been HOT
pruned wouldn't the HOT pruning record have already conflicted with
any queries that could see it?

Quite probably, but a query that started after that record arrived might
slip through. We have to treat each WAL record separately.

Slip through? I'm not following you.

No, there is no possibility for it to slip through, you're right. (After
much thinking).

Do you agree with the conjecture? That LP_DEAD items can be ignored
because their xid would have been earlier than the latest LP_NORMAL
tuple we find? (on any page).

Or is a slightly less strong condition true: we can ignore LP_DEAD items
on a page that is also referenced by an LP_NORMAL item.

I don't like having dependencies on the precise logic in vacuum rather
than only on the guarantees that vacuum provides. We want to improve
the logic in vacuum and hot pruning to cover more cases and that will
be harder if there's code elsewhere depending on its simple-minded xid
<= globalxmin test.

Agreed

--
Simon Riggs www.2ndQuadrant.com