HOT WIP Patch - version 1

Started by Pavan Deolaseealmost 19 years ago16 messages
#1Pavan Deolasee
pavan.deolasee@gmail.com
1 attachment(s)

This is a WIP patch based on the recent posting by Simon and discussions
thereafter. We are trying to do one piece at a time and intention is to post
the work ASAP so that we could get early and continuous feedback from
the community. We could then incorporate those suggestions in the next
WIP patch.

To start with, this patch implements HOT-update for a simple case
when there is enough free space in the same block so that it can
accommodate the new version of the tuple. A necessary condition for
doing HOT-update is that none of the index columns is changed.
The old version is marked as HEAP_UPDATE_ROOT and the new
version is marked as HEAP_ONLY_TUPLE. If a tuple is HOT-updated,
no new index entry is added.

When fetching a tuple using an index, if the root tuple is not visible to
the given snapshot, the ctid chain is followed until a visible tuple is
found or
end of HOT-update chain is reached. The prior_xmax/next_xmin chain
is validated while following the ctid chain.

This patch is generated on the current CVS head. It passes all the
regression
tests, but I haven't measured any performance impact since thats not the
goal for posting this early version. There are several things that are not
yet implemented and there are few unresolved issues for which I am looking
for community help and feedback.

Open Issues:
------------------

- CREATE INDEX needs more work in the HOT context. The existing HOT
tuples may require chilling for the CREATE INDEX to work correctly. There
are concerns about the crash-safety on chilling operation. Few suggestions
were posted in this regard. We need to conclude that and post a working
design/patch.

- We need to find a way to handle DEAD root tuples, either convert them into
stubs or overwrite them with a new version. We can also perform pointer
swinging from the index. Again there are concerns about crash-safety and
concurrent index-scans working properly. We don't have a community
consensus on any of the suggestions in this regard. But hopefully we
would converge on some design soon.

- Retail VACUUM. We need to implement the block-level vacuum for
UPDATEs to find enough free space in the block to do HOT-update.
Though we are still discussing how to handle the dead root tuples, we
should be able to remove any intermediate dead tuples in the HOT-update
chain safely. If we do so without fixing the root tuple, the
prior_xmax/next_xmin chain would be broken. A similar problem exists
with freezing HOT tuples.

Whats Next:
-----------------

In the current implementation, an HOT-updated tuple can not be vacuumed
because it might be in the middle of the access path to the heap-only
visible tuple.
This can cause the table to grow rapidly even if autovacuum is turned on.
The
HOT-update chain also keeps growing if there is enough free space in the
block.
I am thinking of implementing some sort of HOT-update chain squeezing logic
so that intermediate dead tuples can be retired and vacuumed away. This
would
also help us keep the HOT-update chain small enough so that the chain
following
does not become unduly costly.

I am thinking of squeezing the HOT-update chain while following it in the
index fetch.
If the root tuple is dead, we follow the chain until the first LIVE or
RECENTLY_DEAD tuple is found. The ctid pointer in the root tuple is made
point to the first LIVE or RECENTLY_DEAD tuple. All the intermediate
DEAD tuples are marked ~HEAP_UPDATE_ROOT so that they can be vacuumed
in the next cycle. We hold an exclusive lock on the page while doing so.
That should
avoid any race conditions. This infrastructure should also help us retail
vacuum the
block later.

Please let me know your comments.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com

Attachments:

NewHOT-v1.1-pgsql-head.patch.gzapplication/x-gzip; name=NewHOT-v1.1-pgsql-head.patch.gzDownload
#2Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Pavan Deolasee (#1)
Re: HOT WIP Patch - version 1

Pavan Deolasee wrote:

- We need to find a way to handle DEAD root tuples, either convert them
into
stubs or overwrite them with a new version. We can also perform pointer
swinging from the index. Again there are concerns about crash-safety and
concurrent index-scans working properly. We don't have a community
consensus on any of the suggestions in this regard. But hopefully we
would converge on some design soon.

This seems to be the most fundamental problem we have at the moment. If
we stick to the basic rule we have now that a live tuple's ctid doesn't
change, the only way to get rid of a dead tuple at the root of the
update chain is by changing the index pointer. The backward-pointers
Hannu suggested or the "scan the whole page to find the previous tuple"
would allow reuse of those dead tuples for new tuples in the chain, but
even those methods wouldn't completely eliminate the problem.

We could say that in some scenarios we just leave behind some dead
tuples/stubs that can never be reclaimed. What do you guys think, if we
can bring it down to just an extra line pointer, would that be
acceptable? We could also do that for now and implement the
pointer-swinging later if it turns out to be a problem in practice.

What's the verdict on relaxing the "live tuple's ctid doesn't change
rule"? If we did allow that within a page, what would we need to change?
Inside the backend we'd have to make sure that whenever a ctid is
used, the page is kept pinned. How likely is it that it would brake any
external projects, and how difficult would it be to detect the broken
usage pattern? It would mean that the ctid system column could change
within a transaction, unless we change it so that it returns the ctid of
the root tuple + xmin + cmin, but it'd be a user-visible change anyhow.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#2)
Re: HOT WIP Patch - version 1

Heikki Linnakangas <heikki@enterprisedb.com> writes:

What's the verdict on relaxing the "live tuple's ctid doesn't change
rule"?

I think that's unacceptable; it is known that that will break the ODBC
and JDBC drivers, as well as any other programs that make use of the
ctid for re-finding a tuple they read earlier in the same transaction.
We have not only never deprecated client-side use of ctid for this, but
actively encouraged it, for instance by going out of our way to support
fast access for queries "WHERE ctid = 'constant'".

What's more, your proposal would break plain old UPDATE and DELETE,
as well as SELECT FOR UPDATE, none of which promise to hold a pin
continuously on every page containing a tuple they might decide to
revisit (by ctid) later. Are you prepared to disallow hash join and
sort/merge join in all such queries?

regards, tom lane

#4Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Tom Lane (#3)
Re: HOT WIP Patch - version 1

Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

What's the verdict on relaxing the "live tuple's ctid doesn't change
rule"?

I think that's unacceptable; it is known that that will break the ODBC
and JDBC drivers, as well as any other programs that make use of the
ctid for re-finding a tuple they read earlier in the same transaction.

AFAIK the JDBC driver doesn't use ctid. But ODBC and other programs do.

We have not only never deprecated client-side use of ctid for this, but
actively encouraged it, for instance by going out of our way to support
fast access for queries "WHERE ctid = 'constant'".

The idea I had was to change what the ctid system column returns to
root ctid + xmin + cmin. As long as programs treat the ctid as an opaque
string, it should work. Tid scan would use that to locate the original
tuple.

What's more, your proposal would break plain old UPDATE and DELETE,
as well as SELECT FOR UPDATE, none of which promise to hold a pin
continuously on every page containing a tuple they might decide to
revisit (by ctid) later. Are you prepared to disallow hash join and
sort/merge join in all such queries?

No, of course not. We'd have to do the same thing here; use root tid +
xmin + cmin instead of just ctid.

But now that I think of it, how do we get the root tid of a tuple? I
suppose we'd be back to having backpointers or scanning the whole
page... I guess pointer-swinging it is, then.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#3)
Re: HOT WIP Patch - version 1

On 2/14/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

What's the verdict on relaxing the "live tuple's ctid doesn't change
rule"?

I think that's unacceptable; it is known that that will break the ODBC
and JDBC drivers, as well as any other programs that make use of the
ctid for re-finding a tuple they read earlier in the same transaction.
We have not only never deprecated client-side use of ctid for this, but
actively encouraged it, for instance by going out of our way to support
fast access for queries "WHERE ctid = 'constant'".

What's more, your proposal would break plain old UPDATE and DELETE,
as well as SELECT FOR UPDATE, none of which promise to hold a pin
continuously on every page containing a tuple they might decide to
revisit (by ctid) later. Are you prepared to disallow hash join and
sort/merge join in all such queries?

Not that I am suggesting we do this, but I believe we had some solution
to this problem in the earlier version of HOT. The live tuple when
copied-back
to the root tuple, the tuple is marked with a HEAP_COPIED_BACK flag.

HeapTupleSatisfiesUpdate() checks for this flag and if set returns a new
return code, HeapTupleCopiedBack. heap_update() returns the same
to ExecUpdate along with the ctid of the root tuple. The UPDATE/DELETE
operation then retried on the root tuple, very similar to read-committed
update/delete. The xmax of the copied-back tuple is set so that its
not vacuumed away until all the current transactions are completed.

Though I have tested this patch several times and it seems to work fine,
I probably don''t have insight into the code as much others on this list
has. So if someone wants to take a look and see if it would work fine,
I would be more than happy to post the latest HOT patch (older design).

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com

#6Zeugswetter Andreas ADI SD
ZeugswetterA@spardat.at
In reply to: Heikki Linnakangas (#2)
Re: HOT WIP Patch - version 1

What's the verdict on relaxing the "live tuple's ctid doesn't
change rule"? If we did allow that within a page, what would
we need to change?

I already said this, but why would this need to be visible from the
outside ?

A few assumptions:
no back pointers
indexes only point at slots marked as roots (and non hot tuples)

During vacuum, you swap the tuples and keep a stub at the slot that the
user's ctid might be pointing at. You mark the stub to detect this
situation.
When a select/update by ctid comes along it needs to do one step to the
root
and use that tuple instead.

It needs a second vacuum (or a per page vacuum during update) to remove
the
extra stub when it is dead and not recently dead.

I fail to see the hole.

Andreas

#7Zeugswetter Andreas ADI SD
ZeugswetterA@spardat.at
In reply to: Heikki Linnakangas (#4)
Re: HOT WIP Patch - version 1

But now that I think of it, how do we get the root tid of a
tuple? I suppose we'd be back to having backpointers or
scanning the whole page... I guess pointer-swinging it is, then.

During vacuum you see a root [stub] not recently dead. You follow
the chain to detect if you find a live tuple that can replace
the root. You replace the root. You replace the original with a stub
that points at the root and mark it recently dead (and HEAP_COPIED_BACK
aka Pavan). ... (see prev post)
No need for anyone but vacuum to find a root.

Andreas

#8Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#2)
Re: HOT WIP Patch - version 1

Just to summarize:

o Every tuple gets a heap ctid
o Only the root tuple gets an index entry
o We can easily remove dead tuples that aren't the root because
by definition, nothing points to them, including backends and
indexes

The problem is that a dead root tuple has to stay around because while
no backends can see it, the index does. We could move a live tuple into
root ctid slot, but if we do that, the live tuple changes its ctid while
it is visible.

Could we insert index tuples for the live tuple and then remove the root
tuple, perhaps later? So basically we break the chain at that time.
The problem there is that we basically have nothing better than what we
have now --- we are just delaying the index insert, and I don't see what
that buys us.

Could a _new_ tuple take over the root tuple slot? It is new, so it
doesn't have a ctid yet to change. I think that means the index walking
could go forward or backward, but on the same page. To illustrate,
with ctid slot numbers:

[1]: root UPDATE 2
[2]: UPDATE 1
[3]: ---------------------------------------------------------------------------

[1]: root UPDATE 2
[2]: UPDATE 1
[3]: ---------------------------------------------------------------------------

[1]: root UPDATE 2
[2]: UPDATE 1
[3]: ---------------------------------------------------------------------------

[1]: root UPDATE 2
[2]: UPDATE 1
[3]: ---------------------------------------------------------------------------

---------------------------------------------------------------------------

Heikki Linnakangas wrote:

Pavan Deolasee wrote:

- We need to find a way to handle DEAD root tuples, either convert them
into
stubs or overwrite them with a new version. We can also perform pointer
swinging from the index. Again there are concerns about crash-safety and
concurrent index-scans working properly. We don't have a community
consensus on any of the suggestions in this regard. But hopefully we
would converge on some design soon.

This seems to be the most fundamental problem we have at the moment. If
we stick to the basic rule we have now that a live tuple's ctid doesn't
change, the only way to get rid of a dead tuple at the root of the
update chain is by changing the index pointer. The backward-pointers
Hannu suggested or the "scan the whole page to find the previous tuple"
would allow reuse of those dead tuples for new tuples in the chain, but
even those methods wouldn't completely eliminate the problem.

We could say that in some scenarios we just leave behind some dead
tuples/stubs that can never be reclaimed. What do you guys think, if we
can bring it down to just an extra line pointer, would that be
acceptable? We could also do that for now and implement the
pointer-swinging later if it turns out to be a problem in practice.

What's the verdict on relaxing the "live tuple's ctid doesn't change
rule"? If we did allow that within a page, what would we need to change?
Inside the backend we'd have to make sure that whenever a ctid is
used, the page is kept pinned. How likely is it that it would brake any
external projects, and how difficult would it be to detect the broken
usage pattern? It would mean that the ctid system column could change
within a transaction, unless we change it so that it returns the ctid of
the root tuple + xmin + cmin, but it'd be a user-visible change anyhow.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#9Noname
mark@mark.mielke.cc
In reply to: Bruce Momjian (#8)
Re: HOT WIP Patch - version 1

On Wed, Feb 14, 2007 at 01:56:03PM -0500, Bruce Momjian wrote:

Could we insert index tuples for the live tuple and then remove the root
tuple, perhaps later? So basically we break the chain at that time.
The problem there is that we basically have nothing better than what we
have now --- we are just delaying the index insert, and I don't see what
that buys us.

At some point - inserting into the block would not be possible, as
there is no free space. Would that be a good time to do the index
insert? Then, a later vacuum would eventually prune out the whole old
chain. As long as vacuuming the intermediate entries in the chain
keeps the block with free space, there is no need to remove the root
tuple. If space ever runs out (vacuum not running frequently enough -
many updates performed in the same interval) - fall back to the
mechanism that is being used today.

I see it buying increased performance for rows that are frequently
updated. If it can delay modifying the indices to only once every 10
or more updates, it seems to me that the improvement should be
significant. Perhaps PostgreSQL could be used for page hit counters
again... :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#10Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Zeugswetter Andreas ADI SD (#6)
Re: HOT WIP Patch - version 1

Zeugswetter Andreas ADI SD wrote:

A few assumptions:
no back pointers
indexes only point at slots marked as roots (and non hot tuples)

During vacuum, you swap the tuples and keep a stub at the slot that the
user's ctid might be pointing at. You mark the stub to detect this
situation.
When a select/update by ctid comes along it needs to do one step to the
root
and use that tuple instead.

As Pavan pointed out, that's more or less what he ended up doing
originally. You need to mark the stub with the current most recent xid,
and wait until that's no longer running. Only after that you can remove
the stub.

It needs a second vacuum (or a per page vacuum during update) to remove
the
extra stub when it is dead and not recently dead.

Requiring two vacuums to remove the tuple sounds bad at first, but it's
actually not so bad since both steps could by done by retail vacuum, or
even normal scans while.

I fail to see the hole.

The only potential problem I can see is how to make sure that a heap
scan or a bitmap heap scan doesn't visit the tuple twice. If we make
sure that the page is scanned in one go while keeping the buffer pinned,
we're good. We already do that except for system catalogs, so I believe
we'd have to forbid hot updates on system tables, like we forbid bitmap
scans.

To me this sounds like the best idea this far.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Zeugswetter Andreas ADI SD
ZeugswetterA@spardat.at
In reply to: Bruce Momjian (#8)
Re: HOT WIP Patch - version 1

Bruce Momjian wrote:

Just to summarize:

o Every tuple gets a heap ctid
o Only the root tuple gets an index entry

o We can easily remove dead tuples that aren't the root because
by definition, nothing points to them, including backends and
indexes

I am still wondering about the "easily" here. Basically this
needs some kind of wal entry to be crash safe.

Else some later tx might reuse the slot:
- some update on page produces page image in wal
- slot removed
- slot reused and comitted
- page not written
- crash
- wal fullpage restores the page to the version before slot
removed
(- would need a wal replay for slot removed from hot chain here)
- wal restores slot reuse, but the slot is now part of a wrong
hot chain
and the chain is broken (unless we have the above step)

Do we have this wal entry ?

The problem is that a dead root tuple has to stay around
because while no backends can see it, the index does. We
could move a live tuple into root ctid slot, but if we do
that, the live tuple changes its ctid while it is visible.

Could we insert index tuples for the live tuple and then
remove the root tuple, perhaps later? So basically we break
the chain at that time.
The problem there is that we basically have nothing better
than what we have now --- we are just delaying the index
insert, and I don't see what that buys us.

yes, not really promising.

Could a _new_ tuple take over the root tuple slot? It is
new, so it doesn't have a ctid yet to change. I think that
means the index walking
could go forward or backward, but on the same page. To illustrate,
with ctid slot numbers:

[1] root INSERT
[2]
[3]

[1] root INSERT
[2] UPDATE
[3]

[1] root INSERT (dead)
[2] UPDATE 1
[3]

[1] root UPDATE 2
[2] UPDATE 1
[3]

Imho no, because that would need a back pointer in [1] to [2] so that
readers arriving at [1] (e.g. through index) can see [2] until [1] is
visible.
I think we don't want backpointers. (rather go the tuple swapping route)

Andreas

#12Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Zeugswetter Andreas ADI SD (#11)
Re: HOT WIP Patch - version 1

Zeugswetter Andreas ADI SD wrote:

I am still wondering about the "easily" here. Basically this
needs some kind of wal entry to be crash safe.

Else some later tx might reuse the slot:
- some update on page produces page image in wal
- slot removed
- slot reused and comitted
- page not written
- crash
- wal fullpage restores the page to the version before slot
removed
(- would need a wal replay for slot removed from hot chain here)
- wal restores slot reuse, but the slot is now part of a wrong
hot chain
and the chain is broken (unless we have the above step)

Do we have this wal entry ?

We already log tuple removals by normal vacuums. We can't use that wal
entry as it is: if a dead tuple is in the middle of an update chain, it
needs to be unlinked from the chain. But I don't see any particular
problem with that, it just needs to be wal logged like every other data
changing operation.

Do we actually ever want to remove dead tuples from the middle of the
chain? If a tuple in the middle of the chain is dead, surely every tuple
before it in the chain is dead as well, and we want to remove them as
well. I'm thinking, removing tuples from the middle of the chain can be
problematic, because we'd need to fiddle with the xmin/xmax of the other
tuples to make them match. Or change the tuple-following logic to not do
the xmin=xmax check, but it's a nice robustness feature.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#12)
Re: HOT WIP Patch - version 1

On 2/15/07, Heikki Linnakangas <heikki@enterprisedb.com> wrote:

Do we actually ever want to remove dead tuples from the middle of the
chain? If a tuple in the middle of the chain is dead, surely every tuple
before it in the chain is dead as well, and we want to remove them as
well. I'm thinking, removing tuples from the middle of the chain can be
problematic, because we'd need to fiddle with the xmin/xmax of the other
tuples to make them match. Or change the tuple-following logic to not do
the xmin=xmax check, but it's a nice robustness feature.

I am precisely working on this right now. In the next patch version that I
intend to send shortly, I am thinking of removing the dead tuples in the
middle of the chain. We don't have agreement on how to deal with the
root tuple, but we can safely remove the intermediate dead tuples and
vacuum them. Also when all the tuples in the chain are dead because the
last tuple is either deleted or COLD updated, the entire chain along with
the root tuple and the index entry can be vacuumed.

The operation must be WAL logged and you caught the xmin/xmax
problem very rightly. One option is to change the xmax of root tuple
to the xmin of the first live/recently-dead tuple, if we remove a set of
intermediate dead tuples. This xmin of the first live/recently-dead tuple
is also the xmax of the last dead tuple we removed and hence must
be older than the oldtestXmin. So assigning that to the root tuple
should not break any visibility rules for the root tuple (it would still
be dead).

Do we see any problem with this ?

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com

#14Hannu Krosing
hannu@skype.net
In reply to: Heikki Linnakangas (#12)
Re: HOT WIP Patch - version 1

Ühel kenal päeval, N, 2007-02-15 kell 10:49, kirjutas Heikki
Linnakangas:

We already log tuple removals by normal vacuums. We can't use that wal
entry as it is: if a dead tuple is in the middle of an update chain, it
needs to be unlinked from the chain. But I don't see any particular
problem with that, it just needs to be wal logged like every other data
changing operation.

Do we actually ever want to remove dead tuples from the middle of the
chain? If a tuple in the middle of the chain is dead, surely every tuple
before it in the chain is dead as well, and we want to remove them as
well. I'm thinking, removing tuples from the middle of the chain can be
problematic, because we'd need to fiddle with the xmin/xmax of the other
tuples to make them match. Or change the tuple-following logic to not do
the xmin=xmax check, but it's a nice robustness feature.

What kind of robustness does it provide ? In other words - what failure
scenario does this guard against ?

I can't see the case where the xmin=xmax check can not succeed, at least
not for same page tuples.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#15Simon Riggs
simon@2ndquadrant.com
In reply to: Hannu Krosing (#14)
Re: HOT WIP Patch - version 1

On Fri, 2007-02-16 at 09:36 +0200, Hannu Krosing wrote:

Ühel kenal päeval, N, 2007-02-15 kell 10:49, kirjutas Heikki
Linnakangas:

We already log tuple removals by normal vacuums. We can't use that wal
entry as it is: if a dead tuple is in the middle of an update chain, it
needs to be unlinked from the chain. But I don't see any particular
problem with that, it just needs to be wal logged like every other data
changing operation.

Do we actually ever want to remove dead tuples from the middle of the
chain? If a tuple in the middle of the chain is dead, surely every tuple
before it in the chain is dead as well, and we want to remove them as
well. I'm thinking, removing tuples from the middle of the chain can be
problematic, because we'd need to fiddle with the xmin/xmax of the other
tuples to make them match. Or change the tuple-following logic to not do
the xmin=xmax check, but it's a nice robustness feature.

What kind of robustness does it provide ? In other words - what failure
scenario does this guard against ?

I can't see the case where the xmin=xmax check can not succeed, at least
not for same page tuples.

The xmin == xmax case was required to support a complex VACUUM FULL
failure case discovered some time ago. I'll post the details.

With HOT, ISTM that testing this becomes even more important in some
form because of the extra complexity we're putting into that part of the
code. We need some way of detecting a corruption introduced by some
weird use case we haven't thought of.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#16Simon Riggs
simon@2ndquadrant.com
In reply to: Heikki Linnakangas (#12)
Re: HOT WIP Patch - version 1

On Thu, 2007-02-15 at 10:49 +0000, Heikki Linnakangas wrote:

Do we actually ever want to remove dead tuples from the middle of the
chain? If a tuple in the middle of the chain is dead, surely every tuple
before it in the chain is dead as well, and we want to remove them as
well. I'm thinking, removing tuples from the middle of the chain can be
problematic, because we'd need to fiddle with the xmin/xmax of the other
tuples to make them match. Or change the tuple-following logic to not do
the xmin=xmax check, but it's a nice robustness feature.

I think the phrase "middle of the chain" is being used in two ways here.

If a tuple in the chain is dead, then all prior tuples are dead also.
That gives us a live portion of the chain and a dead portion of the
chain.

We will only ever need to follow the chain along the live portion, so
breaking the live portion of the chain is not allowable. Breaking the
chain in the dead portion *is* allowable and that is what is being
proposed here. We break the chain rather than simply remove all of it
because there may still be index pointers to some of the dead rows.

Pavan's code does this now.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com