Lowering the ever-growing heap->pd_lower

Started by Matthias van de Meentabout 5 years ago55 messageshackers
Jump to latest
#1Matthias van de Meent
boekewurm+postgres@gmail.com

Hi,

The heap AMs' pages only grow their pd_linp array, and never shrink
when trailing entries are marked unused. This means that up to 14% of
free space (=291 unused line pointers) on a page could be unusable for
data storage, which I think is a shame. With a patch in the works that
allows the line pointer array to grow up to one third of the size of
the page [0]/messages/by-id/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com, it would be quite catastrophic for the available data
space on old-and-often-used pages if this could not ever be reused for
data.

The shrinking of the line pointer array is already common practice in
indexes (in which all LP_UNUSED items are removed), but this specific
implementation cannot be used for heap pages due to ItemId
invalidation. One available implementation, however, is that we
truncate the end of this array, as mentioned in [1]/messages/by-id/CAEze2Wjf42g8Ho=YsC_OvyNE_ziM0ZkXg6wd9u5KVc2nTbbYXw@mail.gmail.com. There was a
warning at the top of PageRepairFragmentation about not removing
unused line pointers, but I believe that was about not removing
_intermediate_ unused line pointers (which would imply moving in-use
line pointers); as far as I know there is nothing that relies on only
growing page->pd_lower, and nothing keeping us from shrinking it
whilst holding a pin on the page.

Please find attached a fairly trivial patch for which detects the last
unused entry on a page, and truncates the pd_linp array to that entry,
effectively freeing 4 bytes per line pointer truncated away (up to
1164 bytes for pages with MaxHeapTuplesPerPage unused lp_unused
lines).

One unexpected benefit from this patch is that the PD_HAS_FREE_LINES
hint bit optimization can now be false more often, increasing the
chances of not having to check the whole array to find an empty spot.

Note: This does _not_ move valid ItemIds, it only removes invalid
(unused) ItemIds from the end of the space reserved for ItemIds on a
page, keeping valid linepointers intact.

Enjoy,

Matthias van de Meent

[0]: /messages/by-id/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
[1]: /messages/by-id/CAEze2Wjf42g8Ho=YsC_OvyNE_ziM0ZkXg6wd9u5KVc2nTbbYXw@mail.gmail.com

Attachments:

v1-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patchDownload+12-2
#2Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Matthias van de Meent (#1)
Re: Lowering the ever-growing heap->pd_lower

On Mar 9, 2021, at 7:13 AM, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

Hi,

The heap AMs' pages only grow their pd_linp array, and never shrink
when trailing entries are marked unused. This means that up to 14% of
free space (=291 unused line pointers) on a page could be unusable for
data storage, which I think is a shame. With a patch in the works that
allows the line pointer array to grow up to one third of the size of
the page [0], it would be quite catastrophic for the available data
space on old-and-often-used pages if this could not ever be reused for
data.

The shrinking of the line pointer array is already common practice in
indexes (in which all LP_UNUSED items are removed), but this specific
implementation cannot be used for heap pages due to ItemId
invalidation. One available implementation, however, is that we
truncate the end of this array, as mentioned in [1]. There was a
warning at the top of PageRepairFragmentation about not removing
unused line pointers, but I believe that was about not removing
_intermediate_ unused line pointers (which would imply moving in-use
line pointers); as far as I know there is nothing that relies on only
growing page->pd_lower, and nothing keeping us from shrinking it
whilst holding a pin on the page.

Please find attached a fairly trivial patch for which detects the last
unused entry on a page, and truncates the pd_linp array to that entry,
effectively freeing 4 bytes per line pointer truncated away (up to
1164 bytes for pages with MaxHeapTuplesPerPage unused lp_unused
lines).

One unexpected benefit from this patch is that the PD_HAS_FREE_LINES
hint bit optimization can now be false more often, increasing the
chances of not having to check the whole array to find an empty spot.

Note: This does _not_ move valid ItemIds, it only removes invalid
(unused) ItemIds from the end of the space reserved for ItemIds on a
page, keeping valid linepointers intact.

Enjoy,

Matthias van de Meent

[0] /messages/by-id/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com
[1] /messages/by-id/CAEze2Wjf42g8Ho=YsC_OvyNE_ziM0ZkXg6wd9u5KVc2nTbbYXw@mail.gmail.com
<v1-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patch>

For a prior discussion on this topic:

/messages/by-id/2e78013d0709130606l56539755wb9dbe17225ffe90a@mail.gmail.com


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Mark Dilger (#2)
Re: Lowering the ever-growing heap->pd_lower

On Tue, 9 Mar 2021 at 17:21, Mark Dilger <mark.dilger@enterprisedb.com> wrote:

For a prior discussion on this topic:

/messages/by-id/2e78013d0709130606l56539755wb9dbe17225ffe90a@mail.gmail.com

Thanks for the reference! I note that that thread mentions the
old-style VACUUM FULL as a reason as to why it would be unsafe, which
I believe was removed quite a few versions ago (v9.0).

The only two existing mechanisms that I could find (in the access/heap
directory) that possibly could fail on shrunken line pointer arrays;
being xlog recovery (I do not have enough knowledge on recovery to
determine if that may touch pages that have shrunken line pointer
arrays, or if those situations won't exist due to never using dirtied
pages in recovery) and backwards table scans on non-MVCC snapshots
(which would be fixed in the attached patch).

With regards,

Matthias van de Meent

Attachments:

v2-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patchDownload+20-3
In reply to: Matthias van de Meent (#1)
Re: Lowering the ever-growing heap->pd_lower

On Tue, Mar 9, 2021 at 7:13 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

The shrinking of the line pointer array is already common practice in
indexes (in which all LP_UNUSED items are removed), but this specific
implementation cannot be used for heap pages due to ItemId
invalidation. One available implementation, however, is that we
truncate the end of this array, as mentioned in [1]. There was a
warning at the top of PageRepairFragmentation about not removing
unused line pointers, but I believe that was about not removing
_intermediate_ unused line pointers (which would imply moving in-use
line pointers); as far as I know there is nothing that relies on only
growing page->pd_lower, and nothing keeping us from shrinking it
whilst holding a pin on the page.

Sounds like a good idea to me.

--
Peter Geoghegan

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Matthias van de Meent (#3)
Re: Lowering the ever-growing heap->pd_lower

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

The only two existing mechanisms that I could find (in the access/heap
directory) that possibly could fail on shrunken line pointer arrays;
being xlog recovery (I do not have enough knowledge on recovery to
determine if that may touch pages that have shrunken line pointer
arrays, or if those situations won't exist due to never using dirtied
pages in recovery) and backwards table scans on non-MVCC snapshots
(which would be fixed in the attached patch).

I think you're not visualizing the problem properly.

The case I was concerned about back when is that there are various bits of
code that may visit a page with a predetermined TID in mind to look at.
An index lookup is an obvious example, and another one is chasing an
update chain's t_ctid link. You might argue that if the tuple was dead
enough to be removed, there should be no such in-flight references to
worry about, but I think such an assumption is unsafe. There is not, for
example, any interlock that ensures that a process that has obtained a TID
from an index will have completed its heap visit before a VACUUM that
subsequently removed that index entry also removes the heap tuple.

So, to accept a patch that shortens the line pointer array, what we need
to do is verify that every such code path checks for an out-of-range
offset before trying to fetch the target line pointer. I believed
back in 2007 that there were, or once had been, code paths that omitted
such a range check, assuming that they could trust the TID they had
gotten from $wherever to point at an extant line pointer array entry.
Maybe things are all good now, but I think you should run around and
examine every place that checks for tuple deadness to see if the offset
it used is known to be within the current page bounds.

regards, tom lane

#6Mark Dilger
mark.dilger@enterprisedb.com
In reply to: Tom Lane (#5)
Re: Lowering the ever-growing heap->pd_lower

On Mar 9, 2021, at 1:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

So, to accept a patch that shortens the line pointer array, what we need
to do is verify that every such code path checks for an out-of-range
offset before trying to fetch the target line pointer. I believed
back in 2007 that there were, or once had been, code paths that omitted
such a range check, assuming that they could trust the TID they had
gotten from $wherever to point at an extant line pointer array entry.
Maybe things are all good now, but I think you should run around and
examine every place that checks for tuple deadness to see if the offset
it used is known to be within the current page bounds.

Much as Pavan asked [1]/messages/by-id/2e78013d0709130832t31244e79k9488a3e4eb00d64c@mail.gmail.com, I'm curious how we wouldn't already be in trouble if such code exists? In such a scenario, what stops a dead line pointer from being reused (rather than garbage collected by this patch) prior to such hypothetical code using an outdated TID?

I'm not expressing a view here, just asking questions.

[1]: /messages/by-id/2e78013d0709130832t31244e79k9488a3e4eb00d64c@mail.gmail.com


Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Tom Lane (#5)
Re: Lowering the ever-growing heap->pd_lower

On Tue, Mar 9, 2021 at 1:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

The only two existing mechanisms that I could find (in the access/heap
directory) that possibly could fail on shrunken line pointer arrays;
being xlog recovery (I do not have enough knowledge on recovery to
determine if that may touch pages that have shrunken line pointer
arrays, or if those situations won't exist due to never using dirtied
pages in recovery) and backwards table scans on non-MVCC snapshots
(which would be fixed in the attached patch).

I think you're not visualizing the problem properly.

The case I was concerned about back when is that there are various bits of
code that may visit a page with a predetermined TID in mind to look at.
An index lookup is an obvious example, and another one is chasing an
update chain's t_ctid link. You might argue that if the tuple was dead
enough to be removed, there should be no such in-flight references to
worry about, but I think such an assumption is unsafe. There is not, for
example, any interlock that ensures that a process that has obtained a TID
from an index will have completed its heap visit before a VACUUM that
subsequently removed that index entry also removes the heap tuple.

So, to accept a patch that shortens the line pointer array, what we need
to do is verify that every such code path checks for an out-of-range
offset before trying to fetch the target line pointer. I believed
back in 2007 that there were, or once had been, code paths that omitted
such a range check, assuming that they could trust the TID they had
gotten from $wherever to point at an extant line pointer array entry.
Maybe things are all good now, but I think you should run around and
examine every place that checks for tuple deadness to see if the offset
it used is known to be within the current page bounds.

It occurs to me that we should also mark the hole in the middle of the
page (which includes the would-be LP_UNUSED line pointers at the end
of the original line pointer array space) as undefined to Valgrind
within PageRepairFragmentation(). This is not to be confused with
marking them inaccessible to Valgrind, which just poisons the bytes.
Rather, it represents that the bytes in question are considered safe
to copy around but not safe to rely on being any particular value. So
Valgrind will complain if the bytes in question influence control
flow, directly or indirectly.

Obviously the code should also be audited. Even then, there may still
be bugs. I think that we need to bite the bullet here -- line pointer
bloat is a significant problem.

--
Peter Geoghegan

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Dilger (#6)
Re: Lowering the ever-growing heap->pd_lower

Mark Dilger <mark.dilger@enterprisedb.com> writes:

On Mar 9, 2021, at 1:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So, to accept a patch that shortens the line pointer array, what we need
to do is verify that every such code path checks for an out-of-range
offset before trying to fetch the target line pointer.

Much as Pavan asked [1], I'm curious how we wouldn't already be in trouble if such code exists? In such a scenario, what stops a dead line pointer from being reused (rather than garbage collected by this patch) prior to such hypothetical code using an outdated TID?

The line pointer very well *could* be re-used before the in-flight
reference gets to it. That's okay though, because whatever tuple now
occupies the TID would have to have xmin too new to match the snapshot
that such a reference is scanning with.

(Back when we had non-MVCC snapshots to contend with, a bunch of
additional arm-waving was needed to argue that such situations were
safe. Possibly the proposed change wouldn't have flown back then.)

regards, tom lane

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#7)
Re: Lowering the ever-growing heap->pd_lower

Peter Geoghegan <pg@bowt.ie> writes:

It occurs to me that we should also mark the hole in the middle of the
page (which includes the would-be LP_UNUSED line pointers at the end
of the original line pointer array space) as undefined to Valgrind
within PageRepairFragmentation().

+1

regards, tom lane

In reply to: Peter Geoghegan (#7)
Re: Lowering the ever-growing heap->pd_lower

On Tue, Mar 9, 2021 at 1:54 PM Peter Geoghegan <pg@bowt.ie> wrote:

It occurs to me that we should also mark the hole in the middle of the
page (which includes the would-be LP_UNUSED line pointers at the end
of the original line pointer array space) as undefined to Valgrind
within PageRepairFragmentation().

It would probably also make sense to memset() the space in question to
a sequence of 0x7F bytes in CLOBBER_FREED_MEMORY builds. That isn't
quite as good as what Valgrind will do in some ways, but it has a
major advantage: it will usually visibly break code where the
PageRepairFragmentation() calls made by VACUUM happen to take place
inside another backend.

Valgrind instrumentation of PageRepairFragmentation() along the lines
I've described won't recognize the "hole in the middle of the page"
area as undefined when it was marked undefined in another backend.
It's as if shared memory is private to each process as far as the
memory poisoning/undefined-to-Valgrind stuff is concerned. In other
words, it deals with memory mappings, not memory.

--
Peter Geoghegan

#11Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tom Lane (#5)
Re: Lowering the ever-growing heap->pd_lower

On Tue, 9 Mar 2021 at 22:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

The only two existing mechanisms that I could find (in the access/heap
directory) that possibly could fail on shrunken line pointer arrays;
being xlog recovery (I do not have enough knowledge on recovery to
determine if that may touch pages that have shrunken line pointer
arrays, or if those situations won't exist due to never using dirtied
pages in recovery) and backwards table scans on non-MVCC snapshots
(which would be fixed in the attached patch).

I think you're not visualizing the problem properly.

The case I was concerned about back when is that there are various bits of
code that may visit a page with a predetermined TID in mind to look at.
An index lookup is an obvious example, and another one is chasing an
update chain's t_ctid link. You might argue that if the tuple was dead
enough to be removed, there should be no such in-flight references to
worry about, but I think such an assumption is unsafe. There is not, for
example, any interlock that ensures that a process that has obtained a TID
from an index will have completed its heap visit before a VACUUM that
subsequently removed that index entry also removes the heap tuple.

I am aware of this problem. I will admit that I did not detected all
potentially problematic accesses, so I'll show you my work.

So, to accept a patch that shortens the line pointer array, what we need
to do is verify that every such code path checks for an out-of-range
offset before trying to fetch the target line pointer. I believed
back in 2007 that there were, or once had been, code paths that omitted
such a range check, assuming that they could trust the TID they had
gotten from $wherever to point at an extant line pointer array entry.
Maybe things are all good now, but I think you should run around and
examine every place that checks for tuple deadness to see if the offset
it used is known to be within the current page bounds.

In my search for problematic accesses, I make the following assumptions:
* PageRepairFragmentation as found in bufpage is only applicable to
heap pages; this function is not applied to other pages in core
postgres. So, any problems that occur are with due to access with an
OffsetNumber > PageGetMaxOffsetNumber.
* Items [on heap pages] are only extracted after using PageGetItemId
for that item on the page, whilst holding a lock.

Under those assumptions, I ran "grep PageGetItemId" over the src
directory. For all 227 results (as of 68b34b23) I checked if the page
accessed (or item accessed thereafter) was a heap page or heap tuple.
After analysis of the relevant references, I had the following results
(attached full report, containing a line with the file & line number,
and code line of the call, followed by a line containing the usage
type):

Count of usage type - usage type
4 - Not a call (comment)
7 - Callsite guarantees bounds
8 - Has assertion ItemIdIsNormal (asserts item is not removed; i.e.
concurrent vacuum should not have been able to remove this item)
39 - Has bound checks
6 - Not a heap page (brin)
1 - Not a heap page (generic index)
24 - Not a heap page (gin)
21 - Not a heap page (gist)
14 - Not a heap page (hash)
60 - Not a heap page (nbtree)
1 - Not a heap page (sequence)
36 - Not a heap page (spgist)
2 - OffsetNumber is generated by PageAddItem
2 - problem case 1 (table scan)
1 - Problem case 2 (xlog)
1 - Problem case 3 (invalid redirect pointers)

The 3 problem cases were classified based on the origin of the
potentially invalid pointer.

Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup

The table scan maintains a state which contains a page-bound
OffsetNumber, which it uses as a cursor whilst working through the
pages of the relation. In forward scans, the bounds of the page are
re-validated at the start of the 'while (linesleft > 0)' loop at 681,
but for backwards scans this check is invalid, because it incorrectly
assumes that the last OffsetNumber is guaranteed to still exist.

For MVCC snapshots, this is true (the previously returned value must
have been visible in its snapshot, therefore cannot have been vacuumed
because the snapshot is still alive), but non-mvcc snapshots may have
returned a dead tuple, which is now vacuumed and truncated away.

The first occurrance of this issue is easily fixed with the changes as
submitted in patch v2.

The second problem case (on line 811) is for forward scans, where the
line pointer array could have been truncated to 0 length. As the code
uses a hardcoded offset of FirstOffsetNumber (=1), that might point
into arbitrary data. The reading of this arbitrary data is saved by
the 'while (linesleft > 0) check', because at that point linesleft
will be PageGetMaxOffsetNumber, which would then equal 0.

Problem case 2: xlog; heapam.c line 8796, in heap_xlog_freeze_page

This is in the replay of transaction logs, in heap_xlog_freeze_page.
As I am unaware whether or not pages to which these transaction logs
are applied can contain changes from the xlog-generating instance, I
flagged this as a potential problem. The application of the xlogs is
probably safe (it assumes the existence of a HeapTupleHeader for that
ItemId), but my limited knowledge put this on the 'potential problem'
list.

Please advise on this; I cannot find the right documentation

Problem case 3: invalid redirect pointers; pruneheap.c line 949, in
heap_get_root_tuples

The heap pruning mechanism currently assumes that all redirects are
valid. Currently, if a HOT root points to !ItemIdIsNormal, it breaks
out of the loop, but doesn't actually fail. This code is already
potentially problematic because it has no bounds or sanity checking
for the nextoffnum, but with shrinking pd_linp it would also add the
failure mode of HOT tuples pointing into what is now arbitrary data.

This failure mode is now accompanied by an Assert, which fails when
the redirect is to an invalid OffsetNumber. This is not enough to not
exhibit arbitrary behaviour when accessing corrupted data with
non-assert builds, but I think that that is fine; we already do not
have a guaranteed behaviour for this failure mode.

I have also searched the contrib modules using the same method; and
all 18 usages seem to be validated correctly.

With regards,

Matthias van de Meent.

Attachments:

heap_page_itemid_analysis.txttext/plain; charset=US-ASCII; name=heap_page_itemid_analysis.txtDownload
v3-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patchapplication/x-patch; name=v3-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patchDownload+21-3
In reply to: Matthias van de Meent (#11)
Re: Lowering the ever-growing heap->pd_lower

On Wed, Mar 10, 2021 at 6:01 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

The case I was concerned about back when is that there are various bits of
code that may visit a page with a predetermined TID in mind to look at.
An index lookup is an obvious example, and another one is chasing an
update chain's t_ctid link. You might argue that if the tuple was dead
enough to be removed, there should be no such in-flight references to
worry about, but I think such an assumption is unsafe. There is not, for
example, any interlock that ensures that a process that has obtained a TID
from an index will have completed its heap visit before a VACUUM that
subsequently removed that index entry also removes the heap tuple.

I am aware of this problem. I will admit that I did not detected all
potentially problematic accesses, so I'll show you my work.

Attached is a trivial rebase of your v3, which I've called v4. I am
interested in getting this patch into Postgres 14.

In my search for problematic accesses, I make the following assumptions:
* PageRepairFragmentation as found in bufpage is only applicable to
heap pages; this function is not applied to other pages in core
postgres. So, any problems that occur are with due to access with an
OffsetNumber > PageGetMaxOffsetNumber.
* Items [on heap pages] are only extracted after using PageGetItemId
for that item on the page, whilst holding a lock.

The 3 problem cases were classified based on the origin of the
potentially invalid pointer.

Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup

I think that it boils down to this: 100% of the cases where this could
be a problem all either involve old TIDs, or old line pointer -- in
principle these could be invalidated in some way, like your backwards
scan example. But that's it. Bear in mind that we always call
PageRepairFragmentation() with a super-exclusive lock.

This is in the replay of transaction logs, in heap_xlog_freeze_page.
As I am unaware whether or not pages to which these transaction logs
are applied can contain changes from the xlog-generating instance, I
flagged this as a potential problem. The application of the xlogs is
probably safe (it assumes the existence of a HeapTupleHeader for that
ItemId), but my limited knowledge put this on the 'potential problem'
list.

Please advise on this; I cannot find the right documentation

Are you aware of wal_consistency_checking?

I think that this should be fine. There are differences that are
possible between a replica and primary, but they're very minor and
don't seem relevant.

Problem case 3: invalid redirect pointers; pruneheap.c line 949, in
heap_get_root_tuples

The heap pruning mechanism currently assumes that all redirects are
valid. Currently, if a HOT root points to !ItemIdIsNormal, it breaks
out of the loop, but doesn't actually fail. This code is already
potentially problematic because it has no bounds or sanity checking
for the nextoffnum, but with shrinking pd_linp it would also add the
failure mode of HOT tuples pointing into what is now arbitrary data.

heap_prune_chain() is less trusting than heap_get_root_tuples(),
though -- it doesn't trust redirects (because there is a generic
offnum sanity check at the start of its loop). I think that the
inconsistency between these two functions probably isn't hugely
significant.

Ideally it would be 100% clear which of the defenses in code like this
is merely extra hardening. The assumptions should be formalized. There
is nothing wrong with hardening, but we should know it when we see it.

This failure mode is now accompanied by an Assert, which fails when
the redirect is to an invalid OffsetNumber. This is not enough to not
exhibit arbitrary behaviour when accessing corrupted data with
non-assert builds, but I think that that is fine; we already do not
have a guaranteed behaviour for this failure mode.

What about my "set would-be LP_UNUSED space to all-0x7F bytes in
CLOBBER_FREED_MEMORY builds" idea? Did you think about that?

--
Peter Geoghegan

Attachments:

v4-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patchapplication/octet-stream; name=v4-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patchDownload+23-3
#13Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#12)
Re: Lowering the ever-growing heap->pd_lower

On Wed, 31 Mar 2021 at 05:35, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 10, 2021 at 6:01 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

The case I was concerned about back when is that there are various bits of
code that may visit a page with a predetermined TID in mind to look at.
An index lookup is an obvious example, and another one is chasing an
update chain's t_ctid link. You might argue that if the tuple was dead
enough to be removed, there should be no such in-flight references to
worry about, but I think such an assumption is unsafe. There is not, for
example, any interlock that ensures that a process that has obtained a TID
from an index will have completed its heap visit before a VACUUM that
subsequently removed that index entry also removes the heap tuple.

I am aware of this problem. I will admit that I did not detected all
potentially problematic accesses, so I'll show you my work.

Attached is a trivial rebase of your v3, which I've called v4. I am
interested in getting this patch into Postgres 14.

Thanks, and that'd be great! PFA v5, which fixes 1 issue later
mentioned, and adds some comments on existing checks that are now in a
critical path.

In my search for problematic accesses, I make the following assumptions:
* PageRepairFragmentation as found in bufpage is only applicable to
heap pages; this function is not applied to other pages in core
postgres. So, any problems that occur are with due to access with an
OffsetNumber > PageGetMaxOffsetNumber.
* Items [on heap pages] are only extracted after using PageGetItemId
for that item on the page, whilst holding a lock.

The 3 problem cases were classified based on the origin of the
potentially invalid pointer.

Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup

I think that it boils down to this: 100% of the cases where this could
be a problem all either involve old TIDs, or old line pointer -- in
principle these could be invalidated in some way, like your backwards
scan example. But that's it. Bear in mind that we always call
PageRepairFragmentation() with a super-exclusive lock.

Yeah, that's the gist of what I found out. All accesses using old line
pointers need revalidation, and there were some cases in which this
was not yet done correctly.

This is in the replay of transaction logs, in heap_xlog_freeze_page.
As I am unaware whether or not pages to which these transaction logs
are applied can contain changes from the xlog-generating instance, I
flagged this as a potential problem. The application of the xlogs is
probably safe (it assumes the existence of a HeapTupleHeader for that
ItemId), but my limited knowledge put this on the 'potential problem'
list.

Please advise on this; I cannot find the right documentation

Are you aware of wal_consistency_checking?

I was vaguely aware that an option with that name exists, but that was
about the extent. Thanks for pointing me in that direction.

I think that this should be fine. There are differences that are
possible between a replica and primary, but they're very minor and
don't seem relevant.

OK, then I'll assume that WAL replay shouldn't cause problems here.

Problem case 3: invalid redirect pointers; pruneheap.c line 949, in
heap_get_root_tuples

The heap pruning mechanism currently assumes that all redirects are
valid. Currently, if a HOT root points to !ItemIdIsNormal, it breaks
out of the loop, but doesn't actually fail. This code is already
potentially problematic because it has no bounds or sanity checking
for the nextoffnum, but with shrinking pd_linp it would also add the
failure mode of HOT tuples pointing into what is now arbitrary data.

heap_prune_chain() is less trusting than heap_get_root_tuples(),
though -- it doesn't trust redirects (because there is a generic
offnum sanity check at the start of its loop). I think that the
inconsistency between these two functions probably isn't hugely
significant.

Ideally it would be 100% clear which of the defenses in code like this
is merely extra hardening. The assumptions should be formalized. There
is nothing wrong with hardening, but we should know it when we see it.

I realized one of my Assert()s was incorrectly asserting an actually
valid page state, so I've updated and documented that case.

This failure mode is now accompanied by an Assert, which fails when
the redirect is to an invalid OffsetNumber. This is not enough to not
exhibit arbitrary behaviour when accessing corrupted data with
non-assert builds, but I think that that is fine; we already do not
have a guaranteed behaviour for this failure mode.

What about my "set would-be LP_UNUSED space to all-0x7F bytes in
CLOBBER_FREED_MEMORY builds" idea? Did you think about that?

I had implemented it locally, but was waiting for some more feedback
before posting that and got busy with other stuff since, it's now
attached.

I've also played around with marking the free space on the page as
undefined for valgrind, but later realized that that would make the
test for defined memory in PageAddItemExtended fail. This is
documented in the commit message of the attached patch 0002.

With regards,

Matthias van de Meent

Attachments:

v5-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Truncate-a-pages-line-pointer-array-when-it-has-t.patchDownload+35-3
v5-0002-Clobber-free-page-space-in-PageRepairFragmentatio.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Clobber-free-page-space-in-PageRepairFragmentatio.patchDownload+6-1
In reply to: Matthias van de Meent (#13)
Re: Lowering the ever-growing heap->pd_lower

On Wed, Mar 31, 2021 at 2:49 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I had implemented it locally, but was waiting for some more feedback
before posting that and got busy with other stuff since, it's now
attached.

I've also played around with marking the free space on the page as
undefined for valgrind, but later realized that that would make the
test for defined memory in PageAddItemExtended fail. This is
documented in the commit message of the attached patch 0002.

I would like to deal with this work within the scope of the project
we're discussing over on the "New IndexAM API controlling index vacuum
strategies" thread. The latest revision of that patch series includes
a modified version of your patch:

/messages/by-id/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com

Please take discussion around this project over to that other thread.
There are a variety of issues that can only really be discussed in
that context.

Note that I've revised the patch so that it runs during VACUUM's
second heap pass only -- not during pruning/defragmentation. This
means that the line pointer array truncation mechanism will only ever
kick-in during a VACUUM operation.

--
Peter Geoghegan

#15John Naylor
john.naylor@enterprisedb.com
In reply to: Peter Geoghegan (#14)
Re: Lowering the ever-growing heap->pd_lower

On Sat, Apr 3, 2021 at 10:07 PM Peter Geoghegan <pg@bowt.ie> wrote:

I would like to deal with this work within the scope of the project
we're discussing over on the "New IndexAM API controlling index vacuum
strategies" thread. The latest revision of that patch series includes
a modified version of your patch:

/messages/by-id/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com

Please take discussion around this project over to that other thread.
There are a variety of issues that can only really be discussed in
that context.

Since that work has been committed as of 3c3b8a4b2689, I've marked this CF
entry as committed.

--
John Naylor
EDB: http://www.enterprisedb.com

#16Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: John Naylor (#15)
Re: Lowering the ever-growing heap->pd_lower

On Mon, 3 May 2021 at 16:26, John Naylor <john.naylor@enterprisedb.com> wrote:

On Sat, Apr 3, 2021 at 10:07 PM Peter Geoghegan <pg@bowt.ie> wrote:

I would like to deal with this work within the scope of the project
we're discussing over on the "New IndexAM API controlling index vacuum
strategies" thread. The latest revision of that patch series includes
a modified version of your patch:

/messages/by-id/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com

Please take discussion around this project over to that other thread.
There are a variety of issues that can only really be discussed in
that context.

Since that work has been committed as of 3c3b8a4b2689, I've marked this CF entry as committed.

I disagree that this work has been fully committed. A derivative was
committed that would solve part of the problem, but it doesn't cover
all problem cases. I believe that I voiced such concern in the other
thread as well. As such, I am planning on fixing this patch sometime
before the next commit fest so that we can truncate the LP array
during hot pruning as well, instead of only doing so in the 2nd VACUUM
pass. This is especially relevant on pages where hot is highly
effective, but vacuum can't keep up and many unused (former HOT) line
pointers now exist on the page.

With regards,

Matthias van de Meent

#17Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#16)
Re: Lowering the ever-growing heap->pd_lower

On Mon, 3 May 2021 at 16:39, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

I am planning on fixing this patch sometime
before the next commit fest so that we can truncate the LP array
during hot pruning as well, instead of only doing so in the 2nd VACUUM
pass.

PFA the updated version of this patch. Apart from adding line pointer
truncation in PageRepairFragmentation (as in the earlier patches), I
also altered PageTruncateLinePointerArray to clean up all trailing
line pointers, even if it was the last item on the page.

This means that for 32-bit systems, pages that have once had tuples
(but have been cleared since) can now be used again for
MaxHeapTupleSize insertions. Without this patch, an emptied page would
always have at least one line pointer left, which equates to
MaxHeapTupleSize actual free space, but PageGetFreeSpace always
subtracts sizeof(ItemIdData), leaving the perceived free space as
reported to the FSM less than MaxHeapTupleSize if the page has any
line pointers.

For 64-bit systems, this is not as much of a problem, because
MaxHeapTupleSize is 4 bytes smaller on those systems, which leaves us
with 1 line pointer as margin for the FSM to recognise the page as
free enough for one MaxHeapTupleSize item.

With regards,

Matthias van de Meent

Attachments:

v6-0001-Improve-usage-of-line-pointer-array-truncation-in.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Improve-usage-of-line-pointer-array-truncation-in.patchDownload+23-11
In reply to: Matthias van de Meent (#17)
Re: Lowering the ever-growing heap->pd_lower

On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

PFA the updated version of this patch. Apart from adding line pointer
truncation in PageRepairFragmentation (as in the earlier patches), I
also altered PageTruncateLinePointerArray to clean up all trailing
line pointers, even if it was the last item on the page.

Can you show a practical benefit to this patch, such as an improvement
in throughout or in efficiency for a given workload?

It was easy to see that having something was better than having
nothing at all. But things are of course different now that we have
PageTruncateLinePointerArray().

--
Peter Geoghegan

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Peter Geoghegan (#18)
Re: Lowering the ever-growing heap->pd_lower

On Tue, 18 May 2021 at 20:33, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

PFA the updated version of this patch. Apart from adding line pointer
truncation in PageRepairFragmentation (as in the earlier patches), I
also altered PageTruncateLinePointerArray to clean up all trailing
line pointers, even if it was the last item on the page.

Can you show a practical benefit to this patch, such as an improvement
in throughout or in efficiency for a given workload?

It was easy to see that having something was better than having
nothing at all. But things are of course different now that we have
PageTruncateLinePointerArray().

There does seem to be utility in Matthias' patch, which currently does
two things:
1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup
That is going to have a clear benefit for HOT workloads, which by
their nature will use a lot of line pointers.
Many applications are updated much more frequently than they are vacuumed.
Peter - what is your concern about doing this more frequently? Why
would we *not* do this?

2. Reduce number of line pointers to 0 in some cases.
Matthias - I don't think you've made a full case for doing this, nor
looked at the implications.
The comment clearly says "it seems like a good idea to avoid leaving a
PageIsEmpty()" page behind.
So I would be inclined to remove that from the patch and consider that
as a separate issue, or close this.

--
Simon Riggs http://www.EnterpriseDB.com/

#20Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Simon Riggs (#19)
Re: Lowering the ever-growing heap->pd_lower

On Tue, 3 Aug 2021 at 08:57, Simon Riggs <simon.riggs@enterprisedb.com> wrote:

On Tue, 18 May 2021 at 20:33, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

PFA the updated version of this patch. Apart from adding line pointer
truncation in PageRepairFragmentation (as in the earlier patches), I
also altered PageTruncateLinePointerArray to clean up all trailing
line pointers, even if it was the last item on the page.

Can you show a practical benefit to this patch, such as an improvement
in throughout or in efficiency for a given workload?

It was easy to see that having something was better than having
nothing at all. But things are of course different now that we have
PageTruncateLinePointerArray().

There does seem to be utility in Matthias' patch, which currently does
two things:
1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup
That is going to have a clear benefit for HOT workloads, which by
their nature will use a lot of line pointers.
Many applications are updated much more frequently than they are vacuumed.
Peter - what is your concern about doing this more frequently? Why
would we *not* do this?

One clear reason as to why we _do_ want this, is that the current
shrinking only happens in the second phase of vacuum. Shrinking the
LP-array in heap_page_prune decreases the chance that tuples that
could fit on the page due to removed HOT chain items don't currently
fit on the page due to lack of vacuum, whilst adding only little
overhead. Additionally, heap_page_prune is also executed if more empty
space on the page is required for a new tuple that currently doesn't
fit, and in such cases I think clearing as much space as possible is
useful.

2. Reduce number of line pointers to 0 in some cases.
Matthias - I don't think you've made a full case for doing this, nor
looked at the implications.

I have looked at the implications (see upthread), and I haven't found
any implications other than those mentioned below.

The comment clearly says "it seems like a good idea to avoid leaving a
PageIsEmpty()" page behind.

Do note that that comment is based on (to the best of my knowledge)
unmeasured, but somewhat informed, guesswork ('it seems like a good
idea'), which I also commented on in the thread discussing the patch
that resulted in that commit [0]/messages/by-id/CAEze2Wh-nXjkp0bLN_vQwgHttC8CRH=1ewcrWk+7RX5B93YQPQ@mail.gmail.com.

If I recall correctly, the decision to keep at least 1 line pointer on
the page was because this feature was to be committed late in the
development cycle of pg14, and as such there would be little time to
check the impact of fully clearing pages. To go forward with the
feature in pg14 at that point, it was safer to not completely empty
pages, so that we'd not be changing the paths we were hitting during
e.g. vacuum too significantly, reducing the chances on significant
bugs that would require the patch to be reverted [1]/messages/by-id/CAH2-WznCxtWL4B995y2KJWj-+jrjahH4n6gD2R74SyQJo6Y63w@mail.gmail.com.

I agreed at that point that that was a safer bet, but right now it's
early in the pg15 development cycle, and I've had the time to get more
experience around the vacuum and line pointer machinery. That being
the case, I consider this a re-visit of the topic 'is it OK to
truncate the LP-array to 0', where previously the answer was 'we don't
know, and it's late in the release cycle', and after looking through
the code base now I argue that the answer is Yes.

One more point for going to 0 is that for 32-bit systems, a single
line pointer is enough to block a page from being 'empty' enough to
fit a MaxHeapTupleSize-sized tuple (when requesting pages through the
FSM).

Additionally, there are some other optimizations we can only apply to
empty pages:

- vacuum (with disable_page_skipping = on) will process these empty
pages faster, as it won't need to do any pruning on that page. With
page skipping enabled this won't matter because empty pages are
all_visible and therefore vacuum won't access that page.
- the pgstattuple contrib extension processes emtpy pages (slightly)
faster in pgstattuple_approx
- various loops won't need to check the remaining item that it is
unused, saving some cycles in those loops when the page is accessed.

and further future optimizations might include

- Full-page WAL logging of empty pages produced in the checkpointer
could potentially be optimized to only log 'it's an empty page'
instead of writing out the full 8kb page, which would help in reducing
WAL volume. Previously this optimization would never be hit on
heapam-pages because pages could not become empty again, but right now
this has real potential for applying an optimization.

Kind regards,

Matthias van de Meent

[0]: /messages/by-id/CAEze2Wh-nXjkp0bLN_vQwgHttC8CRH=1ewcrWk+7RX5B93YQPQ@mail.gmail.com
[1]: /messages/by-id/CAH2-WznCxtWL4B995y2KJWj-+jrjahH4n6gD2R74SyQJo6Y63w@mail.gmail.com

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Matthias van de Meent (#20)
#22Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Simon Riggs (#21)
In reply to: Simon Riggs (#19)
In reply to: Matthias van de Meent (#22)
#25Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#24)
#26Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#23)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#23)
#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Peter Geoghegan (#23)
In reply to: Simon Riggs (#28)
In reply to: Robert Haas (#27)
#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#27)
In reply to: Simon Riggs (#31)
#33Daniel Gustafsson
daniel@yesql.se
In reply to: Peter Geoghegan (#32)
#34Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Daniel Gustafsson (#33)
In reply to: Matthias van de Meent (#34)
#36Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Peter Geoghegan (#35)
#37Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Matthias van de Meent (#36)
In reply to: Matthias van de Meent (#37)
In reply to: Matthias van de Meent (#34)
In reply to: Peter Geoghegan (#39)
#41Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#39)
In reply to: Andres Freund (#41)
#43Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#41)
#44Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#41)
In reply to: Matthias van de Meent (#43)
In reply to: Robert Haas (#44)
In reply to: Peter Geoghegan (#45)
#48Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#46)
In reply to: Robert Haas (#48)
#50Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#49)
#51Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#44)
#52Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#48)
In reply to: Andres Freund (#52)
In reply to: Andres Freund (#51)
In reply to: Robert Haas (#50)