Remaining case where reltuples can become distorted across multiple VACUUM operations
My bugfix commit 74388a1a (which was pushed back in February) added
heuristics to VACUUM's reltuples calculation/estimate. This prevented
VACUUM from distorting our estimate of reltuples over time, across
successive VACUUM operations run against the same table. The problem
was that VACUUM could scan the same single heap page again and again,
while believing it saw a random sample each time. This eventually
leads to a pg_class.reltuples value that is based on the assumption
that every single heap page in the table is just like the heap page
that gets "sampled" again and again. This was always the last heap
page (due to implementation details related to the work done by commit
e8429082), which in practice tend to be particularly poor
representations of the overall reltuples density of tables.
I have discovered a gap in these heuristics: there are remaining cases
where its percentage threshold doesn't prevent reltuples distortion as
intended. It can still happen with tables that are small enough that a
cutoff of 2% of rel_pages is less than a single page, yet still large
enough that vacuumlazy.c will consider it worth its while to skip some
pages using the visibility map. It will typically skip all but the
final heap page from the relation (same as the first time around).
Here is a test case that shows how this can still happen on HEAD (and
in Postgres 15):
regression=# create table foo(bar int);insert into foo select i from
generate_series(1, 10000) i;
CREATE TABLE
INSERT 0 10000
Now run vacuum verbose against the table several times:
regression=# vacuum verbose foo;
*** SNIP ***
regression=# vacuum verbose foo;
The first vacuum shows "tuples: 0 removed, 10000 remain...", which is
correct. However, each subsequent vacuum verbose revises the estimate
downwards, eventually making pg_class.reltuples significantly
underestimate tuple density (same as the first time around).
Attached patch fixes closes the remaining gap. With the patch applied,
the second and subsequent vacuum verbose operations from the test case
will show that reltuples is still 10000 (it won't ever change). The
patch just extends an old behavior that was applied when scanned_pages
== 0 to cases where scanned_pages <= 1 (unless we happened to scan all
of the relation's tables, of course). It doesn't remove the original
test from commit 74388a1a, which still seems like a good idea to me.
--
Peter Geoghegan
Attachments:
v1-0001-Avoid-reltuples-distortion-in-very-small-tables.patchapplication/octet-stream; name=v1-0001-Avoid-reltuples-distortion-in-very-small-tables.patchDownload
From 379ec8259cc3b2f79e397d83370cad44aee5a70c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 5 Aug 2022 16:40:08 -0700
Subject: [PATCH v1] Avoid reltuples distortion in very small tables.
Consistently avoid trusting a sample of only one page at the point that
VACUUM determines a new reltuples for the target table (though not when
the table is <= 1 page in size, since then it's not merely a sample).
This is follow-up work to commit 74388a1a, which added heuristics that
prevented reltuples from becoming distorted by successive VACUUM
operations that only scan one page. The earlier commit failed to
account for remaining cases where VACUUM scans only one page of a table
that is small enough that a single page is greater than 2% of its total
size, yet big enough that VACUUM will skip most of its pages using the
visibility map (vacuumlazy.c won't skip any page when rel_pages is below
the SKIP_PAGES_THRESHOLD threshold for skipping).
---
src/backend/commands/vacuum.c | 26 ++++++++++----------------
1 file changed, 10 insertions(+), 16 deletions(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 8df25f59d..dbdfe8bd2 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1234,31 +1234,25 @@ vac_estimate_reltuples(Relation relation,
if (scanned_pages >= total_pages)
return scanned_tuples;
- /*
- * If scanned_pages is zero but total_pages isn't, keep the existing value
- * of reltuples. (Note: we might be returning -1 in this case.)
- */
- if (scanned_pages == 0)
- return old_rel_tuples;
-
/*
* When successive VACUUM commands scan the same few pages again and
* again, without anything from the table really changing, there is a risk
* that our beliefs about tuple density will gradually become distorted.
- * It's particularly important to avoid becoming confused in this way due
- * to vacuumlazy.c implementation details. For example, the tendency for
- * our caller to always scan the last heap page should not ever cause us
- * to believe that every page in the table must be just like the last
- * page.
+ * This might be caused by vacuumlazy.c implementation details, such as
+ * its tendency to always scan the last heap page. Handle that here.
*
- * We apply a heuristic to avoid these problems: if the relation is
- * exactly the same size as it was at the end of the last VACUUM, and only
- * a few of its pages (less than a quasi-arbitrary threshold of 2%) were
- * scanned by this VACUUM, assume that reltuples has not changed at all.
+ * If the relation is _exactly_ the same size according to the existing
+ * pg_class entry, and only a few of its pages (less than 2%) were
+ * scanned, keep the existing value of reltuples. Also keep the existing
+ * value when only a subset of rel's pages <= a single page were scanned.
+ *
+ * (Note: we might be returning -1 here.)
*/
if (old_rel_pages == total_pages &&
scanned_pages < (double) total_pages * 0.02)
return old_rel_tuples;
+ if (scanned_pages <= 1)
+ return old_rel_tuples;
/*
* If old density is unknown, we can't do much except scale up
--
2.32.0
On Fri, Aug 5, 2022 at 5:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached patch fixes closes the remaining gap. With the patch applied,
the second and subsequent vacuum verbose operations from the test case
will show that reltuples is still 10000 (it won't ever change). The
patch just extends an old behavior that was applied when scanned_pages
== 0 to cases where scanned_pages <= 1 (unless we happened to scan all
of the relation's tables, of course).
My plan is to commit this later in the week, barring objections. Maybe
on Thursday.
--
Peter Geoghegan
On Mon, 8 Aug 2022 at 16:52, Peter Geoghegan <pg@bowt.ie> wrote:
On Fri, Aug 5, 2022 at 5:39 PM Peter Geoghegan <pg@bowt.ie> wrote:
Attached patch fixes closes the remaining gap. With the patch applied,
the second and subsequent vacuum verbose operations from the test case
will show that reltuples is still 10000 (it won't ever change). The
patch just extends an old behavior that was applied when scanned_pages
== 0 to cases where scanned_pages <= 1 (unless we happened to scan all
of the relation's tables, of course).My plan is to commit this later in the week, barring objections. Maybe
on Thursday.
I do not have intimate knowledge of this code, but shouldn't we also
add some sefety guarantees like the following in these blocks? Right
now, we'll keep underestimating the table size even when we know that
the count is incorrect.
if (scanned_tuples > old_rel_tuples)
return some_weighted_scanned_tuples;
Kind regards,
Matthias van de Meent
On Mon, Aug 8, 2022 at 8:14 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
I do not have intimate knowledge of this code, but shouldn't we also
add some sefety guarantees like the following in these blocks? Right
now, we'll keep underestimating the table size even when we know that
the count is incorrect.if (scanned_tuples > old_rel_tuples)
return some_weighted_scanned_tuples;
Not sure what you mean -- we do something very much like that already.
We take the existing tuple density, and assume that that hasn't
changed for any unscanned pages -- that is used to build a total
number of tuples for the unscanned pages. Then we add the number of
live tuples/scanned_tuples that the vacuumlazy.c caller just
encountered on scanned_pages. That's often where the final reltuples
value comes from.
--
Peter Geoghegan
On Mon, 8 Aug 2022 at 17:26, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Aug 8, 2022 at 8:14 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:I do not have intimate knowledge of this code, but shouldn't we also
add some sefety guarantees like the following in these blocks? Right
now, we'll keep underestimating the table size even when we know that
the count is incorrect.if (scanned_tuples > old_rel_tuples)
return some_weighted_scanned_tuples;Not sure what you mean -- we do something very much like that already.
We take the existing tuple density, and assume that that hasn't
changed for any unscanned pages -- that is used to build a total
number of tuples for the unscanned pages. Then we add the number of
live tuples/scanned_tuples that the vacuumlazy.c caller just
encountered on scanned_pages. That's often where the final reltuples
value comes from.
Indeed we often apply this, but not always. It is the default case,
but never applied in the special cases.
For example, if currently the measured 2% of the pages contains more
than 100% of the previous count of tuples, or with your patch the last
page contains more than 100% of the previous count of the tuples, that
new count is ignored, which seems silly considering that the vacuum
count is supposed to be authorative.
Kind regards,
Matthias van de Meent
On Mon, Aug 8, 2022 at 8:33 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
For example, if currently the measured 2% of the pages contains more
than 100% of the previous count of tuples, or with your patch the last
page contains more than 100% of the previous count of the tuples, that
new count is ignored, which seems silly considering that the vacuum
count is supposed to be authorative.
The 2% thing is conditioned on the new relpages value precisely
matching the existing relpages from pg_class -- which makes it very
targeted. I don't see why scanned_tuples greatly exceeding the
existing reltuples from pg_class is interesting (any more interesting
than the other way around).
We'll always accept scanned_tuples as authoritative when VACUUM
actually scans all pages, no matter what. Currently it isn't possible
for VACUUM to skip pages in a table that is 32 pages or less in size.
So even the new "single page" thing from the patch cannot matter
there.
--
Peter Geoghegan
On Mon, 8 Aug 2022 at 17:49, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Aug 8, 2022 at 8:33 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:For example, if currently the measured 2% of the pages contains more
than 100% of the previous count of tuples, or with your patch the last
page contains more than 100% of the previous count of the tuples, that
new count is ignored, which seems silly considering that the vacuum
count is supposed to be authorative.The 2% thing is conditioned on the new relpages value precisely
matching the existing relpages from pg_class -- which makes it very
targeted. I don't see why scanned_tuples greatly exceeding the
existing reltuples from pg_class is interesting (any more interesting
than the other way around).
Because if a subset of the pages of a relation contains more tuples
than your current total expected tuples in the table, you should
update your expectations regardless of which blocks or which number of
blocks you've scanned - the previous stored value is a strictly worse
estimation than your last measurement.
We'll always accept scanned_tuples as authoritative when VACUUM
actually scans all pages, no matter what. Currently it isn't possible
for VACUUM to skip pages in a table that is 32 pages or less in size.
So even the new "single page" thing from the patch cannot matter
there.
A 33-block relation with first 32 1-tuple pages is still enough to
have a last page with 250 tuples, which would be ignored in that
scheme and have a total tuple count of 33 or so. Sure, this is an
artificial sample, but you can construct similarly wrong vacuum
samples: Two classes of tuples that have distinct update regimes, one
with 32B-tuples and one with MaxFreeSpaceRequestSize-d tuples. When
you start running VACUUM against these separate classes of updated
blocks you'll see that the relation tuple count will also tend to
1*nblocks, due to the disjoint nature of these updates and the
tendency to ignore all updates to densely packed blocks.
With current code, we ignore the high counts of those often-updated
blocks and expect low density in the relation, precisely because we
ignore areas that are extremely dense and updated in VACUUM cycles
that are different from bloated blocks.
Kind regards,
Matthias van de Meent
On Mon, Aug 8, 2022 at 9:17 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
Because if a subset of the pages of a relation contains more tuples
than your current total expected tuples in the table, you should
update your expectations regardless of which blocks or which number of
blocks you've scanned - the previous stored value is a strictly worse
estimation than your last measurement.
The previous stored value could be -1, which represents the idea that
we don't know the tuple density yet. So it doesn't necessarily follow
that the new estimate is strictly better, even in this exact scenario.
A 33-block relation with first 32 1-tuple pages is still enough to
have a last page with 250 tuples, which would be ignored in that
scheme and have a total tuple count of 33 or so.
The simple fact is that there is only so much we can do with the
limited information/context that we have. Heuristics are not usually
free of all bias. Often the bias is the whole point -- the goal can be
to make sure that we have the bias that we know we can live with, and
not the opposite bias, which is much worse. Details of which are
usually very domain specific.
I presented my patch with a very simple test case -- a very clear
problem. Can you do the same for this scenario?
I accept that it is possible that we'll keep an old reltuples which is
provably less accurate than doing something with the latest
information from vacuumlazy.c. But the conditions under which this can
happen are *very* narrow. I am not inclined to do anything about it
for that reason.
--
Peter Geoghegan
On Mon, 8 Aug 2022 at 18:48, Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Aug 8, 2022 at 9:17 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:Because if a subset of the pages of a relation contains more tuples
than your current total expected tuples in the table, you should
update your expectations regardless of which blocks or which number of
blocks you've scanned - the previous stored value is a strictly worse
estimation than your last measurement.The previous stored value could be -1, which represents the idea that
we don't know the tuple density yet. So it doesn't necessarily follow
that the new estimate is strictly better, even in this exact scenario.A 33-block relation with first 32 1-tuple pages is still enough to
have a last page with 250 tuples, which would be ignored in that
scheme and have a total tuple count of 33 or so.The simple fact is that there is only so much we can do with the
limited information/context that we have. Heuristics are not usually
free of all bias. Often the bias is the whole point -- the goal can be
to make sure that we have the bias that we know we can live with, and
not the opposite bias, which is much worse. Details of which are
usually very domain specific.I presented my patch with a very simple test case -- a very clear
problem. Can you do the same for this scenario?
CREATE TABLE tst (id int primary key generated by default as identity,
payload text) with (fillfactor 50); -- fillfactor to make pages fill
up fast
INSERT INTO tst (payload) select repeat('a', 5000) from
generate_series(32); -- 32 pages filled with large tuples
INSERT INTO tst (payload) select repeat('a', 4); -- small tuple at last page
vacuum (verbose, freeze) tst; -- 33 tuples on 33 pages, with lots of
space left on last page
INSERT INTO tst(payload) select repeat('a', 4) from
generate_series(1,63); -- now, we have 64 tuples on the last page
vacuum verbose tst; -- with your patch it reports only 33 tuples
total, while the single page that was scanned contains 64 tuples, and
the table contains 96 tuples.
I accept that it is possible that we'll keep an old reltuples which is
provably less accurate than doing something with the latest
information from vacuumlazy.c. But the conditions under which this can
happen are *very* narrow. I am not inclined to do anything about it
for that reason.
I think I understand your reasoning, but I don't agree with the
conclusion. The attached patch 0002 does fix that skew too, at what I
consider negligible cost. 0001 is your patch with a new version
number.
I'm fine with your patch as is, but would appreciate it if known
estimate mistakes would also be fixed.
An alternative solution could be doing double-vetting, where we ignore
tuples_scanned if <2% of pages AND <2% of previous estimated tuples
was scanned.
Kind regards,
Matthias van de Meent
Attachments:
v2-0001-Avoid-reltuples-distortion-in-very-small-tables.patchapplication/octet-stream; name=v2-0001-Avoid-reltuples-distortion-in-very-small-tables.patchDownload
From 379ec8259cc3b2f79e397d83370cad44aee5a70c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 5 Aug 2022 16:40:08 -0700
Subject: [PATCH v1] Avoid reltuples distortion in very small tables.
Consistently avoid trusting a sample of only one page at the point that
VACUUM determines a new reltuples for the target table (though not when
the table is <= 1 page in size, since then it's not merely a sample).
This is follow-up work to commit 74388a1a, which added heuristics that
prevented reltuples from becoming distorted by successive VACUUM
operations that only scan one page. The earlier commit failed to
account for remaining cases where VACUUM scans only one page of a table
that is small enough that a single page is greater than 2% of its total
size, yet big enough that VACUUM will skip most of its pages using the
visibility map (vacuumlazy.c won't skip any page when rel_pages is below
the SKIP_PAGES_THRESHOLD threshold for skipping).
---
src/backend/commands/vacuum.c | 26 ++++++++++----------------
1 file changed, 10 insertions(+), 16 deletions(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 8df25f59d..dbdfe8bd2 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1234,31 +1234,25 @@ vac_estimate_reltuples(Relation relation,
if (scanned_pages >= total_pages)
return scanned_tuples;
- /*
- * If scanned_pages is zero but total_pages isn't, keep the existing value
- * of reltuples. (Note: we might be returning -1 in this case.)
- */
- if (scanned_pages == 0)
- return old_rel_tuples;
-
/*
* When successive VACUUM commands scan the same few pages again and
* again, without anything from the table really changing, there is a risk
* that our beliefs about tuple density will gradually become distorted.
- * It's particularly important to avoid becoming confused in this way due
- * to vacuumlazy.c implementation details. For example, the tendency for
- * our caller to always scan the last heap page should not ever cause us
- * to believe that every page in the table must be just like the last
- * page.
+ * This might be caused by vacuumlazy.c implementation details, such as
+ * its tendency to always scan the last heap page. Handle that here.
*
- * We apply a heuristic to avoid these problems: if the relation is
- * exactly the same size as it was at the end of the last VACUUM, and only
- * a few of its pages (less than a quasi-arbitrary threshold of 2%) were
- * scanned by this VACUUM, assume that reltuples has not changed at all.
+ * If the relation is _exactly_ the same size according to the existing
+ * pg_class entry, and only a few of its pages (less than 2%) were
+ * scanned, keep the existing value of reltuples. Also keep the existing
+ * value when only a subset of rel's pages <= a single page were scanned.
+ *
+ * (Note: we might be returning -1 here.)
*/
if (old_rel_pages == total_pages &&
scanned_pages < (double) total_pages * 0.02)
return old_rel_tuples;
+ if (scanned_pages <= 1)
+ return old_rel_tuples;
/*
* If old density is unknown, we can't do much except scale up
--
2.32.0
v2-0002-Avoid-reltuples-distortion-in-very-small-tables.patchapplication/octet-stream; name=v2-0002-Avoid-reltuples-distortion-in-very-small-tables.patchDownload
From 0e7b7f2a4189cb98722e0a6b0d4e76e18d980900 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 10 Aug 2022 20:36:25 +0200
Subject: [PATCH v2] Avoid reltuples distortion in very small tables.
It is possible that a small subset of pages contains the overwhelming
majority of all tuples. This leads to skews in gathered statistics if these
dense and bloated pages are vacuumed independently: the 2%-barrier is not
reached when vacuuming dense pages, such that these pages would not
contribute to the tuple count estimate.
This commit fixes (part of) that issue by always updating the tuple count
estimate whenever we encounter more tuples during vacuum than the current
row count estimate.
---
src/backend/commands/vacuum.c | 45 ++++++++++++++++++++++-------------
1 file changed, 29 insertions(+), 16 deletions(-)
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index dbdfe8bd2d..3ae7a89549 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1235,24 +1235,37 @@ vac_estimate_reltuples(Relation relation,
return scanned_tuples;
/*
- * When successive VACUUM commands scan the same few pages again and
- * again, without anything from the table really changing, there is a risk
- * that our beliefs about tuple density will gradually become distorted.
- * This might be caused by vacuumlazy.c implementation details, such as
- * its tendency to always scan the last heap page. Handle that here.
+ * The following block may decide to not update the estimated
+ * count of tuples in the table, most of which are due to skew
+ * in which blocks are scanned in which situations.
*
- * If the relation is _exactly_ the same size according to the existing
- * pg_class entry, and only a few of its pages (less than 2%) were
- * scanned, keep the existing value of reltuples. Also keep the existing
- * value when only a subset of rel's pages <= a single page were scanned.
- *
- * (Note: we might be returning -1 here.)
+ * However, whenever we have a previous estimated count, and
+ * the count in scanned_tuples is strictly larger, then we
+ * must update our estimate, because the previous estimate is
+ * strictly worse than the new count.
*/
- if (old_rel_pages == total_pages &&
- scanned_pages < (double) total_pages * 0.02)
- return old_rel_tuples;
- if (scanned_pages <= 1)
- return old_rel_tuples;
+ if (old_rel_tuples > scanned_tuples || old_rel_tuples < 0)
+ {
+ /*
+ * When successive VACUUM commands scan the same few pages again and
+ * again, without anything from the table really changing, there is a risk
+ * that our beliefs about tuple density will gradually become distorted.
+ * This might be caused by vacuumlazy.c implementation details, such as
+ * its tendency to always scan the last heap page. Handle that here.
+ *
+ * If the relation is _exactly_ the same size according to the existing
+ * pg_class entry, and only a few of its pages (less than 2%) were
+ * scanned, keep the existing value of reltuples. Also keep the existing
+ * value when only a subset of rel's pages <= a single page were scanned.
+ *
+ * (Note: we might be returning -1 here.)
+ */
+ if (old_rel_pages == total_pages &&
+ scanned_pages < (double) total_pages * 0.02)
+ return old_rel_tuples;
+ if (scanned_pages <= 1 && allowSystemTableMods)
+ return old_rel_tuples;
+ }
/*
* If old density is unknown, we can't do much except scale up
--
2.30.2
On Thu, Aug 11, 2022 at 1:48 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
I think I understand your reasoning, but I don't agree with the
conclusion. The attached patch 0002 does fix that skew too, at what I
consider negligible cost. 0001 is your patch with a new version
number.
Your patch added allowSystemTableMods to one of the tests. I guess
that this was an oversight?
I'm fine with your patch as is, but would appreciate it if known
estimate mistakes would also be fixed.
Why do you think that this particular scenario/example deserves
special attention? As I've acknowledged already, it is true that your
scenario is one in which we provably give a less accurate estimate,
based on already-available information. But other than that, I don't
see any underlying principle that would be violated by my original
patch (any kind of principle, held by anybody). reltuples is just an
estimate.
I was thinking of going your way on this, purely because it didn't
seem like there'd be much harm in it (why not just handle your case
and be done with it?). But I don't think that it's a good idea now.
reltuples is usually derived by ANALYZE using a random sample, so the
idea that tuple density can be derived accurately enough from a random
sample is pretty baked in. You're talking about a case where ignoring
just one page ("sampling" all but one of the pages) *isn't* good
enough. It just doesn't seem like something that needs to be addressed
-- it's quite awkward to do so.
Barring any further objections, I plan on committing the original
version tomorrow.
An alternative solution could be doing double-vetting, where we ignore
tuples_scanned if <2% of pages AND <2% of previous estimated tuples
was scanned.
I'm not sure that I've understood you, but I think that you're talking
about remembering more information (in pg_class), which is surely out
of scope for a bug fix.
--
Peter Geoghegan