Relstats after VACUUM FULL and CLUSTER
Hi all,
I noticed a potential issue with the heap cluster code used by VACUUM FULL
and CLUSTER, but I am not sure so I thought I'd post the question to the
list.
The code in question counts the number of tuples it processes and uses that
count to update reltuples in pg_class. However, the tuple count seems off
sometimes because it includes recently dead tuples (due to updates and
deletes). However, the wrong reltuples count is set, AFAICT, only on tables
that don't have indexes because the cluster code also later rebuilds
indexes which then updates reltuples to the "correct" value.
Does this seem like a bug or is it intentional?
Best regards,
Erik
--
Database Architect, Timescale
Does this seem like a bug or is it intentional?
pg_class.reltuples/relpages are only an estimate as per documentation.
However, I cannot reproduce the situation you are talking about on HEAD.
In the below example, I create a table without indexes, then insert and
delete some rows. run vacuum to update the pg_class.reltuples, then run
another delete to generate some more "recent" dead tuples.
The result shows pg_class.reltuples with the expected value,
but maybe I did not repro the same way you did?
( I am surprised that n_live_tup, n_dead_tup is off and also that
VACUUM FULL does not appear to update the stats in pg_stat_all_tables)
```
postgres=# drop table if exists t;
create table t ( id int );
alter table t set (autovacuum_enabled = off);
insert into t select n from generate_series(1, 1000000) as n;
delete from t where id between 1 and 5000;
vacuum t;
delete from t where id between 5001 and 10000;
select reltuples::int from pg_class where relname = 't';
-- might take a bit of time for n_dead_tup to be set
select n_dead_tup, n_live_tup from pg_stat_all_tables where relname = 't';
DROP TABLE
CREATE TABLE
ALTER TABLE
INSERT 0 1000000
DELETE 5000
VACUUM
DELETE 5000
reltuples
-----------
995000
(1 row)
n_dead_tup | n_live_tup
------------+------------
10000 | 985000
(1 row)
postgres=# VACUUM (verbose, full) t;
INFO: vacuuming "public.t"
INFO: "public.t": found 5000 removable, 990000 nonremovable row
versions in 4425 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU: user: 0.79 s, system: 0.02 s, elapsed: 0.86 s.
VACUUM
select reltuples::int from pg_class where relname = 't';
select n_dead_tup from pg_stat_all_tables where relname = 't';
postgres=# select reltuples::int from pg_class where relname = 't';
select n_dead_tup, n_live_tup from pg_stat_all_tables where relname = 't';
reltuples
-----------
990000
(1 row)
postgres=# select n_dead_tup, n_live_tup from pg_stat_all_tables where
relname = 't';
n_dead_tup | n_live_tup
------------+------------
10000 | 985000
(1 row)
--
Sami Imseih
Amazon Web Services (AWS)
Hi Sami,
You need a concurrent transaction to recreate the situation. I am attaching
an isolation test to show the behavior, along with its output file. I ran
it on PostgreSQL 17.4.
The test has two permutations, the first one runs on a table without an
index and the second permutation with an index added. In the output file
you can see that the VACUUM FULL on the index-less table produces reltuples
count that includes all updated tuples + the old/garbage tuples. In other
words, it counts all tuples visible to any currently ongoing transaction.
If the table has an index the behavior is different because the reindex
that happens as the last step of vacuum full overwrites the first reltuples
count with the "correct" number (as visible by the transaction snapshot).
Best,
Erik
On Thu, May 22, 2025 at 5:04 PM Sami Imseih <samimseih@gmail.com> wrote:
Show quoted text
Does this seem like a bug or is it intentional?
pg_class.reltuples/relpages are only an estimate as per documentation.
However, I cannot reproduce the situation you are talking about on HEAD.
In the below example, I create a table without indexes, then insert and
delete some rows. run vacuum to update the pg_class.reltuples, then run
another delete to generate some more "recent" dead tuples.The result shows pg_class.reltuples with the expected value,
but maybe I did not repro the same way you did?( I am surprised that n_live_tup, n_dead_tup is off and also that
VACUUM FULL does not appear to update the stats in pg_stat_all_tables)```
postgres=# drop table if exists t;
create table t ( id int );
alter table t set (autovacuum_enabled = off);
insert into t select n from generate_series(1, 1000000) as n;
delete from t where id between 1 and 5000;
vacuum t;
delete from t where id between 5001 and 10000;
select reltuples::int from pg_class where relname = 't';
-- might take a bit of time for n_dead_tup to be set
select n_dead_tup, n_live_tup from pg_stat_all_tables where relname = 't';
DROP TABLE
CREATE TABLE
ALTER TABLE
INSERT 0 1000000
DELETE 5000
VACUUM
DELETE 5000
reltuples
-----------
995000
(1 row)n_dead_tup | n_live_tup
------------+------------
10000 | 985000
(1 row)postgres=# VACUUM (verbose, full) t;
INFO: vacuuming "public.t"
INFO: "public.t": found 5000 removable, 990000 nonremovable row
versions in 4425 pages
DETAIL: 0 dead row versions cannot be removed yet.
CPU: user: 0.79 s, system: 0.02 s, elapsed: 0.86 s.
VACUUMselect reltuples::int from pg_class where relname = 't';
select n_dead_tup from pg_stat_all_tables where relname = 't';postgres=# select reltuples::int from pg_class where relname = 't';
select n_dead_tup, n_live_tup from pg_stat_all_tables where relname = 't';
reltuples
-----------
990000
(1 row)postgres=# select n_dead_tup, n_live_tup from pg_stat_all_tables where
relname = 't';
n_dead_tup | n_live_tup
------------+------------
10000 | 985000
(1 row)--
Sami Imseih
Amazon Web Services (AWS)
You need a concurrent transaction to recreate the situation. I am attaching an isolation test to show the behavior,
Thanks! That helps.
Indeed heapam_relation_copy_for_cluster and heapam_index_build_range_scan
are counting HEAPTUPLE_RECENTLY_DEAD ( tuples removed but cannot be removed )
different. In heapam_relation_copy_for_cluster they are considered live and in
heapam_index_build_range_scan they are considered dead.
in heapam_relation_copy_for_cluster
```
case HEAPTUPLE_RECENTLY_DEAD:
*tups_recently_dead += 1;
/* fall through */
case HEAPTUPLE_LIVE:
/* Live or recently dead, must copy it */
isdead = false;
break;
```
In both cases, he recently dead tuples must be copied to the table or index, but
they should not be counted towards reltuples. So, I think we need to fix this in
heapam_relation_copy_for_cluster by probably subtracting
tups_recently_dead from num_tuples ( which is the value set in
pg_class.reltuples )
after we process all the tuples, which looks like the best fix to me.
--
Sami Imseih
Amazon Web Services (AWS)
In both cases, he recently dead tuples must be copied to the table or index, but
they should not be counted towards reltuples. So, I think we need to fix this in
heapam_relation_copy_for_cluster by probably subtracting
tups_recently_dead from num_tuples ( which is the value set in
pg_class.reltuples )
after we process all the tuples, which looks like the best fix to me.
something like the attached.
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
v1-0001-Correct-reltuples-count-after-a-VACUUM-CLUSTER-op.patchapplication/octet-stream; name=v1-0001-Correct-reltuples-count-after-a-VACUUM-CLUSTER-op.patchDownload
From c19b7b82d7b0e690f3f4f55af4cbd0d6327cece1 Mon Sep 17 00:00:00 2001
From: Ubuntu <ubuntu@ip-172-31-38-230.ec2.internal>
Date: Thu, 22 May 2025 20:35:06 +0000
Subject: [PATCH v1 1/1] Correct reltuples count after a VACUUM/CLUSTER
operation
During a VACUUM FULL or CLUSTER operation, a recently dead row
is counted as a live tuple, which causes the value of pg_class.reltuples
to be higher than it should be. This differs from an index build or a
regular vacuum, which does not count recently dead tuples as live.
Discussion: https://www.postgresql.org/message-id/flat/CACAa4VKmyz51BFAFZyKGBt9-yt8Vyk09s_26%2B%2BFHy2ngfb9k0Q%40mail.gmail.com
---
src/backend/access/heap/heapam_handler.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..a072bcdfaa1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -936,6 +936,13 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
}
}
+ /*
+ * We want to advertise a recently dead tuple as having been scanned,
+ * but it should not be counted as a live tuple in pg_class.reltuples,
+ * so we adjust the value here.
+ */
+ *num_tuples = *num_tuples - *tups_recently_dead;
+
if (indexScan != NULL)
index_endscan(indexScan);
if (tableScan != NULL)
--
2.43.0
Hi Sami,
Your patch should correct the problem. However, given that this function is
part of the tableam API, I am wondering if the fix shouldn't be outside
heap's copy_for_cluster implementation? I guess it depends on the semantics
of num_tuples, but the cluster code seems to allude to interpreting
num_tuples as the number of non-removable tuples. If you subtract recently
dead from that number within the heap implementation, then it will no
longer reflect non-removable tuples and the log message in the cluster
function "found %.0f removable, %.0f nonremovable row versions" will no
longer be correct.
Surprisingly, tableam.h does not document the num_tuples parameter in the
table_relation_copy_for_cluster() function although all other output
parameters are documented. So, it is not clear what the intended semantics
are. Maybe other hackers on the mailing list have opinions on how to
interpret num_tuples?
In any case, assuming num_tuples is supposed to return non-removable
tuples, then the fix should be to subtract recently dead tuples when
updating pg_class.reltuples. Other TAM's need to treat num_tuples as
non-removable tuples as well, and update recently dead if applicable.
I am attaching a patch with these changes, while also including the
isolation test in that patch.
Regards,
Erik
On Thu, May 22, 2025 at 10:42 PM Sami Imseih <samimseih@gmail.com> wrote:
Show quoted text
In both cases, he recently dead tuples must be copied to the table or
index, but
they should not be counted towards reltuples. So, I think we need to fix
this in
heapam_relation_copy_for_cluster by probably subtracting
tups_recently_dead from num_tuples ( which is the value set in
pg_class.reltuples )
after we process all the tuples, which looks like the best fix to me.something like the attached.
--
Sami Imseih
Amazon Web Services (AWS)
Attachments:
v1-0001-Fix-reltuples-stats-after-VACUUM-FULL-and-CLUSTER.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Fix-reltuples-stats-after-VACUUM-FULL-and-CLUSTER.patchDownload
From 1a5814f794c71d6a6c0b100e71289dece7136a89 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Erik=20Nordstr=C3=B6m?= <erik.nordstrom@gmail.com>
Date: Fri, 23 May 2025 14:54:56 +0200
Subject: [PATCH v1] Fix reltuples stats after VACUUM FULL and CLUSTER
The tableam API's function for clustering a table, used by both
CLUSTER and VACUUM FULL, returns the number of non-removable tuples in
a num_tuples variable. This number includes dead tuples that cannot be
removed because they are still visible to concurrent
transactions. Previously, num_tuples was used to update
pg_class.reltuples, but this could lead to widely misleading numbers
if many tuples had been deleted or updated concurrently with ongoing
transactions. The fix is to not include the recently dead tuples.
It should be noted that this issue only occurred on tables that didn't
have any indexes, since the reindexing that happens as part of the
cluster code also updates reltuples, but using the more accurate
number based on currently visible tuples.
---
src/backend/commands/cluster.c | 2 +-
src/include/access/tableam.h | 1 +
.../isolation/expected/vacuum-full-stats.out | 72 +++++++++++++++++++
src/test/isolation/isolation_schedule | 1 +
.../isolation/specs/vacuum-full-stats.spec | 42 +++++++++++
5 files changed, 117 insertions(+), 1 deletion(-)
create mode 100644 src/test/isolation/expected/vacuum-full-stats.out
create mode 100644 src/test/isolation/specs/vacuum-full-stats.spec
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 54a08e4102e..74e445d9a67 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1017,7 +1017,7 @@ copy_table_data(Relation NewHeap, Relation OldHeap, Relation OldIndex, bool verb
relform = (Form_pg_class) GETSTRUCT(reltup);
relform->relpages = num_pages;
- relform->reltuples = num_tuples;
+ relform->reltuples = num_tuples - tups_recently_dead;
/* Don't update the stats for pg_class. See swap_relation_files. */
if (RelationGetRelid(OldHeap) != RelationRelationId)
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..d259a7b46fa 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1631,6 +1631,7 @@ table_relation_copy_data(Relation rel, const RelFileLocator *newrlocator)
* Output parameters:
* - *xid_cutoff - rel's new relfrozenxid value, may be invalid
* - *multi_cutoff - rel's new relminmxid value, may be invalid
+ * - *num_tuples - stats, non-removable tuples for logging
* - *tups_vacuumed - stats, for logging, if appropriate for AM
* - *tups_recently_dead - stats, for logging, if appropriate for AM
*/
diff --git a/src/test/isolation/expected/vacuum-full-stats.out b/src/test/isolation/expected/vacuum-full-stats.out
new file mode 100644
index 00000000000..7b027711f85
--- /dev/null
+++ b/src/test/isolation/expected/vacuum-full-stats.out
@@ -0,0 +1,72 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s2_vac_full s2_reltuples s1_query s2_update s2_reltuples s2_vac_full s1_commit s2_reltuples
+step s2_vac_full: VACUUM FULL stats;
+step s2_reltuples: SELECT reltuples FROM pg_class WHERE relname = 'stats';
+reltuples
+---------
+ 10
+(1 row)
+
+step s1_query:
+ START TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+ SELECT 1;
+
+?column?
+--------
+ 1
+(1 row)
+
+step s2_update: UPDATE stats SET a = 1 WHERE a > 6;
+step s2_reltuples: SELECT reltuples FROM pg_class WHERE relname = 'stats';
+reltuples
+---------
+ 10
+(1 row)
+
+step s2_vac_full: VACUUM FULL stats;
+step s1_commit:
+ COMMIT;
+
+step s2_reltuples: SELECT reltuples FROM pg_class WHERE relname = 'stats';
+reltuples
+---------
+ 10
+(1 row)
+
+
+starting permutation: s2_create_index s2_vac_full s2_reltuples s1_query s2_update s2_reltuples s2_vac_full s1_commit s2_reltuples
+step s2_create_index: CREATE INDEX ON stats (a);
+step s2_vac_full: VACUUM FULL stats;
+step s2_reltuples: SELECT reltuples FROM pg_class WHERE relname = 'stats';
+reltuples
+---------
+ 10
+(1 row)
+
+step s1_query:
+ START TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+ SELECT 1;
+
+?column?
+--------
+ 1
+(1 row)
+
+step s2_update: UPDATE stats SET a = 1 WHERE a > 6;
+step s2_reltuples: SELECT reltuples FROM pg_class WHERE relname = 'stats';
+reltuples
+---------
+ 10
+(1 row)
+
+step s2_vac_full: VACUUM FULL stats;
+step s1_commit:
+ COMMIT;
+
+step s2_reltuples: SELECT reltuples FROM pg_class WHERE relname = 'stats';
+reltuples
+---------
+ 10
+(1 row)
+
diff --git a/src/test/isolation/isolation_schedule b/src/test/isolation/isolation_schedule
index e3c669a29c7..71358fa42c5 100644
--- a/src/test/isolation/isolation_schedule
+++ b/src/test/isolation/isolation_schedule
@@ -95,6 +95,7 @@ test: vacuum-no-cleanup-lock
test: timeouts
test: vacuum-concurrent-drop
test: vacuum-conflict
+test: vacuum-full-stats
test: vacuum-skip-locked
test: stats
test: horizons
diff --git a/src/test/isolation/specs/vacuum-full-stats.spec b/src/test/isolation/specs/vacuum-full-stats.spec
new file mode 100644
index 00000000000..1280b0dd975
--- /dev/null
+++ b/src/test/isolation/specs/vacuum-full-stats.spec
@@ -0,0 +1,42 @@
+# Test that pg_class.reltuples is updated with the correct value after
+# a VACUUM FULL when there are concurrent transactions preventing
+# removal of some garbage tuples. The reltuples count should not
+# include these garbage tuples.
+
+setup
+{
+ CREATE TABLE stats (a INT);
+ INSERT INTO stats SELECT generate_series(1, 10);
+}
+
+teardown
+{
+ DROP TABLE IF EXISTS stats;
+}
+
+# Session 1 (s1) keeps an open transaction in repeatable read to
+# prevent VACUUM FULL in session 2 (s2) from removing some dead tuples
+# that session 1 should still see in its snapshot. Note that s1 needs
+# to run a query to initiate a snapshot.
+session s1
+step s1_query
+{
+ START TRANSACTION ISOLATION LEVEL REPEATABLE READ;
+ SELECT 1;
+}
+step s1_commit
+{
+ COMMIT;
+}
+
+session s2
+step s2_update { UPDATE stats SET a = 1 WHERE a > 6; }
+step s2_vac_full { VACUUM FULL stats; }
+step s2_reltuples { SELECT reltuples FROM pg_class WHERE relname = 'stats'; }
+step s2_create_index { CREATE INDEX ON stats (a); }
+
+permutation s2_vac_full s2_reltuples s1_query s2_update s2_reltuples s2_vac_full s1_commit s2_reltuples
+
+# If the table has at least one index, then the index rebuild will set
+# reltuples after VACUUM FULL, so run a test with that too.
+permutation s2_create_index s2_vac_full s2_reltuples s1_query s2_update s2_reltuples s2_vac_full s1_commit s2_reltuples
--
2.48.1
If you subtract recently dead from that number within the heap implementation, then it will no longer
reflect non-removable tuples and the log message in the cluster
function "found %.0f removable, %.0f nonremovable row versions" will no longer be correct.
Yes, that's correct. I did not pay attention to the logging aspect.
Here is a test with and without the patch. While there is a long-running
serializable transaction in another session, running an update followed
by a normal vacuum sets the reltuples value correctly to 10. A follow-up
VACUUM FULL then sets it incorrectly to 14. The patch, which sets
the num_tuples after the logging, ensures that the logging is correct
and that pg_class.reltuples matches the actual number of live tuples.
-- without the patch
test=# UPDATE stats SET a=1 WHERE a > 6;
UPDATE 4
test=# SELECT reltuples FROM pg_class WHERE relname = 'stats';
reltuples
-----------
-1
(1 row)
test=# vacuum verbose stats;
INFO: vacuuming "test.public.stats"
INFO: finished vacuuming "test.public.stats": index scans: 0
pages: 0 removed, 1 remain, 1 scanned (100.00% of total), 0 eagerly scanned
tuples: 0 removed, 14 remain, 4 are dead but not yet removable
removable cutoff: 789, which was 1 XIDs old when operation ended
new relfrozenxid: 788, which is 1 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0
were all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 9.159 MB/s
buffer usage: 14 hits, 0 reads, 3 dirtied
WAL usage: 4 records, 3 full page images, 24981 bytes, 0 buffers full
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM
test=# SELECT reltuples FROM pg_class WHERE relname = 'stats';
reltuples
-----------
10
(1 row)
test=# VACUUM (verbose, FULL) stats;
INFO: vacuuming "public.stats"
INFO: "public.stats": found 0 removable, 14 nonremovable row versions
in 1 pages
DETAIL: 4 dead row versions cannot be removed yet.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM
test=# SELECT reltuples FROM pg_class WHERE relname = 'stats';
reltuples
-----------
14
(1 row)
-- with the patch
test=# UPDATE stats SET a=1 WHERE a > 6;
UPDATE 4
test=#
test=# SELECT reltuples FROM pg_class WHERE relname = 'stats';
reltuples
-----------
-1
(1 row)
test=# VACUUM verbose stats;
INFO: vacuuming "test.public.stats"
INFO: finished vacuuming "test.public.stats": index scans: 0
pages: 0 removed, 1 remain, 1 scanned (100.00% of total), 0 eagerly scanned
tuples: 0 removed, 14 remain, 4 are dead but not yet removable
removable cutoff: 794, which was 1 XIDs old when operation ended
new relfrozenxid: 793, which is 1 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
visibility map: 0 pages set all-visible, 0 pages set all-frozen (0
were all-visible)
index scan not needed: 0 pages from table (0.00% of total) had 0 dead
item identifiers removed
avg read rate: 0.000 MB/s, avg write rate: 9.195 MB/s
buffer usage: 17 hits, 0 reads, 3 dirtied
WAL usage: 4 records, 3 full page images, 24981 bytes, 0 buffers full
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
VACUUM
test=# SELECT reltuples FROM pg_class WHERE relname = 'stats';
reltuples
-----------
10
(1 row)
test=# VACUUM (verbose, FULL) stats;
INFO: vacuuming "public.stats"
INFO: "public.stats": found 0 removable, 14 nonremovable row versions
in 1 pages
DETAIL: 4 dead row versions cannot be removed yet.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM
test=# SELECT reltuples FROM pg_class WHERE relname = 'stats';
reltuples
-----------
10
(1 row)
--
Sami