BUG #19439: pg_stat_xact_user_tables stat not currect during the transaction

Started by PG Bug reporting form2 months ago8 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 19439
Logged by: klemen kobau
Email address: klemen.kobau@gmail.com
PostgreSQL version: 18.0
Operating system: Linux (EndeavorOS)
Description:

I am running postgres:18.0 in a docker container, the configuration is as
follows:

postgres:
image: postgres:18.0
command: [
"postgres",
"-N", "200",
"-c", "max_prepared_transactions=100",
"-c", "wal_level=logical",
"-c", "max_wal_senders=10",
"-c", "max_replication_slots=20",
"-c", "wal_keep_size=10",
"-c", "max_slot_wal_keep_size=1024"
]
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
TZ: UTC
PGTZ: UTC
ports:
- 5432:5432
volumes:
- postgres_data:/var/lib/postgresql

I use psql version 18.3.

I run the following:

psql -h localhost -p 5432 -U postgres -d postgres -c "
-- Transaction 1: insert 1 row, check stats, commit
BEGIN;
CREATE TABLE IF NOT EXISTS temp.xact_test (id serial PRIMARY KEY, val text);
INSERT INTO temp.xact_test (val) VALUES ('a');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
COMMIT;

-- Transaction 2: insert 1 row, check stats
BEGIN;
INSERT INTO temp.xact_test (val) VALUES ('b');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
ROLLBACK;

-- Cleanup
DROP TABLE temp.xact_test;
"

And the output is

BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

This is not what I would expect from reading
https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS-VIEWS,
where it states

A transaction can also see its own statistics (not yet flushed out to the

shared memory statistics) in the views pg_stat_xact_all_tables,
pg_stat_xact_sys_tables, pg_stat_xact_user_tables, and
pg_stat_xact_user_functions. These numbers do not act as stated above;
instead they update continuously throughout the transaction.

based on this, I would expect that the numbers are updated each time an
insert happens and that they are isolated per transaction.

Kind regards

Klemen Kobau

#2Xuneng Zhou
xunengzhou@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #19439: pg_stat_xact_user_tables stat not currect during the transaction

Hi klemen,

Thanks for the report.

On Fri, Mar 27, 2026 at 5:36 AM PG Bug reporting form
<noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 19439
Logged by: klemen kobau
Email address: klemen.kobau@gmail.com
PostgreSQL version: 18.0
Operating system: Linux (EndeavorOS)
Description:

I am running postgres:18.0 in a docker container, the configuration is as
follows:

postgres:
image: postgres:18.0
command: [
"postgres",
"-N", "200",
"-c", "max_prepared_transactions=100",
"-c", "wal_level=logical",
"-c", "max_wal_senders=10",
"-c", "max_replication_slots=20",
"-c", "wal_keep_size=10",
"-c", "max_slot_wal_keep_size=1024"
]
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
TZ: UTC
PGTZ: UTC
ports:
- 5432:5432
volumes:
- postgres_data:/var/lib/postgresql

I use psql version 18.3.

I run the following:

psql -h localhost -p 5432 -U postgres -d postgres -c "
-- Transaction 1: insert 1 row, check stats, commit
BEGIN;
CREATE TABLE IF NOT EXISTS temp.xact_test (id serial PRIMARY KEY, val text);
INSERT INTO temp.xact_test (val) VALUES ('a');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
COMMIT;

-- Transaction 2: insert 1 row, check stats
BEGIN;
INSERT INTO temp.xact_test (val) VALUES ('b');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
ROLLBACK;

-- Cleanup
DROP TABLE temp.xact_test;
"

And the output is

BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

I can also reproduce this behavior on HEAD.

This is not what I would expect from reading
https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS-VIEWS,
where it states

A transaction can also see its own statistics (not yet flushed out to the

shared memory statistics) in the views pg_stat_xact_all_tables,
pg_stat_xact_sys_tables, pg_stat_xact_user_tables, and
pg_stat_xact_user_functions. These numbers do not act as stated above;
instead they update continuously throughout the transaction.

based on this, I would expect that the numbers are updated each time an
insert happens and that they are isolated per transaction.

This snippet of the doc feels somewhat ambiguous. The current behavior
does not seem fully consistent with wording such as “throughout the
transaction.” It also seems more appropriate for these table views to
reflect stats within their own scopes, rather than cumulative values
spanning txn boundaries.

--
Best,
Xuneng

#3Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#2)
Re: BUG #19439: pg_stat_xact_user_tables stat not currect during the transaction

On Fri, Mar 27, 2026 at 10:15 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi klemen,

Thanks for the report.

On Fri, Mar 27, 2026 at 5:36 AM PG Bug reporting form
<noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 19439
Logged by: klemen kobau
Email address: klemen.kobau@gmail.com
PostgreSQL version: 18.0
Operating system: Linux (EndeavorOS)
Description:

I am running postgres:18.0 in a docker container, the configuration is as
follows:

postgres:
image: postgres:18.0
command: [
"postgres",
"-N", "200",
"-c", "max_prepared_transactions=100",
"-c", "wal_level=logical",
"-c", "max_wal_senders=10",
"-c", "max_replication_slots=20",
"-c", "wal_keep_size=10",
"-c", "max_slot_wal_keep_size=1024"
]
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
TZ: UTC
PGTZ: UTC
ports:
- 5432:5432
volumes:
- postgres_data:/var/lib/postgresql

I use psql version 18.3.

I run the following:

psql -h localhost -p 5432 -U postgres -d postgres -c "
-- Transaction 1: insert 1 row, check stats, commit
BEGIN;
CREATE TABLE IF NOT EXISTS temp.xact_test (id serial PRIMARY KEY, val text);
INSERT INTO temp.xact_test (val) VALUES ('a');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
COMMIT;

-- Transaction 2: insert 1 row, check stats
BEGIN;
INSERT INTO temp.xact_test (val) VALUES ('b');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
ROLLBACK;

-- Cleanup
DROP TABLE temp.xact_test;
"

And the output is

BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

I can also reproduce this behavior on HEAD.

This is not what I would expect from reading
https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS-VIEWS,
where it states

A transaction can also see its own statistics (not yet flushed out to the

shared memory statistics) in the views pg_stat_xact_all_tables,
pg_stat_xact_sys_tables, pg_stat_xact_user_tables, and
pg_stat_xact_user_functions. These numbers do not act as stated above;
instead they update continuously throughout the transaction.

based on this, I would expect that the numbers are updated each time an
insert happens and that they are isolated per transaction.

This snippet of the doc feels somewhat ambiguous. The current behavior
does not seem fully consistent with wording such as “throughout the
transaction.” It also seems more appropriate for these table views to
reflect stats within their own scopes, rather than cumulative values
spanning txn boundaries.

I’ve looked into this issue and 'd like to propose a patch to address
it. Feedback is very welcome.

--- Root cause

Since the stats subsystem was rewritten (commit 5891c7a8), each
backend keeps a hash of pending stats entries that persist across
transaction boundaries. Entries that flush successfully are deleted
from the backend-local pending list; entries not flushed remain
pending with their accumulated counters. Flushing is driven by
pgstat_report_stat(), called from the ReadyForQuery path, and subject
to a rate limiter.

The pg_stat_xact_* views read these pending entries directly via
find_tabstat_entry() / find_funcstat_entry(). Both functions return
the raw accumulated counters without per-transaction scoping.

The bug is deterministic when multiple top-level transactions are
processed within a single simple-query message, because there is no
ReadyForQuery boundary between the transactions and therefore no
opportunity to flush and remove the pending entry.

--- Fix

The patch introduces a per-entry "transaction baseline" — a lazy
snapshot of the counters taken the first time an entry is touched in
each new top-level transaction. The accessor functions
(find_tabstat_entry, find_funcstat_entry) then subtract the baseline
from the current counters, yielding only the current transaction's
delta.

The baseline is keyed by MyProc->vxid.lxid. A static inline helper,
pgstat_ensure_xact_baseline(), is called at every nontransactional
counter-increment site (the 7 event-counter macros in pgstat.h and the
4 non-inline counter functions in pgstat_relation.c). After the first
call per entry per transaction, the check reduces to a single integer
comparison with a predictably not-taken branch.

For function stats, a new PgStat_FunctionPending wrapper struct embeds
PgStat_FunctionCounts at offset 0 with the baseline fields appended,
so the flush callback requires only a trivial cast change.

--- Testing

The regression test is a TAP test under src/bin/psql/t/ that uses psql
-c to send multi-statement strings as single simple-query messages.
src/bin/psql/t look like the right existing harness for the primary
regression because psql -c sends the whole multi-statement string as
one simple-query message via simple_action_list dispatch. By contrast,
ordinary pg_regress SQL files are executed by psql in file/stdin mode,
which runs through MainLoop() and dispatches top-level statements one
at a time via SendQuery(), creating ReadyForQuery boundaries that
could mask the bug.

The test covers three scenarios: table n_tup_ins and seq_scan
counters, and function calls — each verifying isolation across
consecutive top-level transactions within a single message.

--
Best,
Xuneng

Attachments:

v1-0001-Fix-pg_stat_xact_-views-leaking-stats-across-tran.patchapplication/octet-stream; name=v1-0001-Fix-pg_stat_xact_-views-leaking-stats-across-tran.patchDownload+249-10
#4Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#3)
Re: BUG #19439: pg_stat_xact_user_tables stat not currect during the transaction

On Sat, Mar 28, 2026 at 1:47 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Fri, Mar 27, 2026 at 10:15 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi klemen,

Thanks for the report.

On Fri, Mar 27, 2026 at 5:36 AM PG Bug reporting form
<noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 19439
Logged by: klemen kobau
Email address: klemen.kobau@gmail.com
PostgreSQL version: 18.0
Operating system: Linux (EndeavorOS)
Description:

I am running postgres:18.0 in a docker container, the configuration is as
follows:

postgres:
image: postgres:18.0
command: [
"postgres",
"-N", "200",
"-c", "max_prepared_transactions=100",
"-c", "wal_level=logical",
"-c", "max_wal_senders=10",
"-c", "max_replication_slots=20",
"-c", "wal_keep_size=10",
"-c", "max_slot_wal_keep_size=1024"
]
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
TZ: UTC
PGTZ: UTC
ports:
- 5432:5432
volumes:
- postgres_data:/var/lib/postgresql

I use psql version 18.3.

I run the following:

psql -h localhost -p 5432 -U postgres -d postgres -c "
-- Transaction 1: insert 1 row, check stats, commit
BEGIN;
CREATE TABLE IF NOT EXISTS temp.xact_test (id serial PRIMARY KEY, val text);
INSERT INTO temp.xact_test (val) VALUES ('a');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
COMMIT;

-- Transaction 2: insert 1 row, check stats
BEGIN;
INSERT INTO temp.xact_test (val) VALUES ('b');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
ROLLBACK;

-- Cleanup
DROP TABLE temp.xact_test;
"

And the output is

BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

I can also reproduce this behavior on HEAD.

This is not what I would expect from reading
https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS-VIEWS,
where it states

A transaction can also see its own statistics (not yet flushed out to the

shared memory statistics) in the views pg_stat_xact_all_tables,
pg_stat_xact_sys_tables, pg_stat_xact_user_tables, and
pg_stat_xact_user_functions. These numbers do not act as stated above;
instead they update continuously throughout the transaction.

based on this, I would expect that the numbers are updated each time an
insert happens and that they are isolated per transaction.

This snippet of the doc feels somewhat ambiguous. The current behavior
does not seem fully consistent with wording such as “throughout the
transaction.” It also seems more appropriate for these table views to
reflect stats within their own scopes, rather than cumulative values
spanning txn boundaries.

I’ve looked into this issue and 'd like to propose a patch to address
it. Feedback is very welcome.

--- Root cause

Since the stats subsystem was rewritten (commit 5891c7a8), each
backend keeps a hash of pending stats entries that persist across
transaction boundaries. Entries that flush successfully are deleted
from the backend-local pending list; entries not flushed remain
pending with their accumulated counters. Flushing is driven by
pgstat_report_stat(), called from the ReadyForQuery path, and subject
to a rate limiter.

The pg_stat_xact_* views read these pending entries directly via
find_tabstat_entry() / find_funcstat_entry(). Both functions return
the raw accumulated counters without per-transaction scoping.

The bug is deterministic when multiple top-level transactions are
processed within a single simple-query message, because there is no
ReadyForQuery boundary between the transactions and therefore no
opportunity to flush and remove the pending entry.

--- Fix

The patch introduces a per-entry "transaction baseline" — a lazy
snapshot of the counters taken the first time an entry is touched in
each new top-level transaction. The accessor functions
(find_tabstat_entry, find_funcstat_entry) then subtract the baseline
from the current counters, yielding only the current transaction's
delta.

The baseline is keyed by MyProc->vxid.lxid. A static inline helper,
pgstat_ensure_xact_baseline(), is called at every nontransactional
counter-increment site (the 7 event-counter macros in pgstat.h and the
4 non-inline counter functions in pgstat_relation.c). After the first
call per entry per transaction, the check reduces to a single integer
comparison with a predictably not-taken branch.

For function stats, a new PgStat_FunctionPending wrapper struct embeds
PgStat_FunctionCounts at offset 0 with the baseline fields appended,
so the flush callback requires only a trivial cast change.

--- Testing

The regression test is a TAP test under src/bin/psql/t/ that uses psql
-c to send multi-statement strings as single simple-query messages.
src/bin/psql/t look like the right existing harness for the primary
regression because psql -c sends the whole multi-statement string as
one simple-query message via simple_action_list dispatch. By contrast,
ordinary pg_regress SQL files are executed by psql in file/stdin mode,
which runs through MainLoop() and dispatches top-level statements one
at a time via SendQuery(), creating ReadyForQuery boundaries that
could mask the bug.

The test covers three scenarios: table n_tup_ins and seq_scan
counters, and function calls — each verifying isolation across
consecutive top-level transactions within a single message.

postgres % cat output.txt
BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

I can also reproduce this behavior in v13. So this issue pre-existed
the major refactor 5891c7a8.

--
Best,
Xuneng

#5Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#4)
Re: BUG #19439: pg_stat_xact_user_tables stat not currect during the transaction

On Sun, Mar 29, 2026 at 11:17 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sat, Mar 28, 2026 at 1:47 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Fri, Mar 27, 2026 at 10:15 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi klemen,

Thanks for the report.

On Fri, Mar 27, 2026 at 5:36 AM PG Bug reporting form
<noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 19439
Logged by: klemen kobau
Email address: klemen.kobau@gmail.com
PostgreSQL version: 18.0
Operating system: Linux (EndeavorOS)
Description:

I am running postgres:18.0 in a docker container, the configuration is as
follows:

postgres:
image: postgres:18.0
command: [
"postgres",
"-N", "200",
"-c", "max_prepared_transactions=100",
"-c", "wal_level=logical",
"-c", "max_wal_senders=10",
"-c", "max_replication_slots=20",
"-c", "wal_keep_size=10",
"-c", "max_slot_wal_keep_size=1024"
]
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
TZ: UTC
PGTZ: UTC
ports:
- 5432:5432
volumes:
- postgres_data:/var/lib/postgresql

I use psql version 18.3.

I run the following:

psql -h localhost -p 5432 -U postgres -d postgres -c "
-- Transaction 1: insert 1 row, check stats, commit
BEGIN;
CREATE TABLE IF NOT EXISTS temp.xact_test (id serial PRIMARY KEY, val text);
INSERT INTO temp.xact_test (val) VALUES ('a');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
COMMIT;

-- Transaction 2: insert 1 row, check stats
BEGIN;
INSERT INTO temp.xact_test (val) VALUES ('b');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
ROLLBACK;

-- Cleanup
DROP TABLE temp.xact_test;
"

And the output is

BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

I can also reproduce this behavior on HEAD.

This is not what I would expect from reading
https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS-VIEWS,
where it states

A transaction can also see its own statistics (not yet flushed out to the

shared memory statistics) in the views pg_stat_xact_all_tables,
pg_stat_xact_sys_tables, pg_stat_xact_user_tables, and
pg_stat_xact_user_functions. These numbers do not act as stated above;
instead they update continuously throughout the transaction.

based on this, I would expect that the numbers are updated each time an
insert happens and that they are isolated per transaction.

This snippet of the doc feels somewhat ambiguous. The current behavior
does not seem fully consistent with wording such as “throughout the
transaction.” It also seems more appropriate for these table views to
reflect stats within their own scopes, rather than cumulative values
spanning txn boundaries.

I’ve looked into this issue and 'd like to propose a patch to address
it. Feedback is very welcome.

--- Root cause

Since the stats subsystem was rewritten (commit 5891c7a8), each
backend keeps a hash of pending stats entries that persist across
transaction boundaries. Entries that flush successfully are deleted
from the backend-local pending list; entries not flushed remain
pending with their accumulated counters. Flushing is driven by
pgstat_report_stat(), called from the ReadyForQuery path, and subject
to a rate limiter.

The pg_stat_xact_* views read these pending entries directly via
find_tabstat_entry() / find_funcstat_entry(). Both functions return
the raw accumulated counters without per-transaction scoping.

The bug is deterministic when multiple top-level transactions are
processed within a single simple-query message, because there is no
ReadyForQuery boundary between the transactions and therefore no
opportunity to flush and remove the pending entry.

--- Fix

The patch introduces a per-entry "transaction baseline" — a lazy
snapshot of the counters taken the first time an entry is touched in
each new top-level transaction. The accessor functions
(find_tabstat_entry, find_funcstat_entry) then subtract the baseline
from the current counters, yielding only the current transaction's
delta.

The baseline is keyed by MyProc->vxid.lxid. A static inline helper,
pgstat_ensure_xact_baseline(), is called at every nontransactional
counter-increment site (the 7 event-counter macros in pgstat.h and the
4 non-inline counter functions in pgstat_relation.c). After the first
call per entry per transaction, the check reduces to a single integer
comparison with a predictably not-taken branch.

For function stats, a new PgStat_FunctionPending wrapper struct embeds
PgStat_FunctionCounts at offset 0 with the baseline fields appended,
so the flush callback requires only a trivial cast change.

--- Testing

The regression test is a TAP test under src/bin/psql/t/ that uses psql
-c to send multi-statement strings as single simple-query messages.
src/bin/psql/t look like the right existing harness for the primary
regression because psql -c sends the whole multi-statement string as
one simple-query message via simple_action_list dispatch. By contrast,
ordinary pg_regress SQL files are executed by psql in file/stdin mode,
which runs through MainLoop() and dispatches top-level statements one
at a time via SendQuery(), creating ReadyForQuery boundaries that
could mask the bug.

The test covers three scenarios: table n_tup_ins and seq_scan
counters, and function calls — each verifying isolation across
consecutive top-level transactions within a single message.

postgres % cat output.txt
BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

I can also reproduce this behavior in v13. So this issue pre-existed
the major refactor 5891c7a8.

After further thought of v1, I'd like to also propose an alternative
approach. The root cause and testing rationale are the same as
described upthread; what changes is how the baseline is established.

--- Problems with the lazy (v1) approach
The v1 patch snapshots the baseline lazily -- on every
nontransactional counter increment, it checks whether baseline_lxid
matches the current
transaction and snapshots if not. This has two drawbacks:

1) Hot-path overhead
Every pgstat_count_heap_scan(), pgstat_count_buffer_read_block(), etc.
acquires a branch. After the first call per entry per transaction, the
branch is predictably not-taken, but it could still present on some of
the most frequently executed paths in the backend.

2) Maintenance burden
Every new counter added to PgStat_TableCounts requires a corresponding
baseline snapshot at its increment site. Miss one, and that counter
silently leaks across transactions

--- eager baseline sweep
The attached patch records the baseline eagerly at transaction
boundaries instead of lazily at counter-increment sites.
pgstat_set_pending_baselines() iterates the pgStatPending list and
snapshots each entry's current counts into an xact_baseline field via
struct assignment. It is called from AtEOXact_PgStat() (after folding
transactional counts and removing dropped entries) and from
PostPrepare_PgStat() (after relation cleanup), covering commit, abort,
and PREPARE TRANSACTION. The view accessors unconditionally subtract
the baseline. For entries created in the current transaction,
xact_baseline is zero-initialized, so the subtraction is a no-op.

I don’t have a clear preference between the two approaches; both are
presented for review.

--
Best,
Xuneng

Attachments:

v1-0001-Fix-pg_stat_xact_-views-leaking-stats-across-tran.patchapplication/octet-stream; name=v1-0001-Fix-pg_stat_xact_-views-leaking-stats-across-tran.patchDownload+332-11
#6klemen kobau
klemen.kobau@gmail.com
In reply to: Xuneng Zhou (#5)
Re: BUG #19439: pg_stat_xact_user_tables stat not currect during the transaction

Hi Xuneng,

Thanks for the analysis and the patch proposals. As the original reporter,
I wanted to follow up since the thread has been quiet for over a month.

I am new to this process, do I need to do anything to help the process?

Kind regards

Klemen Kobau

On Wed, 6 May 2026 at 09:56, Xuneng Zhou <xunengzhou@gmail.com> wrote:

Show quoted text

On Sun, Mar 29, 2026 at 11:17 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sat, Mar 28, 2026 at 1:47 PM Xuneng Zhou <xunengzhou@gmail.com>

wrote:

On Fri, Mar 27, 2026 at 10:15 AM Xuneng Zhou <xunengzhou@gmail.com>

wrote:

Hi klemen,

Thanks for the report.

On Fri, Mar 27, 2026 at 5:36 AM PG Bug reporting form
<noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 19439
Logged by: klemen kobau
Email address: klemen.kobau@gmail.com
PostgreSQL version: 18.0
Operating system: Linux (EndeavorOS)
Description:

I am running postgres:18.0 in a docker container, the

configuration is as

follows:

postgres:
image: postgres:18.0
command: [
"postgres",
"-N", "200",
"-c", "max_prepared_transactions=100",
"-c", "wal_level=logical",
"-c", "max_wal_senders=10",
"-c", "max_replication_slots=20",
"-c", "wal_keep_size=10",
"-c", "max_slot_wal_keep_size=1024"
]
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
TZ: UTC
PGTZ: UTC
ports:
- 5432:5432
volumes:
- postgres_data:/var/lib/postgresql

I use psql version 18.3.

I run the following:

psql -h localhost -p 5432 -U postgres -d postgres -c "
-- Transaction 1: insert 1 row, check stats, commit
BEGIN;
CREATE TABLE IF NOT EXISTS temp.xact_test (id serial PRIMARY KEY,

val text);

INSERT INTO temp.xact_test (val) VALUES ('a');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE

relname =

'xact_test';
COMMIT;

-- Transaction 2: insert 1 row, check stats
BEGIN;
INSERT INTO temp.xact_test (val) VALUES ('b');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE

relname =

'xact_test';
ROLLBACK;

-- Cleanup
DROP TABLE temp.xact_test;
"

And the output is

BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

I can also reproduce this behavior on HEAD.

This is not what I would expect from reading

https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS-VIEWS
,

where it states

A transaction can also see its own statistics (not yet flushed

out to the

shared memory statistics) in the views pg_stat_xact_all_tables,
pg_stat_xact_sys_tables, pg_stat_xact_user_tables, and
pg_stat_xact_user_functions. These numbers do not act as stated

above;

instead they update continuously throughout the transaction.

based on this, I would expect that the numbers are updated each

time an

insert happens and that they are isolated per transaction.

This snippet of the doc feels somewhat ambiguous. The current

behavior

does not seem fully consistent with wording such as “throughout the
transaction.” It also seems more appropriate for these table views to
reflect stats within their own scopes, rather than cumulative values
spanning txn boundaries.

I’ve looked into this issue and 'd like to propose a patch to address
it. Feedback is very welcome.

--- Root cause

Since the stats subsystem was rewritten (commit 5891c7a8), each
backend keeps a hash of pending stats entries that persist across
transaction boundaries. Entries that flush successfully are deleted
from the backend-local pending list; entries not flushed remain
pending with their accumulated counters. Flushing is driven by
pgstat_report_stat(), called from the ReadyForQuery path, and subject
to a rate limiter.

The pg_stat_xact_* views read these pending entries directly via
find_tabstat_entry() / find_funcstat_entry(). Both functions return
the raw accumulated counters without per-transaction scoping.

The bug is deterministic when multiple top-level transactions are
processed within a single simple-query message, because there is no
ReadyForQuery boundary between the transactions and therefore no
opportunity to flush and remove the pending entry.

--- Fix

The patch introduces a per-entry "transaction baseline" — a lazy
snapshot of the counters taken the first time an entry is touched in
each new top-level transaction. The accessor functions
(find_tabstat_entry, find_funcstat_entry) then subtract the baseline
from the current counters, yielding only the current transaction's
delta.

The baseline is keyed by MyProc->vxid.lxid. A static inline helper,
pgstat_ensure_xact_baseline(), is called at every nontransactional
counter-increment site (the 7 event-counter macros in pgstat.h and the
4 non-inline counter functions in pgstat_relation.c). After the first
call per entry per transaction, the check reduces to a single integer
comparison with a predictably not-taken branch.

For function stats, a new PgStat_FunctionPending wrapper struct embeds
PgStat_FunctionCounts at offset 0 with the baseline fields appended,
so the flush callback requires only a trivial cast change.

--- Testing

The regression test is a TAP test under src/bin/psql/t/ that uses psql
-c to send multi-statement strings as single simple-query messages.
src/bin/psql/t look like the right existing harness for the primary
regression because psql -c sends the whole multi-statement string as
one simple-query message via simple_action_list dispatch. By contrast,
ordinary pg_regress SQL files are executed by psql in file/stdin mode,
which runs through MainLoop() and dispatches top-level statements one
at a time via SendQuery(), creating ReadyForQuery boundaries that
could mask the bug.

The test covers three scenarios: table n_tup_ins and seq_scan
counters, and function calls — each verifying isolation across
consecutive top-level transactions within a single message.

postgres % cat output.txt
BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

I can also reproduce this behavior in v13. So this issue pre-existed
the major refactor 5891c7a8.

After further thought of v1, I'd like to also propose an alternative
approach. The root cause and testing rationale are the same as
described upthread; what changes is how the baseline is established.

--- Problems with the lazy (v1) approach
The v1 patch snapshots the baseline lazily -- on every
nontransactional counter increment, it checks whether baseline_lxid
matches the current
transaction and snapshots if not. This has two drawbacks:

1) Hot-path overhead
Every pgstat_count_heap_scan(), pgstat_count_buffer_read_block(), etc.
acquires a branch. After the first call per entry per transaction, the
branch is predictably not-taken, but it could still present on some of
the most frequently executed paths in the backend.

2) Maintenance burden
Every new counter added to PgStat_TableCounts requires a corresponding
baseline snapshot at its increment site. Miss one, and that counter
silently leaks across transactions

--- eager baseline sweep
The attached patch records the baseline eagerly at transaction
boundaries instead of lazily at counter-increment sites.
pgstat_set_pending_baselines() iterates the pgStatPending list and
snapshots each entry's current counts into an xact_baseline field via
struct assignment. It is called from AtEOXact_PgStat() (after folding
transactional counts and removing dropped entries) and from
PostPrepare_PgStat() (after relation cleanup), covering commit, abort,
and PREPARE TRANSACTION. The view accessors unconditionally subtract
the baseline. For entries created in the current transaction,
xact_baseline is zero-initialized, so the subtraction is a no-op.

I don’t have a clear preference between the two approaches; both are
presented for review.

--
Best,
Xuneng

#7Xuneng Zhou
xunengzhou@gmail.com
In reply to: klemen kobau (#6)
Re: BUG #19439: pg_stat_xact_user_tables stat not currect during the transaction

Hi klemen,

On Wed, May 6, 2026 at 4:03 PM klemen kobau <klemen.kobau@gmail.com> wrote:

Hi Xuneng,

Thanks for the analysis and the patch proposals. As the original reporter, I wanted to follow up since the thread has been quiet for over a month.

I am new to this process, do I need to do anything to help the process?

Kind regards

Klemen Kobau

On Wed, 6 May 2026 at 09:56, Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sun, Mar 29, 2026 at 11:17 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Sat, Mar 28, 2026 at 1:47 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:

On Fri, Mar 27, 2026 at 10:15 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:

Hi klemen,

Thanks for the report.

On Fri, Mar 27, 2026 at 5:36 AM PG Bug reporting form
<noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 19439
Logged by: klemen kobau
Email address: klemen.kobau@gmail.com
PostgreSQL version: 18.0
Operating system: Linux (EndeavorOS)
Description:

I am running postgres:18.0 in a docker container, the configuration is as
follows:

postgres:
image: postgres:18.0
command: [
"postgres",
"-N", "200",
"-c", "max_prepared_transactions=100",
"-c", "wal_level=logical",
"-c", "max_wal_senders=10",
"-c", "max_replication_slots=20",
"-c", "wal_keep_size=10",
"-c", "max_slot_wal_keep_size=1024"
]
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
TZ: UTC
PGTZ: UTC
ports:
- 5432:5432
volumes:
- postgres_data:/var/lib/postgresql

I use psql version 18.3.

I run the following:

psql -h localhost -p 5432 -U postgres -d postgres -c "
-- Transaction 1: insert 1 row, check stats, commit
BEGIN;
CREATE TABLE IF NOT EXISTS temp.xact_test (id serial PRIMARY KEY, val text);
INSERT INTO temp.xact_test (val) VALUES ('a');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
COMMIT;

-- Transaction 2: insert 1 row, check stats
BEGIN;
INSERT INTO temp.xact_test (val) VALUES ('b');
SELECT relname, n_tup_ins FROM pg_stat_xact_user_tables WHERE relname =
'xact_test';
ROLLBACK;

-- Cleanup
DROP TABLE temp.xact_test;
"

And the output is

BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

I can also reproduce this behavior on HEAD.

This is not what I would expect from reading
https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-STATS-VIEWS,
where it states

A transaction can also see its own statistics (not yet flushed out to the

shared memory statistics) in the views pg_stat_xact_all_tables,
pg_stat_xact_sys_tables, pg_stat_xact_user_tables, and
pg_stat_xact_user_functions. These numbers do not act as stated above;
instead they update continuously throughout the transaction.

based on this, I would expect that the numbers are updated each time an
insert happens and that they are isolated per transaction.

This snippet of the doc feels somewhat ambiguous. The current behavior
does not seem fully consistent with wording such as “throughout the
transaction.” It also seems more appropriate for these table views to
reflect stats within their own scopes, rather than cumulative values
spanning txn boundaries.

I’ve looked into this issue and 'd like to propose a patch to address
it. Feedback is very welcome.

--- Root cause

Since the stats subsystem was rewritten (commit 5891c7a8), each
backend keeps a hash of pending stats entries that persist across
transaction boundaries. Entries that flush successfully are deleted
from the backend-local pending list; entries not flushed remain
pending with their accumulated counters. Flushing is driven by
pgstat_report_stat(), called from the ReadyForQuery path, and subject
to a rate limiter.

The pg_stat_xact_* views read these pending entries directly via
find_tabstat_entry() / find_funcstat_entry(). Both functions return
the raw accumulated counters without per-transaction scoping.

The bug is deterministic when multiple top-level transactions are
processed within a single simple-query message, because there is no
ReadyForQuery boundary between the transactions and therefore no
opportunity to flush and remove the pending entry.

--- Fix

The patch introduces a per-entry "transaction baseline" — a lazy
snapshot of the counters taken the first time an entry is touched in
each new top-level transaction. The accessor functions
(find_tabstat_entry, find_funcstat_entry) then subtract the baseline
from the current counters, yielding only the current transaction's
delta.

The baseline is keyed by MyProc->vxid.lxid. A static inline helper,
pgstat_ensure_xact_baseline(), is called at every nontransactional
counter-increment site (the 7 event-counter macros in pgstat.h and the
4 non-inline counter functions in pgstat_relation.c). After the first
call per entry per transaction, the check reduces to a single integer
comparison with a predictably not-taken branch.

For function stats, a new PgStat_FunctionPending wrapper struct embeds
PgStat_FunctionCounts at offset 0 with the baseline fields appended,
so the flush callback requires only a trivial cast change.

--- Testing

The regression test is a TAP test under src/bin/psql/t/ that uses psql
-c to send multi-statement strings as single simple-query messages.
src/bin/psql/t look like the right existing harness for the primary
regression because psql -c sends the whole multi-statement string as
one simple-query message via simple_action_list dispatch. By contrast,
ordinary pg_regress SQL files are executed by psql in file/stdin mode,
which runs through MainLoop() and dispatches top-level statements one
at a time via SendQuery(), creating ReadyForQuery boundaries that
could mask the bug.

The test covers three scenarios: table n_tup_ins and seq_scan
counters, and function calls — each verifying isolation across
consecutive top-level transactions within a single message.

postgres % cat output.txt
BEGIN
CREATE TABLE
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 1
(1 row)

COMMIT
BEGIN
INSERT 0 1
relname | n_tup_ins
-----------+-----------
xact_test | 2
(1 row)

ROLLBACK
DROP TABLE

I can also reproduce this behavior in v13. So this issue pre-existed
the major refactor 5891c7a8.

After further thought of v1, I'd like to also propose an alternative
approach. The root cause and testing rationale are the same as
described upthread; what changes is how the baseline is established.

--- Problems with the lazy (v1) approach
The v1 patch snapshots the baseline lazily -- on every
nontransactional counter increment, it checks whether baseline_lxid
matches the current
transaction and snapshots if not. This has two drawbacks:

1) Hot-path overhead
Every pgstat_count_heap_scan(), pgstat_count_buffer_read_block(), etc.
acquires a branch. After the first call per entry per transaction, the
branch is predictably not-taken, but it could still present on some of
the most frequently executed paths in the backend.

2) Maintenance burden
Every new counter added to PgStat_TableCounts requires a corresponding
baseline snapshot at its increment site. Miss one, and that counter
silently leaks across transactions

--- eager baseline sweep
The attached patch records the baseline eagerly at transaction
boundaries instead of lazily at counter-increment sites.
pgstat_set_pending_baselines() iterates the pgStatPending list and
snapshots each entry's current counts into an xact_baseline field via
struct assignment. It is called from AtEOXact_PgStat() (after folding
transactional counts and removing dropped entries) and from
PostPrepare_PgStat() (after relation cleanup), covering commit, abort,
and PREPARE TRANSACTION. The view accessors unconditionally subtract
the baseline. For entries created in the current transaction,
xact_baseline is zero-initialized, so the subtraction is a no-op.

I don’t have a clear preference between the two approaches; both are
presented for review.

It would be useful to verify the fix by manually applying the patch
and building the instance. Additionally, a few issues surfaced after
looking at it again, which I will update later.

--
Best,
Xuneng

#8Xuneng Zhou
xunengzhou@gmail.com
In reply to: Xuneng Zhou (#7)
Re: BUG #19439: pg_stat_xact_user_tables stat not currect during the transaction
--- eager baseline sweep
The attached patch records the baseline eagerly at transaction
boundaries instead of lazily at counter-increment sites.
pgstat_set_pending_baselines() iterates the pgStatPending list and
snapshots each entry's current counts into an xact_baseline field via
struct assignment. It is called from AtEOXact_PgStat() (after folding
transactional counts and removing dropped entries) and from
PostPrepare_PgStat() (after relation cleanup), covering commit, abort,
and PREPARE TRANSACTION. The view accessors unconditionally subtract
the baseline. For entries created in the current transaction,
xact_baseline is zero-initialized, so the subtraction is a no-op.

I don’t have a clear preference between the two approaches; both are
presented for review.

It would be useful to verify the fix by manually applying the patch
and building the instance. Additionally, a few issues surfaced after
looking at it again, which I will update later.

Here is the updated version using eager baseline refresh, i.e. sweeping all
backend-local pending pgstat entries at each top-level transaction boundary.

I tested the eager-baseline approach at both micro and macro levels. The
results show the same cost shape in both cases: the patch cost scales with the
number of backend-local pending pgstat entries that must be swept at each
top-level transaction boundary.

The microbenchmark isolates the transaction-boundary cost. It first creates a
controlled number of pending pgstat entries in one backend, then times 10000
tiny BEGIN/COMMIT boundaries in one simple-query message:

BEGIN; COMMIT; BEGIN; COMMIT; ...

The results below use matching -O0 --enable-debug --enable-cassert builds for
both installs.

pending entries unpatched us/xact patched us/xact patch delta
---------------------------------------------------------------------------
0 713.526 703.357 -10.169
100 740.954 754.298 +13.344
1000 1183.302 / 1177.393 1211.533 / 1213.089 about +32 us
5000 2978.008 / 2967.674 3236.365 / 3081.945
about +186 us median

Machine: Mac mini, M4 Pro, 48GB mem

This is intentionally hostile to eager sweeping: one backend accumulates many
pending entries, then repeatedly crosses top-level transaction boundaries while
doing almost no useful work inside the transactions. The added cost is small at
100 pending entries, where noise matters, but becomes clear at 1000 and 5000
entries. In this debug/cassert build, the patch adds roughly 0.03-0.04 us per
pending entry per transaction in the 1000-5000 entry range. Absolute numbers
are inflated by the build profile, but the linear shape is the relevant signal.

The macro benchmark shows when that same boundary cost is visible in a more
query-shaped workload:

workload base tps patched tps change
------------------------------------------------------------------
5000 tables, 1 row/table 11216 6231 -44%
1000 tables, 1000 rows/table 7683 7113 -7%
100 tables, 10000 rows/table 2152 2162 ~0%

The latency breakdown explains the TPS pattern. For the 5000-table/1-row case:

base: avg_lat=0.089 ms, select_avg=0.051 ms
patched: avg_lat=0.160 ms, select_avg=0.062 ms

Machine: intel xeon server, 40 cores, 128GB mem

The SELECT itself barely changes. Most of the regression appears outside the
SELECT, where the patch does the baseline sweep at transaction end. This is
the worst case for the eager design: the transaction does very little real table
work, but the backend has thousands of pending relation stats entries.

As the number of pending entries drops, or as the query does more real scan
work, the same fixed boundary cost is diluted. With 1000 tables and 1000 rows
per table, the regression falls to about 7%. With 100 tables and 10000 rows per
table, the scan dominates and the sweep over about 100 pending entries is lost
in noise.

Taken together, the benchmarks confirm the expected implementation cost model:

eager baseline refresh cost ~= O(number of pending pgstat entries per backend)

Row count does not directly drive the cost; it only hides or exposes the fixed
transaction-boundary work.

These results suggest that the eager-sweeping approach has an unfavorable
cost model for long-lived sessions that accumulate many pending stats entries
and then execute small transactions. A lazy baseline appraoch, where
each pending
entry records the current transaction generation only when that entry is first
touched, should avoid the transaction-boundary sweep and make the cost scale
with the transaction's actual working set instead. However, it still
suffers from the
potential overhead of additional comparisons on hot paths, as well as increased
maintenance pain.

--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.

Attachments:

v2-0001-Fix-pg_stat_xact_-views-leaking-across-xact-bound.patchapplication/octet-stream; name=v2-0001-Fix-pg_stat_xact_-views-leaking-across-xact-bound.patchDownload+338-11
pgstat_xact_macro_bench.shtext/x-sh; charset=US-ASCII; name=pgstat_xact_macro_bench.shDownload
pgstat_xact_micro_bench.shtext/x-sh; charset=US-ASCII; name=pgstat_xact_micro_bench.shDownload