xact_rollback spikes when logical walsender exits
Hi hackers,
There is a bug on logical-replication publishers where every decoded
committed transaction bumps pg_stat_database.xact_rollback.
ReorderBufferProcessTXN() ends each decoded transaction with
AbortCurrentTransaction() for catalog cleanup; in the walsender that
is a top-level abort, so AtEOXact_PgStat_Database(isCommit=false)
increments the backend-local pgStatXactRollback.
The counts are flushed to shared stats on walsender exit, producing
an acute spike. Result: for production systems with SREs on call and tight
alerting on xact_rollback, this turns routine logical-replication operations
(disabling a subscription, dropping a slot, walsender restart) into
false-positive pages.
Reported in [1]/messages/by-id/CAG0ozMo_xWQn+Avv8jzbbhePGp5OnhdO+YWTkdg4faWSXz0Jzg@mail.gmail.com; also experienced at GitLab [2]https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/8290[3]https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/work_items/39[4]https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/work_items/406.
Attaching a simple patch that adds a backend-local flag pgStatXactSkipCounters
in pgstat_database.c that AtEOXact_PgStat_Database() honors to skip
the counter bump.
Added TAP test that fails on master with 5/0 and passes with the patch.
If there is agreement on this shape, happy to send patches for all
supported branches. Let me know what you think.
[1]: /messages/by-id/CAG0ozMo_xWQn+Avv8jzbbhePGp5OnhdO+YWTkdg4faWSXz0Jzg@mail.gmail.com
[2]: https://gitlab.com/gitlab-com/gl-infra/production/-/work_items/8290
[3]: https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/work_items/39
[4]: https://gitlab.com/gitlab-org/orbit/knowledge-graph/-/work_items/406
Nik
Attachments:
v1-logical-rollback-spike.patchapplication/octet-stream; name=v1-logical-rollback-spike.patchDownload+115-15
On Sat, Apr 18, 2026 at 12:15 AM Nikolay Samokhvalov <nik@postgres.ai> wrote:
Hi hackers,
There is a bug on logical-replication publishers where every decoded
committed transaction bumps pg_stat_database.xact_rollback.
ReorderBufferProcessTXN() ends each decoded transaction with
AbortCurrentTransaction() for catalog cleanup; in the walsender that
is a top-level abort, so AtEOXact_PgStat_Database(isCommit=false)
increments the backend-local pgStatXactRollback.The counts are flushed to shared stats on walsender exit, producing
an acute spike. Result: for production systems with SREs on call and tight
alerting on xact_rollback, this turns routine logical-replication operations
(disabling a subscription, dropping a slot, walsender restart) into
false-positive pages.Reported in [1]; also experienced at GitLab [2][3][4].
Attaching a simple patch that adds a backend-local flag pgStatXactSkipCounters
in pgstat_database.c that AtEOXact_PgStat_Database() honors to skip
the counter bump.Added TAP test that fails on master with 5/0 and passes with the patch.
If there is agreement on this shape, happy to send patches for all
supported branches. Let me know what you think.
Thanks for the report and patch!
How to implement a solution depends on what xact_rollback in pg_stat_database
is intended to mean. So at first we should consider which rollbacks should
it count? The documentation does not currently give an explicit definition.
At present, xact_rollback appears to count all rollbacks, explicit or implicit,
by any process connected to the database, including regular backends,
autovacuum workers, and logical walsenders. If that is the intended definition,
then rollbacks implicitly performed by logical walsenders during logical
replication should also be counted. Of course, even if we keep that definition,
the sudden increase in xact_rollback might still be a problem, so we might
need to call pgstat_report_stat() immediately after pgstat_flush_io() in
walsender, so the counters continue to be updated periodically during
logical replication.
On the other hand, your patch seems to assume a different definition: that
xact_rollback should count all explicit and implicit rollbacks, except those
performed by logical walsenders during logical replication. That would be
one possible approach, although it seems a bit odd to exclude only one subset
of rollbacks.
A third option would be to define xact_rollback more narrowly, counting only
rollbacks by regular backends, and excluding rollbacks by processes such as
autovacuum or walsender. At least in my view, xact_commit and xact_rollback
in pg_stat_database are typically used by DBAs to check whether
client transactions are committing or rolling back as expected. From
that perspective, it seems intuitive for xact_rollback to count only rollbacks
by regular backends. But others may reasonably see it differently.
Regards,
--
Fujii Masao
On Fri, 17 Apr 2026 at 20:45, Nikolay Samokhvalov <nik@postgres.ai> wrote:
Hi hackers,
There is a bug on logical-replication publishers where every decoded
committed transaction bumps pg_stat_database.xact_rollback.
ReorderBufferProcessTXN() ends each decoded transaction with
AbortCurrentTransaction() for catalog cleanup; in the walsender that
is a top-level abort, so AtEOXact_PgStat_Database(isCommit=false)
increments the backend-local pgStatXactRollback.The counts are flushed to shared stats on walsender exit, producing
an acute spike. Result: for production systems with SREs on call and tight
alerting on xact_rollback, this turns routine logical-replication operations
(disabling a subscription, dropping a slot, walsender restart) into
false-positive pages.Reported in [1]; also experienced at GitLab [2][3][4].
Attaching a simple patch that adds a backend-local flag pgStatXactSkipCounters
in pgstat_database.c that AtEOXact_PgStat_Database() honors to skip
the counter bump.Added TAP test that fails on master with 5/0 and passes with the patch.
If there is agreement on this shape, happy to send patches for all
supported branches. Let me know what you think.
Thanks for reporting this and for the patch the problem description
matches what I've observed as well. The current behavior could be
misleading, since these rollbacks correspond to internal decoding
cleanup rather than actual user visible transaction aborts.
Another approach could be to introduce a wrapper around
AbortCurrentTransaction(), for example
AbortCurrentTransactionWithoutUpdateStats(), that skips the
AtEOXact_PgStat() call in this case.
Thoughts?
Regards,
Vignesh