MultiXactId wraparound and last aggressive vacuum time
Hi,
My application is getting the following exception:
InternalError: (psycopg2.InternalError) MultiXactId 808263738 has not been
created yet -- apparent wraparound
I read over
https://www.postgresql.org/docs/9.6/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND
and I suspect the issue was caused by not vacuuming often enough. I can
query the last_autovacuum time from the pg_stat_all_tables view. However,
I found this in the documentation:
VACUUM uses the visibility map to determine which pages of a table must be
scanned. Normally, it will skip pages that don't have any dead row versions
even if those pages might still have row versions with old XID values.
Therefore, normal VACUUMs won't always freeze every old row version in the
table. Periodically, VACUUM will perform an aggressive vacuum, skipping
only those pages which contain neither dead rows nor any unfrozen XID or
MXID values. vacuum_freeze_table_age controls when VACUUM does that:
all-visible but not all-frozen pages are scanned if the number of
transactions that have passed since the last such scan is greater than
vacuum_freeze_table_age minus vacuum_freeze_min_age. Setting
vacuum_freeze_table_age to 0 forces VACUUM to use this more aggressive
strategy for all scans.
How can I tell when the last "aggressive" auto-vacuum was performed? The
vacuum_freeze_table_age parameter is at the default value (200000000).
I am using postgresql 9.6.
Thanks,
Mike
I don't believe you can determine a date/time that it happened, but
querying the age of the table is simple to do and then you can compare that
with the freeze age parameters. A periodic manual vacuum freeze may
preclude the system ever needing to perform the emergency autovacuum
freeze, and as an added benefit, index only scans would be more likely to
be chosen since the visibility map would be updated and the regular
autovacuum runs would be faster since they can skip frozen pages. I think
that applies back to 9.6 at least. It's possible it got implemented in 10.
My caffeine hasn't fully kicked in yet and I am still a bit junior on this
list, but I would hope and expect senior people to correct me if I have
misunderstood or misrepresented things.
Show quoted text
After a closer reading of the documentation, it appears the database should
stop accepting new transactions before a wraparound would ever occur. If
so, then the only possible explanations for this multixactid wraparound
error would be data corruption, or a bug in postgresql. The exact version
I'm using is 9.6.10, which is quite a few versions behind the latest
(9.6.21), but I skimmed through the release notes of the later versions and
did not see any bugfixes in this area. That would leave data corruption as
the only explanation. Is my reasoning correct here? I'm willing to
upgrade but I would need to justify it somehow, so if I am missing
something please let me know.
On Tue, Mar 23, 2021 at 11:54 AM Michael Lewis <mlewis@entrata.com> wrote:
Show quoted text
I don't believe you can determine a date/time that it happened, but
querying the age of the table is simple to do and then you can compare that
with the freeze age parameters. A periodic manual vacuum freeze may
preclude the system ever needing to perform the emergency autovacuum
freeze, and as an added benefit, index only scans would be more likely to
be chosen since the visibility map would be updated and the regular
autovacuum runs would be faster since they can skip frozen pages. I think
that applies back to 9.6 at least. It's possible it got implemented in 10.My caffeine hasn't fully kicked in yet and I am still a bit junior on this
list, but I would hope and expect senior people to correct me if I have
misunderstood or misrepresented things.
On Mon, 5 Apr 2021 17:28:06 -0400
Michael Schanne <michael.schanne@gmail.com> wrote:
After a closer reading of the documentation, it appears the database should
stop accepting new transactions before a wraparound would ever occur. If
so, then the only possible explanations for this multixactid wraparound
error would be data corruption, or a bug in postgresql. The exact version
I'm using is 9.6.10, which is quite a few versions behind the latest
(9.6.21), but I skimmed through the release notes of the later versions and
did not see any bugfixes in this area. That would leave data corruption as
the only explanation. Is my reasoning correct here?
I didn't checked the changelog, and you should definitely run 9.6.21, but I
believe your reasoning is correct anyway. A bug might be possible, but I would
bet a coin on the corruption.
You might want to compare this number with the value reported by:
pg_controldata $PGDATA|grep NextMultiXactId
Backup your cluster, then, try to isolate the table(s) and block(s) where the
corruption occurs and check at them using eg. pageinspect.
I'm willing to upgrade but I would need to justify it somehow, so if I am
missing something please let me know.
you can justify the upgrade using this load of reasons:
https://why-upgrade.depesz.com/show?from=9.6.10&to=9.6.21
Regards,