BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
Hello, hackers.
It seems like PG 14 works incorrectly with vacuum_defer_cleanup_age
(or just not cleared rows, not sure) and SELECT FOR UPDATE + UPDATE.
I am not certain, but hot_standby_feedback probably able to cause the
same issues.
Steps to reproduce:
1) Start Postgres like this:
docker run -it -p 5432:5432 --name pg -e
POSTGRES_PASSWORD=postgres -e LANG=C.UTF-8 -d postgres:14.6 -c
vacuum_defer_cleanup_age=1000000
2) Prepare scheme:
CREATE TABLE something_is_wrong_here (id bigserial PRIMARY KEY,
value numeric(15,4) DEFAULT 0 NOT NULL);
INSERT INTO something_is_wrong_here (value) (SELECT 10000 from
generate_series(0, 100));
3) Prepare file for pgbench:
BEGIN;
SELECT omg.*
FROM something_is_wrong_here AS omg
ORDER BY random()
LIMIT 1
FOR UPDATE
\gset
UPDATE something_is_wrong_here SET value = :value + 1 WHERE id = :id;
END;
4) Run pgbench:
pgbench -c 50 -j 2 -n -f reproduce.bench 'port= 5432
host=localhost user=postgres dbname=postgres password=postgres' -T 100
-P 1
I was able to see such a set of errors (looks scary):
ERROR: MultiXactId 30818104 has not been created yet -- apparent wraparound
ERROR: could not open file "base/13757/16385.1" (target block
39591744): previous segment is only 24 blocks
ERROR: attempted to lock invisible tuple
ERROR: could not access status of transaction 38195704
DETAIL: Could not open file "pg_subtrans/0246": No such file or directory.
Best regards,
Michail.
Hello, Andres.
I apologize for the direct ping, but I think your snapshot scalability
work in PG14 could be related to the issue.
The TransactionIdRetreatedBy implementation looks correct... But with
txid_current=212195 I see errors like "could not access status of
transaction 58643736"...
So, maybe vacuum_defer_cleanup_age just highlights some special case
(something with "previous" xids on the left side of zero?)....
Thanks,
Michail.
On Thu, 5 Jan 2023 at 14:12, Michail Nikolaev <michail.nikolaev@gmail.com>
wrote:
Hello, hackers.
It seems like PG 14 works incorrectly with vacuum_defer_cleanup_age
(or just not cleared rows, not sure) and SELECT FOR UPDATE + UPDATE.
I am not certain, but hot_standby_feedback probably able to cause the
same issues.Steps to reproduce:
[steps]
I was able to see such a set of errors (looks scary):
ERROR: MultiXactId 30818104 has not been created yet -- apparent
wraparound
ERROR: could not open file "base/13757/16385.1" (target block
39591744): previous segment is only 24 blocks
This looks quite suspicious too - it wants to access a block at 296GB of
data, where only 196kB exist.
ERROR: attempted to lock invisible tuple
ERROR: could not access status of transaction 38195704
DETAIL: Could not open file "pg_subtrans/0246": No such file or
directory.
I just saw two instances of this "attempted to lock invisible tuple" error
for the 15.1 image (run on Docker in Ubuntu in WSL) with your reproducer
script, so this does not seem to be specific to PG14 (.6).
And, after some vacuum and restarting the process, I got the following:
client 29 script 0 aborted in command 2 query 0: ERROR: heap tid from
index tuple (111,1) points past end of heap page line pointer array at
offset 262 of block 1 in index "something_is_wrong_here_pkey"
There is indeed something wrong there; the page can't be read by
pageinspect:
$ select get_raw_page('public.something_is_wrong_here', 111)::bytea;
ERROR: invalid page in block 111 of relation base/5/16385
I don't have access to the v14 data anymore (I tried a restart, which
dropped the data :-( ), but will retain my v15 instance for some time to
help any debugging.
Kind regards,
Matthias van de Meent
On Thu, Jan 5, 2023 at 1:49 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
client 29 script 0 aborted in command 2 query 0: ERROR: heap tid from index tuple (111,1) points past end of heap page line pointer array at offset 262 of block 1 in index "something_is_wrong_here_pkey"
This particular error message is from the hardening added to Postgres
15 in commit e7428a99. So it's not surprising that Michail didn't see
the same error on 14.
--
Peter Geoghegan
On Thu, Jan 5, 2023 at 3:27 PM Peter Geoghegan <pg@bowt.ie> wrote:
This particular error message is from the hardening added to Postgres
15 in commit e7428a99. So it's not surprising that Michail didn't see
the same error on 14.
Reproduced this on HEAD locally (no docker), without any difficulty.
FWIW, I find that the "Assert(ItemIdIsNormal(lp));" at the top of
heap_lock_tuple() is the first thing that fails on my assert-enabled
build.
--
Peter Geoghegan
Hello!
Thanks for checking the issue!
FWIW, I find that the "Assert(ItemIdIsNormal(lp));" at the top of
heap_lock_tuple() is the first thing that fails on my assert-enabled
build.
Yes, the same for me:
TRAP: failed Assert("ItemIdIsNormal(lp)"), File: "heapam.c",
Line: 4292, PID: 33416
Reproduced this on HEAD locally (no docker), without any difficulty.
It is a little bit harder without docker in my case, need to adjust
connections and number of threads:
pgbench -c 90 -j 8 -n -f reproduce.bench 'port= 5432
host=localhost user=postgres dbname=postgres password=postgres' -T
2000 -P 1
Best regards,
Michail.
Hello.
The few things I have got so far:
1) It is not required to order by random() to reproduce the issue - it
could be done using queries like:
BEGIN;
SELECT omg.*
FROM something_is_wrong_here AS omg
ORDER BY value -- change is here
LIMIT 1
FOR UPDATE
\gset
UPDATE something_is_wrong_here SET value = :value + 1 WHERE id = :id;
COMMIT;
But for some reason it is harder to reproduce without random in my
case (typically need to wait for about a minute with 100 connections).
2) It is not an issue at table creation time. Issue is reproducible if
vacuum_defer_cleanup_age set after table preparation.
3) To reproduce the issue, vacuum_defer_cleanup_age should flip xid
over zero (be >= txid_current()).
And it is stable.... So, for example - unable to reproduce with 733
value, but 734 gives error each time.
Just a single additional txid_current() (after data is filled) fixes a
crash... It looks like the first SELECT FOR UPDATE + UPDATE silently
poisons everything somehow.
You could use such PSQL script:
DROP TABLE IF EXISTS something_is_wrong_here;
CREATE TABLE something_is_wrong_here (id bigserial PRIMARY KEY,
value numeric(15,4) DEFAULT 0 NOT NULL);
INSERT INTO something_is_wrong_here (value) (SELECT 10000 from
generate_series(0, 100));
SELECT txid_current() \gset
SELECT :txid_current + 1 as txid \gset
ALTER SYSTEM SET vacuum_defer_cleanup_age to :txid;SELECT
pg_reload_conf();
I have attached some scripts if someone goes to reproduce.
Best regards,
Michail.
Hi,
Thomas, CCing you because of the 64bit xid representation aspect.
On 2023-01-06 00:39:44 +0300, Michail Nikolaev wrote:
I apologize for the direct ping, but I think your snapshot scalability
work in PG14 could be related to the issue.
Good call!
The TransactionIdRetreatedBy implementation looks correct... But with
txid_current=212195 I see errors like "could not access status of
transaction 58643736"...
So, maybe vacuum_defer_cleanup_age just highlights some special case
(something with "previous" xids on the left side of zero?)....
I think the bug is close to TransactionIdRetreatedBy(). Arguably in
FullXidRelativeTo(). Or a more fundamental data representation issue with
64bit xids.
To explain, here's a trace to the bottom of GetSnapshotData() leading to the
problem:
In the case I'm looking at here we end up with 720:
oldestfxid = FullXidRelativeTo(latest_completed, oldestxid);
and xmin is 255271, both correct.
Then in TransactionIdRetreatedBy:
/* apply vacuum_defer_cleanup_age */
def_vis_xid_data =
TransactionIdRetreatedBy(xmin, vacuum_defer_cleanup_age);
things start to be iffy. Because we retreat by vacuum_defer_cleanup_age, which
was set to txid_current() in scheme.sql, and the xmin above is that xid,
TransactionIdRetreatedBy() first ends up with 0. It then backtracks further to
the highest 32 xid (i.e. 4294967295). So far so good.
We could obviously end up with values further in the past as well, if
vacuum_defer_cleanup_age were larger.
Things start to seriously go off the rails when we convert that 32bit xid to
64 bit with:
def_vis_fxid = FullXidRelativeTo(latest_completed, def_vis_xid);
which returns {value = 18446744073709551615}, which 0-1 in 64bit.
However, as 64bit xids are not supposed to wrap around, we're in trouble -
it's an xid *very* far into the future. Allowing things to be pruned that
shouldn't, because everything is below that.
I don't quite know how to best fix this. 4294967295 is the correct result for
TransactionIdRetreatedBy() in this case, and it'd not cause problems for
FullXidRelativeTo() if we actually had wrapped around the xid counter before
(since then we'd just represent it as a fxid "of the proper epoch").
Basically, in contrast to 32bit xids, 64bit xids cannot represent xids from
before the start of the universe, whereas with the modulo arithmetic of 32bit
that's not a real problem.
It's probably not too hard to fix specifically in this one place - we could
just clamp vacuum_defer_cleanup_age to be smaller than latest_completed, but
it strikes me as as a somewhat larger issue for the 64it xid infrastructure. I
suspect this might not be the only place running into problems with such
"before the universe" xids.
For a bit I was thinking that we should introduce the notion that a
FullTransactionId can be from the past. Specifically that when the upper 32bit
are all set, we treat the lower 32bit as a value from before xid 0 using the
normal 32bit xid arithmetic. But it sucks to add overhead for that
everywhere.
It might be a bit more palatable to designate one individual value,
e.g. 2^32-1<<32, as a "before xid 0" indicator - it doesn't matter how far
before the start of the universe an xid point to...
FullXidRelativeTo() did assert that the input 32bit xid is in a reasonable
range, but unfortunately didn't do a similar check for the 64bit case.
Greetings,
Andres Freund
Hi,
On 2023-01-07 21:06:06 +0300, Michail Nikolaev wrote:
2) It is not an issue at table creation time. Issue is reproducible if
vacuum_defer_cleanup_age set after table preparation.3) To reproduce the issue, vacuum_defer_cleanup_age should flip xid
over zero (be >= txid_current()).
And it is stable.... So, for example - unable to reproduce with 733
value, but 734 gives error each time.
Just a single additional txid_current() (after data is filled) fixes a
crash... It looks like the first SELECT FOR UPDATE + UPDATE silently
poisons everything somehow.
You could use such PSQL script:
FWIW, the concrete value for vacuum_defer_cleanup_age is not crucial to
encounter the problem. It needs to be a value that, when compared to the xid
that did the "INSERT INTO something_is_wrong_here", results in value <= 0.
Setting vacuum_defer_cleanup_age than the xid to a much larger value allows
the crash to be encountered repeatedly.
Greetings,
Andres Freund
Hi,
On 2023-01-07 16:29:23 -0800, Andres Freund wrote:
It's probably not too hard to fix specifically in this one place - we could
just clamp vacuum_defer_cleanup_age to be smaller than latest_completed, but
it strikes me as as a somewhat larger issue for the 64it xid infrastructure. I
suspect this might not be the only place running into problems with such
"before the universe" xids.
I haven't found other problematic places in HEAD, but did end up find a less
serious version of this bug in < 14: GetFullRecentGlobalXmin(). I did verify
that with vacuum_defer_cleanup_age set GetFullRecentGlobalXmin() returns
values that look likely to cause problems. Its "just" used in gist luckily.
It's hard to find places that do this kind of arithmetic, we traditionally
haven't had a helper for it. So it's open-coded in various ways.
xidStopLimit = xidWrapLimit - 3000000;
if (xidStopLimit < FirstNormalTransactionId)
xidStopLimit -= FirstNormalTransactionId;
and oddly:
xidVacLimit = oldest_datfrozenxid + autovacuum_freeze_max_age;
if (xidVacLimit < FirstNormalTransactionId)
xidVacLimit += FirstNormalTransactionId;
or (in < 14):
RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;
if (!TransactionIdIsNormal(RecentGlobalXmin))
RecentGlobalXmin = FirstNormalTransactionId;
The currently existing places I found, other than the ones in procarray.c,
luckily don't seem to convert the xids to 64bit xids.
For a bit I was thinking that we should introduce the notion that a
FullTransactionId can be from the past. Specifically that when the upper 32bit
are all set, we treat the lower 32bit as a value from before xid 0 using the
normal 32bit xid arithmetic. But it sucks to add overhead for that
everywhere.It might be a bit more palatable to designate one individual value,
e.g. 2^32-1<<32, as a "before xid 0" indicator - it doesn't matter how far
before the start of the universe an xid point to...
On IM Thomas suggested we could reserve the 2^32-1 epoch for invalid values. I
hacked up a patch that converts various fxid functions to inline functions
with such assertions, and it indeed quickly catches the problem this thread
reported, close to the source of the use.
One issue with that is is that it'd reduce what can be input for the xid8
type. But it's hard to believe that'd be a real issue?
It's quite unfortunate that we don't have a test for vacuum_defer_cleanup_age
yet :(.
Greetings,
Andres Freund
On Sun, 8 Jan 2023 at 04:09, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-01-07 16:29:23 -0800, Andres Freund wrote:
It's probably not too hard to fix specifically in this one place - we could
just clamp vacuum_defer_cleanup_age to be smaller than latest_completed, but
it strikes me as as a somewhat larger issue for the 64it xid infrastructure. I
suspect this might not be the only place running into problems with such
"before the universe" xids.I haven't found other problematic places in HEAD, but did end up find a less
serious version of this bug in < 14: GetFullRecentGlobalXmin(). I did verify
that with vacuum_defer_cleanup_age set GetFullRecentGlobalXmin() returns
values that look likely to cause problems. Its "just" used in gist luckily.It's hard to find places that do this kind of arithmetic, we traditionally
haven't had a helper for it. So it's open-coded in various ways.
[...]The currently existing places I found, other than the ones in procarray.c,
luckily don't seem to convert the xids to 64bit xids.
That's good to know.
For a bit I was thinking that we should introduce the notion that a
FullTransactionId can be from the past. Specifically that when the upper 32bit
are all set, we treat the lower 32bit as a value from before xid 0 using the
normal 32bit xid arithmetic. But it sucks to add overhead for that
everywhere.It might be a bit more palatable to designate one individual value,
e.g. 2^32-1<<32, as a "before xid 0" indicator - it doesn't matter how far
before the start of the universe an xid point to...On IM Thomas suggested we could reserve the 2^32-1 epoch for invalid values. I
hacked up a patch that converts various fxid functions to inline functions
with such assertions, and it indeed quickly catches the problem this thread
reported, close to the source of the use.
Wouldn't it be enough to only fix the constructions in
FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
not occur with the patch), and (optionally) bump the first XID
available for any cluster to (FirstNormalXid + 1) to retain the 'older
than any running transaction' property?
The change only fixes the issue for FullTransactionId, which IMO is OK
- I don't think we need to keep xid->xid8->xid symmetric in cases of
xid8 wraparound.
One issue with that is is that it'd reduce what can be input for the xid8
type. But it's hard to believe that'd be a real issue?
Yes, it's unlikely anyone would ever hit that with our current WAL
format - we use 24 bytes /xid just to log it's use, so we'd use at
most epoch 0x1000_0000 in unrealistic scenarios. In addition;
technically, we already have (3*2^32 - 3) "invalid" xid8 values that
can never be produced in FullXidRelativeTo - those few extra invalid
values don't matter much to me except "even more special casing".
Kind regards,
Matthias van de Meent.
Attachments:
v1-0001-Prevent-underflow-of-xid8-epoch.patchapplication/octet-stream; name=v1-0001-Prevent-underflow-of-xid8-epoch.patchDownload+42-3
Hi,
Robert, Mark, CCing you because of the amcheck aspect (see below).
On 2023-01-09 17:50:10 +0100, Matthias van de Meent wrote:
On Sun, 8 Jan 2023 at 04:09, Andres Freund <andres@anarazel.de> wrote:
For a bit I was thinking that we should introduce the notion that a
FullTransactionId can be from the past. Specifically that when the upper 32bit
are all set, we treat the lower 32bit as a value from before xid 0 using the
normal 32bit xid arithmetic. But it sucks to add overhead for that
everywhere.It might be a bit more palatable to designate one individual value,
e.g. 2^32-1<<32, as a "before xid 0" indicator - it doesn't matter how far
before the start of the universe an xid point to...On IM Thomas suggested we could reserve the 2^32-1 epoch for invalid values. I
hacked up a patch that converts various fxid functions to inline functions
with such assertions, and it indeed quickly catches the problem this thread
reported, close to the source of the use.Wouldn't it be enough to only fix the constructions in
FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
not occur with the patch), and (optionally) bump the first XID
available for any cluster to (FirstNormalXid + 1) to retain the 'older
than any running transaction' property?
It's not too hard to fix in individual places, but I suspect that we'll
introduce the bug in future places without some more fundamental protection.
Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable value
in ComputeXidHorizons() and GetSnapshotData().
Fixing it in FullXidRelativeTo() doesn't seem quite right, while it's ok to
just return FirstNormalTransactionId in the case of vacuum_defer_cleanup_age,
it doesn't see necessarily correct for other cases.
The change only fixes the issue for FullTransactionId, which IMO is OK
- I don't think we need to keep xid->xid8->xid symmetric in cases of
xid8 wraparound.
I think we should keep that symmetric, it just gets too confusing / easy to
miss bugs otherwise.
One issue with that is is that it'd reduce what can be input for the xid8
type. But it's hard to believe that'd be a real issue?Yes, it's unlikely anyone would ever hit that with our current WAL
format - we use 24 bytes /xid just to log it's use, so we'd use at
most epoch 0x1000_0000 in unrealistic scenarios. In addition;
technically, we already have (3*2^32 - 3) "invalid" xid8 values that
can never be produced in FullXidRelativeTo - those few extra invalid
values don't matter much to me except "even more special casing".
Yep. The attached 0002 is a first implementation of this.
The new assertions found at least one bug in amcheck, and one further example
of the problem of representing past 32 xids in 64bit:
1) Because ctx->next_xid is set after the XidFromFullTransactionId() call in
update_cached_xid_range(), we end up using the xid 0 (or an outdated value in
subsequent calls) to determine whether epoch needs to be reduced.
2) One test generates includes an xid from the future (4026531839). Which
causes epoch to wrap around (via the epoch--) in
FullTransactionIdFromXidAndCtx(). I've hackily fixed that by just representing
it as an xid from the future instead. But not sure that's a good answer.
A different approach would be to represent fxids as *signed* 64bit
integers. That'd of course loose more range, but could represent everything
accurately, and would have a compatible on-disk representation on two's
complement platforms (all our platforms). I think the only place that'd need
special treatment is U64FromFullTransactionId() / its callers. I think this
might be the most robust approach.
Greetings,
Andres Freund
Attachments:
v2-0001-Fix-corruption-due-to-vacuum_defer_cleanup_age-un.patchtext/x-diff; charset=us-asciiDownload+61-13
v2-0002-wip-amcheck-Fix-bugs-relating-to-64bit-xids.patchtext/x-diff; charset=us-asciiDownload+15-2
v2-0003-wip-stricter-validation-for-64bit-xids.patchtext/x-diff; charset=us-asciiDownload+192-24
On Tue, Jan 10, 2023 at 8:34 AM Andres Freund <andres@anarazel.de> wrote:
A different approach would be to represent fxids as *signed* 64bit
integers. That'd of course loose more range, but could represent everything
accurately, and would have a compatible on-disk representation on two's
complement platforms (all our platforms). I think the only place that'd need
special treatment is U64FromFullTransactionId() / its callers. I think this
might be the most robust approach.
It does sound like an interesting approach; it means you are free to
retreat arbitrarily without ever thinking about it, and by the
arguments given (LSN space required to consume fxids) it's still
'enough'. Essentially all these bugs are places where the author
already believed it worked that way.
(Two's complement is required in the C23 draft.)
On Jan 9, 2023, at 11:34 AM, Andres Freund <andres@anarazel.de> wrote:
1) Because ctx->next_xid is set after the XidFromFullTransactionId() call in
update_cached_xid_range(), we end up using the xid 0 (or an outdated value in
subsequent calls) to determine whether epoch needs to be reduced.
Can you say a bit more about your analysis here, preferably with pointers to the lines of code you are analyzing? Does the problem exist in amcheck as currently committed, or are you thinking about a problem that arises only after applying your patch? I'm a bit fuzzy on where xid 0 gets used.
Thanks
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2023-01-09 13:55:02 -0800, Mark Dilger wrote:
On Jan 9, 2023, at 11:34 AM, Andres Freund <andres@anarazel.de> wrote:
1) Because ctx->next_xid is set after the XidFromFullTransactionId() call in
update_cached_xid_range(), we end up using the xid 0 (or an outdated value in
subsequent calls) to determine whether epoch needs to be reduced.Can you say a bit more about your analysis here, preferably with pointers to
the lines of code you are analyzing? Does the problem exist in amcheck as
currently committed, or are you thinking about a problem that arises only
after applying your patch? I'm a bit fuzzy on where xid 0 gets used.
The problems exist in the code as currently committed. I'm not sure what
exactly the consequences are, the result is that oldest_fxid will be, at least
temporarily, bogus.
Consider the first call to update_cached_xid_range():
/*
* Update our cached range of valid transaction IDs.
*/
static void
update_cached_xid_range(HeapCheckContext *ctx)
{
/* Make cached copies */
LWLockAcquire(XidGenLock, LW_SHARED);
ctx->next_fxid = ShmemVariableCache->nextXid;
ctx->oldest_xid = ShmemVariableCache->oldestXid;
LWLockRelease(XidGenLock);
/* And compute alternate versions of the same */
ctx->oldest_fxid = FullTransactionIdFromXidAndCtx(ctx->oldest_xid, ctx);
ctx->next_xid = XidFromFullTransactionId(ctx->next_fxid);
}
The problem is that the call to FullTransactionIdFromXidAndCtx() happens
before ctx->next_xid is assigned, even though FullTransactionIdFromXidAndCtx()
uses ctx->next_xid.
static FullTransactionId
FullTransactionIdFromXidAndCtx(TransactionId xid, const HeapCheckContext *ctx)
{
uint32 epoch;
if (!TransactionIdIsNormal(xid))
return FullTransactionIdFromEpochAndXid(0, xid);
epoch = EpochFromFullTransactionId(ctx->next_fxid);
if (xid > ctx->next_xid)
epoch--;
return FullTransactionIdFromEpochAndXid(epoch, xid);
}
Because ctx->next_xid is 0, due to not having been set yet, "xid > ctx->next_xid"
will always be true, leading to epoch being reduced by one.
In the common case of there never having been an xid wraparound, we'll thus
underflow epoch, generating an xid far into the future.
The tests encounter the issue today. If you add
Assert(TransactionIdIsValid(ctx->next_xid));
Assert(FullTransactionIdIsValid(ctx->next_fxid));
early in FullTransactionIdFromXidAndCtx() it'll be hit in the
amcheck/pg_amcheck tests.
Greetings,
Andres Freund
On Jan 9, 2023, at 2:07 PM, Andres Freund <andres@anarazel.de> wrote:
The tests encounter the issue today. If you add
Assert(TransactionIdIsValid(ctx->next_xid));
Assert(FullTransactionIdIsValid(ctx->next_fxid));
early in FullTransactionIdFromXidAndCtx() it'll be hit in the
amcheck/pg_amcheck tests.
Ok, I can confirm that. I find the assertion
Assert(epoch != (uint32)-1);
a bit simpler to reason about, but either way, I agree it is a bug. Thanks for finding this.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, 9 Jan 2023 at 20:34, Andres Freund <andres@anarazel.de> wrote:
On 2023-01-09 17:50:10 +0100, Matthias van de Meent wrote:
Wouldn't it be enough to only fix the constructions in
FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
not occur with the patch), and (optionally) bump the first XID
available for any cluster to (FirstNormalXid + 1) to retain the 'older
than any running transaction' property?It's not too hard to fix in individual places, but I suspect that we'll
introduce the bug in future places without some more fundamental protection.Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable value
in ComputeXidHorizons() and GetSnapshotData().
I don't think that clamping the value with oldestXid (as seen in patch
0001, in GetSnapshotData) is right.
It would clamp the value relative to the oldest frozen xid of all
databases, which can be millions of transactions behind oldestXmin,
and thus severely skew the amount of transaction's changes you keep on
disk (that is, until oldestXid moves past 1000_000).
A similar case can be made for the changes in ComputeXidHorizons - for
the described behaviour of vacuum_defer_cleanup_age we would need to
clamp the used offset separately for each of the fields in the horizon
result to retain all transaction data for the first 1 million
transactions, and the ones that may still see these transactions.
Fixing it in FullXidRelativeTo() doesn't seem quite right, while it's ok to
just return FirstNormalTransactionId in the case of vacuum_defer_cleanup_age,
it doesn't see necessarily correct for other cases.
Understood.
Kind regards,
Matthias van de Meent
Hi,
On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote:
On Mon, 9 Jan 2023 at 20:34, Andres Freund <andres@anarazel.de> wrote:
On 2023-01-09 17:50:10 +0100, Matthias van de Meent wrote:
Wouldn't it be enough to only fix the constructions in
FullXidRelativeTo() and widen_snapshot_xid() (as attached, $topic does
not occur with the patch), and (optionally) bump the first XID
available for any cluster to (FirstNormalXid + 1) to retain the 'older
than any running transaction' property?It's not too hard to fix in individual places, but I suspect that we'll
introduce the bug in future places without some more fundamental protection.Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable value
in ComputeXidHorizons() and GetSnapshotData().I don't think that clamping the value with oldestXid (as seen in patch
0001, in GetSnapshotData) is right.
I agree that using oldestXid to clamp is problematic.
It would clamp the value relative to the oldest frozen xid of all
databases, which can be millions of transactions behind oldestXmin,
and thus severely skew the amount of transaction's changes you keep on
disk (that is, until oldestXid moves past 1000_000).
What precisely do you mean with "skew" here? Do you just mean that it'd take a
long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like
you might mean more than that?
I'm tempted to go with reinterpreting 64bit xids as signed. Except that it
seems like a mighty invasive change to backpatch.
Greetings,
Andres Freund
On Tue, 10 Jan 2023 at 20:14, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-01-10 15:03:42 +0100, Matthias van de Meent wrote:
On Mon, 9 Jan 2023 at 20:34, Andres Freund <andres@anarazel.de> wrote:
It's not too hard to fix in individual places, but I suspect that we'll
introduce the bug in future places without some more fundamental protection.Locally I fixed it by clamping vacuum_defer_cleanup_age to a reasonable value
in ComputeXidHorizons() and GetSnapshotData().I don't think that clamping the value with oldestXid (as seen in patch
0001, in GetSnapshotData) is right.I agree that using oldestXid to clamp is problematic.
It would clamp the value relative to the oldest frozen xid of all
databases, which can be millions of transactions behind oldestXmin,
and thus severely skew the amount of transaction's changes you keep on
disk (that is, until oldestXid moves past 1000_000).What precisely do you mean with "skew" here? Do you just mean that it'd take a
long time until vacuum_defer_cleanup_age takes effect? Somehow it sounds like
you might mean more than that?
h->oldest_considered_running can be extremely old due to the global
nature of the value and the potential existence of a snapshot in
another database that started in parallel to a very old running
transaction.
Example: With vacuum_defer_cleanup_age set to 1000000, it is possible
that a snapshot in another database (thus another backend) would
result in a local intermediate status result of h->o_c_r = 20,
h->s_o_n = 20, h->d_o_n = 10030. The clamped offset would then be 20
(clamped using h->o_c_r), which updates h->data_oldest_nonremovable to
10010. The obvious result is that all but the last 20 transactions
from this database's data files are available for cleanup, which
contradicts with the intention of the vacuum_defer_cleanup_age GUC.
I'm tempted to go with reinterpreting 64bit xids as signed. Except that it
seems like a mighty invasive change to backpatch.
I'm not sure either. Protecting against underflow by halving the
effective valid value space is quite the intervention, but if it is
necessary to make this work in a performant manner, it would be worth
it. Maybe someone else with more experience can provide their opinion
here.
Kind regards,
Matthias van de Meent
Hello.
I have registered it as patch in the commit fest:
https://commitfest.postgresql.org/42/4138/
Best regards,
Michail.