snapshot too old issues, first around wraparound and then more.

Started by Andres Freundabout 6 years ago79 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

Sorry, this mail is somewhat long. But I think it's important that at
least a few committers read it, since I think we're going to have to
make some sort of call about what to do.

I am trying to change the snapshot too old infrastructure so it
cooperates with my snapshot scalability patch. While trying to
understand the code sufficiently, I think I found a fairly serious
issue:

To map the time-based old_snapshot_threshold to an xid that can be used
as a cutoff for heap_page_prune(), we maintain a ringbuffer of
old_snapshot_threshold + 10 entries in
oldSnapshotControlData->xid_by_minute[]. TransactionIdLimitedForOldSnapshots()
uses that to (at least that's the goal) increase the horizon used for
pruning.

The problem is that there's no protection again the xids in the
ringbuffer getting old enough to wrap around. Given that practical uses
of old_snapshot_threshold are likely to be several hours to several
days, that's not particularly hard to hit.

That then has the consequence that we can use an xid that's either from
"from the future" (if bigger than the current xid), or more recent than
appropriate (if it wrapped far enough to be below nextxid, but not yet
older than OldestXmin) as the OldestXmin argument to heap_page_prune().

Which in turn means that we can end up pruning much more recently
removed rows than intended.

While that'd be bad on its own, the big problem is that we won't detect
that case on access, in contrast to the way old_snapshot_threshold is
intended to work. The reason for that is detecting these errors happens
on the basis of timestamps - which obviously do not wrap around.

This made me want to try to reproduce the problem to at least some
degree. But I hit another wall: I can't make head or tails out of the
values in the xid_by_minute[] mapping.

I added some debug output to print the mapping before/after changes by
MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
to the server start in minutes/seconds to make it easier to interpret).

And the output turns out to be something like:

WARNING: old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 644169619
entry 5 (ring 5): min 12: xid 644169619
entry 6 (ring 6): min 13: xid 644169619
entry 7 (ring 7): min 14: xid 644169619

WARNING: head 420 s: updating existing bucket 4 for sec 660 with xmin 666899382

WARNING: old snapshot mapping at "after update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 666899382
entry 5 (ring 5): min 12: xid 644169619
entry 6 (ring 6): min 13: xid 644169619
entry 7 (ring 7): min 14: xid 644169619

It's pretty obvious that the xids don't make a ton of sense, I think:
They're not monotonically ordered. The same values exist multiple times,
despite xids being constantly used. Also, despite the ringbuffer
supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
always add), and the workload having run for 14min, we only have 8
entries.

Then a bit later we get:

WARNING: old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 666899382
entry 5 (ring 5): min 12: xid 666899382
entry 6 (ring 6): min 13: xid 666899382
entry 7 (ring 7): min 14: xid 666899382

WARNING: head 420 s: filling 8 buckets starting at 0 for sec 900 with xmin 666899382
WARNING: old snapshot mapping at "after update" with head ts: 15, current entries: 15 max entries: 15, offset: 1
entry 0 (ring 1): min 15: xid 654154155
entry 1 (ring 2): min 16: xid 661972949
entry 2 (ring 3): min 17: xid 666899382
entry 3 (ring 4): min 18: xid 666899382
entry 4 (ring 5): min 19: xid 666899382
entry 5 (ring 6): min 20: xid 666899382
entry 6 (ring 7): min 21: xid 666899382
entry 7 (ring 8): min 22: xid 666899382
entry 8 (ring 9): min 23: xid 666899382
entry 9 (ring 10): min 24: xid 666899382
entry 10 (ring 11): min 25: xid 666899382
entry 11 (ring 12): min 26: xid 666899382
entry 12 (ring 13): min 27: xid 666899382
entry 13 (ring 14): min 28: xid 666899382
entry 14 (ring 0): min 29: xid 666899382

At a later point we then enter the "Advance is so far that all old data
is junk; start over." branch, and just reset the whole mapping:
entry 0 (ring 0): min 30: xid 866711525

The problem, as far as I can tell, is that
oldSnapshotControl->head_timestamp appears to be intended to be the
oldest value in the ring. But we update it unconditionally in the "need
a new bucket, but it might not be the very next one" branch of
MaintainOldSnapshotTimeMapping().

While there's not really a clear-cut comment explaining whether
head_timestamp() is intended to be the oldest or the newest timestamp,
it seems to me that the rest of the code treats it as the "oldest"
timestamp.

TransactionId
TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
Relation relation)
...
ts = AlignTimestampToMinuteBoundary(ts)
- (old_snapshot_threshold * USECS_PER_MINUTE);
...
LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);

if (oldSnapshotControl->count_used > 0
&& ts >= oldSnapshotControl->head_timestamp)
{
int offset;

offset = ((ts - oldSnapshotControl->head_timestamp)
/ USECS_PER_MINUTE);
if (offset > oldSnapshotControl->count_used - 1)
offset = oldSnapshotControl->count_used - 1;
offset = (oldSnapshotControl->head_offset + offset)
% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
xlimit = oldSnapshotControl->xid_by_minute[offset];

if (NormalTransactionIdFollows(xlimit, recentXmin))
SetOldSnapshotThresholdTimestamp(ts, xlimit);
}

LWLockRelease(OldSnapshotTimeMapLock);

So we wind ts back by old_snapshot_threshold minutes. Then check that
that's still newer than oldSnapshotControl->head_timestamp - which
clearly can't be the case if it were the newest ts. And as far as I can
tell the indexing code also only makes sense if head_timestamp is the
oldest timestamp.

This would mean that most cases the old_snapshot_threshold feature is
active it would cause corruption: We'd not trigger errors on access,
because the timestamp set with SetOldSnapshotThresholdTimestamp() would
not actually match the xids used to limit. But:

It turns out to be somewhat hard to get
TransactionIdLimitedForOldSnapshots() to actually do something. Because
oldSnapshotControl->head_timestamp is updated much more often than it
should, the ts >= oldSnapshotControl->head_timestamp condition will
often prevent the limiting code from being hit.

But it's not very unlikely either. Due to the update of head_timestamp
to the current timestamp, we'll enter the "existing mapping; advance xid
if possible" branch for up to OLD_SNAPSHOT_TIME_MAP_ENTRIES times. Which
means we can hit it for
/*
* The structure used to map times to TransactionId values for the "snapshot
* too old" feature must have a few entries at the tail to hold old values;
* otherwise the lookup will often fail and the expected early pruning or
* vacuum will not usually occur. It is best if this padding is for a number
* of minutes greater than a thread would normally be stalled, but it's OK if
* early vacuum opportunities are occasionally missed, so there's no need to
* use an extreme value or get too fancy. 10 minutes seems plenty.
*/
#define OLD_SNAPSHOT_PADDING_ENTRIES 10
#define OLD_SNAPSHOT_TIME_MAP_ENTRIES (old_snapshot_threshold + OLD_SNAPSHOT_PADDING_ENTRIES)

10 minutes, I think. There's some other ways too, but they're much less
likely.

Note that once the issue has been hit once, future
SetOldSnapshotThresholdTimestamp() calls that don't hit those 10 minutes
will also return a corrupted horizon, because
oldSnapshotControl->threshold_xid will have the wrong value, which then
will be used:
/*
* Failsafe protection against vacuuming work of active transaction.
*
* This is not an assertion because we avoid the spinlock for
* performance, leaving open the possibility that xlimit could advance
* and be more current; but it seems prudent to apply this limit. It
* might make pruning a tiny bit less aggressive than it could be, but
* protects against data loss bugs.
*/
if (TransactionIdIsNormal(latest_xmin)
&& TransactionIdPrecedes(latest_xmin, xlimit))
xlimit = latest_xmin;

if (NormalTransactionIdFollows(xlimit, recentXmin))
return xlimit;

As far as I can tell, this code has been wrong since the feature has
been committed. The tests don't show a problem, because none of this
code is reached when old_snapshot_threshold = 0 (which has no real world
use, it's purely for testing).

I really don't know what to do here. The feature never worked and will
silently cause wrong query results. Fixing it seems like a pretty large
task - there's a lot more bugs. But ripping out a feature in stable
branches is pretty bad too.

Before figuring out the above, I spent the last several days trying to
make this feature work with my snapshot scalability patch. Trying to
avoid regressing old_snapshot_threshold behaviour / performance. But not
it seems to me that there's no actual working feature that can be
preserved.

I am really tired.

Andres.

#2Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:

The problem is that there's no protection again the xids in the
ringbuffer getting old enough to wrap around. Given that practical uses
of old_snapshot_threshold are likely to be several hours to several
days, that's not particularly hard to hit.

Presumably this could be fixed by changing it to use FullTransactionId.

The problem, as far as I can tell, is that
oldSnapshotControl->head_timestamp appears to be intended to be the
oldest value in the ring. But we update it unconditionally in the "need
a new bucket, but it might not be the very next one" branch of
MaintainOldSnapshotTimeMapping().

I agree, that doesn't look right. It's correct, I think, for the "if
(advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
"else" case. In the "else" case, it should advance by 1 (wrapping if
needed) each time we take the "if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
the "else" branch for that if statement.

As far as I can tell, this code has been wrong since the feature has
been committed. The tests don't show a problem, because none of this
code is reached when old_snapshot_threshold = 0 (which has no real world
use, it's purely for testing).

I'm pretty sure I complained about the fact that only the
old_snapshot_threshold = 0 case was tested at the time this went in,
but I don't think Kevin was too convinced that we needed to do
anything else, and realistically, if he'd tried for a regression test
that ran for 15 minutes, Tom would've gone ballistic.

I really don't know what to do here. The feature never worked and will
silently cause wrong query results. Fixing it seems like a pretty large
task - there's a lot more bugs. But ripping out a feature in stable
branches is pretty bad too.

I don't know what other bugs there are, but the two you mention above
look fixable. Even if we decide that the feature can't be salvaged, I
would vote against ripping it out in back branches. I would instead
argue for telling people not to use it and ripping it out in master.
However, much as I'm not in love with all of the complexity this
feature adds, I don't see the problems you've reported here as serious
enough to justify ripping it out.

What exactly is the interaction of this patch with your snapshot
scalability work?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-03-31 23:40:08 -0700, Andres Freund wrote:

I added some debug output to print the mapping before/after changes by
MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
to the server start in minutes/seconds to make it easier to interpret).

Now attached.

Greetings,

Andres Freund

Attachments:

debug.difftext/x-diff; charset=us-asciiDownload+57-3
#4Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#2)
Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 10:01:07 -0400, Robert Haas wrote:

On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:

The problem is that there's no protection again the xids in the
ringbuffer getting old enough to wrap around. Given that practical uses
of old_snapshot_threshold are likely to be several hours to several
days, that's not particularly hard to hit.

Presumably this could be fixed by changing it to use FullTransactionId.

That doesn't exist in all the back branches. Think it'd be easier to add
code to explicitly prune it during MaintainOldSnapshotTimeMapping().

The problem, as far as I can tell, is that
oldSnapshotControl->head_timestamp appears to be intended to be the
oldest value in the ring. But we update it unconditionally in the "need
a new bucket, but it might not be the very next one" branch of
MaintainOldSnapshotTimeMapping().

I agree, that doesn't look right. It's correct, I think, for the "if
(advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
"else" case. In the "else" case, it should advance by 1 (wrapping if
needed) each time we take the "if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
the "else" branch for that if statement.

Yea.

As far as I can tell, this code has been wrong since the feature has
been committed. The tests don't show a problem, because none of this
code is reached when old_snapshot_threshold = 0 (which has no real world
use, it's purely for testing).

I'm pretty sure I complained about the fact that only the
old_snapshot_threshold = 0 case was tested at the time this went in,
but I don't think Kevin was too convinced that we needed to do
anything else, and realistically, if he'd tried for a regression test
that ran for 15 minutes, Tom would've gone ballistic.

I think it's not just Tom that'd have gone ballistic. I think it's the
reason why, as I think is pretty clear, the feature was *never* actually
tested. The results of whats being removed are not quite random, but
it's not far from it. And there's long stretches of time where it never
removes things.

It's also a completely self-made problem.

There's really no reason at all to have bins of one minute. As it's a
PGC_POSTMASTER GUC, it should just have didided time into bins of
(old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
of a week there's no need to keep 10k bins, and the minimum threshold of
1 minute obviously is problematic.

I really don't know what to do here. The feature never worked and will
silently cause wrong query results. Fixing it seems like a pretty large
task - there's a lot more bugs. But ripping out a feature in stable
branches is pretty bad too.

I don't know what other bugs there are, but the two you mention above
look fixable.

They probably are fixable. But there's a lot more, I think:

Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
update_ts threshold actually needs to be ts >= update_ts, right now we
don't handle being newer than the newest bin correctly afaict (mitigated
by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
to say, because there's no comments.

The whole lock nesting is very hazardous. Most (all?)
TestForOldSnapshot() calls happen with locks on on buffers held, and can
acquire lwlocks itself. In some older branches we do entire *catalog
searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
basis to detect conflicts seems dangerous to me. Isn't that ignoring
inserts that are already in progress?

Even if we decide that the feature can't be salvaged, I would vote
against ripping it out in back branches. I would instead argue for
telling people not to use it and ripping it out in master.

It currently silently causes wrong query results. There's no
infrastructure to detect / protect against that (*).

I'm sure we can fix individual instances of problems. But I don't know
how one is supposed to verify that the fixes actually work. There's
currently no tests for the actual feature. And manual tests are painful
due to the multi-minute thresholds needed, and it's really hard to
manually verify that only the right rows are removed due to the feature,
and that all necessary errors are thrown. Given e.g. the bugs in my
email upthread, there's periods of several minutes where we'd not see
any row removed and then periods where the wrong ones would be removed,
so the manual tests would have to be repeated numerous times to actually
ensure anything.

If somebody wants to step up to the plate and fix these, it'd perhaps be
more realistic to say that we'll keep the feature. But even if somebody
does, I think it'd require a lot of development in the back branches. On
a feature whose purpose is to eat data that is still required.

I think even if we decide that we do not want to rip the feature out, we
should seriously consider hard disabling it in the backbranches. At
least I don't see how the fixed code is tested enough to be entrusted
with users data.

Do we actually have any evidence of this feature ever beeing used? I
didn't find much evidence for that in the archives (except Thomas
finding a problem). Given that it currently will switch between not
preventing bloat and causing wrong query results, without that being
noticed...

(*) perhaps I just am not understanding the protection however. To me
it's not at all clear what:

/*
* Failsafe protection against vacuuming work of active transaction.
*
* This is not an assertion because we avoid the spinlock for
* performance, leaving open the possibility that xlimit could advance
* and be more current; but it seems prudent to apply this limit. It
* might make pruning a tiny bit less aggressive than it could be, but
* protects against data loss bugs.
*/
if (TransactionIdIsNormal(latest_xmin)
&& TransactionIdPrecedes(latest_xmin, xlimit))
xlimit = latest_xmin;

if (NormalTransactionIdFollows(xlimit, recentXmin))
return xlimit;

actually provides in the way of a protection.

However,
much as I'm not in love with all of the complexity this feature adds,
I don't see the problems you've reported here as serious enough to
justify ripping it out.

What exactly is the interaction of this patch with your snapshot
scalability work?

Post my work there's no precise RecentOldestXmin anymore (since
accessing the frequently changing xmin of other backends is what causes
a good chunk of the scalability issues). But heap_page_prune_opt() has
to determine what to use as the threshold for being able to prune dead
rows. Without snapshot_too_old we can initially rely on the known
boundaries to determine whether we can prune, and only determine an
"accurate" boundary when encountering a prune xid (or a tuple, but
that's an optimization) that falls in the range where we don't know for
certain we can prune. But that's not easy to do with the way the
old_snapshot_threshold stuff currently works.

It's not too hard to implement a crude version that just determines an
accurate xmin horizon whenever pruning with old_snapshot_threshold
set. But that seems like gimping performance for old_snapshot_threshold,
which didn't seem nice.

Additionally, the current implementation of snapshot_too_old is pretty
terrible about causing unnecessary conflicts when hot pruning. Even if
there was no need at all for the horizon to be limited to be able to
prune the page, or if there was nothing to prune on the page (note that
the limiting happens before checking if the space on the page even makes
pruning useful), we still cause a conflict for future accesses, because
TransactionIdLimitedForOldSnapshots() will
SetOldSnapshotThresholdTimestamp() to a recent timestamp.

Greetings,

Andres Freund

#5Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:

I added some debug output to print the mapping before/after changes by
MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
to the server start in minutes/seconds to make it easier to interpret).

And the output turns out to be something like:

WARNING: old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 644169619
entry 5 (ring 5): min 12: xid 644169619
entry 6 (ring 6): min 13: xid 644169619
entry 7 (ring 7): min 14: xid 644169619

WARNING: head 420 s: updating existing bucket 4 for sec 660 with xmin 666899382

WARNING: old snapshot mapping at "after update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 666899382
entry 5 (ring 5): min 12: xid 644169619
entry 6 (ring 6): min 13: xid 644169619
entry 7 (ring 7): min 14: xid 644169619

It's pretty obvious that the xids don't make a ton of sense, I think:
They're not monotonically ordered. The same values exist multiple times,
despite xids being constantly used. Also, despite the ringbuffer
supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
always add), and the workload having run for 14min, we only have 8
entries.

The function header comment for MaintainOldSnapshotTimeMapping could
hardly be more vague, as it's little more than a restatement of the
function name. However, it looks to me like the idea is that this
function might get called multiple times for the same or similar
values of whenTaken. I suppose that's the point of this code:

else if (ts <= (oldSnapshotControl->head_timestamp +
((oldSnapshotControl->count_used - 1)
* USECS_PER_MINUTE)))
{
/* existing mapping; advance xid if possible */
int bucket = (oldSnapshotControl->head_offset
+ ((ts - oldSnapshotControl->head_timestamp)
/ USECS_PER_MINUTE))
% OLD_SNAPSHOT_TIME_MAP_ENTRIES;

if (TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[bucket],
xmin))
oldSnapshotControl->xid_by_minute[bucket] = xmin;
}

What I interpret this to be doing is saying - if we got a new call to
this function with a rounded-to-the-minute timestamp that we've seen
previously and for which we still have an entry, and if the XID passed
to this function is newer than the one passed by the previous call,
then advance the xid_by_minute[] bucket to the newer value. Now that
begs the question - what does this XID actually represent? The
comments don't seem to answer that question, not even the comments for
OldSnapshotControlData, which say that we should "Keep one xid per
minute for old snapshot error handling." but don't say which XIDs we
should keep or how they'll be used. However, the only call to
MaintainOldSnapshotTimeMapping() is in GetSnapshotData(). It appears
that we call this function each time a new snapshot is taken and pass
the current time (modulo some fiddling) and snapshot xmin. Given that,
one would expect that any updates to the map would be tight races,
i.e. a bunch of processes that all took their snapshots right around
the same time would all update the same map entry in quick succession,
with the newest value winning.

And that make the debugging output which I quoted from your message
above really confusing. At this point, the "head timestamp" is 7
minutes after this facility started up. The first we entry we have is
for minute 7, and the last is for minute 14. But the one we're
updating is for minute 11. How the heck can that happen? I might
suspect that you'd stopped a process inside GetSnapshotData() with a
debugger, but that can't explain it either, because GetSnapshotData()
gets the xmin first and only afterwards gets the timestamp - so if
you'd stopped for ~3 minutes it just before the call to
MaintainOldSnapshotTimeMapping(), it would've been updating the map
with an *old* XID. In reality, though, it changed the XID from
644169619 to 666899382, advancing over 22 million XIDs. I don't
understand what's going on there. How is this function getting called
with a 4-minute old value of whenTaken?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#5)
Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 11:15:14 -0400, Robert Haas wrote:

On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:

I added some debug output to print the mapping before/after changes by
MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
to the server start in minutes/seconds to make it easier to interpret).

And the output turns out to be something like:

WARNING: old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 644169619
entry 5 (ring 5): min 12: xid 644169619
entry 6 (ring 6): min 13: xid 644169619
entry 7 (ring 7): min 14: xid 644169619

WARNING: head 420 s: updating existing bucket 4 for sec 660 with xmin 666899382

WARNING: old snapshot mapping at "after update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 666899382
entry 5 (ring 5): min 12: xid 644169619
entry 6 (ring 6): min 13: xid 644169619
entry 7 (ring 7): min 14: xid 644169619

It's pretty obvious that the xids don't make a ton of sense, I think:
They're not monotonically ordered. The same values exist multiple times,
despite xids being constantly used. Also, despite the ringbuffer
supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
always add), and the workload having run for 14min, we only have 8
entries.

The function header comment for MaintainOldSnapshotTimeMapping could
hardly be more vague, as it's little more than a restatement of the
function name. However, it looks to me like the idea is that this
function might get called multiple times for the same or similar
values of whenTaken. I suppose that's the point of this code:

Right. We enforce whenTaken to be monotonic
(cf. GetSnapshotCurrentTimestamp()), but since
GetSnapshotCurrentTimestamp() reduces the granularity of the timestamp
to one-minute (the AlignTimestampToMinuteBoundary() call), it's
obviously possible to end up in the same bin as a previous

What I interpret this to be doing is saying - if we got a new call to
this function with a rounded-to-the-minute timestamp that we've seen
previously and for which we still have an entry, and if the XID passed
to this function is newer than the one passed by the previous call,
then advance the xid_by_minute[] bucket to the newer value. Now that
begs the question - what does this XID actually represent? The
comments don't seem to answer that question, not even the comments for
OldSnapshotControlData, which say that we should "Keep one xid per
minute for old snapshot error handling." but don't say which XIDs we
should keep or how they'll be used. However, the only call to
MaintainOldSnapshotTimeMapping() is in GetSnapshotData(). It appears
that we call this function each time a new snapshot is taken and pass
the current time (modulo some fiddling) and snapshot xmin. Given that,
one would expect that any updates to the map would be tight races,
i.e. a bunch of processes that all took their snapshots right around
the same time would all update the same map entry in quick succession,
with the newest value winning.

Right.

And that make the debugging output which I quoted from your message
above really confusing. At this point, the "head timestamp" is 7
minutes after this facility started up. The first we entry we have is
for minute 7, and the last is for minute 14. But the one we're
updating is for minute 11. How the heck can that happen?

If I undestand what your reference correctly, I think that is because,
due to the bug, the "need a new bucket" branch doesn't just extend by
one bucket, it extends it by many in common cases. Basically filling
buckets "into the future".

the advance = ... variable in the branch will not always be 1, even when
we continually call Maintain*. Here's some debug output showing that
(slightly modified from the patch I previously sent):

WARNING: old snapshot mapping at "before update" with head ts: 1, current entries: 2 max entries: 15, offset: 0
entry 0 (ring 0): min 1: xid 1089371384
entry 1 (ring 1): min 2: xid 1099553206

WARNING: head 1 min: filling 2 buckets starting at 0 for whenTaken 3 min, with xmin 1109840204

WARNING: old snapshot mapping at "after update" with head ts: 3, current entries: 4 max entries: 15, offset: 0
entry 0 (ring 0): min 3: xid 1089371384
entry 1 (ring 1): min 4: xid 1099553206
entry 2 (ring 2): min 5: xid 1109840204
entry 3 (ring 3): min 6: xid 1109840204

Note how the two new buckets have the same xid, and how we're inserting
for "whenTaken 3 min", but we've filled the mapping up to minute 6.

I don't think the calculation of the 'advance' variable is correct as
is, even if we ignore the wrong setting of the head_timestamp variable.

Greetings,

Andres Freund

#7Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#4)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 11:09 AM Andres Freund <andres@anarazel.de> wrote:

That doesn't exist in all the back branches. Think it'd be easier to add
code to explicitly prune it during MaintainOldSnapshotTimeMapping().

That's reasonable.

There's really no reason at all to have bins of one minute. As it's a
PGC_POSTMASTER GUC, it should just have didided time into bins of
(old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
of a week there's no need to keep 10k bins, and the minimum threshold of
1 minute obviously is problematic.

I am very doubtful that this approach would have been adequate. It
would mean that, with old_snapshot_threshold set to a week, the
threshold for declaring a snapshot "old" would jump forward 16.8 hours
at a time. It's hard for me to make a coherent argument right now as
to exactly what problems that would create, but it's not very
granular, and a lot of bloat-related things really benefit from more
granularity. I also don't really see what the problem with keeping a
bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
= ~10k buckets, isn't it? That's not really insane for an in-memory
data structure. I agree that the code that does that maintenance being
buggy is a problem to whatever extent that is the case, but writing
the code to have fewer buckets wouldn't necessarily have made it any
less buggy.

They probably are fixable. But there's a lot more, I think:

Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
update_ts threshold actually needs to be ts >= update_ts, right now we
don't handle being newer than the newest bin correctly afaict (mitigated
by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
to say, because there's no comments.

That test and the following one for "if (ts == update_ts)" both make
me nervous too. If only two of <, >, and = are expected, there should
be an Assert() to that effect, at least. If all three values are
expected then we need an explanation of why we're only checking for
equality.

The whole lock nesting is very hazardous. Most (all?)
TestForOldSnapshot() calls happen with locks on on buffers held, and can
acquire lwlocks itself. In some older branches we do entire *catalog
searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

The catalog searches are clearly super-bad, but I'm not sure that the
other ones have a deadlock risk or anything. They might, but I think
we'd need some evidence of that.

GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
basis to detect conflicts seems dangerous to me. Isn't that ignoring
inserts that are already in progress?

How so?

It currently silently causes wrong query results. There's no
infrastructure to detect / protect against that (*).

Sure, and what if you break more stuff ripping it out? Ripping this
volume of code out in a supposedly-stable branch is totally insane
almost no matter how broken the feature is. I also think, and we've
had this disagreement before, that you're far too willing to say
"well, that's wrong so we need to hit it with a nuke." I complained
when you added those error checks to vacuum in back-branches, and
since that release went out people are regularly tripping those checks
and taking prolonged outages for a problem that wasn't making them
unhappy before. I know that in theory those people are better off
because their database was always corrupted and now they know. But for
some of them, those prolonged outages are worse than the problem they
had before. I believe it was irresponsible to decide on behalf of our
entire user base that they were better off with such a behavior change
in a supposedly-stable branch, and I believe the same thing here.

I have no objection to the idea that *if* the feature is hopelessly
broken, it should be removed. But I don't have confidence at this
point that you've established that, and I think ripping out thousands
of lines of codes in the back-branches is terrible. Even
hard-disabling the feature in the back-branches without actually
removing the code is an awfully strong reaction, but it could be
justified if we find out that things are actually super-bad and not
really fixable. Actually removing the code is unnecessary, protects
nobody, and has risk.

Do we actually have any evidence of this feature ever beeing used? I
didn't find much evidence for that in the archives (except Thomas
finding a problem). Given that it currently will switch between not
preventing bloat and causing wrong query results, without that being
noticed...

I believe that at least one EnterpriseDB customer used it, and
possibly more than one. I am not sure how extensively, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In reply to: Robert Haas (#7)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 9:02 AM Robert Haas <robertmhaas@gmail.com> wrote:

I complained
when you added those error checks to vacuum in back-branches, and
since that release went out people are regularly tripping those checks
and taking prolonged outages for a problem that wasn't making them
unhappy before. I know that in theory those people are better off
because their database was always corrupted and now they know. But for
some of them, those prolonged outages are worse than the problem they
had before. I believe it was irresponsible to decide on behalf of our
entire user base that they were better off with such a behavior change
in a supposedly-stable branch, and I believe the same thing here.

I agreed with that decision, FWIW. Though I don't deny that there is
some merit in what you say. This is the kind of high level
philosophical question where large differences of opinion are quite
normal.

I don't think that it's fair to characterize Andres' actions in that
situation as in any way irresponsible. We had an extremely complicated
data corruption bug that he went to great lengths to fix, following
two other incorrect fixes. He was jet lagged from travelling to India
at the time. He went to huge lengths to make sure that the bug was
correctly squashed.

Actually removing the code is unnecessary, protects
nobody, and has risk.

Every possible approach has risk. We are deciding among several
unpleasant and risky alternatives here, no?

--
Peter Geoghegan

#9Robert Haas
robertmhaas@gmail.com
In reply to: Peter Geoghegan (#8)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 1:03 PM Peter Geoghegan <pg@bowt.ie> wrote:

I don't think that it's fair to characterize Andres' actions in that
situation as in any way irresponsible. We had an extremely complicated
data corruption bug that he went to great lengths to fix, following
two other incorrect fixes. He was jet lagged from travelling to India
at the time. He went to huge lengths to make sure that the bug was
correctly squashed.

I don't mean it as a personal attack on Andres, and I know and am glad
that he worked hard on the problem, but I don't agree that it was the
right decision. Perhaps "irresponsible" is the wrong word, but it's
certainly caused problems for multiple EnterpriseDB customers, and in
my view, those problems weren't necessary. Either a WARNING or an
ERROR would have shown up in the log, but an ERROR terminates VACUUM
for that table and thus basically causes autovacuum to be completely
broken. That is a really big problem. Perhaps you will want to argue,
as Andres did, that the value of having ERROR rather than WARNING in
the log justifies that outcome, but I sure don't agree.

Actually removing the code is unnecessary, protects
nobody, and has risk.

Every possible approach has risk. We are deciding among several
unpleasant and risky alternatives here, no?

Sure, but not all levels of risk are equal. Jumping out of a plane
carries some risk of death whether or not you have a parachute, but
that does not mean that we shouldn't worry about whether you have one
or not before you jump.

In this case, I think it is pretty clear that hard-disabling the
feature by always setting old_snapshot_threshold to -1 carries less
risk of breaking unrelated things than removing code that caters to
the feature all over the code base. Perhaps it is not quite as
dramatic as my parachute example, but I think it is pretty clear all
the same that one is a lot more likely to introduce new bugs than the
other. A carefully targeted modification of a few lines of code in 1
file just about has to carry less risk than ~1k lines of code spread
across 40 or so files.

However, I still think that without some more analysis, it's not clear
whether we should go this direction at all. Andres's results suggest
that there are some bugs here, but I think we need more senior hackers
to study the situation before we make a decision about what to do
about them. I certainly haven't had enough time to even fully
understand the problems yet, and nobody else has posted on that topic
at all. I have the highest respect for Andres and his technical
ability, and if he says this stuff has problems, I'm sure it does. Yet
I'm not willing to conclude that because he's tired and frustrated
with this stuff right now, it's unsalvageable. For the benefit of the
whole community, such a claim deserves scrutiny from multiple people.

Is there any chance that you're planning to look into the details?
That would certainly be welcome from my perspective.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)
Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 12:02:18 -0400, Robert Haas wrote:

On Wed, Apr 1, 2020 at 11:09 AM Andres Freund <andres@anarazel.de> wrote:

There's really no reason at all to have bins of one minute. As it's a
PGC_POSTMASTER GUC, it should just have didided time into bins of
(old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
of a week there's no need to keep 10k bins, and the minimum threshold of
1 minute obviously is problematic.

I am very doubtful that this approach would have been adequate. It
would mean that, with old_snapshot_threshold set to a week, the
threshold for declaring a snapshot "old" would jump forward 16.8 hours
at a time. It's hard for me to make a coherent argument right now as
to exactly what problems that would create, but it's not very
granular, and a lot of bloat-related things really benefit from more
granularity. I also don't really see what the problem with keeping a
bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
= ~10k buckets, isn't it? That's not really insane for an in-memory
data structure. I agree that the code that does that maintenance being
buggy is a problem to whatever extent that is the case, but writing
the code to have fewer buckets wouldn't necessarily have made it any
less buggy.

My issue isn't really that it's too many buckets right now, but that it
doesn't scale down to smaller thresholds. I think to be able to develop
this reasonably, it'd need to be able to support thresholds in the
sub-second range. And I don't see how you can have the same binning for
such small thresholds, and for multi-day thresholds - we'd quickly go to
millions of buckets for longer thresholds.

I really think we'd need to support millisecond resolution to make this
properly testable.

GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
basis to detect conflicts seems dangerous to me. Isn't that ignoring
inserts that are already in progress?

How so?

Because it returns the end of the reserved WAL, not how far we've
actually inserted. I.e. there can be in-progress, but not finished,
modifications that will have an LSN < GetXLogInsertRecPtr(). But the
whenTaken timestamp could reflect one that should throw an error for
these in-progress modifications (since the transaction limiting happens
before the WAL logging).

I am not 100%, but I suspect that that could lead to errors not being
thrown that should, because TestForOldSnapshot() will not see these
in-progress modifications as conflicting.

Hm, also, shouldn't
&& PageGetLSN(page) > (snapshot)->lsn)
in TestForOldSnapshot() be an >=?

It currently silently causes wrong query results. There's no
infrastructure to detect / protect against that (*).

Sure, and what if you break more stuff ripping it out? Ripping this
volume of code out in a supposedly-stable branch is totally insane
almost no matter how broken the feature is.

For the backbranches I was just thinking of forcing the GUC to be off
(either by disallowing it to be set to on, or just warning when its set
to true, but not propagating the value).

I have no objection to the idea that *if* the feature is hopelessly
broken, it should be removed.

I would be a lot less inclined to go that way if old_snapshot_threshold

a) weren't explicitly about removing still-needed data - in contrast to
a lot of other features, where the effects of bugs is temporary, here
it can be much larger.
b) were a previously working feature, but as far as I can tell, it never really did
c) had tests that verify that my fixes actually do the right thing. As
it stands, I'd not just have to fix the bugs, I'd also have to develop
a test framework that can test this

While I wish I had been more forceful, and reviewed more of the code to
point out more of the quality issues, I did argue hard against the
feature going in. On account of it being architecturally bad and
impactful. Which I think it has proven to be several times over by
now. And now I'm kind of on the hook to fix it, it seems?

I also think, and we've had this disagreement before, that you're far
too willing to say "well, that's wrong so we need to hit it with a
nuke." I complained when you added those error checks to vacuum in
back-branches, and since that release went out people are regularly
tripping those checks and taking prolonged outages for a problem that
wasn't making them unhappy before. I know that in theory those people
are better off because their database was always corrupted and now
they know. But for some of them, those prolonged outages are worse
than the problem they had before.

I think this is a somewhat revisionist. Sure, the errors were added
after like the 10th data corruption bug around freezing that we didn't
find for a long time, because of the lack of errors being thrown. But
the error checks weren't primarily added to find further bugs, but to
prevent data loss due to the fixed bug. Of which we had field reports.

I'd asked over *weeks* for reviews of the bug fixes. Not a single person
expressed concerns about throwing new errors at that time. First version
of the patches with the errors:
/messages/by-id/20171114030341.movhteyakqeqx5pm@alap3.anarazel.de
I pushed them over a month later
/messages/by-id/20171215023059.oeyindn57oeis5um@alap3.anarazel.de

There also wasn't (and isn't) a way to just report back that we can't
currently freeze the individual page, without doing major surgery. And
even if there were, what are supposed to do other than throw an error?
We need to remove tuples below relfrozenxid, or we corrupt the table.

As I've first asked before when you complained about those errors: What
was the alternative? Just have invisible tuples reappear? Delete them? I
don't think you've ever answered that.

You brought this up as an example for me being over-eager with errors
checks before. But I don't see how that meshes with the history visible
in the thread referenced above.

The more general issue, about throwing errors, is not just about the
people that don't give a hoot about whether their data evolves on its
own (perhaps a good tradeoff for them). Not throwing errors affects
*everyone*. Some people do care about their data. Without errors we
never figure out that we screwed up. And long-term, even the people
that care much more about availability than data loss, benefit from the
whole system getting more robust.

We've since found numerous further data corrupting bugs because of the
relfrozenxid checks. Some of very long standing vintage. Some in newly
added code.

Yes, hypothetically, I'd argue for introducing the checks solely for the
sake of finding bugs. Even if I were prescient to forsee the number of
issues caused (although I'd add block numbers to the error message from
the get go, knowing that). But I'd definitely not do so in the back
branches.

I believe it was irresponsible to decide on behalf of our entire user
base that they were better off with such a behavior change in a
supposedly-stable branch, and I believe the same thing here.

As I explained above, I don't think that's fair with regard to the
relfrozenxid errors. Setting that aside:

In these discussions you always seem to only argue for the people that
don't care about their data. But, uh, a lot of people do - should we
just silently eat their data? And the long-term quality of the project
gets a lot better by throwing errors, because it actually allows us to
fix them.

As far as I can tell we couldn't even have added the checks to master,
back then, if we follow your logic: A lot of the reports about hitting
the errors were with 11+ (partially due to pg_upgrade, partially because
they detected other bugs).

The likelihood of hurting people by adding checks at a later point would
be a lot lower, if we stopped adding code that ignores errors silently
and hoping for the best. But we keep adding such "lenient" code.

We just found another long-standing cause of data corrupting, that
should have been found earlier if we had errors, or at least warnings,
btw. The locking around vac_udpate_datfrozenxid() has been broken for a
long long time, but the silent 'if (bogus) return' made it very hard to
find.
/messages/by-id/20200323235036.6pje6usrjjx22zv3@alap3.anarazel.de

Also, I've recently seen a number of databases beeing eaten because we
just ignore our own WAL logging rules to avoid throwing hard enough
errors (RelationTruncate() WAL logging the truncation outside of a
critical section - oops if you hit it, your primary and replicas/backups
diverge, among many other bad consequences).

Greetings,

Andres Freund

In reply to: Robert Haas (#9)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 10:28 AM Robert Haas <robertmhaas@gmail.com> wrote:

Sure, but not all levels of risk are equal. Jumping out of a plane
carries some risk of death whether or not you have a parachute, but
that does not mean that we shouldn't worry about whether you have one
or not before you jump.

In this case, I think it is pretty clear that hard-disabling the
feature by always setting old_snapshot_threshold to -1 carries less
risk of breaking unrelated things than removing code that caters to
the feature all over the code base. Perhaps it is not quite as
dramatic as my parachute example, but I think it is pretty clear all
the same that one is a lot more likely to introduce new bugs than the
other. A carefully targeted modification of a few lines of code in 1
file just about has to carry less risk than ~1k lines of code spread
across 40 or so files.

Yeah, that's certainly true. But is that fine point really what
anybody disagrees about? I didn't think that Andres was focussed on
literally ripping it out over just disabling it.

Is there any chance that you're planning to look into the details?
That would certainly be welcome from my perspective.

I had a few other things that I was going to work on this week, but
those seems less urgent. I'll take a look into it, and report back
what I find.

--
Peter Geoghegan

#12Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#11)
Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 11:04:43 -0700, Peter Geoghegan wrote:

On Wed, Apr 1, 2020 at 10:28 AM Robert Haas <robertmhaas@gmail.com> wrote:

Is there any chance that you're planning to look into the details?
That would certainly be welcome from my perspective.

+1

This definitely needs more eyes. I am not even close to understanding
the code fully.

I had a few other things that I was going to work on this week, but
those seems less urgent. I'll take a look into it, and report back
what I find.

Thanks you!

I attached a slightly evolved version of my debugging patch.

Greetings,

Andres Freund

Attachments:

v2-debug.difftext/x-diff; charset=us-asciiDownload+76-3
#13Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#9)
Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 13:27:56 -0400, Robert Haas wrote:

Perhaps "irresponsible" is the wrong word, but it's certainly caused
problems for multiple EnterpriseDB customers, and in my view, those
problems weren't necessary. Either a WARNING or an ERROR would have
shown up in the log, but an ERROR terminates VACUUM for that table and
thus basically causes autovacuum to be completely broken. That is a
really big problem. Perhaps you will want to argue, as Andres did,
that the value of having ERROR rather than WARNING in the log
justifies that outcome, but I sure don't agree.

If that had been a really viable option, I would have done so. At the
very least in the back branches, but quite possibly also in master. Or
if somebody had brought them up as an issue at the time.

What is heap_prepare_freeze_tuple/FreezeMultiXactId supposed to do after
issuing a WARNING in these cases. Without the ERROR, e.g.,
if (!TransactionIdDidCommit(xid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
xid, cutoff_xid)));
would make a deleted tuple visible.

if (TransactionIdPrecedes(xid, relfrozenxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg_internal("found xmin %u from before relfrozenxid %u",
xid, relfrozenxid)));
would go on replace xmin of a potentially uncommitted tuple with
relfrozenxid, making it appear visible.

if (TransactionIdPrecedes(xid, relfrozenxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg_internal("found xmax %u from before relfrozenxid %u",
xid, relfrozenxid)));
would replace the xmax indicating a potentially deleted tuple with ?, either
making the tuple become, potentially wrongly, visible/invisible

or
else if (MultiXactIdPrecedes(multi, relminmxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg_internal("found multixact %u from before relminmxid %u",
multi, relminmxid)));
or ...

Just continuing is easier said than done. Especially with the background
of knowing that several users had hit the bug that allowed all of the
above to be hit, and that advancing relfrozenxid further would make it
worse.

Greetings,

Andres Freund

#14Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#13)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 2:37 PM Andres Freund <andres@anarazel.de> wrote:

Just continuing is easier said than done. Especially with the background
of knowing that several users had hit the bug that allowed all of the
above to be hit, and that advancing relfrozenxid further would make it
worse.

Fair point, but it seems we're arguing over nothing here, or at least
nothing relevant to this thread, because it sounds like if we are
going to disable that you're OK with doing that by just shutting it
off the code rather than trying to remove it all. I had the opposite
impression from your first email.

Sorry to have derailed the thread, and for my poor choice of words.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andres Freund (#4)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 10:09 AM Andres Freund <andres@anarazel.de> wrote:

First off, many thanks to Andres for investigating this, and apologies for
the bugs. Also thanks to Michael for making sure I saw the thread. I must
also apologize that for not being able to track the community lists
consistently due to health issues that are exacerbated by stress, and the
fact that these lists often push past my current limits. I'll try to help
in this as best I can.

Do we actually have any evidence of this feature ever beeing used? I

didn't find much evidence for that in the archives (except Thomas
finding a problem).

This was added because a very large company trying to convert from Oracle
had a test that started to show some slowdown on PostgreSQL after 8 hours,
serious slowdown by 24 hours, and crashed hard before it could get to 48
hours -- due to lingering WITH HOLD cursors left by ODBC code. They had
millions of lines of code that would need to be rewritten without this
feature. With this feature (set to 20 minutes, if I recall correctly),
their unmodified code ran successfully for at least three months solid
without failure or corruption. Last I heard, they were converting a large
number of instances from Oracle to PostgreSQL, and those would all fail
hard within days of running with this feature removed or disabled.

Also, VMware is using PostgreSQL as an embedded part of many products, and
this feature was enabled to deal with similar failures due to ODBC cursors;
so the number of instances running 24/7 under high load which have shown a
clear benefit from enabling this feature has a lot of zeros.

Perhaps the lack of evidence for usage in the archives indicates a low
frequency of real-world failures due to the feature, rather than lack of
use? I'm not doubting that Andres found real issues that should be fixed,
but perhaps not very many people who are using the feature have more than
two billion transactions within the time threshold, and perhaps the other
problems are not as big as the problems solved by use of the feature -- at
least in some cases.

To save readers who have not yet done the math some effort, at the 20
minute threshold used by the initial user, they would need to have a
sustained rate of consumption of transaction IDs of over 66 million per
second to experience wraparound problems, and at the longest threshold I
have seen it would need to exceed an average of 461,893 TPS for three days
solid to hit wraparound. Those aren't impossible rates to hit, but in
practice it might not be a frequent occurrence yet on modern hardware with
some real-world applications. Hopefully we can find a way to fix this
before those rates become common.

I am reviewing the issue and patches now, and hope I can make some useful
contribution to the discussion.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

#16Andres Freund
andres@anarazel.de
In reply to: Kevin Grittner (#15)
Re: snapshot too old issues, first around wraparound and then more.

Hi,

Nice to have you back for a bit! Even if the circumstances aren't
great...

It's very understandable that the lists are past your limits, I barely
keep up these days. Without any health issues.

On 2020-04-01 14:10:09 -0500, Kevin Grittner wrote:

Perhaps the lack of evidence for usage in the archives indicates a low
frequency of real-world failures due to the feature, rather than lack of
use? I'm not doubting that Andres found real issues that should be fixed,
but perhaps not very many people who are using the feature have more than
two billion transactions within the time threshold, and perhaps the other
problems are not as big as the problems solved by use of the feature -- at
least in some cases.

To save readers who have not yet done the math some effort, at the 20
minute threshold used by the initial user, they would need to have a
sustained rate of consumption of transaction IDs of over 66 million per
second to experience wraparound problems, and at the longest threshold I
have seen it would need to exceed an average of 461,893 TPS for three days
solid to hit wraparound. Those aren't impossible rates to hit, but in
practice it might not be a frequent occurrence yet on modern hardware with
some real-world applications. Hopefully we can find a way to fix this
before those rates become common.

The wraparound issue on their own wouldn't be that bad - when I found it
I did play around with a few ideas for how to fix it. The most practical
would probably be to have MaintainOldSnapshotTimeMapping() scan all
buckets when a new oldSnapshotControl->oldest_xid is older than
RecentGlobalXmin. There's no benefit in the contents of those buckets
anyway, since we know that we can freeze those independent of
old_snapshot_threshold.

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error). And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
long-running queries/transactions querying immutable or insert only
data. Those would not notice that, on other tables, rows are getting
removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
oldSnapshotControl->head_timestamp getting updated more often than
correct). But if old_snapshot_threshold is small enough in relation to
the time the generated bloat becomes problematic, there will still be
occasions to actually perform cleanup.

Greetings,

Andres Freund

#17Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andres Freund (#16)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error).

That is a very big problem. On the sort-of bright side (ironic in light of
the fact that I'm a big proponent of using serializable transactions), none
of the uses that I have personally seen of this feature use anything other
than the default READ COMMITTED isolation level. That might help explain
the lack of complaints for those using the feature. But yeah, I REALLY
want to see a solid fix for that!

And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
long-running queries/transactions querying immutable or insert only
data. Those would not notice that, on other tables, rows are getting
removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
oldSnapshotControl->head_timestamp getting updated more often than
correct). But if old_snapshot_threshold is small enough in relation to
the time the generated bloat becomes problematic, there will still be
occasions to actually perform cleanup.

Keep in mind that the real goal of this feature is not to eagerly _see_
"snapshot too old" errors, but to prevent accidental debilitating bloat due
to one misbehaving user connection. This is particularly easy to see (and
therefore unnervingly common) for those using ODBC, which in my experience
tends to correspond to the largest companies which are using PostgreSQL.
In some cases, the snapshot which is preventing removal of the rows will
never be used again; removal of the rows will not actually affect the
result of any query, but only the size and performance of the database.
This is a "soft limit" -- kinda like max_wal_size. Where there was a
trade-off between accuracy of the limit and performance, the less accurate
way was intentionally chosen. I apologize for not making that more clear
in comments.

While occasional "snapshot too old" errors are an inconvenient side effect
of achieving the primary goal, it might be of interest to know that the
initial (very large corporate) user of this feature had, under Oracle,
intentionally used a cursor that would be held open as long as a user chose
to leave a list open for scrolling around. They used cursor features for
as long as the cursor allowed. This could be left open for days or weeks
(or longer?). Their query ordered by a unique index, and tracked the ends
of the currently displayed portion of the list so that if they happened to
hit the "snapshot too old" error they could deallocate and restart the
cursor and reposition before moving forward or back to the newly requested
rows. They were not willing to convert to PostgreSQL unless this approach
continued to work.

In Summary:
(1) It's not urgent that rows always be removed as soon as possible after
the threshold is crossed as long as they don't often linger too awfully far
past that limit and allow debilitating bloat.
(2) It _is_ a problem if results inconsistent with the snapshot are
returned -- a "snapshot too old" error is necessary.
(3) Obviously, wraparound problems need to be solved.

I hope this is helpful.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

#18Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#16)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 3:43 PM Andres Freund <andres@anarazel.de> wrote:

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error). And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I think it would be a good idea to add a system view that shows the
contents of the mapping. We could make it a contrib module, if you
like, so that it can even be installed on back branches. We'd need to
move the structure definition from snapmgr.c to a header file, but
that doesn't seem like such a big deal.

Maybe that contrib module could even have some functions to simulate
aging without the passage of any real time. Like, say you have a
function or procedure old_snapshot_pretend_time_has_passed(integer),
and it moves oldSnapshotControl->head_timestamp backwards by that
amount. Maybe that would require updating some other fields in
oldSnapshotControl too but it doesn't seem like we'd need to do a
whole lot.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Andres Freund
andres@anarazel.de
In reply to: Kevin Grittner (#17)
Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 15:11:52 -0500, Kevin Grittner wrote:

On Wed, Apr 1, 2020 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error).

That is a very big problem. On the sort-of bright side (ironic in light of
the fact that I'm a big proponent of using serializable transactions), none
of the uses that I have personally seen of this feature use anything other
than the default READ COMMITTED isolation level. That might help explain
the lack of complaints for those using the feature. But yeah, I REALLY
want to see a solid fix for that!

I don't think it's dependent on RR - it's just a bit easier to verify
that the query results are wrong that way.

And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
long-running queries/transactions querying immutable or insert only
data. Those would not notice that, on other tables, rows are getting
removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
oldSnapshotControl->head_timestamp getting updated more often than
correct). But if old_snapshot_threshold is small enough in relation to
the time the generated bloat becomes problematic, there will still be
occasions to actually perform cleanup.

Keep in mind that the real goal of this feature is not to eagerly _see_
"snapshot too old" errors, but to prevent accidental debilitating bloat due
to one misbehaving user connection.

I don't think it's an "intentional" inaccuracy issue leading to
this. The map contents are just wrong, in particular the head_timestamp
most of the time is so new that
TransactionIdLimitedForOldSnapshots(). When filling a new bucket,
MaintainOldSnapshotThreshold() unconditionally updates
oldSnapshotControl->head_timestamp to be the current minute, which means
it'll take old_snapshot_threshold minutes till
TransactionIdLimitedForOldSnapshots() even looks at the mapping again.

As far as I can tell, with a large old_snapshot_threshold, it can take a
very long time to get to a head_timestamp that's old enough for
TransactionIdLimitedForOldSnapshots() to do anything. Look at this
trace of a pgbench run with old_snapshot_threshold enabled, showing some of
the debugging output added in the patch upthread.

This is with a threshold of 10min, in a freshly started database:

2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING: head 0 min: filling 1 buckets starting at 0 for whenTaken 1 min, with xmin 2068881994
2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING: old snapshot mapping at "after update" with head ts: 1, current entries: 2 max entries: 20, offset: 0
entry 0 (ring 0): min 1: xid 2068447214
entry 1 (ring 1): min 2: xid 2068881994

2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING: old snapshot mapping at "before update" with head ts: 1, current entries: 2 max entries: 20, offset: 0
entry 0 (ring 0): min 1: xid 2068447214
entry 1 (ring 1): min 2: xid 2068881994

2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING: head 1 min: updating existing bucket 1 for whenTaken 2 min, with xmin 2069199511
2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING: old snapshot mapping at "after update" with head ts: 1, current entries: 2 max entries: 20, offset: 0
entry 0 (ring 0): min 1: xid 2068447214
entry 1 (ring 1): min 2: xid 2069199511

2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING: old snapshot mapping at "before update" with head ts: 1, current entries: 2 max entries: 20, offset: 0
entry 0 (ring 0): min 1: xid 2068447214
entry 1 (ring 1): min 2: xid 2069199511

2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING: head 1 min: filling 2 buckets starting at 0 for whenTaken 3 min, with xmin 2069516499
2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING: old snapshot mapping at "after update" with head ts: 3, current entries: 4 max entries: 20, offset: 0
entry 0 (ring 0): min 3: xid 2068447214
entry 1 (ring 1): min 4: xid 2069199511
entry 2 (ring 2): min 5: xid 2069516499
entry 3 (ring 3): min 6: xid 2069516499
...
2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING: old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 20, offset: 0
entry 0 (ring 0): min 7: xid 2068447214
entry 1 (ring 1): min 8: xid 2071112480
entry 2 (ring 2): min 9: xid 2071434473
entry 3 (ring 3): min 10: xid 2071755177
entry 4 (ring 4): min 11: xid 2072075827
entry 5 (ring 5): min 12: xid 2072395700
entry 6 (ring 6): min 13: xid 2072715464
entry 7 (ring 7): min 14: xid 2073035816

Before the mapping change the database had been running for 15
minutes. But the mapping starts only at 7 minutes from start. And then
is updated to

2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING: head 7 min: filling 8 buckets starting at 0 for whenTaken 15 min, with xmin 2075918093
2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING: old snapshot mapping at "after update" with head ts: 15, current entries: 16 max entries: 20, offset: 0
entry 0 (ring 0): min 15: xid 2068447214
entry 1 (ring 1): min 16: xid 2071112480
entry 2 (ring 2): min 17: xid 2071434473
entry 3 (ring 3): min 18: xid 2071755177
entry 4 (ring 4): min 19: xid 2072075827
entry 5 (ring 5): min 20: xid 2072395700
entry 6 (ring 6): min 21: xid 2072715464
entry 7 (ring 7): min 22: xid 2073035816
entry 8 (ring 8): min 23: xid 2075918093
entry 9 (ring 9): min 24: xid 2075918093
entry 10 (ring 10): min 25: xid 2075918093
entry 11 (ring 11): min 26: xid 2075918093
entry 12 (ring 12): min 27: xid 2075918093
entry 13 (ring 13): min 28: xid 2075918093
entry 14 (ring 14): min 29: xid 2075918093
entry 15 (ring 15): min 30: xid 2075918093

be considered having started in that moment. And we expand the size of
the mapping by 8 at the same time, filling the new buckets with the same
xid. Despite there being a continuous workload.

After a few more minutes we get:

2020-04-01 14:07:00.000 PDT [1268503][3/1473617:2077202085] WARNING: head 15 min: updating existing bucket 4 for whenTaken 19 min, with xmin 2077202085
2020-04-01 14:07:00.000 PDT [1268503][3/1473617:2077202085] WARNING: old snapshot mapping at "after update" with head ts: 15, current entries: 16 max entries: 20, offset: 0
entry 0 (ring 0): min 15: xid 2068447214
entry 1 (ring 1): min 16: xid 2076238895
entry 2 (ring 2): min 17: xid 2076559154
entry 3 (ring 3): min 18: xid 2076880731
entry 4 (ring 4): min 19: xid 2077202085
entry 5 (ring 5): min 20: xid 2072395700
entry 6 (ring 6): min 21: xid 2072715464
entry 7 (ring 7): min 22: xid 2073035816
entry 8 (ring 8): min 23: xid 2075918093
entry 9 (ring 9): min 24: xid 2075918093
entry 10 (ring 10): min 25: xid 2075918093
entry 11 (ring 11): min 26: xid 2075918093
entry 12 (ring 12): min 27: xid 2075918093
entry 13 (ring 13): min 28: xid 2075918093
entry 14 (ring 14): min 29: xid 2075918093
entry 15 (ring 15): min 30: xid 2075918093

Note how the xids are not monotonically ordered. And how IsLimited still
won't be able to make use of the mapping, as the head timestamp is only
4 minutes old (whenTaken == 19 min, head == 15min).

Greetings,

Andres Freund

In reply to: Robert Haas (#18)
Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 1:25 PM Robert Haas <robertmhaas@gmail.com> wrote:

Maybe that contrib module could even have some functions to simulate
aging without the passage of any real time. Like, say you have a
function or procedure old_snapshot_pretend_time_has_passed(integer),
and it moves oldSnapshotControl->head_timestamp backwards by that
amount. Maybe that would require updating some other fields in
oldSnapshotControl too but it doesn't seem like we'd need to do a
whole lot.

I like that idea. I think that I've spotted what may be an independent
bug, but I have to wait around for a minute or two to reproduce it
each time. Makes it hard to get to a minimal test case.

--
Peter Geoghegan

In reply to: Peter Geoghegan (#20)
#22Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#19)
#23Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#21)
#24Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#23)
In reply to: Andres Freund (#23)
#26Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#23)
In reply to: Andres Freund (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#26)
#29Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andres Freund (#23)
#30Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andres Freund (#24)
#31Andres Freund
andres@anarazel.de
In reply to: Kevin Grittner (#30)
In reply to: Andres Freund (#1)
In reply to: Peter Geoghegan (#32)
#34Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)
#35Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#7)
In reply to: Andres Freund (#35)
#37Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Geoghegan (#36)
#38Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#37)
#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#38)
#40Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#36)
#41Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#40)
#42Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#40)
#43Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#41)
#44Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#43)
#45Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#43)
#46Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#45)
#47Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#46)
#48Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#47)
#49Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#48)
In reply to: Tom Lane (#48)
#51Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#50)
In reply to: Tom Lane (#51)
#53Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#48)
#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#53)
In reply to: Tom Lane (#54)
In reply to: Robert Haas (#53)
In reply to: Andres Freund (#23)
#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#56)
#59Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#53)
#60Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#53)
#61Noah Misch
noah@leadboat.com
In reply to: Peter Geoghegan (#56)
In reply to: Noah Misch (#61)
#63Noah Misch
noah@leadboat.com
In reply to: Peter Geoghegan (#62)
In reply to: Noah Misch (#63)
#65Bruce Momjian
bruce@momjian.us
In reply to: Peter Geoghegan (#64)
#66Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#65)
#67Stephen Frost
sfrost@snowman.net
In reply to: Bruce Momjian (#65)
#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#67)
In reply to: Tom Lane (#68)
#70Andres Freund
andres@anarazel.de
In reply to: Noah Misch (#61)
#71Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#68)
In reply to: Andres Freund (#70)
#73Andres Freund
andres@anarazel.de
In reply to: Peter Geoghegan (#69)
In reply to: Andres Freund (#71)
In reply to: Andres Freund (#73)
#76Stephen Frost
sfrost@snowman.net
In reply to: Peter Geoghegan (#75)
#77Noah Misch
noah@leadboat.com
In reply to: Tom Lane (#66)
#78Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#42)
#79Bruce Momjian
bruce@momjian.us
In reply to: Noah Misch (#77)