snapshot too old issues, first around wraparound and then more.

robertmhaas@gmail.com

almost 6 years ago

In reply to: Andres Freund (#1)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:

The problem is that there's no protection again the xids in the
ringbuffer getting old enough to wrap around. Given that practical uses
of old_snapshot_threshold are likely to be several hours to several
days, that's not particularly hard to hit.

Presumably this could be fixed by changing it to use FullTransactionId.

The problem, as far as I can tell, is that
oldSnapshotControl->head_timestamp appears to be intended to be the
oldest value in the ring. But we update it unconditionally in the "need
a new bucket, but it might not be the very next one" branch of
MaintainOldSnapshotTimeMapping().

I agree, that doesn't look right. It's correct, I think, for the "if
(advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
"else" case. In the "else" case, it should advance by 1 (wrapping if
needed) each time we take the "if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
the "else" branch for that if statement.

As far as I can tell, this code has been wrong since the feature has
been committed. The tests don't show a problem, because none of this
code is reached when old_snapshot_threshold = 0 (which has no real world
use, it's purely for testing).

I'm pretty sure I complained about the fact that only the
old_snapshot_threshold = 0 case was tested at the time this went in,
but I don't think Kevin was too convinced that we needed to do
anything else, and realistically, if he'd tried for a regression test
that ran for 15 minutes, Tom would've gone ballistic.

I really don't know what to do here. The feature never worked and will
silently cause wrong query results. Fixing it seems like a pretty large
task - there's a lot more bugs. But ripping out a feature in stable
branches is pretty bad too.

I don't know what other bugs there are, but the two you mention above
look fixable. Even if we decide that the feature can't be salvaged, I
would vote against ripping it out in back branches. I would instead
argue for telling people not to use it and ripping it out in master.
However, much as I'm not in love with all of the complexity this
feature adds, I don't see the problems you've reported here as serious
enough to justify ripping it out.

What exactly is the interaction of this patch with your snapshot
scalability work?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

andres@anarazel.de

almost 6 years ago

In reply to: Andres Freund (#1)

1 attachment(s)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-03-31 23:40:08 -0700, Andres Freund wrote:

I added some debug output to print the mapping before/after changes by
MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
to the server start in minutes/seconds to make it easier to interpret).

Now attached.

Greetings,

Andres Freund

Attachments:

debug.difftext/x-diff; charset=us-asciiDownload

diff --git i/src/backend/utils/time/snapmgr.c w/src/backend/utils/time/snapmgr.c
index 1c063c592ce..cbfc462ee26 100644
--- i/src/backend/utils/time/snapmgr.c
+++ w/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@ typedef struct OldSnapshotControlData
 	int			head_offset;	/* subscript of oldest tracked time */
 	TimestampTz head_timestamp; /* time corresponding to head xid */
 	int			count_used;		/* how many slots are in use */
+	TimestampTz starttime; // rounded to minute
 	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
 } OldSnapshotControlData;
 
@@ -290,6 +291,8 @@ SnapMgrInit(void)
 		oldSnapshotControl->head_offset = 0;
 		oldSnapshotControl->head_timestamp = 0;
 		oldSnapshotControl->count_used = 0;
+		oldSnapshotControl->starttime =
+			AlignTimestampToMinuteBoundary(GetCurrentTimestamp());
 	}
 }
 
@@ -1762,6 +1765,39 @@ SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 }
 
+static void PrintOldSnapshotMapping(const char *head, bool already_locked)
+{
+	StringInfoData s;
+
+	initStringInfo(&s);
+
+	if (!already_locked)
+		LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+
+	appendStringInfo(&s, "old snapshot mapping at \"%s\" with head ts: %lu, current entries: %d max entries: %d, offset: %d\n",
+					 head,
+					 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE,
+					 oldSnapshotControl->count_used,
+					 OLD_SNAPSHOT_TIME_MAP_ENTRIES,
+					 oldSnapshotControl->head_offset);
+
+	for (int off = 0; off < oldSnapshotControl->count_used; off++)
+	{
+		int ringoff = (off + oldSnapshotControl->head_offset) % OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		appendStringInfo(&s, "  entry %d (ring %d): min %ld: xid %d\n",
+						 off, ringoff,
+						 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE + off,
+						 oldSnapshotControl->xid_by_minute[ringoff]);
+	}
+
+	if (!already_locked)
+		LWLockRelease(OldSnapshotTimeMapLock);
+
+	elog(WARNING, "%s", s.data);
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
@@ -1824,6 +1860,7 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 
 		if (!same_ts_as_threshold)
 		{
+			PrintOldSnapshotMapping("non cached limit", false);
 			if (ts == update_ts)
 			{
 				xlimit = latest_xmin;
@@ -1923,14 +1960,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	 */
 	if (whenTaken < 0)
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with negative whenTaken = %ld",
 			 (long) whenTaken);
 		return;
 	}
 	if (!TransactionIdIsNormal(xmin))
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with xmin = %lu",
 			 (unsigned long) xmin);
 		return;
@@ -1938,6 +1975,8 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 
 	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
 
+	PrintOldSnapshotMapping("before update", true);
+
 	Assert(oldSnapshotControl->head_offset >= 0);
 	Assert(oldSnapshotControl->head_offset < OLD_SNAPSHOT_TIME_MAP_ENTRIES);
 	Assert((oldSnapshotControl->head_timestamp % USECS_PER_MINUTE) == 0);
@@ -1956,7 +1995,7 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	{
 		/* old ts; log it at DEBUG */
 		LWLockRelease(OldSnapshotTimeMapLock);
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with old whenTaken = %ld",
 			 (long) whenTaken);
 		return;
@@ -1971,6 +2010,12 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 								 / USECS_PER_MINUTE))
 		% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
 
+		elog(WARNING, "head %ld s: updating existing bucket %d for sec %ld with xmin %u",
+			 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 bucket,
+			 (ts - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 xmin);
+
 		if (TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[bucket], xmin))
 			oldSnapshotControl->xid_by_minute[bucket] = xmin;
 	}
@@ -1980,6 +2025,13 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 		int			advance = ((ts - oldSnapshotControl->head_timestamp)
 							   / USECS_PER_MINUTE);
 
+		elog(WARNING, "head %ld s: filling %d buckets starting at %d for sec %ld with xmin %u",
+			 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 advance,
+			 oldSnapshotControl->head_offset,
+			 (ts - oldSnapshotControl->starttime) / USECS_PER_SEC,
+			 xmin);
+
 		oldSnapshotControl->head_timestamp = ts;
 
 		if (advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)
@@ -2021,6 +2073,8 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 		}
 	}
 
+	PrintOldSnapshotMapping("after update", true);
+
 	LWLockRelease(OldSnapshotTimeMapLock);
 }

andres@anarazel.de

almost 6 years ago

In reply to: Robert Haas (#2)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 10:01:07 -0400, Robert Haas wrote:

On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:

The problem is that there's no protection again the xids in the
ringbuffer getting old enough to wrap around. Given that practical uses
of old_snapshot_threshold are likely to be several hours to several
days, that's not particularly hard to hit.

Presumably this could be fixed by changing it to use FullTransactionId.

That doesn't exist in all the back branches. Think it'd be easier to add
code to explicitly prune it during MaintainOldSnapshotTimeMapping().

The problem, as far as I can tell, is that
oldSnapshotControl->head_timestamp appears to be intended to be the
oldest value in the ring. But we update it unconditionally in the "need
a new bucket, but it might not be the very next one" branch of
MaintainOldSnapshotTimeMapping().

I agree, that doesn't look right. It's correct, I think, for the "if
(advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
"else" case. In the "else" case, it should advance by 1 (wrapping if
needed) each time we take the "if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
the "else" branch for that if statement.

Yea.

As far as I can tell, this code has been wrong since the feature has
been committed. The tests don't show a problem, because none of this
code is reached when old_snapshot_threshold = 0 (which has no real world
use, it's purely for testing).

I'm pretty sure I complained about the fact that only the
old_snapshot_threshold = 0 case was tested at the time this went in,
but I don't think Kevin was too convinced that we needed to do
anything else, and realistically, if he'd tried for a regression test
that ran for 15 minutes, Tom would've gone ballistic.

I think it's not just Tom that'd have gone ballistic. I think it's the
reason why, as I think is pretty clear, the feature was *never* actually
tested. The results of whats being removed are not quite random, but
it's not far from it. And there's long stretches of time where it never
removes things.

It's also a completely self-made problem.

There's really no reason at all to have bins of one minute. As it's a
PGC_POSTMASTER GUC, it should just have didided time into bins of
(old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
of a week there's no need to keep 10k bins, and the minimum threshold of
1 minute obviously is problematic.

I really don't know what to do here. The feature never worked and will
silently cause wrong query results. Fixing it seems like a pretty large
task - there's a lot more bugs. But ripping out a feature in stable
branches is pretty bad too.

I don't know what other bugs there are, but the two you mention above
look fixable.

They probably are fixable. But there's a lot more, I think:

Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
update_ts threshold actually needs to be ts >= update_ts, right now we
don't handle being newer than the newest bin correctly afaict (mitigated
by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
to say, because there's no comments.

The whole lock nesting is very hazardous. Most (all?)
TestForOldSnapshot() calls happen with locks on on buffers held, and can
acquire lwlocks itself. In some older branches we do entire *catalog
searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
basis to detect conflicts seems dangerous to me. Isn't that ignoring
inserts that are already in progress?

Even if we decide that the feature can't be salvaged, I would vote
against ripping it out in back branches. I would instead argue for
telling people not to use it and ripping it out in master.

It currently silently causes wrong query results. There's no
infrastructure to detect / protect against that (*).

I'm sure we can fix individual instances of problems. But I don't know
how one is supposed to verify that the fixes actually work. There's
currently no tests for the actual feature. And manual tests are painful
due to the multi-minute thresholds needed, and it's really hard to
manually verify that only the right rows are removed due to the feature,
and that all necessary errors are thrown. Given e.g. the bugs in my
email upthread, there's periods of several minutes where we'd not see
any row removed and then periods where the wrong ones would be removed,
so the manual tests would have to be repeated numerous times to actually
ensure anything.

If somebody wants to step up to the plate and fix these, it'd perhaps be
more realistic to say that we'll keep the feature. But even if somebody
does, I think it'd require a lot of development in the back branches. On
a feature whose purpose is to eat data that is still required.

I think even if we decide that we do not want to rip the feature out, we
should seriously consider hard disabling it in the backbranches. At
least I don't see how the fixed code is tested enough to be entrusted
with users data.

Do we actually have any evidence of this feature ever beeing used? I
didn't find much evidence for that in the archives (except Thomas
finding a problem). Given that it currently will switch between not
preventing bloat and causing wrong query results, without that being
noticed...

(*) perhaps I just am not understanding the protection however. To me
it's not at all clear what:

/*
* Failsafe protection against vacuuming work of active transaction.
*
* This is not an assertion because we avoid the spinlock for
* performance, leaving open the possibility that xlimit could advance
* and be more current; but it seems prudent to apply this limit. It
* might make pruning a tiny bit less aggressive than it could be, but
* protects against data loss bugs.
*/
if (TransactionIdIsNormal(latest_xmin)
&& TransactionIdPrecedes(latest_xmin, xlimit))
xlimit = latest_xmin;

if (NormalTransactionIdFollows(xlimit, recentXmin))
return xlimit;

actually provides in the way of a protection.

However,
much as I'm not in love with all of the complexity this feature adds,
I don't see the problems you've reported here as serious enough to
justify ripping it out.

What exactly is the interaction of this patch with your snapshot
scalability work?

Post my work there's no precise RecentOldestXmin anymore (since
accessing the frequently changing xmin of other backends is what causes
a good chunk of the scalability issues). But heap_page_prune_opt() has
to determine what to use as the threshold for being able to prune dead
rows. Without snapshot_too_old we can initially rely on the known
boundaries to determine whether we can prune, and only determine an
"accurate" boundary when encountering a prune xid (or a tuple, but
that's an optimization) that falls in the range where we don't know for
certain we can prune. But that's not easy to do with the way the
old_snapshot_threshold stuff currently works.

It's not too hard to implement a crude version that just determines an
accurate xmin horizon whenever pruning with old_snapshot_threshold
set. But that seems like gimping performance for old_snapshot_threshold,
which didn't seem nice.

Additionally, the current implementation of snapshot_too_old is pretty
terrible about causing unnecessary conflicts when hot pruning. Even if
there was no need at all for the horizon to be limited to be able to
prune the page, or if there was nothing to prune on the page (note that
the limiting happens before checking if the space on the page even makes
pruning useful), we still cause a conflict for future accesses, because
TransactionIdLimitedForOldSnapshots() will
SetOldSnapshotThresholdTimestamp() to a recent timestamp.

Greetings,

Andres Freund

robertmhaas@gmail.com

almost 6 years ago

In reply to: Andres Freund (#1)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:

I added some debug output to print the mapping before/after changes by
MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
to the server start in minutes/seconds to make it easier to interpret).

And the output turns out to be something like:

WARNING: old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 644169619
entry 5 (ring 5): min 12: xid 644169619
entry 6 (ring 6): min 13: xid 644169619
entry 7 (ring 7): min 14: xid 644169619

WARNING: head 420 s: updating existing bucket 4 for sec 660 with xmin 666899382

WARNING: old snapshot mapping at "after update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 666899382
entry 5 (ring 5): min 12: xid 644169619
entry 6 (ring 6): min 13: xid 644169619
entry 7 (ring 7): min 14: xid 644169619

It's pretty obvious that the xids don't make a ton of sense, I think:
They're not monotonically ordered. The same values exist multiple times,
despite xids being constantly used. Also, despite the ringbuffer
supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
always add), and the workload having run for 14min, we only have 8
entries.

The function header comment for MaintainOldSnapshotTimeMapping could
hardly be more vague, as it's little more than a restatement of the
function name. However, it looks to me like the idea is that this
function might get called multiple times for the same or similar
values of whenTaken. I suppose that's the point of this code:

else if (ts <= (oldSnapshotControl->head_timestamp +
((oldSnapshotControl->count_used - 1)
* USECS_PER_MINUTE)))
{
/* existing mapping; advance xid if possible */
int bucket = (oldSnapshotControl->head_offset
+ ((ts - oldSnapshotControl->head_timestamp)
/ USECS_PER_MINUTE))
% OLD_SNAPSHOT_TIME_MAP_ENTRIES;

if (TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[bucket],
xmin))
oldSnapshotControl->xid_by_minute[bucket] = xmin;
}

What I interpret this to be doing is saying - if we got a new call to
this function with a rounded-to-the-minute timestamp that we've seen
previously and for which we still have an entry, and if the XID passed
to this function is newer than the one passed by the previous call,
then advance the xid_by_minute[] bucket to the newer value. Now that
begs the question - what does this XID actually represent? The
comments don't seem to answer that question, not even the comments for
OldSnapshotControlData, which say that we should "Keep one xid per
minute for old snapshot error handling." but don't say which XIDs we
should keep or how they'll be used. However, the only call to
MaintainOldSnapshotTimeMapping() is in GetSnapshotData(). It appears
that we call this function each time a new snapshot is taken and pass
the current time (modulo some fiddling) and snapshot xmin. Given that,
one would expect that any updates to the map would be tight races,
i.e. a bunch of processes that all took their snapshots right around
the same time would all update the same map entry in quick succession,
with the newest value winning.

And that make the debugging output which I quoted from your message
above really confusing. At this point, the "head timestamp" is 7
minutes after this facility started up. The first we entry we have is
for minute 7, and the last is for minute 14. But the one we're
updating is for minute 11. How the heck can that happen? I might
suspect that you'd stopped a process inside GetSnapshotData() with a
debugger, but that can't explain it either, because GetSnapshotData()
gets the xmin first and only afterwards gets the timestamp - so if
you'd stopped for ~3 minutes it just before the call to
MaintainOldSnapshotTimeMapping(), it would've been updating the map
with an *old* XID. In reality, though, it changed the XID from
644169619 to 666899382, advancing over 22 million XIDs. I don't
understand what's going on there. How is this function getting called
with a 4-minute old value of whenTaken?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

andres@anarazel.de

almost 6 years ago

In reply to: Robert Haas (#5)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 11:15:14 -0400, Robert Haas wrote:

On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:

I added some debug output to print the mapping before/after changes by
MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
to the server start in minutes/seconds to make it easier to interpret).

And the output turns out to be something like:

WARNING: old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 644169619
entry 5 (ring 5): min 12: xid 644169619
entry 6 (ring 6): min 13: xid 644169619
entry 7 (ring 7): min 14: xid 644169619

WARNING: head 420 s: updating existing bucket 4 for sec 660 with xmin 666899382

WARNING: old snapshot mapping at "after update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
entry 0 (ring 0): min 7: xid 582921233
entry 1 (ring 1): min 8: xid 654154155
entry 2 (ring 2): min 9: xid 661972949
entry 3 (ring 3): min 10: xid 666899382
entry 4 (ring 4): min 11: xid 666899382
entry 5 (ring 5): min 12: xid 644169619
entry 6 (ring 6): min 13: xid 644169619
entry 7 (ring 7): min 14: xid 644169619

It's pretty obvious that the xids don't make a ton of sense, I think:
They're not monotonically ordered. The same values exist multiple times,
despite xids being constantly used. Also, despite the ringbuffer
supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
always add), and the workload having run for 14min, we only have 8
entries.

The function header comment for MaintainOldSnapshotTimeMapping could
hardly be more vague, as it's little more than a restatement of the
function name. However, it looks to me like the idea is that this
function might get called multiple times for the same or similar
values of whenTaken. I suppose that's the point of this code:

Right. We enforce whenTaken to be monotonic
(cf. GetSnapshotCurrentTimestamp()), but since
GetSnapshotCurrentTimestamp() reduces the granularity of the timestamp
to one-minute (the AlignTimestampToMinuteBoundary() call), it's
obviously possible to end up in the same bin as a previous

What I interpret this to be doing is saying - if we got a new call to
this function with a rounded-to-the-minute timestamp that we've seen
previously and for which we still have an entry, and if the XID passed
to this function is newer than the one passed by the previous call,
then advance the xid_by_minute[] bucket to the newer value. Now that
begs the question - what does this XID actually represent? The
comments don't seem to answer that question, not even the comments for
OldSnapshotControlData, which say that we should "Keep one xid per
minute for old snapshot error handling." but don't say which XIDs we
should keep or how they'll be used. However, the only call to
MaintainOldSnapshotTimeMapping() is in GetSnapshotData(). It appears
that we call this function each time a new snapshot is taken and pass
the current time (modulo some fiddling) and snapshot xmin. Given that,
one would expect that any updates to the map would be tight races,
i.e. a bunch of processes that all took their snapshots right around
the same time would all update the same map entry in quick succession,
with the newest value winning.

Right.

And that make the debugging output which I quoted from your message
above really confusing. At this point, the "head timestamp" is 7
minutes after this facility started up. The first we entry we have is
for minute 7, and the last is for minute 14. But the one we're
updating is for minute 11. How the heck can that happen?

If I undestand what your reference correctly, I think that is because,
due to the bug, the "need a new bucket" branch doesn't just extend by
one bucket, it extends it by many in common cases. Basically filling
buckets "into the future".

the advance = ... variable in the branch will not always be 1, even when
we continually call Maintain*. Here's some debug output showing that
(slightly modified from the patch I previously sent):

WARNING: old snapshot mapping at "before update" with head ts: 1, current entries: 2 max entries: 15, offset: 0
entry 0 (ring 0): min 1: xid 1089371384
entry 1 (ring 1): min 2: xid 1099553206

WARNING: head 1 min: filling 2 buckets starting at 0 for whenTaken 3 min, with xmin 1109840204

WARNING: old snapshot mapping at "after update" with head ts: 3, current entries: 4 max entries: 15, offset: 0
entry 0 (ring 0): min 3: xid 1089371384
entry 1 (ring 1): min 4: xid 1099553206
entry 2 (ring 2): min 5: xid 1109840204
entry 3 (ring 3): min 6: xid 1109840204

Note how the two new buckets have the same xid, and how we're inserting
for "whenTaken 3 min", but we've filled the mapping up to minute 6.

I don't think the calculation of the 'advance' variable is correct as
is, even if we ignore the wrong setting of the head_timestamp variable.

Greetings,

Andres Freund

robertmhaas@gmail.com

almost 6 years ago

In reply to: Andres Freund (#4)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 11:09 AM Andres Freund <andres@anarazel.de> wrote:

That doesn't exist in all the back branches. Think it'd be easier to add
code to explicitly prune it during MaintainOldSnapshotTimeMapping().

That's reasonable.

There's really no reason at all to have bins of one minute. As it's a
PGC_POSTMASTER GUC, it should just have didided time into bins of
(old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
of a week there's no need to keep 10k bins, and the minimum threshold of
1 minute obviously is problematic.

I am very doubtful that this approach would have been adequate. It
would mean that, with old_snapshot_threshold set to a week, the
threshold for declaring a snapshot "old" would jump forward 16.8 hours
at a time. It's hard for me to make a coherent argument right now as
to exactly what problems that would create, but it's not very
granular, and a lot of bloat-related things really benefit from more
granularity. I also don't really see what the problem with keeping a
bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
= ~10k buckets, isn't it? That's not really insane for an in-memory
data structure. I agree that the code that does that maintenance being
buggy is a problem to whatever extent that is the case, but writing
the code to have fewer buckets wouldn't necessarily have made it any
less buggy.

They probably are fixable. But there's a lot more, I think:

Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
update_ts threshold actually needs to be ts >= update_ts, right now we
don't handle being newer than the newest bin correctly afaict (mitigated
by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
to say, because there's no comments.

That test and the following one for "if (ts == update_ts)" both make
me nervous too. If only two of <, >, and = are expected, there should
be an Assert() to that effect, at least. If all three values are
expected then we need an explanation of why we're only checking for
equality.

The whole lock nesting is very hazardous. Most (all?)
TestForOldSnapshot() calls happen with locks on on buffers held, and can
acquire lwlocks itself. In some older branches we do entire *catalog
searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

The catalog searches are clearly super-bad, but I'm not sure that the
other ones have a deadlock risk or anything. They might, but I think
we'd need some evidence of that.

GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
basis to detect conflicts seems dangerous to me. Isn't that ignoring
inserts that are already in progress?

How so?

It currently silently causes wrong query results. There's no
infrastructure to detect / protect against that (*).

Sure, and what if you break more stuff ripping it out? Ripping this
volume of code out in a supposedly-stable branch is totally insane
almost no matter how broken the feature is. I also think, and we've
had this disagreement before, that you're far too willing to say
"well, that's wrong so we need to hit it with a nuke." I complained
when you added those error checks to vacuum in back-branches, and
since that release went out people are regularly tripping those checks
and taking prolonged outages for a problem that wasn't making them
unhappy before. I know that in theory those people are better off
because their database was always corrupted and now they know. But for
some of them, those prolonged outages are worse than the problem they
had before. I believe it was irresponsible to decide on behalf of our
entire user base that they were better off with such a behavior change
in a supposedly-stable branch, and I believe the same thing here.

I have no objection to the idea that *if* the feature is hopelessly
broken, it should be removed. But I don't have confidence at this
point that you've established that, and I think ripping out thousands
of lines of codes in the back-branches is terrible. Even
hard-disabling the feature in the back-branches without actually
removing the code is an awfully strong reaction, but it could be
justified if we find out that things are actually super-bad and not
really fixable. Actually removing the code is unnecessary, protects
nobody, and has risk.

Do we actually have any evidence of this feature ever beeing used? I
didn't find much evidence for that in the archives (except Thomas
finding a problem). Given that it currently will switch between not
preventing bloat and causing wrong query results, without that being
noticed...

I believe that at least one EnterpriseDB customer used it, and
possibly more than one. I am not sure how extensively, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pg@bowt.ie

almost 6 years ago

In reply to: Robert Haas (#7)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 9:02 AM Robert Haas <robertmhaas@gmail.com> wrote:

I complained
when you added those error checks to vacuum in back-branches, and
since that release went out people are regularly tripping those checks
and taking prolonged outages for a problem that wasn't making them
unhappy before. I know that in theory those people are better off
because their database was always corrupted and now they know. But for
some of them, those prolonged outages are worse than the problem they
had before. I believe it was irresponsible to decide on behalf of our
entire user base that they were better off with such a behavior change
in a supposedly-stable branch, and I believe the same thing here.

I agreed with that decision, FWIW. Though I don't deny that there is
some merit in what you say. This is the kind of high level
philosophical question where large differences of opinion are quite
normal.

I don't think that it's fair to characterize Andres' actions in that
situation as in any way irresponsible. We had an extremely complicated
data corruption bug that he went to great lengths to fix, following
two other incorrect fixes. He was jet lagged from travelling to India
at the time. He went to huge lengths to make sure that the bug was
correctly squashed.

Actually removing the code is unnecessary, protects
nobody, and has risk.

Every possible approach has risk. We are deciding among several
unpleasant and risky alternatives here, no?

--
Peter Geoghegan

robertmhaas@gmail.com

almost 6 years ago

In reply to: Peter Geoghegan (#8)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 1:03 PM Peter Geoghegan <pg@bowt.ie> wrote:

I don't think that it's fair to characterize Andres' actions in that
situation as in any way irresponsible. We had an extremely complicated
data corruption bug that he went to great lengths to fix, following
two other incorrect fixes. He was jet lagged from travelling to India
at the time. He went to huge lengths to make sure that the bug was
correctly squashed.

I don't mean it as a personal attack on Andres, and I know and am glad
that he worked hard on the problem, but I don't agree that it was the
right decision. Perhaps "irresponsible" is the wrong word, but it's
certainly caused problems for multiple EnterpriseDB customers, and in
my view, those problems weren't necessary. Either a WARNING or an
ERROR would have shown up in the log, but an ERROR terminates VACUUM
for that table and thus basically causes autovacuum to be completely
broken. That is a really big problem. Perhaps you will want to argue,
as Andres did, that the value of having ERROR rather than WARNING in
the log justifies that outcome, but I sure don't agree.

Actually removing the code is unnecessary, protects
nobody, and has risk.

Every possible approach has risk. We are deciding among several
unpleasant and risky alternatives here, no?

Sure, but not all levels of risk are equal. Jumping out of a plane
carries some risk of death whether or not you have a parachute, but
that does not mean that we shouldn't worry about whether you have one
or not before you jump.

In this case, I think it is pretty clear that hard-disabling the
feature by always setting old_snapshot_threshold to -1 carries less
risk of breaking unrelated things than removing code that caters to
the feature all over the code base. Perhaps it is not quite as
dramatic as my parachute example, but I think it is pretty clear all
the same that one is a lot more likely to introduce new bugs than the
other. A carefully targeted modification of a few lines of code in 1
file just about has to carry less risk than ~1k lines of code spread
across 40 or so files.

However, I still think that without some more analysis, it's not clear
whether we should go this direction at all. Andres's results suggest
that there are some bugs here, but I think we need more senior hackers
to study the situation before we make a decision about what to do
about them. I certainly haven't had enough time to even fully
understand the problems yet, and nobody else has posted on that topic
at all. I have the highest respect for Andres and his technical
ability, and if he says this stuff has problems, I'm sure it does. Yet
I'm not willing to conclude that because he's tired and frustrated
with this stuff right now, it's unsalvageable. For the benefit of the
whole community, such a claim deserves scrutiny from multiple people.

Is there any chance that you're planning to look into the details?
That would certainly be welcome from my perspective.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

andres@anarazel.de

almost 6 years ago

In reply to: Robert Haas (#7)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 12:02:18 -0400, Robert Haas wrote:

On Wed, Apr 1, 2020 at 11:09 AM Andres Freund <andres@anarazel.de> wrote:

There's really no reason at all to have bins of one minute. As it's a
PGC_POSTMASTER GUC, it should just have didided time into bins of
(old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
of a week there's no need to keep 10k bins, and the minimum threshold of
1 minute obviously is problematic.

I am very doubtful that this approach would have been adequate. It
would mean that, with old_snapshot_threshold set to a week, the
threshold for declaring a snapshot "old" would jump forward 16.8 hours
at a time. It's hard for me to make a coherent argument right now as
to exactly what problems that would create, but it's not very
granular, and a lot of bloat-related things really benefit from more
granularity. I also don't really see what the problem with keeping a
bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
= ~10k buckets, isn't it? That's not really insane for an in-memory
data structure. I agree that the code that does that maintenance being
buggy is a problem to whatever extent that is the case, but writing
the code to have fewer buckets wouldn't necessarily have made it any
less buggy.

My issue isn't really that it's too many buckets right now, but that it
doesn't scale down to smaller thresholds. I think to be able to develop
this reasonably, it'd need to be able to support thresholds in the
sub-second range. And I don't see how you can have the same binning for
such small thresholds, and for multi-day thresholds - we'd quickly go to
millions of buckets for longer thresholds.

I really think we'd need to support millisecond resolution to make this
properly testable.

GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
basis to detect conflicts seems dangerous to me. Isn't that ignoring
inserts that are already in progress?

How so?

Because it returns the end of the reserved WAL, not how far we've
actually inserted. I.e. there can be in-progress, but not finished,
modifications that will have an LSN < GetXLogInsertRecPtr(). But the
whenTaken timestamp could reflect one that should throw an error for
these in-progress modifications (since the transaction limiting happens
before the WAL logging).

I am not 100%, but I suspect that that could lead to errors not being
thrown that should, because TestForOldSnapshot() will not see these
in-progress modifications as conflicting.

Hm, also, shouldn't
&& PageGetLSN(page) > (snapshot)->lsn)
in TestForOldSnapshot() be an >=?

It currently silently causes wrong query results. There's no
infrastructure to detect / protect against that (*).

Sure, and what if you break more stuff ripping it out? Ripping this
volume of code out in a supposedly-stable branch is totally insane
almost no matter how broken the feature is.

For the backbranches I was just thinking of forcing the GUC to be off
(either by disallowing it to be set to on, or just warning when its set
to true, but not propagating the value).

I have no objection to the idea that *if* the feature is hopelessly
broken, it should be removed.

I would be a lot less inclined to go that way if old_snapshot_threshold

a) weren't explicitly about removing still-needed data - in contrast to
a lot of other features, where the effects of bugs is temporary, here
it can be much larger.
b) were a previously working feature, but as far as I can tell, it never really did
c) had tests that verify that my fixes actually do the right thing. As
it stands, I'd not just have to fix the bugs, I'd also have to develop
a test framework that can test this

While I wish I had been more forceful, and reviewed more of the code to
point out more of the quality issues, I did argue hard against the
feature going in. On account of it being architecturally bad and
impactful. Which I think it has proven to be several times over by
now. And now I'm kind of on the hook to fix it, it seems?

I also think, and we've had this disagreement before, that you're far
too willing to say "well, that's wrong so we need to hit it with a
nuke." I complained when you added those error checks to vacuum in
back-branches, and since that release went out people are regularly
tripping those checks and taking prolonged outages for a problem that
wasn't making them unhappy before. I know that in theory those people
are better off because their database was always corrupted and now
they know. But for some of them, those prolonged outages are worse
than the problem they had before.

I think this is a somewhat revisionist. Sure, the errors were added
after like the 10th data corruption bug around freezing that we didn't
find for a long time, because of the lack of errors being thrown. But
the error checks weren't primarily added to find further bugs, but to
prevent data loss due to the fixed bug. Of which we had field reports.

I'd asked over *weeks* for reviews of the bug fixes. Not a single person
expressed concerns about throwing new errors at that time. First version
of the patches with the errors:
/messages/by-id/20171114030341.movhteyakqeqx5pm@alap3.anarazel.de
I pushed them over a month later
/messages/by-id/20171215023059.oeyindn57oeis5um@alap3.anarazel.de

There also wasn't (and isn't) a way to just report back that we can't
currently freeze the individual page, without doing major surgery. And
even if there were, what are supposed to do other than throw an error?
We need to remove tuples below relfrozenxid, or we corrupt the table.

As I've first asked before when you complained about those errors: What
was the alternative? Just have invisible tuples reappear? Delete them? I
don't think you've ever answered that.

You brought this up as an example for me being over-eager with errors
checks before. But I don't see how that meshes with the history visible
in the thread referenced above.

The more general issue, about throwing errors, is not just about the
people that don't give a hoot about whether their data evolves on its
own (perhaps a good tradeoff for them). Not throwing errors affects
*everyone*. Some people do care about their data. Without errors we
never figure out that we screwed up. And long-term, even the people
that care much more about availability than data loss, benefit from the
whole system getting more robust.

We've since found numerous further data corrupting bugs because of the
relfrozenxid checks. Some of very long standing vintage. Some in newly
added code.

Yes, hypothetically, I'd argue for introducing the checks solely for the
sake of finding bugs. Even if I were prescient to forsee the number of
issues caused (although I'd add block numbers to the error message from
the get go, knowing that). But I'd definitely not do so in the back
branches.

I believe it was irresponsible to decide on behalf of our entire user
base that they were better off with such a behavior change in a
supposedly-stable branch, and I believe the same thing here.

As I explained above, I don't think that's fair with regard to the
relfrozenxid errors. Setting that aside:

In these discussions you always seem to only argue for the people that
don't care about their data. But, uh, a lot of people do - should we
just silently eat their data? And the long-term quality of the project
gets a lot better by throwing errors, because it actually allows us to
fix them.

As far as I can tell we couldn't even have added the checks to master,
back then, if we follow your logic: A lot of the reports about hitting
the errors were with 11+ (partially due to pg_upgrade, partially because
they detected other bugs).

The likelihood of hurting people by adding checks at a later point would
be a lot lower, if we stopped adding code that ignores errors silently
and hoping for the best. But we keep adding such "lenient" code.

We just found another long-standing cause of data corrupting, that
should have been found earlier if we had errors, or at least warnings,
btw. The locking around vac_udpate_datfrozenxid() has been broken for a
long long time, but the silent 'if (bogus) return' made it very hard to
find.
/messages/by-id/20200323235036.6pje6usrjjx22zv3@alap3.anarazel.de

Also, I've recently seen a number of databases beeing eaten because we
just ignore our own WAL logging rules to avoid throwing hard enough
errors (RelationTruncate() WAL logging the truncation outside of a
critical section - oops if you hit it, your primary and replicas/backups
diverge, among many other bad consequences).

Greetings,

Andres Freund

#11

pg@bowt.ie

almost 6 years ago

In reply to: Robert Haas (#9)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 10:28 AM Robert Haas <robertmhaas@gmail.com> wrote:

Sure, but not all levels of risk are equal. Jumping out of a plane
carries some risk of death whether or not you have a parachute, but
that does not mean that we shouldn't worry about whether you have one
or not before you jump.

In this case, I think it is pretty clear that hard-disabling the
feature by always setting old_snapshot_threshold to -1 carries less
risk of breaking unrelated things than removing code that caters to
the feature all over the code base. Perhaps it is not quite as
dramatic as my parachute example, but I think it is pretty clear all
the same that one is a lot more likely to introduce new bugs than the
other. A carefully targeted modification of a few lines of code in 1
file just about has to carry less risk than ~1k lines of code spread
across 40 or so files.

Yeah, that's certainly true. But is that fine point really what
anybody disagrees about? I didn't think that Andres was focussed on
literally ripping it out over just disabling it.

Is there any chance that you're planning to look into the details?
That would certainly be welcome from my perspective.

I had a few other things that I was going to work on this week, but
those seems less urgent. I'll take a look into it, and report back
what I find.

--
Peter Geoghegan

#12

andres@anarazel.de

almost 6 years ago

In reply to: Peter Geoghegan (#11)

1 attachment(s)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 11:04:43 -0700, Peter Geoghegan wrote:

On Wed, Apr 1, 2020 at 10:28 AM Robert Haas <robertmhaas@gmail.com> wrote:

Is there any chance that you're planning to look into the details?
That would certainly be welcome from my perspective.

This definitely needs more eyes. I am not even close to understanding
the code fully.

I had a few other things that I was going to work on this week, but
those seems less urgent. I'll take a look into it, and report back
what I find.

Thanks you!

I attached a slightly evolved version of my debugging patch.

Greetings,

Andres Freund

Attachments:

v2-debug.difftext/x-diff; charset=us-asciiDownload

diff --git i/src/backend/utils/time/snapmgr.c w/src/backend/utils/time/snapmgr.c
index 1c063c592ce..6a722863bcf 100644
--- i/src/backend/utils/time/snapmgr.c
+++ w/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@ typedef struct OldSnapshotControlData
 	int			head_offset;	/* subscript of oldest tracked time */
 	TimestampTz head_timestamp; /* time corresponding to head xid */
 	int			count_used;		/* how many slots are in use */
+	TimestampTz starttime; // rounded to minute
 	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
 } OldSnapshotControlData;
 
@@ -290,6 +291,8 @@ SnapMgrInit(void)
 		oldSnapshotControl->head_offset = 0;
 		oldSnapshotControl->head_timestamp = 0;
 		oldSnapshotControl->count_used = 0;
+		oldSnapshotControl->starttime =
+			AlignTimestampToMinuteBoundary(GetCurrentTimestamp());
 	}
 }
 
@@ -1762,6 +1765,39 @@ SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
 	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
 }
 
+static void PrintOldSnapshotMapping(const char *head, bool already_locked)
+{
+	StringInfoData s;
+
+	initStringInfo(&s);
+
+	if (!already_locked)
+		LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+
+
+	appendStringInfo(&s, "old snapshot mapping at \"%s\" with head ts: %lu, current entries: %d max entries: %d, offset: %d\n",
+					 head,
+					 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE,
+					 oldSnapshotControl->count_used,
+					 OLD_SNAPSHOT_TIME_MAP_ENTRIES,
+					 oldSnapshotControl->head_offset);
+
+	for (int off = 0; off < oldSnapshotControl->count_used; off++)
+	{
+		int ringoff = (off + oldSnapshotControl->head_offset) % OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+		appendStringInfo(&s, "  entry %d (ring %d): min %ld: xid %d\n",
+						 off, ringoff,
+						 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE + off,
+						 oldSnapshotControl->xid_by_minute[ringoff]);
+	}
+
+	if (!already_locked)
+		LWLockRelease(OldSnapshotTimeMapLock);
+
+	elog(WARNING, "%s", s.data);
+}
+
 /*
  * TransactionIdLimitedForOldSnapshots
  *
@@ -1826,9 +1862,15 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 		{
 			if (ts == update_ts)
 			{
+				PrintOldSnapshotMapping("non cached limit via update_ts", false);
+
 				xlimit = latest_xmin;
 				if (NormalTransactionIdFollows(xlimit, recentXmin))
+				{
+					elog(LOG, "increasing threshold from %u to %u (via update_ts)",
+						 recentXmin, xlimit);
 					SetOldSnapshotThresholdTimestamp(ts, xlimit);
+				}
 			}
 			else
 			{
@@ -1839,6 +1881,8 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 				{
 					int			offset;
 
+					PrintOldSnapshotMapping("non cached limit via bins", true);
+
 					offset = ((ts - oldSnapshotControl->head_timestamp)
 							  / USECS_PER_MINUTE);
 					if (offset > oldSnapshotControl->count_used - 1)
@@ -1848,7 +1892,15 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 					xlimit = oldSnapshotControl->xid_by_minute[offset];
 
 					if (NormalTransactionIdFollows(xlimit, recentXmin))
+					{
+						elog(LOG, "increasing threshold from %u to %u (via bins)",
+							 recentXmin, xlimit);
 						SetOldSnapshotThresholdTimestamp(ts, xlimit);
+					}
+				}
+				else
+				{
+					// currently debugging output here is pretty darn verbose
 				}
 
 				LWLockRelease(OldSnapshotTimeMapLock);
@@ -1869,7 +1921,11 @@ TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
 			xlimit = latest_xmin;
 
 		if (NormalTransactionIdFollows(xlimit, recentXmin))
+		{
+			elog(LOG, "increasing prune threshold from %u to %u",
+				 recentXmin, xlimit);
 			return xlimit;
+		}
 	}
 
 	return recentXmin;
@@ -1923,14 +1979,14 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	 */
 	if (whenTaken < 0)
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with negative whenTaken = %ld",
 			 (long) whenTaken);
 		return;
 	}
 	if (!TransactionIdIsNormal(xmin))
 	{
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with xmin = %lu",
 			 (unsigned long) xmin);
 		return;
@@ -1938,6 +1994,8 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 
 	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
 
+	PrintOldSnapshotMapping("before update", true);
+
 	Assert(oldSnapshotControl->head_offset >= 0);
 	Assert(oldSnapshotControl->head_offset < OLD_SNAPSHOT_TIME_MAP_ENTRIES);
 	Assert((oldSnapshotControl->head_timestamp % USECS_PER_MINUTE) == 0);
@@ -1956,7 +2014,7 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	{
 		/* old ts; log it at DEBUG */
 		LWLockRelease(OldSnapshotTimeMapLock);
-		elog(DEBUG1,
+		elog(PANIC,
 			 "MaintainOldSnapshotTimeMapping called with old whenTaken = %ld",
 			 (long) whenTaken);
 		return;
@@ -1971,6 +2029,12 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 								 / USECS_PER_MINUTE))
 		% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
 
+		elog(WARNING, "head %ld min: updating existing bucket %d for whenTaken %ld min, with xmin %u",
+			 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE,
+			 bucket,
+			 (ts - oldSnapshotControl->starttime) / USECS_PER_MINUTE,
+			 xmin);
+
 		if (TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[bucket], xmin))
 			oldSnapshotControl->xid_by_minute[bucket] = xmin;
 	}
@@ -1980,6 +2044,13 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 		int			advance = ((ts - oldSnapshotControl->head_timestamp)
 							   / USECS_PER_MINUTE);
 
+		elog(WARNING, "head %ld min: filling %d buckets starting at %d for whenTaken %ld min, with xmin %u",
+			 (oldSnapshotControl->head_timestamp - oldSnapshotControl->starttime) / USECS_PER_MINUTE,
+			 advance,
+			 oldSnapshotControl->head_offset,
+			 (ts - oldSnapshotControl->starttime) / USECS_PER_MINUTE,
+			 xmin);
+
 		oldSnapshotControl->head_timestamp = ts;
 
 		if (advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)
@@ -2021,6 +2092,8 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 		}
 	}
 
+	PrintOldSnapshotMapping("after update", true);
+
 	LWLockRelease(OldSnapshotTimeMapLock);
 }

#13

andres@anarazel.de

almost 6 years ago

In reply to: Robert Haas (#9)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 13:27:56 -0400, Robert Haas wrote:

Perhaps "irresponsible" is the wrong word, but it's certainly caused
problems for multiple EnterpriseDB customers, and in my view, those
problems weren't necessary. Either a WARNING or an ERROR would have
shown up in the log, but an ERROR terminates VACUUM for that table and
thus basically causes autovacuum to be completely broken. That is a
really big problem. Perhaps you will want to argue, as Andres did,
that the value of having ERROR rather than WARNING in the log
justifies that outcome, but I sure don't agree.

If that had been a really viable option, I would have done so. At the
very least in the back branches, but quite possibly also in master. Or
if somebody had brought them up as an issue at the time.

What is heap_prepare_freeze_tuple/FreezeMultiXactId supposed to do after
issuing a WARNING in these cases. Without the ERROR, e.g.,
if (!TransactionIdDidCommit(xid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
xid, cutoff_xid)));
would make a deleted tuple visible.

if (TransactionIdPrecedes(xid, relfrozenxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg_internal("found xmin %u from before relfrozenxid %u",
xid, relfrozenxid)));
would go on replace xmin of a potentially uncommitted tuple with
relfrozenxid, making it appear visible.

if (TransactionIdPrecedes(xid, relfrozenxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg_internal("found xmax %u from before relfrozenxid %u",
xid, relfrozenxid)));
would replace the xmax indicating a potentially deleted tuple with ?, either
making the tuple become, potentially wrongly, visible/invisible

or
else if (MultiXactIdPrecedes(multi, relminmxid))
ereport(ERROR,
(errcode(ERRCODE_DATA_CORRUPTED),
errmsg_internal("found multixact %u from before relminmxid %u",
multi, relminmxid)));
or ...

Just continuing is easier said than done. Especially with the background
of knowing that several users had hit the bug that allowed all of the
above to be hit, and that advancing relfrozenxid further would make it
worse.

Greetings,

Andres Freund

#14

robertmhaas@gmail.com

almost 6 years ago

In reply to: Andres Freund (#13)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 2:37 PM Andres Freund <andres@anarazel.de> wrote:

Just continuing is easier said than done. Especially with the background
of knowing that several users had hit the bug that allowed all of the
above to be hit, and that advancing relfrozenxid further would make it
worse.

Fair point, but it seems we're arguing over nothing here, or at least
nothing relevant to this thread, because it sounds like if we are
going to disable that you're OK with doing that by just shutting it
off the code rather than trying to remove it all. I had the opposite
impression from your first email.

Sorry to have derailed the thread, and for my poor choice of words.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

kgrittn@gmail.com

almost 6 years ago

In reply to: Andres Freund (#4)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 10:09 AM Andres Freund <andres@anarazel.de> wrote:

First off, many thanks to Andres for investigating this, and apologies for
the bugs. Also thanks to Michael for making sure I saw the thread. I must
also apologize that for not being able to track the community lists
consistently due to health issues that are exacerbated by stress, and the
fact that these lists often push past my current limits. I'll try to help
in this as best I can.

Do we actually have any evidence of this feature ever beeing used? I

didn't find much evidence for that in the archives (except Thomas
finding a problem).

This was added because a very large company trying to convert from Oracle
had a test that started to show some slowdown on PostgreSQL after 8 hours,
serious slowdown by 24 hours, and crashed hard before it could get to 48
hours -- due to lingering WITH HOLD cursors left by ODBC code. They had
millions of lines of code that would need to be rewritten without this
feature. With this feature (set to 20 minutes, if I recall correctly),
their unmodified code ran successfully for at least three months solid
without failure or corruption. Last I heard, they were converting a large
number of instances from Oracle to PostgreSQL, and those would all fail
hard within days of running with this feature removed or disabled.

Also, VMware is using PostgreSQL as an embedded part of many products, and
this feature was enabled to deal with similar failures due to ODBC cursors;
so the number of instances running 24/7 under high load which have shown a
clear benefit from enabling this feature has a lot of zeros.

Perhaps the lack of evidence for usage in the archives indicates a low
frequency of real-world failures due to the feature, rather than lack of
use? I'm not doubting that Andres found real issues that should be fixed,
but perhaps not very many people who are using the feature have more than
two billion transactions within the time threshold, and perhaps the other
problems are not as big as the problems solved by use of the feature -- at
least in some cases.

To save readers who have not yet done the math some effort, at the 20
minute threshold used by the initial user, they would need to have a
sustained rate of consumption of transaction IDs of over 66 million per
second to experience wraparound problems, and at the longest threshold I
have seen it would need to exceed an average of 461,893 TPS for three days
solid to hit wraparound. Those aren't impossible rates to hit, but in
practice it might not be a frequent occurrence yet on modern hardware with
some real-world applications. Hopefully we can find a way to fix this
before those rates become common.

I am reviewing the issue and patches now, and hope I can make some useful
contribution to the discussion.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

#16

andres@anarazel.de

almost 6 years ago

In reply to: Kevin Grittner (#15)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

Nice to have you back for a bit! Even if the circumstances aren't
great...

It's very understandable that the lists are past your limits, I barely
keep up these days. Without any health issues.

On 2020-04-01 14:10:09 -0500, Kevin Grittner wrote:

Perhaps the lack of evidence for usage in the archives indicates a low
frequency of real-world failures due to the feature, rather than lack of
use? I'm not doubting that Andres found real issues that should be fixed,
but perhaps not very many people who are using the feature have more than
two billion transactions within the time threshold, and perhaps the other
problems are not as big as the problems solved by use of the feature -- at
least in some cases.

To save readers who have not yet done the math some effort, at the 20
minute threshold used by the initial user, they would need to have a
sustained rate of consumption of transaction IDs of over 66 million per
second to experience wraparound problems, and at the longest threshold I
have seen it would need to exceed an average of 461,893 TPS for three days
solid to hit wraparound. Those aren't impossible rates to hit, but in
practice it might not be a frequent occurrence yet on modern hardware with
some real-world applications. Hopefully we can find a way to fix this
before those rates become common.

The wraparound issue on their own wouldn't be that bad - when I found it
I did play around with a few ideas for how to fix it. The most practical
would probably be to have MaintainOldSnapshotTimeMapping() scan all
buckets when a new oldSnapshotControl->oldest_xid is older than
RecentGlobalXmin. There's no benefit in the contents of those buckets
anyway, since we know that we can freeze those independent of
old_snapshot_threshold.

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error). And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
long-running queries/transactions querying immutable or insert only
data. Those would not notice that, on other tables, rows are getting
removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
oldSnapshotControl->head_timestamp getting updated more often than
correct). But if old_snapshot_threshold is small enough in relation to
the time the generated bloat becomes problematic, there will still be
occasions to actually perform cleanup.

Greetings,

Andres Freund

#17

kgrittn@gmail.com

almost 6 years ago

In reply to: Andres Freund (#16)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error).

That is a very big problem. On the sort-of bright side (ironic in light of
the fact that I'm a big proponent of using serializable transactions), none
of the uses that I have personally seen of this feature use anything other
than the default READ COMMITTED isolation level. That might help explain
the lack of complaints for those using the feature. But yeah, I REALLY
want to see a solid fix for that!

And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
long-running queries/transactions querying immutable or insert only
data. Those would not notice that, on other tables, rows are getting
removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
oldSnapshotControl->head_timestamp getting updated more often than
correct). But if old_snapshot_threshold is small enough in relation to
the time the generated bloat becomes problematic, there will still be
occasions to actually perform cleanup.

Keep in mind that the real goal of this feature is not to eagerly _see_
"snapshot too old" errors, but to prevent accidental debilitating bloat due
to one misbehaving user connection. This is particularly easy to see (and
therefore unnervingly common) for those using ODBC, which in my experience
tends to correspond to the largest companies which are using PostgreSQL.
In some cases, the snapshot which is preventing removal of the rows will
never be used again; removal of the rows will not actually affect the
result of any query, but only the size and performance of the database.
This is a "soft limit" -- kinda like max_wal_size. Where there was a
trade-off between accuracy of the limit and performance, the less accurate
way was intentionally chosen. I apologize for not making that more clear
in comments.

While occasional "snapshot too old" errors are an inconvenient side effect
of achieving the primary goal, it might be of interest to know that the
initial (very large corporate) user of this feature had, under Oracle,
intentionally used a cursor that would be held open as long as a user chose
to leave a list open for scrolling around. They used cursor features for
as long as the cursor allowed. This could be left open for days or weeks
(or longer?). Their query ordered by a unique index, and tracked the ends
of the currently displayed portion of the list so that if they happened to
hit the "snapshot too old" error they could deallocate and restart the
cursor and reposition before moving forward or back to the newly requested
rows. They were not willing to convert to PostgreSQL unless this approach
continued to work.

In Summary:
(1) It's not urgent that rows always be removed as soon as possible after
the threshold is crossed as long as they don't often linger too awfully far
past that limit and allow debilitating bloat.
(2) It _is_ a problem if results inconsistent with the snapshot are
returned -- a "snapshot too old" error is necessary.
(3) Obviously, wraparound problems need to be solved.

I hope this is helpful.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

#18

robertmhaas@gmail.com

almost 6 years ago

In reply to: Andres Freund (#16)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 3:43 PM Andres Freund <andres@anarazel.de> wrote:

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error). And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I think it would be a good idea to add a system view that shows the
contents of the mapping. We could make it a contrib module, if you
like, so that it can even be installed on back branches. We'd need to
move the structure definition from snapmgr.c to a header file, but
that doesn't seem like such a big deal.

Maybe that contrib module could even have some functions to simulate
aging without the passage of any real time. Like, say you have a
function or procedure old_snapshot_pretend_time_has_passed(integer),
and it moves oldSnapshotControl->head_timestamp backwards by that
amount. Maybe that would require updating some other fields in
oldSnapshotControl too but it doesn't seem like we'd need to do a
whole lot.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

andres@anarazel.de

almost 6 years ago

In reply to: Kevin Grittner (#17)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 15:11:52 -0500, Kevin Grittner wrote:

On Wed, Apr 1, 2020 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error).

That is a very big problem. On the sort-of bright side (ironic in light of
the fact that I'm a big proponent of using serializable transactions), none
of the uses that I have personally seen of this feature use anything other
than the default READ COMMITTED isolation level. That might help explain
the lack of complaints for those using the feature. But yeah, I REALLY
want to see a solid fix for that!

I don't think it's dependent on RR - it's just a bit easier to verify
that the query results are wrong that way.

And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
long-running queries/transactions querying immutable or insert only
data. Those would not notice that, on other tables, rows are getting
removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
oldSnapshotControl->head_timestamp getting updated more often than
correct). But if old_snapshot_threshold is small enough in relation to
the time the generated bloat becomes problematic, there will still be
occasions to actually perform cleanup.

Keep in mind that the real goal of this feature is not to eagerly _see_
"snapshot too old" errors, but to prevent accidental debilitating bloat due
to one misbehaving user connection.

I don't think it's an "intentional" inaccuracy issue leading to
this. The map contents are just wrong, in particular the head_timestamp
most of the time is so new that
TransactionIdLimitedForOldSnapshots(). When filling a new bucket,
MaintainOldSnapshotThreshold() unconditionally updates
oldSnapshotControl->head_timestamp to be the current minute, which means
it'll take old_snapshot_threshold minutes till
TransactionIdLimitedForOldSnapshots() even looks at the mapping again.

As far as I can tell, with a large old_snapshot_threshold, it can take a
very long time to get to a head_timestamp that's old enough for
TransactionIdLimitedForOldSnapshots() to do anything. Look at this
trace of a pgbench run with old_snapshot_threshold enabled, showing some of
the debugging output added in the patch upthread.

This is with a threshold of 10min, in a freshly started database:

2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING: head 0 min: filling 1 buckets starting at 0 for whenTaken 1 min, with xmin 2068881994
2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING: old snapshot mapping at "after update" with head ts: 1, current entries: 2 max entries: 20, offset: 0
entry 0 (ring 0): min 1: xid 2068447214
entry 1 (ring 1): min 2: xid 2068881994

2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING: old snapshot mapping at "before update" with head ts: 1, current entries: 2 max entries: 20, offset: 0
entry 0 (ring 0): min 1: xid 2068447214
entry 1 (ring 1): min 2: xid 2068881994

2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING: head 1 min: updating existing bucket 1 for whenTaken 2 min, with xmin 2069199511
2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING: old snapshot mapping at "after update" with head ts: 1, current entries: 2 max entries: 20, offset: 0
entry 0 (ring 0): min 1: xid 2068447214
entry 1 (ring 1): min 2: xid 2069199511

2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING: old snapshot mapping at "before update" with head ts: 1, current entries: 2 max entries: 20, offset: 0
entry 0 (ring 0): min 1: xid 2068447214
entry 1 (ring 1): min 2: xid 2069199511

2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING: head 1 min: filling 2 buckets starting at 0 for whenTaken 3 min, with xmin 2069516499
2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING: old snapshot mapping at "after update" with head ts: 3, current entries: 4 max entries: 20, offset: 0
entry 0 (ring 0): min 3: xid 2068447214
entry 1 (ring 1): min 4: xid 2069199511
entry 2 (ring 2): min 5: xid 2069516499
entry 3 (ring 3): min 6: xid 2069516499
...
2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING: old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 20, offset: 0
entry 0 (ring 0): min 7: xid 2068447214
entry 1 (ring 1): min 8: xid 2071112480
entry 2 (ring 2): min 9: xid 2071434473
entry 3 (ring 3): min 10: xid 2071755177
entry 4 (ring 4): min 11: xid 2072075827
entry 5 (ring 5): min 12: xid 2072395700
entry 6 (ring 6): min 13: xid 2072715464
entry 7 (ring 7): min 14: xid 2073035816

Before the mapping change the database had been running for 15
minutes. But the mapping starts only at 7 minutes from start. And then
is updated to

2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING: head 7 min: filling 8 buckets starting at 0 for whenTaken 15 min, with xmin 2075918093
2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING: old snapshot mapping at "after update" with head ts: 15, current entries: 16 max entries: 20, offset: 0
entry 0 (ring 0): min 15: xid 2068447214
entry 1 (ring 1): min 16: xid 2071112480
entry 2 (ring 2): min 17: xid 2071434473
entry 3 (ring 3): min 18: xid 2071755177
entry 4 (ring 4): min 19: xid 2072075827
entry 5 (ring 5): min 20: xid 2072395700
entry 6 (ring 6): min 21: xid 2072715464
entry 7 (ring 7): min 22: xid 2073035816
entry 8 (ring 8): min 23: xid 2075918093
entry 9 (ring 9): min 24: xid 2075918093
entry 10 (ring 10): min 25: xid 2075918093
entry 11 (ring 11): min 26: xid 2075918093
entry 12 (ring 12): min 27: xid 2075918093
entry 13 (ring 13): min 28: xid 2075918093
entry 14 (ring 14): min 29: xid 2075918093
entry 15 (ring 15): min 30: xid 2075918093

be considered having started in that moment. And we expand the size of
the mapping by 8 at the same time, filling the new buckets with the same
xid. Despite there being a continuous workload.

After a few more minutes we get:

2020-04-01 14:07:00.000 PDT [1268503][3/1473617:2077202085] WARNING: head 15 min: updating existing bucket 4 for whenTaken 19 min, with xmin 2077202085
2020-04-01 14:07:00.000 PDT [1268503][3/1473617:2077202085] WARNING: old snapshot mapping at "after update" with head ts: 15, current entries: 16 max entries: 20, offset: 0
entry 0 (ring 0): min 15: xid 2068447214
entry 1 (ring 1): min 16: xid 2076238895
entry 2 (ring 2): min 17: xid 2076559154
entry 3 (ring 3): min 18: xid 2076880731
entry 4 (ring 4): min 19: xid 2077202085
entry 5 (ring 5): min 20: xid 2072395700
entry 6 (ring 6): min 21: xid 2072715464
entry 7 (ring 7): min 22: xid 2073035816
entry 8 (ring 8): min 23: xid 2075918093
entry 9 (ring 9): min 24: xid 2075918093
entry 10 (ring 10): min 25: xid 2075918093
entry 11 (ring 11): min 26: xid 2075918093
entry 12 (ring 12): min 27: xid 2075918093
entry 13 (ring 13): min 28: xid 2075918093
entry 14 (ring 14): min 29: xid 2075918093
entry 15 (ring 15): min 30: xid 2075918093

Note how the xids are not monotonically ordered. And how IsLimited still
won't be able to make use of the mapping, as the head timestamp is only
4 minutes old (whenTaken == 19 min, head == 15min).

Greetings,

Andres Freund

#20

pg@bowt.ie

almost 6 years ago

In reply to: Robert Haas (#18)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 1:25 PM Robert Haas <robertmhaas@gmail.com> wrote:

Maybe that contrib module could even have some functions to simulate
aging without the passage of any real time. Like, say you have a
function or procedure old_snapshot_pretend_time_has_passed(integer),
and it moves oldSnapshotControl->head_timestamp backwards by that
amount. Maybe that would require updating some other fields in
oldSnapshotControl too but it doesn't seem like we'd need to do a
whole lot.

I like that idea. I think that I've spotted what may be an independent
bug, but I have to wait around for a minute or two to reproduce it
each time. Makes it hard to get to a minimal test case.

--
Peter Geoghegan

#21

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#20)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

I like that idea. I think that I've spotted what may be an independent
bug, but I have to wait around for a minute or two to reproduce it
each time. Makes it hard to get to a minimal test case.

I now have simple steps to reproduce a bug when I start Postgres
master with "--old_snapshot_threshold=1" (1 minute).

This example shows wrong answers to queries in session 2:

Session 1:

pg@regression:5432 [1444078]=# create table snapshot_bug (col int4);
CREATE TABLE
pg@regression:5432 [1444078]=# create index on snapshot_bug (col );
CREATE INDEX
pg@regression:5432 [1444078]=# insert into snapshot_bug select i from
generate_series(1, 500) i;
INSERT 0 500

Session 2 starts, and views the data in a serializable transaction:

pg@regression:5432 [1444124]=# begin isolation level serializable ;
BEGIN
pg@regression:5432 [1444124]=*# select col from snapshot_bug where col

= 0 order by col limit 14;

┌─────┐
│ col │
├─────┤
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
│ 10 │
│ 11 │
│ 12 │
│ 13 │
│ 14 │
└─────┘
(14 rows)

So far so good. Now session 2 continues:

pg@regression:5432 [1444078]=# delete from snapshot_bug where col < 15;
DELETE 14

Session 1:

(repeats the same "select col from snapshot_bug where col >= 0 order
by col limit 14" query every 100 ms using psql's \watch 0.1)

Session 2:

pg@regression:5432 [1444078]=# vacuum snapshot_bug ;
VACUUM

Before too long, we see the following over in session 2 -- the answer
the query gives changes, even though this is a serializable
transaction:

Wed 01 Apr 2020 03:12:59 PM PDT (every 0.1s)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─────┐
│ col │
├─────┤
│ 15 │
│ 16 │
│ 17 │
│ 18 │
│ 19 │
│ 20 │
│ 21 │
│ 22 │
│ 23 │
│ 24 │
│ 25 │
│ 26 │
│ 27 │
│ 28 │
└─────┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

We continue to get this wrong answer for almost another minute (at
least on this occasion). Eventually we get "snapshot too old". Note
that the answer changes when we cross the "minute threshold"

Andres didn't explain anything to me that contributed to finding the
bug (though it could be a known bug, I don't think that it is). It
took me a surprisingly short amount of time to stumble upon this bug
-- I didn't find it because I have good intuitions about how to break
the feature.

--
Peter Geoghegan

#22

andres@anarazel.de

almost 6 years ago

In reply to: Andres Freund (#19)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 14:11:11 -0700, Andres Freund wrote:

As far as I can tell, with a large old_snapshot_threshold, it can take a
very long time to get to a head_timestamp that's old enough for
TransactionIdLimitedForOldSnapshots() to do anything. Look at this
trace of a pgbench run with old_snapshot_threshold enabled, showing some of
the debugging output added in the patch upthread.

This is with a threshold of 10min, in a freshly started database:
[...]

I took a lot longer till the queries started to be cancelled. The last
mapping update before that was:

2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING: old snapshot mapping at "before update" with head ts: 31, current entries: 20 max entries: 20, offset: 12
entry 0 (ring 12): min 31: xid 2078468128
entry 1 (ring 13): min 32: xid 2078642746
entry 2 (ring 14): min 33: xid 2078672303
entry 3 (ring 15): min 34: xid 2078700941
entry 4 (ring 16): min 35: xid 2078728729
entry 5 (ring 17): min 36: xid 2078755425
entry 6 (ring 18): min 37: xid 2078781089
entry 7 (ring 19): min 38: xid 2078805567
entry 8 (ring 0): min 39: xid 2078830065
entry 9 (ring 1): min 40: xid 2078611853
entry 10 (ring 2): min 41: xid 2078611853
entry 11 (ring 3): min 42: xid 2078611853
entry 12 (ring 4): min 43: xid 2078611853
entry 13 (ring 5): min 44: xid 2078611853
entry 14 (ring 6): min 45: xid 2078611853
entry 15 (ring 7): min 46: xid 2078611853
entry 16 (ring 8): min 47: xid 2078611853
entry 17 (ring 9): min 48: xid 2078611853
entry 18 (ring 10): min 49: xid 2078611853
entry 19 (ring 11): min 50: xid 2078611853

2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING: head 31 min: updating existing bucket 1 for whenTaken 40 min, with xmin 2078853870
2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING: old snapshot mapping at "after update" with head ts: 31, current entries: 20 max entries: 20, offset: 12
entry 0 (ring 12): min 31: xid 2078468128
entry 1 (ring 13): min 32: xid 2078642746
entry 2 (ring 14): min 33: xid 2078672303
entry 3 (ring 15): min 34: xid 2078700941
entry 4 (ring 16): min 35: xid 2078728729
entry 5 (ring 17): min 36: xid 2078755425
entry 6 (ring 18): min 37: xid 2078781089
entry 7 (ring 19): min 38: xid 2078805567
entry 8 (ring 0 ): min 39: xid 2078830065
entry 9 (ring 1 ): min 40: xid 2078853870
entry 10 (ring 2 ): min 41: xid 2078611853
entry 11 (ring 3 ): min 42: xid 2078611853
entry 12 (ring 4 ): min 43: xid 2078611853
entry 13 (ring 5 ): min 44: xid 2078611853
entry 14 (ring 6 ): min 45: xid 2078611853
entry 15 (ring 7 ): min 46: xid 2078611853
entry 16 (ring 8 ): min 47: xid 2078611853
entry 17 (ring 9 ): min 48: xid 2078611853
entry 18 (ring 10): min 49: xid 2078611853
entry 19 (ring 11): min 50: xid 2078611853

A query ran for fourty minutes during this, without getting cancelled.

A good while later this happens:

2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING: old snapshot mapping at "before update" with head ts: 82, current entries: 20 max entries: 20, offset: 12
entry 0 (ring 12): min 82: xid 2080333207
entry 1 (ring 13): min 83: xid 2080527298
entry 2 (ring 14): min 84: xid 2080566990
entry 3 (ring 15): min 85: xid 2080605960
entry 4 (ring 16): min 86: xid 2080644554
entry 5 (ring 17): min 87: xid 2080682957
entry 6 (ring 18): min 88: xid 2080721936
entry 7 (ring 19): min 89: xid 2080760947
entry 8 (ring 0): min 90: xid 2080799843
entry 9 (ring 1): min 91: xid 2080838696
entry 10 (ring 2): min 92: xid 2080877550
entry 11 (ring 3): min 93: xid 2080915870
entry 12 (ring 4): min 94: xid 2080954151
entry 13 (ring 5): min 95: xid 2080992556
entry 14 (ring 6): min 96: xid 2081030980
entry 15 (ring 7): min 97: xid 2081069403
entry 16 (ring 8): min 98: xid 2081107811
entry 17 (ring 9): min 99: xid 2081146322
entry 18 (ring 10): min 100: xid 2081185023
entry 19 (ring 11): min 101: xid 2081223632

2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING: head 82 min: filling 20 buckets starting at 12 for whenTaken 102 min, with xmin 2081262046
2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING: old snapshot mapping at "after update" with head ts: 102, current entries: 1 max entries: 20, offset: 0
entry 0 (ring 0): min 102: xid 2081262046

The entire mapping reset, i.e. it'll take another fourty minutes for
cancellations to happen.

Greetings,

Andres Freund

#23

andres@anarazel.de

almost 6 years ago

In reply to: Peter Geoghegan (#21)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 15:30:39 -0700, Peter Geoghegan wrote:

On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

I like that idea. I think that I've spotted what may be an independent
bug, but I have to wait around for a minute or two to reproduce it
each time. Makes it hard to get to a minimal test case.

I now have simple steps to reproduce a bug when I start Postgres
master with "--old_snapshot_threshold=1" (1 minute).

Thanks, that's super helpful.

This example shows wrong answers to queries in session 2:

Session 1:

pg@regression:5432 [1444078]=# create table snapshot_bug (col int4);
CREATE TABLE
pg@regression:5432 [1444078]=# create index on snapshot_bug (col );
CREATE INDEX
pg@regression:5432 [1444078]=# insert into snapshot_bug select i from
generate_series(1, 500) i;
INSERT 0 500

Session 2 starts, and views the data in a serializable transaction:

pg@regression:5432 [1444124]=# begin isolation level serializable ;
BEGIN
pg@regression:5432 [1444124]=*# select col from snapshot_bug where col

= 0 order by col limit 14;

┌─────┐
│ col │
├─────┤
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
│ 10 │
│ 11 │
│ 12 │
│ 13 │
│ 14 │
└─────┘
(14 rows)

So far so good. Now session 2 continues:

pg@regression:5432 [1444078]=# delete from snapshot_bug where col < 15;
DELETE 14

I got a bit confused here - you seemed to have switched session 1 and 2
around? Doesn't seem to matter much though, I was able to reproduce this.

This indeed seems a separate bug.

The backtrace to the point where the xmin horizon is affected by
TransactionIdLimitedForOldSnapshots() is:

#0 TransactionIdLimitedForOldSnapshots (recentXmin=2082816071, relation=0x7f52ff3b56f8) at /home/andres/src/postgresql/src/backend/utils/time/snapmgr.c:1870
#1 0x00005567f4cd1a55 in heap_page_prune_opt (relation=0x7f52ff3b56f8, buffer=175) at /home/andres/src/postgresql/src/backend/access/heap/pruneheap.c:106
#2 0x00005567f4cc70e2 in heapam_index_fetch_tuple (scan=0x5567f6db3028, tid=0x5567f6db2e40, snapshot=0x5567f6d67d68, slot=0x5567f6db1b60,
call_again=0x5567f6db2e46, all_dead=0x7ffce13d78de) at /home/andres/src/postgresql/src/backend/access/heap/heapam_handler.c:137
#3 0x00005567f4cdf5e6 in table_index_fetch_tuple (scan=0x5567f6db3028, tid=0x5567f6db2e40, snapshot=0x5567f6d67d68, slot=0x5567f6db1b60,
call_again=0x5567f6db2e46, all_dead=0x7ffce13d78de) at /home/andres/src/postgresql/src/include/access/tableam.h:1020
#4 0x00005567f4ce0767 in index_fetch_heap (scan=0x5567f6db2de0, slot=0x5567f6db1b60) at /home/andres/src/postgresql/src/backend/access/index/indexam.c:577
#5 0x00005567f4f19191 in IndexOnlyNext (node=0x5567f6db16a0) at /home/andres/src/postgresql/src/backend/executor/nodeIndexonlyscan.c:169
#6 0x00005567f4ef4bc4 in ExecScanFetch (node=0x5567f6db16a0, accessMtd=0x5567f4f18f20 <IndexOnlyNext>, recheckMtd=0x5567f4f1951c <IndexOnlyRecheck>)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:133
#7 0x00005567f4ef4c39 in ExecScan (node=0x5567f6db16a0, accessMtd=0x5567f4f18f20 <IndexOnlyNext>, recheckMtd=0x5567f4f1951c <IndexOnlyRecheck>)
at /home/andres/src/postgresql/src/backend/executor/execScan.c:182
#8 0x00005567f4f195d4 in ExecIndexOnlyScan (pstate=0x5567f6db16a0) at /home/andres/src/postgresql/src/backend/executor/nodeIndexonlyscan.c:317
#9 0x00005567f4ef0f71 in ExecProcNodeFirst (node=0x5567f6db16a0) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:444
#10 0x00005567f4f1d694 in ExecProcNode (node=0x5567f6db16a0) at /home/andres/src/postgresql/src/include/executor/executor.h:245
#11 0x00005567f4f1d7d2 in ExecLimit (pstate=0x5567f6db14b8) at /home/andres/src/postgresql/src/backend/executor/nodeLimit.c:95
#12 0x00005567f4ef0f71 in ExecProcNodeFirst (node=0x5567f6db14b8) at /home/andres/src/postgresql/src/backend/executor/execProcnode.c:444
#13 0x00005567f4ee57c3 in ExecProcNode (node=0x5567f6db14b8) at /home/andres/src/postgresql/src/include/executor/executor.h:245
#14 0x00005567f4ee83dd in ExecutePlan (estate=0x5567f6db1280, planstate=0x5567f6db14b8, use_parallel_mode=false, operation=CMD_SELECT, sendTuples=true,
numberTuples=0, direction=ForwardScanDirection, dest=0x5567f6db3c78, execute_once=true)
at /home/andres/src/postgresql/src/backend/executor/execMain.c:1646
#15 0x00005567f4ee5e23 in standard_ExecutorRun (queryDesc=0x5567f6d0c490, direction=ForwardScanDirection, count=0, execute_once=true)
at /home/andres/src/postgresql/src/backend/executor/execMain.c:364
#16 0x00005567f4ee5c35 in ExecutorRun (queryDesc=0x5567f6d0c490, direction=ForwardScanDirection, count=0, execute_once=true)
at /home/andres/src/postgresql/src/backend/executor/execMain.c:308
#17 0x00005567f510c4de in PortalRunSelect (portal=0x5567f6d49260, forward=true, count=0, dest=0x5567f6db3c78)
at /home/andres/src/postgresql/src/backend/tcop/pquery.c:912
#18 0x00005567f510c191 in PortalRun (portal=0x5567f6d49260, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x5567f6db3c78,
altdest=0x5567f6db3c78, qc=0x7ffce13d7de0) at /home/andres/src/postgresql/src/backend/tcop/pquery.c:756
#19 0x00005567f5106015 in exec_simple_query (query_string=0x5567f6cdd7a0 "select col from snapshot_bug where col >= 0 order by col limit 14;")
at /home/andres/src/postgresql/src/backend/tcop/postgres.c:1239

which in my tree is the elog() in the block below:
if (!same_ts_as_threshold)
{
if (ts == update_ts)
{
PrintOldSnapshotMapping("non cached limit via update_ts", false);

xlimit = latest_xmin;
if (NormalTransactionIdFollows(xlimit, recentXmin))
{
elog(LOG, "increasing threshold from %u to %u (via update_ts)",
recentXmin, xlimit);
SetOldSnapshotThresholdTimestamp(ts, xlimit);
}
}

the mapping at that point is:

2020-04-01 16:14:00.025 PDT [1272381][4/2:0] WARNING: old snapshot mapping at "non cached limit via update_ts" with head ts: 1, current entries: 2 max entries: 11, offset: 0
entry 0 (ring 0): min 1: xid 2082816067
entry 1 (ring 1): min 2: xid 2082816071

and the xmin changed is:
2020-04-01 16:14:00.026 PDT [1272381][4/2:0] LOG: increasing threshold from 2082816071 to 2082816072 (via update_ts)

in the frame of heap_prune_page_opt():
(rr) p snapshot->whenTaken
$5 = 639097973135655
(rr) p snapshot->lsn
$6 = 133951784192
(rr) p MyPgXact->xmin
$7 = 2082816071
(rr) p BufferGetBlockNumber(buffer)
$11 = 0

in the frame for TransactionIdLimitedForOldSnapshots:
(rr) p ts
$8 = 639098040000000
(rr) p latest_xmin
$9 = 2082816072
(rr) p update_ts
$10 = 639098040000000

The primary issue here is that there is no TestForOldSnapshot() in
heap_hot_search_buffer(). Therefore index fetches will never even try to
detect that tuples it needs actually have already been pruned away.

The wrong queries I saw took longer to reproduce, so I've not been able
to debug the precise reasons.

Greetings,

Andres Freund

#24

andres@anarazel.de

almost 6 years ago

In reply to: Andres Freund (#23)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 16:59:51 -0700, Andres Freund wrote:

The primary issue here is that there is no TestForOldSnapshot() in
heap_hot_search_buffer(). Therefore index fetches will never even try to
detect that tuples it needs actually have already been pruned away.

FWIW, with autovacuum=off the query does not get killed until a manual
vacuum, nor if fewer rows are deleted and the table has previously been
vacuumed.

The vacuum in the second session isn't required. There just needs to be
something consuming an xid, so that oldSnapshotControl->latest_xmin is
increased. A single SELECT txid_current(); or such in a separate session
is sufficient.

Greetings,

Andres Freund

#25

pg@bowt.ie

almost 6 years ago

In reply to: Andres Freund (#23)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 4:59 PM Andres Freund <andres@anarazel.de> wrote:

Thanks, that's super helpful.

Glad I could help.

I got a bit confused here - you seemed to have switched session 1 and 2
around? Doesn't seem to matter much though, I was able to reproduce this.

Yeah, I switched the session numbers because I was in a hurry. Sorry about that.

As you have already worked out, one session does all the DDL and
initial loading of data, while the other session queries the data
repeatedly in a serializable (or RR) xact. The latter session exhibits
the bug.

This indeed seems a separate bug.

The primary issue here is that there is no TestForOldSnapshot() in
heap_hot_search_buffer(). Therefore index fetches will never even try to
detect that tuples it needs actually have already been pruned away.

I suspected that heap_hot_search_buffer() was missing something.

The wrong queries I saw took longer to reproduce, so I've not been able
to debug the precise reasons.

How hard would it be to write a debug patch that reduces the quantum
used in places like TransactionIdLimitedForOldSnapshots() to something
much less than the current 1 minute quantum? That made reproducing the
bug *very* tedious.

--
Peter Geoghegan

#26

andres@anarazel.de

almost 6 years ago

In reply to: Andres Freund (#23)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 16:59:51 -0700, Andres Freund wrote:

The primary issue here is that there is no TestForOldSnapshot() in
heap_hot_search_buffer(). Therefore index fetches will never even try to
detect that tuples it needs actually have already been pruned away.

bitmap heap scan doesn't have the necessary checks either. In the
non-lossy case it uses heap_hot_search_buffer, for the lossy case it has
an open coded access without the check (that's bitgetpage() before v12,
and heapam_scan_bitmap_next_block() after that).

Nor do sample scans, but that was "at least" introduced later.

As far as I can tell there's not sufficient in-tree explanation of when
code needs to test for an old snapshot. There's just the following
comment above TestForOldSnapshot():
* Check whether the given snapshot is too old to have safely read the given
* page from the given table. If so, throw a "snapshot too old" error.
*
* This test generally needs to be performed after every BufferGetPage() call
* that is executed as part of a scan. It is not needed for calls made for
* modifying the page (for example, to position to the right place to insert a
* new index tuple or for vacuuming). It may also be omitted where calls to
* lower-level functions will have already performed the test.

But I don't find "as part of a scan" very informative. I mean, it
was explicitly not called from with the executor back then (for a while
the check was embedded in BufferGetPage()):

static void
bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
...
Page dp = BufferGetPage(buffer, NULL, NULL, BGP_NO_SNAPSHOT_TEST);

I am more than a bit dumbfounded here.

Greetings,

Andres Freund

#27

pg@bowt.ie

almost 6 years ago

In reply to: Andres Freund (#26)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:

As far as I can tell there's not sufficient in-tree explanation of when
code needs to test for an old snapshot. There's just the following
comment above TestForOldSnapshot():
* Check whether the given snapshot is too old to have safely read the given
* page from the given table. If so, throw a "snapshot too old" error.
*
* This test generally needs to be performed after every BufferGetPage() call
* that is executed as part of a scan. It is not needed for calls made for
* modifying the page (for example, to position to the right place to insert a
* new index tuple or for vacuuming). It may also be omitted where calls to
* lower-level functions will have already performed the test.

But I don't find "as part of a scan" very informative.

I also find it strange that _bt_search() calls TestForOldSnapshot() on
every level on the tree (actually, it calls _bt_moveright() which
calls it on every level of the tree). At least with reads (see the
comments at the top of _bt_moveright()).

Why do we need to do the test on internal pages? We only ever call
PredicateLockPage() on a leaf nbtree page. Why the inconsistency
between the two similar-seeming cases?

--
Peter Geoghegan

#28

andres@anarazel.de

almost 6 years ago

In reply to: Andres Freund (#26)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 17:54:06 -0700, Andres Freund wrote:

* Check whether the given snapshot is too old to have safely read the given
* page from the given table. If so, throw a "snapshot too old" error.
*
* This test generally needs to be performed after every BufferGetPage() call
* that is executed as part of a scan. It is not needed for calls made for
* modifying the page (for example, to position to the right place to insert a
* new index tuple or for vacuuming). It may also be omitted where calls to
* lower-level functions will have already performed the test.

To me this sounds like we'd not need to check for an old snapshot in
heap_delete/update/lock_tuple. And they were explictly not testing for
old snapshots. But I don't understand why that'd be correct?

In a lot of UPDATE/DELETE queries there's no danger that the target
tuple will be pruned away, because the underlying scan node will hold a
pin. But I don't think that's guaranteed. E.g. if a tidscan is below the
ModifyTable node, it will not hold a pin by the time we heap_update,
because there's no scan holding a pin, and the slot will have been
materialized before updating.

There are number of other ways, I think.

So it's possible to get to heap_update/delete (and probably lock_tuple
as well) with a tid that's already been pruned away. Neither contains a
non-assert check ensuring the tid still is normal.

With assertions we'd fail with an assertion in PageGetItem(). But
without it looks like we'll interpret the page header as a tuple. Which
can't be good.

Greetings,

Andres Freund

#29

kgrittn@gmail.com

almost 6 years ago

In reply to: Andres Freund (#23)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 6:59 PM Andres Freund <andres@anarazel.de> wrote:

index fetches will never even try to
detect that tuples it needs actually have already been pruned away.

I looked at this flavor of problem today and from what I saw:

(1) This has been a problem all the way back to 9.6.0.
(2) The behavior is correct if the index creation is skipped or if
enable_indexscan is turned off in the transaction, confirming Andres'
analysis.
(3) Pruning seems to happen as intended; the bug found by Peter seems to be
entirely about failing to TestForOldSnapshot() where needed.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

#30

kgrittn@gmail.com

almost 6 years ago

In reply to: Andres Freund (#24)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 7:17 PM Andres Freund <andres@anarazel.de> wrote:

FWIW, with autovacuum=off the query does not get killed until a manual
vacuum, nor if fewer rows are deleted and the table has previously been
vacuumed.

The vacuum in the second session isn't required. There just needs to be
something consuming an xid, so that oldSnapshotControl->latest_xmin is
increased. A single SELECT txid_current(); or such in a separate session
is sufficient.

Agreed. I don't see that part as a problem; if no xids are being consumed,
it's hard to see how we could be heading into debilitating levels of bloat,
so there is no need to perform the early pruning. It would not be worth
consuming any cycles to ensure that pruning happens sooner than it does in
this case. It's OK for it to happen any time past the moment that the
snapshot hits the threshold, but it's also OK for it to wait until a vacuum
of the table or until some activity consumes an xid.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

#31

andres@anarazel.de

almost 6 years ago

In reply to: Kevin Grittner (#30)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On April 2, 2020 9:36:32 AM PDT, Kevin Grittner <kgrittn@gmail.com> wrote:

On Wed, Apr 1, 2020 at 7:17 PM Andres Freund <andres@anarazel.de>
wrote:

FWIW, with autovacuum=off the query does not get killed until a

manual

vacuum, nor if fewer rows are deleted and the table has previously

been

vacuumed.

The vacuum in the second session isn't required. There just needs to

be

something consuming an xid, so that oldSnapshotControl->latest_xmin

is

increased. A single SELECT txid_current(); or such in a separate

session

is sufficient.

Agreed. I don't see that part as a problem; if no xids are being
consumed,
it's hard to see how we could be heading into debilitating levels of
bloat,
so there is no need to perform the early pruning. It would not be
worth
consuming any cycles to ensure that pruning happens sooner than it does
in
this case. It's OK for it to happen any time past the moment that the
snapshot hits the threshold, but it's also OK for it to wait until a
vacuum
of the table or until some activity consumes an xid.

The point about txid being sufficient was just about simplifying the reproducer for wrong query results.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#32

pg@bowt.ie

almost 6 years ago

In reply to: Andres Freund (#1)

Re: snapshot too old issues, first around wraparound and then more.

On Tue, Mar 31, 2020 at 11:40 PM Andres Freund <andres@anarazel.de> wrote:

The problem, as far as I can tell, is that
oldSnapshotControl->head_timestamp appears to be intended to be the
oldest value in the ring. But we update it unconditionally in the "need
a new bucket, but it might not be the very next one" branch of
MaintainOldSnapshotTimeMapping().

While there's not really a clear-cut comment explaining whether
head_timestamp() is intended to be the oldest or the newest timestamp,
it seems to me that the rest of the code treats it as the "oldest"
timestamp.

At first, I was almost certain that it's supposed to be the oldest
based only on the OldSnapshotControlData struct fields themselves. It
seemed pretty unambiguous:

int head_offset; /* subscript of oldest tracked time */
TimestampTz head_timestamp; /* time corresponding to head xid */

(Another thing that supports this interpretation is the fact that
there is a separate current_timestamp latest timestamp field in
OldSnapshotControlData.)

But then I took another look at the "We need a new bucket, but it
might not be the very next one" branch. It does indeed seem to
directly contradict the OldSnapshotControlData comments/documentation.
Note just the code itself, either. Even comments from this "new
bucket" branch disagree with the OldSnapshotControlData comments:

if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)
{
/* Map full and new value replaces old head. */
int old_head = oldSnapshotControl->head_offset;

if (old_head == (OLD_SNAPSHOT_TIME_MAP_ENTRIES - 1))
oldSnapshotControl->head_offset = 0;
else
oldSnapshotControl->head_offset = old_head + 1;
oldSnapshotControl->xid_by_minute[old_head] = xmin;
}

Here, the comment says the map (circular buffer) is full, and that we
must replace the current head with a *new* value/timestamp (the one we
just got in GetSnapshotData()). It looks as if the design of the data
structure changed during the development of the original patch, but
this entire branch was totally overlooked.

In conclusion, I share Andres' concerns here. There are glaring
problems with how we manipulate the data structure that controls the
effective horizon for pruning. Maybe they can be fixed while leaving
the code that manages the OldSnapshotControl circular buffer in
something resembling its current form, but I doubt it. In my opinion,
there is no approach to fixing "snapshot too old" that won't have some
serious downside.

--
Peter Geoghegan

#33

pg@bowt.ie

almost 6 years ago

In reply to: Peter Geoghegan (#32)

Re: snapshot too old issues, first around wraparound and then more.

On Thu, Apr 2, 2020 at 11:28 AM Peter Geoghegan <pg@bowt.ie> wrote:

In conclusion, I share Andres' concerns here. There are glaring
problems with how we manipulate the data structure that controls the
effective horizon for pruning. Maybe they can be fixed while leaving
the code that manages the OldSnapshotControl circular buffer in
something resembling its current form, but I doubt it. In my opinion,
there is no approach to fixing "snapshot too old" that won't have some
serious downside.

I'll add something that might be constructive: It would probably be a
good idea to introduce a function like syncrep.c's
SyncRepQueueIsOrderedByLSN() function, which is designed to be called
by assertions only. That would both clearly document and actually
verify the circular buffer/OldSnapshotControl data structure's
invariants.

--
Peter Geoghegan

#34

andres@anarazel.de

almost 6 years ago

In reply to: Robert Haas (#7)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

I just spend a good bit more time improving my snapshot patch, so it
could work well with a fixed version of the old_snapshot_threshold
feature. Mostly so there's no unnecessary dependency on the resolution
of the issues in that patch.

When testing my changes, for quite a while, I could not get
src/test/modules/snapshot_too_old/ to trigger a single too-old error.

It turns out, that's because there's not a single tuple removed due to
old_snapshot_threshold in src/test/modules/snapshot_too_old/. The only
reason the current code triggers any such errors is that

a) TransactionIdLimitedForOldSnapshots() is always called in
heap_page_prune_opt(), even if the "not limited" horizon
(i.e. RecentGlobalDataXmin) is more than old enough to allow for
pruning. That includes pages without a pd_prune_xid.

b) TransactionIdLimitedForOldSnapshots(), in the old_snapshot_threshold
== 0 branch, always calls
SetOldSnapshotThresholdTimestamp(ts, xlimit)
even if the horizon has not changed due to snapshot_too_old (xlimit
is initially set tot the "input" horizon, and only increased if
between (recentXmin, MyProc->xmin)).

To benefit from the snapshot scalability improvements in my patchset, it
is important to avoid unnecessarily determining the "accurate" xmin
horizon, if it's clear from the "lower boundary" horizon that pruning
can happen. Therefore I changed heap_page_prune_opt() and
heap_page_prune() to only limit when we couldn't prune.

In the course of that I separated getting the horizon from
TransactionIdLimitedForOldSnapshots() and triggering errors when an
already removed tuple would be needed via
TransactionIdLimitedForOldSnapshots().

Because there are no occasions to actually remove tuples in the entire
test, there now were no TransactionIdLimitedForOldSnapshots() calls. And
thus no errors. My code turns out to actually work.

Thus, if I change the code in master from:
TransactionId xlimit = recentXmin;
...
if (old_snapshot_threshold == 0)
{
if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
&& TransactionIdFollows(latest_xmin, xlimit))
xlimit = latest_xmin;

ts -= 5 * USECS_PER_SEC;
SetOldSnapshotThresholdTimestamp(ts, xlimit);

return xlimit;
}

to
...
if (old_snapshot_threshold == 0)
{
if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
&& TransactionIdFollows(latest_xmin, xlimit))
{
xlimit = latest_xmin;
SetOldSnapshotThresholdTimestamp(ts, xlimit);
}

ts -= 5 * USECS_PER_SEC;

return xlimit;
}

there's not a single error raised in the existing tests. Not a *single*
tuple removal is caused by old_snapshot_threshold. We just test the
order of SetOldSnapshotThresholdTimestamp() calls. We have code in the
backend to support testing old_snapshot_threshold, but we don't test
anything meaningful in the feature. We basically test a oddly behaving
version version of "transaction_timeout = 5s". I can't emphasize enough
how baffling I find this.

Greetings,

Andres Freund

#35

[1]: /messages/by-id/20200403001235.e6jfdll3gh2ygbuc@alap3.anarazel.de

andres@anarazel.de

almost 6 years ago

In reply to: Robert Haas (#7)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-01 12:02:18 -0400, Robert Haas wrote:

I have no objection to the idea that *if* the feature is hopelessly
broken, it should be removed.

I don't think we have a real choice here at this point, at least for the
back branches.

Just about nothing around old_snapshot_threshold works correctly:

* There are basically no tests (see [1]/messages/by-id/20200403001235.e6jfdll3gh2ygbuc@alap3.anarazel.de I jsut sent, and also
old_snapshot_threshold bypassing a lot of the relevant code).

* We don't detect errors after hot pruning (to allow that is a major
point of the feature) when access is via any sort of index
scans. Wrong query results.

* The time->xid mapping is is entirely broken. We don't prevent bloat
for many multiples of old_snapshot_threshold (if above 1min).

It's possible, but harder, to have this cause wrong query results.

* In read-mostly workloads it can trigger errors in sessions that are
much younger than old_snapshot_threshold, if the transactionid is not
advancing.

I've not tried to reproduce, but I suspect this can also cause wrong
query results. Because a later snapshot can have the same xmin as
older transactions, it sure looks like we can end up with data from an
older xmin getting removed, but the newer snapshot's whenTaken will
prevent TestForOldSnapshot_impl() from raising an error.

* I am fairly sure that it can cause crashes (or even data corruption),
because it assumes that DML never needs to check for old snapshots
(with no meaningful justificiation). Leading to heap_update/delete to
assume the page header is a tuple.

* There's obviously also the wraparound issue that made me start this
thread initially.

Since this is a feature that can result in wrong query results (and
quite possibly crashes / data corruption), I don't think we can just
leave this unfixed. But given the amount of code / infrastructure
changes required to get this into a working feature, I don't see how we
can unleash those changes onto the stable branches.

There's quite a few issues in here that require not just local bugfixes,
but some design changes too. And it's pretty clear that the feature
didn't go through enough review before getting committed. I see quite
some merit in removing the code in master, and having a potential
reimplementation go through a normal feature integration process.

I don't really know what to do here. Causing problems by neutering a
feature in the back branch *sucks*. While not quite as bad, removing a
feature without a replacement in a major release is pretty harsh
too. But I don't really see any other realistic path forward.

FWIW, I've now worked around the interdependency of s_t_o my snapshot
scalability patch (only took like 10 days). I have manually confirmed it
works with 0/1 minute thresholds. I can make the tests pass unmodified
if I just add SetOldSnapshotThresholdTimestamp() calls when not
necessary (which obviously makes no sense). Lead to some decent
improvements around pruning that are independent of s_t_o (with more
possibilities "opened"). But I still think we need to do something
here.

Greetings,

Andres Freund

#36

pg@bowt.ie

almost 6 years ago

In reply to: Andres Freund (#35)

Re: snapshot too old issues, first around wraparound and then more.

On Thu, Apr 2, 2020 at 5:17 PM Andres Freund <andres@anarazel.de> wrote:

Since this is a feature that can result in wrong query results (and
quite possibly crashes / data corruption), I don't think we can just
leave this unfixed. But given the amount of code / infrastructure
changes required to get this into a working feature, I don't see how we
can unleash those changes onto the stable branches.

I don't think that the feature can be allowed to remain in anything
like its current form. The current design is fundamentally unsound.

I don't really know what to do here. Causing problems by neutering a
feature in the back branch *sucks*. While not quite as bad, removing a
feature without a replacement in a major release is pretty harsh
too. But I don't really see any other realistic path forward.

I have an idea that might allow us to insulate some users from the
problem caused by a full revert (or disabling the feature) in the
backbranches. I wouldn't usually make such a radical suggestion, but
the current situation is exceptional. Anything that avoids serious
pain for users deserves to be considered.

Kevin said this about the feature very recently:

"""
Keep in mind that the real goal of this feature is not to eagerly
_see_ "snapshot too old" errors, but to prevent accidental
debilitating bloat due to one misbehaving user connection. This is
particularly easy to see (and therefore unnervingly common) for those
using ODBC, which in my experience tends to correspond to the largest
companies which are using PostgreSQL. In some cases, the snapshot
which is preventing removal of the rows will never be used again;
removal of the rows will not actually affect the result of any query,
but only the size and performance of the database. This is a "soft
limit" -- kinda like max_wal_size. Where there was a trade-off
between accuracy of the limit and performance, the less accurate way
was intentionally chosen. I apologize for not making that more clear
in comments.
"""

ODBC uses cursors in rather strange ways, often to implement a kind of
ODBC-level cache. See the description of "Use Declare/Fetch" from
https://odbc.postgresql.org/docs/config.html to get some idea of what
this can look like.

I think that it's worth considering whether or not there are a
significant number of "snapshot too old" users that rarely or never
rely on old snapshots used by new queries. Kevin said that this
happens "in some cases", but how many cases? Might it be that many
"snapshot too old" users could get by with a version of the feature
that makes the most conservative possible assumptions, totally giving
up on the idea of differentiating which blocks are truly safe to
access with an "old" snapshot? (In other words, one that assumes that
they're *all* unsafe for an "old" snapshot.)

I'm thinking of a version of "snapshot too old" that amounts to a
statement timeout that gets applied for xmin horizon type purposes in
the conventional way, while only showing an error to the client if and
when they access literally any buffer (though not when the relation is
a system catalog). Is it possible that something along those lines is
appreciably better than nothing to users? If it is, and if we can find
a way to manage the transition, then maybe we could tolerate
supporting this greatly simplified implementation of "snapshot too
old".

I feel slightly silly for even suggesting this. I have to ask. Maybe
nobody noticed a problem with the feature before now (at least in
part) because they didn't truly care about old snapshots anyway. They
just wanted to avoid a significant impact from buggy code that leaks
cursors and things like that. Or, they were happy as long as they
could still access ODBC's "100 rows in a cache" through the cursor.
The docs say that a old_snapshot_threshold setting in the hours is
about the lowest reasonable setting for production use, which seems
rather high to me. It almost seems as if the feature specifically
targets misbehaving applications already.

--
Peter Geoghegan

#37

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Peter Geoghegan (#36)

Re: snapshot too old issues, first around wraparound and then more.

On Fri, Apr 3, 2020 at 6:52 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Apr 2, 2020 at 5:17 PM Andres Freund <andres@anarazel.de> wrote:

Since this is a feature that can result in wrong query results (and
quite possibly crashes / data corruption), I don't think we can just
leave this unfixed. But given the amount of code / infrastructure
changes required to get this into a working feature, I don't see how we
can unleash those changes onto the stable branches.

As per my initial understanding, the changes required are (a) There
seem to be multiple places where TestForOldSnapshot is missing, (b)
TestForOldSnapshot itself need to be reviewed carefully to see if it
has problems, (c) Some of the members of OldSnapshotControlData like
head_timestamp and xid_by_minute are not maintained accurately, (d)
handling of wraparound for xids in the in-memory data-structure for
this feature is required, (e) test infrastructure is not good enough
to catch bugs or improve this feature.

Now, this sounds like a quite of work but OTOH, if we see most of the
critical changes required will be in only a few functions like
TransactionIdLimitedForOldSnapshots(),
MaintainOldSnapshotTimeMapping(), TestForOldSnapshot(). I don't deny
the possibility that we might need much more work or we need to come
up with quite a different design to address all these problems but
unless Kevin or someone else doesn't come up with a solution to
address all of these problems, we can't be sure of that.

I don't think that the feature can be allowed to remain in anything
like its current form. The current design is fundamentally unsound.

Agreed, but OTOH, not giving time to Kevin or others who might be
interested to support this work is also not fair. I think once
somebody comes up with patches for problems we can decide whether this
feature can be salvaged in back-branches or we need to disable it in a
hard-way. Now, if Kevin himself is not interested in fixing or nobody
shows up to help, then surely we can take the decision sooner but
giving time for a couple of weeks (or even till we are near for PG13
release) in this case doesn't seem like a bad idea.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#38

andres@anarazel.de

almost 6 years ago

In reply to: Amit Kapila (#37)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-03 14:32:09 +0530, Amit Kapila wrote:

On Fri, Apr 3, 2020 at 6:52 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Thu, Apr 2, 2020 at 5:17 PM Andres Freund <andres@anarazel.de> wrote:

Since this is a feature that can result in wrong query results (and
quite possibly crashes / data corruption), I don't think we can just
leave this unfixed. But given the amount of code / infrastructure
changes required to get this into a working feature, I don't see how we
can unleash those changes onto the stable branches.

As per my initial understanding, the changes required are (a) There
seem to be multiple places where TestForOldSnapshot is missing, (b)
TestForOldSnapshot itself need to be reviewed carefully to see if it
has problems, (c) Some of the members of OldSnapshotControlData like
head_timestamp and xid_by_minute are not maintained accurately, (d)
handling of wraparound for xids in the in-memory data-structure for
this feature is required, (e) test infrastructure is not good enough
to catch bugs or improve this feature.

And a bunch more correctness issues. But basically, yes.

When you say "(c) Some of the members of OldSnapshotControlData like
head_timestamp and xid_by_minute are not maintained accurately)" - note
that that's the core state for the whole feature.

With regards to test: "not good enough" is somewhat of an
understatement. Not a *single* tuple is removed in the tests due to
old_snapshot_threshold - and removing tuples is the entire point.

Now, this sounds like a quite of work but OTOH, if we see most of the
critical changes required will be in only a few functions like
TransactionIdLimitedForOldSnapshots(),
MaintainOldSnapshotTimeMapping(), TestForOldSnapshot().

I don't think that's really the case. Every place reading a buffer needs
to be inspected, and new calls added. They aren't free, and I'm not sure
all of them have the relevant snapshot available. To fix the issue of
spurious errors, we'd likely need changes outside of those, and it'd
quite possibly have performance / bloat implications.

I don't deny the possibility that we might need much more work or we
need to come up with quite a different design to address all these
problems but unless Kevin or someone else doesn't come up with a
solution to address all of these problems, we can't be sure of that.

I don't think that the feature can be allowed to remain in anything
like its current form. The current design is fundamentally unsound.

Agreed, but OTOH, not giving time to Kevin or others who might be
interested to support this work is also not fair. I think once
somebody comes up with patches for problems we can decide whether this
feature can be salvaged in back-branches or we need to disable it in a
hard-way. Now, if Kevin himself is not interested in fixing or nobody
shows up to help, then surely we can take the decision sooner but
giving time for a couple of weeks (or even till we are near for PG13
release) in this case doesn't seem like a bad idea.

It'd certainly be great if somebody came up with fixes, yes. Even if we
had to disable it in the back branches, that'd allow us to keep the
feature around, at least.

The likelihood of regressions even when the feature is not on does not
seem that low. But you're right, we'll be able to better judge it with
fixes to look at.

Greetings,

Andres Freund

#39

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Andres Freund (#38)

Re: snapshot too old issues, first around wraparound and then more.

On Sat, Apr 4, 2020 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:

On 2020-04-03 14:32:09 +0530, Amit Kapila wrote:

Agreed, but OTOH, not giving time to Kevin or others who might be
interested to support this work is also not fair. I think once
somebody comes up with patches for problems we can decide whether this
feature can be salvaged in back-branches or we need to disable it in a
hard-way. Now, if Kevin himself is not interested in fixing or nobody
shows up to help, then surely we can take the decision sooner but
giving time for a couple of weeks (or even till we are near for PG13
release) in this case doesn't seem like a bad idea.

It'd certainly be great if somebody came up with fixes, yes. Even if we
had to disable it in the back branches, that'd allow us to keep the
feature around, at least.

The likelihood of regressions even when the feature is not on does not
seem that low.

Yeah, that is the key point. IIRC, when this feature got added Kevin
and others spent a lot of effort to ensure that point.

But you're right, we'll be able to better judge it with
fixes to look at.

I am hoping Kevin would take the lead and then others also can help.
Kevin, please do let us know if you are *not* planning to work on the
issues raised in this thread so that we can think of an alternative?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#40

thomas.munro@gmail.com

almost 6 years ago

In reply to: Peter Geoghegan (#36)

Re: snapshot too old issues, first around wraparound and then more.

On Fri, Apr 3, 2020 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that it's worth considering whether or not there are a
significant number of "snapshot too old" users that rarely or never
rely on old snapshots used by new queries. Kevin said that this
happens "in some cases", but how many cases? Might it be that many
"snapshot too old" users could get by with a version of the feature
that makes the most conservative possible assumptions, totally giving
up on the idea of differentiating which blocks are truly safe to
access with an "old" snapshot? (In other words, one that assumes that
they're *all* unsafe for an "old" snapshot.)

I'm thinking of a version of "snapshot too old" that amounts to a
statement timeout that gets applied for xmin horizon type purposes in
the conventional way, while only showing an error to the client if and
when they access literally any buffer (though not when the relation is
a system catalog). Is it possible that something along those lines is
appreciably better than nothing to users? If it is, and if we can find
a way to manage the transition, then maybe we could tolerate
supporting this greatly simplified implementation of "snapshot too
old".

Hi Peter,

Interesting idea. I'm keen to try prototyping it to see how well it
works out it practice. Let me know soon if you already have designs
on that and I'll get out of your way, otherwise I'll give it a try and
share what I come up with.

#41

andres@anarazel.de

almost 6 years ago

In reply to: Thomas Munro (#40)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2020-04-13 14:58:34 +1200, Thomas Munro wrote:

On Fri, Apr 3, 2020 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that it's worth considering whether or not there are a
significant number of "snapshot too old" users that rarely or never
rely on old snapshots used by new queries. Kevin said that this
happens "in some cases", but how many cases? Might it be that many
"snapshot too old" users could get by with a version of the feature
that makes the most conservative possible assumptions, totally giving
up on the idea of differentiating which blocks are truly safe to
access with an "old" snapshot? (In other words, one that assumes that
they're *all* unsafe for an "old" snapshot.)

I'm thinking of a version of "snapshot too old" that amounts to a
statement timeout that gets applied for xmin horizon type purposes in
the conventional way, while only showing an error to the client if and
when they access literally any buffer (though not when the relation is
a system catalog). Is it possible that something along those lines is
appreciably better than nothing to users? If it is, and if we can find
a way to manage the transition, then maybe we could tolerate
supporting this greatly simplified implementation of "snapshot too
old".

Hi Peter,

Interesting idea. I'm keen to try prototyping it to see how well it
works out it practice. Let me know soon if you already have designs
on that and I'll get out of your way, otherwise I'll give it a try and
share what I come up with.

FWIW, I think the part that is currently harder to fix is the time->xmin
mapping and some related pieces. Second comes the test
infrastructure. Compared to those, adding additional checks for old
snapshots wouldn't be too hard - although I'd argue that the approach of
sprinkling these tests everywhere isn't that scalable...

Greetings,

Andres Freund

#42

thomas.munro@gmail.com

over 5 years ago

In reply to: Thomas Munro (#40)

1 attachment(s)

Re: snapshot too old issues, first around wraparound and then more.

On Mon, Apr 13, 2020 at 2:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Fri, Apr 3, 2020 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:

I think that it's worth considering whether or not there are a
significant number of "snapshot too old" users that rarely or never
rely on old snapshots used by new queries. Kevin said that this
happens "in some cases", but how many cases? Might it be that many
"snapshot too old" users could get by with a version of the feature
that makes the most conservative possible assumptions, totally giving
up on the idea of differentiating which blocks are truly safe to
access with an "old" snapshot? (In other words, one that assumes that
they're *all* unsafe for an "old" snapshot.)

I'm thinking of a version of "snapshot too old" that amounts to a
statement timeout that gets applied for xmin horizon type purposes in
the conventional way, while only showing an error to the client if and
when they access literally any buffer (though not when the relation is
a system catalog). Is it possible that something along those lines is
appreciably better than nothing to users? If it is, and if we can find
a way to manage the transition, then maybe we could tolerate
supporting this greatly simplified implementation of "snapshot too
old".

Interesting idea. I'm keen to try prototyping it to see how well it
works out it practice. Let me know soon if you already have designs
on that and I'll get out of your way, otherwise I'll give it a try and
share what I come up with.

Here's a quick and dirty test patch of that idea (or my understanding
of it), just for experiments. It introduces snapshot->expire_time and
a new timer SNAPSHOT_TIMEOUT to cause the next CHECK_FOR_INTERRUPTS()
to set snapshot->too_old on any active or registered snapshots whose
time has come, and then try to advance MyPgXact->xmin, without
considering the ones marked too old. That gets rid of the concept of
"early pruning". You can use just regular pruning, because the
snapshot is no longer holding the regular xmin back. Then
TestForOldSnapshot() becomes simply if (snapshot->too_old)
ereport(...).

There are certainly some rough edges, missed details and bugs in here,
not least the fact (pointed out to me by Andres in an off-list chat)
that we sometimes use short-lived snapshots without registering them;
we'd have to fix that. It also does nothing to ensure that
TestForOldSnapshot() is actually called at all the right places, which
is still required for correct results.

If those problems can be fixed, you'd have a situation where
snapshot-too-old is a coarse grained, blunt instrument that
effectively aborts your transaction even if the whole cluster is
read-only. I am not sure if that's really truly useful to anyone (ie
if these ODBC cursor users would be satisfied; I'm not sure I
understand that use case).

Hmm. I suppose it must be possible to put the LSN check back: if
(snapshot->too_old && PageGetLSN(page) > snapshot->lsn) ereport(...).
Then the granularity would be the same as today -- block level -- but
the complexity is transferred from the pruning side (has to deal with
xid time map) to the snapshot-owning side (has to deal with timers,
CFI() and make sure all snapshots are registered). Maybe not a great
deal, and maybe not easier than fixing the existing bugs.

One problem is all the new setitimer() syscalls. I feel like that
could be improved, as could statement_timeout, by letting existing
timers run rather than repeatedly rescheduling eagerly, so that eg a 1
minute timeout never gets rescheduled more than once per minute. I
haven't looked into that, but I guess it's no worse than the existing
implement's overheads anyway.

PS in the patch the GUC is interpreted as milliseconds, which is more
fun for testing but it should really be minutes like before.

Attachments:

0001-Implement-snapshot_too_old-using-a-timer.patchtext/x-patch; charset=US-ASCII; name=0001-Implement-snapshot_too_old-using-a-timer.patchDownload

From 8a5455c7e376bd6ceddf956f789cfdede0396f3f Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 15 Apr 2020 12:35:51 +1200
Subject: [PATCH] Implement snapshot_too_old using a timer.

Remove the previous implementation of "snapshot too old", and replace it
with a very conservative implementation based only on time passing.

Use a timer to mark active and registered snapshots as too old and
potentially advance the backend's xmin.  Snapshots that reach this state
can no longer be used to read buffers, because data they need to see may
be vacuumed away.

XXX Incomplete experimental code!
---
 src/backend/access/common/toast_internals.c |  12 +-
 src/backend/access/heap/pruneheap.c         |   4 +-
 src/backend/catalog/index.c                 |  15 +-
 src/backend/commands/vacuum.c               |   3 +-
 src/backend/storage/buffer/bufmgr.c         |  17 -
 src/backend/storage/ipc/ipci.c              |   2 -
 src/backend/storage/ipc/procarray.c         |  23 +-
 src/backend/tcop/postgres.c                 |   6 +
 src/backend/utils/init/globals.c            |   1 +
 src/backend/utils/init/postinit.c           |  10 +
 src/backend/utils/time/snapmgr.c            | 610 ++++----------------
 src/include/miscadmin.h                     |   1 +
 src/include/storage/bufmgr.h                |  16 +-
 src/include/utils/snapmgr.h                 |  47 +-
 src/include/utils/snapshot.h                |   4 +-
 src/include/utils/timeout.h                 |   1 +
 16 files changed, 170 insertions(+), 602 deletions(-)

diff --git a/src/backend/access/common/toast_internals.c b/src/backend/access/common/toast_internals.c
index 25a81e5ec6..66189df0cb 100644
--- a/src/backend/access/common/toast_internals.c
+++ b/src/backend/access/common/toast_internals.c
@@ -615,18 +615,10 @@ toast_close_indexes(Relation *toastidxs, int num_indexes, LOCKMODE lock)
 /* ----------
  * init_toast_snapshot
  *
- *	Initialize an appropriate TOAST snapshot.  We must use an MVCC snapshot
- *	to initialize the TOAST snapshot; since we don't know which one to use,
- *	just use the oldest one.  This is safe: at worst, we will get a "snapshot
- *	too old" error that might have been avoided otherwise.
+ *	Initialize an appropriate TOAST snapshot.
  */
 void
 init_toast_snapshot(Snapshot toast_snapshot)
 {
-	Snapshot	snapshot = GetOldestSnapshot();
-
-	if (snapshot == NULL)
-		elog(ERROR, "no known snapshots");
-
-	InitToastSnapshot(*toast_snapshot, snapshot->lsn, snapshot->whenTaken);
+	InitToastSnapshot(*toast_snapshot);
 }
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 1794cfd8d9..c2d1ddb092 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -102,9 +102,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 		RelationIsAccessibleInLogicalDecoding(relation))
 		OldestXmin = RecentGlobalXmin;
 	else
-		OldestXmin =
-			TransactionIdLimitedForOldSnapshots(RecentGlobalDataXmin,
-												relation);
+		OldestXmin = RecentGlobalDataXmin;
 
 	Assert(TransactionIdIsValid(OldestXmin));
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bd7ec923e9..da40f4dbcf 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2919,8 +2919,7 @@ index_build(Relation heapRelation,
 	/*
 	 * If we found any potentially broken HOT chains, mark the index as not
 	 * being usable until the current transaction is below the event horizon.
-	 * See src/backend/access/heap/README.HOT for discussion.  Also set this
-	 * if early pruning/vacuuming is enabled for the heap relation.  While it
+	 * See src/backend/access/heap/README.HOT for discussion.  While it
 	 * might become safe to use the index earlier based on actual cleanup
 	 * activity and other active transactions, the test for that would be much
 	 * more complex and would require some form of blocking, so keep it simple
@@ -2936,7 +2935,7 @@ index_build(Relation heapRelation,
 	 *
 	 * We also need not set indcheckxmin during a concurrent index build,
 	 * because we won't set indisvalid true until all transactions that care
-	 * about the broken HOT chains or early pruning/vacuuming are gone.
+	 * about the broken HOT chains are gone.
 	 *
 	 * Therefore, this code path can only be taken during non-concurrent
 	 * CREATE INDEX.  Thus the fact that heap_update will set the pg_index
@@ -2945,7 +2944,7 @@ index_build(Relation heapRelation,
 	 * about any concurrent readers of the tuple; no other transaction can see
 	 * it yet.
 	 */
-	if ((indexInfo->ii_BrokenHotChain || EarlyPruningEnabled(heapRelation)) &&
+	if ((indexInfo->ii_BrokenHotChain) &&
 		!isreindex &&
 		!indexInfo->ii_Concurrent)
 	{
@@ -3584,7 +3583,6 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 		HeapTuple	indexTuple;
 		Form_pg_index indexForm;
 		bool		index_bad;
-		bool		early_pruning_enabled = EarlyPruningEnabled(heapRelation);
 
 		pg_index = table_open(IndexRelationId, RowExclusiveLock);
 
@@ -3598,12 +3596,11 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 					 !indexForm->indisready ||
 					 !indexForm->indislive);
 		if (index_bad ||
-			(indexForm->indcheckxmin && !indexInfo->ii_BrokenHotChain) ||
-			early_pruning_enabled)
+			(indexForm->indcheckxmin && !indexInfo->ii_BrokenHotChain))
 		{
-			if (!indexInfo->ii_BrokenHotChain && !early_pruning_enabled)
+			if (!indexInfo->ii_BrokenHotChain)
 				indexForm->indcheckxmin = false;
-			else if (index_bad || early_pruning_enabled)
+			else if (index_bad)
 				indexForm->indcheckxmin = true;
 			indexForm->indisvalid = true;
 			indexForm->indisready = true;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3a89f8fe1e..8b3db22ca5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -957,8 +957,7 @@ vacuum_set_xid_limits(Relation rel,
 	 * working on a particular table at any time, and that each vacuum is
 	 * always an independent transaction.
 	 */
-	*oldestXmin =
-		TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM), rel);
+	*oldestXmin = GetOldestXmin(rel, true);
 
 	Assert(TransactionIdIsNormal(*oldestXmin));
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f9980cf80c..5f017a19fd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4538,20 +4538,3 @@ IssuePendingWritebacks(WritebackContext *context)
 
 	context->nr_pending = 0;
 }
-
-
-/*
- * Implement slower/larger portions of TestForOldSnapshot
- *
- * Smaller/faster portions are put inline, but the entire set of logic is too
- * big for that.
- */
-void
-TestForOldSnapshot_impl(Snapshot snapshot, Relation relation)
-{
-	if (RelationAllowsEarlyPruning(relation)
-		&& (snapshot)->whenTaken < GetOldSnapshotThresholdTimestamp())
-		ereport(ERROR,
-				(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
-				 errmsg("snapshot too old")));
-}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 427b0d59cd..eb18ae36cb 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -143,7 +143,6 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, WalSndShmemSize());
 		size = add_size(size, WalRcvShmemSize());
 		size = add_size(size, ApplyLauncherShmemSize());
-		size = add_size(size, SnapMgrShmemSize());
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
@@ -259,7 +258,6 @@ CreateSharedMemoryAndSemaphores(void)
 	/*
 	 * Set up other modules that need some shared memory space
 	 */
-	SnapMgrInit();
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 363000670b..3d6cfce5bf 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1758,27 +1758,8 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
-
-	if (old_snapshot_threshold < 0)
-	{
-		/*
-		 * If not using "snapshot too old" feature, fill related fields with
-		 * dummy values that don't require any locking.
-		 */
-		snapshot->lsn = InvalidXLogRecPtr;
-		snapshot->whenTaken = 0;
-	}
-	else
-	{
-		/*
-		 * Capture the current time and WAL stream location in case this
-		 * snapshot becomes old enough to need to fall back on the special
-		 * "old snapshot" logic.
-		 */
-		snapshot->lsn = GetXLogInsertRecPtr();
-		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
-		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
-	}
+	snapshot->too_old = false;
+	snapshot->expire_time = 0;
 
 	return snapshot;
 }
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8958ec8103..856fb2c5a7 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3204,6 +3204,12 @@ ProcessInterrupts(void)
 
 	}
 
+	if (SnapshotTimeoutPending)
+	{
+		HandleSnapshotTimeout();
+		SnapshotTimeoutPending = false;
+	}
+
 	if (ProcSignalBarrierPending)
 		ProcessProcSignalBarrier();
 
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index eb19644419..0bf0dafac5 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -32,6 +32,7 @@ volatile sig_atomic_t QueryCancelPending = false;
 volatile sig_atomic_t ProcDiePending = false;
 volatile sig_atomic_t ClientConnectionLost = false;
 volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false;
+volatile sig_atomic_t SnapshotTimeoutPending = false;
 volatile sig_atomic_t ProcSignalBarrierPending = false;
 volatile uint32 InterruptHoldoffCount = 0;
 volatile uint32 QueryCancelHoldoffCount = 0;
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index f4247ea70d..4b9f1128e0 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -73,6 +73,7 @@ static void ShutdownPostgres(int code, Datum arg);
 static void StatementTimeoutHandler(void);
 static void LockTimeoutHandler(void);
 static void IdleInTransactionSessionTimeoutHandler(void);
+static void SnapshotTimeoutHandler(void);
 static bool ThereIsAtLeastOneRole(void);
 static void process_startup_options(Port *port, bool am_superuser);
 static void process_settings(Oid databaseid, Oid roleid);
@@ -631,6 +632,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 		RegisterTimeout(LOCK_TIMEOUT, LockTimeoutHandler);
 		RegisterTimeout(IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
 						IdleInTransactionSessionTimeoutHandler);
+		RegisterTimeout(SNAPSHOT_TIMEOUT, SnapshotTimeoutHandler);
 	}
 
 	/*
@@ -1241,6 +1243,14 @@ IdleInTransactionSessionTimeoutHandler(void)
 	SetLatch(MyLatch);
 }
 
+static void
+SnapshotTimeoutHandler(void)
+{
+	SnapshotTimeoutPending = true;
+	InterruptPending = true;
+	SetLatch(MyLatch);
+}
+
 /*
  * Returns true if at least one role is defined in this database cluster.
  */
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..4b40f596e0 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -67,6 +67,7 @@
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/timeout.h"
 
 
 /*
@@ -74,61 +75,6 @@
  */
 int			old_snapshot_threshold; /* number of minutes, -1 disables */
 
-/*
- * Structure for dealing with old_snapshot_threshold implementation.
- */
-typedef struct OldSnapshotControlData
-{
-	/*
-	 * Variables for old snapshot handling are shared among processes and are
-	 * only allowed to move forward.
-	 */
-	slock_t		mutex_current;	/* protect current_timestamp */
-	TimestampTz current_timestamp;	/* latest snapshot timestamp */
-	slock_t		mutex_latest_xmin;	/* protect latest_xmin and next_map_update */
-	TransactionId latest_xmin;	/* latest snapshot xmin */
-	TimestampTz next_map_update;	/* latest snapshot valid up to */
-	slock_t		mutex_threshold;	/* protect threshold fields */
-	TimestampTz threshold_timestamp;	/* earlier snapshot is old */
-	TransactionId threshold_xid;	/* earlier xid may be gone */
-
-	/*
-	 * Keep one xid per minute for old snapshot error handling.
-	 *
-	 * Use a circular buffer with a head offset, a count of entries currently
-	 * used, and a timestamp corresponding to the xid at the head offset.  A
-	 * count_used value of zero means that there are no times stored; a
-	 * count_used value of OLD_SNAPSHOT_TIME_MAP_ENTRIES means that the buffer
-	 * is full and the head must be advanced to add new entries.  Use
-	 * timestamps aligned to minute boundaries, since that seems less
-	 * surprising than aligning based on the first usage timestamp.  The
-	 * latest bucket is effectively stored within latest_xmin.  The circular
-	 * buffer is updated when we get a new xmin value that doesn't fall into
-	 * the same interval.
-	 *
-	 * It is OK if the xid for a given time slot is from earlier than
-	 * calculated by adding the number of minutes corresponding to the
-	 * (possibly wrapped) distance from the head offset to the time of the
-	 * head entry, since that just results in the vacuuming of old tuples
-	 * being slightly less aggressive.  It would not be OK for it to be off in
-	 * the other direction, since it might result in vacuuming tuples that are
-	 * still expected to be there.
-	 *
-	 * Use of an SLRU was considered but not chosen because it is more
-	 * heavyweight than is needed for this, and would probably not be any less
-	 * code to implement.
-	 *
-	 * Persistence is not needed.
-	 */
-	int			head_offset;	/* subscript of oldest tracked time */
-	TimestampTz head_timestamp; /* time corresponding to head xid */
-	int			count_used;		/* how many slots are in use */
-	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
-} OldSnapshotControlData;
-
-static volatile OldSnapshotControlData *oldSnapshotControl;
-
-
 /*
  * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
  * mode, and to the latest one taken in a read-committed transaction.
@@ -189,9 +135,6 @@ typedef struct ActiveSnapshotElt
 /* Top of the stack of active snapshots */
 static ActiveSnapshotElt *ActiveSnapshot = NULL;
 
-/* Bottom of the stack of active snapshots */
-static ActiveSnapshotElt *OldestActiveSnapshot = NULL;
-
 /*
  * Currently registered Snapshots.  Ordered in a heap by xmin, so that we can
  * quickly find the one with lowest xmin, to advance our MyPgXact->xmin.
@@ -225,10 +168,10 @@ typedef struct ExportedSnapshot
 static List *exportedSnapshots = NIL;
 
 /* Prototypes for local functions */
-static TimestampTz AlignTimestampToMinuteBoundary(TimestampTz ts);
 static Snapshot CopySnapshot(Snapshot snapshot);
 static void FreeSnapshot(Snapshot snapshot);
 static void SnapshotResetXmin(void);
+static void SetSnapshotExpireTime(Snapshot snapshot);
 
 /*
  * Snapshot fields to be serialized.
@@ -245,54 +188,8 @@ typedef struct SerializedSnapshotData
 	bool		suboverflowed;
 	bool		takenDuringRecovery;
 	CommandId	curcid;
-	TimestampTz whenTaken;
-	XLogRecPtr	lsn;
 } SerializedSnapshotData;
 
-Size
-SnapMgrShmemSize(void)
-{
-	Size		size;
-
-	size = offsetof(OldSnapshotControlData, xid_by_minute);
-	if (old_snapshot_threshold > 0)
-		size = add_size(size, mul_size(sizeof(TransactionId),
-									   OLD_SNAPSHOT_TIME_MAP_ENTRIES));
-
-	return size;
-}
-
-/*
- * Initialize for managing old snapshot detection.
- */
-void
-SnapMgrInit(void)
-{
-	bool		found;
-
-	/*
-	 * Create or attach to the OldSnapshotControlData structure.
-	 */
-	oldSnapshotControl = (volatile OldSnapshotControlData *)
-		ShmemInitStruct("OldSnapshotControlData",
-						SnapMgrShmemSize(), &found);
-
-	if (!found)
-	{
-		SpinLockInit(&oldSnapshotControl->mutex_current);
-		oldSnapshotControl->current_timestamp = 0;
-		SpinLockInit(&oldSnapshotControl->mutex_latest_xmin);
-		oldSnapshotControl->latest_xmin = InvalidTransactionId;
-		oldSnapshotControl->next_map_update = 0;
-		SpinLockInit(&oldSnapshotControl->mutex_threshold);
-		oldSnapshotControl->threshold_timestamp = 0;
-		oldSnapshotControl->threshold_xid = InvalidTransactionId;
-		oldSnapshotControl->head_offset = 0;
-		oldSnapshotControl->head_timestamp = 0;
-		oldSnapshotControl->count_used = 0;
-	}
-}
-
 /*
  * GetTransactionSnapshot
  *		Get the appropriate snapshot for a new query in a transaction.
@@ -353,6 +250,7 @@ GetTransactionSnapshot(void)
 			/* Mark it as "registered" in FirstXactSnapshot */
 			FirstXactSnapshot->regd_count++;
 			pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
+			SetSnapshotExpireTime(FirstXactSnapshot);
 		}
 		else
 			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
@@ -403,36 +301,6 @@ GetLatestSnapshot(void)
 	return SecondarySnapshot;
 }
 
-/*
- * GetOldestSnapshot
- *
- *		Get the transaction's oldest known snapshot, as judged by the LSN.
- *		Will return NULL if there are no active or registered snapshots.
- */
-Snapshot
-GetOldestSnapshot(void)
-{
-	Snapshot	OldestRegisteredSnapshot = NULL;
-	XLogRecPtr	RegisteredLSN = InvalidXLogRecPtr;
-
-	if (!pairingheap_is_empty(&RegisteredSnapshots))
-	{
-		OldestRegisteredSnapshot = pairingheap_container(SnapshotData, ph_node,
-														 pairingheap_first(&RegisteredSnapshots));
-		RegisteredLSN = OldestRegisteredSnapshot->lsn;
-	}
-
-	if (OldestActiveSnapshot != NULL)
-	{
-		XLogRecPtr	ActiveLSN = OldestActiveSnapshot->as_snap->lsn;
-
-		if (XLogRecPtrIsInvalid(RegisteredLSN) || RegisteredLSN > ActiveLSN)
-			return OldestActiveSnapshot->as_snap;
-	}
-
-	return OldestRegisteredSnapshot;
-}
-
 /*
  * GetCatalogSnapshot
  *		Get a snapshot that is sufficiently up-to-date for scan of the
@@ -493,7 +361,8 @@ GetNonHistoricCatalogSnapshot(Oid relid)
 		 * CatalogSnapshot pointer is already valid.
 		 */
 		pairingheap_add(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
-	}
+		SetSnapshotExpireTime(CatalogSnapshot);
+	 }
 
 	return CatalogSnapshot;
 }
@@ -513,7 +382,8 @@ InvalidateCatalogSnapshot(void)
 {
 	if (CatalogSnapshot)
 	{
-		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
+		if (!CatalogSnapshot->too_old)
+			pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
 	}
@@ -646,6 +516,7 @@ SetTransactionSnapshot(Snapshot sourcesnap, VirtualTransactionId *sourcevxid,
 		/* Mark it as "registered" in FirstXactSnapshot */
 		FirstXactSnapshot->regd_count++;
 		pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
+		SetSnapshotExpireTime(FirstXactSnapshot);
 	}
 
 	FirstSnapshotSet = true;
@@ -680,6 +551,9 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->active_count = 0;
 	newsnap->copied = true;
 
+	newsnap->too_old = snapshot->too_old;
+	newsnap->expire_time = snapshot->expire_time;
+
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
 	{
@@ -755,8 +629,8 @@ PushActiveSnapshot(Snapshot snap)
 	newactive->as_snap->active_count++;
 
 	ActiveSnapshot = newactive;
-	if (OldestActiveSnapshot == NULL)
-		OldestActiveSnapshot = ActiveSnapshot;
+
+	SetSnapshotExpireTime(newactive->as_snap);
 }
 
 /*
@@ -827,8 +701,6 @@ PopActiveSnapshot(void)
 
 	pfree(ActiveSnapshot);
 	ActiveSnapshot = newstack;
-	if (ActiveSnapshot == NULL)
-		OldestActiveSnapshot = NULL;
 
 	SnapshotResetXmin();
 }
@@ -891,7 +763,10 @@ RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner)
 	ResourceOwnerRememberSnapshot(owner, snap);
 
 	if (snap->regd_count == 1)
+	{
 		pairingheap_add(&RegisteredSnapshots, &snap->ph_node);
+		SetSnapshotExpireTime(snap);
+	}
 
 	return snap;
 }
@@ -928,7 +803,7 @@ UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner)
 	ResourceOwnerForgetSnapshot(owner, snapshot);
 
 	snapshot->regd_count--;
-	if (snapshot->regd_count == 0)
+	if (snapshot->regd_count == 0 && !snapshot->too_old)
 		pairingheap_remove(&RegisteredSnapshots, &snapshot->ph_node);
 
 	if (snapshot->regd_count == 0 && snapshot->active_count == 0)
@@ -998,13 +873,6 @@ GetFullRecentGlobalXmin(void)
  * dropped.  For efficiency, we only consider recomputing PGXACT->xmin when
  * the active snapshot stack is empty; this allows us not to need to track
  * which active snapshot is oldest.
- *
- * Note: it's tempting to use GetOldestSnapshot() here so that we can include
- * active snapshots in the calculation.  However, that compares by LSN not
- * xmin so it's not entirely clear that it's the same thing.  Also, we'd be
- * critically dependent on the assumption that the bottommost active snapshot
- * stack entry has the oldest xmin.  (Current uses of GetOldestSnapshot() are
- * not actually critical, but this would be.)
  */
 static void
 SnapshotResetXmin(void)
@@ -1076,8 +944,6 @@ AtSubAbort_Snapshot(int level)
 		pfree(ActiveSnapshot);
 
 		ActiveSnapshot = next;
-		if (ActiveSnapshot == NULL)
-			OldestActiveSnapshot = NULL;
 	}
 
 	SnapshotResetXmin();
@@ -1099,7 +965,7 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	 * stacked as active, we don't want the code below to be chasing through a
 	 * dangling pointer.
 	 */
-	if (FirstXactSnapshot != NULL)
+	if (FirstXactSnapshot != NULL && !FirstXactSnapshot->too_old)
 	{
 		Assert(FirstXactSnapshot->regd_count > 0);
 		Assert(!pairingheap_is_empty(&RegisteredSnapshots));
@@ -1133,6 +999,9 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 				elog(WARNING, "could not unlink file \"%s\": %m",
 					 esnap->snapfile);
 
+			if (esnap->snapshot->too_old)
+				continue;
+
 			pairingheap_remove(&RegisteredSnapshots,
 							   &esnap->snapshot->ph_node);
 		}
@@ -1161,7 +1030,6 @@ AtEOXact_Snapshot(bool isCommit, bool resetXmin)
 	 * it'll go away with TopTransactionContext.
 	 */
 	ActiveSnapshot = NULL;
-	OldestActiveSnapshot = NULL;
 	pairingheap_reset(&RegisteredSnapshots);
 
 	CurrentSnapshot = NULL;
@@ -1262,6 +1130,7 @@ ExportSnapshot(Snapshot snapshot)
 
 	snapshot->regd_count++;
 	pairingheap_add(&RegisteredSnapshots, &snapshot->ph_node);
+	SetSnapshotExpireTime(snapshot);
 
 	/*
 	 * Fill buf with a text serialization of the snapshot, plus identification
@@ -1696,335 +1565,6 @@ ThereAreNoPriorRegisteredSnapshots(void)
 }
 
 
-/*
- * Return a timestamp that is exactly on a minute boundary.
- *
- * If the argument is already aligned, return that value, otherwise move to
- * the next minute boundary following the given time.
- */
-static TimestampTz
-AlignTimestampToMinuteBoundary(TimestampTz ts)
-{
-	TimestampTz retval = ts + (USECS_PER_MINUTE - 1);
-
-	return retval - (retval % USECS_PER_MINUTE);
-}
-
-/*
- * Get current timestamp for snapshots
- *
- * This is basically GetCurrentTimestamp(), but with a guarantee that
- * the result never moves backward.
- */
-TimestampTz
-GetSnapshotCurrentTimestamp(void)
-{
-	TimestampTz now = GetCurrentTimestamp();
-
-	/*
-	 * Don't let time move backward; if it hasn't advanced, use the old value.
-	 */
-	SpinLockAcquire(&oldSnapshotControl->mutex_current);
-	if (now <= oldSnapshotControl->current_timestamp)
-		now = oldSnapshotControl->current_timestamp;
-	else
-		oldSnapshotControl->current_timestamp = now;
-	SpinLockRelease(&oldSnapshotControl->mutex_current);
-
-	return now;
-}
-
-/*
- * Get timestamp through which vacuum may have processed based on last stored
- * value for threshold_timestamp.
- *
- * XXX: So far, we never trust that a 64-bit value can be read atomically; if
- * that ever changes, we could get rid of the spinlock here.
- */
-TimestampTz
-GetOldSnapshotThresholdTimestamp(void)
-{
-	TimestampTz threshold_timestamp;
-
-	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
-	threshold_timestamp = oldSnapshotControl->threshold_timestamp;
-	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
-
-	return threshold_timestamp;
-}
-
-static void
-SetOldSnapshotThresholdTimestamp(TimestampTz ts, TransactionId xlimit)
-{
-	SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
-	oldSnapshotControl->threshold_timestamp = ts;
-	oldSnapshotControl->threshold_xid = xlimit;
-	SpinLockRelease(&oldSnapshotControl->mutex_threshold);
-}
-
-/*
- * TransactionIdLimitedForOldSnapshots
- *
- * Apply old snapshot limit, if any.  This is intended to be called for page
- * pruning and table vacuuming, to allow old_snapshot_threshold to override
- * the normal global xmin value.  Actual testing for snapshot too old will be
- * based on whether a snapshot timestamp is prior to the threshold timestamp
- * set in this function.
- */
-TransactionId
-TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
-									Relation relation)
-{
-	if (TransactionIdIsNormal(recentXmin)
-		&& old_snapshot_threshold >= 0
-		&& RelationAllowsEarlyPruning(relation))
-	{
-		TimestampTz ts = GetSnapshotCurrentTimestamp();
-		TransactionId xlimit = recentXmin;
-		TransactionId latest_xmin;
-		TimestampTz update_ts;
-		bool		same_ts_as_threshold = false;
-
-		SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
-		latest_xmin = oldSnapshotControl->latest_xmin;
-		update_ts = oldSnapshotControl->next_map_update;
-		SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
-
-		/*
-		 * Zero threshold always overrides to latest xmin, if valid.  Without
-		 * some heuristic it will find its own snapshot too old on, for
-		 * example, a simple UPDATE -- which would make it useless for most
-		 * testing, but there is no principled way to ensure that it doesn't
-		 * fail in this way.  Use a five-second delay to try to get useful
-		 * testing behavior, but this may need adjustment.
-		 */
-		if (old_snapshot_threshold == 0)
-		{
-			if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
-				&& TransactionIdFollows(latest_xmin, xlimit))
-				xlimit = latest_xmin;
-
-			ts -= 5 * USECS_PER_SEC;
-			SetOldSnapshotThresholdTimestamp(ts, xlimit);
-
-			return xlimit;
-		}
-
-		ts = AlignTimestampToMinuteBoundary(ts)
-			- (old_snapshot_threshold * USECS_PER_MINUTE);
-
-		/* Check for fast exit without LW locking. */
-		SpinLockAcquire(&oldSnapshotControl->mutex_threshold);
-		if (ts == oldSnapshotControl->threshold_timestamp)
-		{
-			xlimit = oldSnapshotControl->threshold_xid;
-			same_ts_as_threshold = true;
-		}
-		SpinLockRelease(&oldSnapshotControl->mutex_threshold);
-
-		if (!same_ts_as_threshold)
-		{
-			if (ts == update_ts)
-			{
-				xlimit = latest_xmin;
-				if (NormalTransactionIdFollows(xlimit, recentXmin))
-					SetOldSnapshotThresholdTimestamp(ts, xlimit);
-			}
-			else
-			{
-				LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
-
-				if (oldSnapshotControl->count_used > 0
-					&& ts >= oldSnapshotControl->head_timestamp)
-				{
-					int			offset;
-
-					offset = ((ts - oldSnapshotControl->head_timestamp)
-							  / USECS_PER_MINUTE);
-					if (offset > oldSnapshotControl->count_used - 1)
-						offset = oldSnapshotControl->count_used - 1;
-					offset = (oldSnapshotControl->head_offset + offset)
-						% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
-					xlimit = oldSnapshotControl->xid_by_minute[offset];
-
-					if (NormalTransactionIdFollows(xlimit, recentXmin))
-						SetOldSnapshotThresholdTimestamp(ts, xlimit);
-				}
-
-				LWLockRelease(OldSnapshotTimeMapLock);
-			}
-		}
-
-		/*
-		 * Failsafe protection against vacuuming work of active transaction.
-		 *
-		 * This is not an assertion because we avoid the spinlock for
-		 * performance, leaving open the possibility that xlimit could advance
-		 * and be more current; but it seems prudent to apply this limit.  It
-		 * might make pruning a tiny bit less aggressive than it could be, but
-		 * protects against data loss bugs.
-		 */
-		if (TransactionIdIsNormal(latest_xmin)
-			&& TransactionIdPrecedes(latest_xmin, xlimit))
-			xlimit = latest_xmin;
-
-		if (NormalTransactionIdFollows(xlimit, recentXmin))
-			return xlimit;
-	}
-
-	return recentXmin;
-}
-
-/*
- * Take care of the circular buffer that maps time to xid.
- */
-void
-MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
-{
-	TimestampTz ts;
-	TransactionId latest_xmin;
-	TimestampTz update_ts;
-	bool		map_update_required = false;
-
-	/* Never call this function when old snapshot checking is disabled. */
-	Assert(old_snapshot_threshold >= 0);
-
-	ts = AlignTimestampToMinuteBoundary(whenTaken);
-
-	/*
-	 * Keep track of the latest xmin seen by any process. Update mapping with
-	 * a new value when we have crossed a bucket boundary.
-	 */
-	SpinLockAcquire(&oldSnapshotControl->mutex_latest_xmin);
-	latest_xmin = oldSnapshotControl->latest_xmin;
-	update_ts = oldSnapshotControl->next_map_update;
-	if (ts > update_ts)
-	{
-		oldSnapshotControl->next_map_update = ts;
-		map_update_required = true;
-	}
-	if (TransactionIdFollows(xmin, latest_xmin))
-		oldSnapshotControl->latest_xmin = xmin;
-	SpinLockRelease(&oldSnapshotControl->mutex_latest_xmin);
-
-	/* We only needed to update the most recent xmin value. */
-	if (!map_update_required)
-		return;
-
-	/* No further tracking needed for 0 (used for testing). */
-	if (old_snapshot_threshold == 0)
-		return;
-
-	/*
-	 * We don't want to do something stupid with unusual values, but we don't
-	 * want to litter the log with warnings or break otherwise normal
-	 * processing for this feature; so if something seems unreasonable, just
-	 * log at DEBUG level and return without doing anything.
-	 */
-	if (whenTaken < 0)
-	{
-		elog(DEBUG1,
-			 "MaintainOldSnapshotTimeMapping called with negative whenTaken = %ld",
-			 (long) whenTaken);
-		return;
-	}
-	if (!TransactionIdIsNormal(xmin))
-	{
-		elog(DEBUG1,
-			 "MaintainOldSnapshotTimeMapping called with xmin = %lu",
-			 (unsigned long) xmin);
-		return;
-	}
-
-	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
-
-	Assert(oldSnapshotControl->head_offset >= 0);
-	Assert(oldSnapshotControl->head_offset < OLD_SNAPSHOT_TIME_MAP_ENTRIES);
-	Assert((oldSnapshotControl->head_timestamp % USECS_PER_MINUTE) == 0);
-	Assert(oldSnapshotControl->count_used >= 0);
-	Assert(oldSnapshotControl->count_used <= OLD_SNAPSHOT_TIME_MAP_ENTRIES);
-
-	if (oldSnapshotControl->count_used == 0)
-	{
-		/* set up first entry for empty mapping */
-		oldSnapshotControl->head_offset = 0;
-		oldSnapshotControl->head_timestamp = ts;
-		oldSnapshotControl->count_used = 1;
-		oldSnapshotControl->xid_by_minute[0] = xmin;
-	}
-	else if (ts < oldSnapshotControl->head_timestamp)
-	{
-		/* old ts; log it at DEBUG */
-		LWLockRelease(OldSnapshotTimeMapLock);
-		elog(DEBUG1,
-			 "MaintainOldSnapshotTimeMapping called with old whenTaken = %ld",
-			 (long) whenTaken);
-		return;
-	}
-	else if (ts <= (oldSnapshotControl->head_timestamp +
-					((oldSnapshotControl->count_used - 1)
-					 * USECS_PER_MINUTE)))
-	{
-		/* existing mapping; advance xid if possible */
-		int			bucket = (oldSnapshotControl->head_offset
-							  + ((ts - oldSnapshotControl->head_timestamp)
-								 / USECS_PER_MINUTE))
-		% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
-
-		if (TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[bucket], xmin))
-			oldSnapshotControl->xid_by_minute[bucket] = xmin;
-	}
-	else
-	{
-		/* We need a new bucket, but it might not be the very next one. */
-		int			advance = ((ts - oldSnapshotControl->head_timestamp)
-							   / USECS_PER_MINUTE);
-
-		oldSnapshotControl->head_timestamp = ts;
-
-		if (advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)
-		{
-			/* Advance is so far that all old data is junk; start over. */
-			oldSnapshotControl->head_offset = 0;
-			oldSnapshotControl->count_used = 1;
-			oldSnapshotControl->xid_by_minute[0] = xmin;
-		}
-		else
-		{
-			/* Store the new value in one or more buckets. */
-			int			i;
-
-			for (i = 0; i < advance; i++)
-			{
-				if (oldSnapshotControl->count_used == OLD_SNAPSHOT_TIME_MAP_ENTRIES)
-				{
-					/* Map full and new value replaces old head. */
-					int			old_head = oldSnapshotControl->head_offset;
-
-					if (old_head == (OLD_SNAPSHOT_TIME_MAP_ENTRIES - 1))
-						oldSnapshotControl->head_offset = 0;
-					else
-						oldSnapshotControl->head_offset = old_head + 1;
-					oldSnapshotControl->xid_by_minute[old_head] = xmin;
-				}
-				else
-				{
-					/* Extend map to unused entry. */
-					int			new_tail = (oldSnapshotControl->head_offset
-											+ oldSnapshotControl->count_used)
-					% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
-
-					oldSnapshotControl->count_used++;
-					oldSnapshotControl->xid_by_minute[new_tail] = xmin;
-				}
-			}
-		}
-	}
-
-	LWLockRelease(OldSnapshotTimeMapLock);
-}
-
-
 /*
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
@@ -2113,8 +1653,6 @@ SerializeSnapshot(Snapshot snapshot, char *start_address)
 	serialized_snapshot.suboverflowed = snapshot->suboverflowed;
 	serialized_snapshot.takenDuringRecovery = snapshot->takenDuringRecovery;
 	serialized_snapshot.curcid = snapshot->curcid;
-	serialized_snapshot.whenTaken = snapshot->whenTaken;
-	serialized_snapshot.lsn = snapshot->lsn;
 
 	/*
 	 * Ignore the SubXID array if it has overflowed, unless the snapshot was
@@ -2187,8 +1725,6 @@ RestoreSnapshot(char *start_address)
 	snapshot->suboverflowed = serialized_snapshot.suboverflowed;
 	snapshot->takenDuringRecovery = serialized_snapshot.takenDuringRecovery;
 	snapshot->curcid = serialized_snapshot.curcid;
-	snapshot->whenTaken = serialized_snapshot.whenTaken;
-	snapshot->lsn = serialized_snapshot.lsn;
 
 	/* Copy XIDs, if present. */
 	if (serialized_snapshot.xcnt > 0)
@@ -2212,6 +1748,9 @@ RestoreSnapshot(char *start_address)
 	snapshot->active_count = 0;
 	snapshot->copied = true;
 
+	snapshot->too_old = false;
+	snapshot->expire_time = 0;
+
 	return snapshot;
 }
 
@@ -2348,3 +1887,102 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 
 	return false;
 }
+
+/*
+ * Find any snapshots that are too old and mark them as such, and see if we can
+ * advance this backend's xmin.
+ */
+void HandleSnapshotTimeout(void)
+{
+	TransactionId		xmin = InvalidTransactionId;
+	TimestampTz			now = GetCurrentTimestamp();
+	TimestampTz			next_expire_time = 0;
+
+	/*
+	 * See if we can mark any registered snapshots as too old, and unregister
+	 * them.
+	 *
+	 * XXX This assumes that two registered snapshots with same xmin are sorted
+	 * in expire_time order, which isn't true right now.
+	 */
+	while (!pairingheap_is_empty(&RegisteredSnapshots))
+	{
+		Snapshot snapshot;
+
+		snapshot = pairingheap_container(SnapshotData, ph_node,
+										 pairingheap_first(&RegisteredSnapshots));
+		if (snapshot->expire_time == 0 || snapshot->expire_time > now)
+		{
+			xmin = snapshot->xmin;
+			next_expire_time = snapshot->expire_time;
+			break;
+		}
+
+		/* Forcibly deregister the snapshot so that it no longer holds up xmin. */
+		snapshot->too_old = true;
+		pairingheap_remove_first(&RegisteredSnapshots);
+	}
+
+	/*
+	 * Also scan the active snapshots, seeing if any need to be marked too old,
+	 * and find the new xmin.
+	 */
+	for (ActiveSnapshotElt *s = ActiveSnapshot; s; s = s->as_next)
+	{
+		if (s->as_snap->too_old)
+			continue;
+		else if (s->as_snap->expire_time > 0 && s->as_snap->expire_time <= now)
+			 s->as_snap->too_old = true;
+		else
+		{
+			/* See if we have a new lowest xmin. */
+			if (xmin == InvalidTransactionId ||
+				TransactionIdPrecedes(s->as_snap->xmin, xmin))
+				xmin = s->as_snap->xmin;
+			/* See if we have a new earliest expire_time. */
+			if (next_expire_time == 0 ||
+				s->as_snap->expire_time < next_expire_time)
+				next_expire_time = s->as_snap->expire_time;
+		}
+	}
+
+	/* Set timer for the next snapshot that will expire. */
+	if (next_expire_time > 0)
+		enable_timeout_after(SNAPSHOT_TIMEOUT, (now - next_expire_time) / 1000);
+
+	/* Advertise new xmin. */
+	MyPgXact->xmin = xmin;
+}
+
+/*
+ * SetSnapshotExpireTime
+ *		Configure the time at which a snapshot becomes too old.
+ *
+ * This should be called whenever a snapshot is made active or registered.
+ */
+static void
+SetSnapshotExpireTime(Snapshot snapshot)
+{
+	/*
+	 * If this snapshot was already marked as too old (perhaps because it was
+	 * copied from one that was too old), then don't clobber the flag.
+	 */
+	if (snapshot->too_old)
+		return;
+
+	snapshot->too_old = false;
+	if (old_snapshot_threshold < 0)
+	{
+		/* Feature not enabled. */
+		snapshot->expire_time = 0;
+	}
+	else
+	{
+		TimestampTz		now = GetCurrentTimestamp();
+
+		/* Set up a timer that will call HandleSnapshotTimeout. */
+		snapshot->expire_time = now + old_snapshot_threshold * 1000;
+		enable_timeout_after(SNAPSHOT_TIMEOUT,
+							 (snapshot->expire_time - now) / 1000);
+	}
+}
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 14fa127ab1..1f866302b6 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -82,6 +82,7 @@ extern PGDLLIMPORT volatile sig_atomic_t InterruptPending;
 extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending;
 extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending;
+extern PGDLLIMPORT volatile sig_atomic_t SnapshotTimeoutPending;
 extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending;
 
 extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost;
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ee91b8fa26..d7fb0dbb6f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -238,8 +238,6 @@ extern bool BgBufferSync(struct WritebackContext *wb_context);
 
 extern void AtProcExit_LocalBuffers(void);
 
-extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
-
 /* in freelist.c */
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
@@ -269,22 +267,16 @@ extern void FreeAccessStrategy(BufferAccessStrategy strategy);
  * Note that a NULL snapshot argument is allowed and causes a fast return
  * without error; this is to support call sites which can be called from
  * either scans or index modification areas.
- *
- * For best performance, keep the tests that are fastest and/or most likely to
- * exclude a page from old snapshot testing near the front.
  */
 static inline void
 TestForOldSnapshot(Snapshot snapshot, Relation relation, Page page)
 {
 	Assert(relation != NULL);
 
-	if (old_snapshot_threshold >= 0
-		&& (snapshot) != NULL
-		&& ((snapshot)->snapshot_type == SNAPSHOT_MVCC
-			|| (snapshot)->snapshot_type == SNAPSHOT_TOAST)
-		&& !XLogRecPtrIsInvalid((snapshot)->lsn)
-		&& PageGetLSN(page) > (snapshot)->lsn)
-		TestForOldSnapshot_impl(snapshot, relation);
+	if (snapshot && snapshot->too_old)
+		ereport(ERROR,
+				(errcode(ERRCODE_SNAPSHOT_TOO_OLD),
+				 errmsg("snapshot too old")));
 }
 
 #endif							/* FRONTEND */
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..cf7cd197e9 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -19,40 +19,10 @@
 #include "utils/snapshot.h"
 
 
-/*
- * The structure used to map times to TransactionId values for the "snapshot
- * too old" feature must have a few entries at the tail to hold old values;
- * otherwise the lookup will often fail and the expected early pruning or
- * vacuum will not usually occur.  It is best if this padding is for a number
- * of minutes greater than a thread would normally be stalled, but it's OK if
- * early vacuum opportunities are occasionally missed, so there's no need to
- * use an extreme value or get too fancy.  10 minutes seems plenty.
- */
-#define OLD_SNAPSHOT_PADDING_ENTRIES 10
-#define OLD_SNAPSHOT_TIME_MAP_ENTRIES (old_snapshot_threshold + OLD_SNAPSHOT_PADDING_ENTRIES)
-
-/*
- * Common definition of relation properties that allow early pruning/vacuuming
- * when old_snapshot_threshold >= 0.
- */
-#define RelationAllowsEarlyPruning(rel) \
-( \
-	 RelationNeedsWAL(rel) \
-  && !IsCatalogRelation(rel) \
-  && !RelationIsAccessibleInLogicalDecoding(rel) \
-)
-
-#define EarlyPruningEnabled(rel) (old_snapshot_threshold >= 0 && RelationAllowsEarlyPruning(rel))
-
 /* GUC variables */
 extern PGDLLIMPORT int old_snapshot_threshold;
 
 
-extern Size SnapMgrShmemSize(void);
-extern void SnapMgrInit(void);
-extern TimestampTz GetSnapshotCurrentTimestamp(void);
-extern TimestampTz GetOldSnapshotThresholdTimestamp(void);
-
 extern bool FirstSnapshotSet;
 
 extern PGDLLIMPORT TransactionId TransactionXmin;
@@ -74,7 +44,8 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
  * local variable of type SnapshotData, and initialize it with this macro.
  */
 #define InitDirtySnapshot(snapshotdata)  \
-	((snapshotdata).snapshot_type = SNAPSHOT_DIRTY)
+	((snapshotdata).snapshot_type = SNAPSHOT_DIRTY, \
+	 (snapshotdata).too_old = false)
 
 /*
  * Similarly, some initialization is required for a NonVacuumable snapshot.
@@ -82,16 +53,15 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
  */
 #define InitNonVacuumableSnapshot(snapshotdata, xmin_horizon)  \
 	((snapshotdata).snapshot_type = SNAPSHOT_NON_VACUUMABLE, \
-	 (snapshotdata).xmin = (xmin_horizon))
+	 (snapshotdata).xmin = (xmin_horizon), \
+	 (snapshotdata).too_old = false)
 
 /*
- * Similarly, some initialization is required for SnapshotToast.  We need
- * to set lsn and whenTaken correctly to support snapshot_too_old.
+ * Similarly, some initialization is required for SnapshotToast.
  */
-#define InitToastSnapshot(snapshotdata, l, w)  \
+#define InitToastSnapshot(snapshotdata)  \
 	((snapshotdata).snapshot_type = SNAPSHOT_TOAST, \
-	 (snapshotdata).lsn = (l),					\
-	 (snapshotdata).whenTaken = (w))
+	 (snapshotdata).too_old = false)
 
 /* This macro encodes the knowledge of which snapshots are MVCC-safe */
 #define IsMVCCSnapshot(snapshot)  \
@@ -102,7 +72,8 @@ extern PGDLLIMPORT SnapshotData CatalogSnapshotData;
 extern Snapshot GetTransactionSnapshot(void);
 extern Snapshot GetLatestSnapshot(void);
 extern void SnapshotSetCommandId(CommandId curcid);
-extern Snapshot GetOldestSnapshot(void);
+
+extern void HandleSnapshotTimeout(void);
 
 extern Snapshot GetCatalogSnapshot(Oid relid);
 extern Snapshot GetNonHistoricCatalogSnapshot(Oid relid);
diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h
index 4796edb63a..e701c8b9b1 100644
--- a/src/include/utils/snapshot.h
+++ b/src/include/utils/snapshot.h
@@ -199,8 +199,8 @@ typedef struct SnapshotData
 	uint32		regd_count;		/* refcount on RegisteredSnapshots */
 	pairingheap_node ph_node;	/* link in the RegisteredSnapshots heap */
 
-	TimestampTz whenTaken;		/* timestamp when snapshot was taken */
-	XLogRecPtr	lsn;			/* position in the WAL stream when taken */
+	TimestampTz	expire_time;	/* when will this snapshot become too old? */
+	bool		too_old;
 } SnapshotData;
 
 #endif							/* SNAPSHOT_H */
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index 83a15f6795..eb705a3f8f 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -31,6 +31,7 @@ typedef enum TimeoutId
 	STANDBY_TIMEOUT,
 	STANDBY_LOCK_TIMEOUT,
 	IDLE_IN_TRANSACTION_SESSION_TIMEOUT,
+	SNAPSHOT_TIMEOUT,
 	/* First user-definable timeout reason */
 	USER_TIMEOUT,
 	/* Maximum number of timeout reasons */
-- 
2.20.1

#43

thomas.munro@gmail.com

over 5 years ago

In reply to: Andres Freund (#41)

1 attachment(s)

Re: snapshot too old issues, first around wraparound and then more.

On Mon, Apr 13, 2020 at 5:14 PM Andres Freund <andres@anarazel.de> wrote:

FWIW, I think the part that is currently harder to fix is the time->xmin
mapping and some related pieces. Second comes the test
infrastructure. Compared to those, adding additional checks for old
snapshots wouldn't be too hard - although I'd argue that the approach of
sprinkling these tests everywhere isn't that scalable...

Just trying out some ideas here... I suppose the wrapping problem
just requires something along the lines of the attached, but now I'm
wondering how to write decent tests for it. Using the
pg_clobber_current_snapshot_timestamp() function I mentioned in
Robert's time->xmin thread, it's easy to build up a time map without
resorting to sleeping etc, with something like:

select pg_clobber_current_snapshot_timestamp('3000-01-01 00:00:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:01:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:02:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:03:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:04:00Z');

Then of course frozenXID can be advanced with eg update pg_database
set datallowconn = 't' where datname = 'template0', then vacuumdb
--freeze --all, and checked before and after with Robert's
pg_old_snapshot_time_mapping() SRF to see that it's truncated. But
it's not really the level of stuff we'd ideally mess with in
pg_regress tests and I don't see any precent, so I guess maybe I'll
need to go and figure out how to write some perl.

Attachments:

0001-Truncate-old-snapshot-XIDs-before-truncating-CLOG.patchtext/x-patch; charset=US-ASCII; name=0001-Truncate-old-snapshot-XIDs-before-truncating-CLOG.patchDownload

From 69d0f7d843a8145fc8cdbd6a56b948ee04d486b9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 17 Apr 2020 15:18:49 +1200
Subject: [PATCH] Truncate old snapshot XIDs before truncating CLOG.

---
 src/backend/commands/vacuum.c    |  3 +++
 src/backend/utils/time/snapmgr.c | 21 +++++++++++++++++++++
 src/include/utils/snapmgr.h      |  1 +
 3 files changed, 25 insertions(+)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5a110edb07..37ead45fa5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1627,6 +1627,9 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	AdvanceOldestCommitTsXid(frozenXID);
 
+	/* Make sure snapshot_too_old drops old XIDs. */
+	TruncateOldSnapshotTimeMapping(frozenXID);
+
 	/*
 	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
 	 */
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 72b2c61a07..d604e69270 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1998,6 +1998,27 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 }
 
 
+/*
+ * Remove old xids from the timing map, so the CLOG can be truncated.
+ */
+void
+TruncateOldSnapshotTimeMapping(TransactionId frozenXID)
+{
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
+	while (oldSnapshotControl->count_used > 0 &&
+		   TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[oldSnapshotControl->head_offset],
+								 frozenXID))
+	{
+		oldSnapshotControl->head_timestamp += USECS_PER_MINUTE;
+		oldSnapshotControl->head_offset =
+			(oldSnapshotControl->head_offset + 1) %
+			OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+		oldSnapshotControl->count_used--;
+	}
+	LWLockRelease(OldSnapshotTimeMapLock);
+}
+
+
 /*
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..4f53aad956 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -135,6 +135,7 @@ extern TransactionId TransactionIdLimitedForOldSnapshots(TransactionId recentXmi
 														 Relation relation);
 extern void MaintainOldSnapshotTimeMapping(TimestampTz whenTaken,
 										   TransactionId xmin);
+extern void TruncateOldSnapshotTimeMapping(TransactionId frozenXID);
 
 extern char *ExportSnapshot(Snapshot snapshot);
 
-- 
2.20.1

#44

thomas.munro@gmail.com

over 5 years ago

In reply to: Thomas Munro (#43)

5 attachment(s)

Re: snapshot too old issues, first around wraparound and then more.

On Fri, Apr 17, 2020 at 3:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Mon, Apr 13, 2020 at 5:14 PM Andres Freund <andres@anarazel.de> wrote:

FWIW, I think the part that is currently harder to fix is the time->xmin
mapping and some related pieces. Second comes the test
infrastructure. Compared to those, adding additional checks for old
snapshots wouldn't be too hard - although I'd argue that the approach of
sprinkling these tests everywhere isn't that scalable...

Just trying out some ideas here...
... so I guess maybe I'll
need to go and figure out how to write some perl.

Here's a very rough sketch of what I mean. Patches 0001-0003 are
stolen directly from Robert. I think 0005's t/001_truncate.pl
demonstrates that the map is purged of old xids as appropriate. I
suppose this style of testing based on manually advancing the hands of
time should also allow for testing early pruning, but it may be Monday
before I can try that so I'm sharing what I have so far in case it's
useful... I think this really wants to be in src/test/modules, not
contrib, but I just bolted it on top of what Robert posted.

Attachments:

v1-0001-Expose-oldSnapshotControl.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Expose-oldSnapshotControl.patchDownload

From b78644a0f9580934b136ca8413366de91198203f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 16 Apr 2020 09:37:31 -0400
Subject: [PATCH v1 1/5] Expose oldSnapshotControl.

---
 src/backend/utils/time/snapmgr.c | 55 +----------------------
 src/include/utils/old_snapshot.h | 75 ++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+), 53 deletions(-)
 create mode 100644 src/include/utils/old_snapshot.h

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 1c063c592c..abaaea569a 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -63,6 +63,7 @@
 #include "storage/spin.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/old_snapshot.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/snapmgr.h"
@@ -74,59 +75,7 @@
  */
 int			old_snapshot_threshold; /* number of minutes, -1 disables */
 
-/*
- * Structure for dealing with old_snapshot_threshold implementation.
- */
-typedef struct OldSnapshotControlData
-{
-	/*
-	 * Variables for old snapshot handling are shared among processes and are
-	 * only allowed to move forward.
-	 */
-	slock_t		mutex_current;	/* protect current_timestamp */
-	TimestampTz current_timestamp;	/* latest snapshot timestamp */
-	slock_t		mutex_latest_xmin;	/* protect latest_xmin and next_map_update */
-	TransactionId latest_xmin;	/* latest snapshot xmin */
-	TimestampTz next_map_update;	/* latest snapshot valid up to */
-	slock_t		mutex_threshold;	/* protect threshold fields */
-	TimestampTz threshold_timestamp;	/* earlier snapshot is old */
-	TransactionId threshold_xid;	/* earlier xid may be gone */
-
-	/*
-	 * Keep one xid per minute for old snapshot error handling.
-	 *
-	 * Use a circular buffer with a head offset, a count of entries currently
-	 * used, and a timestamp corresponding to the xid at the head offset.  A
-	 * count_used value of zero means that there are no times stored; a
-	 * count_used value of OLD_SNAPSHOT_TIME_MAP_ENTRIES means that the buffer
-	 * is full and the head must be advanced to add new entries.  Use
-	 * timestamps aligned to minute boundaries, since that seems less
-	 * surprising than aligning based on the first usage timestamp.  The
-	 * latest bucket is effectively stored within latest_xmin.  The circular
-	 * buffer is updated when we get a new xmin value that doesn't fall into
-	 * the same interval.
-	 *
-	 * It is OK if the xid for a given time slot is from earlier than
-	 * calculated by adding the number of minutes corresponding to the
-	 * (possibly wrapped) distance from the head offset to the time of the
-	 * head entry, since that just results in the vacuuming of old tuples
-	 * being slightly less aggressive.  It would not be OK for it to be off in
-	 * the other direction, since it might result in vacuuming tuples that are
-	 * still expected to be there.
-	 *
-	 * Use of an SLRU was considered but not chosen because it is more
-	 * heavyweight than is needed for this, and would probably not be any less
-	 * code to implement.
-	 *
-	 * Persistence is not needed.
-	 */
-	int			head_offset;	/* subscript of oldest tracked time */
-	TimestampTz head_timestamp; /* time corresponding to head xid */
-	int			count_used;		/* how many slots are in use */
-	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
-} OldSnapshotControlData;
-
-static volatile OldSnapshotControlData *oldSnapshotControl;
+volatile OldSnapshotControlData *oldSnapshotControl;
 
 
 /*
diff --git a/src/include/utils/old_snapshot.h b/src/include/utils/old_snapshot.h
new file mode 100644
index 0000000000..284af7d508
--- /dev/null
+++ b/src/include/utils/old_snapshot.h
@@ -0,0 +1,75 @@
+/*-------------------------------------------------------------------------
+ *
+ * old_snapshot.h
+ *		Data structures for 'snapshot too old'
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/include/utils/old_snapshot.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef OLD_SNAPSHOT_H
+#define OLD_SNAPSHOT_H
+
+#include "datatype/timestamp.h"
+#include "storage/s_lock.h"
+
+/*
+ * Structure for dealing with old_snapshot_threshold implementation.
+ */
+typedef struct OldSnapshotControlData
+{
+	/*
+	 * Variables for old snapshot handling are shared among processes and are
+	 * only allowed to move forward.
+	 */
+	slock_t		mutex_current;	/* protect current_timestamp */
+	TimestampTz current_timestamp;	/* latest snapshot timestamp */
+	slock_t		mutex_latest_xmin;	/* protect latest_xmin and next_map_update */
+	TransactionId latest_xmin;	/* latest snapshot xmin */
+	TimestampTz next_map_update;	/* latest snapshot valid up to */
+	slock_t		mutex_threshold;	/* protect threshold fields */
+	TimestampTz threshold_timestamp;	/* earlier snapshot is old */
+	TransactionId threshold_xid;	/* earlier xid may be gone */
+
+	/*
+	 * Keep one xid per minute for old snapshot error handling.
+	 *
+	 * Use a circular buffer with a head offset, a count of entries currently
+	 * used, and a timestamp corresponding to the xid at the head offset.  A
+	 * count_used value of zero means that there are no times stored; a
+	 * count_used value of OLD_SNAPSHOT_TIME_MAP_ENTRIES means that the buffer
+	 * is full and the head must be advanced to add new entries.  Use
+	 * timestamps aligned to minute boundaries, since that seems less
+	 * surprising than aligning based on the first usage timestamp.  The
+	 * latest bucket is effectively stored within latest_xmin.  The circular
+	 * buffer is updated when we get a new xmin value that doesn't fall into
+	 * the same interval.
+	 *
+	 * It is OK if the xid for a given time slot is from earlier than
+	 * calculated by adding the number of minutes corresponding to the
+	 * (possibly wrapped) distance from the head offset to the time of the
+	 * head entry, since that just results in the vacuuming of old tuples
+	 * being slightly less aggressive.  It would not be OK for it to be off in
+	 * the other direction, since it might result in vacuuming tuples that are
+	 * still expected to be there.
+	 *
+	 * Use of an SLRU was considered but not chosen because it is more
+	 * heavyweight than is needed for this, and would probably not be any less
+	 * code to implement.
+	 *
+	 * Persistence is not needed.
+	 */
+	int			head_offset;	/* subscript of oldest tracked time */
+	TimestampTz head_timestamp; /* time corresponding to head xid */
+	int			count_used;		/* how many slots are in use */
+	TransactionId xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
+} OldSnapshotControlData;
+
+extern volatile OldSnapshotControlData *oldSnapshotControl;
+
+#endif
-- 
2.20.1

v1-0002-contrib-old_snapshot-time-xid-mapping.patchtext/x-patch; charset=US-ASCII; name=v1-0002-contrib-old_snapshot-time-xid-mapping.patchDownload

From e23d46f8482561fc2369525d5625058dad0ded5c Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 16 Apr 2020 12:14:32 -0400
Subject: [PATCH v1 2/5] contrib/old_snapshot: time->xid mapping.

---
 contrib/Makefile                           |   1 +
 contrib/old_snapshot/Makefile              |  24 ++++
 contrib/old_snapshot/old_snapshot--1.0.sql |  14 ++
 contrib/old_snapshot/old_snapshot.control  |   5 +
 contrib/old_snapshot/time_mapping.c        | 159 +++++++++++++++++++++
 5 files changed, 203 insertions(+)
 create mode 100644 contrib/old_snapshot/Makefile
 create mode 100644 contrib/old_snapshot/old_snapshot--1.0.sql
 create mode 100644 contrib/old_snapshot/old_snapshot.control
 create mode 100644 contrib/old_snapshot/time_mapping.c

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d415b6..452ade0782 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
 		lo		\
 		ltree		\
 		oid2name	\
+		old_snapshot	\
 		pageinspect	\
 		passwordcheck	\
 		pg_buffercache	\
diff --git a/contrib/old_snapshot/Makefile b/contrib/old_snapshot/Makefile
new file mode 100644
index 0000000000..091231f25f
--- /dev/null
+++ b/contrib/old_snapshot/Makefile
@@ -0,0 +1,24 @@
+# contrib/old_snapshot/Makefile
+
+MODULE_big = old_snapshot
+OBJS = \
+	$(WIN32RES) \
+	time_mapping.o
+PG_CPPFLAGS = -I$(libpq_srcdir)
+
+EXTENSION = old_snapshot
+DATA = old_snapshot--1.0.sql
+PGFILEDESC = "old_snapshot - utilities in support of old_snapshot_threshold"
+
+REGRESS = old_snapshot
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/old_snapshot
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/old_snapshot/old_snapshot--1.0.sql b/contrib/old_snapshot/old_snapshot--1.0.sql
new file mode 100644
index 0000000000..9ebb8829e3
--- /dev/null
+++ b/contrib/old_snapshot/old_snapshot--1.0.sql
@@ -0,0 +1,14 @@
+/* contrib/old_snapshot/old_snapshot--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION old_snapshot" to load this file. \quit
+
+-- Show visibility map and page-level visibility information for each block.
+CREATE FUNCTION pg_old_snapshot_time_mapping(array_offset OUT int4,
+											 end_timestamp OUT timestamptz,
+											 newest_xmin OUT xid)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'pg_old_snapshot_time_mapping'
+LANGUAGE C STRICT;
+
+-- XXX. Do we want REVOKE commands here?
diff --git a/contrib/old_snapshot/old_snapshot.control b/contrib/old_snapshot/old_snapshot.control
new file mode 100644
index 0000000000..491eec536c
--- /dev/null
+++ b/contrib/old_snapshot/old_snapshot.control
@@ -0,0 +1,5 @@
+# old_snapshot extension
+comment = 'utilities in support of old_snapshot_threshold'
+default_version = '1.0'
+module_pathname = '$libdir/old_snapshot'
+relocatable = true
diff --git a/contrib/old_snapshot/time_mapping.c b/contrib/old_snapshot/time_mapping.c
new file mode 100644
index 0000000000..37e0055a00
--- /dev/null
+++ b/contrib/old_snapshot/time_mapping.c
@@ -0,0 +1,159 @@
+/*-------------------------------------------------------------------------
+ *
+ * time_mapping.c
+ *	  time to XID mapping information
+ *
+ * Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ *	  contrib/old_snapshot/time_mapping.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "storage/lwlock.h"
+#include "utils/old_snapshot.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+
+/*
+ * Backend-private copy of the information from oldSnapshotControl which relates
+ * to the time to XID mapping, plus an index so that we can iterate.
+ *
+ * Note that the length of the xid_by_minute array is given by
+ * OLD_SNAPSHOT_TIME_MAP_ENTRIES (which is not a compile-time constant).
+ */
+typedef struct
+{
+	int				current_index;
+	int				head_offset;
+	TimestampTz		head_timestamp;
+	int				count_used;
+	TransactionId	xid_by_minute[FLEXIBLE_ARRAY_MEMBER];
+} OldSnapshotTimeMapping;
+
+#define NUM_TIME_MAPPING_COLUMNS 3
+
+PG_MODULE_MAGIC;
+PG_FUNCTION_INFO_V1(pg_old_snapshot_time_mapping);
+
+static OldSnapshotTimeMapping *GetOldSnapshotTimeMapping(void);
+static TupleDesc MakeOldSnapshotTimeMappingTupleDesc(void);
+static HeapTuple MakeOldSnapshotTimeMappingTuple(TupleDesc tupdesc,
+												 OldSnapshotTimeMapping *mapping);
+
+/*
+ * SQL-callable set-returning function.
+ */
+Datum
+pg_old_snapshot_time_mapping(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	OldSnapshotTimeMapping *mapping;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		MemoryContext	oldcontext;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		mapping = GetOldSnapshotTimeMapping();
+		funcctx->user_fctx = mapping;
+		funcctx->tuple_desc = MakeOldSnapshotTimeMappingTupleDesc();
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	mapping = (OldSnapshotTimeMapping *) funcctx->user_fctx;
+
+	while (mapping->current_index < mapping->count_used)
+	{
+		HeapTuple	tuple;
+
+		tuple = MakeOldSnapshotTimeMappingTuple(funcctx->tuple_desc, mapping);
+		++mapping->current_index;
+		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Get the old snapshot time mapping data from shared memory.
+ */
+static OldSnapshotTimeMapping *
+GetOldSnapshotTimeMapping(void)
+{
+	OldSnapshotTimeMapping *mapping;
+
+	mapping = palloc(offsetof(OldSnapshotTimeMapping, xid_by_minute)
+					 + sizeof(TransactionId) * OLD_SNAPSHOT_TIME_MAP_ENTRIES);
+	mapping->current_index = 0;
+
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);
+	mapping->head_offset = oldSnapshotControl->head_offset;
+	mapping->head_timestamp = oldSnapshotControl->head_timestamp;
+	mapping->count_used = oldSnapshotControl->count_used;
+	for (int i = 0; i < OLD_SNAPSHOT_TIME_MAP_ENTRIES; ++i)
+		mapping->xid_by_minute[i] = oldSnapshotControl->xid_by_minute[i];
+	LWLockRelease(OldSnapshotTimeMapLock);
+
+	return mapping;
+}
+
+/*
+ * Build a tuple descriptor for the pg_old_snapshot_time_mapping() SRF.
+ */
+static TupleDesc
+MakeOldSnapshotTimeMappingTupleDesc(void)
+{
+	TupleDesc	tupdesc;
+
+	tupdesc = CreateTemplateTupleDesc(NUM_TIME_MAPPING_COLUMNS);
+
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "array_offset",
+					   INT4OID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "end_timestamp",
+					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 3, "newest_xmin",
+					   XIDOID, -1, 0);
+
+	return BlessTupleDesc(tupdesc);
+}
+
+/*
+ * Convert one entry from the old snapshot time mapping to a HeapTuple.
+ */
+static HeapTuple
+MakeOldSnapshotTimeMappingTuple(TupleDesc tupdesc, OldSnapshotTimeMapping *mapping)
+{
+	Datum	values[NUM_TIME_MAPPING_COLUMNS];
+	bool	nulls[NUM_TIME_MAPPING_COLUMNS];
+	int		array_position;
+	TimestampTz	timestamp;
+
+	/*
+	 * Figure out the array position corresponding to the current index.
+	 *
+	 * Index 0 means the oldest entry in the mapping, which is stored at
+	 * mapping->head_offset. Index 1 means the next-oldest entry, which is a the
+	 * following index, and so on. We wrap around when we reach the end of the array.
+	 */
+	array_position = (mapping->head_offset + mapping->current_index)
+		% OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+
+	/*
+	 * No explicit timestamp is stored for any entry other than the oldest one,
+	 * but each entry corresponds to 1-minute period, so we can just add.
+	 */
+	timestamp = TimestampTzPlusMilliseconds(mapping->head_timestamp,
+											mapping->current_index * 60000);
+
+	/* Initialize nulls and values arrays. */
+	memset(nulls, 0, sizeof(nulls));
+	values[0] = Int32GetDatum(array_position);
+	values[1] = TimestampTzGetDatum(timestamp);
+	values[2] = TransactionIdGetDatum(mapping->xid_by_minute[array_position]);
+
+	return heap_form_tuple(tupdesc, values, nulls);
+}
-- 
2.20.1

v1-0003-Fix-bugs-in-MaintainOldSnapshotTimeMapping.patchtext/x-patch; charset=US-ASCII; name=v1-0003-Fix-bugs-in-MaintainOldSnapshotTimeMapping.patchDownload

From 76ed484a4c7975d3bc507fab4d387b54c3ac152d Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Thu, 16 Apr 2020 12:15:57 -0400
Subject: [PATCH v1 3/5] Fix bugs in MaintainOldSnapshotTimeMapping.

---
 src/backend/utils/time/snapmgr.c | 30 +++++++++++++++++++++++++++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index abaaea569a..72b2c61a07 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1926,10 +1926,32 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 	else
 	{
 		/* We need a new bucket, but it might not be the very next one. */
-		int			advance = ((ts - oldSnapshotControl->head_timestamp)
-							   / USECS_PER_MINUTE);
+		int			distance_to_new_tail;
+		int			distance_to_current_tail;
+		int			advance;
 
-		oldSnapshotControl->head_timestamp = ts;
+		/*
+		 * Our goal is for the new "tail" of the mapping, that is, the entry
+		 * which is newest and thus furthest from the "head" entry, to
+		 * correspond to "ts". Since there's one entry per minute, the
+		 * distance between the current head and the new tail is just the
+		 * number of minutes of difference between ts and the current
+		 * head_timestamp.
+		 *
+		 * The distance from the current head to the current tail is one
+		 * less than the number of entries in the mapping, because the
+		 * entry at the head_offset is for 0 minutes after head_timestamp.
+		 *
+		 * The difference between these two values is the number of minutes
+		 * by which we need to advance the mapping, either adding new entries
+		 * or rotating old ones out.
+		 */
+		distance_to_new_tail =
+			(ts - oldSnapshotControl->head_timestamp) / USECS_PER_MINUTE;
+		distance_to_current_tail =
+			oldSnapshotControl->count_used - 1;
+		advance = distance_to_new_tail - distance_to_current_tail;
+		Assert(advance > 0);
 
 		if (advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)
 		{
@@ -1937,6 +1959,7 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 			oldSnapshotControl->head_offset = 0;
 			oldSnapshotControl->count_used = 1;
 			oldSnapshotControl->xid_by_minute[0] = xmin;
+			oldSnapshotControl->head_timestamp = ts;
 		}
 		else
 		{
@@ -1955,6 +1978,7 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 					else
 						oldSnapshotControl->head_offset = old_head + 1;
 					oldSnapshotControl->xid_by_minute[old_head] = xmin;
+					oldSnapshotControl->head_timestamp += USECS_PER_MINUTE;
 				}
 				else
 				{
-- 
2.20.1

v1-0004-Add-pg_clobber_current_snapshot_timestamp.patchtext/x-patch; charset=US-ASCII; name=v1-0004-Add-pg_clobber_current_snapshot_timestamp.patchDownload

From ed965c21b4e5841a4afc249f6365b774440ac8cf Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 17 Apr 2020 14:10:35 +1200
Subject: [PATCH v1 4/5] Add pg_clobber_current_snapshot_timestamp().

---
 contrib/old_snapshot/old_snapshot--1.0.sql |  5 +++++
 contrib/old_snapshot/time_mapping.c        | 13 +++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/contrib/old_snapshot/old_snapshot--1.0.sql b/contrib/old_snapshot/old_snapshot--1.0.sql
index 9ebb8829e3..aacf1704b5 100644
--- a/contrib/old_snapshot/old_snapshot--1.0.sql
+++ b/contrib/old_snapshot/old_snapshot--1.0.sql
@@ -11,4 +11,9 @@ RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'pg_old_snapshot_time_mapping'
 LANGUAGE C STRICT;
 
+CREATE FUNCTION pg_clobber_current_snapshot_timestamp(now timestamptz)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'pg_clobber_current_snapshot_timestamp'
+LANGUAGE C STRICT;
+
 -- XXX. Do we want REVOKE commands here?
diff --git a/contrib/old_snapshot/time_mapping.c b/contrib/old_snapshot/time_mapping.c
index 37e0055a00..8728c4ddb5 100644
--- a/contrib/old_snapshot/time_mapping.c
+++ b/contrib/old_snapshot/time_mapping.c
@@ -36,6 +36,7 @@ typedef struct
 
 PG_MODULE_MAGIC;
 PG_FUNCTION_INFO_V1(pg_old_snapshot_time_mapping);
+PG_FUNCTION_INFO_V1(pg_clobber_current_snapshot_timestamp);
 
 static OldSnapshotTimeMapping *GetOldSnapshotTimeMapping(void);
 static TupleDesc MakeOldSnapshotTimeMappingTupleDesc(void);
@@ -157,3 +158,15 @@ MakeOldSnapshotTimeMappingTuple(TupleDesc tupdesc, OldSnapshotTimeMapping *mappi
 
 	return heap_form_tuple(tupdesc, values, nulls);
 }
+
+Datum
+pg_clobber_current_snapshot_timestamp(PG_FUNCTION_ARGS)
+{
+	TimestampTz new_current_timestamp = PG_GETARG_TIMESTAMPTZ(0);
+
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
+	oldSnapshotControl->current_timestamp = new_current_timestamp;
+	LWLockRelease(OldSnapshotTimeMapLock);
+
+	PG_RETURN_NULL();
+}
-- 
2.20.1

v1-0005-Truncate-old-snapshot-XIDs-before-truncating-CLOG.patchtext/x-patch; charset=US-ASCII; name=v1-0005-Truncate-old-snapshot-XIDs-before-truncating-CLOG.patchDownload

From 7babd9e2bf24063485334a59ab115f6bfdc33db2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 17 Apr 2020 15:18:49 +1200
Subject: [PATCH v1 5/5] Truncate old snapshot XIDs before truncating CLOG.

---
 contrib/old_snapshot/Makefile          |  2 +-
 contrib/old_snapshot/t/001_truncate.pl | 80 ++++++++++++++++++++++++++
 src/backend/commands/vacuum.c          |  3 +
 src/backend/utils/time/snapmgr.c       | 21 +++++++
 src/include/utils/snapmgr.h            |  1 +
 5 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 contrib/old_snapshot/t/001_truncate.pl

diff --git a/contrib/old_snapshot/Makefile b/contrib/old_snapshot/Makefile
index 091231f25f..c839d30346 100644
--- a/contrib/old_snapshot/Makefile
+++ b/contrib/old_snapshot/Makefile
@@ -10,7 +10,7 @@ EXTENSION = old_snapshot
 DATA = old_snapshot--1.0.sql
 PGFILEDESC = "old_snapshot - utilities in support of old_snapshot_threshold"
 
-REGRESS = old_snapshot
+TAP_TESTS = 1
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/old_snapshot/t/001_truncate.pl b/contrib/old_snapshot/t/001_truncate.pl
new file mode 100644
index 0000000000..d6c0def00f
--- /dev/null
+++ b/contrib/old_snapshot/t/001_truncate.pl
@@ -0,0 +1,80 @@
+# Test truncation of the old snapshot time mapping, to check
+# that we can't get into trouble when xids wrap around.
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 6;
+
+my $node = get_new_node('master');
+$node->init;
+$node->append_conf("postgresql.conf", "timezone = UTC");
+$node->append_conf("postgresql.conf", "old_snapshot_threshold=10");
+$node->append_conf("postgresql.conf", "max_prepared_transactions=10");
+$node->start;
+$node->psql('postgres', 'update pg_database set datallowconn = true');
+$node->psql('postgres', 'create extension old_snapshot');
+
+note "check time map is truncated when CLOG is";
+
+# build up a time map with 4 entries
+$node->psql('postgres', "select pg_clobber_current_snapshot_timestamp('3000-01-01 00:00:00Z')");
+$node->psql('postgres', "select pg_current_xact_id()");
+$node->psql('postgres', "select pg_clobber_current_snapshot_timestamp('3000-01-01 00:01:00Z')");
+$node->psql('postgres', "select pg_current_xact_id()");
+$node->psql('postgres', "select pg_clobber_current_snapshot_timestamp('3000-01-01 00:02:00Z')");
+$node->psql('postgres', "select pg_current_xact_id()");
+$node->psql('postgres', "select pg_clobber_current_snapshot_timestamp('3000-01-01 00:03:00Z')");
+$node->psql('postgres', "select pg_current_xact_id()");
+my $count;
+$node->psql('postgres', "select count(*) from pg_old_snapshot_time_mapping()", stdout => \$count);
+is($count, 4, "expected to have 4 entries in the old snapshot time map");
+
+# now cause frozen XID to advance
+$node->psql('postgres', 'vacuum freeze');
+$node->psql('template0', 'vacuum freeze');
+$node->psql('template1', 'vacuum freeze');
+
+# we expect all XIDs to have been truncated
+$node->psql('postgres', "select count(*) from pg_old_snapshot_time_mapping()", stdout => \$count);
+is($count, 0, "expected to have 0 entries in the old snapshot time map");
+
+# put two more in the map
+$node->psql('postgres', "select pg_clobber_current_snapshot_timestamp('3000-01-01 00:04:00Z')");
+$node->psql('postgres', "select pg_current_xact_id()");
+$node->psql('postgres', "select pg_clobber_current_snapshot_timestamp('3000-01-01 00:05:00Z')");
+$node->psql('postgres', "select pg_current_xact_id()");
+$node->psql('postgres', "select count(*) from pg_old_snapshot_time_mapping()", stdout => \$count);
+is($count, 2, "expected to have 2 entries in the old snapshot time map");
+
+# prepare a transaction, to stop xmin from getting further ahead
+$node->psql('postgres', "begin; select pg_current_xact_id(); prepare transaction 'tx1'");
+
+# add 16 more minutes (this tests wrapping around the mapping array, which is of size 10 + 10)...
+$node->psql('postgres', "select pg_clobber_current_snapshot_timestamp('3000-01-01 00:21:00Z')");
+$node->psql('postgres', "select pg_current_xact_id()");
+$node->psql('postgres', "select count(*) from pg_old_snapshot_time_mapping()", stdout => \$count);
+is($count, 18, "expected to have 18 entries in the old snapshot time map");
+
+# now cause frozen XID to advance
+$node->psql('postgres', 'vacuum freeze');
+$node->psql('template0', 'vacuum freeze');
+$node->psql('template1', 'vacuum freeze');
+
+# this should leave just 16
+$node->psql('postgres', "select count(*) from pg_old_snapshot_time_mapping()", stdout => \$count);
+is($count, 16, "expected to have 16 entries in the old snapshot time map");
+
+# commit tx1, and then freeze again to get rid of all of them
+$node->psql('postgres', "commit prepared 'tx1'");
+
+# now cause frozen XID to advance
+$node->psql('postgres', 'vacuum freeze');
+$node->psql('template0', 'vacuum freeze');
+$node->psql('template1', 'vacuum freeze');
+
+# we should now be back to empty
+$node->psql('postgres', "select count(*) from pg_old_snapshot_time_mapping()", stdout => \$count);
+is($count, 0, "expected to have 0 entries in the old snapshot time map");
+
+$node->stop;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 5a110edb07..37ead45fa5 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1627,6 +1627,9 @@ vac_truncate_clog(TransactionId frozenXID,
 	 */
 	AdvanceOldestCommitTsXid(frozenXID);
 
+	/* Make sure snapshot_too_old drops old XIDs. */
+	TruncateOldSnapshotTimeMapping(frozenXID);
+
 	/*
 	 * Truncate CLOG, multixact and CommitTs to the oldest computed value.
 	 */
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 72b2c61a07..d604e69270 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1998,6 +1998,27 @@ MaintainOldSnapshotTimeMapping(TimestampTz whenTaken, TransactionId xmin)
 }
 
 
+/*
+ * Remove old xids from the timing map, so the CLOG can be truncated.
+ */
+void
+TruncateOldSnapshotTimeMapping(TransactionId frozenXID)
+{
+	LWLockAcquire(OldSnapshotTimeMapLock, LW_EXCLUSIVE);
+	while (oldSnapshotControl->count_used > 0 &&
+		   TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[oldSnapshotControl->head_offset],
+								 frozenXID))
+	{
+		oldSnapshotControl->head_timestamp += USECS_PER_MINUTE;
+		oldSnapshotControl->head_offset =
+			(oldSnapshotControl->head_offset + 1) %
+			OLD_SNAPSHOT_TIME_MAP_ENTRIES;
+		oldSnapshotControl->count_used--;
+	}
+	LWLockRelease(OldSnapshotTimeMapLock);
+}
+
+
 /*
  * Setup a snapshot that replaces normal catalog snapshots that allows catalog
  * access to behave just like it did at a certain point in the past.
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index b28d13ce84..4f53aad956 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -135,6 +135,7 @@ extern TransactionId TransactionIdLimitedForOldSnapshots(TransactionId recentXmi
 														 Relation relation);
 extern void MaintainOldSnapshotTimeMapping(TimestampTz whenTaken,
 										   TransactionId xmin);
+extern void TruncateOldSnapshotTimeMapping(TransactionId frozenXID);
 
 extern char *ExportSnapshot(Snapshot snapshot);
 
-- 
2.20.1

#45

robertmhaas@gmail.com

over 5 years ago

In reply to: Thomas Munro (#43)

Re: snapshot too old issues, first around wraparound and then more.

On Thu, Apr 16, 2020 at 11:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Then of course frozenXID can be advanced with eg update pg_database
set datallowconn = 't' where datname = 'template0', then vacuumdb
--freeze --all, and checked before and after with Robert's
pg_old_snapshot_time_mapping() SRF to see that it's truncated. But
it's not really the level of stuff we'd ideally mess with in
pg_regress tests and I don't see any precent, so I guess maybe I'll
need to go and figure out how to write some perl.

The reason I put it in contrib is because I thought it would possibly
be useful to anyone who is actually using this feature to be able to
look at this information. It's unclear to me that there's any less
reason to provide introspection here than there is for, say, pg_locks.

It's sorta unclear to me why you continued the discussion of this on
this thread rather than the new one I started. Seems like doing it
over there might be clearer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#46

thomas.munro@gmail.com

over 5 years ago

In reply to: Robert Haas (#45)

Re: snapshot too old issues, first around wraparound and then more.

On Sat, Apr 18, 2020 at 12:19 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 16, 2020 at 11:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Then of course frozenXID can be advanced with eg update pg_database
set datallowconn = 't' where datname = 'template0', then vacuumdb
--freeze --all, and checked before and after with Robert's
pg_old_snapshot_time_mapping() SRF to see that it's truncated. But
it's not really the level of stuff we'd ideally mess with in
pg_regress tests and I don't see any precent, so I guess maybe I'll
need to go and figure out how to write some perl.

The reason I put it in contrib is because I thought it would possibly
be useful to anyone who is actually using this feature to be able to
look at this information. It's unclear to me that there's any less
reason to provide introspection here than there is for, say, pg_locks.

Makes sense. I was talking more about the
pg_clobber_snapshot_timestamp() function I showed, which is for use by
tests, not end users, since it does weird stuff to internal state.

It's sorta unclear to me why you continued the discussion of this on
this thread rather than the new one I started. Seems like doing it
over there might be clearer.

I understood that you'd forked a new thread to discuss one particular
problem among the many that Andres nailed to the door, namely the xid
map's failure to be monotonic, and here I was responding to other
things from his list, namely the lack of defences against wrap-around
and the lack of testing. Apparently I misunderstood. I will move to
the new thread for the next version I post, once I figure out if I can
use pg_clobber_snapshot_timestamp() in a TAP test to check early
vacuum/pruning behaviour.

#47

robertmhaas@gmail.com

over 5 years ago

In reply to: Thomas Munro (#46)

Re: snapshot too old issues, first around wraparound and then more.

On Fri, Apr 17, 2020 at 4:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I understood that you'd forked a new thread to discuss one particular
problem among the many that Andres nailed to the door, namely the xid
map's failure to be monotonic, and here I was responding to other
things from his list, namely the lack of defences against wrap-around
and the lack of testing. Apparently I misunderstood.

Oh, maybe I'm the one who misunderstood...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#48

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Robert Haas (#47)

Re: snapshot too old issues, first around wraparound and then more.

Robert Haas <robertmhaas@gmail.com> writes:

Oh, maybe I'm the one who misunderstood...

So, it's well over a year later, and so far as I can see exactly
nothing has been done about snapshot_too_old's problems.

I never liked that feature to begin with, and I would be very
glad to undertake the task of ripping it out. If someone thinks
this should not happen, please commit to fixing it ... and not
"eventually".

regards, tom lane

#49

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#48)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2021-06-15 12:51:28 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Oh, maybe I'm the one who misunderstood...

So, it's well over a year later, and so far as I can see exactly
nothing has been done about snapshot_too_old's problems.

I never liked that feature to begin with, and I would be very
glad to undertake the task of ripping it out. If someone thinks
this should not happen, please commit to fixing it ... and not
"eventually".

I still think that's the most reasonable course. I actually like the
feature, but I don't think a better implementation of it would share
much if any of the current infrastructure.

Greetings,

Andres Freund

#50

pg@bowt.ie

over 4 years ago

In reply to: Tom Lane (#48)

Re: snapshot too old issues, first around wraparound and then more.

On Tue, Jun 15, 2021 at 9:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So, it's well over a year later, and so far as I can see exactly
nothing has been done about snapshot_too_old's problems.

FWIW I think that the concept itself is basically reasonable. The
implementation is very flawed, though, so it hardly enters into it.

I never liked that feature to begin with, and I would be very
glad to undertake the task of ripping it out. If someone thinks
this should not happen, please commit to fixing it ... and not
"eventually".

ISTM that this is currently everybody's responsibility, and therefore
nobody's responsibility. That's probably why the problems haven't been
resolved yet.

I propose that the revert question be explicitly timeboxed. If the
issues haven't been fixed by some date, then "snapshot too old"
automatically gets reverted without further discussion. This gives
qualified hackers the opportunity to save the feature if they feel
strongly about it, and are actually willing to take responsibility for
its ongoing maintenance.

--
Peter Geoghegan

#51

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Peter Geoghegan (#50)

Re: snapshot too old issues, first around wraparound and then more.

Peter Geoghegan <pg@bowt.ie> writes:

On Tue, Jun 15, 2021 at 9:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So, it's well over a year later, and so far as I can see exactly
nothing has been done about snapshot_too_old's problems.

I propose that the revert question be explicitly timeboxed. If the
issues haven't been fixed by some date, then "snapshot too old"
automatically gets reverted without further discussion. This gives
qualified hackers the opportunity to save the feature if they feel
strongly about it, and are actually willing to take responsibility for
its ongoing maintenance.

The goal I have in mind is for snapshot_too_old to be fixed or gone
in v15. I don't feel a need to force the issue sooner than that, so
there's plenty of time for someone to step up, if anyone wishes to.

I imagine that we should just ignore the question of whether anything
can be done for it in the back branches. Given the problems
identified upthread, fixing it in a non-back-patchable way would be
challenging enough.

regards, tom lane

#52

pg@bowt.ie

over 4 years ago

In reply to: Tom Lane (#51)

Re: snapshot too old issues, first around wraparound and then more.

On Tue, Jun 15, 2021 at 11:01 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The goal I have in mind is for snapshot_too_old to be fixed or gone
in v15. I don't feel a need to force the issue sooner than that, so
there's plenty of time for someone to step up, if anyone wishes to.

Seems more than reasonable to me. A year ought to be plenty of time if
the feature truly is salvageable.

What do other people think? Ideally we could commit to that hard
deadline now. To me the important thing is to actually have a real
deadline that forces the issue one way or another. This situation must
not be allowed to drag on forever.

--
Peter Geoghegan

#53

robertmhaas@gmail.com

over 4 years ago

In reply to: Tom Lane (#48)

Re: snapshot too old issues, first around wraparound and then more.

On Tue, Jun 15, 2021 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

So, it's well over a year later, and so far as I can see exactly
nothing has been done about snapshot_too_old's problems.

Progress has been pretty limited, but not altogether nonexistent.
55b7e2f4d78d8aa7b4a5eae9a0a810601d03c563 fixed, or at least seemed to
fix, the time->XID mapping, which is one of the main things that
Andres said was broken originally. Also, there are patches on this
thread from Thomas Munro to add some test coverage for that case,
another problem Andres noted in his original email. I guess it
wouldn't be too hard to get something committed there, and I'm willing
to do it if Thomas doesn't want to and if there's any prospect of
salvaging the feature.

But that's not clear to me. I'm not clear how exactly how many
problems we know about and need to fix in order to keep the feature,
and I'm also not clear how deep the hole goes. Like, if we need to get
a certain number of specific bugs fixed, I might be willing to do
that. If we need to commit to a major rewrite of the current
implementation, that's more than I can do. But I guess I don't
understand exactly how bad the current problems are. Reviewing
complaints from Andres from this thread:

Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
update_ts threshold actually needs to be ts >= update_ts, right now we
don't handle being newer than the newest bin correctly afaict (mitigated
by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
to say, because there's no comments.

This seems specific enough to be analyzed and anything that is broken
can be fixed.

The whole lock nesting is very hazardous. Most (all?)
TestForOldSnapshot() calls happen with locks on on buffers held, and can
acquire lwlocks itself. In some older branches we do entire *catalog
searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

I think it's unclear whether there are live problems in master in this area.

GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
basis to detect conflicts seems dangerous to me. Isn't that ignoring
inserts that are already in progress?

Discussion on this point trailed off. Upon rereading, I think Andres
is correct that there's an issue; the snapshot's LSN needs to be set
to a value not older than the last xlog insertion that has been
completed rather than, as now, the last one that is started. I guess
to get that value we would need to do something like
WaitXLogInsertionsToFinish(), or some approximation of it e.g.
GetXLogWriteRecPtr() at the risk of unnecessary snapshot-too-old
errors.

* In read-mostly workloads it can trigger errors in sessions that are
much younger than old_snapshot_threshold, if the transactionid is not
advancing.

I've not tried to reproduce, but I suspect this can also cause wrong
query results. Because a later snapshot can have the same xmin as
older transactions, it sure looks like we can end up with data from an
older xmin getting removed, but the newer snapshot's whenTaken will
prevent TestForOldSnapshot_impl() from raising an error.

I haven't really wrapped my head around this one, but it seems
amenable to a localized fix. It basically amounts to a complaint that
GetOldSnapshotThresholdTimestamp() is returning a newer value than it
should. I don't know exactly what's required to make it not do that,
but it doesn't seem intractable.

* I am fairly sure that it can cause crashes (or even data corruption),
because it assumes that DML never needs to check for old snapshots
(with no meaningful justificiation). Leading to heap_update/delete to
assume the page header is a tuple.

I don't understand the issue here, really. I assume there might be a
wrong word here, because assuming that the page header is a tuple
doesn't sound like a thing that would actually happen. I think one of
the key problems for this feature is figuring out whether you've got
snapshot-too-old checks in all the right places. I think what is being
alleged here is that heap_update() and heap_delete() need them, and
that it's not good enough to rely on the scan that found the tuple to
be updated or deleted having already performed those checks. It is not
clear to me whether that is true, or how it could cause crashes.
Andres may have explained this to me at some point, but if he did I
have unfortunately forgotten.

My general point here is that I would like to know whether we have a
finite number of reasonably localized bugs or a three-ring disaster
that is unrecoverable no matter what we do. Andres seems to think it
is the latter, and I *think* Peter Geoghegan agrees, but I think that
the point might be worth a little more discussion. I'm unclear whether
Tom's dislike for the feature represents hostility to the concept -
with which I would have to disagree - or a judgement on the quality of
the implementation - which might be justified. For the record, and to
Peter's point, I think it's reasonable to set v15 feature freeze as a
drop-dead date for getting this feature into acceptable shape, but I
would like to try to nail down what we think "acceptable" means in
this context.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

#54

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Robert Haas (#53)

Re: snapshot too old issues, first around wraparound and then more.

Robert Haas <robertmhaas@gmail.com> writes:

My general point here is that I would like to know whether we have a
finite number of reasonably localized bugs or a three-ring disaster
that is unrecoverable no matter what we do. Andres seems to think it
is the latter, and I *think* Peter Geoghegan agrees, but I think that
the point might be worth a little more discussion.

TBH, I am not clear on that either.

I'm unclear whether
Tom's dislike for the feature represents hostility to the concept -
with which I would have to disagree - or a judgement on the quality of
the implementation - which might be justified.

I think it's a klugy, unprincipled solution to a valid real-world
problem. I suspect the implementation issues are not unrelated to
the kluginess of the concept. Thus, I would really like to see us
throw this away and find something better. I admit I have nothing
to offer about what a better solution to the problem would look like.
But I would really like it to not involve random-seeming query failures.

regards, tom lane

#55

pg@bowt.ie

over 4 years ago

In reply to: Tom Lane (#54)

Re: snapshot too old issues, first around wraparound and then more.

On Tue, Jun 15, 2021 at 12:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

My general point here is that I would like to know whether we have a
finite number of reasonably localized bugs or a three-ring disaster
that is unrecoverable no matter what we do. Andres seems to think it
is the latter, and I *think* Peter Geoghegan agrees, but I think that
the point might be worth a little more discussion.

TBH, I am not clear on that either.

I don't know for sure which it is, but that in itself isn't actually
what matters to me. The most concerning thing is that I don't really
know how to *assess* the design now. The clear presence of at least
several very severe bugs doesn't necessarily prove anything (it just
*hints* at major design problems).

If I could make a very clear definitive statement on this then I'd
probably have to do ~1/3 of the total required work -- that'd be my
guess. If it was easy to be quite sure here then we wouldn't still be
here 12 months later. In any case I don't think that the feature
deserves to be treated all that differently to something that was
committed much more recently, given what we know. Frankly it took me
about 5 minutes to find a very serious bug in the feature, pretty much
without giving it any thought. That is not a good sign.

I think it's a klugy, unprincipled solution to a valid real-world
problem. I suspect the implementation issues are not unrelated to
the kluginess of the concept. Thus, I would really like to see us
throw this away and find something better. I admit I have nothing
to offer about what a better solution to the problem would look like.
But I would really like it to not involve random-seeming query failures.

I would be very happy to see somebody take this up, because it is
important. The reality is that anybody that undertakes this task
should start with the assumption that they're starting from scratch,
at least until they learn otherwise. So ISTM that it might as well be
true that it needs a total rewrite, even if it turns out to not be
strictly true.

--
Peter Geoghegan

#56

pg@bowt.ie

over 4 years ago

In reply to: Robert Haas (#53)

Re: snapshot too old issues, first around wraparound and then more.

On Tue, Jun 15, 2021 at 12:17 PM Robert Haas <robertmhaas@gmail.com> wrote:

My general point here is that I would like to know whether we have a
finite number of reasonably localized bugs or a three-ring disaster
that is unrecoverable no matter what we do. Andres seems to think it
is the latter, and I *think* Peter Geoghegan agrees, but I think that
the point might be worth a little more discussion. I'm unclear whether
Tom's dislike for the feature represents hostility to the concept -
with which I would have to disagree - or a judgement on the quality of
the implementation - which might be justified. For the record, and to
Peter's point, I think it's reasonable to set v15 feature freeze as a
drop-dead date for getting this feature into acceptable shape, but I
would like to try to nail down what we think "acceptable" means in
this context.

What I had in mind was this: a committer adopting the feature
themselves. The committer would be morally obligated to maintain the
feature on an ongoing basis, just as if they were the original
committer. This seems like the only sensible way of resolving this
issue once and for all.

If it really is incredibly important that we keep this feature, or one
like it, then I have to imagine that somebody will step forward --
there is still ample opportunity. But if nobody steps forward, I'll be
forced to conclude that perhaps it wasn't quite as important as I
first thought. Anybody can agree that it's important in an abstract
sense -- that's easy. What we need is a committer willing to sign on
the dotted line, which we're no closer to today than we were a year
ago. Actions speak louder than words.

--
Peter Geoghegan

#57

pg@bowt.ie

over 4 years ago

In reply to: Andres Freund (#23)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Apr 1, 2020 at 4:59 PM Andres Freund <andres@anarazel.de> wrote:

The primary issue here is that there is no TestForOldSnapshot() in
heap_hot_search_buffer(). Therefore index fetches will never even try to
detect that tuples it needs actually have already been pruned away.

This is still true, right? Nobody fixed this bug after 14 months? Even
though we're talking about returning rows that are not visible to the
xact's snapshot?

--
Peter Geoghegan

#58

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Peter Geoghegan (#56)

Re: snapshot too old issues, first around wraparound and then more.

Peter Geoghegan <pg@bowt.ie> writes:

What I had in mind was this: a committer adopting the feature
themselves. The committer would be morally obligated to maintain the
feature on an ongoing basis, just as if they were the original
committer. This seems like the only sensible way of resolving this
issue once and for all.

Yeah, it seems clear that we need somebody to do that, given that
Kevin Grittner has been inactive for awhile. Even if the known
problems can be resolved by drive-by patches, I think this area
needs an ongoing commitment from someone.

regards, tom lane

#59

andres@anarazel.de

over 4 years ago

In reply to: Robert Haas (#53)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2021-06-15 15:17:05 -0400, Robert Haas wrote:

But that's not clear to me. I'm not clear how exactly how many
problems we know about and need to fix in order to keep the feature,
and I'm also not clear how deep the hole goes. Like, if we need to get
a certain number of specific bugs fixed, I might be willing to do
that. If we need to commit to a major rewrite of the current
implementation, that's more than I can do. But I guess I don't
understand exactly how bad the current problems are. Reviewing
complaints from Andres from this thread:

One important complaints I think your (useful!) list missed is that there's
missing *read side* checks that demonstrably lead to wrong query results:
/messages/by-id/CAH2-Wz=FQ9rbBKkt1nXvz27kmd4A8i1+7dcLTNqpCYibxX83VQ@mail.gmail.com
and that it's currently very hard to figure out where they need to be, because
there's no real explained model of what needs to be checked and what not.

* I am fairly sure that it can cause crashes (or even data corruption),
because it assumes that DML never needs to check for old snapshots
(with no meaningful justificiation). Leading to heap_update/delete to
assume the page header is a tuple.

I don't understand the issue here, really. I assume there might be a
wrong word here, because assuming that the page header is a tuple
doesn't sound like a thing that would actually happen.

I suspect what I was thinking of is that a tuple could get pruned away due to
s_t_o, which would leave a LP_DEAD item around. As heap_update/delete neither
checks s_t_o, nor balks at targetting LP_DEAD items, we'd use the offset from
a the LP_DEAD item. ItemIdSetDead() sets lp_off to 0 - which would mean that
the page header is interpreted as a tuple. Right?

I think one of the key problems for this feature is figuring out
whether you've got snapshot-too-old checks in all the right places. I
think what is being alleged here is that heap_update() and
heap_delete() need them, and that it's not good enough to rely on the
scan that found the tuple to be updated or deleted having already
performed those checks. It is not clear to me whether that is true, or
how it could cause crashes. Andres may have explained this to me at
some point, but if he did I have unfortunately forgotten.

I don't think it is sufficient to rely on the scan. That works only as long as
the page with the to-be-modified tuple is pinned (since that'd prevent pruning
/ vacuuming from working on the page), but I am fairly sure that there are
plans where the target tuple is not pinned from the point it was scanned until
it is modified. In which case it is entirely possible that the u/d target can
be pruned away due to s_t_o between the scan checking s_t_o and the u/d
executing.

My general point here is that I would like to know whether we have a
finite number of reasonably localized bugs or a three-ring disaster
that is unrecoverable no matter what we do. Andres seems to think it
is the latter

Correct. I think there's numerous architectural issues the way the feature is
implemented right now, and that it'd be a substantial project to address them.

For the record, and to Peter's point, I think it's reasonable to set
v15 feature freeze as a drop-dead date for getting this feature into
acceptable shape, but I would like to try to nail down what we think
"acceptable" means in this context.

I think the absolute minimum would be to have
- actually working tests
- a halfway thorough code review of the feature
- added documentation explaining where exactly s_t_o tests need to be
- bugfixes obviously

If I were to work on the feature, I cannot imagine being sufficient confident
the feature works as long as the xid->time mapping granularity is a
minute. It's just not possible to write reasonable tests with the granularity
being that high. Or even to do manual tests of it - I'm not that patient. But
I "can accept" if somebody actually doing the work differs on this.

Greetings,

Andres Freund

#60

[1]: /messages/by-id/CA+hUKGJyw=uJ4eL1x=+vKm16fLaxNPvKUYtnChnRkSKi024u_A@mail.gmail.com

thomas.munro@gmail.com

over 4 years ago

In reply to: Robert Haas (#53)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Jun 16, 2021 at 7:17 AM Robert Haas <robertmhaas@gmail.com> wrote:

Progress has been pretty limited, but not altogether nonexistent.
55b7e2f4d78d8aa7b4a5eae9a0a810601d03c563 fixed, or at least seemed to
fix, the time->XID mapping, which is one of the main things that
Andres said was broken originally. Also, there are patches on this
thread from Thomas Munro to add some test coverage for that case,
another problem Andres noted in his original email. I guess it
wouldn't be too hard to get something committed there, and I'm willing
to do it if Thomas doesn't want to and if there's any prospect of
salvaging the feature.

FTR the latest patches are on a different thread[1]/messages/by-id/CA+hUKGJyw=uJ4eL1x=+vKm16fLaxNPvKUYtnChnRkSKi024u_A@mail.gmail.com. I lost steam on
that stuff because I couldn't find a systematic way to deal with the
lack of checks all over the place, or really understand how the whole
system fits together with confidence. Those patches to fix an xid
wraparound bug and make the testing work better may be useful and I'll
be happy to rebase them, depending on how this discussion goes, but it
seems a bit like the proverbial deckchairs on the Titanic from what
I'm reading... I think the technique I showed for exercising some
basic STO mechanisms and scenarios is probably useful, but I currently
have no idea how to prove much of anything about the whole system and
am not personally in a position to dive into that rabbit hole in a
PG15 time scale.

#61

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Peter Geoghegan (#56)

Re: snapshot too old issues, first around wraparound and then more.

On Tue, Jun 15, 2021 at 02:32:11PM -0700, Peter Geoghegan wrote:

What I had in mind was this: a committer adopting the feature
themselves. The committer would be morally obligated to maintain the
feature on an ongoing basis, just as if they were the original
committer. This seems like the only sensible way of resolving this
issue once and for all.

If it really is incredibly important that we keep this feature, or one
like it, then I have to imagine that somebody will step forward --
there is still ample opportunity. But if nobody steps forward, I'll be
forced to conclude that perhaps it wasn't quite as important as I
first thought.

Hackers are rather wise, but the variety of PostgreSQL use is enormous. We
see that, among other ways, when regression fixes spike in each vN.1. The
$SUBJECT feature was born in response to a user experience; a lack of hacker
interest doesn't invalidate that user experience. We face these competing
interests, at least:

1) Some users want the feature kept so their application can use a certain
pattern of long-running, snapshot-bearing transactions.

2) (a) Some hackers want the feature gone so they can implement changes
without making those changes cooperate with this feature. (b) Bugs in this
feature make such cooperation materially harder.

3) Some users want the feature gone because (2) is slowing the progress of
features they do want.

4) Some users want the feature kept because they don't use it but will worry
what else is vulnerable to removal. PostgreSQL has infrequent history of
removing released features. Normally, PostgreSQL lets some bugs languish
indefinitely, e.g. in
https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#Live_issues

5) Some users want the feature gone because they try it, find a bug, and
regret trying it or fear trying other features.

A hacker adopting the feature would be aiming to reduce (2)(b) to zero,
essentially. What other interests are relevant?

#62

pg@bowt.ie

over 4 years ago

In reply to: Noah Misch (#61)

Re: snapshot too old issues, first around wraparound and then more.

On Tue, Jun 15, 2021 at 9:59 PM Noah Misch <noah@leadboat.com> wrote:

Hackers are rather wise, but the variety of PostgreSQL use is enormous. We
see that, among other ways, when regression fixes spike in each vN.1. The
$SUBJECT feature was born in response to a user experience; a lack of hacker
interest doesn't invalidate that user experience.

I agree that it would be good to hear from some users about this. If a
less painful workaround is possible at all, then users may be able to
help -- maybe it'll be possible to cut scope.

We face these competing
interests, at least:

1) Some users want the feature kept so their application can use a certain
pattern of long-running, snapshot-bearing transactions.

Undoubtedly true.

2) (a) Some hackers want the feature gone so they can implement changes
without making those changes cooperate with this feature. (b) Bugs in this
feature make such cooperation materially harder.

Is that really true? Though it was probably true back when this thread
was started last year, things have changed. Andres found a way to work
around the problems he had with snapshot too old, AFAIK.

A hacker adopting the feature would be aiming to reduce (2)(b) to zero,
essentially. What other interests are relevant?

The code simply isn't up to snuff. If the code was in a niche contrib
module then maybe it would be okay to let this slide. But the fact is
that it touches critical parts of the system. This cannot be allowed
to drag on forever. It's as simple as that.

I admit that I think that the most likely outcome is that it gets
reverted. I don't feel great about that. What else can be done about
it that will really help the situation? No qualified person is likely
to have the time to commit to fixing snapshot too old. Isn't that the
real problem here?

--
Peter Geoghegan

#63

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Peter Geoghegan (#62)

Re: snapshot too old issues, first around wraparound and then more.

On Tue, Jun 15, 2021 at 10:47:45PM -0700, Peter Geoghegan wrote:

On Tue, Jun 15, 2021 at 9:59 PM Noah Misch <noah@leadboat.com> wrote:

Hackers are rather wise, but the variety of PostgreSQL use is enormous. We
see that, among other ways, when regression fixes spike in each vN.1. The
$SUBJECT feature was born in response to a user experience; a lack of hacker
interest doesn't invalidate that user experience.

I agree that it would be good to hear from some users about this. If a
less painful workaround is possible at all, then users may be able to
help -- maybe it'll be possible to cut scope.

It would be good. But if we don't hear from users in 2021 or 2022, that
doesn't invalidate what users already said in 2016.

We face these competing
interests, at least:

1) Some users want the feature kept so their application can use a certain
pattern of long-running, snapshot-bearing transactions.

Undoubtedly true.

2) (a) Some hackers want the feature gone so they can implement changes
without making those changes cooperate with this feature. (b) Bugs in this
feature make such cooperation materially harder.

Is that really true? Though it was probably true back when this thread
was started last year, things have changed. Andres found a way to work
around the problems he had with snapshot too old, AFAIK.

When I say "some hackers", I don't mean that specific people think such
thoughts right now. I'm saying that the expected cost of future cooperation
with the feature is nonzero, and bugs in the feature raise that cost. Perhaps
(5) has more weight than (2). (If (2), (3) and (5) all have little weight,
then PostgreSQL should just keep the feature with its bugs.)

A hacker adopting the feature would be aiming to reduce (2)(b) to zero,
essentially. What other interests are relevant?

The code simply isn't up to snuff. If the code was in a niche contrib
module then maybe it would be okay to let this slide. But the fact is
that it touches critical parts of the system. This cannot be allowed
to drag on forever. It's as simple as that.

Even if we were to stipulate that this feature "isn't up to snuff", purging
PostgreSQL of substandard features may or may not add sufficient value to
compensate for (1) and (4).

#64

pg@bowt.ie

over 4 years ago

In reply to: Noah Misch (#63)

Re: snapshot too old issues, first around wraparound and then more.

On Tue, Jun 15, 2021 at 11:24 PM Noah Misch <noah@leadboat.com> wrote:

When I say "some hackers", I don't mean that specific people think such
thoughts right now. I'm saying that the expected cost of future cooperation
with the feature is nonzero, and bugs in the feature raise that cost.

I see.

A hacker adopting the feature would be aiming to reduce (2)(b) to zero,
essentially. What other interests are relevant?

The code simply isn't up to snuff. If the code was in a niche contrib
module then maybe it would be okay to let this slide. But the fact is
that it touches critical parts of the system. This cannot be allowed
to drag on forever. It's as simple as that.

Even if we were to stipulate that this feature "isn't up to snuff", purging
PostgreSQL of substandard features may or may not add sufficient value to
compensate for (1) and (4).

I'm more concerned about 1 (compatibility) than about 4 (perception
that we deprecate things when we shouldn't), FWIW.

It's not that this is a substandard feature in the same way that (say)
contrib/ISN is a substandard feature -- it's not about the quality
level per se. Nor is it the absolute number of bugs. The real issue is
that this is a substandard feature that affects crucial areas of the
system. Strategically important things that we really cannot afford to
break.

--
Peter Geoghegan

#65

Greg Stark

stark@mit.edu

over 4 years ago

In reply to: Peter Geoghegan (#64)

Re: snapshot too old issues, first around wraparound and then more.

I think Andres's point earlier is the one that stands out the most for me:

I still think that's the most reasonable course. I actually like the
feature, but I don't think a better implementation of it would share
much if any of the current infrastructure.

That makes me wonder whether ripping the code out early in the v15
cycle wouldn't be a better choice. It would make it easier for someone
to start work on a new implementation.

There is the risk that the code would still be out and no new
implementation would have appeared by the release of v15 but it sounds
like that's people are leaning towards ripping it out at that point
anyways.

Fwiw I too think the basic idea of the feature is actually awesome.
There are tons of use cases where you might have one long-lived
transaction working on a dedicated table (or even a schema) that will
never look at the rapidly mutating tables in another schema and never
trigger the error even though those tables have been vacuumed many
times over during its run-time.

#66

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Greg Stark (#65)

Re: snapshot too old issues, first around wraparound and then more.

Greg Stark <stark@mit.edu> writes:

Fwiw I too think the basic idea of the feature is actually awesome.
There are tons of use cases where you might have one long-lived
transaction working on a dedicated table (or even a schema) that will
never look at the rapidly mutating tables in another schema and never
trigger the error even though those tables have been vacuumed many
times over during its run-time.

I agree that's a great use-case. I don't like this implementation though.
I think if you want to set things up like that, you should draw a line
between the tables it's okay for the long transaction to touch and those
it isn't, and then any access to the latter should predictably draw an
error. I really do not like the idea that it might work anyway, because
then if you accidentally break the rule, you have an application that just
fails randomly ... probably only on the days when the boss wants that
report *now* not later.

regards, tom lane

#67

Stephen Frost

sfrost@snowman.net

over 4 years ago

In reply to: Greg Stark (#65)

Re: snapshot too old issues, first around wraparound and then more.

Greetings,

* Greg Stark (stark@mit.edu) wrote:

I think Andres's point earlier is the one that stands out the most for me:

I still think that's the most reasonable course. I actually like the
feature, but I don't think a better implementation of it would share
much if any of the current infrastructure.

That makes me wonder whether ripping the code out early in the v15
cycle wouldn't be a better choice. It would make it easier for someone
to start work on a new implementation.

There is the risk that the code would still be out and no new
implementation would have appeared by the release of v15 but it sounds
like that's people are leaning towards ripping it out at that point
anyways.

Fwiw I too think the basic idea of the feature is actually awesome.
There are tons of use cases where you might have one long-lived
transaction working on a dedicated table (or even a schema) that will
never look at the rapidly mutating tables in another schema and never
trigger the error even though those tables have been vacuumed many
times over during its run-time.

I've long felt that the appropriate approach to addressing that is to
improve on VACUUM and find a way to do better than just having the
conditional of 'xmax < global min' drive if we can mark a given tuple as
no longer visible to anyone.

Not sure that all of the snapshot-too-old use cases could be solved that
way, nor am I even sure it's actually possible to make VACUUM smarter in
that way without introducing other problems or having to track much more
information than we do today, but it'd sure be nice if we could address
the use-case you outline above while also not introducing query
failures if that transaction does happen to decide to go look at some
other table (naturally, the tuples which are in that rapidly mutating
table that *would* be visible to the long-running transaction would have
to be kept around to make things work, but if it's rapidly mutating then
there's very likely lots of tuples that the long-running transaction
can't see in it, and which nothing else can either, and therefore could
be vacuumed).

Thanks,

Stephen

#68

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Stephen Frost (#67)

Re: snapshot too old issues, first around wraparound and then more.

Stephen Frost <sfrost@snowman.net> writes:

I've long felt that the appropriate approach to addressing that is to
improve on VACUUM and find a way to do better than just having the
conditional of 'xmax < global min' drive if we can mark a given tuple as
no longer visible to anyone.

Yeah, I think this scenario of a few transactions with old snapshots
and the rest with very new ones could be improved greatly if we exposed
more info about backends' snapshot state than just "oldest xmin". But
that might be expensive to do.

I remember that Heikki was fooling with a patch that reduced snapshots
to LSNs. If we got that done, it'd be practical to expose complete
info about backends' snapshot state in a lot of cases (i.e., anytime
you had less than N live snapshots).

Of course, there's still the question of how VACUUM could cheaply
apply such info to decide what could be purged.

regards, tom lane

#69

pg@bowt.ie

over 4 years ago

In reply to: Tom Lane (#68)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Jun 16, 2021 at 10:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I remember that Heikki was fooling with a patch that reduced snapshots
to LSNs. If we got that done, it'd be practical to expose complete
info about backends' snapshot state in a lot of cases (i.e., anytime
you had less than N live snapshots).

Of course, there's still the question of how VACUUM could cheaply
apply such info to decide what could be purged.

I would think that it wouldn't really matter inside VACUUM -- it would
only really need to be either an opportunistic pruning or an
opportunistic index deletion thing -- probably both. Most of the time
VACUUM doesn't seem to end up doing most of the work of removing
garbage versions. It's mostly useful for "floating garbage", to use
the proper GC memory management term.

It's not just because opportunistic techniques are where the real work
of removing garbage is usually done these days. It's also because
opportunistic techniques are triggered in response to an immediate
problem, like an overfull heap page or an imminent page split that
we'd like to avoid -- they can actually see what's going on at the
local level in a way that doesn't really work inside VACUUM.

This also means that they can cycle through strategies for a page,
starting with the cheapest and coarsest grained cleanup, progressing
to finer grained cleanup. You only really need the finer grained
cleanup when the coarse grained cleanup (simple OldestXmin style
cutoff) fails. And even then you only need to use the slowpath when
you have a pretty good idea that it'll actually be useful -- you at
least know up front that there are a bunch of RECENTLY_DEAD tuples
that very well might be freeable once you use the slow path.

We can leave the floating garbage inside heap pages that hardly ever
get opportunistic pruning behind for VACUUM. We might even find that
an advanced strategy that does clever things in order to cleanup
intermediate versions isn't actually needed all that often (it's
executed perhaps orders of magnitude less frequently than simple
opportunistic pruning is executed) -- even when the clever technique
really helps the workload. Technically opportunistic pruning might be
99%+ effective, even when it doesn't look like it's effective to
users. The costs in this area are often very nonlinear. It can be very
counterintuitive.

--
Peter Geoghegan

#70

andres@anarazel.de

over 4 years ago

In reply to: Noah Misch (#61)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2021-06-15 21:59:45 -0700, Noah Misch wrote:

Hackers are rather wise, but the variety of PostgreSQL use is enormous. We
see that, among other ways, when regression fixes spike in each vN.1. The
$SUBJECT feature was born in response to a user experience; a lack of hacker
interest doesn't invalidate that user experience. We face these competing
interests, at least:

1) Some users want the feature kept so their application can use a certain
pattern of long-running, snapshot-bearing transactions.

This is obviously true. However, given that the feature practically did
not work at all before 55b7e2f4d78d8aa7b4a5eae9a0a810601d03c563 and
still cannot really be described to work (e..g index scans returning
wrong query results), and there have been two complaints about it as far
as I know leads me to believe that it does not have a great many
features.

2) (a) Some hackers want the feature gone so they can implement changes
without making those changes cooperate with this feature. (b) Bugs in this
feature make such cooperation materially harder.

I think the a) part is a large problem. Primarily because it's so
unclear what one exactly has to do where (no docs/comments explaining
that) and because there's no usable test framework.

Greetings,

Andres Freund

#71

andres@anarazel.de

over 4 years ago

In reply to: Tom Lane (#68)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2021-06-16 13:04:07 -0400, Tom Lane wrote:

Yeah, I think this scenario of a few transactions with old snapshots
and the rest with very new ones could be improved greatly if we exposed
more info about backends' snapshot state than just "oldest xmin". But
that might be expensive to do.

I think it'd be pretty doable now. The snapshot scalability changes
separated out information needed to do vacuuming / pruning (i.e. xmin)
from the information needed to build a snapshot (xid, flags, subxids
etc). Because xmin is not frequently accessed from other backends
anymore, it is not important anymore to touch it as rarely as
possible. From the cross-backend POV I think it'd be practically free to
track a backend's xmax now.

It's not quite as obvious that it'd essentially free to track a
backend's xmax across all the snapshots it uses. I think we'd basically
need a second pairingheap in snapmgr.c to track the "most advanced"
xmax? That's *probably* fine, but I'm not 100% - Heikki wrote a faster
heap implementation for snapmgr.c for a reason I assume.

I think the hard part of this would be much more on the pruning / vacuum
side of things. There's two difficulties:

1) Keeping it cheap to determine whether a tuple can be vacuumed,
particularly while doing on-access pruning. This likely means that
we'd only assemble the information to do visibility determination for
rows above the "dead for everybody" horizon when encountering a
sufficiently old tuple. And then we need a decent datastructure for
checking whether an xid is in one of the "not needed" xid ranges.

This seems solvable.

2) Modeling when it is safe to remove row versions. It is easy to remove
a tuple that was inserted and deleted within one "not needed" xid
range, but it's far less obvious when it is safe to remove row
versions where prior/later row versions are outside of such a gap.

Consider e.g. an update chain where the oldest snapshot can see one
row version, then there is a chain of rows that could be vacuumed
except for the old snapshot, and then there's a live version. If the
old session updates the row version that is visible to it, it needs
to be able to follow the xid chain.

This seems hard to solve in general.

It perhaps is sufficiently effective to remove row version chains
entirely within one removable xid range. And it'd probably doable to
also address the case where a chain is larger than one range, as long
as all the relevant row versions are within one page: We can fix up
the ctids of older still visible row versions to point to the
successor of pruned row versions.

But I have a hard time seeing a realistic approach to removing chains
that span xid ranges and multiple pages. The locking and efficiency
issues seem substantial.

Greetings,

Andres

#72

pg@bowt.ie

over 4 years ago

In reply to: Andres Freund (#70)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Jun 16, 2021 at 11:06 AM Andres Freund <andres@anarazel.de> wrote:

2) (a) Some hackers want the feature gone so they can implement changes
without making those changes cooperate with this feature. (b) Bugs in this
feature make such cooperation materially harder.

I think the a) part is a large problem. Primarily because it's so
unclear what one exactly has to do where (no docs/comments explaining
that) and because there's no usable test framework.

Right. This is what I meant yesterday, when talking about design
issues. It's really not about the bugs so much. We probably could go
through them one by one until things stopped being visibly broken,
without going to a huge amount of trouble -- it's not that hard to
paper over these things without anybody noticing. This is clear just
when you look at how long it took anybody to notice the problems we do
have. Whether or not that amounts to "just fixing the bugs" is perhaps
open to interpretation. Either way I would not be comfortable with
even claiming that "fixing the bugs" in this way actually makes the
situation better overall -- it might make it even worse. So in a more
fundamental sense it would actually be really hard to fix these bugs.
I would never have confidence in a fix like that.

I really don't see a way around it -- we have to declare technical
debt bankruptcy here. Whether or not that means reverting the feature
or rewriting it from scratch remains to be seen. That's another
question entirely, and has everything to do with somebody's
willingness to adopt the project and little to do with how any
individual feels about it -- just like with a new feature. It does us
no good to put off the question indefinitely.

--
Peter Geoghegan

#73

andres@anarazel.de

over 4 years ago

In reply to: Peter Geoghegan (#69)

Re: snapshot too old issues, first around wraparound and then more.

Hi,

On 2021-06-16 10:44:49 -0700, Peter Geoghegan wrote:

On Wed, Jun 16, 2021 at 10:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Of course, there's still the question of how VACUUM could cheaply
apply such info to decide what could be purged.

I would think that it wouldn't really matter inside VACUUM -- it would
only really need to be either an opportunistic pruning or an
opportunistic index deletion thing -- probably both. Most of the time
VACUUM doesn't seem to end up doing most of the work of removing
garbage versions. It's mostly useful for "floating garbage", to use
the proper GC memory management term.

I don't fully agree with this. For one, there are workloads where VACUUM
removes the bulk of the dead tuples. For another, slowing down VACUUM
can cause a slew of follow-on problems, so being careful to not
introduce new bottlenecks is important. And I don't think just doing
this optimization as part of on-access pruning is reasonable
solution. And it's not like making on-access pruning slower is
unproblematic either.

But as I said nearby, I think the hardest part is figuring out how to
deal with ctid chains, not the efficiency of the xid->visibility lookup
(or the collection of data necessary for that lookup).

Greetings,

Andres Freund

#74

pg@bowt.ie

over 4 years ago

In reply to: Andres Freund (#71)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Jun 16, 2021 at 11:27 AM Andres Freund <andres@anarazel.de> wrote:

2) Modeling when it is safe to remove row versions. It is easy to remove
a tuple that was inserted and deleted within one "not needed" xid
range, but it's far less obvious when it is safe to remove row
versions where prior/later row versions are outside of such a gap.

Consider e.g. an update chain where the oldest snapshot can see one
row version, then there is a chain of rows that could be vacuumed
except for the old snapshot, and then there's a live version. If the
old session updates the row version that is visible to it, it needs
to be able to follow the xid chain.

This seems hard to solve in general.

As I've said to you before, I think that it would make sense to solve
the problem inside heap_index_delete_tuples() first (for index tuple
deletion) -- implement and advanced version for heap pruning later.
That gives users a significant benefit without requiring that you
solve this hard problem with xmin/xmax and update chains.

I don't think that it matters that index AMs still only have LP_DEAD
bits set when tuples are dead to all snapshots including the oldest.
Now that we can batch TIDs within each call to
heap_index_delete_tuples() to pick up "extra" deletable TIDs from the
same heap blocks, we'll often be able to delete a significant number
of extra index tuples whose TIDs are in a "not needed" range. Whereas
today, without the "not needed" range mechanism in place, we just
delete the index tuples that are LP_DEAD-set already, plus maybe a few
others ("extra index tuples" that are not even needed by the oldest
snapshot) -- but that's it.

We might miss our chance to ever delete the nearby index tuples
forever, just because we missed the opportunity once. Recall that the
LP_DEAD bit being set for an index tuple isn't just information about
the index tuple in Postgres 14+ -- it also suggests that the *heap
block* has many more index tuples that we can delete that aren't
LP_DEAD set in the index. And so nbtree will check those extra nearby
TIDs out in passing within heap_index_delete_tuples(). We currently
lose this valuable hint about the heap block forever if we delete the
LP_DEAD-set index tuples, unless we get lucky and somebody sets a few
more index tuples for the same heap blocks before the next time the
leaf page fills up (and heap_index_delete_tuples() must be called).

--
Peter Geoghegan

#75

pg@bowt.ie

over 4 years ago

In reply to: Andres Freund (#73)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Jun 16, 2021 at 12:06 PM Andres Freund <andres@anarazel.de> wrote:

I would think that it wouldn't really matter inside VACUUM -- it would
only really need to be either an opportunistic pruning or an
opportunistic index deletion thing -- probably both. Most of the time
VACUUM doesn't seem to end up doing most of the work of removing
garbage versions. It's mostly useful for "floating garbage", to use
the proper GC memory management term.

I don't fully agree with this. For one, there are workloads where VACUUM
removes the bulk of the dead tuples.

It's definitely much more important that VACUUM run often when non-HOT
updates are the norm, and there are lots of them. But that's probably
not going to be helped all that much by this technique anyway.

Mostly I'm just saying I'd start elsewhere and do heapam later. And
probably do VACUUM itself last of all, if that usefully cut scope.

For another, slowing down VACUUM
can cause a slew of follow-on problems, so being careful to not
introduce new bottlenecks is important. And I don't think just doing
this optimization as part of on-access pruning is reasonable
solution. And it's not like making on-access pruning slower is
unproblematic either.

I think that on-access pruning is much more important because it's the
only hope we have of keeping the original heap page intact, in the
sense that there are no non-HOT updates over time, though there may be
many HOT updates. And no LP_DEAD items ever accumulate. It's not so
much about cleaning up bloat as it is about *preserving* the heap
pages in this sense.

If in the long run it's impossible to keep the page intact in this
sense then we will still have most of our current problems. It might
not make that much practical difference if we simply delay the problem
-- we kinda have to prevent it entirely, at least for a given
workload. So I'm mostly concerned about keeping things stable over
time, at the level of individual pages.

But as I said nearby, I think the hardest part is figuring out how to
deal with ctid chains, not the efficiency of the xid->visibility lookup
(or the collection of data necessary for that lookup).

Definitely true.

--
Peter Geoghegan

#76

Stephen Frost

sfrost@snowman.net

over 4 years ago

In reply to: Peter Geoghegan (#75)

Re: snapshot too old issues, first around wraparound and then more.

Greetings,

* Peter Geoghegan (pg@bowt.ie) wrote:

On Wed, Jun 16, 2021 at 12:06 PM Andres Freund <andres@anarazel.de> wrote:

I would think that it wouldn't really matter inside VACUUM -- it would
only really need to be either an opportunistic pruning or an
opportunistic index deletion thing -- probably both. Most of the time
VACUUM doesn't seem to end up doing most of the work of removing
garbage versions. It's mostly useful for "floating garbage", to use
the proper GC memory management term.

I don't fully agree with this. For one, there are workloads where VACUUM
removes the bulk of the dead tuples.

It's definitely much more important that VACUUM run often when non-HOT
updates are the norm, and there are lots of them. But that's probably
not going to be helped all that much by this technique anyway.

I don't follow this argument. Surely there are many, many cases out
there where there's very few HOT updates but lots of non-HOT updates
which create lots of dead rows that can't currently be cleaned up if
there's a long running transaction hanging around.

Mostly I'm just saying I'd start elsewhere and do heapam later. And
probably do VACUUM itself last of all, if that usefully cut scope.

Not quite following what 'elsewhere' means here or what it would entail
if it involves cleaning up dead tuples but doesn't involve heapam. I
can sort of follow the idea of working on the routine page-level cleanup
of tuples rather than VACUUM, except that would seem to require one to
deal with the complexities of ctid chains discussed below and therefore
be a larger and more complicated effort than if one were to tackle
VACUUM and perhaps in the first round cut scope by explicitly ignoring
ctid chains.

For another, slowing down VACUUM
can cause a slew of follow-on problems, so being careful to not
introduce new bottlenecks is important. And I don't think just doing
this optimization as part of on-access pruning is reasonable
solution. And it's not like making on-access pruning slower is
unproblematic either.

I don't know that slowing down VACUUM, which already goes purposefully
slow by default when run out of autovacuum, needs to really be stressed
over, particularly when what we're talking about here are CPU cycles. I
do think it'd make sense to have a heuristic which decides if we're
going to put in the effort to try to do this kind of pruning. That is-
if the global Xmin and the current transaction are only a few thousand
apart or something along those lines then don't bother, but if there's
been 100s of thousands of transactions then enable it (perhaps allowing
control over this or allowing users to explicitly ask VACUUM to 'work
harder' or such).

I think that on-access pruning is much more important because it's the
only hope we have of keeping the original heap page intact, in the
sense that there are no non-HOT updates over time, though there may be
many HOT updates. And no LP_DEAD items ever accumulate. It's not so
much about cleaning up bloat as it is about *preserving* the heap
pages in this sense.

If in the long run it's impossible to keep the page intact in this
sense then we will still have most of our current problems. It might
not make that much practical difference if we simply delay the problem
-- we kinda have to prevent it entirely, at least for a given
workload. So I'm mostly concerned about keeping things stable over
time, at the level of individual pages.

I do think that's a worthwhile goal, but if we could get some kind of
cleanup happening, that strikes me as better than the nothing that we
have today. Which side makes sense to tackle first is certainly a
discussion that could be had but I'd go for "do the simple thing first".

But as I said nearby, I think the hardest part is figuring out how to
deal with ctid chains, not the efficiency of the xid->visibility lookup
(or the collection of data necessary for that lookup).

Definitely true.

It strikes me that stressing over ctid chains, while certainly something
to consider, at this point is putting the cart before the horse in this
discussion- there's not much sense in it if we haven't actually got the
data collection piece figured out and working (and hopefully in a manner
that minimizes the overhead from it) and then worked out the logic to
figure out if a given tuple is actually visible to any running
transaction. As I say above, it seems like it'd be a great win even if
it was initially only able to deal with 'routine'/non-chained cases and
only with VACUUM.

The kind of queue tables that I'm thinking of, at least, are ones like
what PgQ uses: https://github.com/pgq/pgq

Now, that already works around our lacking here by using TRUNCATE and
table rotation, but if we improved here then it'd potentially be able to
be rewritten to use routine DELETE's instead of TRUNCATE. Even the
UPDATEs which are done to process a batch for a subscriber look to be
non-HOT due to updating indexed fields anyway (in
pgq.next_batch_custom(), it's setting subscription.sub_batch which has a
UNIQUE btree on it). Looks like there's a HOT UPDATE for the queue
table when a table swap happens, but that UPDATE wouldn't actually be
necessary if we'd fix the issue with just routine INSERT/DELETE leading
to tons of dead tuples that can't be VACUUM'd if a long running
transaction is running, and I doubt that UPDATE was actually
intentionally designed to take advantage of HOT, it just happened to
work that way.

The gist of what I'm trying to get at here is that the use-cases I've
seen, and where people have put in the effort to work around the long
running transaction vs. VACUUM issue by using hacks like table swapping
and TRUNCATE, aren't cases where there's a lot of HOT updating happening
on the tables that are getting bloated due to VACUUM being unable to
clean up tuples. So, if that's actually the easier thing to tackle,
fantastic, let's do it and then figure out how to improve on it to
handle the more complicated cases later. (This presumes that it's
actually possible to essentially 'skip' the hard cases and still have a
working implementation, of course).

Thanks,

Stephen

#77

Noah Misch

noah@leadboat.com

over 4 years ago

In reply to: Tom Lane (#66)

Re: snapshot too old issues, first around wraparound and then more.

On Wed, Jun 16, 2021 at 12:00:57PM -0400, Tom Lane wrote:

Greg Stark <stark@mit.edu> writes:

I think Andres's point earlier is the one that stands out the most for me:

I still think that's the most reasonable course. I actually like the
feature, but I don't think a better implementation of it would share
much if any of the current infrastructure.

That makes me wonder whether ripping the code out early in the v15
cycle wouldn't be a better choice. It would make it easier for someone
to start work on a new implementation.

Deleting the feature early is better than deleting the feature late,
certainly. (That doesn't tell us about the relative utility of deleting the
feature early versus never deleting the feature.)

Fwiw I too think the basic idea of the feature is actually awesome.
There are tons of use cases where you might have one long-lived
transaction working on a dedicated table (or even a schema) that will
never look at the rapidly mutating tables in another schema and never
trigger the error even though those tables have been vacuumed many
times over during its run-time.

I agree that's a great use-case. I don't like this implementation though.
I think if you want to set things up like that, you should draw a line
between the tables it's okay for the long transaction to touch and those
it isn't, and then any access to the latter should predictably draw an
error.

I agree that would be a useful capability, but it solves a different problem.

I really do not like the idea that it might work anyway, because
then if you accidentally break the rule, you have an application that just
fails randomly ... probably only on the days when the boss wants that
report *now* not later.

Every site adopting SERIALIZABLE learns that transactions can fail due to
mostly-unrelated concurrent activity. ERRCODE_SNAPSHOT_TOO_OLD is another
kind of serialization failure, essentially. Moreover, one can opt for an
old_snapshot_threshold value longer than the runtime of the boss's favorite
report. Of course, nobody would reject a replacement that has all the
advantages of old_snapshot_threshold and fewer transaction failures. Once
your feature rewrite starts taking away advantages to achieve fewer
transaction failures, that rewrite gets a lot more speculative.

#78