Allow "snapshot too old" error, to prevent bloat

Started by Kevin Grittnerabout 11 years ago43 messageshackers
Jump to latest
#1Kevin Grittner
Kevin.Grittner@wicourts.gov

This patch is related to the "Reduce pinning in btree indexes"
patch submitted here:

/messages/by-id/721615179.3351449.1423959585771.JavaMail.yahoo@mail.yahoo.com

That describes how they evolved and how they relate; I won't
duplicate that here.

Unlike the other patch, this one is more at the "proof of concept"
phase, because it requires support in the heap and each index AM to
work correctly; so far I have only had time to cover the heap and
btree indexes. In spite of that, I have thrown the worst test
cases I could think of at it (and only succeeded in uncovering a
bug which was already out there in production), and it has shown
its value in a two-day test test simulating a 300 user load with
complex real-world applications (although the only indexes it used
were btree indexes). Without the patches the database growth was
39GB per day; with the patches it was 28.5GB per day. (The test
does involve more inserts than deletes, so some growth is
expected.) At the end of the tests, pgstattuple reported eight
times as many dead tuples in the database without the patches.
More importantly, without the patches the CPU load started at 60%
and showed linear growth to 92% over the course of the first day;
with the patches it stayed at a stable 60% throughout the test.

What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed. It also
saves the current insertion LSN into every snapshot when it is
created. When reading from the heap or any index, if the snapshot
is vulnerable to showing incorrect data because the threshold has
been crossed since it was generated, reading any page with an LSN
past the snapshot LSN causes a "snapshot too old" error to be
thrown. Since this is LSN-based, the new logic is not used for any
relation which is not WAL-logged.

Note that if you don't read data from a page which has been
modified after your snapshot was taken, the threshold doesn't
matter.

All `make installcheck` tests succeed with any setting. With a
setting of 0 (the most extreme), `make installcheck-world` sees
four isolation tests fail. Those all pass if you raise the
setting to 2. The postgres_fdw test needs a setting of 4 to
succeed. I would expect most shops would want to tune this to
something in the six-digit to eight-digit range. In the tests
mentioned above it was set to 150000 (which corresponded to just
under 4 minutes of txid consumption) and there were no "snapshot
too old" errors, even though some cursors were left open for the
entire two-day run.

The patch still lacks (as mentioned above) support for index AMs
other than btree, and lacks documentation for the new GUC. I'm
sure that there are some comments and README files that need
adjustment, too. As I said, this is still POC.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

snapshot-too-old-v1.patchtext/x-diffDownload+164-48
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#1)
Re: Allow "snapshot too old" error, to prevent bloat

Kevin Grittner <kgrittn@ymail.com> writes:

What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed.

TBH, I'm not sure why we'd wish to emulate Oracle's single worst
operational feature.

Unlike the other patch, this one is more at the "proof of concept"
phase, because it requires support in the heap and each index AM to
work correctly; so far I have only had time to cover the heap and
btree indexes.

But, having said that, why would the index AMs care? Seems like what
you are describing should be strictly a matter for VACUUM's removal
rules. If we're going to have something as ugly as this, I would much
rather it had a very small code footprint.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#2)
Re: Allow "snapshot too old" error, to prevent bloat

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kevin Grittner <kgrittn@ymail.com> writes:

What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed.

TBH, I'm not sure why we'd wish to emulate Oracle's single worst
operational feature.

I've run into cases where people have suffered horribly bloated
databases because of one ill-behaved connection. Some companies
don't want to be vulnerable to that and the disruption that
recovery from that bloat causes.

Note that the default is to maintain legacy PostgreSQL behavior;
the postgresql.conf file must be modified to see this error.

Unlike the other patch, this one is more at the "proof of concept"
phase, because it requires support in the heap and each index AM to
work correctly; so far I have only had time to cover the heap and
btree indexes.

But, having said that, why would the index AMs care? Seems like what
you are describing should be strictly a matter for VACUUM's removal
rules. If we're going to have something as ugly as this, I would much
rather it had a very small code footprint.

If a tuple is vacuumed away, you would not notice that if you were
running an index scan because you would not visit the page it used
to be on to see that the LSN had changed. We tried to sell the
idea that *any* scan using a snapshot past the threshold should
error out (which is the only way I can see to avoid making the
index AMs aware of this), but they insisted that there was no
reason to have scans fail when reading an unmodified table using an
old snapshot, or even an unmodified page of a modified table.

Due to the cost to this company of modifying their software to deal
with differrent behavior, they will not move without the behavior
I'm shooting for with this patch. Due to the value of this
customer to us, we will put this (or something to accomplish this
behavior) into our fork. We're happy to share it with the
community. If the community does not feel they need this behavior,
or wants to generate the error more aggressively to avoid touching
the index AMs, I understand.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#3)
Re: Allow "snapshot too old" error, to prevent bloat

Kevin Grittner <kgrittn@ymail.com> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kevin Grittner <kgrittn@ymail.com> writes:

What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed.

TBH, I'm not sure why we'd wish to emulate Oracle's single worst
operational feature.

I've run into cases where people have suffered horribly bloated
databases because of one ill-behaved connection. Some companies
don't want to be vulnerable to that and the disruption that
recovery from that bloat causes.

No doubt, preventing bloat is a good thing, but that doesn't mean this
is the best API we could create for the issue. The proposition this
patch offers to DBAs is: "You can turn this knob to reduce bloat by some
hard-to-quantify factor. The cost is that some long-running transactions
might fail. You won't know which ones are at risk, the failures won't be
the same from time to time, and you won't be able to do anything to spare
high-value transactions from that fate except by turning that knob back
again globally (which requires a database restart)." Maybe refugees from
Oracle will think that sounds good, but nobody else will.

I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option. That would be far
easier for users to understand and manage, it would be trivial to allow
specific high-value transactions to run with a laxer setting, it does not
create any implementation-detail-dependent behavior that we'd be having to
explain to users forevermore, and (not incidentally) it would be a lot
simpler and more trustworthy to implement. There's no well-defined
correlation between your setting and the net effect on database bloat,
but that's true with the "snapshot too old" approach as well.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Tom Lane (#4)
Re: Allow "snapshot too old" error, to prevent bloat

On 2/15/15 10:36 AM, Tom Lane wrote:

Kevin Grittner <kgrittn@ymail.com> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kevin Grittner <kgrittn@ymail.com> writes:

What this patch does is add a GUC call old_snapshot_threshold. It
defaults to -1, which leaves behavior matching unpatched code.
Above that it allows tuples to be vacuumed away after the number of
transaction IDs specified by the GUC have been consumed.

TBH, I'm not sure why we'd wish to emulate Oracle's single worst
operational feature.

I've run into cases where people have suffered horribly bloated
databases because of one ill-behaved connection. Some companies
don't want to be vulnerable to that and the disruption that
recovery from that bloat causes.

No doubt, preventing bloat is a good thing, but that doesn't mean this
is the best API we could create for the issue. The proposition this
patch offers to DBAs is: "You can turn this knob to reduce bloat by some
hard-to-quantify factor. The cost is that some long-running transactions
might fail. You won't know which ones are at risk, the failures won't be
the same from time to time, and you won't be able to do anything to spare
high-value transactions from that fate except by turning that knob back
again globally (which requires a database restart)." Maybe refugees from
Oracle will think that sounds good, but nobody else will.

I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option. That would be far
easier for users to understand and manage, it would be trivial to allow
specific high-value transactions to run with a laxer setting, it does not
create any implementation-detail-dependent behavior that we'd be having to
explain to users forevermore, and (not incidentally) it would be a lot
simpler and more trustworthy to implement. There's no well-defined
correlation between your setting and the net effect on database bloat,
but that's true with the "snapshot too old" approach as well.

A common use-case is long-running reports hitting relatively stable data
in a database that also has tables with a high churn rate (ie: queue
tables). In those scenarios your only options right now are to suffer
huge amounts of bloat in the high-churn or not do your reporting. A
simple transaction timeout only "solves" this by denying you reporting
queries.

An idea that I've had on this would be some way to "lock down" the
tables that a long-running transaction could access. That would allow
vacuum to ignore any snapshots that transaction had for tables it wasn't
accessing. That's something that would be deterministic.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim Nasby (#5)
Re: Allow "snapshot too old" error, to prevent bloat

Jim Nasby <Jim.Nasby@bluetreble.com> writes:

On 2/15/15 10:36 AM, Tom Lane wrote:

I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option.

A common use-case is long-running reports hitting relatively stable data
in a database that also has tables with a high churn rate (ie: queue
tables). In those scenarios your only options right now are to suffer
huge amounts of bloat in the high-churn or not do your reporting. A
simple transaction timeout only "solves" this by denying you reporting
queries.

Agreed, but Kevin's proposal has exactly the same problem only worse,
because (a) the reporting transactions might or might not fail (per
Murphy, they'll fail consistently only when you're on deadline), and
(b) if they do fail, there's nothing you can do short of increasing the
slack db-wide.

An idea that I've had on this would be some way to "lock down" the
tables that a long-running transaction could access. That would allow
vacuum to ignore any snapshots that transaction had for tables it wasn't
accessing. That's something that would be deterministic.

There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Tom Lane (#6)
Re: Allow "snapshot too old" error, to prevent bloat

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jim Nasby <Jim.Nasby@bluetreble.com> writes:

On 2/15/15 10:36 AM, Tom Lane wrote:

I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option.

We suggested this to our customer and got out of the meeting with
it looking like it *might* fly. In the next meeting, however, they
said they had run it by others and reviewed the code and it was
completely unacceptable -- they would not consider pg with this as
the solution.

A common use-case is long-running reports hitting relatively stable data
in a database that also has tables with a high churn rate (ie: queue
tables). In those scenarios your only options right now are to suffer
huge amounts of bloat in the high-churn or not do your reporting. A
simple transaction timeout only "solves" this by denying you reporting
queries.

Agreed, but Kevin's proposal has exactly the same problem only worse,
because (a) the reporting transactions might or might not fail (per
Murphy, they'll fail consistently only when you're on deadline), and
(b) if they do fail, there's nothing you can do short of increasing the
slack db-wide.

These they were comfortable with, and did *not* consider to be
unpredictable or something they could not do something about.
I really don't feel I can say more than that, though, without
disclosing more than I should.

An idea that I've had on this would be some way to "lock down" the
tables that a long-running transaction could access. That would allow
vacuum to ignore any snapshots that transaction had for tables it wasn't
accessing. That's something that would be deterministic.

While this option was not specifically suggested, based on their
their reasons that numerous other options we suggested were
unacceptable, I feel sure that this would not be acceptable.

I think Tom hit the nail on the head when he said "Maybe refugees
from Oracle will think that sounds good..." It is precisely those
with very large code bases which have been modified over long
periods of time to work with Oracle that would find this solution
attractive, or perhaps *necessary* to consider a move to Postgres.
That's a potential market we don't care to write off.

There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.

What the customer most doesn't want to be "nondeterministic" is
whether the error is generated only when the snapshot has been used
to read a page which has been modified since the snapshot was
taken. If tables or materialized views are set up before a query
and then not subsequently modified during the execution of the
query, that query must not be canceled even if it runs for days,
but it must not cause unconstrained bloat during the run. So far I
don't see any way to avoid including the LSN with the snapshot or
modifying the index AMs. Let's be clear on the footprint for that;
for the btree implementation it is:

src/backend/access/nbtree/nbtinsert.c | 7 ++++---
src/backend/access/nbtree/nbtpage.c | 2 +-
src/backend/access/nbtree/nbtsearch.c | 43 ++++++++++++++++++++++++++++++++++---------
src/include/access/nbtree.h | 7 ++++---
4 files changed, 43 insertions(+), 16 deletions(-)

What this discussion has made me reconsider is the metric for
considering a transaction "too old". The number of transaction IDs
consumed seems inferior as the unit of measure for that to LSN or
time.

It looks to me to be pretty trivial (on the order of maybe 30 lines
of code) to specify this GUC in minutes rather than transaction
IDs. At first glance this seems like it would be vulnerable to the
usual complaints about mismanaged clocks, but those are easily
answered by using a cached time in shared memory that we populate
in such a way that it can never move backward. Done right, this
could even allow the GUC to be changeable on reload rather than
only at restart. A badly mismanaged system clock could not cause a
query to generate incorrect results; the worst that could happen is
that this feature would fail to control bloat as much as expected
or reads of modified data could generate the "snapshot too old"
error around the time of the clock adjustment.

As before, this would default to a magic value to mean that you
want the historical PostgreSQL behavior.

If that makes the idea more palatable to anyone, I can submit a
patch to that effect within the next 24 hours.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Kevin Grittner (#7)
Re: Allow "snapshot too old" error, to prevent bloat

On 2015-02-16 05:19:11 +0000, Kevin Grittner wrote:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jim Nasby <Jim.Nasby@bluetreble.com> writes:

On 2/15/15 10:36 AM, Tom Lane wrote:

I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option.

We suggested this to our customer and got out of the meeting with
it looking like it *might* fly. In the next meeting, however, they
said they had run it by others and reviewed the code and it was
completely unacceptable -- they would not consider pg with this as
the solution.

A common use-case is long-running reports hitting relatively stable data
in a database that also has tables with a high churn rate (ie: queue
tables). In those scenarios your only options right now are to suffer
huge amounts of bloat in the high-churn or not do your reporting. A
simple transaction timeout only "solves" this by denying you reporting
queries.

Agreed, but Kevin's proposal has exactly the same problem only worse,
because (a) the reporting transactions might or might not fail (per
Murphy, they'll fail consistently only when you're on deadline), and
(b) if they do fail, there's nothing you can do short of increasing the
slack db-wide.

These they were comfortable with, and did *not* consider to be
unpredictable or something they could not do something about.
I really don't feel I can say more than that, though, without
disclosing more than I should.

I agree that we need to do something about the dangers of long
snapshots, I'm not sure this is it. I'm not sure of the contrary either.

What I'm definitely not a fan of though, is this implementation. Having
to add checks to a large number of places is a recipe for silently wrong
query results.

One thing I was wondering about recently was introducing an optimization
for repeatedly updated rows into the vacuum code: A row that has xmin =
xmax where these have committed can be removed, even if the xid is above
the xmin horizon - no other transaction is ever going to see it. While
there's some hairy-ness implementing that, it doesn't seem too hard. And
there's a fair number of cases where that'd allow less bloat to
accumulate. Obviously it'd be better if we had logic to do that for
other patterns as well (where the updates aren't in the same xact), but
it seems like a good start.

There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.

What the customer most doesn't want to be "nondeterministic" is
whether the error is generated only when the snapshot has been used
to read a page which has been modified since the snapshot was
taken. If tables or materialized views are set up before a query
and then not subsequently modified during the execution of the
query, that query must not be canceled even if it runs for days,
but it must not cause unconstrained bloat during the run. So far I
don't see any way to avoid including the LSN with the snapshot or
modifying the index AMs. Let's be clear on the footprint for that;
for the btree implementation it is:

IIUC you never would want cancellations when accessing the the tables
these longrunning backends actually access. The errors about too old
snapshots are just a stopgap because we can't compute the xmin per table
in a meaningful way atm. Right?

Is the bloat caused by rows these transactions actually can see or are
the not-removed rows newer than the transaction's xmax?

Since you actually don't want cancellations for the long running
reporting queries it very much might be sensible to switch to a more
complicated version of HeapTupleSatisfiesVacuum() if there's longrunning
queries. One that can recognize if rows are actually visible to any
backend, or if there are just snapshots that see older rows. I've
previously wondered how hard this would be, but I don't think it's
*that* hard. And it'd be a much more general solution.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#8)
Re: Allow "snapshot too old" error, to prevent bloat

On 2/16/15 2:41 AM, Andres Freund wrote:

Since you actually don't want cancellations for the long running
reporting queries it very much might be sensible to switch to a more
complicated version of HeapTupleSatisfiesVacuum() if there's longrunning
queries. One that can recognize if rows are actually visible to any
backend, or if there are just snapshots that see older rows. I've
previously wondered how hard this would be, but I don't think it's
*that* hard. And it'd be a much more general solution.

Josh Berkus mentioned on IRC that they're working on improving vacuum,
and I think it was something along these lines.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#4)
Re: Allow "snapshot too old" error, to prevent bloat

On Sun, Feb 15, 2015 at 11:36:50AM -0500, Tom Lane wrote:

I wonder if we couldn't achieve largely the same positive effects through
adding a simple transaction-level timeout option. That would be far
easier for users to understand and manage, it would be trivial to allow
specific high-value transactions to run with a laxer setting, it does not
create any implementation-detail-dependent behavior that we'd be having to
explain to users forevermore, and (not incidentally) it would be a lot
simpler and more trustworthy to implement. There's no well-defined
correlation between your setting and the net effect on database bloat,
but that's true with the "snapshot too old" approach as well.

While we have prided ourselves on not generating query cancel messages
due to snapshot-too-old, there is the downside of bloat. I understand
the need to allow users to make the trade-off between bloat and query
cancellation.

It seems we already have a mechanism in place that allows tuning of
query cancel on standbys vs. preventing standby queries from seeing old
data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We obsessed
about how users were going to react to these odd variables, but there
has been little negative feedback.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#10)
Re: Allow "snapshot too old" error, to prevent bloat

On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:

It seems we already have a mechanism in place that allows tuning of
query cancel on standbys vs. preventing standby queries from seeing old
data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We obsessed
about how users were going to react to these odd variables, but there
has been little negative feedback.

FWIW, I think that's a somewhat skewed perception. I think it was right to
introduce those, because we didn't really have any alternatives.

But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for questions
and complaints that I/we see.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Magnus Hagander
magnus@hagander.net
In reply to: Andres Freund (#11)
Re: Allow "snapshot too old" error, to prevent bloat

On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:

On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:

It seems we already have a mechanism in place that allows tuning of
query cancel on standbys vs. preventing standby queries from seeing old
data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We obsessed
about how users were going to react to these odd variables, but there
has been little negative feedback.

FWIW, I think that's a somewhat skewed perception. I think it was right to
introduce those, because we didn't really have any alternatives.

But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for questions
and complaints that I/we see.

Agreed.

And a really bad one used to be vacuum_defer_cleanup_age, because of
confusing units amongst other things. Which in terms seems fairly close to
Kevins suggestions, unfortunately.

/Magnus

#13Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Magnus Hagander (#12)
Re: Allow "snapshot too old" error, to prevent bloat

Magnus Hagander <magnus@hagander.net> wrote:

On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:

On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:

It seems we already have a mechanism in place that allows
tuning of query cancel on standbys vs. preventing standby
queries from seeing old data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We
obsessed about how users were going to react to these odd
variables, but there has been little negative feedback.

FWIW, I think that's a somewhat skewed perception. I think it
was right to introduce those, because we didn't really have any
alternatives.

As far as I can see, the "alternatives" suggested so far are all
about causing heap bloat to progress more slowly, but still without
limit. I suggest, based on a lot of user feedback (from the
customer I've talked about at some length on this thread, as well
as numerous others), that unlimited bloat based on the activity of
one connection is a serious deficiency in the product; and that
there is no real alternative to something like a "snapshot too old"
error if we want to fix that deficiency. Enhancements to associate
a snapshot with a database and using a separate vacuum xmin per
database, not limiting vacuum of a particular object by snapshots
that cannot see that snapshot, etc., are all Very Good Things and I
hope those changes are made, but none of that fixes a very
fundamental flaw.

But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for
questions and complaints that I/we see.

Agreed.
And a really bad one used to be vacuum_defer_cleanup_age, because
of confusing units amongst other things. Which in terms seems
fairly close to Kevins suggestions, unfortunately.

Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.

Unless this idea gets some +1 votes Real Soon Now, I will mark the
patch as Rejected and move on.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Stephen Frost
sfrost@snowman.net
In reply to: Kevin Grittner (#13)
Re: Allow "snapshot too old" error, to prevent bloat

Kevin,

* Kevin Grittner (kgrittn@ymail.com) wrote:

Magnus Hagander <magnus@hagander.net> wrote:

On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:

On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:

It seems we already have a mechanism in place that allows
tuning of query cancel on standbys vs. preventing standby
queries from seeing old data, specifically
max_standby_streaming_delay/max_standby_archive_delay. We
obsessed about how users were going to react to these odd
variables, but there has been little negative feedback.

FWIW, I think that's a somewhat skewed perception. I think it
was right to introduce those, because we didn't really have any
alternatives.

As far as I can see, the "alternatives" suggested so far are all
about causing heap bloat to progress more slowly, but still without
limit. I suggest, based on a lot of user feedback (from the
customer I've talked about at some length on this thread, as well
as numerous others), that unlimited bloat based on the activity of
one connection is a serious deficiency in the product; and that
there is no real alternative to something like a "snapshot too old"
error if we want to fix that deficiency. Enhancements to associate
a snapshot with a database and using a separate vacuum xmin per
database, not limiting vacuum of a particular object by snapshots
that cannot see that snapshot, etc., are all Very Good Things and I
hope those changes are made, but none of that fixes a very
fundamental flaw.

I also agree with the general idea that it makes sense to provide a way
to control bloat, but I think you've missed what Andres was getting at
with his suggestion (if I understand correctly, apologies if I don't).

The problem is that we're only looking at the overall xmin / xmax
horizon when it comes to deciding if a given tuple is dead. That's
not quite right- the tuple might actually be dead to all *current*
transactions by being newer than the oldest transaction but dead for all
later transactions. Basically, there exist gaps between our cluster
wide xmin / xmax where we might find actually dead rows. Marking those
rows dead and reusable *would* stop the bloat, not just slow it down.

In the end, with a single long-running transaction, the worst bloat you
would have is double the size of the system at the time the long-running
transaction started. Another way of thinking about it is with this
timeline:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
/\ /\ /\ /\
Long running row created row deleted vacuum
transaction starts

Now, if all transactions currently running started either before the row
was created or after the row was deleted, then the row is dead and
vacuum can reclaim it. Finding those gaps in the transaction space for
all of the currently running processes might not be cheap, but for this
kind of a use-case, it might be worth the effort. On a high-churn
system where the actual set of rows being copied over constantly isn't
very high then you could end up with some amount of duplication of rows-
those to satisfy the ancient transaction and the current working set,
but there wouldn't be any further bloat and it'd almost cerainly be a
much better situation than what is being seen now.

Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.

Unless this idea gets some +1 votes Real Soon Now, I will mark the
patch as Rejected and move on.

I'm not against having a knob like this, which is defaulted to off,
but I do think we'd be a lot better off with a system that could
realize when rows are not visible to any currently running transaction
and clean them up. If this knob's default is off then I don't think
we'd be likely to get the complaints which are being discussed (or, if
we did, we could point the individual at the admin who set the knob...).

Thanks!

Stephen

#15Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Stephen Frost (#14)
Re: Allow "snapshot too old" error, to prevent bloat

Stephen Frost <sfrost@snowman.net> wrote:

I also agree with the general idea that it makes sense to provide a way
to control bloat, but I think you've missed what Andres was getting at
with his suggestion (if I understand correctly, apologies if I don't).

The problem is that we're only looking at the overall xmin / xmax
horizon when it comes to deciding if a given tuple is dead. That's
not quite right- the tuple might actually be dead to all *current*
transactions by being newer than the oldest transaction but dead for all
later transactions. Basically, there exist gaps between our cluster
wide xmin / xmax where we might find actually dead rows. Marking those
rows dead and reusable *would* stop the bloat, not just slow it down.

In the end, with a single long-running transaction, the worst bloat you
would have is double the size of the system at the time the long-running
transaction started.

I agree that limiting bloat to one dead tuple for every live one
for each old snapshot is a limit that has value, and it was unfair
of me to characterize that as not being a limit. Sorry for that.

This possible solution was discussed with the user whose feedback
caused me to write this patch, but there were several reasons they
dismissed that as a viable total solution for them, two of which I
can share:

(1) They have a pool of connections each of which can have several
long-running cursors, so the limit from that isn't just doubling
the size of their database, it is limiting it to some two-or-three
digit multiple of the necessary size.

(2) They are already prepared to deal with "snapshot too old"
errors on queries that run more than about 20 minutes and which
access tables which are modified. They would rather do that than
suffer the bloat beyond that point.

IMO all of these changes people are working are very valuable, and
complement each other. This particular approach is likely to be
especially appealing to those moving from Oracle because it is a
familiar mechanism, and one which some of them have written their
software to be aware of and deal with gracefully.

I'm not against having a knob like this, which is defaulted to off,

Thanks! I'm not sure that amounts to a +1, but at least it doesn't
sound like a -1. :-)

but I do think we'd be a lot better off with a system that could
realize when rows are not visible to any currently running transaction
and clean them up.

+1

But they are not mutually exclusive; I see them as complementary.

If this knob's default is off then I don't think
we'd be likely to get the complaints which are being discussed (or, if
we did, we could point the individual at the admin who set the knob...).

That's how I see it, too. I would not suggest making the default
anything other than "off", but there are situations where it would
be a nice tool to have in the toolbox.

Thanks for the feedback!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#6)
Re: Allow "snapshot too old" error, to prevent bloat

On Sun, Feb 15, 2015 at 8:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.

But I think Kevin would agree with you there. That's why he's interested in
having the errors not occur if you don't read from the volatile tables. Ie,
your reporting query reading from read-only tables would be guaranteed to
succeed even if some other table had had some rows vacuumed away.

I'm not sure that's really going to work because of things like hint bit
updates or hot pruning. That is, say your table is read-only now but last
week some of the records were updated. You might reasonably expect those
rows to be safe and indeed the rows themselves will still be in your
report. But the earlier versions of the rows could still be sitting on some
pages invisible to every query but not vacuumable quite yet and then
suddenly they get vacuumed away and the LSN on the page gets bumped.

Fwiw I would strongly suggest that instead of having a number of
transactions to have a time difference. I haven't been following the
"commit timestamp" thread but I'm hoping that patch has the infrastructure
needed to look up an lsn and find out the timestamp and vice versa. So when
you take a snapshot you could look up the effective cut-off LSN based on
the time difference specified in the GUC. As a DBA I would find it
infinitely more comforting specifying "old_snapshot_threshold=4h" than
specifying some arbitrary large number of transactions and hoping I
calculated the transaction rate reasonably accurately.

--
greg

#17Stephen Frost
sfrost@snowman.net
In reply to: Kevin Grittner (#15)
Re: Allow "snapshot too old" error, to prevent bloat

* Kevin Grittner (kgrittn@ymail.com) wrote:

Stephen Frost <sfrost@snowman.net> wrote:

I also agree with the general idea that it makes sense to provide a way
to control bloat, but I think you've missed what Andres was getting at
with his suggestion (if I understand correctly, apologies if I don't).

The problem is that we're only looking at the overall xmin / xmax
horizon when it comes to deciding if a given tuple is dead. That's
not quite right- the tuple might actually be dead to all *current*
transactions by being newer than the oldest transaction but dead for all
later transactions. Basically, there exist gaps between our cluster
wide xmin / xmax where we might find actually dead rows. Marking those
rows dead and reusable *would* stop the bloat, not just slow it down.

In the end, with a single long-running transaction, the worst bloat you
would have is double the size of the system at the time the long-running
transaction started.

I agree that limiting bloat to one dead tuple for every live one
for each old snapshot is a limit that has value, and it was unfair
of me to characterize that as not being a limit. Sorry for that.

This possible solution was discussed with the user whose feedback
caused me to write this patch, but there were several reasons they
dismissed that as a viable total solution for them, two of which I
can share:

(1) They have a pool of connections each of which can have several
long-running cursors, so the limit from that isn't just doubling
the size of their database, it is limiting it to some two-or-three
digit multiple of the necessary size.

This strikes me as a bit off-the-cuff; was an analysis done which
deteremined that would be the result? If there is overlap between the
long-running cursors then there would be less bloat, and most systems
which I'm familiar with don't turn the entire database over in 20
minutes, 20 hours, or even 20 days except in pretty specific cases.
Perhaps this is one of those, and if so then I'm all wet, but the
feeling I get is that this is a way to dismiss this solution because
it's not what's wanted, which is "what Oracle did."

(2) They are already prepared to deal with "snapshot too old"
errors on queries that run more than about 20 minutes and which
access tables which are modified. They would rather do that than
suffer the bloat beyond that point.

That, really, is the crux here- they've already got support for dealing
with it the way Oracle did and they'd like PG to do that too.
Unfortunately, that, by itself, isn't a good reason for a particular
capability (we certainly aren't going to be trying to duplicate PL/SQL
in PG any time soon). That said, there are capabilities in other
RDBMS's which are valuable and which we *do* want, so the fact that
Oracle does this also isn't a reason to not include it.

IMO all of these changes people are working are very valuable, and
complement each other. This particular approach is likely to be
especially appealing to those moving from Oracle because it is a
familiar mechanism, and one which some of them have written their
software to be aware of and deal with gracefully.

For my 2c, I'd much rather provide them with a system where they don't
have to deal with broken snapshots than give them a way to have them the
way Oracle provided them. :) That said, even the approach Andres
outlined will cause bloat and it may be beyond what's acceptable in some
environments, and it's certainly more complicated and unlikely to get
done in the short term.

I'm not against having a knob like this, which is defaulted to off,

Thanks! I'm not sure that amounts to a +1, but at least it doesn't
sound like a -1. :-)

So, at the time I wrote that, I wasn't sure if it was a +1 or not
myself. I've been thinking about it since then, however, and I'm
leaning more towards having the capability than not, so perhaps that's a
+1, but it doesn't excuse the need to come up with an implementation
that everyone can be happy with and what you've come up with so far
doesn't have a lot of appeal, based on the feedback (I've only glanced
through it myself, but I agree with Andres and Tom that it's a larger
footprint than we'd want for this and the number of places having to be
touched is concerning as it could lead to future bugs).

A lot of that would go away if there was a way to avoid having to mess
with the index AMs, I'd think, but I wonder if we'd actually need more
done there- it's not immediately obvious to me how an index-only scan is
safe with this. Whenever an actual page is visited, we can check the
LSN, and the relation can't be truncated by vacuum since the transaction
will still have a lock on the table which prevents it, but does the
visibility-map update check make sure to never mark pages all-visible
when one of these old transactions is running around? On what basis?

but I do think we'd be a lot better off with a system that could
realize when rows are not visible to any currently running transaction
and clean them up.

+1

But they are not mutually exclusive; I see them as complementary.

I can see how they would be, provided we can be confident that we're
going to actually throw an error when the snapshot is out of date and
not end up returning incorrect results. We need to be darn sure of
that, both now and in a few years from now when many of us may have
forgotten about this knob.. ;)

If this knob's default is off then I don't think
we'd be likely to get the complaints which are being discussed (or, if
we did, we could point the individual at the admin who set the knob...).

That's how I see it, too. I would not suggest making the default
anything other than "off", but there are situations where it would
be a nice tool to have in the toolbox.

Agreed.

Thanks!

Stephen

#18Magnus Hagander
magnus@hagander.net
In reply to: Kevin Grittner (#13)
Re: Allow "snapshot too old" error, to prevent bloat

On Wed, Feb 18, 2015 at 10:57 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Magnus Hagander <magnus@hagander.net> wrote:

On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com>

wrote:

On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
But max_standby_streaming_delay, max_standby_archive_delay and
hot_standby_feedback are among the most frequent triggers for
questions and complaints that I/we see.

Agreed.
And a really bad one used to be vacuum_defer_cleanup_age, because
of confusing units amongst other things. Which in terms seems
fairly close to Kevins suggestions, unfortunately.

Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.

Yes, it would definitely look much better. My reference per above was
exactly that - having a setting in the unit "number of xids" confused a lot
of users and made it really hard to tune. Having something in time units is
a lot easier to understand and tune for most people.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#19Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Bruce Momjian (#16)
Re: Allow "snapshot too old" error, to prevent bloat

Greg Stark <stark@mit.edu> wrote:

On Sun, Feb 15, 2015 at 8:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

There might be something in that, but again it's not much like
this patch. The key point I think we're both making is that
nondeterministic failures are bad, especially when you're
talking about long-running, expensive-to-retry transactions.

But I think Kevin would agree with you there. That's why he's
interested in having the errors not occur if you don't read from
the volatile tables. Ie, your reporting query reading from
read-only tables would be guaranteed to succeed even if some
other table had had some rows vacuumed away.

I'm not sure that's really going to work because of things like
hint bit updates or hot pruning. That is, say your table is
read-only now but last week some of the records were updated. You
might reasonably expect those rows to be safe and indeed the rows
themselves will still be in your report. But the earlier versions
of the rows could still be sitting on some pages invisible to
every query but not vacuumable quite yet and then suddenly they
get vacuumed away and the LSN on the page gets bumped.

That's a very interesting point. In the case that the reporting
tables are like that (versus being created right before the run,
for the report), it would argue for either very aggressive
autovacuum settings or an explicit vacuum on those tables before
starting the report.

Fwiw I would strongly suggest that instead of having a number of
transactions to have a time difference.

On the 15th I said this:

| What this discussion has made me reconsider is the metric for
| considering a transaction "too old". The number of transaction IDs
| consumed seems inferior as the unit of measure for that to LSN or
| time.
|
| It looks to me to be pretty trivial (on the order of maybe 30 lines
| of code) to specify this GUC in minutes rather than transaction
| IDs. At first glance this seems like it would be vulnerable to the
| usual complaints about mismanaged clocks, but those are easily
| answered by using a cached time in shared memory that we populate
| in such a way that it can never move backward. Done right, this
| could even allow the GUC to be changeable on reload rather than
| only at restart. A badly mismanaged system clock could not cause a
| query to generate incorrect results; the worst that could happen is
| that this feature would fail to control bloat as much as expected
| or reads of modified data could generate the "snapshot too old"
| error around the time of the clock adjustment.
|
| As before, this would default to a magic value to mean that you
| want the historical PostgreSQL behavior.
|
| If that makes the idea more palatable to anyone, I can submit a
| patch to that effect within the next 24 hours.

Until yesterday I didn't get any feedback suggesting that such a
change would make the patch more palatable. Now that I have had,
I'll try to post a patch to that effect today.

Thanks!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Stephen Frost (#17)
Re: Allow "snapshot too old" error, to prevent bloat

Stephen Frost <sfrost@snowman.net> wrote:

Kevin Grittner (kgrittn@ymail.com) wrote:

Stephen Frost <sfrost@snowman.net> wrote:

In the end, with a single long-running transaction, the worst bloat you
would have is double the size of the system at the time the long-running
transaction started.

I agree that limiting bloat to one dead tuple for every live one
for each old snapshot is a limit that has value, and it was unfair
of me to characterize that as not being a limit. Sorry for that.

This possible solution was discussed with the user whose feedback
caused me to write this patch, but there were several reasons they
dismissed that as a viable total solution for them, two of which I
can share:

(1) They have a pool of connections each of which can have several
long-running cursors, so the limit from that isn't just doubling
the size of their database, it is limiting it to some two-or-three
digit multiple of the necessary size.

This strikes me as a bit off-the-cuff; was an analysis done which
deteremined that would be the result?

If it sounds like off-the-cuff it is because I am summarizing the
mountain of data which has been accumulated over weeks of
discussions and many rounds of multi-day tests using their actual
software driven by a software test driver that simulates users,
with "think time" and all.

To be usable, we need to run on a particular set of hardware with a
simulated load of 3000 users. In a two-day test without the
patches I'm proposing, their actual, working production software
running against PostgreSQL bloated the database by 37% and was on a
linear rate of increase. Unpatched, CPU time consumed at a linear
rate due to the bloat until in the second day it saturated the CPUs
and the driver was unable to get acceptable user performance.
Because of that we haven't moved on to test what the bloat rate
would be like with 3000 users, but I assume it would be higher than
with 300 users.

With the two patches I submitted, bloat stabilized at less than 5%
except for some btree indexes which followed pattern of inserting
at the end and deleting most (but not all) of the entries over
time. I've been working on a separate patch for that, but it's not
ready to present, so I probably won't post that until the first CF
of the next release -- unless someone's interested enough to want
to discuss it before then.

(2) They are already prepared to deal with "snapshot too old"
errors on queries that run more than about 20 minutes and which
access tables which are modified. They would rather do that than
suffer the bloat beyond that point.

That, really, is the crux here- they've already got support for dealing
with it the way Oracle did and they'd like PG to do that too.
Unfortunately, that, by itself, isn't a good reason for a particular
capability (we certainly aren't going to be trying to duplicate PL/SQL
in PG any time soon).

I understand that, and if the community doesn't want this style of
limitation on bloat we can make it part of PPAS (which *does*
duplicate PL/SQL, among other things). We are offering it to the
community first.

That said, there are capabilities in other RDBMS's which are
valuable and which we *do* want, so the fact that Oracle does this
also isn't a reason to not include it.

:-)

I'm not against having a knob like this, which is defaulted to off,

Thanks! I'm not sure that amounts to a +1, but at least it doesn't
sound like a -1. :-)

So, at the time I wrote that, I wasn't sure if it was a +1 or not
myself. I've been thinking about it since then, however, and I'm
leaning more towards having the capability than not, so perhaps that's a
+1, but it doesn't excuse the need to come up with an implementation
that everyone can be happy with and what you've come up with so far
doesn't have a lot of appeal, based on the feedback (I've only glanced
through it myself, but I agree with Andres and Tom that it's a larger
footprint than we'd want for this and the number of places having to be
touched is concerning as it could lead to future bugs).

This conversation has gotten a little confusing because some of my
posts seem to have been held up in the email systems for some
reason, and are arriving out-of-order (and at least one I don't see
yet). But I have been responding to these points. For one thing I
stated four days ago that I now feel that the metric for
determining that a snapshot is "old" should be *time*, not
transactions IDs consumed, and offered to modify that patch in that
direction if people agreed. I now see one person agreeing and
several people arguing that I should do just that (as though they
had not seem me propose it), so I'll try to get a new version out
today like that. In any event, I am sure it is possible and feel
that it would be better. (Having posted that now at least 4 times,
hopefully people will pick up on it.)

For another thing, as mentioned earlier, this is the btree footprint:

src/backend/access/nbtree/nbtinsert.c | 7 ++++---
src/backend/access/nbtree/nbtpage.c | 2 +-
src/backend/access/nbtree/nbtsearch.c | 43 ++++++++++++++++++++++++++++++++++---------
src/include/access/nbtree.h | 7 ++++---
4 files changed, 43 insertions(+), 16 deletions(-)

That consists mainly of two things. First, there are 13 places
(all in nbtsearch.c) immediately following a call to
BufferGetPage() that a line like this was added:

TestForOldSnapshot(snapshot, rel, page);

Second, some functions that were not previously passed a snapshot
had that added to the function signature and the snapshot was
passed from the caller where needed. C files other than nbtsearch
were only modified to pass a NULL in to calls to the functions with
modified signatures -- in a total of four places.

That is it.

To be sure I had to initially spend some time figuring out which
BufferGetPage() calls needed to be followed with this test, but
it's not rocket science.

A lot of that would go away if there was a way to avoid having to mess
with the index AMs, I'd think, but I wonder if we'd actually need more
done there- it's not immediately obvious to me how an index-only scan is
safe with this. Whenever an actual page is visited, we can check the
LSN, and the relation can't be truncated by vacuum since the transaction
will still have a lock on the table which prevents it, but does the
visibility-map update check make sure to never mark pages all-visible
when one of these old transactions is running around? On what basis?

I will review that; it may indeed need some attention. The index
entry is removed before the line pointer can be freed, but the
question is whether there is a window where pruning or vacuum has
removed the tuple and changed the page visibility map bit to say
that everything on the page is visible to all before the index
entry is removed. There may be a need to, as you say, suppress
setting the all-visible flag in some particular cases.

but I do think we'd be a lot better off with a system that could
realize when rows are not visible to any currently running transaction
and clean them up.

+1

But they are not mutually exclusive; I see them as complementary.

I can see how they would be, provided we can be confident that we're
going to actually throw an error when the snapshot is out of date and
not end up returning incorrect results. We need to be darn sure of
that, both now and in a few years from now when many of us may have
forgotten about this knob.. ;)

I get that this is not zero-cost to maintain, and that it might not
be of interest to the community largely for that reason. If you
have any ideas about how to improve that, I'm all ears. But do
consider the actual scope of what was needed for the btree coverage
(above). (Note that the index-only scan issue doesn't look like
anything AM-specific; if something is needed for that it would be
in the pruning/vacuum area.)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Andrew Dunstan
andrew@dunslane.net
In reply to: Kevin Grittner (#19)
#22Rod Taylor
rbt@rbt.ca
In reply to: Kevin Grittner (#13)
#23Ants Aasma
ants.aasma@cybertec.at
In reply to: Kevin Grittner (#20)
#24Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Andrew Dunstan (#21)
#25Claudio Freire
klaussfreire@gmail.com
In reply to: Kevin Grittner (#24)
#26Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Rod Taylor (#22)
#27Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Ants Aasma (#23)
#28Andrew Dunstan
andrew@dunslane.net
In reply to: Kevin Grittner (#27)
#29Ants Aasma
ants.aasma@cybertec.at
In reply to: Kevin Grittner (#27)
#30Stephen Frost
sfrost@snowman.net
In reply to: Kevin Grittner (#20)
#31David Steele
david@pgmasters.net
In reply to: Kevin Grittner (#26)
#32Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Stephen Frost (#30)
#33Amit Kapila
amit.kapila16@gmail.com
In reply to: Kevin Grittner (#7)
#34Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Amit Kapila (#33)
#35Andrew Dunstan
andrew@dunslane.net
In reply to: Kevin Grittner (#34)
#36Amit Kapila
amit.kapila16@gmail.com
In reply to: Kevin Grittner (#34)
#37Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Amit Kapila (#36)
#38Simon Riggs
simon@2ndQuadrant.com
In reply to: Magnus Hagander (#12)
#39Rui DeSousa
rui@crazybean.net
In reply to: Simon Riggs (#38)
#40Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: Rui DeSousa (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Rui DeSousa (#39)
#42Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#41)
#43Andy Fan
zhihui.fan1213@gmail.com
In reply to: Tom Lane (#6)