Turning off HOT/Cleanup sometimes

Started by Simon Riggsabout 12 years ago119 messages
#1Simon Riggs
simon@2ndQuadrant.com

VACUUM cleans up blocks, which is nice because it happens offline in a
lazy manner.

We also make SELECT clean up blocks as it goes. That is useful in OLTP
workloads, but it means that large SQL queries and pg_dump effectively
do much the same work as VACUUM, generating huge amounts of I/O and
WAL on the master, the cost and annoyance of which is experienced
directly by the user. That is avoided on standbys.

Effects of that are that long running statements often run much longer
than we want, increasing bloat as a result. It also produces wildly
varying response times, depending upon extent of cleanup required.

It is a simple task to make that behaviour optional on the master.

I propose a USERSET parameter, prune_cost_limit (<---insert better name here)
which will make the behaviour optional, default -1, in normal user
processes. VACUUM will ignore this parameter and so its actions will
never be deferred.

In detail, this parameter would disable pruning for any scan larger
than the cost limit. So large scans will disable the behaviour. The
default, -1, means never disable pruning, which is the current
behavour.

We track the number of pages dirtied by the current statement. When
this reaches prune_cost_limit, we will apply these behaviours to all
shared_buffer block accesses...

(1) avoid running heap_page_prune_opt()

(2) avoid dirtying the buffer for hints. (This is safe because the
hinted changes will either be lost or will be part of the full page
image when we make a logged-change).

(i.e. doesn't apply to temp tables)

For example, if we set prune_cost_limit = 4 this behaviour allows
small index lookups via bitmapheapscan to continue to cleanup, while
larger index and seq scans will avoid cleanup.

There would be a postgresql.conf parameter prune_cost_limit, as well
as a table level parameter that would prevent pruning except via
VACUUM.

This will help in these ways
* Reduce write I/O from SELECTs and pg_dump - improving backups and BI queries
* Allow finer grained control over Hot Standby conflicts
* Potentially allow diagnostic inspection of older data via SeqScan

Prototype patch shows this is possible and simple enough for 9.4.
Major objections? Or should I polish up and submit?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#1)
Re: Turning off HOT/Cleanup sometimes

On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

VACUUM cleans up blocks, which is nice because it happens offline in a
lazy manner.

We also make SELECT clean up blocks as it goes. That is useful in OLTP
workloads, but it means that large SQL queries and pg_dump effectively
do much the same work as VACUUM, generating huge amounts of I/O and
WAL on the master, the cost and annoyance of which is experienced
directly by the user. That is avoided on standbys.

On a pgbench workload, though, essentially all page cleanup happens as
a result of HOT cleanups, like >99.9%. It might be OK to have that
happen for write operations, but it would be a performance disaster if
updates didn't try to HOT-prune. Our usual argument for doing HOT
pruning even on SELECT cleanups is that not doing so pessimizes
repeated scans, but there are clearly cases that end up worse off as a
result of that decision.

I'm not entirely wild about adding a parameter in this area because it
seems that we're increasingly choosing to further expose what arguably
ought to be internal implementation details. The recent wal_log_hints
parameter is another recent example of this that I'm not thrilled
with, but in that case, as in this one, I can see the value of it.
Still, I think it'd be loads better to restrict what you're talking
about here to the SELECT-only case; I have a strong feeling that this
will be a disaster on write workloads.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#2)
Re: Turning off HOT/Cleanup sometimes

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

We also make SELECT clean up blocks as it goes. That is useful in OLTP
workloads, but it means that large SQL queries and pg_dump effectively
do much the same work as VACUUM, generating huge amounts of I/O and
WAL on the master, the cost and annoyance of which is experienced
directly by the user. That is avoided on standbys.

On a pgbench workload, though, essentially all page cleanup happens as
a result of HOT cleanups, like >99.9%. It might be OK to have that
happen for write operations, but it would be a performance disaster if
updates didn't try to HOT-prune. Our usual argument for doing HOT
pruning even on SELECT cleanups is that not doing so pessimizes
repeated scans, but there are clearly cases that end up worse off as a
result of that decision.

My recollection of the discussion when HOT was developed is that it works
that way not because anyone thought it was beneficial, but simply because
we didn't see an easy way to know when first fetching a page whether we're
going to try to UPDATE some tuple on the page. (And we can't postpone the
pruning, because the query will have tuple pointers into the page later.)
Maybe we should work a little harder on passing that information down.
It seems reasonable to me that SELECTs shouldn't be tasked with doing
HOT pruning.

I'm not entirely wild about adding a parameter in this area because it
seems that we're increasingly choosing to further expose what arguably
ought to be internal implementation details.

I'm -1 for a parameter as well, but I think that just stopping SELECTs
from doing pruning at all might well be a win. It's at least worthy
of some investigation.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#3)
Re: Turning off HOT/Cleanup sometimes

On Thu, Jan 9, 2014 at 12:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

We also make SELECT clean up blocks as it goes. That is useful in OLTP
workloads, but it means that large SQL queries and pg_dump effectively
do much the same work as VACUUM, generating huge amounts of I/O and
WAL on the master, the cost and annoyance of which is experienced
directly by the user. That is avoided on standbys.

On a pgbench workload, though, essentially all page cleanup happens as
a result of HOT cleanups, like >99.9%. It might be OK to have that
happen for write operations, but it would be a performance disaster if
updates didn't try to HOT-prune. Our usual argument for doing HOT
pruning even on SELECT cleanups is that not doing so pessimizes
repeated scans, but there are clearly cases that end up worse off as a
result of that decision.

My recollection of the discussion when HOT was developed is that it works
that way not because anyone thought it was beneficial, but simply because
we didn't see an easy way to know when first fetching a page whether we're
going to try to UPDATE some tuple on the page. (And we can't postpone the
pruning, because the query will have tuple pointers into the page later.)
Maybe we should work a little harder on passing that information down.
It seems reasonable to me that SELECTs shouldn't be tasked with doing
HOT pruning.

I'm not entirely wild about adding a parameter in this area because it
seems that we're increasingly choosing to further expose what arguably
ought to be internal implementation details.

I'm -1 for a parameter as well, but I think that just stopping SELECTs
from doing pruning at all might well be a win. It's at least worthy
of some investigation.

Unfortunately, there's no categorical answer. You can come up with
workloads where HOT pruning on selects is a win; just create a bunch
of junk and then read the same pages lots of times in a row. And you
can also come up with workloads where it's a loss; create a bunch of
junk and then read them just once. I don't know how easy it's going
to be to set that parameter in a useful way for some particular
environment, and I think that's possibly an argument against having
it. But the argument that we don't need a parameter because one
behavior is best for everyone is not going to fly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#4)
Re: Turning off HOT/Cleanup sometimes

Robert Haas escribi�:

Unfortunately, there's no categorical answer. You can come up with
workloads where HOT pruning on selects is a win; just create a bunch
of junk and then read the same pages lots of times in a row. And you
can also come up with workloads where it's a loss; create a bunch of
junk and then read them just once. I don't know how easy it's going
to be to set that parameter in a useful way for some particular
environment, and I think that's possibly an argument against having
it. But the argument that we don't need a parameter because one
behavior is best for everyone is not going to fly.

In the above, there's the underlying assumption that it doesn't matter
*what* we do with the page after doing or not doing pruning. But this
is not necessarily the case: in the case of an UPDATE, having the space
be freed beforehand is beneficial because there's the option of putting
the new version of the tuple in the same page, potentially saving lots
of I/O (bring up another destination page for the new tuple, write the
new tuple there, end up dirtying two pages instead of one). But in a
SELECT, the effect is only that you will have to skip less dead tuples,
which is not as exciting.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#3)
Re: Turning off HOT/Cleanup sometimes

On 9 January 2014 17:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

We also make SELECT clean up blocks as it goes. That is useful in OLTP
workloads, but it means that large SQL queries and pg_dump effectively
do much the same work as VACUUM, generating huge amounts of I/O and
WAL on the master, the cost and annoyance of which is experienced
directly by the user. That is avoided on standbys.

On a pgbench workload, though, essentially all page cleanup happens as
a result of HOT cleanups, like >99.9%. It might be OK to have that
happen for write operations, but it would be a performance disaster if
updates didn't try to HOT-prune. Our usual argument for doing HOT
pruning even on SELECT cleanups is that not doing so pessimizes
repeated scans, but there are clearly cases that end up worse off as a
result of that decision.

My recollection of the discussion when HOT was developed is that it works
that way not because anyone thought it was beneficial, but simply because
we didn't see an easy way to know when first fetching a page whether we're
going to try to UPDATE some tuple on the page. (And we can't postpone the
pruning, because the query will have tuple pointers into the page later.)
Maybe we should work a little harder on passing that information down.
It seems reasonable to me that SELECTs shouldn't be tasked with doing
HOT pruning.

I'm not entirely wild about adding a parameter in this area because it
seems that we're increasingly choosing to further expose what arguably
ought to be internal implementation details.

I'm -1 for a parameter as well, but I think that just stopping SELECTs
from doing pruning at all might well be a win. It's at least worthy
of some investigation.

Turning HOT off completely would be an absolute disaster for OLTP on
high update use cases against medium-large tables. That scenario is
well represented by pgbench and TPC-C. I am *not* suggesting we
recommend that and would look for very large caveats in the docs.
(That may not have been clear, I guess I just assumed people would
know I was heavily involved in the HOT project and understood its
benefits).

As stated, I am interested in turning off HOT in isolated, user
specified situations, perhaps just for isolated tables.

I'm not crazy about exposing magic parameters either but then I'm not
crazy about either automatic settings or deferring things because we
don't know how to set it. In general, I prefer the idea of having a
user settable parameter in one release then automating it in a later
release if clear settings emerge from usage. I'll submit a patch with
parameter, to allow experimentation, for possible removal at commit or
beta.

If I had to suggest a value for an internal parameter, I would say
that each SELECT statement should clean no more than 4 blocks. That
way current OLTP behaviour is mostly preserved while the big queries
and pg_dump don't suck in unpredictable ways.

I'll submit the patch and we can talk some more.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#5)
Re: Turning off HOT/Cleanup sometimes

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Robert Haas escribi�:

But the argument that we don't need a parameter because one
behavior is best for everyone is not going to fly.

In the above, there's the underlying assumption that it doesn't matter
*what* we do with the page after doing or not doing pruning. But this
is not necessarily the case: in the case of an UPDATE, having the space
be freed beforehand is beneficial because there's the option of putting
the new version of the tuple in the same page, potentially saving lots
of I/O (bring up another destination page for the new tuple, write the
new tuple there, end up dirtying two pages instead of one). But in a
SELECT, the effect is only that you will have to skip less dead tuples,
which is not as exciting.

Yeah. Once they're hinted dead, it doesn't cost that much to skip over
them. Not to mention that you might well never visit them at all, if
this is an indexscan that knows which TIDs it needs to look at.

It's possible that it can be shown that different use-cases have
sufficiently different behaviors that we really do need a user-visible
parameter. I don't want to start from that position though. If we
did have a simple GUC parameter, it'd likely end up in the same boat
as, say, enable_seqscan, which is way too blunt an instrument for real
world use --- so I'm afraid this would soon bloat into a request for
per-table settings, planner hints, or god knows what to try to confine
the effects to the queries where it's appropriate. Let's not go there
without proof that we have to. It's a much better thing if we can get
the system's native behavior to be tuned well enough by depending on
things it already knows.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Stephen Frost
sfrost@snowman.net
In reply to: Alvaro Herrera (#5)
Re: Turning off HOT/Cleanup sometimes

* Alvaro Herrera (alvherre@2ndquadrant.com) wrote:

But in a
SELECT, the effect is only that you will have to skip less dead tuples,
which is not as exciting.

Agreed. There's also the option to have it be done based on some
expectation of future work- that is, if we have to traverse X number of
dead tuples during a select, then don't bother with HOT pruning, but if
we get up to X+Y dead tuples, then do HOT pruning.

That said, I'm not entirely convinced that traversing these dead tuples
is all *that* painful during SELECT. If there's that many levels then
hopefully it's not long til an UPDATE comes along and cleans them up.

Thanks,

Stephen

#9Stephen Frost
sfrost@snowman.net
In reply to: Simon Riggs (#6)
Re: Turning off HOT/Cleanup sometimes

* Simon Riggs (simon@2ndQuadrant.com) wrote:

I'm -1 for a parameter as well, but I think that just stopping SELECTs
from doing pruning at all might well be a win. It's at least worthy
of some investigation.

Turning HOT off completely would be an absolute disaster for OLTP on
high update use cases against medium-large tables. That scenario is
well represented by pgbench and TPC-C. I am *not* suggesting we
recommend that and would look for very large caveats in the docs.

This is true even if we're only talking about turning it off for the
SELECT case...? That's what's under discussion here, after all.
Certainly, we wouldn't turn it off completely...

(That may not have been clear, I guess I just assumed people would
know I was heavily involved in the HOT project and understood its
benefits).

I'm certainly aware that you were heavily involved in HOT but I don't
think anyone is argueing to turn it off for everything.

As stated, I am interested in turning off HOT in isolated, user
specified situations, perhaps just for isolated tables.

I tend to agree w/ Tom on this point- having this be a per-table
configurable doesn't sound very appealing to me and it wouldn't address
the case you mentioned around pg_dump, but I'm sure that'd be the next
step for this and a per-session GUC wouldn't be sufficient.

I'm not crazy about exposing magic parameters either but then I'm not
crazy about either automatic settings or deferring things because we
don't know how to set it. In general, I prefer the idea of having a
user settable parameter in one release then automating it in a later
release if clear settings emerge from usage. I'll submit a patch with
parameter, to allow experimentation, for possible removal at commit or
beta.

Ugh, adding GUCs is bad *because* we end up never being able to remove
them.

If I had to suggest a value for an internal parameter, I would say
that each SELECT statement should clean no more than 4 blocks. That
way current OLTP behaviour is mostly preserved while the big queries
and pg_dump don't suck in unpredictable ways.

Right, this was one idea that I had also, as noted in the other
subthread. I'm not convinced that it's a great idea and it'd probably
be good to do a bit of testing to see just what the cost is; perhaps
even just come up with a "worst-case" example to see the difference
between a "clean" table and one with HOT chains as deep as they can go..

I'll submit the patch and we can talk some more.

Neat.

Thanks!

Stephen

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#8)
Re: Turning off HOT/Cleanup sometimes

Stephen Frost <sfrost@snowman.net> writes:

That said, I'm not entirely convinced that traversing these dead tuples
is all *that* painful during SELECT. If there's that many levels then
hopefully it's not long til an UPDATE comes along and cleans them up.

There's always VACUUM ;-)

If you take about ten steps back, what's happening here is that
maintenance work that we'd originally delegated to VACUUM, precisely so
that it wouldn't have to be done by foreground queries, is now being done
by foreground queries. And oddly enough, people don't like that.

There is a reasonable argument for forcing UPDATE queries to do it anyway,
to improve the odds they can do same-page updates (whether HOT or
otherwise). And probably an INSERT should do it on a page that it's
selected as an insertion target. But I think the argument that the
original do-maintenance-in-background-whenever-possible design was wrong
is a lot harder to sustain for SELECT or even DELETE queries. As I said
upthread, I think the current behavior was *not* chosen for performance
reasons but just to limit the scope of what we had to change for HOT.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Jim Nasby
jim@nasby.net
In reply to: Tom Lane (#10)
Re: Turning off HOT/Cleanup sometimes

On 1/9/14, 12:54 PM, Tom Lane wrote:

Stephen Frost <sfrost@snowman.net> writes:

That said, I'm not entirely convinced that traversing these dead tuples
is all *that* painful during SELECT. If there's that many levels then
hopefully it's not long til an UPDATE comes along and cleans them up.

There's always VACUUM ;-)

If you take about ten steps back, what's happening here is that
maintenance work that we'd originally delegated to VACUUM, precisely so
that it wouldn't have to be done by foreground queries, is now being done
by foreground queries. And oddly enough, people don't like that.

There is a reasonable argument for forcing UPDATE queries to do it anyway,
to improve the odds they can do same-page updates (whether HOT or
otherwise). And probably an INSERT should do it on a page that it's
selected as an insertion target. But I think the argument that the
original do-maintenance-in-background-whenever-possible design was wrong
is a lot harder to sustain for SELECT or even DELETE queries. As I said
upthread, I think the current behavior was *not* chosen for performance
reasons but just to limit the scope of what we had to change for HOT.

Instead of looking at how to avoid this work in SELECTs maybe it'd be more useful to look at how we can get it done more quickly in the background. The VSM is already a step in the right direction, but it seems the big use case here is when some bulk operation comes through and touches a sizeable number of blocks (but perhaps not enough to hit autovac thresholds).

ISTM it wouldn't be too difficult for a backend to track how many blocks in a relation it's dirtied (keep in mind that count doesn't have to be perfect). If we tracked that info, it could be put into a maintenance queue (LISTEN/NOTIFY?) along with our XID. That gives us a list of relations to vacuum and exactly when to vacuum them. Thanks to the VSM we wouldn't need to track individual pages (though it might be useful to track the minimum and maximum block IDs we hit, per relation).
--
Jim C. Nasby, Data Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: Turning off HOT/Cleanup sometimes

On Thu, Jan 9, 2014 at 1:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Stephen Frost <sfrost@snowman.net> writes:

That said, I'm not entirely convinced that traversing these dead tuples
is all *that* painful during SELECT. If there's that many levels then
hopefully it's not long til an UPDATE comes along and cleans them up.

There's always VACUUM ;-)

If you take about ten steps back, what's happening here is that
maintenance work that we'd originally delegated to VACUUM, precisely so
that it wouldn't have to be done by foreground queries, is now being done
by foreground queries. And oddly enough, people don't like that.

People *think* they don't like that, because that's the way it works
right now. If it worked some other way, there's a good chance people
would be complaining about that behavior, too. I submitted a patch a
few years back to limit the setting of hint bits by foreground
processes to approximately 5% of the buffers they touched in a large
scan, so that no single scan would incur all the cost of setting the
hint bits; instead, the cost would be amortized over the first 20 or
so scans. However, nobody was very enthusiastic about that patch,
because while it greatly softened the blow for the first scan,
subsequent scans were slower, because now they had to carry part of
the burden, too. And you know what? People didn't like *that*
either.

The problem with saying that we should let VACUUM do this work is the
same as the problem with saying that if you're late for your Concorde
flight, you should go running across the tarmac and try to catch it.
The cost of dead tuples is related in a linear fashion to the rate at
which pages are accessed. Not coincidentally, the number of
opportunities for HOT pruning is *also* related in a linear fashion to
the rate at which pages are accessed. This is why it works so well.
The rate at which vacuuming happens does not ramp up in the same way;
it's limited by autovacuum cost settings (which people tend not have
set correctly, and don't adjust themselves on the fly) or by their
hardware capabilities. If autovacuum can't keep up, foreground
activity doesn't slow down to compensate; instead, the system just
bloats out of control. While people may not like having this
maintenance activity in the foreground, they like not having it at all
even less.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#12)
Re: Turning off HOT/Cleanup sometimes

On 2014-01-09 16:27:23 -0500, Robert Haas wrote:

People *think* they don't like that, because that's the way it works
right now. If it worked some other way, there's a good chance people
would be complaining about that behavior, too.

I think on of the primary reason why it's causing huge slowdowns is that
the ring buffer of scan strategies causes dirty buffer writes pretty
much immediately, when a buffer is reused.

Not that delaying the writeout would work all that effectively right
now, with the current bgwriter...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: Turning off HOT/Cleanup sometimes

Robert Haas <robertmhaas@gmail.com> writes:

The problem with saying that we should let VACUUM do this work is the
same as the problem with saying that if you're late for your Concorde
flight, you should go running across the tarmac and try to catch it.
The cost of dead tuples is related in a linear fashion to the rate at
which pages are accessed. Not coincidentally, the number of
opportunities for HOT pruning is *also* related in a linear fashion to
the rate at which pages are accessed. This is why it works so well.

That seems like a large oversimplification. Some (most?) of the costs of
dead tuples are proportional to the rate of dead tuple creation. I grant
that there are also some costs proportional to the rate at which scans
visit dead tuples, but I really don't believe that the latter are
dominant. So I think it's bogus to claim that the current behavior is
somehow optimal.

One more time: the sole reason it works the way it does now is that that
was the path of least resistance back in 2007, and we never yet got around
to trying to optimize that. I'm glad to see someone wanting to revisit
the issue, but I don't think that we necessarily have to go as far as
creating user-visible knobs in order to make it better.

The rate at which vacuuming happens does not ramp up in the same way;
it's limited by autovacuum cost settings (which people tend not have
set correctly, and don't adjust themselves on the fly)

True, but that seems like a pretty well-defined improvement project right
there (as well as an argument against user-visible knobs in general ;-)).
Nasby's speculations just upthread could be useful here, too.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#14)
Re: Turning off HOT/Cleanup sometimes

On Thu, Jan 9, 2014 at 4:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

The problem with saying that we should let VACUUM do this work is the
same as the problem with saying that if you're late for your Concorde
flight, you should go running across the tarmac and try to catch it.
The cost of dead tuples is related in a linear fashion to the rate at
which pages are accessed. Not coincidentally, the number of
opportunities for HOT pruning is *also* related in a linear fashion to
the rate at which pages are accessed. This is why it works so well.

That seems like a large oversimplification. Some (most?) of the costs of
dead tuples are proportional to the rate of dead tuple creation. I grant
that there are also some costs proportional to the rate at which scans
visit dead tuples, but I really don't believe that the latter are
dominant. So I think it's bogus to claim that the current behavior is
somehow optimal.

This doesn't make any sense to me at all. What costs are proportional
to the rate of dead tuple creation? I'm referring specifically to the
performance penalty that scans incur for having to skip over dead
tuples, and those costs aren't incurred when the tuples are created,
but rather when you try to access the still-live data afterwards. If
anything, our system is a big WIN at the time tuples are created,
precisely because we leave the old tuples around to be cleaned up
later rather than getting rid of them at once. That's why, for
example, we tend to win delete-heavy benchmarks vs. other database
systems.

One more time: the sole reason it works the way it does now is that that
was the path of least resistance back in 2007, and we never yet got around
to trying to optimize that. I'm glad to see someone wanting to revisit
the issue, but I don't think that we necessarily have to go as far as
creating user-visible knobs in order to make it better.

Sure, I'm not denying that. The fact that it was the path of least
resistance doesn't mean it was a bad idea. I'm happy to see it
improved, too, but I think it's important to understand what happens
now. And at least on the pgbench tests I've done, what happens is
that VACUUM makes no significant contribution to pruning; IIRC, it
would have to visit pages at least 1000 times more often to be
relevant. So when somebody says "relying on vacuum instead of doing
HOT pruning" what I hear is "flush performance down the toilet"... but
of course the real way to resolve this is to test whatever patch Simon
or someone else eventually posts, not to speculate without data.

The rate at which vacuuming happens does not ramp up in the same way;
it's limited by autovacuum cost settings (which people tend not have
set correctly, and don't adjust themselves on the fly)

True, but that seems like a pretty well-defined improvement project right
there (as well as an argument against user-visible knobs in general ;-)).
Nasby's speculations just upthread could be useful here, too.

Fair point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#15)
Re: Turning off HOT/Cleanup sometimes

* Robert Haas (robertmhaas@gmail.com) wrote:

So when somebody says "relying on vacuum instead of doing
HOT pruning" what I hear is "flush performance down the toilet"... but
of course the real way to resolve this is to test whatever patch Simon
or someone else eventually posts, not to speculate without data.

I don't think anyone was seriously proposing that (certainly not with
today's VACUUM). What I've heard speculated about is doing HOT pruning
during UPDATE and/or INSERT but specifically not during SELECT. I
concur that we need data to really understand the difference, hopefully
there'll be a patch posted which we can play with.

Thanks,

Stephen

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#1)
1 attachment(s)
Re: Turning off HOT/Cleanup sometimes

On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote:

VACUUM cleans up blocks, which is nice because it happens offline in a
lazy manner.

We also make SELECT clean up blocks as it goes. That is useful in OLTP
workloads, but it means that large SQL queries and pg_dump effectively
do much the same work as VACUUM, generating huge amounts of I/O and
WAL on the master, the cost and annoyance of which is experienced
directly by the user. That is avoided on standbys.

Effects of that are that long running statements often run much longer
than we want, increasing bloat as a result. It also produces wildly
varying response times, depending upon extent of cleanup required.

It is a simple task to make that behaviour optional on the master.

I propose a USERSET parameter, prune_cost_limit (<---insert better name here)
which will make the behaviour optional, default -1, in normal user
processes. VACUUM will ignore this parameter and so its actions will
never be deferred.

In detail, this parameter would disable pruning for any scan larger
than the cost limit. So large scans will disable the behaviour. The
default, -1, means never disable pruning, which is the current
behavour.

We track the number of pages dirtied by the current statement. When
this reaches prune_cost_limit, we will apply these behaviours to all
shared_buffer block accesses...

(1) avoid running heap_page_prune_opt()

(2) avoid dirtying the buffer for hints. (This is safe because the
hinted changes will either be lost or will be part of the full page
image when we make a logged-change).

(i.e. doesn't apply to temp tables)

For example, if we set prune_cost_limit = 4 this behaviour allows
small index lookups via bitmapheapscan to continue to cleanup, while
larger index and seq scans will avoid cleanup.

There would be a postgresql.conf parameter prune_cost_limit, as well
as a table level parameter that would prevent pruning except via
VACUUM.

This will help in these ways
* Reduce write I/O from SELECTs and pg_dump - improving backups and BI queries
* Allow finer grained control over Hot Standby conflicts
* Potentially allow diagnostic inspection of older data via SeqScan

Prototype patch shows this is possible and simple enough for 9.4.
Major objections? Or should I polish up and submit?

Patch attached, implemented to reduce writes by SELECTs only.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

prune_page_limit.v3.patchapplication/octet-stream; name=prune_page_limit.v3.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0f2f2bf..e0251b7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1592,6 +1592,35 @@ include 'filename'
      <title>Asynchronous Behavior</title>
 
      <variablelist>
+      <varlistentry id="guc-prune-page-dirty-limit" xreflabel="prune_page_dirty_limit">
+       <term><varname>prune_page_dirty_limit</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>prune_page_dirty_limit</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Sets the limit on the number of pages dirtied by page pruning during
+         foreground processing for plain SELECT queries. Maintenance operations
+         and write operations are not limited by this parameter, nor are
+         SELECT queries that require writes to occur, such as row locking or
+         writable common table expressions.
+        </para>
+        <para>
+         This allows larger SELECT queries and pg_dump to avoid generating
+         significant write I/O as they execute, aiding query performance during
+         mixed workload processing. The limit is reset for each new statement.
+        </para>
+        <para>
+         Prior to 9.4, there was no limit, which is provided by a setting of -1.
+        </para>
+        <para>
+         The limit is overridden and set to zero for any table where pruning is
+         disabled.
+        </para>
+       </listitem>
+      </varlistentry>
+
+     <variablelist>
       <varlistentry id="guc-effective-io-concurrency" xreflabel="effective_io_concurrency">
        <term><varname>effective_io_concurrency</varname> (<type>integer</type>)</term>
        <indexterm>
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index a422edd..61915ff 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -895,6 +895,20 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
    </varlistentry>
 
    <varlistentry>
+    <term><literal>allow_buffer_cleanup</> (<type>boolean</>)</term>
+    <listitem>
+     <para>
+      Determines whether user SQL will perform cleanup of shared buffers
+      for this table. Does not prevent setting of tuple hint bits.
+      See also prune_page_dirty_limit.
+      Disabling this will prevent HOT from working effectively, which
+      will in most cases be a performance degradation for writes to this
+      table. This parameter cannot be set for TOAST tables.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
     <term><literal>autovacuum_enabled</>, <literal>toast.autovacuum_enabled</literal> (<type>boolean</>)</term>
     <listitem>
      <para>
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index fa08c45..d466819 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -85,6 +85,14 @@ static relopt_bool boolRelOpts[] =
 		},
 		false
 	},
+	{
+		{
+			"allow_buffer_cleanup",
+			"Allow cleanup of shared buffers by foreground processes, allowing later cleanup by VACUUM",
+			RELOPT_KIND_HEAP
+		},
+		true
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1175,7 +1183,9 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 		{"check_option", RELOPT_TYPE_STRING,
 		offsetof(StdRdOptions, check_option_offset)},
 		{"user_catalog_table", RELOPT_TYPE_BOOL,
-		 offsetof(StdRdOptions, user_catalog_table)}
+		 offsetof(StdRdOptions, user_catalog_table)},
+		{"allow_buffer_cleanup", RELOPT_TYPE_BOOL,
+		 offsetof(StdRdOptions, allow_prune)}
 	};
 
 	options = parseRelOptions(reloptions, validate, kind, &numoptions);
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a771ccb..5ddd07d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -273,6 +273,7 @@ initscan(HeapScanDesc scan, ScanKey key, bool is_rescan)
 		scan->rs_startblock = 0;
 	}
 
+	scan->rs_allow_prune = RelationAllowPrune(scan->rs_rd);
 	scan->rs_inited = false;
 	scan->rs_ctup.t_data = NULL;
 	ItemPointerSetInvalid(&scan->rs_ctup.t_self);
@@ -348,7 +349,10 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	 * Prune and repair fragmentation for the whole page, if possible.
 	 */
 	Assert(TransactionIdIsValid(RecentGlobalXmin));
-	heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
+	if (scan->rs_allow_prune &&
+		(PrunePageDirtyLimit == -1 ||
+		 PrunePageDirtyLimit > PrunePageDirty))
+		heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 1aba2f0..6b1f04e 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -67,7 +67,9 @@
 
 #include "access/relscan.h"
 #include "access/transam.h"
+#include "access/xact.h"
 #include "catalog/index.h"
+#include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -249,6 +251,7 @@ index_beginscan(Relation heapRelation,
 	 * up by RelationGetIndexScan.
 	 */
 	scan->heapRelation = heapRelation;
+	scan->xs_allow_prune = RelationAllowPrune(heapRelation);
 	scan->xs_snapshot = snapshot;
 
 	return scan;
@@ -275,6 +278,12 @@ index_beginscan_bitmap(Relation indexRelation,
 	 */
 	scan->xs_snapshot = snapshot;
 
+	/*
+	 * Bitmap index scans don't touch the heap, so this is effectively
+	 * ignored. We set pruning correctly when we do the bitmap heap scan.
+	 */
+	scan->xs_allow_prune = true;
+
 	return scan;
 }
 
@@ -519,7 +528,10 @@ index_fetch_heap(IndexScanDesc scan)
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
-		if (prev_buf != scan->xs_cbuf)
+		if (prev_buf != scan->xs_cbuf &&
+			scan->xs_allow_prune &&
+			(PrunePageDirtyLimit == -1 ||
+			 PrunePageDirtyLimit > PrunePageDirty))
 			heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf,
 								RecentGlobalXmin);
 	}
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3455a0b..993cb70 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -219,6 +219,8 @@ vacuum(VacuumStmt *vacstmt, Oid relid, bool do_toast,
 		VacuumPageHit = 0;
 		VacuumPageMiss = 0;
 		VacuumPageDirty = 0;
+		PrunePageDirty = 0;
+		PrunePageDirtyLimit = -1;	/* VACUUM forces pruning always */
 
 		/*
 		 * Loop to process each selected relation.
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index e0000e6..f6424af 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -320,6 +320,8 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
 			TimestampDifferenceExceeds(starttime, endtime,
 									   Log_autovacuum_min_duration))
 		{
+			int	NumPageDirty = PrunePageDirty + VacuumPageDirty;
+
 			TimestampDifference(starttime, endtime, &secs, &usecs);
 
 			read_rate = 0;
@@ -328,7 +330,7 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
 			{
 				read_rate = (double) BLCKSZ *VacuumPageMiss / (1024 * 1024) /
 							(secs + usecs / 1000000.0);
-				write_rate = (double) BLCKSZ *VacuumPageDirty / (1024 * 1024) /
+				write_rate = (double) BLCKSZ *NumPageDirty / (1024 * 1024) /
 							(secs + usecs / 1000000.0);
 			}
 			ereport(LOG,
@@ -348,7 +350,7 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt,
 							vacrelstats->new_rel_tuples,
 							VacuumPageHit,
 							VacuumPageMiss,
-							VacuumPageDirty,
+							NumPageDirty,
 							read_rate, write_rate,
 							pg_rusage_show(&ru0))));
 		}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 6b5f198..4075f4c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -158,6 +158,11 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 			palloc0(queryDesc->plannedstmt->nParamExec * sizeof(ParamExecData));
 
 	/*
+	 * Initialise the page pruning limit to "prune all pages"
+	 */
+	PrunePageDirtyLimit = -1;
+
+	/*
 	 * If non-read-only query, set the command ID to mark output tuples with
 	 */
 	switch (queryDesc->operation)
@@ -171,6 +176,14 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 			if (queryDesc->plannedstmt->rowMarks != NIL ||
 				queryDesc->plannedstmt->hasModifyingCTE)
 				estate->es_output_cid = GetCurrentCommandId(true);
+			else
+			{
+				/*
+				 * If we aren't marking tuples then we apply the
+				 * page pruning policy for this query.
+				 */
+				PrunePageDirtyLimit = PrunePageDirtyLimitPolicy;
+			}
 
 			/*
 			 * A SELECT without modifying CTEs can't possibly queue triggers,
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 1a8d4e5..5585851 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -39,6 +39,7 @@
 #include "access/transam.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
+#include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
@@ -337,7 +338,10 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
 	 * Prune and repair fragmentation for the whole page, if possible.
 	 */
 	Assert(TransactionIdIsValid(RecentGlobalXmin));
-	heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
+	if (scan->rs_allow_prune &&
+		(PrunePageDirtyLimit == -1 ||
+		 PrunePageDirtyLimit > PrunePageDirty))
+		heap_page_prune_opt(scan->rs_rd, buffer, RecentGlobalXmin);
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c6bae12..51f07ec 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1009,7 +1009,16 @@ MarkBufferDirty(Buffer buffer)
 	 */
 	if (!(bufHdr->flags & BM_DIRTY))
 	{
-		VacuumPageDirty++;
+		/*
+		 * If we are tracking pruning in SELECTs then we can only get
+		 * here by heap_page_prune_opt() call that cleans a block,
+		 * so in that case, register it as a pruning operation.
+		 * Make sure we don't double count during VACUUMs.
+		 */
+		if (PrunePageDirtyLimit > -1)
+			PrunePageDirty++;
+		else
+			VacuumPageDirty++;
 		pgBufferUsage.shared_blks_dirtied++;
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageDirty;
@@ -2601,6 +2610,12 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		return;
 	}
 
+	/*
+	 * Stop if we reached the limit, or if we set -1 as limit.
+	 */
+	if (PrunePageDirtyLimit <= PrunePageDirty && PrunePageDirtyLimit > -1)
+		return;
+
 	bufHdr = &BufferDescriptors[buffer - 1];
 
 	Assert(PrivateRefCount[buffer - 1] > 0);
@@ -2703,7 +2718,11 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 
 		if (dirtied)
 		{
-			VacuumPageDirty++;
+			/*
+			 * Keep counter accurate even if we aren't limiting pruning
+			 */
+			PrunePageDirty++;
+
 			if (VacuumCostActive)
 				VacuumCostBalance += VacuumCostPageDirty;
 		}
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index f85bd03..1af2a44 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -732,6 +732,12 @@ PortalRun(Portal portal, long count, bool isTopLevel,
 	}
 
 	/*
+	 * Reset execution counters for each new top-level statement
+	 */
+	if (isTopLevel)
+		PrunePageDirty = 0;
+
+	/*
 	 * Check for improper portal use, and mark portal active.
 	 */
 	if (portal->status != PORTAL_READY)
@@ -1246,6 +1252,12 @@ PortalRunMulti(Portal portal, bool isTopLevel,
 		 */
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Reset execution counters for each new top-level statement
+		 */
+		if (isTopLevel)
+			PrunePageDirty = 0;
+
 		if (IsA(stmt, PlannedStmt) &&
 			((PlannedStmt *) stmt)->utilityStmt == NULL)
 		{
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 63c951e..738a3bb 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -119,6 +119,9 @@ int			VacuumCostDelay = 0;
 int			VacuumPageHit = 0;
 int			VacuumPageMiss = 0;
 int			VacuumPageDirty = 0;
+int			PrunePageDirty = 0;
+int			PrunePageDirtyLimit = 0;
+int			PrunePageDirtyLimitPolicy = 0;
 
 int			VacuumCostBalance = 0;		/* working state for vacuum */
 bool		VacuumCostActive = false;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1217098..5a88f07 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1791,6 +1791,16 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"prune_page_dirty_limit", PGC_USERSET, RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Limit of foreground page cleaning by read-only statements."),
+			NULL
+		},
+		&PrunePageDirtyLimitPolicy,
+		4, -1, 10000,			/* -1 is equivalent to the pre-9.4 default */
+		NULL, NULL, NULL
+	},
+
+	{
 		{"vacuum_cost_page_hit", PGC_USERSET, RESOURCES_VACUUM_DELAY,
 			gettext_noop("Vacuum cost for a page found in the buffer cache."),
 			NULL
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8a57698..341326b 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -33,6 +33,7 @@ typedef struct HeapScanDescData
 	bool		rs_allow_strat; /* allow or disallow use of access strategy */
 	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
 	bool		rs_temp_snap;	/* unregister snapshot at scan end? */
+	bool		rs_allow_prune;	/* allow pruning during page-at-a-time access */
 
 	/* state set up at initscan time */
 	BlockNumber rs_nblocks;		/* number of blocks to scan */
@@ -77,6 +78,7 @@ typedef struct IndexScanDescData
 	bool		ignore_killed_tuples;	/* do not return killed entries */
 	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
 										 * tuples */
+	bool		xs_allow_prune;	/* allow pruning during index access */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index b145a19..4f5605a 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -235,6 +235,9 @@ extern int	VacuumCostDelay;
 extern int	VacuumPageHit;
 extern int	VacuumPageMiss;
 extern int	VacuumPageDirty;
+extern int	PrunePageDirty;
+extern int	PrunePageDirtyLimit;
+extern int	PrunePageDirtyLimitPolicy;
 
 extern int	VacuumCostBalance;
 extern bool VacuumCostActive;
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 9b8a4c9..e0e00a6 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -218,6 +218,7 @@ typedef struct StdRdOptions
 	bool		security_barrier;		/* for views */
 	int			check_option_offset;	/* for views */
 	bool		user_catalog_table;		/* use as an additional catalog relation */
+	bool		allow_prune;	/* whether user pruning is enabled on relation */
 } StdRdOptions;
 
 #define HEAP_MIN_FILLFACTOR			10
@@ -232,6 +233,14 @@ typedef struct StdRdOptions
 	 ((StdRdOptions *) (relation)->rd_options)->fillfactor : (defaultff))
 
 /*
+ * RelationAllowPrune
+ *		Returns the relation's prune state.  Note multiple eval of argument!
+ */
+#define RelationAllowPrune(relation) \
+	((relation)->rd_options ? \
+	 ((StdRdOptions *) (relation)->rd_options)->allow_prune : true)
+
+/*
  * RelationGetTargetPageUsage
  *		Returns the relation's desired space usage per page in bytes.
  */
#18Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#17)
Re: Turning off HOT/Cleanup sometimes

On Tue, Jan 14, 2014 at 4:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote:

VACUUM cleans up blocks, which is nice because it happens offline in a
lazy manner.

We also make SELECT clean up blocks as it goes. That is useful in OLTP
workloads, but it means that large SQL queries and pg_dump effectively
do much the same work as VACUUM, generating huge amounts of I/O and
WAL on the master, the cost and annoyance of which is experienced
directly by the user. That is avoided on standbys.

Effects of that are that long running statements often run much longer
than we want, increasing bloat as a result. It also produces wildly
varying response times, depending upon extent of cleanup required.

It is a simple task to make that behaviour optional on the master.

I propose a USERSET parameter, prune_cost_limit (<---insert better name here)
which will make the behaviour optional, default -1, in normal user
processes. VACUUM will ignore this parameter and so its actions will
never be deferred.

In detail, this parameter would disable pruning for any scan larger
than the cost limit. So large scans will disable the behaviour. The
default, -1, means never disable pruning, which is the current
behavour.

We track the number of pages dirtied by the current statement. When
this reaches prune_cost_limit, we will apply these behaviours to all
shared_buffer block accesses...

(1) avoid running heap_page_prune_opt()

(2) avoid dirtying the buffer for hints. (This is safe because the
hinted changes will either be lost or will be part of the full page
image when we make a logged-change).

(i.e. doesn't apply to temp tables)

For example, if we set prune_cost_limit = 4 this behaviour allows
small index lookups via bitmapheapscan to continue to cleanup, while
larger index and seq scans will avoid cleanup.

There would be a postgresql.conf parameter prune_cost_limit, as well
as a table level parameter that would prevent pruning except via
VACUUM.

This will help in these ways
* Reduce write I/O from SELECTs and pg_dump - improving backups and BI queries
* Allow finer grained control over Hot Standby conflicts
* Potentially allow diagnostic inspection of older data via SeqScan

Prototype patch shows this is possible and simple enough for 9.4.
Major objections? Or should I polish up and submit?

Patch attached, implemented to reduce writes by SELECTs only.

I am still not sure whether we want this, but I think it's definitely
an improvement over the previous version. Assorted comments:

- Naming consistency seems to me to dictate that there should be more
similarity between the reloption name (allow_buffer_cleanup) and the
GUC (prune_page_dirty_limit).

- The documentation doesn't describe the use case where suppressing
cleanup on a per-table basis would be desirable, and I can't think of
one, either.

- There are a variety of ways to limit pruning; here, you've chosen to
limit it to a particular number of pruning operations per executor
invocation. But the flag is global, not part of the executor state,
so a query that calls a PL/pgsql function during execution will reset
the counter for the parent query also, which doesn't seem very
principled.

In a patch I posted a few years ago to set hint bits only sometimes, I
settled on an algorithm where I dirtied the first 50 pages per scan
and then skipped the next 950, or something like that. The idea was
that you wanted the pages that did get dirtied to be clustered
together to avoid random I/O; and also that you wanted table of
arbitrary size to get hinted within a certain number of scans (e.g.
20). The limiting here is much more aggressive, so on large tables it
will amount to basically no pruning at all. I dunno whether that's a
good idea or not. But if the idea of making this an integer rather
than a boolean is to allow some pruning to still happen while keeping
it checked within reasonable bounds, I'm not sure it will succeed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#18)
Re: Turning off HOT/Cleanup sometimes

On 15 January 2014 16:47, Robert Haas <robertmhaas@gmail.com> wrote:

There would be a postgresql.conf parameter prune_cost_limit, as well
as a table level parameter that would prevent pruning except via
VACUUM.

This will help in these ways
* Reduce write I/O from SELECTs and pg_dump - improving backups and BI queries
* Allow finer grained control over Hot Standby conflicts
* Potentially allow diagnostic inspection of older data via SeqScan

Patch attached, implemented to reduce writes by SELECTs only.

I am still not sure whether we want this, but I think it's definitely
an improvement over the previous version. Assorted comments:

- Naming consistency seems to me to dictate that there should be more
similarity between the reloption name (allow_buffer_cleanup) and the
GUC (prune_page_dirty_limit).

Now that I've written the patch, I'm seeing those as two different
things, but YMMV and I am very open to naming suggestions.

- The documentation doesn't describe the use case where suppressing
cleanup on a per-table basis would be desirable, and I can't think of
one, either.

We already know that HOT is ineffective in areas of high contention
(previous thread by me). Prior experience was that smaller tables
didn't show much apparent benefit from using HOT either; its
effectiveness was limited to medium and large tables being updated.
The two already stated use cases that would apply are these ones

* Allow finer grained control over Hot Standby conflicts
* Potentially allow diagnostic inspection of older data via SeqScan

So the use cases for the two parameters seem quite different and we
may decide we want one but not the other.

- There are a variety of ways to limit pruning; here, you've chosen to
limit it to a particular number of pruning operations per executor
invocation. But the flag is global, not part of the executor state,
so a query that calls a PL/pgsql function during execution will reset
the counter for the parent query also, which doesn't seem very
principled.

That is subtle thing in this patch and I agree that potential problem
exists. The current limit is set according to the current executing
statement, but the current total is not reset until start of the top
level statement. So the behaviour is not reset during statements
executed within PL/pgSQL function.

In a patch I posted a few years ago to set hint bits only sometimes, I
settled on an algorithm where I dirtied the first 50 pages per scan
and then skipped the next 950, or something like that. The idea was
that you wanted the pages that did get dirtied to be clustered
together to avoid random I/O; and also that you wanted table of
arbitrary size to get hinted within a certain number of scans (e.g.
20). The limiting here is much more aggressive, so on large tables it
will amount to basically no pruning at all. I dunno whether that's a
good idea or not. But if the idea of making this an integer rather
than a boolean is to allow some pruning to still happen while keeping
it checked within reasonable bounds, I'm not sure it will succeed.

It sounds like you're in favour of the overall concept of limiting
writes, which is good.

The behaviour I think we need, based on listening to everybody so far is

* OLTP is unaffected
* Large SELECTs and pg_dump don't cause lots of write I/O.

and hence why "prune_page_dirty_limit" offers a change in behaviour at
a certain point.

Reducing cleanup to "only 5%" just reduces but doesn't remove the
problem. If the data is stored on very poor I/O infrastructure, any
significant volume of writes can adversely affect performance. As we
reduce the percentage, we also reduce the benefit from inducing writes
in the first place and so I would question why bother at all using a
percentage. For me, a parameter that gives you absolute rather than
relative control is more desirable.

The current behaviour assumes it is OK for the first/next user to
touch the data to be the one that won't mind re-writing everything. In
time critical applications, the first/next user could well have a very
urgent need to access the data quickly and doesn't want to have to pay
this price. In seldom-accessed data applications, VACUUM has lots of
time to run out of hours, so users are OK to defer this work. Some
applications exist where we literally want zero I/O.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#19)
Re: Turning off HOT/Cleanup sometimes

On Wed, Jan 15, 2014 at 5:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

We already know that HOT is ineffective in areas of high contention
(previous thread by me). Prior experience was that smaller tables
didn't show much apparent benefit from using HOT either; its
effectiveness was limited to medium and large tables being updated.
The two already stated use cases that would apply are these ones

Do you have a link to that previous thread? I don't happen to recall
that conversation.

I've found that HOT can be very important on smaller tables, so I'm
skeptical of that as a general conclusion. What I think might be true
is that if VACUUM is going to hit the table often enough to make you
happy, then you don't really need HOT. In other words, if the update
rate is non-zero but low, not too much cruft will accumulate before
the table gets vacuumed, and you may be OK. If the update rate is
high, though, I think disabling HOT will be painful on a table of any
size. There might be exceptions, but I can't think of what the are
off-hand.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Simon Riggs (#17)
Re: Turning off HOT/Cleanup sometimes

On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote:

Patch attached, implemented to reduce writes by SELECTs only.

This is really a valuable improvement over current SELECT behaviour
w.r.t Writes.

While going though patch, I observed few points, so thought of
sharing with you:

+ /*
+ * If we are tracking pruning in SELECTs then we can only get
+ * here by heap_page_prune_opt() call that cleans a block,
+ * so in that case, register it as a pruning operation.
+ * Make sure we don't double count during VACUUMs.
+ */
+ if (PrunePageDirtyLimit > -1)
+ PrunePageDirty++;

a. As PrunePageDirtyLimit variable is not initialized for DDL flow,
any statement like Create Function().. will have value of
PrunePageDirtyLimit as 4 (default) and in such cases MarkBufferDirty()
will increment the wrong counter.

b. For DDL statements like Create Materialized view, it will behave as
Select statement.
Ex.
Create Materialized view mv1 as select * from t1;

Now here I think it might not be a problem, because for t1 anyway there
will be no write, so skipping pruning should not be a problem and for
materialized views also there will no dead rows, so skipping should be
okay, but I think it is not strictly adhering to statement "to reduce writes
by SELECTs only" and purpose of patch which is to avoid only when
Top level statement is SELECT.
Do you think it's better to consider such cases and optimize for them
or should we avoid it by following thumb rule that pruning will be avoided
only for top level SELECT?

2. + "Allow cleanup of shared buffers by foreground processes, allowing
later cleanup by VACUUM",
This line is not clear, what do you mean to say by "allowing later cleanup
by VACUUM", if already foreground process has done cleanup, then it
should save effort of Vacuum.

In general, though both the optimisations (allow_buffer_cleanup and
prune_page_dirty_limit ) used in patch have similarity in the sense
that they will be used to avoid pruning, but still I feel they are for different
cases (READ ONLY OP and WRITE ON SMALL TABLES) and also as there
are more people inclined to do this for only SELECT operations, do you think
it will be a good idea to make them as separate patches?

I think there can be some applications or use cases which can be benefited
by avoiding pruning for WRITE ON SMALL TABLES, but the case for SELECT
is more general and more applications can get benefit with this optimisation,so
it would be better if we first try to accomplish that case.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Michael Paquier
michael.paquier@gmail.com
In reply to: Amit Kapila (#21)
Re: Turning off HOT/Cleanup sometimes

On Mon, Feb 3, 2014 at 3:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote:

Patch attached, implemented to reduce writes by SELECTs only.

This patch is registered in this CF. It does not apply anymore and
needs a rebase. Robert and Amit have provided as well some comments
but they have not been addressed. Is it fair to mark it as "returned
with feedback" even if it has not been reviewed within the last month?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#22)
Re: Turning off HOT/Cleanup sometimes

On Fri, Sep 12, 2014 at 3:19 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Feb 3, 2014 at 3:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote:

Patch attached, implemented to reduce writes by SELECTs only.

This patch is registered in this CF. It does not apply anymore and
needs a rebase. Robert and Amit have provided as well some comments
but they have not been addressed. Is it fair to mark it as "returned
with feedback" even if it has not been reviewed within the last month?

For the time being, status has been changed to "waiting on author".
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#23)
1 attachment(s)
Re: Turning off HOT/Cleanup sometimes

Michael Paquier wrote:

On Fri, Sep 12, 2014 at 3:19 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Mon, Feb 3, 2014 at 3:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote:

Patch attached, implemented to reduce writes by SELECTs only.

This patch is registered in this CF. It does not apply anymore and
needs a rebase. Robert and Amit have provided as well some comments
but they have not been addressed. Is it fair to mark it as "returned
with feedback" even if it has not been reviewed within the last month?

For the time being, status has been changed to "waiting on author".

As it happens, I was studying this patch yesterday on the flight back
home. I gave it a quick look; I noticed it was in the commitfest and
hadn't seen any review activity for many months, which seemed odd.

Anyway I first read the whole thread to know what to focus on, before
going over the patch itself. Once I finished reading the emails, I had
a vague idea of how I thought it would work: my thinking was that
heap/index scans would either call heap_page_prune_opt, or not,
depending on whether they were part of a read-only executor node. So if
you have a query that updates a certain table, and while doing so scans
another table in read-only mode, then the HOT updates would be enabled
for the table being written, but disabled for the one being read.

As it turns out, the patch as written is nothing at all like that, and
TBH I don't think I like it very much.

My idea is that we would have a new executor flag, say
EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be
read-only, and reset it on those that aren't, such as LockRows and
ModifyTable (obviously we need to pass it down correctly from parent to
children). Then in ExecInitSeqScan and ExecInitIndexScan, if we see the
flag set, we call heap/index_set_allow_prune(false) for the heap scan;
same thing in index scans. (I envisioned it as a boolean rather than
enabling a certain number of cleanups per scan.)

I tried to code this but I think it doesn't work correctly, and no time
for debug currently. Anyway let me know what you think of this general
idea.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

hot-disable.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2b336b0..b859f89 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -342,9 +342,11 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	snapshot = scan->rs_snapshot;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible and
+	 * enabled.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_allow_prune)
+		heap_page_prune_opt(scan->rs_rd, buffer);
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -1349,6 +1351,9 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
 
+	/* HOT pruning is initially allowed; caller can turn it off if desired */
+	scan->rs_allow_prune = true;
+
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
 	 */
@@ -1440,6 +1445,12 @@ heap_endscan(HeapScanDesc scan)
 	pfree(scan);
 }
 
+void
+heap_set_allow_prune(HeapScanDesc scan, bool allow)
+{
+	scan->rs_allow_prune = allow;
+}
+
 /* ----------------
  *		heap_getnext	- retrieve next tuple in scan
  *
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 53cf96f..cb2f075 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -253,6 +253,7 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->xs_allow_prune = true;
 
 	return scan;
 }
@@ -387,6 +388,12 @@ index_endscan(IndexScanDesc scan)
 	IndexScanEnd(scan);
 }
 
+void
+index_set_allow_prune(IndexScanDesc scan, bool allow_prune)
+{
+	scan->xs_allow_prune = allow_prune;
+}
+
 /* ----------------
  *		index_markpos  - mark a scan position
  * ----------------
@@ -520,9 +527,9 @@ index_fetch_heap(IndexScanDesc scan)
 											 ItemPointerGetBlockNumber(tid));
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * Prune page if enabled, but only if we weren't already on this page
 		 */
-		if (prev_buf != scan->xs_cbuf)
+		if (prev_buf != scan->xs_cbuf && scan->xs_allow_prune)
 			heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
 	}
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index fbd7492..8f4964c 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1351,7 +1351,7 @@ BeginCopy(bool is_from,
 		 *
 		 * ExecutorStart computes a result tupdesc for us
 		 */
-		ExecutorStart(cstate->queryDesc, 0);
+		ExecutorStart(cstate->queryDesc, EXEC_FLAG_READ_ONLY);
 
 		tupDesc = cstate->queryDesc->tupDesc;
 	}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 5245171..5213f4b 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -232,6 +232,9 @@ GetIntoRelEFlags(IntoClause *intoClause)
 	if (intoClause->skipData)
 		flags |= EXEC_FLAG_WITH_NO_DATA;
 
+	/* XXX this is probably wrong at least for some callers */
+	flags |= EXEC_FLAG_READ_ONLY;
+
 	return flags;
 }
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index a718c0b..bbc9e6e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -170,7 +170,10 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 			 */
 			if (queryDesc->plannedstmt->rowMarks != NIL ||
 				queryDesc->plannedstmt->hasModifyingCTE)
+			{
 				estate->es_output_cid = GetCurrentCommandId(true);
+				eflags &= ~EXEC_FLAG_READ_ONLY;
+			}
 
 			/*
 			 * A SELECT without modifying CTEs can't possibly queue triggers,
@@ -185,6 +188,7 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
 		case CMD_INSERT:
 		case CMD_DELETE:
 		case CMD_UPDATE:
+			eflags &= ~EXEC_FLAG_READ_ONLY;
 			estate->es_output_cid = GetCurrentCommandId(true);
 			break;
 
@@ -871,6 +875,8 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 		if (bms_is_member(i, plannedstmt->rewindPlanIDs))
 			sp_eflags |= EXEC_FLAG_REWIND;
 
+		/* XXX can we set EXEC_FLAG_READ_ONLY here? */
+
 		subplanstate = ExecInitNode(subplan, estate, sp_eflags);
 
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 9b1e975..a43bdb8 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -334,9 +334,11 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
 	ntup = 0;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible and
+	 * allowed.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_allow_prune)
+		heap_page_prune_opt(scan->rs_rd, buffer);
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2b89dc6..5ff08e5 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -608,6 +608,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 											   estate->es_snapshot,
 											   indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
+	if (eflags | EXEC_FLAG_READ_ONLY)
+		index_set_allow_prune(indexstate->iss_ScanDesc, false);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 990240b..fb270cb 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -293,6 +293,8 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	/* check for unsupported flags */
 	Assert(!(eflags & EXEC_FLAG_MARK));
 
+	eflags &= ~EXEC_FLAG_READ_ONLY;
+
 	/*
 	 * create state structure
 	 */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 8ac6047..94c33a6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1078,9 +1078,12 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	ListCell   *l;
 	int			i;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
 
+	/* turn off read-only mode, if set */
+	eflags &= ~EXEC_FLAG_READ_ONLY;
+
 	/*
 	 * create state structure
 	 */
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index ab13e47..f06b049 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -140,6 +140,9 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 									 estate->es_snapshot,
 									 0,
 									 NULL);
+	/* set pruning option */
+	if (eflags | EXEC_FLAG_READ_ONLY)
+		heap_set_allow_prune(currentScanDesc, true);
 
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index f9ed266..83e7f97 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -177,7 +177,7 @@ ProcessQuery(PlannedStmt *plan,
 	/*
 	 * Call ExecutorStart to prepare the plan for execution
 	 */
-	ExecutorStart(queryDesc, 0);
+	ExecutorStart(queryDesc, EXEC_FLAG_READ_ONLY);
 
 	/*
 	 * Run the plan to completion.
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index d99158f..0b607ee 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -142,6 +142,7 @@ extern void index_rescan(IndexScanDesc scan,
 			 ScanKey keys, int nkeys,
 			 ScanKey orderbys, int norderbys);
 extern void index_endscan(IndexScanDesc scan);
+extern void index_set_allow_prune(IndexScanDesc scan, bool allow_prune);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7f7166d..daa687c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -113,6 +113,7 @@ extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 bool allow_strat, bool allow_sync);
 extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
 				  int nkeys, ScanKey key);
+extern void heap_set_allow_prune(HeapScanDesc scan, bool allow_prune);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8a57698..48b7bfe 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -33,6 +33,7 @@ typedef struct HeapScanDescData
 	bool		rs_allow_strat; /* allow or disallow use of access strategy */
 	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
 	bool		rs_temp_snap;	/* unregister snapshot at scan end? */
+	bool		rs_allow_prune;	/* allow HOT-pruning during this scan */
 
 	/* state set up at initscan time */
 	BlockNumber rs_nblocks;		/* number of blocks to scan */
@@ -71,6 +72,7 @@ typedef struct IndexScanDescData
 	ScanKey		keyData;		/* array of index qualifier descriptors */
 	ScanKey		orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
+	bool		xs_allow_prune;	/* allow HOT-pruning during this scan */
 
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d167b49..42955eb 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -54,6 +54,9 @@
  * WITH/WITHOUT_OIDS tell the executor to emit tuples with or without space
  * for OIDs, respectively.  These are currently used only for CREATE TABLE AS.
  * If neither is set, the plan may or may not produce tuples including OIDs.
+ *
+ * READ_ONLY tells the executor that the scan requires read-only access.
+ * (we use this to disable HOT pruning on certain nodes)
  */
 #define EXEC_FLAG_EXPLAIN_ONLY	0x0001	/* EXPLAIN, no ANALYZE */
 #define EXEC_FLAG_REWIND		0x0002	/* need efficient rescan */
@@ -63,6 +66,7 @@
 #define EXEC_FLAG_WITH_OIDS		0x0020	/* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS	0x0040	/* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA	0x0080	/* rel scannability doesn't matter */
+#define EXEC_FLAG_READ_ONLY		0x0100	/* read-only node */
 
 
 /*
#25Simon Riggs
simon@2ndquadrant.com
In reply to: Alvaro Herrera (#24)
Re: Turning off HOT/Cleanup sometimes

On 12 September 2014 14:54, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

My idea is that we would have a new executor flag, say
EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be
read-only, and reset it on those that aren't, such as LockRows and
ModifyTable (obviously we need to pass it down correctly from parent to
children). Then in ExecInitSeqScan and ExecInitIndexScan, if we see the
flag set, we call heap/index_set_allow_prune(false) for the heap scan;
same thing in index scans. (I envisioned it as a boolean rather than
enabling a certain number of cleanups per scan.)

I tried to code this but I think it doesn't work correctly, and no time
for debug currently. Anyway let me know what you think of this general
idea.

Thanks for looking at this.

My concern was to ensure that UPDATEs and DELETEs continue to call
heap_page_prune_opt while larger SELECTs do not.

This is achieved without a counter, so after some thought like it
better; simple is good. Happy to progress from here, or you can?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#25)
Re: Turning off HOT/Cleanup sometimes

Simon Riggs wrote:

On 12 September 2014 14:54, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

My idea is that we would have a new executor flag, say
EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be
read-only, and reset it on those that aren't, such as LockRows and
ModifyTable (obviously we need to pass it down correctly from parent to
children). Then in ExecInitSeqScan and ExecInitIndexScan, if we see the
flag set, we call heap/index_set_allow_prune(false) for the heap scan;
same thing in index scans. (I envisioned it as a boolean rather than
enabling a certain number of cleanups per scan.)

I tried to code this but I think it doesn't work correctly, and no time
for debug currently. Anyway let me know what you think of this general
idea.

Thanks for looking at this.

My concern was to ensure that UPDATEs and DELETEs continue to call
heap_page_prune_opt while larger SELECTs do not.

This is achieved without a counter, so after some thought like it
better; simple is good. Happy to progress from here, or you can?

Please feel free to take over.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#25)
Re: Turning off HOT/Cleanup sometimes

Simon Riggs <simon@2ndquadrant.com> writes:

On 12 September 2014 14:54, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

My idea is that we would have a new executor flag, say
EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be
read-only, and reset it on those that aren't, such as LockRows and
ModifyTable (obviously we need to pass it down correctly from parent to
children). Then in ExecInitSeqScan and ExecInitIndexScan, if we see the
flag set, we call heap/index_set_allow_prune(false) for the heap scan;
same thing in index scans. (I envisioned it as a boolean rather than
enabling a certain number of cleanups per scan.)

I tried to code this but I think it doesn't work correctly, and no time
for debug currently. Anyway let me know what you think of this general
idea.

Thanks for looking at this.

My concern was to ensure that UPDATEs and DELETEs continue to call
heap_page_prune_opt while larger SELECTs do not.

I think there's another way to think about it: what about saying that
the query's target relation(s) are subject to pruning, while others
are not? Then you do not need an executor flag, you just need to
look at the estate->es_result_relations array (or maybe even only at
estate->es_result_relation_info). This would have the advantage of
doing what-I-think-is-the-right-thing for updates/deletes involving
joins to other tables. The mechanism Alvaro describes would probably
have to prune all tables involved in such a query; do we really want
that?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#27)
Re: Turning off HOT/Cleanup sometimes

I wrote:

I think there's another way to think about it: what about saying that
the query's target relation(s) are subject to pruning, while others
are not? Then you do not need an executor flag, you just need to
look at the estate->es_result_relations array (or maybe even only at
estate->es_result_relation_info).

After a little bit I remembered there was already a function for this.
So specifically, I'd suggest using ExecRelationIsTargetRelation()
to decide whether to mark the scan as requiring pruning.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#28)
Re: Turning off HOT/Cleanup sometimes

On 12 September 2014 15:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wrote:

I think there's another way to think about it: what about saying that
the query's target relation(s) are subject to pruning, while others
are not? Then you do not need an executor flag, you just need to
look at the estate->es_result_relations array (or maybe even only at
estate->es_result_relation_info).

After a little bit I remembered there was already a function for this.
So specifically, I'd suggest using ExecRelationIsTargetRelation()
to decide whether to mark the scan as requiring pruning.

Sounds cool. Thanks both, this is sounding like a viable route now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#29)
1 attachment(s)
Re: Turning off HOT/Cleanup sometimes

On 12 September 2014 18:19, Simon Riggs <simon@2ndquadrant.com> wrote:

On 12 September 2014 15:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After a little bit I remembered there was already a function for this.
So specifically, I'd suggest using ExecRelationIsTargetRelation()
to decide whether to mark the scan as requiring pruning.

Sounds cool. Thanks both, this is sounding like a viable route now.

Yes, this is viable.

Patch attached, using Alvaro's idea of use-case specific pruning and
Tom's idea of aiming at target relations. Patch uses or extends
existing infrastructure, so its shorter than it might have been, yet
with all that bufmgr yuck removed.

This is very, very good because while going through this I notice the
dozen or more places where we were pruning blocks in annoying places I
didn't even know about such as about 4-5 constraint checks. In more
than a few DDL commands like ALTER TABLE and CLUSTER we were even
pruning the old relation prior to rewrite.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

hot_disable.v5.patchapplication/octet-stream; name=hot_disable.v5.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ae15c0b..8377a9c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -80,7 +80,8 @@ static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_sync,
-						bool is_bitmapscan, bool temp_snap);
+						bool is_bitmapscan, bool temp_snap,
+						bool allow_prune);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -342,9 +343,11 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	snapshot = scan->rs_snapshot;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible and
+	 * enabled.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_allow_prune)
+		heap_page_prune_opt(scan->rs_rd, buffer);
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -1275,7 +1278,8 @@ heap_openrv_extended(const RangeVar *relation, LOCKMODE lockmode,
  * heap_beginscan_strat offers an extended API that lets the caller control
  * whether a nondefault buffer access strategy can be used, and whether
  * syncscan can be chosen (possibly resulting in the scan not starting from
- * block zero).  Both of these default to TRUE with plain heap_beginscan.
+ * block zero).  Both of these default to TRUE with plain heap_beginscan,
+ * while allow_prune defaults to FALSE.
  *
  * heap_beginscan_bm is an alternative entry point for setting up a
  * HeapScanDesc for a bitmap heap scan.  Although that scan technology is
@@ -1288,7 +1292,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   true, true, false, false);
+								   true, true, false, false, false);
 }
 
 HeapScanDesc
@@ -1298,31 +1302,33 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   true, true, false, true);
+								   true, true, false, true, false);
 }
 
 HeapScanDesc
 heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
-					 bool allow_strat, bool allow_sync)
+					 bool allow_strat, bool allow_sync, bool allow_prune)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   allow_strat, allow_sync, false, false);
+								   allow_strat, allow_sync, false, false,
+								   allow_prune);
 }
 
 HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
-				  int nkeys, ScanKey key)
+				  int nkeys, ScanKey key, bool allow_prune)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   false, false, true, false);
+								   false, false, true, false, allow_prune);
 }
 
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_sync,
-						bool is_bitmapscan, bool temp_snap)
+						bool is_bitmapscan, bool temp_snap,
+						bool allow_prune)
 {
 	HeapScanDesc scan;
 
@@ -1348,6 +1354,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_allow_prune = allow_prune;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 850008b..2163056 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -317,7 +317,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = heap_beginscan_strat(heapRelation, snapshot,
 											 nkeys, key,
-											 true, false);
+											 true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 53cf96f..52d716b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -249,10 +249,39 @@ index_beginscan(Relation heapRelation,
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan. Disable pruning in all cases.
+	 */
+	scan->heapRelation = heapRelation;
+	scan->xs_snapshot = snapshot;
+	scan->xs_allow_prune = false;
+
+	return scan;
+}
+
+/*
+ * index_beginscan_strat - start a scan of an index with amgettuple
+ * allowing caller to specify additional scan strategies.
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_strat(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys,
+				bool allow_prune)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
 	 * up by RelationGetIndexScan.
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->xs_allow_prune = allow_prune;
 
 	return scan;
 }
@@ -278,6 +307,9 @@ index_beginscan_bitmap(Relation indexRelation,
 	 */
 	scan->xs_snapshot = snapshot;
 
+	/* BitmapHeapScan does pruning if required */
+	scan->xs_allow_prune = false;
+
 	return scan;
 }
 
@@ -520,9 +552,9 @@ index_fetch_heap(IndexScanDesc scan)
 											 ItemPointerGetBlockNumber(tid));
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * Prune page if enabled, but only if we weren't already on this page
 		 */
-		if (prev_buf != scan->xs_cbuf)
+		if (prev_buf != scan->xs_cbuf && scan->xs_allow_prune)
 			heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
 	}
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index ee10594..beb7fc1 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2165,7 +2165,8 @@ IndexBuildHeapScan(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								allow_sync);	/* syncscan OK? */
+								allow_sync,	/* syncscan OK? */
+								false);		/* disallow pruning */
 
 	reltuples = 0;
 
@@ -2543,7 +2544,8 @@ IndexCheckExclusion(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								true);	/* syncscan OK */
+								true,	/* syncscan OK */
+								false);		/* disallow pruning */
 
 	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
 	{
@@ -2817,7 +2819,8 @@ validate_index_heapscan(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								false); /* syncscan not OK */
+								false, /* syncscan not OK */
+								false);		/* disallow pruning */
 
 	/*
 	 * Scan all tuples matching the snapshot.
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 9b1e975..8f9b122 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -334,9 +334,11 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
 	ntup = 0;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible and
+	 * allowed.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_allow_prune)
+		heap_page_prune_opt(scan->rs_rd, buffer);
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -537,6 +539,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 {
 	BitmapHeapScanState *scanstate;
 	Relation	currentRelation;
+	bool		relistarget;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -597,6 +600,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
 
 	scanstate->ss.ss_currentRelation = currentRelation;
+	relistarget = ExecRelationIsTargetRelation(estate, node->scan.scanrelid);
 
 	/*
 	 * Even though we aren't going to do a conventional seqscan, it is useful
@@ -605,7 +609,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->ss.ss_currentScanDesc = heap_beginscan_bm(currentRelation,
 														 estate->es_snapshot,
 														 0,
-														 NULL);
+														 NULL,
+														 relistarget);
 
 	/*
 	 * get the scan type from the relation descriptor.
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index afcd1ff..b154823 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -523,11 +523,12 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->ioss_ScanDesc = index_beginscan(currentRelation,
+	indexstate->ioss_ScanDesc = index_beginscan_strat(currentRelation,
 												indexstate->ioss_RelationDesc,
 												estate->es_snapshot,
 												indexstate->ioss_NumScanKeys,
-											indexstate->ioss_NumOrderByKeys);
+											indexstate->ioss_NumOrderByKeys,
+											relistarget);
 
 	/* Set it up for index-only scan */
 	indexstate->ioss_ScanDesc->xs_want_itup = true;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2b89dc6..3512f50 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -603,11 +603,12 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+	indexstate->iss_ScanDesc = index_beginscan_strat(currentRelation,
 											   indexstate->iss_RelationDesc,
 											   estate->es_snapshot,
 											   indexstate->iss_NumScanKeys,
-											 indexstate->iss_NumOrderByKeys);
+											   indexstate->iss_NumOrderByKeys,
+											   relistarget);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index ab13e47..7240413 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -126,20 +126,30 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 {
 	Relation	currentRelation;
 	HeapScanDesc currentScanDesc;
+	Oid			currentRelid;
+	bool		relistarget;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table
+	 */
+	currentRelid = ((SeqScan *) node->ps.plan)->scanrelid,
+	relistarget = ExecRelationIsTargetRelation(estate, currentRelid);
 
 	/*
-	 * get the relation object id from the relid'th entry in the range table,
 	 * open that relation and acquire appropriate lock on it.
 	 */
 	currentRelation = ExecOpenScanRelation(estate,
-									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   currentRelid,
 										   eflags);
 
 	/* initialize a heapscan */
-	currentScanDesc = heap_beginscan(currentRelation,
+	currentScanDesc = heap_beginscan_strat(currentRelation,
 									 estate->es_snapshot,
 									 0,
-									 NULL);
+									 NULL,
+									 true,
+									 true,
+									 relistarget);
 
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index d99158f..eb993ae 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -135,6 +135,10 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
 				int nkeys, int norderbys);
+extern IndexScanDesc index_beginscan_strat(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys, bool allow_prune);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 					   Snapshot snapshot,
 					   int nkeys);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 493839f..3eea50c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -109,9 +109,9 @@ extern HeapScanDesc heap_beginscan_catalog(Relation relation, int nkeys,
 					   ScanKey key);
 extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
-					 bool allow_strat, bool allow_sync);
+					 bool allow_strat, bool allow_sync, bool allow_prune);
 extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
-				  int nkeys, ScanKey key);
+				  int nkeys, ScanKey keyi, bool allow_prune);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8a57698..48b7bfe 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -33,6 +33,7 @@ typedef struct HeapScanDescData
 	bool		rs_allow_strat; /* allow or disallow use of access strategy */
 	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
 	bool		rs_temp_snap;	/* unregister snapshot at scan end? */
+	bool		rs_allow_prune;	/* allow HOT-pruning during this scan */
 
 	/* state set up at initscan time */
 	BlockNumber rs_nblocks;		/* number of blocks to scan */
@@ -71,6 +72,7 @@ typedef struct IndexScanDescData
 	ScanKey		keyData;		/* array of index qualifier descriptors */
 	ScanKey		orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
+	bool		xs_allow_prune;	/* allow HOT-pruning during this scan */
 
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
#31Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#30)
Re: Turning off HOT/Cleanup sometimes

On Sun, Sep 14, 2014 at 4:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 12 September 2014 18:19, Simon Riggs <simon@2ndquadrant.com> wrote:

On 12 September 2014 15:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After a little bit I remembered there was already a function for this.
So specifically, I'd suggest using ExecRelationIsTargetRelation()
to decide whether to mark the scan as requiring pruning.

Sounds cool. Thanks both, this is sounding like a viable route now.

Yes, this is viable.

Patch attached, using Alvaro's idea of use-case specific pruning and
Tom's idea of aiming at target relations. Patch uses or extends
existing infrastructure, so its shorter than it might have been, yet
with all that bufmgr yuck removed.

This is very, very good because while going through this I notice the
dozen or more places where we were pruning blocks in annoying places I
didn't even know about such as about 4-5 constraint checks. In more
than a few DDL commands like ALTER TABLE and CLUSTER we were even
pruning the old relation prior to rewrite.

Do we really want to disable HOT for all catalog scans?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Simon Riggs
simon@2ndquadrant.com
In reply to: Robert Haas (#31)
Re: Turning off HOT/Cleanup sometimes

On 15 September 2014 17:09, Robert Haas <robertmhaas@gmail.com> wrote:

Do we really want to disable HOT for all catalog scans?

The intention of the patch is that catalog scans are treated
identically to non-catalog scans. The idea here is that HOT cleanup
only occurs on scans on target relations, so only INSERT, UPDATE and
DELETE do HOT cleanup.

It's possible that many catalog scans don't follow the normal target
relation logic, so we might argue we should use HOT every time. OTOH,
since we now have separate catalog xmins we may find that using HOT on
catalogs is no longer effective. So I could go either way on how to
proceed; its an easy change either way.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Emanuel Calvo
3manuek@esdebian.org
In reply to: Simon Riggs (#30)
Re: Turning off HOT/Cleanup sometimes

El 14/09/14 17:37, Simon Riggs escribió:

On 12 September 2014 18:19, Simon Riggs <simon@2ndquadrant.com> wrote:

On 12 September 2014 15:30, Tom Lane <tgl@sss.pgh.pa.us> wrote:

After a little bit I remembered there was already a function for this.
So specifically, I'd suggest using ExecRelationIsTargetRelation()
to decide whether to mark the scan as requiring pruning.

Sounds cool. Thanks both, this is sounding like a viable route now.

Yes, this is viable.

Patch attached, using Alvaro's idea of use-case specific pruning and
Tom's idea of aiming at target relations. Patch uses or extends
existing infrastructure, so its shorter than it might have been, yet
with all that bufmgr yuck removed.

This is very, very good because while going through this I notice the
dozen or more places where we were pruning blocks in annoying places I
didn't even know about such as about 4-5 constraint checks. In more
than a few DDL commands like ALTER TABLE and CLUSTER we were even
pruning the old relation prior to rewrite.

A simple performance test with the following variables:
LOOP=50
CONN=60
TXSS=500
SCALE=30

Select only:
WITH PATCH
Average: 20716.1 tps

NO PATCH
Average: 19141.7 tps

With writes:
WITH PATCH
Average: 2602.65

NO PATCH
Average: 2565.32

TODO:
- Consistency check.
- ALTER and CLUSTER test.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Emanuel Calvo
emanuel.calvo@2ndquadrant.com
In reply to: Simon Riggs (#32)
Re: Turning off HOT/Cleanup sometimes

El 15/09/14 18:13, Simon Riggs escribió:

On 15 September 2014 17:09, Robert Haas <robertmhaas@gmail.com> wrote:

Do we really want to disable HOT for all catalog scans?

The intention of the patch is that catalog scans are treated
identically to non-catalog scans. The idea here is that HOT cleanup
only occurs on scans on target relations, so only INSERT, UPDATE and
DELETE do HOT cleanup.

It's possible that many catalog scans don't follow the normal target
relation logic, so we might argue we should use HOT every time. OTOH,
since we now have separate catalog xmins we may find that using HOT on
catalogs is no longer effective. So I could go either way on how to
proceed; its an easy change either way.

I setup a more concurrent scenario and the difference is quite larger:

Without patch:
1st concurrent with writes:
tps = 5705.261620 (including connections establishing)
tps = 5945.338557 (excluding connections establishing)
2nd no writes being executed:
tps = 9988.792575 (including connections establishing)
tps = 11059.671351 (excluding connections establishing)

Patched version:
1st concurrent with writes:
tps = 9476.741707 (including connections establishing)
tps = 10274.831185 (excluding connections establishing)
2nd no writes being executed:
tps = 12993.644808 (including connections establishing)
tps = 15171.214744 (excluding connections establishing)

Stats (writes have been run with a time limit, not by tx):

hotcleanup=# select relname ,n_live_tup, n_dead_tup, n_tup_hot_upd from
pg_stat_user_tables where relname ~ 'pgbench';
relname | n_live_tup | n_dead_tup | n_tup_hot_upd
------------------+------------+------------+---------------
pgbench_tellers | 500 | 0 | 2044192
pgbench_accounts | 5109728 | 310842 | 1969264
pgbench_history | 2265882 | 0 | 0
pgbench_branches | 50 | 0 | 2237167
(4 rows)

hotcleanup=# select relname ,n_live_tup, n_dead_tup, n_tup_hot_upd from
pg_stat_user_tables where relname ~ 'pgbench';
relname | n_live_tup | n_dead_tup | n_tup_hot_upd
------------------+------------+------------+---------------
pgbench_history | 2148946 | 0 | 0
pgbench_tellers | 500 | 0 | 1969675
pgbench_branches | 50 | 0 | 2150655
pgbench_accounts | 5098774 | 300123 | 1897484
(4 rows)

I ran the regression tests over the patched version and they passed ok.

--
--
Emanuel Calvo http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#32)
Re: Turning off HOT/Cleanup sometimes

On 15 September 2014 22:13, Simon Riggs <simon@2ndquadrant.com> wrote:

On 15 September 2014 17:09, Robert Haas <robertmhaas@gmail.com> wrote:

Do we really want to disable HOT for all catalog scans?

The intention of the patch is that catalog scans are treated
identically to non-catalog scans. The idea here is that HOT cleanup
only occurs on scans on target relations, so only INSERT, UPDATE and
DELETE do HOT cleanup.

Since INSERT, UPDATE and DELETE can only be called when
!RecoveryInProgress(), we can completely avoid making this test at the
top of each heap_page_prune_opt() call.

I very much like the simplicity of saying "no target, no cleanup".

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#32)
Re: Turning off HOT/Cleanup sometimes

On Mon, Sep 15, 2014 at 5:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 15 September 2014 17:09, Robert Haas <robertmhaas@gmail.com> wrote:

Do we really want to disable HOT for all catalog scans?

The intention of the patch is that catalog scans are treated
identically to non-catalog scans. The idea here is that HOT cleanup
only occurs on scans on target relations, so only INSERT, UPDATE and
DELETE do HOT cleanup.

It's possible that many catalog scans don't follow the normal target
relation logic, so we might argue we should use HOT every time. OTOH,
since we now have separate catalog xmins we may find that using HOT on
catalogs is no longer effective. So I could go either way on how to
proceed; its an easy change either way.

What I'm thinking about is that the smarts to enable pruning is all in
the executor nodes. So anything that updates the catalog without
going through the executor will never be subject to pruning. That
includes nearly all catalog-modifying code throughout the backend.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Simon Riggs
simon@2ndquadrant.com
In reply to: Robert Haas (#36)
Re: Turning off HOT/Cleanup sometimes

On 19 September 2014 13:04, Robert Haas <robertmhaas@gmail.com> wrote:

What I'm thinking about is that the smarts to enable pruning is all in
the executor nodes. So anything that updates the catalog without
going through the executor will never be subject to pruning. That
includes nearly all catalog-modifying code throughout the backend.

Are you saying this is a problem or a benefit? (and please explain why).

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#37)
Re: Turning off HOT/Cleanup sometimes

On September 19, 2014 10:16:35 PM CEST, Simon Riggs <simon@2ndquadrant.com> wrote:

On 19 September 2014 13:04, Robert Haas <robertmhaas@gmail.com> wrote:

What I'm thinking about is that the smarts to enable pruning is all

in

the executor nodes. So anything that updates the catalog without
going through the executor will never be subject to pruning. That
includes nearly all catalog-modifying code throughout the backend.

Are you saying this is a problem or a benefit? (and please explain
why).

I have no idea what Robert is thinking of, but I'd imagine its horrible for workloads with catalog bloat. Like ones involving temp tables.

I generally have serious doubts about disabling it generally for read workloads. I imagine it e.g. will significantly penalize workloads where its likely that a cleanup lock can't be acquired every time...

Andres

--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#38)
Re: Turning off HOT/Cleanup sometimes

Andres Freund <andres@anarazel.de> writes:

On September 19, 2014 10:16:35 PM CEST, Simon Riggs <simon@2ndquadrant.com> wrote:

Are you saying this is a problem or a benefit? (and please explain
why).

I have no idea what Robert is thinking of, but I'd imagine its horrible for workloads with catalog bloat. Like ones involving temp tables.

Yeah. But it's also the case that we know a good deal more about the
access patterns for system-driven catalog updates than we do about user
queries. ISTM we could probably suppress HOT pruning during catalog
*scans* and instead try to do it when a system-driven heap_update
occurs.

Having said that, this could reasonably be considered outside the scope
of a patch that's trying to improve the behavior for user queries.
But if the patch author doesn't want to expand the scope like that,
ISTM he ought to ensure that the behavior *doesn't* change for system
accesses, rather than trying to convince us that disabling HOT for
system updates is a good idea.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#38)
Re: Turning off HOT/Cleanup sometimes

On Fri, Sep 19, 2014 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote:

On September 19, 2014 10:16:35 PM CEST, Simon Riggs <simon@2ndquadrant.com> wrote:

On 19 September 2014 13:04, Robert Haas <robertmhaas@gmail.com> wrote:

What I'm thinking about is that the smarts to enable pruning is all

in

the executor nodes. So anything that updates the catalog without
going through the executor will never be subject to pruning. That
includes nearly all catalog-modifying code throughout the backend.

Are you saying this is a problem or a benefit? (and please explain
why).

I have no idea what Robert is thinking of, but I'd imagine its horrible for workloads with catalog bloat. Like ones involving temp tables.

Right, that's what I was going for.

I generally have serious doubts about disabling it generally for read workloads. I imagine it e.g. will significantly penalize workloads where its likely that a cleanup lock can't be acquired every time...

I share that doubt. But I understand why Simon wants to do something,
too, because the current situation is not great either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#40)
Re: Turning off HOT/Cleanup sometimes

On 2014-09-19 17:29:08 -0400, Robert Haas wrote:

I generally have serious doubts about disabling it generally for
read workloads. I imagine it e.g. will significantly penalize
workloads where its likely that a cleanup lock can't be acquired
every time...

I share that doubt. But I understand why Simon wants to do something,
too, because the current situation is not great either.

Right, I totally agree. I doubt a simple approach like this will work in
the general case, but I think something needs to be done.

I think limiting the amount of HOT cleanup for readonly queries is a
good idea, but I think it has to be gradual. Say after a single cleaned
up page at least another 500 pages need to have been touched till the
next hot cleanup. That way a single query won't be penalized with
cleaning up everything, but there'll be some progress.

The other thing I think might be quite worthwile would be to abort hot
cleanup when the gain is only minimal. If e.g. only 1 small tuple is
removed from a half full page it's not worth the cost of the wal logging
et al.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#39)
Re: Turning off HOT/Cleanup sometimes

On 19 September 2014 15:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Having said that, this could reasonably be considered outside the scope
of a patch that's trying to improve the behavior for user queries.
But if the patch author doesn't want to expand the scope like that,
ISTM he ought to ensure that the behavior *doesn't* change for system
accesses, rather than trying to convince us that disabling HOT for
system updates is a good idea.

As I said, I could make an argument to go either way, so I was unsure.

I'm happy to avoid changing behaviour for catalog scans in this patch.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#39)
Re: Turning off HOT/Cleanup sometimes

On 2014-09-19 16:35:19 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On September 19, 2014 10:16:35 PM CEST, Simon Riggs <simon@2ndquadrant.com> wrote:

Are you saying this is a problem or a benefit? (and please explain
why).

I have no idea what Robert is thinking of, but I'd imagine its horrible for workloads with catalog bloat. Like ones involving temp tables.

Yeah. But it's also the case that we know a good deal more about the
access patterns for system-driven catalog updates than we do about user
queries. ISTM we could probably suppress HOT pruning during catalog
*scans* and instead try to do it when a system-driven heap_update
occurs.

Having said that, this could reasonably be considered outside the scope
of a patch that's trying to improve the behavior for user queries.
But if the patch author doesn't want to expand the scope like that,
ISTM he ought to ensure that the behavior *doesn't* change for system
accesses, rather than trying to convince us that disabling HOT for
system updates is a good idea.

I think it'd have to change for anything not done via the
executor. There definitely is user defined code out there doing manual
heap_* stuff. I know because i've written some. And I know I'm not the
only one.

If such paths suddenly stop doing HOT cleanup we'll cause a noticeable
amount of pain.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#41)
Re: Turning off HOT/Cleanup sometimes

On Fri, Sep 19, 2014 at 5:42 PM, Andres Freund <andres@anarazel.de> wrote:

On 2014-09-19 17:29:08 -0400, Robert Haas wrote:

I generally have serious doubts about disabling it generally for
read workloads. I imagine it e.g. will significantly penalize
workloads where its likely that a cleanup lock can't be acquired
every time...

I share that doubt. But I understand why Simon wants to do something,
too, because the current situation is not great either.

Right, I totally agree. I doubt a simple approach like this will work in
the general case, but I think something needs to be done.

I think limiting the amount of HOT cleanup for readonly queries is a
good idea, but I think it has to be gradual. Say after a single cleaned
up page at least another 500 pages need to have been touched till the
next hot cleanup. That way a single query won't be penalized with
cleaning up everything, but there'll be some progress.

I tried this kind of thing several years ago with hint-bit-setting and
was unimpressed by the results.

/messages/by-id/AANLkTik5QzR8wTs0MqCWwmNp-qHGrdKY5Av5aOB7W4Dp@mail.gmail.com
/messages/by-id/AANLkTimGKaG7wdu-x77GNV2Gh6_Qo5Ss1u5b6Q1MsPUy@mail.gmail.com

Granted, I never tried a ratio as low as 500:1, and HOT pruning is not
the same thing as setting hint bits, but I think the basic problems
are similar, namely:

1. You can't know how many times the page is going to be referenced in
the future before it again gets modified. If that number is small,
then you shouldn't bother with hint bits, or HOT-pruning, or freezing.
But if it's big, you should do all of those things as soon as possible
because the benefits are quite significant. Therefore, any change in
this area is guaranteed to lose on some easy-to-construct workload,
because I just described two of them that want opposing things.

2. Dirtying every N'th page is a great way to generate lots of random
I/O that will quite possibly make your disk almost as sad - or even
sadder - than dirtying all of them, but without anywhere as near as
much performance benefit.

Variations on this idea have been proposed so many times over the
years that I'm tempted to give some credence to the theory that we
ought to adopt one of them. But there will certainly be losers, as
well as winners.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Robert Haas (#44)
Re: Turning off HOT/Cleanup sometimes

This patch has gotten a fair amount of review, and has been rewritten
once during the commitfest. I think it's pretty close to being
committable, the only remaining question seems to be what to do with
system catalogs. I'm marking this as "Returned with feedback", I take it
that Simon can proceed from here, outside the commitfest.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#45)
Re: Turning off HOT/Cleanup sometimes

On 2014-09-27 10:23:33 +0300, Heikki Linnakangas wrote:

This patch has gotten a fair amount of review, and has been rewritten once
during the commitfest. I think it's pretty close to being committable, the
only remaining question seems to be what to do with system catalogs. I'm
marking this as "Returned with feedback", I take it that Simon can proceed
from here, outside the commitfest.

FWIW, I don't think it is, even with that. As is it seems very likely
that it's going to regress a fair share of workloads. At the very least
it needs a fair amount of benchmarking beforehand.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Simon Riggs
simon@2ndquadrant.com
In reply to: Andres Freund (#46)
1 attachment(s)
Re: Turning off HOT/Cleanup sometimes

On 27 September 2014 09:29, Andres Freund <andres@anarazel.de> wrote:

On 2014-09-27 10:23:33 +0300, Heikki Linnakangas wrote:

This patch has gotten a fair amount of review, and has been rewritten once
during the commitfest. I think it's pretty close to being committable, the
only remaining question seems to be what to do with system catalogs. I'm
marking this as "Returned with feedback", I take it that Simon can proceed
from here, outside the commitfest.

FWIW, I don't think it is, even with that. As is it seems very likely
that it's going to regress a fair share of workloads. At the very least
it needs a fair amount of benchmarking beforehand.

There is some doubt there. We've not seen a workload that does
actually exhibit a negative behaviour. I'm not saying one doesn't
exist, but it does matter how common/likely it is. If anyone can
present a performance test case that demonstrates a regression, I
think it will make it easier to discuss how wide that case is and what
we should do about it. Discussing whether to do various kinds of
limited pruning are moot until that is clear.

My memory was that it took months for people to understand the
frequent update use case, since catching it in flagrante delicto was
hard. That may be the case here, or not, but negative-benefit
experimental results very welcome.

Updated patch attached to address earlier comments.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

hot_disable.v6.patchapplication/octet-stream; name=hot_disable.v6.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ae15c0b..dbf4efb 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -80,7 +80,8 @@ static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_sync,
-						bool is_bitmapscan, bool temp_snap);
+						bool is_bitmapscan, bool temp_snap,
+						bool allow_prune);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -342,9 +343,11 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	snapshot = scan->rs_snapshot;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible and
+	 * enabled.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_allow_prune)
+		heap_page_prune_opt(scan->rs_rd, buffer);
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -1275,7 +1278,8 @@ heap_openrv_extended(const RangeVar *relation, LOCKMODE lockmode,
  * heap_beginscan_strat offers an extended API that lets the caller control
  * whether a nondefault buffer access strategy can be used, and whether
  * syncscan can be chosen (possibly resulting in the scan not starting from
- * block zero).  Both of these default to TRUE with plain heap_beginscan.
+ * block zero).  Both of these default to TRUE with plain heap_beginscan,
+ * while allow_prune defaults to FALSE.
  *
  * heap_beginscan_bm is an alternative entry point for setting up a
  * HeapScanDesc for a bitmap heap scan.  Although that scan technology is
@@ -1288,7 +1292,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   true, true, false, false);
+								   true, true, false, false, false);
 }
 
 HeapScanDesc
@@ -1298,31 +1302,33 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   true, true, false, true);
+								   true, true, false, true, !RecoveryInProgress());
 }
 
 HeapScanDesc
 heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
-					 bool allow_strat, bool allow_sync)
+					 bool allow_strat, bool allow_sync, bool allow_prune)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   allow_strat, allow_sync, false, false);
+								   allow_strat, allow_sync, false, false,
+								   allow_prune);
 }
 
 HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
-				  int nkeys, ScanKey key)
+				  int nkeys, ScanKey key, bool allow_prune)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   false, false, true, false);
+								   false, false, true, false, allow_prune);
 }
 
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_sync,
-						bool is_bitmapscan, bool temp_snap)
+						bool is_bitmapscan, bool temp_snap,
+						bool allow_prune)
 {
 	HeapScanDesc scan;
 
@@ -1348,6 +1354,7 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_strat = allow_strat;
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
+	scan->rs_allow_prune = allow_prune;
 
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 06b5488..8c244db 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -82,8 +82,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * clean the page. The master will likely issue a cleaning WAL record soon
 	 * anyway, so this is no particular loss.
 	 */
-	if (RecoveryInProgress())
-		return;
+	Assert(!RecoveryInProgress());
 
 	/*
 	 * Use the appropriate xmin horizon for this relation. If it's a proper
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 850008b..2163056 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -317,7 +317,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = heap_beginscan_strat(heapRelation, snapshot,
 											 nkeys, key,
-											 true, false);
+											 true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 53cf96f..52d716b 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -249,10 +249,39 @@ index_beginscan(Relation heapRelation,
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan. Disable pruning in all cases.
+	 */
+	scan->heapRelation = heapRelation;
+	scan->xs_snapshot = snapshot;
+	scan->xs_allow_prune = false;
+
+	return scan;
+}
+
+/*
+ * index_beginscan_strat - start a scan of an index with amgettuple
+ * allowing caller to specify additional scan strategies.
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_strat(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys,
+				bool allow_prune)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
 	 * up by RelationGetIndexScan.
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->xs_allow_prune = allow_prune;
 
 	return scan;
 }
@@ -278,6 +307,9 @@ index_beginscan_bitmap(Relation indexRelation,
 	 */
 	scan->xs_snapshot = snapshot;
 
+	/* BitmapHeapScan does pruning if required */
+	scan->xs_allow_prune = false;
+
 	return scan;
 }
 
@@ -520,9 +552,9 @@ index_fetch_heap(IndexScanDesc scan)
 											 ItemPointerGetBlockNumber(tid));
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * Prune page if enabled, but only if we weren't already on this page
 		 */
-		if (prev_buf != scan->xs_cbuf)
+		if (prev_buf != scan->xs_cbuf && scan->xs_allow_prune)
 			heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
 	}
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index ee10594..beb7fc1 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2165,7 +2165,8 @@ IndexBuildHeapScan(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								allow_sync);	/* syncscan OK? */
+								allow_sync,	/* syncscan OK? */
+								false);		/* disallow pruning */
 
 	reltuples = 0;
 
@@ -2543,7 +2544,8 @@ IndexCheckExclusion(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								true);	/* syncscan OK */
+								true,	/* syncscan OK */
+								false);		/* disallow pruning */
 
 	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
 	{
@@ -2817,7 +2819,8 @@ validate_index_heapscan(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								false); /* syncscan not OK */
+								false, /* syncscan not OK */
+								false);		/* disallow pruning */
 
 	/*
 	 * Scan all tuples matching the snapshot.
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 9b1e975..8f9b122 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -334,9 +334,11 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
 	ntup = 0;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible and
+	 * allowed.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_allow_prune)
+		heap_page_prune_opt(scan->rs_rd, buffer);
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -537,6 +539,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 {
 	BitmapHeapScanState *scanstate;
 	Relation	currentRelation;
+	bool		relistarget;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -597,6 +600,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
 
 	scanstate->ss.ss_currentRelation = currentRelation;
+	relistarget = ExecRelationIsTargetRelation(estate, node->scan.scanrelid);
 
 	/*
 	 * Even though we aren't going to do a conventional seqscan, it is useful
@@ -605,7 +609,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->ss.ss_currentScanDesc = heap_beginscan_bm(currentRelation,
 														 estate->es_snapshot,
 														 0,
-														 NULL);
+														 NULL,
+														 relistarget);
 
 	/*
 	 * get the scan type from the relation descriptor.
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index afcd1ff..b154823 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -523,11 +523,12 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->ioss_ScanDesc = index_beginscan(currentRelation,
+	indexstate->ioss_ScanDesc = index_beginscan_strat(currentRelation,
 												indexstate->ioss_RelationDesc,
 												estate->es_snapshot,
 												indexstate->ioss_NumScanKeys,
-											indexstate->ioss_NumOrderByKeys);
+											indexstate->ioss_NumOrderByKeys,
+											relistarget);
 
 	/* Set it up for index-only scan */
 	indexstate->ioss_ScanDesc->xs_want_itup = true;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2b89dc6..3512f50 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -603,11 +603,12 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+	indexstate->iss_ScanDesc = index_beginscan_strat(currentRelation,
 											   indexstate->iss_RelationDesc,
 											   estate->es_snapshot,
 											   indexstate->iss_NumScanKeys,
-											 indexstate->iss_NumOrderByKeys);
+											   indexstate->iss_NumOrderByKeys,
+											   relistarget);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index ab13e47..7240413 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -126,20 +126,30 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 {
 	Relation	currentRelation;
 	HeapScanDesc currentScanDesc;
+	Oid			currentRelid;
+	bool		relistarget;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table
+	 */
+	currentRelid = ((SeqScan *) node->ps.plan)->scanrelid,
+	relistarget = ExecRelationIsTargetRelation(estate, currentRelid);
 
 	/*
-	 * get the relation object id from the relid'th entry in the range table,
 	 * open that relation and acquire appropriate lock on it.
 	 */
 	currentRelation = ExecOpenScanRelation(estate,
-									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   currentRelid,
 										   eflags);
 
 	/* initialize a heapscan */
-	currentScanDesc = heap_beginscan(currentRelation,
+	currentScanDesc = heap_beginscan_strat(currentRelation,
 									 estate->es_snapshot,
 									 0,
-									 NULL);
+									 NULL,
+									 true,
+									 true,
+									 relistarget);
 
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index d99158f..eb993ae 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -135,6 +135,10 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
 				int nkeys, int norderbys);
+extern IndexScanDesc index_beginscan_strat(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys, bool allow_prune);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 					   Snapshot snapshot,
 					   int nkeys);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 493839f..3eea50c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -109,9 +109,9 @@ extern HeapScanDesc heap_beginscan_catalog(Relation relation, int nkeys,
 					   ScanKey key);
 extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
-					 bool allow_strat, bool allow_sync);
+					 bool allow_strat, bool allow_sync, bool allow_prune);
 extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
-				  int nkeys, ScanKey key);
+				  int nkeys, ScanKey keyi, bool allow_prune);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8a57698..815768f 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -33,6 +33,7 @@ typedef struct HeapScanDescData
 	bool		rs_allow_strat; /* allow or disallow use of access strategy */
 	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
 	bool		rs_temp_snap;	/* unregister snapshot at scan end? */
+	bool		rs_allow_prune;	/* HOT-prune during this scan; do not use in Hot Standby */
 
 	/* state set up at initscan time */
 	BlockNumber rs_nblocks;		/* number of blocks to scan */
@@ -71,6 +72,7 @@ typedef struct IndexScanDescData
 	ScanKey		keyData;		/* array of index qualifier descriptors */
 	ScanKey		orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
+	bool		xs_allow_prune;	/* HOT-prune during this scan; do not use in Hot Standby */
 
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
#48Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#47)
Re: Turning off HOT/Cleanup sometimes

On 2014-09-28 19:51:36 +0100, Simon Riggs wrote:

On 27 September 2014 09:29, Andres Freund <andres@anarazel.de> wrote:

On 2014-09-27 10:23:33 +0300, Heikki Linnakangas wrote:

This patch has gotten a fair amount of review, and has been rewritten once
during the commitfest. I think it's pretty close to being committable, the
only remaining question seems to be what to do with system catalogs. I'm
marking this as "Returned with feedback", I take it that Simon can proceed
from here, outside the commitfest.

FWIW, I don't think it is, even with that. As is it seems very likely
that it's going to regress a fair share of workloads. At the very least
it needs a fair amount of benchmarking beforehand.

There is some doubt there. We've not seen a workload that does
actually exhibit a negative behaviour.

Neither is there much data about the magnitude of positive effect the
patch has...

I'm not saying one doesn't exist, but it does matter how common/likely
it is. If anyone can present a performance test case that demonstrates
a regression, I think it will make it easier to discuss how wide that
case is and what we should do about it. Discussing whether to do
various kinds of limited pruning are moot until that is clear.

I doubt it'll be hard to construct a case where it'll show. My first try
of using a pgbench scale 100, -M prepared, -cj8 with a custom file with
1 write and 5 read transaction yielded the following on my laptop:

Baseline:
relname | pgbench_tellers
pg_total_relation_size | 458752
relname | pgbench_accounts
pg_total_relation_size | 1590337536
relname | pgbench_branches
pg_total_relation_size | 286720
relname | pgbench_history
pg_total_relation_size | 49979392
Patched:
relname | pgbench_tellers
pg_total_relation_size | 516096
relname | pgbench_accounts
pg_total_relation_size | 1590337536
relname | pgbench_branches
pg_total_relation_size | 360448
relname | pgbench_history
pg_total_relation_size | 49528832

So, there's a noticeable increase in size. Mostly on the smaller tables,
so probably HOT cleanup was sometimes skipped during UPDATEs due to
locks.

Baseline was:
tps = 9655.486532 (excluding connections establishing)
Patched was:
tps = 9466.158701 (including connections establishing)

That's not a unrealistic testcase.

I'm pretty sure this could be made quite a bit more pronounced by not
using a uniform distribution in the pgbench runs. And selecting a test
that's more vulnerable to the change (e.g. using a wider distribution
for the read only statements than the modifying ones) would make the the
CPU overhead of the additional heap_hot_search_buffer() overhead
heavier.

My memory was that it took months for people to understand the
frequent update use case, since catching it in flagrante delicto was
hard. That may be the case here, or not, but negative-benefit
experimental results very welcome.

Updated patch attached to address earlier comments.

contrib (at least pgstattuple) doesn't currently compile with
this... Easily patched up tho.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#48)
Re: Turning off HOT/Cleanup sometimes

On 2014-09-29 11:13:43 +0200, Andres Freund wrote:

I doubt it'll be hard to construct a case where it'll show. My first try
of using a pgbench scale 100, -M prepared, -cj8 with a custom file with
1 write and 5 read transaction yielded the following on my laptop:

Hm. On second thought that testcase probably *decreased* the bad effects
of this because pgbench's readonly statements only touch one table...

patched:

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#49)
1 attachment(s)
Re: Turning off HOT/Cleanup sometimes

On 2014-09-29 11:31:11 +0200, Andres Freund wrote:

On 2014-09-29 11:13:43 +0200, Andres Freund wrote:

I doubt it'll be hard to construct a case where it'll show. My first try
of using a pgbench scale 100, -M prepared, -cj8 with a custom file with
1 write and 5 read transaction yielded the following on my laptop:

Hm. On second thought that testcase probably *decreased* the bad effects
of this because pgbench's readonly statements only touch one table...

I've attached the file nonetheless, for posterities sake.

Note it has the scale hardcoded...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

trans_read5_write1.sqltext/plain; charset=us-asciiDownload
#51Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#50)
Re: Turning off HOT/Cleanup sometimes

What happened to this patch? I'm going over something that could use
the concept of "clean some stuff up when reading this page, but only if
we're already writing" or similar.

I see some cases were presented that had a performance decrease. Did we
get any numbers for the increase in performance in some other
interesting cases?

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#51)
Re: Turning off HOT/Cleanup sometimes

On 17 November 2014 21:09, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

What happened to this patch? I'm going over something that could use
the concept of "clean some stuff up when reading this page, but only if
we're already writing" or similar.

I see some cases were presented that had a performance decrease. Did we
get any numbers for the increase in performance in some other
interesting cases?

It's not dead; it just needs more work. Maybe for next CF, or you can now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#52)
1 attachment(s)
Re: Turning off HOT/Cleanup sometimes

On 17 November 2014 at 22:08, Simon Riggs <simon@2ndquadrant.com> wrote:

On 17 November 2014 21:09, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

What happened to this patch? I'm going over something that could use
the concept of "clean some stuff up when reading this page, but only if
we're already writing" or similar.

I see some cases were presented that had a performance decrease. Did we
get any numbers for the increase in performance in some other
interesting cases?

It's not dead; it just needs more work. Maybe for next CF, or you can now.

Latest version attached for next CF

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

hot_disable.v7.patchapplication/octet-stream; name=hot_disable.v7.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index df4853b..164339f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -80,7 +80,8 @@ static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_sync,
-						bool is_bitmapscan, bool temp_snap);
+						bool is_bitmapscan, bool temp_snap,
+						bool always_prune);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -349,9 +350,15 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	snapshot = scan->rs_snapshot;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible and
+	 * within limits, if any.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_prune_count < PRUNE_LIMIT_PER_SCAN ||
+		scan->rs_always_prune)
+	{
+		if (heap_page_prune_opt(scan->rs_rd, buffer) && !scan->rs_always_prune)
+			scan->rs_prune_count++;
+	}
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -1286,7 +1293,8 @@ heap_openrv_extended(const RangeVar *relation, LOCKMODE lockmode,
  * heap_beginscan_strat offers an extended API that lets the caller control
  * whether a nondefault buffer access strategy can be used, and whether
  * syncscan can be chosen (possibly resulting in the scan not starting from
- * block zero).  Both of these default to TRUE with plain heap_beginscan.
+ * block zero).  Both of these default to TRUE with plain heap_beginscan,
+ * while always_prune defaults to FALSE for regular scans.
  *
  * heap_beginscan_bm is an alternative entry point for setting up a
  * HeapScanDesc for a bitmap heap scan.  Although that scan technology is
@@ -1299,7 +1307,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   true, true, false, false);
+								   true, true, false, false, false);
 }
 
 HeapScanDesc
@@ -1309,31 +1317,33 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   true, true, false, true);
+								   true, true, false, true, true);
 }
 
 HeapScanDesc
 heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
-					 bool allow_strat, bool allow_sync)
+					 bool allow_strat, bool allow_sync, bool always_prune)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   allow_strat, allow_sync, false, false);
+								   allow_strat, allow_sync, false, false,
+								   always_prune);
 }
 
 HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
-				  int nkeys, ScanKey key)
+				  int nkeys, ScanKey key, bool always_prune)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   false, false, true, false);
+								   false, false, true, false, always_prune);
 }
 
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_sync,
-						bool is_bitmapscan, bool temp_snap)
+						bool is_bitmapscan, bool temp_snap,
+						bool always_prune)
 {
 	HeapScanDesc scan;
 
@@ -1360,6 +1370,9 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
 
+	scan->rs_prune_count = 0;
+	scan->rs_always_prune = always_prune;
+
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
 	 */
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index e4561bc..b6fcfd7 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -70,8 +70,10 @@ static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
  *
  * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
  * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ *
+ * Returns true if the block was cleaned.
  */
-void
+bool
 heap_page_prune_opt(Relation relation, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
@@ -84,7 +86,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * anyway, so this is no particular loss.
 	 */
 	if (RecoveryInProgress())
-		return;
+		return false;
 
 	/*
 	 * Use the appropriate xmin horizon for this relation. If it's a proper
@@ -108,7 +110,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * older than OldestXmin.
 	 */
 	if (!PageIsPrunable(page, OldestXmin))
-		return;
+		return false;
 
 	/*
 	 * We prune when a previous UPDATE failed to find enough space on the page
@@ -130,7 +132,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	{
 		/* OK, try to get exclusive buffer lock */
 		if (!ConditionalLockBufferForCleanup(buffer))
-			return;
+			return false;
 
 		/*
 		 * Now that we have buffer lock, get accurate information about the
@@ -149,7 +151,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 
 		/* And release buffer lock */
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+		return true;
 	}
+
+	return false;
 }
 
 
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 8849c08..85fa12c 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -318,7 +318,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = heap_beginscan_strat(heapRelation, snapshot,
 											 nkeys, key,
-											 true, false);
+											 true, false, true);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 53cf96f..a5c1c19 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -249,10 +249,41 @@ index_beginscan(Relation heapRelation,
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan. Disable pruning in all cases.
+	 */
+	scan->heapRelation = heapRelation;
+	scan->xs_snapshot = snapshot;
+	scan->xs_always_prune = true;
+	scan->xs_prune_count = 0;
+
+	return scan;
+}
+
+/*
+ * index_beginscan_strat - start a scan of an index with amgettuple
+ * allowing caller to specify additional scan strategies.
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_strat(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys,
+				bool always_prune)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
 	 * up by RelationGetIndexScan.
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->xs_always_prune = always_prune;
+	scan->xs_prune_count = 0;
 
 	return scan;
 }
@@ -278,6 +309,10 @@ index_beginscan_bitmap(Relation indexRelation,
 	 */
 	scan->xs_snapshot = snapshot;
 
+	/* BitmapHeapScan does pruning if required */
+	scan->xs_always_prune = false;
+	scan->xs_prune_count = 0;
+
 	return scan;
 }
 
@@ -520,10 +555,15 @@ index_fetch_heap(IndexScanDesc scan)
 											 ItemPointerGetBlockNumber(tid));
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * Prune page if enabled, but only if we weren't already on this page
 		 */
-		if (prev_buf != scan->xs_cbuf)
-			heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+		if (prev_buf != scan->xs_cbuf &&
+			(scan->xs_prune_count < PRUNE_LIMIT_PER_SCAN ||
+			 scan->xs_always_prune))
+		{
+			if (heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf) && !scan->xs_always_prune)
+				scan->xs_prune_count++;
+		}
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 844d413..504aaab 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2193,7 +2193,8 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								allow_sync);	/* syncscan OK? */
+								allow_sync,	/* syncscan OK? */
+								false);		/* pruning optional */
 
 	/* set our scan endpoints */
 	heap_setscanlimits(scan, start_blockno, numblocks);
@@ -2574,7 +2575,8 @@ IndexCheckExclusion(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								true);	/* syncscan OK */
+								true,	/* syncscan OK */
+								false);		/* pruning optional*/
 
 	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
 	{
@@ -2848,7 +2850,8 @@ validate_index_heapscan(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								false); /* syncscan not OK */
+								false, /* syncscan not OK */
+								false);		/* pruning optional */
 
 	/*
 	 * Scan all tuples matching the snapshot.
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 9b1e975..5d8e537 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -334,9 +334,15 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
 	ntup = 0;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible
+	 * and within limits, if any.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_prune_count < PRUNE_LIMIT_PER_SCAN ||
+		scan->rs_always_prune)
+	{
+		if (heap_page_prune_opt(scan->rs_rd, buffer) && !scan->rs_always_prune)
+			scan->rs_prune_count++;
+	}
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -537,6 +543,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 {
 	BitmapHeapScanState *scanstate;
 	Relation	currentRelation;
+	bool		relistarget;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -597,6 +604,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
 
 	scanstate->ss.ss_currentRelation = currentRelation;
+	relistarget = ExecRelationIsTargetRelation(estate, node->scan.scanrelid);
 
 	/*
 	 * Even though we aren't going to do a conventional seqscan, it is useful
@@ -605,7 +613,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->ss.ss_currentScanDesc = heap_beginscan_bm(currentRelation,
 														 estate->es_snapshot,
 														 0,
-														 NULL);
+														 NULL,
+														 relistarget);
 
 	/*
 	 * get the scan type from the relation descriptor.
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index afcd1ff..b154823 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -523,11 +523,12 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->ioss_ScanDesc = index_beginscan(currentRelation,
+	indexstate->ioss_ScanDesc = index_beginscan_strat(currentRelation,
 												indexstate->ioss_RelationDesc,
 												estate->es_snapshot,
 												indexstate->ioss_NumScanKeys,
-											indexstate->ioss_NumOrderByKeys);
+											indexstate->ioss_NumOrderByKeys,
+											relistarget);
 
 	/* Set it up for index-only scan */
 	indexstate->ioss_ScanDesc->xs_want_itup = true;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2b89dc6..3512f50 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -603,11 +603,12 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+	indexstate->iss_ScanDesc = index_beginscan_strat(currentRelation,
 											   indexstate->iss_RelationDesc,
 											   estate->es_snapshot,
 											   indexstate->iss_NumScanKeys,
-											 indexstate->iss_NumOrderByKeys);
+											   indexstate->iss_NumOrderByKeys,
+											   relistarget);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 53cfda5..321820b 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -124,20 +124,30 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 {
 	Relation	currentRelation;
 	HeapScanDesc currentScanDesc;
+	Oid			currentRelid;
+	bool		relistarget;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table
+	 */
+	currentRelid = ((SeqScan *) node->ps.plan)->scanrelid,
+	relistarget = ExecRelationIsTargetRelation(estate, currentRelid);
 
 	/*
-	 * get the relation object id from the relid'th entry in the range table,
 	 * open that relation and acquire appropriate lock on it.
 	 */
 	currentRelation = ExecOpenScanRelation(estate,
-									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   currentRelid,
 										   eflags);
 
 	/* initialize a heapscan */
-	currentScanDesc = heap_beginscan(currentRelation,
+	currentScanDesc = heap_beginscan_strat(currentRelation,
 									 estate->es_snapshot,
 									 0,
-									 NULL);
+									 NULL,
+									 true,
+									 true,
+									 relistarget);
 
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index d99158f..bc3885d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -135,6 +135,10 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
 				int nkeys, int norderbys);
+extern IndexScanDesc index_beginscan_strat(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys, bool always_prune);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 					   Snapshot snapshot,
 					   int nkeys);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f43b482..7144b61 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -110,9 +110,9 @@ extern HeapScanDesc heap_beginscan_catalog(Relation relation, int nkeys,
 					   ScanKey key);
 extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
-					 bool allow_strat, bool allow_sync);
+					 bool allow_strat, bool allow_sync, bool always_prune);
 extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
-				  int nkeys, ScanKey key);
+				  int nkeys, ScanKey key, bool always_prune);
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
@@ -164,7 +164,7 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
 extern void heap_sync(Relation relation);
 
 /* in heap/pruneheap.c */
-extern void heap_page_prune_opt(Relation relation, Buffer buffer);
+extern bool heap_page_prune_opt(Relation relation, Buffer buffer);
 extern int heap_page_prune(Relation relation, Buffer buffer,
 				TransactionId OldestXmin,
 				bool report_stats, TransactionId *latestRemovedXid);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index f2c7ca1..c5b57f3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+#define PRUNE_LIMIT_PER_SCAN	4
 
 typedef struct HeapScanDescData
 {
@@ -34,6 +35,9 @@ typedef struct HeapScanDescData
 	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
 	bool		rs_temp_snap;	/* unregister snapshot at scan end? */
 
+	bool		rs_always_prune;/* Unlimited pruning allowed */
+	int			rs_prune_count;	/* Number of blocks pruned during this scan */
+
 	/* state set up at initscan time */
 	BlockNumber rs_nblocks;		/* total number of blocks in rel */
 	BlockNumber rs_startblock;	/* block # to start at */
@@ -72,6 +76,9 @@ typedef struct IndexScanDescData
 	ScanKey		orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
 
+	bool		xs_always_prune;/* Unlimited pruning allowed */
+	int			xs_prune_count;	/* Number of blocks pruned during this scan */
+
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
#54Jeff Janes
jeff.janes@gmail.com
In reply to: Simon Riggs (#53)
Re: Turning off HOT/Cleanup sometimes

On Thu, Dec 11, 2014 at 7:24 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 17 November 2014 at 22:08, Simon Riggs <simon@2ndquadrant.com> wrote:

On 17 November 2014 21:09, Alvaro Herrera <alvherre@2ndquadrant.com>

wrote:

What happened to this patch? I'm going over something that could use
the concept of "clean some stuff up when reading this page, but only if
we're already writing" or similar.

I see some cases were presented that had a performance decrease. Did we
get any numbers for the increase in performance in some other
interesting cases?

It's not dead; it just needs more work. Maybe for next CF, or you can

now.

Latest version attached for next CF

I still get the compiler error in contrib:

pgstattuple.c: In function 'pgstat_heap':
pgstattuple.c:279: error: too few arguments to function
'heap_beginscan_strat'

Should it pass false for the always_prune?

Cheers,

Jeff

#55Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#54)
1 attachment(s)
Re: Turning off HOT/Cleanup sometimes

On 15 December 2014 at 20:26, Jeff Janes <jeff.janes@gmail.com> wrote:

I still get the compiler error in contrib:

pgstattuple.c: In function 'pgstat_heap':
pgstattuple.c:279: error: too few arguments to function
'heap_beginscan_strat'

Should it pass false for the always_prune?

Yes.

New version attached.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

hot_disable.v8.patchapplication/octet-stream; name=hot_disable.v8.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c3a8b1d..9e06334 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -276,7 +276,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 	SnapshotData SnapshotDirty;
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = heap_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = heap_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	InitDirtySnapshot(SnapshotDirty);
 
 	nblocks = scan->rs_nblocks; /* # blocks to be scanned */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index df4853b..164339f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -80,7 +80,8 @@ static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_sync,
-						bool is_bitmapscan, bool temp_snap);
+						bool is_bitmapscan, bool temp_snap,
+						bool always_prune);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -349,9 +350,15 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	snapshot = scan->rs_snapshot;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible and
+	 * within limits, if any.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_prune_count < PRUNE_LIMIT_PER_SCAN ||
+		scan->rs_always_prune)
+	{
+		if (heap_page_prune_opt(scan->rs_rd, buffer) && !scan->rs_always_prune)
+			scan->rs_prune_count++;
+	}
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -1286,7 +1293,8 @@ heap_openrv_extended(const RangeVar *relation, LOCKMODE lockmode,
  * heap_beginscan_strat offers an extended API that lets the caller control
  * whether a nondefault buffer access strategy can be used, and whether
  * syncscan can be chosen (possibly resulting in the scan not starting from
- * block zero).  Both of these default to TRUE with plain heap_beginscan.
+ * block zero).  Both of these default to TRUE with plain heap_beginscan,
+ * while always_prune defaults to FALSE for regular scans.
  *
  * heap_beginscan_bm is an alternative entry point for setting up a
  * HeapScanDesc for a bitmap heap scan.  Although that scan technology is
@@ -1299,7 +1307,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   true, true, false, false);
+								   true, true, false, false, false);
 }
 
 HeapScanDesc
@@ -1309,31 +1317,33 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   true, true, false, true);
+								   true, true, false, true, true);
 }
 
 HeapScanDesc
 heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
-					 bool allow_strat, bool allow_sync)
+					 bool allow_strat, bool allow_sync, bool always_prune)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   allow_strat, allow_sync, false, false);
+								   allow_strat, allow_sync, false, false,
+								   always_prune);
 }
 
 HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
-				  int nkeys, ScanKey key)
+				  int nkeys, ScanKey key, bool always_prune)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   false, false, true, false);
+								   false, false, true, false, always_prune);
 }
 
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_sync,
-						bool is_bitmapscan, bool temp_snap)
+						bool is_bitmapscan, bool temp_snap,
+						bool always_prune)
 {
 	HeapScanDesc scan;
 
@@ -1360,6 +1370,9 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
 
+	scan->rs_prune_count = 0;
+	scan->rs_always_prune = always_prune;
+
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
 	 */
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 2c09128..8538e5a 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -70,8 +70,10 @@ static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
  *
  * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
  * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ *
+ * Returns true if the block was cleaned.
  */
-void
+bool
 heap_page_prune_opt(Relation relation, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
@@ -84,7 +86,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * anyway, so this is no particular loss.
 	 */
 	if (RecoveryInProgress())
-		return;
+		return false;
 
 	/*
 	 * Use the appropriate xmin horizon for this relation. If it's a proper
@@ -108,7 +110,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * older than OldestXmin.
 	 */
 	if (!PageIsPrunable(page, OldestXmin))
-		return;
+		return false;
 
 	/*
 	 * We prune when a previous UPDATE failed to find enough space on the page
@@ -130,7 +132,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	{
 		/* OK, try to get exclusive buffer lock */
 		if (!ConditionalLockBufferForCleanup(buffer))
-			return;
+			return false;
 
 		/*
 		 * Now that we have buffer lock, get accurate information about the
@@ -149,7 +151,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 
 		/* And release buffer lock */
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+		return true;
 	}
+
+	return false;
 }
 
 
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 8849c08..85fa12c 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -318,7 +318,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = heap_beginscan_strat(heapRelation, snapshot,
 											 nkeys, key,
-											 true, false);
+											 true, false, true);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 53cf96f..a5c1c19 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -249,10 +249,41 @@ index_beginscan(Relation heapRelation,
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan. Disable pruning in all cases.
+	 */
+	scan->heapRelation = heapRelation;
+	scan->xs_snapshot = snapshot;
+	scan->xs_always_prune = true;
+	scan->xs_prune_count = 0;
+
+	return scan;
+}
+
+/*
+ * index_beginscan_strat - start a scan of an index with amgettuple
+ * allowing caller to specify additional scan strategies.
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_strat(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys,
+				bool always_prune)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
 	 * up by RelationGetIndexScan.
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->xs_always_prune = always_prune;
+	scan->xs_prune_count = 0;
 
 	return scan;
 }
@@ -278,6 +309,10 @@ index_beginscan_bitmap(Relation indexRelation,
 	 */
 	scan->xs_snapshot = snapshot;
 
+	/* BitmapHeapScan does pruning if required */
+	scan->xs_always_prune = false;
+	scan->xs_prune_count = 0;
+
 	return scan;
 }
 
@@ -520,10 +555,15 @@ index_fetch_heap(IndexScanDesc scan)
 											 ItemPointerGetBlockNumber(tid));
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * Prune page if enabled, but only if we weren't already on this page
 		 */
-		if (prev_buf != scan->xs_cbuf)
-			heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+		if (prev_buf != scan->xs_cbuf &&
+			(scan->xs_prune_count < PRUNE_LIMIT_PER_SCAN ||
+			 scan->xs_always_prune))
+		{
+			if (heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf) && !scan->xs_always_prune)
+				scan->xs_prune_count++;
+		}
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 844d413..504aaab 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2193,7 +2193,8 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								allow_sync);	/* syncscan OK? */
+								allow_sync,	/* syncscan OK? */
+								false);		/* pruning optional */
 
 	/* set our scan endpoints */
 	heap_setscanlimits(scan, start_blockno, numblocks);
@@ -2574,7 +2575,8 @@ IndexCheckExclusion(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								true);	/* syncscan OK */
+								true,	/* syncscan OK */
+								false);		/* pruning optional*/
 
 	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
 	{
@@ -2848,7 +2850,8 @@ validate_index_heapscan(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								false); /* syncscan not OK */
+								false, /* syncscan not OK */
+								false);		/* pruning optional */
 
 	/*
 	 * Scan all tuples matching the snapshot.
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 9b1e975..5d8e537 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -334,9 +334,15 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
 	ntup = 0;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible
+	 * and within limits, if any.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_prune_count < PRUNE_LIMIT_PER_SCAN ||
+		scan->rs_always_prune)
+	{
+		if (heap_page_prune_opt(scan->rs_rd, buffer) && !scan->rs_always_prune)
+			scan->rs_prune_count++;
+	}
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -537,6 +543,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 {
 	BitmapHeapScanState *scanstate;
 	Relation	currentRelation;
+	bool		relistarget;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -597,6 +604,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
 
 	scanstate->ss.ss_currentRelation = currentRelation;
+	relistarget = ExecRelationIsTargetRelation(estate, node->scan.scanrelid);
 
 	/*
 	 * Even though we aren't going to do a conventional seqscan, it is useful
@@ -605,7 +613,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->ss.ss_currentScanDesc = heap_beginscan_bm(currentRelation,
 														 estate->es_snapshot,
 														 0,
-														 NULL);
+														 NULL,
+														 relistarget);
 
 	/*
 	 * get the scan type from the relation descriptor.
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index afcd1ff..b154823 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -523,11 +523,12 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->ioss_ScanDesc = index_beginscan(currentRelation,
+	indexstate->ioss_ScanDesc = index_beginscan_strat(currentRelation,
 												indexstate->ioss_RelationDesc,
 												estate->es_snapshot,
 												indexstate->ioss_NumScanKeys,
-											indexstate->ioss_NumOrderByKeys);
+											indexstate->ioss_NumOrderByKeys,
+											relistarget);
 
 	/* Set it up for index-only scan */
 	indexstate->ioss_ScanDesc->xs_want_itup = true;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2b89dc6..3512f50 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -603,11 +603,12 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+	indexstate->iss_ScanDesc = index_beginscan_strat(currentRelation,
 											   indexstate->iss_RelationDesc,
 											   estate->es_snapshot,
 											   indexstate->iss_NumScanKeys,
-											 indexstate->iss_NumOrderByKeys);
+											   indexstate->iss_NumOrderByKeys,
+											   relistarget);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 53cfda5..321820b 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -124,20 +124,30 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 {
 	Relation	currentRelation;
 	HeapScanDesc currentScanDesc;
+	Oid			currentRelid;
+	bool		relistarget;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table
+	 */
+	currentRelid = ((SeqScan *) node->ps.plan)->scanrelid,
+	relistarget = ExecRelationIsTargetRelation(estate, currentRelid);
 
 	/*
-	 * get the relation object id from the relid'th entry in the range table,
 	 * open that relation and acquire appropriate lock on it.
 	 */
 	currentRelation = ExecOpenScanRelation(estate,
-									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   currentRelid,
 										   eflags);
 
 	/* initialize a heapscan */
-	currentScanDesc = heap_beginscan(currentRelation,
+	currentScanDesc = heap_beginscan_strat(currentRelation,
 									 estate->es_snapshot,
 									 0,
-									 NULL);
+									 NULL,
+									 true,
+									 true,
+									 relistarget);
 
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index d99158f..bc3885d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -135,6 +135,10 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
 				int nkeys, int norderbys);
+extern IndexScanDesc index_beginscan_strat(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys, bool always_prune);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 					   Snapshot snapshot,
 					   int nkeys);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f43b482..7144b61 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -110,9 +110,9 @@ extern HeapScanDesc heap_beginscan_catalog(Relation relation, int nkeys,
 					   ScanKey key);
 extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
-					 bool allow_strat, bool allow_sync);
+					 bool allow_strat, bool allow_sync, bool always_prune);
 extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
-				  int nkeys, ScanKey key);
+				  int nkeys, ScanKey key, bool always_prune);
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
@@ -164,7 +164,7 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
 extern void heap_sync(Relation relation);
 
 /* in heap/pruneheap.c */
-extern void heap_page_prune_opt(Relation relation, Buffer buffer);
+extern bool heap_page_prune_opt(Relation relation, Buffer buffer);
 extern int heap_page_prune(Relation relation, Buffer buffer,
 				TransactionId OldestXmin,
 				bool report_stats, TransactionId *latestRemovedXid);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index f2c7ca1..c5b57f3 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+#define PRUNE_LIMIT_PER_SCAN	4
 
 typedef struct HeapScanDescData
 {
@@ -34,6 +35,9 @@ typedef struct HeapScanDescData
 	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
 	bool		rs_temp_snap;	/* unregister snapshot at scan end? */
 
+	bool		rs_always_prune;/* Unlimited pruning allowed */
+	int			rs_prune_count;	/* Number of blocks pruned during this scan */
+
 	/* state set up at initscan time */
 	BlockNumber rs_nblocks;		/* total number of blocks in rel */
 	BlockNumber rs_startblock;	/* block # to start at */
@@ -72,6 +76,9 @@ typedef struct IndexScanDescData
 	ScanKey		orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
 
+	bool		xs_always_prune;/* Unlimited pruning allowed */
+	int			xs_prune_count;	/* Number of blocks pruned during this scan */
+
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
#56Michael Paquier
michael.paquier@gmail.com
In reply to: Simon Riggs (#55)
Re: Turning off HOT/Cleanup sometimes

On Wed, Dec 17, 2014 at 5:39 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 15 December 2014 at 20:26, Jeff Janes <jeff.janes@gmail.com> wrote:

I still get the compiler error in contrib:

pgstattuple.c: In function 'pgstat_heap':
pgstattuple.c:279: error: too few arguments to function
'heap_beginscan_strat'

Should it pass false for the always_prune?

Yes.

New version attached.

Moved patch to CF 2015-02 with same status "Needs review". It visibly needs
more work, and numbers to show increase in performance while only cases
showing up performance decrease showed up.
--
Michael

#57Jeff Janes
jeff.janes@gmail.com
In reply to: Simon Riggs (#55)
Re: Turning off HOT/Cleanup sometimes

On Wed, Dec 17, 2014 at 12:39 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 15 December 2014 at 20:26, Jeff Janes <jeff.janes@gmail.com> wrote:

I still get the compiler error in contrib:

pgstattuple.c: In function 'pgstat_heap':
pgstattuple.c:279: error: too few arguments to function
'heap_beginscan_strat'

Should it pass false for the always_prune?

Yes.

New version attached.

This no longer applies directly against head, but if I apply to an older
checkout and then do "git checkout -m origin" it rolls forward cleanly.

Did versions 7 and 8 of this patch address Andres' concern about
performance regressions?

Thanks,

Jeff

#58Peter Eisentraut
peter_e@gmx.net
In reply to: Simon Riggs (#55)
Re: Turning off HOT/Cleanup sometimes

On 12/17/14 3:39 AM, Simon Riggs wrote:

On 15 December 2014 at 20:26, Jeff Janes <jeff.janes@gmail.com> wrote:

I still get the compiler error in contrib:

pgstattuple.c: In function 'pgstat_heap':
pgstattuple.c:279: error: too few arguments to function
'heap_beginscan_strat'

Should it pass false for the always_prune?

Yes.

New version attached.

README.HOT section "When can/should we prune or defragment?" needs a
major update as a result of this patch.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Peter Eisentraut
peter_e@gmx.net
In reply to: Jeff Janes (#57)
Re: Turning off HOT/Cleanup sometimes

On 3/9/15 1:36 PM, Jeff Janes wrote:

Did versions 7 and 8 of this patch address Andres' concern about
performance regressions?

I don't think so. Andres basically wanted a nontrival algorithm to
determine how much pruning to do during a read-only scan. And Robert
basically said, that's not really possible.

The presented patch actually has a hardcoded prune limit of 4 per scan,
which I don't see mentioned in the discussion anywhere (except in very
early versions, where this was exposed as a knob).

I think most people were of the opinion that scans on system catalogs
should not be affected by this behavior change. Makes sense to me:
System catalog bloat is likely a bigger problem than speeding up queries
on catalogs with large live data.

And then there is still some disagreement whether just turning this on
is tolerable for all uses. Andres mentioned workloads that have trouble
getting a cleanup lock. README.HOT seems to think that cleaning up
during reads is important because skipping over dead tuples is
expensive. Nobody seems to like the idea of (implicitly) pushing more
responsibility on VACUUM. We have seen some benchmarks that show
significant improvements. We have seen some (constructed ones) that
show problems.

I don't know how to move forward. We could give users a knob: This
might make your queries faster or not -- good luck. But of course
nobody will like that either.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60Andres Freund
andres@anarazel.de
In reply to: Peter Eisentraut (#59)
Re: Turning off HOT/Cleanup sometimes

On 2015-03-11 20:55:18 -0400, Peter Eisentraut wrote:

I don't think so. Andres basically wanted a nontrival algorithm to
determine how much pruning to do during a read-only scan. And Robert
basically said, that's not really possible.

I don't think either of us made really strong statements.

We have seen some benchmarks that show significant improvements. We
have seen some (constructed ones) that show problems.

FWIW, it's not that constructed. It's just a mixture of read with write
load.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Peter Eisentraut
peter_e@gmx.net
In reply to: Andres Freund (#60)
Re: Turning off HOT/Cleanup sometimes

On 3/12/15 5:41 AM, Andres Freund wrote:

On 2015-03-11 20:55:18 -0400, Peter Eisentraut wrote:

I don't think so. Andres basically wanted a nontrival algorithm to
determine how much pruning to do during a read-only scan. And Robert
basically said, that's not really possible.

I don't think either of us made really strong statements.

I didn't mean to put words in your mouth. I just wanted to summarize
the thread as, Andres wanted more fine-tuning on the behavior, Robert
expressed serious doubts that that will lead to an acceptable result.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Robert Haas
robertmhaas@gmail.com
In reply to: Peter Eisentraut (#61)
Re: Turning off HOT/Cleanup sometimes

On Thu, Mar 12, 2015 at 3:48 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On 3/12/15 5:41 AM, Andres Freund wrote:

On 2015-03-11 20:55:18 -0400, Peter Eisentraut wrote:

I don't think so. Andres basically wanted a nontrival algorithm to
determine how much pruning to do during a read-only scan. And Robert
basically said, that's not really possible.

I don't think either of us made really strong statements.

I didn't mean to put words in your mouth. I just wanted to summarize
the thread as, Andres wanted more fine-tuning on the behavior, Robert
expressed serious doubts that that will lead to an acceptable result.

Or to put that another way, I'm not sure there's one behavior here
that will please everybody.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Simon Riggs
simon@2ndQuadrant.com
In reply to: Peter Eisentraut (#59)
1 attachment(s)
Re: Turning off HOT/Cleanup sometimes

On 11 March 2015 at 20:55, Peter Eisentraut <peter_e@gmx.net> wrote:

I don't know how to move forward. We could give users a knob: This
might make your queries faster or not -- good luck. But of course
nobody will like that either.

What is clear is that large SELECT queries are doing the work VACUUM
should do. We should not be doing large background tasks (block
cleanup) during long running foreground tasks. But there is no need
for changing behaviour during small SELECTs. So the setting of 4 gives
current behaviour for small SELECTs and new behaviour for larger
SELECTs.

The OP said this...
<op>
We also make SELECT clean up blocks as it goes. That is useful in OLTP
workloads, but it means that large SQL queries and pg_dump effectively
do much the same work as VACUUM, generating huge amounts of I/O and
WAL on the master, the cost and annoyance of which is experienced
directly by the user. That is avoided on standbys.

Effects of that are that long running statements often run much longer
than we want, increasing bloat as a result. It also produces wildly
varying response times, depending upon extent of cleanup required.
</op>

This is not a performance patch. This is about one user doing the
cleanup work for another. People running large SELECTs should not be
penalised. The patch has been shown to avoid that and no further
discussion should be required.

I don't really care whether we have a parameter for this or not. As
long as we have the main feature.

It's trivial to add/remove a parameter to control this. Currently
there isn't one.

I'd like to commit this.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services

Attachments:

hot_disable.v9.patchapplication/octet-stream; name=hot_disable.v9.patchDownload
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index c3a8b1d..9e06334 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -276,7 +276,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 	SnapshotData SnapshotDirty;
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = heap_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = heap_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	InitDirtySnapshot(SnapshotDirty);
 
 	nblocks = scan->rs_nblocks; /* # blocks to be scanned */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 457cd70..19675da 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -80,7 +80,8 @@ static HeapScanDesc heap_beginscan_internal(Relation relation,
 						Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_sync,
-						bool is_bitmapscan, bool temp_snap);
+						bool is_bitmapscan, bool temp_snap,
+						bool always_prune);
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
 					TransactionId xid, CommandId cid, int options);
 static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
@@ -354,9 +355,15 @@ heapgetpage(HeapScanDesc scan, BlockNumber page)
 	snapshot = scan->rs_snapshot;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible and
+	 * within limits, if any.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_prune_count < PRUNE_LIMIT_PER_SCAN ||
+		scan->rs_always_prune)
+	{
+		if (heap_page_prune_opt(scan->rs_rd, buffer) && !scan->rs_always_prune)
+			scan->rs_prune_count++;
+	}
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -1291,7 +1298,8 @@ heap_openrv_extended(const RangeVar *relation, LOCKMODE lockmode,
  * heap_beginscan_strat offers an extended API that lets the caller control
  * whether a nondefault buffer access strategy can be used, and whether
  * syncscan can be chosen (possibly resulting in the scan not starting from
- * block zero).  Both of these default to TRUE with plain heap_beginscan.
+ * block zero).  Both of these default to TRUE with plain heap_beginscan,
+ * while always_prune defaults to FALSE for regular scans.
  *
  * heap_beginscan_bm is an alternative entry point for setting up a
  * HeapScanDesc for a bitmap heap scan.  Although that scan technology is
@@ -1304,7 +1312,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 			   int nkeys, ScanKey key)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   true, true, false, false);
+								   true, true, false, false, false);
 }
 
 HeapScanDesc
@@ -1314,31 +1322,33 @@ heap_beginscan_catalog(Relation relation, int nkeys, ScanKey key)
 	Snapshot	snapshot = RegisterSnapshot(GetCatalogSnapshot(relid));
 
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   true, true, false, true);
+								   true, true, false, true, true);
 }
 
 HeapScanDesc
 heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
-					 bool allow_strat, bool allow_sync)
+					 bool allow_strat, bool allow_sync, bool always_prune)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   allow_strat, allow_sync, false, false);
+								   allow_strat, allow_sync, false, false,
+								   always_prune);
 }
 
 HeapScanDesc
 heap_beginscan_bm(Relation relation, Snapshot snapshot,
-				  int nkeys, ScanKey key)
+				  int nkeys, ScanKey key, bool always_prune)
 {
 	return heap_beginscan_internal(relation, snapshot, nkeys, key,
-								   false, false, true, false);
+								   false, false, true, false, always_prune);
 }
 
 static HeapScanDesc
 heap_beginscan_internal(Relation relation, Snapshot snapshot,
 						int nkeys, ScanKey key,
 						bool allow_strat, bool allow_sync,
-						bool is_bitmapscan, bool temp_snap)
+						bool is_bitmapscan, bool temp_snap,
+						bool always_prune)
 {
 	HeapScanDesc scan;
 
@@ -1365,6 +1375,9 @@ heap_beginscan_internal(Relation relation, Snapshot snapshot,
 	scan->rs_allow_sync = allow_sync;
 	scan->rs_temp_snap = temp_snap;
 
+	scan->rs_prune_count = 0;
+	scan->rs_always_prune = always_prune;
+
 	/*
 	 * we can use page-at-a-time mode if it's an MVCC-safe snapshot
 	 */
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 563e5c3..579070f 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -70,8 +70,10 @@ static void heap_prune_record_unused(PruneState *prstate, OffsetNumber offnum);
  *
  * OldestXmin is the cutoff XID used to distinguish whether tuples are DEAD
  * or RECENTLY_DEAD (see HeapTupleSatisfiesVacuum).
+ *
+ * Returns true if the block was cleaned.
  */
-void
+bool
 heap_page_prune_opt(Relation relation, Buffer buffer)
 {
 	Page		page = BufferGetPage(buffer);
@@ -84,7 +86,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * anyway, so this is no particular loss.
 	 */
 	if (RecoveryInProgress())
-		return;
+		return false;
 
 	/*
 	 * Use the appropriate xmin horizon for this relation. If it's a proper
@@ -108,7 +110,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	 * older than OldestXmin.
 	 */
 	if (!PageIsPrunable(page, OldestXmin))
-		return;
+		return false;
 
 	/*
 	 * We prune when a previous UPDATE failed to find enough space on the page
@@ -130,7 +132,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 	{
 		/* OK, try to get exclusive buffer lock */
 		if (!ConditionalLockBufferForCleanup(buffer))
-			return;
+			return false;
 
 		/*
 		 * Now that we have buffer lock, get accurate information about the
@@ -149,7 +151,11 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
 
 		/* And release buffer lock */
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+		return true;
 	}
+
+	return false;
 }
 
 
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index e6e4d28..3e826b5 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -390,7 +390,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = heap_beginscan_strat(heapRelation, snapshot,
 											 nkeys, key,
-											 true, false);
+											 true, false, true);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 2b27e73..5a97d94 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -249,10 +249,41 @@ index_beginscan(Relation heapRelation,
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan. Disable pruning in all cases.
+	 */
+	scan->heapRelation = heapRelation;
+	scan->xs_snapshot = snapshot;
+	scan->xs_always_prune = true;
+	scan->xs_prune_count = 0;
+
+	return scan;
+}
+
+/*
+ * index_beginscan_strat - start a scan of an index with amgettuple
+ * allowing caller to specify additional scan strategies.
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_strat(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys,
+				bool always_prune)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
 	 * up by RelationGetIndexScan.
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->xs_always_prune = always_prune;
+	scan->xs_prune_count = 0;
 
 	return scan;
 }
@@ -278,6 +309,10 @@ index_beginscan_bitmap(Relation indexRelation,
 	 */
 	scan->xs_snapshot = snapshot;
 
+	/* BitmapHeapScan does pruning if required */
+	scan->xs_always_prune = false;
+	scan->xs_prune_count = 0;
+
 	return scan;
 }
 
@@ -520,10 +555,15 @@ index_fetch_heap(IndexScanDesc scan)
 											 ItemPointerGetBlockNumber(tid));
 
 		/*
-		 * Prune page, but only if we weren't already on this page
+		 * Prune page if enabled, but only if we weren't already on this page
 		 */
-		if (prev_buf != scan->xs_cbuf)
-			heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf);
+		if (prev_buf != scan->xs_cbuf &&
+			(scan->xs_prune_count < PRUNE_LIMIT_PER_SCAN ||
+			 scan->xs_always_prune))
+		{
+			if (heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf) && !scan->xs_always_prune)
+				scan->xs_prune_count++;
+		}
 	}
 
 	/* Obtain share-lock on the buffer so we can examine visibility */
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d8ff554..fa74b0a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2196,7 +2196,8 @@ IndexBuildHeapRangeScan(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								allow_sync);	/* syncscan OK? */
+								allow_sync,	/* syncscan OK? */
+								false);		/* pruning optional */
 
 	/* set our scan endpoints */
 	heap_setscanlimits(scan, start_blockno, numblocks);
@@ -2577,7 +2578,8 @@ IndexCheckExclusion(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								true);	/* syncscan OK */
+								true,	/* syncscan OK */
+								false);		/* pruning optional*/
 
 	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
 	{
@@ -2851,7 +2853,8 @@ validate_index_heapscan(Relation heapRelation,
 								0,		/* number of keys */
 								NULL,	/* scan key */
 								true,	/* buffer access strategy OK */
-								false); /* syncscan not OK */
+								false, /* syncscan not OK */
+								false);		/* pruning optional */
 
 	/*
 	 * Scan all tuples matching the snapshot.
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 8ea8b9f..9dd4c50 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -334,9 +334,15 @@ bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
 	ntup = 0;
 
 	/*
-	 * Prune and repair fragmentation for the whole page, if possible.
+	 * Prune and repair fragmentation for the whole page, if possible
+	 * and within limits, if any.
 	 */
-	heap_page_prune_opt(scan->rs_rd, buffer);
+	if (scan->rs_prune_count < PRUNE_LIMIT_PER_SCAN ||
+		scan->rs_always_prune)
+	{
+		if (heap_page_prune_opt(scan->rs_rd, buffer) && !scan->rs_always_prune)
+			scan->rs_prune_count++;
+	}
 
 	/*
 	 * We must hold share lock on the buffer content while examining tuple
@@ -537,6 +543,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 {
 	BitmapHeapScanState *scanstate;
 	Relation	currentRelation;
+	bool		relistarget;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -597,6 +604,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	currentRelation = ExecOpenScanRelation(estate, node->scan.scanrelid, eflags);
 
 	scanstate->ss.ss_currentRelation = currentRelation;
+	relistarget = ExecRelationIsTargetRelation(estate, node->scan.scanrelid);
 
 	/*
 	 * Even though we aren't going to do a conventional seqscan, it is useful
@@ -605,7 +613,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate->ss.ss_currentScanDesc = heap_beginscan_bm(currentRelation,
 														 estate->es_snapshot,
 														 0,
-														 NULL);
+														 NULL,
+														 relistarget);
 
 	/*
 	 * get the scan type from the relation descriptor.
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 06b7c3c..d8045dc 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -523,11 +523,12 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->ioss_ScanDesc = index_beginscan(currentRelation,
+	indexstate->ioss_ScanDesc = index_beginscan_strat(currentRelation,
 												indexstate->ioss_RelationDesc,
 												estate->es_snapshot,
 												indexstate->ioss_NumScanKeys,
-											indexstate->ioss_NumOrderByKeys);
+											indexstate->ioss_NumOrderByKeys,
+											relistarget);
 
 	/* Set it up for index-only scan */
 	indexstate->ioss_ScanDesc->xs_want_itup = true;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 48fa919..e2ec902 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -603,11 +603,12 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+	indexstate->iss_ScanDesc = index_beginscan_strat(currentRelation,
 											   indexstate->iss_RelationDesc,
 											   estate->es_snapshot,
 											   indexstate->iss_NumScanKeys,
-											 indexstate->iss_NumOrderByKeys);
+											   indexstate->iss_NumOrderByKeys,
+											   relistarget);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 3cb81fc..eadf0b3 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -124,20 +124,30 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 {
 	Relation	currentRelation;
 	HeapScanDesc currentScanDesc;
+	Oid			currentRelid;
+	bool		relistarget;
+
+	/*
+	 * get the relation object id from the relid'th entry in the range table
+	 */
+	currentRelid = ((SeqScan *) node->ps.plan)->scanrelid,
+	relistarget = ExecRelationIsTargetRelation(estate, currentRelid);
 
 	/*
-	 * get the relation object id from the relid'th entry in the range table,
 	 * open that relation and acquire appropriate lock on it.
 	 */
 	currentRelation = ExecOpenScanRelation(estate,
-									  ((SeqScan *) node->ps.plan)->scanrelid,
+										   currentRelid,
 										   eflags);
 
 	/* initialize a heapscan */
-	currentScanDesc = heap_beginscan(currentRelation,
+	currentScanDesc = heap_beginscan_strat(currentRelation,
 									 estate->es_snapshot,
 									 0,
-									 NULL);
+									 NULL,
+									 true,
+									 true,
+									 relistarget);
 
 	node->ss_currentRelation = currentRelation;
 	node->ss_currentScanDesc = currentScanDesc;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index d86590a..738bd35 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -135,6 +135,10 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
 				int nkeys, int norderbys);
+extern IndexScanDesc index_beginscan_strat(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys, bool always_prune);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 					   Snapshot snapshot,
 					   int nkeys);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 888cce7..e907eeb 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -110,9 +110,9 @@ extern HeapScanDesc heap_beginscan_catalog(Relation relation, int nkeys,
 					   ScanKey key);
 extern HeapScanDesc heap_beginscan_strat(Relation relation, Snapshot snapshot,
 					 int nkeys, ScanKey key,
-					 bool allow_strat, bool allow_sync);
+					 bool allow_strat, bool allow_sync, bool always_prune);
 extern HeapScanDesc heap_beginscan_bm(Relation relation, Snapshot snapshot,
-				  int nkeys, ScanKey key);
+				  int nkeys, ScanKey key, bool always_prune);
 extern void heap_setscanlimits(HeapScanDesc scan, BlockNumber startBlk,
 		   BlockNumber endBlk);
 extern void heap_rescan(HeapScanDesc scan, ScanKey key);
@@ -164,7 +164,7 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
 extern void heap_sync(Relation relation);
 
 /* in heap/pruneheap.c */
-extern void heap_page_prune_opt(Relation relation, Buffer buffer);
+extern bool heap_page_prune_opt(Relation relation, Buffer buffer);
 extern int heap_page_prune(Relation relation, Buffer buffer,
 				TransactionId OldestXmin,
 				bool report_stats, TransactionId *latestRemovedXid);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 9bb6362..a745876 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -20,6 +20,7 @@
 #include "access/itup.h"
 #include "access/tupdesc.h"
 
+#define PRUNE_LIMIT_PER_SCAN	4
 
 typedef struct HeapScanDescData
 {
@@ -34,6 +35,9 @@ typedef struct HeapScanDescData
 	bool		rs_allow_sync;	/* allow or disallow use of syncscan */
 	bool		rs_temp_snap;	/* unregister snapshot at scan end? */
 
+	bool		rs_always_prune;/* Unlimited pruning allowed */
+	int			rs_prune_count;	/* Number of blocks pruned during this scan */
+
 	/* state set up at initscan time */
 	BlockNumber rs_nblocks;		/* total number of blocks in rel */
 	BlockNumber rs_startblock;	/* block # to start at */
@@ -72,6 +76,9 @@ typedef struct IndexScanDescData
 	ScanKey		orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
 
+	bool		xs_always_prune;/* Unlimited pruning allowed */
+	int			xs_prune_count;	/* Number of blocks pruned during this scan */
+
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
#64David Steele
david@pgmasters.net
In reply to: Simon Riggs (#63)
Re: Turning off HOT/Cleanup sometimes

On 4/14/15 6:07 PM, Simon Riggs wrote:

On 11 March 2015 at 20:55, Peter Eisentraut <peter_e@gmx.net> wrote:

I don't know how to move forward. We could give users a knob: This
might make your queries faster or not -- good luck. But of course
nobody will like that either.

What is clear is that large SELECT queries are doing the work VACUUM
should do. We should not be doing large background tasks (block
cleanup) during long running foreground tasks. But there is no need
for changing behaviour during small SELECTs. So the setting of 4 gives
current behaviour for small SELECTs and new behaviour for larger
SELECTs.

The OP said this...
<op>
We also make SELECT clean up blocks as it goes. That is useful in OLTP
workloads, but it means that large SQL queries and pg_dump effectively
do much the same work as VACUUM, generating huge amounts of I/O and
WAL on the master, the cost and annoyance of which is experienced
directly by the user. That is avoided on standbys.

Effects of that are that long running statements often run much longer
than we want, increasing bloat as a result. It also produces wildly
varying response times, depending upon extent of cleanup required.
</op>

This is not a performance patch. This is about one user doing the
cleanup work for another. People running large SELECTs should not be
penalised. The patch has been shown to avoid that and no further
discussion should be required.

I don't really care whether we have a parameter for this or not. As
long as we have the main feature.

It's trivial to add/remove a parameter to control this. Currently
there isn't one.

I'd like to commit this.

+1 from me. One of the last databases I worked on had big raw
partitions that were written to and then sequentially scanned exactly
once before being dropped. It was painful to see all those writes
happening for nothing.

In other cases there were sequential scans that happened directly after
the main writes, but then the next read might be days in the future (if
ever) and the system was basically idle for a while which would have
allowed vacuum to come in and do the job without affecting performance
of the main job.

I think that in batch-oriented databases this patch will definitely be a
boon to performance.

--
- David Steele
david@pgmasters.net

#65Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#63)
Re: Turning off HOT/Cleanup sometimes

On Tue, Apr 14, 2015 at 6:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 11 March 2015 at 20:55, Peter Eisentraut <peter_e@gmx.net> wrote:

I don't know how to move forward. We could give users a knob: This
might make your queries faster or not -- good luck. But of course
nobody will like that either.

What is clear is that large SELECT queries are doing the work VACUUM
should do. We should not be doing large background tasks (block
cleanup) during long running foreground tasks. But there is no need
for changing behaviour during small SELECTs. So the setting of 4 gives
current behaviour for small SELECTs and new behaviour for larger
SELECTs.

Peter commented previously that README.HOT should get an update. The
relevant section seems to be "When can/should we prune or
defragment?".

I wonder if it would be a useful heuristic to still prune pages if
those pages are already dirty.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#65)
Re: Turning off HOT/Cleanup sometimes

On 14 April 2015 at 21:53, Robert Haas <robertmhaas@gmail.com> wrote:

Peter commented previously that README.HOT should get an update. The
relevant section seems to be "When can/should we prune or
defragment?".

That's easy enough to change once we agree to commit.

I wonder if it would be a useful heuristic to still prune pages if
those pages are already dirty.

Useful for who? This is about responsibility. Why should someone
performing a large SELECT take the responsibility for cleaning pages?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#66)
Re: Turning off HOT/Cleanup sometimes

On Wed, Apr 15, 2015 at 3:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 14 April 2015 at 21:53, Robert Haas <robertmhaas@gmail.com> wrote:

Peter commented previously that README.HOT should get an update. The
relevant section seems to be "When can/should we prune or
defragment?".

That's easy enough to change once we agree to commit.

I wonder if it would be a useful heuristic to still prune pages if
those pages are already dirty.

Useful for who? This is about responsibility. Why should someone
performing a large SELECT take the responsibility for cleaning pages?

Because it makes it subsequent accesses to the page cheaper. Of
course, that applies in all cases, but when the page is already dirty,
the cost of pruning it is probably quite small - we're going to have
to write the page anyway, and pruning it before it gets evicted
(perhaps even by our scan) will be cheaper than writing it now and
writing it again after it's pruned. When the page is clean, the cost
of pruning is significantly higher.

I won't take responsibility for paying my neighbor's tax bill, but I
might take responsibility for picking up his mail while he's on
holiday.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#67)
Re: Turning off HOT/Cleanup sometimes

On 15 April 2015 at 08:04, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 15, 2015 at 3:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 14 April 2015 at 21:53, Robert Haas <robertmhaas@gmail.com> wrote:

Peter commented previously that README.HOT should get an update. The
relevant section seems to be "When can/should we prune or
defragment?".

That's easy enough to change once we agree to commit.

I wonder if it would be a useful heuristic to still prune pages if
those pages are already dirty.

Useful for who? This is about responsibility. Why should someone
performing a large SELECT take the responsibility for cleaning pages?

Because it makes it subsequent accesses to the page cheaper.

Cheaper for whom?

Of
course, that applies in all cases, but when the page is already dirty,
the cost of pruning it is probably quite small - we're going to have
to write the page anyway, and pruning it before it gets evicted
(perhaps even by our scan) will be cheaper than writing it now and
writing it again after it's pruned. When the page is clean, the cost
of pruning is significantly higher.

"We" aren't going to have to write the page, but someone will.

In a single workload, the mix of actions can be useful. In separate
workloads, where some guy just wants to run a report or a backup, its
not right that we slow them down because of someone else's actions.

I won't take responsibility for paying my neighbor's tax bill, but I
might take responsibility for picking up his mail while he's on
holiday.

That makes it sound like this is an occasional, non-annoying thing.

It's more like, whoever fetches the mail needs to fetch it for
everybody. So we are slowing down one person disproportionately, while
others fly through without penalty. There is no argument that one
workload necessarily needs to perform that on behalf of the other
workload.

The actions you suggest are reasonable and should ideally be the role
of a background process. But that doesn't mean in the absence of that
we should pay the cost in the foreground.

Let me apply this patch.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#68)
Re: Turning off HOT/Cleanup sometimes

On 2015-04-15 08:42:33 -0400, Simon Riggs wrote:

Because it makes it subsequent accesses to the page cheaper.

Cheaper for whom?

Everyone. Including further readers. Following HOT chains in read mostly
workloads can be really expensive. If you have workloads with a 'hot'
value range that's frequently updated, but that range moves you can
easily end up with heavily chained tuples which won't soon be touched by
a writer again.

And writers will often not yet be able to prune the page because there's
still live readers for the older versions (like other updaters).

Of
course, that applies in all cases, but when the page is already dirty,
the cost of pruning it is probably quite small - we're going to have
to write the page anyway, and pruning it before it gets evicted
(perhaps even by our scan) will be cheaper than writing it now and
writing it again after it's pruned. When the page is clean, the cost
of pruning is significantly higher.

"We" aren't going to have to write the page, but someone will.

If it's already dirty that doesn't change at all. *Not* pruning in that
moment actually will often *increase* the total amount of writes to the
OS. Because now the pruning will happen on the next write access or
vacuum - when the page already might have been undirtied.

I don't really see the downside to this suggestion.

The actions you suggest are reasonable and should ideally be the role
of a background process. But that doesn't mean in the absence of that
we should pay the cost in the foreground.

I'm not sure that's true. A background process will either cause
additional read IO to find worthwhile pages, or it'll not find
worthwhile pages because they're already paged out.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#69)
Re: Turning off HOT/Cleanup sometimes

On 15 April 2015 at 09:10, Andres Freund <andres@anarazel.de> wrote:

On 2015-04-15 08:42:33 -0400, Simon Riggs wrote:

Because it makes it subsequent accesses to the page cheaper.

Cheaper for whom?

Everyone.

I think what you mean is "Everyone else". It is demonstrably quicker
and more consistent for a process when it limits the amount of pruning
it does, as well as the fact that it causes additional WAL traffic
when it does so, causing replication lag.

I love it when someone cleans up for me. I just don't think they'll
accept the argument that they should clean up for me because it makes
their life easier. Certainly doesn't work with my kids.

I don't really see the downside to this suggestion.

The suggestion makes things better than they are now but is still less
than I have proposed.

If what you both mean is "IMHO this is an acceptable compromise", I
can accept it also, at this point in the CF.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#70)
Re: Turning off HOT/Cleanup sometimes

Simon Riggs wrote:

On 15 April 2015 at 09:10, Andres Freund <andres@anarazel.de> wrote:

I don't really see the downside to this suggestion.

The suggestion makes things better than they are now but is still less
than I have proposed.

If what you both mean is "IMHO this is an acceptable compromise", I
can accept it also, at this point in the CF.

Let me see if I understand things.

What we have now is: when reading a page, we also HOT-clean it. This
runs HOT-cleanup a large number of times, and causes many pages to
become dirty.

Your patch is "when reading a page, HOT-clean it, but only 5 times in
each scan". This runs HOT-cleanup at most 5 times, and causes at most 5
pages to become dirty.

Robert's proposal is "when reading a page, if dirty HOT-clean it; if not
dirty, also HOT-clean it but only 5 times in each scan". This runs
HOT-cleanup some number of times (as many as there are dirty), and
causes at most 5 pages to become dirty.

Am I right in thinking that HOT-clean in a dirty page is something that
runs completely within CPU cache? If so, it would be damn fast and
would have benefits for future readers, for very little cost.

Dirtying a page is very different; if buffer reads are common, the
system is later bogged down trying to find clean pages to read uncached
buffers (including the read-only scan itself, so it becomes slower.)

If I have understood things correctly, then I stand behind Robert's
suggestion.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Alvaro Herrera (#71)
Re: Turning off HOT/Cleanup sometimes

On 04/15/2015 05:44 PM, Alvaro Herrera wrote:

Simon Riggs wrote:

On 15 April 2015 at 09:10, Andres Freund <andres@anarazel.de> wrote:

I don't really see the downside to this suggestion.

The suggestion makes things better than they are now but is still less
than I have proposed.

If what you both mean is "IMHO this is an acceptable compromise", I
can accept it also, at this point in the CF.

Let me see if I understand things.

What we have now is: when reading a page, we also HOT-clean it. This
runs HOT-cleanup a large number of times, and causes many pages to
become dirty.

Your patch is "when reading a page, HOT-clean it, but only 5 times in
each scan". This runs HOT-cleanup at most 5 times, and causes at most 5
pages to become dirty.

Robert's proposal is "when reading a page, if dirty HOT-clean it; if not
dirty, also HOT-clean it but only 5 times in each scan". This runs
HOT-cleanup some number of times (as many as there are dirty), and
causes at most 5 pages to become dirty.

Am I right in thinking that HOT-clean in a dirty page is something that
runs completely within CPU cache? If so, it would be damn fast and
would have benefits for future readers, for very little cost.

If there are many tuples on the page, it takes some CPU effort to scan
all the HOT chains and move tuples around. Also, it creates a WAL
record, which isn't free.

Another question is whether the patch can reliably detect whether it's
doing a "read-only" scan or not. I haven't tested, but I suspect it'd
not do pruning when you do something like "INSERT INTO foo SELECT * FROM
foo WHERE blah". I.e. when the target relation is referenced twice in
the same statement: once as the target, and second time as a source.
Maybe that's OK, though.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#72)
Re: Turning off HOT/Cleanup sometimes

Heikki Linnakangas wrote:

On 04/15/2015 05:44 PM, Alvaro Herrera wrote:

Robert's proposal is "when reading a page, if dirty HOT-clean it; if not
dirty, also HOT-clean it but only 5 times in each scan". This runs
HOT-cleanup some number of times (as many as there are dirty), and
causes at most 5 pages to become dirty.

Am I right in thinking that HOT-clean in a dirty page is something that
runs completely within CPU cache? If so, it would be damn fast and
would have benefits for future readers, for very little cost.

If there are many tuples on the page, it takes some CPU effort to scan all
the HOT chains and move tuples around. Also, it creates a WAL record, which
isn't free.

But if the page is in CPU cache, the CPU effort shouldn't be all that
noticeable, should it? That's my point, but then maybe I'm wrong. Now,
the WAL logging is annoying, so let's limit that too -- do it at most
for, say, 20 dirty pages and at most 5 clean pages.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Alvaro Herrera (#73)
Re: Turning off HOT/Cleanup sometimes

On 04/15/2015 07:11 PM, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

On 04/15/2015 05:44 PM, Alvaro Herrera wrote:

Robert's proposal is "when reading a page, if dirty HOT-clean it; if not
dirty, also HOT-clean it but only 5 times in each scan". This runs
HOT-cleanup some number of times (as many as there are dirty), and
causes at most 5 pages to become dirty.

Am I right in thinking that HOT-clean in a dirty page is something that
runs completely within CPU cache? If so, it would be damn fast and
would have benefits for future readers, for very little cost.

If there are many tuples on the page, it takes some CPU effort to scan all
the HOT chains and move tuples around. Also, it creates a WAL record, which
isn't free.

But if the page is in CPU cache, the CPU effort shouldn't be all that
noticeable, should it? That's my point, but then maybe I'm wrong. Now,
the WAL logging is annoying, so let's limit that too -- do it at most
for, say, 20 dirty pages and at most 5 clean pages.

There isn't much difference between that and just doing it on first 5
pages. Both of those numbers were pulled out of thin air, anyway. I'd
rather just keep it simple.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#68)
Re: Turning off HOT/Cleanup sometimes

On Wed, Apr 15, 2015 at 8:42 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I won't take responsibility for paying my neighbor's tax bill, but I
might take responsibility for picking up his mail while he's on
holiday.

That makes it sound like this is an occasional, non-annoying thing.

It's more like, whoever fetches the mail needs to fetch it for
everybody. So we are slowing down one person disproportionately, while
others fly through without penalty. There is no argument that one
workload necessarily needs to perform that on behalf of the other
workload.

Sure there is. It's called a tragedy of the commons - everybody acts
in their own selfish interest (it's not *my* responsibility to limit
grazing on public land, or prune this page that I'm not modifying) and
as a result some resource that everybody cares about (grass,
system-wide I/O) gets trashed to everyone's detriment. Purely selfish
behavior can only be justified here if we assume that the selfish
actor intends to participate in the system only once: I'm going to run
one big reporting query which must run as fast as possible, and then
I'm getting on a space ship to Mars. So if my refusal to do any
pruning during that reporting query causes lots of extra I/O on the
system ten minutes from now, I don't care, because I'll have left the
playing field forever at that point.

As Heikki points out, any HOT pruning operation generates WAL and has
a CPU cost. However, pruning a page that is currently dirty
*decreases* the total volume of writes to the data files, whereas
pruning a page that is currently clean *increases* the total volume of
writes to the data files. In the first case, if we prune the page
right now while it's still dirty, we can't possibly cause any
additional data-file writes, and we may save one, because it's
possible that someone else would later have pruned it when it was
clean and there was no other reason to dirty it. In the second case,
if we prune the page that is currently clean, it will become dirty.
That will cost us no additional I/O if the page is again modified
before it's written out, but otherwise it costs an additional data
file write. I think there's a big difference between those two cases.
Sure, from the narrow point of view of how much work it takes this
scan to process this page, it's always better not to prune. But if
you make the more realistic assumption that you will keep on issuing
queries on the system, then what you're doing to the overall system
I/O load is pretty important.

By the way, was anything ever done about this:

/messages/by-id/20140929091343.GA4716@alap3.anarazel.de

That's just a workload that is 5/6th pgbench -S and 1/6th pgbench,
which is in no way an unrealistic workload, and showed a significant
regression with an earlier version of the patch. You seem very eager
to commit this patch after four months of inactivity, but I think this
is a pretty massive behavior change that deserves careful scrutiny
before it goes in. If we push something that changes longstanding
behavior and can't even be turned off, and it regresses behavior for a
use case that common, our users are going to come after us with
pitchforks. That's not to say some people won't be happy, but in my
experience it takes a lot of happy users to make up for getting
stabbed with even one pitchfork.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#75)
Re: Turning off HOT/Cleanup sometimes

On 15 April 2015 at 12:39, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 15, 2015 at 8:42 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I won't take responsibility for paying my neighbor's tax bill, but I
might take responsibility for picking up his mail while he's on
holiday.

That makes it sound like this is an occasional, non-annoying thing.

It's more like, whoever fetches the mail needs to fetch it for
everybody. So we are slowing down one person disproportionately, while
others fly through without penalty. There is no argument that one
workload necessarily needs to perform that on behalf of the other
workload.

Sure there is. It's called a tragedy of the commons - everybody acts
in their own selfish interest (it's not *my* responsibility to limit
grazing on public land, or prune this page that I'm not modifying) and
as a result some resource that everybody cares about (grass,
system-wide I/O) gets trashed to everyone's detriment. Purely selfish
behavior can only be justified here if we assume that the selfish
actor intends to participate in the system only once: I'm going to run
one big reporting query which must run as fast as possible, and then
I'm getting on a space ship to Mars. So if my refusal to do any
pruning during that reporting query causes lots of extra I/O on the
system ten minutes from now, I don't care, because I'll have left the
playing field forever at that point.

It all depends upon who is being selfish. Why is a user "selfish" for
not wanting to clean every single block they scan, when the people
that made the mess do nothing and go faster 10 minutes from now?
Randomly and massively penalising large SELECTs makes no sense. Some
cleanup is OK, with reasonable limits, which is why that is proposed.

On 04/15/2015 05:44 PM, Alvaro Herrera wrote:

Robert's proposal is "when reading a page, if dirty HOT-clean it; if not
dirty, also HOT-clean it but only 5 times in each scan". This runs
HOT-cleanup some number of times (as many as there are dirty), and
causes at most 5 pages to become dirty.

My understanding of Robert's proposal was "when reading a page,
HOT-clean it, but only do this up to 5 times on clean pages, but
continue to do this indefinitely when the page is already dirty.".
Andres said that was the only way and I have agreed to it.

Are you now saying not to commit your proposal at all?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#76)
Re: Turning off HOT/Cleanup sometimes

Simon Riggs wrote:

On 15 April 2015 at 12:39, Robert Haas <robertmhaas@gmail.com> wrote:

On 04/15/2015 05:44 PM, Alvaro Herrera wrote:

Robert's proposal is "when reading a page, if dirty HOT-clean it; if not
dirty, also HOT-clean it but only 5 times in each scan". This runs
HOT-cleanup some number of times (as many as there are dirty), and
causes at most 5 pages to become dirty.

My understanding of Robert's proposal was "when reading a page,
HOT-clean it, but only do this up to 5 times on clean pages, but
continue to do this indefinitely when the page is already dirty.".

To me, both statements look identical.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78Simon Riggs
simon@2ndQuadrant.com
In reply to: Alvaro Herrera (#77)
Re: Turning off HOT/Cleanup sometimes

On 15 April 2015 at 16:01, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Simon Riggs wrote:

On 15 April 2015 at 12:39, Robert Haas <robertmhaas@gmail.com> wrote:

On 04/15/2015 05:44 PM, Alvaro Herrera wrote:

Robert's proposal is "when reading a page, if dirty HOT-clean it; if not
dirty, also HOT-clean it but only 5 times in each scan". This runs
HOT-cleanup some number of times (as many as there are dirty), and
causes at most 5 pages to become dirty.

My understanding of Robert's proposal was "when reading a page,
HOT-clean it, but only do this up to 5 times on clean pages, but
continue to do this indefinitely when the page is already dirty.".

To me, both statements look identical.

I didn't read it that way, apologies for any confusion. But that is good news.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79Peter Geoghegan
pg@heroku.com
In reply to: Andres Freund (#69)
Re: Turning off HOT/Cleanup sometimes

On Wed, Apr 15, 2015 at 6:10 AM, Andres Freund <andres@anarazel.de> wrote:

Of
course, that applies in all cases, but when the page is already dirty,
the cost of pruning it is probably quite small - we're going to have
to write the page anyway, and pruning it before it gets evicted
(perhaps even by our scan) will be cheaper than writing it now and
writing it again after it's pruned. When the page is clean, the cost
of pruning is significantly higher.

"We" aren't going to have to write the page, but someone will.

If it's already dirty that doesn't change at all. *Not* pruning in that
moment actually will often *increase* the total amount of writes to the
OS. Because now the pruning will happen on the next write access or
vacuum - when the page already might have been undirtied.

I don't really see the downside to this suggestion.

+1. I think, in general, the opportunity cost of not pruning when a
page is already dirty is likely to be rather high. In general, it's
likely to be worth it.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80Greg Stark
stark@mit.edu
In reply to: Simon Riggs (#76)
Re: Turning off HOT/Cleanup sometimes

On 15 Apr 2015 15:43, "Simon Riggs" <simon@2ndquadrant.com> wrote:

It all depends upon who is being selfish. Why is a user "selfish" for
not wanting to clean every single block they scan, when the people
that made the mess do nothing and go faster 10 minutes from now?
Randomly and massively penalising large SELECTs makes no sense. Some
cleanup is OK, with reasonable limits, which is why that is proposed.

I don't think it's productive to think of a query as a different actor with
only an interest in its own performance and no interest in overall system
performance.

From a holistic point of view the question is how many times is a given hit
chain going to need to be followed before it's pruned. Or to put it another
way, how expensive is creating a hot chain. Does it cause a single prune? a
fixed number of chain readers followed by a prune? Does the amount of work
depend on the workload or is it consistent?

My intuition is that a fixed cutoff like "five pages" is dangerous because
if you update many pages there's no limit to the number of times they'll be
read before they're all pruned. The steady state could easily be that every
query is having to read hot chains forever.

My intuition, again, is that what we need is a percentage such as "do 10
prunes then ignore the next 1000 clean pages with hot chains. That
guarantees that after 100 selects the hot chains will all be pruned but
each select will only prune 1% of the clean pages it sees.

#81Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Greg Stark (#80)
Re: Turning off HOT/Cleanup sometimes

On Thu, Apr 16, 2015 at 2:47 PM, Greg Stark <stark@mit.edu> wrote:

On 15 Apr 2015 15:43, "Simon Riggs" <simon@2ndquadrant.com> wrote:

It all depends upon who is being selfish. Why is a user "selfish" for
not wanting to clean every single block they scan, when the people
that made the mess do nothing and go faster 10 minutes from now?
Randomly and massively penalising large SELECTs makes no sense. Some
cleanup is OK, with reasonable limits, which is why that is proposed.

I don't think it's productive to think of a query as a different actor
with only an interest in its own performance and no interest in overall
system performance.

From a holistic point of view the question is how many times is a given
hit chain going to need to be followed before it's pruned. Or to put it
another way, how expensive is creating a hot chain. Does it cause a single
prune? a fixed number of chain readers followed by a prune? Does the amount
of work depend on the workload or is it consistent?

IMO the size or traversal of the HOT chain is not that expensive compared
to the cost of either pruning too frequently, which generates WAL as well
as makes buffers dirty. OTOH cost of less frequent pruning could also be
very high. It can cause severe table bloat which may just stay for a very
long time. Even if dead space is recovered within a page, truncating a
bloated heap is not always possible. In such cases, even SELECTs would be
slowed down just because they need to read/scan far more pages than they
otherwise would have. IOW its probably wrong to assume that not-pruning
quickly enough will have impact only on the non-SELECT queries.

I also concur with arguments upthread that this change needs to be
carefully calibrated because it can lead to significant degradation for
certain workloads.

My intuition, again, is that what we need is a percentage such as "do 10
prunes then ignore the next 1000 clean pages with hot chains. That
guarantees that after 100 selects the hot chains will all be pruned but
each select will only prune 1% of the clean pages it sees.

I think some such proposal was made in the last. There could be knob to
control how much a read-only query (or may be a read-only transaction)
should do HOT cleanup, say as a percentage of pages it looks at. The
default can be left at 100% in the first release so that the current
behaviour is not suddenly disrupted. But it will allow others to play with
the percentages and then based on field reports, we can change defaults in
the next releases.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#82Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Pavan Deolasee (#81)
Re: Turning off HOT/Cleanup sometimes

Pavan Deolasee wrote:

On Thu, Apr 16, 2015 at 2:47 PM, Greg Stark <stark@mit.edu> wrote:

From a holistic point of view the question is how many times is a given
hit chain going to need to be followed before it's pruned. Or to put it
another way, how expensive is creating a hot chain. Does it cause a single
prune? a fixed number of chain readers followed by a prune? Does the amount
of work depend on the workload or is it consistent?

IMO the size or traversal of the HOT chain is not that expensive compared
to the cost of either pruning too frequently, which generates WAL as well
as makes buffers dirty. OTOH cost of less frequent pruning could also be
very high. It can cause severe table bloat which may just stay for a very
long time. Even if dead space is recovered within a page, truncating a
bloated heap is not always possible.

I think you're failing to consider that in the patch there is a
distinction between read-only page accesses and page updates. During a
page update, HOT cleanup is always done even with the patch, so there
won't be any additional bloat that would not be there without the patch.
It's only the read-only accesses to the patch that skip the HOT pruning.

Of course, as Greg says there will be some additional scans of the HOT
chain by read-only processes.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Alvaro Herrera (#82)
Re: Turning off HOT/Cleanup sometimes

On Thu, Apr 16, 2015 at 6:50 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Pavan Deolasee wrote:

On Thu, Apr 16, 2015 at 2:47 PM, Greg Stark <stark@mit.edu> wrote:

From a holistic point of view the question is how many times is a given
hit chain going to need to be followed before it's pruned. Or to put it
another way, how expensive is creating a hot chain. Does it cause a

single

prune? a fixed number of chain readers followed by a prune? Does the

amount

of work depend on the workload or is it consistent?

IMO the size or traversal of the HOT chain is not that expensive compared
to the cost of either pruning too frequently, which generates WAL as well
as makes buffers dirty. OTOH cost of less frequent pruning could also be
very high. It can cause severe table bloat which may just stay for a very
long time. Even if dead space is recovered within a page, truncating a
bloated heap is not always possible.

I think you're failing to consider that in the patch there is a
distinction between read-only page accesses and page updates. During a
page update, HOT cleanup is always done even with the patch, so there
won't be any additional bloat that would not be there without the patch.
It's only the read-only accesses to the patch that skip the HOT pruning.

Ah, Ok. I'd not read the patch. But now that I do, I feel much more
comfortable with the change. In fact, I wonder if its just enough to either
do full HOT prune for target relations and not at all for all other
relations involved in the query. My apologies if this is done based on
discussions upthread. I haven't read the entire thread yet.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

#84Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#82)
Re: Turning off HOT/Cleanup sometimes

On 2015-04-16 10:20:20 -0300, Alvaro Herrera wrote:

I think you're failing to consider that in the patch there is a
distinction between read-only page accesses and page updates. During a
page update, HOT cleanup is always done even with the patch, so there
won't be any additional bloat that would not be there without the
patch.

That's not really true (and my benchmark upthread proves it). The fact
that hot pruning only happens when we can get a cleanup lock means that
we can end up with more pages that are full, if we prune on select less
often. Especially if SELECTs are more frequent than write accesses -
pretty darn common - the likelihood of SELECTs getting the lock is
correspondingly higher.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Andres Freund (#84)
Re: Turning off HOT/Cleanup sometimes

On 16 April 2015 at 15:21, Andres Freund <andres@anarazel.de> wrote:

On 2015-04-16 10:20:20 -0300, Alvaro Herrera wrote:

I think you're failing to consider that in the patch there is a
distinction between read-only page accesses and page updates. During a
page update, HOT cleanup is always done even with the patch, so there
won't be any additional bloat that would not be there without the
patch.

That's not really true (and my benchmark upthread proves it). The fact
that hot pruning only happens when we can get a cleanup lock means that
we can end up with more pages that are full, if we prune on select less
often. Especially if SELECTs are more frequent than write accesses -
pretty darn common - the likelihood of SELECTs getting the lock is
correspondingly higher.

Your point that we *must* do *some* HOT cleanup on SELECTs is proven beyond
question. Alvaro has not disputed that, ISTM you misread that. Pavan has
questioned that point but the results upthread are there, he explains he
hasn't read that yet.

The only question is "how much cleanup on SELECT"? Having one SELECT hit
10,000 cleanups while another hits 0 creates an unfairness and
unpredictability in the way we work. Maybe some people running a backup
actually like the fact it cleans the database; others think that is a bad
thing. Few people issuing large queries think it is good behaviour. Anybody
running replication also knows that this causes a huge slam of WAL which
can increase replication delay, which is a concern for HA.

That is how we arrive at the idea of a cleanup limit, further enhanced by a
limit that applies only to dirtying clean blocks, which we have 4? recent
votes in favour of.

I would personally be in favour of a parameter to control the limit, since
whatever we chose is right/wrong depending upon circumstances. I am however
comfortable with not having a parameter if people think it is hard to tune
that, which I agree it would be, hence no parameter in the patch.

#86Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#84)
Re: Turning off HOT/Cleanup sometimes

Andres Freund wrote:

On 2015-04-16 10:20:20 -0300, Alvaro Herrera wrote:

I think you're failing to consider that in the patch there is a
distinction between read-only page accesses and page updates. During a
page update, HOT cleanup is always done even with the patch, so there
won't be any additional bloat that would not be there without the
patch.

That's not really true (and my benchmark upthread proves it). The fact
that hot pruning only happens when we can get a cleanup lock means that
we can end up with more pages that are full, if we prune on select less
often. Especially if SELECTs are more frequent than write accesses -
pretty darn common - the likelihood of SELECTs getting the lock is
correspondingly higher.

Interesting point. Of course, this code should count HOT cleanups
against the total limit when they are effectively carried out, and
ignore those that are skipped because of inability to acquire the
cleanup lock. Not sure whether the submitted code does that.

Can we keep stats on how many pages we don't clean in the updating
process due to failure to acquire cleanup lock? My intuition says that
it should be similar to the number of backends running concurrently,
but that might be wrong.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87Jeff Janes
jeff.janes@gmail.com
In reply to: Andres Freund (#48)
Re: Turning off HOT/Cleanup sometimes

On Mon, Sep 29, 2014 at 2:13 AM, Andres Freund <andres@anarazel.de> wrote:

On 2014-09-28 19:51:36 +0100, Simon Riggs wrote:

On 27 September 2014 09:29, Andres Freund <andres@anarazel.de> wrote:

On 2014-09-27 10:23:33 +0300, Heikki Linnakangas wrote:

This patch has gotten a fair amount of review, and has been

rewritten once

during the commitfest. I think it's pretty close to being

committable, the

only remaining question seems to be what to do with system catalogs.

I'm

marking this as "Returned with feedback", I take it that Simon can

proceed

from here, outside the commitfest.

FWIW, I don't think it is, even with that. As is it seems very likely
that it's going to regress a fair share of workloads. At the very

least

it needs a fair amount of benchmarking beforehand.

There is some doubt there. We've not seen a workload that does
actually exhibit a negative behaviour.

Neither is there much data about the magnitude of positive effect the
patch has...

I'm not saying one doesn't exist, but it does matter how common/likely
it is. If anyone can present a performance test case that demonstrates
a regression, I think it will make it easier to discuss how wide that
case is and what we should do about it. Discussing whether to do
various kinds of limited pruning are moot until that is clear.

I doubt it'll be hard to construct a case where it'll show. My first try
of using a pgbench scale 100, -M prepared, -cj8 with a custom file with
1 write and 5 read transaction yielded the following on my laptop:

Baseline:
relname | pgbench_tellers
pg_total_relation_size | 458752
relname | pgbench_accounts
pg_total_relation_size | 1590337536
relname | pgbench_branches
pg_total_relation_size | 286720
relname | pgbench_history
pg_total_relation_size | 49979392
Patched:
relname | pgbench_tellers
pg_total_relation_size | 516096
relname | pgbench_accounts
pg_total_relation_size | 1590337536
relname | pgbench_branches
pg_total_relation_size | 360448
relname | pgbench_history
pg_total_relation_size | 49528832

So, there's a noticeable increase in size. Mostly on the smaller tables,
so probably HOT cleanup was sometimes skipped during UPDATEs due to
locks.

Baseline was:
tps = 9655.486532 (excluding connections establishing)
Patched was:
tps = 9466.158701 (including connections establishing)

Was this reproducible? I've run your custom sql file with 4 clients (that
is how many CPUs I have) on a machine
with a BBU. I had wal_level = hot_standby, but the archive_command just
returned true without archiving anything. And using the latest patch.

The size of the pgbench_tellers and pgbench_branches relations were
surprisingly variable in both patched and unpatched, but there was no
reliable difference between them, just within them.

On the TPS front, there was a hint that patched one was slightly slower but
the within sample variation was also high, and the p-val for difference was
only 0.214 on n of 66.

test case attached.

That's not a unrealistic testcase.

I'm pretty sure this could be made quite a bit more pronounced by not
using a uniform distribution in the pgbench runs. And selecting a test
that's more vulnerable to the change (e.g. using a wider distribution
for the read only statements than the modifying ones) would make the the
CPU overhead of the additional heap_hot_search_buffer() overhead
heavier.

Sorry I don't understand this description. Why would queries selecting
data that is not changing have any extra overhead?

Is the idea that the hot part of the table for updates would move around
over time, but the hot part for selects would be even throughout? I'm not
sure how to put that to the test.

Cheers,

Jeff

#88Andres Freund
andres@anarazel.de
In reply to: Jeff Janes (#87)
Re: Turning off HOT/Cleanup sometimes

On 2015-04-20 01:04:18 -0700, Jeff Janes wrote:

Was this reproducible?

Yes, at least with an old version of the patch.

I don't think you could see a difference using exactly that with the
newer versions which have the 5 page limit. After all it'll pretty much
never reach it.

That's not a unrealistic testcase.

I'm pretty sure this could be made quite a bit more pronounced by not
using a uniform distribution in the pgbench runs. And selecting a test
that's more vulnerable to the change (e.g. using a wider distribution
for the read only statements than the modifying ones) would make the the
CPU overhead of the additional heap_hot_search_buffer() overhead
heavier.

Sorry I don't understand this description. Why would queries selecting
data that is not changing have any extra overhead?

The idea, I think, was that by having a uniform (or just wider)
distribution of the reads they'd be more likely to land on values that
have been updated at some point, but not been pruned since (because at
that point the patch IIRC didn't prune during reads at all). I.e. ones
wer

Is the idea that the hot part of the table for updates would move around
over time, but the hot part for selects would be even throughout?

Pretty much.

I'm not sure how to put that to the test.

That pretty much was what I'd tried to model, yea. I guess it'd be
possible to model this by inserting NOW()/updating values NOW() - 5 and
selecting values up to NOW() - 60. That'd roughly model some realistic
insert/update/select patterns I've seen.

To possibly see any difference with the new patch this would have to be
done in a way that regularly a couple of pages would be touched, with
not that many selected tuples on each.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#85)
Re: Turning off HOT/Cleanup sometimes

On Thu, Apr 16, 2015 at 03:41:54PM +0100, Simon Riggs wrote:

That is how we arrive at the idea of a cleanup limit, further enhanced by a
limit that applies only to dirtying clean blocks, which we have 4? recent votes
in favour of.

I would personally be in favour of a parameter to control the limit, since
whatever we chose is right/wrong depending upon circumstances. I am however
comfortable with not having a parameter if people think it is hard to tune
that, which I agree it would be, hence no parameter in the patch.

I think the limit has to be in terms of a percentage of the table size.
For example, if we do one SELECT on a table with all non-dirty pages, it
would be good to know that 5% of the pages were pruned --- that tells me
that another 19 SELECTs will totally prune the table, assuming no future
writes. If there are future writes, they would dirty the pages and
cause even more pruning, but the 5% gives me the maximum pruning number
of SELECTs. If there aren't another 19 SELECTs, do I care if the table
is pruned or not? Probably not. Measuring in page count doesn't do
that, and a large table could receive millions of selects before being
fully cleaned.

Also, I am also not sure we should be designing features at this stage
in our release process.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90Simon Riggs
simon.riggs@2ndquadrant.com
In reply to: Bruce Momjian (#89)
Re: Turning off HOT/Cleanup sometimes

On 20 April 2015 at 18:33, Bruce Momjian <bruce@momjian.us> wrote:

Also, I am also not sure we should be designing features at this stage
in our release process.

I see this more as a process of gaining approval. I don't think patches at
the back of the queue should get the "its too late treatment" just because
they are at the back of the queue. They are logically all at the same
stage. There should be a way to allow people that show patience and respect
for the process to get the same timeshare as those that push their patches
daily.

Anyway, in this case, the patch conflicts with other things going in now,
so changing things isn't really sensible for this release, in terms of my
time.

We clearly need to do a better job of piggybacking actions on a dirty
block. The discussion has been about whether to do early cleanup, but we
should also consider post-access cleanup if we set hint bits or dirtied the
block in other ways.

Since we have many votes in favour of change in this area I'll post a new
version and look for an early review/commit for next release.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#91Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#89)
Re: Turning off HOT/Cleanup sometimes

Bruce Momjian wrote:

I think the limit has to be in terms of a percentage of the table size.
For example, if we do one SELECT on a table with all non-dirty pages, it
would be good to know that 5% of the pages were pruned --- that tells me
that another 19 SELECTs will totally prune the table, assuming no future
writes.

This seems simple to implement: keep two counters, where the second one
is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE,
reset the first counter so that further 5 pages will get HOT pruned. 5%
seems a bit high though. (In Simon's design, SOME_MAX_VALUE is
essentially +infinity.)

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92Jeff Janes
jeff.janes@gmail.com
In reply to: Bruce Momjian (#89)
Re: Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 10:33 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Thu, Apr 16, 2015 at 03:41:54PM +0100, Simon Riggs wrote:

That is how we arrive at the idea of a cleanup limit, further enhanced

by a

limit that applies only to dirtying clean blocks, which we have 4?

recent votes

in favour of.

I would personally be in favour of a parameter to control the limit,

since

whatever we chose is right/wrong depending upon circumstances. I am

however

comfortable with not having a parameter if people think it is hard to

tune

that, which I agree it would be, hence no parameter in the patch.

I think the limit has to be in terms of a percentage of the table size.
For example, if we do one SELECT on a table with all non-dirty pages, it
would be good to know that 5% of the pages were pruned --- that tells me
that another 19 SELECTs will totally prune the table, assuming no future
writes.

But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job,
while the user waits, which is fundamentally VACUUM's duty to do in the
background? If there are a handful of very hot pages, then it makes sense
not to wait for vacuum to get to them. And that is what a block-count
limit does.

But if the entire table is very hot, I think that that is just another of
way of saying that autovacuum is horribly misconfigured. I think the
purpose of this patch is to fix something that can't be fixed through
configuration alone.

If there are future writes, they would dirty the pages and
cause even more pruning, but the 5% gives me the maximum pruning number
of SELECTs. If there aren't another 19 SELECTs, do I care if the table
is pruned or not?

The use case I see for this is when there is a mixed workload. There is
one select which reads the entire table, and hundreds of thousands of
selects/updates/insert that don't, and of course vacuum comes along every
now and then and does it thing. Why should the one massive SELECT have
horrible performance just because it was run right before autovacuum would
have kicked in instead of right after if finished?

Cheers,

Jeff

#93Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#91)
Re: Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:

Bruce Momjian wrote:

I think the limit has to be in terms of a percentage of the table size.
For example, if we do one SELECT on a table with all non-dirty pages, it
would be good to know that 5% of the pages were pruned --- that tells me
that another 19 SELECTs will totally prune the table, assuming no future
writes.

This seems simple to implement: keep two counters, where the second one
is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE,
reset the first counter so that further 5 pages will get HOT pruned. 5%
seems a bit high though. (In Simon's design, SOME_MAX_VALUE is
essentially +infinity.)

Oh, I pulled 5% out of the air. Thinking of a SELECT-only workload,
which would be our worse case, I was thinking how many SELECTS running
through HOT update chains would it take to be slower than generating the
WAL to prune the page. I see the percentage as something that we could
reasonably balance, while a fixed page count couldn't be analyzed in
that way.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94Bruce Momjian
bruce@momjian.us
In reply to: Jeff Janes (#92)
Re: Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 12:28:11PM -0700, Jeff Janes wrote:

But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while
the user waits, which is fundamentally VACUUM's duty to do in the background?�
If there are a handful of very hot pages, then it makes sense not to wait for
vacuum to get to them.� And that is what a block-count limit does. �

But if the entire table is very hot, I think that that is just another of way
of saying that autovacuum is horribly misconfigured.� I think the purpose of

Well, we have to assume there are many misconfigured configurations ---
autovacuum isn't super-easy to configure, so we can't just blame the
user if this makes things worse. In fact, page pruning was designed
spefically for cases where autovacuum wasn't running our couldn't keep
up.

this patch is to fix something that can't be fixed through configuration alone.
�

If there are future writes, they would dirty the pages and
cause even more pruning, but the 5% gives me the maximum pruning number
of SELECTs.� If there aren't another 19 SELECTs, do I care if the table
is pruned or not?�

The use case I see for this is when there is a mixed workload.� There is one
select which reads the entire table, and hundreds of thousands of selects/
updates/insert that don't, and of course vacuum comes along every now and then
and does it thing.� Why should the one massive SELECT have horrible performance
just because it was run right before autovacuum would have kicked in instead of
right after if finished?

I see your point, but what about the read-only workload after a big
update? Do we leave large tables to be non-pruned for a long time?
Also, consider cases where you did a big update, the autovacuum
thresh-hold was not met, so autovacuum doesn't run on that table ---
again, do we keep those non-pruned rows around for millions of scans?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#92)
Re: Turning off HOT/Cleanup sometimes

On 20 April 2015 at 20:28, Jeff Janes <jeff.janes@gmail.com> wrote:

But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job,
while the user waits, which is fundamentally VACUUM's duty to do in the
background?

Agreed. I don't see a % as giving us anything at all.

The idea is that we want to turn an O(N) problem for one query into an O(1)
task.

The use case I see for this is when there is a mixed workload. There is
one select which reads the entire table, and hundreds of thousands of
selects/updates/insert that don't, and of course vacuum comes along every
now and then and does it thing. Why should the one massive SELECT have
horrible performance just because it was run right before autovacuum would
have kicked in instead of right after if finished?

+1

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#96Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Janes (#92)
Re: Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 3:28 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while
the user waits, which is fundamentally VACUUM's duty to do in the
background? If there are a handful of very hot pages, then it makes sense
not to wait for vacuum to get to them. And that is what a block-count limit
does.

I think that's a fundamental mischaracterization of the problem. As
soon as you define this as "vacuum's problem", then of course it makes
no sense to prune in the foreground, ever. But if you define the
problem as "get the best overall system performance", then it clearly
DOES sometimes make sense to prune in the foreground, as benchmark
results upthread demonstrate.

The fact is that on a workload like pgbench - and it doesn't have to
be exactly pgbench, just any kind of workload where there are lots of
changes to the table - vacuum can at any given time be pruning at most
one page of the table. That is because only one vacuum process can be
running in a given table at one time, and it can't be doing two things
at once. But there can be many processes doing inserts, updates, or
deletes on that table, as many as whatever you have max_connections
set to. There can easily be dozens even on a well-configured system;
on a poorly configured system, there could be hundreds. It seems
obvious that if you can have dozens or hundreds of processes creating
garbage and at most one process cleaning it up, there will be cases
where you get further and further behind.

Now, it might well be that the right solution to that problem is to
allow multiple vacuum processes in the same database, or add
background workers to help with opportunistic HOT-pruning of pages so
it doesn't get done in the foreground. Fine. But as of today, on a
heavily-modified table, the ONLY way that we can possibly remove junk
from the table as fast as we're creating junk is if the backends
touching the table do some of the work. Now, Simon is making the
argument that it should be good enough to have people *modifying* the
table help with the cleanup rather than imposing that load on the
people who are only *reading* it, and that's not a dumb argument, but
there are still cases where that strategy loses - specifically, where
the table churn has stopped or paused, by autovacuum hasn't run yet.
If you're going to do 1 sequential scan of the table and then go home
for the day, HOT-pruning is dumb even in that case. If you're going
to do 1000 sequential scans of that table in a row, HOT-pruning may
very well be smart. There's no guarantee that the table has met the
autovacuum threshold, but HOT-pruning it could well be a win anyway.
Or it might be a loss. You can make any policy here look smart or
dumb by picking a particular workload, and you don't even have to
invent crazy things that will never happen in real life to do it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#97Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#91)
Re: Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:

Bruce Momjian wrote:

I think the limit has to be in terms of a percentage of the table size.
For example, if we do one SELECT on a table with all non-dirty pages, it
would be good to know that 5% of the pages were pruned --- that tells me
that another 19 SELECTs will totally prune the table, assuming no future
writes.

This seems simple to implement: keep two counters, where the second one
is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE,
reset the first counter so that further 5 pages will get HOT pruned. 5%
seems a bit high though. (In Simon's design, SOME_MAX_VALUE is
essentially +infinity.)

This would tend to dirty non-sequential heap pages --- it seems best to
just clean as many as we are supposed to, then skip the rest, so we can
write sequential dirty pages to storage.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#95)
Re: Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 09:56:20PM +0100, Simon Riggs wrote:

On 20 April 2015 at 20:28, Jeff Janes <jeff.janes@gmail.com> wrote:
�

But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job,
while the user waits, which is fundamentally VACUUM's duty to do in the
background?�

Agreed. I don't see a % as giving us anything at all.

The idea is that we want to turn an O(N) problem for one query into an O(1)
task.
�

The use case I see for this is when there is a mixed workload.� There is
one select which reads the entire table, and hundreds of thousands of
selects/updates/insert that don't, and of course vacuum comes along every
now and then and does it thing.� Why should the one massive SELECT have
horrible performance just because it was run right before autovacuum would
have kicked in instead of right after if finished?

+1

You can +1 all you want, but if you ignore the specific workloads I
mentioned, you are not going to get much traction.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#97)
Re: Turning off HOT/Cleanup sometimes

Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:

Bruce Momjian wrote:

This seems simple to implement: keep two counters, where the second one
is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE,
reset the first counter so that further 5 pages will get HOT pruned. 5%
seems a bit high though. (In Simon's design, SOME_MAX_VALUE is
essentially +infinity.)

This would tend to dirty non-sequential heap pages --- it seems best to
just clean as many as we are supposed to, then skip the rest, so we can
write sequential dirty pages to storage.

Keep in mind there's a disconnect between dirtying a page and writing it
to storage. A page could remain dirty for a long time in the buffer
cache. This writing of sequential pages would occur at checkpoint time
only, which seems the wrong thing to optimize. If some other process
needs to evict pages to make room to read some other page in, surely
it's going to try one page at a time, not write "many sequential dirty
pages."

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#100Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#99)
Re: Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 07:13:38PM -0300, Alvaro Herrera wrote:

Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:

Bruce Momjian wrote:

This seems simple to implement: keep two counters, where the second one
is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE,
reset the first counter so that further 5 pages will get HOT pruned. 5%
seems a bit high though. (In Simon's design, SOME_MAX_VALUE is
essentially +infinity.)

This would tend to dirty non-sequential heap pages --- it seems best to
just clean as many as we are supposed to, then skip the rest, so we can
write sequential dirty pages to storage.

Keep in mind there's a disconnect between dirtying a page and writing it
to storage. A page could remain dirty for a long time in the buffer
cache. This writing of sequential pages would occur at checkpoint time
only, which seems the wrong thing to optimize. If some other process
needs to evict pages to make room to read some other page in, surely
it's going to try one page at a time, not write "many sequential dirty
pages."

Yes, it might be too much optimization to try to get the checkpoint to
flush all those pages sequentially, but I was thinking of our current
behavior where, after an update of all rows, we effectively write out
the entire table because we have dirtied every page. I guess with later
prune-based writes, we aren't really writing all the pages as we have
the pattern where pages with prunable content is kind of random. I guess
I was just wondering what value there is to your write-then-skip idea,
vs just writing the first X% of pages we find? Your idea certainly
spreads out the pruning, and doesn't require knowing the size of the
table, though I though that information was easily determined.

One thing to consider is how we handle pruning of index scans that hit
multiple heap pages. Do we still write X% of the pages in the table, or
%X of the heap pages we actually access via SELECT? With the
write-then-skip approach, we would do X% of the pages we access, while
with the first-X% approach, we would probably prune all of them as we
would not be accessing most of the table. I don't think we can do the
first first-X% of pages and have the percentage based on the number of
pages accessed as we have no way to know how many heap pages we will
access from the index. (We would know for bitmap scans, but that
complexity doesn't seem worth it.) That would argue, for consistency
with sequential and index-based heap access, that your approach is best.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#99)
Re: Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 6:13 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Bruce Momjian wrote:

On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote:

Bruce Momjian wrote:
This seems simple to implement: keep two counters, where the second one
is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE,
reset the first counter so that further 5 pages will get HOT pruned. 5%
seems a bit high though. (In Simon's design, SOME_MAX_VALUE is
essentially +infinity.)

This would tend to dirty non-sequential heap pages --- it seems best to
just clean as many as we are supposed to, then skip the rest, so we can
write sequential dirty pages to storage.

Keep in mind there's a disconnect between dirtying a page and writing it
to storage. A page could remain dirty for a long time in the buffer
cache. This writing of sequential pages would occur at checkpoint time
only, which seems the wrong thing to optimize. If some other process
needs to evict pages to make room to read some other page in, surely
it's going to try one page at a time, not write "many sequential dirty
pages."

Well, for a big sequential scan, we use a ring buffer, so we will
typically be evicting the pages that we ourselves read in moments
before. So in this case we would do a lot of sequential writes of
dirty pages.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#102Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Bruce Momjian (#100)
Re: Turning off HOT/Cleanup sometimes

On 4/21/15 10:04 AM, Bruce Momjian wrote:

One thing to consider is how we handle pruning of index scans that hit
multiple heap pages. Do we still write X% of the pages in the table, or
%X of the heap pages we actually access via SELECT? With the
write-then-skip approach, we would do X% of the pages we access, while
with the first-X% approach, we would probably prune all of them as we
would not be accessing most of the table. I don't think we can do the
first first-X% of pages and have the percentage based on the number of
pages accessed as we have no way to know how many heap pages we will
access from the index.

This comment made me wonder... has anyone considered handing the pruning
work off to a bgworker, at least for SELECTs? That means the selects
themselves wouldn't be burdened by the actual prune work, only in
notifying the bgworker. While that's not going to be free, presumably
it's a lot cheaper...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#100)
Re: Turning off HOT/Cleanup sometimes

On Tue, Apr 21, 2015 at 11:04 AM, Bruce Momjian <bruce@momjian.us> wrote:

Yes, it might be too much optimization to try to get the checkpoint to
flush all those pages sequentially, but I was thinking of our current
behavior where, after an update of all rows, we effectively write out
the entire table because we have dirtied every page. I guess with later
prune-based writes, we aren't really writing all the pages as we have
the pattern where pages with prunable content is kind of random. I guess
I was just wondering what value there is to your write-then-skip idea,
vs just writing the first X% of pages we find? Your idea certainly
spreads out the pruning, and doesn't require knowing the size of the
table, though I though that information was easily determined.

One thing to consider is how we handle pruning of index scans that hit
multiple heap pages. Do we still write X% of the pages in the table, or
%X of the heap pages we actually access via SELECT? With the
write-then-skip approach, we would do X% of the pages we access, while
with the first-X% approach, we would probably prune all of them as we
would not be accessing most of the table. I don't think we can do the
first first-X% of pages and have the percentage based on the number of
pages accessed as we have no way to know how many heap pages we will
access from the index. (We would know for bitmap scans, but that
complexity doesn't seem worth it.) That would argue, for consistency
with sequential and index-based heap access, that your approach is best.

I actually implemented something like this for setting hint bits a few
years ago:

/messages/by-id/AANLkTik5QzR8wTs0MqCWwmNp-qHGrdKY5Av5aOB7W4Dp@mail.gmail.com
/messages/by-id/AANLkTimGKaG7wdu-x77GNV2Gh6_Qo5Ss1u5b6Q1MsPUy@mail.gmail.com

At least in later versions, the patch writes a certain number of
hinted pages, then skips writing a run of pages, then writes another
run of hinted pages. The basic problem here is that, after the fsync
queue compaction patch went in, the benefits on my tests were pretty
modest. Yeah, it costs something to write out lots of dirty pages,
but before the fsync queue compaction stuff, the initial scan of an
unhinted table took like 6x the time on the machine I tested on, but
after that, it was like 1.5x the time. Blunting that spike just
wasn't exciting enough.

It strikes me that it would be better to have an integrated strategy
for this problem. It doesn't make sense to have one strategy for
deciding whether to set hint bits and a separate strategy for deciding
whether to HOT-prune. And if we decide to set hint bits and
HOT-prune, it might be smart to try to mark the page all-visible, too,
if it is and we're not about to update it. I believe we're losing a
lot of performance on OLTP workloads by re-dirtying the same pages
over and over again. We've probably all hit cases where there is an
obvious loss of performance because of this sort of thing, but I'm
starting to think it's hurting us in a lot of less-obvious ways.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104Peter Eisentraut
peter_e@gmx.net
In reply to: Jim Nasby (#102)
Re: Turning off HOT/Cleanup sometimes

On 4/21/15 4:45 PM, Jim Nasby wrote:

This comment made me wonder... has anyone considered handing the pruning
work off to a bgworker, at least for SELECTs? That means the selects
themselves wouldn't be burdened by the actual prune work, only in
notifying the bgworker. While that's not going to be free, presumably
it's a lot cheaper...

The nice thing about having foreground queries to the light cleanup is
that they can work in parallel and naturally hit the interesting parts
of the table first.

In order for a background worker to keep up with some of the workloads
that have been presented as counterexamples, you'd need multiple
background workers operating in parallel and preferring to work on
certain parts of a table. That would require a lot more sophisticated
job management than we currently have for, say, autovacuum.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#105Greg Stark
stark@mit.edu
In reply to: Bruce Momjian (#94)
Re: Turning off HOT/Cleanup sometimes

On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian <bruce@momjian.us> wrote:

But if the entire table is very hot, I think that that is just another of way
of saying that autovacuum is horribly misconfigured. I think the purpose of

Well, we have to assume there are many misconfigured configurations ---
autovacuum isn't super-easy to configure, so we can't just blame the
user if this makes things worse. In fact, page pruning was designed
spefically for cases where autovacuum wasn't running our couldn't keep
up.

Well autovacuum isn't currently considering HOT pruning part of its
job at all. It's hard to call it "misconfigured" when there's
literally *no* way to configure it "correctly".

If you update less than autovacuum_vacuum_scale_factor fraction of the
table and then never update another row autovacuum will never run.
Ever. Every select will forevermore need to follow hot chains on that
table. Until eventually transaction wraparound forces a vacuum on that
table if that ever happens.

Possibly autovacuum could be adjusted to count how many selects are
happening on the table and decide to vacuum it when the cost of the
selects following the dead tuples is balanced by the cost of doing a
vacuum. But that's not something included in the design of autovacuum
today.

The original design of tuple storage was aimed at optimizing the
steady state where most tuples were not recently updated. It
guaranteed that except for tuples that were in the process of being
updated or were recently updated a tuple read didn't have to read the
CLOG, didn't have to follow any chains, didn't have to do any I/O or
other work other than to read the bits on the tuple itself. When a
tuple is updated it's put into a state where everyone who comes along
has to do extra work but as soon as practical the hint bits get set
and that extra work stops.

We had similar discussions about setting hint bits in the past. I'm
not sure why HOT pruning is the focus now because I actually think
hint bit setting is a larger source of I/O in innocent looking selects
even today. And it's a major headache, people are always being
surprised that their selects cause lots of I/O and slow down
dramatically after a big update or data load has finished. It's
characterized as "why is the database writing everything twice" (and
saying it's actually writing everything three times doesn't make
people feel better). In the new age of checksums with hint bit logging
I wonder if it's even a bigger issue.

It occurs to me that generating these dirty pages isn't really that
expensive individually. It's only that there's a sudden influx of a
large number of dirty pages that causes them to get translated
immediately into filesystem I/O. Perhaps we should dirty pages on hint
bit updates and do HOT pruning only to the extent it can be done
without causing I/O. Of course it's hard to tell that in advance but
maybe something like "if the current buffer had to be fetched and
caused a dirty buffer to be evicted then skip hot pruning and don't
dirty it for any hint bit updates" would at least mean that once the
select fills up its share of buffers with dirty buffers it stops
dirtying more. It would dirty pages only as fast as bgwriter or
checkpoints manage to write them out.

That sounds a bit weird but I think the right solution should have
that combination of properties. It should guarantee that hint bits get
set and hot chains pruned within some length of time but that no one
select causes a storm of dirty buffers that then need to be flushed to
disk.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Peter Eisentraut (#104)
Re: Turning off HOT/Cleanup sometimes

On 4/21/15 4:07 PM, Peter Eisentraut wrote:

On 4/21/15 4:45 PM, Jim Nasby wrote:
In order for a background worker to keep up with some of the workloads
that have been presented as counterexamples, you'd need multiple
background workers operating in parallel and preferring to work on
certain parts of a table. That would require a lot more sophisticated
job management than we currently have for, say, autovacuum.

My thought was that the foreground queries would send page IDs to the
bgworker via a shmq. If the queries have to do much waiting at all on IO
then I'd expect the bgworker to be able to keep pace with a bunch of
them since it's just grabbing buffers that are already in the pool (and
only those in the pool; it wouldn't make sense for it to pull it back
from the kernel, let alone disk).

We'd need to code this so that if a queue fills up the query doesn't
block; we just skip that opportunity to prune. I think that'd be fine.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#101)
Re: Turning off HOT/Cleanup sometimes

On Tue, Apr 21, 2015 at 04:36:53PM -0400, Robert Haas wrote:

Keep in mind there's a disconnect between dirtying a page and writing it
to storage. A page could remain dirty for a long time in the buffer
cache. This writing of sequential pages would occur at checkpoint time
only, which seems the wrong thing to optimize. If some other process
needs to evict pages to make room to read some other page in, surely
it's going to try one page at a time, not write "many sequential dirty
pages."

Well, for a big sequential scan, we use a ring buffer, so we will
typically be evicting the pages that we ourselves read in moments
before. So in this case we would do a lot of sequential writes of
dirty pages.

Ah, yes, this again supports the prune-then-skip approach, rather than
doing the first X% pruneable pages seen.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108Peter Eisentraut
peter_e@gmx.net
In reply to: Jim Nasby (#106)
Re: Turning off HOT/Cleanup sometimes

On 4/22/15 11:37 AM, Jim Nasby wrote:

On 4/21/15 4:07 PM, Peter Eisentraut wrote:

On 4/21/15 4:45 PM, Jim Nasby wrote:
In order for a background worker to keep up with some of the workloads
that have been presented as counterexamples, you'd need multiple
background workers operating in parallel and preferring to work on
certain parts of a table. That would require a lot more sophisticated
job management than we currently have for, say, autovacuum.

My thought was that the foreground queries would send page IDs to the
bgworker via a shmq. If the queries have to do much waiting at all on IO
then I'd expect the bgworker to be able to keep pace with a bunch of
them since it's just grabbing buffers that are already in the pool (and
only those in the pool; it wouldn't make sense for it to pull it back
from the kernel, let alone disk).

We'd need to code this so that if a queue fills up the query doesn't
block; we just skip that opportunity to prune. I think that'd be fine.

I think a "to-clean-up map" would work better. But basically we need a
way to remember where to clean up later if we're not going to do it in
the foreground.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#109Bruce Momjian
bruce@momjian.us
In reply to: Peter Eisentraut (#104)
Re: Turning off HOT/Cleanup sometimes

On Tue, Apr 21, 2015 at 05:07:52PM -0400, Peter Eisentraut wrote:

On 4/21/15 4:45 PM, Jim Nasby wrote:

This comment made me wonder... has anyone considered handing the pruning
work off to a bgworker, at least for SELECTs? That means the selects
themselves wouldn't be burdened by the actual prune work, only in
notifying the bgworker. While that's not going to be free, presumably
it's a lot cheaper...

The nice thing about having foreground queries to the light cleanup is
that they can work in parallel and naturally hit the interesting parts
of the table first.

In order for a background worker to keep up with some of the workloads
that have been presented as counterexamples, you'd need multiple
background workers operating in parallel and preferring to work on
certain parts of a table. That would require a lot more sophisticated
job management than we currently have for, say, autovacuum.

Well, the visibility map tells us where _not_ to clean up, so using
another map to tell use _where_ to cleanup might make sense. However,
the density of the map might be low enough that a list makes more sense,
as you suggested.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110Kevin Grittner
kgrittn@ymail.com
In reply to: Greg Stark (#105)
Re: Turning off HOT/Cleanup sometimes

Greg Stark <stark@mit.edu> wrote:

And it's a major headache, people are always being surprised that
their selects cause lots of I/O and slow down dramatically after
a big update or data load has finished. It's characterized as
"why is the database writing everything twice" (and saying it's
actually writing everything three times doesn't make people feel
better).

When I looked at the life-cycle of a heap tuple in a database I was
using, I found that (ignoring related index access and ignoring
WAL-file copying, etc., for our backups), each tuple that existed
long enough to freeze and be eventually deleted caused a lot of
writes.

(1) WAL log the insert.
(2) Write the tuple.
(3) Hint and rewrite the tuple.
(4) WAL log the freeze of the tuple.
(5) Rewrite the frozen tuple.
(6) WAL-log the delete.
(7) Rewrite the deleted tuple.
(8) Prune and rewrite the page.
(9) Free line pointers and rewrite the page.

If I was lucky some of the writes could be combined in cache
because they happened close enough together. Also, one could hope
that not too much of the WAL-logging involved full page writes to
the WAL -- again, keeping steps close together in time helps with
that. If all of (1) through (5) are done in quick succession, you
save two physical writes of the heap page and save one full page

write to WAL. If steps (7) through (9) are done in quick
succession, you save two more physical writes to the heap. This is
part of what makes the aggressive incremental freezing being
discussed on a nearby thread appealing -- at least for some
workloads.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#111Bruce Momjian
bruce@momjian.us
In reply to: Greg Stark (#105)
Re: Turning off HOT/Cleanup sometimes

On Wed, Apr 22, 2015 at 04:36:17PM +0100, Greg Stark wrote:

On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian <bruce@momjian.us> wrote:

Well, we have to assume there are many misconfigured configurations ---
autovacuum isn't super-easy to configure, so we can't just blame the
user if this makes things worse. In fact, page pruning was designed
spefically for cases where autovacuum wasn't running our couldn't keep
up.

Well autovacuum isn't currently considering HOT pruning part of its
job at all. It's hard to call it "misconfigured" when there's
literally *no* way to configure it "correctly".

Good point, but doesn't vacuum remove the need for pruning as it removes
all the old rows?

If you update less than autovacuum_vacuum_scale_factor fraction of the
table and then never update another row autovacuum will never run.
Ever. Every select will forevermore need to follow hot chains on that
table. Until eventually transaction wraparound forces a vacuum on that
table if that ever happens.

Yes, that is a very good point, and it matches my concerns. Of course,
Simon's concern is to avoid overly-aggressive pruning where the row is
being pruned but will soon be modified, making the prune, and its WAL
volume, undesirable. We have to consider both cases in any final
solution.

Possibly autovacuum could be adjusted to count how many selects are
happening on the table and decide to vacuum it when the cost of the
selects following the dead tuples is balanced by the cost of doing a
vacuum. But that's not something included in the design of autovacuum
today.

Well, autovacuum is also going to clean indexes, which seem like
overkill for pruning HOT updates.

The original design of tuple storage was aimed at optimizing the
steady state where most tuples were not recently updated. It
guaranteed that except for tuples that were in the process of being
updated or were recently updated a tuple read didn't have to read the
CLOG, didn't have to follow any chains, didn't have to do any I/O or
other work other than to read the bits on the tuple itself. When a
tuple is updated it's put into a state where everyone who comes along
has to do extra work but as soon as practical the hint bits get set
and that extra work stops.

Yes, Simon is right that doing everything as-soon-as-possible is not
optimal. I think the trick is knowing when we should give up waiting
for something else to dirty the page and prune it.

We had similar discussions about setting hint bits in the past. I'm
not sure why HOT pruning is the focus now because I actually think
hint bit setting is a larger source of I/O in innocent looking selects
even today. And it's a major headache, people are always being
surprised that their selects cause lots of I/O and slow down
dramatically after a big update or data load has finished. It's
characterized as "why is the database writing everything twice" (and
saying it's actually writing everything three times doesn't make
people feel better). In the new age of checksums with hint bit logging
I wonder if it's even a bigger issue.

What would be the downside of only doing pruning during SELECT hint bit
setting? Hinting is delayed by long-running transactions, but so is
pruning. I assume you can do more pruning than setting all_visible
hints because the old prunable rows are older by definition, but I am
unclear how much older they are.

FYI, while hint bit setting causes page writes, it does not cause WAL
writes unless you have wal_log_hints set or page-level checksums are
enabled. By doing pruning at the same time as hint bit setting, you are
sharing the same page write, but are generating more WAL. Of course, if
you are setting all-visible, then you are by definition waiting longer
to prune than before, and this might be enough to make it a win for all
use cases. You wouldn't never-prune in a read-only workload because
your hint bits would eventually cause the pruning.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#112Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#111)
Re: Turning off HOT/Cleanup sometimes

Bruce Momjian wrote:

On Wed, Apr 22, 2015 at 04:36:17PM +0100, Greg Stark wrote:

On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian <bruce@momjian.us> wrote:

Well, we have to assume there are many misconfigured configurations ---
autovacuum isn't super-easy to configure, so we can't just blame the
user if this makes things worse. In fact, page pruning was designed
spefically for cases where autovacuum wasn't running our couldn't keep
up.

Well autovacuum isn't currently considering HOT pruning part of its
job at all. It's hard to call it "misconfigured" when there's
literally *no* way to configure it "correctly".

Good point, but doesn't vacuum remove the need for pruning as it removes
all the old rows?

Sure. The point, I think, is to make autovacuum runs of some sort that
don't actually vacuum but only do HOT-pruning. Maybe this is a
reasonable solution to the problem that queries don't prune anymore
after Simon's patch. If we made autovac HOT-prune periodically, we
could have read-only queries prune only already-dirty pages. Of course,
that would need further adjustments to default number of autovac
workers, I/O allocation, etc.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#113Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#112)
Re: Turning off HOT/Cleanup sometimes

On Wed, Apr 22, 2015 at 06:07:00PM -0300, Alvaro Herrera wrote:

Good point, but doesn't vacuum remove the need for pruning as it removes
all the old rows?

Sure. The point, I think, is to make autovacuum runs of some sort that
don't actually vacuum but only do HOT-pruning. Maybe this is a
reasonable solution to the problem that queries don't prune anymore
after Simon's patch. If we made autovac HOT-prune periodically, we
could have read-only queries prune only already-dirty pages. Of course,
that would need further adjustments to default number of autovac
workers, I/O allocation, etc.

Do we really want to make vacuum more complex for this? vacuum does
have the delay settings we would need though.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#114Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Kevin Grittner (#110)
Re: Turning off HOT/Cleanup sometimes

On 4/22/15 1:51 PM, Kevin Grittner wrote:

(1) WAL log the insert.
(2) Write the tuple.
(3) Hint and rewrite the tuple.
(4) WAL log the freeze of the tuple.
(5) Rewrite the frozen tuple.
(6) WAL-log the delete.
(7) Rewrite the deleted tuple.
(8) Prune and rewrite the page.
(9) Free line pointers and rewrite the page.

If I was lucky some of the writes could be combined in cache
because they happened close enough together. Also, one could hope
that not too much of the WAL-logging involved full page writes to
the WAL -- again, keeping steps close together in time helps with
that.

This is why I like the idea of methods that tell us where we need to do
cleanup... they provide us with a rough ability to track what tuples are
in what part of their lifecycle. The VM helps with this a small amount,
but really it only applies after 1 and 6; it doesn't help us with any
other portions.

Having a way to track recently created tuples would allow us to be much
more efficient with 1-3, and with aggressive freezing, 1-5. A way to
track recently deleted tuples would help with 6-7, possibly 6-9 if no
indexes.

If we doubled the size of the VM, that would let us track 4 states for
each page:

- Page has newly inserted tuples
- Page has newly deleted tuples
- Page is all visible
- Page is frozen

though as discussed elsewhere, we could probably combine all visible and
frozen.

The win from doing this would be easily knowing what pages need hinting
(newly inserted) and pruning (newly deleted). Unfortunately we still
wouldn't know whether we could do real work without visiting the page
itself, but I suspect that for many workloads just having newly
inserted/deleted would be a serious win.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#113)
Re: Turning off HOT/Cleanup sometimes

On Wed, Apr 22, 2015 at 5:17 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Apr 22, 2015 at 06:07:00PM -0300, Alvaro Herrera wrote:

Good point, but doesn't vacuum remove the need for pruning as it removes
all the old rows?

Sure. The point, I think, is to make autovacuum runs of some sort that
don't actually vacuum but only do HOT-pruning. Maybe this is a
reasonable solution to the problem that queries don't prune anymore
after Simon's patch. If we made autovac HOT-prune periodically, we
could have read-only queries prune only already-dirty pages. Of course,
that would need further adjustments to default number of autovac
workers, I/O allocation, etc.

Do we really want to make vacuum more complex for this? vacuum does
have the delay settings we would need though.

I think it's abundantly clear that, as wonderful as autovacuum is
compared with what we had before autovacuum, it's not good enough.
This is one area where I think improvement is definitely needed, and
I've suggested it before. Discussion began here:

/messages/by-id/AANLkTimd3ieGCm9pXV39ci6-owy3rX0mzz_N1tL=0ZLm@mail.gmail.com

Some of the things I suggested then seem dumb in hindsight, but I
think the basic concept is still valid: if we scan the heap and find
only a few dead tuples, the expense of scanning all of the indexes may
not be justified. Also, the fact that a relation can currently only
be vacuumed by one process at a time is coming to seem like a major
limitation. Some users are partitioning tables just so that each
partition can be autovac'd separately. That really shouldn't be
required.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Robert Haas (#115)
Re: Turning off HOT/Cleanup sometimes

On 4/23/15 8:25 AM, Robert Haas wrote:

Some users are partitioning tables just so that each
partition can be autovac'd separately. That really shouldn't be
required.

Are they doing this for improved heap scan performance? Index scan
performance? If the table wasn't partitioned, would they need more than
one pass through the indexes due to exhausting maintenance_work_mem?

There's probably some fairly low-hanging fruit for parallelizing vacuum,
but it really depends on what problems people are running into.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#116)
Re: Turning off HOT/Cleanup sometimes

On Thu, Apr 23, 2015 at 10:44 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 4/23/15 8:25 AM, Robert Haas wrote:

Some users are partitioning tables just so that each
partition can be autovac'd separately. That really shouldn't be
required.

Are they doing this for improved heap scan performance? Index scan
performance? If the table wasn't partitioned, would they need more than one
pass through the indexes due to exhausting maintenance_work_mem?

I don't know of anyone with a properly-configured system who needs
more than one pass through the indexes due to exhausting
maintenance_work_mem. The issue is that you have to vacuum a table
frequently enough to avoid accumulating bloat. The frequency with
which you need to vacuum varies depending on the size of the table and
how frequently it's updated. However, a large, heavily-updated table
can take long enough to vacuum that, by the time you get done, it's
already overdue to be vacuumed again. That's a problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#117)
Re: Turning off HOT/Cleanup sometimes

On 2015-04-23 15:40:36 -0400, Robert Haas wrote:

The issue is that you have to vacuum a table frequently enough to
avoid accumulating bloat. The frequency with which you need to vacuum
varies depending on the size of the table and how frequently it's
updated. However, a large, heavily-updated table can take long enough
to vacuum that, by the time you get done, it's already overdue to be
vacuumed again. That's a problem.

Especially because the indexes are scanned fully. In many cases I've
observed the heap scans themselves being fast; but scanning hundreds
(yes) of gigabytes of indexes taking ages.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#90)
Re: Turning off HOT/Cleanup sometimes

Hi Simon,

On 2015-04-20 19:21:24 +0100, Simon Riggs wrote:

Since we have many votes in favour of change in this area I'll post a new
version and look for an early review/commit for next release.

If I see correctly there's been no new patch version since, right? The
patch is included in the current commitfest as needing review
nonetheless - that's possibly because it wasn't possible to "return with
feedback" for a while?

I marked it as RWF for now, if I missed something we can change it back.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers