CLOG contention

Started by Robert Haasabout 14 years ago51 messages
#1Robert Haas
robertmhaas@gmail.com

A few weeks ago I posted some performance results showing that
increasing NUM_CLOG_BUFFERS was improving pgbench performance.

http://archives.postgresql.org/pgsql-hackers/2011-12/msg00095.php

I spent some time today looking at this in a bit more detail.
Somewhat obviously in retrospect, it turns out that the problem
becomes more severe the longer you run the test. CLOG lookups are
induced when we go to update a row that we've previously updated.
When the test first starts, just after pgbench -i, all the rows are
hinted and, even if they weren't, they all have the same XID. So no
problem. But, as the fraction of rows that have been updated
increases, it becomes progressively more likely that the next update
will hit a row that's already been updated. Initially, that's OK,
because we can keep all the CLOG pages of interest in the 8 available
buffers. But eaten through enough XIDs - specifically, 8 buffers *
8192 bytes/buffer * 4 xids/byte = 256k - we can't keep all the
necessary pages in memory at the same time, and so we have to keep
replacing CLOG pages. This effect is not difficult to see even on my
2-core laptop, although I'm not sure whether it causes any material
performance degradation.

If you have enough concurrent tasks, a probably-more-serious form of
starvation can occur. As SlruSelectLRUPage notes:

/*
* We need to wait for I/O. Normal case is that it's
dirty and we
* must initiate a write, but it's possible that the
page is already
* write-busy, or in the worst case still read-busy.
In those cases
* we wait for the existing I/O to complete.
*/

On Nate Boley's 32-core box, after running pgbench for a few minutes,
that "in the worst case" scenario starts happening quite regularly,
apparently because the number of people who simultaneously wish to
read a different CLOG pages exceeds the number of available buffers
into which they can be read. The ninth and following backends to come
along have to wait until the least-recently-used page is no longer
read-busy before starting their reads.

So, what do we do about this? The obvious answer is "increase
NUM_CLOG_BUFFERS", and I'm not sure that's a bad idea. 64kB is a
pretty small cache on anything other than an embedded system, these
days. We could either increase the hard-coded value, or make it
configurable - but it would have to be PGC_POSTMASTER, since there's
no way to allocate more shared memory later on. The downsides of this
approach are:

1. If we make it configurable, nobody will have a clue what value to set.
2. If we just make it bigger, people laboring under the default 32MB
shared memory limit will conceivably suffer even more than they do now
if they just initdb and go.

A more radical approach would be to try to merge the buffer arenas for
the various SLRUs either with each other or with shared_buffers, which
would presumably allow a lot more flexibility to ratchet the number of
CLOG buffers up or down depending on overall memory pressure. Merging
the buffer arenas into shared_buffers seems like the most flexible
solution, but it also seems like a big, complex, error-prone behavior
change, because the SLRU machinery does things quite differently from
shared_buffers: we look up buffers with a linear array search rather
than a hash table probe; we have only a per-SLRU lock and a per-page
lock, rather than separate mapping locks, content locks,
io-in-progress locks, and pins; and while the main buffer manager is
content with some loosey-goosey approximation of recency, the SLRU
code makes a fervent attempt at strict LRU (slightly compromised for
the sake of reduced locking in SimpleLruReadPage_Readonly).

Any thoughts on what makes most sense here? I find it fairly tempting
to just crank up NUM_CLOG_BUFFERS and call it good, but the siren song
of refactoring is whispering in my other ear.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#1)
Re: CLOG contention

Robert Haas <robertmhaas@gmail.com> writes:

So, what do we do about this? The obvious answer is "increase
NUM_CLOG_BUFFERS", and I'm not sure that's a bad idea.

As you say, that's likely to hurt people running in small shared
memory. I too have thought about merging the SLRU areas into the main
shared buffer arena, and likewise have concluded that it is likely to
be way more painful than it's worth. What I think might be an
appropriate compromise is something similar to what we did for
autotuning wal_buffers: use a fixed percentage of shared_buffers, with
some minimum and maximum limits to ensure sanity. But picking the
appropriate percentage would take a bit of research.

regards, tom lane

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#1)
Re: CLOG contention

Robert Haas <robertmhaas@gmail.com> writes:

... while the main buffer manager is
content with some loosey-goosey approximation of recency, the SLRU
code makes a fervent attempt at strict LRU (slightly compromised for
the sake of reduced locking in SimpleLruReadPage_Readonly).

Oh btw, I haven't looked at that code recently, but I have a nasty
feeling that there are parts of it that assume that the number of
buffers it is managing is fairly small. Cranking up the number
might require more work than just changing the value.

regards, tom lane

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#3)
Re: CLOG contention

On Wed, Dec 21, 2011 at 5:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

... while the main buffer manager is
content with some loosey-goosey approximation of recency, the SLRU
code makes a fervent attempt at strict LRU (slightly compromised for
the sake of reduced locking in SimpleLruReadPage_Readonly).

Oh btw, I haven't looked at that code recently, but I have a nasty
feeling that there are parts of it that assume that the number of
buffers it is managing is fairly small.  Cranking up the number
might require more work than just changing the value.

My memory was that you'd said benchmarks showed NUM_CLOG_BUFFERS needs
to be low enough to allow fast lookups, since the lookups don't use an
LRU they just scan all buffers. Indeed, it was your objection that
stopped NUM_CLOG_BUFFERS being increased many years before this.

With the increased performance we have now, I don't think increasing
that alone will be that useful since it doesn't solve all of the
problems and (I am told) likely increases lookup speed.

The full list of clog problems I'm aware of is: raw lookup speed,
multi-user contention, writes at checkpoint and new xid allocation.

Would it be better just to have multiple SLRUs dedicated to the clog?
Simply partition things so we have 2^N sets of everything, and we look
up the xid in partition (xid % (2^N)). That would overcome all of the
problems, not just lookup, in exactly the same way that we partitioned
the buffer and lock manager. We would use a graduated offset on the
page to avoid zeroing pages at the same time. Clog size wouldn't
increase, we'd have the same number of bits, just spread across 2^N
files. We'd have more pages too, but that's not a bad thing since it
spreads out the contention.

Code-wise, those changes would be isolated to clog.c only, probably a
days work if you like the idea.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#5Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#3)
Re: CLOG contention

On Wed, Dec 21, 2011 at 12:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Oh btw, I haven't looked at that code recently, but I have a nasty
feeling that there are parts of it that assume that the number of
buffers it is managing is fairly small.  Cranking up the number
might require more work than just changing the value.

Oh, you mean like the fact that it tries to do strict LRU page
replacement? *rolls eyes* We seem to have named the SLRU system
after one of its scalability limitations...

I think there probably are some scalability limits to the current
implementation, but also I think we could probably increase the
current value modestly with something less than a total rewrite.
Linearly scanning the slot array won't scale indefinitely, but I think
it will scale to more than 8 elements. The performance results I
posted previously make it clear that 8 -> 32 is a net win at least on
that system. One fairly low-impact option might be to make the cache
less than fully associative - e.g. given N buffers, a page with pageno
% 4 == X is only allowed to be in a slot numbered between (N/4)*X and
(N/4)*(X+1)-1. That likely would be counterproductive at N = 8 but
might be OK at larger values. We could also switch to using a hash
table but that seems awfully heavy-weight.

The real question is how to decide how many buffers to create. You
suggested a formula based on shared_buffers, but what would that
formula be? I mean, a typical large system is going to have 1,048,576
shared buffers, and it probably needs less than 0.1% of that amount of
CLOG buffers. My guess is that there's no real reason to skimp: if
you are really tight for memory, you might want to crank this down,
but otherwise you may as well just go with whatever we decide the
best-performing value is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#4)
Re: CLOG contention

On Wed, Dec 21, 2011 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

With the increased performance we have now, I don't think increasing
that alone will be that useful since it doesn't solve all of the
problems and (I am told) likely increases lookup speed.

I have benchmarks showing that it works, for whatever that's worth.

The full list of clog problems I'm aware of is: raw lookup speed,
multi-user contention, writes at checkpoint and new xid allocation.

What is the best workload to show a bottleneck on raw lookup speed?

I wouldn't expect writes at checkpoint to be a big problem because
it's so little data.

What's the problem with new XID allocation?

Would it be better just to have multiple SLRUs dedicated to the clog?
Simply partition things so we have 2^N sets of everything, and we look
up the xid in partition (xid % (2^N)).  That would overcome all of the
problems, not just lookup, in exactly the same way that we partitioned
the buffer and lock manager. We would use a graduated offset on the
page to avoid zeroing pages at the same time. Clog size wouldn't
increase, we'd have the same number of bits, just spread across 2^N
files. We'd have more pages too, but that's not a bad thing since it
spreads out the contention.

It seems that would increase memory requirements (clog1 through clog4
with 2 pages each doesn't sound workable). It would also break
on-disk compatibility for pg_upgrade. I'm still holding out hope that
we can find a simpler solution...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#1)
Re: CLOG contention

Robert Haas <robertmhaas@gmail.com> wrote:

Any thoughts on what makes most sense here? I find it fairly
tempting to just crank up NUM_CLOG_BUFFERS and call it good,

The only thought I have to add to discussion so far is that the need
to do anything may be reduced significantly by any work to write
hint bits more aggressively. We only consult CLOG for tuples on
which hint bits have not yet been set, right? What if, before
writing a page, we try to set hint bits where we can? When
successful, it would not only prevent one or more later writes of
the page, but could also prevent having to load old CLOG pages.
Perhaps the hint bit issue should be addressed first, and *then* we
check whether we still have a problem with CLOG.

-Kevin

#8Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#7)
Re: CLOG contention

On Wed, Dec 21, 2011 at 10:51 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

Any thoughts on what makes most sense here?  I find it fairly
tempting to just crank up NUM_CLOG_BUFFERS and call it good,

The only thought I have to add to discussion so far is that the need
to do anything may be reduced significantly by any work to write
hint bits more aggressively.  We only consult CLOG for tuples on
which hint bits have not yet been set, right?  What if, before
writing a page, we try to set hint bits where we can?  When
successful, it would not only prevent one or more later writes of
the page, but could also prevent having to load old CLOG pages.
Perhaps the hint bit issue should be addressed first, and *then* we
check whether we still have a problem with CLOG.

There may be workloads where that will help, but it's definitely not
going to cover all cases. Consider my trusty
pgbench-at-scale-factor-100 test case: since the working set fits
inside shared buffers, we're only writing pages at checkpoint time.
The contention happens because we randomly select rows from the table,
and whatever row we select hasn't been examined since it was last
updated, and so it's unhinted. But we're not reading the page in:
it's already in shared buffers, and has never been written out. I
don't see any realistic way to avoid the CLOG lookups in that case:
nobody else has had any reason to touch that page in any way since the
tuple was first written.

So I think we need a more general solution.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Alvaro Herrera
alvherre@commandprompt.com
In reply to: Robert Haas (#8)
Re: CLOG contention

Excerpts from Robert Haas's message of mié dic 21 13:18:36 -0300 2011:

There may be workloads where that will help, but it's definitely not
going to cover all cases. Consider my trusty
pgbench-at-scale-factor-100 test case: since the working set fits
inside shared buffers, we're only writing pages at checkpoint time.
The contention happens because we randomly select rows from the table,
and whatever row we select hasn't been examined since it was last
updated, and so it's unhinted. But we're not reading the page in:
it's already in shared buffers, and has never been written out. I
don't see any realistic way to avoid the CLOG lookups in that case:
nobody else has had any reason to touch that page in any way since the
tuple was first written.

Maybe we need a background "tuple hinter" process ...

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#5)
Re: CLOG contention

Robert Haas <robertmhaas@gmail.com> writes:

I think there probably are some scalability limits to the current
implementation, but also I think we could probably increase the
current value modestly with something less than a total rewrite.
Linearly scanning the slot array won't scale indefinitely, but I think
it will scale to more than 8 elements. The performance results I
posted previously make it clear that 8 -> 32 is a net win at least on
that system.

Agreed, the question is whether 32 is enough to fix the problem for
anything except this one benchmark.

One fairly low-impact option might be to make the cache
less than fully associative - e.g. given N buffers, a page with pageno
% 4 == X is only allowed to be in a slot numbered between (N/4)*X and
(N/4)*(X+1)-1. That likely would be counterproductive at N = 8 but
might be OK at larger values.

I'm inclined to think that that specific arrangement wouldn't be good.
The normal access pattern for CLOG is, I believe, an exponentially
decaying probability-of-access for each page as you go further back from
current. We have a hack to pin the current (latest) page into SLRU all
the time, but you want the design to be such that the next-to-latest
page is most likely to still be around, then the second-latest, etc.

If I'm reading your equation correctly then the most recent pages would
compete against each other, not against much older pages, which is
exactly the wrong thing. Perhaps what you actually meant to say was
that all pages with the same number mod 4 are in one bucket, which would
be better, but still not really ideal: for instance the next-to-latest
page could end up getting removed while say the third-latest page is
still there because it's in a different associative bucket that's under
less pressure.

But possibly we could fix that with some other variant of the idea.
I certainly agree that strict LRU isn't an essential property here,
so long as we have a design that is matched to the expected access
pattern statistics.

We could also switch to using a hash
table but that seems awfully heavy-weight.

Yeah. If we're not going to go to hundreds of CLOG buffers, which
I think probably wouldn't be useful, then hashing is unlikely to be the
best answer.

The real question is how to decide how many buffers to create. You
suggested a formula based on shared_buffers, but what would that
formula be? I mean, a typical large system is going to have 1,048,576
shared buffers, and it probably needs less than 0.1% of that amount of
CLOG buffers.

Well, something like "0.1% with minimum of 8 and max of 32" might be
reasonable. What I'm mainly fuzzy about is the upper limit.

regards, tom lane

#11Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#6)
Re: CLOG contention

On Wed, Dec 21, 2011 at 3:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 21, 2011 at 5:17 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

With the increased performance we have now, I don't think increasing
that alone will be that useful since it doesn't solve all of the
problems and (I am told) likely increases lookup speed.

I have benchmarks showing that it works, for whatever that's worth.

The full list of clog problems I'm aware of is: raw lookup speed,
multi-user contention, writes at checkpoint and new xid allocation.

What is the best workload to show a bottleneck on raw lookup speed?

A microbenchmark.

I wouldn't expect writes at checkpoint to be a big problem because
it's so little data.

What's the problem with new XID allocation?

Earlier experience shows that those are areas of concern. You aren't
measuring response time in your tests, so you won't notice them as
problems. But they do effect throughput much more than intuition says
it would.

Would it be better just to have multiple SLRUs dedicated to the clog?
Simply partition things so we have 2^N sets of everything, and we look
up the xid in partition (xid % (2^N)).  That would overcome all of the
problems, not just lookup, in exactly the same way that we partitioned
the buffer and lock manager. We would use a graduated offset on the
page to avoid zeroing pages at the same time. Clog size wouldn't
increase, we'd have the same number of bits, just spread across 2^N
files. We'd have more pages too, but that's not a bad thing since it
spreads out the contention.

It seems that would increase memory requirements (clog1 through clog4
with 2 pages each doesn't sound workable).  It would also break
on-disk compatibility for pg_upgrade.  I'm still holding out hope that
we can find a simpler solution...

Not sure what you mean by "increase memory requirements". How would
increasing NUM_CLOG_BUFFERS = 64 differ from having NUM_CLOG_BUFFERS =
8 and NUM_CLOG_PARTITIONS = 8?

I think you appreciate that having 8 lwlocks rather than 1 might help
scalability.

I'm sure pg_upgrade can be tweaked easily enough and it would still
work quickly.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#10)
Re: CLOG contention

On Wed, Dec 21, 2011 at 11:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Agreed, the question is whether 32 is enough to fix the problem for
anything except this one benchmark.

Right. My thought on that topic is that it depends on what you mean
by "fix". It's clearly NOT possible to keep enough CLOG buffers
around to cover the entire range of XID space that might get probed,
at least not without some massive rethinking of the infrastructure.
It seems that the amount of space that might need to be covered there
is at least on the order of vacuum_freeze_table_age, which is to say
150 million by default. At 32K txns/page, that would require almost
5K pages, which is a lot more than 8.

On the other hand, if we just want to avoid having more requests
simultaneously in flight than we have buffers, so that backends don't
need to wait for an available buffer before beginning their I/O, then
something on the order of the number of CPUs in the machine is likely
sufficient. I'll do a little more testing and see if I can figure out
where the tipping point is on this 32-core box.

One fairly low-impact option might be to make the cache
less than fully associative - e.g. given N buffers, a page with pageno
% 4 == X is only allowed to be in a slot numbered between (N/4)*X and
(N/4)*(X+1)-1.  That likely would be counterproductive at N = 8 but
might be OK at larger values.

I'm inclined to think that that specific arrangement wouldn't be good.
The normal access pattern for CLOG is, I believe, an exponentially
decaying probability-of-access for each page as you go further back from
current.  We have a hack to pin the current (latest) page into SLRU all
the time, but you want the design to be such that the next-to-latest
page is most likely to still be around, then the second-latest, etc.

If I'm reading your equation correctly then the most recent pages would
compete against each other, not against much older pages, which is
exactly the wrong thing.  Perhaps what you actually meant to say was
that all pages with the same number mod 4 are in one bucket, which would
be better,

That's what I meant. I think the formula works out to that, but in
any case it's what I meant. :-)

but still not really ideal: for instance the next-to-latest
page could end up getting removed while say the third-latest page is
still there because it's in a different associative bucket that's under
less pressure.

Well, sure. But who is to say that's bad? I think you can find a way
to throw stones at any given algorithm we might choose to implement.
For example, if you contrive things so that you repeatedly access the
same old CLOG pages cyclically: 1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,...

...then our existing LRU algorithm will be anti-optimal, because we'll
keep the latest page plus the most recently accessed 7 old pages in
memory, and every lookup will fault out the page that the next lookup
is about to need. If you're not that excited about that happening in
real life, neither am I. But neither am I that excited about your
scenario: if the next-to-last page gets kicked out, there are a whole
bunch of pages -- maybe 8, if you imagine 32 buffers split 4 ways --
that have been accessed more recently than that next-to-last page. So
it wouldn't be resident in an 8-buffer pool either. Maybe the last
page was mostly transactions updating some infrequently-accessed
table, and we don't really need that page right now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#12)
Re: CLOG contention

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Dec 21, 2011 at 11:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm inclined to think that that specific arrangement wouldn't be good.
The normal access pattern for CLOG is, I believe, an exponentially
decaying probability-of-access for each page as you go further back from
current. ... for instance the next-to-latest
page could end up getting removed while say the third-latest page is
still there because it's in a different associative bucket that's under
less pressure.

Well, sure. But who is to say that's bad? I think you can find a way
to throw stones at any given algorithm we might choose to implement.

The point I'm trying to make is that buffer management schemes like
that one are built on the assumption that the probability of access is
roughly uniform for all pages. We know (or at least have strong reason
to presume) that CLOG pages have very non-uniform probability of access.
The straight LRU scheme is good because it deals well with non-uniform
access patterns. Dividing the buffers into independent buckets in a way
that doesn't account for the expected access probabilities is going to
degrade things. (The approach Simon suggests nearby seems isomorphic to
yours and so suffers from this same objection, btw.)

For example, if you contrive things so that you repeatedly access the
same old CLOG pages cyclically: 1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8,...

Sure, and the reason that that's contrived is that it flies in the face
of reasonable assumptions about CLOG access probabilities. Any scheme
will lose some of the time, but you don't want to pick a scheme that is
more likely to lose for more probable access patterns.

It strikes me that one simple thing we could do is extend the current
heuristic that says "pin the latest page". That is, pin the last K
pages into SLRU, and apply LRU or some other method across the rest.
If K is large enough, that should get us down to where the differential
in access probability among the older pages is small enough to neglect,
and then we could apply associative bucketing or other methods to the
rest without fear of getting burnt by the common usage pattern. I don't
know what K would need to be, though. Maybe it's worth instrumenting
a benchmark run or two so we can get some facts rather than guesses
about the access frequencies?

regards, tom lane

#14Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#13)
Re: CLOG contention

On Wed, Dec 21, 2011 at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It strikes me that one simple thing we could do is extend the current
heuristic that says "pin the latest page".  That is, pin the last K
pages into SLRU, and apply LRU or some other method across the rest.
If K is large enough, that should get us down to where the differential
in access probability among the older pages is small enough to neglect,
and then we could apply associative bucketing or other methods to the
rest without fear of getting burnt by the common usage pattern.  I don't
know what K would need to be, though.  Maybe it's worth instrumenting
a benchmark run or two so we can get some facts rather than guesses
about the access frequencies?

I guess the point is that it seems to me to depend rather heavily on
what benchmark you run. For something like pgbench, we initialize the
cluster with one or a few big transactions, so the page containing
those XIDs figures to stay hot for a very long time. Then after that
we choose rows to update randomly, which will produce the sort of
newer-pages-are-hotter-than-older-pages effect that you're talking
about. But the slope of the curve depends heavily on the scale
factor. If we have scale factor 1 (= 100,000 rows) then chances are
that when we randomly pick a row to update, we'll hit one that's been
touched within the last few hundred thousand updates - i.e. the last
couple of CLOG pages. But if we have scale factor 100 (= 10,000,000
rows) we might easily hit a row that hasn't been updated for many
millions of transactions, so there's going to be a much longer tail
there. And some other test could yield very different results - e.g.
something that uses lots of subtransactions might well have a much
longer tail, while something that does more than one update per
transaction would presumably have a shorter one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#5)
Re: CLOG contention

On Wed, Dec 21, 2011 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think there probably are some scalability limits to the current
implementation, but also I think we could probably increase the
current value modestly with something less than a total rewrite.
Linearly scanning the slot array won't scale indefinitely, but I think
it will scale to more than 8 elements.  The performance results I
posted previously make it clear that 8 -> 32 is a net win at least on
that system.

Agreed to that, but I don't think its nearly enough.

One fairly low-impact option might be to make the cache
less than fully associative - e.g. given N buffers, a page with pageno
% 4 == X is only allowed to be in a slot numbered between (N/4)*X and
(N/4)*(X+1)-1.  That likely would be counterproductive at N = 8 but
might be OK at larger values.

Which is pretty much the same as saying, yes, lets partition the clog
as I suggested, but by a different route.

We could also switch to using a hash
table but that seems awfully heavy-weight.

Which is a re-write of SLRU ground up and inapproriate for most SLRU
usage. We'd get partitioning "for free" as long as we re-write.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#16Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#15)
Re: CLOG contention

On Wed, Dec 21, 2011 at 2:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, Dec 21, 2011 at 3:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think there probably are some scalability limits to the current
implementation, but also I think we could probably increase the
current value modestly with something less than a total rewrite.
Linearly scanning the slot array won't scale indefinitely, but I think
it will scale to more than 8 elements.  The performance results I
posted previously make it clear that 8 -> 32 is a net win at least on
that system.

Agreed to that, but I don't think its nearly enough.

One fairly low-impact option might be to make the cache
less than fully associative - e.g. given N buffers, a page with pageno
% 4 == X is only allowed to be in a slot numbered between (N/4)*X and
(N/4)*(X+1)-1.  That likely would be counterproductive at N = 8 but
might be OK at larger values.

Which is pretty much the same as saying, yes, lets partition the clog
as I suggested, but by a different route.

We could also switch to using a hash
table but that seems awfully heavy-weight.

Which is a re-write of SLRU ground up and inapproriate for most SLRU
usage. We'd get partitioning "for free" as long as we re-write.

I'm not sure what your point is here. I feel like this is on the edge
of turning into an argument, and if we're going to have an argument
I'd like to know what we're arguing about. I am not arguing that
under no circumstances should we partition anything related to CLOG,
nor am I trying to deny you credit for your ideas. I'm merely saying
that the specific plan of having multiple SLRUs for CLOG doesn't
appeal to me -- mostly because I think it will make life difficult for
pg_upgrade without any compensating advantage. If we're going to go
that route, I'd rather build something into the SLRU machinery
generally that allows for the cache to be less than fully-associative,
with all of the savings in terms of lock contention that this entails.
Such a system could be used by any SLRU, not just CLOG, if it proved
to be helpful; and it would avoid any on-disk changes, with, as far as
I can see, basically no downside.

That having been said, Tom isn't convinced that any form of
partitioning is the right way to go, and since Tom often has good
ideas, I'd like to explore his notions of how we might fix this
problem other than via some form of partitioning before we focus in on
partitioning. Partitioning may ultimately be the right way to go, but
let's keep an open mind: this thread is only 14 hours old. The only
things I'm completely convinced of at this point are (1) we need more
CLOG buffers (but I don't know exactly how many) and (2) the current
code isn't designed to manage large numbers of buffers (but I don't
know exactly where it starts to fall over).

If I'm completely misunderstanding the point of your email, please set
me straight (gently).

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#12)
Re: CLOG contention

On Wed, Dec 21, 2011 at 12:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On the other hand, if we just want to avoid having more requests
simultaneously in flight than we have buffers, so that backends don't
need to wait for an available buffer before beginning their I/O, then
something on the order of the number of CPUs in the machine is likely
sufficient.  I'll do a little more testing and see if I can figure out
where the tipping point is on this 32-core box.

I recompiled with NUM_CLOG_BUFFERS = 8, 16, 24, 32, 40, 48 and ran
5-minute tests, using unlogged tables to avoid getting killed by
WALInsertLock contentions. With 32-clients on this 32-core box, the
tipping point is somewhere in the neighborhood of 32 buffers. 40
buffers might still be winning over 32, or maybe not, but 48 is
definitely losing. Below 32, more is better, all the way up. Here
are the full results:

resultswu.clog16.32.100.300:tps = 19549.454462 (including connections
establishing)
resultswu.clog16.32.100.300:tps = 19883.583245 (including connections
establishing)
resultswu.clog16.32.100.300:tps = 19984.857186 (including connections
establishing)
resultswu.clog24.32.100.300:tps = 20124.147651 (including connections
establishing)
resultswu.clog24.32.100.300:tps = 20108.504407 (including connections
establishing)
resultswu.clog24.32.100.300:tps = 20303.964120 (including connections
establishing)
resultswu.clog32.32.100.300:tps = 20573.873097 (including connections
establishing)
resultswu.clog32.32.100.300:tps = 20444.289259 (including connections
establishing)
resultswu.clog32.32.100.300:tps = 20234.209965 (including connections
establishing)
resultswu.clog40.32.100.300:tps = 21762.222195 (including connections
establishing)
resultswu.clog40.32.100.300:tps = 20621.749677 (including connections
establishing)
resultswu.clog40.32.100.300:tps = 20290.990673 (including connections
establishing)
resultswu.clog48.32.100.300:tps = 19253.424997 (including connections
establishing)
resultswu.clog48.32.100.300:tps = 19542.095191 (including connections
establishing)
resultswu.clog48.32.100.300:tps = 19284.962036 (including connections
establishing)
resultswu.master.32.100.300:tps = 18694.886622 (including connections
establishing)
resultswu.master.32.100.300:tps = 18417.647703 (including connections
establishing)
resultswu.master.32.100.300:tps = 18331.718955 (including connections
establishing)

Parameters in use: shared_buffers = 8GB, maintenance_work_mem = 1GB,
synchronous_commit = off, checkpoint_segments = 300,
checkpoint_timeout = 15min, checkpoint_completion_target = 0.9,
wal_writer_delay = 20ms

It isn't clear to me whether we can extrapolate anything more general
from this. It'd be awfully interesting to repeat this experiment on,
say, an 8-core server, but I don't have one of those I can use at the
moment.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#16)
1 attachment(s)
Re: CLOG contention

On Wed, Dec 21, 2011 at 7:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I am not arguing

This seems like a normal and cool technical discussion to me here.

 I'm merely saying
that the specific plan of having multiple SLRUs for CLOG doesn't
appeal to me -- mostly because I think it will make life difficult for
pg_upgrade without any compensating advantage.  If we're going to go
that route, I'd rather build something into the SLRU machinery
generally that allows for the cache to be less than fully-associative,
with all of the savings in terms of lock contention that this entails.
 Such a system could be used by any SLRU, not just CLOG, if it proved
to be helpful; and it would avoid any on-disk changes, with, as far as
I can see, basically no downside.

Partitioning will give us more buffers and more LWlocks, to spread the
contention when we access the buffers. I use that word because its
what we call the technique already used in the buffer manager and lock
manager. If you wish to call this "less than fully-associative" I
really don't mind, as long as we're discussing the same overall
concept, so we can then focus on an implementation of that concept,
which no doubt has many ways of doing it.

More buffers per lock does reduce the lock contention somewhat, but
not by much. So for me, it seems essential that we have more LWlocks
to solve the problem, which is where partitioning comes in.

My perspective is that there is clog contention in many places, not
just in the ones you identified. Main places I see are:

* Access to older pages (identified by you upthread). More buffers
addresses this problem.

* Committing requires us to hold exclusive lock on a page, so there is
contention from nearly all sessions for the same page. The only way to
solve that is by striping pages, so that one page in the current clog
architecture would be striped across N pages with consecutive xids in
separate partitions. Notably this addresses Tom's concern that there
is a much higher request rate on very recent pages - each page would
be split into N pages, so reducing contention.

* We allocate a new clog page every 32k xids. At the rates you have
now measured, we will do this every 1-2 seconds. When we do this, we
must allocate a new page, which means writing the LRU page, which will
be dirty, since we fill 8 buffers in 16 seconds (or even 32 buffers in
about a minute), yet only flush buffers at checkpoint every 5 minutes.
We then need to write an XLogRecord for the new page. All of that
happens while we have the XidGenLock held. Also, while this is
happening nothing can commit, or check clog. That causes nearly all
work to halt for about a second, perhaps longer while the traffic
queue clears. More obvious when writing to logged tables, since the
XLogInsert for the new clog page is then very badly contended. If we
partition then we will be able to continue accessing most of the clog
pages.

So I think we need
* more buffers
* clog page striping
* partitioning

And I would characterise what I am suggesting as "partitioning +
striping" with the free benefit that we increase the number of buffers
as well via partitioning.

With all of that in mind, its relatively easy to rewrite the clog code
so we allocate N SLRUs rather than just 1. That means we just touch
the clog code. Striping adjacent xids onto separate pages in other
ways would gut the SLRU code. We could just partition but then won't
address Tom's concern, as you say. That is based upon code analysis
and hacking something together while thinking - if it helps discussion
I post that hack here, but its not working yet. I don't think reusing
code from bufmgr/lockmgr would help either.

Yes, you're right that I'm suggesting we change the clog data
structures and that therefore we'd need to change pg_upgrade as well.
But that seems like a relatively simple piece of code given the clear
mapping between old and new structures. It would be able to run
quickly at upgrade time.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

clog_partitioning.v0.1.patchtext/x-patch; charset=US-ASCII; name=clog_partitioning.v0.1.patchDownload
*** a/src/backend/access/transam/clog.c
--- b/src/backend/access/transam/clog.c
***************
*** 54,63 ****
  #define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
  #define CLOG_XACT_BITMASK	((1 << CLOG_BITS_PER_XACT) - 1)
  
! #define TransactionIdToPage(xid)	((xid) / (TransactionId) CLOG_XACTS_PER_PAGE)
! #define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)
! #define TransactionIdToByte(xid)	(TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE)
! #define TransactionIdToBIndex(xid)	((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)
  
  /* We store the latest async LSN for each group of transactions */
  #define CLOG_XACTS_PER_LSN_GROUP	32	/* keep this a power of 2 */
--- 54,65 ----
  #define CLOG_XACTS_PER_PAGE (BLCKSZ * CLOG_XACTS_PER_BYTE)
  #define CLOG_XACT_BITMASK	((1 << CLOG_BITS_PER_XACT) - 1)
  
! #define XidStripe(xid) (NUM_CLOG_PARTITIONS * ((xid) / (TransactionId) NUM_CLOG_PARTITIONS))
! #define TransactionIdToPartition(xid) ((xid) % (TransactionId) NUM_CLOG_PARTITIONS)
! #define TransactionIdToPage(xid)	(XidStripe(xid) / (TransactionId) CLOG_XACTS_PER_PAGE)
! #define TransactionIdToPgIndex(xid) (XidStripe(xid) % (TransactionId) CLOG_XACTS_PER_PAGE)
! #define TransactionIdToByte(xid)	(TransactionIdToPgIndex(XidStripe(xid)) / CLOG_XACTS_PER_BYTE)
! #define TransactionIdToBIndex(xid)	(XidStripe(xid) % (TransactionId) CLOG_XACTS_PER_BYTE)
  
  /* We store the latest async LSN for each group of transactions */
  #define CLOG_XACTS_PER_LSN_GROUP	32	/* keep this a power of 2 */
***************
*** 66,88 ****
  #define GetLSNIndex(slotno, xid)	((slotno) * CLOG_LSNS_PER_PAGE + \
  	((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)
  
  
  /*
   * Link to shared-memory data structures for CLOG control
   */
! static SlruCtlData ClogCtlData;
! 
! #define ClogCtl (&ClogCtlData)
! 
  
! static int	ZeroCLOGPage(int pageno, bool writeXlog);
  static bool CLOGPagePrecedes(int page1, int page2);
! static void WriteZeroPageXlogRec(int pageno);
! static void WriteTruncateXlogRec(int pageno);
! static void TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
  						   TransactionId *subxids, XidStatus status,
  						   XLogRecPtr lsn, int pageno);
! static void TransactionIdSetStatusBit(TransactionId xid, XidStatus status,
  						  XLogRecPtr lsn, int slotno);
  static void set_status_by_pages(int nsubxids, TransactionId *subxids,
  					XidStatus status, XLogRecPtr lsn);
--- 68,92 ----
  #define GetLSNIndex(slotno, xid)	((slotno) * CLOG_LSNS_PER_PAGE + \
  	((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)
  
+ typedef struct xl_clog_page
+ {
+ 	int			partition;
+ 	int			pageno;
+ } xl_clog_page;
  
  /*
   * Link to shared-memory data structures for CLOG control
   */
! static SlruCtlData ClogCtl[NUM_CLOG_PARTITIONS];
  
! static int	ZeroCLOGPage(int partition, int pageno, bool writeXlog);
  static bool CLOGPagePrecedes(int page1, int page2);
! static void WriteZeroPageXlogRec(int partition, int pageno);
! static void WriteTruncateXlogRec(int partition, int pageno);
! static void TransactionIdSetPageStatus(int partition, TransactionId xid, int nsubxids,
  						   TransactionId *subxids, XidStatus status,
  						   XLogRecPtr lsn, int pageno);
! static void TransactionIdSetStatusBit(int partition, TransactionId xid, XidStatus status,
  						  XLogRecPtr lsn, int slotno);
  static void set_status_by_pages(int nsubxids, TransactionId *subxids,
  					XidStatus status, XLogRecPtr lsn);
***************
*** 144,149 **** TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
--- 148,154 ----
  					TransactionId *subxids, XidStatus status, XLogRecPtr lsn)
  {
  	int			pageno = TransactionIdToPage(xid);		/* get page of parent */
+ 	int			partition = TransactionIdToPartition(xid); /* of parent */
  	int			i;
  
  	Assert(status == TRANSACTION_STATUS_COMMITTED ||
***************
*** 151,161 **** TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
  
  	/*
  	 * See how many subxids, if any, are on the same page as the parent, if
! 	 * any.
  	 */
  	for (i = 0; i < nsubxids; i++)
  	{
! 		if (TransactionIdToPage(subxids[i]) != pageno)
  			break;
  	}
  
--- 156,167 ----
  
  	/*
  	 * See how many subxids, if any, are on the same page as the parent, if
! 	 * any. Notice that having more partitions most likely reduces this number.
  	 */
  	for (i = 0; i < nsubxids; i++)
  	{
! 		if (TransactionIdToPage(subxids[i]) != pageno ||
! 			TransactionIdToPartition(subxids[i] != partition))
  			break;
  	}
  
***************
*** 167,173 **** TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
  		/*
  		 * Set the parent and all subtransactions in a single call
  		 */
! 		TransactionIdSetPageStatus(xid, nsubxids, subxids, status, lsn,
  								   pageno);
  	}
  	else
--- 173,179 ----
  		/*
  		 * Set the parent and all subtransactions in a single call
  		 */
! 		TransactionIdSetPageStatus(partition, xid, nsubxids, subxids, status, lsn,
  								   pageno);
  	}
  	else
***************
*** 183,188 **** TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
--- 189,198 ----
  		 *
  		 * To avoid touching the first page twice, skip marking subcommitted
  		 * for the subxids on that first page.
+ 		 *
+ 		 * Notice that all the complexity of clog partitions is hidden within
+ 		 * set_status_by_pages. The parent transaction still exists on one
+ 		 * page in one partition, so that part is unchaned by partitioning.
  		 */
  		if (status == TRANSACTION_STATUS_COMMITTED)
  			set_status_by_pages(nsubxids - nsubxids_on_first_page,
***************
*** 194,200 **** TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
  		 * if any
  		 */
  		pageno = TransactionIdToPage(xid);
! 		TransactionIdSetPageStatus(xid, nsubxids_on_first_page, subxids, status,
  								   lsn, pageno);
  
  		/*
--- 204,210 ----
  		 * if any
  		 */
  		pageno = TransactionIdToPage(xid);
! 		TransactionIdSetPageStatus(partition, xid, nsubxids_on_first_page, subxids, status,
  								   lsn, pageno);
  
  		/*
***************
*** 212,217 **** TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
--- 222,228 ----
   * transactions, chunking in the separate CLOG pages involved. We never
   * pass the whole transaction tree to this function, only subtransactions
   * that are on different pages to the top level transaction id.
+  * We sift the array once for each partition.
   */
  static void
  set_status_by_pages(int nsubxids, TransactionId *subxids,
***************
*** 220,241 **** set_status_by_pages(int nsubxids, TransactionId *subxids,
  	int			pageno = TransactionIdToPage(subxids[0]);
  	int			offset = 0;
  	int			i = 0;
  
! 	while (i < nsubxids)
  	{
! 		int			num_on_page = 0;
! 
! 		while (TransactionIdToPage(subxids[i]) == pageno && i < nsubxids)
  		{
! 			num_on_page++;
! 			i++;
  		}
  
! 		TransactionIdSetPageStatus(InvalidTransactionId,
! 								   num_on_page, subxids + offset,
! 								   status, lsn, pageno);
! 		offset = i;
! 		pageno = TransactionIdToPage(subxids[offset]);
  	}
  }
  
--- 231,280 ----
  	int			pageno = TransactionIdToPage(subxids[0]);
  	int			offset = 0;
  	int			i = 0;
+ 	int			partition;
+ 	int			part_nsubxids;
+ 	int			max_part_nsubxids = 32;
+ 	TransactionId *part_subxids = palloc(32 * sizeof(TransactionId));
  
! 	for (partition = 0; partition < NUM_CLOG_PARTITIONS; partition++)
  	{
! 		part_nsubxids = 0;
! 		for (i = 0; i < nsubxids; i++)
  		{
! 			/*
! 			 * Collect up all the xids for this partition
! 			 */
! 			if (TransactionIdToPartition(subxids[i]) == partition)
! 			{
! 				part_subxids[part_nsubxids++] = subxids[i];
! 				if (part_nsubxids >= max_part_nsubxids)
! 				{
! 					max_part_nsubxids *= 2;
! 					part_subxids = repalloc(part_subxids, max_part_nsubxids * sizeof(TransactionId));
! 				}
! 			}
  		}
  
! 		/*
! 		 * Now apply the changes by page, just for this partition
! 		 */
! 		i = 0;
! 		while (i < part_nsubxids)
! 		{
! 			int			num_on_page = 0;
! 
! 			while (TransactionIdToPage(part_subxids[i]) == pageno && i < part_nsubxids)
! 			{
! 				num_on_page++;
! 				i++;
! 			}
! 
! 			TransactionIdSetPageStatus(partition, InvalidTransactionId,
! 									   num_on_page, part_subxids + offset,
! 									   status, lsn, pageno);
! 			offset = i;
! 			pageno = TransactionIdToPage(part_subxids[offset]);
! 		}
  	}
  }
  
***************
*** 243,263 **** set_status_by_pages(int nsubxids, TransactionId *subxids,
   * Record the final state of transaction entries in the commit log for
   * all entries on a single page.  Atomic only on this page.
   *
!  * Otherwise API is same as TransactionIdSetTreeStatus()
   */
  static void
! TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
  						   TransactionId *subxids, XidStatus status,
  						   XLogRecPtr lsn, int pageno)
  {
  	int			slotno;
  	int			i;
  
  	Assert(status == TRANSACTION_STATUS_COMMITTED ||
  		   status == TRANSACTION_STATUS_ABORTED ||
  		   (status == TRANSACTION_STATUS_SUB_COMMITTED && !TransactionIdIsValid(xid)));
  
! 	LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
  
  	/*
  	 * If we're doing an async commit (ie, lsn is valid), then we must wait
--- 282,304 ----
   * Record the final state of transaction entries in the commit log for
   * all entries on a single page.  Atomic only on this page.
   *
!  * Otherwise API is same as TransactionIdSetTreeStatus(), apart from
!  * the partition the page number is in.
   */
  static void
! TransactionIdSetPageStatus(int partition, TransactionId xid, int nsubxids,
  						   TransactionId *subxids, XidStatus status,
  						   XLogRecPtr lsn, int pageno)
  {
  	int			slotno;
  	int			i;
+ 	SlruCtlData *ClogCtlP = &ClogCtl[partition];
  
  	Assert(status == TRANSACTION_STATUS_COMMITTED ||
  		   status == TRANSACTION_STATUS_ABORTED ||
  		   (status == TRANSACTION_STATUS_SUB_COMMITTED && !TransactionIdIsValid(xid)));
  
! 	LWLockAcquire(ClogCtlP->shared->ControlLock, LW_EXCLUSIVE);
  
  	/*
  	 * If we're doing an async commit (ie, lsn is valid), then we must wait
***************
*** 268,274 **** TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
  	 * write-busy, since we don't care if the update reaches disk sooner than
  	 * we think.
  	 */
! 	slotno = SimpleLruReadPage(ClogCtl, pageno, XLogRecPtrIsInvalid(lsn), xid);
  
  	/*
  	 * Set the main transaction id, if any.
--- 309,315 ----
  	 * write-busy, since we don't care if the update reaches disk sooner than
  	 * we think.
  	 */
! 	slotno = SimpleLruReadPage(ClogCtlP, pageno, XLogRecPtrIsInvalid(lsn), xid);
  
  	/*
  	 * Set the main transaction id, if any.
***************
*** 286,312 **** TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
  		{
  			for (i = 0; i < nsubxids; i++)
  			{
! 				Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
! 				TransactionIdSetStatusBit(subxids[i],
  										  TRANSACTION_STATUS_SUB_COMMITTED,
  										  lsn, slotno);
  			}
  		}
  
  		/* ... then the main transaction */
! 		TransactionIdSetStatusBit(xid, status, lsn, slotno);
  	}
  
  	/* Set the subtransactions */
  	for (i = 0; i < nsubxids; i++)
  	{
! 		Assert(ClogCtl->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
! 		TransactionIdSetStatusBit(subxids[i], status, lsn, slotno);
  	}
  
! 	ClogCtl->shared->page_dirty[slotno] = true;
  
! 	LWLockRelease(CLogControlLock);
  }
  
  /*
--- 327,353 ----
  		{
  			for (i = 0; i < nsubxids; i++)
  			{
! 				Assert(ClogCtlP->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
! 				TransactionIdSetStatusBit(partition, subxids[i],
  										  TRANSACTION_STATUS_SUB_COMMITTED,
  										  lsn, slotno);
  			}
  		}
  
  		/* ... then the main transaction */
! 		TransactionIdSetStatusBit(partition, xid, status, lsn, slotno);
  	}
  
  	/* Set the subtransactions */
  	for (i = 0; i < nsubxids; i++)
  	{
! 		Assert(ClogCtlP->shared->page_number[slotno] == TransactionIdToPage(subxids[i]));
! 		TransactionIdSetStatusBit(partition, subxids[i], status, lsn, slotno);
  	}
  
! 	ClogCtlP->shared->page_dirty[slotno] = true;
  
! 	LWLockRelease(ClogCtlP->shared->ControlLock);
  }
  
  /*
***************
*** 315,329 **** TransactionIdSetPageStatus(TransactionId xid, int nsubxids,
   * Must be called with CLogControlLock held
   */
  static void
! TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
  {
  	int			byteno = TransactionIdToByte(xid);
  	int			bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
  	char	   *byteptr;
  	char		byteval;
  	char		curval;
  
! 	byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
  	curval = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
  
  	/*
--- 356,373 ----
   * Must be called with CLogControlLock held
   */
  static void
! TransactionIdSetStatusBit(int partition, TransactionId xid, XidStatus status, XLogRecPtr lsn, int slotno)
  {
  	int			byteno = TransactionIdToByte(xid);
  	int			bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
  	char	   *byteptr;
  	char		byteval;
  	char		curval;
+ 	SlruCtlData *ClogCtlP = &ClogCtl[partition];
  
! 	Assert(TransactionIdToPartition(xid) == partition);
! 
! 	byteptr = ClogCtlP->shared->page_buffer[slotno] + byteno;
  	curval = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
  
  	/*
***************
*** 363,370 **** TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, i
  	{
  		int			lsnindex = GetLSNIndex(slotno, xid);
  
! 		if (XLByteLT(ClogCtl->shared->group_lsn[lsnindex], lsn))
! 			ClogCtl->shared->group_lsn[lsnindex] = lsn;
  	}
  }
  
--- 407,414 ----
  	{
  		int			lsnindex = GetLSNIndex(slotno, xid);
  
! 		if (XLByteLT(ClogCtlP->shared->group_lsn[lsnindex], lsn))
! 			ClogCtlP->shared->group_lsn[lsnindex] = lsn;
  	}
  }
  
***************
*** 386,391 **** TransactionIdSetStatusBit(TransactionId xid, XidStatus status, XLogRecPtr lsn, i
--- 430,436 ----
  XidStatus
  TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
  {
+ 	int			partition = TransactionIdToPartition(xid);
  	int			pageno = TransactionIdToPage(xid);
  	int			byteno = TransactionIdToByte(xid);
  	int			bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
***************
*** 393,410 **** TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
  	int			lsnindex;
  	char	   *byteptr;
  	XidStatus	status;
  
  	/* lock is acquired by SimpleLruReadPage_ReadOnly */
  
! 	slotno = SimpleLruReadPage_ReadOnly(ClogCtl, pageno, xid);
! 	byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
  
  	status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
  
  	lsnindex = GetLSNIndex(slotno, xid);
! 	*lsn = ClogCtl->shared->group_lsn[lsnindex];
  
! 	LWLockRelease(CLogControlLock);
  
  	return status;
  }
--- 438,456 ----
  	int			lsnindex;
  	char	   *byteptr;
  	XidStatus	status;
+ 	SlruCtlData *ClogCtlP = &ClogCtl[partition];
  
  	/* lock is acquired by SimpleLruReadPage_ReadOnly */
  
! 	slotno = SimpleLruReadPage_ReadOnly(ClogCtlP, pageno, xid);
! 	byteptr = ClogCtlP->shared->page_buffer[slotno] + byteno;
  
  	status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
  
  	lsnindex = GetLSNIndex(slotno, xid);
! 	*lsn = ClogCtlP->shared->group_lsn[lsnindex];
  
! 	LWLockRelease(ClogCtlP->shared->ControlLock);
  
  	return status;
  }
***************
*** 416,430 **** TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
  Size
  CLOGShmemSize(void)
  {
! 	return SimpleLruShmemSize(NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE);
  }
  
  void
  CLOGShmemInit(void)
  {
  	ClogCtl->PagePrecedes = CLOGPagePrecedes;
! 	SimpleLruInit(ClogCtl, "CLOG Ctl", NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE,
! 				  CLogControlLock, "pg_clog");
  }
  
  /*
--- 462,476 ----
  Size
  CLOGShmemSize(void)
  {
! 	return SimpleLruShmemSize(NUM_CLOG_PARTITIONS, NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE);
  }
  
  void
  CLOGShmemInit(void)
  {
  	ClogCtl->PagePrecedes = CLOGPagePrecedes;
! 		SimpleLruInit(ClogCtl, "CLOG Ctl", NUM_CLOG_PARTITIONS, NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE,
! 				  FirstClogControlLock, "pg_clog");
  }
  
  /*
***************
*** 437,453 **** void
  BootStrapCLOG(void)
  {
  	int			slotno;
  
! 	LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
  
! 	/* Create and zero the first page of the commit log */
! 	slotno = ZeroCLOGPage(0, false);
  
! 	/* Make sure it's written out */
! 	SimpleLruWritePage(ClogCtl, slotno);
! 	Assert(!ClogCtl->shared->page_dirty[slotno]);
  
! 	LWLockRelease(CLogControlLock);
  }
  
  /*
--- 483,505 ----
  BootStrapCLOG(void)
  {
  	int			slotno;
+ 	int			partition;
  
! 	for (partition = 0; partition < NUM_CLOG_PARTITIONS; partition++)
! 	{
! 		SlruCtlData *ClogCtlP = &ClogCtl[partition];
  
! 		LWLockAcquire(ClogCtlP->shared->ControlLock, LW_EXCLUSIVE);
  
! 		/* Create and zero the first page of the commit log */
! 		slotno = ZeroCLOGPage(partition, 0, false);
  
! 		/* Make sure it's written out */
! 		SimpleLruWritePage(ClogCtlP, slotno);
! 		Assert(!ClogCtlP->shared->page_dirty[slotno]);
! 
! 		LWLockRelease(ClogCtlP->shared->ControlLock);
! 	}
  }
  
  /*
***************
*** 460,473 **** BootStrapCLOG(void)
   * Control lock must be held at entry, and will be held at exit.
   */
  static int
! ZeroCLOGPage(int pageno, bool writeXlog)
  {
  	int			slotno;
  
! 	slotno = SimpleLruZeroPage(ClogCtl, pageno);
  
  	if (writeXlog)
! 		WriteZeroPageXlogRec(pageno);
  
  	return slotno;
  }
--- 512,525 ----
   * Control lock must be held at entry, and will be held at exit.
   */
  static int
! ZeroCLOGPage(int partition, int pageno, bool writeXlog)
  {
  	int			slotno;
  
! 	slotno = SimpleLruZeroPage(&ClogCtl[partition], pageno);
  
  	if (writeXlog)
! 		WriteZeroPageXlogRec(partition, pageno);
  
  	return slotno;
  }
***************
*** 481,495 **** StartupCLOG(void)
  {
  	TransactionId xid = ShmemVariableCache->nextXid;
  	int			pageno = TransactionIdToPage(xid);
  
! 	LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
  
! 	/*
! 	 * Initialize our idea of the latest page number.
! 	 */
! 	ClogCtl->shared->latest_page_number = pageno;
  
! 	LWLockRelease(CLogControlLock);
  }
  
  /*
--- 533,553 ----
  {
  	TransactionId xid = ShmemVariableCache->nextXid;
  	int			pageno = TransactionIdToPage(xid);
+ 	int			partition;
+ 
+ 	for (partition = 0; partition < NUM_CLOG_PARTITIONS; partition++)
+ 	{
+ 		SlruCtlData *ClogCtlP = &ClogCtl[partition];
  
! 		LWLockAcquire(ClogCtlP->shared->ControlLock, LW_EXCLUSIVE);
  
! 		/*
! 		 * Initialize our idea of the latest page number.
! 		 */
! 		ClogCtl->shared->latest_page_number = pageno;
  
! 		LWLockRelease(ClogCtlP->shared->ControlLock);
! 	}
  }
  
  /*
***************
*** 500,544 **** TrimCLOG(void)
  {
  	TransactionId xid = ShmemVariableCache->nextXid;
  	int			pageno = TransactionIdToPage(xid);
  
! 	LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
  
! 	/*
! 	 * Re-Initialize our idea of the latest page number.
! 	 */
! 	ClogCtl->shared->latest_page_number = pageno;
  
! 	/*
! 	 * Zero out the remainder of the current clog page.  Under normal
! 	 * circumstances it should be zeroes already, but it seems at least
! 	 * theoretically possible that XLOG replay will have settled on a nextXID
! 	 * value that is less than the last XID actually used and marked by the
! 	 * previous database lifecycle (since subtransaction commit writes clog
! 	 * but makes no WAL entry).  Let's just be safe. (We need not worry about
! 	 * pages beyond the current one, since those will be zeroed when first
! 	 * used.  For the same reason, there is no need to do anything when
! 	 * nextXid is exactly at a page boundary; and it's likely that the
! 	 * "current" page doesn't exist yet in that case.)
! 	 */
! 	if (TransactionIdToPgIndex(xid) != 0)
! 	{
! 		int			byteno = TransactionIdToByte(xid);
! 		int			bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
! 		int			slotno;
! 		char	   *byteptr;
  
! 		slotno = SimpleLruReadPage(ClogCtl, pageno, false, xid);
! 		byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
  
! 		/* Zero so-far-unused positions in the current byte */
! 		*byteptr &= (1 << bshift) - 1;
! 		/* Zero the rest of the page */
! 		MemSet(byteptr + 1, 0, BLCKSZ - byteno - 1);
  
! 		ClogCtl->shared->page_dirty[slotno] = true;
! 	}
  
! 	LWLockRelease(CLogControlLock);
  }
  
  /*
--- 558,608 ----
  {
  	TransactionId xid = ShmemVariableCache->nextXid;
  	int			pageno = TransactionIdToPage(xid);
+ 	int			partition;
  
! 	for (partition = 0; partition < NUM_CLOG_PARTITIONS; partition++)
! 	{
! 		SlruCtlData *ClogCtlP = &ClogCtl[partition];
  
! 		LWLockAcquire(ClogCtlP->shared->ControlLock, LW_EXCLUSIVE);
  
! 		/*
! 		 * Re-Initialize our idea of the latest page number.
! 		 */
! 		ClogCtlP->shared->latest_page_number = pageno;
  
! 		/*
! 		 * Zero out the remainder of the current clog page.  Under normal
! 		 * circumstances it should be zeroes already, but it seems at least
! 		 * theoretically possible that XLOG replay will have settled on a nextXID
! 		 * value that is less than the last XID actually used and marked by the
! 		 * previous database lifecycle (since subtransaction commit writes clog
! 		 * but makes no WAL entry).  Let's just be safe. (We need not worry about
! 		 * pages beyond the current one, since those will be zeroed when first
! 		 * used.  For the same reason, there is no need to do anything when
! 		 * nextXid is exactly at a page boundary; and it's likely that the
! 		 * "current" page doesn't exist yet in that case.)
! 		 */
! 		if (TransactionIdToPgIndex(xid) != 0)
! 		{
! 			int			byteno = TransactionIdToByte(xid);	/* XXX fix me! */
! 			int			bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
! 			int			slotno;
! 			char	   *byteptr;
  
! 			slotno = SimpleLruReadPage(ClogCtlP, pageno, false, xid);
! 			byteptr = ClogCtlP->shared->page_buffer[slotno] + byteno;
  
! 			/* Zero so-far-unused positions in the current byte */
! 			*byteptr &= (1 << bshift) - 1;
! 			/* Zero the rest of the page */
! 			MemSet(byteptr + 1, 0, BLCKSZ - byteno - 1);
  
! 			ClogCtlP->shared->page_dirty[slotno] = true;
! 		}
! 
! 		LWLockRelease(ClogCtlP->shared->ControlLock);
! 	}
  }
  
  /*
***************
*** 547,555 **** TrimCLOG(void)
  void
  ShutdownCLOG(void)
  {
  	/* Flush dirty CLOG pages to disk */
  	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
! 	SimpleLruFlush(ClogCtl, false);
  	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
  }
  
--- 611,622 ----
  void
  ShutdownCLOG(void)
  {
+ 	int			partition;
+ 
  	/* Flush dirty CLOG pages to disk */
  	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
! 	for (partition = 0; partition < NUM_CLOG_PARTITIONS; partition++)
! 		SimpleLruFlush(&ClogCtl[partition], false);
  	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
  }
  
***************
*** 559,567 **** ShutdownCLOG(void)
  void
  CheckPointCLOG(void)
  {
  	/* Flush dirty CLOG pages to disk */
  	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
! 	SimpleLruFlush(ClogCtl, true);
  	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
  }
  
--- 626,637 ----
  void
  CheckPointCLOG(void)
  {
+ 	int			partition;
+ 
  	/* Flush dirty CLOG pages to disk */
  	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
! 	for (partition = 0; partition < NUM_CLOG_PARTITIONS; partition++)
! 		SimpleLruFlush(&ClogCtl[partition], true);
  	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
  }
  
***************
*** 578,583 **** void
--- 648,655 ----
  ExtendCLOG(TransactionId newestXact)
  {
  	int			pageno;
+ 	int			partition = TransactionIdToPartition(newestXact);
+ 	SlruCtlData *ClogCtlP = &ClogCtl[partition];
  
  	/*
  	 * No work except at first XID of a page.  But beware: just after
***************
*** 589,600 **** ExtendCLOG(TransactionId newestXact)
  
  	pageno = TransactionIdToPage(newestXact);
  
! 	LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
  
  	/* Zero the page and make an XLOG entry about it */
! 	ZeroCLOGPage(pageno, !InRecovery);
  
! 	LWLockRelease(CLogControlLock);
  }
  
  
--- 661,672 ----
  
  	pageno = TransactionIdToPage(newestXact);
  
! 	LWLockAcquire(ClogCtlP->shared->ControlLock, LW_EXCLUSIVE);
  
  	/* Zero the page and make an XLOG entry about it */
! 	ZeroCLOGPage(partition, pageno, !InRecovery);
  
! 	LWLockRelease(ClogCtlP->shared->ControlLock);
  }
  
  
***************
*** 617,622 **** void
--- 689,695 ----
  TruncateCLOG(TransactionId oldestXact)
  {
  	int			cutoffPage;
+ 	int			partition;
  
  	/*
  	 * The cutoff point is the start of the segment containing oldestXact. We
***************
*** 624,638 **** TruncateCLOG(TransactionId oldestXact)
  	 */
  	cutoffPage = TransactionIdToPage(oldestXact);
  
! 	/* Check to see if there's any files that could be removed */
! 	if (!SlruScanDirectory(ClogCtl, SlruScanDirCbReportPresence, &cutoffPage))
! 		return;					/* nothing to remove */
  
! 	/* Write XLOG record and flush XLOG to disk */
! 	WriteTruncateXlogRec(cutoffPage);
  
! 	/* Now we can remove the old CLOG segment(s) */
! 	SimpleLruTruncate(ClogCtl, cutoffPage);
  }
  
  
--- 697,716 ----
  	 */
  	cutoffPage = TransactionIdToPage(oldestXact);
  
! 	for (partition = 0; partition < NUM_CLOG_PARTITIONS; partition++)
! 	{
! 		SlruCtlData *ClogCtlP = &ClogCtl[partition];
! 
! 		/* Check to see if there's any files that could be removed */
! 		if (!SlruScanDirectory(ClogCtlP, SlruScanDirCbReportPresence, &cutoffPage))
! 			continue;					/* nothing to remove */
  
! 		/* Write XLOG record and flush XLOG to disk */
! 		WriteTruncateXlogRec(partition, cutoffPage);
  
! 		/* Now we can remove the old CLOG segment(s) */
! 		SimpleLruTruncate(ClogCtlP, cutoffPage);
! 	}
  }
  
  
***************
*** 664,675 **** CLOGPagePrecedes(int page1, int page2)
   * Write a ZEROPAGE xlog record
   */
  static void
! WriteZeroPageXlogRec(int pageno)
  {
  	XLogRecData rdata;
  
! 	rdata.data = (char *) (&pageno);
! 	rdata.len = sizeof(int);
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
  	(void) XLogInsert(RM_CLOG_ID, CLOG_ZEROPAGE, &rdata);
--- 742,757 ----
   * Write a ZEROPAGE xlog record
   */
  static void
! WriteZeroPageXlogRec(int partition, int pageno)
  {
  	XLogRecData rdata;
+ 	xl_clog_page	cpage;
  
! 	cpage.partition = partition;
! 	cpage.pageno = pageno;
! 
! 	rdata.data = (char *) &cpage;
! 	rdata.len = sizeof(xl_clog_page);
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
  	(void) XLogInsert(RM_CLOG_ID, CLOG_ZEROPAGE, &rdata);
***************
*** 682,694 **** WriteZeroPageXlogRec(int pageno)
   * in TruncateCLOG().
   */
  static void
! WriteTruncateXlogRec(int pageno)
  {
  	XLogRecData rdata;
  	XLogRecPtr	recptr;
  
! 	rdata.data = (char *) (&pageno);
! 	rdata.len = sizeof(int);
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
  	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
--- 764,780 ----
   * in TruncateCLOG().
   */
  static void
! WriteTruncateXlogRec(int partition, int pageno)
  {
  	XLogRecData rdata;
  	XLogRecPtr	recptr;
+ 	xl_clog_page	cpage;
+ 
+ 	cpage.partition = partition;
+ 	cpage.pageno = pageno;
  
! 	rdata.data = (char *) &cpage;
! 	rdata.len = sizeof(xl_clog_page);
  	rdata.buffer = InvalidBuffer;
  	rdata.next = NULL;
  	recptr = XLogInsert(RM_CLOG_ID, CLOG_TRUNCATE, &rdata);
***************
*** 702,739 **** void
  clog_redo(XLogRecPtr lsn, XLogRecord *record)
  {
  	uint8		info = record->xl_info & ~XLR_INFO_MASK;
  
  	/* Backup blocks are not used in clog records */
  	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
  
  	if (info == CLOG_ZEROPAGE)
  	{
- 		int			pageno;
  		int			slotno;
  
! 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
  
! 		LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
  
! 		slotno = ZeroCLOGPage(pageno, false);
! 		SimpleLruWritePage(ClogCtl, slotno);
! 		Assert(!ClogCtl->shared->page_dirty[slotno]);
  
! 		LWLockRelease(CLogControlLock);
  	}
  	else if (info == CLOG_TRUNCATE)
  	{
! 		int			pageno;
  
! 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
  
  		/*
  		 * During XLOG replay, latest_page_number isn't set up yet; insert a
  		 * suitable value to bypass the sanity test in SimpleLruTruncate.
  		 */
! 		ClogCtl->shared->latest_page_number = pageno;
  
! 		SimpleLruTruncate(ClogCtl, pageno);
  	}
  	else
  		elog(PANIC, "clog_redo: unknown op code %u", info);
--- 788,828 ----
  clog_redo(XLogRecPtr lsn, XLogRecord *record)
  {
  	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+ 	xl_clog_page	cpage;
+ 	SlruCtlData *ClogCtlP;
  
  	/* Backup blocks are not used in clog records */
  	Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
  
  	if (info == CLOG_ZEROPAGE)
  	{
  		int			slotno;
  
! 		memcpy(&cpage, XLogRecGetData(record), sizeof(xl_clog_page));
! 
! 		ClogCtlP = &ClogCtl[cpage.partition];
  
! 		LWLockAcquire(ClogCtlP->shared->ControlLock, LW_EXCLUSIVE);
  
! 		slotno = ZeroCLOGPage(cpage.partition, cpage.pageno, false);
! 		SimpleLruWritePage(ClogCtlP, slotno);
! 		Assert(!ClogCtlP->shared->page_dirty[slotno]);
  
! 		LWLockRelease(ClogCtlP->shared->ControlLock);
  	}
  	else if (info == CLOG_TRUNCATE)
  	{
! 		memcpy(&cpage, XLogRecGetData(record), sizeof(xl_clog_page));
  
! 		ClogCtlP = &ClogCtl[cpage.partition];
  
  		/*
  		 * During XLOG replay, latest_page_number isn't set up yet; insert a
  		 * suitable value to bypass the sanity test in SimpleLruTruncate.
  		 */
! 		ClogCtlP->shared->latest_page_number = cpage.pageno;
  
! 		SimpleLruTruncate(ClogCtlP, cpage.pageno);
  	}
  	else
  		elog(PANIC, "clog_redo: unknown op code %u", info);
***************
*** 743,762 **** void
  clog_desc(StringInfo buf, uint8 xl_info, char *rec)
  {
  	uint8		info = xl_info & ~XLR_INFO_MASK;
  
  	if (info == CLOG_ZEROPAGE)
  	{
! 		int			pageno;
! 
! 		memcpy(&pageno, rec, sizeof(int));
! 		appendStringInfo(buf, "zeropage: %d", pageno);
  	}
  	else if (info == CLOG_TRUNCATE)
  	{
! 		int			pageno;
! 
! 		memcpy(&pageno, rec, sizeof(int));
! 		appendStringInfo(buf, "truncate before: %d", pageno);
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
--- 832,848 ----
  clog_desc(StringInfo buf, uint8 xl_info, char *rec)
  {
  	uint8		info = xl_info & ~XLR_INFO_MASK;
+ 	xl_clog_page	cpage;
  
  	if (info == CLOG_ZEROPAGE)
  	{
! 		memcpy(&cpage, rec, sizeof(xl_clog_page));
! 		appendStringInfo(buf, "zeropage: partition %d page %d", cpage.partition, cpage.pageno);
  	}
  	else if (info == CLOG_TRUNCATE)
  	{
! 		memcpy(&cpage, rec, sizeof(xl_clog_page));
! 		appendStringInfo(buf, "truncate before: partition %d page %d", cpage.partition, cpage.pageno);
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
*** a/src/backend/access/transam/multixact.c
--- b/src/backend/access/transam/multixact.c
***************
*** 1392,1399 **** MultiXactShmemSize(void)
  			 mul_size(sizeof(MultiXactId) * 2, MaxOldestSlot))
  
  	size = SHARED_MULTIXACT_STATE_SIZE;
! 	size = add_size(size, SimpleLruShmemSize(NUM_MXACTOFFSET_BUFFERS, 0));
! 	size = add_size(size, SimpleLruShmemSize(NUM_MXACTMEMBER_BUFFERS, 0));
  
  	return size;
  }
--- 1392,1399 ----
  			 mul_size(sizeof(MultiXactId) * 2, MaxOldestSlot))
  
  	size = SHARED_MULTIXACT_STATE_SIZE;
! 	size = add_size(size, SimpleLruShmemSize(1, NUM_MXACTOFFSET_BUFFERS, 0));
! 	size = add_size(size, SimpleLruShmemSize(1, NUM_MXACTMEMBER_BUFFERS, 0));
  
  	return size;
  }
***************
*** 1409,1418 **** MultiXactShmemInit(void)
  	MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
  
  	SimpleLruInit(MultiXactOffsetCtl,
! 				  "MultiXactOffset Ctl", NUM_MXACTOFFSET_BUFFERS, 0,
  				  MultiXactOffsetControlLock, "pg_multixact/offsets");
  	SimpleLruInit(MultiXactMemberCtl,
! 				  "MultiXactMember Ctl", NUM_MXACTMEMBER_BUFFERS, 0,
  				  MultiXactMemberControlLock, "pg_multixact/members");
  
  	/* Initialize our shared state struct */
--- 1409,1418 ----
  	MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
  
  	SimpleLruInit(MultiXactOffsetCtl,
! 				  "MultiXactOffset Ctl", 1, NUM_MXACTOFFSET_BUFFERS, 0,
  				  MultiXactOffsetControlLock, "pg_multixact/offsets");
  	SimpleLruInit(MultiXactMemberCtl,
! 				  "MultiXactMember Ctl", 1, NUM_MXACTMEMBER_BUFFERS, 0,
  				  MultiXactMemberControlLock, "pg_multixact/members");
  
  	/* Initialize our shared state struct */
*** a/src/backend/access/transam/slru.c
--- b/src/backend/access/transam/slru.c
***************
*** 140,146 **** static bool SlruScanDirCbDeleteCutoff(SlruCtl ctl, char *filename,
   */
  
  Size
! SimpleLruShmemSize(int nslots, int nlsns)
  {
  	Size		sz;
  
--- 140,146 ----
   */
  
  Size
! SimpleLruShmemSize(int npartitions, int nslots, int nlsns)
  {
  	Size		sz;
  
***************
*** 156,237 **** SimpleLruShmemSize(int nslots, int nlsns)
  	if (nlsns > 0)
  		sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));	/* group_lsn[] */
  
! 	return BUFFERALIGN(sz) + BLCKSZ * nslots;
  }
  
  void
! SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
  			  LWLockId ctllock, const char *subdir)
  {
! 	SlruShared	shared;
  	bool		found;
  
! 	shared = (SlruShared) ShmemInitStruct(name,
! 										  SimpleLruShmemSize(nslots, nlsns),
  										  &found);
  
! 	if (!IsUnderPostmaster)
  	{
! 		/* Initialize locks and shared memory area */
! 		char	   *ptr;
! 		Size		offset;
! 		int			slotno;
  
! 		Assert(!found);
  
! 		memset(shared, 0, sizeof(SlruSharedData));
  
! 		shared->ControlLock = ctllock;
  
! 		shared->num_slots = nslots;
! 		shared->lsn_groups_per_page = nlsns;
  
! 		shared->cur_lru_count = 0;
  
! 		/* shared->latest_page_number will be set later */
  
! 		ptr = (char *) shared;
! 		offset = MAXALIGN(sizeof(SlruSharedData));
! 		shared->page_buffer = (char **) (ptr + offset);
! 		offset += MAXALIGN(nslots * sizeof(char *));
! 		shared->page_status = (SlruPageStatus *) (ptr + offset);
! 		offset += MAXALIGN(nslots * sizeof(SlruPageStatus));
! 		shared->page_dirty = (bool *) (ptr + offset);
! 		offset += MAXALIGN(nslots * sizeof(bool));
! 		shared->page_number = (int *) (ptr + offset);
! 		offset += MAXALIGN(nslots * sizeof(int));
! 		shared->page_lru_count = (int *) (ptr + offset);
! 		offset += MAXALIGN(nslots * sizeof(int));
! 		shared->buffer_locks = (LWLockId *) (ptr + offset);
! 		offset += MAXALIGN(nslots * sizeof(LWLockId));
  
! 		if (nlsns > 0)
! 		{
! 			shared->group_lsn = (XLogRecPtr *) (ptr + offset);
! 			offset += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));
! 		}
  
! 		ptr += BUFFERALIGN(offset);
! 		for (slotno = 0; slotno < nslots; slotno++)
! 		{
! 			shared->page_buffer[slotno] = ptr;
! 			shared->page_status[slotno] = SLRU_PAGE_EMPTY;
! 			shared->page_dirty[slotno] = false;
! 			shared->page_lru_count[slotno] = 0;
! 			shared->buffer_locks[slotno] = LWLockAssign();
! 			ptr += BLCKSZ;
  		}
! 	}
! 	else
! 		Assert(found);
  
! 	/*
! 	 * Initialize the unshared control struct, including directory path. We
! 	 * assume caller set PagePrecedes.
! 	 */
! 	ctl->shared = shared;
! 	ctl->do_fsync = true;		/* default behavior */
! 	StrNCpy(ctl->Dir, subdir, sizeof(ctl->Dir));
  }
  
  /*
--- 156,252 ----
  	if (nlsns > 0)
  		sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));	/* group_lsn[] */
  
! 	return npartitions * (BUFFERALIGN(sz) + BLCKSZ * nslots);
  }
  
  void
! SimpleLruInit(SlruCtl ctl, const char *name, int npartitions, int nslots, int nlsns,
  			  LWLockId ctllock, const char *subdir)
  {
! 	SlruShared	slrushared;
  	bool		found;
+ 	int			partition;
  
! 	slrushared = (SlruShared) ShmemInitStruct(name,
! 										  SimpleLruShmemSize(npartitions, nslots, nlsns),
  										  &found);
  
! 	for (partition = 0; partition < npartitions; partition++)
  	{
! 		SlruShared	shared = slrushared;
  
! 		if (!IsUnderPostmaster)
! 		{
! 			/* Initialize locks and shared memory area */
! 			char	   *ptr;
! 			Size		offset;
! 			int			slotno;
! 			SlruShared	shared;
  
! 			shared = slrushared + partition;
  
! 			Assert(!found);
  
! 			memset(shared, 0, sizeof(SlruSharedData));
  
! 			shared->ControlLock = ctllock + partition;
  
! 			shared->num_slots = nslots;
! 			shared->lsn_groups_per_page = nlsns;
  
! 			shared->cur_lru_count = 0;
  
! 			/* shared->latest_page_number will be set later */
  
! 			ptr = (char *) shared;
! 			offset = MAXALIGN(sizeof(SlruSharedData));
! 			shared->page_buffer = (char **) (ptr + offset);
! 			offset += MAXALIGN(nslots * sizeof(char *));
! 			shared->page_status = (SlruPageStatus *) (ptr + offset);
! 			offset += MAXALIGN(nslots * sizeof(SlruPageStatus));
! 			shared->page_dirty = (bool *) (ptr + offset);
! 			offset += MAXALIGN(nslots * sizeof(bool));
! 			shared->page_number = (int *) (ptr + offset);
! 			offset += MAXALIGN(nslots * sizeof(int));
! 			shared->page_lru_count = (int *) (ptr + offset);
! 			offset += MAXALIGN(nslots * sizeof(int));
! 			shared->buffer_locks = (LWLockId *) (ptr + offset);
! 			offset += MAXALIGN(nslots * sizeof(LWLockId));
! 
! 			if (nlsns > 0)
! 			{
! 				shared->group_lsn = (XLogRecPtr *) (ptr + offset);
! 				offset += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));
! 			}
! 
! 			ptr += BUFFERALIGN(offset);
! 			for (slotno = 0; slotno < nslots; slotno++)
! 			{
! 				shared->page_buffer[slotno] = ptr;
! 				shared->page_status[slotno] = SLRU_PAGE_EMPTY;
! 				shared->page_dirty[slotno] = false;
! 				shared->page_lru_count[slotno] = 0;
! 				shared->buffer_locks[slotno] = LWLockAssign();
! 				ptr += BLCKSZ;
! 			}
  		}
! 		else
! 			Assert(found);
  
! 		/*
! 		 * Initialize the unshared control struct, including directory path. We
! 		 * assume caller set PagePrecedes.
! 		 */
! 		ctl->shared = shared;
! 		ctl->do_fsync = true;		/* default behavior */
! 		if (npartitions == 1)
! 			sprintf(ctl->Dir, "%s", subdir);
! 		else
! 			sprintf(ctl->Dir, "%s/%i", subdir, partition);
! 
! 		shared++;
! 		ctl++;
! 	}
  }
  
  /*
*** a/src/backend/access/transam/subtrans.c
--- b/src/backend/access/transam/subtrans.c
***************
*** 171,184 **** SubTransGetTopmostTransaction(TransactionId xid)
  Size
  SUBTRANSShmemSize(void)
  {
! 	return SimpleLruShmemSize(NUM_SUBTRANS_BUFFERS, 0);
  }
  
  void
  SUBTRANSShmemInit(void)
  {
  	SubTransCtl->PagePrecedes = SubTransPagePrecedes;
! 	SimpleLruInit(SubTransCtl, "SUBTRANS Ctl", NUM_SUBTRANS_BUFFERS, 0,
  				  SubtransControlLock, "pg_subtrans");
  	/* Override default assumption that writes should be fsync'd */
  	SubTransCtl->do_fsync = false;
--- 171,184 ----
  Size
  SUBTRANSShmemSize(void)
  {
! 	return SimpleLruShmemSize(1, NUM_SUBTRANS_BUFFERS, 0);
  }
  
  void
  SUBTRANSShmemInit(void)
  {
  	SubTransCtl->PagePrecedes = SubTransPagePrecedes;
! 	SimpleLruInit(SubTransCtl, "SUBTRANS Ctl", 1, NUM_SUBTRANS_BUFFERS, 0,
  				  SubtransControlLock, "pg_subtrans");
  	/* Override default assumption that writes should be fsync'd */
  	SubTransCtl->do_fsync = false;
*** a/src/backend/commands/async.c
--- b/src/backend/commands/async.c
***************
*** 422,428 **** AsyncShmemSize(void)
  	size = mul_size(MaxBackends, sizeof(QueueBackendStatus));
  	size = add_size(size, sizeof(AsyncQueueControl));
  
! 	size = add_size(size, SimpleLruShmemSize(NUM_ASYNC_BUFFERS, 0));
  
  	return size;
  }
--- 422,428 ----
  	size = mul_size(MaxBackends, sizeof(QueueBackendStatus));
  	size = add_size(size, sizeof(AsyncQueueControl));
  
! 	size = add_size(size, SimpleLruShmemSize(1, NUM_ASYNC_BUFFERS, 0));
  
  	return size;
  }
***************
*** 470,476 **** AsyncShmemInit(void)
  	 * Set up SLRU management of the pg_notify data.
  	 */
  	AsyncCtl->PagePrecedes = asyncQueuePagePrecedes;
! 	SimpleLruInit(AsyncCtl, "Async Ctl", NUM_ASYNC_BUFFERS, 0,
  				  AsyncCtlLock, "pg_notify");
  	/* Override default assumption that writes should be fsync'd */
  	AsyncCtl->do_fsync = false;
--- 470,476 ----
  	 * Set up SLRU management of the pg_notify data.
  	 */
  	AsyncCtl->PagePrecedes = asyncQueuePagePrecedes;
! 	SimpleLruInit(AsyncCtl, "Async Ctl", 1, NUM_ASYNC_BUFFERS, 0,
  				  AsyncCtlLock, "pg_notify");
  	/* Override default assumption that writes should be fsync'd */
  	AsyncCtl->do_fsync = false;
*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
***************
*** 788,794 **** OldSerXidInit(void)
  	 */
  	OldSerXidSlruCtl->PagePrecedes = OldSerXidPagePrecedesLogically;
  	SimpleLruInit(OldSerXidSlruCtl, "OldSerXid SLRU Ctl",
! 				  NUM_OLDSERXID_BUFFERS, 0, OldSerXidLock, "pg_serial");
  	/* Override default assumption that writes should be fsync'd */
  	OldSerXidSlruCtl->do_fsync = false;
  
--- 788,794 ----
  	 */
  	OldSerXidSlruCtl->PagePrecedes = OldSerXidPagePrecedesLogically;
  	SimpleLruInit(OldSerXidSlruCtl, "OldSerXid SLRU Ctl",
! 				  1, NUM_OLDSERXID_BUFFERS, 0, OldSerXidLock, "pg_serial");
  	/* Override default assumption that writes should be fsync'd */
  	OldSerXidSlruCtl->do_fsync = false;
  
***************
*** 1334,1340 **** PredicateLockShmemSize(void)
  
  	/* Shared memory structures for SLRU tracking of old committed xids. */
  	size = add_size(size, sizeof(OldSerXidControlData));
! 	size = add_size(size, SimpleLruShmemSize(NUM_OLDSERXID_BUFFERS, 0));
  
  	return size;
  }
--- 1334,1340 ----
  
  	/* Shared memory structures for SLRU tracking of old committed xids. */
  	size = add_size(size, sizeof(OldSerXidControlData));
! 	size = add_size(size, SimpleLruShmemSize(1, NUM_OLDSERXID_BUFFERS, 0));
  
  	return size;
  }
*** a/src/include/access/clog.h
--- b/src/include/access/clog.h
***************
*** 30,35 **** typedef int XidStatus;
--- 30,36 ----
  
  /* Number of SLRU buffers to use for clog */
  #define NUM_CLOG_BUFFERS	8
+ #define NUM_CLOG_PARTITIONS	8
  
  
  extern void TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
*** a/src/include/access/slru.h
--- b/src/include/access/slru.h
***************
*** 134,141 **** typedef struct SlruCtlData
  typedef SlruCtlData *SlruCtl;
  
  
! extern Size SimpleLruShmemSize(int nslots, int nlsns);
! extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
  			  LWLockId ctllock, const char *subdir);
  extern int	SimpleLruZeroPage(SlruCtl ctl, int pageno);
  extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
--- 134,141 ----
  typedef SlruCtlData *SlruCtl;
  
  
! extern Size SimpleLruShmemSize(int npartitions, int nslots, int nlsns);
! extern void SimpleLruInit(SlruCtl ctl, const char *name, int npartitions, int nslots, int nlsns,
  			  LWLockId ctllock, const char *subdir);
  extern int	SimpleLruZeroPage(SlruCtl ctl, int pageno);
  extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
*** a/src/include/storage/lwlock.h
--- b/src/include/storage/lwlock.h
***************
*** 21,26 ****
--- 21,29 ----
   */
  
  /* Number of partitions of the shared buffer mapping hashtable */
+ #define NUM_CLOG_PARTITIONS  8
+ 
+ /* Number of partitions of the shared buffer mapping hashtable */
  #define NUM_BUFFER_PARTITIONS  16
  
  /* Number of partitions the shared lock tables are divided into */
***************
*** 57,63 **** typedef enum LWLockId
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
! 	CLogControlLock,
  	SubtransControlLock,
  	MultiXactGenLock,
  	MultiXactOffsetControlLock,
--- 60,66 ----
  	WALWriteLock,
  	ControlFileLock,
  	CheckpointLock,
! 	CLogControlLock_NowUnused,
  	SubtransControlLock,
  	MultiXactGenLock,
  	MultiXactOffsetControlLock,
***************
*** 82,88 **** typedef enum LWLockId
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
! 	FirstPredicateLockMgrLock = FirstLockMgrLock + NUM_LOCK_PARTITIONS,
  
  	/* must be last except for MaxDynamicLWLock: */
  	NumFixedLWLocks = FirstPredicateLockMgrLock + NUM_PREDICATELOCK_PARTITIONS,
--- 85,92 ----
  	/* Individual lock IDs end here */
  	FirstBufMappingLock,
  	FirstLockMgrLock = FirstBufMappingLock + NUM_BUFFER_PARTITIONS,
! 	FirstClogControlLock = FirstLockMgrLock + NUM_LOCK_PARTITIONS,
! 	FirstPredicateLockMgrLock = FirstLockMgrLock + NUM_CLOG_PARTITIONS,
  
  	/* must be last except for MaxDynamicLWLock: */
  	NumFixedLWLocks = FirstPredicateLockMgrLock + NUM_PREDICATELOCK_PARTITIONS,
#19Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#18)
Re: CLOG contention

On Wed, Dec 21, 2011 at 4:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Partitioning will give us more buffers and more LWlocks, to spread the
contention when we access the buffers. I use that word because its
what we call the technique already used in the buffer manager and lock
manager. If you wish to call this "less than fully-associative" I
really don't mind, as long as we're discussing the same overall
concept, so we can then focus on an implementation of that concept,
which no doubt has many ways of doing it.

More buffers per lock does reduce the lock contention somewhat, but
not by much. So for me, it seems essential that we have more LWlocks
to solve the problem, which is where partitioning comes in.

My perspective is that there is clog contention in many places, not
just in the ones you identified.

Well, that's possible. The locking in slru.c is pretty screwy and
could probably benefit from better locking granularity. One point
worth noting is that the control lock for each SLRU protects all the
SLRU buffer mappings and the contents of all the buffers; in the main
buffer manager, those responsibilities are split across
BufFreelistLock, 16 buffer manager partition locks, one content lock
per buffer, and the buffer header spinlocks. (The SLRU per-buffer
locks are the equivalent of the I/O-in-progresss locks, not the
content locks.) So splitting up CLOG into multiple SLRUs might not be
the only way of improving the lock granularity; the current situation
is almost comical.

But on the flip side, I feel like your discussion of the problems is a
bit hand-wavy. I think we need some real test cases that we can look
at and measure, not just an informal description of what we think is
happening. I'm sure, for example, that repeatedly reading different
CLOG pages costs something - but I'm not sure that it's enough to have
a material impact on performance. And if it doesn't, then we'd be
better off leaving it alone and working on things that do. And if it
does, then we need a way to assess how successful any given approach
is in addressing that problem, so we can decide which of various
proposed approaches is best.

* We allocate a new clog page every 32k xids. At the rates you have
now measured, we will do this every 1-2 seconds.

And a new pg_subtrans page quite a bit more frequently than that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#19)
Re: CLOG contention

On Thu, Dec 22, 2011 at 12:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

But on the flip side, I feel like your discussion of the problems is a
bit hand-wavy.  I think we need some real test cases that we can look
at and measure, not just an informal description of what we think is
happening.

I understand why you say that and take no offence. All I can say is
last time I has access to a good test rig and well structured
reporting and analysis I was able to see evidence of what I described
to you here.

I no longer have that access, which is the main reason I've not done
anything in the last few years. We both know you do have good access
and that's the main reason I'm telling you about it rather than just
doing it myself.

* We allocate a new clog page every 32k xids. At the rates you have
now measured, we will do this every 1-2 seconds.

And a new pg_subtrans page quite a bit more frequently than that.

It is less of a concern, all the same. In most cases we can simply
drop pg_subtrans pages (though we don't do that as often as we could),
no fsync is required on write, no WAL record required for extension
and no update required at commit.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#21Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#20)
2 attachment(s)
Re: CLOG contention

On Thu, Dec 22, 2011 at 1:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I understand why you say that and take no offence. All I can say is
last time I has access to a good test rig and well structured
reporting and analysis I was able to see evidence of what I described
to you here.

I no longer have that access, which is the main reason I've not done
anything in the last few years. We both know you do have good access
and that's the main reason I'm telling you about it rather than just
doing it myself.

Right. But I need more details. If I know what to test and how to
test it, I can do it. Otherwise, I'm just guessing. I dislike
guessing.

You mentioned "latency" so this morning I ran pgbench with -l and
graphed the output. There are latency spikes every few seconds. I'm
attaching the overall graph as well as the graph of the last 100
seconds, where the spikes are easier to see clearly. Now, here's the
problem: it seems reasonable to hypothesize that the spikes are due to
CLOG page replacement since the frequency is at least plausibly right,
but this is obviously not enough to prove that conclusively. Ideas?

Also, if it is that, what do we do about it? I don't think any of the
ideas proposed so far are going to help much. Increasing the number
of CLOG buffers isn't going to fix the problem that once they're all
dirty, you have to write and fsync one before pulling in the next one.
Striping might actually make it worse - everyone will move to the
next buffer right around the same time, and instead of everybody
waiting for one fsync, they'll each be waiting for their own. Maybe
the solution is to have the background writer keep an eye on how many
CLOG buffers are dirty and start writing them out if the number gets
too big.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

latency.pngimage/png; name=latency.pngDownload
�PNG


IHDR��,� PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� �����K�IDATx�������E1��?�tG���!�p�{�{:�bY�
�&�c!�B!�B!�B!�B!� ��z��k��{-��=���9����������xqv���W�M�W�S�4d����g�o��/�p]q��\*:�)<T���v�v�8���uj��L�����R
�U�=�+w�B6����y��Q}��4���+���1H�U�~Luf��Y����jf�n��XR�)���r;N@B�M�<���Y����=�-��`���p~O���������g�|�v�Nj���C�|#�r�W	x������z����������j/J����W���E�-�-7��-<����E�������:�v��(�a��z��	���.�'����]���q&xW���y^<�SA����(OI��O��8�(Xx�J�8��/�6WL'�!�}����wC�<�&7����Z�������<�x����vAeh��_t�U)�i",'�������%�3(��h��vv/�����9���Q���p2��2�������eCGe��nTh�Z))�%�`Ao�\��]����������������v���wv�l���B!�B!�B!+������2����2�%
H �������Z�s_����g�{z�w6���_�����!���Ac�&�����\�O�>'�c
�7���46l���G���
����p^l��) y-�,��� 
�Z��&�P�����S@�Z( Y
$KA���M�5���a����P@�
H��( j��8k@	j_�&N�k��d)�,Q@�r6q�����M���B�R( Y
����l��%�}
�8$����P@�DQ���YJP�6q
H^$K��d)���;������5l���
H�B�RD-w`g
(A�k��) y-�,��� 
�Z��&�P�����S@�Z( Y
$KA���M�5���a����P@�
H��( j��8k@	j_�&N�k��d)�,Q@�r6q�����M���B�R( Y
����l��%�}
�8$����P�g��N`6���;5��D�
8/6The�O��N�?��euSi�G����w���uQ�QP��
������?��������������u���L|��MTm�����"���-n��6l�O�9r
&�h�
$hp;�`��!����y�a^
�.��}
�8$����P@�DQ���YJP�6q
H^$K��d)���;������5l���
H�B�RD-w`g
(A�k��) y-�,��� 
�Z��&�P�����S@�Z( Y
$KA���M�5���a����P@�
H��( j��8k@	j_�&N�k��d)�,Q@�r6q�����M���B�R( Y
����l��%�}
�8$����P@�DQ���YJP�6q
H^$K��d)���;������5l���
H�B�RD-w`g
(A�k��) y-�,��� 
�Z��&�P�����S@�Z( Y
$KA���M�5���a����P@�
H��( j��8k@	j_�&�@�����Gcn�%�6QN�������?G�(�7R@r�{S�i��l��|~���?>#�?>�"#'>��&�6�T�X�7gD��a_v�m)G����I+9Q�������$
dL�����6�oD��
��oD/��j_�&�R���V!ia_��l��a�wbn�%	(`�zk���A���s0k�����!�����o��{���	�<�v
/B�����}5w�`T�#`�Mq��W?Q@�r�5�B������j_S@��<&.��[���������d)��aP����u,	j_�&�J��7�o�%���;!��bi���{q	9���y�a�E���a����P@�
H��( j��8k@	j_�&N�k��d)�,����
6��5v���u���������w�9�*( YJ�L�0*���l��-A�k������k)�����}@�"����G�( j����OB���j_�&���c��2�@o���?�K��p^l��W^��{���M����8�A�Ro���;��/�gw#��a���I^K�*����TA�R��Q����m	j_�&�_J"������%��������E�wB���M|Y
����(������Ou�����d)�
C���KI��l���t	j_�&��wB�e2
H�R�yS0yeQ�w���M�����Zy����V�*P�6���mN����$.�%�M���( �M�h<~k�{�
������??jo2���3��;��/��_�Z:��5l������|�?!��6y�:Y
����l��oD?)��}
�8$����P@�DQ���YJP�6q
H^$K��d)���;������5l���
H�B�RD-w`g
(A�k��) y-�,��� 
�Z��&�P�����S@�Z(`�}u
���Q�I�NQ�4���}�|�����P�$pPg�Lb�S���Nc
(��/��(�8�N����T���@pR
828c�P�Tl
8%6���<
�BgASP�) 
��/�M��;��5)6��s�<
����YP@
8Dy#�E�)`*6�Q�9p��$K��d)���;������5��tx��,�����}@��X������=��p���a�����W�pk�1# ���E�7n�{g�(���?G����p�'p�'�2��A7�(��k�����W�~�}��<���  /B�`��<�+P��,?����?	��$�v��k@��F4?	Q���_�P@��	���)�&�g��	\d�����O���X�X�����>�MsgS�Z�\��!��������/�Gcl\0����c7��.��	h��w<��f������t�=R�g��.B�y�{�����9���?&�8���>=�MTmmH��p���5����oD�]��n���`
858�����L�=d������_�	����k�C{�q#����#��c��
�� �p"`ikXZ\�`}�8Km)�N7�2|
�Z��H��s#z������o�\eR��nt+�b�c���,����S\�Dw?�j�D���^V�<	�!����	���w��;�S�����y�1G��������]B�����)D2�9�)#�9��sH�Iw}cV,����%���=�S`ug����U��7�y����{jELx"Qj�h�\�@7/B��
%��w���J����7�����!W�ws�	Sp����,�4zx/������wu�|���t�������`?��c����m�_� ����K�J�Z���Rmk��as��v��������)8�T��}��f
�����e�/�����OB����������r�����,G�ff���}���O�~{�-�^�+���{-Th��r��~7�#`e������$��;!�w�H�� %`i<�/`KH���si2P�$P���7�^*E�S��09����k����"�cR�]�v��g��	B����g��������Tt.�@@��.
���y5�;����=����L�i���}eS��)8P��2gJ7�K��&	�t��������LO�����}?{�
rt/m����g{�:Z��r�07����������#%�P�#������89Jgk��s�XR������"rq�ZO+Zxg-u<�8���7�c}�[�n�����9+���{�MK��AT��0J��{e[���)8�U?�F)�;��+��7h���$k�3p��m�=H�OK�B�3VTS4�����7-�>R�P�4�����}O�nd>�)�
��vw��!����u�
���:V����J��)1�Ipw#`|+.+��;�s�-�C;���j���i������V�
���J���x@
h�?���|��F�n��:��������;�%z�/
�,`M�(s�j
h.h��^)�/�f�;(T.YbS5��������������<�%B�c@iyvGc���z��q�-�
BW����=�%����h\�m��f���7���o�F��<m<�z$DT�xS��G�F��\�������7��e�����7��a�3-������t{�q1nn�_(��3�<jee��"U@=`����I�oD"�D��=Wu
�<�l�Up��C�.�#�lj��Z=+u��f�k�
�.��R$Wzy�=R2����~�����M/��i�hm�&����u��}W��y+��5�</�!�o�,�����4����n��`��7���Y�����7.����(d�E@Y��]�nq��X�������<���t5��`W*M��\���v��lh�B���6��mdz�*x������XUA����^�R�q�=Z�l�j�+v}q����D6CU�|��-���L�g����T����6�.��X-���~���]F@??��%X|u�J,���-VI@�����r=�����*0�Q���+����b#������(Yx��e���\\SaR�$s�Sm����c�����P�q.?��/7M�������L���B����S�
�����%��&�M����0�%#�F�qg���@�s�'hvW@�P��]��hkR;���G�*�FUS�h���!\�K��f��7���F�&������8�v��#`�i1����#���[�hh���\����im�e^%���%�4�#`��
��;D7��������)8� y�����"8M�OT���S5��e��XPl|�L�A,�����gi�*u�����,���O�N
�����!r�b-�j���?n�*`�w:]e�PPq�#���G�n����\����om�����*�;����<����r�p�}BOjg���w�m����A���L^@ec�(P��yL����������a�F�6�����F��oj��������� 
�������n����}�*���sZw\aH���'���-�3�O�vc�dU��b��ct�e1�o��}(ou��v.��%d��O�~K�������0!`��5�vr.��������V�O�4G��qm|�/��{�Q���ZF*��2R�un�M#�W�����s���]TF���!BL�a|���3h��P���X��zI��6�/�MK���v�����{r�t�(����B[����k�8�r]0�y�{-O
�#e�.gI��K]�D��S!�",{��<[�+�~�s
���t�j������I�m����6�N�H��N�%j@�V!rX3�u�P�A��4����B@�>`.j��Ti�
�B�O�\��b��p,�
l�"e�U��8q�	6KE��p�H�=^VY�c�#�������p�K{F�7�/�l�=���x<*�G���I����i�pQ�������Cn
�K���<�`3�X�/3S���#�]��:gW_�x;��l�~|�F�n��D^.�?&�����T�����#��M�q�ga�������,�EI���
������o
���a;��9%���[����4��������E�K�X�sN�=��C��[c�A��'-`!\����p�!D&����G/�C��wK���
�f����yP�P�Z���t���W-!��~ ��'�����W��$���=���2'"+���$dO�
����L�]��$v�{jO�C�E
"7��I)������$����9>����4������~��,���2itU"�z���gM`�ZZg�	�.�:Zs1��$�f��}�x�TH9j������5%^N���(��q��G�N�m��
R@RE?��R@R$K�zM�?�m���T�Q����&���]"]X>v�����x��4���P�~P���R�{n�nkB!�B!�f�5�������xG���m�����G���;�+�5���g?*�
���������6Hy��ng��	����r�"��>,��9�%�;�)�6"|`w��fT�R������:\~5�+.��Q���<��~�g�b��si�|d
�w~����b����Q5 ����4��4���0��]�j��aTp��AP~��B!�B!�B!�T��`�_�O�HG�OO�^�u4����'
H�q~=�~���_��_r���DH'|��	9��������+��b>
H:"t�p�� @B���) ��V���7��E_�&!�B!��?���j��>�IEND�B`�
latency-end.pngimage/png; name=latency-end.pngDownload
�PNG


IHDR��,� PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� �����KSIDATx������(E[����O��� �$��f��H�S-��X�����wi��SeN�$T��gZ�
L�j;������o��N�/y��J������)i��s��������?������/�F�����s�;�[!�v*��d-�MY0{O)Co�w.]
c����idf���OW���!�J}�K����<j�H.������=,q�[��//��!;We/�%�2�9UrwjA�;���uw~#� �����+g���f����J���$U��g����Q���d�r��$��x)Cs#_� ��]v�������{�W�y���]9"{�G-_�;���KISn������I���L�/�(������:�I�n.u��M����H��u�
�����'������?�T�=�\��r��Ur���33�q�6�u��2�i�t-@"C^���k#^����	�n`�����#����6�
���R�����Ss�
�f�6|��2�KI:�R��Y�\h�;{Y��\:��Raw$����G��_���r�V.�u��}��o����sW{i���Y��;��y[��������\�X;���sv���O-w���q�\���_�kB@�L+�gz�-w�7y�H�X_M@�����1;�6���i�
�����'h�_g��8��FO`���y�
�B��=������
��{�Y|�F
��`on����{K�� �
����`�)�=���s/@�,Kq/��1Rt{ �-��^��7 X
��^��c��� @�[j���o @�,����H��A�����{�@�`) X�{F�����ao�=����R @���#E���R{�|�@�Oi��1 X�{F���������������K��R��R{�G����{�@�`)x��^��c��� @�[j���o @���_L?�T��_L��������?})p��1Rt{���:)p���^��C|���|��W
����eO��r���o����7�=F�n��m���nD��@�`) X�{F�����ao�=����R @���#E���R{�|��@�`)���#z��{K�� ���/�����^��c��� @�[j1 X���Kq?G����s?F��������|�,��=F�n������s?���o @���#E���Sq�;(�=���� V�,j�EH���1���.X�a]�
��F	<�+��I�`,��O����m���E�o{�G�y�p�=`N���</~�����H��u	p�_?����� �}�������7v������t��/�������<������,^~�eO��FR��G@GOB`O�1���wPt{�S�z���������0���
�(,OB�`;��G@��=���'��~
����7`/~���<)f��|%���X����^���������x��f~���#�W\�A���a/���N���R�
0H����N�x��^P�7�e��5��8�T�����=�!���:N:��"�Yl����
z�L��b�Q��E��bc,:Bw^03��$���H��Y��e	�A��� �B�)x��0cw�o{�G@����^�-@G���on�T�qb������s� �������_t�}@@�>?E�b4d�i*�,���`K�o�3�
�;�����&��U���?
�
�y���;�_LW%d�����f�'��B_rd��iG[~F�r������������4�'!��:����i�����_5J/����/��./O��c��n��_�J~f�8�9����"�f����cq���M3��a�O�:��^�����t��)�����T�}R���yf�l�_���1����v��+=� ~�;�{��F3���E�0�#��v����;�������Q���=,BD�a������S
(�����tL��Tb�IH�%|C*���Y������*@�Kq/�81�o��ao�=�^�",K��"*���G0���B��G�=�}AV��eY @���t���$���f���]��!�h����m��7C_���`V���+z�\���]�	�����^7�5�6�on���x���"D��9z�*��I�r����b@�,h���a~�XMEo1CK�& @s	X�nX�����b�v9a2�����hbF��`�S���0[|��Ok�G�0R4�N��?�}r_?-��/u��>�N�Z�M���l�/B��s.@��?�hb
�������g��(7�����*�uC ��~bTiFzt+��Lp=)��� }��*8H�i.B�������z�
}�1.0�)X�Q�����^D�F��I�RwU�"D���"D�%W��J�;�y^����.|&$��t�_L/�����J1`� S7�K�����f��)�$�^�X���,��`��:��DQc�7�04��C|���tj�]G��������/[��S��]�=u����3K��K�'R�M�/B���7���Oh;VQ��Sp/F��L��d7L�,��E{QPd�%-�1�Q\�v,z3��m�^�L-W&p�����m�YXl�������������wKB0]����n�"�e�w#���nP������ls�G���`�FtCQ���5p/����D�	�h�����
�p��)�y���'�}�
S���an�����������Z�Y\~&��LeK>��=���k��!�m�
L��-��`S��6yd�zg8�)�����d��%�z�5�|���v��<4�1`#�(6�p�����?(@������JS��*x���?��[1��U0�Rl`K��d�Z��=���������+@f� ���=�OBh	�"z���M������{�,)�Z���v��p/@,B��S���(N��g$C����!�Q�����8����J�$dI���H�hi7��B�f��/B,<	�C#�k���1z���s����8���F�CI������oD��S�������8j���8���Q��
#������`s[���`#V���B�fA�������k�~�{��W����-BT��hT���)X���W ��@��TA3p!@��� ��U�<���F �*���{���S�u�����}���8���&��EP�x���'�����q������<�������x\�=�Ki�1���������Z���}�_��Y��3}2������.���(�����.B>���`������5���2�r��@�YjnW��$��#�So�$�T�6��*�����c�a�c��h`���a}c7�?�v�Z�@�{?�p���p�x�j���|��$��Q�r��+���aZ;B�\�R�aW�L���b���*��q#Z��
k��t�gM���W�b&i`����#�X�R��:��YP���)�-��*1����0TA��on�
x��5�
VA���4*@�����_�*��CS��Pz�`ab�
���"D�y1 �^�Bs�1{c@���[s���`�
������gx�[�X�$&�1?}p;Jn��+����IHM;���	�I�����bbF��������U� ���c������� ���A �^ ����4U���������	p�}$oD�����@�nl��s�*C��v�T�p/@�������r���Z8W!?�
�=��1`�G*��!@6�C�!�P�E�&J��d��k{A�c��R���c��{��(�
�e�����/mk����.07g<;�y�)X�b|P�>�rZ��^+B���*�1�����-B
�&�g�~���\���F���O�|
�����W��1 ����I~�*6b@���7#,�������w7�� "�NH�EH}{���u$�W��l
V�F��\�H��:t��wb����Ovh���#@���{����]�=������eL��jG����^�JhO��!|���[��2�y{s0.��O���bzZ���w �;��������'�������|c@�1����K�/��p�Z�l�U~W�=� td
�'���v����_5����]�����K������������QM�l�Y���,�S�w/v�K��'�[���$����;3��g�����c8���+�f�������s2��J�TH����H�����w��!���DA��x�o����n�]���D]�c���`������@�)��8f��y-���-����S0������8������K���3����'&@Nq\A�p`�]Gf
f���(��,2�/anX-�I�fb��,'c
)���� :�`OB�����5	��8�EV;M,m���)�����z����<p�ou��!�L��S
�������onC����!\�b����c�V�,B�F��s���ED���[���X��9n���������"8�\I�L�aV�#'+��H&z }��e��;�����w�]�&���(�r������s97&�g��w��| ���,B����:}�i#�(�o8���/c�[aYV�����dR��������P������kl��B��=����}���ttwJ�X_�HM��=��Ub@��I|�	�5R>���`�JfN�M#��gB0*��g����G��=�)Xj�4V���5�������T�������t9�H�I���D-yA���c�+��p�����D���F�cf������[2c�1������v�yQ����X���������|��[Y ���]��c���8��
P��&V��zQ@�FtAd����`/]��������`�K���V8�!���!���b��deV�=Vn�I�v�1�1R�
�d��Fmt�*��/N���u����Zf
h!���W�}�+95s&;\��,���z�����9�R(}9�o�0$R�8�S��}@}]����a�.{�K0��&��aI{d�%�H�`
p������ifq�J�zc@�+^+�;��x�r���(���C��6]����5����&��a�j��[C)�%����������'ao`^,�9��J LL`����.�|~x���gqt�.�SU{���I/R��8Y��Q������7�w=%��,�kn��s��a)�>@)w]?��C��T@��*@�Qh#i�,U�C����c0�G�(��3kW�d^�F�[R���ef�]%������k,�u,�+L?X�C�n<�)����S�b�������hG'
�\����F�LSDnB��������e�/j$��U���%#�s���(����_��@]�D#fu��;ZfJ�u��	����]�" ��$�6�%���.)��7���|�sYU��J�`)+��D2�R�G@����".���J�i{�����	�nhB��~�;�T�5_������k�GcW�MT�yg���Gb��n��K�$r���e�1q
6G�HL5��S�_M*��|aQH�-���B�~I�M�G�sb�8��e�mt,����C�E ��},B�"#��/�?!@������V�-@�`)R�O�	�����1�O����?8M�*�s��	��Rn239���b8}�go��x
Z?�������,��}����_{�+����VU�w��aO�~o���^%����|{G�������{��)�K��{x�T����g/m4��(���������1E}@�)xQ{����`E'f�������j����gR�U���P���n��OE{���f�b;Z���J��x�9��*=Iz��<N[d?��.�<�
5�v2$8����z�?AH&C�u��� ���p� �@A�Yx��$T�w�l�nD_�&�3������[�fIEND�B`�
#22Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#21)
Re: CLOG contention

On Thu, Dec 22, 2011 at 4:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

You mentioned "latency" so this morning I ran pgbench with -l and
graphed the output.  There are latency spikes every few seconds.  I'm
attaching the overall graph as well as the graph of the last 100
seconds, where the spikes are easier to see clearly.  Now, here's the
problem: it seems reasonable to hypothesize that the spikes are due to
CLOG page replacement since the frequency is at least plausibly right,
but this is obviously not enough to prove that conclusively.  Ideas?

Thanks. That illustrates the effect I explained earlier very clearly,
so now we all know I wasn't speculating.

Also, if it is that, what do we do about it?  I don't think any of the
ideas proposed so far are going to help much.

If you don't like guessing, don't guess, don't think. Just measure.

Does increasing the number of buffers solve the problems you see? That
must be the first port of call - is that enough, or not? If not, we
can discuss the various ideas, write patches and measure them.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#23Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#22)
Re: CLOG contention

On Sat, Dec 24, 2011 at 9:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Dec 22, 2011 at 4:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Also, if it is that, what do we do about it?  I don't think any of the
ideas proposed so far are going to help much.

If you don't like guessing, don't guess, don't think. Just measure.

Does increasing the number of buffers solve the problems you see? That
must be the first port of call - is that enough, or not? If not, we
can discuss the various ideas, write patches and measure them.

Just in case you want a theoretical prediction to test:

increasing NUM_CLOG_BUFFERS should reduce the frequency of the spikes
you measured earlier. That should happen proportionally, so as that is
increased they will become even less frequent. But the size of the
buffer will not decrease the impact of each event when it happens.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#24Jim Nasby
jim@nasby.net
In reply to: Tom Lane (#2)
Re: CLOG contention

On Dec 20, 2011, at 11:29 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

So, what do we do about this? The obvious answer is "increase
NUM_CLOG_BUFFERS", and I'm not sure that's a bad idea.

As you say, that's likely to hurt people running in small shared
memory. I too have thought about merging the SLRU areas into the main
shared buffer arena, and likewise have concluded that it is likely to
be way more painful than it's worth. What I think might be an
appropriate compromise is something similar to what we did for
autotuning wal_buffers: use a fixed percentage of shared_buffers, with
some minimum and maximum limits to ensure sanity. But picking the
appropriate percentage would take a bit of research.

ISTM that this is based more on number of CPUs rather than total memory, no? Likewise, things like the number of shared buffer partitions would be highly dependent on the number of CPUs.

So perhaps we should either probe the number of CPUs on a box, or have a GUC to tell us how many there are...
--
Jim C. Nasby, Database Architect jim@nasby.net
512.569.9461 (cell) http://jim.nasby.net

#25Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#23)
4 attachment(s)
Re: CLOG contention

On Tue, Dec 27, 2011 at 5:23 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sat, Dec 24, 2011 at 9:25 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Dec 22, 2011 at 4:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Also, if it is that, what do we do about it?  I don't think any of the
ideas proposed so far are going to help much.

If you don't like guessing, don't guess, don't think. Just measure.

Does increasing the number of buffers solve the problems you see? That
must be the first port of call - is that enough, or not? If not, we
can discuss the various ideas, write patches and measure them.

Just in case you want a theoretical prediction to test:

increasing NUM_CLOG_BUFFERS should reduce the frequency of the spikes
you measured earlier. That should happen proportionally, so as that is
increased they will become even less frequent. But the size of the
buffer will not decrease the impact of each event when it happens.

I'm still catching up on email, so apologies for the slow response on
this. I actually ran this test before Christmas, but didn't get
around to emailing the results. I'm attaching graphs of the last 100
seconds of a run with the normal count of CLOG buffers, and the last
100 seconds of a run with NUM_CLOG_BUFFERS = 32. I am also attaching
graphs of the entire runs.

It appears to me that increasing the number of CLOG buffers reduced
the severity of the latency spikes considerably. In the last 100
seconds, for example, master has several spikes in the 500-700ms
range, but with 32 CLOG buffers it never goes above 400 ms. Also, the
number of points associated with each spike is considerably less -
each spike seems to affect fewer transactions. So it seems that at
least on this machine, increasing the number of CLOG buffers both
improves performance and reduces latency.

I hypothesize that there are actually two kinds of latency spikes
here. Just taking a wild guess, I wonder if the *remaining* latency
spikes are caused by the effect that you mentioned before: namely, the
need to write an old CLOG page every time we advance onto a new one.
I further speculate that the spikes are more severe on the unpatched
code because this effect combines with the one I mentioned before: if
there are more simultaneous I/O requests than there are buffers, a new
I/O request has to wait for one of the I/Os already in progress to
complete. If the new I/O request that has to wait extra-long happens
to be the one caused by XID advancement, then things get really ugly.
If that hypothesis is correct, then it supports your previous belief
that more than one fix is needed here... but it also means we can get
a significant and I think quite worthwhile benefit just out of finding
a reasonable way to add some more buffers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

latency-end.pngimage/png; name=latency-end.pngDownload
�PNG


IHDR��,� PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� �����KSIDATx������(E[����O��� �$��f��H�S-��X�����wi��SeN�$T��gZ�
L�j;������o��N�/y��J������)i��s��������?������/�F�����s�;�[!�v*��d-�MY0{O)Co�w.]
c����idf���OW���!�J}�K����<j�H.������=,q�[��//��!;We/�%�2�9UrwjA�;���uw~#� �����+g���f����J���$U��g����Q���d�r��$��x)Cs#_� ��]v�������{�W�y���]9"{�G-_�;���KISn������I���L�/�(������:�I�n.u��M����H��u�
�����'������?�T�=�\��r��Ur���33�q�6�u��2�i�t-@"C^���k#^����	�n`�����#����6�
���R�����Ss�
�f�6|��2�KI:�R��Y�\h�;{Y��\:��Raw$����G��_���r�V.�u��}��o����sW{i���Y��;��y[��������\�X;���sv���O-w���q�\���_�kB@�L+�gz�-w�7y�H�X_M@�����1;�6���i�
�����'h�_g��8��FO`���y�
�B��=������
��{�Y|�F
��`on����{K�� �
����`�)�=���s/@�,Kq/��1Rt{ �-��^��7 X
��^��c��� @�[j���o @�,����H��A�����{�@�`) X�{F�����ao�=����R @���#E���R{�|�@�Oi��1 X�{F���������������K��R��R{�G����{�@�`)x��^��c��� @�[j���o @���_L?�T��_L��������?})p��1Rt{���:)p���^��C|���|��W
����eO��r���o����7�=F�n��m���nD��@�`) X�{F�����ao�=����R @���#E���R{�|��@�`)���#z��{K�� ���/�����^��c��� @�[j1 X���Kq?G����s?F��������|�,��=F�n������s?���o @���#E���Sq�;(�=���� V�,j�EH���1���.X�a]�
��F	<�+��I�`,��O����m���E�o{�G�y�p�=`N���</~�����H��u	p�_?����� �}�������7v������t��/�������<������,^~�eO��FR��G@GOB`O�1���wPt{�S�z���������0���
�(,OB�`;��G@��=���'��~
����7`/~���<)f��|%���X����^���������x��f~���#�W\�A���a/���N���R�
0H����N�x��^P�7�e��5��8�T�����=�!���:N:��"�Yl����
z�L��b�Q��E��bc,:Bw^03��$���H��Y��e	�A��� �B�)x��0cw�o{�G@����^�-@G���on�T�qb������s� �������_t�}@@�>?E�b4d�i*�,���`K�o�3�
�;�����&��U���?
�
�y���;�_LW%d�����f�'��B_rd��iG[~F�r������������4�'!��:����i�����_5J/����/��./O��c��n��_�J~f�8�9����"�f����cq���M3��a�O�:��^�����t��)�����T�}R���yf�l�_���1����v��+=� ~�;�{��F3���E�0�#��v����;�������Q���=,BD�a������S
(�����tL��Tb�IH�%|C*���Y������*@�Kq/�81�o��ao�=�^�",K��"*���G0���B��G�=�}AV��eY @���t���$���f���]��!�h����m��7C_���`V���+z�\���]�	�����^7�5�6�on���x���"D��9z�*��I�r����b@�,h���a~�XMEo1CK�& @s	X�nX�����b�v9a2�����hbF��`�S���0[|��Ok�G�0R4�N��?�}r_?-��/u��>�N�Z�M���l�/B��s.@��?�hb
�������g��(7�����*�uC ��~bTiFzt+��Lp=)��� }��*8H�i.B�������z�
}�1.0�)X�Q�����^D�F��I�RwU�"D���"D�%W��J�;�y^����.|&$��t�_L/�����J1`� S7�K�����f��)�$�^�X���,��`��:��DQc�7�04��C|���tj�]G��������/[��S��]�=u����3K��K�'R�M�/B���7���Oh;VQ��Sp/F��L��d7L�,��E{QPd�%-�1�Q\�v,z3��m�^�L-W&p�����m�YXl�������������wKB0]����n�"�e�w#���nP������ls�G���`�FtCQ���5p/����D�	�h�����
�p��)�y���'�}�
S���an�����������Z�Y\~&��LeK>��=���k��!�m�
L��-��`S��6yd�zg8�)�����d��%�z�5�|���v��<4�1`#�(6�p�����?(@������JS��*x���?��[1��U0�Rl`K��d�Z��=���������+@f� ���=�OBh	�"z���M������{�,)�Z���v��p/@,B��S���(N��g$C����!�Q�����8����J�$dI���H�hi7��B�f��/B,<	�C#�k���1z���s����8���F�CI������oD��S�������8j���8���Q��
#������`s[���`#V���B�fA�������k�~�{��W����-BT��hT���)X���W ��@��TA3p!@��� ��U�<���F �*���{���S�u�����}���8���&��EP�x���'�����q������<�������x\�=�Ki�1���������Z���}�_��Y��3}2������.���(�����.B>���`������5���2�r��@�YjnW��$��#�So�$�T�6��*�����c�a�c��h`���a}c7�?�v�Z�@�{?�p���p�x�j���|��$��Q�r��+���aZ;B�\�R�aW�L���b���*��q#Z��
k��t�gM���W�b&i`����#�X�R��:��YP���)�-��*1����0TA��on�
x��5�
VA���4*@�����_�*��CS��Pz�`ab�
���"D�y1 �^�Bs�1{c@���[s���`�
������gx�[�X�$&�1?}p;Jn��+����IHM;���	�I�����bbF��������U� ���c������� ���A �^ ����4U���������	p�}$oD�����@�nl��s�*C��v�T�p/@�������r���Z8W!?�
�=��1`�G*��!@6�C�!�P�E�&J��d��k{A�c��R���c��{��(�
�e�����/mk����.07g<;�y�)X�b|P�>�rZ��^+B���*�1�����-B
�&�g�~���\���F���O�|
�����W��1 ����I~�*6b@���7#,�������w7�� "�NH�EH}{���u$�W��l
V�F��\�H��:t��wb����Ovh���#@���{����]�=������eL��jG����^�JhO��!|���[��2�y{s0.��O���bzZ���w �;��������'�������|c@�1����K�/��p�Z�l�U~W�=� td
�'���v����_5����]�����K������������QM�l�Y���,�S�w/v�K��'�[���$����;3��g�����c8���+�f�������s2��J�TH����H�����w��!���DA��x�o����n�]���D]�c���`������@�)��8f��y-���-����S0������8������K���3����'&@Nq\A�p`�]Gf
f���(��,2�/anX-�I�fb��,'c
)���� :�`OB�����5	��8�EV;M,m���)�����z����<p�ou��!�L��S
�������onC����!\�b����c�V�,B�F��s���ED���[���X��9n���������"8�\I�L�aV�#'+��H&z }��e��;�����w�]�&���(�r������s97&�g��w��| ���,B����:}�i#�(�o8���/c�[aYV�����dR��������P������kl��B��=����}���ttwJ�X_�HM��=��Ub@��I|�	�5R>���`�JfN�M#��gB0*��g����G��=�)Xj�4V���5�������T�������t9�H�I���D-yA���c�+��p�����D���F�cf������[2c�1������v�yQ����X���������|��[Y ���]��c���8��
P��&V��zQ@�FtAd����`/]��������`�K���V8�!���!���b��deV�=Vn�I�v�1�1R�
�d��Fmt�*��/N���u����Zf
h!���W�}�+95s&;\��,���z�����9�R(}9�o�0$R�8�S��}@}]����a�.{�K0��&��aI{d�%�H�`
p������ifq�J�zc@�+^+�;��x�r���(���C��6]����5����&��a�j��[C)�%����������'ao`^,�9��J LL`����.�|~x���gqt�.�SU{���I/R��8Y��Q������7�w=%��,�kn��s��a)�>@)w]?��C��T@��*@�Qh#i�,U�C����c0�G�(��3kW�d^�F�[R���ef�]%������k,�u,�+L?X�C�n<�)����S�b�������hG'
�\����F�LSDnB��������e�/j$��U���%#�s���(����_��@]�D#fu��;ZfJ�u��	����]�" ��$�6�%���.)��7���|�sYU��J�`)+��D2�R�G@����".���J�i{�����	�nhB��~�;�T�5_������k�GcW�MT�yg���Gb��n��K�$r���e�1q
6G�HL5��S�_M*��|aQH�-���B�~I�M�G�sb�8��e�mt,����C�E ��},B�"#��/�?!@������V�-@�`)R�O�	�����1�O����?8M�*�s��	��Rn239���b8}�go��x
Z?�������,��}����_{�+����VU�w��aO�~o���^%����|{G�������{��)�K��{x�T����g/m4��(���������1E}@�)xQ{����`E'f�������j����gR�U���P���n��OE{���f�b;Z���J��x�9��*=Iz��<N[d?��.�<�
5�v2$8����z�?AH&C�u��� ���p� �@A�Yx��$T�w�l�nD_�&�3������[�fIEND�B`�
latency-clog32-end.pngimage/png; name=latency-clog32-end.pngDownload
latency.pngimage/png; name=latency.pngDownload
�PNG


IHDR��,� PLTE���������������@��Ai��� �@���0`��@�������**��@��333MMMfff�������������������22�������U����������������d�"�".�W��p��������������P����E��r��z�����k������� �����������P@Uk/���@�@��`��`�����@��@��`��p������������������������|�@�� �����K�IDATx�������E1��?�tG���!�p�{�{:�bY�
�&�c!�B!�B!�B!�B!� ��z��k��{-��=���9����������xqv���W�M�W�S�4d����g�o��/�p]q��\*:�)<T���v�v�8���uj��L�����R
�U�=�+w�B6����y��Q}��4���+���1H�U�~Luf��Y����jf�n��XR�)���r;N@B�M�<���Y����=�-��`���p~O���������g�|�v�Nj���C�|#�r�W	x������z����������j/J����W���E�-�-7��-<����E�������:�v��(�a��z��	���.�'����]���q&xW���y^<�SA����(OI��O��8�(Xx�J�8��/�6WL'�!�}����wC�<�&7����Z�������<�x����vAeh��_t�U)�i",'�������%�3(��h��vv/�����9���Q���p2��2�������eCGe��nTh�Z))�%�`Ao�\��]����������������v���wv�l���B!�B!�B!+������2����2�%
H �������Z�s_����g�{z�w6���_�����!���Ac�&�����\�O�>'�c
�7���46l���G���
����p^l��) y-�,��� 
�Z��&�P�����S@�Z( Y
$KA���M�5���a����P@�
H��( j��8k@	j_�&N�k��d)�,Q@�r6q�����M���B�R( Y
����l��%�}
�8$����P@�DQ���YJP�6q
H^$K��d)���;������5l���
H�B�RD-w`g
(A�k��) y-�,��� 
�Z��&�P�����S@�Z( Y
$KA���M�5���a����P@�
H��( j��8k@	j_�&N�k��d)�,Q@�r6q�����M���B�R( Y
����l��%�}
�8$����P�g��N`6���;5��D�
8/6The�O��N�?��euSi�G����w���uQ�QP��
������?��������������u���L|��MTm�����"���-n��6l�O�9r
&�h�
$hp;�`��!����y�a^
�.��}
�8$����P@�DQ���YJP�6q
H^$K��d)���;������5l���
H�B�RD-w`g
(A�k��) y-�,��� 
�Z��&�P�����S@�Z( Y
$KA���M�5���a����P@�
H��( j��8k@	j_�&N�k��d)�,Q@�r6q�����M���B�R( Y
����l��%�}
�8$����P@�DQ���YJP�6q
H^$K��d)���;������5l���
H�B�RD-w`g
(A�k��) y-�,��� 
�Z��&�P�����S@�Z( Y
$KA���M�5���a����P@�
H��( j��8k@	j_�&�@�����Gcn�%�6QN�������?G�(�7R@r�{S�i��l��|~���?>#�?>�"#'>��&�6�T�X�7gD��a_v�m)G����I+9Q�������$
dL�����6�oD��
��oD/��j_�&�R���V!ia_��l��a�wbn�%	(`�zk���A���s0k�����!�����o��{���	�<�v
/B�����}5w�`T�#`�Mq��W?Q@�r�5�B������j_S@��<&.��[���������d)��aP����u,	j_�&�J��7�o�%���;!��bi���{q	9���y�a�E���a����P@�
H��( j��8k@	j_�&N�k��d)�,����
6��5v���u���������w�9�*( YJ�L�0*���l��-A�k������k)�����}@�"����G�( j����OB���j_�&���c��2�@o���?�K��p^l��W^��{���M����8�A�Ro���;��/�gw#��a���I^K�*����TA�R��Q����m	j_�&�_J"������%��������E�wB���M|Y
����(������Ou�����d)�
C���KI��l���t	j_�&��wB�e2
H�R�yS0yeQ�w���M�����Zy����V�*P�6���mN����$.�%�M���( �M�h<~k�{�
������??jo2���3��;��/��_�Z:��5l������|�?!��6y�:Y
����l��oD?)��}
�8$����P@�DQ���YJP�6q
H^$K��d)���;������5l���
H�B�RD-w`g
(A�k��) y-�,��� 
�Z��&�P�����S@�Z(`�}u
���Q�I�NQ�4���}�|�����P�$pPg�Lb�S���Nc
(��/��(�8�N����T���@pR
828c�P�Tl
8%6���<
�BgASP�) 
��/�M��;��5)6��s�<
����YP@
8Dy#�E�)`*6�Q�9p��$K��d)���;������5��tx��,�����}@��X������=��p���a�����W�pk�1# ���E�7n�{g�(���?G����p�'p�'�2��A7�(��k�����W�~�}��<���  /B�`��<�+P��,?����?	��$�v��k@��F4?	Q���_�P@��	���)�&�g��	\d�����O���X�X�����>�MsgS�Z�\��!��������/�Gcl\0����c7��.��	h��w<��f������t�=R�g��.B�y�{�����9���?&�8���>=�MTmmH��p���5����oD�]��n���`
858�����L�=d������_�	����k�C{�q#����#��c��
�� �p"`ikXZ\�`}�8Km)�N7�2|
�Z��H��s#z������o�\eR��nt+�b�c���,����S\�Dw?�j�D���^V�<	�!����	���w��;�S�����y�1G��������]B�����)D2�9�)#�9��sH�Iw}cV,����%���=�S`ug����U��7�y����{jELx"Qj�h�\�@7/B��
%��w���J����7�����!W�ws�	Sp����,�4zx/������wu�|���t�������`?��c����m�_� ����K�J�Z���Rmk��as��v��������)8�T��}��f
�����e�/�����OB����������r�����,G�ff���}���O�~{�-�^�+���{-Th��r��~7�#`e������$��;!�w�H�� %`i<�/`KH���si2P�$P���7�^*E�S��09����k����"�cR�]�v��g��	B����g��������Tt.�@@��.
���y5�;����=����L�i���}eS��)8P��2gJ7�K��&	�t��������LO�����}?{�
rt/m����g{�:Z��r�07����������#%�P�#������89Jgk��s�XR������"rq�ZO+Zxg-u<�8���7�c}�[�n�����9+���{�MK��AT��0J��{e[���)8�U?�F)�;��+��7h���$k�3p��m�=H�OK�B�3VTS4�����7-�>R�P�4�����}O�nd>�)�
��vw��!����u�
���:V����J��)1�Ipw#`|+.+��;�s�-�C;���j���i������V�
���J���x@
h�?���|��F�n��:��������;�%z�/
�,`M�(s�j
h.h��^)�/�f�;(T.YbS5��������������<�%B�c@iyvGc���z��q�-�
BW����=�%����h\�m��f���7���o�F��<m<�z$DT�xS��G�F��\�������7��e�����7��a�3-������t{�q1nn�_(��3�<jee��"U@=`����I�oD"�D��=Wu
�<�l�Up��C�.�#�lj��Z=+u��f�k�
�.��R$Wzy�=R2����~�����M/��i�hm�&����u��}W��y+��5�</�!�o�,�����4����n��`��7���Y�����7.����(d�E@Y��]�nq��X�������<���t5��`W*M��\���v��lh�B���6��mdz�*x������XUA����^�R�q�=Z�l�j�+v}q����D6CU�|��-���L�g����T����6�.��X-���~���]F@??��%X|u�J,���-VI@�����r=�����*0�Q���+����b#������(Yx��e���\\SaR�$s�Sm����c�����P�q.?��/7M�������L���B����S�
�����%��&�M����0�%#�F�qg���@�s�'hvW@�P��]��hkR;���G�*�FUS�h���!\�K��f��7���F�&������8�v��#`�i1����#���[�hh���\����im�e^%���%�4�#`��
��;D7��������)8� y�����"8M�OT���S5��e��XPl|�L�A,�����gi�*u�����,���O�N
�����!r�b-�j���?n�*`�w:]e�PPq�#���G�n����\����om�����*�;����<����r�p�}BOjg���w�m����A���L^@ec�(P��yL����������a�F�6�����F��oj��������� 
�������n����}�*���sZw\aH���'���-�3�O�vc�dU��b��ct�e1�o��}(ou��v.��%d��O�~K�������0!`��5�vr.��������V�O�4G��qm|�/��{�Q���ZF*��2R�un�M#�W�����s���]TF���!BL�a|���3h��P���X��zI��6�/�MK���v�����{r�t�(����B[����k�8�r]0�y�{-O
�#e�.gI��K]�D��S!�",{��<[�+�~�s
���t�j������I�m����6�N�H��N�%j@�V!rX3�u�P�A��4����B@�>`.j��Ti�
�B�O�\��b��p,�
l�"e�U��8q�	6KE��p�H�=^VY�c�#�������p�K{F�7�/�l�=���x<*�G���I����i�pQ�������Cn
�K���<�`3�X�/3S���#�]��:gW_�x;��l�~|�F�n��D^.�?&�����T�����#��M�q�ga�������,�EI���
������o
���a;��9%���[����4��������E�K�X�sN�=��C��[c�A��'-`!\����p�!D&����G/�C��wK���
�f����yP�P�Z���t���W-!��~ ��'�����W��$���=���2'"+���$dO�
����L�]��$v�{jO�C�E
"7��I)������$����9>����4������~��,���2itU"�z���gM`�ZZg�	�.�:Zs1��$�f��}�x�TH9j������5%^N���(��q��G�N�m��
R@RE?��R@R$K�zM�?�m���T�Q����&���]"]X>v�����x��4���P�~P���R�{n�nkB!�B!�f�5�������xG���m�����G���;�+�5���g?*�
���������6Hy��ng��	����r�"��>,��9�%�;�)�6"|`w��fT�R������:\~5�+.��Q���<��~�g�b��si�|d
�w~����b����Q5 ����4��4���0��]�j��aTp��AP~��B!�B!�B!�T��`�_�O�HG�OO�^�u4����'
H�q~=�~���_��_r���DH'|��	9��������+��b>
H:"t�p�� @B���) ��V���7��E_�&!�B!��?���j��>�IEND�B`�
latency-clog32.pngimage/png; name=latency-clog32.pngDownload
#26Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#25)
Re: CLOG contention

On Thu, Jan 5, 2012 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

It appears to me that increasing the number of CLOG buffers reduced
the severity of the latency spikes considerably.  In the last 100
seconds, for example, master has several spikes in the 500-700ms
range, but with 32 CLOG buffers it never goes above 400 ms.  Also, the
number of points associated with each spike is considerably less -
each spike seems to affect fewer transactions.  So it seems that at
least on this machine, increasing the number of CLOG buffers both
improves performance and reduces latency.

I believed before that the increase was worthwhile and now even more so.

Let's commit the change to 32.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#27Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Simon Riggs (#26)
Re: CLOG contention

Simon Riggs <simon@2ndQuadrant.com> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

So it seems that at least on this machine, increasing the number
of CLOG buffers both improves performance and reduces latency.

I believed before that the increase was worthwhile and now even
more so.

Let's commit the change to 32.

+1

-Kevin

#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#25)
1 attachment(s)
Re: CLOG contention

On Thu, Jan 5, 2012 at 4:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I hypothesize that there are actually two kinds of latency spikes
here.  Just taking a wild guess, I wonder if the *remaining* latency
spikes are caused by the effect that you mentioned before: namely, the
need to write an old CLOG page every time we advance onto a new one.
I further speculate that the spikes are more severe on the unpatched
code because this effect combines with the one I mentioned before: if
there are more simultaneous I/O requests than there are buffers, a new
I/O request has to wait for one of the I/Os already in progress to
complete.  If the new I/O request that has to wait extra-long happens
to be the one caused by XID advancement, then things get really ugly.
If that hypothesis is correct, then it supports your previous belief
that more than one fix is needed here... but it also means we can get
a significant and I think quite worthwhile benefit just out of finding
a reasonable way to add some more buffers.

Sounds reaonable.

Patch to remove clog contention caused by my dirty clog LRU.

The patch implements background WAL allocation also, with the
intention of being separately tested, if possible.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

background_clean_slru_and_wal.v1.patchtext/x-patch; charset=US-ASCII; name=background_clean_slru_and_wal.v1.patchDownload
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 4060e60..dbefa02 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -565,6 +565,26 @@ CheckPointCLOG(void)
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
 }
 
+/*
+ * Conditionally flush the CLOG LRU.
+ *
+ * When a backend does ExtendCLOG we need to write the CLOG LRU if it is
+ * dirty. Performing I/O while holding XidGenLock prevents new write
+ * transactions from starting. To avoid that we flush the CLOG LRU, if
+ * we think that a page write is due soon, according to a heuristic.
+ *
+ * Note that we're reading ShmemVariableCache->nextXid without a lock
+ * since the exact value doesn't matter as input into our heuristic.
+ */
+void
+CLOGBackgroundFlushLRU(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int		threshold = (CLOG_XACTS_PER_PAGE - (CLOG_XACTS_PER_PAGE / 4));
+
+	if (TransactionIdToPgIndex(xid) > threshold)
+		SlruBackgroundFlushLRUPage(ClogCtl);
+}
 
 /*
  * Make sure that CLOG has room for a newly-allocated XID.
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 30538ff..aea6c09 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -885,6 +885,82 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid)
 }
 
 /*
+ * Identify the LRU slot but just leave it as it is.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+SlruIdentifyLRUSlot(SlruCtl ctl)
+{
+	SlruShared	shared = ctl->shared;
+	int			slotno;
+	int			cur_count;
+	int			bestslot;
+	int			best_delta;
+	int			best_page_number;
+
+	/*
+	 * If we find any EMPTY slot, just select that one. Else locate the
+	 * least-recently-used slot.
+	 *
+	 * Normally the page_lru_count values will all be different and so
+	 * there will be a well-defined LRU page.  But since we allow
+	 * concurrent execution of SlruRecentlyUsed() within
+	 * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
+	 * acquire the same lru_count values.  In that case we break ties by
+	 * choosing the furthest-back page.
+	 *
+	 * In no case will we select the slot containing latest_page_number
+	 * for replacement, even if it appears least recently used.
+	 *
+	 * Notice that this next line forcibly advances cur_lru_count to a
+	 * value that is certainly beyond any value that will be in the
+	 * page_lru_count array after the loop finishes.  This ensures that
+	 * the next execution of SlruRecentlyUsed will mark the page newly
+	 * used, even if it's for a page that has the current counter value.
+	 * That gets us back on the path to having good data when there are
+	 * multiple pages with the same lru_count.
+	 */
+	cur_count = (shared->cur_lru_count)++;
+	best_delta = -1;
+	bestslot = 0;			/* no-op, just keeps compiler quiet */
+	best_page_number = 0;	/* ditto */
+	for (slotno = 0; slotno < shared->num_slots; slotno++)
+	{
+		int			this_delta;
+		int			this_page_number;
+
+		if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
+			return slotno;
+		this_delta = cur_count - shared->page_lru_count[slotno];
+		if (this_delta < 0)
+		{
+			/*
+			 * Clean up in case shared updates have caused cur_count
+			 * increments to get "lost".  We back off the page counts,
+			 * rather than trying to increase cur_count, to avoid any
+			 * question of infinite loops or failure in the presence of
+			 * wrapped-around counts.
+			 */
+			shared->page_lru_count[slotno] = cur_count;
+			this_delta = 0;
+		}
+		this_page_number = shared->page_number[slotno];
+		if ((this_delta > best_delta ||
+			 (this_delta == best_delta &&
+			  ctl->PagePrecedes(this_page_number, best_page_number))) &&
+			this_page_number != shared->latest_page_number)
+		{
+			bestslot = slotno;
+			best_delta = this_delta;
+			best_page_number = this_page_number;
+		}
+	}
+
+	return bestslot;
+}
+
+/*
  * Select the slot to re-use when we need a free slot.
  *
  * The target page number is passed because we need to consider the
@@ -905,11 +981,8 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
 	/* Outer loop handles restart after I/O */
 	for (;;)
 	{
-		int			slotno;
-		int			cur_count;
 		int			bestslot;
-		int			best_delta;
-		int			best_page_number;
+		int			slotno;
 
 		/* See if page already has a buffer assigned */
 		for (slotno = 0; slotno < shared->num_slots; slotno++)
@@ -919,69 +992,14 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
 				return slotno;
 		}
 
-		/*
-		 * If we find any EMPTY slot, just select that one. Else locate the
-		 * least-recently-used slot to replace.
-		 *
-		 * Normally the page_lru_count values will all be different and so
-		 * there will be a well-defined LRU page.  But since we allow
-		 * concurrent execution of SlruRecentlyUsed() within
-		 * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
-		 * acquire the same lru_count values.  In that case we break ties by
-		 * choosing the furthest-back page.
-		 *
-		 * In no case will we select the slot containing latest_page_number
-		 * for replacement, even if it appears least recently used.
-		 *
-		 * Notice that this next line forcibly advances cur_lru_count to a
-		 * value that is certainly beyond any value that will be in the
-		 * page_lru_count array after the loop finishes.  This ensures that
-		 * the next execution of SlruRecentlyUsed will mark the page newly
-		 * used, even if it's for a page that has the current counter value.
-		 * That gets us back on the path to having good data when there are
-		 * multiple pages with the same lru_count.
-		 */
-		cur_count = (shared->cur_lru_count)++;
-		best_delta = -1;
-		bestslot = 0;			/* no-op, just keeps compiler quiet */
-		best_page_number = 0;	/* ditto */
-		for (slotno = 0; slotno < shared->num_slots; slotno++)
-		{
-			int			this_delta;
-			int			this_page_number;
-
-			if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
-				return slotno;
-			this_delta = cur_count - shared->page_lru_count[slotno];
-			if (this_delta < 0)
-			{
-				/*
-				 * Clean up in case shared updates have caused cur_count
-				 * increments to get "lost".  We back off the page counts,
-				 * rather than trying to increase cur_count, to avoid any
-				 * question of infinite loops or failure in the presence of
-				 * wrapped-around counts.
-				 */
-				shared->page_lru_count[slotno] = cur_count;
-				this_delta = 0;
-			}
-			this_page_number = shared->page_number[slotno];
-			if ((this_delta > best_delta ||
-				 (this_delta == best_delta &&
-				  ctl->PagePrecedes(this_page_number, best_page_number))) &&
-				this_page_number != shared->latest_page_number)
-			{
-				bestslot = slotno;
-				best_delta = this_delta;
-				best_page_number = this_page_number;
-			}
-		}
+		bestslot = SlruIdentifyLRUSlot(ctl);
 
 		/*
-		 * If the selected page is clean, we're set.
+		 * If the selected page is clean or empty, we're set.
 		 */
-		if (shared->page_status[bestslot] == SLRU_PAGE_VALID &&
-			!shared->page_dirty[bestslot])
+		if (shared->page_status[bestslot] == SLRU_PAGE_EMPTY ||
+			(shared->page_status[bestslot] == SLRU_PAGE_VALID &&
+			!shared->page_dirty[bestslot]))
 			return bestslot;
 
 		/*
@@ -1067,6 +1085,39 @@ SimpleLruFlush(SlruCtl ctl, bool checkpoint)
 }
 
 /*
+ * Make sure the next victim buffer is clean, so that the next caller of
+ * SlruSelectLRUPage does not require I/O.
+ */
+void
+SlruBackgroundFlushLRUPage(SlruCtl ctl)
+{
+	SlruShared	shared = ctl->shared;
+	int			bestslot;
+
+	/*
+	 * Notice this takes only a shared lock on the ControlLock.
+	 * We aren't going to change the page/slot allocation, only
+	 * write if needed and reset the dirty status. This is OK
+	 * as long as only one process ever calls this, the bgwriter.
+	 */
+	LWLockAcquire(shared->ControlLock, LW_SHARED);
+
+	bestslot = SlruIdentifyLRUSlot(ctl);
+
+	/*
+	 * If the selected page is valid and dirty then write it out.
+	 * It's possible that the page is already write-busy, or in the worst
+	 * case still read-busy.  In those cases assume that the write we
+	 * wanted to do just happened and we can go.
+	 */
+	if (shared->page_status[bestslot] == SLRU_PAGE_VALID &&
+		shared->page_dirty[bestslot])
+		SlruInternalWritePage(ctl, bestslot, NULL);
+
+	LWLockRelease(shared->ControlLock);
+}
+
+/*
  * Remove all segments before the one holding the passed page number
  */
 void
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e65962..3e17071 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -636,7 +636,6 @@ static bool RestoreArchivedFile(char *path, const char *xlogfname,
 					const char *recovername, off_t expectedSize);
 static void ExecuteRecoveryCommand(char *command, char *commandName,
 					   bool failOnerror);
-static void PreallocXlogFiles(XLogRecPtr endptr);
 static void RemoveOldXlogFiles(uint32 log, uint32 seg, XLogRecPtr endptr);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
@@ -3312,7 +3311,7 @@ ExecuteRecoveryCommand(char *command, char *commandName, bool failOnSignal)
  * recycled log segments, but the startup transient is likely to include
  * a lot of segment creations by foreground processes, which is not so good.
  */
-static void
+void
 PreallocXlogFiles(XLogRecPtr endptr)
 {
 	uint32		_logId;
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 1f8d2d6..41d49d3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -265,6 +265,8 @@ BackgroundWriterMain(void)
 		 * Do one cycle of dirty-buffer writing.
 		 */
 		BgBufferSync();
+		CLOGBackgroundFlushLRU();
+		PreallocXlogFiles(GetXLogWriteRecPtr());
 
 		/* Nap for the configured time. */
 		BgWriterNap();
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 9cf54a4..cf20ae0 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -43,6 +43,7 @@ extern void StartupCLOG(void);
 extern void TrimCLOG(void);
 extern void ShutdownCLOG(void);
 extern void CheckPointCLOG(void);
+extern void CLOGBackgroundFlushLRU(void);
 extern void ExtendCLOG(TransactionId newestXact);
 extern void TruncateCLOG(TransactionId oldestXact);
 
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 41cd484..94d4247 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -144,6 +144,7 @@ extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
 						   TransactionId xid);
 extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
 extern void SimpleLruFlush(SlruCtl ctl, bool checkpoint);
+extern void SlruBackgroundFlushLRUPage(SlruCtl ctl);
 extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
 
 typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index db6380f..935eff1 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -264,6 +264,12 @@ extern pg_time_t GetLastSegSwitchTime(void);
 extern XLogRecPtr RequestXLogSwitch(void);
 
 /*
+ * Exported to support background writing
+ */
+extern void PreallocXlogFiles(XLogRecPtr endptr);
+extern void CLOGBackgroundFlushLRU(void);
+
+/*
  * These aren't in xlog.h because I'd rather not include fmgr.h there.
  */
 extern Datum pg_start_backup(PG_FUNCTION_ARGS);
#29Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#26)
Re: CLOG contention

On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Let's commit the change to 32.

I would like to do that, but I think we need to at least figure out a
way to provide an escape hatch for people without much shared memory.
We could do that, perhaps, by using a formula like this:

1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a
maximum of 32

I also think it would be a worth a quick test to see how the increase
performs on a system with <32 cores.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#29)
Re: CLOG contention

On Thu, Jan 5, 2012 at 7:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Let's commit the change to 32.

I would like to do that, but I think we need to at least figure out a
way to provide an escape hatch for people without much shared memory.
We could do that, perhaps, by using a formula like this:

1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a
maximum of 32

We're talking about an extra 192KB or thereabouts and Clog buffers
will only be the size of subtrans when we've finished.

If you want to have a special low-memory option, then it would need to
include many more things than clog buffers.

Let's just use a constant value for clog buffers until the low-memory
patch arrives.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#31Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#30)
Re: CLOG contention

On Thu, Jan 5, 2012 at 2:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Jan 5, 2012 at 7:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Let's commit the change to 32.

I would like to do that, but I think we need to at least figure out a
way to provide an escape hatch for people without much shared memory.
We could do that, perhaps, by using a formula like this:

1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a
maximum of 32

We're talking about an extra 192KB or thereabouts and Clog buffers
will only be the size of subtrans when we've finished.

If you want to have a special low-memory option, then it would need to
include many more things than clog buffers.

Let's just use a constant value for clog buffers until the low-memory
patch arrives.

Tom already stated that he found that unacceptable. Unless he changes
his opinion, we're not going to get far if you're only happy if it's
constant and he's only happy if there's a formula.

On the other hand, I think there's a decent argument that he should
change his opinion, because 192kB of memory is not a lot. However,
what I mostly want is something that nobody hates, so we can get it
committed and move on.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32Merlin Moncure
mmoncure@gmail.com
In reply to: Robert Haas (#29)
Re: CLOG contention

On Thu, Jan 5, 2012 at 1:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Let's commit the change to 32.

I would like to do that, but I think we need to at least figure out a
way to provide an escape hatch for people without much shared memory.
We could do that, perhaps, by using a formula like this:

1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a
maximum of 32

The assumption that machines that need this will have gigabytes of
shared memory set is not valid IMO.

merlin

#33Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#31)
Re: CLOG contention

On Thu, Jan 5, 2012 at 7:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2012 at 2:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Jan 5, 2012 at 7:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Let's commit the change to 32.

I would like to do that, but I think we need to at least figure out a
way to provide an escape hatch for people without much shared memory.
We could do that, perhaps, by using a formula like this:

1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a
maximum of 32

We're talking about an extra 192KB or thereabouts and Clog buffers
will only be the size of subtrans when we've finished.

If you want to have a special low-memory option, then it would need to
include many more things than clog buffers.

Let's just use a constant value for clog buffers until the low-memory
patch arrives.

Tom already stated that he found that unacceptable.  Unless he changes
his opinion, we're not going to get far if you're only happy if it's
constant and he's only happy if there's a formula.

On the other hand, I think there's a decent argument that he should
change his opinion, because 192kB of memory is not a lot.  However,
what I mostly want is something that nobody hates, so we can get it
committed and move on.

If that was a reasonable objection it would have applied when we added
serializable support, or any other SLRU for that matter.

If memory reduction is a concern to anybody, then a separate patch to
address *all* issues is required. Blocking this patch makes no sense.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#34Alvaro Herrera
alvherre@commandprompt.com
In reply to: Simon Riggs (#30)
Re: CLOG contention

Excerpts from Simon Riggs's message of jue ene 05 16:21:31 -0300 2012:

On Thu, Jan 5, 2012 at 7:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2012 at 11:10 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Let's commit the change to 32.

I would like to do that, but I think we need to at least figure out a
way to provide an escape hatch for people without much shared memory.
We could do that, perhaps, by using a formula like this:

1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a
maximum of 32

We're talking about an extra 192KB or thereabouts and Clog buffers
will only be the size of subtrans when we've finished.

Speaking of which, maybe it'd be a good idea to parametrize the subtrans
size according to the same (or a similar) formula too. (It might be
good to reduce multixact memory consumption too; I'd think that 4+4
pages should be more than sufficient for low memory systems, so making
those be half the clog values should be good)

So you get both things: reduce memory usage for systems on the low end,
which has been slowly increasing lately as we've added more uses of SLRU,
and more buffers for large systems.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#35Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#31)
Re: CLOG contention

Robert Haas <robertmhaas@gmail.com> wrote:

Simon Riggs <simon@2ndquadrant.com> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

Simon Riggs <simon@2ndquadrant.com> wrote:

Let's commit the change to 32.

I would like to do that, but I think we need to at least figure
out a way to provide an escape hatch for people without much
shared memory. We could do that, perhaps, by using a formula
like this:

1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8
and a maximum of 32

If we go with such a formula, I think 32 MB would be a more
appropriate divisor than 128 MB. Even on very large machines where
32 CLOG buffers would be a clear win, we often can't go above 1 or 2
GB of shared_buffers without hitting latency spikes due to overrun
of the RAID controller cache. (Now, that may change if we get DW
in, but that's not there yet.) 1 GB / 32 is 32 MB. This would
leave CLOG pinned at the minimum of 8 buffers (64 KB) all the way up
to shared_buffers of 256 MB.

Let's just use a constant value for clog buffers until the
low-memory patch arrives.

Tom already stated that he found that unacceptable. Unless he
changes his opinion, we're not going to get far if you're only
happy if it's constant and he's only happy if there's a formula.

On the other hand, I think there's a decent argument that he
should change his opinion, because 192kB of memory is not a lot.
However, what I mostly want is something that nobody hates, so we
can get it committed and move on.

I wouldn't hate it either way, as long as the divisor isn't too
large.

-Kevin

#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#29)
Re: CLOG contention

Robert Haas <robertmhaas@gmail.com> writes:

I would like to do that, but I think we need to at least figure out a
way to provide an escape hatch for people without much shared memory.
We could do that, perhaps, by using a formula like this:

1 CLOG buffer per 128MB of shared_buffers, with a minimum of 8 and a
maximum of 32

I would be in favor of that, or perhaps some other formula (eg, maybe
the minimum should be less than 8 for when you've got very little shmem).

I think that the reason it's historically been a constant is that the
original coding took advantage of having a compile-time-constant number
of buffers --- but since we went over to the common SLRU infrastructure
for several different logs, there's no longer any benefit whatever to
using a simple constant.

regards, tom lane

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#33)
Re: CLOG contention

Simon Riggs <simon@2ndQuadrant.com> writes:

On Thu, Jan 5, 2012 at 7:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On the other hand, I think there's a decent argument that he should
change his opinion, because 192kB of memory is not a lot. �However,
what I mostly want is something that nobody hates, so we can get it
committed and move on.

If that was a reasonable objection it would have applied when we added
serializable support, or any other SLRU for that matter.
If memory reduction is a concern to anybody, then a separate patch to
address *all* issues is required. Blocking this patch makes no sense.

No, your argument is the one that makes no sense. The fact that things
could be made better for low-mem situations is not an argument for
instead making them worse. Which is what going to a fixed value of 32
would do, in return for no benefit that I can see compared to using a
formula of some sort. The details of the formula barely matter, though
I would like to see one that bottoms out at less than 8 buffers so that
there is some advantage gained for low-memory cases.

regards, tom lane

#38Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#36)
Re: CLOG contention

On Thu, Jan 5, 2012 at 7:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think that the reason it's historically been a constant is that the
original coding took advantage of having a compile-time-constant number
of buffers --- but since we went over to the common SLRU infrastructure
for several different logs, there's no longer any benefit whatever to
using a simple constant.

You astound me, you really do.

Parameterised slru buffer sizes were proposed about for 8.3 and opposed by you.

I guess we all reserve the right to change our minds...

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#38)
Re: CLOG contention

Simon Riggs <simon@2ndQuadrant.com> writes:

Parameterised slru buffer sizes were proposed about for 8.3 and opposed by you.
I guess we all reserve the right to change our minds...

When presented with new data, sure. Robert's results offer a reason to
worry about this, which we did not have before now.

regards, tom lane

#40Robert Haas
robertmhaas@gmail.com
In reply to: Kevin Grittner (#35)
Re: CLOG contention

On Thu, Jan 5, 2012 at 2:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

If we go with such a formula, I think 32 MB would be a more
appropriate divisor than 128 MB.  Even on very large machines where
32 CLOG buffers would be a clear win, we often can't go above 1 or 2
GB of shared_buffers without hitting latency spikes due to overrun
of the RAID controller cache.  (Now, that may change if we get DW
in, but that's not there yet.)  1 GB / 32 is 32 MB.  This would
leave CLOG pinned at the minimum of 8 buffers (64 KB) all the way up
to shared_buffers of 256 MB.

That seems reasonable to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#41Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#36)
Re: CLOG contention

On Thu, Jan 5, 2012 at 2:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I would be in favor of that, or perhaps some other formula (eg, maybe
the minimum should be less than 8 for when you've got very little shmem).

I have some results that show that, under the right set of
circumstances, 8->32 is a win, and I can quantify by how much it wins.
I don't have any data at all to quantify the cost of dropping the
minimum from 8->6, or from 8->4, and therefore I'm reluctant to do it.
My guess is that it's a bad idea, anyway. Even on a system where
shared_buffers is just 8MB, we have 1024 regular buffers and 8 CLOG
buffers. If we reduce the number of CLOG buffers from 8 to 4 (i.e. by
50%), we can increase the number of regular buffers from 1024 to 1028
(i.e. by <0.5%). Maybe you can find a case where that comes out to a
win, but you might have to look pretty hard.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#42Merlin Moncure
mmoncure@gmail.com
In reply to: Robert Haas (#40)
Re: CLOG contention

On Thu, Jan 5, 2012 at 2:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 5, 2012 at 2:44 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:

If we go with such a formula, I think 32 MB would be a more
appropriate divisor than 128 MB.  Even on very large machines where
32 CLOG buffers would be a clear win, we often can't go above 1 or 2
GB of shared_buffers without hitting latency spikes due to overrun
of the RAID controller cache.  (Now, that may change if we get DW
in, but that's not there yet.)  1 GB / 32 is 32 MB.  This would
leave CLOG pinned at the minimum of 8 buffers (64 KB) all the way up
to shared_buffers of 256 MB.

That seems reasonable to me.

likewise (champion bikeshedder here). It just so happens I typically
set 'large' server shared memory to 256mb.

merlin

#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#41)
Re: CLOG contention

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 5, 2012 at 2:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I would be in favor of that, or perhaps some other formula (eg, maybe
the minimum should be less than 8 for when you've got very little shmem).

I have some results that show that, under the right set of
circumstances, 8->32 is a win, and I can quantify by how much it wins.
I don't have any data at all to quantify the cost of dropping the
minimum from 8->6, or from 8->4, and therefore I'm reluctant to do it.
My guess is that it's a bad idea, anyway. Even on a system where
shared_buffers is just 8MB, we have 1024 regular buffers and 8 CLOG
buffers. If we reduce the number of CLOG buffers from 8 to 4 (i.e. by
50%), we can increase the number of regular buffers from 1024 to 1028
(i.e. by <0.5%). Maybe you can find a case where that comes out to a
win, but you might have to look pretty hard.

I think you're rejecting the concept too easily. A setup with very
little shmem is only going to be suitable for low-velocity systems that
are not pushing too many transactions through per second, so it's not
likely to need so many CLOG buffers. And frankly I'm not that concerned
about what the performance is like: I'm more concerned about whether
PG will start up at all without modifying the system shmem limits,
on systems with legacy values for SHMMAX etc. Shaving a few
single-purpose buffers to make back what we spent on SSI, for example,
seems like a good idea to me.

regards, tom lane

#44Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#43)
Re: CLOG contention

On Thu, Jan 5, 2012 at 10:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 5, 2012 at 2:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I would be in favor of that, or perhaps some other formula (eg, maybe
the minimum should be less than 8 for when you've got very little shmem).

I have some results that show that, under the right set of
circumstances, 8->32 is a win, and I can quantify by how much it wins.
 I don't have any data at all to quantify the cost of dropping the
minimum from 8->6, or from 8->4, and therefore I'm reluctant to do it.
 My guess is that it's a bad idea, anyway.  Even on a system where
shared_buffers is just 8MB, we have 1024 regular buffers and 8 CLOG
buffers.  If we reduce the number of CLOG buffers from 8 to 4 (i.e. by
50%), we can increase the number of regular buffers from 1024 to 1028
(i.e. by <0.5%).  Maybe you can find a case where that comes out to a
win, but you might have to look pretty hard.

I think you're rejecting the concept too easily.  A setup with very
little shmem is only going to be suitable for low-velocity systems that
are not pushing too many transactions through per second, so it's not
likely to need so many CLOG buffers.  And frankly I'm not that concerned
about what the performance is like: I'm more concerned about whether
PG will start up at all without modifying the system shmem limits,
on systems with legacy values for SHMMAX etc.  Shaving a few
single-purpose buffers to make back what we spent on SSI, for example,
seems like a good idea to me.

Having 32 clog buffers is important at the high end. I don't think
that other complexities should mask that truth and lead to us not
doing anything on this topic for this release.

Please can we either make it user configurable? prepared transactions
require config, lock table size is configurable also, so having SSI
and clog require config is not too much of a stretch. We can then
discuss intelligent autotuning behaviour when we have more time and
more evidence.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#45Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#43)
Re: CLOG contention

On Thu, Jan 5, 2012 at 5:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Thu, Jan 5, 2012 at 2:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I would be in favor of that, or perhaps some other formula (eg, maybe
the minimum should be less than 8 for when you've got very little shmem).

I have some results that show that, under the right set of
circumstances, 8->32 is a win, and I can quantify by how much it wins.
 I don't have any data at all to quantify the cost of dropping the
minimum from 8->6, or from 8->4, and therefore I'm reluctant to do it.
 My guess is that it's a bad idea, anyway.  Even on a system where
shared_buffers is just 8MB, we have 1024 regular buffers and 8 CLOG
buffers.  If we reduce the number of CLOG buffers from 8 to 4 (i.e. by
50%), we can increase the number of regular buffers from 1024 to 1028
(i.e. by <0.5%).  Maybe you can find a case where that comes out to a
win, but you might have to look pretty hard.

I think you're rejecting the concept too easily.  A setup with very
little shmem is only going to be suitable for low-velocity systems that
are not pushing too many transactions through per second, so it's not
likely to need so many CLOG buffers.

Well, if you take the same workload and spread it out over a long
period of time, it will still have just as many CLOG misses or
shared_buffers misses as it had when you did it all at top speed.
Admittedly, you're unlikely to run into the situation where you have
people wanting to do simultaneous CLOG reads than there are buffers,
but you'll still thrash the cache.

And frankly I'm not that concerned
about what the performance is like: I'm more concerned about whether
PG will start up at all without modifying the system shmem limits,
on systems with legacy values for SHMMAX etc.

After thinking about this a bit, I think the problem is that the
divisor we picked is still too high. Suppose we set num_clog_buffers
= (shared_buffers / 4MB), with a minimum of 4 and maximum of 32. That
way, pretty much anyone who bothers to set shared_buffers to a
non-default value will get 32 CLOG buffers, which should be fine, but
people who are in the 32MB-or-less range can ramp down lower than what
we've allowed in the past. That seems like it might give us the best
of both worlds.

Shaving a few
single-purpose buffers to make back what we spent on SSI, for example,
seems like a good idea to me.

I think if we want to buy back that memory, the best way to do it
would be to add a GUC to disable SSI at startup time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#46Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#44)
Re: CLOG contention

Simon Riggs <simon@2ndQuadrant.com> writes:

Please can we either make it user configurable?

Weren't you just complaining that *I* was overcomplicating things?
I see no evidence to justify inventing a user-visible GUC here.
We have rough consensus on both the need for and the shape of a formula,
with just minor discussion about the exact parameters to plug into it.
Punting the problem off to a GUC is not a better answer.

regards, tom lane

#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#45)
Re: CLOG contention

Robert Haas <robertmhaas@gmail.com> writes:

After thinking about this a bit, I think the problem is that the
divisor we picked is still too high. Suppose we set num_clog_buffers
= (shared_buffers / 4MB), with a minimum of 4 and maximum of 32.

Works for me.

regards, tom lane

#48Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#46)
Re: CLOG contention

On Fri, Jan 6, 2012 at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndQuadrant.com> writes:

Please can we either make it user configurable?

Weren't you just complaining that *I* was overcomplicating things?
I see no evidence to justify inventing a user-visible GUC here.
We have rough consensus on both the need for and the shape of a formula,
with just minor discussion about the exact parameters to plug into it.
Punting the problem off to a GUC is not a better answer.

As long as we get 32 buffers on big systems, I have no complaint.

I'm sorry if I moaned at you personally.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#49Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#47)
Re: CLOG contention

On Fri, Jan 6, 2012 at 11:05 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

After thinking about this a bit, I think the problem is that the
divisor we picked is still too high.  Suppose we set num_clog_buffers
= (shared_buffers / 4MB), with a minimum of 4 and maximum of 32.

Works for me.

Done. I tested this on my MacBook Pro and I see no statistically
significant difference from the change on a couple of small pgbench
tests. Hopefully that means this is good on large boxes and at worst
harmless on small ones.

As far as I can see, the trade-off is this: If you increase the number
of CLOG buffers, then your CLOG miss rate will go down. On the other
hand, the cost of looking up a CLOG buffer will go up. At some point,
the reduction in the miss rate will not be enough to pay for a longer
linear search - which also means holding CLogControlLock. I think
it'd probably be worthwhile to think about looking for something
slightly smarter than a linear search at some point, and maybe also
looking for a way to partition the locking better. But, this at least
picks the available load-hanging fruit, which is a good place to
start.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#50Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#28)
1 attachment(s)
Re: CLOG contention

On Thu, Jan 5, 2012 at 6:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Patch to remove clog contention caused by dirty clog LRU.

v2, minor changes, updated for recent commits

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

background_clean_slru.v2.patchtext/x-patch; charset=US-ASCII; name=background_clean_slru.v2.patchDownload
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 69b6ef3..f3e08e6 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -594,6 +594,26 @@ CheckPointCLOG(void)
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
 }
 
+/*
+ * Conditionally flush the CLOG LRU.
+ *
+ * When a backend does ExtendCLOG we need to write the CLOG LRU if it is
+ * dirty. Performing I/O while holding XidGenLock prevents new write
+ * transactions from starting. To avoid that we flush the CLOG LRU, if
+ * we think that a page write is due soon, according to a heuristic.
+ *
+ * Note that we're reading ShmemVariableCache->nextXid without a lock
+ * since the exact value doesn't matter as input into our heuristic.
+ */
+void
+CLOGBackgroundFlushLRU(void)
+{
+	TransactionId xid = ShmemVariableCache->nextXid;
+	int		threshold = (CLOG_XACTS_PER_PAGE - (CLOG_XACTS_PER_PAGE / 4));
+
+	if (TransactionIdToPgIndex(xid) > threshold)
+		SlruBackgroundFlushLRUPage(ClogCtl);
+}
 
 /*
  * Make sure that CLOG has room for a newly-allocated XID.
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 30538ff..aea6c09 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -885,6 +885,82 @@ SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid)
 }
 
 /*
+ * Identify the LRU slot but just leave it as it is.
+ *
+ * Control lock must be held at entry, and will be held at exit.
+ */
+static int
+SlruIdentifyLRUSlot(SlruCtl ctl)
+{
+	SlruShared	shared = ctl->shared;
+	int			slotno;
+	int			cur_count;
+	int			bestslot;
+	int			best_delta;
+	int			best_page_number;
+
+	/*
+	 * If we find any EMPTY slot, just select that one. Else locate the
+	 * least-recently-used slot.
+	 *
+	 * Normally the page_lru_count values will all be different and so
+	 * there will be a well-defined LRU page.  But since we allow
+	 * concurrent execution of SlruRecentlyUsed() within
+	 * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
+	 * acquire the same lru_count values.  In that case we break ties by
+	 * choosing the furthest-back page.
+	 *
+	 * In no case will we select the slot containing latest_page_number
+	 * for replacement, even if it appears least recently used.
+	 *
+	 * Notice that this next line forcibly advances cur_lru_count to a
+	 * value that is certainly beyond any value that will be in the
+	 * page_lru_count array after the loop finishes.  This ensures that
+	 * the next execution of SlruRecentlyUsed will mark the page newly
+	 * used, even if it's for a page that has the current counter value.
+	 * That gets us back on the path to having good data when there are
+	 * multiple pages with the same lru_count.
+	 */
+	cur_count = (shared->cur_lru_count)++;
+	best_delta = -1;
+	bestslot = 0;			/* no-op, just keeps compiler quiet */
+	best_page_number = 0;	/* ditto */
+	for (slotno = 0; slotno < shared->num_slots; slotno++)
+	{
+		int			this_delta;
+		int			this_page_number;
+
+		if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
+			return slotno;
+		this_delta = cur_count - shared->page_lru_count[slotno];
+		if (this_delta < 0)
+		{
+			/*
+			 * Clean up in case shared updates have caused cur_count
+			 * increments to get "lost".  We back off the page counts,
+			 * rather than trying to increase cur_count, to avoid any
+			 * question of infinite loops or failure in the presence of
+			 * wrapped-around counts.
+			 */
+			shared->page_lru_count[slotno] = cur_count;
+			this_delta = 0;
+		}
+		this_page_number = shared->page_number[slotno];
+		if ((this_delta > best_delta ||
+			 (this_delta == best_delta &&
+			  ctl->PagePrecedes(this_page_number, best_page_number))) &&
+			this_page_number != shared->latest_page_number)
+		{
+			bestslot = slotno;
+			best_delta = this_delta;
+			best_page_number = this_page_number;
+		}
+	}
+
+	return bestslot;
+}
+
+/*
  * Select the slot to re-use when we need a free slot.
  *
  * The target page number is passed because we need to consider the
@@ -905,11 +981,8 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
 	/* Outer loop handles restart after I/O */
 	for (;;)
 	{
-		int			slotno;
-		int			cur_count;
 		int			bestslot;
-		int			best_delta;
-		int			best_page_number;
+		int			slotno;
 
 		/* See if page already has a buffer assigned */
 		for (slotno = 0; slotno < shared->num_slots; slotno++)
@@ -919,69 +992,14 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
 				return slotno;
 		}
 
-		/*
-		 * If we find any EMPTY slot, just select that one. Else locate the
-		 * least-recently-used slot to replace.
-		 *
-		 * Normally the page_lru_count values will all be different and so
-		 * there will be a well-defined LRU page.  But since we allow
-		 * concurrent execution of SlruRecentlyUsed() within
-		 * SimpleLruReadPage_ReadOnly(), it is possible that multiple pages
-		 * acquire the same lru_count values.  In that case we break ties by
-		 * choosing the furthest-back page.
-		 *
-		 * In no case will we select the slot containing latest_page_number
-		 * for replacement, even if it appears least recently used.
-		 *
-		 * Notice that this next line forcibly advances cur_lru_count to a
-		 * value that is certainly beyond any value that will be in the
-		 * page_lru_count array after the loop finishes.  This ensures that
-		 * the next execution of SlruRecentlyUsed will mark the page newly
-		 * used, even if it's for a page that has the current counter value.
-		 * That gets us back on the path to having good data when there are
-		 * multiple pages with the same lru_count.
-		 */
-		cur_count = (shared->cur_lru_count)++;
-		best_delta = -1;
-		bestslot = 0;			/* no-op, just keeps compiler quiet */
-		best_page_number = 0;	/* ditto */
-		for (slotno = 0; slotno < shared->num_slots; slotno++)
-		{
-			int			this_delta;
-			int			this_page_number;
-
-			if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
-				return slotno;
-			this_delta = cur_count - shared->page_lru_count[slotno];
-			if (this_delta < 0)
-			{
-				/*
-				 * Clean up in case shared updates have caused cur_count
-				 * increments to get "lost".  We back off the page counts,
-				 * rather than trying to increase cur_count, to avoid any
-				 * question of infinite loops or failure in the presence of
-				 * wrapped-around counts.
-				 */
-				shared->page_lru_count[slotno] = cur_count;
-				this_delta = 0;
-			}
-			this_page_number = shared->page_number[slotno];
-			if ((this_delta > best_delta ||
-				 (this_delta == best_delta &&
-				  ctl->PagePrecedes(this_page_number, best_page_number))) &&
-				this_page_number != shared->latest_page_number)
-			{
-				bestslot = slotno;
-				best_delta = this_delta;
-				best_page_number = this_page_number;
-			}
-		}
+		bestslot = SlruIdentifyLRUSlot(ctl);
 
 		/*
-		 * If the selected page is clean, we're set.
+		 * If the selected page is clean or empty, we're set.
 		 */
-		if (shared->page_status[bestslot] == SLRU_PAGE_VALID &&
-			!shared->page_dirty[bestslot])
+		if (shared->page_status[bestslot] == SLRU_PAGE_EMPTY ||
+			(shared->page_status[bestslot] == SLRU_PAGE_VALID &&
+			!shared->page_dirty[bestslot]))
 			return bestslot;
 
 		/*
@@ -1067,6 +1085,39 @@ SimpleLruFlush(SlruCtl ctl, bool checkpoint)
 }
 
 /*
+ * Make sure the next victim buffer is clean, so that the next caller of
+ * SlruSelectLRUPage does not require I/O.
+ */
+void
+SlruBackgroundFlushLRUPage(SlruCtl ctl)
+{
+	SlruShared	shared = ctl->shared;
+	int			bestslot;
+
+	/*
+	 * Notice this takes only a shared lock on the ControlLock.
+	 * We aren't going to change the page/slot allocation, only
+	 * write if needed and reset the dirty status. This is OK
+	 * as long as only one process ever calls this, the bgwriter.
+	 */
+	LWLockAcquire(shared->ControlLock, LW_SHARED);
+
+	bestslot = SlruIdentifyLRUSlot(ctl);
+
+	/*
+	 * If the selected page is valid and dirty then write it out.
+	 * It's possible that the page is already write-busy, or in the worst
+	 * case still read-busy.  In those cases assume that the write we
+	 * wanted to do just happened and we can go.
+	 */
+	if (shared->page_status[bestslot] == SLRU_PAGE_VALID &&
+		shared->page_dirty[bestslot])
+		SlruInternalWritePage(ctl, bestslot, NULL);
+
+	LWLockRelease(shared->ControlLock);
+}
+
+/*
  * Remove all segments before the one holding the passed page number
  */
 void
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 1f8d2d6..66eee36 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -265,6 +265,7 @@ BackgroundWriterMain(void)
 		 * Do one cycle of dirty-buffer writing.
 		 */
 		BgBufferSync();
+		CLOGBackgroundFlushLRU();
 
 		/* Nap for the configured time. */
 		BgWriterNap();
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index bed3b8c..6376464 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -40,6 +40,7 @@ extern void StartupCLOG(void);
 extern void TrimCLOG(void);
 extern void ShutdownCLOG(void);
 extern void CheckPointCLOG(void);
+extern void CLOGBackgroundFlushLRU(void);
 extern void ExtendCLOG(TransactionId newestXact);
 extern void TruncateCLOG(TransactionId oldestXact);
 
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 41cd484..94d4247 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -144,6 +144,7 @@ extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
 						   TransactionId xid);
 extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
 extern void SimpleLruFlush(SlruCtl ctl, bool checkpoint);
+extern void SlruBackgroundFlushLRUPage(SlruCtl ctl);
 extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
 
 typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index db6380f..c17ae29 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -264,6 +264,11 @@ extern pg_time_t GetLastSegSwitchTime(void);
 extern XLogRecPtr RequestXLogSwitch(void);
 
 /*
+ * Exported to support background writing
+ */
+extern void CLOGBackgroundFlushLRU(void);
+
+/*
  * These aren't in xlog.h because I'd rather not include fmgr.h there.
  */
 extern Datum pg_start_backup(PG_FUNCTION_ARGS);
#51Jeff Janes
jeff.janes@gmail.com
In reply to: Simon Riggs (#50)
Re: CLOG contention

On Thu, Jan 12, 2012 at 4:49 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Jan 5, 2012 at 6:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Patch to remove clog contention caused by dirty clog LRU.

v2, minor changes, updated for recent commits

This no longer applies to file src/backend/postmaster/bgwriter.c, due
to the latch code, and I'm not confident I know how to fix it.

Cheers,

Jeff