Initial 9.2 pgbench write results

Started by Greg Smithover 14 years ago32 messageshackers

gsmith@gregsmith.com

over 14 years ago

Last year at this time, I was investigating things like ext3 vs xfs, how
well Linux's dirty_bytes parameter worked, and how effective a couple of
patches were on throughput & latency. The only patch that ended up
applied for 9.1 was for fsync compaction. That was measurably better in
terms of eliminating backend syncs altogether, and it also pulled up
average TPS a bit on the database scales I picked out to test it on.
That rambling group of test sets is available at
http://highperfpostgres.com/pgbench-results/index.htm

For the first round of 9.2 testing under a write-heavy load, I started
with 9.0 via the yum.postgresql.org packages for SL6, upgraded to 9.1
from there, and then used a source code build of 9.2 HEAD as of Feb 11
(58a9596ed4a509467e1781b433ff9c65a4e5b5ce). Attached is an Excel
spreadsheet showing the major figures, along with a CSV formatted copy
of that data too. Results that are ready so far are available at
http://highperfpostgres.com/results-write-9.2-cf4/index.htm

Most of that is good; here's the best and worst parts of the news in
compact form:

scale=500, db is 46% of RAM
Version Avg TPS
9.0 1961
9.1 2255
9.2 2525

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0 535
9.1 491 (-8.4% relative to 9.0)
9.2 338 (-31.2% relative to 9.1)

There's usually a tipping point with pgbench results, where the
characteristics change quite a bit as the database exceeds total RAM
size. You can see the background writer statistics change quite a bit
around there too. Last year the sharpest part of that transition
happened when exceeding total RAM; now it's happening just below that.

This test set takes about 26 hours to run in the stripped down form I'm
comparing, which doesn't even bother trying larger than RAM scales like
2000 or 3000 that might also be helpful. Most of the runtime time is
spent on the larger scale database tests, which unfortunately are the
interesting ones this year. I'm torn at this point between chasing down
where this regression came from, moving forward with testing the new
patches proposed for this CF, and seeing if this regression also holds
with SSD storage. Obvious big commit candidates to bisect this over are
the bgwriter/checkpointer split (Nov 1) and the group commit changes
(Jan 30). Now I get to pay for not having set this up to run
automatically each week since earlier in the 9.2 development cycle.

If someone else wants to try and replicate the bad part of this, best
guess for how is using the same minimal postgresql.conf changes I have
here, and picking your database scale so that the test database just
barely fits into RAM. pgbench gives rough 16MB of data per unit of
scale, and scale=1000 is 15GB; percentages above are relative to the
16GB of RAM in my server. Client count should be small, number of
physical cores is probably a good starter point (that's 4 in my system,
I didn't test below that). At higher client counts, the general
scalability improvements in 9.2 negate some of this downside.

= Server config =

The main change to the 8 hyperthreaded core test server (Intel i7-870)
for this year is bumping it from 8GB to 16GB of RAM, which effectively
doubles the scale I can reach before things slow dramatically. It's
also been updated to run Scientific Linux 6.0, giving a slightly later
kernel. That kernel does have different defaults for
dirty_background_ratio and dirty_ratio, they're 10% and 20% now
(compared to 5%/10% in last year's tests).

Drive set for tests I'm publishing so far is basically the same: 4-port
Areca card with 256MB battery-backed cache, 3 disk RAID0 for the
database, single disk for the WAL, all cheap 7200 RPM drives. The OS is
a separate drive, not connected to the caching controller. That's also
where the pgbench latency data is writing to. Idea is that this will be
similar to having around 10 drives in a production server, where you'll
also be using RAID1 for redundancy. I have some numbers brewing for
this system running with an Intel 320 series SSD, too, but they're not
ready yet.

= Test setup =

pgbench-tools has been upgraded to break down its graphs per test set
now, and there's even a configuration option to use client-side
Javascript to put that into a tab-like interface available. Thanks to
Ben Bleything for that one.

Minimal changes were made to the postgresql.conf. shared_buffers=2GB,
checkpoint_segments=64, and I left wal_buffers at its default so that
9.1 got credit for that going up. See
http://highperfpostgres.com/results-write-9.2-cf4/541/pg_settings.txt
for a full list of changes, drive mount options, and important kernel
settings. Much of that data wasn't collected in last year's
pgbench-tools runs.

= Results commentary =

For the most part the 9.2 results are quite good. The increase at high
client counts is solid, as expected from all the lock refactoring this
release has gotten. The smaller than RAM results that particularly
benefited from the 9.1 changes, particularly the scale=500 ones, leaped
as much in 9.2 as they did in 9.1. scale=500 and clients=96 is up 58%
from 9.0 to 9.2 so far.

The problems are all around the higher scales. scale=4000 (58GB) was
detuned an average of 1.7% in 9.1, which seemed a fair trade for how
much the fsync compaction helped with worse case behavior. It drops
another 7.2% on average in 9.2 so far though. The really bad one is
scale=1000 (15GB, so barely fitting in RAM now; very different from
scale=1000 last year). With this new kernel/more RAM/etc., I'm seeing
an average of a 7% TPS drop for the 9.1 changes. The drop from 9.1 to
9.2 is another 26%.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Greg Smith

gsmith@gregsmith.com

over 14 years ago

In reply to: Greg Smith (#1)

Re: Initial 9.2 pgbench write results

On 02/14/2012 01:45 PM, Greg Smith wrote:

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0 535
9.1 491 (-8.4% relative to 9.0)
9.2 338 (-31.2% relative to 9.1)

A second pass through this data noted that the maximum number of buffers
cleaned by the background writer is <=2785 in 9.0/9.1, while it goes as
high as 17345 times in 9.2. The background writer is so busy now it
hits the max_clean limit around 147 times in the slower[1]I normally do 3 runs of every scale/client combination, and find that more useful than a single run lasting 3X as long. The first out of each of the 3 runs I do at any scale is usually a bit faster than the later two, presumably due to table and/or disk fragmentation. I've tried to make this less of a factor in pgbench-tools by iterating through all requested client counts first, before beginning a second run of those scale/client combination. So if the two client counts were 4 and 8, it would be 4/8/4/8/4/8, which works much better than 4/4/4/8/8/8 in terms of fragmentation impacting the average result. Whether it would be better or worse to eliminate this difference by rebuilding the whole database multiple times for each scale is complicated. I happen to like seeing the results with a bit more fragmentation mixed in, see how they compare with the fresh database. Since more rebuilds would also make these tests take much longer than they already do, that's the tie-breaker that's led to the current testing schedule being the preferred one. of the 9.2
runs. That's an average of once every 4 seconds, quite frequent.
Whereas max_clean rarely happens in the comparable 9.0/9.1 results.
This is starting to point my finger more toward this being an unintended
consequence of the background writer/checkpointer split.

Thinking out loud, about solutions before the problem is even nailed
down, I wonder if we should consider lowering bgwriter_lru_maxpages now
in the default config? In older versions, the page cleaning work had at
most a 50% duty cycle; it was only running when checkpoints were not.
If we wanted to keep the ceiling on background writer cleaning at the
same level in the default configuration, that would require dropping
bgwriter_lru_maxpages from 100 to 50. That would be roughly be the same
amount of maximum churn. It's obviously more complicated than that, but
I think there's a defensible position along those lines to consider.

As a historical aside, I wonder how much this behavior might have been
to blame for my failing to get spread checkpoints to show a positive
outcome during 9.1 development. The way that was written also kept the
cleaner running during checkpoints. I didn't measure those two changes
individually as much as I did the combination.

[1]: I normally do 3 runs of every scale/client combination, and find that more useful than a single run lasting 3X as long. The first out of each of the 3 runs I do at any scale is usually a bit faster than the later two, presumably due to table and/or disk fragmentation. I've tried to make this less of a factor in pgbench-tools by iterating through all requested client counts first, before beginning a second run of those scale/client combination. So if the two client counts were 4 and 8, it would be 4/8/4/8/4/8, which works much better than 4/4/4/8/8/8 in terms of fragmentation impacting the average result. Whether it would be better or worse to eliminate this difference by rebuilding the whole database multiple times for each scale is complicated. I happen to like seeing the results with a bit more fragmentation mixed in, see how they compare with the fresh database. Since more rebuilds would also make these tests take much longer than they already do, that's the tie-breaker that's led to the current testing schedule being the preferred one.
that more useful than a single run lasting 3X as long. The first out of
each of the 3 runs I do at any scale is usually a bit faster than the
later two, presumably due to table and/or disk fragmentation. I've
tried to make this less of a factor in pgbench-tools by iterating
through all requested client counts first, before beginning a second run
of those scale/client combination. So if the two client counts were 4
and 8, it would be 4/8/4/8/4/8, which works much better than 4/4/4/8/8/8
in terms of fragmentation impacting the average result. Whether it
would be better or worse to eliminate this difference by rebuilding the
whole database multiple times for each scale is complicated. I happen
to like seeing the results with a bit more fragmentation mixed in, see
how they compare with the fresh database. Since more rebuilds would
also make these tests take much longer than they already do, that's the
tie-breaker that's led to the current testing schedule being the
preferred one.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Greg Smith (#2)

Re: Initial 9.2 pgbench write results

On Tue, Feb 14, 2012 at 3:25 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 02/14/2012 01:45 PM, Greg Smith wrote:

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0 535
9.1 491 (-8.4% relative to 9.0)
9.2 338 (-31.2% relative to 9.1)

A second pass through this data noted that the maximum number of buffers
cleaned by the background writer is <=2785 in 9.0/9.1, while it goes as high
as 17345 times in 9.2. The background writer is so busy now it hits the
max_clean limit around 147 times in the slower[1] of the 9.2 runs. That's
an average of once every 4 seconds, quite frequent. Whereas max_clean
rarely happens in the comparable 9.0/9.1 results. This is starting to point
my finger more toward this being an unintended consequence of the background
writer/checkpointer split.

I guess the question that occurs to me is: why is it busier?

It may be that the changes we've made to reduce lock contention are
allowing foreground processes to get work done faster. When they get
work done faster, they dirty more buffers, and therefore the
background writer gets busier. Also, if the background writer is more
reliably cleaning pages even during checkpoints, that could have the
same effect. Backends write fewer of their own pages, therefore they
get more real work done, which of course means dirtying more pages.
But I'm just speculating here.

Thinking out loud, about solutions before the problem is even nailed down, I
wonder if we should consider lowering bgwriter_lru_maxpages now in the
default config? In older versions, the page cleaning work had at most a 50%
duty cycle; it was only running when checkpoints were not.

Is this really true? I see CheckpointWriteDelay calling BgBufferSync
in 9.1. Background writing would stop during the sync phase and
perhaps slow down a bit during checkpoint writing, but I don't think
it was stopped completely.

I'm curious what vmstat output looks like during your test. I've
found that's a good way to know whether the system is being limited by
I/O, CPU, or locks. It'd also be interesting to know what the %
utilization figures for the disks looked like.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Robert Haas (#3)

Re: Initial 9.2 pgbench write results

On Sat, Feb 18, 2012 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 14, 2012 at 3:25 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 02/14/2012 01:45 PM, Greg Smith wrote:

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0 535
9.1 491 (-8.4% relative to 9.0)
9.2 338 (-31.2% relative to 9.1)

A second pass through this data noted that the maximum number of buffers
cleaned by the background writer is <=2785 in 9.0/9.1, while it goes as high
as 17345 times in 9.2. The background writer is so busy now it hits the
max_clean limit around 147 times in the slower[1] of the 9.2 runs. That's
an average of once every 4 seconds, quite frequent. Whereas max_clean
rarely happens in the comparable 9.0/9.1 results. This is starting to point
my finger more toward this being an unintended consequence of the background
writer/checkpointer split.

I guess the question that occurs to me is: why is it busier?

It may be that the changes we've made to reduce lock contention are
allowing foreground processes to get work done faster. When they get
work done faster, they dirty more buffers, and therefore the
background writer gets busier. Also, if the background writer is more
reliably cleaning pages even during checkpoints, that could have the
same effect. Backends write fewer of their own pages, therefore they
get more real work done, which of course means dirtying more pages.

The checkpointer/bgwriter split allows the bgwriter to do more work,
which is the desired outcome, not an unintended consequence.

The general increase in performance means there is more work to do. So
both things mean there is more bgwriter activity.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Simon Riggs (#4)

Re: Initial 9.2 pgbench write results

On Sat, Feb 18, 2012 at 3:00 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Sat, Feb 18, 2012 at 7:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 14, 2012 at 3:25 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 02/14/2012 01:45 PM, Greg Smith wrote:

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0 535
9.1 491 (-8.4% relative to 9.0)
9.2 338 (-31.2% relative to 9.1)

A second pass through this data noted that the maximum number of buffers
cleaned by the background writer is <=2785 in 9.0/9.1, while it goes as high
as 17345 times in 9.2. The background writer is so busy now it hits the
max_clean limit around 147 times in the slower[1] of the 9.2 runs. That's
an average of once every 4 seconds, quite frequent. Whereas max_clean
rarely happens in the comparable 9.0/9.1 results. This is starting to point
my finger more toward this being an unintended consequence of the background
writer/checkpointer split.

I guess the question that occurs to me is: why is it busier?

It may be that the changes we've made to reduce lock contention are
allowing foreground processes to get work done faster. When they get
work done faster, they dirty more buffers, and therefore the
background writer gets busier. Also, if the background writer is more
reliably cleaning pages even during checkpoints, that could have the
same effect. Backends write fewer of their own pages, therefore they
get more real work done, which of course means dirtying more pages.

The checkpointer/bgwriter split allows the bgwriter to do more work,
which is the desired outcome, not an unintended consequence.

The general increase in performance means there is more work to do. So
both things mean there is more bgwriter activity.

I think you're saying pretty much the same thing I was saying, so I agree.

Here's what's bugging me. Greg seemed to be assuming that the
business of the background writer might be the cause of the
performance drop-off he measured on certain test cases. But you and I
both seem to feel that the business of the background writer is
intentional and desirable. Supposing we're right, where's the
drop-off coming from? *scratches head*

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Robert Haas (#5)

Re: Initial 9.2 pgbench write results

On Sun, Feb 19, 2012 at 4:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Here's what's bugging me. Greg seemed to be assuming that the
business of the background writer might be the cause of the
performance drop-off he measured on certain test cases. But you and I
both seem to feel that the business of the background writer is
intentional and desirable. Supposing we're right, where's the
drop-off coming from? *scratches head*

Any source of logical I/O becomes physical I/O when we run short of
memory. So if we're using more memory for any reason that will cause
more swapping. Or if we are doing things like consulting the vmap that
would also cause a problem.

I notice the issue is not as bad for 9.2 in the scale 4000 case, so it
seems more likely that we're just hitting the tipping point earlier on
9.2 and that scale 1000 is right in the middle of the tipping point.

What it does show quite clearly is that the extreme high end response
time variability is still there. It also shows that insufficient
performance testing has been done on this release so far. We may have
"solved" some scalability problems but we've completely ignored real
world performance issues and as Greg says, we now get to pray the
price for not having done that earlier.

I've argued previously that we should have a performance tuning phase
at the end of the release cycle, now it looks that has become a
necessity. Which will turn out to be a good thing in the end, I'm
sure, even if its a little worrying right now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Greg Smith (#1)

Re: Initial 9.2 pgbench write results

On Tue, Feb 14, 2012 at 6:45 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Minimal changes were made to the postgresql.conf. shared_buffers=2GB,
checkpoint_segments=64, and I left wal_buffers at its default so that 9.1
got credit for that going up. See
http://highperfpostgres.com/results-write-9.2-cf4/541/pg_settings.txt for a
full list of changes, drive mount options, and important kernel settings.
Much of that data wasn't collected in last year's pgbench-tools runs.

Please retest with wal_buffers 128MB, checkpoint_segments 1024

Best to remove any tunable resource bottlenecks before we attempt
further analysis.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Greg Smith

gsmith@gregsmith.com

over 14 years ago

In reply to: Robert Haas (#3)

Re: Initial 9.2 pgbench write results

On 02/18/2012 02:35 PM, Robert Haas wrote:

I see CheckpointWriteDelay calling BgBufferSync
in 9.1. Background writing would stop during the sync phase and
perhaps slow down a bit during checkpoint writing, but I don't think
it was stopped completely.

The sync phase can be pretty long here--that's where the worst-case
latency figures lasting many seconds are coming from. When checkpoints
are happening every 60 seconds as in some of these cases, that can
represent a decent percentage of time. Similarly, when the OS cache
fills, the write phase might block for a larger period of time than
normally expected. But, yes, you're right that my "BGW is active twice
as much in 9.2" comments are overstating the reality here.

I'm collecting one last bit of data before posting another full set of
results, but I'm getting more comfortable the issue here is simply
changes in the BGW behavior. The performance regression tracks the
background writer maximum intensity. I can match the original 9.1
performance just by dropping bgwriter_lru_maxpages, in cases where TPS
drops significantly between 9.2 and 9.1. At the same time, some cases
that improve between 9.1 and 9.2 perform worse if I do that. If whether
9.2 gains or loses compared to 9.1 is adjustable with a tunable
parameter, with some winning and other losing at the defaults, that path
forward is reasonable to deal with. The fact that pgbench is an unusual
write workload is well understood, and I can write something documenting
this possibility before 9.2 is officially released. I'm a lot less
stressed that there's really a problem here now.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Greg Smith

gsmith@gregsmith.com

over 14 years ago

In reply to: Simon Riggs (#7)

Re: Initial 9.2 pgbench write results

On 02/19/2012 05:37 AM, Simon Riggs wrote:

Please retest with wal_buffers 128MB, checkpoint_segments 1024

The test parameters I'm using aim to run through several checkpoint
cycles in 10 minutes of time. Bumping up against the ugly edges of
resource bottlenecks is part of the test. Increasing
checkpoint_segments like that would lead to time driven checkpoints,
either 1 or 2 of them during 10 minutes. I'd have to increase the total
testing time by at least 5X to get an equal workout of the system. That
would be an interesting data point to collect if I had a few weeks to
focus just on that test. I think that's more than pgbench testing
deserves though.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#10

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Greg Smith (#8)

Re: Initial 9.2 pgbench write results

On Sun, Feb 19, 2012 at 11:12 PM, Greg Smith <greg@2ndquadrant.com> wrote:

I'm collecting one last bit of data before posting another full set of
results, but I'm getting more comfortable the issue here is simply changes
in the BGW behavior. The performance regression tracks the background
writer maximum intensity.

That's really quite fascinating... but it seems immensely
counterintuitive. Any idea why? BufFreelist contention between the
background writer and regular backends leading to buffer allocation
stalls, maybe?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Greg Smith

gsmith@gregsmith.com

over 14 years ago

In reply to: Greg Smith (#1)

Re: Initial 9.2 pgbench write results

I've updated http://highperfpostgres.com/results-write-9.2-cf4/index.htm
with more data including two alternate background writer configurations.
Since the sensitive part of the original results was scales of 500 and
1000, I've also gone back and added scale=750 runs to all results.
Quick summary is that I'm not worried about 9.2 performance now, I'm
increasingly confident that the earlier problems I reported on are just
bad interactions between the reinvigorated background writer and
workloads that are tough to write to disk. I'm satisfied I understand
these test results well enough to start evaluating the pending 9.2
changes in the CF queue I wanted to benchmark.

Attached are now useful client and scale graphs. All of 9.0, 9.1, and
9.2 have been run now with exactly the same scales and clients loads, so
the graphs of all three versions can be compared. The two 9.2
variations with alternate parameters were only run at some scales, which
means you can't compare them usefully on the clients graph; only on the
scaling one. They are very obviously in a whole different range of that
graph, just ignore the two that are way below the rest.

Here's a repeat of the interesting parts of the data set with new
points. Here "9.2N" is without no background writer, while "9.2H" has a
background writer set to half strength: bgwriter_lru_maxpages = 50 I
picked one middle client level out of the result=750 results just to
focus better, relative results are not sensitive to that:

scale=500, db is 46% of RAM
Version Avg TPS
9.0 1961
9.1 2255
9.2 2525
9.2N 2267
9.2H 2300

scale=750, db is 69% of RAM; clients=16
Version Avg TPS
9.0 1298
9.1 1387
9.2 1477
9.2N 1489
9.2H 943

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0 535
9.1 491 (-8.4% relative to 9.0)
9.2 338 (-31.2% relative to 9.1)
9.2N 516
9.2H 400

The fact that almost all the performance regression against 9.2 goes
away if the background writer is disabled is an interesting point. That
results actually get worse at scale=500 without the background writer is
another. That pair of observations makes me feel better that there's a
tuning trade-off here being implicitly made by having a more active
background writer in 9.2; it helps on some cases, hurts others. That I
can deal with. Everything lines up perfectly at scale=500: if I
reorder on TPS:

scale=500, db is 46% of RAM
Version Avg TPS
9.2 2525
9.2H 2300
9.2N 2267
9.1 2255
9.0 1961

That makes you want to say "the more background writer the better", right?

The new scale=750 numbers are weird though, and they keep this from
being so clear. I ran the parts that were most weird twice just because
they seemed so odd, and it was repeatable. Just like scale=500, with
scale=750 the 9.2/no background writer has the best performance of any
run. But the half-intensity one has the worst! It would be nice if it
fell between the 9.2 and 9.2N results, instead it's at the other edge.

The only lesson I can think to draw here is that once we're in the area
where performance is dominated by the trivia around exactly how writes
are scheduled, the optimal ordering of writes is just too complicated to
model that easily. The rest of this is all speculation on how to fit
some ideas to this data.

Going back to 8.3 development, one of the design trade-offs I was very
concerned about was not wasting resources by having the BGW run too
often. Even then it was clear that for these simple pgbench tests,
there were situations where letting backends do their own writes was
better than speculative writes from the background writer. The BGW
constantly risks writing a page that will be re-dirtied before it goes
to disk. That can't be too common though in the current design, since
it avoids things with high usage counts. (The original BGW wrote things
that were used recently, and that was a measurable problem by 8.3)

I think an even bigger factor now is that the BGW writes can disturb
write ordering/combining done at the kernel and storage levels. It's
painfully obvious now how much PostgreSQL relies on that to get good
performance. All sorts of things break badly if we aren't getting
random writes scheduled to optimize seek times, in as many contexts as
possible. It doesn't seem unreasonable that background writer writes
can introduce some delay into the checkpoint writes, just by adding more
random components to what is already a difficult to handle write/sync
series. That's what I think what these results are showing is that
background writer writes can deoptimize other forms of write.

A second fact that's visible from the TPS graphs over the test run, and
obvious if you think about it, is that BGW writes force data to physical
disk earlier than they otherwise might go there. That's a subtle
pattern in the graphs. I expect that though, given one element to "do I
write this?" in Linux is how old the write is. Wondering about this
really emphasises that I need to either add graphing of vmstat/iostat
data to these graphs or switch to a benchmark that does that already. I
think I've got just enough people using pgbench-tools to justify the
feature even if I plan to use the program less.

I also have a good answer to "why does this only happen at these
scales?" now. At scales below here, the database is so small relative
to RAM that it just all fits all the time. That includes the indexes
being very small, so not many writes generated by their dirty blocks.
At higher scales, the write volume becomes seek bound, and the result is
so low that checkpoints become timeout based. So there are
significantly less of them. At the largest scales and client counts
here, there isn't a single checkpoint actually finished at some of these
10 minute long runs. One doesn't even start until 5 minutes have gone
by, and the checkpoint writes are so slow they take longer than 5
minutes to trickle out and sync, with all the competing I/O from
backends mixed in. Note that the "clients-sets" graph still shows a
strong jump from 9.0 to 9.1 at high client counts; I'm pretty sure
that's the fsync compaction at work.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#12

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Greg Smith (#11)

Re: Initial 9.2 pgbench write results

On Thu, Feb 23, 2012 at 11:17 AM, Greg Smith <greg@2ndquadrant.com> wrote:

A second fact that's visible from the TPS graphs over the test run, and
obvious if you think about it, is that BGW writes force data to physical
disk earlier than they otherwise might go there. That's a subtle pattern in
the graphs. I expect that though, given one element to "do I write this?"
in Linux is how old the write is. Wondering about this really emphasises
that I need to either add graphing of vmstat/iostat data to these graphs or
switch to a benchmark that does that already. I think I've got just enough
people using pgbench-tools to justify the feature even if I plan to use the
program less.

For me, that is the key point.

For the test being performed there is no value in things being written
earlier, since doing so merely overexercises the I/O.

We should note that there is no feedback process in the bgwriter to do
writes only when the level of dirty writes by backends is high enough
to warrant the activity. Note that Linux has a demand paging
algorithm, it doesn't just clean all of the time. That's the reason
you still see some swapping, because that activity is what wakes the
pager. We don't count the number of dirty writes by backends, we just
keep cleaning even when nobody wants it.

Earlier, I pointed out that bgwriter is being woken any time a user
marks a buffer dirty. That is overkill. The bgwriter should stay
asleep until a threshold number (TBD) of dirty writes is reached, then
it should wake up and do some cleaning. Having a continuously active
bgwriter is pointless, for some workloads whereas for others, it
helps. So having a sleeping bgwriter isn't just a power management
issue its a performance issue in some cases.

/*
* Even in cases where there's been little or no buffer allocation
* activity, we want to make a small amount of progress through the buffer
* cache so that as many reusable buffers as possible are clean after an
* idle period.
*
* (scan_whole_pool_milliseconds / BgWriterDelay) computes how many times
* the BGW will be called during the scan_whole_pool time; slice the
* buffer pool into that many sections.
*/

Since scan_whole_pool_milliseconds is set to 2 minutes we scan the
whole bufferpool every 2 minutes, no matter how big the bufferpool,
even when nothing else is happening. Not cool.

I think it would be sensible to have bgwriter stop when 10% of
shared_buffers are clean, rather than keep going even when no dirty
writes are happening.

So my suggestion is that we put in an additional clause into
BgBufferSync() to allow min_scan_buffers to fall to zero when X% of
shared buffers is clean. After that bgwriter should sleep. And be
woken again only by a dirty write by a user backend. That sounds like
clean ratio will flip between 0 and X% but first dirty write will
occur long before we git zero, so that will cause bgwriter to attempt
to maintain a reasonably steady state clean ratio.

I would also take a wild guess that the 750 results are due to
freelist contention. To assess that, I post again the patch shown on
other threads designed to assess the overall level of freelist lwlock
contention.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13

Greg Smith

gsmith@gregsmith.com

over 14 years ago

In reply to: Simon Riggs (#12)

Re: Initial 9.2 pgbench write results

On 02/23/2012 07:36 AM, Simon Riggs wrote:

Since scan_whole_pool_milliseconds is set to 2 minutes we scan the
whole bufferpool every 2 minutes, no matter how big the bufferpool,
even when nothing else is happening. Not cool.

It's not quite that bad. Once the BGW has circled around the whole
buffer pool, such that it's swept so far ahead it's reached the clock
sweep strategy point, it stops. So when the system is idle, it creeps
forward until it's scanned the pool once. Then, it still wakes up
regularly, but the computation of the bufs_to_lap lap number will reach
0. That aborts running the main buffer scanning loop, so it only burns
a bit of CPU time and a lock on BufFreelistLock each time it wakes--both
of which are surely to spare if the system is idle. I can agree with
your power management argument, I don't see much of a performance win
from eliminating this bit.

The goals was to end up with a fully cleaned pool ready to absorb going
from idle to a traffic spike. The logic behind where the "magic
constants" controlling it came from was all laid out at
http://archives.postgresql.org/pgsql-hackers/2007-09/msg00214.php
There's a bunch of code around that whole computation that only executes
if you enable BGW_DEBUG. I left that in there in case somebody wanted
to fiddle with this specific tuning work again, since it took so long to
get right. That was the last feature change made to the 8.3 background
writer tuning work.

I was content at that time to cut the minimal activity level in half
relative to what it was in 8.2, and that measured well enough. It's
hard to find compelling benchmark workloads where the background writer
really works well though. I hope to look at this set of interesting
cases I found here more, now that I seem to have both positive and
negative results for background writer involvement.

As for free list contention, I wouldn't expect that to be a major issue
in the cases I was testing. The background writer is just one of many
backends all contending for that. When there's dozens of backends all
grabbing, I'd think that its individual impact would be a small slice of
the total activity. I will of course reserve arguing that point until
I've benchmarked to support it though.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#14

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Greg Smith (#13)

Re: Initial 9.2 pgbench write results

On Thu, Feb 23, 2012 at 8:44 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 02/23/2012 07:36 AM, Simon Riggs wrote:

Since scan_whole_pool_milliseconds is set to 2 minutes we scan the
whole bufferpool every 2 minutes, no matter how big the bufferpool,
even when nothing else is happening. Not cool.

It's not quite that bad. Once the BGW has circled around the whole buffer
pool, such that it's swept so far ahead it's reached the clock sweep
strategy point, it stops. So when the system is idle, it creeps forward
until it's scanned the pool once. Then, it still wakes up regularly, but
the computation of the bufs_to_lap lap number will reach 0. That aborts
running the main buffer scanning loop, so it only burns a bit of CPU time
and a lock on BufFreelistLock each time it wakes--both of which are surely
to spare if the system is idle. I can agree with your power management
argument, I don't see much of a performance win from eliminating this bit.

The behaviour is wrong though, because we're scanning for too long
when the system goes quiet and then we wake up again too quickly - as
soon as a new buffer allocation happens.

We don't need to clean the complete bufferpool in 2 minutes. That's
exactly the thing checkpoint does and we slowed that down so it didn't
do that. So we're still writing way too much.

So the proposal was to make it scan only 10% of the bufferpool, not
100%, then sleep. We only need some clean buffers, we don't need *all*
buffers clean, especially on very large shared_buffers. And we should
wake up only when we see an effect on user backends, i.e. a dirty
write - which is the event the bgwriter is designed to avoid.

The last bit is the key - waking up only when a dirty write occurs. If
they aren't happening we literally don't need the bgwriter - as your
tests show.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Greg Smith (#13)

Re: Initial 9.2 pgbench write results

On Thu, Feb 23, 2012 at 3:44 PM, Greg Smith <greg@2ndquadrant.com> wrote:

It's not quite that bad. Once the BGW has circled around the whole buffer
pool, such that it's swept so far ahead it's reached the clock sweep
strategy point, it stops. So when the system is idle, it creeps forward
until it's scanned the pool once. Then, it still wakes up regularly, but
the computation of the bufs_to_lap lap number will reach 0. That aborts
running the main buffer scanning loop, so it only burns a bit of CPU time
and a lock on BufFreelistLock each time it wakes--both of which are surely
to spare if the system is idle. I can agree with your power management
argument, I don't see much of a performance win from eliminating this bit

I think that goal of ending up with a clean buffer pool is a good one,
and I'm loathe to give it up. On the other hand, I agree with Simon
that it does seem a bit wasteful to scan the entire buffer arena
because there's one dirty buffer somewhere. But maybe we should look
at that as a reason to improve the way we find dirty buffers, rather
than a reason not to worry about writing them out. There's got to be
a better way than scanning the whole buffer pool. Honestly, though,
that feels like 9.3 material. So far there's no evidence that we've
introduced any regressions that can't be compensated for by tuning,
and this doesn't feel like the right time to embark on a bunch of new
engineering projects.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Robert Haas (#15)

Re: Initial 9.2 pgbench write results

On Thu, Feb 23, 2012 at 11:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

this doesn't feel like the right time to embark on a bunch of new
engineering projects.

IMHO this is exactly the right time to do full system tuning. Only
when we have major projects committed can we move towards measuring
things and correcting deficiencies.

Doing tuning last is a natural consequence of the first rule of
tuning: Don't. That means we have to wait and see what problems emerge
and then fix them, so there has to be a time period when this is
allowed. This is exactly the same on any commercial implementation
project - you do tuning at the end before release.

I fully accept that this is not a time for heavy lifting. But it is a
time when we can apply a few low-invasive patches to improve things.
Tweaking the bgwriter is not exactly a big or complex thing. We will
be making many other small tweaks and fixes for months yet, so lets
just regard tuning as performance bug fixing and get on with it,
please.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

Jeff Janes

jeff.janes@gmail.com

over 14 years ago

In reply to: Greg Smith (#2)

Re: Initial 9.2 pgbench write results

On Tue, Feb 14, 2012 at 12:25 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 02/14/2012 01:45 PM, Greg Smith wrote:

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0 535
9.1 491 (-8.4% relative to 9.0)
9.2 338 (-31.2% relative to 9.1)

A second pass through this data noted that the maximum number of buffers
cleaned by the background writer is <=2785 in 9.0/9.1, while it goes as high
as 17345 times in 9.2.

There is something strange about the data for Set 4 (9.1) at scale 1000.

The number of buf_alloc varies a lot from run to run in that series
(by a factor of 60 from max to min).

But the TPS doesn't vary by very much.

How can that be? If a transaction needs a page that is not in the
cache, it needs to allocate a buffer. So the only thing that could
lower the allocation would be a higher cache hit rate, right? How
could there be so much variation in the cache hit rate from run to run
at the same scale?

Cheers,

Jeff

#18

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Simon Riggs (#16)

Re: Initial 9.2 pgbench write results

On Fri, Feb 24, 2012 at 5:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Feb 23, 2012 at 11:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

this doesn't feel like the right time to embark on a bunch of new
engineering projects.

IMHO this is exactly the right time to do full system tuning. Only
when we have major projects committed can we move towards measuring
things and correcting deficiencies.

Ideally we should measure things as we do them. Of course there will
be cases that we fail to test which slip through the cracks, as Greg
is now finding, and I agree we should try to fix any problems that we
turn up during testing. But, as I said before, so far Greg hasn't
turned up anything that can't be fixed by adjusting settings, so I
don't see a compelling case for change on that basis.

As a side point, there's no obvious reason why the problems Greg is
identifying here couldn't have been identified before committing the
background writer/checkpointer split. The fact that we didn't find
them then suggests to me that we need to be more not less cautious in
making further changes in this area.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Robert Haas (#18)

Re: Initial 9.2 pgbench write results

On Mon, Feb 27, 2012 at 5:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 24, 2012 at 5:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, Feb 23, 2012 at 11:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:

this doesn't feel like the right time to embark on a bunch of new
engineering projects.

IMHO this is exactly the right time to do full system tuning. Only
when we have major projects committed can we move towards measuring
things and correcting deficiencies.

Ideally we should measure things as we do them. Of course there will
be cases that we fail to test which slip through the cracks, as Greg
is now finding, and I agree we should try to fix any problems that we
turn up during testing. But, as I said before, so far Greg hasn't
turned up anything that can't be fixed by adjusting settings, so I
don't see a compelling case for change on that basis.

That isn't the case. We have evidence that the current knobs are
hugely ineffective in some cases.

Turning the bgwriter off is hardly "adjusting a setting", its
admitting that there is no useful setting.

I've suggested changes that aren't possible by tuning the current knobs.

As a side point, there's no obvious reason why the problems Greg is
identifying here couldn't have been identified before committing the
background writer/checkpointer split. The fact that we didn't find
them then suggests to me that we need to be more not less cautious in
making further changes in this area.

The split was essential to avoid the bgwriter action being forcibly
turned off during checkpoint sync. The fact that forcibly turning it
off is in some cases a benefit doesn't alter the fact that it was in
many cases a huge negative. If its on you can always turn it off, but
if it was not available at all there was no tuning option. I see no
negative aspect to the split.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#20

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Simon Riggs (#19)

Re: Initial 9.2 pgbench write results

On Mon, Feb 27, 2012 at 3:50 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

That isn't the case. We have evidence that the current knobs are
hugely ineffective in some cases.

Turning the bgwriter off is hardly "adjusting a setting", its
admitting that there is no useful setting.

I've suggested changes that aren't possible by tuning the current knobs.

OK, fair point. But I don't think any of us - Greg included - have an
enormously clear idea why turning the background writer off is
improving performance in some cases. I think we need to understand
that better before we start changing things.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Robert Haas (#20)

#22

Greg Smith

gsmith@gregsmith.com

over 14 years ago

In reply to: Robert Haas (#20)

#23