parametric block size?

Started by Fabienover 11 years ago17 messages
#1Fabien
coelho@cri.ensmp.fr

Hello devs,

The default blocksize is currently 8k, which is not necessary optimal for
all setup, especially with SSDs where the latency is much lower than HDD.

There is a case for different values with significant impact on
performance (up to a not-to-be-sneezed-at 10% on a pgbench run on SSD, see
http://www.cybertec.at/postgresql-block-sizes-getting-started/), and ISTM
that the ability to align PostgreSQL block size to the underlying FS/HW
block size would be nice.

This is currently possible, but it requires recompiling and maintaining
distinct executables for various block sizes. This is annoying, thus most
admins will not bother.

ISTM that a desirable and reasonably simple to implement feature would be
to be able to set the blocksize at "initdb" time, and "postgres" could use
the value found in the database instead of a compile-time one.

More advanced features, but with much more impact on the code, would be to
be able to change the size at database/table level.

Any thoughts?

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fabien (#1)
Re: parametric block size?

Fabien wrote:

ISTM that a desirable and reasonably simple to implement feature
would be to be able to set the blocksize at "initdb" time, and
"postgres" could use the value found in the database instead of a
compile-time one.

I think you will find it more difficult to implement than it seems at
first. For one thing, there are several macros that depend on the block
size and the algorithms involved cannot work with dynamic sizes;
consider MaxIndexTuplesPerPage which is used inPageIndexMultiDelete()
for instance. That value is used to allocate an array in the stack,
but that doesn't work if the array size is dynamic. (Actually it works
almost everywhere, but that feature is not in C89 and thus it fails on
Windows). That shouldn't be a problem, you say, just palloc() the array
-- except that that function is called within critical sections in some
places (e.g. _bt_delitems_vacuum) and you cannot use palloc there.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Alvaro Herrera (#2)
Re: parametric block size?

Hello Alvaro,

ISTM that a desirable and reasonably simple to implement feature
would be to be able to set the blocksize at "initdb" time, and
"postgres" could use the value found in the database instead of a
compile-time one.

I think you will find it more difficult to implement than it seems at
first. For one thing, there are several macros that depend on the block
size and the algorithms involved cannot work with dynamic sizes;
consider MaxIndexTuplesPerPage which is used inPageIndexMultiDelete()
for instance. That value is used to allocate an array in the stack,
but that doesn't work if the array size is dynamic. (Actually it works
almost everywhere, but that feature is not in C89 and thus it fails on
Windows). That shouldn't be a problem, you say, just palloc() the array
-- except that that function is called within critical sections in some
places (e.g. _bt_delitems_vacuum) and you cannot use palloc there.

Hmmm. Thanks for your point... indeed there may be implementation
details... not a surprise:-)

Note that I was more asking about the desirability of the feature, the
implementation is another, although also relevant, issue. To me it is
really desirable given the potential performance impact, but maybe we
should not care about 10%?

About your point: if we really have to do without dynamic stack allocation
(C99 is only 15, not ripe for adult use yet, maybe when it turns 18 or 21,
depending on the state:-), a possible way around would be to allocate a
larger area with some MAX_BLCKSZ with a ifdef for compilers that really
would not support dynamic stack allocation. Moreover, it might be possible
to hide it more or less cleanly in a macro. I had to put "-pedantic
-Werror" to manage to get an error on dynamic stack allocation with "gcc
-std=c89".

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fabien COELHO (#3)
Re: parametric block size?

Fabien COELHO wrote:

Hi,

Note that I was more asking about the desirability of the feature,
the implementation is another, although also relevant, issue. To me
it is really desirable given the potential performance impact, but
maybe we should not care about 10%?

10% performance improvement sounds good, no doubt. What will happen to
performance for people with the same block size? I mean, if you run a
comparison of current HEAD vs. patched with identical BLCKSZ, is there a
decrease in performance? I expect there will be some, although I'm not
sure to what extent. People who pg_upgrade for example will be stuck
with whatever blcksz they had on the original installation and so will
be unable to benefit from this improvement. I admit I'm not sure
where's the breakeven point, i.e. what's the loss we're willing to
tolerate. It might be pretty small.

About your point: if we really have to do without dynamic stack
allocation (C99 is only 15, not ripe for adult use yet, maybe when
it turns 18 or 21, depending on the state:-), a possible way around
would be to allocate a larger area with some MAX_BLCKSZ with a ifdef
for compilers that really would not support dynamic stack
allocation. Moreover, it might be possible to hide it more or less
cleanly in a macro.

Maybe we could try to use dynamic stack allocation on compilers that
support it, and use your MAX_BLCKSZ idea on the rest. Of course,
finding all problematic code sites might prove difficult. I pointed out
the one case I'm familiar with because of working with similar code
recently.

I had to put "-pedantic -Werror" to manage to
get an error on dynamic stack allocation with "gcc -std=c89".

Yeah, I guess in practice it will work everywhere except very old
dinosaurs and Windows. But see a thread elsewhere about supporting
VAXen; we don't appear to be prepared to drop support for dinosaurs just
yet.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#2)
Re: parametric block size?

On Tue, Jul 22, 2014 at 1:22 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Fabien wrote:

ISTM that a desirable and reasonably simple to implement feature
would be to be able to set the blocksize at "initdb" time, and
"postgres" could use the value found in the database instead of a
compile-time one.

I think you will find it more difficult to implement than it seems at
first. For one thing, there are several macros that depend on the block
size and the algorithms involved cannot work with dynamic sizes;
consider MaxIndexTuplesPerPage which is used inPageIndexMultiDelete()
for instance. That value is used to allocate an array in the stack,
but that doesn't work if the array size is dynamic. (Actually it works
almost everywhere, but that feature is not in C89 and thus it fails on
Windows). That shouldn't be a problem, you say, just palloc() the array
-- except that that function is called within critical sections in some
places (e.g. _bt_delitems_vacuum) and you cannot use palloc there.

There's a performance argument here as well. Static allocation is
likely faster that palloc, and there are likely many other places
where having things like BLCKSZ or MaxIndexTuplesPerPage as
compile-time constants saves a few cycles. A 10% speedup is nice, but
I wouldn't want to pay 1% for everybody to get back 10% people who are
willing to fiddle with the block size.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Robert Haas (#5)
Re: parametric block size?

Resent: previous message was stalled because of a bad "From:".

ISTM that a desirable and reasonably simple to implement feature
would be to be able to set the blocksize at "initdb" time, and
"postgres" could use the value found in the database instead of a
compile-time one.

I think you will find it more difficult to implement than it seems at
first. [...]

There's a performance argument here as well. Static allocation is
likely faster that palloc, and there are likely many other places where
having things like BLCKSZ or MaxIndexTuplesPerPage as compile-time
constants saves a few cycles. A 10% speedup is nice, but I wouldn't
want to pay 1% for everybody to get back 10% people who are willing to
fiddle with the block size.

Yes, I agree that it would not make much sense to have such a feature with
a significant performance penalty for most people.

For what I have seen, ISTM that palloc can be avoided altogether either
with dynamic stack allocation when supported (that is in most cases?), or
maybe in some case by allocating larger safe area. In such case, the
"block size" setting would be a "max block size", and all valid block
sizes below (eg for 8 kB: 1, 2, 4 and 8 kB) would be allowed.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Alvaro Herrera (#4)
Re: parametric block size?

Note that I was more asking about the desirability of the feature,
the implementation is another, although also relevant, issue. To me
it is really desirable given the potential performance impact, but
maybe we should not care about 10%?

10% performance improvement sounds good, no doubt. What will happen to
performance for people with the same block size? I mean, if you run a
comparison of current HEAD vs. patched with identical BLCKSZ, is there a
decrease in performance? I expect there will be some, although I'm not
sure to what extent.

I do not understand the question. Do you mean to compare current 'compile
time set block size' vs an hypothetical 'adaptative initdb-time block
size' version, which does not really exist yet?

I cannot answer that, but I would not expect significant differences. If
there is a significant performance impact, this would be sure no good.

People who pg_upgrade for example will be stuck with whatever blcksz
they had on the original installation and so will be unable to benefit
from this improvement.

Sure. What I'm looking at is just to have a postmaster executable which
tolerates several block sizes, but they must be set & chosen when
initdb-ing anyway.

I admit I'm not sure where's the breakeven point, i.e. what's the loss
we're willing to tolerate. It might be pretty small.

Minimal performance impact wrt the current version, got that!

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@2ndquadrant.com
In reply to: Fabien (#1)
Re: parametric block size?

Hi,

On 2014-07-22 10:22:53 +0200, Fabien wrote:

The default blocksize is currently 8k, which is not necessary optimal for
all setup, especially with SSDs where the latency is much lower than HDD.

I don't think that really follows.

There is a case for different values with significant impact on performance
(up to a not-to-be-sneezed-at 10% on a pgbench run on SSD, see
http://www.cybertec.at/postgresql-block-sizes-getting-started/), and ISTM
that the ability to align PostgreSQL block size to the underlying FS/HW
block size would be nice.

I don't think that benchmark is very meaningful. Way too small scale,
way to short runtime (there'll be barely any checkpoints, hot pruning,
vacuum at all).

More advanced features, but with much more impact on the code, would be to
be able to change the size at database/table level.

That'd be pretty horrible because the size of pages in shared_buffers
wouldn't be uniform anymore.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Andres Freund (#8)
Re: parametric block size?

Hello Andres,

The default blocksize is currently 8k, which is not necessary optimal for
all setup, especially with SSDs where the latency is much lower than HDD.

I don't think that really follows.

The rationale, which may be proven false, is that with a SSD the latency
penalty for reading and writing randomly vs sequentially is much lower
than for HDD, so there is less insentive to group stuff in larger chunks
on that account.

There is a case for different values with significant impact on performance
(up to a not-to-be-sneezed-at 10% on a pgbench run on SSD, see
http://www.cybertec.at/postgresql-block-sizes-getting-started/), and ISTM
that the ability to align PostgreSQL block size to the underlying FS/HW
block size would be nice.

I don't think that benchmark is very meaningful. Way too small scale,
way to short runtime (there'll be barely any checkpoints, hot pruning,
vacuum at all).

These benchs have the merit to exist, to be consistent (the smaller the
blocksize, the better the performance), and ISTM that the performance
results suggest that this is worth investigating.

Possibly the "small" scale means that data fit in memory, so the
benchmarks as run emphasize write performance linked to the INSERT/UPDATE.

What would you suggest as meaningful for scale and run time, say on a
dual-core 8GB memory 256GB SSD laptop?

More advanced features, but with much more impact on the code, would be to
be able to change the size at database/table level.

That'd be pretty horrible because the size of pages in shared_buffers
wouldn't be uniform anymore.

Yep, I also thought of that, so I'm not planing to investigate.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
andres@2ndquadrant.com
In reply to: Fabien COELHO (#9)
Re: parametric block size?

Hi,

On 2014-07-26 12:50:30 +0200, Fabien COELHO wrote:

The default blocksize is currently 8k, which is not necessary optimal for
all setup, especially with SSDs where the latency is much lower than HDD.

I don't think that really follows.

The rationale, which may be proven false, is that with a SSD the latency
penalty for reading and writing randomly vs sequentially is much lower than
for HDD, so there is less insentive to group stuff in larger chunks on that
account.

A higher number of blocks has overhead unrelated to this though:
Increased waste/lower storage density as it gets more frequently that
tuples don't fit into a page; more locks; higher number of buffer
headers; more toasted rows; smaller toast chunks; more vacuuming/heap
pruning WAL records, ...

Now obviously there's also a inverse to this, otherwise we'd all be
using 1GB page sizes. But I don't think storage latency has much to do
with it - it's imo more about write amplification (i.e. turning a single
row update into a 8/4/16/32 kb write).

There is a case for different values with significant impact on performance
(up to a not-to-be-sneezed-at 10% on a pgbench run on SSD, see
http://www.cybertec.at/postgresql-block-sizes-getting-started/), and ISTM
that the ability to align PostgreSQL block size to the underlying FS/HW
block size would be nice.

I don't think that benchmark is very meaningful. Way too small scale, way
to short runtime (there'll be barely any checkpoints, hot pruning, vacuum
at all).

These benchs have the merit to exist, to be consistent (the smaller the
blocksize, the better the performance), and ISTM that the performance
results suggest that this is worth investigating.

Well, it's easy to make claims that aren't meaningful with bad
benchmarks.

Those numbers are *far* too low for the presented SSD - invalidating the
entire thing. That's the speed you'd expect for rotating media, not an
SSD. My laptop has the 1TB variant of that disk and I get nearly 10 that
number of TPS. With a parallel parallel make running, a profiler
started, and assertions enabled.

This isn't an actual benchmark, sorry. It's SEO.

Possibly the "small" scale means that data fit in memory, so the benchmarks
as run emphasize write performance linked to the INSERT/UPDATE.

Well, the generated data is 160MB in size. Nobody with a concurrent
write heavy OLTP load has that little data.

What would you suggest as meaningful for scale and run time, say on a
dual-core 8GB memory 256GB SSD laptop?

At the very least scale hundred - then it likely doesn't fit into
internal caches on common consumer drives anymore. But more importantly
the test has to run over several checkpoint cycles, so hot pruning and
vacuuming are also measured.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Andres Freund (#10)
Re: parametric block size?

The rationale, which may be proven false, is that with a SSD the
latency penalty for reading and writing randomly vs sequentially is
much lower than for HDD, so there is less insentive to group stuff in
larger chunks on that account.

A higher number of blocks has overhead unrelated to this though:
Increased waste/lower storage density as it gets more frequently that
tuples don't fit into a page; more locks; higher number of buffer
headers; more toasted rows; smaller toast chunks; more vacuuming/heap
pruning WAL records, ...

Now obviously there's also a inverse to this, otherwise we'd all be
using 1GB page sizes. But I don't think storage latency has much to do
with it - it's imo more about write amplification (i.e. turning a single
row update into a 8/4/16/32 kb write).

I agree with your interesting above discussion. I do not think that is
altogether fully invalidates my reasonning about latency, page size &
performance, but I may be wrong. On a HDD, writing a page takes +- the
same time whatever the size of the page, so the insentive is to try to
benefit as much as possible from this write, thus to use larger pages. On
a SSD, the insentive is not so, you can write smaller pages at a lower
cost.

Anyway, this needs measures, not just words.

ISTM that there is a tradeoff. Whether the current 8 kB page size is the
best possible compromise, given the various effects and the evoluting
hardware, and that the compromise would happen to be the same for a HDD
and a SSD, does not look obvious to me.

These benchs have the merit to exist, to be consistent (the smaller the
blocksize, the better the performance), and ISTM that the performance
results suggest that this is worth investigating.

Well, it's easy to make claims that aren't meaningful with bad
benchmarks.

Sure.

The basic claim that I'm making wrt to this benchmark is that there may be
a significant impact on performance with changing the block size, thus
this is worth investigating. I think this claim is quite safe, even if the
benchmark is not the best possible.

What would you suggest as meaningful for scale and run time, say on a
dual-core 8GB memory 256GB SSD laptop?

At the very least scale hundred - then it likely doesn't fit into
internal caches on common consumer drives anymore. But more importantly
the test has to run over several checkpoint cycles, so hot pruning and
vacuuming are also measured.

Ok.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@2ndquadrant.com
In reply to: Fabien COELHO (#11)
Re: parametric block size?

On 2014-07-26 19:06:58 +0200, Fabien COELHO wrote:

The basic claim that I'm making wrt to this benchmark is that there may be a
significant impact on performance with changing the block size, thus this is
worth investigating. I think this claim is quite safe, even if the benchmark
is not the best possible.

Well, you went straight to making it something adjustable at run
time. And I don't see that as being warranted at this point. But further
benchmarks sound like a good idea.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Andres Freund (#12)
Re: parametric block size?

The basic claim that I'm making wrt to this benchmark is that there may
be a significant impact on performance with changing the block size,
thus this is worth investigating. I think this claim is quite safe,
even if the benchmark is not the best possible.

Well, you went straight to making it something adjustable at run time.

What I really did was to go straight to asking the question:-)

Up to now I have two answers, or really caveats:

- a varying blocksize implementation should have minimum effects
on performance for user of the default settings.

- the said benchmark may not be that meaningful, so the performance
impact is to be accessed more thoroughly.

And I don't see that as being warranted at this point. But further
benchmarks sound like a good idea.

Yep. A 10% potential performance impact looks worth the investigation.

--
Fabien.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Mark Kirkwood
mark.kirkwood@catalyst.net.nz
In reply to: Andres Freund (#8)
Re: parametric block size?

On 26/07/14 21:05, Andres Freund wrote:

More advanced features, but with much more impact on the code, would be to
be able to change the size at database/table level.

That'd be pretty horrible because the size of pages in shared_buffers
wouldn't be uniform anymore.

Possibly stopping at the tablespace level might be more straightforward.
To avoid messing up the pages in shared buffers we'd perhaps need
something like several shared buffer pools - each with either its own
blocksize or associated with a (set of) tablespace(s).

Obviously this sort of thing has a pretty big architecture/code impact,
probably better to consider a 1st iteration with it being initdb
specifiable only (as that would still be very convenient)!

Regards

Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: Fabien COELHO (#13)
Re: parametric block size?

On Sat, Jul 26, 2014 at 1:37 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

And I don't see that as being warranted at this point. But further
benchmarks sound like a good idea.

Yep. A 10% potential performance impact looks worth the investigation.

I wonder, though, whether this isn't using a crowbar where some finer
instrument is called for. If, for example, bigger heap blocks give
better performance because a bigger I/O size just plain works better,
well then that's interesting in its own right. But if a bigger or
smaller block size yields better results on index scans, the right
solution might be to change the internal page structure used by that
index. For example, I remember reading a paper a few years back where
the authors found that large page sizes were inefficient because you
had to do a linear scan of all the items on the page; so they added
some kind of btree-like structure within the page and got great
results. So the page size itself wasn't the fundamental issue; it had
more to do with what kind of page layout made sense at one page size
vs. another page size.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Thomas Kellerer
spam_eater@gmx.net
In reply to: Mark Kirkwood (#14)
Re: parametric block size?

Possibly stopping at the tablespace level might be more straightforward.
To avoid messing up the pages in shared buffers we'd perhaps need
something like several shared buffer pools - each with either its own
blocksize or associated with a (set of) tablespace(s).

This is exactly how Oracle does it. You can specify the blocksize when
creating a tablespace.

For each blocksize a separate buffer cache ("shared buffers" in Postgres
terms) can be configured. So the cache is not maintained on tablespace level
but on blocksize level.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/parametric-block-size-tp5812350p5813060.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Andres Freund (#12)
Re: parametric block size?

Hello Andres,

But further benchmarks sound like a good idea.

I've started running some benchmarks with pgbench, with varying block &
WAL block sizes. I've done a blog post on a small subset of results,
focussing on block size with SSDs and to validate the significance of the
figures found, see for more details:
http://blog.coelho.net/database/2014/08/08/postgresql-page-size-for-SSD/

I've also found an old post by Tomas Vondra who did really extensive
tests, including playing around with file system options:
http://www.fuzzy.cz/en/articles/ssd-benchmark-results-read-write-pgbench/

The cumulated and consistent result of all these tests, including
Hans-Jᅵrgen Schᅵnig short tests, is that reducing page size on SSDs
increases significantly pgbench reported performance, by about 10%.

I've also done some tests with HDDs which are quite disappointing, with
PostgreSQL running in batch mode: a few seconds at 1000 tps followed by a
catch-up phase of 20 seconds at about 0 (zero) tps, and back to a new
cycle. I'm not sure of which parameter to tweak (postgresql configuration,
linux io scheduler, ext4 options or possibly stay away from ext4) to get
something more stable.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers