Initdb-time block size specification

Started by David Christensenalmost 3 years ago45 messageshackers
Jump to latest
#1David Christensen
david.christensen@crunchydata.com

Hi,

As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
variable block sizes; basically instead of BLCKSZ being a compile-time
constant, a single set of binaries can support all of the block sizes
Pg can support, using the value stored in pg_control as the basis.
(Possible future plans would be to make this something even more
dynamic, such as configured per tablespace, but this is out of scope;
this just sets up the infrastructure for this.)

Whereas we had traditionally used BLCKSZ to indicate the compile-time selected
block size, this commit adjusted things so the cluster block size can be
selected at initdb time.

In order to code for this, we introduce a few new defines:

- CLUSTER_BLOCK_SIZE is the blocksize for this cluster itself. This is not
valid until BlockSizeInit() has been called in the given backend, which we do as
early as possible by parsing the ControlFile and using the blcksz field.

- MIN_BLOCK_SIZE and MAX_BLOCK_SIZE are the limits for the selectable block
size. It is required that CLUSTER_BLOCK_SIZE is a power of 2 between these two
constants.

- DEFAULT_BLOCK_SIZE is the moral equivalent of BLCKSZ; it is the built-in
default value. This is used in a few places that just needed a buffer of an
arbitrary size, but the dynamic value CLUSTER_BLOCK_SIZE should almost always be
used instead.

- CLUSTER_RELSEG_SIZE is used instead of RELSEG_SIZE, since internally we are
storing the segment size in terms of number of blocks. RELSEG_SIZE is still
kept, but is used in terms of the number of blocks of DEFAULT_BLOCK_SIZE;
CLUSTER_RELSEG_SIZE scales appropriately (and is the only thing used internally)
to keep the same target total segment size regardless of block size.

This patch uses a precalculated table to store the block size itself, as well as
additional derived values that have traditionally been compile-time
constants (example: MaxHeapTuplesPerPage). The traditional macro names are kept
so code that doesn't care about it should not need to change, however the
definition of these has changed (see the CalcXXX() routines in blocksize.h for
details).

A new function, BlockSizeInit() populates the appropriate values based on the
target block size. This should be called as early as possible in any code that
utilizes block sizes. This patch adds this in the appropriate place on the
handful of src/bin/ programs that used BLCKSZ, so this caveat mainly impacts new
code.

Code which had previously used BLCKZ should likely be able to get away with
changing these instances to CLUSTER_BLOCK_SIZE, unless you're using a structure
allocated on the stack. In these cases, the compiler will complain about
dynamic structure. The solution is to utilize an expression with MAX_BLOCK_SIZE
instead of BLCKSZ, ensuring enough stack space is allocated for the maximum
size. This also does require using CLUSTER_BLOCK_SIZE or an expression based on
it when actually using this structure, so in practice more stack space may be
allocated then used in principal; as long as there is plenty of stack this
should have no specific impacts on code.

Initial (basic) performance testing shows only minor changes with the pgbench -S
benchmark, though this is obviously an area that will need considerable
testing/verification across multiple workloads.

Thanks,

David

Attachments:

v1-0001-Introduce-initdb-selectable-block-sizes.patchapplication/octet-stream; name=v1-0001-Introduce-initdb-selectable-block-sizes.patchDownload+1666-965
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Christensen (#1)
Re: Initdb-time block size specification

On 6/30/23 19:35, David Christensen wrote:

Hi,

As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
variable block sizes; basically instead of BLCKSZ being a compile-time
constant, a single set of binaries can support all of the block sizes
Pg can support, using the value stored in pg_control as the basis.
(Possible future plans would be to make this something even more
dynamic, such as configured per tablespace, but this is out of scope;
this just sets up the infrastructure for this.)

Whereas we had traditionally used BLCKSZ to indicate the compile-time selected
block size, this commit adjusted things so the cluster block size can be
selected at initdb time.

In order to code for this, we introduce a few new defines:

- CLUSTER_BLOCK_SIZE is the blocksize for this cluster itself. This is not
valid until BlockSizeInit() has been called in the given backend, which we do as
early as possible by parsing the ControlFile and using the blcksz field.

- MIN_BLOCK_SIZE and MAX_BLOCK_SIZE are the limits for the selectable block
size. It is required that CLUSTER_BLOCK_SIZE is a power of 2 between these two
constants.

- DEFAULT_BLOCK_SIZE is the moral equivalent of BLCKSZ; it is the built-in
default value. This is used in a few places that just needed a buffer of an
arbitrary size, but the dynamic value CLUSTER_BLOCK_SIZE should almost always be
used instead.

- CLUSTER_RELSEG_SIZE is used instead of RELSEG_SIZE, since internally we are
storing the segment size in terms of number of blocks. RELSEG_SIZE is still
kept, but is used in terms of the number of blocks of DEFAULT_BLOCK_SIZE;
CLUSTER_RELSEG_SIZE scales appropriately (and is the only thing used internally)
to keep the same target total segment size regardless of block size.

Do we really want to prefix the option with CLUSTER_? That seems to just
add a lot of noise into the patch, and I don't see much value in this
rename. I'd prefer keeping BLCKSZ and tweak just the couple places that
need "default" to use BLCKSZ_DEFAULT or something like that.

But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time
values, so after making this initdb-time parameter we should abandon
that (just like fc49e24fa69a did for segment sizes). Perhaps something
like cluste_block_size would work ... (yes, that touches all the places
too).

This patch uses a precalculated table to store the block size itself, as well as
additional derived values that have traditionally been compile-time
constants (example: MaxHeapTuplesPerPage). The traditional macro names are kept
so code that doesn't care about it should not need to change, however the
definition of these has changed (see the CalcXXX() routines in blocksize.h for
details).

A new function, BlockSizeInit() populates the appropriate values based on the
target block size. This should be called as early as possible in any code that
utilizes block sizes. This patch adds this in the appropriate place on the
handful of src/bin/ programs that used BLCKSZ, so this caveat mainly impacts new
code.

Code which had previously used BLCKZ should likely be able to get away with
changing these instances to CLUSTER_BLOCK_SIZE, unless you're using a structure
allocated on the stack. In these cases, the compiler will complain about
dynamic structure. The solution is to utilize an expression with MAX_BLOCK_SIZE
instead of BLCKSZ, ensuring enough stack space is allocated for the maximum
size. This also does require using CLUSTER_BLOCK_SIZE or an expression based on
it when actually using this structure, so in practice more stack space may be
allocated then used in principal; as long as there is plenty of stack this
should have no specific impacts on code.

Initial (basic) performance testing shows only minor changes with the pgbench -S
benchmark, though this is obviously an area that will need considerable
testing/verification across multiple workloads.

I wonder how to best evaluate/benchmark this. AFAICS what we want to
measure is the extra cost of making the values dynamic (which means the
compiler can't just optimize them out). I'd say a "pgbench -S" seems
like a good test.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3David Christensen
david.christensen@crunchydata.com
In reply to: Tomas Vondra (#2)
Re: Initdb-time block size specification

On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Do we really want to prefix the option with CLUSTER_? That seems to just
add a lot of noise into the patch, and I don't see much value in this
rename. I'd prefer keeping BLCKSZ and tweak just the couple places that
need "default" to use BLCKSZ_DEFAULT or something like that.

But more importantly, I'd say we use CAPITALIZED_NAMES for compile-time
values, so after making this initdb-time parameter we should abandon
that (just like fc49e24fa69a did for segment sizes). Perhaps something
like cluste_block_size would work ... (yes, that touches all the places
too).

Yes, I can see that being an equivalent change; thanks for the pointer
there. Definitely the "cluster_block_size" could be an approach,
though since it's just currently a #define for GetBlockSize(), maybe
we just replace with the equivalent instead. I was mainly trying to
make it something that was conceptually similar and easy to reason
about without getting bogged down in the details locally, but can see
that ALL_CAPS does have a specific meaning. Also eliminating the
BLCKSZ symbol meant it was easier to catch anything which depended on
that value. If we wanted to keep BLCKSZ, I'd be okay with that at
this point vs the CLUSTER_BLOCK_SIZE approach, could help to make the
patch smaller at this point.

Initial (basic) performance testing shows only minor changes with the pgbench -S
benchmark, though this is obviously an area that will need considerable
testing/verification across multiple workloads.

I wonder how to best evaluate/benchmark this. AFAICS what we want to
measure is the extra cost of making the values dynamic (which means the
compiler can't just optimize them out). I'd say a "pgbench -S" seems
like a good test.

Yep, I tested 100 runs apiece with pgbench -S at scale factor 100,
default settings for optimized builds of the same base commit with and
without the patch; saw a reduction of TPS around 1% in that case, but
I do think we'd want to look at different workloads; I presume the
largest impact would be seen when it's all in shared_buffers and no IO
is required.

Thanks,

David

#4Andres Freund
andres@anarazel.de
In reply to: David Christensen (#1)
Re: Initdb-time block size specification

Hi,

On 2023-06-30 12:35:09 -0500, David Christensen wrote:

As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
variable block sizes; basically instead of BLCKSZ being a compile-time
constant, a single set of binaries can support all of the block sizes
Pg can support, using the value stored in pg_control as the basis.
(Possible future plans would be to make this something even more
dynamic, such as configured per tablespace, but this is out of scope;
this just sets up the infrastructure for this.)

I am extremely doubtful this is a good idea. For one it causes a lot of churn,
but more importantly it turns currently cheap code paths into more expensive
ones.

Changes like

-#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE))

Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is
actually variable.

I am fairly certain this is going to be causing substantial performance
regressions. I think we should reject this even if we don't immediately find
them, because it's almost guaranteed to cause some.

Besides this, I've not really heard any convincing justification for needing
this in the first place.

Greetings,

Andres Freund

#5David Christensen
david.christensen@crunchydata.com
In reply to: Andres Freund (#4)
Re: Initdb-time block size specification

On Fri, Jun 30, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-06-30 12:35:09 -0500, David Christensen wrote:

As discussed somewhat at PgCon, enclosed is v1 of a patch to provide
variable block sizes; basically instead of BLCKSZ being a compile-time
constant, a single set of binaries can support all of the block sizes
Pg can support, using the value stored in pg_control as the basis.
(Possible future plans would be to make this something even more
dynamic, such as configured per tablespace, but this is out of scope;
this just sets up the infrastructure for this.)

I am extremely doubtful this is a good idea. For one it causes a lot of churn,
but more importantly it turns currently cheap code paths into more expensive
ones.

Changes like

-#define BufHdrGetBlock(bufHdr)       ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
+#define BufHdrGetBlock(bufHdr)       ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * CLUSTER_BLOCK_SIZE))

Note that CLUSTER_BLOCK_SIZE, despite looking like a macro that's constant, is
actually variable.

Correct; that is mainly a notational device which would be easy enough
to change (and presumably would follow along the lines of the commit
Tomas pointed out above).

I am fairly certain this is going to be causing substantial performance
regressions. I think we should reject this even if we don't immediately find
them, because it's almost guaranteed to cause some.

What would be considered substantial? Some overhead would be expected,
but I think having an actual patch to evaluate lets us see what
potential there is. Seems like this will likely be optimized as an
offset stored in a register, so wouldn't expect huge changes here.
(There may be other approaches I haven't thought of in terms of
getting this.)

Besides this, I've not really heard any convincing justification for needing
this in the first place.

Doing this would open up experiments in larger block sizes, so we
would be able to have larger indexable tuples, say, or be able to
store data types that are larger than currently supported for tuple
row limits without dropping to toast (native vector data types come to
mind as a candidate here). We've had 8k blocks for a long time while
hardware has improved over 20+ years, and it would be interesting to
see how tuning things would open up additional avenues for performance
without requiring packagers to make a single choice on this regardless
of use-case. (The fact that we allow compiling this at a different
value suggests there is thought to be some utility having this be
something other than the default value.)

I just think it's one of those things that is hard to evaluate without
actually having something specific, which is why we have this patch
now.

Thanks,

David

#6Andres Freund
andres@anarazel.de
In reply to: David Christensen (#3)
Re: Initdb-time block size specification

Hi,

On 2023-06-30 14:09:55 -0500, David Christensen wrote:

On Fri, Jun 30, 2023 at 1:14 PM Tomas Vondra

I wonder how to best evaluate/benchmark this. AFAICS what we want to
measure is the extra cost of making the values dynamic (which means the
compiler can't just optimize them out). I'd say a "pgbench -S" seems
like a good test.

Yep, I tested 100 runs apiece with pgbench -S at scale factor 100,
default settings for optimized builds of the same base commit with and
without the patch; saw a reduction of TPS around 1% in that case, but
I do think we'd want to look at different workloads; I presume the
largest impact would be seen when it's all in shared_buffers and no IO
is required.

I think pgbench -S indeed isn't a good workload - the overhead for it is much
more in context switches and instantiating executor state etc than code that
is affected by the change.

And indeed. Comparing e.g. TPC-H, I see *massive* regressions. Some queries
are the same, sobut others regress by up to 70% (although more commonly around
10-20%).

That's larger than I thought, which makes me suspect that there's some bug in
the new code.

Interestingly, repeating the benchmark with a larger work_mem setting, the
regressions are still quite present, but smaller. I suspect the planner
chooses smarter plans which move bottlenecks more towards hashjoin code etc,
which won't be affected by this change.

IOW, you seriously need to evaluate analytics queries before this is worth
looking at further.

Greetings,

Andres Freund

#7Andres Freund
andres@anarazel.de
In reply to: David Christensen (#5)
Re: Initdb-time block size specification

Hi,

On 2023-06-30 15:05:54 -0500, David Christensen wrote:

I am fairly certain this is going to be causing substantial performance
regressions. I think we should reject this even if we don't immediately find
them, because it's almost guaranteed to cause some.

What would be considered substantial? Some overhead would be expected,
but I think having an actual patch to evaluate lets us see what
potential there is.

Anything beyond 1-2%, although even that imo is a hard sell.

Besides this, I've not really heard any convincing justification for needing
this in the first place.

Doing this would open up experiments in larger block sizes, so we
would be able to have larger indexable tuples, say, or be able to
store data types that are larger than currently supported for tuple
row limits without dropping to toast (native vector data types come to
mind as a candidate here).

You can do experiments today with the compile time option. Which does not
require regressing performance for everyone.

We've had 8k blocks for a long time while hardware has improved over 20+
years, and it would be interesting to see how tuning things would open up
additional avenues for performance without requiring packagers to make a
single choice on this regardless of use-case.

I suspect you're going to see more benefits from going to a *lower* setting
than a higher one. Some practical issues aside, plenty of storage hardware
these days would allow to get rid of FPIs if you go to 4k blocks (although it
often requires explicit sysadmin action to reformat the drive into that mode
etc). But obviously that's problematic from the "postgres limits" POV.

If we really wanted to do this - but I don't think we do - I'd argue for
working on the buildsystem support to build the postgres binary multiple
times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
exec's the relevant "real" binary based on the pg_control value. I really
don't see us ever wanting to make BLCKSZ runtime configurable within one
postgres binary. There's just too much intrinsic overhead associated with
that.

Greetings,

Andres Freund

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Christensen (#5)
Re: Initdb-time block size specification

On 6/30/23 22:05, David Christensen wrote:

On Fri, Jun 30, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote:

...

Besides this, I've not really heard any convincing justification for needing
this in the first place.

Doing this would open up experiments in larger block sizes, so we
would be able to have larger indexable tuples, say, or be able to
store data types that are larger than currently supported for tuple
row limits without dropping to toast (native vector data types come to
mind as a candidate here). We've had 8k blocks for a long time while
hardware has improved over 20+ years, and it would be interesting to
see how tuning things would open up additional avenues for performance
without requiring packagers to make a single choice on this regardless
of use-case. (The fact that we allow compiling this at a different
value suggests there is thought to be some utility having this be
something other than the default value.)

I just think it's one of those things that is hard to evaluate without
actually having something specific, which is why we have this patch
now.

But it's possible to evaluate that - you just need to rebuild with a
different configuration option. Yes, allowing doing that at initdb is
simpler and allows testing this on systems where rebuilding is not
convenient. And having a binary that can deal with any block size would
be nice too.

In fact, I did exactly that a year ago for a conference, and I spoke
about it at the 2022 unconference too. Not sure if there's recording
from pgcon, but there is one from the other conference [1]https://www.youtube.com/watch?v=mVKpoQxtCXk[2]https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf.

The short story is that the possible gains are significant (say +50%)
for data sets that don't fit into RAM. But that was with block size set
at compile time, the question is what's the impact of making it a
variable instead of a macro ....

[1]: https://www.youtube.com/watch?v=mVKpoQxtCXk
[2]: https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#7)
Re: Initdb-time block size specification

On 6/30/23 23:11, Andres Freund wrote:

...

If we really wanted to do this - but I don't think we do - I'd argue for
working on the buildsystem support to build the postgres binary multiple
times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
exec's the relevant "real" binary based on the pg_control value. I really
don't see us ever wanting to make BLCKSZ runtime configurable within one
postgres binary. There's just too much intrinsic overhead associated with
that.

I don't quite understand why we shouldn't do this (or at least try to).
IMO the benefits of using smaller blocks were substantial (especially
for 4kB, most likely due matching the internal SSD page size). The other
benefits (reducing WAL volume) seem rather interesting too.

Sure, there are challenges (e.g. the overhead due to making it dynamic).
No doubt about that.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10David Christensen
david.christensen@crunchydata.com
In reply to: Andres Freund (#6)
Re: Initdb-time block size specification

On Fri, Jun 30, 2023 at 3:29 PM Andres Freund <andres@anarazel.de> wrote:

And indeed. Comparing e.g. TPC-H, I see *massive* regressions. Some queries

are the same, sobut others regress by up to 70% (although more commonly around
10-20%).

Hmm, that is definitely not good.

That's larger than I thought, which makes me suspect that there's some bug in
the new code.

Will do a little profiling here to see if I can figure out the
regression. Which build optimization settings are you seeing this
under?

Interestingly, repeating the benchmark with a larger work_mem setting, the
regressions are still quite present, but smaller. I suspect the planner
chooses smarter plans which move bottlenecks more towards hashjoin code etc,
which won't be affected by this change.

Interesting.

IOW, you seriously need to evaluate analytics queries before this is worth
looking at further.

Makes sense, thanks for reviewing.

David

#11David Christensen
david.christensen@crunchydata.com
In reply to: Tomas Vondra (#8)
Re: Initdb-time block size specification

On Fri, Jun 30, 2023 at 4:12 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/30/23 22:05, David Christensen wrote:

On Fri, Jun 30, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote:

...

Besides this, I've not really heard any convincing justification for needing
this in the first place.

Doing this would open up experiments in larger block sizes, so we
would be able to have larger indexable tuples, say, or be able to
store data types that are larger than currently supported for tuple
row limits without dropping to toast (native vector data types come to
mind as a candidate here). We've had 8k blocks for a long time while
hardware has improved over 20+ years, and it would be interesting to
see how tuning things would open up additional avenues for performance
without requiring packagers to make a single choice on this regardless
of use-case. (The fact that we allow compiling this at a different
value suggests there is thought to be some utility having this be
something other than the default value.)

I just think it's one of those things that is hard to evaluate without
actually having something specific, which is why we have this patch
now.

But it's possible to evaluate that - you just need to rebuild with a
different configuration option. Yes, allowing doing that at initdb is
simpler and allows testing this on systems where rebuilding is not
convenient. And having a binary that can deal with any block size would
be nice too.

In fact, I did exactly that a year ago for a conference, and I spoke
about it at the 2022 unconference too. Not sure if there's recording
from pgcon, but there is one from the other conference [1][2].

The short story is that the possible gains are significant (say +50%)
for data sets that don't fit into RAM. But that was with block size set
at compile time, the question is what's the impact of making it a
variable instead of a macro ....

[1] https://www.youtube.com/watch?v=mVKpoQxtCXk
[2] https://blog.pgaddict.com/pdf/block-sizes-postgresvision-2022.pdf

Cool, thanks for the video links; will review.

David

#12David Christensen
david.christensen@crunchydata.com
In reply to: Tomas Vondra (#9)
Re: Initdb-time block size specification

On Fri, Jun 30, 2023 at 4:27 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 6/30/23 23:11, Andres Freund wrote:

...

If we really wanted to do this - but I don't think we do - I'd argue for
working on the buildsystem support to build the postgres binary multiple
times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
exec's the relevant "real" binary based on the pg_control value. I really
don't see us ever wanting to make BLCKSZ runtime configurable within one
postgres binary. There's just too much intrinsic overhead associated with
that.

I don't quite understand why we shouldn't do this (or at least try to).
IMO the benefits of using smaller blocks were substantial (especially
for 4kB, most likely due matching the internal SSD page size). The other
benefits (reducing WAL volume) seem rather interesting too.

If it's dynamic, we could also be able to add detection of the best
block size at initdb time, leading to improvements all around, say.
:-)

Sure, there are challenges (e.g. the overhead due to making it dynamic).
No doubt about that.

I definitely agree that it seems worth looking into. If nothing else,
being able to quantify just what/where the overhead comes from may be
instructive for future efforts.

David

#13David Christensen
david.christensen@crunchydata.com
In reply to: Andres Freund (#7)
Re: Initdb-time block size specification

On Fri, Jun 30, 2023 at 4:17 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-06-30 15:05:54 -0500, David Christensen wrote:

I am fairly certain this is going to be causing substantial performance
regressions. I think we should reject this even if we don't immediately find
them, because it's almost guaranteed to cause some.

What would be considered substantial? Some overhead would be expected,
but I think having an actual patch to evaluate lets us see what
potential there is.

Anything beyond 1-2%, although even that imo is a hard sell.

I'd agree that that threshold seems like a reasonable target, and
anything much above that would be regressive.

Besides this, I've not really heard any convincing justification for needing
this in the first place.

Doing this would open up experiments in larger block sizes, so we
would be able to have larger indexable tuples, say, or be able to
store data types that are larger than currently supported for tuple
row limits without dropping to toast (native vector data types come to
mind as a candidate here).

You can do experiments today with the compile time option. Which does not
require regressing performance for everyone.

Sure, not arguing that this is more performant than the current approach.

We've had 8k blocks for a long time while hardware has improved over 20+
years, and it would be interesting to see how tuning things would open up
additional avenues for performance without requiring packagers to make a
single choice on this regardless of use-case.

I suspect you're going to see more benefits from going to a *lower* setting
than a higher one. Some practical issues aside, plenty of storage hardware
these days would allow to get rid of FPIs if you go to 4k blocks (although it
often requires explicit sysadmin action to reformat the drive into that mode
etc). But obviously that's problematic from the "postgres limits" POV.

If we really wanted to do this - but I don't think we do - I'd argue for
working on the buildsystem support to build the postgres binary multiple
times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
exec's the relevant "real" binary based on the pg_control value. I really
don't see us ever wanting to make BLCKSZ runtime configurable within one
postgres binary. There's just too much intrinsic overhead associated with
that.

You may well be right, but I think if we haven't tried to do that and
measure it, it's hard to say exactly. There are of course more parts
of the system that are about BLCKSZ than the backend, plus you'd need
to build other extensions to support each option, so there is a lot
more that would need to change. (That's neither here nor there, as my
approach also involves changing all those places, so change isn't
inherently bad; just saying it's not a trivial solution to merely
iterate over the block size for binaries.)

David

#14Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#7)
Re: Initdb-time block size specification

On 6/30/23 23:11, Andres Freund wrote:

Hi,

...

I suspect you're going to see more benefits from going to a *lower* setting
than a higher one. Some practical issues aside, plenty of storage hardware
these days would allow to get rid of FPIs if you go to 4k blocks (although it
often requires explicit sysadmin action to reformat the drive into that mode
etc). But obviously that's problematic from the "postgres limits" POV.

I wonder what are the conditions/options for disabling FPI. I kinda
assume it'd apply to new drives with 4k sectors, with properly aligned
partitions etc. But I haven't seen any particularly clear confirmation
that's correct.

If we really wanted to do this - but I don't think we do - I'd argue for
working on the buildsystem support to build the postgres binary multiple
times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
exec's the relevant "real" binary based on the pg_control value. I really
don't see us ever wanting to make BLCKSZ runtime configurable within one
postgres binary. There's just too much intrinsic overhead associated with
that.

How would that work for extensions which may be built for a particular
BLCKSZ value (like pageinspect)?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15Bruce Momjian
bruce@momjian.us
In reply to: Tomas Vondra (#14)
Re: Initdb-time block size specification

On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote:

On 6/30/23 23:11, Andres Freund wrote:

Hi,

...

I suspect you're going to see more benefits from going to a *lower* setting
than a higher one. Some practical issues aside, plenty of storage hardware
these days would allow to get rid of FPIs if you go to 4k blocks (although it
often requires explicit sysadmin action to reformat the drive into that mode
etc). But obviously that's problematic from the "postgres limits" POV.

I wonder what are the conditions/options for disabling FPI. I kinda
assume it'd apply to new drives with 4k sectors, with properly aligned
partitions etc. But I haven't seen any particularly clear confirmation
that's correct.

I don't think we have ever had to study this --- we just request the
write to the operating system, and we either get a successful reply or
we go into WAL recovery to reread the pre-image. We never really care
if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4
2kB writes --- we don't care --- we only care if they are all done or
not.

For a 4kB write, to say it is not partially written would be to require
the operating system to guarantee that the 4kB write is not split into
smaller writes which might each be atomic because smaller atomic writes
would not help us.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#16Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#9)
Re: Initdb-time block size specification

Hi,

On 2023-06-30 23:27:45 +0200, Tomas Vondra wrote:

On 6/30/23 23:11, Andres Freund wrote:

...

If we really wanted to do this - but I don't think we do - I'd argue for
working on the buildsystem support to build the postgres binary multiple
times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
exec's the relevant "real" binary based on the pg_control value. I really
don't see us ever wanting to make BLCKSZ runtime configurable within one
postgres binary. There's just too much intrinsic overhead associated with
that.

I don't quite understand why we shouldn't do this (or at least try to).

IMO the benefits of using smaller blocks were substantial (especially
for 4kB, most likely due matching the internal SSD page size). The other
benefits (reducing WAL volume) seem rather interesting too.

Mostly because I think there are bigger gains to be had elsewhere.

IME not a whole lot of storage ships by default with externally visible 4k
sectors, but needs to be manually reformated [1]to see the for d in /dev/nvme*n1; do echo "$d:"; sudo nvme id-ns -H $d|grep '^LBA Format';echo ;done, which looses all data, so it
has to be done initially. Then postgres would also need OS specific trickery
to figure out that indeed the IO stack is entirely 4k (checking sector size is
not enough). And you run into the issue that suddenly the #column and
index-tuple-size limits are lower, which won't make it easier.

I think we should change the default of the WAL blocksize to 4k
though. There's practically no downsides, and it drastically reduces
postgres-side write amplification in many transactional workloads, by only
writing out partially filled 4k pages instead of partially filled 8k pages.

Sure, there are challenges (e.g. the overhead due to making it dynamic).
No doubt about that.

I don't think the runtime-dynamic overhead is avoidable with reasonable effort
(leaving aside compiling code multiple times and switching between).

If we were to start building postgres for multiple compile-time settings, I
think there are uses other than switching between BLCKSZ, potentially more
interesting. E.g. you can see substantially improved performance by being able
to use SSE4.2 without CPU dispatch (partially because it allows compiler
autovectorization, partially because it allows to compiler to use newer
non-vectorized math instructions (when targetting AVX IIRC), partially because
the dispatch overhead is not insubstantial). Another example: ARMv8
performance is substantially better if you target ARMv8.1-A instead of
ARMv8.0, due to having atomic instructions instead of LL/SC (it still baffles
me that they didn't do this earlier, ll/sc is just inherently inefficient).

Greetings,

Andres Freund

[1]: to see the for d in /dev/nvme*n1; do echo "$d:"; sudo nvme id-ns -H $d|grep '^LBA Format';echo ;done

#17Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#14)
Re: Initdb-time block size specification

Hi,

On 2023-06-30 23:42:30 +0200, Tomas Vondra wrote:

I wonder what are the conditions/options for disabling FPI. I kinda
assume it'd apply to new drives with 4k sectors, with properly aligned
partitions etc. But I haven't seen any particularly clear confirmation
that's correct.

Yea, it's not trivial. And the downsides are also substantial from a
replication / crash recovery performance POV - even with reading blocks ahead
of WAL replay, it's hard to beat just sequentially reading nearly all the data
you're going to need.

On 6/30/23 23:11, Andres Freund wrote:

If we really wanted to do this - but I don't think we do - I'd argue for
working on the buildsystem support to build the postgres binary multiple
times, for 4, 8, 16 kB BLCKSZ and having a wrapper postgres binary that just
exec's the relevant "real" binary based on the pg_control value. I really
don't see us ever wanting to make BLCKSZ runtime configurable within one
postgres binary. There's just too much intrinsic overhead associated with
that.

How would that work for extensions which may be built for a particular
BLCKSZ value (like pageinspect)?

I think we'd need to do something similar for extensions, likely loading them
from a path that includes the "subvariant" the server currently is running. Or
alternatively adding a suffix to the filename indicating the
variant. Something like pageinspect.x86-64-v4-4kB.so.

The x86-64-v* stuff is something gcc and clang added a couple years ago, so
that not every project has to define different "baseline" levels. I think it
was done in collaboration with the sytem-v/linux AMD64 ABI specification group
([1]https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build section 3.1.1.).

Greetings,

Andres

[1]: https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/master/raw/x86-64-ABI/abi.pdf?job=build section 3.1.1.
section 3.1.1.

#18Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bruce Momjian (#15)
Re: Initdb-time block size specification

On 6/30/23 23:53, Bruce Momjian wrote:

On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote:

On 6/30/23 23:11, Andres Freund wrote:

Hi,

...

I suspect you're going to see more benefits from going to a *lower* setting
than a higher one. Some practical issues aside, plenty of storage hardware
these days would allow to get rid of FPIs if you go to 4k blocks (although it
often requires explicit sysadmin action to reformat the drive into that mode
etc). But obviously that's problematic from the "postgres limits" POV.

I wonder what are the conditions/options for disabling FPI. I kinda
assume it'd apply to new drives with 4k sectors, with properly aligned
partitions etc. But I haven't seen any particularly clear confirmation
that's correct.

I don't think we have ever had to study this --- we just request the
write to the operating system, and we either get a successful reply or
we go into WAL recovery to reread the pre-image. We never really care
if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4
2kB writes --- we don't care --- we only care if they are all done or
not.

For a 4kB write, to say it is not partially written would be to require
the operating system to guarantee that the 4kB write is not split into
smaller writes which might each be atomic because smaller atomic writes
would not help us.

Right, that's the dance we do to protect against torn pages. But Andres
suggested that if you have modern storage and configure it correctly,
writing with 4kB pages would be atomic. So we wouldn't need to do this
FPI stuff, eliminating pretty significant source of write amplification.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Bruce Momjian
bruce@momjian.us
In reply to: Tomas Vondra (#18)
Re: Initdb-time block size specification

On Sat, Jul 1, 2023 at 12:21:03AM +0200, Tomas Vondra wrote:

On 6/30/23 23:53, Bruce Momjian wrote:

For a 4kB write, to say it is not partially written would be to require
the operating system to guarantee that the 4kB write is not split into
smaller writes which might each be atomic because smaller atomic writes
would not help us.

Right, that's the dance we do to protect against torn pages. But Andres
suggested that if you have modern storage and configure it correctly,
writing with 4kB pages would be atomic. So we wouldn't need to do this
FPI stuff, eliminating pretty significant source of write amplification.

I agree the hardware is atomic for 4k writes, but do we know the OS
always issues 4k writes?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#20Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#15)
Re: Initdb-time block size specification

Hi,

On 2023-06-30 17:53:34 -0400, Bruce Momjian wrote:

On Fri, Jun 30, 2023 at 11:42:30PM +0200, Tomas Vondra wrote:

On 6/30/23 23:11, Andres Freund wrote:

I suspect you're going to see more benefits from going to a *lower* setting
than a higher one. Some practical issues aside, plenty of storage hardware
these days would allow to get rid of FPIs if you go to 4k blocks (although it
often requires explicit sysadmin action to reformat the drive into that mode
etc). But obviously that's problematic from the "postgres limits" POV.

I wonder what are the conditions/options for disabling FPI. I kinda
assume it'd apply to new drives with 4k sectors, with properly aligned
partitions etc. But I haven't seen any particularly clear confirmation
that's correct.

I don't think we have ever had to study this --- we just request the
write to the operating system, and we either get a successful reply or
we go into WAL recovery to reread the pre-image. We never really care
if the write is atomic, e.g., an 8k write can be done in 2 4kB writes 4
2kB writes --- we don't care --- we only care if they are all done or
not.

Well, that works because we have FPI. This sub-discussion is motivated by
getting rid of FPIs.

For a 4kB write, to say it is not partially written would be to require
the operating system to guarantee that the 4kB write is not split into
smaller writes which might each be atomic because smaller atomic writes
would not help us.

That's why were talking about drives with 4k sector size - you *can't* split
the writes below that.

The problem is that, as far as I know,it's not always obvious what block size
is being used on the actual storage level. It's not even trivial when
operating on a filesystem directly stored on a single block device ([1]On linux I think you need to use stat() to figure out the st_dev for a file, then look in /proc/self/mountinfo for the block device, use the name of the file to look in /sys/block/$d/queue/physical_block_size.). Once
there's things like LVM or disk encryption involved, it gets pretty hairy
([2]The above doesn't work because e.g. a device mapper target might only support 4k sectors, even though the sectors on the underlying storage device are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the above recipe (with the added step of resolving the symlink to know the actual device name), you would see a 4k sector size. Even though the underlying NVMe disk only supports 512b sectors.). Once you know all the block devices, it's not too bad, but ...

Greetings,

Andres Freund

[1]: On linux I think you need to use stat() to figure out the st_dev for a file, then look in /proc/self/mountinfo for the block device, use the name of the file to look in /sys/block/$d/queue/physical_block_size.
file, then look in /proc/self/mountinfo for the block device, use the name
of the file to look in /sys/block/$d/queue/physical_block_size.

[2]: The above doesn't work because e.g. a device mapper target might only support 4k sectors, even though the sectors on the underlying storage device are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the above recipe (with the added step of resolving the symlink to know the actual device name), you would see a 4k sector size. Even though the underlying NVMe disk only supports 512b sectors.
support 4k sectors, even though the sectors on the underlying storage device
are 512b sectors. E.g. my root filesystem is encrypted, and if you follow the
above recipe (with the added step of resolving the symlink to know the actual
device name), you would see a 4k sector size. Even though the underlying NVMe
disk only supports 512b sectors.

#21Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#16)
#22Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#20)
#23Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#19)
#24Bruce Momjian
bruce@momjian.us
In reply to: Bruce Momjian (#22)
#25Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#22)
#26Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#23)
#27Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#25)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Bruce Momjian (#27)
#29Bruce Momjian
bruce@momjian.us
In reply to: Tomas Vondra (#28)
#30Andres Freund
andres@anarazel.de
In reply to: David Christensen (#10)
#31Peter Eisentraut
peter_e@gmx.net
In reply to: Tomas Vondra (#18)
#32David Christensen
david.christensen@crunchydata.com
In reply to: David Christensen (#1)
#33David Christensen
david.christensen@crunchydata.com
In reply to: David Christensen (#32)
#34John Naylor
john.naylor@enterprisedb.com
In reply to: David Christensen (#32)
#35David Christensen
david.christensen@crunchydata.com
In reply to: John Naylor (#34)
#36David Christensen
david.christensen@crunchydata.com
In reply to: David Christensen (#35)
#37Robert Haas
robertmhaas@gmail.com
In reply to: David Christensen (#36)
#38Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#37)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#38)
#40David Christensen
david.christensen@crunchydata.com
In reply to: Robert Haas (#39)
#41Hannu Krosing
hannu@tm.ee
In reply to: David Christensen (#40)
#42Andres Freund
andres@anarazel.de
In reply to: Hannu Krosing (#41)
#43David Christensen
david.christensen@crunchydata.com
In reply to: Hannu Krosing (#41)
#44Hannu Krosing
hannu@tm.ee
In reply to: Andres Freund (#42)
#45David Christensen
david.christensen@crunchydata.com
In reply to: David Christensen (#32)