Lowering the default wal_blocksize to 4K

Started by Andres Freundover 2 years ago21 messages
#1Andres Freund
andres@anarazel.de

Hi,

I've mentioned this to a few people before, but forgot to start an actual
thread. So here we go:

I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from
the current 8192. The reason is that

a) We don't gain much from a blocksize above 4096, as we already do one write
all the pending WAL data in one go (except when at the tail of
wal_buffers). We *do* incur more overhead for page headers, but compared to
the actual WAL data it is not a lot (~0.29% of space is page headers 8192
vs 0.59% with 4096).

b) Writing 8KB when we we have to flush a partially filled buffer can
substantially increase write amplification. In a transactional workload,
this will often double the write volume.

Currently disks mostly have 4096 bytes as their "sector size". Sometimes
that's exposed directly, sometimes they can also write in 512 bytes, but that
internally requires a read-modify-write operation.

For some example numbers, I ran a very simple insert workload with a varying
number of clients with both a wal_blocksize=4096 and wal_blocksize=8192
cluster, and measured the amount of bytes written before/after. The table was
recreated before each run, followed by a checkpoint and the benchmark. Here I
ran the inserts only for 15s each, because the results don't change
meaningfully with longer runs.

With XLOG_BLCKSZ=8192

clients tps disk bytes written
1 667 81296
2 739 89796
4 1446 89208
8 2858 90858
16 5775 96928
32 11920 115351
64 23686 135244
128 46001 173390
256 88833 239720
512 146208 335669

With XLOG_BLCKSZ=4096

clients tps disk bytes written
1 751 46838
2 773 47936
4 1512 48317
8 3143 52584
16 6221 59097
32 12863 73776
64 25652 98792
128 48274 133330
256 88969 200720
512 146298 298523

This is on a not-that-fast NVMe SSD (Samsung SSD 970 PRO 1TB).

It's IMO quite interesting that even at the higher client counts, the number
of bytes written don't reach parity.

On a stripe of two very fast SSDs:

With XLOG_BLCKSZ=8192

clients tps disk bytes written
1 23786 2893392
2 38515 4683336
4 63436 4688052
8 106618 4618760
16 177905 4384360
32 254890 3890664
64 297113 3031568
128 299878 2297808
256 308774 1935064
512 292515 1630408

With XLOG_BLCKSZ=4096

clients tps disk bytes written
1 25742 1586748
2 43578 2686708
4 62734 2613856
8 116217 2809560
16 200802 2947580
32 269268 2461364
64 323195 2042196
128 317160 1550364
256 309601 1285744
512 292063 1103816

It's fun to see how the total number of writes *decreases* at higher
concurrency, because it becomes more likely that pages are filled completely.

One thing I noticed is that our auto-configuration of wal_buffers leads to
different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem
great.

Performing the same COPY workload (1024 files, split across N clients) for
both settings shows no performance difference, but a very slight increase in
total bytes written (about 0.25%, which is roughly what I'd expect).

Personally I'd say the slight increase in WAL volume is more than outweighed
by the increase in throughput and decrease in bytes written.

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

Greetings,

Andres Freund

#2Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#1)
Re: Lowering the default wal_blocksize to 4K

On Mon, Oct 9, 2023 at 04:08:05PM -0700, Andres Freund wrote:

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

How do we handle shorter maximum row lengths and shorter maximum index
entry lengths?

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#3Andres Freund
andres@anarazel.de
In reply to: Bruce Momjian (#2)
Re: Lowering the default wal_blocksize to 4K

Hi,

On 2023-10-09 19:26:54 -0400, Bruce Momjian wrote:

On Mon, Oct 9, 2023 at 04:08:05PM -0700, Andres Freund wrote:

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

How do we handle shorter maximum row lengths and shorter maximum index
entry lengths?

The WAL blocksize shouldn't influence either, unless we have a bug somewhere.

Greetings,

Andres Freund

#4Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#3)
Re: Lowering the default wal_blocksize to 4K

On Mon, Oct 9, 2023 at 04:36:20PM -0700, Andres Freund wrote:

Hi,

On 2023-10-09 19:26:54 -0400, Bruce Momjian wrote:

On Mon, Oct 9, 2023 at 04:08:05PM -0700, Andres Freund wrote:

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

How do we handle shorter maximum row lengths and shorter maximum index
entry lengths?

The WAL blocksize shouldn't influence either, unless we have a bug somewhere.

Oh, good point.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Only you can decide what is important to you.

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#1)
Re: Lowering the default wal_blocksize to 4K

Andres Freund <andres@anarazel.de> writes:

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

Seems like that's doubling the overhead of WAL page headers. Do we need
to try to skinny those down?

regards, tom lane

#6Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: Lowering the default wal_blocksize to 4K

Hi,

On 2023-10-09 23:16:30 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

Seems like that's doubling the overhead of WAL page headers. Do we need
to try to skinny those down?

I think the overhead is small, and we are wasting so much space in other
places, that I am not worried about the proportional increase page header
space usage at this point, particularly compared to saving in overall write
rate and increase in TPS. There's other areas we can save much more space, if
we want to focus on that.

I was thinking we should perhaps do the opposite, namely getting rid of short
page headers. The overhead in the "byte position" <-> LSN conversion due to
the differing space is worse than the gain. Or do something inbetween - having
the system ID in the header adds a useful crosscheck, but I'm far less
convinced that having segment and block size in there, as 32bit numbers no
less, is worthwhile. After all, if the system id matches, it's not likely that
the xlog block or segment size differ.

Greetings,

Andres Freund

#7Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#1)
Re: Lowering the default wal_blocksize to 4K

On Tue, 10 Oct 2023 at 01:08, Andres Freund <andres@anarazel.de> wrote:

Hi,

I've mentioned this to a few people before, but forgot to start an actual
thread. So here we go:

I think we should lower the default wal_blocksize / XLOG_BLCKSZ to 4096, from
the current 8192.

Seems like a good idea.

It's IMO quite interesting that even at the higher client counts, the number
of bytes written don't reach parity.

It's fun to see how the total number of writes *decreases* at higher
concurrency, because it becomes more likely that pages are filled completely.

With higher client counts and short transactions I think it is not too
unexpected to see commit_delay+commit_siblings configured. Did you
measure the impact of this change on such configurations?

One thing I noticed is that our auto-configuration of wal_buffers leads to
different wal_buffers settings for different XLOG_BLCKSZ, which doesn't seem
great.

Hmm.

Performing the same COPY workload (1024 files, split across N clients) for
both settings shows no performance difference, but a very slight increase in
total bytes written (about 0.25%, which is roughly what I'd expect).

Personally I'd say the slight increase in WAL volume is more than outweighed
by the increase in throughput and decrease in bytes written.

Agreed.

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

It is not just the disk overhead of blocks, but we also maintain some
other data (currently in the form of XLogRecPtrs) in memory for each
WAL buffer, the overhead of which will also increase when we increase
the number of XLog pages per MB of WAL that we cache.
Additionally, highly concurrent workloads with transactions that write
a high multiple of XLOG_BLCKSZ bytes to WAL may start to see increased
overhead due to the .25% additional WAL getting written and a doubling
of the number of XLog pages being touched (both initialization and the
smaller memcpy for records that would now cross an extra page
boundary).

However, for all of these issues I doubt that they actually matter
much in the grand scheme of things, so I definitely wouldn't mind
moving to 4KiB XLog pages.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#8Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#6)
Re: Lowering the default wal_blocksize to 4K

On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-10-09 23:16:30 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

Seems like that's doubling the overhead of WAL page headers. Do we need
to try to skinny those down?

I think the overhead is small, and we are wasting so much space in other
places, that I am not worried about the proportional increase page header
space usage at this point, particularly compared to saving in overall write
rate and increase in TPS. There's other areas we can save much more space, if
we want to focus on that.

I was thinking we should perhaps do the opposite, namely getting rid of short
page headers. The overhead in the "byte position" <-> LSN conversion due to
the differing space is worse than the gain. Or do something inbetween - having
the system ID in the header adds a useful crosscheck, but I'm far less
convinced that having segment and block size in there, as 32bit numbers no
less, is worthwhile. After all, if the system id matches, it's not likely that
the xlog block or segment size differ.

Hmm. I don't think we should remove those checks, as I can see people
that would want to change their XLog block size with e.g.
pg_reset_wal.
But I think we can relatively easily move segsize/blocksize checks to
a different place in the normal page header, which would reduce the
number of bytes we'd have to store elsewhere.

We could move segsize/blocksize into the xlp_info flags: 12 of the 16
bits are currently unused. Using 4 of these bits for segsize
(indicating 2^N MB, current accepted values are N=0..10 for 1 MB ...
1024MB) and 4 (or 3) for blcksz (as we currently support 1..64 kB
blocks, or 2^{0..6} kB). This would remove the need for 2 of the 3
fields in the large xlog block header.

After that we'll only have the system ID left from the extended
header, which we could store across 2 pages in the (current) alignment
losses of xlp_rem_len - even pages the upper half, uneven pages the
lower half of the ID. This should allow for enough integrity checks
without further increasing the size of XLogPageHeader in most
installations.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#9Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#8)
Re: Lowering the default wal_blocksize to 4K

Hi,

On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:

On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote:

On 2023-10-09 23:16:30 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

Seems like that's doubling the overhead of WAL page headers. Do we need
to try to skinny those down?

I think the overhead is small, and we are wasting so much space in other
places, that I am not worried about the proportional increase page header
space usage at this point, particularly compared to saving in overall write
rate and increase in TPS. There's other areas we can save much more space, if
we want to focus on that.

I was thinking we should perhaps do the opposite, namely getting rid of short
page headers. The overhead in the "byte position" <-> LSN conversion due to
the differing space is worse than the gain. Or do something inbetween - having
the system ID in the header adds a useful crosscheck, but I'm far less
convinced that having segment and block size in there, as 32bit numbers no
less, is worthwhile. After all, if the system id matches, it's not likely that
the xlog block or segment size differ.

Hmm. I don't think we should remove those checks, as I can see people
that would want to change their XLog block size with e.g.
pg_reset_wal.

I don't think that's something we need to address in every physical
segment. For one, there's no option to do so. But more importantly, if they
don't change the xlog block size, we'll just accept random WAL as well. If
somebody goes to the trouble of writing a custom tool, they can live with the
consequences of that potentially causing breakage. Particularly if the checks
wouldn't meaningfully prevent that anyway.

After that we'll only have the system ID left from the extended
header, which we could store across 2 pages in the (current) alignment
losses of xlp_rem_len - even pages the upper half, uneven pages the
lower half of the ID. This should allow for enough integrity checks
without further increasing the size of XLogPageHeader in most
installations.

I doubt that that's a good idea - what if there's just a single page in a
segment? And there aren't earlier segments? That's not a rare case, IME.

Greetings,

Andres Freund

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#9)
Re: Lowering the default wal_blocksize to 4K

On Wed, Oct 11, 2023 at 12:29 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:

On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote:

I was thinking we should perhaps do the opposite, namely getting rid of short
page headers. The overhead in the "byte position" <-> LSN conversion due to
the differing space is worse than the gain. Or do something inbetween - having
the system ID in the header adds a useful crosscheck, but I'm far less
convinced that having segment and block size in there, as 32bit numbers no
less, is worthwhile. After all, if the system id matches, it's not likely that
the xlog block or segment size differ.

Hmm. I don't think we should remove those checks, as I can see people
that would want to change their XLog block size with e.g.
pg_reset_wal.

I don't think that's something we need to address in every physical
segment. For one, there's no option to do so. But more importantly, if they
don't change the xlog block size, we'll just accept random WAL as well. If
somebody goes to the trouble of writing a custom tool, they can live with the
consequences of that potentially causing breakage. Particularly if the checks
wouldn't meaningfully prevent that anyway.

How about this idea: Put the system ID etc into the new record Robert
is proposing for the redo point, and also into the checkpoint record,
so that it's at both ends of the to-be-replayed range. That just
leaves the WAL segments in between. If you find yourself writing a
new record that would go in the first usable byte of a segment, insert
a new special system ID (etc) record that will be checked during
replay. For segments that start with XLP_FIRST_IS_CONTRECORD, don't
worry about it: those already form part of a chain of verification
(xlp_rem_len, xl_crc) that started on the preceding page, so it seems
almost impossible to accidentally replay from a segment that came from
another system.

#11Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#10)
Re: Lowering the default wal_blocksize to 4K

Hi,

On 2023-10-11 14:39:12 +1300, Thomas Munro wrote:

On Wed, Oct 11, 2023 at 12:29 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:

On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote:

I was thinking we should perhaps do the opposite, namely getting rid of short
page headers. The overhead in the "byte position" <-> LSN conversion due to
the differing space is worse than the gain. Or do something inbetween - having
the system ID in the header adds a useful crosscheck, but I'm far less
convinced that having segment and block size in there, as 32bit numbers no
less, is worthwhile. After all, if the system id matches, it's not likely that
the xlog block or segment size differ.

Hmm. I don't think we should remove those checks, as I can see people
that would want to change their XLog block size with e.g.
pg_reset_wal.

I don't think that's something we need to address in every physical
segment. For one, there's no option to do so. But more importantly, if they
don't change the xlog block size, we'll just accept random WAL as well. If
somebody goes to the trouble of writing a custom tool, they can live with the
consequences of that potentially causing breakage. Particularly if the checks
wouldn't meaningfully prevent that anyway.

How about this idea: Put the system ID etc into the new record Robert
is proposing for the redo point, and also into the checkpoint record,
so that it's at both ends of the to-be-replayed range.

I think that's a very good idea.

That just leaves the WAL segments in between. If you find yourself writing
a new record that would go in the first usable byte of a segment, insert a
new special system ID (etc) record that will be checked during replay.

I don't see how we can do that without incuring a lot of overhead though. This
determination would need to happen in ReserveXLogInsertLocation(), while
holding the spinlock. Which is one of the most contended bits of code in
postgres. The whole reason that we have this "byte pos" to LSN conversion
stuff is to make the spinlock-protected part of ReserveXLogInsertLocation() as
short as possible.

For segments that start with XLP_FIRST_IS_CONTRECORD, don't worry about it:
those already form part of a chain of verification (xlp_rem_len, xl_crc)
that started on the preceding page, so it seems almost impossible to
accidentally replay from a segment that came from another system.

But I think we might just be ok with logic similar to this, even for the
non-contrecord case. If recovery starts in one segment where we have verified
sysid, xlog block size etc and we encounter a WAL record starting on the first
"content byte" of a segment, we can still verify that the prev LSN is correct
etc. Sure, if you try hard you could come up with a scenario where you could
mislead such a check, but we don't need to protect against intentional malice
here, just against accidents.

Greetings,

Andres Freund

#12Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Andres Freund (#9)
Re: Lowering the default wal_blocksize to 4K

On Wed, 11 Oct 2023 at 01:29, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-10-10 21:30:44 +0200, Matthias van de Meent wrote:

On Tue, 10 Oct 2023 at 06:14, Andres Freund <andres@anarazel.de> wrote:

On 2023-10-09 23:16:30 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

There's an alternative approach we could take, which is to write in 4KB
increments, while keeping 8KB pages. With the current format that's not
obviously a bad idea. But given there aren't really advantages in 8KB WAL
pages, it seems we should just go for 4KB?

Seems like that's doubling the overhead of WAL page headers. Do we need
to try to skinny those down?

I think the overhead is small, and we are wasting so much space in other
places, that I am not worried about the proportional increase page header
space usage at this point, particularly compared to saving in overall write
rate and increase in TPS. There's other areas we can save much more space, if
we want to focus on that.

I was thinking we should perhaps do the opposite, namely getting rid of short
page headers. The overhead in the "byte position" <-> LSN conversion due to
the differing space is worse than the gain. Or do something inbetween - having
the system ID in the header adds a useful crosscheck, but I'm far less
convinced that having segment and block size in there, as 32bit numbers no
less, is worthwhile. After all, if the system id matches, it's not likely that
the xlog block or segment size differ.

Hmm. I don't think we should remove those checks, as I can see people
that would want to change their XLog block size with e.g.
pg_reset_wal.

I don't think that's something we need to address in every physical
segment. For one, there's no option to do so.

Not block size, but xlog segment size is modifiable with pg_resetwal,
and could thus reasonably change across restarts. Apart from more
practical concerns around compile-time options requiring you to swap
out binaries, I don't really see why xlog block size couldn't be
changed with pg_resetwal in a securely shutdown cluster as one does
with the WAL segment size.

But more importantly, if they
don't change the xlog block size, we'll just accept random WAL as well. If
somebody goes to the trouble of writing a custom tool, they can live with the
consequences of that potentially causing breakage. Particularly if the checks
wouldn't meaningfully prevent that anyway.

I don't understand what you mean by that "we'll just accept random WAL
as well". We do significant validation in XLogReaderValidatePageHeader
to make sure that all pages of WAL are sufficiently formatted so that
they can securely be read by the available infrastructure with the
least chance of misreading data. There is no chance currently that we
read WAL from WAL segments that contain correct data for different
segment or block sizes. That includes WAL from segments created before
a pg_resetwal changed the WAL segment size.

If this "custom tool" refers to the typo-ed name of pg_resetwal, that
is hardly a custom tool, it is shipped with PostgreSQL and you can
find the sources under src/bin/pg_resetwal.

After that we'll only have the system ID left from the extended
header, which we could store across 2 pages in the (current) alignment
losses of xlp_rem_len - even pages the upper half, uneven pages the
lower half of the ID. This should allow for enough integrity checks
without further increasing the size of XLogPageHeader in most
installations.

I doubt that that's a good idea - what if there's just a single page in a
segment? And there aren't earlier segments? That's not a rare case, IME.

Then we'd still have 50% of a system ID which we can check against for
any corruption. I agree that it increases the chance of conflics, but
it's still strictly better than nothing at all.
An alternative solution would be to write the first two pages of a WAL
segment regardless of contents, so that we essentially never only have
access to the first page during crash recovery. Physical replication's
recovery wouldn't be able to read ahead, but I consider that as less
problematic.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#9)
Re: Lowering the default wal_blocksize to 4K

On Tue, Oct 10, 2023 at 7:29 PM Andres Freund <andres@anarazel.de> wrote:

Hmm. I don't think we should remove those checks, as I can see people
that would want to change their XLog block size with e.g.
pg_reset_wal.

I don't think that's something we need to address in every physical
segment. For one, there's no option to do so. But more importantly, if they
don't change the xlog block size, we'll just accept random WAL as well. If
somebody goes to the trouble of writing a custom tool, they can live with the
consequences of that potentially causing breakage. Particularly if the checks
wouldn't meaningfully prevent that anyway.

I'm extremely confused about what both of you are saying.

Matthias is referring to pg_reset_wal, which I assume means
pg_resetwal. But it has no option to change the WAL block size. It
does have an option to change the WAL segment size, but that's not the
same thing. And even if pg_resetwal did have an option to change the
WAL segment size, it removes all WAL from pg_wal when it runs, so you
wouldn't normally end up trying to replay WAL from before the
operation because it would have been removed. You might still have
those files around in an archive or something, but the primary doesn't
replay from the archive. You might have standbys, but I would assume
they would have to be rebuilt after changing the WAL block size on the
master, unless you were trying to follow some probably-too-clever
procedure to avoid a standby rebuild. So I'm really kind of lost as to
what the scenario is that Matthias has in mind.

But Andres's response doesn't make any sense to me either. What in the
world does "if they don't change the xlog block size, we'll just
accept random WAL as well" mean? Neither having or not having a check
that the block size hasn't change causes us to "just accept random
WAL". To "accept random WAL," we'd have to remove all of the sanity
checks, which nobody is proposing and nobody would accept.

But if we do want to keep those cross-checks, why not take what Thomas
proposed a little further and move all of xlp_sysid, xlp_seg_size, and
xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO? Then long and short page
headers would become identical. We'd lose the ability to recheck those
values for every new segment, but it seems quite unlikely that any of
these values would change in the middle of replay. If they did, would
xl_prev and xl_crc be sufficient to catch that? I think Andres says in
a later email that they would be, and I think I'm inclined to agree.
False xl_prev matches don't seem especially unlikely, but xl_crc seems
like it should be effective.

--
Robert Haas
EDB: http://www.enterprisedb.com

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#13)
Re: Lowering the default wal_blocksize to 4K

On Thu, Oct 12, 2023 at 9:05 AM Robert Haas <robertmhaas@gmail.com> wrote:

But if we do want to keep those cross-checks, why not take what Thomas
proposed a little further and move all of xlp_sysid, xlp_seg_size, and
xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO? Then long and short page
headers would become identical.

FTR that's exactly what I was trying to say.

We'd lose the ability to recheck those
values for every new segment, but it seems quite unlikely that any of
these values would change in the middle of replay. If they did, would
xl_prev and xl_crc be sufficient to catch that? I think Andres says in
a later email that they would be, and I think I'm inclined to agree.
False xl_prev matches don't seem especially unlikely, but xl_crc seems
like it should be effective.

Right, it is strong enough, and covers the common case where a record
crosses the segment boundary.

That leaves only the segments where a record starts exactly on the
first usable byte of a segment, which is why I was trying to think of
a way to cover that case too. I suggested we could notice and insert
a new record at that place. But Andres suggests it would be too
expensive and not worth worrying about.

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#14)
Re: Lowering the default wal_blocksize to 4K

On Thu, Oct 12, 2023 at 9:27 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Oct 12, 2023 at 9:05 AM Robert Haas <robertmhaas@gmail.com> wrote:

But if we do want to keep those cross-checks, why not take what Thomas
proposed a little further and move all of xlp_sysid, xlp_seg_size, and
xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO? Then long and short page
headers would become identical.

FTR that's exactly what I was trying to say.

And to be extra double explicit, the point of that is to kill the
'div' instruction that Andres was complaining about, because now the
division depends only on compile time constants so it can be done with
multiplication and bitswizzling tricks. For example, when X is a
variable I get:

*a = n / X;
0x0000000000000003 <+3>: mov %rdi,%rax
0x0000000000000006 <+6>: xor %edx,%edx
0x0000000000000008 <+8>: divq 0x0(%rip) # 0xf <f+15>
0x0000000000000011 <+17>: mov %rax,(%rsi)

*b = n % X;
0x000000000000000f <+15>: xor %edx,%edx
0x0000000000000014 <+20>: mov %rdi,%rax
0x0000000000000017 <+23>: divq 0x0(%rip) # 0x1e <f+30>
0x000000000000001e <+30>: mov %rdx,(%rcx)

... but when it's the constant 8192 - 24 I get:

*a = n / X;
0x0000000000000000 <+0>: movabs $0x2018120d8a279db7,%rax
0x000000000000000d <+13>: mov %rdi,%rdx
0x0000000000000010 <+16>: shr $0x3,%rdx
0x0000000000000014 <+20>: mul %rdx
0x0000000000000017 <+23>: shr $0x7,%rdx
0x000000000000001b <+27>: mov %rdx,(%rsi)

*b = n % X;
0x000000000000001e <+30>: imul $0x1fe8,%rdx,%rdx
0x0000000000000025 <+37>: sub %rdx,%rdi
0x0000000000000028 <+40>: mov %rdi,(%rcx)

#16Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#13)
Re: Lowering the default wal_blocksize to 4K

Hi,

On 2023-10-11 16:05:02 -0400, Robert Haas wrote:

On Tue, Oct 10, 2023 at 7:29 PM Andres Freund <andres@anarazel.de> wrote:

Hmm. I don't think we should remove those checks, as I can see people
that would want to change their XLog block size with e.g.
pg_reset_wal.

I don't think that's something we need to address in every physical
segment. For one, there's no option to do so. But more importantly, if they
don't change the xlog block size, we'll just accept random WAL as well. If
somebody goes to the trouble of writing a custom tool, they can live with the
consequences of that potentially causing breakage. Particularly if the checks
wouldn't meaningfully prevent that anyway.

I'm extremely confused about what both of you are saying.

Matthias is referring to pg_reset_wal, which I assume means
pg_resetwal. But it has no option to change the WAL block size. It
does have an option to change the WAL segment size, but that's not the
same thing. And even if pg_resetwal did have an option to change the
WAL segment size, it removes all WAL from pg_wal when it runs, so you
wouldn't normally end up trying to replay WAL from before the
operation because it would have been removed. You might still have
those files around in an archive or something, but the primary doesn't
replay from the archive. You might have standbys, but I would assume
they would have to be rebuilt after changing the WAL block size on the
master, unless you were trying to follow some probably-too-clever
procedure to avoid a standby rebuild. So I'm really kind of lost as to
what the scenario is that Matthias has in mind.

I think the question is what the point of the crosschecks in long page headers
is. It's pretty easy to see what the point of the xlp_sysid check is - make it
less likely to accidentally replay WAL from a different system. It's much
less clear what the point of xlp_seg_size and xlp_xlog_blcksz is - after all,
they are also in ControlFileData and the xlp_sysid check tied the control file
and WAL file together.

But Andres's response doesn't make any sense to me either. What in the world
does "if they don't change the xlog block size, we'll just accept random WAL
as well" mean? Neither having or not having a check that the block size
hasn't change causes us to "just accept random WAL". To "accept random WAL,"
we'd have to remove all of the sanity checks, which nobody is proposing and
nobody would accept.

Let me rephrase my point:

If somebody uses a modified pg_resetwal to change the xlog block size, then
tries to replay WAL from before that change, and is unlucky enough that the
LSN looked for in a segment is the start of a valid record both before/after
the pg_resetwal invocation, then yes, we might not catch that anymore if we
remove the block size check. But the much much more common case is that the
block size was *not* changed, in which case we *already* don't catch that
pg_resetwal was invoked.

ISTM that the xlp_seg_size and xlp_xlog_blcksz checks in long page headers are
a belt and suspenders check that is very unlikely to ever catch a mistake that
wouldn't otherwise be caught.

But if we do want to keep those cross-checks, why not take what Thomas
proposed a little further and move all of xlp_sysid, xlp_seg_size, and
xlp_xlog_blcksz into XLOG_CHECKPOINT_REDO?

I think that's what Thomas was proposing. Thinking about it a bit more I'm
not sure that having the data both in the checkpoint record itself and in
XLOG_CHECKPOINT_REDO buys much. But it's also pretty much free, so ...

Then long and short page headers would become identical.

Which would make the code more efficient...

We'd lose the ability to recheck those values for every new segment, but it
seems quite unlikely that any of these values would change in the middle of
replay.

I guess the most likely scenario would be a replica that has some local WAL
files (e.g. due to pg_basebackup -X ...), but accidentally configures a
restore_command pointing to the wrong archive. In that case recovery could
start up successfully as the checkpoint/redo records have sensible contents,
but we then wouldn't necessarily notice that the subsequent files aren't from
the correct system. Of course you need to be quite unlucky and have LSNs that
match between the systems. The most likely path I can think of is an idle
system with archive_timeout.

As outlined above, I don't think xlp_seg_size, xlp_xlog_blcksz buy us
anything, but that the protection by xlp_sysid is a bit more meaningful. So a
compromise position could be to include xlp_sysid in the page header, possibly
in a "chopped up" manner, as Matthias suggested.

If they did, would xl_prev and xl_crc be sufficient to catch that? I think
Andres says in a later email that they would be, and I think I'm inclined to
agree. False xl_prev matches don't seem especially unlikely, but xl_crc
seems like it should be effective.

I think it'd be an entirely tolerable risk. Even if a WAL file were falsely
replayed, we'd still notice the problem soon after. I think once such a
mistake was made, it'd be inadvasible to continue using the cluster anyway.

Greetings,

Andres Freund

#17Andres Freund
andres@anarazel.de
In reply to: Matthias van de Meent (#12)
Re: Lowering the default wal_blocksize to 4K

Hi,

On 2023-10-11 16:09:21 +0200, Matthias van de Meent wrote:

On Wed, 11 Oct 2023 at 01:29, Andres Freund <andres@anarazel.de> wrote:

After that we'll only have the system ID left from the extended
header, which we could store across 2 pages in the (current) alignment
losses of xlp_rem_len - even pages the upper half, uneven pages the
lower half of the ID. This should allow for enough integrity checks
without further increasing the size of XLogPageHeader in most
installations.

I doubt that that's a good idea - what if there's just a single page in a
segment? And there aren't earlier segments? That's not a rare case, IME.

Then we'd still have 50% of a system ID which we can check against for
any corruption. I agree that it increases the chance of conflics, but
it's still strictly better than nothing at all.

A fair point - I somehow disregarded that all bits in the system id are
equally meaningful.

Greetings,

Andres Freund

#18Robert Haas
robertmhaas@gmail.com
In reply to: Thomas Munro (#14)
Re: Lowering the default wal_blocksize to 4K

On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote:

That leaves only the segments where a record starts exactly on the
first usable byte of a segment, which is why I was trying to think of
a way to cover that case too. I suggested we could notice and insert
a new record at that place. But Andres suggests it would be too
expensive and not worth worrying about.

Hmm. Even in that case, xl_prev has to match. It's not like it's the
wild west. Sure, it's not nearly as good of a cross-check, but it's
something. It seems to me that it's not worth worrying very much about
xlp_seg_size or xlp_blcksz changing undetected in that scenario - if
you're doing that kind of advanced magic, you need to be careful
enough to not mess it up, and if we still cross-check once per
checkpoint cycle that's pretty good. I do worry a bit about the sysid
changing under us, though. It's not that hard to get your WAL archives
mixed up, and it'd be nice to catch that right away.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#16)
Re: Lowering the default wal_blocksize to 4K

On Wed, Oct 11, 2023 at 6:11 PM Andres Freund <andres@anarazel.de> wrote:

I think the question is what the point of the crosschecks in long page headers
is. It's pretty easy to see what the point of the xlp_sysid check is - make it
less likely to accidentally replay WAL from a different system. It's much
less clear what the point of xlp_seg_size and xlp_xlog_blcksz is - after all,
they are also in ControlFileData and the xlp_sysid check tied the control file
and WAL file together.

Yeah, fair.

Let me rephrase my point:

If somebody uses a modified pg_resetwal to change the xlog block size, then
tries to replay WAL from before that change, and is unlucky enough that the
LSN looked for in a segment is the start of a valid record both before/after
the pg_resetwal invocation, then yes, we might not catch that anymore if we
remove the block size check. But the much much more common case is that the
block size was *not* changed, in which case we *already* don't catch that
pg_resetwal was invoked.

Hmm. Should we invent a mechanism just for that?

ISTM that the xlp_seg_size and xlp_xlog_blcksz checks in long page headers are
a belt and suspenders check that is very unlikely to ever catch a mistake that
wouldn't otherwise be caught.

I think that's probably right.

I think that's what Thomas was proposing. Thinking about it a bit more I'm
not sure that having the data both in the checkpoint record itself and in
XLOG_CHECKPOINT_REDO buys much. But it's also pretty much free, so ...

Yes. To me, having it in the redo record seems considerably more
valuable. Because that's where we're going to begin replay, so we
should catch most problems straight off. To escape detection at that
point, you need to not just be pointed at the wrong WAL archive, but
actually have files of diverse origin mixed together in the same WAL
archive. That's a less-likely error, and we still have some ways of
catching it if it happens.

Which would make the code more efficient...

Right.

As outlined above, I don't think xlp_seg_size, xlp_xlog_blcksz buy us
anything, but that the protection by xlp_sysid is a bit more meaningful. So a
compromise position could be to include xlp_sysid in the page header, possibly
in a "chopped up" manner, as Matthias suggested.

I'm not that keen on the idea of storing the upper half and lower half
in alternate pages. That seems to me to add code complexity and
cognitive burden with little increased likelihood of catching real
problems. I'm not completely opposed to the idea if somebody wants to
make it happen, but I bet it would be better to either store the whole
thing or just cut it in half and store, say, the low-order bits.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20Ants Aasma
ants@cybertec.at
In reply to: Robert Haas (#18)
Re: Lowering the default wal_blocksize to 4K

On Thu, 12 Oct 2023 at 16:36, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 11, 2023 at 4:28 PM Thomas Munro <thomas.munro@gmail.com>
wrote:

That leaves only the segments where a record starts exactly on the
first usable byte of a segment, which is why I was trying to think of
a way to cover that case too. I suggested we could notice and insert
a new record at that place. But Andres suggests it would be too
expensive and not worth worrying about.

Hmm. Even in that case, xl_prev has to match. It's not like it's the
wild west. Sure, it's not nearly as good of a cross-check, but it's
something. It seems to me that it's not worth worrying very much about
xlp_seg_size or xlp_blcksz changing undetected in that scenario - if
you're doing that kind of advanced magic, you need to be careful
enough to not mess it up, and if we still cross-check once per
checkpoint cycle that's pretty good. I do worry a bit about the sysid
changing under us, though. It's not that hard to get your WAL archives
mixed up, and it'd be nice to catch that right away.

This reminds me that xlp_tli is not being used to its full potential right
now either. We only check that it's not going backwards, but there is at
least one not very hard to hit way to get postgres to silently replay on
the wrong timeline. [1]/messages/by-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com --

[1]: /messages/by-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com --
/messages/by-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com
--

Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com

#21Robert Haas
robertmhaas@gmail.com
In reply to: Ants Aasma (#20)
Re: Lowering the default wal_blocksize to 4K

On Thu, Oct 12, 2023 at 9:57 AM Ants Aasma <ants@cybertec.at> wrote:

This reminds me that xlp_tli is not being used to its full potential right now either. We only check that it's not going backwards, but there is at least one not very hard to hit way to get postgres to silently replay on the wrong timeline. [1]

[1] /messages/by-id/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw@mail.gmail.com

Maybe I'm missing something, but that seems mostly unrelated. What
you're discussing there is the server's ability to figure out when it
ought to perform a timeline switch. In other words, the server settles
on the wrong TLI and therefore opens and reads from the wrong
filename. But here, we're talking about the case where the server is
correct about the TLI and LSN and hence opens exactly the right file
on disk, but the contents of the file on disk aren't what they're
supposed to be due to a procedural error.

Said differently, I don't see how anything we could do with xlp_tli
would actually fix the problem discussed in that thread. That can
detect a situation where the TLI of the file doesn't match the TLI of
the pages inside the file, but it doesn't help with the case where the
server decided to read the wrong file in the first place.

But this does make me wonder whether storing xlp_tli and xlp_pageaddr
in every page is really worth the bit-space. That takes 12 bytes plus
any padding it forces us to incur, but the actual entropy content of
those 12 bytes must be quite low. In normal cases probably 7 or so of
those bytes are going to consist entirely of zero bits (TLI < 256,
LSN%8k == 0, LSN < 2^40). We could probably find a way of jumbling
the LSN, TLI, and maybe some other stuff into an 8-byte quantity or
even perhaps a 4-byte quantity that would do about as good a job
catching problems as what we have now (e.g.
LSN_HIGH32^LSN_LOW32^BITREVERSE(TLI)). In the event of a mismatch, the
value actually stored in the page header would be harder for humans to
understand, but I'm not sure that really matters here. Users should
mostly be concerned with whether a WAL file matches the cluster where
they're trying to replay it; forensics on misplaced or corrupted WAL
files should be comparatively rare.

--
Robert Haas
EDB: http://www.enterprisedb.com