problems with making relfilenodes 56-bits

Started by Robert Haasover 3 years ago29 messageshackers

robertmhaas@gmail.com

over 3 years ago

OK, so the recent commit and revert of the 56-bit relfilenode patch
revealed a few issues that IMHO need design-level input. Let me try to
surface those here, starting a new thread to separate this discussion
from the clutter:

1. Commit Record Alignment. ParseCommitRecord() and ParseAbortRecord()
are dependent on every subsidiary structure that can be added to a
commit or abort record requiring exactly 4-byte alignment. IMHO, this
seems awfully fragile, even leaving the 56-bit patch aside. Prepare
records seem to have a much saner scheme: they've also got a bunch of
different things that can be stuck onto the main record, but they
maxalign each top-level thing that they stick in there. So
ParsePrepareRecord() doesn't have to make any icky alignment
assumptions the way ParseCommitRecord() and ParseAbortRecord() do.
Unfortuantely, that scheme doesn't work as well for commit records,
because the very first top-level thing only needs 2 bytes. We're
currently using 4, and it would obviously be nicer to cut that down to
2 than to have it go up to 8. We could try to rejigger things around
somehow to avoid needing that 2-byte quantity in there as a separate
toplevel item, but I'm not quite sure how to do that, or we could just
copy everything to ensure alignment, but that seems kind of expensive.

If we don't decide to do either of those things, we should at least
better document, and preferably enforce via assets, the requirement
that these structs be exactly 4-byte aligned, so that nobody else
makes the same mistake in the future.

2. WAL Size. Block references in the WAL are by RelFileLocator, so if
you make RelFileLocators bigger, WAL gets bigger. We'd have to test
the exact impact of this, but it seems a bit scary: if you have a WAL
stream with few FPIs doing DML on a narrow table, probably most
records will contain 1 block reference (and occasionally more, but I
guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
block reference feels like it might add up to something significant. I
don't really see any way around this, either: if you make relfilenode
values wider, they take up more space. Perhaps there's a way to claw
that back elsewhere, or we could do something really crazy like switch
to variable-width representations of integer quantities in WAL
records, but there doesn't seem to be any simple way forward other
than, you know, deciding that we're willing to pay the cost of the
additional WAL volume.

3. Sinval Message Size. Sinval messages are 16 bytes right now.
They'll have to grow to 20 bytes if we do this. There's even less room
for bit-squeezing here than there is for the WAL stuff. I'm skeptical
that this really matters, but Tom seems concerned.

4. Other Uses of RelFileLocator. There are a bunch of structs I
haven't looked into yet that also embed RelFileLocator, which may have
their own issues with alignment, padding, and/or size: ginxlogSplit,
ginxlogDeletePage, ginxlogUpdateMeta, gistxlogPageReuse,
xl_heap_new_cid, xl_btree_reuse_page, LogicalRewriteMappingData,
xl_smgr_truncate, xl_seq_rec, ReorderBufferChange, FileTag. I think a
bunch of these are things that get written into WAL, but at least some
of them seem like they probably don't get written into WAL enough to
matter. Needs more investigation, though.

Thoughts?

--
Robert Haas
EDB: http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Robert Haas (#1)

Re: problems with making relfilenodes 56-bits

Robert Haas <robertmhaas@gmail.com> writes:

3. Sinval Message Size. Sinval messages are 16 bytes right now.
They'll have to grow to 20 bytes if we do this. There's even less room
for bit-squeezing here than there is for the WAL stuff. I'm skeptical
that this really matters, but Tom seems concerned.

As far as that goes, I'm entirely prepared to accept a conclusion
that the benefits of widening relfilenodes justify whatever space
or speed penalties may exist there. However, we cannot honestly
make that conclusion if we haven't measured said penalties.
The same goes for the other issues you raise here.

regards, tom lane

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#2)

Re: problems with making relfilenodes 56-bits

On Wed, Sep 28, 2022 at 4:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As far as that goes, I'm entirely prepared to accept a conclusion
that the benefits of widening relfilenodes justify whatever space
or speed penalties may exist there. However, we cannot honestly
make that conclusion if we haven't measured said penalties.
The same goes for the other issues you raise here.

I generally agree, but the devil is in the details.

I tend to agree with Robert that many individual WAL record types just
don't appear frequently enough to matter (it also helps that even the
per-record space overhead with wider 56-bit relfilenodes isn't so
bad). Just offhand I'd say that ginxlogSplit, ginxlogDeletePage,
ginxlogUpdateMeta, gistxlogPageReuse and xl_btree_reuse_page are
likely to be in this category (though would be nice to see some
numbers for those).

I'm much less sure about the other record types. Any WAL records with
a variable number of relfilenode entries seem like they might be more
of a problem. But I'm not ready to accept that that cannot be
ameliorated in some way. Just for example, it wouldn't be impossible to
do some kind of varbyte encoding for some record types. How many times
will the cluster actually need billions of relfilenodes? It has to
work, but maybe it can be suboptimal from a space overhead
perspective.

I'm not saying that we need to do anything fancy just yet. I'm
just saying that there definitely *are* options. Maybe it's not really
necessary to come up with something like a varbyte encoding, and maybe
the complexity it imposes just won't be worth it -- I really have no
opinion on that just yet.

--
Peter Geoghegan

Matthias van de Meent

boekewurm+postgres@gmail.com

over 3 years ago

In reply to: Robert Haas (#1)

Re: problems with making relfilenodes 56-bits

On Thu, 29 Sep 2022, 00:06 Robert Haas, <robertmhaas@gmail.com> wrote:

2. WAL Size. Block references in the WAL are by RelFileLocator, so if
you make RelFileLocators bigger, WAL gets bigger. We'd have to test
the exact impact of this, but it seems a bit scary: if you have a WAL
stream with few FPIs doing DML on a narrow table, probably most
records will contain 1 block reference (and occasionally more, but I
guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
block reference feels like it might add up to something significant. I
don't really see any way around this, either: if you make relfilenode
values wider, they take up more space. Perhaps there's a way to claw
that back elsewhere, or we could do something really crazy like switch
to variable-width representations of integer quantities in WAL
records, but there doesn't seem to be any simple way forward other
than, you know, deciding that we're willing to pay the cost of the
additional WAL volume.

Re: WAL volume and record size optimization

I've been working off and on with WAL for some time now due to [0]/messages/by-id/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc=SdEHHsW6HS+k9zw@mail.gmail.com and
the interest of Neon in the area, and I think we can reduce the size
of the base record by a significant margin:

Currently, our minimal WAL record is exactly 24 bytes: length (4B),
TransactionId (4B), previous record pointer (8B), flags (1B), redo
manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
fields, TransactionID could reasonably be omitted for certain WAL
records (as example: index insertions don't really need the XID).
Additionally, the length field could be made to be variable length,
and any padding is just plain bad (adding 4 bytes to all
insert/update/delete/lock records was frowned upon).

I'm working on a prototype patch for a more bare-bones WAL record
header of which the only required fields would be prevptr (8B), CRC
(4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
yet know the performance of this, but the considering that there will
be a lot more conditionals in header decoding it might be slower for
any one backend, but faster overall (less overall IOps)

The flags field would be indications for additional information: [flag
name (bits): explanation (additional xlog header data in bytes)]
- len_size(0..1): xlog record size is at most xlrec_header_only (0B),
uint8_max(1B), uint16_max(2B), uint32_max(4B)
- has_xid (2): contains transaction ID of logging transaction (4B, or
probably 8B when we introduce 64-bit xids)
- has_cid (3): contains the command ID of the logging statement (4B)
(rationale for logging CID in [0]/messages/by-id/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc=SdEHHsW6HS+k9zw@mail.gmail.com, now in record header because XID is
included there as well, and both are required for consistent
snapshots.
- has_rminfo (4): has non-zero redo-manager flags field (1B)
(rationale for separate field [1]/messages/by-id/20220715173731.6t3km5cww3f5ztfq@awork3.anarazel.de, non-zero allows 1B space
optimization for one of each RMGR's operations)
- special_rel (5): pre-existing definition
- check_consistency (6): pre-existing definition
- unset (7): no meaning defined yet. Could be used for full record
compression, or other purposes.

A normal record header (XLOG record with at least some registered
data) would be only 15 to 17 bytes (0-1B rminfo + 1-2B in xl_len), and
one with XID only up to 21 bytes. So, when compared to the current
XLogRecord format, we would in general recover 2 or 3 bytes from the
xl_tot_len field, 1 or 2 bytes from the alignment hole, and
potentially the 4 bytes of the xid when that data is considered
useless during recovery, or physical or logical replication.

Kind regards,

Matthias van de Meent

[0]: /messages/by-id/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc=SdEHHsW6HS+k9zw@mail.gmail.com
[1]: /messages/by-id/20220715173731.6t3km5cww3f5ztfq@awork3.anarazel.de

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Matthias van de Meent (#4)

Re: problems with making relfilenodes 56-bits

On Thu, Sep 29, 2022 at 12:24 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Currently, our minimal WAL record is exactly 24 bytes: length (4B),
TransactionId (4B), previous record pointer (8B), flags (1B), redo
manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
fields, TransactionID could reasonably be omitted for certain WAL
records (as example: index insertions don't really need the XID).
Additionally, the length field could be made to be variable length,
and any padding is just plain bad (adding 4 bytes to all
insert/update/delete/lock records was frowned upon).

Right. I was shocked when I realized that we had two bytes of padding
in there, considering that numerous rmgrs are stealing bits from the
1-byte field that identifies the record type. My question was: why
aren't we exposing those 2 bytes for rmgr-type-specific use? Or for
something like xl_xact_commit, we could get rid of xl_xact_info if we
had those 2 bytes to work with.

Right now, I see that a bare commit record is 34 bytes which rounds
out to 40. With the trick above, we could shave off 4 bytes bringing
the size to 30 which would round to 32. That's a pretty significant
savings, although it'd be a lot better if we could get some kind of
savings for DML records which could be much higher frequency.

I'm working on a prototype patch for a more bare-bones WAL record
header of which the only required fields would be prevptr (8B), CRC
(4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
yet know the performance of this, but the considering that there will
be a lot more conditionals in header decoding it might be slower for
any one backend, but faster overall (less overall IOps)

The flags field would be indications for additional information: [flag
name (bits): explanation (additional xlog header data in bytes)]
- len_size(0..1): xlog record size is at most xlrec_header_only (0B),
uint8_max(1B), uint16_max(2B), uint32_max(4B)
- has_xid (2): contains transaction ID of logging transaction (4B, or
probably 8B when we introduce 64-bit xids)
- has_cid (3): contains the command ID of the logging statement (4B)
(rationale for logging CID in [0], now in record header because XID is
included there as well, and both are required for consistent
snapshots.
- has_rminfo (4): has non-zero redo-manager flags field (1B)
(rationale for separate field [1], non-zero allows 1B space
optimization for one of each RMGR's operations)
- special_rel (5): pre-existing definition
- check_consistency (6): pre-existing definition
- unset (7): no meaning defined yet. Could be used for full record
compression, or other purposes.

Interesting. One fly in the ointment here is that WAL records start on
8-byte boundaries (probably MAXALIGN boundaries, but I didn't check
the details). And after the 24-byte header, there's a 2-byte header
(or 5-byte header) introducing the payload data (see
XLR_BLOCK_ID_DATA_SHORT/LONG). So if the size of the actual payload
data is a multiple of 8, and is short enough that we use the short
data header, we waste 6 bytes. If the data length is a multiple of 4,
we waste 2 bytes. And those are probably really common cases. So the
big improvements probably come from saving 2 bytes or 6 bytes or 10
bytes, and saving say 3 or 5 is probably not much better than 2. Or at
least that's what I'm guessing.

--
Robert Haas
EDB: http://www.enterprisedb.com

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Robert Haas (#1)

Re: problems with making relfilenodes 56-bits

On Thu, Sep 29, 2022 at 2:36 AM Robert Haas <robertmhaas@gmail.com> wrote:

2. WAL Size. Block references in the WAL are by RelFileLocator, so if
you make RelFileLocators bigger, WAL gets bigger. We'd have to test
the exact impact of this, but it seems a bit scary

I have done some testing around this area to see the impact on WAL
size especially when WAL sizes are smaller, with a very simple test
with insert/update/delete I can see around an 11% increase in WAL size
[1]: checkpoint; do $$ declare lsn1 pg_lsn; lsn2 pg_lsn; diff float; begin select pg_current_wal_lsn() into lsn1; CREATE TABLE test(a int); for counter in 1..1000 loop INSERT INTO test values(1); UPDATE test set a=a+1; DELETE FROM test where a=1; end loop; DROP TABLE test; select pg_current_wal_lsn() into lsn2; select pg_wal_lsn_diff(lsn2, lsn1) into diff; raise notice '%', diff/1024; end; $$;
factor(1) there I do not see a significant increase in the WAL size
although it increases WAL size around 1-2%. [2]./pgbench -i postgres ./pgbench -c1 -j1 -t 30000 -M prepared postgres wal generated head: 30780 kB wal generated patch: 31284 kB wal-size increase: ~1-2%.

[1]: checkpoint; do $$ declare lsn1 pg_lsn; lsn2 pg_lsn; diff float; begin select pg_current_wal_lsn() into lsn1; CREATE TABLE test(a int); for counter in 1..1000 loop INSERT INTO test values(1); UPDATE test set a=a+1; DELETE FROM test where a=1; end loop; DROP TABLE test; select pg_current_wal_lsn() into lsn2; select pg_wal_lsn_diff(lsn2, lsn1) into diff; raise notice '%', diff/1024; end; $$;
checkpoint;
do $$
declare
lsn1 pg_lsn;
lsn2 pg_lsn;
diff float;
begin
select pg_current_wal_lsn() into lsn1;
CREATE TABLE test(a int);
for counter in 1..1000 loop
INSERT INTO test values(1);
UPDATE test set a=a+1;
DELETE FROM test where a=1;
end loop;
DROP TABLE test;
select pg_current_wal_lsn() into lsn2;
select pg_wal_lsn_diff(lsn2, lsn1) into diff;
raise notice '%', diff/1024;
end; $$;

wal generated head: 66199.09375 kB
wal generated patch: 73906.984375 kB
wal-size increase: 11%

[2]: ./pgbench -i postgres ./pgbench -c1 -j1 -t 30000 -M prepared postgres wal generated head: 30780 kB wal generated patch: 31284 kB wal-size increase: ~1-2%
./pgbench -i postgres
./pgbench -c1 -j1 -t 30000 -M prepared postgres
wal generated head: 30780 kB
wal generated patch: 31284 kB
wal-size increase: ~1-2%

I have done further analysis to know why on pgbench workload the wal
size is increasing by 1-2%. So with waldump I could see that wal size
per transaction size increased from 566 (on head) to 590 (with patch),
so that is around 4% but when we see total wal size difference after
30k transaction then it is just 1-2%
and I think that is because there would be other records which are not
impacted like FPI

Conclusion: So as suspected with very small WAL sizes with a very
targeted test case we can see a significant 11% increase in WAL size
but with pgbench kind of workload the increase in WAL size is very
less.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Dilip Kumar (#6)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-09-30 15:36:11 +0530, Dilip Kumar wrote:

I have done some testing around this area to see the impact on WAL
size especially when WAL sizes are smaller, with a very simple test
with insert/update/delete I can see around an 11% increase in WAL size
[1] then I did some more test with pgbench with smaller scale
factor(1) there I do not see a significant increase in the WAL size
although it increases WAL size around 1-2%. [2].

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
pay off for 56bit relfilenodes. The class of problems foreclosed is large
enough to "waste" "improvement potential" on this.

Greetings,

Andres Freund

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Andres Freund (#7)

Re: problems with making relfilenodes 56-bits

On Fri, Sep 30, 2022 at 5:20 PM Andres Freund <andres@anarazel.de> wrote:

I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
pay off for 56bit relfilenodes. The class of problems foreclosed is large
enough to "waste" "improvement potential" on this.

I agree overall.

A closely related but distinct question occurs to me: if we're going
to be "wasting" space on alignment padding in certain cases one way or
another, can we at least recognize those cases and take advantage at
the level of individual WAL record formats? In other words: So far
we've been discussing the importance of not going over a critical
threshold for certain WAL records. But it might also be valuable to
consider recognizing that that's inevitable, and that we might as well
make the most of it by including one or two other things.

This seems most likely to matter when managing the problem of negative
compression with per-WAL-record compression schemes for things like
arrays of page offset numbers [1]/messages/by-id/CAH2-WzmLCn2Hx9tQLdmdb+9CkHKLyWD2bsz=PmRebc4dAxjy6g@mail.gmail.com -- Peter Geoghegan. If (say) a given compression scheme
"wastes" space for arrays of only 1-3 items, but we already know that
the relevant space will all be lost to alignment needed by code one
level down in any case, does it really count as waste? We're likely
always going to have some kind of negative compression, but you do get
to influence where and when the negative compression happens.

Not sure how relevant this will turn out to be, but seems worth
considering. More generally, thinking about how things work across
multiple layers of abstraction seems like it could be valuable in
other ways.

[1]: /messages/by-id/CAH2-WzmLCn2Hx9tQLdmdb+9CkHKLyWD2bsz=PmRebc4dAxjy6g@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Andres Freund (#7)

Re: problems with making relfilenodes 56-bits

On Sat, Oct 1, 2022 at 5:50 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-09-30 15:36:11 +0530, Dilip Kumar wrote:

I have done some testing around this area to see the impact on WAL
size especially when WAL sizes are smaller, with a very simple test
with insert/update/delete I can see around an 11% increase in WAL size
[1] then I did some more test with pgbench with smaller scale
factor(1) there I do not see a significant increase in the WAL size
although it increases WAL size around 1-2%. [2].

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

While testing pgbench, I dumped the wal sizes using waldump. So in
pgbench case, most of the record sizes increased by 4 bytes as they
include single block references and the same is true for the other
test case I sent. Here is the wal dump of what the sizes look like
for a single pgbench transaction[1]. Maybe for seeing these changes
with the different workloads we can run some of the files from the
regression test and compare the individual wal sizes.

Head:
rmgr: Heap len (rec/tot): 54/ 54, tx: 867, lsn:
0/02DD1280, prev 0/02DD1250, desc: LOCK off 44: xid 867: flags 0x01
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/16424 blk 226
rmgr: Heap len (rec/tot): 171/ 171, tx: 867, lsn:
0/02DD12B8, prev 0/02DD1280, desc: UPDATE off 44 xmax 867 flags 0x11 ;
new off 30 xmax 0, blkref #0: rel 1663/5/16424 blk 1639, blkref #1:
rel 1663/5/16424 blk 226
rmgr: Btree len (rec/tot): 64/ 64, tx: 867, lsn:
0/02DD1368, prev 0/02DD12B8, desc: INSERT_LEAF off 290, blkref #0: rel
1663/5/16432 blk 39
rmgr: Heap len (rec/tot): 78/ 78, tx: 867, lsn:
0/02DD13A8, prev 0/02DD1368, desc: HOT_UPDATE off 15 xmax 867 flags
0x10 ; new off 19 xmax 0, blkref #0: rel 1663/5/16427 blk 0
rmgr: Heap len (rec/tot): 74/ 74, tx: 867, lsn:
0/02DD13F8, prev 0/02DD13A8, desc: HOT_UPDATE off 9 xmax 867 flags
0x10 ; new off 10 xmax 0, blkref #0: rel 1663/5/16425 blk 0
rmgr: Heap len (rec/tot): 79/ 79, tx: 867, lsn:
0/02DD1448, prev 0/02DD13F8, desc: INSERT off 9 flags 0x08, blkref #0:
rel 1663/5/16434 blk 0
rmgr: Transaction len (rec/tot): 46/ 46, tx: 867, lsn:
0/02DD1498, prev 0/02DD1448, desc: COMMIT 2022-10-01 11:24:03.464437
IST

Patch:
rmgr: Heap len (rec/tot): 58/ 58, tx: 818, lsn:
0/0218BEB0, prev 0/0218BE80, desc: LOCK off 34: xid 818: flags 0x01
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/100004 blk 522
rmgr: Heap len (rec/tot): 175/ 175, tx: 818, lsn:
0/0218BEF0, prev 0/0218BEB0, desc: UPDATE off 34 xmax 818 flags 0x11 ;
new off 8 xmax 0, blkref #0: rel 1663/5/100004 blk 1645, blkref #1:
rel 1663/5/100004 blk 522
rmgr: Btree len (rec/tot): 68/ 68, tx: 818, lsn:
0/0218BFA0, prev 0/0218BEF0, desc: INSERT_LEAF off 36, blkref #0: rel
1663/5/100010 blk 89
rmgr: Heap len (rec/tot): 82/ 82, tx: 818, lsn:
0/0218BFE8, prev 0/0218BFA0, desc: HOT_UPDATE off 66 xmax 818 flags
0x10 ; new off 90 xmax 0, blkref #0: rel 1663/5/100007 blk 0
rmgr: Heap len (rec/tot): 78/ 78, tx: 818, lsn:
0/0218C058, prev 0/0218BFE8, desc: HOT_UPDATE off 80 xmax 818 flags
0x10 ; new off 81 xmax 0, blkref #0: rel 1663/5/100005 blk 0
rmgr: Heap len (rec/tot): 83/ 83, tx: 818, lsn:
0/0218C0A8, prev 0/0218C058, desc: INSERT off 80 flags 0x08, blkref
#0: rel 1663/5/100011 blk 0
rmgr: Transaction len (rec/tot): 46/ 46, tx: 818, lsn:
0/0218C100, prev 0/0218C0A8, desc: COMMIT 2022-10-01 11:11:03.564063
IST

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#10

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#7)

Re: problems with making relfilenodes 56-bits

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

I would expect, and Dilip's results seem to confirm, the effect to be
pretty uniform: basically, nearly every record gets bigger by 4 bytes.
That's because most records contain at least one block reference, and
if they contain multiple block references, likely all but one will be
marked BKPBLOCK_SAME_REL, so we pay the cost just once.

Because of alignment padding, the practical effect is probably that
about half of the records get bigger by 8 bytes and the other half
don't get bigger at all. But I see no reason to believe that things
are any better or worse than that. Most interesting record types are
going to contain some kind of variable-length payload, so the chances
that a 4 byte size increase pushes you across a MAXALIGN boundary seem
to be no better or worse than fifty-fifty.

I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
pay off for 56bit relfilenodes. The class of problems foreclosed is large
enough to "waste" "improvement potential" on this.

I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen. Trying to do it by optimizing specific WAL record
types seems like a real pain in the neck, because there's tons of
different WAL records that all have the same problem. Trying to do it
in a generic way makes more sense, and the fact that we have 2 padding
bytes available in XLogRecord seems like a place to start looking, but
the way forward from there is not clear to me.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#10)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

I would expect, and Dilip's results seem to confirm, the effect to be
pretty uniform: basically, nearly every record gets bigger by 4 bytes.
That's because most records contain at least one block reference, and
if they contain multiple block references, likely all but one will be
marked BKPBLOCK_SAME_REL, so we pay the cost just once.

But it doesn't really matter that much if an already large record gets a bit
bigger. Whereas it does matter if it's a small record. Focussing on optimizing
the record types where the increase is large seems like a potential way
forward to me, even if we can't find something generic.

I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen.

I think there's plenty potential...

Trying to do it by optimizing specific WAL record
types seems like a real pain in the neck, because there's tons of
different WAL records that all have the same problem.

I am not so sure about that. Improving a bunch of the most frequent small
records might buy you back enough on just about every workload to be OK.

I put the top record sizes for an installcheck run with full_page_writes off
at the bottom. Certainly our regression tests aren't generally
representative. But I think it still decently highlights how just improving a
few records could buy you back more than enough.

Trying to do it in a generic way makes more sense, and the fact that we have
2 padding bytes available in XLogRecord seems like a place to start looking,
but the way forward from there is not clear to me.

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

Greetings,

Andres Freund

by total size:

Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
Heap/INSERT 1041666 ( 50.48) 106565255 ( 50.54) 0 ( 0.00) 106565255 ( 43.92)
Btree/INSERT_LEAF 352196 ( 17.07) 24067672 ( 11.41) 0 ( 0.00) 24067672 ( 9.92)
Heap/DELETE 250852 ( 12.16) 13546008 ( 6.42) 0 ( 0.00) 13546008 ( 5.58)
Hash/INSERT 108499 ( 5.26) 7811928 ( 3.70) 0 ( 0.00) 7811928 ( 3.22)
Transaction/COMMIT 16053 ( 0.78) 6402657 ( 3.04) 0 ( 0.00) 6402657 ( 2.64)
Gist/PAGE_UPDATE 57225 ( 2.77) 5217100 ( 2.47) 0 ( 0.00) 5217100 ( 2.15)
Gin/UPDATE_META_PAGE 23943 ( 1.16) 4539970 ( 2.15) 0 ( 0.00) 4539970 ( 1.87)
Gin/INSERT 27004 ( 1.31) 3623998 ( 1.72) 0 ( 0.00) 3623998 ( 1.49)
Gist/PAGE_SPLIT 448 ( 0.02) 3391244 ( 1.61) 0 ( 0.00) 3391244 ( 1.40)
SPGist/ADD_LEAF 38968 ( 1.89) 3341696 ( 1.58) 0 ( 0.00) 3341696 ( 1.38)
...
XLOG/FPI 7228 ( 0.35) 378924 ( 0.18) 29788166 ( 93.67) 30167090 ( 12.43)
...
Gin/SPLIT 141 ( 0.01) 13011 ( 0.01) 1187588 ( 3.73) 1200599 ( 0.49)
...
-------- -------- -------- --------
Total 2063609 210848282 [86.89%] 31802766 [13.11%] 242651048 [100%]

(Included XLOG/FPI and Gin/SPLIT to explain why there's FPIs despite running with fpw=off)

sorted by number of records:
Heap/INSERT 1041666 ( 50.48) 106565255 ( 50.54) 0 ( 0.00) 106565255 ( 43.92)
Btree/INSERT_LEAF 352196 ( 17.07) 24067672 ( 11.41) 0 ( 0.00) 24067672 ( 9.92)
Heap/DELETE 250852 ( 12.16) 13546008 ( 6.42) 0 ( 0.00) 13546008 ( 5.58)
Hash/INSERT 108499 ( 5.26) 7811928 ( 3.70) 0 ( 0.00) 7811928 ( 3.22)
Gist/PAGE_UPDATE 57225 ( 2.77) 5217100 ( 2.47) 0 ( 0.00) 5217100 ( 2.15)
SPGist/ADD_LEAF 38968 ( 1.89) 3341696 ( 1.58) 0 ( 0.00) 3341696 ( 1.38)
Gin/INSERT 27004 ( 1.31) 3623998 ( 1.72) 0 ( 0.00) 3623998 ( 1.49)
Gin/UPDATE_META_PAGE 23943 ( 1.16) 4539970 ( 2.15) 0 ( 0.00) 4539970 ( 1.87)
Standby/LOCK 18451 ( 0.89) 775026 ( 0.37) 0 ( 0.00) 775026 ( 0.32)
Transaction/COMMIT 16053 ( 0.78) 6402657 ( 3.04) 0 ( 0.00) 6402657 ( 2.64)

#12

Matthias van de Meent

boekewurm+postgres@gmail.com

over 3 years ago

In reply to: Andres Freund (#11)

Re: problems with making relfilenodes 56-bits

On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:

Hi,

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

I would expect, and Dilip's results seem to confirm, the effect to be
pretty uniform: basically, nearly every record gets bigger by 4 bytes.
That's because most records contain at least one block reference, and
if they contain multiple block references, likely all but one will be
marked BKPBLOCK_SAME_REL, so we pay the cost just once.

But it doesn't really matter that much if an already large record gets a bit
bigger. Whereas it does matter if it's a small record. Focussing on optimizing
the record types where the increase is large seems like a potential way
forward to me, even if we can't find something generic.

I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen.

I think there's plenty potential...

Trying to do it by optimizing specific WAL record
types seems like a real pain in the neck, because there's tons of
different WAL records that all have the same problem.

I am not so sure about that. Improving a bunch of the most frequent small
records might buy you back enough on just about every workload to be OK.

I put the top record sizes for an installcheck run with full_page_writes off
at the bottom. Certainly our regression tests aren't generally
representative. But I think it still decently highlights how just improving a
few records could buy you back more than enough.

Trying to do it in a generic way makes more sense, and the fact that we have
2 padding bytes available in XLogRecord seems like a place to start looking,
but the way forward from there is not clear to me.

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

A different idea is removing most block_ids from the record, and
optionally reducing per-block length fields to 1B. Used block ids are
effectively always sequential, and we only allow 33+4 valid values, so
we can use 2 bits to distinguish between 'block belonging to this ID
field have at most 255B of data registered' and 'blocks up to this ID
follow sequentially without own block ID'. That would save 2N-1 total
bytes for N blocks. It is scraping the barrel, but I think it is quite
possible.

Lastly, we could add XLR_BLOCK_ID_DATA_MED for values >255 containing
up to UINT16_MAX lengths. That would save 2 bytes for records that
only just pass the 255B barrier, where 2B is still a fairly
significant part of the record size.

Kind regards,

Matthias van de Meent

#13

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Matthias van de Meent (#12)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:

On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

What exactly is the theory for this significant increase? I don't think
xl_prev provides a meaningful protection against torn pages in the first
place?

Greetings,

Andres Freund

#14

Matthias van de Meent

boekewurm+postgres@gmail.com

over 3 years ago

In reply to: Andres Freund (#13)

Re: problems with making relfilenodes 56-bits

On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:

On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

What exactly is the theory for this significant increase? I don't think
xl_prev provides a meaningful protection against torn pages in the first
place?

XLog pages don't have checksums, so they do not provide torn page
protection capabilities on their own.
A singular xlog record is protected against torn page writes through
the checksum that covers the whole record - if only part of the record
was written, we can detect that through the mismatching checksum.
However, if records end at the tear boundary, we must know for certain
that any record that starts after the tear is the record that was
written after the one before the tear. Page-local references/offsets
would not work, because the record decoding doesn't know which xlog
page the record should be located on; it could be both the version of
the page before it was recycled, or the one after.
Currently, we can detect this because the value of xl_prev will point
to a record far in the past (i.e. not the expected value), but with a
page-local version of xl_prev we would be less likely to detect torn
pages (and thus be unable to handle this without risk of corruption)
due to the significant chance of the truncated xl_prev value being the
same in both the old and new record.

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

Kind regards,

Matthias van de Meent

PS. there are ideas floating around (I heard about this one from
Heikki) where we could concatenate WAL records into one combined
record that has only one shared xl_prev+crc; which would save these 12
bytes per record. However, that needs a lot of careful consideration
to make sure that the persistence guarantee of operations doesn't get
lost somewhere in the traffic.

#15

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Matthias van de Meent (#14)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-04 15:05:47 +0200, Matthias van de Meent wrote:

On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote:

On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:

On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

What exactly is the theory for this significant increase? I don't think
xl_prev provides a meaningful protection against torn pages in the first
place?

XLog pages don't have checksums, so they do not provide torn page
protection capabilities on their own.
A singular xlog record is protected against torn page writes through
the checksum that covers the whole record - if only part of the record
was written, we can detect that through the mismatching checksum.
However, if records end at the tear boundary, we must know for certain
that any record that starts after the tear is the record that was
written after the one before the tear. Page-local references/offsets
would not work, because the record decoding doesn't know which xlog
page the record should be located on; it could be both the version of
the page before it was recycled, or the one after.
Currently, we can detect this because the value of xl_prev will point
to a record far in the past (i.e. not the expected value), but with a
page-local version of xl_prev we would be less likely to detect torn
pages (and thus be unable to handle this without risk of corruption)
due to the significant chance of the truncated xl_prev value being the
same in both the old and new record.

Think this is addressable, see below.

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.

The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.

PS. there are ideas floating around (I heard about this one from
Heikki) where we could concatenate WAL records into one combined
record that has only one shared xl_prev+crc; which would save these 12
bytes per record. However, that needs a lot of careful consideration
to make sure that the persistence guarantee of operations doesn't get
lost somewhere in the traffic.

One version of that is to move the CRCs to the page header, make the pages
smaller (512 bytes / 4K, depending on the hardware), and to pad out partial
pages when flushing them out. Rewriting pages is bad for hardware and prevents
having multiple WAL IOs in flight at the same time.

Greetings,

Andres Freund

#16

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#15)

Re: problems with making relfilenodes 56-bits

On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote:

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.

The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.

I'm really confused. xl_prev *is* a full-width LSN currently, as I
understand it. So in the scenario that Matthias poses, let's say the
segment was previously 000000010000000400000025 and now it's
000000010000000400000049. So if a given chunk of the page is leftover
from when the page was 000000010000000400000025, it will have xl_prev
values like 4/25xxxxxx. If it's been rewritten since the segment was
recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell
whether record B has been overwritten with a new record since the
segment was recycled. But if we stored only 2 bytes in each xl_prev
field, that would no longer be possible.

So I'm lost. It seems like Matthias has correctly identified a real
hazard, and not some weird corner case but actually something that
will happen regularly. All you need is for the old segment that got
recycled to have a record stating at the same place where the page
tore, and for the previous record to have been the same length as the
one on the new page. Given that there's only <~1024 places on a page
where a record can start, and given that in many workloads the lengths
of WAL records will be fairly uniform, this doesn't seem unlikely at
all.

A way to up the chances of detecting this case would be to store only
2 or 4 bytes of xl_prev on disk, but arrange to include the full
xl_prev value in the xl_crc calculation. Then your chances of a
collision are about 2^-32, or maybe more if you posit that CRC is a
weak and crappy algorithm, but even then it's strictly better than
just hoping that there isn't a tear point at a record boundary where
the same length record precedes the tear in both the old and new WAL
segments. However, on the flip side, even if you assume that CRC is a
fantastic algorithm with beautiful and state-of-the-art bit mixing,
the chances of it failing to notice the problem are still >0, whereas
the current algorithm that compares the full xl_prev value is a sure
thing. Because xl_prev values are never repeated, it's certain that
when a segment is recycled, any values that were legal for the old one
aren't legal in the new one.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#16)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-04 13:36:33 -0400, Robert Haas wrote:

On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote:

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.

The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.

I'm really confused. xl_prev *is* a full-width LSN currently, as I
understand it. So in the scenario that Matthias poses, let's say the
segment was previously 000000010000000400000025 and now it's
000000010000000400000049. So if a given chunk of the page is leftover
from when the page was 000000010000000400000025, it will have xl_prev
values like 4/25xxxxxx. If it's been rewritten since the segment was
recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell
whether record B has been overwritten with a new record since the
segment was recycled. But if we stored only 2 bytes in each xl_prev
field, that would no longer be possible.

Oh, I think I misunderstood the scenario. I was thinking of cases where we
write out a bunch of pages, crash, only some of the pages made it to disk, we
then write new ones of the same length, and now find a record after the "real"
end of the WAL to be valid. Not sure how I mentally swallowed the "recycled".

For the recycling scenario to be a problem we'll also need to crash, with
parts of the page ending up with the new contents and parts of the page ending
up with the old "pre recycling" content, correct? Because without a crash
we'll have zeroed out the remainder of the page (well, leaving walreceiver out
of the picture, grr).

However, this can easily happen without any record boundaries on the partially
recycled page, so we rely on the CRCs to protect against this.

Here I originally wrote a more in-depth explanation of the scenario I was
thinking about, where we alread rely on CRCs to protect us. But, ooph, I think
they don't reliably, with today's design. But maybe I'm missing more things
today. Consider the following sequence:

1) we write WAL like this:

[record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn]

2) crash, the sectors with A and C made it to disk, the one with B didn't

3) We replay A, discover B is invalid (possibly zeroes or old contents),
insert a new record B' with the same length. Now it looks like this:

[record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn]

4) crash, the sector with B' makes it to disk

5) we replay A, B', C, because C has an xl_prev that's compatible with B'
location and a valid CRC.

Oops.

I think this can happen both within a single page and across page boundaries.

I hope I am missing something here?

A way to up the chances of detecting this case would be to store only
2 or 4 bytes of xl_prev on disk, but arrange to include the full
xl_prev value in the xl_crc calculation.

Right, that's what I was suggesting as well.

Then your chances of a collision are about 2^-32, or maybe more if you posit
that CRC is a weak and crappy algorithm, but even then it's strictly better
than just hoping that there isn't a tear point at a record boundary where
the same length record precedes the tear in both the old and new WAL
segments. However, on the flip side, even if you assume that CRC is a
fantastic algorithm with beautiful and state-of-the-art bit mixing, the
chances of it failing to notice the problem are still >0, whereas the
current algorithm that compares the full xl_prev value is a sure
thing. Because xl_prev values are never repeated, it's certain that when a
segment is recycled, any values that were legal for the old one aren't legal
in the new one.

Given that we already rely on the CRCs to detect corruption within a single
record spanning tear boundaries, this doesn't cause me a lot of heartburn. But
I suspect we might need to do something about the scenario I outlined above,
which likely would also increase the protection against this issue.

I think there might be reasonable ways to increase the guarantees based on the
2 byte xl_prev approach "alone". We don't have to store the offset from the
page header as a plain offset. What about storing something like:
page_offset ^ (page_lsn >> wal_segsz_shift)

I think something like that'd result in prev_not_really_lsn typically not
simply matching after recycling. Of course it only provides so much
protection, given 16bits...

Greetings,

Andres Freund

#18

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#17)

Re: problems with making relfilenodes 56-bits

On Tue, Oct 4, 2022 at 2:30 PM Andres Freund <andres@anarazel.de> wrote:

Consider the following sequence:

1) we write WAL like this:

[record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn]

2) crash, the sectors with A and C made it to disk, the one with B didn't

3) We replay A, discover B is invalid (possibly zeroes or old contents),
insert a new record B' with the same length. Now it looks like this:

[record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn]

4) crash, the sector with B' makes it to disk

5) we replay A, B', C, because C has an xl_prev that's compatible with B'
location and a valid CRC.

Oops.

I think this can happen both within a single page and across page boundaries.

I hope I am missing something here?

If you are, I don't know what it is off-hand. That seems like a
plausible scenario to me. It does require the OS to write things out
of order, and I don't know how likely that is in practice, but the
answer probably isn't zero.

I think there might be reasonable ways to increase the guarantees based on the
2 byte xl_prev approach "alone". We don't have to store the offset from the
page header as a plain offset. What about storing something like:
page_offset ^ (page_lsn >> wal_segsz_shift)

I think something like that'd result in prev_not_really_lsn typically not
simply matching after recycling. Of course it only provides so much
protection, given 16bits...

Maybe. That does seem somewhat better, but I feel like it's hard to
reason about whether it's safe in absolute terms or just resistant to
the precise scenario Matthias postulated while remaining vulnerable to
slightly modified versions.

How about this: remove xl_prev. widen xl_crc to 64 bits. include the
CRC of the previous WAL record in the xl_crc calculation. That doesn't
cut quite as many bytes out of the record size as your proposal, but
it feels like it should strongly resist pretty much every attack of
this general type, with only the minor disadvantage that the more
expensive CRC calculation will destroy all hope of getting anything
committed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#11)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-03 10:01:25 -0700, Andres Freund wrote:

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen.

I think there's plenty potential...

I light dusted off my old varint implementation from [1] and converted the
RelFileLocator and BlockNumber from fixed width integers to varint ones. This
isn't meant as a serious patch, but an experiment to see if this is a path
worth pursuing.

A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
(for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

The average record size goes from 102.7 to 95.7 bytes excluding the remaining
FPIs, 118.1 to 111.0 including FPIs.

There's plenty other spots that could be converted (e.g. the record length
which rarely needs four bytes), this is just meant as a demonstration.

I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A
profile does show pg_varint_decode_uint64(), but partially that seems to be
offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains.

That would be tolerable, I think, because waldump --stats pretty much doesn't
do anything with the WAL.

But I suspect there's plenty of optimization potential in the varint
code. Right now it e.g. stores data as big endian, and the bswap instructions
do show up. And a lot of my bit-maskery could be optimized too.

Greetings,

Andres Freund

#20

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Andres Freund (#19)

Re: problems with making relfilenodes 56-bits

On Wed, Oct 5, 2022 at 5:19 AM Andres Freund <andres@anarazel.de> wrote:

I light dusted off my old varint implementation from [1] and converted the
RelFileLocator and BlockNumber from fixed width integers to varint ones. This
isn't meant as a serious patch, but an experiment to see if this is a path
worth pursuing.

A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
(for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

The average record size goes from 102.7 to 95.7 bytes excluding the remaining
FPIs, 118.1 to 111.0 including FPIs.

I have also executed my original test after applying these patches on
top of the 56 bit relfilenode patch. So earlier we saw the WAL size
increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
patch now the WAL generated is 58179.2265625. That means in this
particular example this patch is reducing the WAL size by 12% even
with the 56 bit relfilenode patch.

[1]: /messages/by-id/CAFiTN-uut+04AdwvBY_oK_jLvMkwXUpDJj5mXg--nek+ucApPQ@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#21

Robert Haas

robertmhaas@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#20)

#22

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#21)