problems with making relfilenodes 56-bits

[0]: /messages/by-id/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc=SdEHHsW6HS+k9zw@mail.gmail.com
[1]: /messages/by-id/20220715173731.6t3km5cww3f5ztfq@awork3.anarazel.de

boekewurm+postgres@gmail.com

over 3 years ago

In reply to: Robert Haas (#1)

Re: problems with making relfilenodes 56-bits

On Thu, 29 Sep 2022, 00:06 Robert Haas, <robertmhaas@gmail.com> wrote:

2. WAL Size. Block references in the WAL are by RelFileLocator, so if
you make RelFileLocators bigger, WAL gets bigger. We'd have to test
the exact impact of this, but it seems a bit scary: if you have a WAL
stream with few FPIs doing DML on a narrow table, probably most
records will contain 1 block reference (and occasionally more, but I
guess most will use BKPBLOCK_SAME_REL) and adding 4 bytes to that
block reference feels like it might add up to something significant. I
don't really see any way around this, either: if you make relfilenode
values wider, they take up more space. Perhaps there's a way to claw
that back elsewhere, or we could do something really crazy like switch
to variable-width representations of integer quantities in WAL
records, but there doesn't seem to be any simple way forward other
than, you know, deciding that we're willing to pay the cost of the
additional WAL volume.

Re: WAL volume and record size optimization

I've been working off and on with WAL for some time now due to [0]/messages/by-id/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc=SdEHHsW6HS+k9zw@mail.gmail.com and
the interest of Neon in the area, and I think we can reduce the size
of the base record by a significant margin:

Currently, our minimal WAL record is exactly 24 bytes: length (4B),
TransactionId (4B), previous record pointer (8B), flags (1B), redo
manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
fields, TransactionID could reasonably be omitted for certain WAL
records (as example: index insertions don't really need the XID).
Additionally, the length field could be made to be variable length,
and any padding is just plain bad (adding 4 bytes to all
insert/update/delete/lock records was frowned upon).

I'm working on a prototype patch for a more bare-bones WAL record
header of which the only required fields would be prevptr (8B), CRC
(4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
yet know the performance of this, but the considering that there will
be a lot more conditionals in header decoding it might be slower for
any one backend, but faster overall (less overall IOps)

The flags field would be indications for additional information: [flag
name (bits): explanation (additional xlog header data in bytes)]
- len_size(0..1): xlog record size is at most xlrec_header_only (0B),
uint8_max(1B), uint16_max(2B), uint32_max(4B)
- has_xid (2): contains transaction ID of logging transaction (4B, or
probably 8B when we introduce 64-bit xids)
- has_cid (3): contains the command ID of the logging statement (4B)
(rationale for logging CID in [0]/messages/by-id/CAEze2WhmU8WciEgaVPZm71vxFBOpp8ncDc=SdEHHsW6HS+k9zw@mail.gmail.com, now in record header because XID is
included there as well, and both are required for consistent
snapshots.
- has_rminfo (4): has non-zero redo-manager flags field (1B)
(rationale for separate field [1]/messages/by-id/20220715173731.6t3km5cww3f5ztfq@awork3.anarazel.de, non-zero allows 1B space
optimization for one of each RMGR's operations)
- special_rel (5): pre-existing definition
- check_consistency (6): pre-existing definition
- unset (7): no meaning defined yet. Could be used for full record
compression, or other purposes.

A normal record header (XLOG record with at least some registered
data) would be only 15 to 17 bytes (0-1B rminfo + 1-2B in xl_len), and
one with XID only up to 21 bytes. So, when compared to the current
XLogRecord format, we would in general recover 2 or 3 bytes from the
xl_tot_len field, 1 or 2 bytes from the alignment hole, and
potentially the 4 bytes of the xid when that data is considered
useless during recovery, or physical or logical replication.

Kind regards,

Matthias van de Meent

robertmhaas@gmail.com

over 3 years ago

In reply to: Matthias van de Meent (#4)

Re: problems with making relfilenodes 56-bits

On Thu, Sep 29, 2022 at 12:24 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

Currently, our minimal WAL record is exactly 24 bytes: length (4B),
TransactionId (4B), previous record pointer (8B), flags (1B), redo
manager (1B), 2 bytes of padding and lastly the 4-byte CRC. Of these
fields, TransactionID could reasonably be omitted for certain WAL
records (as example: index insertions don't really need the XID).
Additionally, the length field could be made to be variable length,
and any padding is just plain bad (adding 4 bytes to all
insert/update/delete/lock records was frowned upon).

Right. I was shocked when I realized that we had two bytes of padding
in there, considering that numerous rmgrs are stealing bits from the
1-byte field that identifies the record type. My question was: why
aren't we exposing those 2 bytes for rmgr-type-specific use? Or for
something like xl_xact_commit, we could get rid of xl_xact_info if we
had those 2 bytes to work with.

Right now, I see that a bare commit record is 34 bytes which rounds
out to 40. With the trick above, we could shave off 4 bytes bringing
the size to 30 which would round to 32. That's a pretty significant
savings, although it'd be a lot better if we could get some kind of
savings for DML records which could be much higher frequency.

I'm working on a prototype patch for a more bare-bones WAL record
header of which the only required fields would be prevptr (8B), CRC
(4B), rmgr (1B) and flags (1B) for a minimal size of 14 bytes. I don't
yet know the performance of this, but the considering that there will
be a lot more conditionals in header decoding it might be slower for
any one backend, but faster overall (less overall IOps)

The flags field would be indications for additional information: [flag
name (bits): explanation (additional xlog header data in bytes)]
- len_size(0..1): xlog record size is at most xlrec_header_only (0B),
uint8_max(1B), uint16_max(2B), uint32_max(4B)
- has_xid (2): contains transaction ID of logging transaction (4B, or
probably 8B when we introduce 64-bit xids)
- has_cid (3): contains the command ID of the logging statement (4B)
(rationale for logging CID in [0], now in record header because XID is
included there as well, and both are required for consistent
snapshots.
- has_rminfo (4): has non-zero redo-manager flags field (1B)
(rationale for separate field [1], non-zero allows 1B space
optimization for one of each RMGR's operations)
- special_rel (5): pre-existing definition
- check_consistency (6): pre-existing definition
- unset (7): no meaning defined yet. Could be used for full record
compression, or other purposes.

Interesting. One fly in the ointment here is that WAL records start on
8-byte boundaries (probably MAXALIGN boundaries, but I didn't check
the details). And after the 24-byte header, there's a 2-byte header
(or 5-byte header) introducing the payload data (see
XLR_BLOCK_ID_DATA_SHORT/LONG). So if the size of the actual payload
data is a multiple of 8, and is short enough that we use the short
data header, we waste 6 bytes. If the data length is a multiple of 4,
we waste 2 bytes. And those are probably really common cases. So the
big improvements probably come from saving 2 bytes or 6 bytes or 10
bytes, and saving say 3 or 5 is probably not much better than 2. Or at
least that's what I'm guessing.

--
Robert Haas
EDB: http://www.enterprisedb.com

dilipbalaut@gmail.com

over 3 years ago

In reply to: Robert Haas (#1)

Re: problems with making relfilenodes 56-bits

On Thu, Sep 29, 2022 at 2:36 AM Robert Haas <robertmhaas@gmail.com> wrote:

2. WAL Size. Block references in the WAL are by RelFileLocator, so if
you make RelFileLocators bigger, WAL gets bigger. We'd have to test
the exact impact of this, but it seems a bit scary

I have done some testing around this area to see the impact on WAL
size especially when WAL sizes are smaller, with a very simple test
with insert/update/delete I can see around an 11% increase in WAL size
[1]: checkpoint; do $$ declare lsn1 pg_lsn; lsn2 pg_lsn; diff float; begin select pg_current_wal_lsn() into lsn1; CREATE TABLE test(a int); for counter in 1..1000 loop INSERT INTO test values(1); UPDATE test set a=a+1; DELETE FROM test where a=1; end loop; DROP TABLE test; select pg_current_wal_lsn() into lsn2; select pg_wal_lsn_diff(lsn2, lsn1) into diff; raise notice '%', diff/1024; end; $$;
factor(1) there I do not see a significant increase in the WAL size
although it increases WAL size around 1-2%. [2]./pgbench -i postgres ./pgbench -c1 -j1 -t 30000 -M prepared postgres wal generated head: 30780 kB wal generated patch: 31284 kB wal-size increase: ~1-2%.

[1]: checkpoint; do $$ declare lsn1 pg_lsn; lsn2 pg_lsn; diff float; begin select pg_current_wal_lsn() into lsn1; CREATE TABLE test(a int); for counter in 1..1000 loop INSERT INTO test values(1); UPDATE test set a=a+1; DELETE FROM test where a=1; end loop; DROP TABLE test; select pg_current_wal_lsn() into lsn2; select pg_wal_lsn_diff(lsn2, lsn1) into diff; raise notice '%', diff/1024; end; $$;
checkpoint;
do $$
declare
lsn1 pg_lsn;
lsn2 pg_lsn;
diff float;
begin
select pg_current_wal_lsn() into lsn1;
CREATE TABLE test(a int);
for counter in 1..1000 loop
INSERT INTO test values(1);
UPDATE test set a=a+1;
DELETE FROM test where a=1;
end loop;
DROP TABLE test;
select pg_current_wal_lsn() into lsn2;
select pg_wal_lsn_diff(lsn2, lsn1) into diff;
raise notice '%', diff/1024;
end; $$;

wal generated head: 66199.09375 kB
wal generated patch: 73906.984375 kB
wal-size increase: 11%

[2]: ./pgbench -i postgres ./pgbench -c1 -j1 -t 30000 -M prepared postgres wal generated head: 30780 kB wal generated patch: 31284 kB wal-size increase: ~1-2%
./pgbench -i postgres
./pgbench -c1 -j1 -t 30000 -M prepared postgres
wal generated head: 30780 kB
wal generated patch: 31284 kB
wal-size increase: ~1-2%

I have done further analysis to know why on pgbench workload the wal
size is increasing by 1-2%. So with waldump I could see that wal size
per transaction size increased from 566 (on head) to 590 (with patch),
so that is around 4% but when we see total wal size difference after
30k transaction then it is just 1-2%
and I think that is because there would be other records which are not
impacted like FPI

Conclusion: So as suspected with very small WAL sizes with a very
targeted test case we can see a significant 11% increase in WAL size
but with pgbench kind of workload the increase in WAL size is very
less.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

andres@anarazel.de

over 3 years ago

In reply to: Dilip Kumar (#6)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-09-30 15:36:11 +0530, Dilip Kumar wrote:

I have done some testing around this area to see the impact on WAL
size especially when WAL sizes are smaller, with a very simple test
with insert/update/delete I can see around an 11% increase in WAL size
[1] then I did some more test with pgbench with smaller scale
factor(1) there I do not see a significant increase in the WAL size
although it increases WAL size around 1-2%. [2].

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
pay off for 56bit relfilenodes. The class of problems foreclosed is large
enough to "waste" "improvement potential" on this.

Greetings,

Andres Freund

Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Andres Freund (#7)

Re: problems with making relfilenodes 56-bits

On Fri, Sep 30, 2022 at 5:20 PM Andres Freund <andres@anarazel.de> wrote:

I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
pay off for 56bit relfilenodes. The class of problems foreclosed is large
enough to "waste" "improvement potential" on this.

I agree overall.

A closely related but distinct question occurs to me: if we're going
to be "wasting" space on alignment padding in certain cases one way or
another, can we at least recognize those cases and take advantage at
the level of individual WAL record formats? In other words: So far
we've been discussing the importance of not going over a critical
threshold for certain WAL records. But it might also be valuable to
consider recognizing that that's inevitable, and that we might as well
make the most of it by including one or two other things.

This seems most likely to matter when managing the problem of negative
compression with per-WAL-record compression schemes for things like
arrays of page offset numbers [1]/messages/by-id/CAH2-WzmLCn2Hx9tQLdmdb+9CkHKLyWD2bsz=PmRebc4dAxjy6g@mail.gmail.com -- Peter Geoghegan. If (say) a given compression scheme
"wastes" space for arrays of only 1-3 items, but we already know that
the relevant space will all be lost to alignment needed by code one
level down in any case, does it really count as waste? We're likely
always going to have some kind of negative compression, but you do get
to influence where and when the negative compression happens.

Not sure how relevant this will turn out to be, but seems worth
considering. More generally, thinking about how things work across
multiple layers of abstraction seems like it could be valuable in
other ways.

[1]: /messages/by-id/CAH2-WzmLCn2Hx9tQLdmdb+9CkHKLyWD2bsz=PmRebc4dAxjy6g@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

dilipbalaut@gmail.com

over 3 years ago

In reply to: Andres Freund (#7)

Re: problems with making relfilenodes 56-bits

On Sat, Oct 1, 2022 at 5:50 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-09-30 15:36:11 +0530, Dilip Kumar wrote:

I have done some testing around this area to see the impact on WAL
size especially when WAL sizes are smaller, with a very simple test
with insert/update/delete I can see around an 11% increase in WAL size
[1] then I did some more test with pgbench with smaller scale
factor(1) there I do not see a significant increase in the WAL size
although it increases WAL size around 1-2%. [2].

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

While testing pgbench, I dumped the wal sizes using waldump. So in
pgbench case, most of the record sizes increased by 4 bytes as they
include single block references and the same is true for the other
test case I sent. Here is the wal dump of what the sizes look like
for a single pgbench transaction[1]. Maybe for seeing these changes
with the different workloads we can run some of the files from the
regression test and compare the individual wal sizes.

Head:
rmgr: Heap len (rec/tot): 54/ 54, tx: 867, lsn:
0/02DD1280, prev 0/02DD1250, desc: LOCK off 44: xid 867: flags 0x01
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/16424 blk 226
rmgr: Heap len (rec/tot): 171/ 171, tx: 867, lsn:
0/02DD12B8, prev 0/02DD1280, desc: UPDATE off 44 xmax 867 flags 0x11 ;
new off 30 xmax 0, blkref #0: rel 1663/5/16424 blk 1639, blkref #1:
rel 1663/5/16424 blk 226
rmgr: Btree len (rec/tot): 64/ 64, tx: 867, lsn:
0/02DD1368, prev 0/02DD12B8, desc: INSERT_LEAF off 290, blkref #0: rel
1663/5/16432 blk 39
rmgr: Heap len (rec/tot): 78/ 78, tx: 867, lsn:
0/02DD13A8, prev 0/02DD1368, desc: HOT_UPDATE off 15 xmax 867 flags
0x10 ; new off 19 xmax 0, blkref #0: rel 1663/5/16427 blk 0
rmgr: Heap len (rec/tot): 74/ 74, tx: 867, lsn:
0/02DD13F8, prev 0/02DD13A8, desc: HOT_UPDATE off 9 xmax 867 flags
0x10 ; new off 10 xmax 0, blkref #0: rel 1663/5/16425 blk 0
rmgr: Heap len (rec/tot): 79/ 79, tx: 867, lsn:
0/02DD1448, prev 0/02DD13F8, desc: INSERT off 9 flags 0x08, blkref #0:
rel 1663/5/16434 blk 0
rmgr: Transaction len (rec/tot): 46/ 46, tx: 867, lsn:
0/02DD1498, prev 0/02DD1448, desc: COMMIT 2022-10-01 11:24:03.464437
IST

Patch:
rmgr: Heap len (rec/tot): 58/ 58, tx: 818, lsn:
0/0218BEB0, prev 0/0218BE80, desc: LOCK off 34: xid 818: flags 0x01
LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/5/100004 blk 522
rmgr: Heap len (rec/tot): 175/ 175, tx: 818, lsn:
0/0218BEF0, prev 0/0218BEB0, desc: UPDATE off 34 xmax 818 flags 0x11 ;
new off 8 xmax 0, blkref #0: rel 1663/5/100004 blk 1645, blkref #1:
rel 1663/5/100004 blk 522
rmgr: Btree len (rec/tot): 68/ 68, tx: 818, lsn:
0/0218BFA0, prev 0/0218BEF0, desc: INSERT_LEAF off 36, blkref #0: rel
1663/5/100010 blk 89
rmgr: Heap len (rec/tot): 82/ 82, tx: 818, lsn:
0/0218BFE8, prev 0/0218BFA0, desc: HOT_UPDATE off 66 xmax 818 flags
0x10 ; new off 90 xmax 0, blkref #0: rel 1663/5/100007 blk 0
rmgr: Heap len (rec/tot): 78/ 78, tx: 818, lsn:
0/0218C058, prev 0/0218BFE8, desc: HOT_UPDATE off 80 xmax 818 flags
0x10 ; new off 81 xmax 0, blkref #0: rel 1663/5/100005 blk 0
rmgr: Heap len (rec/tot): 83/ 83, tx: 818, lsn:
0/0218C0A8, prev 0/0218C058, desc: INSERT off 80 flags 0x08, blkref
#0: rel 1663/5/100011 blk 0
rmgr: Transaction len (rec/tot): 46/ 46, tx: 818, lsn:
0/0218C100, prev 0/0218C0A8, desc: COMMIT 2022-10-01 11:11:03.564063
IST

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#10

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#7)

Re: problems with making relfilenodes 56-bits

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

I would expect, and Dilip's results seem to confirm, the effect to be
pretty uniform: basically, nearly every record gets bigger by 4 bytes.
That's because most records contain at least one block reference, and
if they contain multiple block references, likely all but one will be
marked BKPBLOCK_SAME_REL, so we pay the cost just once.

Because of alignment padding, the practical effect is probably that
about half of the records get bigger by 8 bytes and the other half
don't get bigger at all. But I see no reason to believe that things
are any better or worse than that. Most interesting record types are
going to contain some kind of variable-length payload, so the chances
that a 4 byte size increase pushes you across a MAXALIGN boundary seem
to be no better or worse than fifty-fifty.

I think it'd be an OK tradeoff to optimize WAL usage for a few of the worst to
pay off for 56bit relfilenodes. The class of problems foreclosed is large
enough to "waste" "improvement potential" on this.

I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen. Trying to do it by optimizing specific WAL record
types seems like a real pain in the neck, because there's tons of
different WAL records that all have the same problem. Trying to do it
in a generic way makes more sense, and the fact that we have 2 padding
bytes available in XLogRecord seems like a place to start looking, but
the way forward from there is not clear to me.

--
Robert Haas
EDB: http://www.enterprisedb.com

#11

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#10)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

I would expect, and Dilip's results seem to confirm, the effect to be
pretty uniform: basically, nearly every record gets bigger by 4 bytes.
That's because most records contain at least one block reference, and
if they contain multiple block references, likely all but one will be
marked BKPBLOCK_SAME_REL, so we pay the cost just once.

But it doesn't really matter that much if an already large record gets a bit
bigger. Whereas it does matter if it's a small record. Focussing on optimizing
the record types where the increase is large seems like a potential way
forward to me, even if we can't find something generic.

I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen.

I think there's plenty potential...

Trying to do it by optimizing specific WAL record
types seems like a real pain in the neck, because there's tons of
different WAL records that all have the same problem.

I am not so sure about that. Improving a bunch of the most frequent small
records might buy you back enough on just about every workload to be OK.

I put the top record sizes for an installcheck run with full_page_writes off
at the bottom. Certainly our regression tests aren't generally
representative. But I think it still decently highlights how just improving a
few records could buy you back more than enough.

Trying to do it in a generic way makes more sense, and the fact that we have
2 padding bytes available in XLogRecord seems like a place to start looking,
but the way forward from there is not clear to me.

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

Greetings,

Andres Freund

by total size:

Type N (%) Record size (%) FPI size (%) Combined size (%)
---- - --- ----------- --- -------- --- ------------- ---
Heap/INSERT 1041666 ( 50.48) 106565255 ( 50.54) 0 ( 0.00) 106565255 ( 43.92)
Btree/INSERT_LEAF 352196 ( 17.07) 24067672 ( 11.41) 0 ( 0.00) 24067672 ( 9.92)
Heap/DELETE 250852 ( 12.16) 13546008 ( 6.42) 0 ( 0.00) 13546008 ( 5.58)
Hash/INSERT 108499 ( 5.26) 7811928 ( 3.70) 0 ( 0.00) 7811928 ( 3.22)
Transaction/COMMIT 16053 ( 0.78) 6402657 ( 3.04) 0 ( 0.00) 6402657 ( 2.64)
Gist/PAGE_UPDATE 57225 ( 2.77) 5217100 ( 2.47) 0 ( 0.00) 5217100 ( 2.15)
Gin/UPDATE_META_PAGE 23943 ( 1.16) 4539970 ( 2.15) 0 ( 0.00) 4539970 ( 1.87)
Gin/INSERT 27004 ( 1.31) 3623998 ( 1.72) 0 ( 0.00) 3623998 ( 1.49)
Gist/PAGE_SPLIT 448 ( 0.02) 3391244 ( 1.61) 0 ( 0.00) 3391244 ( 1.40)
SPGist/ADD_LEAF 38968 ( 1.89) 3341696 ( 1.58) 0 ( 0.00) 3341696 ( 1.38)
...
XLOG/FPI 7228 ( 0.35) 378924 ( 0.18) 29788166 ( 93.67) 30167090 ( 12.43)
...
Gin/SPLIT 141 ( 0.01) 13011 ( 0.01) 1187588 ( 3.73) 1200599 ( 0.49)
...
-------- -------- -------- --------
Total 2063609 210848282 [86.89%] 31802766 [13.11%] 242651048 [100%]

(Included XLOG/FPI and Gin/SPLIT to explain why there's FPIs despite running with fpw=off)

sorted by number of records:
Heap/INSERT 1041666 ( 50.48) 106565255 ( 50.54) 0 ( 0.00) 106565255 ( 43.92)
Btree/INSERT_LEAF 352196 ( 17.07) 24067672 ( 11.41) 0 ( 0.00) 24067672 ( 9.92)
Heap/DELETE 250852 ( 12.16) 13546008 ( 6.42) 0 ( 0.00) 13546008 ( 5.58)
Hash/INSERT 108499 ( 5.26) 7811928 ( 3.70) 0 ( 0.00) 7811928 ( 3.22)
Gist/PAGE_UPDATE 57225 ( 2.77) 5217100 ( 2.47) 0 ( 0.00) 5217100 ( 2.15)
SPGist/ADD_LEAF 38968 ( 1.89) 3341696 ( 1.58) 0 ( 0.00) 3341696 ( 1.38)
Gin/INSERT 27004 ( 1.31) 3623998 ( 1.72) 0 ( 0.00) 3623998 ( 1.49)
Gin/UPDATE_META_PAGE 23943 ( 1.16) 4539970 ( 2.15) 0 ( 0.00) 4539970 ( 1.87)
Standby/LOCK 18451 ( 0.89) 775026 ( 0.37) 0 ( 0.00) 775026 ( 0.32)
Transaction/COMMIT 16053 ( 0.78) 6402657 ( 3.04) 0 ( 0.00) 6402657 ( 2.64)

#12

boekewurm+postgres@gmail.com

over 3 years ago

In reply to: Andres Freund (#11)

Re: problems with making relfilenodes 56-bits

On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:

Hi,

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:

I think it'd be interesting to look at per-record-type stats between two
equivalent workload, to see where practical workloads suffer the most
(possibly with fpw=off, to make things more repeatable).

I would expect, and Dilip's results seem to confirm, the effect to be
pretty uniform: basically, nearly every record gets bigger by 4 bytes.
That's because most records contain at least one block reference, and
if they contain multiple block references, likely all but one will be
marked BKPBLOCK_SAME_REL, so we pay the cost just once.

But it doesn't really matter that much if an already large record gets a bit
bigger. Whereas it does matter if it's a small record. Focussing on optimizing
the record types where the increase is large seems like a potential way
forward to me, even if we can't find something generic.

I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen.

I think there's plenty potential...

Trying to do it by optimizing specific WAL record
types seems like a real pain in the neck, because there's tons of
different WAL records that all have the same problem.

I am not so sure about that. Improving a bunch of the most frequent small
records might buy you back enough on just about every workload to be OK.

I put the top record sizes for an installcheck run with full_page_writes off
at the bottom. Certainly our regression tests aren't generally
representative. But I think it still decently highlights how just improving a
few records could buy you back more than enough.

Trying to do it in a generic way makes more sense, and the fact that we have
2 padding bytes available in XLogRecord seems like a place to start looking,
but the way forward from there is not clear to me.

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

A different idea is removing most block_ids from the record, and
optionally reducing per-block length fields to 1B. Used block ids are
effectively always sequential, and we only allow 33+4 valid values, so
we can use 2 bits to distinguish between 'block belonging to this ID
field have at most 255B of data registered' and 'blocks up to this ID
follow sequentially without own block ID'. That would save 2N-1 total
bytes for N blocks. It is scraping the barrel, but I think it is quite
possible.

Lastly, we could add XLR_BLOCK_ID_DATA_MED for values >255 containing
up to UINT16_MAX lengths. That would save 2 bytes for records that
only just pass the 255B barrier, where 2B is still a fairly
significant part of the record size.

Kind regards,

Matthias van de Meent

#13

andres@anarazel.de

over 3 years ago

In reply to: Matthias van de Meent (#12)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:

On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

What exactly is the theory for this significant increase? I don't think
xl_prev provides a meaningful protection against torn pages in the first
place?

Greetings,

Andres Freund

#14

boekewurm+postgres@gmail.com

over 3 years ago

In reply to: Andres Freund (#13)

Re: problems with making relfilenodes 56-bits

On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:

On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

What exactly is the theory for this significant increase? I don't think
xl_prev provides a meaningful protection against torn pages in the first
place?

XLog pages don't have checksums, so they do not provide torn page
protection capabilities on their own.
A singular xlog record is protected against torn page writes through
the checksum that covers the whole record - if only part of the record
was written, we can detect that through the mismatching checksum.
However, if records end at the tear boundary, we must know for certain
that any record that starts after the tear is the record that was
written after the one before the tear. Page-local references/offsets
would not work, because the record decoding doesn't know which xlog
page the record should be located on; it could be both the version of
the page before it was recycled, or the one after.
Currently, we can detect this because the value of xl_prev will point
to a record far in the past (i.e. not the expected value), but with a
page-local version of xl_prev we would be less likely to detect torn
pages (and thus be unable to handle this without risk of corruption)
due to the significant chance of the truncated xl_prev value being the
same in both the old and new record.

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

Kind regards,

Matthias van de Meent

PS. there are ideas floating around (I heard about this one from
Heikki) where we could concatenate WAL records into one combined
record that has only one shared xl_prev+crc; which would save these 12
bytes per record. However, that needs a lot of careful consideration
to make sure that the persistence guarantee of operations doesn't get
lost somewhere in the traffic.

#15

andres@anarazel.de

over 3 years ago

In reply to: Matthias van de Meent (#14)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-04 15:05:47 +0200, Matthias van de Meent wrote:

On Mon, 3 Oct 2022 at 23:26, Andres Freund <andres@anarazel.de> wrote:

On 2022-10-03 19:40:30 +0200, Matthias van de Meent wrote:

On Mon, 3 Oct 2022, 19:01 Andres Freund, <andres@anarazel.de> wrote:

Random idea: xl_prev is large. Store a full xl_prev in the page header, but
only store a 2 byte offset from the page header xl_prev within each record.

With that small xl_prev we may not detect partial page writes in
recycled segments; or other issues in the underlying file system. With
small record sizes, the chance of returning incorrect data would be
significant for small records (it would be approximately the chance of
getting a record boundary on the underlying page boundary * chance of
getting the same MAXALIGN-adjusted size record before the persistence
boundary). That issue is part of the reason why my proposed change
upthread still contains the full xl_prev.

What exactly is the theory for this significant increase? I don't think
xl_prev provides a meaningful protection against torn pages in the first
place?

XLog pages don't have checksums, so they do not provide torn page
protection capabilities on their own.
A singular xlog record is protected against torn page writes through
the checksum that covers the whole record - if only part of the record
was written, we can detect that through the mismatching checksum.
However, if records end at the tear boundary, we must know for certain
that any record that starts after the tear is the record that was
written after the one before the tear. Page-local references/offsets
would not work, because the record decoding doesn't know which xlog
page the record should be located on; it could be both the version of
the page before it was recycled, or the one after.
Currently, we can detect this because the value of xl_prev will point
to a record far in the past (i.e. not the expected value), but with a
page-local version of xl_prev we would be less likely to detect torn
pages (and thus be unable to handle this without risk of corruption)
due to the significant chance of the truncated xl_prev value being the
same in both the old and new record.

Think this is addressable, see below.

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.

The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.

PS. there are ideas floating around (I heard about this one from
Heikki) where we could concatenate WAL records into one combined
record that has only one shared xl_prev+crc; which would save these 12
bytes per record. However, that needs a lot of careful consideration
to make sure that the persistence guarantee of operations doesn't get
lost somewhere in the traffic.

One version of that is to move the CRCs to the page header, make the pages
smaller (512 bytes / 4K, depending on the hardware), and to pad out partial
pages when flushing them out. Rewriting pages is bad for hardware and prevents
having multiple WAL IOs in flight at the same time.

Greetings,

Andres Freund

#16

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#15)

Re: problems with making relfilenodes 56-bits

On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote:

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.

The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.

I'm really confused. xl_prev *is* a full-width LSN currently, as I
understand it. So in the scenario that Matthias poses, let's say the
segment was previously 000000010000000400000025 and now it's
000000010000000400000049. So if a given chunk of the page is leftover
from when the page was 000000010000000400000025, it will have xl_prev
values like 4/25xxxxxx. If it's been rewritten since the segment was
recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell
whether record B has been overwritten with a new record since the
segment was recycled. But if we stored only 2 bytes in each xl_prev
field, that would no longer be possible.

So I'm lost. It seems like Matthias has correctly identified a real
hazard, and not some weird corner case but actually something that
will happen regularly. All you need is for the old segment that got
recycled to have a record stating at the same place where the page
tore, and for the previous record to have been the same length as the
one on the new page. Given that there's only <~1024 places on a page
where a record can start, and given that in many workloads the lengths
of WAL records will be fairly uniform, this doesn't seem unlikely at
all.

A way to up the chances of detecting this case would be to store only
2 or 4 bytes of xl_prev on disk, but arrange to include the full
xl_prev value in the xl_crc calculation. Then your chances of a
collision are about 2^-32, or maybe more if you posit that CRC is a
weak and crappy algorithm, but even then it's strictly better than
just hoping that there isn't a tear point at a record boundary where
the same length record precedes the tear in both the old and new WAL
segments. However, on the flip side, even if you assume that CRC is a
fantastic algorithm with beautiful and state-of-the-art bit mixing,
the chances of it failing to notice the problem are still >0, whereas
the current algorithm that compares the full xl_prev value is a sure
thing. Because xl_prev values are never repeated, it's certain that
when a segment is recycled, any values that were legal for the old one
aren't legal in the new one.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#16)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-04 13:36:33 -0400, Robert Haas wrote:

On Tue, Oct 4, 2022 at 11:34 AM Andres Freund <andres@anarazel.de> wrote:

Example: Page { [ record A ] | tear boundary | [ record B ] } gets
recycled and receives a new record C at the place of A with the same
length.

With your proposal, record B would still be a valid record when it
follows C; as the page-local serial number/offset reference to the
previous record would still match after the torn write.
With the current situation and a full LSN in xl_prev, the mismatching
value in the xl_prev pointer allows us to detect this torn page write
and halt replay, before redoing an old (incorrect) record.

In this concrete scenario the 8 byte xl_prev doesn't provide *any* protection?
As you specified it, C has the same length as A, so B's xl_prev will be the
same whether it's a page local offset or the full 8 bytes.

The relevant protection against issues like this isn't xl_prev, it's the
CRC. We could improve the CRC by using the "full width" LSN for xl_prev rather
than the offset.

I'm really confused. xl_prev *is* a full-width LSN currently, as I
understand it. So in the scenario that Matthias poses, let's say the
segment was previously 000000010000000400000025 and now it's
000000010000000400000049. So if a given chunk of the page is leftover
from when the page was 000000010000000400000025, it will have xl_prev
values like 4/25xxxxxx. If it's been rewritten since the segment was
recycled, it will have xl_prev values like 4/49xxxxxx. So, we can tell
whether record B has been overwritten with a new record since the
segment was recycled. But if we stored only 2 bytes in each xl_prev
field, that would no longer be possible.

Oh, I think I misunderstood the scenario. I was thinking of cases where we
write out a bunch of pages, crash, only some of the pages made it to disk, we
then write new ones of the same length, and now find a record after the "real"
end of the WAL to be valid. Not sure how I mentally swallowed the "recycled".

For the recycling scenario to be a problem we'll also need to crash, with
parts of the page ending up with the new contents and parts of the page ending
up with the old "pre recycling" content, correct? Because without a crash
we'll have zeroed out the remainder of the page (well, leaving walreceiver out
of the picture, grr).

However, this can easily happen without any record boundaries on the partially
recycled page, so we rely on the CRCs to protect against this.

Here I originally wrote a more in-depth explanation of the scenario I was
thinking about, where we alread rely on CRCs to protect us. But, ooph, I think
they don't reliably, with today's design. But maybe I'm missing more things
today. Consider the following sequence:

1) we write WAL like this:

[record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn]

2) crash, the sectors with A and C made it to disk, the one with B didn't

3) We replay A, discover B is invalid (possibly zeroes or old contents),
insert a new record B' with the same length. Now it looks like this:

[record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn]

4) crash, the sector with B' makes it to disk

5) we replay A, B', C, because C has an xl_prev that's compatible with B'
location and a valid CRC.

Oops.

I think this can happen both within a single page and across page boundaries.

I hope I am missing something here?

A way to up the chances of detecting this case would be to store only
2 or 4 bytes of xl_prev on disk, but arrange to include the full
xl_prev value in the xl_crc calculation.

Right, that's what I was suggesting as well.

Then your chances of a collision are about 2^-32, or maybe more if you posit
that CRC is a weak and crappy algorithm, but even then it's strictly better
than just hoping that there isn't a tear point at a record boundary where
the same length record precedes the tear in both the old and new WAL
segments. However, on the flip side, even if you assume that CRC is a
fantastic algorithm with beautiful and state-of-the-art bit mixing, the
chances of it failing to notice the problem are still >0, whereas the
current algorithm that compares the full xl_prev value is a sure
thing. Because xl_prev values are never repeated, it's certain that when a
segment is recycled, any values that were legal for the old one aren't legal
in the new one.

Given that we already rely on the CRCs to detect corruption within a single
record spanning tear boundaries, this doesn't cause me a lot of heartburn. But
I suspect we might need to do something about the scenario I outlined above,
which likely would also increase the protection against this issue.

I think there might be reasonable ways to increase the guarantees based on the
2 byte xl_prev approach "alone". We don't have to store the offset from the
page header as a plain offset. What about storing something like:
page_offset ^ (page_lsn >> wal_segsz_shift)

I think something like that'd result in prev_not_really_lsn typically not
simply matching after recycling. Of course it only provides so much
protection, given 16bits...

Greetings,

Andres Freund

#18

robertmhaas@gmail.com

over 3 years ago

In reply to: Andres Freund (#17)

Re: problems with making relfilenodes 56-bits

On Tue, Oct 4, 2022 at 2:30 PM Andres Freund <andres@anarazel.de> wrote:

Consider the following sequence:

1) we write WAL like this:

[record A][tear boundary][record B, prev A_lsn][tear boundary][record C, prev B_lsn]

2) crash, the sectors with A and C made it to disk, the one with B didn't

3) We replay A, discover B is invalid (possibly zeroes or old contents),
insert a new record B' with the same length. Now it looks like this:

[record A][tear boundary][record B', prev A_lsn][tear boundary][record C, prev B_lsn]

4) crash, the sector with B' makes it to disk

5) we replay A, B', C, because C has an xl_prev that's compatible with B'
location and a valid CRC.

Oops.

I think this can happen both within a single page and across page boundaries.

I hope I am missing something here?

If you are, I don't know what it is off-hand. That seems like a
plausible scenario to me. It does require the OS to write things out
of order, and I don't know how likely that is in practice, but the
answer probably isn't zero.

I think there might be reasonable ways to increase the guarantees based on the
2 byte xl_prev approach "alone". We don't have to store the offset from the
page header as a plain offset. What about storing something like:
page_offset ^ (page_lsn >> wal_segsz_shift)

I think something like that'd result in prev_not_really_lsn typically not
simply matching after recycling. Of course it only provides so much
protection, given 16bits...

Maybe. That does seem somewhat better, but I feel like it's hard to
reason about whether it's safe in absolute terms or just resistant to
the precise scenario Matthias postulated while remaining vulnerable to
slightly modified versions.

How about this: remove xl_prev. widen xl_crc to 64 bits. include the
CRC of the previous WAL record in the xl_crc calculation. That doesn't
cut quite as many bytes out of the record size as your proposal, but
it feels like it should strongly resist pretty much every attack of
this general type, with only the minor disadvantage that the more
expensive CRC calculation will destroy all hope of getting anything
committed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#11)

2 attachment(s)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-03 10:01:25 -0700, Andres Freund wrote:

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen.

I think there's plenty potential...

I light dusted off my old varint implementation from [1] and converted the
RelFileLocator and BlockNumber from fixed width integers to varint ones. This
isn't meant as a serious patch, but an experiment to see if this is a path
worth pursuing.

A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
(for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

The average record size goes from 102.7 to 95.7 bytes excluding the remaining
FPIs, 118.1 to 111.0 including FPIs.

There's plenty other spots that could be converted (e.g. the record length
which rarely needs four bytes), this is just meant as a demonstration.

I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A
profile does show pg_varint_decode_uint64(), but partially that seems to be
offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains.

That would be tolerable, I think, because waldump --stats pretty much doesn't
do anything with the WAL.

But I suspect there's plenty of optimization potential in the varint
code. Right now it e.g. stores data as big endian, and the bswap instructions
do show up. And a lot of my bit-maskery could be optimized too.

Greetings,

Andres Freund

Attachments:

v1-0001-Add-a-varint-implementation.patchtext/x-diff; charset=us-asciiDownload

From 68248e7aa220586cbf075d3cb6eb56a6461ae81f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Oct 2022 16:04:16 -0700
Subject: [PATCH v1 1/2] Add a varint implementation

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20191210015054.5otdfuftxrqb5gum@alap3.anarazel.de
Backpatch:
---
 src/include/common/varint.h |  92 +++++++++++++++
 src/common/Makefile         |   1 +
 src/common/meson.build      |   1 +
 src/common/varint.c         | 228 ++++++++++++++++++++++++++++++++++++
 4 files changed, 322 insertions(+)
 create mode 100644 src/include/common/varint.h
 create mode 100644 src/common/varint.c

diff --git a/src/include/common/varint.h b/src/include/common/varint.h
new file mode 100644
index 00000000000..43d9ade7808
--- /dev/null
+++ b/src/include/common/varint.h
@@ -0,0 +1,92 @@
+#ifndef PG_VARINT_H
+#define PG_VARINT_H
+
+/*
+ * Variable length integer encoding.
+ *
+ * Represent the length of the varint, in bytes - 1, as the number of
+ * "leading" 0 bits in the first byte. I.e. the length is stored in
+ * unary.  If less than 9 bytes of data are needed, the leading 0 bits
+ * are followed by a 1 bit, followed by the data. When encoding 64bit
+ * integers, for 8 bytes of input data, there is however no high 1 bit
+ * in the second output byte, as that would be unnecessary (and
+ * increase the size to 10 bytes of output bytes). If, hypothetically,
+ * larger numbers were supported, that'd not be the case, obviously.
+ *
+ * The reason to encode the number of bytes as zero bits, rathter than
+ * ones, is that more architectures have instructions for efficiently
+ * finding the first bit set, than finding the first bit unset.
+ *
+ * In contrast to encoding the length as e.g. the high-bit in each following
+ * bytes, this allows for faster decoding (and encoding, although the benefit
+ * is smaller), because no length related branches are needed after processing
+ * the first byte and no 7 bit/byte -> 8 bit/byte conversion is needed. The
+ * mask / shift overhead is similar.
+ *
+ *
+ * I.e. the encoding of a unsiged 64bit integer is as follows
+ * [0 {#bytes - 1}][1 as separator][leading 0 bits][data]
+ *
+ * E.g.
+ * - 7 (input 0b0000'0111) is encoded as 0b1000'0111
+ * - 145 (input 0b1001'0001) is 0b0100'0000''1001'0001 (as the separator does
+ *   not fit into the one byte of the fixed width encoding)
+ * - 4141 (input 0b0001'0000''0010'1101) is 0b0101'0000''0010'1101 (the
+ *   leading 0, followed by a 1 indicating that two bytes are needed)
+ *
+ *
+ * Naively encoding negative two's complement integers this way would
+ * lead to such numbers always being stored as the maximum varint
+ * length, due to the leading sign bits.  Instead the sign bit is
+ * encoded in the least significant bit. To avoid different limits
+ * than native 64bit integers - in particular around the lowest
+ * possible value - the data portion is bit flipped instead of
+ * negated.
+ *
+ * The reason for the sign bit to be trailing is that that makes it
+ * trivial to reuse code of the unsigned case, whereas a leading bit
+ * would make that harder.
+ *
+ * XXX: One problem with storing negative numbers this way is that it
+ * makes varints not compare lexicographically. Is it worth the code
+ * duplication to allow for that? Given the variable width of
+ * integers, lexicographic sorting doesn't seem that relevant.
+ *
+ * Therefore the encoding of a signed 64bit integer is as follows:
+ * [0 {#bytes - 1}][1][leading 0 bits][data][sign bit]
+ *
+ * E.g. the encoding of
+ * -  7 (input 0b0000'0111) is encoded as 0b1000'1110
+ * - -7 (input 0b[1*]''1111'1000) is encoded as 0b1001'1101
+ */
+
+/*
+ * TODO:
+ * - check how well this works on big endian
+ * - consider encoding as little endian, more common
+ * - proper header separation
+ */
+
+
+/*
+ * The max value fitting in a 1 byte varint. One bit is needed for the length
+ * indicator.
+ */
+#define PG_VARINT_UINT64_MAX_1BYTE_VAL ((1L << (BITS_PER_BYTE - 1)) - 1)
+
+/*
+ * The max value fitting into a varint below 8 bytes has to take into account
+ * the number of bits for the length indicator for 8 bytes (7 bits), and
+ * additionally a bit for the separator (1 bit).
+ */
+#define PG_VARINT_UINT64_MAX_8BYTE_VAL ((1L << (64 - (7 + 1))) - 1)
+
+
+extern int pg_varint_encode_uint64(uint64 uval, uint8 *buf);
+extern int pg_varint_encode_int64(int64 val, uint8 *buf);
+
+extern uint64 pg_varint_decode_uint64(const uint8 *buf, int *consumed);
+extern int64 pg_varint_decode_int64(const uint8 *buf, int *consumed);
+
+
+#endif /* PG_VARINT_H */
diff --git a/src/common/Makefile b/src/common/Makefile
index e9af7346c9c..dd98b468fe4 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -78,6 +78,7 @@ OBJS_COMMON = \
 	stringinfo.o \
 	unicode_norm.o \
 	username.o \
+	varint.o \
 	wait_error.o \
 	wchar.o
 
diff --git a/src/common/meson.build b/src/common/meson.build
index 23842e1ffef..d1710569f1d 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -28,6 +28,7 @@ common_sources = files(
   'stringinfo.c',
   'unicode_norm.c',
   'username.c',
+  'varint.c',
   'wait_error.c',
   'wchar.c',
 )
diff --git a/src/common/varint.c b/src/common/varint.c
new file mode 100644
index 00000000000..9f41966773a
--- /dev/null
+++ b/src/common/varint.c
@@ -0,0 +1,228 @@
+#include "c.h"
+
+#include "common/varint.h"
+#include "port/pg_bswap.h"
+
+/*
+ * Encode unsigned 64bit integer as a variable length integer in buf, return
+ * encoded length.
+ */
+int
+pg_varint_encode_uint64(uint64 uval, uint8 *buf)
+{
+	if (uval <= PG_VARINT_UINT64_MAX_1BYTE_VAL)
+	{
+		buf[0] = (uint8) uval | 1u << 7;
+
+		return 1;
+	}
+	else if (uval <= PG_VARINT_UINT64_MAX_8BYTE_VAL)
+	{
+		int clz = __builtin_clzl(uval);
+
+		int bits_of_input_data;
+		int bytes_of_output_data;
+
+		uint64 output_uval = 0;
+		unsigned bytes_of_input_data;
+
+		bits_of_input_data =
+			(sizeof(uval) * BITS_PER_BYTE - clz); /* bits of actual data */
+
+		bytes_of_input_data =
+			(bits_of_input_data + (BITS_PER_BYTE - 1)) / BITS_PER_BYTE;
+
+		bytes_of_output_data = bytes_of_input_data;
+
+		/*
+		 * Check whether another output byte is needed to account for the
+		 * overhead of length/separator bits. Avoiding a separate division is
+		 * cheaper.
+		 *
+		 * XXX: I'm sure there's a neater way to do this.
+		 */
+		if ((bits_of_input_data + /* data */
+			 (bytes_of_input_data - 1) + /* length indicator in unary */
+			 1) /* 1 separator */
+			> (bytes_of_input_data * BITS_PER_BYTE) /* available space */
+			)
+			bytes_of_output_data++;
+
+		/* mask in value at correct position */
+		output_uval = uval << (64 - bytes_of_output_data * BITS_PER_BYTE);
+
+		/* set length bits, by setting the terminating bit to 1 */
+		output_uval |= (uint64)(1 << (BITS_PER_BYTE - (bytes_of_output_data)))
+			<< (64 - BITS_PER_BYTE); /* and shift into the most significant byte */
+
+		/* convert to big endian */
+		output_uval = pg_hton64(output_uval);
+
+		/* output */
+		memcpy(&buf[0], &output_uval, sizeof(uint64));
+
+		return bytes_of_output_data;
+	}
+	else
+	{
+		uint64 uval_be;
+
+		/*
+		 * Set length bits, no separator for 9 bytes (detectable via max
+		 * length).
+		 */
+		buf[0] = 0;
+
+		uval_be = pg_hton64(uval);
+		memcpy(&buf[1], &uval_be, sizeof(uint64));
+
+		return 9;
+	}
+}
+
+/*
+ * Encode signed 64bit integer as a variable length integer in buf, return
+ * encoded length.
+ *
+ * See also pg_varint_encode_uint64().
+ */
+int
+pg_varint_encode_int64(int64 val, uint8 *buf)
+{
+	uint64 uval;
+	bool neg;
+
+	if (val < 0)
+	{
+		neg = true;
+
+		/* reinterpret number as uint64 */
+		memcpy(&uval, &val, sizeof(int64));
+
+		/* store bit flipped value */
+		uval = ~uval;
+	}
+	else
+	{
+		neg = false;
+
+		uval = (uint64) val;
+	}
+
+	/* make space for sign bit */
+	uval <<= 1;
+	uval |= (uint8) neg;
+
+	return pg_varint_encode_uint64(uval, buf);
+}
+
+/*
+ * Decode buf into unsigned 64bit integer.
+ *
+ * Note that this function, for efficiency, reads 8 bytes, even if the
+ * variable integer is less than 8 bytes long. The buffer has to be
+ * allocated sufficiently large to account for that fact. The maximum
+ * amount of memory read is 9 bytes.
+ *
+ * FIXME: Need to length of buffer "used"!
+ */
+uint64
+pg_varint_decode_uint64(const uint8 *buf, int *consumed)
+{
+	uint8 first = buf[0];
+
+	if (first & (1 << (BITS_PER_BYTE - 1)))
+	{
+		/*
+		 * Fast path for common case of small integers.
+		 */
+		// XXX: Should we put this path into a static inline function in the
+		// header? It's pretty common...
+		*consumed = 1;
+		return first ^ (1 << 7);
+	}
+	else if (first != 0)
+	{
+		/*
+		 * Separate path for cases of 1-8 bytes - that case is different
+		 * enough from the 9 byte case that it's not worth sharing code.
+		 */
+		int clz =__builtin_clz(first);
+		int bytes = 8 - (32 - clz) + 1;
+		uint64 ret;
+
+		*consumed = bytes;
+
+		/*
+		 * Note that we explicitly read "too much" here - but we never look at
+		 * the additional data, if the length bit doesn't indicate that's
+		 * ok. A load that varies on length would require substantial, often
+		 * unpredictable, branching.
+		 */
+		memcpy(&ret, buf, sizeof(uint64));
+
+		/* move data into the appropriate place */
+		ret <<= BITS_PER_BYTE * (BITS_PER_BYTE - bytes);
+
+		/* restore native endianess*/
+		ret = pg_ntoh64(ret);
+
+		/* mask out length indicator bits & separator bit */
+		ret ^= 1L << (BITS_PER_BYTE * (bytes - 1) + (BITS_PER_BYTE - bytes));
+
+		return ret;
+	}
+	else
+	{
+		/*
+		 * 9 byte varint encoding of an 8byte integer. All the
+		 * data is following the width indicating byte, so
+		 * there's no length / separator bit to unset or such.
+		 */
+		uint64 ret;
+
+		*consumed = 9;
+
+		memcpy(&ret, &buf[1], sizeof(uint64));
+
+		/* restore native endianess*/
+		ret = pg_ntoh64(ret);
+
+		return ret;
+	}
+
+	pg_unreachable();
+	*consumed = 0;
+	return 0;
+}
+
+/*
+ * Decode buf into signed 64bit integer.
+ *
+ * See pg_varint_decode_uint64 for caveats.
+ */
+int64
+pg_varint_decode_int64(const uint8 *buf, int *consumed)
+{
+	uint64 uval = pg_varint_decode_uint64(buf, consumed);
+	bool neg;
+
+	/* determine sign */
+	neg = uval & 1;
+
+	/* remove sign bit */
+	uval >>= 1;
+
+	if (neg)
+	{
+		int64 val;
+
+		uval = ~uval;
+
+		memcpy(&val, &uval, sizeof(uint64));
+
+		return val;
+	}
+	else
+		return uval;
+}
-- 
2.37.3.542.gdd3f6c4cae

v1-0002-wip-wal-Encode-RelFileLocator-and-BLockNumber-as-.patchtext/x-diff; charset=us-asciiDownload

From 95783a38684976f81bb8ef0a66139d1bd54935ba Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 4 Oct 2022 16:08:42 -0700
Subject: [PATCH v1 2/2] wip: wal: Encode RelFileLocator and BLockNumber as
 varints

A run of installcheck in a cluster with autovacuum=off,
full_page_writes=off (for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

Could be used in a bunch of additional places too.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/transam/xloginsert.c |  9 ++++----
 src/backend/access/transam/xlogreader.c | 29 +++++++++++++++++++++++--
 2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 5ca15ebbf20..a6e95a58966 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -33,6 +33,7 @@
 #include "access/xloginsert.h"
 #include "catalog/pg_control.h"
 #include "common/pg_lzcompress.h"
+#include "common/varint.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -807,11 +808,11 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 		}
 		if (!samerel)
 		{
-			memcpy(scratch, &regbuf->rlocator, sizeof(RelFileLocator));
-			scratch += sizeof(RelFileLocator);
+			scratch += pg_varint_encode_uint64(regbuf->rlocator.spcOid, (uint8*) scratch);
+			scratch += pg_varint_encode_uint64(regbuf->rlocator.dbOid, (uint8*) scratch);
+			scratch += pg_varint_encode_uint64(regbuf->rlocator.relNumber, (uint8*) scratch);
 		}
-		memcpy(scratch, &regbuf->block, sizeof(BlockNumber));
-		scratch += sizeof(BlockNumber);
+		scratch += pg_varint_encode_uint64(regbuf->block, (uint8*) scratch);
 	}
 
 	/* followed by the record's origin, if any */
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5a8fe81f826..899b776ed25 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -31,6 +31,7 @@
 #include "access/xlogrecord.h"
 #include "catalog/pg_control.h"
 #include "common/pg_lzcompress.h"
+#include "common/varint.h"
 #include "replication/origin.h"
 
 #ifndef FRONTEND
@@ -1853,7 +1854,22 @@ DecodeXLogRecord(XLogReaderState *state,
 			}
 			if (!(fork_flags & BKPBLOCK_SAME_REL))
 			{
-				COPY_HEADER_FIELD(&blk->rlocator, sizeof(RelFileLocator));
+				int consumed;
+
+				/* FIXME: needs overflow checks from COPY_HEADER_FIELD() */
+
+				blk->rlocator.spcOid = pg_varint_decode_uint64((uint8*) ptr, &consumed);
+				ptr += consumed;
+				remaining -= consumed;
+
+				blk->rlocator.dbOid = pg_varint_decode_uint64((uint8*) ptr, &consumed);
+				ptr += consumed;
+				remaining -= consumed;
+
+				blk->rlocator.relNumber = pg_varint_decode_uint64((uint8*) ptr, &consumed);
+				ptr += consumed;
+				remaining -= consumed;
+
 				rlocator = &blk->rlocator;
 			}
 			else
@@ -1868,7 +1884,16 @@ DecodeXLogRecord(XLogReaderState *state,
 
 				blk->rlocator = *rlocator;
 			}
-			COPY_HEADER_FIELD(&blk->blkno, sizeof(BlockNumber));
+
+
+			{
+				int consumed;
+
+				/* FIXME: needs overflow checks from COPY_HEADER_FIELD() */
+				blk->blkno = pg_varint_decode_uint64((uint8*) ptr, &consumed);
+				ptr += consumed;
+				remaining -= consumed;
+			}
 		}
 		else
 		{
-- 
2.37.3.542.gdd3f6c4cae

#20

[1]: /messages/by-id/CAFiTN-uut+04AdwvBY_oK_jLvMkwXUpDJj5mXg--nek+ucApPQ@mail.gmail.com

dilipbalaut@gmail.com

over 3 years ago

In reply to: Andres Freund (#19)

Re: problems with making relfilenodes 56-bits

On Wed, Oct 5, 2022 at 5:19 AM Andres Freund <andres@anarazel.de> wrote:

I light dusted off my old varint implementation from [1] and converted the
RelFileLocator and BlockNumber from fixed width integers to varint ones. This
isn't meant as a serious patch, but an experiment to see if this is a path
worth pursuing.

A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
(for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

The average record size goes from 102.7 to 95.7 bytes excluding the remaining
FPIs, 118.1 to 111.0 including FPIs.

I have also executed my original test after applying these patches on
top of the 56 bit relfilenode patch. So earlier we saw the WAL size
increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
patch now the WAL generated is 58179.2265625. That means in this
particular example this patch is reducing the WAL size by 12% even
with the 56 bit relfilenode patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#21

robertmhaas@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#20)

Re: problems with making relfilenodes 56-bits

On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have also executed my original test after applying these patches on
top of the 56 bit relfilenode patch. So earlier we saw the WAL size
increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
patch now the WAL generated is 58179.2265625. That means in this
particular example this patch is reducing the WAL size by 12% even
with the 56 bit relfilenode patch.

That's a very promising result, but the question in my mind is how
much work would be required to bring this patch to a committable
state?

--
Robert Haas
EDB: http://www.enterprisedb.com

#22

andres@anarazel.de

over 3 years ago

In reply to: Robert Haas (#21)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-10 08:10:22 -0400, Robert Haas wrote:

On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have also executed my original test after applying these patches on
top of the 56 bit relfilenode patch. So earlier we saw the WAL size
increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
patch now the WAL generated is 58179.2265625. That means in this
particular example this patch is reducing the WAL size by 12% even
with the 56 bit relfilenode patch.

That's a very promising result, but the question in my mind is how
much work would be required to bring this patch to a committable
state?

The biggest part clearly is to review the variable width integer patch. It's
not a large amount of code, but probably more complicated than average.

One complication there is that currently the patch assumes:
* Note that this function, for efficiency, reads 8 bytes, even if the
* variable integer is less than 8 bytes long. The buffer has to be
* allocated sufficiently large to account for that fact. The maximum
* amount of memory read is 9 bytes.

We could make a less efficient version without that assumption, but I think it
might be easier to just guarantee it in the xlog*.c case.

Using it in xloginsert.c is pretty darn simple, code-wise. xlogreader is bit
harder, although not for intrinsic reasons - the logic underlying
COPY_HEADER_FIELD seems unneccessary complicated to me. The minimal solution
would likely be to just wrap the varint reads in another weird macro.

Leaving the code issues themselves aside, one important thing would be to
evaluate what the performance impacts of the varint encoding/decoding are as
part of "full" server. I suspect it'll vanish in the noise, but we'd need to
validate that.

Greetings,

Andres Freund

#23

dilipbalaut@gmail.com

over 3 years ago

In reply to: Robert Haas (#21)

Re: problems with making relfilenodes 56-bits

On Mon, Oct 10, 2022 at 5:40 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 10, 2022 at 5:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have also executed my original test after applying these patches on
top of the 56 bit relfilenode patch. So earlier we saw the WAL size
increased by 11% (66199.09375 kB to 73906.984375 kB) and after this
patch now the WAL generated is 58179.2265625. That means in this
particular example this patch is reducing the WAL size by 12% even
with the 56 bit relfilenode patch.

That's a very promising result, but the question in my mind is how
much work would be required to bring this patch to a committable
state?

Right, the results are promising. I have done some more testing with
make installcheck WAL size (fpw=off) and I have seen a similar gain
with this patch.

1. Head: 272 MB
2. 56 bit RelfileLocator: 285 MB
3. 56 bit RelfileLocator + this patch: 261 MB

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#24

[0]: /messages/by-id/CAEze2WgZti_Bgs-Aw3egsR5PJQpHcYZwZFCJND5MS-O_DX0-Hg@mail.gmail.com
[1]: /messages/by-id/CAEze2WjOFzRzPMPYhH4odSa9OCF2XeZszE3jGJhJzrpdFmyLOw@mail.gmail.com

boekewurm+postgres@gmail.com

over 3 years ago

In reply to: Andres Freund (#19)

3 attachment(s)

Re: problems with making relfilenodes 56-bits

On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-10-03 10:01:25 -0700, Andres Freund wrote:

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen.

I think there's plenty potential...

I light dusted off my old varint implementation from [1] and converted the
RelFileLocator and BlockNumber from fixed width integers to varint ones. This
isn't meant as a serious patch, but an experiment to see if this is a path
worth pursuing.

A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
(for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

I think a signficant part of this improvement comes from the premise
of starting with a fresh database. tablespace OID will indeed most
likely be low, but database OID may very well be linearly distributed
if concurrent workloads in the cluster include updating (potentially
unlogged) TOASTed columns and the databases are not created in one
"big bang" but over the lifetime of the cluster. In that case DBOID
will consume 5B for a significant fraction of databases (anything with
OID >=2^28).

My point being: I don't think that we should have different WAL
performance in databases which is dependent on which OID was assigned
to that database.
In addition; this varlen encoding of relfilenode would mean that
performance would drop over time, as a relations' relfile locator is
updated to something with a wider number (through VACUUM FULL or other
relfilelocator cycling; e.g. re-importing a database). For maximum
performance, you'd have to tune your database to have the lowest
possible database, namespace and relfilelocator numbers; which (in
older clusters) implies hacking into the catalogs - which seems like
an antipattern.

I would have much less issue with this if we had separate counters per
database (and approximately incremental dbOid:s), but that's not the
case right now.

The average record size goes from 102.7 to 95.7 bytes excluding the remaining
FPIs, 118.1 to 111.0 including FPIs.

That's quite promising.

There's plenty other spots that could be converted (e.g. the record length
which rarely needs four bytes), this is just meant as a demonstration.

Agreed.

I used pg_waldump --stats for that range of WAL to measure the CPU overhead. A
profile does show pg_varint_decode_uint64(), but partially that seems to be
offset by the reduced amount of bytes to CRC. Maybe a ~2% overhead remains.

That would be tolerable, I think, because waldump --stats pretty much doesn't
do anything with the WAL.

But I suspect there's plenty of optimization potential in the varint
code. Right now it e.g. stores data as big endian, and the bswap instructions
do show up. And a lot of my bit-maskery could be optimized too.

One thing that comes to mind is that we will never see dbOid < 2^8
(and rarely < 2^14, nor spcOid less than 2^8 for that matter), so
we'll probably waste at least one or two bits in the encoding of those
values. That's not the end of the world, but it'd probably be better
if we could improve on that - up to 6% of the field's disk usage would
be wasted on an always-on bit.

----

Attached is a prototype patchset that reduces the WAL record size in
many common cases. This is a prototype, as it fails tests due to a
locking issue in prepared_xacts that I have not been able to find the
source of yet. It also could use some more polishing, but the base
case seems quite good. I haven't yet run the numbers though...

0001 - Extract xl_rminfo from xl_info
See [0]/messages/by-id/CAEze2WgZti_Bgs-Aw3egsR5PJQpHcYZwZFCJND5MS-O_DX0-Hg@mail.gmail.com for more info as to why that's useful, the patch was pulled
from there. It is mainly used to reduce the size of 0002; and mostly
consists of find-and-replace of rmgrs extracting their bits from
xl_info.

0002 - Rework XLogRecord
This makes many fields in the xlog header optional, reducing the size
of many xlog records by several bytes. This implements the design I
shared in my earlier message [1]/messages/by-id/CAEze2WjOFzRzPMPYhH4odSa9OCF2XeZszE3jGJhJzrpdFmyLOw@mail.gmail.com.

0003 - Rework XLogRecordBlockHeader.
This patch could be applied on current head, and saves some bytes in
per-block data. It potentially saves some bytes per registered
block/buffer in the WAL record (max 2 bytes for the first block, after
that up to 3). See the patch's commit message in the patch for
detailed information.

Kind regards,

Matthias van de Meent

Attachments:

0003-Compactify-XLogRecordBlockHeader-format-where-possib.patchapplication/x-patch; name=0003-Compactify-XLogRecordBlockHeader-format-where-possib.patchDownload

From b42ffc126fe714d3a25bb198fb45ef483c9e5795 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 12 Oct 2022 21:16:53 +0200
Subject: [PATCH 3/3] Compactify XLogRecordBlockHeader format, where possible.

- Add 'sequential block IDs' compression
  Usually, blocks are registered with sequential IDs starting with 0.
  Utilize that pattern to deduplicate those fields in that common case.
- Add 'mini block data' field size
  If blocks don't need 2 bytes to store length; we don't those 2 bytes.
- Omit data_length if !BKPBLOCK_HAS_DATA
  FPIs have their own data length bookkeeping, so don't bother with
  those 2 bytes if the field is 0.
---
 src/backend/access/transam/xloginsert.c | 115 +++++++++++++++++++++++-
 src/backend/access/transam/xlogreader.c |  39 +++++++-
 src/include/access/xlogrecord.h         |  25 ++++--
 3 files changed, 167 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 2ac1d0a870..c9d54c0a1e 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -559,6 +559,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 	XLogRecData *rdt_datas_last;
 	XLogRecord *rechdr = (XLogRecord *) hdr_scratch;
 	char	   *scratch = hdr_scratch;
+	uint8	   *first_blockid = NULL;
 
 	Assert((info & ~(XLR_INFO_USERFLAGS)) == 0);
 
@@ -626,7 +627,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 		bool		include_image;
 
 		if (!regbuf->in_use)
+		{
+			first_blockid = NULL;
 			continue;
+		}
 
 		only_hdr = false;
 		/* Determine if this block needs to be backed up */
@@ -843,8 +847,115 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 		prev_regbuf = regbuf;
 
 		/* Ok, copy the header to the scratch buffer */
-		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
-		scratch += SizeOfXLogRecordBlockHeader;
+		/* First block needs to initialize optional batching */
+		if (block_id == 0)
+		{
+			first_blockid = (uint8 *) scratch;
+
+			if (bkpb.data_length <= UINT8_MAX)
+			{
+				/*
+				 * Optimistic guess: if the first registered block has no data,
+				 * it is likely that subsequent blocks have little data too.
+				 * If we're wrong, a subsequent block will have 1 more byte
+				 * than when we would have guessed correctly. If we're right,
+				 * it saves a byte for each correctly guessed block.
+				 */
+				bkpb.id |= XLR_BLOCK_DATA_SMALL;
+				memcpy(scratch, &bkpb, offsetof(XLogRecordBlockHeader, data_length));
+				scratch += offsetof(XLogRecordBlockHeader, data_length);
+
+				if (bkpb.data_length == 0)
+					Assert((bkpb.fork_flags & BKPBLOCK_HAS_DATA) == 0);
+				else
+				{
+					uint8	data_length = bkpb.data_length;
+					memcpy(scratch++, &data_length, sizeof(uint8));
+				}
+			}
+			else
+			{
+				memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
+				scratch += SizeOfXLogRecordBlockHeader;
+			}
+		}
+		/* in a sequence of block ids; with sequence data at first_blockid */
+		else if (first_blockid != NULL)
+		{
+			if (bkpb.data_length == 0)
+			{
+				/*
+				 * Empty blocks don't care about data_length field size, as
+				 * it is not emitted.
+				 */
+				*first_blockid += 1;
+				*first_blockid |= XLR_BLOCK_FIRST_NP1_SEQ;
+
+				memcpy(scratch++, &bkpb.fork_flags, sizeof(bkpb.fork_flags));
+			}
+			else if (*first_blockid & XLR_BLOCK_DATA_SMALL)
+			{
+				if (bkpb.data_length <= UINT8_MAX)
+				{
+					uint8	fork_flags = bkpb.fork_flags;
+					uint8	data_length = bkpb.data_length;
+
+					*first_blockid += 1;
+					*first_blockid |= XLR_BLOCK_FIRST_NP1_SEQ;
+
+					memcpy(scratch++, &fork_flags, sizeof(uint8));
+					memcpy(scratch++, &data_length, sizeof(uint8));
+				}
+				else
+				{
+					first_blockid = NULL;
+					memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
+					scratch += SizeOfXLogRecordBlockHeader;
+				}
+			}
+			else
+			{
+				/*
+				 * It would save 1 byte in the length field to break the chain,
+				 * but that would cost 1 byte as well, and completely disable
+				 * any chances of saving 1 byte in the block header of a
+				 * later block.
+				 * So as long as the chain continues, we won't have a large
+				 * record header.
+				 * We do still ignore the block_id field in the block header.
+				 */
+				*first_blockid += 1;
+				*first_blockid |= XLR_BLOCK_FIRST_NP1_SEQ;
+
+				memcpy(scratch,
+					   ((char *) &bkpb) + offsetof(XLogRecordBlockHeader, fork_flags),
+					   SizeOfXLogRecordBlockHeader - offsetof(XLogRecordBlockHeader, fork_flags));
+				scratch += SizeOfXLogRecordBlockHeader - offsetof(XLogRecordBlockHeader, fork_flags);
+			}
+		}
+		/* not in a sequence of block ids */
+		else
+		{
+			if (bkpb.data_length <= UINT8_MAX)
+			{
+				memcpy(scratch, &bkpb, offsetof(XLogRecordBlockHeader, data_length));
+				scratch += offsetof(XLogRecordBlockHeader, data_length);
+
+				if (bkpb.data_length == 0)
+					Assert((bkpb.fork_flags & BKPBLOCK_HAS_DATA) == 0);
+				else
+				{
+					uint8	data_length = bkpb.data_length;
+					memcpy(scratch++, &data_length, sizeof(uint8));
+				}
+			}
+			else
+			{
+				memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
+				scratch += SizeOfXLogRecordBlockHeader;
+			}
+		}
+
 		if (include_image)
 		{
 			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 56bf86cb27..cb930c9e82 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1792,6 +1792,9 @@ DecodeXLogRecord(XLogReaderState *state,
 	uint32		datatotal;
 	RelFileLocator *rlocator = NULL;
 	uint8		block_id;
+	bool		shorthdr = false;
+	bool		in_bid_seq = false;
+	uint8		bid_seq_end = 0;
 
 	decoded->header = (XLRHeaderData) {0};
 	decoded->lsn = lsn;
@@ -1851,7 +1854,10 @@ DecodeXLogRecord(XLogReaderState *state,
 	datatotal = 0;
 	while (remaining > datatotal)
 	{
-		COPY_HEADER_FIELD(&block_id, sizeof(uint8));
+		if (in_bid_seq)
+			block_id++;
+		else
+			COPY_HEADER_FIELD(&block_id, sizeof(uint8));
 
 		if (block_id == XLR_BLOCK_ID_DATA_SHORT)
 		{
@@ -1884,12 +1890,24 @@ DecodeXLogRecord(XLogReaderState *state,
 		{
 			COPY_HEADER_FIELD(&decoded->toplevel_xid, sizeof(TransactionId));
 		}
-		else if (block_id <= XLR_MAX_BLOCK_ID)
+		else if ((block_id & XLR_BLOCK_ID_MASK) <= XLR_MAX_BLOCK_ID)
 		{
 			/* XLogRecordBlockHeader */
 			DecodedBkpBlock *blk;
 			uint8		fork_flags;
 
+			if (!in_bid_seq)
+			{
+				shorthdr = (block_id & XLR_BLOCK_DATA_SMALL) != 0;
+
+				if (block_id & XLR_BLOCK_FIRST_NP1_SEQ)
+				{
+					bid_seq_end = (block_id & XLR_BLOCK_ID_MASK) + 1;
+					block_id = decoded->max_block_id + 1;
+					in_bid_seq = true;
+				}
+			}
+
 			/* mark any intervening block IDs as not in use */
 			for (int i = decoded->max_block_id + 1; i < block_id; ++i)
 				decoded->blocks[i].in_use = false;
@@ -1915,8 +1933,19 @@ DecodeXLogRecord(XLogReaderState *state,
 			blk->has_data = ((fork_flags & BKPBLOCK_HAS_DATA) != 0);
 
 			blk->prefetch_buffer = InvalidBuffer;
+			if (!blk->has_data)
+			{
+				/* no data == no data_len, in all situations */
+			}
+			else if (shorthdr)
+			{
+				uint8 len = 0;
+				COPY_HEADER_FIELD(&len, sizeof(uint8));
+				blk->data_len = len;
+			}
+			else
+				COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 
-			COPY_HEADER_FIELD(&blk->data_len, sizeof(uint16));
 			/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
 			if (blk->has_data && blk->data_len == 0)
 			{
@@ -2033,6 +2062,10 @@ DecodeXLogRecord(XLogReaderState *state,
 				blk->rlocator = *rlocator;
 			}
 			COPY_HEADER_FIELD(&blk->blkno, sizeof(BlockNumber));
+
+			/* finish the sequence we're in */
+			if (block_id == bid_seq_end)
+				in_bid_seq = false;
 		}
 		else
 		{
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 94125f9878..862c922032 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -220,13 +220,18 @@ XLogRecordGetRMInfo(XLogRecord *record)
  *
  * Note that we don't attempt to align the XLogRecordBlockHeader struct!
  * So, the struct must be copied to aligned local storage before use.
+ * 
+ * Note that if .id & 0x80 is set; the block header is small instead.
+ * If .id & 0x40 is set, there will be id + 1 following block headers
+ * of this type, having incremental ids, but written to disk without
+ * the id field.
  */
 typedef struct XLogRecordBlockHeader
 {
 	uint8		id;				/* block reference ID */
 	uint8		fork_flags;		/* fork within the relation, and flags */
 	uint16		data_length;	/* number of payload bytes (not including page
-								 * image) */
+								 * image). Emitted iff BKPBLOCK_HAS_DATA */
 
 	/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
 	/* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */
@@ -359,11 +364,17 @@ typedef struct XLogRecordDataHeaderLong
  * need a handful of block references, but there are a few exceptions that
  * need more.
  */
-#define XLR_MAX_BLOCK_ID			32
-
-#define XLR_BLOCK_ID_DATA_SHORT		255
-#define XLR_BLOCK_ID_DATA_LONG		254
-#define XLR_BLOCK_ID_ORIGIN			253
-#define XLR_BLOCK_ID_TOPLEVEL_XID	252
+#define XLR_MAX_BLOCK_ID			0x20
+
+#define XLR_BLOCK_ID_DATA_SHORT		0x3F
+#define XLR_BLOCK_ID_DATA_LONG		0x3E
+#define XLR_BLOCK_ID_ORIGIN			0x3D
+#define XLR_BLOCK_ID_TOPLEVEL_XID	0x3C
+
+#define XLR_BLOCK_ID_MASK			0x3F
+#define XLR_BLOCK_FIRST_NP1_SEQ		0x40 /* the following blocks are 
+										  * 0..block_id + 1, and have omitted
+										  * their block ID */
+#define XLR_BLOCK_DATA_SMALL		0x80 /* data_length field is uint8 */
 
 #endif							/* XLOGRECORD_H */
-- 
2.30.2

0002-Compactify-xlog-format.patchapplication/x-patch; name=0002-Compactify-xlog-format.patchDownload

From eb42223e869ddc8b4d5c7a62fe8f1e6e58eedced Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 11 Oct 2022 19:15:58 +0200
Subject: [PATCH 2/3] Compactify xlog format

Reduces xlog header size to 16B in many cases, and <=21B in the
common case, saving 3-8 bytes /record.

Future improvements will see up to 2 more bytes saved per registered
block.

Note: prepared_xacts tests are disabled for now; because I had issues
that I couldn't trace down; resulting in a lock waiting for terminated
backends. IDK how that happened, but that's for a later day.
---
 contrib/pg_walinspect/pg_walinspect.c         |   4 +-
 src/backend/access/heap/heapam.c              |  24 +-
 src/backend/access/heap/rewriteheap.c         |   3 +-
 src/backend/access/transam/multixact.c        |   3 +-
 src/backend/access/transam/twophase.c         |   2 +-
 src/backend/access/transam/xact.c             |   6 +-
 src/backend/access/transam/xlog.c             |  90 +++-
 src/backend/access/transam/xloginsert.c       | 141 +++++-
 src/backend/access/transam/xlogprefetcher.c   |   2 +-
 src/backend/access/transam/xlogreader.c       | 438 ++++++++++++------
 src/backend/access/transam/xlogrecovery.c     |  31 +-
 src/backend/catalog/storage.c                 |   8 +-
 src/backend/commands/dbcommands.c             |  20 +-
 src/backend/replication/logical/decode.c      |   6 +
 src/backend/replication/logical/logical.c     |   2 +-
 .../replication/logical/logicalfuncs.c        |   2 +-
 src/backend/replication/slotfuncs.c           |   2 +-
 src/backend/replication/walsender.c           |   2 +-
 src/backend/utils/cache/inval.c               |   3 +-
 src/bin/pg_resetwal/pg_resetwal.c             |  23 +-
 src/bin/pg_rewind/parsexlog.c                 |   6 +-
 src/bin/pg_waldump/pg_waldump.c               |   2 +-
 src/include/access/xloginsert.h               |   2 +-
 src/include/access/xlogprefetcher.h           |   4 +-
 src/include/access/xlogreader.h               |   6 +-
 src/include/access/xlogrecord.h               | 173 ++++++-
 src/test/regress/parallel_schedule            |   7 +-
 27 files changed, 748 insertions(+), 264 deletions(-)

diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index f4e3b40bed..0c6eeab557 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -38,7 +38,7 @@ PG_FUNCTION_INFO_V1(pg_get_wal_stats_till_end_of_wal);
 
 static bool IsFutureLSN(XLogRecPtr lsn, XLogRecPtr *curr_lsn);
 static XLogReaderState *InitXLogReaderState(XLogRecPtr lsn);
-static XLogRecord *ReadNextXLogRecord(XLogReaderState *xlogreader);
+static XLRHeaderData *ReadNextXLogRecord(XLogReaderState *xlogreader);
 static void GetWALRecordInfo(XLogReaderState *record, Datum *values,
 							 bool *nulls, uint32 ncols);
 static XLogRecPtr ValidateInputLSNs(bool till_end_of_wal,
@@ -138,7 +138,7 @@ InitXLogReaderState(XLogRecPtr lsn)
  * encounter errors if the flush pointer falls in the middle of a record. In
  * that case we'll return NULL.
  */
-static XLogRecord *
+static XLRHeaderData *
 ReadNextXLogRecord(XLogReaderState *xlogreader)
 {
 	XLogRecord *record;
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 94e702cdb2..22b12da928 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2171,7 +2171,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		/* filtering by origin on a row level is much more efficient */
 		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-		recptr = XLogInsert(RM_HEAP_ID, rminfo);
+		recptr = XLogInsertExtended(RM_HEAP_ID, XLR_HAS_XID,
+									rminfo, InvalidCommandId);
 
 		PageSetLSN(page, recptr);
 	}
@@ -2519,7 +2520,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 			/* filtering by origin on a row level is much more efficient */
 			XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-			recptr = XLogInsert(RM_HEAP2_ID, rminfo);
+			recptr = XLogInsertExtended(RM_HEAP2_ID, XLR_HAS_XID,
+										rminfo, InvalidCommandId);
 
 			PageSetLSN(page, recptr);
 		}
@@ -3018,7 +3020,8 @@ l1:
 		/* filtering by origin on a row level is much more efficient */
 		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+		recptr = XLogInsertExtended(RM_HEAP_ID, XLR_HAS_XID,
+									XLOG_HEAP_DELETE, InvalidCommandId);
 
 		PageSetLSN(page, recptr);
 	}
@@ -5813,7 +5816,8 @@ heap_finish_speculative(Relation relation, ItemPointer tid)
 		XLogRegisterData((char *) &xlrec, SizeOfHeapConfirm);
 		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
 
-		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_CONFIRM);
+		recptr = XLogInsertExtended(RM_HEAP_ID, XLR_HAS_XID,
+									XLOG_HEAP_CONFIRM, InvalidCommandId);
 
 		PageSetLSN(page, recptr);
 	}
@@ -5956,7 +5960,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
 
 		/* No replica identity & replication origin logged */
 
-		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_DELETE);
+		recptr = XLogInsertExtended(RM_HEAP_ID, XLR_HAS_XID,
+									XLOG_HEAP_DELETE, InvalidCommandId);
 
 		PageSetLSN(page, recptr);
 	}
@@ -6067,7 +6072,8 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 
 		/* inplace updates aren't decoded atm, don't log the origin */
 
-		recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_INPLACE);
+		recptr = XLogInsertExtended(RM_HEAP_ID, XLR_HAS_XID,
+									XLOG_HEAP_INPLACE, InvalidCommandId);
 
 		PageSetLSN(page, recptr);
 	}
@@ -8439,7 +8445,8 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	/* filtering by origin on a row level is much more efficient */
 	XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-	recptr = XLogInsert(RM_HEAP_ID, rminfo);
+	recptr = XLogInsertExtended(RM_HEAP_ID, XLR_HAS_XID,
+								rminfo, InvalidCommandId);
 
 	return recptr;
 }
@@ -8513,7 +8520,8 @@ log_heap_new_cid(Relation relation, HeapTuple tup)
 
 	/* will be looked at irrespective of origin */
 
-	recptr = XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_NEW_CID);
+	recptr = XLogInsertExtended(RM_HEAP2_ID, XLR_HAS_XID,
+								XLOG_HEAP2_NEW_CID, InvalidCommandId);
 
 	return recptr;
 }
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index b01b39b008..0b54dec99d 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -927,7 +927,8 @@ logical_heap_rewrite_flush_mappings(RewriteState state)
 		XLogRegisterData(waldata_start, len);
 
 		/* write xlog record */
-		XLogInsert(RM_HEAP2_ID, XLOG_HEAP2_REWRITE);
+		XLogInsertExtended(RM_HEAP2_ID, XLR_HAS_XID,
+						   XLOG_HEAP2_REWRITE, InvalidCommandId);
 
 		pfree(waldata_start);
 	}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index ca6e238542..2fe15635e9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -835,7 +835,8 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	XLogRegisterData((char *) (&xlrec), SizeOfMultiXactCreate);
 	XLogRegisterData((char *) members, nmembers * sizeof(MultiXactMember));
 
-	(void) XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID);
+	(void) XLogInsertExtended(RM_MULTIXACT_ID, XLR_HAS_XID,
+							  XLOG_MULTIXACT_CREATE_ID, InvalidCommandId);
 
 	/* Now enter the information into the OFFSETs and MEMBERs logs */
 	RecordNewMultiXact(multi, offset, nmembers, members);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 98c1ef7ec5..0bae7c71ea 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1396,7 +1396,7 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 static void
 XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 {
-	XLogRecord *record;
+	XLRHeaderData *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8ce44abfcf..fd9ce4d41f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5769,7 +5769,8 @@ XactLogCommitRecord(TimestampTz commit_time,
 	/* we allow filtering by xacts */
 	XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-	return XLogInsertExtended(RM_XACT_ID, info, rminfo);
+	return XLogInsertExtended(RM_XACT_ID, info | XLR_HAS_XID,
+							  rminfo, InvalidCommandId);
 }
 
 /*
@@ -5917,7 +5918,8 @@ XactLogAbortRecord(TimestampTz abort_time,
 	if (TransactionIdIsValid(twophase_xid))
 		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-	return XLogInsertExtended(RM_XACT_ID, info, rminfo);
+	return XLogInsertExtended(RM_XACT_ID, info | XLR_HAS_XID,
+							  rminfo, InvalidCommandId);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3058041683..02ec968697 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -742,16 +742,16 @@ XLogInsertRecord(XLogRecData *rdata,
 	pg_crc32c	rdata_crc;
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
-	uint8		rminfo = rechdr->xl_rminfo;
-	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   rminfo == XLOG_SWITCH);
+	uint8		rminfo;
+	uint32		reclength;
+	bool		isLogSwitch;
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 	bool		prevDoPageWrites = doPageWrites;
 	TimeLineID	insertTLI;
 
 	/* we assume that all of the record header is in the first chunk */
-	Assert(rdata->len >= SizeOfXLogRecord);
+	Assert(rdata->len >= MinXLogHeaderSize);
 
 	/* cross-check on whether we should be here or not */
 	if (!XLogInsertAllowed())
@@ -763,6 +763,40 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	insertTLI = XLogCtl->InsertTimeLineID;
 
+	if (rechdr->xl_info & XLR_HAS_RMINFO)
+	{
+		Size offset = MinXLogHeaderSize;
+
+		switch (rechdr->xl_info & XLR2_LEN_MASK)
+		{
+			case XLR2_LEN_ABSENT:
+				break;
+			case XLR2_LEN_1B:
+				offset += 1;
+				break;
+			case XLR2_LEN_2B:
+				offset += 2;
+				break;
+			case XLR2_LEN_4B:
+				offset += 4;
+				break;
+			default:
+				pg_unreachable();
+		}
+		if (rechdr->xl_info & XLR_HAS_XID)
+			offset += sizeof(TransactionId);
+		if (rechdr->xl_info & XLR_HAS_CID)
+			offset += sizeof(CommandId);
+
+		rminfo = *((uint8 *) (rdata->data + offset));
+	}
+	else
+		rminfo = 0;
+
+	reclength = XLogRecordGetLength(rechdr);
+
+	isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID && rminfo == XLOG_SWITCH);
+
 	/*----------
 	 *
 	 * We have now done all the preparatory work we can without holding a
@@ -845,7 +879,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
 	else
 	{
-		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
+		ReserveXLogInsertLocation(reclength, &StartPos, &EndPos,
 								  &rechdr->xl_prev);
 		inserted = true;
 	}
@@ -865,7 +899,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.
 		 */
-		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
+		CopyXLogRecordToWAL(reclength, isLogSwitch, rdata,
 							StartPos, EndPos, insertTLI);
 
 		/*
@@ -896,14 +930,18 @@ XLogInsertRecord(XLogRecData *rdata,
 
 	END_CRIT_SECTION();
 
-	MarkCurrentTransactionIdLoggedIfAny();
+	if (rechdr->xl_info & XLR_HAS_XID)
+		MarkCurrentTransactionIdLoggedIfAny();
 
 	/*
 	 * Mark top transaction id is logged (if needed) so that we should not try
 	 * to log it again with the next WAL record in the current subtransaction.
 	 */
 	if (topxid_included)
+	{
+		Assert(rechdr->xl_info & XLR_HAS_XID);
 		MarkSubxactTopXidLogged();
+	}
 
 	/*
 	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
@@ -936,7 +974,8 @@ XLogInsertRecord(XLogRecData *rdata,
 		 */
 		if (inserted)
 		{
-			EndPos = StartPos + SizeOfXLogRecord;
+			/* switch record is no more than a min-sized header plus rminfo */
+			EndPos = StartPos + MAXALIGN(MinXLogHeaderSize + sizeof(uint8));
 			if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
 			{
 				uint64		offset = XLogSegmentOffset(EndPos, wal_segment_size);
@@ -977,7 +1016,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* We also need temporary space to decode the record. */
 		record = (XLogRecord *) recordBuf.data;
 		decoded = (DecodedXLogRecord *)
-			palloc(DecodeXLogRecordRequiredSpace(record->xl_tot_len));
+			palloc(DecodeXLogRecordRequiredSpace(reclength));
 
 		if (!debug_reader)
 			debug_reader = XLogReaderAllocate(wal_segment_size, NULL,
@@ -1022,7 +1061,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	/* Report WAL traffic to the instrumentation. */
 	if (inserted)
 	{
-		pgWalUsage.wal_bytes += rechdr->xl_tot_len;
+		pgWalUsage.wal_bytes += reclength;
 		pgWalUsage.wal_records++;
 		pgWalUsage.wal_fpi += num_fpi;
 	}
@@ -1056,7 +1095,7 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	size = MAXALIGN(size);
 
 	/* All (non xlog-switch) records should contain data. */
-	Assert(size > SizeOfXLogRecord);
+	Assert(size > MinXLogHeaderSize);
 
 	/*
 	 * The duration the spinlock needs to be held is minimized by minimizing
@@ -1107,7 +1146,7 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	uint64		startbytepos;
 	uint64		endbytepos;
 	uint64		prevbytepos;
-	uint32		size = MAXALIGN(SizeOfXLogRecord);
+	uint32		size = MAXALIGN(MinXLogHeaderSize + sizeof(uint8));
 	XLogRecPtr	ptr;
 	uint32		segleft;
 
@@ -1250,8 +1289,11 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 	 */
 	if (isLogSwitch && XLogSegmentOffset(CurrPos, wal_segment_size) != 0)
 	{
-		/* An xlog-switch record doesn't contain any data besides the header */
-		Assert(write_len == SizeOfXLogRecord);
+		/*
+		 * An xlog-switch record doesn't contain any data besides the basic
+		 * header and rminfo
+		 */
+		Assert(write_len == MAXALIGN(MinXLogHeaderSize + sizeof(uint8)));
 
 		/* Assert that we did reserve the right amount of space */
 		Assert(XLogSegmentOffset(EndPos, wal_segment_size) == 0);
@@ -4672,6 +4714,9 @@ BootStrapXLOG(void)
 	uint64		sysidentifier;
 	struct timeval tv;
 	pg_crc32c	crc;
+	const uint32 rec_tot_len = MinXLogHeaderSize + sizeof(uint8) /* xl_tot_len */
+		+ SizeOfXLogRecordDataHeaderShort + sizeof(checkPoint);
+	const uint8 rec_len = rec_tot_len;
 
 	/* allow ordinary WAL segment creation, like StartupXLOG() would */
 	SetInstallXLogFileSegmentActive();
@@ -4746,20 +4791,23 @@ BootStrapXLOG(void)
 	recptr = ((char *) page + SizeOfXLogLongPHD);
 	record = (XLogRecord *) recptr;
 	record->xl_prev = 0;
-	record->xl_xid = InvalidTransactionId;
-	record->xl_tot_len = SizeOfXLogRecord + SizeOfXLogRecordDataHeaderShort + sizeof(checkPoint);
-	record->xl_info = XLOG_CHECKPOINT_SHUTDOWN;
+	record->xl_info = XLR2_LEN_1B;
 	record->xl_rmid = RM_XLOG_ID;
-	recptr += SizeOfXLogRecord;
+
+	recptr += MinXLogHeaderSize;
+
+	memcpy(recptr, (char *) &rec_len, sizeof(uint8));
+	recptr += sizeof(uint8);
+
 	/* fill the XLogRecordDataHeaderShort struct */
 	*(recptr++) = (char) XLR_BLOCK_ID_DATA_SHORT;
 	*(recptr++) = sizeof(checkPoint);
-	memcpy(recptr, &checkPoint, sizeof(checkPoint));
+	memcpy(recptr, (char *) &checkPoint, sizeof(checkPoint));
 	recptr += sizeof(checkPoint);
-	Assert(recptr - (char *) record == record->xl_tot_len);
+	Assert(recptr - (char *) record == rec_len);
 
 	INIT_CRC32C(crc);
-	COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
+	COMP_CRC32C(crc, ((char *) record) + offsetof(XLogRecord, xl_rmid), rec_len - offsetof(XLogRecord, xl_rmid));
 	COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
 	FIN_CRC32C(crc);
 	record->xl_crc = crc;
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index cc4a262d8e..2ac1d0a870 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -118,7 +118,7 @@ static char *hdr_scratch = NULL;
 #define SizeOfXLogTransactionId	(sizeof(TransactionId) + sizeof(char))
 
 #define HEADER_SCRATCH_SIZE \
-	(SizeOfXLogRecord + \
+	(MaxXLogHeaderSize + \
 	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
 	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin + \
 	 SizeOfXLogTransactionId)
@@ -135,10 +135,11 @@ static bool begininsert_called = false;
 /* Memory context to hold the registered buffer and data references. */
 static MemoryContext xloginsert_cxt;
 
-static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
-									   XLogRecPtr RedoRecPtr, bool doPageWrites,
-									   XLogRecPtr *fpw_lsn, int *num_fpi,
-									   bool *topxid_included);
+static XLogRecData *
+XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
+				   XLogRecPtr RedoRecPtr, bool doPageWrites,
+				   XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included,
+				   CommandId cid);
 static bool XLogCompressBackupBlock(char *page, uint16 hole_offset,
 									uint16 hole_length, char *dest, uint16 *dlen);
 
@@ -447,11 +448,15 @@ XLogSetRecordFlags(uint8 flags)
  * (LSN is the XLOG point up to which the XLOG must be flushed to disk
  * before the data page can be written out.  This implements the basic
  * WAL rule "write the log before the data".)
+ *
+ * Note: To include the XID in the record, you have to use
+ * XLogInsertExtended with info = XLR2_HAS_XID. To include CID, do the
+ * same with HAS_CID, and specify the command ID.
  */
 XLogRecPtr
 XLogInsert(RmgrId rmid, uint8 rminfo)
 {
-	return XLogInsertExtended(rmid, 0, rminfo);
+	return XLogInsertExtended(rmid, 0, rminfo, InvalidCommandId);
 }
 
 
@@ -459,11 +464,15 @@ XLogInsert(RmgrId rmid, uint8 rminfo)
  * Insert an XLOG record having the specified RMID, info and rminfo bytes,
  * with the body of the record being the data and buffer references
  * registered earlier with XLogRegister* calls.
+ * 
+ * CommandId is an argument to be called by the user, while TransactionId
+ * (if needed) is taken from this backend's state: There is at any time
+ * only one running XID, while there may be more than one active CommandIds.
  *
  * See also XLogInsert above.
  */
 XLogRecPtr
-XLogInsertExtended(RmgrId rmid, uint8 info, uint8 rminfo)
+XLogInsertExtended(RmgrId rmid, uint8 info, uint8 rminfo, CommandId cid)
 {
 	XLogRecPtr	EndPos;
 
@@ -475,8 +484,7 @@ XLogInsertExtended(RmgrId rmid, uint8 info, uint8 rminfo)
 	 * The caller can set XLR_SPECIAL_REL_UPDATE and
 	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_SPECIAL_REL_UPDATE |
-				  XLR_CHECK_CONSISTENCY)) != 0)
+	if ((info & ~XLR_INFO_USERFLAGS) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
 	TRACE_POSTGRESQL_WAL_INSERT(rmid, info);
@@ -509,7 +517,8 @@ XLogInsertExtended(RmgrId rmid, uint8 info, uint8 rminfo)
 		GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
 
 		rdt = XLogRecordAssemble(rmid, info, rminfo, RedoRecPtr, doPageWrites,
-								 &fpw_lsn, &num_fpi, &topxid_included);
+								 &fpw_lsn, &num_fpi, &topxid_included,
+								 cid);
 
 		EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi,
 								  topxid_included);
@@ -538,30 +547,57 @@ XLogInsertExtended(RmgrId rmid, uint8 info, uint8 rminfo)
 static XLogRecData *
 XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 				   XLogRecPtr RedoRecPtr, bool doPageWrites,
-				   XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
+				   XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included,
+				   CommandId cid)
 {
 	XLogRecData *rdt;
 	uint32		total_len = 0;
 	int			block_id;
+	bool		only_hdr = true;
 	pg_crc32c	rdata_crc;
 	registered_buffer *prev_regbuf = NULL;
 	XLogRecData *rdt_datas_last;
-	XLogRecord *rechdr;
+	XLogRecord *rechdr = (XLogRecord *) hdr_scratch;
 	char	   *scratch = hdr_scratch;
 
+	Assert((info & ~(XLR_INFO_USERFLAGS)) == 0);
+
 	/*
 	 * Note: this function can be called multiple times for the same record.
 	 * All the modifications we do to the rdata chains below must handle that.
 	 */
 
-	/* The record begins with the fixed-size header */
-	rechdr = (XLogRecord *) scratch;
-	scratch += SizeOfXLogRecord;
+	/*
+	 * The record begins with the variable-size header data. We pre-allocate
+	 * the fixed part of the xlog header section, plus the length field, as
+	 * we'll only fill those at the end of the record. The rest can be
+	 * pre-filled.
+	 */
+	scratch += MinXLogHeaderSize + sizeof(uint32);
 
 	hdr_rdt.next = NULL;
 	rdt_datas_last = &hdr_rdt;
 	hdr_rdt.data = hdr_scratch;
 
+	if (IsSubxactTopXidLogPending())
+		info |= XLR_HAS_XID;
+
+	if (info & XLR_HAS_XID)
+	{
+		TransactionId xid = GetCurrentTransactionIdIfAny();
+		memcpy(scratch, (char *) &xid, sizeof(TransactionId));
+		scratch += sizeof(TransactionId);
+	}
+
+	if (info & XLR_HAS_CID)
+	{
+		memcpy(scratch, (char *) &cid, sizeof(CommandId));
+		scratch += sizeof(CommandId);
+	}
+
+	if (rminfo != 0)
+		*(scratch++) = rminfo;
+
 	/*
 	 * Enforce consistency checks for this record if user is looking for it.
 	 * Do this before at the beginning of this routine to give the possibility
@@ -592,6 +628,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 		if (!regbuf->in_use)
 			continue;
 
+		only_hdr = false;
 		/* Determine if this block needs to be backed up */
 		if (regbuf->flags & REGBUF_FORCE_IMAGE)
 			needs_backup = true;
@@ -832,6 +869,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 	if ((curinsert_flags & XLOG_INCLUDE_ORIGIN) &&
 		replorigin_session_origin != InvalidRepOriginId)
 	{
+		only_hdr = false;
 		*(scratch++) = (char) XLR_BLOCK_ID_ORIGIN;
 		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
 		scratch += sizeof(replorigin_session_origin);
@@ -845,6 +883,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 		/* Set the flag that the top xid is included in the WAL */
 		*topxid_included = true;
 
+		only_hdr = false;
 		*(scratch++) = (char) XLR_BLOCK_ID_TOPLEVEL_XID;
 		memcpy(scratch, &xid, sizeof(TransactionId));
 		scratch += sizeof(TransactionId);
@@ -853,6 +892,8 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 	/* followed by main data, if any */
 	if (mainrdata_len > 0)
 	{
+		only_hdr = false;
+
 		if (mainrdata_len > 255)
 		{
 			*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
@@ -873,16 +914,75 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 	hdr_rdt.len = (scratch - hdr_scratch);
 	total_len += hdr_rdt.len;
 
+
+	/* ensure that we haven't yet set the length mask */
+	Assert((info & XLR2_LEN_MASK) == 0);
+
+	/*
+	 * Here, the full xlog header has been constructed, except for the xlog
+	 * record length, and the constant fields in the xlog header.
+	 */
+
+	/* fill the xlog header length field and mask */
+	if (only_hdr)
+	{
+		Assert(total_len - sizeof(uint32) == XLRSizeOfHeader(info));
+		info |= XLR2_LEN_ABSENT;
+		memmove(hdr_scratch + MinXLogHeaderSize,
+				hdr_scratch + MinXLogHeaderSize + sizeof(uint32),
+				hdr_rdt.len - MinXLogHeaderSize - sizeof(uint32));
+		hdr_rdt.len = hdr_rdt.len - sizeof(uint32);
+	}
+	else if (total_len - sizeof(uint32) <= UINT8_MAX - sizeof(uint8))
+	{
+		uint8 size = (uint8) (total_len - sizeof(uint32)) + sizeof(uint8);
+		info |= XLR2_LEN_1B;
+		memmove(hdr_scratch + MinXLogHeaderSize + sizeof(uint8),
+				hdr_scratch + MinXLogHeaderSize + sizeof(uint32),
+				hdr_rdt.len - MinXLogHeaderSize - sizeof(uint32));
+		memcpy(hdr_scratch + MinXLogHeaderSize, (char *) &size, sizeof(uint8));
+		total_len = size;
+		hdr_rdt.len = hdr_rdt.len - sizeof(uint32) + sizeof(uint8);
+	}
+	else if (total_len - sizeof(uint32) <= UINT16_MAX - sizeof(uint16))
+	{
+		uint16 size = (uint16) (total_len - sizeof(uint32)) + sizeof(uint16);
+		info |= XLR2_LEN_2B;
+		memmove(hdr_scratch + MinXLogHeaderSize + sizeof(uint16),
+				hdr_scratch + MinXLogHeaderSize + sizeof(uint32),
+				hdr_rdt.len - MinXLogHeaderSize - sizeof(uint32));
+		memcpy(hdr_scratch + MinXLogHeaderSize, (char *) &size, sizeof(uint16));
+		total_len = size;
+		hdr_rdt.len = hdr_rdt.len - sizeof(uint32) + sizeof(uint16);
+	}
+	else
+	{
+		uint32 size = total_len;
+		info |= XLR2_LEN_4B;
+		memcpy(hdr_scratch + MinXLogHeaderSize, (char *) &size, sizeof(uint32));
+		total_len = size;
+	}
+
+	/*
+	 * We've filled all variable-length data of the xlog header section,
+	 * which allows us to start CRC-ing the data.
+	 */
+
+	rechdr->xl_rmid = rmid;
+	rechdr->xl_info = info;
+
 	/*
 	 * Calculate CRC of the data
 	 *
 	 * Note that the record header isn't added into the CRC initially since we
 	 * don't know the prev-link yet.  Thus, the CRC will represent the CRC of
-	 * the whole record in the order: rdata, then backup blocks, then record
-	 * header.
+	 * the whole record in the order: xl_rmid, xl_info, varlen header fields,
+	 * rdata, then backup blocks, then prevptr.
 	 */
 	INIT_CRC32C(rdata_crc);
-	COMP_CRC32C(rdata_crc, hdr_scratch + SizeOfXLogRecord, hdr_rdt.len - SizeOfXLogRecord);
+	COMP_CRC32C(rdata_crc,
+				hdr_rdt.data + offsetof(XLogRecord, xl_rmid),
+				hdr_rdt.len - offsetof(XLogRecord, xl_rmid));
 	for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
 		COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
 
@@ -891,11 +991,6 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 	 * once we know where in the WAL the record will be inserted. The CRC does
 	 * not include the record header yet.
 	 */
-	rechdr->xl_xid = GetCurrentTransactionIdIfAny();
-	rechdr->xl_tot_len = total_len;
-	rechdr->xl_info = info;
-	rechdr->xl_rmid = rmid;
-	rechdr->xl_rminfo = rminfo;
 	rechdr->xl_prev = InvalidXLogRecPtr;
 	rechdr->xl_crc = rdata_crc;
 
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index e5d4f36182..f885670263 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -983,7 +983,7 @@ XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher, XLogRecPtr recPtr)
  * A wrapper for XLogReadRecord() that provides the same interface, but also
  * tries to initiate I/O for blocks referenced in future WAL records.
  */
-XLogRecord *
+XLRHeaderData *
 XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher, char **errmsg)
 {
 	DecodedXLogRecord *record;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 9200d7f56c..56bf86cb27 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -49,7 +49,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
-								  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
+								  XLogRecPtr PrevRecPtr, XLogRecord *record,
+								  bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
 							XLogRecPtr recptr);
 static void ResetDecoder(XLogReaderState *state);
@@ -418,7 +419,7 @@ XLogNextRecord(XLogReaderState *state, char **errormsg)
  * The returned pointer (or *errormsg) points to an internal buffer that's
  * valid until the next call to XLogReadRecord.
  */
-XLogRecord *
+XLRHeaderData *
 XLogReadRecord(XLogReaderState *state, char **errormsg)
 {
 	DecodedXLogRecord *decoded;
@@ -531,6 +532,83 @@ XLogReadRecordAlloc(XLogReaderState *state, size_t xl_tot_len, bool allow_oversi
 	return NULL;
 }
 
+/*
+ * Fills contdata with pointer to record continuation data of the next page,
+ * and len with the length of the data on that page.
+ */
+#define ReadPageData(dataNeeded, gotLen, expectLen, strict) \
+	do \
+	{ \
+		/* Wait for the next page to become available */ \
+		readOff = ReadPageInternal(state, targetPagePtr, \
+								   Min((dataNeeded) + SizeOfXLogShortPHD, \
+									   XLOG_BLCKSZ)); \
+		\
+		if (readOff == XLREAD_WOULDBLOCK) \
+			return XLREAD_WOULDBLOCK; \
+		else if (readOff < 0) \
+			goto err; \
+		Assert(SizeOfXLogShortPHD <= readOff); \
+		pageHeader = (XLogPageHeader) state->readBuf; \
+		/*
+		 * If we were expecting a continuation record and got an
+		 * "overwrite contrecord" flag, that means the continuation record
+		 * was overwritten with a different record.  Restart the read by
+		 * assuming the address to read is the location where we found
+		 * this flag; but keep track of the LSN of the record we were
+		 * reading, for later verification.
+		 */ \
+		if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD) \
+		{ \
+			state->overwrittenRecPtr = RecPtr; \
+			RecPtr = targetPagePtr; \
+			goto restart; \
+		} \
+		\
+		/* Check that the continuation on next page looks valid */ \
+		if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD)) \
+		{ \
+			report_invalid_record(state, \
+								  "there is no contrecord flag at %X/%X", \
+								  LSN_FORMAT_ARGS(RecPtr)); \
+			goto err; \
+		} \
+		/*
+		 * Cross-check that xlp_rem_len agrees with how much of the record
+		 * we expect there to be left.
+		 */ \
+		if (pageHeader->xlp_rem_len == 0 || (strict) ? ( \
+				(expectLen) != (pageHeader->xlp_rem_len + (gotLen)) \
+			) : ( \
+				(expectLen) < (pageHeader->xlp_rem_len + (gotLen)) \
+			)) \
+		{ \
+			report_invalid_record(state, \
+								  "invalid contrecord length %u (expected %lld) at %X/%X", \
+								  pageHeader->xlp_rem_len, \
+								  ((long long) (expectLen)) - (gotLen), \
+								  LSN_FORMAT_ARGS(RecPtr)); \
+			goto err; \
+		} \
+		/* Append the continuation from this page to the buffer */ \
+		pageHeaderSize = XLogPageHeaderSize(pageHeader); \
+		\
+		if (readOff < Min(pageHeaderSize + (dataNeeded), XLOG_BLCKSZ)) \
+		{ \
+			readOff = ReadPageInternal(state, targetPagePtr, \
+									   Min(pageHeaderSize + (dataNeeded), XLOG_BLCKSZ)); \
+			if (readOff == XLREAD_WOULDBLOCK) \
+				return XLREAD_WOULDBLOCK; \
+			else if (readOff < 0) \
+				goto err; \
+		} \
+		Assert(pageHeaderSize <= readOff); \
+		\
+		/* point to the right */\
+		contdata = (char *) state->readBuf + pageHeaderSize; \
+		len = readOff - pageHeaderSize; \
+	} while (false)
+
 static XLogPageReadResult
 XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 {
@@ -543,10 +621,16 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	uint32		targetRecOff;
 	uint32		pageHeaderSize;
 	bool		assembled;
-	bool		gotheader;
 	int			readOff;
 	DecodedXLogRecord *decoded;
 	char	   *errormsg;		/* not used */
+	union {
+		char bytes[MaxXLogHeaderSize];
+		XLogRecord hdr;
+	} rechdr = {0}; /* record header buffer for cross-page header reads */
+	uint32		rechdrsize = 0;
+	XLogPageHeader pageHeader;
+	char	   *contdata;
 
 	/*
 	 * randAccess indicates whether to verify the previous-record pointer of
@@ -602,7 +686,8 @@ restart:
 	 * fits on the same page.
 	 */
 	readOff = ReadPageInternal(state, targetPagePtr,
-							   Min(targetRecOff + SizeOfXLogRecord, XLOG_BLCKSZ));
+							   Min(targetRecOff + MAXALIGN(MinXLogHeaderSize),
+								   XLOG_BLCKSZ));
 	if (readOff == XLREAD_WOULDBLOCK)
 		return XLREAD_WOULDBLOCK;
 	else if (readOff < 0)
@@ -639,47 +724,126 @@ restart:
 	/* ReadPageInternal has verified the page header */
 	Assert(pageHeaderSize <= readOff);
 
-	/*
-	 * Read the record length.
-	 *
-	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
-	 */
-	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
+	contdata = state->readBuf + targetRecOff;
+	len = state->readLen - targetRecOff;
+
+	rechdrsize = Min(len, MinXLogHeaderSize);
+
+	memcpy(&rechdr.bytes[0], contdata, rechdrsize);
+
+	contdata += rechdrsize;
+	len -= rechdrsize;
+
 
 	/*
-	 * If the whole record header is on this page, validate it immediately.
-	 * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-	 * rest of the header after reading it from the next page.  The xl_tot_len
-	 * check is necessary here to ensure that we enter the "Need to reassemble
-	 * record" code path below; otherwise we might fail to apply
-	 * ValidXLogRecordHeader at all.
+	 * If we don't yet have the minimum required data of an XLogRecord,
+	 * we have to read from the next page.
 	 */
-	if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
+	if (rechdrsize < MinXLogHeaderSize)
 	{
-		if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
-								   randAccess))
-			goto err;
-		gotheader = true;
+		int gotlen;
+
+		/* Calculate pointer to beginning of next page */
+		targetPagePtr += XLOG_BLCKSZ;
+		ReadPageData(MinXLogHeaderSize - rechdrsize,
+					 rechdrsize,
+					 MinXLogHeaderSize,
+					 false);
+
+		gotlen = Min(len, MinXLogHeaderSize - rechdrsize);
+
+		memcpy(&rechdr.bytes[rechdrsize],
+			   contdata,
+			   gotlen);
+		rechdrsize += gotlen;
+		contdata += gotlen;
+		len -= gotlen;
 	}
-	else
+
+	Assert(rechdrsize >= MinXLogHeaderSize);
+
+	/*
+	 * We now have a minimal XLogHeader in buffers. This is enough
+	 * to determine the size of all header data, but we don't yet have
+	 * all of that.
+	 */
+
+	total_len = XLRSizeOfHeader(rechdr.hdr.xl_info);
+
+	/*
+	 * We now known the expected size of the xlog header structures
+	 * (which includes the xl_rec_len field).
+	 * 
+	 * If the full headers section requires more data than what we've loaded
+	 * by now, we need to read that data.
+	 */
+	if (total_len > rechdrsize)
 	{
-		/* XXX: more validation should be done here */
-		if (total_len < SizeOfXLogRecord)
+		/* Consume any bytes that are still in the buffer */
+		if (len > 0)
 		{
-			report_invalid_record(state,
-								  "invalid record length at %X/%X: wanted %u, got %u",
-								  LSN_FORMAT_ARGS(RecPtr),
-								  (uint32) SizeOfXLogRecord, total_len);
-			goto err;
+			int consumed = Min(len, total_len - rechdrsize);
+
+			memcpy(&rechdr.bytes[rechdrsize],
+				   contdata,
+				   consumed);
+			contdata += consumed;
+			rechdrsize += consumed;
+			len -= consumed;
+		}
+
+		/* Continue reading from this page if the page wasn't fully read */
+		if (total_len > rechdrsize && state->readLen < XLOG_BLCKSZ)
+		{
+			int consumed;
+			int needed = Min(state->readLen + total_len - rechdrsize,
+							 XLOG_BLCKSZ);
+
+			ReadPageData(needed, rechdrsize, total_len, false);
+			consumed = Min(total_len - rechdrsize, len);
+
+			memcpy(&rechdr.bytes[rechdrsize],
+				   contdata,
+				   consumed);
+
+			contdata += consumed;
+			rechdrsize += consumed;
+			len -= consumed;
+		}
+
+		/* If we're still not done, read from the next page */
+		if (total_len > rechdrsize && state->readLen >= XLOG_BLCKSZ)
+		{
+			int consumed,
+				needed = total_len - rechdrsize;
+			targetPagePtr += XLOG_BLCKSZ;
+
+			ReadPageData(needed, rechdrsize, total_len, false);
+			consumed = Min(total_len - rechdrsize, len);
+
+			memcpy(&rechdr.bytes[rechdrsize],
+				   contdata,
+				   consumed);
+
+			contdata += consumed;
+			rechdrsize += consumed;
+			len -= consumed;
 		}
-		gotheader = false;
 	}
 
+	/*
+	 * By now, the full record header is loaded; allowing us to read the
+	 * record's length field.
+	*/
+	total_len = XLogRecordGetLength(&rechdr.hdr);
+
+	/*
+	 * Because the xlog header is loaded, we should validate it as well.
+	 */
+	if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, &rechdr.hdr,
+							   randAccess))
+		goto err;
+
 	/*
 	 * Find space to decode this record.  Don't allow oversized allocation if
 	 * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -705,12 +869,9 @@ restart:
 		goto err;
 	}
 
-	len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
-	if (total_len > len)
+	if (total_len > (XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ))
 	{
 		/* Need to reassemble record */
-		char	   *contdata;
-		XLogPageHeader pageHeader;
 		char	   *buffer;
 		uint32		gotlen;
 
@@ -728,105 +889,51 @@ restart:
 			goto err;
 		}
 
-		/* Copy the first fragment of the record from the first page. */
-		memcpy(state->readRecordBuf,
-			   state->readBuf + RecPtr % XLOG_BLCKSZ, len);
-		buffer = state->readRecordBuf + len;
-		gotlen = len;
-
-		do
-		{
-			/* Calculate pointer to beginning of next page */
-			targetPagePtr += XLOG_BLCKSZ;
-
-			/* Wait for the next page to become available */
-			readOff = ReadPageInternal(state, targetPagePtr,
-									   Min(total_len - gotlen + SizeOfXLogShortPHD,
-										   XLOG_BLCKSZ));
+		/* Copy the buffered record header fragment */
+		memcpy(state->readRecordBuf, &rechdr.bytes[0], rechdrsize);
 
-			if (readOff == XLREAD_WOULDBLOCK)
-				return XLREAD_WOULDBLOCK;
-			else if (readOff < 0)
-				goto err;
+		len = Min(len, total_len - rechdrsize);
 
-			Assert(SizeOfXLogShortPHD <= readOff);
+		/* Copy the remaining bytes of the page */
+		memcpy(state->readRecordBuf + rechdrsize, contdata, len);
 
-			pageHeader = (XLogPageHeader) state->readBuf;
+		buffer = state->readRecordBuf + rechdrsize + len;
+		gotlen = rechdrsize + len;
+		
+		/* Finish the last bytes of this page when available */
+		if (state->readLen < XLOG_BLCKSZ)
+		{
+			int buffered = state->readLen;
+			ReadPageData((state->readLen - SizeOfXLogShortPHD) + (total_len - gotlen),
+						 gotlen, total_len, true);
 
-			/*
-			 * If we were expecting a continuation record and got an
-			 * "overwrite contrecord" flag, that means the continuation record
-			 * was overwritten with a different record.  Restart the read by
-			 * assuming the address to read is the location where we found
-			 * this flag; but keep track of the LSN of the record we were
-			 * reading, for later verification.
-			 */
-			if (pageHeader->xlp_info & XLP_FIRST_IS_OVERWRITE_CONTRECORD)
-			{
-				state->overwrittenRecPtr = RecPtr;
-				RecPtr = targetPagePtr;
-				goto restart;
-			}
+			if (pageHeader->xlp_rem_len < len)
+				len = pageHeader->xlp_rem_len;
 
-			/* Check that the continuation on next page looks valid */
-			if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
-			{
-				report_invalid_record(state,
-									  "there is no contrecord flag at %X/%X",
-									  LSN_FORMAT_ARGS(RecPtr));
-				goto err;
-			}
+			buffer += buffered;
+			len -= buffered;
 
-			/*
-			 * Cross-check that xlp_rem_len agrees with how much of the record
-			 * we expect there to be left.
-			 */
-			if (pageHeader->xlp_rem_len == 0 ||
-				total_len != (pageHeader->xlp_rem_len + gotlen))
-			{
-				report_invalid_record(state,
-									  "invalid contrecord length %u (expected %lld) at %X/%X",
-									  pageHeader->xlp_rem_len,
-									  ((long long) total_len) - gotlen,
-									  LSN_FORMAT_ARGS(RecPtr));
-				goto err;
-			}
+			memcpy(buffer, contdata, len);
 
-			/* Append the continuation from this page to the buffer */
-			pageHeaderSize = XLogPageHeaderSize(pageHeader);
+			buffer += len;
+			gotlen += len;
+		}
 
-			if (readOff < pageHeaderSize)
-				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize);
+		do
+		{
+			/* Calculate pointer to beginning of next page */
+			targetPagePtr += XLOG_BLCKSZ;
 
-			Assert(pageHeaderSize <= readOff);
+			ReadPageData(total_len - gotlen, gotlen, total_len, true);
 
-			contdata = (char *) state->readBuf + pageHeaderSize;
-			len = XLOG_BLCKSZ - pageHeaderSize;
 			if (pageHeader->xlp_rem_len < len)
 				len = pageHeader->xlp_rem_len;
 
-			if (readOff < pageHeaderSize + len)
-				readOff = ReadPageInternal(state, targetPagePtr,
-										   pageHeaderSize + len);
-
-			memcpy(buffer, (char *) contdata, len);
+			memcpy(buffer, contdata, len);
 			buffer += len;
 			gotlen += len;
-
-			/* If we just reassembled the record header, validate it. */
-			if (!gotheader)
-			{
-				record = (XLogRecord *) state->readRecordBuf;
-				if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr,
-										   record, randAccess))
-					goto err;
-				gotheader = true;
-			}
 		} while (gotlen < total_len);
 
-		Assert(gotheader);
-
 		record = (XLogRecord *) state->readRecordBuf;
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
@@ -846,6 +953,8 @@ restart:
 		else if (readOff < 0)
 			goto err;
 
+		record = (XLogRecord *) (state->readBuf + targetRecOff);
+
 		/* Record does not cross a page boundary */
 		if (!ValidXLogRecord(state, record, RecPtr))
 			goto err;
@@ -859,7 +968,7 @@ restart:
 	 * Special processing if it's an XLOG SWITCH record
 	 */
 	if (record->xl_rmid == RM_XLOG_ID &&
-		record->xl_rminfo == XLOG_SWITCH)
+		XLogRecordGetRMInfo(record) == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
 		state->NextRecPtr += state->segcxt.ws_segsize - 1;
@@ -1116,12 +1225,13 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 					  XLogRecPtr PrevRecPtr, XLogRecord *record,
 					  bool randAccess)
 {
-	if (record->xl_tot_len < SizeOfXLogRecord)
+	uint32 reclen = XLogRecordGetLength(record);
+	if (reclen < MinXLogHeaderSize)
 	{
 		report_invalid_record(state,
 							  "invalid record length at %X/%X: wanted %u, got %u",
 							  LSN_FORMAT_ARGS(RecPtr),
-							  (uint32) SizeOfXLogRecord, record->xl_tot_len);
+							  (uint32) MinXLogHeaderSize, reclen);
 		return false;
 	}
 	if (!RmgrIdIsValid(record->xl_rmid))
@@ -1184,16 +1294,28 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)
 
 	/* Calculate the CRC */
 	INIT_CRC32C(crc);
-	COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
-	/* include the record header last */
+	COMP_CRC32C(crc,
+				((char *) record) + offsetof(XLogRecord, xl_rmid),
+				XLogRecordGetLength(record) - offsetof(XLogRecord, xl_rmid));
+	/* include the xl_prev pointer last */
 	COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
 	FIN_CRC32C(crc);
 
 	if (!EQ_CRC32C(record->xl_crc, crc))
 	{
 		report_invalid_record(state,
-							  "incorrect resource manager data checksum in record at %X/%X",
-							  LSN_FORMAT_ARGS(recptr));
+							  "incorrect resource manager data checksum in record at %X/%X\n"
+							  "xl_crc: record: %d; calculated: %d\n"
+							  "xl_info: %x\n"
+							  "xl_prev: %X/%X\n"
+							  "xl_rmid: %X\n",
+							  LSN_FORMAT_ARGS(recptr),
+							  record->xl_crc,
+							  crc,
+							  record->xl_info,
+							  LSN_FORMAT_ARGS(record->xl_prev),
+							  record->xl_rmid);
+		Assert(false);
 		return false;
 	}
 
@@ -1657,11 +1779,11 @@ DecodeXLogRecord(XLogReaderState *state,
 	 */
 #define COPY_HEADER_FIELD(_dst, _size)			\
 	do {										\
-		if (remaining < _size)					\
+		if (remaining < (_size))				\
 			goto shortdata_err;					\
-		memcpy(_dst, ptr, _size);				\
-		ptr += _size;							\
-		remaining -= _size;						\
+		memcpy((_dst), ptr, (_size));			\
+		ptr += (_size);							\
+		remaining -= (_size);					\
 	} while(0)
 
 	char	   *ptr;
@@ -1671,7 +1793,7 @@ DecodeXLogRecord(XLogReaderState *state,
 	RelFileLocator *rlocator = NULL;
 	uint8		block_id;
 
-	decoded->header = *record;
+	decoded->header = (XLRHeaderData) {0};
 	decoded->lsn = lsn;
 	decoded->next = NULL;
 	decoded->record_origin = InvalidRepOriginId;
@@ -1679,11 +1801,53 @@ DecodeXLogRecord(XLogReaderState *state,
 	decoded->main_data = NULL;
 	decoded->main_data_len = 0;
 	decoded->max_block_id = -1;
+
+	decoded->header.xl_prev = record->xl_prev;
+	decoded->header.xl_crc = record->xl_crc;
+	decoded->header.xl_rmid = record->xl_rmid;
+	decoded->header.xl_info = record->xl_info;
+	decoded->header.xl_tot_len = XLogRecordGetLength(record);
+	
+	remaining = decoded->header.xl_tot_len - MinXLogHeaderSize;
 	ptr = (char *) record;
-	ptr += SizeOfXLogRecord;
-	remaining = record->xl_tot_len - SizeOfXLogRecord;
+	ptr += MinXLogHeaderSize;
+
+	switch (record->xl_info & XLR2_LEN_MASK)
+	{
+		case XLR2_LEN_ABSENT:
+			break;
+		case XLR2_LEN_1B:
+			ptr += 1;
+			remaining -= 1;
+			break;
+		case XLR2_LEN_2B:
+			ptr += 2;
+			remaining -= 2;
+			break;
+		case XLR2_LEN_4B:
+			ptr += 4;
+			remaining -= 4;
+			break;
+		default:
+			pg_unreachable();
+	}
+	
+	if (record->xl_info & XLR_HAS_XID)
+		COPY_HEADER_FIELD(&decoded->header.xl_xid, sizeof(TransactionId));
+	else
+		decoded->header.xl_xid = InvalidTransactionId;
+	
+	if (record->xl_info & XLR_HAS_CID)
+		COPY_HEADER_FIELD(&decoded->header.xl_cid, sizeof(CommandId));
+	else
+		decoded->header.xl_cid = InvalidCommandId;
+	
+	if (record->xl_info & XLR_HAS_RMINFO)
+		COPY_HEADER_FIELD(&decoded->header.xl_rminfo, sizeof(uint8));
+	else
+		decoded->header.xl_rminfo = 0;
 
-	/* Decode the headers */
+	/* Decode the block headers */
 	datatotal = 0;
 	while (remaining > datatotal)
 	{
@@ -1933,7 +2097,7 @@ DecodeXLogRecord(XLogReaderState *state,
 
 	/* Report the actual size we used. */
 	decoded->size = MAXALIGN(out - (char *) decoded);
-	Assert(DecodeXLogRecordRequiredSpace(record->xl_tot_len) >=
+	Assert(DecodeXLogRecordRequiredSpace(decoded->header.xl_tot_len) >=
 		   decoded->size);
 
 	return true;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 96d12994b5..d8c380f694 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -384,7 +384,7 @@ static char recoveryStopName[MAXFNAMELEN];
 static bool recoveryStopAfter;
 
 /* prototypes for local functions */
-static void ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *replayTLI);
+static void ApplyWalRecord(XLogReaderState *xlogreader, XLRHeaderData *record, TimeLineID *replayTLI);
 
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
@@ -412,9 +412,9 @@ static void recoveryPausesHere(bool endOfRecovery);
 static bool recoveryApplyDelay(XLogReaderState *record);
 static void ConfirmRecoveryPaused(void);
 
-static XLogRecord *ReadRecord(XLogPrefetcher *xlogprefetcher,
-							  int emode, bool fetching_ckpt,
-							  TimeLineID replayTLI);
+static XLRHeaderData *ReadRecord(XLogPrefetcher *xlogprefetcher,
+								 int emode, bool fetching_ckpt,
+								 TimeLineID replayTLI);
 
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
@@ -426,8 +426,8 @@ static XLogPageReadResult WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,
 													  XLogRecPtr replayLSN,
 													  bool nonblocking);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
-static XLogRecord *ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher,
-										XLogRecPtr RecPtr, TimeLineID replayTLI);
+static XLRHeaderData *ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher,
+										   XLogRecPtr RecPtr, TimeLineID replayTLI);
 static bool rescanLatestTimeLine(TimeLineID replayTLI, XLogRecPtr replayLSN);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
@@ -497,7 +497,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	XLogPageReadPrivate *private;
 	struct stat st;
 	bool		wasShutdown;
-	XLogRecord *record;
+	XLRHeaderData *record;
 	DBState		dbstate_at_startup;
 	bool		haveTblspcMap = false;
 	bool		haveBackupLabel = false;
@@ -1566,7 +1566,7 @@ ShutdownWalRecovery(void)
 void
 PerformWalRecovery(void)
 {
-	XLogRecord *record;
+	XLRHeaderData *record;
 	bool		reachedRecoveryTarget = false;
 	TimeLineID	replayTLI;
 
@@ -1811,7 +1811,7 @@ PerformWalRecovery(void)
  * Subroutine of PerformWalRecovery, to apply one WAL record.
  */
 static void
-ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *replayTLI)
+ApplyWalRecord(XLogReaderState *xlogreader, XLRHeaderData *record, TimeLineID *replayTLI)
 {
 	ErrorContextCallback errcallback;
 	bool		switchedTLI = false;
@@ -3004,11 +3004,11 @@ ConfirmRecoveryPaused(void)
  * (emode must be either PANIC, LOG). In standby mode, retries until a valid
  * record is available.
  */
-static XLogRecord *
+static XLRHeaderData *
 ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
 		   bool fetching_ckpt, TimeLineID replayTLI)
 {
-	XLogRecord *record;
+	XLRHeaderData *record;
 	XLogReaderState *xlogreader = XLogPrefetcherGetReader(xlogprefetcher);
 	XLogPageReadPrivate *private = (XLogPageReadPrivate *) xlogreader->private_data;
 
@@ -3912,11 +3912,11 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 /*
  * Subroutine to try to fetch and validate a prior checkpoint record.
  */
-static XLogRecord *
+static XLRHeaderData *
 ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 					 TimeLineID replayTLI)
 {
-	XLogRecord *record;
+	XLRHeaderData *record;
 	uint8		rminfo;
 
 	Assert(xlogreader != NULL);
@@ -3951,7 +3951,10 @@ ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 				(errmsg("invalid xl_info in checkpoint record")));
 		return NULL;
 	}
-	if (record->xl_tot_len != SizeOfXLogRecord + SizeOfXLogRecordDataHeaderShort + sizeof(CheckPoint))
+	if (record->xl_tot_len != (MinXLogHeaderSize
+								   + sizeof(uint8) /* length field */
+								   + SizeOfXLogRecordDataHeaderShort
+								   + sizeof(CheckPoint)))
 	{
 		ereport(LOG,
 				(errmsg("invalid length of checkpoint record")));
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 4ef46a1855..fa4be2c68b 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -194,7 +194,8 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-	XLogInsertExtended(RM_SMGR_ID, XLR_SPECIAL_REL_UPDATE, XLOG_SMGR_CREATE);
+	XLogInsertExtended(RM_SMGR_ID, XLR_SPECIAL_REL_UPDATE | XLR_HAS_XID,
+					   XLOG_SMGR_CREATE, InvalidCommandId);
 }
 
 /*
@@ -376,8 +377,9 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
 
 		lsn = XLogInsertExtended(RM_SMGR_ID,
-								 XLR_SPECIAL_REL_UPDATE,
-								 XLOG_SMGR_TRUNCATE);
+								 XLR_SPECIAL_REL_UPDATE | XLR_HAS_XID,
+								 XLOG_SMGR_TRUNCATE,
+								 InvalidCommandId);
 
 		/*
 		 * Flush, because otherwise the truncation of the main relation might
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 25c4672917..fea070d768 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -624,9 +624,8 @@ CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
 			XLogRegisterData((char *) &xlrec,
 							 sizeof(xl_dbase_create_file_copy_rec));
 
-			(void) XLogInsertExtended(RM_DBASE_ID,
-									  XLR_SPECIAL_REL_UPDATE,
-									  XLOG_DBASE_CREATE_FILE_COPY);
+			(void) XLogInsertExtended(RM_DBASE_ID, XLR_SPECIAL_REL_UPDATE,
+									  XLOG_DBASE_CREATE_FILE_COPY, InvalidCommandId);
 		}
 		pfree(srcpath);
 		pfree(dstpath);
@@ -2022,9 +2021,8 @@ movedb(const char *dbname, const char *tblspcname)
 			XLogRegisterData((char *) &xlrec,
 							 sizeof(xl_dbase_create_file_copy_rec));
 
-			(void) XLogInsertExtended(RM_DBASE_ID,
-									  XLR_SPECIAL_REL_UPDATE,
-									  XLOG_DBASE_CREATE_FILE_COPY);
+			(void) XLogInsertExtended(RM_DBASE_ID, XLR_SPECIAL_REL_UPDATE,
+									  XLOG_DBASE_CREATE_FILE_COPY, InvalidCommandId);
 		}
 
 		/*
@@ -2117,9 +2115,8 @@ movedb(const char *dbname, const char *tblspcname)
 		XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_drop_rec));
 		XLogRegisterData((char *) &src_tblspcoid, sizeof(Oid));
 
-		(void) XLogInsertExtended(RM_DBASE_ID,
-								  XLR_SPECIAL_REL_UPDATE,
-								  XLOG_DBASE_DROP);
+		(void) XLogInsertExtended(RM_DBASE_ID, XLR_SPECIAL_REL_UPDATE,
+								  XLOG_DBASE_DROP, InvalidCommandId);
 	}
 
 	/* Now it's safe to release the database lock */
@@ -2837,9 +2834,8 @@ remove_dbtablespaces(Oid db_id)
 		XLogRegisterData((char *) &xlrec, MinSizeOfDbaseDropRec);
 		XLogRegisterData((char *) tablespace_ids, ntblspc * sizeof(Oid));
 
-		(void) XLogInsertExtended(RM_DBASE_ID,
-								  XLR_SPECIAL_REL_UPDATE,
-								  XLOG_DBASE_DROP);
+		(void) XLogInsertExtended(RM_DBASE_ID, XLR_SPECIAL_REL_UPDATE,
+								  XLOG_DBASE_DROP, InvalidCommandId);
 	}
 
 	list_free(ltblspc);
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d6abeb9a9d..7cbbb9b73a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -200,7 +200,10 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				ParseCommitRecord(XLogRecGetRmInfo(buf->record), xlrec, &parsed);
 
 				if (!TransactionIdIsValid(parsed.twophase_xid))
+				{
+					Assert(info == XLOG_XACT_COMMIT);
 					xid = XLogRecGetXid(r);
+				}
 				else
 					xid = parsed.twophase_xid;
 
@@ -228,7 +231,10 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				ParseAbortRecord(XLogRecGetRmInfo(buf->record), xlrec, &parsed);
 
 				if (!TransactionIdIsValid(parsed.twophase_xid))
+				{
+					Assert(info == XLOG_XACT_ABORT);
 					xid = XLogRecGetXid(r);
+				}
 				else
 					xid = parsed.twophase_xid;
 
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 625a7f4273..6faa519a35 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -600,7 +600,7 @@ DecodingContextFindStartpoint(LogicalDecodingContext *ctx)
 	/* Wait for a consistent starting point */
 	for (;;)
 	{
-		XLogRecord *record;
+		XLRHeaderData *record;
 		char	   *err = NULL;
 
 		/* the read_page callback waits for new WAL */
diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index 7fa2b2cba7..a2a13c8ae4 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -256,7 +256,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 		/* Decode until we run out of records */
 		while (ctx->reader->EndRecPtr < end_of_wal)
 		{
-			XLogRecord *record;
+			XLRHeaderData *record;
 			char	   *errm = NULL;
 
 			record = XLogReadRecord(ctx->reader, &errm);
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ca945994ef..6ffecd0fcc 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -495,7 +495,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 		while (ctx->reader->EndRecPtr < moveto)
 		{
 			char	   *errm = NULL;
-			XLogRecord *record;
+			XLRHeaderData *record;
 
 			/*
 			 * Read records.  No changes are generated in fast_forward mode,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e9ba500a15..8129a2552b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3019,7 +3019,7 @@ retry:
 static void
 XLogSendLogical(void)
 {
-	XLogRecord *record;
+	XLRHeaderData *record;
 	char	   *errm;
 
 	/*
diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index eb5782f82a..78f17b5d09 100644
--- a/src/backend/utils/cache/inval.c
+++ b/src/backend/utils/cache/inval.c
@@ -1632,6 +1632,7 @@ LogLogicalInvalidations(void)
 		ProcessMessageSubGroupMulti(group, RelCacheMsgs,
 									XLogRegisterData((char *) msgs,
 													 n * sizeof(SharedInvalidationMessage)));
-		XLogInsert(RM_XACT_ID, XLOG_XACT_INVALIDATIONS);
+		XLogInsertExtended(RM_XACT_ID, XLR_HAS_XID,
+						   XLOG_XACT_INVALIDATIONS, InvalidCommandId);
 	}
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 089063f471..380d0a260c 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -1046,6 +1046,9 @@ WriteEmptyXLOG(void)
 	int			fd;
 	int			nbytes;
 	char	   *recptr;
+	const uint32 rec_tot_len = MinXLogHeaderSize + sizeof(uint8) /* xl_tot_len */
+		+ SizeOfXLogRecordDataHeaderShort + sizeof(CheckPoint);
+	const uint8 rec_len = rec_tot_len;
 
 	memset(buffer.data, 0, XLOG_BLCKSZ);
 
@@ -1060,23 +1063,33 @@ WriteEmptyXLOG(void)
 	longpage->xlp_seg_size = WalSegSz;
 	longpage->xlp_xlog_blcksz = XLOG_BLCKSZ;
 
+	/*
+	 * Ensure that the length actually fits within the 8-bit length field in
+	 * the header.
+	 */
+	Assert(rec_tot_len < UINT8_MAX);
+
 	/* Insert the initial checkpoint record */
 	recptr = (char *) page + SizeOfXLogLongPHD;
 	record = (XLogRecord *) recptr;
 	record->xl_prev = 0;
-	record->xl_xid = InvalidTransactionId;
-	record->xl_tot_len = SizeOfXLogRecord + SizeOfXLogRecordDataHeaderShort + sizeof(CheckPoint);
-	record->xl_info = XLOG_CHECKPOINT_SHUTDOWN;
+	record->xl_info = XLR2_LEN_1B;
 	record->xl_rmid = RM_XLOG_ID;
 
-	recptr += SizeOfXLogRecord;
+	recptr += MinXLogHeaderSize;
+
+	memcpy(recptr, (char *) &rec_len, sizeof(uint8));
+	recptr += sizeof(uint8);
+
 	*(recptr++) = (char) XLR_BLOCK_ID_DATA_SHORT;
 	*(recptr++) = sizeof(CheckPoint);
 	memcpy(recptr, &ControlFile.checkPointCopy,
 		   sizeof(CheckPoint));
 
 	INIT_CRC32C(crc);
-	COMP_CRC32C(crc, ((char *) record) + SizeOfXLogRecord, record->xl_tot_len - SizeOfXLogRecord);
+	COMP_CRC32C(crc,
+				((char *) record) + offsetof(XLogRecord, xl_rmid),
+				rec_len - offsetof(XLogRecord, xl_rmid));
 	COMP_CRC32C(crc, (char *) record, offsetof(XLogRecord, xl_crc));
 	FIN_CRC32C(crc);
 	record->xl_crc = crc;
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 132c4db65e..fc2dccb63c 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -66,7 +66,7 @@ void
 extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 			   XLogRecPtr endpoint, const char *restoreCommand)
 {
-	XLogRecord *record;
+	XLRHeaderData *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
 	XLogPageReadPrivate private;
@@ -124,7 +124,7 @@ XLogRecPtr
 readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex,
 			  const char *restoreCommand)
 {
-	XLogRecord *record;
+	XLRHeaderData *record;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
 	XLogPageReadPrivate private;
@@ -170,7 +170,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 				   XLogRecPtr *lastchkptredo, const char *restoreCommand)
 {
 	/* Walk backwards, starting from the given record */
-	XLogRecord *record;
+	XLRHeaderData *record;
 	XLogRecPtr	searchptr;
 	XLogReaderState *xlogreader;
 	char	   *errormsg;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 70886beedd..a821d7b40a 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -696,7 +696,7 @@ main(int argc, char **argv)
 	XLogDumpPrivate private;
 	XLogDumpConfig config;
 	XLogStats	stats;
-	XLogRecord *record;
+	XLRHeaderData *record;
 	XLogRecPtr	first_record;
 	char	   *waldir = NULL;
 	char	   *errormsg;
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index cfe53c7175..f41bca6cea 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -42,7 +42,7 @@
 extern void XLogBeginInsert(void);
 extern void XLogSetRecordFlags(uint8 flags);
 extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 rminfo);
-extern XLogRecPtr XLogInsertExtended(RmgrId rmid, uint8 info, uint8 rminfo);
+extern XLogRecPtr XLogInsertExtended(RmgrId rmid, uint8 info, uint8 rminfo, CommandId cid);
 extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
 extern void XLogRegisterData(char *data, uint32 len);
 extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
diff --git a/src/include/access/xlogprefetcher.h b/src/include/access/xlogprefetcher.h
index fdd67fcedd..fbfd9e02a4 100644
--- a/src/include/access/xlogprefetcher.h
+++ b/src/include/access/xlogprefetcher.h
@@ -47,8 +47,8 @@ extern XLogReaderState *XLogPrefetcherGetReader(XLogPrefetcher *prefetcher);
 extern void XLogPrefetcherBeginRead(XLogPrefetcher *prefetcher,
 									XLogRecPtr recPtr);
 
-extern XLogRecord *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
-											char **errmsg);
+extern XLRHeaderData *XLogPrefetcherReadRecord(XLogPrefetcher *prefetcher,
+											   char **errmsg);
 
 extern void XLogPrefetcherComputeStats(XLogPrefetcher *prefetcher);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index fb6cae08ad..9a31aec414 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -163,7 +163,7 @@ typedef struct DecodedXLogRecord
 	/* Public members. */
 	XLogRecPtr	lsn;			/* location */
 	XLogRecPtr	next_lsn;		/* location of next record */
-	XLogRecord	header;			/* header */
+	XLRHeaderData header;	/* header */
 	RepOriginId record_origin;
 	TransactionId toplevel_xid; /* XID of top-level transaction */
 	char	   *main_data;		/* record's main data portion */
@@ -355,8 +355,8 @@ typedef enum XLogPageReadResult
 } XLogPageReadResult;
 
 /* Read the next XLog record. Returns NULL on end-of-WAL or failure */
-extern struct XLogRecord *XLogReadRecord(XLogReaderState *state,
-										 char **errormsg);
+extern struct XLRHeaderData *XLogReadRecord(XLogReaderState *state,
+											char **errormsg);
 
 /* Consume the next record or error. */
 extern DecodedXLogRecord *XLogNextRecord(XLogReaderState *state,
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 17093d93b6..94125f9878 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -19,7 +19,11 @@
 
 /*
  * The overall layout of an XLOG record is:
- *		Fixed-size header (XLogRecord struct)
+ *		Fixed-size header (XLogRecord struct) + variable-sized header data:
+ *		 - xl_cid (0 or 4 bytes)
+ *		 - xl_cid (0 or 4 bytes)
+ *		 - xl_rmgr_flags (0 or 1 byte)
+ *		 - xl_len (0, 1, 2 or 4 bytes)
  *		XLogRecordBlockHeader struct
  *		XLogRecordBlockHeader struct
  *		...
@@ -37,23 +41,97 @@
  * The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
  * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
  * used to distinguish between block references, and the main data structs.
+ *
+ * The smallest size that XLogRecord header takes up is now 14 bytes: 8 bytes
+ * in xl_prev, 4 in checksum, and 1 in xl_rmid and xl_info each, while the
+ * max-sized xlog header now takes up 27 bytes; with 4 bytes each in
+ * xl_tot_len, xl_xid and xl_cid, plus one in xl_rminfo.
  */
-typedef struct XLogRecord
-{
-	uint32		xl_tot_len;		/* total len of entire record */
-	TransactionId xl_xid;		/* xact id */
-	XLogRecPtr	xl_prev;		/* ptr to previous record in log */
-	uint8		xl_info;		/* flag bits, see below */
-	RmgrId		xl_rmid;		/* resource manager for this record */
-	uint8		xl_rminfo;		/* flag bits for rmgr use */
-	/* 1 byte of padding here, initialize to zero */
-	pg_crc32c	xl_crc;			/* CRC for this record */
-
-	/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
+typedef struct XLogRecord {
+	XLogRecPtr	xl_prev;
+	pg_crc32c	xl_crc;
+	RmgrId		xl_rmid;
 
+	/* Flags for record handling and variable-length header fields */
+	uint8		xl_info;
+	/*
+	 * Without padding:
+	 * - depending on flags, length field follows (0, 1, 2 or 4 bytes)
+	 * - if HAS_XID, TransactionId follows
+	 * - if HAS_CID, CommandID follows
+	 * - if HAS_RMINFO, uint8 with rminfo flags follows
+	 * - XLogRecordBlockHeaders and XLogRecordDataHeader follow
+	 */
 } XLogRecord;
 
-#define SizeOfXLogRecord	(offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))
+/*
+ * 
+ */
+typedef struct XLRHeaderData {
+	XLogRecPtr	xl_prev;
+	pg_crc32c	xl_crc;
+	RmgrId		xl_rmid;
+	uint8		xl_info;
+	TransactionId xl_xid;
+	CommandId	xl_cid;
+	uint8		xl_rminfo;
+	uint32		xl_tot_len;
+} XLRHeaderData;
+
+#define MinXLogHeaderSize	( \
+	offsetof(XLogRecord, xl_info) \
+	+ sizeof(uint8) /* xl_info */ \
+)
+
+#define MaxXLogHeaderSize	( \
+	MinXLogHeaderSize \
+	+ sizeof(TransactionId) /* xl_xid */ \
+	+ sizeof(CommandId) /* xl_cid */ \
+	+ sizeof(uint8) /* xl_rminfo */ \
+	+ sizeof(uint32) /* xl_len */ \
+)
+
+/*
+ * Mask for getting the size of the length field
+ */
+#define XLR2_LEN_MASK			(0x03)
+
+/*
+ * IFF the record does not contain any registered data, the length field will
+ * be absent (as the size of a plain record is knowable from just the
+ * fixed-size struct's flags)
+ */
+#define XLR2_LEN_ABSENT			0x00
+/*
+ * Size of the xlog record is <= 255 bytes
+ */
+#define XLR2_LEN_1B				0x01
+/*
+ * Size of the xlog record is <= (2^16 - 1)
+ */
+#define XLR2_LEN_2B				0x02
+/*
+ * Size of the xlog record is <= (2^32 - 1)
+ */
+#define XLR2_LEN_4B				0x03
+
+/*
+ * Does this record contain an XID? This must be included if the data has
+ * transactional visibility.
+ */
+#define XLR_HAS_XID			0x04
+
+/*
+ * Doest this record contain a CID? This must be included if the data has
+ * transactional visibility, and remote snapshot transfer support is enabled.
+ */
+#define XLR_HAS_CID			0x08
+
+/*
+ * If the redo manager needs non-zero bits in the header to discern different
+ * types of WAL records, this flag is set.
+ */
+#define XLR_HAS_RMINFO			0x10
 
 /*
  * If a WAL record modifies any relation files, in ways not covered by the
@@ -61,7 +139,7 @@ typedef struct XLogRecord
  * by PostgreSQL itself, but it allows external tools that read WAL and keep
  * track of modified blocks to recognize such special record types.
  */
-#define XLR_SPECIAL_REL_UPDATE	0x01
+#define XLR_SPECIAL_REL_UPDATE	0x20
 
 /*
  * Enforces consistency checks of replayed WAL at recovery. If enabled,
@@ -70,7 +148,68 @@ typedef struct XLogRecord
  * of XLogInsert can use this value if necessary, but if
  * wal_consistency_checking is enabled for a rmgr this is set unconditionally.
  */
-#define XLR_CHECK_CONSISTENCY	0x02
+#define XLR_CHECK_CONSISTENCY	0x40
+
+#define XLR_INFO_USERFLAGS		( \
+	XLR_HAS_XID \
+		| XLR_HAS_CID \
+		| XLR_SPECIAL_REL_UPDATE \
+		| XLR_CHECK_CONSISTENCY \
+)
+
+#define XLRSizeOfHeader(infomask) ( \
+	MinXLogHeaderSize \
+		+ ((1 << ((infomask) & XLR2_LEN_MASK)) >> 1) \
+		+ (((infomask) & XLR_HAS_XID) ? sizeof(TransactionId) : 0) \
+		+ (((infomask) & XLR_HAS_CID) ? sizeof(CommandId) : 0) \
+		+ (((infomask) & XLR_HAS_RMINFO) ? sizeof(uint8) : 0) \
+)
+
+static inline uint32
+XLogRecordGetLength(XLogRecord *record)
+{
+	char *lenptr = (char *) record;
+	uint8 len8;
+	uint16 len16;
+	uint32 len32;
+
+	lenptr += MinXLogHeaderSize;
+
+	switch ((record->xl_info) & XLR2_LEN_MASK) {
+		case XLR2_LEN_ABSENT:
+			return XLRSizeOfHeader(record->xl_info);
+		case XLR2_LEN_1B:
+			memcpy(&len8, lenptr, sizeof(uint8));
+			return (uint32) len8;
+		case XLR2_LEN_2B:
+			memcpy(&len16, lenptr, sizeof(uint16));
+			return (uint32) len16;
+		case XLR2_LEN_4B:
+			memcpy(&len32, lenptr, sizeof(uint32));
+			return (uint32) len32;
+		default:
+			pg_unreachable();
+	}
+}
+
+static inline int8
+XLogRecordGetRMInfo(XLogRecord *record)
+{
+	int infooff = MinXLogHeaderSize;
+
+	if (!(record->xl_info & XLR_HAS_RMINFO))
+		return 0;
+
+	infooff += (1 << (record->xl_info & XLR2_LEN_MASK)) >> 1;
+
+	if (record->xl_info & XLR_HAS_XID)
+		infooff += sizeof(TransactionId);
+
+	if (record->xl_info & XLR_HAS_CID)
+		infooff += sizeof(CommandId);
+
+	return *(((uint8 *) record) + infooff);
+}
 
 /*
  * Header info for block data appended to an XLOG record.
@@ -145,7 +284,7 @@ typedef struct XLogRecordBlockImageHeader
 #define BKPIMAGE_COMPRESS_ZSTD	0x10
 
 #define	BKPIMAGE_COMPRESSED(info) \
-	((info & (BKPIMAGE_COMPRESS_PGLZ | BKPIMAGE_COMPRESS_LZ4 | \
+	(((info) & (BKPIMAGE_COMPRESS_PGLZ | BKPIMAGE_COMPRESS_LZ4 | \
 			  BKPIMAGE_COMPRESS_ZSTD)) != 0)
 
 /*
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 9f644a0c1b..dfafbc36df 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -69,7 +69,12 @@ ignore: random
 # aggregates depends on create_aggregate
 # join depends on create_misc
 # ----------
-test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update delete namespace prepared_xacts
+test: select_into select_distinct select_distinct_on select_implicit select_having subselect union case join aggregates transactions random portals arrays btree_index hash_index update delete namespace
+
+# ----------
+# Another group of parallel tests
+# ----------
+# test: prepared_xacts
 
 # ----------
 # Another group of parallel tests
-- 
2.30.2

0001-Move-rmgr-specific-flags-out-of-xl_info-into-new-uin.patchapplication/x-patch; name=0001-Move-rmgr-specific-flags-out-of-xl_info-into-new-uin.patchDownload

From e8a87c56860e605cbef8cb5d712414441c300f5c Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 9 Sep 2022 15:38:55 +0200
Subject: [PATCH 1/3] Move rmgr-specific flags out of xl_info into new uint8
 xl_rminfo field.

This field doens't use extra data as it is located in the previously 2-byte
alignment padding in the xlog header behind xl_rmid.

The field is used to allow redo managers more bits for their own use without
growing the xlog record size.
---
 contrib/pg_walinspect/pg_walinspect.c        |  4 +-
 src/backend/access/brin/brin_pageops.c       | 16 +++---
 src/backend/access/brin/brin_xlog.c          |  8 +--
 src/backend/access/gin/ginxlog.c             |  6 +--
 src/backend/access/gist/gistxlog.c           |  6 +--
 src/backend/access/hash/hash_xlog.c          |  6 +--
 src/backend/access/heap/heapam.c             | 40 +++++++--------
 src/backend/access/nbtree/nbtinsert.c        | 18 +++----
 src/backend/access/nbtree/nbtpage.c          |  8 +--
 src/backend/access/nbtree/nbtxlog.c          | 10 ++--
 src/backend/access/rmgrdesc/brindesc.c       | 20 ++++----
 src/backend/access/rmgrdesc/clogdesc.c       | 10 ++--
 src/backend/access/rmgrdesc/committsdesc.c   | 10 ++--
 src/backend/access/rmgrdesc/dbasedesc.c      | 12 ++---
 src/backend/access/rmgrdesc/genericdesc.c    |  2 +-
 src/backend/access/rmgrdesc/gindesc.c        |  8 +--
 src/backend/access/rmgrdesc/gistdesc.c       |  8 +--
 src/backend/access/rmgrdesc/hashdesc.c       |  8 +--
 src/backend/access/rmgrdesc/heapdesc.c       | 46 ++++++++---------
 src/backend/access/rmgrdesc/logicalmsgdesc.c |  8 +--
 src/backend/access/rmgrdesc/mxactdesc.c      | 14 ++---
 src/backend/access/rmgrdesc/nbtdesc.c        |  8 +--
 src/backend/access/rmgrdesc/relmapdesc.c     |  8 +--
 src/backend/access/rmgrdesc/replorigindesc.c |  8 +--
 src/backend/access/rmgrdesc/seqdesc.c        |  8 +--
 src/backend/access/rmgrdesc/smgrdesc.c       | 10 ++--
 src/backend/access/rmgrdesc/spgdesc.c        |  8 +--
 src/backend/access/rmgrdesc/standbydesc.c    | 12 ++---
 src/backend/access/rmgrdesc/tblspcdesc.c     | 10 ++--
 src/backend/access/rmgrdesc/xactdesc.c       | 34 ++++++------
 src/backend/access/rmgrdesc/xlogdesc.c       | 28 +++++-----
 src/backend/access/spgist/spgxlog.c          |  6 +--
 src/backend/access/transam/clog.c            |  8 +--
 src/backend/access/transam/commit_ts.c       |  8 +--
 src/backend/access/transam/multixact.c       | 18 +++----
 src/backend/access/transam/twophase.c        |  2 +-
 src/backend/access/transam/xact.c            | 36 +++++++------
 src/backend/access/transam/xlog.c            | 34 ++++++------
 src/backend/access/transam/xloginsert.c      | 31 ++++++++---
 src/backend/access/transam/xlogprefetcher.c  |  2 +-
 src/backend/access/transam/xlogreader.c      |  2 +-
 src/backend/access/transam/xlogrecovery.c    | 54 ++++++++++----------
 src/backend/access/transam/xlogstats.c       |  2 +-
 src/backend/catalog/storage.c                | 15 +++---
 src/backend/commands/dbcommands.c            | 30 ++++++-----
 src/backend/commands/sequence.c              |  6 +--
 src/backend/commands/tablespace.c            |  8 +--
 src/backend/replication/logical/decode.c     | 38 +++++++-------
 src/backend/replication/logical/message.c    |  6 +--
 src/backend/replication/logical/origin.c     |  6 +--
 src/backend/storage/ipc/standby.c            | 10 ++--
 src/backend/utils/cache/relmapper.c          |  6 +--
 src/bin/pg_rewind/parsexlog.c                | 10 ++--
 src/bin/pg_waldump/pg_waldump.c              |  6 +--
 src/include/access/brin_xlog.h               |  2 +-
 src/include/access/clog.h                    |  2 +-
 src/include/access/ginxlog.h                 |  2 +-
 src/include/access/gistxlog.h                |  2 +-
 src/include/access/hash_xlog.h               |  2 +-
 src/include/access/heapam_xlog.h             |  4 +-
 src/include/access/multixact.h               |  2 +-
 src/include/access/nbtxlog.h                 |  2 +-
 src/include/access/spgxlog.h                 |  2 +-
 src/include/access/xact.h                    |  6 +--
 src/include/access/xlog.h                    |  2 +-
 src/include/access/xloginsert.h              |  3 +-
 src/include/access/xlogreader.h              |  1 +
 src/include/access/xlogrecord.h              | 11 +---
 src/include/access/xlogstats.h               |  2 +-
 src/include/catalog/storage_xlog.h           |  2 +-
 src/include/commands/dbcommands_xlog.h       |  2 +-
 src/include/commands/sequence.h              |  2 +-
 src/include/commands/tablespace.h            |  2 +-
 src/include/replication/message.h            |  2 +-
 src/include/storage/standbydefs.h            |  2 +-
 src/include/utils/relmapper.h                |  2 +-
 76 files changed, 411 insertions(+), 394 deletions(-)

diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index 38fb4106da..f4e3b40bed 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -188,10 +188,10 @@ GetWALRecordInfo(XLogReaderState *record, Datum *values,
 	int			i = 0;
 
 	desc = GetRmgr(XLogRecGetRmid(record));
-	id = desc.rm_identify(XLogRecGetInfo(record));
+	id = desc.rm_identify(XLogRecGetRmInfo(record));
 
 	if (id == NULL)
-		id = psprintf("UNKNOWN (%x)", XLogRecGetInfo(record) & ~XLR_INFO_MASK);
+		id = psprintf("UNKNOWN (%x)", XLogRecGetRmInfo(record));
 
 	initStringInfo(&rec_desc);
 	desc.rm_desc(&rec_desc, record);
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index f17aad51b6..150d6f9793 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -186,7 +186,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		{
 			xl_brin_samepage_update xlrec;
 			XLogRecPtr	recptr;
-			uint8		info = XLOG_BRIN_SAMEPAGE_UPDATE;
+			uint8		rminfo = XLOG_BRIN_SAMEPAGE_UPDATE;
 
 			xlrec.offnum = oldoff;
 
@@ -196,7 +196,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 			XLogRegisterBuffer(0, oldbuf, REGBUF_STANDARD);
 			XLogRegisterBufData(0, (char *) unconstify(BrinTuple *, newtup), newsz);
 
-			recptr = XLogInsert(RM_BRIN_ID, info);
+			recptr = XLogInsert(RM_BRIN_ID, rminfo);
 
 			PageSetLSN(oldpage, recptr);
 		}
@@ -271,9 +271,9 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		{
 			xl_brin_update xlrec;
 			XLogRecPtr	recptr;
-			uint8		info;
+			uint8		rminfo;
 
-			info = XLOG_BRIN_UPDATE | (extended ? XLOG_BRIN_INIT_PAGE : 0);
+			rminfo = XLOG_BRIN_UPDATE | (extended ? XLOG_BRIN_INIT_PAGE : 0);
 
 			xlrec.insert.offnum = newoff;
 			xlrec.insert.heapBlk = heapBlk;
@@ -294,7 +294,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 			/* old page */
 			XLogRegisterBuffer(2, oldbuf, REGBUF_STANDARD);
 
-			recptr = XLogInsert(RM_BRIN_ID, info);
+			recptr = XLogInsert(RM_BRIN_ID, rminfo);
 
 			PageSetLSN(oldpage, recptr);
 			PageSetLSN(newpage, recptr);
@@ -428,9 +428,9 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 	{
 		xl_brin_insert xlrec;
 		XLogRecPtr	recptr;
-		uint8		info;
+		uint8		rminfo;
 
-		info = XLOG_BRIN_INSERT | (extended ? XLOG_BRIN_INIT_PAGE : 0);
+		rminfo = XLOG_BRIN_INSERT | (extended ? XLOG_BRIN_INIT_PAGE : 0);
 		xlrec.heapBlk = heapBlk;
 		xlrec.pagesPerRange = pagesPerRange;
 		xlrec.offnum = off;
@@ -443,7 +443,7 @@ brin_doinsert(Relation idxrel, BlockNumber pagesPerRange,
 
 		XLogRegisterBuffer(1, revmapbuf, 0);
 
-		recptr = XLogInsert(RM_BRIN_ID, info);
+		recptr = XLogInsert(RM_BRIN_ID, rminfo);
 
 		PageSetLSN(page, recptr);
 		PageSetLSN(BufferGetPage(revmapbuf), recptr);
diff --git a/src/backend/access/brin/brin_xlog.c b/src/backend/access/brin/brin_xlog.c
index af6949882a..c0a23bdca6 100644
--- a/src/backend/access/brin/brin_xlog.c
+++ b/src/backend/access/brin/brin_xlog.c
@@ -56,7 +56,7 @@ brin_xlog_insert_update(XLogReaderState *record,
 	 * If we inserted the first and only tuple on the page, re-initialize the
 	 * page from scratch.
 	 */
-	if (XLogRecGetInfo(record) & XLOG_BRIN_INIT_PAGE)
+	if (XLogRecGetRmInfo(record) & XLOG_BRIN_INIT_PAGE)
 	{
 		buffer = XLogInitBufferForRedo(record, 0);
 		page = BufferGetPage(buffer);
@@ -308,9 +308,9 @@ brin_xlog_desummarize_page(XLogReaderState *record)
 void
 brin_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	switch (info & XLOG_BRIN_OPMASK)
+	switch (rminfo & XLOG_BRIN_OPMASK)
 	{
 		case XLOG_BRIN_CREATE_INDEX:
 			brin_xlog_createidx(record);
@@ -331,7 +331,7 @@ brin_redo(XLogReaderState *record)
 			brin_xlog_desummarize_page(record);
 			break;
 		default:
-			elog(PANIC, "brin_redo: unknown op code %u", info);
+			elog(PANIC, "brin_redo: unknown op code %u", rminfo);
 	}
 }
 
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 41b92115bf..e6a28ef6ce 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -725,7 +725,7 @@ ginRedoDeleteListPages(XLogReaderState *record)
 void
 gin_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 	MemoryContext oldCtx;
 
 	/*
@@ -735,7 +735,7 @@ gin_redo(XLogReaderState *record)
 	 */
 
 	oldCtx = MemoryContextSwitchTo(opCtx);
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_GIN_CREATE_PTREE:
 			ginRedoCreatePTree(record);
@@ -765,7 +765,7 @@ gin_redo(XLogReaderState *record)
 			ginRedoDeleteListPages(record);
 			break;
 		default:
-			elog(PANIC, "gin_redo: unknown op code %u", info);
+			elog(PANIC, "gin_redo: unknown op code %u", rminfo);
 	}
 	MemoryContextSwitchTo(oldCtx);
 	MemoryContextReset(opCtx);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 998befd2cb..940d9cd90e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -402,7 +402,7 @@ gistRedoPageReuse(XLogReaderState *record)
 void
 gist_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 	MemoryContext oldCxt;
 
 	/*
@@ -412,7 +412,7 @@ gist_redo(XLogReaderState *record)
 	 */
 
 	oldCxt = MemoryContextSwitchTo(opCtx);
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_GIST_PAGE_UPDATE:
 			gistRedoPageUpdateRecord(record);
@@ -433,7 +433,7 @@ gist_redo(XLogReaderState *record)
 			/* nop. See gistGetFakeLSN(). */
 			break;
 		default:
-			elog(PANIC, "gist_redo: unknown op code %u", info);
+			elog(PANIC, "gist_redo: unknown op code %u", rminfo);
 	}
 
 	MemoryContextSwitchTo(oldCxt);
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index e88213c742..9bc8f4edd4 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -1052,9 +1052,9 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 void
 hash_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_HASH_INIT_META_PAGE:
 			hash_xlog_init_meta_page(record);
@@ -1096,7 +1096,7 @@ hash_redo(XLogReaderState *record)
 			hash_xlog_vacuum_one_page(record);
 			break;
 		default:
-			elog(PANIC, "hash_redo: unknown op code %u", info);
+			elog(PANIC, "hash_redo: unknown op code %u", rminfo);
 	}
 }
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bd4d85041d..94e702cdb2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2104,7 +2104,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		xl_heap_header xlhdr;
 		XLogRecPtr	recptr;
 		Page		page = BufferGetPage(buffer);
-		uint8		info = XLOG_HEAP_INSERT;
+		uint8		rminfo = XLOG_HEAP_INSERT;
 		int			bufflags = 0;
 
 		/*
@@ -2122,7 +2122,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
 			PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
 		{
-			info |= XLOG_HEAP_INIT_PAGE;
+			rminfo |= XLOG_HEAP_INIT_PAGE;
 			bufflags |= REGBUF_WILL_INIT;
 		}
 
@@ -2171,7 +2171,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		/* filtering by origin on a row level is much more efficient */
 		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-		recptr = XLogInsert(RM_HEAP_ID, info);
+		recptr = XLogInsert(RM_HEAP_ID, rminfo);
 
 		PageSetLSN(page, recptr);
 	}
@@ -2415,7 +2415,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		{
 			XLogRecPtr	recptr;
 			xl_heap_multi_insert *xlrec;
-			uint8		info = XLOG_HEAP2_MULTI_INSERT;
+			uint8		rminfo = XLOG_HEAP2_MULTI_INSERT;
 			char	   *tupledata;
 			int			totaldatalen;
 			char	   *scratchptr = scratch.data;
@@ -2499,7 +2499,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 
 			if (init)
 			{
-				info |= XLOG_HEAP_INIT_PAGE;
+				rminfo |= XLOG_HEAP_INIT_PAGE;
 				bufflags |= REGBUF_WILL_INIT;
 			}
 
@@ -2519,7 +2519,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 			/* filtering by origin on a row level is much more efficient */
 			XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-			recptr = XLogInsert(RM_HEAP2_ID, info);
+			recptr = XLogInsert(RM_HEAP2_ID, rminfo);
 
 			PageSetLSN(page, recptr);
 		}
@@ -8237,7 +8237,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	xl_heap_update xlrec;
 	xl_heap_header xlhdr;
 	xl_heap_header xlhdr_idx;
-	uint8		info;
+	uint8		rminfo;
 	uint16		prefix_suffix[2];
 	uint16		prefixlen = 0,
 				suffixlen = 0;
@@ -8253,9 +8253,9 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	XLogBeginInsert();
 
 	if (HeapTupleIsHeapOnly(newtup))
-		info = XLOG_HEAP_HOT_UPDATE;
+		rminfo = XLOG_HEAP_HOT_UPDATE;
 	else
-		info = XLOG_HEAP_UPDATE;
+		rminfo = XLOG_HEAP_UPDATE;
 
 	/*
 	 * If the old and new tuple are on the same page, we only need to log the
@@ -8335,7 +8335,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	if (ItemPointerGetOffsetNumber(&(newtup->t_self)) == FirstOffsetNumber &&
 		PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
 	{
-		info |= XLOG_HEAP_INIT_PAGE;
+		rminfo |= XLOG_HEAP_INIT_PAGE;
 		init = true;
 	}
 	else
@@ -8439,7 +8439,7 @@ log_heap_update(Relation reln, Buffer oldbuf,
 	/* filtering by origin on a row level is much more efficient */
 	XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-	recptr = XLogInsert(RM_HEAP_ID, info);
+	recptr = XLogInsert(RM_HEAP_ID, rminfo);
 
 	return recptr;
 }
@@ -9122,7 +9122,7 @@ heap_xlog_insert(XLogReaderState *record)
 	 * If we inserted the first and only tuple on the page, re-initialize the
 	 * page from scratch.
 	 */
-	if (XLogRecGetInfo(record) & XLOG_HEAP_INIT_PAGE)
+	if (XLogRecGetRmInfo(record) & XLOG_HEAP_INIT_PAGE)
 	{
 		buffer = XLogInitBufferForRedo(record, 0);
 		page = BufferGetPage(buffer);
@@ -9216,7 +9216,7 @@ heap_xlog_multi_insert(XLogReaderState *record)
 	uint32		newlen;
 	Size		freespace = 0;
 	int			i;
-	bool		isinit = (XLogRecGetInfo(record) & XLOG_HEAP_INIT_PAGE) != 0;
+	bool		isinit = (XLogRecGetRmInfo(record) & XLOG_HEAP_INIT_PAGE) != 0;
 	XLogRedoAction action;
 
 	/*
@@ -9464,7 +9464,7 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)
 		nbuffer = obuffer;
 		newaction = oldaction;
 	}
-	else if (XLogRecGetInfo(record) & XLOG_HEAP_INIT_PAGE)
+	else if (XLogRecGetRmInfo(record) & XLOG_HEAP_INIT_PAGE)
 	{
 		nbuffer = XLogInitBufferForRedo(record, 0);
 		page = (Page) BufferGetPage(nbuffer);
@@ -9828,14 +9828,14 @@ heap_xlog_inplace(XLogReaderState *record)
 void
 heap_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
 	/*
 	 * These operations don't overwrite MVCC data so no conflict processing is
 	 * required. The ones in heap2 rmgr do.
 	 */
 
-	switch (info & XLOG_HEAP_OPMASK)
+	switch (rminfo & XLOG_HEAP_OPMASK)
 	{
 		case XLOG_HEAP_INSERT:
 			heap_xlog_insert(record);
@@ -9867,16 +9867,16 @@ heap_redo(XLogReaderState *record)
 			heap_xlog_inplace(record);
 			break;
 		default:
-			elog(PANIC, "heap_redo: unknown op code %u", info);
+			elog(PANIC, "heap_redo: unknown op code %u", rminfo);
 	}
 }
 
 void
 heap2_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	switch (info & XLOG_HEAP_OPMASK)
+	switch (rminfo & XLOG_HEAP_OPMASK)
 	{
 		case XLOG_HEAP2_PRUNE:
 			heap_xlog_prune(record);
@@ -9907,7 +9907,7 @@ heap2_redo(XLogReaderState *record)
 			heap_xlog_logical_rewrite(record);
 			break;
 		default:
-			elog(PANIC, "heap2_redo: unknown op code %u", info);
+			elog(PANIC, "heap2_redo: unknown op code %u", rminfo);
 	}
 }
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index f6f4af8bfe..90948bf819 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -1308,7 +1308,7 @@ _bt_insertonpg(Relation rel,
 		{
 			xl_btree_insert xlrec;
 			xl_btree_metadata xlmeta;
-			uint8		xlinfo;
+			uint8		xlrminfo;
 			XLogRecPtr	recptr;
 			uint16		upostingoff;
 
@@ -1320,7 +1320,7 @@ _bt_insertonpg(Relation rel,
 			if (isleaf && postingoff == 0)
 			{
 				/* Simple leaf insert */
-				xlinfo = XLOG_BTREE_INSERT_LEAF;
+				xlrminfo = XLOG_BTREE_INSERT_LEAF;
 			}
 			else if (postingoff != 0)
 			{
@@ -1329,18 +1329,18 @@ _bt_insertonpg(Relation rel,
 				 * postingoff field before newitem/orignewitem.
 				 */
 				Assert(isleaf);
-				xlinfo = XLOG_BTREE_INSERT_POST;
+				xlrminfo = XLOG_BTREE_INSERT_POST;
 			}
 			else
 			{
 				/* Internal page insert, which finishes a split on cbuf */
-				xlinfo = XLOG_BTREE_INSERT_UPPER;
+				xlrminfo = XLOG_BTREE_INSERT_UPPER;
 				XLogRegisterBuffer(1, cbuf, REGBUF_STANDARD);
 
 				if (BufferIsValid(metabuf))
 				{
 					/* Actually, it's an internal page insert + meta update */
-					xlinfo = XLOG_BTREE_INSERT_META;
+					xlrminfo = XLOG_BTREE_INSERT_META;
 
 					Assert(metad->btm_version >= BTREE_NOVAC_VERSION);
 					xlmeta.version = metad->btm_version;
@@ -1381,7 +1381,7 @@ _bt_insertonpg(Relation rel,
 									IndexTupleSize(origitup));
 			}
 
-			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+			recptr = XLogInsert(RM_BTREE_ID, xlrminfo);
 
 			if (BufferIsValid(metabuf))
 				PageSetLSN(metapg, recptr);
@@ -1962,7 +1962,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	if (RelationNeedsWAL(rel))
 	{
 		xl_btree_split xlrec;
-		uint8		xlinfo;
+		uint8		xlrminfo;
 		XLogRecPtr	recptr;
 
 		xlrec.level = ropaque->btpo_level;
@@ -2045,8 +2045,8 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 							(char *) rightpage + ((PageHeader) rightpage)->pd_upper,
 							((PageHeader) rightpage)->pd_special - ((PageHeader) rightpage)->pd_upper);
 
-		xlinfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
-		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+		xlrminfo = newitemonleft ? XLOG_BTREE_SPLIT_L : XLOG_BTREE_SPLIT_R;
+		recptr = XLogInsert(RM_BTREE_ID, xlrminfo);
 
 		PageSetLSN(origpage, recptr);
 		PageSetLSN(rightpage, recptr);
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8b96708b3e..e802b71704 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2634,7 +2634,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	{
 		xl_btree_unlink_page xlrec;
 		xl_btree_metadata xlmeta;
-		uint8		xlinfo;
+		uint8		xlrminfo;
 		XLogRecPtr	recptr;
 
 		XLogBeginInsert();
@@ -2673,12 +2673,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 			xlmeta.allequalimage = metad->btm_allequalimage;
 
 			XLogRegisterBufData(4, (char *) &xlmeta, sizeof(xl_btree_metadata));
-			xlinfo = XLOG_BTREE_UNLINK_PAGE_META;
+			xlrminfo = XLOG_BTREE_UNLINK_PAGE_META;
 		}
 		else
-			xlinfo = XLOG_BTREE_UNLINK_PAGE;
+			xlrminfo = XLOG_BTREE_UNLINK_PAGE;
 
-		recptr = XLogInsert(RM_BTREE_ID, xlinfo);
+		recptr = XLogInsert(RM_BTREE_ID, xlrminfo);
 
 		if (BufferIsValid(metabuf))
 		{
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index ad489e33b3..6c085fd43e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -1012,11 +1012,11 @@ btree_xlog_reuse_page(XLogReaderState *record)
 void
 btree_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 	MemoryContext oldCtx;
 
 	oldCtx = MemoryContextSwitchTo(opCtx);
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
 			btree_xlog_insert(true, false, false, record);
@@ -1046,11 +1046,11 @@ btree_redo(XLogReaderState *record)
 			btree_xlog_delete(record);
 			break;
 		case XLOG_BTREE_MARK_PAGE_HALFDEAD:
-			btree_xlog_mark_page_halfdead(info, record);
+			btree_xlog_mark_page_halfdead(rminfo, record);
 			break;
 		case XLOG_BTREE_UNLINK_PAGE:
 		case XLOG_BTREE_UNLINK_PAGE_META:
-			btree_xlog_unlink_page(info, record);
+			btree_xlog_unlink_page(rminfo, record);
 			break;
 		case XLOG_BTREE_NEWROOT:
 			btree_xlog_newroot(record);
@@ -1062,7 +1062,7 @@ btree_redo(XLogReaderState *record)
 			_bt_restore_meta(record, 0);
 			break;
 		default:
-			elog(PANIC, "btree_redo: unknown op code %u", info);
+			elog(PANIC, "btree_redo: unknown op code %u", rminfo);
 	}
 	MemoryContextSwitchTo(oldCtx);
 	MemoryContextReset(opCtx);
diff --git a/src/backend/access/rmgrdesc/brindesc.c b/src/backend/access/rmgrdesc/brindesc.c
index f05607e6c3..44395dd9a5 100644
--- a/src/backend/access/rmgrdesc/brindesc.c
+++ b/src/backend/access/rmgrdesc/brindesc.c
@@ -20,17 +20,17 @@ void
 brin_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	info &= XLOG_BRIN_OPMASK;
-	if (info == XLOG_BRIN_CREATE_INDEX)
+	rminfo &= XLOG_BRIN_OPMASK;
+	if (rminfo == XLOG_BRIN_CREATE_INDEX)
 	{
 		xl_brin_createidx *xlrec = (xl_brin_createidx *) rec;
 
 		appendStringInfo(buf, "v%d pagesPerRange %u",
 						 xlrec->version, xlrec->pagesPerRange);
 	}
-	else if (info == XLOG_BRIN_INSERT)
+	else if (rminfo == XLOG_BRIN_INSERT)
 	{
 		xl_brin_insert *xlrec = (xl_brin_insert *) rec;
 
@@ -39,7 +39,7 @@ brin_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec->pagesPerRange,
 						 xlrec->offnum);
 	}
-	else if (info == XLOG_BRIN_UPDATE)
+	else if (rminfo == XLOG_BRIN_UPDATE)
 	{
 		xl_brin_update *xlrec = (xl_brin_update *) rec;
 
@@ -49,19 +49,19 @@ brin_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec->oldOffnum,
 						 xlrec->insert.offnum);
 	}
-	else if (info == XLOG_BRIN_SAMEPAGE_UPDATE)
+	else if (rminfo == XLOG_BRIN_SAMEPAGE_UPDATE)
 	{
 		xl_brin_samepage_update *xlrec = (xl_brin_samepage_update *) rec;
 
 		appendStringInfo(buf, "offnum %u", xlrec->offnum);
 	}
-	else if (info == XLOG_BRIN_REVMAP_EXTEND)
+	else if (rminfo == XLOG_BRIN_REVMAP_EXTEND)
 	{
 		xl_brin_revmap_extend *xlrec = (xl_brin_revmap_extend *) rec;
 
 		appendStringInfo(buf, "targetBlk %u", xlrec->targetBlk);
 	}
-	else if (info == XLOG_BRIN_DESUMMARIZE)
+	else if (rminfo == XLOG_BRIN_DESUMMARIZE)
 	{
 		xl_brin_desummarize *xlrec = (xl_brin_desummarize *) rec;
 
@@ -71,11 +71,11 @@ brin_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-brin_identify(uint8 info)
+brin_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_BRIN_CREATE_INDEX:
 			id = "CREATE_INDEX";
diff --git a/src/backend/access/rmgrdesc/clogdesc.c b/src/backend/access/rmgrdesc/clogdesc.c
index 87513732be..1b0cd7ddbf 100644
--- a/src/backend/access/rmgrdesc/clogdesc.c
+++ b/src/backend/access/rmgrdesc/clogdesc.c
@@ -21,16 +21,16 @@ void
 clog_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info == CLOG_ZEROPAGE)
+	if (rminfo == CLOG_ZEROPAGE)
 	{
 		int			pageno;
 
 		memcpy(&pageno, rec, sizeof(int));
 		appendStringInfo(buf, "page %d", pageno);
 	}
-	else if (info == CLOG_TRUNCATE)
+	else if (rminfo == CLOG_TRUNCATE)
 	{
 		xl_clog_truncate xlrec;
 
@@ -41,11 +41,11 @@ clog_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-clog_identify(uint8 info)
+clog_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case CLOG_ZEROPAGE:
 			id = "ZEROPAGE";
diff --git a/src/backend/access/rmgrdesc/committsdesc.c b/src/backend/access/rmgrdesc/committsdesc.c
index 3a65538bb0..120562222f 100644
--- a/src/backend/access/rmgrdesc/committsdesc.c
+++ b/src/backend/access/rmgrdesc/committsdesc.c
@@ -22,16 +22,16 @@ void
 commit_ts_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info == COMMIT_TS_ZEROPAGE)
+	if (rminfo == COMMIT_TS_ZEROPAGE)
 	{
 		int			pageno;
 
 		memcpy(&pageno, rec, sizeof(int));
 		appendStringInfo(buf, "%d", pageno);
 	}
-	else if (info == COMMIT_TS_TRUNCATE)
+	else if (rminfo == COMMIT_TS_TRUNCATE)
 	{
 		xl_commit_ts_truncate *trunc = (xl_commit_ts_truncate *) rec;
 
@@ -41,9 +41,9 @@ commit_ts_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-commit_ts_identify(uint8 info)
+commit_ts_identify(uint8 rminfo)
 {
-	switch (info)
+	switch (rminfo)
 	{
 		case COMMIT_TS_ZEROPAGE:
 			return "ZEROPAGE";
diff --git a/src/backend/access/rmgrdesc/dbasedesc.c b/src/backend/access/rmgrdesc/dbasedesc.c
index 523d0b3c1d..f184fbb2ab 100644
--- a/src/backend/access/rmgrdesc/dbasedesc.c
+++ b/src/backend/access/rmgrdesc/dbasedesc.c
@@ -22,9 +22,9 @@ void
 dbase_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info == XLOG_DBASE_CREATE_FILE_COPY)
+	if (rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		xl_dbase_create_file_copy_rec *xlrec =
 		(xl_dbase_create_file_copy_rec *) rec;
@@ -33,7 +33,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec->src_tablespace_id, xlrec->src_db_id,
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
-	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	else if (rminfo == XLOG_DBASE_CREATE_WAL_LOG)
 	{
 		xl_dbase_create_wal_log_rec *xlrec =
 		(xl_dbase_create_wal_log_rec *) rec;
@@ -41,7 +41,7 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "create dir %u/%u",
 						 xlrec->tablespace_id, xlrec->db_id);
 	}
-	else if (info == XLOG_DBASE_DROP)
+	else if (rminfo == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) rec;
 		int			i;
@@ -54,11 +54,11 @@ dbase_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-dbase_identify(uint8 info)
+dbase_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_DBASE_CREATE_FILE_COPY:
 			id = "CREATE_FILE_COPY";
diff --git a/src/backend/access/rmgrdesc/genericdesc.c b/src/backend/access/rmgrdesc/genericdesc.c
index d8509b884b..e1aa356686 100644
--- a/src/backend/access/rmgrdesc/genericdesc.c
+++ b/src/backend/access/rmgrdesc/genericdesc.c
@@ -50,7 +50,7 @@ generic_desc(StringInfo buf, XLogReaderState *record)
  * inside generic xlog records.
  */
 const char *
-generic_identify(uint8 info)
+generic_identify(uint8 rminfo)
 {
 	return "Generic";
 }
diff --git a/src/backend/access/rmgrdesc/gindesc.c b/src/backend/access/rmgrdesc/gindesc.c
index 7d147cea97..76505f5f14 100644
--- a/src/backend/access/rmgrdesc/gindesc.c
+++ b/src/backend/access/rmgrdesc/gindesc.c
@@ -74,9 +74,9 @@ void
 gin_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_GIN_CREATE_PTREE:
 			/* no further information */
@@ -179,11 +179,11 @@ gin_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-gin_identify(uint8 info)
+gin_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_GIN_CREATE_PTREE:
 			id = "CREATE_PTREE";
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 7dd3c1d500..d6b6404213 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -60,9 +60,9 @@ void
 gist_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_GIST_PAGE_UPDATE:
 			out_gistxlogPageUpdate(buf, (gistxlogPageUpdate *) rec);
@@ -86,11 +86,11 @@ gist_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-gist_identify(uint8 info)
+gist_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_GIST_PAGE_UPDATE:
 			id = "PAGE_UPDATE";
diff --git a/src/backend/access/rmgrdesc/hashdesc.c b/src/backend/access/rmgrdesc/hashdesc.c
index ef443bdb16..8ebfc7d16b 100644
--- a/src/backend/access/rmgrdesc/hashdesc.c
+++ b/src/backend/access/rmgrdesc/hashdesc.c
@@ -20,9 +20,9 @@ void
 hash_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_HASH_INIT_META_PAGE:
 			{
@@ -122,11 +122,11 @@ hash_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-hash_identify(uint8 info)
+hash_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_HASH_INIT_META_PAGE:
 			id = "INIT_META_PAGE";
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 923d3bc43d..c07f237365 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -35,17 +35,17 @@ void
 heap_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	info &= XLOG_HEAP_OPMASK;
-	if (info == XLOG_HEAP_INSERT)
+	rminfo &= XLOG_HEAP_OPMASK;
+	if (rminfo == XLOG_HEAP_INSERT)
 	{
 		xl_heap_insert *xlrec = (xl_heap_insert *) rec;
 
 		appendStringInfo(buf, "off %u flags 0x%02X", xlrec->offnum,
 						 xlrec->flags);
 	}
-	else if (info == XLOG_HEAP_DELETE)
+	else if (rminfo == XLOG_HEAP_DELETE)
 	{
 		xl_heap_delete *xlrec = (xl_heap_delete *) rec;
 
@@ -54,7 +54,7 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec->flags);
 		out_infobits(buf, xlrec->infobits_set);
 	}
-	else if (info == XLOG_HEAP_UPDATE)
+	else if (rminfo == XLOG_HEAP_UPDATE)
 	{
 		xl_heap_update *xlrec = (xl_heap_update *) rec;
 
@@ -67,7 +67,7 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec->new_offnum,
 						 xlrec->new_xmax);
 	}
-	else if (info == XLOG_HEAP_HOT_UPDATE)
+	else if (rminfo == XLOG_HEAP_HOT_UPDATE)
 	{
 		xl_heap_update *xlrec = (xl_heap_update *) rec;
 
@@ -80,7 +80,7 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec->new_offnum,
 						 xlrec->new_xmax);
 	}
-	else if (info == XLOG_HEAP_TRUNCATE)
+	else if (rminfo == XLOG_HEAP_TRUNCATE)
 	{
 		xl_heap_truncate *xlrec = (xl_heap_truncate *) rec;
 		int			i;
@@ -93,13 +93,13 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 		for (i = 0; i < xlrec->nrelids; i++)
 			appendStringInfo(buf, " %u", xlrec->relids[i]);
 	}
-	else if (info == XLOG_HEAP_CONFIRM)
+	else if (rminfo == XLOG_HEAP_CONFIRM)
 	{
 		xl_heap_confirm *xlrec = (xl_heap_confirm *) rec;
 
 		appendStringInfo(buf, "off %u", xlrec->offnum);
 	}
-	else if (info == XLOG_HEAP_LOCK)
+	else if (rminfo == XLOG_HEAP_LOCK)
 	{
 		xl_heap_lock *xlrec = (xl_heap_lock *) rec;
 
@@ -107,7 +107,7 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec->offnum, xlrec->locking_xid, xlrec->flags);
 		out_infobits(buf, xlrec->infobits_set);
 	}
-	else if (info == XLOG_HEAP_INPLACE)
+	else if (rminfo == XLOG_HEAP_INPLACE)
 	{
 		xl_heap_inplace *xlrec = (xl_heap_inplace *) rec;
 
@@ -118,10 +118,10 @@ void
 heap2_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	info &= XLOG_HEAP_OPMASK;
-	if (info == XLOG_HEAP2_PRUNE)
+	rminfo &= XLOG_HEAP_OPMASK;
+	if (rminfo == XLOG_HEAP2_PRUNE)
 	{
 		xl_heap_prune *xlrec = (xl_heap_prune *) rec;
 
@@ -130,34 +130,34 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec->nredirected,
 						 xlrec->ndead);
 	}
-	else if (info == XLOG_HEAP2_VACUUM)
+	else if (rminfo == XLOG_HEAP2_VACUUM)
 	{
 		xl_heap_vacuum *xlrec = (xl_heap_vacuum *) rec;
 
 		appendStringInfo(buf, "nunused %u", xlrec->nunused);
 	}
-	else if (info == XLOG_HEAP2_FREEZE_PAGE)
+	else if (rminfo == XLOG_HEAP2_FREEZE_PAGE)
 	{
 		xl_heap_freeze_page *xlrec = (xl_heap_freeze_page *) rec;
 
 		appendStringInfo(buf, "cutoff xid %u ntuples %u",
 						 xlrec->cutoff_xid, xlrec->ntuples);
 	}
-	else if (info == XLOG_HEAP2_VISIBLE)
+	else if (rminfo == XLOG_HEAP2_VISIBLE)
 	{
 		xl_heap_visible *xlrec = (xl_heap_visible *) rec;
 
 		appendStringInfo(buf, "cutoff xid %u flags 0x%02X",
 						 xlrec->cutoff_xid, xlrec->flags);
 	}
-	else if (info == XLOG_HEAP2_MULTI_INSERT)
+	else if (rminfo == XLOG_HEAP2_MULTI_INSERT)
 	{
 		xl_heap_multi_insert *xlrec = (xl_heap_multi_insert *) rec;
 
 		appendStringInfo(buf, "%d tuples flags 0x%02X", xlrec->ntuples,
 						 xlrec->flags);
 	}
-	else if (info == XLOG_HEAP2_LOCK_UPDATED)
+	else if (rminfo == XLOG_HEAP2_LOCK_UPDATED)
 	{
 		xl_heap_lock_updated *xlrec = (xl_heap_lock_updated *) rec;
 
@@ -165,7 +165,7 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec->offnum, xlrec->xmax, xlrec->flags);
 		out_infobits(buf, xlrec->infobits_set);
 	}
-	else if (info == XLOG_HEAP2_NEW_CID)
+	else if (rminfo == XLOG_HEAP2_NEW_CID)
 	{
 		xl_heap_new_cid *xlrec = (xl_heap_new_cid *) rec;
 
@@ -181,11 +181,11 @@ heap2_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-heap_identify(uint8 info)
+heap_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_HEAP_INSERT:
 			id = "INSERT";
@@ -226,11 +226,11 @@ heap_identify(uint8 info)
 }
 
 const char *
-heap2_identify(uint8 info)
+heap2_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_HEAP2_PRUNE:
 			id = "PRUNE";
diff --git a/src/backend/access/rmgrdesc/logicalmsgdesc.c b/src/backend/access/rmgrdesc/logicalmsgdesc.c
index 08e03aa30d..03e653c0e8 100644
--- a/src/backend/access/rmgrdesc/logicalmsgdesc.c
+++ b/src/backend/access/rmgrdesc/logicalmsgdesc.c
@@ -19,9 +19,9 @@ void
 logicalmsg_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info == XLOG_LOGICAL_MESSAGE)
+	if (rminfo == XLOG_LOGICAL_MESSAGE)
 	{
 		xl_logical_message *xlrec = (xl_logical_message *) rec;
 		char	   *prefix = xlrec->message;
@@ -43,9 +43,9 @@ logicalmsg_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-logicalmsg_identify(uint8 info)
+logicalmsg_identify(uint8 rminfo)
 {
-	if ((info & ~XLR_INFO_MASK) == XLOG_LOGICAL_MESSAGE)
+	if ((rminfo) == XLOG_LOGICAL_MESSAGE)
 		return "MESSAGE";
 
 	return NULL;
diff --git a/src/backend/access/rmgrdesc/mxactdesc.c b/src/backend/access/rmgrdesc/mxactdesc.c
index 7076be2b3f..6e5a411d16 100644
--- a/src/backend/access/rmgrdesc/mxactdesc.c
+++ b/src/backend/access/rmgrdesc/mxactdesc.c
@@ -50,17 +50,17 @@ void
 multixact_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info == XLOG_MULTIXACT_ZERO_OFF_PAGE ||
-		info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
+	if (rminfo == XLOG_MULTIXACT_ZERO_OFF_PAGE ||
+		rminfo == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
 		int			pageno;
 
 		memcpy(&pageno, rec, sizeof(int));
 		appendStringInfo(buf, "%d", pageno);
 	}
-	else if (info == XLOG_MULTIXACT_CREATE_ID)
+	else if (rminfo == XLOG_MULTIXACT_CREATE_ID)
 	{
 		xl_multixact_create *xlrec = (xl_multixact_create *) rec;
 		int			i;
@@ -70,7 +70,7 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 		for (i = 0; i < xlrec->nmembers; i++)
 			out_member(buf, &xlrec->members[i]);
 	}
-	else if (info == XLOG_MULTIXACT_TRUNCATE_ID)
+	else if (rminfo == XLOG_MULTIXACT_TRUNCATE_ID)
 	{
 		xl_multixact_truncate *xlrec = (xl_multixact_truncate *) rec;
 
@@ -81,11 +81,11 @@ multixact_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-multixact_identify(uint8 info)
+multixact_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_MULTIXACT_ZERO_OFF_PAGE:
 			id = "ZERO_OFF_PAGE";
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index 4843cd530d..c5e92e9e6b 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -20,9 +20,9 @@ void
 btree_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
 		case XLOG_BTREE_INSERT_UPPER:
@@ -121,11 +121,11 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-btree_identify(uint8 info)
+btree_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
 			id = "INSERT_LEAF";
diff --git a/src/backend/access/rmgrdesc/relmapdesc.c b/src/backend/access/rmgrdesc/relmapdesc.c
index 43d63eb9a4..0f2d0d80fb 100644
--- a/src/backend/access/rmgrdesc/relmapdesc.c
+++ b/src/backend/access/rmgrdesc/relmapdesc.c
@@ -20,9 +20,9 @@ void
 relmap_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info == XLOG_RELMAP_UPDATE)
+	if (rminfo == XLOG_RELMAP_UPDATE)
 	{
 		xl_relmap_update *xlrec = (xl_relmap_update *) rec;
 
@@ -32,11 +32,11 @@ relmap_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-relmap_identify(uint8 info)
+relmap_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_RELMAP_UPDATE:
 			id = "UPDATE";
diff --git a/src/backend/access/rmgrdesc/replorigindesc.c b/src/backend/access/rmgrdesc/replorigindesc.c
index e3213b1016..3bf3fde05f 100644
--- a/src/backend/access/rmgrdesc/replorigindesc.c
+++ b/src/backend/access/rmgrdesc/replorigindesc.c
@@ -19,9 +19,9 @@ void
 replorigin_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_REPLORIGIN_SET:
 			{
@@ -48,9 +48,9 @@ replorigin_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-replorigin_identify(uint8 info)
+replorigin_identify(uint8 rminfo)
 {
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_REPLORIGIN_SET:
 			return "SET";
diff --git a/src/backend/access/rmgrdesc/seqdesc.c b/src/backend/access/rmgrdesc/seqdesc.c
index b3845f93bf..2a4b8367ec 100644
--- a/src/backend/access/rmgrdesc/seqdesc.c
+++ b/src/backend/access/rmgrdesc/seqdesc.c
@@ -21,21 +21,21 @@ void
 seq_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 	xl_seq_rec *xlrec = (xl_seq_rec *) rec;
 
-	if (info == XLOG_SEQ_LOG)
+	if (rminfo == XLOG_SEQ_LOG)
 		appendStringInfo(buf, "rel %u/%u/%u",
 						 xlrec->locator.spcOid, xlrec->locator.dbOid,
 						 xlrec->locator.relNumber);
 }
 
 const char *
-seq_identify(uint8 info)
+seq_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_SEQ_LOG:
 			id = "LOG";
diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index e0ee8a078a..944f3650cf 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -21,9 +21,9 @@ void
 smgr_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info == XLOG_SMGR_CREATE)
+	if (rminfo == XLOG_SMGR_CREATE)
 	{
 		xl_smgr_create *xlrec = (xl_smgr_create *) rec;
 		char	   *path = relpathperm(xlrec->rlocator, xlrec->forkNum);
@@ -31,7 +31,7 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfoString(buf, path);
 		pfree(path);
 	}
-	else if (info == XLOG_SMGR_TRUNCATE)
+	else if (rminfo == XLOG_SMGR_TRUNCATE)
 	{
 		xl_smgr_truncate *xlrec = (xl_smgr_truncate *) rec;
 		char	   *path = relpathperm(xlrec->rlocator, MAIN_FORKNUM);
@@ -43,11 +43,11 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-smgr_identify(uint8 info)
+smgr_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_SMGR_CREATE:
 			id = "CREATE";
diff --git a/src/backend/access/rmgrdesc/spgdesc.c b/src/backend/access/rmgrdesc/spgdesc.c
index d5d921a42a..2afb827de4 100644
--- a/src/backend/access/rmgrdesc/spgdesc.c
+++ b/src/backend/access/rmgrdesc/spgdesc.c
@@ -20,9 +20,9 @@ void
 spg_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_SPGIST_ADD_LEAF:
 			{
@@ -128,11 +128,11 @@ spg_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-spg_identify(uint8 info)
+spg_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_SPGIST_ADD_LEAF:
 			id = "ADD_LEAF";
diff --git a/src/backend/access/rmgrdesc/standbydesc.c b/src/backend/access/rmgrdesc/standbydesc.c
index 2dba39e349..5a2e6a323c 100644
--- a/src/backend/access/rmgrdesc/standbydesc.c
+++ b/src/backend/access/rmgrdesc/standbydesc.c
@@ -40,9 +40,9 @@ void
 standby_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info == XLOG_STANDBY_LOCK)
+	if (rminfo == XLOG_STANDBY_LOCK)
 	{
 		xl_standby_locks *xlrec = (xl_standby_locks *) rec;
 		int			i;
@@ -52,13 +52,13 @@ standby_desc(StringInfo buf, XLogReaderState *record)
 							 xlrec->locks[i].xid, xlrec->locks[i].dbOid,
 							 xlrec->locks[i].relOid);
 	}
-	else if (info == XLOG_RUNNING_XACTS)
+	else if (rminfo == XLOG_RUNNING_XACTS)
 	{
 		xl_running_xacts *xlrec = (xl_running_xacts *) rec;
 
 		standby_desc_running_xacts(buf, xlrec);
 	}
-	else if (info == XLOG_INVALIDATIONS)
+	else if (rminfo == XLOG_INVALIDATIONS)
 	{
 		xl_invalidations *xlrec = (xl_invalidations *) rec;
 
@@ -69,11 +69,11 @@ standby_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-standby_identify(uint8 info)
+standby_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_STANDBY_LOCK:
 			id = "LOCK";
diff --git a/src/backend/access/rmgrdesc/tblspcdesc.c b/src/backend/access/rmgrdesc/tblspcdesc.c
index ed94b6e2dd..b8c050fcba 100644
--- a/src/backend/access/rmgrdesc/tblspcdesc.c
+++ b/src/backend/access/rmgrdesc/tblspcdesc.c
@@ -21,15 +21,15 @@ void
 tblspc_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info == XLOG_TBLSPC_CREATE)
+	if (rminfo == XLOG_TBLSPC_CREATE)
 	{
 		xl_tblspc_create_rec *xlrec = (xl_tblspc_create_rec *) rec;
 
 		appendStringInfo(buf, "%u \"%s\"", xlrec->ts_id, xlrec->ts_path);
 	}
-	else if (info == XLOG_TBLSPC_DROP)
+	else if (rminfo == XLOG_TBLSPC_DROP)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) rec;
 
@@ -38,11 +38,11 @@ tblspc_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-tblspc_identify(uint8 info)
+tblspc_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_TBLSPC_CREATE:
 			id = "CREATE";
diff --git a/src/backend/access/rmgrdesc/xactdesc.c b/src/backend/access/rmgrdesc/xactdesc.c
index 39752cf349..93a653c75a 100644
--- a/src/backend/access/rmgrdesc/xactdesc.c
+++ b/src/backend/access/rmgrdesc/xactdesc.c
@@ -32,7 +32,7 @@
  */
 
 void
-ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *parsed)
+ParseCommitRecord(uint8 rminfo, xl_xact_commit *xlrec, xl_xact_parsed_commit *parsed)
 {
 	char	   *data = ((char *) xlrec) + MinSizeOfXactCommit;
 
@@ -43,7 +43,7 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
 
 	parsed->xact_time = xlrec->xact_time;
 
-	if (info & XLOG_XACT_HAS_INFO)
+	if (rminfo & XLOG_XACT_HAS_INFO)
 	{
 		xl_xact_xinfo *xl_xinfo = (xl_xact_xinfo *) data;
 
@@ -138,7 +138,7 @@ ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *pars
 }
 
 void
-ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
+ParseAbortRecord(uint8 rminfo, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
 {
 	char	   *data = ((char *) xlrec) + MinSizeOfXactAbort;
 
@@ -149,7 +149,7 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
 
 	parsed->xact_time = xlrec->xact_time;
 
-	if (info & XLOG_XACT_HAS_INFO)
+	if (rminfo & XLOG_XACT_HAS_INFO)
 	{
 		xl_xact_xinfo *xl_xinfo = (xl_xact_xinfo *) data;
 
@@ -236,7 +236,7 @@ ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed)
  * ParsePrepareRecord
  */
 void
-ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *parsed)
+ParsePrepareRecord(uint8 rminfo, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *parsed)
 {
 	char	   *bufptr;
 
@@ -328,11 +328,11 @@ xact_desc_stats(StringInfo buf, const char *label,
 }
 
 static void
-xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec, RepOriginId origin_id)
+xact_desc_commit(StringInfo buf, uint8 rminfo, xl_xact_commit *xlrec, RepOriginId origin_id)
 {
 	xl_xact_parsed_commit parsed;
 
-	ParseCommitRecord(info, xlrec, &parsed);
+	ParseCommitRecord(rminfo, xlrec, &parsed);
 
 	/* If this is a prepared xact, show the xid of the original xact */
 	if (TransactionIdIsValid(parsed.twophase_xid))
@@ -364,11 +364,11 @@ xact_desc_commit(StringInfo buf, uint8 info, xl_xact_commit *xlrec, RepOriginId
 }
 
 static void
-xact_desc_abort(StringInfo buf, uint8 info, xl_xact_abort *xlrec, RepOriginId origin_id)
+xact_desc_abort(StringInfo buf, uint8 rminfo, xl_xact_abort *xlrec, RepOriginId origin_id)
 {
 	xl_xact_parsed_abort parsed;
 
-	ParseAbortRecord(info, xlrec, &parsed);
+	ParseAbortRecord(rminfo, xlrec, &parsed);
 
 	/* If this is a prepared xact, show the xid of the original xact */
 	if (TransactionIdIsValid(parsed.twophase_xid))
@@ -391,11 +391,11 @@ xact_desc_abort(StringInfo buf, uint8 info, xl_xact_abort *xlrec, RepOriginId or
 }
 
 static void
-xact_desc_prepare(StringInfo buf, uint8 info, xl_xact_prepare *xlrec, RepOriginId origin_id)
+xact_desc_prepare(StringInfo buf, uint8 rminfo, xl_xact_prepare *xlrec, RepOriginId origin_id)
 {
 	xl_xact_parsed_prepare parsed;
 
-	ParsePrepareRecord(info, xlrec, &parsed);
+	ParsePrepareRecord(rminfo, xlrec, &parsed);
 
 	appendStringInfo(buf, "gid %s: ", parsed.twophase_gid);
 	appendStringInfoString(buf, timestamptz_to_str(parsed.xact_time));
@@ -436,27 +436,27 @@ void
 xact_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK;
+	uint8		info = XLogRecGetRmInfo(record) & XLOG_XACT_OPMASK;
 
 	if (info == XLOG_XACT_COMMIT || info == XLOG_XACT_COMMIT_PREPARED)
 	{
 		xl_xact_commit *xlrec = (xl_xact_commit *) rec;
 
-		xact_desc_commit(buf, XLogRecGetInfo(record), xlrec,
+		xact_desc_commit(buf, XLogRecGetRmInfo(record), xlrec,
 						 XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ABORT || info == XLOG_XACT_ABORT_PREPARED)
 	{
 		xl_xact_abort *xlrec = (xl_xact_abort *) rec;
 
-		xact_desc_abort(buf, XLogRecGetInfo(record), xlrec,
+		xact_desc_abort(buf, XLogRecGetRmInfo(record), xlrec,
 						XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_PREPARE)
 	{
 		xl_xact_prepare *xlrec = (xl_xact_prepare *) rec;
 
-		xact_desc_prepare(buf, XLogRecGetInfo(record), xlrec,
+		xact_desc_prepare(buf, XLogRecGetRmInfo(record), xlrec,
 						  XLogRecGetOrigin(record));
 	}
 	else if (info == XLOG_XACT_ASSIGNMENT)
@@ -481,11 +481,11 @@ xact_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-xact_identify(uint8 info)
+xact_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & XLOG_XACT_OPMASK)
+	switch (rminfo & XLOG_XACT_OPMASK)
 	{
 		case XLOG_XACT_COMMIT:
 			id = "COMMIT";
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 3fd7185f21..65ac642908 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -37,10 +37,10 @@ void
 xlog_desc(StringInfo buf, XLogReaderState *record)
 {
 	char	   *rec = XLogRecGetData(record);
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info == XLOG_CHECKPOINT_SHUTDOWN ||
-		info == XLOG_CHECKPOINT_ONLINE)
+	if (rminfo == XLOG_CHECKPOINT_SHUTDOWN ||
+		rminfo == XLOG_CHECKPOINT_ONLINE)
 	{
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
@@ -65,33 +65,33 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestCommitTsXid,
 						 checkpoint->newestCommitTsXid,
 						 checkpoint->oldestActiveXid,
-						 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
+						 (rminfo == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
 	}
-	else if (info == XLOG_NEXTOID)
+	else if (rminfo == XLOG_NEXTOID)
 	{
 		Oid			nextOid;
 
 		memcpy(&nextOid, rec, sizeof(Oid));
 		appendStringInfo(buf, "%u", nextOid);
 	}
-	else if (info == XLOG_RESTORE_POINT)
+	else if (rminfo == XLOG_RESTORE_POINT)
 	{
 		xl_restore_point *xlrec = (xl_restore_point *) rec;
 
 		appendStringInfoString(buf, xlrec->rp_name);
 	}
-	else if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT)
+	else if (rminfo == XLOG_FPI || rminfo == XLOG_FPI_FOR_HINT)
 	{
 		/* no further information to print */
 	}
-	else if (info == XLOG_BACKUP_END)
+	else if (rminfo == XLOG_BACKUP_END)
 	{
 		XLogRecPtr	startpoint;
 
 		memcpy(&startpoint, rec, sizeof(XLogRecPtr));
 		appendStringInfo(buf, "%X/%X", LSN_FORMAT_ARGS(startpoint));
 	}
-	else if (info == XLOG_PARAMETER_CHANGE)
+	else if (rminfo == XLOG_PARAMETER_CHANGE)
 	{
 		xl_parameter_change xlrec;
 		const char *wal_level_str;
@@ -123,14 +123,14 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec.wal_log_hints ? "on" : "off",
 						 xlrec.track_commit_timestamp ? "on" : "off");
 	}
-	else if (info == XLOG_FPW_CHANGE)
+	else if (rminfo == XLOG_FPW_CHANGE)
 	{
 		bool		fpw;
 
 		memcpy(&fpw, rec, sizeof(bool));
 		appendStringInfoString(buf, fpw ? "true" : "false");
 	}
-	else if (info == XLOG_END_OF_RECOVERY)
+	else if (rminfo == XLOG_END_OF_RECOVERY)
 	{
 		xl_end_of_recovery xlrec;
 
@@ -139,7 +139,7 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 xlrec.ThisTimeLineID, xlrec.PrevTimeLineID,
 						 timestamptz_to_str(xlrec.end_time));
 	}
-	else if (info == XLOG_OVERWRITE_CONTRECORD)
+	else if (rminfo == XLOG_OVERWRITE_CONTRECORD)
 	{
 		xl_overwrite_contrecord xlrec;
 
@@ -151,11 +151,11 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 }
 
 const char *
-xlog_identify(uint8 info)
+xlog_identify(uint8 rminfo)
 {
 	const char *id = NULL;
 
-	switch (info & ~XLR_INFO_MASK)
+	switch (rminfo)
 	{
 		case XLOG_CHECKPOINT_SHUTDOWN:
 			id = "CHECKPOINT_SHUTDOWN";
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 4c9f4020ff..cb3b864227 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -938,11 +938,11 @@ spgRedoVacuumRedirect(XLogReaderState *record)
 void
 spg_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 	MemoryContext oldCxt;
 
 	oldCxt = MemoryContextSwitchTo(opCtx);
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_SPGIST_ADD_LEAF:
 			spgRedoAddLeaf(record);
@@ -969,7 +969,7 @@ spg_redo(XLogReaderState *record)
 			spgRedoVacuumRedirect(record);
 			break;
 		default:
-			elog(PANIC, "spg_redo: unknown op code %u", info);
+			elog(PANIC, "spg_redo: unknown op code %u", rminfo);
 	}
 
 	MemoryContextSwitchTo(oldCxt);
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index a7dfcfb4da..5badd53418 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -986,12 +986,12 @@ WriteTruncateXlogRec(int pageno, TransactionId oldestXact, Oid oldestXactDb)
 void
 clog_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
 	/* Backup blocks are not used in clog records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == CLOG_ZEROPAGE)
+	if (rminfo == CLOG_ZEROPAGE)
 	{
 		int			pageno;
 		int			slotno;
@@ -1006,7 +1006,7 @@ clog_redo(XLogReaderState *record)
 
 		LWLockRelease(XactSLRULock);
 	}
-	else if (info == CLOG_TRUNCATE)
+	else if (rminfo == CLOG_TRUNCATE)
 	{
 		xl_clog_truncate xlrec;
 
@@ -1017,7 +1017,7 @@ clog_redo(XLogReaderState *record)
 		SimpleLruTruncate(XactCtl, xlrec.pageno);
 	}
 	else
-		elog(PANIC, "clog_redo: unknown op code %u", info);
+		elog(PANIC, "clog_redo: unknown op code %u", rminfo);
 }
 
 /*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 9aa4675cb7..43129c7a6f 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -984,12 +984,12 @@ WriteTruncateXlogRec(int pageno, TransactionId oldestXid)
 void
 commit_ts_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
 	/* Backup blocks are not used in commit_ts records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == COMMIT_TS_ZEROPAGE)
+	if (rminfo == COMMIT_TS_ZEROPAGE)
 	{
 		int			pageno;
 		int			slotno;
@@ -1004,7 +1004,7 @@ commit_ts_redo(XLogReaderState *record)
 
 		LWLockRelease(CommitTsSLRULock);
 	}
-	else if (info == COMMIT_TS_TRUNCATE)
+	else if (rminfo == COMMIT_TS_TRUNCATE)
 	{
 		xl_commit_ts_truncate *trunc = (xl_commit_ts_truncate *) XLogRecGetData(record);
 
@@ -1019,7 +1019,7 @@ commit_ts_redo(XLogReaderState *record)
 		SimpleLruTruncate(CommitTsCtl, trunc->pageno);
 	}
 	else
-		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
+		elog(PANIC, "commit_ts_redo: unknown op code %u", rminfo);
 }
 
 /*
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index a7383f553b..ca6e238542 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -365,7 +365,7 @@ static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
 									 MultiXactOffset start, uint32 distance);
 static bool SetOffsetVacuumLimit(bool is_startup);
 static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
-static void WriteMZeroPageXlogRec(int pageno, uint8 info);
+static void WriteMZeroPageXlogRec(int pageno, uint8 rminfo);
 static void WriteMTruncateXlogRec(Oid oldestMultiDB,
 								  MultiXactId startTruncOff,
 								  MultiXactId endTruncOff,
@@ -3193,11 +3193,11 @@ MultiXactOffsetPrecedes(MultiXactOffset offset1, MultiXactOffset offset2)
  * OFFSETs page (info shows which)
  */
 static void
-WriteMZeroPageXlogRec(int pageno, uint8 info)
+WriteMZeroPageXlogRec(int pageno, uint8 rminfo)
 {
 	XLogBeginInsert();
 	XLogRegisterData((char *) (&pageno), sizeof(int));
-	(void) XLogInsert(RM_MULTIXACT_ID, info);
+	(void) XLogInsert(RM_MULTIXACT_ID, rminfo);
 }
 
 /*
@@ -3234,12 +3234,12 @@ WriteMTruncateXlogRec(Oid oldestMultiDB,
 void
 multixact_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
 	/* Backup blocks are not used in multixact records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_MULTIXACT_ZERO_OFF_PAGE)
+	if (rminfo == XLOG_MULTIXACT_ZERO_OFF_PAGE)
 	{
 		int			pageno;
 		int			slotno;
@@ -3254,7 +3254,7 @@ multixact_redo(XLogReaderState *record)
 
 		LWLockRelease(MultiXactOffsetSLRULock);
 	}
-	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
+	else if (rminfo == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
 		int			pageno;
 		int			slotno;
@@ -3269,7 +3269,7 @@ multixact_redo(XLogReaderState *record)
 
 		LWLockRelease(MultiXactMemberSLRULock);
 	}
-	else if (info == XLOG_MULTIXACT_CREATE_ID)
+	else if (rminfo == XLOG_MULTIXACT_CREATE_ID)
 	{
 		xl_multixact_create *xlrec =
 		(xl_multixact_create *) XLogRecGetData(record);
@@ -3298,7 +3298,7 @@ multixact_redo(XLogReaderState *record)
 
 		AdvanceNextFullTransactionIdPastXid(max_xid);
 	}
-	else if (info == XLOG_MULTIXACT_TRUNCATE_ID)
+	else if (rminfo == XLOG_MULTIXACT_TRUNCATE_ID)
 	{
 		xl_multixact_truncate xlrec;
 		int			pageno;
@@ -3339,7 +3339,7 @@ multixact_redo(XLogReaderState *record)
 		LWLockRelease(MultiXactTruncationLock);
 	}
 	else
-		elog(PANIC, "multixact_redo: unknown op code %u", info);
+		elog(PANIC, "multixact_redo: unknown op code %u", rminfo);
 }
 
 Datum
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 803d169f57..98c1ef7ec5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1429,7 +1429,7 @@ XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len)
 	}
 
 	if (XLogRecGetRmid(xlogreader) != RM_XACT_ID ||
-		(XLogRecGetInfo(xlogreader) & XLOG_XACT_OPMASK) != XLOG_XACT_PREPARE)
+		(XLogRecGetRmInfo(xlogreader) & XLOG_XACT_OPMASK) != XLOG_XACT_PREPARE)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("expected two-phase state data is not present in WAL at %X/%X",
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c1ffbd89b8..8ce44abfcf 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5624,7 +5624,8 @@ XactLogCommitRecord(TimestampTz commit_time,
 	xl_xact_invals xl_invals;
 	xl_xact_twophase xl_twophase;
 	xl_xact_origin xl_origin;
-	uint8		info;
+	uint8		rminfo;
+	uint8		info = 0;
 
 	Assert(CritSectionCount > 0);
 
@@ -5632,9 +5633,9 @@ XactLogCommitRecord(TimestampTz commit_time,
 
 	/* decide between a plain and 2pc commit */
 	if (!TransactionIdIsValid(twophase_xid))
-		info = XLOG_XACT_COMMIT;
+		rminfo = XLOG_XACT_COMMIT;
 	else
-		info = XLOG_XACT_COMMIT_PREPARED;
+		rminfo = XLOG_XACT_COMMIT_PREPARED;
 
 	/* First figure out and collect all the information needed */
 
@@ -5675,7 +5676,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
 		xl_relfilelocators.nrels = nrels;
-		info |= XLR_SPECIAL_REL_UPDATE;
+		info = XLR_SPECIAL_REL_UPDATE;
 	}
 
 	if (ndroppedstats > 0)
@@ -5710,7 +5711,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	}
 
 	if (xl_xinfo.xinfo != 0)
-		info |= XLOG_XACT_HAS_INFO;
+		rminfo |= XLOG_XACT_HAS_INFO;
 
 	/* Then include all the collected data into the commit record. */
 
@@ -5768,7 +5769,7 @@ XactLogCommitRecord(TimestampTz commit_time,
 	/* we allow filtering by xacts */
 	XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-	return XLogInsert(RM_XACT_ID, info);
+	return XLogInsertExtended(RM_XACT_ID, info, rminfo);
 }
 
 /*
@@ -5794,7 +5795,8 @@ XactLogAbortRecord(TimestampTz abort_time,
 	xl_xact_dbinfo xl_dbinfo;
 	xl_xact_origin xl_origin;
 
-	uint8		info;
+	uint8		rminfo;
+	uint8		info = 0;
 
 	Assert(CritSectionCount > 0);
 
@@ -5802,9 +5804,9 @@ XactLogAbortRecord(TimestampTz abort_time,
 
 	/* decide between a plain and 2pc abort */
 	if (!TransactionIdIsValid(twophase_xid))
-		info = XLOG_XACT_ABORT;
+		rminfo = XLOG_XACT_ABORT;
 	else
-		info = XLOG_XACT_ABORT_PREPARED;
+		rminfo = XLOG_XACT_ABORT_PREPARED;
 
 
 	/* First figure out and collect all the information needed */
@@ -5824,7 +5826,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 	{
 		xl_xinfo.xinfo |= XACT_XINFO_HAS_RELFILELOCATORS;
 		xl_relfilelocators.nrels = nrels;
-		info |= XLR_SPECIAL_REL_UPDATE;
+		info = XLR_SPECIAL_REL_UPDATE;
 	}
 
 	if (ndroppedstats > 0)
@@ -5864,7 +5866,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 	}
 
 	if (xl_xinfo.xinfo != 0)
-		info |= XLOG_XACT_HAS_INFO;
+		rminfo |= XLOG_XACT_HAS_INFO;
 
 	/* Then include all the collected data into the abort record. */
 
@@ -5915,7 +5917,7 @@ XactLogAbortRecord(TimestampTz abort_time,
 	if (TransactionIdIsValid(twophase_xid))
 		XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
 
-	return XLogInsert(RM_XACT_ID, info);
+	return XLogInsertExtended(RM_XACT_ID, info, rminfo);
 }
 
 /*
@@ -6158,7 +6160,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
 void
 xact_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK;
+	uint8		info = XLogRecGetRmInfo(record) & XLOG_XACT_OPMASK;
 
 	/* Backup blocks are not used in xact records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
@@ -6168,7 +6170,7 @@ xact_redo(XLogReaderState *record)
 		xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record);
 		xl_xact_parsed_commit parsed;
 
-		ParseCommitRecord(XLogRecGetInfo(record), xlrec, &parsed);
+		ParseCommitRecord(XLogRecGetRmInfo(record), xlrec, &parsed);
 		xact_redo_commit(&parsed, XLogRecGetXid(record),
 						 record->EndRecPtr, XLogRecGetOrigin(record));
 	}
@@ -6177,7 +6179,7 @@ xact_redo(XLogReaderState *record)
 		xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record);
 		xl_xact_parsed_commit parsed;
 
-		ParseCommitRecord(XLogRecGetInfo(record), xlrec, &parsed);
+		ParseCommitRecord(XLogRecGetRmInfo(record), xlrec, &parsed);
 		xact_redo_commit(&parsed, parsed.twophase_xid,
 						 record->EndRecPtr, XLogRecGetOrigin(record));
 
@@ -6191,7 +6193,7 @@ xact_redo(XLogReaderState *record)
 		xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(record);
 		xl_xact_parsed_abort parsed;
 
-		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
+		ParseAbortRecord(XLogRecGetRmInfo(record), xlrec, &parsed);
 		xact_redo_abort(&parsed, XLogRecGetXid(record),
 						record->EndRecPtr, XLogRecGetOrigin(record));
 	}
@@ -6200,7 +6202,7 @@ xact_redo(XLogReaderState *record)
 		xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(record);
 		xl_xact_parsed_abort parsed;
 
-		ParseAbortRecord(XLogRecGetInfo(record), xlrec, &parsed);
+		ParseAbortRecord(XLogRecGetRmInfo(record), xlrec, &parsed);
 		xact_redo_abort(&parsed, parsed.twophase_xid,
 						record->EndRecPtr, XLogRecGetOrigin(record));
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 27085b15a8..3058041683 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -742,9 +742,9 @@ XLogInsertRecord(XLogRecData *rdata,
 	pg_crc32c	rdata_crc;
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
-	uint8		info = rechdr->xl_info & ~XLR_INFO_MASK;
+	uint8		rminfo = rechdr->xl_rminfo;
 	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   info == XLOG_SWITCH);
+							   rminfo == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 	bool		prevDoPageWrites = doPageWrites;
@@ -7728,17 +7728,17 @@ UpdateFullPageWrites(void)
 void
 xlog_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 	XLogRecPtr	lsn = record->EndRecPtr;
 
 	/*
 	 * In XLOG rmgr, backup blocks are only used by XLOG_FPI and
 	 * XLOG_FPI_FOR_HINT records.
 	 */
-	Assert(info == XLOG_FPI || info == XLOG_FPI_FOR_HINT ||
+	Assert(rminfo == XLOG_FPI || rminfo == XLOG_FPI_FOR_HINT ||
 		   !XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_NEXTOID)
+	if (rminfo == XLOG_NEXTOID)
 	{
 		Oid			nextOid;
 
@@ -7755,7 +7755,7 @@ xlog_redo(XLogReaderState *record)
 		ShmemVariableCache->oidCount = 0;
 		LWLockRelease(OidGenLock);
 	}
-	else if (info == XLOG_CHECKPOINT_SHUTDOWN)
+	else if (rminfo == XLOG_CHECKPOINT_SHUTDOWN)
 	{
 		CheckPoint	checkPoint;
 		TimeLineID	replayTLI;
@@ -7852,7 +7852,7 @@ xlog_redo(XLogReaderState *record)
 
 		RecoveryRestartPoint(&checkPoint, record);
 	}
-	else if (info == XLOG_CHECKPOINT_ONLINE)
+	else if (rminfo == XLOG_CHECKPOINT_ONLINE)
 	{
 		CheckPoint	checkPoint;
 		TimeLineID	replayTLI;
@@ -7910,11 +7910,11 @@ xlog_redo(XLogReaderState *record)
 
 		RecoveryRestartPoint(&checkPoint, record);
 	}
-	else if (info == XLOG_OVERWRITE_CONTRECORD)
+	else if (rminfo == XLOG_OVERWRITE_CONTRECORD)
 	{
 		/* nothing to do here, handled in xlogrecovery_redo() */
 	}
-	else if (info == XLOG_END_OF_RECOVERY)
+	else if (rminfo == XLOG_END_OF_RECOVERY)
 	{
 		xl_end_of_recovery xlrec;
 		TimeLineID	replayTLI;
@@ -7937,19 +7937,19 @@ xlog_redo(XLogReaderState *record)
 					(errmsg("unexpected timeline ID %u (should be %u) in end-of-recovery record",
 							xlrec.ThisTimeLineID, replayTLI)));
 	}
-	else if (info == XLOG_NOOP)
+	else if (rminfo == XLOG_NOOP)
 	{
 		/* nothing to do here */
 	}
-	else if (info == XLOG_SWITCH)
+	else if (rminfo == XLOG_SWITCH)
 	{
 		/* nothing to do here */
 	}
-	else if (info == XLOG_RESTORE_POINT)
+	else if (rminfo == XLOG_RESTORE_POINT)
 	{
 		/* nothing to do here, handled in xlogrecovery.c */
 	}
-	else if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT)
+	else if (rminfo == XLOG_FPI || rminfo == XLOG_FPI_FOR_HINT)
 	{
 		/*
 		 * XLOG_FPI records contain nothing else but one or more block
@@ -7973,7 +7973,7 @@ xlog_redo(XLogReaderState *record)
 
 			if (!XLogRecHasBlockImage(record, block_id))
 			{
-				if (info == XLOG_FPI)
+				if (rminfo == XLOG_FPI)
 					elog(ERROR, "XLOG_FPI record did not contain a full-page image");
 				continue;
 			}
@@ -7983,11 +7983,11 @@ xlog_redo(XLogReaderState *record)
 			UnlockReleaseBuffer(buffer);
 		}
 	}
-	else if (info == XLOG_BACKUP_END)
+	else if (rminfo == XLOG_BACKUP_END)
 	{
 		/* nothing to do here, handled in xlogrecovery_redo() */
 	}
-	else if (info == XLOG_PARAMETER_CHANGE)
+	else if (rminfo == XLOG_PARAMETER_CHANGE)
 	{
 		xl_parameter_change xlrec;
 
@@ -8035,7 +8035,7 @@ xlog_redo(XLogReaderState *record)
 		/* Check to see if any parameter change gives a problem on recovery */
 		CheckRequiredParameterValues();
 	}
-	else if (info == XLOG_FPW_CHANGE)
+	else if (rminfo == XLOG_FPW_CHANGE)
 	{
 		bool		fpw;
 
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 5ca15ebbf2..cc4a262d8e 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -135,7 +135,7 @@ static bool begininsert_called = false;
 /* Memory context to hold the registered buffer and data references. */
 static MemoryContext xloginsert_cxt;
 
-static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info,
+static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 									   XLogRecPtr RedoRecPtr, bool doPageWrites,
 									   XLogRecPtr *fpw_lsn, int *num_fpi,
 									   bool *topxid_included);
@@ -436,8 +436,9 @@ XLogSetRecordFlags(uint8 flags)
 	curinsert_flags |= flags;
 }
 
+
 /*
- * Insert an XLOG record having the specified RMID and info bytes, with the
+ * Insert an XLOG record having the specified RMID and rminfo bytes, with the
  * body of the record being the data and buffer references registered earlier
  * with XLogRegister* calls.
  *
@@ -448,7 +449,21 @@ XLogSetRecordFlags(uint8 flags)
  * WAL rule "write the log before the data".)
  */
 XLogRecPtr
-XLogInsert(RmgrId rmid, uint8 info)
+XLogInsert(RmgrId rmid, uint8 rminfo)
+{
+	return XLogInsertExtended(rmid, 0, rminfo);
+}
+
+
+/*
+ * Insert an XLOG record having the specified RMID, info and rminfo bytes,
+ * with the body of the record being the data and buffer references
+ * registered earlier with XLogRegister* calls.
+ *
+ * See also XLogInsert above.
+ */
+XLogRecPtr
+XLogInsertExtended(RmgrId rmid, uint8 info, uint8 rminfo)
 {
 	XLogRecPtr	EndPos;
 
@@ -457,11 +472,10 @@ XLogInsert(RmgrId rmid, uint8 info)
 		elog(ERROR, "XLogBeginInsert was not called");
 
 	/*
-	 * The caller can set rmgr bits, XLR_SPECIAL_REL_UPDATE and
+	 * The caller can set XLR_SPECIAL_REL_UPDATE and
 	 * XLR_CHECK_CONSISTENCY; the rest are reserved for use by me.
 	 */
-	if ((info & ~(XLR_RMGR_INFO_MASK |
-				  XLR_SPECIAL_REL_UPDATE |
+	if ((info & ~(XLR_SPECIAL_REL_UPDATE |
 				  XLR_CHECK_CONSISTENCY)) != 0)
 		elog(PANIC, "invalid xlog info mask %02X", info);
 
@@ -494,7 +508,7 @@ XLogInsert(RmgrId rmid, uint8 info)
 		 */
 		GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
 
-		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
+		rdt = XLogRecordAssemble(rmid, info, rminfo, RedoRecPtr, doPageWrites,
 								 &fpw_lsn, &num_fpi, &topxid_included);
 
 		EndPos = XLogInsertRecord(rdt, fpw_lsn, curinsert_flags, num_fpi,
@@ -522,7 +536,7 @@ XLogInsert(RmgrId rmid, uint8 info)
  * current subtransaction.
  */
 static XLogRecData *
-XLogRecordAssemble(RmgrId rmid, uint8 info,
+XLogRecordAssemble(RmgrId rmid, uint8 info, uint8 rminfo,
 				   XLogRecPtr RedoRecPtr, bool doPageWrites,
 				   XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
 {
@@ -881,6 +895,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
 	rechdr->xl_tot_len = total_len;
 	rechdr->xl_info = info;
 	rechdr->xl_rmid = rmid;
+	rechdr->xl_rminfo = rminfo;
 	rechdr->xl_prev = InvalidXLogRecPtr;
 	rechdr->xl_crc = rdata_crc;
 
diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 1cbac4b7f6..e5d4f36182 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -536,7 +536,7 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
 		if (replaying_lsn < record->lsn)
 		{
 			uint8		rmid = record->header.xl_rmid;
-			uint8		record_type = record->header.xl_info & ~XLR_INFO_MASK;
+			uint8		record_type = record->header.xl_rminfo;
 
 			if (rmid == RM_XLOG_ID)
 			{
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5a8fe81f82..9200d7f56c 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -859,7 +859,7 @@ restart:
 	 * Special processing if it's an XLOG SWITCH record
 	 */
 	if (record->xl_rmid == RM_XLOG_ID &&
-		(record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
+		record->xl_rminfo == XLOG_SWITCH)
 	{
 		/* Pretend it extends to end of segment */
 		state->NextRecPtr += state->segcxt.ws_segsize - 1;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cb07694aea..96d12994b5 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -614,7 +614,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		if (record != NULL)
 		{
 			memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
-			wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
+			wasShutdown = (record->xl_rminfo == XLOG_CHECKPOINT_SHUTDOWN);
 			ereport(DEBUG1,
 					(errmsg_internal("checkpoint record is at %X/%X",
 									 LSN_FORMAT_ARGS(CheckPointLoc))));
@@ -768,7 +768,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 					(errmsg("could not locate a valid checkpoint record")));
 		}
 		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
-		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
+		wasShutdown = (record->xl_rminfo == XLOG_CHECKPOINT_SHUTDOWN);
 	}
 
 	/*
@@ -1839,9 +1839,9 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	{
 		TimeLineID	newReplayTLI = *replayTLI;
 		TimeLineID	prevReplayTLI = *replayTLI;
-		uint8		info = record->xl_info & ~XLR_INFO_MASK;
+		uint8		rminfo = record->xl_rminfo;
 
-		if (info == XLOG_CHECKPOINT_SHUTDOWN)
+		if (rminfo == XLOG_CHECKPOINT_SHUTDOWN)
 		{
 			CheckPoint	checkPoint;
 
@@ -1849,7 +1849,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 			newReplayTLI = checkPoint.ThisTimeLineID;
 			prevReplayTLI = checkPoint.PrevTimeLineID;
 		}
-		else if (info == XLOG_END_OF_RECOVERY)
+		else if (rminfo == XLOG_END_OF_RECOVERY)
 		{
 			xl_end_of_recovery xlrec;
 
@@ -1958,12 +1958,12 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 static void
 xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 	XLogRecPtr	lsn = record->EndRecPtr;
 
 	Assert(XLogRecGetRmid(record) == RM_XLOG_ID);
 
-	if (info == XLOG_OVERWRITE_CONTRECORD)
+	if (rminfo == XLOG_OVERWRITE_CONTRECORD)
 	{
 		/* Verify the payload of a XLOG_OVERWRITE_CONTRECORD record. */
 		xl_overwrite_contrecord xlrec;
@@ -1986,7 +1986,7 @@ xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI)
 		/* Verifying the record should only happen once */
 		record->overwrittenRecPtr = InvalidXLogRecPtr;
 	}
-	else if (info == XLOG_BACKUP_END)
+	else if (rminfo == XLOG_BACKUP_END)
 	{
 		XLogRecPtr	startpoint;
 
@@ -2176,15 +2176,15 @@ void
 xlog_outdesc(StringInfo buf, XLogReaderState *record)
 {
 	RmgrData	rmgr = GetRmgr(XLogRecGetRmid(record));
-	uint8		info = XLogRecGetInfo(record);
+	uint8		rminfo = XLogRecGetRmInfo(record);
 	const char *id;
 
 	appendStringInfoString(buf, rmgr.rm_name);
 	appendStringInfoChar(buf, '/');
 
-	id = rmgr.rm_identify(info);
+	id = rmgr.rm_identify(rminfo);
 	if (id == NULL)
-		appendStringInfo(buf, "UNKNOWN (%X): ", info & ~XLR_INFO_MASK);
+		appendStringInfo(buf, "UNKNOWN (%X): ", rminfo);
 	else
 		appendStringInfo(buf, "%s: ", id);
 
@@ -2304,11 +2304,11 @@ checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI, TimeLineID prevTLI,
 static bool
 getRecordTimestamp(XLogReaderState *record, TimestampTz *recordXtime)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
-	uint8		xact_info = info & XLOG_XACT_OPMASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
+	uint8		xact_info = rminfo & XLOG_XACT_OPMASK;
 	uint8		rmid = XLogRecGetRmid(record);
 
-	if (rmid == RM_XLOG_ID && info == XLOG_RESTORE_POINT)
+	if (rmid == RM_XLOG_ID && rminfo == XLOG_RESTORE_POINT)
 	{
 		*recordXtime = ((xl_restore_point *) XLogRecGetData(record))->rp_time;
 		return true;
@@ -2498,7 +2498,7 @@ recoveryStopsBefore(XLogReaderState *record)
 	if (XLogRecGetRmid(record) != RM_XACT_ID)
 		return false;
 
-	xact_info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK;
+	xact_info = XLogRecGetRmInfo(record) & XLOG_XACT_OPMASK;
 
 	if (xact_info == XLOG_XACT_COMMIT)
 	{
@@ -2511,7 +2511,7 @@ recoveryStopsBefore(XLogReaderState *record)
 		xl_xact_parsed_commit parsed;
 
 		isCommit = true;
-		ParseCommitRecord(XLogRecGetInfo(record),
+		ParseCommitRecord(XLogRecGetRmInfo(record),
 						  xlrec,
 						  &parsed);
 		recordXid = parsed.twophase_xid;
@@ -2527,7 +2527,7 @@ recoveryStopsBefore(XLogReaderState *record)
 		xl_xact_parsed_abort parsed;
 
 		isCommit = false;
-		ParseAbortRecord(XLogRecGetInfo(record),
+		ParseAbortRecord(XLogRecGetRmInfo(record),
 						 xlrec,
 						 &parsed);
 		recordXid = parsed.twophase_xid;
@@ -2599,7 +2599,7 @@ recoveryStopsBefore(XLogReaderState *record)
 static bool
 recoveryStopsAfter(XLogReaderState *record)
 {
-	uint8		info;
+	uint8		rminfo;
 	uint8		xact_info;
 	uint8		rmid;
 	TimestampTz recordXtime;
@@ -2611,7 +2611,7 @@ recoveryStopsAfter(XLogReaderState *record)
 	if (!ArchiveRecoveryRequested)
 		return false;
 
-	info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	rminfo = XLogRecGetRmInfo(record);
 	rmid = XLogRecGetRmid(record);
 
 	/*
@@ -2619,7 +2619,7 @@ recoveryStopsAfter(XLogReaderState *record)
 	 * the first one.
 	 */
 	if (recoveryTarget == RECOVERY_TARGET_NAME &&
-		rmid == RM_XLOG_ID && info == XLOG_RESTORE_POINT)
+		rmid == RM_XLOG_ID && rminfo == XLOG_RESTORE_POINT)
 	{
 		xl_restore_point *recordRestorePointData;
 
@@ -2660,7 +2660,7 @@ recoveryStopsAfter(XLogReaderState *record)
 	if (rmid != RM_XACT_ID)
 		return false;
 
-	xact_info = info & XLOG_XACT_OPMASK;
+	xact_info = rminfo & XLOG_XACT_OPMASK;
 
 	if (xact_info == XLOG_XACT_COMMIT ||
 		xact_info == XLOG_XACT_COMMIT_PREPARED ||
@@ -2689,7 +2689,7 @@ recoveryStopsAfter(XLogReaderState *record)
 			xl_xact_abort *xlrec = (xl_xact_abort *) XLogRecGetData(record);
 			xl_xact_parsed_abort parsed;
 
-			ParseAbortRecord(XLogRecGetInfo(record),
+			ParseAbortRecord(XLogRecGetRmInfo(record),
 							 xlrec,
 							 &parsed);
 			recordXid = parsed.twophase_xid;
@@ -2883,7 +2883,7 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (XLogRecGetRmid(record) != RM_XACT_ID)
 		return false;
 
-	xact_info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK;
+	xact_info = XLogRecGetRmInfo(record) & XLOG_XACT_OPMASK;
 
 	if (xact_info != XLOG_XACT_COMMIT &&
 		xact_info != XLOG_XACT_COMMIT_PREPARED)
@@ -3917,7 +3917,7 @@ ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 					 TimeLineID replayTLI)
 {
 	XLogRecord *record;
-	uint8		info;
+	uint8		rminfo;
 
 	Assert(xlogreader != NULL);
 
@@ -3943,9 +3943,9 @@ ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 				(errmsg("invalid resource manager ID in checkpoint record")));
 		return NULL;
 	}
-	info = record->xl_info & ~XLR_INFO_MASK;
-	if (info != XLOG_CHECKPOINT_SHUTDOWN &&
-		info != XLOG_CHECKPOINT_ONLINE)
+	rminfo = record->xl_rminfo;
+	if (rminfo != XLOG_CHECKPOINT_SHUTDOWN &&
+		rminfo != XLOG_CHECKPOINT_ONLINE)
 	{
 		ereport(LOG,
 				(errmsg("invalid xl_info in checkpoint record")));
diff --git a/src/backend/access/transam/xlogstats.c b/src/backend/access/transam/xlogstats.c
index 514181792d..5ff1a58134 100644
--- a/src/backend/access/transam/xlogstats.c
+++ b/src/backend/access/transam/xlogstats.c
@@ -79,7 +79,7 @@ XLogRecStoreStats(XLogStats *stats, XLogReaderState *record)
 	 * RmgrId).
 	 */
 
-	recid = XLogRecGetInfo(record) >> 4;
+	recid = XLogRecGetRmInfo(record);
 
 	/*
 	 * XACT records need to be handled differently. Those records use the
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index d708af19ed..4ef46a1855 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -194,7 +194,7 @@ log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum)
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-	XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE | XLR_SPECIAL_REL_UPDATE);
+	XLogInsertExtended(RM_SMGR_ID, XLR_SPECIAL_REL_UPDATE, XLOG_SMGR_CREATE);
 }
 
 /*
@@ -375,8 +375,9 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
 
-		lsn = XLogInsert(RM_SMGR_ID,
-						 XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+		lsn = XLogInsertExtended(RM_SMGR_ID,
+								 XLR_SPECIAL_REL_UPDATE,
+								 XLOG_SMGR_TRUNCATE);
 
 		/*
 		 * Flush, because otherwise the truncation of the main relation might
@@ -958,12 +959,12 @@ void
 smgr_redo(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
 	/* Backup blocks are not used in smgr records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_SMGR_CREATE)
+	if (rminfo == XLOG_SMGR_CREATE)
 	{
 		xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record);
 		SMgrRelation reln;
@@ -971,7 +972,7 @@ smgr_redo(XLogReaderState *record)
 		reln = smgropen(xlrec->rlocator, InvalidBackendId);
 		smgrcreate(reln, xlrec->forkNum, true);
 	}
-	else if (info == XLOG_SMGR_TRUNCATE)
+	else if (rminfo == XLOG_SMGR_TRUNCATE)
 	{
 		xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record);
 		SMgrRelation reln;
@@ -1060,5 +1061,5 @@ smgr_redo(XLogReaderState *record)
 		FreeFakeRelcacheEntry(rel);
 	}
 	else
-		elog(PANIC, "smgr_redo: unknown op code %u", info);
+		elog(PANIC, "smgr_redo: unknown op code %u", rminfo);
 }
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 96b46cbc02..25c4672917 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -624,8 +624,9 @@ CreateDatabaseUsingFileCopy(Oid src_dboid, Oid dst_dboid, Oid src_tsid,
 			XLogRegisterData((char *) &xlrec,
 							 sizeof(xl_dbase_create_file_copy_rec));
 
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+			(void) XLogInsertExtended(RM_DBASE_ID,
+									  XLR_SPECIAL_REL_UPDATE,
+									  XLOG_DBASE_CREATE_FILE_COPY);
 		}
 		pfree(srcpath);
 		pfree(dstpath);
@@ -2021,8 +2022,9 @@ movedb(const char *dbname, const char *tblspcname)
 			XLogRegisterData((char *) &xlrec,
 							 sizeof(xl_dbase_create_file_copy_rec));
 
-			(void) XLogInsert(RM_DBASE_ID,
-							  XLOG_DBASE_CREATE_FILE_COPY | XLR_SPECIAL_REL_UPDATE);
+			(void) XLogInsertExtended(RM_DBASE_ID,
+									  XLR_SPECIAL_REL_UPDATE,
+									  XLOG_DBASE_CREATE_FILE_COPY);
 		}
 
 		/*
@@ -2115,8 +2117,9 @@ movedb(const char *dbname, const char *tblspcname)
 		XLogRegisterData((char *) &xlrec, sizeof(xl_dbase_drop_rec));
 		XLogRegisterData((char *) &src_tblspcoid, sizeof(Oid));
 
-		(void) XLogInsert(RM_DBASE_ID,
-						  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
+		(void) XLogInsertExtended(RM_DBASE_ID,
+								  XLR_SPECIAL_REL_UPDATE,
+								  XLOG_DBASE_DROP);
 	}
 
 	/* Now it's safe to release the database lock */
@@ -2834,8 +2837,9 @@ remove_dbtablespaces(Oid db_id)
 		XLogRegisterData((char *) &xlrec, MinSizeOfDbaseDropRec);
 		XLogRegisterData((char *) tablespace_ids, ntblspc * sizeof(Oid));
 
-		(void) XLogInsert(RM_DBASE_ID,
-						  XLOG_DBASE_DROP | XLR_SPECIAL_REL_UPDATE);
+		(void) XLogInsertExtended(RM_DBASE_ID,
+								  XLR_SPECIAL_REL_UPDATE,
+								  XLOG_DBASE_DROP);
 	}
 
 	list_free(ltblspc);
@@ -3040,12 +3044,12 @@ recovery_create_dbdir(char *path, bool only_tblspc)
 void
 dbase_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
 	/* Backup blocks are not used in dbase records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_DBASE_CREATE_FILE_COPY)
+	if (rminfo == XLOG_DBASE_CREATE_FILE_COPY)
 	{
 		xl_dbase_create_file_copy_rec *xlrec =
 		(xl_dbase_create_file_copy_rec *) XLogRecGetData(record);
@@ -3117,7 +3121,7 @@ dbase_redo(XLogReaderState *record)
 		pfree(src_path);
 		pfree(dst_path);
 	}
-	else if (info == XLOG_DBASE_CREATE_WAL_LOG)
+	else if (rminfo == XLOG_DBASE_CREATE_WAL_LOG)
 	{
 		xl_dbase_create_wal_log_rec *xlrec =
 		(xl_dbase_create_wal_log_rec *) XLogRecGetData(record);
@@ -3136,7 +3140,7 @@ dbase_redo(XLogReaderState *record)
 								true);
 		pfree(dbpath);
 	}
-	else if (info == XLOG_DBASE_DROP)
+	else if (rminfo == XLOG_DBASE_DROP)
 	{
 		xl_dbase_drop_rec *xlrec = (xl_dbase_drop_rec *) XLogRecGetData(record);
 		char	   *dst_path;
@@ -3198,5 +3202,5 @@ dbase_redo(XLogReaderState *record)
 		}
 	}
 	else
-		elog(PANIC, "dbase_redo: unknown op code %u", info);
+		elog(PANIC, "dbase_redo: unknown op code %u", rminfo);
 }
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index 99c9f91cba..56221903b7 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -1845,7 +1845,7 @@ void
 seq_redo(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 	Buffer		buffer;
 	Page		page;
 	Page		localpage;
@@ -1854,8 +1854,8 @@ seq_redo(XLogReaderState *record)
 	xl_seq_rec *xlrec = (xl_seq_rec *) XLogRecGetData(record);
 	sequence_magic *sm;
 
-	if (info != XLOG_SEQ_LOG)
-		elog(PANIC, "seq_redo: unknown op code %u", info);
+	if (rminfo != XLOG_SEQ_LOG)
+		elog(PANIC, "seq_redo: unknown op code %u", rminfo);
 
 	buffer = XLogInitBufferForRedo(record, 0);
 	page = (Page) BufferGetPage(buffer);
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index b69ff37dbb..d9c89bbbf9 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -1516,19 +1516,19 @@ get_tablespace_name(Oid spc_oid)
 void
 tblspc_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
 	/* Backup blocks are not used in tblspc records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_TBLSPC_CREATE)
+	if (rminfo == XLOG_TBLSPC_CREATE)
 	{
 		xl_tblspc_create_rec *xlrec = (xl_tblspc_create_rec *) XLogRecGetData(record);
 		char	   *location = xlrec->ts_path;
 
 		create_tablespace_directories(location, xlrec->ts_id);
 	}
-	else if (info == XLOG_TBLSPC_DROP)
+	else if (rminfo == XLOG_TBLSPC_DROP)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
@@ -1571,5 +1571,5 @@ tblspc_redo(XLogReaderState *record)
 		}
 	}
 	else
-		elog(PANIC, "tblspc_redo: unknown op code %u", info);
+		elog(PANIC, "tblspc_redo: unknown op code %u", rminfo);
 }
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 2cc0ac9eb0..d6abeb9a9d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -132,12 +132,12 @@ void
 xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 {
 	SnapBuild  *builder = ctx->snapshot_builder;
-	uint8		info = XLogRecGetInfo(buf->record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(buf->record);
 
 	ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(buf->record),
 							buf->origptr);
 
-	switch (info)
+	switch (rminfo)
 	{
 			/* this is also used in END_OF_RECOVERY checkpoints */
 		case XLOG_CHECKPOINT_SHUTDOWN:
@@ -164,7 +164,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_OVERWRITE_CONTRECORD:
 			break;
 		default:
-			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
+			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", rminfo);
 	}
 }
 
@@ -177,7 +177,7 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	SnapBuild  *builder = ctx->snapshot_builder;
 	ReorderBuffer *reorder = ctx->reorder;
 	XLogReaderState *r = buf->record;
-	uint8		info = XLogRecGetInfo(r) & XLOG_XACT_OPMASK;
+	uint8		info = XLogRecGetRmInfo(r) & XLOG_XACT_OPMASK;
 
 	/*
 	 * If the snapshot isn't yet fully built, we cannot decode anything, so
@@ -197,7 +197,7 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				bool		two_phase = false;
 
 				xlrec = (xl_xact_commit *) XLogRecGetData(r);
-				ParseCommitRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+				ParseCommitRecord(XLogRecGetRmInfo(buf->record), xlrec, &parsed);
 
 				if (!TransactionIdIsValid(parsed.twophase_xid))
 					xid = XLogRecGetXid(r);
@@ -225,7 +225,7 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 				bool		two_phase = false;
 
 				xlrec = (xl_xact_abort *) XLogRecGetData(r);
-				ParseAbortRecord(XLogRecGetInfo(buf->record), xlrec, &parsed);
+				ParseAbortRecord(XLogRecGetRmInfo(buf->record), xlrec, &parsed);
 
 				if (!TransactionIdIsValid(parsed.twophase_xid))
 					xid = XLogRecGetXid(r);
@@ -288,7 +288,7 @@ xact_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 
 				/* ok, parse it */
 				xlrec = (xl_xact_prepare *) XLogRecGetData(r);
-				ParsePrepareRecord(XLogRecGetInfo(buf->record),
+				ParsePrepareRecord(XLogRecGetRmInfo(buf->record),
 								   xlrec, &parsed);
 
 				/*
@@ -333,11 +333,11 @@ standby_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 {
 	SnapBuild  *builder = ctx->snapshot_builder;
 	XLogReaderState *r = buf->record;
-	uint8		info = XLogRecGetInfo(r) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(r);
 
 	ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(r), buf->origptr);
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_RUNNING_XACTS:
 			{
@@ -367,7 +367,7 @@ standby_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			 */
 			break;
 		default:
-			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", info);
+			elog(ERROR, "unexpected RM_STANDBY_ID record type: %u", rminfo);
 	}
 }
 
@@ -377,7 +377,7 @@ standby_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 void
 heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 {
-	uint8		info = XLogRecGetInfo(buf->record) & XLOG_HEAP_OPMASK;
+	uint8		rminfo = XLogRecGetRmInfo(buf->record) & XLOG_HEAP_OPMASK;
 	TransactionId xid = XLogRecGetXid(buf->record);
 	SnapBuild  *builder = ctx->snapshot_builder;
 
@@ -391,7 +391,7 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		ctx->fast_forward)
 		return;
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_HEAP2_MULTI_INSERT:
 			if (!ctx->fast_forward &&
@@ -427,7 +427,7 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_HEAP2_LOCK_UPDATED:
 			break;
 		default:
-			elog(ERROR, "unexpected RM_HEAP2_ID record type: %u", info);
+			elog(ERROR, "unexpected RM_HEAP2_ID record type: %u", rminfo);
 	}
 }
 
@@ -437,7 +437,7 @@ heap2_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 void
 heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 {
-	uint8		info = XLogRecGetInfo(buf->record) & XLOG_HEAP_OPMASK;
+	uint8		rminfo = XLogRecGetRmInfo(buf->record) & XLOG_HEAP_OPMASK;
 	TransactionId xid = XLogRecGetXid(buf->record);
 	SnapBuild  *builder = ctx->snapshot_builder;
 
@@ -451,7 +451,7 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		ctx->fast_forward)
 		return;
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_HEAP_INSERT:
 			if (SnapBuildProcessChange(builder, xid, buf->origptr))
@@ -512,7 +512,7 @@ heap_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 			break;
 
 		default:
-			elog(ERROR, "unexpected RM_HEAP_ID record type: %u", info);
+			elog(ERROR, "unexpected RM_HEAP_ID record type: %u", rminfo);
 			break;
 	}
 }
@@ -562,13 +562,13 @@ logicalmsg_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	SnapBuild  *builder = ctx->snapshot_builder;
 	XLogReaderState *r = buf->record;
 	TransactionId xid = XLogRecGetXid(r);
-	uint8		info = XLogRecGetInfo(r) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(r);
 	RepOriginId origin_id = XLogRecGetOrigin(r);
 	Snapshot	snapshot;
 	xl_logical_message *message;
 
-	if (info != XLOG_LOGICAL_MESSAGE)
-		elog(ERROR, "unexpected RM_LOGICALMSG_ID record type: %u", info);
+	if (rminfo != XLOG_LOGICAL_MESSAGE)
+		elog(ERROR, "unexpected RM_LOGICALMSG_ID record type: %u", rminfo);
 
 	ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(r), buf->origptr);
 
diff --git a/src/backend/replication/logical/message.c b/src/backend/replication/logical/message.c
index 1c34912610..564990b610 100644
--- a/src/backend/replication/logical/message.c
+++ b/src/backend/replication/logical/message.c
@@ -80,10 +80,10 @@ LogLogicalMessage(const char *prefix, const char *message, size_t size,
 void
 logicalmsg_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	if (info != XLOG_LOGICAL_MESSAGE)
-		elog(PANIC, "logicalmsg_redo: unknown op code %u", info);
+	if (rminfo != XLOG_LOGICAL_MESSAGE)
+		elog(PANIC, "logicalmsg_redo: unknown op code %u", rminfo);
 
 	/* This is only interesting for logical decoding, see decode.c. */
 }
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index f19b72ff35..cf6be4e28d 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -822,9 +822,9 @@ StartupReplicationOrigin(void)
 void
 replorigin_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
-	switch (info)
+	switch (rminfo)
 	{
 		case XLOG_REPLORIGIN_SET:
 			{
@@ -861,7 +861,7 @@ replorigin_redo(XLogReaderState *record)
 				break;
 			}
 		default:
-			elog(PANIC, "replorigin_redo: unknown op code %u", info);
+			elog(PANIC, "replorigin_redo: unknown op code %u", rminfo);
 	}
 }
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 7db86f7885..74565e6cb0 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1145,7 +1145,7 @@ StandbyReleaseOldLocks(TransactionId oldxid)
 void
 standby_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
 	/* Backup blocks are not used in standby records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
@@ -1154,7 +1154,7 @@ standby_redo(XLogReaderState *record)
 	if (standbyState == STANDBY_DISABLED)
 		return;
 
-	if (info == XLOG_STANDBY_LOCK)
+	if (rminfo == XLOG_STANDBY_LOCK)
 	{
 		xl_standby_locks *xlrec = (xl_standby_locks *) XLogRecGetData(record);
 		int			i;
@@ -1164,7 +1164,7 @@ standby_redo(XLogReaderState *record)
 											  xlrec->locks[i].dbOid,
 											  xlrec->locks[i].relOid);
 	}
-	else if (info == XLOG_RUNNING_XACTS)
+	else if (rminfo == XLOG_RUNNING_XACTS)
 	{
 		xl_running_xacts *xlrec = (xl_running_xacts *) XLogRecGetData(record);
 		RunningTransactionsData running;
@@ -1179,7 +1179,7 @@ standby_redo(XLogReaderState *record)
 
 		ProcArrayApplyRecoveryInfo(&running);
 	}
-	else if (info == XLOG_INVALIDATIONS)
+	else if (rminfo == XLOG_INVALIDATIONS)
 	{
 		xl_invalidations *xlrec = (xl_invalidations *) XLogRecGetData(record);
 
@@ -1190,7 +1190,7 @@ standby_redo(XLogReaderState *record)
 											 xlrec->tsId);
 	}
 	else
-		elog(PANIC, "standby_redo: unknown op code %u", info);
+		elog(PANIC, "standby_redo: unknown op code %u", rminfo);
 }
 
 /*
diff --git a/src/backend/utils/cache/relmapper.c b/src/backend/utils/cache/relmapper.c
index 6266568605..c9bbd98598 100644
--- a/src/backend/utils/cache/relmapper.c
+++ b/src/backend/utils/cache/relmapper.c
@@ -1084,12 +1084,12 @@ perform_relmap_update(bool shared, const RelMapFile *updates)
 void
 relmap_redo(XLogReaderState *record)
 {
-	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
 	/* Backup blocks are not used in relmap records */
 	Assert(!XLogRecHasAnyBlockRefs(record));
 
-	if (info == XLOG_RELMAP_UPDATE)
+	if (rminfo == XLOG_RELMAP_UPDATE)
 	{
 		xl_relmap_update *xlrec = (xl_relmap_update *) XLogRecGetData(record);
 		RelMapFile	newmap;
@@ -1126,5 +1126,5 @@ relmap_redo(XLogReaderState *record)
 		pfree(dbpath);
 	}
 	else
-		elog(PANIC, "relmap_redo: unknown op code %u", info);
+		elog(PANIC, "relmap_redo: unknown op code %u", rminfo);
 }
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index 53f011a2fe..132c4db65e 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -201,7 +201,7 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 	searchptr = forkptr;
 	for (;;)
 	{
-		uint8		info;
+		uint8		rminfo;
 
 		XLogBeginRead(xlogreader, searchptr);
 		record = XLogReadRecord(xlogreader, &errormsg);
@@ -222,11 +222,11 @@ findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
 		 * be the latest checkpoint before WAL forked and not the checkpoint
 		 * where the primary has been stopped to be rewound.
 		 */
-		info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
+		rminfo = XLogRecGetRmInfo(xlogreader);
 		if (searchptr < forkptr &&
 			XLogRecGetRmid(xlogreader) == RM_XLOG_ID &&
-			(info == XLOG_CHECKPOINT_SHUTDOWN ||
-			 info == XLOG_CHECKPOINT_ONLINE))
+			(rminfo == XLOG_CHECKPOINT_SHUTDOWN ||
+			 rminfo == XLOG_CHECKPOINT_ONLINE))
 		{
 			CheckPoint	checkPoint;
 
@@ -370,7 +370,7 @@ extractPageInfo(XLogReaderState *record)
 	int			block_id;
 	RmgrId		rmid = XLogRecGetRmid(record);
 	uint8		info = XLogRecGetInfo(record);
-	uint8		rminfo = info & ~XLR_INFO_MASK;
+	uint8		rminfo = XLogRecGetRmInfo(record);
 
 	/* Is this a special record type that I recognize? */
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca5..70886beedd 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -449,7 +449,7 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 	const RmgrDescData *desc = GetRmgrDesc(XLogRecGetRmid(record));
 	uint32		rec_len;
 	uint32		fpi_len;
-	uint8		info = XLogRecGetInfo(record);
+	uint8		rminfo = XLogRecGetRmInfo(record);
 	XLogRecPtr	xl_prev = XLogRecGetPrev(record);
 	StringInfoData s;
 
@@ -462,9 +462,9 @@ XLogDumpDisplayRecord(XLogDumpConfig *config, XLogReaderState *record)
 		   LSN_FORMAT_ARGS(record->ReadRecPtr),
 		   LSN_FORMAT_ARGS(xl_prev));
 
-	id = desc->rm_identify(info);
+	id = desc->rm_identify(rminfo);
 	if (id == NULL)
-		printf("desc: UNKNOWN (%x) ", info & ~XLR_INFO_MASK);
+		printf("desc: UNKNOWN (%x) ", rminfo);
 	else
 		printf("desc: %s ", id);
 
diff --git a/src/include/access/brin_xlog.h b/src/include/access/brin_xlog.h
index 012a9afdf4..9a4445c122 100644
--- a/src/include/access/brin_xlog.h
+++ b/src/include/access/brin_xlog.h
@@ -145,7 +145,7 @@ typedef struct xl_brin_desummarize
 
 extern void brin_redo(XLogReaderState *record);
 extern void brin_desc(StringInfo buf, XLogReaderState *record);
-extern const char *brin_identify(uint8 info);
+extern const char *brin_identify(uint8 rminfo);
 extern void brin_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* BRIN_XLOG_H */
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 543f2e2643..c8aa16da93 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -58,6 +58,6 @@ extern int	clogsyncfiletag(const FileTag *ftag, char *path);
 
 extern void clog_redo(XLogReaderState *record);
 extern void clog_desc(StringInfo buf, XLogReaderState *record);
-extern const char *clog_identify(uint8 info);
+extern const char *clog_identify(uint8 rminfo);
 
 #endif							/* CLOG_H */
diff --git a/src/include/access/ginxlog.h b/src/include/access/ginxlog.h
index 7f985039bb..7b79f18ad7 100644
--- a/src/include/access/ginxlog.h
+++ b/src/include/access/ginxlog.h
@@ -208,7 +208,7 @@ typedef struct ginxlogDeleteListPages
 
 extern void gin_redo(XLogReaderState *record);
 extern void gin_desc(StringInfo buf, XLogReaderState *record);
-extern const char *gin_identify(uint8 info);
+extern const char *gin_identify(uint8 rminfo);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
 extern void gin_mask(char *pagedata, BlockNumber blkno);
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 9bbe4c2622..9bb848a207 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -106,7 +106,7 @@ typedef struct gistxlogPageReuse
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
-extern const char *gist_identify(uint8 info);
+extern const char *gist_identify(uint8 rminfo);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
 extern void gist_mask(char *pagedata, BlockNumber blkno);
diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h
index 59230706bb..e3200fa11c 100644
--- a/src/include/access/hash_xlog.h
+++ b/src/include/access/hash_xlog.h
@@ -261,7 +261,7 @@ typedef struct xl_hash_vacuum_one_page
 
 extern void hash_redo(XLogReaderState *record);
 extern void hash_desc(StringInfo buf, XLogReaderState *record);
-extern const char *hash_identify(uint8 info);
+extern const char *hash_identify(uint8 rminfo);
 extern void hash_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* HASH_XLOG_H */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 34220d93cf..306cc568eb 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -394,11 +394,11 @@ extern void HeapTupleHeaderAdvanceLatestRemovedXid(HeapTupleHeader tuple,
 
 extern void heap_redo(XLogReaderState *record);
 extern void heap_desc(StringInfo buf, XLogReaderState *record);
-extern const char *heap_identify(uint8 info);
+extern const char *heap_identify(uint8 rminfo);
 extern void heap_mask(char *pagedata, BlockNumber blkno);
 extern void heap2_redo(XLogReaderState *record);
 extern void heap2_desc(StringInfo buf, XLogReaderState *record);
-extern const char *heap2_identify(uint8 info);
+extern const char *heap2_identify(uint8 rminfo);
 extern void heap_xlog_logical_rewrite(XLogReaderState *r);
 
 extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer,
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4cbe17de7b..979f2fafe7 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -158,7 +158,7 @@ extern void multixact_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void multixact_redo(XLogReaderState *record);
 extern void multixact_desc(StringInfo buf, XLogReaderState *record);
-extern const char *multixact_identify(uint8 info);
+extern const char *multixact_identify(uint8 rminfo);
 extern char *mxid_to_string(MultiXactId multi, int nmembers,
 							MultiXactMember *members);
 
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index dd504d1885..0adecc3c05 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -350,6 +350,6 @@ extern void btree_mask(char *pagedata, BlockNumber blkno);
  * prototypes for functions in nbtdesc.c
  */
 extern void btree_desc(StringInfo buf, XLogReaderState *record);
-extern const char *btree_identify(uint8 info);
+extern const char *btree_identify(uint8 rminfo);
 
 #endif							/* NBTXLOG_H */
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 930ffdd4f7..06a4a4e074 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -249,7 +249,7 @@ typedef struct spgxlogVacuumRedirect
 
 extern void spg_redo(XLogReaderState *record);
 extern void spg_desc(StringInfo buf, XLogReaderState *record);
-extern const char *spg_identify(uint8 info);
+extern const char *spg_identify(uint8 rminfo);
 extern void spg_xlog_startup(void);
 extern void spg_xlog_cleanup(void);
 extern void spg_mask(char *pagedata, BlockNumber blkno);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index c604ee11f8..31aabb96e3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -512,9 +512,9 @@ extern void xact_desc(StringInfo buf, XLogReaderState *record);
 extern const char *xact_identify(uint8 info);
 
 /* also in xactdesc.c, so they can be shared between front/backend code */
-extern void ParseCommitRecord(uint8 info, xl_xact_commit *xlrec, xl_xact_parsed_commit *parsed);
-extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed);
-extern void ParsePrepareRecord(uint8 info, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *parsed);
+extern void ParseCommitRecord(uint8 rminfo, xl_xact_commit *xlrec, xl_xact_parsed_commit *parsed);
+extern void ParseAbortRecord(uint8 rminfo, xl_xact_abort *xlrec, xl_xact_parsed_abort *parsed);
+extern void ParsePrepareRecord(uint8 rminfo, xl_xact_prepare *xlrec, xl_xact_parsed_prepare *parsed);
 
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index dce265098e..cf8c72c134 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -211,7 +211,7 @@ extern void XLogSetReplicationSlotMinimumLSN(XLogRecPtr lsn);
 
 extern void xlog_redo(XLogReaderState *record);
 extern void xlog_desc(StringInfo buf, XLogReaderState *record);
-extern const char *xlog_identify(uint8 info);
+extern const char *xlog_identify(uint8 rminfo);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli);
 
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 001ff2f521..cfe53c7175 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -41,7 +41,8 @@
 /* prototypes for public functions in xloginsert.c: */
 extern void XLogBeginInsert(void);
 extern void XLogSetRecordFlags(uint8 flags);
-extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info);
+extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 rminfo);
+extern XLogRecPtr XLogInsertExtended(RmgrId rmid, uint8 info, uint8 rminfo);
 extern void XLogEnsureRecordSpace(int max_block_id, int ndatas);
 extern void XLogRegisterData(char *data, uint32 len);
 extern void XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e87f91316a..fb6cae08ad 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -410,6 +410,7 @@ extern bool DecodeXLogRecord(XLogReaderState *state,
 #define XLogRecGetPrev(decoder) ((decoder)->record->header.xl_prev)
 #define XLogRecGetInfo(decoder) ((decoder)->record->header.xl_info)
 #define XLogRecGetRmid(decoder) ((decoder)->record->header.xl_rmid)
+#define XLogRecGetRmInfo(decoder) ((decoder)->record->header.xl_rminfo)
 #define XLogRecGetXid(decoder) ((decoder)->record->header.xl_xid)
 #define XLogRecGetOrigin(decoder) ((decoder)->record->record_origin)
 #define XLogRecGetTopXid(decoder) ((decoder)->record->toplevel_xid)
diff --git a/src/include/access/xlogrecord.h b/src/include/access/xlogrecord.h
index 835151ec92..17093d93b6 100644
--- a/src/include/access/xlogrecord.h
+++ b/src/include/access/xlogrecord.h
@@ -45,7 +45,8 @@ typedef struct XLogRecord
 	XLogRecPtr	xl_prev;		/* ptr to previous record in log */
 	uint8		xl_info;		/* flag bits, see below */
 	RmgrId		xl_rmid;		/* resource manager for this record */
-	/* 2 bytes of padding here, initialize to zero */
+	uint8		xl_rminfo;		/* flag bits for rmgr use */
+	/* 1 byte of padding here, initialize to zero */
 	pg_crc32c	xl_crc;			/* CRC for this record */
 
 	/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
@@ -54,14 +55,6 @@ typedef struct XLogRecord
 
 #define SizeOfXLogRecord	(offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))
 
-/*
- * The high 4 bits in xl_info may be used freely by rmgr. The
- * XLR_SPECIAL_REL_UPDATE and XLR_CHECK_CONSISTENCY bits can be passed by
- * XLogInsert caller. The rest are set internally by XLogInsert.
- */
-#define XLR_INFO_MASK			0x0F
-#define XLR_RMGR_INFO_MASK		0xF0
-
 /*
  * If a WAL record modifies any relation files, in ways not covered by the
  * usual block references, this flag is set. This is not used for anything
diff --git a/src/include/access/xlogstats.h b/src/include/access/xlogstats.h
index 7eb4370f2d..5df0f584c1 100644
--- a/src/include/access/xlogstats.h
+++ b/src/include/access/xlogstats.h
@@ -16,7 +16,7 @@
 #include "access/rmgr.h"
 #include "access/xlogreader.h"
 
-#define MAX_XLINFO_TYPES 16
+#define MAX_XLINFO_TYPES UINT8_MAX
 
 typedef struct XLogRecStats
 {
diff --git a/src/include/catalog/storage_xlog.h b/src/include/catalog/storage_xlog.h
index 44a5e2043b..4db1cdffad 100644
--- a/src/include/catalog/storage_xlog.h
+++ b/src/include/catalog/storage_xlog.h
@@ -54,6 +54,6 @@ extern void log_smgrcreate(const RelFileLocator *rlocator, ForkNumber forkNum);
 
 extern void smgr_redo(XLogReaderState *record);
 extern void smgr_desc(StringInfo buf, XLogReaderState *record);
-extern const char *smgr_identify(uint8 info);
+extern const char *smgr_identify(uint8 rminfo);
 
 #endif							/* STORAGE_XLOG_H */
diff --git a/src/include/commands/dbcommands_xlog.h b/src/include/commands/dbcommands_xlog.h
index 545e5430cc..109cd358d9 100644
--- a/src/include/commands/dbcommands_xlog.h
+++ b/src/include/commands/dbcommands_xlog.h
@@ -55,6 +55,6 @@ typedef struct xl_dbase_drop_rec
 
 extern void dbase_redo(XLogReaderState *record);
 extern void dbase_desc(StringInfo buf, XLogReaderState *record);
-extern const char *dbase_identify(uint8 info);
+extern const char *dbase_identify(uint8 rminfo);
 
 #endif							/* DBCOMMANDS_XLOG_H */
diff --git a/src/include/commands/sequence.h b/src/include/commands/sequence.h
index b3b04ccfa9..78c7b15cf2 100644
--- a/src/include/commands/sequence.h
+++ b/src/include/commands/sequence.h
@@ -64,7 +64,7 @@ extern void ResetSequenceCaches(void);
 
 extern void seq_redo(XLogReaderState *record);
 extern void seq_desc(StringInfo buf, XLogReaderState *record);
-extern const char *seq_identify(uint8 info);
+extern const char *seq_identify(uint8 rminfo);
 extern void seq_mask(char *page, BlockNumber blkno);
 
 #endif							/* SEQUENCE_H */
diff --git a/src/include/commands/tablespace.h b/src/include/commands/tablespace.h
index a11c9e9473..749413fc07 100644
--- a/src/include/commands/tablespace.h
+++ b/src/include/commands/tablespace.h
@@ -64,6 +64,6 @@ extern void remove_tablespace_symlink(const char *linkloc);
 
 extern void tblspc_redo(XLogReaderState *record);
 extern void tblspc_desc(StringInfo buf, XLogReaderState *record);
-extern const char *tblspc_identify(uint8 info);
+extern const char *tblspc_identify(uint8 rminfo);
 
 #endif							/* TABLESPACE_H */
diff --git a/src/include/replication/message.h b/src/include/replication/message.h
index 0b396c5669..6d3a77ea36 100644
--- a/src/include/replication/message.h
+++ b/src/include/replication/message.h
@@ -36,6 +36,6 @@ extern XLogRecPtr LogLogicalMessage(const char *prefix, const char *message,
 #define XLOG_LOGICAL_MESSAGE	0x00
 extern void logicalmsg_redo(XLogReaderState *record);
 extern void logicalmsg_desc(StringInfo buf, XLogReaderState *record);
-extern const char *logicalmsg_identify(uint8 info);
+extern const char *logicalmsg_identify(uint8 rminfo);
 
 #endif							/* PG_LOGICAL_MESSAGE_H */
diff --git a/src/include/storage/standbydefs.h b/src/include/storage/standbydefs.h
index c0234b6cf3..51625e45d9 100644
--- a/src/include/storage/standbydefs.h
+++ b/src/include/storage/standbydefs.h
@@ -22,7 +22,7 @@
 /* Recovery handlers for the Standby Rmgr (RM_STANDBY_ID) */
 extern void standby_redo(XLogReaderState *record);
 extern void standby_desc(StringInfo buf, XLogReaderState *record);
-extern const char *standby_identify(uint8 info);
+extern const char *standby_identify(uint8 rminfo);
 extern void standby_desc_invalidations(StringInfo buf,
 									   int nmsgs, SharedInvalidationMessage *msgs,
 									   Oid dbId, Oid tsId,
diff --git a/src/include/utils/relmapper.h b/src/include/utils/relmapper.h
index 92f1f779a4..6fd75b546f 100644
--- a/src/include/utils/relmapper.h
+++ b/src/include/utils/relmapper.h
@@ -68,6 +68,6 @@ extern void RestoreRelationMap(char *startAddress);
 
 extern void relmap_redo(XLogReaderState *record);
 extern void relmap_desc(StringInfo buf, XLogReaderState *record);
-extern const char *relmap_identify(uint8 info);
+extern const char *relmap_identify(uint8 rminfo);
 
 #endif							/* RELMAPPER_H */
-- 
2.30.2

#25

andres@anarazel.de

over 3 years ago

In reply to: Matthias van de Meent (#24)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-12 22:05:30 +0200, Matthias van de Meent wrote:

On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote:

On 2022-10-03 10:01:25 -0700, Andres Freund wrote:

On 2022-10-03 08:12:39 -0400, Robert Haas wrote:

On Fri, Sep 30, 2022 at 8:20 PM Andres Freund <andres@anarazel.de> wrote:
I thought about trying to buy back some space elsewhere, and I think
that would be a reasonable approach to getting this committed if we
could find a way to do it. However, I don't see a terribly obvious way
of making it happen.

I think there's plenty potential...

I light dusted off my old varint implementation from [1] and converted the
RelFileLocator and BlockNumber from fixed width integers to varint ones. This
isn't meant as a serious patch, but an experiment to see if this is a path
worth pursuing.

A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
(for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

I think a signficant part of this improvement comes from the premise
of starting with a fresh database. tablespace OID will indeed most
likely be low, but database OID may very well be linearly distributed
if concurrent workloads in the cluster include updating (potentially
unlogged) TOASTed columns and the databases are not created in one
"big bang" but over the lifetime of the cluster. In that case DBOID
will consume 5B for a significant fraction of databases (anything with
OID >=2^28).

My point being: I don't think that we should have different WAL
performance in databases which is dependent on which OID was assigned
to that database.

To me this is raising the bar to an absurd level. Some minor space usage
increase after oid wraparound and for very large block numbers isn't a huge
issue - if you're in that situation you already have a huge amount of wal.

0002 - Rework XLogRecord
This makes many fields in the xlog header optional, reducing the size
of many xlog records by several bytes. This implements the design I
shared in my earlier message [1].

0003 - Rework XLogRecordBlockHeader.
This patch could be applied on current head, and saves some bytes in
per-block data. It potentially saves some bytes per registered
block/buffer in the WAL record (max 2 bytes for the first block, after
that up to 3). See the patch's commit message in the patch for
detailed information.

The amount of complexity these two introduce seems quite substantial to
me. Both from an maintenance and a runtime perspective. I think we'd be better
off using building blocks like variable lengths encoded values than open
coding it in many places.

Greetings,

Andres Freund

#26

boekewurm+postgres@gmail.com

about 3 years ago

In reply to: Andres Freund (#25)

Re: problems with making relfilenodes 56-bits

On Wed, 12 Oct 2022 at 23:13, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-10-12 22:05:30 +0200, Matthias van de Meent wrote:

On Wed, 5 Oct 2022 at 01:50, Andres Freund <andres@anarazel.de> wrote:

I light dusted off my old varint implementation from [1] and converted the
RelFileLocator and BlockNumber from fixed width integers to varint ones. This
isn't meant as a serious patch, but an experiment to see if this is a path
worth pursuing.

A run of installcheck in a cluster with autovacuum=off, full_page_writes=off
(for increased reproducability) shows a decent saving:

master: 241106544 - 230 MB
varint: 227858640 - 217 MB

I think a signficant part of this improvement comes from the premise
of starting with a fresh database. tablespace OID will indeed most
likely be low, but database OID may very well be linearly distributed
if concurrent workloads in the cluster include updating (potentially
unlogged) TOASTed columns and the databases are not created in one
"big bang" but over the lifetime of the cluster. In that case DBOID
will consume 5B for a significant fraction of databases (anything with
OID >=2^28).

My point being: I don't think that we should have different WAL
performance in databases which is dependent on which OID was assigned
to that database.

To me this is raising the bar to an absurd level. Some minor space usage
increase after oid wraparound and for very large block numbers isn't a huge
issue - if you're in that situation you already have a huge amount of wal.

I didn't want to block all varlen encoding, I just want to make clear
that I don't think it's great for performance testing and consistency
across installations if WAL size (and thus part of your performance)
is dependent on which actual database/relation/tablespace combination
you're running your workload in.

With the 56-bit relfilenode, the size of a block reference would
realistically differ between 7 bytes and 23 bytes:

- tblspc=0=1B
db=16386=3B
rel=797=2B (787 = 4 * default # of data relations in a fresh DB in PG14, + 1)
block=0=1B

- tsp>=2^28 = 5B
dat>=2^28 =5B
rel>=2^49 =8B
block>=2^28 =5B

That's a difference of 16 bytes, of which only the block number can
realistically be directly influenced by the user ("just don't have
relations larger than X blocks").

If applied to Dilip's pgbench transaction data, that would imply a
minimum per transaction wal usage of 509 bytes, and a maximum per
transaction wal usage of 609 bytes. That is nearly a 20% difference in
WAL size based only on the location of your data, and I'm just not
comfortable with that. Users have little or zero control over the
internal IDs we assign to these fields, while it would affect
performance fairly significantly.

(difference % between min/max wal size is unchanged (within 0.1%)
after accounting for record alignment)

0002 - Rework XLogRecord
This makes many fields in the xlog header optional, reducing the size
of many xlog records by several bytes. This implements the design I
shared in my earlier message [1].

0003 - Rework XLogRecordBlockHeader.
This patch could be applied on current head, and saves some bytes in
per-block data. It potentially saves some bytes per registered
block/buffer in the WAL record (max 2 bytes for the first block, after
that up to 3). See the patch's commit message in the patch for
detailed information.

The amount of complexity these two introduce seems quite substantial to
me. Both from an maintenance and a runtime perspective. I think we'd be better
off using building blocks like variable lengths encoded values than open
coding it in many places.

I guess that's true for length fields, but I don't think dynamic
header field presence (the 0002 rewrite, and the omission of
data_length in 0003) is that bad. We already have dynamic data
inclusion through block ids 25x; I'm not sure why we couldn't do that
more compactly with bitfields as indicators instead (hence the dynamic
header size).

As for complexity, I think my current patchset is mostly complex due
to a lack of tooling. Note that decoding makes common use of
COPY_HEADER_FIELD, which we don't really have an equivalent for in
XLogRecordAssemble. I think the code for 0002 would improve
significantly in readability if such construct would be available.

To reduce complexity in 0003, I could drop the 'repeat id'
optimization, as that reduces the complexity significantly, at the
cost of not saving that 1 byte per registered block after the first.

Kind regards,

Matthias van de Meent

#27

robertmhaas@gmail.com

about 3 years ago

In reply to: Andres Freund (#25)

Re: problems with making relfilenodes 56-bits

On Wed, Oct 12, 2022 at 5:13 PM Andres Freund <andres@anarazel.de> wrote:

I think a signficant part of this improvement comes from the premise
of starting with a fresh database. tablespace OID will indeed most
likely be low, but database OID may very well be linearly distributed
if concurrent workloads in the cluster include updating (potentially
unlogged) TOASTed columns and the databases are not created in one
"big bang" but over the lifetime of the cluster. In that case DBOID
will consume 5B for a significant fraction of databases (anything with
OID >=2^28).

My point being: I don't think that we should have different WAL
performance in databases which is dependent on which OID was assigned
to that database.

To me this is raising the bar to an absurd level. Some minor space usage
increase after oid wraparound and for very large block numbers isn't a huge
issue - if you're in that situation you already have a huge amount of wal.

I have to admit that I worried about the same thing that Matthias
raises, more or less. But I don't know whether I'm right to be
worried. A variable-length representation of any kind is essentially a
gamble that values requiring fewer bytes will be more common than
values requiring more bytes, and by enough to justify the overhead
that the method has. And, you want it to be more common for each
individual user, not just overall. For example, more people are going
to have small relations than large ones, but nobody wants performance
to drop off a cliff when the relation passes a certain size threshold.
Now, it wouldn't drop off a cliff here, but what about someone with a
really big, append-only relation? Won't they just end up writing more
to WAL than with the present system?

Maybe not. They might still have some writes to relations other than
the very large, append-only relation, and then they could still win.
Also, if we assume that the overhead of the variable-length
representation is never more than 1 byte beyond what is needed to
represent the underlying quantity in the minimal number of bytes, they
are only going to lose if their relation is already more than half the
maximum theoretical size, and if that is the case, they are in danger
of hitting the size limit anyway. You can argue that there's still a
risk here, but it doesn't seem like that bad of a risk.

But the same thing is not so obvious for, let's say, database OIDs.
What if you just have one or a few databases, but due to the previous
history of the cluster, their OIDs just happen to be big? Then you're
just behind where you would have been without the patch. Granted, if
this happens to you, you will be in the minority, because most users
are likely to have small database OIDs, but the fact that other people
are writing less WAL on average isn't going to make you happy about
writing more WAL on average. And even for a user for which that
doesn't happen, it's not at all unlikely that the gains they see will
be less than what we see on a freshly-initdb'd database.

So I don't really know what the answer is here. I don't think this
technique sucks, but I don't think it's necessarily a categorical win
for every case, either. And it even seems hard to reason about which
cases are likely to be wins and which cases are likely to be losses.

0002 - Rework XLogRecord
This makes many fields in the xlog header optional, reducing the size
of many xlog records by several bytes. This implements the design I
shared in my earlier message [1].

0003 - Rework XLogRecordBlockHeader.
This patch could be applied on current head, and saves some bytes in
per-block data. It potentially saves some bytes per registered
block/buffer in the WAL record (max 2 bytes for the first block, after
that up to 3). See the patch's commit message in the patch for
detailed information.

The amount of complexity these two introduce seems quite substantial to
me. Both from a maintenance and a runtime perspective. I think we'd be better
off using building blocks like variable lengths encoded values than open
coding it in many places.

I agree that this looks pretty ornate as written, but I think there
might be some good ideas in here, too. It is also easy to reason about
this kind of thing at least in terms of space consumption. It's a bit
harder to know how things will play out in terms of CPU cycles and
code complexity.

--
Robert Haas
EDB: http://www.enterprisedb.com

#28

andres@anarazel.de

about 3 years ago

In reply to: Robert Haas (#27)

Re: problems with making relfilenodes 56-bits

Hi,

On 2022-10-17 17:14:21 -0400, Robert Haas wrote:

I have to admit that I worried about the same thing that Matthias
raises, more or less. But I don't know whether I'm right to be
worried. A variable-length representation of any kind is essentially a
gamble that values requiring fewer bytes will be more common than
values requiring more bytes, and by enough to justify the overhead
that the method has. And, you want it to be more common for each
individual user, not just overall. For example, more people are going
to have small relations than large ones, but nobody wants performance
to drop off a cliff when the relation passes a certain size threshold.
Now, it wouldn't drop off a cliff here, but what about someone with a
really big, append-only relation? Won't they just end up writing more
to WAL than with the present system?

Perhaps. But I suspect it'd be a very small increase because they'd be using
bulk-insert paths in all likelihood anyway, if they managed to get to a very
large relation. And even in that case, if we e.g. were to make the record size
variable length, they'd still pretty much never reach that and it'd be an
overall win.

The number of people with that large relations, leaving partitioning aside
which'd still benefit as each relation is smaller, strikes me as a very small
percentage. And as you say, it's not like there's a cliff where everything
starts to be horrible.

Maybe not. They might still have some writes to relations other than
the very large, append-only relation, and then they could still win.
Also, if we assume that the overhead of the variable-length
representation is never more than 1 byte beyond what is needed to
represent the underlying quantity in the minimal number of bytes, they
are only going to lose if their relation is already more than half the
maximum theoretical size, and if that is the case, they are in danger
of hitting the size limit anyway. You can argue that there's still a
risk here, but it doesn't seem like that bad of a risk.

Another thing here is that I suspect we ought to increase our relation size
beyond 4 byte * blocksize at some point - and then we'll have to use variable
encodings... Admittedly the amount of work needed to get there is substantial.

Somewhat relatedly, I think we, very slowly, should move towards wider OIDs as
well. Not having to deal with oid wraparound will be a significant win
(particularly for toast), but to keep the overhead reasonable, we're going to
need variable encodings.

But the same thing is not so obvious for, let's say, database OIDs.
What if you just have one or a few databases, but due to the previous
history of the cluster, their OIDs just happen to be big? Then you're
just behind where you would have been without the patch. Granted, if
this happens to you, you will be in the minority, because most users
are likely to have small database OIDs, but the fact that other people
are writing less WAL on average isn't going to make you happy about
writing more WAL on average. And even for a user for which that
doesn't happen, it's not at all unlikely that the gains they see will
be less than what we see on a freshly-initdb'd database.

I agree that going for variable width encodings on the basis of the database
oid field alone would be an unconvincing proposition. But variably encoding
database oids when we already variably encode other fields seems like a decent
bet. If you e.g. think of the 56-bit relfilenode field itself - obviously what
I was thinking about in the first place - it's going to be a win much more
often.

To really loose you'd not just have to have a large database oid, but also a
large tablespace and relation oid and a huge block number...

So I don't really know what the answer is here. I don't think this
technique sucks, but I don't think it's necessarily a categorical win
for every case, either. And it even seems hard to reason about which
cases are likely to be wins and which cases are likely to be losses.

True. I'm far less concerned than you or Matthias about increasing the size in
rare cases as long as it wins in the majority of cases. But that doesn't mean
every case is easy to consider.

0002 - Rework XLogRecord
This makes many fields in the xlog header optional, reducing the size
of many xlog records by several bytes. This implements the design I
shared in my earlier message [1].

0003 - Rework XLogRecordBlockHeader.
This patch could be applied on current head, and saves some bytes in
per-block data. It potentially saves some bytes per registered
block/buffer in the WAL record (max 2 bytes for the first block, after
that up to 3). See the patch's commit message in the patch for
detailed information.

The amount of complexity these two introduce seems quite substantial to
me. Both from a maintenance and a runtime perspective. I think we'd be better
off using building blocks like variable lengths encoded values than open
coding it in many places.

I agree that this looks pretty ornate as written, but I think there
might be some good ideas in here, too.

Agreed! Several of the ideas seem orthogonal to using variable encodings, so
this isn't really an either / or.

It is also easy to reason about this kind of thing at least in terms of
space consumption.

Hm, not for me, but...

Greetings,

Andres Freund

#29