v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Started by Justin Pryzbyover 2 years ago11 messages
#1Justin Pryzby
pryzby@telsasoft.com

I hit this assertion while pg_restoring data into a v16 instance.
postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64

wal_level=minimal and pg_dump --single-transaction both seem to be
required to hit the issue.

$ /usr/pgsql-16/bin/postgres -D ./pg16test -c maintenance_work_mem=1GB -c max_wal_size=16GB -c wal_level=minimal -c max_wal_senders=0 -c port=5678 -c logging_collector=no &

$ time sudo -u postgres /usr/pgsql-16/bin/pg_restore -d postgres -p 5678 --single-transaction --no-tablespace ./curtables

TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Core was generated by `postgres: postgres postgres [local] COMMIT '.
Program terminated with signal 6, Aborted.
#0 0x00007f28b8bd5387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64
(gdb) bt
#0 0x00007f28b8bd5387 in raise () from /lib64/libc.so.6
#1 0x00007f28b8bd6a78 in abort () from /lib64/libc.so.6
#2 0x00000000009bc8c9 in ExceptionalCondition (conditionName=conditionName@entry=0xa373e1 "size > SizeOfXLogRecord", fileName=fileName@entry=0xa31b13 "xlog.c", lineNumber=lineNumber@entry=1055) at assert.c:66
#3 0x000000000057b049 in ReserveXLogInsertLocation (PrevPtr=0x2e3d750, EndPos=<synthetic pointer>, StartPos=<synthetic pointer>, size=24) at xlog.c:1055
#4 XLogInsertRecord (rdata=rdata@entry=0xf187a0 <hdr_rdt>, fpw_lsn=fpw_lsn@entry=0, flags=<optimized out>, num_fpi=num_fpi@entry=0, topxid_included=topxid_included@entry=false) at xlog.c:844
#5 0x000000000058210c in XLogInsert (rmid=rmid@entry=0 '\000', info=info@entry=176 '\260') at xloginsert.c:510
#6 0x0000000000582b09 in log_newpage_range (rel=rel@entry=0x2e1f628, forknum=forknum@entry=FSM_FORKNUM, startblk=startblk@entry=0, endblk=endblk@entry=3, page_std=page_std@entry=false) at xloginsert.c:1317
#7 0x00000000005d02f9 in smgrDoPendingSyncs (isCommit=isCommit@entry=true, isParallelWorker=isParallelWorker@entry=false) at storage.c:837
#8 0x0000000000571637 in CommitTransaction () at xact.c:2225
#9 0x0000000000572b25 in CommitTransactionCommand () at xact.c:3201
#10 0x000000000086afc7 in finish_xact_command () at postgres.c:2782
#11 0x000000000086d7e1 in exec_simple_query (query_string=0x2dec4f8 "COMMIT") at postgres.c:1307

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#1)
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Justin Pryzby <pryzby@telsasoft.com> writes:

I hit this assertion while pg_restoring data into a v16 instance.
postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64

wal_level=minimal and pg_dump --single-transaction both seem to be
required to hit the issue.

Hmm. I wonder if log_newpages() is confused here:

XLogEnsureRecordSpace(XLR_MAX_BLOCK_ID - 1, 0);

Why is XLR_MAX_BLOCK_ID - 1 enough, rather than XLR_MAX_BLOCK_ID?

regards, tom lane

#3Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Justin Pryzby (#1)
1 attachment(s)
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

On Mon, 17 Apr 2023 at 17:53, Justin Pryzby <pryzby@telsasoft.com> wrote:

I hit this assertion while pg_restoring data into a v16 instance.
postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64

wal_level=minimal and pg_dump --single-transaction both seem to be
required to hit the issue.

$ /usr/pgsql-16/bin/postgres -D ./pg16test -c maintenance_work_mem=1GB -c max_wal_size=16GB -c wal_level=minimal -c max_wal_senders=0 -c port=5678 -c logging_collector=no &

$ time sudo -u postgres /usr/pgsql-16/bin/pg_restore -d postgres -p 5678 --single-transaction --no-tablespace ./curtables

TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Core was generated by `postgres: postgres postgres [local] COMMIT '.
Program terminated with signal 6, Aborted.
#0 0x00007f28b8bd5387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install postgresql16-server-16-alpha_20230417_PGDG.rhel7.x86_64
(gdb) bt
#0 0x00007f28b8bd5387 in raise () from /lib64/libc.so.6
#1 0x00007f28b8bd6a78 in abort () from /lib64/libc.so.6
#2 0x00000000009bc8c9 in ExceptionalCondition (conditionName=conditionName@entry=0xa373e1 "size > SizeOfXLogRecord", fileName=fileName@entry=0xa31b13 "xlog.c", lineNumber=lineNumber@entry=1055) at assert.c:66
#3 0x000000000057b049 in ReserveXLogInsertLocation (PrevPtr=0x2e3d750, EndPos=<synthetic pointer>, StartPos=<synthetic pointer>, size=24) at xlog.c:1055
#4 XLogInsertRecord (rdata=rdata@entry=0xf187a0 <hdr_rdt>, fpw_lsn=fpw_lsn@entry=0, flags=<optimized out>, num_fpi=num_fpi@entry=0, topxid_included=topxid_included@entry=false) at xlog.c:844
#5 0x000000000058210c in XLogInsert (rmid=rmid@entry=0 '\000', info=info@entry=176 '\260') at xloginsert.c:510
#6 0x0000000000582b09 in log_newpage_range (rel=rel@entry=0x2e1f628, forknum=forknum@entry=FSM_FORKNUM, startblk=startblk@entry=0, endblk=endblk@entry=3, page_std=page_std@entry=false) at xloginsert.c:1317

Looking at log_newpage_range, it seems like we're always trying to log
a record if startblk < endblk; but don't register the PageIsNew()
buffers in the range. That means that if the last buffers in the range
are new, this can result in no buffers being registered in the last
iteration of the main loop (if the number of non-new buffers in the
range is 0 (mod 32)).

A change like attached should fix the issue; or alternatively we could
force log the last (new) buffer when we detect this edge case.

Kind regards,

Matthias van de Meent

Attachments:

7882549d3b6ca83e83eda88d46f6f29afda0ae03.diffapplication/octet-stream; name=7882549d3b6ca83e83eda88d46f6f29afda0ae03.diffDownload
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index e3d5937479fe..91de1f96044d 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -1304,6 +1304,10 @@ log_newpage_range(Relation rel, ForkNumber forknum,
 			blkno++;
 		}
 
+		/* Don't emit a WAL record if all pages were new */
+		if (nbufs == 0)
+			break;
+
 		/* Write WAL record for this batch. */
 		XLogBeginInsert();
 
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#2)
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

I wrote:

Hmm. I wonder if log_newpages() is confused here:
XLogEnsureRecordSpace(XLR_MAX_BLOCK_ID - 1, 0);

Oh, no, it's simpler than that: log_newpage_range is trying to
log zero page images, and ReserveXLogInsertLocation doesn't
like that because every WAL record should contain some data.
Will fix, thanks for report.

regards, tom lane

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Matthias van de Meent (#3)
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

Looking at log_newpage_range, it seems like we're always trying to log
a record if startblk < endblk; but don't register the PageIsNew()
buffers in the range. That means that if the last buffers in the range
are new, this can result in no buffers being registered in the last
iteration of the main loop (if the number of non-new buffers in the
range is 0 (mod 32)).

Yeah, I just came to the same conclusion. One thing I don't understand
yet: log_newpage_range is old (it looks like this back to v12), and
that Assert is older, so why doesn't this reproduce further back?
Maybe the state where all the pages are new didn't happen before?
Is that telling us there's a bug somewhere else? Seems like a job
for git bisect.

To be clear: log_newpage_range is certainly broken, and your fix looks
appropriate. I'm just wondering what else we need to learn here.

regards, tom lane

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#5)
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

I wrote:

Yeah, I just came to the same conclusion. One thing I don't understand
yet: log_newpage_range is old (it looks like this back to v12), and
that Assert is older, so why doesn't this reproduce further back?
Maybe the state where all the pages are new didn't happen before?

Bingo: bisecting shows the failure started at

commit 3d6a98457d8e21d85bed86cfd3e1d1df1b260721
Author: Andres Freund <andres@anarazel.de>
Date: Wed Apr 5 08:19:39 2023 -0700

Don't initialize page in {vm,fsm}_extend(), not needed

So previously, log_newpage_range could only have failed in very
unlikely circumstances, whereas now it's not hard to hit when
committing a table creation. I wonder what other bugs may be
lurking.

I'll patch it back to v12 anyway, since that function is
clearly wrong in isolation.

regards, tom lane

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#5)
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Hi,

On 2023-04-17 12:13:41 -0400, Tom Lane wrote:

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

Looking at log_newpage_range, it seems like we're always trying to log
a record if startblk < endblk; but don't register the PageIsNew()
buffers in the range. That means that if the last buffers in the range
are new, this can result in no buffers being registered in the last
iteration of the main loop (if the number of non-new buffers in the
range is 0 (mod 32)).

Yeah, I just came to the same conclusion. One thing I don't understand
yet: log_newpage_range is old (it looks like this back to v12), and
that Assert is older, so why doesn't this reproduce further back?
Maybe the state where all the pages are new didn't happen before?
Is that telling us there's a bug somewhere else? Seems like a job
for git bisect.

One plausible explanation is that bulk relation extension has made it more
likely to encounter this scenario. We had some bulk extension code before, but
it was triggered purely based on contention - quite unlikely in simple test
scenarios - but now we also bulk extend if we know that we'll insert multiple
pages (when coming from heap_multi_insert(), with sufficient data).

Greetings,

Andres Freund

#8Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#6)
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Hi,

On 2023-04-17 13:50:30 -0400, Tom Lane wrote:

I wrote:

Yeah, I just came to the same conclusion. One thing I don't understand
yet: log_newpage_range is old (it looks like this back to v12), and
that Assert is older, so why doesn't this reproduce further back?
Maybe the state where all the pages are new didn't happen before?

Bingo: bisecting shows the failure started at

commit 3d6a98457d8e21d85bed86cfd3e1d1df1b260721
Author: Andres Freund <andres@anarazel.de>
Date: Wed Apr 5 08:19:39 2023 -0700

Don't initialize page in {vm,fsm}_extend(), not needed

So previously, log_newpage_range could only have failed in very
unlikely circumstances, whereas now it's not hard to hit when
committing a table creation. I wonder what other bugs may be
lurking.

Oh, interesting. We haven't initialized the extra pages added by
RelationAddExtraBlocks() (in <= 15) for quite a while now, so I'm a bit
surprised it causes more issues for the VM / FSM. I guess it's that it's quite
common in real workloads to contend on the extension lock and add extra
blocks, but not in simple single-threaded tests?

Greetings,

Andres Freund

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#8)
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Andres Freund <andres@anarazel.de> writes:

On 2023-04-17 13:50:30 -0400, Tom Lane wrote:

So previously, log_newpage_range could only have failed in very
unlikely circumstances, whereas now it's not hard to hit when
committing a table creation. I wonder what other bugs may be
lurking.

Oh, interesting. We haven't initialized the extra pages added by
RelationAddExtraBlocks() (in <= 15) for quite a while now, so I'm a bit
surprised it causes more issues for the VM / FSM. I guess it's that it's quite
common in real workloads to contend on the extension lock and add extra
blocks, but not in simple single-threaded tests?

I haven't tried hard to run it to ground, but maybe log_newpage_range
isn't used in that code path? Seems like we'd have detected this
before now if the case were reachable without any crash involved.

regards, tom lane

#10Justin Pryzby
pryzby@telsasoft.com
In reply to: Tom Lane (#6)
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

On Mon, Apr 17, 2023 at 01:50:30PM -0400, Tom Lane wrote:

I wrote:

Yeah, I just came to the same conclusion. One thing I don't understand
yet: log_newpage_range is old (it looks like this back to v12), and
that Assert is older, so why doesn't this reproduce further back?
Maybe the state where all the pages are new didn't happen before?

Bingo: bisecting shows the failure started at

Just curious: what "test" did you use to bisect with ?

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Justin Pryzby (#10)
Re: v16dev: TRAP: failed Assert("size > SizeOfXLogRecord"), File: "xlog.c", Line: 1055, PID: 13564

Justin Pryzby <pryzby@telsasoft.com> writes:

On Mon, Apr 17, 2023 at 01:50:30PM -0400, Tom Lane wrote:

Bingo: bisecting shows the failure started at

Just curious: what "test" did you use to bisect with ?

The test case I used looked like

start postmaster with -c wal_level=minimal -c max_wal_senders=0
make installcheck-parallel
psql -d regression -c "do 'begin for i in 1..1000 loop execute ''create table lots''||i||'' as select * from onek''; end loop; end';"
pg_dump -Fc -Z0 regression >~/regression.dump
createdb r2
pg_restore -d r2 --single-transaction --no-tablespace ~/regression.dump

Dumping the regression database as-is didn't reproduce it for me,
but after I added a bunch more tables it did reproduce.

(I added the -Z0 bit after some of the bisection test points hit the
interval where somebody had broken pg_dump's compression features.
It didn't seem relevant to the problem so I just disabled that.)

regards, tom lane