BUG #18658: Assert in SerialAdd() due to race condition
The following bug has been logged on the website:
Bug reference: 18658
Logged by: Andrew Bille
Email address: andrewbille@gmail.com
PostgreSQL version: 17.0
Operating system: Ubuntu 20.04
Description:
Hello!
The following script against a server (master c259b1578) compiled with
-DTEST_SUMMARIZE_SERIAL (without the define it's harder to reproduce...):
echo "CREATE TABLE t(f1 int);" | psql test
for i in seq 100; do
echo "
SET default_transaction_isolation to 'SERIALIZABLE';
SELECT 'INSERT INTO t VALUES(42);' FROM generate_series(1, 100) g
\gexec
" | psql -q test &
echo "
SET default_transaction_isolation to 'SERIALIZABLE';
SELECT 'INSERT INTO t VALUES(42);' FROM generate_series(1, 100) g
\gexec
" | psql -q test &
wait
grep 'TRAP:' server.log && break;
done
fails for me with the following stack trace:
#5 0x000055e6aaf3fef5 in ExceptionalCondition
(conditionName=conditionName@entry=0x55e6ab1232d1
"TransactionIdIsValid(tailXid)", fileName=fileName@entry=0x55e6ab1231d9
"predicate.c", lineNumber=lineNumber@entry=886) at assert.c:66
#6 0x000055e6aadcde64 in SerialAdd (xid=976743,
minConflictCommitSeqNo=18446744073709551615) at predicate.c:886
#7 0x000055e6aadd0877 in SummarizeOldestCommittedSxact () at
predicate.c:1521
#8 0x000055e6aadd094f in GetSerializableTransactionSnapshotInt
(snapshot=snapshot@entry=0x55e6ab32f3e0 <CurrentSnapshotData>,
sourcevxid=sourcevxid@entry=0x0, sourcepid=sourcepid@entry=-1) at
predicate.c:1789
#9 0x000055e6aadd2a51 in GetSerializableTransactionSnapshot
(snapshot=snapshot@entry=0x55e6ab32f3e0 <CurrentSnapshotData>) at
predicate.c:1693
#10 0x000055e6aaf868c4 in GetTransactionSnapshot () at snapmgr.c:257
#11 0x000055e6aade2d9a in exec_simple_query
(query_string=query_string@entry=0x55e6ab433fe0 "SELECT 'INSERT INTO t
VALUES(42);' FROM generate_series(1, 100) g\n ") at postgres.c:1173
#12 0x000055e6aade516f in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at postgres.c:4793
#13 0x000055e6aadde301 in BackendMain (startup_data=<optimized out>,
startup_data_len=<optimized out>) at backend_startup.c:107
#14 0x000055e6aad2cbf8 in postmaster_child_launch
(child_type=child_type@entry=B_BACKEND,
startup_data=startup_data@entry=0x7ffc1ca820e4 "",
startup_data_len=startup_data_len@entry=4,
client_sock=client_sock@entry=0x7ffc1ca82120)
at launch_backend.c:274
#15 0x000055e6aad30e72 in BackendStartup
(client_sock=client_sock@entry=0x7ffc1ca82120) at postmaster.c:3420
#16 0x000055e6aad31092 in ServerLoop () at postmaster.c:1653
#17 0x000055e6aad324e6 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x55e6ab42e390) at postmaster.c:1351
#18 0x000055e6aac48bd7 in main (argc=3, argv=0x55e6ab42e390) at main.c:197
This is also reproduced on dafaa3efb
Implement genuine serializable isolation level.
...
Thanks, Andrew
Hello again!
Using a build with no asserts enabled and without that define, I can also
see the following errors using the script with increased concurrency
(Initially we got a report from our client, who encountered these errors in
production):
2024-10-17 04:58:29.774 UTC|test|regression|67109975.ed069|PANIC: could
not access status of transaction 18204
2024-10-17 04:58:29.774 UTC|test|regression|67109975.ed069|DETAIL: Could
not open file "pg_serial/0000": No such file or directory.
I changed ERROR to PANIC to get a core dump.
echo "CREATE TABLE t(f1 int);" | psql
res=0
for i in `seq 200`; do
echo "I $i"
for l in `seq 80`; do
echo "
SET default_transaction_isolation to 'SERIALIZABLE';
SELECT 'INSERT INTO t VALUES(42);' FROM generate_series(1, 100) g
\gexec
" | psql -q &
done
wait
grep 'PANIC: could not access status' server.log && { res=1; break; }
grep 'TRAP: ' server.log && { res=1; break; }
done
#5 0x00005f824729a10a in errfinish (filename=0x5f824732df83 "slru.c",
lineno=1060, funcname=0x5f824732e220 <__func__.3> "SlruReportIOError") at
elog.c:599
#6 0x00005f8246c16c26 in SlruReportIOError (ctl=0x5f8247686c80
<SerialSlruCtlData>, pageno=17, xid=18204) at slru.c:1060
#7 0x00005f8246c15cea in SimpleLruReadPage (ctl=0x5f8247686c80
<SerialSlruCtlData>, pageno=17, write_ok=true, xid=18204) at slru.c:579
#8 0x00005f82470988c2 in SerialAdd (xid=18204,
minConflictCommitSeqNo=18446744073709551615) at predicate.c:928
#9 0x00005f8247099854 in SummarizeOldestCommittedSxact () at
predicate.c:1520
#10 0x00005f8247099ddd in GetSerializableTransactionSnapshotInt
(snapshot=0x5f8247691740 <CurrentSnapshotData>, sourcevxid=0x0,
sourcepid=-1) at predicate.c:1798
#11 0x00005f8247099c72 in GetSerializableTransactionSnapshot
(snapshot=0x5f8247691740 <CurrentSnapshotData>) at predicate.c:1692
#12 0x00005f82472ee3b1 in GetTransactionSnapshot () at snapmgr.c:257
#13 0x00005f82470adfe8 in exec_simple_query (query_string=0x5f8248e06a28
"INSERT INTO t VALUES(42);") at postgres.c:1173
#14 0x00005f82470b3453 in PostgresMain (dbname=0x5f8248e030f0 "regression",
username=0x5f8248e3fe08 "test") at postgres.c:4793
#15 0x00005f82470aa5ac in BackendMain (startup_data=0x7fffdf93a054 "",
startup_data_len=4) at backend_startup.c:107
#16 0x00005f8246fd6a6a in postmaster_child_launch (child_type=B_BACKEND,
startup_data=0x7fffdf93a054 "", startup_data_len=4,
client_sock=0x7fffdf93a0a0) at launch_backend.c:274
#17 0x00005f8246fdc4ed in BackendStartup (client_sock=0x7fffdf93a0a0) at
postmaster.c:3420
#18 0x00005f8246fd9dd0 in ServerLoop () at postmaster.c:1653
#19 0x00005f8246fd96d0 in PostmasterMain (argc=1, argv=0x5f8248e01090) at
postmaster.c:1351
#20 0x00005f8246e9729e in main (argc=1, argv=0x5f8248e01090) at main.c:197
Assert would not be triggered in this case, because of:
(gdb) fr 8
#8 0x00005f82470988c2 in SerialAdd (xid=18204,
minConflictCommitSeqNo=18446744073709551615) at predicate.c:928
928 slotno = SimpleLruReadPage(SerialSlruCtl,
targetPage, true, xid);
(gdb) p *serialControl
$1 = {headPage = 18, headXid = 18654, tailXid = 18557}
I'm not sure if this is the same problem.
Thanks to Alexandr Lakhin for help with reproducing the issue!
On Wed, Oct 16, 2024 at 10:33 PM PG Bug reporting form <
noreply@postgresql.org> wrote:
Show quoted text
The following bug has been logged on the website:
Bug reference: 18658
Logged by: Andrew Bille
Email address: andrewbille@gmail.com
PostgreSQL version: 17.0
Operating system: Ubuntu 20.04
Description:Hello!
The following script against a server (master c259b1578) compiled with
-DTEST_SUMMARIZE_SERIAL (without the define it's harder to reproduce...):echo "CREATE TABLE t(f1 int);" | psql test
for i in seq 100; do
echo "
SET default_transaction_isolation to 'SERIALIZABLE';
SELECT 'INSERT INTO t VALUES(42);' FROM generate_series(1, 100) g
\gexec
" | psql -q test &echo "
SET default_transaction_isolation to 'SERIALIZABLE';
SELECT 'INSERT INTO t VALUES(42);' FROM generate_series(1, 100) g
\gexec
" | psql -q test &wait
grep 'TRAP:' server.log && break;
donefails for me with the following stack trace:
#5 0x000055e6aaf3fef5 in ExceptionalCondition
(conditionName=conditionName@entry=0x55e6ab1232d1
"TransactionIdIsValid(tailXid)", fileName=fileName@entry=0x55e6ab1231d9
"predicate.c", lineNumber=lineNumber@entry=886) at assert.c:66
#6 0x000055e6aadcde64 in SerialAdd (xid=976743,
minConflictCommitSeqNo=18446744073709551615) at predicate.c:886
#7 0x000055e6aadd0877 in SummarizeOldestCommittedSxact () at
predicate.c:1521
#8 0x000055e6aadd094f in GetSerializableTransactionSnapshotInt
(snapshot=snapshot@entry=0x55e6ab32f3e0 <CurrentSnapshotData>,
sourcevxid=sourcevxid@entry=0x0, sourcepid=sourcepid@entry=-1) at
predicate.c:1789
#9 0x000055e6aadd2a51 in GetSerializableTransactionSnapshot
(snapshot=snapshot@entry=0x55e6ab32f3e0 <CurrentSnapshotData>) at
predicate.c:1693
#10 0x000055e6aaf868c4 in GetTransactionSnapshot () at snapmgr.c:257
#11 0x000055e6aade2d9a in exec_simple_query
(query_string=query_string@entry=0x55e6ab433fe0 "SELECT 'INSERT INTO t
VALUES(42);' FROM generate_series(1, 100) g\n ") at postgres.c:1173
#12 0x000055e6aade516f in PostgresMain (dbname=<optimized out>,
username=<optimized out>) at postgres.c:4793
#13 0x000055e6aadde301 in BackendMain (startup_data=<optimized out>,
startup_data_len=<optimized out>) at backend_startup.c:107
#14 0x000055e6aad2cbf8 in postmaster_child_launch
(child_type=child_type@entry=B_BACKEND,
startup_data=startup_data@entry=0x7ffc1ca820e4 "",
startup_data_len=startup_data_len@entry=4,
client_sock=client_sock@entry=0x7ffc1ca82120)
at launch_backend.c:274
#15 0x000055e6aad30e72 in BackendStartup
(client_sock=client_sock@entry=0x7ffc1ca82120) at postmaster.c:3420
#16 0x000055e6aad31092 in ServerLoop () at postmaster.c:1653
#17 0x000055e6aad324e6 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x55e6ab42e390) at postmaster.c:1351
#18 0x000055e6aac48bd7 in main (argc=3, argv=0x55e6ab42e390) at main.c:197This is also reproduced on dafaa3efb
Implement genuine serializable isolation level....
Thanks, Andrew
Hello Andrew,
17.10.2024 08:47, Andrew Bille wrote:
Using a build with no asserts enabled and without that define, I can also see the following errors using the script
with increased concurrency (Initially we got a report from our client, who encountered these errors in production):2024-10-17 04:58:29.774 UTC|test|regression|67109975.ed069|PANIC: could not access status of transaction 18204
2024-10-17 04:58:29.774 UTC|test|regression|67109975.ed069|DETAIL: Could not open file "pg_serial/0000": No such file
or directory.I changed ERROR to PANIC to get a core dump.
With debug logging added (look at the patch attached), I could observe
events, leading to the error:
2024-10-18 07:11:29.301 UTC [2370369:55][client backend][96/762:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.301 UTC [2370325:58][client backend][91/0:77843] LOG: !!!SerialSetActiveSerXmin|
serialControl->tailXid: 77832
2024-10-18 07:11:29.301 UTC [2370325:59][client backend][91/0:77843] STATEMENT: INSERT INTO t VALUES(42);
1) SerialSetActiveSerXmin(), called by some backend, sets tailXid, which
corresponds to page 77832 / 1024 = 76.
2024-10-18 07:11:29.301 UTC [2370252:83][client backend][85/794:0] LOG: statement: INSERT INTO t VALUES(42);
...
2024-10-18 07:11:29.309 UTC [2370375:56][client backend][83/767:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370271:106][client backend][63/810:0] LOG: !!!SerialAdd| xid: 77824,
serialControl->headPage: -1, tailXid: 77832, SERIAL_ENTRIESPERPAGE: 1024, firstZeroPage: 76, targetPage: 76, isNewPage: 1
2024-10-18 07:11:29.309 UTC [2370271:107][client backend][63/810:0] STATEMENT: INSERT INTO t VALUES(42);
2.1) SerialAdd() called when headPage is not defined yet, for xid 77824,
which corresponds to page 77824 / 1024 = 76.
2024-10-18 07:11:29.309 UTC [2370327:86][client backend][97/795:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370379:72][client backend][87/777:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370376:74][client backend][98/781:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370321:74][client backend][71/789:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370277:86][client backend][76/795:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370325:82][client backend][91/791:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370373:64][client backend][84/779:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370271:108][client backend][63/810:0] LOG: !!!SerialAdd| after SimpleLruZeroPage(),
firstZeroPage: 76, slotno: 0
2024-10-18 07:11:29.309 UTC [2370271:109][client backend][63/810:0] STATEMENT: INSERT INTO t VALUES(42);
2.2) SerialAdd() calls SimpleLruZeroPage() to zero out page 76.
2024-10-18 07:11:29.309 UTC [2370330:96][client backend][1/803:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370274:95][client backend][70/805:0] LOG: !!!SerialAdd| xid: 77790,
serialControl->headPage: 76, firstZeroPage: 77, targetPage: 75, isNewPage: 0
2024-10-18 07:11:29.309 UTC [2370274:96][client backend][70/805:0] STATEMENT: INSERT INTO t VALUES(42);
3.1) SerialAdd() called for another transaction, with xid 77790, which
corresponds to page 77790 / 1024 = 75, i. e. the page before the page
just zeroed; this page considered not new.
2024-10-18 07:11:29.309 UTC [2370274:97][client backend][70/805:0] LOG: !!!SimpleLruReadPage| pageno: 75, slotno: 16,
shared->page_status[slotno]: 0, shared->page_number[slotno]: 0, shared->page_dirty[slotno]: 0
2024-10-18 07:11:29.309 UTC [2370274:98][client backend][70/805:0] STATEMENT: INSERT INTO t VALUES(42);
3.2) SerialAdd() calls SimpleLruReadPage() for page 75, the page has status
SLRU_PAGE_EMPTY, so SimpleLruReadPage() wants to read it from disk...
2024-10-18 07:11:29.309 UTC [2370257:89][client backend][58/802:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370355:71][client backend][74/780:0] LOG: statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370274:99][client backend][70/805:0] PANIC: could not access status of transaction 77790
2024-10-18 07:11:29.309 UTC [2370274:100][client backend][70/805:0] DETAIL: Could not open file "pg_serial/0002": No
such file or directory.
2024-10-18 07:11:29.309 UTC [2370274:101][client backend][70/805:0] STATEMENT: INSERT INTO t VALUES(42);
3.3) And fails inside SlruPhysicalReadPage(), because that page was never
written to disk.
(I used log_line_prefix = '%m [%p:%l][%b][%v:%x] '; decreased 80 to 50 in
the script posted above, had to run the script several dozen times to
catch the error.)
I also saw a failure, when SerialAdd(), which actions 2 and 3 were
performed by one backend.
Best regards,
Alexander
Attachments:
SLRU-debugging.patchtext/x-patch; charset=UTF-8; name=SLRU-debugging.patchDownload+9-1
On 18/10/2024 11:00, Alexander Lakhin wrote:
Hello Andrew,
17.10.2024 08:47, Andrew Bille wrote:
Using a build with no asserts enabled and without that define, I can
also see the following errors using the script with increased
concurrency (Initially we got a report from our client, who
encountered these errors in production):2024-10-17 04:58:29.774 UTC|test|regression|67109975.ed069|PANIC:
could not access status of transaction 18204
2024-10-17 04:58:29.774 UTC|test|regression|67109975.ed069|DETAIL:
Could not open file "pg_serial/0000": No such file or directory.I changed ERROR to PANIC to get a core dump.
With debug logging added (look at the patch attached), I could observe
events, leading to the error:
2024-10-18 07:11:29.301 UTC [2370369:55][client backend][96/762:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.301 UTC [2370325:58][client backend][91/0:77843]
LOG: !!!SerialSetActiveSerXmin| serialControl->tailXid: 77832
2024-10-18 07:11:29.301 UTC [2370325:59][client backend][91/0:77843]
STATEMENT: INSERT INTO t VALUES(42);1) SerialSetActiveSerXmin(), called by some backend, sets tailXid, which
corresponds to page 77832 / 1024 = 76.2024-10-18 07:11:29.301 UTC [2370252:83][client backend][85/794:0] LOG:
statement: INSERT INTO t VALUES(42);
...
2024-10-18 07:11:29.309 UTC [2370375:56][client backend][83/767:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370271:106][client backend][63/810:0]
LOG: !!!SerialAdd| xid: 77824, serialControl->headPage: -1, tailXid:
77832, SERIAL_ENTRIESPERPAGE: 1024, firstZeroPage: 76, targetPage: 76,
isNewPage: 1
2024-10-18 07:11:29.309 UTC [2370271:107][client backend][63/810:0]
STATEMENT: INSERT INTO t VALUES(42);2.1) SerialAdd() called when headPage is not defined yet, for xid 77824,
which corresponds to page 77824 / 1024 = 76.2024-10-18 07:11:29.309 UTC [2370327:86][client backend][97/795:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370379:72][client backend][87/777:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370376:74][client backend][98/781:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370321:74][client backend][71/789:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370277:86][client backend][76/795:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370325:82][client backend][91/791:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370373:64][client backend][84/779:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370271:108][client backend][63/810:0]
LOG: !!!SerialAdd| after SimpleLruZeroPage(), firstZeroPage: 76, slotno: 0
2024-10-18 07:11:29.309 UTC [2370271:109][client backend][63/810:0]
STATEMENT: INSERT INTO t VALUES(42);2.2) SerialAdd() calls SimpleLruZeroPage() to zero out page 76.
2024-10-18 07:11:29.309 UTC [2370330:96][client backend][1/803:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370274:95][client backend][70/805:0] LOG:
!!!SerialAdd| xid: 77790, serialControl->headPage: 76, firstZeroPage:
77, targetPage: 75, isNewPage: 0
2024-10-18 07:11:29.309 UTC [2370274:96][client backend][70/805:0]
STATEMENT: INSERT INTO t VALUES(42);3.1) SerialAdd() called for another transaction, with xid 77790, which
corresponds to page 77790 / 1024 = 75, i. e. the page before the page
just zeroed; this page considered not new.2024-10-18 07:11:29.309 UTC [2370274:97][client backend][70/805:0] LOG:
!!!SimpleLruReadPage| pageno: 75, slotno: 16,
shared->page_status[slotno]: 0, shared->page_number[slotno]: 0,
shared->page_dirty[slotno]: 0
2024-10-18 07:11:29.309 UTC [2370274:98][client backend][70/805:0]
STATEMENT: INSERT INTO t VALUES(42);3.2) SerialAdd() calls SimpleLruReadPage() for page 75, the page has status
SLRU_PAGE_EMPTY, so SimpleLruReadPage() wants to read it from disk...2024-10-18 07:11:29.309 UTC [2370257:89][client backend][58/802:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370355:71][client backend][74/780:0] LOG:
statement: INSERT INTO t VALUES(42);
2024-10-18 07:11:29.309 UTC [2370274:99][client backend][70/805:0]
PANIC: could not access status of transaction 77790
2024-10-18 07:11:29.309 UTC [2370274:100][client backend][70/805:0]
DETAIL: Could not open file "pg_serial/0002": No such file or directory.
2024-10-18 07:11:29.309 UTC [2370274:101][client backend][70/805:0]
STATEMENT: INSERT INTO t VALUES(42);3.3) And fails inside SlruPhysicalReadPage(), because that page was never
written to disk.(I used log_line_prefix = '%m [%p:%l][%b][%v:%x] '; decreased 80 to 50 in
the script posted above, had to run the script several dozen times to
catch the error.)I also saw a failure, when SerialAdd(), which actions 2 and 3 were
performed by one backend.
Thanks for the repro, Andrew & Alexander! I was able to reproduce this
too. It reproduces very quickly with the script you provided, if you add
this sleep to ReleasePredicateLocks():
@@ -3654,6 +3667,8 @@ ReleasePredicateLocks(bool isCommit, bool
isReadOnlySafe)
LWLockRelease(SerializableFinishedListLock);
+ pg_usleep(1000);
+
if (needToClear)
ClearOldPredicateLocks();
I think the assertion is too strict. It is normal for tailXid to be
invalid in this scenario. The condition is that an XID was added to the
finished list, but the global xmin has already advanced past that XID.
It gets cleared from the finished list by the ClearOldPredicateLocks()
call, but another backend might call SummarizeOldestCommittedSxact()
before that.
The attached patch fixes it.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
0001-Fix-race-condition-in-committing-a-serializable-tran.patchtext/x-patch; charset=UTF-8; name=0001-Fix-race-condition-in-committing-a-serializable-tran.patchDownload+9-5
Hello Heikki,
18.10.2024 23:15, Heikki Linnakangas wrote:
Thanks for the repro, Andrew & Alexander! I was able to reproduce this too. It reproduces very quickly with the script
you provided, if you add this sleep to ReleasePredicateLocks():@@ -3654,6 +3667,8 @@ ReleasePredicateLocks(bool isCommit, bool isReadOnlySafe)
LWLockRelease(SerializableFinishedListLock);
+ pg_usleep(1000); + if (needToClear) ClearOldPredicateLocks();I think the assertion is too strict. It is normal for tailXid to be invalid in this scenario. The condition is that an
XID was added to the finished list, but the global xmin has already advanced past that XID. It gets cleared from the
finished list by the ClearOldPredicateLocks() call, but another backend might call SummarizeOldestCommittedSxact()
before that.The attached patch fixes it.
Thank you for your attention to this!
I also encountered another (more rare) failure with that script (initially
on REL_16_STABLE, but now I've reproduced this on master too), when it
fails due to ENOSPC. (I could reproduce the failure more or less reliably
by running that script with parallel -j4 using 4 different servers.)
With additional logging added (see attached), I see the following:
2024-10-19 07:34:48.254 UTC [3032898:1][client backend][48/278:0] LOG: !!!SerialAdd| xid: 19957,
serialControl->headPage: 4294967295, tailXid: 20491, SERIAL_ENTRIESPERPAGE: 1024, firstZeroPage: 20, targetPage: 19,
isNewPage: 1
2024-10-19 07:34:48.254 UTC [3032898:2][client backend][48/278:0] STATEMENT: INSERT INTO t VALUES(42);
2024-10-19 07:34:48.254 UTC [3032898:3][client backend][48/278:0] LOG: !!!SerialAdd: isNewPage, firstZeroPage: 20,
targetPage: 19
2024-10-19 07:34:48.254 UTC [3032898:4][client backend][48/278:0] STATEMENT: INSERT INTO t VALUES(42);
2024-10-19 07:35:05.105 UTC [3032898:5][client backend][48/278:0] ERROR: could not access status of transaction 0
2024-10-19 07:35:05.105 UTC [3032898:6][client backend][48/278:0] DETAIL: Could not write to file "pg_serial/11FB3" at
offset 8192: No space left on device.
That is, if SerialAdd() gets xid preceding tailXid and belonging to a
preceding page, the page zeroing loop just runs until ENOSPC.
Your proposed fix (adjusted for REL_16_STABLE) eliminates the issue for me.
Thank you!
Best regards,
Alexander
Attachments:
SerialAdd-debugging.patchtext/x-patch; charset=UTF-8; name=SerialAdd-debugging.patchDownload+6-0
On 19/10/2024 12:00, Alexander Lakhin wrote:
Your proposed fix (adjusted for REL_16_STABLE) eliminates the issue for me.
Thank you!
Committed and backpatched to all supported branches. Thanks for the
report and the testing!
--
Heikki Linnakangas
Neon (https://neon.tech)
I think there is a bit more in this module. Though, not related to the
described error. See, we copy nextXid from TransamVariables assuming
TransactionId is atomic. But the nextXid is of FullTransactionId, i.e.
64-bit. And this will work fine on modern 64-bit platforms where 64-bit
FullTransactionId actually would be atomic, but not for the x86. And since
we still support x86 platforms, we better use lock here.
--
Best regards,
Maxim Orlov.