TRAP: FailedAssertion("tabstat->trans == trans", File: "pgstat_relation.c", Line: 508
Hi,
I get this crash running the attached test program. On my slow-disked
and old desktop it occurs once in 20 or so runs (it is yet another
installment of an old test that runs pgbench with logical replication).
15devel compiled from d3609dd25.
(The bash deletes stuff, and without my environment it will need some
tweaking)
Thanks,
Erik Rijkers
Attachments:
On Sat, Apr 16, 2022 at 09:37:55AM +0200, Erik Rijkers wrote:
I get this crash running the attached test program. On my slow-disked and
old desktop it occurs once in 20 or so runs (it is yet another installment
of an old test that runs pgbench with logical replication).15devel compiled from d3609dd25.
(The bash deletes stuff, and without my environment it will need some
tweaking)
Thanks for the report, Erik. This one is new, likely related to the
move of the stats to shared memory. I have added an open item.
--
Michael
Hi,
On 2022-04-16 09:37:55 +0200, Erik Rijkers wrote:
I get this crash running the attached test program. On my slow-disked and
old desktop it occurs once in 20 or so runs (it is yet another installment
of an old test that runs pgbench with logical replication).15devel compiled from d3609dd25.
(The bash deletes stuff, and without my environment it will need some
tweaking)
Any chance for a backtrace? I'll otherwise try to adjust the script, but ...
Greetings,
Andres Freund
At Sat, 16 Apr 2022 11:23:23 -0700, Andres Freund <andres@anarazel.de> wrote in
Hi,
On 2022-04-16 09:37:55 +0200, Erik Rijkers wrote:
I get this crash running the attached test program. On my slow-disked and
old desktop it occurs once in 20 or so runs (it is yet another installment
of an old test that runs pgbench with logical replication).15devel compiled from d3609dd25.
(The bash deletes stuff, and without my environment it will need some
tweaking)Any chance for a backtrace? I'll otherwise try to adjust the script, but ...
FWIW, the script keep succussfully running more than 140 times for me.
(on master, Cent8) And I haven't find a hypothesis for the cause of
the symptom.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Op 18-04-2022 om 09:13 schreef Kyotaro Horiguchi:
At Sat, 16 Apr 2022 11:23:23 -0700, Andres Freund <andres@anarazel.de> wrote in
Hi,
On 2022-04-16 09:37:55 +0200, Erik Rijkers wrote:
I get this crash running the attached test program. On my slow-disked and
old desktop it occurs once in 20 or so runs (it is yet another installment
of an old test that runs pgbench with logical replication).15devel compiled from d3609dd25.
(The bash deletes stuff, and without my environment it will need some
tweaking)Any chance for a backtrace? I'll otherwise try to adjust the script, but ...
FWIW, the script keep succussfully running more than 140 times for me.
(on master, Cent8) And I haven't find a hypothesis for the cause of
the symptom.
Hm. Just now I've recompiled and retried and after 5 runs got the same
crash. Then tried on another machine (also old, I'm afraid),
and built 1a8b11053 and ran the same thing. That failed on the first
try, and made core dump from which I extracted:
gdb ~/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres
core-postgres-6-500-500-8289-1650269886 -ex bt -ex q
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from
/home/aardvark/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres...done.
[New LWP 8289]
warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: logical replication worker for
subscription 16411 '.
Program terminated with signal 6, Aborted.
#0 0x000000357d6324f5 in raise () from /lib64/libc.so.6
#0 0x000000357d6324f5 in raise () from /lib64/libc.so.6
#1 0x000000357d633cd5 in abort () from /lib64/libc.so.6
#2 0x0000000000973fcb in ExceptionalCondition
(conditionName=conditionName@entry=0xb20d76 "tabstat->trans == trans",
errorType=errorType@entry=0x9c7c2b "FailedAssertion",
fileName=fileName@entry=0xb20d0b "pgstat_relation.c",
lineNumber=lineNumber@entry=508) at assert.c:69
#3 0x000000000086b77f in AtEOXact_PgStat_Relations
(xact_state=xact_state@entry=0x26f0b50, isCommit=isCommit@entry=true) at
pgstat_relation.c:508
#4 0x000000000086ec0f in AtEOXact_PgStat (isCommit=isCommit@entry=true,
parallel=parallel@entry=false) at pgstat_xact.c:54
#5 0x00000000005bd2a3 in CommitTransaction () at xact.c:2360
#6 0x00000000005be5d5 in CommitTransactionCommand () at xact.c:3048
#7 0x00000000007ee72b in apply_handle_commit_internal
(commit_data=commit_data@entry=0x7ffe4606a7a0) at worker.c:1532
#8 0x00000000007efac9 in apply_handle_commit (s=0x7ffe4606a940) at
worker.c:845
#9 apply_dispatch () at worker.c:2473
#10 0x00000000007f11a7 in LogicalRepApplyLoop (last_received=74454600)
at worker.c:2757
#11 start_apply () at worker.c:3526
#12 0x00000000007f175f in ApplyWorkerMain () at worker.c:3782
#13 0x00000000007bdba3 in StartBackgroundWorker () at bgworker.c:858
#14 0x00000000007c3241 in do_start_bgworker (rw=<optimized out>) at
postmaster.c:5802
#15 maybe_start_bgworkers () at postmaster.c:6026
#16 0x00000000007c3b65 in sigusr1_handler
(postgres_signal_arg=<optimized out>) at postmaster.c:5191
#17 <signal handler called>
#18 0x000000357d6e1683 in __select_nocancel () from /lib64/libc.so.6
#19 0x00000000007c41d6 in ServerLoop () at postmaster.c:1757
#20 0x00000000007c5c3b in PostmasterMain () at postmaster.c:1465
#21 0x0000000000720cfe in main (argc=11, argv=0x2615590) at main.c:202
I'm not sure that helps.
Show quoted text
regards.
At Mon, 18 Apr 2022 10:57:02 +0200, Erikjan Rijkers <er@xs4all.nl> wrote in
Hm. Just now I've recompiled and retried and after 5 runs got the
same crash. Then tried on another machine (also old, I'm afraid),
and built 1a8b11053 and ran the same thing. That failed on the first
try, and made core dump from which I extracted:
Thanks!
gdb ~/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres
core-postgres-6-500-500-8289-1650269886 -ex bt -ex q#2 0x0000000000973fcb in ExceptionalCondition
#(conditionName=conditionName@entry=0xb20d76 "tabstat->trans == trans",
#errorType=errorType@entry=0x9c7c2b "FailedAssertion",
fileName=fileName@entry=0xb20d0b "pgstat_relation.c",
lineNumber=lineNumber@entry=508) at assert.c:69
#3 0x000000000086b77f in AtEOXact_PgStat_Relations
#(xact_state=xact_state@entry=0x26f0b50, isCommit=isCommit@entry=true)
#at pgstat_relation.c:508
Could you read tabstat, *tabstat, trans, *trans here?
#4 0x000000000086ec0f in AtEOXact_PgStat (isCommit=isCommit@entry=true,
#parallel=parallel@entry=false) at pgstat_xact.c:54
#5 0x00000000005bd2a3 in CommitTransaction () at xact.c:2360
#6 0x00000000005be5d5 in CommitTransactionCommand () at xact.c:3048
#7 0x00000000007ee72b in apply_handle_commit_internal
#(commit_data=commit_data@entry=0x7ffe4606a7a0) at worker.c:1532
#8 0x00000000007efac9 in apply_handle_commit (s=0x7ffe4606a940) at
#worker.c:845
#9 apply_dispatch () at worker.c:2473
#10 0x00000000007f11a7 in LogicalRepApplyLoop (last_received=74454600)
#at worker.c:2757
#11 start_apply () at worker.c:3526
#12 0x00000000007f175f in ApplyWorkerMain () at worker.c:3782
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Op 19-04-2022 om 02:15 schreef Kyotaro Horiguchi:
At Mon, 18 Apr 2022 10:57:02 +0200, Erikjan Rijkers <er@xs4all.nl> wrote in
Hm. Just now I've recompiled and retried and after 5 runs got the
same crash. Then tried on another machine (also old, I'm afraid),
and built 1a8b11053 and ran the same thing. That failed on the first
try, and made core dump from which I extracted:Thanks!
gdb ~/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres
core-postgres-6-500-500-8289-1650269886 -ex bt -ex q#2 0x0000000000973fcb in ExceptionalCondition
#(conditionName=conditionName@entry=0xb20d76 "tabstat->trans == trans",
#errorType=errorType@entry=0x9c7c2b "FailedAssertion",
fileName=fileName@entry=0xb20d0b "pgstat_relation.c",
lineNumber=lineNumber@entry=508) at assert.c:69
#3 0x000000000086b77f in AtEOXact_PgStat_Relations
#(xact_state=xact_state@entry=0x26f0b50, isCommit=isCommit@entry=true)
#at pgstat_relation.c:508Could you read tabstat, *tabstat, trans, *trans here?
To be honest I'm not sure how to, but I gave it a try:
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from
/home/aardvark/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres...done.
[New LWP 21839]
warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: logical replication worker for
subscription 16411 '.
Program terminated with signal 6, Aborted.
#0 0x000000357d6324f5 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x000000357d6324f5 in raise () from /lib64/libc.so.6
#1 0x000000357d633cd5 in abort () from /lib64/libc.so.6
#2 0x000000000097400b in ExceptionalCondition
(conditionName=conditionName@entry=0xb20df6 "tabstat->trans == trans",
errorType=errorType@entry=0x9c7cab "FailedAssertion",
fileName=fileName@entry=0xb20d8b "pgstat_relation.c",
lineNumber=lineNumber@entry=508) at assert.c:69
#3 0x000000000086b7bf in AtEOXact_PgStat_Relations
(xact_state=xact_state@entry=0x2d9ab50, isCommit=isCommit@entry=true) at
pgstat_relation.c:508
#4 0x000000000086ec4f in AtEOXact_PgStat (isCommit=isCommit@entry=true,
parallel=parallel@entry=false) at pgstat_xact.c:54
#5 0x00000000005bd2a3 in CommitTransaction () at xact.c:2360
#6 0x00000000005be5d5 in CommitTransactionCommand () at xact.c:3048
#7 0x00000000007ee76b in apply_handle_commit_internal
(commit_data=commit_data@entry=0x7fffb90aa8e0) at worker.c:1532
#8 0x00000000007efb09 in apply_handle_commit (s=0x7fffb90aaa80) at
worker.c:845
#9 apply_dispatch () at worker.c:2473
#10 0x00000000007f11e7 in LogicalRepApplyLoop (last_received=74695984)
at worker.c:2757
#11 start_apply () at worker.c:3526
#12 0x00000000007f179f in ApplyWorkerMain () at worker.c:3782
#13 0x00000000007bdbb3 in StartBackgroundWorker () at bgworker.c:858
#14 0x00000000007c3251 in do_start_bgworker (rw=<optimized out>) at
postmaster.c:5802
#15 maybe_start_bgworkers () at postmaster.c:6026
#16 0x00000000007c3b75 in sigusr1_handler
(postgres_signal_arg=<optimized out>) at postmaster.c:5191
#17 <signal handler called>
#18 0x000000357d6e1683 in __select_nocancel () from /lib64/libc.so.6
#19 0x00000000007c41e6 in ServerLoop () at postmaster.c:1757
#20 0x00000000007c5c4b in PostmasterMain () at postmaster.c:1465
#21 0x0000000000720d0e in main (argc=11, argv=0x2cbf590) at main.c:202
(gdb) f 3
#3 0x000000000086b7bf in AtEOXact_PgStat_Relations
(xact_state=xact_state@entry=0x2d9ab50, isCommit=isCommit@entry=true) at
pgstat_relation.c:508
508 Assert(tabstat->trans == trans);
(gdb) p tabstat
$1 = <optimized out>
(gdb) p *tabstat
value has been optimized out
(gdb) p trans
$2 = <optimized out>
(gdb) p *trans
value has been optimized out
(gdb)
Show quoted text
#4 0x000000000086ec0f in AtEOXact_PgStat (isCommit=isCommit@entry=true,
#parallel=parallel@entry=false) at pgstat_xact.c:54
#5 0x00000000005bd2a3 in CommitTransaction () at xact.c:2360
#6 0x00000000005be5d5 in CommitTransactionCommand () at xact.c:3048
#7 0x00000000007ee72b in apply_handle_commit_internal
#(commit_data=commit_data@entry=0x7ffe4606a7a0) at worker.c:1532
#8 0x00000000007efac9 in apply_handle_commit (s=0x7ffe4606a940) at
#worker.c:845
#9 apply_dispatch () at worker.c:2473
#10 0x00000000007f11a7 in LogicalRepApplyLoop (last_received=74454600)
#at worker.c:2757
#11 start_apply () at worker.c:3526
#12 0x00000000007f175f in ApplyWorkerMain () at worker.c:3782regards.
Thaks Erik.
At Tue, 19 Apr 2022 07:00:30 +0200, Erik Rijkers <er@xs4all.nl> wrote in
Op 19-04-2022 om 02:15 schreef Kyotaro Horiguchi:
Could you read tabstat, *tabstat, trans, *trans here?
To be honest I'm not sure how to, but I gave it a try:
(gdb) p tabstat
$1 = <optimized out>
Great! It is that. But unfortunately they are optimized out.. Could
you cause the crash with -O0 binary? You will see the variable with
it. You can rebuild with the option as follows.
$ make clean; make install CUSTOM_COPT="-O0 -g"
You can dump only the whole xact_state chain from the current core
file but the result will give a bit obscure hint for diagnosis.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Op 19-04-2022 om 11:25 schreef Kyotaro Horiguchi:
Thaks Erik.
At Tue, 19 Apr 2022 07:00:30 +0200, Erik Rijkers <er@xs4all.nl> wrote in
Op 19-04-2022 om 02:15 schreef Kyotaro Horiguchi:
Could you read tabstat, *tabstat, trans, *trans here?
To be honest I'm not sure how to, but I gave it a try:
I rebuilt newest master (a62bff74b135) with
export CUSTOM_COPT="-O0 -g"
The 12th run of statbug.sh crashed and gave a corefile.
GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from
/home/aardvark/pg_stuff/pg_installations/pgsql.HEAD/bin/postgres...done.
[New LWP 25058]
warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: logical replication worker for
subscription 16411 '.
Program terminated with signal 6, Aborted.
#0 0x000000357d6324f5 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x000000357d6324f5 in raise () from /lib64/libc.so.6
#1 0x000000357d633cd5 in abort () from /lib64/libc.so.6
#2 0x0000000000b3bada in ExceptionalCondition (conditionName=0xd389a1
"tabstat->trans == trans", errorType=0xd388b2 "FailedAssertion",
fileName=0xd388a0 "pgstat_relation.c", lineNumber=508) at assert.c:69
#3 0x00000000009bf5dc in AtEOXact_PgStat_Relations
(xact_state=0x31b1b50, isCommit=true) at pgstat_relation.c:508
#4 0x00000000009c4107 in AtEOXact_PgStat (isCommit=true,
parallel=false) at pgstat_xact.c:54
#5 0x0000000000583764 in CommitTransaction () at xact.c:2360
#6 0x0000000000584354 in CommitTransactionCommand () at xact.c:3048
#7 0x000000000090b34e in apply_handle_commit_internal
(commit_data=0x7ffd024b5940) at worker.c:1532
#8 0x000000000090a287 in apply_handle_commit (s=0x7ffd024b59b0) at
worker.c:845
#9 0x000000000090ce3a in apply_dispatch (s=0x7ffd024b59b0) at worker.c:2473
#10 0x000000000090d41c in LogicalRepApplyLoop (last_received=74680880)
at worker.c:2757
#11 0x000000000090e974 in start_apply (origin_startpos=0) at worker.c:3526
#12 0x000000000090f156 in ApplyWorkerMain (main_arg=0) at worker.c:3782
#13 0x00000000008c7623 in StartBackgroundWorker () at bgworker.c:858
#14 0x00000000008d1557 in do_start_bgworker (rw=0x30ff0a0) at
postmaster.c:5802
#15 0x00000000008d1903 in maybe_start_bgworkers () at postmaster.c:6026
#16 0x00000000008d09ba in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5191
#17 <signal handler called>
#18 0x000000357d6e1683 in __select_nocancel () from /lib64/libc.so.6
#19 0x00000000008cc6c1 in ServerLoop () at postmaster.c:1757
#20 0x00000000008cc0aa in PostmasterMain (argc=11, argv=0x30d6590) at
postmaster.c:1465
#21 0x00000000007c9256 in main (argc=11, argv=0x30d6590) at main.c:202
(gdb) f 3
#3 0x00000000009bf5dc in AtEOXact_PgStat_Relations
(xact_state=0x31b1b50, isCommit=true) at pgstat_relation.c:508
508 Assert(tabstat->trans == trans);
(gdb) p tabstat
$1 = (PgStat_TableStatus *) 0x319e630
(gdb) p *tabstat
$2 = {t_id = 2139062143, t_shared = 127, trans = 0x7f7f7f7f7f7f7f7f,
t_counts = {t_numscans = 9187201950435737471, t_tuples_returned =
9187201950435737471, t_tuples_fetched = 9187201950435737471,
t_tuples_inserted = 9187201950435737471, t_tuples_updated =
9187201950435737471, t_tuples_deleted = 9187201950435737471,
t_tuples_hot_updated = 9187201950435737471, t_truncdropped = 127,
t_delta_live_tuples = 9187201950435737471, t_delta_dead_tuples =
9187201950435737471, t_changed_tuples = 9187201950435737471,
t_blocks_fetched = 9187201950435737471, t_blocks_hit = 9187201950435737471},
relation = 0x7f7f7f7f7f7f7f7f}
(gdb) p trans
$3 = (PgStat_TableXactStatus *) 0x31b1ba8
(gdb) p *trans
$4 = {tuples_inserted = 1, tuples_updated = 0, tuples_deleted = 0,
truncdropped = false, inserted_pre_truncdrop = 0, updated_pre_truncdrop
= 0, deleted_pre_truncdrop = 0, nest_level = 1, upper = 0x0,
parent = 0x319e630, next = 0x31b1ab8}
(gdb)
Looks like we're one step further, no?
Erik
Show quoted text
(gdb) p tabstat
$1 = <optimized out>Great! It is that. But unfortunately they are optimized out.. Could
you cause the crash with -O0 binary? You will see the variable with
it. You can rebuild with the option as follows.$ make clean; make install CUSTOM_COPT="-O0 -g"
You can dump only the whole xact_state chain from the current core
file but the result will give a bit obscure hint for diagnosis.regards.
On 2022-Apr-19, Erik Rijkers wrote:
(gdb) p tabstat
$1 = (PgStat_TableStatus *) 0x319e630
(gdb) p *tabstat
$2 = {t_id = 2139062143, t_shared = 127, trans = 0x7f7f7f7f7f7f7f7f,
t_counts = {t_numscans = 9187201950435737471, t_tuples_returned =
9187201950435737471, t_tuples_fetched = 9187201950435737471,
t_tuples_inserted = 9187201950435737471, t_tuples_updated =
9187201950435737471, t_tuples_deleted = 9187201950435737471,
t_tuples_hot_updated = 9187201950435737471, t_truncdropped = 127,
t_delta_live_tuples = 9187201950435737471, t_delta_dead_tuples =
9187201950435737471, t_changed_tuples = 9187201950435737471,
t_blocks_fetched = 9187201950435737471, t_blocks_hit = 9187201950435737471},
relation = 0x7f7f7f7f7f7f7f7f}
It looks like this struct is freed or is in a memory context that was
reset. Perhaps its lifetime wasn't carefully considered in the logical
replication code, which takes some shortcuts.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")
Hi,
On 2022-04-19 13:50:25 +0200, Erik Rijkers wrote:
The 12th run of statbug.sh crashed and gave a corefile.
I ran through quite a few iterations by now, without reproducing :(
I guess there's some timing issue and you're hitting on your system
due to the slower disks.
Program terminated with signal 6, Aborted.
#0 0x000000357d6324f5 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x000000357d6324f5 in raise () from /lib64/libc.so.6
#1 0x000000357d633cd5 in abort () from /lib64/libc.so.6
#2 0x0000000000b3bada in ExceptionalCondition (conditionName=0xd389a1
"tabstat->trans == trans", errorType=0xd388b2 "FailedAssertion",
fileName=0xd388a0 "pgstat_relation.c", lineNumber=508) at assert.c:69
#3 0x00000000009bf5dc in AtEOXact_PgStat_Relations (xact_state=0x31b1b50,
isCommit=true) at pgstat_relation.c:508
#4 0x00000000009c4107 in AtEOXact_PgStat (isCommit=true, parallel=false) at
pgstat_xact.c:54
#5 0x0000000000583764 in CommitTransaction () at xact.c:2360
#6 0x0000000000584354 in CommitTransactionCommand () at xact.c:3048
#7 0x000000000090b34e in apply_handle_commit_internal
(commit_data=0x7ffd024b5940) at worker.c:1532
#8 0x000000000090a287 in apply_handle_commit (s=0x7ffd024b59b0) at
worker.c:845
#9 0x000000000090ce3a in apply_dispatch (s=0x7ffd024b59b0) at worker.c:2473
#10 0x000000000090d41c in LogicalRepApplyLoop (last_received=74680880) at
worker.c:2757
#11 0x000000000090e974 in start_apply (origin_startpos=0) at worker.c:3526
#12 0x000000000090f156 in ApplyWorkerMain (main_arg=0) at worker.c:3782
#13 0x00000000008c7623 in StartBackgroundWorker () at bgworker.c:858
#14 0x00000000008d1557 in do_start_bgworker (rw=0x30ff0a0) at
postmaster.c:5802
#15 0x00000000008d1903 in maybe_start_bgworkers () at postmaster.c:6026
#16 0x00000000008d09ba in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5191
#17 <signal handler called>
#18 0x000000357d6e1683 in __select_nocancel () from /lib64/libc.so.6
#19 0x00000000008cc6c1 in ServerLoop () at postmaster.c:1757
#20 0x00000000008cc0aa in PostmasterMain (argc=11, argv=0x30d6590) at
postmaster.c:1465
#21 0x00000000007c9256 in main (argc=11, argv=0x30d6590) at main.c:202
(gdb) f 3
#3 0x00000000009bf5dc in AtEOXact_PgStat_Relations (xact_state=0x31b1b50,
isCommit=true) at pgstat_relation.c:508
508 Assert(tabstat->trans == trans);
(gdb) p tabstat
$1 = (PgStat_TableStatus *) 0x319e630
(gdb) p *tabstat
$2 = {t_id = 2139062143, t_shared = 127, trans = 0x7f7f7f7f7f7f7f7f,
t_counts = {t_numscans = 9187201950435737471, t_tuples_returned =
9187201950435737471, t_tuples_fetched = 9187201950435737471,
t_tuples_inserted = 9187201950435737471, t_tuples_updated =
9187201950435737471, t_tuples_deleted = 9187201950435737471,
t_tuples_hot_updated = 9187201950435737471, t_truncdropped = 127,
t_delta_live_tuples = 9187201950435737471, t_delta_dead_tuples =
9187201950435737471, t_changed_tuples = 9187201950435737471,
t_blocks_fetched = 9187201950435737471, t_blocks_hit = 9187201950435737471},
relation = 0x7f7f7f7f7f7f7f7f}
(gdb) p trans
$3 = (PgStat_TableXactStatus *) 0x31b1ba8
(gdb) p *trans
$4 = {tuples_inserted = 1, tuples_updated = 0, tuples_deleted = 0,
truncdropped = false, inserted_pre_truncdrop = 0, updated_pre_truncdrop = 0,
deleted_pre_truncdrop = 0, nest_level = 1, upper = 0x0,
parent = 0x319e630, next = 0x31b1ab8}
(gdb)
Could you print out
p xact_state
p *xact_state
p xact_state->first
p *xact_state->first
Do you have the server log file for the failed run / instance?
Greetings,
Andres Freund
Hi,
On 2022-04-19 10:36:24 -0700, Andres Freund wrote:
On 2022-04-19 13:50:25 +0200, Erik Rijkers wrote:
The 12th run of statbug.sh crashed and gave a corefile.
I ran through quite a few iterations by now, without reproducing :(
I guess there's some timing issue and you're hitting on your system
due to the slower disks.
Ah. I found the issue. The new pgstat_report_stat(true) call in
LogicalRepApplyLoop()'s "timeout" section doesn't check if we're in a
transaction. And the transactional stats code doesn't handle that (never
has).
I think all that's needed is a if (IsTransactionState()) around that
pgstat_report_stat().
It might be possible to put an assertion into pgstat_report_stat(), but
I need to look at the process exit code to see if it is.
Greetings,
Andres Freund
Op 19-04-2022 om 19:36 schreef Andres Freund:
Hi,
On 2022-04-19 13:50:25 +0200, Erik Rijkers wrote:
The 12th run of statbug.sh crashed and gave a corefile.
I ran through quite a few iterations by now, without reproducing :(
I guess there's some timing issue and you're hitting on your system
due to the slower disks.Program terminated with signal 6, Aborted.
#0 0x000000357d6324f5 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x000000357d6324f5 in raise () from /lib64/libc.so.6
#1 0x000000357d633cd5 in abort () from /lib64/libc.so.6
#2 0x0000000000b3bada in ExceptionalCondition (conditionName=0xd389a1
"tabstat->trans == trans", errorType=0xd388b2 "FailedAssertion",
fileName=0xd388a0 "pgstat_relation.c", lineNumber=508) at assert.c:69
#3 0x00000000009bf5dc in AtEOXact_PgStat_Relations (xact_state=0x31b1b50,
isCommit=true) at pgstat_relation.c:508
#4 0x00000000009c4107 in AtEOXact_PgStat (isCommit=true, parallel=false) at
pgstat_xact.c:54
#5 0x0000000000583764 in CommitTransaction () at xact.c:2360
#6 0x0000000000584354 in CommitTransactionCommand () at xact.c:3048
#7 0x000000000090b34e in apply_handle_commit_internal
(commit_data=0x7ffd024b5940) at worker.c:1532
#8 0x000000000090a287 in apply_handle_commit (s=0x7ffd024b59b0) at
worker.c:845
#9 0x000000000090ce3a in apply_dispatch (s=0x7ffd024b59b0) at worker.c:2473
#10 0x000000000090d41c in LogicalRepApplyLoop (last_received=74680880) at
worker.c:2757
#11 0x000000000090e974 in start_apply (origin_startpos=0) at worker.c:3526
#12 0x000000000090f156 in ApplyWorkerMain (main_arg=0) at worker.c:3782
#13 0x00000000008c7623 in StartBackgroundWorker () at bgworker.c:858
#14 0x00000000008d1557 in do_start_bgworker (rw=0x30ff0a0) at
postmaster.c:5802
#15 0x00000000008d1903 in maybe_start_bgworkers () at postmaster.c:6026
#16 0x00000000008d09ba in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5191
#17 <signal handler called>
#18 0x000000357d6e1683 in __select_nocancel () from /lib64/libc.so.6
#19 0x00000000008cc6c1 in ServerLoop () at postmaster.c:1757
#20 0x00000000008cc0aa in PostmasterMain (argc=11, argv=0x30d6590) at
postmaster.c:1465
#21 0x00000000007c9256 in main (argc=11, argv=0x30d6590) at main.c:202
(gdb) f 3
#3 0x00000000009bf5dc in AtEOXact_PgStat_Relations (xact_state=0x31b1b50,
isCommit=true) at pgstat_relation.c:508
508 Assert(tabstat->trans == trans);
(gdb) p tabstat
$1 = (PgStat_TableStatus *) 0x319e630
(gdb) p *tabstat
$2 = {t_id = 2139062143, t_shared = 127, trans = 0x7f7f7f7f7f7f7f7f,
t_counts = {t_numscans = 9187201950435737471, t_tuples_returned =
9187201950435737471, t_tuples_fetched = 9187201950435737471,
t_tuples_inserted = 9187201950435737471, t_tuples_updated =
9187201950435737471, t_tuples_deleted = 9187201950435737471,
t_tuples_hot_updated = 9187201950435737471, t_truncdropped = 127,
t_delta_live_tuples = 9187201950435737471, t_delta_dead_tuples =
9187201950435737471, t_changed_tuples = 9187201950435737471,
t_blocks_fetched = 9187201950435737471, t_blocks_hit = 9187201950435737471},
relation = 0x7f7f7f7f7f7f7f7f}
(gdb) p trans
$3 = (PgStat_TableXactStatus *) 0x31b1ba8
(gdb) p *trans
$4 = {tuples_inserted = 1, tuples_updated = 0, tuples_deleted = 0,
truncdropped = false, inserted_pre_truncdrop = 0, updated_pre_truncdrop = 0,
deleted_pre_truncdrop = 0, nest_level = 1, upper = 0x0,
parent = 0x319e630, next = 0x31b1ab8}
(gdb)Could you print out
p xact_state
p *xact_state
p xact_state->first
p *xact_state->firstDo you have the server log file for the failed run / instance?
(gdb) p xact_state
$5 = (PgStat_SubXactStatus *) 0x31b1b50
(gdb) p *xact_state
$6 = {nest_level = 1, prev = 0x0, pending_drops = {head = {prev =
0x31b1b60, next = 0x31b1b60}}, pending_drops_count = 0, first = 0x31b1ba8}
(gdb) p xact_state->first
$7 = (PgStat_TableXactStatus *) 0x31b1ba8
(gdb) p *xact_state->first
$8 = {tuples_inserted = 1, tuples_updated = 0, tuples_deleted = 0,
truncdropped = false, inserted_pre_truncdrop = 0, updated_pre_truncdrop
= 0, deleted_pre_truncdrop = 0, nest_level = 1, upper = 0x0,
parent = 0x319e630, next = 0x31b1ab8}
(gdb)
The logfile is attached.
Erik
Show quoted text
Greetings,
Andres Freund
Attachments:
At Tue, 19 Apr 2022 10:55:26 -0700, Andres Freund <andres@anarazel.de> wrote in
Hi,
On 2022-04-19 10:36:24 -0700, Andres Freund wrote:
On 2022-04-19 13:50:25 +0200, Erik Rijkers wrote:
The 12th run of statbug.sh crashed and gave a corefile.
I ran through quite a few iterations by now, without reproducing :(
I guess there's some timing issue and you're hitting on your system
due to the slower disks.Ah. I found the issue. The new pgstat_report_stat(true) call in
LogicalRepApplyLoop()'s "timeout" section doesn't check if we're in a
transaction. And the transactional stats code doesn't handle that (never
has).I think all that's needed is a if (IsTransactionState()) around that
pgstat_report_stat().
if (!IsTransactinoState()) ?
It might be possible to put an assertion into pgstat_report_stat(), but
I need to look at the process exit code to see if it is.
Inserting a sleep in pgoutput_commit_txn reproduced this. Crashes with
the same stack trace with the similar variable state.
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index b197bfd565..def4d751d3 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -568,6 +568,7 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
return;
}
+ sleep(2);
OutputPluginPrepareWrite(ctx, true);
logicalrep_write_commit(ctx->out, txn, commit_lsn);
OutputPluginWrite(ctx, true);
The following actuall works for this.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 4171371296..f4e5359513 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -2882,10 +2882,11 @@ LogicalRepApplyLoop(XLogRecPtr last_received)
send_feedback(last_received, requestReply, requestReply);
/*
- * Force reporting to ensure long idle periods don't lead to
- * arbitrarily delayed stats.
+ * Force reporting to ensure long out-of-transaction idle periods
+ * don't lead to arbitrarily delayed stats.
*/
- pgstat_report_stat(true);
+ if (!IsTransactionState())
+ pgstat_report_stat(true);
}
}
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Op 20-04-2022 om 06:54 schreef Kyotaro Horiguchi:
At Tue, 19 Apr 2022 10:55:26 -0700, Andres Freund <andres@anarazel.de> wrote in
Hi,
On 2022-04-19 10:36:24 -0700, Andres Freund wrote:
On 2022-04-19 13:50:25 +0200, Erik Rijkers wrote:
The 12th run of statbug.sh crashed and gave a corefile.
I ran through quite a few iterations by now, without reproducing :(
I guess there's some timing issue and you're hitting on your system
due to the slower disks.Ah. I found the issue. The new pgstat_report_stat(true) call in
LogicalRepApplyLoop()'s "timeout" section doesn't check if we're in a
transaction. And the transactional stats code doesn't handle that (never
has).I think all that's needed is a if (IsTransactionState()) around that
pgstat_report_stat().if (!IsTransactinoState()) ?
It might be possible to put an assertion into pgstat_report_stat(), but
I need to look at the process exit code to see if it is.Inserting a sleep in pgoutput_commit_txn reproduced this. Crashes with
the same stack trace with the similar variable state.diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c index b197bfd565..def4d751d3 100644 --- a/src/backend/replication/pgoutput/pgoutput.c +++ b/src/backend/replication/pgoutput/pgoutput.c @@ -568,6 +568,7 @@ pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, return; }+ sleep(2);
OutputPluginPrepareWrite(ctx, true);
logicalrep_write_commit(ctx->out, txn, commit_lsn);
OutputPluginWrite(ctx, true);The following actuall works for this.
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index 4171371296..f4e5359513 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -2882,10 +2882,11 @@ LogicalRepApplyLoop(XLogRecPtr last_received) send_feedback(last_received, requestReply, requestReply);/* - * Force reporting to ensure long idle periods don't lead to - * arbitrarily delayed stats. + * Force reporting to ensure long out-of-transaction idle periods + * don't lead to arbitrarily delayed stats. */ - pgstat_report_stat(true); + if (!IsTransactionState()) + pgstat_report_stat(true); } }
Yes, that seems to fix it: I applied that latter patch, and ran my
program 250x without errors. Then I removed it again an it gave the
error within 15x.
thanks!
Erik
Show quoted text
regards.
On Wed, Apr 20, 2022 at 01:03:20PM +0200, Erik Rijkers wrote:
Yes, that seems to fix it: I applied that latter patch, and ran my program
250x without errors. Then I removed it again an it gave the error within
15x.
That looks simple enough, indeed. Andres, are you planning to address
this issue?
--
Michael
On Mon, Apr 25, 2022 at 03:18:52PM +0900, Michael Paquier wrote:
On Wed, Apr 20, 2022 at 01:03:20PM +0200, Erik Rijkers wrote:
Yes, that seems to fix it: I applied that latter patch, and ran my program
250x without errors. Then I removed it again an it gave the error within
15x.That looks simple enough, indeed. Andres, are you planning to address
this issue?
Ping. It looks annoying to release beta1 with that, as assertions are
likely going to be enabled in a lot of test builds.
--
Michael
Hi,
On 2022-05-11 15:48:40 +0900, Michael Paquier wrote:
On Mon, Apr 25, 2022 at 03:18:52PM +0900, Michael Paquier wrote:
On Wed, Apr 20, 2022 at 01:03:20PM +0200, Erik Rijkers wrote:
Yes, that seems to fix it: I applied that latter patch, and ran my program
250x without errors. Then I removed it again an it gave the error within
15x.That looks simple enough, indeed. Andres, are you planning to address
this issue?Ping. It looks annoying to release beta1 with that, as assertions are
likely going to be enabled in a lot of test builds.
I'll try to fix it tomorrow... Sorry for the delay.
Greetings,
Andres Freund
On Wed, May 11, 2022 at 08:32:14PM -0700, Andres Freund wrote:
I'll try to fix it tomorrow... Sorry for the delay.
Thanks, Andres.
--
Michael
Hi,
I finally pushed the fix for this. Erik, thanks for the report! And thanks
Michael for the ping...
On 2022-05-11 20:32:14 -0700, Andres Freund wrote:
On 2022-05-11 15:48:40 +0900, Michael Paquier wrote:
Ping. It looks annoying to release beta1 with that, as assertions are
likely going to be enabled in a lot of test builds.
FWIW, it's somewhat hard to hit (basically the sender needs to stall while
sending out a transaction / network being really slow), so it'd not have been
likely to be hit by all that many people.
Greetings,
Andres Freund