rare crash - FailedAssertion snapbuild.c Line: 580

Started by Erik Rijkersover 7 years ago5 messages
#1Erik Rijkers
er@xs4all.nl

Hello,

To test postgres 11, I still regularly run series of short sessions of
pgbench-over-logical-replication (basically the same thing that I used
last year [1]/messages/by-id/3897361c7010c4ac03f358173adbcd60@xs4all.nl - now in a perl incarnation). Most of the time the
replication is stable and finishes correctly but sometimes (rarely) I
get:

TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
snap->xmin))", File: "snapbuild.c", Line: 580)

This will probably be difficult to reproduce and to act upon but I
wanted to report it anyway as in the course of the last few months I
have seen it several times, on several machines. Always rarely, always
postgres 11 (I did not try other versions).

Erik Rijkers

[1]: /messages/by-id/3897361c7010c4ac03f358173adbcd60@xs4all.nl
/messages/by-id/3897361c7010c4ac03f358173adbcd60@xs4all.nl

source/version: bf2d0462cd73 / REL_11_STABLE (compiled 20180828_1629)

1 primary, 4 replicas (on 1 host).
Basically: pgbench --port=6515 --quiet --initialize --scale=5 postgres
then: pgbench -M prepared -c 93 -j 8 -T 10 -P 2 -p 6515 postgres
then: wait for log-repl sync.
(I also added extra data type columns to the basic pgbench tables to
test all the different data types)

Below is a grep for the assert message, with 15 surrounding lines from
the latest occurrence.

$ p=REL_11_STABLE; cd /var/data1/pg_stuff/tmp/cascade/$p ; grep -A 15 -B
15 -Ei 'trap|assert' 65*/logfile.65* | less
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] LOG:
logical decoding found consistent point at 0/89E038E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:13.747 CEST [139409] DETAIL:
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:14.522 CEST [139539] LOG:
received replication command: CREATE_REPLICATION_SLOT
"sub2_6517_6517_18834_sync_18804" TEMPORARY LOGICAL pgoutput
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] LOG:
logical decoding found initial starting point at 0/8AF5D590
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.681 CEST [139539] DETAIL:
Waiting for transactions (approximately 1) older than 886297 to end.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] LOG:
logical decoding found consistent point at 0/8B018310
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.720 CEST [139539] DETAIL:
There are no running transactions.
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.860 CEST [139407] LOG:
received replication command: START_REPLICATION SLOT
"sub2_6517_6517_18834_sync_18793" LOGICAL 0/89E03918 (proto_version '1',
publication_names '"pub1_6516"')
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] LOG:
starting logical decoding for slot "sub2_6517_6517_18834_sync_18793"
6516JTq_E8/logfile.6516-2018-08-29 14:28:15.861 CEST [139407] DETAIL:
Streaming transactions committing after 0/89E03918, reading WAL from
0/841EE960.
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] LOG:
logical decoding found consistent point at 0/841EF3E0
6516JTq_E8/logfile.6516-2018-08-29 14:28:16.064 CEST [139407] DETAIL:
Logical decoding will begin using saved snapshot.
6516JTq_E8/logfile.6516-2018-08-29 14:28:18.953 CEST [139541] LOG:
received replication command: CREATE_REPLICATION_SLOT
"sub2_6517_6517_18834_sync_18814" TEMPORARY LOGICAL pgoutput
USE_SNAPSHOT
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] LOG:
logical decoding found consistent point at 0/8C6BA408
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.002 CEST [139541] DETAIL:
There are no running transactions.
6516JTq_E8/logfile.6516:TRAP:
FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))",
File: "snapbuild.c", Line: 580)
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:
server process (PID 139541) was terminated by signal 6: Aborted
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] DETAIL:
Failed process was running: BEGIN READ ONLY ISOLATION LEVEL REPEATABLE
READ
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130398] LOG:
terminating any other active server processes
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] WARNING:
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] DETAIL:
The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] HINT: In
a moment you should be able to reconnect to the database and repeat your
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [139405] CONTEXT:
slot "sub2_6517_6517", output plugin "pgoutput", in the change callback,
associated LSN 0/8B11CF00
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] WARNING:
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] DETAIL:
The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130411] HINT: In
a moment you should be able to reconnect to the database and repeat your
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] WARNING:
terminating connection because of crash of another server process
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] DETAIL:
The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.112 CEST [130405] HINT: In
a moment you should be able to reconnect to the database and repeat your
command.
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.116 CEST [130398] LOG: all
server processes terminated; reinitializing
6516JTq_E8/logfile.6516-2018-08-29 14:28:20.515 CEST [139548] LOG:
database system was interrupted; last known up at 2018-08-29 14:26:14
CEST

#2Andres Freund
andres@anarazel.de
In reply to: Erik Rijkers (#1)
Re: rare crash - FailedAssertion snapbuild.c Line: 580

Hi,

On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote:

To test postgres 11, I still regularly run series of short sessions of
pgbench-over-logical-replication (basically the same thing that I used last
year [1] - now in a perl incarnation). Most of the time the replication is
stable and finishes correctly but sometimes (rarely) I get:

TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
snap->xmin))", File: "snapbuild.c", Line: 580)

This will probably be difficult to reproduce and to act upon but I wanted to
report it anyway as in the course of the last few months I have seen it
several times, on several machines. Always rarely, always postgres 11 (I did
not try other versions).

Thanks for testing! Could you possibly run the tests with core files
enabled, so we at get a backtrace in case of trouble? Knowing what the
values here are would be tremendously helpful...

Greetings,

Andres Freund

#3Erik Rijkers
er@xs4all.nl
In reply to: Andres Freund (#2)
Re: rare crash - FailedAssertion snapbuild.c Line: 580

On 2018-08-29 21:15, Andres Freund wrote:

Hi,

On 2018-08-29 17:43:17 +0200, Erik Rijkers wrote:

To test postgres 11, I still regularly run series of short sessions of
pgbench-over-logical-replication (basically the same thing that I used
last
year [1] - now in a perl incarnation). Most of the time the
replication is
stable and finishes correctly but sometimes (rarely) I get:

TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
snap->xmin))", File: "snapbuild.c", Line: 580)

This will probably be difficult to reproduce and to act upon but I
wanted to
report it anyway as in the course of the last few months I have seen
it
several times, on several machines. Always rarely, always postgres 11
(I did
not try other versions).

Thanks for testing! Could you possibly run the tests with core files
enabled, so we at get a backtrace in case of trouble? Knowing what the
values here are would be tremendously helpful...

ok, is this any use?

$ gdb --quiet
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres
/var/data1/pg_stuff/tmp/cascade/REL_11_STABLE/6516_gW1Cl/data/core
Reading symbols from
/var/data1/pg_stuff/pg_installations/pgsql.REL_11_STABLE/bin/postgres...done.
[New LWP 147484]
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: walsender rijkers [local] idle in
transaction '.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f0fd20e7067 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or
directory.
(gdb) bt
#0 0x00007f0fd20e7067 in __GI_raise (sig=sig@entry=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007f0fd20e8448 in __GI_abort () at abort.c:89
#2 0x00000000008880bf in ExceptionalCondition
(conditionName=conditionName@entry=0xa417f8
"!(TransactionIdPrecedesOrEquals(safeXid, snap->xmin))",
errorType=errorType@entry=0x8d365d "FailedAssertion",
fileName=fileName@entry=0xa41223 "snapbuild.c",
lineNumber=lineNumber@entry=580) at assert.c:54
#3 0x000000000072676e in SnapBuildInitialSnapshot () at snapbuild.c:580
#4 0x000000000072ed54 in CreateReplicationSlot (cmd=0x223bae0) at
walsender.c:951
#5 exec_replication_command (cmd_string=cmd_string@entry=0x21a1cf8
"CREATE_REPLICATION_SLOT \"sub2_6517_6517_18748_sync_18728\" TEMPORARY
LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1527
#6 0x000000000077e8ee in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x21cbbc8, dbname=<optimized out>, username=<optimized
out>) at postgres.c:4155
#7 0x0000000000704fde in BackendRun (port=0x21c4520) at
postmaster.c:4361
#8 BackendStartup (port=0x21c4520) at postmaster.c:4033
#9 ServerLoop () at postmaster.c:1706
#10 0x0000000000705e0f in PostmasterMain (argc=argc@entry=12,
argv=argv@entry=0x219c470) at postmaster.c:1379
#11 0x0000000000478d80 in main (argc=12, argv=0x219c470) at main.c:228
(gdb)

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Erik Rijkers (#3)
Re: rare crash - FailedAssertion snapbuild.c Line: 580

On 2018-Aug-30, Erik Rijkers wrote:

ok, is this any use?

Seems mostly good, but the Xids are not printed. Could you please do
"bt full"? Also:

frame 3
print *snap

Thanks,

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Erik Rijkers
er@xs4all.nl
In reply to: Alvaro Herrera (#4)
1 attachment(s)
Re: rare crash - FailedAssertion snapbuild.c Line: 580

On 2018-08-30 16:44, Alvaro Herrera wrote:

On 2018-Aug-30, Erik Rijkers wrote:

ok, is this any use?

Seems mostly good, but the Xids are not printed. Could you please do
"bt full"? Also:

frame 3
print *snap

See the attached.

Attachments:

gdb_bt_full_frame3_snap.txttext/plain; name=gdb_bt_full_frame3_snap.txtDownload