BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Started by PG Bug reporting formabout 7 years ago46 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 15727
Logged by: Roman Zharkov
Email address: r.zharkov@postgrespro.ru
PostgreSQL version: Unsupported/Unknown
Operating system: Ubuntu 5.4.0-6ubuntu1~16.04.10
Description:

Hello,
Our test server crashed while benchmarking connection pooler called Yandex
Odyssey.
There is no debugging info in the code dump now, because it takes few hours
to reproduce the error. We will try to reproduce the error with debugging
info.

Server version - PostgreSQL 12devel-master, commit - bb76134 on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0
20160609, 64-bit

The error does not reproduce on PostgreSQL 12devel-master, commit -
2fadf24

2019-04-01 15:27:38.829 +07 [7524] ERROR: unexpected table_lock_tuple
status: 1
2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction
400048276, it was already committed
2019-04-01 15:27:39.001 +07 [7499] ERROR: unexpected table_lock_tuple
status: 1
2019-04-01 15:27:39.001 +07 [7499] STATEMENT: UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 2;
2019-04-01 15:27:39.001 +07 [7499] PANIC: cannot abort transaction
400051561, it was already committed
2019-04-01 15:27:41.193 +07 [7509] ERROR: unexpected table_lock_tuple
status: 1
2019-04-01 15:27:41.193 +07 [7509] STATEMENT: UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:41.193 +07 [7509] PANIC: cannot abort transaction
400088425, it was already committed
2019-04-01 15:41:01.176 +07 [3559] LOG: server process (PID 7499) was
terminated by signal 6: Aborted
2019-04-01 15:41:01.176 +07 [3559] DETAIL: Failed process was running:
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;
2019-04-01 15:41:01.176 +07 [3559] LOG: terminating any other active server
processes

[New LWP 7509]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test 127.0.0.1(42666) UPDATE
'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f1f4c6c7428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007f1f4c6c7428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f1f4c6c902a in __GI_abort () at abort.c:89
#2 0x000000000085faac in errfinish ()
#3 0x00000000008633e4 in elog_finish ()
#4 0x000000000050266d in RecordTransactionAbort ()
#5 0x0000000000502788 in AbortTransaction ()
#6 0x0000000000502fb5 in AbortCurrentTransaction ()
#7 0x000000000075029f in PostgresMain ()
#8 0x0000000000480593 in ServerLoop ()
#9 0x00000000006dfc04 in PostmasterMain ()
#10 0x00000000004813b9 in main ()

#2Andres Freund
andres@anarazel.de
In reply to: PG Bug reporting form (#1)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Hi,

On 2019-04-02 10:30:15 +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 15727
Logged by: Roman Zharkov
Email address: r.zharkov@postgrespro.ru
PostgreSQL version: Unsupported/Unknown
Operating system: Ubuntu 5.4.0-6ubuntu1~16.04.10
Description:

Hello,
Our test server crashed while benchmarking connection pooler called Yandex
Odyssey.

What's the exact workload? Just pgbench?

There is no debugging info in the code dump now, because it takes few hours
to reproduce the error. We will try to reproduce the error with debugging
info.

Unfortunately a backtrace is probably not going to help up much, I'm
afraid. The issue likely is triggered at an earlier state than the error
thrown here.

Server version - PostgreSQL 12devel-master, commit - bb76134 on
x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0
20160609, 64-bit

The error does not reproduce on PostgreSQL 12devel-master, commit -
2fadf24

2019-04-01 15:27:38.829 +07 [7524] ERROR: unexpected table_lock_tuple
status: 1

Hm, so that appears to be related to my changes. Could you run this with
log_error_verbosity=verbose, so we know which of the codepaths this
hits? Apparently we're encountering an invisible tuple somewhere, which
I thought was not possible (hence the error) - previously we just
ignored that.

2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction
400048276, it was already committed

But that's probably a separate issue.

Greetings,

Andres Freund

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#2)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Andres Freund <andres@anarazel.de> writes:

2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction
400048276, it was already committed

But that's probably a separate issue.

What that seems to indicate is that the "unexpected table_lock_tuple
status" error was thrown during commit, which seems pretty odd.

regards, tom lane

#4Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#3)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Hi,

On 2019-04-02 11:44:26 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction
400048276, it was already committed

But that's probably a separate issue.

What that seems to indicate is that the "unexpected table_lock_tuple
status" error was thrown during commit, which seems pretty odd.

I suspect that's a deferred trigger. But that code obviously could throw
errors, so we gotta handle that correctly.

Greetings,

Andres Freund

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

On 2019-Apr-02, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction
400048276, it was already committed

But that's probably a separate issue.

What that seems to indicate is that the "unexpected table_lock_tuple
status" error was thrown during commit, which seems pretty odd.

AFAICS this error can only come from ExecDelete(), because the value 1
is TM_Invisible and the other callsites where the "unexpected
table_lock_tuple" error appears use different wording for that one.

Maybe it's the result of a deferred constraint being checked at that
time ... maybe it's trying to honor an "on cascade delete" setting for
an FK, and the affected tuple has already been updated or deleted?

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#4)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Andres Freund <andres@anarazel.de> writes:

On 2019-04-02 11:44:26 -0400, Tom Lane wrote:

What that seems to indicate is that the "unexpected table_lock_tuple
status" error was thrown during commit, which seems pretty odd.

I suspect that's a deferred trigger. But that code obviously could throw
errors, so we gotta handle that correctly.

Deferred triggers execute before the real transaction commit, not during
the critical section surrounding where we set the bit in pg_clog.
But this error is seemingly getting thrown from within that critical
section, otherwise we wouldn't have inconsistent xact status afterwards.

regards, tom lane

#7Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#6)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Hi,

On 2019-04-02 11:51:38 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2019-04-02 11:44:26 -0400, Tom Lane wrote:

What that seems to indicate is that the "unexpected table_lock_tuple
status" error was thrown during commit, which seems pretty odd.

I suspect that's a deferred trigger. But that code obviously could throw
errors, so we gotta handle that correctly.

Deferred triggers execute before the real transaction commit, not during
the critical section surrounding where we set the bit in pg_clog.
But this error is seemingly getting thrown from within that critical
section, otherwise we wouldn't have inconsistent xact status afterwards.

I don't think it was thrown during the critical section - we'd otherwise
see the ERROR about unexpected lock status converted to a PANIC (and the
"cannot abort transaction, already committed" is a PANIC itself). I
suspect it's more likely that we corrupt the xact.c status somewhere -
but it's hard to say without knowing more about the workload.

Greetings,

Andres Freund

#8Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#5)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Hi,

On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote:

On 2019-Apr-02, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction
400048276, it was already committed

But that's probably a separate issue.

What that seems to indicate is that the "unexpected table_lock_tuple
status" error was thrown during commit, which seems pretty odd.

AFAICS this error can only come from ExecDelete(), because the value 1
is TM_Invisible and the other callsites where the "unexpected
table_lock_tuple" error appears use different wording for that one.

Hm? Why couldn't it be the ExecUpdate() case?

Maybe it's the result of a deferred constraint being checked at that
time ... maybe it's trying to honor an "on cascade delete" setting for
an FK, and the affected tuple has already been updated or deleted?

Then it ought to get TM_Deleted, no? We ought to wait till that
transaction commits, and then roll back. So there's something odd
happening here. I suspect there has to be some additional log output or
such to explain this.

Greetings,

Andres Freund

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#8)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

On 2019-Apr-02, Andres Freund wrote:

On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote:

AFAICS this error can only come from ExecDelete(), because the value 1
is TM_Invisible and the other callsites where the "unexpected
table_lock_tuple" error appears use different wording for that one.

Hm? Why couldn't it be the ExecUpdate() case?

You're right, that one too.

Maybe it's the result of a deferred constraint being checked at that
time ... maybe it's trying to honor an "on cascade delete" setting for
an FK, and the affected tuple has already been updated or deleted?

Then it ought to get TM_Deleted, no? We ought to wait till that
transaction commits, and then roll back.

I was thinking that this would have happened in the same transaction;
but yeah I didn't spend too much time analyzing the exact code flow.
Anyway I agree that there's something odd going on, and perhaps you just
unmasked an earlier bug.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Andres Freund (#8)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

On 2019-04-02 23:13, Andres Freund wrote:

Hi,

On 2019-04-02 12:51:08 -0300, Alvaro Herrera wrote:

On 2019-Apr-02, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

2019-04-01 15:27:38.829 +07 [7524] STATEMENT: UPDATE pgbench_accounts SET
abalance = 1 WHERE aid = 1;
2019-04-01 15:27:38.829 +07 [7524] PANIC: cannot abort transaction
400048276, it was already committed

But that's probably a separate issue.

What that seems to indicate is that the "unexpected table_lock_tuple
status" error was thrown during commit, which seems pretty odd.

AFAICS this error can only come from ExecDelete(), because the value 1
is TM_Invisible and the other callsites where the "unexpected
table_lock_tuple" error appears use different wording for that one.

Hm? Why couldn't it be the ExecUpdate() case?

Maybe it's the result of a deferred constraint being checked at that
time ... maybe it's trying to honor an "on cascade delete" setting for
an FK, and the affected tuple has already been updated or deleted?

Then it ought to get TM_Deleted, no? We ought to wait till that
transaction commits, and then roll back. So there's something odd
happening here. I suspect there has to be some additional log output
or
such to explain this.

Greetings,

Andres Freund

I made the new core file:

[New LWP 5967]
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test 127.0.0.1(35888) UPDATE
'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fe194769428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007fe194769428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007fe19476b02a in __GI_abort () at abort.c:89
#2 0x0000000000a9c0ac in errfinish (dummy=0) at elog.c:552
#3 0x0000000000a9e653 in elog_finish (elevel=22, fmt=0xb695a0 "cannot
abort transaction %u, it was already committed") at elog.c:1365
#4 0x0000000000562c87 in RecordTransactionAbort (isSubXact=false) at
xact.c:1644
#5 0x0000000000563c93 in AbortTransaction () at xact.c:2670
#6 0x0000000000564516 in AbortCurrentTransaction () at xact.c:3191
#7 0x00000000009145b9 in PostgresMain (argc=1, argv=0x255d970,
dbname=0x255d950 "test", username=0x252dd88 "zharkov") at
postgres.c:4033
#8 0x000000000086406c in BackendRun (port=0x2556020) at
postmaster.c:4399
#9 0x00000000008637cd in BackendStartup (port=0x2556020) at
postmaster.c:4090
#10 0x000000000085facf in ServerLoop () at postmaster.c:1703
#11 0x000000000085f2dd in PostmasterMain (argc=3, argv=0x252bcf0) at
postmaster.c:1376
#12 0x000000000077aeba in main (argc=3, argv=0x252bcf0) at main.c:228

Log:
2019-04-02 21:51:19.639 +07 [5967] STATEMENT: UPDATE pgbench_accounts
SET abalance = 1 WHERE aid = 1;
2019-04-02 21:51:19.639 +07 [5967] PANIC: cannot abort transaction
200011850, it was already committed
2019-04-02 21:53:57.924 +07 [2473] LOG: server process (PID 5967) was
terminated by signal 6: Aborted
2019-04-02 21:53:57.924 +07 [2473] DETAIL: Failed process was running:
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 1;
2019-04-02 21:53:57.924 +07 [2473] LOG: terminating any other active
server processes

We run pgbench with two custom scripts.

First:
\set i random_zipfian(1, 100000 * :scale, 1.01)
SELECT abalance FROM pgbench_accounts WHERE aid = :i;

Second:
\set i random_zipfian(1, 100000 * :scale, 1.01)
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = :i;

pgbench connects to the database through the connection pooller called
Odyssey.
We use pool of 16 or 32 backends.
We run four series of pgbench tests with different connections count (
100, 150, 200, ... , 1000 )
Three series with pool size 16 runs normally.
But fourth series fails with 200-300 connections. It took about five
hours to reproduce the error this time.

Here is the little piece of the pgbench log:

Bench start
Tue Apr 2 14:43:12 UTC 2019
Timestamp 1554216192
transaction type: multiple scripts
scaling factor: 1
query mode: simple
number of clients: 200
number of threads: 32
duration: 300 s
number of transactions actually processed: 8062981
latency average = 7.442 ms
tps = 26875.375100 (including connections establishing)
tps = 26875.694564 (excluding connections establishing)
SQL script 1: ./sql/ycsb_read_zipf.sql
- weight: 1 (targets 50.0% of total)
- 4019635 transactions (49.9% of total, tps = 13398.171023)
- latency average = 6.955 ms
- latency stddev = 8.366 ms
SQL script 2: ./sql/ycsb_update_zipf.sql
- weight: 1 (targets 50.0% of total)
- 4021508 transactions (49.9% of total, tps = 13404.414071)
- latency average = 7.909 ms
- latency stddev = 8.478 ms
Bench end
Tue Apr 2 14:48:12 UTC 2019
Timestamp 1554216492
Bench start
Tue Apr 2 14:48:12 UTC 2019
Timestamp 1554216492
client 220 script 1 aborted in command 1 query 0: PANIC: cannot abort
transaction 200011850, it was already committed
WARNING: terminating connection because of crash of another server
process

--
regards,

Roman Zharkov

#11Andres Freund
andres@anarazel.de
In reply to: Zharkov Roman (#10)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Hi,

On 2019-04-03 00:04:15 +0700, r.zharkov@postgrespro.ru wrote:

pgbench connects to the database through the connection pooller called
Odyssey.
We use pool of 16 or 32 backends.
We run four series of pgbench tests with different connections count ( 100,
150, 200, ... , 1000 )
Three series with pool size 16 runs normally.
But fourth series fails with 200-300 connections. It took about five hours
to reproduce the error this time.

Here is the little piece of the pgbench log:

Bench start
Tue Apr 2 14:43:12 UTC 2019
Timestamp 1554216192
transaction type: multiple scripts
scaling factor: 1

Is this the actual scaling factor? Did you create the pgbench database
with foreign keys?

Greetings,

Andres Freund

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zharkov Roman (#10)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

r.zharkov@postgrespro.ru writes:

I made the new core file:

Thanks, but this isn't much help --- it just shows what we already
know, ie the "cannot abort transaction" error occurs after some other
error was thrown.

What would be more helpful would be to adjust the places that
can throw the "unexpected table_lock_tuple status" error text to be
PANIC rather than ERROR, so that the core dump would show a stack
trace showing how we got to whichever of those places this is.
Or, run the test with a higher log verbosity, so that you can find
out which of those places is throwing the error to begin with;
then just PANIC-ize that one.

FWIW, I see six potential candidates, not two:

pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test);
pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u", res);
pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u", res);

regards, tom lane

#13Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#12)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Hi,

On 2019-04-02 13:13:58 -0400, Tom Lane wrote:

FWIW, I see six potential candidates, not two:

pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test);
pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u", res);
pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u", res);

These have an explicit case / separate error for TM_Invisible (= 1 -
which is the status we're seeing according to the error message)
though. So afaict they can't be relevant here.

Greetings,

Andres Freund

#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#12)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

On 2019-Apr-02, Tom Lane wrote:

FWIW, I see six potential candidates, not two:

pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test);
pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u", res);
pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u", res);

The other four handle the TM_Invisible case with a different error
message. Anyway, the six of them are unexpected so changing them all to
PANIC ought to be good enough.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#13)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Andres Freund <andres@anarazel.de> writes:

On 2019-04-02 13:13:58 -0400, Tom Lane wrote:

FWIW, I see six potential candidates, not two:

These have an explicit case / separate error for TM_Invisible (= 1 -
which is the status we're seeing according to the error message)
though. So afaict they can't be relevant here.

Oh, right, so it has to be from one of the calls in nodeModifyTable.c.
But it's hardly plausible that that runs post-commit. So now I'm
thinking that what we really need to know is why a TM_Invisible
result was returned. That's going to be a bit harder ...

regards, tom lane

#16Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Tom Lane (#12)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

3 апр. 2019 г., в 0:13, Tom Lane <tgl@sss.pgh.pa.us> написал(а):

r.zharkov@postgrespro.ru writes:

I made the new core file:

Thanks, but this isn't much help --- it just shows what we already
know, ie the "cannot abort transaction" error occurs after some other
error was thrown.

What would be more helpful would be to adjust the places that
can throw the "unexpected table_lock_tuple status" error text to be
PANIC rather than ERROR, so that the core dump would show a stack
trace showing how we got to whichever of those places this is.
Or, run the test with a higher log verbosity, so that you can find
out which of those places is throwing the error to begin with;
then just PANIC-ize that one.

FWIW, I see six potential candidates, not two:

pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR, "unexpected table_lock_tuple status: %u", test);
pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR, "unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR, "unexpected table_lock_tuple status: %u", res);
pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR, "unexpected table_lock_tuple status: %u", res);

regards, tom lane

Ok. I will made it tomorrow morning.

regards,
Roman Zharkov

#17Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#15)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

Hi,

On 2019-04-02 13:30:56 -0400, Tom Lane wrote:

But it's hardly plausible that that runs post-commit.

Yea, one surely would hope that that's not happening. I suspect it's
possible it's running pre-commit and somehow the failure handling is
screwed up leading to the 'already committed' error.

So now I'm thinking that what we really need to know is why a
TM_Invisible result was returned. That's going to be a bit harder ...

Yea. I re-skimmed through the code and can't immediately see anything
wrong - not that such subtle issues would necessarily be noticed without
more information.

I wonder, is there a chance that you're using an extension with xact
hooks, or a modified version of postgres? Could you show us the exact
schema used, and the config?

Greetings,

Andres Freund

#18Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Tom Lane (#12)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

On 2019-04-03 00:13, Tom Lane wrote:

r.zharkov@postgrespro.ru writes:

I made the new core file:

Thanks, but this isn't much help --- it just shows what we already
know, ie the "cannot abort transaction" error occurs after some other
error was thrown.

What would be more helpful would be to adjust the places that
can throw the "unexpected table_lock_tuple status" error text to be
PANIC rather than ERROR, so that the core dump would show a stack
trace showing how we got to whichever of those places this is.
Or, run the test with a higher log verbosity, so that you can find
out which of those places is throwing the error to begin with;
then just PANIC-ize that one.

FWIW, I see six potential candidates, not two:

pgsql/src/backend/commands/trigger.c: 3380: elog(ERROR,
"unexpected table_lock_tuple status: %u", test);
pgsql/src/backend/executor/nodeLockRows.c: 232: elog(ERROR,
"unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 881: elog(ERROR,
"unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/nodeModifyTable.c: 1384: elog(ERROR,
"unexpected table_lock_tuple status: %u",
pgsql/src/backend/executor/execReplication.c: 211: elog(ERROR,
"unexpected table_lock_tuple status: %u", res);
pgsql/src/backend/executor/execReplication.c: 375: elog(ERROR,
"unexpected table_lock_tuple status: %u", res);

regards, tom lane

Hello,
I have made the new core dump. Here are the results:
PostgreSQL 12devel-master/bb76134

Log:
TRAP: FailedAssertion("!((((tuple->t_data->t_infomask) & 0x0080) ||
(((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 |
0x0010))) == 0x0040)))", File: "heapam.c", Line:
2019-04-03 15:13:36.668 +07 [23163] LOG: server process (PID 27904) was
terminated by signal 6: Aborted
2019-04-03 15:13:36.668 +07 [23163] DETAIL: Failed process was running:
UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;
2019-04-03 15:13:36.668 +07 [23163] LOG: terminating any other active
server processes
2019-04-03 15:13:36.669 +07 [27907] WARNING: terminating connection
because of crash of another server process

Backtrace:
[New LWP 27904]
[Thread debugging using libthread_db enabled]
Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: zharkov test 127.0.0.1(44582) UPDATE
'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f5e0dc7d02a in __GI_abort () at abort.c:89
#2 0x0000000000a9b7ce in ExceptionalCondition (
conditionName=0xb3de08 "!((((tuple->t_data->t_infomask) & 0x0080) ||
(((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 |
0x0010))) == 0x0040)))",
errorType=0xb39a69 "FailedAssertion", fileName=0xb39ae8 "heapam.c",
lineNumber=4286) at assert.c:54
#3 0x00000000004ea024 in heap_lock_tuple (relation=0x7f5e0f252d80,
tuple=0x30f4dd0, cid=0, mode=LockTupleNoKeyExclusive,
wait_policy=LockWaitBlock, follow_updates=false,
buffer=0x7ffed968cbb4, tmfd=0x7ffed968cd80) at heapam.c:4286
#4 0x00000000004f5903 in heapam_tuple_lock (relation=0x7f5e0f252d80,
tid=0x7ffed968cdf0, snapshot=0x303cef0, slot=0x30f4d80, cid=0,
mode=LockTupleNoKeyExclusive,
wait_policy=LockWaitBlock, flags=2 '\002', tmfd=0x7ffed968cd80) at
heapam_handler.c:352
#5 0x000000000074055a in table_lock_tuple (rel=0x7f5e0f252d80,
tid=0x7ffed968cdf0, snapshot=0x303cef0, slot=0x30f4d80, cid=0,
mode=LockTupleNoKeyExclusive, wait_policy=LockWaitBlock,
flags=2 '\002', tmfd=0x7ffed968cd80) at
../../../src/include/access/tableam.h:993
#6 0x00000000007421cc in ExecUpdate (mtstate=0x30eb780,
tupleid=0x7ffed968cdf0, oldtuple=0x0, slot=0x30ed1c8,
planSlot=0x30ed050, epqstate=0x30eb878, estate=0x30eb3d8,
canSetTag=true)
at nodeModifyTable.c:1242
#7 0x0000000000743756 in ExecModifyTable (pstate=0x30eb780) at
nodeModifyTable.c:2070
#8 0x0000000000712671 in ExecProcNodeFirst (node=0x30eb780) at
execProcnode.c:445
#9 0x000000000070748c in ExecProcNode (node=0x30eb780) at
../../../src/include/executor/executor.h:239
#10 0x0000000000709d11 in ExecutePlan (estate=0x30eb3d8,
planstate=0x30eb780, use_parallel_mode=false, operation=CMD_UPDATE,
sendTuples=false, numberTuples=0,
direction=ForwardScanDirection, dest=0x30ee508, execute_once=true)
at execMain.c:1643
#11 0x0000000000707ac9 in standard_ExecutorRun (queryDesc=0x303faf8,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:362
#12 0x0000000000707905 in ExecutorRun (queryDesc=0x303faf8,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:306
#13 0x0000000000915950 in ProcessQuery (plan=0x30df120,
sourceText=0x30161d8 "UPDATE pgbench_accounts SET abalance = 1 WHERE aid
= 2;", params=0x0, queryEnv=0x0, dest=0x30ee508,
completionTag=0x7ffed968d270 "") at pquery.c:161
#14 0x000000000091739c in PortalRunMulti (portal=0x3098078,
isTopLevel=true, setHoldSnapshot=false, dest=0x30ee508,
altdest=0x30ee508, completionTag=0x7ffed968d270 "") at pquery.c:1283
#15 0x00000000009168e8 in PortalRun (portal=0x3098078,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x30ee508, altdest=0x30ee508, completionTag=0x7ffed968d270 "")
at pquery.c:796
#16 0x000000000091027a in exec_simple_query (query_string=0x30161d8
"UPDATE pgbench_accounts SET abalance = 1 WHERE aid = 2;") at
postgres.c:1215
#17 0x00000000009148ff in PostgresMain (argc=1, argv=0x3042980,
dbname=0x3042960 "test", username=0x3012d98 "zharkov") at
postgres.c:4247
#18 0x000000000086406c in BackendRun (port=0x303b030) at
postmaster.c:4399
#19 0x00000000008637cd in BackendStartup (port=0x303b030) at
postmaster.c:4090
#20 0x000000000085facf in ServerLoop () at postmaster.c:1703
#21 0x000000000085f2dd in PostmasterMain (argc=3, argv=0x3010d00) at
postmaster.c:1376
#22 0x000000000077aeba in main (argc=3, argv=0x3010d00) at main.c:228

The changes I made:
diff --git a/src/backend/commands/trigger.c 
b/src/backend/commands/trigger.c
index e03ffdd..ab3bace 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -3358,7 +3358,7 @@ GetTupleForTrigger(EState *estate,
                                         ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
                                                          errmsg("could 
not serialize access due to concurrent update")));
-                               elog(ERROR, "unexpected table_lock_tuple 
status: %u", test);
+                               elog(PANIC, "unexpected table_lock_tuple 
status: %u", test);
                                 break;
                         case TM_Deleted:
diff --git a/src/backend/executor/execReplication.c 
b/src/backend/executor/execReplication.c
index d8b48c6..d06cbe0 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -207,7 +207,7 @@ retry:
                                 elog(ERROR, "attempted to lock invisible 
tuple");
                                 break;
                         default:
-                               elog(ERROR, "unexpected table_lock_tuple 
status: %u", res);
+                               elog(PANIC, "unexpected table_lock_tuple 
status: %u", res);
                                 break;
                 }
         }
@@ -371,7 +371,7 @@ retry:
                                 elog(ERROR, "attempted to lock invisible 
tuple");
                                 break;
                         default:
-                               elog(ERROR, "unexpected table_lock_tuple 
status: %u", res);
+                               elog(PANIC, "unexpected table_lock_tuple 
status: %u", res);
                                 break;
                 }
         }
diff --git a/src/backend/executor/nodeLockRows.c 
b/src/backend/executor/nodeLockRows.c
index 7674ac8..c7a261c 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -229,7 +229,7 @@ lnext:
                                         ereport(ERROR,
(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
                                                          errmsg("could 
not serialize access due to concurrent update")));
-                               elog(ERROR, "unexpected table_lock_tuple 
status: %u",
+                               elog(PANIC, "unexpected table_lock_tuple 
status: %u",
                                          test);
                                 break;
diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index 7be0e77..b15b7a0 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -780,7 +780,7 @@ ldelete:;
                                                          * locking the 
latest version via
                                                          * 
TUPLE_LOCK_FLAG_FIND_LAST_VERSION.
                                                          */
-                                                       elog(ERROR, 
"unexpected table_lock_tuple status: %u",
+                                                       elog(PANIC, 
"unexpected table_lock_tuple status: %u",

result);
return NULL;
}
@@ -1269,7 +1269,7 @@ lreplace:;

                                                 default:
                                                         /* see 
table_lock_tuple call in ExecDelete() */
-                                                       elog(ERROR, 
"unexpected table_lock_tuple status: %u",
+                                                       elog(PANIC, 
"unexpected table_lock_tuple status: %u",

result);
return NULL;
}

regards,

Roman Zharkov

#19Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Andres Freund (#11)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

On 2019-04-03 00:11, Andres Freund wrote:

Hi,

On 2019-04-03 00:04:15 +0700, r.zharkov@postgrespro.ru wrote:

pgbench connects to the database through the connection pooller called
Odyssey.
We use pool of 16 or 32 backends.
We run four series of pgbench tests with different connections count (
100,
150, 200, ... , 1000 )
Three series with pool size 16 runs normally.
But fourth series fails with 200-300 connections. It took about five
hours
to reproduce the error this time.

Here is the little piece of the pgbench log:

Bench start
Tue Apr 2 14:43:12 UTC 2019
Timestamp 1554216192
transaction type: multiple scripts
scaling factor: 1

Is this the actual scaling factor? Did you create the pgbench database
with foreign keys?

Greetings,

Andres Freund

Hi,
Yes, 1 is the actual scaling factor.

My script to create database:
/db/zharkov/.vanila/bin/pg_ctl -D /db/zharkov/vanila -l
/db/zharkov/vanila/log.log stop
rm -rf /db/zharkov/vanila/*
/db/zharkov/.vanila/bin/initdb -D /db/zharkov/vanila --data-checksums
#/db/zharkov/.vanila/bin/initdb -D /db/zharkov/vanila
cp postgresql.conf /db/zharkov/vanila/
ulimit -c unlimited
/db/zharkov/.vanila/bin/pg_ctl -D /db/zharkov/vanila -c -l
/db/zharkov/vanila/log.log start
/db/zharkov/.vanila/bin/psql postgres -c 'create database test'
/db/zharkov/.vanila/bin/psql test -c 'create extension pg_prewarm;'
/db/zharkov/.vanila/bin/pgbench -i test -s 1
#/db/zharkov/.vanila/bin/pgbench -i test -s 1000
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm(
'pgbench_accounts' );"
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm(
'pgbench_branches' );"
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm(
'pgbench_history' );"
/db/zharkov/.vanila/bin/psql test -c "select pg_prewarm(
'pgbench_tellers' );"

The dump of the server settings is in the attachment.

--
regards,

Roman Zharkov

Attachments:

settings.txttext/plain; name=settings.txtDownload
#20Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Zharkov Roman (#18)
Re: BUG #15727: PANIC: cannot abort transaction 295144144, it was already committed

On 2019-Apr-03, r.zharkov@postgrespro.ru wrote:

#0 0x00007f5e0dc7b428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f5e0dc7d02a in __GI_abort () at abort.c:89
#2 0x0000000000a9b7ce in ExceptionalCondition (
conditionName=0xb3de08 "!((((tuple->t_data->t_infomask) & 0x0080) ||
(((tuple->t_data->t_infomask) & (0x1000 | ((0x0040 | 0x0010) | 0x0040 |
0x0010))) == 0x0040)))",
errorType=0xb39a69 "FailedAssertion", fileName=0xb39ae8 "heapam.c",
lineNumber=4286) at assert.c:54
#3 0x00000000004ea024 in heap_lock_tuple (relation=0x7f5e0f252d80,
tuple=0x30f4dd0, cid=0, mode=LockTupleNoKeyExclusive,
wait_policy=LockWaitBlock, follow_updates=false,
buffer=0x7ffed968cbb4, tmfd=0x7ffed968cd80) at heapam.c:4286

Uhm ... this assertion failure is doesn't seem related to the other
problem you were reporting. Here, it's failing because it thinks the
tuple should be marked as only locked when the current transaction is
the only Xmax for the tuple.

/*
* As a check independent from those above, we can also avoid sleeping
* if the current transaction is the sole locker of the tuple. Note
* that the strength of the lock already held is irrelevant; this is
* not about recording the lock in Xmax (which will be done regardless
* of this optimization, below). Also, note that the cases where we
* hold a lock stronger than we are requesting are already handled
* above by not doing anything.
*
* Note we only deal with the non-multixact case here; MultiXactIdWait
* is well equipped to deal with this situation on its own.
*/
if (require_sleep && !(infomask & HEAP_XMAX_IS_MULTI) &&
TransactionIdIsCurrentTransactionId(xwait))
{
/* ... but if the xmax changed in the meantime, start over */
LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
if (xmax_infomask_changed(tuple->t_data->t_infomask, infomask) ||
!TransactionIdEquals(HeapTupleHeaderGetRawXmax(tuple->t_data),
xwait))
goto l3;
Assert(HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_data->t_infomask)); // <--- failed assertion
require_sleep = false;
}

I'm not quite sure what's up with that.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Alvaro Herrera (#20)
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zharkov Roman (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
#24Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Andres Freund (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#23)
#26Andres Freund
andres@anarazel.de
In reply to: Zharkov Roman (#24)
#27Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#25)
#28Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#25)
#29Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#28)
#30Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#22)
#31Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Amit Langote (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#32)
#34Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Tom Lane (#32)
#35Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Andres Freund (#26)
#36Michael Paquier
michael@paquier.xyz
In reply to: Zharkov Roman (#35)
#37Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Michael Paquier (#36)
#38Michael Paquier
michael@paquier.xyz
In reply to: Zharkov Roman (#37)
#39Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#38)
#40Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Thomas Munro (#39)
#41Thomas Munro
thomas.munro@gmail.com
In reply to: Zharkov Roman (#40)
#42Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Thomas Munro (#41)
#43Zharkov Roman
r.zharkov@postgrespro.ru
In reply to: Thomas Munro (#41)
#44Thomas Munro
thomas.munro@gmail.com
In reply to: Zharkov Roman (#43)
#45Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#44)
#46Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#45)