BUG #15808: ERROR: subtransaction logged without previous top-level txn record (SQLSTATE XX000)

Started by PG Bug reporting formalmost 7 years ago4 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 15808
Logged by: Mansour Behabadi
Email address: mansour@oxplot.com
PostgreSQL version: 10.6
Operating system: Amazon RDS
Description:

We have some custom logical replication client that makes
pg_logical_slot_get_changes() calls in SQL. E.g.:

select max(1) from pg_logical_slot_get_changes('testslot', '5754/E9332B68',
1000);

Once every few thousand calls, we get the following error:

ERROR: subtransaction logged without previous top-level txn record (SQLSTATE
XX000)

which will persist on all subsequent calls, essentially forcing us to drop
the slot and create a new one.

We had little success looking for solutions online and the only lead is that
of a recent commit
(https://github.com/postgres/postgres/commit/f49a80c481f74fa81407dce8e51dea6956cb64f8)
whose commit message seem to correlate to the error we're getting. Below is
the relevant excerpt:

The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.

We're not sure if this is a fix to our problem and whether upgrading to
Postgres 11 (which has this change in it) will solve the issue.

Please let me know if any more info is needed.

#2Andres Freund
andres@anarazel.de
In reply to: PG Bug reporting form (#1)
Re: BUG #15808: ERROR: subtransaction logged without previous top-level txn record (SQLSTATE XX000)

Hi,

On 2019-05-16 04:56:15 +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 15808
Logged by: Mansour Behabadi
Email address: mansour@oxplot.com
PostgreSQL version: 10.6
Operating system: Amazon RDS
Description:

We have some custom logical replication client that makes
pg_logical_slot_get_changes() calls in SQL. E.g.:

Unrelated to the bug: You really should use the streaming
interface. It's much, much, much more efficient.

https://www.postgresql.org/docs/current/logicaldecoding-walsender.html

Once every few thousand calls, we get the following error:

ERROR: subtransaction logged without previous top-level txn record (SQLSTATE
XX000)

which will persist on all subsequent calls, essentially forcing us to drop
the slot and create a new one.

That obviously shouldn't happen.

We had little success looking for solutions online and the only lead is that
of a recent commit
(https://github.com/postgres/postgres/commit/f49a80c481f74fa81407dce8e51dea6956cb64f8)
whose commit message seem to correlate to the error we're getting. Below is
the relevant excerpt:

The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.

That seems unrelated to the error message you're getting.

We're not sure if this is a fix to our problem and whether upgrading to
Postgres 11 (which has this change in it) will solve the issue.

Note that this change isn't just in 11:

Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Branch: master Release: REL_11_BR [f49a80c48] 2018-06-26 16:48:10 -0400
Branch: REL_10_STABLE Release: REL_10_5 [b767b3f2e] 2018-06-26 16:38:34 -0400
Branch: REL9_6_STABLE Release: REL9_6_10 [da10d6a8a] 2018-06-26 16:38:34 -0400
Branch: REL9_5_STABLE Release: REL9_5_14 [4cb6f7837] 2018-06-26 16:38:34 -0400
Branch: REL9_4_STABLE Release: REL9_4_19 [962313558] 2018-06-26 16:38:34 -0400

Please let me know if any more info is needed.

The easiest way to progress here would be a recipe to reproduce the
problem. As long as the problem is on RDS, we unfortunately can't really
debug this - neither can we modify the source to emit more debugging
information, nor can we inspect the WAL files ourselves (I think).

It's possible that trying to reproduce this on RDS with the debug level
set to very high (debug5) would allow for a bit more insight. But I'm
somewhat doubtful.

Greetings,

Andres Freund

#3Dave Cramer
pg@fastcrypt.com
In reply to: Andres Freund (#2)
Re: BUG #15808: ERROR: subtransaction logged without previous top-level txn record (SQLSTATE XX000)

On Thu, 16 May 2019 at 13:04, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-05-16 04:56:15 +0000, PG Bug reporting form wrote:

The following bug has been logged on the website:

Bug reference: 15808
Logged by: Mansour Behabadi
Email address: mansour@oxplot.com
PostgreSQL version: 10.6
Operating system: Amazon RDS
Description:

We have some custom logical replication client that makes
pg_logical_slot_get_changes() calls in SQL. E.g.:

Unrelated to the bug: You really should use the streaming
interface. It's much, much, much more efficient.

https://www.postgresql.org/docs/current/logicaldecoding-walsender.html

Once every few thousand calls, we get the following error:

ERROR: subtransaction logged without previous top-level txn record

(SQLSTATE

XX000)

which will persist on all subsequent calls, essentially forcing us to

drop

the slot and create a new one.

That obviously shouldn't happen.

We had little success looking for solutions online and the only lead is

that

of a recent commit
(

https://github.com/postgres/postgres/commit/f49a80c481f74fa81407dce8e51dea6956cb64f8
)

whose commit message seem to correlate to the error we're getting. Below

is

the relevant excerpt:

The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise. To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.

That seems unrelated to the error message you're getting.

We're not sure if this is a fix to our problem and whether upgrading to
Postgres 11 (which has this change in it) will solve the issue.

Note that this change isn't just in 11:

Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Branch: master Release: REL_11_BR [f49a80c48] 2018-06-26 16:48:10 -0400
Branch: REL_10_STABLE Release: REL_10_5 [b767b3f2e] 2018-06-26 16:38:34
-0400
Branch: REL9_6_STABLE Release: REL9_6_10 [da10d6a8a] 2018-06-26 16:38:34
-0400
Branch: REL9_5_STABLE Release: REL9_5_14 [4cb6f7837] 2018-06-26 16:38:34
-0400
Branch: REL9_4_STABLE Release: REL9_4_19 [962313558] 2018-06-26 16:38:34
-0400

Please let me know if any more info is needed.

The easiest way to progress here would be a recipe to reproduce the
problem. As long as the problem is on RDS, we unfortunately can't really
debug this - neither can we modify the source to emit more debugging
information, nor can we inspect the WAL files ourselves (I think).

It's possible that trying to reproduce this on RDS with the debug level
set to very high (debug5) would allow for a bit more insight. But I'm
somewhat doubtful.

Andres,

It's possible that I have someone that would be able to run this in a
non-RDS environment.

It's unlikely we have a reproducible test case, but it's likely we can
modify the code on their boxes for debugging and or get WAL files for
inspection.

This is in a version of 9.6.14 so the above fix should be in it.

I'm willing to facilitate if you can provide some direction.

Dave

#4Hsu, John
hsuchen@amazon.com
In reply to: Dave Cramer (#3)
Re: BUG #15808: ERROR: subtransaction logged without previous top-level txn record (SQLSTATE XX000)

On 9/6/19, 1:40 PM, "Dave Cramer" <davecramer@gmail.com> wrote:

It's possible that I have someone that would be able to run this in a non-RDS environment.

It's unlikely we have a reproducible test case, but it's likely we can modify the code on their boxes for debugging and or get WAL files for inspection.

This is in a version of 9.6.14 so the above fix should be in it.

I'm willing to facilitate if you can provide some direction.

Hey Dave,

I have a repro for this issue here:
/messages/by-id/AB5978B2-1772-4FEE-A245-74C91704ECB0@amazon.com

I can still reproduce it on REL_11_STABLE as of the latest commit (7babff18).

postgres=# select version();
version
--------------------------------------------------------------------------------------------------------
PostgreSQL 11.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5), 64-bit
(1 row)

postgres=# \q

[ec2-user.... postgres]$ pg_recvlogical -d postgres --start --slot test_slot_1 -f /dev/null
pg_recvlogical: unexpected termination of replication stream: ERROR: subtransaction logged without previous top-level txn record
pg_recvlogical: disconnected; waiting 5 seconds to try again

Thanks,
John H