Decoding speculative insert with toast leaks memory

Started by Ashutosh Bapatalmost 5 years ago66 messages
#1Ashutosh Bapat
ashutosh.bapat.oss@gmail.com
1 attachment(s)

Hi All,
We saw OOM in a system where WAL sender consumed Gigabttes of memory
which was never released. Upon investigation, we found out that there
were many ReorderBufferToastHash memory contexts linked to
ReorderBuffer context, together consuming gigs of memory. They were
running INSERT ... ON CONFLICT .. among other things. A similar report
at [1]https://www.postgresql-archive.org/Diagnose-memory-leak-in-logical-replication-td6161318.html

I could reproduce a memory leak in wal sender using following steps
Session 1
postgres=# create table t_toast (a int primary key, b text);
postgres=# CREATE PUBLICATION dbz_minimal_publication FOR TABLE public.t_toast;

Terminal 4
$ pg_recvlogical -d postgres --slot pgoutput_minimal_test_slot
--create-slot -P pgoutput
$ pg_recvlogical -d postgres --slot pgoutput_minimal_test_slot --start
-o proto_version=1 -o publication_names='dbz_minimal_publication' -f
/dev/null

Session 1
postgres=# select * from pg_replication_slots ;
slot_name | plugin | slot_type | datoid | database
| temporary | active | active_pid | xmin | catalog_xmin | restart_lsn
| confirmed_flush_lsn
----------------------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
pgoutput_minimal_test_slot | pgoutput | logical | 12402 | postgres
| f | f | | | 570 | 0/15FFFD0
| 0/1600020

postgres=# begin;
postgres=# insert into t_toast values (500, repeat('something' ||
txid_current()::text, 100000)) ON CONFLICT (a) DO NOTHING;
INSERT 0 1

Session 2 (xid = 571)
postgres=# begin;
postgres=# insert into t_toast values (500, repeat('something' ||
txid_current()::text, 100000)) ON CONFLICT (a) DO NOTHING;

Session 3 (xid = 572)
postgres=# begin;
postgres=# insert into t_toast values (500, repeat('something' ||
txid_current()::text, 100000)) ON CONFLICT (a) DO NOTHING;

Session 1 (this session doesn't modify the table but is essential for
speculative insert to happen.)
postgres=# rollback;

Session 2 and 3 in the order in which you get control back (in my case
session 2 with xid = 571)
INSERT 0 1
postgres=# commit;
COMMIT

other session (in my case session 3 with xid = 572)
INSERT 0 0
postgres=# commit;
COMMIT

With the attached patch, we see following in the server logs
2021-03-25 09:57:20.469 IST [12424] LOG: starting logical decoding
for slot "pgoutput_minimal_test_slot"
2021-03-25 09:57:20.469 IST [12424] DETAIL: Streaming transactions
committing after 0/1600020, reading WAL from 0/15FFFD0.
2021-03-25 09:57:20.469 IST [12424] LOG: logical decoding found
consistent point at 0/15FFFD0
2021-03-25 09:57:20.469 IST [12424] DETAIL: There are no running transactions.
2021-03-25 09:59:45.494 IST [12424] LOG: initializing hash table for
transaction 571
2021-03-25 09:59:45.494 IST [12424] LOG: speculative insert
encountered in transaction 571
2021-03-25 09:59:45.494 IST [12424] LOG: speculative insert confirmed
in transaction 571
2021-03-25 09:59:45.504 IST [12424] LOG: destroying toast hash for
transaction 571
2021-03-25 09:59:50.806 IST [12424] LOG: initializing hash table for
transaction 572
2021-03-25 09:59:50.806 IST [12424] LOG: speculative insert
encountered in transaction 572
2021-03-25 09:59:50.806 IST [12424] LOG: toast hash for transaction
572 is not cleared

Observe that the toast_hash was cleaned for the transaction 571 which
successfully inserted the row but was not cleaned for the transaction
572 which performed DO NOTHING instead of INSERT.

Here's the sequence of events which leads to memory leak in wal sender
1. Transaction 571 performs a speculative INSERT which is decoded as
toast insert followed by speculative insert of row
2. decoding toast tuple, causes the toast hash to be created
3. Speculative insert is ignored while decoding
4. Speculative insert is confimed and decoded as a normal INSERT, also
destroying the toast hash
5. Transaction 572 performs speculative insert which is decoded as
toast insert followed by speculative insert of row
6. decoding toast tuple causes the toast hash to be created
7. speculative insert is ignored while decoding
... Speculative INSERT is never confirmed and thus toast hash is never
destroyed leaking memory

In memory context dump we see as many ReorderBufferToastHash as the
number of times the above sequence is repeated.
TopMemoryContext: 1279640 total in 7 blocks; 23304 free (17 chunks);
1256336 used
...
Replication command context: 32768 total in 3 blocks; 10952 free
(9 chunks); 21816 used
...
ReorderBuffer: 8192 total in 1 blocks; 7656 free (7 chunks); 536 used
ReorderBufferToastHash: 8192 total in 1 blocks; 2056 free (0
chunks); 6136 used
ReorderBufferToastHash: 8192 total in 1 blocks; 2056 free (0
chunks); 6136 used
ReorderBufferToastHash: 8192 total in 1 blocks; 2056 free (0
chunks); 6136 used

The relevant code is all in ReoderBufferCommit() in cases
REORDER_BUFFER_CHANGE_INSERT,
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM.

About the solution: The speculative insert needs to be ignored since
it can be rolled back later. If speculative insert is not confirmed,
there is no way to know that the speculative insert change required a
toast_hash table and destroy it before the next change starts.
ReorderBufferCommit seems to notice a speculative insert that was
never confirmed in the following code
1624 change_done:
1625
1626 /*
1627 * Either speculative insertion was
confirmed, or it was
1628 * unsuccessful and the record isn't needed anymore.
1629 */
1630 if (specinsert != NULL)
1631 {
1632 ReorderBufferReturnChange(rb, specinsert);
1633 specinsert = NULL;
1634 }
1635
1636 if (relation != NULL)
1637 {
1638 RelationClose(relation);
1639 relation = NULL;
1640 }
1641 break;

but by then we might have reused the toast_hash and thus can not be
destroyed. But that isn't the problem since the reused toast_hash will
be destroyed eventually.

It's only when the change next to speculative insert is something
other than INSERT/UPDATE/DELETE that we have to worry about a
speculative insert that was never confirmed. So may be for those
cases, we check whether specinsert != null and destroy toast_hash if
it exists.

[1]: https://www.postgresql-archive.org/Diagnose-memory-leak-in-logical-replication-td6161318.html

--
Best Wishes,
Ashutosh Bapat

Attachments:

rb_toas_hash_mem_leak.patchtext/x-patch; charset=US-ASCII; name=rb_toas_hash_mem_leak.patchDownload
diff --git a/src/backend/access/rmgrdesc/heapdesc.c b/src/backend/access/rmgrdesc/heapdesc.c
index 318a281d7f..2fd01b53f5 100644
--- a/src/backend/access/rmgrdesc/heapdesc.c
+++ b/src/backend/access/rmgrdesc/heapdesc.c
@@ -42,6 +42,8 @@ heap_desc(StringInfo buf, XLogReaderState *record)
 	{
 		xl_heap_insert *xlrec = (xl_heap_insert *) rec;
 
+		if (xlrec->flags & XLH_INSERT_IS_SPECULATIVE)
+			appendStringInfo(buf, "speculative, ");
 		appendStringInfo(buf, "off %u", xlrec->offnum);
 	}
 	else if (info == XLOG_HEAP_DELETE)
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1a4b87c419..cbc9ffbebb 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -351,6 +351,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Toast hash should be deallocated by now */
+	if (txn->toast_hash)
+		elog(LOG, "toast hash for transaction %d is not cleared", txn->xid);
 	pfree(txn);
 }
 
@@ -1517,6 +1520,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			{
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
 
+					elog(LOG, "speculative insert confirmed in transaction %d", xid);
 					/*
 					 * Confirmation for speculative insertion arrived. Simply
 					 * use as a normal record. It'll be cleaned up at the end
@@ -1638,6 +1642,7 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 
 				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT:
 
+					elog(LOG, "speculative insert encountered in transaction %d", xid);
 					/*
 					 * Speculative insertions are dealt with by delaying the
 					 * processing of the insert until the confirmation record
@@ -2924,6 +2929,7 @@ ReorderBufferToastInitHash(ReorderBuffer *rb, ReorderBufferTXN *txn)
 	hash_ctl.hcxt = rb->context;
 	txn->toast_hash = hash_create("ReorderBufferToastHash", 5, &hash_ctl,
 								  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	elog(LOG, "initializing hash table for transaction %d", txn->xid);
 }
 
 /*
@@ -3206,6 +3212,7 @@ ReorderBufferToastReset(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		}
 	}
 
+	elog(LOG, "destroying toast hash for transaction %d", txn->xid);
 	hash_destroy(txn->toast_hash);
 	txn->toast_hash = NULL;
 }
diff --git a/src/test/subscription/t/011_toast_subxacts.pl b/src/test/subscription/t/011_toast_subxacts.pl
new file mode 100644
index 0000000000..776f5b4794
--- /dev/null
+++ b/src/test/subscription/t/011_toast_subxacts.pl
@@ -0,0 +1,253 @@
+# Basic logical replication test
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 14;
+
+# Initialize publisher node
+my $node_publisher = get_new_node('publisher');
+$node_publisher->init(allows_streaming => 'logical');
+$node_publisher->start;
+
+# Create subscriber node
+my $node_subscriber = get_new_node('subscriber');
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->start;
+
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_mixed (a int primary key, b text, c numeric)");
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_mixed (a, b, c) VALUES (1, 'foo', 1.1)");
+
+# Setup structure on subscriber
+# different column count and order than on publisher
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_mixed (d text default 'local', c numeric, b text, a int primary key)"
+);
+
+# Setup logical replication
+my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
+$node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
+$node_publisher->safe_psql('postgres',
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_mixed"
+);
+
+my $appname = 'tap_sub';
+$node_subscriber->safe_psql('postgres',
+	"CREATE SUBSCRIPTION tap_sub CONNECTION '$publisher_connstr application_name=$appname' PUBLICATION tap_pub"
+);
+
+$node_publisher->wait_for_catchup($appname);
+
+# Also wait for initial table sync to finish
+my $synced_query =
+  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT IN ('r', 's');";
+$node_subscriber->poll_query_until('postgres', $synced_query)
+  or die "Timed out while waiting for subscriber to synchronize data";
+
+
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_mixed VALUES (2, 'bar', 2.2)");
+
+$node_publisher->wait_for_catchup($appname);
+
+my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM tab_mixed");
+is( $result, qq(local|1.1|foo|1
+local|2.2|bar|2), 'check replicated changes with different column order');
+
+$node_publisher->safe_psql('postgres',
+	"UPDATE tab_mixed SET b = 'baz' WHERE a = 1");
+
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT * FROM tab_mixed ORDER BY a");
+is( $result, qq(local|1.1|baz|1
+local|2.2|bar|2),
+	'update works with different column order and subscriber local values');
+
+# check behavior with toasted values
+
+$node_publisher->safe_psql('postgres',
+	"UPDATE tab_mixed SET b = repeat('xyzzy', 100000) WHERE a = 2");
+
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT a, length(b), c, d FROM tab_mixed ORDER BY a");
+is( $result, qq(1|3|1.1|local
+2|500000|2.2|local),
+	'update transmits large column value');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT a, length(b), c, d FROM tab_mixed ORDER BY a");
+is( $result, qq(1|3|1.1|local
+2|500000|2.2|local),
+	'update transmits large column value');
+
+# It seems we do not free the hash tables used for storing toast values in
+# subtransactions leaking memory when they abort. So try adding a lot of toast
+# values
+$node_publisher->safe_psql('postgres',
+	"BEGIN;" .
+	"INSERT INTO tab_mixed VALUES (100, repeat('before_savept1' || txid_current()::text, 100000), 2.2);" .
+	"SAVEPOINT subxact1;" .
+	"INSERT INTO tab_mixed VALUES (101, repeat('after_savept1'|| txid_current()::text, 100000), 2.2);" .
+	"RELEASE SAVEPOINT subxact1;" .
+	"COMMIT");
+
+$node_publisher->wait_for_catchup('tap_sub');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_mixed");
+is( $result, qq(4),
+	'toast subtxn test release savepoint, commit, replicated');
+
+$node_publisher->safe_psql('postgres',
+	"BEGIN;" .
+	"INSERT INTO tab_mixed VALUES (200, repeat('before_savept1' || txid_current()::text, 100000), 2.2);" .
+	"SAVEPOINT subxact1;" .
+	"INSERT INTO tab_mixed VALUES (201, repeat('after_savept1'|| txid_current()::text, 100000), 2.2);" .
+	"RELEASE SAVEPOINT subxact1;" .
+	"ROLLBACK");
+
+$node_publisher->wait_for_catchup('tap_sub');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_mixed");
+is( $result, qq(4),
+	'toast subtxn test release savepoint, rollback');
+
+$node_publisher->safe_psql('postgres',
+	"BEGIN;" .
+	"INSERT INTO tab_mixed VALUES (300, repeat('before_savept1' || txid_current()::text, 100000), 2.2);" .
+	"SAVEPOINT subxact1;" .
+	"INSERT INTO tab_mixed VALUES (301, repeat('after_savept1'|| txid_current()::text, 100000), 2.2);" .
+	"ROLLBACK TO SAVEPOINT subxact1;" .
+	"COMMIT");
+
+$node_publisher->wait_for_catchup('tap_sub');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_mixed");
+is( $result, qq(5),
+	'toast subtxn test rollback savepoint, commit, replicated');
+
+$node_publisher->safe_psql('postgres',
+	"BEGIN;" .
+	"INSERT INTO tab_mixed VALUES (400, repeat('before_savept1' || txid_current()::text, 100000), 2.2);" .
+	"SAVEPOINT subxact1;" .
+	"INSERT INTO tab_mixed VALUES (401, repeat('after_savept1'|| txid_current()::text, 100000), 2.2);" .
+	"ROLLBACK TO SAVEPOINT subxact1;" .
+	"ROLLBACK");
+
+$node_publisher->wait_for_catchup('tap_sub');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_mixed");
+is( $result, qq(5),
+	'toast subtxn test rollback savepoint, commit, replicated');
+
+# The customer's load has INSERT ON CONFLICT DO UPDATE clause, so try that.
+$node_publisher->safe_psql('postgres',
+	"BEGIN;" .
+	"INSERT INTO tab_mixed VALUES (100, repeat('ins_upd_before_savept1' || txid_current()::text, 100000), 2.2) ON CONFLICT(a) DO UPDATE SET b = excluded.b;" .
+	"SAVEPOINT subxact1;" .
+	"INSERT INTO tab_mixed VALUES (101, repeat('ins_upd_after_savept1'|| txid_current()::text, 100000), 2.2) ON CONFLICT(a) DO UPDATE SET b = excluded.b;" .
+	"RELEASE SAVEPOINT subxact1;" .
+	"COMMIT");
+
+$node_publisher->wait_for_catchup('tap_sub');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_mixed");
+is( $result, qq(4),
+	'toast subtxn test release savepoint, commit, replicated');
+
+$node_publisher->safe_psql('postgres',
+	"BEGIN;" .
+	"INSERT INTO tab_mixed VALUES (200, repeat('ins_upd_before_savept1' || txid_current()::text, 100000), 2.2) ON CONFLICT(a) DO UPDATE SET b = excluded.b;" .
+	"SAVEPOINT subxact1;" .
+	"INSERT INTO tab_mixed VALUES (201, repeat('ins_upd_after_savept1'|| txid_current()::text, 100000), 2.2) ON CONFLICT(a) DO UPDATE SET b = excluded.b;" .
+	"RELEASE SAVEPOINT subxact1;" .
+	"ROLLBACK");
+
+$node_publisher->wait_for_catchup('tap_sub');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_mixed");
+is( $result, qq(4),
+	'toast subtxn test release savepoint, rollback');
+
+$node_publisher->safe_psql('postgres',
+	"BEGIN;" .
+	"INSERT INTO tab_mixed VALUES (300, repeat('ins_upd_before_savept1' || txid_current()::text, 100000), 2.2) ON CONFLICT(a) DO UPDATE SET b = excluded.b;" .
+	"SAVEPOINT subxact1;" .
+	"INSERT INTO tab_mixed VALUES (301, repeat('ins_upd_after_savept1'|| txid_current()::text, 100000), 2.2) ON CONFLICT(a) DO UPDATE SET b = excluded.b;" .
+	"ROLLBACK TO SAVEPOINT subxact1;" .
+	"COMMIT");
+
+$node_publisher->wait_for_catchup('tap_sub');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_mixed");
+is( $result, qq(5),
+	'toast subtxn test rollback savepoint, commit, replicated');
+
+$node_publisher->safe_psql('postgres',
+	"BEGIN;" .
+	"INSERT INTO tab_mixed VALUES (400, repeat('ins_upd_before_savept1' || txid_current()::text, 100000), 2.2) ON CONFLICT(a) DO UPDATE SET b = excluded.b;" .
+	"SAVEPOINT subxact1;" .
+	"INSERT INTO tab_mixed VALUES (401, repeat('ins_upd_after_savept1'|| txid_current()::text, 100000), 2.2) ON CONFLICT(a) DO UPDATE SET b = excluded.b;" .
+	"ROLLBACK TO SAVEPOINT subxact1;" .
+	"ROLLBACK");
+
+$node_publisher->wait_for_catchup('tap_sub');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM tab_mixed");
+is( $result, qq(5),
+	'toast subtxn test rollback savepoint, commit, replicated');
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM tab_mixed WHERE a >= 100");
+
+$node_publisher->wait_for_catchup('tap_sub');
+
+$node_publisher->safe_psql('postgres',
+	"UPDATE tab_mixed SET c = 3.3 WHERE a = 2");
+
+$node_publisher->wait_for_catchup($appname);
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT a, length(b), c, d FROM tab_mixed ORDER BY a");
+is( $result, qq(1|3|1.1|local
+2|500000|3.3|local),
+	'update with non-transmitted large column value');
+
+# check all the cleanup
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription");
+is($result, qq(0), 'check subscription was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_subscription_rel");
+is($result, qq(0),
+	'check subscription relation status was dropped on subscriber');
+
+$result = $node_publisher->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_slots");
+is($result, qq(0), 'check replication slot was dropped on publisher');
+
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*) FROM pg_replication_origin");
+is($result, qq(0), 'check replication origin was dropped on subscriber');
+
+$node_subscriber->stop('fast');
+$node_publisher->stop('fast');
In reply to: Ashutosh Bapat (#1)
Re: Decoding speculative insert with toast leaks memory

On Wed, Mar 24, 2021 at 10:34 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi All,
We saw OOM in a system where WAL sender consumed Gigabttes of memory
which was never released. Upon investigation, we found out that there
were many ReorderBufferToastHash memory contexts linked to
ReorderBuffer context, together consuming gigs of memory. They were
running INSERT ... ON CONFLICT .. among other things. A similar report
at [1]

What is the relationship between this bug and commit 7259736a6e5,
dealt specifically with TOAST and speculative insertion resource
management issues within reorderbuffer.c? Amit?

--
Peter Geoghegan

#3Amit Kapila
amit.kapila16@gmail.com
In reply to: Ashutosh Bapat (#1)
Re: Decoding speculative insert with toast leaks memory

On Thu, Mar 25, 2021 at 11:04 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi All,
We saw OOM in a system where WAL sender consumed Gigabttes of memory
which was never released. Upon investigation, we found out that there
were many ReorderBufferToastHash memory contexts linked to
ReorderBuffer context, together consuming gigs of memory. They were
running INSERT ... ON CONFLICT .. among other things. A similar report
at [1]

..

but by then we might have reused the toast_hash and thus can not be
destroyed. But that isn't the problem since the reused toast_hash will
be destroyed eventually.

It's only when the change next to speculative insert is something
other than INSERT/UPDATE/DELETE that we have to worry about a
speculative insert that was never confirmed. So may be for those
cases, we check whether specinsert != null and destroy toast_hash if
it exists.

Can we consider the possibility to destroy the toast_hash in
ReorderBufferCleanupTXN/ReorderBufferTruncateTXN? It will delay the
clean up of memory till the end of stream or txn but there won't be
any memory leak.

--
With Regards,
Amit Kapila.

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Peter Geoghegan (#2)
Re: Decoding speculative insert with toast leaks memory

On Thu, May 27, 2021 at 8:27 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Mar 24, 2021 at 10:34 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi All,
We saw OOM in a system where WAL sender consumed Gigabttes of memory
which was never released. Upon investigation, we found out that there
were many ReorderBufferToastHash memory contexts linked to
ReorderBuffer context, together consuming gigs of memory. They were
running INSERT ... ON CONFLICT .. among other things. A similar report
at [1]

What is the relationship between this bug and commit 7259736a6e5,
dealt specifically with TOAST and speculative insertion resource
management issues within reorderbuffer.c? Amit?

This seems to be a pre-existing bug. This should be reproduced in
PG-13 and or prior to that commit. Ashutosh can confirm?

--
With Regards,
Amit Kapila.

#5Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#3)
Re: Decoding speculative insert with toast leaks memory

On Thu, May 27, 2021 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 25, 2021 at 11:04 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi All,
We saw OOM in a system where WAL sender consumed Gigabttes of memory
which was never released. Upon investigation, we found out that there
were many ReorderBufferToastHash memory contexts linked to
ReorderBuffer context, together consuming gigs of memory. They were
running INSERT ... ON CONFLICT .. among other things. A similar report
at [1]

..

but by then we might have reused the toast_hash and thus can not be
destroyed. But that isn't the problem since the reused toast_hash will
be destroyed eventually.

It's only when the change next to speculative insert is something
other than INSERT/UPDATE/DELETE that we have to worry about a
speculative insert that was never confirmed. So may be for those
cases, we check whether specinsert != null and destroy toast_hash if
it exists.

Can we consider the possibility to destroy the toast_hash in
ReorderBufferCleanupTXN/ReorderBufferTruncateTXN? It will delay the
clean up of memory till the end of stream or txn but there won't be
any memory leak.

The other possibility could be to clean it up when we clean the spec
insert change in the below code:
/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
* record.
*/
if (specinsert)
{
ReorderBufferReturnChange(rb, specinsert, true);
specinsert = NULL;
}

But I guess we might miss cleaning it up in case of an error. A
similar problem could be there in the idea where we will try to tie
the clean up with the next change.

--
With Regards,
Amit Kapila.

#6Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#3)
Re: Decoding speculative insert with toast leaks memory

On Thu, May 27, 2021 at 9:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 25, 2021 at 11:04 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi All,
We saw OOM in a system where WAL sender consumed Gigabttes of memory
which was never released. Upon investigation, we found out that there
were many ReorderBufferToastHash memory contexts linked to
ReorderBuffer context, together consuming gigs of memory. They were
running INSERT ... ON CONFLICT .. among other things. A similar report
at [1]

..

but by then we might have reused the toast_hash and thus can not be
destroyed. But that isn't the problem since the reused toast_hash will
be destroyed eventually.

It's only when the change next to speculative insert is something
other than INSERT/UPDATE/DELETE that we have to worry about a
speculative insert that was never confirmed. So may be for those
cases, we check whether specinsert != null and destroy toast_hash if
it exists.

Can we consider the possibility to destroy the toast_hash in
ReorderBufferCleanupTXN/ReorderBufferTruncateTXN? It will delay the
clean up of memory till the end of stream or txn but there won't be
any memory leak.

Currently, we are ignoring XLH_DELETE_IS_SUPER, so maybe we can do
something based on this flag?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#7Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#5)
Re: Decoding speculative insert with toast leaks memory

On Thu, May 27, 2021 at 9:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 27, 2021 at 9:02 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Mar 25, 2021 at 11:04 AM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

Hi All,
We saw OOM in a system where WAL sender consumed Gigabttes of memory
which was never released. Upon investigation, we found out that there
were many ReorderBufferToastHash memory contexts linked to
ReorderBuffer context, together consuming gigs of memory. They were
running INSERT ... ON CONFLICT .. among other things. A similar report
at [1]

..

but by then we might have reused the toast_hash and thus can not be
destroyed. But that isn't the problem since the reused toast_hash will
be destroyed eventually.

It's only when the change next to speculative insert is something
other than INSERT/UPDATE/DELETE that we have to worry about a
speculative insert that was never confirmed. So may be for those
cases, we check whether specinsert != null and destroy toast_hash if
it exists.

Can we consider the possibility to destroy the toast_hash in
ReorderBufferCleanupTXN/ReorderBufferTruncateTXN? It will delay the
clean up of memory till the end of stream or txn but there won't be
any memory leak.

The other possibility could be to clean it up when we clean the spec
insert change in the below code:

Yeah that could be done.

/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
* record.
*/
if (specinsert)
{
ReorderBufferReturnChange(rb, specinsert, true);
specinsert = NULL;
}

But I guess we might miss cleaning it up in case of an error. A
similar problem could be there in the idea where we will try to tie
the clean up with the next change.

In error case also we can handle it in the CATCH block no?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#8Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#7)
Re: Decoding speculative insert with toast leaks memory

On Thu, May 27, 2021 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, May 27, 2021 at 9:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Can we consider the possibility to destroy the toast_hash in
ReorderBufferCleanupTXN/ReorderBufferTruncateTXN? It will delay the
clean up of memory till the end of stream or txn but there won't be
any memory leak.

The other possibility could be to clean it up when we clean the spec
insert change in the below code:

Yeah that could be done.

/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
* record.
*/
if (specinsert)
{
ReorderBufferReturnChange(rb, specinsert, true);
specinsert = NULL;
}

But I guess we might miss cleaning it up in case of an error. A
similar problem could be there in the idea where we will try to tie
the clean up with the next change.

In error case also we can handle it in the CATCH block no?

True, but if you do this clean-up in ReorderBufferCleanupTXN then you
don't need to take care at separate places. Also, toast_hash is stored
in txn so it appears natural to clean it up in while releasing TXN.

--
With Regards,
Amit Kapila.

#9Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#8)
1 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Thu, May 27, 2021 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 27, 2021 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

True, but if you do this clean-up in ReorderBufferCleanupTXN then you
don't need to take care at separate places. Also, toast_hash is stored
in txn so it appears natural to clean it up in while releasing TXN.

Make sense, basically, IMHO we will have to do in TruncateTXN and
ReturnTXN as attached?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-Cleanup-toast-hash.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Cleanup-toast-hash.patchDownload
From ff049e1ab141ba6bc5c05d62e7a225abfb18fad8 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 27 May 2021 10:03:27 +0530
Subject: [PATCH v1] Cleanup toast hash

---
 src/backend/replication/logical/reorderbuffer.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 4401608..39242f2 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -437,6 +437,10 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* cleanup the toast hash if it is not done already. */
+	if (txn->toast_hash != NULL)
+		ReorderBufferToastReset(rb, txn);	
+
 	if (txn->invalidations)
 	{
 		pfree(txn->invalidations);
@@ -1629,6 +1633,10 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* cleanup the toast hash if it is not done already. */
+	if (txn->toast_hash != NULL)
+		ReorderBufferToastReset(rb, txn);
+
 	/* If this txn is serialized then clean the disk space. */
 	if (rbtxn_is_serialized(txn))
 	{
-- 
1.8.3.1

#10Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Dilip Kumar (#9)
Re: Decoding speculative insert with toast leaks memory

On 5/27/21 6:36 AM, Dilip Kumar wrote:

On Thu, May 27, 2021 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 27, 2021 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

True, but if you do this clean-up in ReorderBufferCleanupTXN then you
don't need to take care at separate places. Also, toast_hash is stored
in txn so it appears natural to clean it up in while releasing TXN.

Make sense, basically, IMHO we will have to do in TruncateTXN and
ReturnTXN as attached?

Yeah, I've been working on a fix over the last couple days (because of a
customer issue), and I ended up with the reset in ReorderBufferReturnTXN
too - it does solve the issue in the case I've been investigating.

I'm not sure the reset in ReorderBufferTruncateTXN is correct, though.
Isn't it possible that we'll need the TOAST data after streaming part of
the transaction? After all, we're not resetting invalidations, tuplecids
and snapshot either ... And we'll eventually clean it after the streamed
transaction gets commited (ReorderBufferStreamCommit ends up calling
ReorderBufferReturnTXN too).

I wonder if there's a way to free the TOASTed data earlier, instead of
waiting until the end of the transaction (as this patch does). But I
suspect it'd be way more complex, harder to backpatch, and destroying
the hash table is a good idea anyway.

Also, I think the "if (txn->toast_hash != NULL)" checks are not needed,
because it's the first thing ReorderBufferToastReset does.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Dilip Kumar
dilipbalaut@gmail.com
In reply to: Tomas Vondra (#10)
Re: Decoding speculative insert with toast leaks memory

On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 5/27/21 6:36 AM, Dilip Kumar wrote:

On Thu, May 27, 2021 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 27, 2021 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

True, but if you do this clean-up in ReorderBufferCleanupTXN then you
don't need to take care at separate places. Also, toast_hash is stored
in txn so it appears natural to clean it up in while releasing TXN.

Make sense, basically, IMHO we will have to do in TruncateTXN and
ReturnTXN as attached?

Yeah, I've been working on a fix over the last couple days (because of a
customer issue), and I ended up with the reset in ReorderBufferReturnTXN
too - it does solve the issue in the case I've been investigating.

I'm not sure the reset in ReorderBufferTruncateTXN is correct, though.
Isn't it possible that we'll need the TOAST data after streaming part of
the transaction? After all, we're not resetting invalidations, tuplecids
and snapshot either

Actually, as per the current design, we don't need the toast data
after the streaming. Because currently, we don't allow to stream the
transaction if we need to keep the toast across stream e.g. if we only
have toast insert without the main insert we assure this as partial
changes, another case is if we have multi-insert with toast then we
keep the transaction as mark partial until we get the last insert of
the multi-insert. So with the current design we don't have any case
where we need to keep toast data across streams.

... And we'll eventually clean it after the streamed

transaction gets commited (ReorderBufferStreamCommit ends up calling
ReorderBufferReturnTXN too).

Right, but generally after streaming we assert that txn->size should
be 0. That could be changed if we change the above design but this is
what it is today.

I wonder if there's a way to free the TOASTed data earlier, instead of
waiting until the end of the transaction (as this patch does). But I
suspect it'd be way more complex, harder to backpatch, and destroying
the hash table is a good idea anyway.

Right.

Also, I think the "if (txn->toast_hash != NULL)" checks are not needed,
because it's the first thing ReorderBufferToastReset does.

I see, I will change this. If we all agree with this idea.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#12Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Dilip Kumar (#11)
Re: Decoding speculative insert with toast leaks memory

On 5/28/21 2:17 PM, Dilip Kumar wrote:

On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 5/27/21 6:36 AM, Dilip Kumar wrote:

On Thu, May 27, 2021 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 27, 2021 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

True, but if you do this clean-up in ReorderBufferCleanupTXN then you
don't need to take care at separate places. Also, toast_hash is stored
in txn so it appears natural to clean it up in while releasing TXN.

Make sense, basically, IMHO we will have to do in TruncateTXN and
ReturnTXN as attached?

Yeah, I've been working on a fix over the last couple days (because of a
customer issue), and I ended up with the reset in ReorderBufferReturnTXN
too - it does solve the issue in the case I've been investigating.

I'm not sure the reset in ReorderBufferTruncateTXN is correct, though.
Isn't it possible that we'll need the TOAST data after streaming part of
the transaction? After all, we're not resetting invalidations, tuplecids
and snapshot either

Actually, as per the current design, we don't need the toast data
after the streaming. Because currently, we don't allow to stream the
transaction if we need to keep the toast across stream e.g. if we only
have toast insert without the main insert we assure this as partial
changes, another case is if we have multi-insert with toast then we
keep the transaction as mark partial until we get the last insert of
the multi-insert. So with the current design we don't have any case
where we need to keep toast data across streams.

... And we'll eventually clean it after the streamed

transaction gets commited (ReorderBufferStreamCommit ends up calling
ReorderBufferReturnTXN too).

Right, but generally after streaming we assert that txn->size should
be 0. That could be changed if we change the above design but this is
what it is today.

Can we add some assert to enforce this?

I wonder if there's a way to free the TOASTed data earlier, instead of
waiting until the end of the transaction (as this patch does). But I
suspect it'd be way more complex, harder to backpatch, and destroying
the hash table is a good idea anyway.

Right.

Also, I think the "if (txn->toast_hash != NULL)" checks are not needed,
because it's the first thing ReorderBufferToastReset does.

I see, I will change this. If we all agree with this idea.

+1 from me. I think it's good enough to do the cleanup at the end, and
it's an improvement compared to current state. There might be cases of
transactions doing many such speculative inserts and accumulating a lot
of data in the TOAST hash, but I find it very unlikely.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#12)
Re: Decoding speculative insert with toast leaks memory

On Fri, May 28, 2021 at 6:01 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 5/28/21 2:17 PM, Dilip Kumar wrote:

On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 5/27/21 6:36 AM, Dilip Kumar wrote:

On Thu, May 27, 2021 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, May 27, 2021 at 9:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

True, but if you do this clean-up in ReorderBufferCleanupTXN then you
don't need to take care at separate places. Also, toast_hash is stored
in txn so it appears natural to clean it up in while releasing TXN.

Make sense, basically, IMHO we will have to do in TruncateTXN and
ReturnTXN as attached?

Yeah, I've been working on a fix over the last couple days (because of a
customer issue), and I ended up with the reset in ReorderBufferReturnTXN
too - it does solve the issue in the case I've been investigating.

I'm not sure the reset in ReorderBufferTruncateTXN is correct, though.
Isn't it possible that we'll need the TOAST data after streaming part of
the transaction? After all, we're not resetting invalidations, tuplecids
and snapshot either

Actually, as per the current design, we don't need the toast data
after the streaming. Because currently, we don't allow to stream the
transaction if we need to keep the toast across stream e.g. if we only
have toast insert without the main insert we assure this as partial
changes, another case is if we have multi-insert with toast then we
keep the transaction as mark partial until we get the last insert of
the multi-insert. So with the current design we don't have any case
where we need to keep toast data across streams.

... And we'll eventually clean it after the streamed

transaction gets commited (ReorderBufferStreamCommit ends up calling
ReorderBufferReturnTXN too).

Right, but generally after streaming we assert that txn->size should
be 0. That could be changed if we change the above design but this is
what it is today.

Can we add some assert to enforce this?

There is already an Assert for this. See ReorderBufferCheckMemoryLimit.

--
With Regards,
Amit Kapila.

#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#10)
Re: Decoding speculative insert with toast leaks memory

On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I wonder if there's a way to free the TOASTed data earlier, instead of
waiting until the end of the transaction (as this patch does).

IIUC we are anyway freeing the toasted data at the next
insert/update/delete. We can try to free at other change message types
like REORDER_BUFFER_CHANGE_MESSAGE but as you said that may make the
patch more complex, so it seems better to do the fix on the lines of
what is proposed in the patch.

--
With Regards,
Amit Kapila.

#15Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Amit Kapila (#14)
Re: Decoding speculative insert with toast leaks memory

On 5/29/21 6:29 AM, Amit Kapila wrote:

On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I wonder if there's a way to free the TOASTed data earlier, instead of
waiting until the end of the transaction (as this patch does).

IIUC we are anyway freeing the toasted data at the next
insert/update/delete. We can try to free at other change message types
like REORDER_BUFFER_CHANGE_MESSAGE but as you said that may make the
patch more complex, so it seems better to do the fix on the lines of
what is proposed in the patch.

+1

Even if we started doing what you mention (freeing the hash for other
change types), we'd still need to do what the patch proposes because the
speculative insert may be the last change in the transaction. For the
other cases it works as a mitigation, so that we don't leak the memory
forever.

So let's get this committed, perhaps with a comment explaining that it
might be possible to reset earlier if needed.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Tomas Vondra (#15)
Re: Decoding speculative insert with toast leaks memory

On Sat, May 29, 2021 at 5:45 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 5/29/21 6:29 AM, Amit Kapila wrote:

On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I wonder if there's a way to free the TOASTed data earlier, instead of
waiting until the end of the transaction (as this patch does).

IIUC we are anyway freeing the toasted data at the next
insert/update/delete. We can try to free at other change message types
like REORDER_BUFFER_CHANGE_MESSAGE but as you said that may make the
patch more complex, so it seems better to do the fix on the lines of
what is proposed in the patch.

+1

Even if we started doing what you mention (freeing the hash for other
change types), we'd still need to do what the patch proposes because the
speculative insert may be the last change in the transaction. For the
other cases it works as a mitigation, so that we don't leak the memory
forever.

Right.

So let's get this committed, perhaps with a comment explaining that it
might be possible to reset earlier if needed.

Okay, I think it would be better if we can test this once for the
streaming case as well. Dilip, would you like to do that and send the
updated patch as per one of the comments by Tomas?

--
With Regards,
Amit Kapila.

#17Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#16)
Re: Decoding speculative insert with toast leaks memory

On Mon, 31 May 2021 at 8:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, May 29, 2021 at 5:45 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 5/29/21 6:29 AM, Amit Kapila wrote:

On Fri, May 28, 2021 at 5:16 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I wonder if there's a way to free the TOASTed data earlier, instead of
waiting until the end of the transaction (as this patch does).

IIUC we are anyway freeing the toasted data at the next
insert/update/delete. We can try to free at other change message types
like REORDER_BUFFER_CHANGE_MESSAGE but as you said that may make the
patch more complex, so it seems better to do the fix on the lines of
what is proposed in the patch.

+1

Even if we started doing what you mention (freeing the hash for other
change types), we'd still need to do what the patch proposes because the
speculative insert may be the last change in the transaction. For the
other cases it works as a mitigation, so that we don't leak the memory
forever.

Right.

So let's get this committed, perhaps with a comment explaining that it
might be possible to reset earlier if needed.

Okay, I think it would be better if we can test this once for the
streaming case as well. Dilip, would you like to do that and send the
updated patch as per one of the comments by Tomas?

I will do that in sometime.

--

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#18Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#17)
2 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Mon, May 31, 2021 at 8:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, 31 May 2021 at 8:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I think it would be better if we can test this once for the
streaming case as well. Dilip, would you like to do that and send the
updated patch as per one of the comments by Tomas?

I will do that sometime.

I have changed patches as Tomas suggested and also created back patches.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-Fix-memory-leak-in-toast-hash_v13-to-v9.6.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Fix-memory-leak-in-toast-hash_v13-to-v9.6.patchDownload
From 792892fce8771c003eca16847e5e494a70f318ed Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 31 May 2021 15:20:05 +0530
Subject: [PATCH v1] Fix memory leak in toast hash

While cleaning up the changes just destory the toast hash so that if
it is not already done in some cases we don't leak memory.
---
 src/backend/replication/logical/reorderbuffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5251932..89fa7e7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -391,6 +391,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* cleanup the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	if (txn->invalidations)
 	{
 		pfree(txn->invalidations);
-- 
1.8.3.1

v1-0001-Fix-memory-leak-in-toast-hash-v14.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Fix-memory-leak-in-toast-hash-v14.patchDownload
From 9a7eeaf7b47ec6faff0bbc524412d46869136720 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 27 May 2021 10:03:27 +0530
Subject: [PATCH v1] Fix memory leak in toast hash

While cleaning up the changes just destory the toast hash so that if
it is not already done in some cases we don't leak memory.
---
 src/backend/replication/logical/reorderbuffer.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..ab65d3b 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -437,6 +437,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* cleanup the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	if (txn->invalidations)
 	{
 		pfree(txn->invalidations);
@@ -1637,6 +1640,9 @@ ReorderBufferTruncateTXN(ReorderBuffer *rb, ReorderBufferTXN *txn, bool txn_prep
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* Cleanup the toast hash. */
+	ReorderBufferToastReset(rb, txn);
+
 	/* If this txn is serialized then clean the disk space. */
 	if (rbtxn_is_serialized(txn))
 	{
-- 
1.8.3.1

#19Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#18)
Re: Decoding speculative insert with toast leaks memory

On Mon, 31 May 2021 at 4:29 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 31, 2021 at 8:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, 31 May 2021 at 8:21 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Okay, I think it would be better if we can test this once for the
streaming case as well. Dilip, would you like to do that and send the
updated patch as per one of the comments by Tomas?

I will do that sometime.

I have changed patches as Tomas suggested and also created back patches.

I missed to do the test for streaming. I will to that tomorrow and reply
back.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#20Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#19)
Re: Decoding speculative insert with toast leaks memory

On Mon, May 31, 2021 at 6:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, 31 May 2021 at 4:29 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 31, 2021 at 8:50 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, 31 May 2021 at 8:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Okay, I think it would be better if we can test this once for the
streaming case as well. Dilip, would you like to do that and send the
updated patch as per one of the comments by Tomas?

I will do that sometime.

I have changed patches as Tomas suggested and also created back patches.

I missed to do the test for streaming. I will to that tomorrow and reply back.

For streaming transactions this issue is not there. Because this
problem will only occur if the last change is *SPEC INSERT * and after
that there is no other UPDATE/INSERT change because on that change we
are resetting the toast table. Now, if the transaction has only *SPEC
INSERT* without SPEC CONFIRM or any other INSERT/UPDATE then we will
not stream it. And if we get any next INSERT/UPDATE then only we can
select this for stream but in that case toast will be reset. So as of
today for streaming mode we don't have this problem.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#20)
Re: Decoding speculative insert with toast leaks memory

On Mon, May 31, 2021 at 8:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 31, 2021 at 6:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I missed to do the test for streaming. I will to that tomorrow and reply back.

For streaming transactions this issue is not there. Because this
problem will only occur if the last change is *SPEC INSERT * and after
that there is no other UPDATE/INSERT change because on that change we
are resetting the toast table. Now, if the transaction has only *SPEC
INSERT* without SPEC CONFIRM or any other INSERT/UPDATE then we will
not stream it. And if we get any next INSERT/UPDATE then only we can
select this for stream but in that case toast will be reset. So as of
today for streaming mode we don't have this problem.

What if the next change is a different SPEC_INSERT
(REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)? I think in that case we
will stream but won't free the toast memory.

--
With Regards,
Amit Kapila.

#22Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#21)
Re: Decoding speculative insert with toast leaks memory

On Tue, Jun 1, 2021 at 9:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 31, 2021 at 8:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 31, 2021 at 6:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I missed to do the test for streaming. I will to that tomorrow and reply back.

For streaming transactions this issue is not there. Because this
problem will only occur if the last change is *SPEC INSERT * and after
that there is no other UPDATE/INSERT change because on that change we
are resetting the toast table. Now, if the transaction has only *SPEC
INSERT* without SPEC CONFIRM or any other INSERT/UPDATE then we will
not stream it. And if we get any next INSERT/UPDATE then only we can
select this for stream but in that case toast will be reset. So as of
today for streaming mode we don't have this problem.

What if the next change is a different SPEC_INSERT
(REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)? I think in that case we
will stream but won't free the toast memory.

But if the next change is again the SPEC INSERT then we will keep the
PARTIAL change flag set and we will not select this transaction for
stream right?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#23Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#22)
Re: Decoding speculative insert with toast leaks memory

On Tue, Jun 1, 2021 at 9:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 1, 2021 at 9:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 31, 2021 at 8:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 31, 2021 at 6:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I missed to do the test for streaming. I will to that tomorrow and reply back.

For streaming transactions this issue is not there. Because this
problem will only occur if the last change is *SPEC INSERT * and after
that there is no other UPDATE/INSERT change because on that change we
are resetting the toast table. Now, if the transaction has only *SPEC
INSERT* without SPEC CONFIRM or any other INSERT/UPDATE then we will
not stream it. And if we get any next INSERT/UPDATE then only we can
select this for stream but in that case toast will be reset. So as of
today for streaming mode we don't have this problem.

What if the next change is a different SPEC_INSERT
(REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)? I think in that case we
will stream but won't free the toast memory.

But if the next change is again the SPEC INSERT then we will keep the
PARTIAL change flag set and we will not select this transaction for
stream right?

Right, I think you can remove the change related to stream xact and
probably write some comments on why we don't need it for streamed
transactions. But, now I have another question related to fixing the
non-streamed case. What if after the missing spec_confirm we get the
delete operation in the transaction? It seems
ReorderBufferToastReplace always expects Insert/Update if we have
toast hash active in the transaction.

--
With Regards,
Amit Kapila.

#24Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#23)
Re: Decoding speculative insert with toast leaks memory

On Tue, Jun 1, 2021 at 10:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 1, 2021 at 9:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 1, 2021 at 9:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 31, 2021 at 8:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 31, 2021 at 6:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I missed to do the test for streaming. I will to that tomorrow and reply back.

For streaming transactions this issue is not there. Because this
problem will only occur if the last change is *SPEC INSERT * and after
that there is no other UPDATE/INSERT change because on that change we
are resetting the toast table. Now, if the transaction has only *SPEC
INSERT* without SPEC CONFIRM or any other INSERT/UPDATE then we will
not stream it. And if we get any next INSERT/UPDATE then only we can
select this for stream but in that case toast will be reset. So as of
today for streaming mode we don't have this problem.

What if the next change is a different SPEC_INSERT
(REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)? I think in that case we
will stream but won't free the toast memory.

But if the next change is again the SPEC INSERT then we will keep the
PARTIAL change flag set and we will not select this transaction for
stream right?

Right, I think you can remove the change related to stream xact and
probably write some comments on why we don't need it for streamed
transactions. But, now I have another question related to fixing the
non-streamed case. What if after the missing spec_confirm we get the
delete operation in the transaction? It seems
ReorderBufferToastReplace always expects Insert/Update if we have
toast hash active in the transaction.

Yeah, that looks like a problem, I will test this case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#25Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#24)
Re: Decoding speculative insert with toast leaks memory

On Tue, Jun 1, 2021 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 1, 2021 at 10:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 1, 2021 at 9:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 1, 2021 at 9:53 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, May 31, 2021 at 8:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, May 31, 2021 at 6:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I missed to do the test for streaming. I will to that tomorrow and reply back.

For streaming transactions this issue is not there. Because this
problem will only occur if the last change is *SPEC INSERT * and after
that there is no other UPDATE/INSERT change because on that change we
are resetting the toast table. Now, if the transaction has only *SPEC
INSERT* without SPEC CONFIRM or any other INSERT/UPDATE then we will
not stream it. And if we get any next INSERT/UPDATE then only we can
select this for stream but in that case toast will be reset. So as of
today for streaming mode we don't have this problem.

What if the next change is a different SPEC_INSERT
(REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT)? I think in that case we
will stream but won't free the toast memory.

But if the next change is again the SPEC INSERT then we will keep the
PARTIAL change flag set and we will not select this transaction for
stream right?

Right, I think you can remove the change related to stream xact and
probably write some comments on why we don't need it for streamed
transactions. But, now I have another question related to fixing the
non-streamed case. What if after the missing spec_confirm we get the
delete operation in the transaction? It seems
ReorderBufferToastReplace always expects Insert/Update if we have
toast hash active in the transaction.

Yeah, that looks like a problem, I will test this case.

I am able to hit that Assert after slight modification in the original
test case, basically, I added an extra delete in the spec abort
transaction and I got this assertion.

#0 0x00007f7b8cc3a387 in raise () from /lib64/libc.so.6
#1 0x00007f7b8cc3ba78 in abort () from /lib64/libc.so.6
#2 0x0000000000b027c7 in ExceptionalCondition (conditionName=0xcc11df
"change->data.tp.newtuple", errorType=0xcc0244 "FailedAssertion",
fileName=0xcc0290 "reorderbuffer.c", lineNumber=4601) at assert.c:69
#3 0x00000000008dfeaf in ReorderBufferToastReplace (rb=0x1a73e40,
txn=0x1b5d6e8, relation=0x7f7b8dab4d78, change=0x1b5fb68) at
reorderbuffer.c:4601
#4 0x00000000008db769 in ReorderBufferProcessTXN (rb=0x1a73e40,
txn=0x1b5d6e8, commit_lsn=24329048, snapshot_now=0x1b4b8d0,
command_id=0, streaming=false)
at reorderbuffer.c:2187
#5 0x00000000008dc1df in ReorderBufferReplay (txn=0x1b5d6e8,
rb=0x1a73e40, xid=748, commit_lsn=24329048, end_lsn=24329096,
commit_time=675842700629597,
origin_id=0, origin_lsn=0) at reorderbuffer.c:2601
#6 0x00000000008dc267 in ReorderBufferCommit (rb=0x1a73e40, xid=748,
commit_lsn=24329048, end_lsn=24329096, commit_time=675842700629597,
origin_id=0, origin_lsn=0)
at reorderbuffer.c:2625
#7 0x00000000008cc144 in DecodeCommit (ctx=0x1b319b0,
buf=0x7ffdf15fb0a0, parsed=0x7ffdf15faf00, xid=748, two_phase=false)
at decode.c:744

IMHO, as I stated earlier one way to fix this problem is that we add
the spec abort operation (DELETE + XLH_DELETE_IS_SUPER flag) to the
queue, maybe with action name
"REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT" and as part of processing
that just cleans up the toast and specinsert tuple and nothing else.
If we think that makes sense then I can work on that patch?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#25)
Re: Decoding speculative insert with toast leaks memory

On Tue, Jun 1, 2021 at 11:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 1, 2021 at 11:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 1, 2021 at 10:21 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

Right, I think you can remove the change related to stream xact and
probably write some comments on why we don't need it for streamed
transactions. But, now I have another question related to fixing the
non-streamed case. What if after the missing spec_confirm we get the
delete operation in the transaction? It seems
ReorderBufferToastReplace always expects Insert/Update if we have
toast hash active in the transaction.

Yeah, that looks like a problem, I will test this case.

I am able to hit that Assert after slight modification in the original
test case, basically, I added an extra delete in the spec abort
transaction and I got this assertion.

Can we try with other Insert/Update after spec abort to check if there
can be other problems due to active toast_hash?

IMHO, as I stated earlier one way to fix this problem is that we add
the spec abort operation (DELETE + XLH_DELETE_IS_SUPER flag) to the
queue, maybe with action name
"REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT" and as part of processing
that just cleans up the toast and specinsert tuple and nothing else.
If we think that makes sense then I can work on that patch?

I think this should solve the problem but let's first try to see if we
have any other problems. Because, say, if we don't have any other
problem, then maybe removing Assert might work but I guess we still
need to process the tuple to find that we don't need to assemble toast
for it which again seems like overkill.

--
With Regards,
Amit Kapila.

#27Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#26)
Re: Decoding speculative insert with toast leaks memory

On Tue, Jun 1, 2021 at 12:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IMHO, as I stated earlier one way to fix this problem is that we add
the spec abort operation (DELETE + XLH_DELETE_IS_SUPER flag) to the
queue, maybe with action name
"REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT" and as part of processing
that just cleans up the toast and specinsert tuple and nothing else.
If we think that makes sense then I can work on that patch?

I think this should solve the problem but let's first try to see if we
have any other problems. Because, say, if we don't have any other
problem, then maybe removing Assert might work but I guess we still
need to process the tuple to find that we don't need to assemble toast
for it which again seems like overkill.

Yeah, other operation will also fail, basically, if txn->toast_hash is
not NULL then we assume that we need to assemble the tuple using
toast, but if there is next insert in another relation and if that
relation doesn't have a toast table then it will fail with below
error. And otherwise also, if it is the same relation, then the toast
chunks of previous tuple will be used for constructing this new tuple.
I think we must have to clean the toast before processing the next
tuple so I think we can go with the solution I mentioned above.

static void
ReorderBufferToastReplace
{
...
toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
if (!RelationIsValid(toast_rel))
elog(ERROR, "could not open relation with OID %u",
relation->rd_rel->reltoastrelid);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#28Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#27)
1 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Tue, Jun 1, 2021 at 5:22 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 1, 2021 at 12:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IMHO, as I stated earlier one way to fix this problem is that we add
the spec abort operation (DELETE + XLH_DELETE_IS_SUPER flag) to the
queue, maybe with action name
"REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT" and as part of processing
that just cleans up the toast and specinsert tuple and nothing else.
If we think that makes sense then I can work on that patch?

I think this should solve the problem but let's first try to see if we
have any other problems. Because, say, if we don't have any other
problem, then maybe removing Assert might work but I guess we still
need to process the tuple to find that we don't need to assemble toast
for it which again seems like overkill.

Yeah, other operation will also fail, basically, if txn->toast_hash is
not NULL then we assume that we need to assemble the tuple using
toast, but if there is next insert in another relation and if that
relation doesn't have a toast table then it will fail with below
error. And otherwise also, if it is the same relation, then the toast
chunks of previous tuple will be used for constructing this new tuple.
I think we must have to clean the toast before processing the next
tuple so I think we can go with the solution I mentioned above.

static void
ReorderBufferToastReplace
{
...
toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
if (!RelationIsValid(toast_rel))
elog(ERROR, "could not open relation with OID %u",
relation->rd_rel->reltoastrelid);

The attached patch fixes by queuing the spec abort change and cleaning
up the toast hash on spec abort. Currently, in this patch I am
queuing up all the spec abort changes, but as an optimization we can
avoid
queuing the spec abort for toast tables but for that we need to log
that as a flag in WAL. that this XLH_DELETE_IS_SUPER is for a toast
relation.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-Bug-fix-for-speculative-abort.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Bug-fix-for-speculative-abort.patchDownload
From f607e140cae21183fea2fc226029d7673478509e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 1 Jun 2021 19:53:47 +0530
Subject: [PATCH v2] Bug fix for speculative abort

If speculative insert has a toast table insert then if that tuple
is not confirmed then the toast hash is not cleaned and that is
creating various problem like a) memory leak b) next insert is
using these uncleaned toast data for its insertion and other
error and assersion failure.  So this patch handle that by
queuing the spec abort changes and cleaning up the toast hash
on spec abort.  Currently, in this patch we are queuing up all
the spec abort changes, but as an optimization we can avoid
queuing the spec abort for toast tables but for that we need to
log that as a flag in WAL.
---
 src/backend/replication/logical/decode.c        | 16 +++++++-----
 src/backend/replication/logical/reorderbuffer.c | 34 +++++++++++++++++++++++++
 src/include/replication/reorderbuffer.h         |  1 +
 3 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..1a0d7dc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1040,19 +1040,21 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
+	/* output plugin doesn't look for this origin, no need to queue */
+	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+		return;
+
+	change = ReorderBufferGetChange(ctx->reorder);
+
 	/*
 	 * Super deletions are irrelevant for logical decoding, it's driven by the
 	 * confirmation records.
 	 */
 	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
-	/* output plugin doesn't look for this origin, no need to queue */
-	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
-		return;
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
 
-	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..4940cf5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -520,6 +520,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -2254,6 +2255,36 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived.  So cleanup the
+					 * specinsert tuple and toast hash.  If spec insert change
+					 * is NULL then do nothing, this is possible because we
+					 * have spec abort for each toast entry.  So we just have
+					 * to clean the specinsert and toast hash for the first
+					 * spec abort for the main table and for the remaining
+					 * entries we can just ignore.
+					 *
+					 * XXX For optimization, we may log a flag saying this is
+					 * a spec abort for the toast table and we can avoid queuing
+					 * that change.
+					 */
+					if (specinsert != NULL)
+					{
+						/* Clear the toast chunk */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/*
+						* If the speculative insertion was aborted, the record
+						* isn't needed anymore.
+						*/
+						ReorderBufferReturnChange(rb, specinsert, true);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -3754,6 +3785,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4017,6 +4049,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4315,6 +4348,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..9ff0986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#28)
Re: Decoding speculative insert with toast leaks memory

On Tue, Jun 1, 2021 at 8:01 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

The attached patch fixes by queuing the spec abort change and cleaning
up the toast hash on spec abort. Currently, in this patch I am
queuing up all the spec abort changes, but as an optimization we can
avoid
queuing the spec abort for toast tables but for that we need to log
that as a flag in WAL. that this XLH_DELETE_IS_SUPER is for a toast
relation.

I don't think that is required especially because we intend to
backpatch this, so I would like to keep such optimization for another
day. Few comments:

Comments:
------------
/*
* Super deletions are irrelevant for logical decoding, it's driven by the
* confirmation records.
*/
1. The above comment is not required after your other changes.

/*
* Either speculative insertion was confirmed, or it was
* unsuccessful and the record isn't needed anymore.
*/
if (specinsert != NULL)
2. The above comment needs some adjustment.

/*
* There's a speculative insertion remaining, just clean in up, it
* can't have been successful, otherwise we'd gotten a confirmation
* record.
*/
if (specinsert)
{
ReorderBufferReturnChange(rb, specinsert, true);
specinsert = NULL;
}

3. Ideally, we should have an Assert here because we shouldn't reach
without cleaning up specinsert. If there is still a chance then we
should mention that in the comments.

4.
+ case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+ /*
+ * Abort for speculative insertion arrived.

I think here we should explain why we can't piggyback cleanup on next
insert/update/delete.

5. Can we write a test case for it? I guess we don't need to use
multiple sessions if the conflicting record is already present.

Please see if the same patch works on back-branches? I guess this
makes the change bit tricky as it involves decoding a new message but
not sure if there is a better way.

--
With Regards,
Amit Kapila.

#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#27)
Re: Decoding speculative insert with toast leaks memory

On Tue, Jun 1, 2021 at 5:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 1, 2021 at 12:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IMHO, as I stated earlier one way to fix this problem is that we add
the spec abort operation (DELETE + XLH_DELETE_IS_SUPER flag) to the
queue, maybe with action name
"REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT" and as part of processing
that just cleans up the toast and specinsert tuple and nothing else.
If we think that makes sense then I can work on that patch?

I think this should solve the problem but let's first try to see if we
have any other problems. Because, say, if we don't have any other
problem, then maybe removing Assert might work but I guess we still
need to process the tuple to find that we don't need to assemble toast
for it which again seems like overkill.

Yeah, other operation will also fail, basically, if txn->toast_hash is
not NULL then we assume that we need to assemble the tuple using
toast, but if there is next insert in another relation and if that
relation doesn't have a toast table then it will fail with below
error. And otherwise also, if it is the same relation, then the toast
chunks of previous tuple will be used for constructing this new tuple.

I think the same relation case might not create a problem because it
won't find the entry for it in the toast_hash, so it will return from
there but the other two problems will be there. So, one idea could be
to just avoid those two cases (by simply adding return for those
cases) and still we can rely on toast clean up on the next
insert/update/delete. However, your approach looks more natural to me
as that will allow us to clean up memory in all cases instead of
waiting till the transaction end. So, I still vote for going with your
patch's idea of cleaning at spec_abort but I am fine if you and others
decide not to process spec_abort message. What do you think? Tomas, do
you have any opinion on this matter?

--
With Regards,
Amit Kapila.

#31Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#30)
2 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Wed, Jun 2, 2021 at 11:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 1, 2021 at 5:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 1, 2021 at 12:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IMHO, as I stated earlier one way to fix this problem is that we add
the spec abort operation (DELETE + XLH_DELETE_IS_SUPER flag) to the
queue, maybe with action name
"REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT" and as part of processing
that just cleans up the toast and specinsert tuple and nothing else.
If we think that makes sense then I can work on that patch?

I think this should solve the problem but let's first try to see if we
have any other problems. Because, say, if we don't have any other
problem, then maybe removing Assert might work but I guess we still
need to process the tuple to find that we don't need to assemble toast
for it which again seems like overkill.

Yeah, other operation will also fail, basically, if txn->toast_hash is
not NULL then we assume that we need to assemble the tuple using
toast, but if there is next insert in another relation and if that
relation doesn't have a toast table then it will fail with below
error. And otherwise also, if it is the same relation, then the toast
chunks of previous tuple will be used for constructing this new tuple.

I think the same relation case might not create a problem because it
won't find the entry for it in the toast_hash, so it will return from
there but the other two problems will be there.

Right

So, one idea could be

to just avoid those two cases (by simply adding return for those
cases) and still we can rely on toast clean up on the next
insert/update/delete. However, your approach looks more natural to me
as that will allow us to clean up memory in all cases instead of
waiting till the transaction end. So, I still vote for going with your
patch's idea of cleaning at spec_abort but I am fine if you and others
decide not to process spec_abort message. What do you think? Tomas, do
you have any opinion on this matter?

I agree that processing with spec abort looks more natural and ideally
the current code expects it to be getting cleaned after the change,
that's the reason we have those assertions and errors. OTOH I agree
that we can just return from those conditions because now we know that
with the current code those situations are possible. My vote is with
handling the spec abort option (Option1) because that looks more
natural way of handling these issues and we also don't have to clean
up the hash in "ReorderBufferReturnTXN" if no followup change after
spec abort. I am attaching the patches with both the approaches for
the reference.

Once we finalize on the approach then I will work on pending review
comments and also prepare the back branch patches.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

Option1_v2-0001-Bug-fix-for-speculative-abort.patchtext/x-patch; charset=US-ASCII; name=Option1_v2-0001-Bug-fix-for-speculative-abort.patchDownload
From f607e140cae21183fea2fc226029d7673478509e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 1 Jun 2021 19:53:47 +0530
Subject: [PATCH v2] Bug fix for speculative abort

If speculative insert has a toast table insert then if that tuple
is not confirmed then the toast hash is not cleaned and that is
creating various problem like a) memory leak b) next insert is
using these uncleaned toast data for its insertion and other
error and assersion failure.  So this patch handle that by
queuing the spec abort changes and cleaning up the toast hash
on spec abort.  Currently, in this patch we are queuing up all
the spec abort changes, but as an optimization we can avoid
queuing the spec abort for toast tables but for that we need to
log that as a flag in WAL.
---
 src/backend/replication/logical/decode.c        | 16 +++++++-----
 src/backend/replication/logical/reorderbuffer.c | 34 +++++++++++++++++++++++++
 src/include/replication/reorderbuffer.h         |  1 +
 3 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..1a0d7dc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1040,19 +1040,21 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
+	/* output plugin doesn't look for this origin, no need to queue */
+	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
+		return;
+
+	change = ReorderBufferGetChange(ctx->reorder);
+
 	/*
 	 * Super deletions are irrelevant for logical decoding, it's driven by the
 	 * confirmation records.
 	 */
 	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
-	/* output plugin doesn't look for this origin, no need to queue */
-	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
-		return;
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
 
-	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..4940cf5 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -520,6 +520,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -2254,6 +2255,36 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived.  So cleanup the
+					 * specinsert tuple and toast hash.  If spec insert change
+					 * is NULL then do nothing, this is possible because we
+					 * have spec abort for each toast entry.  So we just have
+					 * to clean the specinsert and toast hash for the first
+					 * spec abort for the main table and for the remaining
+					 * entries we can just ignore.
+					 *
+					 * XXX For optimization, we may log a flag saying this is
+					 * a spec abort for the toast table and we can avoid queuing
+					 * that change.
+					 */
+					if (specinsert != NULL)
+					{
+						/* Clear the toast chunk */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/*
+						* If the speculative insertion was aborted, the record
+						* isn't needed anymore.
+						*/
+						ReorderBufferReturnChange(rb, specinsert, true);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -3754,6 +3785,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4017,6 +4049,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4315,6 +4348,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..9ff0986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

Option2_v3-0001-Bug-fix-in-case-of-leftover-hash-after-spec-abort.patchtext/x-patch; charset=US-ASCII; name=Option2_v3-0001-Bug-fix-in-case-of-leftover-hash-after-spec-abort.patchDownload
From 68c672ec05b1867e3b170435f929321e43d05020 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 2 Jun 2021 11:16:34 +0530
Subject: [PATCH v3] Bug fix in case of leftover hash after spec abort

Basically, if the toast table insertion for spec insert followup
by spec abort then the toast hash was not cleaned up immeditely
and there was some assert and error based on that situation so
just ignore them.
---
 src/backend/replication/logical/reorderbuffer.c | 32 +++++++++++++++++--------
 1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..5c20e13 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -437,6 +437,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->tuplecid_hash = NULL;
 	}
 
+	/* cleanup the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	if (txn->invalidations)
 	{
 		pfree(txn->invalidations);
@@ -4583,6 +4586,25 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		return;
 
 	/*
+	 * If the current change is not a INSERT or UPDATE then it might be the
+	 * leftover toast hash from previous spec insert without spec confirm.  So
+	 * we can just ignore it.
+	 */
+	if (change->data.tp.newtuple == NULL)
+		return;
+
+	desc = RelationGetDescr(relation);
+
+	/*
+	 * If the current relation doesn't have a toast relation then it might be
+	 * the leftover toast hash from previous spec insert without spec confirm.
+	 * So we can just ignore it.
+	 */
+	toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
+	if (!RelationIsValid(toast_rel))
+		return;
+
+	/*
 	 * We're going to modify the size of the change, so to make sure the
 	 * accounting is correct we'll make it look like we're removing the change
 	 * now (with the old size), and then re-add it at the end.
@@ -4591,16 +4613,6 @@ ReorderBufferToastReplace(ReorderBuffer *rb, ReorderBufferTXN *txn,
 
 	oldcontext = MemoryContextSwitchTo(rb->context);
 
-	/* we should only have toast tuples in an INSERT or UPDATE */
-	Assert(change->data.tp.newtuple);
-
-	desc = RelationGetDescr(relation);
-
-	toast_rel = RelationIdGetRelation(relation->rd_rel->reltoastrelid);
-	if (!RelationIsValid(toast_rel))
-		elog(ERROR, "could not open relation with OID %u",
-			 relation->rd_rel->reltoastrelid);
-
 	toast_desc = RelationGetDescr(toast_rel);
 
 	/* should we allocate from stack instead? */
-- 
1.8.3.1

#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#31)
Re: Decoding speculative insert with toast leaks memory

On Wed, Jun 2, 2021 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 2, 2021 at 11:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 1, 2021 at 5:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jun 1, 2021 at 12:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

IMHO, as I stated earlier one way to fix this problem is that we add
the spec abort operation (DELETE + XLH_DELETE_IS_SUPER flag) to the
queue, maybe with action name
"REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT" and as part of processing
that just cleans up the toast and specinsert tuple and nothing else.
If we think that makes sense then I can work on that patch?

I think this should solve the problem but let's first try to see if we
have any other problems. Because, say, if we don't have any other
problem, then maybe removing Assert might work but I guess we still
need to process the tuple to find that we don't need to assemble toast
for it which again seems like overkill.

Yeah, other operation will also fail, basically, if txn->toast_hash is
not NULL then we assume that we need to assemble the tuple using
toast, but if there is next insert in another relation and if that
relation doesn't have a toast table then it will fail with below
error. And otherwise also, if it is the same relation, then the toast
chunks of previous tuple will be used for constructing this new tuple.

I think the same relation case might not create a problem because it
won't find the entry for it in the toast_hash, so it will return from
there but the other two problems will be there.

Right

So, one idea could be

to just avoid those two cases (by simply adding return for those
cases) and still we can rely on toast clean up on the next
insert/update/delete. However, your approach looks more natural to me
as that will allow us to clean up memory in all cases instead of
waiting till the transaction end. So, I still vote for going with your
patch's idea of cleaning at spec_abort but I am fine if you and others
decide not to process spec_abort message. What do you think? Tomas, do
you have any opinion on this matter?

I agree that processing with spec abort looks more natural and ideally
the current code expects it to be getting cleaned after the change,
that's the reason we have those assertions and errors. OTOH I agree
that we can just return from those conditions because now we know that
with the current code those situations are possible. My vote is with
handling the spec abort option (Option1) because that looks more
natural way of handling these issues and we also don't have to clean
up the hash in "ReorderBufferReturnTXN" if no followup change after
spec abort.

Even, if we decide to go with spec_abort approach, it might be better
to still keep the toastreset call in ReorderBufferReturnTXN so that it
can be freed in case of error.

--
With Regards,
Amit Kapila.

#33Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#32)
Re: Decoding speculative insert with toast leaks memory

On Wed, Jun 2, 2021 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 2, 2021 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 2, 2021 at 11:25 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the same relation case might not create a problem because it
won't find the entry for it in the toast_hash, so it will return from
there but the other two problems will be there.

Right

So, one idea could be

to just avoid those two cases (by simply adding return for those
cases) and still we can rely on toast clean up on the next
insert/update/delete. However, your approach looks more natural to me
as that will allow us to clean up memory in all cases instead of
waiting till the transaction end. So, I still vote for going with your
patch's idea of cleaning at spec_abort but I am fine if you and others
decide not to process spec_abort message. What do you think? Tomas, do
you have any opinion on this matter?

I agree that processing with spec abort looks more natural and ideally
the current code expects it to be getting cleaned after the change,
that's the reason we have those assertions and errors.

Okay, so, let's go with that approach. I have thought about whether it
creates any problem in back-branches but couldn't think of any
primarily because we are not going to send anything additional to
plugin/subscriber. Do you see any problems with back branches if we go
with this approach?

OTOH I agree
that we can just return from those conditions because now we know that
with the current code those situations are possible. My vote is with
handling the spec abort option (Option1) because that looks more
natural way of handling these issues and we also don't have to clean
up the hash in "ReorderBufferReturnTXN" if no followup change after
spec abort.

Even, if we decide to go with spec_abort approach, it might be better
to still keep the toastreset call in ReorderBufferReturnTXN so that it
can be freed in case of error.

Please take care of this as well.

--
With Regards,
Amit Kapila.

#34Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#33)
Re: Decoding speculative insert with toast leaks memory

On Mon, 7 Jun 2021 at 8:30 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 2, 2021 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Jun 2, 2021 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com>

wrote:

On Wed, Jun 2, 2021 at 11:25 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think the same relation case might not create a problem because it
won't find the entry for it in the toast_hash, so it will return from
there but the other two problems will be there.

Right

So, one idea could be

to just avoid those two cases (by simply adding return for those
cases) and still we can rely on toast clean up on the next
insert/update/delete. However, your approach looks more natural to me
as that will allow us to clean up memory in all cases instead of
waiting till the transaction end. So, I still vote for going with

your

patch's idea of cleaning at spec_abort but I am fine if you and

others

decide not to process spec_abort message. What do you think? Tomas,

do

you have any opinion on this matter?

I agree that processing with spec abort looks more natural and ideally
the current code expects it to be getting cleaned after the change,
that's the reason we have those assertions and errors.

Okay, so, let's go with that approach. I have thought about whether it
creates any problem in back-branches but couldn't think of any
primarily because we are not going to send anything additional to
plugin/subscriber. Do you see any problems with back branches if we go
with this approach?

I will check this and let you know.

OTOH I agree
that we can just return from those conditions because now we know that
with the current code those situations are possible. My vote is with
handling the spec abort option (Option1) because that looks more
natural way of handling these issues and we also don't have to clean
up the hash in "ReorderBufferReturnTXN" if no followup change after
spec abort.

Even, if we decide to go with spec_abort approach, it might be better
to still keep the toastreset call in ReorderBufferReturnTXN so that it
can be freed in case of error.

Please take care of this as well.

Ok

--

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#35Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#34)
1 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Mon, Jun 7, 2021 at 8:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, 7 Jun 2021 at 8:30 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Jun 2, 2021 at 11:52 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Jun 2, 2021 at 11:38 AM Dilip Kumar <dilipbalaut@gmail.com>

wrote:

On Wed, Jun 2, 2021 at 11:25 AM Amit Kapila <amit.kapila16@gmail.com>

wrote:

I think the same relation case might not create a problem because it
won't find the entry for it in the toast_hash, so it will return

from

there but the other two problems will be there.

Right

So, one idea could be

to just avoid those two cases (by simply adding return for those
cases) and still we can rely on toast clean up on the next
insert/update/delete. However, your approach looks more natural to

me

as that will allow us to clean up memory in all cases instead of
waiting till the transaction end. So, I still vote for going with

your

patch's idea of cleaning at spec_abort but I am fine if you and

others

decide not to process spec_abort message. What do you think? Tomas,

do

you have any opinion on this matter?

I agree that processing with spec abort looks more natural and ideally
the current code expects it to be getting cleaned after the change,
that's the reason we have those assertions and errors.

Okay, so, let's go with that approach. I have thought about whether it
creates any problem in back-branches but couldn't think of any
primarily because we are not going to send anything additional to
plugin/subscriber. Do you see any problems with back branches if we go
with this approach?

I will check this and let you know.

OTOH I agree
that we can just return from those conditions because now we know that
with the current code those situations are possible. My vote is with
handling the spec abort option (Option1) because that looks more
natural way of handling these issues and we also don't have to clean
up the hash in "ReorderBufferReturnTXN" if no followup change after
spec abort.

Even, if we decide to go with spec_abort approach, it might be better
to still keep the toastreset call in ReorderBufferReturnTXN so that it
can be freed in case of error.

Please take care of this as well.

Ok

I have fixed all pending review comments and also added a test case which
is working fine. I haven't yet checked on the back branches. Let's
discuss if we think this patch looks fine then I can apply and test on the
back branches.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0001-Bug-fix-for-speculative-abort.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Bug-fix-for-speculative-abort.patchDownload
From 0b9c93398ef108a3d71cbac6f793a0314964aaa2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 1 Jun 2021 19:53:47 +0530
Subject: [PATCH v3] Bug fix for speculative abort

If speculative insert has a toast table insert then if that tuple
is not confirmed then the toast hash is not cleaned and that is
creating various problem like a) memory leak b) next insert is
using these uncleaned toast data for its insertion and other
error and assersion failure.  So this patch handle that by
queuing the spec abort changes and cleaning up the toast hash
on spec abort.  Currently, in this patch we are queuing up all
the spec abort changes, but as an optimization we can avoid
queuing the spec abort for toast tables but for that we need to
log that as a flag in WAL.
---
 contrib/test_decoding/Makefile                     |  2 +-
 .../test_decoding/expected/speculative_abort.out   | 64 +++++++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 67 ++++++++++++++++++++++
 src/backend/replication/logical/decode.c           | 14 ++---
 src/backend/replication/logical/reorderbuffer.c    | 43 +++++++++++++-
 src/include/replication/reorderbuffer.h            |  1 +
 6 files changed, 180 insertions(+), 11 deletions(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b..1cab935 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot speculative_abort
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..d672deb
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,64 @@
+Parsed test spec with 3 sessions
+
+starting permutation: s1_session s1_lock_s2 s1_lock_s3 s1_begin s1_insert_tbl1 s2_session s2_begin s2_insert_tbl1 s3_session s3_begin s3_insert_tbl1 s1_unlock_s2 s1_unlock_s3 s1_lock_s2 s1_abort s3_commit s1_unlock_s2 s2_insert_tbl2 s2_commit s1_get_changes
+data           
+
+step s1_session: SET spec.session = 1;
+step s1_lock_s2: SELECT pg_advisory_lock(2);
+pg_advisory_lock
+
+               
+step s1_lock_s3: SELECT pg_advisory_lock(2);
+pg_advisory_lock
+
+               
+step s1_begin: BEGIN;
+step s1_insert_tbl1: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING;
+step s2_session: SET spec.session = 2;
+step s2_begin: BEGIN;
+s2: NOTICE:  2acquiring advisory lock on 2
+step s2_insert_tbl1: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step s3_session: SET spec.session = 3;
+step s3_begin: BEGIN;
+s3: NOTICE:  3acquiring advisory lock on 3
+s3: NOTICE:  3acquiring advisory lock on 3
+step s3_insert_tbl1: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step s1_unlock_s2: SELECT pg_advisory_unlock(2);
+pg_advisory_unlock
+
+t              
+step s1_unlock_s3: SELECT pg_advisory_unlock(2);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  2acquiring advisory lock on 2
+step s1_lock_s2: SELECT pg_advisory_lock(2);
+pg_advisory_lock
+
+               
+step s1_abort: ROLLBACK;
+s2: NOTICE:  2acquiring advisory lock on 2
+s3: NOTICE:  3acquiring advisory lock on 3
+step s3_insert_tbl1: <... completed>
+step s3_commit: COMMIT;
+step s1_unlock_s2: SELECT pg_advisory_unlock(2);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  2acquiring advisory lock on 2
+s2: NOTICE:  2acquiring advisory lock on 2
+step s2_insert_tbl1: <... completed>
+step s2_insert_tbl2: INSERT INTO tbl2 VALUES(1);
+step s2_commit: COMMIT;
+step s1_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..0e2ab29
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,67 @@
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+	CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+	BEGIN
+	-- depending on lock state, wait for lock 2 or 3
+	IF current_setting('spec.session')::int =  2 THEN
+		RAISE NOTICE '2acquiring advisory lock on 2';
+		PERFORM pg_advisory_lock(2);
+		PERFORM pg_advisory_unlock(2);
+	ELSIF current_setting('spec.session')::int =  3 THEN
+		RAISE NOTICE '3acquiring advisory lock on 3';
+		PERFORM pg_advisory_lock(3);
+		PERFORM pg_advisory_unlock(3);
+	END IF;
+	RETURN $1;
+	END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "s1"
+setup { SET synchronous_commit=on; }
+
+step "s1_lock_s2" { SELECT pg_advisory_lock(2); }
+step "s1_lock_s3" { SELECT pg_advisory_lock(2); }
+step "s1_session" { SET spec.session = 1; }
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_abort" { ROLLBACK; }
+step "s1_unlock_s2" { SELECT pg_advisory_unlock(2); }
+step "s1_unlock_s3" { SELECT pg_advisory_unlock(2); }
+step "s1_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s2"
+setup { SET synchronous_commit=on; }
+
+step "s2_session" { SET spec.session = 2; }
+step "s2_begin" { BEGIN; }
+step "s2_insert_tbl1" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s2_insert_tbl2" { INSERT INTO tbl2 VALUES(1); }
+step "s2_commit" { COMMIT; }
+
+session "s3"
+setup { SET synchronous_commit=on; }
+
+step "s3_session" { SET spec.session = 3; }
+step "s3_begin" { BEGIN; }
+step "s3_insert_tbl1" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s3_commit" { COMMIT; }
+
+
+permutation "s1_session" "s1_lock_s2" "s1_lock_s3" "s1_begin" "s1_insert_tbl1" "s2_session" "s2_begin" "s2_insert_tbl1" "s3_session" "s3_begin" "s3_insert_tbl1" "s1_unlock_s2" "s1_unlock_s3" "s1_lock_s2" "s1_abort" "s3_commit" "s1_unlock_s2" "s2_insert_tbl2" "s2_commit" "s1_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..453efc5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1040,19 +1040,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..dd95785 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -443,6 +443,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -520,6 +523,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -2211,8 +2215,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -2254,6 +2258,38 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived.  So cleanup the
+					 * specinsert tuple and toast hash.  If spec insert change
+					 * is NULL then do nothing, this is possible because we
+					 * have spec abort for each toast entry.  So we just have
+					 * to clean the specinsert and toast hash for the first
+					 * spec abort for the main table and remaining changes for
+					 * the tables can be ignored.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * Clear the toast hash, we must clean the toast hash
+						 * before we start with a completely new tuple,
+						 * otherwise, while processing the new tuple it would
+						 * create a confusion that whether we need to process
+						 * these toast chunks or not.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/*
+						* If the speculative insertion was aborted, the record
+						* isn't needed anymore.
+						*/
+						ReorderBufferReturnChange(rb, specinsert, true);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -3754,6 +3790,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4017,6 +4054,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4315,6 +4353,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..9ff0986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

#36Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#35)
Re: Decoding speculative insert with toast leaks memory

On Mon, Jun 7, 2021 at 6:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed all pending review comments and also added a test case which is working fine.

Few observations and questions on testcase:
1.
+step "s1_lock_s2" { SELECT pg_advisory_lock(2); }
+step "s1_lock_s3" { SELECT pg_advisory_lock(2); }
+step "s1_session" { SET spec.session = 1; }
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000))
ON CONFLICT DO NOTHING; }
+step "s1_abort" { ROLLBACK; }
+step "s1_unlock_s2" { SELECT pg_advisory_unlock(2); }
+step "s1_unlock_s3" { SELECT pg_advisory_unlock(2); }

Here, s1_lock_s3 and s1_unlock_s3 uses 2 as identifier. Don't you need
to use 3 in that part of the test?

2. In the test, there seems to be an assumption that we can unlock s2
and s3 one after another, and then both will start waiting on s-1 but
isn't it possible that before s2 start waiting on s1, s3 completes its
insertion and then s2 will never proceed for speculative insertion?

I haven't yet checked on the back branches. Let's discuss if we think this patch looks fine then I can apply and test on the back branches.

Sure, that makes sense.

--
With Regards,
Amit Kapila.

#37Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#36)
Re: Decoding speculative insert with toast leaks memory

On Mon, Jun 7, 2021 at 6:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Jun 7, 2021 at 6:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have fixed all pending review comments and also added a test case

which is working fine.

Few observations and questions on testcase:
1.
+step "s1_lock_s2" { SELECT pg_advisory_lock(2); }
+step "s1_lock_s3" { SELECT pg_advisory_lock(2); }
+step "s1_session" { SET spec.session = 1; }
+step "s1_begin" { BEGIN; }
+step "s1_insert_tbl1" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000))
ON CONFLICT DO NOTHING; }
+step "s1_abort" { ROLLBACK; }
+step "s1_unlock_s2" { SELECT pg_advisory_unlock(2); }
+step "s1_unlock_s3" { SELECT pg_advisory_unlock(2); }

Here, s1_lock_s3 and s1_unlock_s3 uses 2 as identifier. Don't you need
to use 3 in that part of the test?

Yeah this should be 3.

2. In the test, there seems to be an assumption that we can unlock s2
and s3 one after another, and then both will start waiting on s-1 but
isn't it possible that before s2 start waiting on s1, s3 completes its
insertion and then s2 will never proceed for speculative insertion?

I agree, such race conditions are possible. Currently, I am not able to
think what we can do here, but I will think more on this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#38Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#37)
2 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Mon, Jun 7, 2021 at 6:45 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

2. In the test, there seems to be an assumption that we can unlock s2
and s3 one after another, and then both will start waiting on s-1 but
isn't it possible that before s2 start waiting on s1, s3 completes its
insertion and then s2 will never proceed for speculative insertion?

I agree, such race conditions are possible. Currently, I am not able to think what we can do here, but I will think more on this.

Based on the off list discussion, I have modified the test based on
the idea showed in
"isolation/specs/insert-conflict-specconflict.spec", other open point
we had about the race condition that how to ensure that when we unlock
any session it make progress, IMHO the isolation tested is designed in
a way that either all the waiting session returns with the output or
again block on a heavy weight lock before performing the next step.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-Bug-fix-for-speculative-abort.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Bug-fix-for-speculative-abort.patchDownload
From dcea4c36267ad2dc58dd0a57733a6f6276e2d754 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 8 Jun 2021 17:06:39 +0530
Subject: [PATCH v4 1/2] Bug fix for speculative abort

If speculative insert has a toast table insert then if that tuple
is not confirmed then the toast hash is not cleaned and that is
creating various problem like a) memory leak b) next insert is
using these uncleaned toast data for its insertion and other
error and assersion failure.  So this patch handle that by
queuing the spec abort changes and cleaning up the toast hash
on spec abort.  Currently, in this patch we are queuing up all
the spec abort changes, but as an optimization we can avoid
queuing the spec abort for toast tables but for that we need to
log that as a flag in WAL.
---
 src/backend/replication/logical/decode.c        | 14 ++++----
 src/backend/replication/logical/reorderbuffer.c | 43 +++++++++++++++++++++++--
 src/include/replication/reorderbuffer.h         |  1 +
 3 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..453efc5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1040,19 +1040,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..dd95785 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -443,6 +443,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -520,6 +523,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -2211,8 +2215,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -2254,6 +2258,38 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived.  So cleanup the
+					 * specinsert tuple and toast hash.  If spec insert change
+					 * is NULL then do nothing, this is possible because we
+					 * have spec abort for each toast entry.  So we just have
+					 * to clean the specinsert and toast hash for the first
+					 * spec abort for the main table and remaining changes for
+					 * the tables can be ignored.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * Clear the toast hash, we must clean the toast hash
+						 * before we start with a completely new tuple,
+						 * otherwise, while processing the new tuple it would
+						 * create a confusion that whether we need to process
+						 * these toast chunks or not.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/*
+						* If the speculative insertion was aborted, the record
+						* isn't needed anymore.
+						*/
+						ReorderBufferReturnChange(rb, specinsert, true);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -3754,6 +3790,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4017,6 +4054,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4315,6 +4353,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..9ff0986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

v4-0002-test-case.patchtext/x-patch; charset=US-ASCII; name=v4-0002-test-case.patchDownload
From 8b5a46a99489c5c1cacbaf6047440c4689727b9c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 8 Jun 2021 17:11:24 +0530
Subject: [PATCH v4 2/2] test case

---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 +++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 105 +++++++++++++++++++++
 3 files changed, 191 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b..1cab935 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot speculative_abort
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..79ee67c
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock_123() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock_123() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock_123() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock_123() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock_123() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock_123() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..01b6be5
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,105 @@
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock_123(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+     BEGIN
+        RAISE NOTICE 'blurt_and_lock_123() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock_123(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+#permutation "s1_session" "s1_lock_s2" "s1_lock_s3" "s1_begin" "s1_insert_tbl1" "s2_session" "s2_begin" "s2_insert_tbl1" "s3_session" "s3_begin" "s3_insert_tbl1" "s1_unlock_s2" "s1_unlock_s3" "s1_lock_s2" "s1_abort" "s3_commit" "s1_unlock_s2" "s2_insert_tbl2" "s2_commit" "s1_get_changes"
+
+# Test that speculative locks are correctly acquired and released, s2
+# inserts, s1 updates.
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock_123 function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
-- 
1.8.3.1

#39Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#38)
Re: Decoding speculative insert with toast leaks memory

On Tue, Jun 8, 2021 at 5:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Based on the off list discussion, I have modified the test based on
the idea showed in
"isolation/specs/insert-conflict-specconflict.spec", other open point
we had about the race condition that how to ensure that when we unlock
any session it make progress, IMHO the isolation tested is designed in
a way that either all the waiting session returns with the output or
again block on a heavy weight lock before performing the next step.

Few comments:
1. The test has a lot of similarities and test duplication with what
we are doing in insert-conflict-specconflict.spec. Can we move it to
insert-conflict-specconflict.spec? I understand that having it in
test_decoding has the advantage that we can have all decoding tests in
one place but OTOH, we can avoid a lot of test-code duplication if we
add it in insert-conflict-specconflict.spec.

2.
+#permutation "s1_session" "s1_lock_s2" "s1_lock_s3" "s1_begin"
"s1_insert_tbl1" "s2_session" "s2_begin" "s2_insert_tbl1" "s3_session"
"s3_begin" "s3_insert_tbl1" "s1_unlock_s2" "s1_unlock_s3" "s1_lock_s2"
"s1_abort" "s3_commit" "s1_unlock_s2" "s2_insert_tbl2" "s2_commit"
"s1_get_changes"

This permutation is not matching with what we are actually doing.

3.
+# Test that speculative locks are correctly acquired and released, s2
+# inserts, s1 updates.

This test description doesn't seem to be correct. Can we change it to
something like: "Test logical decoding of speculative aborts for toast
insertion followed by insertion into a different table which doesn't
have a toast"?

Also, let's prepare and test the patches for back-branches. It would
be better if you can prepare separate patches for code and test-case
for each branch then I can merge them before commit. This helps with
testing on back-branches.

--
With Regards,
Amit Kapila.

#40Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#39)
Re: Decoding speculative insert with toast leaks memory

On Wed, Jun 9, 2021 at 11:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 8, 2021 at 5:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Based on the off list discussion, I have modified the test based on
the idea showed in
"isolation/specs/insert-conflict-specconflict.spec", other open point
we had about the race condition that how to ensure that when we unlock
any session it make progress, IMHO the isolation tested is designed in
a way that either all the waiting session returns with the output or
again block on a heavy weight lock before performing the next step.

Few comments:
1. The test has a lot of similarities and test duplication with what
we are doing in insert-conflict-specconflict.spec. Can we move it to
insert-conflict-specconflict.spec? I understand that having it in
test_decoding has the advantage that we can have all decoding tests in
one place but OTOH, we can avoid a lot of test-code duplication if we
add it in insert-conflict-specconflict.spec.

It seems the isolation test runs on the default configuration, will it be a
good idea to change the wal_level to logical for the whole isolation tester
folder?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#41Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#40)
Re: Decoding speculative insert with toast leaks memory

On Wed, Jun 9, 2021 at 4:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 9, 2021 at 11:03 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 8, 2021 at 5:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Based on the off list discussion, I have modified the test based on
the idea showed in
"isolation/specs/insert-conflict-specconflict.spec", other open point
we had about the race condition that how to ensure that when we unlock
any session it make progress, IMHO the isolation tested is designed in
a way that either all the waiting session returns with the output or
again block on a heavy weight lock before performing the next step.

Few comments:
1. The test has a lot of similarities and test duplication with what
we are doing in insert-conflict-specconflict.spec. Can we move it to
insert-conflict-specconflict.spec? I understand that having it in
test_decoding has the advantage that we can have all decoding tests in
one place but OTOH, we can avoid a lot of test-code duplication if we
add it in insert-conflict-specconflict.spec.

It seems the isolation test runs on the default configuration, will it be a good idea to change the wal_level to logical for the whole isolation tester folder?

No, that doesn't sound like a good idea to me. Let's keep it in
test_decoding then.

--
With Regards,
Amit Kapila.

#42Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#41)
Re: Decoding speculative insert with toast leaks memory

On Wed, Jun 9, 2021 at 4:22 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 9, 2021 at 4:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Few comments:
1. The test has a lot of similarities and test duplication with what
we are doing in insert-conflict-specconflict.spec. Can we move it to
insert-conflict-specconflict.spec? I understand that having it in
test_decoding has the advantage that we can have all decoding tests in
one place but OTOH, we can avoid a lot of test-code duplication if we
add it in insert-conflict-specconflict.spec.

It seems the isolation test runs on the default configuration, will it

be a good idea to change the wal_level to logical for the whole isolation
tester folder?

No, that doesn't sound like a good idea to me. Let's keep it in
test_decoding then.

Okay, I will work on the remaining comments and back patches and send it by
tomorrow.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#43Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Dilip Kumar (#42)
Re: Decoding speculative insert with toast leaks memory

May I suggest to use a different name in the blurt_and_lock_123()
function, so that it doesn't conflict with the one in
insert-conflict-specconflict? Thanks

--
�lvaro Herrera 39�49'30"S 73�17'W

#44Dilip Kumar
dilipbalaut@gmail.com
In reply to: Alvaro Herrera (#43)
10 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Wed, Jun 9, 2021 at 8:59 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

May I suggest to use a different name in the blurt_and_lock_123()
function, so that it doesn't conflict with the one in
insert-conflict-specconflict? Thanks

Renamed to blurt_and_lock(), is that fine?

I haved fixed other comments and also prepared patches for the back branches.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v5-0001-Bug-fix-for-speculative-abort_HEAD.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Bug-fix-for-speculative-abort_HEAD.patchDownload
From dcea4c36267ad2dc58dd0a57733a6f6276e2d754 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 8 Jun 2021 17:06:39 +0530
Subject: [PATCH v5 1/2] Bug fix for speculative abort

If speculative insert has a toast table insert then if that tuple
is not confirmed then the toast hash is not cleaned and that is
creating various problem like a) memory leak b) next insert is
using these uncleaned toast data for its insertion and other
error and assersion failure.  So this patch handle that by
queuing the spec abort changes and cleaning up the toast hash
on spec abort.  Currently, in this patch we are queuing up all
the spec abort changes, but as an optimization we can avoid
queuing the spec abort for toast tables but for that we need to
log that as a flag in WAL.
---
 src/backend/replication/logical/decode.c        | 14 ++++----
 src/backend/replication/logical/reorderbuffer.c | 43 +++++++++++++++++++++++--
 src/include/replication/reorderbuffer.h         |  1 +
 3 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..453efc5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1040,19 +1040,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..dd95785 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -443,6 +443,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -520,6 +523,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -2211,8 +2215,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -2254,6 +2258,38 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived.  So cleanup the
+					 * specinsert tuple and toast hash.  If spec insert change
+					 * is NULL then do nothing, this is possible because we
+					 * have spec abort for each toast entry.  So we just have
+					 * to clean the specinsert and toast hash for the first
+					 * spec abort for the main table and remaining changes for
+					 * the tables can be ignored.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * Clear the toast hash, we must clean the toast hash
+						 * before we start with a completely new tuple,
+						 * otherwise, while processing the new tuple it would
+						 * create a confusion that whether we need to process
+						 * these toast chunks or not.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/*
+						* If the speculative insertion was aborted, the record
+						* isn't needed anymore.
+						*/
+						ReorderBufferReturnChange(rb, specinsert, true);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -3754,6 +3790,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4017,6 +4054,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4315,6 +4353,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..9ff0986 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

v5-0001-Bug-fix-for-speculative-abort-v10.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Bug-fix-for-speculative-abort-v10.patchDownload
From 4440e5dca68da59d9d397efb890893470dd92aaf Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Jun 2021 11:59:58 +0530
Subject: [PATCH v5 1/2] Bug fix for speculative abort v10

If speculative insert has a toast table insert then if that tuple
is not confirmed then the toast hash is not cleaned and that is
creating various problem like a) memory leak b) next insert is
using these uncleaned toast data for its insertion and other
error and assersion failure.  So this patch handle that by
queuing the spec abort changes and cleaning up the toast hash
on spec abort.  Currently, in this patch we are queuing up all
the spec abort changes, but as an optimization we can avoid
queuing the spec abort for toast tables but for that we need to
log that as a flag in WAL.
---
 src/backend/replication/logical/decode.c        | 14 ++++-----
 src/backend/replication/logical/reorderbuffer.c | 42 +++++++++++++++++++++++--
 src/include/replication/reorderbuffer.h         |  3 +-
 3 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d7a6f5d..3778ea9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -778,19 +778,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 165ba8f..2f80c73 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -353,6 +353,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -413,6 +416,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			break;
 			/* no data in addition to the struct itself */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1621,8 +1625,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1664,6 +1668,38 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived.  So cleanup the
+					 * specinsert tuple and toast hash.  If spec insert change
+					 * is NULL then do nothing, this is possible because we
+					 * have spec abort for each toast entry.  So we just have
+					 * to clean the specinsert and toast hash for the first
+					 * spec abort for the main table and remaining changes for
+					 * the tables can be ignored.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * Clear the toast hash, we must clean the toast hash
+						 * before we start with a completely new tuple,
+						 * otherwise, while processing the new tuple it would
+						 * create a confusion that whether we need to process
+						 * these toast chunks or not.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/*
+						* If the speculative insertion was aborted, the record
+						* isn't needed anymore.
+						*/
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_MESSAGE:
 					rb->message(rb, txn, change->lsn, true,
 								change->data.msg.prefix,
@@ -2423,6 +2459,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2716,6 +2753,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 			/* the base struct contains all the data, easy peasy */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5530c4f..6c2889a 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -59,7 +59,8 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
-	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT
 };
 
 /*
-- 
1.8.3.1

v5-0001-Bug-fix-for-speculative-abort-v12_and_v13.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Bug-fix-for-speculative-abort-v12_and_v13.patchDownload
From e739d144dd9e2ed1e4903d3992b56b0ac76c1526 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Jun 2021 11:09:47 +0530
Subject: [PATCH v5 1/2] Bug fix for speculative abort v13

If speculative insert has a toast table insert then if that tuple
is not confirmed then the toast hash is not cleaned and that is
creating various problem like a) memory leak b) next insert is
using these uncleaned toast data for its insertion and other
error and assersion failure.  So this patch handle that by
queuing the spec abort changes and cleaning up the toast hash
on spec abort.  Currently, in this patch we are queuing up all
the spec abort changes, but as an optimization we can avoid
queuing the spec abort for toast tables but for that we need to
log that as a flag in WAL.
---
 src/backend/replication/logical/decode.c        | 14 ++++-----
 src/backend/replication/logical/reorderbuffer.c | 42 +++++++++++++++++++++++--
 src/include/replication/reorderbuffer.h         |  1 +
 3 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..4985c2a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -800,19 +800,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5251932..5e571ff 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -397,6 +397,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -467,6 +470,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1677,8 +1681,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1720,6 +1724,38 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived.  So cleanup the
+					 * specinsert tuple and toast hash.  If spec insert change
+					 * is NULL then do nothing, this is possible because we
+					 * have spec abort for each toast entry.  So we just have
+					 * to clean the specinsert and toast hash for the first
+					 * spec abort for the main table and remaining changes for
+					 * the tables can be ignored.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * Clear the toast hash, we must clean the toast hash
+						 * before we start with a completely new tuple,
+						 * otherwise, while processing the new tuple it would
+						 * create a confusion that whether we need to process
+						 * these toast chunks or not.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/*
+						* If the speculative insertion was aborted, the record
+						* isn't needed anymore.
+						*/
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -2640,6 +2676,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -3032,6 +3069,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..dbd2e84 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -62,6 +62,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

v5-0001-Bug-fix-for-speculative-abort-v96.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Bug-fix-for-speculative-abort-v96.patchDownload
From ef1253e047556459cdcd415e4e2558353fa41e76 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Jun 2021 13:53:00 +0530
Subject: [PATCH v5 1/2] Bug fix for speculative abort v96

If speculative insert has a toast table insert then if that tuple
is not confirmed then the toast hash is not cleaned and that is
creating various problem like a) memory leak b) next insert is
using these uncleaned toast data for its insertion and other
error and assersion failure.  So this patch handle that by
queuing the spec abort changes and cleaning up the toast hash
on spec abort.  Currently, in this patch we are queuing up all
the spec abort changes, but as an optimization we can avoid
queuing the spec abort for toast tables but for that we need to
log that as a flag in WAL.
---
 src/backend/replication/logical/decode.c        | 14 ++++-----
 src/backend/replication/logical/reorderbuffer.c | 42 +++++++++++++++++++++++--
 src/include/replication/reorderbuffer.h         |  3 +-
 3 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 1300902..571a901 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -778,19 +778,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f0de337..54c8901 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -364,6 +364,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	/* check whether to put into the slab cache */
 	if (rb->nr_cached_transactions < max_cached_transactions)
 	{
@@ -449,6 +452,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			break;
 			/* no data in addition to the struct itself */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1674,8 +1678,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1717,6 +1721,38 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived.  So cleanup the
+					 * specinsert tuple and toast hash.  If spec insert change
+					 * is NULL then do nothing, this is possible because we
+					 * have spec abort for each toast entry.  So we just have
+					 * to clean the specinsert and toast hash for the first
+					 * spec abort for the main table and remaining changes for
+					 * the tables can be ignored.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * Clear the toast hash, we must clean the toast hash
+						 * before we start with a completely new tuple,
+						 * otherwise, while processing the new tuple it would
+						 * create a confusion that whether we need to process
+						 * these toast chunks or not.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/*
+						* If the speculative insertion was aborted, the record
+						* isn't needed anymore.
+						*/
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_MESSAGE:
 					rb->message(rb, txn, change->lsn, true,
 								change->data.msg.prefix,
@@ -2476,6 +2512,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2764,6 +2801,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 			/* the base struct contains all the data, easy peasy */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e085088..67f2981 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -59,7 +59,8 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
-	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT
 };
 
 /*
-- 
1.8.3.1

v5-0001-Bug-fix-for-speculative-abort-v11.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Bug-fix-for-speculative-abort-v11.patchDownload
From 1c08381fd1a32406a80f6f9a0bf6ce5916022b7c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Jun 2021 11:09:47 +0530
Subject: [PATCH v5 1/2] Bug fix for speculative abort v11

If speculative insert has a toast table insert then if that tuple
is not confirmed then the toast hash is not cleaned and that is
creating various problem like a) memory leak b) next insert is
using these uncleaned toast data for its insertion and other
error and assersion failure.  So this patch handle that by
queuing the spec abort changes and cleaning up the toast hash
on spec abort.  Currently, in this patch we are queuing up all
the spec abort changes, but as an optimization we can avoid
queuing the spec abort for toast tables but for that we need to
log that as a flag in WAL.
---
 src/backend/replication/logical/decode.c        | 14 ++++-----
 src/backend/replication/logical/reorderbuffer.c | 42 +++++++++++++++++++++++--
 src/include/replication/reorderbuffer.h         |  1 +
 3 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d7da3d6..676f921 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -804,19 +804,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1a4b87c..b9c6016 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -351,6 +351,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -418,6 +421,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1620,8 +1624,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1663,6 +1667,38 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived.  So cleanup the
+					 * specinsert tuple and toast hash.  If spec insert change
+					 * is NULL then do nothing, this is possible because we
+					 * have spec abort for each toast entry.  So we just have
+					 * to clean the specinsert and toast hash for the first
+					 * spec abort for the main table and remaining changes for
+					 * the tables can be ignored.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * Clear the toast hash, we must clean the toast hash
+						 * before we start with a completely new tuple,
+						 * otherwise, while processing the new tuple it would
+						 * create a confusion that whether we need to process
+						 * these toast chunks or not.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/*
+						* If the speculative insertion was aborted, the record
+						* isn't needed anymore.
+						*/
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -2475,6 +2511,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2778,6 +2815,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index c41f362..f51336f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -60,6 +60,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

v5-0002-Test-logical-decoding-of-speculative-aborts_HEAD.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Test-logical-decoding-of-speculative-aborts_HEAD.patchDownload
From cb8f217a3c9764ff60294298965809a3b55dbcae Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Tue, 8 Jun 2021 17:11:24 +0530
Subject: [PATCH v5 2/2] Test logical decoding of speculative aborts

Test logical decoding of speculative aborts for toast
insertion followed by insertion into a different table
which doesn't have a toast
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 3 files changed, 197 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b..1cab935 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot speculative_abort
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..7e80a25
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertion are handled during logical decoding
+#
+# Does this by using advisory locks controlling progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
-- 
1.8.3.1

v5-0002-Test-logical-decoding-of-speculative-aborts-v10.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Test-logical-decoding-of-speculative-aborts-v10.patchDownload
From a6bc6c989b3a0fcab69bb1e63af239b30dfd4f9d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Jun 2021 11:11:44 +0530
Subject: [PATCH v5 2/2] Test logical decoding of speculative aborts v10

Test logical decoding of speculative aborts for toast
insertion followed by insertion into a different table
which doesn't have a toast
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 3 files changed, 197 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2db2b27..c7def09 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top speculative_abort
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..7e80a25
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertion are handled during logical decoding
+#
+# Does this by using advisory locks controlling progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
-- 
1.8.3.1

v5-0002-Test-logical-decoding-of-speculative-aborts-v11.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Test-logical-decoding-of-speculative-aborts-v11.patchDownload
From d5f3cb36e68fde7dcd2a207a8e5443ebf7d23c83 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Jun 2021 11:11:44 +0530
Subject: [PATCH v5 2/2] Test logical decoding of speculative aborts v11

Test logical decoding of speculative aborts for toast
insertion followed by insertion into a different table
which doesn't have a toast
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 3 files changed, 197 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 65a91a8..29e5a3e 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top speculative_abort
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..7e80a25
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertion are handled during logical decoding
+#
+# Does this by using advisory locks controlling progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
-- 
1.8.3.1

v5-0002-Test-logical-decoding-of-speculative-aborts-v12_and_v13.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Test-logical-decoding-of-speculative-aborts-v12_and_v13.patchDownload
From 34f99cd579b7994707b207ef8fd748a40c970e4a Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Jun 2021 11:31:31 +0530
Subject: [PATCH v5 2/2] Test logical decoding of speculative aborts v13

Test logical decoding of speculative aborts for toast
insertion followed by insertion into a different table
which doesn't have a toast
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 3 files changed, 197 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..14e60bd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top speculative_abort
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..7e80a25
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertion are handled during logical decoding
+#
+# Does this by using advisory locks controlling progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
-- 
1.8.3.1

v5-0002-Test-logical-decoding-of-speculative-aborts-v96.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Test-logical-decoding-of-speculative-aborts-v96.patchDownload
From 4d8e5a68b63ca14abfef755d78f383746b41000f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Thu, 10 Jun 2021 13:57:17 +0530
Subject: [PATCH v5 2/2] Test logical decoding of speculative aborts v96

Test logical decoding of speculative aborts for toast
insertion followed by insertion into a different table
which doesn't have a toast
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 3 files changed, 197 insertions(+), 1 deletion(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b6fc8da..18dcd2d 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -54,7 +54,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top speculative_abort
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(MKDIR_P) isolation_output
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..7e80a25
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertion are handled during logical decoding
+#
+# Does this by using advisory locks controlling progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
-- 
1.8.3.1

#45Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#44)
1 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Thu, Jun 10, 2021 at 2:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Jun 9, 2021 at 8:59 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

May I suggest to use a different name in the blurt_and_lock_123()
function, so that it doesn't conflict with the one in
insert-conflict-specconflict? Thanks

Renamed to blurt_and_lock(), is that fine?

I think a non-conflicting name should be fine.

I haved fixed other comments and also prepared patches for the back branches.

Okay, I have verified the fix on all branches and the newly added test
was giving error without patch and passes with code change patch. Few
minor things:
1. You forgot to make the change in ReorderBufferChangeSize for v13 patch.
2. I have made a few changes in the HEAD patch, (a) There was an
unnecessary cleanup of spec insert at one place. I have replaced that
with Assert. (b) I have added and edited few comments both in the code
and test patch.

Please find the patch for HEAD attached. Can you please prepare the
patch for back-branches by doing all the changes I have done in the
patch for HEAD?

--
With Regards,
Amit Kapila.

Attachments:

v6-0001-Fix-decoding-of-speculative-aborts.patchapplication/octet-stream; name=v6-0001-Fix-decoding-of-speculative-aborts.patchDownload
From 3ac8ba2025f5db2b63ab94b0b51789e14ae9ce46 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 10 Jun 2021 18:29:31 +0530
Subject: [PATCH v6] Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 src/backend/replication/logical/decode.c           |  14 ++-
 src/backend/replication/logical/reorderbuffer.c    |  49 ++++++---
 src/include/replication/reorderbuffer.h            |   9 +-
 6 files changed, 245 insertions(+), 25 deletions(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b..1cab935 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot speculative_abort
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..6b3cbfb
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertions are handled during logical decoding.
+#
+# Does this by using advisory locks controlling the progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast.
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..453efc5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1040,19 +1040,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..d905f32 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -443,6 +443,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -520,6 +523,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -2211,8 +2215,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -2254,6 +2258,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert, true);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -2360,16 +2390,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert, true);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -3754,6 +3776,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4017,6 +4040,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4315,6 +4339,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..ba257d8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -46,10 +46,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

#46Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#45)
6 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Thu, Jun 10, 2021 at 7:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Please find the patch for HEAD attached. Can you please prepare the
patch for back-branches by doing all the changes I have done in the
patch for HEAD?

Done

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v7-0001-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Fix-decoding-of-speculative-aborts.patchDownload
From 3ac8ba2025f5db2b63ab94b0b51789e14ae9ce46 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 10 Jun 2021 18:29:31 +0530
Subject: [PATCH v6] Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 src/backend/replication/logical/decode.c           |  14 ++-
 src/backend/replication/logical/reorderbuffer.c    |  49 ++++++---
 src/include/replication/reorderbuffer.h            |   9 +-
 6 files changed, 245 insertions(+), 25 deletions(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 9a31e0b..1cab935 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -8,7 +8,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	spill slot truncate stream stats twophase twophase_stream
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
 	oldest_xmin snapshot_transfer subxact_without_top concurrent_stream \
-	twophase_snapshot
+	twophase_snapshot speculative_abort
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..6b3cbfb
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertions are handled during logical decoding.
+#
+# Does this by using advisory locks controlling the progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast.
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..453efc5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1040,19 +1040,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..d905f32 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -443,6 +443,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -520,6 +523,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -2211,8 +2215,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -2254,6 +2258,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert, true);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -2360,16 +2390,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert, true);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -3754,6 +3776,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4017,6 +4040,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4315,6 +4339,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..ba257d8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -46,10 +46,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

v7-0001-96-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v7-0001-96-Fix-decoding-of-speculative-aborts.patchDownload
From 5d5116aac9db01242e36bc629fa19af816c47e23 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Jun 2021 10:08:42 +0530
Subject: [PATCH v7] 96-Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 src/backend/replication/logical/decode.c           |  14 ++-
 src/backend/replication/logical/reorderbuffer.c    |  48 ++++++---
 src/include/replication/reorderbuffer.h            |  11 +-
 6 files changed, 245 insertions(+), 26 deletions(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index b6fc8da..18dcd2d 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -54,7 +54,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top speculative_abort
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(MKDIR_P) isolation_output
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..6b3cbfb
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertions are handled during logical decoding.
+#
+# Does this by using advisory locks controlling the progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast.
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 1300902..571a901 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -778,19 +778,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f0de337..1cd0bbd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -364,6 +364,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	/* check whether to put into the slab cache */
 	if (rb->nr_cached_transactions < max_cached_transactions)
 	{
@@ -449,6 +452,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			break;
 			/* no data in addition to the struct itself */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1674,8 +1678,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1717,6 +1721,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_MESSAGE:
 					rb->message(rb, txn, change->lsn, true,
 								change->data.msg.prefix,
@@ -1792,16 +1822,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2476,6 +2498,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2764,6 +2787,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 			/* the base struct contains all the data, easy peasy */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e085088..ddba4bd 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -44,10 +44,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -59,7 +59,8 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
-	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT
 };
 
 /*
-- 
1.8.3.1

v7-0001-v11-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v7-0001-v11-Fix-decoding-of-speculative-aborts.patchDownload
From 96ef95fdb64afcdf1f5c5cb0c298286c2cbc1e9f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Jun 2021 10:58:59 +0530
Subject: [PATCH v7] v11-Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 src/backend/replication/logical/decode.c           |  14 ++-
 src/backend/replication/logical/reorderbuffer.c    |  48 ++++++---
 src/include/replication/reorderbuffer.h            |   9 +-
 6 files changed, 244 insertions(+), 25 deletions(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 65a91a8..29e5a3e 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top speculative_abort
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..6b3cbfb
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertions are handled during logical decoding.
+#
+# Does this by using advisory locks controlling the progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast.
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d7da3d6..676f921 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -804,19 +804,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1a4b87c..244d203 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -351,6 +351,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -418,6 +421,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1620,8 +1624,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1695,6 +1699,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						break;
 					}
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_MESSAGE:
 					rb->message(rb, txn, change->lsn, true,
 								change->data.msg.prefix,
@@ -1770,16 +1800,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2475,6 +2497,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2778,6 +2801,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index c41f362..ffb5a94 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -44,10 +44,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -60,6 +60,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

v7-0001-v12-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v7-0001-v12-Fix-decoding-of-speculative-aborts.patchDownload
From 30124d6c24daa4b9058259e0e940b1671e6c541b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Jun 2021 11:09:02 +0530
Subject: [PATCH v7] v12-Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 src/backend/replication/logical/decode.c           |  14 ++-
 src/backend/replication/logical/reorderbuffer.c    |  48 ++++++---
 src/include/replication/reorderbuffer.h            |   9 +-
 6 files changed, 244 insertions(+), 25 deletions(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..14e60bd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top speculative_abort
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..6b3cbfb
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertions are handled during logical decoding.
+#
+# Does this by using advisory locks controlling the progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast.
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index eef2f88..ff18861 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -803,19 +803,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 941ca9a..3a922b7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -359,6 +359,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -426,6 +429,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1628,8 +1632,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1703,6 +1707,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						break;
 					}
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_MESSAGE:
 					rb->message(rb, txn, change->lsn, true,
 								change->data.msg.prefix,
@@ -1778,16 +1808,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2483,6 +2505,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2797,6 +2820,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 49df12a..1ced41f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -44,10 +44,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -60,6 +60,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

v7-0001-v10-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v7-0001-v10-Fix-decoding-of-speculative-aborts.patchDownload
From 8aab67b16187179c16eaa4bb29e9673cd4313a5f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Jun 2021 10:45:49 +0530
Subject: [PATCH v7] v10-Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 src/backend/replication/logical/decode.c           |  14 ++-
 src/backend/replication/logical/reorderbuffer.c    |  48 ++++++---
 src/include/replication/reorderbuffer.h            |  11 +-
 6 files changed, 245 insertions(+), 26 deletions(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index 2db2b27..c7def09 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -51,7 +51,7 @@ regresscheck-install-force: | submake-regress submake-test_decoding temp-install
 	    $(REGRESSCHECKS)
 
 ISOLATIONCHECKS=mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top speculative_abort
 
 isolationcheck: | submake-isolation submake-test_decoding temp-install
 	$(pg_isolation_regress_check) \
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..6b3cbfb
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertions are handled during logical decoding.
+#
+# Does this by using advisory locks controlling the progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast.
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d7a6f5d..3778ea9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -778,19 +778,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 165ba8f..0963b23 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -353,6 +353,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -413,6 +416,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			break;
 			/* no data in addition to the struct itself */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1621,8 +1625,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1664,6 +1668,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_MESSAGE:
 					rb->message(rb, txn, change->lsn, true,
 								change->data.msg.prefix,
@@ -1739,16 +1769,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2423,6 +2445,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2716,6 +2739,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 			/* the base struct contains all the data, easy peasy */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5530c4f..d455567 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -44,10 +44,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -59,7 +59,8 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
-	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT
 };
 
 /*
-- 
1.8.3.1

v7-0001-v13-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v7-0001-v13-Fix-decoding-of-speculative-aborts.patchDownload
From 2e80138ad62a053ca051172bca4aef0081d59236 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Jun 2021 11:17:34 +0530
Subject: [PATCH v7] v13-Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 contrib/test_decoding/Makefile                     |   2 +-
 .../test_decoding/expected/speculative_abort.out   |  85 ++++++++++++++++
 contrib/test_decoding/specs/speculative_abort.spec | 111 +++++++++++++++++++++
 src/backend/replication/logical/decode.c           |  14 ++-
 src/backend/replication/logical/reorderbuffer.c    |  49 ++++++---
 src/include/replication/reorderbuffer.h            |   9 +-
 6 files changed, 245 insertions(+), 25 deletions(-)
 create mode 100644 contrib/test_decoding/expected/speculative_abort.out
 create mode 100644 contrib/test_decoding/specs/speculative_abort.spec

diff --git a/contrib/test_decoding/Makefile b/contrib/test_decoding/Makefile
index f439c58..14e60bd 100644
--- a/contrib/test_decoding/Makefile
+++ b/contrib/test_decoding/Makefile
@@ -7,7 +7,7 @@ REGRESS = ddl xact rewrite toast permissions decoding_in_xact \
 	decoding_into_rel binary prepared replorigin time messages \
 	spill slot truncate
 ISOLATION = mxact delayed_startup ondisk_startup concurrent_ddl_dml \
-	oldest_xmin snapshot_transfer subxact_without_top
+	oldest_xmin snapshot_transfer subxact_without_top speculative_abort
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
 ISOLATION_OPTS = --temp-config $(top_srcdir)/contrib/test_decoding/logical.conf
diff --git a/contrib/test_decoding/expected/speculative_abort.out b/contrib/test_decoding/expected/speculative_abort.out
new file mode 100644
index 0000000..7492506
--- /dev/null
+++ b/contrib/test_decoding/expected/speculative_abort.out
@@ -0,0 +1,85 @@
+Parsed test spec with 3 sessions
+
+starting permutation: controller_locks controller_show_count s1_begin s1_insert_toast s2_insert_toast controller_show_count controller_unlock_1_1 controller_unlock_2_1 controller_unlock_1_3 controller_unlock_2_3 controller_show_count controller_unlock_2_2 controller_show_count controller_unlock_1_2 s1_insert_other s1_commit controller_get_changes
+data           
+
+step controller_locks: SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);
+pg_advisory_locksess           lock           
+
+               1              1              
+               1              2              
+               1              3              
+               2              1              
+               2              2              
+               2              3              
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step s1_begin: BEGIN;
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 3
+step s1_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 3
+step s2_insert_toast: INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; <waiting ...>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_1_1: SELECT pg_advisory_unlock(1, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_2_1: SELECT pg_advisory_unlock(2, 1);
+pg_advisory_unlock
+
+t              
+step controller_unlock_1_3: SELECT pg_advisory_unlock(1, 3);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step controller_unlock_2_3: SELECT pg_advisory_unlock(2, 3);
+pg_advisory_unlock
+
+t              
+s2: NOTICE:  blurt_and_lock() called for 1 in session 2
+s2: NOTICE:  acquiring advisory lock on 2
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+0              
+step controller_unlock_2_2: SELECT pg_advisory_unlock(2, 2);
+pg_advisory_unlock
+
+t              
+step s2_insert_toast: <... completed>
+step controller_show_count: SELECT COUNT(*) FROM tbl1;
+count          
+
+1              
+step controller_unlock_1_2: SELECT pg_advisory_unlock(1, 2);
+pg_advisory_unlock
+
+t              
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+s1: NOTICE:  blurt_and_lock() called for 1 in session 1
+s1: NOTICE:  acquiring advisory lock on 2
+step s1_insert_toast: <... completed>
+step s1_insert_other: INSERT INTO tbl2 VALUES(1);
+step s1_commit: COMMIT;
+step controller_get_changes: SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+data           
+
+BEGIN          
+table public.tbl1: INSERT: a[integer]:1 b[text]:'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+COMMIT         
+BEGIN          
+table public.tbl2: INSERT: a[integer]:1
+COMMIT         
+?column?       
+
+stop           
diff --git a/contrib/test_decoding/specs/speculative_abort.spec b/contrib/test_decoding/specs/speculative_abort.spec
new file mode 100644
index 0000000..6b3cbfb
--- /dev/null
+++ b/contrib/test_decoding/specs/speculative_abort.spec
@@ -0,0 +1,111 @@
+# INSERT ... ON CONFLICT test verifying that speculative abort for toast
+# insertions are handled during logical decoding.
+#
+# Does this by using advisory locks controlling the progress of
+# insertions. By waiting when building the index keys, it's possible
+# to schedule concurrent INSERT ON CONFLICTs so that there will always
+# be a speculative conflict.
+
+setup
+{
+	SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 'test_decoding');
+	DROP TABLE IF EXISTS tbl1;
+	CREATE TABLE tbl1 (a INT, b TEXT);
+	ALTER TABLE tbl1 ALTER COLUMN b SET STORAGE EXTERNAL;
+	CREATE TABLE tbl2 (a INT);
+
+    CREATE OR REPLACE FUNCTION blurt_and_lock(int) RETURNS int IMMUTABLE LANGUAGE plpgsql AS $$
+    BEGIN
+        RAISE NOTICE 'blurt_and_lock() called for % in session %', $1, current_setting('spec.session')::int;
+
+	-- depending on lock state, wait for lock 2 or 3
+        IF pg_try_advisory_xact_lock(current_setting('spec.session')::int, 1) THEN
+            RAISE NOTICE 'acquiring advisory lock on 2';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 2);
+        ELSE
+            RAISE NOTICE 'acquiring advisory lock on 3';
+            PERFORM pg_advisory_xact_lock(current_setting('spec.session')::int, 3);
+        END IF;
+    RETURN $1;
+    END;$$;
+
+	CREATE UNIQUE INDEX idx on tbl1(blurt_and_lock(a));
+
+	-- consume DDL
+	SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
+}
+
+teardown
+{
+    DROP TABLE tbl1;
+    SELECT 'stop' FROM pg_drop_replication_slot('isolation_slot');
+}
+
+session "controller"
+setup
+{
+    SET default_transaction_isolation = 'read committed';
+    SET application_name = 'isolation/insert-specconflict-controller';
+}
+step "controller_locks" {SELECT pg_advisory_lock(sess, lock), sess, lock FROM generate_series(1, 2) a(sess), generate_series(1,3) b(lock);}
+step "controller_unlock_1_1" { SELECT pg_advisory_unlock(1, 1); }
+step "controller_unlock_2_1" { SELECT pg_advisory_unlock(2, 1); }
+step "controller_unlock_1_2" { SELECT pg_advisory_unlock(1, 2); }
+step "controller_unlock_2_2" { SELECT pg_advisory_unlock(2, 2); }
+step "controller_unlock_1_3" { SELECT pg_advisory_unlock(1, 3); }
+step "controller_unlock_2_3" { SELECT pg_advisory_unlock(2, 3); }
+step "controller_show_count" { SELECT COUNT(*) FROM tbl1; }
+step "controller_get_changes" { SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); }
+
+session "s1"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 1;
+    SET application_name = 'isolation/insert-specconflict-s1';
+}
+
+step "s1_begin"  { BEGIN; }
+step "s1_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+step "s1_insert_other" { INSERT INTO tbl2 VALUES(1); }
+step "s1_commit"  { COMMIT; }
+
+session "s2"
+setup
+{
+	SET synchronous_commit=on;
+    SET default_transaction_isolation = 'read committed';
+    SET spec.session = 2;
+    SET application_name = 'isolation/insert-specconflict-s2';
+}
+step "s2_insert_toast" { INSERT INTO tbl1 VALUES(1, repeat('a', 4000)) ON CONFLICT DO NOTHING; }
+
+
+# Test logical decoding of speculative aborts for toast insertion followed by
+# insertion into a different table which doesn't have a toast.
+permutation
+   # acquire a number of locks, to control execution flow - the
+   # blurt_and_lock function acquires advisory locks that allow us to
+   # continue after a) the optimistic conflict probe b) after the
+   # insertion of the speculative tuple.
+   "controller_locks"
+   "controller_show_count"
+   "s1_begin"
+   "s1_insert_toast" "s2_insert_toast"
+   "controller_show_count"
+   # Switch both sessions to wait on the other lock next time (the speculative insertion)
+   "controller_unlock_1_1" "controller_unlock_2_1"
+   # Allow both sessions to continue
+   "controller_unlock_1_3" "controller_unlock_2_3"
+   "controller_show_count"
+   # Allow the second session to finish insertion
+   "controller_unlock_2_2"
+   # This should now show a successful insertion
+   "controller_show_count"
+   # Allow the first session to speculative abort
+   "controller_unlock_1_2"
+   # Insert into other table from s1 and commit
+   "s1_insert_other" "s1_commit"
+   # Get the changes
+   "controller_get_changes"
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..4985c2a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -800,19 +800,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5251932..e86eb8c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -397,6 +397,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -467,6 +470,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1677,8 +1681,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1720,6 +1724,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -1827,16 +1857,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2640,6 +2662,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2747,6 +2770,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -3032,6 +3056,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..676491d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -46,10 +46,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -62,6 +62,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

#47Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#46)
Re: Decoding speculative insert with toast leaks memory

On Fri, Jun 11, 2021 at 11:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jun 10, 2021 at 7:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Please find the patch for HEAD attached. Can you please prepare the
patch for back-branches by doing all the changes I have done in the
patch for HEAD?

Done

Thanks, the patch looks good to me. I'll push these early next week
(Tuesday) unless someone has any other comments or suggestions.

--
With Regards,
Amit Kapila.

#48Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#47)
Re: Decoding speculative insert with toast leaks memory

On Fri, Jun 11, 2021 at 7:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Jun 11, 2021 at 11:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jun 10, 2021 at 7:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Please find the patch for HEAD attached. Can you please prepare the
patch for back-branches by doing all the changes I have done in the
patch for HEAD?

Done

Thanks, the patch looks good to me. I'll push these early next week
(Tuesday) unless someone has any other comments or suggestions.

I think the test in this patch is quite similar to what Noah has
pointed in the nearby thread [1]/messages/by-id/20210613073407.GA768908@rfd.leadboat.com to be failing at some intervals. Can
you also please once verify the same and if we can expect similar
failures here then we might want to consider dropping the test in this
patch for now? We can always come back to it once we find a good
solution to make it pass consistently.

[1]: /messages/by-id/20210613073407.GA768908@rfd.leadboat.com

--
With Regards,
Amit Kapila.

#49Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#48)
Re: Decoding speculative insert with toast leaks memory

On Mon, Jun 14, 2021 at 8:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the test in this patch is quite similar to what Noah has
pointed in the nearby thread [1] to be failing at some intervals. Can
you also please once verify the same and if we can expect similar
failures here then we might want to consider dropping the test in this
patch for now? We can always come back to it once we find a good
solution to make it pass consistently.

test insert-conflict-do-nothing ... ok 646 ms
test insert-conflict-do-nothing-2 ... ok 1994 ms
test insert-conflict-do-update ... ok 1786 ms
test insert-conflict-do-update-2 ... ok 2689 ms
test insert-conflict-do-update-3 ... ok 851 ms
test insert-conflict-specconflict ... FAILED 3695 ms
test delete-abort-savept ... ok 1238 ms

Yeah, this is the same test that we have used base for our test so
let's not push this test until it becomes stable.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#50Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#49)
6 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Mon, Jun 14, 2021 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 14, 2021 at 8:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the test in this patch is quite similar to what Noah has
pointed in the nearby thread [1] to be failing at some intervals. Can
you also please once verify the same and if we can expect similar
failures here then we might want to consider dropping the test in this
patch for now? We can always come back to it once we find a good
solution to make it pass consistently.

test insert-conflict-do-nothing ... ok 646 ms
test insert-conflict-do-nothing-2 ... ok 1994 ms
test insert-conflict-do-update ... ok 1786 ms
test insert-conflict-do-update-2 ... ok 2689 ms
test insert-conflict-do-update-3 ... ok 851 ms
test insert-conflict-specconflict ... FAILED 3695 ms
test delete-abort-savept ... ok 1238 ms

Yeah, this is the same test that we have used base for our test so
let's not push this test until it becomes stable.

Patches without test case.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v8-0001-96-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v8-0001-96-Fix-decoding-of-speculative-aborts.patchDownload
From e73d20545d7f1725dc424de3d9168269bc20ad33 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Jun 2021 10:08:42 +0530
Subject: [PATCH v8] 96-Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 src/backend/replication/logical/decode.c        | 14 ++++----
 src/backend/replication/logical/reorderbuffer.c | 48 ++++++++++++++++++-------
 src/include/replication/reorderbuffer.h         | 11 +++---
 3 files changed, 48 insertions(+), 25 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 1300902..571a901 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -778,19 +778,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index f0de337..1cd0bbd 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -364,6 +364,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	/* check whether to put into the slab cache */
 	if (rb->nr_cached_transactions < max_cached_transactions)
 	{
@@ -449,6 +452,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			break;
 			/* no data in addition to the struct itself */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1674,8 +1678,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1717,6 +1721,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_MESSAGE:
 					rb->message(rb, txn, change->lsn, true,
 								change->data.msg.prefix,
@@ -1792,16 +1822,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2476,6 +2498,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2764,6 +2787,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 			/* the base struct contains all the data, easy peasy */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index e085088..ddba4bd 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -44,10 +44,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -59,7 +59,8 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
-	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT
 };
 
 /*
-- 
1.8.3.1

v8-0001-v12-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v8-0001-v12-Fix-decoding-of-speculative-aborts.patchDownload
From f2b9240a860764934cf26f06ad4fd9b73a616a1f Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Jun 2021 11:09:02 +0530
Subject: [PATCH v8] v12-Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 src/backend/replication/logical/decode.c        | 14 ++++----
 src/backend/replication/logical/reorderbuffer.c | 48 ++++++++++++++++++-------
 src/include/replication/reorderbuffer.h         |  9 ++---
 3 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index eef2f88..ff18861 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -803,19 +803,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 941ca9a..3a922b7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -359,6 +359,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -426,6 +429,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1628,8 +1632,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1703,6 +1707,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						break;
 					}
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_MESSAGE:
 					rb->message(rb, txn, change->lsn, true,
 								change->data.msg.prefix,
@@ -1778,16 +1808,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2483,6 +2505,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2797,6 +2820,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 49df12a..1ced41f 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -44,10 +44,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -60,6 +60,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

v8-0001-v10-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v8-0001-v10-Fix-decoding-of-speculative-aborts.patchDownload
From 070ddba30178a5e239b19be83b0e7828a502ab6c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Jun 2021 10:45:49 +0530
Subject: [PATCH v8] v10-Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 src/backend/replication/logical/decode.c        | 14 ++++----
 src/backend/replication/logical/reorderbuffer.c | 48 ++++++++++++++++++-------
 src/include/replication/reorderbuffer.h         | 11 +++---
 3 files changed, 48 insertions(+), 25 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d7a6f5d..3778ea9 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -778,19 +778,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 165ba8f..0963b23 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -353,6 +353,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -413,6 +416,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			break;
 			/* no data in addition to the struct itself */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1621,8 +1625,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1664,6 +1668,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_MESSAGE:
 					rb->message(rb, txn, change->lsn, true,
 								change->data.msg.prefix,
@@ -1739,16 +1769,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2423,6 +2445,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2716,6 +2739,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 			/* the base struct contains all the data, easy peasy */
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 5530c4f..d455567 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -44,10 +44,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -59,7 +59,8 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID,
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
-	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT
 };
 
 /*
-- 
1.8.3.1

v8-0001-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Fix-decoding-of-speculative-aborts.patchDownload
From 912f84100e5ba0820b349ff791075209f35e2513 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Thu, 10 Jun 2021 18:29:31 +0530
Subject: [PATCH v8] Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 src/backend/replication/logical/decode.c        | 14 +++----
 src/backend/replication/logical/reorderbuffer.c | 49 +++++++++++++++++++------
 src/include/replication/reorderbuffer.h         |  9 +++--
 3 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7067016..453efc5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -1040,19 +1040,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 2d9e127..d905f32 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -443,6 +443,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -520,6 +523,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change,
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -2211,8 +2215,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -2254,6 +2258,32 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert, true);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -2360,16 +2390,8 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert, true);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -3754,6 +3776,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4017,6 +4040,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -4315,6 +4339,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 0c6e9d1..ba257d8 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -46,10 +46,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

v8-0001-v11-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v8-0001-v11-Fix-decoding-of-speculative-aborts.patchDownload
From 3b3433301386ac5f64af2647bd1e9cbac6cec784 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Jun 2021 10:58:59 +0530
Subject: [PATCH v8] v11-Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 src/backend/replication/logical/decode.c        | 14 ++++----
 src/backend/replication/logical/reorderbuffer.c | 48 ++++++++++++++++++-------
 src/include/replication/reorderbuffer.h         |  9 ++---
 3 files changed, 47 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d7da3d6..676f921 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -804,19 +804,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 1a4b87c..244d203 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -351,6 +351,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -418,6 +421,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1620,8 +1624,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1695,6 +1699,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 						break;
 					}
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_MESSAGE:
 					rb->message(rb, txn, change->lsn, true,
 								change->data.msg.prefix,
@@ -1770,16 +1800,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2475,6 +2497,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2778,6 +2801,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index c41f362..ffb5a94 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -44,10 +44,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -60,6 +60,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

v8-0001-v13-Fix-decoding-of-speculative-aborts.patchtext/x-patch; charset=US-ASCII; name=v8-0001-v13-Fix-decoding-of-speculative-aborts.patchDownload
From 7b1fbfa9d2fe4365c5c2cfb54794ca34008343b5 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Fri, 11 Jun 2021 11:17:34 +0530
Subject: [PATCH v8] v13-Fix decoding of speculative aborts.

During decoding for speculative inserts, we were relying for cleaning
toast hash on confirmation records or next change records. But that
could lead to multiple problems (a) memory leak if there is neither a
confirmation record nor any other record after toast insertion for a
speculative insert in the transaction, (b) error and assertion failures
if the next operation is not an insert/update on the same table.

The fix is to start queuing spec abort change and clean up toast hash
and change record during its processing. Currently, we are queuing the
spec aborts for both toast and main table even though we perform cleanup
while processing the main table's spec abort record. Later, if we have a
way to distinguish between the spec abort record of toast and the main
table, we can avoid queuing the change for spec aborts of toast tables.

Reported-by: Ashutosh Bapat
Author: Dilip Kumar
Reviewed-by: Amit Kapila
Backpatch-through: 9.6, where it was introduced
Discussion: https://postgr.es/m/CAExHW5sPKF-Oovx_qZe4p5oM6Dvof7_P+XgsNAViug15Fm99jA@mail.gmail.com
---
 src/backend/replication/logical/decode.c        | 14 +++----
 src/backend/replication/logical/reorderbuffer.c | 49 +++++++++++++++++++------
 src/include/replication/reorderbuffer.h         |  9 +++--
 3 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c2e5e3a..4985c2a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -800,19 +800,17 @@ DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 	if (target_node.dbNode != ctx->slot->data.database)
 		return;
 
-	/*
-	 * Super deletions are irrelevant for logical decoding, it's driven by the
-	 * confirmation records.
-	 */
-	if (xlrec->flags & XLH_DELETE_IS_SUPER)
-		return;
-
 	/* output plugin doesn't look for this origin, no need to queue */
 	if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
 		return;
 
 	change = ReorderBufferGetChange(ctx->reorder);
-	change->action = REORDER_BUFFER_CHANGE_DELETE;
+
+	if (xlrec->flags & XLH_DELETE_IS_SUPER)
+		change->action = REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT;
+	else
+		change->action = REORDER_BUFFER_CHANGE_DELETE;
+
 	change->origin_id = XLogRecGetOrigin(r);
 
 	memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode));
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5251932..e86eb8c 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -397,6 +397,9 @@ ReorderBufferReturnTXN(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		txn->invalidations = NULL;
 	}
 
+	/* Reset the toast hash */
+	ReorderBufferToastReset(rb, txn);
+
 	pfree(txn);
 }
 
@@ -467,6 +470,7 @@ ReorderBufferReturnChange(ReorderBuffer *rb, ReorderBufferChange *change)
 			}
 			break;
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
@@ -1677,8 +1681,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			change_done:
 
 					/*
-					 * Either speculative insertion was confirmed, or it was
-					 * unsuccessful and the record isn't needed anymore.
+					 * If speculative insertion was confirmed, the record isn't
+					 * needed anymore.
 					 */
 					if (specinsert != NULL)
 					{
@@ -1720,6 +1724,32 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 					specinsert = change;
 					break;
 
+				case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
+
+					/*
+					 * Abort for speculative insertion arrived. So cleanup the
+					 * specinsert tuple and toast hash.
+					 *
+					 * Note that we get the spec abort change for each toast
+					 * entry but we need to perform the cleanup only the first
+					 * time we get it for the main table.
+					 */
+					if (specinsert != NULL)
+					{
+						/*
+						 * We must clean the toast hash before processing a
+						 * completely new tuple to avoid confusion about the
+						 * previous tuple's toast chunks.
+						 */
+						Assert(change->data.tp.clear_toast_afterwards);
+						ReorderBufferToastReset(rb, txn);
+
+						/* We don't need this record anymore. */
+						ReorderBufferReturnChange(rb, specinsert);
+						specinsert = NULL;
+					}
+					break;
+
 				case REORDER_BUFFER_CHANGE_TRUNCATE:
 					{
 						int			i;
@@ -1827,16 +1857,8 @@ ReorderBufferCommit(ReorderBuffer *rb, TransactionId xid,
 			}
 		}
 
-		/*
-		 * There's a speculative insertion remaining, just clean in up, it
-		 * can't have been successful, otherwise we'd gotten a confirmation
-		 * record.
-		 */
-		if (specinsert)
-		{
-			ReorderBufferReturnChange(rb, specinsert);
-			specinsert = NULL;
-		}
+		/* speculative insertion record must be freed by now */
+		Assert(!specinsert);
 
 		/* clean up the iterator */
 		ReorderBufferIterTXNFinish(rb, iterstate);
@@ -2640,6 +2662,7 @@ ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -2747,6 +2770,7 @@ ReorderBufferChangeSize(ReorderBufferChange *change)
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			/* ReorderBufferChange contains everything important */
@@ -3032,6 +3056,7 @@ ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 				break;
 			}
 		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM:
+		case REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT:
 		case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
 		case REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID:
 			break;
diff --git a/src/include/replication/reorderbuffer.h b/src/include/replication/reorderbuffer.h
index 019bd38..676491d 100644
--- a/src/include/replication/reorderbuffer.h
+++ b/src/include/replication/reorderbuffer.h
@@ -46,10 +46,10 @@ typedef struct ReorderBufferTupleBuf
  * changes. Users of the decoding facilities will never see changes with
  * *_INTERNAL_* actions.
  *
- * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM changes concern
- * "speculative insertions", and their confirmation respectively.  They're
- * used by INSERT .. ON CONFLICT .. UPDATE.  Users of logical decoding don't
- * have to care about these.
+ * The INTERNAL_SPEC_INSERT and INTERNAL_SPEC_CONFIRM, and INTERNAL_SPEC_ABORT
+ * changes concern "speculative insertions", their confirmation, and abort
+ * respectively.  They're used by INSERT .. ON CONFLICT .. UPDATE.  Users of
+ * logical decoding don't have to care about these.
  */
 enum ReorderBufferChangeType
 {
@@ -62,6 +62,7 @@ enum ReorderBufferChangeType
 	REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
 	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+	REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
 	REORDER_BUFFER_CHANGE_TRUNCATE
 };
 
-- 
1.8.3.1

#51Amit Kapila
amit.kapila16@gmail.com
In reply to: Dilip Kumar (#50)
Re: Decoding speculative insert with toast leaks memory

On Mon, Jun 14, 2021 at 12:06 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 14, 2021 at 9:44 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jun 14, 2021 at 8:34 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the test in this patch is quite similar to what Noah has
pointed in the nearby thread [1] to be failing at some intervals. Can
you also please once verify the same and if we can expect similar
failures here then we might want to consider dropping the test in this
patch for now? We can always come back to it once we find a good
solution to make it pass consistently.

test insert-conflict-do-nothing ... ok 646 ms
test insert-conflict-do-nothing-2 ... ok 1994 ms
test insert-conflict-do-update ... ok 1786 ms
test insert-conflict-do-update-2 ... ok 2689 ms
test insert-conflict-do-update-3 ... ok 851 ms
test insert-conflict-specconflict ... FAILED 3695 ms
test delete-abort-savept ... ok 1238 ms

Yeah, this is the same test that we have used base for our test so
let's not push this test until it becomes stable.

Patches without test case.

Pushed!

--
With Regards,
Amit Kapila.

#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#51)
Re: Decoding speculative insert with toast leaks memory

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed!

skink reports that this has valgrind issues:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2021-06-15%2020%3A49%3A26

2021-06-16 01:20:13.344 UTC [2198271][4/0:0] LOG: received replication command: IDENTIFY_SYSTEM
2021-06-16 01:20:13.384 UTC [2198271][4/0:0] LOG: received replication command: START_REPLICATION SLOT "sub2" LOGICAL 0/0 (proto_version '1', publication_names '"pub2"')
2021-06-16 01:20:13.454 UTC [2198271][4/0:0] LOG: starting logical decoding for slot "sub2"
2021-06-16 01:20:13.454 UTC [2198271][4/0:0] DETAIL: Streaming transactions committing after 0/157C828, reading WAL from 0/157C7F0.
2021-06-16 01:20:13.488 UTC [2198271][4/0:0] LOG: logical decoding found consistent point at 0/157C7F0
2021-06-16 01:20:13.488 UTC [2198271][4/0:0] DETAIL: There are no running transactions.
...
==2198271== VALGRINDERROR-BEGIN
==2198271== Conditional jump or move depends on uninitialised value(s)
==2198271== at 0x80EF890: rel_sync_cache_relation_cb (pgoutput.c:833)
==2198271== by 0x666AEB: LocalExecuteInvalidationMessage (inval.c:595)
==2198271== by 0x53A423: ReceiveSharedInvalidMessages (sinval.c:90)
==2198271== by 0x666026: AcceptInvalidationMessages (inval.c:683)
==2198271== by 0x53FBDD: LockRelationOid (lmgr.c:136)
==2198271== by 0x1D3943: relation_open (relation.c:56)
==2198271== by 0x26F21F: table_open (table.c:43)
==2198271== by 0x66D97F: ScanPgRelation (relcache.c:346)
==2198271== by 0x674644: RelationBuildDesc (relcache.c:1059)
==2198271== by 0x674BE8: RelationClearRelation (relcache.c:2568)
==2198271== by 0x675064: RelationFlushRelation (relcache.c:2736)
==2198271== by 0x6750A6: RelationCacheInvalidateEntry (relcache.c:2797)
==2198271== Uninitialised value was created by a heap allocation
==2198271== at 0x6AC308: MemoryContextAlloc (mcxt.c:826)
==2198271== by 0x68A8D9: DynaHashAlloc (dynahash.c:283)
==2198271== by 0x68A94B: element_alloc (dynahash.c:1675)
==2198271== by 0x68AA58: get_hash_entry (dynahash.c:1284)
==2198271== by 0x68B23E: hash_search_with_hash_value (dynahash.c:1057)
==2198271== by 0x68B3D4: hash_search (dynahash.c:913)
==2198271== by 0x80EE855: get_rel_sync_entry (pgoutput.c:681)
==2198271== by 0x80EEDA5: pgoutput_truncate (pgoutput.c:530)
==2198271== by 0x4E48A2: truncate_cb_wrapper (logical.c:797)
==2198271== by 0x4EFDDB: ReorderBufferCommit (reorderbuffer.c:1777)
==2198271== by 0x4E1DBE: DecodeCommit (decode.c:637)
==2198271== by 0x4E1F31: DecodeXactOp (decode.c:245)
==2198271==
==2198271== VALGRINDERROR-END

regards, tom lane

#53Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#52)
Re: Decoding speculative insert with toast leaks memory

On Wed, Jun 16, 2021 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed!

skink reports that this has valgrind issues:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2021-06-15%2020%3A49%3A26

The problem happens at line:
rel_sync_cache_relation_cb()
{
..
if (entry->map)
..

I think the reason is that before we initialize 'entry->map' in
get_rel_sync_entry(), the invalidation is processed as part of which
when we try to clean up the entry, it tries to access uninitialized
value. Note, this won't happen in HEAD as we initialize 'entry->map'
before we get to process any invalidation. We have fixed a similar
issue in HEAD sometime back as part of the commit 69bd60672a, so we
need to make a similar change in PG-13 as well.

This problem is introduced by commit d250568121 (Fix memory leak due
to RelationSyncEntry.map.) not by the patch in this thread, so keeping
Amit L and Osumi-San in the loop.

--
With Regards,
Amit Kapila.

#54Amit Langote
amitlangote09@gmail.com
In reply to: Amit Kapila (#53)
Re: Decoding speculative insert with toast leaks memory

On Thu, Jun 17, 2021 at 12:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 16, 2021 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed!

skink reports that this has valgrind issues:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2021-06-15%2020%3A49%3A26

The problem happens at line:
rel_sync_cache_relation_cb()
{
..
if (entry->map)
..

I think the reason is that before we initialize 'entry->map' in
get_rel_sync_entry(), the invalidation is processed as part of which
when we try to clean up the entry, it tries to access uninitialized
value. Note, this won't happen in HEAD as we initialize 'entry->map'
before we get to process any invalidation. We have fixed a similar
issue in HEAD sometime back as part of the commit 69bd60672a, so we
need to make a similar change in PG-13 as well.

This problem is introduced by commit d250568121 (Fix memory leak due
to RelationSyncEntry.map.) not by the patch in this thread, so keeping
Amit L and Osumi-San in the loop.

Thanks.

Maybe not sufficient as a fix, but I wonder if
rel_sync_cache_relation_cb() should really also check that
replicate_valid is true in the following condition:

/*
* Reset schema sent status as the relation definition may have changed.
* Also free any objects that depended on the earlier definition.
*/
if (entry != NULL)
{

If the problem is with HEAD, I don't quite understand why the last
statement of the following code block doesn't suffice:

/* Not found means schema wasn't sent */
if (!found)
{
/* immediately make a new entry valid enough to satisfy callbacks */
entry->schema_sent = false;
entry->streamed_txns = NIL;
entry->replicate_valid = false;
entry->pubactions.pubinsert = entry->pubactions.pubupdate =
entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false;
entry->publish_as_relid = InvalidOid;
entry->map = NULL; /* will be set by maybe_send_schema() if needed */
}

Do we need the same statement at the end of the following block?

/* Validate the entry */
if (!entry->replicate_valid)
{

--
Amit Langote
EDB: http://www.enterprisedb.com

#55Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#54)
Re: Decoding speculative insert with toast leaks memory

On Thu, Jun 17, 2021 at 10:39 AM Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, Jun 17, 2021 at 12:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 16, 2021 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed!

skink reports that this has valgrind issues:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2021-06-15%2020%3A49%3A26

The problem happens at line:
rel_sync_cache_relation_cb()
{
..
if (entry->map)
..

I think the reason is that before we initialize 'entry->map' in
get_rel_sync_entry(), the invalidation is processed as part of which
when we try to clean up the entry, it tries to access uninitialized
value. Note, this won't happen in HEAD as we initialize 'entry->map'
before we get to process any invalidation. We have fixed a similar
issue in HEAD sometime back as part of the commit 69bd60672a, so we
need to make a similar change in PG-13 as well.

This problem is introduced by commit d250568121 (Fix memory leak due
to RelationSyncEntry.map.) not by the patch in this thread, so keeping
Amit L and Osumi-San in the loop.

Thanks.

Maybe not sufficient as a fix, but I wonder if
rel_sync_cache_relation_cb() should really also check that
replicate_valid is true in the following condition:

I don't think that is required because we initialize the entry in "if
(!found)" case in the HEAD.

/*
* Reset schema sent status as the relation definition may have changed.
* Also free any objects that depended on the earlier definition.
*/
if (entry != NULL)
{

If the problem is with HEAD,

The problem occurs only in PG-13. So, we need to make PG-13 code
similar to HEAD as far as initialization of entry is concerned.

--
With Regards,
Amit Kapila.

#56Amit Langote
amitlangote09@gmail.com
In reply to: Amit Kapila (#55)
1 attachment(s)
Re: Decoding speculative insert with toast leaks memory

On Thu, Jun 17, 2021 at 3:42 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 17, 2021 at 10:39 AM Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, Jun 17, 2021 at 12:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jun 16, 2021 at 8:18 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Pushed!

skink reports that this has valgrind issues:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&amp;dt=2021-06-15%2020%3A49%3A26

The problem happens at line:
rel_sync_cache_relation_cb()
{
..
if (entry->map)
..

I think the reason is that before we initialize 'entry->map' in
get_rel_sync_entry(), the invalidation is processed as part of which
when we try to clean up the entry, it tries to access uninitialized
value. Note, this won't happen in HEAD as we initialize 'entry->map'
before we get to process any invalidation. We have fixed a similar
issue in HEAD sometime back as part of the commit 69bd60672a, so we
need to make a similar change in PG-13 as well.

This problem is introduced by commit d250568121 (Fix memory leak due
to RelationSyncEntry.map.) not by the patch in this thread, so keeping
Amit L and Osumi-San in the loop.

Thanks.

Maybe not sufficient as a fix, but I wonder if
rel_sync_cache_relation_cb() should really also check that
replicate_valid is true in the following condition:

I don't think that is required because we initialize the entry in "if
(!found)" case in the HEAD.

Yeah, I see that. If we can be sure that the callback can't get
called between hash_search() allocating the entry and the above code
block making the entry look valid, which appears to be the case, then
I guess we don't need to worry.

/*
* Reset schema sent status as the relation definition may have changed.
* Also free any objects that depended on the earlier definition.
*/
if (entry != NULL)
{

If the problem is with HEAD,

The problem occurs only in PG-13. So, we need to make PG-13 code
similar to HEAD as far as initialization of entry is concerned.

Oh I missed that the problem report is for the PG13 branch.

How about the attached patch then?

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

pg13-init-RelationSyncEntry-properly.patchapplication/octet-stream; name=pg13-init-RelationSyncEntry-properly.patchDownload
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 5b33b31515..30cb4e03aa 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -685,7 +685,22 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	Assert(entry != NULL);
 
 	/* Not found means schema wasn't sent */
-	if (!found || !entry->replicate_valid)
+	if (!found)
+	{
+		/*
+		 * Make the new entry valid enough for the callbacks to look at, in
+		 * case any of them get invoked during the more complicated
+		 * initialization steps below.
+		 */
+		entry->schema_sent = false;
+		entry->replicate_valid = false;
+		entry->pubactions.pubinsert = entry->pubactions.pubupdate =
+			entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false;
+		entry->publish_as_relid = InvalidOid;
+		entry->map = NULL;	/* will be set by maybe_send_schema() if needed */
+	}
+
+	if (!entry->replicate_valid)
 	{
 		List	   *pubids = GetRelationPublications(relid);
 		ListCell   *lc;
@@ -782,13 +797,9 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 		list_free(pubids);
 
 		entry->publish_as_relid = publish_as_relid;
-		entry->map = NULL;	/* will be set by maybe_send_schema() if needed */
 		entry->replicate_valid = true;
 	}
 
-	if (!found)
-		entry->schema_sent = false;
-
 	return entry;
 }
 
#57Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Langote (#56)
Re: Decoding speculative insert with toast leaks memory

On Thu, Jun 17, 2021 at 12:52 PM Amit Langote <amitlangote09@gmail.com> wrote:

Oh I missed that the problem report is for the PG13 branch.

How about the attached patch then?

Looks good, one minor comment, how about making the below comment,
same as on the head?

- if (!found || !entry->replicate_valid)
+ if (!found)
+ {
+ /*
+ * Make the new entry valid enough for the callbacks to look at, in
+ * case any of them get invoked during the more complicated
+ * initialization steps below.
+ */

On head:
if (!found)
{
/* immediately make a new entry valid enough to satisfy callbacks */

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#58Amit Langote
amitlangote09@gmail.com
In reply to: Dilip Kumar (#57)
2 attachment(s)
Re: Decoding speculative insert with toast leaks memory

Hi Dilip,

On Thu, Jun 17, 2021 at 4:45 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jun 17, 2021 at 12:52 PM Amit Langote <amitlangote09@gmail.com> wrote:

Oh I missed that the problem report is for the PG13 branch.

How about the attached patch then?

Looks good,

Thanks for checking.

one minor comment, how about making the below comment,
same as on the head?

- if (!found || !entry->replicate_valid)
+ if (!found)
+ {
+ /*
+ * Make the new entry valid enough for the callbacks to look at, in
+ * case any of them get invoked during the more complicated
+ * initialization steps below.
+ */

On head:
if (!found)
{
/* immediately make a new entry valid enough to satisfy callbacks */

Agree it's better to have the same comment in both branches.

Though, I think it should be "the new entry", not "a new entry". I
find the sentence I wrote a bit more enlightening, but I am fine with
just fixing the aforementioned problem with the existing comment.

I've updated the patch. Also, attaching a patch for HEAD for the
s/a/the change. While at it, I also capitalized "immediately".

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

pg13-init-RelationSyncEntry-properly_v2.patchapplication/octet-stream; name=pg13-init-RelationSyncEntry-properly_v2.patchDownload
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 5b33b31515..da82cafa72 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -685,7 +685,20 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	Assert(entry != NULL);
 
 	/* Not found means schema wasn't sent */
-	if (!found || !entry->replicate_valid)
+	if (!found)
+	{
+		/*
+		 * Immediately make the new entry valid enough to satisfy callbacks.
+		 */
+		entry->schema_sent = false;
+		entry->replicate_valid = false;
+		entry->pubactions.pubinsert = entry->pubactions.pubupdate =
+			entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false;
+		entry->publish_as_relid = InvalidOid;
+		entry->map = NULL;	/* will be set by maybe_send_schema() if needed */
+	}
+
+	if (!entry->replicate_valid)
 	{
 		List	   *pubids = GetRelationPublications(relid);
 		ListCell   *lc;
@@ -782,13 +795,9 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 		list_free(pubids);
 
 		entry->publish_as_relid = publish_as_relid;
-		entry->map = NULL;	/* will be set by maybe_send_schema() if needed */
 		entry->replicate_valid = true;
 	}
 
-	if (!found)
-		entry->schema_sent = false;
-
 	return entry;
 }
 
HEAD-fix-get_rel_sync_entry-comment.patchapplication/octet-stream; name=HEAD-fix-get_rel_sync_entry-comment.patchDownload
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 63f108f960..3117308fb7 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -1024,7 +1024,9 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
 	/* Not found means schema wasn't sent */
 	if (!found)
 	{
-		/* immediately make a new entry valid enough to satisfy callbacks */
+		/*
+		 * Immediately make the new entry valid enough to satisfy callbacks.
+		 */
 		entry->schema_sent = false;
 		entry->streamed_txns = NIL;
 		entry->replicate_valid = false;
#59Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Langote (#58)
Re: Decoding speculative insert with toast leaks memory

On Thu, Jun 17, 2021 at 1:35 PM Amit Langote <amitlangote09@gmail.com> wrote:

Hi Dilip,

On Thu, Jun 17, 2021 at 4:45 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jun 17, 2021 at 12:52 PM Amit Langote <amitlangote09@gmail.com> wrote:

Oh I missed that the problem report is for the PG13 branch.

How about the attached patch then?

Looks good,

Thanks for checking.

one minor comment, how about making the below comment,
same as on the head?

- if (!found || !entry->replicate_valid)
+ if (!found)
+ {
+ /*
+ * Make the new entry valid enough for the callbacks to look at, in
+ * case any of them get invoked during the more complicated
+ * initialization steps below.
+ */

On head:
if (!found)
{
/* immediately make a new entry valid enough to satisfy callbacks */

Agree it's better to have the same comment in both branches.

Though, I think it should be "the new entry", not "a new entry". I
find the sentence I wrote a bit more enlightening, but I am fine with
just fixing the aforementioned problem with the existing comment.

I've updated the patch. Also, attaching a patch for HEAD for the
s/a/the change. While at it, I also capitalized "immediately".

Your patch looks good to me as well. I would like to retain the
comment as it is from master for now. I'll do some testing and push it
tomorrow unless there are additional comments.

--
With Regards,
Amit Kapila.

#60Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#59)
Re: Decoding speculative insert with toast leaks memory

On Thu, Jun 17, 2021 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Your patch looks good to me as well. I would like to retain the
comment as it is from master for now. I'll do some testing and push it
tomorrow unless there are additional comments.

Pushed!

--
With Regards,
Amit Kapila.

#61Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Amit Kapila (#60)
Re: Decoding speculative insert with toast leaks memory

Hi,

On 6/18/21 5:50 AM, Amit Kapila wrote:

On Thu, Jun 17, 2021 at 2:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Your patch looks good to me as well. I would like to retain the
comment as it is from master for now. I'll do some testing and push it
tomorrow unless there are additional comments.

Pushed!

While rebasing a patch broken by 4daa140a2f5, I noticed that the patch
does this:

@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
        REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
        REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
        REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+       REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
        REORDER_BUFFER_CHANGE_TRUNCATE
 };

I understand adding the ABORT right after CONFIRM

Isn't that an undesirable ABI break for extensions? It changes the value
for the TRUNCATE item, so if an extension references to that somehow
it'd suddenly start failing (until it gets rebuilt). And the failures
would be pretty confusing and seemingly contradicting the code.

FWIW I don't know how likely it is for an extension to depend on the
TRUNCATE value (it'd be far worse for INSERT/UPDATE/DELETE), but seems
moving the new element at the end would solve this.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#61)
Re: Decoding speculative insert with toast leaks memory

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

While rebasing a patch broken by 4daa140a2f5, I noticed that the patch
does this:

@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+       REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
REORDER_BUFFER_CHANGE_TRUNCATE
};

Isn't that an undesirable ABI break for extensions?

I think it's OK in HEAD. I agree we shouldn't do it like that
in the back branches.

regards, tom lane

#63Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#62)
Re: Decoding speculative insert with toast leaks memory

On Wed, Jun 23, 2021 at 8:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

While rebasing a patch broken by 4daa140a2f5, I noticed that the patch
does this:

@@ -63,6 +63,7 @@ enum ReorderBufferChangeType
REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
+       REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT,
REORDER_BUFFER_CHANGE_TRUNCATE
};

Isn't that an undesirable ABI break for extensions?

I think it's OK in HEAD. I agree we shouldn't do it like that
in the back branches.

Okay, I'll change this in back branches and HEAD to keep the code
consistent, or do you think it is better to retain the order in HEAD
as it is and just change it for back-branches?

--
With Regards,
Amit Kapila.

#64Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Kapila (#63)
Re: Decoding speculative insert with toast leaks memory

Amit Kapila <amit.kapila16@gmail.com> writes:

I think it's OK in HEAD. I agree we shouldn't do it like that
in the back branches.

Okay, I'll change this in back branches and HEAD to keep the code
consistent, or do you think it is better to retain the order in HEAD
as it is and just change it for back-branches?

As I said, I'd keep the natural ordering in HEAD.

regards, tom lane

#65Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#64)
Re: Decoding speculative insert with toast leaks memory

On Thu, Jun 24, 2021 at 12:25:15AM -0400, Tom Lane wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Okay, I'll change this in back branches and HEAD to keep the code
consistent, or do you think it is better to retain the order in HEAD
as it is and just change it for back-branches?

As I said, I'd keep the natural ordering in HEAD.

Yes, please keep the items in an alphabetical order on HEAD, and just
have the new item at the bottom of the enum in the back-branches.
That's the usual practice.
--
Michael

#66Amit Kapila
amit.kapila16@gmail.com
In reply to: Michael Paquier (#65)
Re: Decoding speculative insert with toast leaks memory

On Thu, Jun 24, 2021 at 11:04 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Jun 24, 2021 at 12:25:15AM -0400, Tom Lane wrote:

Amit Kapila <amit.kapila16@gmail.com> writes:

Okay, I'll change this in back branches and HEAD to keep the code
consistent, or do you think it is better to retain the order in HEAD
as it is and just change it for back-branches?

As I said, I'd keep the natural ordering in HEAD.

Yes, please keep the items in an alphabetical order on HEAD, and just
have the new item at the bottom of the enum in the back-branches.
That's the usual practice.

Okay, I have back-patched the change till v11 because before that
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_ABORT is already at the end.

--
With Regards,
Amit Kapila.