Invalid pointer access in logical decoding after error

Started by vignesh C6 months ago15 messages

vignesh21@gmail.com

6 months ago

1 attachment(s)

Hi,

I encountered an invalid pointer access issue. Below are the steps to
reproduce the issue:
-- Create table
CREATE TABLE t1(c1 int, c2 int);

-- Create publications with each publication selecting a different column
CREATE PUBLICATION pub1 for TABLE t1(c1);
CREATE PUBLICATION pub2 for TABLE t1(c2);

-- Create slot
SELECT * FROM pg_create_logical_replication_slot('test', 'pgoutput');

-- Insert couple of records
INSERT INTO t1 VALUES(1,1);
INSERT INTO t1 VALUES(2,2);

-- Execute slot_get_changes which will throw an error because of
different column lists
postgres=# SELECT * FROM pg_logical_slot_get_binary_changes('test',
NULL, NULL, 'proto_version', '4', 'publication_names', 'pub1,pub2');
ERROR: cannot use different column lists for table "public.t1" in
different publications
CONTEXT: slot "test", output plugin "pgoutput", in the change
callback, associated LSN 0/14C3C30

-- The second call simulates an issue where we try to free an invalid pointer
postgres=# SELECT * FROM pg_logical_slot_get_binary_changes('test',
NULL, NULL, 'proto_version', '4', 'publication_names', 'pub1,pub2');
ERROR: pfree called with invalid pointer 0x58983541e6b8 (header
0x6563617073656d61)
CONTEXT: slot "test", output plugin "pgoutput", in the change
callback, associated LSN 0/14C3C30

The error occurs because entry->columns is allocated in the entry
private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
context is a child of the PortalContext, which is cleared after an
error via: AbortTransaction -> AtAbort_Portals ->
MemoryContextDeleteChildren -> MemoryContextDelete ->
MemoryContextDeleteOnly
As a result, the memory backing entry->columns is freed, but the
RelationSyncCache which resides in CacheMemoryContext and thus
survives the error still holds a dangling pointer to this freed
memory, causing it to pfree an invalid pointer.
In the normal (positive) execution flow, pgoutput_shutdown() is called
to clean up the RelationSyncCache. This happens via:
FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown
But this is not called in case of an error case. To handle this case
safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
ensure pgoutput_shutdown is invoked and the stale cache is cleared
appropriately. Attached patch has the changes for the same.
Thoughts?

Regards,
Vignesh

Attachments:

v1-0001-Fix-referencing-invalid-pointer-in-logical-decodi.patchapplication/octet-stream; name=v1-0001-Fix-referencing-invalid-pointer-in-logical-decodi.patchDownload

From 141601d9025806e5d1fc2a739fe391335a817534 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Wed, 2 Jul 2025 10:26:02 +0530
Subject: [PATCH v1] Fix referencing invalid pointer in logical decoding after
 error

When an error occurs while processing changes with conflicting
column lists in different publications, the entry->columns memory
which is allocated in entry private context is freed. However, the
RelationSyncCache still holds a pointer to this memory, leading to
a crash on subsequent access.

To fix this, ensure FreeDecodingContext is called in the error path,
properly clearing the stale cache and preventing use-after-free.
---
 src/backend/replication/logical/logicalfuncs.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c
index ca53caac2f2..0d674f3c2bd 100644
--- a/src/backend/replication/logical/logicalfuncs.c
+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -106,7 +106,7 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	MemoryContext oldcontext;
 	XLogRecPtr	end_of_wal;
 	XLogRecPtr	wait_for_wal_lsn;
-	LogicalDecodingContext *ctx;
+	LogicalDecodingContext *ctx = NULL;
 	ResourceOwner old_resowner = CurrentResourceOwner;
 	ArrayType  *arr;
 	Size		ndim;
@@ -314,6 +314,10 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin
 	}
 	PG_CATCH();
 	{
+		/* free context, call shutdown callback */
+		if (ctx)
+			FreeDecodingContext(ctx);
+
 		/* clear all timetravel entries */
 		InvalidateSystemCaches();
 
-- 
2.43.0

Zhijie Hou (Fujitsu)

houzj.fnst@fujitsu.com

6 months ago

In reply to: vignesh C (#1)

RE: Invalid pointer access in logical decoding after error

On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:

Hi,

I encountered an invalid pointer access issue. Below are the steps to
reproduce the issue:

...

The error occurs because entry->columns is allocated in the entry
private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
context is a child of the PortalContext, which is cleared after an
error via: AbortTransaction
-> AtAbort_Portals ->
MemoryContextDeleteChildren -> MemoryContextDelete ->
MemoryContextDeleteOnly
As a result, the memory backing entry->columns is freed, but the
RelationSyncCache which resides in CacheMemoryContext and thus
survives the error still holds a dangling pointer to this freed
memory, causing it to pfree an invalid pointer.
In the normal (positive) execution flow, pgoutput_shutdown() is called
to clean up the RelationSyncCache. This happens via:
FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
this is not called in case of an error case. To handle this case
safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
Attached patch has the changes for the same.
Thoughts?

Thank you for reporting the issue and providing a fix.

I recall that we identified this general issue with the hash table in pgoutput
in other threads as well [1]/messages/by-id/OS0PR01MB57167C62D7DA4A8EBBC92B0A941BA@OS0PR01MB5716.jpnprd01.prod.outlook.com. The basic consensus [2]/messages/by-id/20230820210545.plmlk3rnxxgkokkj@awork3.anarazel.de is that calling
FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
user code, increasing the risk of encountering another error within PG_CATCH.
This scenario could prevent execution of subsequent code to invalidate syscache
entries, which is problematic.

I think a better fix could be to introduce a memory context reset callback(on
data->cachectx) and perform the actions of pgoutput_shutdown() within it.

[1]: /messages/by-id/OS0PR01MB57167C62D7DA4A8EBBC92B0A941BA@OS0PR01MB5716.jpnprd01.prod.outlook.com
[2]: /messages/by-id/20230820210545.plmlk3rnxxgkokkj@awork3.anarazel.de

Best Regards,
Hou zj

vignesh C

vignesh21@gmail.com

6 months ago

In reply to: Zhijie Hou (Fujitsu) (#2)

1 attachment(s)

Re: Invalid pointer access in logical decoding after error

On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:

Hi,

I encountered an invalid pointer access issue. Below are the steps to
reproduce the issue:

...

The error occurs because entry->columns is allocated in the entry
private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
context is a child of the PortalContext, which is cleared after an
error via: AbortTransaction
-> AtAbort_Portals ->
MemoryContextDeleteChildren -> MemoryContextDelete ->
MemoryContextDeleteOnly
As a result, the memory backing entry->columns is freed, but the
RelationSyncCache which resides in CacheMemoryContext and thus
survives the error still holds a dangling pointer to this freed
memory, causing it to pfree an invalid pointer.
In the normal (positive) execution flow, pgoutput_shutdown() is called
to clean up the RelationSyncCache. This happens via:
FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
this is not called in case of an error case. To handle this case
safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
Attached patch has the changes for the same.
Thoughts?

Thank you for reporting the issue and providing a fix.

I recall that we identified this general issue with the hash table in pgoutput
in other threads as well [1]. The basic consensus [2] is that calling
FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
user code, increasing the risk of encountering another error within PG_CATCH.
This scenario could prevent execution of subsequent code to invalidate syscache
entries, which is problematic.

Yes, let's avoid this.

I think a better fix could be to introduce a memory context reset callback(on
data->cachectx) and perform the actions of pgoutput_shutdown() within it.

The attached v2 version patch has the changes for the same.

Regards,
Vignesh

Attachments:

v2-0001-Fix-referencing-invalid-pointer-in-logical-decodi.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-referencing-invalid-pointer-in-logical-decodi.patchDownload

From dea5cc35c7ad2a86e86c28096b50f132d51bca57 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 3 Jul 2025 19:48:28 +0530
Subject: [PATCH v2] Fix referencing invalid pointer in logical decoding after
 error

When an error occurs while processing changes with conflicting
column lists in different publications, the entry->columns memory
which is allocated in entry private context is freed. However, the
RelationSyncCache still holds a pointer to this memory, leading to
a crash on subsequent access. Fix this by clearing the cached relation
info through MemoryContext reset callback.
---
 src/backend/replication/pgoutput/pgoutput.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 082b4d9d327..1d00587b715 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -424,6 +424,19 @@ parse_output_parameters(List *options, PGOutputData *data)
 				errmsg("option \"%s\" missing", "publication_names"));
 }
 
+/*
+ * Memory context reset callback for clearing the cached relation info.
+ */
+static void
+cache_context_callback(void *arg)
+{
+	if (RelationSyncCache)
+	{
+		hash_destroy(RelationSyncCache);
+		RelationSyncCache = NULL;
+	}
+}
+
 /*
  * Initialize this plugin
  */
@@ -433,6 +446,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 	static bool publication_callback_registered = false;
+	MemoryContextCallback *mcallback;
 
 	/* Create our memory context for private allocations. */
 	data->context = AllocSetContextCreate(ctx->context,
@@ -447,6 +461,10 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 										 "logical replication publication list context",
 										 ALLOCSET_SMALL_SIZES);
 
+	mcallback = palloc0(sizeof(MemoryContextCallback));
+	mcallback->func = cache_context_callback;
+	MemoryContextRegisterResetCallback(data->cachectx, mcallback);
+
 	ctx->output_plugin_private = data;
 
 	/* This plugin uses binary protocol. */
-- 
2.43.0

Masahiko Sawada

sawada.mshk@gmail.com

4 months ago

In reply to: vignesh C (#3)

Re: Invalid pointer access in logical decoding after error

On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:

Hi,

I encountered an invalid pointer access issue. Below are the steps to
reproduce the issue:

...

The error occurs because entry->columns is allocated in the entry
private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
context is a child of the PortalContext, which is cleared after an
error via: AbortTransaction
-> AtAbort_Portals ->
MemoryContextDeleteChildren -> MemoryContextDelete ->
MemoryContextDeleteOnly
As a result, the memory backing entry->columns is freed, but the
RelationSyncCache which resides in CacheMemoryContext and thus
survives the error still holds a dangling pointer to this freed
memory, causing it to pfree an invalid pointer.
In the normal (positive) execution flow, pgoutput_shutdown() is called
to clean up the RelationSyncCache. This happens via:
FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
this is not called in case of an error case. To handle this case
safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
Attached patch has the changes for the same.
Thoughts?

Thank you for reporting the issue and providing a fix.

I recall that we identified this general issue with the hash table in pgoutput
in other threads as well [1]. The basic consensus [2] is that calling
FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
user code, increasing the risk of encountering another error within PG_CATCH.
This scenario could prevent execution of subsequent code to invalidate syscache
entries, which is problematic.

Yes, let's avoid this.

I think a better fix could be to introduce a memory context reset callback(on
data->cachectx) and perform the actions of pgoutput_shutdown() within it.

The attached v2 version patch has the changes for the same.

We've addressed several memory-related issues in pgoutput. While most
of these issues didn't affect logical replication, they did impact
logical decoding called via SQL API. I find that these problems stem
from RelationSyncCache being defined as a file-scope static variable
and being allocated in CacheMemoryContext. I'm wondering if we could
move it to PGOutputData and create it under the logical decoding
context. This would ensure it's automatically cleaned up along with
the logical decoding context.

I also noticed another concerning issue: the entry->streamed_txns list
is maintained in CacheMemoryContext (see
set_schema_sent_in_streamed_txn()). This could lead to memory leaks
when logical decoding called via the SQL API encounters an error. This
issue isn't addressed in the current patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

vignesh C

vignesh21@gmail.com

4 months ago

In reply to: Masahiko Sawada (#4)

Re: Invalid pointer access in logical decoding after error

On Fri, 26 Sept 2025 at 05:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:

Hi,

I encountered an invalid pointer access issue. Below are the steps to
reproduce the issue:

...

The error occurs because entry->columns is allocated in the entry
private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
context is a child of the PortalContext, which is cleared after an
error via: AbortTransaction
-> AtAbort_Portals ->
MemoryContextDeleteChildren -> MemoryContextDelete ->
MemoryContextDeleteOnly
As a result, the memory backing entry->columns is freed, but the
RelationSyncCache which resides in CacheMemoryContext and thus
survives the error still holds a dangling pointer to this freed
memory, causing it to pfree an invalid pointer.
In the normal (positive) execution flow, pgoutput_shutdown() is called
to clean up the RelationSyncCache. This happens via:
FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
this is not called in case of an error case. To handle this case
safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
Attached patch has the changes for the same.
Thoughts?

Thank you for reporting the issue and providing a fix.

I recall that we identified this general issue with the hash table in pgoutput
in other threads as well [1]. The basic consensus [2] is that calling
FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
user code, increasing the risk of encountering another error within PG_CATCH.
This scenario could prevent execution of subsequent code to invalidate syscache
entries, which is problematic.

Yes, let's avoid this.

I think a better fix could be to introduce a memory context reset callback(on
data->cachectx) and perform the actions of pgoutput_shutdown() within it.

The attached v2 version patch has the changes for the same.

We've addressed several memory-related issues in pgoutput. While most
of these issues didn't affect logical replication, they did impact
logical decoding called via SQL API. I find that these problems stem
from RelationSyncCache being defined as a file-scope static variable
and being allocated in CacheMemoryContext. I'm wondering if we could
move it to PGOutputData and create it under the logical decoding
context. This would ensure it's automatically cleaned up along with
the logical decoding context.

I see one issue with keeping RelationSyncCache inside PGOutputData.
Both rel_sync_cache_publication_cb and rel_sync_cache_relation_cb rely
on this cache: they check its validity and update it in response to
invalidation events. The problem is that these callbacks cannot be
unregistered once they are registered. After pgoutput_shutdown() runs,
the hash table is destroyed and later FreeDecodingContext deletes the
logical decoding context. At that point, the registered callbacks may
still fire, but they’ll be holding onto a freed memory.

This is also noted directly in the callbacks:
/*
* We can get here if the plugin was used in SQL interface as the
* RelationSyncCache is destroyed when the decoding finishes, but there is
* no way to unregister the relcache invalidation callback.
*/
if (RelationSyncCache == NULL)
return;

So the core issue is that the callbacks may outlive the cache they
reference, and we don’t have a mechanism to unregister them cleanly.

Thoughts?

Regards,
Vignesh

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: vignesh C (#5)

1 attachment(s)

Re: Invalid pointer access in logical decoding after error

On Mon, Sep 29, 2025 at 1:52 AM vignesh C <vignesh21@gmail.com> wrote:

On Fri, 26 Sept 2025 at 05:29, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 3, 2025 at 7:55 AM vignesh C <vignesh21@gmail.com> wrote:

On Wed, 2 Jul 2025 at 13:21, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:

On Wed, Jul 2, 2025 at 2:42 PM vignesh C wrote:

Hi,

I encountered an invalid pointer access issue. Below are the steps to
reproduce the issue:

...

The error occurs because entry->columns is allocated in the entry
private context (entry->entry_cxt) by pub_collist_to_bitmapset(). This
context is a child of the PortalContext, which is cleared after an
error via: AbortTransaction
-> AtAbort_Portals ->
MemoryContextDeleteChildren -> MemoryContextDelete ->
MemoryContextDeleteOnly
As a result, the memory backing entry->columns is freed, but the
RelationSyncCache which resides in CacheMemoryContext and thus
survives the error still holds a dangling pointer to this freed
memory, causing it to pfree an invalid pointer.
In the normal (positive) execution flow, pgoutput_shutdown() is called
to clean up the RelationSyncCache. This happens via:
FreeDecodingContext -> shutdown_cb_wrapper -> pgoutput_shutdown But
this is not called in case of an error case. To handle this case
safely, I suggest calling FreeDecodingContext in the PG_CATCH block to
ensure pgoutput_shutdown is invoked and the stale cache is cleared appropriately.
Attached patch has the changes for the same.
Thoughts?

Thank you for reporting the issue and providing a fix.

I recall that we identified this general issue with the hash table in pgoutput
in other threads as well [1]. The basic consensus [2] is that calling
FreeDecodingContext() within PG_CATCH is not ideal, as this function includes
user code, increasing the risk of encountering another error within PG_CATCH.
This scenario could prevent execution of subsequent code to invalidate syscache
entries, which is problematic.

Yes, let's avoid this.

I think a better fix could be to introduce a memory context reset callback(on
data->cachectx) and perform the actions of pgoutput_shutdown() within it.

The attached v2 version patch has the changes for the same.

We've addressed several memory-related issues in pgoutput. While most
of these issues didn't affect logical replication, they did impact
logical decoding called via SQL API. I find that these problems stem
from RelationSyncCache being defined as a file-scope static variable
and being allocated in CacheMemoryContext. I'm wondering if we could
move it to PGOutputData and create it under the logical decoding
context. This would ensure it's automatically cleaned up along with
the logical decoding context.

I see one issue with keeping RelationSyncCache inside PGOutputData.
Both rel_sync_cache_publication_cb and rel_sync_cache_relation_cb rely
on this cache: they check its validity and update it in response to
invalidation events. The problem is that these callbacks cannot be
unregistered once they are registered. After pgoutput_shutdown() runs,
the hash table is destroyed and later FreeDecodingContext deletes the
logical decoding context. At that point, the registered callbacks may
still fire, but they’ll be holding onto a freed memory.

This is also noted directly in the callbacks:
/*
* We can get here if the plugin was used in SQL interface as the
* RelationSyncCache is destroyed when the decoding finishes, but there is
* no way to unregister the relcache invalidation callback.
*/
if (RelationSyncCache == NULL)
return;

So the core issue is that the callbacks may outlive the cache they
reference, and we don’t have a mechanism to unregister them cleanly.

Thoughts?

I agree with your analysis. It seems there is no convenient way to
move RelationSyncCache inside PGOutputData. So let's proceed with the
proposed approach.

I've done some minor changes to your v2 patch and updated the commit
message. IIUC this patch needs to be backpatched to v15. Please review
the attached patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchapplication/octet-stream; name=v3-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchDownload

From b6a940149029634bcd78a9f951b004a7a53a984d Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 3 Jul 2025 19:48:28 +0530
Subject: [PATCH v3] Fix access-to-already-freed-memory issue in pgoutput.

While pgoutput caches relation synchronization information in
RelationSyncCache that resides in CacheMemoryContext, each entry's
information (such as row filter expressions and column lists) is
stored in the entry's private memory context (entry_cxt in
RelationSyncEntry), which is a descendant memory context of the
decoding context. If a logical decoding invoked via SQL functions like
pg_logical_slot_get_binary_changes fails with an error, subsequent
logical decoding executions could access already-freed memory of the
entry's cache, resulting in a crash.

With this change, it's ensured that RelationSyncCache is cleaned up
even in error cases by using a memory context reset callback function.

Backpatch to 15, where entry_cxt was introduced for column filtering
and row filtering.

Author: vignesh C <vignesh21@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com
Backpatch: 15
---
 src/backend/replication/pgoutput/pgoutput.c | 23 +++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 92eb17049c3..a9f163ac704 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -229,6 +229,7 @@ static void send_relation_and_attrs(Relation relation, TransactionId xid,
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
+static void rel_sync_cache_reset_cb(void *arg);
 static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 											TransactionId xid);
 static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
@@ -435,6 +436,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 	static bool publication_callback_registered = false;
+	MemoryContextCallback *mcallback;
 
 	/* Create our memory context for private allocations. */
 	data->context = AllocSetContextCreate(ctx->context,
@@ -449,6 +451,14 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 										 "logical replication publication list context",
 										 ALLOCSET_SMALL_SIZES);
 
+	/*
+	 * Ensure to cleanup RelationSyncCache even when logical decoding invoked
+	 * via SQL interface ends up with an error.
+	 */
+	mcallback = palloc0(sizeof(MemoryContextCallback));
+	mcallback->func = rel_sync_cache_reset_cb;
+	MemoryContextRegisterResetCallback(data->cachectx, mcallback);
+
 	ctx->output_plugin_private = data;
 
 	/* This plugin uses binary protocol. */
@@ -2432,6 +2442,19 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 	}
 }
 
+/*
+ * Memory context reset callback for clearing the cached relation info.
+ */
+static void
+rel_sync_cache_reset_cb(void *arg)
+{
+	if (RelationSyncCache)
+	{
+		hash_destroy(RelationSyncCache);
+		RelationSyncCache = NULL;
+	}
+}
+
 /* Send Replication origin */
 static void
 send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
-- 
2.47.3

Euler Taveira

euler@eulerto.com

3 months ago

In reply to: Masahiko Sawada (#6)

2 attachment(s)

Re: Invalid pointer access in logical decoding after error

On Mon, Oct 6, 2025, at 5:00 PM, Masahiko Sawada wrote:

I agree with your analysis. It seems there is no convenient way to
move RelationSyncCache inside PGOutputData. So let's proceed with the
proposed approach.

+1. Shouldn't we add a comment mentioning that it is a possibility?

I've done some minor changes to your v2 patch and updated the commit
message. IIUC this patch needs to be backpatched to v15. Please review
the attached patch.

I verified that the bug exists since v15 as reported. Despite of the test case
provided by Vignesh (which I attached a modified version to be used in v15 or
later), I also added another test case that has a similar problem with
generated columns. This 2nd test case only works for v18 (where the feature was
introduced). This patch also fixes this case.

I'm curious about other cases related to RelationSyncCache. Is there any other
cases that this patch doesn't fix?

This patch looks good to me. Do we really need a new function with the same
content as pgoutput_shutdown? I don't like mcallback. It seems 'm' stands for
'memory' but if we want to use crypt names I would suggest 'mcb' (_m_emory
_c_all_b_ack) -- that is also a crypt name but a shorter one. Same pattern is
already used in another place with MemoryContextCallback.

--
Euler Taveira
EDB https://www.enterprisedb.com/

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: Euler Taveira (#7)

Re: Invalid pointer access in logical decoding after error

On Mon, Oct 6, 2025 at 6:55 PM Euler Taveira <euler@eulerto.com> wrote:

On Mon, Oct 6, 2025, at 5:00 PM, Masahiko Sawada wrote:

I agree with your analysis. It seems there is no convenient way to
move RelationSyncCache inside PGOutputData. So let's proceed with the
proposed approach.

+1. Shouldn't we add a comment mentioning that it is a possibility?

I think the primary reason why we cannot simply move RelationSyncCache
to PGOutputData is there is no way to unregister the invalidation
callback, which is already mentioned in some places:

/*
* We can get here if the plugin was used in SQL interface as the
* RelationSyncCache is destroyed when the decoding finishes, but there is
* no way to unregister the relcache invalidation callback.
*/
if (RelationSyncCache == NULL)

I've done some minor changes to your v2 patch and updated the commit
message. IIUC this patch needs to be backpatched to v15. Please review
the attached patch.

I verified that the bug exists since v15 as reported. Despite of the test case
provided by Vignesh (which I attached a modified version to be used in v15 or
later), I also added another test case that has a similar problem with
generated columns. This 2nd test case only works for v18 (where the feature was
introduced). This patch also fixes this case.

Thank you for the tests!

I'm curious about other cases related to RelationSyncCache. Is there any other
cases that this patch doesn't fix?

I think that with the patch we can ensure to clean up
RelationSyncCache alongside with other memory contexts used in
pgoutput, fixing problems stemming from RelationSyncCache referencing
already-freed-memory in the pgoutput memory contexts.

IIUC there is another memory leak issue in pgoutput as I reported on
this thread[1]/messages/by-id/CAD21AoCgqZ0BUpXjVY6tD1jSLtVSdWpG+LZyZimq4Uu3TymTAA@mail.gmail.com. It should be fixed in a separate patch.

This patch looks good to me. Do we really need a new function with the same
content as pgoutput_shutdown?

Probably we can call pgoutput_shutdown() in rel_sync_cache_reset_cb().

I don't like mcallback. It seems 'm' stands for
'memory' but if we want to use crypt names I would suggest 'mcb' (_m_emory
_c_all_b_ack) -- that is also a crypt name but a shorter one. Same pattern is
already used in another place with MemoryContextCallback.

I think it's better to use 'mcallback' since back branches are already
using it (see commit bbe68c13a for example). While considering
backpatching this patch, I noticed that the memory context reset
callback function should be registered to ctx->context but not
ctx->cachectx.

Regards,

[1]: /messages/by-id/CAD21AoCgqZ0BUpXjVY6tD1jSLtVSdWpG+LZyZimq4Uu3TymTAA@mail.gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

vignesh C

vignesh21@gmail.com

3 months ago

In reply to: Masahiko Sawada (#8)

1 attachment(s)

Re: Invalid pointer access in logical decoding after error

On Tue, 7 Oct 2025 at 23:40, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 6, 2025 at 6:55 PM Euler Taveira <euler@eulerto.com> wrote:

On Mon, Oct 6, 2025, at 5:00 PM, Masahiko Sawada wrote:

I agree with your analysis. It seems there is no convenient way to
move RelationSyncCache inside PGOutputData. So let's proceed with the
proposed approach.

+1. Shouldn't we add a comment mentioning that it is a possibility?

I think the primary reason why we cannot simply move RelationSyncCache
to PGOutputData is there is no way to unregister the invalidation
callback, which is already mentioned in some places:

/*
* We can get here if the plugin was used in SQL interface as the
* RelationSyncCache is destroyed when the decoding finishes, but there is
* no way to unregister the relcache invalidation callback.
*/
if (RelationSyncCache == NULL)

I've done some minor changes to your v2 patch and updated the commit
message. IIUC this patch needs to be backpatched to v15. Please review
the attached patch.

I verified that the bug exists since v15 as reported. Despite of the test case
provided by Vignesh (which I attached a modified version to be used in v15 or
later), I also added another test case that has a similar problem with
generated columns. This 2nd test case only works for v18 (where the feature was
introduced). This patch also fixes this case.

Thank you for the tests!

I also have verified that the test works till the PG15 branch.

I'm curious about other cases related to RelationSyncCache. Is there any other
cases that this patch doesn't fix?

I think that with the patch we can ensure to clean up
RelationSyncCache alongside with other memory contexts used in
pgoutput, fixing problems stemming from RelationSyncCache referencing
already-freed-memory in the pgoutput memory contexts.

IIUC there is another memory leak issue in pgoutput as I reported on
this thread[1]. It should be fixed in a separate patch.

+1 to fix it as a separate patch.

This patch looks good to me. Do we really need a new function with the same
content as pgoutput_shutdown?

Probably we can call pgoutput_shutdown() in rel_sync_cache_reset_cb().

Changed it to pgoutput_shutdown as it has the same functionality

I don't like mcallback. It seems 'm' stands for
'memory' but if we want to use crypt names I would suggest 'mcb' (_m_emory
_c_all_b_ack) -- that is also a crypt name but a shorter one. Same pattern is
already used in another place with MemoryContextCallback.

I think it's better to use 'mcallback' since back branches are already
using it (see commit bbe68c13a for example).

Yes, that is better

While considering
backpatching this patch, I noticed that the memory context reset
callback function should be registered to ctx->context but not
ctx->cachectx.

Modified

The attached v4 version patch has the changes for the same.

Regards,
Vignesh

Attachments:

v4-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchapplication/octet-stream; name=v4-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchDownload

From 071592c308e474e89e8e62d67e6dec1b8cc8f4cc Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 3 Jul 2025 19:48:28 +0530
Subject: [PATCH v4] Fix access-to-already-freed-memory issue in pgoutput.

While pgoutput caches relation synchronization information in
RelationSyncCache that resides in CacheMemoryContext, each entry's
information (such as row filter expressions and column lists) is
stored in the entry's private memory context (entry_cxt in
RelationSyncEntry), which is a descendant memory context of the
decoding context. If a logical decoding invoked via SQL functions like
pg_logical_slot_get_binary_changes fails with an error, subsequent
logical decoding executions could access already-freed memory of the
entry's cache, resulting in a crash.

With this change, it's ensured that RelationSyncCache is cleaned up
even in error cases by using a memory context reset callback function.

Backpatch to 15, where entry_cxt was introduced for column filtering
and row filtering.

Author: vignesh C <vignesh21@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com
Backpatch: 15
---
 src/backend/replication/pgoutput/pgoutput.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 92eb17049c3..083a7828b14 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -229,6 +229,7 @@ static void send_relation_and_attrs(Relation relation, TransactionId xid,
 static void rel_sync_cache_relation_cb(Datum arg, Oid relid);
 static void rel_sync_cache_publication_cb(Datum arg, int cacheid,
 										  uint32 hashvalue);
+static void rel_sync_cache_reset_cb(void *arg);
 static void set_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 											TransactionId xid);
 static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
@@ -435,6 +436,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 	static bool publication_callback_registered = false;
+	MemoryContextCallback *mcallback;
 
 	/* Create our memory context for private allocations. */
 	data->context = AllocSetContextCreate(ctx->context,
@@ -449,6 +451,14 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 										 "logical replication publication list context",
 										 ALLOCSET_SMALL_SIZES);
 
+	/*
+	 * Ensure to cleanup RelationSyncCache even when logical decoding invoked
+	 * via SQL interface ends up with an error.
+	 */
+	mcallback = palloc0(sizeof(MemoryContextCallback));
+	mcallback->func = rel_sync_cache_reset_cb;
+	MemoryContextRegisterResetCallback(ctx->context, mcallback);
+
 	ctx->output_plugin_private = data;
 
 	/* This plugin uses binary protocol. */
@@ -2432,6 +2442,15 @@ rel_sync_cache_publication_cb(Datum arg, int cacheid, uint32 hashvalue)
 	}
 }
 
+/*
+ * Memory context reset callback for clearing the cached relation info.
+ */
+static void
+rel_sync_cache_reset_cb(void *arg)
+{
+	pgoutput_shutdown(NULL);
+}
+
 /* Send Replication origin */
 static void
 send_repl_origin(LogicalDecodingContext *ctx, RepOriginId origin_id,
-- 
2.43.0

#10

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: vignesh C (#9)

5 attachment(s)

Re: Invalid pointer access in logical decoding after error

On Wed, Oct 8, 2025 at 11:39 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 7 Oct 2025 at 23:40, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 6, 2025 at 6:55 PM Euler Taveira <euler@eulerto.com> wrote:

On Mon, Oct 6, 2025, at 5:00 PM, Masahiko Sawada wrote:

I agree with your analysis. It seems there is no convenient way to
move RelationSyncCache inside PGOutputData. So let's proceed with the
proposed approach.

+1. Shouldn't we add a comment mentioning that it is a possibility?

I think the primary reason why we cannot simply move RelationSyncCache
to PGOutputData is there is no way to unregister the invalidation
callback, which is already mentioned in some places:

/*
* We can get here if the plugin was used in SQL interface as the
* RelationSyncCache is destroyed when the decoding finishes, but there is
* no way to unregister the relcache invalidation callback.
*/
if (RelationSyncCache == NULL)

I've done some minor changes to your v2 patch and updated the commit
message. IIUC this patch needs to be backpatched to v15. Please review
the attached patch.

I verified that the bug exists since v15 as reported. Despite of the test case
provided by Vignesh (which I attached a modified version to be used in v15 or
later), I also added another test case that has a similar problem with
generated columns. This 2nd test case only works for v18 (where the feature was
introduced). This patch also fixes this case.

Thank you for the tests!

I also have verified that the test works till the PG15 branch.

I'm curious about other cases related to RelationSyncCache. Is there any other
cases that this patch doesn't fix?

I think that with the patch we can ensure to clean up
RelationSyncCache alongside with other memory contexts used in
pgoutput, fixing problems stemming from RelationSyncCache referencing
already-freed-memory in the pgoutput memory contexts.

IIUC there is another memory leak issue in pgoutput as I reported on
this thread[1]. It should be fixed in a separate patch.

+1 to fix it as a separate patch.

This patch looks good to me. Do we really need a new function with the same
content as pgoutput_shutdown?

Probably we can call pgoutput_shutdown() in rel_sync_cache_reset_cb().

Changed it to pgoutput_shutdown as it has the same functionality

I don't like mcallback. It seems 'm' stands for
'memory' but if we want to use crypt names I would suggest 'mcb' (_m_emory
_c_all_b_ack) -- that is also a crypt name but a shorter one. Same pattern is
already used in another place with MemoryContextCallback.

I think it's better to use 'mcallback' since back branches are already
using it (see commit bbe68c13a for example).

Yes, that is better

While considering
backpatching this patch, I noticed that the memory context reset
callback function should be registered to ctx->context but not
ctx->cachectx.

Modified

The attached v4 version patch has the changes for the same.

Thank you for updating the patch! I was working on creating patches
for backbranches too, so let me share these patches.

I've made some cosmetic changes to align back branch codes, and
attached the patches for master to v15.

One thing we might want to consider is for v14 and v13. They don't
have this bug since the entry_ctx was introduced in v15, but it's
still true for them that RelationSyncCache is not cleaned up in error
cases if pgoutput is used via SQL API. For example, if
RelationSyncCache hash table gets corrupted for some reason, logical
decoding could repeat an error until logical decoding completes
successfully and its shutdown callback is called. Also, it might be a
good idea in general to ensure cleaning up the hash table after use.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

REL17_v5-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchapplication/octet-stream; name=REL17_v5-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchDownload

From b2fc158299f9fab539c29e6056aadcb381fb42a3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Oct 2025 10:40:13 -0700
Subject: [PATCH v5] Fix access-to-already-freed-memory issue in pgoutput.

While pgoutput caches relation synchronization information in
RelationSyncCache that resides in CacheMemoryContext, each entry's
information (such as row filter expressions and column lists) is
stored in the entry's private memory context (entry_cxt in
RelationSyncEntry), which is a descendant memory context of the
decoding context. If a logical decoding invoked via SQL functions like
pg_logical_slot_get_binary_changes fails with an error, subsequent
logical decoding executions could access already-freed memory of the
entry's cache, resulting in a crash.

With this change, it's ensured that RelationSyncCache is cleaned up
even in error cases by using a memory context reset callback function.

Backpatch to 15, where entry_cxt was introduced for column filtering
and row filtering.

Author: vignesh C <vignesh21@gmail.com>
Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com
Backpatch: 15
---
 src/backend/replication/pgoutput/pgoutput.c | 27 ++++++++++++---------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 99518c6b6dd..972a88e7c18 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -229,6 +229,7 @@ static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 											TransactionId xid);
 static void init_tuple_slot(PGOutputData *data, Relation relation,
 							RelationSyncEntry *entry);
+static void pgoutput_memory_context_reset(void *arg);
 
 /* row filter routines */
 static EState *create_estate_for_relation(Relation rel);
@@ -419,11 +420,18 @@ parse_output_parameters(List *options, PGOutputData *data)
 }
 
 /*
- * Callback of PGOutputData->context in charge of cleaning pubctx.
+ * Memory context reset callback of PGOutputData->context.
  */
 static void
-pgoutput_pubctx_reset_callback(void *arg)
+pgoutput_memory_context_reset(void *arg)
 {
+	if (RelationSyncCache)
+	{
+		hash_destroy(RelationSyncCache);
+		RelationSyncCache = NULL;
+	}
+
+	/* Better safe than sorry */
 	pubctx = NULL;
 }
 
@@ -452,8 +460,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								   "logical replication publication list context",
 								   ALLOCSET_SMALL_SIZES);
 
+	/*
+	 * Ensure to cleanup RelationSyncCache even when logical decoding invoked
+	 * via SQL interface ends up with an error.
+	 */
 	mcallback = palloc0(sizeof(MemoryContextCallback));
-	mcallback->func = pgoutput_pubctx_reset_callback;
+	mcallback->func = pgoutput_memory_context_reset;
 	MemoryContextRegisterResetCallback(ctx->context, mcallback);
 
 	ctx->output_plugin_private = data;
@@ -1729,14 +1741,7 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
 static void
 pgoutput_shutdown(LogicalDecodingContext *ctx)
 {
-	if (RelationSyncCache)
-	{
-		hash_destroy(RelationSyncCache);
-		RelationSyncCache = NULL;
-	}
-
-	/* Better safe than sorry */
-	pubctx = NULL;
+	pgoutput_memory_context_reset(NULL);
 }
 
 /*
-- 
2.47.3

REL15_v5-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchapplication/octet-stream; name=REL15_v5-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchDownload

From 453cc7ec35a06748bfc0dd8600b9fb7ba064bd5a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Oct 2025 10:40:13 -0700
Subject: [PATCH v5] Fix access-to-already-freed-memory issue in pgoutput.

While pgoutput caches relation synchronization information in
RelationSyncCache that resides in CacheMemoryContext, each entry's
information (such as row filter expressions and column lists) is
stored in the entry's private memory context (entry_cxt in
RelationSyncEntry), which is a descendant memory context of the
decoding context. If a logical decoding invoked via SQL functions like
pg_logical_slot_get_binary_changes fails with an error, subsequent
logical decoding executions could access already-freed memory of the
entry's cache, resulting in a crash.

With this change, it's ensured that RelationSyncCache is cleaned up
even in error cases by using a memory context reset callback function.

Backpatch to 15, where entry_cxt was introduced for column filtering
and row filtering.

Author: vignesh C <vignesh21@gmail.com>
Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com
Backpatch: 15
---
 src/backend/replication/pgoutput/pgoutput.c | 27 ++++++++++++---------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 64f62de635f..f3acc64c451 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -232,6 +232,7 @@ static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 											TransactionId xid);
 static void init_tuple_slot(PGOutputData *data, Relation relation,
 							RelationSyncEntry *entry);
+static void pgoutput_memory_context_reset(void *arg);
 
 /* row filter routines */
 static EState *create_estate_for_relation(Relation rel);
@@ -393,11 +394,18 @@ parse_output_parameters(List *options, PGOutputData *data)
 }
 
 /*
- * Callback of PGOutputData->context in charge of cleaning pubctx.
+ * Memory context reset callback of PGOutputData->context.
  */
 static void
-pgoutput_pubctx_reset_callback(void *arg)
+pgoutput_memory_context_reset(void *arg)
 {
+	if (RelationSyncCache)
+	{
+		hash_destroy(RelationSyncCache);
+		RelationSyncCache = NULL;
+	}
+
+	/* Better safe than sorry */
 	pubctx = NULL;
 }
 
@@ -426,8 +434,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								   "logical replication publication list context",
 								   ALLOCSET_SMALL_SIZES);
 
+	/*
+	 * Ensure to cleanup RelationSyncCache even when logical decoding invoked
+	 * via SQL interface ends up with an error.
+	 */
 	mcallback = palloc0(sizeof(MemoryContextCallback));
-	mcallback->func = pgoutput_pubctx_reset_callback;
+	mcallback->func = pgoutput_memory_context_reset;
 	MemoryContextRegisterResetCallback(ctx->context, mcallback);
 
 	ctx->output_plugin_private = data;
@@ -1761,14 +1773,7 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
 static void
 pgoutput_shutdown(LogicalDecodingContext *ctx)
 {
-	if (RelationSyncCache)
-	{
-		hash_destroy(RelationSyncCache);
-		RelationSyncCache = NULL;
-	}
-
-	/* Better safe than sorry */
-	pubctx = NULL;
+	pgoutput_memory_context_reset(NULL);
 }
 
 /*
-- 
2.47.3

REL18_v5-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchapplication/octet-stream; name=REL18_v5-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchDownload

From 36cd09f9d46d4341db5c05a720e0645f73998b6c Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 3 Jul 2025 19:48:28 +0530
Subject: [PATCH v5] Fix access-to-already-freed-memory issue in pgoutput.

While pgoutput caches relation synchronization information in
RelationSyncCache that resides in CacheMemoryContext, each entry's
information (such as row filter expressions and column lists) is
stored in the entry's private memory context (entry_cxt in
RelationSyncEntry), which is a descendant memory context of the
decoding context. If a logical decoding invoked via SQL functions like
pg_logical_slot_get_binary_changes fails with an error, subsequent
logical decoding executions could access already-freed memory of the
entry's cache, resulting in a crash.

With this change, it's ensured that RelationSyncCache is cleaned up
even in error cases by using a memory context reset callback function.

Backpatch to 15, where entry_cxt was introduced for column filtering
and row filtering.

Author: vignesh C <vignesh21@gmail.com>
Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com
Backpatch: 15
---
 src/backend/replication/pgoutput/pgoutput.c | 29 +++++++++++++++++----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 082b4d9d327..693ac28de3c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -235,6 +235,7 @@ static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 											TransactionId xid);
 static void init_tuple_slot(PGOutputData *data, Relation relation,
 							RelationSyncEntry *entry);
+static void pgoutput_memory_context_reset(void *arg);
 
 /* row filter routines */
 static EState *create_estate_for_relation(Relation rel);
@@ -424,6 +425,19 @@ parse_output_parameters(List *options, PGOutputData *data)
 				errmsg("option \"%s\" missing", "publication_names"));
 }
 
+/*
+ * Memory context reset callback of PGOutputData->context.
+ */
+static void
+pgoutput_memory_context_reset(void *arg)
+{
+	if (RelationSyncCache)
+	{
+		hash_destroy(RelationSyncCache);
+		RelationSyncCache = NULL;
+	}
+}
+
 /*
  * Initialize this plugin
  */
@@ -433,6 +447,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 	static bool publication_callback_registered = false;
+	MemoryContextCallback *mcallback;
 
 	/* Create our memory context for private allocations. */
 	data->context = AllocSetContextCreate(ctx->context,
@@ -447,6 +462,14 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 										 "logical replication publication list context",
 										 ALLOCSET_SMALL_SIZES);
 
+	/*
+	 * Ensure to cleanup RelationSyncCache even when logical decoding invoked
+	 * via SQL interface ends up with an error.
+	 */
+	mcallback = palloc0(sizeof(MemoryContextCallback));
+	mcallback->func = pgoutput_memory_context_reset;
+	MemoryContextRegisterResetCallback(ctx->context, mcallback);
+
 	ctx->output_plugin_private = data;
 
 	/* This plugin uses binary protocol. */
@@ -1758,11 +1781,7 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
 static void
 pgoutput_shutdown(LogicalDecodingContext *ctx)
 {
-	if (RelationSyncCache)
-	{
-		hash_destroy(RelationSyncCache);
-		RelationSyncCache = NULL;
-	}
+	pgoutput_memory_context_reset(NULL);
 }
 
 /*
-- 
2.47.3

REL16_v5-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchapplication/octet-stream; name=REL16_v5-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchDownload

From 94af57e3abae46961bbe573eb0045847f8ddcecd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Oct 2025 10:40:13 -0700
Subject: [PATCH v5] Fix access-to-already-freed-memory issue in pgoutput.

While pgoutput caches relation synchronization information in
RelationSyncCache that resides in CacheMemoryContext, each entry's
information (such as row filter expressions and column lists) is
stored in the entry's private memory context (entry_cxt in
RelationSyncEntry), which is a descendant memory context of the
decoding context. If a logical decoding invoked via SQL functions like
pg_logical_slot_get_binary_changes fails with an error, subsequent
logical decoding executions could access already-freed memory of the
entry's cache, resulting in a crash.

With this change, it's ensured that RelationSyncCache is cleaned up
even in error cases by using a memory context reset callback function.

Backpatch to 15, where entry_cxt was introduced for column filtering
and row filtering.

Author: vignesh C <vignesh21@gmail.com>
Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com
Backpatch: 15
---
 src/backend/replication/pgoutput/pgoutput.c | 27 ++++++++++++---------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 32b74bb4752..62e584a2cd1 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -231,6 +231,7 @@ static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 											TransactionId xid);
 static void init_tuple_slot(PGOutputData *data, Relation relation,
 							RelationSyncEntry *entry);
+static void pgoutput_memory_context_reset(void *arg);
 
 /* row filter routines */
 static EState *create_estate_for_relation(Relation rel);
@@ -407,11 +408,18 @@ parse_output_parameters(List *options, PGOutputData *data)
 }
 
 /*
- * Callback of PGOutputData->context in charge of cleaning pubctx.
+ * Memory context reset callback of PGOutputData->context.
  */
 static void
-pgoutput_pubctx_reset_callback(void *arg)
+pgoutput_memory_context_reset(void *arg)
 {
+	if (RelationSyncCache)
+	{
+		hash_destroy(RelationSyncCache);
+		RelationSyncCache = NULL;
+	}
+
+	/* Better safe than sorry */
 	pubctx = NULL;
 }
 
@@ -440,8 +448,12 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 								   "logical replication publication list context",
 								   ALLOCSET_SMALL_SIZES);
 
+	/*
+	 * Ensure to cleanup RelationSyncCache even when logical decoding invoked
+	 * via SQL interface ends up with an error.
+	 */
 	mcallback = palloc0(sizeof(MemoryContextCallback));
-	mcallback->func = pgoutput_pubctx_reset_callback;
+	mcallback->func = pgoutput_memory_context_reset;
 	MemoryContextRegisterResetCallback(ctx->context, mcallback);
 
 	ctx->output_plugin_private = data;
@@ -1725,14 +1737,7 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
 static void
 pgoutput_shutdown(LogicalDecodingContext *ctx)
 {
-	if (RelationSyncCache)
-	{
-		hash_destroy(RelationSyncCache);
-		RelationSyncCache = NULL;
-	}
-
-	/* Better safe than sorry */
-	pubctx = NULL;
+	pgoutput_memory_context_reset(NULL);
 }
 
 /*
-- 
2.47.3

master_v5-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchapplication/octet-stream; name=master_v5-0001-Fix-access-to-already-freed-memory-issue-in-pgout.patchDownload

From 6d24f13538a001e0ccb6197349f9328e9d3d5c39 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignesh21@gmail.com>
Date: Thu, 3 Jul 2025 19:48:28 +0530
Subject: [PATCH v5] Fix access-to-already-freed-memory issue in pgoutput.

While pgoutput caches relation synchronization information in
RelationSyncCache that resides in CacheMemoryContext, each entry's
information (such as row filter expressions and column lists) is
stored in the entry's private memory context (entry_cxt in
RelationSyncEntry), which is a descendant memory context of the
decoding context. If a logical decoding invoked via SQL functions like
pg_logical_slot_get_binary_changes fails with an error, subsequent
logical decoding executions could access already-freed memory of the
entry's cache, resulting in a crash.

With this change, it's ensured that RelationSyncCache is cleaned up
even in error cases by using a memory context reset callback function.

Backpatch to 15, where entry_cxt was introduced for column filtering
and row filtering.

Author: vignesh C <vignesh21@gmail.com>
Co-authored-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Reviewed-by: Euler Taveira <euler@eulerto.com>
Discussion: https://postgr.es/m/CALDaNm0x-aCehgt8Bevs2cm=uhmwS28MvbYq1=s2Ekf0aDPkOA@mail.gmail.com
Backpatch: 15
---
 src/backend/replication/pgoutput/pgoutput.c | 29 +++++++++++++++++----
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 80540c017bd..6a8269be9c1 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -235,6 +235,7 @@ static bool get_schema_sent_in_streamed_txn(RelationSyncEntry *entry,
 											TransactionId xid);
 static void init_tuple_slot(PGOutputData *data, Relation relation,
 							RelationSyncEntry *entry);
+static void pgoutput_memory_context_reset(void *arg);
 
 /* row filter routines */
 static EState *create_estate_for_relation(Relation rel);
@@ -426,6 +427,19 @@ parse_output_parameters(List *options, PGOutputData *data)
 				errmsg("option \"%s\" missing", "publication_names"));
 }
 
+/*
+ * Memory context reset callback of PGOutputData->context.
+ */
+static void
+pgoutput_memory_context_reset(void *arg)
+{
+	if (RelationSyncCache)
+	{
+		hash_destroy(RelationSyncCache);
+		RelationSyncCache = NULL;
+	}
+}
+
 /*
  * Initialize this plugin
  */
@@ -435,6 +449,7 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 {
 	PGOutputData *data = palloc0(sizeof(PGOutputData));
 	static bool publication_callback_registered = false;
+	MemoryContextCallback *mcallback;
 
 	/* Create our memory context for private allocations. */
 	data->context = AllocSetContextCreate(ctx->context,
@@ -449,6 +464,14 @@ pgoutput_startup(LogicalDecodingContext *ctx, OutputPluginOptions *opt,
 										 "logical replication publication list context",
 										 ALLOCSET_SMALL_SIZES);
 
+	/*
+	 * Ensure to cleanup RelationSyncCache even when logical decoding invoked
+	 * via SQL interface ends up with an error.
+	 */
+	mcallback = palloc0(sizeof(MemoryContextCallback));
+	mcallback->func = pgoutput_memory_context_reset;
+	MemoryContextRegisterResetCallback(ctx->context, mcallback);
+
 	ctx->output_plugin_private = data;
 
 	/* This plugin uses binary protocol. */
@@ -1760,11 +1783,7 @@ pgoutput_origin_filter(LogicalDecodingContext *ctx,
 static void
 pgoutput_shutdown(LogicalDecodingContext *ctx)
 {
-	if (RelationSyncCache)
-	{
-		hash_destroy(RelationSyncCache);
-		RelationSyncCache = NULL;
-	}
+	pgoutput_memory_context_reset(NULL);
 }
 
 /*
-- 
2.47.3

#11

vignesh C

vignesh21@gmail.com

3 months ago

In reply to: Masahiko Sawada (#10)

Re: Invalid pointer access in logical decoding after error

On Thu, 9 Oct 2025 at 00:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Oct 8, 2025 at 11:39 AM vignesh C <vignesh21@gmail.com> wrote:

On Tue, 7 Oct 2025 at 23:40, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 6, 2025 at 6:55 PM Euler Taveira <euler@eulerto.com> wrote:

On Mon, Oct 6, 2025, at 5:00 PM, Masahiko Sawada wrote:

I agree with your analysis. It seems there is no convenient way to
move RelationSyncCache inside PGOutputData. So let's proceed with the
proposed approach.

+1. Shouldn't we add a comment mentioning that it is a possibility?

I think the primary reason why we cannot simply move RelationSyncCache
to PGOutputData is there is no way to unregister the invalidation
callback, which is already mentioned in some places:

/*
* We can get here if the plugin was used in SQL interface as the
* RelationSyncCache is destroyed when the decoding finishes, but there is
* no way to unregister the relcache invalidation callback.
*/
if (RelationSyncCache == NULL)

I've done some minor changes to your v2 patch and updated the commit
message. IIUC this patch needs to be backpatched to v15. Please review
the attached patch.

I verified that the bug exists since v15 as reported. Despite of the test case
provided by Vignesh (which I attached a modified version to be used in v15 or
later), I also added another test case that has a similar problem with
generated columns. This 2nd test case only works for v18 (where the feature was
introduced). This patch also fixes this case.

Thank you for the tests!

I also have verified that the test works till the PG15 branch.

I'm curious about other cases related to RelationSyncCache. Is there any other
cases that this patch doesn't fix?

I think that with the patch we can ensure to clean up
RelationSyncCache alongside with other memory contexts used in
pgoutput, fixing problems stemming from RelationSyncCache referencing
already-freed-memory in the pgoutput memory contexts.

IIUC there is another memory leak issue in pgoutput as I reported on
this thread[1]. It should be fixed in a separate patch.

+1 to fix it as a separate patch.

This patch looks good to me. Do we really need a new function with the same
content as pgoutput_shutdown?

Probably we can call pgoutput_shutdown() in rel_sync_cache_reset_cb().

Changed it to pgoutput_shutdown as it has the same functionality

I don't like mcallback. It seems 'm' stands for
'memory' but if we want to use crypt names I would suggest 'mcb' (_m_emory
_c_all_b_ack) -- that is also a crypt name but a shorter one. Same pattern is
already used in another place with MemoryContextCallback.

I think it's better to use 'mcallback' since back branches are already
using it (see commit bbe68c13a for example).

Yes, that is better

While considering
backpatching this patch, I noticed that the memory context reset
callback function should be registered to ctx->context but not
ctx->cachectx.

Modified

The attached v4 version patch has the changes for the same.

Thank you for updating the patch! I was working on creating patches
for backbranches too, so let me share these patches.

I've made some cosmetic changes to align back branch codes, and
attached the patches for master to v15.

One thing we might want to consider is for v14 and v13. They don't
have this bug since the entry_ctx was introduced in v15, but it's
still true for them that RelationSyncCache is not cleaned up in error
cases if pgoutput is used via SQL API. For example, if
RelationSyncCache hash table gets corrupted for some reason, logical
decoding could repeat an error until logical decoding completes
successfully and its shutdown callback is called. Also, it might be a
good idea in general to ensure cleaning up the hash table after use.

Agreed, let's backpatch to PG13. Should we also add a test case in the
master branch, given that this issue has been around for a while?

Regards,
Vignesh

#12

Euler Taveira

euler@eulerto.com

3 months ago

In reply to: vignesh C (#11)

Re: Invalid pointer access in logical decoding after error

On Thu, Oct 9, 2025, at 10:40 AM, vignesh C wrote:

On Thu, 9 Oct 2025 at 00:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

One thing we might want to consider is for v14 and v13. They don't
have this bug since the entry_ctx was introduced in v15, but it's
still true for them that RelationSyncCache is not cleaned up in error
cases if pgoutput is used via SQL API. For example, if
RelationSyncCache hash table gets corrupted for some reason, logical
decoding could repeat an error until logical decoding completes
successfully and its shutdown callback is called. Also, it might be a
good idea in general to ensure cleaning up the hash table after use.

Agreed, let's backpatch to PG13. Should we also add a test case in the
master branch, given that this issue has been around for a while?

I'm wondering if it is a good idea because the bug doesn't manifest in v13 and
v14. At least the v13 has its final minor version in less than a month and EOL.
I would have caution when applying fixes to the latest minor version of a
stable branch; there won't be a chance to fix the fix in the next minor
release. Furthermore, in these back branches, the patch doesn't fix a known
issue. I wouldn't bother with these back branches. For v14, if, in a couple of
months, we have some reports that justify the backpatch, fine.

--
Euler Taveira
EDB https://www.enterprisedb.com/

#13

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: Euler Taveira (#12)

Re: Invalid pointer access in logical decoding after error

On Thu, Oct 9, 2025 at 8:23 AM Euler Taveira <euler@eulerto.com> wrote:

On Thu, Oct 9, 2025, at 10:40 AM, vignesh C wrote:

On Thu, 9 Oct 2025 at 00:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

One thing we might want to consider is for v14 and v13. They don't
have this bug since the entry_ctx was introduced in v15, but it's
still true for them that RelationSyncCache is not cleaned up in error
cases if pgoutput is used via SQL API. For example, if
RelationSyncCache hash table gets corrupted for some reason, logical
decoding could repeat an error until logical decoding completes
successfully and its shutdown callback is called. Also, it might be a
good idea in general to ensure cleaning up the hash table after use.

Agreed, let's backpatch to PG13. Should we also add a test case in the
master branch, given that this issue has been around for a while?

I'm wondering if it is a good idea because the bug doesn't manifest in v13 and
v14. At least the v13 has its final minor version in less than a month and EOL.
I would have caution when applying fixes to the latest minor version of a
stable branch; there won't be a chance to fix the fix in the next minor
release. Furthermore, in these back branches, the patch doesn't fix a known
issue. I wouldn't bother with these back branches. For v14, if, in a couple of
months, we have some reports that justify the backpatch, fine.

Agreed. I'm hesitant with patching to v13 and v14. We've never got
such a bug report yet and the next minior version of v13 would be the
final release. I'll add some comments in the commit message.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14

Masahiko Sawada

sawada.mshk@gmail.com

3 months ago

In reply to: Masahiko Sawada (#13)

Re: Invalid pointer access in logical decoding after error

On Thu, Oct 9, 2025 at 10:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Oct 9, 2025 at 8:23 AM Euler Taveira <euler@eulerto.com> wrote:

On Thu, Oct 9, 2025, at 10:40 AM, vignesh C wrote:

On Thu, 9 Oct 2025 at 00:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

One thing we might want to consider is for v14 and v13. They don't
have this bug since the entry_ctx was introduced in v15, but it's
still true for them that RelationSyncCache is not cleaned up in error
cases if pgoutput is used via SQL API. For example, if
RelationSyncCache hash table gets corrupted for some reason, logical
decoding could repeat an error until logical decoding completes
successfully and its shutdown callback is called. Also, it might be a
good idea in general to ensure cleaning up the hash table after use.

Agreed, let's backpatch to PG13. Should we also add a test case in the
master branch, given that this issue has been around for a while?

I'm wondering if it is a good idea because the bug doesn't manifest in v13 and
v14. At least the v13 has its final minor version in less than a month and EOL.
I would have caution when applying fixes to the latest minor version of a
stable branch; there won't be a chance to fix the fix in the next minor
release. Furthermore, in these back branches, the patch doesn't fix a known
issue. I wouldn't bother with these back branches. For v14, if, in a couple of
months, we have some reports that justify the backpatch, fine.

Agreed. I'm hesitant with patching to v13 and v14. We've never got
such a bug report yet and the next minior version of v13 would be the
final release. I'll add some comments in the commit message.

And now pushed (from master to v15).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#15

vignesh C

vignesh21@gmail.com

3 months ago

In reply to: Masahiko Sawada (#14)

Re: Invalid pointer access in logical decoding after error

On Thu, 9 Oct 2025 at 23:33, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Oct 9, 2025 at 10:07 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Oct 9, 2025 at 8:23 AM Euler Taveira <euler@eulerto.com> wrote:

On Thu, Oct 9, 2025, at 10:40 AM, vignesh C wrote:

On Thu, 9 Oct 2025 at 00:16, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

One thing we might want to consider is for v14 and v13. They don't
have this bug since the entry_ctx was introduced in v15, but it's
still true for them that RelationSyncCache is not cleaned up in error
cases if pgoutput is used via SQL API. For example, if
RelationSyncCache hash table gets corrupted for some reason, logical
decoding could repeat an error until logical decoding completes
successfully and its shutdown callback is called. Also, it might be a
good idea in general to ensure cleaning up the hash table after use.

Agreed, let's backpatch to PG13. Should we also add a test case in the
master branch, given that this issue has been around for a while?

I'm wondering if it is a good idea because the bug doesn't manifest in v13 and
v14. At least the v13 has its final minor version in less than a month and EOL.
I would have caution when applying fixes to the latest minor version of a
stable branch; there won't be a chance to fix the fix in the next minor
release. Furthermore, in these back branches, the patch doesn't fix a known
issue. I wouldn't bother with these back branches. For v14, if, in a couple of
months, we have some reports that justify the backpatch, fine.

Agreed. I'm hesitant with patching to v13 and v14. We've never got
such a bug report yet and the next minior version of v13 would be the
final release. I'll add some comments in the commit message.

And now pushed (from master to v15).

Thanks for committing the changes. I monitored the build farm, and the
runs completed successfully. I've also closed the Cf entry at [1]https://commitfest.postgresql.org/patch/5903/.
[1]: https://commitfest.postgresql.org/patch/5903/

Regards,
Vignesh