Forget close an open relation in ReorderBufferProcessTXN()
The RelationIdGetRelation() comment says:
Caller should eventually decrement count. (Usually,
that happens by calling RelationClose().)
However, it doesn't do it in ReorderBufferProcessTXN().
I think we should close it, here is a patch that fixes it. Thoughts?
--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 52d06285a2..aac6ffc602 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2261,7 +2261,10 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
elog(ERROR, "could not open relation with OID %u", relid);
if (!RelationIsLogicallyLogged(relation))
+ {
+ RelationClose(relation);
continue;
+ }
relations[nrelations++] = relation;
}
On Thu, Apr 15, 2021 at 4:00 PM Japin Li <japinli@hotmail.com> wrote:
The RelationIdGetRelation() comment says:
Caller should eventually decrement count. (Usually,
that happens by calling RelationClose().)However, it doesn't do it in ReorderBufferProcessTXN().
I think we should close it, here is a patch that fixes it. Thoughts?
+1. Your fix looks correct to me but can we test it in some way?
--
With Regards,
Amit Kapila.
Amit Kapila <amit.kapila16@gmail.com> writes:
On Thu, Apr 15, 2021 at 4:00 PM Japin Li <japinli@hotmail.com> wrote:
The RelationIdGetRelation() comment says:
Caller should eventually decrement count. (Usually,
that happens by calling RelationClose().)However, it doesn't do it in ReorderBufferProcessTXN().
I think we should close it, here is a patch that fixes it. Thoughts?
+1. Your fix looks correct to me but can we test it in some way?
I think this code has a bigger problem: it should not be using
RelationIdGetRelation and RelationClose directly. 99.44% of
the backend goes through relation_open or one of the other
relation.c wrappers, so why doesn't this?
Possibly the answer is "it copied the equally misguided code
in pgoutput.c". A quick grep shows nothing else doing it this
way.
regards, tom lane
On Thu, Apr 15, 2021 at 10:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Amit Kapila <amit.kapila16@gmail.com> writes:
On Thu, Apr 15, 2021 at 4:00 PM Japin Li <japinli@hotmail.com> wrote:
The RelationIdGetRelation() comment says:
Caller should eventually decrement count. (Usually,
that happens by calling RelationClose().)However, it doesn't do it in ReorderBufferProcessTXN().
I think we should close it, here is a patch that fixes it. Thoughts?+1. Your fix looks correct to me but can we test it in some way?
I think this code has a bigger problem: it should not be using
RelationIdGetRelation and RelationClose directly. 99.44% of
the backend goes through relation_open or one of the other
relation.c wrappers, so why doesn't this?
I think it is because relation_open expects either caller to have a
lock on the relation or don't use 'NoLock' lockmode. AFAIU, we don't
need to acquire a lock on relation while decoding changes from WAL
because it uses a historic snapshot to build a relcache entry and all
the later changes to the rel are absorbed while decoding WAL.
I think it is also important to *not* acquire any lock on relation
otherwise it can lead to some sort of deadlock or infinite wait in the
decoding process. Consider a case for operations like Truncate (or if
the user has acquired an exclusive lock on the relation in some other
way say via Lock command) which acquires an exclusive lock on
relation, it won't get replicated in synchronous mode (when
synchronous_standby_name is configured). The truncate operation will
wait for the transaction to be replicated to the subscriber and the
decoding process will wait for the Truncate operation to finish.
Possibly the answer is "it copied the equally misguided code
in pgoutput.c".
I think it is following what is done during decoding, otherwise, it
will lead to the problems as described above. We are already
discussing one of the similar problems [1]/messages/by-id/OS0PR01MB6113C2499C7DC70EE55ADB82FB759@OS0PR01MB6113.jpnprd01.prod.outlook.com where pgoutput
unintentionally acquired a lock on the index and lead to a sort of
deadlock.
If the above understanding is correct, I think we might want to
improve comments in this area.
[1]: /messages/by-id/OS0PR01MB6113C2499C7DC70EE55ADB82FB759@OS0PR01MB6113.jpnprd01.prod.outlook.com
--
With Regards,
Amit Kapila.
Hi,
On 2021-04-16 08:42:40 +0530, Amit Kapila wrote:
I think it is because relation_open expects either caller to have a
lock on the relation or don't use 'NoLock' lockmode. AFAIU, we don't
need to acquire a lock on relation while decoding changes from WAL
because it uses a historic snapshot to build a relcache entry and all
the later changes to the rel are absorbed while decoding WAL.
Right.
I think it is also important to *not* acquire any lock on relation
otherwise it can lead to some sort of deadlock or infinite wait in the
decoding process. Consider a case for operations like Truncate (or if
the user has acquired an exclusive lock on the relation in some other
way say via Lock command) which acquires an exclusive lock on
relation, it won't get replicated in synchronous mode (when
synchronous_standby_name is configured). The truncate operation will
wait for the transaction to be replicated to the subscriber and the
decoding process will wait for the Truncate operation to finish.
However, this cannot be really relied upon for catalog tables. An output
function might acquire locks or such. But for those we do not need to
decode contents...
This made me take a brief look at pgoutput.c - maybe I am missing
something, but how is the following not a memory leak?
static void
maybe_send_schema(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, ReorderBufferChange *change,
Relation relation, RelationSyncEntry *relentry)
{
...
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
CreateTupleDescCopy(outdesc));
MemoryContextSwitchTo(oldctx);
send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);
If - and that's common - convert_tuples_by_name() won't have to do
anything, the copied tuple descs will be permanently leaked.
Greetings,
Andres Freund
On Fri, Apr 16, 2021 at 11:24 PM Andres Freund <andres@anarazel.de> wrote:
I think it is also important to *not* acquire any lock on relation
otherwise it can lead to some sort of deadlock or infinite wait in the
decoding process. Consider a case for operations like Truncate (or if
the user has acquired an exclusive lock on the relation in some other
way say via Lock command) which acquires an exclusive lock on
relation, it won't get replicated in synchronous mode (when
synchronous_standby_name is configured). The truncate operation will
wait for the transaction to be replicated to the subscriber and the
decoding process will wait for the Truncate operation to finish.However, this cannot be really relied upon for catalog tables. An output
function might acquire locks or such. But for those we do not need to
decode contents...
True, so, if we don't need to decode contents then we won't have the
problems of the above kind.
This made me take a brief look at pgoutput.c - maybe I am missing
something, but how is the following not a memory leak?static void
maybe_send_schema(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, ReorderBufferChange *change,
Relation relation, RelationSyncEntry *relentry)
{
...
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
CreateTupleDescCopy(outdesc));
MemoryContextSwitchTo(oldctx);
send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);If - and that's common - convert_tuples_by_name() won't have to do
anything, the copied tuple descs will be permanently leaked.
I also think this is a permanent leak. I think we need to free all the
memory associated with this map on the invalidation of this particular
relsync entry (basically in rel_sync_cache_relation_cb).
--
With Regards,
Amit Kapila.
On Thu, Apr 15, 2021 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Apr 15, 2021 at 4:00 PM Japin Li <japinli@hotmail.com> wrote:
The RelationIdGetRelation() comment says:
Caller should eventually decrement count. (Usually,
that happens by calling RelationClose().)However, it doesn't do it in ReorderBufferProcessTXN().
I think we should close it, here is a patch that fixes it. Thoughts?+1. Your fix looks correct to me but can we test it in some way?
I have tried to find a test but not able to find one. I have tried
with a foreign table but we don't log truncate for it, see
ExecuteTruncate. It has a check that it will log for relids where
RelationIsLogicallyLogged. If that is the case, it is not clear to me
how we can ever hit this condition? Have you tried to find the test?
--
With Regards,
Amit Kapila.
On Fri, Apr 16, 2021 at 11:24 PM Andres Freund <andres@anarazel.de> wrote:
I think it is also important to *not* acquire any lock on relation
otherwise it can lead to some sort of deadlock or infinite wait in the
decoding process. Consider a case for operations like Truncate (or if
the user has acquired an exclusive lock on the relation in some other
way say via Lock command) which acquires an exclusive lock on
relation, it won't get replicated in synchronous mode (when
synchronous_standby_name is configured). The truncate operation will
wait for the transaction to be replicated to the subscriber and the
decoding process will wait for the Truncate operation to finish.However, this cannot be really relied upon for catalog tables. An output
function might acquire locks or such. But for those we do not need to
decode contents...
I see that if we define a user_catalog_table (create table t1_cat(c1
int) WITH(user_catalog_table = true);), we are able to decode
operations like (insert, truncate) on such a table. What do you mean
by "But for those we do not need to decode contents"?
--
With Regards,
Amit Kapila.
On Sat, 17 Apr 2021 at 14:09, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Apr 15, 2021 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Apr 15, 2021 at 4:00 PM Japin Li <japinli@hotmail.com> wrote:
The RelationIdGetRelation() comment says:
Caller should eventually decrement count. (Usually,
that happens by calling RelationClose().)However, it doesn't do it in ReorderBufferProcessTXN().
I think we should close it, here is a patch that fixes it. Thoughts?+1. Your fix looks correct to me but can we test it in some way?
I have tried to find a test but not able to find one. I have tried
with a foreign table but we don't log truncate for it, see
ExecuteTruncate. It has a check that it will log for relids where
RelationIsLogicallyLogged. If that is the case, it is not clear to me
how we can ever hit this condition? Have you tried to find the test?
I also don't find a test for this. It is introduced in 5dfd1e5a6696,
wrote by Simon Riggs, Marco Nenciarini and Peter Eisentraut. Maybe they
can explain when we can enter this condition?
--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.
On Sat, Apr 17, 2021 at 12:05 PM Japin Li <japinli@hotmail.com> wrote:
On Sat, 17 Apr 2021 at 14:09, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Apr 15, 2021 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Apr 15, 2021 at 4:00 PM Japin Li <japinli@hotmail.com> wrote:
The RelationIdGetRelation() comment says:
Caller should eventually decrement count. (Usually,
that happens by calling RelationClose().)However, it doesn't do it in ReorderBufferProcessTXN().
I think we should close it, here is a patch that fixes it. Thoughts?+1. Your fix looks correct to me but can we test it in some way?
I have tried to find a test but not able to find one. I have tried
with a foreign table but we don't log truncate for it, see
ExecuteTruncate. It has a check that it will log for relids where
RelationIsLogicallyLogged. If that is the case, it is not clear to me
how we can ever hit this condition? Have you tried to find the test?I also don't find a test for this. It is introduced in 5dfd1e5a6696,
wrote by Simon Riggs, Marco Nenciarini and Peter Eisentraut. Maybe they
can explain when we can enter this condition?
My guess is that this has been copied from the code a few lines above
to handle insert/update/delete where it is required to handle some DDL
ops like Alter Table but I think we don't need it here (for Truncate
op). If that understanding turns out to be true then we should either
have an Assert for this or an elog message.
--
With Regards,
Amit Kapila.
On Sat, Apr 17, 2021 at 12:01 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 16, 2021 at 11:24 PM Andres Freund <andres@anarazel.de> wrote:
I think it is also important to *not* acquire any lock on relation
otherwise it can lead to some sort of deadlock or infinite wait in the
decoding process. Consider a case for operations like Truncate (or if
the user has acquired an exclusive lock on the relation in some other
way say via Lock command) which acquires an exclusive lock on
relation, it won't get replicated in synchronous mode (when
synchronous_standby_name is configured). The truncate operation will
wait for the transaction to be replicated to the subscriber and the
decoding process will wait for the Truncate operation to finish.However, this cannot be really relied upon for catalog tables. An output
function might acquire locks or such. But for those we do not need to
decode contents...I see that if we define a user_catalog_table (create table t1_cat(c1
int) WITH(user_catalog_table = true);), we are able to decode
operations like (insert, truncate) on such a table. What do you mean
by "But for those we do not need to decode contents"?
I think we are allowed to decode the operations on user catalog tables
because we are using RelationIsLogicallyLogged() instead of
RelationIsAccessibleInLogicalDecoding() in ReorderBufferProcessTXN().
Based on this discussion, I think we should not be allowing decoding
of operations on user catalog tables, so we should use
RelationIsAccessibleInLogicalDecoding to skip such ops in
ReorderBufferProcessTXN(). Am, I missing something?
Can you please clarify?
--
With Regards,
Amit Kapila.
On Sat, Apr 17, 2021 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 16, 2021 at 11:24 PM Andres Freund <andres@anarazel.de> wrote:> > This made me take a brief look at pgoutput.c - maybe I am missing
something, but how is the following not a memory leak?
static void
maybe_send_schema(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, ReorderBufferChange *change,
Relation relation, RelationSyncEntry *relentry)
{
...
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
CreateTupleDescCopy(outdesc));
MemoryContextSwitchTo(oldctx);
send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);If - and that's common - convert_tuples_by_name() won't have to do
anything, the copied tuple descs will be permanently leaked.I also think this is a permanent leak. I think we need to free all the
memory associated with this map on the invalidation of this particular
relsync entry (basically in rel_sync_cache_relation_cb).
I agree there's a problem here.
Back in:
/messages/by-id/CA+HiwqEeU19iQgjN6HF1HTPU0L5+JxyS5CmxaOVGNXBAfUY06Q@mail.gmail.com
I had proposed to move the map creation from maybe_send_schema() to
get_rel_sync_entry(), mainly because the latter is where I realized it
belongs, though a bit too late. Attached is the part of the patch
for this particular issue. It also takes care to release the copied
TupleDescs if the map is found to be unnecessary, thus preventing
leaking into CacheMemoryContext.
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
pgoutput-map-tupdesc-leak.patchapplication/octet-stream; name=pgoutput-map-tupdesc-leak.patchDownload
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348dcf4..c66e0a20a8 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -444,15 +444,7 @@ maybe_send_schema(LogicalDecodingContext *ctx,
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
- TupleDesc indesc = RelationGetDescr(relation);
- TupleDesc outdesc = RelationGetDescr(ancestor);
- MemoryContext oldctx;
-
- /* Map must live as long as the session does. */
- oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
- MemoryContextSwitchTo(oldctx);
+
send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);
}
@@ -1108,6 +1100,39 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
list_free(pubids);
entry->publish_as_relid = publish_as_relid;
+
+ /*
+ * While at it, also set the map to convert leaf partition tuples into
+ * the format of the table to publish the tuples as of.
+ */
+ if (publish_as_relid != relid)
+ {
+ Relation relation = RelationIdGetRelation(relid);
+ Relation ancestor = RelationIdGetRelation(publish_as_relid);
+ TupleDesc indesc = RelationGetDescr(relation);
+ TupleDesc outdesc = RelationGetDescr(ancestor);
+ MemoryContext oldctx;
+
+ /* Map, if any, must live as long as the session does. */
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ /*
+ * TupleDescs must be copied before putting into the map, because
+ * they may not live as long as we want the map to live.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ entry->map = convert_tuples_by_name(indesc, outdesc);
+ if (entry->map == NULL)
+ {
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
+ MemoryContextSwitchTo(oldctx);
+ RelationClose(relation);
+ RelationClose(ancestor);
+ }
entry->replicate_valid = true;
}
On Saturday, April 17, 2021 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Apr 17, 2021 at 12:05 PM Japin Li <japinli@hotmail.com> wrote:
On Sat, 17 Apr 2021 at 14:09, Amit Kapila <amit.kapila16@gmail.com>
wrote:
On Thu, Apr 15, 2021 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:
On Thu, Apr 15, 2021 at 4:00 PM Japin Li <japinli@hotmail.com> wrote:
The RelationIdGetRelation() comment says:
Caller should eventually decrement count. (Usually, that
happens by calling RelationClose().)However, it doesn't do it in ReorderBufferProcessTXN().
I think we should close it, here is a patch that fixes it. Thoughts?+1. Your fix looks correct to me but can we test it in some way?
I have tried to find a test but not able to find one. I have tried
with a foreign table but we don't log truncate for it, see
ExecuteTruncate. It has a check that it will log for relids where
RelationIsLogicallyLogged. If that is the case, it is not clear to
me how we can ever hit this condition? Have you tried to find the test?I also don't find a test for this. It is introduced in 5dfd1e5a6696,
wrote by Simon Riggs, Marco Nenciarini and Peter Eisentraut. Maybe
they can explain when we can enter this condition?My guess is that this has been copied from the code a few lines above to
handle insert/update/delete where it is required to handle some DDL ops like
Alter Table but I think we don't need it here (for Truncate op). If that
understanding turns out to be true then we should either have an Assert for
this or an elog message.
In this thread, we are discussing 3 topics below...
(1) necessity of the check for REORDER_BUFFER_CHANGE_TRUNCATE in ReorderBufferProcessTXN()
(2) discussion of whether we disallow decoding of operations on user catalog tables or not
(3) memory leak of maybe_send_schema() (patch already provided)
Let's address those one by one.
In terms of (1), which was close to the motivation of this thread,
first of all, I traced the truncate processing
and I think the check is done by truncate command side as well.
I preferred Assert rather than never called elog,
but it's OK to choose elog if someone has strong opinion on it.
Attached the patch for this.
Best Regards,
Takamichi Osumi
Attachments:
replace_check_of_ReorderBufferProcessTXN_v01.patchapplication/octet-stream; name=replace_check_of_ReorderBufferProcessTXN_v01.patchDownload
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5cb484f..f708ef6 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -2267,8 +2267,12 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, ReorderBufferTXN *txn,
if (!RelationIsValid(relation))
elog(ERROR, "could not open relation with OID %u", relid);
- if (!RelationIsLogicallyLogged(relation))
- continue;
+ /*
+ * Logging the truncate has already done this
+ * test. See ExecuteTruncate for more information
+ * of the WAL write.
+ */
+ Assert(RelationIsLogicallyLogged(relation));
relations[nrelations++] = relation;
}
On Fri, 23 Apr 2021 at 22:33, osumi.takamichi@fujitsu.com <osumi.takamichi@fujitsu.com> wrote:
On Saturday, April 17, 2021 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Apr 17, 2021 at 12:05 PM Japin Li <japinli@hotmail.com> wrote:
On Sat, 17 Apr 2021 at 14:09, Amit Kapila <amit.kapila16@gmail.com>
wrote:
On Thu, Apr 15, 2021 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:
On Thu, Apr 15, 2021 at 4:00 PM Japin Li <japinli@hotmail.com> wrote:
The RelationIdGetRelation() comment says:
Caller should eventually decrement count. (Usually, that
happens by calling RelationClose().)However, it doesn't do it in ReorderBufferProcessTXN().
I think we should close it, here is a patch that fixes it. Thoughts?+1. Your fix looks correct to me but can we test it in some way?
I have tried to find a test but not able to find one. I have tried
with a foreign table but we don't log truncate for it, see
ExecuteTruncate. It has a check that it will log for relids where
RelationIsLogicallyLogged. If that is the case, it is not clear to
me how we can ever hit this condition? Have you tried to find the test?I also don't find a test for this. It is introduced in 5dfd1e5a6696,
wrote by Simon Riggs, Marco Nenciarini and Peter Eisentraut. Maybe
they can explain when we can enter this condition?My guess is that this has been copied from the code a few lines above to
handle insert/update/delete where it is required to handle some DDL ops like
Alter Table but I think we don't need it here (for Truncate op). If that
understanding turns out to be true then we should either have an Assert for
this or an elog message.In this thread, we are discussing 3 topics below...
(1) necessity of the check for REORDER_BUFFER_CHANGE_TRUNCATE in ReorderBufferProcessTXN()
(2) discussion of whether we disallow decoding of operations on user catalog tables or not
(3) memory leak of maybe_send_schema() (patch already provided)Let's address those one by one.
In terms of (1), which was close to the motivation of this thread,
first of all, I traced the truncate processing
and I think the check is done by truncate command side as well.
I preferred Assert rather than never called elog,
but it's OK to choose elog if someone has strong opinion on it.
Attached the patch for this.
+1, make check-world passed.
--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.
On Fri, Apr 23, 2021 at 8:03 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Saturday, April 17, 2021 4:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I also don't find a test for this. It is introduced in 5dfd1e5a6696,
wrote by Simon Riggs, Marco Nenciarini and Peter Eisentraut. Maybe
they can explain when we can enter this condition?My guess is that this has been copied from the code a few lines above to
handle insert/update/delete where it is required to handle some DDL ops like
Alter Table but I think we don't need it here (for Truncate op). If that
understanding turns out to be true then we should either have an Assert for
this or an elog message.In this thread, we are discussing 3 topics below...
(1) necessity of the check for REORDER_BUFFER_CHANGE_TRUNCATE in ReorderBufferProcessTXN()
(2) discussion of whether we disallow decoding of operations on user catalog tables or not
(3) memory leak of maybe_send_schema() (patch already provided)Let's address those one by one.
In terms of (1), which was close to the motivation of this thread,
I think (1) and (2) are related because if we need (2) then the check
removed by (1) needs to be replaced with another check. So, I am not
sure how to make this decision.
--
With Regards,
Amit Kapila.
On Tuesday, April 20, 2021 12:07 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Sat, Apr 17, 2021 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:On Fri, Apr 16, 2021 at 11:24 PM Andres Freund <andres@anarazel.de>
wrote:> > This made me take a brief look at pgoutput.c - maybe I am
missingsomething, but how is the following not a memory leak?
static void
maybe_send_schema(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, ReorderBufferChange*change,
Relation relation, RelationSyncEntry *relentry) {
...
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
relentry->map =convert_tuples_by_name(CreateTupleDescCopy(indesc),
CreateTupleDescCopy(outdesc));
MemoryContextSwitchTo(oldctx);
send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);If - and that's common - convert_tuples_by_name() won't have to do
anything, the copied tuple descs will be permanently leaked.I also think this is a permanent leak. I think we need to free all the
memory associated with this map on the invalidation of this particular
relsync entry (basically in rel_sync_cache_relation_cb).I agree there's a problem here.
Back in:
/messages/by-id/CA+HiwqEeU19iQgjN6HF1HTP
U0L5%2BJxyS5CmxaOVGNXBAfUY06Q%40mail.gmail.comI had proposed to move the map creation from maybe_send_schema() to
get_rel_sync_entry(), mainly because the latter is where I realized it
belongs, though a bit too late. Attached is the part of the patch
for this particular issue. It also takes care to release the copied TupleDescs
if the map is found to be unnecessary, thus preventing leaking into
CacheMemoryContext.
Thank you for sharing the patch.
Your patch looks correct to me. Make check-world has
passed with it. Also, I agree with the idea to place
the processing to set the map in the get_rel_sync_entry.
One thing I'd like to ask is an advanced way to confirm
the memory leak is solved by the patch, not just by running make check-world.
I used OSS HEAD and valgrind, expecting that
I could see function stack which has a call of CreateTupleDescCopy
from both pgoutput_change and pgoutput_truncate as memory leak report
in the valgrind logs, and they disappear after applying the patch.
But, I cannot find the pair of pgoutput functions and CreateTupleDescCopy in one report
when I used OSS HEAD, which means that I need to do advanced testing to check if
the memory leak of CreateTupleDescCopy is addressed.
I collected the logs from RT at src/test/subscription so should pass the routes of our interest.
Could someone give me
an advise about the way to confirm the memory leak is solved ?
Best Regards,
Takamichi Osumi
On Monday, April 26, 2021 2:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 23, 2021 at 8:03 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:On Saturday, April 17, 2021 4:13 PM Amit Kapila
<amit.kapila16@gmail.com> wrote:
I also don't find a test for this. It is introduced in
5dfd1e5a6696, wrote by Simon Riggs, Marco Nenciarini and Peter
Eisentraut. Maybe they can explain when we can enter this condition?My guess is that this has been copied from the code a few lines
above to handle insert/update/delete where it is required to handle
some DDL ops like Alter Table but I think we don't need it here (for
Truncate op). If that understanding turns out to be true then we
should either have an Assert for this or an elog message.In this thread, we are discussing 3 topics below...
(1) necessity of the check for REORDER_BUFFER_CHANGE_TRUNCATE
in
ReorderBufferProcessTXN()
(2) discussion of whether we disallow decoding of operations on user
catalog tables or not
(3) memory leak of maybe_send_schema() (patch already provided)Let's address those one by one.
In terms of (1), which was close to the motivation of this thread,I think (1) and (2) are related because if we need (2) then the check removed
by (1) needs to be replaced with another check. So, I am not sure how to
make this decision.
Yeah, you are right.
On Monday, April 19, 2021 9:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Apr 17, 2021 at 12:01 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:On Fri, Apr 16, 2021 at 11:24 PM Andres Freund <andres@anarazel.de>
wrote:
I think it is also important to *not* acquire any lock on relation
otherwise it can lead to some sort of deadlock or infinite wait in
the decoding process. Consider a case for operations like Truncate
(or if the user has acquired an exclusive lock on the relation in
some other way say via Lock command) which acquires an exclusive
lock on relation, it won't get replicated in synchronous mode
(when synchronous_standby_name is configured). The truncate
operation will wait for the transaction to be replicated to the
subscriber and the decoding process will wait for the Truncateoperation to finish.
However, this cannot be really relied upon for catalog tables. An
output function might acquire locks or such. But for those we do not
need to decode contents...I see that if we define a user_catalog_table (create table t1_cat(c1
int) WITH(user_catalog_table = true);), we are able to decode
operations like (insert, truncate) on such a table. What do you mean
by "But for those we do not need to decode contents"?I think we are allowed to decode the operations on user catalog tables
because we are using RelationIsLogicallyLogged() instead of
RelationIsAccessibleInLogicalDecoding() in ReorderBufferProcessTXN().
Based on this discussion, I think we should not be allowing decoding of
operations on user catalog tables, so we should use
RelationIsAccessibleInLogicalDecoding to skip such ops in
ReorderBufferProcessTXN(). Am, I missing something?Can you please clarify?
I don't understand that point, either.
I read the context where the user_catalog_table was introduced - [1]/messages/by-id/20130914204913.GA4071@awork2.anarazel.de.
There, I couldn't find any discussion if we should skip decode operations
on that kind of tables or not. Accordingly, we just did not conclude it, I suppose.
What surprised me a bit is to decode operations of system catalog table are considered like [2]/messages/by-id/CA+TgmobhDCHuckL_86wRDWJ31Gw3Y1HrQ4yUKEn7U1_hTbeVqQ@mail.gmail.com
somehow at the time. I cannot find any concrete description of such use cases in the thread, though.
Anyway, I felt disallowing decoding of operations on user catalog tables
doesn't spoil the feature's purpose. So, I'm OK to do so. What do you think ?
[1]: /messages/by-id/20130914204913.GA4071@awork2.anarazel.de
Note that in this discussion, user_catalog_table was renamed from
treat_as_catalog_table in the middle of the thread. Searching it might help you to shorten your time to have a look at it.
[2]: /messages/by-id/CA+TgmobhDCHuckL_86wRDWJ31Gw3Y1HrQ4yUKEn7U1_hTbeVqQ@mail.gmail.com
Best Regards,
Takamichi Osumi
On Wed, Apr 28, 2021 at 5:36 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Monday, April 26, 2021 2:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 23, 2021 at 8:03 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
I think we are allowed to decode the operations on user catalog tables
because we are using RelationIsLogicallyLogged() instead of
RelationIsAccessibleInLogicalDecoding() in ReorderBufferProcessTXN().
Based on this discussion, I think we should not be allowing decoding of
operations on user catalog tables, so we should use
RelationIsAccessibleInLogicalDecoding to skip such ops in
ReorderBufferProcessTXN(). Am, I missing something?Can you please clarify?
I don't understand that point, either.
I read the context where the user_catalog_table was introduced - [1].
There, I couldn't find any discussion if we should skip decode operations
on that kind of tables or not. Accordingly, we just did not conclude it, I suppose.What surprised me a bit is to decode operations of system catalog table are considered like [2]
somehow at the time. I cannot find any concrete description of such use cases in the thread, though.Anyway, I felt disallowing decoding of operations on user catalog tables
doesn't spoil the feature's purpose. So, I'm OK to do so. What do you think ?
I am not so sure about it because I think we don't have any example of
user_catalog_tables in the core code. This is the reason I was kind of
looking towards Andres to clarify this. Right now, if the user
performs TRUNCATE on user_catalog_table in synchronous mode then it
will hang in case the decoding plugin takes even share lock on it. The
main reason is that we allow decoding of TRUNCATE operation for
user_catalog_tables. I think even if we want to allow decoding of
other operations on user_catalog_table, the decoding of TRUNCATE
should be prohibited but maybe we shouldn't allow decoding of any
operation on such tables as we don't do it for system catalog tables.
--
With Regards,
Amit Kapila.
Takamichi-san,
On Tue, Apr 27, 2021 at 9:37 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Tuesday, April 20, 2021 12:07 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Sat, Apr 17, 2021 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:On Fri, Apr 16, 2021 at 11:24 PM Andres Freund <andres@anarazel.de>
wrote:> > This made me take a brief look at pgoutput.c - maybe I am
missingsomething, but how is the following not a memory leak?
static void
maybe_send_schema(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, ReorderBufferChange*change,
Relation relation, RelationSyncEntry *relentry) {
...
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
relentry->map =convert_tuples_by_name(CreateTupleDescCopy(indesc),
CreateTupleDescCopy(outdesc));
MemoryContextSwitchTo(oldctx);
send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);If - and that's common - convert_tuples_by_name() won't have to do
anything, the copied tuple descs will be permanently leaked.I also think this is a permanent leak. I think we need to free all the
memory associated with this map on the invalidation of this particular
relsync entry (basically in rel_sync_cache_relation_cb).I agree there's a problem here.
Back in:
/messages/by-id/CA+HiwqEeU19iQgjN6HF1HTP
U0L5%2BJxyS5CmxaOVGNXBAfUY06Q%40mail.gmail.comI had proposed to move the map creation from maybe_send_schema() to
get_rel_sync_entry(), mainly because the latter is where I realized it
belongs, though a bit too late. Attached is the part of the patch
for this particular issue. It also takes care to release the copied TupleDescs
if the map is found to be unnecessary, thus preventing leaking into
CacheMemoryContext.Thank you for sharing the patch.
Your patch looks correct to me. Make check-world has
passed with it. Also, I agree with the idea to place
the processing to set the map in the get_rel_sync_entry.
Thanks for checking.
One thing I'd like to ask is an advanced way to confirm
the memory leak is solved by the patch, not just by running make check-world.I used OSS HEAD and valgrind, expecting that
I could see function stack which has a call of CreateTupleDescCopy
from both pgoutput_change and pgoutput_truncate as memory leak report
in the valgrind logs, and they disappear after applying the patch.But, I cannot find the pair of pgoutput functions and CreateTupleDescCopy in one report
when I used OSS HEAD, which means that I need to do advanced testing to check if
the memory leak of CreateTupleDescCopy is addressed.
I collected the logs from RT at src/test/subscription so should pass the routes of our interest.Could someone give me
an advise about the way to confirm the memory leak is solved ?
I have not used valgrind or other testing methods to check this.
To me, it's clear in this case by only looking at the code that the
TupleDescs returned by CreateTupleDescCopy() ought to be freed when
convert_tuples_by_name() determines that no map is necessary such that
there will be no need to keep those TupleDesc copies around. Failing
that, those copies end up in CacheMemoryContext without anything
pointing to them, hence the leak. Actually, since maybe_send_schema()
doesn't execute this code if schema_sent is found to have been set by
earlier calls, the leak in question should occur only once in most
tests.
--
Amit Langote
EDB: http://www.enterprisedb.com
On Thursday, April 29, 2021 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I am not so sure about it because I think we don't have any example of
user_catalog_tables in the core code. This is the reason I was kind of looking
towards Andres to clarify this. Right now, if the user performs TRUNCATE on
user_catalog_table in synchronous mode then it will hang in case the
decoding plugin takes even share lock on it. The main reason is that we allow
decoding of TRUNCATE operation for user_catalog_tables. I think even if we
want to allow decoding of other operations on user_catalog_table, the
decoding of TRUNCATE should be prohibited but maybe we shouldn't allow
decoding of any operation on such tables as we don't do it for system catalog
tables.
I tried the following scenarios for trying to reproduce this.
Scenario1:
(1) set up 1 publisher and 1 subscriber
(2) create table with user_catalog_table = true on the pub
(3) insert some data to this table
(4) create publication for the table on the pub
(5) create table with user_catalog_table = true on the sub
(6) create subscription on the sub
(7) add synchronous_standby_names to publisher's configuration and restart the pub
(8) have 1 session to hold a lock to the user_catalog_table on the pub in access share mode
(9) have another session to truncate the user_catalog_table on the pub
Here, It keeps waiting but I'm not sure this is the scenario described above,
since this deadlock is caused by (8)'s lock.
Scenario2:
(1) set up 1 publisher and 1 subscriber
(2) create table with user_catalog_table = true on the pub
(3) insert some data to this table
(4) create publication for the table on the pub
(5) create table with user_catalog_table = true on the sub
(6) create subscription on the sub
(7) add synchronous_standby_names to publisher's configuration and restart the pub
(8) have a session to truncate the user_catalog_table on the pub
Scenario 2 was successful.
Are these the scenario you have in mind,
if not please let me know for the missing steps.
I would like to reproduce the scenario and write a patch to fix this.
Best Regards,
Takamichi Osumi
On Thu, May 13, 2021 at 11:15 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Thursday, April 29, 2021 2:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I am not so sure about it because I think we don't have any example of
user_catalog_tables in the core code. This is the reason I was kind of looking
towards Andres to clarify this. Right now, if the user performs TRUNCATE on
user_catalog_table in synchronous mode then it will hang in case the
decoding plugin takes even share lock on it. The main reason is that we allow
decoding of TRUNCATE operation for user_catalog_tables. I think even if we
want to allow decoding of other operations on user_catalog_table, the
decoding of TRUNCATE should be prohibited but maybe we shouldn't allow
decoding of any operation on such tables as we don't do it for system catalog
tables.I tried the following scenarios for trying to reproduce this.
Scenario1:
(1) set up 1 publisher and 1 subscriber
(2) create table with user_catalog_table = true on the pub
(3) insert some data to this table
(4) create publication for the table on the pub
(5) create table with user_catalog_table = true on the sub
(6) create subscription on the sub
(7) add synchronous_standby_names to publisher's configuration and restart the pub
(8) have 1 session to hold a lock to the user_catalog_table on the pub in access share mode
(9) have another session to truncate the user_catalog_table on the pubHere, It keeps waiting but I'm not sure this is the scenario described above,
since this deadlock is caused by (8)'s lock.
This is a lock time-out scenario, not a deadlock.
Scenario2:
(1) set up 1 publisher and 1 subscriber
(2) create table with user_catalog_table = true on the pub
(3) insert some data to this table
(4) create publication for the table on the pub
(5) create table with user_catalog_table = true on the sub
(6) create subscription on the sub
(7) add synchronous_standby_names to publisher's configuration and restart the pub
(8) have a session to truncate the user_catalog_table on the pubScenario 2 was successful.
Yeah, because pgoutput or for that matter even test_decoding doesn't
acquire a lock on user catalog tables.
Are these the scenario you have in mind,
if not please let me know for the missing steps.
I would like to reproduce the scenario and write a patch to fix this.
I don't think we can reproduce it with core plugins as they don't lock
user catalog tables. We either need to write a minimal decoding plugin
where we acquire a lock (maybe share lock) on the user catalog table
or hack test_decoding/pgoutput to take such a lock.
--
With Regards,
Amit Kapila.
On Tue, Apr 20, 2021 at 8:36 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Sat, Apr 17, 2021 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 16, 2021 at 11:24 PM Andres Freund <andres@anarazel.de> wrote:> > This made me take a brief look at pgoutput.c - maybe I am missing
something, but how is the following not a memory leak?
static void
maybe_send_schema(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, ReorderBufferChange *change,
Relation relation, RelationSyncEntry *relentry)
{
...
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
CreateTupleDescCopy(outdesc));
MemoryContextSwitchTo(oldctx);
send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);If - and that's common - convert_tuples_by_name() won't have to do
anything, the copied tuple descs will be permanently leaked.I also think this is a permanent leak. I think we need to free all the
memory associated with this map on the invalidation of this particular
relsync entry (basically in rel_sync_cache_relation_cb).I agree there's a problem here.
Back in:
/messages/by-id/CA+HiwqEeU19iQgjN6HF1HTPU0L5+JxyS5CmxaOVGNXBAfUY06Q@mail.gmail.com
I had proposed to move the map creation from maybe_send_schema() to
get_rel_sync_entry(), mainly because the latter is where I realized it
belongs, though a bit too late.
It seems in get_rel_sync_entry, it will only build the map again when
there is any invalidation in publication_rel. Don't we need to build
it after any DDL on the relation itself? I haven't tried this with a
test so I might be missing something. Also, don't we need to free the
entire map as suggested by me?
--
With Regards,
Amit Kapila.
On Thursday, May 13, 2021 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Apr 20, 2021 at 8:36 AM Amit Langote <amitlangote09@gmail.com>
wrote:Back in:
/messages/by-id/CA+HiwqEeU19iQgjN6HF1HTP
U0L5%2BJxyS5CmxaOVGNXBAfUY06Q%40mail.gmail.com
I had proposed to move the map creation from maybe_send_schema() to
get_rel_sync_entry(), mainly because the latter is where I realized it
belongs, though a bit too late.It seems in get_rel_sync_entry, it will only build the map again when there is
any invalidation in publication_rel. Don't we need to build it after any DDL on
the relation itself? I haven't tried this with a test so I might be missing
something.
Yeah, the patch not only tries to address the memory leak
but also changes the timing (condition) to call convert_tuples_by_name.
This is because the patch placed the function within a condition of !entry->replicate_valid in get_rel_sync_entry.
OTOH, OSS HEAD calls it based on RelationSyncEntry's schema_sent in maybe_send_schema.
The two flags (replicate_valid and schema_sent) are reset at different timing somehow.
InvalidateSystemCaches resets both flags but schema_send is also reset by LocalExecuteInvalidationMessage
while replicate_valid is reset by CallSyscacheCallbacks.
IIUC, InvalidateSystemCaches, which applies to both flags, is called
when a transaction starts, via AtStart_Cache and when a table lock is taken via LockRelationOid, etc.
Accordingly, I think we can notice changes after any DDL on the relation.
But, as for the different timing, we need to know the impact of the change accurately.
LocalExecuteInvalidationMessage is called from functions in reorderbuffer
(e.g. ReorderBufferImmediateInvalidation, ReorderBufferExecuteInvalidations).
This seems to me that changing the condition by the patch
reduces the chance of the reorderbuffer's proactive reset of
the flag which leads to rebuild the map in the end.
Langote-san, could you please explain this perspective ?
Best Regards,
Takamichi Osumi
On Thu, May 13, 2021 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Apr 20, 2021 at 8:36 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Sat, Apr 17, 2021 at 1:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, Apr 16, 2021 at 11:24 PM Andres Freund <andres@anarazel.de> wrote:> > This made me take a brief look at pgoutput.c - maybe I am missing
something, but how is the following not a memory leak?
static void
maybe_send_schema(LogicalDecodingContext *ctx,
ReorderBufferTXN *txn, ReorderBufferChange *change,
Relation relation, RelationSyncEntry *relentry)
{
...
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
CreateTupleDescCopy(outdesc));
MemoryContextSwitchTo(oldctx);
send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);If - and that's common - convert_tuples_by_name() won't have to do
anything, the copied tuple descs will be permanently leaked.I also think this is a permanent leak. I think we need to free all the
memory associated with this map on the invalidation of this particular
relsync entry (basically in rel_sync_cache_relation_cb).I agree there's a problem here.
Back in:
/messages/by-id/CA+HiwqEeU19iQgjN6HF1HTPU0L5+JxyS5CmxaOVGNXBAfUY06Q@mail.gmail.com
I had proposed to move the map creation from maybe_send_schema() to
get_rel_sync_entry(), mainly because the latter is where I realized it
belongs, though a bit too late.It seems in get_rel_sync_entry, it will only build the map again when
there is any invalidation in publication_rel. Don't we need to build
it after any DDL on the relation itself? I haven't tried this with a
test so I might be missing something.
That's a good point, I didn't really think that through. So,
rel_sync_cache_relation_cb(), that gets called when the published
table's relcache is invalidated, only resets schema_sent but not
replicate_valid. The latter, as you said, is reset by
rel_sync_cache_publication_cb() when a pg_publication syscache
invalidation occurs. So with the patch, it's possible for the map to
not be recreated, even when it should, if for example DDL changes the
table's TupleDesc.
I have put the map-creation code back into maybe_send_schema() in the
attached updated patch, updated some comments related to the map, and
added a test case that would fail with the previous patch (due to
moving map-creation code into get_rel_sync_entry() that is) but
succeeds with the updated patch.
Also, don't we need to free the
entire map as suggested by me?
Yes, I had missed that too. Addressed in the updated patch.
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
pgoutput-tupeconv-map-leak_v2.patchapplication/octet-stream; name=pgoutput-tupeconv-map-leak_v2.patchDownload
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348dcf4..34cdaeded0 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -93,8 +93,9 @@ typedef struct RelationSyncEntry
Oid relid; /* relation oid */
/*
- * Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
+ * Did we send the schema? If ancestor relid is set, its schema must have
+ * also been sent and the map to convert the relation's tuples into the
+ * ancestor's format created before this can be set to be true.
*/
bool schema_sent;
List *streamed_txns; /* streamed toplevel transactions with this
@@ -440,7 +441,10 @@ maybe_send_schema(LogicalDecodingContext *ctx,
if (schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * If needed, send the ancestor's schema first. Also, set the map to
+ * convert leaf partition tuples into the ancestor's format.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -448,12 +452,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
TupleDesc outdesc = RelationGetDescr(ancestor);
MemoryContext oldctx;
+ send_relation_and_attrs(ancestor, xid, ctx);
+
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
- send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);
}
@@ -1190,13 +1208,17 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
HASH_FIND, NULL);
/*
- * Reset schema sent status as the relation definition may have changed.
+ * Reset schema sent status as the relation definition may have changed,
+ * also freeing any objects that depended on the earlier definition.
*/
if (entry != NULL)
{
entry->schema_sent = false;
list_free(entry->streamed_txns);
entry->streamed_txns = NIL;
+ if (entry->map)
+ free_conversion_map(entry->map);
+ entry->map = NULL;
}
}
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index 6daf8daa3c..9de01017be 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -6,7 +6,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 54;
+use Test::More tests => 56;
# setup
@@ -624,3 +624,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
Takamichi-san,
On Fri, May 14, 2021 at 11:19 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Thursday, May 13, 2021 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Apr 20, 2021 at 8:36 AM Amit Langote <amitlangote09@gmail.com>
wrote:Back in:
/messages/by-id/CA+HiwqEeU19iQgjN6HF1HTP
U0L5%2BJxyS5CmxaOVGNXBAfUY06Q%40mail.gmail.com
I had proposed to move the map creation from maybe_send_schema() to
get_rel_sync_entry(), mainly because the latter is where I realized it
belongs, though a bit too late.It seems in get_rel_sync_entry, it will only build the map again when there is
any invalidation in publication_rel. Don't we need to build it after any DDL on
the relation itself? I haven't tried this with a test so I might be missing
something.Yeah, the patch not only tries to address the memory leak
but also changes the timing (condition) to call convert_tuples_by_name.
This is because the patch placed the function within a condition of !entry->replicate_valid in get_rel_sync_entry.
OTOH, OSS HEAD calls it based on RelationSyncEntry's schema_sent in maybe_send_schema.The two flags (replicate_valid and schema_sent) are reset at different timing somehow.
InvalidateSystemCaches resets both flags but schema_send is also reset by LocalExecuteInvalidationMessage
while replicate_valid is reset by CallSyscacheCallbacks.IIUC, InvalidateSystemCaches, which applies to both flags, is called
when a transaction starts, via AtStart_Cache and when a table lock is taken via LockRelationOid, etc.
Accordingly, I think we can notice changes after any DDL on the relation.But, as for the different timing, we need to know the impact of the change accurately.
LocalExecuteInvalidationMessage is called from functions in reorderbuffer
(e.g. ReorderBufferImmediateInvalidation, ReorderBufferExecuteInvalidations).
This seems to me that changing the condition by the patch
reduces the chance of the reorderbuffer's proactive reset of
the flag which leads to rebuild the map in the end.Langote-san, could you please explain this perspective ?
Please check the reply I just sent. In summary, moving map-creation
into get_rel_sync_entry() was not correct.
--
Amit Langote
EDB: http://www.enterprisedb.com
On Thursday, May 13, 2021 7:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, May 13, 2021 at 11:15 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:I tried the following scenarios for trying to reproduce this.
Scenario2:
(1) set up 1 publisher and 1 subscriber
(2) create table with user_catalog_table = true on the pub
(3) insert some data to this table
(4) create publication for the table on the pub
(5) create table with user_catalog_table = true on the sub
(6) create subscription on the sub
(7) add synchronous_standby_names to publisher's configuration and
restart the pub
(8) have a session to truncate the user_catalog_table on the pubScenario 2 was successful.
Yeah, because pgoutput or for that matter even test_decoding doesn't
acquire a lock on user catalog tables.Are these the scenario you have in mind, if not please let me know for
the missing steps.
I would like to reproduce the scenario and write a patch to fix this.I don't think we can reproduce it with core plugins as they don't lock user
catalog tables.
OK. My current understanding about how the deadlock happens is below.
1. TRUNCATE command is performed on user_catalog_table.
2. TRUNCATE command locks the table and index with ACCESS EXCLUSIVE LOCK.
3. TRUNCATE waits for the subscriber's synchronization
when synchronous_standby_names is set.
4. Here, the walsender stops, *if* it tries to acquire a lock on the user_catalog_table
because the table where it wants to see is locked by the TRUNCATE already.
If this is right, we need to go back to a little bit higher level discussion,
since whether we should hack any plugin to simulate the deadlock caused by user_catalog_table reference
with locking depends on the assumption if the plugin takes a lock on the user_catalog_table or not.
In other words, if the plugin does read only access to that type of table with no lock
(by RelationIdGetRelation for example ?), the deadlock concern disappears and we don't
need to add anything to plugin sides, IIUC.
Here, we haven't gotten any response about whether output plugin takes (should take)
the lock on the user_catalog_table. But, I would like to make a consensus
about this point before the implementation.
By the way, Amit-san already mentioned the main reason of this
is that we allow decoding of TRUNCATE operation for user_catalog_table in synchronous mode.
The choices are provided by Amit-san already in the past email in [1]/messages/by-id/CAA4eK1LP8xTysPEMEBYAZ=6KoMWfjyf0gzF-9Bp=SgVFvYQDVw@mail.gmail.com.
(1) disallow decoding of TRUNCATE operation for user_catalog_tables
(2) disallow decoding of any operation for user_catalog_tables like system catalog tables
Yet, I'm not sure if either option solves the deadlock concern completely.
If application takes an ACCESS EXCLUSIVE lock by LOCK command (not by TRUNCATE !)
on the user_catalog_table in a transaction, and if the plugin tries to take a lock on it,
I think the deadlock happens. Of course, having a consensus that the plugin takes no lock at all
would remove this concern, though.
Like this, I'd like to discuss those two items in question together at first.
* the plugin should take a lock on user_catalog_table or not
* the range of decoding related to user_catalog_table
To me, taking no lock on the user_catalog_table from plugin is fine because
we have historical snapshot mechanism, which doesn't produce deadlock in any combination
even when application executes a LOCK command for ACCESS EXCLUSIVE.
In addition, I agree with the idea that we don't decode any operation on user_catalog_table
and have better alignment with usual system catalogs.
Thoughts ?
[1]: /messages/by-id/CAA4eK1LP8xTysPEMEBYAZ=6KoMWfjyf0gzF-9Bp=SgVFvYQDVw@mail.gmail.com
Best Regards,
Takamichi Osumi
On Fri, May 14, 2021 at 12:44 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Thu, May 13, 2021 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Also, don't we need to free the
entire map as suggested by me?Yes, I had missed that too. Addressed in the updated patch.
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
I think the patch frees these descriptors when the map is NULL but not
otherwise because free_conversion_map won't free these descriptors.
BTW, have you tried this patch in back branches because I think we
should backpatch this fix?
--
With Regards,
Amit Kapila.
On Fri, May 14, 2021 at 2:20 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Thursday, May 13, 2021 7:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I don't think we can reproduce it with core plugins as they don't lock user
catalog tables.OK. My current understanding about how the deadlock happens is below.
1. TRUNCATE command is performed on user_catalog_table.
2. TRUNCATE command locks the table and index with ACCESS EXCLUSIVE LOCK.
3. TRUNCATE waits for the subscriber's synchronization
when synchronous_standby_names is set.
4. Here, the walsender stops, *if* it tries to acquire a lock on the user_catalog_table
because the table where it wants to see is locked by the TRUNCATE already.If this is right,
Yeah, the above steps are correct, so if we take a lock on
user_catalog_table when walsender is processing the WAL, it would lead
to a problem.
we need to go back to a little bit higher level discussion,
since whether we should hack any plugin to simulate the deadlock caused by user_catalog_table reference
with locking depends on the assumption if the plugin takes a lock on the user_catalog_table or not.
In other words, if the plugin does read only access to that type of table with no lock
(by RelationIdGetRelation for example ?), the deadlock concern disappears and we don't
need to add anything to plugin sides, IIUC.
True, if the plugin doesn't acquire any lock on user_catalog_table,
then it is fine but we don't prohibit plugins to acquire locks on
user_catalog_tables. This is similar to system catalogs, the plugins
and decoding code do acquire lock on those.
Here, we haven't gotten any response about whether output plugin takes (should take)
the lock on the user_catalog_table. But, I would like to make a consensus
about this point before the implementation.By the way, Amit-san already mentioned the main reason of this
is that we allow decoding of TRUNCATE operation for user_catalog_table in synchronous mode.
The choices are provided by Amit-san already in the past email in [1].
(1) disallow decoding of TRUNCATE operation for user_catalog_tables
(2) disallow decoding of any operation for user_catalog_tables like system catalog tablesYet, I'm not sure if either option solves the deadlock concern completely.
If application takes an ACCESS EXCLUSIVE lock by LOCK command (not by TRUNCATE !)
on the user_catalog_table in a transaction, and if the plugin tries to take a lock on it,
I think the deadlock happens. Of course, having a consensus that the plugin takes no lock at all
would remove this concern, though.
This is true for system catalogs as well. See the similar report [1]/messages/by-id/CALDaNm1UB==gL9Poad4ETjfcyGdJBphWEzEZocodnBd--kJpVw@mail.gmail.com
Like this, I'd like to discuss those two items in question together at first.
* the plugin should take a lock on user_catalog_table or not
* the range of decoding related to user_catalog_tableTo me, taking no lock on the user_catalog_table from plugin is fine
We allow taking locks on system catalogs, so why prohibit
user_catalog_tables? However, I agree that if we want plugins to
acquire the lock on user_catalog_tables then we should either prohibit
decoding of such relations or do something else to avoid deadlock
hazards.
[1]: /messages/by-id/CALDaNm1UB==gL9Poad4ETjfcyGdJBphWEzEZocodnBd--kJpVw@mail.gmail.com
--
With Regards,
Amit Kapila.
On Mon, May 17, 2021 at 6:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 14, 2021 at 12:44 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Thu, May 13, 2021 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Also, don't we need to free the
entire map as suggested by me?Yes, I had missed that too. Addressed in the updated patch.
+ relentry->map = convert_tuples_by_name(indesc, outdesc); + if (relentry->map == NULL) + { + /* Map not necessary, so free the TupleDescs too. */ + FreeTupleDesc(indesc); + FreeTupleDesc(outdesc); + }I think the patch frees these descriptors when the map is NULL but not
otherwise because free_conversion_map won't free these descriptors.
You're right. I have fixed that by making the callback free the
TupleDescs explicitly.
BTW, have you tried this patch in back branches because I think we
should backpatch this fix?
I have created a version of the patch for v13, the only older release
that has this code, and can see that tests, including the newly added
one, pass.
Both patches are attached.
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
PG13-pgoutput-tupeconv-map-leak_v3.patchapplication/octet-stream; name=PG13-pgoutput-tupeconv-map-leak_v3.patchDownload
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c5fbebf55a..3afa8c1252 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -66,7 +66,8 @@ typedef struct RelationSyncEntry
/*
* Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
+ * have been sent and the map to convert the relation's tuples into the
+ * ancestor's format created before this can be set to be true.
*/
bool schema_sent;
@@ -295,7 +296,10 @@ maybe_send_schema(LogicalDecodingContext *ctx,
if (relentry->schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * If needed, send the ancestor's schema first. Also, set the map to
+ * convert leaf partition tuples into the ancestor's format.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -303,12 +307,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
TupleDesc outdesc = RelationGetDescr(ancestor);
MemoryContext oldctx;
+ send_relation_and_attrs(ancestor, ctx);
+
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
- send_relation_and_attrs(ancestor, ctx);
RelationClose(ancestor);
}
@@ -800,10 +818,24 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
HASH_FIND, NULL);
/*
- * Reset schema sent status as the relation definition may have changed.
+ * Reset schema sent status as the relation definition may have changed,
+ * also freeing any objects that depended on the earlier definition.
*/
if (entry != NULL)
+ {
entry->schema_sent = false;
+ if (entry->map)
+ {
+ /*
+ * Must free the TupleDescs contained in the map explicitly,
+ * because free_conversion_map() doesn't.
+ */
+ FreeTupleDesc(entry->map->indesc);
+ FreeTupleDesc(entry->map->outdesc);
+ free_conversion_map(entry->map);
+ }
+ entry->map = NULL;
+ }
}
/*
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index a04c03a7e2..a59de0b28c 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -3,7 +3,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 51;
+use Test::More tests => 53;
# setup
@@ -532,3 +532,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
HEAD-pgoutput-tupeconv-map-leak_v3.patchapplication/octet-stream; name=HEAD-pgoutput-tupeconv-map-leak_v3.patchDownload
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348dcf4..d51975025c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -94,7 +94,8 @@ typedef struct RelationSyncEntry
/*
* Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
+ * have been sent and the map to convert the relation's tuples into the
+ * ancestor's format created before this can be set to be true.
*/
bool schema_sent;
List *streamed_txns; /* streamed toplevel transactions with this
@@ -440,7 +441,10 @@ maybe_send_schema(LogicalDecodingContext *ctx,
if (schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * If needed, send the ancestor's schema first. Also, set the map to
+ * convert leaf partition tuples into the ancestor's format.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -448,12 +452,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
TupleDesc outdesc = RelationGetDescr(ancestor);
MemoryContext oldctx;
+ send_relation_and_attrs(ancestor, xid, ctx);
+
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
- send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);
}
@@ -1190,13 +1208,25 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
HASH_FIND, NULL);
/*
- * Reset schema sent status as the relation definition may have changed.
+ * Reset schema sent status as the relation definition may have changed,
+ * also freeing any objects that depended on the earlier definition.
*/
if (entry != NULL)
{
entry->schema_sent = false;
list_free(entry->streamed_txns);
entry->streamed_txns = NIL;
+ if (entry->map)
+ {
+ /*
+ * Must free the TupleDescs contained in the map explicitly,
+ * because free_conversion_map() doesn't.
+ */
+ FreeTupleDesc(entry->map->indesc);
+ FreeTupleDesc(entry->map->outdesc);
+ free_conversion_map(entry->map);
+ }
+ entry->map = NULL;
}
}
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index 6daf8daa3c..9de01017be 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -6,7 +6,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 54;
+use Test::More tests => 56;
# setup
@@ -624,3 +624,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
On Monday, May 17, 2021 6:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, May 17, 2021 at 6:13 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:On Fri, May 14, 2021 at 12:44 PM Amit Langote
<amitlangote09@gmail.com> wrote:
On Thu, May 13, 2021 at 7:43 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:
Also, don't we need to free the
entire map as suggested by me?Yes, I had missed that too. Addressed in the updated patch.
+ relentry->map = convert_tuples_by_name(indesc, outdesc); + if (relentry->map == NULL) + { + /* Map not necessary, so free the TupleDescs too. */ + FreeTupleDesc(indesc); FreeTupleDesc(outdesc); }I think the patch frees these descriptors when the map is NULL but not
otherwise because free_conversion_map won't free these descriptors.You're right. I have fixed that by making the callback free the TupleDescs
explicitly.
This fix looks correct. Also, RT of v3 didn't fail.
Further, I've checked the point of view of the newly added tests.
As you said that with v1's contents, the test of v2 failed but
we are fine with v2 and v3, which proves that we adjust DDL in a right way.
BTW, have you tried this patch in back branches because I think we
should backpatch this fix?I have created a version of the patch for v13, the only older release that has
this code, and can see that tests, including the newly added one, pass.Both patches are attached.
The patch for PG13 can be applied to it cleanly and the RT succeeded.
I have few really minor comments on your comments in the patch.
(1) schema_sent's comment
@@ -94,7 +94,8 @@ typedef struct RelationSyncEntry
/*
* Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
+ * have been sent and the map to convert the relation's tuples into the
+ * ancestor's format created before this can be set to be true.
*/
bool schema_sent;
List *streamed_txns; /* streamed toplevel transactions with this
I suggest to insert a comma between 'created' and 'before'
because the sentence is a bit long and confusing.
Or, I thought another comment idea for this part,
because the original one doesn't care about the cycle of the reset.
"To avoid repetition to send the schema, this is set true after its first transmission.
Reset when any change of the relation definition is possible. If ancestor relid is set,
its schema must have also been sent while the map to convert the relation's tuples into
the ancestor's format created, before this flag is set true."
(2) comment in rel_sync_cache_relation_cb()
@@ -1190,13 +1208,25 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
HASH_FIND, NULL);
/*
- * Reset schema sent status as the relation definition may have changed.
+ * Reset schema sent status as the relation definition may have changed,
+ * also freeing any objects that depended on the earlier definition.
How about divide this sentence into two sentences like
"Reset schema sent status as the relation definition may have changed.
Also, free any objects that depended on the earlier definition."
Best Regards,
Takamichi Osumi
On Mon, May 17, 2021 at 9:45 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Monday, May 17, 2021 6:52 PM Amit Langote <amitlangote09@gmail.com> wrote:
Both patches are attached.
The patch for PG13 can be applied to it cleanly and the RT succeeded.
I have few really minor comments on your comments in the patch.
(1) schema_sent's comment
@@ -94,7 +94,8 @@ typedef struct RelationSyncEntry
/* * Did we send the schema? If ancestor relid is set, its schema must also - * have been sent for this to be true. + * have been sent and the map to convert the relation's tuples into the + * ancestor's format created before this can be set to be true. */ bool schema_sent; List *streamed_txns; /* streamed toplevel transactions with thisI suggest to insert a comma between 'created' and 'before'
because the sentence is a bit long and confusing.Or, I thought another comment idea for this part,
because the original one doesn't care about the cycle of the reset."To avoid repetition to send the schema, this is set true after its first transmission.
Reset when any change of the relation definition is possible. If ancestor relid is set,
its schema must have also been sent while the map to convert the relation's tuples into
the ancestor's format created, before this flag is set true."(2) comment in rel_sync_cache_relation_cb()
@@ -1190,13 +1208,25 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
HASH_FIND, NULL);/* - * Reset schema sent status as the relation definition may have changed. + * Reset schema sent status as the relation definition may have changed, + * also freeing any objects that depended on the earlier definition.How about divide this sentence into two sentences like
"Reset schema sent status as the relation definition may have changed.
Also, free any objects that depended on the earlier definition."
Thanks for reading it over. I have revised comments in a way that
hopefully addresses your concerns.
While doing so, it occurred to me (maybe not for the first time) that
we are *unnecessarily* doing send_relation_and_attrs() for a relation
if the changes will be published using an ancestor's schema. In that
case, sending only the ancestor's schema suffices AFAICS. Changing
the code that way doesn't break any tests. I propose that we fix that
too.
Updated patches attached. I've added a commit message to both patches.
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
HEAD-v4-0001-pgoutput-fix-code-for-publishing-using-ancestor-s.patchapplication/octet-stream; name=HEAD-v4-0001-pgoutput-fix-code-for-publishing-using-ancestor-s.patchDownload
From b0482733a1fc4fa8555932ef0387ccef44de5edb Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 18 May 2021 15:13:32 +0900
Subject: [PATCH v4] pgoutput: fix code for publishing using ancestor schema
Couple of fixes:
1. Make maybe_send_schema() not send a relation's schema if the
changes will be published using an ancestor's schema, because
sending only the ancestor's schema suffices.
2. Release memory allocated when creating the tuple-conversion map
and its component TupleDescs when its owning sync entry is
invalidated. TupleDescs must also be freed when no map is deemed
necessary to begin with.
---
src/backend/replication/pgoutput/pgoutput.c | 51 ++++++++++++++++-----
src/test/subscription/t/013_partition.pl | 28 ++++++++++-
2 files changed, 67 insertions(+), 12 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348dcf4..0b1ec102cb 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -74,7 +74,8 @@ static void send_relation_and_attrs(Relation relation, TransactionId xid,
/*
* Entry in the map used to remember which relation schemas we sent.
*
- * The schema_sent flag determines if the current schema record was already
+ * The schema_sent flag determines if the current schema record for the
+ * relation (or for its ancestor if publish_as_relid is set) was already
* sent to the subscriber (in which case we don't need to send it again).
*
* The schema cache on downstream is however updated only at commit time,
@@ -92,10 +93,6 @@ typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
- /*
- * Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
- */
bool schema_sent;
List *streamed_txns; /* streamed toplevel transactions with this
* schema */
@@ -437,10 +434,16 @@ maybe_send_schema(LogicalDecodingContext *ctx,
else
schema_sent = relentry->schema_sent;
+ /* Nothing to do if we already sent the schema. */
if (schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * Nope, so send the schema. If the changes will be published using an
+ * ancestor's schema, not the relation's own, send that ancestor's schema
+ * instead. This is also a good place to set the map that will be used to
+ * convert the relation's tuples into the ancestor's format, if needed.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -448,16 +451,30 @@ maybe_send_schema(LogicalDecodingContext *ctx,
TupleDesc outdesc = RelationGetDescr(ancestor);
MemoryContext oldctx;
+ send_relation_and_attrs(ancestor, xid, ctx);
+
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
- send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);
}
-
- send_relation_and_attrs(relation, xid, ctx);
+ else
+ send_relation_and_attrs(relation, xid, ctx);
if (in_streaming)
set_schema_sent_in_streamed_txn(relentry, topxid);
@@ -1191,12 +1208,24 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
/*
* Reset schema sent status as the relation definition may have changed.
+ * Also free any objects that depended on the earlier definition.
*/
if (entry != NULL)
{
entry->schema_sent = false;
list_free(entry->streamed_txns);
entry->streamed_txns = NIL;
+ if (entry->map)
+ {
+ /*
+ * Must free the TupleDescs contained in the map explicitly,
+ * because free_conversion_map() doesn't.
+ */
+ FreeTupleDesc(entry->map->indesc);
+ FreeTupleDesc(entry->map->outdesc);
+ free_conversion_map(entry->map);
+ }
+ entry->map = NULL;
}
}
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index 6daf8daa3c..9de01017be 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -6,7 +6,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 54;
+use Test::More tests => 56;
# setup
@@ -624,3 +624,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
--
2.24.1
PG13-v4-0001-pgoutput-fix-code-for-publishing-using-ancestor-s.patchapplication/octet-stream; name=PG13-v4-0001-pgoutput-fix-code-for-publishing-using-ancestor-s.patchDownload
From 9fe52ba5216f8a9ac37187c7f542673da7bbc879 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 18 May 2021 14:52:04 +0900
Subject: [PATCH v4] pgoutput: fix code for publishing using ancestor schema
Couple of fixes:
1. Make maybe_send_schema() not send a relation's schema if the
changes will be published using an ancestor's schema, because
sending only the ancestor's schema suffices.
2. Release memory allocated when creating the tuple-conversion map
and its component TupleDescs when its owning sync entry is
invalidated. TupleDescs must also be freed when no map is deemed
necessary to begin with.
---
src/backend/replication/pgoutput/pgoutput.c | 53 +++++++++++++++++----
src/test/subscription/t/013_partition.pl | 28 ++++++++++-
2 files changed, 71 insertions(+), 10 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c5fbebf55a..41e9e76d85 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -57,6 +57,10 @@ static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *c
/*
* Entry in the map used to remember which relation schemas we sent.
*
+ * The schema_sent flag determines if the current schema record for the
+ * relation (or for its ancestor if publish_as_relid is set) was already sent
+ * to the subscriber (in which case we don't need to send it again).
+ *
* For partitions, 'pubactions' considers not only the table's own
* publications, but also those of all of its ancestors.
*/
@@ -64,10 +68,6 @@ typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
- /*
- * Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
- */
bool schema_sent;
bool replicate_valid;
@@ -292,10 +292,16 @@ static void
maybe_send_schema(LogicalDecodingContext *ctx,
Relation relation, RelationSyncEntry *relentry)
{
+ /* Nothing to do if we already sent the schema. */
if (relentry->schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * Nope, so send the schema. If the changes will be published using an
+ * ancestor's schema, not the relation's own, send that ancestor's schema
+ * instead. This is also a good place to set the map that will be used to
+ * convert the relation's tuples into the ancestor's format, if needed.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -303,16 +309,31 @@ maybe_send_schema(LogicalDecodingContext *ctx,
TupleDesc outdesc = RelationGetDescr(ancestor);
MemoryContext oldctx;
+ send_relation_and_attrs(ancestor, ctx);
+
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
- send_relation_and_attrs(ancestor, ctx);
RelationClose(ancestor);
}
+ else
+ send_relation_and_attrs(relation, ctx);
- send_relation_and_attrs(relation, ctx);
relentry->schema_sent = true;
}
@@ -801,9 +822,23 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
/*
* Reset schema sent status as the relation definition may have changed.
+ * Also, free any objects that depended on the earlier definition.
*/
if (entry != NULL)
+ {
entry->schema_sent = false;
+ if (entry->map)
+ {
+ /*
+ * Must free the TupleDescs contained in the map explicitly,
+ * because free_conversion_map() doesn't.
+ */
+ FreeTupleDesc(entry->map->indesc);
+ FreeTupleDesc(entry->map->outdesc);
+ free_conversion_map(entry->map);
+ }
+ entry->map = NULL;
+ }
}
/*
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index a04c03a7e2..a59de0b28c 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -3,7 +3,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 51;
+use Test::More tests => 53;
# setup
@@ -532,3 +532,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
--
2.24.1
On Monday, May 17, 2021 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 14, 2021 at 2:20 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:On Thursday, May 13, 2021 7:21 PM Amit Kapila
<amit.kapila16@gmail.com> wrote:
I don't think we can reproduce it with core plugins as they don't
lock user catalog tables.OK. My current understanding about how the deadlock happens is below.
1. TRUNCATE command is performed on user_catalog_table.
2. TRUNCATE command locks the table and index with ACCESSEXCLUSIVE LOCK.
3. TRUNCATE waits for the subscriber's synchronization
when synchronous_standby_names is set.
4. Here, the walsender stops, *if* it tries to acquire a lock on theuser_catalog_table
because the table where it wants to see is locked by the
TRUNCATE already.
If this is right,
Yeah, the above steps are correct, so if we take a lock on user_catalog_table
when walsender is processing the WAL, it would lead to a problem.we need to go back to a little bit higher level discussion, since
whether we should hack any plugin to simulate the deadlock caused by
user_catalog_table reference with locking depends on the assumption ifthe plugin takes a lock on the user_catalog_table or not.
In other words, if the plugin does read only access to that type of
table with no lock (by RelationIdGetRelation for example ?), the
deadlock concern disappears and we don't need to add anything to pluginsides, IIUC.
True, if the plugin doesn't acquire any lock on user_catalog_table, then it is
fine but we don't prohibit plugins to acquire locks on user_catalog_tables.
This is similar to system catalogs, the plugins and decoding code do acquire
lock on those.
Thanks for sharing this. I'll take the idea
that plugin can take a lock on user_catalog_table into account.
Here, we haven't gotten any response about whether output plugin takes
(should take) the lock on the user_catalog_table. But, I would like to
make a consensus about this point before the implementation.By the way, Amit-san already mentioned the main reason of this is that
we allow decoding of TRUNCATE operation for user_catalog_table insynchronous mode.
The choices are provided by Amit-san already in the past email in [1].
(1) disallow decoding of TRUNCATE operation for user_catalog_tables
(2) disallow decoding of any operation for user_catalog_tables like
system catalog tablesYet, I'm not sure if either option solves the deadlock concern completely.
If application takes an ACCESS EXCLUSIVE lock by LOCK command (notby
TRUNCATE !) on the user_catalog_table in a transaction, and if the
plugin tries to take a lock on it, I think the deadlock happens. Of
course, having a consensus that the plugin takes no lock at all wouldremove this concern, though.
This is true for system catalogs as well. See the similar report [1]
Like this, I'd like to discuss those two items in question together at first.
* the plugin should take a lock on user_catalog_table or not
* the range of decoding related to user_catalog_tableTo me, taking no lock on the user_catalog_table from plugin is fine
We allow taking locks on system catalogs, so why prohibit
user_catalog_tables? However, I agree that if we want plugins to acquire the
lock on user_catalog_tables then we should either prohibit decoding of such
relations or do something else to avoid deadlock hazards.
OK.
Although we have not concluded the range of logical decoding of user_catalog_table
(like we should exclude TRUNCATE command only or all operations on that type of table),
I'm worried that disallowing the logical decoding of user_catalog_table produces
the deadlock still. It's because disabling it by itself does not affect the
lock taken by TRUNCATE command. What I have in mind is an example below.
(1) plugin (e.g. pgoutput) is designed to take a lock on user_catalog_table.
(2) logical replication is set up in synchronous mode.
(3) TRUNCATE command takes an access exclusive lock on the user_catalog_table.
(4) This time, we don't do anything for the TRUNCATE decoding.
(5) the plugin tries to take a lock on the truncated table
but, it can't due to the lock by TRUNCATE command.
I was not sure that the place where the plugin takes the lock is in truncate_cb
or somewhere else not directly related to decoding of the user_catalog_table itself,
so I might be wrong. However, in this case,
the solution would be not disabling the decoding of user_catalog_table
but prohibiting TRUNCATE command on user_catalog_table in synchronous_mode.
If this is true, I need to extend an output plugin and simulate the deadlock first
and remove it by fixing the TRUNCATE side. Thoughts ?
Best Regards,
Takamichi Osumi
On Tue, May 18, 2021 at 1:29 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Monday, May 17, 2021 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
We allow taking locks on system catalogs, so why prohibit
user_catalog_tables? However, I agree that if we want plugins to acquire the
lock on user_catalog_tables then we should either prohibit decoding of such
relations or do something else to avoid deadlock hazards.OK.
Although we have not concluded the range of logical decoding of user_catalog_table
(like we should exclude TRUNCATE command only or all operations on that type of table),
I'm worried that disallowing the logical decoding of user_catalog_table produces
the deadlock still. It's because disabling it by itself does not affect the
lock taken by TRUNCATE command. What I have in mind is an example below.(1) plugin (e.g. pgoutput) is designed to take a lock on user_catalog_table.
(2) logical replication is set up in synchronous mode.
(3) TRUNCATE command takes an access exclusive lock on the user_catalog_table.
(4) This time, we don't do anything for the TRUNCATE decoding.
(5) the plugin tries to take a lock on the truncated table
but, it can't due to the lock by TRUNCATE command.
If you skip decoding of truncate then we won't invoke plugin API so
step 5 will be skipped.
I was not sure that the place where the plugin takes the lock is in truncate_cb
or somewhere else not directly related to decoding of the user_catalog_table itself,
so I might be wrong. However, in this case,
the solution would be not disabling the decoding of user_catalog_table
but prohibiting TRUNCATE command on user_catalog_table in synchronous_mode.
If this is true, I need to extend an output plugin and simulate the deadlock first
and remove it by fixing the TRUNCATE side. Thoughts ?
I suggest not spending too much time reproducing this because it is
quite clear that it will lead to deadlock if the plugin acquires lock
on user_catalog_table and we allow decoding of truncate. But if you
want to see how that happens you can try as well.
--
With Regards,
Amit Kapila.
On Tue, May 18, 2021 at 5:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, May 18, 2021 at 1:29 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:On Monday, May 17, 2021 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
We allow taking locks on system catalogs, so why prohibit
user_catalog_tables? However, I agree that if we want plugins to acquire the
lock on user_catalog_tables then we should either prohibit decoding of such
relations or do something else to avoid deadlock hazards.OK.
Although we have not concluded the range of logical decoding of user_catalog_table
(like we should exclude TRUNCATE command only or all operations on that type of table),
I'm worried that disallowing the logical decoding of user_catalog_table produces
the deadlock still. It's because disabling it by itself does not affect the
lock taken by TRUNCATE command. What I have in mind is an example below.(1) plugin (e.g. pgoutput) is designed to take a lock on user_catalog_table.
(2) logical replication is set up in synchronous mode.
(3) TRUNCATE command takes an access exclusive lock on the user_catalog_table.
(4) This time, we don't do anything for the TRUNCATE decoding.
(5) the plugin tries to take a lock on the truncated table
but, it can't due to the lock by TRUNCATE command.If you skip decoding of truncate then we won't invoke plugin API so
step 5 will be skipped.
I think you were right here even if skip step-4, the plugin might take
a lock on user_catalog_table for something else. I am not sure but I
think we should prohibit truncate on user_catalog_tables as we
prohibit truncate on system catalog tables (see below [1]postgres=# truncate pg_class; ERROR: permission denied: "pg_class" is a system catalog postgres=# cluster pg_class; ERROR: there is no previously clustered index for table "pg_class") if we want
plugin to lock them, otherwise, as you said it might lead to deadlock.
For the matter, I think we should once check all other operations
where we can take an exclusive lock on [user]_catalog_table, say
Cluster command, and compare the behavior of same on system catalog
tables.
[1]: postgres=# truncate pg_class; ERROR: permission denied: "pg_class" is a system catalog postgres=# cluster pg_class; ERROR: there is no previously clustered index for table "pg_class"
postgres=# truncate pg_class;
ERROR: permission denied: "pg_class" is a system catalog
postgres=# cluster pg_class;
ERROR: there is no previously clustered index for table "pg_class"
--
With Regards,
Amit Kapila.
On Wed, May 19, 2021 at 7:59 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, May 18, 2021 at 5:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, May 18, 2021 at 1:29 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:On Monday, May 17, 2021 6:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
We allow taking locks on system catalogs, so why prohibit
user_catalog_tables? However, I agree that if we want plugins to acquire the
lock on user_catalog_tables then we should either prohibit decoding of such
relations or do something else to avoid deadlock hazards.OK.
Although we have not concluded the range of logical decoding of user_catalog_table
(like we should exclude TRUNCATE command only or all operations on that type of table),
I'm worried that disallowing the logical decoding of user_catalog_table produces
the deadlock still. It's because disabling it by itself does not affect the
lock taken by TRUNCATE command. What I have in mind is an example below.(1) plugin (e.g. pgoutput) is designed to take a lock on user_catalog_table.
(2) logical replication is set up in synchronous mode.
(3) TRUNCATE command takes an access exclusive lock on the user_catalog_table.
(4) This time, we don't do anything for the TRUNCATE decoding.
(5) the plugin tries to take a lock on the truncated table
but, it can't due to the lock by TRUNCATE command.If you skip decoding of truncate then we won't invoke plugin API so
step 5 will be skipped.I think you were right here even if skip step-4, the plugin might take
a lock on user_catalog_table for something else. I am not sure but I
think we should prohibit truncate on user_catalog_tables as we
prohibit truncate on system catalog tables (see below [1]) if we want
plugin to lock them, otherwise, as you said it might lead to deadlock.
For the matter, I think we should once check all other operations
where we can take an exclusive lock on [user]_catalog_table, say
Cluster command, and compare the behavior of same on system catalog
tables.[1]
postgres=# truncate pg_class;
ERROR: permission denied: "pg_class" is a system catalog
postgres=# cluster pg_class;
ERROR: there is no previously clustered index for table "pg_class"
Please ignore the cluster command as we need to use 'using index' with
that command to make it successful. I just want to show the truncate
command behavior for which you have asked the question.
--
With Regards,
Amit Kapila.
On Wednesday, May 19, 2021 11:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, May 19, 2021 at 7:59 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:On Tue, May 18, 2021 at 5:29 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:
On Tue, May 18, 2021 at 1:29 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:On Monday, May 17, 2021 6:45 PM Amit Kapila
<amit.kapila16@gmail.com> wrote:
We allow taking locks on system catalogs, so why prohibit
user_catalog_tables? However, I agree that if we want plugins to
acquire the lock on user_catalog_tables then we should either
prohibit decoding of such relations or do something else to avoiddeadlock hazards.
OK.
Although we have not concluded the range of logical decoding of
user_catalog_table (like we should exclude TRUNCATE command only
or all operations on that type of table), I'm worried that
disallowing the logical decoding of user_catalog_table produces
the deadlock still. It's because disabling it by itself does not affect thelock taken by TRUNCATE command. What I have in mind is an example
below.(1) plugin (e.g. pgoutput) is designed to take a lock on
user_catalog_table.
(2) logical replication is set up in synchronous mode.
(3) TRUNCATE command takes an access exclusive lock on theuser_catalog_table.
(4) This time, we don't do anything for the TRUNCATE decoding.
(5) the plugin tries to take a lock on the truncated table
but, it can't due to the lock by TRUNCATE command.If you skip decoding of truncate then we won't invoke plugin API so
step 5 will be skipped.I think you were right here even if skip step-4, the plugin might take
a lock on user_catalog_table for something else.
Yes, we can't know the exact place where the user wants to use the feature
of user_catalog_table. Even if we imagine that the user skips
the truncate decoding (I imagined continuing and skipping a case in
REORDER_BUFFER_CHANGE_TRUNCATE of pgoutput),
it's possible that the user accesses it somewhere else for different purpose with lock.
I am not sure but I
think we should prohibit truncate on user_catalog_tables as we
prohibit truncate on system catalog tables (see below [1]) if we want
plugin to lock them, otherwise, as you said it might lead to deadlock.
For the matter, I think we should once check all other operations
where we can take an exclusive lock on [user]_catalog_table, say
Cluster command, and compare the behavior of same on system catalog
tables.[1]
postgres=# truncate pg_class;
ERROR: permission denied: "pg_class" is a system catalog postgres=#
cluster pg_class;
ERROR: there is no previously clustered index for table "pg_class"Please ignore the cluster command as we need to use 'using index' with that
command to make it successful. I just want to show the truncate command
behavior for which you have asked the question.
Thank you so much for clarifying the direction.
I agree with the changing the TRUNCATE side.
I'll make a patch based on this.
Best Regards,
Takamichi Osumi
On Wed, May 19, 2021 at 8:28 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Wednesday, May 19, 2021 11:33 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, May 19, 2021 at 7:59 AM Amit Kapila <amit.kapila16@gmail.com>
wrote:On Tue, May 18, 2021 at 5:29 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:
On Tue, May 18, 2021 at 1:29 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:On Monday, May 17, 2021 6:45 PM Amit Kapila
<amit.kapila16@gmail.com> wrote:
We allow taking locks on system catalogs, so why prohibit
user_catalog_tables? However, I agree that if we want plugins to
acquire the lock on user_catalog_tables then we should either
prohibit decoding of such relations or do something else to avoiddeadlock hazards.
OK.
Although we have not concluded the range of logical decoding of
user_catalog_table (like we should exclude TRUNCATE command only
or all operations on that type of table), I'm worried that
disallowing the logical decoding of user_catalog_table produces
the deadlock still. It's because disabling it by itself does not affect thelock taken by TRUNCATE command. What I have in mind is an example
below.(1) plugin (e.g. pgoutput) is designed to take a lock on
user_catalog_table.
(2) logical replication is set up in synchronous mode.
(3) TRUNCATE command takes an access exclusive lock on theuser_catalog_table.
(4) This time, we don't do anything for the TRUNCATE decoding.
(5) the plugin tries to take a lock on the truncated table
but, it can't due to the lock by TRUNCATE command.If you skip decoding of truncate then we won't invoke plugin API so
step 5 will be skipped.I think you were right here even if skip step-4, the plugin might take
a lock on user_catalog_table for something else.Yes, we can't know the exact place where the user wants to use the feature
of user_catalog_table. Even if we imagine that the user skips
the truncate decoding (I imagined continuing and skipping a case in
REORDER_BUFFER_CHANGE_TRUNCATE of pgoutput),
it's possible that the user accesses it somewhere else for different purpose with lock.I am not sure but I
think we should prohibit truncate on user_catalog_tables as we
prohibit truncate on system catalog tables (see below [1]) if we want
plugin to lock them, otherwise, as you said it might lead to deadlock.
For the matter, I think we should once check all other operations
where we can take an exclusive lock on [user]_catalog_table, say
Cluster command, and compare the behavior of same on system catalog
tables.[1]
postgres=# truncate pg_class;
ERROR: permission denied: "pg_class" is a system catalog postgres=#
cluster pg_class;
ERROR: there is no previously clustered index for table "pg_class"Please ignore the cluster command as we need to use 'using index' with that
command to make it successful. I just want to show the truncate command
behavior for which you have asked the question.Thank you so much for clarifying the direction.
I agree with the changing the TRUNCATE side.
I'll make a patch based on this.
Isn't it a better idea to start a new thread where you can summarize
whatever we have discussed here about user_catalog_tables? We might
get the opinion from others about the behavior change you are
proposing.
--
With Regards,
Amit Kapila.
On Wednesday, May 19, 2021 1:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
I am not sure but I
think we should prohibit truncate on user_catalog_tables as we
prohibit truncate on system catalog tables (see below [1]) if we
want plugin to lock them, otherwise, as you said it might lead todeadlock.
For the matter, I think we should once check all other operations
where we can take an exclusive lock on [user]_catalog_table, say
Cluster command, and compare the behavior of same on system
catalog tables.[1]
postgres=# truncate pg_class;
ERROR: permission denied: "pg_class" is a system catalog
postgres=# cluster pg_class;
ERROR: there is no previously clustered index for table "pg_class"Please ignore the cluster command as we need to use 'using index'
with that command to make it successful. I just want to show the
truncate command behavior for which you have asked the question.Thank you so much for clarifying the direction.
I agree with the changing the TRUNCATE side.
I'll make a patch based on this.Isn't it a better idea to start a new thread where you can summarize whatever
we have discussed here about user_catalog_tables? We might get the opinion
from others about the behavior change you are proposing.
You are right. So, I've launched it with the patch for this.
Best Regards,
Takamichi Osumi
On Tuesday, May 18, 2021 3:30 PM Amit Langote <amitlangote09@gmail.com> wrote:
While doing so, it occurred to me (maybe not for the first time) that we are
*unnecessarily* doing send_relation_and_attrs() for a relation if the changes
will be published using an ancestor's schema. In that case, sending only the
ancestor's schema suffices AFAICS. Changing the code that way doesn't
break any tests. I propose that we fix that too.
I've analyzed this new change's validity.
My conclusion for this is that we don't have
any bad impact from this, which means your additional fix is acceptable.
I think this addition blurs the purpose of the patch a bit, though.
With the removal of the send_relation_and_attrs() of the patch,
we don't send one pair of LOGICAL_REP_MSG_TYPE('Y'),
LOGICAL_REP_MSG_RELATION('R') message to the subscriber
when we use ancestor. Therefore, we become
not to register or update type and relation for maybe_send_schema()'s
argument 'relation' with the patch, in the case to use ancestor's schema.
However, both the pgoutput_change() and pgoutput_truncate()
have conditions to check oids to send to the subscriber for any operations.
Accordingly, the pair information for that argument 'relation'
aren't used on the subscriber in that case and we are fine.
I'll comment on other minor things in another email.
Best Regards,
Takamichi Osumi
On Tuesday, May 18, 2021 3:30 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, May 17, 2021 at 9:45 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:On Monday, May 17, 2021 6:52 PM Amit Langote
<amitlangote09@gmail.com> wrote:
Both patches are attached.
The patch for PG13 can be applied to it cleanly and the RT succeeded.
I have few really minor comments on your comments in the patch.
(1) schema_sent's comment
@@ -94,7 +94,8 @@ typedef struct RelationSyncEntry
/*
* Did we send the schema? If ancestor relid is set, its schemamust also
- * have been sent for this to be true. + * have been sent and the map to convert the relation's tuples intothe
+ * ancestor's format created before this can be set to be true.
*/
bool schema_sent;
List *streamed_txns; /* streamed topleveltransactions with this
I suggest to insert a comma between 'created' and 'before'
because the sentence is a bit long and confusing.Or, I thought another comment idea for this part, because the original
one doesn't care about the cycle of the reset."To avoid repetition to send the schema, this is set true after its first
transmission.
Reset when any change of the relation definition is possible. If
ancestor relid is set, its schema must have also been sent while the
map to convert the relation's tuples into the ancestor's format created,before this flag is set true."
(2) comment in rel_sync_cache_relation_cb()
@@ -1190,13 +1208,25 @@ rel_sync_cache_relation_cb(Datum arg, Oid
relid)
HASH_FIND, NULL);
/*
- * Reset schema sent status as the relation definition may havechanged.
+ * Reset schema sent status as the relation definition may have
changed,
+ * also freeing any objects that depended on the earlier definition.
How about divide this sentence into two sentences like "Reset schema
sent status as the relation definition may have changed.
Also, free any objects that depended on the earlier definition."Thanks for reading it over. I have revised comments in a way that hopefully
addresses your concerns.
Thank you for your fix.
I think the patches look good to me.
Just in case, I'll report that the two patches succeeded
in the RT as expected and from my side,
there's no more suggestions.
Those are ready for committer, I think.
Best Regards,
Takamichi Osumi
On Thu, May 20, 2021 at 5:59 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Tuesday, May 18, 2021 3:30 PM Amit Langote <amitlangote09@gmail.com> wrote:
While doing so, it occurred to me (maybe not for the first time) that we are
*unnecessarily* doing send_relation_and_attrs() for a relation if the changes
will be published using an ancestor's schema. In that case, sending only the
ancestor's schema suffices AFAICS. Changing the code that way doesn't
break any tests. I propose that we fix that too.I've analyzed this new change's validity.
My conclusion for this is that we don't have
any bad impact from this, which means your additional fix is acceptable.
I think this addition blurs the purpose of the patch a bit, though.
Okay, I've extracted that change into 0002.
With the removal of the send_relation_and_attrs() of the patch,
we don't send one pair of LOGICAL_REP_MSG_TYPE('Y'),
LOGICAL_REP_MSG_RELATION('R') message to the subscriber
when we use ancestor. Therefore, we become
not to register or update type and relation for maybe_send_schema()'s
argument 'relation' with the patch, in the case to use ancestor's schema.
However, both the pgoutput_change() and pgoutput_truncate()
have conditions to check oids to send to the subscriber for any operations.
Accordingly, the pair information for that argument 'relation'
aren't used on the subscriber in that case and we are fine.
Thanks for checking that.
Here are updated/divided patches.
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
HEAD-v5-0001-pgoutput-fix-memory-management-of-RelationSyncEnt.patchapplication/octet-stream; name=HEAD-v5-0001-pgoutput-fix-memory-management-of-RelationSyncEnt.patchDownload
From 41c4eb055a2930bfc139fa555c77d4a8478ff1a6 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 18 May 2021 15:13:32 +0900
Subject: [PATCH v5 1/2] pgoutput: fix memory management of
RelationSyncEntry.map
Release memory allocated when creating the tuple-conversion map
and its component TupleDescs when its owning sync entry is
invalidated. TupleDescs must also be freed when no map is deemed
necessary to begin with.
---
src/backend/replication/pgoutput/pgoutput.c | 45 +++++++++++++++++----
src/test/subscription/t/013_partition.pl | 28 ++++++++++++-
2 files changed, 64 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348dcf4..de89b14308 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -92,10 +92,6 @@ typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
- /*
- * Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
- */
bool schema_sent;
List *streamed_txns; /* streamed toplevel transactions with this
* schema */
@@ -437,10 +433,17 @@ maybe_send_schema(LogicalDecodingContext *ctx,
else
schema_sent = relentry->schema_sent;
+ /* Nothing to do if we already sent the schema. */
if (schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * Nope, so send the schema. If the changes will be published using an
+ * ancestor's schema, not the relation's own, send that ancestor's schema
+ * before sending relation's own (XXX - maybe sending only the former
+ * suffices?). This is also a good place to set the map that will be used
+ * to convert the relation's tuples into the ancestor's format, if needed.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -448,12 +451,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
TupleDesc outdesc = RelationGetDescr(ancestor);
MemoryContext oldctx;
+ send_relation_and_attrs(ancestor, xid, ctx);
+
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
- send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);
}
@@ -1191,12 +1208,24 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
/*
* Reset schema sent status as the relation definition may have changed.
+ * Also free any objects that depended on the earlier definition.
*/
if (entry != NULL)
{
entry->schema_sent = false;
list_free(entry->streamed_txns);
entry->streamed_txns = NIL;
+ if (entry->map)
+ {
+ /*
+ * Must free the TupleDescs contained in the map explicitly,
+ * because free_conversion_map() doesn't.
+ */
+ FreeTupleDesc(entry->map->indesc);
+ FreeTupleDesc(entry->map->outdesc);
+ free_conversion_map(entry->map);
+ }
+ entry->map = NULL;
}
}
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index 6daf8daa3c..9de01017be 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -6,7 +6,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 54;
+use Test::More tests => 56;
# setup
@@ -624,3 +624,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
--
2.24.1
PG13-v5-0002-pgoutput-don-t-send-leaf-partition-schema-when-pu.patchapplication/octet-stream; name=PG13-v5-0002-pgoutput-don-t-send-leaf-partition-schema-when-pu.patchDownload
From 7ff324deae7030be1b3cf9e1174c6ab0015245e3 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Thu, 20 May 2021 21:42:27 +0900
Subject: [PATCH v5 2/2] pgoutput: don't send leaf partition schema when
publishing via ancestor
---
src/backend/replication/pgoutput/pgoutput.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 4351643da1..7dde97176b 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -57,6 +57,10 @@ static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *c
/*
* Entry in the map used to remember which relation schemas we sent.
*
+ * The schema_sent flag determines if the current schema record for the
+ * relation (or for its ancestor if publish_as_relid is set) was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
* For partitions, 'pubactions' considers not only the table's own
* publications, but also those of all of its ancestors.
*/
@@ -295,9 +299,8 @@ maybe_send_schema(LogicalDecodingContext *ctx,
/*
* Nope, so send the schema. If the changes will be published using an
* ancestor's schema, not the relation's own, send that ancestor's schema
- * before sending relation's own (XXX - maybe sending only the former
- * suffices?). This is also a good place to set the map that will be used
- * to convert the relation's tuples into the ancestor's format, if needed.
+ * instead. This is also a good place to set the map that will be used to
+ * convert the relation's tuples into the ancestor's format, if needed.
*/
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
@@ -328,8 +331,8 @@ maybe_send_schema(LogicalDecodingContext *ctx,
MemoryContextSwitchTo(oldctx);
RelationClose(ancestor);
}
-
- send_relation_and_attrs(relation, ctx);
+ else
+ send_relation_and_attrs(relation, ctx);
relentry->schema_sent = true;
}
--
2.24.1
HEAD-v5-0002-pgoutput-don-t-send-leaf-partition-schema-when-pu.patchapplication/octet-stream; name=HEAD-v5-0002-pgoutput-don-t-send-leaf-partition-schema-when-pu.patchDownload
From 08c3f7979011fae982c3c32691c56ab3d4886c08 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Thu, 20 May 2021 21:35:09 +0900
Subject: [PATCH v5 2/2] pgoutput: don't send leaf partition schema when
publishing via ancestor
---
src/backend/replication/pgoutput/pgoutput.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index de89b14308..0b1ec102cb 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -74,7 +74,8 @@ static void send_relation_and_attrs(Relation relation, TransactionId xid,
/*
* Entry in the map used to remember which relation schemas we sent.
*
- * The schema_sent flag determines if the current schema record was already
+ * The schema_sent flag determines if the current schema record for the
+ * relation (or for its ancestor if publish_as_relid is set) was already
* sent to the subscriber (in which case we don't need to send it again).
*
* The schema cache on downstream is however updated only at commit time,
@@ -440,9 +441,8 @@ maybe_send_schema(LogicalDecodingContext *ctx,
/*
* Nope, so send the schema. If the changes will be published using an
* ancestor's schema, not the relation's own, send that ancestor's schema
- * before sending relation's own (XXX - maybe sending only the former
- * suffices?). This is also a good place to set the map that will be used
- * to convert the relation's tuples into the ancestor's format, if needed.
+ * instead. This is also a good place to set the map that will be used to
+ * convert the relation's tuples into the ancestor's format, if needed.
*/
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
@@ -473,8 +473,8 @@ maybe_send_schema(LogicalDecodingContext *ctx,
MemoryContextSwitchTo(oldctx);
RelationClose(ancestor);
}
-
- send_relation_and_attrs(relation, xid, ctx);
+ else
+ send_relation_and_attrs(relation, xid, ctx);
if (in_streaming)
set_schema_sent_in_streamed_txn(relentry, topxid);
--
2.24.1
PG13-v5-0001-pgoutput-fix-memory-management-for-RelationSyncEn.patchapplication/octet-stream; name=PG13-v5-0001-pgoutput-fix-memory-management-for-RelationSyncEn.patchDownload
From 89206bf31b5b915f32befc1b676f706810ad6eb6 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 18 May 2021 14:52:04 +0900
Subject: [PATCH v5 1/2] pgoutput: fix memory management for
RelationSyncEntry.map
Release memory allocated when creating the tuple-conversion map
and its component TupleDescs when its owning sync entry is
invalidated. TupleDescs must also be freed when no map is deemed
necessary to begin with.
---
src/backend/replication/pgoutput/pgoutput.c | 48 +++++++++++++++++----
src/test/subscription/t/013_partition.pl | 28 +++++++++++-
2 files changed, 67 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c5fbebf55a..4351643da1 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -64,10 +64,6 @@ typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
- /*
- * Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
- */
bool schema_sent;
bool replicate_valid;
@@ -292,10 +288,17 @@ static void
maybe_send_schema(LogicalDecodingContext *ctx,
Relation relation, RelationSyncEntry *relentry)
{
+ /* Nothing to do if we already sent the schema. */
if (relentry->schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * Nope, so send the schema. If the changes will be published using an
+ * ancestor's schema, not the relation's own, send that ancestor's schema
+ * before sending relation's own (XXX - maybe sending only the former
+ * suffices?). This is also a good place to set the map that will be used
+ * to convert the relation's tuples into the ancestor's format, if needed.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -303,16 +306,31 @@ maybe_send_schema(LogicalDecodingContext *ctx,
TupleDesc outdesc = RelationGetDescr(ancestor);
MemoryContext oldctx;
+ send_relation_and_attrs(ancestor, ctx);
+
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
- send_relation_and_attrs(ancestor, ctx);
RelationClose(ancestor);
}
send_relation_and_attrs(relation, ctx);
+
relentry->schema_sent = true;
}
@@ -801,9 +819,23 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
/*
* Reset schema sent status as the relation definition may have changed.
+ * Also, free any objects that depended on the earlier definition.
*/
if (entry != NULL)
+ {
entry->schema_sent = false;
+ if (entry->map)
+ {
+ /*
+ * Must free the TupleDescs contained in the map explicitly,
+ * because free_conversion_map() doesn't.
+ */
+ FreeTupleDesc(entry->map->indesc);
+ FreeTupleDesc(entry->map->outdesc);
+ free_conversion_map(entry->map);
+ }
+ entry->map = NULL;
+ }
}
/*
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index a04c03a7e2..a59de0b28c 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -3,7 +3,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 51;
+use Test::More tests => 53;
# setup
@@ -532,3 +532,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
--
2.24.1
On Thursday, May 20, 2021 9:59 PM Amit Langote <amitlangote09@gmail.com> wrote:
Here are updated/divided patches.
Thanks for your updates.
But, I've detected segmentation faults caused by the patch,
which can happen during 100_bugs.pl in src/test/subscription.
This happens more than one in ten times.
This problem would be a timing issue and has been introduced by v3 already.
I used v5 for HEAD also and reproduced this failure, while
OSS HEAD doesn't reproduce this, even when I executed 100_bugs.pl 200 times in a tight loop.
I aligned the commit id 4f586fe2 for all check. Below logs are ones I got from v3.
* The message of the failure during TAP test.
# Postmaster PID for node "twoways" is 5015
Waiting for replication conn testsub's replay_lsn to pass pg_current_wal_lsn() on twoways
# poll_query_until timed out executing this query:
# SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'testsub';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# psql: error: connection to server on socket "/tmp/cs8dhFOtZZ/.s.PGSQL.59345" failed: No such file or directory
# Is the server running locally and accepting connections on that socket?
timed out waiting for catchup at t/100_bugs.pl line 148.
The failure produces core file and its back trace is below.
My first guess of the cause is that between the timing to get an entry from hash_search() in get_rel_sync_entry()
and to set the map by convert_tuples_by_name() in maybe_send_schema(), we had invalidation message,
which tries to free unset descs in the entry ?
* core file backtrace
Core was generated by `postgres: twoways: walsender k5user [local] START_REPLICATION '.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f93b38b8c2b in rel_sync_cache_relation_cb (arg=0, relid=16388) at pgoutput.c:1225
1225 FreeTupleDesc(entry->map->indesc);
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64 libicu-50.2-4.el7_7.x86_64 libstdc++-4.8.5-44.el7.x86_64
(gdb) bt
#0 0x00007f93b38b8c2b in rel_sync_cache_relation_cb (arg=0, relid=16388) at pgoutput.c:1225
#1 0x0000000000ae21f0 in LocalExecuteInvalidationMessage (msg=0x21d1de8) at inval.c:601
#2 0x00000000008dbd6e in ReorderBufferExecuteInvalidations (nmsgs=4, msgs=0x21d1db8) at reorderbuffer.c:3232
#3 0x00000000008da70a in ReorderBufferProcessTXN (rb=0x21d1a40, txn=0x2210b58, commit_lsn=25569096, snapshot_now=0x21d1e10, command_id=1, streaming=false)
at reorderbuffer.c:2294
#4 0x00000000008dae56 in ReorderBufferReplay (txn=0x2210b58, rb=0x21d1a40, xid=748, commit_lsn=25569096, end_lsn=25569216, commit_time=674891531661619,
origin_id=0, origin_lsn=0) at reorderbuffer.c:2591
#5 0x00000000008daede in ReorderBufferCommit (rb=0x21d1a40, xid=748, commit_lsn=25569096, end_lsn=25569216, commit_time=674891531661619, origin_id=0,
origin_lsn=0) at reorderbuffer.c:2615
#6 0x00000000008cae06 in DecodeCommit (ctx=0x21e6880, buf=0x7fffb9383db0, parsed=0x7fffb9383c10, xid=748, two_phase=false) at decode.c:744
#7 0x00000000008ca1ed in DecodeXactOp (ctx=0x21e6880, buf=0x7fffb9383db0) at decode.c:278
#8 0x00000000008c9e76 in LogicalDecodingProcessRecord (ctx=0x21e6880, record=0x21e6c80) at decode.c:142
#9 0x0000000000901fcc in XLogSendLogical () at walsender.c:2876
#10 0x0000000000901327 in WalSndLoop (send_data=0x901f30 <XLogSendLogical>) at walsender.c:2306
#11 0x00000000008ffd5f in StartLogicalReplication (cmd=0x219aff8) at walsender.c:1206
#12 0x00000000009006ae in exec_replication_command (
cmd_string=0x2123c20 "START_REPLICATION SLOT \"pg_16400_sync_16392_6964617299612181363\" LOGICAL 0/182D058 (proto_version '2', publication_names '\"testpub\"')") at walsender.c:1646
#13 0x000000000096ffae in PostgresMain (argc=1, argv=0x7fffb93840d0, dbname=0x214ef18 "d1", username=0x214eef8 "k5user") at postgres.c:4482
I'll update when I get more information.
Best Regards,
Takamichi Osumi
On Friday, May 21, 2021 3:55 PM I wrote:
On Thursday, May 20, 2021 9:59 PM Amit Langote
<amitlangote09@gmail.com> wrote:Here are updated/divided patches.
Thanks for your updates.
But, I've detected segmentation faults caused by the patch, which can
happen during 100_bugs.pl in src/test/subscription.
This happens more than one in ten times.This problem would be a timing issue and has been introduced by v3 already.
I used v5 for HEAD also and reproduced this failure, while OSS HEAD doesn't
reproduce this, even when I executed 100_bugs.pl 200 times in a tight loop.
I aligned the commit id 4f586fe2 for all check. Below logs are ones I got from v3.* The message of the failure during TAP test.
# Postmaster PID for node "twoways" is 5015 Waiting for replication conn
testsub's replay_lsn to pass pg_current_wal_lsn() on twoways #
poll_query_until timed out executing this query:
# SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication WHERE application_name = 'testsub';
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# psql: error: connection to server on socket
"/tmp/cs8dhFOtZZ/.s.PGSQL.59345" failed: No such file or directory
# Is the server running locally and accepting connections on that
socket?
timed out waiting for catchup at t/100_bugs.pl line 148.The failure produces core file and its back trace is below.
My first guess of the cause is that between the timing to get an entry from
hash_search() in get_rel_sync_entry() and to set the map by
convert_tuples_by_name() in maybe_send_schema(), we had invalidation
message, which tries to free unset descs in the entry ?
Sorry, this guess was not accurate at all.
Please ignore this because we need to have the entry->map set
to free descs. Sorry for making noises.
Best Regards,
Takamichi Osumi
On Fri, May 21, 2021 at 3:55 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Thursday, May 20, 2021 9:59 PM Amit Langote <amitlangote09@gmail.com> wrote:
Here are updated/divided patches.
Thanks for your updates.
But, I've detected segmentation faults caused by the patch,
which can happen during 100_bugs.pl in src/test/subscription.
This happens more than one in ten times.This problem would be a timing issue and has been introduced by v3 already.
I used v5 for HEAD also and reproduced this failure, while
OSS HEAD doesn't reproduce this, even when I executed 100_bugs.pl 200 times in a tight loop.
I aligned the commit id 4f586fe2 for all check. Below logs are ones I got from v3.My first guess of the cause is that between the timing to get an entry from hash_search() in get_rel_sync_entry()
and to set the map by convert_tuples_by_name() in maybe_send_schema(), we had invalidation message,
which tries to free unset descs in the entry ?
Hmm, maybe get_rel_syn_entry() should explicitly set map to NULL when
first initializing an entry. It's possible that without doing so, the
map remains set to a garbage value, which causes the invalidation
callback that runs into such partially initialized entry to segfault
upon trying to deference that garbage pointer.
I've tried that in the attached v6 patches. Please check.
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
HEAD-v6-0001-pgoutput-fix-memory-management-of-RelationSyncEnt.patchapplication/octet-stream; name=HEAD-v6-0001-pgoutput-fix-memory-management-of-RelationSyncEnt.patchDownload
From 3f232c70285442569fb0914903079c4888c73677 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 18 May 2021 15:13:32 +0900
Subject: [PATCH v6 1/2] pgoutput: fix memory management of
RelationSyncEntry.map
Release memory allocated when creating the tuple-conversion map
and its component TupleDescs when its owning sync entry is
invalidated. TupleDescs must also be freed when no map is deemed
necessary to begin with.
---
src/backend/replication/pgoutput/pgoutput.c | 46 +++++++++++++++++----
src/test/subscription/t/013_partition.pl | 28 ++++++++++++-
2 files changed, 65 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348dcf4..7ace9a3c42 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -92,10 +92,6 @@ typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
- /*
- * Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
- */
bool schema_sent;
List *streamed_txns; /* streamed toplevel transactions with this
* schema */
@@ -437,10 +433,17 @@ maybe_send_schema(LogicalDecodingContext *ctx,
else
schema_sent = relentry->schema_sent;
+ /* Nothing to do if we already sent the schema. */
if (schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * Nope, so send the schema. If the changes will be published using an
+ * ancestor's schema, not the relation's own, send that ancestor's schema
+ * before sending relation's own (XXX - maybe sending only the former
+ * suffices?). This is also a good place to set the map that will be used
+ * to convert the relation's tuples into the ancestor's format, if needed.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -448,12 +451,26 @@ maybe_send_schema(LogicalDecodingContext *ctx,
TupleDesc outdesc = RelationGetDescr(ancestor);
MemoryContext oldctx;
+ send_relation_and_attrs(ancestor, xid, ctx);
+
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
- send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);
}
@@ -1011,6 +1028,7 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
entry->pubactions.pubinsert = entry->pubactions.pubupdate =
entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false;
entry->publish_as_relid = InvalidOid;
+ entry->map = NULL; /* will be set by maybe_send_schema() if needed */
}
/* Validate the entry */
@@ -1191,12 +1209,24 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
/*
* Reset schema sent status as the relation definition may have changed.
+ * Also free any objects that depended on the earlier definition.
*/
if (entry != NULL)
{
entry->schema_sent = false;
list_free(entry->streamed_txns);
entry->streamed_txns = NIL;
+ if (entry->map)
+ {
+ /*
+ * Must free the TupleDescs contained in the map explicitly,
+ * because free_conversion_map() doesn't.
+ */
+ FreeTupleDesc(entry->map->indesc);
+ FreeTupleDesc(entry->map->outdesc);
+ free_conversion_map(entry->map);
+ }
+ entry->map = NULL;
}
}
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index 6daf8daa3c..9de01017be 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -6,7 +6,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 54;
+use Test::More tests => 56;
# setup
@@ -624,3 +624,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
--
2.24.1
HEAD-v6-0002-pgoutput-don-t-send-leaf-partition-schema-when-pu.patchapplication/octet-stream; name=HEAD-v6-0002-pgoutput-don-t-send-leaf-partition-schema-when-pu.patchDownload
From 5688e0a6fd0e6fdb12367b4cbf463c8779460833 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Thu, 20 May 2021 21:35:09 +0900
Subject: [PATCH v6 2/2] pgoutput: don't send leaf partition schema when
publishing via ancestor
---
src/backend/replication/pgoutput/pgoutput.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 7ace9a3c42..1593ce442c 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -74,7 +74,8 @@ static void send_relation_and_attrs(Relation relation, TransactionId xid,
/*
* Entry in the map used to remember which relation schemas we sent.
*
- * The schema_sent flag determines if the current schema record was already
+ * The schema_sent flag determines if the current schema record for the
+ * relation (or for its ancestor if publish_as_relid is set) was already
* sent to the subscriber (in which case we don't need to send it again).
*
* The schema cache on downstream is however updated only at commit time,
@@ -440,9 +441,8 @@ maybe_send_schema(LogicalDecodingContext *ctx,
/*
* Nope, so send the schema. If the changes will be published using an
* ancestor's schema, not the relation's own, send that ancestor's schema
- * before sending relation's own (XXX - maybe sending only the former
- * suffices?). This is also a good place to set the map that will be used
- * to convert the relation's tuples into the ancestor's format, if needed.
+ * instead. This is also a good place to set the map that will be used to
+ * convert the relation's tuples into the ancestor's format, if needed.
*/
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
@@ -473,8 +473,8 @@ maybe_send_schema(LogicalDecodingContext *ctx,
MemoryContextSwitchTo(oldctx);
RelationClose(ancestor);
}
-
- send_relation_and_attrs(relation, xid, ctx);
+ else
+ send_relation_and_attrs(relation, xid, ctx);
if (in_streaming)
set_schema_sent_in_streamed_txn(relentry, topxid);
--
2.24.1
PG13-v6-0002-pgoutput-don-t-send-leaf-partition-schema-when-pu.patchapplication/octet-stream; name=PG13-v6-0002-pgoutput-don-t-send-leaf-partition-schema-when-pu.patchDownload
From bec60feed2fd4a9bef2cbdbc934f52b73e9f9aa1 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Thu, 20 May 2021 21:42:27 +0900
Subject: [PATCH v6 2/2] pgoutput: don't send leaf partition schema when
publishing via ancestor
---
src/backend/replication/pgoutput/pgoutput.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c691c063cd..4a5eb9ccd1 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -57,6 +57,10 @@ static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *c
/*
* Entry in the map used to remember which relation schemas we sent.
*
+ * The schema_sent flag determines if the current schema record for the
+ * relation (or for its ancestor if publish_as_relid is set) was already
+ * sent to the subscriber (in which case we don't need to send it again).
+ *
* For partitions, 'pubactions' considers not only the table's own
* publications, but also those of all of its ancestors.
*/
@@ -295,9 +299,8 @@ maybe_send_schema(LogicalDecodingContext *ctx,
/*
* Nope, so send the schema. If the changes will be published using an
* ancestor's schema, not the relation's own, send that ancestor's schema
- * before sending relation's own (XXX - maybe sending only the former
- * suffices?). This is also a good place to set the map that will be used
- * to convert the relation's tuples into the ancestor's format, if needed.
+ * instead. This is also a good place to set the map that will be used to
+ * convert the relation's tuples into the ancestor's format, if needed.
*/
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
@@ -328,8 +331,8 @@ maybe_send_schema(LogicalDecodingContext *ctx,
MemoryContextSwitchTo(oldctx);
RelationClose(ancestor);
}
-
- send_relation_and_attrs(relation, ctx);
+ else
+ send_relation_and_attrs(relation, ctx);
relentry->schema_sent = true;
}
--
2.24.1
PG13-v6-0001-pgoutput-fix-memory-management-for-RelationSyncEn.patchapplication/octet-stream; name=PG13-v6-0001-pgoutput-fix-memory-management-for-RelationSyncEn.patchDownload
From e6bca40314cc515b33b27247768bff05cd42eeab Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 18 May 2021 14:52:04 +0900
Subject: [PATCH v6 1/2] pgoutput: fix memory management for
RelationSyncEntry.map
Release memory allocated when creating the tuple-conversion map
and its component TupleDescs when its owning sync entry is
invalidated. TupleDescs must also be freed when no map is deemed
necessary to begin with.
---
src/backend/replication/pgoutput/pgoutput.c | 49 +++++++++++++++++----
src/test/subscription/t/013_partition.pl | 28 +++++++++++-
2 files changed, 68 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c5fbebf55a..c691c063cd 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -64,10 +64,6 @@ typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
- /*
- * Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
- */
bool schema_sent;
bool replicate_valid;
@@ -292,10 +288,17 @@ static void
maybe_send_schema(LogicalDecodingContext *ctx,
Relation relation, RelationSyncEntry *relentry)
{
+ /* Nothing to do if we already sent the schema. */
if (relentry->schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * Nope, so send the schema. If the changes will be published using an
+ * ancestor's schema, not the relation's own, send that ancestor's schema
+ * before sending relation's own (XXX - maybe sending only the former
+ * suffices?). This is also a good place to set the map that will be used
+ * to convert the relation's tuples into the ancestor's format, if needed.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -303,16 +306,31 @@ maybe_send_schema(LogicalDecodingContext *ctx,
TupleDesc outdesc = RelationGetDescr(ancestor);
MemoryContext oldctx;
+ send_relation_and_attrs(ancestor, ctx);
+
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
- send_relation_and_attrs(ancestor, ctx);
RelationClose(ancestor);
}
send_relation_and_attrs(relation, ctx);
+
relentry->schema_sent = true;
}
@@ -759,6 +777,7 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
list_free(pubids);
entry->publish_as_relid = publish_as_relid;
+ entry->map = NULL; /* will be set by maybe_send_schema() if needed */
entry->replicate_valid = true;
}
@@ -801,9 +820,23 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
/*
* Reset schema sent status as the relation definition may have changed.
+ * Also, free any objects that depended on the earlier definition.
*/
if (entry != NULL)
+ {
entry->schema_sent = false;
+ if (entry->map)
+ {
+ /*
+ * Must free the TupleDescs contained in the map explicitly,
+ * because free_conversion_map() doesn't.
+ */
+ FreeTupleDesc(entry->map->indesc);
+ FreeTupleDesc(entry->map->outdesc);
+ free_conversion_map(entry->map);
+ }
+ entry->map = NULL;
+ }
}
/*
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index a04c03a7e2..a59de0b28c 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -3,7 +3,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 51;
+use Test::More tests => 53;
# setup
@@ -532,3 +532,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
--
2.24.1
On Friday, May 21, 2021 4:43 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Fri, May 21, 2021 at 3:55 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:But, I've detected segmentation faults caused by the patch, which can
happen during 100_bugs.pl in src/test/subscription.
This happens more than one in ten times.This problem would be a timing issue and has been introduced by v3
already.
I used v5 for HEAD also and reproduced this failure, while OSS HEAD
doesn't reproduce this, even when I executed 100_bugs.pl 200 times in atight loop.
I aligned the commit id 4f586fe2 for all check. Below logs are ones I got from
v3.
My first guess of the cause is that between the timing to get an entry
from hash_search() in get_rel_sync_entry() and to set the map by
convert_tuples_by_name() in maybe_send_schema(), we had invalidationmessage, which tries to free unset descs in the entry ?
Hmm, maybe get_rel_syn_entry() should explicitly set map to NULL when first
initializing an entry. It's possible that without doing so, the map remains set
to a garbage value, which causes the invalidation callback that runs into such
partially initialized entry to segfault upon trying to deference that garbage
pointer.
Just in case, I prepared a new PG and
did a check to make get_rel_sync_entry() print its first pointer value with elog.
Here, when I executed 100_bugs.pl, I got some garbage below.
* The change I did:
@@ -1011,6 +1011,7 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
entry->pubactions.pubinsert = entry->pubactions.pubupdate =
entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false;
entry->publish_as_relid = InvalidOid;
+ elog(LOG, "**> the pointer's default value : %p", entry->map);
}
* Grep result of all logs from 100_bugs.pl
2021-05-21 09:05:56.132 UTC [29122] sub1 LOG: **> the pointer's default value : (nil)
2021-05-21 09:06:11.556 UTC [30198] testsub1 LOG: **> the pointer's default value : (nil)
2021-05-21 09:06:11.561 UTC [30200] pg_16389_sync_16384_6964667281140237667 LOG: **> the pointer's default value : 0x7f7f7f7f7f7f7f7f
2021-05-21 09:06:11.567 UTC [30191] testsub2 LOG: **> the pointer's default value : (nil)
2021-05-21 09:06:11.570 UTC [30194] pg_16387_sync_16384_6964667292923737489 LOG: **> the pointer's default value : 0x7f7f7f7f7f7f7f7f
2021-05-21 09:06:02.513 UTC [29809] testsub LOG: **> the pointer's default value : (nil)
2021-05-21 09:06:02.557 UTC [29809] testsub LOG: **> the pointer's default value : (nil)
So, your solution is right, I think.
I've tried that in the attached v6 patches. Please check.
With this fix, I don't get the failure.
I executed 100_bugs.pl 100 times in a loop and didn't face that problem.
Again, I conducted one make check-world for each combination
* use OSS HEAD or PG13
* apply only the first patch or both two patches
Those all passed.
Best Regards,
Takamichi Osumi
On Friday, May 21, 2021 9:45 PM I worte:
On Friday, May 21, 2021 4:43 PM Amit Langote <amitlangote09@gmail.com>
wrote:On Fri, May 21, 2021 at 3:55 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:But, I've detected segmentation faults caused by the patch, which
can happen during 100_bugs.pl in src/test/subscription.Hmm, maybe get_rel_syn_entry() should explicitly set map to NULL when
first initializing an entry. It's possible that without doing so, the
map remains set to a garbage value, which causes the invalidation
callback that runs into such partially initialized entry to segfault
upon trying to deference that garbage pointer.Just in case, I prepared a new PG and
did a check to make get_rel_sync_entry() print its first pointer value with elog.
Here, when I executed 100_bugs.pl, I got some garbage below.* The change I did: @@ -1011,6 +1011,7 @@ get_rel_sync_entry(PGOutputData *data, Oid relid) entry->pubactions.pubinsert = entry->pubactions.pubupdate = entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false; entry->publish_as_relid = InvalidOid; + elog(LOG, "**> the pointer's default value : %p", + entry->map); }
(snip)
So, your solution is right, I think.
This was a bit indirect.
I've checked the core file of v3's failure core and printed the entry
to get more confidence. Sorry for inappropriate measure to verify the solution.
$1 = {relid = 16388, schema_sent = false, streamed_txns = 0x0, replicate_valid = false, pubactions = {pubinsert = false, pubupdate = false, pubdelete = false, pubtruncate = false}, publish_as_relid = 16388,
map = 0x7f7f7f7f7f7f7f7f}
Yes, the process tried to free garbage.
Now, we are convinced that we have addressed the problem. That's it !
Best Regards,
Takamichi Osumi
On Sat, May 22, 2021 at 11:00 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Friday, May 21, 2021 9:45 PM I worte:
On Friday, May 21, 2021 4:43 PM Amit Langote <amitlangote09@gmail.com>
wrote:On Fri, May 21, 2021 at 3:55 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:But, I've detected segmentation faults caused by the patch, which
can happen during 100_bugs.pl in src/test/subscription.Hmm, maybe get_rel_syn_entry() should explicitly set map to NULL when
first initializing an entry. It's possible that without doing so, the
map remains set to a garbage value, which causes the invalidation
callback that runs into such partially initialized entry to segfault
upon trying to deference that garbage pointer.Just in case, I prepared a new PG and
did a check to make get_rel_sync_entry() print its first pointer value with elog.
Here, when I executed 100_bugs.pl, I got some garbage below.* The change I did: @@ -1011,6 +1011,7 @@ get_rel_sync_entry(PGOutputData *data, Oid relid) entry->pubactions.pubinsert = entry->pubactions.pubupdate = entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false; entry->publish_as_relid = InvalidOid; + elog(LOG, "**> the pointer's default value : %p", + entry->map); }(snip)
So, your solution is right, I think.
This was a bit indirect.
I've checked the core file of v3's failure core and printed the entry
to get more confidence. Sorry for inappropriate measure to verify the solution.$1 = {relid = 16388, schema_sent = false, streamed_txns = 0x0, replicate_valid = false, pubactions = {pubinsert = false, pubupdate = false, pubdelete = false, pubtruncate = false}, publish_as_relid = 16388,
map = 0x7f7f7f7f7f7f7f7f}Yes, the process tried to free garbage.
Now, we are convinced that we have addressed the problem. That's it !
Thanks for confirming that.
--
Amit Langote
EDB: http://www.enterprisedb.com
On Saturday, May 22, 2021 11:58 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Sat, May 22, 2021 at 11:00 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:I've checked the core file of v3's failure core and printed the entry
to get more confidence. Sorry for inappropriate measure to verify thesolution.
$1 = {relid = 16388, schema_sent = false, streamed_txns = 0x0,
replicate_valid = false, pubactions = {pubinsert = false, pubupdate = false,
pubdelete = false, pubtruncate = false}, publish_as_relid = 16388,map = 0x7f7f7f7f7f7f7f7f}
Yes, the process tried to free garbage.
Now, we are convinced that we have addressed the problem. That's it !Thanks for confirming that.
Langote-san, I need to report another issue.
When I execute make check-world with v6 additionally,
I've gotten another failure. I get this about once in
20 times of make check-world with v6.
The test ended with stderr outputs below.
NOTICE: database "regression" does not exist, skipping
make[2]: *** [check] Error 1
make[1]: *** [check-isolation-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2
make: *** Waiting for unfinished jobs....
And, I had ./src/test/isolation/output_iso/regression.diffs and regression.out,
which told me below.
test detach-partition-concurrently-1 ... ok 705 ms
test detach-partition-concurrently-2 ... ok 260 ms
test detach-partition-concurrently-3 ... FAILED 618 ms
test detach-partition-concurrently-4 ... ok 1384 ms
The diffs file showed me below.
diff -U3 /home/k5user/new_disk/repro_fail_v6/src/test/isolation/expected/detach-partition-concurrently-3.out /home/k5user/new_disk/repro_fail_v6/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out
--- /home/k5user/new_disk/repro_fail_v6/src/test/isolation/expected/detach-partition-concurrently-3.out 2021-05-24 01:22:22.381488295 +0000
+++ /home/k5user/new_disk/repro_fail_v6/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out 2021-05-24 02:47:08.292488295 +0000
@@ -190,7 +190,7 @@
t
step s2detach: <... completed>
-error in steps s1cancel s2detach: ERROR: canceling statement due to user request
+ERROR: canceling statement due to user request
step s2detach2: ALTER TABLE d3_listp DETACH PARTITION d3_listp2 CONCURRENTLY;
ERROR: partition "d3_listp1" already pending detach in partitioned table "public.d3_listp"
step s1c: COMMIT;
I'm not sure if this is related to the patch or we already have this from OSS HEAD yet.
FYI: the steps I did are
1 - clone PG(I used f5024d8)
2 - git am the 2 patches for HEAD
* HEAD-v6-0001-pgoutput-fix-memory-management-of-RelationSyncEnt.patch
* HEAD-v6-0002-pgoutput-don-t-send-leaf-partition-schema-when-pu.patch
3 - configure with --enable-cassert --enable-debug --enable-tap-tests --with-icu CFLAGS=-O0 --prefix=/where/you/wanna/put/PG
4 - make -j2 2> make.log # did not get stderr output.
5 - make check-world -j8 2> make_check_world.log
(after this I've conducted another tight loop test by repeating make check-world and got the error)
Best Regards,
Takamichi Osumi
On Mon, May 24, 2021 at 12:16 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:
On Saturday, May 22, 2021 11:58 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Sat, May 22, 2021 at 11:00 AM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:I've checked the core file of v3's failure core and printed the entry
to get more confidence. Sorry for inappropriate measure to verify thesolution.
$1 = {relid = 16388, schema_sent = false, streamed_txns = 0x0,
replicate_valid = false, pubactions = {pubinsert = false, pubupdate = false,
pubdelete = false, pubtruncate = false}, publish_as_relid = 16388,map = 0x7f7f7f7f7f7f7f7f}
Yes, the process tried to free garbage.
Now, we are convinced that we have addressed the problem. That's it !Thanks for confirming that.
Langote-san, I need to report another issue.
Thanks for continued testing.
When I execute make check-world with v6 additionally,
I've gotten another failure. I get this about once in
20 times of make check-world with v6.The test ended with stderr outputs below.
NOTICE: database "regression" does not exist, skipping
make[2]: *** [check] Error 1
make[1]: *** [check-isolation-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [check-world-src/test-recurse] Error 2
make: *** Waiting for unfinished jobs....And, I had ./src/test/isolation/output_iso/regression.diffs and regression.out,
which told me below.test detach-partition-concurrently-1 ... ok 705 ms
test detach-partition-concurrently-2 ... ok 260 ms
test detach-partition-concurrently-3 ... FAILED 618 ms
test detach-partition-concurrently-4 ... ok 1384 msThe diffs file showed me below.
diff -U3 /home/k5user/new_disk/repro_fail_v6/src/test/isolation/expected/detach-partition-concurrently-3.out /home/k5user/new_disk/repro_fail_v6/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out --- /home/k5user/new_disk/repro_fail_v6/src/test/isolation/expected/detach-partition-concurrently-3.out 2021-05-24 01:22:22.381488295 +0000 +++ /home/k5user/new_disk/repro_fail_v6/src/test/isolation/output_iso/results/detach-partition-concurrently-3.out 2021-05-24 02:47:08.292488295 +0000 @@ -190,7 +190,7 @@t step s2detach: <... completed> -error in steps s1cancel s2detach: ERROR: canceling statement due to user request +ERROR: canceling statement due to user request step s2detach2: ALTER TABLE d3_listp DETACH PARTITION d3_listp2 CONCURRENTLY; ERROR: partition "d3_listp1" already pending detach in partitioned table "public.d3_listp" step s1c: COMMIT;I'm not sure if this is related to the patch or we already have this from OSS HEAD yet.
Hmm, I doubt it would be this patch's fault. Maybe we still have some
unresolved issues with DETACH PARTITION CONCURRENTLY. I suggest you
report this in a new thread preferably after you figure that it's
reproducible.
--
Amit Langote
EDB: http://www.enterprisedb.com
On Monday, May 24, 2021 12:23 PM Amit Langote <amitlangote09@gmail.com> wrote:
On Mon, May 24, 2021 at 12:16 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:When I execute make check-world with v6 additionally, I've gotten
another failure. I get this about once in
20 times of make check-world with v6.Hmm, I doubt it would be this patch's fault. Maybe we still have some
unresolved issues with DETACH PARTITION CONCURRENTLY. I suggest
you report this in a new thread preferably after you figure that it's
reproducible.
OK, I'll do so when I get this with OSS HEAD.
Best Regards,
Takamichi Osumi
On Monday, May 24, 2021 12:57 PM I wrote:
On Monday, May 24, 2021 12:23 PM Amit Langote <amitlangote09@gmail.com>
wrote:On Mon, May 24, 2021 at 12:16 PM osumi.takamichi@fujitsu.com
<osumi.takamichi@fujitsu.com> wrote:When I execute make check-world with v6 additionally, I've gotten
another failure. I get this about once in
20 times of make check-world with v6.Hmm, I doubt it would be this patch's fault. Maybe we still have some
unresolved issues with DETACH PARTITION CONCURRENTLY. I suggestyou
report this in a new thread preferably after you figure that it's
reproducible.OK, I'll do so when I get this with OSS HEAD.
Just now, I've reported this on hackers as a different thread.
This was not an issue of the patch.
Also, I have no more suggestions to fix the patch set you shared.
Best Regards,
Takamichi Osumi
On Fri, May 21, 2021 at 1:12 PM Amit Langote <amitlangote09@gmail.com> wrote:
Hmm, maybe get_rel_syn_entry() should explicitly set map to NULL when
first initializing an entry. It's possible that without doing so, the
map remains set to a garbage value, which causes the invalidation
callback that runs into such partially initialized entry to segfault
upon trying to deference that garbage pointer.I've tried that in the attached v6 patches. Please check.
v6-0001
=========
+ send_relation_and_attrs(ancestor, xid, ctx);
+
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
- send_relation_and_attrs(ancestor, xid, ctx);
Why do we need to move send_relation_and_attrs() call? I think it
doesn't matter much either way but OTOH, in the existing code, if
there is an error (say 'out of memory' or some other) while building
the map, we won't send relation attrs whereas with your change we will
unnecessarily send those in such a case.
I feel there is no need to backpatch v6-0002. We can just make it a
HEAD-only change as that doesn't cause any bug even though it is
better not to send it. If we consider it as a HEAD-only change then
probably we can register it for PG-15. What do you think?
--
With Regards,
Amit Kapila.
On Thu, May 27, 2021 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 21, 2021 at 1:12 PM Amit Langote <amitlangote09@gmail.com> wrote:
Hmm, maybe get_rel_syn_entry() should explicitly set map to NULL when
first initializing an entry. It's possible that without doing so, the
map remains set to a garbage value, which causes the invalidation
callback that runs into such partially initialized entry to segfault
upon trying to deference that garbage pointer.I've tried that in the attached v6 patches. Please check.
v6-0001 ========= + send_relation_and_attrs(ancestor, xid, ctx); + /* Map must live as long as the session does. */ oldctx = MemoryContextSwitchTo(CacheMemoryContext); - relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc), - CreateTupleDescCopy(outdesc)); + + /* + * Make copies of the TupleDescs that will live as long as the map + * does before putting into the map. + */ + indesc = CreateTupleDescCopy(indesc); + outdesc = CreateTupleDescCopy(outdesc); + relentry->map = convert_tuples_by_name(indesc, outdesc); + if (relentry->map == NULL) + { + /* Map not necessary, so free the TupleDescs too. */ + FreeTupleDesc(indesc); + FreeTupleDesc(outdesc); + } + MemoryContextSwitchTo(oldctx); - send_relation_and_attrs(ancestor, xid, ctx);Why do we need to move send_relation_and_attrs() call? I think it
doesn't matter much either way but OTOH, in the existing code, if
there is an error (say 'out of memory' or some other) while building
the map, we won't send relation attrs whereas with your change we will
unnecessarily send those in such a case.
That's a good point. I've reverted that change in the attached.
I feel there is no need to backpatch v6-0002. We can just make it a
HEAD-only change as that doesn't cause any bug even though it is
better not to send it. If we consider it as a HEAD-only change then
probably we can register it for PG-15. What do you think?
Okay, I will see about creating a PG15 CF entry for 0002.
Please see attached v7-0001 with the part mentioned above fixed.
--
Amit Langote
EDB: http://www.enterprisedb.com
Attachments:
PG13-v7-0001-pgoutput-fix-memory-management-for-RelationSyncEn.patchapplication/octet-stream; name=PG13-v7-0001-pgoutput-fix-memory-management-for-RelationSyncEn.patchDownload
From fb75973b016fd3777e108c098c65caa99359a701 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 18 May 2021 14:52:04 +0900
Subject: [PATCH v7] pgoutput: fix memory management for RelationSyncEntry.map
Release memory allocated when creating the tuple-conversion map
and its component TupleDescs when its owning sync entry is
invalidated. TupleDescs must also be freed when no map is deemed
necessary to begin with.
---
src/backend/replication/pgoutput/pgoutput.c | 49 ++++++++++++++++++---
src/test/subscription/t/013_partition.pl | 28 +++++++++++-
2 files changed, 69 insertions(+), 8 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index c5fbebf55a..d9e45bab4a 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -57,6 +57,10 @@ static void send_relation_and_attrs(Relation relation, LogicalDecodingContext *c
/*
* Entry in the map used to remember which relation schemas we sent.
*
+ * The schema_sent flag determines if the current schema record for the
+ * relation (and for its ancestor if publish_as_relid is set) was already sent
+ * to the subscriber (in which case we don't need to send it again).
+ *
* For partitions, 'pubactions' considers not only the table's own
* publications, but also those of all of its ancestors.
*/
@@ -64,10 +68,6 @@ typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
- /*
- * Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
- */
bool schema_sent;
bool replicate_valid;
@@ -292,10 +292,17 @@ static void
maybe_send_schema(LogicalDecodingContext *ctx,
Relation relation, RelationSyncEntry *relentry)
{
+ /* Nothing to do if we already sent the schema. */
if (relentry->schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * Nope, so send the schema. If the changes will be published using an
+ * ancestor's schema, not the relation's own, send that ancestor's schema
+ * before sending relation's own (XXX - maybe sending only the former
+ * suffices?). This is also a good place to set the map that will be used
+ * to convert the relation's tuples into the ancestor's format, if needed.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -305,8 +312,21 @@ maybe_send_schema(LogicalDecodingContext *ctx,
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
send_relation_and_attrs(ancestor, ctx);
RelationClose(ancestor);
@@ -759,6 +779,7 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
list_free(pubids);
entry->publish_as_relid = publish_as_relid;
+ entry->map = NULL; /* will be set by maybe_send_schema() if needed */
entry->replicate_valid = true;
}
@@ -801,9 +822,23 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
/*
* Reset schema sent status as the relation definition may have changed.
+ * Also, free any objects that depended on the earlier definition.
*/
if (entry != NULL)
+ {
entry->schema_sent = false;
+ if (entry->map)
+ {
+ /*
+ * Must free the TupleDescs contained in the map explicitly,
+ * because free_conversion_map() doesn't.
+ */
+ FreeTupleDesc(entry->map->indesc);
+ FreeTupleDesc(entry->map->outdesc);
+ free_conversion_map(entry->map);
+ }
+ entry->map = NULL;
+ }
}
/*
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index 4b7d637c70..046e18c700 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -3,7 +3,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 54;
+use Test::More tests => 56;
# setup
@@ -621,3 +621,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
--
2.24.1
HEAD-v7-0001-pgoutput-fix-memory-management-of-RelationSyncEnt.patchapplication/octet-stream; name=HEAD-v7-0001-pgoutput-fix-memory-management-of-RelationSyncEnt.patchDownload
From 5e66b1c533ed07ea2d7d8e19eedf38046cd96547 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 18 May 2021 15:13:32 +0900
Subject: [PATCH v7 1/2] pgoutput: fix memory management of
RelationSyncEntry.map
Release memory allocated when creating the tuple-conversion map
and its component TupleDescs when its owning sync entry is
invalidated. TupleDescs must also be freed when no map is deemed
necessary to begin with.
---
src/backend/replication/pgoutput/pgoutput.c | 46 +++++++++++++++++----
src/test/subscription/t/013_partition.pl | 28 ++++++++++++-
2 files changed, 65 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index f68348dcf4..fe12d08a94 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -74,7 +74,8 @@ static void send_relation_and_attrs(Relation relation, TransactionId xid,
/*
* Entry in the map used to remember which relation schemas we sent.
*
- * The schema_sent flag determines if the current schema record was already
+ * The schema_sent flag determines if the current schema record for the
+ * relation (and for its ancestor if publish_as_relid is set) was already
* sent to the subscriber (in which case we don't need to send it again).
*
* The schema cache on downstream is however updated only at commit time,
@@ -92,10 +93,6 @@ typedef struct RelationSyncEntry
{
Oid relid; /* relation oid */
- /*
- * Did we send the schema? If ancestor relid is set, its schema must also
- * have been sent for this to be true.
- */
bool schema_sent;
List *streamed_txns; /* streamed toplevel transactions with this
* schema */
@@ -437,10 +434,17 @@ maybe_send_schema(LogicalDecodingContext *ctx,
else
schema_sent = relentry->schema_sent;
+ /* Nothing to do if we already sent the schema. */
if (schema_sent)
return;
- /* If needed, send the ancestor's schema first. */
+ /*
+ * Nope, so send the schema. If the changes will be published using an
+ * ancestor's schema, not the relation's own, send that ancestor's schema
+ * before sending relation's own (XXX - maybe sending only the former
+ * suffices?). This is also a good place to set the map that will be used
+ * to convert the relation's tuples into the ancestor's format, if needed.
+ */
if (relentry->publish_as_relid != RelationGetRelid(relation))
{
Relation ancestor = RelationIdGetRelation(relentry->publish_as_relid);
@@ -450,8 +454,21 @@ maybe_send_schema(LogicalDecodingContext *ctx,
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
- relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
- CreateTupleDescCopy(outdesc));
+
+ /*
+ * Make copies of the TupleDescs that will live as long as the map
+ * does before putting into the map.
+ */
+ indesc = CreateTupleDescCopy(indesc);
+ outdesc = CreateTupleDescCopy(outdesc);
+ relentry->map = convert_tuples_by_name(indesc, outdesc);
+ if (relentry->map == NULL)
+ {
+ /* Map not necessary, so free the TupleDescs too. */
+ FreeTupleDesc(indesc);
+ FreeTupleDesc(outdesc);
+ }
+
MemoryContextSwitchTo(oldctx);
send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);
@@ -1011,6 +1028,7 @@ get_rel_sync_entry(PGOutputData *data, Oid relid)
entry->pubactions.pubinsert = entry->pubactions.pubupdate =
entry->pubactions.pubdelete = entry->pubactions.pubtruncate = false;
entry->publish_as_relid = InvalidOid;
+ entry->map = NULL; /* will be set by maybe_send_schema() if needed */
}
/* Validate the entry */
@@ -1191,12 +1209,24 @@ rel_sync_cache_relation_cb(Datum arg, Oid relid)
/*
* Reset schema sent status as the relation definition may have changed.
+ * Also free any objects that depended on the earlier definition.
*/
if (entry != NULL)
{
entry->schema_sent = false;
list_free(entry->streamed_txns);
entry->streamed_txns = NIL;
+ if (entry->map)
+ {
+ /*
+ * Must free the TupleDescs contained in the map explicitly,
+ * because free_conversion_map() doesn't.
+ */
+ FreeTupleDesc(entry->map->indesc);
+ FreeTupleDesc(entry->map->outdesc);
+ free_conversion_map(entry->map);
+ }
+ entry->map = NULL;
}
}
diff --git a/src/test/subscription/t/013_partition.pl b/src/test/subscription/t/013_partition.pl
index 6daf8daa3c..9de01017be 100644
--- a/src/test/subscription/t/013_partition.pl
+++ b/src/test/subscription/t/013_partition.pl
@@ -6,7 +6,7 @@ use strict;
use warnings;
use PostgresNode;
use TestLib;
-use Test::More tests => 54;
+use Test::More tests => 56;
# setup
@@ -624,3 +624,29 @@ is($result, qq(), 'truncate of tab3 replicated');
$result = $node_subscriber2->safe_psql('postgres', "SELECT a FROM tab3_1");
is($result, qq(), 'truncate of tab3_1 replicated');
+
+# check that the map to convert tuples from leaf partition to the root
+# table is correctly rebuilt when a new column is added
+$node_publisher->safe_psql('postgres',
+ "ALTER TABLE tab2 DROP b, ADD COLUMN c text DEFAULT 'pub_tab2', ADD b text");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b) VALUES (1, 'xxx'), (3, 'yyy'), (5, 'zzz')");
+$node_publisher->safe_psql('postgres',
+ "INSERT INTO tab2 (a, b, c) VALUES (6, 'aaa', 'xxx_c')");
+
+$node_publisher->wait_for_catchup('sub_viaroot');
+$node_publisher->wait_for_catchup('sub2');
+
+$result = $node_subscriber1->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
+
+$result = $node_subscriber2->safe_psql('postgres',
+ "SELECT c, a, b FROM tab2 ORDER BY 1, 2");
+is( $result, qq(pub_tab2|1|xxx
+pub_tab2|3|yyy
+pub_tab2|5|zzz
+xxx_c|6|aaa), 'inserts into tab2 replicated');
--
2.24.1
On Mon, May 31, 2021 at 8:51 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Thu, May 27, 2021 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Fri, May 21, 2021 at 1:12 PM Amit Langote <amitlangote09@gmail.com> wrote:
Why do we need to move send_relation_and_attrs() call? I think it
doesn't matter much either way but OTOH, in the existing code, if
there is an error (say 'out of memory' or some other) while building
the map, we won't send relation attrs whereas with your change we will
unnecessarily send those in such a case.That's a good point. I've reverted that change in the attached.
Pushed.
I feel there is no need to backpatch v6-0002. We can just make it a
HEAD-only change as that doesn't cause any bug even though it is
better not to send it. If we consider it as a HEAD-only change then
probably we can register it for PG-15. What do you think?Okay, I will see about creating a PG15 CF entry for 0002.
Thanks!
--
With Regards,
Amit Kapila.
On Tue, Jun 1, 2021 at 6:56 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, May 31, 2021 at 8:51 AM Amit Langote <amitlangote09@gmail.com> wrote:
On Thu, May 27, 2021 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Why do we need to move send_relation_and_attrs() call? I think it
doesn't matter much either way but OTOH, in the existing code, if
there is an error (say 'out of memory' or some other) while building
the map, we won't send relation attrs whereas with your change we will
unnecessarily send those in such a case.That's a good point. I've reverted that change in the attached.
Pushed.
Thank you.
--
Amit Langote
EDB: http://www.enterprisedb.com