Problem with logical replication

Started by Dilip Kumarover 5 years ago11 messages
#1Dilip Kumar
Dilip Kumar
dilipbalaut@gmail.com
2 attachment(s)

While try to setup a cascading replication, I have observed that if we
set the REPLICA IDENTITY to FULL on the subscriber side then there is
an Assert hit.

After analysis I have found that, when we set the REPLICA IDENTITY to
FULL on subscriber side (because I wanted to make this a publisher for
another subscriber).
then it will set relation->rd_replidindex to InvalidOid refer below code snippet
RelationGetIndexList()
{
....
if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex))
relation->rd_replidindex = pkeyIndex;
else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
relation->rd_replidindex = candidateIndex;
else
relation->rd_replidindex = InvalidOid;
}

But, while appying the update and if the table have an index we have
this assert in build_replindex_scan_key

static bool
build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
TupleTableSlot *searchslot)
{
...
Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
}

To me it appears like this assert is not correct. Attached patch has
removed this assert and things works fine.

#0 0x00007ff2a0c8d5d7 in raise () from /lib64/libc.so.6
#1 0x00007ff2a0c8ecc8 in abort () from /lib64/libc.so.6
#2 0x0000000000aa7c7d in ExceptionalCondition (conditionName=0xc1bb30
"RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel)",
errorType=0xc1bad9 "FailedAssertion",
fileName=0xc1bb1c "execReplication.c", lineNumber=60) at assert.c:67
#3 0x00000000007153c3 in build_replindex_scan_key
(skey=0x7fff25711560, rel=0x7ff2a1b0b800, idxrel=0x7ff2a1b0bd98,
searchslot=0x21328c8) at execReplication.c:60
#4 0x00000000007156ac in RelationFindReplTupleByIndex
(rel=0x7ff2a1b0b800, idxoid=16387, lockmode=LockTupleExclusive,
searchslot=0x21328c8, outslot=0x2132bb0) at execReplication.c:141
#5 0x00000000008aeba5 in FindReplTupleInLocalRel (estate=0x2150170,
localrel=0x7ff2a1b0b800, remoterel=0x214a7c8, remoteslot=0x21328c8,
localslot=0x7fff25711f28) at worker.c:989
#6 0x00000000008ae6f2 in apply_handle_update_internal
(relinfo=0x21327b0, estate=0x2150170, remoteslot=0x21328c8,
newtup=0x7fff25711fd0, relmapentry=0x214a7c8) at worker.c:820
#7 0x00000000008ae609 in apply_handle_update (s=0x7fff25719560) at worker.c:788
#8 0x00000000008af8b1 in apply_dispatch (s=0x7fff25719560) at worker.c:1362
#9 0x00000000008afd52 in LogicalRepApplyLoop (last_received=22926832)
at worker.c:1570
#10 0x00000000008b0c3a in ApplyWorkerMain (main_arg=0) at worker.c:2114
#11 0x0000000000869c15 in StartBackgroundWorker () at bgworker.c:813
#12 0x000000000087d28f in do_start_bgworker (rw=0x20a07a0) at postmaster.c:5852
#13 0x000000000087d63d in maybe_start_bgworkers () at postmaster.c:6078
#14 0x000000000087c685 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5247
#15 <signal handler called>
#16 0x00007ff2a0d458d3 in __select_nocancel () from /lib64/libc.so.6
#17 0x0000000000878153 in ServerLoop () at postmaster.c:1691
#18 0x0000000000877b42 in PostmasterMain (argc=3, argv=0x2079120) at
postmaster.c:1400
#19 0x000000000077f256 in main (argc=3, argv=0x2079120) at main.c:210

To reproduce this issue
run start1.sh
then execute below commands on publishers.
insert into pgbench_accounts values(1,2);
update pgbench_accounts set b=30 where a=1;

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

start1.shtext/x-sh; charset=US-ASCII; name=start1.sh
bugfix_replica_identity_full_on_subscriber.patchapplication/octet-stream; name=bugfix_replica_identity_full_on_subscriber.patch
#2Justin Pryzby
Justin Pryzby
pryzby@telsasoft.com
In reply to: Dilip Kumar (#1)
Re: Problem with logical replication (crash with REPLICA IDENTITY FULL and cascading replication)

Confirmed and added to opened items - this is a live bug back to v10.

for a in data data1; do ./tmp_install/usr/local/pgsql/bin/initdb -D $a --no-sync& done; wait
echo "wal_level = logical">> data/postgresql.conf; echo "port=5433" >> data1/postgresql.conf
for a in data data1; do ./tmp_install/usr/local/pgsql/bin/postmaster -D $a& done
for a in "CREATE TABLE pgbench_accounts(a int primary key, b int)" "ALTER TABLE pgbench_accounts REPLICA IDENTITY FULL" "CREATE PUBLICATION mypub FOR TABLE pgbench_accounts"; do \
for p in 5432 5433; do psql -d postgres -h /tmp -p "$p" -c "$a"; done; done

psql -d postgres -h /tmp -p 5433 -c "CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432 dbname=postgres' PUBLICATION mypub"
psql -d postgres -h /tmp -p 5432 -c "insert into pgbench_accounts values(1,2); update pgbench_accounts set b=30 where a=1;"

--
Justin

#3Masahiko Sawada
Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Dilip Kumar (#1)
Re: Problem with logical replication

On Thu, 16 Apr 2020 at 17:48, Dilip Kumar <dilipbalaut@gmail.com> wrote:

While try to setup a cascading replication, I have observed that if we
set the REPLICA IDENTITY to FULL on the subscriber side then there is
an Assert hit.

After analysis I have found that, when we set the REPLICA IDENTITY to
FULL on subscriber side (because I wanted to make this a publisher for
another subscriber).
then it will set relation->rd_replidindex to InvalidOid refer below code snippet
RelationGetIndexList()
{
....
if (replident == REPLICA_IDENTITY_DEFAULT && OidIsValid(pkeyIndex))
relation->rd_replidindex = pkeyIndex;
else if (replident == REPLICA_IDENTITY_INDEX && OidIsValid(candidateIndex))
relation->rd_replidindex = candidateIndex;
else
relation->rd_replidindex = InvalidOid;
}

But, while appying the update and if the table have an index we have
this assert in build_replindex_scan_key

static bool
build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
TupleTableSlot *searchslot)
{
...
Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
}

To me it appears like this assert is not correct. Attached patch has
removed this assert and things works fine.

#0 0x00007ff2a0c8d5d7 in raise () from /lib64/libc.so.6
#1 0x00007ff2a0c8ecc8 in abort () from /lib64/libc.so.6
#2 0x0000000000aa7c7d in ExceptionalCondition (conditionName=0xc1bb30
"RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel)",
errorType=0xc1bad9 "FailedAssertion",
fileName=0xc1bb1c "execReplication.c", lineNumber=60) at assert.c:67
#3 0x00000000007153c3 in build_replindex_scan_key
(skey=0x7fff25711560, rel=0x7ff2a1b0b800, idxrel=0x7ff2a1b0bd98,
searchslot=0x21328c8) at execReplication.c:60
#4 0x00000000007156ac in RelationFindReplTupleByIndex
(rel=0x7ff2a1b0b800, idxoid=16387, lockmode=LockTupleExclusive,
searchslot=0x21328c8, outslot=0x2132bb0) at execReplication.c:141
#5 0x00000000008aeba5 in FindReplTupleInLocalRel (estate=0x2150170,
localrel=0x7ff2a1b0b800, remoterel=0x214a7c8, remoteslot=0x21328c8,
localslot=0x7fff25711f28) at worker.c:989
#6 0x00000000008ae6f2 in apply_handle_update_internal
(relinfo=0x21327b0, estate=0x2150170, remoteslot=0x21328c8,
newtup=0x7fff25711fd0, relmapentry=0x214a7c8) at worker.c:820
#7 0x00000000008ae609 in apply_handle_update (s=0x7fff25719560) at worker.c:788
#8 0x00000000008af8b1 in apply_dispatch (s=0x7fff25719560) at worker.c:1362
#9 0x00000000008afd52 in LogicalRepApplyLoop (last_received=22926832)
at worker.c:1570
#10 0x00000000008b0c3a in ApplyWorkerMain (main_arg=0) at worker.c:2114
#11 0x0000000000869c15 in StartBackgroundWorker () at bgworker.c:813
#12 0x000000000087d28f in do_start_bgworker (rw=0x20a07a0) at postmaster.c:5852
#13 0x000000000087d63d in maybe_start_bgworkers () at postmaster.c:6078
#14 0x000000000087c685 in sigusr1_handler (postgres_signal_arg=10) at
postmaster.c:5247
#15 <signal handler called>
#16 0x00007ff2a0d458d3 in __select_nocancel () from /lib64/libc.so.6
#17 0x0000000000878153 in ServerLoop () at postmaster.c:1691
#18 0x0000000000877b42 in PostmasterMain (argc=3, argv=0x2079120) at
postmaster.c:1400
#19 0x000000000077f256 in main (argc=3, argv=0x2079120) at main.c:210

To reproduce this issue
run start1.sh
then execute below commands on publishers.
insert into pgbench_accounts values(1,2);
update pgbench_accounts set b=30 where a=1;

I could reproduce this issue by the steps you shared. For the bug fix
patch, I basically agree to remove that assertion from
build_replindex_scan_key() but I think it's better to update the
assertion instead of removal and update the following comment:

* This is not generic routine, it expects the idxrel to be replication
* identity of a rel and meet all limitations associated with that.
*/
static bool
build_replindex_scan_key(ScanKey skey, Relation rel, Relation idxrel,
TupleTableSlot *searchslot)
{

An alternative solution would be that logical replication worker
determines the access path based on its replica identity instead of
seeking the chance to use the primary key as follows:

@@ -981,7 +981,7 @@ FindReplTupleInLocalRel(EState *estate, Relation localrel,

*localslot = table_slot_create(localrel, &estate->es_tupleTable);

-   idxoid = GetRelationIdentityOrPK(localrel);
+   idxoid = RelationGetReplicaIndex(localrel);
    Assert(OidIsValid(idxoid) ||
           (remoterel->replident == REPLICA_IDENTITY_FULL));

That way, we can avoid such mismatch between replica identity and an
index for index scans. But a downside is that it will end up with a
sequential scan even if the local table has the primary key. IIUC if
the table has the primary key, a logical replication worker can use
the primary key for the update and delete even if its replica identity
is FULL, because the columns of the primary key are always a subset of
all columns. So I'll look at this closely but I agree with your idea.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Euler Taveira
Euler Taveira
euler.taveira@2ndquadrant.com
In reply to: Masahiko Sawada (#3)
1 attachment(s)
Re: Problem with logical replication

On Mon, 20 Apr 2020 at 10:25, Masahiko Sawada <
masahiko.sawada@2ndquadrant.com> wrote:

On Thu, 16 Apr 2020 at 17:48, Dilip Kumar <dilipbalaut@gmail.com> wrote:

I could reproduce this issue by the steps you shared. For the bug fix
patch, I basically agree to remove that assertion from
build_replindex_scan_key() but I think it's better to update the
assertion instead of removal and update the following comment:

IMO the assertion is using the wrong function because it should test a

replica
identity or primary key (GetRelationIdentityOrPK). RelationGetReplicaIndex
returns InvalidOid even though the table has a primary key.
GetRelationIdentityOrPK tries to obtain a replica identity and if it fails,
it
tries a primary key. That's exact what this assertion should use. We should
also notice that FindReplTupleInLocalRel uses GetRelationIdentityOrPK and
after
a code path like RelationFindReplTupleByIndex -> build_replindex_scan_key it
should also use the same function.

Since GetRelationIdentityOrPK is a fallback function that
uses RelationGetReplicaIndex and RelationGetPrimaryKeyIndex, I propose that
we
move this static function to execReplication.c.

I attached a patch with the described solution. I also included a test that
covers this scenario.

--
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-Fix-assert-failure-with-REPLICA-IDENTITY-FULL-in-the.patchtext/x-patch; charset=US-ASCII; name=0001-Fix-assert-failure-with-REPLICA-IDENTITY-FULL-in-the.patch
#5Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Euler Taveira (#4)
Re: Problem with logical replication

On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:

I attached a patch with the described solution. I also included a test that
covers this scenario.

-   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
+   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));

Not much a fan of adding a routine to relcache.c to do the work of two
routines already present, so I think that we had better add an extra
condition based on RelationGetPrimaryKeyIndex, and give up on
GetRelationIdentityOrPK() in execReplication.c. Wouldn't it also be
better to cross-check the replica identity here depending on if
RelationGetReplicaIndex() returns an invalid OID or not?
--
Michael

#6Masahiko Sawada
Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Michael Paquier (#5)
Re: Problem with logical replication

On Mon, 11 May 2020 at 16:28, Michael Paquier <michael@paquier.xyz> wrote:

On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:

I attached a patch with the described solution. I also included a test that
covers this scenario.

-   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
+   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));

Not much a fan of adding a routine to relcache.c to do the work of two
routines already present, so I think that we had better add an extra
condition based on RelationGetPrimaryKeyIndex, and give up on
GetRelationIdentityOrPK() in execReplication.c.

+1

In any case, it seems to me that the comment of
build_replindex_scan_key needs to be updated.

* This is not generic routine, it expects the idxrel to be replication
* identity of a rel and meet all limitations associated with that.

For example, we can update the above:

* This is not generic routine, it expects the idxrel to be replication
* identity or primary key of a rel and meet all limitations associated
* with that.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Euler Taveira
Euler Taveira
euler.taveira@2ndquadrant.com
In reply to: Masahiko Sawada (#6)
1 attachment(s)
Re: Problem with logical replication

On Tue, 12 May 2020 at 06:36, Masahiko Sawada <
masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 11 May 2020 at 16:28, Michael Paquier <michael@paquier.xyz> wrote:

On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:

I attached a patch with the described solution. I also included a test

that

covers this scenario.

-   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
+   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));

Not much a fan of adding a routine to relcache.c to do the work of two
routines already present, so I think that we had better add an extra
condition based on RelationGetPrimaryKeyIndex, and give up on
GetRelationIdentityOrPK() in execReplication.c.

Although, I think this solution is fragile, I updated the patch

accordingly.
(When/If someone changed GetRelationIdentityOrPK() it will break this
assert)

In any case, it seems to me that the comment of
build_replindex_scan_key needs to be updated.

* This is not generic routine, it expects the idxrel to be replication
* identity of a rel and meet all limitations associated with that.

It is implicit that a primary key can be a replica identity so I think this

comment is fine.

--
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v2-0001-Fix-assert-failure-with-REPLICA-IDENTITY-FULL-in-the.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Fix-assert-failure-with-REPLICA-IDENTITY-FULL-in-the.patch
#8Dilip Kumar
Dilip Kumar
dilipbalaut@gmail.com
In reply to: Euler Taveira (#7)
Re: Problem with logical replication

On Wed, May 13, 2020 at 6:15 AM Euler Taveira
<euler.taveira@2ndquadrant.com> wrote:

On Tue, 12 May 2020 at 06:36, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:

On Mon, 11 May 2020 at 16:28, Michael Paquier <michael@paquier.xyz> wrote:

On Sun, May 10, 2020 at 07:08:03PM -0300, Euler Taveira wrote:

I attached a patch with the described solution. I also included a test that
covers this scenario.

-   Assert(RelationGetReplicaIndex(rel) == RelationGetRelid(idxrel));
+   Assert(GetRelationIdentityOrPK(rel) == RelationGetRelid(idxrel));

Not much a fan of adding a routine to relcache.c to do the work of two
routines already present, so I think that we had better add an extra
condition based on RelationGetPrimaryKeyIndex, and give up on
GetRelationIdentityOrPK() in execReplication.c.

Although, I think this solution is fragile, I updated the patch accordingly.
(When/If someone changed GetRelationIdentityOrPK() it will break this assert)

In any case, it seems to me that the comment of
build_replindex_scan_key needs to be updated.

* This is not generic routine, it expects the idxrel to be replication
* identity of a rel and meet all limitations associated with that.

It is implicit that a primary key can be a replica identity so I think this
comment is fine.

I like your idea of modifying the assert instead of completely removing.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#9Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Euler Taveira (#7)
1 attachment(s)
Re: Problem with logical replication

On Tue, May 12, 2020 at 09:45:45PM -0300, Euler Taveira wrote:

In any case, it seems to me that the comment of
build_replindex_scan_key needs to be updated.

* This is not generic routine, it expects the idxrel to be replication
* identity of a rel and meet all limitations associated with that.

It is implicit that a primary key can be a replica identity so I think this
comment is fine.

Agreed. I don't think either that we need to update this comment. I
was playing with this patch and what you have here looks fine by me.
Two nits: the extra parenthesis in the assert are not necessary, and
the indentation had some diffs. Tom has just reindented the whole
tree, so let's keep things clean.
--
Michael

Attachments:

logirep-assert-v3.patchtext/x-diff; charset=us-ascii
#10Euler Taveira
Euler Taveira
euler.taveira@2ndquadrant.com
In reply to: Michael Paquier (#9)
Re: Problem with logical replication

On Fri, 15 May 2020 at 02:47, Michael Paquier <michael@paquier.xyz> wrote:

Agreed. I don't think either that we need to update this comment. I
was playing with this patch and what you have here looks fine by me.
Two nits: the extra parenthesis in the assert are not necessary, and
the indentation had some diffs. Tom has just reindented the whole
tree, so let's keep things clean.

LGTM.

--
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Euler Taveira (#10)
Re: Problem with logical replication

On Fri, May 15, 2020 at 08:48:53AM -0300, Euler Taveira wrote:

On Fri, 15 May 2020 at 02:47, Michael Paquier <michael@paquier.xyz> wrote:

Agreed. I don't think either that we need to update this comment. I
was playing with this patch and what you have here looks fine by me.
Two nits: the extra parenthesis in the assert are not necessary, and
the indentation had some diffs. Tom has just reindented the whole
tree, so let's keep things clean.

LGTM.

Thanks for double-checking. Applied and back-patched down to 10
then.
--
Michael