Logical replication can be broken by domain constraint with NOT VALID option

Started by Andrei Lepikhovover 6 years ago5 messagesbugs
Jump to latest
#1Andrei Lepikhov
lepihov@gmail.com

Hi,

During patch development I ran into a small problem (see attachment,
fail_replication.sh):
1. We have a table with logical replication to another node.
2. On the master and replica add such "NOT VALID" domain constraint on
the table that some tuples violates the constraint.
3. UPDATE the table: set value of the tuple that violates constraint to
correct value.
4. That's all!

The reason for this problem is that on UPDATE walsender sends old tuple
value (that violates the constraint) with new version (satisfied the
constraint).
Replication worker at replica node restores slot from transfer
representation. During this process domain checking constraint and
returns an ERROR.
Because we can't apply WAL record of the UPDATE command, logical
replication will be stopped at all.
As I understand, this problem can be reproduced in all postgres versions
with logical replication feature.
This problem can be solved by many ways and approaches. I wrote the
patch to solve this problem (see in attachment) by the shortest way.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Attachments:

fail_replication.shapplication/x-shellscript; name=fail_replication.shDownload
0001-Fix-the-problem-of-logical-replication-with-domain-N.patchtext/x-patch; name=0001-Fix-the-problem-of-logical-replication-with-domain-N.patchDownload+4-1
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrei Lepikhov (#1)
Re: Logical replication can be broken by domain constraint with NOT VALID option

Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:

During patch development I ran into a small problem (see attachment,
fail_replication.sh):
1. We have a table with logical replication to another node.
2. On the master and replica add such "NOT VALID" domain constraint on
the table that some tuples violates the constraint.
3. UPDATE the table: set value of the tuple that violates constraint to
correct value.
4. That's all!

The reason for this problem is that on UPDATE walsender sends old tuple
value (that violates the constraint) with new version (satisfied the
constraint).
Replication worker at replica node restores slot from transfer
representation. During this process domain checking constraint and
returns an ERROR.

I'm not sure this is something we should attempt to fix. There are
an infinite number of ways you can break logical replication by
presenting it with inconsistent data, and that's really what you've
done here.

This problem can be solved by many ways and approaches. I wrote the
patch to solve this problem (see in attachment) by the shortest way.

That patch is certainly utterly unacceptable. It'd allow the
receipient to accept data that violates the domain constraint.

The situation you're describing would probably best be handled by
not adding the constraint on the replica side until all the
bad data has been corrected (and replicated).

regards, tom lane

#3Andrei Lepikhov
lepihov@gmail.com
In reply to: Tom Lane (#2)
Re: Logical replication can be broken by domain constraint with NOT VALID option

On 03/11/2019 20:42, Tom Lane wrote:

Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:

The reason for this problem is that on UPDATE walsender sends old tuple
value (that violates the constraint) with new version (satisfied the
constraint).
Replication worker at replica node restores slot from transfer
representation. During this process domain checking constraint and
returns an ERROR.

I'm not sure this is something we should attempt to fix. There are
an infinite number of ways you can break logical replication by
presenting it with inconsistent data, and that's really what you've
done here.

This problem reproduced by standard way from the documentation. I assume
this inconsistency option is allowed by SQL standard because it has a
practical usage.

This problem can be solved by many ways and approaches. I wrote the
patch to solve this problem (see in attachment) by the shortest way.

That patch is certainly utterly unacceptable. It'd allow the
receipient to accept data that violates the domain constraint.

If this is the only reason, I propose a new version of the patch (see in
attachment). It is satisfy the "Paranoid safety" rule.

The situation you're describing would probably best be handled by
not adding the constraint on the replica side until all the
bad data has been corrected (and replicated).

On any PostgreSQL-based multimaster system, this will be a problem.

--
regards,
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Attachments:

v2-0001-Fix-the-problem-of-logical-replication-with-domain-N.patchtext/x-patch; name=v2-0001-Fix-the-problem-of-logical-replication-with-domain-N.patchDownload+20-3
In reply to: Andrei Lepikhov (#3)
Re: Logical replication can be broken by domain constraint with NOT VALID option

Em dom., 3 de nov. de 2019 às 23:33, Andrey Lepikhov
<a.lepikhov@postgrespro.ru> escreveu:

On 03/11/2019 20:42, Tom Lane wrote:

Andrey Lepikhov <a.lepikhov@postgrespro.ru> writes:

The reason for this problem is that on UPDATE walsender sends old tuple
value (that violates the constraint) with new version (satisfied the
constraint).
Replication worker at replica node restores slot from transfer
representation. During this process domain checking constraint and
returns an ERROR.

I'm not sure this is something we should attempt to fix. There are
an infinite number of ways you can break logical replication by
presenting it with inconsistent data, and that's really what you've
done here.

This problem reproduced by standard way from the documentation. I assume
this inconsistency option is allowed by SQL standard because it has a
practical usage.

Could you point out the problem in the documentation?

This problem can be solved by many ways and approaches. I wrote the
patch to solve this problem (see in attachment) by the shortest way.

That patch is certainly utterly unacceptable. It'd allow the
receipient to accept data that violates the domain constraint.

If this is the only reason, I propose a new version of the patch (see in
attachment). It is satisfy the "Paranoid safety" rule.

I don't think that is acceptable either. If you have different schemas
(even for a small period of time), you should handle it dropping and
recreating the constraints. Logical replication is far from a complete
feature. There should be cases that someone wants to enforce even the
FK constraints in the subscriber. I certainly wouldn't like to open
that can of worms. Relaxing constraints could lead to inconsistent
datasets across nodes. If you want to accept constraint violation,
drop the constraints.

The situation you're describing would probably best be handled by
not adding the constraint on the replica side until all the
bad data has been corrected (and replicated).

On any PostgreSQL-based multimaster system, this will be a problem.

... if you do not replicate DDLs in the same order it occurs or if you
have different schemas.

--
Euler Taveira Timbira -
http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

#5Andrei Lepikhov
lepihov@gmail.com
In reply to: Euler Taveira de Oliveira (#4)
Re: Logical replication can be broken by domain constraint with NOT VALID option

On 05/11/2019 20:21, Euler Taveira wrote:

Em dom., 3 de nov. de 2019 às 23:33, Andrey Lepikhov
<a.lepikhov@postgrespro.ru> escreveu:

If this is the only reason, I propose a new version of the patch (see in
attachment). It is satisfy the "Paranoid safety" rule.

I don't think that is acceptable either. If you have different schemas
(even for a small period of time), you should handle it dropping and
recreating the constraints.

Changing schema is a big deal. But adding a constraint with "not valid"
option can be used frequently. May be for change phone numbers format,
for example.

Logical replication is far from a complete
feature. There should be cases that someone wants to enforce even the
FK constraints in the subscriber. I certainly wouldn't like to open
that can of worms. Relaxing constraints could lead to inconsistent
datasets across nodes. If you want to accept constraint violation,
drop the constraints.

May be logical replication is incomplete. But it is no argument to not
fix an errors that we found.
In v2 version of the patch constraints are suppressed only for old
version of the tuple that used for search in the heap and can't be
applied. In this sense we do not relaxing any constraints.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company