Column Filtering in Logical Replication
Hi,
Filtering of columns at the publisher node will allow for selective
replication of data between publisher and subscriber. In case the updates
on the publisher are targeted only towards specific columns, the user will
have an option to reduce network consumption by not sending the data
corresponding to new columns that do not change. Note that replica
identity values will always be sent irrespective of column filtering settings.
The column values that are not sent by the publisher will be populated
using local values on the subscriber. For insert command, non-replicated
column values will be NULL or the default.
If column names are not specified while creating or altering a publication,
all the columns are replicated as per current behaviour.
The proposal for syntax to add table with column names to publication is as
follows:
Create publication:
CREATE PUBLICATION <pub_name> [ FOR TABLE [ONLY] table_name [(colname
[,…])] | FOR ALL TABLES]
Alter publication:
ALTER PUBLICATION <pub_name> ADD TABLE [ONLY] table_name [(colname [, ..])]
Please find attached a patch that implements the above proposal.
While the patch contains basic implementation and tests, several
improvements
and sanity checks are underway. I will post an updated patch with those
changes soon.
Kindly let me know your opinion.
Thank you,
Rahila Syed
Attachments:
0001-Add-column-filtering-to-logical-replication.patchapplication/octet-stream; name=0001-Add-column-filtering-to-logical-replication.patchDownload+224-37
On Thu, Jul 1, 2021 at 1:06 AM Rahila Syed <rahilasyed90@gmail.com> wrote:
Hi,
Filtering of columns at the publisher node will allow for selective replication of data between publisher and subscriber. In case the updates on the publisher are targeted only towards specific columns, the user will have an option to reduce network consumption by not sending the data corresponding to new columns that do not change. Note that replica identity values will always be sent irrespective of column filtering settings. The column values that are not sent by the publisher will be populated using local values on the subscriber. For insert command, non-replicated column values will be NULL or the default.
If column names are not specified while creating or altering a publication,
all the columns are replicated as per current behaviour.The proposal for syntax to add table with column names to publication is as follows:
Create publication:CREATE PUBLICATION <pub_name> [ FOR TABLE [ONLY] table_name [(colname [,…])] | FOR ALL TABLES]
Alter publication:
ALTER PUBLICATION <pub_name> ADD TABLE [ONLY] table_name [(colname [, ..])]
Please find attached a patch that implements the above proposal.
While the patch contains basic implementation and tests, several improvements
and sanity checks are underway. I will post an updated patch with those changes soon.Kindly let me know your opinion.
I haven't looked into the patch yet but +1 for the idea.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Jul 1, 2021 at 1:06 AM Rahila Syed <rahilasyed90@gmail.com> wrote:
Hi,
Filtering of columns at the publisher node will allow for selective replication of data between publisher and subscriber. In case the updates on the publisher are targeted only towards specific columns, the user will have an option to reduce network consumption by not sending the data corresponding to new columns that do not change. Note that replica identity values will always be sent irrespective of column filtering settings. The column values that are not sent by the publisher will be populated using local values on the subscriber. For insert command, non-replicated column values will be NULL or the default.
If column names are not specified while creating or altering a publication,
all the columns are replicated as per current behaviour.The proposal for syntax to add table with column names to publication is as follows:
Create publication:CREATE PUBLICATION <pub_name> [ FOR TABLE [ONLY] table_name [(colname [,…])] | FOR ALL TABLES]
Alter publication:
ALTER PUBLICATION <pub_name> ADD TABLE [ONLY] table_name [(colname [, ..])]
Please find attached a patch that implements the above proposal.
While the patch contains basic implementation and tests, several improvements
and sanity checks are underway. I will post an updated patch with those changes soon.Kindly let me know your opinion.
This idea gives more flexibility to the user, +1 for the feature.
Regards,
Vignesh
Hello, here are a few comments on this patch.
The patch adds a function get_att_num_by_name; but we have a lsyscache.c
function for that purpose, get_attnum. Maybe that one should be used
instead?
get_tuple_columns_map() returns a bitmapset of the attnos of the columns
in the given list, so its name feels wrong. I propose
get_table_columnset(). However, this function is invoked for every
insert/update change, so it's going to be far too slow to be usable. I
think you need to cache the bitmapset somewhere, so that the function is
only called on first use. I didn't look very closely, but it seems that
struct RelationSyncEntry may be a good place to cache it.
The patch adds a new parse node PublicationTable, but doesn't add
copyfuncs.c, equalfuncs.c, readfuncs.c, outfuncs.c support for it.
Maybe try a compile with WRITE_READ_PARSE_PLAN_TREES and/or
COPY_PARSE_PLAN_TREES enabled to make sure everything is covered.
(I didn't verify that this actually catches anything ...)
The new column in pg_publication_rel is prrel_attr. This name seems at
odds with existing column names (we don't use underscores in catalog
column names). Maybe prattrs is good enough? prrelattrs? We tend to
use plurals for columns that are arrays.
It's not super clear to me that strlist_to_textarray() and related
processing will behave sanely when the column names contain weird
characters such as commas or quotes, or just when used with uppercase
column names. Maybe it's worth having tests that try to break such
cases.
You seem to have left a debugging "elog(LOG)" line in OpenTableList.
I got warnings from "git am" about trailing whitespace being added by
the patch in two places.
Thanks!
--
Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Hi, I was wondering if/when a subset of cols is specified then does
that mean it will be possible for the table to be replicated to a
*smaller* table at the subscriber side?
e.g Can a table with 7 cols replicated to a table with 2 cols?
table tab1(a,b,c,d,e,f,g) --> CREATE PUBLICATION pub1 FOR TABLE
tab1(a,b) --> table tab1(a,b)
~~
I thought maybe that should be possible, but the expected behaviour
for that scenario was not very clear to me from the thread/patch
comments. And the new TAP test uses the tab1 table created exactly the
same for pub/sub, so I couldn't tell from the test code either.
------
Kind Regards,
Peter Smith.
Fujitsu Australia
Hi Peter,
Hi, I was wondering if/when a subset of cols is specified then does
that mean it will be possible for the table to be replicated to a
*smaller* table at the subscriber side?
e.g Can a table with 7 cols replicated to a table with 2 cols?
table tab1(a,b,c,d,e,f,g) --> CREATE PUBLICATION pub1 FOR TABLE
tab1(a,b) --> table tab1(a,b)~~
I thought maybe that should be possible, but the expected behaviour
for that scenario was not very clear to me from the thread/patch
comments. And the new TAP test uses the tab1 table created exactly the
same for pub/sub, so I couldn't tell from the test code either.
Currently, this capability is not included in the patch. If the table on
the subscriber
server has lesser attributes than that on the publisher server, it throws
an error at the
time of CREATE SUBSCRIPTION.
About having such a functionality, I don't immediately see any issue with
it as long
as we make sure replica identity columns are always present on both
instances.
However, need to carefully consider situations in which a server subscribes
to multiple
publications, each publishing a different subset of columns of a table.
Thank you,
Rahila Syed
Hi Alvaro,
Thank you for comments.
The patch adds a function get_att_num_by_name; but we have a lsyscache.c
function for that purpose, get_attnum. Maybe that one should be used
instead?Thank you for pointing that out, I agree it makes sense to reuse the
existing function.
Changed it accordingly in the attached patch.
get_tuple_columns_map() returns a bitmapset of the attnos of the columns
in the given list, so its name feels wrong. I propose
get_table_columnset(). However, this function is invoked for every
insert/update change, so it's going to be far too slow to be usable. I
think you need to cache the bitmapset somewhere, so that the function is
only called on first use. I didn't look very closely, but it seems that
struct RelationSyncEntry may be a good place to cache it.Makes sense, changed accordingly.
The patch adds a new parse node PublicationTable, but doesn't add
copyfuncs.c, equalfuncs.c, readfuncs.c, outfuncs.c support for it.
Maybe try a compile with WRITE_READ_PARSE_PLAN_TREES and/or
COPY_PARSE_PLAN_TREES enabled to make sure everything is covered.
(I didn't verify that this actually catches anything ...)
I will test this and include these changes in the next version.
The new column in pg_publication_rel is prrel_attr. This name seems at
odds with existing column names (we don't use underscores in catalog
column names). Maybe prattrs is good enough? prrelattrs? We tend to
use plurals for columns that are arrays.Renamed it to prattrs as per suggestion.
It's not super clear to me that strlist_to_textarray() and related
processing will behave sanely when the column names contain weird
characters such as commas or quotes, or just when used with uppercase
column names. Maybe it's worth having tests that try to break such
cases.Sure, I will include these tests in the next version.
You seem to have left a debugging "elog(LOG)" line in OpenTableList.
Removed.
I got warnings from "git am" about trailing whitespace being added by
the patch in two places.Should be fixed now.
Thank you,
Rahila Syed
Attachments:
v1-0001-Add-column-filtering-to-logical-replication.patchapplication/octet-stream; name=v1-0001-Add-column-filtering-to-logical-replication.patchDownload+216-37
On 7/12/21 10:32 AM, Rahila Syed wrote:
Hi Peter,
Hi, I was wondering if/when a subset of cols is specified then does
that mean it will be possible for the table to be replicated to a
*smaller* table at the subscriber side?e.g Can a table with 7 cols replicated to a table with 2 cols?
table tab1(a,b,c,d,e,f,g) --> CREATE PUBLICATION pub1 FOR TABLE
tab1(a,b) --> table tab1(a,b)~~
I thought maybe that should be possible, but the expected behaviour
for that scenario was not very clear to me from the thread/patch
comments. And the new TAP test uses the tab1 table created exactly the
same for pub/sub, so I couldn't tell from the test code either.
Currently, this capability is not included in the patch. If the table on
the subscriber
server has lesser attributes than that on the publisher server, it
throws an error at the
time of CREATE SUBSCRIPTION.
That's a bit surprising, to be honest. I do understand the patch simply
treats the filtered columns as "unchanged" because that's the simplest
way to filter the *data* of the columns. But if someone told me we can
"filter columns" I'd expect this to work without the columns on the
subscriber.
About having such a functionality, I don't immediately see any issue
with it as long
as we make sure replica identity columns are always present on both
instances.
Yeah, that seems like an inherent requirement.
However, need to carefully consider situations in which a server
subscribes to multiple
publications, each publishing a different subset of columns of a table.
Isn't that pretty much the same situation as for multiple subscriptions
each with a different set of I/U/D operations? IIRC we simply merge
those, so why not to do the same thing here and merge the attributes?
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 7/12/21 11:38 AM, Rahila Syed wrote:
Hi Alvaro,
Thank you for comments.
The patch adds a function get_att_num_by_name; but we have a lsyscache.c
function for that purpose, get_attnum. Maybe that one should be used
instead?Thank you for pointing that out, I agree it makes sense to reuse the
existing function.
Changed it accordingly in the attached patch.
get_tuple_columns_map() returns a bitmapset of the attnos of the columns
in the given list, so its name feels wrong. I propose
get_table_columnset(). However, this function is invoked for every
insert/update change, so it's going to be far too slow to be usable. I
think you need to cache the bitmapset somewhere, so that the function is
only called on first use. I didn't look very closely, but it seems that
struct RelationSyncEntry may be a good place to cache it.Makes sense, changed accordingly.
To nitpick, I find "Bitmapset *att_list" a bit annoying, because it's
not really a list ;-)
FWIW "make check" fails for me with this version, due to segfault in
OpenTableLists. Apparenly there's some confusion - the code expects the
list to contain PublicationTable nodes, and tries to extract the
RangeVar from the elements. But the list actually contains RangeVar, so
this crashes and burns. See the attached backtrace.
I'd bet this is because the patch uses list of RangeVar in some cases
and list of PublicationTable in some cases, similarly to the "row
filtering" patch nearby. IMHO this is just confusing and we should
always pass list of PublicationTable nodes.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
crash.txttext/plain; charset=UTF-8; name=crash.txtDownload
On 2021-Jul-12, Tomas Vondra wrote:
FWIW "make check" fails for me with this version, due to segfault in
OpenTableLists. Apparenly there's some confusion - the code expects the
list to contain PublicationTable nodes, and tries to extract the
RangeVar from the elements. But the list actually contains RangeVar, so
this crashes and burns. See the attached backtrace.I'd bet this is because the patch uses list of RangeVar in some cases
and list of PublicationTable in some cases, similarly to the "row
filtering" patch nearby. IMHO this is just confusing and we should
always pass list of PublicationTable nodes.
+1 don't make the code guess what type of list it is. Changing all the
uses of that node to deal with PublicationTable seems best.
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Cuando no hay humildad las personas se degradan" (A. Christie)
Hi Tomas,
Thank you for your comments.
Currently, this capability is not included in the patch. If the table on
the subscriber
server has lesser attributes than that on the publisher server, it
throws an error at the
time of CREATE SUBSCRIPTION.That's a bit surprising, to be honest. I do understand the patch simply
treats the filtered columns as "unchanged" because that's the simplest
way to filter the *data* of the columns. But if someone told me we can
"filter columns" I'd expect this to work without the columns on the
subscriber.OK, I will look into adding this.
However, need to carefully consider situations in which a server
subscribes to multiple
publications, each publishing a different subset of columns of a table.
Isn't that pretty much the same situation as for multiple subscriptions
each with a different set of I/U/D operations? IIRC we simply merge
those, so why not to do the same thing here and merge the attributes?
Yeah, I agree with the solution to merge the attributes, similar to how
operations are merged. My concern was also from an implementation point
of view, will it be a very drastic change. I now had a look at how remote
relation
attributes are acquired for comparison with local attributes at the
subscriber.
It seems that the publisher will need to send the information about the
filtered columns
for each publication specified during CREATE SUBSCRIPTION.
This will be read at the subscriber side which in turn updates its cache
accordingly.
Currently, the subscriber expects all attributes of a published relation to
be present.
I will add code for this in the next version of the patch.
To nitpick, I find "Bitmapset *att_list" a bit annoying, because it's
not really a list ;-)
I will make this change with the next version
FWIW "make check" fails for me with this version, due to segfault in
OpenTableLists. Apparenly there's some confusion - the code expects the
list to contain PublicationTable nodes, and tries to extract the
RangeVar from the elements. But the list actually contains RangeVar, so
this crashes and burns. See the attached backtrace.
Thank you for the report, This is fixed in the attached version, now all
publication
function calls accept the PublicationTableInfo list.
Thank you,
Rahila Syed
Attachments:
v2-0001-Add-column-filtering-to-logical-replication.patchapplication/octet-stream; name=v2-0001-Add-column-filtering-to-logical-replication.patchDownload+235-43
On Tue, Jul 13, 2021 at 7:44 PM Rahila Syed <rahilasyed90@gmail.com> wrote:
Hi Tomas,
Thank you for your comments.
Currently, this capability is not included in the patch. If the table on
the subscriber
server has lesser attributes than that on the publisher server, it
throws an error at the
time of CREATE SUBSCRIPTION.That's a bit surprising, to be honest. I do understand the patch simply
treats the filtered columns as "unchanged" because that's the simplest
way to filter the *data* of the columns. But if someone told me we can
"filter columns" I'd expect this to work without the columns on the
subscriber.OK, I will look into adding this.
However, need to carefully consider situations in which a server
subscribes to multiple
publications, each publishing a different subset of columns of atable.
Isn't that pretty much the same situation as for multiple subscriptions
each with a different set of I/U/D operations? IIRC we simply merge
those, so why not to do the same thing here and merge the attributes?Yeah, I agree with the solution to merge the attributes, similar to how
operations are merged. My concern was also from an implementation point
of view, will it be a very drastic change. I now had a look at how remote
relation
attributes are acquired for comparison with local attributes at the
subscriber.
It seems that the publisher will need to send the information about the
filtered columns
for each publication specified during CREATE SUBSCRIPTION.
This will be read at the subscriber side which in turn updates its cache
accordingly.
Currently, the subscriber expects all attributes of a published relation
to be present.
I will add code for this in the next version of the patch.To nitpick, I find "Bitmapset *att_list" a bit annoying, because it's
not really a list ;-)
I will make this change with the next version
FWIW "make check" fails for me with this version, due to segfault in
OpenTableLists. Apparenly there's some confusion - the code expects the
list to contain PublicationTable nodes, and tries to extract the
RangeVar from the elements. But the list actually contains RangeVar, so
this crashes and burns. See the attached backtrace.Thank you for the report, This is fixed in the attached version, now all
publication
function calls accept the PublicationTableInfo list.Thank you,
Rahila Syed
The patch does not apply, and an rebase is required
Hunk #8 succeeded at 1259 (offset 99 lines).
Hunk #9 succeeded at 1360 (offset 99 lines).
1 out of 9 hunks FAILED -- saving rejects to file
src/backend/replication/pgoutput/pgoutput.c.rej
patching file src/include/catalog/pg_publication.h
Changing the status to "Waiting on Author"
--
Ibrar Ahmed
Hello,
I think this looks good regarding the PublicationRelationInfo API that was
discussed.
Looking at OpenTableList(), I think you forgot to update the comment --
it says "open relations specified by a RangeVar list", but the list is
now of PublicationTable. Also I think it would be good to say that the
returned tables are PublicationRelationInfo, maybe such as "In the
returned list of PublicationRelationInfo, the tables are locked ..."
In AlterPublicationTables() I was confused by some code that seemed
commented a bit too verbosely (for a moment I thought the whole list was
being copied into a different format). May I suggest something more
compact like
/* Not yet in list; open it and add it to the list */
if (!found)
{
Relation oldrel;
PublicationRelationInfo *pubrel;
oldrel = table_open(oldrelid, ShareUpdateExclusiveLock);
/* Wrap it in PublicationRelationInfo */
pubrel = palloc(sizeof(PublicationRelationInfo));
pubrel->relation = oldrel;
pubrel->relid = oldrelid;
pubrel->columns = NIL; /* not needed */
delrels = lappend(delrels, pubrel);
}
Thanks!
--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
One thing I just happened to notice is this part of your commit message
: REPLICA IDENTITY columns are always replicated
: irrespective of column names specification.
... for which you don't have any tests -- I mean, create a table with a
certain REPLICA IDENTITY and later try to publish a set of columns that
doesn't include all the columns in the replica identity, then verify
that those columns are indeed published.
Having said that, I'm not sure I agree with this design decision; what I
think this is doing is hiding from the user the fact that they are
publishing columns that they don't want to publish. I think as a user I
would rather get an error in that case:
ERROR: invalid column list in published set
DETAIL: The set of published commands does not include all the replica identity columns.
or something like that. Avoid possible nasty surprises of security-
leaking nature.
--
Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
"On the other flipper, one wrong move and we're Fatal Exceptions"
(T.U.X.: Term Unit X - http://www.thelinuxreview.com/TUX/)
Hi,
Currently, this capability is not included in the patch. If the table on
the subscriber
server has lesser attributes than that on the publisher server, it
throws an error at the
time of CREATE SUBSCRIPTION.That's a bit surprising, to be honest. I do understand the patch simply
treats the filtered columns as "unchanged" because that's the simplest
way to filter the *data* of the columns. But if someone told me we can
"filter columns" I'd expect this to work without the columns on the
subscriber.OK, I will look into adding this.
This has been added in the attached patch. Now, instead of
treating the filtered columns as unchanged and sending a byte
with that information, unfiltered columns are not sent to the subscriber
server at all. This along with saving the network bandwidth, allows
the logical replication to even work between tables with different numbers
of
columns i.e with the table on subscriber server containing only the
filtered
columns. Currently, replica identity columns are replicated irrespective of
the presence of the column filters, hence the table on the subscriber side
must
contain the replica identity columns.
The patch adds a new parse node PublicationTable, but doesn't add
copyfuncs.c, equalfuncs.c, readfuncs.c, outfuncs.c support for it.
Thanks, added this.
Looking at OpenTableList(), I think you forgot to update the comment --
it says "open relations specified by a RangeVar list",
Thank you for the review, Modified this.
To nitpick, I find "Bitmapset *att_list" a bit annoying, because it's
not really a list ;-)
Changed this.
It's not super clear to me that strlist_to_textarray() and related
processing will behave sanely when the column names contain weird
characters such as commas or quotes, or just when used with uppercase
column names. Maybe it's worth having tests that try to break such
cases.
Added a few test cases for this.
In AlterPublicationTables() I was confused by some code that seemed
commented a bit too verbosely
Modified this as per the suggestion.
: REPLICA IDENTITY columns are always replicated
: irrespective of column names specification.
... for which you don't have any tests
I have added these tests.
Having said that, I'm not sure I agree with this design decision; what I
think this is doing is hiding from the user the fact that they are
publishing columns that they don't want to publish. I think as a user I
would rather get an error in that case:
ERROR: invalid column list in published set
DETAIL: The set of published commands does not include all the replica
identity columns.
or something like that. Avoid possible nasty surprises of security-
leaking nature.
Ok, Thank you for your opinion. I agree that giving an explicit error in
this case will be safer.
I will include this, in case there are no counter views.
Thank you for your review comments. Please find attached the rebased and
updated patch.
Thank you,
Rahila Syed
Attachments:
v3-0001-Add-column-filtering-to-logical-replication.patchapplication/octet-stream; name=v3-0001-Add-column-filtering-to-logical-replication.patchDownload+499-73
On Mon, Aug 9, 2021 at 1:36 AM Rahila Syed <rahilasyed90@gmail.com> wrote:
Having said that, I'm not sure I agree with this design decision; what I
think this is doing is hiding from the user the fact that they are
publishing columns that they don't want to publish. I think as a user I
would rather get an error in that case:ERROR: invalid column list in published set
DETAIL: The set of published commands does not include all the replica identity columns.or something like that. Avoid possible nasty surprises of security-
leaking nature.Ok, Thank you for your opinion. I agree that giving an explicit error in this case will be safer.
+1 for an explicit error in this case.
Can you please explain why you have the restriction for including
replica identity columns and do we want to put a similar restriction
for the primary key? As far as I understand, if we allow default
values on subscribers for replica identity, then probably updates,
deletes won't work as they need to use replica identity (or PK) to
search the required tuple. If so, shouldn't we add this restriction
only when a publication has been defined for one of these (Update,
Delete) actions?
Another point is what if someone drops the column used in one of the
publications? Do we want to drop the entire relation from publication
or just remove the column filter or something else?
Do we want to consider that the columns specified in the filter must
not have NOT NULL constraint? Because, otherwise, the subscriber will
error out inserting such rows?
Minor comments:
================
pq_sendbyte(out, flags);
-
/* attribute name */
pq_sendstring(out, NameStr(att->attname));
@@ -953,6 +1000,7 @@ logicalrep_write_attrs(StringInfo out, Relation rel)
/* attribute mode */
pq_sendint32(out, att->atttypmod);
+
}
bms_free(idattrs);
diff --git a/src/backend/replication/logical/relation.c
b/src/backend/replication/logical/relation.c
index c37e2a7e29..d7a7b00841 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -354,7 +354,6 @@ logicalrep_rel_open(LogicalRepRelId remoteid,
LOCKMODE lockmode)
attnum = logicalrep_rel_att_by_name(remoterel,
NameStr(attr->attname));
-
entry->attrmap->attnums[i] = attnum;
There are quite a few places in the patch that contains spurious line
additions or removals.
--
With Regards,
Amit Kapila.
On Mon, Aug 9, 2021 at 3:59 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Mon, Aug 9, 2021 at 1:36 AM Rahila Syed <rahilasyed90@gmail.com> wrote:
Having said that, I'm not sure I agree with this design decision; what I
think this is doing is hiding from the user the fact that they are
publishing columns that they don't want to publish. I think as a user I
would rather get an error in that case:ERROR: invalid column list in published set
DETAIL: The set of published commands does not include all the replica identity columns.or something like that. Avoid possible nasty surprises of security-
leaking nature.Ok, Thank you for your opinion. I agree that giving an explicit error in this case will be safer.
+1 for an explicit error in this case.
Can you please explain why you have the restriction for including
replica identity columns and do we want to put a similar restriction
for the primary key? As far as I understand, if we allow default
values on subscribers for replica identity, then probably updates,
deletes won't work as they need to use replica identity (or PK) to
search the required tuple. If so, shouldn't we add this restriction
only when a publication has been defined for one of these (Update,
Delete) actions?Another point is what if someone drops the column used in one of the
publications? Do we want to drop the entire relation from publication
or just remove the column filter or something else?Do we want to consider that the columns specified in the filter must
not have NOT NULL constraint? Because, otherwise, the subscriber will
error out inserting such rows?
I noticed that other databases provide this feature [1]https://docs.oracle.com/en/cloud/paas/goldengate-cloud/gwuad/selecting-columns.html#GUID-9A851C8B-48F7-43DF-8D98-D086BE069E20 and they allow
users to specify "Columns that are included in Filter" or specify "All
columns to be included in filter except for a subset of columns". I am
not sure if want to provide both ways in the first version but at
least we should consider it as a future extensibility requirement and
try to choose syntax accordingly.
--
With Regards,
Amit Kapila.
Hi Amit,
Thanks for your review.
Can you please explain why you have the restriction for including
replica identity columns and do we want to put a similar restriction
for the primary key? As far as I understand, if we allow default
values on subscribers for replica identity, then probably updates,
deletes won't work as they need to use replica identity (or PK) to
search the required tuple. If so, shouldn't we add this restriction
only when a publication has been defined for one of these (Update,
Delete) actions?
Yes, like you mentioned they are needed for Updates and Deletes to work.
The restriction for including replica identity columns in column filters
exists because
In case the replica identity column values did not change, the old row
replica identity columns
are not sent to the subscriber, thus we would need new replica identity
columns
to be sent to identify the row that is to be Updated or Deleted.
I haven't tested if it would break Insert as well though. I will update
the patch accordingly.
Another point is what if someone drops the column used in one of the
publications? Do we want to drop the entire relation from publication
or just remove the column filter or something else?
Thanks for pointing this out. Currently, this is not handled in the patch.
I think dropping the column from the filter would make sense on the lines
of the table being dropped from publication, in case of drop table.
Do we want to consider that the columns specified in the filter must
not have NOT NULL constraint? Because, otherwise, the subscriber will
error out inserting such rows?I think you mean columns *not* specified in the filter must not have NOT
NULL constraint
on the subscriber, as this will break during insert, as it will try to
insert NULL for columns
not sent by the publisher.
I will look into fixing this. Probably this won't be a problem in
case the column is auto generated or contains a default value.
Minor comments:
================
pq_sendbyte(out, flags);
-
/* attribute name */
pq_sendstring(out, NameStr(att->attname));@@ -953,6 +1000,7 @@ logicalrep_write_attrs(StringInfo out, Relation rel)
/* attribute mode */
pq_sendint32(out, att->atttypmod);
+
}bms_free(idattrs); diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c index c37e2a7e29..d7a7b00841 100644 --- a/src/backend/replication/logical/relation.c +++ b/src/backend/replication/logical/relation.c @@ -354,7 +354,6 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)attnum = logicalrep_rel_att_by_name(remoterel,
NameStr(attr->attname));
-
entry->attrmap->attnums[i] = attnum;There are quite a few places in the patch that contains spurious line
additions or removals.
Thank you for your comments, I will fix these.
Thank you,
Rahila Syed
On Thu, Aug 12, 2021 at 8:40 AM Rahila Syed <rahilasyed90@gmail.com> wrote:
Can you please explain why you have the restriction for including
replica identity columns and do we want to put a similar restriction
for the primary key? As far as I understand, if we allow default
values on subscribers for replica identity, then probably updates,
deletes won't work as they need to use replica identity (or PK) to
search the required tuple. If so, shouldn't we add this restriction
only when a publication has been defined for one of these (Update,
Delete) actions?Yes, like you mentioned they are needed for Updates and Deletes to work.
The restriction for including replica identity columns in column filters exists because
In case the replica identity column values did not change, the old row replica identity columns
are not sent to the subscriber, thus we would need new replica identity columns
to be sent to identify the row that is to be Updated or Deleted.
I haven't tested if it would break Insert as well though. I will update the patch accordingly.
Okay, but then we also need to ensure that the user shouldn't be
allowed to enable the 'update' or 'delete' for a publication that
contains some filter that doesn't have replica identity columns.
Another point is what if someone drops the column used in one of the
publications? Do we want to drop the entire relation from publication
or just remove the column filter or something else?Thanks for pointing this out. Currently, this is not handled in the patch.
I think dropping the column from the filter would make sense on the lines
of the table being dropped from publication, in case of drop table.
I think it would be tricky if you want to remove the column from the
filter because you need to recompute the entire filter and update it
again. Also, you might need to do this for all the publications that
have a particular column in their filter clause. It might be easier to
drop the entire filter but you can check if it is easier another way
than it is good.
Do we want to consider that the columns specified in the filter must
not have NOT NULL constraint? Because, otherwise, the subscriber will
error out inserting such rows?I think you mean columns *not* specified in the filter must not have NOT NULL constraint
on the subscriber, as this will break during insert, as it will try to insert NULL for columns
not sent by the publisher.
Right.
--
With Regards,
Amit Kapila.
Hi,
Another point is what if someone drops the column used in one of the
publications? Do we want to drop the entire relation from publication
or just remove the column filter or something else?
After thinking about this, I think it is best to remove the entire table
from publication,
if a column specified in the column filter is dropped from the table.
Because, if we drop the entire filter without dropping the table, it means
all the columns will be replicated,
and the downstream server table might not have those columns.
If we drop only the column from the filter we might have to recreate the
filter and check for replica identity.
That means if the replica identity column is dropped, you can't drop it
from the filter,
and might have to drop the entire publication-table mapping anyways.
Thus, I think it is cleanest to drop the entire relation from publication.
This has been implemented in the attached version.
Do we want to consider that the columns specified in the filter must
not have NOT NULL constraint? Because, otherwise, the subscriber will
error out inserting such rows?I think you mean columns *not* specified in the filter must not have NOT
NULL constraint
on the subscriber, as this will break during insert, as it will try to
insert NULL for columns
not sent by the publisher.
I will look into fixing this. Probably this won't be a problem in
case the column is auto generated or contains a default value.
I am not sure if this needs to be handled. Ideally, we need to prevent the
subscriber tables from having a NOT NULL
constraint if the publisher uses column filters to publish the values of
the table. There is no way
to do this at the time of creating a table on subscriber.
As this involves querying the publisher for this information, it can be
done at the time of initial table synchronization.
i.e error out if any of the subscribed tables has NOT NULL constraint on
non-filter columns.
This will lead to the user dropping and recreating the subscription after
removing the
NOT NULL constraint from the table.
I think the same can be achieved by doing nothing and letting the
subscriber error out while inserting rows.
Minor comments:
================
pq_sendbyte(out, flags);
-
/* attribute name */
pq_sendstring(out, NameStr(att->attname));@@ -953,6 +1000,7 @@ logicalrep_write_attrs(StringInfo out, Relation rel)
/* attribute mode */
pq_sendint32(out, att->atttypmod);
+
}bms_free(idattrs); diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c index c37e2a7e29..d7a7b00841 100644 --- a/src/backend/replication/logical/relation.c +++ b/src/backend/replication/logical/relation.c @@ -354,7 +354,6 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)attnum = logicalrep_rel_att_by_name(remoterel,
NameStr(attr->attname));
-
entry->attrmap->attnums[i] = attnum;There are quite a few places in the patch that contains spurious line
additions or removals.Fixed these in the attached patch.
Having said that, I'm not sure I agree with this design decision; what I
think this is doing is hiding from the user the fact that they are
publishing columns that they don't want to publish. I think as a user I
would rather get an error in that case:
ERROR: invalid column list in published set
DETAIL: The set of published commands does not include all the replica
identity columns.
Added this.
Also added some more tests. Please find attached a rebased and updated
patch.
Thank you,
Rahila Syed