Re: row filtering for logical replication

Started by Stephen Frostabout 7 years ago4 messages

sfrost@snowman.net

about 7 years ago

Greetings,

* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:

On 23/11/2018 03:02, Stephen Frost wrote:

* Euler Taveira (euler@timbira.com.br) wrote:

2018-02-28 21:54 GMT-03:00 Craig Ringer <craig@2ndquadrant.com>:

Good idea. I haven't read this yet, but one thing to make sure you've
handled is limiting the clause to referencing only the current tuple and the
catalogs. user-catalog tables are OK, too, anything that is
RelationIsAccessibleInLogicalDecoding().

This means only immutable functions may be invoked, since a stable or
volatile function might attempt to access a table. And views must be
prohibited or recursively checked. (We have tree walkers that would help
with this).

It might be worth looking at the current logic for CHECK expressions, since
the requirements are similar. In my opinion you could safely not bother with
allowing access to user catalog tables in the filter expressions and limit
them strictly to immutable functions and the tuple its self.

IIRC implementation is similar to RLS expressions. I'll check all of
these rules.

Given the similarity to RLS and the nearby discussion about allowing
non-superusers to create subscriptions, and probably publications later,
I wonder if we shouldn't be somehow associating this with RLS policies
instead of having the publication filtering be entirely independent..

I do see the appeal here, if you consider logical replication to be a
streaming select it probably applies well.

But given that this is happening inside output plugin which does not
have full executor setup and has catalog-only snapshot I am not sure how
feasible it is to try to merge these two things. As per my previous
email it's possible that we'll have to be stricter about what we allow
in expressions here.

I can certainly understand the concern about trying to combine the
implementation of this with that of RLS; perhaps that isn't a good fit
due to the additional constraints put on logical decoding.

That said, I still think it might make sense to consider these filters
for logical decoding to be policies and, ideally, to allow users to use
the same policy for both.

In the end, the idea of having to build a single large and complex
'create publication' command which has a bunch of tables, each with
their own filter clauses, just strikes me as pretty painful.

The other issue with merging this is that the use-case for filtering out
the data in logical replication is not necessarily about security, but
often about sending only relevant data. So it makes sense to have filter
on publication without RLS enabled on table and if we'd force that, we'd
limit usefulness of this feature.

I definitely have a serious problem if we are going to say that you
can't use this filtering for security-sensitive cases.

We definitely want to eventually create subscriptions as non-superuser
but that has zero effect on this as everything here is happening on
different server than where subscription lives (we already allow
creation of publications with just CREATE privilege on database and
ownership of the table).

What I wasn't clear about above was the idea that we might allow a user
other than the table owner to publish a given table, but that such a
publication should certanily only be allowed to include the rows which
that user has access to- as regulated by RLS. If the RLS policy is too
complex to allow that then I would think we'd simply throw an error at
the create publication time and the would-be publisher would need to
figure that out with the table owner.

I'll admit that this might seem like a stretch, but what happens today?
Today, people write cronjobs to try to sync between tables with FDWs and
you don't need to own a table to use it as the target of a foreign
table.

I do think that we'll need to have some additional privileges around who
is allowed to create publications, I'm not entirely thrilled with that
being combined with the ability to create schemas; the two seem quite
different to me.

* Euler Taveira (euler@timbira.com.br) wrote:

Em sex, 23 de nov de 2018 às 11:40, Petr Jelinek
<petr.jelinek@2ndquadrant.com> escreveu:

But given that this is happening inside output plugin which does not
have full executor setup and has catalog-only snapshot I am not sure how
feasible it is to try to merge these two things. As per my previous
email it's possible that we'll have to be stricter about what we allow
in expressions here.

This feature should be as simple as possible. I don't want to
introduce a huge overhead just for filtering some data. Data sharding
generally uses simple expressions.

RLS often uses simple filters too.

The other issue with merging this is that the use-case for filtering out
the data in logical replication is not necessarily about security, but
often about sending only relevant data. So it makes sense to have filter
on publication without RLS enabled on table and if we'd force that, we'd
limit usefulness of this feature.

Use the same infrastructure as RLS could be a good idea but use RLS
for row filtering is not. RLS is complex.

Right, this was along the lines I was thinking of- using the
infrastructure and the policy system, in particular.

Thanks!

Stephen

Import Notes

Reply to msg id not found: CAHE3wgj+JATk+Z4XkU7aWPX02uwBxhJaTTNHuCRQ0pFqQjNyQ@mail.gmail.com7bf075c2-7538-e5a4-d4f7-d1a51544dba5@2ndquadrant.com

Petr Jelinek

petr.jelinek@2ndquadrant.com

about 7 years ago

In reply to: Stephen Frost (#1)

On 14/12/2018 16:38, Stephen Frost wrote:

Greetings,

* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:

On 23/11/2018 03:02, Stephen Frost wrote:

* Euler Taveira (euler@timbira.com.br) wrote:

2018-02-28 21:54 GMT-03:00 Craig Ringer <craig@2ndquadrant.com>:

Good idea. I haven't read this yet, but one thing to make sure you've
handled is limiting the clause to referencing only the current tuple and the
catalogs. user-catalog tables are OK, too, anything that is
RelationIsAccessibleInLogicalDecoding().

This means only immutable functions may be invoked, since a stable or
volatile function might attempt to access a table. And views must be
prohibited or recursively checked. (We have tree walkers that would help
with this).

It might be worth looking at the current logic for CHECK expressions, since
the requirements are similar. In my opinion you could safely not bother with
allowing access to user catalog tables in the filter expressions and limit
them strictly to immutable functions and the tuple its self.

IIRC implementation is similar to RLS expressions. I'll check all of
these rules.

Given the similarity to RLS and the nearby discussion about allowing
non-superusers to create subscriptions, and probably publications later,
I wonder if we shouldn't be somehow associating this with RLS policies
instead of having the publication filtering be entirely independent..

I do see the appeal here, if you consider logical replication to be a
streaming select it probably applies well.

But given that this is happening inside output plugin which does not
have full executor setup and has catalog-only snapshot I am not sure how
feasible it is to try to merge these two things. As per my previous
email it's possible that we'll have to be stricter about what we allow
in expressions here.

I can certainly understand the concern about trying to combine the
implementation of this with that of RLS; perhaps that isn't a good fit
due to the additional constraints put on logical decoding.

That said, I still think it might make sense to consider these filters
for logical decoding to be policies and, ideally, to allow users to use
the same policy for both.

I am not against that as long as it's possible to have policy for
logical replication without having it for RLS and vice versa.

I also wonder if policies are flexible enough to allow for specifying
OLD and NEW - the replication filtering deals with DML, not with what's
visible, it might very well depend on differences between these (that's
something the current patch is missing as well BTW).

In the end, the idea of having to build a single large and complex
'create publication' command which has a bunch of tables, each with
their own filter clauses, just strikes me as pretty painful.

The other issue with merging this is that the use-case for filtering out
the data in logical replication is not necessarily about security, but
often about sending only relevant data. So it makes sense to have filter
on publication without RLS enabled on table and if we'd force that, we'd
limit usefulness of this feature.

I definitely have a serious problem if we are going to say that you
can't use this filtering for security-sensitive cases.

I am saying it should not be tied to only security sensitive cases,
because it has use cases that have nothing to do with security (ie, I
don't want this to depend on RLS being enabled for a table).

We definitely want to eventually create subscriptions as non-superuser
but that has zero effect on this as everything here is happening on
different server than where subscription lives (we already allow
creation of publications with just CREATE privilege on database and
ownership of the table).

What I wasn't clear about above was the idea that we might allow a user
other than the table owner to publish a given table, but that such a
publication should certanily only be allowed to include the rows which
that user has access to- as regulated by RLS. If the RLS policy is too
complex to allow that then I would think we'd simply throw an error at
the create publication time and the would-be publisher would need to
figure that out with the table owner.

My opinion is that this is useful, but not necessarily something v1
patch needs to solve. Having too many publications and subscriptions to
various places is not currently practical anyway due to decoding
duplicating all the work for every connection.

* Euler Taveira (euler@timbira.com.br) wrote:

Em sex, 23 de nov de 2018 ï¿½s 11:40, Petr Jelinek
<petr.jelinek@2ndquadrant.com> escreveu:

The other issue with merging this is that the use-case for filtering out
the data in logical replication is not necessarily about security, but
often about sending only relevant data. So it makes sense to have filter
on publication without RLS enabled on table and if we'd force that, we'd
limit usefulness of this feature.

Use the same infrastructure as RLS could be a good idea but use RLS
for row filtering is not. RLS is complex.

Right, this was along the lines I was thinking of- using the
infrastructure and the policy system, in particular.

Yeah that part is definitely worth investigating.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Stephen Frost

sfrost@snowman.net

about 7 years ago

In reply to: Petr Jelinek (#2)

Greetings,

* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:

On 14/12/2018 16:38, Stephen Frost wrote:

* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:

I do see the appeal here, if you consider logical replication to be a
streaming select it probably applies well.

But given that this is happening inside output plugin which does not
have full executor setup and has catalog-only snapshot I am not sure how
feasible it is to try to merge these two things. As per my previous
email it's possible that we'll have to be stricter about what we allow
in expressions here.

I can certainly understand the concern about trying to combine the
implementation of this with that of RLS; perhaps that isn't a good fit
due to the additional constraints put on logical decoding.

That said, I still think it might make sense to consider these filters
for logical decoding to be policies and, ideally, to allow users to use
the same policy for both.

I am not against that as long as it's possible to have policy for
logical replication without having it for RLS and vice versa.

RLS already is able to be enabled/disabled on a per-table basis. I
could see how we might want to extend the existing policy system to have
a way to enable/disable individual policies for RLS but that should be
reasonably straight-forward to do, I would think.

I also wonder if policies are flexible enough to allow for specifying
OLD and NEW - the replication filtering deals with DML, not with what's
visible, it might very well depend on differences between these (that's
something the current patch is missing as well BTW).

The policy system already has the notion of a 'visible' check and a
'does the new row match this' check (USING vs. WITH CHECK policies).
Perhaps if you could outline the specific use-cases that you're thinking
about, we could discuss them and make sure that they fit within those
mechanisms- or, if not, discuss if such a use-case would make sense for
RLS as well and, if so, figure out a way to support that for both.

In the end, the idea of having to build a single large and complex
'create publication' command which has a bunch of tables, each with
their own filter clauses, just strikes me as pretty painful.

The other issue with merging this is that the use-case for filtering out
the data in logical replication is not necessarily about security, but
often about sending only relevant data. So it makes sense to have filter
on publication without RLS enabled on table and if we'd force that, we'd
limit usefulness of this feature.

I definitely have a serious problem if we are going to say that you
can't use this filtering for security-sensitive cases.

I am saying it should not be tied to only security sensitive cases,
because it has use cases that have nothing to do with security (ie, I
don't want this to depend on RLS being enabled for a table).

I'm fine with this being able to be independently enabled/disabled,
apart from RLS.

We definitely want to eventually create subscriptions as non-superuser
but that has zero effect on this as everything here is happening on
different server than where subscription lives (we already allow
creation of publications with just CREATE privilege on database and
ownership of the table).

What I wasn't clear about above was the idea that we might allow a user
other than the table owner to publish a given table, but that such a
publication should certanily only be allowed to include the rows which
that user has access to- as regulated by RLS. If the RLS policy is too
complex to allow that then I would think we'd simply throw an error at
the create publication time and the would-be publisher would need to
figure that out with the table owner.

My opinion is that this is useful, but not necessarily something v1
patch needs to solve. Having too many publications and subscriptions to
various places is not currently practical anyway due to decoding
duplicating all the work for every connection.

I agree that supporting this could be done in a later patch, however, I
do feel that when we go to add support for non-owners to create
publications then RLS needs to be supported at that point (and by more
than just 'throw an error'). I can agree with incremental improvements
but I don't want to get to a point where we've got a bunch of
independent things only half of which work with other parts of the
system.

* Euler Taveira (euler@timbira.com.br) wrote:

Em sex, 23 de nov de 2018 às 11:40, Petr Jelinek
<petr.jelinek@2ndquadrant.com> escreveu:

The other issue with merging this is that the use-case for filtering out
the data in logical replication is not necessarily about security, but
often about sending only relevant data. So it makes sense to have filter
on publication without RLS enabled on table and if we'd force that, we'd
limit usefulness of this feature.

Use the same infrastructure as RLS could be a good idea but use RLS
for row filtering is not. RLS is complex.

Right, this was along the lines I was thinking of- using the
infrastructure and the policy system, in particular.

Yeah that part is definitely worth investigating.

Glad to hear that.

Thanks!

Stephen

Petr Jelinek

petr.jelinek@2ndquadrant.com

about 7 years ago

In reply to: Stephen Frost (#3)

Hi,

On 27/12/2018 20:05, Stephen Frost wrote:

Greetings,

* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:

On 14/12/2018 16:38, Stephen Frost wrote:

* Petr Jelinek (petr.jelinek@2ndquadrant.com) wrote:

I do see the appeal here, if you consider logical replication to be a
streaming select it probably applies well.

But given that this is happening inside output plugin which does not
have full executor setup and has catalog-only snapshot I am not sure how
feasible it is to try to merge these two things. As per my previous
email it's possible that we'll have to be stricter about what we allow
in expressions here.

I can certainly understand the concern about trying to combine the
implementation of this with that of RLS; perhaps that isn't a good fit
due to the additional constraints put on logical decoding.

That said, I still think it might make sense to consider these filters
for logical decoding to be policies and, ideally, to allow users to use
the same policy for both.

I am not against that as long as it's possible to have policy for
logical replication without having it for RLS and vice versa.

RLS already is able to be enabled/disabled on a per-table basis. I
could see how we might want to extend the existing policy system to have
a way to enable/disable individual policies for RLS but that should be
reasonably straight-forward to do, I would think.

Sure, I was mostly referring to having ability of enable/disable this
independently of enabling/disabling RLS which you are okay with based on
bellow so no issue there from my side.

I also wonder if policies are flexible enough to allow for specifying
OLD and NEW - the replication filtering deals with DML, not with what's
visible, it might very well depend on differences between these (that's
something the current patch is missing as well BTW).

The policy system already has the notion of a 'visible' check and a
'does the new row match this' check (USING vs. WITH CHECK policies).
Perhaps if you could outline the specific use-cases that you're thinking
about, we could discuss them and make sure that they fit within those
mechanisms- or, if not, discuss if such a use-case would make sense for
RLS as well and, if so, figure out a way to support that for both.

So we'd use USING for old row images (UPDATE/DELETE) and WITH CHECK for
new ones (UPDATE/INSERT)? I think OLD/NEW is somewhat more natural
naming of this as there is no "SELECT" part of operation here, but as
long as the functionality is there I don't mind syntax that much.

In the end, the idea of having to build a single large and complex
'create publication' command which has a bunch of tables, each with
their own filter clauses, just strikes me as pretty painful.

The other issue with merging this is that the use-case for filtering out
the data in logical replication is not necessarily about security, but
often about sending only relevant data. So it makes sense to have filter
on publication without RLS enabled on table and if we'd force that, we'd
limit usefulness of this feature.

I definitely have a serious problem if we are going to say that you
can't use this filtering for security-sensitive cases.

I am saying it should not be tied to only security sensitive cases,
because it has use cases that have nothing to do with security (ie, I
don't want this to depend on RLS being enabled for a table).

I'm fine with this being able to be independently enabled/disabled,
apart from RLS.

Cool.

We definitely want to eventually create subscriptions as non-superuser
but that has zero effect on this as everything here is happening on
different server than where subscription lives (we already allow
creation of publications with just CREATE privilege on database and
ownership of the table).

What I wasn't clear about above was the idea that we might allow a user
other than the table owner to publish a given table, but that such a
publication should certanily only be allowed to include the rows which
that user has access to- as regulated by RLS. If the RLS policy is too
complex to allow that then I would think we'd simply throw an error at
the create publication time and the would-be publisher would need to
figure that out with the table owner.

My opinion is that this is useful, but not necessarily something v1
patch needs to solve. Having too many publications and subscriptions to
various places is not currently practical anyway due to decoding
duplicating all the work for every connection.

I agree that supporting this could be done in a later patch, however, I
do feel that when we go to add support for non-owners to create
publications then RLS needs to be supported at that point (and by more
than just 'throw an error'). I can agree with incremental improvements
but I don't want to get to a point where we've got a bunch of
independent things only half of which work with other parts of the
system.

Yes, using RLS infrastructure now will make it easier to add support for
publishing without being owner at some later point, just let's please
not make publishing without being owner part of requirements for this.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services