Multivariate MCV stats can leak data to unprivileged users

Started by Dean Rasheedalmost 7 years ago34 messageshackers
Jump to latest
#1Dean Rasheed
dean.a.rasheed@gmail.com

While working on 1aebfbea83c, I noticed that the new multivariate MCV
stats feature suffers from the same problem, and also the original
problems that were fixed in e2d4ef8de8 and earlier --- namely that a
user can see values in the MCV lists that they shouldn't see (values
from tables that they don't have privileges on).

I think there are 2 separate issues here:

1). The table pg_statistic_ext is accessible to anyone, so any user
can see the MCV lists of any table. I think we should give this the
same treatment as pg_statistic, and hide it behind a security barrier
view, revoking public access from the table.

2). The multivariate MCV stats planner code can be made to invoke
user-defined operators, so a user can create a leaky operator and use
it to reveal data values from the MCV lists even if they have no
permissions on the table.

Attached is a draft patch to fix (2), which hooks into
statext_is_compatible_clause().

I haven't thought much about (1). There are some questions about what
exactly the view should look like. Probably it should translate table
oids to names, like pg_stats does, but should it also translate column
indexes to names? That could get fiddly. Should it unpack MCV items?

I'll raise this as an open item for PG12.

Regards,
Dean

Attachments:

fix-mcv-stats-planner-leak.patchtext/x-patch; charset=US-ASCII; name=fix-mcv-stats-planner-leak.patchDownload+186-8
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#1)
Re: Multivariate MCV stats can leak data to unprivileged users

On Fri, May 10, 2019 at 10:19:44AM +0100, Dean Rasheed wrote:

While working on 1aebfbea83c, I noticed that the new multivariate MCV
stats feature suffers from the same problem, and also the original
problems that were fixed in e2d4ef8de8 and earlier --- namely that a
user can see values in the MCV lists that they shouldn't see (values
from tables that they don't have privileges on).

I think there are 2 separate issues here:

1). The table pg_statistic_ext is accessible to anyone, so any user
can see the MCV lists of any table. I think we should give this the
same treatment as pg_statistic, and hide it behind a security barrier
view, revoking public access from the table.

2). The multivariate MCV stats planner code can be made to invoke
user-defined operators, so a user can create a leaky operator and use
it to reveal data values from the MCV lists even if they have no
permissions on the table.

Attached is a draft patch to fix (2), which hooks into
statext_is_compatible_clause().

I think that patch is good.

I haven't thought much about (1). There are some questions about what
exactly the view should look like. Probably it should translate table
oids to names, like pg_stats does, but should it also translate column
indexes to names? That could get fiddly. Should it unpack MCV items?

Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats.
It would:

(1) translate the schema / relation / attribute names

I don't see why translating column indexes to names would be fiddly.
It's a matter of simple unnest + join, no? Or what issues you see?

(2) include values for ndistinct / dependencies, if built

Those don't include any actual values, so this should be OK. You might
make the argument that even this does leak a bit of information (e.g.
when you can see values in one column, and you know there's a strong
functional dependence, it tells you something about the other column
which you may not see). But I think we kinda already leak information
about that through estimates, so maybe that's not an issue.

(3) include MCV list only when user has access to all columns

Essentially, if the user is missing 'select' privilege on at least one
of the columns, there'll be NULL. Otherwise the MCV value.

The attached patch adds pg_stats_ext doing this. I don't claim it's the
best possible query backing the view, so any improvements are welcome.

I've been thinking we might somehow filter the statistics values, e.g. by
not showing values for attributes the user has no 'select' privilege on,
but that seems like a can of worms. It'd lead to MCV items that can't be
distinguished because the only difference was the removed attribute, and
so on. So I propose we simply show/hide the whole MCV list.

Likewise, I don't think it makes sense to expand the MCV list in this
view - that works for the single-dimensional case, because then the
list is expanded into two arrays (values + frequencies), which are easy
to process further. But for multivariate MCV lists that's much more
complex - we don't know how many attributes are there, for example.

So I suggest we just show the pg_mcv_list value as is, and leave it up
to the user to call the pg_mcv_list_items() function if needed.

This will also work for histograms, where expanding the value in the
pg_stats_ext would be even trickier.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

add-pg-stats-ext.patchtext/plain; charset=us-asciiDownload+20-0
#3Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#2)
Re: Multivariate MCV stats can leak data to unprivileged users

On Mon, 13 May 2019 at 23:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats.
It would:

(1) translate the schema / relation / attribute names

I don't see why translating column indexes to names would be fiddly.
It's a matter of simple unnest + join, no? Or what issues you see?

Yeah, you're right. I thought it would be harder than that. One minor
thing -- I think we should have an explicit ORDER BY where we collect
the column names, to guarantee that they're listed in the right order.

(2) include values for ndistinct / dependencies, if built

Those don't include any actual values, so this should be OK. You might
make the argument that even this does leak a bit of information (e.g.
when you can see values in one column, and you know there's a strong
functional dependence, it tells you something about the other column
which you may not see). But I think we kinda already leak information
about that through estimates, so maybe that's not an issue.

Hmm. For normal statistics, if the user has no permissions on the
table, they don't get to see any of these kinds of statistics, not
even things like n_distinct. I think we should do the same here --
i.e., if the user has no permissions on the table, don't let them see
anything. Such a user will not be able to run EXPLAIN on queries
against the table, so they won't get to see any estimates, and I don't
think they should get to see any extended statistics either.

(3) include MCV list only when user has access to all columns

Essentially, if the user is missing 'select' privilege on at least one
of the columns, there'll be NULL. Otherwise the MCV value.

OK, that seems reasonable, except as I said above, I think that should
apply to all statistics, not just the MCV lists.

The attached patch adds pg_stats_ext doing this. I don't claim it's the
best possible query backing the view, so any improvements are welcome.

I've been thinking we might somehow filter the statistics values, e.g. by
not showing values for attributes the user has no 'select' privilege on,
but that seems like a can of worms. It'd lead to MCV items that can't be
distinguished because the only difference was the removed attribute, and
so on. So I propose we simply show/hide the whole MCV list.

Agreed.

Likewise, I don't think it makes sense to expand the MCV list in this
view - that works for the single-dimensional case, because then the
list is expanded into two arrays (values + frequencies), which are easy
to process further. But for multivariate MCV lists that's much more
complex - we don't know how many attributes are there, for example.

So I suggest we just show the pg_mcv_list value as is, and leave it up
to the user to call the pg_mcv_list_items() function if needed.

I think expanding the MCV lists is actually quite useful because then
you can see arrays of values, nulls, frequencies and base frequencies
in a reasonably readable form (it certainly looks better than a binary
dump), without needing to join to a function call, which is a bit
ugly, and unmemorable.

The results from the attached look quite reasonable at first glance.
It contains a few other changes as well:

1). It exposes the schema, name and owner of the statistics object as
well via the view, for completeness.

2). It changes a few column names -- traditionally these views strip
off the column name prefix from the underlying table, so I've
attempted to be consistent with other similar views.

3). I added array-valued columns for each of the MCV list components,
which makes it more like pg_stats.

4). I moved all the permission checks to the top-level WHERE clause,
so a user needs to have select permissions on all the columns
mentioned by the statistics, and the table mustn't have RLS in effect,
otherwise the user won't see the row for that statistics object.

5). Some columns from pg_statistic_ext have to be made visible for
psql \d to work. Basically, it needs to be able to query for the
existence of extended statistics, but it doesn't need to see the
actual statistical data. Of course, we could change psql to use the
view, but this way gives us better backwards compatibility with older
clients.

This is still going to break compatibility of any user code looking at
stxndistinct or stxdependencies from pg_statistic_ext, but at least it
doesn't break old versions of psql.

Note: doc and test updates still to do.

Regards,
Dean

Attachments:

add-pg-stats-ext-v2.patchtext/x-patch; charset=US-ASCII; name=add-pg-stats-ext-v2.patchDownload+40-0
#4Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#3)
Re: Multivariate MCV stats can leak data to unprivileged users

On Thu, May 16, 2019 at 02:28:03PM +0100, Dean Rasheed wrote:

On Mon, 13 May 2019 at 23:36, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Yeah. I suggest we add a simple pg_stats_ext view, similar to pg_stats.
It would:

(1) translate the schema / relation / attribute names

I don't see why translating column indexes to names would be fiddly.
It's a matter of simple unnest + join, no? Or what issues you see?

Yeah, you're right. I thought it would be harder than that. One minor
thing -- I think we should have an explicit ORDER BY where we collect
the column names, to guarantee that they're listed in the right order.

Good point.

(2) include values for ndistinct / dependencies, if built

Those don't include any actual values, so this should be OK. You might
make the argument that even this does leak a bit of information (e.g.
when you can see values in one column, and you know there's a strong
functional dependence, it tells you something about the other column
which you may not see). But I think we kinda already leak information
about that through estimates, so maybe that's not an issue.

Hmm. For normal statistics, if the user has no permissions on the
table, they don't get to see any of these kinds of statistics, not
even things like n_distinct. I think we should do the same here --
i.e., if the user has no permissions on the table, don't let them see
anything. Such a user will not be able to run EXPLAIN on queries
against the table, so they won't get to see any estimates, and I don't
think they should get to see any extended statistics either.

OK, I haven't realized we don't show that even for normal stats.

(3) include MCV list only when user has access to all columns

Essentially, if the user is missing 'select' privilege on at least one
of the columns, there'll be NULL. Otherwise the MCV value.

OK, that seems reasonable, except as I said above, I think that should
apply to all statistics, not just the MCV lists.

The attached patch adds pg_stats_ext doing this. I don't claim it's the
best possible query backing the view, so any improvements are welcome.

I've been thinking we might somehow filter the statistics values, e.g. by
not showing values for attributes the user has no 'select' privilege on,
but that seems like a can of worms. It'd lead to MCV items that can't be
distinguished because the only difference was the removed attribute, and
so on. So I propose we simply show/hide the whole MCV list.

Agreed.

Likewise, I don't think it makes sense to expand the MCV list in this
view - that works for the single-dimensional case, because then the
list is expanded into two arrays (values + frequencies), which are easy
to process further. But for multivariate MCV lists that's much more
complex - we don't know how many attributes are there, for example.

So I suggest we just show the pg_mcv_list value as is, and leave it up
to the user to call the pg_mcv_list_items() function if needed.

I think expanding the MCV lists is actually quite useful because then
you can see arrays of values, nulls, frequencies and base frequencies
in a reasonably readable form (it certainly looks better than a binary
dump), without needing to join to a function call, which is a bit
ugly, and unmemorable.

Hmmm, ok. I think my main worry here is that it may or may not work for
more complex types of extended stats that are likely to come in the
future. Although, maybe it can be made work even for that.

The results from the attached look quite reasonable at first glance.
It contains a few other changes as well:

1). It exposes the schema, name and owner of the statistics object as
well via the view, for completeness.

2). It changes a few column names -- traditionally these views strip
off the column name prefix from the underlying table, so I've
attempted to be consistent with other similar views.

3). I added array-valued columns for each of the MCV list components,
which makes it more like pg_stats.

4). I moved all the permission checks to the top-level WHERE clause,
so a user needs to have select permissions on all the columns
mentioned by the statistics, and the table mustn't have RLS in effect,
otherwise the user won't see the row for that statistics object.

5). Some columns from pg_statistic_ext have to be made visible for
psql \d to work. Basically, it needs to be able to query for the
existence of extended statistics, but it doesn't need to see the
actual statistical data. Of course, we could change psql to use the
view, but this way gives us better backwards compatibility with older
clients.

This is still going to break compatibility of any user code looking at
stxndistinct or stxdependencies from pg_statistic_ext, but at least it
doesn't break old versions of psql.

Note: doc and test updates still to do.

Thanks. I'm travelling today/tomorrow, but I'll do my best to fill in the
missing bits ASAP.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#5Andres Freund
andres@anarazel.de
In reply to: Dean Rasheed (#3)
Re: Multivariate MCV stats can leak data to unprivileged users

Hi,

On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote:

5). Some columns from pg_statistic_ext have to be made visible for
psql \d to work. Basically, it needs to be able to query for the
existence of extended statistics, but it doesn't need to see the
actual statistical data. Of course, we could change psql to use the
view, but this way gives us better backwards compatibility with older
clients.

This is still going to break compatibility of any user code looking at
stxndistinct or stxdependencies from pg_statistic_ext, but at least it
doesn't break old versions of psql.

Hm, it's not normally a goal to keep old psql working against new
postgres versions. And there's plenty other issues preventing a v11 psql
to work against 12. I'd not let this guide any design decisions.

Greetings,

Andres Freund

#6Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Andres Freund (#5)
Re: Multivariate MCV stats can leak data to unprivileged users

On Fri, 17 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote:

On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote:

5). Some columns from pg_statistic_ext have to be made visible for
psql \d to work. Basically, it needs to be able to query for the
existence of extended statistics, but it doesn't need to see the
actual statistical data. Of course, we could change psql to use the
view, but this way gives us better backwards compatibility with older
clients.

This is still going to break compatibility of any user code looking at
stxndistinct or stxdependencies from pg_statistic_ext, but at least it
doesn't break old versions of psql.

Hm, it's not normally a goal to keep old psql working against new
postgres versions. And there's plenty other issues preventing a v11 psql
to work against 12. I'd not let this guide any design decisions.

Ah good point. In fact running "\d some_table" from v11's psql against
a v12 database immediately falls over because of the removal of
relhasoids from pg_class, so this isn't a valid reason for retaining
access to any columns from pg_statistic_ext.

Regards,
Dean

#7Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Dean Rasheed (#6)
Re: Multivariate MCV stats can leak data to unprivileged users

On Sat, 18 May 2019 at 10:11, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Fri, 17 May 2019 at 21:29, Andres Freund <andres@anarazel.de> wrote:

On 2019-05-16 14:28:03 +0100, Dean Rasheed wrote:

5). Some columns from pg_statistic_ext have to be made visible for
psql \d to work. Basically, it needs to be able to query for the
existence of extended statistics, but it doesn't need to see the
actual statistical data. Of course, we could change psql to use the
view, but this way gives us better backwards compatibility with older
clients.

This is still going to break compatibility of any user code looking at
stxndistinct or stxdependencies from pg_statistic_ext, but at least it
doesn't break old versions of psql.

Hm, it's not normally a goal to keep old psql working against new
postgres versions. And there's plenty other issues preventing a v11 psql
to work against 12. I'd not let this guide any design decisions.

Ah good point. In fact running "\d some_table" from v11's psql against
a v12 database immediately falls over because of the removal of
relhasoids from pg_class, so this isn't a valid reason for retaining
access to any columns from pg_statistic_ext.

On the other hand, pg_dump relies on pg_statistic_ext to work out
which extended statistics objects to dump. If we were to change that
to use pg_stats_ext, then a user dumping a table with RLS using the
--enable-row-security flag wouldn't get any extended statistics
objects, which would be a somewhat surprising result.

That could be fixed by changing the view to return rows for every
extended statistics object, nulling out values in columns that the
user doesn't have permission to see, in a similar way to Tomas'
original patch. It would have to be modified to do the RLS check in
the same place as the privilege checks, rather than in the top-level
WHERE clause, and we'd probably also have to expose OIDs in addition
to names, because that's what clients like psql and pg_dump want. To
me, that feels quite messy though, so I think I'd still vote for
leaving the first few columns of pg_statistic_ext accessible to
public, and not have to change the clients to work differently from
v12 onwards.

Regards,
Dean

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#7)
Re: Multivariate MCV stats can leak data to unprivileged users

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On the other hand, pg_dump relies on pg_statistic_ext to work out
which extended statistics objects to dump. If we were to change that
to use pg_stats_ext, then a user dumping a table with RLS using the
--enable-row-security flag wouldn't get any extended statistics
objects, which would be a somewhat surprising result.

It seems like what we need here is to have a separation between the
*definition* of a stats object (which is what pg_dump needs access
to) and the current actual *data* in it. I'd have expected that
keeping those in separate catalogs would be the thing to do, though
perhaps it's too late for that.

regards, tom lane

#9Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#8)
Re: Multivariate MCV stats can leak data to unprivileged users

On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On the other hand, pg_dump relies on pg_statistic_ext to work out
which extended statistics objects to dump. If we were to change that
to use pg_stats_ext, then a user dumping a table with RLS using the
--enable-row-security flag wouldn't get any extended statistics
objects, which would be a somewhat surprising result.

It seems like what we need here is to have a separation between the
*definition* of a stats object (which is what pg_dump needs access
to) and the current actual *data* in it. I'd have expected that
keeping those in separate catalogs would be the thing to do, though
perhaps it's too late for that.

Yeah, with the benefit of hindsight, that would have made sense, but
that seems like a pretty big change to be attempting at this stage.

Regards,
Dean

#10Andres Freund
andres@anarazel.de
In reply to: Dean Rasheed (#9)
Re: Multivariate MCV stats can leak data to unprivileged users

Hi,

On May 18, 2019 8:43:29 AM PDT, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On the other hand, pg_dump relies on pg_statistic_ext to work out
which extended statistics objects to dump. If we were to change

that

to use pg_stats_ext, then a user dumping a table with RLS using the
--enable-row-security flag wouldn't get any extended statistics
objects, which would be a somewhat surprising result.

It seems like what we need here is to have a separation between the
*definition* of a stats object (which is what pg_dump needs access
to) and the current actual *data* in it. I'd have expected that
keeping those in separate catalogs would be the thing to do, though
perhaps it's too late for that.

Yeah, with the benefit of hindsight, that would have made sense, but
that seems like a pretty big change to be attempting at this stage.

Otoh, having a suboptimal catalog representation that we'll likely have to change in one of the next releases also isn't great. Seems likely that we'll need post beta1 catversion bumps anyway?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#11Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#10)
Re: Multivariate MCV stats can leak data to unprivileged users

On Sat, May 18, 2019 at 11:49:06AM -0700, Andres Freund wrote:

Hi,

On May 18, 2019 8:43:29 AM PDT, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:

On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

On the other hand, pg_dump relies on pg_statistic_ext to work out
which extended statistics objects to dump. If we were to change

that

to use pg_stats_ext, then a user dumping a table with RLS using the
--enable-row-security flag wouldn't get any extended statistics
objects, which would be a somewhat surprising result.

It seems like what we need here is to have a separation between the
*definition* of a stats object (which is what pg_dump needs access
to) and the current actual *data* in it. I'd have expected that
keeping those in separate catalogs would be the thing to do, though
perhaps it's too late for that.

Yeah, with the benefit of hindsight, that would have made sense, but
that seems like a pretty big change to be attempting at this stage.

Otoh, having a suboptimal catalog representation that we'll likely have
to change in one of the next releases also isn't great. Seems likely
that we'll need post beta1 catversion bumps anyway?

But that's not an issue intruduced by PG12, it works like that even for
the extended statistics introduced in PG10.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#11)
Re: Multivariate MCV stats can leak data to unprivileged users

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Sat, May 18, 2019 at 11:49:06AM -0700, Andres Freund wrote:

On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It seems like what we need here is to have a separation between the
*definition* of a stats object (which is what pg_dump needs access
to) and the current actual *data* in it.

Otoh, having a suboptimal catalog representation that we'll likely have
to change in one of the next releases also isn't great. Seems likely
that we'll need post beta1 catversion bumps anyway?

But that's not an issue intruduced by PG12, it works like that even for
the extended statistics introduced in PG10.

Yeah, but no time like the present to fix it if it's wrong ...

regards, tom lane

#13Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#12)
Re: Multivariate MCV stats can leak data to unprivileged users

On Sat, May 18, 2019 at 03:45:11PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Sat, May 18, 2019 at 11:49:06AM -0700, Andres Freund wrote:

On Sat, 18 May 2019 at 16:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:

It seems like what we need here is to have a separation between the
*definition* of a stats object (which is what pg_dump needs access
to) and the current actual *data* in it.

Otoh, having a suboptimal catalog representation that we'll likely have
to change in one of the next releases also isn't great. Seems likely
that we'll need post beta1 catversion bumps anyway?

But that's not an issue intruduced by PG12, it works like that even for
the extended statistics introduced in PG10.

Yeah, but no time like the present to fix it if it's wrong ...

Sorry, not sure I understand. Are you saying we should try to rework
this before the beta1 release, or that we don't have time to do that?

I think we have four options - rework it before beta1, rework it after
beta1, rework it in PG13 and leave it as it is now.

If the pg_dump thing si the only issue, maybe there's a simple solution
that reworking all the catalogs. Not sure. Are there any other reasons
why the current catalog representation would be suboptimal, or do we
have some precedent of a catalog split this way? I can't think of any.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#13)
Re: Multivariate MCV stats can leak data to unprivileged users

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Sat, May 18, 2019 at 03:45:11PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

But that's not an issue intruduced by PG12, it works like that even for
the extended statistics introduced in PG10.

Yeah, but no time like the present to fix it if it's wrong ...

Sorry, not sure I understand. Are you saying we should try to rework
this before the beta1 release, or that we don't have time to do that?

I think we have four options - rework it before beta1, rework it after
beta1, rework it in PG13 and leave it as it is now.

Yup, that's about what the options are. I'm just voting against
"change it in v13". If we're going to change it, then the fewer
major versions that have the bogus definition the better --- and
since we're changing that catalog in v12 anyway, users will see
fewer distinct behaviors if we do this change too.

It's very possibly too late to get it done before beta1,
unfortunately. But as Andres noted, post-beta1 catversion
bumps are hardly unusual, so I do not think "rework after
beta1" is unacceptable.

regards, tom lane

#15Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#14)
Re: Multivariate MCV stats can leak data to unprivileged users

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

On Sat, May 18, 2019 at 03:45:11PM -0400, Tom Lane wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

But that's not an issue intruduced by PG12, it works like that even for
the extended statistics introduced in PG10.

Yeah, but no time like the present to fix it if it's wrong ...

Sorry, not sure I understand. Are you saying we should try to rework
this before the beta1 release, or that we don't have time to do that?

I think we have four options - rework it before beta1, rework it after
beta1, rework it in PG13 and leave it as it is now.

Yup, that's about what the options are. I'm just voting against
"change it in v13". If we're going to change it, then the fewer
major versions that have the bogus definition the better --- and
since we're changing that catalog in v12 anyway, users will see
fewer distinct behaviors if we do this change too.

It's very possibly too late to get it done before beta1,
unfortunately. But as Andres noted, post-beta1 catversion
bumps are hardly unusual, so I do not think "rework after
beta1" is unacceptable.

Agreed.

Thanks,

Stephen

#16Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Stephen Frost (#15)
Re: Multivariate MCV stats can leak data to unprivileged users

On Sun, 19 May 2019 at 00:48, Stephen Frost <sfrost@snowman.net> wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:

I think we have four options - rework it before beta1, rework it after
beta1, rework it in PG13 and leave it as it is now.

Yup, that's about what the options are. I'm just voting against
"change it in v13". If we're going to change it, then the fewer
major versions that have the bogus definition the better --- and
since we're changing that catalog in v12 anyway, users will see
fewer distinct behaviors if we do this change too.

It's very possibly too late to get it done before beta1,
unfortunately. But as Andres noted, post-beta1 catversion
bumps are hardly unusual, so I do not think "rework after
beta1" is unacceptable.

Agreed.

Yes, that makes sense.

I think we shouldn't risk trying to get this into beta1, but let's try
to get it done as soon as possible after that.

Actually, it doesn't appear to be as big a change as I had feared. As
a starter for ten, here's a patch doing the basic split, moving the
extended stats data into a new catalog pg_statistic_ext_data (I'm not
particularly wedded to that name, it's just the first name that came
to mind).

With this patch the catalogs look like this:

\d pg_statistic_ext
Table "pg_catalog.pg_statistic_ext"
Column | Type | Collation | Nullable | Default
--------------+------------+-----------+----------+---------
oid | oid | | not null |
stxrelid | oid | | not null |
stxname | name | | not null |
stxnamespace | oid | | not null |
stxowner | oid | | not null |
stxkeys | int2vector | | not null |
stxkind | "char"[] | | not null |
Indexes:
"pg_statistic_ext_name_index" UNIQUE, btree (stxname, stxnamespace)
"pg_statistic_ext_oid_index" UNIQUE, btree (oid)
"pg_statistic_ext_relid_index" btree (stxrelid)

\d pg_statistic_ext_data
Table "pg_catalog.pg_statistic_ext_data"
Column | Type | Collation | Nullable | Default
-----------------+-----------------+-----------+----------+---------
stxoid | oid | | not null |
stxndistinct | pg_ndistinct | C | |
stxdependencies | pg_dependencies | C | |
stxmcv | pg_mcv_list | C | |
Indexes:
"pg_statistic_ext_data_stxoid_index" UNIQUE, btree (stxoid)

I opted to create/remove pg_statistic_ext_data tuples at the same time
as the pg_statistic_ext tuples, in CreateStatistics() /
RemoveStatisticsById(), so then it's easier to see that they're in a
one-to-one relationship, and other code doesn't need to worry about
the data tuple not existing. The other option would be to defer
inserting the data tuple to ANALYZE.

I couldn't resist moving the code block that declares
pg_statistic_ext's indexes in indexing.h to the right place, assuming
that file is (mostly) sorted alphabetically by catalog name. This puts
the extended stats entries just after the normal stats entries which
seems preferable.

This is only a very rough first draft (e.g., no doc updates), but it
passes all the regression tests.

Regards,
Dean

Attachments:

split-up-pg-statistics-ext.patchtext/x-patch; charset=US-ASCII; name=split-up-pg-statistics-ext.patchDownload+256-88
#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#16)
Re: Multivariate MCV stats can leak data to unprivileged users

Dean Rasheed <dean.a.rasheed@gmail.com> writes:

I think we shouldn't risk trying to get this into beta1, but let's try
to get it done as soon as possible after that.

Agreed.

\d pg_statistic_ext
Table "pg_catalog.pg_statistic_ext"
Column | Type | Collation | Nullable | Default
--------------+------------+-----------+----------+---------
oid | oid | | not null |
stxrelid | oid | | not null |
stxname | name | | not null |
stxnamespace | oid | | not null |
stxowner | oid | | not null |
stxkeys | int2vector | | not null |
stxkind | "char"[] | | not null |
Indexes:
"pg_statistic_ext_name_index" UNIQUE, btree (stxname, stxnamespace)
"pg_statistic_ext_oid_index" UNIQUE, btree (oid)
"pg_statistic_ext_relid_index" btree (stxrelid)

Check.

\d pg_statistic_ext_data
Table "pg_catalog.pg_statistic_ext_data"
Column | Type | Collation | Nullable | Default
-----------------+-----------------+-----------+----------+---------
stxoid | oid | | not null |
stxndistinct | pg_ndistinct | C | |
stxdependencies | pg_dependencies | C | |
stxmcv | pg_mcv_list | C | |
Indexes:
"pg_statistic_ext_data_stxoid_index" UNIQUE, btree (stxoid)

I wonder ... another way we could potentially do this is

create table pg_statistic_ext_data(
stxoid oid, -- OID of owning pg_statistic_ext entry
stxkind char, -- what kind of data
stxdata bytea -- the data, in some format or other
);

The advantage of this way is that we'd not have to rejigger the
catalog's rowtype every time we think of a new kind of extended
stats. The disadvantage is that manual inspection of the contents
of an entry would become much harder, for lack of any convenient
output function. However, this whole exercise is mostly to prevent
casual manual inspection anyway :-(, so I wonder how much we care
about that.

Also, I assume there's going to be a user-accessible view that shows
a join of these tables, but only those rows that correspond to columns
the current user can read all of. Should we give that view the name
pg_statistic_ext for maximum backward compatibility? I'm not sure.
pg_dump would probably prefer it if the view is what has a new name,
but other clients might like the other way.

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#17)
Re: Multivariate MCV stats can leak data to unprivileged users

I wrote:

I wonder ... another way we could potentially do this is

create table pg_statistic_ext_data(
stxoid oid, -- OID of owning pg_statistic_ext entry
stxkind char, -- what kind of data
stxdata bytea -- the data, in some format or other
);

The advantage of this way is that we'd not have to rejigger the
catalog's rowtype every time we think of a new kind of extended
stats. The disadvantage is that manual inspection of the contents
of an entry would become much harder, for lack of any convenient
output function.

No, wait, scratch that. We could fold the three existing types
pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say
"pg_stats_ext_data", where the actual storage would need to have an
ID field (so we'd waste a byte or two duplicating the externally
visible stxkind field inside stxdata). The output function for this
type is just a switch over the existing code. The big advantage of
this way compared to the current approach is that adding a new
ext-stats type requires *zero* work with adding new catalog entries.
Just add another switch case in pg_stats_ext_data_out() and you're
done.

regards, tom lane

#19Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#18)
Re: Multivariate MCV stats can leak data to unprivileged users

On Sun, May 19, 2019 at 10:28:43AM -0400, Tom Lane wrote:

I wrote:

I wonder ... another way we could potentially do this is

create table pg_statistic_ext_data(
stxoid oid, -- OID of owning pg_statistic_ext entry
stxkind char, -- what kind of data
stxdata bytea -- the data, in some format or other
);

The advantage of this way is that we'd not have to rejigger the
catalog's rowtype every time we think of a new kind of extended
stats. The disadvantage is that manual inspection of the contents
of an entry would become much harder, for lack of any convenient
output function.

No, wait, scratch that. We could fold the three existing types
pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say
"pg_stats_ext_data", where the actual storage would need to have an
ID field (so we'd waste a byte or two duplicating the externally
visible stxkind field inside stxdata). The output function for this
type is just a switch over the existing code. The big advantage of
this way compared to the current approach is that adding a new
ext-stats type requires *zero* work with adding new catalog entries.
Just add another switch case in pg_stats_ext_data_out() and you're
done.

The annoying thing is that this undoes the protections provided by special
data types generated only in internally. It's not possible to generate
e.g. pg_mcv_list values in user code (except for C code, at which points
all bets are off anyway). By abandoning this and reverting to bytea anyone
could craft a bytea and claim it's a statistic value.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#20Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#18)
Re: Multivariate MCV stats can leak data to unprivileged users

On Sun, 19 May 2019 at 15:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I wonder ... another way we could potentially do this is

create table pg_statistic_ext_data(
stxoid oid, -- OID of owning pg_statistic_ext entry
stxkind char, -- what kind of data
stxdata bytea -- the data, in some format or other
);

The advantage of this way is that we'd not have to rejigger the
catalog's rowtype every time we think of a new kind of extended
stats. The disadvantage is that manual inspection of the contents
of an entry would become much harder, for lack of any convenient
output function.

No, wait, scratch that. We could fold the three existing types
pg_ndistinct, pg_dependencies, pg_mcv_list into one new type, say
"pg_stats_ext_data", where the actual storage would need to have an
ID field (so we'd waste a byte or two duplicating the externally
visible stxkind field inside stxdata). The output function for this
type is just a switch over the existing code. The big advantage of
this way compared to the current approach is that adding a new
ext-stats type requires *zero* work with adding new catalog entries.
Just add another switch case in pg_stats_ext_data_out() and you're
done.

This feels a little over-engineered to me. Presumably there'd be a
compound key on (stxoid, stxkind) and we'd have to scan multiple rows
to get all the applicable stats, whereas currently they're all in one
row. And then the user-accessible view would probably need separate
sub-queries for each stats kind.

If the point is just to avoid adding columns to the catalog in future
releases, I'm not sure it's worth the added complexity. We know that
we will probably add histogram stats in a future release. I'm not sure
how many more kinds we'll end up adding, but it doesn't seem likely to
be a huge number. I think we'll add far more columns to other catalog
tables as we add new features to each release.

Regards,
Dean

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tomas Vondra (#19)
#22Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#21)
#23Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#22)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dean Rasheed (#23)
#25Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tom Lane (#24)
#26Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tom Lane (#24)
#27Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#26)
#28Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#27)
#29Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#28)
#30John Naylor
john.naylor@enterprisedb.com
In reply to: Tomas Vondra (#28)
#31Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#29)
#32Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Tomas Vondra (#31)
#33Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Tomas Vondra (#2)
#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Dean Rasheed (#33)