Multitenancy optimization

Started by Konstantin Knizhnikabout 7 years ago3 messageshackers

k.knizhnik@postgrespro.ru

about 7 years ago

Hi hacker,

Postgres is quite frequently used in different Internet services with
multi-tenant architecture.
It means that all object stored in the database have something like
"tenant_id" foreign key.
This key is used in all queries, i.e.

select * from Product where tenant_id=? and product_name=?;

The problem is that columns "tenant_id" and "product_name" are
frequently highly correlated (for example if this product is produced
just by one company).
And Postgres knows nothing about this correlation and so makes incorrect
estimation of selectivity of this predicate.

Certainly it is possible to create multicolumn statistics to notify
Postgres about columns correlation.
But unfortunately it is not good and working solution.

First of all we have to create multicolumn statistic for all possible
combinations of table's attributes including "tenant_id".
It is very inconvenient and inefficient.

Second - right now multicolumn statistic is not used for calculating
join selectivity. And for joins estimation errors are most critical,
causing Postgres to choose bad execution plans.

From my point of view the best solution is to make Postgres take in
account possible statistics errors and choose "stable" plan which
cost is not significantly increased in case of estimation errors. But it
requires huge refactoring of optimizer.

Right now I have information that some of Postgres customer which faced
with such problem just hacked calc_joinrel_size_estimate function,
checking attribute name and if it is "tenant_id" then do not take its
selectivity in account.
It leads to good query plans but certainly can not be considered as
acceptable solution.

I thought about more straightforward ways for reaching the same effect.
Right now Postgres allows to explicitly specify number of distinct
values for the attribute:

alter table foo alter column x set (n_distinct=1);

Unfortunately just setting it to 1 doesn't work. Postgres calculates
selectivity based on MCV or histogram and not using n_distinct value.
It is also possible to disable collection of statistic for this columns:

alter table foo alter column x set statistics 0;

But in this case Postgres is choosing DEFAULT_NUM_DISTINCT despite to
n_distinct option specified for this attribute.
I propose small patch which makes Postgres to use explicitly specified
n_distinct attribute option value when no statistic is available.

This test illustrating how it works (without this patch estimation for
this query is 1 row):

postgres=# create table foo(x integer, y integer);
CREATE TABLE
postgres=# insert into foo values (generate_series(1,100000)/10,
generate_series(1,100000)/10);
INSERT 0 100000
postgres=# alter table foo alter column x set (n_distinct=1);
ALTER TABLE
postgres=# alter table foo alter column x set statistics 0;
ALTER TABLE
postgres=# analyze foo;
ANALYZE
postgres=# explain select * from foo where x=100 and y=100;
QUERY PLAN
-------------------------------------------------------
Seq Scan on foo (cost=0.00..1943.00 rows=10 width=8)
Filter: ((x = 100) AND (y = 100))
(2 rows)

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Hadi Moshayedi

hadi@moshayedi.net

about 7 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Multitenancy optimization

On Thu, Mar 28, 2019 at 5:40 AM Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:

Certainly it is possible to create multicolumn statistics to notify
Postgres about columns correlation.
But unfortunately it is not good and working solution.

First of all we have to create multicolumn statistic for all possible
combinations of table's attributes including "tenant_id".
It is very inconvenient and inefficient.

On the inconvenient part: doesn't postgres itself automatically create
functional dependencies on combinations? i.e. it seems to me if we create
statistics on (a, b, c), then we don't need to create statistics on (a, b)
or (a, c) or (b, c), because the pg_statistic_ext entry for (a, b, c)
already includes enough information.

On the inefficient part, I think there's some areas of improvement here.
For example, if (product_id) -> seller_id correlation is 1.0, then
(product_id, product_name) -> seller_id correlation is definitely 1.0 and
we don't need to store it. So we can reduce the amount of information
stored in pg_statistic_ext -> stxdependencies, without losing any data
points.

More generally, if (a) -> b correlation is X, then (a, c) -> b correlation
is >= X. Maybe we can have a threshold to reduce number of entries in
pg_statistic_ext -> stxdependencies.

-- Hadi

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 7 years ago

In reply to: Hadi Moshayedi (#2)

Re: Multitenancy optimization

On 29.03.2019 11:06, Hadi Moshayedi wrote:

On Thu, Mar 28, 2019 at 5:40 AM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote:

Certainly it is possible to create multicolumn statistics to notify
Postgres about columns correlation.
But unfortunately it is not good and working solution.

First of all we have to create multicolumn statistic for all possible
combinations of table's attributes including "tenant_id".
It is very inconvenient and inefficient.

On the inconvenient part: doesn't postgres itself automatically create
functional dependencies on combinations? i.e. it seems to me if we
create statistics on (a, b, c), then we don't need to create
statistics on (a, b) or (a, c) or (b, c), because the pg_statistic_ext
entry for (a, b, c) already includes enough information.

On the inefficient part, I think there's some areas of improvement
here. For example, if (product_id) -> seller_id correlation is 1.0,
then (product_id, product_name) -> seller_id correlation is definitely
1.0 and we don't need to store it. So we can reduce the amount of
information stored in pg_statistic_ext -> stxdependencies, without
losing any data points.

More generally, if (a) -> b correlation is X, then (a, c) -> b
correlation is >= X. Maybe we can have a threshold to reduce number of
entries in pg_statistic_ext -> stxdependencies.

-- Hadi

Yes, Postgres automatically creates functional dependencies on combinations.
But actually you do not need ALL combinations. Table can contain
hundreds of attributes: number of combination in this case will not fit
in bigint.
This is why Postgres doesn't allow to create muticolumn statistic for
more than 8 columns.
So if you have table with hundred attributes and tenant_id, you with
have to manually create statistic for each <tenant_id,att-i> pair.
And it is very inconvenient (and as I already mentioned doesn't
completely solve the problem with join selectivity estimation).

May be there are some other ways of addressing this problem (although I
do not them).
But I think that in any case, if number of distinction values is
explicitly specified for the attribute, then this value should be used
by optimizer instead of dummy DEFAULT_NUM_DISTINCT.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Multitenancy optimization

Attachments: