UniqueKey v2

Started by Andy Fanover 2 years ago16 messageshackers

zhihui.fan1213@gmail.com

over 2 years ago

Hi,

David and I had worked the uniquekey stuff since 2020[1], and later it
is blocked by the NULL values stuff. Now the blocker should be removed
by Var.varnullingrels, so it is time to work on this again. During the
past 3 years, we have found more and more interesting usage of it.

Here is a design document and a part of implementation.

What is UniqueKey?
-----------------

UniqueKey represents a uniqueness information for a RelOptInfo. for
example:

SELECT id FROM t;

where the ID is the UniqueKey for the RelOptInfo (t). In the real word,
it has the following attributes:

1). It should be EquivalenceClass based. for example:

SELECT a FROM t WHERE a = id;

In this case, the UniqueKey should be 1 EC with two members
- EC(EM(a), EM(id)).

2). Each UniqueKey may be made up with 1+ EquivalenceClass. for example:

CREATE TABLE t(a int not null, b int not null);
CREATE UNIQUE INDEX on t(a, b);
SELECT * FROM t;

Where the UniqueKey for RelOptInfo (t) will be 2 ECs with each 1 has 1
member.

- EC(em=a), EC(em=b)

3). Each RelOptInfo may have 1+ UniqueKeys.

CREATE TABLE t(a int not null, b int not null, c int not null);
CREATE UNIQUE INDEX on t(a, b);
CREATE UNIQUE INDEX on t(c);

SELECT * FROM t;

Where the UniqueKey for RelOptInfo (t) will be
- [EC(em=a), EC(em=b)].
- [EC(em=c)]

4). A special case is about the one-row case. It works like:
SELECT * FROM t WHERE id = 1;
Here every single expression in the RelOptInfo (t) is unique.

Where can we use it?
--------------------
1. mark the distinct as no-op. SELECT DISTINCT uniquekey FROM v; This
optimization has been required several times in our threads.

2. Figure out more pathkey within the onerow case, then some planning
time can be reduced to be big extend. This user case is not absurd, I
run into a real user case like this:

CREATE TABLE small_t (id int primary key, b int, c int .. u int);
CREATE INDEX ON small_t(b);
CREATE INDEX ON small_t(c);
..

SELECT * FROM small_t s
JOIN t1 on t1.sb = s.b
JOIN T2 on t2.sc = s.c
..
JOIN t20 on t20.su = s.u
WHERE s.id = 1;

Without the above optimization, we don't know s.b /s.c is ordered
already, so it might keep more different paths for small_t because of
they have different interesting pathkey, and use more planning time
for sorting to support merge join.

With the above optimization, the planning time should be reduced since
the seq scan can produce a ordered result for every expression.

3. Figure out more interesting pathkey after join with normal UniqueKey.

CREATE TABLE t(id int primary key, b int, c int);
CREATE INDEX on t(c);
ANALYZE t;

explain (costs off)
select t1.id, t2.c from t t1
join t1 t2 on t1.id = t2.b
and t2.c > 3
order by t1.id, t2.c;

QUERY PLAN
--------------------------------------------------
Sort Key: t1.id, t2.c <--- this sort can be avoided actually.
-> Nested Loop
Join Filter: (t1.id = t2.b)
-> Index Only Scan using t_pkey on t t1
-> Index Scan using t1_c_idx on t1 t2
Index Cond: (c > 3)

*Without knowing the t1.id is unique*, which means there are some
duplicated data in t1.id, the duplication data in t1 will break the
order of (t1.id, t2.c), but if we know the t1.id is unique, the sort
will be not needed. I'm pretty happy with this finding.

4. Optimize some group by case, like

SELECT id, sum(b) FROM t GROUP BY id
is same with
SELECT id, b from t;

I'm not sure how often it is in the real life, I'm not so excited with
this for now.

How to present ECs in UniqueKey?
--------------------------------

I choose "Bitmapset *eclass_indexes;" finally, which is because
Bitmapset is memory compact and good at bms_union, bms_is_subset
stuffs. The value in the bitmap is the positions in root->eq_classes. It
is also be able to present the UniqueKey which is made up from multi
relations or upper relation. I'm pleased with the EC strategy because
the existing logic would even create a EC with single members which
means we don't need to create any EquivalenceClass for our own. for
example, in the case of

SELECT DISTINCT pk FROM t;

a EquivalenceClass with single member is created.

How to present single row in UniqueKey
-------------------------------------

I just use a 'Index relid', an non-zero value means the
RelOptInfo[relid] is single row. For the case like

SELECT * FROM t WHERE id = 1;
The UniqueKey is:
- UniqueKey(eclass_indexes=NULL, relid=1)

during a join, any unique keys join with single row, it's uniqueness can
be kept.

SELECT t1.uk, t2.a FROM t WHERE t2.id = 1 and any-qual(t1, t2);
- UniqueKey (t1.uk)

more specially, join two single row like:

SELECT * FROM t1 join t2 on true where t1.id = 1 and t2.id = 2;

the UniqueKey for the JoinRel will be:
- UniqueKey(eclass_indexes=NULL, relid=1)
- UniqueKey(eclass_indexes=NULL, relid=2)

However, the current single row presentation can't works well with Upper
relation, which I think it would be acceptable. See the following case:

SELECT count(*) FROM t1 JOIN t2 on true;

How to maintain the uniquekey?
-------------------------------
the uniquekey is maintained from baserel to join rel then to upper
relation. In the base rel, it comes from unique index. From the join
relation, it is maintained with two rules:

- the uniquekey in one side is still unique if it can't be duplicated
after the join. for example:

SELECT t1.pk FROM t1 JOIN t2 ON t1.a = t2.pk;
UniqueKey on t1: t1.pk
UniqueKey on t1 Join t2: t1.pk

- The combined unique key from both sides are unique all the times.
SELECT t1.pk , t2.pk FROM t1 join t2 on true;
UniqueKey on t1 join t2: (t1.pk, t2.pk)

Some other operations like DISTINCT, GROUP BY can produce UniqueKey as well.

NULL values
-----------
I added notnullattrs in RelOptInfo, which present if these attributes may
not be NULL after the baserestrictinfo is executed. not-null-attributes
may be generated by not-null constraint in catalog or baserestrictinfo
(only) filter. However it is possible become NULLs because of outer
join, then Var.varnullingrels is used in this case. see
'var_is_nullable' function call.

To simplify the UniqueKey module, it doesn't care about the null values
during the maintaining, which means it may contains multi NULL values
all the time by design. However whenever a user case care about that,
the user case can get the answer with the above logic, that is what
'mark-distinct-as-noop' does.

How to reduce the overhead
----------------------------------
UniqueKey employs the similar strategy like PathKey, it only maintain
the interesting PathKey. Currently the interesting UniqueKey includes:
1). It is subset of distinct_pathkeys.
2). It is used in mergeable join clauses for unique key deduction (for
the join rel case, rule 1). In this case, it can be discarded quickly if
the join has been done.

To avoid to check if an uniquekey is subset of distinct clause again and
again, I cached the result into UnqiueKey struct during the UniqueKey
creation.

Since our first goal is just used for marking distinct as no-op, so if
there is no distinct clause at all, unique key will be not maintained at
the beginning. so we can have some codes like:

if (root->distinct_pathkeys == NULL)
return;

This fast path is NOT added for now for better code coverage.

What I have now:
----------------

The current patch just maintain the UniqueKey at the baserel level and
used it for mark-distinct-as-noop purpose. including the basic idea of

- How the UniqueKey is defined.
- How to find out the interesting pathkey in the base relation level.
- How to figure out the unique key contains NULL values.

Also the test cases are prepared, see uniquekey.sql.

Some deep issues can only be found during the development, but I still
like to gather more feedback to see if anything is wrong at the first
place. Like what role will the collation play on for UniqueKey.

Any thought?

jian he

jian.universality@gmail.com

over 2 years ago

In reply to: Andy Fan (#1)

Re: UniqueKey v2

hi.
After `git am`, I still cannot build.

../../Desktop/pg_sources/main/postgres/src/backend/optimizer/path/uniquekey.c:125:45:
error: variable ‘var’ set but not used
[-Werror=unused-but-set-variable]
125 | Var *var;
| ^~~

You also need to change src/backend/optimizer/path/meson.build.

git apply failed.

git am warning:
Applying: uniquekey on base relation and used it for mark-distinct-as-op.
.git/rebase-apply/patch:876: new blank line at EOF.
+
warning: 1 line adds whitespace errors.

I think you can use `git diff --check`
(https://git-scm.com/docs/git-diff) to check for whitespace related
errors.

Andy Fan

zhihui.fan1213@gmail.com

over 2 years ago

In reply to: jian he (#2)

Re: UniqueKey v2

jian he <jian.universality@gmail.com> writes:

Hi jian,

hi.
After `git am`, I still cannot build.

../../Desktop/pg_sources/main/postgres/src/backend/optimizer/path/uniquekey.c:125:45:
error: variable ‘var’ set but not used
[-Werror=unused-but-set-variable]
125 | Var *var;
| ^~~

Thanks for this report, looks clang 11 can't capture this error. I have
switched to clang 17 which would report this issue at the first place.

You also need to change src/backend/optimizer/path/meson.build.

Great thanks.

git apply failed.

git am warning:
Applying: uniquekey on base relation and used it for mark-distinct-as-op.
.git/rebase-apply/patch:876: new blank line at EOF.
+
warning: 1 line adds whitespace errors.

I think you can use `git diff --check`
(https://git-scm.com/docs/git-diff) to check for whitespace related
errors.

thanks for the really good suggestion. Here is the newer version:

jian he

jian.universality@gmail.com

over 2 years ago

In reply to: Andy Fan (#3)

Re: UniqueKey v2

On Tue, Oct 17, 2023 at 11:21 AM <zhihuifan1213@163.com> wrote:

thanks for the really good suggestion. Here is the newer version:

--- a/src/backend/optimizer/path/meson.build
+++ b/src/backend/optimizer/path/meson.build
@@ -10,4 +10,5 @@ backend_sources += files(
   'joinrels.c',
   'pathkeys.c',
   'tidpath.c',
+  'uniquekey.c'
 )
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 3ac25d47..5ed550ca 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -264,7 +264,10 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,

int strategy, bool nulls_first);
extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,

 List *live_childrels);
-
+/*
+ * uniquekey.c
+ *     uniquekey.c related functions.
+ */

---------
i did some simple tests using text data type.

it works with the primary key, not with unique indexes.
it does not work when the column is unique, not null.

The following is my test.

begin;
CREATE COLLATION case_insensitive (provider = icu, locale =
'und-u-ks-level2', deterministic = false);
CREATE COLLATION upper_first (provider = icu, locale = 'und-u-kf-upper');
commit;
begin;
CREATE TABLE test_uniquekey3(a text, b text);
CREATE TABLE test_uniquekey4(a text, b text);
CREATE TABLE test_uniquekey5(a text, b text);
CREATE TABLE test_uniquekey6(a text, b text);
CREATE TABLE test_uniquekey7(a text not null, b text not null);
CREATE TABLE test_uniquekey8(a text not null, b text not null);
CREATE TABLE test_uniquekey9(a text primary key COLLATE upper_first, b
text not null);
CREATE TABLE test_uniquekey10(a text primary key COLLATE
case_insensitive, b text not null);
create unique index on test_uniquekey3 (a COLLATE case_insensitive nulls first)
nulls distinct
with (fillfactor = 80);
create unique index on test_uniquekey4 (a COLLATE case_insensitive nulls first)
nulls not distinct
with (fillfactor = 80);
create unique index on test_uniquekey5 (a COLLATE upper_first nulls first)
nulls distinct;
create unique index on test_uniquekey6 (a COLLATE upper_first nulls first)
nulls not distinct;
create unique index on test_uniquekey7 (a COLLATE upper_first nulls
first) nulls distinct;
create unique index on test_uniquekey8 (a COLLATE case_insensitive
nulls first) nulls not distinct;
insert into test_uniquekey3(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey4(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey5(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey6(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey7(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey8(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey9(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey10(a,b) select g::text, (g+10)::text from
generate_series(1,1e5) g;
insert into test_uniquekey3(a) VALUES(null),(null),(null);
insert into test_uniquekey4(a) VALUES(null);
insert into test_uniquekey5(a) VALUES(null),(null),(null);
insert into test_uniquekey6(a) VALUES(null);
commit;

ANALYZE test_uniquekey3, test_uniquekey4, test_uniquekey5
,test_uniquekey6,test_uniquekey7, test_uniquekey8
,test_uniquekey9, test_uniquekey10;

explain (costs off) select distinct a from test_uniquekey3;
explain (costs off) select distinct a from test_uniquekey4;
explain (costs off) select distinct a from test_uniquekey5;
explain (costs off) select distinct a from test_uniquekey6;
explain (costs off) select distinct a from test_uniquekey7;
explain (costs off) select distinct a from test_uniquekey8;
explain (costs off) select distinct a from test_uniquekey9;
explain (costs off) select distinct a from test_uniquekey10;
explain (costs off) select distinct a from test_uniquekey3 where a < '2000';
explain (costs off) select distinct a from test_uniquekey4 where a < '2000';
explain (costs off) select distinct a from test_uniquekey5 where a < '2000';
explain (costs off) select distinct a from test_uniquekey6 where a < '2000';
explain (costs off) select distinct a from test_uniquekey7 where a < '2000';
explain (costs off) select distinct a from test_uniquekey8 where a < '2000';
explain (costs off) select distinct a from test_uniquekey9 where a < '2000';
explain (costs off) select distinct a from test_uniquekey10 where a < '2000';

--very high selectivity
explain (costs off) select distinct a from test_uniquekey3 where a < '1001';
explain (costs off) select distinct a from test_uniquekey4 where a < '1001';
explain (costs off) select distinct a from test_uniquekey5 where a < '1001';
explain (costs off) select distinct a from test_uniquekey6 where a < '1001';
explain (costs off) select distinct a from test_uniquekey7 where a < '1001';
explain (costs off) select distinct a from test_uniquekey8 where a < '1001';
explain (costs off) select distinct a from test_uniquekey9 where a < '1001';
explain (costs off) select distinct a from test_uniquekey10 where a < '1001';
explain (costs off,ANALYZE) select distinct a from test_uniquekey9
where a < '1001';
explain (costs off,ANALYZE) select distinct a from test_uniquekey10
where a < '1001';

Andy Fan

zhihui.fan1213@gmail.com

over 2 years ago

In reply to: jian he (#4)

Re: UniqueKey v2

i did some simple tests using text data type.

it works with the primary key, not with unique indexes.
it does not work when the column is unique, not null.

The following is my test.

Can you simplify your test case please? I can't undertand what "doesn't
work" mean here and for which case. FWIW, this feature has nothing with
the real data, I don't think inserting any data is helpful unless I
missed anything.

--
Best Regards
Andy Fan

jian he

jian.universality@gmail.com

over 2 years ago

In reply to: Andy Fan (#5)

Re: UniqueKey v2

On Fri, Oct 20, 2023 at 4:33 PM <zhihuifan1213@163.com> wrote:

i did some simple tests using text data type.

it works with the primary key, not with unique indexes.
it does not work when the column is unique, not null.

The following is my test.

Can you simplify your test case please? I can't undertand what "doesn't
work" mean here and for which case. FWIW, this feature has nothing with
the real data, I don't think inserting any data is helpful unless I
missed anything.

Sorry for not explaining it very well.
"make distinct as no-op."
my understanding: it means: if fewer rows meet the criteria "columnX <
const_a;" , after analyze the table, it should use index only scan
for the queryA?
--queryA:
select distinct columnX from the_table where columnX < const_a;

There are several ways for columnX to be unique: primark key, unique
key, unique key nulls distinct, unique key nulls not distinct, unique
key and not null.

After applying your patch, only the primary key case will make the
queryA explain output using the index-only scan.

Andy Fan

zhihui.fan1213@gmail.com

over 2 years ago

In reply to: jian he (#6)

Re: UniqueKey v2

jian he <jian.universality@gmail.com> writes:

On Fri, Oct 20, 2023 at 4:33 PM <zhihuifan1213@163.com> wrote:

i did some simple tests using text data type.

it works with the primary key, not with unique indexes.
it does not work when the column is unique, not null.

The following is my test.

Can you simplify your test case please? I can't undertand what "doesn't
work" mean here and for which case. FWIW, this feature has nothing with
the real data, I don't think inserting any data is helpful unless I
missed anything.

Sorry for not explaining it very well.
"make distinct as no-op."
my understanding: it means: if fewer rows meet the criteria "columnX <
const_a;" , after analyze the table, it should use index only scan

No, "mark distinct as no-op" means the distinct node can be discarded
automatically since it is not needed any more. The simplest case would
be "select distinct pk from t", where it should be same as "select pk
from t". You can check the testcase for the more cases.

--
Best Regards
Andy Fan

Andy Fan

zhihui.fan1213@gmail.com

over 2 years ago

In reply to: Andy Fan (#1)

Re: UniqueKey v2

zhihuifan1213@163.com writes:

Hi,

Here is the v3, the mainly changes is it maintains the UniqueKey on
joinrel level, which probabaly is the most important part of this
feature. It shows how the UnqiueKey on joinrel is generated and how it
is discarded due to non-interesting-uniquekey and also show much details
about the single-row case.

I will always maintain README.uniquekey under src/backend/optimizer/path/
to include the latest state of this feature to save the time for
reviewer from going through from the begining. I also use the word "BAD
CASE" in uniquekey.sql to demo which sistuation is not handled well so
far, that probably needs more attention at the first review.

Antonin Houska

ah@cybertec.at

about 2 years ago

In reply to: Andy Fan (#8)

Re: UniqueKey v2

zhihuifan1213@163.com wrote:

Here is the v3, ...

I'm trying to enhance the join removal functionality (will post my patch in a
separate thread soon) and I consider your patch very helpful in this
area.

Following is my review. Attached are also some fixes and one enhancement:
propagation of the unique keys (UK) from a subquery to the parent query
(0004). (Note that 0001 is just your patch rebased).

* I think that, before EC is considered suitable for an UK, its ec_opfamilies
field needs to be checked. I try to do that in
find_ec_position_matching_expr(), see 0004.

* Set-returning functions (SRFs) can make a "distinct path" necessary even if
the join output is unique.

* RelOptInfo.notnullattrs

My understanding is that this field contains the set attributes whose
uniqueness is guaranteed by the unique key. They are acceptable because they
are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
value makes the containing row filtered out, so the row cannot break
uniqueness of the output. Am I right?

If so, I suggest to change the field name to something less generic, maybe
'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that
more checks are needed before particular attribute can actually be used in
UniqueKey.

* add_uniquekey_for_uniqueindex()

I'd appreciate an explanation why the "single-row UK" is created. I think
the reason for unique_exprs==NIL is that a restriction clause VAR=CONST
exists for each column of the unique index. Whether I'm right or not, a
comment should state clearly what the reason is.

* uniquekey_useful_for_merging()

How does uniqueness relate to merge join? In README.uniquekey you seem to
point out that a single row is always sorted, but I don't think this
function is related to that fact. (Instead, I'd expect that pathkeys are
added to all paths for a single-row relation, but I'm not sure you do that
in the current version of the patch.)

* is_uniquekey_useful_afterjoin()

Now that my patch (0004) allows propagation of the unique keys from a
subquery to the upper query, I was wondering if the UniqueKey structure
needs the 'use_for_distinct field' I mean we should now propagate the unique
keys to the parent query whether the subquery has DISTINCT clause or not. I
noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
changed the function to always returned true. However nothing changed in the
output of regression tests (installcheck). Do you insist that the
'use_for_distinct' field is needed?

* uniquekey_contains_multinulls()

** Instead of calling the function when trying to use the UK, how about
checking the ECs when considering creation of the UK? If the tests fail,
just don't create the UK.

** What does the 'multi' word in the function name mean?

* relation_is_distinct_for()

The function name is too similar to rel_is_distinct_for(). I think the name
should indicate that you are checking the relation against a set of
pathkeys. Maybe rel_is_distinct_for_pathkeys() (and remove 'distinct' from
the argument name)? At the same time, it might be good to rename
rel_is_distinct_for() to rel_is_distinct_for_clauses().

* uniquekey_contains_in()

Shouldn't this be uniquekey_contained_in()? And likewise, shouldn't the
comment be " ... if UniqueKey is contained in the list of EquivalenceClass"
?

(In general, even though I'm not English native speaker, I think I see quite
a few grammar issues, which often make reading of the comments/documentation
a bit difficult.)

* Combining the UKs

IMO this is the most problematic part of the patch. You call
populate_joinrel_uniquekeys() for the same join multiple times, each time
with a different 'restrictlist', and you try to do two things at the same
time: 1) combine the UKs of the input relations into the UKs of the join
relation, 2) check if the join relation can be marked single-row.

I think that both 1) and 2) should be independent from join order, and thus
both computations should only take place once for given set of input
relations. And I think they should be done separately:

1) Compute the join UKs

As you admit in a comment in populate_joinrel_uniquekeys(), neither join
method nor clauses should matter. So I think you only need to pick the
"component UKs" (i.e. UKs of the input relations) which are usable above
that join (i.e. neither the join itself nor any join below sets any column
of the UK to NULL) and combine them.

Of course one problem is that the number of combinations can grow
exponentially as new relations are joined. I'm not sure it's necessary to
combine the UKs (and to discard some of them) immediately. Instead, maybe we
can keep lists of UKs only for base relations, and postpone picking the
suitable UKs and combining them until we actually need to check the relation
uniqueness.

2) Check if the join relation is single-row

I in order to get rid of the dependency on 'restrictlist', I think you can
use ECs. Consider a query from your regression tests:

CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c int, d int, e int);

SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id = 1;

The idea here seems to be that no more than one row of t1 matches the query
clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures that
no more than one row of t2 matches the query (because t1 cannot provide the
clause with more than one input value of 'e'). And therefore, the join also
produces at most one row.

My theory is that relation is single-row if it has an UK such that each of
its ECs meets at least one of the following conditions:

a) contains a constant

b) contains a column of a relation which has already been proven single-row.

b) is referenced by an UK of a relation which has already been proven
single-row.

I think that in the example above, an EC {t1.e, t2.id} should exist. So when
checking whether 't2' is single-row, the condition b) cam be ised: the UK of
't2' should reference the EC {t1.e, t2.id}, which in turn contains the
column t1.e. And 't1' is unique because its EC meets the condition a). (Of
course you need to check em_jdomain before you use particular EM.)

Are you going to submit the patch to the first CF of PG 18?

Please let me know if I can contribute to the effort by reviewing or writing
some code.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#10

Andy Fan

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Antonin Houska (#9)

Re: UniqueKey v2

Hello Antonin,

zhihuifan1213@163.com wrote:

Here is the v3, ...

I'm trying to enhance the join removal functionality (will post my patch in a
separate thread soon) and I consider your patch very helpful in this
area.

Thanks for these words. The point 2) and 3) is pretty interesting to me
at [1]/messages/by-id/7mlamswjp81p.fsf@e18c07352.et15sqa and "enhance join removal" is another interesting user case.

Following is my review. Attached are also some fixes and one enhancement:
propagation of the unique keys (UK) from a subquery to the parent query
(0004). (Note that 0001 is just your patch rebased).

Thanks for that! more enhancment like uniquekey in partitioned table is
needed. This post is mainly used to check if more people is still
interested with this.

Are you going to submit the patch to the first CF of PG 18?

Since there are still known work to do, I'm not sure if it is OK to
submit in CF. What do you think about this part?

Please let me know if I can contribute to the effort by reviewing or writing
some code.

Absolutely yes! please feel free to review / writing any of them and do
remember add yourself into the author list if you do that.

Thanks for your review suggestion, I will get to this very soon if once
I get time, I hope it is in 4 weeks.

[1]: /messages/by-id/7mlamswjp81p.fsf@e18c07352.et15sqa
/messages/by-id/7mlamswjp81p.fsf@e18c07352.et15sqa

--
Best Regards
Andy Fan

#11

Andy Fan

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Antonin Houska (#9)

Re: UniqueKey v2

Hello Antonin,

Thanks for interesting with this topic!

* I think that, before EC is considered suitable for an UK, its ec_opfamilies
field needs to be checked. I try to do that in
find_ec_position_matching_expr(), see 0004.

Could you make the reason clearer for adding 'List *opfamily_lists;'
into UniqueKey? You said "This is needed to create ECs in the parent
query if the upper relation represents a subquery." but I didn't get the
it. Since we need to maintain the UniqueKey in the many places, I'd like
to keep it as simple as possbile. Of course, anything essentical should
be added for sure.

* Set-returning functions (SRFs) can make a "distinct path" necessary even if
the join output is unique.

You are right at this point, I will fix it in the coming version.

* RelOptInfo.notnullattrs

My understanding is that this field contains the set attributes whose
uniqueness is guaranteed by the unique key. They are acceptable because they
are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
value makes the containing row filtered out, so the row cannot break
uniqueness of the output. Am I right?

If so, I suggest to change the field name to something less generic, maybe
'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that
more checks are needed before particular attribute can actually be used in
UniqueKey.

I don't think so, UniqueKey is just one of the places to use this
not-null property, see 3af704098 for the another user case of it.

(Because of 3af704098, we should leverage notnullattnums somehow in this
patch, which will be included in the next version as well).

* add_uniquekey_for_uniqueindex()

I'd appreciate an explanation why the "single-row UK" is created. I think
the reason for unique_exprs==NIL is that a restriction clause VAR=CONST
exists for each column of the unique index. Whether I'm right or not, a
comment should state clearly what the reason is.

You are understanding it correctly. I will add comments in the next
version.

* uniquekey_useful_for_merging()

How does uniqueness relate to merge join? In README.uniquekey you seem to
point out that a single row is always sorted, but I don't think this
function is related to that fact. (Instead, I'd expect that pathkeys are
added to all paths for a single-row relation, but I'm not sure you do that
in the current version of the patch.)

The merging is for "mergejoinable join clauses", see function
eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a";

* is_uniquekey_useful_afterjoin()

Now that my patch (0004) allows propagation of the unique keys from a
subquery to the upper query, I was wondering if the UniqueKey structure
needs the 'use_for_distinct field' I mean we should now propagate the unique
keys to the parent query whether the subquery has DISTINCT clause or not. I
noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
changed the function to always returned true. However nothing changed in the
output of regression tests (installcheck). Do you insist that the
'use_for_distinct' field is needed?

* uniquekey_contains_multinulls()

** Instead of calling the function when trying to use the UK, how about
checking the ECs when considering creation of the UK? If the tests fail,
just don't create the UK.

I don't think so since we maintain the UniqueKey from bottom to top, you
can double check if my reason is appropriate.

CREATE TABLE t1(a int);
CREATE INDEX ON t1(a);

SELECT distinct t1.a FROM t1 JOIN t2 using(a);

We need to create the UniqueKey on the baserel for t1 and the NULL
values is filtered out in the joinrel. so we have to creating it with
allowing NULL values first.

** What does the 'multi' word in the function name mean?

multi means multiple, I thought we use this short name in the many
places, for ex bt_multi_page_stats after a quick search.

* relation_is_distinct_for()

The function name is too similar to rel_is_distinct_for(). I think the name
should indicate that you are checking the relation against a set of
pathkeys. Maybe rel_is_distinct_for_pathkeys() (and remove 'distinct' from
the argument name)? At the same time, it might be good to rename
rel_is_distinct_for() to rel_is_distinct_for_clauses().

OK.

* uniquekey_contains_in()

Shouldn't this be uniquekey_contained_in()? And likewise, shouldn't the
comment be " ... if UniqueKey is contained in the list of EquivalenceClass"
?

OK.

(In general, even though I'm not English native speaker, I think I see quite
a few grammar issues, which often make reading of the comments/documentation

Your English is really good:)

* Combining the UKs

IMO this is the most problematic part of the patch. You call
populate_joinrel_uniquekeys() for the same join multiple times,

Why do you think so? The below code is called in "make_join_rel"

populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, ...);

so it should be only called once per joinrel.

Is your original question is about populate_joinrel_uniquekey_for_rel
rather than populate_joinrel_uniquekeys? We have the below codes:

outeruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, outerrel,
innerrel, restrictlist);
inneruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, innerrel,
outerrel, restrictlist);

This is mainly because of the following theory. Quoted from
README.uniquekey. Let's called this as "rule 1".

"""
How to maintain the uniquekey?
-------------------------------
.. From the join relation, it is maintained with two rules:

- the uniquekey in one side is still unique if it can't be duplicated
after the join. for example:

SELECT t1.pk FROM t1 JOIN t2 ON t1.a = t2.pk;
UniqueKey on t1: t1.pk
UniqueKey on t1 Join t2: t1.pk
"""

AND the blow codes:

if (outeruk_still_valid || inneruk_still_valid)

/*
* the uniquekey on outers or inners have been added into joinrel so
* the combined uniuqekey from both sides is not needed.
*/
return;

We don't create the component uniquekey if any one side of the boths
sides is unique already. For example:

"(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is
unique", there is no need to create component UniqueKey (t1.id, t2.id);

each time
with a different 'restrictlist', and you try to do two things at the same
time: 1) combine the UKs of the input relations into the UKs of the join
relation, 2) check if the join relation can be marked single-row.

I think that both 1) and 2) should be independent from join order, and thus
both computations should only take place once for given set of input
relations. And I think they should be done separately:

1) Compute the join UKs

As you admit in a comment in populate_joinrel_uniquekeys(), neither join
method nor clauses should matter. So I think you only need to pick the
"component UKs" (i.e. UKs of the input relations) which are usable above
that join (i.e. neither the join itself nor any join below sets any column
of the UK to NULL) and combine them.

We need to do this only after the "if (!outeruk_still_valid &&
!inneruk_still_valid)" check, as explained above.

Of course one problem is that the number of combinations can grow
exponentially as new relations are joined.

Yes, that's why "rule 1" needed and "How to reduce the overhead" in
UniqueKey.README is introduced.

2) Check if the join relation is single-row

I in order to get rid of the dependency on 'restrictlist', I think you can
use ECs. Consider a query from your regression tests:

CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c int, d int, e int);

SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id = 1;

The idea here seems to be that no more than one row of t1 matches the query
clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures that
no more than one row of t2 matches the query (because t1 cannot provide the
clause with more than one input value of 'e'). And therefore, the join also
produces at most one row.

You are correct and IMO my current code are able to tell it is a single
row as well.

1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a
consequence.
2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept
after the join because of rule 1 on joinrel. and t1 is singlerow, so the
joinrel is singlerow as well.

I'm interested with "get rid of the dependency on 'restrictlist', I
think you can use ECs.", let's see what we can improve.

My theory is that relation is single-row if it has an UK such that each of
its ECs meets at least one of the following conditions:

a) contains a constant

True.

b) contains a column of a relation which has already been proven single-row.

True, not sure if it is easy to tell.

b) is referenced by an UK of a relation which has already been proven
single-row.

I can't follow here...

I think that in the example above, an EC {t1.e, t2.id} should exist. So when
checking whether 't2' is single-row, the condition b) cam be ised: the UK of
't2' should reference the EC {t1.e, t2.id}, which in turn contains the
column t1.e. And 't1' is unique because its EC meets the condition a). (Of
course you need to check em_jdomain before you use particular EM.)

I think the existing rule 1 for joinrel works well with the singlerow
case naturally, what can be improved if we add the theory you suggested
here?

--
Best Regards
Andy Fan

#12

Antonin Houska

ah@cybertec.at

about 2 years ago

In reply to: Andy Fan (#11)

Re: UniqueKey v2

Andy Fan <zhihuifan1213@163.com> wrote:

* I think that, before EC is considered suitable for an UK, its ec_opfamilies
field needs to be checked. I try to do that in
find_ec_position_matching_expr(), see 0004.

Could you make the reason clearer for adding 'List *opfamily_lists;'
into UniqueKey? You said "This is needed to create ECs in the parent
query if the upper relation represents a subquery." but I didn't get the
it. Since we need to maintain the UniqueKey in the many places, I'd like
to keep it as simple as possbile. Of course, anything essentical should
be added for sure.

If unique keys are generated for a subquery output, they also need to be
created for the corresponding relation in the upper query ("sub" in the
following example):

select * from tab1 left join (select * from tab2) sub;

However, to create an unique key for "sub", you need an EC for each expression
of the key. And to create an EC, you in turn need the list of operator
families.

Even if the parent query already had ECs for the columns of "sub" which are
contained in the unique key, you need to make sure that those ECs are
"compatible" with the ECs of the subquery which generated the unique key. That
is, if an EC of the subquery considers certain input values equal, the EC of
the parent query must also be able to determine if they are equal or not.

* RelOptInfo.notnullattrs

My understanding is that this field contains the set attributes whose
uniqueness is guaranteed by the unique key. They are acceptable because they
are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
value makes the containing row filtered out, so the row cannot break
uniqueness of the output. Am I right?

If so, I suggest to change the field name to something less generic, maybe
'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that
more checks are needed before particular attribute can actually be used in
UniqueKey.

I don't think so, UniqueKey is just one of the places to use this
not-null property, see 3af704098 for the another user case of it.

(Because of 3af704098, we should leverage notnullattnums somehow in this
patch, which will be included in the next version as well).

In your patch you modify 'notnullattrs' in add_base_clause_to_rel(), but that
does not happen to 'notnullattnums' in the current master branch. Thus I think
that 'notnullattrs' is specific to the unique keys feature, so the field name
should be less generic.

* uniquekey_useful_for_merging()

How does uniqueness relate to merge join? In README.uniquekey you seem to
point out that a single row is always sorted, but I don't think this
function is related to that fact. (Instead, I'd expect that pathkeys are
added to all paths for a single-row relation, but I'm not sure you do that
in the current version of the patch.)

The merging is for "mergejoinable join clauses", see function
eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a";

My question is: why is the uniqueness important specifically to merge join? I
understand that join evaluation can be more efficient if we know that one
input relation is unique (i.e. we only scan that relation until we find the
first match), but this is not specific to merge join.

* is_uniquekey_useful_afterjoin()

Now that my patch (0004) allows propagation of the unique keys from a
subquery to the upper query, I was wondering if the UniqueKey structure
needs the 'use_for_distinct field' I mean we should now propagate the unique
keys to the parent query whether the subquery has DISTINCT clause or not. I
noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
changed the function to always returned true. However nothing changed in the
output of regression tests (installcheck). Do you insist that the
'use_for_distinct' field is needed?

I miss your answer to this comment.

* uniquekey_contains_multinulls()

** Instead of calling the function when trying to use the UK, how about
checking the ECs when considering creation of the UK? If the tests fail,
just don't create the UK.

I don't think so since we maintain the UniqueKey from bottom to top, you
can double check if my reason is appropriate.

CREATE TABLE t1(a int);
CREATE INDEX ON t1(a);

SELECT distinct t1.a FROM t1 JOIN t2 using(a);

We need to create the UniqueKey on the baserel for t1 and the NULL
values is filtered out in the joinrel. so we have to creating it with
allowing NULL values first.

** What does the 'multi' word in the function name mean?

multi means multiple, I thought we use this short name in the many
places, for ex bt_multi_page_stats after a quick search.

Why not simply uniquekey_contains_nulls() ?

Actually I wouldn't say that an instance of UniqueKey contains any value (NULL
or NOT NULL) because it describes the whole relation rather than particular
row. I consider UniqueKey to be a set of expressions. How about
uniquekey_expression_nullable() ?

* Combining the UKs

IMO this is the most problematic part of the patch. You call
populate_joinrel_uniquekeys() for the same join multiple times,

Why do you think so? The below code is called in "make_join_rel"

Consider join of tables "a", "b" and "c". My understanding is that
make_join_rel() is called once with rel1={a} and rel2={b join c}, then with
rel1={a join b} and rel2={c}, etc. I wanted to say that each call should
produce the same set of unique keys.

I need to check this part more in detail.

Is your original question is about populate_joinrel_uniquekey_for_rel
rather than populate_joinrel_uniquekeys? We have the below codes:

outeruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, outerrel,
innerrel, restrictlist);
inneruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, innerrel,
outerrel, restrictlist);

This is mainly because of the following theory. Quoted from
README.uniquekey. Let's called this as "rule 1".

"""
How to maintain the uniquekey?
-------------------------------
.. From the join relation, it is maintained with two rules:

- the uniquekey in one side is still unique if it can't be duplicated
after the join. for example:

SELECT t1.pk FROM t1 JOIN t2 ON t1.a = t2.pk;
UniqueKey on t1: t1.pk
UniqueKey on t1 Join t2: t1.pk
"""

AND the blow codes:

if (outeruk_still_valid || inneruk_still_valid)

/*
* the uniquekey on outers or inners have been added into joinrel so
* the combined uniuqekey from both sides is not needed.
*/
return;

We don't create the component uniquekey if any one side of the boths
sides is unique already. For example:

"(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is
unique", there is no need to create component UniqueKey (t1.id, t2.id);

ok, I need to check more in detail how this part works.

Of course one problem is that the number of combinations can grow
exponentially as new relations are joined.

Yes, that's why "rule 1" needed and "How to reduce the overhead" in
UniqueKey.README is introduced.

What if we are interested in unique keys of a subquery, but the subquery has
no DISTINCT clause?

2) Check if the join relation is single-row

I in order to get rid of the dependency on 'restrictlist', I think you can
use ECs. Consider a query from your regression tests:

CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c int, d int, e int);

SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id = 1;

The idea here seems to be that no more than one row of t1 matches the query
clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures that
no more than one row of t2 matches the query (because t1 cannot provide the
clause with more than one input value of 'e'). And therefore, the join also
produces at most one row.

You are correct and IMO my current code are able to tell it is a single
row as well.

1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a
consequence.
2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept
after the join because of rule 1 on joinrel. and t1 is singlerow, so the
joinrel is singlerow as well.

I'm interested with "get rid of the dependency on 'restrictlist', I
think you can use ECs.", let's see what we can improve.

My theory is that relation is single-row if it has an UK such that each of
its ECs meets at least one of the following conditions:

a) contains a constant

True.

b) contains a column of a relation which has already been proven single-row.

True, not sure if it is easy to tell.

b) is referenced by an UK of a relation which has already been proven
single-row.

I can't follow here...

This is similar to EC containing a constant: if an EC is used by a single-row
UK, all its member can only have a single value.

I think that in the example above, an EC {t1.e, t2.id} should exist. So when
checking whether 't2' is single-row, the condition b) cam be used: the UK of
't2' should reference the EC {t1.e, t2.id}, which in turn contains the
column t1.e. And 't1' is unique because its EC meets the condition a). (Of
course you need to check em_jdomain before you use particular EM.)

I think the existing rule 1 for joinrel works well with the singlerow
case naturally, what can be improved if we add the theory you suggested
here?

This is still the explanation of the idea how to mark join unique key as a
single-row separately from the other logic. As noted above, I need to learn
more about the unique keys of a join.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#13

Antonin Houska

ah@cybertec.at

about 2 years ago

In reply to: Antonin Houska (#12)

Re: UniqueKey v2

Antonin Houska <ah@cybertec.at> wrote:

Andy Fan <zhihuifan1213@163.com> wrote:

* Combining the UKs

IMO this is the most problematic part of the patch. You call
populate_joinrel_uniquekeys() for the same join multiple times,

Why do you think so? The below code is called in "make_join_rel"

Consider join of tables "a", "b" and "c". My understanding is that
make_join_rel() is called once with rel1={a} and rel2={b join c}, then with
rel1={a join b} and rel2={c}, etc. I wanted to say that each call should
produce the same set of unique keys.

I need to check this part more in detail.

I think I understand now. By calling populate_joinrel_uniquekeys() for various
orderings, you can find out that various input relation unique keys can
represent the whole join. For example, if the ordering is

A JOIN (B JOIN C)

you can prove that the unique keys of A can be used for the whole join, while
for the ordering

B JOIN (A JOIN C)

you can prove the same for the unique keys of B, and so on.

Is your original question is about populate_joinrel_uniquekey_for_rel
rather than populate_joinrel_uniquekeys? We have the below codes:

outeruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, outerrel,
innerrel, restrictlist);
inneruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, innerrel,
outerrel, restrictlist);

This is mainly because of the following theory. Quoted from
README.uniquekey. Let's called this as "rule 1".

"""
How to maintain the uniquekey?
-------------------------------
.. From the join relation, it is maintained with two rules:

- the uniquekey in one side is still unique if it can't be duplicated
after the join. for example:

SELECT t1.pk FROM t1 JOIN t2 ON t1.a = t2.pk;
UniqueKey on t1: t1.pk
UniqueKey on t1 Join t2: t1.pk
"""

AND the blow codes:

if (outeruk_still_valid || inneruk_still_valid)

/*
* the uniquekey on outers or inners have been added into joinrel so
* the combined uniuqekey from both sides is not needed.
*/
return;

We don't create the component uniquekey if any one side of the boths
sides is unique already. For example:

"(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is
unique", there is no need to create component UniqueKey (t1.id, t2.id);

ok, I need to check more in detail how this part works.

This optimization makes sense to me.

Of course one problem is that the number of combinations can grow
exponentially as new relations are joined.

Yes, that's why "rule 1" needed and "How to reduce the overhead" in
UniqueKey.README is introduced.

I think there should yet be some guarantee that the number of unique keys does
not grow exponentially. Perhaps a constant that allows a relation (base or
join) to have at most N unique keys. (I imagine N to be rather small, e.g. 3
or 4.) And when picking the "best N keys", one criterion could be the number
of expressions in the key (the shorter key the better).

2) Check if the join relation is single-row

I in order to get rid of the dependency on 'restrictlist', I think you can
use ECs. Consider a query from your regression tests:

CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c int, d int, e int);

SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id = 1;

The idea here seems to be that no more than one row of t1 matches the query
clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures that
no more than one row of t2 matches the query (because t1 cannot provide the
clause with more than one input value of 'e'). And therefore, the join also
produces at most one row.

You are correct and IMO my current code are able to tell it is a single
row as well.

1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a
consequence.
2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept
after the join because of rule 1 on joinrel. and t1 is singlerow, so the
joinrel is singlerow as well.

ok, I think I understand now.

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

#14

Andy Fan

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Antonin Houska (#12)

Re: UniqueKey v2

Antonin Houska <ah@cybertec.at> writes:

Could you make the reason clearer for adding 'List *opfamily_lists;'
into UniqueKey? You said "This is needed to create ECs in the parent
query if the upper relation represents a subquery." but I didn't get the
it. Since we need to maintain the UniqueKey in the many places, I'd like
to keep it as simple as possbile. Of course, anything essentical should
be added for sure.

If unique keys are generated for a subquery output, they also need to be
created for the corresponding relation in the upper query ("sub" in the
following example):

OK.

select * from tab1 left join (select * from tab2) sub;

However, to create an unique key for "sub", you need an EC for each expression
of the key.

OK.

And to create an EC, you in turn need the list of operator
families.

I'm thinking if we need to "create" any EC. Can you find out a user case
where the outer EC is missed and the UniqueKey is still interesting? I
don't have an example now.

convert_subquery_pathkeys has a similar sistuation and has the following
codes:

outer_ec =
get_eclass_for_sort_expr(root,
(Expr *) outer_var,
sub_eclass->ec_opfamilies,
sub_member->em_datatype,
sub_eclass->ec_collation,
0,
rel->relids,
NULL,
false);

/*
* If we don't find a matching EC, sub-pathkey isn't
* interesting to the outer query
*/
if (outer_ec)
best_pathkey =
make_canonical_pathkey(root,
outer_ec,
sub_pathkey->pk_opfamily,
sub_pathkey->pk_strategy,
sub_pathkey->pk_nulls_first);
}

Even if the parent query already had ECs for the columns of "sub" which are
contained in the unique key, you need to make sure that those ECs are
"compatible" with the ECs of the subquery which generated the unique key. That
is, if an EC of the subquery considers certain input values equal, the EC of
the parent query must also be able to determine if they are equal or not.

* RelOptInfo.notnullattrs

My understanding is that this field contains the set attributes whose
uniqueness is guaranteed by the unique key. They are acceptable because they
are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
value makes the containing row filtered out, so the row cannot break
uniqueness of the output. Am I right?

If so, I suggest to change the field name to something less generic, maybe
'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that
more checks are needed before particular attribute can actually be used in
UniqueKey.

I don't think so, UniqueKey is just one of the places to use this
not-null property, see 3af704098 for the another user case of it.

(Because of 3af704098, we should leverage notnullattnums somehow in this
patch, which will be included in the next version as well).

In your patch you modify 'notnullattrs' in add_base_clause_to_rel(), but that
does not happen to 'notnullattnums' in the current master branch. Thus I think
that 'notnullattrs' is specific to the unique keys feature, so the field name
should be less generic.

OK.

* uniquekey_useful_for_merging()

How does uniqueness relate to merge join? In README.uniquekey you seem to
point out that a single row is always sorted, but I don't think this
function is related to that fact. (Instead, I'd expect that pathkeys are
added to all paths for a single-row relation, but I'm not sure you do that
in the current version of the patch.)

The merging is for "mergejoinable join clauses", see function
eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a";

My question is: why is the uniqueness important specifically to merge join? I
understand that join evaluation can be more efficient if we know that one
input relation is unique (i.e. we only scan that relation until we find the
first match), but this is not specific to merge join.

So the answer is the "merging" in uniquekey_useful_for_merging() has
nothing with merge join.

* is_uniquekey_useful_afterjoin()

Now that my patch (0004) allows propagation of the unique keys from a
subquery to the upper query, I was wondering if the UniqueKey structure
needs the 'use_for_distinct field' I mean we should now propagate the unique
keys to the parent query whether the subquery has DISTINCT clause or not. I
noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
changed the function to always returned true. However nothing changed in the
output of regression tests (installcheck). Do you insist that the
'use_for_distinct' field is needed?

I miss your answer to this comment.

After we considers the uniquekey from subquery, 'use_for_distinct' field
is not needed.

** What does the 'multi' word in the function name mean?

multi means multiple, I thought we use this short name in the many
places, for ex bt_multi_page_stats after a quick search.

Why not simply uniquekey_contains_nulls() ?

Actually I wouldn't say that an instance of UniqueKey contains any value (NULL
or NOT NULL) because it describes the whole relation rather than particular
row. I consider UniqueKey to be a set of expressions. How about
uniquekey_expression_nullable() ?

uniquekey_expression_nullable() is a better name, I will use it in the
next version.

IIUC, we have reached to the agreement based on your latest response for
the most of the questions. Please point me if I missed anything.

Of course one problem is that the number of combinations can grow
exponentially as new relations are joined.

Yes, that's why "rule 1" needed and "How to reduce the overhead" in
UniqueKey.README is introduced.

What if we are interested in unique keys of a subquery, but the subquery has
no DISTINCT clause?

I agree we should remove the prerequisite of "use_for_distinct".

--
Best Regards
Andy Fan

#15

Andy Fan

zhihui.fan1213@gmail.com

about 2 years ago

In reply to: Antonin Houska (#13)

Re: UniqueKey v2

Consider join of tables "a", "b" and "c". My understanding is that
make_join_rel() is called once with rel1={a} and rel2={b join c}, then with
rel1={a join b} and rel2={c}, etc. I wanted to say that each call should
produce the same set of unique keys.

I need to check this part more in detail.

I think I understand now. By calling populate_joinrel_uniquekeys() for various
orderings, you can find out that various input relation unique keys can
represent the whole join. For example, if the ordering is

A JOIN (B JOIN C)

you can prove that the unique keys of A can be used for the whole join, while
for the ordering

B JOIN (A JOIN C)

you can prove the same for the unique keys of B, and so on.

Yes.

We don't create the component uniquekey if any one side of the boths
sides is unique already. For example:

"(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is
unique", there is no need to create component UniqueKey (t1.id, t2.id);

ok, I need to check more in detail how this part works.

This optimization makes sense to me.

OK.

Of course one problem is that the number of combinations can grow
exponentially as new relations are joined.

Yes, that's why "rule 1" needed and "How to reduce the overhead" in
UniqueKey.README is introduced.

I think there should yet be some guarantee that the number of unique keys does
not grow exponentially. Perhaps a constant that allows a relation (base or
join) to have at most N unique keys. (I imagine N to be rather small, e.g. 3
or 4.) And when picking the "best N keys", one criterion could be the number
of expressions in the key (the shorter key the better).

I don't want to introduce this complextity right now. I'm more
inerested with how to work with them effectivity. main effort includes:

- the design of bitmapset which is memory usage friendly and easy for
combinations.
- Optimize the singlerow cases to reduce N UnqiueKeys to 1 UniqueKey.

I hope we can pay more attention to this optimization (at most N
UniqueKeys) when the major inforastruce has been done.

You are correct and IMO my current code are able to tell it is a single
row as well.

1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a
consequence.
2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept
after the join because of rule 1 on joinrel. and t1 is singlerow, so the
joinrel is singlerow as well.

ok, I think I understand now.

OK.

At last, this probably is my first non-trival patchs which has multiple
authors, I don't want myself is the bottleneck for the coorperation, so
if you need something to do done sooner, please don't hesitate to ask me
for it explicitly.

Here is my schedule about this. I can provide the next version based
our discussion and your patches at the eariler of next week. and update
the UniqueKey.README to make sure the overall design clearer. What I
hope you to pay more attention is the UniqueKey.README besides the
code. I hope the UniqueKey.README can reduce the effort for others to
understand the overall design enormously.

--
Best Regards
Andy Fan

#16

Antonin Houska

ah@cybertec.at

about 2 years ago

In reply to: Andy Fan (#14)

Re: UniqueKey v2

Andy Fan <zhihuifan1213@163.com> wrote:

Antonin Houska <ah@cybertec.at> writes:

Could you make the reason clearer for adding 'List *opfamily_lists;'
into UniqueKey? You said "This is needed to create ECs in the parent
query if the upper relation represents a subquery." but I didn't get the
it. Since we need to maintain the UniqueKey in the many places, I'd like
to keep it as simple as possbile. Of course, anything essentical should
be added for sure.

If unique keys are generated for a subquery output, they also need to be
created for the corresponding relation in the upper query ("sub" in the
following example):

OK.

select * from tab1 left join (select * from tab2) sub;

However, to create an unique key for "sub", you need an EC for each expression
of the key.

OK.

And to create an EC, you in turn need the list of operator
families.

I'm thinking if we need to "create" any EC. Can you find out a user case
where the outer EC is missed and the UniqueKey is still interesting? I
don't have an example now.

convert_subquery_pathkeys has a similar sistuation and has the following
codes:

outer_ec =
get_eclass_for_sort_expr(root,
(Expr *) outer_var,
sub_eclass->ec_opfamilies,
sub_member->em_datatype,
sub_eclass->ec_collation,
0,
rel->relids,
NULL,
false);

/*
* If we don't find a matching EC, sub-pathkey isn't
* interesting to the outer query
*/
if (outer_ec)
best_pathkey =
make_canonical_pathkey(root,
outer_ec,
sub_pathkey->pk_opfamily,
sub_pathkey->pk_strategy,
sub_pathkey->pk_nulls_first);
}

I think that convert_subquery_pathkeys() just does not try that hard to
achieve its goal.

The example where it's important to create the EC in the outer query is what I
added to the subselect.sql regression test in the 0004- diff in [1]/messages/by-id/7971.1713526758@antos:

create table tabx as select * from generate_series(1,100) idx;
create table taby as select * from generate_series(1,100) idy;
create unique index on taby using btree (idy);
create view view_barrier with (security_barrier=true) as select * from taby;
analyze tabx, taby;
explain (costs off, verbose on) select * from tabx x left join view_barrier y on idy = idx;

If you modify find_ec_position_matching_expr() to return -1 instead of
creating the EC, you will get this plan

Hash Left Join
Output: x.idx, taby.idy
Hash Cond: (x.idx = taby.idy)
-> Seq Scan on public.tabx x
Output: x.idx
-> Hash
Output: taby.idy
-> Seq Scan on public.taby
Output: taby.idy

instead of this

Hash Left Join
Output: x.idx, taby.idy
Inner Unique: true
Hash Cond: (x.idx = taby.idy)
-> Seq Scan on public.tabx x
Output: x.idx
-> Hash
Output: taby.idy
-> Seq Scan on public.taby
Output: taby.idy

* uniquekey_useful_for_merging()

How does uniqueness relate to merge join? In README.uniquekey you seem to
point out that a single row is always sorted, but I don't think this
function is related to that fact. (Instead, I'd expect that pathkeys are
added to all paths for a single-row relation, but I'm not sure you do that
in the current version of the patch.)

The merging is for "mergejoinable join clauses", see function
eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a";

My question is: why is the uniqueness important specifically to merge join? I
understand that join evaluation can be more efficient if we know that one
input relation is unique (i.e. we only scan that relation until we find the
first match), but this is not specific to merge join.

So the answer is the "merging" in uniquekey_useful_for_merging() has
nothing with merge join.

I don't understand. The function comment does mention merge join:

/*
* uniquekey_useful_for_merging
* Check if the uniquekey is useful for mergejoins above the given relation.
*
* similar with pathkeys_useful_for_merging.
*/
static bool
uniquekey_useful_for_merging(PlannerInfo *root, UniqueKey * ukey, RelOptInfo *rel)

[1]: /messages/by-id/7971.1713526758@antos

--
Antonin Houska
Web: https://www.cybertec-postgresql.com

UniqueKey v2

Attachments:

Attachments:

Attachments:

Attachments: