Duplicate unique key values in inheritance tables

Started by Richard Guoover 1 year ago4 messages
#1Richard Guo
guofenglinux@gmail.com

I came across a query that returned incorrect results and I traced it
down to being caused by duplicate unique key values in an inheritance
table. As a simple example, consider

create table p (a int primary key, b int);
create table c () inherits (p);

insert into p select 1, 1;
insert into c select 1, 2;

select a, b from p;
a | b
---+---
1 | 1
1 | 2
(2 rows)

explain (verbose, costs off)
select a, b from p group by a;
QUERY PLAN
--------------------------------------
HashAggregate
Output: p.a, p.b
Group Key: p.a
-> Append
-> Seq Scan on public.p p_1
Output: p_1.a, p_1.b
-> Seq Scan on public.c p_2
Output: p_2.a, p_2.b
(8 rows)

The parser considers 'p.b' functionally dependent on the group by
column 'p.a' because 'p.a' is identified as the primary key for table
'p'. However, this causes confusion for the executor when determining
which 'p.b' value should be returned for each group. In my case, I
observed that sorted and hashed aggregation produce different results
for the same query.

Reading the doc, it seems that this is a documented limitation of the
inheritance feature that we would have duplicate unique key values in
inheritance tables. Even adding a unique constraint to the children
does not prevent duplication compared to the parent.

As a workaround for this issue, I'm considering whether we can skip
checking functional dependency on primary keys for inheritance
parents, given that we cannot guarantee uniqueness on the keys in this
case. Maybe something like below.

@@ -1421,7 +1427,9 @@ check_ungrouped_columns_walker(Node *node,
        Assert(var->varno > 0 &&
               (int) var->varno <= list_length(context->pstate->p_rtable));
        rte = rt_fetch(var->varno, context->pstate->p_rtable);
-       if (rte->rtekind == RTE_RELATION)
+       if (rte->rtekind == RTE_RELATION &&
+           !(rte->relkind == RELKIND_RELATION &&
+             rte->inh && has_subclass(rte->relid)))
        {
            if (check_functional_grouping(rte->relid,

Any thoughts?

Thanks
Richard

#2David Rowley
dgrowleyml@gmail.com
In reply to: Richard Guo (#1)
Re: Duplicate unique key values in inheritance tables

On Tue, 16 Jul 2024 at 12:45, Richard Guo <guofenglinux@gmail.com> wrote:

As a workaround for this issue, I'm considering whether we can skip
checking functional dependency on primary keys for inheritance
parents, given that we cannot guarantee uniqueness on the keys in this
case. Maybe something like below.

@@ -1421,7 +1427,9 @@ check_ungrouped_columns_walker(Node *node,
Assert(var->varno > 0 &&
(int) var->varno <= list_length(context->pstate->p_rtable));
rte = rt_fetch(var->varno, context->pstate->p_rtable);
-       if (rte->rtekind == RTE_RELATION)
+       if (rte->rtekind == RTE_RELATION &&
+           !(rte->relkind == RELKIND_RELATION &&
+             rte->inh && has_subclass(rte->relid)))
{
if (check_functional_grouping(rte->relid,

Any thoughts?

The problem with doing that is that it might mean queries that used to
work no longer work. CREATE VIEW could also fail where it used to
work which could render pg_dumps unrestorable.

Because it's a parser issue, I don't think we can fix it the same way
as a5be4062f was fixed.

I don't have any ideas on what we can do about this right now, but
thought it was worth sharing the above.

David

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: David Rowley (#2)
Re: Duplicate unique key values in inheritance tables

On Monday, July 15, 2024, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 16 Jul 2024 at 12:45, Richard Guo <guofenglinux@gmail.com> wrote:

As a workaround for this issue, I'm considering whether we can skip
checking functional dependency on primary keys for inheritance
parents, given that we cannot guarantee uniqueness on the keys in this
case.

Because it's a parser issue, I don't think we can fix it the same way
as a5be4062f was fixed.

I don't have any ideas on what we can do about this right now, but
thought it was worth sharing the above.

Add another note to caveats in the docs and call it a feature. We produce
a valid answer for the data model encountered. The non-determinism isn’t
wrong, it’s just a poorly written query/model with non-deterministic
results. Since v15 we have an any_value aggregate - we basically are
applying this to the dependent columns implicitly. A bit of revisionist
history but I’d rather do that than break said queries. Especially at
parse time; I’d be a bit more open to execution-time enforcement if
functional dependency on the id turns out to have actually been violated.
But people want, and in other products have, any_value implicit aggregation
in this situation so it’s hard to say it is wrong even if we otherwise take
the position that we will not accept it.

David J.

#4David Rowley
dgrowleyml@gmail.com
In reply to: David G. Johnston (#3)
Re: Duplicate unique key values in inheritance tables

On Tue, 16 Jul 2024 at 13:28, David G. Johnston
<david.g.johnston@gmail.com> wrote:

Add another note to caveats in the docs and call it a feature. We produce a valid answer for the data model encountered. The non-determinism isn’t wrong, it’s just a poorly written query/model with non-deterministic results. Since v15 we have an any_value aggregate - we basically are applying this to the dependent columns implicitly. A bit of revisionist history but I’d rather do that than break said queries. Especially at parse time; I’d be a bit more open to execution-time enforcement if functional dependency on the id turns out to have actually been violated. But people want, and in other products have, any_value implicit aggregation in this situation so it’s hard to say it is wrong even if we otherwise take the position that we will not accept it.

I think it might be best just to ignore it and do nothing. Maybe it
would be worth putting something into the docs about it if people from
userland come complaining about a bug as the doc mention might stop
them wasting their time reporting something we already know about.
Otherwise, I feel the docs would just draw attention to something that
I'd personally rather people didn't do. As you say, using any_value()
would be the way we'd encourage people to do it if they don't care
which value of the ungrouped column they want, so documenting
something else doesn't seem quite right to me.

David