Recording foreign key relationships for the system catalogs

Started by Tom Laneabout 5 years ago14 messageshackers
Jump to latest
#1Tom Lane
tgl@sss.pgh.pa.us

Now that dfb75e478 is in, I looked into whether we can have some
machine-readable representation of the catalogs' foreign key
relationships. As per the previous discussion [1]/messages/by-id/dc5f44d9-5ec1-a596-0251-dadadcdede98@2ndquadrant.com, it's not
practical to have "real" SQL foreign key constraints, because
the semantics we use aren't quite right (i.e., using 0 instead
of NULL in rows with no reference). Nonetheless it would be
nice to have the knowledge available in some form, and we agreed
that a set-returning function returning the data would be useful.

The attached patch creates that function, and rewrites the oidjoins.sql
regression test to use it, in place of the very incomplete info that's
reverse-engineered by findoidjoins. It's mostly straightforward.

My original thought had been to add DECLARE_FOREIGN_KEY() macros
for all references, but I soon realized that in a large majority of
the cases, that's redundant with the BKI_LOOKUP() annotations we
already have. So I taught genbki.pl to extract FK data from
BKI_LOOKUP() as well as the explicit DECLARE macros. That didn't
remove the work entirely, because it turned out that we hadn't
bothered to apply BKI_LOOKUP() labels to most of the catalogs that
have no hand-made data. A big chunk of the patch consists in
adding those as needed. Also, I had to make the BKI_LOOKUP()
mechanism a little more complete, because it failed on pg_namespace
and pg_authid references. (It will still fail on some other
cases such as BKI_LOOKUP(pg_foreign_server), but I think there's
no need to fill that in until/unless we have some built-in data
that needs it.)

There are various loose ends yet to be cleaned up:

* I'm unsure whether it's better for the SRF to return the
column names as textual names, or as column numbers. Names was
a bit easier for all the parts of the current patch so I did
it that way, but maybe there's a case for the other way.
Actually the whole API for the SRF is just spur-of-the-moment,
so maybe a different API would be better.

* It would now be possible to remove the PGNSP and PGUID kluges
entirely in favor of plain BKI_LOOKUP references to pg_namespace
and pg_authid. The catalog header usage would get a little
more verbose: BKI_DEFAULT(PGNSP) becomes BKI_DEFAULT(pg_catalog)
and BKI_DEFAULT(PGUID) becomes BKI_DEFAULT(POSTGRES). I'm a bit
inclined to do it, simply to remove one bit of mechanism that has
to be documented; but it's material for a separate patch perhaps.

* src/tools/findoidjoins should be nuked entirely, AFAICS.
Again, that could be a follow-on patch.

* I've not touched the SGML docs. Certainly
pg_get_catalog_foreign_keys() should be documented, and some
adjustments in bki.sgml might be appropriate.

regards, tom lane

[1]: /messages/by-id/dc5f44d9-5ec1-a596-0251-dadadcdede98@2ndquadrant.com

Attachments:

add-catalog-foreign-key-info-1.patchtext/x-diff; charset=us-ascii; name=add-catalog-foreign-key-info-1.patchDownload+732-2146
#2Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#1)
Re: Recording foreign key relationships for the system catalogs

Very nice. Thanks to this patch, I can get rid of my own parse-catalogs.pl hack and use pg_get_catalog_foreign_keys() instead.

+1

I can with high confidence assert the correctness of pg_get_catalog_foreign_keys()'s output,
as it matches the lookup tables for the tool I'm hacking on precisely:

--
-- verify single column foreign keys
--
WITH
a AS (
SELECT
fktable::text,
fkcols[1]::text,
pktable::text,
pkcols[1]::text
FROM pg_get_catalog_foreign_keys()
WHERE cardinality(fkcols) = 1
),
b AS (
SELECT
table_name,
column_name,
ref_table_name,
ref_column_name
FROM pit.oid_joins
)
SELECT
(SELECT COUNT(*) FROM (SELECT * FROM a EXCEPT SELECT * FROM b) AS x) AS except_b,
(SELECT COUNT(*) FROM (SELECT * FROM b EXCEPT SELECT * FROM a) AS x) AS except_a,
(SELECT COUNT(*) FROM (SELECT * FROM b INTERSECT SELECT * FROM a) AS x) AS a_intersect_b
;

except_b | except_a | a_intersect_b
----------+----------+---------------
0 | 0 | 209
(1 row)

--
-- verify multi-column foreign keys
--
WITH
a AS (
SELECT
fktable::text,
fkcols,
pktable::text,
pkcols
FROM pg_get_catalog_foreign_keys()
WHERE cardinality(fkcols) > 1
),
b AS (
SELECT
table_name,
ARRAY[rel_column_name,attnum_column_name],
'pg_attribute',
'{attrelid,attnum}'::text[]
FROM pit.attnum_joins
)
SELECT
(SELECT COUNT(*) FROM (SELECT * FROM a EXCEPT SELECT * FROM b) AS x) AS except_b,
(SELECT COUNT(*) FROM (SELECT * FROM b EXCEPT SELECT * FROM a) AS x) AS except_a,
(SELECT COUNT(*) FROM (SELECT * FROM b INTERSECT SELECT * FROM a) AS x) AS a_intersect_b
;

except_b | except_a | a_intersect_b
----------+----------+---------------
0 | 0 | 8
(1 row)

/Joel

On Sun, Jan 31, 2021, at 22:39, Tom Lane wrote:

Now that dfb75e478 is in, I looked into whether we can have some
machine-readable representation of the catalogs' foreign key
relationships. As per the previous discussion [1], it's not
practical to have "real" SQL foreign key constraints, because
the semantics we use aren't quite right (i.e., using 0 instead
of NULL in rows with no reference). Nonetheless it would be
nice to have the knowledge available in some form, and we agreed
that a set-returning function returning the data would be useful.

The attached patch creates that function, and rewrites the oidjoins.sql
regression test to use it, in place of the very incomplete info that's
reverse-engineered by findoidjoins. It's mostly straightforward.

My original thought had been to add DECLARE_FOREIGN_KEY() macros
for all references, but I soon realized that in a large majority of
the cases, that's redundant with the BKI_LOOKUP() annotations we
already have. So I taught genbki.pl to extract FK data from
BKI_LOOKUP() as well as the explicit DECLARE macros. That didn't
remove the work entirely, because it turned out that we hadn't
bothered to apply BKI_LOOKUP() labels to most of the catalogs that
have no hand-made data. A big chunk of the patch consists in
adding those as needed. Also, I had to make the BKI_LOOKUP()
mechanism a little more complete, because it failed on pg_namespace
and pg_authid references. (It will still fail on some other
cases such as BKI_LOOKUP(pg_foreign_server), but I think there's
no need to fill that in until/unless we have some built-in data
that needs it.)

There are various loose ends yet to be cleaned up:

* I'm unsure whether it's better for the SRF to return the
column names as textual names, or as column numbers. Names was
a bit easier for all the parts of the current patch so I did
it that way, but maybe there's a case for the other way.
Actually the whole API for the SRF is just spur-of-the-moment,
so maybe a different API would be better.

* It would now be possible to remove the PGNSP and PGUID kluges
entirely in favor of plain BKI_LOOKUP references to pg_namespace
and pg_authid. The catalog header usage would get a little
more verbose: BKI_DEFAULT(PGNSP) becomes BKI_DEFAULT(pg_catalog)
and BKI_DEFAULT(PGUID) becomes BKI_DEFAULT(POSTGRES). I'm a bit
inclined to do it, simply to remove one bit of mechanism that has
to be documented; but it's material for a separate patch perhaps.

* src/tools/findoidjoins should be nuked entirely, AFAICS.
Again, that could be a follow-on patch.

* I've not touched the SGML docs. Certainly
pg_get_catalog_foreign_keys() should be documented, and some
adjustments in bki.sgml might be appropriate.

regards, tom lane

[1] /messages/by-id/dc5f44d9-5ec1-a596-0251-dadadcdede98@2ndquadrant.com

*Attachments:*
* add-catalog-foreign-key-info-1.patch

Kind regards,

Joel

#3Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#1)
Re: Recording foreign key relationships for the system catalogs

Could it be an idea to also add

OUT can_be_zero boolean

to pg_get_catalog_foreign_keys()'s out parameters?

This information is useful to know if one should be doing an INNER JOIN or a LEFT JOIN on the foreign keys.

The information is mostly available in the documentation already,
but not quite accurate, which I've proposed a patch [1]/messages/by-id/4ed9a372-7bf9-479a-926c-ae8e774717a8@www.fastmail.com to fix.

[1]: /messages/by-id/4ed9a372-7bf9-479a-926c-ae8e774717a8@www.fastmail.com

#4Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#1)
Re: Recording foreign key relationships for the system catalogs

Hi again,

After trying to use pg_get_catalog_foreign_keys() to replace what I had before,
I notice one ambiguity which I think is a serious problem in the machine-readable context.

The is_array OUT parameter doesn't say which of the possibly many fkcols that is the array column.

One example:

fktable | fkcols | pktable | pkcols | is_array
----------------------+-----------------------+--------------+-------------------+----------
pg_constraint | {conrelid,conkey} | pg_attribute | {attrelid,attnum} | t

Is the array "conrelid" or is it "conkey"? As a human, I know it's "conkey", but for a machine to figure out, one would need to join information_schema.columns and check the data_type or something similar.

Suggestions on how to fix:

* Make is_array an boolean[], and let each element represent the is_array value for each fkcols element.

* Change interface to be more like information_schema, and add a "ordinal_position" column, and return each column on a separate row.

I think I prefer the latter since it's more information_schema-conformant, but any works.

/Joel

#5Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#4)
Re: Recording foreign key relationships for the system catalogs

On Mon, Feb 1, 2021, at 20:33, Joel Jacobson wrote:

Suggestions on how to fix:

* Make is_array an boolean[], and let each element represent the is_array value for each fkcols element.

* Change interface to be more like information_schema, and add a "ordinal_position" column, and return each column on a separate row.

I think I prefer the latter since it's more information_schema-conformant, but any works.

Ops. I see a problem with just adding a "ordinal_position", since then one would also need enumerate the "foreing key" constraints or give them names like "constraint_name" in information_schema.table_constraints (since there can be multiple foreign keys per fktable, so there would be multiple e.g. ordinal_position=1 per fktable).

With this into consideration, I think the easiest and cleanest solution is to make is_array a boolean[].

I like the usage of arrays, it makes it much easier to understand the output,
as it's the visually easy to see what groups of columns that references what group of columns.

/Joel

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#4)
Re: Recording foreign key relationships for the system catalogs

"Joel Jacobson" <joel@compiler.org> writes:

The is_array OUT parameter doesn't say which of the possibly many fkcols that is the array column.

Yeah, I didn't write the sgml docs yet, but the comments explain that
the array is always the last fkcol. Maybe someday that won't be
general enough, but we can cross that bridge when we come to it.

regards, tom lane

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#3)
Re: Recording foreign key relationships for the system catalogs

"Joel Jacobson" <joel@compiler.org> writes:

Could it be an idea to also add
OUT can_be_zero boolean
to pg_get_catalog_foreign_keys()'s out parameters?

I was initially feeling resistant to that idea, but warmed to it
once I realized that a majority of the FK referencing columns
actually should not contain zeroes. So we can get a useful
improvement in the strictness of the test coverage if we make this
distinction --- and we can enforce it in the initial catalog data,
too.

So here's a v2 that does that. In the interests of brevity,
I spelled the declaration macros that allow a zero as BKI_LOOKUP_OPT,
DECLARE_FOREIGN_KEY_OPT, etc; and thus the output column is
also is_opt. I'm not wedded to that term but I think we need
something pretty short.

This also moves the oidjoins regression test to run near the
end of the test suite. As I commented earlier, that test was
originally mainly meant to validate the handwritten initial
data; but nowadays it's hard to see what it would catch that
genbki.pl doesn't. So the usefulness is in looking at rows
that get added later, and therefore we ought to run it after
the regression tests have created stuff. I've tried here
to run it in parallel with event_triggers, which might be
foolish.

I also added some documentation. I feel like this might
be committable at this point.

regards, tom lane

Attachments:

add-catalog-foreign-key-info-2.patchtext/x-diff; charset=us-ascii; name=add-catalog-foreign-key-info-2.patchDownload+904-2221
#8Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#7)
Re: Recording foreign key relationships for the system catalogs

On Tue, Feb 2, 2021, at 04:27, Tom Lane wrote:

Attachments:
add-catalog-foreign-key-info-2.patch

Very nice.

I could only find one minor error,
found by running the regression-tests,
and then using the query below to compare "is_opt"
with my own "zero_values" in my tool
that derives its value from pg_catalog content.

--
-- Are there any observed oid columns with zero values
-- that are also marked as NOT is_opt by pg_get_catalog_foreign_keys()?
--
regression=# SELECT
table_name,
column_name
FROM pit.oid_columns
WHERE zero_values
INTERSECT
SELECT
fktable::text,
unnest(fkcols)
FROM pg_get_catalog_foreign_keys()
WHERE NOT is_opt;

Expected to return no rows but:

table_name | column_name
---------------+-------------
pg_constraint | confrelid
(1 row)

regression=# SELECT * FROM pg_get_catalog_foreign_keys() WHERE 'confrelid' = ANY(fkcols);
fktable | fkcols | pktable | pkcols | is_array | is_opt
---------------+---------------------+--------------+-------------------+----------+--------
pg_constraint | {confrelid} | pg_class | {oid} | f | t
pg_constraint | {confrelid,confkey} | pg_attribute | {attrelid,attnum} | t | f
(2 rows)

Reading the new documentation, I interpret "is_opt=false" to be a negation of

"the referencing column(s) are allowed to contain zeroes instead of a valid reference"

i.e. that none of the referencing columns (fkcols) are allowed to contain zeroes,
but since "confrelid" apparently can contain zeroes:

regression=# select * from pg_constraint where confrelid = 0 limit 1;
-[ RECORD 1 ]-+------------------
oid | 12111
conname | pg_proc_oid_index
connamespace | 11
contype | p
condeferrable | f
condeferred | f
convalidated | t
conrelid | 1255
contypid | 0
conindid | 2690
conparentid | 0
confrelid | 0
confupdtype |
confdeltype |
confmatchtype |
conislocal | t
coninhcount | 0
connoinherit | t
conkey | {1}
confkey |
confreftype |
conpfeqop |
conppeqop |
conffeqop |
conexclop |
conbin |

I therefore think is_opt should be changed to true for this row:
fktable | fkcols | pktable | pkcols | is_array | is_opt
---------------+---------------------+--------------+-------------------+----------+--------
pg_constraint | {confrelid,confkey} | pg_attribute | {attrelid,attnum} | t | f

If this is fixed, I also agree this is ready to be committed.

/Joel

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#8)
Re: Recording foreign key relationships for the system catalogs

"Joel Jacobson" <joel@compiler.org> writes:

I could only find one minor error,
found by running the regression-tests,
and then using the query below to compare "is_opt"
with my own "zero_values" in my tool
that derives its value from pg_catalog content.
...
I therefore think is_opt should be changed to true for this row:
fktable | fkcols | pktable | pkcols | is_array | is_opt
---------------+---------------------+--------------+-------------------+----------+--------
pg_constraint | {confrelid,confkey} | pg_attribute | {attrelid,attnum} | t | f

No, I think it's correct as-is (and this is one reason that I chose to
have two separate FK entries for cases like this). confrelid can be
zero in rows that are not FK constraints. However, such a row must
also have empty confkey. The above entry states that for each element
of confkey, the pair (confrelid,confkey[i]) must be nonzero and have
a match in pg_attribute. It creates no constraint if confkey is empty.

If this is fixed, I also agree this is ready to be committed.

Appreciate the review! Please confirm if you agree with above
analysis.

regards, tom lane

#10Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#9)
Re: Recording foreign key relationships for the system catalogs

On Tue, Feb 2, 2021, at 17:00, Tom Lane wrote:

No, I think it's correct as-is (and this is one reason that I chose to
have two separate FK entries for cases like this). confrelid can be
zero in rows that are not FK constraints. However, such a row must
also have empty confkey. The above entry states that for each element
of confkey, the pair (confrelid,confkey[i]) must be nonzero and have
a match in pg_attribute. It creates no constraint if confkey is empty.

Thanks for explaining, I get it now.

Appreciate the review! Please confirm if you agree with above
analysis.

Yes, I agree with the analysis.

/Joel

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joel Jacobson (#10)
Re: Recording foreign key relationships for the system catalogs

"Joel Jacobson" <joel@compiler.org> writes:

On Tue, Feb 2, 2021, at 17:00, Tom Lane wrote:

Appreciate the review! Please confirm if you agree with above
analysis.

Yes, I agree with the analysis.

Cool, thanks. I've pushed it now.

regards, tom lane

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#1)
Re: Recording foreign key relationships for the system catalogs

I wrote:

* It would now be possible to remove the PGNSP and PGUID kluges
entirely in favor of plain BKI_LOOKUP references to pg_namespace
and pg_authid. The catalog header usage would get a little
more verbose: BKI_DEFAULT(PGNSP) becomes BKI_DEFAULT(pg_catalog)
and BKI_DEFAULT(PGUID) becomes BKI_DEFAULT(POSTGRES). I'm a bit
inclined to do it, simply to remove one bit of mechanism that has
to be documented; but it's material for a separate patch perhaps.

Here's a patch for that part. I think this is probably a good
idea not only because it removes magic, but because now that we
have various predefined roles it's becoming more and more likely
that some of those will need to be cross-referenced in other
catalogs' initial data. With this change, nothing special
will be needed for that. Multiple built-in schemas also become
more feasible than they were.

regards, tom lane

Attachments:

retire-PGNSP-and-PGUID-symbols.patchtext/x-diff; charset=us-ascii; name=retire-PGNSP-and-PGUID-symbols.patchDownload+46-52
#13Joel Jacobson
joel@compiler.org
In reply to: Tom Lane (#6)
Re: Recording foreign key relationships for the system catalogs

On Mon, Feb 1, 2021, at 21:03, Tom Lane wrote:

"Joel Jacobson" <joel@compiler.org> writes:

The is_array OUT parameter doesn't say which of the possibly many fkcols that is the array column.

Yeah, I didn't write the sgml docs yet, but the comments explain that
the array is always the last fkcol. Maybe someday that won't be
general enough, but we can cross that bridge when we come to it.

I've now fully migrated to using pg_get_catalog_foreign_keys()
instead of my own lookup tables, and have some additional hands-on experiences
to share with you.

I struggle to come up with a clean way to make use of is_array,
without being forced to introduce some CASE logic to figure out
if the fkcol is an array or not.

The alternative to join information_schema.columns and check data_type='ARRAY' is almost simpler,
but that seems wrong, since we now have is_array, and using it should be simpler than
joining information_schema.columns.

The best approach I've come up with so far is the CASE logic below:

WITH
foreign_keys AS
(
SELECT
fktable::text AS table_name,
unnest(fkcols) AS column_name,
pktable::text AS ref_table_name,
unnest(pkcols) AS ref_column_name,
--
-- is_array refers to the last fkcols column
--
unnest
(
CASE cardinality(fkcols)
WHEN 1 THEN ARRAY[is_array]
WHEN 2 THEN ARRAY[FALSE,is_array]
END
) AS is_array
FROM pg_get_catalog_foreign_keys()
)

If is_array would instead have been an boolean[], the query could have been written:

WITH
foreign_keys AS
(
SELECT
fktable::text AS table_name,
unnest(fkcols) AS column_name,
pktable::text AS ref_table_name,
unnest(pkcols) AS ref_column_name,
unnest(is_array) AS is_array
FROM pg_get_catalog_foreign_keys()
)

Maybe this can be written in a simpler way already.

Otherwise I think it would be more natural to change both is_array and is_opt
to boolean[] with the same cardinality as fkcols and pkcols,
to allow unnest()ing of them as well.

This would also be a more future proof solution,
and wouldn't require a code change to code using pg_get_catalog_foreign_keys(),
if we would ever add more complex cases in the future.

But even without increased future complexity,
I think the example above demonstrates a problem already today.

Maybe there is a simpler way to achieve what I'm trying to do,
i.e. to figure out if a specific fkcol is an array or not,
using some other simpler clever trick than the CASE variant above?

/Joel

#14Joel Jacobson
joel@compiler.org
In reply to: Joel Jacobson (#13)
Re: Recording foreign key relationships for the system catalogs

On Wed, Feb 3, 2021, at 21:41, Joel Jacobson wrote:

Otherwise I think it would be more natural to change both is_array and is_opt
to boolean[] with the same cardinality as fkcols and pkcols,
to allow unnest()ing of them as well.

Another option would perhaps be to add a new
system view in src/backend/catalog/system_views.sql

I see there are other cases with a slightly more complex view
using a function with a similar name, such as
the pg_stat_activity using pg_stat_get_activity().

Similar to this, maybe we could add a pg_catalog_foreign_keys view
using the output from pg_get_catalog_foreign_keys():

Example usage:

SELECT * FROM pg_catalog_foreign_keys
WHERE fktable = 'pg_constraint'::regclass
AND pktable = 'pg_attribute'::regclass;

fkid | fktable | fkcol | pktable | pkcol | is_array | is_opt | ordinal_position
------+---------------+-----------+--------------+----------+----------+--------+------------------
48 | pg_constraint | conkey | pg_attribute | attnum | t | t | 1
48 | pg_constraint | conrelid | pg_attribute | attrelid | f | f | 2
49 | pg_constraint | confkey | pg_attribute | attnum | t | f | 1
49 | pg_constraint | confrelid | pg_attribute | attrelid | f | f | 2
(4 rows)

The point of this would be to avoid unnecessary increase of data model complexity,
which I agree is not needed, since we only need single booleans as of today,
but to provide a more information_schema-like system view,
i.e. with columns on separate rows, with ordinal_position.

Since we don't have any "constraint_name" for these,
we need to enumerate the fks first, to let ordinal_position
be the position within each such fkid.

Here is my proposal on how to implement:

CREATE VIEW pg_catalog_foreign_keys AS
WITH
enumerate_fks AS (
SELECT
*,
ROW_NUMBER() OVER () AS fkid
FROM pg_catalog.pg_get_catalog_foreign_keys()
),
unnest_cols AS (
SELECT
C.fkid,
C.fktable,
unnest(C.fkcols) AS fkcol,
C.pktable,
unnest(C.pkcols) AS pkcol,
unnest(
CASE cardinality(fkcols)
WHEN 1 THEN ARRAY[C.is_array]
WHEN 2 THEN ARRAY[FALSE,C.is_array]
END
) AS is_array,
unnest(
CASE cardinality(fkcols)
WHEN 1 THEN ARRAY[C.is_opt]
WHEN 2 THEN ARRAY[FALSE,C.is_opt]
END
) AS is_opt
FROM enumerate_fks AS C
)
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY U.fkid
ORDER BY U.fkcol, U.pkcol
) AS ordinal_position
FROM unnest_cols AS U;

I think both the pg_get_catalog_foreign_keys() function
and this view are useful in different ways,
so it's good to provide both.

Only providing pg_get_catalog_foreign_keys() will
arguably mean some users of the function will need to implement
something like the same as above on their own, if they need the is_array and is_opt
value for a specific fkcol.

/Joel