interval_ops shall stop using btequalimage (deduplication)
The btequalimage() header comment says:
* Generic "equalimage" support function.
*
* B-Tree operator classes whose equality function could safely be replaced by
* datum_image_eq() in all cases can use this as their "equalimage" support
* function.
interval_ops, however, recognizes equal-but-distinguishable values:
create temp table t (c interval);
insert into t values ('1d'::interval), ('24h');
table t;
select distinct c from t;
The CREATE INDEX of the following test:
begin;
create table t (c interval);
insert into t select x from generate_series(1,500), (values ('1 year 1 month'::interval), ('1 year 30 days')) t(x);
select distinct c from t;
create index ti on t (c);
rollback;
Fails with:
2498151 2023-10-10 05:06:46.177 GMT DEBUG: building index "ti" on table "t" serially
2498151 2023-10-10 05:06:46.178 GMT DEBUG: index "ti" can safely use deduplication
TRAP: failed Assert("!itup_key->allequalimage || keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright)"), File: "nbtutils.c", Line: 2443, PID: 2498151
I've also caught btree posting lists where one TID refers to a '1d' heap
tuple, while another TID refers to a '24h' heap tuple. amcheck complains.
Index-only scans can return the '1d' bits where the actual tuple had the '24h'
bits. Are there other consequences to highlight in the release notes? The
back-branch patch is larger, to fix things without initdb. Hence, I'm
attaching patches for HEAD and for v16 (trivial to merge back from there). I
glanced at the other opfamilies permitting deduplication, and they look okay:
[local] test=*# select amproc, amproclefttype = amprocrighttype as l_eq_r, array_agg(array[opfname, amproclefttype::regtype::text]) from pg_amproc join pg_opfamily f on amprocfamily = f.oid where amprocnum = 4 and opfmethod = 403 group by 1,2;
─[ RECORD 1 ]───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
amproc │ btequalimage
l_eq_r │ t
array_agg │ {{bit_ops,bit},{bool_ops,boolean},{bytea_ops,bytea},{char_ops,"\"char\""},{datetime_ops,date},{datetime_ops,"timestamp without time zone"},{datetime_ops,"timestamp with time zone"},{network_ops,inet},{integer_ops,smallint},{integer_ops,integer},{integer_ops,bigint},{interval_ops,interval},{macaddr_ops,macaddr},{oid_ops,oid},{oidvector_ops,oidvector},{time_ops,"time without time zone"},{timetz_ops,"time with time zone"},{varbit_ops,"bit varying"},{text_pattern_ops,text},{bpchar_pattern_ops,character},{money_ops,money},{tid_ops,tid},{uuid_ops,uuid},{pg_lsn_ops,pg_lsn},{macaddr8_ops,macaddr8},{enum_ops,anyenum},{xid8_ops,xid8}}
─[ RECORD 2 ]───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
amproc │ btvarstrequalimage
l_eq_r │ t
array_agg │ {{bpchar_ops,character},{text_ops,text},{text_ops,name}}
Thanks,
nm
Attachments:
interval-rescind-dedup-v1.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
Dissociate btequalimage() from interval_ops, ending its deduplication.
Under interval_ops, some equal values are distinguishable. One such
pair is '24:00:00' and '1 day'. With that being so, btequalimage()
breaches the documented contract for the "equalimage" btree support
function. This can cause incorrect results from index-only scans.
Users should REINDEX any indexes having interval-type columns and for
which "pg_amcheck --heapallindexed" reports an error. This fix makes
interval_ops simply omit the support function, like numeric_ops does.
Back-pack to v13, where btequalimage() first appeared. In back
branches, for the benefit of old catalog content, btequalimage() code
will return false for type "interval". Going forward, back-branch
initdb will include the catalog change.
Reviewed by FIXME.
Discussion: https://postgr.es/m/FIXME
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 5b95012..4c70da4 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -172,8 +172,6 @@
{ amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
amprocrighttype => 'interval', amprocnum => '3',
amproc => 'in_range(interval,interval,interval,bool,bool)' },
-{ amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
- amprocrighttype => 'interval', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
amprocrighttype => 'macaddr', amprocnum => '1', amproc => 'macaddr_cmp' },
{ amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index a1bdf2c..7a6f36a 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2208,6 +2208,7 @@ ORDER BY 1, 2, 3;
| array_ops | array_ops | anyarray
| float_ops | float4_ops | real
| float_ops | float8_ops | double precision
+ | interval_ops | interval_ops | interval
| jsonb_ops | jsonb_ops | jsonb
| multirange_ops | multirange_ops | anymultirange
| numeric_ops | numeric_ops | numeric
@@ -2216,7 +2217,7 @@ ORDER BY 1, 2, 3;
| record_ops | record_ops | record
| tsquery_ops | tsquery_ops | tsquery
| tsvector_ops | tsvector_ops | tsvector
-(15 rows)
+(16 rows)
-- **************** pg_index ****************
-- Look for illegal values in pg_index fields.
interval-rescind-dedup-v1_16.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
Dissociate btequalimage() from interval_ops, ending its deduplication.
Under interval_ops, some equal values are distinguishable. One such
pair is '24:00:00' and '1 day'. With that being so, btequalimage()
breaches the documented contract for the "equalimage" btree support
function. This can cause incorrect results from index-only scans.
Users should REINDEX any indexes having interval-type columns and for
which "pg_amcheck --heapallindexed" reports an error. This fix makes
interval_ops simply omit the support function, like numeric_ops does.
Back-pack to v13, where btequalimage() first appeared. In back
branches, for the benefit of old catalog content, btequalimage() code
will return false for type "interval". Going forward, back-branch
initdb will include the catalog change.
Reviewed by FIXME.
Discussion: https://postgr.es/m/FIXME
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 9f06ee7..251dd23 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -43,6 +43,7 @@
#include "postgres.h"
#include "access/detoast.h"
+#include "catalog/pg_type_d.h"
#include "common/hashfn.h"
#include "fmgr.h"
#include "utils/builtins.h"
@@ -385,20 +386,17 @@ datum_image_hash(Datum value, bool typByVal, int typLen)
* datum_image_eq() in all cases can use this as their "equalimage" support
* function.
*
- * Currently, we unconditionally assume that any B-Tree operator class that
- * registers btequalimage as its support function 4 must be able to safely use
- * optimizations like deduplication (i.e. we return true unconditionally). If
- * it ever proved necessary to rescind support for an operator class, we could
- * do that in a targeted fashion by doing something with the opcintype
- * argument.
+ * Earlier minor releases erroneously associated this function with
+ * interval_ops. Detect that case to rescind deduplication support, without
+ * requiring initdb.
*-------------------------------------------------------------------------
*/
Datum
btequalimage(PG_FUNCTION_ARGS)
{
- /* Oid opcintype = PG_GETARG_OID(0); */
+ Oid opcintype = PG_GETARG_OID(0);
- PG_RETURN_BOOL(true);
+ PG_RETURN_BOOL(opcintype != INTERVALOID);
}
/*-------------------------------------------------------------------------
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 5b95012..4c70da4 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -172,8 +172,6 @@
{ amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
amprocrighttype => 'interval', amprocnum => '3',
amproc => 'in_range(interval,interval,interval,bool,bool)' },
-{ amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
- amprocrighttype => 'interval', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
amprocrighttype => 'macaddr', amprocnum => '1', amproc => 'macaddr_cmp' },
{ amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index a1bdf2c..7a6f36a 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2208,6 +2208,7 @@ ORDER BY 1, 2, 3;
| array_ops | array_ops | anyarray
| float_ops | float4_ops | real
| float_ops | float8_ops | double precision
+ | interval_ops | interval_ops | interval
| jsonb_ops | jsonb_ops | jsonb
| multirange_ops | multirange_ops | anymultirange
| numeric_ops | numeric_ops | numeric
@@ -2216,7 +2217,7 @@ ORDER BY 1, 2, 3;
| record_ops | record_ops | record
| tsquery_ops | tsquery_ops | tsquery
| tsvector_ops | tsvector_ops | tsvector
-(15 rows)
+(16 rows)
-- **************** pg_index ****************
-- Look for illegal values in pg_index fields.
On Tue, Oct 10, 2023 at 6:33 PM Noah Misch <noah@leadboat.com> wrote:
interval_ops, however, recognizes equal-but-distinguishable values:
Fails with:
2498151 2023-10-10 05:06:46.177 GMT DEBUG: building index "ti" on table "t" serially
2498151 2023-10-10 05:06:46.178 GMT DEBUG: index "ti" can safely use deduplication
TRAP: failed Assert("!itup_key->allequalimage || keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright)"), File: "nbtutils.c", Line: 2443, PID: 2498151
Nice catch.
Out of curiosity, how did you figure this out? Did it just occur to
you that interval_ops had a behavior that made deduplication unsafe?
Or did the problem come to your attention after running amcheck on a
customer database? Or was it something else?
I've also caught btree posting lists where one TID refers to a '1d' heap
tuple, while another TID refers to a '24h' heap tuple. amcheck complains.
Index-only scans can return the '1d' bits where the actual tuple had the '24h'
bits. Are there other consequences to highlight in the release notes?
Nothing else comes to mind right now. I should think about posting
list splits some more tomorrow, though -- those are tricky.
The back-branch patch is larger, to fix things without initdb. Hence, I'm
attaching patches for HEAD and for v16 (trivial to merge back from there).
I glanced at the other opfamilies permitting deduplication, and they look okay:
Due to the way that nbtsplitloc.c deals with duplicates, I'd expect
the same assertion failure with any index where a single leaf page is
filled with opclass-wise duplicates with more than one distinct
representation/output -- the details beyond that shouldn't matter. I
was happy with how easy it was to make this assertion fail (with a
known broken numeric_ops opclass) while testing/developing
deduplication. I'm a little surprised that it took this long to notice
the interval_ops issue.
Do we really need to change the catalog contents when backpatching?
--
Peter Geoghegan
On Tue, Oct 10, 2023 at 08:12:36PM -0700, Peter Geoghegan wrote:
On Tue, Oct 10, 2023 at 6:33 PM Noah Misch <noah@leadboat.com> wrote:
interval_ops, however, recognizes equal-but-distinguishable values:
Fails with:
2498151 2023-10-10 05:06:46.177 GMT DEBUG: building index "ti" on table "t" serially
2498151 2023-10-10 05:06:46.178 GMT DEBUG: index "ti" can safely use deduplication
TRAP: failed Assert("!itup_key->allequalimage || keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright)"), File: "nbtutils.c", Line: 2443, PID: 2498151Nice catch.
Out of curiosity, how did you figure this out? Did it just occur to
you that interval_ops had a behavior that made deduplication unsafe?
Or did the problem come to your attention after running amcheck on a
customer database? Or was it something else?
My friend got an amcheck "lacks matching index tuple" failure, and they asked
me about it. I ran into the assertion failure while reproducing things.
I'm a little surprised that it took this long to notice
the interval_ops issue.
Agreed. I don't usually store interval values in tables, and I'm not sure
I've ever indexed one. Who knows.
Do we really need to change the catalog contents when backpatching?
Not really. I think we usually do. On the other hand, unlike some past
cases, there's no functional need for the catalog changes. The catalog
changes just get a bit of efficiency. No strong preference here.
On Tue, Oct 10, 2023 at 8:29 PM Noah Misch <noah@leadboat.com> wrote:
My friend got an amcheck "lacks matching index tuple" failure, and they asked
me about it. I ran into the assertion failure while reproducing things.
Reminds me of the time that amcheck found a bug in the default btree
opclass for PostGIS's geography type.
The opclass wasn't really intended to be used for indexing. The issue
with the opclass (which violated transitive consistency) would
probably never have been detected were it not for the tendency of
PostGIS users to accidentally create useless B-Tree indexes on
geography columns. Users sometimes omitted "using gist", without
necessarily noticing that the index was practically useless.
Do we really need to change the catalog contents when backpatching?
Not really. I think we usually do. On the other hand, unlike some past
cases, there's no functional need for the catalog changes. The catalog
changes just get a bit of efficiency. No strong preference here.
I'll defer to you on this question, then.
I don't see any reason to delay committing your fix. The issue that
you've highlighted is exactly the kind of issue that I anticipated
might happen at some point. This seems straightforward.
--
Peter Geoghegan
On Tue, Oct 10, 2023 at 8:51 PM Peter Geoghegan <pg@bowt.ie> wrote:
I don't see any reason to delay committing your fix. The issue that
you've highlighted is exactly the kind of issue that I anticipated
might happen at some point. This seems straightforward.
BTW, we don't need to recommend the heapallindexed option in the
release notes. Calling bt_check_index() will reliably indicate that
corruption is present when called against existing interval_ops
indexes once your fix is in. Simply having an index whose metapage's
allequalimage field is spuriously set to true will be recognized as
corruption right away. Obviously, this will be no less true with an
existing interval_ops index that happens to be completely empty.
--
Peter Geoghegan
On Tue, Oct 10, 2023 at 09:35:45PM -0700, Peter Geoghegan wrote:
On Tue, Oct 10, 2023 at 8:51 PM Peter Geoghegan <pg@bowt.ie> wrote:
I don't see any reason to delay committing your fix. The issue that
you've highlighted is exactly the kind of issue that I anticipated
might happen at some point. This seems straightforward.BTW, we don't need to recommend the heapallindexed option in the
release notes. Calling bt_check_index() will reliably indicate that
corruption is present when called against existing interval_ops
indexes once your fix is in. Simply having an index whose metapage's
allequalimage field is spuriously set to true will be recognized as
corruption right away. Obviously, this will be no less true with an
existing interval_ops index that happens to be completely empty.
Interesting. So, >99% of interval-type indexes, even ones WITH
(deduplicate_items=off), will get amcheck failures. The <1% of exceptions
might include indexes having allequalimage=off due to an additional column,
e.g. a two-column (interval, numeric) index. If interval indexes are common
enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are relatively
rare, that could argue for giving amcheck a special case. Specifically,
downgrade its "metapage incorrectly indicates that deduplication is safe" from
ERROR to WARNING for interval_ops only. Without that special case (i.e. with
the v1 patch), the release notes should probably resemble, "After updating,
run REINDEX on all indexes having an interval-type column." There's little
point in recommending pg_amcheck if >99% will fail. I'm inclined to bet that
interval-type indexes are rare, so I lean against adding the amcheck special
case. It's not a strong preference. Other opinions?
If users want to reduce their exposure now, they could do "ALTER INDEX ... SET
(deduplicate_items = off)" and then REINDEX any indexes already failing
"pg_amcheck --heapallindexed". allequalimage will remain wrong but have no
ill effects beyond making amcheck fail. Another REINDEX after the update
would let amcheck pass.
On Wed, Oct 11, 2023 at 11:38 AM Noah Misch <noah@leadboat.com> wrote:
Interesting. So, >99% of interval-type indexes, even ones WITH
(deduplicate_items=off), will get amcheck failures. The <1% of exceptions
might include indexes having allequalimage=off due to an additional column,
e.g. a two-column (interval, numeric) index. If interval indexes are common
enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are relatively
rare, that could argue for giving amcheck a special case. Specifically,
downgrade its "metapage incorrectly indicates that deduplication is safe" from
ERROR to WARNING for interval_ops only.
I am not aware of any user actually running "deduplicate_items = off"
in production, for any index. It was added purely as a defensive thing
-- not because I anticipated any real need to disable deduplication.
Deduplication was optimized for being enabled by default.
Anything is possible, of course, but it's hard to give too much weight
to cases where two very unlikely things happen to intersect. (Plus
"deduplicate_items = off" doesn't really do that much; more on that
below.)
Without that special case (i.e. with
the v1 patch), the release notes should probably resemble, "After updating,
run REINDEX on all indexes having an interval-type column."
+1
There's little
point in recommending pg_amcheck if >99% will fail. I'm inclined to bet that
interval-type indexes are rare, so I lean against adding the amcheck special
case. It's not a strong preference. Other opinions?
Well, there will only be one known reason why anybody will ever see
this test fail (barring a near-miraculous coincidence where "generic
corruption" somehow gets passed our most basic sanity tests, only to
fail this metapage field check a bit later on). Even if you
pessimistically assume that similar problems remain undiscovered in
some other opfamily, this particular check isn't going to be the check
that detects the other problems -- you really would need
heapallindexed verification for that.
In short, this metapage check is only effective retrospectively, once
we recognize and fix problems in an opclass. Clearly there will be
exactly one case like that post-fix (interval_ops is at least the only
affected core code opfamily), so why not point that out directly with
a HINT? A HINT could go a long way towards putting the problem in
context, without really adding a special case, and without any real
question of users being misled.
If users want to reduce their exposure now, they could do "ALTER INDEX ... SET
(deduplicate_items = off)" and then REINDEX any indexes already failing
"pg_amcheck --heapallindexed". allequalimage will remain wrong but have no
ill effects beyond making amcheck fail. Another REINDEX after the update
would let amcheck pass.
Even when "deduplicate_items = off", that just means that the nbtree
code won't apply further deduplication passes going forward (until
such time as deduplication is reenabled). It doesn't really mean that
this problem can't exist. OTOH it's easy to detect affected indexes
using SQL. So this is one case where telling users to REINDEX really
does seem like the best thing (as opposed to something we say because
we're too lazy to come up with nuanced, practical guidance).
--
Peter Geoghegan
On Wed, Oct 11, 2023 at 01:00:44PM -0700, Peter Geoghegan wrote:
On Wed, Oct 11, 2023 at 11:38 AM Noah Misch <noah@leadboat.com> wrote:
Interesting. So, >99% of interval-type indexes, even ones WITH
(deduplicate_items=off), will get amcheck failures. The <1% of exceptions
might include indexes having allequalimage=off due to an additional column,
e.g. a two-column (interval, numeric) index. If interval indexes are common
enough and "pg_amcheck --heapallindexed" failures from $SUBJECT are relatively
rare, that could argue for giving amcheck a special case. Specifically,
downgrade its "metapage incorrectly indicates that deduplication is safe" from
ERROR to WARNING for interval_ops only.I am not aware of any user actually running "deduplicate_items = off"
in production, for any index. It was added purely as a defensive thing
-- not because I anticipated any real need to disable deduplication.
Deduplication was optimized for being enabled by default.
Sure. Low-importance background information: deduplicate_items=off got on my
radar while I was wondering if ALTER INDEX ... SET (deduplicate_items=off)
would clear allequalimage. If it had, we could have advised people to use
ALTER INDEX, then rebuild only those indexes still failing "pg_amcheck
--heapallindexed". ALTER INDEX doesn't do that, ruling out that idea.
Without that special case (i.e. with
the v1 patch), the release notes should probably resemble, "After updating,
run REINDEX on all indexes having an interval-type column."+1
There's little
point in recommending pg_amcheck if >99% will fail. I'm inclined to bet that
interval-type indexes are rare, so I lean against adding the amcheck special
case. It's not a strong preference. Other opinions?
exactly one case like that post-fix (interval_ops is at least the only
affected core code opfamily), so why not point that out directly with
a HINT? A HINT could go a long way towards putting the problem in
context, without really adding a special case, and without any real
question of users being misled.
Works for me. Added.
Attachments:
interval-rescind-dedup-v2.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
Dissociate btequalimage() from interval_ops, ending its deduplication.
Under interval_ops, some equal values are distinguishable. One such
pair is '24:00:00' and '1 day'. With that being so, btequalimage()
breaches the documented contract for the "equalimage" btree support
function. This can cause incorrect results from index-only scans.
Users should REINDEX any btree indexes having interval-type columns.
After updating, pg_amcheck will report an error for almost all such
indexes. This fix makes interval_ops simply omit the support function,
like numeric_ops does. Back-pack to v13, where btequalimage() first
appeared. In back branches, for the benefit of old catalog content,
btequalimage() code will return false for type "interval". Going
forward, back-branch initdb will include the catalog change.
Reviewed by Peter Geoghegan.
Discussion: https://postgr.es/m/20231011013317.22.nmisch@google.com
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index dbb83d8..3e07a3e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -31,6 +31,7 @@
#include "access/xact.h"
#include "catalog/index.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_opfamily_d.h"
#include "commands/tablecmds.h"
#include "common/pg_prng.h"
#include "lib/bloomfilter.h"
@@ -338,10 +339,20 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
errmsg("index \"%s\" metapage has equalimage field set on unsupported nbtree version",
RelationGetRelationName(indrel))));
if (allequalimage && !_bt_allequalimage(indrel, false))
+ {
+ bool has_interval_ops = false;
+
+ for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(indrel); i++)
+ if (indrel->rd_opfamily[i] == INTERVAL_BTREE_FAM_OID)
+ has_interval_ops = true;
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("index \"%s\" metapage incorrectly indicates that deduplication is safe",
- RelationGetRelationName(indrel))));
+ RelationGetRelationName(indrel)),
+ has_interval_ops
+ ? errhint("This is known of \"interval\" indexes last built on a version predating 2023-11.")
+ : 0));
+ }
/* Check index, possibly against table it is an index on */
bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 5b95012..4c70da4 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -172,8 +172,6 @@
{ amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
amprocrighttype => 'interval', amprocnum => '3',
amproc => 'in_range(interval,interval,interval,bool,bool)' },
-{ amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
- amprocrighttype => 'interval', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
amprocrighttype => 'macaddr', amprocnum => '1', amproc => 'macaddr_cmp' },
{ amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index 91587b9..81a8525 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -50,7 +50,7 @@
opfmethod => 'btree', opfname => 'integer_ops' },
{ oid => '1977',
opfmethod => 'hash', opfname => 'integer_ops' },
-{ oid => '1982',
+{ oid => '1982', oid_symbol => 'INTERVAL_BTREE_FAM_OID',
opfmethod => 'btree', opfname => 'interval_ops' },
{ oid => '1983',
opfmethod => 'hash', opfname => 'interval_ops' },
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index a1bdf2c..7a6f36a 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2208,6 +2208,7 @@ ORDER BY 1, 2, 3;
| array_ops | array_ops | anyarray
| float_ops | float4_ops | real
| float_ops | float8_ops | double precision
+ | interval_ops | interval_ops | interval
| jsonb_ops | jsonb_ops | jsonb
| multirange_ops | multirange_ops | anymultirange
| numeric_ops | numeric_ops | numeric
@@ -2216,7 +2217,7 @@ ORDER BY 1, 2, 3;
| record_ops | record_ops | record
| tsquery_ops | tsquery_ops | tsquery
| tsvector_ops | tsvector_ops | tsvector
-(15 rows)
+(16 rows)
-- **************** pg_index ****************
-- Look for illegal values in pg_index fields.
interval-rescind-dedup-v2_16.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
Dissociate btequalimage() from interval_ops, ending its deduplication.
Under interval_ops, some equal values are distinguishable. One such
pair is '24:00:00' and '1 day'. With that being so, btequalimage()
breaches the documented contract for the "equalimage" btree support
function. This can cause incorrect results from index-only scans.
Users should REINDEX any btree indexes having interval-type columns.
After updating, pg_amcheck will report an error for almost all such
indexes. This fix makes interval_ops simply omit the support function,
like numeric_ops does. Back-pack to v13, where btequalimage() first
appeared. In back branches, for the benefit of old catalog content,
btequalimage() code will return false for type "interval". Going
forward, back-branch initdb will include the catalog change.
Reviewed by Peter Geoghegan.
Discussion: https://postgr.es/m/20231011013317.22.nmisch@google.com
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index dbb83d8..3e07a3e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -31,6 +31,7 @@
#include "access/xact.h"
#include "catalog/index.h"
#include "catalog/pg_am.h"
+#include "catalog/pg_opfamily_d.h"
#include "commands/tablecmds.h"
#include "common/pg_prng.h"
#include "lib/bloomfilter.h"
@@ -338,10 +339,20 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
errmsg("index \"%s\" metapage has equalimage field set on unsupported nbtree version",
RelationGetRelationName(indrel))));
if (allequalimage && !_bt_allequalimage(indrel, false))
+ {
+ bool has_interval_ops = false;
+
+ for (int i = 0; i < IndexRelationGetNumberOfKeyAttributes(indrel); i++)
+ if (indrel->rd_opfamily[i] == INTERVAL_BTREE_FAM_OID)
+ has_interval_ops = true;
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg("index \"%s\" metapage incorrectly indicates that deduplication is safe",
- RelationGetRelationName(indrel))));
+ RelationGetRelationName(indrel)),
+ has_interval_ops
+ ? errhint("This is known of \"interval\" indexes last built on a version predating 2023-11.")
+ : 0));
+ }
/* Check index, possibly against table it is an index on */
bt_check_every_level(indrel, heaprel, heapkeyspace, parentcheck,
diff --git a/src/backend/utils/adt/datum.c b/src/backend/utils/adt/datum.c
index 9f06ee7..251dd23 100644
--- a/src/backend/utils/adt/datum.c
+++ b/src/backend/utils/adt/datum.c
@@ -43,6 +43,7 @@
#include "postgres.h"
#include "access/detoast.h"
+#include "catalog/pg_type_d.h"
#include "common/hashfn.h"
#include "fmgr.h"
#include "utils/builtins.h"
@@ -385,20 +386,17 @@ datum_image_hash(Datum value, bool typByVal, int typLen)
* datum_image_eq() in all cases can use this as their "equalimage" support
* function.
*
- * Currently, we unconditionally assume that any B-Tree operator class that
- * registers btequalimage as its support function 4 must be able to safely use
- * optimizations like deduplication (i.e. we return true unconditionally). If
- * it ever proved necessary to rescind support for an operator class, we could
- * do that in a targeted fashion by doing something with the opcintype
- * argument.
+ * Earlier minor releases erroneously associated this function with
+ * interval_ops. Detect that case to rescind deduplication support, without
+ * requiring initdb.
*-------------------------------------------------------------------------
*/
Datum
btequalimage(PG_FUNCTION_ARGS)
{
- /* Oid opcintype = PG_GETARG_OID(0); */
+ Oid opcintype = PG_GETARG_OID(0);
- PG_RETURN_BOOL(true);
+ PG_RETURN_BOOL(opcintype != INTERVALOID);
}
/*-------------------------------------------------------------------------
diff --git a/src/include/catalog/pg_amproc.dat b/src/include/catalog/pg_amproc.dat
index 5b95012..4c70da4 100644
--- a/src/include/catalog/pg_amproc.dat
+++ b/src/include/catalog/pg_amproc.dat
@@ -172,8 +172,6 @@
{ amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
amprocrighttype => 'interval', amprocnum => '3',
amproc => 'in_range(interval,interval,interval,bool,bool)' },
-{ amprocfamily => 'btree/interval_ops', amproclefttype => 'interval',
- amprocrighttype => 'interval', amprocnum => '4', amproc => 'btequalimage' },
{ amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
amprocrighttype => 'macaddr', amprocnum => '1', amproc => 'macaddr_cmp' },
{ amprocfamily => 'btree/macaddr_ops', amproclefttype => 'macaddr',
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index 91587b9..81a8525 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -50,7 +50,7 @@
opfmethod => 'btree', opfname => 'integer_ops' },
{ oid => '1977',
opfmethod => 'hash', opfname => 'integer_ops' },
-{ oid => '1982',
+{ oid => '1982', oid_symbol => 'INTERVAL_BTREE_FAM_OID',
opfmethod => 'btree', opfname => 'interval_ops' },
{ oid => '1983',
opfmethod => 'hash', opfname => 'interval_ops' },
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index a1bdf2c..7a6f36a 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2208,6 +2208,7 @@ ORDER BY 1, 2, 3;
| array_ops | array_ops | anyarray
| float_ops | float4_ops | real
| float_ops | float8_ops | double precision
+ | interval_ops | interval_ops | interval
| jsonb_ops | jsonb_ops | jsonb
| multirange_ops | multirange_ops | anymultirange
| numeric_ops | numeric_ops | numeric
@@ -2216,7 +2217,7 @@ ORDER BY 1, 2, 3;
| record_ops | record_ops | record
| tsquery_ops | tsquery_ops | tsquery
| tsvector_ops | tsvector_ops | tsvector
-(15 rows)
+(16 rows)
-- **************** pg_index ****************
-- Look for illegal values in pg_index fields.
On Thu, Oct 12, 2023 at 4:10 PM Noah Misch <noah@leadboat.com> wrote:
exactly one case like that post-fix (interval_ops is at least the only
affected core code opfamily), so why not point that out directly with
a HINT? A HINT could go a long way towards putting the problem in
context, without really adding a special case, and without any real
question of users being misled.Works for me. Added.
Looks good. Thanks!
--
Peter Geoghegan
I've also caught btree posting lists where one TID refers to a '1d' heap
tuple, while another TID refers to a '24h' heap tuple. amcheck complains.
Index-only scans can return the '1d' bits where the actual tuple had the
'24h'
bits.
Have a build without the patch. I can't reproduce amcheck complaints in
release mode
where all these statements succeed.
create table t (c interval);
insert into t select x from generate_series(1,500), (values ('1 year 1
month'::interval), ('1 year 30 days')) t(x);
select distinct c from t;
create index ti on t (c);
select bt_index_check('ti'::regclass);
On following query, the query seems to return the right result sets,
index-only scan doesn't seem to be mislead by the misuse of btequalimage
postgres=# vacuum (INDEX_CLEANUP on) t;
VACUUM
postgres=# explain (analyze, buffers) select c::text, count(c) from t group
by c::text;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=49.27..49.29 rows=1 width=40) (actual
time=3.329..3.333 rows=2 loops=1)
Group Key: (c)::text
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=6
-> Index Only Scan using ti on t (cost=0.28..44.27 rows=1000 width=48)
(actual time=0.107..2.269 rows=1000 loops=1)
Heap Fetches: 0
Buffers: shared hit=6
Planning:
Buffers: shared hit=4
Planning Time: 0.319 ms
Execution Time: 3.432 ms
(11 rows)
postgres=# select c::text, count(c) from t group by c::text;
c | count
----------------+-------
1 year 1 mon | 500
1 year 30 days | 500
(2 rows)
* Generic "equalimage" support function.
*
* B-Tree operator classes whose equality function could safely be
replaced by
* datum_image_eq() in all cases can use this as their "equalimage" support
* function.
It seems to me that as long as a data type has a deterministic sort but not
necessarily be equalimage,
it should be able to support deduplication. e.g for interval type, we add
a byte wise tie breaker
after '24h' and '1day' are compared equally. In the btree, '24h' and '1day'
are still adjacent,
'1day' is always sorted before '24h' in a btree page, can we do dedup for
each value
without problem?
The assertion will still be triggered as it's testing btequalimage, but
I'll defer it for now.
Wanted to know if the above idea sounds sane first...
Donghang Lin
(ServiceNow)
On Thu, Oct 12, 2023 at 4:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
Show quoted text
On Thu, Oct 12, 2023 at 4:10 PM Noah Misch <noah@leadboat.com> wrote:
exactly one case like that post-fix (interval_ops is at least the only
affected core code opfamily), so why not point that out directly with
a HINT? A HINT could go a long way towards putting the problem in
context, without really adding a special case, and without any real
question of users being misled.Works for me. Added.
Looks good. Thanks!
--
Peter Geoghegan
On Mon, Oct 16, 2023 at 11:21:20PM -0700, Donghang Lin wrote:
I've also caught btree posting lists where one TID refers to a '1d' heap
tuple, while another TID refers to a '24h' heap tuple. amcheck complains.Index-only scans can return the '1d' bits where the actual tuple had the
'24h'
bits.Have a build without the patch. I can't reproduce amcheck complaints in
release mode
where all these statements succeed.
The queries I shared don't create the problematic structure, just an assertion
failure.
* Generic "equalimage" support function.
*
* B-Tree operator classes whose equality function could safely bereplaced by
* datum_image_eq() in all cases can use this as their "equalimage" support
* function.It seems to me that as long as a data type has a deterministic sort but not
necessarily be equalimage,
it should be able to support deduplication. e.g for interval type, we add
a byte wise tie breaker
after '24h' and '1day' are compared equally. In the btree, '24h' and '1day'
are still adjacent,
'1day' is always sorted before '24h' in a btree page, can we do dedup for
each value
without problem?
Yes. I'm not aware of correctness obstacles arising if one did that.
On Mon, Oct 16, 2023 at 11:21 PM Donghang Lin <donghanglin@gmail.com> wrote:
It seems to me that as long as a data type has a deterministic sort but not necessarily be equalimage,
it should be able to support deduplication. e.g for interval type, we add a byte wise tie breaker
after '24h' and '1day' are compared equally. In the btree, '24h' and '1day' are still adjacent,
'1day' is always sorted before '24h' in a btree page, can we do dedup for each value
without problem?
The assertion will still be triggered as it's testing btequalimage, but I'll defer it for now.
Wanted to know if the above idea sounds sane first...
It's hard to give any one reason why this won't work. I'm sure that
with enough effort some scheme like this could work. It's just that
there are significant practical problems that at least make it seem
unappealing as a project. This has been discussed before:
/messages/by-id/CAH2-WzkZkSC7G+v1WwXGo0emh8E-rByw=xSpBUoavk7PTjwF2Q@mail.gmail.com
--
Peter Geoghegan