BUG #3965: UNIQUE constraint fails on long column values

Started by Juho Saarikkoabout 18 years ago20 messagesbugsdocs
Jump to latest
#1Juho Saarikko
juhos@mbnet.fi
bugsdocs

The following bug has been logged online:

Bug reference: 3965
Logged by: Juho Saarikko
Email address: juhos@mbnet.fi
PostgreSQL version: 8.3RC2
Operating system: Linux
Description: UNIQUE constraint fails on long column values
Details:

It is impossible to add an UNIQUE constraint which includes columns with
long values. The reason seems to be that UNIQUE is implemented using b-tree
index, which cannot handle values longer than 8191 bytes.

While I didn't test, I'd imagine that this would also mean that any attempt
to insert such values to an already unique column would fail.

It is propably impossible to fix this in a simple way, since it is an
inherent result of the underlying storage specification rather than a mere
programming error, so the documentation needs to be updated to warn about
this.

I suggest implementing unique hash indexes and automatically creating one
(and turning the b-tree index into a non-unique one) when a large value is
inserted to fix this. Alternatively, fix b-trees so they can handle large
values; however, a hash index should be far more efficient for this specific
case, since the size of a hash is independent of pre-hash data size.

Exact error message:
******************
kuvat=# alter table pictures ADD constraint pic_unique unique (safe);
NOTICE: 00000: ALTER TABLE / ADD UNIQUE will create implicit index
"pic_unique" for table "pictures"
LOCATION: DefineIndex, indexcmds.c:434
ERROR: 54000: index row requires 47148 bytes, maximum size is 8191
LOCATION: index_form_tuple, indextuple.c:170

#2Bruce Momjian
bruce@momjian.us
In reply to: Juho Saarikko (#1)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

"Juho Saarikko" <juhos@mbnet.fi> writes:

It is propably impossible to fix this in a simple way, since it is an
inherent result of the underlying storage specification rather than a mere
programming error, so the documentation needs to be updated to warn about
this.

Point taken.

I suggest implementing unique hash indexes and automatically creating one
(and turning the b-tree index into a non-unique one) when a large value is
inserted to fix this. Alternatively, fix b-trees so they can handle large
values; however, a hash index should be far more efficient for this specific
case, since the size of a hash is independent of pre-hash data size.

With expression indexes you can do this yourself with something like
CREATE INDEX pk_hash on tab ((hashtext(safe)))

We can't do this automatically since it wouldn't enforce the UNIQUE
constraint. Conceivably we could actually do something about that but there's
nothing like that now.

We have hash indexes too but in practice a btree over a hash seems to work
just as well or better.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

#3Bruce Momjian
bruce@momjian.us
In reply to: Juho Saarikko (#1)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

Juho Saarikko wrote:

While I didn't test, I'd imagine that this would also mean that any attempt
to insert such values to an already unique column would fail.

Works here in 8.3:

test=> create table test (x text unique);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "test_x_key" for table "test"
CREATE TABLE
test=> insert into test values (repeat('a', 50000));
INSERT 0 1

Even this works:

test=> insert into test values (repeat('a', 50000) || 'b');

I believe the index only indexes 8192 bytes but checks the heap for
longer values to check the full length.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#3)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

Bruce Momjian <bruce@momjian.us> writes:

Juho Saarikko wrote:

While I didn't test, I'd imagine that this would also mean that any attempt
to insert such values to an already unique column would fail.

Works here in 8.3:

test=> create table test (x text unique);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "test_x_key" for table "test"
CREATE TABLE
test=> insert into test values (repeat('a', 50000));
INSERT 0 1

That test only works because it's eminently compressible.

The short answer to this bug report is that we're not very concerned
about fixing this because there is seldom a good reason to have an
index (unique or not) on fields that can get so wide. As was already
noted, if you do need a uniqueness check you can easily make a 99.9999%
solution by indexing the md5 hash (or some similar digest) of the
column. It doesn't really seem worthwhile to expend development work
on something that would benefit so few people.

regards, tom lane

#5Juho Saarikko
juhos@mbnet.fi
In reply to: Tom Lane (#4)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Juho Saarikko wrote:

While I didn't test, I'd imagine that this would also mean that any attempt
to insert such values to an already unique column would fail.

Works here in 8.3:

test=> create table test (x text unique);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "test_x_key" for table "test"
CREATE TABLE
test=> insert into test values (repeat('a', 50000));
INSERT 0 1

That test only works because it's eminently compressible.

The short answer to this bug report is that we're not very concerned
about fixing this because there is seldom a good reason to have an
index (unique or not) on fields that can get so wide. As was already
noted, if you do need a uniqueness check you can easily make a 99.9999%
solution by indexing the md5 hash (or some similar digest) of the
column. It doesn't really seem worthwhile to expend development work
on something that would benefit so few people.

regards, tom lane

But the documentation needs to be updated to mention this nonetheless.
It is a nasty surprise if it hits unawares.

Besides, it's not such an impossible scenario. I encountered this bug
when making an Usenet image archival system. Since the same images tend
to be reposted a lot, it makes sense to store them only once, and simply
reference the stored image from each context it was posted in. Currently
my program does the uniqueness constraining by itself; I was examining
having the database enforce it when I ran into this issue.

Such applications are not exactly rare: bayimg, img.google.com, etc. and
of course the innumerable Usenet archival sites could all conceivably
want to do something like this. So could any application which monitors
potentially repeating phenomena, for that matter. After all, saving a
single state of the system only once not only reduces the amount of data
stored, but could also help in actual analysis of it, since it becomes
trivial to recognize most and least often recurring states.

#6Kris Jurka
books@ejurka.com
In reply to: Bruce Momjian (#3)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

On Mon, 18 Feb 2008, Bruce Momjian wrote:

Juho Saarikko wrote:

While I didn't test, I'd imagine that this would also mean that any attempt
to insert such values to an already unique column would fail.

Works here in 8.3:

test=> insert into test values (repeat('a', 50000) || 'b');

This only works because it gets toasted before being put in the index.
Since you've selected something real compressible, you can fit 50k chars
into it.

Kris Jurka

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Juho Saarikko (#1)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

Juho Saarikko wrote:

I suggest implementing unique hash indexes and automatically creating one
(and turning the b-tree index into a non-unique one) when a large value is
inserted to fix this. Alternatively, fix b-trees so they can handle large
values; however, a hash index should be far more efficient for this specific
case, since the size of a hash is independent of pre-hash data size.

The current implementation of hash indexes actually store the whole key,
in addition to the hash, so the size of the hash index is not
independent of the key size. There has been some discussion on revamping
the hash index implementation, and changing that among other things, but
don't hold your breath.

As others have pointed out, CREATE UNIQUE INDEX i ON ((md5(column)) is a
pretty good work-around.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#7)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

As others have pointed out, CREATE UNIQUE INDEX i ON ((md5(column)) is a pretty
good work-around.

Unless you need cryptographic security I would not suggest using MD5. MD5 is
intentionally designed to take a substantial amount of CPU resources to
calculate.

Postgres's internal hash method is exposed for most data types as hashtext()
hashfloat8(), hashint4(), etc. These functions were chosen for their
lightweight design.

Cryptographic security is important only if you're concerned with people being
able to intentionally create collisions. In this scenario that's probably not
a top threat. Conceivably someone could create a denial-of-service attack
slowing down your server by causing your indexes to become unbalanced. But it
would be fairly challenging to engineer.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#9Francisco Olarte
folarte@peoplecall.com
In reply to: Bruce Momjian (#8)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

On Wednesday 20 February 2008, Gregory Stark wrote:

Unless you need cryptographic security I would not suggest using MD5. MD5
is intentionally designed to take a substantial amount of CPU resources to
calculate.

I thought it was the exact opposite, quoting from RFC1321:

The MD5 algorithm is designed to be quite fast on 32-bit machines. In
addition, the MD5 algorithm does not require any large substitution
tables; the algorithm can be coded quite compactly.

F.O.S.

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Bruce Momjian (#8)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

Gregory Stark wrote:

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

As others have pointed out, CREATE UNIQUE INDEX i ON ((md5(column)) is a pretty
good work-around.

Unless you need cryptographic security I would not suggest using MD5. MD5 is
intentionally designed to take a substantial amount of CPU resources to
calculate.

Postgres's internal hash method is exposed for most data types as hashtext()
hashfloat8(), hashint4(), etc. These functions were chosen for their
lightweight design.

Cryptographic security is important only if you're concerned with people being
able to intentionally create collisions. In this scenario that's probably not
a top threat. Conceivably someone could create a denial-of-service attack
slowing down your server by causing your indexes to become unbalanced. But it
would be fairly challenging to engineer.

Return type of hash* functions is just 32 bits. I wonder if that's wide
enough to avoid accidental collisions? Depends on the application of
course...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Bruce Momjian
bruce@momjian.us
In reply to: Heikki Linnakangas (#10)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

Gregory Stark wrote:

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

As others have pointed out, CREATE UNIQUE INDEX i ON ((md5(column)) is a pretty
good work-around.

Unless you need cryptographic security I would not suggest using MD5. MD5 is
intentionally designed to take a substantial amount of CPU resources to
calculate.

Return type of hash* functions is just 32 bits. I wonder if that's wide enough
to avoid accidental collisions? Depends on the application of course...

Oh, I missed that you were suggesting a UNIQUE index. That seems unsafe to me
even for md5 or its ilk. But that would depend on the application too.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

#12Michael Fuhr
mike@fuhr.org
In reply to: Francisco Olarte (#9)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

On Wed, Feb 20, 2008 at 12:21:03PM +0100, Francisco Olarte Sanz wrote:

On Wednesday 20 February 2008, Gregory Stark wrote:

Unless you need cryptographic security I would not suggest using MD5. MD5
is intentionally designed to take a substantial amount of CPU resources to
calculate.

I thought it was the exact opposite, quoting from RFC1321:

And if you *do* need cryptographic security then don't use MD5, and
consider using SHA-256 instead of SHA-1. See RFC 4270 for discussion.

ftp://ftp.rfc-editor.org/in-notes/rfc4270.txt

--
Michael Fuhr

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#11)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

Gregory Stark <stark@enterprisedb.com> writes:

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

Return type of hash* functions is just 32 bits. I wonder if that's wide enough
to avoid accidental collisions? Depends on the application of course...

Oh, I missed that you were suggesting a UNIQUE index. That seems unsafe to me
even for md5 or its ilk. But that would depend on the application too.

md5 is designed to be a signature, remember? If there weren't a very
high probability of its output being different for different inputs,
it wouldn't be good for anything.

The built-in hash functions definitely cannot be relied on to not have
collisions, though.

regards, tom lane

#14Bruce Momjian
bruce@momjian.us
In reply to: Michael Fuhr (#12)
bugsdocs
Re: BUG #3965: UNIQUE constraint fails on long column values

"Michael Fuhr" <mike@fuhr.org> writes:

On Wed, Feb 20, 2008 at 12:21:03PM +0100, Francisco Olarte Sanz wrote:

On Wednesday 20 February 2008, Gregory Stark wrote:

Unless you need cryptographic security I would not suggest using MD5. MD5
is intentionally designed to take a substantial amount of CPU resources to
calculate.

I thought it was the exact opposite, quoting from RFC1321:

Hm, ok, strike "intentionally". Nonetheless MD5 is quite computationally
intensive compared to quick hashes like the ones Postgres uses or CRC hashes
(which we ought to have functions for, but we don't seem to). SHA-1 is even
more computationally intensive and SHA-256 far more again.

For purposes of speeding up access a simple hash with the possibility of a few
collisions is normally fine. You add an additional clause to recheck the
original constraint.

For purposes of enforcing uniqueness I would be leery of depending on any
hash. The decision would depend on the application and the consequences of a
spurious error. The chances are slim but it's not impossible.

And if you *do* need cryptographic security then don't use MD5, and
consider using SHA-256 instead of SHA-1. See RFC 4270 for discussion.

ftp://ftp.rfc-editor.org/in-notes/rfc4270.txt

One of the factors in deciding between cryptographic algorithms is the
longevity required. MD5 has not been cracked but some suspicious weaknesses
have been discovered which might lead to a crack sometime in the future where
an attacker might be able to construct new plaintexts with identical hashes.
If you just need something secure for session keys then that's not going to be
a concern. If you need to distinguish user-provided documents from other
user-provided documents you're keeping for decades then it is.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

#15Bruce Momjian
bruce@momjian.us
In reply to: Juho Saarikko (#5)
bugsdocs
Re: [BUGS] BUG #3965: UNIQUE constraint fails on long column values

Does anyone think it is a good idea to document that our indexes cannot
index arbirarily-long strings? I see nothing in the documentation now
about it.

---------------------------------------------------------------------------

Juho Saarikko wrote:

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Juho Saarikko wrote:

While I didn't test, I'd imagine that this would also mean that any attempt
to insert such values to an already unique column would fail.

Works here in 8.3:

test=> create table test (x text unique);
NOTICE: CREATE TABLE / UNIQUE will create implicit index "test_x_key" for table "test"
CREATE TABLE
test=> insert into test values (repeat('a', 50000));
INSERT 0 1

That test only works because it's eminently compressible.

The short answer to this bug report is that we're not very concerned
about fixing this because there is seldom a good reason to have an
index (unique or not) on fields that can get so wide. As was already
noted, if you do need a uniqueness check you can easily make a 99.9999%
solution by indexing the md5 hash (or some similar digest) of the
column. It doesn't really seem worthwhile to expend development work
on something that would benefit so few people.

regards, tom lane

But the documentation needs to be updated to mention this nonetheless.
It is a nasty surprise if it hits unawares.

Besides, it's not such an impossible scenario. I encountered this bug
when making an Usenet image archival system. Since the same images tend
to be reposted a lot, it makes sense to store them only once, and simply
reference the stored image from each context it was posted in. Currently
my program does the uniqueness constraining by itself; I was examining
having the database enforce it when I ran into this issue.

Such applications are not exactly rare: bayimg, img.google.com, etc. and
of course the innumerable Usenet archival sites could all conceivably
want to do something like this. So could any application which monitors
potentially repeating phenomena, for that matter. After all, saving a
single state of the system only once not only reduces the amount of data
stored, but could also help in actual analysis of it, since it becomes
trivial to recognize most and least often recurring states.

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#16Phil Frost
phil@macprofessionals.com
In reply to: Bruce Momjian (#15)
bugsdocs
Re: Re: [BUGS] BUG #3965: UNIQUE constraint fails on long column values

On Mar 5, 2008, at 11:20 , Bruce Momjian wrote:

Does anyone think it is a good idea to document that our indexes
cannot
index arbirarily-long strings? I see nothing in the documentation now
about it.

+1

I never even knew. I'd be especially interested to hear about
details, like how long strings can be until they are not guaranteed
unique.

#17Bruce Momjian
bruce@momjian.us
In reply to: Phil Frost (#16)
bugsdocs
Re: Re: [BUGS] BUG #3965: UNIQUE constraint fails on long column values

Phil Frost wrote:

On Mar 5, 2008, at 11:20 , Bruce Momjian wrote:

Does anyone think it is a good idea to document that our indexes
cannot
index arbirarily-long strings? I see nothing in the documentation now
about it.

+1

I never even knew. I'd be especially interested to hear about
details, like how long strings can be until they are not guaranteed
unique.

Part of the problem is we don't know the length --- it is the compressed
length that is limited, I think to 8k.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#17)
bugsdocs
Re: Re: [BUGS] BUG #3965: UNIQUE constraint fails on long column values

Bruce Momjian <bruce@momjian.us> writes:

Part of the problem is we don't know the length --- it is the compressed
length that is limited, I think to 8k.

That's a problem, and the length limit is different for different index
types. So you're really not going to be able to say anything very
specific. Another problem is where would you put the information?

regards, tom lane

#19Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Tom Lane (#18)
bugsdocs
Re: Re: [BUGS] BUG #3965: UNIQUE constraint fails on long column values

On Wed, Mar 5, 2008 at 10:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bruce Momjian <bruce@momjian.us> writes:

Part of the problem is we don't know the length --- it is the compressed
length that is limited, I think to 8k.

That's a problem, and the length limit is different for different index
types. So you're really not going to be able to say anything very
specific. Another problem is where would you put the information?

In 'CREATE INDEX' docs...

http://www.postgresql.org/docs/8.3/interactive/sql-createindex.html

Along with the following statement:

method: The name of the index method to be used. Choices are btree, hash,
gist, and gin. The default method is btree.

Best regards,
--
gurjeet[.singh]@EnterpriseDB.com
singh.gurjeet@{ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

17° 29' 34.37"N, 78° 30' 59.76"E - Hyderabad *
18° 32' 57.25"N, 73° 56' 25.42"E - Pune
37° 47' 19.72"N, 122° 24' 1.69" W - San Francisco

http://gurjeet.frihost.net

Mail sent from my BlackLaptop device

#20Bruce Momjian
bruce@momjian.us
In reply to: Gurjeet Singh (#19)
bugsdocs
Re: Re: [BUGS] BUG #3965: UNIQUE constraint fails on long column values

Gurjeet Singh wrote:

On Wed, Mar 5, 2008 at 10:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bruce Momjian <bruce@momjian.us> writes:

Part of the problem is we don't know the length --- it is the compressed
length that is limited, I think to 8k.

That's a problem, and the length limit is different for different index
types. So you're really not going to be able to say anything very
specific. Another problem is where would you put the information?

In 'CREATE INDEX' docs...

http://www.postgresql.org/docs/8.3/interactive/sql-createindex.html

Along with the following statement:

method: The name of the index method to be used. Choices are btree, hash,
gist, and gin. The default method is btree.

We are trying to figure out how to specify the index-type-specific length
limits, not the index types themselves.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +