how is text-equality handled in postgresql?

Started by Gábor Farkasabout 12 years ago8 messagesgeneral
Jump to latest
#1Gábor Farkas
gabor.farkas@gmail.com

hi,

when i create an unique-constraint on a varchar field, how exactly
does postgresql compare the texts? i'm asking because in UNICODE there
are a lot of complexities about this..

or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

thanks,
gabor

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#2Ivan Voras
ivoras@freebsd.org
In reply to: Gábor Farkas (#1)
Re: how is text-equality handled in postgresql?

On 15/01/2014 10:10, Gábor Farkas wrote:

hi,

when i create an unique-constraint on a varchar field, how exactly
does postgresql compare the texts? i'm asking because in UNICODE there
are a lot of complexities about this..

or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

By default, it is "whatever the operating system thinks it's right".
PostgreSQL doesn't have its own collation code, it uses the OS's locale
support for this.

(which breaks on certain systems which don't have complete UTF-8 support
- I'm in favour of importing ICU at least as an optional dependancy,
similar to what the FreeBSD's patch does:
http://people.freebsd.org/~girgen/postgresql-icu/).

#3Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ivan Voras (#2)
Re: how is text-equality handled in postgresql?

On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote:

On 15/01/2014 10:10, Gábor Farkas wrote:

hi,

when i create an unique-constraint on a varchar field, how exactly
does postgresql compare the texts? i'm asking because in UNICODE there
are a lot of complexities about this..

or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

By default, it is "whatever the operating system thinks it's right".
PostgreSQL doesn't have its own collation code, it uses the OS's locale
support for this.

Just to add to this, whenever strcoll() (a locale aware comparator)
says two strings are equal, postgres re-compares them using strcmp().
See following code snippet off
src/backend/utils/adt/varlena.c:varstr_cmp() -

#ifdef HAVE_LOCALE_T
if (mylocale)
result = strcoll_l(a1p, a2p, mylocale);
else
#endif
result = strcoll(a1p, a2p);

/*
* In some locales strcoll() can claim that
nonidentical strings are
* equal. Believing that would be bad news for a
number of reasons,
* so we follow Perl's lead and sort "equal" strings
according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);

--
Amit Langote

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#4Ivan Voras
ivoras@freebsd.org
In reply to: Amit Langote (#3)
Re: how is text-equality handled in postgresql?

On 15/01/2014 12:36, Amit Langote wrote:

On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote:

On 15/01/2014 10:10, Gábor Farkas wrote:

hi,

when i create an unique-constraint on a varchar field, how exactly
does postgresql compare the texts? i'm asking because in UNICODE there
are a lot of complexities about this..

or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

By default, it is "whatever the operating system thinks it's right".
PostgreSQL doesn't have its own collation code, it uses the OS's locale
support for this.

Just to add to this, whenever strcoll() (a locale aware comparator)
says two strings are equal, postgres re-compares them using strcmp().
See following code snippet off
src/backend/utils/adt/varlena.c:varstr_cmp() -

/*
* In some locales strcoll() can claim that
nonidentical strings are
* equal. Believing that would be bad news for a
number of reasons,
* so we follow Perl's lead and sort "equal" strings
according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);

That seems odd and inefficient. Why would it be necessary? I would think
indexing (and other collation-sensitive operations) don't care what the
actual collation result is for arbitrary blobs of strings, as long as
they are stable?

#5Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Ivan Voras (#4)
Re: how is text-equality handled in postgresql?

On Wed, Jan 15, 2014 at 9:02 PM, Ivan Voras <ivoras@freebsd.org> wrote:

On 15/01/2014 12:36, Amit Langote wrote:

On Wed, Jan 15, 2014 at 7:39 PM, Ivan Voras <ivoras@freebsd.org> wrote:

On 15/01/2014 10:10, Gábor Farkas wrote:

hi,

when i create an unique-constraint on a varchar field, how exactly
does postgresql compare the texts? i'm asking because in UNICODE there
are a lot of complexities about this..

or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

By default, it is "whatever the operating system thinks it's right".
PostgreSQL doesn't have its own collation code, it uses the OS's locale
support for this.

Just to add to this, whenever strcoll() (a locale aware comparator)
says two strings are equal, postgres re-compares them using strcmp().
See following code snippet off
src/backend/utils/adt/varlena.c:varstr_cmp() -

/*
* In some locales strcoll() can claim that
nonidentical strings are
* equal. Believing that would be bad news for a
number of reasons,
* so we follow Perl's lead and sort "equal" strings
according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);

That seems odd and inefficient. Why would it be necessary? I would think
indexing (and other collation-sensitive operations) don't care what the
actual collation result is for arbitrary blobs of strings, as long as
they are stable?

This is the behavior since quite some time introduced by this commit

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

--
Amit Langote

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#6Ivan Voras
ivoras@freebsd.org
In reply to: Amit Langote (#5)
Re: how is text-equality handled in postgresql?

On 15/01/2014 13:29, Amit Langote wrote:

On Wed, Jan 15, 2014 at 9:02 PM, Ivan Voras <ivoras@freebsd.org> wrote:

On 15/01/2014 12:36, Amit Langote wrote:

* In some locales strcoll() can claim that
nonidentical strings are
* equal. Believing that would be bad news for a
number of reasons,
* so we follow Perl's lead and sort "equal" strings
according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);

That seems odd and inefficient. Why would it be necessary? I would think
indexing (and other collation-sensitive operations) don't care what the
actual collation result is for arbitrary blobs of strings, as long as
they are stable?

This is the behavior since quite some time introduced by this commit

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=656beff59033ccc5261a615802e1a85da68e8fad

Ok, the commit comment is:

"Adjust string comparison so that only bitwise-equal strings are considered
equal: if strcoll claims two strings are equal, check it with strcmp, and
sort according to strcmp if not identical. This fixes inconsistent
behavior under glibc's hu_HU locale, and probably under some other locales
as well. Also, take advantage of the now-well-defined behavior to speed up
texteq, textne, bpchareq, bpcharne: they may as well just do a bitwise
comparison and not bother with strcoll at all."

... so it's just another workaround for OS specific locale issues - to
me it looks like just another reason to use ICU.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ivan Voras (#4)
Re: how is text-equality handled in postgresql?

Ivan Voras <ivoras@freebsd.org> writes:

On 15/01/2014 12:36, Amit Langote wrote:

Just to add to this, whenever strcoll() (a locale aware comparator)
says two strings are equal, postgres re-compares them using strcmp().

That seems odd and inefficient. Why would it be necessary? I would think
indexing (and other collation-sensitive operations) don't care what the
actual collation result is for arbitrary blobs of strings, as long as
they are stable?

If we didn't do it like this, we could not use hashing techniques for
text --- at least not unless we could find a hash function guaranteed
to yield the same values for any two strings that strcoll() claims are
equal.

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#8Vick Khera
vivek@khera.org
In reply to: Gábor Farkas (#1)
Re: how is text-equality handled in postgresql?

On Wed, Jan 15, 2014 at 4:10 AM, Gábor Farkas <gabor.farkas@gmail.com>wrote:

or in other words, when are two varchars equal in postgres? when their
bytes are? or some algorithm is applied?

On this topic, when I write my strings to the DB and search from the DB,
should I canonicalize them first as NKFC (or some other), or just let the
DB figure it out? In my specific case I use perl DBI with place holders to
submit my queries.