C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)
On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.
What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.
Best,
Aleksey
On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.
Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.
To add:
The middle ground I usually choose is to have a database encoding of
UTF8 but with the C (aka POSIX) locale. This gives you the ability to
store any unicode but indexing operations will use the faster C string
comparison operations for a significant performance boost --
especially for partial string searches on an indexed column. This is
an even more attractive option in 9.1 with the ability to specify
specific collations at runtime.
merlin
On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.
I think you're confusing encodings with locales. C is a locale. You
can have a database with a locale of C and UTF-8 encoding.
create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0;
\l
Name | Owner | Encoding | Collate | Ctype |
Access privileges
--------------+----------+-----------+-------------+-------------+-----------------------
clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 |
SQL_ASCII is the encoding equivalent of C locale, but it also allows
multi-byte characters.
2012/8/29 Merlin Moncure <mmoncure@gmail.com>
On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com>
wrote:
citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.To add:
The middle ground I usually choose is to have a database encoding of
UTF8 but with the C (aka POSIX) locale. This gives you the ability to
store any unicode but indexing operations will use the faster C string
comparison operations for a significant performance boost --
especially for partial string searches on an indexed column. This is
an even more attractive option in 9.1 with the ability to specify
specific collations at runtime.
Good point! Thanks!
--
// Dmitriy.
On Wed, Aug 29, 2012 at 01:45:20PM -0500, Merlin Moncure wrote:
On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.To add:
The middle ground I usually choose is to have a database encoding of
UTF8 but with the C (aka POSIX) locale. This gives you the ability to
store any unicode but indexing operations will use the faster C string
comparison operations for a significant performance boost --
especially for partial string searches on an indexed column. This is
an even more attractive option in 9.1 with the ability to specify
specific collations at runtime.
Do you get proper sort ordering in this case, or only when you specific
the proper collation at runtime?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Wed, Aug 29, 2012 at 12:52:50PM -0600, Scott Marlowe wrote:
On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.I think you're confusing encodings with locales. C is a locale. You
I think technically C is a non-locale.
can have a database with a locale of C and UTF-8 encoding.
create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0;
\l
Name | Owner | Encoding | Collate | Ctype |
Access privileges
--------------+----------+-----------+-------------+-------------+-----------------------
clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 |SQL_ASCII is the encoding equivalent of C locale, but it also allows
multi-byte characters.
Yes, but what sort ordering do you get in that case?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ It's impossible for everything to be true. +
On Wed, Aug 29, 2012 at 2:17 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Aug 29, 2012 at 12:52:50PM -0600, Scott Marlowe wrote:
On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:
On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.I think you're confusing encodings with locales. C is a locale. You
I think technically C is a non-locale.
True. But it's NOT an encoding.
can have a database with a locale of C and UTF-8 encoding.
create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0;
\l
Name | Owner | Encoding | Collate | Ctype |
Access privileges
--------------+----------+-----------+-------------+-------------+-----------------------
clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 |SQL_ASCII is the encoding equivalent of C locale, but it also allows
multi-byte characters.Yes, but what sort ordering do you get in that case?
Byte ordering.
On 2012-08-29, Bruce Momjian <bruce@momjian.us> wrote:
Name | Owner | Encoding | Collate | Ctype |
Access privileges
--------------+----------+-----------+-------------+-------------+-----------------------
clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 |SQL_ASCII is the encoding equivalent of C locale, but it also allows
multi-byte characters.Yes, but what sort ordering do you get in that case?
C ordering is by unicode code point number.
c=# values ('a'),('Z'),('€'),('z'),('¢'),('ä'),('Ā') order by 1;
column1
---------
Z
a
z
¢
ä
Ā
€
--
⚂⚃ 100% natural
On Wed, 2012-08-29 at 10:31 -0700, Aleksey Tsalolikhin wrote:
What is the difference between C and en_US.UTF8, please?
There are many differences, but here is a simple one:
$ (echo a; echo A; echo b; echo B) | LC_ALL=C sort
A
B
a
b
$ (echo a; echo A; echo b; echo B) | LC_ALL=en_US.utf8 sort
a
A
b
B
On Sun, Sep 2, 2012 at 10:08 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
On Wed, 2012-08-29 at 10:31 -0700, Aleksey Tsalolikhin wrote:
What is the difference between C and en_US.UTF8, please?
There are many differences, but here is a simple one:
$ (echo a; echo A; echo b; echo B) | LC_ALL=C sort
...
$ (echo a; echo A; echo b; echo B) | LC_ALL=en_US.utf8 sort
...
Beautifully illustrated, and in a language I readily understand as a UNIX
sys admin. Thank you, Peter!
Thank you, Bruce, Merlin, and everybody else who replied.
I love the flexibility 9.1 allows to have a default sort locale with ability
to alter locale per-query if needed. It gives us the confidence to move forward
with switching to C for sorting -- right now we only need to sort on US
English data, so this is a perfect fit. When we internationalize, 9.1
allows us to switch back to UTF8-based sorting if needed. I love it!
Thank you!
Aleksey