C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

Started by Aleksey Tsalolikhinalmost 14 years ago11 messagesgeneral

atsaloli.tech@gmail.com

almost 14 years ago

On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.

What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.

Best,
Aleksey

Bruce Momjian

bruce@momjian.us

almost 14 years ago

In reply to: Aleksey Tsalolikhin (#1)

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:

On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.

What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.

Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Merlin Moncure

mmoncure@gmail.com

almost 14 years ago

In reply to: Bruce Momjian (#2)

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:

On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.

What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.

Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.

To add:
The middle ground I usually choose is to have a database encoding of
UTF8 but with the C (aka POSIX) locale. This gives you the ability to
store any unicode but indexing operations will use the faster C string
comparison operations for a significant performance boost --
especially for partial string searches on an indexed column. This is
an even more attractive option in 9.1 with the ability to specify
specific collations at runtime.

merlin

Scott Marlowe

scott.marlowe@gmail.com

almost 14 years ago

In reply to: Bruce Momjian (#2)

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:

On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.

What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.

Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.

I think you're confusing encodings with locales. C is a locale. You
can have a database with a locale of C and UTF-8 encoding.

create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0;

SQL_ASCII is the encoding equivalent of C locale, but it also allows
multi-byte characters.

Dmitriy Igrishin

dmitigr@gmail.com

almost 14 years ago

In reply to: Merlin Moncure (#3)

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

2012/8/29 Merlin Moncure <mmoncure@gmail.com>

On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:

On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com>

wrote:

citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.

What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.

Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.

To add:
The middle ground I usually choose is to have a database encoding of
UTF8 but with the C (aka POSIX) locale. This gives you the ability to
store any unicode but indexing operations will use the faster C string
comparison operations for a significant performance boost --
especially for partial string searches on an indexed column. This is
an even more attractive option in 9.1 with the ability to specify
specific collations at runtime.

Good point! Thanks!

--
// Dmitriy.

Bruce Momjian

bruce@momjian.us

almost 14 years ago

In reply to: Merlin Moncure (#3)

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

On Wed, Aug 29, 2012 at 01:45:20PM -0500, Merlin Moncure wrote:

On Wed, Aug 29, 2012 at 12:43 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:

On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.

What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.

Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.

To add:
The middle ground I usually choose is to have a database encoding of
UTF8 but with the C (aka POSIX) locale. This gives you the ability to
store any unicode but indexing operations will use the faster C string
comparison operations for a significant performance boost --
especially for partial string searches on an indexed column. This is
an even more attractive option in 9.1 with the ability to specify
specific collations at runtime.

Do you get proper sort ordering in this case, or only when you specific
the proper collation at runtime?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Bruce Momjian

bruce@momjian.us

almost 14 years ago

In reply to: Scott Marlowe (#4)

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

On Wed, Aug 29, 2012 at 12:52:50PM -0600, Scott Marlowe wrote:

On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:

On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.

What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.

Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.

I think you're confusing encodings with locales. C is a locale. You

I think technically C is a non-locale.

can have a database with a locale of C and UTF-8 encoding.

create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0;

\l
Name | Owner | Encoding | Collate | Ctype |
Access privileges
--------------+----------+-----------+-------------+-------------+-----------------------
clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 |

SQL_ASCII is the encoding equivalent of C locale, but it also allows
multi-byte characters.

Yes, but what sort ordering do you get in that case?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Scott Marlowe

scott.marlowe@gmail.com

almost 14 years ago

In reply to: Bruce Momjian (#7)

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

On Wed, Aug 29, 2012 at 2:17 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 29, 2012 at 12:52:50PM -0600, Scott Marlowe wrote:

On Wed, Aug 29, 2012 at 11:43 AM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Aug 29, 2012 at 10:31:21AM -0700, Aleksey Tsalolikhin wrote:

On Wed, Aug 29, 2012 at 9:45 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

citext unfortunately doesn't allow for index optimization of LIKE
queries, which IMNSHO defeats the whole purpose. to the best way
remains to use lower() ...
this will be index optimized and fast as long as you specified C
locale for your database.

What is the difference between C and en_US.UTF8, please? We see that
the same query (that invokes a sort) runs 15% faster under the C
locale. The output between C and en_US.UTF8 is identical. We're
considering moving our database from en_US.UTF8 to C, but we do deal
with internationalized text.

Well, C has reduced overhead for string comparisons, but obviously
doesn't work well for international characters. The single-byte
encodings have somewhat less overhead than UTF8. You can try using C
locales for databases that don't require non-ASCII characters.

I think you're confusing encodings with locales. C is a locale. You

I think technically C is a non-locale.

True. But it's NOT an encoding.

can have a database with a locale of C and UTF-8 encoding.

create database clocale_utf8 encoding='UTF8' LC_COLLATE= 'C' template=template0;

\l
Name | Owner | Encoding | Collate | Ctype |
Access privileges
--------------+----------+-----------+-------------+-------------+-----------------------
clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 |

SQL_ASCII is the encoding equivalent of C locale, but it also allows
multi-byte characters.

Yes, but what sort ordering do you get in that case?

Byte ordering.

Jasen Betts

jasen@xnet.co.nz

almost 14 years ago

In reply to: Aleksey Tsalolikhin (#1)

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

On 2012-08-29, Bruce Momjian <bruce@momjian.us> wrote:

Name | Owner | Encoding | Collate | Ctype |
Access privileges
--------------+----------+-----------+-------------+-------------+-----------------------
clocale_utf8 | smarlowe | UTF8 | C | en_US.UTF-8 |

SQL_ASCII is the encoding equivalent of C locale, but it also allows
multi-byte characters.

Yes, but what sort ordering do you get in that case?

C ordering is by unicode code point number.

c=# values ('a'),('Z'),('€'),('z'),('¢'),('ä'),('Ā') order by 1;
column1
---------
Z
a
z
¢
ä
Ā
€

--
⚂⚃ 100% natural

#10

Peter Eisentraut

peter_e@gmx.net

almost 14 years ago

In reply to: Aleksey Tsalolikhin (#1)

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

On Wed, 2012-08-29 at 10:31 -0700, Aleksey Tsalolikhin wrote:

What is the difference between C and en_US.UTF8, please?

There are many differences, but here is a simple one:

$ (echo a; echo A; echo b; echo B) | LC_ALL=C sort
A
B
a
b
$ (echo a; echo A; echo b; echo B) | LC_ALL=en_US.utf8 sort
a
A
b
B

#11

Aleksey Tsalolikhin

atsaloli.tech@gmail.com

almost 14 years ago

In reply to: Peter Eisentraut (#10)

Re: C locale versus en_US.UTF8. (Was: String comparision in PostgreSQL)

On Sun, Sep 2, 2012 at 10:08 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

On Wed, 2012-08-29 at 10:31 -0700, Aleksey Tsalolikhin wrote:

What is the difference between C and en_US.UTF8, please?

There are many differences, but here is a simple one:

$ (echo a; echo A; echo b; echo B) | LC_ALL=C sort
...
$ (echo a; echo A; echo b; echo B) | LC_ALL=en_US.utf8 sort
...

Beautifully illustrated, and in a language I readily understand as a UNIX
sys admin. Thank you, Peter!

Thank you, Bruce, Merlin, and everybody else who replied.

I love the flexibility 9.1 allows to have a default sort locale with ability
to alter locale per-query if needed. It gives us the confidence to move forward
with switching to C for sorting -- right now we only need to sort on US
English data, so this is a perfect fit. When we internationalize, 9.1
allows us to switch back to UTF8-based sorting if needed. I love it!

Thank you!
Aleksey