locales and encodings Oh MY!

Started by markabout 15 years ago7 messages
#1mark
dvlhntr@gmail.com

Hi all,

So today while I investigating some slow queries I found out that some time
ago some of our databases (due, innocently I suspect, to upgrades or
hardware migrations) are now not using C as lc_ctype. And why some 'things'
got a lot slower...

To correct this problem it seems like I am going to have to do a dump,
intidb, then restore. (or add some new indexes and eat the extra overhead).
Unless someone knows something I don't about changing ctype and collation
with some sleight of hand (if you do I am very interested)

I have listed what I think I will be doing with regards to initdb. if anyone
sees problems with the following mixture during my dump -> init-> restore I
would be most keen in hearing about it.

initdb /path/to/data/dir --lc_ctype=C --lc_collation=C
--lc_message=en_US.UTF8 --lc_monetary=en_US.UTF8 --lc_numeric=en_US.UTF8
--lc_time=en_US.UTF8 -E UTF8

Thank you,

Mark

#2Gabriele Bartolini
gabriele.bartolini@2ndQuadrant.it
In reply to: mark (#1)
Re: locales and encodings Oh MY!

Hi Mark,

Il 12/11/10 03:31, mark ha scritto:

I have listed what I think I will be doing with regards to initdb. if anyone
sees problems with the following mixture during my dump -> init-> restore I
would be most keen in hearing about it.

FYI, PostgreSQL 8.4 introduced database level collation. Before that,
yes, you need to re-init your data dir.

initdb /path/to/data/dir --lc_ctype=C --lc_collation=C
--lc_message=en_US.UTF8 --lc_monetary=en_US.UTF8 --lc_numeric=en_US.UTF8
--lc_time=en_US.UTF8 -E UTF8

Maybe you meant --lc_collate ?

Cheers,
Gabriele

--
Gabriele Bartolini - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
gabriele.bartolini@2ndQuadrant.it | www.2ndQuadrant.it

#3mark
dvlhntr@gmail.com
In reply to: Gabriele Bartolini (#2)
Re: locales and encodings Oh MY!

On Fri, Nov 12, 2010 at 12:45 AM, Gabriele Bartolini
<gabriele.bartolini@2ndquadrant.it> wrote:

Hi Mark,

Il 12/11/10 03:31, mark ha scritto:

I have listed what I think I will be doing with regards to initdb. if
anyone
sees problems with the following mixture during my dump ->  init->
 restore I
would be most keen in hearing about it.

FYI, PostgreSQL 8.4 introduced database level collation. Before that, yes,
you need to re-init your data dir.

Guess I missed that, I have 8.3.X boxes in production and 9.0.1 boxes
in dev so I guess only someone of them will require a re-init.

initdb /path/to/data/dir --lc_ctype=C --lc_collation=C
--lc_message=en_US.UTF8 --lc_monetary=en_US.UTF8 --lc_numeric=en_US.UTF8
--lc_time=en_US.UTF8 -E UTF8

Maybe you meant --lc_collate ?

Yes I did mean lc_collate - thanks

Cheers,
Gabriele

With how similar straight C and en_US.UTF8 are it was suggested to me,
by persons who are far more C knowledgeable then I in my office, that
this is something the PG community could "fix" . A "fix" being so that
"col LIKE 'foo%' " could use btree indexes in locales like en_US.UTF8
(and probably some others).

@hackers ->
is the request unreasonable ? anyone got any idea of the price tag to
make that happen ?

Show quoted text

--
 Gabriele Bartolini - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 gabriele.bartolini@2ndQuadrant.it | www.2ndQuadrant.it

#4Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: mark (#3)
Re: locales and encodings Oh MY!

mark wrote:

A "fix" being so that "col LIKE 'foo%' " could use btree indexes in
locales like en_US.UTF8 (and probably some others).

How about specifying an opclass?:

http://www.postgresql.org/docs/current/interactive/indexes-opclass.html

-Kevin

#5Robert Haas
robertmhaas@gmail.com
In reply to: mark (#3)
Re: [HACKERS] locales and encodings Oh MY!

With how similar straight C and en_US.UTF8 are it was suggested to me,
by persons who are far more C knowledgeable then I in my office, that
this is something the PG community could "fix" . A "fix" being so that
"col LIKE 'foo%' " could use btree indexes in locales like en_US.UTF8
(and probably some others).

@hackers ->
is the request unreasonable ? anyone got any idea of the price tag to
make that happen ?

I thought it already did that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#5)
Re: [HACKERS] locales and encodings Oh MY!

Robert Haas <robertmhaas@gmail.com> writes:

With how similar straight C and en_US.UTF8 are it was suggested to me,
by persons who are far more C knowledgeable then I in my office, that
this is something the PG community could "fix" . A "fix" being so that
"col LIKE 'foo%' " could use btree indexes in locales like en_US.UTF8
(and probably some others).

is the request unreasonable ? anyone got any idea of the price tag to
make that happen ?

I thought it already did that.

No, and the odds of it ever happening are insignificant. The sort order
associated with en_US (and other "dictionary order" locales) is just too
randomly different from what you need to optimize a LIKE search.
(Whoever told you en_US sorts similarly to C is nuts.)

The solution if you want the database's prevailing sort order to be en_US
is to put an extra text_pattern_ops index on the column you want to do
LIKE searches on. We might eventually have the ability to spell that
"put a C-locale index on the column", but text_pattern_ops is the way to
do it today.

regards, tom lane

#7mark
dvlhntr@gmail.com
In reply to: Tom Lane (#6)
Re: [HACKERS] locales and encodings Oh MY!

On Fri, Nov 12, 2010 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

With how similar straight C and en_US.UTF8 are it was suggested to me,
by persons who are far more C knowledgeable then I in my office, that
this is something the PG community could "fix" . A "fix" being so that
"col LIKE 'foo%' " could use btree indexes in locales like en_US.UTF8
(and probably some others).

is the request unreasonable ? anyone got any idea of the price tag to
make that happen ?

I thought it already did that.

No, and the odds of it ever happening are insignificant.  The sort order
associated with en_US (and other "dictionary order" locales) is just too
randomly different from what you need to optimize a LIKE search.
(Whoever told you en_US sorts similarly to C is nuts.)

The solution if you want the database's prevailing sort order to be en_US
is to put an extra text_pattern_ops index on the column you want to do
LIKE searches on.  We might eventually have the ability to spell that
"put a C-locale index on the column", but text_pattern_ops is the way to
do it today.

                       regards, tom lane

Ok I hear you loud and clear. I am going to eat the overhead until
I get to 9.0.1, currently on 8.3.X in some places.

I will either take an outage and do a dump - re-init-restore or
inplace upgrade and then do some locking, copy, drop old, rename new
db path.

thanks all.

..: Mark