Encoding/collation question

Started by Rich Shepardover 6 years ago8 messagesgeneral

rshepard@appl-ecosys.com

over 6 years ago

My older databases have LATIN1 encoding and C collation; the newer ones have
UTF8 encoding and en_US.UTF-8 collation. A web search taught me that I can
change each old database by dumping it and restoring it with the desired
encoding and collation types. My question is whether the older types make
any difference in a single-user environment.

Regards,

Rich

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Rich Shepard (#1)

Re: Encoding/collation question

Rich Shepard <rshepard@appl-ecosys.com> writes:

My older databases have LATIN1 encoding and C collation; the newer ones have
UTF8 encoding and en_US.UTF-8 collation. A web search taught me that I can
change each old database by dumping it and restoring it with the desired
encoding and collation types. My question is whether the older types make
any difference in a single-user environment.

String comparisons in non-C collations tend to be a lot slower than
they are in C collation. Whether this makes a noticeable difference
to you depends on your workload, but certainly we've seen performance
gripes that trace to that.

If your data doesn't require the larger character set of UTF8, then
using LATIN-any is going to offer some space savings (for non-ASCII
characters) plus minor performance benefits due to the lack of
variable-width characters. This is less significant than the
collation issue, though, for most people.

regards, tom lane

Rich Shepard

rshepard@appl-ecosys.com

over 6 years ago

In reply to: Tom Lane (#2)

Re: Encoding/collation question

On Wed, 11 Dec 2019, Tom Lane wrote:

String comparisons in non-C collations tend to be a lot slower than they
are in C collation. Whether this makes a noticeable difference to you
depends on your workload, but certainly we've seen performance gripes that
trace to that.

Tom,

How interesting.

If your data doesn't require the larger character set of UTF8, then using
LATIN-any is going to offer some space savings (for non-ASCII characters)
plus minor performance benefits due to the lack of variable-width
characters. This is less significant than the collation issue, though, for
most people.

I doubt that my use will notice meaningful differences. Since there are only
two or three databases in UTF8 and its collation perhaps I'll convert those
to LATIN1 and C.

Thanks for the insights.

Regards,

Rich

Andrew Gierth

andrew@tao11.riddles.org.uk

over 6 years ago

In reply to: Rich Shepard (#3)

Re: Encoding/collation question

"Rich" == Rich Shepard <rshepard@appl-ecosys.com> writes:

Rich> I doubt that my use will notice meaningful differences. Since
Rich> there are only two or three databases in UTF8 and its collation
Rich> perhaps I'll convert those to LATIN1 and C.

Note that it's perfectly fine to use UTF8 encoding and C collation (this
has the effect of sorting strings in Unicode codepoint order); this is
as fast for comparisons as LATIN1/C is.

For those cases where you need data to be sorted in a
culturally-meaningful order rather than in codepoint order, you can set
collations on specific columns or in individual queries.

--
Andrew (irc:RhodiumToad)

Karsten Hilbert

Karsten.Hilbert@gmx.net

over 6 years ago

In reply to: Andrew Gierth (#4)

Re: Encoding/collation question

On Thu, Dec 12, 2019 at 05:03:59AM +0000, Andrew Gierth wrote:

Rich> I doubt that my use will notice meaningful differences. Since
Rich> there are only two or three databases in UTF8 and its collation
Rich> perhaps I'll convert those to LATIN1 and C.

Note that it's perfectly fine to use UTF8 encoding and C collation (this
has the effect of sorting strings in Unicode codepoint order); this is
as fast for comparisons as LATIN1/C is.

For those cases where you need data to be sorted in a
culturally-meaningful order rather than in codepoint order, you can set
collations on specific columns or in individual queries.

Nice, thanks for pointing that out. One addition: while this
may seem like "the" magic bullet it should be noted that one
will need additional indexes for culturally-meaningful ORDER
BY sorts to be fast (while having a default non-C collation
one will get a by-default culturally-meaningful index for
that one non-C locale).

Question: is C collation expected to be future-proof /
rock-solid /stable -- like UTF8 for encoding choice -- or
could it end up like the SQL-ASCII encoding did: Yeah, we
support it, it's been in use a long time, it should work,
but, nah, one doesn't really want to choose it over UTF8 if
at all possible, or at least know *exactly* what one is doing
and BTW YMMV ?

Karsten
--
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

Tom Lane

tgl@sss.pgh.pa.us

over 6 years ago

In reply to: Karsten Hilbert (#5)

Re: Encoding/collation question

Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:

Question: is C collation expected to be future-proof /
rock-solid /stable -- like UTF8 for encoding choice -- or
could it end up like the SQL-ASCII encoding did: Yeah, we
support it, it's been in use a long time, it should work,
but, nah, one doesn't really want to choose it over UTF8 if
at all possible, or at least know *exactly* what one is doing
and BTW YMMV ?

C collation basically devolves to strcmp/memcmp, which are as standard
and well-defined as can be. If you're happy with the way it sorts
things then there's no reason not to use it.

It's actually all the *other* collations where you should worry about
their behavior being a moving target :-(.

regards, tom lane

Rich Shepard

rshepard@appl-ecosys.com

over 6 years ago

In reply to: Andrew Gierth (#4)

Re: Encoding/collation question

On Thu, 12 Dec 2019, Andrew Gierth wrote:

Note that it's perfectly fine to use UTF8 encoding and C collation (this
has the effect of sorting strings in Unicode codepoint order); this is as
fast for comparisons as LATIN1/C is.

Andrew,

This is really useful insight. I've not thought of the relationship of
encoding to collation (which I now know there isn't.)

For those cases where you need data to be sorted in a
culturally-meaningful order rather than in codepoint order, you can set
collations on specific columns or in individual queries.

Not an issue for my work. :-)

Thanks very much,

Rich

Karsten Hilbert

Karsten.Hilbert@gmx.net

over 6 years ago

In reply to: Tom Lane (#6)

Re: Encoding/collation question

On Thu, Dec 12, 2019 at 08:35:53AM -0500, Tom Lane wrote:

C collation basically devolves to strcmp/memcmp, which are as standard
and well-defined as can be. If you're happy with the way it sorts
things then there's no reason not to use it.

So that's the collation to use when "technical" sorting is
required (say, when uniqueness does not depend on the notion
of culturally equivalent characters).

It's actually all the *other* collations where you should worry about
their behavior being a moving target :-(.

But then that is to be expected.

Thanks,
Karsten
--
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B