improve Chinese locale performance

Started by Quan Zongliangover 12 years ago16 messageshackers
Jump to latest
#1Quan Zongliang
quanzongliang@gmail.com

Hi hackers,

I tried to improve performance when database is Chinese.

Under openSUSE, create index on table with 54996 rows
locale=C, 140ms
locale=zh_CN, 985ms

I think the function strcoll() of Linux is too slow.
So, I made a new utf8 to GB18030 map, store Chinese order in it.
Do not call strcoll().
On my modified code, same operation, locale=zh_CN, 203ms.

My English is too bad to describe my idea.
Please find the attachment.

The users in China would like to use locale=C, because it is faster.
When need to order, they call function convert() to do.

And I found some wrong order in the locale zh_CN of Linux.

In my test under Windows, there is performance improve too.
Windows XP sp3, vm, locale=Chinese_People's Republic of China.936
original code, 343ms
modified code, 235ms

Maybe, some Asian Languages have same problem.

Regards,
Quan Zongliang

Attachments:

zhCNorder.tar.gzapplication/x-gzip; name=zhCNorder.tar.gzDownload+1-0
#2Craig Ringer
craig@2ndquadrant.com
In reply to: Quan Zongliang (#1)
Re: improve Chinese locale performance

On 07/22/2013 12:17 PM, Quan Zongliang wrote:

Hi hackers,

I tried to improve performance when database is Chinese.

Under openSUSE, create index on table with 54996 rows
locale=C, 140ms
locale=zh_CN, 985ms

I think the function strcoll() of Linux is too slow.
So, I made a new utf8 to GB18030 map, store Chinese order in it.
Do not call strcoll().
On my modified code, same operation, locale=zh_CN, 203ms.

It might be worth looking at gcc's strcoll() implementation. See if it
performs better when you use the latest gcc, and if not try to improve
gcc's strcoll() .

I'd be interested in seeing a test case for this that shows that the
results of your new collation are exactly the same as the original
strcoll() based approach.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Quan Zongliang
quanzongliang@gmail.com
In reply to: Craig Ringer (#2)
Re: improve Chinese locale performance

On 07/22/2013 03:54 PM, Craig Ringer wrote:

On 07/22/2013 12:17 PM, Quan Zongliang wrote:

Hi hackers,

I tried to improve performance when database is Chinese.

Under openSUSE, create index on table with 54996 rows
locale=C, 140ms
locale=zh_CN, 985ms

I think the function strcoll() of Linux is too slow.
So, I made a new utf8 to GB18030 map, store Chinese order in it.
Do not call strcoll().
On my modified code, same operation, locale=zh_CN, 203ms.

It might be worth looking at gcc's strcoll() implementation. See if it
performs better when you use the latest gcc, and if not try to improve
gcc's strcoll() .

I'd be interested in seeing a test case for this that shows that the
results of your new collation are exactly the same as the original
strcoll() based approach.

Do not same exactly.
I found some errors in gcc's strcoll() when order by Chinese character.
Because there are lots of special characters in Chinese.
gcc's strcoll() do not consider this or missed at part of them.

Yes, the best way is to impove gcc's strcoll().
But I don't know how to do.

Thanks,
Quan Zongliang

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Peter Eisentraut
peter_e@gmx.net
In reply to: Craig Ringer (#2)
Re: improve Chinese locale performance

On 7/22/13 3:54 AM, Craig Ringer wrote:

It might be worth looking at gcc's strcoll() implementation. See if it
performs better when you use the latest gcc, and if not try to improve
gcc's strcoll() .

I think part of the problem is that we call strcoll for each comparison,
instead of doing strxfrm once for each datum and then just strcmp for
each comparison. That is effectively equivalent to what the proposal
implements.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Bruce Momjian
bruce@momjian.us
In reply to: Peter Eisentraut (#4)
Re: improve Chinese locale performance

On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

I think part of the problem is that we call strcoll for each comparison,
instead of doing strxfrm once for each datum and then just strcmp for
each comparison. That is effectively equivalent to what the proposal
implements.

Fwiw I used to be a big proponent of using strxfrm. But upon further
analysis I realized it was a real difficult tradeoff. strxrfm saves
potentially a lot of cpu cost but at the expense of expanding the size
of the sort key. If the sort spills to disk or even if it's just
memory bandwidth limited it might actually be slower than doing the
additional cpu work of calling strcoll.

It's hard to see how to decide in advance which way will be faster. I
suspect strxfrm is still the better bet, especially for complex large
character set based locales like Chinese. strcoll might still win by a
large margin on simple mostly-ascii character sets.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Bruce Momjian (#5)
Re: improve Chinese locale performance

On 07/22/2013 12:49 PM, Greg Stark wrote:

On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

I think part of the problem is that we call strcoll for each comparison,
instead of doing strxfrm once for each datum and then just strcmp for
each comparison. That is effectively equivalent to what the proposal
implements.

Fwiw I used to be a big proponent of using strxfrm. But upon further
analysis I realized it was a real difficult tradeoff. strxrfm saves
potentially a lot of cpu cost but at the expense of expanding the size
of the sort key. If the sort spills to disk or even if it's just
memory bandwidth limited it might actually be slower than doing the
additional cpu work of calling strcoll.

It's hard to see how to decide in advance which way will be faster. I
suspect strxfrm is still the better bet, especially for complex large
character set based locales like Chinese. strcoll might still win by a
large margin on simple mostly-ascii character sets.

Perhaps we need a bit of performance testing to prove the point.

Maybe the behaviour should be locale-dependent.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Robert Haas
robertmhaas@gmail.com
In reply to: Bruce Momjian (#5)
Re: improve Chinese locale performance

On Mon, Jul 22, 2013 at 12:49 PM, Greg Stark <stark@mit.edu> wrote:

On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote:

I think part of the problem is that we call strcoll for each comparison,
instead of doing strxfrm once for each datum and then just strcmp for
each comparison. That is effectively equivalent to what the proposal
implements.

Fwiw I used to be a big proponent of using strxfrm. But upon further
analysis I realized it was a real difficult tradeoff. strxrfm saves
potentially a lot of cpu cost but at the expense of expanding the size
of the sort key. If the sort spills to disk or even if it's just
memory bandwidth limited it might actually be slower than doing the
additional cpu work of calling strcoll.

It's hard to see how to decide in advance which way will be faster. I
suspect strxfrm is still the better bet, especially for complex large
character set based locales like Chinese. strcoll might still win by a
large margin on simple mostly-ascii character sets.

The storage blow-up on systems I've tested is on the order of 10x.
That's possibly fine if the data still fits in memory, but it pretty
much sucks if it makes your sort spill to disk, which seems like a
likely outcome in many cases.

But I don't have much trouble believing the OP's contention that he's
coded a locale-specific version that is faster than the version that
ships with the OS. On glibc, for example, we copy the strings we want
to compare, so that we can add a terminating zero byte. The first
thing that glibc does is call strlen(). That's pretty horrible, and
I'm not at all sure the horror ends there, either.

It would be great to have support for user-defined collations in
PostgreSQL. Let the user provide their own comparison function and
whatever else is needed and use that instead of the OS-specific
support. Aside from the performance advantages, one could even create
collations that have the same names and orderings on all platforms we
support. Our support team has gotten more than one inquiry of the
form "what's the equivalent of Linux collation XYZ on Windows?" - and
telling them that there is no exact equivalent is not the answer the
want to hear.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Craig Ringer
craig@2ndquadrant.com
In reply to: Robert Haas (#7)
Re: improve Chinese locale performance

(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?

I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations would compare when ICU is crudely hacked into place for testing.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Craig Ringer (#8)
Re: improve Chinese locale performance

On Tue, Jul 23, 2013 at 9:42 AM, Craig Ringer <craig@2ndquadrant.com> wrote:

(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?

Yes.

I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations would compare when ICU is crudely hacked into place for testing.

I pretty much lost interest in ICU upon reading that they use UTF-16
as their internal format.

http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

What that would mean for us is that instead of copying the input
strings into a temporary buffer and passing the buffer to strcoll(),
we'd need to convert them to ICU's representation (which means writing
twice as many bytes as the length of the input string in cases where
the input string is mostly single-byte characters) and then call ICU's
strcoll() equivalent. I agree that it might be worth testing, but I
can't work up much optimism. It seems to me that something that
operates directly on the server encoding could run a whole lot faster.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Quan Zongliang
quanzongliang@gmail.com
In reply to: Craig Ringer (#8)
Re: improve Chinese locale performance

On 07/23/2013 09:42 PM, Craig Ringer wrote:

(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?

Yes, we need OS-independent collations.

I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations would compare when ICU is crudely hacked into place for testing.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Martijn van Oosterhout
kleptog@svana.org
In reply to: Robert Haas (#9)
Re: improve Chinese locale performance

On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote:

I pretty much lost interest in ICU upon reading that they use UTF-16
as their internal format.

http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

The UTF-8 support has been steadily improving:

For example, icu::Collator::compareUTF8() compares two UTF-8 strings
incrementally, without converting all of the two strings to UTF-16 if
there is an early base letter difference.

http://userguide.icu-project.org/strings/utf-8

For all other encodings you should be able to use an iterator. As to
performance I have no idea.

The main issue with strxfrm() is its lame API. If it supported
returning prefixes you'd be set, but as it is you need >10MB of memory
just to transform a 10MB string, even if only the first few characers
would be enough to sort...

Mvg,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

He who writes carelessly confesses thereby at the very outset that he does
not attach much importance to his own thoughts.

-- Arthur Schopenhauer

#12Robert Haas
robertmhaas@gmail.com
In reply to: Martijn van Oosterhout (#11)
Re: improve Chinese locale performance

On Sun, Jul 28, 2013 at 5:39 AM, Martijn van Oosterhout
<kleptog@svana.org> wrote:

On Tue, Jul 23, 2013 at 10:34:21AM -0400, Robert Haas wrote:

I pretty much lost interest in ICU upon reading that they use UTF-16
as their internal format.

http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

The UTF-8 support has been steadily improving:

For example, icu::Collator::compareUTF8() compares two UTF-8 strings
incrementally, without converting all of the two strings to UTF-16 if
there is an early base letter difference.

http://userguide.icu-project.org/strings/utf-8

For all other encodings you should be able to use an iterator. As to
performance I have no idea.

The main issue with strxfrm() is its lame API. If it supported
returning prefixes you'd be set, but as it is you need >10MB of memory
just to transform a 10MB string, even if only the first few characers
would be enough to sort...

Yep, definitely. And by ">10MB" you mean ">90MB", at least on my Mac,
which is really outrageous.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Quan Zongliang
quanzongliang@gmail.com
In reply to: Craig Ringer (#8)
Re: improve Chinese locale performance

On 07/23/2013 09:42 PM, Craig Ringer wrote:

(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent collations?

I'd be interested in seeing the rest data for this performance report, partly as I'd like to see how ICU collations would compare when ICU is crudely hacked into place for testing.

I think of a new idea.
Add a compare method column to pg_collation.
Every collation has its own compare function or null.
When function varstr_cmp is called, if specified collation
has compare function, call it instead of strcoll().

How about this?

Regards.

Quan Zongliang

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Robert Haas
robertmhaas@gmail.com
In reply to: Quan Zongliang (#13)
Re: improve Chinese locale performance

On Wed, Sep 4, 2013 at 11:02 PM, Quan Zongliang <quanzongliang@gmail.com> wrote:

I think of a new idea.
Add a compare method column to pg_collation.
Every collation has its own compare function or null.
When function varstr_cmp is called, if specified collation
has compare function, call it instead of strcoll().

I think we're going to need to have two kinds of collations:
OS-derived collations (which get all of their smarts from the OS), and
PG-internal collations (which use PG-aware code for everything).
Which I suspect is a bit more involved than what you're imagining, but
mixing and matching doesn't seem likely to end well.

However, what you're proposing might serve as a useful demonstration
of how much performance there is to be gained here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Quan Zongliang
quanzongliang@gmail.com
In reply to: Robert Haas (#14)
Re: improve Chinese locale performance

On 09/06/2013 01:02 AM, Robert Haas wrote:

On Wed, Sep 4, 2013 at 11:02 PM, Quan Zongliang <quanzongliang@gmail.com> wrote:

I think of a new idea.
Add a compare method column to pg_collation.
Every collation has its own compare function or null.
When function varstr_cmp is called, if specified collation
has compare function, call it instead of strcoll().

I think we're going to need to have two kinds of collations:
OS-derived collations (which get all of their smarts from the OS), and
PG-internal collations (which use PG-aware code for everything).
Which I suspect is a bit more involved than what you're imagining, but
mixing and matching doesn't seem likely to end well.

However, what you're proposing might serve as a useful demonstration
of how much performance there is to be gained here.

Understood.

I just try to speed up text compare, not redesign locale.

Do you have a plan to do this?

Thank you.

Quan Zongliang

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Robert Haas
robertmhaas@gmail.com
In reply to: Quan Zongliang (#15)
Re: improve Chinese locale performance

On Mon, Sep 9, 2013 at 5:22 AM, Quan Zongliang <quanzongliang@gmail.com> wrote:

Understood.

I just try to speed up text compare, not redesign locale.

Do you have a plan to do this?

Not any time soon, anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers