LIKE optimization in UTF-8 and locale-C
Hello,
I found LIKE operators are slower on multi-byte encoding databases
than single-byte encoding ones. It comes from difference between
MatchText() and MBMatchText().
We've had an optimization for single-byte encodings using
pg_database_encoding_max_length() == 1 test. I'll propose to extend it
in UTF-8 with locale-C case. All of trailing bytes are different from first
bytes in UTF-8 multi-byte characters, so we can use functions for single-bytes
and byte-wise comparison in the case. With the attached patch, the performance
of UTF-8 LIKE operators are pushed up to near other single-bytes encodings.
Databases initialized with locale-C are widely used in Japan, because
Japanese locale are broken in almost of platforms. Japanese user can
choose EUC-jp or UTF-8 as a server encoding, but I think UTF-8 will be
more and more used in the future.
---- test ----
initdb --no-locale --encoding=<encoding>
[HEAD]
SQL_ASCII : 7171 ms / 7203 ms / 7187 ms
LATIN1 : 7172 ms / 7156 ms / 7141 ms
UTF8 : 16235 ms / 16281 ms / 16281 ms
EUC_JP : 17454 ms / 17453 ms / 17438 ms
[with patch]
SQL_ASCII : 7062 ms / 7125 ms / 7125 ms
LATIN1 : 7047 ms / 7063 ms / 7047 ms
UTF8 : 7188 ms / 7234 ms / 7235 ms
EUC_JP : 17468 ms / 17453 ms / 17453 ms
CREATE OR REPLACE FUNCTION test() RETURNS INTEGER AS $$
DECLARE
cnt integer;
BEGIN
FOR i IN 1..1000 LOOP
SELECT count(*) INTO cnt FROM item WHERE i_title LIKE '%BABABABABARIBA%' LIMIT 50;
END LOOP;
RETURN cnt;
END;
$$ LANGUAGE plpgsql;
SELECT count(*) FROM item; -- borrowed from DBT-1 (TPC-W)
count
-------
10000
(1 row)
---- patch ----
Index: src/backend/utils/adt/like.c
===================================================================
--- src/backend/utils/adt/like.c (head)
+++ src/backend/utils/adt/like.c (working copy)
@@ -21,6 +21,7 @@
#include "mb/pg_wchar.h"
#include "utils/builtins.h"
+#include "utils/pg_locale.h"
#define LIKE_TRUE 1
@@ -119,6 +120,13 @@
/*
+ * true iff match functions for single-byte characters are available.
+ */
+#define sb_match_available() \
+ (pg_database_encoding_max_length() == 1 || \
+ (lc_collate_is_c() && GetDatabaseEncoding() == PG_UTF8))
+
+/*
* interface routines called by the function manager
*/
@@ -138,7 +146,7 @@
p = VARDATA(pat);
plen = (VARSIZE(pat) - VARHDRSZ);
- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
else
result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);
@@ -162,7 +170,7 @@
p = VARDATA(pat);
plen = (VARSIZE(pat) - VARHDRSZ);
- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
else
result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);
@@ -186,7 +194,7 @@
p = VARDATA(pat);
plen = (VARSIZE(pat) - VARHDRSZ);
- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
else
result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);
@@ -210,7 +218,7 @@
p = VARDATA(pat);
plen = (VARSIZE(pat) - VARHDRSZ);
- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
else
result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);
@@ -275,7 +283,7 @@
int slen,
plen;
- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
{
s = NameStr(*str);
slen = strlen(s);
@@ -316,7 +324,7 @@
int slen,
plen;
- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
{
s = NameStr(*str);
slen = strlen(s);
@@ -357,7 +365,7 @@
int slen,
plen;
- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
{
s = VARDATA(str);
slen = (VARSIZE(str) - VARHDRSZ);
@@ -393,7 +401,7 @@
int slen,
plen;
- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
{
s = VARDATA(str);
slen = (VARSIZE(str) - VARHDRSZ);
@@ -429,7 +437,7 @@
text *esc = PG_GETARG_TEXT_P(1);
text *result;
- if (pg_database_encoding_max_length() == 1)
+ if (sb_match_available())
result = do_like_escape(pat, esc);
else
result = MB_do_like_escape(pat, esc);
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
I found LIKE operators are slower on multi-byte encoding databases
than single-byte encoding ones. It comes from difference between
MatchText() and MBMatchText().
We've had an optimization for single-byte encodings using
pg_database_encoding_max_length() == 1 test. I'll propose to extend it
in UTF-8 with locale-C case.
If this works for UTF8, won't it work for all the backend-legal
encodings?
regards, tom lane
Ühel kenal päeval, N, 2007-03-22 kell 11:08, kirjutas Tom Lane:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
I found LIKE operators are slower on multi-byte encoding databases
than single-byte encoding ones. It comes from difference between
MatchText() and MBMatchText().We've had an optimization for single-byte encodings using
pg_database_encoding_max_length() == 1 test. I'll propose to extend it
in UTF-8 with locale-C case.If this works for UTF8, won't it work for all the backend-legal
encodings?
I guess it works well for % but not for _ , the latter has to know, how
many bytes the current (multibyte) character covers.
The length is still easy to find out for UTF8 encoding, so it may be
feasible to write UTF8MatchText() that is still faster than
MBMatchText().
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia
Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com
Hannu Krosing <hannu@skype.net> wrote:
We've had an optimization for single-byte encodings using
pg_database_encoding_max_length() == 1 test. I'll propose to extend it
in UTF-8 with locale-C case.If this works for UTF8, won't it work for all the backend-legal
encodings?I guess it works well for % but not for _ , the latter has to know, how
many bytes the current (multibyte) character covers.
Yes, % is not used in trailing bytes for all encodings, but _ is
used in some of them. I think we can use the optimization for all
of the server encodings except JOHAB.
Also, I took notice that locale settings are not used in LIKE matching,
so the following is enough for checking availability of byte-wise matching
functions. or am I missing something?
#define sb_match_available() (GetDatabaseEncoding() == PG_JOHAB))
Multi-byte encodings supported by a server encoding.
| % 0x25 | _ 0x5f | \ 0x5c |
--------------+--------+--------+--------+-
EUC_JP | unused | unused | unused |
EUC_CN | unused | unused | unused |
EUC_KR | unused | unused | unused |
EUC_TW | unused | unused | unused |
JOHAB | unused | *used* | *used* |
UTF8 | unused | unused | unused |
MULE_INTERNAL | unused | unused | unused |
Just for reference, encodings only supported as a client encoding.
| % 0x25 | _ 0x5f | \ 0x5c |
--------------+--------+--------+--------+-
SJIS | unused | *used* | *used* |
BIG5 | unused | *used* | *used* |
GBK | unused | *used* | *used* |
UHC | unused | unused | unused |
GB18030 | unused | *used* | *used* |
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
ITAGAKI Takahiro skrev:
I guess it works well for % but not for _ , the latter has to know, how
many bytes the current (multibyte) character covers.Yes, % is not used in trailing bytes for all encodings, but _ is
used in some of them. I think we can use the optimization for all
of the server encodings except JOHAB.
The problem with the like pattern _ is that it has to know how long the
single caracter is that it should pass over. Say you have a UTF-8 string
with 2 characters encoded in 3 bytes ('�A'). Where the first character
is 2 bytes:
0xC3 0x96 'A'
and now you want to match that with the LIKE pattern:
'_A'
How would that work in the C locale?
Maybe one should simply write a special version of LIKE for the UTF-8
encoding since it's probably the most used encoding today. But I don't
think you can use the C locale and that it would work for UTF-8.
/Dennis
Dennis Bjorklund <db@zigo.dhs.org> wrote:
The problem with the like pattern _ is that it has to know how long the
single caracter is that it should pass over. Say you have a UTF-8 string
with 2 characters encoded in 3 bytes ('ÖA'). Where the first character
is 2 bytes:0xC3 0x96 'A'
and now you want to match that with the LIKE pattern:
'_A'
Thanks, it all made sense to me. My proposal was completely wrong.
The optimization of MBMatchText() seems to be the right way...
Maybe one should simply write a special version of LIKE for the UTF-8
encoding since it's probably the most used encoding today. But I don't
think you can use the C locale and that it would work for UTF-8.
But then, present LIKE matching is not locale aware. we treat multi-byte
characters properly, but always perform a char-by-char comparison.
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
On 2007-03-22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
I found LIKE operators are slower on multi-byte encoding databases
than single-byte encoding ones. It comes from difference between
MatchText() and MBMatchText().We've had an optimization for single-byte encodings using
pg_database_encoding_max_length() == 1 test. I'll propose to extend it
in UTF-8 with locale-C case.If this works for UTF8, won't it work for all the backend-legal
encodings?
It works for UTF8 only because UTF8 has special properties which are not
shared by, for example, EUC_*. Specifically, in UTF8 the octet sequence
for a multibyte character will never appear as a subsequence of the octet
sequence of a string of other multibyte characters. i.e. given a string
of two two-octet characters AB, the second octet of A plus the first octet
of B is not a valid UTF8 character (and likewise for longer characters).
(And while I haven't tested it, it looks like the patch posted doesn't
account properly for the use of _, so it needs a bit more work.)
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
On 2007-03-23, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote:
Thanks, it all made sense to me. My proposal was completely wrong.
Actually, I think your proposal is fundamentally correct, merely incomplete.
Doing octet-based rather than character-based matching of strings is a
_design goal_ of UTF8. Treating UTF8 like any other multibyte charset and
converting everything to wide-chars is, in my opinion, always going to
result in suboptimal performance.
--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services
Ühel kenal päeval, R, 2007-03-23 kell 06:10, kirjutas Andrew -
Supernews:
On 2007-03-23, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote:
Thanks, it all made sense to me. My proposal was completely wrong.
Actually, I think your proposal is fundamentally correct, merely incomplete.
Doing octet-based rather than character-based matching of strings is a
_design goal_ of UTF8. Treating UTF8 like any other multibyte charset and
converting everything to wide-chars is, in my opinion, always going to
result in suboptimal performance.
Yes, that was what I meant by proposing a utf8 specific UTF8MatchText(),
which should not convert everything to wide char, but instead do
byte-by-byte comparison and just be aware of UTF encoding, where it is
easy to know how wide (how maby bytes/octets) each encoded character
takes.
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia
Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com
Andrew - Supernews <andrew+nonews@supernews.com> wrote:
Actually, I think your proposal is fundamentally correct, merely incomplete.
Yeah, I fixed the patch to handle '_' correctly.
Doing octet-based rather than character-based matching of strings is a
_design goal_ of UTF8.
I think all "safe ASCII-supersets" encodings are comparable by bytes,
not only UTF-8. Their all multibyte characters consist of bytes larger
than 127. I updated the patch on this presupposition. It uses octet-based
matching usually and character-based matching at '_'.
There was 30%+ of performance win in selection using multibytes LIKE '%foo%'.
encoding | HEAD | patched
-----------+---------+---------
SQL_ASCII | 7094ms | 7062ms
LATIN1 | 7083ms | 7078ms
UTF8 | 17974ms | 11635ms (64.7%)
EUC_JP | 17032ms | 12109ms (71.1%)
If this patch is acceptable, please drop JOHAB encoding from server encodings
before it is applied. Trailing bytes of JOHAB can be less than 128.
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01475.php
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Attachments:
mbtextmatch.patchapplication/octet-stream; name=mbtextmatch.patchDownload+46-41
"Andrew - Supernews" <andrew@supernews.net> wrote:
ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
ITAGAKI> by bytes, not only UTF-8.This is false, particularly for EUC.
Umm, I see. I updated the optimization to be used only for UTF8 case.
I also added some inlining hints that are useful on my machine (Pentium 4).
x1000 of LIKE '%foo% on 10000 rows tables [ms]
encoding | HEAD | P1 | P2 | P3
-----------+-------+-------+-------+-------
SQL_ASCII | 7094 | 7120 | 7063 | 7031
LATIN1 | 7083 | 7130 | 7057 | 7031
UTF8 | 17974 | 10859 | 10839 | 9682
EUC_JP | 17032 | 17557 | 17599 | 15240
- P1: UTF8MatchText()
- P2: P1 + __inline__ GenericMatchText()
- P3: P2 + __inline__ wchareq()
(The attached patch is P3.)
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Attachments:
utf8matchtext.patchapplication/octet-stream; name=utf8matchtext.patchDownload+325-92
Import Notes
Reply to msg id not found: E1HXEs4-000HK9-DJ@trinity.supernews.net
I assume this replaces all your earlier multi-byte LIKE patches.
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.
---------------------------------------------------------------------------
ITAGAKI Takahiro wrote:
"Andrew - Supernews" <andrew@supernews.net> wrote:
ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
ITAGAKI> by bytes, not only UTF-8.This is false, particularly for EUC.
Umm, I see. I updated the optimization to be used only for UTF8 case.
I also added some inlining hints that are useful on my machine (Pentium 4).x1000 of LIKE '%foo% on 10000 rows tables [ms]
encoding | HEAD | P1 | P2 | P3
-----------+-------+-------+-------+-------
SQL_ASCII | 7094 | 7120 | 7063 | 7031
LATIN1 | 7083 | 7130 | 7057 | 7031
UTF8 | 17974 | 10859 | 10839 | 9682
EUC_JP | 17032 | 17557 | 17599 | 15240- P1: UTF8MatchText()
- P2: P1 + __inline__ GenericMatchText()
- P3: P2 + __inline__ wchareq()
(The attached patch is P3.)Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
[ Attachment, skipping... ]
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:
#define CHAREQ(p1, p2) wchareq(p1, p2)
#define ICHAREQ(p1, p2) wchareq(p1, p2)
Why are there two functions? Also, can't you use one function and just
pass a boolean to indicate whether case it be ignored?
---------------------------------------------------------------------------
ITAGAKI Takahiro wrote:
"Andrew - Supernews" <andrew@supernews.net> wrote:
ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
ITAGAKI> by bytes, not only UTF-8.This is false, particularly for EUC.
Umm, I see. I updated the optimization to be used only for UTF8 case.
I also added some inlining hints that are useful on my machine (Pentium 4).x1000 of LIKE '%foo% on 10000 rows tables [ms]
encoding | HEAD | P1 | P2 | P3
-----------+-------+-------+-------+-------
SQL_ASCII | 7094 | 7120 | 7063 | 7031
LATIN1 | 7083 | 7130 | 7057 | 7031
UTF8 | 17974 | 10859 | 10839 | 9682
EUC_JP | 17032 | 17557 | 17599 | 15240- P1: UTF8MatchText()
- P2: P1 + __inline__ GenericMatchText()
- P3: P2 + __inline__ wchareq()
(The attached patch is P3.)Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
[ Attachment, skipping... ]
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote:
I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:#define CHAREQ(p1, p2) wchareq(p1, p2)
#define ICHAREQ(p1, p2) wchareq(p1, p2)Why are there two functions? Also, can't you use one function and just
pass a boolean to indicate whether case it be ignored?
Sorry, typo:
Why are there two functions? Also, can't you use one function and just
pass a boolean to indicate whether case should be ignored?
------
---------------------------------------------------------------------------
ITAGAKI Takahiro wrote:
"Andrew - Supernews" <andrew@supernews.net> wrote:
ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
ITAGAKI> by bytes, not only UTF-8.This is false, particularly for EUC.
Umm, I see. I updated the optimization to be used only for UTF8 case.
I also added some inlining hints that are useful on my machine (Pentium 4).x1000 of LIKE '%foo% on 10000 rows tables [ms]
encoding | HEAD | P1 | P2 | P3
-----------+-------+-------+-------+-------
SQL_ASCII | 7094 | 7120 | 7063 | 7031
LATIN1 | 7083 | 7130 | 7057 | 7031
UTF8 | 17974 | 10859 | 10839 | 9682
EUC_JP | 17032 | 17557 | 17599 | 15240- P1: UTF8MatchText()
- P2: P1 + __inline__ GenericMatchText()
- P3: P2 + __inline__ wchareq()
(The attached patch is P3.)Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center[ Attachment, skipping... ]
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com+ If your life is a hard drive, Christ can be your backup. +
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> wrote:
I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:Why are there two functions? Also, can't you use one function and just
pass a boolean to indicate whether case should be ignored?
The same is true of MBMatchText() and MBMatchTextIC().
Now, I'll split the patch into two changes.
1. DropMBMatchTextIC.patch
Drop MBMatchTextIC() and use MBMatchText() instead.
2. UTF8MatchText.patch
Add UTF8MatchText() as a specialized version of MBMatchText().
As a future work, it might be good to research the performance of rewriting
"col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that
we can avoid to call lower() for constant pattern in the right-hand side and
can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
in the future unless we move to wide-character server encoding :)
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Patch removed, updated version submitted.
---------------------------------------------------------------------------
ITAGAKI Takahiro wrote:
"Andrew - Supernews" <andrew@supernews.net> wrote:
ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
ITAGAKI> by bytes, not only UTF-8.This is false, particularly for EUC.
Umm, I see. I updated the optimization to be used only for UTF8 case.
I also added some inlining hints that are useful on my machine (Pentium 4).x1000 of LIKE '%foo% on 10000 rows tables [ms]
encoding | HEAD | P1 | P2 | P3
-----------+-------+-------+-------+-------
SQL_ASCII | 7094 | 7120 | 7063 | 7031
LATIN1 | 7083 | 7130 | 7057 | 7031
UTF8 | 17974 | 10859 | 10839 | 9682
EUC_JP | 17032 | 17557 | 17599 | 15240- P1: UTF8MatchText()
- P2: P1 + __inline__ GenericMatchText()
- P3: P2 + __inline__ wchareq()
(The attached patch is P3.)Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
[ Attachment, skipping... ]
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches
It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.
---------------------------------------------------------------------------
ITAGAKI Takahiro wrote:
Bruce Momjian <bruce@momjian.us> wrote:
I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:Why are there two functions? Also, can't you use one function and just
pass a boolean to indicate whether case should be ignored?The same is true of MBMatchText() and MBMatchTextIC().
Now, I'll split the patch into two changes.1. DropMBMatchTextIC.patch
Drop MBMatchTextIC() and use MBMatchText() instead.2. UTF8MatchText.patch
Add UTF8MatchText() as a specialized version of MBMatchText().As a future work, it might be good to research the performance of rewriting
"col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that
we can avoid to call lower() for constant pattern in the right-hand side and
can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
in the future unless we move to wide-character server encoding :)Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
[ Attachment, skipping... ]
[ Attachment, skipping... ]
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Itagaki,
I find this still fairly unclean. It certainly took me some time to get
me head around what's going on.
ISTM we should generate all these match functions from one body of code
plus some #define magic.
As I understand it, we have three possible encoding switches: Single
Byte, UTF8 and other Multi Byte Charsets, and two possible case
settings: case Sensitive and Case Insensitive. That would make for a
total of six functions, but in the case of both UTF8 and other MBCS we
don't need a special Case Insensitive function - instead we downcase
both the string and the pattern and then use the Case Sensitive
function. That leaves a total of four functions.
What is not clear to me is why the UTF8 optimisation work, and why it
doesn't apply to other MBCS. At the very least we need a comment on that.
I also find the existing function naming convention somewhat annoying -
having foo() and MB_foo() is less than clear. I'd rather have SB_foo()
and MB_foo(). That's not your fault, of course.
If you supply me with some explanation on the UTF8 optimisation issue,
I'll prepare a revised patch along these lines.
cheers
andrew
ITAGAKI Takahiro wrote:
Show quoted text
Bruce Momjian <bruce@momjian.us> wrote:
I do not understand this patch. You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:Why are there two functions? Also, can't you use one function and just
pass a boolean to indicate whether case should be ignored?The same is true of MBMatchText() and MBMatchTextIC().
Now, I'll split the patch into two changes.1. DropMBMatchTextIC.patch
Drop MBMatchTextIC() and use MBMatchText() instead.2. UTF8MatchText.patch
Add UTF8MatchText() as a specialized version of MBMatchText().As a future work, it might be good to research the performance of rewriting
"col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that
we can avoid to call lower() for constant pattern in the right-hand side and
can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
in the future unless we move to wide-character server encoding :)Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center------------------------------------------------------------------------
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
I wrote:
ISTM we should generate all these match functions from one body of
code plus some #define magic.As I understand it, we have three possible encoding switches: Single
Byte, UTF8 and other Multi Byte Charsets, and two possible case
settings: case Sensitive and Case Insensitive. That would make for a
total of six functions, but in the case of both UTF8 and other MBCS we
don't need a special Case Insensitive function - instead we downcase
both the string and the pattern and then use the Case Sensitive
function. That leaves a total of four functions.What is not clear to me is why the UTF8 optimisation work, and why it
doesn't apply to other MBCS. At the very least we need a comment on that.I also find the existing function naming convention somewhat annoying
- having foo() and MB_foo() is less than clear. I'd rather have
SB_foo() and MB_foo(). That's not your fault, of course.If you supply me with some explanation on the UTF8 optimisation issue,
I'll prepare a revised patch along these lines.
Ok, I have studied some more and I think I understand what's going on.
AIUI, we are switching from some expensive char-wise comparisons to
cheap byte-wise comparisons in the UTF8 case because we know that in
UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
character sequence. Is that putting it too mildly? Do we need stronger
conditions than that? If it's correct, are there other MBCS for which
this is true?
cheers
andrew
Andrew Dunstan <andrew@dunslane.net> writes:
Ok, I have studied some more and I think I understand what's going on.
AIUI, we are switching from some expensive char-wise comparisons to
cheap byte-wise comparisons in the UTF8 case because we know that in
UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
character sequence. Is that putting it too mildly? Do we need stronger
conditions than that? If it's correct, are there other MBCS for which
this is true?
I don't think this is a correct analysis. If it were correct then we
could use the optimization for all backend charsets because none of them
allow MB characters to contain non-high-bit-set bytes. But it was
stated somewhere upthread that that doesn't actually work. Clearly
it's a necessary property that we not falsely detect the magic pattern
characters, but that's not sufficient.
I think the real issue is that UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property, we have to use the slow way to
ensure we don't make out-of-sync matches.
regards, tom lane