like/ilike improvements

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew Dunstan (#1)

Re: like/ilike improvements

Andrew Dunstan <andrew@dunslane.net> writes:

... It turns out (according to the analysis) that the
only time we actually need to use NextChar is when we are matching an
"_" in a like/ilike pattern.

I thought we'd determined that advancing bytewise for "%" was also risky,
in two cases:

1. Multibyte character set that is not UTF8 (more specifically, does not
have a guarantee that first bytes and not-first bytes are distinct)

2. "_" immediately follows the "%".

regards, tom lane

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#2)

Re: like/ilike improvements

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

... It turns out (according to the analysis) that the
only time we actually need to use NextChar is when we are matching an
"_" in a like/ilike pattern.

I thought we'd determined that advancing bytewise for "%" was also risky,
in two cases:

1. Multibyte character set that is not UTF8 (more specifically, does not
have a guarantee that first bytes and not-first bytes are distinct)

I will review - I thought we had ruled that out.

Which non-UTF8 multi-byte charset would be best to test with?

2. "_" immediately follows the "%".

The patch in fact calls NextChar in this case.

cheers

andrew

andrew@dunslane.net

almost 19 years ago

In reply to: Andrew Dunstan (#3)

Re: like/ilike improvements

Andrew Dunstan wrote:

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

... It turns out (according to the analysis) that the only time we
actually need to use NextChar is when we are matching an "_" in a
like/ilike pattern.

I thought we'd determined that advancing bytewise for "%" was also
risky,
in two cases:

1. Multibyte character set that is not UTF8 (more specifically, does not
have a guarantee that first bytes and not-first bytes are distinct)

I thought we disposed of the idea that there was a problem with charsets
that didn't do first byte special.

And Dennis said:

Tom Lane skrev:

You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.

It is only when you have a pattern like '%_' when this is a problem
and we could detect this and do byte by byte when it's not. Now we
check (*p == '\\') || (*p == '_') in each iteration when we scan over
characters for '%', and we could do it once and have different loops
for the two cases.

That's pretty much what the patch does now - It never tries to match a
single byte when it sees "_", whether or not preceeded by "%".

cheers

andrew

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew Dunstan (#4)

Re: like/ilike improvements

Andrew Dunstan <andrew@dunslane.net> writes:

Tom Lane wrote:

I thought we'd determined that advancing bytewise for "%" was also
risky, in two cases:

1. Multibyte character set that is not UTF8 (more specifically, does not
have a guarantee that first bytes and not-first bytes are distinct)

I thought we disposed of the idea that there was a problem with charsets
that didn't do first byte special.

We disposed of that in connection with a version of the patch that had
"%" advancing in NextChar units, so that comparison of ordinary
characters was always safely char-aligned. Consider 2-byte characters
represented as {AB} etc:

DATA x{AB}{CD}y

PATTERN %{BC}%

If "%" advances by bytes then this will find a spurious match. The
only thing that prevents it is if "B" can't be both a leading and a
trailing byte of validly-encoded MB characters.

regards, tom lane

guillaume.smet@gmail.com

almost 19 years ago

In reply to: Andrew Dunstan (#1)

Re: like/ilike improvements

On 5/22/07, Andrew Dunstan <andrew@dunslane.net> wrote:

But before I commit this I'd appreciate seeing some more testing, both
for correctness and performance.

Any chance the patch applies cleanly on a 8.2 code base? I can test it
on a real life 8.2 db but I won't have the time to load the data in a
CVS HEAD one.
If there is no obvious reason for it to fail on 8.2, I'll try to see
if I can apply it.

Thanks.

--
Guillaume

Andrew - Supernews

andrew+nonews@supernews.com

almost 19 years ago

In reply to: Andrew Dunstan (#1)

Re: like/ilike improvements

On 2007-05-22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If "%" advances by bytes then this will find a spurious match. The
only thing that prevents it is if "B" can't be both a leading and a
trailing byte of validly-encoded MB characters.

Which is (by design) true in UTF8, but is not true of most other
multibyte charsets.

The %_ case is also trivially handled in UTF8 by simply ensuring that
_ doesn't match a non-initial octet. This allows % to advance by bytes
without danger of losing sync.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

Mark Mielke

mark@mark.mielke.cc

almost 19 years ago

In reply to: Tom Lane (#2)

Re: like/ilike improvements

On Tue, May 22, 2007 at 12:12:51PM -0400, Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

... It turns out (according to the analysis) that the
only time we actually need to use NextChar is when we are matching an
"_" in a like/ilike pattern.

I thought we'd determined that advancing bytewise for "%" was also risky,
in two cases:
1. Multibyte character set that is not UTF8 (more specifically, does not
have a guarantee that first bytes and not-first bytes are distinct)
2. "_" immediately follows the "%".

Have you considered a two pass approach? First pass - match on bytes.
Only if you find a match with the first pass, start a second pass to
do a 'safe' check?

Are there optimizations to recognize whether the index was created as
lower(field) or upper(field), and translate ILIKE to the appropriate
one?

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew - Supernews (#7)

Re: like/ilike improvements

Andrew - Supernews <andrew+nonews@supernews.com> writes:

On 2007-05-22, Tom Lane <tgl@sss.pgh.pa.us> wrote:

If "%" advances by bytes then this will find a spurious match. The
only thing that prevents it is if "B" can't be both a leading and a
trailing byte of validly-encoded MB characters.

Which is (by design) true in UTF8, but is not true of most other
multibyte charsets.

The %_ case is also trivially handled in UTF8 by simply ensuring that
_ doesn't match a non-initial octet. This allows % to advance by bytes
without danger of losing sync.

Yeah. It seems we need three comparison functions after all:

1. Single-byte character set: needs NextByte and ByteEq only.

2. Generic multi-byte character set: both % and _ must advance by
characters to ensure we never try an out-of-alignment character
comparison. But simple character comparison works bytewise given
that. So primitives are NextChar, NextByte, ByteEq.

3. UTF8: % can advance bytewise. _ must check it is on a first byte
(else return match failure) and if so do NextChar. So primitives
are NextChar, NextByte, ByteEq, IsFirstByte.

In no case do we need CharEq. I'd be inclined to drop ByteEq as a
macro and just use "==", too.

regards, tom lane

#10

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#9)

Re: like/ilike improvements

Tom Lane wrote:

Yeah. It seems we need three comparison functions after all:

Yeah, that was my confusion. I thought we had concluded that we didn't,
but clearly we do.

1. Single-byte character set: needs NextByte and ByteEq only.

2. Generic multi-byte character set: both % and _ must advance by
characters to ensure we never try an out-of-alignment character
comparison. But simple character comparison works bytewise given
that. So primitives are NextChar, NextByte, ByteEq.

3. UTF8: % can advance bytewise. _ must check it is on a first byte
(else return match failure) and if so do NextChar. So primitives
are NextChar, NextByte, ByteEq, IsFirstByte.

In no case do we need CharEq. I'd be inclined to drop ByteEq as a
macro and just use "==", too.

I'll work this up. I think it will be easier if I marry cases 1 and 2,
with NextChar being the same as NextByte in the single byte case.

cheers

andrew

#11

Dennis Bjorklund

db@zigo.dhs.org

almost 19 years ago

In reply to: Andrew Dunstan (#4)

Re: like/ilike improvements

And Dennis said:

It is only when you have a pattern like '%_' when this is a problem
and we could detect this and do byte by byte when it's not. Now we
check (*p == '\\') || (*p == '_') in each iteration when we scan over
characters for '%', and we could do it once and have different loops
for the two cases.

That's pretty much what the patch does now - It never tries to match a
single byte when it sees "_", whether or not preceeded by "%".

My comment was about UTF-8 since I thought we were making a special
version for UTF-8. I don't know what properties other multibyte encodings
have.

/Dennis

#12

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#9)

Re: like/ilike improvements

Tom Lane wrote:

3. UTF8: % can advance bytewise. _ must check it is on a first byte
(else return match failure) and if so do NextChar. So primitives
are NextChar, NextByte, ByteEq, IsFirstByte.

We should only be able to get out of step from the "%_" case, I believe,
so we should only need to do the first-byte test in that case (which is
in a different code path from the normal "_" case. Does that seem right?

cheers

andrew

#13

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew Dunstan (#12)

Re: like/ilike improvements

Andrew Dunstan <andrew@dunslane.net> writes:

We should only be able to get out of step from the "%_" case, I believe,
so we should only need to do the first-byte test in that case (which is
in a different code path from the normal "_" case. Does that seem right?

At least put Assert(IsFirstByte()) in the main path.

I'm a bit suspicious of the separate-path business anyway. Will it do
the right thing with say "%%%_" ?

regards, tom lane

#14

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#13)

Re: like/ilike improvements

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

We should only be able to get out of step from the "%_" case, I believe,
so we should only need to do the first-byte test in that case (which is
in a different code path from the normal "_" case. Does that seem right?

At least put Assert(IsFirstByte()) in the main path.

I'm a bit suspicious of the separate-path business anyway. Will it do
the right thing with say "%%%_" ?

Yes:

/* %% is the same as % according to the SQL standard */
/* Advance past all %'s */
while ((plen > 0) && (*p == '%'))
NextByte(p, plen);

cheers

andrew

#15

andrew@dunslane.net

almost 19 years ago

In reply to: Andrew Dunstan (#14)

Re: like/ilike improvements

Andrew Dunstan wrote:

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

We should only be able to get out of step from the "%_" case, I
believe, so we should only need to do the first-byte test in that
case (which is in a different code path from the normal "_" case.
Does that seem right?

At least put Assert(IsFirstByte()) in the main path.

I'm a bit suspicious of the separate-path business anyway. Will it do
the right thing with say "%%%_" ?

Yes:

/* %% is the same as % according to the SQL standard */
/* Advance past all %'s */
while ((plen > 0) && (*p == '%'))
NextByte(p, plen);

I am also wondering if it might be sensible to make this choice once at
backend startup and store a function pointer, instead of doing it for
every string processed by like/ilike:

if (pg_database_encoding_max_length() == 1)
return SB_MatchText(s, slen, p, plen);
else if (GetDatabaseEncoding() == PG_UTF8)
return UTF8_MatchText(s, slen, p, plen);
else
return MB_MatchText(s, slen, p, plen);

I guess that might make matters harder if we ever got per-column encodings.

cheers

andrew

#16

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew Dunstan (#15)

Re: like/ilike improvements

Andrew Dunstan <andrew@dunslane.net> writes:

I am also wondering if it might be sensible to make this choice once at
backend startup and store a function pointer, instead of doing it for
every string processed by like/ilike:

if (pg_database_encoding_max_length() == 1)
return SB_MatchText(s, slen, p, plen);
else if (GetDatabaseEncoding() == PG_UTF8)
return UTF8_MatchText(s, slen, p, plen);
else
return MB_MatchText(s, slen, p, plen);

I guess that might make matters harder if we ever got per-column encodings.

Yeah. It's not saving much anyway ... I wouldn't bother.

regards, tom lane

#17

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#13)

Re: like/ilike improvements

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

We should only be able to get out of step from the "%_" case, I believe,
so we should only need to do the first-byte test in that case (which is
in a different code path from the normal "_" case. Does that seem right?

At least put Assert(IsFirstByte()) in the main path.

I'm a bit suspicious of the separate-path business anyway. Will it do
the right thing with say "%%%_" ?

OK, Here is a patch that I am fairly confident does what's been
discussed, as summarised by Tom.

To answer Guillaume's question - it probably won't apply cleanly to 8.2
sources.

cheers

andrew

#18

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#13)

Re: like/ilike improvements

Tom Lane wrote:

Andrew Dunstan <andrew@dunslane.net> writes:

We should only be able to get out of step from the "%_" case, I believe,
so we should only need to do the first-byte test in that case (which is
in a different code path from the normal "_" case. Does that seem right?

At least put Assert(IsFirstByte()) in the main path.

I'm a bit suspicious of the separate-path business anyway. Will it do
the right thing with say "%%%_" ?

OK, Here is a patch that I am fairly confident does what's been
discussed, as summarised by Tom.

To answer Guillaume's question - it probably won't apply cleanly to 8.2
sources.

cheers

andrew

#19

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew Dunstan (#17)

Re: like/ilike improvements

Andrew Dunstan <andrew@dunslane.net> writes:

OK, Here is a patch that I am fairly confident does what's been
discussed, as summarised by Tom.

! #define CHAREQ(p1, p2) (*p1 == *p2)
...
+ #define IsFirstByte(c) ((*c & 0xC0) != 0x80)

These macros are bugs waiting to happen. Please parenthesize the
arguments.

The header comment for like_match.c needs more love:

* This file is included by like.c *twice*, to provide an optimization
* for single-byte encodings.

I'm not sure I believe the new coding for %-matching at all, and I
certainly don't like the 100% lack of comments explaining why the
different cases are necessary and just how they differ. In particular,
once we've advanced more than one character, why does it still matter
what was immediately after the %?

There should somewhere be a block comment explaining all the reasoning
we've so painfully gone through about why the three cases (SB, MB, UTF8)
are needed and how they must differ.

regards, tom lane

#20

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#19)

Re: like/ilike improvements

Tom Lane wrote:

I'm not sure I believe the new coding for %-matching at all, and I
certainly don't like the 100% lack of comments explaining why the
different cases are necessary and just how they differ. In particular,
once we've advanced more than one character, why does it still matter
what was immediately after the %?

I don't understand the question. The % processing looks for a place that
matches what is immediately after the % and then tries to match the
remainder using a recursive call - so it never actually does matter. I
haven't actually changed the fundamental logic AFAIK, I have just
rearranged and optimised it some.

I admit that it takes some pondering to understand - I certainly intend
to adjust the comments once we are satisfied the code is right. It's
going to be next week now before I finish this up :-(

cheers

andrew

#21

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#19)

#22

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew Dunstan (#21)

#23

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#22)

#24

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew Dunstan (#23)

#25

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Tom Lane (#24)

#26

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#24)

#27

Zeugswetter Andreas SB SD

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#25)

#28

Mark Mielke

mark@mark.mielke.cc

almost 19 years ago

In reply to: Tom Lane (#25)

#29

ZeugswetterA@spardat.at

almost 19 years ago

In reply to: Tom Lane (#22)

#30

andrew@dunslane.net

almost 19 years ago

In reply to: Zeugswetter Andreas SB SD (#29)

#31

andrew@dunslane.net

almost 19 years ago

In reply to: Mark Mielke (#28)

#32

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Zeugswetter Andreas SB SD (#29)

#33

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew Dunstan (#30)

#34

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#33)

#35

ITAGAKI Takahiro

itagaki.takahiro@oss.ntt.co.jp

almost 19 years ago

In reply to: Andrew Dunstan (#34)

#36

andrew@dunslane.net

almost 19 years ago

In reply to: ITAGAKI Takahiro (#35)

#37

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew Dunstan (#34)

#38

tgl@sss.pgh.pa.us

almost 19 years ago

In reply to: Andrew Dunstan (#36)

#39

andrew@dunslane.net

almost 19 years ago

In reply to: Tom Lane (#37)

#40

Bruce Momjian

bruce@momjian.us

almost 19 years ago

In reply to: Andrew Dunstan (#39)

#41

guillaume.smet@gmail.com

over 18 years ago

In reply to: Guillaume Smet (#6)

#42

andrew@dunslane.net

over 18 years ago

In reply to: Guillaume Smet (#41)

#43

guillaume.smet@gmail.com

over 18 years ago

In reply to: Andrew Dunstan (#42)

#44

guillaume.smet@gmail.com

over 18 years ago

In reply to: Andrew Dunstan (#42)

#45

andrew@dunslane.net

over 18 years ago

In reply to: Guillaume Smet (#44)

#46

guillaume.smet@gmail.com

over 18 years ago

In reply to: Andrew Dunstan (#45)

#47

andrew@dunslane.net

over 18 years ago

In reply to: Guillaume Smet (#46)

#48

andrew@dunslane.net

over 18 years ago

In reply to: Andrew Dunstan (#47)

#49

guillaume.smet@gmail.com

over 18 years ago

In reply to: Andrew Dunstan (#47)

#50

ITAGAKI Takahiro

itagaki.takahiro@oss.ntt.co.jp

over 18 years ago

In reply to: Guillaume Smet (#49)

#51

Bruce Momjian

bruce@momjian.us

over 18 years ago

In reply to: ITAGAKI Takahiro (#50)

#52

guillaume.smet@gmail.com

over 18 years ago

In reply to: Bruce Momjian (#51)

#53

Bruce Momjian

bruce@momjian.us

over 18 years ago

In reply to: Guillaume Smet (#52)

#54

andrew@dunslane.net

over 18 years ago

In reply to: Guillaume Smet (#52)

#55

guillaume.smet@gmail.com

over 18 years ago

In reply to: Andrew Dunstan (#54)

#56