Broken select on regular expression !!!

Started by Constantin Teodorescuover 26 years ago17 messages

Hi all,

I think I have found a bug in regexp based selections.
Watch this :

create table regdemo (fld1 varchar(32));
CREATE
insert into regdemo values('410');
INSERT 726409 1
insert into regdemo values('7410');
INSERT 726410 1
insert into regdemo values('source');
INSERT 726411 1
insert into regdemo values('destination');
INSERT 726412 1
select * from regdemo where fld1 ~* '^sou|^des';
fld1
-----------
source
destination
(2 rows)

select * from regdemo where fld1 ~* '41|^des';
fld1
-----------
410
7410
destination
(3 rows)

select * from regdemo where fld1 ~* '^41|^des';
fld1
----
(0 rows)

^^^^^^^^^^^^^^
!?!?!?!
I thought it should return '410' and 'destination' rows. But it returns
nothing!
The first select example with ^ in both variants ( ^sou|^des ) works !!!
The last one ( ^41|^des ) don't !

Am I missing something?
I am getting the same result also on 6.4.2 and 6.5 beta 1 versions!

Best regards,
--
Constantin Teodorescu
FLEX Consulting Braila, ROMANIA

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Constantin Teodorescu (#1)
Re: [HACKERS] Broken select on regular expression !!!

Constantin Teodorescu <teo@flex.ro> writes:

select * from regdemo where fld1 ~* '^41|^des';
fld1
----
(0 rows)

^^^^^^^^^^^^^^
!?!?!?!

I see it too. Even more interesting is that these variants are OK:

regression=> select * from regdemo where fld1 ~* '^des|^41';
fld1
-----------
410
destination
(2 rows)

regression=> select * from regdemo where fld1 ~* '(^41)|(^des)';
fld1
-----------
410
destination
(2 rows)

And if you want *really* disturbing:

regression=> select * from regdemo where fld1 ~* '^sou|^des';
fld1
-----------
source
destination
(2 rows)

regression=> select * from regdemo where fld1 ~ '^sou|^des';
fld1
----
(0 rows)

Something is rotten in the state of Denmark...

regards, tom lane

#3Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#2)
Re: [HACKERS] Broken select on regular expression !!!

select * from regdemo where fld1 ~* '^41|^des';
fld1
----
(0 rows)

^^^^^^^^^^^^^^
!?!?!?!

I see it too. Even more interesting is that these variants are OK:

regression=> select * from regdemo where fld1 ~* '^des|^41';
fld1
-----------
410
destination
(2 rows)

regression=> select * from regdemo where fld1 ~* '(^41)|(^des)';
fld1
-----------
410
destination
(2 rows)

And if you want *really* disturbing:

regression=> select * from regdemo where fld1 ~* '^sou|^des';
fld1
-----------
source
destination
(2 rows)

regression=> select * from regdemo where fld1 ~ '^sou|^des';
fld1
----
(0 rows)

Something is rotten in the state of Denmark...

These all oddness are caused by the parser (makeIndexable). When
makeIndexable sees ~* '^41|^des' , it tries to rewrite the target
regexp so that an index can be used. The rewritten query might be
something like:

fld1 ~* '^41|^des' and fld1 >= '41|^' and fld1 <= '41|^\377'

Apparently this is wrong. This is because makeIndexable does not
understand '|' and '^' appearing in the middle of the regexp. On the
other hand,

regression=> select * from regdemo where fld1 ~* '^des|^41';
regression=> select * from regdemo where fld1 ~* '^sou|^des';

will work since makeIndexable gave up the optimization if the op is
"~*" and a letter appearing right after '^' is *alphabet*.

Note that:

regression=> select * from regdemo where fld1 ~ '^sou|^des';

will not work because the op is *not* "~*".

It seems that the only solution is checking '|' to see if it appears
in the target regexp and giving up the optimization in that case.

One might think that ~* '^41|^des' can be rewritten like:

fld1 ~* '^41' or fld1 ~* '^des'

For me this seems not to be a good idea. To accomplish this, we have
to deeply parse the regexp (consider that we might have arbitrary
complex regexps) and such kind thing is a job regexp() shoud
do.

Comments?
---
Tatsuo Ishii

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#3)
Re: [HACKERS] Broken select on regular expression !!!

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

Something is rotten in the state of Denmark...

These all oddness are caused by the parser (makeIndexable).

Ah-hah, I'm sure you're right. That makes *two* serious bugs in
makeIndexable. (We still don't have an adequate solution
for its known problem in non-C locales...)

It seems that the only solution is checking '|' to see if it appears
in the target regexp and giving up the optimization in that case.

I'm feeling a strong urge to just rip out makeIndexable until
it can be redesigned... how many other problems has it got
that we haven't stumbled upon?

regards, tom lane

#5Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tatsuo Ishii (#3)
Re: [HACKERS] Broken select on regular expression !!!

These all oddness are caused by the parser (makeIndexable). When
makeIndexable sees ~* '^41|^des' , it tries to rewrite the target
regexp so that an index can be used. The rewritten query might be
something like:

fld1 ~* '^41|^des' and fld1 >= '41|^' and fld1 <= '41|^\377'

Apparently this is wrong. This is because makeIndexable does not
understand '|' and '^' appearing in the middle of the regexp. On the
other hand,

regression=> select * from regdemo where fld1 ~* '^des|^41';
regression=> select * from regdemo where fld1 ~* '^sou|^des';

will work since makeIndexable gave up the optimization if the op is
"~*" and a letter appearing right after '^' is *alphabet*.

Note that:

regression=> select * from regdemo where fld1 ~ '^sou|^des';

will not work because the op is *not* "~*".

It seems that the only solution is checking '|' to see if it appears
in the target regexp and giving up the optimization in that case.

One might think that ~* '^41|^des' can be rewritten like:

fld1 ~* '^41' or fld1 ~* '^des'

For me this seems not to be a good idea. To accomplish this, we have
to deeply parse the regexp (consider that we might have arbitrary
complex regexps) and such kind thing is a job regexp() shoud
do.

Again very clear, and caused by the indexing of regex's, as you suggest.
I can easily look for '|' in the string, and skip the optimization. Is
that the only special case I need to add?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tom Lane (#4)
Re: [HACKERS] Broken select on regular expression !!!

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

Something is rotten in the state of Denmark...

These all oddness are caused by the parser (makeIndexable).

Ah-hah, I'm sure you're right. That makes *two* serious bugs in
makeIndexable. (We still don't have an adequate solution
for its known problem in non-C locales...)

It seems that the only solution is checking '|' to see if it appears
in the target regexp and giving up the optimization in that case.

I'm feeling a strong urge to just rip out makeIndexable until
it can be redesigned... how many other problems has it got
that we haven't stumbled upon?

But if we rip it out, people will complain we are not using the index
for regex and LIKE. That is a pretty serious complaint.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#7Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tatsuo Ishii (#3)
Re: [HACKERS] Broken select on regular expression !!!

These all oddness are caused by the parser (makeIndexable). When
makeIndexable sees ~* '^41|^des' , it tries to rewrite the target
regexp so that an index can be used. The rewritten query might be
something like:

fld1 ~* '^41|^des' and fld1 >= '41|^' and fld1 <= '41|^\377'

I have just applied a fix to gram.y that should fix this case. Please
let me know how it works at your site. Thanks.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#8Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Bruce Momjian (#7)
Re: [HACKERS] Broken select on regular expression !!!

Again very clear, and caused by the indexing of regex's, as you suggest.
I can easily look for '|' in the string, and skip the optimization. Is
that the only special case I need to add?

What about '{' ?
---
Tatsuo Ishii

#9Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tatsuo Ishii (#8)
Re: [HACKERS] Broken select on regular expression !!!

I have just applied a fix to gram.y that should fix this case. Please
let me know how it works at your site. Thanks.

It works great!

(This is FreeBSD 2.2.6-RELEASE)
--
Tatsuo Ishii

#10Don Baccus
dhogaza@pacifier.com
In reply to: Bruce Momjian (#6)
Re: [HACKERS] Broken select on regular expression !!!

At 12:08 AM 5/21/99 -0400, Bruce Momjian wrote:

But if we rip it out, people will complain we are not using the index
for regex and LIKE. That is a pretty serious complaint.

For this release, given that it's coming fairly soon (11 days???)
you might consider documenting the shortcomings, rather than
ripping it out. Indexing is important, serious users will expect
it, at least for LIKE (they might not know about regex if they're
just SQL grunts).

- Don Baccus, Portland OR <dhogaza@pacifier.com>
Nature photos, on-line guides, and other goodies at
http://donb.photo.net

#11Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tatsuo Ishii (#8)
Re: [HACKERS] Broken select on regular expression !!!

Again very clear, and caused by the indexing of regex's, as you suggest.
I can easily look for '|' in the string, and skip the optimization. Is
that the only special case I need to add?

What about '{' ?

Does it understand {? Man, what kind of regex library do we have?

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#12Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Don Baccus (#10)
Re: [HACKERS] Broken select on regular expression !!!

At 12:08 AM 5/21/99 -0400, Bruce Momjian wrote:

But if we rip it out, people will complain we are not using the index
for regex and LIKE. That is a pretty serious complaint.

For this release, given that it's coming fairly soon (11 days???)
you might consider documenting the shortcomings, rather than
ripping it out. Indexing is important, serious users will expect
it, at least for LIKE (they might not know about regex if they're
just SQL grunts).

I didn't even know our ~ operator supported '|'! Now I am told it know
about '{' too. I am checking on that one.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#13Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Bruce Momjian (#12)
Re: [HACKERS] Broken select on regular expression !!!

Again very clear, and caused by the indexing of regex's, as you suggest.
I can easily look for '|' in the string, and skip the optimization. Is
that the only special case I need to add?

What about '{' ?

Does it understand {? Man, what kind of regex library do we have?

I vaguely recall that we used to support only "basic" regex. At least
I thought so. Now looking into the source, I found we have supported
"extended" regex. FYI, our regex routines definitely supprt '{'. See
backend/regex/re_format.7.

P.S. I will commit a small regex test program under
backend/regex for the testing purpose.
--
Tatsuo Ishii

#14Thomas Lockhart
lockhart@alumni.caltech.edu
In reply to: Tatsuo Ishii (#13)
Re: [HACKERS] Broken select on regular expression !!!

I vaguely recall that we used to support only "basic" regex. At least
I thought so. Now looking into the source, I found we have supported
"extended" regex. FYI, our regex routines definitely supprt '{'. See
backend/regex/re_format.7.

Didn't even know that man page existed. It sure would be nice to have
it in the main docs, perhaps in the chapter on operators...

- Thomas

--
Thomas Lockhart lockhart@alumni.caltech.edu
South Pasadena, California

#15Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tatsuo Ishii (#13)
Re: [HACKERS] Broken select on regular expression !!!

Again very clear, and caused by the indexing of regex's, as you suggest.
I can easily look for '|' in the string, and skip the optimization. Is
that the only special case I need to add?

What about '{' ?

Does it understand {? Man, what kind of regex library do we have?

I vaguely recall that we used to support only "basic" regex. At least
I thought so. Now looking into the source, I found we have supported
"extended" regex. FYI, our regex routines definitely supprt '{'. See
backend/regex/re_format.7.

P.S. I will commit a small regex test program under
backend/regex for the testing purpose.

I have just commited a fix to skip {} too.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#16Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Constantin Teodorescu (#1)
Re: [HACKERS] Broken select on regular expression !!!

Fixed in 6.5 final.

Hi all,

I think I have found a bug in regexp based selections.
Watch this :

create table regdemo (fld1 varchar(32));
CREATE
insert into regdemo values('410');
INSERT 726409 1
insert into regdemo values('7410');
INSERT 726410 1
insert into regdemo values('source');
INSERT 726411 1
insert into regdemo values('destination');
INSERT 726412 1
select * from regdemo where fld1 ~* '^sou|^des';
fld1
-----------
source
destination
(2 rows)

select * from regdemo where fld1 ~* '41|^des';
fld1
-----------
410
7410
destination
(3 rows)

select * from regdemo where fld1 ~* '^41|^des';
fld1
----
(0 rows)

^^^^^^^^^^^^^^
!?!?!?!
I thought it should return '410' and 'destination' rows. But it returns
nothing!
The first select example with ^ in both variants ( ^sou|^des ) works !!!
The last one ( ^41|^des ) don't !

Am I missing something?
I am getting the same result also on 6.4.2 and 6.5 beta 1 versions!

Best regards,
--
Constantin Teodorescu
FLEX Consulting Braila, ROMANIA

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#17Bruce Momjian
maillist@candle.pha.pa.us
In reply to: Tatsuo Ishii (#8)
Re: [HACKERS] Broken select on regular expression !!!

Again very clear, and caused by the indexing of regex's, as you suggest.
I can easily look for '|' in the string, and skip the optimization. Is
that the only special case I need to add?

What about '{' ?

I dealt with this before 6.5.

-- 
  Bruce Momjian                        |  http://www.op.net/~candle
  maillist@candle.pha.pa.us            |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026