Queries with Regular Expressions

Started by Silas Justinianoabout 20 years ago4 messagesgeneral
Jump to latest
#1Silas Justiniano
silasju@gmail.com

Hello!

My queries work fine with Regular Expressions, as:

SELECT field FROM table WHERE field ~ 'something';
SELECT field FROM table WHERE field ~* 'something';
SELECT field FROM table WHERE field ~* 'som[i,e]thing';

And it works fine for special characters:
SELECT field FROM table WHERE field ~* 'chão';

But I just can't make it work correctly using brackets:
SELECT field FROM table WHERE field ~* 'ch[aã]o';

It just returns tuples that have 'chao', but not 'chão'.

My queries are utf-8 an the database is SQL_ASCII.

Any idea? Thank you!

--
Silas Justiniano - Brazil

#2John D. Burger
john@mitre.org
In reply to: Silas Justiniano (#1)
Re: Queries with Regular Expressions

But I just can't make it work correctly using brackets:
SELECT field FROM table WHERE field ~* 'ch[aã]o';

It just returns tuples that have 'chao', but not 'chão'.

My queries are utf-8 an the database is SQL_ASCII.

I suspect the bracketed expression is turning into [aXY], where XY is
the two-byte sequence corresponding to ã in UTF8. So the regular
expression is only going to match strings of the form chao, chXo and
chYo. To make sure that this is what's happening, try this:

select length('ã');

I bet you get back 2, not 1. I don't know if a UTF8 database will
handle this correctly or not. The safest thing to do may be to use
queries like this:

SELECT field FROM table WHERE field ~* 'ch(a|ã)o';

- John D. Burger
MITRE

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: John D. Burger (#2)
Re: Queries with Regular Expressions

"John D. Burger" <john@mitre.org> writes:

My queries are utf-8 an the database is SQL_ASCII.

I suspect the bracketed expression is turning into [aXY], where XY is
the two-byte sequence corresponding to � in UTF8.

That's what it looks like to me. You can hardly blame the database for
this, when you haven't clued it in that you're using UTF8.

I don't know if a UTF8 database will handle this correctly or not.

I believe that it will work in recent PG releases (7.4 and up).

regards, tom lane

#4Silas Justiniano
silasju@gmail.com
In reply to: John D. Burger (#2)
Re: Queries with Regular Expressions

John, it worked completely fine! Thank you! I don't understand exactly
the difference between [] and () for REs, but I'm starting to study
them deeply.

Thank you very much! Pg is great!

On 4/6/06, John D. Burger <john@mitre.org> wrote:

But I just can't make it work correctly using brackets:
SELECT field FROM table WHERE field ~* 'ch[aã]o';

It just returns tuples that have 'chao', but not 'chão'.

My queries are utf-8 an the database is SQL_ASCII.

I suspect the bracketed expression is turning into [aXY], where XY is
the two-byte sequence corresponding to ã in UTF8. So the regular
expression is only going to match strings of the form chao, chXo and
chYo. To make sure that this is what's happening, try this:

select length('ã');

I bet you get back 2, not 1. I don't know if a UTF8 database will
handle this correctly or not. The safest thing to do may be to use
queries like this:

SELECT field FROM table WHERE field ~* 'ch(a|ã)o';

- John D. Burger
MITRE

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

--
Silas Justiniano - Brazil