Regexp match not working.. (SQL help)

Started by Phoenix Kiulaalmost 15 years ago4 messagesgeneral
Jump to latest
#1Phoenix Kiula
phoenix.kiula@gmail.com

I have a text column in a table, which I want to search through --
seeking the occurrence of about 300 small strings in it.

Let's say the table is like this:

table1 (
id bigint primary key
,mytext text
,mydate timestamp without time zone
);

I am using this SQL:

SELECT id FROM table1
WHERE mytext ~* E'sub1|sub2|sub3|sub4...'
LIMIT 10;

This is basically working, but some of the "mytext" columns being
returned that do not contain any of these substrings. Am I doing the
POSIX regexp wrongly? This same thing works when I try it in PHP with
preg_match. But not in Postgresql. I have tried several variations
too:

WHERE mytext ~* E'(sub1)(sub2)(sub3)(sub4)...'

None of this is working. I cannot seem to get out the results that do
NOT contain any of those strings.

Appreciate any pointers!

Thanks!

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Phoenix Kiula (#1)
Re: Regexp match not working.. (SQL help)

Phoenix Kiula <phoenix.kiula@gmail.com> writes:

I am using this SQL:

SELECT id FROM table1
WHERE mytext ~* E'sub1|sub2|sub3|sub4...'
LIMIT 10;

This is basically working, but some of the "mytext" columns being
returned that do not contain any of these substrings.

[ raised eyebrow... ] Could we see a concrete example?

One potential issue is that, depending on which PG version and locale
and database encoding you are using, case-insensitive matching might
not work properly on non-ASCII letters. Other than that, the only
gotcha I can think of is having regexp special characters in the
substrings and failing to escape them properly.

regards, tom lane

#3David G. Johnston
david.g.johnston@gmail.com
In reply to: Phoenix Kiula (#1)
Re: Regexp match not working.. (SQL help)

I am using this SQL:

SELECT id FROM table1
WHERE mytext ~* E'sub1|sub2|sub3|sub4...'
LIMIT 10;

This is basically working, but some of the "mytext" columns being returned
that do not contain any of these substrings. Am I doing the POSIX regexp
wrongly? This same thing works when I try it in PHP with preg_match. But

not

in Postgresql. I have tried several variations
too:

WHERE mytext ~* E'(sub1)(sub2)(sub3)(sub4)...'

When requesting help with RegEx you are strongly advised to supply the text
of the records that you are concerned about (in this case you say you are
getting false positives so provide the contents of "mytext" for those
records) AND the exact expression you are using.

You seem to indicate the contents of mytext contains a "text document" and
you are attempting to find specific words in that document. The expression
format supplied does not take into consideration word boundaries. If any
part of a word matches "subX" then the pattern will match.

You may want to consider finding one or more books on RegEx. The fact that
you consider E'(sub1)(sub2)...' to be a variation of E'sub1|sub2...'
indicates that the issue is likely not PostgreSQL itself but your
understanding on RegEx.

You may also want to try the "regexp_matches(...)" function in PostgreSQL.
Instead of just evaluating true/false it returns an array of all the matches
that were found. Using this you would be able to see exactly what text
PostgreSQL is matching with your expression.

Figuring out why something is matching that should not be (false positive)
is fairly easy since the engine itself will tell you what it matched. The
hard situation is the false-negative, where you think something should match
and it does not.

David J.

#4Phoenix Kiula
phoenix.kiula@gmail.com
In reply to: Phoenix Kiula (#1)
Re: Regexp match not working.. (SQL help)

On Wed, May 11, 2011 at 11:18 PM, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:

I have a text column in a table, which I want to search through --
seeking the occurrence of about 300 small strings in it.

Let's say the table is like this:

   table1 (
        id   bigint primary key
       ,mytext   text
       ,mydate  timestamp without time zone
   );

I am using this SQL:

  SELECT id FROM table1
  WHERE   mytext   ~*   E'sub1|sub2|sub3|sub4...'
  LIMIT 10;

This is basically working, but some of the "mytext" columns being
returned that do not contain any of these substrings. Am I doing the
POSIX regexp wrongly? This same thing works when I try it in PHP with
preg_match. But not in Postgresql. I have tried several variations
too:

  WHERE   mytext   ~*   E'(sub1)(sub2)(sub3)(sub4)...'

 None of this is working. I cannot seem to get out the results that do
NOT contain any of those strings.

Appreciate any pointers!

Thanks!

My bad. I figured out that the pipe should only separate the strings
to be searched. I had one stray pipe at the end:

SELECT id FROM table1
WHERE mytext ~* E'sub1|sub2|sub3|....subXY|'
LIMIT 10;

This meant that it was matching, well basically anything.

Sorry.