Determining if "in a text set"

Started by Richi Planaabout 27 years ago8 messagesgeneral

richip@mozcom.com

about 27 years ago

Hi,

I have a table/class with a text field which contains a single word
(possibly trailed by whitespaces). Given a phrase (a text literal composed
of words separated by one or more whitespaces), what's the best way to
SELECT those tuples with fields containing instances that match exactly
one of the words in the given string?

Example:

Table:
Tuple# ... FieldN
1 ... 'MON'
2 ... 'TUE'
3 ... 'THURS'
4 ... 'THU'
5 ... 'FRI'

Given phrase:
'MON TUE WED THURS'
Tuple 1 will match
Tuple 2 will match
Tuple 3 will match
Tuple 4 will NOT match
Tuple 5 will NOT match

Right now, I'm extracting (using strtok(), 8^P) each word from the given
phrase and using "SELECT * FROM {class} where FieldN LIKE '%{word}%'", but
it's terribly slow, relies on strtok() to parse words, and Tuple 4 above
will match.

L L Richi Plana 8^) ,-,-. ,-,-. ,-,-. ,-,-. ,-
LL LL Systems Administrator / / \ \ / / \ \ / / \ \ / / \ \ / /
LLLLL Mosaic Communications, Inc. \ \ / / \ \ / / \ \ / / \ \ / /
LLLLL mailto:richip@mozcom.com `-'-' `-'-' `-'-' `-'-'
------------------------------------------------------------------------
P G P Key available at http://www2.mozcom.com/~richip/richip.asc
Tired of Spam? Join this CAUCE! http://www.cauce.org/

Charles Tassell

ctassell@isn.net

about 27 years ago

In reply to: Richi Plana (#1)

Re: [GENERAL] Determining if "in a text set"

What about constructing your SQL query using a bunch of OR's? IE:
SELECT * FROM table WHERE day = 'MON' or day = 'TUE' ....
My memory of SQL syntax is woefully blurry (been a while, just getting back
into it) but this should work. That way you do the strtok stuff once (you
should never use strtok, strtok is evil, in the bad way) and only end up
with one query containing all your records.

At 12:41 AM 3/27/99, Richi Plana wrote:

Show quoted text

Hi,

I have a table/class with a text field which contains a single word
(possibly trailed by whitespaces). Given a phrase (a text literal composed
of words separated by one or more whitespaces), what's the best way to
SELECT those tuples with fields containing instances that match exactly
one of the words in the given string?

Example:

Table:
Tuple# ... FieldN
1 ... 'MON'
2 ... 'TUE'
3 ... 'THURS'
4 ... 'THU'
5 ... 'FRI'

Given phrase:
'MON TUE WED THURS'
Tuple 1 will match
Tuple 2 will match
Tuple 3 will match
Tuple 4 will NOT match
Tuple 5 will NOT match

Right now, I'm extracting (using strtok(), 8^P) each word from the given
phrase and using "SELECT * FROM {class} where FieldN LIKE '%{word}%'", but
it's terribly slow, relies on strtok() to parse words, and Tuple 4 above
will match.

L L Richi Plana 8^) ,-,-. ,-,-. ,-,-. ,-,-. ,-
LL LL Systems Administrator / / \ \ / / \ \ / / \ \ / / \ \ / /
LLLLL Mosaic Communications, Inc. \ \ / / \ \ / / \ \ / / \ \ / /
LLLLL mailto:richip@mozcom.com `-'-' `-'-' `-'-' `-'-'
------------------------------------------------------------------------
P G P Key available at http://www2.mozcom.com/~richip/richip.asc
Tired of Spam? Join this CAUCE! http://www.cauce.org/

Herouth Maoz

herouth@oumail.openu.ac.il

about 27 years ago

In reply to: Richi Plana (#1)

Re: [GENERAL] Determining if "in a text set"

I am redirecting the discussion to the SQL list, as it is more appropriate
than the GENERAL list.

At 06:41 +0200 on 27/03/1999, Richi Plana wrote:

I have a table/class with a text field which contains a single word
(possibly trailed by whitespaces). Given a phrase (a text literal composed
of words separated by one or more whitespaces), what's the best way to
SELECT those tuples with fields containing instances that match exactly
one of the words in the given string?

Example:

Table:
Tuple# ... FieldN
1 ... 'MON'
2 ... 'TUE'
3 ... 'THURS'
4 ... 'THU'
5 ... 'FRI'

Given phrase:
'MON TUE WED THURS'
Tuple 1 will match
Tuple 2 will match
Tuple 3 will match
Tuple 4 will NOT match
Tuple 5 will NOT match

Right now, I'm extracting (using strtok(), 8^P) each word from the given
phrase and using "SELECT * FROM {class} where FieldN LIKE '%{word}%'", but
it's terribly slow, relies on strtok() to parse words, and Tuple 4 above
will match.

First, do without the strtok. You could achieve the same result (that also
matches tuple 4) with:

SELECT * FROM {class}
WHERE '{phrase}' LIKE '%' || ( rtrim( FieldN ) || '%' );

Hehe... the strange expression simply cuts off spaces from the right side
of the field, and concatenates % signs on its left and right. So you match
the phrase against the keyword, not the other way around.

It is more readable if you use regular expression rather than 'like':

SELECT * FROM {class}
WHERE '{phrase}' ~ rtrim( FieldN );

Now, turning attention to eliminating the fourth tuple from matching: I'm
not an expert on regular expressions. I think postgres does not support
perl's concept of word boundaries. But if the phrase is supposed to contain
just spaces (ascii 32, if you will) as word separators, I'd match against
the field with a space on the left and nothing on the right, a space on the
right and nothing on the left, spaces on both sides, or standing alone:

SELECT * FROM {class}
WHERE '{phrase}' ~ ' ' || ( rtrim( FieldN ) || '$' )
OR '{phrase}' ~ '^' || ( rtrim( FieldN ) || ' ' )
OR '{phrase}' ~ ' ' || ( rtrim( FieldN ) || ' ' )
OR '{phrase}' = rtrim( FieldN );

The equivalent with 'like' is:

SELECT * FROM {class}
WHERE '{phrase}' LIKE '% ' || rtrim( FieldN )
OR '{phrase}' LIKE rtrim( FieldN ) || ' %'
OR '{phrase}' LIKE '% ' || ( rtrim( FieldN ) || ' %' )
OR '{phrase}' = rtrim( FieldN );

If the spaces between the words in your phrase are not just plain spaces
but can also be tabs etc, you will have to use the regexp version, and
write something like '[ \t\r\f\n]' - though I'm less than sure that
postgres's regular expressions support these. Can anyone tell us which
regexp definition postgres uses?

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma

Herouth Maoz

herouth@oumail.openu.ac.il

about 27 years ago

In reply to: Herouth Maoz (#3)

Re: [SQL] Re: [GENERAL] Determining if "in a text set"

At 19:24 +0200 on 28/03/1999, I wrote:

If the spaces between the words in your phrase are not just plain spaces
but can also be tabs etc, you will have to use the regexp version, and
write something like '[ \t\r\f\n]' - though I'm less than sure that
postgres's regular expressions support these. Can anyone tell us which
regexp definition postgres uses?

OK, I looked into it more deeply. The regular expressions in postgres seem
to follow the rules in the 'regex(5)' manpage.

So, if the delimiters are not necessarily plain space characters, you could
match, using the '~' version, with '[[:space:]]' instead of ' '. Ugly, but
it matches also tabs and newlines.

At least it does on my solaris. I'm not sure whether this is or is not
OS-dependent.

As for the words in the field, which you said can have trailing spaces. If
these spaces also may be tabs or newlines, you should use rtrim with two
arguments, and the second argument is a string of all characters which need
removal from the end of the string. The point is, however, that you type in
an actual tab or an actual newline.

Hope you followed me so far.

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma

Richi Plana

richip@mozcom.com

about 27 years ago

In reply to: Herouth Maoz (#4)

Re: [SQL] Re: [GENERAL] Determining if "in a text set"

Hi,

On Sun, 28 Mar 1999, Herouth Maoz wrote:

|o| OK, I looked into it more deeply. The regular expressions in postgres seem
|o| to follow the rules in the 'regex(5)' manpage.
[snipped]
|o| At least it does on my solaris. I'm not sure whether this is or is not
|o| OS-dependent.

Looking at the PostgreSQL source, I think they're using some UC Berkeley
regex package (check out $(SRC)/src/backend/regex). There are two MAN
pages there if people are interested. re_format.7 specifies POSIX 1003.2
Regex specs ... Again, assuming that PostgreSQL uses this lib.

|o| Hope you followed me so far.

thanks, that was pretty straightforward. kinda teacher-like. ;^)

Richi Plana

richip@mozcom.com

about 27 years ago

In reply to: Herouth Maoz (#3)

Re: [GENERAL] Determining if "in a text set"

Hi,

On Sun, 28 Mar 1999, Herouth Maoz wrote:

|o| perl's concept of word boundaries. But if the phrase is supposed to contain
|o| just spaces (ascii 32, if you will) as word separators, I'd match against
|o| the field with a space on the left and nothing on the right, a space on the
|o| right and nothing on the left, spaces on both sides, or standing alone:
|o|
|o| SELECT * FROM {class}
|o| WHERE '{phrase}' ~ ' ' || ( rtrim( FieldN ) || '$' )
|o| OR '{phrase}' ~ '^' || ( rtrim( FieldN ) || ' ' )
|o| OR '{phrase}' ~ ' ' || ( rtrim( FieldN ) || ' ' )
|o| OR '{phrase}' = rtrim( FieldN );

1) I'm not sure the regex thing is ANSI SQL and would port to other SQL
systems easilly
2) Isn't there a way to do [^{ws}](word)[{ws}$], where [^{ws}] means
either the start or a whitespace and [{ws}$] means either a whitespace or
EOL?
3) When I try the concat strings operator (||), I get the following psql
error:

ERROR: parser: syntax error at or near "||"

What am I doing wrong? (I'm using PostgreSQL 6.4.2 on Sparc/Solaris with
GCC compiler) There's something wrong with the concat operator on my
system. I get diff. results from the ff. commands:

SELECT * FROM {class} where 'MON' = FieldN; and
SELECT * FROM {class} where 'MO' || 'N' = fieldN;

Anybody have a clue.

Richi Plana

richip@mozcom.com

about 27 years ago

In reply to: Richi Plana (#6)

Re: [SQL] Re: [GENERAL] Determining if "in a text set"

Hi,

On Mon, 29 Mar 1999, Richi Plana wrote:

|o| |o| SELECT * FROM {class}
|o| |o| WHERE '{phrase}' ~ ' ' || ( rtrim( FieldN ) || '$' )
|o| |o| OR '{phrase}' ~ '^' || ( rtrim( FieldN ) || ' ' )
|o| |o| OR '{phrase}' ~ ' ' || ( rtrim( FieldN ) || ' ' )
|o| |o| OR '{phrase}' = rtrim( FieldN );
|o|
|o| 3) When I try the concat strings operator (||), I get the following psql
|o| error:
|o|
|o| ERROR: parser: syntax error at or near "||"

My mistake. I didn't take into account hierarchy/precedence of operators.
It seems string concatenation (||) has a higher order compared to the
regex and LIKE operators. It should have been.

SELECT * FROM {class} WHERE '{phrase}' ~ ('^' || TRIM(FieldN) ...'

Herouth Maoz

herouth@oumail.openu.ac.il

about 27 years ago

In reply to: Richi Plana (#6)

Re: [GENERAL] Determining if "in a text set"

At 20:56 +0200 on 28/03/1999, Richi Plana wrote:

1) I'm not sure the regex thing is ANSI SQL and would port to other SQL
systems easilly

That's true. ANSI only has LIKE, but it's oh-so-limited.

2) Isn't there a way to do [^{ws}](word)[{ws}$], where [^{ws}] means
either the start or a whitespace and [{ws}$] means either a whitespace or
EOL?

Hehe... Actually, there is. This is what I came up with:

SELECT * FROM {class}
WHERE '{phrase}' ~ (
'(^|[[:space:]])' || ( rtrim( FieldN ) || '([[:space:]]|$)' )
);

It seems a bit complicated. The [[:space:]] combination is the set of all
whitespace characters. Like perl's \s, but locale-sensitive.

The combination '(^|[[:space:]])' uses the regexp alternatives syntax. This
means "Either the beginning of the phrase (^) or a space character". The
combination '([[:space:]]|$)' is, along the same lines, "either a space
character or the end of the phrase".

All the rest of the mess are concatenation operators, and the parentheses
required for their precedence and associativity.

3) When I try the concat strings operator (||), I get the following psql
error:

Yes, you are right about the precedence (I don't know why I didn't
encounter it yesterday). One must also remember that the || operator is
also non-associative (no real reason for it - it's a minor bug), so you
have to put parentheses around it if you concat more than two strings.

Herouth

--
Herouth Maoz, Internet developer.
Open University of Israel - Telem project
http://telem.openu.ac.il/~herutma