non-static LIKE patterns

Started by patrick keshishianalmost 14 years ago7 messagesgeneral
Jump to latest
#1patrick keshishian
pkeshish@gmail.com

Hi all,

I'm sure this has been discussed before, but I am not too sure what
key search-terms to use to find any potentially relevant discussions.

Issue: I have two tables, each has a column that contains a directory
path. First table contains a starting path and the second holds
sub-paths (retaining full path names from root directory). In short,
first table entries are sub-strings of those found in the second
table.

I need to match entries in second table to the first, so I use the
following in my WHERE clause:

... WHERE second.path LIKE first.path||'%'

This seemed to work at first, but it fails if the paths use
back-slashes (like Windows paths).

Following is a simple test-case to illustrate what I described.

PostgreSQL 9.1.1 (similar results with much older version)

$ psql -d db -e < testcase.sql
CREATE TEMPORARY TABLE foo (id INTEGER, a TEXT);
CREATE TABLE
CREATE TEMPORARY TABLE bar (id INTEGER, b TEXT);
CREATE TABLE
INSERT INTO foo VALUES (0, '/root/a/b');
INSERT 8030228 1
INSERT INTO foo VALUES (1, '\root\a\b');
INSERT 8030229 1
INSERT INTO bar VALUES (0, '/root/a/b/c/*nix');
INSERT 8030230 1
INSERT INTO bar VALUES (1, '\root\a\b\c\Windows');
INSERT 8030231 1
SELECT * FROM foo;
id | a
----+-----------
0 | /root/a/b
1 | \root\a\b
(2 rows)

SELECT * FROM bar;
id | b
----+---------------------
0 | /root/a/b/c/*nix
1 | \root\a\b\c\Windows
(2 rows)

SELECT a,b, b LIKE a||'%' FROM foo JOIN bar USING (id);
a | b | ?column?
-----------+---------------------+----------
/root/a/b | /root/a/b/c/*nix | t
\root\a\b | \root\a\b\c\Windows | f
(2 rows)

Hmm... just tried these two cases as well which seem interesting:

SELECT '\root\a\b\c\Windows' LIKE '\root\a\b'||'%';
?column?
----------
f
(1 row)

mod=# SELECT '\root\a\b\c\Windows' LIKE '\\root\\a\\b'||'%';
?column?
----------
t
(1 row)

Is this a bug in the SQL statement, or a bug in PostgreSQL? If the
former, what is the correct way to do this? If the latter, is there a
work-around?

I realize the same thing can be done with the following statement, but
it is harder to read and might be slightly more expensive to run on a
large data set.

SELECT a,b,substr(b,1,length(a)), substr(b,1,length(a)) = a FROM foo
JOIN bar USING (id);
a | b | substr | ?column?
-----------+---------------------+-----------+----------
/root/a/b | /root/a/b/c/*nix | /root/a/b | t
\root\a\b | \root\a\b\c\Windows | \root\a\b | t
(2 rows)

Thanks,
--patrick

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: patrick keshishian (#1)
Re: non-static LIKE patterns

patrick keshishian <pkeshish@gmail.com> writes:

I need to match entries in second table to the first, so I use the
following in my WHERE clause:
... WHERE second.path LIKE first.path||'%'
This seemed to work at first, but it fails if the paths use
back-slashes (like Windows paths).

By default, back-slash is a special character in LIKE patterns.
You can change that with the ESCAPE option. See
http://www.postgresql.org/docs/devel/static/functions-matching.html#FUNCTIONS-LIKE

regards, tom lane

#3patrick keshishian
pkeshish@gmail.com
In reply to: Tom Lane (#2)
Re: non-static LIKE patterns

On Wed, Apr 11, 2012 at 4:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

patrick keshishian <pkeshish@gmail.com> writes:

I need to match entries in second table to the first, so I use the
following in my WHERE clause:
      ... WHERE second.path LIKE first.path||'%'
This seemed to work at first, but it fails if the paths use
back-slashes (like Windows paths).

By default, back-slash is a special character in LIKE patterns.
You can change that with the ESCAPE option.  See
http://www.postgresql.org/docs/devel/static/functions-matching.html#FUNCTIONS-LIKE

Thanks for the quick reply. Would be tough choosing another
"reasonable" ESCAPE character while dealing with paths. Will think
more about this.

Cheers,
--patrick

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: patrick keshishian (#3)
Re: non-static LIKE patterns

patrick keshishian <pkeshish@gmail.com> writes:

Thanks for the quick reply. Would be tough choosing another
"reasonable" ESCAPE character while dealing with paths. Will think
more about this.

If you want it to be bulletproof, what I'd think about is something like

WHERE second.path LIKE quote_like(first.path)||'%'

where quote_like() is a function that inserts a backslash before each
backslash, percent, and underscore in the given value. Probably not
hard to cons that up from regexp_replace().

regards, tom lane

#5Noname
hamann.w@t-online.de
In reply to: Tom Lane (#4)
Re: non-static LIKE patterns

Tom Lane wrote:

patrick keshishian <pkeshish@gmail.com> writes:

Thanks for the quick reply. Would be tough choosing another
"reasonable" ESCAPE character while dealing with paths. Will think
more about this.

If you want it to be bulletproof, what I'd think about is something like

WHERE second.path LIKE quote_like(first.path)||'%'

where quote_like() is a function that inserts a backslash before each
backslash, percent, and underscore in the given value. Probably not
hard to cons that up from regexp_replace().

regards, tom lane

Just out of curiosity: wouldn't that (as well as using non-static like)
be an enormous performance problem?
I tried something with normal "~" regex matching some time ago but
gave up on the idea pretty soon

Regards
Wolfgang Hamann

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Noname (#5)
Re: non-static LIKE patterns

hamann.w@t-online.de writes:

Tom Lane wrote:
If you want it to be bulletproof, what I'd think about is something like
WHERE second.path LIKE quote_like(first.path)||'%'

Just out of curiosity: wouldn't that (as well as using non-static like)
be an enormous performance problem?

Well, it won't be free, but I think you've already doomed yourself to
a not-very-bright plan by using LIKE in this way at all.

In any case, as a wise man once said, you can make it run arbitrarily
fast if it doesn't have to give the right answer. Correctness trumps
any micro-optimization questions, so if you have to have prefix matching
of this sort, it's gonna cost ya somehow.

Actually, if the only case you're worried about is prefix match, you
could do it in substring style:

WHERE second.path = substring(first.path, 1, length(second.path))

(better double-check the substring syntax, I'm too lazy to). This is
still going to completely suck on a macro level: there's still no way to
perform the join except by tediously iterating through every combination
of rows. But it'll likely outrun any LIKE-based solution by some
percentage.

regards, tom lane

#7Noname
hamann.w@t-online.de
In reply to: Tom Lane (#6)
Re: non-static LIKE patterns

hamann.w@t-online.de writes:

Tom Lane wrote:
If you want it to be bulletproof, what I'd think about is something like
WHERE second.path LIKE quote_like(first.path)||'%'

Just out of curiosity: wouldn't that (as well as using non-static like)
be an enormous performance problem?

Well, it won't be free, but I think you've already doomed yourself to
a not-very-bright plan by using LIKE in this way at all.

In any case, as a wise man once said, you can make it run arbitrarily
fast if it doesn't have to give the right answer. Correctness trumps
any micro-optimization questions, so if you have to have prefix matching
of this sort, it's gonna cost ya somehow.

Hi Tom,

I just stumbled across this question because I regularly come across problems that,
at first, look like they should be solved with non-static LIKE or REGEX patterns

I actually have two situations where I would need a better plan. One is, fortunately,
fairly static (mostly lookups, hardly inserts) for name matches. Many famous people
appear in different spellings, say these two musicians
Franz|Ferenc Liszt
Fr(e|y)der(ic|yk) Chopin
So the first plan would be to regex-compare the sought name against the first name (or last name)
regexes. Run-time is astronomical, though
My current approach is to
a) keep the regexes in a separate table/column, so names with a regex entry are handled in
a smaller query
b) reverse the query: for every regex (they are well-behaved in this context) I pre-create a pattern
so that my actual query becomes
where pre-made-pattern ~ searched_name
c) while preparing the pattern, a common initial character (the "F" for Franz and Ferenc) is identified
to build an index. In the rare case that the first letter is already different, there would be
two entries in the table. So the actual query can check for first letter before it does the regex.

The other situation, unfortionately, is ad-hoc queries where I cannot do that kind of preparation
typically, the DB would contain strings like XY4711A, XY271, XY17321AAA, and I want to
check whether an input like XY17321 matches a database entry up to the end of the
numerals. So I add [^0-9]*$ to the end of my candidates, select
where candidate ~ entry-in-table
and go for a coffee or two

Of course I would prefer to see a pre-built solution do all that mess for me...

Regards
Wolfgang Hamann