How to get around LIKE inefficiencies?

Started by The Hermit Hackerover 25 years ago25 messageshackers

The Hermit Hacker

scrappy@hub.org

over 25 years ago

I'm tryin to figure out how to speed up udmsearch when run under
postgresql, and am being hit by atrocious performance when using a LIKE
query ... the query looks like:

SELECT ndict.url_id,ndict.intag
FROM ndict,url
WHERE ndict.word_id=1971739852
AND url.rec_id=ndict.url_id
AND (url.url LIKE 'http://www.postgresql.org/%');

Take off the AND ( LIKE ) part of the query, finishes almost as soon as
you hit return. Put it back in, and you can go for coffee before it
finishes ...

If I do 'SELECT url_id FROM ndict WHERE word_id=1971739852', there
are 153 records returned ... is there some way, that I'm not thinking, of
re-writing the above so that it 'resolves' the equality before the LIKE in
order to reduce the number of tuples that it has to do the LIKE on? Is
there some way of writing the above so that it doesn't take forever to
execute?

I'm running this on a Dual-PIII 450 Server, 512Meg of RAM, zero
swap space being used ... the database has its indices on one hard drive,
the tables themselves are on a second one ... its PgSQL 7.0.2 (Tom,
anything in v7.0.3 that might improve this?) and startup is as:

#!/bin/tcsh
setenv PORT 5432
setenv POSTMASTER /pgsql/bin/postmaster
unlimit
${POSTMASTER} -B 384 -N 192 \
-o "-F -S 32768" \
-i -p ${PORT} -D/pgsql/data >&
/pgsql/logs/postmaster.${PORT}.$$ &

So its not like I'm not throwing alot of resources at this ...

Is there anything that we can do to improve this? I was trying to
think of some way to use a subselect to narrow the search results, or
something ...

Oh, the above explains down to:

NOTICE: QUERY PLAN:

Nested Loop (cost=0.00..1195.14 rows=1 width=10)
-> Index Scan using url_url on url (cost=0.00..2.73 rows=1 width=4)
-> Index Scan using n_word on ndict (cost=0.00..1187.99 rows=353 width=6)

EXPLAIN

ndict: 663018 tuples
url: 29276 tuples

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: The Hermit Hacker (#1)

Re: How to get around LIKE inefficiencies?

A brute-force answer would be to remove the url_url index ;-)
dunno if that would slow down other queries, however.

regards, tom lane

bruce@momjian.us

over 25 years ago

In reply to: Tom Lane (#2)

Re: How to get around LIKE inefficiencies?

Sorry to be getting in here late. Have you tried CLUSTER? If it is
using an index scan, and it is slow, cluster often helps, especially
when there are several duplicate matches, as there is with LIKE. Let me
know how that works.

A brute-force answer would be to remove the url_url index ;-)
dunno if that would slow down other queries, however.

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Bruce Momjian (#3)

Re: How to get around LIKE inefficiencies?

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Sorry to be getting in here late. Have you tried CLUSTER?

Prolly won't help much. I think what he's getting burnt by
is that the planner thinks that an indexscan based on the
LIKE 'http://www.postgresql.org/%' condition will be extremely
selective --- it has no idea that most of the URLs in his table
will match that prefix. It's ye same olde nonuniform-distribution
problem; until we have better statistics, there's not much hope
for a non-kluge solution.

regards, tom lane

bruce@momjian.us

over 25 years ago

In reply to: Tom Lane (#4)

Re: How to get around LIKE inefficiencies?]

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Sorry to be getting in here late. Have you tried CLUSTER?

Prolly won't help much. I think what he's getting burnt by
is that the planner thinks that an indexscan based on the
LIKE 'http://www.postgresql.org/%' condition will be extremely
selective --- it has no idea that most of the URLs in his table
will match that prefix. It's ye same olde nonuniform-distribution
problem; until we have better statistics, there's not much hope
for a non-kluge solution.

But I think it will help. There will be lots of index lookups, but they
will be sequential in the heap, not random over the heap.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

bruce@momjian.us

over 25 years ago

In reply to: Tom Lane (#4)

Re: How to get around LIKE inefficiencies?

Yes, I am waiting to hear back on that.

At 21:47 5/11/00 -0500, Tom Lane wrote:

It's ye same olde nonuniform-distribution
problem; until we have better statistics, there's not much hope
for a non-kluge solution.

Wasn't somebody trying to do something with that a few weeks back?

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Import Notes

Reply to msg id not found: 3.0.5.32.20001106135259.02482100@mail.rhyme.com.au | Resolved by subject fallback

pjw@rhyme.com.au

over 25 years ago

In reply to: Tom Lane (#2)

Re: How to get around LIKE inefficiencies?

At 21:28 5/11/00 -0500, Tom Lane wrote:

A brute-force answer would be to remove the url_url index ;-)
dunno if that would slow down other queries, however.

Could you trick it into not using the index (AND using the other strategy?)
by using a calculation:

SELECT ndict.url_id,ndict.intag
FROM ndict,url
WHERE ndict.word_id=1971739852
AND url.rec_id=ndict.url_id
AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');

it's a bit nasty.

If you had 7.1, the following might work:

SELECT url_id,intag From
(Select ndict.url_id,ndict.intag,url
FROM ndict,url
WHERE ndict.word_id=1971739852
AND url.rec_id=ndict.url_id) as zzz
Where
zzz.url LIKE 'http://www.postgresql.org/%'

But I don't know how subselect-in-from handles this sort of query - it
might be 'clever' enough to fold it somehow.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

pjw@rhyme.com.au

over 25 years ago

In reply to: Tom Lane (#4)

Re: How to get around LIKE inefficiencies?

At 21:47 5/11/00 -0500, Tom Lane wrote:

It's ye same olde nonuniform-distribution
problem; until we have better statistics, there's not much hope
for a non-kluge solution.

Wasn't somebody trying to do something with that a few weeks back?

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Philip Warner (#7)

Re: How to get around LIKE inefficiencies?

Philip Warner <pjw@rhyme.com.au> writes:

Could you trick it into not using the index (AND using the other strategy?)
by using a calculation:

AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');

it's a bit nasty.

Looks like a great kluge to me ;-)

regards, tom lane

The Hermit Hacker

scrappy@hub.org

over 25 years ago

In reply to: Tom Lane (#2)

Re: How to get around LIKE inefficiencies?

yowch ... removing that one index makes my 'test' search (mvcc) come back
as:

[97366]: SQL 0.05s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE ndict.word_id=572517542 AND url.rec_id=ndict.url_id AND (url.url LIKE 'http://www.postgresql.org/%')

vs what we were doing before ... now, let's see how increasing the number
of docs indexed changes this ...

thanks ...

On Sun, 5 Nov 2000, Tom Lane wrote:

A brute-force answer would be to remove the url_url index ;-)
dunno if that would slow down other queries, however.

regards, tom lane

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

pjw@rhyme.com.au

over 25 years ago

In reply to: Tom Lane (#9)

Re: How to get around LIKE inefficiencies?

At 21:59 5/11/00 -0500, Tom Lane wrote:

Looks like a great kluge to me ;-)

Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

The Hermit Hacker

scrappy@hub.org

over 25 years ago

In reply to: Philip Warner (#11)

Re: How to get around LIKE inefficiencies?

On Mon, 6 Nov 2000, Philip Warner wrote:

At 21:59 5/11/00 -0500, Tom Lane wrote:

Looks like a great kluge to me ;-)

Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

i can now do 3 word searches in what looks like no time ... not sure how
it will look after I have ~100k documents indexed, but at least now its
not diead at 22k ...

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: The Hermit Hacker (#12)

Re: How to get around LIKE inefficiencies?

The Hermit Hacker <scrappy@hub.org> writes:

On Mon, 6 Nov 2000, Philip Warner wrote:

At 21:59 5/11/00 -0500, Tom Lane wrote:

Looks like a great kluge to me ;-)

Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

Sure --- but do you have any other query types where the index *is*
useful? If so, Philip's idea will let you suppress use of the index
for just this one kind of query.

regards, tom lane

pjw@rhyme.com.au

over 25 years ago

In reply to: The Hermit Hacker (#12)

Re: How to get around LIKE inefficiencies?

At 23:12 5/11/00 -0400, The Hermit Hacker wrote:

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

So long as you don't ever need the index for anything else, then getting
rid of it is by far the best solution. But, eg, if you want to check if a
page is already indexed you will probably end up with a sequential search.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

ron@Opus1.COM

over 25 years ago

In reply to: The Hermit Hacker (#1)

Re: How to get around LIKE inefficiencies?

The Hermit Hacker wrote:

I'm tryin to figure out how to speed up udmsearch when run under
postgresql, and am being hit by atrocious performance when using a LIKE
query ... the query looks like:
SELECT ndict.url_id,ndict.intag
FROM ndict,url
WHERE ndict.word_id=1971739852
AND url.rec_id=ndict.url_id
AND (url.url LIKE 'http://www.postgresql.org/%');
Take off the AND ( LIKE ) part of the query, finishes almost as soon as
you hit return. Put it back in, and you can go for coffee before it
finishes ...

The entire *approach* is wrong. I'm currently in the process of optimizing
a db which is used for logfile mining, and it was originally built with the same
kludge.... it seems to make sense when there's only a few thousand records,
but at 20 million records, yikes!

The problem is that there's a "like" operation for something that is
fundamentally static (http://www.postgresql.org/) with some varying
data *after it*, that you're not using, in any form, for this operation.
This can be solved one of two ways:

1. Preprocess your files to strip out the paths and arguments on
a new field for the domain call. You are only setting up that data once,
so you shouldn't be using a "like" operator for every query. It's not
like on monday the server is "http://www.postgresql.org/1221" and on
tuesday the server is "http://www.postgresql.org/12111". It's always
the *same server*, so split out that data into it's own column, it's own
index.

This turns your query into:
SELECT ndict.url_id,ndict.intag
FROM ndict,url
WHERE ndict.word_id=1971739852
AND url.rec_id=ndict.url_id
AND url.server_url='http://www.postgresql.org/';

2. Trigger to do the above, if you're doing on-the-fly inserts into
your db (so you can't pre-process).

-Ronabop

--
Brought to you from iBop the iMac, a MacOS, Win95, Win98, LinuxPPC machine,
which is currently in MacOS land. Your bopping may vary.

The Hermit Hacker

scrappy@hub.org

over 25 years ago

In reply to: Tom Lane (#13)

Re: How to get around LIKE inefficiencies?

On Sun, 5 Nov 2000, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

On Mon, 6 Nov 2000, Philip Warner wrote:

At 21:59 5/11/00 -0500, Tom Lane wrote:

Looks like a great kluge to me ;-)

Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

Sure --- but do you have any other query types where the index *is*
useful? If so, Philip's idea will let you suppress use of the index
for just this one kind of query.

Actually, it looks like they got a bit smart, and they search for the URL
in the url table based on the CRC32 value instead of text ...

The Hermit Hacker

scrappy@hub.org

over 25 years ago

In reply to: Philip Warner (#14)

Re: How to get around LIKE inefficiencies?

On Mon, 6 Nov 2000, Philip Warner wrote:

At 23:12 5/11/00 -0400, The Hermit Hacker wrote:

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

So long as you don't ever need the index for anything else, then getting
rid of it is by far the best solution. But, eg, if you want to check if a
page is already indexed you will probably end up with a sequential search.

except, it appears that they both store the text of the URL *and* the
CRC32 value of it, and do other queries based on the CRC32 value of it ...

bruce@momjian.us

over 25 years ago

In reply to: Tom Lane (#9)

Re: How to get around LIKE inefficiencies?

I am adding a new TODO item:

* Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM
ANALYZE, and CLUSTER

Seems we should be able to emit NOTICE messages suggesting performance
improvements.

Philip Warner <pjw@rhyme.com.au> writes:

Could you trick it into not using the index (AND using the other strategy?)
by using a calculation:

AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');

it's a bit nasty.

Looks like a great kluge to me ;-)

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

bruce@momjian.us

over 25 years ago

In reply to: Bruce Momjian (#18)

Re: How to get around LIKE inefficiencies?

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I am adding a new TODO item:
* Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM
ANALYZE, and CLUSTER
Seems we should be able to emit NOTICE messages suggesting performance
improvements.

This would be targeted to help those who refuse to read documentation?

I'm not following the point here...

Well, I think there is some confusion about when CLUSTER is good, and I
can see people turning it on and running their application to look for
things they forgot, like indexes or vacuum analyze.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Import Notes

Reply to msg id not found: 2142.973483291@sss.pgh.pa.us | Resolved by subject fallback

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Bruce Momjian (#18)

Re: How to get around LIKE inefficiencies?

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I am adding a new TODO item:
* Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM
ANALYZE, and CLUSTER
Seems we should be able to emit NOTICE messages suggesting performance
improvements.

This would be targeted to help those who refuse to read documentation?

I'm not following the point here...

regards, tom lane

The Hermit Hacker

scrappy@hub.org

over 25 years ago

In reply to: Bruce Momjian (#19)

bruce@momjian.us

over 25 years ago

In reply to: The Hermit Hacker (#21)

bruce@momjian.us

over 25 years ago

In reply to: Bruce Momjian (#22)

tgl@sss.pgh.pa.us

over 25 years ago

In reply to: Bruce Momjian (#22)

ishii@postgresql.org

over 25 years ago

In reply to: The Hermit Hacker (#1)