How to get around LIKE inefficiencies?

Started by The Hermit Hackerabout 25 years ago25 messages
#1The Hermit Hacker
scrappy@hub.org

I'm tryin to figure out how to speed up udmsearch when run under
postgresql, and am being hit by atrocious performance when using a LIKE
query ... the query looks like:

SELECT ndict.url_id,ndict.intag
FROM ndict,url
WHERE ndict.word_id=1971739852
AND url.rec_id=ndict.url_id
AND (url.url LIKE 'http://www.postgresql.org/%');

Take off the AND ( LIKE ) part of the query, finishes almost as soon as
you hit return. Put it back in, and you can go for coffee before it
finishes ...

If I do 'SELECT url_id FROM ndict WHERE word_id=1971739852', there
are 153 records returned ... is there some way, that I'm not thinking, of
re-writing the above so that it 'resolves' the equality before the LIKE in
order to reduce the number of tuples that it has to do the LIKE on? Is
there some way of writing the above so that it doesn't take forever to
execute?

I'm running this on a Dual-PIII 450 Server, 512Meg of RAM, zero
swap space being used ... the database has its indices on one hard drive,
the tables themselves are on a second one ... its PgSQL 7.0.2 (Tom,
anything in v7.0.3 that might improve this?) and startup is as:

#!/bin/tcsh
setenv PORT 5432
setenv POSTMASTER /pgsql/bin/postmaster
unlimit
${POSTMASTER} -B 384 -N 192 \
-o "-F -S 32768" \
-i -p ${PORT} -D/pgsql/data >&
/pgsql/logs/postmaster.${PORT}.$$ &

So its not like I'm not throwing alot of resources at this ...

Is there anything that we can do to improve this? I was trying to
think of some way to use a subselect to narrow the search results, or
something ...

Oh, the above explains down to:

NOTICE: QUERY PLAN:

Nested Loop (cost=0.00..1195.14 rows=1 width=10)
-> Index Scan using url_url on url (cost=0.00..2.73 rows=1 width=4)
-> Index Scan using n_word on ndict (cost=0.00..1187.99 rows=353 width=6)

EXPLAIN

ndict: 663018 tuples
url: 29276 tuples

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#1)
Re: How to get around LIKE inefficiencies?

A brute-force answer would be to remove the url_url index ;-)
dunno if that would slow down other queries, however.

regards, tom lane

#3Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#2)
Re: How to get around LIKE inefficiencies?

Sorry to be getting in here late. Have you tried CLUSTER? If it is
using an index scan, and it is slow, cluster often helps, especially
when there are several duplicate matches, as there is with LIKE. Let me
know how that works.

A brute-force answer would be to remove the url_url index ;-)
dunno if that would slow down other queries, however.

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#3)
Re: How to get around LIKE inefficiencies?

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Sorry to be getting in here late. Have you tried CLUSTER?

Prolly won't help much. I think what he's getting burnt by
is that the planner thinks that an indexscan based on the
LIKE 'http://www.postgresql.org/%&#39; condition will be extremely
selective --- it has no idea that most of the URLs in his table
will match that prefix. It's ye same olde nonuniform-distribution
problem; until we have better statistics, there's not much hope
for a non-kluge solution.

regards, tom lane

#5Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#4)
Re: How to get around LIKE inefficiencies?]

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Sorry to be getting in here late. Have you tried CLUSTER?

Prolly won't help much. I think what he's getting burnt by
is that the planner thinks that an indexscan based on the
LIKE 'http://www.postgresql.org/%&#39; condition will be extremely
selective --- it has no idea that most of the URLs in his table
will match that prefix. It's ye same olde nonuniform-distribution
problem; until we have better statistics, there's not much hope
for a non-kluge solution.

But I think it will help. There will be lots of index lookups, but they
will be sequential in the heap, not random over the heap.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#4)
Re: How to get around LIKE inefficiencies?

Yes, I am waiting to hear back on that.

At 21:47 5/11/00 -0500, Tom Lane wrote:

It's ye same olde nonuniform-distribution
problem; until we have better statistics, there's not much hope
for a non-kluge solution.

Wasn't somebody trying to do something with that a few weeks back?

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#7Philip Warner
pjw@rhyme.com.au
In reply to: Tom Lane (#2)
Re: How to get around LIKE inefficiencies?

At 21:28 5/11/00 -0500, Tom Lane wrote:

A brute-force answer would be to remove the url_url index ;-)
dunno if that would slow down other queries, however.

Could you trick it into not using the index (AND using the other strategy?)
by using a calculation:

SELECT ndict.url_id,ndict.intag
FROM ndict,url
WHERE ndict.word_id=1971739852
AND url.rec_id=ndict.url_id
AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');

it's a bit nasty.

If you had 7.1, the following might work:

SELECT url_id,intag From
(Select ndict.url_id,ndict.intag,url
FROM ndict,url
WHERE ndict.word_id=1971739852
AND url.rec_id=ndict.url_id) as zzz
Where
zzz.url LIKE 'http://www.postgresql.org/%&#39;

But I don't know how subselect-in-from handles this sort of query - it
might be 'clever' enough to fold it somehow.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#8Philip Warner
pjw@rhyme.com.au
In reply to: Tom Lane (#4)
Re: How to get around LIKE inefficiencies?

At 21:47 5/11/00 -0500, Tom Lane wrote:

It's ye same olde nonuniform-distribution
problem; until we have better statistics, there's not much hope
for a non-kluge solution.

Wasn't somebody trying to do something with that a few weeks back?

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Philip Warner (#7)
Re: How to get around LIKE inefficiencies?

Philip Warner <pjw@rhyme.com.au> writes:

Could you trick it into not using the index (AND using the other strategy?)
by using a calculation:

AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');

it's a bit nasty.

Looks like a great kluge to me ;-)

regards, tom lane

#10The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#2)
Re: How to get around LIKE inefficiencies?

yowch ... removing that one index makes my 'test' search (mvcc) come back
as:

[97366]: SQL 0.05s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE ndict.word_id=572517542 AND url.rec_id=ndict.url_id AND (url.url LIKE 'http://www.postgresql.org/%&#39;)

vs what we were doing before ... now, let's see how increasing the number
of docs indexed changes this ...

thanks ...

On Sun, 5 Nov 2000, Tom Lane wrote:

A brute-force answer would be to remove the url_url index ;-)
dunno if that would slow down other queries, however.

regards, tom lane

Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#11Philip Warner
pjw@rhyme.com.au
In reply to: Tom Lane (#9)
Re: How to get around LIKE inefficiencies?

At 21:59 5/11/00 -0500, Tom Lane wrote:

Looks like a great kluge to me ;-)

Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#12The Hermit Hacker
scrappy@hub.org
In reply to: Philip Warner (#11)
Re: How to get around LIKE inefficiencies?

On Mon, 6 Nov 2000, Philip Warner wrote:

At 21:59 5/11/00 -0500, Tom Lane wrote:

Looks like a great kluge to me ;-)

Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

i can now do 3 word searches in what looks like no time ... not sure how
it will look after I have ~100k documents indexed, but at least now its
not diead at 22k ...

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: The Hermit Hacker (#12)
Re: How to get around LIKE inefficiencies?

The Hermit Hacker <scrappy@hub.org> writes:

On Mon, 6 Nov 2000, Philip Warner wrote:

At 21:59 5/11/00 -0500, Tom Lane wrote:

Looks like a great kluge to me ;-)

Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

Sure --- but do you have any other query types where the index *is*
useful? If so, Philip's idea will let you suppress use of the index
for just this one kind of query.

regards, tom lane

#14Philip Warner
pjw@rhyme.com.au
In reply to: The Hermit Hacker (#12)
Re: How to get around LIKE inefficiencies?

At 23:12 5/11/00 -0400, The Hermit Hacker wrote:

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

So long as you don't ever need the index for anything else, then getting
rid of it is by far the best solution. But, eg, if you want to check if a
page is already indexed you will probably end up with a sequential search.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.B.N. 75 008 659 498) | /(@) ______---_
Tel: (+61) 0500 83 82 81 | _________ \
Fax: (+61) 0500 83 82 82 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/

#15Ron Chmara
ron@Opus1.COM
In reply to: The Hermit Hacker (#1)
Re: How to get around LIKE inefficiencies?

The Hermit Hacker wrote:

I'm tryin to figure out how to speed up udmsearch when run under
postgresql, and am being hit by atrocious performance when using a LIKE
query ... the query looks like:
SELECT ndict.url_id,ndict.intag
FROM ndict,url
WHERE ndict.word_id=1971739852
AND url.rec_id=ndict.url_id
AND (url.url LIKE 'http://www.postgresql.org/%&#39;);
Take off the AND ( LIKE ) part of the query, finishes almost as soon as
you hit return. Put it back in, and you can go for coffee before it
finishes ...

The entire *approach* is wrong. I'm currently in the process of optimizing
a db which is used for logfile mining, and it was originally built with the same
kludge.... it seems to make sense when there's only a few thousand records,
but at 20 million records, yikes!

The problem is that there's a "like" operation for something that is
fundamentally static (http://www.postgresql.org/) with some varying
data *after it*, that you're not using, in any form, for this operation.
This can be solved one of two ways:

1. Preprocess your files to strip out the paths and arguments on
a new field for the domain call. You are only setting up that data once,
so you shouldn't be using a "like" operator for every query. It's not
like on monday the server is "http://www.postgresql.org/1221&quot; and on
tuesday the server is "http://www.postgresql.org/12111&quot;. It's always
the *same server*, so split out that data into it's own column, it's own
index.

This turns your query into:
SELECT ndict.url_id,ndict.intag
FROM ndict,url
WHERE ndict.word_id=1971739852
AND url.rec_id=ndict.url_id
AND url.server_url='http://www.postgresql.org/&#39;;

2. Trigger to do the above, if you're doing on-the-fly inserts into
your db (so you can't pre-process).

-Ronabop

--
Brought to you from iBop the iMac, a MacOS, Win95, Win98, LinuxPPC machine,
which is currently in MacOS land. Your bopping may vary.

#16The Hermit Hacker
scrappy@hub.org
In reply to: Tom Lane (#13)
Re: How to get around LIKE inefficiencies?

On Sun, 5 Nov 2000, Tom Lane wrote:

The Hermit Hacker <scrappy@hub.org> writes:

On Mon, 6 Nov 2000, Philip Warner wrote:

At 21:59 5/11/00 -0500, Tom Lane wrote:

Looks like a great kluge to me ;-)

Hmph. I prefer to think of it as a 'user-defined optimizer hint'. ;-}

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

Sure --- but do you have any other query types where the index *is*
useful? If so, Philip's idea will let you suppress use of the index
for just this one kind of query.

Actually, it looks like they got a bit smart, and they search for the URL
in the url table based on the CRC32 value instead of text ...

#17The Hermit Hacker
scrappy@hub.org
In reply to: Philip Warner (#14)
Re: How to get around LIKE inefficiencies?

On Mon, 6 Nov 2000, Philip Warner wrote:

At 23:12 5/11/00 -0400, The Hermit Hacker wrote:

Except, if we are telling it to get rid of using the index, may as well
get rid of it altogether, as updates/inserts would be slowed down by
having to update that too ...

So long as you don't ever need the index for anything else, then getting
rid of it is by far the best solution. But, eg, if you want to check if a
page is already indexed you will probably end up with a sequential search.

except, it appears that they both store the text of the URL *and* the
CRC32 value of it, and do other queries based on the CRC32 value of it ...

#18Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#9)
Re: How to get around LIKE inefficiencies?

I am adding a new TODO item:

* Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM
ANALYZE, and CLUSTER

Seems we should be able to emit NOTICE messages suggesting performance
improvements.

Philip Warner <pjw@rhyme.com.au> writes:

Could you trick it into not using the index (AND using the other strategy?)
by using a calculation:

AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');

it's a bit nasty.

Looks like a great kluge to me ;-)

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#19Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Bruce Momjian (#18)
Re: How to get around LIKE inefficiencies?

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I am adding a new TODO item:
* Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM
ANALYZE, and CLUSTER
Seems we should be able to emit NOTICE messages suggesting performance
improvements.

This would be targeted to help those who refuse to read documentation?

I'm not following the point here...

Well, I think there is some confusion about when CLUSTER is good, and I
can see people turning it on and running their application to look for
things they forgot, like indexes or vacuum analyze.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#18)
Re: How to get around LIKE inefficiencies?

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I am adding a new TODO item:
* Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM
ANALYZE, and CLUSTER
Seems we should be able to emit NOTICE messages suggesting performance
improvements.

This would be targeted to help those who refuse to read documentation?

I'm not following the point here...

regards, tom lane

#21The Hermit Hacker
scrappy@hub.org
In reply to: Bruce Momjian (#19)
Re: How to get around LIKE inefficiencies?

On Sun, 5 Nov 2000, Bruce Momjian wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I am adding a new TODO item:
* Add SET PERFORMANCE_TIPS option to suggest INDEX, VACUUM, VACUUM
ANALYZE, and CLUSTER
Seems we should be able to emit NOTICE messages suggesting performance
improvements.

This would be targeted to help those who refuse to read documentation?

I'm not following the point here...

Well, I think there is some confusion about when CLUSTER is good, and I
can see people turning it on and running their application to look for
things they forgot, like indexes or vacuum analyze.

so, we are gonna have an AI built into the database now too? my
experience to date is that each scenario is different for what can be done
to fix something ... as my last problem shows. I could remove the index,
since it isn't used anywhere else that I'm aware of, or, as philip pointed
out, I could change my query ...

now, this 'PERFORMANCE_TIPS', would it have to be smart enough to think
about Philips solution, or only Tom's? How is such a knowledge base
maintained? Who is turned off of PgSQL when they enable that, and it
doesn't help their performance even though they follow the
recommendations?

#22Bruce Momjian
pgman@candle.pha.pa.us
In reply to: The Hermit Hacker (#21)
Re: How to get around LIKE inefficiencies?

so, we are gonna have an AI built into the database now too? my
experience to date is that each scenario is different for what can be done
to fix something ... as my last problem shows. I could remove the index,
since it isn't used anywhere else that I'm aware of, or, as philip pointed
out, I could change my query ...

now, this 'PERFORMANCE_TIPS', would it have to be smart enough to think
about Philips solution, or only Tom's? How is such a knowledge base
maintained? Who is turned off of PgSQL when they enable that, and it
doesn't help their performance even though they follow the
recommendations?

Well, I think it would be helpful to catch the most obvious things
people forget, but if no one thinks its a good idea, I will yank it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#23Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Bruce Momjian (#22)
Re: How to get around LIKE inefficiencies?

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Well, I think it would be helpful to catch the most obvious things
people forget, but if no one thinks its a good idea, I will yank it.

If you've got an idea *how* to do it in any sort of reliable fashion,
I'm all ears. But it sounds more like pie-in-the-sky to me.

But I like pie. :-)

Well, we could throw a message when the optimizer tries to get
statistics on a column with no analyze stats, or table stats on a table
that has never been vacuumed, or does a sequential scan on a table that
has >%50 expired rows.

We could throw a message when a query does an index scan that bounces
all over the heap looking for a single value. We could though a message
when a constant is compared to a column, and there is no index on the
column.

Not perfect, but would help catch some obvious things people forget.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#22)
Re: How to get around LIKE inefficiencies?

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Well, I think it would be helpful to catch the most obvious things
people forget, but if no one thinks its a good idea, I will yank it.

If you've got an idea *how* to do it in any sort of reliable fashion,
I'm all ears. But it sounds more like pie-in-the-sky to me.

regards, tom lane

#25Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: The Hermit Hacker (#1)
Re: How to get around LIKE inefficiencies?

I'm running this on a Dual-PIII 450 Server, 512Meg of RAM, zero
swap space being used ... the database has its indices on one hard drive,
the tables themselves are on a second one ... its PgSQL 7.0.2 (Tom,
anything in v7.0.3 that might improve this?) and startup is as:

#!/bin/tcsh
setenv PORT 5432
setenv POSTMASTER /pgsql/bin/postmaster
unlimit
${POSTMASTER} -B 384 -N 192 \
-o "-F -S 32768" \
-i -p ${PORT} -D/pgsql/data >&
/pgsql/logs/postmaster.${PORT}.$$ &

BTW, you have a 32MB sort space for each backend, and you allow up to
192 concurrent backends. So whole postgres requires at least 192*32MB
= 6144MB memory if all 192 users try to connect to your server at the
same time! I would suggest adding enough swap space or descreasing -S
setting...
--
Tatsuo Ishii