Performance Killer 'IN' ?

Started by Kai Hessingabout 20 years ago12 messagesgeneral

kai.hessing@hobsons.de

about 20 years ago

Hossa,

I just made a little test on our test-database. I have an excel sheet
with about 2000 entries that should be updated with the same value. In a
first try I generated an SQL-Syntax for every entry like:

UPDATE xyz SET status=-6 WHERE id=xyz1 AND status>-1;
UPDATE xyz SET status=-6 WHERE id=xyz2 AND status>-1;
UPDATE xyz SET status=-6 WHERE id=... AND status>-1;

The execution of the ~2000 SQL-commands took about 5-10 seconds.

Then I tried the same with generating only one request using IN with the
twothousand entries like:

UPDATE xyz WHERE id IN (xyz1, xyz2, ....) AND status>-1;

and it took about 10 Minutes to execute. So it is nearly a hundred times
slower. Can this be verified? Is there anything that can be done about
that? Else I would need to have a few words with our programmers...

By the way the testsystem is a basic Suse 9.3 with a default postgres
installation 8.0.x

Thanks and
*greets*
Kai

kai.hessing@hobsons.de

about 20 years ago

In reply to: Kai Hessing (#1)

Re: Performance Killer 'IN' ?

and it took about 10 Minutes to execute. So it is nearly a hundred times
slower. Can this be verified? Is there anything that can be done about
that? Else I would need to have a few words with our programmers...

OK... I overstated a little bit. The real numbers are:

1.) 21.5 seconds
2.) 363.7 seconds

But it is still a significant difference.

Joshua D. Drake

jd@commandprompt.com

about 20 years ago

In reply to: Kai Hessing (#2)

Re: Performance Killer 'IN' ?

Kai Hessing wrote:

and it took about 10 Minutes to execute. So it is nearly a hundred times
slower. Can this be verified? Is there anything that can be done about
that? Else I would need to have a few words with our programmers...

OK... I overstated a little bit. The real numbers are:

1.) 21.5 seconds
2.) 363.7 seconds

But it is still a significant difference.

Can you provide an explain analyze of each query?

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

kai.hessing@hobsons.de

about 20 years ago

In reply to: Joshua D. Drake (#3)

Re: Performance Killer 'IN' ?

Joshua D. Drake wrote:

Kai Hessing wrote:

1.) 21.5 seconds
2.) 363.7 seconds

But it is still a significant difference.

Can you provide an explain analyze of each query?

There isn't an explain analyze of the first variant, because there are
just 2000 SQL-Updates. The explain analyze for just one of the commands
(UPDATE xyz SET status=-6 WHERE phon='xyz1' AND status>-1;) is:
------------------
Index Scan using phon_phon_idx on phon (cost=0.00..5193.83 rows=530
width=148) (actual time=0.146..0.146 rows=0 loops=1)

Index Cond: ((phon)::text = 'xyz'::text)

Filter: (status > -1)

Total runtime: 0.387 ms

1 Datensï¿œtze (means data sets)

Laufzeit gesamt: 16.682 ms (means running time)
------------------

The second one (UPDATE xyz WHERE id IN (xyz1, xyz2, ....) AND
status>-1;) returns:
------------------
Seq Scan on phon (cost=0.00..1573304.58 rows=105931 width=148) (actual
time=369563.565..369563.565 rows=0 loops=1)

Filter: ((((phon)::text = 'xyz1) OR ((phon)::text = 'xyz2'::text) OR
((phon)::text = 'xyz3'::text) OR ((phon)::text = 'xyz4'::text) OR ...
[all the 2000 entries from the IN clause]

Total runtime: 369566.954 ms

667 Datensï¿œtze (means data sets)

Laufzeit gesamt: 370,179.246 ms (means running time)
------------------

Hope, that helps. Have a nice weekend. Being back to work on monday ;)

*greets*
Kai

markokr@gmail.com

about 20 years ago

In reply to: Kai Hessing (#4)

Re: Performance Killer 'IN' ?

On 3/31/06, Kai Hessing <kai.hessing@hobsons.de> wrote:

The second one (UPDATE xyz WHERE id IN (xyz1, xyz2, ....) AND
status>-1;) returns:
------------------
Seq Scan on phon (cost=0.00..1573304.58 rows=105931 width=148) (actual
time=369563.565..369563.565 rows=0 loops=1)

Just a shot in the dark: does the plan stay the same,
when you remove the ' AND status > -1' ?

--
marko

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Kai Hessing (#4)

Re: Performance Killer 'IN' ?

Kai Hessing <kai.hessing@hobsons.de> writes:

Index Scan using phon_phon_idx on phon (cost=0.00..5193.83 rows=530
width=148) (actual time=0.146..0.146 rows=0 loops=1)
...
Seq Scan on phon (cost=0.00..1573304.58 rows=105931 width=148) (actual
time=369563.565..369563.565 rows=0 loops=1)

You need to look into the discrepancy between estimated and actual row
counts. (I suppose the reason you're showing 0 rows here is that you
already did these UPDATEs and so none of the rows in question pass the
status filter --- but how many rows are there matching the phon index
conditions?) Perhaps a larger statistics target for the phon column
would be a good idea.

regards, tom lane

kai.hessing@hobsons.de

about 20 years ago

In reply to: Marko Kreen (#5)

Re: Performance Killer 'IN' ?

Marko Kreen wrote:

On 3/31/06, Kai Hessing <kai.hessing@hobsons.de> wrote:

The second one (UPDATE xyz WHERE id IN (xyz1, xyz2, ....) AND
status>-1;) returns:
------------------
Seq Scan on phon (cost=0.00..1573304.58 rows=105931 width=148) (actual
time=369563.565..369563.565 rows=0 loops=1)

Just a shot in the dark: does the plan stay the same,
when you remove the ' AND status > -1' ?

No difference: I skipped the 'AND status > -1' and have the following
results...

Using 2000x 'UPDATE phon SET status=-6 WHERE ' returns each time:
----------------------------
Index Scan using phon_phon_idx on phon (cost=0.00..5179.80 rows=1587
width=148) (actual time=31.452..31.470 rows=1 loops=1)

Index Cond: ((phon)::text = 'wink4103@uni-trier.de'::text)

Total runtime:3.414 ms
----------------------------
(Total runtime for all 2000 Updates: 23335.393 ms

Using the 'UPDATE xyz WHERE id IN ('xyz1', 'xyz2', other 2000
values.....)' returns:
----------------------------
Seq Scan on phon (cost=0.00..1564960.67 rows=317227 width=148) (actual
time=68.315..365621.761 rows=2522 loops=1)

Filter: (((phon)::text = '.....

Total runtime: 393182.745 ms
----------------------------

kai.hessing@hobsons.de

about 20 years ago

In reply to: Tom Lane (#6)

Re: Performance Killer 'IN' ?

Tom Lane wrote:

Kai Hessing <kai.hessing@hobsons.de> writes:

Index Scan using phon_phon_idx on phon (cost=0.00..5193.83 rows=530
width=148) (actual time=0.146..0.146 rows=0 loops=1)
...
Seq Scan on phon (cost=0.00..1573304.58 rows=105931 width=148) (actual
time=369563.565..369563.565 rows=0 loops=1)

You need to look into the discrepancy between estimated and actual row
counts. (I suppose the reason you're showing 0 rows here is that you
already did these UPDATEs and so none of the rows in question pass the
status filter --- but how many rows are there matching the phon index
conditions?) Perhaps a larger statistics target for the phon column
would be a good idea.

Yes... The 0 rows are there because I did the command before. Now I
resetted the test database to a previous state and dopped the 'AND
status>-1' in the SQL-syntax:

Using the 'UPDATE xyz WHERE id IN ('xyz1', 'xyz2', other 2000
values.....)' returns:
----------------------------
Seq Scan on phon (cost=0.00..1564960.67 rows=317227 width=148) (actual
time=68.315..365621.761 rows=2522 loops=1)

Filter: (((phon)::text = '.....

Total runtime: 393182.745 ms
----------------------------

(please see also the other post <49ck9sFo32mbU1@individual.net> )

What do you mean with larger statistics target?

markokr@gmail.com

about 20 years ago

In reply to: Kai Hessing (#7)

Re: Performance Killer 'IN' ?

On 4/3/06, Kai Hessing <kai.hessing@hobsons.de> wrote:

Marko Kreen wrote:

Just a shot in the dark: does the plan stay the same,
when you remove the ' AND status > -1' ?

No difference: I skipped the 'AND status > -1' and have the following
results...

Ok. Thanks. I once had similar query and it seemed that postgres
got confused if there was big IN and alse something else in WHERE
clause. Dropping the rest got postgres to use index for IN.

But I did not have time to research it then and with your result
seems it's not a problem in recent versions.

--
marko

tgl@sss.pgh.pa.us

about 20 years ago

In reply to: Kai Hessing (#8)

Re: Performance Killer 'IN' ?

Kai Hessing <kai.hessing@hobsons.de> writes:

Yes... The 0 rows are there because I did the command before. Now I
resetted the test database to a previous state and dopped the 'AND
status>-1' in the SQL-syntax:

Using the 'UPDATE xyz WHERE id IN ('xyz1', 'xyz2', other 2000
values.....)' returns:
----------------------------
Seq Scan on phon (cost=0.00..1564960.67 rows=317227 width=148) (actual
time=68.315..365621.761 rows=2522 loops=1)

Filter: (((phon)::text = '.....

Well, here's our problem it would seem: the planner is estimating the IN
clause to match 317227 rows, rather than the actual 2522. That's
naturally going to bias it against choosing an indexscan. You need to
get that estimate closer before there's going to be much chance of
choosing the right plan.

What do you mean with larger statistics target?

See ALTER TABLE SET STATISTICS, or just change default_statistics_target
and re-ANALYZE.

regards, tom lane

kai.hessing@hobsons.de

about 20 years ago

In reply to: Tom Lane (#10)

Re: Performance Killer 'IN' ?

Tom Lane wrote:

Well, here's our problem it would seem: the planner is estimating the IN
clause to match 317227 rows, rather than the actual 2522. That's
naturally going to bias it against choosing an indexscan. You need to
get that estimate closer before there's going to be much chance of
choosing the right plan.

What do you mean with larger statistics target?

See ALTER TABLE SET STATISTICS, or just change default_statistics_target
and re-ANALYZE.

Thanks, that definitly looks like a starting point. I will test it and
post my results. Btw. what happens if the estimation would be to low?

kai.hessing@hobsons.de

about 20 years ago

In reply to: Kai Hessing (#11)

Re: Performance Killer 'IN' ?

Kai Hessing wrote:

What do you mean with larger statistics target?

See ALTER TABLE SET STATISTICS, or just change default_statistics_target
and re-ANALYZE.

Thanks, that definitly looks like a starting point. I will test it and
post my results. Btw. what happens if the estimation would be to low?

unbelievable... I found the problem: Doing a simple ANALYZE after
resetting the test-db to a previous state is my friend. The results now are:

---------
Index Scan using phon_phon_idx, phon_phon_idx, phon_phon_idx, .....
(cost=0.00..56793.09 rows=4068 width=53) (actual time=0.155..363.810
rows=2534 loops=1)

Index Cond: (((phon)::text = 'xyz'::text) OR .....

Filter: (status > -1)

Total runtime: 11140.480 ms
---------

Thank you very much, this was the right hint!