performance question (something to do w/ parameterized stmts?, wrong index types?)

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: Jeffrey Tenny (#1)

Re: performance question (something to do w/ parameterized stmts?, wrong index types?)

What's EXPLAIN ANALYZE show?

On Mon, May 08, 2006 at 01:29:28PM -0400, Jeffrey Tenny wrote:

Why does this query take so long? (PostgreSQL 8.0.3, FC4)
Hopefully I have provided enough information below.

LOG: statement: SELECT * FROM x WHERE f IN
($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,\
$25,$26,$27,$28,$29,$30,$31,$32,$33,$34,$35,$36,$37,$38,$39,$40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$50,$51,$52,$53,$54,$55,$56,$57,$58,$59,$60,$61,$62,$63\
,$64,$65,$66,$67,$68,$69,$70,$71,$72,$73,$74,$75,$76,$77,$78,$79,$80,$81,$82,$83,$84,$85,$86,$87,$88,$89,$90,$91,$92,$93,$94,$95,$96,$97,$98,$99,$100,$101,\
$102,$103,$104,$105,$106,$107,$108,$109,$110,$111,$112,$113,$114,$115,$116,$117,$118,$119,$120,$121,$122,$123,$124,$125,$126,$127,$128,$129,$130,$131,$132,\
$133,$134,$135,$136,$137,$138,$139,$140,$141,$142,$143,$144,$145,$146,$147,$148,$149,$150,$151,$152,$153,$154,$155,$156,$157,$158,$159,$160,$161,$162,$163,\
$164,$165,$166,$167,$168,$169,$170,$171,$172,$173,$174,$175,$176,$177,$178,$179,$180,$181,$182,$183,$184,$185,$186,$187,$188,$189,$190,$191,$192,$193,$194,\
$195,$196,$197,$198,$199,$200,$201,$202,$203,$204,$205,$206,$207,$208,$209,$210,$211,$212,$213,$214,$215,$216,$217,$218,$219,$220,$221,$222,$223,$224,$225,\
$226,$227,$228,$229,$230,$231,$232,$233,$234,$235,$236,$237,$238,$239,$240,$241,$242,$243,$244,$245,$246,$247,$248,$249,$250,$251,$252,$253,$254,$255,$256,\
$257,$258,$259,$260,$261,$262,$263,$264,$265,$266,$267,$268,$269,$270,$271,$272,$273,$274,$275,$276,$277,$278,$279,$280,$281,$282,$283,$284,$285,$286,$287,\
$288,$289,$290,$291,$292,$293,$294,$295,$296,$297,$298,$299,$300,$301,$302,$303,$304,$305,$306,$307,$308,$309,$310,$311,$312,$313,$314,$315,$316,$317,$318,\
$319,$320,$321,$322,$323,$324,$325,$326,$327,$328,$329,$330,$331,$332,$333,$334,$335,$336,$337,$338,$339,$340,$341,$342,$343,$344,$345,$346,$347,$348,$349,\
$350,$351,$352,$353,$354,$355,$356,$357,$358,$359,$360,$361,$362,$363,$364,$365,$366,$367,$368,$369,$370,$371,$372,$373,$374,$375,$376,$377,$378,$379,$380,\
$381,$382,$383,$384,$385,$386,$387,$388,$389,$390,$391,$392,$393,$394,$395,$396,$397,$398,$399,$400,$401,$402,$403,$404,$405,$406,$407,$408,$409,$410,$411,\
$412,$413,$414,$415,$416,$417,$418,$419,$420,$421,$422,$423,$424,$425,$426,$427,$428,$429,$430,$431,$432,$433,$434,$435,$436,$437,$438,$439,$440,$441,$442,\
$443,$444,$445,$446,$447,$448,$449,$450,$451,$452,$453,$454,$455,$456,$457,$458,$459,$460,$461,$462,$463,$464,$465,$466,$467,$468,$469,$470,$471,$472,$473,\
$474,$475,$476,$477,$478,$479,$480,$481,$482,$483,$484,$485,$486,$487,$488,$489,$490,$491,$492,$493,$494,$495,$496,$497,$498,$499,$500,$501,$502,$503,$504,\
$505,$506,$507,$508,$509,$510,$511,$512,$513,$514,$515,$516,$517,$518,$519,$520,$521,$522,$523,$524,$525,$526,$527,$528,$529,$530,$531,$532,$533,$534,$535,\
$536,$537,$538,$539,$540,$541,$542,$543,$544,$545,$546,$547,$548,$549,$550,$551,$552,$553,$554,$555,$556,$557,$558,$559,$560,$561,$562,$563,$564,$565,$566,\
$567,$568,$569,$570,$571,$572,$573,$574,$575,$576,$577,$578,$579,$580,$581,$582,$583,$584,$585,$586,$587,$588,$589,$590,$591,$592,$593,$594,$595,$596,$597,\
$598,$599,$600,$601,$602,$603,$604,$605,$606,$607,$608,$609,$610,$611,$612,$613,$614,$615,$616,$617,$618,$619,$620,$621,$622,$623,$624,$625,$626,$627,$628,\
$629,$630,$631,$632,$633,$634,$635,$636,$637,$638,$639,$640,$641,$642,$643,$644,$645,$646,$647,$648,$649,$650)
ORDER BY f,c

LOG: EXECUTOR STATISTICS
DETAIL: ! system usage stats:
! 10.282945 elapsed 10.234444 user 0.048992 system sec
! [25.309152 user 0.500923 sys total]
! 0/0 [0/0] filesystem blocks in/out
! 0/0 [0/10397] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/15 [291/55] voluntary/involuntary context switches
! buffer usage stats:
! Shared blocks: 0 read, 0 written,
buffer hit rate = 100.00%
! Local blocks: 0 read, 0 written,
buffer hit rate = 0.00%
! Direct blocks: 0 read, 0 written

Here is the table description:

Table "public.x"
Column | Type | Modifiers
--------+---------+-----------
f | integer | not null
c | integer | not null
r | integer | not null
n | integer | not null
Indexes:
"x_c_idx" btree (c)
"x_f_idx" btree (f)
"testindex2" btree (f, c)

There are only 2,369 records in the X table.

I don't understand why this query should take 10 seconds in the executor
phase, with so little data being managed, and all relevant data already
in memory. Any clues?

Maybe there are more database server debugging options I should have
tweaked, but I'm not sure what. The stuff I turned on included:

log_duration = true
log_statement = 'all'
log_parser_stats = true
log_planner_stats = true
log_executor_stats = true

(N.B. log_statement_stats = true caused the server startup failure
every time with no error message I could find, so was not deliberately set)

stats_start_collector = true
stats_command_string = true
stats_block_level = true
stats_row_level = true
stats_reset_on_server_start = true

(FYI: 10 secs is a lot only because this query is executed many times in
my application, and they're pretty much all bad, and the aggregate query
times are killing my app response).

Thanks for any tips!

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

Mark Lewis

mark.lewis@mir3.com

almost 20 years ago

In reply to: Jeffrey Tenny (#1)

Re: performance question (something to do w/

Doing a SELECT with a large list of variables inside an IN runs slowly
on every database we've tested. We've tested mostly in Oracle and
PostgreSQL, and both get very slow very quickly (actually Oracle refuses
to process the query at all after it gets too many bind parameters).

In our case, we have a (potentially quite large) set of external values
that we want to look up in the database. We originally thought that
doing a single select with a large IN clause was the way to go, but then
we did some performance analysis on the optimal batch size (number of
items to include per IN clause), and discovered that for most databases,
the optimal batch size was 1. For PostgreSQL I think it was 2.

The moral of the story is that you're probably better off running a
bunch of small selects than in trying to optimize things with one
gargantuan select.

-- Mark Lewis

Show quoted text

On Mon, 2006-05-08 at 13:29 -0400, Jeffrey Tenny wrote:

Why does this query take so long? (PostgreSQL 8.0.3, FC4)
Hopefully I have provided enough information below.

LOG: statement: SELECT * FROM x WHERE f IN
($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,\
$25,$26,$27,$28,$29,$30,$31,$32,$33,$34,$35,$36,$37,$38,$39,$40,$41,$42,$43,$44,$45,$46,$47,$48,$49,$50,$51,$52,$53,$54,$55,$56,$57,$58,$59,$60,$61,$62,$63\
,$64,$65,$66,$67,$68,$69,$70,$71,$72,$73,$74,$75,$76,$77,$78,$79,$80,$81,$82,$83,$84,$85,$86,$87,$88,$89,$90,$91,$92,$93,$94,$95,$96,$97,$98,$99,$100,$101,\
$102,$103,$104,$105,$106,$107,$108,$109,$110,$111,$112,$113,$114,$115,$116,$117,$118,$119,$120,$121,$122,$123,$124,$125,$126,$127,$128,$129,$130,$131,$132,\
$133,$134,$135,$136,$137,$138,$139,$140,$141,$142,$143,$144,$145,$146,$147,$148,$149,$150,$151,$152,$153,$154,$155,$156,$157,$158,$159,$160,$161,$162,$163,\
$164,$165,$166,$167,$168,$169,$170,$171,$172,$173,$174,$175,$176,$177,$178,$179,$180,$181,$182,$183,$184,$185,$186,$187,$188,$189,$190,$191,$192,$193,$194,\
$195,$196,$197,$198,$199,$200,$201,$202,$203,$204,$205,$206,$207,$208,$209,$210,$211,$212,$213,$214,$215,$216,$217,$218,$219,$220,$221,$222,$223,$224,$225,\
$226,$227,$228,$229,$230,$231,$232,$233,$234,$235,$236,$237,$238,$239,$240,$241,$242,$243,$244,$245,$246,$247,$248,$249,$250,$251,$252,$253,$254,$255,$256,\
$257,$258,$259,$260,$261,$262,$263,$264,$265,$266,$267,$268,$269,$270,$271,$272,$273,$274,$275,$276,$277,$278,$279,$280,$281,$282,$283,$284,$285,$286,$287,\
$288,$289,$290,$291,$292,$293,$294,$295,$296,$297,$298,$299,$300,$301,$302,$303,$304,$305,$306,$307,$308,$309,$310,$311,$312,$313,$314,$315,$316,$317,$318,\
$319,$320,$321,$322,$323,$324,$325,$326,$327,$328,$329,$330,$331,$332,$333,$334,$335,$336,$337,$338,$339,$340,$341,$342,$343,$344,$345,$346,$347,$348,$349,\
$350,$351,$352,$353,$354,$355,$356,$357,$358,$359,$360,$361,$362,$363,$364,$365,$366,$367,$368,$369,$370,$371,$372,$373,$374,$375,$376,$377,$378,$379,$380,\
$381,$382,$383,$384,$385,$386,$387,$388,$389,$390,$391,$392,$393,$394,$395,$396,$397,$398,$399,$400,$401,$402,$403,$404,$405,$406,$407,$408,$409,$410,$411,\
$412,$413,$414,$415,$416,$417,$418,$419,$420,$421,$422,$423,$424,$425,$426,$427,$428,$429,$430,$431,$432,$433,$434,$435,$436,$437,$438,$439,$440,$441,$442,\
$443,$444,$445,$446,$447,$448,$449,$450,$451,$452,$453,$454,$455,$456,$457,$458,$459,$460,$461,$462,$463,$464,$465,$466,$467,$468,$469,$470,$471,$472,$473,\
$474,$475,$476,$477,$478,$479,$480,$481,$482,$483,$484,$485,$486,$487,$488,$489,$490,$491,$492,$493,$494,$495,$496,$497,$498,$499,$500,$501,$502,$503,$504,\
$505,$506,$507,$508,$509,$510,$511,$512,$513,$514,$515,$516,$517,$518,$519,$520,$521,$522,$523,$524,$525,$526,$527,$528,$529,$530,$531,$532,$533,$534,$535,\
$536,$537,$538,$539,$540,$541,$542,$543,$544,$545,$546,$547,$548,$549,$550,$551,$552,$553,$554,$555,$556,$557,$558,$559,$560,$561,$562,$563,$564,$565,$566,\
$567,$568,$569,$570,$571,$572,$573,$574,$575,$576,$577,$578,$579,$580,$581,$582,$583,$584,$585,$586,$587,$588,$589,$590,$591,$592,$593,$594,$595,$596,$597,\
$598,$599,$600,$601,$602,$603,$604,$605,$606,$607,$608,$609,$610,$611,$612,$613,$614,$615,$616,$617,$618,$619,$620,$621,$622,$623,$624,$625,$626,$627,$628,\
$629,$630,$631,$632,$633,$634,$635,$636,$637,$638,$639,$640,$641,$642,$643,$644,$645,$646,$647,$648,$649,$650)
ORDER BY f,c

LOG: EXECUTOR STATISTICS
DETAIL: ! system usage stats:
! 10.282945 elapsed 10.234444 user 0.048992 system sec
! [25.309152 user 0.500923 sys total]
! 0/0 [0/0] filesystem blocks in/out
! 0/0 [0/10397] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 0/15 [291/55] voluntary/involuntary context switches
! buffer usage stats:
! Shared blocks: 0 read, 0 written,
buffer hit rate = 100.00%
! Local blocks: 0 read, 0 written,
buffer hit rate = 0.00%
! Direct blocks: 0 read, 0 written

Here is the table description:

Table "public.x"
Column | Type | Modifiers
--------+---------+-----------
f | integer | not null
c | integer | not null
r | integer | not null
n | integer | not null
Indexes:
"x_c_idx" btree (c)
"x_f_idx" btree (f)
"testindex2" btree (f, c)

There are only 2,369 records in the X table.

I don't understand why this query should take 10 seconds in the executor
phase, with so little data being managed, and all relevant data already
in memory. Any clues?

Maybe there are more database server debugging options I should have
tweaked, but I'm not sure what. The stuff I turned on included:

log_duration = true
log_statement = 'all'
log_parser_stats = true
log_planner_stats = true
log_executor_stats = true

(N.B. log_statement_stats = true caused the server startup failure
every time with no error message I could find, so was not deliberately set)

stats_start_collector = true
stats_command_string = true
stats_block_level = true
stats_row_level = true
stats_reset_on_server_start = true

(FYI: 10 secs is a lot only because this query is executed many times in
my application, and they're pretty much all bad, and the aggregate query
times are killing my app response).

Thanks for any tips!

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: Mark Lewis (#3)

Re: performance question (something to do w/

On Mon, May 08, 2006 at 10:42:21AM -0700, Mark Lewis wrote:

Doing a SELECT with a large list of variables inside an IN runs slowly
on every database we've tested. We've tested mostly in Oracle and
PostgreSQL, and both get very slow very quickly (actually Oracle refuses
to process the query at all after it gets too many bind parameters).

In our case, we have a (potentially quite large) set of external values
that we want to look up in the database. We originally thought that
doing a single select with a large IN clause was the way to go, but then
we did some performance analysis on the optimal batch size (number of
items to include per IN clause), and discovered that for most databases,
the optimal batch size was 1. For PostgreSQL I think it was 2.

The moral of the story is that you're probably better off running a
bunch of small selects than in trying to optimize things with one
gargantuan select.

Ever experiment with loading the parameters into a temp table and
joining to that?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Jim Nasby (#4)

Re: performance question (something to do w/

"Jim C. Nasby" <jnasby@pervasive.com> writes:

On Mon, May 08, 2006 at 10:42:21AM -0700, Mark Lewis wrote:

Doing a SELECT with a large list of variables inside an IN runs slowly
on every database we've tested. We've tested mostly in Oracle and
PostgreSQL, and both get very slow very quickly (actually Oracle refuses
to process the query at all after it gets too many bind parameters).

In our case, we have a (potentially quite large) set of external values
that we want to look up in the database. We originally thought that
doing a single select with a large IN clause was the way to go, but then
we did some performance analysis on the optimal batch size (number of
items to include per IN clause), and discovered that for most databases,
the optimal batch size was 1. For PostgreSQL I think it was 2.

The moral of the story is that you're probably better off running a
bunch of small selects than in trying to optimize things with one
gargantuan select.

Ever experiment with loading the parameters into a temp table and
joining to that?

Also, it might be worth re-testing that conclusion with PG CVS tip
(or 8.2 when it comes out). The reimplementation of IN as = ANY that
I did a couple months ago might well change the results.

regards, tom lane

Kenneth Marshall

ktm@it.is.rice.edu

almost 20 years ago

In reply to: Jim Nasby (#4)

Re: performance question (something to do w/

On Mon, May 08, 2006 at 12:50:13PM -0500, Jim C. Nasby wrote:

On Mon, May 08, 2006 at 10:42:21AM -0700, Mark Lewis wrote:

Doing a SELECT with a large list of variables inside an IN runs slowly
on every database we've tested. We've tested mostly in Oracle and
PostgreSQL, and both get very slow very quickly (actually Oracle refuses
to process the query at all after it gets too many bind parameters).

In our case, we have a (potentially quite large) set of external values
that we want to look up in the database. We originally thought that
doing a single select with a large IN clause was the way to go, but then
we did some performance analysis on the optimal batch size (number of
items to include per IN clause), and discovered that for most databases,
the optimal batch size was 1. For PostgreSQL I think it was 2.

The moral of the story is that you're probably better off running a
bunch of small selects than in trying to optimize things with one
gargantuan select.

Ever experiment with loading the parameters into a temp table and
joining to that?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

---------------------------(end of broadcast)---------------------------

The DB use by the DSPAM software is very similar to your use case. The
fastest queries are made using the PostgreSQL generate_series functionality
to unwind the "IN *" to multiple single selects. Here is the lookup function
that they use:

create function lookup_tokens(integer,bigint[])
returns setof dspam_token_data
language plpgsql stable
as '
declare
v_rec record;
begin
for v_rec in select * from dspam_token_data
where uid=$1
and token in (select $2[i]
from generate_series(array_lower($2,1),
array_upper($2,1)) s(i))
loop
return next v_rec;
end loop;
return;
end;';

You should be able to try something similar for your workload.

Ken Marshall

jeffrey.tenny@comcast.net

almost 20 years ago

In reply to: Jim Nasby (#2)

Re: performance question (something to do w/ parameterized

Well, since I don't know the exact parameter values, just substituting
1-650 for $1-$650, I get:

Index Scan using testindex2 on x (cost=0.00..34964.52 rows=1503
width=16) (actual time=0.201..968.252 rows=677 loops=1)
Filter: ((f = 1) OR (f = 2) OR (f = 3) OR (f = 4) ...

So index usage is presumably good on this one.

Jim C. Nasby wrote:

Show quoted text

What's EXPLAIN ANALYZE show?

jeffrey.tenny@comcast.net

almost 20 years ago

In reply to: Mark Lewis (#3)

Re: performance question (something to do w/ parameterized

Mark Lewis wrote:

Doing a SELECT with a large list of variables inside an IN runs slowly
on every database we've tested. We've tested mostly in Oracle and
PostgreSQL, and both get very slow very quickly (actually Oracle refuses
to process the query at all after it gets too many bind parameters).

In our case, we have a (potentially quite large) set of external values
that we want to look up in the database. We originally thought that
doing a single select with a large IN clause was the way to go, but then
we did some performance analysis on the optimal batch size (number of
items to include per IN clause), and discovered that for most databases,
the optimal batch size was 1. For PostgreSQL I think it was 2.

So that is for parameterized queries (the batch size?).

In my case, I was concerned about latency between app and database
server, so I try to minimize the number of queries I send to the
database server. (My app servers can be anywhere, they /should/ be
close to the database server, but there are no guarantees and I can't
control it).

The last time I tested for optimal batch size using non-parameterized
queries with same-host database and app, I got a batch size of
approximately 700 IN list elements (again, not variables in that test).
That was on postgres 7.X.Y.

Guess I'll have to try a test where I turn the parameterized statements
into regular statements.

I'm pretty sure it would be a bad idea for me to send one IN list
element at a time in all cases. Even if the query query prep was fast,
the network latency could kill my app.

The moral of the story is that you're probably better off running a
bunch of small selects than in trying to optimize things with one
gargantuan select.

The algorithm currently tries to ensure that IN-lists of not more than
700 elements are sent to the database server, and breaks them into
multiple queries. If it has to break it into at least 3 queries, it
uses parameterized statements for the first 2+ and then a
non-parameterized statement for the last one (which may have a different
number of IN list elements than the prior batches).

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Jeffrey Tenny (#7)

Re: performance question (something to do w/ parameterized

Jeffrey Tenny <jeffrey.tenny@comcast.net> writes:

Well, since I don't know the exact parameter values, just substituting
1-650 for $1-$650, I get:

Index Scan using testindex2 on x (cost=0.00..34964.52 rows=1503
width=16) (actual time=0.201..968.252 rows=677 loops=1)
Filter: ((f = 1) OR (f = 2) OR (f = 3) OR (f = 4) ...

So index usage is presumably good on this one.

No, that's not a very nice plan at all --- the key thing to notice is
it says Filter: not Index Cond:. What you've actually got here is a
full-index scan over testindex2 (I guess it's doing that to achieve the
requested sort order), then computation of a 650-way boolean OR expression
for each row of the table. Ugh.

The other way of doing this would involve 650 separate index probes and
then sorting the result. Which would be pretty expensive too, but just
counting on my fingers it seems like that ought to come out at less than
the 35000 cost units for this plan. The planner evidently is coming up
with a different answer though. You might try dropping testindex2
(which I suppose is an index on (f,c)) so that it has only an index on
f to play with, and see what plan it picks and what the estimated/actual
costs are.

regards, tom lane

#10

jeffrey.tenny@comcast.net

almost 20 years ago

In reply to: Tom Lane (#9)

Re: performance question (something to do w/ parameterized

The original set of indexes were:

Indexes:
"x_c_idx" btree (c)
"x_f_idx" btree (f)
"testindex2" btree (f, c)

I dropped the multicolumn index 'testindex2', and a new explain analyze
looks like this:

Sort (cost=35730.71..35768.28 rows=1503 width=16) (actual
time=962.555..964.467 rows=677 loops=1)
Sort Key: f, c
-> Seq Scan on x (cost=0.00..34937.60 rows=1503 width=16) (actual
time=5.449..956.594 rows=677 loops=1)
Filter: ((f = 1) OR (f = 2) OR (f = 3) ...

Turning on the server debugging again, I got roughly identical
query times with and without the two column index.
It appears to have ignored the other indexes completely.

Tom Lane wrote:

Show quoted text

Jeffrey Tenny <jeffrey.tenny@comcast.net> writes:

Well, since I don't know the exact parameter values, just substituting
1-650 for $1-$650, I get:

Index Scan using testindex2 on x (cost=0.00..34964.52 rows=1503
width=16) (actual time=0.201..968.252 rows=677 loops=1)
Filter: ((f = 1) OR (f = 2) OR (f = 3) OR (f = 4) ...

So index usage is presumably good on this one.

No, that's not a very nice plan at all --- the key thing to notice is
it says Filter: not Index Cond:. What you've actually got here is a
full-index scan over testindex2 (I guess it's doing that to achieve the
requested sort order), then computation of a 650-way boolean OR expression
for each row of the table. Ugh.

The other way of doing this would involve 650 separate index probes and
then sorting the result. Which would be pretty expensive too, but just
counting on my fingers it seems like that ought to come out at less than
the 35000 cost units for this plan. The planner evidently is coming up
with a different answer though. You might try dropping testindex2
(which I suppose is an index on (f,c)) so that it has only an index on
f to play with, and see what plan it picks and what the estimated/actual
costs are.

regards, tom lane

#11

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Jeffrey Tenny (#10)

Re: performance question (something to do w/ parameterized

Jeffrey Tenny <jeffrey.tenny@comcast.net> writes:

I dropped the multicolumn index 'testindex2', and a new explain analyze
looks like this:

Sort (cost=35730.71..35768.28 rows=1503 width=16) (actual
time=962.555..964.467 rows=677 loops=1)
Sort Key: f, c
-> Seq Scan on x (cost=0.00..34937.60 rows=1503 width=16) (actual
time=5.449..956.594 rows=677 loops=1)
Filter: ((f = 1) OR (f = 2) OR (f = 3) ...

Turning on the server debugging again, I got roughly identical
query times with and without the two column index.

That's good, actually, seeing that the planner thinks they're close to
the same speed too. Now try "set enable_seqscan = off" to see if you
can force the multi-index-scan plan to be chosen, and see how that does.

regards, tom lane

#12

jeffrey.tenny@comcast.net

almost 20 years ago

In reply to: Tom Lane (#11)

Re: performance question (something to do w/ parameterized

I tried the seqscan disabling and got what sounds like the desired plan:

Sort (cost=54900.62..54940.29 rows=1587 width=16) (actual time=20.208..22.138 rows=677 loops=1)
Sort Key: f, c
-> Index Scan using x_f_idx, x_f_idx, ...
(cost=0.00..54056.96 rows=1587 width=16) (actual time=1.048..15.598 rows=677 loops=1)
Index Cond: ((f = 1) OR (f = 2) OR (f = 3) ....

I turned off the option in postgresql.conf and it did indeed improve all similar queries on that table
to have sub-second response time, down from 6/8/10 second responses. And the elapsed time for
the application action reflected this improvement.

So that begs two questions:

1) is there a way to enable that for a single query in a multi-query transaction?

2) am I opening a can of worms if I turn it off server-wide? (PROBABLY!)

I've already had to tune the server to account for the fact that
the database is easily cached in memory but the processors are slow. (PIII 550Mhz Xeons)
I've lowered the cost of random pages and raised the cost of per-row processing
as follows (where the configuration defaults are also noted):

# - Planner Cost Constants -

#JDT: default effective_cache_size = 1000 # typically 8KB each
effective_cache_size = 50000 # typically 8KB each
#JDT: default: random_page_cost = 4 # units are one sequential page fetch cost
random_page_cost = 2 # units are one sequential page fetch cost
#JDT: default: cpu_tuple_cost = 0.01 # (same)
cpu_tuple_cost = 0.10 # (same)
#cpu_index_tuple_cost = 0.001 # (same)
#JDT: default: cpu_operator_cost = 0.0025 # (same)
cpu_operator_cost = 0.025 # (same)

Any suggestion for how to fix today's query (turning seqscan off) without wrecking others is welcome, as well as whether I've
blundered on the above (which may or may not be optimal, but definitely fixed some former problem queries
on that machine).

My transactions are large multi-query serializable transactions, so it's also important that any single-query targeting optimization
not affect other queries in the same transaction.

Thanks for the help.

Tom Lane wrote:

Show quoted text

Jeffrey Tenny <jeffrey.tenny@comcast.net> writes:

I dropped the multicolumn index 'testindex2', and a new explain analyze
looks like this:

Sort (cost=35730.71..35768.28 rows=1503 width=16) (actual
time=962.555..964.467 rows=677 loops=1)
Sort Key: f, c
-> Seq Scan on x (cost=0.00..34937.60 rows=1503 width=16) (actual
time=5.449..956.594 rows=677 loops=1)
Filter: ((f = 1) OR (f = 2) OR (f = 3) ...

Turning on the server debugging again, I got roughly identical
query times with and without the two column index.

That's good, actually, seeing that the planner thinks they're close to
the same speed too. Now try "set enable_seqscan = off" to see if you
can force the multi-index-scan plan to be chosen, and see how that does.

regards, tom lane

#13

jeffrey.tenny@comcast.net

almost 20 years ago

In reply to: Jeffrey Tenny (#12)

Re: performance question (something to do w/ parameterized

re my question here: what would be the JDBC-proper technique,
my app is all jdbc.

Jeffrey Tenny wrote:

Show quoted text

1) is there a way to enable that for a single query in a multi-query
transaction?

#14

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Jeffrey Tenny (#12)

Re: performance question (something to do w/ parameterized

Jeffrey Tenny <jeffrey.tenny@comcast.net> writes:

I tried the seqscan disabling and got what sounds like the desired plan:
Sort (cost=54900.62..54940.29 rows=1587 width=16) (actual time=20.208..22.138 rows=677 loops=1)
Sort Key: f, c
-> Index Scan using x_f_idx, x_f_idx, ...
(cost=0.00..54056.96 rows=1587 width=16) (actual time=1.048..15.598 rows=677 loops=1)
Index Cond: ((f = 1) OR (f = 2) OR (f = 3) ....

Hm, vs 35000 or so estimates for the slower plans. My recommendation
would be to decrease random_page_cost to 2 or so, instead of the brute
force disable-seqscans approach.

regards, tom lane

#15

jeffrey.tenny@comcast.net

almost 20 years ago

In reply to: Tom Lane (#14)

Re: performance question (something to do w/ parameterized

Tom Lane wrote:

Jeffrey Tenny <jeffrey.tenny@comcast.net> writes:

I tried the seqscan disabling and got what sounds like the desired plan:
Sort (cost=54900.62..54940.29 rows=1587 width=16) (actual time=20.208..22.138 rows=677 loops=1)
Sort Key: f, c
-> Index Scan using x_f_idx, x_f_idx, ...
(cost=0.00..54056.96 rows=1587 width=16) (actual time=1.048..15.598 rows=677 loops=1)
Index Cond: ((f = 1) OR (f = 2) OR (f = 3) ....

Hm, vs 35000 or so estimates for the slower plans. My recommendation
would be to decrease random_page_cost to 2 or so, instead of the brute
force disable-seqscans approach.

The server was already running with random_page_cost=2 today for all tests, because of
the mods I've made to improve other problem queries in the past (my settings noted below, and
before in another msg on this topic).

So to nail this particular query something additional is required (even lower random_page_cost?).
What's a good value for slower processors/memory and database in memory?
1? .5?

Just curious:
Has anybody ever done an exercise that generates postgresql defaults that are customized based on the
cpu, memory, architecture, bus speeds, etc?
These old PIII xeons are quite a bit different than the newer AMD chips I use for postgres,
and the tuning of the postgresql.conf parameters has been very effective in using the old xeons, but it seems like there
must be a general knowledge base of what's generically more appropriate for some types of hardware
that would give people
better initial defaults for a given platform. I know, step right up and do it :-)

Here's the postgresql defaults and actual settings I used for all tests today (from my production server):

Show quoted text

I've already had to tune the server to account for the fact that
the database is easily cached in memory but the processors are slow. (PIII 550Mhz Xeons)
I've lowered the cost of random pages and raised the cost of per-row processing
as follows (where the configuration defaults are also noted):

# - Planner Cost Constants -

#JDT: default effective_cache_size = 1000 # typically 8KB each
effective_cache_size = 50000 # typically 8KB each
#JDT: default: random_page_cost = 4 # units are one sequential page fetch cost
random_page_cost = 2 # units are one sequential page fetch cost
#JDT: default: cpu_tuple_cost = 0.01 # (same)
cpu_tuple_cost = 0.10 # (same)
#cpu_index_tuple_cost = 0.001 # (same)
#JDT: default: cpu_operator_cost = 0.0025 # (same)
cpu_operator_cost = 0.025 # (same)

#16

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Jeffrey Tenny (#15)

Re: performance question (something to do w/ parameterized

Jeffrey Tenny <jeffrey.tenny@comcast.net> writes:

The server was already running with random_page_cost=2 today for all tests, because of
the mods I've made to improve other problem queries in the past (my settings noted below, and
before in another msg on this topic).

So to nail this particular query something additional is required (even lower random_page_cost?).
What's a good value for slower processors/memory and database in memory?

If you're pretty sure the database will always be RAM-resident, then 1.0
is the theoretically correct value.

regards, tom lane

#17

Klint Gore

kg@kgb.une.edu.au

almost 20 years ago

In reply to: Tom Lane (#16)

Re: performance question (something to do w/ parameterized

On Mon, 08 May 2006 19:37:37 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Jeffrey Tenny <jeffrey.tenny@comcast.net> writes:

The server was already running with random_page_cost=2 today for all tests, because of
the mods I've made to improve other problem queries in the past (my settings noted below, and
before in another msg on this topic).

So to nail this particular query something additional is required (even lower random_page_cost?).
What's a good value for slower processors/memory and database in memory?

If you're pretty sure the database will always be RAM-resident, then 1.0
is the theoretically correct value.

Would it be possible to craft a set of queries on specific data that
could advise a reasonable value for random_page_cost?

What sort of data distribution and query type would be heavily dependant
on random_page_cost? i.e. randomness of the data, size of the data
compared to physical memory.

klint.

+---------------------------------------+-----------------+
: Klint Gore                            : "Non rhyming    :
: EMail   : kg@kgb.une.edu.au           :  slang - the    :
: Snail   : A.B.R.I.                    :  possibilities  :
: Mail      University of New England   :  are useless"   :
:           Armidale NSW 2351 Australia :     L.J.J.      :
: Fax     : +61 2 6772 5376             :                 :
+---------------------------------------+-----------------+

#18

lists@peufeu.com

almost 20 years ago

In reply to: Tom Lane (#5)

Big IN() clauses etc : feature proposal

The moral of the story is that you're probably better off running a
bunch of small selects than in trying to optimize things with one
gargantuan select.

Ever experiment with loading the parameters into a temp table and
joining to that?

Also, it might be worth re-testing that conclusion with PG CVS tip
(or 8.2 when it comes out). The reimplementation of IN as = ANY that
I did a couple months ago might well change the results.

Long mail, but I think it's interesting...

I think this is a generic problem, which is often encountered : selecting
a bunch of records based on a list of primary keys (or other indexed,
unique field) ; said list being anything from very short to quite large.
Here are a few occurences of this need :

1- The application supplies a list of id's (the case of the OP of this
thread)
2- A query Q1 yields a list of selected objects , that we wish to use in
several subsequent queries.
And Q1 is a query we don't wish to do several times, either because it's
slow, complicated (advanced search, for instance), or it acts on a
constantly moving dataset, so the results would be different each time. So
we store the result of Q1 in the application, or in a temp table, or in an
array in a plpgsql variable, whatever, to reuse them.

Then, for each of these objects, often we will make more queries to
resolve foreign keys (get category name, owner name, from categories and
users tables, etc).

I have encountered both cases quite often, and they both pose a few
problems. I think it would be a good opportunity for a new feature (see
below).
A typical use case for point 2 :

Consider an "objects" table. Each object ...
- is related to one or several rows from the "categories" table via an
"objects_categories" link table.
- has an owner_id referencing the "users" table

I do an "advanced search" query on "objects", which returns a list of
objects. I can join directly to "users" to get the owner's name, but
joining to "categories" is already problematic because of the many-to-many
relationship.

I wish to do this : fetch all objects matching the search criteria ;
fetch the owner users ; fetch the categories ; build in my application
object space a clean and straightforward data representation for all this.

Also :
- I do not wish to complicate the search query.
- The row estimates for the search query results are likely to be "not so
good" (because it's a complex query) ; so the joins to users and
categories are likely to use suboptimal plans based on "not so good"
estimates.
- The rows from "objects" are large ; so moving them around through a lot
of small joins hurts performance.

The obvious solution is this :

BEGIN;
CREATE TEMPORARY TABLE results ON COMMIT DROP AS SELECT * FROM advanced
search query;
ANALYZE results;

-- get the results to the application
SELECT * FROM results;

-- get object owners info
SELECT * FROM users WHERE id IN (SELECT user_id FROM results);

-- get category info
SELECT * FROM categories WHERE id IN (SELECT category_id FROM
objects_to_categories WHERE object_id IN (SELECT id FROM results));

-- get object/category relations (the ORM will use this to link objects in
the application)
SELECT * FROM objects_to_categories WHERE object_id IN (SELECT id FROM
results);
COMMIT;

You might wonder why I do it this way on the "categories" table.
This is because I use an Object-Relational mapper which will instantiate
a User or Category class object for each row I fetch from these tables. I
do not want to fetch just the username, using a simple join, but I want
the full object, because :
- I want to instantiate these objects (they have useful methods to
process rights etc)
- I do not want to mix columns from "objects" and "users"

And I do not wish to instantiate each category more than once. This would
waste memory, but more importantly, it is a lot cleaner to have only one
instance per row, because my ORM then translates the foreign key relations
into object relations (pointers). Each instanciated category will contain
a list of Object instances ; each Object instance will contain a list of
the categories it belongs to, and point to its owner user.

Back to the point : I can't use the temp table method, because temp
tables are too slow.
Creating a temp table, filling it, analyzing it and then dropping it
takes about 100 ms. The search query, on average, takes 10 ms.

So I have to move this logic to the application, or to plpgsql, and jump
through hoops and use big IN() clauses ; which has the following drawbacks
:
- slow
- ugly
- very hard for the ORM to auto-generate

*******************************

Feature proposal :

A way to store query results in a named buffer and reuse them in the next
queries.
This should be as fast as possible, store results in RAM if possible, and
be limited to inside a transaction.

Ways to store results like this already exist in various flavours inside
the postgres engine :
- Cursors (WITH SCROLL)
- Arrays (realistically, limited to a list of ids)
- Executor nodes : Materialize, Hash, Sort, etc

The simpler to mutate would probably be the cursor.
Therefore I propose to add the capability to use a CURSOR like a
temporary table, join it to other tables, etc.
This would be implemented by making FETCH behave just like SELECT and be
usable in subqueries (see example below).

FETCH can return rows to the application. Why can't it return rows to
postgres itself without using plpgsql tricks ?

Cursors are already extremely fast.
If the cursor is declared WITH SCROLL, the result-set is buffered.
Therefore, the rowcount can be computed exactly, and a good plan can be
chosen.
The important columns could even be ANALYZEd if needed...

Example :

BEGIN;
DECLARE results SCROLL CURSOR WITHOUT HOLD FOR SELECT * FROM advanced
search query;

-- get the results to the application
FETCH ALL FROM results;

-- get object owners info
MOVE FIRST IN results;
SELECT * FROM users WHERE id IN (FETCH ALL user_id FROM results);

-- buffer object/category relations
MOVE FIRST IN results;
DECLARE cats SCROLL CURSOR WITHOUT HOLD FOR SELECT * FROM
objects_to_categories WHERE object_id IN (FETCH ALL id FROM results);

-- get category info
SELECT * FROM categories WHERE id IN (FETCH ALL category_id FROM cats);

-- get object/category relations
MOVE FIRST IN cats;
FETCH ALL FROM cats;

COMMIT;

I really like this. It's clean, efficient, and easy to use.

This would be a lot faster than using temp tables.
Creating cursors is very fast so we can create two, and avoid doing twice
the same work (ie. hashing the ids from the results to grab categories
only once).

*******************************

Goodies (utopian derivatives from this feature).

- Deferred Planning, and on-the-spot function estimates.

There are no rowcount estimates for set returning functions in postgres.
This is a problem, when postgres thinks the function will return 1000
rows, whereas in reality, it returns 5 rows or 10K rows. Suboptimal plans
are chosen.

SELECT * FROM objects WHERE id IN (SELECT * FROM function( params ));

This hairy problem can be solved easily with the proposed feature :

DECLARE my_set CURSOR WITHOUT HOLD FOR SELECT * FROM function( params );

Here, the result set for the function is materialized. Therefore, the
rowcount is known, and the following query can be executed with a very
good plan :

SELECT * FROM objects WHERE id IN (FETCH ALL FROM my_set);

It will likely be Hash + Nested loop index scan for very few rows, maybe
merge joins for a lot of rows, etc. In both cases the result set from
function() needs to be hashed or sorted, which means buffered in memory or
disk ; the overhead of buffering it in the cursor would have been incurred
anyway, so there is no resource waste.

Likewise, a hard-to-estimate subquery which breaks the planning of the
outer SELECT could be embedded in a cursor and buffered.

In a distant future, the planner could chose to automatically do this,
effectively implementing deferred planning.

Thoughts ?

#19

Christian Kratzer

ck-lists@cksoft.de

almost 20 years ago

In reply to: PFC (#18)

Re: Big IN() clauses etc : feature proposal

Hi,

On Tue, 9 May 2006, PFC wrote:
<snipp/>

Back to the point : I can't use the temp table method, because temp
tables are too slow.
Creating a temp table, filling it, analyzing it and then dropping it
takes about 100 ms. The search query, on average, takes 10 ms.

just some thoughts:

You might consider just selecting your primary key or a set of
primary keys to involved relations in your search query. If you
currently use "select *" this can make your result set very large.

Copying all the result set to the temp. costs you additional IO
that you propably dont need.

Also you might try:

SELECT * FROM somewhere JOIN result USING (id)

Instead of:

SELECT * FROM somewhere WHERE id IN (SELECT id FROM result)

Joins should be a lot faster than large IN clauses.

Here it will also help if result only contains the primary keys
and not all the other data. The join will be much faster.

On the other hand if your search query runs in 10ms it seems to be fast
enough for you to run it multiple times. Theres propably no point in
optimizing anything in such case.

Greetings
Christian

--
Christian Kratzer ck@cksoft.de
CK Software GmbH http://www.cksoft.de/
Phone: +49 7452 889 135 Fax: +49 7452 889 136

#20

lists@peufeu.com

almost 20 years ago

In reply to: Christian Kratzer (#19)

Re: Big IN() clauses etc : feature proposal

You might consider just selecting your primary key or a set of
primary keys to involved relations in your search query. If you
currently use "select *" this can make your result set very large.

Copying all the result set to the temp. costs you additional IO
that you propably dont need.

It is a bit of a catch : I need this information, because the purpose of
the query is to retrieve these objects. I can first store the ids, then
retrieve the objects, but it's one more query.

Also you might try:
SELECT * FROM somewhere JOIN result USING (id)
Instead of:
SELECT * FROM somewhere WHERE id IN (SELECT id FROM result)

Yes you're right in this case ; however the query to retrieve the owners
needs to eliminate duplicates, which IN() does.

On the other hand if your search query runs in 10ms it seems to be fast
enough for you to run it multiple times. Theres propably no point in
optimizing anything in such case.

I don't think so :
- 10 ms is a mean time, sometimes it can take much more time, sometimes
it's faster.
- Repeating the query might yield different results if records were added
or deleted in the meantime.
- Complex search queries have imprecise rowcount estimates ; hence the
joins that I would add to them will get suboptimal plans.

Using a temp table is really the cleanest solution now ; but it's too
slow so I reverted to generating big IN() clauses in the application.

#21

Christian Kratzer

ck-lists@cksoft.de

almost 20 years ago

In reply to: PFC (#20)

#22

lists@peufeu.com

almost 20 years ago

In reply to: Christian Kratzer (#21)

#23

kleptog@svana.org

almost 20 years ago

In reply to: PFC (#22)

#24

Csaba Nagy

nagy@ecircle-ag.com

almost 20 years ago

In reply to: Martijn van Oosterhout (#23)

#25

lists@peufeu.com

almost 20 years ago

In reply to: Martijn van Oosterhout (#23)

#26

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: PFC (#18)

#27

Bruce Momjian

bruce@momjian.us

almost 20 years ago

In reply to: PFC (#18)

#28

Mitchell Skinner

mitch@arctur.us

almost 20 years ago

In reply to: PFC (#25)

#29

lists@peufeu.com

almost 20 years ago

In reply to: Bruce Momjian (#27)

#30

lists@peufeu.com

almost 20 years ago

In reply to: Mitchell Skinner (#28)

#31

Dawid Kuroczko

qnex42@gmail.com

almost 20 years ago

In reply to: PFC (#20)

#32

lists@peufeu.com

almost 20 years ago

In reply to: Dawid Kuroczko (#31)

#33

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: PFC (#30)

#34

Markus Schaber

schabi@logix-tt.com

almost 20 years ago

In reply to: PFC (#32)

#35

lists@peufeu.com

almost 20 years ago

In reply to: Markus Schaber (#34)

#36

kleptog@svana.org

almost 20 years ago

In reply to: PFC (#35)

#37

Markus Schaber

schabi@logix-tt.com

almost 20 years ago

In reply to: PFC (#35)

#38

Nis Jørgensen

nis@superlativ.dk

almost 20 years ago

In reply to: Martijn van Oosterhout (#36)

#39

Markus Schaber

schabi@logix-tt.com

almost 20 years ago

In reply to: Nis Jørgensen (#38)

#40

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: PFC (#20)

#41

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: Tom Lane (#33)

#42

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: PFC (#25)

#43

lists@peufeu.com

almost 20 years ago

In reply to: Jim Nasby (#41)

#44

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: PFC (#29)

#45

lists@peufeu.com

almost 20 years ago

In reply to: Jim Nasby (#42)

#46

lists@peufeu.com

almost 20 years ago

In reply to: Jim Nasby (#44)

#47

Zeugswetter Andreas SB SD

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Jim Nasby (#41)

#48

ZeugswetterA@spardat.at

almost 20 years ago

In reply to: Tom Lane (#47)

#49

kleptog@svana.org

almost 20 years ago

In reply to: Zeugswetter Andreas SB SD (#48)

#50

Bruce Momjian

bruce@momjian.us

almost 20 years ago

In reply to: Jim Nasby (#44)

#51

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: Tom Lane (#47)

#52

Scott Marlowe

smarlowe@g2switchworks.com

almost 20 years ago

In reply to: Jim Nasby (#51)

#53

kleptog@svana.org

almost 20 years ago

In reply to: Jim Nasby (#51)

#54

kleptog@svana.org

almost 20 years ago

In reply to: Bruce Momjian (#50)

#55

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: Martijn van Oosterhout (#53)

#56

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: Martijn van Oosterhout (#54)

#57

Jim.Nasby@BlueTreble.com

almost 20 years ago

In reply to: Zeugswetter Andreas SB SD (#48)

#58

lists@peufeu.com

almost 20 years ago

In reply to: Jim Nasby (#57)

#59

tgl@sss.pgh.pa.us

almost 20 years ago

In reply to: Jim Nasby (#57)

#60