-HEAD planner issue wrt hash_joins on dbt3 ?
Hi!
I'm about to do some benchmarking on -HEAD one some hardware I have
available and it seems I'm hitting a rather weird issue causing the osdl
dbt3 benchmark to run very slow and eating CPU time for hours ...
it seems that the issue is caused by the following query:
(in case it gets linewrapped:
http://www.kaltenbrunner.cc/files/dbt3_with_hashjoin.txt)
select n_name, sum(l_extendedprice * (1 - l_discount)) as revenue from
customer, orders, lineitem, supplier, nation, region where c_custkey =
o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and
c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey
= r_regionkey and r_name = 'AFRICA' and o_orderdate >= date '1993-01-01'
and o_orderdate < date '1993-01-01' + interval '1 year' group by n_name
order by revenue desc;
that results in the following plan on my box:
Sort (cost=2543391.75..2543391.81 rows=25 width=37)
Sort Key: sum((lineitem.l_extendedprice * (1::double precision -
lineitem.l_discount)))
-> HashAggregate (cost=2543390.73..2543391.17 rows=25 width=37)
-> Hash Join (cost=440864.81..2543027.40 rows=72666 width=37)
Hash Cond: ((orders.o_custkey = customer.c_custkey) AND
(supplier.s_nationkey = customer.c_nationkey))
-> Hash Join (cost=377714.59..2415568.01 rows=1816643
width=49)
Hash Cond: (lineitem.l_orderkey = orders.o_orderkey)
-> Nested Loop (cost=13.65..1719683.85
rows=12000672 width=49)
-> Merge Join (cost=0.00..10248.66
rows=20000 width=41)
Merge Cond: (nation.n_nationkey =
supplier.s_nationkey)
-> Nested Loop (cost=0.00..19.19
rows=5 width=33)
-> Index Scan using pk_nation on
nation (cost=0.00..9.38 rows=25 width=37)
-> Index Scan using pk_region on
region (cost=0.00..0.38 rows=1 width=4)
Index Cond:
(nation.n_regionkey = region.r_regionkey)
Filter: (r_name =
'AFRICA'::bpchar)
-> Index Scan using i_s_nationkey on
supplier (cost=0.00..9779.46 rows=100000 width=8)
-> Bitmap Heap Scan on lineitem
(cost=13.65..77.16 rows=665 width=16)
Recheck Cond: (lineitem.l_suppkey =
supplier.s_suppkey)
-> Bitmap Index Scan on i_l_suppkey
(cost=0.00..13.65 rows=665 width=0)
Index Cond: (lineitem.l_suppkey =
supplier.s_suppkey)
-> Hash (cost=372023.51..372023.51 rows=2270971
width=8)
-> Bitmap Heap Scan on orders
(cost=41391.94..372023.51 rows=2270971 width=8)
Recheck Cond: ((o_orderdate >=
'1993-01-01'::date) AND (o_orderdate < '1994-01-01 00:00:00'::timestamp
without time zone))
-> Bitmap Index Scan on i_o_orderdate
(cost=0.00..41391.94 rows=2270971 width=0)
Index Cond: ((o_orderdate >=
'1993-01-01'::date) AND (o_orderdate < '1994-01-01 00:00:00'::timestamp
without time zone))
-> Hash (cost=55647.15..55647.15 rows=1500615 width=8)
-> Seq Scan on customer (cost=0.00..55647.15
rows=1500615 width=8)
(27 rows)
so it really thinks that doing hashes with gigantic amounts of data is
a good idea generally - this seems to be independent on work_mem - the
plan looks the same with 1MB vs 126MB(which I had during the run).
the profile of the backend eating the cpu looks similiar to:
26351 27.9047 ExecScanHashBucket
8239 8.7248 hash_seq_search
6984 7.3958 hash_search_with_hash_value
setting hash_join to off results in a runtime of about 2,5minutes:
(http://www.kaltenbrunner.cc/files/dbt3_without_hashjoin.txt)
Sort (cost=3700257.38..3700257.45 rows=25 width=37) (actual
time=286820.962..286820.968 rows=5 loops=1)
Sort Key: sum((lineitem.l_extendedprice * (1::double precision -
lineitem.l_discount)))
-> HashAggregate (cost=3700256.37..3700256.80 rows=25 width=37)
(actual time=286820.932..286820.941 rows=5 loops=1)
-> Nested Loop (cost=730956.43..3699893.04 rows=72666
width=37) (actual time=43551.767..286488.555 rows=72441 loops=1)
Join Filter: (customer.c_nationkey = supplier.s_nationkey)
-> Merge Join (cost=730956.43..3624153.73 rows=1816643
width=49) (actual time=43281.710..257082.739 rows=1822547 loops=1)
Merge Cond: (lineitem.l_orderkey = orders.o_orderkey)
-> Index Scan using i_l_orderkey on lineitem
(cost=0.00..2715943.34 rows=60003360 width=16) (actual
time=32.868..123668.380 rows=59991868 loops=1)
-> Sort (cost=730956.43..732091.92 rows=454194
width=41) (actual time=43248.797..45754.223 rows=1822547 loops=1)
Sort Key: orders.o_orderkey
-> Merge Join (cost=670885.68..688278.21
rows=454194 width=41) (actual time=34469.359..42050.059 rows=455262 loops=1)
Merge Cond: (customer.c_custkey =
orders.o_custkey)
-> Sort (cost=59105.79..59856.10
rows=300123 width=41) (actual time=8113.826..8491.532 rows=299493 loops=1)
Sort Key: customer.c_custkey
-> Nested Loop
(cost=781.13..31801.81 rows=300123 width=41) (actual
time=107.537..7461.355 rows=299493 loops=1)
-> Nested Loop
(cost=1.06..11.00 rows=5 width=33) (actual time=0.030..0.296 rows=5 loops=1)
Join Filter:
(nation.n_regionkey = region.r_regionkey)
-> Index Scan using
pk_nation on nation (cost=0.00..9.38 rows=25 width=37) (actual
time=0.007..0.063 rows=25 loops=1)
-> Materialize
(cost=1.06..1.07 rows=1 width=4) (actual time=0.002..0.004 rows=1 loops=25)
-> Seq Scan on
region (cost=0.00..1.06 rows=1 width=4) (actual time=0.009..0.018
rows=1 loops=1)
Filter:
(r_name = 'AFRICA'::bpchar)
-> Bitmap Heap Scan on
customer (cost=780.07..5607.85 rows=60025 width=8) (actual
time=61.150..1331.466 rows=59899 loops=5)
Recheck Cond:
(nation.n_nationkey = customer.c_nationkey)
-> Bitmap Index Scan
on i_c_nationkey (cost=0.00..780.07 rows=60025 width=0) (actual
time=44.637..44.637 rows=59899 loops=5)
Index Cond:
(nation.n_nationkey = customer.c_nationkey)
-> Sort (cost=611779.89..617457.31
rows=2270971 width=8) (actual time=26355.515..29471.963 rows=2276859
loops=1)
Sort Key: orders.o_custkey
-> Bitmap Heap Scan on orders
(cost=41391.94..372023.51 rows=2270971 width=8) (actual
time=1630.604..16266.102 rows=2276859 loops=1)
Recheck Cond: ((o_orderdate
= '1993-01-01'::date) AND (o_orderdate < '1994-01-01
00:00:00'::timestamp without time zone))
-> Bitmap Index Scan on
i_o_orderdate (cost=0.00..41391.94 rows=2270971 width=0) (actual
time=1352.037..1352.037 rows=2276859 loops=1)
Index Cond:
((o_orderdate >= '1993-01-01'::date) AND (o_orderdate < '1994-01-01
00:00:00'::timestamp without time zone))
-> Index Scan using pk_supplier on supplier
(cost=0.00..0.03 rows=1 width=8) (actual time=0.010..0.012 rows=1
loops=1822547)
Index Cond: (lineitem.l_suppkey = supplier.s_suppkey)
Total runtime: 286984.386 ms
(34 rows)
(about 120s seem to be explain analyze overhead here)
fwiw the box in question is a Dual 2,6Ghz Opteron with 8GB or RAM - wal
is on the BBWC-onboard Smartarray (RAID 10 on 4 disks) and the data is
on a 14 disk Linux Software RAID 10 running Debian Sarge/AMD64 with
Kernel 2.6.17.7. the dbt3 database got initialized with scaling factor
of 10 (running with just 1 works fine).
Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
I'm about to do some benchmarking on -HEAD one some hardware I have
available and it seems I'm hitting a rather weird issue causing the osdl
dbt3 benchmark to run very slow and eating CPU time for hours ...
Could we see the actual EXPLAIN ANALYZE results for the slow plan?
I'm unconvinced by your "hash join is bad" analysis, especially in
the cases where you're giving it lots of work_mem. I think it's got
something to do with the different join orders. The rowcount estimates
in the fast plan all seem pretty good, but I'm betting something is
wrong with some of them in the slow case.
regards, tom lane
Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
I'm about to do some benchmarking on -HEAD one some hardware I have
available and it seems I'm hitting a rather weird issue causing the osdl
dbt3 benchmark to run very slow and eating CPU time for hours ...Could we see the actual EXPLAIN ANALYZE results for the slow plan?
I'm unconvinced by your "hash join is bad" analysis, especially in
the cases where you're giving it lots of work_mem. I think it's got
something to do with the different join orders. The rowcount estimates
in the fast plan all seem pretty good, but I'm betting something is
wrong with some of them in the slow case.
will do - but that will take a while - the slow one runs for 12h or so
even without explain analyze overhead ...
Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
btw - the "hashjoin is bad" was more or less based on the observation
that nearly all of the cpu is burned in hash-related functions in the
profile (when profiling over a longer period of time those accumulate
even more % of the time than in the short profile I included in the
original report)
[ shrug... ] Two out of the three functions you mentioned are not used
by hash join, and anyway the other plan probably has a comparable
execution density in sort-related functions; does that make it bad?
It's possible that the large time for ExecScanHashBucket has something
to do with skewed usage of the hash buckets due to an unfortunate data
distribution, but that's theorizing far in advance of the data.
regards, tom lane
Import Notes
Reply to msg id not found: 4505B638.9060302@kaltenbrunner.cc
Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
btw - the "hashjoin is bad" was more or less based on the observation
that nearly all of the cpu is burned in hash-related functions in the
profile (when profiling over a longer period of time those accumulate
even more % of the time than in the short profile I included in the
original report)[ shrug... ] Two out of the three functions you mentioned are not used
by hash join, and anyway the other plan probably has a comparable
execution density in sort-related functions; does that make it bad?
hmm sorry for that - I should have checked the source before I made that
assumption :-(
It's possible that the large time for ExecScanHashBucket has something
to do with skewed usage of the hash buckets due to an unfortunate data
distribution, but that's theorizing far in advance of the data.
http://www.kaltenbrunner.cc/files/4/
has preliminary data of the dbt3/scaling 10 run I did which seems to
imply we have at least 4 queries in there that take an excessive amount
of time (query 5 is the one I started the complaint with).
However those results have to be taken with a graint of salt since there
is an appearant bug in the dbt3 code which seems to rely on
add_missing_from=on (as can be seen in some of the errorlogs of the
database) and towards the end of the throughput run I did some of the
explain analyzes for the report (those are the small 100% spikes in the
graph due to the box using the second CPU to run them).
I will redo those tests later this week though ...
Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Tom Lane wrote:
Could we see the actual EXPLAIN ANALYZE results for the slow plan?
Well, indeed it seems that the hash join is just an innocent bystander:
the bulk of the runtime (all but about 120 sec in fact) is spent here:
-> Nested Loop (cost=13.65..1719683.85 rows=12000672 width=49) (actual time=60.325..24923860.713 rows=11897899 loops=1)
-> Merge Join (cost=0.00..10248.66 rows=20000 width=41) (actual time=16.654..2578.060 rows=19837 loops=1)
...
-> Bitmap Heap Scan on lineitem (cost=13.65..77.16 rows=665 width=16) (actual time=13.492..1254.535 rows=600 loops=19837)
Recheck Cond: (lineitem.l_suppkey = supplier.s_suppkey)
-> Bitmap Index Scan on i_l_suppkey (cost=0.00..13.65 rows=665 width=0) (actual time=10.662..10.662 rows=600 loops=19837)
Index Cond: (lineitem.l_suppkey = supplier.s_suppkey)
I suppose that the profile result you showed was taken during the
startup transient where it was computing the hashtables that this loop's
results are joined to ... but that's not where the problem is. The
problem is repeating that bitmap scan on lineitem for nearly 20000
different l_suppkeys.
Apparently we've made the planner a bit too optimistic about the savings
that can be expected from repeated indexscans occurring on the inside of
a join. The other plan uses a different join order and doesn't try to
join lineitem until it's got orders.o_orderkey, whereupon it does a
mergejoin against an indexscan on lineitem:
-> Index Scan using i_l_orderkey on lineitem (cost=0.00..2715943.34 rows=60003360 width=16) (actual time=32.868..123668.380 rows=59991868 loops=1)
The runtimes for the remainders of the plans are roughly comparable, so
it's the cost of joining lineitem that is hurting here.
Is lineitem sorted (or nearly sorted) by l_orderkey? Part of the
problem could be overestimating the cost of this indexscan.
What are the physical sizes of lineitem and its indexes, and how do
those compare to your RAM? What are you using for planner settings
(particularly effective_cache_size)?
regards, tom lane
Import Notes
Reply to msg id not found: 45063D3D.10404@kaltenbrunner.cc
Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Tom Lane wrote:
Could we see the actual EXPLAIN ANALYZE results for the slow plan?
Well, indeed it seems that the hash join is just an innocent bystander:
the bulk of the runtime (all but about 120 sec in fact) is spent here:-> Nested Loop (cost=13.65..1719683.85 rows=12000672 width=49) (actual time=60.325..24923860.713 rows=11897899 loops=1)
-> Merge Join (cost=0.00..10248.66 rows=20000 width=41) (actual time=16.654..2578.060 rows=19837 loops=1)
...
-> Bitmap Heap Scan on lineitem (cost=13.65..77.16 rows=665 width=16) (actual time=13.492..1254.535 rows=600 loops=19837)
Recheck Cond: (lineitem.l_suppkey = supplier.s_suppkey)
-> Bitmap Index Scan on i_l_suppkey (cost=0.00..13.65 rows=665 width=0) (actual time=10.662..10.662 rows=600 loops=19837)
Index Cond: (lineitem.l_suppkey = supplier.s_suppkey)I suppose that the profile result you showed was taken during the
startup transient where it was computing the hashtables that this loop's
results are joined to ... but that's not where the problem is. The
problem is repeating that bitmap scan on lineitem for nearly 20000
different l_suppkeys.
possible - I actually took them over a longer period of time
Apparently we've made the planner a bit too optimistic about the savings
that can be expected from repeated indexscans occurring on the inside of
a join. The other plan uses a different join order and doesn't try to
join lineitem until it's got orders.o_orderkey, whereupon it does a
mergejoin against an indexscan on lineitem:-> Index Scan using i_l_orderkey on lineitem (cost=0.00..2715943.34 rows=60003360 width=16) (actual time=32.868..123668.380 rows=59991868 loops=1)
The runtimes for the remainders of the plans are roughly comparable, so
it's the cost of joining lineitem that is hurting here.Is lineitem sorted (or nearly sorted) by l_orderkey? Part of the
problem could be overestimating the cost of this indexscan.What are the physical sizes of lineitem and its indexes, and how do
those compare to your RAM? What are you using for planner settings
(particularly effective_cache_size)?
ouch - you are right(as usual) here.
effective_cache_size was set to 10GB(my fault for copying over the conf
from a 16GB box) during the run - lowering it just a few megabytes(!) or
to a more realistic 6GB results in the following MUCH better plan:
http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt
as for the relation sizes:
dbt3=# select pg_relation_size('lineitem');
pg_relation_size
------------------
10832764928
(1 row)
dbt3=# select pg_total_relation_size('lineitem');
pg_total_relation_size
------------------------
22960259072
(1 row)
there are nine btree indexes on lineitem all between 1,1GB and 1,4GB in
size.
Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Tom Lane wrote:
Apparently we've made the planner a bit too optimistic about the savings
that can be expected from repeated indexscans occurring on the inside of
a join.
effective_cache_size was set to 10GB(my fault for copying over the conf
from a 16GB box) during the run - lowering it just a few megabytes(!) or
to a more realistic 6GB results in the following MUCH better plan:
http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txt
Interesting. It used to be that effective_cache_size wasn't all that
critical... what I think this report is showing is that with the 8.2
changes to try to account for caching effects in repeated indexscans,
we've turned that into a pretty significant parameter.
It'd be nice not to have to depend on the DBA to give us a good number
for this setting. But I don't know of any portable ways to find out
how much RAM is in the box, let alone what fraction of it we should
assume is available per-query.
regards, tom lane
Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Tom Lane wrote:
Apparently we've made the planner a bit too optimistic about the savings
that can be expected from repeated indexscans occurring on the inside of
a join.effective_cache_size was set to 10GB(my fault for copying over the conf
from a 16GB box) during the run - lowering it just a few megabytes(!) or
to a more realistic 6GB results in the following MUCH better plan:
http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txtInteresting. It used to be that effective_cache_size wasn't all that
critical... what I think this report is showing is that with the 8.2
changes to try to account for caching effects in repeated indexscans,
we've turned that into a pretty significant parameter.
yes I'm a bit worried about that too - it has been a bit of
"conventional wisdom" that setting effective_cache_size optimistic will
never hurt and that it encourages postgresql to sometimes get a better
plan by favouring index-scans.
It'd be nice not to have to depend on the DBA to give us a good number
for this setting. But I don't know of any portable ways to find out
how much RAM is in the box, let alone what fraction of it we should
assume is available per-query.
well there are really a number of things the dba would better give
accurate information to the database - though in that case we might go
from "too much won't hurt" to "too much will hurt" ...
Stefan
On Wed, Sep 13, 2006 at 10:47:09AM -0400, Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Tom Lane wrote:
Apparently we've made the planner a bit too optimistic about the savings
that can be expected from repeated indexscans occurring on the inside of
a join.effective_cache_size was set to 10GB(my fault for copying over the conf
from a 16GB box) during the run - lowering it just a few megabytes(!) or
to a more realistic 6GB results in the following MUCH better plan:
http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txtInteresting. It used to be that effective_cache_size wasn't all
that critical... what I think this report is showing is that with
the 8.2 changes to try to account for caching effects in repeated
indexscans, we've turned that into a pretty significant parameter.It'd be nice not to have to depend on the DBA to give us a good
number for this setting. But I don't know of any portable ways to
find out how much RAM is in the box, let alone what fraction of it
we should assume is available per-query.
That's fairly straight-forward, if a little crude. We ask the DBA and
provide some tools for estimating and tuning same. :)
Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter
Remember to vote!
[already sent a variant of that yesterday but it doesn't look like it
made it to the list]
Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Tom Lane wrote:
Apparently we've made the planner a bit too optimistic about the savings
that can be expected from repeated indexscans occurring on the inside of
a join.effective_cache_size was set to 10GB(my fault for copying over the conf
from a 16GB box) during the run - lowering it just a few megabytes(!) or
to a more realistic 6GB results in the following MUCH better plan:
http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txtInteresting. It used to be that effective_cache_size wasn't all that
critical... what I think this report is showing is that with the 8.2
changes to try to account for caching effects in repeated indexscans,
we've turned that into a pretty significant parameter.
took me a while due to hardware issues on my testbox - but there are new
results(with 6GB for effective_cache_size) up at:
http://www.kaltenbrunner.cc/files/5/
there are still a few issues with the validity of the run like the rf
tests not actually being done right - but lowering effective_cache_size
gave a dramtic speedup on Q5,Q7 and Q8.
that is the explain for the 4h+ Q9:
http://www.kaltenbrunner.cc/files/analyze_q9.txt
increasing the the statistic_target up to 1000 does not seem to change
the plan btw.
disabling nested loop leads to the following (4 times faster) plan:
http://www.kaltenbrunner.cc/files/analyze_q9_no_nest.txt
since the hash-joins in there look rather slow (inappropriate hashtable
set up due to the wrong estimates?) I disabled hash_joins too:
http://www.kaltenbrunner.cc/files/analyze_q9_no_nest_no_hashjoin.txt
and amazingly this plan is by far the fastest one in runtime (15min vs
4,5h ...) except that the planner thinks it is 20 times more expensive ...
Stefan
Stefan Kaltenbrunner wrote:
[already sent a variant of that yesterday but it doesn't look like it
made it to the list]Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Tom Lane wrote:
Apparently we've made the planner a bit too optimistic about the savings
that can be expected from repeated indexscans occurring on the inside of
a join.effective_cache_size was set to 10GB(my fault for copying over the conf
from a 16GB box) during the run - lowering it just a few megabytes(!) or
to a more realistic 6GB results in the following MUCH better plan:
http://www.kaltenbrunner.cc/files/dbt3_explain_analyze2.txtInteresting. It used to be that effective_cache_size wasn't all that
critical... what I think this report is showing is that with the 8.2
changes to try to account for caching effects in repeated indexscans,
we've turned that into a pretty significant parameter.took me a while due to hardware issues on my testbox - but there are new
results(with 6GB for effective_cache_size) up at:http://www.kaltenbrunner.cc/files/5/
there are still a few issues with the validity of the run like the rf
tests not actually being done right - but lowering effective_cache_size
gave a dramtic speedup on Q5,Q7 and Q8.that is the explain for the 4h+ Q9:
http://www.kaltenbrunner.cc/files/analyze_q9.txt
increasing the the statistic_target up to 1000 does not seem to change
the plan btw.disabling nested loop leads to the following (4 times faster) plan:
http://www.kaltenbrunner.cc/files/analyze_q9_no_nest.txt
since the hash-joins in there look rather slow (inappropriate hashtable
set up due to the wrong estimates?) I disabled hash_joins too:http://www.kaltenbrunner.cc/files/analyze_q9_no_nest_no_hashjoin.txt
and amazingly this plan is by far the fastest one in runtime (15min vs
4,5h ...) except that the planner thinks it is 20 times more expensive ...
some additional numbers(first one is with default settings, second is
with enable_nestloop = 'off', third one is with enable_nestloop = 'off'
and enable_hashjoin='off'):
http://www.kaltenbrunner.cc/files/analyze_q7.txt
here we have a 3x speedup with disabling nested loops and a 2x speedup
(over the original plan) with nested loops and hashjoins disabled.
http://www.kaltenbrunner.cc/files/analyze_q20.txt
here we have a 180x(!) speedup with both disabled planner options ...
it is worth mentioning that for both queries the estimated costs in
relation to each other looks quite reasonable as soon as enable_nestloop
= 'off' (ie 5042928 vs 10715247 with 344sec vs 514 for Q7 and 101441851
vs 101445468 with 10sec vs 11sec)
Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
that is the explain for the 4h+ Q9:
http://www.kaltenbrunner.cc/files/analyze_q9.txt
The big problem there seems to be the drastic misestimation of the
number of rows matching the p_name ~~ '%ghost%' condition. What does
pg_stats have for the p_name column?
regards, tom lane
Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
that is the explain for the 4h+ Q9:
http://www.kaltenbrunner.cc/files/analyze_q9.txtThe big problem there seems to be the drastic misestimation of the
number of rows matching the p_name ~~ '%ghost%' condition. What does
pg_stats have for the p_name column?
http://www.kaltenbrunner.cc/files/pg_stat_p_name.txt
Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
some additional numbers(first one is with default settings, second is
with enable_nestloop = 'off', third one is with enable_nestloop = 'off'
and enable_hashjoin='off'):
I'm inclined to think you still have effective_cache_size set too high;
or at least that the planner is being too optimistic about how much
cache space is actually available to each indexscan.
With the code as it currently stands, effective_cache_size has some of
the same properties as work_mem: the planner effectively assumes that
that much space is available to *each* indexscan, and so you'd need to
de-rate the setting based on the complexity of queries and the number of
concurrent sessions.
I'm not sure what we could do about the concurrent-sessions issue, but
we could make some sort of attack on the query complexity issue by
pro-rating the effective_cache_size among all the tables used by a
query.
http://www.kaltenbrunner.cc/files/analyze_q20.txt
here we have a 180x(!) speedup with both disabled planner options ...
There's something awfully bogus about that one --- how is it that the
aggregate subplan, with the exact same plan and same number of
executions in all three cases, has an actual runtime 200x more in the
first case?
regards, tom lane
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Tom Lane wrote:
The big problem there seems to be the drastic misestimation of the
number of rows matching the p_name ~~ '%ghost%' condition. What does
pg_stats have for the p_name column?
Hmm ... pattern_sel already applies the operator directly to the
most_common_vals, but in this situation those aren't common enough
to help much. With such an extensive histogram it is awfully tempting
to assume that the histogram members are a representative sample, and
take the selectivity as being the fraction of histogram entries that
match the pattern. Maybe drop the first and last histogram entries
on the grounds they're probably outliers. Thoughts? What would be a
reasonable minimum histogram size to enable using this approach instead
of the guess-on-the-basis-of-the-pattern code?
regards, tom lane
Tom Lane wrote:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
some additional numbers(first one is with default settings, second is
with enable_nestloop = 'off', third one is with enable_nestloop = 'off'
and enable_hashjoin='off'):I'm inclined to think you still have effective_cache_size set too high;
or at least that the planner is being too optimistic about how much
cache space is actually available to each indexscan.
I have long term external monitoring on that server and it indeed shows
that that there was never less then about 5.8G of buffercache used (or
more then 2.2GB used by other means). So 6G might still be a bit on the
optimistic side but it is not actually that far of from reality.
I will redo with lower settings - do you have any suggestions for that ?
With the code as it currently stands, effective_cache_size has some of
the same properties as work_mem: the planner effectively assumes that
that much space is available to *each* indexscan, and so you'd need to
de-rate the setting based on the complexity of queries and the number of
concurrent sessions.
concurrency is 1 here - there is never more than a single query running
in parallel in those tests.
I'm not sure what we could do about the concurrent-sessions issue, but
we could make some sort of attack on the query complexity issue by
pro-rating the effective_cache_size among all the tables used by a
query.
hmm not sure i understand what you mean here :-(
http://www.kaltenbrunner.cc/files/analyze_q20.txt
here we have a 180x(!) speedup with both disabled planner options ...There's something awfully bogus about that one --- how is it that the
aggregate subplan, with the exact same plan and same number of
executions in all three cases, has an actual runtime 200x more in the
first case?
hmm - good question. I will redo those in a bit ...
Stefan
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
Tom Lane wrote:
I'm not sure what we could do about the concurrent-sessions issue, but
we could make some sort of attack on the query complexity issue by
pro-rating the effective_cache_size among all the tables used by a
query.
hmm not sure i understand what you mean here :-(
Per the comment for index_pages_fetched:
* We assume that effective_cache_size is the total number of buffer pages
* available for both table and index, and pro-rate that space between the
* table and index. (Ideally other_pages should include all the other
* tables and indexes used by the query too; but we don't have a good way
* to get that number here.)
A first-order approximation to this would be to add up the total sizes
of all the other tables used in the query. I am thinking of leaving out
other indexes, mainly because we can't tell at this level which other
indexes are actually gonna get used. This would tend to underestimate
by leaving out indexes, but not by a lot if you assume indexes are much
smaller than their tables. It would also be an overestimate because
tables that are not indexscanned concurrently with the one under
consideration probably shouldn't be counted anyway. So one might hope
these effects would more or less cancel out. Anyway it seems to be a
better idea than what we have now.
I will redo with lower settings - do you have any suggestions for that ?
Try reducing effective_cache_size to maybe a fourth of what it is now.
If that helps the thing pick better plans for these multi-table queries,
then we should try changing the other_pages calculation as above.
regards, tom lane
On Sun, Sep 17, 2006 at 04:18:36PM -0400, Tom Lane wrote:
* table and index. (Ideally other_pages should include all the other
* tables and indexes used by the query too; but we don't have a good way
* to get that number here.)A first-order approximation to this would be to add up the total sizes
of all the other tables used in the query. I am thinking of leaving out
other indexes, mainly because we can't tell at this level which other
indexes are actually gonna get used. This would tend to underestimate
by leaving out indexes, but not by a lot if you assume indexes are much
smaller than their tables. It would also be an overestimate because
tables that are not indexscanned concurrently with the one under
consideration probably shouldn't be counted anyway. So one might hope
these effects would more or less cancel out. Anyway it seems to be a
better idea than what we have now.
I think it'd be better to attack this problem from the "other side";
namely looking at what's actually cached. Sadly, I don't think there's
any way to actually query the OS for info about what it has buffered,
but we can look at what's in shared_buffers and assume that it's a
reasonable proxy for the OS's cache. Something like...
relBufPages / shared_buffers * effective_cache_size
should give us a decent idea of what percentage of a relation will be in
cache somewhere. (relBufPages is the number of pages the relation in
question has in the buffer).
Of course, that raises the question of how to track how many pages are
in shared buffers for a relation. Given the criticality of locking
there, we probably don't want to update that info in real-time, but for
this application it's probably OK to just scan through the buffer every
X period of time (maybe after X number of pages read into the buffers).
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jimn@enterprisedb.com> writes:
I think it'd be better to attack this problem from the "other side";
namely looking at what's actually cached.
You can kiss goodbye to plan stability if you go that route... and
in any case I doubt the assumption that what's in shared buffers is
representative of what's in kernel cache.
regards, tom lane