EXPLAIN SELECT .. does not return
Hi, This has become a major problem for us. Thank you in advance for
your help.
OS: SUSE Linux 2.6.5-7.191-bigsmp
PostgreSQL: 7.4.8
Application: ModPerl Web application using DBI.pm
Database size: 100 Gb, 1025 Tables.
Problem: EXPLAIN SELECT ... does not return.
Description:
The Application uses an EXPLAIN cost to determine whether a client's dynamic
request for data is too demanding for the server so it can gracefully deny
them. (Currently, anything over cost=0.00..500000.00).
The system gets about 3000 page requests a day.
Certain SQL Queries, I believe those with many table joins, when run as
EXPLAIN plans, never return. As a result they seem to stay churning in the
system. Once that happens other queries build up and the performance of the
whole database server grinds to a halt. Postgresql never dies, but
eventually, user requests start timing out.
This happens on average two or three times a week. I kill an offending
process and all's well again. I have not been able to identify with
certainty
an offending SQL statement.
Config params, that have changed from default:
tcpip_socket = true
max_connections = 200
shared_buffers = 2000
sort_mem = 1048576
vacuum_mem = 65536
max_fsm_pages = 100000
max_fsm_relations = 1000
max_files_per_process = 1000
fsync = false
wal_sync_method = fsync
wal_buffers = 800
checkpoint_segments = 30
commit_delay = 100
commit_siblings = 50
effective_cache_size = 1000
random_page_cost = 4
geqo = true
geqo_threshold = 14
default_statistics_target = 100
from_collapse_limit = 13
join_collapse_limit = 13
Note: we load lumps of data ea. week. Then primarily it is a readonly
database.
On 12/5/05, David Link <dlink@soundscan.com> wrote:
Hi, This has become a major problem for us. Thank you in advance for
your help.OS: SUSE Linux 2.6.5-7.191-bigsmp
PostgreSQL: 7.4.8
Application: ModPerl Web application using DBI.pm
Database size: 100 Gb, 1025 Tables.Problem: EXPLAIN SELECT ... does not return.
Description:
The Application uses an EXPLAIN cost to determine whether a client's dynamic
request for data is too demanding for the server so it can gracefully deny
them. (Currently, anything over cost=0.00..500000.00).The system gets about 3000 page requests a day.
Certain SQL Queries, I believe those with many table joins, when run as
EXPLAIN plans, never return. As a result they seem to stay churning in the
system. Once that happens other queries build up and the performance of the
whole database server grinds to a halt. Postgresql never dies, but
eventually, user requests start timing out.This happens on average two or three times a week. I kill an offending
process and all's well again. I have not been able to identify with
certainty
an offending SQL statement.Config params, that have changed from default:
tcpip_socket = true
max_connections = 200
shared_buffers = 2000
sort_mem = 1048576
vacuum_mem = 65536
max_fsm_pages = 100000
max_fsm_relations = 1000
max_files_per_process = 1000
fsync = false
wal_sync_method = fsync
wal_buffers = 800
checkpoint_segments = 30
commit_delay = 100
commit_siblings = 50
effective_cache_size = 1000
random_page_cost = 4
geqo = true
geqo_threshold = 14
default_statistics_target = 100
from_collapse_limit = 13
join_collapse_limit = 13Note: we load lumps of data ea. week. Then primarily it is a readonly
database.
when you have thoses cases, you can take a look in pg_stats_activity
to find the offending query...
or simply logs all queries
--
Atentamente,
Jaime Casanova
(DBA: DataBase Aniquilator ;)
David Link <dlink@soundscan.com> writes:
Certain SQL Queries, I believe those with many table joins, when run as
EXPLAIN plans, never return.
I'd guess that one or all of these settings are excessive:
geqo_threshold = 14
from_collapse_limit = 13
join_collapse_limit = 13
Keep in mind that the planning cost is exponential in these limits,
eg geqo_threshold = 14 probably allows planning times about 14 times
greater than geqo_threshold = 13.
While I'm looking:
shared_buffers = 2000
That seems extremely low for modern machines.
sort_mem = 1048576
That, on the other hand, is almost certainly way too high for a system-wide
setting. You're promising you have 1Gb available for *each* sort.
max_fsm_pages = 100000
And this way too low for a 100Gb database, unless most of the tables
never see any UPDATEs or DELETEs.
wal_buffers = 800
Seems a bit high, especially considering you have fsync disabled and
thus there is no benefit whatever to buffering WAL.
commit_delay = 100
commit_siblings = 50
Have you measured any benefit to having this turned on?
All in all it looks like your configuration settings were chosen by
throwing darts :-(
regards, tom lane
Jaime Casanova wrote:
when you have thoses cases, you can take a look in pg_stats_activity
to find the offending query...or simply logs all queries
Thanks for the advice. I also turned on stat_command_string
Tom Lane wrote:
David Link <dlink@soundscan.com> writes:
Certain SQL Queries, I believe those with many table joins, when run as
EXPLAIN plans, never return.I'd guess that one or all of these settings are excessive:
geqo_threshold = 14
from_collapse_limit = 13
join_collapse_limit = 13Keep in mind that the planning cost is exponential in these limits,
eg geqo_threshold = 14 probably allows planning times about 14 times
greater than geqo_threshold = 13.While I'm looking:
shared_buffers = 2000
That seems extremely low for modern machines.
sort_mem = 1048576
That, on the other hand, is almost certainly way too high for a system-wide
setting. You're promising you have 1Gb available for *each* sort.max_fsm_pages = 100000
And this way too low for a 100Gb database, unless most of the tables
never see any UPDATEs or DELETEs.wal_buffers = 800
Seems a bit high, especially considering you have fsync disabled and
thus there is no benefit whatever to buffering WAL.commit_delay = 100
commit_siblings = 50Have you measured any benefit to having this turned on?
All in all it looks like your configuration settings were chosen by
throwing darts :-(regards, tom lane
Thanks for your reply, Tom. Different folks have made different
suggestions. Can you suggest more reasonable values for these? But
more importantly, do you think the problem I am having is due to these
configuration short comings?
Thanks much.
David
Thanks for your reply, Tom. Different folks have made different
suggestions. Can you suggest more reasonable values for these? But
more importantly, do you think the problem I am having is due to these
configuration short comings?Thanks much.
David take a look at:
http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html
Joshua D. Drake
Show quoted text
David
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster