PostgreSQL pre-fork speedup
I had lots of trouble posting so you may receive this
more than once. My apologies..
------------------------------
Hi,
I know the issue of pre-fork PostgreSQL has been
discussed previously.
Someone mentionned pre-fork can be implemented when
schemas become available
in PostgreSQL because there will be less of the need
to run multiple
databases.
I think Oracle 7 uses pre-forking and it helps speed
up the startup time
considerably. Often, there are cases where connection
pooling or persistent
connection cannot be used efficiently (e.g. replicated
or splitted databases
over hundreds of machines or where persistent
connection opens up too many
idle connections). Instead, there's a big need to
create a new connection on
every query and with PostgreSQL needing to fork on
every incoming connection
can be quite slow.
This could be a big win since even a moderate
improvement at the connection
level will affect almost every user. Any chance of
that happening for 7.5?
Thanks.
__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover
sdv mailer wrote:
Instead, there's a big need to
create a new connection on
every query and with PostgreSQL needing to fork on
every incoming connection
can be quite slow.
Really? My general experience has beent that forking/connection setup
times are very good with PgSQL. Do not assume your Oracle experience
transfers directly over -- Oracle has very large connection time
overheads, PgSQL does not.
This could be a big win since even a moderate
improvement at the connection
level will affect almost every user. Any chance of
that happening for 7.5?
Only if you do it yourself, probably. The calculation of the developers
appears to be that the amount of time spent by the database on
fork/connect will generally be dwarfed by the amount of time spent by
the database actually doing work (this being a database, the actual
workloads required of the backend are much higher than, say, for a web
server). So the operational benefit of adding the complexity of a
pre-fork system is not very high. And if you have the rare workload
where a pre-fork actually *would* speed things up a great deal, you can
solve the problem yourself with a connection-pooling middleware.
--
__
/
| Paul Ramsey
| Refractions Research
\_
Paul Ramsey <pramsey@refractions.net> writes:
... So the operational benefit of adding the complexity of a
pre-fork system is not very high.
In particular, most of the connection startup overhead work cannot be
performed until we've identified which database to connect to (since
it largely consists of bootstrapping access to the system catalogs
in that database). If you want that work to be done in advance of
receiving a client connection request, life is much more complicated
than it would be for something like Apache.
There is considerable discussion of this point in the pgsql-hackers
archives. I'd suggest reading the past threads before trying to do
anything yourself.
regards, tom lane
Forking consumes a large amount of CPU when you have
many simultaneous connections and adds up to the
latency. Particularly MySQL users may think
PostgreSQL's connection time is much slower because
these users tend to perform relatively simple queries.
In my case, connection pooling and persistent
connection is useless for a large server farm
consisting of hundreds of partitioned and replicated
servers doing only simple queries.
Below is a benchmark of MySQL 3.2 and PostgreSQL 7.4
doing multiple connects/disconnects within the same
server (AMD 1.2GHz, 512MB, Linux 2.4). If forking is
the issue then pre-forking will give a big boost
especially for simple queries:
MySQL time
----------
0.012786865234375
0.011546850204468
0.01167106628418
<?php
$time_start = getmicrotime();
for ($i = 0; $i < 20; $i++) {
$DBH = mysql_connect('127.0.0.1');
mysql_select_db('test1');
mysql_close($DBH);
}
$Time = getmicrotime() - $time_start;
?>
MySQL time (with simple query)
------------------------------
0.015650987625122
0.01443886756897
0.014433860778809
<?php
$time_start = getmicrotime();
for ($i = 0; $i < 20; $i++) {
$DBH = mysql_connect('127.0.0.1');
mysql_select_db('test1');
$Res = mysql_query('SELECT * FROM table1 WHERE id =
1', $DBH);
mysql_close($DBH);
}
$Time = getmicrotime() - $time_start;
?>
PostgreSQL time
---------------
0.15319013595581
0.14930582046509
0.14920592308044
<?php
$time_start = getmicrotime();
for ($i = 0; $i < 20; $i++) {
$DBH = pg_connect('dbname=test1
host=127.0.0.1');
pg_close($DBH);
}
$Time = getmicrotime() - $time_start;
?>
PostgreSQL time (with simple query)
------------------------------------
0.19016313552856
0.18785095214844
0.18786096572876
<?php
$time_start = getmicrotime();
for ($i = 0; $i < 20; $i++) {
$DBH = pg_connect('dbname=test1
host=127.0.0.1');
$Res = pg_query($DBH, 'SELECT * FROM table1 WHERE id
= 1');
pg_close($DBH);
}
$Time = getmicrotime() - $time_start;
?>
__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover
sdv mailer <sdvmailer@yahoo.com> writes:
Forking consumes a large amount of CPU when you have
many simultaneous connections and adds up to the
latency. Particularly MySQL users may think
PostgreSQL's connection time is much slower because
these users tend to perform relatively simple queries.
Frankly I think part of the reason you'll get little traction on this front is
that some people consider an application that puts building a database
connection into a critical path, especially one that does only a single simple
query and disconnects, a stupid design.
If it's a connection time is critical element it's trivial to move it outside
the critical path and reuse connections. You'll get much better performance
that way as well since both Postgres and Linux will have more time to observe
how the process behaves and adjust cache and schedule behaviour.
In my case, connection pooling and persistent
connection is useless for a large server farm
consisting of hundreds of partitioned and replicated
servers doing only simple queries.
Well have you tried either? It would involve having a lot more idle
connections but then that's effectively the same as "preforking" anyways. Not
only would they be "preforked" but they wouldn't have to be shut down and
recreated repeatedly at all, even outside the critical path.
If the idle connections consume too much memory to be feasible (like, say, if
you have thousands of database servers but only a small unpredictable subset
of which are busy at any time) then "preforking" wouldn't really help much
either and suffer the same problem.
--
greg
We used to run persistent connection until the DB
servers got maxed out because of too many idle
connections sucking up all the memory. Web servers run
different loads than database servers and persistent
connections are notorious for crashing your DB.
Connection pooling (eg. SQLRelay) didn't work either
because we needed to connect to hundreds of DB servers
from each web server. Imagine having 200+ open
connections on the web server and how many more of
these connections remain idle. The situation gets
worse when you multiply by an even greater number of
web servers connected to all these database servers.
Do the math! We're talking large server farm here, not
2 or 3 machines.
Saving that X ms can be substantial for large number
of simultaneous connections and shouldn't be
neglected, otherwise why have persistent connection or
connection pooling in the first place. Imagine every
query uses up that X ms of time just for
connecting/forking. It adds up to a lot from
experience.
I think pre-forking can be beneficial and is a lot
simpler than to rewrite a multi-threaded DB server.
Pre-forking would not consume as much memory as
persistent connections because it scales with the
database load and NOT with the web server load. I'm
guessing pre-forking will benefit more on systems
where launching a new process is expensive (Win32,
certain UNIXes).
Here's a snippet from one of the Apache's conferences:
"Traditionally TCP/IP servers fork a new child to
handle incoming requests from clients. However, in the
situation of a busy web site, the overhead of forking
a huge number of children will simply suffocate the
server. As a consequence, Apache uses a different
technique. It forks a fixed number of children right
from the beginning. The children service incoming
requests independently, using different address
spaces. Apache can dynamically control the number of
children it forks based on current load. This design
has worked well and proved to be both reliable and
efficient; one of its best features is that the server
can survive the death of children and is also
reliable. It is also more efficient than the canonical
UNIX model of forking a new child for every request."
Beside solving my own problems, having a pre-fork
solution will benefit PostgreSQL too. MySQL is
reputated for having a fast connection and people know
it because you cannot avoid simple queries (e.g.
counters, session retrieval, etc). The truth of the
matter is many people still operate on
connect/query/disconnect model running simple queries
and if you can satisfy these people then it can be a
big marketing win for PostgreSQL.
Many web hosting companies out there don't allow
persistent connection, which is where MySQL shines.
Over and over again, we hear people say how MySQL is
fast for the Web because it can connect and execute
simple queries quickly. Take for instance
http://www-css.fnal.gov/dsg/external/freeware/pgsql-vs-mysql.html
"MySQL handles connections very fast, thus making it
suitable to use MySQL for Web - if you have hundreds
of CGIs connecting/disconnecting all the time you'd
like to avoid long startup procedures."
and
http://www-css.fnal.gov/dsg/external/freeware/Repl_mysql_vs_psql.html
"MySQL handles connections and simple SELECTs very
fast."
Likely, PostgreSQL is just as fast but if people don't
see that on the first try running a simple query, then
MySQL already won the war when it comes to speed.
Other benchmark I came across:
http://www.randomnetworks.com/joseph/blog/?eid=101
__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover
sdv mailer wrote:
We used to run persistent connection until the DB
servers got maxed out because of too many idle
connections sucking up all the memory. Web servers run
different loads than database servers and persistent
connections are notorious for crashing your DB.
And this translates from your experiences with mysql to postgresql ? You
haven't made it clear which platforms and what level of concurrent
connections gave you this behaviour. Tom Lane has already explained that
most of the connection time is probably used in configuring the connection
based on the database required etc.
Connection pooling (eg. SQLRelay) didn't work either
because we needed to connect to hundreds of DB servers
from each web server. Imagine having 200+ open
connections on the web server and how many more of
these connections remain idle. The situation gets
worse when you multiply by an even greater number of
web servers connected to all these database servers.
Do the math! We're talking large server farm here, not
2 or 3 machines.
And "preforking" makes this different, how ? Perhaps having a pool of
processes ready to be handed a query to a specific database, where you
configure N connections to db1, M to db2 etc. still means lots of resource
usage. In effect a preforked database server *is* an idle connection, just
without the TCP establishment and teardown sequence which is negligable on
modern platforms - and even if it were not negligable, it would be
effectively identical regardless of the chosen DB platform.
I think pre-forking can be beneficial and is a lot
simpler than to rewrite a multi-threaded DB server.
This is open source, feel free to do a proof on concept (or pay someone to
do a proof of concept), run the numbers and see if your assertions work for
real. Many others here with more experience than myself of running thousands
of connections at once don't appear to think so. My limited expereience with
many hundreds of "idle" connections is that it is not particularly taxing at
all on any even semi-modern hardware (PIII/512MB etc).
Peter
And "preforking" makes this different, how ? Perhaps having a pool of
processes ready to be handed a query to a specific database, where you
configure N connections to db1, M to db2 etc. still means lots of resource
usage. In effect a preforked database server *is* an idle connection, just
without the TCP establishment and teardown sequence which is negligable on
modern platforms - and even if it were not negligable, it would be
effectively identical regardless of the chosen DB platform.
In theory, it should drastically reduce the number of idle connections
for poor connection pooling on the other end.
The problem are pools for Apache that establish 1 connection per Apache
backend. 100 Apache backends means 100 backend connections (50 of which
may be idle as not all pages use the database). Multiply that by 40
webservers and you have a real mess of idle connections.
Cutting that count down to 10 idlers in total by having PostgreSQL
prefork a specific database would make a significant difference.
The other (preferable) alternative is to convince Apache to use a common
connection pool per server rather than per Apache backend.
On Mon, May 03, 2004 at 11:59:45PM -0700, sdv mailer wrote:
Connection pooling (eg. SQLRelay) didn't work either
because we needed to connect to hundreds of DB servers
from each web server. Imagine having 200+ open
connections on the web server and how many more of
these connections remain idle. The situation gets
This sounds like a case where you probably ought to be using schema
support instead of many different databases, for the record. I don't
see how pre forking is going to help you at all, because a connection
is to a database, so you're going to have to pick one, and it's
likely as not to be the wrong one.
A
--
Andrew Sullivan | ajs@crankycanuck.ca
Rod Taylor <pg@rbt.ca> writes:
Cutting that count down to 10 idlers in total by having PostgreSQL
prefork a specific database would make a significant difference.
Well it would be 10 for each database. Since as has been pointed out before
loading the database is most of the delay.
If that's enough why not just run 10 apache processes instead of 100?
I'm assuming the static non-database driven content is already separated onto
other servers. In which case running 100 apache processes, most of which are
idle is the source of the problem.
--
greg
Pre-fork does not equal to idle connections! Pre-fork
scales with database load where as persistent
connections scales with webserver load. A web server
that is heavily loaded but not necessarily performing
a lot of database activity will spawn hundreds of idle
database connections using persistent connection. With
pre-fork, you can potentially lower this down to even
10 open connections.
Forking is quite fast on Linux but creating a new
process is still 10x more expensive than creating a
thread and is even worse on Win32 platform. CPU load
goes up because the OS needs to allocate/deallocate
memory making it difficult to get a steady state
resource consumption.
More importantly, solving the forking delay will have
a big impact on people's mind who have been given the
impression that forking is very very slow. Here's what
one site has to say about PostgreSQL's forking:
http://www.geocities.com/mailsoftware42/db/
"Postgres forks on every incoming connection - and the
forking process and backend setup is a bit slow, but
one can speed up PostgreSQL by coding things as stored
procedures"
Pre-fork will give MySQL one less argument to throw at
PostgreSQL.
I think optimizing is this area will speed up the
general case for everyone rather than optimizing a
feature that affects 10% of the users. On top of that,
it will make a strong marketing case because forking
will no longer become a speed issue when compared to
MySQL.
__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover
I'm talking about connecting to multiple database
servers on separate machines. Schemas don't apply
here.
How much work would it take to make a pre-fork smart
enough to open different databases on incoming
connection? How much of it can be modeled after
Apache?
__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover
On Wednesday 05 May 2004 07:24 am, Rod Taylor wrote:
And "preforking" makes this different, how ? Perhaps having a pool of
processes ready to be handed a query to a specific database, where you
configure N connections to db1, M to db2 etc. still means lots of
resource usage. In effect a preforked database server *is* an idle
connection, just without the TCP establishment and teardown sequence
which is negligable on modern platforms - and even if it were not
negligable, it would be effectively identical regardless of the chosen
DB platform.In theory, it should drastically reduce the number of idle connections
for poor connection pooling on the other end.
If the client is poorly written, nothing on the server side can really
prevent them from being poorly written.
The problem are pools for Apache that establish 1 connection per Apache
backend. 100 Apache backends means 100 backend connections (50 of which
may be idle as not all pages use the database). Multiply that by 40
webservers and you have a real mess of idle connections.
Or, you run several seperate Apache webservers. The ones that serve static
content or don't need database connections don't run with the ones that do.
And just like each idle Apache process uses memory and other resources,
each idle PostgreSQL connection does to. So managing the number of Apache
connections so that there aren't too many or too few solves the problem of
having too many or too few idle database connections. This is all stuff
that I personally have managed and planned for, and it is quite easy to do
without any connection pooling on the server side.
It all comes down to management, which Apache does a reasonable job of.
Either we duplicate the efforts of Apache (they are non-trivial), or we
piggy-back on their success. And who's to say that the right solution for
Apache is the right solution for another application? Are we going to
implement a different flavor of management for each kind of application?
I suggest you implement server-side connection pooling and see for yourself:
(a) How much overhead there is for configuration (which databases? How many
idle?)
(b) How much easier it is to do on the client side after all.
If you really believe that you are right and I am wrong, then prove it. I'll
be happy to be shown the error of my thinking (and see an improvement to
PostgreSQL in the process).
That's the great thing about Open Source. We can all talk the talk, but it
comes down to whoever actually walks the walk. In the proprietary world, no
one gets a chance to walk the walk.
--
Jonathan Gardner
jgardner@jonathangardner.net
Or, you run several seperate Apache webservers. The ones that serve static
content or don't need database connections don't run with the ones that do.
And just like each idle Apache process uses memory and other resources,
each idle PostgreSQL connection does to. So managing the number of Apache
Considered that, but it doesn't help much. The duty cycle of any given
page is about 20% database, 80% webserver work. So at any given time 80%
of the connections to the database will be idle in a best case scenario.
If Apache did decent connection pooling or PostgreSQL gave us a hand
then a given webserver would need 1/4 of the connections which could be
internally shared.
Page 1 start
Page 1 DB connect
Page 1 DB disconnect
.
. <IDLE persistent connection as work happens>
.
Page 1 transmit results
If we could really disconnect from the database and not suffer high
re-connection overhead OR have Apache recognize the connection is unused
and allow another Apache backend to use it there would not be a problem.
It all comes down to management, which Apache does a reasonable job of.
If you really believe that you are right and I am wrong, then prove it. I'll
be happy to be shown the error of my thinking (and see an improvement to
PostgreSQL in the process).
You wouldn't run into a problem like this on a system with good
connection pooling. JBoss comes to mind, once a connection is free it is
available to other threads to use. AOL Server is a webserver which
demonstrates proper connection pooling.
Apache is the problem we're trying to work around. It does everything
per backend, rather than having a common pool for the server. That can
be fixed by improving PostgreSQL or by doing something (I'm not sure
what) with apache.
On Wed, 2004-05-05 at 11:57, Greg Stark wrote:
Rod Taylor <pg@rbt.ca> writes:
Cutting that count down to 10 idlers in total by having PostgreSQL
prefork a specific database would make a significant difference.Well it would be 10 for each database. Since as has been pointed out before
loading the database is most of the delay.If that's enough why not just run 10 apache processes instead of 100?
Because then we would need 10 times as many servers ;)
I'm assuming the static non-database driven content is already separated onto
other servers. In which case running 100 apache processes, most of which are
idle is the source of the problem.
Most of it has been. It's the duty cycle. As stated in another email,
only about 20% of the work a script does is database related -- which
occurs all at one time. Even when all Apache backends are active, a
large number of connections will be idle but were used or will be used
at some point during the generation of that page.
It really is an Apache fault -- but I don't think it can be fixed within Apache itself.
sdv mailer wrote:
I'm talking about connecting to multiple database
servers on separate machines. Schemas don't apply
here.How much work would it take to make a pre-fork smart
enough to open different databases on incoming
connection? How much of it can be modeled after
Apache?
I've not used it but Tatsuo Ishii has just released pgpool v1.0. Quoting
from its README:
1. What is pgpool
pgpool is a connection server program for PostgreSQL. pgpool runs
between PostgreSQL's client(frontend) and server(backend). Any
PostgreSQL clients can connect to pgpool as if it's a real
PostgreSQL server.
pgpool caches the connection to PostgreSQL server to reduce the
overhead to establish the connection to it.
Also pgpool could use two PostgreSQL servers for fail over
purpose. If the first server goes down, pgpool will automatically
switch to the secondary server.
If that's not what you're after, then it must be fairly close. Check the
mailing list archives - the download URL is on the "announce" list April
2004.
I'm sure any real figures from your testing will be of much interest to
all of us.
HTH
--
Richard Huxton
Archonet Ltd
sdv mailer wrote:
[snip]
Pre-fork will give MySQL one less argument to throw at
PostgreSQL.I think optimizing is this area will speed up the
general case for everyone rather than optimizing a
feature that affects 10% of the users. On top of that,
it will make a strong marketing case because forking
will no longer become a speed issue when compared to
MySQL.
So when can we expect to see your proof of concept code and benchmarks
to show the speedup achieved?
cheers
andrew
On Wed, May 05, 2004 at 02:12:58PM -0400, Rod Taylor wrote:
Most of it has been. It's the duty cycle. As stated in another email,
only about 20% of the work a script does is database related -- which
occurs all at one time. Even when all Apache backends are active, a
large number of connections will be idle but were used or will be used
at some point during the generation of that page.It really is an Apache fault -- but I don't think it can be fixed within Apache itself.
mod_pg_pool or mod_valet_sql - Apache modules to handle postgresql
connection pools
http://sqlrelay.sourceforge.net/
http://dbbalancer.sourceforge.net/
Database connection pooling software
And, of course, most development environments (perl, php, java etc)
have their own language specific connection pooling solutions.
Cheers,
Steve
On Wed, 5 May 2004, sdv mailer wrote:
Forking is quite fast on Linux but creating a new
process is still 10x more expensive than creating a
thread and is even worse on Win32 platform. CPU load
goes up because the OS needs to allocate/deallocate
memory making it difficult to get a steady state
resource consumption.
Just a nit to pick here. In Linux, the difference between forking and
spawning a new thread is almost nothing. Definitely less than a factor of
2, and most assuredly less than the quoted factor of 10 here.
The fact that windows has a heavy process / lightweight thread design
means little to me, since I'll likely never deploy a production postgresql
server on it that needs to handle any serious load.
* Steve Atkins (steve@blighty.com) wrote:
On Wed, May 05, 2004 at 02:12:58PM -0400, Rod Taylor wrote:
Most of it has been. It's the duty cycle. As stated in another email,
only about 20% of the work a script does is database related -- which
occurs all at one time. Even when all Apache backends are active, a
large number of connections will be idle but were used or will be used
at some point during the generation of that page.It really is an Apache fault -- but I don't think it can be fixed within Apache itself.
mod_pg_pool or mod_valet_sql - Apache modules to handle postgresql
connection pools
Looks like what we need are functions in PHP or something which use the
functions provided by these apache modules, if they don't exist already
(as far as I know they don't?). Or whatever language it is that he's
using.
Stephen