how good is PostgreSQL
Hi,
For users of large PostgreSQL and PostgreSQL builders, this is for you.
I'm having a terrible time deciding now. :(
We're about to build a "huge" website now. I got tied up in signing the
contract without really getting enough information about PgSQL since this
what we plan to implement with PHP (normally we use mySQL but i guess it
does not fit for huge databases like that).
Here's my problem.. We're about to build a site like hitbox.com where there
is a large amount of database required.. If say there is 100,000 users with
1000 page hits per day for each, and everything will be logged, you could
imagine how huge this will be. I'm just so "nervous" (really, that's the
term) if we implement this and later on experience a slow down or worse than
that, crash in the server.
My questions are:
1. What is the limit for number of records in a table WITHOUT SUFFERING SLOW
DOWN.
2. ....limit in number of tables per database
3. ... limit in number of database.
Thanks for you comments. I would really appreciate every comment that I'll
receive regarding this.
Arnold
"Arnold Gamboa" <arnold@php4us.com> writes:
We're about to build a "huge" website now. I got tied up in signing the
contract without really getting enough information about PgSQL since this
what we plan to implement with PHP (normally we use mySQL but i guess it
does not fit for huge databases like that).
Can you do connection pooling and client side caching of database queries
in PHP ? From working with Java this is the spot we really improve speed.
Here's my problem.. We're about to build a site like hitbox.com where there
is a large amount of database required.. If say there is 100,000 users with
1000 page hits per day for each, and everything will be logged, you could
imagine how huge this will be. I'm just so "nervous" (really, that's the
term) if we implement this and later on experience a slow down or worse than
that, crash in the server.
How many database queries do you have per page hit ?
How many database inserts/updates do you have per page hit ?
Are you using the database for httpd access logging, or is it some
application level logging ? Anyhow you might want to look into an
architecture where you have a dedicated box for the logging.
But most important, test with real data. Populate your database and run
stress tests.
I'm was doing some testing on a portal my company has developed with
PostgreSQL as the backend database. Running on my Linux laptop P466 with
128MB, Apache JServ, PostgreSQL 7.0.2. I managed to get about ~20 pageviews
a second. Each pageview had on average 4 queries and 1 insert.
But measure for yourself. Remember that you can gain a lot by tuning
application, database and OS.
regards,
Gunnar
Import Notes
Reply to msg id not found: ArnoldGamboasmessageofTue31Oct2000132504+0800
Arnold Gamboa wrote:
Hi,
For users of large PostgreSQL and PostgreSQL builders, this is for you.
I'm having a terrible time deciding now. :(
We're about to build a "huge" website now. I got tied up in signing the
contract without really getting enough information about PgSQL since this
what we plan to implement with PHP (normally we use mySQL but i guess it
does not fit for huge databases like that).Here's my problem.. We're about to build a site like hitbox.com where there
is a large amount of database required.. If say there is 100,000 users with
1000 page hits per day for each, and everything will be logged, you could
imagine how huge this will be. I'm just so "nervous" (really, that's the
term) if we implement this and later on experience a slow down or worse than
that, crash in the server.
That is a LOT of work for any system. That is over 1100 page views a second, or
under 900us each.. A standard Pentium III system, serving static pages would
have problems with that.
If you look at search engines, to get that performance with readonly data, they
usually cluster multiple systems and load balance across them. You may need to
segment your data and have multiple SQL servers perform different functions.
Also, that 1100 page view per second is assuming an even distribution of
traffic, which does not happen in a web server. If you average that much,
chances are there will be periods of twice that.
Look into a "local director," "Alteon," or even LVS.
My questions are:
1. What is the limit for number of records in a table WITHOUT SUFFERING SLOW
DOWN.
2. ....limit in number of tables per database
3. ... limit in number of database.
There are a couple factors involved, more complex than a simple response.
Use multiple databases and put each on a separate disk, with its own
controller. Better yet, have multiple load balanced web boxes do a lot of
processing in PHP and offload much of the CPU bound SQL work to the "cheap" web
boxes, and have multiple SQL databases in the back handling various independent
tasks.
In a web site I worked on, we had multiple front end web servers, load balanced
with an Alteon. Each web server had its own SQL database which provided SQL
access to "static" data which was updated each week. We had an additional
single SQL database backend which all the Web servers accessed for synchronized
dynamic data.
If you are serious about the load you expect to put on this system you must be
careful:
Do not create any indexes you do not need.
Do not use the "foreign key" constraint as it forces a trigger for each insert.
Make sure you index the keys by which you will access data.
Avoid searching by strings, try to use keys.
Even after that, you have a long way to go before you will hit 1000
transactions per second from any SQL database.
If you are betting your business on this implementation, you have a lot of
homework to do.
Show quoted text
Thanks for you comments. I would really appreciate every comment that I'll
receive regarding this.Arnold
Even after that, you have a long way to go before you will hit 1000
transactions per second from any SQL database.
I guess they could always buy a few Sun E10000's on the backend, and a
large room of rack-mountable PC's for web/CGI serving. Nothing like
plopping down ten or twenty million dollars on hardware. : )
steve
Steve Wolfe wrote:
Even after that, you have a long way to go before you will hit 1000
transactions per second from any SQL database.
I guess they could always buy a few Sun E10000's on the backend, and a
large room of rack-mountable PC's for web/CGI serving. Nothing like
plopping down ten or twenty million dollars on hardware. : )
Or they could buy a single IBM S/390, run Linux/390 and PostgreSQL on
that. Probably would cost less, and be more reliable. And they can
always load another Linux/390 VM -- an S/390 can run something like
41,000 virtual machines each running Linux/390 and Apache.
However, if you want to see the architecture of a _large_
database-backed website, see the story behind Digital City at
www.aolserver.com. While they're using Sybase instead of PostgreSQL,
the architecture is the same.
--
Lamar Owen
WGCR Internet Radio
1 Peter 4:11
Or they could buy a single IBM S/390, run Linux/390 and PostgreSQL on
that. Probably would cost less, and be more reliable. And they can
always load another Linux/390 VM -- an S/390 can run something like
41,000 virtual machines each running Linux/390 and Apache.
Yeah.... I'm very optomistic about IBM's new chips that are coming out
next year. Each "processor module" will have 4 processors, but each
processor will have 2 cores - so in effect, each "processor module" has 8
processors on it. All processors will have copper interconnects, and
depending on the source, will debut at anywhere from 1.3 to 2 gigahertz. I
think that will certainly help them get a larger share of the high-end
market!
steve
Even after that, you have a long way to go before you will hit 1000
transactions per second from any SQL database.
Since my last post probably wasn't too useful, here's some information
that might be a little more help. It's a little long, I know, but hopefully
it will be of use to someone.
As programmers, we naturally want to throw things into databases for
three reasons. First, it's easy to get data in. Second, it's easy to get
relevant data out. And third, it's "cool". We don't want to work with flat
files, now do we? ; )
However, in some cases, using the database to get data out ends up
costing us a lot of time and money. Sometimes we do the same nasty query so
often, that we end up purchasing bigger hardware to make the system work
reasonably. Why? Because it was easier for us to write a program that did:
GetDataFromDatabase();
PrepareData();
PrintData();
Each time, the database server does the work. But it doesn't
necessarily have to be that way. In our company, we've found two trends
that have enabled us to save a LOT of processing power on our machines.
(read: Increase the capacity of our servers by 30% or more, with fairly
minor changes)
The first case is that of rarely-changing data. Some of our datasets
probably have around 50,000 to 1,000,000 views (selects) for each update
(insert/delete). Having the database repeat the query every time is a
waste. So, we began writing our programs such that they will grab the data
from the database once, and generate the HTML for every page, and the
indexes. Then, when an update is made to the database (via the
administrative tools), it simply rewrites *the relevant HTML files*, and
changes the indeces pointing to them. (There are also some other very large
advantages to this sort of thing, but I'm not allowed to say them. ; ) )
The second case is that of often-repeated queries. One of the
offerings on our site is an online directory, which gets a pretty fair
amount of traffic. Unfortunately, it uses a proprietary program that was
purchased by management before they spoke with us. Grr.... It was the
most utterly inefficient program I've ever seen. It would *not* allow the
database to do joins, it would grab entire tables, then try to do the joins
itself, in Perl.
We rewrote the program to let PostgreSQL do the joins, and that sped
it up. Then we realized that a very small number of queries (those for the
first one or two levels of pages) accounted for a huge portion of the
useage. So, we replaced the front page with a static HTML page (the front
page doesn't change...), and saw another terrific drop in our system loads.
Overall, by only modifying a couple of our more heavily-uesd programs,
our server loads dropped by about 30%-40%. If we went to the trouble to
modify some others, it would drop even more. But we're going to rewrite
them completely for other reasons. : )
In any event, there are ways like this to save a LOT of CPU and disk I/O.
Most web servers can server out several hundred static pages with the
resources that would otherwise deliver one dynamically-created,
database-driven page. It also allows you to cluster the web servers with
cheap commodity hardware, instead of using big-iron on the database. And if
you have a big-iron machine running the back-end, this can severely lighten
the load on it, keeping you from dropping a few hundred grand on the next
step up. ; )
(Incidentally, we've toyed around with developping a query-caching system
that would sit betwen PostgreSQL and our DB libraries. However, it seems
like it could be done *much* more efficiently in PostgreSQL itself, as it
would be much easier to keep track of which tables have changed, etc..
Anybody know if this sort of functionality is planned? It would be terrific
to simply give the machine another 256 megs of RAM, and tell it to use it as
a DB cache...)
steve
markw wrote:
Arnold Gamboa wrote:
Hi,
For users of large PostgreSQL and PostgreSQL builders, this is for you.
..snip..
Also, that 1100 page view per second is assuming an even distribution of
traffic, which does not happen in a web server. If you average that much,
chances are there will be periods of twice that.
That's excessively optimistic. If your daily average is 1100 per second, you'll
have 2200 average for many of the hours in that day, 5500 for a few hours, and
some 10-minute periods with 11,000, certainly once in a while.
++ kevin
--
Kevin O'Gorman (805) 650-6274 mailto:kogorman@pacbell.net
Permanent e-mail forwarder: mailto:Kevin.O'Gorman.64@Alum.Dartmouth.org
At school: mailto:kogorman@cs.ucsb.edu
Web: http://www.cs.ucsb.edu/~kogorman/index.html
Web: http://trixie.kosman.via.ayuda.com/~kevin/index.html
"There is a freedom lying beyond circumstance,
derived from the direct intuition that life can
be grounded upon its absorption in what is
changeless amid change"
-- Alfred North Whitehead
As programmers, we naturally want to throw things into databases for
three reasons. First, it's easy to get data in. Second, it's easy to get
relevant data out. And third, it's "cool". We don't want to work with
flat files, now do we? ; )
Kiddin', eh? :) Actually, the third reason seems to dominate the younger
developers' minds. People often tend to keep everything in poor DBMS until
it begins to kick back. And this has impact on the customers. Does your
system use a database? No, why should it? You mean you'll keep our dearly
beloved banner ads as flat files? Yes, this is where they belong. Sorry,
we'll seek for someone more advanced. Good luck.
Of course, hardware vendors jump up of joy :) Maybe I don't get it, but
IMHO there's no reason to put into DB something that can't be indexed and
used in where clause.
It would *not* allow the
database to do joins, it would grab entire tables, then try to do the
joins
itself, in Perl.
Umh.... Yeah.... Well.... To keep compatibility with other Open Source
Databases and ESR/RMS, you know :)
(Incidentally, we've toyed around with developping a query-caching
system that would sit betwen PostgreSQL and our DB libraries.
Sounds amazing, but requires some research, I guess. However, in many
cases one would be more than happy with cahced connections. Of course,
cahced query results can be naturally added to that, but just connections
are OK to start with. Security....
--
contaminated fish and microchips
huge supertankers on Arabian trips
oily propaganda from the leaders' lips
all about the future
there's people over here, people over there
everybody's looking for a little more air
crossing all the borders just to take their share
planning for the future
Rainbow, Difficult to Cure
(Incidentally, we've toyed around with developping a
query-caching
system that would sit betwen PostgreSQL and our DB libraries.
Sounds amazing, but requires some research, I guess. However, in
many
cases one would be more than happy with cahced connections. Of
course,
cahced query results can be naturally added to that, but just
connections
are OK to start with. Security....
To me, it doesn't sound like it would be that difficult of a project, at
least not for the likes of the PostgreSQL developpers. It also doesn't seem
like it would really introduce any security problems, not if it were done
inside of PostgreSQL. Long ago, I got sidetracked from my endeavors in C,
and so I don't feel that I'm qualified to do it. (otherwise, I would have
done it already. : ) ) If you wanted it done in Perl or Object Pascal, I
could help. : )
Here's a simple design that I was tossing back and forth. Please
understand that I'm not saying this is the best way to do it, or even a good
way to do it. Just a possible way to do it. I haven't been able to give it
as much thought as I would like to. Here goes.
------------
Implementation
Upon starting, the PostgreSQL engine could allocate a chunk of memory,
sized according to the administrator's desire. That chunk would be used
solely for query caching.
When a query came in that was not cached (say, the first query), the
database engine would process it as normal. It would then return it to the
user, and add it to the cache. "Adding it to the cache" would mean that it
would enter the query itself, the result set, and a list of which tables the
query relied upon. The query that is stored could be either the query
coming from the user, or the query after it goes through the optimizer.
Each has pros and cons, I would probably favor using the query that comes
from the user.
When another query comes along, the caching engine would quickly look
in the hash table, and see if it already had the cached results of the
query. If so, it returns them, and wham. You've just avoided all of the
work of optimizing, parsing, and executing, not to mention the disk I/O. A
hash lookup seems extremely cheap compared to the work of actually
processing a query.
When an update/delete/insert comes along, the engine would analyze
which tables were affected, and clear the cache entries that relied upon
those tables.
-----------------
Cache Clearing
Cache clearing would be achieved via an LRU-based algorithm, which
would also take into account the amount of RAM used by each query in the
cache.
-----------------
Performance Impact
The potential performance differences range from a miniscule decrease to
a tremendous increase. And it's a lot cheaper to throw an extra half gig of
RAM in a machine that to upgrade processors and disk subsystems!
------------------
Possible Changes
One potential drawback is that when a table is modified, the queries
that rely upon it would be discarded. Where a table is updated frequently,
that could greatly reduce the performance benefit. One possible alternative
is to store the query cost with each query in the cache. When a table is
updated, those queries are marked as "dirty". If the system load is below a
certain amount, or the system has been idle, it could then re-execute those
queries and update the cache. Which queries it re-executed would be
determined on a factor of query cost and how frequently those cache entries
were used.
-------------------
The reason I would prefer it done in the PostgreSQL engine (as opposed to
in a middleware application) is that the caching engine needs to know (a)
which tables a query relies upon, and (b) which tables get changed. It
seems that it would significantly reduce overhead to do those inside of
PostgreSQL (which is already doing the query parsing and analysis).
This could certainly give PostgreSQL a huge advantage over other
database systems, too. It could save administrators a very large chunk of
cash that they would otherwise have to spend on large systems. And it would
just be cool. ; )
steve
Whenever a query is executed (not found in cache, etc.), the caching
system would simply store the query, the results, and a list of tables
queried. When a new query came in, it would do a quick lookup in the
query
hash to see if it already had the results. If so, whammo. Whenever an
insert/delete/update was sensed, it would look at the tables being
affected,
and the caching mechanism would clear out the entries depending on those
tables.
It seems to me that tracking the list of cached queries and watching for
queries that might invalidate them adds a lot of complexity to the back end
and the front end still has to establish the connection and wait transfer
the data over the socket.
On a more practical level, a backend solution would require someone with
fairly detailed knowlege of the internals of the backend. A front end
solution can more likely to be implemented by someone not as knowlegable.
One of the big advantages of your technique is there is no code change at
the application level. This means less database lock-in. Maybe that is a
disadvantage too. ;-)
It seems to me that tracking the list of cached queries and watching for
queries that might invalidate them adds a lot of complexity to the back
end
and the front end still has to establish the connection and wait transfer
the data over the socket.
I really don't think that it would. Checking to see if you have a query
(a hash lookup) is very, very cheap relative to normally processing a query,
I would think.
And invalidating cache entries would also be very, very cheap compared to
the normal activity of the database. Assuming hashes are done correctly, it
would probably be done much, much faster than any query could execute. If
software caches can increase the performance of disk drives that have
latencies in thousandths of seconds, I'm sure they could help with queries
that take hundredths or tenths of seconds. ; )
On a more practical level, a backend solution would require someone with
fairly detailed knowlege of the internals of the backend. A front end
solution can more likely to be implemented by someone not as knowlegable.
Yeah. I was hoping that one of the developpers would say "oooh... that
would rock. We should do that." : )
One of the big advantages of your technique is there is no code change at
the application level. This means less database lock-in. Maybe that is a
disadvantage too. ;-)
I'm sure that someone with a better understanding of the theory associated
with cache invalidation would design a better algorithm that I would, but it
seems that even a fairly rudimentary implementation would seriously increase
performance.
steve
Here's a simple design that I was tossing back and forth. Please
understand that I'm not saying this is the best way to do it, or even a
good way to do it. Just a possible way to do it.
Sounds interesting, I certainly have reasons to play bad guy, but that's
what I always do, so nevermind :)
However, there's one major point where I disagree. Not that I have real
reasons to, or observation or analysis to background my position, just a
feeling. And the feeling is that connection/query cache should be separate
from DBMS server itself.
Several things come to the mind right off, like possibilities to cache
connections to different sources, like PGSQL and Oracle, as well as a
chance to run this cache on a separate box that will perform various
additional functions, like load balancing. But that's right on the surface.
Still in doubt....
--
contaminated fish and microchips
huge supertankers on Arabian trips
oily propaganda from the leaders' lips
all about the future
there's people over here, people over there
everybody's looking for a little more air
crossing all the borders just to take their share
planning for the future
Rainbow, Difficult to Cure
Sounds interesting, I certainly have reasons to play bad guy, but
that's
what I always do, so nevermind :)
That's OK. Somebody has to be a realist. : )
However, there's one major point where I disagree. Not that I have
real
reasons to, or observation or analysis to background my position, just
a
feeling. And the feeling is that connection/query cache should be
separate
from DBMS server itself.
Several things come to the mind right off, like possibilities to
cache
connections to different sources, like PGSQL and Oracle,
That would be a benefit if you're running multiple DBMS' in the house -
and you're certainly welcome to do something like that as a standalone
package. ; ) I think it would be terrific if PostgreSQL could have the
feature added to it, which would (a) give it a big performance benefit, (b)
let it take advantage of already-written code, and (c) make one less machine
and service to administer.
as well as a
chance to run this cache on a separate box that will perform
various
additional functions, like load balancing. But that's right on the
surface.
Still in doubt....
Yes, load-balancing would be another good factor. However, to my (very
limitted) knowledge, there aren't any truly good ways of splitting up
database work. If you're doing nothing but selects, it would be easy. But
when updates come around, it gets hairier - and when you try to try for
dynamic continuity-checking and database rebuilding, it gets very ugly. If
there are any systems that get around those without huge performance hits,
I'd love to hear about it.
(Of course, if you have lots of money, a Beowolf-style cluster with high
bandwidth, low-latency interconnects becomes desireable. But that's a
different ballgame.)
However, there is one other possibility: With caching, your servers
might see enough of a performance increase that you wouldn't need to
load-balance them. : )
steve
* Steve Wolfe <steve@iboats.com> [001031 13:47] wrote:
(Incidentally, we've toyed around with developping a
query-caching
system that would sit betwen PostgreSQL and our DB libraries.
Sounds amazing, but requires some research, I guess. However, in
many
cases one would be more than happy with cahced connections. Of
course,
cahced query results can be naturally added to that, but just
connections
are OK to start with. Security....
To me, it doesn't sound like it would be that difficult of a project, at
least not for the likes of the PostgreSQL developpers. It also doesn't seem
like it would really introduce any security problems, not if it were done
inside of PostgreSQL. Long ago, I got sidetracked from my endeavors in C,
and so I don't feel that I'm qualified to do it. (otherwise, I would have
done it already. : ) ) If you wanted it done in Perl or Object Pascal, I
could help. : )Here's a simple design that I was tossing back and forth. Please
understand that I'm not saying this is the best way to do it, or even a good
way to do it. Just a possible way to do it. I haven't been able to give it
as much thought as I would like to. Here goes.------------
Implementation
[snip]
Karel Zak <zakkr@zf.jcu.cz> Implemented stored proceedures for
postgresql but still hasn't been approached to integrated them.
You can find his second attempt to get a response from the developers
here:
http://people.freebsd.org/~alfred/karel-pgsql.txt
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."
I think this feature deserves to be put on the TODO list under exotic
features.
Well, it's kinda implemented already, I believe, with decades of being run
unattended :)
This feature would probably also be a threat to MySQL dominance in the
web scripting area for websites with medium to high traffic ;)
Dominance? Who needs it, anyway?
--
contaminated fish and microchips
huge supertankers on Arabian trips
oily propaganda from the leaders' lips
all about the future
there's people over here, people over there
everybody's looking for a little more air
crossing all the borders just to take their share
planning for the future
Rainbow, Difficult to Cure
KuroiNeko wrote:
Here's a simple design that I was tossing back and forth. Please
understand that I'm not saying this is the best way to do it, or even a
good way to do it. Just a possible way to do it.Sounds interesting, I certainly have reasons to play bad guy, but that's
what I always do, so nevermind :)
I think this feature deserves to be put on the TODO list under exotic
features.
This feature would probably also be a threat to MySQL dominance in the
web scripting area for websites with medium to high traffic ;)
Poul L. Christiansen
a) Don't log to a database. Log data should be sent into a process
that collects any needed on-the-fly statistics and then outputs
into disk files (rotating hourly or daily depending on your needs).
This model is becoming pretty standard with Apache now; look at
rotatelog in the Apache distribution for an example.
b) Number of records isn't really the issue. Query complexity and
number of queries are more pertinent. Generally, for example, a
single SELECT that pulls in multiple rows is much faster than
a bunch of small SELECTs.
c) For very high traffic, you are going to have multiple front-end
servers. If you design the system carefully, you can have a single
shared network disk used by all of your front ends, then just stack
boxes in front of it. This doesn't give you endless scalability,
though;
at some point you'll saturate your network file server and/or
database
box.
d) PHP may not be a great choice. It doesn't provide a lot of hooks
for effective caching of database connections and/or results.
mod_perl or Java servlets may be better, depending on the details.
- Tim Kientzle
Arnold Gamboa wrote:
Show quoted text
Hi,
For users of large PostgreSQL and PostgreSQL builders, this is for you.
I'm having a terrible time deciding now. :(
We're about to build a "huge" website now. I got tied up in signing the
contract without really getting enough information about PgSQL since this
what we plan to implement with PHP (normally we use mySQL but i guess it
does not fit for huge databases like that).Here's my problem.. We're about to build a site like hitbox.com where there
is a large amount of database required.. If say there is 100,000 users with
1000 page hits per day for each, and everything will be logged, you could
imagine how huge this will be. I'm just so "nervous" (really, that's the
term) if we implement this and later on experience a slow down or worse than
that, crash in the server.My questions are:
1. What is the limit for number of records in a table WITHOUT SUFFERING SLOW
DOWN.
2. ....limit in number of tables per database
3. ... limit in number of database.Thanks for you comments. I would really appreciate every comment that I'll
receive regarding this.Arnold
d) PHP may not be a great choice. It doesn't provide a lot of hooks
for effective caching of database connections and/or results.
mod_perl or Java servlets may be better, depending on the details.
One of our competitors spent a very, very large deal of money on high-end
Sun equipment, so that they could write their CGI stuff in Java servlets.
It still ran slow. We run Perl on machines that pale compared to theirs,
and get far better performance. : )
steve
We run Perl on machines that pale compared to theirs,
and get far better performance. : )
Well, don't get me wrong, I'm not going to a war. Here :) But CGI is so
simple and straightforward that anything more than C is quite an overkill
(think assembly).
Myself I'm planning to port all my PERL stuff eventually. Yes, PERL is
great for string handling, but when you spend a couple of weeks on BugTraq,
you'll suddenly feel that it's still too much. When you only let `known
good' values in, lex or regexp libs will do.
Sorry for the offtopic, anyone interested is welcome to email me in
private.
Ed
--
contaminated fish and microchips
huge supertankers on Arabian trips
oily propaganda from the leaders' lips
all about the future
there's people over here, people over there
everybody's looking for a little more air
crossing all the borders just to take their share
planning for the future
Rainbow, Difficult to Cure