eWeek Poll: Which database is most critical to your organization?
Good Day All,
eWeek has posted a poll that asks which database server is most critical
to your organization.
The poll tracks the following databases:
- IBM DB2
- Informix
- Microsoft SQL Server
- MySQL
- Ingres
- Oracle
- PostgreSQL
- Sybase Adaptive Server Enterprise
- Sybase SQL Anywhere
- any object database
It is interesting to note that Oracle has shown a rather large jump
between yesterday and today - about 7% IIRC - a sign that the sample
size of the poll should become larger. :)
If you would care to vote, visit:
http://www.eweek.com/article/0%2c3658%2cs=708&a=23115%2c00.asp
Note: This message is being delivered to the Perl DBI, PHP general,
PostgreSQL Advocacy, MySQL general and Ruby general mailing lists.
However, to prevent possible and irritating cross-posting, I am
delivering each message individually. :)
Ciao!
--zak
--
__ ___ ___ ____ __
/ |/ /_ __/ __/ __ \/ / Zak Greant <zak@mysql.com>
/ /|_/ / // /\ \/ /_/ / /__ MySQL AB, Advocate
/_/ /_/\_, /___/\___\_\___/ Calgary, Canada
<___/ www.mysql.com 403.244.7213
On Tue, 2002-02-26 at 15:30, Zak Greant wrote:
Good Day All,
eWeek has posted a poll that asks which database server is most critical
to your organization.
The article mentions a MySQL feature which apparently improved
performance considerably:
//
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature. If the text of an incoming
query has a byte-for-byte match with a cached query, MySQL can retrieve
the results directly from the cache without compiling the query, getting
locks or doing index accesses. This query caching will be effective only
for tables with few updates because any table updates that clear the
cache to guarantee correct results are always returned.
//
My guess is that it would be relatively simple to implement. Any
comments on this?
If I implemented this, any chance this would make it into the tree? Of
course, it would be:
- disabled by default
- enabled on a table-by-table basis (maybe an ALTER TABLE command)
Cheers,
Neil
--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC
-----Original Message-----
From: Neil Conway [mailto:nconway@klamath.dyndns.org]
Sent: Tuesday, February 26, 2002 3:04 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your
On Tue, 2002-02-26 at 15:30, Zak Greant wrote:
Good Day All,
eWeek has posted a poll that asks which database server is most
critical
to your organization.
The article mentions a MySQL feature which apparently improved
performance considerably:
//
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature. If the text of an incoming
query has a byte-for-byte match with a cached query, MySQL can retrieve
the results directly from the cache without compiling the query, getting
locks or doing index accesses. This query caching will be effective only
for tables with few updates because any table updates that clear the
cache to guarantee correct results are always returned.
//
My guess is that it would be relatively simple to implement. Any
comments on this?
If I implemented this, any chance this would make it into the tree? Of
course, it would be:
- disabled by default
- enabled on a table-by-table basis (maybe an ALTER TABLE command)
I don't see how it will do any good. There is no "prepare" in
Postgresql
and therefore you will simply be reexecuting the queries every time any
way. Also, parameter markers only work in embedded SQL and that is a
single tasking system.
I think it would be a major piece of work to do anything useful along
those lines.
If you look at how DB/2 works, you will see that they store prepared
statements. Another alternative would be to keep some point in the
parser marked and somehow jump to that point, but you would have to
be able to save a parse tree somewhere and also recognize the query.
Here is where problems come in...
-- Someone wants blue and blue-green, etc shirts that are backordered
SELECT shirt, color, backorder_qty FROM garments WHERE color like
"BLUE%"
Now, another query comes along:
-- Someone else wants reddish, etc shirts that are backordered:
SELECT shirt, color, backorder_qty FROM garments WHERE color like "RED%"
It's the same query with different data. Without parameter markers you
will never know it. And yet this is exactly the sort of caching that is
useful.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<<
Import Notes
Resolved by subject fallback
On Tue, 2002-02-26 at 18:20, Dann Corbit wrote:
I don't see how it will do any good. There is no "prepare" in
Postgresql
and therefore you will simply be reexecuting the queries every time any
way. Also, parameter markers only work in embedded SQL and that is a
single tasking system.
Perhaps I wasn't clear. The feature I'm proposing is this:
When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return the
cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache the
query, just increment a "frequency" counter stored in the hash table. If
the counter goes above a certain constant, we decide that the query is
worth caching, so we cache the full result in the hash table.
When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns our
cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.
HOWEVER -- I don't see this feature as something that will appeal to,
say, 75% of PgSQL users. If the table in question is being modified on a
regular basis, or if a wide variety of queries are being issued, this
cache isn't a good idea. Nevertheless, I think there are certainly some
situations in which this cache is useful -- and furthermore, these kinds
of "mostly read-only" situations are often where MySQL is chosen over
PostgreSQL.
Anyway, just putting this on the table -- if the consensus is that this
isn't a very worthwhile feature, I won't bother with it.
Cheers,
Neil
--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC
-----Original Message-----
From: Neil Conway [mailto:nconway@klamath.dyndns.org]
Sent: Tuesday, February 26, 2002 3:42 PM
To: Dann Corbit
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your
On Tue, 2002-02-26 at 18:20, Dann Corbit wrote:
I don't see how it will do any good. There is no "prepare" in
Postgresql
and therefore you will simply be reexecuting the queries every time
any
way. Also, parameter markers only work in embedded SQL and that is a
single tasking system.
Perhaps I wasn't clear. The feature I'm proposing is this:
When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return the
cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache the
query, just increment a "frequency" counter stored in the hash table. If
the counter goes above a certain constant, we decide that the query is
worth caching, so we cache the full result in the hash table.
Statistical tools are a good idea, because they can tell us where
indexes should be added. However, you cannot simply return the result
of the previous query, because the contents may have changed since the
last time it was executed. It is simply invalid to do that. If some
other system is doing that, then it isn't a relational database.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns our
cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.
How do you know whether or not someone has affected the row that you
are reading? If you do not know, then every single update, insert or
delete will mean that you have to refresh. And not only that, you will
also have to track it. For sure, it will make the whole system run
more slowly rather than faster.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
HOWEVER -- I don't see this feature as something that will appeal to,
say, 75% of PgSQL users. If the table in question is being modified on a
regular basis, or if a wide variety of queries are being issued, this
cache isn't a good idea. Nevertheless, I think there are certainly some
situations in which this cache is useful -- and furthermore, these kinds
of "mostly read-only" situations are often where MySQL is chosen over
PostgreSQL.
Anyway, just putting this on the table -- if the consensus is that this
isn't a very worthwhile feature, I won't bother with it.
Very likely, it is only my limited understanding not really grasping
what it is that you are trying to do. Even so, I don't think it really
helps even for read only queries, unless it is exactly the same query
with the same parameter markers and everything that was issued before.
That is very unusual. Normally, you won't have the entire query hard-
wired, but with allow the customer to do some sort of filtering of the
data.
I do (very much) like your idea to gather statistics. Again, this must
have the complete query plan parsed so that you know what the parameter
markers are. Otherwise, there will be completely different queries:
SELECT * FROM employee WHERE ssn = '777-99-1234'
SELECT * FROM employee WHERE ssn = '999-77-1234'
SELECT * FROM employee WHERE ssn = '797-97-1234'
SELECT * FROM employee WHERE ssn = '979-79-1234'
SELECT * FROM employee WHERE ssn = '779-99-1234'
when really only one query exits:
SELECT * FROM employee WHERE ssn = ?
If you do not parameterize the queries your statistics will be
misleading
at best and outright lies at worst. For example, suppose that the above
query appears 300,000 times, but with different SSN's. And then someone
runs this twice:
SELECT * FROM reports WHERE type = 'q'
The query that ran twice will be the most frequent, despite the fact
that
the really frequent query happens 150,000 times as often.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Import Notes
Resolved by subject fallback
On 26 Feb 2002 13:30:41 MST, the world broke into rejoicing as
Zak Greant <zak@mysql.com> said:
It is interesting to note that Oracle has shown a rather large jump
between yesterday and today - about 7% IIRC - a sign that the sample
size of the poll should become larger. :)
PostgreSQL has, between this morning and this evening, shown a jump
from around 7% to around 18%, so evidently the sample size has grown a
mite...
--
(reverse (concatenate 'string "gro.gultn@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/emacs.html
"... the open research model is justified. There is a passage in the
Bible (John 8:32, and on a plaque in CIA HQ), "And ye shall know the
truth, and the truth shall set ye free." -- Dave Dittrich
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
</head>
<body>
Dann Corbit wrote:<br>
<blockquote type="cite"
cite="midD90A5A6C612A39408103E6ECDD77B82920CC1F@voyager.corporate.connx.com">
<pre wrap="">-----Original Message-----<br>From: Neil Conway [<a class="moz-txt-link-freetext" href="mailto:nconway@klamath.dyndns.org">mailto:nconway@klamath.dyndns.org</a>]<br>Sent: Tuesday, February 26, 2002 3:04 PM<br>To: <a class="moz-txt-link-abbreviated" href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a><br>Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to<br>your<br><br><br>On Tue, 2002-02-26 at 15:30, Zak Greant wrote:<br></pre>
<blockquote type="cite">
<pre wrap="">Good Day All,<br><br>eWeek has posted a poll that asks which database server is most<br></pre>
</blockquote>
<pre wrap=""><!---->critical<br></pre>
<blockquote type="cite">
<pre wrap="">to your organization.<br></pre>
</blockquote>
<pre wrap=""><!----><br>The article mentions a MySQL feature which apparently improved<br>performance considerably:<br><br>//<br>MySQL 4.0.1's new, extremely fast query cache is also quite notable, as<br>no other database we tested had this feature. If the text of an incoming<br>query has a byte-for-byte match with a cached query, MySQL can retrieve<br>the results directly from the cache without compiling the query, getting<br>locks or doing index accesses. This query caching will be effective only<br>for tables with few updates because any table updates that clear the<br>cache to guarantee correct results are always returned.<br>//<br><br>My guess is that it would be relatively simple to implement. Any<br>comments on this?<br><br>If I implemented this, any chance this would make it into the tree? Of<br>course, it would be:<br><br> - disabled by default<br> - enabled on a table-by-table basis (maybe an ALTER TABLE command)<br></pre>
<pre wrap=""><!---->I don't see how it will do any good. There is no "prepare" in<br>Postgresql<br>and therefore you will simply be reexecuting the queries every time any<br>way. Also, parameter markers only work in embedded SQL and that is a <br>single tasking system.<br><br>I think it would be a major piece of work to do anything useful along<br>those lines.<br><br>If you look at how DB/2 works, you will see that they store prepared<br>statements. Another alternative would be to keep some point in the<br>parser marked and somehow jump to that point, but you would have to<br>be able to save a parse tree somewhere and also recognize the query.<br><br>Here is where problems come in...<br>-- Someone wants blue and blue-green, etc shirts that are backordered<br>SELECT shirt, color, backorder_qty FROM garments WHERE color like<br>"BLUE%"<br><br>Now, another query comes along:<br><br>-- Someone else wants reddish, etc shirts that are backordered:<br>SELECT shirt, color, back
order_qty FROM garments WHERE color like "RED%"<br><br>It's the same query with different data. Without parameter markers you<br>will never know it. And yet this is exactly the sort of caching that is<br>useful.</pre>
</blockquote>
However, an exact match is still not a bad idea. You might have a different
execution plan depending on the statistics of the data in your column.<br>
<br>
If there were a way to store the execution plan and start executing from
there it still might not be a bad idea.<br>
<br>
<blockquote type="cite"
cite="midD90A5A6C612A39408103E6ECDD77B82920CC1F@voyager.corporate.connx.com">
<pre wrap=""><br><br><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<br><<<br><br>---------------------------(end of broadcast)---------------------------<br>TIP 5: Have you checked our extensive FAQ?<br><br><a class="moz-txt-link-freetext" href="http://www.postgresql.org/users-lounge/docs/faq.html">http://www.postgresql.org/users-lounge/docs/faq.html</a><br></pre>
</blockquote>
<br>
<br>
</body>
</html>
If you were to expire the cache when the table was updated it wouldn't
be so bad, and you certainly don't want to cache everything under the
assumption something may come along -- as it probably won't. But if
you're interested in speeding this up I can't see any harm in adding a
stat into statistics that stores query frequency. If the frequency
goes above a certain notch (without the results changing -- static
table data) then cache results.
However, I think 99% of these cases could be easily rectified with a
php interface which caches the results into memory at the users
request as I really don't see hundreds of machines making the same
queries frequently -- rather 1 or 2 poorly written ones :)
Then again, I regularly hit one row database tables which act as
markers for data changes to see if the program should reload the
entire data set. (Parallel machines for interface purposes cache
table data in ram normally, but test for changes with every request --
goal was that they shouldn't have to communicate to eachother). But a
1 row table is very quick to select from. Can this make it faster, if
so it'll be useful.
Anyway, write it up. If you can speed up some without making a hit
against others it'll be a large advantage.
--
Rod Taylor
This message represents the official view of the voices in my head
----- Original Message -----
From: "Neil Conway" <nconway@klamath.dyndns.org>
To: "Dann Corbit" <DCorbit@connx.com>
Cc: <pgsql-hackers@postgresql.org>
Sent: Tuesday, February 26, 2002 6:42 PM
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your
On Tue, 2002-02-26 at 18:20, Dann Corbit wrote:
I don't see how it will do any good. There is no "prepare" in
Postgresql
and therefore you will simply be reexecuting the queries every
time any
way. Also, parameter markers only work in embedded SQL and that
is a
single tasking system.
Perhaps I wasn't clear. The feature I'm proposing is this:
When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return
the
cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache
the
query, just increment a "frequency" counter stored in the hash
table. If
the counter goes above a certain constant, we decide that the query
is
worth caching, so we cache the full result in the hash table.
When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this
cache. If
so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns
our
cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.HOWEVER -- I don't see this feature as something that will appeal
to,
say, 75% of PgSQL users. If the table in question is being modified
on a
regular basis, or if a wide variety of queries are being issued,
this
cache isn't a good idea. Nevertheless, I think there are certainly
some
situations in which this cache is useful -- and furthermore, these
kinds
of "mostly read-only" situations are often where MySQL is chosen
over
PostgreSQL.
Anyway, just putting this on the table -- if the consensus is that
this
isn't a very worthwhile feature, I won't bother with it.
Cheers,
Neil
--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC---------------------------(end of
broadcast)---------------------------
Show quoted text
TIP 6: Have you searched our list archives?
On Tue, 2002-02-26 at 19:08, Dann Corbit wrote:
Statistical tools are a good idea, because they can tell us where
indexes should be added. However, you cannot simply return the result
of the previous query, because the contents may have changed since the
last time it was executed. It is simply invalid to do that. If some
other system is doing that, then it isn't a relational database.
No -- as I said, any inserts, updates or deletes that affect the table
in question will cause a full cache flush.
How do you know whether or not someone has affected the row that you
are reading? If you do not know, then every single update, insert or
delete will mean that you have to refresh.
Yes, that is true.
And not only that, you will
also have to track it. For sure, it will make the whole system run
more slowly rather than faster.
I don't think tracking changes imposes a lot of overhead -- it is
relatively simple to determine if a query affects a given table.
Very likely, it is only my limited understanding not really grasping
what it is that you are trying to do. Even so, I don't think it really
helps even for read only queries, unless it is exactly the same query
with the same parameter markers and everything that was issued before.
That is very unusual. Normally, you won't have the entire query hard-
wired, but with allow the customer to do some sort of filtering of the
data.
Hmmm... the more I think about it, the more unusual it would be for
_exactly_ the same query to be repeated a lot. However, the article
reported a significant performance gain when this feature was enabled.
That could mean that:
(a) the performance measurements/benchmarks used by the article were
synthetic and don't reflect real database applications
(b) the feature MySQL implements is different than the one I am
describing
When I get a chance I'll investigate further the technique used by MySQL
to see if (b) is the case. However, it is beginning to look like this
isn't a good idea, overall.
Cheers,
Neil
--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC
[snip]
Very likely, it is only my limited understanding not really grasping
what it is that you are trying to do. Even so, I don't think it
really
helps even for read only queries, unless it is exactly the same query
with the same parameter markers and everything that was issued before.
That is very unusual. Normally, you won't have the entire query hard-
wired, but with allow the customer to do some sort of filtering of the
data.
Hmmm... the more I think about it, the more unusual it would be for
_exactly_ the same query to be repeated a lot. However, the article
reported a significant performance gain when this feature was enabled.
That could mean that:
(a) the performance measurements/benchmarks used by the article were
synthetic and don't reflect real database applications
(b) the feature MySQL implements is different than the one I am
describing
When I get a chance I'll investigate further the technique used by MySQL
to see if (b) is the case. However, it is beginning to look like this
isn't a good idea, overall.
I did not read the article at all, but I am familiar with query cache
and in fact, I do it a lot (I work for a database company). Here is
how the algorithm works:
You intercept every incoming query and parse it. Any physical data
gets replaced with parameter markers. A 64 bit hash is formed from the
parsed query with the parameter markers removed. The hash is used as
an index into a skiplist which also stores the original query. After
all, if a client has a million dollar request, he won't be happy that
the unbelievably rare thing happened and the checksums agreed.
You can add a counter to the data in the skiplist so that you know how
often the query happens. The parsed query will only be useful to a
system that can save time from having a query prepared (most systems
call it preparing the query). I was kind of surprised to see that
PostgreSQL does not have a prepare stage in libpq. This can be a
very large speedup in query execution (for obvious reasons).
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Import Notes
Resolved by subject fallback
On Tue, 2002-02-26 at 17:14, cbbrowne@acm.org wrote:
On 26 Feb 2002 13:30:41 MST, the world broke into rejoicing as
Zak Greant <zak@mysql.com> said:It is interesting to note that Oracle has shown a rather large jump
between yesterday and today - about 7% IIRC - a sign that the sample
size of the poll should become larger. :)PostgreSQL has, between this morning and this evening, shown a jump
from around 7% to around 18%, so evidently the sample size has grown a
mite...
Heh. :) Nice to see that the PostgreSQL community rallied so well!
--zak
"There are three kinds of lies: Lies, Damn Lies, and Statistics."
--Benjamin Disraeli
Neil Conway wrote:
<snip>
Hi everyone,
This is getting me to think :
We don't have a really established process for PostgreSQL development
which allows for "experimental features". aka the Linux kernel
"EXPERIMENTAL", and so forth.
aka ./configure --experimental-querycache
--experimental-pertablecostings --expermental-something?
This way, people can still use PostgreSQL as per normal, but it also
allows for development of code which might or might not actually see the
light of day, depending on if it turns out to really be useful.
By default, things should only become "experimental features" after the
correct consideration, not by default. There are situations where it
would be beneficial (perhaps in Neil's example here). If Neil were to
split off a project and do it on Sourceforge, then only some people
would check it out when kind of ready, etc, etc and it wouldn't really
be as available to a wider audience. But if's it part of the present
source tree, I think the experimental features would see a wider
audience and we'd get a better indication of what's good/bad/etc.
Don't know how to support those features, but pretty much developers and
hard-core-users should be the only ones using them and therefore should
be able to figure most of the stuff out for themselves (i.e. we'd only
get the real questions).
Does this seems like a worthwhile viewpoint to consider, then perhaps we
can think about starting it implement it for 7.3? (Wish I could code
better). :)
Regards and best wishes,
Justin Clift
Perhaps I wasn't clear. The feature I'm proposing is this:
When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return the
cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache the
query, just increment a "frequency" counter stored in the hash table. If
the counter goes above a certain constant, we decide that the query is
worth caching, so we cache the full result in the hash table.When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns our
cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.HOWEVER -- I don't see this feature as something that will appeal to,
say, 75% of PgSQL users. If the table in question is being modified on a
regular basis, or if a wide variety of queries are being issued, this
cache isn't a good idea. Nevertheless, I think there are certainly some
situations in which this cache is useful -- and furthermore, these kinds
of "mostly read-only" situations are often where MySQL is chosen over
PostgreSQL.Anyway, just putting this on the table -- if the consensus is that this
isn't a very worthwhile feature, I won't bother with it.Cheers,
Neil
--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi
On Wed, 2002-02-27 at 05:23, Neil Conway wrote:
On Tue, 2002-02-26 at 19:08, Dann Corbit wrote:
Statistical tools are a good idea, because they can tell us where
indexes should be added. However, you cannot simply return the result
of the previous query, because the contents may have changed since the
last time it was executed. It is simply invalid to do that. If some
other system is doing that, then it isn't a relational database.No -- as I said, any inserts, updates or deletes that affect the table
in question will cause a full cache flush.How do you know whether or not someone has affected the row that you
are reading? If you do not know, then every single update, insert or
delete will mean that you have to refresh.Yes, that is true.
And not only that, you will
also have to track it. For sure, it will make the whole system run
more slowly rather than faster.I don't think tracking changes imposes a lot of overhead -- it is
relatively simple to determine if a query affects a given table.
Perhaps you can do it in a simple way for MySQL which has no
rules/triggers/foreign keys with ON DELETE CASCADE.
The only way I can think of is by putting some kind of rule or trigger
on the table affected.
Hmmm... the more I think about it, the more unusual it would be for
_exactly_ the same query to be repeated a lot. However, the article
reported a significant performance gain when this feature was enabled.
That could mean that:(a) the performance measurements/benchmarks used by the article were
synthetic and don't reflect real database applications
I think that a Slashdot-type web application would probably benefit a
lot.
----------------
Hannu
Here is the documentation about MySQL's new Query Cache. I think that it
would be helpful, as they indicate, for dynamic web sites, such as Slashdot.
There are hundreds or maybe thousands of queries in between added comments
and there are probably only a few common combinations of
threshold/nesting/sort.
MySQL Query Cache
=================
From version 4.0.1, `MySQL server' features a `Query Cache'. When in
use, the query cache stores the text of a `SELECT' query together with
the corresponding result that is sent to a client. If another
identical query is received, the server can then retrieve the results
from the query cache rather than parsing and executing the same query
again.
The query cache is extremely useful in an environment where (some)
tables don't change very often and you have a lot of identical queries.
This is a typical situation for many web servers that use a lot of
dynamic content.
Following are some performance data for the query cache (We got these
by running the MySQL benchmark suite on a Linux Alpha 2x500 MHz with
2GB RAM and a 64MB query cache):
* If you want to disable the query cache code set
`query_cache_size=0'. By disabling the query cache code there is
no noticeable overhead.
* If all of the queries you're preforming are simple (such as
selecting a row from a table with one row); but still differ so
that the queries can not be cached, the overhead for having the
query cache active is 13%. This could be regarded as the worst
case scenario. However, in real life, queries are much more
complicated than our simple example so the overhead is normally
significantly lower.
* Searches after one row in a one row table is 238% faster. This
can be regarded as close to the minimum speedup to be expected for
a query that is cached.
How The Query Cache Operates
----------------------------
Queries are compared before parsing, thus
SELECT * FROM TABLE
and
Select * from table
are regarded as different queries for query cache, so queries need to
be exactly the same (byte for byte) to be seen as identical. In
addition, a query may be seen as different if for instance one client
is using a new communication protocol format or another character set
than another client.
Queries that uses different databases, uses different protocol versions
or the uses different default character sets are considered different
queries and cached separately.
The cache does work for `SELECT CALC_ROWS ...' and `SELECT FOUND_ROWS()
...' type queries because the number of found rows is also stored in
the cache.
If a table changes (`INSERT', `UPDATE', `DELETE', `TRUNCATE', `ALTER'
or `DROP TABLE|DATABASE'), then all cached queries that used this table
(possibly through a `MRG_MyISAM' table!) become invalid and are removed
from the cache.
Currently all `InnoDB' tables are invalidated on `COMMIT', in the
future this will be changed so only tables changed in the transaction
cause the corresponding cache entries to be invalidated.
A query cannot be cached if it contains one of the functions:
*Function* *Function* *Function* *Function*
`User Defined `CONNECTION_ID' `FOUND_ROWS' `GET_LOCK'
Functions'
`RELEASE_LOCK' `LOAD_FILE' `MASTER_POS_WAIT' `NOW'
`SYSDATE' `CURRENT_TIMESTAMP'`CURDATE' `CURRENT_DATE'
`CURTIME' `CURRENT_TIME' `DATABASE' `ENCRYPT' (with
one parameter)
`LAST_INSERT_ID' `RAND' `UNIX_TIMESTAMP' `USER'
(without
parameters)
`BENCHMARK'
Nor can a query be cached if it contains user variables, if it is of
the form `SELECT ... IN SHARE MODE' or of the form `SELECT * FROM
AUTOINCREMENT_FIELD IS NULL' (to retrieve last insert id - ODBC work
around).
However, `FOUND ROWS()' will return the correct value, even if the
preceding query was fetched from the cache.
Queries that don't use any tables or if the user has a column privilege
for any of the involved tables are not cached.
Before a query is fetched from the query cache, MySQL will check that
the user has SELECT privilege to all the involved databases and tables.
If this is not the case, the cached result will not be used.
Query Cache Configuration
-------------------------
The query cache adds a few `MySQL' system variables for `mysqld' which
may be set in a configuration file, on the command line when starting
`mysqld'.
* `query_cache_limit' Don't cache results that are bigger than this.
(Default 1M).
* `query_cache_size' The memory allocated to store results from old
queries. If this is 0, the query cache is disabled (default).
* `query_cache_startup_type' This may be set (only numeric) to
*Option* *Description*
0 (OFF, don't cache or retrieve results)
1 (ON, cache all results except `SELECT
SQL_NO_CACHE ...' queries)
2 (DEMAND, cache only `SELECT SQL_CACHE ...'
queries)
Inside a thread (connection), the behaviour of the query cache can be
changed from the default. The syntax is as follows:
`SQL_QUERY_CACHE_TYPE = OFF | ON | DEMAND' `SQL_QUERY_CACHE_TYPE = 0
| 1 | 2'
*Option* *Description*
0 or OFF Don't cache or retrieve results.
1 or ON Cache all results except `SELECT SQL_NO_CACHE
...' queries.
2 or DEMAND Cache only `SELECT SQL_CACHE ...' queries.
By default `SQL_QUERY_CACHE_TYPE' depends on the value of
`query_cache_startup_type' when the thread was created.
Query Cache Options in `SELECT'
-------------------------------
There are two possible query cache related parameters that may be
specified in a `SELECT' query:
*Option* *Description*
`SQL_CACHE' If `SQL_QUERY_CACHE_TYPE' is `DEMAND', allow the
query to be cached. If `SQL_QUERY_CACHE_TYPE'
is `ON', this is the default. If
`SQL_QUERY_CACHE_TYPE' is `OFF', do nothing.
`SQL_NO_CACHE' Make this query non-cachable, don't allow this
query to be stored in the cache.
Query Cache Status and Maintenance
----------------------------------
With the `FLUSH QUERY CACHE' command you can defragment the query cache
to better utilise its memory. This command will not remove any queries
from the cache. `FLUSH TABLES' also flushes the query cache.
The `RESET QUERY CACHE' command removes all query results from the
query cache.
You can monitor query cache performance in `SHOW STATUS':
*Variable* *Description*
`Qcache_queries_in_cache'Number of queries registered in the cache.
`Qcache_inserts' Number of queries added to the cache.
`Qcache_hits' Number of cache hits.
`Qcache_not_cached' Number of non-cached queries (not cachable, or
due to `SQL_QUERY_CACHE_TYPE').
`Qcache_free_memory' Amount of free memory for query cache.
`Qcache_total_blocks' Total number of blocks in query cache.
`Qcache_free_blocks' Number of free memory blocks in query cache.
Total number of queries = `Qcache_inserts' + `Qcache_hits' +
`Qcache_not_cached'.
The query cache uses variable length blocks, so `Qcache_total_blocks'
and `Qcache_free_blocks' may indicate query cache memory fragmentation.
After `FLUSH QUERY CACHE' only a single (big) free block remains.
Note: Every query needs a minimum of two blocks (one for the query text
and one or more for the query results). Also, every table that is used
by a query needs one block, but if two or more queries use same table
only one block needs to be allocated.
On Wed, 2002-02-27 at 10:39, Tom Lane wrote:
Neil Conway <nconway@klamath.dyndns.org> writes:
When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results.Note that this would imply that the cache is *global* across all
backends; therefore it is a shared data structure and hence an access
bottleneck. (Not to mention a memory-management headache, since the
size of shared memory can't easily be varied on-the-fly.)
I think that it would be enough if the hashes and bookkeeping info
(tmin,tmax,filename) were in a global table. We could also purge all
data more than a few minutes old. We also need an inverse lookup from
changed table to cached query for cache invalidation on
insert/update/delete
The result could be even saved in temp files and be mostly faster than
doing the full parse/plan/execute, both for complex queries returning a
few rows (saves planning time) or many rows (saves execute time).
The format used for saving should be exact wire protocol, so that
efficient system calls could be used where available (linux 2.4+ has a
system call that will transfer a whole file to a socket in one call
bypassing all copying and cacheing)
The lookup part will be pretty trivial - lookup using hash, check for
tmin/tmax, if ok push cached result out to client.This will make us as
fast or faster than MySQL for trivial_benchmark/busy_website case.
The cache creation/maintenance part will be much trickier -
When creating cache
* the tables affected can be determined only from fully built plans
because of possible rule expansions.
* if there is a trigger on select for this query it can't be cached
* put some temporary insert/update/delete triggers on all real tables
used in query that will invalidate cache - as an alternatively we
could always run the invalidate-query-cache code for affected table
on insert/update/delete on a table if exact caching is enabled
* invalidate cache on schema changes
* run a periodic check and invalidate old cache entries.
Some of the above could also be needed for caching query plans.
I cannot believe that caching results for literally-identical queries
is a win, except perhaps for the most specialized (read brain dead)
applications.
Actually a web app that looks up contents of a 5 row combobox is not
really brain-dead.
Doing all the caching (and cache invalidation) on client side is hard
and fragle - what happens when someone adds a trigger in backend ?
Has anyone looked at the details of the test case that
MySQL uses to claim that this is a good idea? Has it got any similarity
to your own usage patterns?
Yes - for content management / web apps.
No - for bean-counting apps.
We have talked about caching query plans for suitably-parameterized
queries, but even that seems unlikely to be a huge win; at least I'd
not think it useful to try to drive the cache totally automatically.
If an application could say "here's a query I expect to use a lot,
varying these specific parameters" then caching a plan for that would
make sense.Now, there are notions of "prepared statements" in many access APIs
that fit this description, and in fact the underlying capability exists
in the backend --- we've just not gotten around to building the
interfaces to tie it all together. *That* would be worth working on.
Sure. It would
a) make many queries faster
b) make client libs (ODBC/JDBC/ECPG) faster and simpler by not forcing
them to fake it.
But there is also a big class of applications that would benefit much
more from caching exact queries.
And it will make us as fast as MySQL for 100000 consecutive calls of
SELECT MAX(N) FROM T ;)
---------------
Hannu
Import Notes
Reply to msg id not found: 19991.1014788356@sss.pgh.pa.us
On Wed, 2002-02-27 at 11:44, Hiroshi Inoue wrote:
Neil Conway wrote:
//
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature.Hmm is it really a notable feature ?
It makes a powerful database eqally fast to not-so-powerful one for
simple cases.
IIRC Oracle has had a similar feature for a long time.
I guess that's for exactly this reason :)
---------
Hannu
Import Notes
Reply to msg id not found: 3C7C8041.8A628548@tpf.co.jp
Neil Conway <nconway@klamath.dyndns.org> writes:
When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results.
Note that this would imply that the cache is *global* across all
backends; therefore it is a shared data structure and hence an access
bottleneck. (Not to mention a memory-management headache, since the
size of shared memory can't easily be varied on-the-fly.)
I cannot believe that caching results for literally-identical queries
is a win, except perhaps for the most specialized (read brain dead)
applications. Has anyone looked at the details of the test case that
MySQL uses to claim that this is a good idea? Has it got any similarity
to your own usage patterns?
We have talked about caching query plans for suitably-parameterized
queries, but even that seems unlikely to be a huge win; at least I'd
not think it useful to try to drive the cache totally automatically.
If an application could say "here's a query I expect to use a lot,
varying these specific parameters" then caching a plan for that would
make sense.
Now, there are notions of "prepared statements" in many access APIs
that fit this description, and in fact the underlying capability exists
in the backend --- we've just not gotten around to building the
interfaces to tie it all together. *That* would be worth working on.
regards, tom lane
So that's it. A completely lame "benchmark faker" tool. Useful for
only the dumb benchmark they create.
Import Notes
Resolved by subject fallback
On Wed, 2002-02-27 at 00:39, Tom Lane wrote:
I cannot believe that caching results for literally-identical queries
is a win, except perhaps for the most specialized (read brain dead)
applications.
According to MySQL: "The query cache is extremely useful in an
environment where (some) tables don't change very often and you have a
lot of identical queries. This is a typical situation for many web
servers that use a lot of dynamic content."
Would people agree with the MySQL guys on this? In particular, that this
is a "typical situation" for many webapps?
Has anyone looked at the details of the test case that
MySQL uses to claim that this is a good idea?
I emailed the author of the eWeek benchmarks asking for more
information. My guess is that the benchmark results benefit from the
query cache because it executes exactly the same query over and over
again (e.g. selecting the same product from the database every time, not
simulating typical user behavior by selecting random products). If that
is the case, the results are clearly irrelevant.
Has it got any similarity to your own usage patterns?
I would be very interested to hear about this as well.
Now, there are notions of "prepared statements" in many access APIs
that fit this description, and in fact the underlying capability exists
in the backend --- we've just not gotten around to building the
interfaces to tie it all together. *That* would be worth working on.
Okay, I'll take a look at this...
Cheers,
Neil
--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC
Neil Conway wrote:
On Tue, 2002-02-26 at 15:30, Zak Greant wrote:
Good Day All,
eWeek has posted a poll that asks which database server is most critical
to your organization.The article mentions a MySQL feature which apparently improved
performance considerably://
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature.
Hmm is it really a notable feature ?
IIRC Oracle has had a similar feature for a long time.
regards,
Hiroshi Inoue