eWeek Poll: Which database is most critical to your organization?

Started by Zak Greantalmost 24 years ago34 messages
#1Zak Greant
zak@mysql.com

Good Day All,

eWeek has posted a poll that asks which database server is most critical
to your organization.

The poll tracks the following databases:

- IBM DB2
- Informix
- Microsoft SQL Server
- MySQL
- Ingres
- Oracle
- PostgreSQL
- Sybase Adaptive Server Enterprise
- Sybase SQL Anywhere
- any object database

It is interesting to note that Oracle has shown a rather large jump
between yesterday and today - about 7% IIRC - a sign that the sample
size of the poll should become larger. :)

If you would care to vote, visit:
http://www.eweek.com/article/0%2c3658%2cs=708&a=23115%2c00.asp

Note: This message is being delivered to the Perl DBI, PHP general,
PostgreSQL Advocacy, MySQL general and Ruby general mailing lists.
However, to prevent possible and irritating cross-posting, I am
delivering each message individually. :)

Ciao!

--zak

--
__ ___ ___ ____ __
/ |/ /_ __/ __/ __ \/ / Zak Greant <zak@mysql.com>
/ /|_/ / // /\ \/ /_/ / /__ MySQL AB, Advocate
/_/ /_/\_, /___/\___\_\___/ Calgary, Canada
<___/ www.mysql.com 403.244.7213

#2Neil Conway
nconway@klamath.dyndns.org
In reply to: Zak Greant (#1)
Re: eWeek Poll: Which database is most critical to your

On Tue, 2002-02-26 at 15:30, Zak Greant wrote:

Good Day All,

eWeek has posted a poll that asks which database server is most critical
to your organization.

The article mentions a MySQL feature which apparently improved
performance considerably:

//
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature. If the text of an incoming
query has a byte-for-byte match with a cached query, MySQL can retrieve
the results directly from the cache without compiling the query, getting
locks or doing index accesses. This query caching will be effective only
for tables with few updates because any table updates that clear the
cache to guarantee correct results are always returned.
//

My guess is that it would be relatively simple to implement. Any
comments on this?

If I implemented this, any chance this would make it into the tree? Of
course, it would be:

- disabled by default
- enabled on a table-by-table basis (maybe an ALTER TABLE command)

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

#3Dann Corbit
DCorbit@connx.com
In reply to: Neil Conway (#2)
Re: eWeek Poll: Which database is most critical to your

-----Original Message-----
From: Neil Conway [mailto:nconway@klamath.dyndns.org]
Sent: Tuesday, February 26, 2002 3:04 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your

On Tue, 2002-02-26 at 15:30, Zak Greant wrote:

Good Day All,

eWeek has posted a poll that asks which database server is most

critical

to your organization.

The article mentions a MySQL feature which apparently improved
performance considerably:

//
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature. If the text of an incoming
query has a byte-for-byte match with a cached query, MySQL can retrieve
the results directly from the cache without compiling the query, getting
locks or doing index accesses. This query caching will be effective only
for tables with few updates because any table updates that clear the
cache to guarantee correct results are always returned.
//

My guess is that it would be relatively simple to implement. Any
comments on this?

If I implemented this, any chance this would make it into the tree? Of
course, it would be:

- disabled by default
- enabled on a table-by-table basis (maybe an ALTER TABLE command)

I don't see how it will do any good. There is no "prepare" in
Postgresql
and therefore you will simply be reexecuting the queries every time any
way. Also, parameter markers only work in embedded SQL and that is a
single tasking system.

I think it would be a major piece of work to do anything useful along
those lines.

If you look at how DB/2 works, you will see that they store prepared
statements. Another alternative would be to keep some point in the
parser marked and somehow jump to that point, but you would have to
be able to save a parse tree somewhere and also recognize the query.

Here is where problems come in...
-- Someone wants blue and blue-green, etc shirts that are backordered
SELECT shirt, color, backorder_qty FROM garments WHERE color like
"BLUE%"

Now, another query comes along:

-- Someone else wants reddish, etc shirts that are backordered:
SELECT shirt, color, backorder_qty FROM garments WHERE color like "RED%"

It's the same query with different data. Without parameter markers you
will never know it. And yet this is exactly the sort of caching that is
useful.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<<

#4Neil Conway
nconway@klamath.dyndns.org
In reply to: Dann Corbit (#3)
Re: eWeek Poll: Which database is most critical to your

On Tue, 2002-02-26 at 18:20, Dann Corbit wrote:

I don't see how it will do any good. There is no "prepare" in
Postgresql
and therefore you will simply be reexecuting the queries every time any
way. Also, parameter markers only work in embedded SQL and that is a
single tasking system.

Perhaps I wasn't clear. The feature I'm proposing is this:

When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return the
cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache the
query, just increment a "frequency" counter stored in the hash table. If
the counter goes above a certain constant, we decide that the query is
worth caching, so we cache the full result in the hash table.

When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns our
cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.

HOWEVER -- I don't see this feature as something that will appeal to,
say, 75% of PgSQL users. If the table in question is being modified on a
regular basis, or if a wide variety of queries are being issued, this
cache isn't a good idea. Nevertheless, I think there are certainly some
situations in which this cache is useful -- and furthermore, these kinds
of "mostly read-only" situations are often where MySQL is chosen over
PostgreSQL.

Anyway, just putting this on the table -- if the consensus is that this
isn't a very worthwhile feature, I won't bother with it.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

#5Dann Corbit
DCorbit@connx.com
In reply to: Neil Conway (#4)
Re: eWeek Poll: Which database is most critical to your

-----Original Message-----
From: Neil Conway [mailto:nconway@klamath.dyndns.org]
Sent: Tuesday, February 26, 2002 3:42 PM
To: Dann Corbit
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your

On Tue, 2002-02-26 at 18:20, Dann Corbit wrote:

I don't see how it will do any good. There is no "prepare" in
Postgresql
and therefore you will simply be reexecuting the queries every time

any

way. Also, parameter markers only work in embedded SQL and that is a
single tasking system.

Perhaps I wasn't clear. The feature I'm proposing is this:

When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return the
cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache the
query, just increment a "frequency" counter stored in the hash table. If
the counter goes above a certain constant, we decide that the query is
worth caching, so we cache the full result in the hash table.

Statistical tools are a good idea, because they can tell us where
indexes should be added. However, you cannot simply return the result
of the previous query, because the contents may have changed since the
last time it was executed. It is simply invalid to do that. If some
other system is doing that, then it isn't a relational database.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns our
cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.

How do you know whether or not someone has affected the row that you
are reading? If you do not know, then every single update, insert or
delete will mean that you have to refresh. And not only that, you will
also have to track it. For sure, it will make the whole system run
more slowly rather than faster.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

HOWEVER -- I don't see this feature as something that will appeal to,
say, 75% of PgSQL users. If the table in question is being modified on a
regular basis, or if a wide variety of queries are being issued, this
cache isn't a good idea. Nevertheless, I think there are certainly some
situations in which this cache is useful -- and furthermore, these kinds
of "mostly read-only" situations are often where MySQL is chosen over
PostgreSQL.

Anyway, just putting this on the table -- if the consensus is that this
isn't a very worthwhile feature, I won't bother with it.

Very likely, it is only my limited understanding not really grasping
what it is that you are trying to do. Even so, I don't think it really
helps even for read only queries, unless it is exactly the same query
with the same parameter markers and everything that was issued before.
That is very unusual. Normally, you won't have the entire query hard-
wired, but with allow the customer to do some sort of filtering of the
data.

I do (very much) like your idea to gather statistics. Again, this must
have the complete query plan parsed so that you know what the parameter
markers are. Otherwise, there will be completely different queries:

SELECT * FROM employee WHERE ssn = '777-99-1234'
SELECT * FROM employee WHERE ssn = '999-77-1234'
SELECT * FROM employee WHERE ssn = '797-97-1234'
SELECT * FROM employee WHERE ssn = '979-79-1234'
SELECT * FROM employee WHERE ssn = '779-99-1234'

when really only one query exits:
SELECT * FROM employee WHERE ssn = ?

If you do not parameterize the queries your statistics will be
misleading
at best and outright lies at worst. For example, suppose that the above
query appears 300,000 times, but with different SSN's. And then someone
runs this twice:

SELECT * FROM reports WHERE type = 'q'

The query that ran twice will be the most frequent, despite the fact
that
the really frequent query happens 150,000 times as often.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

#6Thomas Swan
tswan-lst@ics.olemiss.edu
In reply to: Dann Corbit (#3)
Re: eWeek Poll: Which database is most critical to your

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
</head>
<body>
Dann Corbit wrote:<br>
<blockquote type="cite"
cite="midD90A5A6C612A39408103E6ECDD77B82920CC1F@voyager.corporate.connx.com">
<pre wrap="">-----Original Message-----<br>From: Neil Conway [<a class="moz-txt-link-freetext" href="mailto:nconway@klamath.dyndns.org">mailto:nconway@klamath.dyndns.org</a>]<br>Sent: Tuesday, February 26, 2002 3:04 PM<br>To: <a class="moz-txt-link-abbreviated" href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a><br>Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to<br>your<br><br><br>On Tue, 2002-02-26 at 15:30, Zak Greant wrote:<br></pre>
<blockquote type="cite">
<pre wrap="">Good Day All,<br><br>eWeek has posted a poll that asks which database server is most<br></pre>
</blockquote>
<pre wrap=""><!---->critical<br></pre>
<blockquote type="cite">
<pre wrap="">to your organization.<br></pre>
</blockquote>
<pre wrap=""><!----><br>The article mentions a MySQL feature which apparently improved<br>performance considerably:<br><br>//<br>MySQL 4.0.1's new, extremely fast query cache is also quite notable, as<br>no other database we tested had this feature. If the text of an incoming<br>query has a byte-for-byte match with a cached query, MySQL can retrieve<br>the results directly from the cache without compiling the query, getting<br>locks or doing index accesses. This query caching will be effective only<br>for tables with few updates because any table updates that clear the<br>cache to guarantee correct results are always returned.<br>//<br><br>My guess is that it would be relatively simple to implement. Any<br>comments on this?<br><br>If I implemented this, any chance this would make it into the tree? Of<br>course, it would be:<br><br> - disabled by default<br> - enabled on a table-by-table basis (maybe an ALTER TABLE command)<br></pre>
<pre wrap=""><!---->I don't see how it will do any good. There is no "prepare" in<br>Postgresql<br>and therefore you will simply be reexecuting the queries every time any<br>way. Also, parameter markers only work in embedded SQL and that is a <br>single tasking system.<br><br>I think it would be a major piece of work to do anything useful along<br>those lines.<br><br>If you look at how DB/2 works, you will see that they store prepared<br>statements. Another alternative would be to keep some point in the<br>parser marked and somehow jump to that point, but you would have to<br>be able to save a parse tree somewhere and also recognize the query.<br><br>Here is where problems come in...<br>-- Someone wants blue and blue-green, etc shirts that are backordered<br>SELECT shirt, color, backorder_qty FROM garments WHERE color like<br>"BLUE%"<br><br>Now, another query comes along:<br><br>-- Someone else wants reddish, etc shirts that are backordered:<br>SELECT shirt, color, back
order_qty FROM garments WHERE color like "RED%"<br><br>It's the same query with different data. Without parameter markers you<br>will never know it. And yet this is exactly the sort of caching that is<br>useful.</pre>
</blockquote>
However, an exact match is still not a bad idea. &nbsp; You might have a different
execution plan depending on the statistics of the data in your column.<br>
<br>
If there were a way to store the execution plan and start executing from
there it still might not be a bad idea.<br>
<br>
<blockquote type="cite"
cite="midD90A5A6C612A39408103E6ECDD77B82920CC1F@voyager.corporate.connx.com">
<pre wrap=""><br><br>&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;<br>&lt;&lt;<br><br>---------------------------(end of broadcast)---------------------------<br>TIP 5: Have you checked our extensive FAQ?<br><br><a class="moz-txt-link-freetext" href="http://www.postgresql.org/users-lounge/docs/faq.html&quot;&gt;http://www.postgresql.org/users-lounge/docs/faq.html&lt;/a&gt;&lt;br&gt;&lt;/pre&gt;
</blockquote>
<br>
<br>
</body>
</html>

#7Rod Taylor
rbt@barchord.com
In reply to: Dann Corbit (#3)
Re: eWeek Poll: Which database is most critical to your

If you were to expire the cache when the table was updated it wouldn't
be so bad, and you certainly don't want to cache everything under the
assumption something may come along -- as it probably won't. But if
you're interested in speeding this up I can't see any harm in adding a
stat into statistics that stores query frequency. If the frequency
goes above a certain notch (without the results changing -- static
table data) then cache results.

However, I think 99% of these cases could be easily rectified with a
php interface which caches the results into memory at the users
request as I really don't see hundreds of machines making the same
queries frequently -- rather 1 or 2 poorly written ones :)

Then again, I regularly hit one row database tables which act as
markers for data changes to see if the program should reload the
entire data set. (Parallel machines for interface purposes cache
table data in ram normally, but test for changes with every request --
goal was that they shouldn't have to communicate to eachother). But a
1 row table is very quick to select from. Can this make it faster, if
so it'll be useful.

Anyway, write it up. If you can speed up some without making a hit
against others it'll be a large advantage.
--
Rod Taylor

This message represents the official view of the voices in my head

----- Original Message -----
From: "Neil Conway" <nconway@klamath.dyndns.org>
To: "Dann Corbit" <DCorbit@connx.com>
Cc: <pgsql-hackers@postgresql.org>
Sent: Tuesday, February 26, 2002 6:42 PM
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your

On Tue, 2002-02-26 at 18:20, Dann Corbit wrote:

I don't see how it will do any good. There is no "prepare" in
Postgresql
and therefore you will simply be reexecuting the queries every

time any

way. Also, parameter markers only work in embedded SQL and that

is a

single tasking system.

Perhaps I wasn't clear. The feature I'm proposing is this:

When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return

the

cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache

the

query, just increment a "frequency" counter stored in the hash

table. If

the counter goes above a certain constant, we decide that the query

is

worth caching, so we cache the full result in the hash table.

When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this

cache. If

so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns

our

cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.

HOWEVER -- I don't see this feature as something that will appeal

to,

say, 75% of PgSQL users. If the table in question is being modified

on a

regular basis, or if a wide variety of queries are being issued,

this

cache isn't a good idea. Nevertheless, I think there are certainly

some

situations in which this cache is useful -- and furthermore, these

kinds

of "mostly read-only" situations are often where MySQL is chosen

over

PostgreSQL.

Anyway, just putting this on the table -- if the consensus is that

this

isn't a very worthwhile feature, I won't bother with it.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

---------------------------(end of

broadcast)---------------------------

Show quoted text

TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#8Neil Conway
nconway@klamath.dyndns.org
In reply to: Dann Corbit (#5)
Re: eWeek Poll: Which database is most critical to your

On Tue, 2002-02-26 at 19:08, Dann Corbit wrote:

Statistical tools are a good idea, because they can tell us where
indexes should be added. However, you cannot simply return the result
of the previous query, because the contents may have changed since the
last time it was executed. It is simply invalid to do that. If some
other system is doing that, then it isn't a relational database.

No -- as I said, any inserts, updates or deletes that affect the table
in question will cause a full cache flush.

How do you know whether or not someone has affected the row that you
are reading? If you do not know, then every single update, insert or
delete will mean that you have to refresh.

Yes, that is true.

And not only that, you will
also have to track it. For sure, it will make the whole system run
more slowly rather than faster.

I don't think tracking changes imposes a lot of overhead -- it is
relatively simple to determine if a query affects a given table.

Very likely, it is only my limited understanding not really grasping
what it is that you are trying to do. Even so, I don't think it really
helps even for read only queries, unless it is exactly the same query
with the same parameter markers and everything that was issued before.
That is very unusual. Normally, you won't have the entire query hard-
wired, but with allow the customer to do some sort of filtering of the
data.

Hmmm... the more I think about it, the more unusual it would be for
_exactly_ the same query to be repeated a lot. However, the article
reported a significant performance gain when this feature was enabled.
That could mean that:

(a) the performance measurements/benchmarks used by the article were
synthetic and don't reflect real database applications

(b) the feature MySQL implements is different than the one I am
describing

When I get a chance I'll investigate further the technique used by MySQL
to see if (b) is the case. However, it is beginning to look like this
isn't a good idea, overall.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

#9Dann Corbit
DCorbit@connx.com
In reply to: Neil Conway (#8)
Re: eWeek Poll: Which database is most critical to your

[snip]

Very likely, it is only my limited understanding not really grasping
what it is that you are trying to do. Even so, I don't think it

really

helps even for read only queries, unless it is exactly the same query
with the same parameter markers and everything that was issued before.
That is very unusual. Normally, you won't have the entire query hard-
wired, but with allow the customer to do some sort of filtering of the
data.

Hmmm... the more I think about it, the more unusual it would be for
_exactly_ the same query to be repeated a lot. However, the article
reported a significant performance gain when this feature was enabled.
That could mean that:

(a) the performance measurements/benchmarks used by the article were
synthetic and don't reflect real database applications

(b) the feature MySQL implements is different than the one I am
describing

When I get a chance I'll investigate further the technique used by MySQL
to see if (b) is the case. However, it is beginning to look like this
isn't a good idea, overall.

I did not read the article at all, but I am familiar with query cache
and in fact, I do it a lot (I work for a database company). Here is
how the algorithm works:

You intercept every incoming query and parse it. Any physical data
gets replaced with parameter markers. A 64 bit hash is formed from the
parsed query with the parameter markers removed. The hash is used as
an index into a skiplist which also stores the original query. After
all, if a client has a million dollar request, he won't be happy that
the unbelievably rare thing happened and the checksums agreed.

You can add a counter to the data in the skiplist so that you know how
often the query happens. The parsed query will only be useful to a
system that can save time from having a query prepared (most systems
call it preparing the query). I was kind of surprised to see that
PostgreSQL does not have a prepare stage in libpq. This can be a
very large speedup in query execution (for obvious reasons).
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

#10Justin Clift
justin@postgresql.org
In reply to: Dann Corbit (#3)
Experimental Feature development in PostgreSQL

Neil Conway wrote:
<snip>

Hi everyone,

This is getting me to think :

We don't have a really established process for PostgreSQL development
which allows for "experimental features". aka the Linux kernel
"EXPERIMENTAL", and so forth.

aka ./configure --experimental-querycache
--experimental-pertablecostings --expermental-something?

This way, people can still use PostgreSQL as per normal, but it also
allows for development of code which might or might not actually see the
light of day, depending on if it turns out to really be useful.

By default, things should only become "experimental features" after the
correct consideration, not by default. There are situations where it
would be beneficial (perhaps in Neil's example here). If Neil were to
split off a project and do it on Sourceforge, then only some people
would check it out when kind of ready, etc, etc and it wouldn't really
be as available to a wider audience. But if's it part of the present
source tree, I think the experimental features would see a wider
audience and we'd get a better indication of what's good/bad/etc.

Don't know how to support those features, but pretty much developers and
hard-core-users should be the only ones using them and therefore should
be able to figure most of the stuff out for themselves (i.e. we'd only
get the real questions).

Does this seems like a worthwhile viewpoint to consider, then perhaps we
can think about starting it implement it for 7.3? (Wish I could code
better). :)

Regards and best wishes,

Justin Clift

Perhaps I wasn't clear. The feature I'm proposing is this:

When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return the
cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache the
query, just increment a "frequency" counter stored in the hash table. If
the counter goes above a certain constant, we decide that the query is
worth caching, so we cache the full result in the hash table.

When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns our
cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.

HOWEVER -- I don't see this feature as something that will appeal to,
say, 75% of PgSQL users. If the table in question is being modified on a
regular basis, or if a wide variety of queries are being issued, this
cache isn't a good idea. Nevertheless, I think there are certainly some
situations in which this cache is useful -- and furthermore, these kinds
of "mostly read-only" situations are often where MySQL is chosen over
PostgreSQL.

Anyway, just putting this on the table -- if the consensus is that this
isn't a very worthwhile feature, I won't bother with it.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi

#11Hannu Krosing
hannu@krosing.net
In reply to: Neil Conway (#8)
Re: eWeek Poll: Which database is most critical to your

On Wed, 2002-02-27 at 05:23, Neil Conway wrote:

On Tue, 2002-02-26 at 19:08, Dann Corbit wrote:

Statistical tools are a good idea, because they can tell us where
indexes should be added. However, you cannot simply return the result
of the previous query, because the contents may have changed since the
last time it was executed. It is simply invalid to do that. If some
other system is doing that, then it isn't a relational database.

No -- as I said, any inserts, updates or deletes that affect the table
in question will cause a full cache flush.

How do you know whether or not someone has affected the row that you
are reading? If you do not know, then every single update, insert or
delete will mean that you have to refresh.

Yes, that is true.

And not only that, you will
also have to track it. For sure, it will make the whole system run
more slowly rather than faster.

I don't think tracking changes imposes a lot of overhead -- it is
relatively simple to determine if a query affects a given table.

Perhaps you can do it in a simple way for MySQL which has no
rules/triggers/foreign keys with ON DELETE CASCADE.

The only way I can think of is by putting some kind of rule or trigger
on the table affected.

Hmmm... the more I think about it, the more unusual it would be for
_exactly_ the same query to be repeated a lot. However, the article
reported a significant performance gain when this feature was enabled.
That could mean that:

(a) the performance measurements/benchmarks used by the article were
synthetic and don't reflect real database applications

I think that a Slashdot-type web application would probably benefit a
lot.

----------------
Hannu

#12Ken Hirsch
kenhirsch@myself.com
In reply to: Dann Corbit (#9)
Re: eWeek Poll: Which database is most critical to your

Here is the documentation about MySQL's new Query Cache. I think that it
would be helpful, as they indicate, for dynamic web sites, such as Slashdot.
There are hundreds or maybe thousands of queries in between added comments
and there are probably only a few common combinations of
threshold/nesting/sort.

MySQL Query Cache
=================

From version 4.0.1, `MySQL server' features a `Query Cache'. When in
use, the query cache stores the text of a `SELECT' query together with
the corresponding result that is sent to a client. If another
identical query is received, the server can then retrieve the results
from the query cache rather than parsing and executing the same query
again.

The query cache is extremely useful in an environment where (some)
tables don't change very often and you have a lot of identical queries.
This is a typical situation for many web servers that use a lot of
dynamic content.

Following are some performance data for the query cache (We got these
by running the MySQL benchmark suite on a Linux Alpha 2x500 MHz with
2GB RAM and a 64MB query cache):

* If you want to disable the query cache code set
`query_cache_size=0'. By disabling the query cache code there is
no noticeable overhead.

* If all of the queries you're preforming are simple (such as
selecting a row from a table with one row); but still differ so
that the queries can not be cached, the overhead for having the
query cache active is 13%. This could be regarded as the worst
case scenario. However, in real life, queries are much more
complicated than our simple example so the overhead is normally
significantly lower.

* Searches after one row in a one row table is 238% faster. This
can be regarded as close to the minimum speedup to be expected for
a query that is cached.

How The Query Cache Operates
----------------------------

Queries are compared before parsing, thus

SELECT * FROM TABLE

and

Select * from table

are regarded as different queries for query cache, so queries need to
be exactly the same (byte for byte) to be seen as identical. In
addition, a query may be seen as different if for instance one client
is using a new communication protocol format or another character set
than another client.

Queries that uses different databases, uses different protocol versions
or the uses different default character sets are considered different
queries and cached separately.

The cache does work for `SELECT CALC_ROWS ...' and `SELECT FOUND_ROWS()
...' type queries because the number of found rows is also stored in
the cache.

If a table changes (`INSERT', `UPDATE', `DELETE', `TRUNCATE', `ALTER'
or `DROP TABLE|DATABASE'), then all cached queries that used this table
(possibly through a `MRG_MyISAM' table!) become invalid and are removed
from the cache.

Currently all `InnoDB' tables are invalidated on `COMMIT', in the
future this will be changed so only tables changed in the transaction
cause the corresponding cache entries to be invalidated.

A query cannot be cached if it contains one of the functions:
*Function* *Function* *Function* *Function*
`User Defined `CONNECTION_ID' `FOUND_ROWS' `GET_LOCK'
Functions'
`RELEASE_LOCK' `LOAD_FILE' `MASTER_POS_WAIT' `NOW'
`SYSDATE' `CURRENT_TIMESTAMP'`CURDATE' `CURRENT_DATE'
`CURTIME' `CURRENT_TIME' `DATABASE' `ENCRYPT' (with
one parameter)
`LAST_INSERT_ID' `RAND' `UNIX_TIMESTAMP' `USER'
(without
parameters)
`BENCHMARK'

Nor can a query be cached if it contains user variables, if it is of
the form `SELECT ... IN SHARE MODE' or of the form `SELECT * FROM
AUTOINCREMENT_FIELD IS NULL' (to retrieve last insert id - ODBC work
around).

However, `FOUND ROWS()' will return the correct value, even if the
preceding query was fetched from the cache.

Queries that don't use any tables or if the user has a column privilege
for any of the involved tables are not cached.

Before a query is fetched from the query cache, MySQL will check that
the user has SELECT privilege to all the involved databases and tables.
If this is not the case, the cached result will not be used.

Query Cache Configuration
-------------------------

The query cache adds a few `MySQL' system variables for `mysqld' which
may be set in a configuration file, on the command line when starting
`mysqld'.

* `query_cache_limit' Don't cache results that are bigger than this.
(Default 1M).

* `query_cache_size' The memory allocated to store results from old
queries. If this is 0, the query cache is disabled (default).

* `query_cache_startup_type' This may be set (only numeric) to
*Option* *Description*
0 (OFF, don't cache or retrieve results)
1 (ON, cache all results except `SELECT
SQL_NO_CACHE ...' queries)
2 (DEMAND, cache only `SELECT SQL_CACHE ...'
queries)

Inside a thread (connection), the behaviour of the query cache can be
changed from the default. The syntax is as follows:

`SQL_QUERY_CACHE_TYPE = OFF | ON | DEMAND' `SQL_QUERY_CACHE_TYPE = 0
| 1 | 2'

*Option* *Description*
0 or OFF Don't cache or retrieve results.
1 or ON Cache all results except `SELECT SQL_NO_CACHE
...' queries.
2 or DEMAND Cache only `SELECT SQL_CACHE ...' queries.

By default `SQL_QUERY_CACHE_TYPE' depends on the value of
`query_cache_startup_type' when the thread was created.

Query Cache Options in `SELECT'
-------------------------------

There are two possible query cache related parameters that may be
specified in a `SELECT' query:

*Option* *Description*
`SQL_CACHE' If `SQL_QUERY_CACHE_TYPE' is `DEMAND', allow the
query to be cached. If `SQL_QUERY_CACHE_TYPE'
is `ON', this is the default. If
`SQL_QUERY_CACHE_TYPE' is `OFF', do nothing.
`SQL_NO_CACHE' Make this query non-cachable, don't allow this
query to be stored in the cache.

Query Cache Status and Maintenance
----------------------------------

With the `FLUSH QUERY CACHE' command you can defragment the query cache
to better utilise its memory. This command will not remove any queries
from the cache. `FLUSH TABLES' also flushes the query cache.

The `RESET QUERY CACHE' command removes all query results from the
query cache.

You can monitor query cache performance in `SHOW STATUS':

*Variable* *Description*
`Qcache_queries_in_cache'Number of queries registered in the cache.
`Qcache_inserts' Number of queries added to the cache.
`Qcache_hits' Number of cache hits.
`Qcache_not_cached' Number of non-cached queries (not cachable, or
due to `SQL_QUERY_CACHE_TYPE').
`Qcache_free_memory' Amount of free memory for query cache.
`Qcache_total_blocks' Total number of blocks in query cache.
`Qcache_free_blocks' Number of free memory blocks in query cache.

Total number of queries = `Qcache_inserts' + `Qcache_hits' +
`Qcache_not_cached'.

The query cache uses variable length blocks, so `Qcache_total_blocks'
and `Qcache_free_blocks' may indicate query cache memory fragmentation.
After `FLUSH QUERY CACHE' only a single (big) free block remains.

Note: Every query needs a minimum of two blocks (one for the query text
and one or more for the query results). Also, every table that is used
by a query needs one block, but if two or more queries use same table
only one block needs to be allocated.

#13Hannu Krosing
hannu@krosing.net
In reply to: Dann Corbit (#3)
Re: eWeek Poll: Which database is most critical to your

On Wed, 2002-02-27 at 10:39, Tom Lane wrote:

Neil Conway <nconway@klamath.dyndns.org> writes:

When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results.

Note that this would imply that the cache is *global* across all
backends; therefore it is a shared data structure and hence an access
bottleneck. (Not to mention a memory-management headache, since the
size of shared memory can't easily be varied on-the-fly.)

I think that it would be enough if the hashes and bookkeeping info
(tmin,tmax,filename) were in a global table. We could also purge all
data more than a few minutes old. We also need an inverse lookup from
changed table to cached query for cache invalidation on
insert/update/delete

The result could be even saved in temp files and be mostly faster than
doing the full parse/plan/execute, both for complex queries returning a
few rows (saves planning time) or many rows (saves execute time).

The format used for saving should be exact wire protocol, so that
efficient system calls could be used where available (linux 2.4+ has a
system call that will transfer a whole file to a socket in one call
bypassing all copying and cacheing)

The lookup part will be pretty trivial - lookup using hash, check for
tmin/tmax, if ok push cached result out to client.This will make us as
fast or faster than MySQL for trivial_benchmark/busy_website case.

The cache creation/maintenance part will be much trickier -

When creating cache
* the tables affected can be determined only from fully built plans
because of possible rule expansions.
* if there is a trigger on select for this query it can't be cached
* put some temporary insert/update/delete triggers on all real tables
used in query that will invalidate cache - as an alternatively we
could always run the invalidate-query-cache code for affected table
on insert/update/delete on a table if exact caching is enabled
* invalidate cache on schema changes
* run a periodic check and invalidate old cache entries.

Some of the above could also be needed for caching query plans.

I cannot believe that caching results for literally-identical queries
is a win, except perhaps for the most specialized (read brain dead)
applications.

Actually a web app that looks up contents of a 5 row combobox is not
really brain-dead.

Doing all the caching (and cache invalidation) on client side is hard
and fragle - what happens when someone adds a trigger in backend ?

Has anyone looked at the details of the test case that
MySQL uses to claim that this is a good idea? Has it got any similarity
to your own usage patterns?

Yes - for content management / web apps.

No - for bean-counting apps.

We have talked about caching query plans for suitably-parameterized
queries, but even that seems unlikely to be a huge win; at least I'd
not think it useful to try to drive the cache totally automatically.
If an application could say "here's a query I expect to use a lot,
varying these specific parameters" then caching a plan for that would
make sense.

Now, there are notions of "prepared statements" in many access APIs
that fit this description, and in fact the underlying capability exists
in the backend --- we've just not gotten around to building the
interfaces to tie it all together. *That* would be worth working on.

Sure. It would

a) make many queries faster

b) make client libs (ODBC/JDBC/ECPG) faster and simpler by not forcing
them to fake it.

But there is also a big class of applications that would benefit much
more from caching exact queries.

And it will make us as fast as MySQL for 100000 consecutive calls of
SELECT MAX(N) FROM T ;)

---------------
Hannu

#14Hannu Krosing
hannu@krosing.net
In reply to: Zak Greant (#1)
Re: eWeek Poll: Which database is most critical to your

On Wed, 2002-02-27 at 11:44, Hiroshi Inoue wrote:

Neil Conway wrote:

//
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature.

Hmm is it really a notable feature ?

It makes a powerful database eqally fast to not-so-powerful one for
simple cases.

IIRC Oracle has had a similar feature for a long time.

I guess that's for exactly this reason :)

---------
Hannu

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#4)
Re: eWeek Poll: Which database is most critical to your

Neil Conway <nconway@klamath.dyndns.org> writes:

When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results.

Note that this would imply that the cache is *global* across all
backends; therefore it is a shared data structure and hence an access
bottleneck. (Not to mention a memory-management headache, since the
size of shared memory can't easily be varied on-the-fly.)

I cannot believe that caching results for literally-identical queries
is a win, except perhaps for the most specialized (read brain dead)
applications. Has anyone looked at the details of the test case that
MySQL uses to claim that this is a good idea? Has it got any similarity
to your own usage patterns?

We have talked about caching query plans for suitably-parameterized
queries, but even that seems unlikely to be a huge win; at least I'd
not think it useful to try to drive the cache totally automatically.
If an application could say "here's a query I expect to use a lot,
varying these specific parameters" then caching a plan for that would
make sense.

Now, there are notions of "prepared statements" in many access APIs
that fit this description, and in fact the underlying capability exists
in the backend --- we've just not gotten around to building the
interfaces to tie it all together. *That* would be worth working on.

regards, tom lane

#16Dann Corbit
DCorbit@connx.com
In reply to: Tom Lane (#15)
Re: eWeek Poll: Which database is most critical to your

So that's it. A completely lame "benchmark faker" tool. Useful for
only the dumb benchmark they create.

#17Neil Conway
nconway@klamath.dyndns.org
In reply to: Tom Lane (#15)
Re: eWeek Poll: Which database is most critical to your

On Wed, 2002-02-27 at 00:39, Tom Lane wrote:

I cannot believe that caching results for literally-identical queries
is a win, except perhaps for the most specialized (read brain dead)
applications.

According to MySQL: "The query cache is extremely useful in an
environment where (some) tables don't change very often and you have a
lot of identical queries. This is a typical situation for many web
servers that use a lot of dynamic content."

Would people agree with the MySQL guys on this? In particular, that this
is a "typical situation" for many webapps?

Has anyone looked at the details of the test case that
MySQL uses to claim that this is a good idea?

I emailed the author of the eWeek benchmarks asking for more
information. My guess is that the benchmark results benefit from the
query cache because it executes exactly the same query over and over
again (e.g. selecting the same product from the database every time, not
simulating typical user behavior by selecting random products). If that
is the case, the results are clearly irrelevant.

Has it got any similarity to your own usage patterns?

I would be very interested to hear about this as well.

Now, there are notions of "prepared statements" in many access APIs
that fit this description, and in fact the underlying capability exists
in the backend --- we've just not gotten around to building the
interfaces to tie it all together. *That* would be worth working on.

Okay, I'll take a look at this...

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

#18Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zak Greant (#1)
Re: eWeek Poll: Which database is most critical to your

Neil Conway wrote:

On Tue, 2002-02-26 at 15:30, Zak Greant wrote:

Good Day All,

eWeek has posted a poll that asks which database server is most critical
to your organization.

The article mentions a MySQL feature which apparently improved
performance considerably:

//
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature.

Hmm is it really a notable feature ?
IIRC Oracle has had a similar feature for a long time.

regards,
Hiroshi Inoue

#19Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Neil Conway (#17)
Re: eWeek Poll: Which database is most critical to your

According to MySQL: "The query cache is extremely useful in an
environment where (some) tables don't change very often and you have a
lot of identical queries. This is a typical situation for many web
servers that use a lot of dynamic content."

Would people agree with the MySQL guys on this? In particular, that this
is a "typical situation" for many webapps?

Hmmm. We have a lot of repeated _parameterised_ queries, but the recurrence
of identical queries is quite small. It'd be an interesting thing to try
and measure.

Now, there are notions of "prepared statements" in many access APIs
that fit this description, and in fact the underlying capability exists
in the backend --- we've just not gotten around to building the
interfaces to tie it all together. *That* would be worth working on.

Okay, I'll take a look at this...

This is the more general solution, compared to MySQL's query cache - and can
speed up paramaterised queries as well as identical queries...

Chris

#20F Harvell
fharvell@fts.net
In reply to: Dann Corbit (#3)
Re: eWeek Poll: Which database is most critical to

On Tue, 26 Feb 2002 15:20:17 PST, "Dann Corbit" wrote:

-----Original Message-----
From: Neil Conway [mailto:nconway@klamath.dyndns.org]
Sent: Tuesday, February 26, 2002 3:04 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your

On Tue, 2002-02-26 at 15:30, Zak Greant wrote:

Good Day All,

eWeek has posted a poll that asks which database server is most

critical

to your organization.

The article mentions a MySQL feature which apparently improved
performance considerably:

//
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature. If the text of an incoming
query has a byte-for-byte match with a cached query, MySQL can retrieve
the results directly from the cache without compiling the query, getting
locks or doing index accesses. This query caching will be effective only
for tables with few updates because any table updates that clear the
cache to guarantee correct results are always returned.
//

My guess is that it would be relatively simple to implement. Any
comments on this?

If I implemented this, any chance this would make it into the tree? Of
course, it would be:

- disabled by default
- enabled on a table-by-table basis (maybe an ALTER TABLE command)

I don't see how it will do any good. There is no "prepare" in
Postgresql
and therefore you will simply be reexecuting the queries every time any
way. Also, parameter markers only work in embedded SQL and that is a
single tasking system.

I think it would be a major piece of work to do anything useful along
those lines.

If you look at how DB/2 works, you will see that they store prepared
statements. Another alternative would be to keep some point in the
parser marked and somehow jump to that point, but you would have to
be able to save a parse tree somewhere and also recognize the query.

Here is where problems come in...
-- Someone wants blue and blue-green, etc shirts that are backordered
SELECT shirt, color, backorder_qty FROM garments WHERE color like
"BLUE%"

Now, another query comes along:

-- Someone else wants reddish, etc shirts that are backordered:
SELECT shirt, color, backorder_qty FROM garments WHERE color like "RED%"

It's the same query with different data. Without parameter markers you
will never know it. And yet this is exactly the sort of caching that is
useful.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<<

While the cache that MySQL implemented may be of limited value for
dynamic queries, it would be very useful for many, many queries. For
example, many applications (especially the ones that I write), there
are literally hundreds of selects against static "lookup" tables that
maintain acceptable values for new data inputs (i.e., validated using
referential integrity). Each and every one of those selects would
receive an improvement.

Also, it is relatively easy to "parameterize" the select. After all,
it is a structured query. Just by parsing it, you can identify the
"parameters". As a matter of fact, I'm pretty certain that this is
already done by the optimizer when breaking the SQL down into the code
to perform the query. The trick is to cache the "parameterized"
version of the query. This is likely at or near the end of the
optimizer processing. Also, you would want to capture the query plan
in the cache. The query plan is not going to be interested at all in
the literal value of the parameters and therefore will be the same for
any query of the same form.

For example, from above:

SELECT shirt, color, backorder_qty FROM garments WHERE color like
'BLUE%'

should become something on the order of:

SELECT shirt, color, backorder_qty FROM garments WHERE color like
'{param0}%'

The next query:

SELECT shirt, color, backorder_qty FROM garments WHERE color like
'RED%'

should also be "parameterized" on the order of:

SELECT shirt, color, backorder_qty FROM garments WHERE color like
'{param0}%'

A lookup into the query hash would match and therefore _at least_
the same query plan can be used.

The commercial database "Cache" by Intersystems uses an approach
similar to this. The performance of that database is phenomenal. (Of
course, there is more going on in the internals of that system than
just the query cache.)

#21Henshall, Stuart - WCP
SHenshall@westcountrypublications.co.uk
In reply to: Christopher Kings-Lynne (#19)
Re: eWeek Poll: Which database is most critical to your

-----Original Message-----
From: Neil Conway [mailto:nconway@klamath.dyndns.org]
Sent: 27 February 2002 00:24

On Tue, 2002-02-26 at 19:08, Dann Corbit wrote:

Statistical tools are a good idea, because they can tell us where
indexes should be added. However, you cannot simply return

the result

of the previous query, because the contents may have

changed since the

last time it was executed. It is simply invalid to do

that. If some

other system is doing that, then it isn't a relational database.

No -- as I said, any inserts, updates or deletes that affect the table
in question will cause a full cache flush.

How do you know whether or not someone has affected the row that you
are reading? If you do not know, then every single update,

insert or

delete will mean that you have to refresh.

Yes, that is true.

And not only that, you will
also have to track it. For sure, it will make the whole system run
more slowly rather than faster.

I don't think tracking changes imposes a lot of overhead -- it is
relatively simple to determine if a query affects a given table.

Very likely, it is only my limited understanding not really

grasping

what it is that you are trying to do. Even so, I don't

think it really

helps even for read only queries, unless it is exactly the

same query

with the same parameter markers and everything that was

issued before.

That is very unusual. Normally, you won't have the entire

query hard-

wired, but with allow the customer to do some sort of

filtering of the

data.

Hmmm... the more I think about it, the more unusual it would be for
_exactly_ the same query to be repeated a lot. However, the article
reported a significant performance gain when this feature was enabled.
That could mean that:

(a) the performance measurements/benchmarks used by the
article were
synthetic and don't reflect real database applications

(b) the feature MySQL implements is different than the one I am
describing

When I get a chance I'll investigate further the technique
used by MySQL
to see if (b) is the case. However, it is beginning to look like this
isn't a good idea, overall.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

If I understand correctly you'll be taking matching the on the original
query string, then pulling out the previous plan, rather than doing all
the planning again? Or where you thinking of storing the resultant tuples
(seems far more diffcult to do effciently)?
Either way would be handy for me though as I have a number of clients who
all basically ask the same query and then ask it again every few minutes
to update themselves. Therefore this sounds like something that would
improve performance for me.
Hope I've understood correctly,
- Stuart

#22Dann Corbit
DCorbit@connx.com
In reply to: Henshall, Stuart - WCP (#21)
Re: eWeek Poll: Which database is most critical to your

[snip]
If I understand correctly you'll be taking matching the on the original
query string, then pulling out the previous plan, rather than doing all
the planning again? Or where you thinking of storing the resultant
tuples
(seems far more diffcult to do effciently)?
Either way would be handy for me though as I have a number of clients
who
all basically ask the same query and then ask it again every few minutes

to update themselves. Therefore this sounds like something that would
improve performance for me.
Hope I've understood correctly,

If they expect the information to change, then caching the query data
is not going to help. Unless you have a read-only database, it is my
opinion that caching of the actual query data is a very bad idea. It
will complicate the code and (in the end) not end up being any faster.

Two other aspects of the approach (gathering statistical information
on the frequency and usage of queries and storing the prepared query)
are both excellent ideas and ought to be pursued. Maintaining an LRU
cache of prepared queries is also a good idea.

If you have a read only database or table, then you can make a lot of
safe, simplifying assumptions about queries against that table. I
think that you can cache the data for a read-only table or database.

But for those instances, will it really be worth it? In other words,
if I have some read only tables and they really are frequently
accessed by user queries, the data will be in memory anyway. How much
will you actually gain by saving the data somewhere?

Benchmarks are great, but I think it makes a lot more sense to focus
on real applications. If we want to improve web server performance
(here is a tangible, real goal) then why not try with a real, useful
web server application? If we want to perform well in benchmarks, use
real benchmarks like the TPC-X benchmarks. "The Benchmark Factory"
has some useful stuff along these lines.

For sure, I can write a benchmark that makes any tool look good. But
does it indicate that the tool is actually useful for getting real
work done?

Instead of trying to win benchmarks, we should imagine how we can
make improvements that will solve real scientific and business
problems.

IMO-YMMV
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: F Harvell (#20)
Re: eWeek Poll: Which database is most critical to

F Harvell <fharvell@fts.net> writes:

The query plan is not going to be interested at all in
the literal value of the parameters and therefore will be the same for
any query of the same form.

Unfortunately, this is completely false.

For example, from above:

SELECT shirt, color, backorder_qty FROM garments WHERE color like
'BLUE%'

should become something on the order of:

SELECT shirt, color, backorder_qty FROM garments WHERE color like
'{param0}%'

You managed to pick an example that's perfectly suited to demolish your
assertion. The query with "color like 'BLUE%'" can be optimized into an
indexscan (using index quals of the form "color >= 'BLUE' and color <
'BLUF'), at least in C locale. The parameterized query cannot be
optimized at all, because the planner cannot know whether the
substituted parameter string will provide a left-anchored pattern.
What if param0 contains '_FOO' at runtime? An indexscan will be
useless in that case.

In general, Postgres' query plans *do* depend on the values of
constants, and it's not always possible to produce an equally good plan
that doesn't assume anything about constants. This is why I think it's
a lousy idea for the system to try to automatically abstract a
parameterized query plan from the actual queries it sees. On the other
hand, an application programmer will have a very good idea of which
parts of a repeated query are really constant and which are parameters.
So what we really need is preparable parameterized queries, wherein the
application tells us what to parameterize, rather than having to guess
about it.

regards, tom lane

#24Dann Corbit
DCorbit@connx.com
In reply to: Tom Lane (#23)
Re: eWeek Poll: Which database is most critical to

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, February 27, 2002 1:25 PM
To: F Harvell
Cc: Dann Corbit; Neil Conway; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to

F Harvell <fharvell@fts.net> writes:

The query plan is not going to be interested at all in
the literal value of the parameters and therefore will be the same for
any query of the same form.

Unfortunately, this is completely false.

For example, from above:

SELECT shirt, color, backorder_qty FROM garments WHERE color like
'BLUE%'

should become something on the order of:

SELECT shirt, color, backorder_qty FROM garments WHERE color like
'{param0}%'

You managed to pick an example that's perfectly suited to demolish your
assertion. The query with "color like 'BLUE%'" can be optimized into an
indexscan (using index quals of the form "color >= 'BLUE' and color <
'BLUF'), at least in C locale. The parameterized query cannot be
optimized at all, because the planner cannot know whether the
substituted parameter string will provide a left-anchored pattern.
What if param0 contains '_FOO' at runtime? An indexscan will be
useless in that case.

In general, Postgres' query plans *do* depend on the values of
constants, and it's not always possible to produce an equally good plan
that doesn't assume anything about constants. This is why I think it's
a lousy idea for the system to try to automatically abstract a
parameterized query plan from the actual queries it sees. On the other
hand, an application programmer will have a very good idea of which
parts of a repeated query are really constant and which are parameters.
So what we really need is preparable parameterized queries, wherein the
application tells us what to parameterize, rather than having to guess
about it.

Using the data to enhance the plan is quite a brilliant strategy.
I was not aware that PostgreSQL could do that.

Rdb has a very nice feature -- it allows you to *edit* the plan.
Obviously, you can get some real disasters that way, but for advanced
users, it is very nice.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

#25Neil Conway
nconway@klamath.dyndns.org
In reply to: Henshall, Stuart - WCP (#21)
Re: eWeek Poll: Which database is most critical to your

On Wed, 2002-02-27 at 10:51, Henshall, Stuart - WCP wrote:

If I understand correctly you'll be taking matching the on the original
query string, then pulling out the previous plan, rather than doing all
the planning again? Or where you thinking of storing the resultant tuples
(seems far more diffcult to do effciently)?

Well, those are really two different features. The second (caching
entire result sets based upon the _exact_ query string) is implemented
by MySQL, and is probably the more exotic feature of the two. There is
some debate about whether this is even worthwhile, or just results in
better benchmark results...

As Tom points out, the first feature (caching query plans) is probably
better implemented by allowing application developers to prepare queries
and specify their own parameters. This is a fairly conventional RDBMS
feature and it is already on the TODO list.

Either way would be handy for me though as I have a number of clients who
all basically ask the same query and then ask it again every few minutes
to update themselves. Therefore this sounds like something that would
improve performance for me.

Good to know...

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

#26Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zak Greant (#1)
Re: eWeek Poll: Which database is most critical to your

Hannu Krosing wrote:

On Wed, 2002-02-27 at 11:44, Hiroshi Inoue wrote:

Neil Conway wrote:

//
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature.

Hmm is it really a notable feature ?

It makes a powerful database eqally fast to not-so-powerful one for
simple cases.

Sorry I may be misunderstanding your point.
What I meant was it's not a notably new feature and
I love the feature itself.

regards,
Hiroshi Inoue

#27Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#23)
Re: eWeek Poll: Which database is most critical to

In general, Postgres' query plans *do* depend on the values of
constants, and it's not always possible to produce an equally good plan
that doesn't assume anything about constants. This is why I think it's
a lousy idea for the system to try to automatically abstract a
parameterized query plan from the actual queries it sees. On the other
hand, an application programmer will have a very good idea of which
parts of a repeated query are really constant and which are parameters.
So what we really need is preparable parameterized queries, wherein the
application tells us what to parameterize, rather than having to guess
about it.

I think we could store the constants that went with the saved plan and
re-use the plan if the new constants were _similar_ to the old ones.
(Of course, figuring out _similar_ is the trick here.)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#28Denis Perchine
dyp@perchine.com
In reply to: Henshall, Stuart - WCP (#21)
Re: eWeek Poll: Which database is most critical to your

Hello,

If I understand correctly you'll be taking matching the on the original
query string, then pulling out the previous plan, rather than doing all
the planning again?

No.

Or where you thinking of storing the resultant tuples (seems far more
diffcult to do effciently)?

Yes. They store a result of the query, not the plan. Storing a plan is not
such a win as storing a result of queries returning relativly small amount of
tuples.

--
Denis

#29Michael Meskes
meskes@postgresql.org
In reply to: Tom Lane (#15)
Re: eWeek Poll: Which database is most critical to your

On Wed, Feb 27, 2002 at 12:39:16AM -0500, Tom Lane wrote:

I cannot believe that caching results for literally-identical queries
is a win, except perhaps for the most specialized (read brain dead)

I don't think they are brain dead. Well that is at first I thought so too,
but then thinking some more it made sense. After all MySQL is used mostly
for web pages and even your dynamic content doesn't change that often. But
in between there are thousands of concurrent access that all execute the
very same statement. This feature makes no sense IMO for the "normal" use we
both probably had in mind when first reading, but for this web usage I see a
benefit if it's implementable.

Michael
--
Michael Meskes
Michael@Fam-Meskes.De
Go SF 49ers! Go Rhein Fire!
Use Debian GNU/Linux! Use PostgreSQL!

#30Dave Page
dpage@vale-housing.co.uk
In reply to: Michael Meskes (#29)
Re: eWeek Poll: Which database is most critical to your

-----Original Message-----
From: Michael Meskes [mailto:meskes@postgresql.org]
Sent: 28 February 2002 07:04
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] eWeek Poll: Which database is most
critical to your

On Wed, Feb 27, 2002 at 12:39:16AM -0500, Tom Lane wrote:

I cannot believe that caching results for

literally-identical queries

is a win, except perhaps for the most specialized (read brain dead)

I don't think they are brain dead. Well that is at first I
thought so too, but then thinking some more it made sense.
After all MySQL is used mostly for web pages and even your
dynamic content doesn't change that often. But in between
there are thousands of concurrent access that all execute the
very same statement. This feature makes no sense IMO for the
"normal" use we both probably had in mind when first reading,
but for this web usage I see a benefit if it's implementable.

Everytime someone browses http://pgadmin.postgresql.org/ this is exactly
what's going on, as well as (I imagine) with Vince's interactive docs.

Regards, Dave.

#31F Harvell
fharvell@fts.net
In reply to: Tom Lane (#23)
Re: eWeek Poll: Which database is most critical to

On Wed, 27 Feb 2002 16:24:45 EST, Tom Lane wrote:

F Harvell <fharvell@fts.net> writes:

The query plan is not going to be interested at all in
the literal value of the parameters and therefore will be the same for
any query of the same form.

Unfortunately, this is completely false.

For example, from above:

SELECT shirt, color, backorder_qty FROM garments WHERE color like
'BLUE%'

should become something on the order of:

SELECT shirt, color, backorder_qty FROM garments WHERE color like
'{param0}%'

You managed to pick an example that's perfectly suited to demolish your
assertion. The query with "color like 'BLUE%'" can be optimized into an
indexscan (using index quals of the form "color >= 'BLUE' and color <
'BLUF'), at least in C locale. The parameterized query cannot be
optimized at all, because the planner cannot know whether the
substituted parameter string will provide a left-anchored pattern.
What if param0 contains '_FOO' at runtime? An indexscan will be
useless in that case.

Thanks for the feedback. In the example that was used, it was
important to note that the {param0} was the string literal "BLUE" and
not the % "operator". This IMHO ties the query to a left anchored
pattern. I certainly do not think that the "parameter" can be
anything but a literal. Functions and operators would very likely
affect any query plan.

Is it true that the optimizer manipulates the literal? It would
seem that that would require a huge amount of processing (due to
character sets, etc.). It would appear that it would be more viable
to use a simpler optimization that does not manipulate the literal
such as an index quals of the form "color{0,4} == 'BLUE'" than to
generate a range comparison. Of course, this is a very simple query
and I am likely missing a critical concept.

Thanks,
F Harvell

#32Andrew McMillan
andrew@catalyst.net.nz
In reply to: Tom Lane (#15)
Re: eWeek Poll: Which database is most critical to your

On Wed, 2002-02-27 at 18:39, Tom Lane wrote:

I cannot believe that caching results for literally-identical queries
is a win, except perhaps for the most specialized (read brain dead)
applications.

We have an application which we developed, and which we added
specifically added query caching of the type being discussed. This was
added within the application, rather than within the database, and
provided huge gains for the client.

One difficulty is in knowing when the results of the exact same query
will result in a different result set (i.e. invalidating your cached
resultsets), but as far as I can see this is just as knowable at a
database level as it was for us in the application code.

If this were available I wouldn't want it applied to _all_ queries,
however - I'd want to be able to enable it against particular tables
and/or queries (po: a 'WITH RESULTCACHE' option on the end of a SELECT)
where I knew there was going to be a dramatic win.

The reason there was such a large win for our application was that when
people go to a job website on Monday morning to look for a better place
to work, there is a strong tendency for them to be searching for much
the same sort of thing (as each other) - especially if the site is
getting in the millions of page views.

The application I talk about here was written in Progress (WebSpeed),
and we have found it much harder to do this sort of thing more recently
in Oracle or PostgreSQL.

This would be nice.

Regards,
Andrew.
--
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267
Are you enrolled at http://schoolreunions.co.nz/ yet?

#33Dann Corbit
DCorbit@connx.com
In reply to: Andrew McMillan (#32)
Re: eWeek Poll: Which database is most critical to your

[snip]

----------------------------------------------------------------------

---------------
Thought for the day:
Caching this stuff could very easily exhaust the entire server disk
space in a jiffy.
Any lists stashed in memory will take away from database performance for
normal sorts
of queries. Any lists stashed on disk will have to be read again.
Where is the big
savings?
<<----------------------------------------------------------------------
---------------
Sure. It would

a) make many queries faster

b) make client libs (ODBC/JDBC/ECPG) faster and simpler by not forcing
them to fake it.

But there is also a big class of applications that would benefit much
more from caching exact queries.

And it will make us as fast as MySQL for 100000 consecutive calls of
SELECT MAX(N) FROM T ;)

----------------------------------------------------------------------

---------------
Try as I might, I can't think of anything that could possibly be more
useless than that.
;-)

I agree that some web apps might benefit mightily with a careful plan
for this notion.
Maybe a special function/procedure call is in order.

If someone wants to explore this stuff, I'm all for it. But I think it
needs careful
thought. I suspect that:
1. The work for retaining ACID properties of the database is ten times
harder than
anyone cares to imagine with changes of this nature.
2. The benefit for most apps will be small, and will be negative if not
implemented
properly.
3. The functionality can *already* be achieved by user applications
simply by holding
open your own cursors and reusing them.

For whoever wants to tackle this thing, I say, "GO for it!"
But go for it with both eyes open.

I don't think it should be anyone's priority. But that is just because
it isn't very
interesting to me. Maybe other people are a lot more keen on it.
<<----------------------------------------------------------------------
---------------

#34Rod Taylor
rbt@zort.ca
In reply to: Dann Corbit (#5)
Re: eWeek Poll: Which database is most critical to your

(a) the performance measurements/benchmarks used by the

article were

synthetic and don't reflect real database applications

I think that a Slashdot-type web application would probably benefit

a

lot.

Hows that? Slashdot serves (primarily) static pages which are
periodically generated. Anything that isn't static isn't hit very
often, or is dynamic enough that it doesn't matter (second or 3rd
comment deep in chat thread). It'll help generation of their front
page for those users logged in, but even then the content itself is
cached in memory on the webservers. Their second bottleneck is using
apache with modperl. They'd get alot faster results with a custom
written webserver which serves slashdot pages only -- even if it was
written as a daemon in perl.

Why have the network overhead to hit a cached resource when you can
have it right in the webserver itself?

Use a table with a single row which holds a timestamp to determine
when cache should be expired (triggers update the time on writes).
That query is done everytime. Thats it though, nothing else repeats
more than 5 times unless the data changes (5 as there are 5
webservers). But, speed up that query by 2x and we can serve 2x more
pages :)

create table cache_expire (lastwrite timestamp);
insert into cache_expire default values;
select lastwrite from cache_expire;

Think you can get a big boost to that?

Those phpBB's, etc would get a much biggest boost by using a 5MB
shared memory segment to store the articles, etc on the webserver
itself as it avoids network and connection conjestion of the database
entirely.