Postgresql Caching

Started by Anon Mousover 19 years ago31 messageshackers
Jump to latest
#1Anon Mous
soundami@yahoo.com

Hi

I may have a workable idea on a way to add caching to Postgres without

disturbing the MVCC functionality.

Caching, as I've been reading can provide an amazing and sometimes

almost unbelievable performance boost to a database based application,

especially for data that is rarely modified.

http://www.danga.com/memcached/users.bml

The problem, as Tom Lane suggested, and the reason caching hasn't become

more popular is simply because it's very complex to keep up to date

especially in an MVCC environment. But, most every database has some

tables (and usually these are some of the most popular) that would

benefit greatly from caching. For example security permissions or product

descriptions and prices in an e-commerce application. Other tables,

however, like orders and accounting, are very dynamic and cannot afford

to ever be stale.

However, the problem is surmountable and has been figured out by Oracle,

although I don't know how they did it:

http://www.oracle.com/technology/products/ias/joc/index.html

What if there was a way to automatically setup caching so data would

only ever be cached when it's not being changed?

I think that may be possible with an extendion the Postgresql api that

tracks which tables are involved in a transaction. Queries could be

cached until a transaction starts to modify a table. At that point, the

api passes all queries that reference that table to the database for

processing.

In reality, even these tables that are very active may see great speed

improvements from caching because most of the time transactions don't

make any modifications until they are near the end of the transaction.

So

Would it be possible to combine a special memcache implementation of

memcache with a Postgresql interface wrapper?

Additional Possibilities

If the caching was implemented in an api, perhaps one could also add
connection caching.

When Stale Data is OK

During a query request, let the application specifically say if they

would be OK with stale data no older than ___ seconds...

Yes, this would be a change to the api, but for those applications that

can withstand receiving stale data, it could provide a significant

performance boost on very active tables. I ran across a report recently that suggested for some
applications/tables, a 10 second delay can reduce database hits by over
50 percent.

Automatic Tuning

Let the caching system automatically track how often a particular table

benefits from caching. If it doesn't, or its a critical table as defined by the database administrator always bypass caching on that table.

Write Caching

On some tables, like web sessions, it may be worthwhile to implement a

batching function where updates are written to disk (to be acid

compliant) and can later be more efficiently processed by the database

in a batch. Of course, the api would have to detect any queries touching

that table, and run the batch first, but it seems that some performance

gains are available since most of the requests will be for a single row,

and that would be available in the cache.

Thanks

Daniel

#2Mark Mielke
mark@mark.mielke.cc
In reply to: Anon Mous (#1)
Re: Postgresql Caching

On Sun, Oct 15, 2006 at 03:41:25AM -0700, Anon Mous wrote:

However, the problem is surmountable and has been figured out by Oracle,
although I don't know how they did it:

http://www.oracle.com/technology/products/ias/joc/index.html

I'm pretty sure this is application-side caching. The application is
able to cache intelligently and efficiently, as it is able to make
assumptions. It's only seeing one view of the data. The view is
internally consistent, and only the application is making changes
to the view that it sees.

On the rest of your thinking:

Are people comfortable in believing that tables that do not change
are a valuable point to consider for caching?

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#3Merlin Moncure
mmoncure@gmail.com
In reply to: Anon Mous (#1)
Re: Postgresql Caching

On 10/15/06, Anon Mous <soundami@yahoo.com> wrote:

Would it be possible to combine a special memcache implementation of
memcache with a Postgresql interface wrapper?

have you seen
http://people.freebsd.org/~seanc/pgmemcache/

merlin

#4Mark Mielke
mark@mark.mielke.cc
In reply to: Merlin Moncure (#3)
Re: Postgresql Caching

On Sun, Oct 15, 2006 at 08:31:36PM +0530, Merlin Moncure wrote:

On 10/15/06, Anon Mous <soundami@yahoo.com> wrote:

Would it be possible to combine a special memcache implementation of
memcache with a Postgresql interface wrapper?

have you seen
http://people.freebsd.org/~seanc/pgmemcache/

Interesting. I note that they don't address the view consistency
problem any better than an application using memcached directly.
And that's the real problem with memcached, and why people are
tempted to 'indulge' by relying on PostgreSQL. Some people value
the consistency. Others don't. memcached, whether application-side,
or whether automatically invoked by triggers (pgmemcache) is a
decision to ignore the consistency.

Using memcache, I've had problems with consistency brought right to
the front. Both of these have failed me:

1) When updating a PostgreSQL record, I invalidate the memcache record.
If another process comes along in parallel before I commit, notices
that the memcache record is invalidated, it queries the data from
SQL, and updates the memcache record back to the old value. :-(

2) When updating a PostgreSQL record, I updated the memcache record
to the new value. If another process comes along in parallel before
I commit, that is still looking at an older view, cross-referencing
may not work as expected.

I'm currently settled on 2), but setting a short timeout (5 seconds) on
the data. Still an imperfect compromise between speed and accuracy, but
it isn't causing me problems... yet.

I don't see memcache as a general solution to query plan or query
result caching. Along these lines, I would look more towards having
the query plans or query results stored in cache along with the
transaction numbers that would let us know whether either is valid.

Consistency is very valuable to me. If it wasn't for memcache being
hundreds or more times faster, I wouldn't use it in the cases I do.
It can be dangerous.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Mark Mielke (#4)
Re: Postgresql Caching

mark@mark.mielke.cc wrote:

Using memcache, I've had problems with consistency brought right to
the front. Both of these have failed me:

1) When updating a PostgreSQL record, I invalidate the memcache record.
If another process comes along in parallel before I commit, notices
that the memcache record is invalidated, it queries the data from
SQL, and updates the memcache record back to the old value. :-(

2) When updating a PostgreSQL record, I updated the memcache record
to the new value. If another process comes along in parallel before
I commit, that is still looking at an older view, cross-referencing
may not work as expected.

Hmm, have you tried having a NOTIFY when the Postgres record is updated,
and having a listener daemon that updates the value when it gets the
notification?

One issue is that if the listener is not running you don't get to
update nor invalidate the record, which would cause consistency problems.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#6Shane Ambler
pgsql@007Marketing.com
In reply to: Merlin Moncure (#3)
Re: Postgresql Caching

Merlin Moncure wrote:

On 10/15/06, Anon Mous <soundami@yahoo.com> wrote:

Would it be possible to combine a special memcache implementation of
memcache with a Postgresql interface wrapper?

have you seen
http://people.freebsd.org/~seanc/pgmemcache/

merlin

Now you got me thinkin ;-P

Just throwing some ideas around -

What if we could do something like

CREATE TABLESPACE myramcache LOCATION MEMORY(2GB);

CREATE TABLE mycache (
id as integer, data as varchar(50))
USING TABLESPACE myramcache;

INSERT INTO mycache SELECT id,data FROM myrealtable;

You could setup a table in memory to contain known popular data, you
could also use this to create a temporary table in memory to speed up
multiple intermediate calculations without touching disks.

Or maybe just a view for caching -

CREATE MEMORY VIEW mycacheview
USING MAX(2GB) FOR LIFE(10)
AS SELECT * FROM myrealtable;

which would cache SELECTed rows in ram for LIFE seconds before purging
and inserts/updates to myrealtable would trigger or can contain a
trigger that would purge all or some of the view cache.

Or better yet maybe the memory tablespace idea could also be extended to
allow CREATE VIEW ... USING TABLESPACE myramcache LIFE(10);

TABLESPACE LOCATION MEMORY would seem to give an opportunity for later
expansion.

The memory tablespace idea could be expanded to work with something like
memcached (pg_ramcache_slave ?) - allowing multiple machines to work as
a ram cache for the server.

Something like -
CREATE MEMCACHE group1 SLAVE 192.168.0.5;
CREATE MEMCACHE group1 SLAVE 192.168.0.6 PORT 5436;
CREATE MEMCACHE group2 SLAVE 192.168.0.7;
CREATE TABLESPACE myramcache LOCATION MEMORY WITH group1 SLAVES;
CREATE TABLESPACE myramcache2 LOCATION MEMORY WITH group2 SLAVES;

Probably want to put in some limits such as only temporary tables and
'caching' views are allowed in memory tablespace.

Apart from temp tables these could all be saved into system tables so
they are re-created upon server startup.

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Alvaro Herrera (#5)
Re: Postgresql Caching

Alvaro Herrera wrote:

mark@mark.mielke.cc wrote:

Using memcache, I've had problems with consistency brought right to
the front. Both of these have failed me:

1) When updating a PostgreSQL record, I invalidate the memcache
record.
If another process comes along in parallel before I commit,
notices
that the memcache record is invalidated, it queries the data from
SQL, and updates the memcache record back to the old value. :-(

2) When updating a PostgreSQL record, I updated the memcache record
to the new value. If another process comes along in parallel
before
I commit, that is still looking at an older view,
cross-referencing
may not work as expected.

Hmm, have you tried having a NOTIFY when the Postgres record is updated,
and having a listener daemon that updates the value when it gets the
notification?

One issue is that if the listener is not running you don't get to
update nor invalidate the record, which would cause consistency problems.

The original poster seemed to be talking about seldom/never changed data.
But the interesting and often-discussed case is really rapidly changing
data such as shopping carts, web session objects etc.

On a somewhat unrelated note, I recently discussed a high performance
postgres installation with the CIO, who told me they were using a large
array of high end solid state disks (there is some serious money here) for
the backing store. That's probably something worth thinking about for high
traffic sites, which might help to relieve the problems for which caching
or in-memory tables are thought to be possible answers.

cheers

andrew

#8Lexington Luthor
Lexington.Luthor@gmail.com
In reply to: Mark Mielke (#4)
Re: Postgresql Caching

mark@mark.mielke.cc wrote:

Using memcache, I've had problems with consistency brought right to
the front. Both of these have failed me:

1) When updating a PostgreSQL record, I invalidate the memcache record.
If another process comes along in parallel before I commit, notices
that the memcache record is invalidated, it queries the data from
SQL, and updates the memcache record back to the old value. :-(

How can this fail? The PostgreSQL MVCC will hold the second transaction
back until the effect on the tuple is known (i.e. after the first
transaction is over). Have you not been using SERIALIZABLE transactions?

With a bit of careful planning (and a few SELECT FOR UPDATE queries to
prevent deadlock), having perfect consistency and correct caching is
possible.

2) When updating a PostgreSQL record, I updated the memcache record
to the new value. If another process comes along in parallel before
I commit, that is still looking at an older view, cross-referencing
may not work as expected.

This breaks integrity, and all bets are off.

I'm currently settled on 2), but setting a short timeout (5 seconds) on
the data. Still an imperfect compromise between speed and accuracy, but
it isn't causing me problems... yet.

What exactly does your application do about the possibility of incorrect
data?

Consistency is very valuable to me. If it wasn't for memcache being
hundreds or more times faster, I wouldn't use it in the cases I do.
It can be dangerous.

Consistency and caching are not mutually exclusive, and there are many
frameworks that handle the burden of maintaining both for you.

Regards,
LL

#9Mark Mielke
mark@mark.mielke.cc
In reply to: Lexington Luthor (#8)
Re: Postgresql Caching

On Sun, Oct 15, 2006 at 06:43:43PM +0100, Lexington Luthor wrote:

mark@mark.mielke.cc wrote:

Using memcache, I've had problems with consistency brought right to
the front. Both of these have failed me:
1) When updating a PostgreSQL record, I invalidate the memcache record.
If another process comes along in parallel before I commit, notices
that the memcache record is invalidated, it queries the data from
SQL, and updates the memcache record back to the old value. :-(

How can this fail? The PostgreSQL MVCC will hold the second transaction
back until the effect on the tuple is known (i.e. after the first
transaction is over). Have you not been using SERIALIZABLE transactions?

I don't *want* to use SERIALIZABLE transactions. That would grind my
application to a halt.

Consistency isn't the same as serializable. Memcache offers *NEITHER*.

With a bit of careful planning (and a few SELECT FOR UPDATE queries to
prevent deadlock), having perfect consistency and correct caching is
possible.

Your conclusion is false for all cases except data that will never change.

You can never have perfect consistency across different systems (memcache /
postgresql) and especially not when their visibility rules differ. What is
visible to something via memcache is always latest uncommitted. What is
visible in PostgreSQL is something less than that. Consistency is not
possible. Correct caching is therefore also not possible unless you define
correct as 'latest', and even then, you have problems if memcache expires
the record, before the real record has been commited into PostgreSQL.

Under a significant enough load, it becomes visible.

2) When updating a PostgreSQL record, I updated the memcache record
to the new value. If another process comes along in parallel before
I commit, that is still looking at an older view, cross-referencing
may not work as expected.

This breaks integrity, and all bets are off.

Both break integrity.

I'm currently settled on 2), but setting a short timeout (5 seconds) on
the data. Still an imperfect compromise between speed and accuracy, but
it isn't causing me problems... yet.

What exactly does your application do about the possibility of incorrect
data?

Right now? I've limited it to display problems. Wrong counts. I think
I tracked down all the significant problems. For me, "latest" is often
equally good to "consistent" where memcache is giving "latest" and
PostgreSQL is giving "consistent".

Consistency is very valuable to me. If it wasn't for memcache being
hundreds or more times faster, I wouldn't use it in the cases I do.
It can be dangerous.

Consistency and caching are not mutually exclusive, and there are many
frameworks that handle the burden of maintaining both for you.

Consistency and memcached, *are* mutually exclusive.

memcached provides no mechanisms for consistency.

Close may good enough for many. Close is the best that it can do.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#10Mark Mielke
mark@mark.mielke.cc
In reply to: Shane Ambler (#6)
Re: Postgresql Caching

On Mon, Oct 16, 2006 at 03:08:39AM +0930, Shane Ambler wrote:

Just throwing some ideas around -
What if we could do something like
CREATE TABLESPACE myramcache LOCATION MEMORY(2GB);
CREATE TABLE mycache (
id as integer, data as varchar(50))
USING TABLESPACE myramcache;
INSERT INTO mycache SELECT id,data FROM myrealtable;
...
You could setup a table in memory to contain known popular data, you
could also use this to create a temporary table in memory to speed up
multiple intermediate calculations without touching disks.

I'm not sure what this would save. If the table is read-only, there
shouldn't be writes happening. If it's small, and frequently accessed,
it should fit in the buffer cache.

None of this avoids the cost of query planning, or query execution.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#11Mark Mielke
mark@mark.mielke.cc
In reply to: Lexington Luthor (#8)
Re: Postgresql Caching

On Sun, Oct 15, 2006 at 06:43:43PM +0100, Lexington Luthor wrote:

With a bit of careful planning (and a few SELECT FOR UPDATE queries to
prevent deadlock), having perfect consistency and correct caching is
possible.

I didn't respond directly to this claim of yours.

SELECT FOR UPDATE is only useful if I'm going to do SELECT. If I am
using memcache, one would presume that I am using it in place of
SELECT, to improve performance. If I'm going to SELECT and then
memcache, I haven't gained anything.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#12Shane Ambler
pgsql@007Marketing.com
In reply to: Mark Mielke (#10)
Re: Postgresql Caching

mark@mark.mielke.cc wrote:

On Mon, Oct 16, 2006 at 03:08:39AM +0930, Shane Ambler wrote:

You could setup a table in memory to contain known popular data, you
could also use this to create a temporary table in memory to speed up
multiple intermediate calculations without touching disks.

I'm not sure what this would save. If the table is read-only, there
shouldn't be writes happening. If it's small, and frequently accessed,
it should fit in the buffer cache.

Because it is frequently accessed doesn't mean that it is small - the
main point is control over what is cached and a starting point for other
options mentioned later.

None of this avoids the cost of query planning, or query execution.

No but you can avoid costly disk access and still have the postgres
level of integrity and integration that memcached doesn't offer.

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Shane Ambler (#12)
Re: Postgresql Caching

Shane Ambler <pgsql@007Marketing.com> writes:

mark@mark.mielke.cc wrote:

None of this avoids the cost of query planning, or query execution.

No but you can avoid costly disk access and still have the postgres
level of integrity and integration that memcached doesn't offer.

If you're just trying to cache data, it's not clear what you are doing
that the shared buffer cache and/or kernel-level disk cache doesn't
do already.

regards, tom lane

#14Mark Mielke
mark@mark.mielke.cc
In reply to: Shane Ambler (#12)
Re: Postgresql Caching

On Mon, Oct 16, 2006 at 05:14:59AM +0930, Shane Ambler wrote:

mark@mark.mielke.cc wrote:

On Mon, Oct 16, 2006 at 03:08:39AM +0930, Shane Ambler wrote:

You could setup a table in memory to contain known popular data, you
could also use this to create a temporary table in memory to speed up
multiple intermediate calculations without touching disks.

I'm not sure what this would save. If the table is read-only, there
shouldn't be writes happening. If it's small, and frequently accessed,
it should fit in the buffer cache.

Because it is frequently accessed doesn't mean that it is small - the
main point is control over what is cached and a starting point for other
options mentioned later.

This presumes that your instructions will do a better job than it is
already doing. You are telling it "use this much cache memory". Only,
if you specify more than the amount of RAM you have, this will be
swapped to disk, and you won't have avoided a disk access. If you
specify less than the amount of RAM you have, you are preventing
PostgreSQL or the kernel from deciding that another page is more
valuable than your "large static table".

None of this avoids the cost of query planning, or query execution.

No but you can avoid costly disk access and still have the postgres
level of integrity and integration that memcached doesn't offer.

If PostgreSQL or the kernel is caching it, this is zero.

As a thought experiment, I'm not seeing the benefit. I think if you
could prove a benefit, then any proof you provided could be used to
improve the already existing caching layers, and would apply equally
to read-only or read-write pages. For example, why not be able to
hint to PostgreSQL that a disk-based table should be considered a
priority to keep in RAM. That way, PostgreSQL would avoid pushing
pages from this table out.

I'm not convinced that this would be a gain, though. I highly expect
that an LRU rule is better than a statically defined "keep this in
RAM" rule.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#15Shane Ambler
pgsql@007Marketing.com
In reply to: Mark Mielke (#14)
Re: Postgresql Caching

mark@mark.mielke.cc wrote:

As a thought experiment, I'm not seeing the benefit. I think if you
could prove a benefit, then any proof you provided could be used to
improve the already existing caching layers, and would apply equally
to read-only or read-write pages. For example, why not be able to
hint to PostgreSQL that a disk-based table should be considered a
priority to keep in RAM. That way, PostgreSQL would avoid pushing
pages from this table out.

If memcached (or pgmemcached implemented in triggers) can show a speed
improvement using ram based caching (even with network overhead) of
specific data then it stands to reason that this ram based cache can be
integrated into postgres with better integration that will overcome the
issues that pgmemcached has. So I threw some ideas out there to get
others thinking on these lines to see if we can come up with a way to
improve or integrate this principle.

My original thoughts were integrating it into the sql level to allow the
database structure to define what we would want to cache in ram, which
is similar to what is happening with using pgmemcached.
Expanding create table to specify that a table gets priority in cache or
allocate x amount of cache to be used by table y could be a better way
than saying all of this table in ram.

I think the main benefit of my first ideas would come from the later
examples I gave where create memory tablespace with slaves would allow
the use of extra machines, effectively increasing the ram available
outside the current postgres setup.

Maybe implementing this idea as a way of increasing the current postgres
caching would be a better implementation than the memory tablespaces
idea. As in integrating a version of pgmemcached as an option into the
current caching layers. Thus implementing it at the config level instead
of the structure design. Although defining tables to get priority or
allocated space in the ram cache would fit well with that.

#16Josh Berkus
josh@agliodbs.com
In reply to: Mark Mielke (#4)
Re: Postgresql Caching

Mark,

Using memcache, I've had problems with consistency brought right to
the front. Both of these have failed me:

1) When updating a PostgreSQL record, I invalidate the memcache record.
If another process comes along in parallel before I commit, notices
that the memcache record is invalidated, it queries the data from
SQL, and updates the memcache record back to the old value. :-(

2) When updating a PostgreSQL record, I updated the memcache record
to the new value. If another process comes along in parallel before
I commit, that is still looking at an older view, cross-referencing
may not work as expected.

The answer is that cached values are not expected to be consistent. If they
were, then they'd have to have all of the transaction overhead which
PostgreSQL has, and lose any gain in efficiency.

Generally, memcached is best used for values that don't get kept in the
database at all. Example (how I use it)

Using pgmemcache for user session information:

1) User logs in. Their session information is stored in the permanent
user_session table and the pgmemcache pseudotable, user_session_cache.
2) User browses the website. This requires updating their session every time
a page is loaded with their last activity timestamp, their page path (for
backtracking) and their ip information (for hijack prevention). This
informaiton is recorded in user_session_cache *only*, with the presumption
that it will be lost if the server goes down.
3) User logs out (or is auto-logged-out). Keys are deleted from
user_session_cache and their exit information is written to the permanent
table user_session.

The advantage of this is that it allows lowering the amount of write activity
to the user_session table by 95% with no loss of information we care about.
Of course, if you are using a Java or .NET application server, it probably
provides the above functionality itself.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

#17Mark Mielke
mark@mark.mielke.cc
In reply to: Josh Berkus (#16)
Re: Postgresql Caching

On Sun, Oct 15, 2006 at 02:39:36PM -0700, Josh Berkus wrote:

Using memcache, I've had problems with consistency brought right to
the front. Both of these have failed me:
1) When updating a PostgreSQL record, I invalidate the memcache record.
If another process comes along in parallel before I commit, notices
that the memcache record is invalidated, it queries the data from
SQL, and updates the memcache record back to the old value. :-(

2) When updating a PostgreSQL record, I updated the memcache record
to the new value. If another process comes along in parallel before
I commit, that is still looking at an older view, cross-referencing
may not work as expected.

The answer is that cached values are not expected to be consistent. If they
were, then they'd have to have all of the transaction overhead which
PostgreSQL has, and lose any gain in efficiency.

Agree.

Generally, memcached is best used for values that don't get kept in the
database at all. Example (how I use it)

Using pgmemcache for user session information:

1) User logs in. Their session information is stored in the permanent
user_session table and the pgmemcache pseudotable, user_session_cache.
2) User browses the website. This requires updating their session every time
a page is loaded with their last activity timestamp, their page path (for
backtracking) and their ip information (for hijack prevention). This
informaiton is recorded in user_session_cache *only*, with the presumption
that it will be lost if the server goes down.
3) User logs out (or is auto-logged-out). Keys are deleted from
user_session_cache and their exit information is written to the permanent
table user_session.

Is there a benefit here to doing this with pgmemcache over the application
calling in to memcache directly?

Are you able to effectively and efficiently include memcache derived
information within select queries that include information you want
to pull out of the database?

I like the logout commits information part, and it's something I've been
meaning to do, but haven't gotten around to.

The advantage of this is that it allows lowering the amount of write activity
to the user_session table by 95% with no loss of information we care about.
Of course, if you are using a Java or .NET application server, it probably
provides the above functionality itself.

Agree.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#18Mark Mielke
mark@mark.mielke.cc
In reply to: Shane Ambler (#15)
Re: Postgresql Caching

On Mon, Oct 16, 2006 at 07:00:20AM +0930, Shane Ambler wrote:

mark@mark.mielke.cc wrote:

As a thought experiment, I'm not seeing the benefit. I think if you
could prove a benefit, then any proof you provided could be used to
improve the already existing caching layers, and would apply equally
to read-only or read-write pages. For example, why not be able to
hint to PostgreSQL that a disk-based table should be considered a
priority to keep in RAM. That way, PostgreSQL would avoid pushing
pages from this table out.

If memcached (or pgmemcached implemented in triggers) can show a speed
improvement using ram based caching (even with network overhead) of
specific data then it stands to reason that this ram based cache can be
integrated into postgres with better integration that will overcome the
issues that pgmemcached has.

I think you might misunderstand how memcache gets its speed. It gets
away from reading or writing to disk because it makes no promise that
the data is safe, even in RAM. memcache may choose to overwrite the
data at any time.

It makes no consistency guarantees. There are no read/write
boundaries. You and I can both read, and both write, and the result
is indeterminate.

It limits itself to only two fields per row. A key and a value. Both
fields must be string types.

Based on all of the assumptions above, all read and write operations are
fast enough, that they can be serialized. This allows for the process to
be single-process, single-thread, with no requirement for co-ordination
between these processes or threads. There is no need for locking any
data structures, or waiting or any resources. Requests can be dispatched
immediately.

What of the above fits into PostgreSQL? Can PostgreSQL choose to remove
records on a whim? Can PostgreSQL get away with removing transaction
boundaries, and making specific tables always read and write to latest?
Can PostgreSQL tables be limited to two fields? Can PostgreSQL get away
with zero synchronization between processes or threads?

The architectures are almost opposite of each other. I don't see how you
could combine the architectures. Effectively, you would need to have two
engines inside PostgreSQL, with the engine type selected from the table
type. MySQL sort of does this. In MySQL, some tables support transactions
while others do not. Is that what you are asking for?

My original thoughts were integrating it into the sql level to allow the
database structure to define what we would want to cache in ram, which
is similar to what is happening with using pgmemcached.

In my experience, the most costly part of SQL, for very simple queries,
is the query planning. As soon as you have more than one key and one value,
you require query planning of some sort. I believe this is the SQL overhead.
Parsing the SQL, and determining how to best execute it.

Lighter weight databases, such as BerkeleyDB already exist to do what
you are asking for. memcache makes few guarantees. BerkeleyDB and
similar gives you transactions. PostgreSQL and similar give you SQL.
Each level of abstraction costs.

Expanding create table to specify that a table gets priority in cache or
allocate x amount of cache to be used by table y could be a better way
than saying all of this table in ram.

Or, it could be a worse way. Where is the evidence that it would be better?

I think the main benefit of my first ideas would come from the later
examples I gave where create memory tablespace with slaves would allow
the use of extra machines, effectively increasing the ram available
outside the current postgres setup.

MySQL has some sort of distributed scheme like this, based on a
partitioning of the keys. I'm don't know how great it is. Other than
the sales pitch we were given when MySQL gave us a presentation, I
haven't heard of it in use by others.

Maybe implementing this idea as a way of increasing the current postgres
caching would be a better implementation than the memory tablespaces
idea. As in integrating a version of pgmemcached as an option into the
current caching layers. Thus implementing it at the config level instead
of the structure design. Although defining tables to get priority or
allocated space in the ram cache would fit well with that.

If there is a problem with the current cache algorithm, it should be fixed.

What is the problem with it?

I think the memcache people are thinking that the cost of PostgreSQL is
about the disk. Although the disk plays a part, I'm pretty sure it's
only a fraction. Not providing transaction guarantees, not providing an
SQL level abstraction, and not having multiple processes or threads
plays a much bigger part.

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#19Jeremy Drake
pgsql@jdrake.com
In reply to: Mark Mielke (#4)
Re: Postgresql Caching

On Sun, 15 Oct 2006, mark@mark.mielke.cc wrote:

On Sun, Oct 15, 2006 at 08:31:36PM +0530, Merlin Moncure wrote:

On 10/15/06, Anon Mous <soundami@yahoo.com> wrote:

Would it be possible to combine a special memcache implementation of
memcache with a Postgresql interface wrapper?

have you seen
http://people.freebsd.org/~seanc/pgmemcache/

Interesting. I note that they don't address the view consistency
problem any better than an application using memcached directly.
And that's the real problem with memcached, and why people are
tempted to 'indulge' by relying on PostgreSQL. Some people value
the consistency. Others don't. memcached, whether application-side,
or whether automatically invoked by triggers (pgmemcache) is a
decision to ignore the consistency.

Using memcache, I've had problems with consistency brought right to
the front. Both of these have failed me:

1) When updating a PostgreSQL record, I invalidate the memcache record.
If another process comes along in parallel before I commit, notices
that the memcache record is invalidated, it queries the data from
SQL, and updates the memcache record back to the old value. :-(

2) When updating a PostgreSQL record, I updated the memcache record
to the new value. If another process comes along in parallel before
I commit, that is still looking at an older view, cross-referencing
may not work as expected.

Shouldn't you be able to use 2-stage commit for this? Prepare to commit,
update the memcache record, then commit? Or am I thinking of something
else?

I'm currently settled on 2), but setting a short timeout (5 seconds) on
the data. Still an imperfect compromise between speed and accuracy, but
it isn't causing me problems... yet.

--
Fortune's Real-Life Courtroom Quote #32:

Q: Do you know how far pregnant you are right now?
A: I will be three months November 8th.
Q: Apparently then, the date of conception was August 8th?
A: Yes.
Q: What were you and your husband doing at that time?

#20Harvell F
fharvell@file13.info
In reply to: Mark Mielke (#18)
Re: Postgresql Caching

On 15 Oct 2006, at 19:55, mark@mark.mielke.cc wrote:

On Mon, Oct 16, 2006 at 07:00:20AM +0930, Shane Ambler wrote:

mark@mark.mielke.cc wrote:

As a thought experiment, I'm not seeing the benefit. I think if you
could prove a benefit, then any proof you provided could be used to
improve the already existing caching layers, and would apply equally
to read-only or read-write pages. For example, why not be able to
hint to PostgreSQL that a disk-based table should be considered a
priority to keep in RAM. That way, PostgreSQL would avoid pushing
pages from this table out.

If memcached (or pgmemcached implemented in triggers) can show a
speed
improvement using ram based caching (even with network overhead) of
specific data then it stands to reason that this ram based cache
can be
integrated into postgres with better integration that will
overcome the
issues that pgmemcached has.

...
I think the memcache people are thinking that the cost of
PostgreSQL is
about the disk. Although the disk plays a part, I'm pretty sure it's
only a fraction. Not providing transaction guarantees, not
providing an
SQL level abstraction, and not having multiple processes or threads
plays a much bigger part.

Forgive my intrusion and perhaps simplistic viewpoint, however,
improved caching would be of great benefit for me as a web developer.

I wholeheartedly agree that the disk IO is often a small part of
the expense of obtaining data from the database, especially for the
nominal web based application. Query parsing, joining, sorting, etc.
are all likely to be real culprits. The existing caching mechanism
(as I understand them) and especially the kernel disk caches do
nothing to eliminate these overhead costs.

I would venture that the 80/20 rule applies here as in many, many
other instances. A full 80+% of the queries performed against the
database are performed over and over and over again with the same
criteria for a period of time and then the criteria changes for the
next period of time. This would be particularly true for seldom
changed tables that, for example, contain a list of the day's
advertisements. The data is changed slowly, once a day or once a
week, but, a query is made for every page hit. Usually the exact
same query.

I know, for you purists out there, that this is an obvious call
for an application level cache. Perhaps so, however, it complicates
the end-programmer environment _and_ it has consistency
disadvantages. Many of the programming languages being used provide
direct interfaces to PostgreSQL (not surprising given that the
programmers are using PostgreSQL) and some may even provide a caching
mechanism. Best case, integrating the two remains a task for the end-
programmer, worse case, the end-programmer has to implement the cache
as well. Rolling a cache into the database removes that complexity
by incorporating it into the existing PostgreSQL API. (BTW, I'm
aware that the consistency disadvantages of the application level
cache could probably be overcome by implementing notify in the cache
but, again, at added end-programmer expense.)

Getting back to the original posting, as I remember it, the
question was about seldom changed information. In that case, and
assuming a repetitive query as above, a simple query results cache
that is keyed on the passed SQL statement string and that simply
returns the previously cooked result set would be a really big
performance win.

Registering each cache entry by the tables included in the query
and invalidating the cache during on a committed update or insert
transaction to any of the tables would, transparently, solve the
consistency problem.

Does this improve the "more interesting" case of heavily updated
tables? Not likely, however, for many web applications, it will
likely improve 80% of the queries leaving more cycles (and bandwidth)
available for the non-cacheable queries.

There would be other issues as well, for example, non-invalidated
cache entries will accumulate rapidly if the criteria changes often,
large result sets will cause cache contention, cursors will (likely)
be unable to use the cache, syntax/commands for manipulating
cacheability, etc. THIS DOES NOT ELIMINATE THE BASIC VALUE of a
results cache for the conditions specified above. Conditions that I
would venture to say make up a large part of the queries that are (or
could be) made by a web application.

Thanks,
F

--
F Harvell
407 467-1919

#21Shane Ambler
pgsql@007Marketing.com
In reply to: Harvell F (#20)
#22Mark Mielke
mark@mark.mielke.cc
In reply to: Jeremy Drake (#19)
#23Mark Mielke
mark@mark.mielke.cc
In reply to: Shane Ambler (#21)
#24Markus Schaber
schabi@logix-tt.com
In reply to: Shane Ambler (#6)
#25Merlin Moncure
mmoncure@gmail.com
In reply to: Mark Mielke (#4)
#26Harvell F
fharvell@file13.info
In reply to: Shane Ambler (#21)
#27Neil Conway
neilc@samurai.com
In reply to: Markus Schaber (#24)
#28Mark Mielke
mark@mark.mielke.cc
In reply to: Neil Conway (#27)
#29Jeremy Drake
pgsql@jdrake.com
In reply to: Mark Mielke (#22)
#30Anon Mous
soundami@yahoo.com
In reply to: Jeremy Drake (#29)
#31Jeff Davis
pgsql@j-davis.com
In reply to: Anon Mous (#30)