Multiple Storage per Tablespace, or Volumes
Hi list,
Here's a proposal of this idea which stole a good part of my night.
I'll present first the idea, then 2 use cases where to read some rational and
few details. Please note I won't be able to participate in any development
effort associated with this idea, may such a thing happen!
The bare idea is to provide a way to 'attach' multiple storage facilities (say
volumes) to a given tablespace. Each volume may be attached in READ ONLY,
READ WRITE or WRITE ONLY mode.
You can mix RW and WO volumes into the same tablespace, but can't have RO with
any W form, or so I think.
It would be pretty handy to be able to add and remove volumes on a live
cluster, and this could be a way to implement moving/extending tablespaces.
Use Case A: better read performances while keeping data write reliability
The first application of this multiple volumes per tablespace idea is to keep
a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW).
Then PG should be able to read from both volumes when dealing with read
queries, and would have to fwrite()/fsync() both volumes for each write.
Of course, write speed will be constrained by the slowest volume, but the
quicker one would then be able to take away some amount of read queries
meanwhile.
It would be neat if PG was able to account volumes relative write speed in
order to assign pounds to each tablespace volumes; and have the planner or
executor span read queries among volumes depending on that.
For example if a single query has a plan containing several full scan (of
indexes and/or tables) in the same tablespace, those could be done on
different volumes.
Use Case B: Synchronous Master Slave(s) Replication
By using a Distributed File System capable of being mounted from several nodes
at the same time, we could have a configuration where a master node has
('exports') a WO tablespace volume, and one or more slaves (depending on FS
capability) configures a RO tablespace volume.
PG has then to be able to cope with a RO volume: the data are not written by
PG itself (local node point of view), so some limitations would certainly
occur.
Will it be possible, for example, to add indexes to data on slaves?
I'd use the solution even without this, thus...
When the master/slave link is broken, the master can no more write to
tablespace, as if it was a local disk failure of some sort, so this should
prevent nasty desync' problems: data is written on all W volumes or data is
not written at all.
I realize this proposal is the first draft of a work to be done, and that I
won't be able to make a lot more than drafting this idea. This mail is sent
on the hackers list in the hope someone there will find this is worth
considering and polishing...
Regards, and thanks for the good work ;)
--
Dimitri Fontaine
On Mon, Feb 19, 2007 at 11:25:41AM +0100, Dimitri Fontaine wrote:
Hi list,
Here's a proposal of this idea which stole a good part of my night.
I'll present first the idea, then 2 use cases where to read some rational and
few details. Please note I won't be able to participate in any development
effort associated with this idea, may such a thing happen!The bare idea is to provide a way to 'attach' multiple storage facilities (say
volumes) to a given tablespace. Each volume may be attached in READ ONLY,
READ WRITE or WRITE ONLY mode.
You can mix RW and WO volumes into the same tablespace, but can't have RO with
any W form, or so I think.
Somehow this seems like implementing RAID within postgres, which seems
a bit outside of the scope of a DB.
Use Case A: better read performances while keeping data write reliability
The first application of this multiple volumes per tablespace idea is to keep
a tablespace both into RAM (tmpfs or ramfs) and on disk (both RW).
For example, I don't beleive there is a restiction against having one
member of a RAID array being a RAM disk.
Use Case B: Synchronous Master Slave(s) Replication
By using a Distributed File System capable of being mounted from several nodes
at the same time, we could have a configuration where a master node has
('exports') a WO tablespace volume, and one or more slaves (depending on FS
capability) configures a RO tablespace volume.
Here you have the problem of row visibility. The data in the table isn't
very useful without the clog, and that's not stored in a tablespace...
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top of.
regards, tom lane
Le lundi 19 février 2007 16:33, Tom Lane a écrit :
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top of.
I though moving some knowledge about data availability into PostgreSQL code
could provide some valuable performance benefit, allowing to organize reads
(for example parallel tables scan/indexes scan to different volumes) and
obtaining data from 'quicker' known volume (or least used/charged).
You're both saying RAID/LVM implementations provide good enough performances
for PG not having to go this way, if I understand correctly.
And distributed file systems are enough to have the replication stuff, without
PG having to deal explicitly with the work involved.
May be I should have slept after all ;)
Thanks for your time and comments, regards,
--
Dimitri Fontaine
Dimitri Fontaine <dim@dalibo.com> writes:
You're both saying RAID/LVM implementations provide good enough performances
for PG not having to go this way, if I understand correctly.
There's certainly no evidence to suggest that reimplementing them
ourselves would be a productive use of our time.
regards, tom lane
On Mon, Feb 19, 2007 at 05:10:36PM +0100, Dimitri Fontaine wrote:
RAID and LVM too. I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top of.I though moving some knowledge about data availability into PostgreSQL code
could provide some valuable performance benefit, allowing to organize reads
(for example parallel tables scan/indexes scan to different volumes) and
obtaining data from 'quicker' known volume (or least used/charged).
Well, organising requests to be handled quickly is a function of
LVM/RAID, so we don't go there. However, speeding up scans by having
multiple requests is an interesting approach, as would perhaps a
different random_page_cost for different tablespaces.
My point is, don't try to implement the mechanics of LVM/RAID into
postgres, instead, work on providing ways for users to take advantage
of these mechanisms if they have them. Look at it as if you have got
LVM/RAID setup for your ideas, how do you get postgres to take
advantage of them?
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those
wheels when perfectly good implementations already exist for us to
sit on top of.
I expect that someone will point out that Windows doesn't support RAID
or LVM, and we'll have to reimplement it anyway.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Peter Eisentraut wrote:
Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those
wheels when perfectly good implementations already exist for us to
sit on top of.I expect that someone will point out that Windows doesn't support RAID
or LVM, and we'll have to reimplement it anyway.
windows supports software raid just fine since Windows 2000 or so ...
Stefan
Peter Eisentraut wrote:
Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those
wheels when perfectly good implementations already exist for us to
sit on top of.I expect that someone will point out that Windows doesn't support RAID
or LVM, and we'll have to reimplement it anyway.
Windows supports both RAID and LVM.
//Magnus
Magnus Hagander wrote:
Windows supports both RAID and LVM.
Oh good, so we've got that on record. :)
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
Stefan Kaltenbrunner wrote:
Peter Eisentraut wrote:
Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those
wheels when perfectly good implementations already exist for us to
sit on top of.I expect that someone will point out that Windows doesn't support RAID
or LVM, and we'll have to reimplement it anyway.windows supports software raid just fine since Windows 2000 or so ...
Longer than that... it supported mirroring and raid 5 in NT4 and
possibly even NT3.51 IIRC.
Joshua D. Drake
Stefan
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/
On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top of.
Ok, warning, this is a "you know what would be sweet" moment.
What would be nice is to be able to detach one of the volumes, and
know the span of the data in there without being able to access the
data.
The problem that a lot of warehouse operators have is something like
this: "We know we have all this data, but we don't know what we will
want to do with it later. So keep it all. I'll get back to you when
I want to know something."
It'd be nice to be able to load up all that data once, and then shunt
it off into (say) read-only media. If one could then run a query
that would tell one which spans of data are candidates for the
search, you could bring back online (onto reasonably fast storage,
for instance) just the volumes you need to read.
A
--
Andrew Sullivan | ajs@crankycanuck.ca
Users never remark, "Wow, this software may be buggy and hard
to use, but at least there is a lot of code underneath."
--Damien Katz
Andrew Sullivan wrote:
On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top of.Ok, warning, this is a "you know what would be sweet" moment.
The dreaded words from a developers mouth to every manager in the world.
Joshua D. Drake
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/
On Mon, 19 Feb 2007, Joshua D. Drake wrote:
Longer than that... it supported mirroring and raid 5 in NT4 and
possibly even NT3.51 IIRC.
Mirroring and RAID 5 go back to Windows NT 3.1 Advanced Server in 1993:
http://support.microsoft.com/kb/114779
http://www.byte.com/art/9404/sec8/art7.htm
The main source of confusion about current support for this feature is
that the desktop/workstation version of Windows don't have it. For
Windows XP, you need the XP Professional version to get "dynamic disk"
support; it's not in the home edition.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Joshua D. Drake wrote:
Andrew Sullivan wrote:
On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top of.Ok, warning, this is a "you know what would be sweet" moment.
The dreaded words from a developers mouth to every manager in the world.
Yea, I just instinctively hit "delete" when I saw that phrase.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Andrew Sullivan wrote:
On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top of.Ok, warning, this is a "you know what would be sweet" moment.
What would be nice is to be able to detach one of the volumes, and
know the span of the data in there without being able to access the
data.The problem that a lot of warehouse operators have is something like
this: "We know we have all this data, but we don't know what we will
want to do with it later. So keep it all. I'll get back to you when
I want to know something."
You should be able to do that with tablespaces and VACUUM FREEZE, the
point of the latter being that you can take the disk containing the
"read only" data offline, and still have the data readable after
plugging it back in, no matter how far along the transaction ID counter
is.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
On Mon, 2007-02-19 at 17:35 -0300, Alvaro Herrera wrote:
Andrew Sullivan wrote:
On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top of.Ok, warning, this is a "you know what would be sweet" moment.
What would be nice is to be able to detach one of the volumes, and
know the span of the data in there without being able to access the
data.The problem that a lot of warehouse operators have is something like
this: "We know we have all this data, but we don't know what we will
want to do with it later. So keep it all. I'll get back to you when
I want to know something."You should be able to do that with tablespaces and VACUUM FREEZE, the
point of the latter being that you can take the disk containing the
"read only" data offline, and still have the data readable after
plugging it back in, no matter how far along the transaction ID counter
is.
Doesn't work anymore because VACUUM FREEZE doesn't (and can't) take a
full table lock, so somebody can always update data after a data block
has been frozen. That can lead to putting a table onto read-only media
when it still needs vacuuming, which is a great way to break the DB. It
also doesn't freeze still visible data, so there's no easy way of doing
this. Waiting until the VF is the oldest Xid is prone to deadlock as
well.
Ideally, we'd have a copy to read-only media whilst freezing, as an
atomic operation, with some guarantees that it will actually have frozen
*everything*, or fail:
ALTER TABLE SET TABLESPACE foo READONLY;
Can we agree that as a TODO item?
--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com
On Mon, Feb 19, 2007 at 02:50:34PM -0500, Andrew Sullivan wrote:
On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those
wheels when perfectly good implementations already exist for us to
sit on top of.Ok, warning, this is a "you know what would be sweet" moment.
What would be nice is to be able to detach one of the volumes, and
know the span of the data in there without being able to access the
data.The problem that a lot of warehouse operators have is something like
this: "We know we have all this data, but we don't know what we will
want to do with it later. So keep it all. I'll get back to you
when I want to know something."It'd be nice to be able to load up all that data once, and then
shunt it off into (say) read-only media. If one could then run a
query that would tell one which spans of data are candidates for the
search, you could bring back online (onto reasonably fast storage,
for instance) just the volumes you need to read.
Isn't this one of the big use cases for table partitioning?
Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter
Remember to vote!
I have a WIP patch that adds the main detail I have found I need to
properly tune checkpoint and background writer activity. I think it's
almost ready to submit (you can see the current patch against 8.2 at
http://www.westnet.com/~gsmith/content/postgresql/patch-checkpoint.txt )
after making it a bit more human-readable. But I've realized that along
with that, I need some guidance in regards to what log level is
appropriate for this information.
An example works better than explaining what the patch does:
2007-02-19 21:53:24.602 EST - DEBUG: checkpoint required (wrote
checkpoint_segments)
2007-02-19 21:53:24.685 EST - DEBUG: checkpoint starting
2007-02-19 21:53:24.705 EST - DEBUG: checkpoint flushing buffer pool
2007-02-19 21:53:24.985 EST - DEBUG: checkpoint database fsync starting
2007-02-19 21:53:42.725 EST - DEBUG: checkpoint database fsync complete
2007-02-19 21:53:42.726 EST - DEBUG: checkpoint buffer flush dirty=8034
write=279956 us sync=17739974 us
Remember that "Load distributed checkpoint" discussion back in December?
You can see exactly how bad the problem is on your system with this log
style (this is from a pgbench run where it's postively awful--it really
does take over 17 seconds for the fsync to execute, and there are clients
that are hung the whole time waiting for it).
I also instrumented the background writer. You get messages like this:
2007-02-19 21:58:54.328 EST - DEBUG: BGWriter Scan All - Written = 5/5
Unscanned = 23/54
This shows that we wrote (5) the maximum pages we were allowed to write
(5) while failing to scan almost half (23) of the buffers we meant to look
at (54). By taking a look at this output while the system is under load,
I found I was able to do bgwriter optimization that used to take me days
of frustrating testing in hours. I've been waiting for a good guide to
bgwriter tuning since 8.1 came out. Once you have this, combined with
knowing how many buffers were dirty at checkpoint time because the
bgwriter didn't get to them in time (the number you want to minimize), the
tuning guide practically writes itself.
So my question is...what log level should all this go at? Right now, I
have the background writer stuff adjusting its level dynamically based on
what happened; it logs at DEBUG2 if it hits the write limit (which should
be a rare event once you're tuned properly), DEBUG3 for writes that
scanned everything they were supposed to, and DEBUG4 if it scanned but
didn't find anything to write. The source of checkpoint information logs
at DEBUG1, the fsync/write info at DEBUG2.
I'd like to move some of these up. On my system, I even have many of the
checkpoint messages logged at INFO (the source of the checkpoint and the
total write time line). It's a bit chatty, but when you get some weird
system pause issue it makes it easy to figure out if checkpoints were to
blame. Given how useful I feel some of these messages are to system
tuning, and to explaining what currently appears as inexplicable pauses, I
don't want them to be buried at DEBUG levels where people are unlikely to
ever see them (I think some people may be concerned about turning on
things labeled DEBUG at all). I am aware that I am too deep into this to
have an unbiased opinion at this point though, which is why I ask for
feedback on how to proceed here.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Monday 19 February 2007 15:08, Bruce Momjian wrote:
Joshua D. Drake wrote:
Andrew Sullivan wrote:
On Mon, Feb 19, 2007 at 10:33:24AM -0500, Tom Lane wrote:
Martijn van Oosterhout <kleptog@svana.org> writes:
Somehow this seems like implementing RAID within postgres,
RAID and LVM too. I can't get excited about re-inventing those wheels
when perfectly good implementations already exist for us to sit on top
of.Ok, warning, this is a "you know what would be sweet" moment.
The dreaded words from a developers mouth to every manager in the world.
Yea, I just instinctively hit "delete" when I saw that phrase.
Too bad... I know oracle can do what he wants... possibly other db systems as
well.
--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL