Compression and on-disk sorting
A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.
On-disk sorts are the most obvious candidate, but I suspect there's
others.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes:
A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.
Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...
regards, tom lane
On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...
I'm really starting to see why other databases want the OS out of their
way...
I guess at this point the best we could do would be to have a
configurable limit for when compression started. The first X number of
bytes go out uncompressed, everything after that is compressed. I don't
know of any technical reason why you couldn't switch in the middle of a
file, so long as you knew exactly where you switched.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote:
On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...I'm really starting to see why other databases want the OS out of their
way...
Some of it is pure NIH syndrome. I recently heard of some tests done by
a major DB team that showed their finely crafted raw file system stuff
performing at best a few percent better than a standard file system, and
sometimes worse. I have often heard of the supposed benefits of our
being able to go behind the OS, but I am very dubious about it. What
makes people think that we could do any better than the OS guys?
cheers
andrew
On Mon, May 15, 2006 at 03:44:50PM -0400, Andrew Dunstan wrote:
Jim C. Nasby wrote:
On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...I'm really starting to see why other databases want the OS out of their
way...Some of it is pure NIH syndrome. I recently heard of some tests done by
a major DB team that showed their finely crafted raw file system stuff
performing at best a few percent better than a standard file system, and
sometimes worse. I have often heard of the supposed benefits of our
being able to go behind the OS, but I am very dubious about it. What
makes people think that we could do any better than the OS guys?
The problem is that it seems like there's never enough ability to clue
the OS in on what the application is trying to accomplish. For a long
time we didn't have a background writer, because the OS should be able
to flush things out on it's own before checkpoint. Now there's talk of a
background reader, because backends keep stalling on waiting on disk IO.
In this case the problem is that we want to tell the OS "Hey, if this
stuff is actually going to go out to the spindles then compress it. And
by the way, we won't be doing any random access on it, either." But
AFAIK there's no option like that in fopen... :)
I agree, when it comes to base-level stuff like how to actually put the
data on the physical media, there's not much to be gained in this day
and age by using RAW storage, and in fact Oracle hasn't favored RAW for
quite some time. Every DBA I've ever talked to says that the only reason
to use RAW is if you're trying to eek every last ounce of performance
out of the hardware that you can, which for 99.99% of installs makes
absolutely no sense.
But, there's a big range between writing your own filesystem and
assuming that the OS should just handle everything for you. I think a
lot of commercial databases lean too far towards not trusting the OS
(which is understandable to a degree, givin how much OSes have
improved), while in some areas I think we still rely too much on the OS
(like read-ahead).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Mon, May 15, 2006 at 03:02:07PM -0500, Jim C. Nasby wrote:
The problem is that it seems like there's never enough ability to clue
the OS in on what the application is trying to accomplish. For a long
time we didn't have a background writer, because the OS should be able
to flush things out on it's own before checkpoint. Now there's talk of a
background reader, because backends keep stalling on waiting on disk IO.
Hmm, I thought the background writer was created to reduce the cost of
checkpoint which has to write out modified pages in the buffers. The
background writer tries to keep the number of dirty pages down.
I don't know about Oracle, but with or without an OS, backends are
going to block on I/O and you'd still need a background reader in both
cases for precisely the same reason. We might be able to do without a
background reader if we did asyncronous i/o, but that can't be done
portably.
In this case the problem is that we want to tell the OS "Hey, if this
stuff is actually going to go out to the spindles then compress it. And
by the way, we won't be doing any random access on it, either." But
AFAIK there's no option like that in fopen... :)
posix_fadvise(). We don't use it and many OSes don't support it, but it
is there.
The O/S is some overhead, but the benefits outweigh the costs IMHO.
Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
From each according to his ability. To each according to his ability to litigate.
On Mon, May 15, 2006 at 10:09:47PM +0200, Martijn van Oosterhout wrote:
In this case the problem is that we want to tell the OS "Hey, if this
stuff is actually going to go out to the spindles then compress it. And
by the way, we won't be doing any random access on it, either." But
AFAIK there's no option like that in fopen... :)posix_fadvise(). We don't use it and many OSes don't support it, but it
is there.
There's an fadvise that tells the OS to compress the data if it actually
makes it to disk?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote:
There's an fadvise that tells the OS to compress the data if it actually
makes it to disk?
Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.
I think the reasons against adding this feature to postgresql are
largely the same as the reasons why compressed filesystems aren't
very popular.
Has anyone tried running postgresql on a compressing file-system?
I'd expect the penalties to outweigh the benefits (or they'd be
more common); but if it gives impressive results, it might add
weight to this feature idea.
Ron M
I think the real reason Oracle and others practically re-wrote
their own VM-system and filesystems is that at the time it was
important for them to run under Windows98; where it was rather
easy to write better filesystems than your customer's OS was
bundled with.
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
I think the real reason Oracle and others practically re-wrote
their own VM-system and filesystems is that at the time it was
important for them to run under Windows98; where it was rather
easy to write better filesystems than your customer's OS was
bundled with.
Windows98? No, those decisions predate any thought of running Oracle
on Windows, probably by decades. But I think the thought process was
about as above whenever they did make it; they were running on some
pretty stupid OSes way back when.
regards, tom lane
Tom Lane wrote:
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
I think the real reason Oracle and others practically re-wrote
their own VM-system and filesystems is that at the time it was
important for them to run under Windows98; where it was rather
easy to write better filesystems than your customer's OS was
bundled with.Windows98? No, those decisions predate any thought of running Oracle
on Windows, probably by decades. But I think the thought process was
about as above whenever they did make it; they were running on some
pretty stupid OSes way back when.
Windows XP?
****runs****
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
--
=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/
On Mon, May 15, 2006 at 05:42:53PM -0700, Joshua D. Drake wrote:
Windows98? No, those decisions predate any thought of running Oracle
on Windows, probably by decades. But I think the thought process was
about as above whenever they did make it; they were running on some
pretty stupid OSes way back when.Windows XP?
****runs****
You guys have to kill your Windows hate - in jest or otherwise. It's
zealous, and blinding. I'm picking on you Joshua, only because your
message is the one I saw last. Sorry...
Writing your own block caching layer can make a lot of sense. Why would it
be assumed, that a file system designed for use from a desktop, would be
optimal at all for database style loads?
Why would it be assumed that a file system to be used for many different
smaller files would be optimal at all for database style loads?
It's always going to be true, that the more specific the requirements,
the more highly optimized one can design a system. The Linux block
caching layer, or file system layout can be beat *for sure* for database
loads.
The real question - and I believe Tom and others have correctly harped
on it in the past is - is it worth it? Until somebody actually pulls
up their sleeves, invests a month or more of their life to it, and
does it, we really won't know. And even then, the cost of maintenance
would have to be considered. Who is going to keep up-to-date on
theoretical storage models? What happens when generic file system
levels again surpass the first attempt?
Personally, I believe it would be worth it - but only to a few. And
these most of these few are likely using Oracle. So, no gain unless
you can convince them to switch back... :-)
Cheers,
mark
--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...
Oh come on, Sorry to troll but this is too easy.
On 5/15/06, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote:
You guys have to kill your Windows hate - in jest or otherwise. It's
zealous, and blinding.
[snip]
Why would it
be assumed, that a file system designed for use from a desktop, would be
optimal at all for database style loads?
It wouldn't.
Why would someone use a desktop OS for a database?
Why would you call the result of answering the previous question
zealous and blinding?
PG's use of the OS's block cache is a good move because it makes PG
tend to 'just work' where the alternatives require non-trivial tuning
(sizing their caches not to push the OS into swap). The advantages of
this are great enough that if additional smarts are needed in the OS
cache it might well be worth the work to add it there and to ask for
new fadvise flags to get the desired behavior.
That's something that would be easy enough for a dedicated hacker to
do, or easy enough to collaborate with the OS developers if the need
could be demonstrated clearly enough.
What reasonable OS couldn't you do that with?
:)
mark@mark.mielke.cc wrote:
The real question - and I believe Tom and others have correctly harped
on it in the past is - is it worth it? Until somebody actually pulls
up their sleeves, invests a month or more of their life to it, and
does it, we really won't know. And even then, the cost of maintenance
would have to be considered. Who is going to keep up-to-date on
theoretical storage models? What happens when generic file system
levels again surpass the first attempt?Personally, I believe it would be worth it - but only to a few. And
these most of these few are likely using Oracle. So, no gain unless
you can convince them to switch back... :-)
We do know that the benefit for commercial databases that use raw and
file system storage is that raw storage is only a few percentage
points faster.
--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
Given that any time that happens we end up caring much less about
CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk
would
potentially be a sizeable win.
Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...
Yes, that is very true. However it would also increase the probability
that spill to disk is not needed, since more data fits in RAM.
It would probably need some sort of plugin architecture, since the
fastest compression algorithms (LZO) that also reach good ratios are
gpl.
LZO is proven to increase physical IO write speed with low CPU overhead.
Andreas
Import Notes
Resolved by subject fallback
Personally, I believe it would be worth it - but only to a few. And
these most of these few are likely using Oracle. So, no gain unless
you can convince them to switch back... :-)We do know that the benefit for commercial databases that use raw and
file system storage is that raw storage is only a few percentage
points faster.
Imho it is really not comparable because they all use direct or async IO
that bypasses the OS buffercache even when using filesystem files for
storage.
A substantial speed difference is allocation of space for restore
(no format of fs and no file allocation needed).
I am not saying this to advocate moving in that direction however.
I do however think that there is substantial headroom in reducing the
number
of IO calls and reducing on disk storage requirements.
Especially in concurrent load scenarios.
Andreas
Import Notes
Resolved by subject fallback
Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.
Windows (NT/2000/XP) can compress individual directories and files under
NTFS; new files in a compressed directory are compressed by default.
So if the 'spill-to-disk' all happened in its own specific directory, it
would be trivial to mark that directory for compression.
I don't know enough Linux/Unix to know if it has similar capabilities.
Import Notes
Resolved by subject fallback
Bort, Paul wrote:
Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.Windows (NT/2000/XP) can compress individual directories and files under
NTFS; new files in a compressed directory are compressed by default.So if the 'spill-to-disk' all happened in its own specific directory, it
would be trivial to mark that directory for compression.I don't know enough Linux/Unix to know if it has similar capabilities.
Or would want to ...
I habitually turn off all compression on my Windows boxes, because it's
a performance hit in my experience. Disk is cheap ...
cheers
andrew
On Tue, 2006-05-16 at 11:53 -0400, Andrew Dunstan wrote:
Bort, Paul wrote:
Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.Windows (NT/2000/XP) can compress individual directories and files under
NTFS; new files in a compressed directory are compressed by default.So if the 'spill-to-disk' all happened in its own specific directory, it
would be trivial to mark that directory for compression.I don't know enough Linux/Unix to know if it has similar capabilities.
Or would want to ...
I habitually turn off all compression on my Windows boxes, because it's
a performance hit in my experience. Disk is cheap ...
Disk storage is cheap. Disk bandwidth or throughput is very expensive.
--
Rod Taylor wrote:
I habitually turn off all compression on my Windows boxes, because it's
a performance hit in my experience. Disk is cheap ...Disk storage is cheap. Disk bandwidth or throughput is very expensive.
Sure, but in my experience using Windows File System compression is not
a win here. Presumably if it were an unqualified win they would have it
turned on everywhere. The fact that there's an option is a good
indication that it isn't in many cases. It is most commonly used for
files like executables that are in effect read-only - but that doesn't
help us.
cheers
andrew
On Tue, May 16, 2006 at 09:24:38AM +0200, Zeugswetter Andreas DCP SD wrote:
Given that any time that happens we end up caring much less about
CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to diskwould
potentially be a sizeable win.
Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...Yes, that is very true. However it would also increase the probability
that spill to disk is not needed, since more data fits in RAM.
That's a pretty thin margin though, depending on how good the
compression is. This also assumes that you have a compression algorithm
that supports random access.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461