Compression and on-disk sorting

Started by Jim C. Nasbyover 19 years ago105 messages
#1Jim C. Nasby
jnasby@pervasive.com

A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.

On-disk sorts are the most obvious candidate, but I suspect there's
others.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#1)
Re: Compression and on-disk sorting

"Jim C. Nasby" <jnasby@pervasive.com> writes:

A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.

Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...

regards, tom lane

#3Jim C. Nasby
jnasby@pervasive.com
In reply to: Tom Lane (#2)
Re: Compression and on-disk sorting

On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.

Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...

I'm really starting to see why other databases want the OS out of their
way...

I guess at this point the best we could do would be to have a
configurable limit for when compression started. The first X number of
bytes go out uncompressed, everything after that is compressed. I don't
know of any technical reason why you couldn't switch in the middle of a
file, so long as you knew exactly where you switched.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Jim C. Nasby (#3)
Re: Compression and on-disk sorting

Jim C. Nasby wrote:

On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.

Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...

I'm really starting to see why other databases want the OS out of their
way...

Some of it is pure NIH syndrome. I recently heard of some tests done by
a major DB team that showed their finely crafted raw file system stuff
performing at best a few percent better than a standard file system, and
sometimes worse. I have often heard of the supposed benefits of our
being able to go behind the OS, but I am very dubious about it. What
makes people think that we could do any better than the OS guys?

cheers

andrew

#5Jim C. Nasby
jnasby@pervasive.com
In reply to: Andrew Dunstan (#4)
Re: Compression and on-disk sorting

On Mon, May 15, 2006 at 03:44:50PM -0400, Andrew Dunstan wrote:

Jim C. Nasby wrote:

On Mon, May 15, 2006 at 02:18:03PM -0400, Tom Lane wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

A recent post Tom made in -bugs about how bad performance would be if we
spilled after-commit triggers to disk got me thinking... There are
several operations the database performs that potentially spill to disk.
Given that any time that happens we end up caring much less about CPU
usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk would
potentially be a sizeable win.

Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...

I'm really starting to see why other databases want the OS out of their
way...

Some of it is pure NIH syndrome. I recently heard of some tests done by
a major DB team that showed their finely crafted raw file system stuff
performing at best a few percent better than a standard file system, and
sometimes worse. I have often heard of the supposed benefits of our
being able to go behind the OS, but I am very dubious about it. What
makes people think that we could do any better than the OS guys?

The problem is that it seems like there's never enough ability to clue
the OS in on what the application is trying to accomplish. For a long
time we didn't have a background writer, because the OS should be able
to flush things out on it's own before checkpoint. Now there's talk of a
background reader, because backends keep stalling on waiting on disk IO.

In this case the problem is that we want to tell the OS "Hey, if this
stuff is actually going to go out to the spindles then compress it. And
by the way, we won't be doing any random access on it, either." But
AFAIK there's no option like that in fopen... :)

I agree, when it comes to base-level stuff like how to actually put the
data on the physical media, there's not much to be gained in this day
and age by using RAW storage, and in fact Oracle hasn't favored RAW for
quite some time. Every DBA I've ever talked to says that the only reason
to use RAW is if you're trying to eek every last ounce of performance
out of the hardware that you can, which for 99.99% of installs makes
absolutely no sense.

But, there's a big range between writing your own filesystem and
assuming that the OS should just handle everything for you. I think a
lot of commercial databases lean too far towards not trusting the OS
(which is understandable to a degree, givin how much OSes have
improved), while in some areas I think we still rely too much on the OS
(like read-ahead).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#6Martijn van Oosterhout
kleptog@svana.org
In reply to: Jim C. Nasby (#5)
Re: Compression and on-disk sorting

On Mon, May 15, 2006 at 03:02:07PM -0500, Jim C. Nasby wrote:

The problem is that it seems like there's never enough ability to clue
the OS in on what the application is trying to accomplish. For a long
time we didn't have a background writer, because the OS should be able
to flush things out on it's own before checkpoint. Now there's talk of a
background reader, because backends keep stalling on waiting on disk IO.

Hmm, I thought the background writer was created to reduce the cost of
checkpoint which has to write out modified pages in the buffers. The
background writer tries to keep the number of dirty pages down.

I don't know about Oracle, but with or without an OS, backends are
going to block on I/O and you'd still need a background reader in both
cases for precisely the same reason. We might be able to do without a
background reader if we did asyncronous i/o, but that can't be done
portably.

In this case the problem is that we want to tell the OS "Hey, if this
stuff is actually going to go out to the spindles then compress it. And
by the way, we won't be doing any random access on it, either." But
AFAIK there's no option like that in fopen... :)

posix_fadvise(). We don't use it and many OSes don't support it, but it
is there.

The O/S is some overhead, but the benefits outweigh the costs IMHO.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#7Jim C. Nasby
jnasby@pervasive.com
In reply to: Martijn van Oosterhout (#6)
Re: Compression and on-disk sorting

On Mon, May 15, 2006 at 10:09:47PM +0200, Martijn van Oosterhout wrote:

In this case the problem is that we want to tell the OS "Hey, if this
stuff is actually going to go out to the spindles then compress it. And
by the way, we won't be doing any random access on it, either." But
AFAIK there's no option like that in fopen... :)

posix_fadvise(). We don't use it and many OSes don't support it, but it
is there.

There's an fadvise that tells the OS to compress the data if it actually
makes it to disk?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#8Ron Mayer
rm_pg@cheapcomplexdevices.com
In reply to: Jim C. Nasby (#7)
Re: Compression and on-disk sorting

Jim C. Nasby wrote:

There's an fadvise that tells the OS to compress the data if it actually
makes it to disk?

Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.

I think the reasons against adding this feature to postgresql are
largely the same as the reasons why compressed filesystems aren't
very popular.

Has anyone tried running postgresql on a compressing file-system?
I'd expect the penalties to outweigh the benefits (or they'd be
more common); but if it gives impressive results, it might add
weight to this feature idea.

Ron M

I think the real reason Oracle and others practically re-wrote
their own VM-system and filesystems is that at the time it was
important for them to run under Windows98; where it was rather
easy to write better filesystems than your customer's OS was
bundled with.

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Ron Mayer (#8)
Re: Compression and on-disk sorting

Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:

I think the real reason Oracle and others practically re-wrote
their own VM-system and filesystems is that at the time it was
important for them to run under Windows98; where it was rather
easy to write better filesystems than your customer's OS was
bundled with.

Windows98? No, those decisions predate any thought of running Oracle
on Windows, probably by decades. But I think the thought process was
about as above whenever they did make it; they were running on some
pretty stupid OSes way back when.

regards, tom lane

#10Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#9)
Re: Compression and on-disk sorting

Tom Lane wrote:

Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:

I think the real reason Oracle and others practically re-wrote
their own VM-system and filesystems is that at the time it was
important for them to run under Windows98; where it was rather
easy to write better filesystems than your customer's OS was
bundled with.

Windows98? No, those decisions predate any thought of running Oracle
on Windows, probably by decades. But I think the thought process was
about as above whenever they did make it; they were running on some
pretty stupid OSes way back when.

Windows XP?

****runs****

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

#11Noname
mark@mark.mielke.cc
In reply to: Joshua D. Drake (#10)
Re: Compression and on-disk sorting

On Mon, May 15, 2006 at 05:42:53PM -0700, Joshua D. Drake wrote:

Windows98? No, those decisions predate any thought of running Oracle
on Windows, probably by decades. But I think the thought process was
about as above whenever they did make it; they were running on some
pretty stupid OSes way back when.

Windows XP?
****runs****

You guys have to kill your Windows hate - in jest or otherwise. It's
zealous, and blinding. I'm picking on you Joshua, only because your
message is the one I saw last. Sorry...

Writing your own block caching layer can make a lot of sense. Why would it
be assumed, that a file system designed for use from a desktop, would be
optimal at all for database style loads?

Why would it be assumed that a file system to be used for many different
smaller files would be optimal at all for database style loads?

It's always going to be true, that the more specific the requirements,
the more highly optimized one can design a system. The Linux block
caching layer, or file system layout can be beat *for sure* for database
loads.

The real question - and I believe Tom and others have correctly harped
on it in the past is - is it worth it? Until somebody actually pulls
up their sleeves, invests a month or more of their life to it, and
does it, we really won't know. And even then, the cost of maintenance
would have to be considered. Who is going to keep up-to-date on
theoretical storage models? What happens when generic file system
levels again surpass the first attempt?

Personally, I believe it would be worth it - but only to a few. And
these most of these few are likely using Oracle. So, no gain unless
you can convince them to switch back... :-)

Cheers,
mark

--
mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

#12Gregory Maxwell
gmaxwell@gmail.com
In reply to: Noname (#11)
Re: Compression and on-disk sorting

Oh come on, Sorry to troll but this is too easy.

On 5/15/06, mark@mark.mielke.cc <mark@mark.mielke.cc> wrote:

You guys have to kill your Windows hate - in jest or otherwise. It's
zealous, and blinding.

[snip]

Why would it
be assumed, that a file system designed for use from a desktop, would be
optimal at all for database style loads?

It wouldn't.
Why would someone use a desktop OS for a database?
Why would you call the result of answering the previous question
zealous and blinding?

PG's use of the OS's block cache is a good move because it makes PG
tend to 'just work' where the alternatives require non-trivial tuning
(sizing their caches not to push the OS into swap). The advantages of
this are great enough that if additional smarts are needed in the OS
cache it might well be worth the work to add it there and to ask for
new fadvise flags to get the desired behavior.

That's something that would be easy enough for a dedicated hacker to
do, or easy enough to collaborate with the OS developers if the need
could be demonstrated clearly enough.

What reasonable OS couldn't you do that with?

:)

#13Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Noname (#11)
Re: Compression and on-disk sorting

mark@mark.mielke.cc wrote:

The real question - and I believe Tom and others have correctly harped
on it in the past is - is it worth it? Until somebody actually pulls
up their sleeves, invests a month or more of their life to it, and
does it, we really won't know. And even then, the cost of maintenance
would have to be considered. Who is going to keep up-to-date on
theoretical storage models? What happens when generic file system
levels again surpass the first attempt?

Personally, I believe it would be worth it - but only to a few. And
these most of these few are likely using Oracle. So, no gain unless
you can convince them to switch back... :-)

We do know that the benefit for commercial databases that use raw and
file system storage is that raw storage is only a few percentage
points faster.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#14Zeugswetter Andreas DCP SD
ZeugswetterA@spardat.at
In reply to: Bruce Momjian (#13)
Re: Compression and on-disk sorting

Given that any time that happens we end up caring much less about

CPU

usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk

would

potentially be a sizeable win.

Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...

Yes, that is very true. However it would also increase the probability
that spill to disk is not needed, since more data fits in RAM.

It would probably need some sort of plugin architecture, since the
fastest compression algorithms (LZO) that also reach good ratios are
gpl.
LZO is proven to increase physical IO write speed with low CPU overhead.

Andreas

#15Zeugswetter Andreas DCP SD
ZeugswetterA@spardat.at
In reply to: Zeugswetter Andreas DCP SD (#14)
Re: Compression and on-disk sorting

Personally, I believe it would be worth it - but only to a few. And
these most of these few are likely using Oracle. So, no gain unless
you can convince them to switch back... :-)

We do know that the benefit for commercial databases that use raw and
file system storage is that raw storage is only a few percentage
points faster.

Imho it is really not comparable because they all use direct or async IO
that bypasses the OS buffercache even when using filesystem files for
storage.
A substantial speed difference is allocation of space for restore
(no format of fs and no file allocation needed).

I am not saying this to advocate moving in that direction however.
I do however think that there is substantial headroom in reducing the
number
of IO calls and reducing on disk storage requirements.
Especially in concurrent load scenarios.

Andreas

#16Bort, Paul
pbort@tmwsystems.com
In reply to: Zeugswetter Andreas DCP SD (#15)
Re: Compression and on-disk sorting

Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.

Windows (NT/2000/XP) can compress individual directories and files under
NTFS; new files in a compressed directory are compressed by default.

So if the 'spill-to-disk' all happened in its own specific directory, it
would be trivial to mark that directory for compression.

I don't know enough Linux/Unix to know if it has similar capabilities.

#17Andrew Dunstan
andrew@dunslane.net
In reply to: Bort, Paul (#16)
Re: Compression and on-disk sorting

Bort, Paul wrote:

Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.

Windows (NT/2000/XP) can compress individual directories and files under
NTFS; new files in a compressed directory are compressed by default.

So if the 'spill-to-disk' all happened in its own specific directory, it
would be trivial to mark that directory for compression.

I don't know enough Linux/Unix to know if it has similar capabilities.

Or would want to ...

I habitually turn off all compression on my Windows boxes, because it's
a performance hit in my experience. Disk is cheap ...

cheers

andrew

#18Rod Taylor
pg@rbt.ca
In reply to: Andrew Dunstan (#17)
Re: Compression and on-disk sorting

On Tue, 2006-05-16 at 11:53 -0400, Andrew Dunstan wrote:

Bort, Paul wrote:

Compressed-filesystem extension (like e2compr, and I think either
Fat or NTFS) can do that.

Windows (NT/2000/XP) can compress individual directories and files under
NTFS; new files in a compressed directory are compressed by default.

So if the 'spill-to-disk' all happened in its own specific directory, it
would be trivial to mark that directory for compression.

I don't know enough Linux/Unix to know if it has similar capabilities.

Or would want to ...

I habitually turn off all compression on my Windows boxes, because it's
a performance hit in my experience. Disk is cheap ...

Disk storage is cheap. Disk bandwidth or throughput is very expensive.
--

#19Andrew Dunstan
andrew@dunslane.net
In reply to: Rod Taylor (#18)
Re: Compression and on-disk sorting

Rod Taylor wrote:

I habitually turn off all compression on my Windows boxes, because it's
a performance hit in my experience. Disk is cheap ...

Disk storage is cheap. Disk bandwidth or throughput is very expensive.

Sure, but in my experience using Windows File System compression is not
a win here. Presumably if it were an unqualified win they would have it
turned on everywhere. The fact that there's an option is a good
indication that it isn't in many cases. It is most commonly used for
files like executables that are in effect read-only - but that doesn't
help us.

cheers

andrew

#20Jim C. Nasby
jnasby@pervasive.com
In reply to: Zeugswetter Andreas DCP SD (#14)
Re: Compression and on-disk sorting

On Tue, May 16, 2006 at 09:24:38AM +0200, Zeugswetter Andreas DCP SD wrote:

Given that any time that happens we end up caring much less about

CPU

usage and much more about disk IO, for any of these cases that use
non-random access, compressing the data before sending it to disk

would

potentially be a sizeable win.

Note however that what the code thinks is a spill to disk and what
actually involves disk I/O are two different things. If you think
of it as a spill to kernel disk cache then the attraction is a lot
weaker...

Yes, that is very true. However it would also increase the probability
that spill to disk is not needed, since more data fits in RAM.

That's a pretty thin margin though, depending on how good the
compression is. This also assumes that you have a compression algorithm
that supports random access.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#21Jim C. Nasby
jnasby@pervasive.com
In reply to: Andrew Dunstan (#19)
Re: Compression and on-disk sorting

On Tue, May 16, 2006 at 12:27:42PM -0400, Andrew Dunstan wrote:

Rod Taylor wrote:

I habitually turn off all compression on my Windows boxes, because it's
a performance hit in my experience. Disk is cheap ...

Disk storage is cheap. Disk bandwidth or throughput is very expensive.

Hey, that's my line! :P

Sure, but in my experience using Windows File System compression is not
a win here. Presumably if it were an unqualified win they would have it
turned on everywhere. The fact that there's an option is a good
indication that it isn't in many cases. It is most commonly used for
files like executables that are in effect read-only - but that doesn't
help us.

The issue with filesystem level compression is that it has to support
things like random access, which isn't needed for on-disk sorting (not
sure about other things like hashing, etc).

In any case, my curiousity is aroused, so I'm currently benchmarking
pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
some benchmarks with pg_tmp compressed.

Does anyone have time to hack some kind of compression into the on-disk
sort code just to get some benchmark numbers? Unfortunately, doing so is
beyond my meager C abilitiy...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#22Jim C. Nasby
jnasby@pervasive.com
In reply to: Jim C. Nasby (#21)
Re: Compression and on-disk sorting

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:

In any case, my curiousity is aroused, so I'm currently benchmarking
pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
some benchmarks with pg_tmp compressed.

Results: http://jim.nasby.net/bench.log

As expected, compressing $PGDATA/base was a loss. But compressing
pgsql_tmp and then doing some disk-based sorts did show an improvement,
from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
a worst-case scenario. On the other hand, XP's compression algorithm
appears to be pretty aggressive, as it cut the size of the on-disk sort
file from almost 700MB to 82MB. There's probably gains to be had from a
different compression algorithm.

Does anyone have time to hack some kind of compression into the on-disk
sort code just to get some benchmark numbers? Unfortunately, doing so is
beyond my meager C abilitiy...

--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#23Martijn van Oosterhout
kleptog@svana.org
In reply to: Jim C. Nasby (#21)
Re: Compression and on-disk sorting

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:

Does anyone have time to hack some kind of compression into the on-disk
sort code just to get some benchmark numbers? Unfortunately, doing so is
beyond my meager C abilitiy...

I had a look at this. At first glance it doesn't seem too hard, except
the whole logtape process kinda gets in the way. If it wern't for the
mark/restore it'd be trivial. Might take a stab at it some time, if I
can think of a way to handle the seeking...

--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#24Jim C. Nasby
jnasby@pervasive.com
In reply to: Martijn van Oosterhout (#23)
Re: Compression and on-disk sorting

On Tue, May 16, 2006 at 11:46:15PM +0200, Martijn van Oosterhout wrote:

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:

Does anyone have time to hack some kind of compression into the on-disk
sort code just to get some benchmark numbers? Unfortunately, doing so is
beyond my meager C abilitiy...

I had a look at this. At first glance it doesn't seem too hard, except
the whole logtape process kinda gets in the way. If it wern't for the
mark/restore it'd be trivial. Might take a stab at it some time, if I
can think of a way to handle the seeking...

Oh, do we need to randomly seek? Is that how we switch from one tape to
another?

It might be easier to switch to giving each tape it's own file...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#25Martijn van Oosterhout
kleptog@svana.org
In reply to: Jim C. Nasby (#24)
Re: Compression and on-disk sorting

On Tue, May 16, 2006 at 04:50:22PM -0500, Jim C. Nasby wrote:

I had a look at this. At first glance it doesn't seem too hard, except
the whole logtape process kinda gets in the way. If it wern't for the
mark/restore it'd be trivial. Might take a stab at it some time, if I
can think of a way to handle the seeking...

Oh, do we need to randomly seek? Is that how we switch from one tape to
another?

Not seek, mark/restore. As the code describes, sometimes you go back a
tuple. The primary reason I think is for the final pass, a merge sort
might read the tuples multiple times, so it needs to support it there.

It might be easier to switch to giving each tape it's own file...

I don't think it would make much difference. OTOH, if this turns out to
be a win, the tuplestore could have the same optimisation.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#26Greg Stark
gsstark@mit.edu
In reply to: Martijn van Oosterhout (#25)
Re: Compression and on-disk sorting

Martijn van Oosterhout <kleptog@svana.org> writes:

It might be easier to switch to giving each tape it's own file...

I don't think it would make much difference. OTOH, if this turns out to
be a win, the tuplestore could have the same optimisation.

Would giving each tape its own file make it easier to allow multiple temporary
sort areas and allow optimizing to avoid seeking when multiple spindles area
available?

--
greg

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#25)
Re: Compression and on-disk sorting

Martijn van Oosterhout <kleptog@svana.org> writes:

Not seek, mark/restore. As the code describes, sometimes you go back a
tuple. The primary reason I think is for the final pass, a merge sort
might read the tuples multiple times, so it needs to support it there.

However it'd be possible to tell logtape in advance whether a particular
tape needs to support that, and only apply compression when not; it
would work all the time for intermediate merge passes, and with the
recent executor changes to pass down you-need-to-support-mark flags,
it'd work for the output pass in a lot of cases too.

If you're just trying to get some quick and dirty numbers: do
compression, replace Seek/Tell with PANICs, and only test on plain
sorts no joins ;-)

regards, tom lane

#28Andrew Piskorski
atp@piskorski.com
In reply to: Jim C. Nasby (#21)
Re: Compression and on-disk sorting

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:

On Tue, May 16, 2006 at 12:27:42PM -0400, Andrew Dunstan wrote:

Rod Taylor wrote:

I habitually turn off all compression on my Windows boxes, because it's
a performance hit in my experience. Disk is cheap ...

Disk storage is cheap. Disk bandwidth or throughput is very expensive.

Sure, but in my experience using Windows File System compression is not
a win here. Presumably if it were an unqualified win they would have it

Does anyone have time to hack some kind of compression into the on-disk
sort code just to get some benchmark numbers? Unfortunately, doing so is
beyond my meager C abilitiy...

Folks, first of all, I'm in no way an expert on data compression in
RDBMSs, but other databases DO include data compression features and
claim it as a SIGNIFICANT win in I/O reduction.

Looking at performance of the Windows File System compression, etc.,
doesn't make too much sense when there are actual RDBMS compression
implementations to compare to, on the commerical market, in open
source code, and in the academic literature.

Oracle has included "table compression" since 9iR2. They report table
size reductions of 2x to 4x as typical, with proportional reductions
in I/O, and supposedly, usually low to negligible overhead for writes:

http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14211/build_db.htm#sthref289

Decision Speed: Table Compression In Action by Meikel Poess and Hermann Baer (2003):
http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/poess_tablecomp.html

Compressing Data for Space and Speed by Sanjay Mishra (2004):
http://www.oracle.com/technology/oramag/oracle/04-mar/o24tech_data.html

Order For Maximum Compression:
http://oramossoracle.blogspot.com/2005/11/table-compression-order-for-maximum.html

I don't remember whether the current (Open Source) MonetDB includes
table compression or not, but they've published papers with lots of
interesting detail on the compression and other high performance OLAP
features in their latest (not released) "X100" MoneyDB research
codebase:

http://monetdb.cwi.nl/
http://homepages.cwi.nl/~mk/MonetDB/
http://sourceforge.net/projects/monetdb/
ftp://ftp.research.microsoft.com/pub/debull/A05june/issue1.htm

Now, the docs and papers above are all focused on query performance,
they say nothing directly about using using compression for on-disk
sorts. But, I'd bet that similar rules of thumb will apply in both
cases.

The main tricks seem to be: One, EXTREMELY lightweight compression
schemes - basically table lookups designed to be as cpu friendly as
posible. Two, keep the data compressed in RAM as well so that you can
also cache more of the data, and indeed keep it the compressed until
as late in the CPU processing pipeline as possible.

A corrolary of that is forget compression schemes like gzip - it
reduces data size nicely but is far too slow on the cpu to be
particularly useful in improving overall throughput rates.

Note, I have not really tested ANY of the above myself, your mileage
may well vary from what I recall from those various articles...

--
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/

#29Greg Stark
gsstark@mit.edu
In reply to: Andrew Piskorski (#28)
Re: Compression and on-disk sorting

Andrew Piskorski <atp@piskorski.com> writes:

The main tricks seem to be: One, EXTREMELY lightweight compression
schemes - basically table lookups designed to be as cpu friendly as
posible. Two, keep the data compressed in RAM as well so that you can
also cache more of the data, and indeed keep it the compressed until
as late in the CPU processing pipeline as possible.

A corrolary of that is forget compression schemes like gzip - it
reduces data size nicely but is far too slow on the cpu to be
particularly useful in improving overall throughput rates.

There are some very fast decompression algorithms:

http://www.oberhumer.com/opensource/lzo/

I think most of the mileage from "lookup tables" would be better implemented
at a higher level by giving tools to data modellers that let them achieve
denser data representations. Things like convenient enum data types, 1-bit
boolean data types, short integer data types, etc.

--
greg

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#29)
Re: Compression and on-disk sorting

Greg Stark <gsstark@mit.edu> writes:

Andrew Piskorski <atp@piskorski.com> writes:

A corrolary of that is forget compression schemes like gzip - it
reduces data size nicely but is far too slow on the cpu to be
particularly useful in improving overall throughput rates.

There are some very fast decompression algorithms:

AFAICS the only sane choice here is to use
src/backend/utils/adt/pg_lzcompress.c, on the grounds that (1) it's
already in the backend, and (2) data compression in general is such a
minefield of patents that we'd be foolish to expose ourselves in more
than one direction.

Certainly, if you can't prototype a convincing performance win using
that algorithm, it's unlikely to be worth anyone's time to look harder.

regards, tom lane

#31Albe Laurenz
all@adv.magwien.gv.at
In reply to: Tom Lane (#30)
Re: Compression and on-disk sorting

Andrew Piskorski wrote:

Rod Taylor wrote:

Disk storage is cheap. Disk bandwidth or throughput is very

expensive.

Oracle has included "table compression" since 9iR2. They report table
size reductions of 2x to 4x as typical, with proportional reductions
in I/O, and supposedly, usually low to negligible overhead for writes:

[...]

The main tricks seem to be: One, EXTREMELY lightweight compression
schemes - basically table lookups designed to be as cpu friendly as
posible. Two, keep the data compressed in RAM as well so that you can
also cache more of the data, and indeed keep it the compressed until
as late in the CPU processing pipeline as possible.

Oracle's compression seems to work as follows:
- At the beginning of each data block, there is a 'lookup table'
containing frequently used values in table entries (of that block).
- This lookup table is referenced from within the block.

There is a White Paper that describes the algorithm and contains
praise for the effects:
http://www.oracle.com/technology/products/bi/pdf/o9ir2_compression_perfo
rmance_twp.pdf

Oracle does not compress tables by default.
This is what they have to say about it:

Table compression should be used with highly redundant data, such as
tables
with many foreign keys. You should avoid compressing tables with much
update
or other DML activity. Although compressed tables or partitions are
updatable,
there is some overhead in updating these tables, and high update
activity
may work against compression by causing some space to be wasted.

Yours,
Laurenz Albe

#32Martijn van Oosterhout
kleptog@svana.org
In reply to: Albe Laurenz (#31)
Re: Compression and on-disk sorting

On Wed, May 17, 2006 at 09:45:35AM +0200, Albe Laurenz wrote:

Oracle's compression seems to work as follows:
- At the beginning of each data block, there is a 'lookup table'
containing frequently used values in table entries (of that block).
- This lookup table is referenced from within the block.

Clever idea, pity we can't use it (what's the bet it's patented?). I'd
wager anything beyond simple compression is patented by someone.

The biggest issue is really that once postgres reads a block from disk
and uncompresses it, this block will be much larger than 8K. Somehow
you have to arrange storage for this.

I have some ideas though, but as Tom says, should go for the quick and
dirty numbers first, to determine if it's even worth doing.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#33Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#30)
Re: Compression and on-disk sorting

On Wed, May 17, 2006 at 12:03:15AM -0400, Tom Lane wrote:

AFAICS the only sane choice here is to use
src/backend/utils/adt/pg_lzcompress.c, on the grounds that (1) it's
already in the backend, and (2) data compression in general is such a
minefield of patents that we'd be foolish to expose ourselves in more
than one direction.

Unfortunatly, the interface provided by pg_lzcompress.c is probably
insufficient for this purpose. You want to be able to compress tuples
as they get inserted and start a new block once the output reaches a
certain size. pg_lzcompress.c only has the options compress-whole-block
and decompress-whole-block.

zlib allows you to compress as the data comes along, keeping an eye on
the output buffer while you do it. For an initial test, using zlib
directly would probably be easier. If it works out we can look into
alternatives.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#34Andrew Piskorski
atp@piskorski.com
In reply to: Greg Stark (#29)
Re: Compression and on-disk sorting

On Tue, May 16, 2006 at 11:48:21PM -0400, Greg Stark wrote:

There are some very fast decompression algorithms:

http://www.oberhumer.com/opensource/lzo/

Sure, and for some tasks in PostgreSQL perhaps it would be useful.
But at least as of July 2005, a Sandor Heman, one of the MonetDB guys,
had looked at zlib, bzlib2, lzrw, and lzo, and claimed that:

"... in general, it is very unlikely that we could achieve any
bandwidth gains with these algorithms. LZRW and LZO might increase
bandwidth on relatively slow disk systems, with bandwidths up to
100MB/s, but this would induce high processing overheads, which
interferes with query execution. On a fast disk system, such as our
350MB/s 12 disk RAID, all the generic algorithms will fail to achieve
any speedup."

http://www.google.com/search?q=MonetDB+LZO+Heman&amp;btnG=Search
http://homepages.cwi.nl/~heman/downloads/msthesis.pdf

I think most of the mileage from "lookup tables" would be better implemented
at a higher level by giving tools to data modellers that let them achieve
denser data representations. Things like convenient enum data types, 1-bit
boolean data types, short integer data types, etc.

Things like enums and 1 bit booleans certainly could be useful, but
they cannot take advantage of duplicate values across multiple rows at
all, even if 1000 rows have the exact same value in their "date"
column and are all in the same disk block, right?

Thus I suspect that the exact opposite is true, a good table
compression scheme would render special denser data types largely
redundant and obsolete.

Good table compression might be a lot harder to do, of course.
Certainly Oracle's implementation of it had some bugs which made it
difficult to use reliably in practice (in certain circumstances
updates could fail, or if not fail perhaps have pathological
performance), bugs which are supposed to be fixed in 10.2.0.2, which
was only released within the last few months.

--
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/

#35Zeugswetter Andreas DCP SD
ZeugswetterA@spardat.at
In reply to: Andrew Piskorski (#34)
Re: Compression and on-disk sorting

Certainly, if you can't prototype a convincing performance win using
that algorithm, it's unlikely to be worth anyone's time to
look harder.

That should be easily possible with LZO. It would need to be the lib
that
we can optionally link to (--with-lzo), since the lib is GPL.

lzo even allows for inplace decompression and overlapping compression.

Andreas

#36Hannu Krosing
hannu@skype.net
In reply to: Zeugswetter Andreas DCP SD (#35)
Re: Compression and on-disk sorting

Ühel kenal päeval, K, 2006-05-17 kell 12:20, kirjutas Zeugswetter
Andreas DCP SD:

Certainly, if you can't prototype a convincing performance win using
that algorithm, it's unlikely to be worth anyone's time to
look harder.

That should be easily possible with LZO. It would need to be the lib
that
we can optionally link to (--with-lzo), since the lib is GPL.

lzo even allows for inplace decompression and overlapping compression.

Does being GPL also automatically imply that it is patent-free, so that
we could re-implement it under BSD license if it gives good results?

------------
Hannu

#37Zeugswetter Andreas DCP SD
ZeugswetterA@spardat.at
In reply to: Hannu Krosing (#36)
Re: Compression and on-disk sorting

Unfortunatly, the interface provided by pg_lzcompress.c is probably
insufficient for this purpose. You want to be able to compress tuples
as they get inserted and start a new block once the output reaches a

I don't think anything that compresses single tuples without context is
going to be a win under realistic circumstances.

I would at least compress whole pages. Allow a max ratio of 1:n,
have the pg buffercache be uncompressed, and only compress on write
(filesystem cache then holds compressed pages).

The tricky part is predicting whether a tuple still fits in a n*8k
uncompressed
8k compressed page, but since lzo is fast you might even test it in
corner cases.
(probably logic that needs to also be in the available page freespace
calculation)
Choosing a good n is also tricky, probably 2 (or 3 ?) is good.

You probably also want to always keep the header part of the page
uncompressed.

Andreas

#38Jonah H. Harris
jonah.harris@gmail.com
In reply to: Martijn van Oosterhout (#32)
Re: Compression and on-disk sorting

On 5/17/06, Martijn van Oosterhout <kleptog@svana.org> wrote:

Clever idea, pity we can't use it (what's the bet it's patented?). I'd
wager anything beyond simple compression is patented by someone.

Oracle's patent application 20040054858 covers the method itself
including the process for storing and retrieving compressed data.

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation | fax: 732.331.1301
33 Wood Ave S, 2nd Floor | jharris@enterprisedb.com
Iselin, New Jersey 08830 | http://www.enterprisedb.com/

#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#32)
Re: Compression and on-disk sorting

Martijn van Oosterhout <kleptog@svana.org> writes:

Clever idea, pity we can't use it (what's the bet it's patented?). I'd
wager anything beyond simple compression is patented by someone.

You're in for a rude awakening: even "simple compression" is anything
but simple. As I said, it's a minefield of patents. I recall reading a
very long statement by one of the zlib developers (Jean-loup Gailly, I
think) explaining exactly how they had threaded their way through that
minefield, and why they were different enough from half-a-dozen
similar-looking patented methods to not infringe any of them.

I feel fairly confident that zlib is patent-free, first because they did
their homework and second because they've now been out there and highly
visible for a good long time without getting sued. I've got no such
confidence in any other random algorithm you might choose --- in fact,
I'm not at all sure that pg_lzcompress.c is safe. If we were
aggressively trying to avoid patent risks we might well consider
dropping pg_lzcompress.c and using zlib exclusively.

regards, tom lane

#40Jim C. Nasby
jnasby@pervasive.com
In reply to: Greg Stark (#26)
Re: Compression and on-disk sorting

On Tue, May 16, 2006 at 06:48:25PM -0400, Greg Stark wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

It might be easier to switch to giving each tape it's own file...

I don't think it would make much difference. OTOH, if this turns out to
be a win, the tuplestore could have the same optimisation.

Would giving each tape its own file make it easier to allow multiple temporary
sort areas and allow optimizing to avoid seeking when multiple spindles area
available?

Only if those spindles weren't all in a single RAID array and if we went
through the trouble of creating all the machinery so you could tell
PostgreSQL where all those spindles were mounted in the filesystem. And
even after all that work, there's not much that says it would perform
better than a simple RAID10.

What *might* make sense would be to provide two locations for pgsql_tmp,
because a lot of operations in there involve reading and writing at the
same time:

Read from heap while writing tapes to pgsql_tmp
read from tapes while writing final version to pgsql_tmp

There's probably some gain to be had by writing the final version to a
tablespace other than the default one (which is where pgsql_tmp would
be, I think). But recent changes in -HEAD mean that the second step is
now only performed in certain cases.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#41Jim C. Nasby
jnasby@pervasive.com
In reply to: Andrew Piskorski (#28)
Re: Compression and on-disk sorting

If we're going to consider table-level compression, ISTM the logical
first step is to provide greater control over TOASTing; namely
thresholds for when to compress and/or go to external storage that can
be set on a per-field or at least per-table basis.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#42Jim C. Nasby
jnasby@pervasive.com
In reply to: Martijn van Oosterhout (#32)
Re: Compression and on-disk sorting

On Wed, May 17, 2006 at 10:06:04AM +0200, Martijn van Oosterhout wrote:

On Wed, May 17, 2006 at 09:45:35AM +0200, Albe Laurenz wrote:

Oracle's compression seems to work as follows:
- At the beginning of each data block, there is a 'lookup table'
containing frequently used values in table entries (of that block).
- This lookup table is referenced from within the block.

Clever idea, pity we can't use it (what's the bet it's patented?). I'd
wager anything beyond simple compression is patented by someone.

The biggest issue is really that once postgres reads a block from disk
and uncompresses it, this block will be much larger than 8K. Somehow
you have to arrange storage for this.

It's entirely possible that the best performance would be found from not
un-compressing blocks when putting them into shared_buffers, though.
That would mean you'd "only" have to deal with compression when pulling
individual tuples. Simple, right? :)
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#40)
Re: Compression and on-disk sorting

"Jim C. Nasby" <jnasby@pervasive.com> writes:

What *might* make sense would be to provide two locations for pgsql_tmp,
because a lot of operations in there involve reading and writing at the
same time:

Read from heap while writing tapes to pgsql_tmp
read from tapes while writing final version to pgsql_tmp

Note that a large part of the reason for the current logtape.c design
is to avoid requiring 2X or more disk space to sort X amount of data.
AFAICS, any design that does the above will put us right back in the 2X
regime. That's a direct, measurable penalty; it'll take more than
handwaving arguments to convince me we should change it in pursuit of
unquantified speed benefits.

regards, tom lane

#44Jim C. Nasby
jnasby@pervasive.com
In reply to: Tom Lane (#43)
Re: Compression and on-disk sorting

On Wed, May 17, 2006 at 11:38:05AM -0400, Tom Lane wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

What *might* make sense would be to provide two locations for pgsql_tmp,
because a lot of operations in there involve reading and writing at the
same time:

Read from heap while writing tapes to pgsql_tmp
read from tapes while writing final version to pgsql_tmp

Note that a large part of the reason for the current logtape.c design
is to avoid requiring 2X or more disk space to sort X amount of data.
AFAICS, any design that does the above will put us right back in the 2X
regime. That's a direct, measurable penalty; it'll take more than
handwaving arguments to convince me we should change it in pursuit of
unquantified speed benefits.

I certainly agree that there's no reason to make this change without
testing, but if you've got enough spindles laying around to actually
make use of this I suspect that requiring 2X the disk space to sort X
won't bother you.

Actually, I suspect in most cases it won't matter; I don't think people
make a habit of trying to sort their entire database. :) But we'd want
to protect for the oddball cases... yech.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#44)
Re: Compression and on-disk sorting

"Jim C. Nasby" <jnasby@pervasive.com> writes:

On Wed, May 17, 2006 at 11:38:05AM -0400, Tom Lane wrote:

Note that a large part of the reason for the current logtape.c design
is to avoid requiring 2X or more disk space to sort X amount of data.

Actually, I suspect in most cases it won't matter; I don't think people
make a habit of trying to sort their entire database. :)

Well, some years ago we used something like 4X space to sort X amount of
data (this is the native behavior of 7-way polyphase merge, it turns out)
and we got yelled at. Which is what prompted the writing of logtape.c.
Maybe disk space has gotten so cheap since then that it no longer
matters ... but I suspect the nature of DB applications is that people
are always pushing the envelope of what their hardware can do.

regards, tom lane

#46Rod Taylor
pg@rbt.ca
In reply to: Jim C. Nasby (#44)
Re: Compression and on-disk sorting

Actually, I suspect in most cases it won't matter; I don't think people
make a habit of trying to sort their entire database. :) But we'd want
to protect for the oddball cases... yech.

I can make query result sets that are far larger than the database
itself.

create table fat_table_with_few_tuples(fat_status_id serial primary key,
fat_1 text, fat_2 text);

create table thin_table_with_many_tuples(fat_status_id integer
references fat_table_with_few_tuples, thin_1 integer, thin_2 integer);

SELECT * FROM thin_table_with_many_tuples NATURAL JOIN
fat_table_with_few_tuples order by fat_1, thin_1, thin_2, fat_2;

I would be asking the folks trying to use PostgreSQL for data
warehousing what their opinion is. A few fact tables in an audit query
could easily result in a very large amount of temporary diskspace being
required.

--

#47Martijn van Oosterhout
kleptog@svana.org
In reply to: Andrew Dunstan (#19)
1 attachment(s)
[PATCH] Compression and on-disk sorting

Persuant to the discussions currently on -hackers, here's a patch that
uses zlib to compress the tapes as they go to disk. I default to the
compression level 3 (think gzip -3).

Please speed test all you like, I *think* it's bug free, but you never
know.

Outstanding questions:

- I use zlib because the builtin pg_lzcompress can't do what zlib does.
Here we setup input and output buffers and zlib will process as much as
it can (input empty or output full). This means no marshalling is
required. We can compress the whole file without having it in memory.

- zlib allocates memory for compression and decompression, I don't know
how much. However, it allocates via the postgres mcxt system so it
shouldn't too hard to find out. Simon pointed out that we'll need to
track this because we might allow hundreds of tapes.

- Each tape is compressed as one long compressed stream. Currently no
seeking is allowed, so only sorts, no joins! (As tom said, quick and
dirty numbers). This should show this possibility in its best light
but if we want to support seeking we're going to need to change that.
Maybe no compression on the last pass?

- It's probable that the benefits are strongly correlated to the speed
of your disk subsystem. We need to measure this effect. I can't
accuratly measure this because my compiler doesn't inline any of the
functions in tuplesort.c.

In my test of a compression ratio around 100-to-1, on 160MB of data
with tiny work_mem on my 5 year old laptop, it speeds it up by 60% so
it's obviously not a complete waste of time. Ofcourse, YMMV :)

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

Attachments:

compress-sort.patchtext/plain; charset=us-asciiDownload
Index: src/backend/Makefile
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/Makefile,v
retrieving revision 1.114
diff -c -r1.114 Makefile
*** src/backend/Makefile	5 Jan 2006 01:56:29 -0000	1.114
--- src/backend/Makefile	17 May 2006 16:16:20 -0000
***************
*** 25,31 ****
  LIBS := $(filter-out -lpgport, $(LIBS))
  
  # The backend doesn't need everything that's in LIBS, however
! LIBS := $(filter-out -lz -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
  
  ##########################################################################
  
--- 25,31 ----
  LIBS := $(filter-out -lpgport, $(LIBS))
  
  # The backend doesn't need everything that's in LIBS, however
! LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS))
  
  ##########################################################################
  
Index: src/backend/utils/sort/logtape.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/sort/logtape.c,v
retrieving revision 1.21
diff -c -r1.21 logtape.c
*** src/backend/utils/sort/logtape.c	7 Mar 2006 23:46:24 -0000	1.21
--- src/backend/utils/sort/logtape.c	17 May 2006 16:16:21 -0000
***************
*** 80,91 ****
--- 80,96 ----
  #include "storage/buffile.h"
  #include "utils/logtape.h"
  
+ #include "zlib.h"
+ 
  /*
   * Block indexes are "long"s, so we can fit this many per indirect block.
   * NB: we assume this is an exact fit!
   */
  #define BLOCKS_PER_INDIR_BLOCK	((int) (BLCKSZ / sizeof(long)))
  
+ /* compression level to use for zlib */
+ #define COMPRESSION_LEVEL    3
+ 
  /*
   * We use a struct like this for each active indirection level of each
   * logical tape.  If the indirect block is not the highest level of its
***************
*** 131,136 ****
--- 136,142 ----
  	 * reading.
  	 */
  	char	   *buffer;			/* physical buffer (separately palloc'd) */
+ 	z_stream	zstream;		/* zlib stream */
  	long		curBlockNumber; /* this block's logical blk# within tape */
  	int			pos;			/* next read/write position in buffer */
  	int			nbytes;			/* total # of valid bytes in buffer */
***************
*** 509,514 ****
--- 515,532 ----
  }
  
  
+ static void *
+ lts_zalloc(void *cxt, unsigned int items, unsigned int size)
+ {
+ 	return MemoryContextAlloc( cxt, items*size );
+ }
+ 
+ static void 
+ lts_zfree(void *cxt, void *ptr)
+ {
+ 	return pfree( ptr );
+ }
+ 
  /*
   * Create a set of logical tapes in a temporary underlying file.
   *
***************
*** 556,561 ****
--- 574,586 ----
  		lt->curBlockNumber = 0L;
  		lt->pos = 0;
  		lt->nbytes = 0;
+ 		
+ 		memset( &lt->zstream, 0, sizeof(lt->zstream) );
+ 		lt->zstream.zalloc = lts_zalloc;
+ 		lt->zstream.zfree = lts_zfree;
+ 		lt->zstream.opaque = CurrentMemoryContext;
+ 		
+ 		deflateInit( &lt->zstream, COMPRESSION_LEVEL );  /* Fast compression */
  	}
  	return lts;
  }
***************
*** 627,633 ****
  				 void *ptr, size_t size)
  {
  	LogicalTape *lt;
- 	size_t		nthistime;
  
  	Assert(tapenum >= 0 && tapenum < lts->nTapes);
  	lt = &lts->tapes[tapenum];
--- 652,657 ----
***************
*** 643,650 ****
  		lt->indirect->nextup = NULL;
  	}
  
! 	while (size > 0)
  	{
  		if (lt->pos >= BLCKSZ)
  		{
  			/* Buffer full, dump it out */
--- 667,678 ----
  		lt->indirect->nextup = NULL;
  	}
  
! 	lt->zstream.next_in = ptr;
! 	lt->zstream.avail_in = size;
! 	
! 	while (lt->zstream.avail_in > 0)
  	{
+ 		int err;
  		if (lt->pos >= BLCKSZ)
  		{
  			/* Buffer full, dump it out */
***************
*** 661,680 ****
  			lt->nbytes = 0;
  		}
  
! 		nthistime = BLCKSZ - lt->pos;
! 		if (nthistime > size)
! 			nthistime = size;
! 		Assert(nthistime > 0);
  
! 		memcpy(lt->buffer + lt->pos, ptr, nthistime);
  
  		lt->dirty = true;
! 		lt->pos += nthistime;
! 		if (lt->nbytes < lt->pos)
! 			lt->nbytes = lt->pos;
! 		ptr = (void *) ((char *) ptr + nthistime);
! 		size -= nthistime;
  	}
  }
  
  /*
--- 689,741 ----
  			lt->nbytes = 0;
  		}
  
! 		lt->zstream.next_out = lt->buffer + lt->pos;
! 		lt->zstream.avail_out = BLCKSZ - lt->pos;
! 		err = deflate( &lt->zstream, Z_NO_FLUSH );
! 		lt->pos = BLCKSZ - lt->zstream.avail_out;
! 		lt->nbytes = lt->pos;
! 		lt->dirty = true;
! 		
! 		if( err != Z_OK )
! 			elog(ERROR, "deflate returned error: %d\n", err );
! 	}
! }
  
! /* Flushes any data still present in the buffers of the compression module. 
!  * Only necessary for writing (deflating), for reading we don't particularly
!  * care about unflished data */
  
+ static void 
+ ltsFlush( LogicalTapeSet *lts, LogicalTape *lt)
+ {
+ 	int err;
+ 	
+ 	if( !lt->dirty )
+ 		return;
+ 		
+ 	lt->zstream.next_in = NULL;
+ 	lt->zstream.avail_in = 0;
+ 
+ 	for(;;)
+ 	{
+ 		lt->zstream.next_out = lt->buffer + lt->pos;
+ 		lt->zstream.avail_out = BLCKSZ - lt->pos;
+ 		err = deflate( &lt->zstream, Z_FINISH );
  		lt->dirty = true;
! 
! 		if( err == Z_STREAM_END )
! 			break;
! 		Assert(err == Z_OK);
! 		Assert(lt->zstream.avail_out == 0);
! 		ltsDumpBuffer(lts, lt);
! 		lt->numFullBlocks++;
! 		lt->curBlockNumber++;
! 		lt->nbytes = 0;
! 		lt->pos = 0;
  	}
+ 	lt->pos = BLCKSZ - lt->zstream.avail_out;
+ 	lt->nbytes = lt->pos;
+ 	return;
  }
  
  /*
***************
*** 692,697 ****
--- 753,761 ----
  	Assert(tapenum >= 0 && tapenum < lts->nTapes);
  	lt = &lts->tapes[tapenum];
  
+ 	lt->zstream.next_in = NULL;
+ 	lt->zstream.avail_in = 0;
+ 
  	if (!forWrite)
  	{
  		if (lt->writing)
***************
*** 702,711 ****
--- 766,781 ----
  			 * (destructive) read.
  			 */
  			if (lt->dirty)
+ 			{
+ 				ltsFlush(lts, lt);
  				ltsDumpBuffer(lts, lt);
+ 			}
  			lt->lastBlockBytes = lt->nbytes;
  			lt->writing = false;
  			datablocknum = ltsRewindIndirectBlock(lts, lt->indirect, false);
+ 			
+ 			deflateEnd( &lt->zstream );
+ 			inflateInit( &lt->zstream );
  		}
  		else
  		{
***************
*** 715,720 ****
--- 785,791 ----
  			 */
  			Assert(lt->frozen);
  			datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
+ 			inflateReset( &lt->zstream );
  		}
  		/* Read the first block, or reset if tape is empty */
  		lt->curBlockNumber = 0L;
***************
*** 761,766 ****
--- 832,840 ----
  		lt->curBlockNumber = 0L;
  		lt->pos = 0;
  		lt->nbytes = 0;
+ 		
+ 		inflateEnd( &lt->zstream );
+ 		deflateInit( &lt->zstream, COMPRESSION_LEVEL );
  	}
  }
  
***************
*** 774,788 ****
  				void *ptr, size_t size)
  {
  	LogicalTape *lt;
! 	size_t		nread = 0;
! 	size_t		nthistime;
  
  	Assert(tapenum >= 0 && tapenum < lts->nTapes);
  	lt = &lts->tapes[tapenum];
  	Assert(!lt->writing);
  
! 	while (size > 0)
  	{
  		if (lt->pos >= lt->nbytes)
  		{
  			/* Try to load more data into buffer. */
--- 848,865 ----
  				void *ptr, size_t size)
  {
  	LogicalTape *lt;
! //	size_t		nread = 0;
  
  	Assert(tapenum >= 0 && tapenum < lts->nTapes);
  	lt = &lts->tapes[tapenum];
  	Assert(!lt->writing);
  
! 	lt->zstream.next_out = ptr;
! 	lt->zstream.avail_out = size;
! 	
! 	while (lt->zstream.avail_out > 0)
  	{
+ 		int err;
  		if (lt->pos >= lt->nbytes)
  		{
  			/* Try to load more data into buffer. */
***************
*** 801,821 ****
  			if (lt->nbytes <= 0)
  				break;			/* EOF (possible here?) */
  		}
! 
! 		nthistime = lt->nbytes - lt->pos;
! 		if (nthistime > size)
! 			nthistime = size;
! 		Assert(nthistime > 0);
! 
! 		memcpy(ptr, lt->buffer + lt->pos, nthistime);
! 
! 		lt->pos += nthistime;
! 		ptr = (void *) ((char *) ptr + nthistime);
! 		size -= nthistime;
! 		nread += nthistime;
  	}
  
! 	return nread;
  }
  
  /*
--- 878,895 ----
  			if (lt->nbytes <= 0)
  				break;			/* EOF (possible here?) */
  		}
! 		
! 		lt->zstream.next_in = lt->buffer + lt->pos;
! 		lt->zstream.avail_in = lt->nbytes - lt->pos;
! 		err = inflate( &lt->zstream, Z_NO_FLUSH );
! 		lt->pos = lt->nbytes - lt->zstream.avail_in;
! 		/* Here we can detect the end of the compressed stream,
! 		 * can't do anything with that information though */
! 		if( err != Z_OK && err != Z_STREAM_END)
! 			elog(ERROR, "Inflate returned error: %d\n", err);
  	}
  
! 	return size - lt->zstream.avail_out;
  }
  
  /*
***************
*** 844,850 ****
--- 918,927 ----
  	 * partial indirect blocks, rewind for nondestructive read.
  	 */
  	if (lt->dirty)
+ 	{
+ 		ltsFlush(lts, lt);
  		ltsDumpBuffer(lts, lt);
+ 	}
  	lt->lastBlockBytes = lt->nbytes;
  	lt->writing = false;
  	lt->frozen = true;
***************
*** 853,858 ****
--- 930,939 ----
  	lt->curBlockNumber = 0L;
  	lt->pos = 0;
  	lt->nbytes = 0;
+ 			
+ 	deflateEnd( &lt->zstream );
+ 	inflateInit( &lt->zstream );
+ 
  	if (datablocknum != -1L)
  	{
  		ltsReadBlock(lts, datablocknum, (void *) lt->buffer);
***************
*** 879,884 ****
--- 960,967 ----
  	long		nblocks;
  	int			newpos;
  
+ 	elog(PANIC,"LogicalTapeBackspace on compressed file not yet implemented");
+ 	
  	Assert(tapenum >= 0 && tapenum < lts->nTapes);
  	lt = &lts->tapes[tapenum];
  	Assert(lt->frozen);
***************
*** 944,949 ****
--- 1027,1034 ----
  {
  	LogicalTape *lt;
  
+ 	elog(PANIC,"LogicalTapeSeek on compressed file not yet implemented");
+ 	
  	Assert(tapenum >= 0 && tapenum < lts->nTapes);
  	lt = &lts->tapes[tapenum];
  	Assert(lt->frozen);
***************
*** 1007,1012 ****
--- 1092,1099 ----
  {
  	LogicalTape *lt;
  
+ 	elog(PANIC,"LogicalTapeTell on compressed file not yet implemented");
+ 	
  	Assert(tapenum >= 0 && tapenum < lts->nTapes);
  	lt = &lts->tapes[tapenum];
  	*blocknum = lt->curBlockNumber;
#48Martijn van Oosterhout
kleptog@svana.org
In reply to: Martijn van Oosterhout (#47)
Re: Compression and on-disk sorting

For all those people not subscribed to -patches (should appear in
archive soon), I just posted a patch there implemented zlib compression
for logtape.c. If people have test machines for speed-testing this sort
of stuff, please have at it.

You can also download it here:
http://svana.org/kleptog/temp/compress-sort.patch

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#49Greg Stark
gsstark@mit.edu
In reply to: Jim C. Nasby (#40)
Re: Compression and on-disk sorting

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Only if those spindles weren't all in a single RAID array and if we went
through the trouble of creating all the machinery so you could tell
PostgreSQL where all those spindles were mounted in the filesystem.

I think the way you do this is simply by documenting that the admin should
create precisely one temp area per physical spindle (or raid array).

--
greg

#50Greg Stark
gsstark@mit.edu
In reply to: Andrew Piskorski (#34)
Re: Compression and on-disk sorting

Andrew Piskorski <atp@piskorski.com> writes:

Things like enums and 1 bit booleans certainly could be useful, but
they cannot take advantage of duplicate values across multiple rows at
all, even if 1000 rows have the exact same value in their "date"
column and are all in the same disk block, right?

That's an interesting direction to go in. Generic algorithms would still help
in that case since the identical value would occur more frequently than other
values it would be encoded in a smaller symbol. But there's going to be a
limit to how compressed it can get the data.

The ideal way to handle the situation you're describing would be to interleave
the tuples so that you have all 1000 values of the first column, followed by
all 1000 values of the second column and so on. Then you run a generic
algorithm on this and it achieves very high compression rates since there are
a lot of repeating patterns.

I don't see how you build a working database with data in this form however.
For example, a single insert would require updating small pieces of data
across the entire table. Perhaps there's some middle ground with interleaving
the tuples within a single compressed page, or something like that?

--
greg

#51Simon Riggs
simon@2ndquadrant.com
In reply to: Martijn van Oosterhout (#47)
Re: [PATCH] Compression and on-disk sorting

On Wed, 2006-05-17 at 18:17 +0200, Martijn van Oosterhout wrote:

Persuant to the discussions currently on -hackers, here's a patch that
uses zlib to compress the tapes as they go to disk. I default to the
compression level 3 (think gzip -3).

Please speed test all you like, I *think* it's bug free, but you never
know.

Outstanding questions:

- I use zlib because the builtin pg_lzcompress can't do what zlib does.
Here we setup input and output buffers and zlib will process as much as
it can (input empty or output full). This means no marshalling is
required. We can compress the whole file without having it in memory.

Licence is BSD-compatible and it works the way we need it to work.

- Each tape is compressed as one long compressed stream. Currently no
seeking is allowed, so only sorts, no joins! (As tom said, quick and
dirty numbers). This should show this possibility in its best light
but if we want to support seeking we're going to need to change that.
Maybe no compression on the last pass?

We should be able to do this without significant loss of compression by
redefining the lts block size to be 32k. That's the size of the
look-back window anyhow, so compressing the whole stream doesn't get us
much more.

- It's probable that the benefits are strongly correlated to the speed
of your disk subsystem. We need to measure this effect. I can't
accuratly measure this because my compiler doesn't inline any of the
functions in tuplesort.c.

Please make sure any tests have trace_sort = on.

In my test of a compression ratio around 100-to-1, on 160MB of data
with tiny work_mem on my 5 year old laptop, it speeds it up by 60% so
it's obviously not a complete waste of time. Ofcourse, YMMV :)

Sounds good. Well done.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Stark (#50)
Re: Compression and on-disk sorting

Greg Stark <gsstark@mit.edu> writes:

The ideal way to handle the situation you're describing would be to interleave
the tuples so that you have all 1000 values of the first column, followed by
all 1000 values of the second column and so on. Then you run a generic
algorithm on this and it achieves very high compression rates since there are
a lot of repeating patterns.

It's not obvious to me that that yields a form more compressible than
what we have now. As long as the previous value is within the lookback
window, an LZ-style compressor will still be able to use it. More
importantly, the layout you describe would be unable to take advantage
of any cross-column correlation, which in real data is likely to be a
useful property for compression.

regards, tom lane

#53Jim C. Nasby
jnasby@pervasive.com
In reply to: Greg Stark (#49)
Re: Compression and on-disk sorting

On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Only if those spindles weren't all in a single RAID array and if we went
through the trouble of creating all the machinery so you could tell
PostgreSQL where all those spindles were mounted in the filesystem.

I think the way you do this is simply by documenting that the admin should
create precisely one temp area per physical spindle (or raid array).

And you still need some way to tell PostgreSQL about all of that.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#54Jim C. Nasby
jnasby@pervasive.com
In reply to: Rod Taylor (#46)
Re: Compression and on-disk sorting

On Wed, May 17, 2006 at 12:16:13PM -0400, Rod Taylor wrote:

Actually, I suspect in most cases it won't matter; I don't think people
make a habit of trying to sort their entire database. :) But we'd want
to protect for the oddball cases... yech.

I can make query result sets that are far larger than the database
itself.

create table fat_table_with_few_tuples(fat_status_id serial primary key,
fat_1 text, fat_2 text);

create table thin_table_with_many_tuples(fat_status_id integer
references fat_table_with_few_tuples, thin_1 integer, thin_2 integer);

SELECT * FROM thin_table_with_many_tuples NATURAL JOIN
fat_table_with_few_tuples order by fat_1, thin_1, thin_2, fat_2;

I would be asking the folks trying to use PostgreSQL for data
warehousing what their opinion is. A few fact tables in an audit query
could easily result in a very large amount of temporary diskspace being
required.

Note my last sentence: we'd need to provide for cases where this was a
problem. How much that would complicate the code, I don't know...

This is another case where someone (with more skills than me) would need
to hack the backend enough to be able to test it and see how big a
performance gain there was.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#55Hannu Krosing
hannu@skype.net
In reply to: Jim C. Nasby (#41)
Re: Compression and on-disk sorting

Ühel kenal päeval, K, 2006-05-17 kell 10:01, kirjutas Jim C. Nasby:

If we're going to consider table-level compression, ISTM the logical
first step is to provide greater control over TOASTing; namely
thresholds for when to compress and/or go to external storage that can
be set on a per-field or at least per-table basis.

also, we would get a lot of "compression", if we could get rid of index
on toast table, and use the ctid directly.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#56Jim C. Nasby
jnasby@pervasive.com
In reply to: Hannu Krosing (#55)
Re: Compression and on-disk sorting

On Wed, May 17, 2006 at 10:55:19PM +0300, Hannu Krosing wrote:

??hel kenal p??eval, K, 2006-05-17 kell 10:01, kirjutas Jim C. Nasby:

If we're going to consider table-level compression, ISTM the logical
first step is to provide greater control over TOASTing; namely
thresholds for when to compress and/or go to external storage that can
be set on a per-field or at least per-table basis.

also, we would get a lot of "compression", if we could get rid of index
on toast table, and use the ctid directly.

It'd also be damn handy to be able to ditch the 2-pass vacuum
requirement. I've often wondered if treating the toast table as a real
table was overkill; for example, it should be possible to include WAL
information for it in with the parent table. That plus using ctid as a
reference would hopefully allow for removing a lot of overhead from it.
Presumably all the MVCC info could go, since that would be taken care of
by the parent table.

Of course the downside is that it'd mean a different set of code for
handling toast storage...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#57Greg Stark
gsstark@mit.edu
In reply to: Jim C. Nasby (#53)
Re: Compression and on-disk sorting

"Jim C. Nasby" <jnasby@pervasive.com> writes:

On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Only if those spindles weren't all in a single RAID array and if we went
through the trouble of creating all the machinery so you could tell
PostgreSQL where all those spindles were mounted in the filesystem.

I think the way you do this is simply by documenting that the admin should
create precisely one temp area per physical spindle (or raid array).

And you still need some way to tell PostgreSQL about all of that.

No, my point was that you tell Postges how many spindles you have and where to
find them by creating precisely one temp area on each spindle. It then knows
that it should strive to maximize sequential reads within one temp area and
expect switching between temp areas (which represent multiple spindles) to be
better than multiplexing multiple tapes within a single temp area (which
represents a single spindle).

--
greg

#58Jim C. Nasby
jnasby@pervasive.com
In reply to: Greg Stark (#57)
Re: Compression and on-disk sorting

On Wed, May 17, 2006 at 05:44:22PM -0400, Greg Stark wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Only if those spindles weren't all in a single RAID array and if we went
through the trouble of creating all the machinery so you could tell
PostgreSQL where all those spindles were mounted in the filesystem.

I think the way you do this is simply by documenting that the admin should
create precisely one temp area per physical spindle (or raid array).

And you still need some way to tell PostgreSQL about all of that.

No, my point was that you tell Postges how many spindles you have and where to
find them by creating precisely one temp area on each spindle. It then knows

Which means we need all the interface bits to be able to tell PostgreSQL
where every single temp storage area is. Presumably much of the
tablespace mechanism could be used for this, but it's still a bunch of
work. And you can't just say "I have 8 spindles", you have to tell
PostgreSQL exactly where to put each temporary area (unless you just
have it put one on every tablespace you have defined).

that it should strive to maximize sequential reads within one temp area and
expect switching between temp areas (which represent multiple spindles) to be
better than multiplexing multiple tapes within a single temp area (which
represents a single spindle).

Which adds yet more complexity to all the code that uses the temp area.
And as others have brought up, you still have to allow for the case when
splitting all of this out into multiple files means you end up using
substantially more disk space. That further drives up the complexity.

My point is that unless someone shows that there's a non-trivial
performance gain here, it's not going to happen.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#59Greg Stark
gsstark@mit.edu
In reply to: Jim C. Nasby (#58)
Re: Compression and on-disk sorting

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Which means we need all the interface bits to be able to tell PostgreSQL
where every single temp storage area is. Presumably much of the
tablespace mechanism could be used for this, but it's still a bunch of
work. And you can't just say "I have 8 spindles", you have to tell
PostgreSQL exactly where to put each temporary area (unless you just
have it put one on every tablespace you have defined).

Yes, if you have more than one temporary area you definitely need a way to
tell Postgres where to put them since obviously they won't be in the postgres
base directory. But I think that's all Postgres really needs to know.

One could imagine a more complex version where Postgres has meta information
about the bandwidth and seek penalty for each sort area separately. Presumably
also for each table space. But that's a whole lot more complexity than
Postgres's current cost model.

that it should strive to maximize sequential reads within one temp area and
expect switching between temp areas (which represent multiple spindles) to be
better than multiplexing multiple tapes within a single temp area (which
represents a single spindle).

Which adds yet more complexity to all the code that uses the temp area.
And as others have brought up, you still have to allow for the case when
splitting all of this out into multiple files means you end up using
substantially more disk space. That further drives up the complexity.

You also have to consider that it won't always be a benefit to spread the sort
over multiple sort areas. If there's only one sort going on and you can reach
a 1:1 ratio between tapes and spindles then I think it would be a huge
benefit. Effectively boosting the sort speed by random_page_cost.

But if you don't have as many spindles as your algorithm needs tapes
then it's unclear which to multiplex down and whether you gain any benefit
once you're multiplexing over simply using a single sort area.

And worse, if there are multiple sorts going on in the system then you're not
going to get sequential access even if you have multiple sort areas available.
If you have N sort areas and N sorts are going on then you're probably better
off multiplexing each one down to a single sort area and letting them each
proceed without interfering with each other rather than having each one hog
all the sort areas and forcing the OS to do the multiplexing blindly.

My point is that unless someone shows that there's a non-trivial
performance gain here, it's not going to happen.

I think two extreme cases are well worth pursuing:

1) Use n sort areas for n tapes making everything purely sequential access.
That would be useful for large DSS systems where large sorts are running
and i/o bandwidth is high for sequential access. That gives effectively a
random_page_cost speed boost.

2) Use the current algorithm unchanged but have each sort use a different sort
area in some sort of round-robin fashion. That helps the OLTP type
environment (ignoring for the moment that OLTP environments really ought
not be doing disk sorts) where people complain about unreliable execution
times more than slow execution times. If you can provide enough sort areas
then it would remove one big reason other queries concurrent impact the
execution time of queries.

--
greg

#60Florian Weimer
fw@deneb.enyo.de
In reply to: Jim C. Nasby (#5)
Re: Compression and on-disk sorting

* Jim C. Nasby:

The problem is that it seems like there's never enough ability to clue
the OS in on what the application is trying to accomplish. For a long
time we didn't have a background writer, because the OS should be able
to flush things out on it's own before checkpoint. Now there's talk of a
background reader, because backends keep stalling on waiting on disk IO.

I've recently seen this on one of our test systems -- neither CPU nor
disk I/O were maxed out.

Have you considered using asynchronous I/O? Maybe it results in less
complexity and fewer context switches than a background reader.

#61Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#51)
Re: [PATCH] Compression and on-disk sorting

On Wed, May 17, 2006 at 06:38:47PM +0100, Simon Riggs wrote:

- Each tape is compressed as one long compressed stream. Currently no
seeking is allowed, so only sorts, no joins! (As tom said, quick and
dirty numbers). This should show this possibility in its best light
but if we want to support seeking we're going to need to change that.
Maybe no compression on the last pass?

We should be able to do this without significant loss of compression by
redefining the lts block size to be 32k. That's the size of the
look-back window anyhow, so compressing the whole stream doesn't get us
much more.

The major problem is looking back costs significantly more with
compression. If you need to look back into the previous compressed
block, you need to decompress the whole previous block. The simple
solution would be to keep a buffer of the last 32KB. Another posibility
would be to have a limit of 32KB of uncompressed data per block and
just remember the whole previous block.

Seek/Tell is not the hard part, it's the backspace. It would probably
be smart to make backspace call Seek, rather than trying to be smart
about it.

Another issue is that currently the compression code is completely
within logtape.c. To be able to seek backwards efficiently you might
have to change the abstraction so that it knows about the records from
tuplesort.c. That's much more work, which needs a lot more thinking.

Besides, we still havn't got any reports yet that this actually
provides a benefit on any machine less than five years ago. Anyone out
there doing tests?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#62Zeugswetter Andreas DCP SD
ZeugswetterA@spardat.at
In reply to: Martijn van Oosterhout (#61)
Re: Compression and on-disk sorting

1) Use n sort areas for n tapes making everything purely sequential

access.

Some time ago testing I did has shown, that iff the IO block size is
large enough
(256k) it does not really matter that much if the blocks are at random
locations.
I think that is still true for current model disks.

So unless we parallelize, it is imho sufficient to see to it that we
write
(and read) large enough blocks with single calls. This also has no
problem in
highly concurrent scenarios, where you do not have enough spindles.

Andreas

#63Simon Riggs
simon@2ndquadrant.com
In reply to: Martijn van Oosterhout (#61)
Re: [PATCH] Compression and on-disk sorting

On Thu, 2006-05-18 at 10:31 +0200, Martijn van Oosterhout wrote:

On Wed, May 17, 2006 at 06:38:47PM +0100, Simon Riggs wrote:

- Each tape is compressed as one long compressed stream. Currently no
seeking is allowed, so only sorts, no joins! (As tom said, quick and
dirty numbers). This should show this possibility in its best light
but if we want to support seeking we're going to need to change that.
Maybe no compression on the last pass?

We should be able to do this without significant loss of compression by
redefining the lts block size to be 32k. That's the size of the
look-back window anyhow, so compressing the whole stream doesn't get us
much more.

The major problem is looking back costs significantly more with
compression. If you need to look back into the previous compressed
block, you need to decompress the whole previous block. The simple
solution would be to keep a buffer of the last 32KB. Another posibility
would be to have a limit of 32KB of uncompressed data per block and
just remember the whole previous block.

Just do a Z_FULL_FLUSH when you hit end of block. That way all blocks
will be independent of each other and you can rewind as much as you
like. We can choose the block size to be 32KB or even 64KB, there's no
dependency there, just memory allocation. It should be pretty simple to
make the block size variable at run time, so we can select it according
to how many files and how much memory we have.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#64Simon Riggs
simon@2ndquadrant.com
In reply to: Jim C. Nasby (#22)
Re: Compression and on-disk sorting

On Tue, 2006-05-16 at 15:42 -0500, Jim C. Nasby wrote:

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:

In any case, my curiousity is aroused, so I'm currently benchmarking
pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
some benchmarks with pg_tmp compressed.

Results: http://jim.nasby.net/bench.log

As expected, compressing $PGDATA/base was a loss. But compressing
pgsql_tmp and then doing some disk-based sorts did show an improvement,
from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
a worst-case scenario. On the other hand, XP's compression algorithm
appears to be pretty aggressive, as it cut the size of the on-disk sort
file from almost 700MB to 82MB. There's probably gains to be had from a
different compression algorithm.

Can you re-run these tests using "SELECT aid FROM accounts ..."
"SELECT 1 ... " is of course highly compressible.

I also note that the compressed file fits within memory and may not even
have been written out at all. That's good, but this sounds like the very
best case of what we can hope to achieve. We need to test a whole range
of cases to see if it is generally applicable, or only in certain cases
- and if so which ones.

Would you be able to write up some extensive testing of Martijn's patch?
He's followed through on your idea and written it (on -patches now...)

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#65Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#63)
Re: [PATCH] Compression and on-disk sorting

On Thu, May 18, 2006 at 11:34:36AM +0100, Simon Riggs wrote:

Just do a Z_FULL_FLUSH when you hit end of block. That way all blocks
will be independent of each other and you can rewind as much as you
like. We can choose the block size to be 32KB or even 64KB, there's no
dependency there, just memory allocation. It should be pretty simple to
make the block size variable at run time, so we can select it according
to how many files and how much memory we have.

If you know you don't need to seek, there's no need to block the data
at all, one long stream is fine. So that case is easy.

For seeking, you need more work. I assume you're talking about 32KB
input block sizes (uncompressed). The output blocks will be of variable
size. These compressed blocks would be divided up into fixed 8K blocks
and written to disk.

To allow seeking, you'd have to do something like a header comtaining:

- length of previous compressed block
- length of this compressed block
- offset of block in uncompressed bytes (from beginning of tape)

This would allow you to scan backwards and forwards. If you want to be
able to jump to anywhere in the file, you may be better off storing the
file offsets (which would be implicit if the blocks are 32KB) in the
indirect blocks, using a search to find the right block, and then a
header in the block to find the offset.

Still, I'd like some evidence of benefits before writing up something
like that.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#66Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Greg Stark (#59)
Re: Compression and on-disk sorting

Uh, TODO already has:

o %Add a GUC variable to control the tablespace for temporary objects
and sort files

It could start with a random tablespace from a supplied list and
cycle through the list.

Do we need to add to this?

---------------------------------------------------------------------------

Greg Stark wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Which means we need all the interface bits to be able to tell PostgreSQL
where every single temp storage area is. Presumably much of the
tablespace mechanism could be used for this, but it's still a bunch of
work. And you can't just say "I have 8 spindles", you have to tell
PostgreSQL exactly where to put each temporary area (unless you just
have it put one on every tablespace you have defined).

Yes, if you have more than one temporary area you definitely need a way to
tell Postgres where to put them since obviously they won't be in the postgres
base directory. But I think that's all Postgres really needs to know.

One could imagine a more complex version where Postgres has meta information
about the bandwidth and seek penalty for each sort area separately. Presumably
also for each table space. But that's a whole lot more complexity than
Postgres's current cost model.

that it should strive to maximize sequential reads within one temp area and
expect switching between temp areas (which represent multiple spindles) to be
better than multiplexing multiple tapes within a single temp area (which
represents a single spindle).

Which adds yet more complexity to all the code that uses the temp area.
And as others have brought up, you still have to allow for the case when
splitting all of this out into multiple files means you end up using
substantially more disk space. That further drives up the complexity.

You also have to consider that it won't always be a benefit to spread the sort
over multiple sort areas. If there's only one sort going on and you can reach
a 1:1 ratio between tapes and spindles then I think it would be a huge
benefit. Effectively boosting the sort speed by random_page_cost.

But if you don't have as many spindles as your algorithm needs tapes
then it's unclear which to multiplex down and whether you gain any benefit
once you're multiplexing over simply using a single sort area.

And worse, if there are multiple sorts going on in the system then you're not
going to get sequential access even if you have multiple sort areas available.
If you have N sort areas and N sorts are going on then you're probably better
off multiplexing each one down to a single sort area and letting them each
proceed without interfering with each other rather than having each one hog
all the sort areas and forcing the OS to do the multiplexing blindly.

My point is that unless someone shows that there's a non-trivial
performance gain here, it's not going to happen.

I think two extreme cases are well worth pursuing:

1) Use n sort areas for n tapes making everything purely sequential access.
That would be useful for large DSS systems where large sorts are running
and i/o bandwidth is high for sequential access. That gives effectively a
random_page_cost speed boost.

2) Use the current algorithm unchanged but have each sort use a different sort
area in some sort of round-robin fashion. That helps the OLTP type
environment (ignoring for the moment that OLTP environments really ought
not be doing disk sorts) where people complain about unreliable execution
times more than slow execution times. If you can provide enough sort areas
then it would remove one big reason other queries concurrent impact the
execution time of queries.

--
greg

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#67Jim C. Nasby
jnasby@pervasive.com
In reply to: Simon Riggs (#64)
Re: Compression and on-disk sorting

On Thu, May 18, 2006 at 11:34:51AM +0100, Simon Riggs wrote:

On Tue, 2006-05-16 at 15:42 -0500, Jim C. Nasby wrote:

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:

In any case, my curiousity is aroused, so I'm currently benchmarking
pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
some benchmarks with pg_tmp compressed.

Results: http://jim.nasby.net/bench.log

As expected, compressing $PGDATA/base was a loss. But compressing
pgsql_tmp and then doing some disk-based sorts did show an improvement,
from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
a worst-case scenario. On the other hand, XP's compression algorithm
appears to be pretty aggressive, as it cut the size of the on-disk sort
file from almost 700MB to 82MB. There's probably gains to be had from a
different compression algorithm.

Can you re-run these tests using "SELECT aid FROM accounts ..."
"SELECT 1 ... " is of course highly compressible.

I also note that the compressed file fits within memory and may not even
have been written out at all. That's good, but this sounds like the very
best case of what we can hope to achieve. We need to test a whole range
of cases to see if it is generally applicable, or only in certain cases
- and if so which ones.

Would you be able to write up some extensive testing of Martijn's patch?
He's followed through on your idea and written it (on -patches now...)

Yes, I'm working on that. I'd rather test his stuff than XP's
compression anyway.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#68Jim C. Nasby
jnasby@pervasive.com
In reply to: Zeugswetter Andreas DCP SD (#62)
Re: Compression and on-disk sorting

On Thu, May 18, 2006 at 10:57:16AM +0200, Zeugswetter Andreas DCP SD wrote:

1) Use n sort areas for n tapes making everything purely sequential

access.

Some time ago testing I did has shown, that iff the IO block size is
large enough
(256k) it does not really matter that much if the blocks are at random
locations.
I think that is still true for current model disks.

So unless we parallelize, it is imho sufficient to see to it that we
write
(and read) large enough blocks with single calls. This also has no
problem in
highly concurrent scenarios, where you do not have enough spindles.

AFAIK logtape currently reads in much less than 256k blocks. Of course
if you get lucky you'll read from one tape for some time before
switching to another, which should have a sort-of similar effect if the
drives aren't very busy with other things.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#66)
Re: Compression and on-disk sorting

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Uh, TODO already has:

o %Add a GUC variable to control the tablespace for temporary objects
and sort files

It could start with a random tablespace from a supplied list and
cycle through the list.

Do we need to add to this?

The list bit strikes me as lily-gilding, or at least designing features
well beyond proven need.

regards, tom lane

#70Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#69)
Re: Compression and on-disk sorting

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Uh, TODO already has:

o %Add a GUC variable to control the tablespace for temporary objects
and sort files

It could start with a random tablespace from a supplied list and
cycle through the list.

Do we need to add to this?

The list bit strikes me as lily-gilding, or at least designing features
well beyond proven need.

I have seen other databases do the round-robin idea, so it is worth
testing.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#71Martijn van Oosterhout
kleptog@svana.org
In reply to: Jim C. Nasby (#68)
Re: Compression and on-disk sorting

On Thu, May 18, 2006 at 11:22:46AM -0500, Jim C. Nasby wrote:

AFAIK logtape currently reads in much less than 256k blocks. Of course
if you get lucky you'll read from one tape for some time before
switching to another, which should have a sort-of similar effect if the
drives aren't very busy with other things.

Logtape works with BLCKSZ blocks. Whether it's advisable or not, I
don't know. One thing I'm noticing with this compression-in-logtape is
that the memory cost per tape increases considerably. Currently we rely
heavily on the OS to cache for us.

Lets say we buffer 256KB per tape, and we assume a large sort with many
tapes, you're going to blow all your work_mem on buffers. If using
compression uses more memory so that you can't have as many tapes and
thus need multiple passes, well, we need to test if this is still a
win.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#72Jim C. Nasby
jnasby@pervasive.com
In reply to: Martijn van Oosterhout (#71)
Re: Compression and on-disk sorting

On Thu, May 18, 2006 at 08:32:10PM +0200, Martijn van Oosterhout wrote:

On Thu, May 18, 2006 at 11:22:46AM -0500, Jim C. Nasby wrote:

AFAIK logtape currently reads in much less than 256k blocks. Of course
if you get lucky you'll read from one tape for some time before
switching to another, which should have a sort-of similar effect if the
drives aren't very busy with other things.

Logtape works with BLCKSZ blocks. Whether it's advisable or not, I
don't know. One thing I'm noticing with this compression-in-logtape is
that the memory cost per tape increases considerably. Currently we rely
heavily on the OS to cache for us.

Lets say we buffer 256KB per tape, and we assume a large sort with many
tapes, you're going to blow all your work_mem on buffers. If using
compression uses more memory so that you can't have as many tapes and
thus need multiple passes, well, we need to test if this is still a
win.

So you're sticking with 8K blocks on disk, after compression? And then
decompressing blocks as they come in?

Actually, I guess the amount of memory used for zlib's lookback buffer
(or whatever they call it) could be pretty substantial, and I'm not sure
if there would be a way to combine that across all tapes.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#73Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#72)
Re: Compression and on-disk sorting

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Actually, I guess the amount of memory used for zlib's lookback buffer
(or whatever they call it) could be pretty substantial, and I'm not sure
if there would be a way to combine that across all tapes.

But there's only one active write tape at a time. My recollection of
zlib is that compression is memory-hungry but decompression not so much,
so it seems like this shouldn't be a huge deal.

regards, tom lane

#74Jim C. Nasby
jnasby@pervasive.com
In reply to: Martijn van Oosterhout (#61)
Re: [PATCH] Compression and on-disk sorting

On Thu, May 18, 2006 at 10:31:03AM +0200, Martijn van Oosterhout wrote:

Besides, we still havn't got any reports yet that this actually
provides a benefit on any machine less than five years ago. Anyone out
there doing tests?

Yes. I'm compiling the patched binaries right now, but the baseline
testing I've got so far is at
http://jim.nasby.net/misc/compress_sort.txt.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#75Jim C. Nasby
jnasby@pervasive.com
In reply to: Tom Lane (#73)
Re: Compression and on-disk sorting

On Thu, May 18, 2006 at 04:55:17PM -0400, Tom Lane wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Actually, I guess the amount of memory used for zlib's lookback buffer
(or whatever they call it) could be pretty substantial, and I'm not sure
if there would be a way to combine that across all tapes.

But there's only one active write tape at a time. My recollection of
zlib is that compression is memory-hungry but decompression not so much,
so it seems like this shouldn't be a huge deal.

It seems more appropriate to discuss results here, rather than on
-patches...

http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
I've run into a slight problem in that even at a compression level of
-3, zlib is cutting the on-disk size of sorts by 25x. So my pgbench sort
test with scale=150 that was producing a 2G on-disk sort is now
producing a 80M sort, which obviously fits in memory. And cuts sort
times by more than half.

So, if nothing else, it looks like compression is definately a win if it
means you can now fit the sort within the disk cache. While that doesn't
sound like something very worthwhile, it essentially extends work_mem
from a fraction of memory to up to ~25x memory.

I'm currently loading up a pgbench database with a scaling factor of
15000; hopefully I'll have results from that testing in the morning.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#76Martijn van Oosterhout
kleptog@svana.org
In reply to: Jim C. Nasby (#75)
Re: Compression and on-disk sorting

On Thu, May 18, 2006 at 10:02:44PM -0500, Jim C. Nasby wrote:

http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
I've run into a slight problem in that even at a compression level of
-3, zlib is cutting the on-disk size of sorts by 25x. So my pgbench sort
test with scale=150 that was producing a 2G on-disk sort is now
producing a 80M sort, which obviously fits in memory. And cuts sort
times by more than half.

I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
unbeleiveable. What's in the table? It would seem to imply that our
tuple format is far more compressable than we expected.

Do you have any stats on CPU usage? Memory usage?

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#77Simon Riggs
simon@2ndquadrant.com
In reply to: Jim C. Nasby (#74)
Re: [PATCH] Compression and on-disk sorting

On Thu, 2006-05-18 at 17:10 -0500, Jim C. Nasby wrote:

On Thu, May 18, 2006 at 10:31:03AM +0200, Martijn van Oosterhout wrote:

Besides, we still havn't got any reports yet that this actually
provides a benefit on any machine less than five years ago. Anyone out
there doing tests?

Yes. I'm compiling the patched binaries right now, but the baseline
testing I've got so far is at
http://jim.nasby.net/misc/compress_sort.txt.

Looks a very good improvement. Well done Martijn/Jim.

The next question is: does it apply in all cases?

We need to test "SELECT aid from accounts" also, or some other scenarios
where the data is as uncompressible as possible. We should also try this
on a table where the rows have been inserted by different transactions,
so that the xmin value isn't the same for all tuples. We need to see if
there are cases where this causes a performance regression rather than
gain.

We still need to examine memory usage. Jim's testing so far is done on
already sorted data, so only uses 1 out of 715 tapes. If we did utilise
a much larger number of tapes, we could face difficulties with the
memory used during decompression.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#78Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#77)
Re: [PATCH] Compression and on-disk sorting

On Fri, May 19, 2006 at 12:43:05PM +0100, Simon Riggs wrote:

We need to test "SELECT aid from accounts" also, or some other scenarios
where the data is as uncompressible as possible. We should also try this
on a table where the rows have been inserted by different transactions,
so that the xmin value isn't the same for all tuples. We need to see if
there are cases where this causes a performance regression rather than
gain.

AFAIK, the xmin is not included in the sort. The only thing is maybe
the ctid which is used in updates. Actually repeating the above tests
but doing:

select xmin,xmax,cmin,cmax,ctid,* from <blah>

Would be interesting. Even that would be compressable though.

Thinking about it, we're storing and compressing HeapTuples right.
There are a few fields there that would compress really well.

t_tableOid
t_len (if not vairable length fields)
t_natts

Even t_hoff and the bitmask if the patterns of NULLs don't vary much
between rows.

We still need to examine memory usage. Jim's testing so far is done on
already sorted data, so only uses 1 out of 715 tapes. If we did utilise
a much larger number of tapes, we could face difficulties with the
memory used during decompression.

I'm going to see if I can make some changes to track maximum memory
usage per tape.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#79Luke Lonergan
LLonergan@greenplum.com
In reply to: Martijn van Oosterhout (#78)
Re: Compression and on-disk sorting

Jim,

http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
I've run into a slight problem in that even at a compression
level of -3, zlib is cutting the on-disk size of sorts by
25x. So my pgbench sort test with scale=150 that was
producing a 2G on-disk sort is now producing a 80M sort,
which obviously fits in memory. And cuts sort times by more than half.

When you're ready, we can test this on some other interesting cases and
on fast hardware.

BTW - external sorting is *still* 4x slower than popular commercial DBMS
(PCDB) on real workload when full rows are used in queries. The final
results we had after the last bit of sort improvements were limited to
cases where only the sort column was used in the query, and for that
case the improved external sort code was as fast as PCDB provided lots
of work_mem are used, but when the whole contents of the row are
consumed (as with TPC-H and in many real world cases) the performance is
still far slower.

So, compression of the tuples may be just what we're looking for.

- Luke

#80Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#76)
Re: Compression and on-disk sorting

Martijn van Oosterhout <kleptog@svana.org> writes:

I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
unbeleiveable. What's in the table?

Yeah, I'd tend to question the test data being used. gzip does not do
that well on typical text (especially not at the lower settings we'd
likely want to use).

regards, tom lane

#81Martijn van Oosterhout
kleptog@svana.org
In reply to: Tom Lane (#80)
Re: Compression and on-disk sorting

On Fri, May 19, 2006 at 09:03:31AM -0400, Tom Lane wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
unbeleiveable. What's in the table?

Yeah, I'd tend to question the test data being used. gzip does not do
that well on typical text (especially not at the lower settings we'd
likely want to use).

However, postgres tables are very highly compressable, 10-to-1 is not
that uncommon. pg_proc and pg_index compress by that for example.
Indexes compress even more (a few on my system compress 25-to-1 but
that could just be slack space, the record being 37-to-1
(pg_constraint_conname_nsp_index)).

The only table on my test system over 32KB that doesn't reach 2-to-1
compression with gzip -3 is one of the toast tables.

So getting 25-to-1 is a lot, but possibly not that extreme.
pg_statistic, which is about as close to random data as you're going to
get on a postgres system, compresses 5-to-1.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#82Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#81)
Re: Compression and on-disk sorting

Martijn van Oosterhout <kleptog@svana.org> writes:

However, postgres tables are very highly compressable, 10-to-1 is not
that uncommon. pg_proc and pg_index compress by that for example.
Indexes compress even more (a few on my system compress 25-to-1 but
that could just be slack space, the record being 37-to-1
(pg_constraint_conname_nsp_index)).

Anything containing a column of type "name" will compress amazingly well
because of all the padding spaces. I don't think that's representative
of user data though ... except maybe for the occasional novice using
"char(255)" ...

regards, tom lane

#83Jim C. Nasby
jnasby@pervasive.com
In reply to: Martijn van Oosterhout (#76)
Re: Compression and on-disk sorting

On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:

On Thu, May 18, 2006 at 10:02:44PM -0500, Jim C. Nasby wrote:

http://jim.nasby.net/misc/compress_sort.txt is preliminary results.
I've run into a slight problem in that even at a compression level of
-3, zlib is cutting the on-disk size of sorts by 25x. So my pgbench sort
test with scale=150 that was producing a 2G on-disk sort is now
producing a 80M sort, which obviously fits in memory. And cuts sort
times by more than half.

I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
unbeleiveable. What's in the table? It would seem to imply that our
tuple format is far more compressable than we expected.

It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
If the tape routines were actually storing visibility information, I'd
expect that to be pretty compressible in this case since all the tuples
were presumably created in a single transaction by pgbench.

If needs be, I could try the patch against http://stats.distributed.net,
assuming that it would apply to REL_8_1.

Do you have any stats on CPU usage? Memory usage?

I've only been taking a look at vmstat from time-to-time, and I have yet
to see the machine get CPU-bound. Haven't really paid much attention to
memory. Is there anything in partucular you're looking for? I can log
vmstat for the next set of runs (with a scaling factor of 10000). I plan
on doing those runs tonight...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#84Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#83)
Re: Compression and on-disk sorting

"Jim C. Nasby" <jnasby@pervasive.com> writes:

On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:

I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
unbeleiveable. What's in the table? It would seem to imply that our
tuple format is far more compressable than we expected.

It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
If the tape routines were actually storing visibility information, I'd
expect that to be pretty compressible in this case since all the tuples
were presumably created in a single transaction by pgbench.

It's worse than that: IIRC what passes through a heaptuple sort are
tuples manufactured by heap_form_tuple, which will have consistently
zeroed header fields. However, the above isn't very helpful since the
rest of us have no idea what that "accounts" table contains. How wide
is the tuple data, and what's in it?

(This suggests that we might try harder to strip unnecessary header info
from tuples being written to tape inside tuplesort.c. I think most of
the required fields could be reconstructed given the TupleDesc.)

regards, tom lane

#85Hannu Krosing
hannu@skype.net
In reply to: Tom Lane (#84)
Re: Compression and on-disk sorting

Ühel kenal päeval, R, 2006-05-19 kell 14:53, kirjutas Tom Lane:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:

I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
unbeleiveable. What's in the table? It would seem to imply that our
tuple format is far more compressable than we expected.

It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
If the tape routines were actually storing visibility information, I'd
expect that to be pretty compressible in this case since all the tuples
were presumably created in a single transaction by pgbench.

It's worse than that: IIRC what passes through a heaptuple sort are
tuples manufactured by heap_form_tuple, which will have consistently
zeroed header fields. However, the above isn't very helpful since the
rest of us have no idea what that "accounts" table contains. How wide
is the tuple data, and what's in it?

Was he not using pg_bench data ?

(This suggests that we might try harder to strip unnecessary header info
from tuples being written to tape inside tuplesort.c. I think most of
the required fields could be reconstructed given the TupleDesc.)

I guess that tapefiles compress better than averahe table because they
are sorted, and thus at least a little more repetitive than the rest.
If there are varlen types, then they usually also have abundance of
small 4-byte integers, which should also compress at least better than
4/1, maybe a lot better.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#86Martijn van Oosterhout
kleptog@svana.org
In reply to: Hannu Krosing (#85)
Re: Compression and on-disk sorting

On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:

It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
If the tape routines were actually storing visibility information, I'd
expect that to be pretty compressible in this case since all the tuples
were presumably created in a single transaction by pgbench.

Was he not using pg_bench data ?

Hmm, so there was only 3 integer fields and one varlena structure which
was always empty. This prepended with a tuple header with mostly blank
fields or at least repeated, yes, I can see how we might get a 25-to-1
compression.

Maybe we need to change pgbench so that it puts random text in the
filler field, that would at least put some strain on the compression
algorithm...

I guess that tapefiles compress better than averahe table because they
are sorted, and thus at least a little more repetitive than the rest.
If there are varlen types, then they usually also have abundance of
small 4-byte integers, which should also compress at least better than
4/1, maybe a lot better.

Hmm, that makes sense. That also explains the 37-to-1 compression I was
seeing on indexes :).

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#87Jim C. Nasby
jnasby@pervasive.com
In reply to: Hannu Krosing (#85)
Re: Compression and on-disk sorting

On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:

??hel kenal p??eval, R, 2006-05-19 kell 14:53, kirjutas Tom Lane:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

On Fri, May 19, 2006 at 09:29:03AM +0200, Martijn van Oosterhout wrote:

I'm seeing 250,000 blocks being cut down to 9,500 blocks. That's almost
unbeleiveable. What's in the table? It would seem to imply that our
tuple format is far more compressable than we expected.

It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
If the tape routines were actually storing visibility information, I'd
expect that to be pretty compressible in this case since all the tuples
were presumably created in a single transaction by pgbench.

It's worse than that: IIRC what passes through a heaptuple sort are
tuples manufactured by heap_form_tuple, which will have consistently
zeroed header fields. However, the above isn't very helpful since the
rest of us have no idea what that "accounts" table contains. How wide
is the tuple data, and what's in it?

Was he not using pg_bench data ?

I am. For reference:

bench=# \d accounts
Table "public.accounts"
Column | Type | Modifiers
----------+---------------+-----------
aid | integer | not null
bid | integer |
abalance | integer |
filler | character(84) |

(This suggests that we might try harder to strip unnecessary header info
from tuples being written to tape inside tuplesort.c. I think most of
the required fields could be reconstructed given the TupleDesc.)

I guess that tapefiles compress better than averahe table because they
are sorted, and thus at least a little more repetitive than the rest.
If there are varlen types, then they usually also have abundance of
small 4-byte integers, which should also compress at least better than
4/1, maybe a lot better.

If someone wants to provide a patch that strips out the headers I can test that
as well.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#88Jim C. Nasby
jnasby@pervasive.com
In reply to: Martijn van Oosterhout (#86)
Re: Compression and on-disk sorting

On Fri, May 19, 2006 at 09:29:44PM +0200, Martijn van Oosterhout wrote:

On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:

It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
If the tape routines were actually storing visibility information, I'd
expect that to be pretty compressible in this case since all the tuples
were presumably created in a single transaction by pgbench.

Was he not using pg_bench data ?

Hmm, so there was only 3 integer fields and one varlena structure which
was always empty. This prepended with a tuple header with mostly blank
fields or at least repeated, yes, I can see how we might get a 25-to-1
compression.

Maybe we need to change pgbench so that it puts random text in the
filler field, that would at least put some strain on the compression
algorithm...

Wow, I thought there was actually something in there...

True random data wouldn't be such a great test either; what would
probably be best is a set of random words, since in real life you're
unlikely to have truely random data.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#89Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#88)
Re: Compression and on-disk sorting

"Jim C. Nasby" <jnasby@pervasive.com> writes:

True random data wouldn't be such a great test either; what would
probably be best is a set of random words, since in real life you're
unlikely to have truely random data.

True random data would provide worst-case compression behavior, so
we'd want to try that to find out what the downside is; but we shouldn't
consider it to be the design center.

regards, tom lane

#90Hannu Krosing
hannu@skype.net
In reply to: Jim C. Nasby (#88)
Re: Compression and on-disk sorting

Ühel kenal päeval, R, 2006-05-19 kell 14:57, kirjutas Jim C. Nasby:

On Fri, May 19, 2006 at 09:29:44PM +0200, Martijn van Oosterhout wrote:

On Fri, May 19, 2006 at 10:02:50PM +0300, Hannu Krosing wrote:

It's just SELECT count(*) FROM (SELECT * FROM accounts ORDER BY bid) a;
If the tape routines were actually storing visibility information, I'd
expect that to be pretty compressible in this case since all the tuples
were presumably created in a single transaction by pgbench.

Was he not using pg_bench data ?

Hmm, so there was only 3 integer fields and one varlena structure which
was always empty. This prepended with a tuple header with mostly blank
fields or at least repeated, yes, I can see how we might get a 25-to-1
compression.

Maybe we need to change pgbench so that it puts random text in the
filler field, that would at least put some strain on the compression
algorithm...

Wow, I thought there was actually something in there...

True random data wouldn't be such a great test either; what would
probably be best is a set of random words, since in real life you're
unlikely to have truely random data.

I usually use something like the following for my "random name" tests:

#!/usr/bin/python

import random

words = [line.strip() for line in open('/usr/share/dict/words')]

def make_random_name(min_items, max_items):
l = []
for w in range(random.randint(min_items, max_items)):
l.append(random.choice(words))
return ' '.join(l)

it gives out somewhat justifyable but still quite amusing results:

make_random_name(2,4)

'encroaches Twedy'

make_random_name(2,4)

'annuloida Maiah commends imputatively'

make_random_name(2,4)

'terebral wine-driven pacota'

make_random_name(2,4)

'ballads disenfranchise cabriolets spiny-fruited'

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#91Martijn van Oosterhout
kleptog@svana.org
In reply to: Jim C. Nasby (#83)
Re: Compression and on-disk sorting

On Fri, May 19, 2006 at 01:39:45PM -0500, Jim C. Nasby wrote:

Do you have any stats on CPU usage? Memory usage?

I've only been taking a look at vmstat from time-to-time, and I have yet
to see the machine get CPU-bound. Haven't really paid much attention to
memory. Is there anything in partucular you're looking for? I can log
vmstat for the next set of runs (with a scaling factor of 10000). I plan
on doing those runs tonight...

I've got some more info on zlibs memory usage:

Compression: 5816 bytes + 256KB buffer = approx 261KB
Decompression: 9512 bytes + 32KB buffer = approx 42KB

As Tom said, you only run one compression at a time but logtape doesn't
know that. It can only free the compression structures on Rewind or
Freeze, neither of which are run until the merge pass. I don't
understand the algorithm enough to know if it's safe to rewind the old
tape in selectnewtape. That would seem to defeat the "freeze if only
one tape" optimisation.

One final thing, with trace_sort=on on my machine I get this with
compression:

LOG: performsort done (except 28-way final merge): CPU 1.48s/7.49u sec elapsed 10.24 sec
LOG: external sort ended, 163 disk blocks used: CPU 1.48s/7.49u sec elapsed 10.30 sec

and without compression:

LOG: performsort done (except 28-way final merge): CPU 2.85s/1.90u sec elapsed 14.76 sec
LOG: external sort ended, 18786 disk blocks used: CPU 2.88s/1.90u sec elapsed 15.70 sec

This indicates an awful lot of I/O waiting, some 60% of the time
without compression. The compression has cut the I/O wait from 10sec to
1.5sec at the expense of 5.5sec of compression time. If you had a
faster compression algorithm (zlib is not that fast) the results would
be even better...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#92Jim C. Nasby
jnasby@pervasive.com
In reply to: Martijn van Oosterhout (#91)
Re: Compression and on-disk sorting

Finally completed testing of a dataset that doesn't fit in memory with
compression enabled. Results are at
http://jim.nasby.net/misc/pgsqlcompression .

Summary:
work_mem compressed not compressed gain
in-memory 20000 400.1 797.7 49.8%
in-memory 2000 371.4 805.7 53.9%
not in-memory 20000 8537 17436 51.0%
not in-memory 2000 8152 17820 54.3%

I find it very interesting that the gains are identical even when the
tapes should fit in memory. My guess is that for some reason the OS is
flushing those to disk anyway. In fact, watching gstat during a run, I
do see write activity hitting the drives. So if there was some way to
tune that behavior, the in-memory case would probably be much, much
faster. Anyone know FreeBSD well enough to suggest how to change this?
Anyone want to test on linux and see if the results are the same? This
could indicate that it might be advantageous to attempt an in-memory
sort with compressed data before spilling that compressed data to
disk...

As for CPU utilization, it was ~33% with compression and ~13% without.
That tells me that CPU could become a factor if everything was truely in
memory (including the table we were reading from), but if that's the
case there's a good chance that we wouldn't even be switching to an
on-disk sort. If everything isn't in memory then you're likely to be IO
bound anyway...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#93Joshua D. Drake
jd@commandprompt.com
In reply to: Jim C. Nasby (#92)
Re: Compression and on-disk sorting

Jim C. Nasby wrote:

Finally completed testing of a dataset that doesn't fit in memory with
compression enabled. Results are at
http://jim.nasby.net/misc/pgsqlcompression .

Summary:
work_mem compressed not compressed gain
in-memory 20000 400.1 797.7 49.8%
in-memory 2000 371.4 805.7 53.9%
not in-memory 20000 8537 17436 51.0%
not in-memory 2000 8152 17820 54.3%

I find it very interesting that the gains are identical even when the
tapes should fit in memory. My guess is that for some reason the OS is
flushing those to disk anyway. In fact, watching gstat during a run, I
do see write activity hitting the drives. So if there was some way to
tune that behavior, the in-memory case would probably be much, much
faster. Anyone know FreeBSD well enough to suggest how to change this?
Anyone want to test on linux and see if the results are the same? This
could indicate that it might be advantageous to attempt an in-memory
sort with compressed data before spilling that compressed data to
disk...

I can test it on linux just let me know what you need.

J

As for CPU utilization, it was ~33% with compression and ~13% without.
That tells me that CPU could become a factor if everything was truely in
memory (including the table we were reading from), but if that's the
case there's a good chance that we wouldn't even be switching to an
on-disk sort. If everything isn't in memory then you're likely to be IO
bound anyway...

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

#94Jim C. Nasby
jnasby@pervasive.com
In reply to: Joshua D. Drake (#93)
Re: Compression and on-disk sorting

On Wed, May 24, 2006 at 02:20:43PM -0700, Joshua D. Drake wrote:

Jim C. Nasby wrote:

Finally completed testing of a dataset that doesn't fit in memory with
compression enabled. Results are at
http://jim.nasby.net/misc/pgsqlcompression .

Summary:
work_mem compressed not compressed gain
in-memory 20000 400.1 797.7 49.8%
in-memory 2000 371.4 805.7 53.9%
not in-memory 20000 8537 17436 51.0%
not in-memory 2000 8152 17820 54.3%

I find it very interesting that the gains are identical even when the
tapes should fit in memory. My guess is that for some reason the OS is
flushing those to disk anyway. In fact, watching gstat during a run, I
do see write activity hitting the drives. So if there was some way to
tune that behavior, the in-memory case would probably be much, much
faster. Anyone know FreeBSD well enough to suggest how to change this?
Anyone want to test on linux and see if the results are the same? This
could indicate that it might be advantageous to attempt an in-memory
sort with compressed data before spilling that compressed data to
disk...

I can test it on linux just let me know what you need.

Actually, after talking to Larry he mentioned that it'd be worth
checking to see if we're doing something like opening the files in
O_DIRECT, which I haven't had a chance to do. Might be worth looking at
that before running more tests.

Anyway, I've posted the patch now as well, and compress_sort.txt has the
commands I was running. Those are just against a plain pgbench database
that's been freshly initialized (ie: no dead tuples). I just created two
install directories from a checkout of HEAD via --prefix=, one with the
patch and one without. Both hit the same $PGDATA. I've posted the
postgresql.conf as well.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#95Jim C. Nasby
jnasby@pervasive.com
In reply to: Jim C. Nasby (#94)
Re: Compression and on-disk sorting

I've done some more testing with Tom's recently committed changes to
tuplesort.c, which remove the tupleheaders from the sort data. It does
about 10% better than compression alone does. What's interesting is that
the gains are about 10% regardless of compression, which means
compression isn't helping very much on all the redundant header data,
which I find very odd. And the header data is very redundant:

bench=# select xmin,xmax,cmin,cmax,aid from accounts order by aid limit 1;
xmin | xmax | cmin | cmax | aid
--------+------+------+------+-----
280779 | 0 | 0 | 0 | 1
(1 row)

bench=# select xmin,xmax,cmin,cmax,aid from accounts order by aid desc limit 1;
xmin | xmax | cmin | cmax | aid
--------+------+------+------+-----------
310778 | 0 | 0 | 0 | 300000000
(1 row)

Makes sense, since pgbench loads the database via a string of COPY commands,
each of which loads 10000 rows.

Something else worth mentioning is that sort performance is worse with
larger work_mem for all cases except the old HEAD, prior to the
tuplesort.c changes. It looks like whatever was done to fix that will
need to be adjusted/rethought pending the outcome of using compression.

In any case, compression certainly seems to be a clear win, at least in
this case. If there's interest, I can test this on some larger hardware,
or if someone wants to produce a patch for pgbench that will load some
kind of real data into accounts.filler, I can test that as well.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#96Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#95)
Re: Compression and on-disk sorting

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Something else worth mentioning is that sort performance is worse with
larger work_mem for all cases except the old HEAD, prior to the
tuplesort.c changes. It looks like whatever was done to fix that will
need to be adjusted/rethought pending the outcome of using compression.

Please clarify. What are you comparing here exactly, and what cases did
you test?

regards, tom lane

#97Jim C. Nasby
jnasby@pervasive.com
In reply to: Tom Lane (#96)
Re: Compression and on-disk sorting

On Fri, May 26, 2006 at 12:35:36PM -0400, Tom Lane wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

Something else worth mentioning is that sort performance is worse with
larger work_mem for all cases except the old HEAD, prior to the
tuplesort.c changes. It looks like whatever was done to fix that will
need to be adjusted/rethought pending the outcome of using compression.

Please clarify. What are you comparing here exactly, and what cases did
you test?

Sorry, forgot to put the url in:
http://jim.nasby.net/misc/pgsqlcompression/compress_sort.txt

But the meat is:
-- work_mem --
Scale 2000 20000
not compressed 150 805.7 797.7
not compressed 3000 17820 17436
compressed 150 371.4 400.1
compressed 3000 8152 8537
compressed, no headers 3000 7325 7876

Performance degrades with more work_mem any time compression is used. I
thought I had data on just your tuplesort.c change without compression,
but I guess I don't. :( I can run that tonight if desired.

As for the code, the 3 things I've tested are HEAD as of 5/17/06 with no
patches (labeld 'not compressed'); that code with the compression patch
(compressed), and that code with both the compression patch and your change to
tuplesort.c that removes tuple headers from the sorted data (compressed, no
headers).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#98Simon Riggs
simon@2ndquadrant.com
In reply to: Jim C. Nasby (#97)
Re: Compression and on-disk sorting

On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:

But the meat is:
-- work_mem --
Scale 2000 20000
not compressed 150 805.7 797.7
not compressed 3000 17820 17436
compressed 150 371.4 400.1
compressed 3000 8152 8537
compressed, no headers 3000 7325 7876

Since Tom has committed the header-removing patch, we need to test

not compressed, no headers v compressed, no headers

There is a noticeable rise in sort time with increasing work_mem, but
that needs to be offset from the benefit that in-general comes from
using a large Heap for the sort. With the data you're using that always
looks like a loss, but that isn't true with all input data orderings.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#99Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#98)
Re: Compression and on-disk sorting

Simon Riggs <simon@2ndquadrant.com> writes:

There is a noticeable rise in sort time with increasing work_mem, but
that needs to be offset from the benefit that in-general comes from
using a large Heap for the sort. With the data you're using that always
looks like a loss, but that isn't true with all input data orderings.

Yeah, these are all the exact same test data, right? We need a bit more
variety in the test cases before drawing any sweeping conclusions.

regards, tom lane

#100Jim C. Nasby
jnasby@pervasive.com
In reply to: Tom Lane (#99)
Re: Compression and on-disk sorting

On Fri, May 26, 2006 at 04:41:51PM -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

There is a noticeable rise in sort time with increasing work_mem, but
that needs to be offset from the benefit that in-general comes from
using a large Heap for the sort. With the data you're using that always
looks like a loss, but that isn't true with all input data orderings.

Yeah, these are all the exact same test data, right? We need a bit more
variety in the test cases before drawing any sweeping conclusions.

All testing is select count(*) from (select * from accounts order by
bid) a; hitting a pgbench database, since that's something anyone can
(presumably) reproduce. Suggestions for other datasets welcome.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#101Jim C. Nasby
jnasby@pervasive.com
In reply to: Simon Riggs (#98)
Re: Compression and on-disk sorting

On Fri, May 26, 2006 at 09:21:44PM +0100, Simon Riggs wrote:

On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:

But the meat is:
-- work_mem --
Scale 2000 20000
not compressed 150 805.7 797.7
not compressed 3000 17820 17436
compressed 150 371.4 400.1
compressed 3000 8152 8537
compressed, no headers 3000 7325 7876

Since Tom has committed the header-removing patch, we need to test

not compressed, no headers v compressed, no headers

-- work_mem --
Scale 2000 20000
not compressed 150 805.7 797.7
not compressed 3000 17820 17436
not compressed, no hdr 3000 14470 14507
compressed 150 371.4 400.1
compressed 3000 8152 8537
compressed, no headers 3000 7325 7876

There is a noticeable rise in sort time with increasing work_mem, but
that needs to be offset from the benefit that in-general comes from
using a large Heap for the sort. With the data you're using that always
looks like a loss, but that isn't true with all input data orderings.

I thought that a change had been made to the on-disk sort specifically to
eliminate the problem of more work_mem making the sort take longer. I also
thought that there was something about that fix that was tunable.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#102Simon Riggs
simon@2ndquadrant.com
In reply to: Jim C. Nasby (#101)
Re: Compression and on-disk sorting

On Wed, 2006-06-07 at 01:33 -0500, Jim C. Nasby wrote:

On Fri, May 26, 2006 at 09:21:44PM +0100, Simon Riggs wrote:

On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:

But the meat is:
-- work_mem --
Scale 2000 20000
not compressed 150 805.7 797.7
not compressed 3000 17820 17436
compressed 150 371.4 400.1
compressed 3000 8152 8537
compressed, no headers 3000 7325 7876

Since Tom has committed the header-removing patch, we need to test

not compressed, no headers v compressed, no headers

-- work_mem --
Scale 2000 20000
not compressed 150 805.7 797.7
not compressed 3000 17820 17436
not compressed, no hdr 3000 14470 14507
compressed 150 371.4 400.1
compressed 3000 8152 8537
compressed, no headers 3000 7325 7876

That looks fairly conclusive. Can we try tests with data in reverse
order, so we use more tapes? We're still using a single tape, so the
additional overhead of compression doesn't cause any pain.

There is a noticeable rise in sort time with increasing work_mem, but
that needs to be offset from the benefit that in-general comes from
using a large Heap for the sort. With the data you're using that always
looks like a loss, but that isn't true with all input data orderings.

I thought that a change had been made to the on-disk sort specifically to
eliminate the problem of more work_mem making the sort take longer.

There was a severe non-optimal piece of code...but the general effect
still exists. As does the effect that having higher work_mem produces
fewer runs which speeds up the final stages of the sort.

I also
thought that there was something about that fix that was tunable.

Increasing work_mem makes *this* test case take longer.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#103Jim C. Nasby
jnasby@pervasive.com
In reply to: Simon Riggs (#102)
Re: Compression and on-disk sorting

On Wed, Jun 07, 2006 at 11:59:50AM +0100, Simon Riggs wrote:

On Wed, 2006-06-07 at 01:33 -0500, Jim C. Nasby wrote:

On Fri, May 26, 2006 at 09:21:44PM +0100, Simon Riggs wrote:

On Fri, 2006-05-26 at 14:47 -0500, Jim C. Nasby wrote:

But the meat is:
-- work_mem --
Scale 2000 20000
not compressed 150 805.7 797.7
not compressed 3000 17820 17436
compressed 150 371.4 400.1
compressed 3000 8152 8537
compressed, no headers 3000 7325 7876

Since Tom has committed the header-removing patch, we need to test

not compressed, no headers v compressed, no headers

-- work_mem --
Scale 2000 20000
not compressed 150 805.7 797.7
not compressed 3000 17820 17436
not compressed, no hdr 3000 14470 14507
compressed 150 371.4 400.1
compressed 3000 8152 8537
compressed, no headers 3000 7325 7876

That looks fairly conclusive. Can we try tests with data in reverse
order, so we use more tapes? We're still using a single tape, so the
additional overhead of compression doesn't cause any pain.

Would simply changing the ORDER BY to DESC suffice for this? FWIW:

bench=# select correlation from pg_stats where tablename='accounts' and attname='bid';
correlation
-------------
1
(1 row)

There is a noticeable rise in sort time with increasing work_mem, but
that needs to be offset from the benefit that in-general comes from
using a large Heap for the sort. With the data you're using that always
looks like a loss, but that isn't true with all input data orderings.

I thought that a change had been made to the on-disk sort specifically to
eliminate the problem of more work_mem making the sort take longer.

There was a severe non-optimal piece of code...but the general effect
still exists. As does the effect that having higher work_mem produces
fewer runs which speeds up the final stages of the sort.

I also
thought that there was something about that fix that was tunable.

Increasing work_mem makes *this* test case take longer.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#104Simon Riggs
simon@2ndquadrant.com
In reply to: Jim C. Nasby (#103)
Re: Compression and on-disk sorting

On Wed, 2006-06-07 at 09:35 -0500, Jim C. Nasby wrote:

Would simply changing the ORDER BY to DESC suffice for this? FWIW:

Try sorting on aid also, both ascneding and descending.

We need to try lots of tests, not just one thats chosen to show the
patch in the best light. I want this, but we need to check.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#105Jim C. Nasby
jnasby@pervasive.com
In reply to: Simon Riggs (#104)
Re: Compression and on-disk sorting

On Wed, Jun 07, 2006 at 04:11:57PM +0100, Simon Riggs wrote:

On Wed, 2006-06-07 at 09:35 -0500, Jim C. Nasby wrote:

Would simply changing the ORDER BY to DESC suffice for this? FWIW:

Try sorting on aid also, both ascneding and descending.

We need to try lots of tests, not just one thats chosen to show the
patch in the best light. I want this, but we need to check.

Well, correlation on everything in that table is 1. At this point maybe
it makes more sense to just come up with a different test case, possibly
generate_series and random. Better yet would be if someone came up with
a patch that actually populated the filler field in pgbench. Better
still would be allowing the user to define how large they wanted the
filler field to be...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461