Update on sort-compression stuff

Started by Martijn van Oosterhoutalmost 20 years ago5 messageshackers
Jump to latest
#1Martijn van Oosterhout
kleptog@svana.org

I'm going to be offline for a few days but there are some things I've
tested in the meantime.

Once the compression level drops below 4-to-1 the overhead of zlib
becomes overwhelming compared to the savings. I'm not sure how common
that is, I basically filled a table for random data to get it that low.

I implemented a basic implementation using pg_lzcompress. It appears
that pg_lzcompress is very, very slow. I was afraid that I'd made an
infinite loop, but it was just really slow. Mind you, the overhead of
each call might have been the problem, it was being called on every
32KB block.

My suggestions at this point are:

- Test a way of storing tuples with less overhead than a HeapTuple
header. If you could do it for in-memory sorts, that'd mean you could
fit more tuples in memory before spilling to disk. Given the
"compression" in that case is extremely cheap, it'd be much more likely
to be beneficial.

- Consider replacing pg_lzcompress with zlib if available. Or at least
test pg_lzcompress in a more realistic environment, because it seems
quite slow.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#2Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Martijn van Oosterhout (#1)
Re: Update on sort-compression stuff

BTW, the test I ran this weekend ended up filling the disk, so I wasn't
able to get results. I hope to have results for a compressed sort that's
still larger than memory in the morning. Unfortunately I'm doing all
this on a machine I use for other things, so it's hard to do testing and
other things at the same time...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#1)
Re: Update on sort-compression stuff

Martijn van Oosterhout <kleptog@svana.org> writes:

I implemented a basic implementation using pg_lzcompress. It appears
that pg_lzcompress is very, very slow. I was afraid that I'd made an
infinite loop, but it was just really slow. Mind you, the overhead of
each call might have been the problem, it was being called on every
32KB block.

Something wrong there, because the entire point of pg_lzcompress was to
be pretty fast. I agree it needs looking into.

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Martijn van Oosterhout (#1)
Re: Update on sort-compression stuff

Martijn van Oosterhout <kleptog@svana.org> writes:

- Test a way of storing tuples with less overhead than a HeapTuple
header. If you could do it for in-memory sorts, that'd mean you could
fit more tuples in memory before spilling to disk. Given the
"compression" in that case is extremely cheap, it'd be much more likely
to be beneficial.

I looked into this and decided that trimming the headers for the
in-memory copies is not as attractive as all that. The killer problem
is that comparetup_heap() needs to be able to apply heap_getattr() to
the stored tuples to extract sort keys. Unless we want to support a
variant copy of the heap_getattr() infrastructure just for sort tuples,
it ain't gonna work. Another issue is that we'd be increasing the
palloc traffic for in-memory sorts, because tuplesort_gettuple() would
have to cons up a freshly palloc'd complete tuple to hand back to the
caller.

However, we can definitely trim a lot of overhead from what gets written
to "tape", so I'll have a go at doing that.

regards, tom lane

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#4)
Re: Update on sort-compression stuff

On Tue, 2006-05-23 at 14:27 -0400, Tom Lane wrote:

Martijn van Oosterhout <kleptog@svana.org> writes:

- Test a way of storing tuples with less overhead than a HeapTuple
header. If you could do it for in-memory sorts, that'd mean you could
fit more tuples in memory before spilling to disk. Given the
"compression" in that case is extremely cheap, it'd be much more likely
to be beneficial.

I looked into this and decided that trimming the headers for the
in-memory copies is not as attractive as all that. The killer problem
is that comparetup_heap() needs to be able to apply heap_getattr() to
the stored tuples to extract sort keys. Unless we want to support a
variant copy of the heap_getattr() infrastructure just for sort tuples,
it ain't gonna work. Another issue is that we'd be increasing the
palloc traffic for in-memory sorts, because tuplesort_gettuple() would
have to cons up a freshly palloc'd complete tuple to hand back to the
caller.

However, we can definitely trim a lot of overhead from what gets written
to "tape", so I'll have a go at doing that.

If we write the tuples in compressed form and read them back in that
same form, there wouldn't be any more palloc overhead at all. The
freelists would be full of too large blocks, but that might not be such
a problem.

heap_getattr() is called by so few other places it makes sense to have a
sort specific version.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com