Improve compression speeds in pg_lzcompress.c

Started by Takeshi Yamamuroover 13 years ago28 messageshackers

yamamuro.takeshi@lab.ntt.co.jp

over 13 years ago

Hi, hackers,

The attached is a patch to improve compression speeds with loss of
compression ratios in backend/utils/adt/pg_lzcompress.c. Recent
modern compression techniques like google LZ4 and Snappy inspreid
me to write this patch. Thre are two points of my patch:

1. Skip at most 255 literals that might be incompressible
during pattern matching for LZ compression.

2. Update a hash table every PGLZ_HASH_GAP literals.

A sequence of literals is typically mixed up with compressible parts
and incompressible ones. Then, IMHO that it is reasonable to skip
PGLZ_SKIP_SIZE literals every a match is not found. The skipped multiple
literals are just copied to the output buffer, so pglz_out_literal() is
re-written (and renamed pglz_out_literals) so as to copy multiple
bytes, not a single byte.

And also, the current implementation updates a hash table for every a single
literal. However, as the updates obviously eat much processor time, skipping
the updates dynamically improves compression speeds.

I've done quick comparison tests with a Xeon 5670 processor.
A sequence logs of Apache hadoop and TREC GOV2 web data were used
as test sets. The former is highly compressible (low entroy) and the
other is difficult to compress (high entropy).

*******************
Compression Speed (Ratio)
Apache hadoop logs:
gzip 78.22MiB/s ( 5.31%)
bzip2 3.34MiB/s ( 3.04%)
lz4 939.45MiB/s ( 9.17%)
pg_lzcompress(original) 37.80MiB/s (11.76%)
pg_lzcompress(patch apaplied) 99.42MiB/s (14.19%)

TREC GOV2 web data:
gzip 21.22MiB/s (32.66%)
bzip2 8.61MiB/s (27.86%)
lz4 250.98MiB/s (49.82%)
pg_lzcompress(original) 20.44MiB/s (50.09%)
pg_lzcompress(patch apaplied) 48.67MiB/s (61.87%)

*******************

Obviously, both the compression ratio and the speed in the current
implementation are inferior to those in gzip. And, my patch
loses gzip and bzip2 in view of compression ratios though, the
compression speed overcomes those in gzip and bzip2.

Anyway, the compression speed in lz4 is very fast, so in my
opinion, there is a room to improve the current implementation
in pg_lzcompress.

regards,
--
----
Takeshi Yamamuro
NTT Cyber Communications Laboratory Group
Software Innovation Center
(Open Source Software Center)
Tel: +81-3-5860-5057 Fax: +81-3-5463-5490
Mail:yamamuro.takeshi@lab.ntt.co.jp

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Takeshi Yamamuro (#1)

Re: Improve compression speeds in pg_lzcompress.c

On 7 January 2013 07:29, Takeshi Yamamuro
<yamamuro.takeshi@lab.ntt.co.jp> wrote:

Anyway, the compression speed in lz4 is very fast, so in my
opinion, there is a room to improve the current implementation
in pg_lzcompress.

So why don't we use LZ4?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

John R Pierce

pierce@hogranch.com

over 13 years ago

In reply to: Simon Riggs (#2)

Re: Improve compression speeds in pg_lzcompress.c

On 1/7/2013 1:10 AM, Simon Riggs wrote:

On 7 January 2013 07:29, Takeshi Yamamuro
<yamamuro.takeshi@lab.ntt.co.jp> wrote:

Anyway, the compression speed in lz4 is very fast, so in my
opinion, there is a room to improve the current implementation
in pg_lzcompress.

So why don't we use LZ4?

what will changing compression formats do for compatability?

this is for the compressed data in pg_toast storage or something? will
this break pg_upgrade style operations?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: John R Pierce (#3)

Re: Improve compression speeds in pg_lzcompress.c

On 7 January 2013 09:19, John R Pierce <pierce@hogranch.com> wrote:

On 1/7/2013 1:10 AM, Simon Riggs wrote:

On 7 January 2013 07:29, Takeshi Yamamuro
<yamamuro.takeshi@lab.ntt.co.jp> wrote:

Anyway, the compression speed in lz4 is very fast, so in my
opinion, there is a room to improve the current implementation
in pg_lzcompress.

So why don't we use LZ4?

what will changing compression formats do for compatability?

this is for the compressed data in pg_toast storage or something? will this
break pg_upgrade style operations?

Anything that changes on-disk format would need to consider how to do
pg_upgrade. It's the major blocker in that area.

For this, it would be possible to have a new format and old format
coexist, but that will take more time to think through than we have
for this release, so this is a nice idea for further investigation in
9.4. Thanks for raising that point.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 13 years ago

In reply to: Simon Riggs (#4)

Re: Improve compression speeds in pg_lzcompress.c

On 2013-01-07 09:57:58 +0000, Simon Riggs wrote:

On 7 January 2013 09:19, John R Pierce <pierce@hogranch.com> wrote:

On 1/7/2013 1:10 AM, Simon Riggs wrote:

On 7 January 2013 07:29, Takeshi Yamamuro
<yamamuro.takeshi@lab.ntt.co.jp> wrote:

Anyway, the compression speed in lz4 is very fast, so in my
opinion, there is a room to improve the current implementation
in pg_lzcompress.

So why don't we use LZ4?

what will changing compression formats do for compatability?

this is for the compressed data in pg_toast storage or something? will this
break pg_upgrade style operations?

Anything that changes on-disk format would need to consider how to do
pg_upgrade. It's the major blocker in that area.

For this, it would be possible to have a new format and old format
coexist, but that will take more time to think through than we have
for this release, so this is a nice idea for further investigation in
9.4. Thanks for raising that point.

I think there should be enough bits available in the toast pointer to
indicate the type of compression. I seem to remember somebody even
posting a patch to that effect?
I agree that it's probably too late in the 9.3 cycle to start with this.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

John R Pierce

pierce@hogranch.com

over 13 years ago

In reply to: Andres Freund (#5)

Re: Improve compression speeds in pg_lzcompress.c

On 1/7/2013 2:05 AM, Andres Freund wrote:

I think there should be enough bits available in the toast pointer to
indicate the type of compression. I seem to remember somebody even
posting a patch to that effect?
I agree that it's probably too late in the 9.3 cycle to start with this.

so an upgraded database would have old toasted values in the old
compression format, and new toasted values in the new format in an
existing table? that's kind of ugly.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@anarazel.de

over 13 years ago

In reply to: John R Pierce (#6)

Re: Improve compression speeds in pg_lzcompress.c

On 2013-01-07 02:21:26 -0800, John R Pierce wrote:

On 1/7/2013 2:05 AM, Andres Freund wrote:

I think there should be enough bits available in the toast pointer to
indicate the type of compression. I seem to remember somebody even
posting a patch to that effect?
I agree that it's probably too late in the 9.3 cycle to start with this.

so an upgraded database would have old toasted values in the old compression
format, and new toasted values in the new format in an existing table?
that's kind of ugly.

Well, ISTM thats just life. What you prefer? Converting all toast values
during pg_upgrade kinda goes against the aim of quick upgrades.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Bruce Momjian

bruce@momjian.us

over 13 years ago

In reply to: John R Pierce (#6)

Re: Improve compression speeds in pg_lzcompress.c

On Mon, Jan 7, 2013 at 10:21 AM, John R Pierce <pierce@hogranch.com> wrote:

On 1/7/2013 2:05 AM, Andres Freund wrote:

I think there should be enough bits available in the toast pointer to
indicate the type of compression. I seem to remember somebody even
posting a patch to that effect?
I agree that it's probably too late in the 9.3 cycle to start with this.

so an upgraded database would have old toasted values in the old compression
format, and new toasted values in the new format in an existing table?
that's kind of ugly.

I haven't looked at the patch. It's not obvious to me from the
description that the output isn't backwards compatible. The way the LZ
toast compression works the output is self-describing. There are many
different outputs that would decompress to the same thing and the
compressing code can choose how hard to look for earlier matches and
when to just copy bytes wholesale but the decompression will work
regardless.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Bruce Momjian (#8)

Re: Improve compression speeds in pg_lzcompress.c

On 7 January 2013 13:36, Greg Stark <stark@mit.edu> wrote:

On Mon, Jan 7, 2013 at 10:21 AM, John R Pierce <pierce@hogranch.com> wrote:

On 1/7/2013 2:05 AM, Andres Freund wrote:

I think there should be enough bits available in the toast pointer to
indicate the type of compression. I seem to remember somebody even
posting a patch to that effect?
I agree that it's probably too late in the 9.3 cycle to start with this.

so an upgraded database would have old toasted values in the old compression
format, and new toasted values in the new format in an existing table?
that's kind of ugly.

I haven't looked at the patch. It's not obvious to me from the
description that the output isn't backwards compatible. The way the LZ
toast compression works the output is self-describing. There are many
different outputs that would decompress to the same thing and the
compressing code can choose how hard to look for earlier matches and
when to just copy bytes wholesale but the decompression will work
regardless.

Good point, and a great reason to use this patch rather than LZ4 for 9.3

We could even have tuning parameters for toast compression, as long as
we keep the on disk format identical.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Kenneth Marshall

ktm@rice.edu

over 13 years ago

In reply to: Simon Riggs (#2)

Re: Improve compression speeds in pg_lzcompress.c

On Mon, Jan 07, 2013 at 09:10:31AM +0000, Simon Riggs wrote:

On 7 January 2013 07:29, Takeshi Yamamuro
<yamamuro.takeshi@lab.ntt.co.jp> wrote:

Anyway, the compression speed in lz4 is very fast, so in my
opinion, there is a room to improve the current implementation
in pg_lzcompress.

So why don't we use LZ4?

Regards,
Ken

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Kenneth Marshall

ktm@rice.edu

over 13 years ago

In reply to: Bruce Momjian (#8)

Re: Improve compression speeds in pg_lzcompress.c

On Mon, Jan 07, 2013 at 01:36:33PM +0000, Greg Stark wrote:

On Mon, Jan 7, 2013 at 10:21 AM, John R Pierce <pierce@hogranch.com> wrote:

On 1/7/2013 2:05 AM, Andres Freund wrote:

I think there should be enough bits available in the toast pointer to
indicate the type of compression. I seem to remember somebody even
posting a patch to that effect?
I agree that it's probably too late in the 9.3 cycle to start with this.

so an upgraded database would have old toasted values in the old compression
format, and new toasted values in the new format in an existing table?
that's kind of ugly.

I haven't looked at the patch. It's not obvious to me from the
description that the output isn't backwards compatible. The way the LZ
toast compression works the output is self-describing. There are many
different outputs that would decompress to the same thing and the
compressing code can choose how hard to look for earlier matches and
when to just copy bytes wholesale but the decompression will work
regardless.

I think this comment refers to the lz4 option. I do agree that the patch
that was posted to improve the current compression speed should be able
to be implemented to allow the current results to be decompressed as well.

Regards,
Ken

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Andres Freund

andres@anarazel.de

over 13 years ago

In reply to: Takeshi Yamamuro (#1)

Re: Improve compression speeds in pg_lzcompress.c

Hi,

It seems worth rereading the thread around
http://archives.postgresql.org/message-id/CAAZKuFb59sABSa7gCG0vnVnGb-mJCUBBbrKiyPraNXHnis7KMw%40mail.gmail.com
for people wanting to work on this.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Takeshi Yamamuro (#1)

Re: Improve compression speeds in pg_lzcompress.c

Takeshi Yamamuro <yamamuro.takeshi@lab.ntt.co.jp> writes:

The attached is a patch to improve compression speeds with loss of
compression ratios in backend/utils/adt/pg_lzcompress.c.

Why would that be a good tradeoff to make? Larger stored values require
more I/O, which is likely to swamp any CPU savings in the compression
step. Not to mention that a value once written may be read many times,
so the extra I/O cost could be multiplied many times over later on.

Another thing to keep in mind is that the compression area in general
is a minefield of patents. We're fairly confident that pg_lzcompress
as-is doesn't fall foul of any, but any significant change there would
probably require more research.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Merlin Moncure

mmoncure@gmail.com

over 13 years ago

In reply to: Tom Lane (#13)

Re: Improve compression speeds in pg_lzcompress.c

On Mon, Jan 7, 2013 at 10:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Takeshi Yamamuro <yamamuro.takeshi@lab.ntt.co.jp> writes:

The attached is a patch to improve compression speeds with loss of
compression ratios in backend/utils/adt/pg_lzcompress.c.

Why would that be a good tradeoff to make? Larger stored values require
more I/O, which is likely to swamp any CPU savings in the compression
step. Not to mention that a value once written may be read many times,
so the extra I/O cost could be multiplied many times over later on.

I disagree. pg compression is so awful it's almost never a net win.
I turn it off.

Another thing to keep in mind is that the compression area in general
is a minefield of patents. We're fairly confident that pg_lzcompress
as-is doesn't fall foul of any, but any significant change there would
probably require more research.

A minefield of *expired* patents. Fast lz based compression is used
all over the place -- for example by the lucene.

lz4.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Merlin Moncure (#14)

Re: Improve compression speeds in pg_lzcompress.c

Merlin Moncure <mmoncure@gmail.com> writes:

On Mon, Jan 7, 2013 at 10:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Takeshi Yamamuro <yamamuro.takeshi@lab.ntt.co.jp> writes:

The attached is a patch to improve compression speeds with loss of
compression ratios in backend/utils/adt/pg_lzcompress.c.

Why would that be a good tradeoff to make? Larger stored values require
more I/O, which is likely to swamp any CPU savings in the compression
step. Not to mention that a value once written may be read many times,
so the extra I/O cost could be multiplied many times over later on.

I disagree. pg compression is so awful it's almost never a net win.
I turn it off.

One report doesn't make it useless, but even if it is so on your data,
why would making it even less effective be a win?

Another thing to keep in mind is that the compression area in general
is a minefield of patents. We're fairly confident that pg_lzcompress
as-is doesn't fall foul of any, but any significant change there would
probably require more research.

A minefield of *expired* patents. Fast lz based compression is used
all over the place -- for example by the lucene.

The patents that had to be dodged for original LZ compression are gone,
true, but what's your evidence for saying that newer versions don't have
newer patents?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#13)

Re: Improve compression speeds in pg_lzcompress.c

On Mon, Jan 7, 2013 at 11:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Why would that be a good tradeoff to make? Larger stored values require
more I/O, which is likely to swamp any CPU savings in the compression
step. Not to mention that a value once written may be read many times,
so the extra I/O cost could be multiplied many times over later on.

I agree with this analysis, but I note that the test results show it
actually improving things along both parameters.

I'm not sure how general that result is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#16)

Re: Improve compression speeds in pg_lzcompress.c

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 7, 2013 at 11:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Why would that be a good tradeoff to make? Larger stored values require
more I/O, which is likely to swamp any CPU savings in the compression
step. Not to mention that a value once written may be read many times,
so the extra I/O cost could be multiplied many times over later on.

I agree with this analysis, but I note that the test results show it
actually improving things along both parameters.

Hm ... one of us is reading those results backwards, then.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Merlin Moncure

mmoncure@gmail.com

over 13 years ago

In reply to: Tom Lane (#15)

Re: Improve compression speeds in pg_lzcompress.c

On Mon, Jan 7, 2013 at 2:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

On Mon, Jan 7, 2013 at 10:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Takeshi Yamamuro <yamamuro.takeshi@lab.ntt.co.jp> writes:

The attached is a patch to improve compression speeds with loss of
compression ratios in backend/utils/adt/pg_lzcompress.c.

Why would that be a good tradeoff to make? Larger stored values require
more I/O, which is likely to swamp any CPU savings in the compression
step. Not to mention that a value once written may be read many times,
so the extra I/O cost could be multiplied many times over later on.

I disagree. pg compression is so awful it's almost never a net win.
I turn it off.

One report doesn't make it useless, but even if it is so on your data,
why would making it even less effective be a win?

That's a fair point. I'm neutral on the OP's proposal -- it's just
moving spots around the dog. If we didn't have better options, maybe
offering options to tune what we have would be worth implementing...
but by your standard ISTM we can't even do *that*.

Another thing to keep in mind is that the compression area in general
is a minefield of patents. We're fairly confident that pg_lzcompress
as-is doesn't fall foul of any, but any significant change there would
probably require more research.

A minefield of *expired* patents. Fast lz based compression is used
all over the place -- for example by the lucene.

The patents that had to be dodged for original LZ compression are gone,
true, but what's your evidence for saying that newer versions don't have
newer patents?

That's impossible (at least for a non-attorney) to do because the
patents are still flying (for example:
http://www.google.com/patents/US7650040). That said, you've framed
the debate so that any improvement to postgres compression requires an
IP lawyer. That immediately raises some questions:

*) why hold only compression type features in postgres to that
standard? Patents get mentioned here and there in the context of
other features in the archives but only compression seems to require a
proven clean pedigree. Why don't we require a patent search for
other interesting features? What evidence do *you* offer that lz4
violates any patents?

*) why is postgres the only FOSS project that cares about
patentability of say, lz4? (google 'lz4 patent')

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Andrew Dunstan

andrew@dunslane.net

over 13 years ago

In reply to: Tom Lane (#17)

Re: Improve compression speeds in pg_lzcompress.c

On 01/07/2013 04:19 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 7, 2013 at 11:16 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Why would that be a good tradeoff to make? Larger stored values require
more I/O, which is likely to swamp any CPU savings in the compression
step. Not to mention that a value once written may be read many times,
so the extra I/O cost could be multiplied many times over later on.

I agree with this analysis, but I note that the test results show it
actually improving things along both parameters.

Hm ... one of us is reading those results backwards, then.

I just went back and looked. Unless I'm misreading it he has about a 2.5
times speed improvement but about a 20% worse compression result.

What would be interesting would be to see if the knobs he's tweaked
could be tweaked a bit more to give us substantial speedup without
significant space degradation.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Tom Lane (#17)

Re: Improve compression speeds in pg_lzcompress.c

On Mon, Jan 7, 2013 at 4:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hm ... one of us is reading those results backwards, then.

*looks*

It's me.

Sorry for the noise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Takeshi Yamamuro

yamamuro.takeshi@lab.ntt.co.jp

over 13 years ago

In reply to: Andrew Dunstan (#19)

#22

Takeshi Yamamuro

yamamuro.takeshi@lab.ntt.co.jp

over 13 years ago

In reply to: Kenneth Marshall (#10)

#23

Takeshi Yamamuro