pluggable compression support

Started by Andres Freundalmost 13 years ago33 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

While hacking on the indirect toast support I felt the need to find out
how make compression formats pluggable to make sure .

In
http://archives.postgresql.org/message-id/20130605150144.GD28067%40alap2.anarazel.de
I submitted an initial patch that showed some promising results.

Here's a more cleaned up version which isn't intermingled with indirect
toast tuple support anymore.

It still contains a guc as described in the above message to control the
algorithm used for compressing new tuples but I think we should remove
that guc after testing.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Add-snappy-compression-algorithm-to-contrib.patchtext/x-patch; charset=us-asciiDownload+1701-2
0002-pluggable-compression.patchtext/x-patch; charset=us-asciiDownload+220-64
#2Josh Berkus
josh@agliodbs.com
In reply to: Andres Freund (#1)
Re: pluggable compression support

On 06/14/2013 04:01 PM, Andres Freund wrote:

It still contains a guc as described in the above message to control the
algorithm used for compressing new tuples but I think we should remove
that guc after testing.

Did you add the storage attribute?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Josh Berkus (#2)
Re: pluggable compression support

On 2013-06-14 17:12:01 -0700, Josh Berkus wrote:

On 06/14/2013 04:01 PM, Andres Freund wrote:

It still contains a guc as described in the above message to control the
algorithm used for compressing new tuples but I think we should remove
that guc after testing.

Did you add the storage attribute?

No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.

If we want to make it configurable on a per column basis I think the way
to go is to add a new column to pg_attribute and split compression
related things out of attstorage into attcompression.
That's a fair amount of work and it includes a minor compatibility break
in the catalog format, so I'd prefer not to do it until there's a good
reason to do so.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Josh Berkus
josh@agliodbs.com
In reply to: Andres Freund (#1)
Re: pluggable compression support

No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.

Then it's not "pluggable", is it? It's "upgradable compression
support", if anything. Which is fine, but let's not confuse people.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: Josh Berkus (#4)
Re: pluggable compression support

On 2013-06-14 17:35:02 -0700, Josh Berkus wrote:

No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.

Then it's not "pluggable", is it? It's "upgradable compression
support", if anything. Which is fine, but let's not confuse people.

The point is that it's pluggable on the storage level in the sense of
that several different algorithms can coexist and new ones can
relatively easily added.
That part is what seems to have blocked progress for quite a while
now. So fixing that seems to be the interesting thing.

I am happy enough to do the work of making it configurable if we want it
to be... But I have zap interest of doing it and throw it away in the
end because we decide we don't need it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#5)
Re: pluggable compression support

On Fri, Jun 14, 2013 at 8:45 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-06-14 17:35:02 -0700, Josh Berkus wrote:

No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.

Then it's not "pluggable", is it? It's "upgradable compression
support", if anything. Which is fine, but let's not confuse people.

The point is that it's pluggable on the storage level in the sense of
that several different algorithms can coexist and new ones can
relatively easily added.
That part is what seems to have blocked progress for quite a while
now. So fixing that seems to be the interesting thing.

I am happy enough to do the work of making it configurable if we want it
to be... But I have zap interest of doing it and throw it away in the
end because we decide we don't need it.

I don't think we need it. I think what we need is to decide is which
algorithm is legally OK to use. And then put it in.

In the past, we've had a great deal of speculation about that legal
question from people who are not lawyers. Maybe it would be valuable
to get some opinions from people who ARE lawyers. Tom and Heikki both
work for real big companies which, I'm guessing, have substantial
legal departments; perhaps they could pursue getting the algorithms of
possible interest vetted. Or, I could try to find out whether it's
possible do something similar through EnterpriseDB.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Joshua D. Drake
jd@commandprompt.com
In reply to: Robert Haas (#6)
Re: pluggable compression support

On 06/14/2013 06:56 PM, Robert Haas wrote:

On Fri, Jun 14, 2013 at 8:45 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-06-14 17:35:02 -0700, Josh Berkus wrote:

No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.

Then it's not "pluggable", is it? It's "upgradable compression
support", if anything. Which is fine, but let's not confuse people.

The point is that it's pluggable on the storage level in the sense of
that several different algorithms can coexist and new ones can
relatively easily added.
That part is what seems to have blocked progress for quite a while
now. So fixing that seems to be the interesting thing.

I am happy enough to do the work of making it configurable if we want it
to be... But I have zap interest of doing it and throw it away in the
end because we decide we don't need it.

I don't think we need it. I think what we need is to decide is which
algorithm is legally OK to use. And then put it in.

In the past, we've had a great deal of speculation about that legal
question from people who are not lawyers. Maybe it would be valuable
to get some opinions from people who ARE lawyers. Tom and Heikki both
work for real big companies which, I'm guessing, have substantial
legal departments; perhaps they could pursue getting the algorithms of
possible interest vetted. Or, I could try to find out whether it's
possible do something similar through EnterpriseDB.

We have IP legal representation through Software in the Public interest
who pretty much specializes in this type of thing.

Should I follow up? If so, I need a summary of the exact question
including licenses etc.

JD

--
Command Prompt, Inc. - http://www.commandprompt.com/ 509-416-6579
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC, @cmdpromptinc
For my dreams of your image that blossoms
a rose in the deeps of my heart. - W.B. Yeats

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#6)
Re: pluggable compression support

On 2013-06-14 21:56:52 -0400, Robert Haas wrote:

I don't think we need it. I think what we need is to decide is which
algorithm is legally OK to use. And then put it in.

In the past, we've had a great deal of speculation about that legal
question from people who are not lawyers. Maybe it would be valuable
to get some opinions from people who ARE lawyers. Tom and Heikki both
work for real big companies which, I'm guessing, have substantial
legal departments; perhaps they could pursue getting the algorithms of
possible interest vetted. Or, I could try to find out whether it's
possible do something similar through EnterpriseDB.

I personally don't think the legal arguments holds all that much water
for snappy and lz4. But then the opinion of a european non-lawyer doesn't
hold much either.
Both are widely used by a large number open and closed projects, some of
which have patent grant clauses in their licenses. E.g. hadoop,
cassandra use lz4, and I'd be surprised if the companies behind those
have opened themselves to litigation.

I think we should preliminarily decide which algorithm to use before we
get lawyers involved. I'd surprised if they can make such a analysis
faster than we can rule out one of them via benchmarks.

Will post an updated patch that includes lz4 as well.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Hannu Krosing
hannu@tm.ee
In reply to: Robert Haas (#6)
Re: pluggable compression support

On 06/15/2013 03:56 AM, Robert Haas wrote:

On Fri, Jun 14, 2013 at 8:45 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-06-14 17:35:02 -0700, Josh Berkus wrote:

No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.

Then it's not "pluggable", is it? It's "upgradable compression
support", if anything. Which is fine, but let's not confuse people.

The point is that it's pluggable on the storage level in the sense of
that several different algorithms can coexist and new ones can
relatively easily added.
That part is what seems to have blocked progress for quite a while
now. So fixing that seems to be the interesting thing.

I am happy enough to do the work of making it configurable if we want it
to be... But I have zap interest of doing it and throw it away in the
end because we decide we don't need it.

I don't think we need it. I think what we need is to decide is which
algorithm is legally OK to use. And then put it in.

If it were truly pluggable - that is just load a .dll, set a GUG and start
using it - then the selection of algorithms would be much
wider as several slow-but-high-compression ones under GPL could be
used as well, similar to how currently PostGIS works.

gzip and bzip2 are the first two that came in mind, but I am sure there
are more.

In the past, we've had a great deal of speculation about that legal
question from people who are not lawyers. Maybe it would be valuable
to get some opinions from people who ARE lawyers.

Making a truly pluggable compression API delegates this question
to end users.

Delegation is good, as it lets you get done more :)

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Hannu Krosing
hannu@tm.ee
In reply to: Andres Freund (#3)
Re: pluggable compression support

On 06/15/2013 02:20 AM, Andres Freund wrote:

On 2013-06-14 17:12:01 -0700, Josh Berkus wrote:

On 06/14/2013 04:01 PM, Andres Freund wrote:

It still contains a guc as described in the above message to control the
algorithm used for compressing new tuples but I think we should remove
that guc after testing.

Did you add the storage attribute?

No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.

If not significantly harder than what you currently do, I'd prefer a
true pluggable compression support which is

a) dynamically configurable , say by using a GUG

and

b) self-describing, that is, the compressed data should have enough
info to determine how to decompress it.

additionally it *could* have the property Simon proposed earlier
of *uncompressed* pages having some predetermined size, so we
could retain optimisations of substring() even on compressed TOAST
values.

the latter of course could also be achieved by adding offset
column to toast tables as well.

One more idea - if we are already changing toast table structure, we
could introduce a notion of "compress block", which could run over
several storage pages for much improved compression compared
to compressing only a single page at a time.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@anarazel.de
In reply to: Hannu Krosing (#10)
Re: pluggable compression support

On 2013-06-15 13:25:49 +0200, Hannu Krosing wrote:

On 06/15/2013 02:20 AM, Andres Freund wrote:

On 2013-06-14 17:12:01 -0700, Josh Berkus wrote:

On 06/14/2013 04:01 PM, Andres Freund wrote:

It still contains a guc as described in the above message to control the
algorithm used for compressing new tuples but I think we should remove
that guc after testing.

Did you add the storage attribute?

No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.

If not significantly harder than what you currently do, I'd prefer a
true pluggable compression support which is

a) dynamically configurable , say by using a GUG

and

b) self-describing, that is, the compressed data should have enough
info to determine how to decompress it.

Could you perhaps actually read the the description and the discussion
before making wild suggestions? Possibly even the patch.
Compressed toast datums now *do* have an identifier of the compression
algorithm used. That's how we can discern between pglz and whatever
algorithm we come up with.

But those identifiers should be *small* (since they are added to all
Datums) and they need to be stable, even across pg_upgrade. So I think
making this user configurable would be grave error at this point.

additionally it *could* have the property Simon proposed earlier
of *uncompressed* pages having some predetermined size, so we
could retain optimisations of substring() even on compressed TOAST
values.

We are not changing the toast format here, so I don't think that's
applicable. That's a completely separate feature.

the latter of course could also be achieved by adding offset
column to toast tables as well.

One more idea - if we are already changing toast table structure, we
could introduce a notion of "compress block", which could run over
several storage pages for much improved compression compared
to compressing only a single page at a time.

We aren't changing the toast table structure. And we can't easily do so,
think of pg_upgrade.
Besides, toast always has compressed datums over several chunks. What
would be beneficial would be to compress in a way that you can compress
several datums together, but that's several magnitudes more complex and
is unrelated to this feature.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Hannu Krosing (#9)
Re: pluggable compression support

On 2013-06-15 13:11:47 +0200, Hannu Krosing wrote:

If it were truly pluggable - that is just load a .dll, set a GUG and start
using it

Ok. I officially rechristen the patchset to 'extensible compression
support'.

- then the selection of algorithms would be much
wider as several slow-but-high-compression ones under GPL could be
used as well, similar to how currently PostGIS works.

gzip and bzip2 are the first two that came in mind, but I am sure there
are more.

gzip barely has a higher compression ratio than lz4 and is a magnitude
slower decompressing, so I don't think it's interesting.
I personally think bzip2 is too slow to be useful, even for
decompression. What might be useful is something like lzma, but it's
implementation is so complex that I don't really want to touch it.

In the past, we've had a great deal of speculation about that legal
question from people who are not lawyers. Maybe it would be valuable
to get some opinions from people who ARE lawyers.

Making a truly pluggable compression API delegates this question
to end users.

Delegation is good, as it lets you get done more :)

No. It increases the features complexity by a magnitude. That's not
good. And it means that about nobody but a few expert users will benefit
from it, so I am pretty strongly opposed.

You suddently need to solve the question of how the identifiers for
compression formats are allocated and preserved across pg_upgrade and
across machines.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Hannu Krosing
hannu@tm.ee
In reply to: Andres Freund (#11)
Re: pluggable compression support

On 06/15/2013 01:56 PM, Andres Freund wrote:

On 2013-06-15 13:25:49 +0200, Hannu Krosing wrote:

On 06/15/2013 02:20 AM, Andres Freund wrote:

On 2013-06-14 17:12:01 -0700, Josh Berkus wrote:

On 06/14/2013 04:01 PM, Andres Freund wrote:

It still contains a guc as described in the above message to control the
algorithm used for compressing new tuples but I think we should remove
that guc after testing.

Did you add the storage attribute?

No. I think as long as we only have pglz and one new algorithm (even if
that is lz4 instead of the current snappy) we should just always use the
new algorithm. Unless I missed it nobody seemed to have voiced a
contrary position?
For testing/evaluation the guc seems to be sufficient.

If not significantly harder than what you currently do, I'd prefer a
true pluggable compression support which is
a) dynamically configurable , say by using a GUG
and
b) self-describing, that is, the compressed data should have enough
info to determine how to decompress it.

Could you perhaps actually read the the description and the discussion
before making wild suggestions? Possibly even the patch.
Compressed toast datums now *do* have an identifier of the compression
algorithm used.
That's how we can discern between pglz and whatever
algorithm we come up with.

Claiming that the algorithm will be one of only two (current and
"whatever algorithm we come up with ") suggests that it is
only one bit, which is undoubtedly too little for having a "pluggable"
compression API :)

But those identifiers should be *small* (since they are added to all
Datums)

if there will be any alignment at all between the datums, then
one byte will be lost in the noise ("remember: nobody will need
more than 256 compression algorithms")
OTOH, if you plan to put these format markers in the compressed
stream and change the compression algorithm while reading it, I am lost.

and they need to be stable, even across pg_upgrade. So I think
making this user configurable would be grave error at this point.

"at this point" in what sense ?

additionally it *could* have the property Simon proposed earlier
of *uncompressed* pages having some predetermined size, so we
could retain optimisations of substring() even on compressed TOAST
values.

We are not changing the toast format here, so I don't think that's
applicable. That's a completely separate feature.

the latter of course could also be achieved by adding offset
column to toast tables as well.
One more idea - if we are already changing toast table structure, we
could introduce a notion of "compress block", which could run over
several storage pages for much improved compression compared
to compressing only a single page at a time.

We aren't changing the toast table structure. And we can't easily do so,
think of pg_upgrade.

Where was the page for "features rejected based on of pg_upgrade" ;)

Besides, toast always has compressed datums over several chunks. What
would be beneficial would be to compress in a way that you can compress
several datums together, but that's several magnitudes more complex and
is unrelated to this feature.

Greetings,

Andres Freund

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Hannu Krosing
hannu@tm.ee
In reply to: Andres Freund (#12)
Re: pluggable compression support

On 06/15/2013 02:02 PM, Andres Freund wrote:

On 2013-06-15 13:11:47 +0200, Hannu Krosing wrote:

If it were truly pluggable - that is just load a .dll, set a GUG and start
using it

Ok. I officially rechristen the patchset to 'extensible compression
support'.

- then the selection of algorithms would be much
wider as several slow-but-high-compression ones under GPL could be
used as well, similar to how currently PostGIS works.
gzip and bzip2 are the first two that came in mind, but I am sure there
are more.

gzip barely has a higher compression ratio than lz4 and is a magnitude
slower decompressing, so I don't think it's interesting.
I personally think bzip2 is too slow to be useful, even for
decompression.

with low compression settings gzip and bzip2 seem to decompress at the
same speed :
http://pokecraft.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO

(an interesting thing there is memory usage, but I guess it is just an
artefact of outer layers around the algorithm)

and if better compression translates to more speed depends heavily on
disk speeds :

http://www.citusdata.com/blog/64-zfs-compression claims quite big
performance increases from using gzip, even with its slow decompression"

What might be useful is something like lzma, but it's
implementation is so complex that I don't really want to touch it.

In the past, we've had a great deal of speculation about that legal
question from people who are not lawyers. Maybe it would be valuable
to get some opinions from people who ARE lawyers.

Making a truly pluggable compression API delegates this question
to end users.

Delegation is good, as it lets you get done more :)

No. It increases the features complexity by a magnitude. That's not
good. And it means that about nobody but a few expert users will benefit
from it, so I am pretty strongly opposed.

You suddently need to solve the question of how the identifiers for
compression formats are allocated and preserved across pg_upgrade and
across machines.

This is something similar we already need to do for any non-builtin type
OID.

Greetings,

Andres Freund

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@anarazel.de
In reply to: Hannu Krosing (#13)
Re: pluggable compression support

On 2013-06-15 14:11:54 +0200, Hannu Krosing wrote:

Could you perhaps actually read the the description and the discussion
before making wild suggestions? Possibly even the patch.
Compressed toast datums now *do* have an identifier of the compression
algorithm used.
That's how we can discern between pglz and whatever
algorithm we come up with.

Claiming that the algorithm will be one of only two (current and
"whatever algorithm we come up with ") suggests that it is
only one bit, which is undoubtedly too little for having a "pluggable"
compression API :)

No, I am thinking 127 + 2 possibly algorithms for now. If we need more
the space used for the algorithm can be extended transparently at that
point.

But those identifiers should be *small* (since they are added to all
Datums)

if there will be any alignment at all between the datums, then
one byte will be lost in the noise ("remember: nobody will need
more than 256 compression algorithms")

No. There's no additional alignment involved here.

OTOH, if you plan to put these format markers in the compressed
stream and change the compression algorithm while reading it, I am
lost.

The marker *needs* to be in the compressed stream. When decompressing a
datum we only only get passed the varlena.

and they need to be stable, even across pg_upgrade. So I think
making this user configurable would be grave error at this point.

"at this point" in what sense ?

If we add another algorithm with different tradeofs we can have a column
attribute for choosing the algorithm. If there proofs to be a need to
add more configurabily, we can do that at that point.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#8)
Re: pluggable compression support

On Saturday, June 15, 2013 3:50 PM Andres Freund wrote:
On 2013-06-14 21:56:52 -0400, Robert Haas wrote:

I don't think we need it. I think what we need is to decide is which
algorithm is legally OK to use. And then put it in.

In the past, we've had a great deal of speculation about that legal
question from people who are not lawyers. Maybe it would be valuable
to get some opinions from people who ARE lawyers. Tom and Heikki both
work for real big companies which, I'm guessing, have substantial
legal departments; perhaps they could pursue getting the algorithms of
possible interest vetted. Or, I could try to find out whether it's
possible do something similar through EnterpriseDB.

I personally don't think the legal arguments holds all that much water
for snappy and lz4. But then the opinion of a european non-lawyer doesn't
hold much either.
Both are widely used by a large number open and closed projects, some of
which have patent grant clauses in their licenses. E.g. hadoop,
cassandra use lz4, and I'd be surprised if the companies behind those
have opened themselves to litigation.

I think we should preliminarily decide which algorithm to use before we
get lawyers involved. I'd surprised if they can make such a analysis
faster than we can rule out one of them via benchmarks.

I have also tried to use snappy for patch "Performance Improvement by reducing WAL for Update Operation".
It has shown very good results and performed very well for all the tests asked by Heikki.
Results are at below link:
/messages/by-id/009001ce2c6e$9bea4790$d3bed6b0$@kapila@huawei.com

I think if we can get snappy into core, it can be used for more things.
I wanted to try it for FPW as well.

With Regards,
Amit Kapila.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Robert Haas
robertmhaas@gmail.com
In reply to: Hannu Krosing (#13)
Re: pluggable compression support

On Sat, Jun 15, 2013 at 8:11 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:

Claiming that the algorithm will be one of only two (current and
"whatever algorithm we come up with ") suggests that it is
only one bit, which is undoubtedly too little for having a "pluggable"
compression API :)

See /messages/by-id/20130607143053.GJ29964@alap2.anarazel.de

But those identifiers should be *small* (since they are added to all
Datums)

if there will be any alignment at all between the datums, then
one byte will be lost in the noise ("remember: nobody will need
more than 256 compression algorithms")
OTOH, if you plan to put these format markers in the compressed
stream and change the compression algorithm while reading it, I am lost.

The above-linked email addresses this point as well: there are bits
available in the toast pointer. But there aren't MANY bits without
increasing the storage footprint, so trying to do something that's
more general than we really need is going to cost us in terms of
on-disk footprint. Is that really worth it? And if so, why? I don't
find the idea of a trade-off between compression/decompression speed
and compression ratio to be very exciting. As Andres says, bzip2 is
impractically slow for ... almost everything. If there's a good
BSD-licensed algorithm available, let's just use it and be done. Our
current algorithm has lasted us a very long time; I see no reason to
think we'll want to change this again for another 10 years, and by
that time, we may have redesigned the storage format altogether,
making the limited extensibility of our current TOAST pointer format
moot.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Robert Haas
robertmhaas@gmail.com
In reply to: Hannu Krosing (#14)
Re: pluggable compression support

On Sat, Jun 15, 2013 at 8:22 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:

You suddently need to solve the question of how the identifiers for
compression formats are allocated and preserved across pg_upgrade and
across machines.

This is something similar we already need to do for any non-builtin type
OID.

That's true, but that code has already been written. And it's not
trivial. The code involved is CREATE/ALTER/DROP TYPE plus all the
corresponding pg_dump mechanism. To do what you're proposing here,
we'd need CREATE/ALTER/DROP COMPRESSION METHOD, and associated pg_dump
--binary-upgrade support. I think Andres is entirely right to be
skeptical about that. It will make this project about 4 times as hard
for almost no benefit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Hannu Krosing
hannu@tm.ee
In reply to: Robert Haas (#17)
Re: pluggable compression support

On 06/16/2013 03:50 AM, Robert Haas wrote:

On Sat, Jun 15, 2013 at 8:11 AM, Hannu Krosing <hannu@2ndquadrant.com> wrote:

Claiming that the algorithm will be one of only two (current and
"whatever algorithm we come up with ") suggests that it is
only one bit, which is undoubtedly too little for having a "pluggable"
compression API :)

See /messages/by-id/20130607143053.GJ29964@alap2.anarazel.de

But those identifiers should be *small* (since they are added to all
Datums)

if there will be any alignment at all between the datums, then
one byte will be lost in the noise ("remember: nobody will need
more than 256 compression algorithms")
OTOH, if you plan to put these format markers in the compressed
stream and change the compression algorithm while reading it, I am lost.

The above-linked email addresses this point as well: there are bits
available in the toast pointer. But there aren't MANY bits without
increasing the storage footprint, so trying to do something that's
more general than we really need is going to cost us in terms of
on-disk footprint. Is that really worth it? And if so, why? I don't
find the idea of a trade-off between compression/decompression speed
and compression ratio to be very exciting. As Andres says, bzip2 is
impractically slow for ... almost everything. If there's a good
BSD-licensed algorithm available, let's just use it and be done. Our
current algorithm has lasted us a very long time;

My scepticism about current algorithm comes from a brief test
(which may have been flawed) which showed almost no compression
for plain XML fields.

It may very well be that I was doing something stupid and got
wrong results though, as I the functions to ask for toast internals
like "is this field compressed" or "what is the compressed
length of this field" are well hidden - if available at all - in our
documentation.

I see no reason to
think we'll want to change this again for another 10 years, and by
that time, we may have redesigned the storage format altogether,
making the limited extensibility of our current TOAST pointer format
moot.

Agreed.

I just hoped that "pluggable compression support" would
be something that enables people not directly interested in
hacking the core to experiment with compression and thereby
possibly coming up with something that changes your "not
useful in next 10 years" prediction :)

Seeing that the scope of this patch is actually much narrower,
I have no objections of doing it as proposed by Andres.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#12)
Re: pluggable compression support

On 15 June 2013 13:02, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-06-15 13:11:47 +0200, Hannu Krosing wrote:

If it were truly pluggable - that is just load a .dll, set a GUG and start
using it

Ok. I officially rechristen the patchset to 'extensible compression
support'.

+1

(I confess I was confused also.)

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Simon Riggs
simon@2ndQuadrant.com
In reply to: Hannu Krosing (#10)
#22Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#21)
#23Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#8)
#24Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#24)
#26Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#2)
#27Andres Freund
andres@anarazel.de
In reply to: Josh Berkus (#26)
#28Claudio Freire
klaussfreire@gmail.com
In reply to: Andres Freund (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Josh Berkus (#26)
#30Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#2)
#31Huchev
hugochevrain@gmail.com
In reply to: Josh Berkus (#30)
#32Daniel Farina
daniel@heroku.com
In reply to: Huchev (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Daniel Farina (#32)