QuickLZ compression algorithm (Re: Inclusion in the PostgreSQL backend for toasting rows)
I asked the author of the QuickLZ algorithm about licensing...
Sounds like he is willing to cooperate. This is what I got from him:
On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote:
Hi Stephen,
That sounds really exciting, I'd love to see QuickLZ included into
PostgreSQL. I'd be glad to offer support and add custom optimizations,
features or hacks or whatever should turn up.My only concern is to avoid undermining the commercial license, but this
can, as you suggest, be solved by exceptionally allowing QuickLZ to be
linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any
construction is possible.Greetings,
Lasse Reinhold
Developer
http://www.quicklz.com/
lar@quicklz.comOn Sat Jan 3 15:46 , 'Stephen R. van den Berg' sent:
PostgreSQL is the most advanced Open Source database at this moment, it is
being distributed under a Berkeley license though.What if we'd like to use your QuickLZ algorithm in the PostgreSQL core
to compress rows in the internal archive format (it's not going to be a
compression algorithm which is exposed to the SQL level)?
Is it conceivable that you'd allow us to use the algorithm free of charge
and allow it to be distributed under the Berkeley license, as long as it
is part of the PostgreSQL backend?
--
Sincerely,
Stephen R. van den Berg.Expect the unexpected!
)
--
Sincerely,
Stephen R. van den Berg.
On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote:
That sounds really exciting, I'd love to see QuickLZ included into
PostgreSQL. I'd be glad to offer support and add custom optimizations,
features or hacks or whatever should turn up.My only concern is to avoid undermining the commercial license, but this
can, as you suggest, be solved by exceptionally allowing QuickLZ to be
linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any
construction is possible.
Hmm ... keep in mind that PostgreSQL is used as a base for a certain
number of commercial, non-BSD products (Greenplum, Netezza,
EnterpriseDB, Truviso, are the ones that come to mind). Would this
exception allow for linking QuickLZ with them too? It doesn't sound to
me like you're open to relicensing it under BSD, which puts us in an
uncomfortable position.
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote:
On Sat, Jan 3, 2009 at 17:56, Lasse Reinhold <lar@quicklz.com> wrote:
That sounds really exciting, I'd love to see QuickLZ included into
PostgreSQL. I'd be glad to offer support and add custom optimizations,
features or hacks or whatever should turn up.
My only concern is to avoid undermining the commercial license, but this
can, as you suggest, be solved by exceptionally allowing QuickLZ to be
linked with PostgreSQL. Since I have exclusive copyright of QuickLZ any
construction is possible.
Hmm ... keep in mind that PostgreSQL is used as a base for a certain
number of commercial, non-BSD products (Greenplum, Netezza,
EnterpriseDB, Truviso, are the ones that come to mind). Would this
exception allow for linking QuickLZ with them too? It doesn't sound to
me like you're open to relicensing it under BSD, which puts us in an
uncomfortable position.
I'm not speaking for Lasse, merely providing food for thought, but it sounds
feasible to me (and conforming to Lasse's spirit of his intended license)
to put something like the following license on his code, which would allow
inclusion into the PostgreSQL codebase and not restrict usage in any
of the derived works:
"Grant license to use the code in question without cost, provided that
the code is being linked to at least 50% of the PostgreSQL code it is
being distributed alongside with."
This should allow commercial reuse in derived products without undesirable
sideeffects.
--
Sincerely,
Stephen R. van den Berg.
"Well, if we're going to make a party of it, let's nibble Nobby's nuts!"
On Mon, Jan 5, 2009 at 3:18 AM, Stephen R. van den Berg <srb@cuci.nl> wrote:
I'm not speaking for Lasse, merely providing food for thought, but it sounds
feasible to me (and conforming to Lasse's spirit of his intended license)
to put something like the following license on his code, which would allow
inclusion into the PostgreSQL codebase and not restrict usage in any
of the derived works:"Grant license to use the code in question without cost, provided that
the code is being linked to at least 50% of the PostgreSQL code it is
being distributed alongside with."This should allow commercial reuse in derived products without undesirable
sideeffects.
I think Postgres becomes non-DFSG-free if this is done. All of a
sudden one can't pull arbitrary pieces of code out of PG and use them
in other projects (which I'd argue is the intent if not the letter of
the DFSG). Have we ever allowed code in on these terms before? Are
we willing to be dropped from Debian and possibly Red Hat if this is
the case?
-Doug
Douglas McNaught wrote:
On Mon, Jan 5, 2009 at 3:18 AM, Stephen R. van den Berg <srb@cuci.nl> wrote:
I'm not speaking for Lasse, merely providing food for thought, but it sounds
feasible to me (and conforming to Lasse's spirit of his intended license)
to put something like the following license on his code, which would allow
inclusion into the PostgreSQL codebase and not restrict usage in any
of the derived works:"Grant license to use the code in question without cost, provided that
the code is being linked to at least 50% of the PostgreSQL code it is
being distributed alongside with."This should allow commercial reuse in derived products without undesirable
sideeffects.I think Postgres becomes non-DFSG-free if this is done. All of a
sudden one can't pull arbitrary pieces of code out of PG and use them
in other projects (which I'd argue is the intent if not the letter of
the DFSG). Have we ever allowed code in on these terms before? Are
we willing to be dropped from Debian and possibly Red Hat if this is
the case?
Presumably a clean room implementation of this algorithm would get us
over these hurdles, if anyone wants to undertake it.
I certainly agree that we don't want arbitrary bits of our code to be
encumbered or licensed differently from the rest.
cheers
andrew
Andrew Dunstan wrote:
Douglas McNaught wrote:
On Mon, Jan 5, 2009 at 3:18 AM, Stephen R. van den Berg <srb@cuci.nl>
wrote:I'm not speaking for Lasse, merely providing food for thought, but it
sounds
feasible to me (and conforming to Lasse's spirit of his intended
license)
to put something like the following license on his code, which would
allow
inclusion into the PostgreSQL codebase and not restrict usage in any
of the derived works:"Grant license to use the code in question without cost, provided that
the code is being linked to at least 50% of the PostgreSQL code it is
being distributed alongside with."This should allow commercial reuse in derived products without
undesirable
sideeffects.I think Postgres becomes non-DFSG-free if this is done. All of a
sudden one can't pull arbitrary pieces of code out of PG and use them
in other projects (which I'd argue is the intent if not the letter of
the DFSG). Have we ever allowed code in on these terms before? Are
we willing to be dropped from Debian and possibly Red Hat if this is
the case?Presumably a clean room implementation of this algorithm would get us
over these hurdles, if anyone wants to undertake it.I certainly agree that we don't want arbitrary bits of our code to be
encumbered or licensed differently from the rest.
do we actually have any numbers that quicklz is actually faster and/or
compresses better than what we have now?
Stefan
Are
we willing to be dropped from Debian and possibly Red Hat if this is
the case?
No. I frankly think this discussion is a dead end.
The whole thing got started because Alex Hunsaker pointed out that his
database got a lot bigger because we disabled compression on columns >
1MB. It seems like the obvious thing to do is turn it back on again.
The only objection to that is that it will hurt performance,
especially on substring operations. That lead to a discussion of
alternative compression algorithms, which is only relevant if we
believe that there are people out there who want to do substring
extractions on huge columns AND want those columns to be compressed.
At least on this thread, we have zero requests for that feature
combination.
What we do have is a suggestion from several people that the database
shouldn't be in the business of compressing data AT ALL. If we want
to implement that suggestion, then we could change the default column
storage type.
Regardless of whether we do that or not, no one has offered any
justification of the arbitrary decision not to compress columns >1MB,
and at least one person (Peter) has suggested that it is exactly
backwards. I think he's right, and this part should be backed out.
That will leave us back in the sensible place where people who want
compression can get it (which is currently false) and people who don't
want it can get rid of it (which has always been true). If there is a
demand for alternative compression algorithms, then someone can submit
a patch for that for 8.5.
...Robert
Douglas McNaught wrote:
"Grant license to use the code in question without cost, provided that
the code is being linked to at least 50% of the PostgreSQL code it is
being distributed alongside with."
This should allow commercial reuse in derived products without undesirable
sideeffects.
I think Postgres becomes non-DFSG-free if this is done. All of a
sudden one can't pull arbitrary pieces of code out of PG and use them
in other projects (which I'd argue is the intent if not the letter of
the DFSG). Have we ever allowed code in on these terms before? Are
we willing to be dropped from Debian and possibly Red Hat if this is
the case?
Upon reading the DFSG, it seems you have a point...
However...
QuickLZ is dual licensed:
a. Royalty-free-perpetuous-use as part of the PostgreSQL backend or
any derived works of PostgreSQL which link in *at least* 50% of the
original PostgreSQL codebase.
b. GPL if a) does not apply for some reason.
I.e. for all intents and purposes, it fits the bill for both:
1. PostgreSQL-derived products (existing and future).
2. Debian/RedHat, since the source can be used under GPL.
In essence, it would be kind of a GPL license on steroids; it grants
Berkeley-style rights as long as the source is part of PostgreSQL (or a
derived work thereof), and it falls back to GPL if extracted.
--
Sincerely,
Stephen R. van den Berg.
"Well, if we're going to make a party of it, let's nibble Nobby's nuts!"
"Douglas McNaught" <doug@mcnaught.org> writes:
I think Postgres becomes non-DFSG-free if this is done. All of a
sudden one can't pull arbitrary pieces of code out of PG and use them
in other projects (which I'd argue is the intent if not the letter of
the DFSG). Have we ever allowed code in on these terms before?
No, and we aren't starting now. Any submission that's not
BSD-equivalent license will be rejected. Count on it.
regards, tom lane
On Jan 5, 2009, at 1:16 PM, Stephen R. van den Berg wrote:
Upon reading the DFSG, it seems you have a point...
However...
QuickLZ is dual licensed:
a. Royalty-free-perpetuous-use as part of the PostgreSQL backend or
any derived works of PostgreSQL which link in *at least* 50% of the
original PostgreSQL codebase.
How does one even define "50% of the original PostgreSQL codebase"?
What nonsense.
-M
Robert Haas wrote:
What we do have is a suggestion from several people that the database
shouldn't be in the business of compressing data AT ALL. If we want
+1
IMHO, this is a job for the application. I also think the current
implementation is a little odd in that it only compresses data objects
under a meg.
--
Andrew Chernow
eSilo, LLC
every bit counts
http://www.esilo.com/
"Robert Haas" <robertmhaas@gmail.com> writes:
Regardless of whether we do that or not, no one has offered any
justification of the arbitrary decision not to compress columns >1MB,
Er, yes, there was discussion before the change, for instance:
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php
And do you have any response to this point?
I think the right value for this setting is going to depend on the
environment. If the system is starved for cpu cycles then you won't want to
compress large data. If it's starved for i/o bandwidth but has spare cpu
cycles then you will.
http://archives.postgresql.org/pgsql-hackers/2009-01/msg00074.php
and at least one person (Peter) has suggested that it is exactly
backwards. I think he's right, and this part should be backed out.
Well the original code had a threshold above which we *always* compresed even
if it saved only a single byte.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning
"Robert Haas" <robertmhaas@gmail.com> writes:
The whole thing got started because Alex Hunsaker pointed out that his
database got a lot bigger because we disabled compression on columns >
1MB. It seems like the obvious thing to do is turn it back on again.
I suggest that before we make any knee-jerk responses, we need to go
back and reread the prior discussion. The current 8.4 code was proposed
here:
http://archives.postgresql.org/pgsql-patches/2008-02/msg00053.php
and that message links to several older threads that were complaining
about the 8.3 behavior. In particular the notion of an upper limit
on what we should attempt to compress was discussed in this thread:
http://archives.postgresql.org/pgsql-general/2007-08/msg01129.php
After poking around in those threads a bit, I think that the current
threshold of 1MB was something I just made up on the fly (I did note
that it needed tuning...). Perhaps something like 10MB would be a
better default. Another possibility is to have different minimum
compression rates for "small" and "large" datums.
regards, tom lane
On Mon, 05 Jan 2009 13:44:57 -0500, Andrew Chernow wrote:
Robert Haas wrote:
What we do have is a suggestion from several people that the database
shouldn't be in the business of compressing data AT ALL. If we want
DB2 users generally seem very happy with the built-in compression.
IMHO, this is a job for the application.
Changing applications is several times more expensive and often simply not
possible.
-h
On Mon, Jan 5, 2009 at 2:02 PM, Gregory Stark <stark@enterprisedb.com> wrote:
Regardless of whether we do that or not, no one has offered any
justification of the arbitrary decision not to compress columns >1MB,Er, yes, there was discussion before the change, for instance:
http://archives.postgresql.org/pgsql-hackers/2007-08/msg00082.php
OK, maybe I'm missing something, but I don't see anywhere in that
email where it suggests NEVER compressing anything above 1MB. It
suggests some more nuanced things which are quite different.
And do you have any response to this point?
I think the right value for this setting is going to depend on the
environment. If the system is starved for cpu cycles then you won't want to
compress large data. If it's starved for i/o bandwidth but has spare cpu
cycles then you will.http://archives.postgresql.org/pgsql-hackers/2009-01/msg00074.php
I think it is a good point, to the extent that compression is an
option that people choose in order to improve performance. I'm not
really convinced that this is the case, but I haven't seen much
evidence on either side of the question.
Well the original code had a threshold above which we *always* compresed even
if it saved only a single byte.
I certainly don't think that's right either.
...Robert
On Mon, 2009-01-05 at 13:04 -0500, Robert Haas wrote:
Are
we willing to be dropped from Debian and possibly Red Hat if this is
the case?
Regardless of whether we do that or not, no one has offered any
justification of the arbitrary decision not to compress columns >1MB,
and at least one person (Peter) has suggested that it is exactly
backwards. I think he's right, and this part should be backed out.
That will leave us back in the sensible place where people who want
compression can get it (which is currently false) and people who don't
want it can get rid of it (which has always been true). If there is a
demand for alternative compression algorithms, then someone can submit
a patch for that for 8.5....Robert
+1
Sincerely,
Joshua D. Drake
--
PostgreSQL
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997
Guaranteed compression of large data fields is the responsibility of the
client. The database should feel free to compress behind the scenes if
it turns out to be desirable, but an expectation that it compresses is
wrong in my opinion.
That said, I'm wondering why compression has to be a problem or why >1
Mbyte is a reasonable compromise? I missed the original thread that lead
to 8.4. It seems to me that transparent file system compression doesn't
have limits like "files must be less than 1 Mbyte to be compressed".
They don't exhibit poor file system performance. I remember back in the
386/486 days, that I would always DriveSpace compress everything,
because hard disks were so slow then that DriveSpace would actually
increase performance. The toast tables already give a sort of
block-addressable scheme. Compression can be on a per block or per set
of blocks basis allowing for seek into the block, or if compression
doesn't seem to be working for the first few blocks, the later blocks
can be stored uncompressed? Or is that too complicated compared to what
we have now? :-)
Cheers,
mark
--
Mark Mielke <mark@mielke.cc>
A.M. wrote:
On Jan 5, 2009, at 1:16 PM, Stephen R. van den Berg wrote:
Upon reading the DFSG, it seems you have a point...
However...
QuickLZ is dual licensed:
a. Royalty-free-perpetuous-use as part of the PostgreSQL backend or
any derived works of PostgreSQL which link in *at least* 50% of the
original PostgreSQL codebase.
How does one even define "50% of the original PostgreSQL codebase"?
What nonsense.
It's a suggested (but by no means definitive) technical translation of
the legalese term "substantial". Substitute with something better, by all
means.
--
Sincerely,
Stephen R. van den Berg.
"Well, if we're going to make a party of it, let's nibble Nobby's nuts!"
Mark Mielke <mark@mark.mielke.cc> writes:
It seems to me that transparent file system compression doesn't have limits
like "files must be less than 1 Mbyte to be compressed". They don't exhibit
poor file system performance.
Well I imagine those implementations are more complex than toast is. I'm not
sure what lessons we can learn from their behaviour directly.
I remember back in the 386/486 days, that I would always DriveSpace compress
everything, because hard disks were so slow then that DriveSpace would
actually increase performance.
Surely this depends on whether your machine was cpu starved or disk starved?
Do you happen to recall which camp these anecdotal machines from 1980 fell in?
The toast tables already give a sort of block-addressable scheme.
Compression can be on a per block or per set of blocks basis allowing for
seek into the block,
The current toast architecture is that we compress the whole datum, then store
the datum either inline or using the same external blocking mechanism that we
use when not compressing. So this doesn't fit at all.
It does seem like an interesting idea to have toast chunks which are
compressed individually. So each chunk could be, say, an 8kb chunk of
plaintext and stored as whatever size it ends up being after compression. That
would allow us to do random access into external chunks as well as allow
overlaying the cpu costs of decompression with the i/o costs. It would get a
lower compression ratio than compressing the whole object together but we
would have to experiment to see how big a problem that was.
It would be pretty much rewriting the toast mechanism for external compressed
data though. Currently the storage and the compression are handled separately.
This would tie the two together in a separate code path.
Hm, It occurs to me we could almost use the existing code. Just store it as a
regular uncompressed external datum but allow the toaster to operate on the
data column (which it's normally not allowed to) to compress it, but not store
it externally.
or if compression doesn't seem to be working for the first few blocks, the
later blocks can be stored uncompressed? Or is that too complicated compared
to what we have now? :-)
Actually we do that now, it was part of the same patch we're discussing.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!
Tom Lane wrote:
"Robert Haas" <robertmhaas@gmail.com> writes:
The whole thing got started because Alex Hunsaker pointed out that his
database got a lot bigger because we disabled compression on columns >
1MB. It seems like the obvious thing to do is turn it back on again.
After poking around in those threads a bit, I think that the current
threshold of 1MB was something I just made up on the fly (I did note
that it needed tuning...). Perhaps something like 10MB would be a
better default. Another possibility is to have different minimum
compression rates for "small" and "large" datums.
As far as I can imagine, the following use cases apply:
a. Columnsize <= 2048 bytes without substring access.
b. Columnsize <= 2048 bytes with substring access.
c. Columnsize > 2048 bytes compressible without substring access (text).
d. Columnsize > 2048 bytes uncompressible with substring access (multimedia).
Can anyone think of another use case I missed here?
To cover those cases, the following solutions seem feasible:
Sa. Disable compression for this column (manually, by the DBA).
Sb. Check if the compression saves more than 20%, store uncompressed otherwise.
Sc. Check if the compression saves more than 20%, store uncompressed otherwise.
Sd. Check if the compression saves more than 20%, store uncompressed otherwise.
For Sb, Sc and Sd we should probably only check the first 256KB or so to
determine the expected savings.
--
Sincerely,
Stephen R. van den Berg.
"Well, if we're going to make a party of it, let's nibble Nobby's nuts!"