lztext and compression ratios...
I have been looking at using the lztext type and I have some
questions/observations. Most of my experience comes from attempting to
compress text records in a different database (CTREE), but I think the
experience is transferable.
My typical table consists of variable length text records. The average
length record is around 1K bytes. I would like to compress my records
to save space and improve I/O performance (smaller records means more
records fit into the file system cache which means less I/O - or so the
theory goes). I am not too concerned about CPU as we are using a 4-way
Sun Enterprise class server. So compress seems like a good idea to me.
My experience with attempting to compress such a relatively small
(around 1K) text string is that the compression ration is not very
good. This is because the string is not long enough for the LZ
compression algorithm to establish really good compression patterns and
the fact that the de-compression table has to be built into each
record. What I have done in the past to get around these problems is
that I have "taught" the compression algorithm the patterns ahead of
time and stored the de-compression patterns in an external table. Using
this technique, I have achieved *much* better compression ratios.
So my questions/comments are:
- What are the typical compression rations on relatively small (i.e.
around 1K) strings seen with lztext?
- Does anyone see a need/use for a generalized string compression
type that can be "trained" external to the individual records?
- Am I crazy in even attempting to compress strings of this relative
size? My largest table correct contains about 2 million entries of
roughly 1k size strings or about 2Gig of data. If I could compress this
to about 33% of it's original size (not unreasonable with a trained LZ
compression), I would save a lot of disk space (not really important)
and a lot of file system cache space (very important) and be able to fit
the entire table into memory (very, very important).
Thank you,
Jeff
Jeffery Collins <collins@onyx-technologies.com> writes:
My experience with attempting to compress such a relatively small
(around 1K) text string is that the compression ration is not very
good. This is because the string is not long enough for the LZ
compression algorithm to establish really good compression patterns and
the fact that the de-compression table has to be built into each
record. What I have done in the past to get around these problems is
that I have "taught" the compression algorithm the patterns ahead of
time and stored the de-compression patterns in an external table. Using
this technique, I have achieved *much* better compression ratios.
(Puts on compression-guru hat...)
There is much in what you say. Perhaps we should consider keeping the
lztext type around (currently it's slated for doom in 7.1, since the
TOAST feature will make plain text do everything lztext does and more)
and having it be different from text in that a training sample is
supplied when the column is defined. Not quite sure how that should
look or where to store the sample, but it could be a big win for tables
having a large number of moderate-sized text entries.
regards, tom lane
I was sitting here reading about TOAST and 7.1 and another question came
into my mind.. I looked on the web page for some information about features
to be added in 7.1 and an approximate 7.1 release date.. I found information
on the TOAST feature as well as Referential Integrity but nothing else.. If
I missed it please just point me in the right direction, if not I'd like to
as about the approximate time of 7.1 arriving in a stable form and if full
text indexing is going to be a part of 7.1..
I ask because if 7.1 is going to include full text indexing natively and is
going to arrive pretty soon, I might not continue on this project as I have
been (the new full text index trigger)..
Thanks!!!
-Mitch
figure sometime in October before v7.1 is released ... i believe the last
talk talked about a beta starting sometime in September, but nothing is
ever "set in stone" until we actually do it :)
On Wed, 5 Jul 2000, Mitch Vincent wrote:
I was sitting here reading about TOAST and 7.1 and another question came
into my mind.. I looked on the web page for some information about features
to be added in 7.1 and an approximate 7.1 release date.. I found information
on the TOAST feature as well as Referential Integrity but nothing else.. If
I missed it please just point me in the right direction, if not I'd like to
as about the approximate time of 7.1 arriving in a stable form and if full
text indexing is going to be a part of 7.1..I ask because if 7.1 is going to include full text indexing natively and is
going to arrive pretty soon, I might not continue on this project as I have
been (the new full text index trigger)..Thanks!!!
-Mitch
Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
"Mitch Vincent" <mitch@venux.net> writes:
I was sitting here reading about TOAST and 7.1 and another question came
into my mind.. I looked on the web page for some information about features
to be added in 7.1 and an approximate 7.1 release date.. I found information
on the TOAST feature as well as Referential Integrity but nothing else.. If
I missed it please just point me in the right direction, if not I'd like to
as about the approximate time of 7.1 arriving in a stable form and if full
text indexing is going to be a part of 7.1..
AFAIK there are no plans on the table to do more with full text
indexing; at least none of the core developers are working on it.
(If someone else is and I've forgotten, my apologies.)
Currently the plans for 7.1 look like this:
* WAL (assuming Vadim gets it done soon)
* TOAST (pretty far along already)
* new fmgr (fmgr itself done, still need to turn crank on
converting built-in functions)
* better memory management, no more intraquery memory leaks
(about half done)
* some other stuff, but I think those are all the "big features"
that anyone has committed to get done
* whatever else gets done meanwhile
We were shooting for going beta around Aug 1 with release around Sep 1,
but don't hold us to that ;-).
Notably missing from this list is OUTER JOIN and other things that
depend on querytree redesign (such as fixing all the restrictions on
views). We've agreed to put that off till 7.2 in hopes of keeping the
7.1 development cycle reasonably short. Not sure what else will be
in 7.2, but the querytree work will be.
I ask because if 7.1 is going to include full text indexing natively and is
going to arrive pretty soon, I might not continue on this project as I have
been (the new full text index trigger)..
Seems like you should keep at it.
regards, tom lane
Mitch Vincent wrote:
I was sitting here reading about TOAST and 7.1 and another question came
into my mind.. I looked on the web page for some information about features
to be added in 7.1 and an approximate 7.1 release date.. I found information
on the TOAST feature as well as Referential Integrity but nothing else.. If
I missed it please just point me in the right direction, if not I'd like to
as about the approximate time of 7.1 arriving in a stable form and if full
text indexing is going to be a part of 7.1..
The project page you've seen was an idea I had several month
ago. I had a big success with it since I got co-developers
for FOREIGN KEY. It was the first time I something in an
open source project with related developers, and the result
was stunning - even for me.
Unfortunately, none of the other "inner-circle" developers
catched up that idea. Maybe all they're doing is not of the
nature that they would gain anything from co-
developers/volunteers.
Not an answer to your real question, more like a kick in the
A.. of my colleagues.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
Jeffery Collins wrote:
I have been looking at using the lztext type and I have some
questions/observations. Most of my experience comes from attempting to
compress text records in a different database (CTREE), but I think the
experience is transferable.
First of all I welcome any suggestions/input to the
compression code I've implemented. Seems you're experienced
in this area so keep going and hammer down any of my points
below.
You should know that the "lztext" type will disappear soon
and be replaced with the general "all varlena types are
potentially compressable" approach of TOAST.
My typical table consists of variable length text records. The average
length record is around 1K bytes. I would like to compress my records
to save space and improve I/O performance (smaller records means more
records fit into the file system cache which means less I/O - or so the
theory goes). I am not too concerned about CPU as we are using a 4-way
Sun Enterprise class server. So compress seems like a good idea to me.My experience with attempting to compress such a relatively small
(around 1K) text string is that the compression ration is not very
good. This is because the string is not long enough for the LZ
compression algorithm to establish really good compression patterns and
the fact that the de-compression table has to be built into each
record. What I have done in the past to get around these problems is
that I have "taught" the compression algorithm the patterns ahead of
time and stored the de-compression patterns in an external table. Using
this technique, I have achieved *much* better compression ratios.
The compression algorithm used in "lztext" (and so in TOAST
in the future) doesn't have a de-compression table at all.
It's based on Adisak Pochanayon's SLZ algorithm, using
<literal_char> or <token>. A <token> just tells how far to
go back in the OUTPUT-buffer and how many bytes to copy from
OUTPUT to OUTPUT. Look at the code for details.
My design rules for the compression inside of Postgres have
been
- beeing fast on decompression
- beeing good for relatively small values
- beeing fast on compression
The first rule is met by the implementation itself. Don't
underestimate this design rule! Usually you don't update as
often as you query. And the implementation of TOAST requires
a super fast decompression.
The second rule is met by not needing any initial
decompression table inside of the stored value.
The third rule is controlled by the default strategy of the
algorithm, (unfortunately) hardcoded into
utils/adt/pg_lzcompress.c. It'll never try to compress items
smaller than 256 bytes. It'll fallback to plain storage (for
speed advantage while decompressing a value) if less than 20%
of compression is gained. It'll stop match loookup if a
backward match of 128 or more bytes is found.
So my questions/comments are:
- What are the typical compression rations on relatively small (i.e.
around 1K) strings seen with lztext?
Don't have that small items handy. But a table containing a
file path and it's content. All files where HTML files.
From - To | count(*) | avg(length) | avg(octet_length)
---------------+----------+-------------+------------------
1024 - 2047 | 14 | 1905 | 1470
2048 - 4095 | 67 | 3059 | 1412
4096 - 8191 | 45 | 5384 | 2412
8192 - | 25 | 17200 | 6323
---------------+----------+-------------+------------------
all | 151 | 5986 | 2529
- Does anyone see a need/use for a generalized string compression
type that can be "trained" external to the individual records?
Yes, of course. Maybe "lztext" can be a framework for you and
we just tell the toaster "never apply your lousy compression
on that" (it's prepared for).
- Am I crazy in even attempting to compress strings of this relative
size? My largest table correct contains about 2 million entries of
roughly 1k size strings or about 2Gig of data. If I could compress this
to about 33% of it's original size (not unreasonable with a trained LZ
compression), I would save a lot of disk space (not really important)
and a lot of file system cache space (very important) and be able to fit
the entire table into memory (very, very important).
Noone is crazy attempting to improve something. It might turn
out not to work well, or beeing brain damaged from the start.
But someone who never tries will miss all his chances.
Final note:
One thing to keep in mind is that the LZ algorithm you're
thinking of must be distributable under the terms of the BSD
license. If it's copyrighted or patented by any third party,
not agreeing to these terms, it's out of discussion and will
never appear in the Postgres source tree. Especially the LZ
algorithm used in GIF is one of these show-stoppers.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
JanWieck@t-online.de (Jan Wieck) writes:
One thing to keep in mind is that the LZ algorithm you're
thinking of must be distributable under the terms of the BSD
license. If it's copyrighted or patented by any third party,
not agreeing to these terms, it's out of discussion and will
never appear in the Postgres source tree. Especially the LZ
algorithm used in GIF is one of these show-stoppers.
As long as you brought it up: how sure are you that the method you've
used is not subject to any patents? The mere fact that you learned
it from someone who didn't patent it does not guarantee anything ---
someone else could have invented it independently and filed for a
patent.
If you can show that this method uses no ideas not found in zlib,
then I'll feel reassured --- a good deal of legal research went into
zlib to make sure it didn't fall foul of any patents, and zlib has
now been around long enough that it'd be tough for anyone to get a
new patent on one of its techniques. But if SLZ has any additional
ideas in it, then we could be in trouble. There are an awful lot of
compression patents.
regards, tom lane
Jan,
Thank you for your comments on lztext. They were very informative. I hope
you (and the rest of the list) don't mind my ramblings on compression, but I
think there is some performance/space advantage if compression can integrated
into a table. I admit to not being a compression expert. All that I know
has been learn from reading some zlib source.
The compression that I have been working with (and am more familiar with) is
Huffman compression. You are probably familiar with it and may have even
evaluated it when you were looking at compression techniques. In my limited
understanding of LZ compression, there are some advantages and disadvantages
to Huffman compression.
The disadvantages seem to be:
- In normal Huffman compression, the string to be compressed is examined
twice. Once to build the translation table and once to do the translation.
- The translation table must be written with the string. This makes the
result larger and limits it's effectiveness with small strings.
- I don't know for sure, but I wouldn't be surprised if uncompression is
slower with Huffman than LZ compression.
To get around these disadvantages, I modified my Huffman support in the
following way:
- I trained the translation table and stored the table separately. I was
able to do this because I knew before hand the type of text that would be in
my columns.
- I "improved" the compression by giving the translation table the
ability to look for well known substrings in addition to byte values. For
example the string "THE" in an English text might have it's own Huffman
translation value instead of relying on the individual T, H and E
translations. Again I was able to do this because I knew the type of text I
would be storing in the column.
Because the translation table is external to the string, it is no longer
included in the string. This improves the compression ratio and helps the
uncompression as the table does not need to be read and interpreted. This
approach also allows for and takes advantage of cross-row commonalities.
After doing all of this, I was able to get the average compression ratio to
about 3.5 (i.e. orig size / new size) on text columns of 1K or less.
Of course the disadvantage to this is that if the dataset changes overtime,
the compression translations to not change with it and the compression ratio
will get worse. In the very worst case, where the dataset totally changes
(say the language changed from English to Russian) the "compressed" string
could actually get larger than the original string because everything the
compression algorithm thought it knew was totally wrong.
As I said, I did this with a different database (CTREE). My company is now
converting from CTREE to postgresql and I would like to find some fairly
elegant way to include this type of compression support. My CTREE
implementation was rather a hack, but now that we are moving to a better
database, I would like to have a better solution for compression.
Originally I latched onto the idea of using lztext or something like lztext,
but I agree it is better if the column type is something standard and the
compression is hidden by the backend. I guess my vision would be to be able
to define either a column type or a column attribute that would allow me to
indicate the type of compression to have and possibly the compression
algorithm and translation set to use. Actually the real vision is to have it
all hidden and have the database learn about the column as rows are added and
modify the compression transalation to adapt to the changing column data.
Jeff
Tom Lane wrote:
JanWieck@t-online.de (Jan Wieck) writes:
One thing to keep in mind is that the LZ algorithm you're
thinking of must be distributable under the terms of the BSD
license. If it's copyrighted or patented by any third party,
not agreeing to these terms, it's out of discussion and will
never appear in the Postgres source tree. Especially the LZ
algorithm used in GIF is one of these show-stoppers.As long as you brought it up: how sure are you that the method you've
used is not subject to any patents? The mere fact that you learned
it from someone who didn't patent it does not guarantee anything ---
someone else could have invented it independently and filed for a
patent.
Now that you ask for it: I'm not sure. Could be.
If you can show that this method uses no ideas not found in zlib,
then I'll feel reassured --- a good deal of legal research went into
zlib to make sure it didn't fall foul of any patents, and zlib has
now been around long enough that it'd be tough for anyone to get a
new patent on one of its techniques. But if SLZ has any additional
ideas in it, then we could be in trouble. There are an awful lot of
compression patents.
To do so I don't know enough about the algorithms used in
zlib. Is there someone out here who could verify that if I
detailed enough describe what our compression code does?
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
JanWieck@t-online.de (Jan Wieck) writes:
As long as you brought it up: how sure are you that the method you've
used is not subject to any patents?
Now that you ask for it: I'm not sure. Could be.
If you can show that this method uses no ideas not found in zlib,
then I'll feel reassured
To do so I don't know enough about the algorithms used in
zlib. Is there someone out here who could verify that if I
detailed enough describe what our compression code does?
After a quick look at the code, I don't think there is anything
problematic about the data representation or the decompression
algorithm. The compression algorithm is another story, and it's
not real well commented :-(. The important issues are how you
search for matches in the past text and how you decide which match
is the best one to use. Please update the code comments to describe
that, and I'll take another look.
regards, tom lane
Tom Lane wrote:
JanWieck@t-online.de (Jan Wieck) writes:
As long as you brought it up: how sure are you that the method you've
used is not subject to any patents?Now that you ask for it: I'm not sure. Could be.
If you can show that this method uses no ideas not found in zlib,
then I'll feel reassuredTo do so I don't know enough about the algorithms used in
zlib. Is there someone out here who could verify that if I
detailed enough describe what our compression code does?After a quick look at the code, I don't think there is anything
problematic about the data representation or the decompression
algorithm. The compression algorithm is another story, and it's
not real well commented :-(. The important issues are how you
search for matches in the past text and how you decide which match
is the best one to use. Please update the code comments to describe
that, and I'll take another look.
Done. You'll find a new section in the top comments.
While writing it I noticed that the algorithm is really
expensive for big items. The history lookup table allocated
is 8 times (on 32 bit architectures) the size of the input.
So if you want to have 1MB compressed, it'll allocate 8MB for
the history. It hit me when I was hunting a bug in the
toaster earlier today. Doing an update to a toasted item of
5MB, resulting in a new value of 10MB, the backend blew up to
290MB of virtual memory - oh boy. I definitely need to make
that smarter.
When I wrote it I never thought about items that big. It was
before we had the idea of TOAST.
This all might open another discussion I'll start in a
separate thread.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
figure sometime in October before v7.1 is released ... i believe the last
talk talked about a beta starting sometime in September, but nothing is
ever "set in stone" until we actually do it :)
I agree. August is a very useful month because a lot of people are slow
at work during that month, and that gives them time to work on
PostgreSQL.
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
JanWieck@t-online.de (Jan Wieck) writes:
Tom Lane wrote:
After a quick look at the code, I don't think there is anything
problematic about the data representation or the decompression
algorithm. The compression algorithm is another story, and it's
not real well commented :-(. The important issues are how you
search for matches in the past text and how you decide which match
is the best one to use. Please update the code comments to describe
that, and I'll take another look.
Done. You'll find a new section in the top comments.
I think we are probably OK. Anyone who wants to check the issue might
want to start with http://www.faqs.org/faqs/compression-faq/,
particularly part 1 item 8. The two patents I'm most concerned about
are Waterworth's (4701745) and Stac's (5016009) which both cover basic
variants of LZ77 (for full text of these or any other US patent consult
http://patent.womplex.ibm.com/). But it looks to me like this
particular variant can be argued not to fall under either patent.
For example, Waterworth's patent specifies using 3-byte hash values
while you use 4-byte, and stupid as that is it's sufficient to get
you out from under. (A more technically valid reason is that he
doesn't use hash collision chains.) The Stac patent also specifies
a data structure considerably different from your hash chains.
To give you an idea of what's involved here, I attach an old message
from Jean-loup Gailly, author of gzip and zlib, explaining why he
believes gzip doesn't fall foul of these two patents. Not all of
his reasons apply to your method, but I think enough do. (BTW,
Gailly's a good guy --- I worked with him on the PNG project.
I think I'll write to him and see if he's got time to review this
code for patent problems.)
While writing it I noticed that the algorithm is really
expensive for big items. The history lookup table allocated
is 8 times (on 32 bit architectures) the size of the input.
So if you want to have 1MB compressed, it'll allocate 8MB for
the history. It hit me when I was hunting a bug in the
toaster earlier today. Doing an update to a toasted item of
5MB, resulting in a new value of 10MB, the backend blew up to
290MB of virtual memory - oh boy. I definitely need to make
that smarter.
Yes, you need some smarter method for choosing the size of the
hash table... a braindead but possibly sufficient answer is to
have some hard limit on the size... or you could just use a
hard-wired constant size to begin with. I think it's usually
considered wise to use a prime for the table size and reduce
the raw input bits modulo the prime.
regards, tom lane
Article: 14604 of gnu.misc.discuss
Path: chorus!octave.chorus.fr!jloup
From: jloup@chorus.fr (Jean-loup Gailly)
Newsgroups: gnu.misc.discuss
Subject: Re: LPF, NY Times, GNU and compression algorithms
Message-ID: <7989@chorus.chorus.fr>
Date: 10 Mar 94 10:10:03 GMT
References: <RXN106.94Mar6103732@wilbur.cac.psu.edu> <HSU.94Mar7204324@laphroaig.cs.hut.fi> <MARC.94Mar8090631@marc.watson.ibm.com>
Sender: news@chorus.chorus.fr
Distribution: gnu
Lines: 201
Marc Auslander <marc@watson.ibm.com> writes:
Two Stac patents in the case are 4701745 and 5016009.
From 4601745: [typo, meant: 4701745]
the data processing means including circuit means operable to check
whether a sequence of successive bytes to be processed identical with
a sequence of bytes already processed, and including a hash generating
means responsive to the application of a predetermined number of bytes ...From 5016009
in order to perform the hashing function, a data compression system
includes certain hash data structures including a history array
pointer ...also:
... is found ..., encoding said matching string ... a variable length
indicator ..., said predetermined strategy ensuring that a matching
string of two characters of said input data stream is compressed to
less than said two characters of said input data stream.So - on the face of it, why isn't gzip covered by these patents?
Let's take each patent in turn. A clarification first: the Stac patent
4,701,745 was not invented by Stac. It was initially owned by Ferranti
(UK) and only recently bought by Stac. This was a very profitable
acquisition (assuming it cost Stac far less than the $120 million they
won by using this patent against Microsoft).
(a) 4,701,745
This algorithm is now known as LZRW1, because Ross Williams reinvented
it independently later and posted it on comp.compression on April 22,
1991. Exactly the same algorithm has also been patented by Gibson and
Graybill (5,049,881). The patent office failed to recognize that
it was the same algorithm.
This algorithm uses LZ77 with hashing but no collision chains and
outputs unmatched bytes directly without further compression. gzip
uses collisions chains of arbitrary length, and uses Huffman encoding
on the unmatched bytes:
- Claim 1 of the patent is restricted to (emphasis added by me):
output means operable to APPLY to a transfer medium each byte of data
not forming part of such an identical sequence; and
ENCODING means responsive to the identification of such a sequence
to APPLY to the transfer medium an identification signal which
identifies both the location in the input store of the previous
occurrence of the sequence of bytes and the number of bytes
contained in the sequence.
The claim thus makes a clear distinction between "encoding" and
"applying to the transfer medium". A system which compresses the
unmatched bytes does not infringe this patent.
- The description of the patent and claim 2 make clear that the check
for identity of the sequences of bytes is to be made only once (no
hash collision chains). Gzip performs an arbitrary number of such
checks. The "means" enumerated in claim 1 specify what the hash
table consists of, and this does not include any means for storing
hash collision chains.
- Claim 2 also requires that *all* bytes participating in the hash
function should be compared:
A system as claimed in claim 1 in which the circuit means also
includes check means operable to check for identity between EACH
of the said predetermined number of bytes in sequence and EACH of
a similar sequence of bytes contained in the input store at a
location defined by a pointer read out from the temporary store at
said address
[in plain English, this is the check for hash collision]
and to check whether identity exists between succeeding bytes in
each sequence of bytes, and a byte counter operable to count the
number of identical bytes in each sequence.
[this is the determination of the match length]
Gzip never checks for equality of the third byte used in the hash
function. The hash function is such that on a hash hit with equality
of the first two bytes, the third byte necessarily matches.
- In addition, gzip uses a "lazy" evaluation of string matches. Even
when a match is found, gzip may encode (with Huffman coding) a single
unmatched byte. This is done when gzip determines that it is more
beneficial to parse the input string differently because a longer
match follows. In the Waterworth patent, a string match is always
encoded as a (length, pointer) pair.
All other claims of the patent are dependent on claim 1
("a system as claimed in claim 1 in which ..."). Since gzip
does not infringe claim 1 it does not infringe the other claims.
In particular, claim 6 explicitly states that unmatched strings
are not compressed:
A system as claimed in claim 5 in which the data receiving means
includes decoder means operable to separate UNCOMPRESSED bytes of
data from identification signals.
The gzip decoder never receives uncompressed bytes since all input is
compressed with Huffman coding [both literals and (length, offset) pairs].
The only "invention" in the Waterworth patent is the absence of hash
collision chains. The original description of the LZ77 algorithm
required searching for the true longest match, not just checking the
length of one particular match. Using hashing for string searching
was very well known at the time of the patent application (March 86).
The patent description specifies explicitly that "Hash techniques are
well known and many differents forms of hash will be suitable".
The --fast option of gzip was on purpose made slower than possible
precisely because of the existence of the Waterworth patent.
See in particular the following part of the gzip TODO file:
Add a super-fast compression method, suitable for implementing
file systems with transparent compression. One problem is that the
best candidate (lzrw1) is patented twice (Waterworth 4,701,745
and Gibson & Graybill 5,049,881).
(b) 5,016,009
This is standard LZ77 with hashing, and collisions resolved using
linked lists. There are several important restrictions which let gzip
escape from the patent:
- the linked lists are implemented only with offsets. The end of
a chain is detected by adding together all offsets, until the
sum becomes greater than the size of the history buffer. gzip uses
direct indices, and so detects the end of the chains differently.
The exact wording of claim 1 of the patent is:
... said data compression system comprising ... an offset array means
... said method comprising the steps of ...
calculating a difference between said history array pointer
and said pointer obtained from said hash table means,
storing said difference into said offset array means entry
pointed to by said history array pointer, ...
gzip never calculates such a difference and does not have any offset
array.
- unmatched strings are emitted as literal bytes without any
compression. gzip uses Huffman encoding on the unmatched strings.
This is the same argument as for the Waterworth patent.
- unmatched strings are preceded by
... a "raw" data tag indicating that no matching data string was found
gzip does not use such a tag because it uses a single Huffman table for
both string literals and match lengths. It is only the prefix
property of Huffman codes which allows the decoder to distinguish
the two cases. So there is not a unique "raw" tag preceding all
literals. This is not a minor point. It is one of the reasons
giving gzip superior compression to that obtained with the Stac
algorithm.
- a string match is always encoded as a (length, pointer) pair.
Gzip uses a "lazy" evaluation of string matches as described
above for the Waterworth patent.
All other claims of the patent are dependent on claim 1 ("the method
of claim 1 wherein ..."). Since gzip does not infringe claim 1 it does
not infringe the other claims. In any case, I have studied in detail
all the 77 claims to make sure that gzip does not infringe.
Unrelated note: this Stac patent is the only one where I found an
original and non obvious idea. The hash table is refreshed using a
incremental mechanism, so that the refresh overhead is distributed
among all input bytes. This allows the real response time necessary in
disk compressors such as Stacker (the Stac product). gzip does not use
this idea, and refreshes the hash table in a straightforward manner
every 32K bytes.
One final comment: I estimate that I have now spent more time studying
data compression patents than actually implementing data compression
algorithms. I have a partial list of 318 data compression patents,
even though for the moment I restrict myself mostly to lossless
algorithms (ignoring most image compression patents). Richard
Stallman has been *extremely* careful before accepting gzip as the GNU
compressor. I continue to study new patents regularly. I would of
course very much prefer spending what's left of my spare time
improving the gzip compression algorithm instead of studying patents.
Some improvements that I would have liked to put in gzip have not been
incorporated because of patents.
In short, every possible precaution has been taken to make sure that
gzip isn't covered by patents.
Jean-loup Gailly, author of gzip.
jloup@chorus.fr
Maybe you just want to use zlib. Let other guys hammer out the details.
On Thu, 6 Jul 2000, Jan Wieck wrote:
While writing it I noticed that the algorithm is really
expensive for big items. The history lookup table allocated
is 8 times (on 32 bit architectures) the size of the input.
So if you want to have 1MB compressed, it'll allocate 8MB for
the history. It hit me when I was hunting a bug in the
toaster earlier today. Doing an update to a toasted item of
5MB, resulting in a new value of 10MB, the backend blew up to
290MB of virtual memory - oh boy. I definitely need to make
that smarter.When I wrote it I never thought about items that big. It was
before we had the idea of TOAST.This all might open another discussion I'll start in a
separate thread.Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
--
Peter Eisentraut Sernanders vaeg 10:115
peter_e@gmx.net 75262 Uppsala
http://yi.org/peter-e/ Sweden
eisentrp@csis.gvsu.edu writes:
Maybe you just want to use zlib. Let other guys hammer out the details.
I've been wondering about that myself. Jan, have you actually
benchmarked decompression speed on this method vs. stock zlib?
Unless the differential is huge we might want to just skip the
patent worries.
zlib has a BSD-style license, so we could include it in our
distribution to avoid worries about whether it's installed.
I've written to Jean-loup for his advice. He's pretty busy these
days (CTO at some startup, I believe) but we'll see if he has time
to comment.
regards, tom lane
eisentrp@csis.gvsu.edu writes:
Maybe you just want to use zlib. Let other guys hammer out the details.
I've been wondering about that myself. Jan, have you actually
benchmarked decompression speed on this method vs. stock zlib?
Unless the differential is huge we might want to just skip the
patent worries.
If we found later that there was a patent problem, could we just issue a
new release with a new compression algorithm?
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Bruce Momjian <pgman@candle.pha.pa.us> writes:
Unless the differential is huge we might want to just skip the
patent worries.
If we found later that there was a patent problem, could we just issue a
new release with a new compression algorithm?
Yeah, we could, and it could presumably even be a fully compatible
dot-release with no change to the on-disk representation. That
representation and consequently the decompression algorithm are safe
enough, it's the details of the compressor's search for matching
patterns that are a patent minefield.
However, changing the code after-the-fact might not be enough to keep us
from being sued :-(. I'd rather use something that's pretty well
established as being in the clear...
regards, tom lane
eisentrp@csis.gvsu.edu wrote:
Maybe you just want to use zlib. Let other guys hammer out the details.
We cannot assume that zlib is available everywhere. Thus it
must be determined during configure and where it isn't, TOAST
can only move off values to make tuples fit into blocks.
Since decompression of already in memory items is alot faster
than doing an index scan on the TOAST table, I expect this to
make installations without zlib damned slow.
And how should binary distributions like RPM's handle it? I
assume that this problem is already on it's way because of
the integration of zlib into pg_dump. The only way I see is
having different RPM's for each possible combination of
available helper libs. Or is there another way to work
around?
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
JanWieck@t-online.de (Jan Wieck) writes:
eisentrp@csis.gvsu.edu wrote:
Maybe you just want to use zlib. Let other guys hammer out the details.
We cannot assume that zlib is available everywhere.
We can if we include it in our distribution --- which we could; it's
pretty small and uses a BSD-style license. I can assure you the zlib
guys would be happy with that. And it's certainly as portable as our
own code. The real question is, is a custom compressor enough better
than zlib for our purposes to make it worth taking any patent risks?
We could run zlib at a low compression setting (-z1 to -z3 maybe)
to make compression relatively fast, and since that also doesn't
generate a custom Huffman tree, the overhead in the compressed data
is minor even for short strings. And its memory footprint is
certainly no worse than Jan's method...
The real question is whether zlib decompression is markedly slower
than Jan's code. Certainly Jan's method is a lot simpler and *should*
be faster --- but on the other hand, zlib has had a heck of a lot
of careful performance tuning put into it over the years. The speed
difference might not be as bad as all that.
I think it's worth taking a look at the option.
regards, tom lane