[PATCH] COPY .. COMPRESSED

Started by Stephen Frostabout 13 years ago44 messageshackers
Jump to latest
#1Stephen Frost
sfrost@snowman.net

Greetings,

Attached is a patch to add a 'COMPRESSED' option to COPY which will
cause COPY to expect a gzip'd file on input and which will output a
gzip'd file on output. Included is support for backend COPY, psql's
\copy, regression tests for both, and documentation.

On top of this I plan to submit a trivial patch to add support for
this to file_fdw, allowing creation of FDW tables which operate
directly on compressed files (including CSVs, which is what I need
this patch for).

I've also begun working on a patch to allow this capability to be used
through pg_dump/pg_restore which would reduce the bandwidth used
between the client and the server for backups and restores. Ideally,
one would also be able to use custom format dumps, with compression,
even if the client-side pg_dump/pg_restore wasn't compiled with zlib
support.

Thanks,

Stephen

Attachments:

copy_compressed.patchtext/x-diff; charset=us-asciiDownload+624-244
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#1)
Re: [PATCH] COPY .. COMPRESSED

Stephen Frost <sfrost@snowman.net> writes:

Attached is a patch to add a 'COMPRESSED' option to COPY which will
cause COPY to expect a gzip'd file on input and which will output a
gzip'd file on output. Included is support for backend COPY, psql's
\copy, regression tests for both, and documentation.

I don't think it's a very good idea to invent such a specialized option,
nor to tie it to gzip, which is widely considered to be old news.

There was discussion (and, I think, a patch in the queue) for allowing
COPY to pipe into or out of an arbitrary shell pipe. Why would that not
be enough to cover this use-case? That is, instead of a hard-wired
capability, people would do something like COPY TO '| gzip >file.gz'.
Or they could use bzip2 or whatever struck their fancy.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#2)
Re: [PATCH] COPY .. COMPRESSED

Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

Attached is a patch to add a 'COMPRESSED' option to COPY which will
cause COPY to expect a gzip'd file on input and which will output a
gzip'd file on output. Included is support for backend COPY, psql's
\copy, regression tests for both, and documentation.

I don't think it's a very good idea to invent such a specialized option,
nor to tie it to gzip, which is widely considered to be old news.

We're already using gzip/zlib for pg_dump/pg_restore, so it was simple
and straight-forward to add and would allow utilizing this option while
keeping the custom dump format the same. It also happens to match what
I need. While gzip might be 'old hat' it's still extremely popular.
I'd be happy to add support for bzip2 or something else that people are
interested in, and support compression options for zlib if necessary
too. This was intended to get the ball rolling on something as the last
discussion that I had seen while hunting through the archives was from
2006, obviously I missed the boat on the last set of patches.

There was discussion (and, I think, a patch in the queue) for allowing
COPY to pipe into or out of an arbitrary shell pipe. Why would that not
be enough to cover this use-case? That is, instead of a hard-wired
capability, people would do something like COPY TO '| gzip >file.gz'.
Or they could use bzip2 or whatever struck their fancy.

Sounds like a nice idea, but I can't imagine it'd be available to anyone
except for superusers, and looking at that patch, that's exactly the
restriction which is in place for it. In addition, that patch's support
for "\copy" implements everything locally, making it little different
from "zcat mycsv.csv.gz | psql". The patch that I proposed actually
sent the compressed stream across the wire, reducing bandwidth
utilization.

All that said, I've nothing against having the pipe option for the
backend COPY command; a bit annoyed with myself for somehow missing that
patch. I don't like what it's doing with psql's \copy command and would
rather we figure out a way to support PROGRAM .. TO STDOUT, but that
still would require superuser privileges. I don't see any easy way to
support compressed data streaming to/from the server for COPY w/o
defining what methods are available or coming up with some ACL system
for what programs can be called by the backend.

Thanks,

Stephen

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Stephen Frost (#3)
Re: [PATCH] COPY .. COMPRESSED

On 14 January 2013 13:43, Stephen Frost <sfrost@snowman.net> wrote:

Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

Attached is a patch to add a 'COMPRESSED' option to COPY which will
cause COPY to expect a gzip'd file on input and which will output a
gzip'd file on output. Included is support for backend COPY, psql's
\copy, regression tests for both, and documentation.

I don't think it's a very good idea to invent such a specialized option,
nor to tie it to gzip, which is widely considered to be old news.

We're already using gzip/zlib for pg_dump/pg_restore, so it was simple
and straight-forward to add and would allow utilizing this option while
keeping the custom dump format the same.

Both thoughts are useful, I think.

There is a new option being added to pre/post process data, so it
seems like the best way to add new features - in general.

Specifically, we do support compressed output so a simple patch to
allow re-loading of the compressed data we generate does seem sensible
and reasonable.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Stephen Frost
sfrost@snowman.net
In reply to: Simon Riggs (#4)
Re: [PATCH] COPY .. COMPRESSED

* Simon Riggs (simon@2ndQuadrant.com) wrote:

There is a new option being added to pre/post process data, so it
seems like the best way to add new features - in general.

That structure appears to have no option for passing compressed data to
or from a client connection. Instead, it actually overloads the typical
meaning for options sent to \copy (which, imv, is "run COPY on the server
with these options and have the results stored locally") to mean
something different (run part of the COPY command on the server and part
of it locally).

Specifically, we do support compressed output so a simple patch to
allow re-loading of the compressed data we generate does seem sensible
and reasonable.

Right, we're already using gzip for pg_dump/pg_restore. This just gives
an option to move that compression over to the server side. Also, I'd
be happy to add support for other compression options.

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

Thanks,

Stephen

#6Claudio Freire
klaussfreire@gmail.com
In reply to: Stephen Frost (#5)
Re: [PATCH] COPY .. COMPRESSED

On Mon, Jan 14, 2013 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote:

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

Unless it's one of a set of superuser-authorized compression tools.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#5)
Re: [PATCH] COPY .. COMPRESSED

Stephen Frost <sfrost@snowman.net> writes:

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

The design that was being kicked around allowed pipes to be used on the
client side too, ie \copy foo to '| gzip ...'. That form would not
require any special privileges, and might be thought preferable for
another reason too: it offloads the work from the server.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Stephen Frost
sfrost@snowman.net
In reply to: Claudio Freire (#6)
Re: [PATCH] COPY .. COMPRESSED

* Claudio Freire (klaussfreire@gmail.com) wrote:

On Mon, Jan 14, 2013 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote:

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

Unless it's one of a set of superuser-authorized compression tools.

Which would require a new ACL system for handling that, as I mentioned..
That certainly isn't what the existing patch does.

What would that look like? How would it operate? How would a user
invoke it or even know what options are available? Would we provide
anything by default? It's great to consider that possibility but
there's a lot of details involved.

I'm a bit nervous about having a generalized system which can run
anything on the system when called by a superuser but when called by a
regular user we're on the hook to verify the request against a
superuser-provided list and to then make sure nothing goes wrong.

Thanks,

Stephen

#9Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#7)
Re: [PATCH] COPY .. COMPRESSED

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

The design that was being kicked around allowed pipes to be used on the
client side too, ie \copy foo to '| gzip ...'. That form would not
require any special privileges, and might be thought preferable for
another reason too: it offloads the work from the server.

It's a different use-case which, imv, is really already trivially
covered:

psql -c 'COPY foo TO STDOUT;' | gzip > myfile.gz

While there is no option currently for having the server do the
compression before sending the data over the wire.

Thanks,

Stephen

#10Peter Eisentraut
peter_e@gmx.net
In reply to: Stephen Frost (#9)
Re: [PATCH] COPY .. COMPRESSED

On 1/14/13 11:28 AM, Stephen Frost wrote:

While there is no option currently for having the server do the
compression before sending the data over the wire.

OpenSSL?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Stephen Frost
sfrost@snowman.net
In reply to: Peter Eisentraut (#10)
Re: [PATCH] COPY .. COMPRESSED

* Peter Eisentraut (peter_e@gmx.net) wrote:

On 1/14/13 11:28 AM, Stephen Frost wrote:

While there is no option currently for having the server do the
compression before sending the data over the wire.

OpenSSL?

To be honest, I expected that to come up earlier in this discussion.
It'd be redundant to use OpenSSL for compression and then ALSO do
compression on the client side to save into a custom format dump.
There's also plenty of reasons to not want to deal with OpenSSL just to
have compression support. Now, protocol-level on-the-wire compression
is another option, but there's quite a few drawbacks to that and quite a
bit of work involved. Having support for COPY-based compression could
be an answer for many cases where on-the-wire compression is desirable.

Being able to use pipe's for the backend-side of COPY is a good
solution, for that. I'm looking forward to having it and plan to review
the patch. That said, I'd like to find an answer to some of these other
use cases, if possible.

Thanks,

Stephen

#12Bruce Momjian
bruce@momjian.us
In reply to: Stephen Frost (#11)
Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 2:33 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Peter Eisentraut (peter_e@gmx.net) wrote:

On 1/14/13 11:28 AM, Stephen Frost wrote:

While there is no option currently for having the server do the
compression before sending the data over the wire.

OpenSSL?

To be honest, I expected that to come up earlier in this discussion.
It'd be redundant to use OpenSSL for compression and then ALSO do
compression on the client side to save into a custom format dump.

For what it's worth there was a security announcement not long ago
that made OpenSSL disable compression in streams by default. I'm not
sure if it's relevant to Postgres or not.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Claudio Freire
klaussfreire@gmail.com
In reply to: Stephen Frost (#11)
Re: [PATCH] COPY .. COMPRESSED

On Mon, Jan 14, 2013 at 11:33 PM, Stephen Frost <sfrost@snowman.net> wrote:

Now, protocol-level on-the-wire compression
is another option, but there's quite a few drawbacks to that and quite a
bit of work involved. Having support for COPY-based compression could
be an answer for many cases where on-the-wire compression is desirable.

Like?

Postgres' packetized protocol lends itself quite well for this kind of
thing. It could even be done on a packet-by-packet basis. The only
drawback I see, is that it pretty much rules out piping through
arbitrary commands (a protocol needs to be very clearly defined).

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Peter Eisentraut
peter_e@gmx.net
In reply to: Stephen Frost (#1)
Re: [PATCH] COPY .. COMPRESSED

On 1/13/13 9:16 PM, Stephen Frost wrote:

On top of this I plan to submit a trivial patch to add support for
this to file_fdw, allowing creation of FDW tables which operate
directly on compressed files (including CSVs, which is what I need
this patch for).

I've also begun working on a patch to allow this capability to be used
through pg_dump/pg_restore which would reduce the bandwidth used
between the client and the server for backups and restores. Ideally,
one would also be able to use custom format dumps, with compression,
even if the client-side pg_dump/pg_restore wasn't compiled with zlib
support.

I think a problem is that this code is now serving such different uses.

Operating on compressed files transparently in file_fdw is obviously
useful, but why only gzip? The gold standard is GNU tar, which can
operate on any compressed file in a variety of compression formats
without even having to specify an option.

Writing compressed COPY output files on the backend has limited uses, at
least none have been clearly explained, and the popen patch might
address those better.

Writing compressed COPY output on the frontend can already be done
differently.

Compression on the wire is a different debate and it probably shouldn't
be snuck in through this backdoor.

Putting compressed COPY output from the backend straight into a
compressed pg_dump file sounds interested, but this patch doesn't do
that yet, and I think there will be more issues to solve there.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Stephen Frost
sfrost@snowman.net
In reply to: Peter Eisentraut (#14)
Re: [PATCH] COPY .. COMPRESSED

* Peter Eisentraut (peter_e@gmx.net) wrote:

Operating on compressed files transparently in file_fdw is obviously
useful, but why only gzip?

This isn't really an argument, imv. It's only gzip *right this moment*
because that's all that I implemented. I've already offered to add
bzip2 or whatever else people would like.

The gold standard is GNU tar, which can
operate on any compressed file in a variety of compression formats
without even having to specify an option.

Yes, that's what I was hoping to get to, eventually.

Writing compressed COPY output files on the backend has limited uses, at
least none have been clearly explained, and the popen patch might
address those better.

I do see value in the popen patch for server-side operations.

Writing compressed COPY output on the frontend can already be done
differently.

Certainly. On a similar vein, I'm not convinced that the popen patch
for psql's \copy is really a great addition.

Compression on the wire is a different debate and it probably shouldn't
be snuck in through this backdoor.

Considering the COPY-COMPRESSED-to-FE piece is the vast majority of the
patch, I hope you understand that it certainly wasn't my intent to try
and 'sneak it in'. Support for reading and writing compressed files
with COPY directly from the FE was one of my goals from the start on
this.

Putting compressed COPY output from the backend straight into a
compressed pg_dump file sounds interested, but this patch doesn't do
that yet, and I think there will be more issues to solve there.

Let me just vent my dislike for the pg_dump code. :) Probably half the
time spent on this overall patch was fighting with that to make it work
and it's actually about 90% of the way there, imv. Getting the
compressed data into pg_dump is working in my local branch, going to a
directory-format dump output, but the custom format is causing me some
difficulties which I believe are related to the blocking that's used and
that the blocks coming off the wire were 'full-size', if you will,
instead of being chunked down to 4KB by the client-side compression.
I've simply not had time to debug it and fix it and wanted to get the
general patch out for discussion (which I'm glad that I did, given that
there's other work going on that's related).

Thanks,

Stephen

#16Stephen Frost
sfrost@snowman.net
In reply to: Claudio Freire (#13)
Re: [PATCH] COPY .. COMPRESSED

* Claudio Freire (klaussfreire@gmail.com) wrote:

Postgres' packetized protocol lends itself quite well for this kind of
thing. It could even be done on a packet-by-packet basis. The only
drawback I see, is that it pretty much rules out piping through
arbitrary commands (a protocol needs to be very clearly defined).

Actually, wouldn't PG's packet-based protocol be exactly the wrong way
to do any kind of good on-the-wire compression? You certainly wouldn't
want to waste time compressing small packets, such as a single INSERT
command or similar, as you'll always have to send a packet out anyway.
Even doing it at the protocol level with something ssl-like, where
you wrap the entire connection, wouldn't help if the client has a
process along the lines of:

send INSERT command
wait for response
send INSERT command
wait for response
..
..

Since you'd still have to flush after each small bit of data.

Where it does work well is when you move into a bulk-data mode (ala
COPY) and can compress relatively large amounts of data into a smaller
number of full-size packets to be sent.

Thanks,

Stephen

#17Garick Hamlin
ghamlin@isc.upenn.edu
In reply to: Bruce Momjian (#12)
Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 01:35:57PM +0000, Greg Stark wrote:

On Tue, Jan 15, 2013 at 2:33 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Peter Eisentraut (peter_e@gmx.net) wrote:

On 1/14/13 11:28 AM, Stephen Frost wrote:

While there is no option currently for having the server do the
compression before sending the data over the wire.

OpenSSL?

To be honest, I expected that to come up earlier in this discussion.
It'd be redundant to use OpenSSL for compression and then ALSO do
compression on the client side to save into a custom format dump.

For what it's worth there was a security announcement not long ago
that made OpenSSL disable compression in streams by default. I'm not
sure if it's relevant to Postgres or not.

It's an interesting question. It might be. I thought at first it
wouldn't be relevant, but on reflection it is.

This attack is called the CRIME attack. This class of attacks stem from reuse
of a dictionary across some sort of confidentiality boundary. The attacker
looks at the traffic and notices 'how big' the network response is. This
tells the attacker the compressor has seen already seem the text.

So imagine, I have a website and I keep session cookies, user names, and
password crypts in a database. Imagine the session key is a long
hexidecimal number. As an attacker, I could send in the username field
of the login form guesses of the prefix of somebody else's session key
I peek at the encrypted traffic from script to the database. As the
prefix match length increases the database reply gets shorter.
Essentially, its a side channel attack that ends up reducing guessing to
a radix search by prefix of all valid session keys. In this attack, I
don't have access to the database except through the web form, but I
can see the encrypted database traffic.

It is not a huge vulnerability, but yeah in some use cases if postgresql
used compression it might provide a difficult, but possible route.

Garick

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Claudio Freire
klaussfreire@gmail.com
In reply to: Stephen Frost (#16)
Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 1:08 PM, Stephen Frost <sfrost@snowman.net> wrote:

Where it does work well is when you move into a bulk-data mode (ala
COPY) and can compress relatively large amounts of data into a smaller
number of full-size packets to be sent.

Well... exactly. COPY is one case, big result sets is another.
And packet headers can include whether each packet is compressed or
not, which is quite transparent and easy to handle. There could even
be a negotiation phase and make it backwards-compatible.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Stephen Frost
sfrost@snowman.net
In reply to: Claudio Freire (#18)
Re: [PATCH] COPY .. COMPRESSED

* Claudio Freire (klaussfreire@gmail.com) wrote:

On Tue, Jan 15, 2013 at 1:08 PM, Stephen Frost <sfrost@snowman.net> wrote:

Where it does work well is when you move into a bulk-data mode (ala
COPY) and can compress relatively large amounts of data into a smaller
number of full-size packets to be sent.

Well... exactly. COPY is one case, big result sets is another.
And packet headers can include whether each packet is compressed or
not, which is quite transparent and easy to handle. There could even
be a negotiation phase and make it backwards-compatible.

COPY and a large result set are the only cases, and a large result set
could easily be put inside of a COPY statement. I agree that large
result sets outside of COPY could benefit from compression and perhaps
we can formulate a way to support that also.

Thanks,

Stephen

#20Bruce Momjian
bruce@momjian.us
In reply to: Stephen Frost (#15)
Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 10:55:04AM -0500, Stephen Frost wrote:

* Peter Eisentraut (peter_e@gmx.net) wrote:

Operating on compressed files transparently in file_fdw is obviously
useful, but why only gzip?

This isn't really an argument, imv. It's only gzip *right this moment*
because that's all that I implemented. I've already offered to add
bzip2 or whatever else people would like.

And this leads to support-my-compression-binary-of-the-day mess. Why
not just allow them to do '|compression-binary'?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Stephen Frost
sfrost@snowman.net
In reply to: Bruce Momjian (#20)
#22Chris Browne
cbbrowne@acm.org
In reply to: Stephen Frost (#21)
#23Bruce Momjian
bruce@momjian.us
In reply to: Chris Browne (#22)
#24Stephen Frost
sfrost@snowman.net
In reply to: Chris Browne (#22)
#25Stephen Frost
sfrost@snowman.net
In reply to: Bruce Momjian (#23)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#21)
#27Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#26)
#28Bruce Momjian
bruce@momjian.us
In reply to: Stephen Frost (#25)
#29Stephen Frost
sfrost@snowman.net
In reply to: Bruce Momjian (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#27)
#31Stephen Frost
sfrost@snowman.net
In reply to: Tom Lane (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#31)
#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Stephen Frost (#31)
#34Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#33)
#35Claudio Freire
klaussfreire@gmail.com
In reply to: Tom Lane (#30)
#36Peter Eisentraut
peter_e@gmx.net
In reply to: Stephen Frost (#21)
#37Stephen Frost
sfrost@snowman.net
In reply to: Peter Eisentraut (#36)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#9)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Chris Browne (#22)
#40Chris Browne
cbbrowne@acm.org
In reply to: Robert Haas (#39)
#41Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#26)
#42Bruce Momjian
bruce@momjian.us
In reply to: Robert Haas (#41)
#43Stephen Frost
sfrost@snowman.net
In reply to: Bruce Momjian (#42)
#44Claudio Freire
klaussfreire@gmail.com
In reply to: Robert Haas (#41)