[PATCH] COPY .. COMPRESSED

Started by Stephen Frostalmost 13 years ago44 messages

sfrost@snowman.net

almost 13 years ago

1 attachment(s)

Greetings,

Attached is a patch to add a 'COMPRESSED' option to COPY which will
cause COPY to expect a gzip'd file on input and which will output a
gzip'd file on output. Included is support for backend COPY, psql's
\copy, regression tests for both, and documentation.

On top of this I plan to submit a trivial patch to add support for
this to file_fdw, allowing creation of FDW tables which operate
directly on compressed files (including CSVs, which is what I need
this patch for).

I've also begun working on a patch to allow this capability to be used
through pg_dump/pg_restore which would reduce the bandwidth used
between the client and the server for backups and restores. Ideally,
one would also be able to use custom format dumps, with compression,
even if the client-side pg_dump/pg_restore wasn't compiled with zlib
support.

Thanks,

Stephen

Attachments:

copy_compressed.patchtext/x-diff; charset=us-asciiDownload

colordiff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
new file mode 100644
index 6a0fabc..5c58dd2
*** a/doc/src/sgml/ref/copy.sgml
--- b/doc/src/sgml/ref/copy.sgml
*************** COPY { <replaceable class="parameter">ta
*** 38,43 ****
--- 38,44 ----
      DELIMITER '<replaceable class="parameter">delimiter_character</replaceable>'
      NULL '<replaceable class="parameter">null_string</replaceable>'
      HEADER [ <replaceable class="parameter">boolean</replaceable> ]
+     COMPRESSED [ <replaceable class="parameter">boolean</replaceable> ]
      QUOTE '<replaceable class="parameter">quote_character</replaceable>'
      ESCAPE '<replaceable class="parameter">escape_character</replaceable>'
      FORCE_QUOTE { ( <replaceable class="parameter">column_name</replaceable> [, ...] ) | * }
*************** COPY { <replaceable class="parameter">ta
*** 254,259 ****
--- 255,271 ----
       </para>
      </listitem>
     </varlistentry>
+ 
+    <varlistentry>
+     <term><literal>COMPRESSED</literal></term>
+     <listitem>
+      <para>
+       Specifies that the file contents are compressed using zlib.  On input,
+       the data should be compressed with zlib (eg: using gzip).  On output,
+       the resulting data will be the compressed contents of the table.
+      </para>
+     </listitem>
+    </varlistentry>
  
     <varlistentry>
      <term><literal>QUOTE</literal></term>
colordiff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
new file mode 100644
index abd82cf..1f394de
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***************
*** 19,24 ****
--- 19,27 ----
  #include <sys/stat.h>
  #include <netinet/in.h>
  #include <arpa/inet.h>
+ #ifdef HAVE_LIBZ
+ #include <zlib.h>
+ #endif
  
  #include "access/heapam.h"
  #include "access/htup_details.h"
***************
*** 59,66 ****
  typedef enum CopyDest
  {
  	COPY_FILE,					/* to/from file */
  	COPY_OLD_FE,				/* to/from frontend (2.0 protocol) */
! 	COPY_NEW_FE					/* to/from frontend (3.0 protocol) */
  } CopyDest;
  
  /*
--- 62,71 ----
  typedef enum CopyDest
  {
  	COPY_FILE,					/* to/from file */
+ 	COPY_GZFILE,				/* to/from compressed file */
  	COPY_OLD_FE,				/* to/from frontend (2.0 protocol) */
! 	COPY_NEW_FE,				/* to/from frontend (3.0 protocol) */
! 	COPY_NEW_FE_COMPRESSED		/* FE 3.0 protocol, compressed */
  } CopyDest;
  
  /*
*************** typedef struct CopyStateData
*** 95,100 ****
--- 100,110 ----
  	/* low-level state data */
  	CopyDest	copy_dest;		/* type of copy source/destination */
  	FILE	   *copy_file;		/* used if copy_dest == COPY_FILE */
+ #ifdef HAVE_LIBZ
+ 	gzFile		copy_gzfile;	/* used if copy_dest == COPY_GZFILE */
+ 	z_stream   *zstrm;			/* used if streaming compressed data */
+ 	char	   *zstrm_buf;		/* used if streaming compressed data */
+ #endif
  	StringInfo	fe_msgbuf;		/* used for all dests during COPY TO, only for
  								 * dest == COPY_NEW_FE in COPY FROM */
  	bool		fe_eof;			/* true if detected end of copy data */
*************** typedef struct CopyStateData
*** 109,114 ****
--- 119,125 ----
  	List	   *attnumlist;		/* integer list of attnums to copy */
  	char	   *filename;		/* filename, or NULL for STDIN/STDOUT */
  	bool		binary;			/* binary format? */
+ 	bool		compressed;		/* compressed file? */
  	bool		oids;			/* include OIDs? */
  	bool		freeze;			/* freeze rows on loading? */
  	bool		csv_mode;		/* Comma Separated Value format? */
*************** static bool CopyGetInt32(CopyState cstat
*** 320,325 ****
--- 331,437 ----
  static void CopySendInt16(CopyState cstate, int16 val);
  static bool CopyGetInt16(CopyState cstate, int16 *val);
  
+ #ifdef HAVE_LIBZ
+ /* Helper functions for working with zlib */
+ static void zstrm_init(CopyState cstate, bool is_from);
+ void *zstrm_palloc(void *curr_memctx, unsigned int num, unsigned int size);
+ void zstrm_pfree(void *opaque, void *addr);
+ 
+ /*
+  * Define some useful constants.
+  * We're just using the default windowBits, but we have to combine it
+  * with 16 to turn on gzip encoding, which we want to use.
+  */
+ #define DEF_WINDOW_BITS 15
+ #define GZIP_ENCODING 16
+ #define DEF_MEMLEVEL 8
+ 
+ /*
+  * Our wrapper to be like standard palloc(), which we later pass to zlib
+  * Note that we can't just use palloc() since it's a #define to
+  * MemoryContextAlloc.
+  */
+ void *
+ zstrm_palloc(void *curr_memctx, unsigned int num, unsigned int size)
+ {
+ 	return MemoryContextAlloc(curr_memctx, num*size);
+ }
+ 
+ /*
+  * Do-nothing function to pass as free()
+  * No need to actually pfree() as the memory context will be released.
+  */
+ void
+ zstrm_pfree(void *opaque, void *addr)
+ {
+ 	opaque = NULL; /* not used */
+ 	addr = NULL; /* will be handled by the context being released */
+ 
+ 	return;
+ }
+ 
+ /* Set up zstrm for streaming data */
+ static void
+ zstrm_init(CopyState cstate, bool is_from)
+ {
+ 	/* Allocate storage for our z_stream and buffer */
+ 	cstate->zstrm = palloc(sizeof(z_stream));
+ 	cstate->zstrm_buf = palloc(RAW_BUF_SIZE);
+ 
+ 	/* Have zlib use our malloc/free functions */
+ 	cstate->zstrm->zalloc = &zstrm_palloc;
+ 	cstate->zstrm->zfree  = &zstrm_pfree;
+ 
+ 	/* Provide the current memory context, used by MemoryConextAlloc() */
+ 	cstate->zstrm->opaque = CurrentMemoryContext;
+ 
+ 	/* Initialize avail_in- there's never data ready right away... */
+ 	cstate->zstrm->avail_in = 0;
+ 
+ 	/*
+ 	 * Initialize our pointers and structures based on if we are going
+ 	 * doing decompression (from) or compression (to)
+ 	 */
+ 	if (is_from)
+ 	{
+ 		/*
+ 		 * With decompression, we set up the input buffer as where we will
+ 		 * load our data in to and then ask inflate() to process the
+ 		 * compressed data and store the uncompressed results in the
+ 		 * caller's buffer
+ 		 */
+ 		/* Data will eventually be coming from our buffer */
+ 		cstate->zstrm->next_in  = (unsigned char*) cstate->zstrm_buf;
+ 
+ 		/* Output space not available yet */
+ 		cstate->zstrm->next_out = Z_NULL;
+ 		cstate->zstrm->avail_out = 0;
+ 		if (inflateInit2(cstate->zstrm,
+ 						 DEF_WINDOW_BITS | GZIP_ENCODING) != Z_OK)
+ 			elog(ERROR, "Error initializing zlib stream.");
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * With compression, we set up the output buffer as where we will
+ 		 * receive data from the deflate() function, to send to the client;
+ 		 * the input data comes from the caller sending us uncompressed data.
+ 		 */
+ 		/* No data available initially */
+ 		cstate->zstrm->next_in  = Z_NULL;
+ 
+ 		/* Data will eventually be going to our buffer */
+ 		cstate->zstrm->next_out = (unsigned char*) cstate->zstrm_buf;
+ 		cstate->zstrm->avail_out = RAW_BUF_SIZE;
+ 		if (deflateInit2(cstate->zstrm, Z_DEFAULT_COMPRESSION,
+ 						 Z_DEFLATED, DEF_WINDOW_BITS | GZIP_ENCODING,
+ 						 DEF_MEMLEVEL, Z_DEFAULT_STRATEGY) != Z_OK)
+ 			elog(ERROR, "Error initializing zlib stream.");
+ 	}
+ 
+ 	return;
+ }
+ #endif
  
  /*
   * Send copy start/stop messages for frontend copies.  These have changed
*************** SendCopyBegin(CopyState cstate)
*** 333,339 ****
  		/* new way */
  		StringInfoData buf;
  		int			natts = list_length(cstate->attnumlist);
! 		int16		format = (cstate->binary ? 1 : 0);
  		int			i;
  
  		pq_beginmessage(&buf, 'H');
--- 445,451 ----
  		/* new way */
  		StringInfoData buf;
  		int			natts = list_length(cstate->attnumlist);
! 		int16		format = (cstate->binary || cstate->compressed ? 1 : 0);
  		int			i;
  
  		pq_beginmessage(&buf, 'H');
*************** SendCopyBegin(CopyState cstate)
*** 342,348 ****
  		for (i = 0; i < natts; i++)
  			pq_sendint(&buf, format, 2);		/* per-column formats */
  		pq_endmessage(&buf);
! 		cstate->copy_dest = COPY_NEW_FE;
  	}
  	else if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 2)
  	{
--- 454,473 ----
  		for (i = 0; i < natts; i++)
  			pq_sendint(&buf, format, 2);		/* per-column formats */
  		pq_endmessage(&buf);
! 
! 		if (cstate->compressed)
! 		{
! #ifdef HAVE_LIBZ
! 			cstate->copy_dest = COPY_NEW_FE_COMPRESSED;
! 
! 			/* Initialize our zstream */
! 			zstrm_init(cstate,false);
! #else
! 			elog(ERROR, "Not compiled with zlib support.");
! #endif
! 		}
! 		else
! 			cstate->copy_dest = COPY_NEW_FE;
  	}
  	else if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 2)
  	{
*************** ReceiveCopyBegin(CopyState cstate)
*** 378,384 ****
  		/* new way */
  		StringInfoData buf;
  		int			natts = list_length(cstate->attnumlist);
! 		int16		format = (cstate->binary ? 1 : 0);
  		int			i;
  
  		pq_beginmessage(&buf, 'G');
--- 503,509 ----
  		/* new way */
  		StringInfoData buf;
  		int			natts = list_length(cstate->attnumlist);
! 		int16		format = (cstate->binary || cstate->compressed ? 1 : 0);
  		int			i;
  
  		pq_beginmessage(&buf, 'G');
*************** ReceiveCopyBegin(CopyState cstate)
*** 387,393 ****
  		for (i = 0; i < natts; i++)
  			pq_sendint(&buf, format, 2);		/* per-column formats */
  		pq_endmessage(&buf);
! 		cstate->copy_dest = COPY_NEW_FE;
  		cstate->fe_msgbuf = makeStringInfo();
  	}
  	else if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 2)
--- 512,532 ----
  		for (i = 0; i < natts; i++)
  			pq_sendint(&buf, format, 2);		/* per-column formats */
  		pq_endmessage(&buf);
! 
! 		if (cstate->compressed)
! 		{
! #ifdef HAVE_LIBZ
! 			cstate->copy_dest = COPY_NEW_FE_COMPRESSED;
! 
! 			/* Initialize our zstream */
! 			zstrm_init(cstate,true);
! #else
! 			elog(ERROR, "Not compiled with zlib support.");
! #endif
! 		}
! 		else
! 			cstate->copy_dest = COPY_NEW_FE;
! 
  		cstate->fe_msgbuf = makeStringInfo();
  	}
  	else if (PG_PROTOCOL_MAJOR(FrontendProtocol) >= 2)
*************** SendCopyEnd(CopyState cstate)
*** 424,429 ****
--- 563,593 ----
  		/* Send Copy Done message */
  		pq_putemptymessage('c');
  	}
+ 	else if (cstate->copy_dest == COPY_NEW_FE_COMPRESSED)
+ 	{
+ #ifdef HAVE_LIBZ
+ 		int result;
+ 
+ 		/* Finish sending data */
+ 		cstate->zstrm->next_in  = (unsigned char*) cstate->fe_msgbuf->data;
+ 		cstate->zstrm->avail_in = cstate->fe_msgbuf->len;
+ 
+ 		do {
+ 			result = deflate(cstate->zstrm, Z_FINISH);
+ 			if (result != Z_OK && result != Z_STREAM_END)
+ 				elog(ERROR, "Error finishing compressed stream.");
+ 
+ 			(void) pq_putmessage('d', (char*) cstate->zstrm_buf, RAW_BUF_SIZE - cstate->zstrm->avail_out);
+ 
+ 		/* ensure we empty the buffer */
+ 		} while (result == Z_OK || cstate->zstrm->avail_in > 0);
+ 
+ 		/* Send Copy Done message */
+ 		pq_putemptymessage('c');
+ #else
+ 		elog(ERROR, "Not compiled with zlib support.");
+ #endif
+ 	}
  	else
  	{
  		CopySendData(cstate, "\\.", 2);
*************** CopySendEndOfRow(CopyState cstate)
*** 486,491 ****
--- 650,676 ----
  						(errcode_for_file_access(),
  						 errmsg("could not write to COPY file: %m")));
  			break;
+ 		case COPY_GZFILE:
+ 			if (!cstate->binary)
+ 			{
+ 				/* Default line termination depends on platform */
+ #ifndef WIN32
+ 				CopySendChar(cstate, '\n');
+ #else
+ 				CopySendString(cstate, "\r\n");
+ #endif
+ 			}
+ 
+ #ifdef HAVE_LIBZ
+ 			if (gzwrite(cstate->copy_gzfile, fe_msgbuf->data,
+ 						fe_msgbuf->len) != fe_msgbuf->len)
+ 				ereport(ERROR,
+ 						(errcode_for_file_access(),
+ 						 errmsg("could not write to COPY file: %m")));
+ #else
+ 			elog(ERROR, "Not compiled with zlib support.");
+ #endif
+ 			break;
  		case COPY_OLD_FE:
  			/* The FE/BE protocol uses \n as newline for all platforms */
  			if (!cstate->binary)
*************** CopySendEndOfRow(CopyState cstate)
*** 507,512 ****
--- 692,731 ----
  			/* Dump the accumulated row as one CopyData message */
  			(void) pq_putmessage('d', fe_msgbuf->data, fe_msgbuf->len);
  			break;
+ 		case COPY_NEW_FE_COMPRESSED:
+ #ifdef HAVE_LIBZ
+ 			/* The FE/BE protocol uses \n as newline for all platforms */
+ 			if (!cstate->binary)
+ 				CopySendChar(cstate, '\n');
+ 
+ 			/* We should always have some room for more data */
+ 			Assert(cstate->zstrm->avail_out > 0);
+ 			
+ 			cstate->zstrm->next_in  = (unsigned char*) fe_msgbuf->data;
+ 			cstate->zstrm->avail_in = fe_msgbuf->len;
+ 
+ 			/* Dump the accumulated row as one CopyData message */
+ 			do {
+ 				if (deflate(cstate->zstrm, Z_NO_FLUSH) != Z_OK)
+ 					elog(ERROR, "Error during compression");
+ 
+ 				/* Once avail_out is full, send it */
+ 				if (cstate->zstrm->avail_out == 0)
+ 				{
+ 					(void) pq_putmessage('d', (char*) cstate->zstrm_buf, RAW_BUF_SIZE);
+ 
+ 					/* Reset the buffer */
+ 					cstate->zstrm->avail_out = RAW_BUF_SIZE;
+ 					cstate->zstrm->next_out = (unsigned char*) cstate->zstrm_buf;
+ 				}
+ 			} while (cstate->zstrm->avail_in > 0); /* ensure we empty buffer */
+ #else
+ 			elog(ERROR, "Not compiled with zlib support.");
+ #endif
+ 			break;
+ 		default:
+ 			/* Should never reach here */
+ 			Assert(false);
  	}
  
  	resetStringInfo(fe_msgbuf);
*************** CopyGetData(CopyState cstate, void *data
*** 539,544 ****
--- 758,774 ----
  						(errcode_for_file_access(),
  						 errmsg("could not read from COPY file: %m")));
  			break;
+ 		case COPY_GZFILE:
+ #ifdef HAVE_LIBZ
+ 			bytesread = gzread(cstate->copy_gzfile, databuf, maxread);
+ 			if (bytesread == -1)
+ 				ereport(ERROR,
+ 						(errcode_for_file_access(),
+ 						 errmsg("could not read from COPY file: %m")));
+ #else
+ 			elog(ERROR, "Not compiled with zlib support.");
+ #endif
+ 			break;
  		case COPY_OLD_FE:
  
  			/*
*************** CopyGetData(CopyState cstate, void *data
*** 557,562 ****
--- 787,794 ----
  			}
  			bytesread = minread;
  			break;
+ 		case COPY_NEW_FE_COMPRESSED:
+ 			/* Fall through and handle as part of COPY_NEW_FE case */
  		case COPY_NEW_FE:
  			while (maxread > 0 && bytesread < minread && !cstate->fe_eof)
  			{
*************** CopyGetData(CopyState cstate, void *data
*** 609,621 ****
  							break;
  					}
  				}
! 				avail = cstate->fe_msgbuf->len - cstate->fe_msgbuf->cursor;
! 				if (avail > maxread)
! 					avail = maxread;
! 				pq_copymsgbytes(cstate->fe_msgbuf, databuf, avail);
! 				databuf = (void *) ((char *) databuf + avail);
! 				maxread -= avail;
! 				bytesread += avail;
  			}
  			break;
  	}
--- 841,918 ----
  							break;
  					}
  				}
! 				if (cstate->copy_dest == COPY_NEW_FE_COMPRESSED)
! 				{
! #ifdef HAVE_LIBZ
! 					int result;
! 
! 					/* Calculate how much data is *now* available */
! 					cstate->zstrm->avail_in = cstate->zstrm->avail_in
! 											  + cstate->fe_msgbuf->len
! 											  - cstate->fe_msgbuf->cursor;
! 
! 					/*
! 					 * The amount of input data should never exceed our buffer
! 					 * size or something has gone awry
! 					 */
! 					Assert(((char*) cstate->zstrm->next_in
! 								  + cstate->zstrm->avail_in
! 								  - cstate->zstrm_buf) < RAW_BUF_SIZE);
! 
! 					/* Tack on the next set of input data to our in buffer */
! 					pq_copymsgbytes(cstate->fe_msgbuf,
! 									(char*) cstate->zstrm->next_in,
! 									cstate->zstrm->avail_in);
! 
! 					/*
! 					 * Tell inflate() to store the uncompressed data into our
! 					 * output buffer
! 					 */
! 					cstate->zstrm->next_out = (unsigned char*) databuf;
! 					cstate->zstrm->avail_out = maxread;
! 
! 					/* decompress input data into our output buffer */
! 					result = inflate(cstate->zstrm, Z_NO_FLUSH);
! 					if (result != Z_OK && result != Z_STREAM_END)
! 						elog(ERROR, "Error processing compressed input.");
! 
! 					/*
! 					 * Move any left-over data up to the front of our buffer,
! 					 * to make room for the next batch; the amount of left-
! 					 * over data is likely to be pretty small.
! 					 */
! 					if (cstate->zstrm->avail_in > 0)
! 						memmove(cstate->zstrm_buf,
! 								cstate->zstrm->next_in,
! 								cstate->zstrm->avail_in);
! 
! 					/* Update our pointer to the next set of input data */
! 					cstate->zstrm->next_in = (unsigned char*) cstate->zstrm_buf;
! 
! 					/* Bytes returned from inflate() */
! 					avail = maxread - cstate->zstrm->avail_out;
! 
! 					/*
! 					 * Update our databuf pointer, in case we need to provide
! 					 * more data in this call.
! 					 */
! 					databuf = (void *) ((char *) databuf + avail);
! 					maxread -= avail;
! 					bytesread += avail;
! #else
! 					elog(ERROR,"Not compiled with zlib support.");
! #endif
! 				}
! 				else
! 				{
! 					avail = cstate->fe_msgbuf->len - cstate->fe_msgbuf->cursor;
! 					if (avail > maxread)
! 						avail = maxread;
! 					pq_copymsgbytes(cstate->fe_msgbuf, databuf, avail);
! 					databuf = (void *) ((char *) databuf + avail);
! 					maxread -= avail;
! 					bytesread += avail;
! 				}
  			}
  			break;
  	}
*************** ProcessCopyOptions(CopyState cstate,
*** 889,894 ****
--- 1186,1205 ----
  						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  						 errmsg("COPY format \"%s\" not recognized", fmt)));
  		}
+ 		else if (strcmp(defel->defname, "compressed") == 0)
+ 		{
+ #ifdef HAVE_LIBZ
+ 			if (cstate->compressed)
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_SYNTAX_ERROR),
+ 						 errmsg("conflicting or redundant options")));
+ 			cstate->compressed = defGetBoolean(defel);
+ #else
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_SYNTAX_ERROR),
+ 					 errmsg("Not compiled with zlib support.")));
+ #endif
+ 		}
  		else if (strcmp(defel->defname, "oids") == 0)
  		{
  			if (cstate->oids)
*************** BeginCopy(bool is_from,
*** 1379,1385 ****
  	/* See Multibyte encoding comment above */
  	cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding);
  
! 	cstate->copy_dest = COPY_FILE;		/* default */
  
  	MemoryContextSwitchTo(oldcontext);
  
--- 1690,1699 ----
  	/* See Multibyte encoding comment above */
  	cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding);
  
! 	if (cstate->compressed)
! 		cstate->copy_dest = COPY_GZFILE;
! 	else
! 		cstate->copy_dest = COPY_FILE;		/* default */
  
  	MemoryContextSwitchTo(oldcontext);
  
*************** BeginCopy(bool is_from,
*** 1392,1402 ****
  static void
  EndCopy(CopyState cstate)
  {
! 	if (cstate->filename != NULL && FreeFile(cstate->copy_file))
! 		ereport(ERROR,
! 				(errcode_for_file_access(),
! 				 errmsg("could not close file \"%s\": %m",
! 						cstate->filename)));
  
  	MemoryContextDelete(cstate->copycontext);
  	pfree(cstate);
--- 1706,1734 ----
  static void
  EndCopy(CopyState cstate)
  {
! 	if (cstate->filename != NULL)
! 	{
! 		if (cstate->compressed)
! 		{
! #ifdef HAVE_LIBZ
! 			if (FreeFileGz(cstate->copy_gzfile))
! 				ereport(ERROR,
! 						(errcode_for_file_access(),
! 						 errmsg("could not close file \"%s\": %m",
! 								cstate->filename)));
! #else
! 			elog(ERROR,"Not compiled with zlib support.");
! #endif
! 		}
! 		else
! 		{
! 			if(FreeFile(cstate->copy_file))
! 				ereport(ERROR,
! 						(errcode_for_file_access(),
! 						 errmsg("could not close file \"%s\": %m",
! 								cstate->filename)));
! 		}
! 	}
  
  	MemoryContextDelete(cstate->copycontext);
  	pfree(cstate);
*************** BeginCopyTo(Relation rel,
*** 1467,1486 ****
  
  		cstate->filename = pstrdup(filename);
  		oumask = umask(S_IWGRP | S_IWOTH);
! 		cstate->copy_file = AllocateFile(cstate->filename, PG_BINARY_W);
! 		umask(oumask);
! 
! 		if (cstate->copy_file == NULL)
  			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not open file \"%s\" for writing: %m",
! 							cstate->filename)));
  
! 		fstat(fileno(cstate->copy_file), &st);
! 		if (S_ISDIR(st.st_mode))
  			ereport(ERROR,
  					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
! 					 errmsg("\"%s\" is a directory", cstate->filename)));
  	}
  
  	MemoryContextSwitchTo(oldcontext);
--- 1799,1846 ----
  
  		cstate->filename = pstrdup(filename);
  		oumask = umask(S_IWGRP | S_IWOTH);
! 		if (cstate->compressed)
! 		{
! #ifdef HAVE_LIBZ
! 			cstate->copy_gzfile = AllocateFileGz(cstate->filename, PG_BINARY_W);
! #else
  			ereport(ERROR,
! 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
! 					 errmsg("Not compiled with zlib support.")));
! #endif
! 		}
! 		else
! 			cstate->copy_file = AllocateFile(cstate->filename, PG_BINARY_W);
! 		umask(oumask);
  
! 		if (cstate->compressed)
! 		{
! #ifdef HAVE_LIBZ
! 			if (cstate->copy_gzfile == NULL)
! 				ereport(ERROR,
! 						(errcode_for_file_access(),
! 						 errmsg("could not open file \"%s\" for writing (compressed): %m",
! 								cstate->filename)));
! #else
  			ereport(ERROR,
  					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
! 					 errmsg("Not compiled with zlib support.")));
! #endif
! 		}
! 		else
! 		{
! 			if (cstate->copy_file == NULL)
! 				ereport(ERROR,
! 						(errcode_for_file_access(),
! 						 errmsg("could not open file \"%s\" for writing: %m",
! 								cstate->filename)));
! 
! 			fstat(fileno(cstate->copy_file), &st);
! 			if (S_ISDIR(st.st_mode))
! 				ereport(ERROR,
! 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
! 						 errmsg("\"%s\" is a directory", cstate->filename)));
! 		}
  	}
  
  	MemoryContextSwitchTo(oldcontext);
*************** BeginCopyFrom(Relation rel,
*** 2419,2437 ****
  		struct stat st;
  
  		cstate->filename = pstrdup(filename);
! 		cstate->copy_file = AllocateFile(cstate->filename, PG_BINARY_R);
! 
! 		if (cstate->copy_file == NULL)
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not open file \"%s\" for reading: %m",
! 							cstate->filename)));
  
! 		fstat(fileno(cstate->copy_file), &st);
! 		if (S_ISDIR(st.st_mode))
  			ereport(ERROR,
  					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
! 					 errmsg("\"%s\" is a directory", cstate->filename)));
  	}
  
  	if (!cstate->binary)
--- 2779,2816 ----
  		struct stat st;
  
  		cstate->filename = pstrdup(filename);
! 		if (cstate->compressed)
! 		{
! #ifdef HAVE_LIBZ
! 			cstate->copy_gzfile = AllocateFileGz(cstate->filename, PG_BINARY_R);
  
! 			if (cstate->copy_gzfile == NULL)
! 				ereport(ERROR,
! 						(errcode_for_file_access(),
! 						 errmsg("could not open file \"%s\" for reading: %m",
! 								cstate->filename)));
! #else
  			ereport(ERROR,
  					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
! 					 errmsg("Not compiled with zlib support.")));
! #endif
! 		}
! 		else
! 		{
! 			cstate->copy_file = AllocateFile(cstate->filename, PG_BINARY_R);
! 
! 			if (cstate->copy_file == NULL)
! 				ereport(ERROR,
! 						(errcode_for_file_access(),
! 						 errmsg("could not open file \"%s\" for reading: %m",
! 								cstate->filename)));
! 
! 			fstat(fileno(cstate->copy_file), &st);
! 			if (S_ISDIR(st.st_mode))
! 				ereport(ERROR,
! 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
! 						 errmsg("\"%s\" is a directory", cstate->filename)));
! 		}
  	}
  
  	if (!cstate->binary)
*************** CopyReadLine(CopyState cstate)
*** 2823,2829 ****
  		 * after \. up to the protocol end of copy data.  (XXX maybe better
  		 * not to treat \. as special?)
  		 */
! 		if (cstate->copy_dest == COPY_NEW_FE)
  		{
  			do
  			{
--- 3202,3209 ----
  		 * after \. up to the protocol end of copy data.  (XXX maybe better
  		 * not to treat \. as special?)
  		 */
! 		if (cstate->copy_dest == COPY_NEW_FE ||
! 			cstate->copy_dest == COPY_NEW_FE_COMPRESSED)
  		{
  			do
  			{
colordiff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
new file mode 100644
index 4568876..23ebaa7
*** a/src/backend/parser/gram.y
--- b/src/backend/parser/gram.y
*************** static void processCASbits(int cas_bits,
*** 521,528 ****
  	CACHE CALLED CASCADE CASCADED CASE CAST CATALOG_P CHAIN CHAR_P
  	CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE
  	CLUSTER COALESCE COLLATE COLLATION COLUMN COMMENT COMMENTS COMMIT
! 	COMMITTED CONCURRENTLY CONFIGURATION CONNECTION CONSTRAINT CONSTRAINTS
! 	CONTENT_P CONTINUE_P CONVERSION_P COPY COST CREATE
  	CROSS CSV CURRENT_P
  	CURRENT_CATALOG CURRENT_DATE CURRENT_ROLE CURRENT_SCHEMA
  	CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE
--- 521,528 ----
  	CACHE CALLED CASCADE CASCADED CASE CAST CATALOG_P CHAIN CHAR_P
  	CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE
  	CLUSTER COALESCE COLLATE COLLATION COLUMN COMMENT COMMENTS COMMIT
! 	COMMITTED COMPRESSED CONCURRENTLY CONFIGURATION CONNECTION CONSTRAINT
! 	CONSTRAINTS CONTENT_P CONTINUE_P CONVERSION_P COPY COST CREATE
  	CROSS CSV CURRENT_P
  	CURRENT_CATALOG CURRENT_DATE CURRENT_ROLE CURRENT_SCHEMA
  	CURRENT_TIME CURRENT_TIMESTAMP CURRENT_USER CURSOR CYCLE
*************** copy_opt_item:
*** 2403,2408 ****
--- 2403,2412 ----
  				{
  					$$ = makeDefElem("header", (Node *)makeInteger(TRUE));
  				}
+ 			| COMPRESSED
+ 				{
+ 					$$ = makeDefElem("compressed", (Node *)makeInteger(TRUE));
+ 				}
  			| QUOTE opt_as Sconst
  				{
  					$$ = makeDefElem("quote", (Node *)makeString($3));
*************** unreserved_keyword:
*** 12471,12476 ****
--- 12475,12481 ----
  			| COMMENTS
  			| COMMIT
  			| COMMITTED
+ 			| COMPRESSED
  			| CONFIGURATION
  			| CONNECTION
  			| CONSTRAINTS
colordiff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
new file mode 100644
index 07ee51c..4547285
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
***************
*** 66,71 ****
--- 66,74 ----
  #ifdef HAVE_SYS_RESOURCE_H
  #include <sys/resource.h>		/* for getrlimit */
  #endif
+ #ifdef HAVE_LIBZ
+ #include <zlib.h>
+ #endif
  
  #include "miscadmin.h"
  #include "access/xact.h"
*************** static uint64 temporary_files_size = 0;
*** 201,206 ****
--- 204,210 ----
  typedef enum
  {
  	AllocateDescFile,
+ 	AllocateDescGzFile,
  	AllocateDescDir,
  	AllocateDescRawFD
  } AllocateDescKind;
*************** typedef struct
*** 211,216 ****
--- 215,223 ----
  	union
  	{
  		FILE	   *file;
+ #ifdef HAVE_LIBZ
+ 		gzFile		gzfile;
+ #endif
  		DIR		   *dir;
  		int			fd;
  	}			desc;
*************** TryAgain:
*** 1543,1548 ****
--- 1550,1607 ----
  	return NULL;
  }
  
+ /*
+  * Like AllocateFile, but uses zlib to open a compressed file.
+  */
+ #ifdef HAVE_LIBZ
+ gzFile
+ AllocateFileGz(const char *name, const char *mode)
+ {
+ 	gzFile	   file;
+ 
+ 	DO_DB(elog(LOG, "AllocateFileGz: Allocated %d (%s)",
+ 			   numAllocatedDescs, name));
+ 
+ 	/*
+ 	 * The test against MAX_ALLOCATED_DESCS prevents us from overflowing
+ 	 * allocatedFiles[]; the test against max_safe_fds prevents AllocateFile
+ 	 * from hogging every one of the available FDs, which'd lead to infinite
+ 	 * looping.
+ 	 */
+ 	if (numAllocatedDescs >= MAX_ALLOCATED_DESCS ||
+ 		numAllocatedDescs >= max_safe_fds - 1)
+ 		elog(ERROR, "exceeded MAX_ALLOCATED_DESCS while trying to open file \"%s\"",
+ 			 name);
+ 
+ TryAgain:
+ 	if ((file = gzopen(name, mode)) != NULL)
+ 	{
+ 		AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
+ 
+ 		desc->kind = AllocateDescGzFile;
+ 		desc->desc.gzfile = file;
+ 		desc->create_subid = GetCurrentSubTransactionId();
+ 		numAllocatedDescs++;
+ 		return desc->desc.gzfile;
+ 	}
+ 
+ 	if (errno == EMFILE || errno == ENFILE)
+ 	{
+ 		int			save_errno = errno;
+ 
+ 		ereport(LOG,
+ 				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+ 				 errmsg("out of file descriptors: %m; release and retry")));
+ 		errno = 0;
+ 		if (ReleaseLruFile())
+ 			goto TryAgain;
+ 		errno = save_errno;
+ 	}
+ 
+ 	return NULL;
+ }
+ #endif
+ 
  
  /*
   * Like AllocateFile, but returns an unbuffered fd like open(2)
*************** FreeDesc(AllocateDesc *desc)
*** 1600,1605 ****
--- 1659,1672 ----
  		case AllocateDescFile:
  			result = fclose(desc->desc.file);
  			break;
+ 		case AllocateDescGzFile:
+ #ifdef HAVE_LIBZ
+ 			result = gzclose(desc->desc.gzfile);
+ #else
+ 			elog(ERROR,"AllocateDesc not compiled with zlib support");
+ 			result = 0;			/* keep compiler quiet */
+ #endif
+ 			break;
  		case AllocateDescDir:
  			result = closedir(desc->desc.dir);
  			break;
*************** FreeFile(FILE *file)
*** 1648,1653 ****
--- 1715,1750 ----
  }
  
  /*
+  * Close a file returned by AllocateFileGz.
+  *
+  * Note we do not check gzclose's return value --- it is up to the caller
+  * to handle close errors.
+  */
+ #ifdef HAVE_LIBZ
+ int
+ FreeFileGz(gzFile file)
+ {
+ 	int			i;
+ 
+ 	DO_DB(elog(LOG, "FreeFileGz: Allocated %d", numAllocatedDescs));
+ 
+ 	/* Remove file from list of allocated files, if it's present */
+ 	for (i = numAllocatedDescs; --i >= 0;)
+ 	{
+ 		AllocateDesc *desc = &allocatedDescs[i];
+ 
+ 		if (desc->kind == AllocateDescGzFile && desc->desc.gzfile == file)
+ 			return FreeDesc(desc);
+ 	}
+ 
+ 	/* Only get here if someone passes us a file not in allocatedDescs */
+ 	elog(WARNING, "file passed to FreeFileGz was not obtained from AllocateFile");
+ 
+ 	return gzclose(file);
+ }
+ #endif
+ 
+ /*
   * Close a file returned by OpenTransientFile.
   *
   * Note we do not check close's return value --- it is up to the caller
colordiff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
new file mode 100644
index af60dac..2aecd74
*** a/src/include/parser/kwlist.h
--- b/src/include/parser/kwlist.h
*************** PG_KEYWORD("comment", COMMENT, UNRESERVE
*** 85,90 ****
--- 85,91 ----
  PG_KEYWORD("comments", COMMENTS, UNRESERVED_KEYWORD)
  PG_KEYWORD("commit", COMMIT, UNRESERVED_KEYWORD)
  PG_KEYWORD("committed", COMMITTED, UNRESERVED_KEYWORD)
+ PG_KEYWORD("compressed", COMPRESSED, UNRESERVED_KEYWORD)
  PG_KEYWORD("concurrently", CONCURRENTLY, TYPE_FUNC_NAME_KEYWORD)
  PG_KEYWORD("configuration", CONFIGURATION, UNRESERVED_KEYWORD)
  PG_KEYWORD("connection", CONNECTION, UNRESERVED_KEYWORD)
colordiff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
new file mode 100644
index 940d9d4..6cd1ae4
*** a/src/include/storage/fd.h
--- b/src/include/storage/fd.h
***************
*** 41,46 ****
--- 41,49 ----
  
  #include <dirent.h>
  
+ #ifdef HAVE_LIBZ
+ #include <zlib.h>
+ #endif
  
  /*
   * FileSeek uses the standard UNIX lseek(2) flags.
*************** extern char *FilePathName(File file);
*** 80,85 ****
--- 83,94 ----
  extern FILE *AllocateFile(const char *name, const char *mode);
  extern int	FreeFile(FILE *file);
  
+ /* Operations for compressed files */
+ #ifdef HAVE_LIBZ
+ extern gzFile AllocateFileGz(const char *name, const char *mode);
+ extern int	FreeFileGz(gzFile file);
+ #endif
+ 
  /* Operations to allow use of the <dirent.h> library routines */
  extern DIR *AllocateDir(const char *dirname);
  extern struct dirent *ReadDir(DIR *dir, const char *dirname);
colordiff --git a/src/test/regress/input/copy.source b/src/test/regress/input/copy.source
new file mode 100644
index ab3f508..5f83d98
*** a/src/test/regress/input/copy.source
--- b/src/test/regress/input/copy.source
*************** copy copytest2 from '@abs_builddir@/resu
*** 91,96 ****
--- 91,116 ----
  
  select * from copytest except select * from copytest2;
  
+ truncate copytest2;
+ 
+ --- same test but with compression
+ 
+ copy copytest to '@abs_builddir@/results/copytest.csv.gz' csv compressed;
+ 
+ copy copytest2 from '@abs_builddir@/results/copytest.csv.gz' csv compressed;
+ 
+ select * from copytest except select * from copytest2;
+ 
+ truncate copytest2;
+ 
+ --- same test but with compression through psql
+ 
+ \copy copytest to '@abs_builddir@/results/copytest.csv.gz' csv compressed;
+ 
+ \copy copytest2 from '@abs_builddir@/results/copytest.csv.gz' csv compressed;
+ 
+ select * from copytest except select * from copytest2;
+ 
  
  -- test header line feature
  
colordiff --git a/src/test/regress/output/copy.source b/src/test/regress/output/copy.source
new file mode 100644
index febca71..a6730fe
*** a/src/test/regress/output/copy.source
--- b/src/test/regress/output/copy.source
*************** select * from copytest except select * f
*** 61,66 ****
--- 61,84 ----
  -------+------+--------
  (0 rows)
  
+ truncate copytest2;
+ --- same test but with compression
+ copy copytest to '@abs_builddir@/results/copytest.csv.gz' csv compressed;
+ copy copytest2 from '@abs_builddir@/results/copytest.csv.gz' csv compressed;
+ select * from copytest except select * from copytest2;
+  style | test | filler 
+ -------+------+--------
+ (0 rows)
+ 
+ truncate copytest2;
+ --- same test but with compression through psql
+ \copy copytest to '@abs_builddir@/results/copytest.csv.gz' csv compressed;
+ \copy copytest2 from '@abs_builddir@/results/copytest.csv.gz' csv compressed;
+ select * from copytest except select * from copytest2;
+  style | test | filler 
+ -------+------+--------
+ (0 rows)
+ 
  -- test header line feature
  create temp table copytest3 (
  	c1 int,

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Stephen Frost (#1)

Re: [PATCH] COPY .. COMPRESSED

Stephen Frost <sfrost@snowman.net> writes:

Attached is a patch to add a 'COMPRESSED' option to COPY which will
cause COPY to expect a gzip'd file on input and which will output a
gzip'd file on output. Included is support for backend COPY, psql's
\copy, regression tests for both, and documentation.

I don't think it's a very good idea to invent such a specialized option,
nor to tie it to gzip, which is widely considered to be old news.

There was discussion (and, I think, a patch in the queue) for allowing
COPY to pipe into or out of an arbitrary shell pipe. Why would that not
be enough to cover this use-case? That is, instead of a hard-wired
capability, people would do something like COPY TO '| gzip >file.gz'.
Or they could use bzip2 or whatever struck their fancy.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Tom Lane (#2)

Re: [PATCH] COPY .. COMPRESSED

Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

Attached is a patch to add a 'COMPRESSED' option to COPY which will
cause COPY to expect a gzip'd file on input and which will output a
gzip'd file on output. Included is support for backend COPY, psql's
\copy, regression tests for both, and documentation.

I don't think it's a very good idea to invent such a specialized option,
nor to tie it to gzip, which is widely considered to be old news.

We're already using gzip/zlib for pg_dump/pg_restore, so it was simple
and straight-forward to add and would allow utilizing this option while
keeping the custom dump format the same. It also happens to match what
I need. While gzip might be 'old hat' it's still extremely popular.
I'd be happy to add support for bzip2 or something else that people are
interested in, and support compression options for zlib if necessary
too. This was intended to get the ball rolling on something as the last
discussion that I had seen while hunting through the archives was from
2006, obviously I missed the boat on the last set of patches.

There was discussion (and, I think, a patch in the queue) for allowing
COPY to pipe into or out of an arbitrary shell pipe. Why would that not
be enough to cover this use-case? That is, instead of a hard-wired
capability, people would do something like COPY TO '| gzip >file.gz'.
Or they could use bzip2 or whatever struck their fancy.

Sounds like a nice idea, but I can't imagine it'd be available to anyone
except for superusers, and looking at that patch, that's exactly the
restriction which is in place for it. In addition, that patch's support
for "\copy" implements everything locally, making it little different
from "zcat mycsv.csv.gz | psql". The patch that I proposed actually
sent the compressed stream across the wire, reducing bandwidth
utilization.

All that said, I've nothing against having the pipe option for the
backend COPY command; a bit annoyed with myself for somehow missing that
patch. I don't like what it's doing with psql's \copy command and would
rather we figure out a way to support PROGRAM .. TO STDOUT, but that
still would require superuser privileges. I don't see any easy way to
support compressed data streaming to/from the server for COPY w/o
defining what methods are available or coming up with some ACL system
for what programs can be called by the backend.

Thanks,

Stephen

Simon Riggs

simon@2ndQuadrant.com

almost 13 years ago

In reply to: Stephen Frost (#3)

Re: [PATCH] COPY .. COMPRESSED

On 14 January 2013 13:43, Stephen Frost <sfrost@snowman.net> wrote:

Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

Attached is a patch to add a 'COMPRESSED' option to COPY which will
cause COPY to expect a gzip'd file on input and which will output a
gzip'd file on output. Included is support for backend COPY, psql's
\copy, regression tests for both, and documentation.

I don't think it's a very good idea to invent such a specialized option,
nor to tie it to gzip, which is widely considered to be old news.

We're already using gzip/zlib for pg_dump/pg_restore, so it was simple
and straight-forward to add and would allow utilizing this option while
keeping the custom dump format the same.

Both thoughts are useful, I think.

There is a new option being added to pre/post process data, so it
seems like the best way to add new features - in general.

Specifically, we do support compressed output so a simple patch to
allow re-loading of the compressed data we generate does seem sensible
and reasonable.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Simon Riggs (#4)

Re: [PATCH] COPY .. COMPRESSED

* Simon Riggs (simon@2ndQuadrant.com) wrote:

There is a new option being added to pre/post process data, so it
seems like the best way to add new features - in general.

That structure appears to have no option for passing compressed data to
or from a client connection. Instead, it actually overloads the typical
meaning for options sent to \copy (which, imv, is "run COPY on the server
with these options and have the results stored locally") to mean
something different (run part of the COPY command on the server and part
of it locally).

Specifically, we do support compressed output so a simple patch to
allow re-loading of the compressed data we generate does seem sensible
and reasonable.

Right, we're already using gzip for pg_dump/pg_restore. This just gives
an option to move that compression over to the server side. Also, I'd
be happy to add support for other compression options.

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

Thanks,

Stephen

Claudio Freire

klaussfreire@gmail.com

almost 13 years ago

In reply to: Stephen Frost (#5)

Re: [PATCH] COPY .. COMPRESSED

On Mon, Jan 14, 2013 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote:

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

Unless it's one of a set of superuser-authorized compression tools.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Stephen Frost (#5)

Re: [PATCH] COPY .. COMPRESSED

Stephen Frost <sfrost@snowman.net> writes:

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

The design that was being kicked around allowed pipes to be used on the
client side too, ie \copy foo to '| gzip ...'. That form would not
require any special privileges, and might be thought preferable for
another reason too: it offloads the work from the server.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Claudio Freire (#6)

Re: [PATCH] COPY .. COMPRESSED

* Claudio Freire (klaussfreire@gmail.com) wrote:

On Mon, Jan 14, 2013 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote:

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

Unless it's one of a set of superuser-authorized compression tools.

Which would require a new ACL system for handling that, as I mentioned..
That certainly isn't what the existing patch does.

What would that look like? How would it operate? How would a user
invoke it or even know what options are available? Would we provide
anything by default? It's great to consider that possibility but
there's a lot of details involved.

I'm a bit nervous about having a generalized system which can run
anything on the system when called by a superuser but when called by a
regular user we're on the hook to verify the request against a
superuser-provided list and to then make sure nothing goes wrong.

Thanks,

Stephen

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Tom Lane (#7)

Re: [PATCH] COPY .. COMPRESSED

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

The design that was being kicked around allowed pipes to be used on the
client side too, ie \copy foo to '| gzip ...'. That form would not
require any special privileges, and might be thought preferable for
another reason too: it offloads the work from the server.

It's a different use-case which, imv, is really already trivially
covered:

psql -c 'COPY foo TO STDOUT;' | gzip > myfile.gz

While there is no option currently for having the server do the
compression before sending the data over the wire.

Thanks,

Stephen

#10

Peter Eisentraut

peter_e@gmx.net

almost 13 years ago

In reply to: Stephen Frost (#9)

Re: [PATCH] COPY .. COMPRESSED

On 1/14/13 11:28 AM, Stephen Frost wrote:

While there is no option currently for having the server do the
compression before sending the data over the wire.

OpenSSL?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Peter Eisentraut (#10)

Re: [PATCH] COPY .. COMPRESSED

* Peter Eisentraut (peter_e@gmx.net) wrote:

On 1/14/13 11:28 AM, Stephen Frost wrote:

While there is no option currently for having the server do the
compression before sending the data over the wire.

OpenSSL?

To be honest, I expected that to come up earlier in this discussion.
It'd be redundant to use OpenSSL for compression and then ALSO do
compression on the client side to save into a custom format dump.
There's also plenty of reasons to not want to deal with OpenSSL just to
have compression support. Now, protocol-level on-the-wire compression
is another option, but there's quite a few drawbacks to that and quite a
bit of work involved. Having support for COPY-based compression could
be an answer for many cases where on-the-wire compression is desirable.

Being able to use pipe's for the backend-side of COPY is a good
solution, for that. I'm looking forward to having it and plan to review
the patch. That said, I'd like to find an answer to some of these other
use cases, if possible.

Thanks,

Stephen

#12

Greg Stark

stark@mit.edu

almost 13 years ago

In reply to: Stephen Frost (#11)

Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 2:33 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Peter Eisentraut (peter_e@gmx.net) wrote:

On 1/14/13 11:28 AM, Stephen Frost wrote:

While there is no option currently for having the server do the
compression before sending the data over the wire.

OpenSSL?

To be honest, I expected that to come up earlier in this discussion.
It'd be redundant to use OpenSSL for compression and then ALSO do
compression on the client side to save into a custom format dump.

For what it's worth there was a security announcement not long ago
that made OpenSSL disable compression in streams by default. I'm not
sure if it's relevant to Postgres or not.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Claudio Freire

klaussfreire@gmail.com

almost 13 years ago

In reply to: Stephen Frost (#11)

Re: [PATCH] COPY .. COMPRESSED

On Mon, Jan 14, 2013 at 11:33 PM, Stephen Frost <sfrost@snowman.net> wrote:

Now, protocol-level on-the-wire compression
is another option, but there's quite a few drawbacks to that and quite a
bit of work involved. Having support for COPY-based compression could
be an answer for many cases where on-the-wire compression is desirable.

Like?

Postgres' packetized protocol lends itself quite well for this kind of
thing. It could even be done on a packet-by-packet basis. The only
drawback I see, is that it pretty much rules out piping through
arbitrary commands (a protocol needs to be very clearly defined).

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Peter Eisentraut

peter_e@gmx.net

almost 13 years ago

In reply to: Stephen Frost (#1)

Re: [PATCH] COPY .. COMPRESSED

On 1/13/13 9:16 PM, Stephen Frost wrote:

On top of this I plan to submit a trivial patch to add support for
this to file_fdw, allowing creation of FDW tables which operate
directly on compressed files (including CSVs, which is what I need
this patch for).

I've also begun working on a patch to allow this capability to be used
through pg_dump/pg_restore which would reduce the bandwidth used
between the client and the server for backups and restores. Ideally,
one would also be able to use custom format dumps, with compression,
even if the client-side pg_dump/pg_restore wasn't compiled with zlib
support.

I think a problem is that this code is now serving such different uses.

Operating on compressed files transparently in file_fdw is obviously
useful, but why only gzip? The gold standard is GNU tar, which can
operate on any compressed file in a variety of compression formats
without even having to specify an option.

Writing compressed COPY output files on the backend has limited uses, at
least none have been clearly explained, and the popen patch might
address those better.

Writing compressed COPY output on the frontend can already be done
differently.

Compression on the wire is a different debate and it probably shouldn't
be snuck in through this backdoor.

Putting compressed COPY output from the backend straight into a
compressed pg_dump file sounds interested, but this patch doesn't do
that yet, and I think there will be more issues to solve there.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Peter Eisentraut (#14)

Re: [PATCH] COPY .. COMPRESSED

* Peter Eisentraut (peter_e@gmx.net) wrote:

Operating on compressed files transparently in file_fdw is obviously
useful, but why only gzip?

This isn't really an argument, imv. It's only gzip *right this moment*
because that's all that I implemented. I've already offered to add
bzip2 or whatever else people would like.

The gold standard is GNU tar, which can
operate on any compressed file in a variety of compression formats
without even having to specify an option.

Yes, that's what I was hoping to get to, eventually.

Writing compressed COPY output files on the backend has limited uses, at
least none have been clearly explained, and the popen patch might
address those better.

I do see value in the popen patch for server-side operations.

Writing compressed COPY output on the frontend can already be done
differently.

Certainly. On a similar vein, I'm not convinced that the popen patch
for psql's \copy is really a great addition.

Compression on the wire is a different debate and it probably shouldn't
be snuck in through this backdoor.

Considering the COPY-COMPRESSED-to-FE piece is the vast majority of the
patch, I hope you understand that it certainly wasn't my intent to try
and 'sneak it in'. Support for reading and writing compressed files
with COPY directly from the FE was one of my goals from the start on
this.

Putting compressed COPY output from the backend straight into a
compressed pg_dump file sounds interested, but this patch doesn't do
that yet, and I think there will be more issues to solve there.

Let me just vent my dislike for the pg_dump code. :) Probably half the
time spent on this overall patch was fighting with that to make it work
and it's actually about 90% of the way there, imv. Getting the
compressed data into pg_dump is working in my local branch, going to a
directory-format dump output, but the custom format is causing me some
difficulties which I believe are related to the blocking that's used and
that the blocks coming off the wire were 'full-size', if you will,
instead of being chunked down to 4KB by the client-side compression.
I've simply not had time to debug it and fix it and wanted to get the
general patch out for discussion (which I'm glad that I did, given that
there's other work going on that's related).

Thanks,

Stephen

#16

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Claudio Freire (#13)

Re: [PATCH] COPY .. COMPRESSED

* Claudio Freire (klaussfreire@gmail.com) wrote:

Postgres' packetized protocol lends itself quite well for this kind of
thing. It could even be done on a packet-by-packet basis. The only
drawback I see, is that it pretty much rules out piping through
arbitrary commands (a protocol needs to be very clearly defined).

Actually, wouldn't PG's packet-based protocol be exactly the wrong way
to do any kind of good on-the-wire compression? You certainly wouldn't
want to waste time compressing small packets, such as a single INSERT
command or similar, as you'll always have to send a packet out anyway.
Even doing it at the protocol level with something ssl-like, where
you wrap the entire connection, wouldn't help if the client has a
process along the lines of:

send INSERT command
wait for response
send INSERT command
wait for response
..
..

Since you'd still have to flush after each small bit of data.

Where it does work well is when you move into a bulk-data mode (ala
COPY) and can compress relatively large amounts of data into a smaller
number of full-size packets to be sent.

Thanks,

Stephen

#17

Garick Hamlin

ghamlin@isc.upenn.edu

almost 13 years ago

In reply to: Greg Stark (#12)

Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 01:35:57PM +0000, Greg Stark wrote:

On Tue, Jan 15, 2013 at 2:33 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Peter Eisentraut (peter_e@gmx.net) wrote:

On 1/14/13 11:28 AM, Stephen Frost wrote:

While there is no option currently for having the server do the
compression before sending the data over the wire.

OpenSSL?

To be honest, I expected that to come up earlier in this discussion.
It'd be redundant to use OpenSSL for compression and then ALSO do
compression on the client side to save into a custom format dump.

For what it's worth there was a security announcement not long ago
that made OpenSSL disable compression in streams by default. I'm not
sure if it's relevant to Postgres or not.

It's an interesting question. It might be. I thought at first it
wouldn't be relevant, but on reflection it is.

This attack is called the CRIME attack. This class of attacks stem from reuse
of a dictionary across some sort of confidentiality boundary. The attacker
looks at the traffic and notices 'how big' the network response is. This
tells the attacker the compressor has seen already seem the text.

So imagine, I have a website and I keep session cookies, user names, and
password crypts in a database. Imagine the session key is a long
hexidecimal number. As an attacker, I could send in the username field
of the login form guesses of the prefix of somebody else's session key
I peek at the encrypted traffic from script to the database. As the
prefix match length increases the database reply gets shorter.
Essentially, its a side channel attack that ends up reducing guessing to
a radix search by prefix of all valid session keys. In this attack, I
don't have access to the database except through the web form, but I
can see the encrypted database traffic.

It is not a huge vulnerability, but yeah in some use cases if postgresql
used compression it might provide a difficult, but possible route.

Garick

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Claudio Freire

klaussfreire@gmail.com

almost 13 years ago

In reply to: Stephen Frost (#16)

Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 1:08 PM, Stephen Frost <sfrost@snowman.net> wrote:

Where it does work well is when you move into a bulk-data mode (ala
COPY) and can compress relatively large amounts of data into a smaller
number of full-size packets to be sent.

Well... exactly. COPY is one case, big result sets is another.
And packet headers can include whether each packet is compressed or
not, which is quite transparent and easy to handle. There could even
be a negotiation phase and make it backwards-compatible.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Claudio Freire (#18)

Re: [PATCH] COPY .. COMPRESSED

* Claudio Freire (klaussfreire@gmail.com) wrote:

On Tue, Jan 15, 2013 at 1:08 PM, Stephen Frost <sfrost@snowman.net> wrote:

Where it does work well is when you move into a bulk-data mode (ala
COPY) and can compress relatively large amounts of data into a smaller
number of full-size packets to be sent.

Well... exactly. COPY is one case, big result sets is another.
And packet headers can include whether each packet is compressed or
not, which is quite transparent and easy to handle. There could even
be a negotiation phase and make it backwards-compatible.

COPY and a large result set are the only cases, and a large result set
could easily be put inside of a COPY statement. I agree that large
result sets outside of COPY could benefit from compression and perhaps
we can formulate a way to support that also.

Thanks,

Stephen

#20

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Stephen Frost (#15)

Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 10:55:04AM -0500, Stephen Frost wrote:

* Peter Eisentraut (peter_e@gmx.net) wrote:

Operating on compressed files transparently in file_fdw is obviously
useful, but why only gzip?

This isn't really an argument, imv. It's only gzip *right this moment*
because that's all that I implemented. I've already offered to add
bzip2 or whatever else people would like.

And this leads to support-my-compression-binary-of-the-day mess. Why
not just allow them to do '|compression-binary'?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Bruce Momjian (#20)

Re: [PATCH] COPY .. COMPRESSED

* Bruce Momjian (bruce@momjian.us) wrote:

On Tue, Jan 15, 2013 at 10:55:04AM -0500, Stephen Frost wrote:

* Peter Eisentraut (peter_e@gmx.net) wrote:

Operating on compressed files transparently in file_fdw is obviously
useful, but why only gzip?

This isn't really an argument, imv. It's only gzip *right this moment*
because that's all that I implemented. I've already offered to add
bzip2 or whatever else people would like.

And this leads to support-my-compression-binary-of-the-day mess. Why
not just allow them to do '|compression-binary'?

You're right, to clarify, for *file_fdw*, which is a backend-only
operation, the popen patch is great (thought I made that clear before).

The popen patch doesn't support the '|compression-binary' option through
the FE protocol. Even if it did, it would only be available for
superusers as we can't allow regular users to run arbitrary commands on
the server-side.

Thanks,

Stephen

#22

Christopher Browne

cbbrowne@gmail.com

almost 13 years ago

In reply to: Stephen Frost (#21)

Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 2:53 PM, Stephen Frost <sfrost@snowman.net> wrote:

The popen patch doesn't support the '|compression-binary' option through
the FE protocol. Even if it did, it would only be available for
superusers as we can't allow regular users to run arbitrary commands on
the server-side.

That points towards a fix that involves having a set of non-arbitrary commands
that we allow plain users to use.

Hmm. There's an interesting thought...

How about having a "pg_filters" table in pg_catalog which allows capturing
labels and names of known-to-be-safe binary filters:

insert into pg_filters (label, location)
values
('zcat', '/usr/bin/zcat'),
('bzip2', '/usr/bin/bzip2'),
('bunzip2', '/usr/bin/bunzip2');

And then having some capability to grant permissions to roles to use
these filters.

That's not a "version 1" capability... Suppose we have, in 9.3, that there are
direct references to "|/usr/bin/zcat" (and such), and then hope, in
9.4, to tease
this out to be a non-superuser-capable facility via the above pg_filters?

These filters should be useful for FDWs as well as for COPY.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Christopher Browne (#22)

Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 03:37:07PM -0500, Christopher Browne wrote:

On Tue, Jan 15, 2013 at 2:53 PM, Stephen Frost <sfrost@snowman.net> wrote:

The popen patch doesn't support the '|compression-binary' option through
the FE protocol. Even if it did, it would only be available for
superusers as we can't allow regular users to run arbitrary commands on
the server-side.

That points towards a fix that involves having a set of non-arbitrary commands
that we allow plain users to use.

Hmm. There's an interesting thought...

How about having a "pg_filters" table in pg_catalog which allows capturing
labels and names of known-to-be-safe binary filters:

insert into pg_filters (label, location)
values
('zcat', '/usr/bin/zcat'),
('bzip2', '/usr/bin/bzip2'),
('bunzip2', '/usr/bin/bunzip2');

And then having some capability to grant permissions to roles to use
these filters.

That's not a "version 1" capability... Suppose we have, in 9.3, that there are
direct references to "|/usr/bin/zcat" (and such), and then hope, in
9.4, to tease
this out to be a non-superuser-capable facility via the above pg_filters?

These filters should be useful for FDWs as well as for COPY.

Well, COPY is super-user only, so it seems only useful for FDW, no? We
already have lots of user-configuration FDW commands, so I can see
adding this one too.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Christopher Browne (#22)

Re: [PATCH] COPY .. COMPRESSED

* Christopher Browne (cbbrowne@gmail.com) wrote:

How about having a "pg_filters" table in pg_catalog which allows capturing
labels and names of known-to-be-safe binary filters:

I was considering that (though I was thinking they'd be
"transformations" rather than filters; filter implies that you're
removing something, imv), but as I mentioned upthread, there are dangers
in that direction and having a default set of options strikes me as a
lot more challenging to provide.

insert into pg_filters (label, location)
values
('zcat', '/usr/bin/zcat'),
('bzip2', '/usr/bin/bzip2'),
('bunzip2', '/usr/bin/bunzip2');

We'd need to include which direction is supported also, I think.

And then having some capability to grant permissions to roles to use
these filters.

Yes, an additional ACL system, as I mentioned upthread, would be
required for this.

That's not a "version 1" capability... Suppose we have, in 9.3, that there are
direct references to "|/usr/bin/zcat" (and such), and then hope, in
9.4, to tease
this out to be a non-superuser-capable facility via the above pg_filters?

It would be good to flush out what the syntax, etc, would look like for
this, if we're going to support it, before we go down a road that limits
us in what we can do. For example, if we implement the existing popen
call, and then later want to allow non-superusers to use certain
filters, how would the non-superuser specify the filter? I really don't
think we want to be taking the shell-like command provided by a
non-superuser and then try to match that against a list of commands in a
table..

These filters should be useful for FDWs as well as for COPY.

I'm not sure I see how any FDW beyond file_fdw would really benefit from
this..? I don't think a MySQL FDW or Reddis FDW would gain anything...

Thanks,

Stephen

#25

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Bruce Momjian (#23)

Re: [PATCH] COPY .. COMPRESSED

* Bruce Momjian (bruce@momjian.us) wrote:

Well, COPY is super-user only, so it seems only useful for FDW, no? We
already have lots of user-configuration FDW commands, so I can see
adding this one too.

COPY is most certainly not superuser-only.. COPY w/ popen, if that
popen can call anything, would certainly have to be superuser-only.

COPY TO STDOUT / FROM STDIN is available to and used a huge amount by
non-superusers. Would be great if we could allow that to work with
compressed data also, imv.

Thanks,

Stephen

#26

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Stephen Frost (#21)

Re: [PATCH] COPY .. COMPRESSED

Stephen Frost <sfrost@snowman.net> writes:

* Bruce Momjian (bruce@momjian.us) wrote:

And this leads to support-my-compression-binary-of-the-day mess. Why
not just allow them to do '|compression-binary'?

The popen patch doesn't support the '|compression-binary' option through
the FE protocol. Even if it did, it would only be available for
superusers as we can't allow regular users to run arbitrary commands on
the server-side.

I find the argument that this supports compression-over-the-wire to be
quite weak, because COPY is only one form of bulk data transfer, and
one that a lot of applications don't ever use. If we think we need to
support transmission compression for ourselves, it ought to be
integrated at the wire protocol level, not in COPY.

Just to not look like I'm rejecting stuff without proposing
alternatives, here is an idea about a backwards-compatible design for
doing that: we could add an option that can be set in the connection
request packet. Say, "transmission_compression = gzip".

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Tom Lane (#26)

Re: [PATCH] COPY .. COMPRESSED

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

I find the argument that this supports compression-over-the-wire to be
quite weak, because COPY is only one form of bulk data transfer, and
one that a lot of applications don't ever use. If we think we need to
support transmission compression for ourselves, it ought to be
integrated at the wire protocol level, not in COPY.

As far as I can tell, COPY is the option which is strongly recommended
for bulk-data-operations. I can see the use-case for wanting SELECT
results to be compressed, but it strikes me as the 10% case for PG users
rather than the 90% one. Ditto for COPY vs. large INSERT .. VALUES.

Compressing every small packet seems like it'd be overkill and might
surprise people by actually reducing performance in the case of lots of
small requests.

It also strikes me as a bit silly to do something like:

zcat myfile.gz | \
psql -Z -c "COPY mytable FROM STDIN;"

Just to not look like I'm rejecting stuff without proposing
alternatives, here is an idea about a backwards-compatible design for
doing that: we could add an option that can be set in the connection
request packet. Say, "transmission_compression = gzip".

Alright, do we want/need to support multiple options there? What do
people think we should support? Any other particular details or issues
that come to mind with such an implementation?

I'm willing to work through that if it's the route everyone agrees with.

Thanks,

Stephen

#28

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Stephen Frost (#25)

Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 04:22:48PM -0500, Stephen Frost wrote:

* Bruce Momjian (bruce@momjian.us) wrote:

Well, COPY is super-user only, so it seems only useful for FDW, no? We
already have lots of user-configuration FDW commands, so I can see
adding this one too.

COPY is most certainly not superuser-only.. COPY w/ popen, if that
popen can call anything, would certainly have to be superuser-only.

COPY with a file name is super-user-only. I am unclear how you would
use STDIN/STDOUT in any meaningful way with binary data produced by
compression. I guess you could with libpq.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Bruce Momjian (#28)

Re: [PATCH] COPY .. COMPRESSED

* Bruce Momjian (bruce@momjian.us) wrote:

COPY with a file name is super-user-only. I am unclear how you would
use STDIN/STDOUT in any meaningful way with binary data produced by
compression. I guess you could with libpq.

The patch that I posted provided this:

psql -h myhost -c "COPY mytable FROM STDIN COMPRESSED;" < myfile.gz

With the compressed file being transferred unmolested to the server side
where it was decompressed and processed by the server.

Thanks,

Stephen

#30

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Stephen Frost (#27)

Re: [PATCH] COPY .. COMPRESSED

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

I find the argument that this supports compression-over-the-wire to be
quite weak, because COPY is only one form of bulk data transfer, and
one that a lot of applications don't ever use. If we think we need to
support transmission compression for ourselves, it ought to be
integrated at the wire protocol level, not in COPY.

As far as I can tell, COPY is the option which is strongly recommended
for bulk-data-operations. I can see the use-case for wanting SELECT
results to be compressed, but it strikes me as the 10% case for PG users
rather than the 90% one. Ditto for COPY vs. large INSERT .. VALUES.

Really? Given that libpq provides no useful support for doing anything
with COPY data, much less higher-level packages such as Perl DBI, I'd
venture that the real-world ratio is more like 90/10. If not 99/1.
There might be a few souls out there who are hardy enough and concerned
enough with performance to have made their apps speak COPY protocol,
and not given up on it the first time they hit a quoting/escaping bug
... but not many, I bet.

Compressing every small packet seems like it'd be overkill and might
surprise people by actually reducing performance in the case of lots of
small requests.

Yeah, proper selection and integration of a compression method would be
critical, which is one reason that I'm not suggesting a plugin for this.
You couldn't expect any-random-compressor to work well. I think zlib
would be okay though when making use of its stream compression features.
The key thing there is to force a stream buffer flush (too lazy to look
up exactly what zlib calls it, but they have the concept) exactly when
we're about to do a flush to the socket. That way we get cross-packet
compression but don't have a problem with the compressor failing to send
the last partial message when we need it to.

(My suggestion of an expansible option is for future-proofing, not
because I think we'd try to support more than one option today.)

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Tom Lane (#30)

Re: [PATCH] COPY .. COMPRESSED

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Really? Given that libpq provides no useful support for doing anything
with COPY data, much less higher-level packages such as Perl DBI, I'd
venture that the real-world ratio is more like 90/10. If not 99/1.

Perhaps I'm taking a bit too narrow view of the world, but my thinking
is OLTP won't want things compressed, as it increases latency of
requests, while OLAP users are operating with enough data that they'll
go through the effort to use COPY.

There might be a few souls out there who are hardy enough and concerned
enough with performance to have made their apps speak COPY protocol,
and not given up on it the first time they hit a quoting/escaping bug
... but not many, I bet.

The Perl/PHP/Ruby/etc users that are writing OLTP systems aren't likely
going to be interested in this. The OLAP users are likely using psql
(it's what we're using to load terrabytes of data via COPY, JDBC, DBI,
etc, all have been tried and pale in comparison..).

The key thing there is to force a stream buffer flush (too lazy to look
up exactly what zlib calls it, but they have the concept) exactly when
we're about to do a flush to the socket. That way we get cross-packet
compression but don't have a problem with the compressor failing to send
the last partial message when we need it to.

Yes, I'm familiar with it.

(My suggestion of an expansible option is for future-proofing, not
because I think we'd try to support more than one option today.)

Fair enough.

Thanks,

Stephen

#32

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Stephen Frost (#31)

Re: [PATCH] COPY .. COMPRESSED

Stephen Frost <sfrost@snowman.net> writes:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Really? Given that libpq provides no useful support for doing anything
with COPY data, much less higher-level packages such as Perl DBI, I'd
venture that the real-world ratio is more like 90/10. If not 99/1.

Perhaps I'm taking a bit too narrow view of the world, but my thinking
is OLTP won't want things compressed, as it increases latency of
requests, while OLAP users are operating with enough data that they'll
go through the effort to use COPY.

I should think the argument for or against wire-protocol compression
depends mainly on your network environment, not the nature of your
application. Either bytes sent are more expensive than CPU cycles at
each end, or vice versa. Latency could be a big deal if we weren't
going to force compressor flushes at synchronization boundaries, but
if we are, any added latency is a matter of a few cycles at most.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Tom Lane

tgl@sss.pgh.pa.us

almost 13 years ago

In reply to: Stephen Frost (#31)

Re: [PATCH] COPY .. COMPRESSED

Stephen Frost <sfrost@snowman.net> writes:

Perhaps I'm taking a bit too narrow view of the world, but my thinking
is OLTP won't want things compressed, as it increases latency of
requests, while OLAP users are operating with enough data that they'll
go through the effort to use COPY.

Also, if there are so many people using COPY in their apps, why have we
never seen any submitted patches to extend libpq with functions to
construct/deconstruct COPY data? Surely somebody would have said to
themselves that they shouldn't be responsible for knowing those escaping
rules.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Andrew Dunstan

andrew@dunslane.net

almost 13 years ago

In reply to: Tom Lane (#33)

Re: [PATCH] COPY .. COMPRESSED

On 01/15/2013 06:22 PM, Tom Lane wrote:

Stephen Frost <sfrost@snowman.net> writes:

Perhaps I'm taking a bit too narrow view of the world, but my thinking
is OLTP won't want things compressed, as it increases latency of
requests, while OLAP users are operating with enough data that they'll
go through the effort to use COPY.

Also, if there are so many people using COPY in their apps, why have we
never seen any submitted patches to extend libpq with functions to
construct/deconstruct COPY data? Surely somebody would have said to
themselves that they shouldn't be responsible for knowing those escaping
rules.

There are perfectly good libraries in Perl and other languages for
constructing/deconstructing CSV data. If we didn't have CSV
import/export I suspect we would have heard lots more howls by now.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Claudio Freire

klaussfreire@gmail.com

almost 13 years ago

In reply to: Tom Lane (#30)

Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 7:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Compressing every small packet seems like it'd be overkill and might
surprise people by actually reducing performance in the case of lots of
small requests.

Yeah, proper selection and integration of a compression method would be
critical, which is one reason that I'm not suggesting a plugin for this.
You couldn't expect any-random-compressor to work well. I think zlib
would be okay though when making use of its stream compression features.
The key thing there is to force a stream buffer flush (too lazy to look
up exactly what zlib calls it, but they have the concept) exactly when
we're about to do a flush to the socket. That way we get cross-packet
compression but don't have a problem with the compressor failing to send
the last partial message when we need it to.

Just a "stream flush bit" (or stream reset bit) on the packet header
would do. First packet on any stream would be marked, and that's it.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Peter Eisentraut

peter_e@gmx.net

almost 13 years ago

In reply to: Stephen Frost (#21)

Re: [PATCH] COPY .. COMPRESSED

On 1/15/13 2:53 PM, Stephen Frost wrote:

You're right, to clarify, for *file_fdw*, which is a backend-only
operation, the popen patch is great (thought I made that clear before).

I would think that if we get writable FDWs, you would want file_fdw to
go through zlib so that it can write directly to the file.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Peter Eisentraut (#36)

Re: [PATCH] COPY .. COMPRESSED

* Peter Eisentraut (peter_e@gmx.net) wrote:

On 1/15/13 2:53 PM, Stephen Frost wrote:

You're right, to clarify, for *file_fdw*, which is a backend-only
operation, the popen patch is great (thought I made that clear before).

I would think that if we get writable FDWs, you would want file_fdw to
go through zlib so that it can write directly to the file.

With the popen patch, I expect it could be defined as '|gzip >
myfile.gz'.. I believe that patch did that. It'd be ideal to add
support for that to file_fdw also, certainly. That shouldn't be hard as
file_fdw is essentially a wrapper around backend COPY already and this
would just be adding a few additional options to pass through.

Thanks,

Stephen

#38

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Stephen Frost (#9)

Re: [PATCH] COPY .. COMPRESSED

On Mon, Jan 14, 2013 at 11:28 AM, Stephen Frost <sfrost@snowman.net> wrote:

* Tom Lane (tgl@sss.pgh.pa.us) wrote:

Stephen Frost <sfrost@snowman.net> writes:

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.

The design that was being kicked around allowed pipes to be used on the
client side too, ie \copy foo to '| gzip ...'. That form would not
require any special privileges, and might be thought preferable for
another reason too: it offloads the work from the server.

It's a different use-case which, imv, is really already trivially
covered:

psql -c 'COPY foo TO STDOUT;' | gzip > myfile.gz

While there is no option currently for having the server do the
compression before sending the data over the wire.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Christopher Browne (#22)

Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 3:37 PM, Christopher Browne <cbbrowne@gmail.com> wrote:

That points towards a fix that involves having a set of non-arbitrary commands
that we allow plain users to use.

Hmm. There's an interesting thought...

How about having a "pg_filters" table in pg_catalog which allows capturing
labels and names of known-to-be-safe binary filters:

insert into pg_filters (label, location)
values
('zcat', '/usr/bin/zcat'),
('bzip2', '/usr/bin/bzip2'),
('bunzip2', '/usr/bin/bunzip2');

And then having some capability to grant permissions to roles to use
these filters.

I suspect that's going to be less efficient than using a compression
library that's linked into the backend, because you have to copy all
the data through the kernel to another process and back. And it's
certainly a lot more complex.

If it greatly broadened the applicability of this feature I might
think it was worthwhile, but I can't see that it does. I suspect that
supporting zlib, which we already linked against, would cater to
something well upwards of 90% of the use cases here. Sure, there are
other things, but zlib is very widely used and bzip2 IME is far too
slow to be taken seriously for this kind of application. The
additional space savings that you get for the additional CPU
investment is typically small, and if you really need it, having to
un-gzip and re-bzip2 on the client is always an option. If you're
using bzip2 you obviously have CPU time to burn.

At any rate, I think it would be good to avoid letting our desire for
infinite flexibility get in the way of doing something useful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Christopher Browne

cbbrowne@gmail.com

almost 13 years ago

In reply to: Robert Haas (#39)

Re: [PATCH] COPY .. COMPRESSED

On Wed, Jan 16, 2013 at 5:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 15, 2013 at 3:37 PM, Christopher Browne <cbbrowne@gmail.com> wrote:

That points towards a fix that involves having a set of non-arbitrary commands
that we allow plain users to use.

Hmm. There's an interesting thought...

How about having a "pg_filters" table in pg_catalog which allows capturing
labels and names of known-to-be-safe binary filters:

insert into pg_filters (label, location)
values
('zcat', '/usr/bin/zcat'),
('bzip2', '/usr/bin/bzip2'),
('bunzip2', '/usr/bin/bunzip2');

And then having some capability to grant permissions to roles to use
these filters.

I suspect that's going to be less efficient than using a compression
library that's linked into the backend, because you have to copy all
the data through the kernel to another process and back. And it's
certainly a lot more complex.

More complex, certainly.

By spawning a separate process, we'd get benefit of multicore CPUs,
so I'm not sure I agree that it's necessarily slower.

At any rate, I think it would be good to avoid letting our desire for
infinite flexibility get in the way of doing something useful.

Oh, agreed. I was actively thinking of the cooler bits of this pointing more
towards 9.4 than 9.3.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Robert Haas

robertmhaas@gmail.com

almost 13 years ago

In reply to: Tom Lane (#26)

Re: [PATCH] COPY .. COMPRESSED

On Tue, Jan 15, 2013 at 4:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I find the argument that this supports compression-over-the-wire to be
quite weak, because COPY is only one form of bulk data transfer, and
one that a lot of applications don't ever use. If we think we need to
support transmission compression for ourselves, it ought to be
integrated at the wire protocol level, not in COPY.

Just to not look like I'm rejecting stuff without proposing
alternatives, here is an idea about a backwards-compatible design for
doing that: we could add an option that can be set in the connection
request packet. Say, "transmission_compression = gzip".

But presumably this would transparently compress at one end and
decompress at the other end, which is again a somewhat different use
case. To get compressed output on the client side, you have to
decompress and recompress. Maybe that's OK, but it's not quite the
same thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Bruce Momjian

bruce@momjian.us

almost 13 years ago

In reply to: Robert Haas (#41)

Re: [PATCH] COPY .. COMPRESSED

On Wed, Jan 16, 2013 at 06:19:09PM -0500, Robert Haas wrote:

On Tue, Jan 15, 2013 at 4:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I find the argument that this supports compression-over-the-wire to be
quite weak, because COPY is only one form of bulk data transfer, and
one that a lot of applications don't ever use. If we think we need to
support transmission compression for ourselves, it ought to be
integrated at the wire protocol level, not in COPY.

Just to not look like I'm rejecting stuff without proposing
alternatives, here is an idea about a backwards-compatible design for
doing that: we could add an option that can be set in the connection
request packet. Say, "transmission_compression = gzip".

But presumably this would transparently compress at one end and
decompress at the other end, which is again a somewhat different use
case. To get compressed output on the client side, you have to
decompress and recompress. Maybe that's OK, but it's not quite the
same thing.

Is there a TODO here?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Stephen Frost

sfrost@snowman.net

almost 13 years ago

In reply to: Bruce Momjian (#42)

Re: [PATCH] COPY .. COMPRESSED

* Bruce Momjian (bruce@momjian.us) wrote:

On Wed, Jan 16, 2013 at 06:19:09PM -0500, Robert Haas wrote:

But presumably this would transparently compress at one end and
decompress at the other end, which is again a somewhat different use
case. To get compressed output on the client side, you have to
decompress and recompress. Maybe that's OK, but it's not quite the
same thing.

Is there a TODO here?

For some reason, I thought there was a patch...

Thanks,

Stephen

#44

Claudio Freire

klaussfreire@gmail.com

almost 13 years ago

In reply to: Robert Haas (#41)

Re: [PATCH] COPY .. COMPRESSED

On Wed, Jan 16, 2013 at 8:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 15, 2013 at 4:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I find the argument that this supports compression-over-the-wire to be
quite weak, because COPY is only one form of bulk data transfer, and
one that a lot of applications don't ever use. If we think we need to
support transmission compression for ourselves, it ought to be
integrated at the wire protocol level, not in COPY.

Just to not look like I'm rejecting stuff without proposing
alternatives, here is an idea about a backwards-compatible design for
doing that: we could add an option that can be set in the connection
request packet. Say, "transmission_compression = gzip".

But presumably this would transparently compress at one end and
decompress at the other end, which is again a somewhat different use
case. To get compressed output on the client side, you have to
decompress and recompress. Maybe that's OK, but it's not quite the
same thing.

Well, libpq could give some access to raw compressed streams, but,
really, even with double compression on the client, it solves the
bandwidth issue, not only for pg_dump, pg_restore, and copy, but also
for all other transfer-intensive applications. I do think it's the
best option.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers