Add ENCODING option to COPY

Started by Hitoshi Haradaalmost 15 years ago23 messages

umi.tanuki@gmail.com

almost 15 years ago

1 attachment(s)

Here's the patch to add ENCODING option to COPY command.
The fundamental issue was explained months ago:

http://archives.postgresql.org/message-id/AANLkTikCt6bHXZjO_oX+JS7+G=jAQ7gVZPu0Owjcsbfb@mail.gmail.com

In short, client_encoding is not appropriate for copy operation so we
should need the specialized option for it. Robert Haas agreed with its
need later in the thread. Thanks.

The patch overrides client_encoding by the added ENCODING option, and
restores it as soon as copy is done. I see some complaints ask to use
pg_do_encoding_conversion() instead of
pg_client_to_server/server_to_client(), but the former will surely add
slight overhead per reading line and I believe copy is
performance-sensitive command.

I'll add this to the CF app.

Regards,

--
Hitoshi Harada

Attachments:

copy_encoding.patchapplication/octet-stream; name=copy_encoding.patchDownload

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 38424ad..1f9e4cc 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -40,7 +40,8 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
     QUOTE '<replaceable class="parameter">quote_character</replaceable>'
     ESCAPE '<replaceable class="parameter">escape_character</replaceable>'
     FORCE_QUOTE { ( <replaceable class="parameter">column</replaceable> [, ...] ) | * }
-    FORCE_NOT_NULL ( <replaceable class="parameter">column</replaceable> [, ...] )
+    FORCE_NOT_NULL ( <replaceable class="parameter">column</replaceable> [, ...] ) |
+    ENCODING <replaceable class="parameter">encoding_name</replaceable>
 </synopsis>
  </refsynopsisdiv>
 
@@ -282,6 +283,18 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>ENCODING</></term>
+    <listitem>
+     <para>
+      Specifies that the file is encoded in the <replaceable
+      class="parameter">encoding_name</replaceable>.  If this option is
+      omitted, the current client encoding is used. See the Notes below
+      for more details.
+     </para>
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </refsect1>
 
@@ -377,8 +390,9 @@ COPY <replaceable class="parameter">count</replaceable>
    </para>
 
    <para>
-    Input data is interpreted according to the current client encoding,
-    and output data is encoded in the current client encoding, even
+    Input data is interpreted according to <literal>ENCODING</literal>
+    option or the current client encoding, and output data is encoded
+    in <literal>ENCODING</literal> or the current client encoding, even
     if the data does not pass through the client but is read from or
     written to a file directly by the server.
    </para>
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 841bf22..c571315 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -97,7 +97,9 @@ typedef struct CopyStateData
 	bool		fe_eof;			/* true if detected end of copy data */
 	EolType		eol_type;		/* EOL type of input */
 	int			client_encoding;	/* remote side's character encoding */
-	bool		need_transcoding;		/* client encoding diff from server? */
+	int			saved_encoding;		/* client encoding to be restored */
+	bool		encoding_option;	/* has encoding option? */
+	bool		need_transcoding;		/* encoding diff from server? */
 	bool		encoding_embeds_ascii;	/* ASCII can be non-first byte? */
 	uint64		processed;		/* # of tuples processed */
 
@@ -811,6 +813,22 @@ DoCopy(const CopyStmt *stmt, const char *queryString)
 						 errmsg("conflicting or redundant options")));
 			cstate->escape = defGetString(defel);
 		}
+		else if (strcmp(defel->defname, "encoding") == 0)
+		{
+			if (cstate->encoding_option)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			if (PG_VALID_ENCODING(pg_char_to_encoding(defGetString(defel))))
+					cstate->client_encoding =
+							pg_char_to_encoding(defGetString(defel));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("argument to option \"%s\" must be a valid encoding name",
+								defel->defname)));
+			cstate->encoding_option = true;
+		}
 		else if (strcmp(defel->defname, "force_quote") == 0)
 		{
 			if (force_quote || force_quote_all)
@@ -1169,11 +1187,25 @@ DoCopy(const CopyStmt *stmt, const char *queryString)
 	cstate->processed = 0;
 
 	/*
-	 * Set up encoding conversion info.  Even if the client and server
-	 * encodings are the same, we must apply pg_client_to_server() to validate
-	 * data in multibyte encodings.
+	 * Set up encoding conversion info.  If encoding option is specified,
+	 * use it instead of client_encoding of GUC.  In any cases, we use
+	 * pg_client_to_server/server_to_client() for performance reason.
+	 * Be careful to restore the previous encoding setting on changing
+	 * current client_encoding.
+	 */
+	if (cstate->encoding_option)
+	{
+		cstate->saved_encoding = pg_get_client_encoding();
+		SetClientEncoding(cstate->client_encoding, true);
+	}
+	else
+		cstate->client_encoding = pg_get_client_encoding();
+
+	/*
+	 * Even if the client and server encodings are the same,
+	 * we must apply pg_client_to_server() to validate data in
+	 * multibyte encodings.
 	 */
-	cstate->client_encoding = pg_get_client_encoding();
 	cstate->need_transcoding =
 		(cstate->client_encoding != GetDatabaseEncoding() ||
 		 pg_database_encoding_max_length() > 1);
@@ -1188,6 +1220,10 @@ DoCopy(const CopyStmt *stmt, const char *queryString)
 	else
 		DoCopyTo(cstate);		/* copy from database to file */
 
+	/* restore previous client_encoding if client_encoding has changed */
+	if (cstate->encoding_option)
+		SetClientEncoding(cstate->saved_encoding, true);
+
 	/*
 	 * Close the relation or query.  If reading, we can release the
 	 * AccessShareLock we got; if writing, we should hold the lock until end
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 660947c..3318dd0 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2216,6 +2216,10 @@ copy_opt_item:
 				{
 					$$ = makeDefElem("force_not_null", (Node *)$4);
 				}
+			| ENCODING ColId_or_Sconst
+				{
+					$$ = makeDefElem("encoding", (Node *)makeString($2));
+				}
 		;
 
 /* The following exist for backward compatibility with very old versions */
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 15cbe02..88d7d16 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -46,10 +46,10 @@ CONTEXT:  COPY x, line 1: "2001	231	\N	\N"
 COPY x from stdin;
 ERROR:  extra data after last expected column
 CONTEXT:  COPY x, line 1: "2002	232	40	50	60	70	80"
--- various COPY options: delimiters, oids, NULL string
+-- various COPY options: delimiters, oids, NULL string, encoding
 COPY x (b, c, d, e) from stdin with oids delimiter ',' null 'x';
 COPY x from stdin WITH DELIMITER AS ';' NULL AS '';
-COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X';
+COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X' ENCODING sql_ascii;
 -- check results of copy in
 SELECT * FROM x;
    a   | b  |     c      |   d    |          e           
@@ -187,7 +187,7 @@ COPY y TO stdout WITH CSV QUOTE '''' DELIMITER '|';
 Jackson, Sam|\h
 It is "perfect".|	
 ''|
-COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\';
+COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\' ENCODING sql_ascii;
 "Jackson, Sam","\\h"
 "It is \"perfect\".","	"
 "",
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index c2e8b03..d2683d1 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -72,7 +72,7 @@ COPY x from stdin;
 2002	232	40	50	60	70	80
 \.
 
--- various COPY options: delimiters, oids, NULL string
+-- various COPY options: delimiters, oids, NULL string, encoding
 COPY x (b, c, d, e) from stdin with oids delimiter ',' null 'x';
 500000,x,45,80,90
 500001,x,\x,\\x,\\\x
@@ -83,7 +83,7 @@ COPY x from stdin WITH DELIMITER AS ';' NULL AS '';
 3000;;c;;
 \.
 
-COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X';
+COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X' ENCODING sql_ascii;
 4000:\X:C:\X:\X
 4001:1:empty::
 4002:2:null:\X:\X
@@ -127,7 +127,7 @@ INSERT INTO y VALUES ('', NULL);
 
 COPY y TO stdout WITH CSV;
 COPY y TO stdout WITH CSV QUOTE '''' DELIMITER '|';
-COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\';
+COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\' ENCODING sql_ascii;
 COPY y TO stdout WITH CSV FORCE QUOTE *;
 
 -- Repeat above tests with new 9.0 option syntax

Itagaki Takahiro

itagaki.takahiro@gmail.com

almost 15 years ago

In reply to: Hitoshi Harada (#1)

Re: Add ENCODING option to COPY

On Sat, Jan 15, 2011 at 02:25, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

The patch overrides client_encoding by the added ENCODING option, and
restores it as soon as copy is done.

We cannot do that because error messages should be encoded in the original
encoding even during COPY commands with encoding option. Error messages
could contain non-ASCII characters if lc_messages is set.

I see some complaints ask to use
pg_do_encoding_conversion() instead of
pg_client_to_server/server_to_client(), but the former will surely add
slight overhead per reading line

If we want to reduce the overhead, we should cache the conversion procedure
in CopyState. How about adding something like "FmgrInfo file_to_server_covv"
into it?

--
Itagaki Takahiro

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Itagaki Takahiro (#2)

Re: Add ENCODING option to COPY

2011/1/25 Itagaki Takahiro <itagaki.takahiro@gmail.com>:

On Sat, Jan 15, 2011 at 02:25, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

The patch overrides client_encoding by the added ENCODING option, and
restores it as soon as copy is done.

We cannot do that because error messages should be encoded in the original
encoding even during COPY commands with encoding option. Error messages
could contain non-ASCII characters if lc_messages is set.

Agreed.

I see some complaints ask to use
pg_do_encoding_conversion() instead of
pg_client_to_server/server_to_client(), but the former will surely add
slight overhead per reading line

If we want to reduce the overhead, we should cache the conversion procedure
in CopyState. How about adding something like "FmgrInfo file_to_server_covv"
into it?

I looked down to the code and found that we cannot pass FmgrInfo * to
any functions defined in pg_wchar.h, since the header file is shared
in libpq, too.

For the record, I also tried pg_do_encoding_conversion() instead of
pg_client_to_server/server_to_client(), and the simple benchmark shows
it is too slow.

with 3000000 lines with 3 columns (~22MB tsv) COPY FROM

*utf8 -> utf8 (no conversion)
13428.233ms
13322.832ms
15661.093ms

*euc_jp -> utf8 (client_encoding)
17527.470ms
16457.452ms
16522.337ms

*euc_jp -> utf8 (pg_do_encoding_conversion)
20550.983ms
21425.313ms
20774.323ms

I'll check the code more if we have better alternatives.

Regards,

--
Hitoshi Harada

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Hitoshi Harada (#3)

Re: Add ENCODING option to COPY

On Tue, Jan 25, 2011 at 10:24 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

I'll check the code more if we have better alternatives.

Where are we with this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Robert Haas (#4)

Re: Add ENCODING option to COPY

2011/1/31 Robert Haas <robertmhaas@gmail.com>:

On Tue, Jan 25, 2011 at 10:24 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

I'll check the code more if we have better alternatives.

Where are we with this?

I'll post another version today.

Regards,

--
Hitoshi Harada

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Hitoshi Harada (#5)

1 attachment(s)

Re: Add ENCODING option to COPY

2011/1/31 Hitoshi Harada <umi.tanuki@gmail.com>:

2011/1/31 Robert Haas <robertmhaas@gmail.com>:

On Tue, Jan 25, 2011 at 10:24 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

I'll check the code more if we have better alternatives.

Where are we with this?

I'll post another version today.

Here's the patch.

Finally I concluded the concern Itagaki-san raised can be solved by
adding code that restores client_encoding in copy_in_error_callback. I
tested some encoding mismatch cases with this patch and saw
appropriate messages in NLS environment.

Regards,

--
Hitoshi Harada

Attachments:

copy_encoding.v2.patchapplication/octet-stream; name=copy_encoding.v2.patchDownload

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 38424ad..1f9e4cc 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -40,7 +40,8 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
     QUOTE '<replaceable class="parameter">quote_character</replaceable>'
     ESCAPE '<replaceable class="parameter">escape_character</replaceable>'
     FORCE_QUOTE { ( <replaceable class="parameter">column</replaceable> [, ...] ) | * }
-    FORCE_NOT_NULL ( <replaceable class="parameter">column</replaceable> [, ...] )
+    FORCE_NOT_NULL ( <replaceable class="parameter">column</replaceable> [, ...] ) |
+    ENCODING <replaceable class="parameter">encoding_name</replaceable>
 </synopsis>
  </refsynopsisdiv>
 
@@ -282,6 +283,18 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>ENCODING</></term>
+    <listitem>
+     <para>
+      Specifies that the file is encoded in the <replaceable
+      class="parameter">encoding_name</replaceable>.  If this option is
+      omitted, the current client encoding is used. See the Notes below
+      for more details.
+     </para>
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </refsect1>
 
@@ -377,8 +390,9 @@ COPY <replaceable class="parameter">count</replaceable>
    </para>
 
    <para>
-    Input data is interpreted according to the current client encoding,
-    and output data is encoded in the current client encoding, even
+    Input data is interpreted according to <literal>ENCODING</literal>
+    option or the current client encoding, and output data is encoded
+    in <literal>ENCODING</literal> or the current client encoding, even
     if the data does not pass through the client but is read from or
     written to a file directly by the server.
    </para>
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 841bf22..0c85842 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -97,7 +97,9 @@ typedef struct CopyStateData
 	bool		fe_eof;			/* true if detected end of copy data */
 	EolType		eol_type;		/* EOL type of input */
 	int			client_encoding;	/* remote side's character encoding */
-	bool		need_transcoding;		/* client encoding diff from server? */
+	int			saved_encoding;		/* client encoding to be restored */
+	bool		encoding_option;	/* has encoding option? */
+	bool		need_transcoding;		/* encoding diff from server? */
 	bool		encoding_embeds_ascii;	/* ASCII can be non-first byte? */
 	uint64		processed;		/* # of tuples processed */
 
@@ -811,6 +813,22 @@ DoCopy(const CopyStmt *stmt, const char *queryString)
 						 errmsg("conflicting or redundant options")));
 			cstate->escape = defGetString(defel);
 		}
+		else if (strcmp(defel->defname, "encoding") == 0)
+		{
+			if (cstate->encoding_option)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			if (PG_VALID_ENCODING(pg_char_to_encoding(defGetString(defel))))
+					cstate->client_encoding =
+							pg_char_to_encoding(defGetString(defel));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("argument to option \"%s\" must be a valid encoding name",
+								defel->defname)));
+			cstate->encoding_option = true;
+		}
 		else if (strcmp(defel->defname, "force_quote") == 0)
 		{
 			if (force_quote || force_quote_all)
@@ -1169,11 +1187,25 @@ DoCopy(const CopyStmt *stmt, const char *queryString)
 	cstate->processed = 0;
 
 	/*
-	 * Set up encoding conversion info.  Even if the client and server
-	 * encodings are the same, we must apply pg_client_to_server() to validate
-	 * data in multibyte encodings.
+	 * Set up encoding conversion info.  If encoding option is specified,
+	 * use it instead of client_encoding of GUC.  In any cases, we use
+	 * pg_client_to_server/server_to_client() for performance reason.
+	 * Be careful to restore the previous encoding setting on changing
+	 * current client_encoding.
+	 */
+	if (cstate->encoding_option)
+	{
+		cstate->saved_encoding = pg_get_client_encoding();
+		SetClientEncoding(cstate->client_encoding, true);
+	}
+	else
+		cstate->client_encoding = pg_get_client_encoding();
+
+	/*
+	 * Even if the client and server encodings are the same,
+	 * we must apply pg_client_to_server() to validate data in
+	 * multibyte encodings.
 	 */
-	cstate->client_encoding = pg_get_client_encoding();
 	cstate->need_transcoding =
 		(cstate->client_encoding != GetDatabaseEncoding() ||
 		 pg_database_encoding_max_length() > 1);
@@ -1188,6 +1220,10 @@ DoCopy(const CopyStmt *stmt, const char *queryString)
 	else
 		DoCopyTo(cstate);		/* copy from database to file */
 
+	/* restore the previous client_encoding if it has changed */
+	if (cstate->encoding_option)
+		SetClientEncoding(cstate->saved_encoding, true);
+
 	/*
 	 * Close the relation or query.  If reading, we can release the
 	 * AccessShareLock we got; if writing, we should hold the lock until end
@@ -1571,6 +1607,14 @@ copy_in_error_callback(void *arg)
 {
 	CopyState	cstate = (CopyState) arg;
 
+	/*
+	 * If we changed encoding by the option, need to restore
+	 * the original client encoding so that error message
+	 * can be encoded in the proper one.
+	 */
+	if (cstate->encoding_option)
+		SetClientEncoding(cstate->saved_encoding, true);
+
 	if (cstate->binary)
 	{
 		/* can't usefully display the data */
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 456db5c..1f698d1 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2217,6 +2217,10 @@ copy_opt_item:
 				{
 					$$ = makeDefElem("force_not_null", (Node *)$4);
 				}
+			| ENCODING ColId_or_Sconst
+				{
+					$$ = makeDefElem("encoding", (Node *)makeString($2));
+				}
 		;
 
 /* The following exist for backward compatibility with very old versions */
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 15cbe02..88d7d16 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -46,10 +46,10 @@ CONTEXT:  COPY x, line 1: "2001	231	\N	\N"
 COPY x from stdin;
 ERROR:  extra data after last expected column
 CONTEXT:  COPY x, line 1: "2002	232	40	50	60	70	80"
--- various COPY options: delimiters, oids, NULL string
+-- various COPY options: delimiters, oids, NULL string, encoding
 COPY x (b, c, d, e) from stdin with oids delimiter ',' null 'x';
 COPY x from stdin WITH DELIMITER AS ';' NULL AS '';
-COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X';
+COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X' ENCODING sql_ascii;
 -- check results of copy in
 SELECT * FROM x;
    a   | b  |     c      |   d    |          e           
@@ -187,7 +187,7 @@ COPY y TO stdout WITH CSV QUOTE '''' DELIMITER '|';
 Jackson, Sam|\h
 It is "perfect".|	
 ''|
-COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\';
+COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\' ENCODING sql_ascii;
 "Jackson, Sam","\\h"
 "It is \"perfect\".","	"
 "",
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index c2e8b03..d2683d1 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -72,7 +72,7 @@ COPY x from stdin;
 2002	232	40	50	60	70	80
 \.
 
--- various COPY options: delimiters, oids, NULL string
+-- various COPY options: delimiters, oids, NULL string, encoding
 COPY x (b, c, d, e) from stdin with oids delimiter ',' null 'x';
 500000,x,45,80,90
 500001,x,\x,\\x,\\\x
@@ -83,7 +83,7 @@ COPY x from stdin WITH DELIMITER AS ';' NULL AS '';
 3000;;c;;
 \.
 
-COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X';
+COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X' ENCODING sql_ascii;
 4000:\X:C:\X:\X
 4001:1:empty::
 4002:2:null:\X:\X
@@ -127,7 +127,7 @@ INSERT INTO y VALUES ('', NULL);
 
 COPY y TO stdout WITH CSV;
 COPY y TO stdout WITH CSV QUOTE '''' DELIMITER '|';
-COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\';
+COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\' ENCODING sql_ascii;
 COPY y TO stdout WITH CSV FORCE QUOTE *;
 
 -- Repeat above tests with new 9.0 option syntax

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Hitoshi Harada (#6)

Re: Add ENCODING option to COPY

Hitoshi Harada <umi.tanuki@gmail.com> writes:

Finally I concluded the concern Itagaki-san raised can be solved by
adding code that restores client_encoding in copy_in_error_callback.

That seems like an absolutely horrid idea. Error context callbacks
should not have side-effects like that. They're not guaranteed to be
called at all, let alone in any particular order. In this case I'd also
be worried that the state needs to be fixed before elog.c reaches the
point of calling the callbacks --- there's nothing to say that it might
not try to translate some strings to the client encoding earlier than
that.

It might happen to work today (or at least in the scenarios you tested),
but it seems fragile as can be.

regards, tom lane

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Tom Lane (#7)

1 attachment(s)

Re: Add ENCODING option to COPY

2011/2/1 Tom Lane <tgl@sss.pgh.pa.us>:

Hitoshi Harada <umi.tanuki@gmail.com> writes:

Finally I concluded the concern Itagaki-san raised can be solved by
adding code that restores client_encoding in copy_in_error_callback.

It might happen to work today (or at least in the scenarios you tested),
but it seems fragile as can be.

Although I thought its fragile-ness was acceptable to avoid making the
patch too complex, I agree with you.
The third patch is attached, modifying mb routines so that they can
receive conversion procedures as FmgrInof * and save the function
pointer in CopyState.
I tested it with encoding option and could not see performance slowdown.

Regards,

--
Hitoshi Harada

Attachments:

copy_encoding.v3.patchapplication/octet-stream; name=copy_encoding.v3.patchDownload

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 93fb2cf..6ea9372 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1370,8 +1370,13 @@ CopyTo(CopyState cstate)
 	/* We use fe_msgbuf as a per-row buffer regardless of copy_dest */
 	cstate->fe_msgbuf = makeStringInfo();
 
-	/* Setup encoding conversion proc */
-	if (cstate->need_transcoding)
+	/*
+	 * We don't need conversion procedure when the encodings match
+	 * between target and server, but need_transcoding indicates
+	 * we need verify encoding still.
+	 */
+	if (cstate->need_transcoding &&
+		cstate->target_encoding != GetDatabaseEncoding())
 	{
 		Oid		funcoid;
 
@@ -1380,6 +1385,8 @@ CopyTo(CopyState cstate)
 		cstate->conv_proc = (FmgrInfo *) palloc(sizeof(FmgrInfo));
 		fmgr_info(funcoid, cstate->conv_proc);
 	}
+	else
+		cstate->conv_proc = NULL;
 
 	/* Get info about the columns we need to process. */
 	cstate->out_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo));
@@ -1989,8 +1996,13 @@ CopyFrom(CopyState cstate)
 	cstate->cur_attname = NULL;
 	cstate->cur_attval = NULL;
 
-	/* Setup encoding conversion proc */
-	if (cstate->need_transcoding)
+	/*
+	 * We don't need conversion procedure when the encodings match
+	 * between target and server, but need_transcoding indicates
+	 * we need verify encoding still.
+	 */
+	if (cstate->need_transcoding &&
+		cstate->target_encoding != GetDatabaseEncoding())
 	{
 		Oid		funcoid;
 
@@ -1999,6 +2011,8 @@ CopyFrom(CopyState cstate)
 		cstate->conv_proc = (FmgrInfo *) palloc(sizeof(FmgrInfo));
 		fmgr_info(funcoid, cstate->conv_proc);
 	}
+	else
+		cstate->conv_proc = NULL;
 
 	bistate = GetBulkInsertState();

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Hitoshi Harada (#8)

1 attachment(s)

Re: Add ENCODING option to COPY

2011/2/1 Hitoshi Harada <umi.tanuki@gmail.com>:

2011/2/1 Tom Lane <tgl@sss.pgh.pa.us>:

Hitoshi Harada <umi.tanuki@gmail.com> writes:

Finally I concluded the concern Itagaki-san raised can be solved by
adding code that restores client_encoding in copy_in_error_callback.

It might happen to work today (or at least in the scenarios you tested),
but it seems fragile as can be.

Although I thought its fragile-ness was acceptable to avoid making the
patch too complex, I agree with you.
The third patch is attached, modifying mb routines so that they can
receive conversion procedures as FmgrInof * and save the function
pointer in CopyState.
I tested it with encoding option and could not see performance slowdown.

Hmm, sorry, the patch was wrong. Correct version is attached.

Regards,

--
Hitoshi Harada

Attachments:

copy_encoding.v4.patchapplication/octet-stream; name=copy_encoding.v4.patchDownload

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 38424ad..1f9e4cc 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -40,7 +40,8 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
     QUOTE '<replaceable class="parameter">quote_character</replaceable>'
     ESCAPE '<replaceable class="parameter">escape_character</replaceable>'
     FORCE_QUOTE { ( <replaceable class="parameter">column</replaceable> [, ...] ) | * }
-    FORCE_NOT_NULL ( <replaceable class="parameter">column</replaceable> [, ...] )
+    FORCE_NOT_NULL ( <replaceable class="parameter">column</replaceable> [, ...] ) |
+    ENCODING <replaceable class="parameter">encoding_name</replaceable>
 </synopsis>
  </refsynopsisdiv>
 
@@ -282,6 +283,18 @@ COPY { <replaceable class="parameter">table_name</replaceable> [ ( <replaceable
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>ENCODING</></term>
+    <listitem>
+     <para>
+      Specifies that the file is encoded in the <replaceable
+      class="parameter">encoding_name</replaceable>.  If this option is
+      omitted, the current client encoding is used. See the Notes below
+      for more details.
+     </para>
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </refsect1>
 
@@ -377,8 +390,9 @@ COPY <replaceable class="parameter">count</replaceable>
    </para>
 
    <para>
-    Input data is interpreted according to the current client encoding,
-    and output data is encoded in the current client encoding, even
+    Input data is interpreted according to <literal>ENCODING</literal>
+    option or the current client encoding, and output data is encoded
+    in <literal>ENCODING</literal> or the current client encoding, even
     if the data does not pass through the client but is read from or
     written to a file directly by the server.
    </para>
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 841bf22..6ea9372 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -96,9 +96,11 @@ typedef struct CopyStateData
 	bool		fe_copy;		/* true for all FE copy dests */
 	bool		fe_eof;			/* true if detected end of copy data */
 	EolType		eol_type;		/* EOL type of input */
-	int			client_encoding;	/* remote side's character encoding */
-	bool		need_transcoding;		/* client encoding diff from server? */
+	int			target_encoding;	/* remote side's character encoding */
+	bool		encoding_option;	/* has encoding option? */
+	bool		need_transcoding;		/* encoding diff from server? */
 	bool		encoding_embeds_ascii;	/* ASCII can be non-first byte? */
+	FmgrInfo   *conv_proc;		/* encoding conversion proc */
 	uint64		processed;		/* # of tuples processed */
 
 	/* parameters from the COPY command */
@@ -811,6 +813,22 @@ DoCopy(const CopyStmt *stmt, const char *queryString)
 						 errmsg("conflicting or redundant options")));
 			cstate->escape = defGetString(defel);
 		}
+		else if (strcmp(defel->defname, "encoding") == 0)
+		{
+			if (cstate->encoding_option)
+				ereport(ERROR,
+						(errcode(ERRCODE_SYNTAX_ERROR),
+						 errmsg("conflicting or redundant options")));
+			if (PG_VALID_ENCODING(pg_char_to_encoding(defGetString(defel))))
+					cstate->target_encoding =
+							pg_char_to_encoding(defGetString(defel));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("argument to option \"%s\" must be a valid encoding name",
+								defel->defname)));
+			cstate->encoding_option = true;
+		}
 		else if (strcmp(defel->defname, "force_quote") == 0)
 		{
 			if (force_quote || force_quote_all)
@@ -1169,16 +1187,22 @@ DoCopy(const CopyStmt *stmt, const char *queryString)
 	cstate->processed = 0;
 
 	/*
-	 * Set up encoding conversion info.  Even if the client and server
-	 * encodings are the same, we must apply pg_client_to_server() to validate
-	 * data in multibyte encodings.
+	 * Set up encoding conversion info.  If encoding option is specified,
+	 * use it instead of client_encoding of GUC.
+	 */
+	if (!cstate->encoding_option)
+		cstate->target_encoding = pg_get_client_encoding();
+
+	/*
+	 * Even if the client and server encodings are the same,
+	 * we must apply pg_client_to_server() to validate data in
+	 * multibyte encodings.
 	 */
-	cstate->client_encoding = pg_get_client_encoding();
 	cstate->need_transcoding =
-		(cstate->client_encoding != GetDatabaseEncoding() ||
+		(cstate->target_encoding != GetDatabaseEncoding() ||
 		 pg_database_encoding_max_length() > 1);
 	/* See Multibyte encoding comment above */
-	cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->client_encoding);
+	cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->target_encoding);
 
 	cstate->copy_dest = COPY_FILE;		/* default */
 	cstate->filename = stmt->filename;
@@ -1346,6 +1370,24 @@ CopyTo(CopyState cstate)
 	/* We use fe_msgbuf as a per-row buffer regardless of copy_dest */
 	cstate->fe_msgbuf = makeStringInfo();
 
+	/*
+	 * We don't need conversion procedure when the encodings match
+	 * between target and server, but need_transcoding indicates
+	 * we need verify encoding still.
+	 */
+	if (cstate->need_transcoding &&
+		cstate->target_encoding != GetDatabaseEncoding())
+	{
+		Oid		funcoid;
+
+		funcoid = FindDefaultConversionProc(
+									GetDatabaseEncoding(), cstate->target_encoding);
+		cstate->conv_proc = (FmgrInfo *) palloc(sizeof(FmgrInfo));
+		fmgr_info(funcoid, cstate->conv_proc);
+	}
+	else
+		cstate->conv_proc = NULL;
+
 	/* Get info about the columns we need to process. */
 	cstate->out_functions = (FmgrInfo *) palloc(num_phys_attrs * sizeof(FmgrInfo));
 	foreach(cur, cstate->attnumlist)
@@ -1400,8 +1442,13 @@ CopyTo(CopyState cstate)
 		 * encoding, because it will be sent directly with CopySendString.
 		 */
 		if (cstate->need_transcoding)
-			cstate->null_print_client = pg_server_to_client(cstate->null_print,
-													 cstate->null_print_len);
+			cstate->null_print_client =
+				pg_cached_encoding_conversion(cstate->null_print,
+											  cstate->null_print_len,
+											  cstate->conv_proc,
+											  GetDatabaseEncoding(),
+											  cstate->target_encoding,
+											  false);
 
 		/* if a header has been requested send the line */
 		if (cstate->header_line)
@@ -1949,6 +1996,24 @@ CopyFrom(CopyState cstate)
 	cstate->cur_attname = NULL;
 	cstate->cur_attval = NULL;
 
+	/*
+	 * We don't need conversion procedure when the encodings match
+	 * between target and server, but need_transcoding indicates
+	 * we need verify encoding still.
+	 */
+	if (cstate->need_transcoding &&
+		cstate->target_encoding != GetDatabaseEncoding())
+	{
+		Oid		funcoid;
+
+		funcoid = FindDefaultConversionProc(
+								cstate->target_encoding, GetDatabaseEncoding());
+		cstate->conv_proc = (FmgrInfo *) palloc(sizeof(FmgrInfo));
+		fmgr_info(funcoid, cstate->conv_proc);
+	}
+	else
+		cstate->conv_proc = NULL;
+
 	bistate = GetBulkInsertState();
 
 	/* Set up callback to identify error line number */
@@ -2351,8 +2416,12 @@ CopyReadLine(CopyState cstate)
 	{
 		char	   *cvt;
 
-		cvt = pg_client_to_server(cstate->line_buf.data,
-								  cstate->line_buf.len);
+		cvt = pg_cached_encoding_conversion(cstate->line_buf.data,
+											cstate->line_buf.len,
+											cstate->conv_proc,
+											cstate->target_encoding,
+											GetDatabaseEncoding(),
+											true);
 		if (cvt != cstate->line_buf.data)
 		{
 			/* transfer converted data back to line_buf */
@@ -2711,7 +2780,7 @@ not_end_of_copy:
 
 			mblen_str[0] = c;
 			/* All our encodings only read the first byte to get the length */
-			mblen = pg_encoding_mblen(cstate->client_encoding, mblen_str);
+			mblen = pg_encoding_mblen(cstate->target_encoding, mblen_str);
 			IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(mblen - 1);
 			IF_NEED_REFILL_AND_EOF_BREAK(mblen - 1);
 			raw_buf_ptr += mblen - 1;
@@ -3210,7 +3279,11 @@ CopyAttributeOutText(CopyState cstate, char *string)
 	char		delimc = cstate->delim[0];
 
 	if (cstate->need_transcoding)
-		ptr = pg_server_to_client(string, strlen(string));
+		ptr = pg_cached_encoding_conversion(string, strlen(string),
+											cstate->conv_proc,
+											GetDatabaseEncoding(),
+											cstate->target_encoding,
+											false);
 	else
 		ptr = string;
 
@@ -3283,7 +3356,7 @@ CopyAttributeOutText(CopyState cstate, char *string)
 				start = ptr++;	/* we include char in next run */
 			}
 			else if (IS_HIGHBIT_SET(c))
-				ptr += pg_encoding_mblen(cstate->client_encoding, ptr);
+				ptr += pg_encoding_mblen(cstate->target_encoding, ptr);
 			else
 				ptr++;
 		}
@@ -3370,7 +3443,11 @@ CopyAttributeOutCSV(CopyState cstate, char *string,
 		use_quote = true;
 
 	if (cstate->need_transcoding)
-		ptr = pg_server_to_client(string, strlen(string));
+		ptr = pg_cached_encoding_conversion(string, strlen(string),
+											cstate->conv_proc,
+											GetDatabaseEncoding(),
+											cstate->target_encoding,
+											false);
 	else
 		ptr = string;
 
@@ -3397,7 +3474,7 @@ CopyAttributeOutCSV(CopyState cstate, char *string,
 					break;
 				}
 				if (IS_HIGHBIT_SET(c) && cstate->encoding_embeds_ascii)
-					tptr += pg_encoding_mblen(cstate->client_encoding, tptr);
+					tptr += pg_encoding_mblen(cstate->target_encoding, tptr);
 				else
 					tptr++;
 			}
@@ -3421,7 +3498,7 @@ CopyAttributeOutCSV(CopyState cstate, char *string,
 				start = ptr;	/* we include char in next run */
 			}
 			if (IS_HIGHBIT_SET(c) && cstate->encoding_embeds_ascii)
-				ptr += pg_encoding_mblen(cstate->client_encoding, ptr);
+				ptr += pg_encoding_mblen(cstate->target_encoding, ptr);
 			else
 				ptr++;
 		}
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 456db5c..1f698d1 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2217,6 +2217,10 @@ copy_opt_item:
 				{
 					$$ = makeDefElem("force_not_null", (Node *)$4);
 				}
+			| ENCODING ColId_or_Sconst
+				{
+					$$ = makeDefElem("encoding", (Node *)makeString($2));
+				}
 		;
 
 /* The following exist for backward compatibility with very old versions */
diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index a041812..b38b7a0 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -71,8 +71,11 @@ static int	pending_client_encoding = PG_SQL_ASCII;
 
 
 /* Internal functions */
-static char *perform_default_encoding_conversion(const char *src,
-									int len, bool is_client_to_server);
+static bool need_conversion(const char *src, int len,
+							int src_encoding, int dest_encoding, bool check_sanity);
+static char *perform_cached_encoding_conversion(const char *src,
+												int len, FmgrInfo *flinfo,
+												int src_encoding, int dest_encoding);
 static int	cliplen(const char *str, int len, int limit);
 
 
@@ -500,50 +503,12 @@ pg_client_to_server(const char *s, int len)
 	Assert(DatabaseEncoding);
 	Assert(ClientEncoding);
 
-	if (len <= 0)
+	if (!need_conversion(s, len, ClientEncoding->encoding,
+						 DatabaseEncoding->encoding, true))
 		return (char *) s;
 
-	if (ClientEncoding->encoding == DatabaseEncoding->encoding ||
-		ClientEncoding->encoding == PG_SQL_ASCII)
-	{
-		/*
-		 * No conversion is needed, but we must still validate the data.
-		 */
-		(void) pg_verify_mbstr(DatabaseEncoding->encoding, s, len, false);
-		return (char *) s;
-	}
-
-	if (DatabaseEncoding->encoding == PG_SQL_ASCII)
-	{
-		/*
-		 * No conversion is possible, but we must still validate the data,
-		 * because the client-side code might have done string escaping using
-		 * the selected client_encoding.  If the client encoding is ASCII-safe
-		 * then we just do a straight validation under that encoding.  For an
-		 * ASCII-unsafe encoding we have a problem: we dare not pass such data
-		 * to the parser but we have no way to convert it.	We compromise by
-		 * rejecting the data if it contains any non-ASCII characters.
-		 */
-		if (PG_VALID_BE_ENCODING(ClientEncoding->encoding))
-			(void) pg_verify_mbstr(ClientEncoding->encoding, s, len, false);
-		else
-		{
-			int			i;
-
-			for (i = 0; i < len; i++)
-			{
-				if (s[i] == '\0' || IS_HIGHBIT_SET(s[i]))
-					ereport(ERROR,
-							(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
-					 errmsg("invalid byte value for encoding \"%s\": 0x%02x",
-							pg_enc2name_tbl[PG_SQL_ASCII].name,
-							(unsigned char) s[i])));
-			}
-		}
-		return (char *) s;
-	}
-
-	return perform_default_encoding_conversion(s, len, true);
+	return perform_cached_encoding_conversion(s, len, ToServerConvProc,
+				ClientEncoding->encoding, DatabaseEncoding->encoding);
 }
 
 /*
@@ -555,43 +520,101 @@ pg_server_to_client(const char *s, int len)
 	Assert(DatabaseEncoding);
 	Assert(ClientEncoding);
 
+	if (!need_conversion(s, len, DatabaseEncoding->encoding,
+						 ClientEncoding->encoding, false))
+		return (char *) s;
+	return perform_cached_encoding_conversion(s, len, ToClientConvProc,
+				DatabaseEncoding->encoding, ClientEncoding->encoding);
+}
+
+/*
+ * given conversion Fmgr, do encoding conversion with sanity check if needed.
+ */
+char *
+pg_cached_encoding_conversion(const char *s, int len, FmgrInfo *convproc,
+					int src_encoding, int dest_encoding, bool check_sanity)
+{
 	if (len <= 0)
 		return (char *) s;
 
-	if (ClientEncoding->encoding == DatabaseEncoding->encoding ||
-		ClientEncoding->encoding == PG_SQL_ASCII ||
-		DatabaseEncoding->encoding == PG_SQL_ASCII)
-		return (char *) s;		/* assume data is valid */
+	if (!need_conversion(s, len, src_encoding, dest_encoding, check_sanity))
+		return (char *) s;
 
-	return perform_default_encoding_conversion(s, len, false);
+	return perform_cached_encoding_conversion(s, len, convproc,
+											  src_encoding, dest_encoding);
 }
 
 /*
+ * returns true if conversion is needed. check_sanity = true means
+ * the source string should be verified even though the conversion happened.
+ */
+static bool
+need_conversion(const char *s, int len, int src_encoding, int dest_encoding,
+				bool check_sanity)
+{
+	if (check_sanity)
+	{
+		if (src_encoding == dest_encoding || src_encoding == PG_SQL_ASCII)
+		{
+			/*
+			 * No conversion is needed, but we must still validate the data.
+			 */
+			(void) pg_verify_mbstr(src_encoding, s, len, false);
+			return false;
+		}
+
+		if (dest_encoding == PG_SQL_ASCII)
+		{
+			/*
+			 * No conversion is possible, but we must still validate the
+			 * data, because the client-side code might have done string
+			 * escaping using the selected client_encoding.  If the client
+			 * encoding is ASCII-safe then we just do a straight validation
+			 * under that encoding.  For an ASCII-unsafe encoding we have a
+			 * problem: we dare not pass such data to the parser but we have
+			 * no way to convert it.	We compromise by rejecting the data
+			 * if it contains any non-ASCII characters.
+			 */
+			if (PG_VALID_BE_ENCODING(src_encoding))
+				(void) pg_verify_mbstr(src_encoding, s, len, false);
+			else
+			{
+				int		i;
+
+				for (i = 0; i < len; i++)
+				{
+					if (s[i] == '\0' || IS_HIGHBIT_SET(s[i]))
+						ereport(ERROR,
+								(errcode(ERRCODE_CHARACTER_NOT_IN_REPERTOIRE),
+						 errmsg("invalid byte value for encoding \"%s\": 0x%02x",
+								pg_enc2name_tbl[PG_SQL_ASCII].name,
+								(unsigned char) s[i])));
+				}
+			}
+			return false;
+		}
+	}
+	else
+	{
+		if (src_encoding == dest_encoding ||
+			src_encoding == PG_SQL_ASCII ||
+			dest_encoding == PG_SQL_ASCII)
+			return false;
+	}
+
+	return true;
+}
+/*
  *	Perform default encoding conversion using cached FmgrInfo. Since
  *	this function does not access database at all, it is safe to call
  *	outside transactions.  If the conversion has not been set up by
  *	SetClientEncoding(), no conversion is performed.
  */
 static char *
-perform_default_encoding_conversion(const char *src, int len, bool is_client_to_server)
+perform_cached_encoding_conversion(const char *src, int len,
+					FmgrInfo *flinfo, int src_encoding, int dest_encoding)
 {
 	char	   *result;
-	int			src_encoding,
-				dest_encoding;
-	FmgrInfo   *flinfo;
-
-	if (is_client_to_server)
-	{
-		src_encoding = ClientEncoding->encoding;
-		dest_encoding = DatabaseEncoding->encoding;
-		flinfo = ToServerConvProc;
-	}
-	else
-	{
-		src_encoding = DatabaseEncoding->encoding;
-		dest_encoding = ClientEncoding->encoding;
-		flinfo = ToClientConvProc;
-	}
 
 	if (flinfo == NULL)
 		return (char *) src;
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index f110723..576773e 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -21,6 +21,10 @@
 
 #include <sys/types.h>
 
+#ifndef FRONTEND
+#include <fmgr.h>
+#endif
+
 /*
  * The pg_wchar type
  */
@@ -421,6 +425,12 @@ extern unsigned char *pg_do_encoding_conversion(unsigned char *src, int len,
 extern char *pg_client_to_server(const char *s, int len);
 extern char *pg_server_to_client(const char *s, int len);
 
+#ifndef FRONTEND
+extern char *pg_cached_encoding_conversion(const char *s, int len,
+						FmgrInfo *convproc, int src_encoding,
+						int dest_encoding, bool check_sanity);
+#endif
+
 extern unsigned short BIG5toCNS(unsigned short big5, unsigned char *lc);
 extern unsigned short CNStoBIG5(unsigned short cns, unsigned char lc);
 
diff --git a/src/test/regress/expected/copy2.out b/src/test/regress/expected/copy2.out
index 15cbe02..88d7d16 100644
--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -46,10 +46,10 @@ CONTEXT:  COPY x, line 1: "2001	231	\N	\N"
 COPY x from stdin;
 ERROR:  extra data after last expected column
 CONTEXT:  COPY x, line 1: "2002	232	40	50	60	70	80"
--- various COPY options: delimiters, oids, NULL string
+-- various COPY options: delimiters, oids, NULL string, encoding
 COPY x (b, c, d, e) from stdin with oids delimiter ',' null 'x';
 COPY x from stdin WITH DELIMITER AS ';' NULL AS '';
-COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X';
+COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X' ENCODING sql_ascii;
 -- check results of copy in
 SELECT * FROM x;
    a   | b  |     c      |   d    |          e           
@@ -187,7 +187,7 @@ COPY y TO stdout WITH CSV QUOTE '''' DELIMITER '|';
 Jackson, Sam|\h
 It is "perfect".|	
 ''|
-COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\';
+COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\' ENCODING sql_ascii;
 "Jackson, Sam","\\h"
 "It is \"perfect\".","	"
 "",
diff --git a/src/test/regress/sql/copy2.sql b/src/test/regress/sql/copy2.sql
index c2e8b03..d2683d1 100644
--- a/src/test/regress/sql/copy2.sql
+++ b/src/test/regress/sql/copy2.sql
@@ -72,7 +72,7 @@ COPY x from stdin;
 2002	232	40	50	60	70	80
 \.
 
--- various COPY options: delimiters, oids, NULL string
+-- various COPY options: delimiters, oids, NULL string, encoding
 COPY x (b, c, d, e) from stdin with oids delimiter ',' null 'x';
 500000,x,45,80,90
 500001,x,\x,\\x,\\\x
@@ -83,7 +83,7 @@ COPY x from stdin WITH DELIMITER AS ';' NULL AS '';
 3000;;c;;
 \.
 
-COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X';
+COPY x from stdin WITH DELIMITER AS ':' NULL AS E'\\X' ENCODING sql_ascii;
 4000:\X:C:\X:\X
 4001:1:empty::
 4002:2:null:\X:\X
@@ -127,7 +127,7 @@ INSERT INTO y VALUES ('', NULL);
 
 COPY y TO stdout WITH CSV;
 COPY y TO stdout WITH CSV QUOTE '''' DELIMITER '|';
-COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\';
+COPY y TO stdout WITH CSV FORCE QUOTE col2 ESCAPE E'\\' ENCODING sql_ascii;
 COPY y TO stdout WITH CSV FORCE QUOTE *;
 
 -- Repeat above tests with new 9.0 option syntax

#10

Itagaki Takahiro

itagaki.takahiro@gmail.com

almost 15 years ago

In reply to: Hitoshi Harada (#9)

Re: Add ENCODING option to COPY

On Tue, Feb 1, 2011 at 13:08, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

The third patch is attached, modifying mb routines so that they can
receive conversion procedures as FmgrInof * and save the function
pointer in CopyState.
I tested it with encoding option and could not see performance slowdown.

Hmm, sorry, the patch was wrong. Correct version is attached.

Here is a brief review for the patch.

* Syntax: ENCODING encoding vs. ENCODING 'encoding'
We don't have to quote encoding names in the patch, but we always need
quotes for CREATE DATABASE WITH ENCODING. I think we should modify
CREATE DATABASE to accept unquoted encoding names aside from the patch.

Changes in pg_wchar.h are the most debatable parts of the patch.
The patch adds pg_cached_encoding_conversion(). We normally use
pg_do_encoding_conversion(), but it is slower than the proposed
function because it lookups conversion proc from catalog every call.

* Can we use #ifndef FRONTEND in the header?
Usage of fmgr.h members will broke client applications without the #ifdef,
but I guess client apps don't always have definitions of FRONTEND.
If we don't want to change pg_wchar.h, pg_conversion_fn.h might be
a better header for the new API because FindDefaultConversion() is in it.

Changes in copy.c looks good except a few trivial cosmetic issues:

* encoding_option could be a local variable instead of cstate's member.
* cstate->client_encoding is renamed to target_encoding,
but I prefer file_encoding or remote_encoding.
* CopyState can have conv_proc entity as a member instead of the pointer.
* need_transcoding checks could be replaced with conv_proc IS NULL check.

--
Itagaki Takahiro

#11

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Itagaki Takahiro (#10)

Re: Add ENCODING option to COPY

2011/2/4 Itagaki Takahiro <itagaki.takahiro@gmail.com>:

On Tue, Feb 1, 2011 at 13:08, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

The third patch is attached, modifying mb routines so that they can
receive conversion procedures as FmgrInof * and save the function
pointer in CopyState.
I tested it with encoding option and could not see performance slowdown.

Hmm, sorry, the patch was wrong. Correct version is attached.

Here is a brief review for the patch.

Thanks for the review.

* Syntax: ENCODING encoding vs. ENCODING 'encoding'
We don't have to quote encoding names in the patch, but we always need
quotes for CREATE DATABASE WITH ENCODING. I think we should modify
CREATE DATABASE to accept unquoted encoding names aside from the patch.

I followed the syntax in SET client_encoding TO xxx, where you don't
need quote xxx. I didn't care CREATE DATABASE.

Changes in pg_wchar.h are the most debatable parts of the patch.
The patch adds pg_cached_encoding_conversion(). We normally use
pg_do_encoding_conversion(), but it is slower than the proposed
function because it lookups conversion proc from catalog every call.

* Can we use #ifndef FRONTEND in the header?
Usage of fmgr.h members will broke client applications without the #ifdef,
but I guess client apps don't always have definitions of FRONTEND.
If we don't want to change pg_wchar.h, pg_conversion_fn.h might be
a better header for the new API because FindDefaultConversion() is in it.

It doesn't look to me like clear solution. FindDefaultConversion() is
implemented in pg_conversion.c, whereas
pg_cached_encoding_conversion() in mbutils.c. Maybe adding another
header, namely pg_wchar_backend.h?

* CopyState can have conv_proc entity as a member instead of the pointer.
* need_transcoding checks could be replaced with conv_proc IS NULL check.

No, need_transcoding means you need verification even if the target
and server encoding is the same. See comments in CopyTo().

If pg_wchar_backend .h is acceptable, I'll post the revised patch.

Regards,

--
Hitoshi Harada

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Itagaki Takahiro (#10)

Re: Add ENCODING option to COPY

Itagaki Takahiro <itagaki.takahiro@gmail.com> writes:

* Syntax: ENCODING encoding vs. ENCODING 'encoding'
We don't have to quote encoding names in the patch, but we always need
quotes for CREATE DATABASE WITH ENCODING. I think we should modify
CREATE DATABASE to accept unquoted encoding names aside from the patch.

The reason that we use quotes in CREATE DATABASE is that encoding names
aren't assumed to be valid SQL identifiers. If this patch isn't
following the CREATE DATABASE precedent, it's the patch that's wrong,
not CREATE DATABASE.

Changes in pg_wchar.h are the most debatable parts of the patch.
The patch adds pg_cached_encoding_conversion(). We normally use
pg_do_encoding_conversion(), but it is slower than the proposed
function because it lookups conversion proc from catalog every call.

Why should callers need to be changed for that? Just make the existing
function use caching.

* Can we use #ifndef FRONTEND in the header?
Usage of fmgr.h members will broke client applications without the #ifdef,
but I guess client apps don't always have definitions of FRONTEND.
If we don't want to change pg_wchar.h, pg_conversion_fn.h might be
a better header for the new API because FindDefaultConversion() is in it.

Yeah, putting backend-only stuff into that header is a nonstarter.

regards, tom lane

#13

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Tom Lane (#12)

Re: Add ENCODING option to COPY

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

Itagaki Takahiro <itagaki.takahiro@gmail.com> writes:

* Syntax: ENCODING encoding vs. ENCODING 'encoding'
We don't have to quote encoding names in the patch, but we always need
quotes for CREATE DATABASE WITH ENCODING. I think we should modify
CREATE DATABASE to accept unquoted encoding names aside from the patch.

The reason that we use quotes in CREATE DATABASE is that encoding names
aren't assumed to be valid SQL identifiers. If this patch isn't
following the CREATE DATABASE precedent, it's the patch that's wrong,
not CREATE DATABASE.

What about SET client_encoding TO encoding?

Changes in pg_wchar.h are the most debatable parts of the patch.
The patch adds pg_cached_encoding_conversion(). We normally use
pg_do_encoding_conversion(), but it is slower than the proposed
function because it lookups conversion proc from catalog every call.

Why should callers need to be changed for that? Just make the existing
function use caching.

Because the demanded behavior is similar to
pg_client_to_server/server_to_client, which caches functions as
specialized client encoding and database encoding as the file local
variables. We didn't have such spaces to cache functions for other
encoding conversions, so decided to cache them in CopyState and pass
them to the mb routine.

Regards,

--
Hitoshi Harada

#14

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Hitoshi Harada (#13)

Re: Add ENCODING option to COPY

Hitoshi Harada <umi.tanuki@gmail.com> writes:

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

The reason that we use quotes in CREATE DATABASE is that encoding names
aren't assumed to be valid SQL identifiers. �If this patch isn't
following the CREATE DATABASE precedent, it's the patch that's wrong,
not CREATE DATABASE.

What about SET client_encoding TO encoding?

SET is in its own little world --- it will interchangeably take names
with or without quotes. It is not a precedent to follow elsewhere.

regards, tom lane

#15

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Tom Lane (#14)

Re: Add ENCODING option to COPY

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

Hitoshi Harada <umi.tanuki@gmail.com> writes:

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

The reason that we use quotes in CREATE DATABASE is that encoding names
aren't assumed to be valid SQL identifiers. If this patch isn't
following the CREATE DATABASE precedent, it's the patch that's wrong,
not CREATE DATABASE.

What about SET client_encoding TO encoding?

SET is in its own little world --- it will interchangeably take names
with or without quotes. It is not a precedent to follow elsewhere.

I see. I'll update my patch, after the mb change discussion gets done.

* Can we use #ifndef FRONTEND in the header?
Usage of fmgr.h members will broke client applications without the #ifdef,
but I guess client apps don't always have definitions of FRONTEND.
If we don't want to change pg_wchar.h, pg_conversion_fn.h might be
a better header for the new API because FindDefaultConversion() is in it.

Yeah, putting backend-only stuff into that header is a nonstarter.

Do you mean you think it' all right to define
pg_cached_encoding_conversion() in pg_conversion_fn.h?

Regards,

--
Hitoshi Harada

#16

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Hitoshi Harada (#15)

Re: Add ENCODING option to COPY

Hitoshi Harada <umi.tanuki@gmail.com> writes:

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

Yeah, putting backend-only stuff into that header is a nonstarter.

Do you mean you think it' all right to define
pg_cached_encoding_conversion() in pg_conversion_fn.h?

That seems pretty random too. I still think you've designed this API
badly and it'd be better to avoid exposing the FmgrInfo to callers
by letting the function manage the cache internally.

regards, tom lane

#17

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Tom Lane (#16)

Re: Add ENCODING option to COPY

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

Hitoshi Harada <umi.tanuki@gmail.com> writes:

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

Yeah, putting backend-only stuff into that header is a nonstarter.

Do you mean you think it' all right to define
pg_cached_encoding_conversion() in pg_conversion_fn.h?

That seems pretty random too. I still think you've designed this API
badly and it'd be better to avoid exposing the FmgrInfo to callers
by letting the function manage the cache internally.

I'll try it in a few days, but only making struct that holds FmgrInfo
in a file local like tuplestorestate comes up with so far....

Regards,

--
Hitoshi Harada

#18

Robert Haas

robertmhaas@gmail.com

almost 15 years ago

In reply to: Hitoshi Harada (#17)

Re: Add ENCODING option to COPY

On Fri, Feb 4, 2011 at 1:54 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

Hitoshi Harada <umi.tanuki@gmail.com> writes:

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

Yeah, putting backend-only stuff into that header is a nonstarter.

Do you mean you think it' all right to define
pg_cached_encoding_conversion() in pg_conversion_fn.h?

That seems pretty random too. I still think you've designed this API
badly and it'd be better to avoid exposing the FmgrInfo to callers
by letting the function manage the cache internally.

I'll try it in a few days, but only making struct that holds FmgrInfo
in a file local like tuplestorestate comes up with so far....

We've been through several iterations of this patch now and haven't
come up with something committable. I think it's time to mark this
one Returned with Feedback and revisit this for 9.2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Robert Haas (#18)

Re: Add ENCODING option to COPY

2011/2/8 Robert Haas <robertmhaas@gmail.com>:

On Fri, Feb 4, 2011 at 1:54 PM, Hitoshi Harada <umi.tanuki@gmail.com> wrote:

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

Hitoshi Harada <umi.tanuki@gmail.com> writes:

2011/2/5 Tom Lane <tgl@sss.pgh.pa.us>:

Yeah, putting backend-only stuff into that header is a nonstarter.

Do you mean you think it' all right to define
pg_cached_encoding_conversion() in pg_conversion_fn.h?

That seems pretty random too. I still think you've designed this API
badly and it'd be better to avoid exposing the FmgrInfo to callers
by letting the function manage the cache internally.

I'll try it in a few days, but only making struct that holds FmgrInfo
in a file local like tuplestorestate comes up with so far....

We've been through several iterations of this patch now and haven't
come up with something committable. I think it's time to mark this
one Returned with Feedback and revisit this for 9.2.

As I've been thinking these days. The design isn't fixed yet even now.
Thanks for all the reviews.

Regards,

--
Hitoshi Harada

#20

Peter Eisentraut

peter_e@gmx.net

almost 15 years ago

In reply to: Tom Lane (#12)

Re: Add ENCODING option to COPY

On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote:

The reason that we use quotes in CREATE DATABASE is that encoding
names aren't assumed to be valid SQL identifiers. If this patch isn't
following the CREATE DATABASE precedent, it's the patch that's wrong,
not CREATE DATABASE.

Since encoding names are built-in and therefore well known, and the
names have been aligned with the SQL standard names, which are
identifiers, I don't think this argument is valid (anymore).

It probably shouldn't be changed inconsistently as part of an unrelated
patch, but I think the idea has merit.

#21

Tom Lane

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Peter Eisentraut (#20)

Re: Add ENCODING option to COPY

Peter Eisentraut <peter_e@gmx.net> writes:

On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote:

The reason that we use quotes in CREATE DATABASE is that encoding
names aren't assumed to be valid SQL identifiers. If this patch isn't
following the CREATE DATABASE precedent, it's the patch that's wrong,
not CREATE DATABASE.

Since encoding names are built-in and therefore well known, and the
names have been aligned with the SQL standard names, which are
identifiers, I don't think this argument is valid (anymore).

What about "UTF-8"?

regards, tom lane

#22

Hitoshi Harada

umi.tanuki@gmail.com

almost 15 years ago

In reply to: Tom Lane (#21)

Re: Add ENCODING option to COPY

2011/2/9 Tom Lane <tgl@sss.pgh.pa.us>:

Peter Eisentraut <peter_e@gmx.net> writes:

On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote:

The reason that we use quotes in CREATE DATABASE is that encoding
names aren't assumed to be valid SQL identifiers. If this patch isn't
following the CREATE DATABASE precedent, it's the patch that's wrong,
not CREATE DATABASE.

Since encoding names are built-in and therefore well known, and the
names have been aligned with the SQL standard names, which are
identifiers, I don't think this argument is valid (anymore).

What about "UTF-8"?

Then, quote it?

db1=# set client_encoding to utf-8;
ERROR: syntax error at or near "-"

Regards,

--
Hitoshi Harada

#23

Peter Eisentraut

peter_e@gmx.net

almost 15 years ago

In reply to: Tom Lane (#21)

Re: Add ENCODING option to COPY

On tis, 2011-02-08 at 10:53 -0500, Tom Lane wrote:

Peter Eisentraut <peter_e@gmx.net> writes:

On fre, 2011-02-04 at 10:47 -0500, Tom Lane wrote:

The reason that we use quotes in CREATE DATABASE is that encoding
names aren't assumed to be valid SQL identifiers. If this patch isn't
following the CREATE DATABASE precedent, it's the patch that's wrong,
not CREATE DATABASE.

Since encoding names are built-in and therefore well known, and the
names have been aligned with the SQL standard names, which are
identifiers, I don't think this argument is valid (anymore).

What about "UTF-8"?

The canonical name of that is UTF8. But you can quote it if you want to
spell it differently.