confusing / inefficient "need_transcoding" handling in copy

Started by Andres Freundabout 2 years ago21 messages
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

Looking at the profiles in [1]/messages/by-id/ZcGE8LrjGW8pmtOf@paquier.xyz, and similar profiles locally, made me wonder
why a basic COPY TO shows pg_server_to_any() and the strlen() to compute the
length of the to-be-converted string so heavily in profiles. Example
profile, for [2]COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 1000000::int4)) TO '/dev/null';:

-   88.11%    12.02%  postgres      postgres          [.] CopyOneRowTo
   - 76.09% CopyOneRowTo
      - 37.24% CopyAttributeOutText
         + 14.25% __strlen_evex
         + 2.76% pg_server_to_any
         + 0.03% 0xffffffff82a00c86
      + 31.82% OutputFunctionCall
      + 2.98% CopySendEndOfRow
      + 2.75% appendBinaryStringInfo
      + 0.58% MemoryContextReset
      + 0.02% 0xffffffff82a00c86
   + 12.01% standard_ExecutorRun
   + 0.02% PostgresMain

In the basic cases the client and server encoding should be the same after
all, so why do we need to do any conversion?

The code has a comment about this:

/*
* Set up encoding conversion info. Even if the file and server encodings
* are the same, we must apply pg_any_to_server() to validate data in
* multibyte encodings.
*/
cstate->need_transcoding =
(cstate->file_encoding != GetDatabaseEncoding() ||
pg_database_encoding_max_length() > 1);

I don't really understand why we need to validate anything during COPY TO?
Which is good, because it turns out that we don't actually validate anything,
as pg_server_to_any() returns without doing anything if the encoding matches:

if (encoding == DatabaseEncoding->encoding ||
encoding == PG_SQL_ASCII)
return unconstify(char *, s); /* assume data is valid */

This means that the strlen() we do in the call do pg_server_to_any(), which on
its own takes 14.25% of the cycles, computes something that will never be
used.

Unsurprisingly, only doing transcoding when encodings differ yields a sizable
improvement, about 18% for [2]COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 1000000::int4)) TO '/dev/null';.

I haven't yet dug into the code history. One guess is that this should only
have been set this way for COPY FROM.

Greetings,

Andres Freund

[1]: /messages/by-id/ZcGE8LrjGW8pmtOf@paquier.xyz
[2]: COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 1000000::int4)) TO '/dev/null';

#2Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#1)
Re: confusing / inefficient "need_transcoding" handling in copy

On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote:

I don't really understand why we need to validate anything during COPY TO?
Which is good, because it turns out that we don't actually validate anything,
as pg_server_to_any() returns without doing anything if the encoding matches:

if (encoding == DatabaseEncoding->encoding ||
encoding == PG_SQL_ASCII)
return unconstify(char *, s); /* assume data is valid */

This means that the strlen() we do in the call do pg_server_to_any(), which on
its own takes 14.25% of the cycles, computes something that will never be
used.

Indeed, that's wasting cycles for nothing when the client and server
encoding match.

Unsurprisingly, only doing transcoding when encodings differ yields a sizable
improvement, about 18% for [2].

I haven't yet dug into the code history. One guess is that this should only
have been set this way for COPY FROM.

Looking the git history, this looks like an oversight of c61a2f58418e
that has added the condition on pg_database_encoding_max_length(), no?
Adding Tom and Ishii-san, even if this comes from 2005.
--
Michael

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#2)
Re: confusing / inefficient "need_transcoding" handling in copy

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote:

I haven't yet dug into the code history. One guess is that this should only
have been set this way for COPY FROM.

Looking the git history, this looks like an oversight of c61a2f58418e
that has added the condition on pg_database_encoding_max_length(), no?
Adding Tom and Ishii-san, even if this comes from 2005.

Yeah, back in 8.1 that code was applied for both directions, but
probably it should have enforced validation for same-encoding
cases only for COPY FROM.

It looks like now we have a mess, because the condition was copied
verbatim into copyto.c but not copyfrom.c. Aren't we failing to
validate encoding in this case in COPY FROM, which is where we
actually need to?

regards, tom lane

#4Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#3)
Re: confusing / inefficient "need_transcoding" handling in copy

Hi,

On 2024-02-06 12:51:48 -0500, Tom Lane wrote:

Michael Paquier <michael@paquier.xyz> writes:

On Mon, Feb 05, 2024 at 06:05:04PM -0800, Andres Freund wrote:

I haven't yet dug into the code history. One guess is that this should only
have been set this way for COPY FROM.

Looking the git history, this looks like an oversight of c61a2f58418e
that has added the condition on pg_database_encoding_max_length(), no?
Adding Tom and Ishii-san, even if this comes from 2005.

Yeah, back in 8.1 that code was applied for both directions, but
probably it should have enforced validation for same-encoding
cases only for COPY FROM.

It looks like now we have a mess, because the condition was copied
verbatim into copyto.c but not copyfrom.c. Aren't we failing to
validate encoding in this case in COPY FROM, which is where we
actually need to?

I think the code is just very confusing - there actually *is* verification of
the encoding, it just happens at a different, earlier, layer, namely in
copyfromparse.c: CopyConvertBuf() which says:
/*
* If the file and server encoding are the same, no encoding conversion is
* required. However, we still need to verify that the input is valid for
* the encoding.
*/

And does indeed verify that.

One unfortunate issue: We don't have any tests verifying that COPY FROM
catches encoding issues.

Regards,

Andres

#5Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#4)
Re: confusing / inefficient "need_transcoding" handling in copy

On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote:

I think the code is just very confusing - there actually *is* verification of
the encoding, it just happens at a different, earlier, layer, namely in
copyfromparse.c: CopyConvertBuf() which says:
/*
* If the file and server encoding are the same, no encoding conversion is
* required. However, we still need to verify that the input is valid for
* the encoding.
*/

And does indeed verify that.

This has been switched by Heikki in f82de5c46bdf, in 2021, that has
removed pg_database_encoding_max_length() in the COPY FROM case.
(Adding him now in CC, in case he has any comments).

One unfortunate issue: We don't have any tests verifying that COPY FROM
catches encoding issues.

Oops.

Anyway, I was looking at the copyto.c code because I need to get
something on this thread to be able to do something about the
pluggable APIs in COPY, and echoing with what you mentioned upthread,
what we only need to do is to set need_transcoding only when the
client and the server encodings are not the same? Am I missing
something?

Runtime gets much better in average, around 1260ms on HEAD vs 1023ms
with the attached for the example of upthread on a single process.
Some profile data from CopyOneRowTo(), if relevant:
* HEAD:
-   82.78%    10.96%  postgres  postgres            [.] CopyOneRowTo
    - 71.82% CopyOneRowTo
       + 30.87% OutputFunctionCall
       - 13.21% CopyAttributeOutText
            pg_server_to_any
       - 9.48% appendBinaryStringInfo
            4.93% enlargeStringInfo
         3.33% 0xffffa4e1e234
       + 3.20% CopySendEndOfRow
         2.66% 0xffffa4e1e214
         1.02% pgstat_progress_update_param
         0.86% memcpy@plt
         0.74% 0xffffa4e1cba4
         0.72% MemoryContextReset
         0.72% 0xffffa4e1cba8
         0.59% enlargeStringInfo
         0.55% 0xffffa4e1cb40
         0.54% 0xffffa4e1cb74
         0.52% 0xffffa4e1cb8c
    + 10.96% _start
* patch:
-   80.82%    12.25%  postgres  postgres            [.] CopyOneRowTo
    - 68.57% CopyOneRowTo
       + 36.55% OutputFunctionCall
         11.44% CopyAttributeOutText
       + 8.87% appendBinaryStringInfo
       + 3.79% CopySendEndOfRow
         1.01% pgstat_progress_update_param
         0.79% int2out
         0.66% MemoryContextReset
         0.63% 0xffffaa624ba8
         0.60% memcpy@plt
         0.60% enlargeStringInfo
         0.53% 0xffffaa624ba4
    + 12.25% _start

That's a performance-only change, but there may be a good argument for
backpatching something, perhaps?
--
Michael

Attachments:

0001-Speedup-COPY-TO.patchtext/x-diff; charset=us-asciiDownload+8-7
#6Sutou Kouhei
kou@clear-code.com
In reply to: Andres Freund (#4)
Re: confusing / inefficient "need_transcoding" handling in copy

Hi,

In <20240206222445.hzq22pb2nye7rm67@awork3.anarazel.de>
"Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800,
Andres Freund <andres@anarazel.de> wrote:

One unfortunate issue: We don't have any tests verifying that COPY FROM
catches encoding issues.

How about the attached patch for it?

How about the following to avoid needless transcoding?

----
diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c
index bd4710a79b..80ec26cafe 100644
--- a/src/backend/commands/copyto.c
+++ b/src/backend/commands/copyto.c
@@ -612,13 +612,14 @@ BeginCopyTo(ParseState *pstate,
 		cstate->file_encoding = cstate->opts.file_encoding;
 	/*
-	 * Set up encoding conversion info.  Even if the file and server encodings
-	 * are the same, we must apply pg_any_to_server() to validate data in
-	 * multibyte encodings.
+	 * Set up encoding conversion info. We use pg_server_to_any() for the
+	 * conversion. pg_server_to_any() does nothing with an encoding that
+	 * equals to GetDatabaseEncoding() and PG_SQL_ASCII. If
+	 * cstate->file_encoding equals to GetDatabaseEncoding() and PG_SQL_ASCII,
+	 * we don't need to transcode.
 	 */
-	cstate->need_transcoding =
-		(cstate->file_encoding != GetDatabaseEncoding() ||
-		 pg_database_encoding_max_length() > 1);
+	cstate->need_transcoding = !(cstate->file_encoding == GetDatabaseEncoding() ||
+								 cstate->file_encoding == PG_SQL_ASCII);
 	/* See Multibyte encoding comment above */
 	cstate->encoding_embeds_ascii = PG_ENCODING_IS_CLIENT_ONLY(cstate->file_encoding);

----

Numbers on my environment:

master: 861.646ms
patched: 697.547ms

SQL:
COPY (SELECT 1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2, generate_series(1, 1000000::int4)) TO '/dev/null' \watch c=5

Thanks,
--
kou

Attachments:

v1-0001-Add-a-test-for-invalid-encoding-for-COPY-FROM.patchtext/x-patch; charset=utf-8Download+19-2
#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#5)
Re: confusing / inefficient "need_transcoding" handling in copy

On 08/02/2024 09:05, Michael Paquier wrote:

On Tue, Feb 06, 2024 at 02:24:45PM -0800, Andres Freund wrote:

I think the code is just very confusing - there actually *is* verification of
the encoding, it just happens at a different, earlier, layer, namely in
copyfromparse.c: CopyConvertBuf() which says:
/*
* If the file and server encoding are the same, no encoding conversion is
* required. However, we still need to verify that the input is valid for
* the encoding.
*/

And does indeed verify that.

This has been switched by Heikki in f82de5c46bdf, in 2021, that has
removed pg_database_encoding_max_length() in the COPY FROM case.
(Adding him now in CC, in case he has any comments).

Yeah, I agree the "pg_database_encoding_max_length() > 1" check in COPY
TO is unnecessary. It's always been like that, but now that the COPY TO
and COPY FROM cases don't share the code, it's more obvious. Let's
remove it.

On your patch:

+	 * Set up encoding conversion info, validating data if server and
+	 * client encodings are not the same (see also pg_server_to_any).

There's no validation, just conversion. I'd suggest:

"Set up encoding conversion info if the file and server encodings differ
(see also pg_server_to_any)."

Other than that, +1

That's a performance-only change, but there may be a good argument for
backpatching something, perhaps?

-1 for backpatching, out of principle. This is not a regression, it's
always been like that. Seems innocent, but why is this different from
any other performance improvement we make.

BTW, I can see an optimization opportunity even if the encodings differ:
Currently, CopyAttributeOutText() calls pg_server_to_any(), and then
grovels through the string to find any characters that need to be
quoted. You could do it the other way round and handle quoting before
the conversion. That has two benefits:

1. You don't need the strlen() call, because you just scanned through
the string so you already know its length.
2. You don't need to worry about 'encoding_embeds_ascii' when you
operate on the server encoding.

--
Heikki Linnakangas
Neon (https://neon.tech)

#8Sutou Kouhei
kou@clear-code.com
In reply to: Sutou Kouhei (#6)
Re: confusing / inefficient "need_transcoding" handling in copy

Hi,

In <20240208.172501.2177371292839763981.kou@clear-code.com>
"Re: confusing / inefficient "need_transcoding" handling in copy" on Thu, 08 Feb 2024 17:25:01 +0900 (JST),
Sutou Kouhei <kou@clear-code.com> wrote:

How about the following to avoid needless transcoding?

Oh, sorry. I missed the Michael's patch:
/messages/by-id/ZcR9Q9hJ8GedFSCd@paquier.xyz

I withdraw my change.

Thanks,
--
kou

#9Michael Paquier
michael@paquier.xyz
In reply to: Sutou Kouhei (#8)
Re: confusing / inefficient "need_transcoding" handling in copy

On Thu, Feb 08, 2024 at 05:29:46PM +0900, Sutou Kouhei wrote:

Oh, sorry. I missed the Michael's patch:
/messages/by-id/ZcR9Q9hJ8GedFSCd@paquier.xyz

I withdraw my change.

No problem. Thanks for caring about that.
--
Michael

#10Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#7)
Re: confusing / inefficient "need_transcoding" handling in copy

On Thu, Feb 08, 2024 at 10:25:07AM +0200, Heikki Linnakangas wrote:

There's no validation, just conversion. I'd suggest:

"Set up encoding conversion info if the file and server encodings differ
(see also pg_server_to_any)."

Other than that, +1

Cool. I've used your wording and applied that on HEAD.

BTW, I can see an optimization opportunity even if the encodings differ:
Currently, CopyAttributeOutText() calls pg_server_to_any(), and then grovels
through the string to find any characters that need to be quoted. You could
do it the other way round and handle quoting before the conversion. That has
two benefits:

1. You don't need the strlen() call, because you just scanned through the
string so you already know its length.
2. You don't need to worry about 'encoding_embeds_ascii' when you operate on
the server encoding.

That sounds right, still it looks like there would be cases where
you'd need the strlen() call if !encoding_embeds_ascii.
--
Michael

#11Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#10)
Re: confusing / inefficient "need_transcoding" handling in copy

Hi,

On 2024-02-09 09:36:28 +0900, Michael Paquier wrote:

On Thu, Feb 08, 2024 at 10:25:07AM +0200, Heikki Linnakangas wrote:

There's no validation, just conversion. I'd suggest:

"Set up encoding conversion info if the file and server encodings differ
(see also pg_server_to_any)."

Other than that, +1

Cool. I've used your wording and applied that on HEAD.

Thanks. LGTM.

Greetings,

Andres Freund

#12Michael Paquier
michael@paquier.xyz
In reply to: Sutou Kouhei (#6)
Re: confusing / inefficient "need_transcoding" handling in copy

On Thu, Feb 08, 2024 at 05:25:01PM +0900, Sutou Kouhei wrote:

In <20240206222445.hzq22pb2nye7rm67@awork3.anarazel.de>
"Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 6 Feb 2024 14:24:45 -0800,
Andres Freund <andres@anarazel.de> wrote:

One unfortunate issue: We don't have any tests verifying that COPY FROM
catches encoding issues.

How about the attached patch for it?

+CREATE TABLE test (t text);
+COPY test FROM stdin WITH (ENCODING 'EUC_JP');
+こんにちは
+\.
+
+DROP TABLE test;

We have a couple of non-ASCII characters in the tests, but I suspect
that this one will not be digested correctly everywhere, even if
EUC_JP should be OK to use for the check. How about writing an
arbitrary sequence of bytes into a temporary file that gets used for
the COPY FROM instead? See for example how we do that with
abs_builddir in copy.sql.
--
Michael

#13Sutou Kouhei
kou@clear-code.com
In reply to: Michael Paquier (#12)
Re: confusing / inefficient "need_transcoding" handling in copy

Hi,

In <ZcvlgMEjt3qY8eiL@paquier.xyz>
"Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 06:56:16 +0900,
Michael Paquier <michael@paquier.xyz> wrote:

We have a couple of non-ASCII characters in the tests, but I suspect
that this one will not be digested correctly everywhere, even if
EUC_JP should be OK to use for the check. How about writing an
arbitrary sequence of bytes into a temporary file that gets used for
the COPY FROM instead? See for example how we do that with
abs_builddir in copy.sql.

It makes sense. How about the attached patch?

Thanks,
--
kou

Attachments:

v2-0001-Add-a-test-for-invalid-encoding-for-COPY-FROM.patchtext/x-patch; charset=us-asciiDownload+29-2
#14Sutou Kouhei
kou@clear-code.com
In reply to: Sutou Kouhei (#13)
Re: confusing / inefficient "need_transcoding" handling in copy

Hi Michael,

ping.

(Do you think that this patch is still needed?)

Thanks,
--
kou

In <20240214.114608.2091541942684063981.kou@clear-code.com>
"Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 11:46:08 +0900 (JST),
Sutou Kouhei <kou@clear-code.com> wrote:

Show quoted text

Hi,

In <ZcvlgMEjt3qY8eiL@paquier.xyz>
"Re: confusing / inefficient "need_transcoding" handling in copy" on Wed, 14 Feb 2024 06:56:16 +0900,
Michael Paquier <michael@paquier.xyz> wrote:

We have a couple of non-ASCII characters in the tests, but I suspect
that this one will not be digested correctly everywhere, even if
EUC_JP should be OK to use for the check. How about writing an
arbitrary sequence of bytes into a temporary file that gets used for
the COPY FROM instead? See for example how we do that with
abs_builddir in copy.sql.

It makes sense. How about the attached patch?

Thanks,
--
kou

#15Michael Paquier
michael@paquier.xyz
In reply to: Sutou Kouhei (#14)
Re: confusing / inefficient "need_transcoding" handling in copy

On Fri, Dec 06, 2024 at 04:20:42PM +0900, Sutou Kouhei wrote:

(Do you think that this patch is still needed?)

This thread has fallen off my radar, my apologies about that.

Yes, I think that this is a good thing to expand these tests. Let's
take one step at a time. I have a couple of comments.

+-- U+3042 HIRAGANA LETTER A
+COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+COPY test FROM :'utf8_csv' WITH (FORMAT csv, ENCODING 'EUC_JP');
+ERROR:  invalid byte sequence for encoding "EUC_JP": 0xe3 0x81
+CONTEXT:  COPY test, line 1
+DROP TABLE test;

client_encoding would be used by COPY when not specifying ENCODING
option. Perhaps more tests should be added with this value specified
by a SET client_encoding?

Another one would be valid conversions back and forth. For example,
I recall that LATIN1 accepts any bytes and can apply a conversion to
UTF-8, so we could use it and expand a bit more the proposed tests?
Or something like that?

This is not going to be portable across the buildfarm.  Two reasons
are spotted by the CI (there may be others):
1) For Windows, as in the following regression.diffs:
 COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+ERROR:  character with byte sequence 0xe3 0x81 0x82 in encoding "UTF8" has no equivalent in encoding "WIN1252"
2) Second failure on Linux, with 32-bit builds:
 COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+ERROR:  conversion between UTF8 and SQL_ASCII is not supported

Likely, this should be made conditional, based on the fact that the
database needs to be able to support utf8? There are a couple of
examples like that in the tree, based on the following SQL trick:
SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
\if :skip_test
\quit
\endif

This requires an alternate output for the non-utf8 case.
--
Michael

#16Sutou Kouhei
kou@clear-code.com
In reply to: Michael Paquier (#15)
Re: confusing / inefficient "need_transcoding" handling in copy

Hi,

In <Z1fKrTkT-eIVAK7F@paquier.xyz>
"Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 10 Dec 2024 13:59:25 +0900,
Michael Paquier <michael@paquier.xyz> wrote:

client_encoding would be used by COPY when not specifying ENCODING
option. Perhaps more tests should be added with this value specified
by a SET client_encoding?

It makes sense. I missed the case. I've added the case to
the v3 patch.

Another one would be valid conversions back and forth. For example,
I recall that LATIN1 accepts any bytes and can apply a conversion to
UTF-8, so we could use it and expand a bit more the proposed tests?
Or something like that?

OK. I've added valid cases too by using LATIN1 as you
suggested.

This is not going to be portable across the buildfarm.  Two reasons
are spotted by the CI (there may be others):
1) For Windows, as in the following regression.diffs:
COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+ERROR:  character with byte sequence 0xe3 0x81 0x82 in encoding "UTF8" has no equivalent in encoding "WIN1252"
2) Second failure on Linux, with 32-bit builds:
COPY (SELECT E'\u3042') TO :'utf8_csv' WITH (FORMAT csv, ENCODING 'UTF8');
+ERROR:  conversion between UTF8 and SQL_ASCII is not supported

Likely, this should be made conditional, based on the fact that the
database needs to be able to support utf8? There are a couple of
examples like that in the tree, based on the following SQL trick:
SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
\if :skip_test
\quit
\endif

Thanks. I didn't notice the portability problem. I've added
the skip trick.

This requires an alternate output for the non-utf8 case.

Oh! I didn't know the "XXX_1.out" feature.

Thanks,
--
kou

Attachments:

v3-0001-Add-tests-for-invalid-encoding-for-COPY-FROM.patchtext/x-patch; charset=us-asciiDownload+58-2
#17Michael Paquier
michael@paquier.xyz
In reply to: Sutou Kouhei (#16)
Re: confusing / inefficient "need_transcoding" handling in copy

On Thu, Dec 12, 2024 at 03:25:41PM +0900, Sutou Kouhei wrote:

In <Z1fKrTkT-eIVAK7F@paquier.xyz>
"Re: confusing / inefficient "need_transcoding" handling in copy" on Tue, 10 Dec 2024 13:59:25 +0900,
Michael Paquier <michael@paquier.xyz> wrote:

Another one would be valid conversions back and forth. For example,
I recall that LATIN1 accepts any bytes and can apply a conversion to
UTF-8, so we could use it and expand a bit more the proposed tests?
Or something like that?

OK. I've added valid cases too by using LATIN1 as you
suggested.

I may have missed something but v3 does not use that for a valid
conversion?

Likely, this should be made conditional, based on the fact that the
database needs to be able to support utf8? There are a couple of
examples like that in the tree, based on the following SQL trick:
SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
\if :skip_test
\quit
\endif

Thanks. I didn't notice the portability problem. I've added
the skip trick.

Indeed, thanks.

Oh! I didn't know the "XXX_1.out" feature.

You have missed the inclusion of an alternate output, which should be
something like that to bypass the test rather than failing:
--- /dev/null
+++ b/src/test/regress/expected/copyencoding_1.out
@@ -0,0 +1,7 @@
+--
+-- Test cases for COPY encoding
+--
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit

I guess that you have the file, forgot a `git add`.
--
Michael

#18Sutou Kouhei
kou@clear-code.com
In reply to: Michael Paquier (#17)
Re: confusing / inefficient "need_transcoding" handling in copy

Hi,

In <Z1ukEe2d7ml6-oaZ@paquier.xyz>
"Re: confusing / inefficient "need_transcoding" handling in copy" on Fri, 13 Dec 2024 12:03:45 +0900,
Michael Paquier <michael@paquier.xyz> wrote:

OK. I've added valid cases too by using LATIN1 as you
suggested.

I may have missed something but v3 does not use that for a valid
conversion?

Oh, sorry... I attached wrong patch...
I attach the v4 patch that includes this case.

Oh! I didn't know the "XXX_1.out" feature.

You have missed the inclusion of an alternate output, which should be
something like that to bypass the test rather than failing:
--- /dev/null
+++ b/src/test/regress/expected/copyencoding_1.out
@@ -0,0 +1,7 @@
+--
+-- Test cases for COPY encoding
+--
+-- skip test if not UTF8 server encoding
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit

I guess that you have the file, forgot a `git add`.

I did "git add" but I attached a wrong file...

The v4 patch includes this too.

Thanks,
--
kou

Attachments:

v4-0001-Add-tests-for-encoding-for-COPY-FROM.patchtext/x-patch; charset=us-asciiDownload+103-2
#19Michael Paquier
michael@paquier.xyz
In reply to: Sutou Kouhei (#18)
Re: confusing / inefficient "need_transcoding" handling in copy

On Fri, Dec 13, 2024 at 12:27:37PM +0900, Sutou Kouhei wrote:

Oh, sorry... I attached wrong patch...
I attach the v4 patch that includes this case.

Sounds fair to me as a beginning for the code paths setting
need_transcoding.

Note that using "test" as table name for the tests is not a good idea,
as this could very likely conflict with some concurrent activity. I
would also add two RESET queries to remove the dependency to
client_encoding once the test has no need to rely on it. No need to
send a new patch for all that, just noticing in passing.
--
Michael

#20Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#19)
Re: confusing / inefficient "need_transcoding" handling in copy

On Sat, Dec 14, 2024 at 04:46:57PM +0900, Michael Paquier wrote:

Note that using "test" as table name for the tests is not a good idea,
as this could very likely conflict with some concurrent activity. I
would also add two RESET queries to remove the dependency to
client_encoding once the test has no need to rely on it. No need to
send a new patch for all that, just noticing in passing.

I got some time to look again at this one. Applied with the tweaks
for the table name and the two RESET queries.
--
Michael

#21Sutou Kouhei
kou@clear-code.com
In reply to: Michael Paquier (#20)