Perform COPY FROM encoding conversions in larger chunks
I've been looking at the COPY FROM parsing code, trying to refactor it
so that the parallel COPY would be easier to implement. I haven't
touched parallelism itself, just looking for ways to smoothen the way.
And for ways to speed up COPY in general.
Currently, COPY FROM parses the input one line at a time. Each line is
converted to the database encoding separately, or if the file encoding
matches the database encoding, we just check that the input is valid for
the encoding. It would be more efficient to do the encoding
conversion/verification in larger chunks. At least potentially; the
current conversion/verification implementations work one byte a time so
it doesn't matter too much, but there are faster algorithms out there
that use SIMD instructions or lookup tables that benefit from larger inputs.
So I'd like to change it so that the encoding conversion/verification is
done before splitting the input into lines. The problem is that the
conversion and verification functions throw an error on incomplete
input. So we can't pass them a chunk of N raw bytes, if we don't know
where the character boundaries are. The first step in this effort is to
change the encoding and conversion routines to allow that. Attached
patches 0001-0004 do that:
For encoding conversions, change the signature of the conversion
function, by adding a "bool noError" argument and making them return the
number of input bytes successfully converted. That way, the conversion
function can be called in a streaming fashion: load a buffer with raw
input without caring about the character boundaries, call the conversion
function to convert it except for the few bytes at the end that might be
an incomplete character, load the buffer with more data, and repeat.
For encoding verification, add a new function that works similarly. It
takes N bytes of raw input, verifies as much of it as possible, and
returns the number of input bytes that were valid. In principle, this
could've been implemented by calling the existing pg_encoding_mblen()
and pg_encoding_verifymb() functions in a loop, but it would be too
slow. This adds encoding-specific functions for that. The UTF-8
implementation is slightly optimized by basically inlining the
pg_utf8_mblen() call, the other implementations are pretty naive.
- Heikki
Attachments:
0001-Add-new-mbverifystr-function-for-each-encoding.patchtext/x-patch; charset=UTF-8; name=0001-Add-new-mbverifystr-function-for-each-encoding.patchDownload+491-95
0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patchtext/x-patch; charset=UTF-8; name=0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patchDownload+21-22
0003-Add-direct-conversion-routines-between-EUC_TW-and-Bi.patchtext/x-patch; charset=UTF-8; name=0003-Add-direct-conversion-routines-between-EUC_TW-and-Bi.patchDownload+136-11
0004-Change-conversion-function-signature.patchtext/x-patch; charset=UTF-8; name=0004-Change-conversion-function-signature.patchDownload+1292-634
0005-Do-COPY-FROM-encoding-conversion-verification-in-lar.patchtext/x-patch; charset=UTF-8; name=0005-Do-COPY-FROM-encoding-conversion-verification-in-lar.patchDownload+262-81
On Wed, Dec 16, 2020 at 02:17:58PM +0200, Heikki Linnakangas wrote:
I've been looking at the COPY FROM parsing code, trying to refactor it so
that the parallel COPY would be easier to implement. I haven't touched
parallelism itself, just looking for ways to smoothen the way. And for ways
to speed up COPY in general.
Yes, this makes a lot of sense. Glad you are looking into this.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EnterpriseDB https://enterprisedb.com
The usefulness of a cup is in its emptiness, Bruce Lee
One of the patches in this patch set is worth calling out separately:
0003-Add-direct-conversion-routines-between-EUC_TW-and-Bi.patch. Per
commit message:
Add direct conversion routines between EUC_TW and Big5.
Conversions between EUC_TW and Big5 were previously implemented by
converting the whole input to MIC first, and then from MIC to the
target encoding. Implement functions to convert directly between the
two.
The reason to do this now is that the next patch will change the
change the conversion function signature so that if the input is
invalid, we convert as much as we can and return the number of bytes
successfully converted. That's not possible if we use an intermediary
format, because if an error happens in the intermediary -> final
conversion, we lose track of the location of the invalid character in
the original input. Avoiding the intermediate step should be faster
too.
This patch is fairly independent of the others. It could be reviewed and
applied separately.
In order to verify that the new code is correct, I wrote some helper
plpgsql functions to generate all valid EUC_TW and Big5 byte sequences
that encode one character, and tested converting each of them. Then I
compared the the results with unpatched server, to check that the new
code performs the same conversion. This is perhaps overkill, but since
its pretty straightforward to enumerate all the input characters, might
as well do it.
For the sake of completeness, I wrote similar helpers for all the other
encodings and conversions. Except for UTF-8, there are too many formally
valid codepoints for that to feasible. This does test round-trip
conversions of all codepoints from all the other encodings to UTF-8 and
back, though, so there's pretty good coverage of UTF-8 too.
This test suite is probably too large to add to the source tree, but for
the sake of the archives, I'm attaching it here. The first patch adds
the test suite, including the expected output of each conversion. The
second patch contains expected output changes for the above patch to add
direct conversions between EUC_TW and Big5. It affected the error
messages for some byte sequences that cannot be converted. For example,
on unpatched master:
postgres=# select convert('\xfdcc', 'euc_tw', 'big5');
ERROR: character with byte sequence 0x95 0xfd 0xcc in encoding
"MULE_INTERNAL" has no equivalent in encoding "BIG5"
With the patch:
postgres=# select convert('\xfdcc', 'euc_tw', 'big5');
ERROR: character with byte sequence 0xfd 0xcc in encoding "EUC_TW" has
no equivalent in encoding "BIG5"
The old message talked about "MULE_INTERNAL" which exposes the
implementation detail that we used it as an intermediate in the
conversion. That can be confusing to a user, the new message makes more
sense. So that's also nice.
- Heikki
Attachments:
0001-Add-conversion-test-suite.patch.bz2application/x-bzip; name=0001-Add-conversion-test-suite.patch.bz2Download+13-8
0002-Expected-output-changes-for-patch-to-convert-directl.patch.bz2application/x-bzip; name=0002-Expected-output-changes-for-patch-to-convert-directl.patch.bz2Download+2-0
On Wed, Dec 16, 2020 at 8:18 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Currently, COPY FROM parses the input one line at a time. Each line is
converted to the database encoding separately, or if the file encoding
matches the database encoding, we just check that the input is valid for
the encoding. It would be more efficient to do the encoding
conversion/verification in larger chunks. At least potentially; the
current conversion/verification implementations work one byte a time so
it doesn't matter too much, but there are faster algorithms out there
that use SIMD instructions or lookup tables that benefit from larger
inputs.
Hi Heikki,
This is great news. I've seen examples of such algorithms and that'd be
nice to have. I haven't studied the patch in detail, but it looks fine on
the whole.
In 0004, it seems you have some doubts about upgrade compatibility. Is that
because user-defined conversions would no longer have the right signature?
--
John Naylor
EDB: http://www.enterprisedb.com
On 22/12/2020 22:01, John Naylor wrote:
In 0004, it seems you have some doubts about upgrade compatibility. Is
that because user-defined conversions would no longer have the right
signature?
Exactly. If you have an extension that adds a custom conversion function
and does CREATE CONVERSION, the old installation script will fail on the
new version. That causes trouble for pg_dump+restore and pg_upgrade.
Perhaps we could accept the old signature in the server when you do
CREATE CONVERSION, but somehow mark the conversion as broken in the
catalog so that you would get a runtime error if you tried to use it.
That would be enough to make pg_dump+restore (and pg_upgrade) not throw
an error, and you could then upgrade the extension later (ALTER
EXTENSION UPDATE).
I'm not sure it's worth the trouble, though. Custom conversions are very
rare. And I don't think any other object can depend on a conversion, so
you can always drop the conversion before upgrade, and re-create it with
the new function signature afterwards. A note in the release notes and a
check in pg_upgrade, with instructions to drop and recreate the
conversion, are probably enough.
- Heikki
On Wed, Dec 23, 2020 at 3:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
I'm not sure it's worth the trouble, though. Custom conversions are very
rare. And I don't think any other object can depend on a conversion, so
you can always drop the conversion before upgrade, and re-create it with
the new function signature afterwards. A note in the release notes and a
check in pg_upgrade, with instructions to drop and recreate the
conversion, are probably enough.
That was my thought as well.
--
John Naylor
EDB: http://www.enterprisedb.com
Hi Heikki,
0001 through 0003 are straightforward, and I think they can be committed
now if you like.
0004 is also pretty straightforward. The check you proposed upthread for
pg_upgrade seems like the best solution to make that workable. I'll take a
look at 0005 soon.
I measured the conversions that were rewritten in 0003, and there is indeed
a noticeable speedup:
Big5 to EUC-TW:
head 196ms
0001-3 152ms
EUC-TW to Big5:
head 190ms
0001-3 144ms
I've attached the driver function for reference. Example use:
select drive_conversion(
1000, 'euc_tw'::name, 'big5'::name,
convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
);
I took a look at the test suite also, and the only thing to note is a
couple places where the comment doesn't match the code:
+ -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2)
+ byte1 = hex('0e');
+ for byte2 in hex('a1')..hex('df') loop
+ return next b(byte1, byte2);
+ end loop;
+
+ -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3)
+ byte1 = hex('0f');
+ for byte2 in hex('a1')..hex('fe') loop
+ for byte3 in hex('a1')..hex('fe') loop
+ return next b(byte1, byte2, byte3);
+ end loop;
+ end loop;
Not sure if it matters , but thought I'd mention it anyway.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
On 28/01/2021 01:23, John Naylor wrote:
Hi Heikki,
0001 through 0003 are straightforward, and I think they can be committed
now if you like.
Thanks for the review!
I did some more rigorous microbenchmarking of patch 1 and 2. I used the
attached test script, which calls convert_from() function to perform
UTF-8 verification on two large strings, about 60kb each. One of the
strings is pure ASCII, and the other is an HTML page that contains a mix
of ASCII and multibyte characters.
Compiled with "gcc -O2", gcc version 10.2.1 20210110 (Debian 10.2.1-6)
| mixed | ascii
-----------+-------+-------
master | 1866 | 1250
patch 1 | 959 | 507
patch 1+2 | 1396 | 987
So, the first patch,
0001-Add-new-mbverifystr-function-for-each-encoding.patch, made huge
difference. Even with pure ASCII input. That's very surprising, because
there is already a fast-path for pure-ASCII input in pg_verify_mbstr_len().
Even more surprising was that the second patch
(0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch)
actually made things worse again. I thought it would give a modest gain,
but nope.
It seems to me that GCC is not doing good job at optimizing the loop in
pg_verify_mbstr(). The first patch fixes that, but the second patch
somehow trips up GCC again.
So I also tried this with "gcc -O3" and clang:
Compiled with "gcc -O3"
| mixed | ascii
-----------+-------+-------
master | 1522 | 1225
patch 1 | 753 | 507
patch 1+2 | 868 | 507
Compiled with "clang -O2", Debian clang version 11.0.1-2
| mixed | ascii
-----------+-------+-------
master | 1257 | 520
patch 1 | 899 | 507
patch 1+2 | 884 | 508
With gcc -O3, the results are a better, but still the second patch seems
harmful. With clang, I got the result I expected: Almost no difference
with pure-ASCII input, because there's already a fast-path for that, and
a nice speedup with multibyte characters. Still, I was surprised how big
the speedup from the first patch was, and how little additional gain the
second patch gives.
Based on these results, I'm going to commit the first patch, but not the
second one. There are much faster UTF-8 verification routines out there,
using SIMD instructions and whatnot, and we should consider adopting one
of those, but that's future work.
- Heikki
Attachments:
On 28/01/2021 01:23, John Naylor wrote:
Hi Heikki,
0001 through 0003 are straightforward, and I think they can be committed
now if you like.0004 is also pretty straightforward. The check you proposed upthread for
pg_upgrade seems like the best solution to make that workable. I'll take
a look at 0005 soon.I measured the conversions that were rewritten in 0003, and there is
indeed a noticeable speedup:Big5 to EUC-TW:
head 196ms
0001-3 152msEUC-TW to Big5:
head 190ms
0001-3 144msI've attached the driver function for reference. Example use:
select drive_conversion(
1000, 'euc_tw'::name, 'big5'::name,
convert('a few kB of utf8 text here', 'utf8', 'euc_tw')
);
Thanks! I have committed patches 0001 and 0003 in this series, with
minor comment fixes. Next I'm going to write the pg_upgrade check for
patch 0004, to get that into a committable state too.
I took a look at the test suite also, and the only thing to note is a
couple places where the comment doesn't match the code:+ -- JIS X 0201: 2-byte encoded chars starting with 0x8e (SS2) + byte1 = hex('0e'); + for byte2 in hex('a1')..hex('df') loop + return next b(byte1, byte2); + end loop; + + -- JIS X 0212: 3-byte encoded chars, starting with 0x8f (SS3) + byte1 = hex('0f'); + for byte2 in hex('a1')..hex('fe') loop + for byte3 in hex('a1')..hex('fe') loop + return next b(byte1, byte2, byte3); + end loop; + end loop;Not sure if it matters , but thought I'd mention it anyway.
Good catch! The comments were correct, and the tests were wrong, not
testing those 2- and 3-byte encoded characters as intened. Doesn't
matter for testing this patch, I only included those euc_jis_2004 tets
for the sake of completeness, but if someone finds this test suite in
the archives and want to use it for something real, make sure you fix
that first.
- Heikki
On 28/01/2021 15:05, Heikki Linnakangas wrote:
Next I'm going to write the pg_upgrade check for
patch 0004, to get that into a committable state too.
As promised, here are new versions of the remaining patches, with the
pg_upgrade check added. If you have any custom encoding conversions in
the old cluster, pg_upgrade now fails:
Performing Consistency Checks
-----------------------------
Checking cluster versions ok
Checking database user is the install user ok
Checking database connection settings ok
Checking for prepared transactions ok
Checking for reg* data types in user tables ok
Checking for contrib/isn with bigint-passing mismatch ok
Checking for user-defined encoding conversions fatalYour installation contains user-defined encoding conversions.
The conversion function parameters changed in PostgreSQL version 14
so this cluster cannot currently be upgraded. You can remove the
encoding conversions in the old cluster and restart the upgrade.
A list of user-defined encoding conversions is in the file:
encoding_conversions.txtFailure, exiting
To test this, I wrote a dummy conversion function, also attached.
- Heikki
Attachments:
v2-0001-Change-conversion-function-signature.patchtext/x-patch; charset=UTF-8; name=v2-0001-Change-conversion-function-signature.patchDownload+1387-634
v2-0002-Do-COPY-FROM-encoding-conversion-verification-in-.patchtext/x-patch; charset=UTF-8; name=v2-0002-Do-COPY-FROM-encoding-conversion-verification-in-.patchDownload+270-82
0001-Custom-encoding-conversion-routine-for-testing.patchtext/x-patch; charset=UTF-8; name=0001-Custom-encoding-conversion-routine-for-testing.patchDownload+86-1
On Thu, Jan 28, 2021 at 7:36 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Even more surprising was that the second patch
(0002-Replace-pg_utf8_verifystr-with-a-faster-implementati.patch)
actually made things worse again. I thought it would give a modest gain,
but nope.
Hmm, that surprised me too.
Based on these results, I'm going to commit the first patch, but not the
second one. There are much faster UTF-8 verification routines out there,
using SIMD instructions and whatnot, and we should consider adopting one
of those, but that's future work.
I have something in mind for that.
I took a look at v2, and for the first encoding I tried, it fails to report
the error for invalid input:
create database euctest WITH ENCODING 'EUC_CN' LC_COLLATE='zh_CN.eucCN'
LC_CTYPE='zh_CN.eucCN' TEMPLATE=template0;
\c euctest
create table foo (a text);
master:
euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
ä
\.
ERROR: character with byte sequence 0xc3 0xa4 in encoding "UTF8" has no
equivalent in encoding "EUC_CN"
CONTEXT: COPY foo, line 1
patch:
euctest=# copy foo from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
ä
\.
COPY 0
euctest=#
I believe the problem is in UtfToLocal(). I've attached a fix formatted as
a text file to avoid confusing the cfbot. The fix keeps the debugging
ereport() in case you find it useful. Some additional test coverage might
be good here, but not sure how much work that would be. I didn't check any
other conversions yet.
v2-0002 seems fine to me, I just have cosmetic comments here:
+ * the same, no conversion is required by we must still validate that the
s/by/but/
This comment in copyfrom_internal.h above the *StateData struct is the same
as the corresponding one in copyto.c:
* Multi-byte encodings: all supported client-side encodings encode
multi-byte
* characters by having the first byte's high bit set. Subsequent bytes of
the
* character can have the high bit not set. When scanning data in such an
* encoding to look for a match to a single-byte (ie ASCII) character, we
must
* use the full pg_encoding_mblen() machinery to skip over multibyte
* characters, else we might find a false match to a trailing byte. In
* supported server encodings, there is no possibility of a false match, and
* it's faster to make useless comparisons to trailing bytes than it is to
* invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii is
true
* when we have to do it the hard way.
The references to pg_encoding_mblen() and encoding_embeds_ascii, are out of
date for copy-from. I'm not sure the rest is relevant to copy-from anymore,
either. Can you confirm?
This comment inside the struct is now out of date as well:
* Similarly, line_buf holds the whole input line being processed. The
* input cycle is first to read the whole line into line_buf, convert it
* to server encoding there, and then extract the individual attribute
HEAD has this macro already:
/* Shorthand for number of unconsumed bytes available in raw_buf */
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len -
(cstate)->raw_buf_index)
It might make sense to create a CONVERSION_BUF_BYTES equivalent since the
patch calculates cstate->conversion_buf_len - cstate->conversion_buf_index
in a couple places.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
fix-utf8-convertedbytes-calc.txttext/plain; charset=US-ASCII; name=fix-utf8-convertedbytes-calc.txtDownload+7-3
On 30/01/2021 20:47, John Naylor wrote:
I took a look at v2, and for the first encoding I tried, it fails to
report the error for invalid input:
That's embarassing...
I believe the problem is in UtfToLocal(). I've attached a fix formatted
as a text file to avoid confusing the cfbot. The fix keeps the debugging
ereport() in case you find it useful.
Thanks. I fixed it slightly differently, and also changed LocalToUtf()
to follow the same pattern, even though LocalToUtf() did not have the
same bug.
Some additional test coverage might be good here, but not sure how
much work that would be. I didn't check any other conversions yet.
I added a bunch of tests for various built-in conversions.
v2-0002 seems fine to me, I just have cosmetic comments here:
+ * the same, no conversion is required by we must still validate that the
s/by/but/
This comment in copyfrom_internal.h above the *StateData struct is the
same as the corresponding one in copyto.c:* Multi-byte encodings: all supported client-side encodings encode
multi-byte
* characters by having the first byte's high bit set. Subsequent bytes
of the
* character can have the high bit not set. When scanning data in such an
* encoding to look for a match to a single-byte (ie ASCII) character,
we must
* use the full pg_encoding_mblen() machinery to skip over multibyte
* characters, else we might find a false match to a trailing byte. In
* supported server encodings, there is no possibility of a false
match, and
* it's faster to make useless comparisons to trailing bytes than it is to
* invoke pg_encoding_mblen() to skip over them. encoding_embeds_ascii
is true
* when we have to do it the hard way.The references to pg_encoding_mblen() and encoding_embeds_ascii, are out
of date for copy-from. I'm not sure the rest is relevant to copy-from
anymore, either. Can you confirm?
Yeah, that comment is obsolete for COPY FROM, the encoding conversion
works differently now. Removed it from copyfrom_internal.h.
This comment inside the struct is now out of date as well:
* Similarly, line_buf holds the whole input line being processed. The
* input cycle is first to read the whole line into line_buf, convert it
* to server encoding there, and then extract the individual attributeHEAD has this macro already:
/* Shorthand for number of unconsumed bytes available in raw_buf */
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len -
(cstate)->raw_buf_index)It might make sense to create a CONVERSION_BUF_BYTES equivalent since
the patch calculates cstate->conversion_buf_len -
cstate->conversion_buf_index in a couple places.
Thanks for the review!
I spent some time refactoring and adding comments all around the patch,
hopefully making it all more clear. One notable difference is that I
renamed 'raw_buf' (which exists in master too) to 'input_buf', and
renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this
patch again another day with fresh eyes, and also try to add some tests
for the corner cases at buffer boundaries.
Attached is a new set of patches. I added some regression tests for the
built-in conversion functions, which cover the bug you found, and many
other interesting cases that did not have test coverage yet. It comes in
two patches: the first patch uses just the existing convert_from() SQL
function, and the second one uses the new "noError" variants of the
conversion functions. I also kept the bug-fixes compared to the previous
patch version as a separate commit, for easier review.
- Heikki
Attachments:
v3-0001-Add-regression-tests-for-built-in-encoding-conver.patchtext/x-patch; charset=UTF-8; name=v3-0001-Add-regression-tests-for-built-in-encoding-conver.patchDownload+675-1
v3-0002-Change-conversion-function-signature.patchtext/x-patch; charset=UTF-8; name=v3-0002-Change-conversion-function-signature.patchDownload+1390-636
v3-0003-Add-tests-for-the-new-noError-variants-of-built-i.patchtext/x-patch; charset=UTF-8; name=v3-0003-Add-tests-for-the-new-noError-variants-of-built-i.patchDownload+482-268
v3-0004-Fix-bugs-in-the-commit-to-change-conversion-funct.patchtext/x-patch; charset=UTF-8; name=v3-0004-Fix-bugs-in-the-commit-to-change-conversion-funct.patchDownload+17-13
v3-0005-Do-COPY-FROM-encoding-conversion-verification-in-.patchtext/x-patch; charset=UTF-8; name=v3-0005-Do-COPY-FROM-encoding-conversion-verification-in-.patchDownload+715-162
On Mon, Feb 1, 2021 at 12:15 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Thanks. I fixed it slightly differently, and also changed LocalToUtf()
to follow the same pattern, even though LocalToUtf() did not have the
same bug.
Looks good to me.
I added a bunch of tests for various built-in conversions.
Nice! I would like to have utf8 tests for every category of invalid byte
(overlong, surrogate, 5 bytes, etc), but it's not necessary for this patch.
I spent some time refactoring and adding comments all around the patch,
hopefully making it all more clear. One notable difference is that I
renamed 'raw_buf' (which exists in master too) to 'input_buf', and
renamed 'conversion_buf' to 'raw_buf'. I'm going to read through this
patch again another day with fresh eyes, and also try to add some tests
for the corner cases at buffer boundaries.
The comments and renaming are really helpful in understanding that file!
Although a new patch is likely forthcoming, I did take a brief look and
found the following:
In copyfromparse.c, this is now out of date:
* Read the next input line and stash it in line_buf, with conversion to
* server encoding.
One of your FIXME comments seems to allude to this, but if we really need a
difference here, maybe it should be explained:
+#define INPUT_BUF_SIZE 65536 /* we palloc INPUT_BUF_SIZE+1 bytes */
+#define RAW_BUF_SIZE 65536 /* allocated size of the buffer */
Lastly, it looks like pg_do_encoding_conversion_buf() ended up in 0003
accidentally?
--
John Naylor
EDB: http://www.enterprisedb.com
On 02/02/2021 23:42, John Naylor wrote:
Although a new patch is likely forthcoming, I did take a brief look and
found the following:In copyfromparse.c, this is now out of date:
* Read the next input line and stash it in line_buf, with conversion to
* server encoding.One of your FIXME comments seems to allude to this, but if we really
need a difference here, maybe it should be explained:+#define INPUT_BUF_SIZE 65536 /* we palloc INPUT_BUF_SIZE+1 bytes */
+#define RAW_BUF_SIZE 65536 /* allocated size of the buffer */
We do in fact still need the +1 for the NUL terminator. It was missing
from the last patch version, but that was wrong; my fuzz testing
actually uncovered a bug caused by that. Fixed.
Attached are new patch versions. The first patch is same as before, but
rebased, pgindented, and with a couple of tiny fixes where conversion
functions were still missing the "if (noError) break;" checks.
I've hacked on the second patch more, doing more refactoring and
commenting for readability. I think it's in pretty good shape now.
- Heikki
Attachments:
v4-0001-Add-noError-argument-to-encoding-conversion-funct.patchtext/x-patch; charset=UTF-8; name=v4-0001-Add-noError-argument-to-encoding-conversion-funct.patchDownload+2322-629
v4-0002-Do-COPY-FROM-encoding-conversion-verification-in-.patchtext/x-patch; charset=UTF-8; name=v4-0002-Do-COPY-FROM-encoding-conversion-verification-in-.patchDownload+500-180
On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 02/02/2021 23:42, John Naylor wrote:
In copyfromparse.c, this is now out of date:
* Read the next input line and stash it in line_buf, with conversion
to
* server encoding.
This comment for CopyReadLine() is still there. Conversion already happened
by now, so I think this comment is outdated.
Other than that, I think this is ready for commit.
--
John Naylor
EDB: http://www.enterprisedb.com
On 09/02/2021 15:40, John Naylor wrote:
On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:On 02/02/2021 23:42, John Naylor wrote:
In copyfromparse.c, this is now out of date:
* Read the next input line and stash it in line_buf, with
conversion to
* server encoding.
This comment for CopyReadLine() is still there. Conversion already
happened by now, so I think this comment is outdated.Other than that, I think this is ready for commit.
Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1"
somehow went missing again. That was causing a failure on Windows at
cfbot.cputube.org.
I'll read through this one more time with fresh eyes tomorrow or the day
after, and push. Thanks for all the review!
- Heikki
On 09/02/2021 19:36, Heikki Linnakangas wrote:
On 09/02/2021 15:40, John Naylor wrote:
On Sun, Feb 7, 2021 at 2:13 PM Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> wrote:On 02/02/2021 23:42, John Naylor wrote:
In copyfromparse.c, this is now out of date:
* Read the next input line and stash it in line_buf, with
conversion to
* server encoding.
This comment for CopyReadLine() is still there. Conversion already
happened by now, so I think this comment is outdated.Other than that, I think this is ready for commit.
Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1"
somehow went missing again. That was causing a failure on Windows at
cfbot.cputube.org.I'll read through this one more time with fresh eyes tomorrow or the day
after, and push. Thanks for all the review!
Forgot attachment..
- Heikki
Attachments:
v5-0001-Add-noError-argument-to-encoding-conversion-funct.patchtext/x-patch; charset=UTF-8; name=v5-0001-Add-noError-argument-to-encoding-conversion-funct.patchDownload+2322-629
v5-0002-Do-COPY-FROM-encoding-conversion-verification-in-.patchtext/x-patch; charset=UTF-8; name=v5-0002-Do-COPY-FROM-encoding-conversion-verification-in-.patchDownload+500-184
On Tue, Feb 9, 2021 at 1:44 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Fixed. And also fixed one more bug in allocating raw_buf_size, the "+ 1"
somehow went missing again. That was causing a failure on Windows at
cfbot.cputube.org.I'll read through this one more time with fresh eyes tomorrow or the day
after, and push. Thanks for all the review!Forgot attachment..
- Heikki
I went ahead and rebased these.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v6-0001-Add-noError-argument-to-encoding-conversion-funct.patchapplication/octet-stream; name=v6-0001-Add-noError-argument-to-encoding-conversion-funct.patchDownload+2322-629
v6-0002-Do-COPY-FROM-encoding-conversion-verification-in-.patchapplication/octet-stream; name=v6-0002-Do-COPY-FROM-encoding-conversion-verification-in-.patchDownload+500-184
I wrote:
I went ahead and rebased these.
It looks like FreeBSD doesn't like this for some reason.
I also wanted to see if this patch set had any performance effect, with and
without changing how UTF-8 is validated, using the blackhole am from
https://github.com/michaelpq/pg_plugins/tree/master/blackhole_am.
create extension blackhole_am;
create table blackhole_tab (a text) using blackhole_am ;
time ./inst/bin/psql -c "copy blackhole_tab from '/path/to/test-copy.txt'"
....where copy-test.txt is made by
for i in {1..100}; do cat UTF-8-Sampler.htm >> test-copy.txt ; done;
On Linux x86-64, gcc 8.4, I get these numbers (minimum of five runs):
master:
109ms
v6 do encoding in larger chunks:
109ms
v7 utf8 SIMD:
98ms
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Mar 18, 2021 at 2:05 PM John Naylor <john.naylor@enterprisedb.com>
wrote:
I wrote:
I went ahead and rebased these.
It looks like FreeBSD doesn't like this for some reason.
On closer examination, make check was "terminated", not that the tests
failed...
--
John Naylor
EDB: http://www.enterprisedb.com