Speed up COPY FROM text/CSV parsing using SIMD
Hi hackers,
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.
Idea
====
The current text/CSV parser processes input byte-by-byte, checking
whether each byte is a special character (\n, \r, quote, escape) or a
regular character, and transitions states in a state machine. This
sequential processing is inefficient and likely causes frequent branch
mispredictions due to the many if statements.
I thought this problem could be addressed by leveraging SIMD and
vectorized operations for faster processing.
Implementation Overview
=======================
1. Create a vector of special characters (e.g., Vector8 nl =
vector8_broadcast('\n');).
2. Load the input buffer into a Vector8 variable called chunk.
3. Perform vectorized operations between chunk and the special
character vectors to check if the buffer contains any special
characters.
4-1. If no special characters are found, advance the input_buf_ptr by
sizeof(Vector8).
4-2. If special characters are found, advance the input_buf_ptr as far
as possible, then fall back to the original text/CSV parser for
byte-by-byte processing.
Test
====
I tested the performance by measuring the time it takes to load a CSV
file created using the attached SQL script with the following COPY
command:
=# COPY t FROM '/tmp/t.csv' (FORMAT csv);
Environment
-----------
OS: Rocky Linux 9.6
CPU: Intel Core i7-10710U (6 Cores / 12 Threads, 1.1 GHz Base / 4.7
GHz Boost, AVX2 & FMA supported)
Time
----
master: 02.44.943
patch applied: 02:36.878 (about 5% faster)
Perf
----
Each call graphs are attached and the rates of CopyReadLineText are:
master: 12.15%
patch applied: 8.04%
Thought?
I would appreciate feedback on the implementation and any suggestions
for further improvement.
--
Best regards,
Shinya Kato
NTT OSS Center
Attachments:
v1-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchapplication/octet-stream; name=v1-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload+72-1
Hi,
Thank you for working on this!
On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
Hi hackers,
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.
I have been working on the same idea. I was not moving input_buf_ptr
as far as possible, so I think your approach is better.
Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.
Idea
====
The current text/CSV parser processes input byte-by-byte, checking
whether each byte is a special character (\n, \r, quote, escape) or a
regular character, and transitions states in a state machine. This
sequential processing is inefficient and likely causes frequent branch
mispredictions due to the many if statements.I thought this problem could be addressed by leveraging SIMD and
vectorized operations for faster processing.Implementation Overview
=======================
1. Create a vector of special characters (e.g., Vector8 nl =
vector8_broadcast('\n');).
2. Load the input buffer into a Vector8 variable called chunk.
3. Perform vectorized operations between chunk and the special
character vectors to check if the buffer contains any special
characters.
4-1. If no special characters are found, advance the input_buf_ptr by
sizeof(Vector8).
4-2. If special characters are found, advance the input_buf_ptr as far
as possible, then fall back to the original text/CSV parser for
byte-by-byte processing.
...
Thought?
I would appreciate feedback on the implementation and any suggestions
for further improvement.
I have a couple of ideas that I was working on:
---
+ * However, SIMD optimization cannot be applied in the following cases:
+ * - Inside quoted fields, where escape sequences and closing quotes
+ * require sequential processing to handle correctly.
I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.
---
+ * - When the remaining buffer size is smaller than the size of a SIMD
+ * vector register, as SIMD operations require processing data in
+ * fixed-size chunks.
You run SIMD when 'copy_buf_len - input_buf_ptr >= sizeof(Vector8)'
but you only call CopyLoadInputBuf() when 'input_buf_ptr >=
copy_buf_len || need_data' so basically you need to wait at least the
sizeof(Vector8) character to pass for the next SIMD. And in the worst
case; if CopyLoadInputBuf() puts one character less than
sizeof(Vector8), then you can't ever run SIMD. I think we need to make
sure that CopyLoadInputBuf() loads at least the sizeof(Vector8)
character to the input_buf so we do not encounter that problem.
---
What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.
---
I shared my ideas as a Feedback.txt file (.txt to stay off CFBot's
radar for this thread). I hope these help, please let me know if you
have any questions.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
Feedback.txttext/plain; charset=US-ASCII; name=Feedback.txtDownload+51-5
Hi,
On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.
I did the same benchmark for the CSV format. The peak improvement is
line length being 4096 and the improvement is more than 25% [1]. I saw
a 5% regression on the 1 byte benchmark, there are no other
regressions.
What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.
I wanted to try using SIMD in CopyReadAttributesCSV() as well. The
improvement on the 4096 byte line length input [1] goes from 25% to
35%, the regression on the 1 byte input is the same.
CopyReadAttributesCSV() changes are attached as feedback v2.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v2-0001-Feedback.txttext/plain; charset=US-ASCII; name=v2-0001-Feedback.txtDownload+160-17
On Thu, Aug 7, 2025 at 8:15 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
Thank you for working on this!
On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
Hi hackers,
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.I have been working on the same idea. I was not moving input_buf_ptr
as far as possible, so I think your approach is better.
Great. I'm looking forward to working with you on this feature implementation.
Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.
Thank you for the additional benchmarks.
I have a couple of ideas that I was working on:
---+ * However, SIMD optimization cannot be applied in the following cases: + * - Inside quoted fields, where escape sequences and closing quotes + * require sequential processing to handle correctly.I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.
That's a clever point that last_was_esc should be reset to false when
a SIMD chunk is skipped. You're right about that specific case.
However, the core challenge is not what happens when we skip a chunk,
but what happens when a chunk contains special characters like quotes
or escapes. The main reason we avoid SIMD inside quoted fields is that
the parsing logic becomes fundamentally sequential and
context-dependent.
To correctly parse a "" as a single literal quote, we must perform a
lookahead to check the next character. This is an inherently
sequential operation that doesn't map well to SIMD's parallel nature.
Trying to handle this stateful logic with SIMD would lead to
significant implementation complexity, especially with edge cases like
an escape character falling on the last byte of a chunk.
+ * - When the remaining buffer size is smaller than the size of a SIMD + * vector register, as SIMD operations require processing data in + * fixed-size chunks.You run SIMD when 'copy_buf_len - input_buf_ptr >= sizeof(Vector8)'
but you only call CopyLoadInputBuf() when 'input_buf_ptr >=
copy_buf_len || need_data' so basically you need to wait at least the
sizeof(Vector8) character to pass for the next SIMD. And in the worst
case; if CopyLoadInputBuf() puts one character less than
sizeof(Vector8), then you can't ever run SIMD. I think we need to make
sure that CopyLoadInputBuf() loads at least the sizeof(Vector8)
character to the input_buf so we do not encounter that problem.
I think you're probably right, but we only need to account for
sizeof(Vector8) when USE_NO_SIMD is not defined.
What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.
Agreed, I will.
I shared my ideas as a Feedback.txt file (.txt to stay off CFBot's
radar for this thread). I hope these help, please let me know if you
have any questions.
Thanks a lot!
On Mon, Aug 11, 2025 at 5:52 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.I did the same benchmark for the CSV format. The peak improvement is
line length being 4096 and the improvement is more than 25% [1]. I saw
a 5% regression on the 1 byte benchmark, there are no other
regressions.
Thank you. I'm not too concerned about a regression when there's only
one byte per line.
What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.I wanted to try using SIMD in CopyReadAttributesCSV() as well. The
improvement on the 4096 byte line length input [1] goes from 25% to
35%, the regression on the 1 byte input is the same.
Yes, I'm on it. I'm currently adding the SIMD logic to
CopyReadAttributesCSV() as you suggested. I'll share the new version
of the patch soon.
--
Best regards,
Shinya Kato
NTT OSS Center
On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11.kato@gmail.com> wrote:
+ * However, SIMD optimization cannot be applied in the following cases: + * - Inside quoted fields, where escape sequences and closing quotes + * require sequential processing to handle correctly.I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.That's a clever point that last_was_esc should be reset to false when
a SIMD chunk is skipped. You're right about that specific case.However, the core challenge is not what happens when we skip a chunk,
but what happens when a chunk contains special characters like quotes
or escapes. The main reason we avoid SIMD inside quoted fields is that
the parsing logic becomes fundamentally sequential and
context-dependent.To correctly parse a "" as a single literal quote, we must perform a
lookahead to check the next character. This is an inherently
sequential operation that doesn't map well to SIMD's parallel nature.Trying to handle this stateful logic with SIMD would lead to
significant implementation complexity, especially with edge cases like
an escape character falling on the last byte of a chunk.
Ah, you're right. My apologies, I misunderstood the implementation. It
appears that SIMD can be used even within quoted strings.
I think it would be better not to use the SIMD path when last_was_esc
is true. The next character is likely to be a special character, and
handling this case outside the SIMD loop would also improve
readability by consolidating the last_was_esc toggle logic in one
place.
Furthermore, when inside a quote (in_quote) in CSV mode, the detection
of \n and \r can be disabled.
+ last_was_esc = false;
Regarding the implementation, I believe we must set last_was_esc to
false when advancing input_buf_ptr, as shown in the code below. For
this reason, I think it’s best to keep the current logic for toggling
last_was_esc.
+ int advance = pg_rightmost_one_pos32(mask);
+ input_buf_ptr += advance;
I've attached a new patch that includes these changes. Further
modifications are still in progress.
--
Best regards,
Shinya Kato
NTT OSS Center
Attachments:
v2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchapplication/octet-stream; name=v2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload+71-1
Following Nazir's findings about 4096 bytes being the performant line
length, I did more benchmarks from my side on both TEXT and CSV formats
with two different cases of normal data (no special characters) and data
with many special characters.
Results are con good as expected and similar to previous benchmarks
~30.9% faster copy in TEXT format
~32.4% faster copy in CSV format
20%-30% reduces cycles per instructions
In the case of doing a lot of special characters in the lines (e.g., tables
with large numbers of columns maybe), we obviously expect regressions here
because of the overhead of many fallbacks to scalar processing.
Results for a 1/3 of line length of special characters:
~43.9% slower copy in TEXT format
~16.7% slower copy in CSV format
So for even less occurrences of special characters or wider distance
between there might still be some regressions in this case, a
non-significant case maybe, but can be treated in other patches if we
consider to not use SIMD path sometimes.
I hope this helps more and confirms the patch.
Regards,
Ayoub Kazar
Le jeu. 14 août 2025 à 01:55, Shinya Kato <shinya11.kato@gmail.com> a
écrit :
Show quoted text
On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11.kato@gmail.com>
wrote:+ * However, SIMD optimization cannot be applied in the
following cases:
+ * - Inside quoted fields, where escape sequences and closing
quotes
+ * require sequential processing to handle correctly.
I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.That's a clever point that last_was_esc should be reset to false when
a SIMD chunk is skipped. You're right about that specific case.However, the core challenge is not what happens when we skip a chunk,
but what happens when a chunk contains special characters like quotes
or escapes. The main reason we avoid SIMD inside quoted fields is that
the parsing logic becomes fundamentally sequential and
context-dependent.To correctly parse a "" as a single literal quote, we must perform a
lookahead to check the next character. This is an inherently
sequential operation that doesn't map well to SIMD's parallel nature.Trying to handle this stateful logic with SIMD would lead to
significant implementation complexity, especially with edge cases like
an escape character falling on the last byte of a chunk.Ah, you're right. My apologies, I misunderstood the implementation. It
appears that SIMD can be used even within quoted strings.I think it would be better not to use the SIMD path when last_was_esc
is true. The next character is likely to be a special character, and
handling this case outside the SIMD loop would also improve
readability by consolidating the last_was_esc toggle logic in one
place.Furthermore, when inside a quote (in_quote) in CSV mode, the detection
of \n and \r can be disabled.+ last_was_esc = false;
Regarding the implementation, I believe we must set last_was_esc to
false when advancing input_buf_ptr, as shown in the code below. For
this reason, I think it’s best to keep the current logic for toggling
last_was_esc.+ int advance = pg_rightmost_one_pos32(mask); + input_buf_ptr += advance;I've attached a new patch that includes these changes. Further
modifications are still in progress.--
Best regards,
Shinya Kato
NTT OSS Center
Hi,
On Thu, 14 Aug 2025 at 05:25, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Following Nazir's findings about 4096 bytes being the performant line length, I did more benchmarks from my side on both TEXT and CSV formats with two different cases of normal data (no special characters) and data with many special characters.
Results are con good as expected and similar to previous benchmarks
~30.9% faster copy in TEXT format
~32.4% faster copy in CSV format
20%-30% reduces cycles per instructionsIn the case of doing a lot of special characters in the lines (e.g., tables with large numbers of columns maybe), we obviously expect regressions here because of the overhead of many fallbacks to scalar processing.
Results for a 1/3 of line length of special characters:
~43.9% slower copy in TEXT format
~16.7% slower copy in CSV format
So for even less occurrences of special characters or wider distance between there might still be some regressions in this case, a non-significant case maybe, but can be treated in other patches if we consider to not use SIMD path sometimes.I hope this helps more and confirms the patch.
Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
On Thu, 14 Aug 2025 at 05:25, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Following Nazir's findings about 4096 bytes being the performant line
length, I did more benchmarks from my side on both TEXT and CSV formats
with two different cases of normal data (no special characters) and data
with many special characters.Results are con good as expected and similar to previous benchmarks
~30.9% faster copy in TEXT format
~32.4% faster copy in CSV format
20%-30% reduces cycles per instructionsIn the case of doing a lot of special characters in the lines (e.g.,
tables with large numbers of columns maybe), we obviously expect
regressions here because of the overhead of many fallbacks to scalar
processing.Results for a 1/3 of line length of special characters:
~43.9% slower copy in TEXT format
~16.7% slower copy in CSV format
So for even less occurrences of special characters or wider distancebetween there might still be some regressions in this case, a
non-significant case maybe, but can be treated in other patches if we
consider to not use SIMD path sometimes.I hope this helps more and confirms the patch.
Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?--
Regards,
Nazir Bilal Yavuz
Microsoft
Of course, I attached the sql to generate the text and csv test files.
If having a 1/3 of line length of special characters can be an
exaggeration, something lower might still reproduce some regressions of
course for the same idea.
Best regards,
Ayoub Kazar
Attachments:
On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
I have a couple of ideas that I was working on:
---+ * However, SIMD optimization cannot be applied in the following cases: + * - Inside quoted fields, where escape sequences and closing quotes + * require sequential processing to handle correctly.I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.
There is a trick with doing carryless multiplication with -1 that can
be used to SIMD process transitions between quoted/not-quoted. [1]https://github.com/geofflangdale/simdcsv/blob/master/src/main.cpp#L76
This is able to convert a bitmask of unescaped quote character
positions to a quote mask in a single operation. I last looked at it 5
years ago, but I remember coming to the conclusion that it would work
for implementing PostgreSQL's interpretation of CSV.
[1]: https://github.com/geofflangdale/simdcsv/blob/master/src/main.cpp#L76
--
Ants
Hi,
On Thu, 14 Aug 2025 at 18:00, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?Of course, I attached the sql to generate the text and csv test files.
If having a 1/3 of line length of special characters can be an exaggeration, something lower might still reproduce some regressions of course for the same idea.
Thank you so much!
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.
So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.
I am sharing a POC patch to show heuristic, it can be applied on top
of v1-0001. Heuristic version has the same performance improvements
with the v1-0001 but the regression is %5 instead of %20 compared to
the master.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
COPY-SIMD-add-heuristic-to-avoid-regression-on-sm.txttext/plain; charset=UTF-8; name=COPY-SIMD-add-heuristic-to-avoid-regression-on-sm.txtDownload+40-6
Hi,
On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.
--
Regards,
Nazir Bilal Yavuz
Microsoft
On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
Hi,
On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.
I was thinking a bit about that this morning. I wonder if it might be
better instead of having a constantly applied heuristic like this, it
might be better to do a little extra accounting in the first, say, 1000
lines of an input file, and if less than some portion of the input is
found to be special characters then switch to the SIMD code. What that
portion should be would need to be determined by some experimentation
with a variety of typical workloads, but given your findings 20% seems
like a good starting point.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Thu, 14 Aug 2025 at 18:00, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?Of course, I attached the sql to generate the text and csv test files.
If having a 1/3 of line length of special characters can be anexaggeration, something lower might still reproduce some regressions of
course for the same idea.Thank you so much!
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.I am sharing a POC patch to show heuristic, it can be applied on top
of v1-0001. Heuristic version has the same performance improvements
with the v1-0001 but the regression is %5 instead of %20 compared to
the master.--
Regards,
Nazir Bilal Yavuz
Microsoft
Yes this is good, i'm also getting about 5% regression only now.
Regards,
Ayoub Kazar
Hi,
On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
Hi,
On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.I was thinking a bit about that this morning. I wonder if it might be
better instead of having a constantly applied heuristic like this, it
might be better to do a little extra accounting in the first, say, 1000
lines of an input file, and if less than some portion of the input is
found to be special characters then switch to the SIMD code. What that
portion should be would need to be determined by some experimentation
with a variety of typical workloads, but given your findings 20% seems
like a good starting point.
I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:
Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.
1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.
With this heuristic the regression is limited by %2 in the worst case.
Patches are attached, the first patch is v2-0001 from Shinya with the
'-Werror=maybe-uninitialized' fixes and the pgindent changes. 0002 is
the actual heuristic patch.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v3-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload+73-1
v3-0002-COPY-SIMD-per-line-heuristic.patchtext/x-patch; charset=US-ASCII; name=v3-0002-COPY-SIMD-per-line-heuristic.patchDownload+89-7
Hello,
I’ve rebenchmarked the new heuristic patch, We still have the previous
improvements ranging from 15% to 30%. For regressions i see at maximum 3%
or 4% in the worst case, so this is solid.
I'm also trying the idea of doing SIMD inside quotes with prefix XOR using
carry less multiplication avoiding the slow path in all cases even with
weird looking input, but it needs to take into consideration the
availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we
go, it quickly starts to become dirty OR we can wait for the decision to
start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.
Regards,
Ayoub Kazar
Hi,
On Thu, 16 Oct 2025 at 17:29, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.
I had a typo there, correct sentence is that:
"Overall logic is that we *will* run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines."
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
On Sat, 18 Oct 2025 at 21:46, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Hello,
I’ve rebenchmarked the new heuristic patch, We still have the previous improvements ranging from 15% to 30%. For regressions i see at maximum 3% or 4% in the worst case, so this is solid.
Thank you so much for doing this! The results look nice, do you think
there are any other benchmarks that might be interesting to try?
I'm also trying the idea of doing SIMD inside quotes with prefix XOR using carry less multiplication avoiding the slow path in all cases even with weird looking input, but it needs to take into consideration the availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we go, it quickly starts to become dirty OR we can wait for the decision to start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.
I can not quite picture this, would you mind sharing a few examples or patches?
--
Regards,
Nazir Bilal Yavuz
Microsoft
On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
Hi,
On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan<andrew@dunslane.net> wrote:
On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
Hi,
On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz<byavuz81@gmail.com> wrote:
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.I was thinking a bit about that this morning. I wonder if it might be
better instead of having a constantly applied heuristic like this, it
might be better to do a little extra accounting in the first, say, 1000
lines of an input file, and if less than some portion of the input is
found to be special characters then switch to the SIMD code. What that
portion should be would need to be determined by some experimentation
with a variety of typical workloads, but given your findings 20% seems
like a good starting point.I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.With this heuristic the regression is limited by %2 in the worst case.
My worry is that the worst case is actually quite common. Sparse data
sets dominated by a lot of null values (and hence lots of special
characters) are very common. Are people prepared to accept a 2%
regression on load times for such data sets?
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
With this heuristic the regression is limited by %2 in the worst case.
My worry is that the worst case is actually quite common. Sparse data sets
dominated by a lot of null values (and hence lots of special characters) are
very common. Are people prepared to accept a 2% regression on load times for
such data sets?
Without knowing how common it is, I think it's difficult to judge whether
2% is a reasonable trade-off. If <5% of workloads might see a small
regression while the other >95% see double-digit percentage improvements,
then I might argue that it's fine. But I'm not sure we have any way to
know those sorts of details at the moment.
I'm also at least a little skeptical about the 2% number. IME that's
generally within the noise range and can vary greatly between machines and
test runs.
--
nathan
On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote:
On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
With this heuristic the regression is limited by %2 in the worst case.
My worry is that the worst case is actually quite common. Sparse data sets
dominated by a lot of null values (and hence lots of special characters) are
very common. Are people prepared to accept a 2% regression on load times for
such data sets?Without knowing how common it is, I think it's difficult to judge whether
2% is a reasonable trade-off. If <5% of workloads might see a small
regression while the other >95% see double-digit percentage improvements,
then I might argue that it's fine. But I'm not sure we have any way to
know those sorts of details at the moment.
I guess what I don't understand is why we actually need to do the test
continuously, even using an adaptive algorithm. Data files in my
experience usually have lines with fairly similar shapes. It's highly
unlikely that you will get the the first 1000 (say) lines of a file that
are rich in special characters and then some later significant section
that isn't, or vice versa. Therefore, doing the test once should yield
the correct answer that can be applied to the rest of the file. That
should reduce the worst case regression to ~0% without sacrificing any
of the performance gains. I appreciate the elegance of what Bilal has
done here, but it does seem like overkill.
I'm also at least a little skeptical about the 2% number. IME that's
generally within the noise range and can vary greatly between machines and
test runs.
Fair point.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com