Speed up COPY FROM text/CSV parsing using SIMD
Hi hackers,
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.
Idea
====
The current text/CSV parser processes input byte-by-byte, checking
whether each byte is a special character (\n, \r, quote, escape) or a
regular character, and transitions states in a state machine. This
sequential processing is inefficient and likely causes frequent branch
mispredictions due to the many if statements.
I thought this problem could be addressed by leveraging SIMD and
vectorized operations for faster processing.
Implementation Overview
=======================
1. Create a vector of special characters (e.g., Vector8 nl =
vector8_broadcast('\n');).
2. Load the input buffer into a Vector8 variable called chunk.
3. Perform vectorized operations between chunk and the special
character vectors to check if the buffer contains any special
characters.
4-1. If no special characters are found, advance the input_buf_ptr by
sizeof(Vector8).
4-2. If special characters are found, advance the input_buf_ptr as far
as possible, then fall back to the original text/CSV parser for
byte-by-byte processing.
Test
====
I tested the performance by measuring the time it takes to load a CSV
file created using the attached SQL script with the following COPY
command:
=# COPY t FROM '/tmp/t.csv' (FORMAT csv);
Environment
-----------
OS: Rocky Linux 9.6
CPU: Intel Core i7-10710U (6 Cores / 12 Threads, 1.1 GHz Base / 4.7
GHz Boost, AVX2 & FMA supported)
Time
----
master: 02.44.943
patch applied: 02:36.878 (about 5% faster)
Perf
----
Each call graphs are attached and the rates of CopyReadLineText are:
master: 12.15%
patch applied: 8.04%
Thought?
I would appreciate feedback on the implementation and any suggestions
for further improvement.
--
Best regards,
Shinya Kato
NTT OSS Center
Attachments:
v1-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchapplication/octet-stream; name=v1-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload
From 5ae3be7d262e4251bf21ac0c73b3e0ebc2ba615d Mon Sep 17 00:00:00 2001
From: Shinya Kato <shinya11.kato@gmail.com>
Date: Mon, 28 Jul 2025 22:08:20 +0900
Subject: [PATCH v1] Speed up COPY FROM text/CSV parsing using SIMD
The inner loop of CopyReadLineText scans for newlines and other special
characters by processing the input byte-by-byte. For large inputs, this
can be a performance bottleneck.
This commit introduces a SIMD-accelerated path. When not parsing inside
a quoted field, we can use vector instructions to scan the input buffer
for any character of interest in 16-byte chunks. This significantly
improves performance, especially for data with long, unquoted fields.
---
src/backend/commands/copyfromparse.c | 72 ++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index b1ae97b833d..5aba0fa6cb7 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -71,7 +71,9 @@
#include "mb/pg_wchar.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "port/pg_bitutils.h"
#include "port/pg_bswap.h"
+#include "port/simd.h"
#include "utils/builtins.h"
#include "utils/rel.h"
@@ -1255,6 +1257,14 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
char quotec = '\0';
char escapec = '\0';
+#ifndef USE_NO_SIMD
+ Vector8 nl = vector8_broadcast('\n');
+ Vector8 cr = vector8_broadcast('\r');
+ Vector8 bs = vector8_broadcast('\\');
+ Vector8 quote;
+ Vector8 escape;
+#endif
+
if (is_csv)
{
quotec = cstate->opts.quote[0];
@@ -1262,6 +1272,12 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* ignore special escape processing if it's the same as quotec */
if (quotec == escapec)
escapec = '\0';
+
+#ifndef USE_NO_SIMD
+ quote = vector8_broadcast(quotec);
+ if (quotec != escapec)
+ escape = vector8_broadcast(escapec);
+#endif
}
/*
@@ -1328,6 +1344,62 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
need_data = false;
}
+#ifndef USE_NO_SIMD
+ /*
+ * SIMD instructions are used here to efficiently scan the input buffer
+ * for special characters (e.g., newline, carriage return, quotes, or
+ * escape characters). This approach significantly improves performance
+ * compared to byte-by-byte iteration, especially for large input
+ * buffers.
+ *
+ * However, SIMD optimization cannot be applied in the following cases:
+ * - Inside quoted fields, where escape sequences and closing quotes
+ * require sequential processing to handle correctly.
+ * - When the remaining buffer size is smaller than the size of a SIMD
+ * vector register, as SIMD operations require processing data in
+ * fixed-size chunks.
+ */
+ if (!in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ {
+ Vector8 chunk;
+ Vector8 match;
+ uint32 mask;
+
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);
+
+ /* Create a mask of all special characters we need to stop at */
+ match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+
+ if (is_csv)
+ {
+ match = vector8_or(match, vector8_eq(chunk, quote));
+ if (escapec != '\0')
+ match = vector8_or(match, vector8_eq(chunk, escape));
+ }
+ else
+ match = vector8_or(match, vector8_eq(chunk, bs));
+
+ /* Check if we found any special characters */
+ mask = vector8_highbit_mask(match);
+ if (mask != 0)
+ {
+ /*
+ * Found a special character. Advance up to that point and let
+ * the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+ input_buf_ptr += advance;
+ }
+ else
+ {
+ /* No special characters found, so skip the entire chunk */
+ input_buf_ptr += sizeof(Vector8);
+ continue;
+ }
+ }
+#endif
+
/* OK to fetch a character */
prev_raw_ptr = input_buf_ptr;
c = copy_input_buf[input_buf_ptr++];
--
2.47.1
Hi,
Thank you for working on this!
On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
Hi hackers,
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.
I have been working on the same idea. I was not moving input_buf_ptr
as far as possible, so I think your approach is better.
Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.
Idea
====
The current text/CSV parser processes input byte-by-byte, checking
whether each byte is a special character (\n, \r, quote, escape) or a
regular character, and transitions states in a state machine. This
sequential processing is inefficient and likely causes frequent branch
mispredictions due to the many if statements.I thought this problem could be addressed by leveraging SIMD and
vectorized operations for faster processing.Implementation Overview
=======================
1. Create a vector of special characters (e.g., Vector8 nl =
vector8_broadcast('\n');).
2. Load the input buffer into a Vector8 variable called chunk.
3. Perform vectorized operations between chunk and the special
character vectors to check if the buffer contains any special
characters.
4-1. If no special characters are found, advance the input_buf_ptr by
sizeof(Vector8).
4-2. If special characters are found, advance the input_buf_ptr as far
as possible, then fall back to the original text/CSV parser for
byte-by-byte processing.
...
Thought?
I would appreciate feedback on the implementation and any suggestions
for further improvement.
I have a couple of ideas that I was working on:
---
+ * However, SIMD optimization cannot be applied in the following cases:
+ * - Inside quoted fields, where escape sequences and closing quotes
+ * require sequential processing to handle correctly.
I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.
---
+ * - When the remaining buffer size is smaller than the size of a SIMD
+ * vector register, as SIMD operations require processing data in
+ * fixed-size chunks.
You run SIMD when 'copy_buf_len - input_buf_ptr >= sizeof(Vector8)'
but you only call CopyLoadInputBuf() when 'input_buf_ptr >=
copy_buf_len || need_data' so basically you need to wait at least the
sizeof(Vector8) character to pass for the next SIMD. And in the worst
case; if CopyLoadInputBuf() puts one character less than
sizeof(Vector8), then you can't ever run SIMD. I think we need to make
sure that CopyLoadInputBuf() loads at least the sizeof(Vector8)
character to the input_buf so we do not encounter that problem.
---
What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.
---
I shared my ideas as a Feedback.txt file (.txt to stay off CFBot's
radar for this thread). I hope these help, please let me know if you
have any questions.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
Feedback.txttext/plain; charset=US-ASCII; name=Feedback.txtDownload
From b13f4cdf134eef5fbecf9ea06f9b1c99890b7c02 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 7 Aug 2025 13:27:34 +0300
Subject: [PATCH] Feedback
---
src/backend/commands/copyfromparse.c | 55 ++++++++++++++++++++++++++--
1 file changed, 51 insertions(+), 4 deletions(-)
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 5aba0fa6cb7..dae5c1f698c 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -670,8 +670,12 @@ CopyLoadInputBuf(CopyFromState cstate)
/* If we now have some unconverted data, try to convert it */
CopyConvertBuf(cstate);
- /* If we now have some more input bytes ready, return them */
- if (INPUT_BUF_BYTES(cstate) > nbytes)
+ /*
+ * If we now have at least sizeof(Vector8) input bytes ready, return
+ * them. This is beneficial for SIMD processing in the
+ * CopyReadLineText() function.
+ */
+ if (INPUT_BUF_BYTES(cstate) > nbytes + sizeof(Vector8))
return;
/*
@@ -1322,7 +1326,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* unsafe with the old v2 COPY protocol, but we don't support that
* anymore.
*/
- if (input_buf_ptr >= copy_buf_len || need_data)
+ if (input_buf_ptr + sizeof(Vector8) >= copy_buf_len || need_data)
{
REFILL_LINEBUF;
@@ -1359,7 +1363,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* vector register, as SIMD operations require processing data in
* fixed-size chunks.
*/
- if (!in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ if (copy_buf_len - input_buf_ptr >= sizeof(Vector8))
{
Vector8 chunk;
Vector8 match;
@@ -1395,6 +1399,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
{
/* No special characters found, so skip the entire chunk */
input_buf_ptr += sizeof(Vector8);
+ last_was_esc = false;
continue;
}
}
@@ -1650,6 +1655,11 @@ CopyReadAttributesText(CopyFromState cstate)
char *cur_ptr;
char *line_end_ptr;
+#ifndef USE_NO_SIMD
+ Vector8 bs = vector8_broadcast('\\');
+ Vector8 delim = vector8_broadcast(delimc);;
+#endif
+
/*
* We need a special case for zero-column tables: check that the input
* line is empty, and return.
@@ -1717,6 +1727,43 @@ CopyReadAttributesText(CopyFromState cstate)
{
char c;
+#ifndef USE_NO_SIMD
+ if (line_end_ptr - cur_ptr >= sizeof(Vector8))
+ {
+ Vector8 chunk;
+ Vector8 match;
+ uint32 mask;
+
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) cur_ptr);
+
+ /* Create a mask of all special characters we need to stop at */
+ match = vector8_or(vector8_eq(chunk, bs), vector8_eq(chunk, delim));
+
+ /* Check if we found any special characters */
+ mask = vector8_highbit_mask(match);
+ if (mask != 0)
+ {
+ /*
+ * Found a special character. Advance up to that point and let
+ * the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+ memcpy(output_ptr, cur_ptr, advance);
+ output_ptr += advance;
+ cur_ptr += advance;
+ }
+ else
+ {
+ /* No special characters found, so skip the entire chunk */
+ memcpy(output_ptr, cur_ptr, sizeof(Vector8));
+ output_ptr += sizeof(Vector8);
+ cur_ptr += sizeof(Vector8);
+ continue;
+ }
+ }
+#endif
+
end_ptr = cur_ptr;
if (cur_ptr >= line_end_ptr)
break;
--
2.50.1
Hi,
On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.
I did the same benchmark for the CSV format. The peak improvement is
line length being 4096 and the improvement is more than 25% [1]. I saw
a 5% regression on the 1 byte benchmark, there are no other
regressions.
What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.
I wanted to try using SIMD in CopyReadAttributesCSV() as well. The
improvement on the 4096 byte line length input [1] goes from 25% to
35%, the regression on the 1 byte input is the same.
CopyReadAttributesCSV() changes are attached as feedback v2.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v2-0001-Feedback.txttext/plain; charset=US-ASCII; name=v2-0001-Feedback.txtDownload
From 203d648c4cf64c6d629f2abc719a371dd0393e22 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Thu, 7 Aug 2025 13:27:34 +0300
Subject: [PATCH v2] Feedback
---
src/backend/commands/copyfromparse.c | 176 ++++++++++++++++++++++++---
1 file changed, 160 insertions(+), 16 deletions(-)
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 5aba0fa6cb7..7b83e64e23b 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -670,8 +670,12 @@ CopyLoadInputBuf(CopyFromState cstate)
/* If we now have some unconverted data, try to convert it */
CopyConvertBuf(cstate);
- /* If we now have some more input bytes ready, return them */
- if (INPUT_BUF_BYTES(cstate) > nbytes)
+ /*
+ * If we now have at least sizeof(Vector8) input bytes ready, return
+ * them. This is beneficial for SIMD processing in the
+ * CopyReadLineText() function.
+ */
+ if (INPUT_BUF_BYTES(cstate) > nbytes + sizeof(Vector8))
return;
/*
@@ -1322,7 +1326,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* unsafe with the old v2 COPY protocol, but we don't support that
* anymore.
*/
- if (input_buf_ptr >= copy_buf_len || need_data)
+ if (input_buf_ptr + sizeof(Vector8) >= copy_buf_len || need_data)
{
REFILL_LINEBUF;
@@ -1345,21 +1349,22 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
}
#ifndef USE_NO_SIMD
+
/*
- * SIMD instructions are used here to efficiently scan the input buffer
- * for special characters (e.g., newline, carriage return, quotes, or
- * escape characters). This approach significantly improves performance
- * compared to byte-by-byte iteration, especially for large input
- * buffers.
+ * SIMD instructions are used here to efficiently scan the input
+ * buffer for special characters (e.g., newline, carriage return,
+ * quotes, or escape characters). This approach significantly improves
+ * performance compared to byte-by-byte iteration, especially for
+ * large input buffers.
*
- * However, SIMD optimization cannot be applied in the following cases:
- * - Inside quoted fields, where escape sequences and closing quotes
- * require sequential processing to handle correctly.
- * - When the remaining buffer size is smaller than the size of a SIMD
- * vector register, as SIMD operations require processing data in
- * fixed-size chunks.
+ * However, SIMD optimization cannot be applied in the following
+ * cases: - Inside quoted fields, where escape sequences and closing
+ * quotes require sequential processing to handle correctly. - When
+ * the remaining buffer size is smaller than the size of a SIMD vector
+ * register, as SIMD operations require processing data in fixed-size
+ * chunks.
*/
- if (!in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ if (copy_buf_len - input_buf_ptr >= sizeof(Vector8))
{
Vector8 chunk;
Vector8 match;
@@ -1388,13 +1393,15 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* Found a special character. Advance up to that point and let
* the scalar code handle it.
*/
- int advance = pg_rightmost_one_pos32(mask);
+ int advance = pg_rightmost_one_pos32(mask);
+
input_buf_ptr += advance;
}
else
{
/* No special characters found, so skip the entire chunk */
input_buf_ptr += sizeof(Vector8);
+ last_was_esc = false;
continue;
}
}
@@ -1650,6 +1657,11 @@ CopyReadAttributesText(CopyFromState cstate)
char *cur_ptr;
char *line_end_ptr;
+#ifndef USE_NO_SIMD
+ Vector8 bs = vector8_broadcast('\\');
+ Vector8 delim = vector8_broadcast(delimc);
+#endif
+
/*
* We need a special case for zero-column tables: check that the input
* line is empty, and return.
@@ -1717,6 +1729,44 @@ CopyReadAttributesText(CopyFromState cstate)
{
char c;
+#ifndef USE_NO_SIMD
+ if (line_end_ptr - cur_ptr >= sizeof(Vector8))
+ {
+ Vector8 chunk;
+ Vector8 match;
+ uint32 mask;
+
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) cur_ptr);
+
+ /* Create a mask of all special characters we need to stop at */
+ match = vector8_or(vector8_eq(chunk, bs), vector8_eq(chunk, delim));
+
+ /* Check if we found any special characters */
+ mask = vector8_highbit_mask(match);
+ if (mask != 0)
+ {
+ /*
+ * Found a special character. Advance up to that point and
+ * let the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+
+ memcpy(output_ptr, cur_ptr, advance);
+ output_ptr += advance;
+ cur_ptr += advance;
+ }
+ else
+ {
+ /* No special characters found, so skip the entire chunk */
+ memcpy(output_ptr, cur_ptr, sizeof(Vector8));
+ output_ptr += sizeof(Vector8);
+ cur_ptr += sizeof(Vector8);
+ continue;
+ }
+ }
+#endif
+
end_ptr = cur_ptr;
if (cur_ptr >= line_end_ptr)
break;
@@ -1906,6 +1956,12 @@ CopyReadAttributesCSV(CopyFromState cstate)
char *cur_ptr;
char *line_end_ptr;
+#ifndef USE_NO_SIMD
+ Vector8 quote = vector8_broadcast(quotec);
+ Vector8 delim = vector8_broadcast(delimc);
+ Vector8 escape = vector8_broadcast(escapec);
+#endif
+
/*
* We need a special case for zero-column tables: check that the input
* line is empty, and return.
@@ -1972,6 +2028,50 @@ CopyReadAttributesCSV(CopyFromState cstate)
/* Not in quote */
for (;;)
{
+#ifndef USE_NO_SIMD
+ if (line_end_ptr - cur_ptr >= sizeof(Vector8))
+ {
+ Vector8 chunk;
+ Vector8 match;
+ uint32 mask;
+
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) cur_ptr);
+
+ /*
+ * Create a mask of all special characters we need to stop
+ * at
+ */
+ match = vector8_or(vector8_eq(chunk, quote), vector8_eq(chunk, delim));
+
+ /* Check if we found any special characters */
+ mask = vector8_highbit_mask(match);
+ if (mask != 0)
+ {
+ /*
+ * Found a special character. Advance up to that point
+ * and let the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+
+ memcpy(output_ptr, cur_ptr, advance);
+ output_ptr += advance;
+ cur_ptr += advance;
+ }
+ else
+ {
+ /*
+ * No special characters found, so skip the entire
+ * chunk
+ */
+ memcpy(output_ptr, cur_ptr, sizeof(Vector8));
+ output_ptr += sizeof(Vector8);
+ cur_ptr += sizeof(Vector8);
+ continue;
+ }
+ }
+#endif
+
end_ptr = cur_ptr;
if (cur_ptr >= line_end_ptr)
goto endfield;
@@ -1995,6 +2095,50 @@ CopyReadAttributesCSV(CopyFromState cstate)
/* In quote */
for (;;)
{
+#ifndef USE_NO_SIMD
+ if (line_end_ptr - cur_ptr >= sizeof(Vector8))
+ {
+ Vector8 chunk;
+ Vector8 match;
+ uint32 mask;
+
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) cur_ptr);
+
+ /*
+ * Create a mask of all special characters we need to stop
+ * at
+ */
+ match = vector8_or(vector8_eq(chunk, quote), vector8_eq(chunk, escape));
+
+ /* Check if we found any special characters */
+ mask = vector8_highbit_mask(match);
+ if (mask != 0)
+ {
+ /*
+ * Found a special character. Advance up to that point
+ * and let the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+
+ memcpy(output_ptr, cur_ptr, advance);
+ output_ptr += advance;
+ cur_ptr += advance;
+ }
+ else
+ {
+ /*
+ * No special characters found, so skip the entire
+ * chunk
+ */
+ memcpy(output_ptr, cur_ptr, sizeof(Vector8));
+ output_ptr += sizeof(Vector8);
+ cur_ptr += sizeof(Vector8);
+ continue;
+ }
+ }
+#endif
+
end_ptr = cur_ptr;
if (cur_ptr >= line_end_ptr)
ereport(ERROR,
--
2.50.1
On Thu, Aug 7, 2025 at 8:15 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
Thank you for working on this!
On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
Hi hackers,
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.I have been working on the same idea. I was not moving input_buf_ptr
as far as possible, so I think your approach is better.
Great. I'm looking forward to working with you on this feature implementation.
Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.
Thank you for the additional benchmarks.
I have a couple of ideas that I was working on:
---+ * However, SIMD optimization cannot be applied in the following cases: + * - Inside quoted fields, where escape sequences and closing quotes + * require sequential processing to handle correctly.I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.
That's a clever point that last_was_esc should be reset to false when
a SIMD chunk is skipped. You're right about that specific case.
However, the core challenge is not what happens when we skip a chunk,
but what happens when a chunk contains special characters like quotes
or escapes. The main reason we avoid SIMD inside quoted fields is that
the parsing logic becomes fundamentally sequential and
context-dependent.
To correctly parse a "" as a single literal quote, we must perform a
lookahead to check the next character. This is an inherently
sequential operation that doesn't map well to SIMD's parallel nature.
Trying to handle this stateful logic with SIMD would lead to
significant implementation complexity, especially with edge cases like
an escape character falling on the last byte of a chunk.
+ * - When the remaining buffer size is smaller than the size of a SIMD + * vector register, as SIMD operations require processing data in + * fixed-size chunks.You run SIMD when 'copy_buf_len - input_buf_ptr >= sizeof(Vector8)'
but you only call CopyLoadInputBuf() when 'input_buf_ptr >=
copy_buf_len || need_data' so basically you need to wait at least the
sizeof(Vector8) character to pass for the next SIMD. And in the worst
case; if CopyLoadInputBuf() puts one character less than
sizeof(Vector8), then you can't ever run SIMD. I think we need to make
sure that CopyLoadInputBuf() loads at least the sizeof(Vector8)
character to the input_buf so we do not encounter that problem.
I think you're probably right, but we only need to account for
sizeof(Vector8) when USE_NO_SIMD is not defined.
What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.
Agreed, I will.
I shared my ideas as a Feedback.txt file (.txt to stay off CFBot's
radar for this thread). I hope these help, please let me know if you
have any questions.
Thanks a lot!
On Mon, Aug 11, 2025 at 5:52 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
On Thu, 7 Aug 2025 at 04:49, Shinya Kato <shinya11.kato@gmail.com> wrote:
I have implemented SIMD optimization for the COPY FROM (FORMAT {csv,
text}) command and observed approximately a 5% performance
improvement. Please see the detailed test results below.Also, I did a benchmark on text format. I created a benchmark for line
length in a table being from 1 byte to 1 megabyte.The peak improvement
is line length being 4096 and the improvement is more than 20% [1], I
saw no regression on your patch.I did the same benchmark for the CSV format. The peak improvement is
line length being 4096 and the improvement is more than 25% [1]. I saw
a 5% regression on the 1 byte benchmark, there are no other
regressions.
Thank you. I'm not too concerned about a regression when there's only
one byte per line.
What do you think about adding SIMD to CopyReadAttributesText() and
CopyReadAttributesCSV() functions? When I add your SIMD approach to
CopyReadAttributesText() function, the improvement on the 4096 byte
line length input [1] goes from 20% to 30%.I wanted to try using SIMD in CopyReadAttributesCSV() as well. The
improvement on the 4096 byte line length input [1] goes from 25% to
35%, the regression on the 1 byte input is the same.
Yes, I'm on it. I'm currently adding the SIMD logic to
CopyReadAttributesCSV() as you suggested. I'll share the new version
of the patch soon.
--
Best regards,
Shinya Kato
NTT OSS Center
On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11.kato@gmail.com> wrote:
+ * However, SIMD optimization cannot be applied in the following cases: + * - Inside quoted fields, where escape sequences and closing quotes + * require sequential processing to handle correctly.I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.That's a clever point that last_was_esc should be reset to false when
a SIMD chunk is skipped. You're right about that specific case.However, the core challenge is not what happens when we skip a chunk,
but what happens when a chunk contains special characters like quotes
or escapes. The main reason we avoid SIMD inside quoted fields is that
the parsing logic becomes fundamentally sequential and
context-dependent.To correctly parse a "" as a single literal quote, we must perform a
lookahead to check the next character. This is an inherently
sequential operation that doesn't map well to SIMD's parallel nature.Trying to handle this stateful logic with SIMD would lead to
significant implementation complexity, especially with edge cases like
an escape character falling on the last byte of a chunk.
Ah, you're right. My apologies, I misunderstood the implementation. It
appears that SIMD can be used even within quoted strings.
I think it would be better not to use the SIMD path when last_was_esc
is true. The next character is likely to be a special character, and
handling this case outside the SIMD loop would also improve
readability by consolidating the last_was_esc toggle logic in one
place.
Furthermore, when inside a quote (in_quote) in CSV mode, the detection
of \n and \r can be disabled.
+ last_was_esc = false;
Regarding the implementation, I believe we must set last_was_esc to
false when advancing input_buf_ptr, as shown in the code below. For
this reason, I think it’s best to keep the current logic for toggling
last_was_esc.
+ int advance = pg_rightmost_one_pos32(mask);
+ input_buf_ptr += advance;
I've attached a new patch that includes these changes. Further
modifications are still in progress.
--
Best regards,
Shinya Kato
NTT OSS Center
Attachments:
v2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchapplication/octet-stream; name=v2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload
From 69e16f8c7a52d967385a1dc9b1602bbd4472df60 Mon Sep 17 00:00:00 2001
From: Shinya Kato <shinya11.kato@gmail.com>
Date: Mon, 28 Jul 2025 22:08:20 +0900
Subject: [PATCH v2] Speed up COPY FROM text/CSV parsing using SIMD
---
src/backend/commands/copyfromparse.c | 71 ++++++++++++++++++++++++++++
1 file changed, 71 insertions(+)
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index b1ae97b833d..f1a6ea81dd1 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -71,7 +71,9 @@
#include "mb/pg_wchar.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "port/pg_bitutils.h"
#include "port/pg_bswap.h"
+#include "port/simd.h"
#include "utils/builtins.h"
#include "utils/rel.h"
@@ -1255,6 +1257,14 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
char quotec = '\0';
char escapec = '\0';
+#ifndef USE_NO_SIMD
+ Vector8 nl = vector8_broadcast('\n');
+ Vector8 cr = vector8_broadcast('\r');
+ Vector8 bs = vector8_broadcast('\\');
+ Vector8 quote;
+ Vector8 escape;
+#endif
+
if (is_csv)
{
quotec = cstate->opts.quote[0];
@@ -1262,6 +1272,12 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* ignore special escape processing if it's the same as quotec */
if (quotec == escapec)
escapec = '\0';
+
+#ifndef USE_NO_SIMD
+ quote = vector8_broadcast(quotec);
+ if (quotec != escapec)
+ escape = vector8_broadcast(escapec);
+#endif
}
/*
@@ -1328,6 +1344,61 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
need_data = false;
}
+#ifndef USE_NO_SIMD
+ /*
+ * Use SIMD instructions to efficiently scan the input buffer for
+ * special characters (e.g., newline, carriage return, quote, and
+ * escape). This is faster than byte-by-byte iteration, especially on
+ * large buffers.
+ *
+ * We do not apply the SIMD fast path in either of the following cases:
+ * - When the previously processed character was an escape character
+ * (last_was_esc), since the next byte must be examined sequentially.
+ * - The remaining buffer is smaller than one vector width
+ * (sizeof(Vector8)); SIMD operates on fixed-size chunks.
+ */
+ if (!last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ {
+ Vector8 chunk;
+ Vector8 match;
+ uint32 mask;
+
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);
+
+ /* \n and \r are not special inside quotes */
+ if (!in_quote)
+ match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+
+ if (is_csv)
+ {
+ match = vector8_or(match, vector8_eq(chunk, quote));
+ if (escapec != '\0')
+ match = vector8_or(match, vector8_eq(chunk, escape));
+ }
+ else
+ match = vector8_or(match, vector8_eq(chunk, bs));
+
+ /* Check if we found any special characters */
+ mask = vector8_highbit_mask(match);
+ if (mask != 0)
+ {
+ /*
+ * Found a special character. Advance up to that point and let
+ * the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+ input_buf_ptr += advance;
+ }
+ else
+ {
+ /* No special characters found, so skip the entire chunk */
+ input_buf_ptr += sizeof(Vector8);
+ continue;
+ }
+ }
+#endif
+
/* OK to fetch a character */
prev_raw_ptr = input_buf_ptr;
c = copy_input_buf[input_buf_ptr++];
--
2.47.3
Following Nazir's findings about 4096 bytes being the performant line
length, I did more benchmarks from my side on both TEXT and CSV formats
with two different cases of normal data (no special characters) and data
with many special characters.
Results are con good as expected and similar to previous benchmarks
~30.9% faster copy in TEXT format
~32.4% faster copy in CSV format
20%-30% reduces cycles per instructions
In the case of doing a lot of special characters in the lines (e.g., tables
with large numbers of columns maybe), we obviously expect regressions here
because of the overhead of many fallbacks to scalar processing.
Results for a 1/3 of line length of special characters:
~43.9% slower copy in TEXT format
~16.7% slower copy in CSV format
So for even less occurrences of special characters or wider distance
between there might still be some regressions in this case, a
non-significant case maybe, but can be treated in other patches if we
consider to not use SIMD path sometimes.
I hope this helps more and confirms the patch.
Regards,
Ayoub Kazar
Le jeu. 14 août 2025 à 01:55, Shinya Kato <shinya11.kato@gmail.com> a
écrit :
Show quoted text
On Tue, Aug 12, 2025 at 4:25 PM Shinya Kato <shinya11.kato@gmail.com>
wrote:+ * However, SIMD optimization cannot be applied in the
following cases:
+ * - Inside quoted fields, where escape sequences and closing
quotes
+ * require sequential processing to handle correctly.
I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.That's a clever point that last_was_esc should be reset to false when
a SIMD chunk is skipped. You're right about that specific case.However, the core challenge is not what happens when we skip a chunk,
but what happens when a chunk contains special characters like quotes
or escapes. The main reason we avoid SIMD inside quoted fields is that
the parsing logic becomes fundamentally sequential and
context-dependent.To correctly parse a "" as a single literal quote, we must perform a
lookahead to check the next character. This is an inherently
sequential operation that doesn't map well to SIMD's parallel nature.Trying to handle this stateful logic with SIMD would lead to
significant implementation complexity, especially with edge cases like
an escape character falling on the last byte of a chunk.Ah, you're right. My apologies, I misunderstood the implementation. It
appears that SIMD can be used even within quoted strings.I think it would be better not to use the SIMD path when last_was_esc
is true. The next character is likely to be a special character, and
handling this case outside the SIMD loop would also improve
readability by consolidating the last_was_esc toggle logic in one
place.Furthermore, when inside a quote (in_quote) in CSV mode, the detection
of \n and \r can be disabled.+ last_was_esc = false;
Regarding the implementation, I believe we must set last_was_esc to
false when advancing input_buf_ptr, as shown in the code below. For
this reason, I think it’s best to keep the current logic for toggling
last_was_esc.+ int advance = pg_rightmost_one_pos32(mask); + input_buf_ptr += advance;I've attached a new patch that includes these changes. Further
modifications are still in progress.--
Best regards,
Shinya Kato
NTT OSS Center
Hi,
On Thu, 14 Aug 2025 at 05:25, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Following Nazir's findings about 4096 bytes being the performant line length, I did more benchmarks from my side on both TEXT and CSV formats with two different cases of normal data (no special characters) and data with many special characters.
Results are con good as expected and similar to previous benchmarks
~30.9% faster copy in TEXT format
~32.4% faster copy in CSV format
20%-30% reduces cycles per instructionsIn the case of doing a lot of special characters in the lines (e.g., tables with large numbers of columns maybe), we obviously expect regressions here because of the overhead of many fallbacks to scalar processing.
Results for a 1/3 of line length of special characters:
~43.9% slower copy in TEXT format
~16.7% slower copy in CSV format
So for even less occurrences of special characters or wider distance between there might still be some regressions in this case, a non-significant case maybe, but can be treated in other patches if we consider to not use SIMD path sometimes.I hope this helps more and confirms the patch.
Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
On Thu, 14 Aug 2025 at 05:25, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Following Nazir's findings about 4096 bytes being the performant line
length, I did more benchmarks from my side on both TEXT and CSV formats
with two different cases of normal data (no special characters) and data
with many special characters.Results are con good as expected and similar to previous benchmarks
~30.9% faster copy in TEXT format
~32.4% faster copy in CSV format
20%-30% reduces cycles per instructionsIn the case of doing a lot of special characters in the lines (e.g.,
tables with large numbers of columns maybe), we obviously expect
regressions here because of the overhead of many fallbacks to scalar
processing.Results for a 1/3 of line length of special characters:
~43.9% slower copy in TEXT format
~16.7% slower copy in CSV format
So for even less occurrences of special characters or wider distancebetween there might still be some regressions in this case, a
non-significant case maybe, but can be treated in other patches if we
consider to not use SIMD path sometimes.I hope this helps more and confirms the patch.
Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?--
Regards,
Nazir Bilal Yavuz
Microsoft
Of course, I attached the sql to generate the text and csv test files.
If having a 1/3 of line length of special characters can be an
exaggeration, something lower might still reproduce some regressions of
course for the same idea.
Best regards,
Ayoub Kazar
Attachments:
On Thu, 7 Aug 2025 at 14:15, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
I have a couple of ideas that I was working on:
---+ * However, SIMD optimization cannot be applied in the following cases: + * - Inside quoted fields, where escape sequences and closing quotes + * require sequential processing to handle correctly.I think you can continue SIMD inside quoted fields. Only important
thing is you need to set last_was_esc to false when SIMD skipped the
chunk.
There is a trick with doing carryless multiplication with -1 that can
be used to SIMD process transitions between quoted/not-quoted. [1]https://github.com/geofflangdale/simdcsv/blob/master/src/main.cpp#L76
This is able to convert a bitmask of unescaped quote character
positions to a quote mask in a single operation. I last looked at it 5
years ago, but I remember coming to the conclusion that it would work
for implementing PostgreSQL's interpretation of CSV.
[1]: https://github.com/geofflangdale/simdcsv/blob/master/src/main.cpp#L76
--
Ants
Hi,
On Thu, 14 Aug 2025 at 18:00, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?Of course, I attached the sql to generate the text and csv test files.
If having a 1/3 of line length of special characters can be an exaggeration, something lower might still reproduce some regressions of course for the same idea.
Thank you so much!
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.
So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.
I am sharing a POC patch to show heuristic, it can be applied on top
of v1-0001. Heuristic version has the same performance improvements
with the v1-0001 but the regression is %5 instead of %20 compared to
the master.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
COPY-SIMD-add-heuristic-to-avoid-regression-on-sm.txttext/plain; charset=UTF-8; name=COPY-SIMD-add-heuristic-to-avoid-regression-on-sm.txtDownload
From aa55843b0c64bed9f72cf8cd7854df9df7ef989b Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 19 Aug 2025 15:16:02 +0300
Subject: [PATCH v1] COPY SIMD: add heuristic to avoid regression on small
advances
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
When SIMD advances fewer than 5 characters, performance regresses.
To mitigate this, introduce a heuristic:
- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.
---
src/backend/commands/copyfromparse.c | 42 ++++++++++++++++++++++++++--
1 file changed, 40 insertions(+), 2 deletions(-)
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 5aba0fa6cb7..e58d7d4e353 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1263,6 +1263,9 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
Vector8 bs = vector8_broadcast('\\');
Vector8 quote;
Vector8 escape;
+
+ int sleep_cyle = 0;
+ int last_sleep_cyle = 1;
#endif
if (is_csv)
@@ -1359,7 +1362,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* vector register, as SIMD operations require processing data in
* fixed-size chunks.
*/
- if (!in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ if (sleep_cyle <= 0 && !in_quote && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
{
Vector8 chunk;
Vector8 match;
@@ -1390,14 +1393,49 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
*/
int advance = pg_rightmost_one_pos32(mask);
input_buf_ptr += advance;
+
+ /*
+ * If we advance less than 5 characters we cause regression.
+ * Sleep a bit then try again. Sleep time increases
+ * exponentially.
+ */
+ if (advance < 5)
+ {
+ if (last_sleep_cyle >= PG_INT16_MAX / 2)
+ last_sleep_cyle = PG_INT16_MAX;
+ else
+ last_sleep_cyle = last_sleep_cyle << 1;
+
+ sleep_cyle = last_sleep_cyle;
+ }
+
+ /*
+ * If we advance more than 4 charactes this means we have
+ * performance improvement. Halve sleep time for next sleep.
+ */
+ else
+ {
+ last_sleep_cyle = Max(last_sleep_cyle >> 1, 1);
+ sleep_cyle = 0;
+ }
}
else
{
- /* No special characters found, so skip the entire chunk */
+ /*
+ * No special characters found, so skip the entire chunk and
+ * halve sleep time for next sleep.
+ */
input_buf_ptr += sizeof(Vector8);
+ last_sleep_cyle = Max(last_sleep_cyle >> 1, 1);
continue;
}
}
+
+ /*
+ * Vulnerable to overflow if we are in quote for more than INT16_MAX
+ * characters.
+ */
+ sleep_cyle--;
#endif
/* OK to fetch a character */
--
2.50.1
Hi,
On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.
--
Regards,
Nazir Bilal Yavuz
Microsoft
On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
Hi,
On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.
I was thinking a bit about that this morning. I wonder if it might be
better instead of having a constantly applied heuristic like this, it
might be better to do a little extra accounting in the first, say, 1000
lines of an input file, and if less than some portion of the input is
found to be special characters then switch to the SIMD code. What that
portion should be would need to be determined by some experimentation
with a variety of typical workloads, but given your findings 20% seems
like a good starting point.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Thu, 14 Aug 2025 at 18:00, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Thanks for running that benchmark! Would you mind sharing a reproducer
for the regression you observed?Of course, I attached the sql to generate the text and csv test files.
If having a 1/3 of line length of special characters can be anexaggeration, something lower might still reproduce some regressions of
course for the same idea.Thank you so much!
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
- Each time advance < 5, n is doubled.
- Each time advance ≥ 5, n is halved.I am sharing a POC patch to show heuristic, it can be applied on top
of v1-0001. Heuristic version has the same performance improvements
with the v1-0001 but the regression is %5 instead of %20 compared to
the master.--
Regards,
Nazir Bilal Yavuz
Microsoft
Yes this is good, i'm also getting about 5% regression only now.
Regards,
Ayoub Kazar
Hi,
On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
Hi,
On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.I was thinking a bit about that this morning. I wonder if it might be
better instead of having a constantly applied heuristic like this, it
might be better to do a little extra accounting in the first, say, 1000
lines of an input file, and if less than some portion of the input is
found to be special characters then switch to the SIMD code. What that
portion should be would need to be determined by some experimentation
with a variety of typical workloads, but given your findings 20% seems
like a good starting point.
I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:
Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.
1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.
With this heuristic the regression is limited by %2 in the worst case.
Patches are attached, the first patch is v2-0001 from Shinya with the
'-Werror=maybe-uninitialized' fixes and the pgindent changes. 0002 is
the actual heuristic patch.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v3-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload
From 2d2372e90305a81c80fe182003933039bf32f97e Mon Sep 17 00:00:00 2001
From: Shinya Kato <shinya11.kato@gmail.com>
Date: Mon, 28 Jul 2025 22:08:20 +0900
Subject: [PATCH v3 1/2] Speed up COPY FROM text/CSV parsing using SIMD
---
src/backend/commands/copyfromparse.c | 73 ++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index b1ae97b833d..99959a40fab 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -71,7 +71,9 @@
#include "mb/pg_wchar.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "port/pg_bitutils.h"
#include "port/pg_bswap.h"
+#include "port/simd.h"
#include "utils/builtins.h"
#include "utils/rel.h"
@@ -1255,6 +1257,14 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
char quotec = '\0';
char escapec = '\0';
+#ifndef USE_NO_SIMD
+ Vector8 nl = vector8_broadcast('\n');
+ Vector8 cr = vector8_broadcast('\r');
+ Vector8 bs = vector8_broadcast('\\');
+ Vector8 quote = vector8_broadcast(0);
+ Vector8 escape = vector8_broadcast(0);
+#endif
+
if (is_csv)
{
quotec = cstate->opts.quote[0];
@@ -1262,6 +1272,12 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* ignore special escape processing if it's the same as quotec */
if (quotec == escapec)
escapec = '\0';
+
+#ifndef USE_NO_SIMD
+ quote = vector8_broadcast(quotec);
+ if (quotec != escapec)
+ escape = vector8_broadcast(escapec);
+#endif
}
/*
@@ -1328,6 +1344,63 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
need_data = false;
}
+#ifndef USE_NO_SIMD
+
+ /*
+ * Use SIMD instructions to efficiently scan the input buffer for
+ * special characters (e.g., newline, carriage return, quote, and
+ * escape). This is faster than byte-by-byte iteration, especially on
+ * large buffers.
+ *
+ * We do not apply the SIMD fast path in either of the following
+ * cases: - When the previously processed character was an escape
+ * character (last_was_esc), since the next byte must be examined
+ * sequentially. - The remaining buffer is smaller than one vector
+ * width (sizeof(Vector8)); SIMD operates on fixed-size chunks.
+ */
+ if (!last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ {
+ Vector8 chunk;
+ Vector8 match = vector8_broadcast(0);
+ uint32 mask;
+
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);
+
+ /* \n and \r are not special inside quotes */
+ if (!in_quote)
+ match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+
+ if (is_csv)
+ {
+ match = vector8_or(match, vector8_eq(chunk, quote));
+ if (escapec != '\0')
+ match = vector8_or(match, vector8_eq(chunk, escape));
+ }
+ else
+ match = vector8_or(match, vector8_eq(chunk, bs));
+
+ /* Check if we found any special characters */
+ mask = vector8_highbit_mask(match);
+ if (mask != 0)
+ {
+ /*
+ * Found a special character. Advance up to that point and let
+ * the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+
+ input_buf_ptr += advance;
+ }
+ else
+ {
+ /* No special characters found, so skip the entire chunk */
+ input_buf_ptr += sizeof(Vector8);
+ continue;
+ }
+ }
+#endif
+
/* OK to fetch a character */
prev_raw_ptr = input_buf_ptr;
c = copy_input_buf[input_buf_ptr++];
--
2.51.0
v3-0002-COPY-SIMD-per-line-heuristic.patchtext/x-patch; charset=US-ASCII; name=v3-0002-COPY-SIMD-per-line-heuristic.patchDownload
From ad050583d3c14bdec44266d8d2110b384fa9d7dc Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 14 Oct 2025 13:18:13 +0300
Subject: [PATCH v3 2/2] COPY SIMD per-line heuristic
---
src/include/commands/copyfrom_internal.h | 7 ++
src/backend/commands/copyfrom.c | 6 ++
src/backend/commands/copyfromparse.c | 82 ++++++++++++++++++++++--
3 files changed, 89 insertions(+), 6 deletions(-)
diff --git a/src/include/commands/copyfrom_internal.h b/src/include/commands/copyfrom_internal.h
index c8b22af22d8..9dd31320f52 100644
--- a/src/include/commands/copyfrom_internal.h
+++ b/src/include/commands/copyfrom_internal.h
@@ -89,6 +89,13 @@ typedef struct CopyFromStateData
const char *cur_attval; /* current att value for error messages */
bool relname_only; /* don't output line number, att, etc. */
+ /* SIMD variables */
+ bool simd_continue;
+ bool simd_initialized;
+ uint16 simd_last_sleep_cycle;
+ uint16 simd_current_sleep_cycle;
+
+
/*
* Working state
*/
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 12781963b4f..4bdfd96c244 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1721,6 +1721,12 @@ BeginCopyFrom(ParseState *pstate,
cstate->cur_attval = NULL;
cstate->relname_only = false;
+ /* Initialize SIMD variables */
+ cstate->simd_continue = false;
+ cstate->simd_initialized = false;
+ cstate->simd_current_sleep_cycle = 0;
+ cstate->simd_last_sleep_cycle = 0;
+
/*
* Allocate buffers for the input pipeline.
*
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 99959a40fab..24cef54e5e4 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -143,12 +143,14 @@ static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
/* non-export function prototypes */
static bool CopyReadLine(CopyFromState cstate, bool is_csv);
-static bool CopyReadLineText(CopyFromState cstate, bool is_csv);
static int CopyReadAttributesText(CopyFromState cstate);
static int CopyReadAttributesCSV(CopyFromState cstate);
static Datum CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
Oid typioparam, int32 typmod,
bool *isnull);
+static pg_attribute_always_inline bool CopyReadLineText(CopyFromState cstate,
+ bool is_csv,
+ bool simd_continue);
static pg_attribute_always_inline bool CopyFromTextLikeOneRow(CopyFromState cstate,
ExprContext *econtext,
Datum *values,
@@ -1173,8 +1175,23 @@ CopyReadLine(CopyFromState cstate, bool is_csv)
resetStringInfo(&cstate->line_buf);
cstate->line_buf_valid = false;
- /* Parse data and transfer into line_buf */
- result = CopyReadLineText(cstate, is_csv);
+ /* If that is the first time we do read, initalize the SIMD */
+ if (unlikely(!cstate->simd_initialized))
+ {
+ cstate->simd_initialized = true;
+ cstate->simd_continue = true;
+ cstate->simd_current_sleep_cycle = 0;
+ cstate->simd_last_sleep_cycle = 0;
+ }
+
+ /*
+ * Parse data and transfer into line_buf. To get benefit from inlining,
+ * call CopyReadLineText() with the constant boolean variables.
+ */
+ if (cstate->simd_continue)
+ result = CopyReadLineText(cstate, is_csv, true);
+ else
+ result = CopyReadLineText(cstate, is_csv, false);
if (result)
{
@@ -1241,8 +1258,8 @@ CopyReadLine(CopyFromState cstate, bool is_csv)
/*
* CopyReadLineText - inner loop of CopyReadLine for text mode
*/
-static bool
-CopyReadLineText(CopyFromState cstate, bool is_csv)
+static pg_attribute_always_inline bool
+CopyReadLineText(CopyFromState cstate, bool is_csv, bool simd_continue)
{
char *copy_input_buf;
int input_buf_ptr;
@@ -1258,11 +1275,16 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
char escapec = '\0';
#ifndef USE_NO_SIMD
+#define SIMD_SLEEP_MAX 1024
+#define SIMD_ADVANCE_AT_LEAST 5
Vector8 nl = vector8_broadcast('\n');
Vector8 cr = vector8_broadcast('\r');
Vector8 bs = vector8_broadcast('\\');
Vector8 quote = vector8_broadcast(0);
Vector8 escape = vector8_broadcast(0);
+
+ uint64 simd_total_cycle = 0;
+ uint64 simd_total_advance = 0;
#endif
if (is_csv)
@@ -1358,12 +1380,14 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* sequentially. - The remaining buffer is smaller than one vector
* width (sizeof(Vector8)); SIMD operates on fixed-size chunks.
*/
- if (!last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ if (simd_continue && !last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
{
Vector8 chunk;
Vector8 match = vector8_broadcast(0);
uint32 mask;
+ simd_total_cycle++;
+
/* Load a chunk of data into a vector register */
vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);
@@ -1391,11 +1415,13 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
int advance = pg_rightmost_one_pos32(mask);
input_buf_ptr += advance;
+ simd_total_advance += advance;
}
else
{
/* No special characters found, so skip the entire chunk */
input_buf_ptr += sizeof(Vector8);
+ simd_total_advance += sizeof(Vector8);
continue;
}
}
@@ -1603,6 +1629,50 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
}
} /* end of outer loop */
+#ifndef USE_NO_SIMD
+
+ /* SIMD was enabled */
+ if (simd_continue)
+ {
+ /* SIMD is worth */
+ if (simd_total_cycle && simd_total_advance / simd_total_cycle >= SIMD_ADVANCE_AT_LEAST)
+ {
+ Assert(cstate->simd_current_sleep_cycle == 0);
+ cstate->simd_last_sleep_cycle >>= 1;
+ }
+ /* SIMD was enabled but it isn't worth */
+ else
+ {
+ uint16 simd_last_sleep_cycle = cstate->simd_last_sleep_cycle;
+
+ cstate->simd_continue = false;
+
+ if (simd_last_sleep_cycle == 0)
+ simd_last_sleep_cycle = 1;
+ else if (simd_last_sleep_cycle >= SIMD_SLEEP_MAX / 2)
+ simd_last_sleep_cycle = SIMD_SLEEP_MAX;
+ else
+ simd_last_sleep_cycle <<= 1;
+ cstate->simd_current_sleep_cycle = simd_last_sleep_cycle;
+ cstate->simd_last_sleep_cycle = simd_last_sleep_cycle;
+ }
+ }
+ /* SIMD was disabled */
+ else
+ {
+ /*
+ * We should come here with decrementing
+ * cstate->simd_current_sleep_cycle from a positive number.
+ */
+ Assert(cstate->simd_current_sleep_cycle != 0);
+ cstate->simd_current_sleep_cycle--;
+
+ if (cstate->simd_current_sleep_cycle == 0)
+ cstate->simd_continue = true;
+ }
+
+#endif
+
/*
* Transfer any still-uncopied data to line_buf.
*/
--
2.51.0
Hello,
I’ve rebenchmarked the new heuristic patch, We still have the previous
improvements ranging from 15% to 30%. For regressions i see at maximum 3%
or 4% in the worst case, so this is solid.
I'm also trying the idea of doing SIMD inside quotes with prefix XOR using
carry less multiplication avoiding the slow path in all cases even with
weird looking input, but it needs to take into consideration the
availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we
go, it quickly starts to become dirty OR we can wait for the decision to
start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.
Regards,
Ayoub Kazar
Hi,
On Thu, 16 Oct 2025 at 17:29, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.
I had a typo there, correct sentence is that:
"Overall logic is that we *will* run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines."
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
On Sat, 18 Oct 2025 at 21:46, KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Hello,
I’ve rebenchmarked the new heuristic patch, We still have the previous improvements ranging from 15% to 30%. For regressions i see at maximum 3% or 4% in the worst case, so this is solid.
Thank you so much for doing this! The results look nice, do you think
there are any other benchmarks that might be interesting to try?
I'm also trying the idea of doing SIMD inside quotes with prefix XOR using carry less multiplication avoiding the slow path in all cases even with weird looking input, but it needs to take into consideration the availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we go, it quickly starts to become dirty OR we can wait for the decision to start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.
I can not quite picture this, would you mind sharing a few examples or patches?
--
Regards,
Nazir Bilal Yavuz
Microsoft
On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
Hi,
On Thu, 21 Aug 2025 at 18:47, Andrew Dunstan<andrew@dunslane.net> wrote:
On 2025-08-19 Tu 10:14 AM, Nazir Bilal Yavuz wrote:
Hi,
On Tue, 19 Aug 2025 at 15:33, Nazir Bilal Yavuz<byavuz81@gmail.com> wrote:
I am able to reproduce the regression you mentioned but both
regressions are %20 on my end. I found that (by experimenting) SIMD
causes a regression if it advances less than 5 characters.So, I implemented a small heuristic. It works like that:
- If advance < 5 -> insert a sleep penalty (n cycles).
'sleep' might be a poor word choice here. I meant skipping SIMD for n
number of times.I was thinking a bit about that this morning. I wonder if it might be
better instead of having a constantly applied heuristic like this, it
might be better to do a little extra accounting in the first, say, 1000
lines of an input file, and if less than some portion of the input is
found to be special characters then switch to the SIMD code. What that
portion should be would need to be determined by some experimentation
with a variety of typical workloads, but given your findings 20% seems
like a good starting point.I implemented a heuristic something similar to this. It is a mix of
previous heuristic and your idea, it works like that:Overall logic is that we will not run SIMD for the entire line and we
decide if it is worth it to run SIMD for the next lines.1 - We will try SIMD and decide if it is worth it to run SIMD.
1.1 - If it is worth it, we will continue to run SIMD and we will
halve the simd_last_sleep_cycle variable.
1.2 - If it is not worth it, we will double the simd_last_sleep_cycle
and we will not run SIMD for these many lines.
1.3 - After skipping simd_last_sleep_cycle lines, we will go back to the #1.
Note: simd_last_sleep_cycle can not pass 1024, so we will run SIMD for
each 1024 lines at max.With this heuristic the regression is limited by %2 in the worst case.
My worry is that the worst case is actually quite common. Sparse data
sets dominated by a lot of null values (and hence lots of special
characters) are very common. Are people prepared to accept a 2%
regression on load times for such data sets?
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
With this heuristic the regression is limited by %2 in the worst case.
My worry is that the worst case is actually quite common. Sparse data sets
dominated by a lot of null values (and hence lots of special characters) are
very common. Are people prepared to accept a 2% regression on load times for
such data sets?
Without knowing how common it is, I think it's difficult to judge whether
2% is a reasonable trade-off. If <5% of workloads might see a small
regression while the other >95% see double-digit percentage improvements,
then I might argue that it's fine. But I'm not sure we have any way to
know those sorts of details at the moment.
I'm also at least a little skeptical about the 2% number. IME that's
generally within the noise range and can vary greatly between machines and
test runs.
--
nathan
On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote:
On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
With this heuristic the regression is limited by %2 in the worst case.
My worry is that the worst case is actually quite common. Sparse data sets
dominated by a lot of null values (and hence lots of special characters) are
very common. Are people prepared to accept a 2% regression on load times for
such data sets?Without knowing how common it is, I think it's difficult to judge whether
2% is a reasonable trade-off. If <5% of workloads might see a small
regression while the other >95% see double-digit percentage improvements,
then I might argue that it's fine. But I'm not sure we have any way to
know those sorts of details at the moment.
I guess what I don't understand is why we actually need to do the test
continuously, even using an adaptive algorithm. Data files in my
experience usually have lines with fairly similar shapes. It's highly
unlikely that you will get the the first 1000 (say) lines of a file that
are rich in special characters and then some later significant section
that isn't, or vice versa. Therefore, doing the test once should yield
the correct answer that can be applied to the rest of the file. That
should reduce the worst case regression to ~0% without sacrificing any
of the performance gains. I appreciate the elegance of what Bilal has
done here, but it does seem like overkill.
I'm also at least a little skeptical about the 2% number. IME that's
generally within the noise range and can vary greatly between machines and
test runs.
Fair point.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
Hi,
On Mon, 20 Oct 2025 at 23:32, Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-10-20 Mo 1:04 PM, Nathan Bossart wrote:
On Mon, Oct 20, 2025 at 10:02:23AM -0400, Andrew Dunstan wrote:
On 2025-10-16 Th 10:29 AM, Nazir Bilal Yavuz wrote:
With this heuristic the regression is limited by %2 in the worst case.
My worry is that the worst case is actually quite common. Sparse data sets
dominated by a lot of null values (and hence lots of special characters) are
very common. Are people prepared to accept a 2% regression on load times for
such data sets?Without knowing how common it is, I think it's difficult to judge whether
2% is a reasonable trade-off. If <5% of workloads might see a small
regression while the other >95% see double-digit percentage improvements,
then I might argue that it's fine. But I'm not sure we have any way to
know those sorts of details at the moment.I guess what I don't understand is why we actually need to do the test continuously, even using an adaptive algorithm. Data files in my experience usually have lines with fairly similar shapes. It's highly unlikely that you will get the the first 1000 (say) lines of a file that are rich in special characters and then some later significant section that isn't, or vice versa. Therefore, doing the test once should yield the correct answer that can be applied to the rest of the file. That should reduce the worst case regression to ~0% without sacrificing any of the performance gains. I appreciate the elegance of what Bilal has done here, but it does seem like overkill.
I think the problem is deciding how many lines to process before
deciding for the rest. 1000 lines could work for the small sized data
but it might not work for the big sized data. Also, it might cause a
worse regressions for the small sized data. Because of this reason, I
tried to implement a heuristic that will work regardless of the size
of the data. The last heuristic I suggested will run SIMD for
approximately (#number_of_lines / 1024 [1024 is the max number of
lines to sleep before running SIMD again]) lines if all characters in
the data are special characters.
--
Regards,
Nazir Bilal Yavuz
Microsoft
On Sat, Oct 18, 2025 at 10:01 PM Nazir Bilal Yavuz <byavuz81@gmail.com>
wrote:
Thank you so much for doing this! The results look nice, do you think
there are any other benchmarks that might be interesting to try?
I'm also trying the idea of doing SIMD inside quotes with prefix XOR
using carry less multiplication avoiding the slow path in all cases even
with weird looking input, but it needs to take into consideration the
availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we
go, it quickly starts to become dirty OR we can wait for the decision to
start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.I can not quite picture this, would you mind sharing a few examples or
patches?
The idea aims to avoid stopping at characters that are not actually special
in their position (inside quote, escaped ..etc)
This is done by creating a lot of masks from the original chunk, masks
like: quote_mask, escape_mask, odd escape sequences mask ; from these we
can deduce which quotes are not special to stop at
Then for inside quotes, we aim to know which characters in our chunk are
inside quotes (also keeping in track the previous chunk's quote state) and
there's a clever/fast way to do it [1]https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/.
After this you start to match with LF and CR ..etc, all this while
maintaining the state of what you've seen (the annoying part).
At the end you only reach the scalar path advancing by the position of
first real special character that requires special treatment.
However, after trying to implement this on the existing pipeline way of
COPY command [2]https://github.com/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b1ca9dfe1daf (broken hopeless try, but has the idea), It becomes very
unreasonable for a lot of reasons:
- It is very challenging to correctly handle commas inside quoted fields,
and tracking quoted vs. unquoted state (especially across chunk boundaries,
or with escaped quotes) ....
- Using carry less multiplication (CLMUL) for prefix xor on a 16 bytes
chunk is overkill for some architectures where PCLMULQDQ latency is high
[3]: https://agner.org/optimize/instruction_tables.pdf
cycles).
- It starts to feel that handling these cases is inherently scalar, doing
all that work for a 16 bytes chunk would be unreasonable since it's not
free, compared to a simple help using SIMD and heuristic of Nazir which is
way nicer in general.
Currently we are at 200-400Mbps which isn't that terrible compared to
production and non production grade parsers (of course we don't only parse
in our case), also we are using SSE2 only so theoretically if we add
support for avx later on we'll have even better numbers.
Maybe more micro optimizations to the current heuristic can squeeze it more.
[1]: https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/
https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/
[2]: https://github.com/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b1ca9dfe1daf
https://github.com/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b1ca9dfe1daf
[3]: https://agner.org/optimize/instruction_tables.pdf
[4]: https://www.uops.info/table.html
Regards,
Ayoub Kazar.
On Tue, Oct 21, 2025, 8:17 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Currently we are at 200-400Mbps which isn't that terrible compared to
production and non production grade parsers (of course we don't only parse
in our case), also we are using SSE2 only so theoretically if we add
support for avx later on we'll have even better numbers.
Maybe more micro optimizations to the current heuristic can squeeze it
more.[1]
https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq/
[2]
https://github.com/AyoubKaz07/postgres/commit/73c6ecfedae4cce5c3f375fd6074b1ca9dfe1daf
[3] https://agner.org/optimize/instruction_tables.pdf
[4] https://www.uops.info/table.htmlRegards,
Ayoub Kazar.
Sorry, I meant 200-400MB/s.
Regards.
Ayoub Kazar.
Show quoted text
On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote:
I think the problem is deciding how many lines to process before
deciding for the rest. 1000 lines could work for the small sized data
but it might not work for the big sized data. Also, it might cause a
worse regressions for the small sized data.
IMHO we have some leeway with smaller amounts of data. If COPY FROM for
1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems
unlikely users would be inconvenienced all that much. (Those numbers are
completely made up in order to illustrate my point.)
Because of this reason, I
tried to implement a heuristic that will work regardless of the size
of the data. The last heuristic I suggested will run SIMD for
approximately (#number_of_lines / 1024 [1024 is the max number of
lines to sleep before running SIMD again]) lines if all characters in
the data are special characters.
I wonder if we could mitigate the regression further by spacing out the
checks a bit more. It could be worth comparing a variety of values to
identify what works best with the test data.
--
nathan
On Tue, Oct 21, 2025 at 08:17:01AM +0200, KAZAR Ayoub wrote:
I'm also trying the idea of doing SIMD inside quotes with prefix XOR
using carry less multiplication avoiding the slow path in all cases even
with weird looking input, but it needs to take into consideration the
availability of PCLMULQDQ instruction set with <wmmintrin.h> and here we
go, it quickly starts to become dirty OR we can wait for the decision to
start requiring x86-64-v2 or v3 which has SSE4.2 and AVX2.[...]
Currently we are at 200-400Mbps which isn't that terrible compared to
production and non production grade parsers (of course we don't only parse
in our case), also we are using SSE2 only so theoretically if we add
support for avx later on we'll have even better numbers.
Maybe more micro optimizations to the current heuristic can squeeze it more.
I'd greatly prefer that we stick with SSE2/Neon (i.e., simd.h) unless the
gains are extraordinary. Beyond the inherent complexity of using
architecture-specific intrinsics, you also have to deal with configure-time
checks, runtime checks, and function pointer overhead juggling. That tends
to be a lot of work for the amount of gain.
--
nathan
Hi,
On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Oct 21, 2025 at 12:09:27AM +0300, Nazir Bilal Yavuz wrote:
I think the problem is deciding how many lines to process before
deciding for the rest. 1000 lines could work for the small sized data
but it might not work for the big sized data. Also, it might cause a
worse regressions for the small sized data.IMHO we have some leeway with smaller amounts of data. If COPY FROM for
1000 rows takes 19 milliseconds as opposed to 11 milliseconds, it seems
unlikely users would be inconvenienced all that much. (Those numbers are
completely made up in order to illustrate my point.)Because of this reason, I
tried to implement a heuristic that will work regardless of the size
of the data. The last heuristic I suggested will run SIMD for
approximately (#number_of_lines / 1024 [1024 is the max number of
lines to sleep before running SIMD again]) lines if all characters in
the data are special characters.I wonder if we could mitigate the regression further by spacing out the
checks a bit more. It could be worth comparing a variety of values to
identify what works best with the test data.
Do you mean that instead of doubling the SIMD sleep, we should
multiply it by 3 (or another factor)? Or are you referring to
increasing the maximum sleep from 1024? Or possibly both?
--
Regards,
Nazir Bilal Yavuz
Microsoft
On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote:
I wonder if we could mitigate the regression further by spacing out the
checks a bit more. It could be worth comparing a variety of values to
identify what works best with the test data.Do you mean that instead of doubling the SIMD sleep, we should
multiply it by 3 (or another factor)? Or are you referring to
increasing the maximum sleep from 1024? Or possibly both?
I'm not sure of the precise details, but the main thrust of my suggestion
is to assume that whatever sampling you do to determine whether to use SIMD
is good for a larger chunk of data. That is, if you are sampling 1K lines
and then using the result to choose whether to use SIMD for the next 100K
lines, we could instead bump the latter number to 1M lines (or something).
That way we minimize the regression for relatively uniform data sets while
retaining some ability to adapt in case things change halfway through a
large table.
--
nathan
On 2025-10-22 We 3:24 PM, Nathan Bossart wrote:
On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com> wrote:
I wonder if we could mitigate the regression further by spacing out the
checks a bit more. It could be worth comparing a variety of values to
identify what works best with the test data.Do you mean that instead of doubling the SIMD sleep, we should
multiply it by 3 (or another factor)? Or are you referring to
increasing the maximum sleep from 1024? Or possibly both?I'm not sure of the precise details, but the main thrust of my suggestion
is to assume that whatever sampling you do to determine whether to use SIMD
is good for a larger chunk of data. That is, if you are sampling 1K lines
and then using the result to choose whether to use SIMD for the next 100K
lines, we could instead bump the latter number to 1M lines (or something).
That way we minimize the regression for relatively uniform data sets while
retaining some ability to adapt in case things change halfway through a
large table.
I'd be ok with numbers like this, although I suspect the numbers of
cases where we see shape shifts like this in the middle of a data set
would be vanishingly small.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Wed, Oct 29, 2025 at 5:23 PM Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-10-22 We 3:24 PM, Nathan Bossart wrote:
On Wed, Oct 22, 2025 at 03:33:37PM +0300, Nazir Bilal Yavuz wrote:
On Tue, 21 Oct 2025 at 21:40, Nathan Bossart <nathandbossart@gmail.com>
wrote:
I wonder if we could mitigate the regression further by spacing out the
checks a bit more. It could be worth comparing a variety of values to
identify what works best with the test data.Do you mean that instead of doubling the SIMD sleep, we should
multiply it by 3 (or another factor)? Or are you referring to
increasing the maximum sleep from 1024? Or possibly both?I'm not sure of the precise details, but the main thrust of my suggestion
is to assume that whatever sampling you do to determine whether to useSIMD
is good for a larger chunk of data. That is, if you are sampling 1K
lines
and then using the result to choose whether to use SIMD for the next 100K
lines, we could instead bump the latter number to 1M lines (orsomething).
That way we minimize the regression for relatively uniform data sets
while
retaining some ability to adapt in case things change halfway through a
large table.I'd be ok with numbers like this, although I suspect the numbers of
cases where we see shape shifts like this in the middle of a data set
would be vanishingly small.cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Hello!
I wanted reproduce the results using files attached by Shinya Kato and
Ayoub Kazar. I installed a postgres compiled from master, and then I
installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches
applied.
The master+v3patches postgres naturally performed better on copying into
the database: anywhere from 11% better for the t.csv file produced by
Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file
created by Ayoub Kazar's simd-copy-from-bench.sql.
But here's where it gets weird. The two files created by Ayoub Kazar's
simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt,
and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5%
respectively.
This seems impossible.
A few things I should note:
I timed the commands using the Unix time command, like so:
time psql -X -U mwood -h localhost -d postgres -c '\copy t from
/tmp/t_4096_escape.txt'
For each file, I timed the copy 6 times and took the average.
This was done on my work Linux machine while also running Chrome and an
Open Office spreadsheet; not a dedicated machine only running postgres.
All of the copy results took between 4.5 seconds (Shinyo's t.csv copied
into postgres compiled from master) to 2 seconds (Ayoub
Kazar's t_4096_none.csv copied into postgres compiled from master plus
Nazir's v3 patches).
Perhaps I need to fiddle with the provided SQL to produce larger files to
get longer run times? Maybe sub-second differences won't tell as
interesting a story as minutes-long copy commands?
Thanks for reading this.
--
-- Manni Wood EDB: https://www.enterprisedb.com
On Tue, Nov 11, 2025 at 11:23 PM Manni Wood <manni.wood@enterprisedb.com>
wrote:
Hello!
I wanted reproduce the results using files attached by Shinya Kato and
Ayoub Kazar. I installed a postgres compiled from master, and then I
installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches
applied.The master+v3patches postgres naturally performed better on copying into
the database: anywhere from 11% better for the t.csv file produced by
Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file
created by Ayoub Kazar's simd-copy-from-bench.sql.But here's where it gets weird. The two files created by Ayoub Kazar's
simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt,
and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5%
respectively.This seems impossible.
A few things I should note:
I timed the commands using the Unix time command, like so:
time psql -X -U mwood -h localhost -d postgres -c '\copy t from
/tmp/t_4096_escape.txt'For each file, I timed the copy 6 times and took the average.
This was done on my work Linux machine while also running Chrome and an
Open Office spreadsheet; not a dedicated machine only running postgres.
Hello,
I think if you do a perf benchmark (if it still reproduces) it would
probably be possible to explain why it's performing like that looking at
the CPI and other metrics and compare it to my findings.
What i also suggest is to make the data close even closer to the worst case
i.e: more special characters where it hurts the switching between SIMD and
scalar processing (in simd-copy-from-bench.sql file), if still does a good
job then there's something to look at.
All of the copy results took between 4.5 seconds (Shinyo's t.csv copied
into postgres compiled from master) to 2 seconds (Ayoub
Kazar's t_4096_none.csv copied into postgres compiled from master plus
Nazir's v3 patches).Perhaps I need to fiddle with the provided SQL to produce larger files to
get longer run times? Maybe sub-second differences won't tell as
interesting a story as minutes-long copy commands?
I did try it on some GBs (around 2-5GB only), the differences were not that
much, but if you can run this on more GBs (at least 10GB) it would be good
to look at, although i don't suspect anything interesting since the shape
of data is the same for the totality of the COPY.
Thanks for reading this.
--
-- Manni Wood EDB: https://www.enterprisedb.com
Thanks for the info.
Regards,
Ayoub Kazar.
On Wed, Nov 12, 2025 at 8:44 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
On Tue, Nov 11, 2025 at 11:23 PM Manni Wood <manni.wood@enterprisedb.com>
wrote:Hello!
I wanted reproduce the results using files attached by Shinya Kato and
Ayoub Kazar. I installed a postgres compiled from master, and then I
installed a postgres built from master plus Nazir Bilal Yavuz's v3 patches
applied.The master+v3patches postgres naturally performed better on copying into
the database: anywhere from 11% better for the t.csv file produced by
Shinyo's test.sql, to 35% better copying in the t_4096_none.csv file
created by Ayoub Kazar's simd-copy-from-bench.sql.But here's where it gets weird. The two files created by Ayoub Kazar's
simd-copy-from-bench.sql that are supposed to be slower, t_4096_escape.txt,
and t_4096_quote.csv, actually ran faster on my machine, by 11% and 5%
respectively.This seems impossible.
A few things I should note:
I timed the commands using the Unix time command, like so:
time psql -X -U mwood -h localhost -d postgres -c '\copy t from
/tmp/t_4096_escape.txt'For each file, I timed the copy 6 times and took the average.
This was done on my work Linux machine while also running Chrome and an
Open Office spreadsheet; not a dedicated machine only running postgres.Hello,
I think if you do a perf benchmark (if it still reproduces) it would
probably be possible to explain why it's performing like that looking at
the CPI and other metrics and compare it to my findings.
What i also suggest is to make the data close even closer to the worst
case i.e: more special characters where it hurts the switching between SIMD
and scalar processing (in simd-copy-from-bench.sql file), if still does a
good job then there's something to look at.All of the copy results took between 4.5 seconds (Shinyo's t.csv copied
into postgres compiled from master) to 2 seconds (Ayoub
Kazar's t_4096_none.csv copied into postgres compiled from master plus
Nazir's v3 patches).Perhaps I need to fiddle with the provided SQL to produce larger files to
get longer run times? Maybe sub-second differences won't tell as
interesting a story as minutes-long copy commands?I did try it on some GBs (around 2-5GB only), the differences were not
that much, but if you can run this on more GBs (at least 10GB) it would be
good to look at, although i don't suspect anything interesting since the
shape of data is the same for the totality of the COPY.Thanks for reading this.
--
-- Manni Wood EDB: https://www.enterprisedb.comThanks for the info.
Regards,
Ayoub Kazar.
Hello again!
It looks like using 10 times the data removed the apparent speedup in the
simd code when the simd code has to deal with t_4096_escape.txt
and t_4096_quote.csv. When both files contain 1,000,000 lines each,
postgres master+v3patch imports 0.63% slower and 0.54% slower respectively.
For 1,000,000 lines of t_4096_none.txt, the v3 patch yields a 30% speedup.
For 1,000,000 lines of t_4096_none.csv, the v3 patch yields a 33% speedup.
I got these numbers just via simple timing, though this time I used psql's
\timing feature. I left psql running rather than launching it each time as
I did when I used the unix "time" command. I ran the copy command 5 times
for each file and averaged the results. Again, this happened on a Linux
machine that also happened to be running Chrome and Open Office's
spreadsheet.
I should probably try to construct some .txt or .csv files that would trip
up the simd on/off heuristic in the v3 patch.
If data "in the wild" tend to be roughly the same "shape" from row to row,
as Andrew's experience has shown, I imagine these million row results bode
well for the v3 patch...
--
-- Manni Wood EDB: https://www.enterprisedb.com
I'd like to mark myself as the committer this one, but I noticed that the
commitfest entry [0]https://commitfest.postgresql.org/patch/5952/ has been marked as Withdrawn. Could someone either
reopen it or create a new one as appropriate (assuming there is a desire to
continue with it)? I'm hoping to start spending more time on it soon.
[0]: https://commitfest.postgresql.org/patch/5952/
--
nathan
On Tue, Nov 18, 2025, 07:16 Nathan Bossart <nathandbossart@gmail.com> wrote:
I'd like to mark myself as the committer this one, but I noticed that the
commitfest entry [0] has been marked as Withdrawn. Could someone either
reopen it or create a new one as appropriate (assuming there is a desire to
continue with it)? I'm hoping to start spending more time on it soon.
I closed this entry because I currently don't have enough time to continue
developing this patch. It is fine if someone else reopens it; I will do my
best to see the patch whenever I can.
Shinya
Hi,
On Tue, 18 Nov 2025 at 01:53, Shinya Kato <shinya11.kato@gmail.com> wrote:
On Tue, Nov 18, 2025, 07:16 Nathan Bossart <nathandbossart@gmail.com> wrote:
I'd like to mark myself as the committer this one, but I noticed that the
commitfest entry [0] has been marked as Withdrawn. Could someone either
reopen it or create a new one as appropriate (assuming there is a desire to
continue with it)? I'm hoping to start spending more time on it soon.I closed this entry because I currently don't have enough time to continue developing this patch. It is fine if someone else reopens it; I will do my best to see the patch whenever I can.
Thank you for all your work on this patch.
I would like to continue working on this but I am not sure what are
the correct steps to reopen this commitfest entry. Do I just need to
change commitfest entry's status to 'Needs review'?
--
Regards,
Nazir Bilal Yavuz
Microsoft
On 2025-11-18 Tu 3:04 AM, Nazir Bilal Yavuz wrote:
Hi,
On Tue, 18 Nov 2025 at 01:53, Shinya Kato <shinya11.kato@gmail.com> wrote:
On Tue, Nov 18, 2025, 07:16 Nathan Bossart <nathandbossart@gmail.com> wrote:
I'd like to mark myself as the committer this one, but I noticed that the
commitfest entry [0] has been marked as Withdrawn. Could someone either
reopen it or create a new one as appropriate (assuming there is a desire to
continue with it)? I'm hoping to start spending more time on it soon.I closed this entry because I currently don't have enough time to continue developing this patch. It is fine if someone else reopens it; I will do my best to see the patch whenever I can.
Thank you for all your work on this patch.
I would like to continue working on this but I am not sure what are
the correct steps to reopen this commitfest entry. Do I just need to
change commitfest entry's status to 'Needs review'?
That should do it, I believe.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Hi,
On Tue, 18 Nov 2025 at 17:01, Andrew Dunstan <andrew@dunslane.net> wrote:
On 2025-11-18 Tu 3:04 AM, Nazir Bilal Yavuz wrote:
Hi,
On Tue, 18 Nov 2025 at 01:53, Shinya Kato <shinya11.kato@gmail.com> wrote:
On Tue, Nov 18, 2025, 07:16 Nathan Bossart <nathandbossart@gmail.com> wrote:
I'd like to mark myself as the committer this one, but I noticed that the
commitfest entry [0] has been marked as Withdrawn. Could someone either
reopen it or create a new one as appropriate (assuming there is a desire to
continue with it)? I'm hoping to start spending more time on it soon.I closed this entry because I currently don't have enough time to continue developing this patch. It is fine if someone else reopens it; I will do my best to see the patch whenever I can.
Thank you for all your work on this patch.
I would like to continue working on this but I am not sure what are
the correct steps to reopen this commitfest entry. Do I just need to
change commitfest entry's status to 'Needs review'?That should do it, I believe.
Thanks, done.
--
Regards,
Nazir Bilal Yavuz
Microsoft
On Mon, Nov 17, 2025, 11:16 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:
(assuming there is a desire to
continue with it)?
I'm hoping to start spending more time on it soon.
Somethings worth noting for future reference (so someone else wouldn't
waste time thinking about it), previously I tried extra several micro
optimizations inside and around CopyReadLineText:
SIMD alignment*:* Forcing 16-byte aligned buffers so we could use aligned
memory instructions (_mm_load_si128 vs _mm_loadu_si128) provided no
measurable benefit on modern CPUs (there's definitely a thread somewhere
talking about it that i didn't encounter yet). This likely explains why
simd.h exclusively uses unaligned load intrinsics the performance
difference has become negligible since Nehalem processors.
Memory prefetching: Explicit prefetch instructions for the COPY buffer
pipeline (copy_raw_buf, input buffers, etc.) either showed no improvement
or slight regression. Multiple chunks are already within a cache line,
other buffers are too far to prefetch and the next part of the buffer is
easily prefetched, nothing special, so it turns out to be not worth having
more uops.
Instruction-level parallelism: Spreading too many independent vector
operations to increase ILP eventually degrades performance, likely due to
backend saturation observed through perf (execution port and execution
units contention most likely ?)
.....
This simply suggests that further optimization work should focus on the
pipeline as a whole for large benefits (parallel copy[0]/messages/by-id/CAA4eK1+kpddvvLxWm4BuG_AhVvYz8mKAEa7osxp_X0d4ZEiV=g@mail.gmail.com, maybe ?).
[0]: /messages/by-id/CAA4eK1+kpddvvLxWm4BuG_AhVvYz8mKAEa7osxp_X0d4ZEiV=g@mail.gmail.com
/messages/by-id/CAA4eK1+kpddvvLxWm4BuG_AhVvYz8mKAEa7osxp_X0d4ZEiV=g@mail.gmail.com
--
Regards,
Ayoub Kazar
On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
Thanks, done.
I took a look at the v3 patches. Here are my high-level thoughts:
+ /*
+ * Parse data and transfer into line_buf. To get benefit from inlining,
+ * call CopyReadLineText() with the constant boolean variables.
+ */
+ if (cstate->simd_continue)
+ result = CopyReadLineText(cstate, is_csv, true);
+ else
+ result = CopyReadLineText(cstate, is_csv, false);
I'm curious whether this actually generates different code, and if it does,
if it's actually faster. We're already branching on cstate->simd_continue
here.
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);
In other places, processing 2 or 4 vectors of data at a time has proven
faster. Have you tried that here?
+ /* \n and \r are not special inside quotes */
+ if (!in_quote)
+ match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+
+ if (is_csv)
+ {
+ match = vector8_or(match, vector8_eq(chunk, quote));
+ if (escapec != '\0')
+ match = vector8_or(match, vector8_eq(chunk, escape));
+ }
+ else
+ match = vector8_or(match, vector8_eq(chunk, bs));
The amount of branching here catches my eye. Some branching might be
unavoidable, but in general we want to keep these SIMD paths as branch-free
as possible.
+ /*
+ * Found a special character. Advance up to that point and let
+ * the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+
+ input_buf_ptr += advance;
+ simd_total_advance += advance;
Do we actually need to advance here? Or could we just fall through to the
scalar path? My suspicion is that this extra code doesn't gain us much.
+ if (simd_last_sleep_cycle == 0)
+ simd_last_sleep_cycle = 1;
+ else if (simd_last_sleep_cycle >= SIMD_SLEEP_MAX / 2)
+ simd_last_sleep_cycle = SIMD_SLEEP_MAX;
+ else
+ simd_last_sleep_cycle <<= 1;
+ cstate->simd_current_sleep_cycle = simd_last_sleep_cycle;
+ cstate->simd_last_sleep_cycle = simd_last_sleep_cycle;
IMHO we should be looking for ways to simplify this should-we-use-SIMD
code. For example, perhaps we could just disable the SIMD path for 10K or
100K lines any time a special character is found. I'm dubious that a lot
of complexity is warranted.
--
nathan
Hi,
Thank you for looking into this!
On Thu, 20 Nov 2025 at 00:01, Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
Thanks, done.
I took a look at the v3 patches. Here are my high-level thoughts:
+ /* + * Parse data and transfer into line_buf. To get benefit from inlining, + * call CopyReadLineText() with the constant boolean variables. + */ + if (cstate->simd_continue) + result = CopyReadLineText(cstate, is_csv, true); + else + result = CopyReadLineText(cstate, is_csv, false);I'm curious whether this actually generates different code, and if it does,
if it's actually faster. We're already branching on cstate->simd_continue
here.
I had the same doubts before but my benchmark shows nice speedup. I
used a test which is full of delimiters. The current code gives 2700
ms but when I changed these lines with the 'result =
CopyReadLineText(cstate, is_csv, cstate->simd_continue);', the result
was 2920 ms. I compiled code with both -O3 and -O2 and the results
were similar.
+ /* Load a chunk of data into a vector register */ + vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);In other places, processing 2 or 4 vectors of data at a time has proven
faster. Have you tried that here?
Sorry, I could not find the related code piece. I only saw the
vector8_load() inside of hex_decode_safe() function and its comment
says:
/*
* We must process 2 vectors at a time since the output will be half the
* length of the input.
*/
But this does not mention any speedup from using 2 vectors at a time.
Could you please show the related code?
+ /* \n and \r are not special inside quotes */ + if (!in_quote) + match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr)); + + if (is_csv) + { + match = vector8_or(match, vector8_eq(chunk, quote)); + if (escapec != '\0') + match = vector8_or(match, vector8_eq(chunk, escape)); + } + else + match = vector8_or(match, vector8_eq(chunk, bs));The amount of branching here catches my eye. Some branching might be
unavoidable, but in general we want to keep these SIMD paths as branch-free
as possible.
You are right, I will check these branches and will try to remove as
many branches as possible.
+ /* + * Found a special character. Advance up to that point and let + * the scalar code handle it. + */ + int advance = pg_rightmost_one_pos32(mask); + + input_buf_ptr += advance; + simd_total_advance += advance;Do we actually need to advance here? Or could we just fall through to the
scalar path? My suspicion is that this extra code doesn't gain us much.
My testing shows that if we advance more than ~5 characters then SIMD
is worth it, but if we advance less than ~5; then code causes a
regression. I used this information while writing a heuristic.
+ if (simd_last_sleep_cycle == 0) + simd_last_sleep_cycle = 1; + else if (simd_last_sleep_cycle >= SIMD_SLEEP_MAX / 2) + simd_last_sleep_cycle = SIMD_SLEEP_MAX; + else + simd_last_sleep_cycle <<= 1; + cstate->simd_current_sleep_cycle = simd_last_sleep_cycle; + cstate->simd_last_sleep_cycle = simd_last_sleep_cycle;IMHO we should be looking for ways to simplify this should-we-use-SIMD
code. For example, perhaps we could just disable the SIMD path for 10K or
100K lines any time a special character is found. I'm dubious that a lot
of complexity is warranted.
I think this is a bit too harsh since SIMD is still worth it if SIMD
can advance more than ~5 character average. I am trying to use SIMD as
much as possible when it is worth it but what you said can remove the
regression completely, perhaps that is the correct way.
--
Regards,
Nazir Bilal Yavuz
Microsoft
On 2025-11-20 Th 7:55 AM, Nazir Bilal Yavuz wrote:
Hi,
Thank you for looking into this!
On Thu, 20 Nov 2025 at 00:01, Nathan Bossart <nathandbossart@gmail.com> wrote:
IMHO we should be looking for ways to simplify this should-we-use-SIMD
code. For example, perhaps we could just disable the SIMD path for 10K or
100K lines any time a special character is found. I'm dubious that a lot
of complexity is warranted.I think this is a bit too harsh since SIMD is still worth it if SIMD
can advance more than ~5 character average. I am trying to use SIMD as
much as possible when it is worth it but what you said can remove the
regression completely, perhaps that is the correct way.
Perhaps a very small regression (say under 1%) in the worst case would
be OK. But the closer you can get that to zero the more acceptable this
will be. Very large loads of sparse data, which will often have lots of
special characters AIUI, are very common, so we should not dismiss the
worst case as an outlier. I still like the idea of testing, say, a
thousand lines every million, or something like that.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Thu, Nov 20, 2025 at 03:55:43PM +0300, Nazir Bilal Yavuz wrote:
On Thu, 20 Nov 2025 at 00:01, Nathan Bossart <nathandbossart@gmail.com> wrote:
+ /* Load a chunk of data into a vector register */ + vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);In other places, processing 2 or 4 vectors of data at a time has proven
faster. Have you tried that here?Sorry, I could not find the related code piece. I only saw the
vector8_load() inside of hex_decode_safe() function and its comment
says:/*
* We must process 2 vectors at a time since the output will be half the
* length of the input.
*/But this does not mention any speedup from using 2 vectors at a time.
Could you please show the related code?
See pg_lfind32().
--
nathan
Hello.
I tried Ayoub Kazar's test files again, using Nazir Bilal Yavuz's v3
patches, but with one difference since my last attempt: this time, I used 5
million lines per file. For each 5 million line file, I ran the import 5
times and averaged the results.
(I found that even using 1 million lines could sometimes produce surprising
speedups where the newer algorithm should be at least a tiny bit slower
than the non-simd version.)
The text file with no special characters is 30% faster. The CSV file with
no special characters is 39% faster. The text file with roughly 1/3rd
special characters is 0.5% slower. The CSV file with roughly 1/3rd special
characters is 2.7% slower.
I also tried files that alternated lines with no special characters and
lines with 1/3rd special characters, thinking I could force the algorithm
to continually check whether or not it should use simd and therefore force
more overhead in the try-simd/don't-try-simd housekeeping code. The text
file was still 50% faster. The CSV file was still 13% faster.
On Mon, Nov 24, 2025 at 3:59 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:
On Thu, Nov 20, 2025 at 03:55:43PM +0300, Nazir Bilal Yavuz wrote:
On Thu, 20 Nov 2025 at 00:01, Nathan Bossart <nathandbossart@gmail.com>
wrote:
+ /* Load a chunk of data into a vector register */ + vector8_load(&chunk, (const uint8 *)©_input_buf[input_buf_ptr]);
In other places, processing 2 or 4 vectors of data at a time has proven
faster. Have you tried that here?Sorry, I could not find the related code piece. I only saw the
vector8_load() inside of hex_decode_safe() function and its comment
says:/*
* We must process 2 vectors at a time since the output will be half the
* length of the input.
*/But this does not mention any speedup from using 2 vectors at a time.
Could you please show the related code?See pg_lfind32().
--
nathan
--
-- Manni Wood EDB: https://www.enterprisedb.com
Hello,
On Wed, Nov 19, 2025 at 10:01 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:
On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
Thanks, done.
I took a look at the v3 patches. Here are my high-level thoughts:
+ /* + * Parse data and transfer into line_buf. To get benefit from inlining, + * call CopyReadLineText() with the constant boolean variables. + */ + if (cstate->simd_continue) + result = CopyReadLineText(cstate, is_csv, true); + else + result = CopyReadLineText(cstate, is_csv, false);I'm curious whether this actually generates different code, and if it does,
if it's actually faster. We're already branching on cstate->simd_continue
here.
I've compiled both versions with -O2 and confirmed they generate different
code. When simd_continue is passed as a constant to CopyReadLineText, the
compiler optimizes out the condition checks from the SIMD path.
A small benchmark on a 1GB+ file shows the expected benefit which is around
6% performance improvement.
I've attached the assembly outputs in case someone wants to check something
else.
Regards,
Ayoub Kazar
On Wed, Nov 26, 2025 at 5:51 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Hello,
On Wed, Nov 19, 2025 at 10:01 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
Thanks, done.
I took a look at the v3 patches. Here are my high-level thoughts:
+ /* + * Parse data and transfer into line_buf. To get benefit from inlining, + * call CopyReadLineText() with the constant boolean variables. + */ + if (cstate->simd_continue) + result = CopyReadLineText(cstate, is_csv, true); + else + result = CopyReadLineText(cstate, is_csv, false);I'm curious whether this actually generates different code, and if it
does,
if it's actually faster. We're already branching on cstate->simd_continue
here.I've compiled both versions with -O2 and confirmed they generate different
code. When simd_continue is passed as a constant to CopyReadLineText, the
compiler optimizes out the condition checks from the SIMD path.
A small benchmark on a 1GB+ file shows the expected benefit which is
around 6% performance improvement.
I've attached the assembly outputs in case someone wants to check
something else.Regards,
Ayoub Kazar
Correction to my last post:
I also tried files that alternated lines with no special characters and
lines with 1/3rd special characters, thinking I could force the algorithm
to continually check whether or not it should use simd and therefore force
more overhead in the try-simd/don't-try-simd housekeeping code. The text
file was still 20% faster (not 50% faster as I originally stated --- that
was a typo). The CSV file was still 13% faster.
Also, apologies for posting at the top in my last e-mail.
--
-- Manni Wood EDB: https://www.enterprisedb.com
On Wed, Nov 26, 2025 at 8:21 AM Manni Wood <manni.wood@enterprisedb.com>
wrote:
On Wed, Nov 26, 2025 at 5:51 AM KAZAR Ayoub <ma_kazar@esi.dz> wrote:
Hello,
On Wed, Nov 19, 2025 at 10:01 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:On Tue, Nov 18, 2025 at 05:20:05PM +0300, Nazir Bilal Yavuz wrote:
Thanks, done.
I took a look at the v3 patches. Here are my high-level thoughts:
+ /* + * Parse data and transfer into line_buf. To get benefit from inlining, + * call CopyReadLineText() with the constant boolean variables. + */ + if (cstate->simd_continue) + result = CopyReadLineText(cstate, is_csv, true); + else + result = CopyReadLineText(cstate, is_csv, false);I'm curious whether this actually generates different code, and if it
does,
if it's actually faster. We're already branching on
cstate->simd_continue
here.I've compiled both versions with -O2 and confirmed they generate
different code. When simd_continue is passed as a constant to
CopyReadLineText, the compiler optimizes out the condition checks from the
SIMD path.
A small benchmark on a 1GB+ file shows the expected benefit which is
around 6% performance improvement.
I've attached the assembly outputs in case someone wants to check
something else.Regards,
Ayoub KazarCorrection to my last post:
I also tried files that alternated lines with no special characters and
lines with 1/3rd special characters, thinking I could force the algorithm
to continually check whether or not it should use simd and therefore force
more overhead in the try-simd/don't-try-simd housekeeping code. The text
file was still 20% faster (not 50% faster as I originally stated --- that
was a typo). The CSV file was still 13% faster.Also, apologies for posting at the top in my last e-mail.
--
-- Manni Wood EDB: https://www.enterprisedb.com
Hello, all.
Andrew, I tried your suggestion of just reading the first chunk of the copy
file to determine if SIMD is worth using. Attached are v4 versions of the
patches showing a first attempt at doing that.
I attached test.sh.txt to show how I've been testing, with 5 million lines
of the various copy file variations introduced by Ayub Kazar.
The text copy with no special chars is 30% faster. The CSV copy with no
special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The
CSV with 1/3rd quotes is 0.27% slower.
This set of patches follows the simplest suggestion of just testing the
first N lines (actually first N bytes) of the file and then deciding
whether or not to enable SIMD. This set of patches does not follow Andrew's
later suggestion of maybe checking again every million lines or so.
--
-- Manni Wood EDB: https://www.enterprisedb.com
Attachments:
v4-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload
From 38b587dda44cb7160ee734cdea55a573f302c3a9 Mon Sep 17 00:00:00 2001
From: Manni Wood <manni.wood@enterprisedb.com>
Date: Fri, 5 Dec 2025 18:33:46 -0600
Subject: [PATCH v4 2/2] Speed up COPY FROM text/CSV parsing using SIMD
Authors: Shinya Kato <shinya11.kato@gmail.com>,
Nazir Bilal Yavuz <byavuz81@gmail.com>,
Ayoub Kazar <ma_kazar@esi.dz>
Reviewers: Andrew Dunstan <andrew@dunslane.net>
Descussion:
https://www.postgresql.org/message-id/flat/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com
---
src/backend/commands/copyfrom.c | 3 +++
src/backend/commands/copyfromparse.c | 29 +++++++++++++++++++++++-
src/include/commands/copyfrom_internal.h | 11 +++++++++
3 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 12781963b4f..e638623e5b5 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1720,6 +1720,9 @@ BeginCopyFrom(ParseState *pstate,
cstate->cur_attname = NULL;
cstate->cur_attval = NULL;
cstate->relname_only = false;
+ cstate->special_chars_encountered = 0;
+ cstate->checked_simd = false;
+ cstate->use_simd = false;
/*
* Allocate buffers for the input pipeline.
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 1edb525f072..8cfdfcd4cd8 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1346,6 +1346,28 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
#ifndef USE_NO_SIMD
+ /*
+ * Wait until we have read more than BYTES_PROCESSED_UNTIL_SIMD_CHECK.
+ * cstate->bytes_processed will grow an unpredictable amount with each
+ * call to this function, so just wait until we have crossed the
+ * threshold.
+ */
+ if (!cstate->checked_simd && cstate->bytes_processed > BYTES_PROCESSED_UNTIL_SIMD_CHECK)
+ {
+ cstate->checked_simd = true;
+
+ /*
+ * If we have not read too many special characters
+ * (SPECIAL_CHAR_SIMD_THRESHOLD) then start using SIMD to speed up
+ * processing. This heuristic assumes that input does not vary too
+ * much from line to line and that number of special characters
+ * encountered in the first BYTES_PROCESSED_UNTIL_SIMD_CHECK are
+ * indicitive of the whole file.
+ */
+ if (cstate->special_chars_encountered < SPECIAL_CHAR_SIMD_THRESHOLD)
+ cstate->use_simd = true;
+ }
+
/*
* Use SIMD instructions to efficiently scan the input buffer for
* special characters (e.g., newline, carriage return, quote, and
@@ -1358,7 +1380,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* sequentially. - The remaining buffer is smaller than one vector
* width (sizeof(Vector8)); SIMD operates on fixed-size chunks.
*/
- if (!last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ if (cstate->use_simd && !last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
{
Vector8 chunk;
Vector8 match = vector8_broadcast(0);
@@ -1415,6 +1437,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
*/
if (c == '\r')
{
+ cstate->special_chars_encountered++;
IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
}
@@ -1446,6 +1469,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* Process \r */
if (c == '\r' && (!is_csv || !in_quote))
{
+ cstate->special_chars_encountered++;
/* Check for \r\n on first line, _and_ handle \r\n. */
if (cstate->eol_type == EOL_UNKNOWN ||
cstate->eol_type == EOL_CRNL)
@@ -1502,6 +1526,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* Process \n */
if (c == '\n' && (!is_csv || !in_quote))
{
+ cstate->special_chars_encountered++;
if (cstate->eol_type == EOL_CR || cstate->eol_type == EOL_CRNL)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
@@ -1524,6 +1549,8 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
{
char c2;
+ cstate->special_chars_encountered++;
+
IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
IF_NEED_REFILL_AND_EOF_BREAK(0);
diff --git a/src/include/commands/copyfrom_internal.h b/src/include/commands/copyfrom_internal.h
index c8b22af22d8..215215f909f 100644
--- a/src/include/commands/copyfrom_internal.h
+++ b/src/include/commands/copyfrom_internal.h
@@ -181,6 +181,17 @@ typedef struct CopyFromStateData
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - (cstate)->raw_buf_index)
uint64 bytes_processed; /* number of bytes processed so far */
+
+ /* the amount of bytes to read until checking if we should try simd */
+#define BYTES_PROCESSED_UNTIL_SIMD_CHECK 100000
+ /* the number of special chars read below which we use simd */
+#define SPECIAL_CHAR_SIMD_THRESHOLD 20000
+ uint64 special_chars_encountered; /* number of special chars
+ * encountered so far */
+ bool checked_simd; /* we read BYTES_PROCESSED_UNTIL_SIMD_CHECK
+ * and checked if we should use SIMD on the
+ * rest of the file */
+ bool use_simd; /* use simd to speed up copying */
} CopyFromStateData;
extern void ReceiveCopyBegin(CopyFromState cstate);
--
2.52.0
v4-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload
From 0b1f786bf58c3d90e078d4afa83b7d43dda08491 Mon Sep 17 00:00:00 2001
From: Manni Wood <manni.wood@enterprisedb.com>
Date: Fri, 5 Dec 2025 18:30:00 -0600
Subject: [PATCH v4 1/2] Speed up COPY FROM text/CSV parsing using SIMD
Authors: Shinya Kato <shinya11.kato@gmail.com>,
Nazir Bilal Yavuz <byavuz81@gmail.com>,
Ayoub Kazar <ma_kazar@esi.dz>
Reviewers: Andrew Dunstan <andrew@dunslane.net>
Descussion:
https://www.postgresql.org/message-id/flat/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com
---
src/backend/commands/copyfromparse.c | 73 ++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index a09e7fbace3..1edb525f072 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -71,7 +71,9 @@
#include "mb/pg_wchar.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "port/pg_bitutils.h"
#include "port/pg_bswap.h"
+#include "port/simd.h"
#include "utils/builtins.h"
#include "utils/rel.h"
@@ -1255,6 +1257,14 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
char quotec = '\0';
char escapec = '\0';
+#ifndef USE_NO_SIMD
+ Vector8 nl = vector8_broadcast('\n');
+ Vector8 cr = vector8_broadcast('\r');
+ Vector8 bs = vector8_broadcast('\\');
+ Vector8 quote = vector8_broadcast(0);
+ Vector8 escape = vector8_broadcast(0);
+#endif
+
if (is_csv)
{
quotec = cstate->opts.quote[0];
@@ -1262,6 +1272,12 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* ignore special escape processing if it's the same as quotec */
if (quotec == escapec)
escapec = '\0';
+
+#ifndef USE_NO_SIMD
+ quote = vector8_broadcast(quotec);
+ if (quotec != escapec)
+ escape = vector8_broadcast(escapec);
+#endif
}
/*
@@ -1328,6 +1344,63 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
need_data = false;
}
+#ifndef USE_NO_SIMD
+
+ /*
+ * Use SIMD instructions to efficiently scan the input buffer for
+ * special characters (e.g., newline, carriage return, quote, and
+ * escape). This is faster than byte-by-byte iteration, especially on
+ * large buffers.
+ *
+ * We do not apply the SIMD fast path in either of the following
+ * cases: - When the previously processed character was an escape
+ * character (last_was_esc), since the next byte must be examined
+ * sequentially. - The remaining buffer is smaller than one vector
+ * width (sizeof(Vector8)); SIMD operates on fixed-size chunks.
+ */
+ if (!last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ {
+ Vector8 chunk;
+ Vector8 match = vector8_broadcast(0);
+ uint32 mask;
+
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);
+
+ /* \n and \r are not special inside quotes */
+ if (!in_quote)
+ match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+
+ if (is_csv)
+ {
+ match = vector8_or(match, vector8_eq(chunk, quote));
+ if (escapec != '\0')
+ match = vector8_or(match, vector8_eq(chunk, escape));
+ }
+ else
+ match = vector8_or(match, vector8_eq(chunk, bs));
+
+ /* Check if we found any special characters */
+ mask = vector8_highbit_mask(match);
+ if (mask != 0)
+ {
+ /*
+ * Found a special character. Advance up to that point and let
+ * the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+
+ input_buf_ptr += advance;
+ }
+ else
+ {
+ /* No special characters found, so skip the entire chunk */
+ input_buf_ptr += sizeof(Vector8);
+ continue;
+ }
+ }
+#endif
+
/* OK to fetch a character */
prev_raw_ptr = input_buf_ptr;
c = copy_input_buf[input_buf_ptr++];
--
2.52.0
Hi,
On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com> wrote:
Hello, all.
Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using. Attached are v4 versions of the patches showing a first attempt at doing that.
Thank you for doing this!
I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations introduced by Ayub Kazar.
The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower.
This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the file and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of maybe checking again every million lines or so.
My input-generation script is not ready to share yet, but the inputs
follow this format: text_${n}.input, where n represents the number of
normal characters before the delimiter. For example:
n = 0 -> "\n\n\n\n\n..." (no normal characters)
n = 1 -> "a\n..." (1 normal character before the delimiter)
...
n = 5 -> "aaaaa\n..."
… continuing up to n = 32.
Each line has 4096 chars and there are a total of 100000 lines in each
input file.
I only benchmarked the text format. I compared the latest heuristic I
shared [1]/messages/by-id/CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com with the current method. The benchmarks show roughly a ~16%
regression at the worst case (n = 2), with regressions up to n = 5.
For the remaining values, performance was similar.
Actual comparison of timings (in ms):
current method / heuristic
n = 0 -> 3252.7253 / 2856.2753 (%12)
n = 1 -> 2910.321 / 2520.7717 (%13)
n = 2 -> 2865.008 / 2403.2017 (%16)
n = 3 -> 2608.649 / 2353.1477 (%9)
n = 4 -> 2460.74 / 2300.1783 (%6)
n = 5 -> 2451.696 / 2362.1573 (%3)
No difference for the rest.
Side note: Sorry for the delay in responding, I will continue working
on this next week.
[1]: /messages/by-id/CAN55FZ1KF7XNpm2XyG=M-sFUODai=6Z8a11xE3s4YRBeBKY3tA@mail.gmail.com
--
Regards,
Nazir Bilal Yavuz
Microsoft
Hi,
On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com> wrote:
Hello, all.
Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using. Attached are v4 versions of the patches showing a first attempt at doing that.
Thank you for doing this!
I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations introduced by Ayub Kazar.
The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower.
This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the file and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of maybe checking again every million lines or so.
My input-generation script is not ready to share yet, but the inputs
follow this format: text_${n}.input, where n represents the number of
normal characters before the delimiter. For example:n = 0 -> "\n\n\n\n\n..." (no normal characters)
n = 1 -> "a\n..." (1 normal character before the delimiter)
...
n = 5 -> "aaaaa\n..."
… continuing up to n = 32.Each line has 4096 chars and there are a total of 100000 lines in each
input file.I only benchmarked the text format. I compared the latest heuristic I
shared [1] with the current method. The benchmarks show roughly a ~16%
regression at the worst case (n = 2), with regressions up to n = 5.
For the remaining values, performance was similar.
I tried to improve the v4 patchset. My changes are:
1 - I changed CopyReadLineText() to an inline function and sent the
use_simd variable as an argument to get help from inlining.
2 - A main for loop in the CopyReadLineText() function is called many
times, so I moved the use_simd check to the CopyReadLine() function.
3 - Instead of 'bytes_processed', I used 'chars_processed' because
cstate->bytes_processed is increased before we process them and this
can cause wrong results.
4 - Because of #2 and #3, instead of having
'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed /
special_chars_encountered' to determine whether we want to use SIMD.
5 - cstate->special_chars_encountered is incremented wrongly for the
CSV case. It is not incremented for the quote and escape delimiters. I
moved all increments of cstate->special_chars_encountered to the
central place and tried to optimize it but it still causes a
regression as it creates one more branching.
With these changes, I am able to decrease the regression to %10 from
%16. Regression decreases to %7 if I modify #5 for the only text input
but I did not do that.
My changes are in the 0003.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v4.1-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchtext/x-patch; charset=US-ASCII; name=v4.1-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload
From a1b4d28069786c3fb506c79e096312fcfd585fdb Mon Sep 17 00:00:00 2001
From: Shinya Kato <shinya11.kato@gmail.com>
Date: Mon, 28 Jul 2025 22:08:20 +0900
Subject: [PATCH v4.1 1/3] Speed up COPY FROM text/CSV parsing using SIMD
---
src/backend/commands/copyfromparse.c | 76 ++++++++++++++++++++++++++++
1 file changed, 76 insertions(+)
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 62afcd8fad1..cf110767542 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -71,7 +71,9 @@
#include "mb/pg_wchar.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "port/pg_bitutils.h"
#include "port/pg_bswap.h"
+#include "port/simd.h"
#include "utils/builtins.h"
#include "utils/rel.h"
@@ -1255,6 +1257,14 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
char quotec = '\0';
char escapec = '\0';
+#ifndef USE_NO_SIMD
+ Vector8 nl = vector8_broadcast('\n');
+ Vector8 cr = vector8_broadcast('\r');
+ Vector8 bs = vector8_broadcast('\\');
+ Vector8 quote = vector8_broadcast(0);
+ Vector8 escape = vector8_broadcast(0);
+#endif
+
if (is_csv)
{
quotec = cstate->opts.quote[0];
@@ -1262,6 +1272,12 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* ignore special escape processing if it's the same as quotec */
if (quotec == escapec)
escapec = '\0';
+
+#ifndef USE_NO_SIMD
+ quote = vector8_broadcast(quotec);
+ if (quotec != escapec)
+ escape = vector8_broadcast(escapec);
+#endif
}
/*
@@ -1328,6 +1344,66 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
need_data = false;
}
+#ifndef USE_NO_SIMD
+
+ /*
+ * Use SIMD instructions to efficiently scan the input buffer for
+ * special characters (e.g., newline, carriage return, quote, and
+ * escape). This is faster than byte-by-byte iteration, especially on
+ * large buffers.
+ *
+ * We do not apply the SIMD fast path in either of the following
+ * cases: - When the previously processed character was an escape
+ * character (last_was_esc), since the next byte must be examined
+ * sequentially. - The remaining buffer is smaller than one vector
+ * width (sizeof(Vector8)); SIMD operates on fixed-size chunks.
+ */
+ if (!last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ {
+ Vector8 chunk;
+ Vector8 match = vector8_broadcast(0);
+ uint32 mask;
+
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);
+
+ if (is_csv)
+ {
+ /* \n and \r are not special inside quotes */
+ if (!in_quote)
+ match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+
+ match = vector8_or(match, vector8_eq(chunk, quote));
+ if (escapec != '\0')
+ match = vector8_or(match, vector8_eq(chunk, escape));
+ }
+ else
+ {
+ match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+ match = vector8_or(match, vector8_eq(chunk, bs));
+ }
+
+ /* Check if we found any special characters */
+ mask = vector8_highbit_mask(match);
+ if (mask != 0)
+ {
+ /*
+ * Found a special character. Advance up to that point and let
+ * the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+
+ input_buf_ptr += advance;
+ }
+ else
+ {
+ /* No special characters found, so skip the entire chunk */
+ input_buf_ptr += sizeof(Vector8);
+ continue;
+ }
+ }
+#endif
+
/* OK to fetch a character */
prev_raw_ptr = input_buf_ptr;
c = copy_input_buf[input_buf_ptr++];
--
2.51.0
v4.1-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchtext/x-patch; charset=US-ASCII; name=v4.1-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload
From 3a2f9ff26755a5248b7a33770f4603fec483d3bc Mon Sep 17 00:00:00 2001
From: Manni Wood <manni.wood@enterprisedb.com>
Date: Fri, 5 Dec 2025 18:33:46 -0600
Subject: [PATCH v4.1 2/3] Speed up COPY FROM text/CSV parsing using SIMD
Authors: Shinya Kato <shinya11.kato@gmail.com>,
Nazir Bilal Yavuz <byavuz81@gmail.com>,
Ayoub Kazar <ma_kazar@esi.dz>
Reviewers: Andrew Dunstan <andrew@dunslane.net>
Descussion:
https://www.postgresql.org/message-id/flat/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com
---
src/include/commands/copyfrom_internal.h | 11 +++++++++
src/backend/commands/copyfrom.c | 3 +++
src/backend/commands/copyfromparse.c | 29 +++++++++++++++++++++++-
3 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/src/include/commands/copyfrom_internal.h b/src/include/commands/copyfrom_internal.h
index c8b22af22d8..215215f909f 100644
--- a/src/include/commands/copyfrom_internal.h
+++ b/src/include/commands/copyfrom_internal.h
@@ -181,6 +181,17 @@ typedef struct CopyFromStateData
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - (cstate)->raw_buf_index)
uint64 bytes_processed; /* number of bytes processed so far */
+
+ /* the amount of bytes to read until checking if we should try simd */
+#define BYTES_PROCESSED_UNTIL_SIMD_CHECK 100000
+ /* the number of special chars read below which we use simd */
+#define SPECIAL_CHAR_SIMD_THRESHOLD 20000
+ uint64 special_chars_encountered; /* number of special chars
+ * encountered so far */
+ bool checked_simd; /* we read BYTES_PROCESSED_UNTIL_SIMD_CHECK
+ * and checked if we should use SIMD on the
+ * rest of the file */
+ bool use_simd; /* use simd to speed up copying */
} CopyFromStateData;
extern void ReceiveCopyBegin(CopyFromState cstate);
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 12781963b4f..e638623e5b5 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1720,6 +1720,9 @@ BeginCopyFrom(ParseState *pstate,
cstate->cur_attname = NULL;
cstate->cur_attval = NULL;
cstate->relname_only = false;
+ cstate->special_chars_encountered = 0;
+ cstate->checked_simd = false;
+ cstate->use_simd = false;
/*
* Allocate buffers for the input pipeline.
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index cf110767542..549b56c21fb 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1346,6 +1346,28 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
#ifndef USE_NO_SIMD
+ /*
+ * Wait until we have read more than BYTES_PROCESSED_UNTIL_SIMD_CHECK.
+ * cstate->bytes_processed will grow an unpredictable amount with each
+ * call to this function, so just wait until we have crossed the
+ * threshold.
+ */
+ if (!cstate->checked_simd && cstate->bytes_processed > BYTES_PROCESSED_UNTIL_SIMD_CHECK)
+ {
+ cstate->checked_simd = true;
+
+ /*
+ * If we have not read too many special characters
+ * (SPECIAL_CHAR_SIMD_THRESHOLD) then start using SIMD to speed up
+ * processing. This heuristic assumes that input does not vary too
+ * much from line to line and that number of special characters
+ * encountered in the first BYTES_PROCESSED_UNTIL_SIMD_CHECK are
+ * indicitive of the whole file.
+ */
+ if (cstate->special_chars_encountered < SPECIAL_CHAR_SIMD_THRESHOLD)
+ cstate->use_simd = true;
+ }
+
/*
* Use SIMD instructions to efficiently scan the input buffer for
* special characters (e.g., newline, carriage return, quote, and
@@ -1358,7 +1380,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* sequentially. - The remaining buffer is smaller than one vector
* width (sizeof(Vector8)); SIMD operates on fixed-size chunks.
*/
- if (!last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ if (cstate->use_simd && !last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
{
Vector8 chunk;
Vector8 match = vector8_broadcast(0);
@@ -1418,6 +1440,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
*/
if (c == '\r')
{
+ cstate->special_chars_encountered++;
IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
}
@@ -1449,6 +1472,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* Process \r */
if (c == '\r' && (!is_csv || !in_quote))
{
+ cstate->special_chars_encountered++;
/* Check for \r\n on first line, _and_ handle \r\n. */
if (cstate->eol_type == EOL_UNKNOWN ||
cstate->eol_type == EOL_CRNL)
@@ -1505,6 +1529,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* Process \n */
if (c == '\n' && (!is_csv || !in_quote))
{
+ cstate->special_chars_encountered++;
if (cstate->eol_type == EOL_CR || cstate->eol_type == EOL_CRNL)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
@@ -1527,6 +1552,8 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
{
char c2;
+ cstate->special_chars_encountered++;
+
IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
IF_NEED_REFILL_AND_EOF_BREAK(0);
--
2.51.0
v4.1-0003-Feedback-Changes.patchtext/x-patch; charset=US-ASCII; name=v4.1-0003-Feedback-Changes.patchDownload
From 8d0e6766175abac15b39884126c29da03657be40 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 9 Dec 2025 15:32:10 +0300
Subject: [PATCH v4.1 3/3] Feedback / Changes
---
src/include/commands/copyfrom_internal.h | 9 +--
src/backend/commands/copyfrom.c | 1 +
src/backend/commands/copyfromparse.c | 88 +++++++++++++++---------
3 files changed, 60 insertions(+), 38 deletions(-)
diff --git a/src/include/commands/copyfrom_internal.h b/src/include/commands/copyfrom_internal.h
index 215215f909f..397720bf875 100644
--- a/src/include/commands/copyfrom_internal.h
+++ b/src/include/commands/copyfrom_internal.h
@@ -183,12 +183,13 @@ typedef struct CopyFromStateData
uint64 bytes_processed; /* number of bytes processed so far */
/* the amount of bytes to read until checking if we should try simd */
-#define BYTES_PROCESSED_UNTIL_SIMD_CHECK 100000
- /* the number of special chars read below which we use simd */
-#define SPECIAL_CHAR_SIMD_THRESHOLD 20000
+#define CHARS_PROCESSED_UNTIL_SIMD_CHECK 100000
+ /* the ratio of special chars read below which we use simd */
+#define SPECIAL_CHAR_SIMD_RATIO 4
+ uint64 chars_processed;
uint64 special_chars_encountered; /* number of special chars
* encountered so far */
- bool checked_simd; /* we read BYTES_PROCESSED_UNTIL_SIMD_CHECK
+ bool checked_simd; /* we read CHARS_PROCESSED_UNTIL_SIMD_CHECK
* and checked if we should use SIMD on the
* rest of the file */
bool use_simd; /* use simd to speed up copying */
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index e638623e5b5..d44dd16eced 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1720,6 +1720,7 @@ BeginCopyFrom(ParseState *pstate,
cstate->cur_attname = NULL;
cstate->cur_attval = NULL;
cstate->relname_only = false;
+ cstate->chars_processed = 0;
cstate->special_chars_encountered = 0;
cstate->checked_simd = false;
cstate->use_simd = false;
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 549b56c21fb..86a268d0df9 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -143,7 +143,7 @@ static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
/* non-export function prototypes */
static bool CopyReadLine(CopyFromState cstate, bool is_csv);
-static bool CopyReadLineText(CopyFromState cstate, bool is_csv);
+static pg_attribute_always_inline bool CopyReadLineText(CopyFromState cstate, bool is_csv, bool use_simd);
static int CopyReadAttributesText(CopyFromState cstate);
static int CopyReadAttributesCSV(CopyFromState cstate);
static Datum CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
@@ -1173,8 +1173,40 @@ CopyReadLine(CopyFromState cstate, bool is_csv)
resetStringInfo(&cstate->line_buf);
cstate->line_buf_valid = false;
- /* Parse data and transfer into line_buf */
- result = CopyReadLineText(cstate, is_csv);
+#ifndef USE_NO_SIMD
+
+ /*
+ * Wait until we have read more than CHARS_PROCESSED_UNTIL_SIMD_CHECK.
+ * cstate->bytes_processed will grow an unpredictable amount with each
+ * call to this function, so just wait until we have crossed the
+ * threshold.
+ */
+ if (!cstate->checked_simd && cstate->chars_processed > CHARS_PROCESSED_UNTIL_SIMD_CHECK)
+ {
+ cstate->checked_simd = true;
+
+ /*
+ * If we have not read too many special characters then start using
+ * SIMD to speed up processing. This heuristic assumes that input does
+ * not vary too much from line to line and that number of special
+ * characters encountered in the first
+ * CHARS_PROCESSED_UNTIL_SIMD_CHECK are indicitive of the whole file.
+ */
+ if (cstate->chars_processed / SPECIAL_CHAR_SIMD_RATIO >= cstate->special_chars_encountered)
+ {
+ cstate->use_simd = true;
+ }
+ }
+#endif
+
+ /*
+ * Parse data and transfer into line_buf. To get benefit from inlining,
+ * call CopyReadLineText() with the constant boolean variables.
+ */
+ if (cstate->use_simd)
+ result = CopyReadLineText(cstate, is_csv, true);
+ else
+ result = CopyReadLineText(cstate, is_csv, false);
if (result)
{
@@ -1241,8 +1273,8 @@ CopyReadLine(CopyFromState cstate, bool is_csv)
/*
* CopyReadLineText - inner loop of CopyReadLine for text mode
*/
-static bool
-CopyReadLineText(CopyFromState cstate, bool is_csv)
+static pg_attribute_always_inline bool
+CopyReadLineText(CopyFromState cstate, bool is_csv, bool use_simd)
{
char *copy_input_buf;
int input_buf_ptr;
@@ -1309,7 +1341,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
input_buf_ptr = cstate->input_buf_index;
copy_buf_len = cstate->input_buf_len;
- for (;;)
+ for (;; cstate->chars_processed++)
{
int prev_raw_ptr;
char c;
@@ -1346,28 +1378,6 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
#ifndef USE_NO_SIMD
- /*
- * Wait until we have read more than BYTES_PROCESSED_UNTIL_SIMD_CHECK.
- * cstate->bytes_processed will grow an unpredictable amount with each
- * call to this function, so just wait until we have crossed the
- * threshold.
- */
- if (!cstate->checked_simd && cstate->bytes_processed > BYTES_PROCESSED_UNTIL_SIMD_CHECK)
- {
- cstate->checked_simd = true;
-
- /*
- * If we have not read too many special characters
- * (SPECIAL_CHAR_SIMD_THRESHOLD) then start using SIMD to speed up
- * processing. This heuristic assumes that input does not vary too
- * much from line to line and that number of special characters
- * encountered in the first BYTES_PROCESSED_UNTIL_SIMD_CHECK are
- * indicitive of the whole file.
- */
- if (cstate->special_chars_encountered < SPECIAL_CHAR_SIMD_THRESHOLD)
- cstate->use_simd = true;
- }
-
/*
* Use SIMD instructions to efficiently scan the input buffer for
* special characters (e.g., newline, carriage return, quote, and
@@ -1380,7 +1390,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* sequentially. - The remaining buffer is smaller than one vector
* width (sizeof(Vector8)); SIMD operates on fixed-size chunks.
*/
- if (cstate->use_simd && !last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ if (use_simd && !last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
{
Vector8 chunk;
Vector8 match = vector8_broadcast(0);
@@ -1430,6 +1440,21 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
prev_raw_ptr = input_buf_ptr;
c = copy_input_buf[input_buf_ptr++];
+ /* Use this calculation decide whether to use SIMD later */
+ if (!use_simd && unlikely(!cstate->checked_simd))
+ {
+ if (is_csv)
+ {
+ if (c == '\r' || c == '\n' || c == quotec || c == escapec)
+ cstate->special_chars_encountered++;
+ }
+ else
+ {
+ if (c == '\r' || c == '\n' || c == '\\')
+ cstate->special_chars_encountered++;
+ }
+ }
+
if (is_csv)
{
/*
@@ -1440,7 +1465,6 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
*/
if (c == '\r')
{
- cstate->special_chars_encountered++;
IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
}
@@ -1472,7 +1496,6 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* Process \r */
if (c == '\r' && (!is_csv || !in_quote))
{
- cstate->special_chars_encountered++;
/* Check for \r\n on first line, _and_ handle \r\n. */
if (cstate->eol_type == EOL_UNKNOWN ||
cstate->eol_type == EOL_CRNL)
@@ -1529,7 +1552,6 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* Process \n */
if (c == '\n' && (!is_csv || !in_quote))
{
- cstate->special_chars_encountered++;
if (cstate->eol_type == EOL_CR || cstate->eol_type == EOL_CRNL)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
@@ -1552,8 +1574,6 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
{
char c2;
- cstate->special_chars_encountered++;
-
IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
IF_NEED_REFILL_AND_EOF_BREAK(0);
--
2.51.0
On Tue, Dec 9, 2025 at 7:40 AM Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com>
wrote:
Hello, all.
Andrew, I tried your suggestion of just reading the first chunk of the
copy file to determine if SIMD is worth using. Attached are v4 versions of
the patches showing a first attempt at doing that.Thank you for doing this!
I attached test.sh.txt to show how I've been testing, with 5 million
lines of the various copy file variations introduced by Ayub Kazar.
The text copy with no special chars is 30% faster. The CSV copy with
no special chars is 48% faster. The text with 1/3rd escapes is 3% slower.
The CSV with 1/3rd quotes is 0.27% slower.This set of patches follows the simplest suggestion of just testing
the first N lines (actually first N bytes) of the file and then deciding
whether or not to enable SIMD. This set of patches does not follow Andrew's
later suggestion of maybe checking again every million lines or so.My input-generation script is not ready to share yet, but the inputs
follow this format: text_${n}.input, where n represents the number of
normal characters before the delimiter. For example:n = 0 -> "\n\n\n\n\n..." (no normal characters)
n = 1 -> "a\n..." (1 normal character before the delimiter)
...
n = 5 -> "aaaaa\n..."
… continuing up to n = 32.Each line has 4096 chars and there are a total of 100000 lines in each
input file.I only benchmarked the text format. I compared the latest heuristic I
shared [1] with the current method. The benchmarks show roughly a ~16%
regression at the worst case (n = 2), with regressions up to n = 5.
For the remaining values, performance was similar.I tried to improve the v4 patchset. My changes are:
1 - I changed CopyReadLineText() to an inline function and sent the
use_simd variable as an argument to get help from inlining.2 - A main for loop in the CopyReadLineText() function is called many
times, so I moved the use_simd check to the CopyReadLine() function.3 - Instead of 'bytes_processed', I used 'chars_processed' because
cstate->bytes_processed is increased before we process them and this
can cause wrong results.4 - Because of #2 and #3, instead of having
'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed /
special_chars_encountered' to determine whether we want to use SIMD.5 - cstate->special_chars_encountered is incremented wrongly for the
CSV case. It is not incremented for the quote and escape delimiters. I
moved all increments of cstate->special_chars_encountered to the
central place and tried to optimize it but it still causes a
regression as it creates one more branching.With these changes, I am able to decrease the regression to %10 from
%16. Regression decreases to %7 if I modify #5 for the only text input
but I did not do that.My changes are in the 0003.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Bilal Yavuz (Nazir Bilal Yavuz?), I did not get a chance to do any work on
this today, but wanted to thank you for finding my logic errors in counting
special chars for CSV, and hacking on my naive solution to make it faster.
By attempting Andrew Dunstan's suggestion, I got a better feel for the
reality that the "housekeeping" code produces a significant amount of
overhead.
--
-- Manni Wood EDB: https://www.enterprisedb.com
Hi,
On Wed, 10 Dec 2025 at 01:13, Manni Wood <manni.wood@enterprisedb.com> wrote:
Bilal Yavuz (Nazir Bilal Yavuz?),
It is Nazir Bilal Yavuz, I changed some settings on my phone and it
seems that it affected my mail account, hopefully it should be fixed
now.
I did not get a chance to do any work on this today, but wanted to thank you for finding my logic errors in counting special chars for CSV, and hacking on my naive solution to make it faster. By attempting Andrew Dunstan's suggestion, I got a better feel for the reality that the "housekeeping" code produces a significant amount of overhead.
You are welcome! v4.1 has some problems with in_quote case in SIMD
handling code and counting cstate->chars_processed variable. I fixed
them in v4.2.
--
Regards,
Nazir Bilal Yavuz
Microsoft
Attachments:
v4.2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchtext/x-patch; charset=US-ASCII; name=v4.2-0001-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload
From e4546b0612bd2fde6190a9ade6e60a1f08299184 Mon Sep 17 00:00:00 2001
From: Manni Wood <manni.wood@enterprisedb.com>
Date: Fri, 5 Dec 2025 18:30:00 -0600
Subject: [PATCH v4.2 1/3] Speed up COPY FROM text/CSV parsing using SIMD
Authors: Shinya Kato <shinya11.kato@gmail.com>,
Nazir Bilal Yavuz <byavuz81@gmail.com>,
Ayoub Kazar <ma_kazar@esi.dz>
Reviewers: Andrew Dunstan <andrew@dunslane.net>
Descussion:
https://www.postgresql.org/message-id/flat/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com
---
src/backend/commands/copyfromparse.c | 73 ++++++++++++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 62afcd8fad1..673d6683a72 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -71,7 +71,9 @@
#include "mb/pg_wchar.h"
#include "miscadmin.h"
#include "pgstat.h"
+#include "port/pg_bitutils.h"
#include "port/pg_bswap.h"
+#include "port/simd.h"
#include "utils/builtins.h"
#include "utils/rel.h"
@@ -1255,6 +1257,14 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
char quotec = '\0';
char escapec = '\0';
+#ifndef USE_NO_SIMD
+ Vector8 nl = vector8_broadcast('\n');
+ Vector8 cr = vector8_broadcast('\r');
+ Vector8 bs = vector8_broadcast('\\');
+ Vector8 quote = vector8_broadcast(0);
+ Vector8 escape = vector8_broadcast(0);
+#endif
+
if (is_csv)
{
quotec = cstate->opts.quote[0];
@@ -1262,6 +1272,12 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* ignore special escape processing if it's the same as quotec */
if (quotec == escapec)
escapec = '\0';
+
+#ifndef USE_NO_SIMD
+ quote = vector8_broadcast(quotec);
+ if (quotec != escapec)
+ escape = vector8_broadcast(escapec);
+#endif
}
/*
@@ -1328,6 +1344,63 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
need_data = false;
}
+#ifndef USE_NO_SIMD
+
+ /*
+ * Use SIMD instructions to efficiently scan the input buffer for
+ * special characters (e.g., newline, carriage return, quote, and
+ * escape). This is faster than byte-by-byte iteration, especially on
+ * large buffers.
+ *
+ * We do not apply the SIMD fast path in either of the following
+ * cases: - When the previously processed character was an escape
+ * character (last_was_esc), since the next byte must be examined
+ * sequentially. - The remaining buffer is smaller than one vector
+ * width (sizeof(Vector8)); SIMD operates on fixed-size chunks.
+ */
+ if (!last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ {
+ Vector8 chunk;
+ Vector8 match = vector8_broadcast(0);
+ uint32 mask;
+
+ /* Load a chunk of data into a vector register */
+ vector8_load(&chunk, (const uint8 *) ©_input_buf[input_buf_ptr]);
+
+ /* \n and \r are not special inside quotes */
+ if (!in_quote)
+ match = vector8_or(vector8_eq(chunk, nl), vector8_eq(chunk, cr));
+
+ if (is_csv)
+ {
+ match = vector8_or(match, vector8_eq(chunk, quote));
+ if (escapec != '\0')
+ match = vector8_or(match, vector8_eq(chunk, escape));
+ }
+ else
+ match = vector8_or(match, vector8_eq(chunk, bs));
+
+ /* Check if we found any special characters */
+ mask = vector8_highbit_mask(match);
+ if (mask != 0)
+ {
+ /*
+ * Found a special character. Advance up to that point and let
+ * the scalar code handle it.
+ */
+ int advance = pg_rightmost_one_pos32(mask);
+
+ input_buf_ptr += advance;
+ }
+ else
+ {
+ /* No special characters found, so skip the entire chunk */
+ input_buf_ptr += sizeof(Vector8);
+ continue;
+ }
+ }
+#endif
+
/* OK to fetch a character */
prev_raw_ptr = input_buf_ptr;
c = copy_input_buf[input_buf_ptr++];
--
2.51.0
v4.2-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchtext/x-patch; charset=US-ASCII; name=v4.2-0002-Speed-up-COPY-FROM-text-CSV-parsing-using-SIMD.patchDownload
From 92ac4ada1e4833f81ce30164b48868dc1ade102f Mon Sep 17 00:00:00 2001
From: Manni Wood <manni.wood@enterprisedb.com>
Date: Fri, 5 Dec 2025 18:33:46 -0600
Subject: [PATCH v4.2 2/3] Speed up COPY FROM text/CSV parsing using SIMD
Authors: Shinya Kato <shinya11.kato@gmail.com>,
Nazir Bilal Yavuz <byavuz81@gmail.com>,
Ayoub Kazar <ma_kazar@esi.dz>
Reviewers: Andrew Dunstan <andrew@dunslane.net>
Descussion:
https://www.postgresql.org/message-id/flat/CAOzEurSW8cNr6TPKsjrstnPfhf4QyQqB4tnPXGGe8N4e_v7Jig@mail.gmail.com
---
src/include/commands/copyfrom_internal.h | 11 +++++++++
src/backend/commands/copyfrom.c | 3 +++
src/backend/commands/copyfromparse.c | 29 +++++++++++++++++++++++-
3 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/src/include/commands/copyfrom_internal.h b/src/include/commands/copyfrom_internal.h
index c8b22af22d8..215215f909f 100644
--- a/src/include/commands/copyfrom_internal.h
+++ b/src/include/commands/copyfrom_internal.h
@@ -181,6 +181,17 @@ typedef struct CopyFromStateData
#define RAW_BUF_BYTES(cstate) ((cstate)->raw_buf_len - (cstate)->raw_buf_index)
uint64 bytes_processed; /* number of bytes processed so far */
+
+ /* the amount of bytes to read until checking if we should try simd */
+#define BYTES_PROCESSED_UNTIL_SIMD_CHECK 100000
+ /* the number of special chars read below which we use simd */
+#define SPECIAL_CHAR_SIMD_THRESHOLD 20000
+ uint64 special_chars_encountered; /* number of special chars
+ * encountered so far */
+ bool checked_simd; /* we read BYTES_PROCESSED_UNTIL_SIMD_CHECK
+ * and checked if we should use SIMD on the
+ * rest of the file */
+ bool use_simd; /* use simd to speed up copying */
} CopyFromStateData;
extern void ReceiveCopyBegin(CopyFromState cstate);
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 2ae3d2ba86e..6711c0cfcdd 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1720,6 +1720,9 @@ BeginCopyFrom(ParseState *pstate,
cstate->cur_attname = NULL;
cstate->cur_attval = NULL;
cstate->relname_only = false;
+ cstate->special_chars_encountered = 0;
+ cstate->checked_simd = false;
+ cstate->use_simd = false;
/*
* Allocate buffers for the input pipeline.
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 673d6683a72..d548674c8ff 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1346,6 +1346,28 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
#ifndef USE_NO_SIMD
+ /*
+ * Wait until we have read more than BYTES_PROCESSED_UNTIL_SIMD_CHECK.
+ * cstate->bytes_processed will grow an unpredictable amount with each
+ * call to this function, so just wait until we have crossed the
+ * threshold.
+ */
+ if (!cstate->checked_simd && cstate->bytes_processed > BYTES_PROCESSED_UNTIL_SIMD_CHECK)
+ {
+ cstate->checked_simd = true;
+
+ /*
+ * If we have not read too many special characters
+ * (SPECIAL_CHAR_SIMD_THRESHOLD) then start using SIMD to speed up
+ * processing. This heuristic assumes that input does not vary too
+ * much from line to line and that number of special characters
+ * encountered in the first BYTES_PROCESSED_UNTIL_SIMD_CHECK are
+ * indicitive of the whole file.
+ */
+ if (cstate->special_chars_encountered < SPECIAL_CHAR_SIMD_THRESHOLD)
+ cstate->use_simd = true;
+ }
+
/*
* Use SIMD instructions to efficiently scan the input buffer for
* special characters (e.g., newline, carriage return, quote, and
@@ -1358,7 +1380,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* sequentially. - The remaining buffer is smaller than one vector
* width (sizeof(Vector8)); SIMD operates on fixed-size chunks.
*/
- if (!last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ if (cstate->use_simd && !last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
{
Vector8 chunk;
Vector8 match = vector8_broadcast(0);
@@ -1415,6 +1437,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
*/
if (c == '\r')
{
+ cstate->special_chars_encountered++;
IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
}
@@ -1446,6 +1469,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* Process \r */
if (c == '\r' && (!is_csv || !in_quote))
{
+ cstate->special_chars_encountered++;
/* Check for \r\n on first line, _and_ handle \r\n. */
if (cstate->eol_type == EOL_UNKNOWN ||
cstate->eol_type == EOL_CRNL)
@@ -1502,6 +1526,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* Process \n */
if (c == '\n' && (!is_csv || !in_quote))
{
+ cstate->special_chars_encountered++;
if (cstate->eol_type == EOL_CR || cstate->eol_type == EOL_CRNL)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
@@ -1524,6 +1549,8 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
{
char c2;
+ cstate->special_chars_encountered++;
+
IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
IF_NEED_REFILL_AND_EOF_BREAK(0);
--
2.51.0
v4.2-0003-Feedback-Changes.patchtext/x-patch; charset=US-ASCII; name=v4.2-0003-Feedback-Changes.patchDownload
From 128574f80963c5b532c8aa7e7fad84a7e6e20874 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavuz81@gmail.com>
Date: Tue, 9 Dec 2025 15:32:10 +0300
Subject: [PATCH v4.2 3/3] Feedback / Changes
---
src/include/commands/copyfrom_internal.h | 9 +--
src/backend/commands/copyfrom.c | 1 +
src/backend/commands/copyfromparse.c | 92 +++++++++++++++---------
3 files changed, 65 insertions(+), 37 deletions(-)
diff --git a/src/include/commands/copyfrom_internal.h b/src/include/commands/copyfrom_internal.h
index 215215f909f..397720bf875 100644
--- a/src/include/commands/copyfrom_internal.h
+++ b/src/include/commands/copyfrom_internal.h
@@ -183,12 +183,13 @@ typedef struct CopyFromStateData
uint64 bytes_processed; /* number of bytes processed so far */
/* the amount of bytes to read until checking if we should try simd */
-#define BYTES_PROCESSED_UNTIL_SIMD_CHECK 100000
- /* the number of special chars read below which we use simd */
-#define SPECIAL_CHAR_SIMD_THRESHOLD 20000
+#define CHARS_PROCESSED_UNTIL_SIMD_CHECK 100000
+ /* the ratio of special chars read below which we use simd */
+#define SPECIAL_CHAR_SIMD_RATIO 4
+ uint64 chars_processed;
uint64 special_chars_encountered; /* number of special chars
* encountered so far */
- bool checked_simd; /* we read BYTES_PROCESSED_UNTIL_SIMD_CHECK
+ bool checked_simd; /* we read CHARS_PROCESSED_UNTIL_SIMD_CHECK
* and checked if we should use SIMD on the
* rest of the file */
bool use_simd; /* use simd to speed up copying */
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 6711c0cfcdd..2b77ba2556c 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1720,6 +1720,7 @@ BeginCopyFrom(ParseState *pstate,
cstate->cur_attname = NULL;
cstate->cur_attval = NULL;
cstate->relname_only = false;
+ cstate->chars_processed = 0;
cstate->special_chars_encountered = 0;
cstate->checked_simd = false;
cstate->use_simd = false;
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index d548674c8ff..720222152da 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -143,7 +143,7 @@ static const char BinarySignature[11] = "PGCOPY\n\377\r\n\0";
/* non-export function prototypes */
static bool CopyReadLine(CopyFromState cstate, bool is_csv);
-static bool CopyReadLineText(CopyFromState cstate, bool is_csv);
+static pg_attribute_always_inline bool CopyReadLineText(CopyFromState cstate, bool is_csv, bool use_simd);
static int CopyReadAttributesText(CopyFromState cstate);
static int CopyReadAttributesCSV(CopyFromState cstate);
static Datum CopyReadBinaryAttribute(CopyFromState cstate, FmgrInfo *flinfo,
@@ -1173,8 +1173,40 @@ CopyReadLine(CopyFromState cstate, bool is_csv)
resetStringInfo(&cstate->line_buf);
cstate->line_buf_valid = false;
- /* Parse data and transfer into line_buf */
- result = CopyReadLineText(cstate, is_csv);
+#ifndef USE_NO_SIMD
+
+ /*
+ * Wait until we have read more than CHARS_PROCESSED_UNTIL_SIMD_CHECK.
+ * cstate->bytes_processed will grow an unpredictable amount with each
+ * call to this function, so just wait until we have crossed the
+ * threshold.
+ */
+ if (!cstate->checked_simd && cstate->chars_processed > CHARS_PROCESSED_UNTIL_SIMD_CHECK)
+ {
+ cstate->checked_simd = true;
+
+ /*
+ * If we have not read too many special characters then start using
+ * SIMD to speed up processing. This heuristic assumes that input does
+ * not vary too much from line to line and that number of special
+ * characters encountered in the first
+ * CHARS_PROCESSED_UNTIL_SIMD_CHECK are indicitive of the whole file.
+ */
+ if (cstate->chars_processed / SPECIAL_CHAR_SIMD_RATIO >= cstate->special_chars_encountered)
+ {
+ cstate->use_simd = true;
+ }
+ }
+#endif
+
+ /*
+ * Parse data and transfer into line_buf. To get benefit from inlining,
+ * call CopyReadLineText() with the constant boolean variables.
+ */
+ if (cstate->use_simd)
+ result = CopyReadLineText(cstate, is_csv, true);
+ else
+ result = CopyReadLineText(cstate, is_csv, false);
if (result)
{
@@ -1241,11 +1273,12 @@ CopyReadLine(CopyFromState cstate, bool is_csv)
/*
* CopyReadLineText - inner loop of CopyReadLine for text mode
*/
-static bool
-CopyReadLineText(CopyFromState cstate, bool is_csv)
+static pg_attribute_always_inline bool
+CopyReadLineText(CopyFromState cstate, bool is_csv, bool use_simd)
{
char *copy_input_buf;
int input_buf_ptr;
+ int start_input_buf_ptr;
int copy_buf_len;
bool need_data = false;
bool hit_eof = false;
@@ -1309,6 +1342,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
input_buf_ptr = cstate->input_buf_index;
copy_buf_len = cstate->input_buf_len;
+ start_input_buf_ptr = input_buf_ptr;
for (;;)
{
int prev_raw_ptr;
@@ -1327,9 +1361,11 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
REFILL_LINEBUF;
CopyLoadInputBuf(cstate);
+ cstate->chars_processed += (input_buf_ptr - start_input_buf_ptr);
/* update our local variables */
hit_eof = cstate->input_reached_eof;
input_buf_ptr = cstate->input_buf_index;
+ start_input_buf_ptr = input_buf_ptr;
copy_buf_len = cstate->input_buf_len;
/*
@@ -1346,28 +1382,6 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
#ifndef USE_NO_SIMD
- /*
- * Wait until we have read more than BYTES_PROCESSED_UNTIL_SIMD_CHECK.
- * cstate->bytes_processed will grow an unpredictable amount with each
- * call to this function, so just wait until we have crossed the
- * threshold.
- */
- if (!cstate->checked_simd && cstate->bytes_processed > BYTES_PROCESSED_UNTIL_SIMD_CHECK)
- {
- cstate->checked_simd = true;
-
- /*
- * If we have not read too many special characters
- * (SPECIAL_CHAR_SIMD_THRESHOLD) then start using SIMD to speed up
- * processing. This heuristic assumes that input does not vary too
- * much from line to line and that number of special characters
- * encountered in the first BYTES_PROCESSED_UNTIL_SIMD_CHECK are
- * indicitive of the whole file.
- */
- if (cstate->special_chars_encountered < SPECIAL_CHAR_SIMD_THRESHOLD)
- cstate->use_simd = true;
- }
-
/*
* Use SIMD instructions to efficiently scan the input buffer for
* special characters (e.g., newline, carriage return, quote, and
@@ -1380,7 +1394,7 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
* sequentially. - The remaining buffer is smaller than one vector
* width (sizeof(Vector8)); SIMD operates on fixed-size chunks.
*/
- if (cstate->use_simd && !last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
+ if (use_simd && !last_was_esc && copy_buf_len - input_buf_ptr >= sizeof(Vector8))
{
Vector8 chunk;
Vector8 match = vector8_broadcast(0);
@@ -1427,6 +1441,21 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
prev_raw_ptr = input_buf_ptr;
c = copy_input_buf[input_buf_ptr++];
+ /* Use this calculation decide whether to use SIMD later */
+ if (!use_simd && unlikely(!cstate->checked_simd))
+ {
+ if (is_csv)
+ {
+ if (c == '\r' || c == '\n' || c == quotec || c == escapec)
+ cstate->special_chars_encountered++;
+ }
+ else
+ {
+ if (c == '\r' || c == '\n' || c == '\\')
+ cstate->special_chars_encountered++;
+ }
+ }
+
if (is_csv)
{
/*
@@ -1437,7 +1466,6 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
*/
if (c == '\r')
{
- cstate->special_chars_encountered++;
IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
}
@@ -1469,7 +1497,6 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* Process \r */
if (c == '\r' && (!is_csv || !in_quote))
{
- cstate->special_chars_encountered++;
/* Check for \r\n on first line, _and_ handle \r\n. */
if (cstate->eol_type == EOL_UNKNOWN ||
cstate->eol_type == EOL_CRNL)
@@ -1526,7 +1553,6 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
/* Process \n */
if (c == '\n' && (!is_csv || !in_quote))
{
- cstate->special_chars_encountered++;
if (cstate->eol_type == EOL_CR || cstate->eol_type == EOL_CRNL)
ereport(ERROR,
(errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
@@ -1549,8 +1575,6 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
{
char c2;
- cstate->special_chars_encountered++;
-
IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(0);
IF_NEED_REFILL_AND_EOF_BREAK(0);
@@ -1635,6 +1659,8 @@ CopyReadLineText(CopyFromState cstate, bool is_csv)
*/
REFILL_LINEBUF;
+ cstate->chars_processed += (input_buf_ptr - start_input_buf_ptr);
+
return result;
}
--
2.51.0
Hi everyone,
On Tue, Dec 09, 2025 at 04:40:19PM +0300, Bilal Yavuz wrote:
Hi,
On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com> wrote:
Hello, all.
Andrew, I tried your suggestion of just reading the first chunk of the copy file to determine if SIMD is worth using. Attached are v4 versions of the patches showing a first attempt at doing that.
Thank you for doing this!
I attached test.sh.txt to show how I've been testing, with 5 million lines of the various copy file variations introduced by Ayub Kazar.
The text copy with no special chars is 30% faster. The CSV copy with no special chars is 48% faster. The text with 1/3rd escapes is 3% slower. The CSV with 1/3rd quotes is 0.27% slower.
This set of patches follows the simplest suggestion of just testing the first N lines (actually first N bytes) of the file and then deciding whether or not to enable SIMD. This set of patches does not follow Andrew's later suggestion of maybe checking again every million lines or so.
My input-generation script is not ready to share yet, but the inputs
follow this format: text_${n}.input, where n represents the number of
normal characters before the delimiter. For example:n = 0 -> "\n\n\n\n\n..." (no normal characters)
n = 1 -> "a\n..." (1 normal character before the delimiter)
...
n = 5 -> "aaaaa\n..."
… continuing up to n = 32.Each line has 4096 chars and there are a total of 100000 lines in each
input file.I only benchmarked the text format. I compared the latest heuristic I
shared [1] with the current method. The benchmarks show roughly a ~16%
regression at the worst case (n = 2), with regressions up to n = 5.
For the remaining values, performance was similar.I tried to improve the v4 patchset. My changes are:
1 - I changed CopyReadLineText() to an inline function and sent the
use_simd variable as an argument to get help from inlining.2 - A main for loop in the CopyReadLineText() function is called many
times, so I moved the use_simd check to the CopyReadLine() function.3 - Instead of 'bytes_processed', I used 'chars_processed' because
cstate->bytes_processed is increased before we process them and this
can cause wrong results.4 - Because of #2 and #3, instead of having
'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed /
special_chars_encountered' to determine whether we want to use SIMD.5 - cstate->special_chars_encountered is incremented wrongly for the
CSV case. It is not incremented for the quote and escape delimiters. I
moved all increments of cstate->special_chars_encountered to the
central place and tried to optimize it but it still causes a
regression as it creates one more branching.With these changes, I am able to decrease the regression to %10 from
%16. Regression decreases to %7 if I modify #5 for the only text input
but I did not do that.My changes are in the 0003.
I was helping collect some data, but I'm a little behind sharing what I
ran against the v4.1 patches (on commit 07961ef8) with the v4.2 version
out there...
I hope it's still helpfule that I share what I collected even though
they are not quite as nice, but maybe it's more about how/where I ran
them.
My laptop has a Intel(R) Core(TM) Ultra 7 165H, where most of these
tests were using up 95%+ of one of the cores (I have hyperthreading
disabled), and using about 10% the ssd's capacity.
Summarizing my results from the same script Manni ran, I didn't see as
much as an improvement in the positive tests, and then saw more negative
results in the other tests.
text copy with no special chars: 18% improvement of 15s from 80s before
the patch
CSV copy with no special chars: 23% improvement of 23s from 96s before
the patch
text with 1/3rd escapes: 6% slower, an additional 5s to 85 seconds
before the patch
CSV with 1/3rd quotes: 7% slower, an additional 10 seconds to 129
seconds before the patch
I'm wondering if my laptop/processor isn't the best test bed for this...
Regards,
Mark
--
Mark Wong <markwkm@gmail.com>
EDB https://enterprisedb.com
On Fri, Dec 12, 2025 at 2:42 PM Mark Wong <markwkm@gmail.com> wrote:
Hi everyone,
On Tue, Dec 09, 2025 at 04:40:19PM +0300, Bilal Yavuz wrote:
Hi,
On Sat, 6 Dec 2025 at 10:55, Bilal Yavuz <byavuz81@gmail.com> wrote:
Hi,
On Sat, 6 Dec 2025 at 04:40, Manni Wood <manni.wood@enterprisedb.com>
wrote:
Hello, all.
Andrew, I tried your suggestion of just reading the first chunk of
the copy file to determine if SIMD is worth using. Attached are v4 versions
of the patches showing a first attempt at doing that.Thank you for doing this!
I attached test.sh.txt to show how I've been testing, with 5 million
lines of the various copy file variations introduced by Ayub Kazar.
The text copy with no special chars is 30% faster. The CSV copy with
no special chars is 48% faster. The text with 1/3rd escapes is 3% slower.
The CSV with 1/3rd quotes is 0.27% slower.This set of patches follows the simplest suggestion of just testing
the first N lines (actually first N bytes) of the file and then deciding
whether or not to enable SIMD. This set of patches does not follow Andrew's
later suggestion of maybe checking again every million lines or so.My input-generation script is not ready to share yet, but the inputs
follow this format: text_${n}.input, where n represents the number of
normal characters before the delimiter. For example:n = 0 -> "\n\n\n\n\n..." (no normal characters)
n = 1 -> "a\n..." (1 normal character before the delimiter)
...
n = 5 -> "aaaaa\n..."
… continuing up to n = 32.Each line has 4096 chars and there are a total of 100000 lines in each
input file.I only benchmarked the text format. I compared the latest heuristic I
shared [1] with the current method. The benchmarks show roughly a ~16%
regression at the worst case (n = 2), with regressions up to n = 5.
For the remaining values, performance was similar.I tried to improve the v4 patchset. My changes are:
1 - I changed CopyReadLineText() to an inline function and sent the
use_simd variable as an argument to get help from inlining.2 - A main for loop in the CopyReadLineText() function is called many
times, so I moved the use_simd check to the CopyReadLine() function.3 - Instead of 'bytes_processed', I used 'chars_processed' because
cstate->bytes_processed is increased before we process them and this
can cause wrong results.4 - Because of #2 and #3, instead of having
'SPECIAL_CHAR_SIMD_THRESHOLD', I used the ratio of 'chars_processed /
special_chars_encountered' to determine whether we want to use SIMD.5 - cstate->special_chars_encountered is incremented wrongly for the
CSV case. It is not incremented for the quote and escape delimiters. I
moved all increments of cstate->special_chars_encountered to the
central place and tried to optimize it but it still causes a
regression as it creates one more branching.With these changes, I am able to decrease the regression to %10 from
%16. Regression decreases to %7 if I modify #5 for the only text input
but I did not do that.My changes are in the 0003.
I was helping collect some data, but I'm a little behind sharing what I
ran against the v4.1 patches (on commit 07961ef8) with the v4.2 version
out there...I hope it's still helpfule that I share what I collected even though
they are not quite as nice, but maybe it's more about how/where I ran
them.My laptop has a Intel(R) Core(TM) Ultra 7 165H, where most of these
tests were using up 95%+ of one of the cores (I have hyperthreading
disabled), and using about 10% the ssd's capacity.Summarizing my results from the same script Manni ran, I didn't see as
much as an improvement in the positive tests, and then saw more negative
results in the other tests.text copy with no special chars: 18% improvement of 15s from 80s before
the patchCSV copy with no special chars: 23% improvement of 23s from 96s before
the patchtext with 1/3rd escapes: 6% slower, an additional 5s to 85 seconds
before the patchCSV with 1/3rd quotes: 7% slower, an additional 10 seconds to 129
seconds before the patchI'm wondering if my laptop/processor isn't the best test bed for this...
Regards,
Mark
--
Mark Wong <markwkm@gmail.com>
EDB https://enterprisedb.com
Hello, Everyone!
I have attached two files. 1) the shell script that Mark and I have been
using to get our test results, and 2) a screenshot of a spreadsheet of my
latest test results. (Please let me know if there's a different format than
a screenshot that I could share my spreadsheet in.)
I took greater care this time to compile all three variants of Postgres
(master at bfb335df, master at bfb335df with v4.2 patches installed, master
at bfb335df with v3 patches installed) with the same gcc optimization flags
that would be used to build Postgres packages. To the best of my knowledge,
the two gcc flags of greatest interest would be -g and -O2. I built all
three variants of Postgres using meson like so:
BRANCH=$(git branch --show-current)
meson setup build --prefix=/home/mwood/compiled-pg-instances/${BRANCH}
--buildtype=debugoptimized
It occurred to me that in addition to end users only caring about 1) wall
clock time (is the speedup noticeable in "real time" or just technically
faster / uses less CPU?) and 2) Postgres binaries compiled with the same
optimization level one would get when installing Postgres from packages
like .deb or .rpm; in other words, will the user see speedups without
having do manually compile postgres.
My interesting finding, on my laptop (ThinkPad P14s Gen 1 running Ubuntu
24.04.3), is different from Mark Wong's. On my laptop, using three Postgres
installations all compiled with the -O2 optimization flag, I see speedups
with the v4.2 patch except for a 2% slowdown with CSV with 1/3rd quotes (a
2% slowdown). But with Nazir's proposed v3 patch, I see improvements across
the board. So even for a text file with 1/3rd escape characters, and even
with a CSV file with 1/3rd quotes, I see speedups of 11% and 26%
respectively.
The format of these test files originally comes from Ayoub Kazar's test
scripts; all Mark and I have done in playing with them is make them much
larger: 5,000,000 rows, based on the assumption that longer tests are
better tests.
I find my results interesting enough that I'd be curious to know if anybody
else can reproduce them. It is very interesting that Mark's results are
noticeably different from mine.
--
-- Manni Wood EDB: https://www.enterprisedb.com