[PATCH] Optimize json_lex_string by batching character copying
When parsing JSON strings need to be converted from the JSON string
format to a c-style string. A simple copy of the buffer does not suffice
because of the various escape sequences that that JSON supports. Because
of this our JSON parser wrote characters into the c-style string buffer
one at a time.
However, this is only necessary for these escaped sequences that map to
another character. This patch changes the behaviour for non-escaped
characters. These are now copied in batches instead of one character at
a time.
To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB. The JSONB values
are a JSON object with a single level. They contain a few small keys and
values, but one very big value that's a stringified JSON blob. So this
JSON blob contains a relatively high number of escape characters, to
escape all the " characters. This change improves performance for
workload this workload on my machine by ~18% (going from 1m24s to 1m09s).
@Andres, there was indeed some low hanging fruit.
@John Naylor, SSE2 indeed sounds like another nice improvement. I'll leave
that to you.
Attachments:
0001-Optimize-json_lex_string-by-batching-character-copie.patchapplication/octet-stream; name=0001-Optimize-json_lex_string-by-batching-character-copie.patchDownload
From 8f8aa638ee2b6dfa85fa8bc0ec5788f44768e92f Mon Sep 17 00:00:00 2001
From: Jelte Fennema <github-tech@jeltef.nl>
Date: Fri, 24 Jun 2022 09:19:13 +0200
Subject: [PATCH] Optimize json_lex_string by batching character copies
When parsing JSON strings need to be converted from the JSON string
format to a c-style string. A simple copy of the buffer does not suffice
because of the various escape sequences that that JSON supports. Because
of this our JSON parser wrote characters into the c-style string buffer
one at a time.
However, this is only necessary for these escaped sequences that map to
another character. This patch changes the behaviour for non-escaped
characters. These are now copied in batches instead of one character at
a time.
To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB. The JSONB values
are a JSON object with a single level. They contain a few small keys and
values, but one very big value that's a stringified JSON blob. So this
JSON blob contains a relatively high number of escape characters, to
escape all the " characters. This change improves performance for
workload this workload on my machine by ~18% (going from 1m24s to 1m09s).
---
src/common/jsonapi.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 98e4ef0942..219ecb9df9 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -674,6 +674,7 @@ json_lex_string(JsonLexContext *lex)
char *s;
int len;
int hi_surrogate = -1;
+ int copyable_characters_length = 0;
if (lex->strval != NULL)
resetStringInfo(lex->strval);
@@ -692,7 +693,18 @@ json_lex_string(JsonLexContext *lex)
return JSON_INVALID_TOKEN;
}
else if (*s == '"')
+ {
+ if (copyable_characters_length)
+ {
+ /* flush copyable characters */
+ appendBinaryStringInfo(
+ lex->strval,
+ s - copyable_characters_length,
+ copyable_characters_length);
+
+ }
break;
+ }
else if ((unsigned char) *s < 32)
{
/* Per RFC4627, these characters MUST be escaped. */
@@ -702,6 +714,16 @@ json_lex_string(JsonLexContext *lex)
}
else if (*s == '\\')
{
+ if (copyable_characters_length)
+ {
+ /* flush copyable characters */
+ appendBinaryStringInfo(
+ lex->strval,
+ s - copyable_characters_length,
+ copyable_characters_length);
+ copyable_characters_length = 0;
+
+ }
/* OK, we have an escape character. */
s++;
len++;
@@ -818,7 +840,7 @@ json_lex_string(JsonLexContext *lex)
case '"':
case '\\':
case '/':
- appendStringInfoChar(lex->strval, *s);
+ copyable_characters_length++;
break;
case 'b':
appendStringInfoChar(lex->strval, '\b');
@@ -861,7 +883,7 @@ json_lex_string(JsonLexContext *lex)
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
- appendStringInfoChar(lex->strval, *s);
+ copyable_characters_length++;
}
}
--
2.34.1
Hi,
Looking at the patch,
+ if (copyable_characters_length)
+ {
+ /* flush copyable characters */
+ appendBinaryStringInfo(
+ lex->strval,
+ s - copyable_characters_length,
+ copyable_characters_length);
+
+ }
break;
I wonder why copyable_characters_length is not reset after flushing.
Cheers
Import Notes
Resolved by subject fallback
+ if (copyable_characters_length) + { + /* flush copyable characters */ + appendBinaryStringInfo( + lex->strval, + s - copyable_characters_length, + copyable_characters_length); + + } break;I wonder why copyable_characters_length is not reset after flushing.
It breaks from the loop right after. So copyable_characters_length isn't used
again and thus resetting is not necessary. But I agree this could use a comment.
Hi,
On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:
To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB.
This will have a lot of other costs included (DML is expensive). I'd suggest
storing the json in a text column and casting it to json[b], with a filter
ontop of the json[b] result that cheaply filters it away. That should end up
spending nearly all the time somewhere around json parsing.
It's useful for things like this to include a way for others to use the same
benchmark...
I tried your patch with:
DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;
SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';
Which the patch improves from 846ms to 754ms (best of three). A bit smaller
than your improvement, but still nice.
I think your patch doesn't quite go far enough - we still end up looping for
each character, have the added complication of needing to flush the
"buffer". I'd be surprised if a "dedicated" loop to see until where the string
last isn't faster. That then obviously could be SIMDified.
Separately, it seems pretty awful efficiency / code density wise to have the
NULL checks for ->strval all over. Might be worth forcing json_lex() and
json_lex_string() to be inlined, with a constant parameter deciding whether
->strval is expected. That'd likely be enough to get the compiler specialize
the code for us.
Might also be worth to maintain ->strval using appendBinaryStringInfoNT().
Greetings,
Andres Freund
Hi,
On 2022-06-24 17:18:10 -0700, Andres Freund wrote:
On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:
To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB.This will have a lot of other costs included (DML is expensive). I'd suggest
storing the json in a text column and casting it to json[b], with a filter
ontop of the json[b] result that cheaply filters it away. That should end up
spending nearly all the time somewhere around json parsing.It's useful for things like this to include a way for others to use the same
benchmark...I tried your patch with:
DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';
Which the patch improves from 846ms to 754ms (best of three). A bit smaller
than your improvement, but still nice.I think your patch doesn't quite go far enough - we still end up looping for
each character, have the added complication of needing to flush the
"buffer". I'd be surprised if a "dedicated" loop to see until where the string
last isn't faster. That then obviously could be SIMDified.
A naive implementation (attached) of that gets me down to 706ms.
Greetings,
Andres Freund
Attachments:
json-lex-string-lookahead-speed.difftext/x-diff; charset=us-asciiDownload
diff --git i/src/common/jsonapi.c w/src/common/jsonapi.c
index 98e4ef09426..63d92c66aec 100644
--- i/src/common/jsonapi.c
+++ w/src/common/jsonapi.c
@@ -858,10 +858,25 @@ json_lex_string(JsonLexContext *lex)
}
else if (lex->strval != NULL)
{
+ size_t chunklen = 1;
+
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
- appendStringInfoChar(lex->strval, *s);
+ while (len + chunklen < lex->input_length)
+ {
+ char next = *(s + chunklen);
+
+ if (next == '\\' || next == '"' || (unsigned char) next < 32)
+ break;
+
+ chunklen++;
+ }
+
+ appendBinaryStringInfo(lex->strval, s, chunklen);
+
+ s += (chunklen - 1);
+ len += (chunklen - 1);
}
}
On Sat, Jun 25, 2022 at 8:05 AM Andres Freund <andres@anarazel.de> wrote:
I tried your patch with:
DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';
Which the patch improves from 846ms to 754ms (best of three). A bit smaller
than your improvement, but still nice.I think your patch doesn't quite go far enough - we still end up looping for
each character, have the added complication of needing to flush the
"buffer". I'd be surprised if a "dedicated" loop to see until where the string
last isn't faster. That then obviously could be SIMDified.A naive implementation (attached) of that gets me down to 706ms.
Taking this a step further, I modified json_lex and json_lex_string to
use a const end pointer instead of maintaining the length (0001). The
observed speedup is small enough that it might not be real, but the
code is simpler this way, and it makes 0002 and 0003 easier to reason
about. Then I modified your patch to do the same (0002). Hackish SSE2
support is in 0003.
To exercise the SIMD code a bit, I added a second test:
DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 10) from pg_description pd
)
SELECT (select json_agg(row_to_json(long)) as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;
SELECT 1 FROM long_json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';
With this, I get (turbo disabled, min of 3):
short test:
master: 769ms
0001: 751ms
0002: 703ms
0003: 701ms
long test;
master: 939ms
0001: 883ms
0002: 537ms
0003: 439ms
I think 0001/2 are mostly in committable shape.
With 0003, I'd want to make the portability check a bit nicer and more
centralized. I'm thinking of modifying the CRC check to report that
the host cpu/compiler understands SSE4.2 x86 intrinsics, and then the
compile time SSE2 check can piggyback on top of that without a runtime
check. This is conceptually easy but a bit of work to not look like a
hack (which probably means the ARM CRC check should look more generic
somehow...). The regression tests will likely need some work as well.
Separately, it seems pretty awful efficiency / code density wise to have the
NULL checks for ->strval all over. Might be worth forcing json_lex() and
json_lex_string() to be inlined, with a constant parameter deciding whether
->strval is expected. That'd likely be enough to get the compiler specialize
the code for us.
I had a look at this but it's a bit more invasive than I want to
devote time to at this point.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v3-0003-Use-vectorized-lookahead-in-json_lex_string-on-x8.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Use-vectorized-lookahead-in-json_lex_string-on-x8.patchDownload
From 1484b7541b9ed3f0c476e31f99340d36895bc629 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 5 Jul 2022 18:57:37 +0700
Subject: [PATCH v3 3/3] Use vectorized lookahead in json_lex_string on x86
---
src/common/jsonapi.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index ad4858c623..978f18b129 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -24,6 +24,12 @@
#include "miscadmin.h"
#endif
+/* WIP: put somewhere sensible and consider removing CRC from the names */
+#if (defined (__x86_64__) || defined(_M_AMD64)) && (defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK))
+#include <nmmintrin.h>
+#define USE_SSE2
+#endif
+
/*
* The context of the parser is maintained by the recursive descent
* mechanism, but is passed explicitly to the error reporting routine
@@ -851,12 +857,54 @@ json_lex_string(JsonLexContext *lex)
}
else if (lex->strval != NULL)
{
+#ifdef USE_SSE2
+ __m128i block,
+ has_backslash,
+ has_doublequote,
+ control,
+ has_control,
+ error_cum = _mm_setzero_si128();
+ const __m128i backslash = _mm_set1_epi8('\\');
+ const __m128i doublequote = _mm_set1_epi8('"');
+ const __m128i max_control = _mm_set1_epi8(0x1F);
+#endif
/* start lookahead at next byte */
char *p = s + 1;
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
+#ifdef USE_SSE2
+ while (p < end - sizeof(__m128i))
+ {
+ block = _mm_loadu_si128((const __m128i *) p);
+
+ /* direct comparison to quotes and backslashes */
+ has_backslash = _mm_cmpeq_epi8(block, backslash);
+ has_doublequote = _mm_cmpeq_epi8(block, doublequote);
+
+ /*
+ * use saturation arithmetic to check for <= highest control
+ * char
+ */
+ control = _mm_subs_epu8(block, max_control);
+ has_control = _mm_cmpeq_epi8(control, _mm_setzero_si128());
+
+ /*
+ * set bits in error_cum where the corresponding lanes in has_*
+ * are set
+ */
+ error_cum = _mm_or_si128(error_cum, has_backslash);
+ error_cum = _mm_or_si128(error_cum, has_doublequote);
+ error_cum = _mm_or_si128(error_cum, has_control);
+
+ if (_mm_movemask_epi8(error_cum))
+ break;
+
+ p += sizeof(__m128i);
+ }
+#endif /* USE_SSE2 */
+
while (p < end)
{
if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
--
2.36.1
v3-0002-Build-json-strings-in-larger-chunks-during-lexing.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Build-json-strings-in-larger-chunks-during-lexing.patchDownload
From 03d3be083ba60d272e848b3ec96db3c6d47a3b06 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 1 Jul 2022 17:28:20 +0700
Subject: [PATCH v3 2/3] Build json strings in larger chunks during lexing
Add lookahead loop to json_lex_string. This way, we can batch calls
to appendBinaryStringInfo.
Jelte Fennema and Andres Freund, with some adjustments by me
Discussion:
https://www.postgresql.org/message-id/CAGECzQQuXbies_nKgSiYifZUjBk6nOf2%3DTSXqRjj2BhUh8CTeA%40mail.gmail.com
Discussion:
https://www.postgresql.org/message-id/flat/PR3PR83MB0476F098CBCF68AF7A1CA89FF7B49@PR3PR83MB0476.EURPRD83.prod.outlook.com
---
src/common/jsonapi.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index eeedc0645a..ad4858c623 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -851,10 +851,26 @@ json_lex_string(JsonLexContext *lex)
}
else if (lex->strval != NULL)
{
+ /* start lookahead at next byte */
+ char *p = s + 1;
+
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
- appendStringInfoChar(lex->strval, *s);
+ while (p < end)
+ {
+ if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
+ break;
+ p++;
+ }
+
+ appendBinaryStringInfo(lex->strval, s, p - s);
+
+ /*
+ * s will be incremented at the top of the loop, so set it to just
+ * behind our lookahead position
+ */
+ s = p - 1;
}
}
--
2.36.1
v3-0001-Simplify-json-lexing-state.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Simplify-json-lexing-state.patchDownload
From 3d8b39ff1c1a4abf9effc45323b293b62551770a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 08:35:24 +0700
Subject: [PATCH v3 1/3] Simplify json lexing state
Instead of updating the length as we go, use a const pointer to end of
the input, which we know already at the start
---
src/common/jsonapi.c | 23 ++++++++---------------
1 file changed, 8 insertions(+), 15 deletions(-)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 98e4ef0942..eeedc0645a 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -519,26 +519,23 @@ JsonParseErrorType
json_lex(JsonLexContext *lex)
{
char *s;
- int len;
+ char *const end = lex->input + lex->input_length;
JsonParseErrorType result;
/* Skip leading whitespace. */
s = lex->token_terminator;
- len = s - lex->input;
- while (len < lex->input_length &&
- (*s == ' ' || *s == '\t' || *s == '\n' || *s == '\r'))
+ while (s < end && (*s == ' ' || *s == '\t' || *s == '\n' || *s == '\r'))
{
if (*s++ == '\n')
{
++lex->line_number;
lex->line_start = s;
}
- len++;
}
lex->token_start = s;
/* Determine token type. */
- if (len >= lex->input_length)
+ if (s >= end)
{
lex->token_start = NULL;
lex->prev_token_terminator = lex->token_terminator;
@@ -623,7 +620,7 @@ json_lex(JsonLexContext *lex)
* the whole word as an unexpected token, rather than just
* some unintuitive prefix thereof.
*/
- for (p = s; p - s < lex->input_length - len && JSON_ALPHANUMERIC_CHAR(*p); p++)
+ for (p = s; p < end && JSON_ALPHANUMERIC_CHAR(*p); p++)
/* skip */ ;
/*
@@ -672,7 +669,7 @@ static inline JsonParseErrorType
json_lex_string(JsonLexContext *lex)
{
char *s;
- int len;
+ char *const end = lex->input + lex->input_length;
int hi_surrogate = -1;
if (lex->strval != NULL)
@@ -680,13 +677,11 @@ json_lex_string(JsonLexContext *lex)
Assert(lex->input_length > 0);
s = lex->token_start;
- len = lex->token_start - lex->input;
for (;;)
{
s++;
- len++;
/* Premature end of the string. */
- if (len >= lex->input_length)
+ if (s >= end)
{
lex->token_terminator = s;
return JSON_INVALID_TOKEN;
@@ -704,8 +699,7 @@ json_lex_string(JsonLexContext *lex)
{
/* OK, we have an escape character. */
s++;
- len++;
- if (len >= lex->input_length)
+ if (s >= end)
{
lex->token_terminator = s;
return JSON_INVALID_TOKEN;
@@ -718,8 +712,7 @@ json_lex_string(JsonLexContext *lex)
for (i = 1; i <= 4; i++)
{
s++;
- len++;
- if (len >= lex->input_length)
+ if (s >= end)
{
lex->token_terminator = s;
return JSON_INVALID_TOKEN;
--
2.36.1
Hi,
On 2022-07-06 12:10:20 +0700, John Naylor wrote:
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c index eeedc0645a..ad4858c623 100644 --- a/src/common/jsonapi.c +++ b/src/common/jsonapi.c @@ -851,10 +851,26 @@ json_lex_string(JsonLexContext *lex) } else if (lex->strval != NULL) { + /* start lookahead at next byte */ + char *p = s + 1; + if (hi_surrogate != -1) return JSON_UNICODE_LOW_SURROGATE;- appendStringInfoChar(lex->strval, *s); + while (p < end) + { + if (*p == '\\' || *p == '"' || (unsigned char) *p < 32) + break; + p++; + } + + appendBinaryStringInfo(lex->strval, s, p - s); + + /* + * s will be incremented at the top of the loop, so set it to just + * behind our lookahead position + */ + s = p - 1; } }--
2.36.1
I think before committing something along those lines we should make the
relevant bits also be applicable when ->strval is NULL, as several functions
use that (notably json_in IIRC). Afaics we'd just need to move the strval
check to be around the appendBinaryStringInfo(). And it should simplify the
function, because some of the relevant code is duplicated outside as well...
Greetings,
Andres Freund
On Wed, Jul 6, 2022 at 12:18 PM Andres Freund <andres@anarazel.de> wrote:
I think before committing something along those lines we should make the
relevant bits also be applicable when ->strval is NULL, as several functions
use that (notably json_in IIRC). Afaics we'd just need to move the strval
check to be around the appendBinaryStringInfo().
That makes sense and is easy.
And it should simplify the
function, because some of the relevant code is duplicated outside as well...
Not sure how far to take this, but I put the returnable paths inside
the "other" path, so only backslash will go back to the top.
Both the above changes are split into a new 0003 patch for easier
review, but in the end will likely be squashed with 0002.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v4-0001-Simplify-json-lexing-state.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Simplify-json-lexing-state.patchDownload
From 3d8b39ff1c1a4abf9effc45323b293b62551770a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 08:35:24 +0700
Subject: [PATCH v4 1/4] Simplify json lexing state
Instead of updating the length as we go, use a const pointer to end of
the input, which we know already at the start
---
src/common/jsonapi.c | 23 ++++++++---------------
1 file changed, 8 insertions(+), 15 deletions(-)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 98e4ef0942..eeedc0645a 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -519,26 +519,23 @@ JsonParseErrorType
json_lex(JsonLexContext *lex)
{
char *s;
- int len;
+ char *const end = lex->input + lex->input_length;
JsonParseErrorType result;
/* Skip leading whitespace. */
s = lex->token_terminator;
- len = s - lex->input;
- while (len < lex->input_length &&
- (*s == ' ' || *s == '\t' || *s == '\n' || *s == '\r'))
+ while (s < end && (*s == ' ' || *s == '\t' || *s == '\n' || *s == '\r'))
{
if (*s++ == '\n')
{
++lex->line_number;
lex->line_start = s;
}
- len++;
}
lex->token_start = s;
/* Determine token type. */
- if (len >= lex->input_length)
+ if (s >= end)
{
lex->token_start = NULL;
lex->prev_token_terminator = lex->token_terminator;
@@ -623,7 +620,7 @@ json_lex(JsonLexContext *lex)
* the whole word as an unexpected token, rather than just
* some unintuitive prefix thereof.
*/
- for (p = s; p - s < lex->input_length - len && JSON_ALPHANUMERIC_CHAR(*p); p++)
+ for (p = s; p < end && JSON_ALPHANUMERIC_CHAR(*p); p++)
/* skip */ ;
/*
@@ -672,7 +669,7 @@ static inline JsonParseErrorType
json_lex_string(JsonLexContext *lex)
{
char *s;
- int len;
+ char *const end = lex->input + lex->input_length;
int hi_surrogate = -1;
if (lex->strval != NULL)
@@ -680,13 +677,11 @@ json_lex_string(JsonLexContext *lex)
Assert(lex->input_length > 0);
s = lex->token_start;
- len = lex->token_start - lex->input;
for (;;)
{
s++;
- len++;
/* Premature end of the string. */
- if (len >= lex->input_length)
+ if (s >= end)
{
lex->token_terminator = s;
return JSON_INVALID_TOKEN;
@@ -704,8 +699,7 @@ json_lex_string(JsonLexContext *lex)
{
/* OK, we have an escape character. */
s++;
- len++;
- if (len >= lex->input_length)
+ if (s >= end)
{
lex->token_terminator = s;
return JSON_INVALID_TOKEN;
@@ -718,8 +712,7 @@ json_lex_string(JsonLexContext *lex)
for (i = 1; i <= 4; i++)
{
s++;
- len++;
- if (len >= lex->input_length)
+ if (s >= end)
{
lex->token_terminator = s;
return JSON_INVALID_TOKEN;
--
2.36.1
v4-0004-Use-vectorized-lookahead-in-json_lex_string-on-x8.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Use-vectorized-lookahead-in-json_lex_string-on-x8.patchDownload
From 82e13b6bebd85a152ededcfd75495c0c0f642354 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 15:50:09 +0700
Subject: [PATCH v4 4/4] Use vectorized lookahead in json_lex_string on x86
---
src/common/jsonapi.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 81e176ad8d..44e8ed2b2f 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -24,6 +24,12 @@
#include "miscadmin.h"
#endif
+/* WIP: put somewhere sensible and consider removing CRC from the names */
+#if (defined (__x86_64__) || defined(_M_AMD64)) && (defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK))
+#include <nmmintrin.h>
+#define USE_SSE2
+#endif
+
/*
* The context of the parser is maintained by the recursive descent
* mechanism, but is passed explicitly to the error reporting routine
@@ -842,12 +848,54 @@ json_lex_string(JsonLexContext *lex)
}
else
{
+#ifdef USE_SSE2
+ __m128i block,
+ has_backslash,
+ has_doublequote,
+ control,
+ has_control,
+ error_cum = _mm_setzero_si128();
+ const __m128i backslash = _mm_set1_epi8('\\');
+ const __m128i doublequote = _mm_set1_epi8('"');
+ const __m128i max_control = _mm_set1_epi8(0x1F);
+#endif
/* start lookahead at current byte */
char *p = s;
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
+#ifdef USE_SSE2
+ while (p < end - sizeof(__m128i))
+ {
+ block = _mm_loadu_si128((const __m128i *) p);
+
+ /* direct comparison to quotes and backslashes */
+ has_backslash = _mm_cmpeq_epi8(block, backslash);
+ has_doublequote = _mm_cmpeq_epi8(block, doublequote);
+
+ /*
+ * use saturation arithmetic to check for <= highest control
+ * char
+ */
+ control = _mm_subs_epu8(block, max_control);
+ has_control = _mm_cmpeq_epi8(control, _mm_setzero_si128());
+
+ /*
+ * set bits in error_cum where the corresponding lanes in has_*
+ * are set
+ */
+ error_cum = _mm_or_si128(error_cum, has_backslash);
+ error_cum = _mm_or_si128(error_cum, has_doublequote);
+ error_cum = _mm_or_si128(error_cum, has_control);
+
+ if (_mm_movemask_epi8(error_cum))
+ break;
+
+ p += sizeof(__m128i);
+ }
+#endif /* USE_SSE2 */
+
while (p < end)
{
if (*p == '\\' || *p == '"')
--
2.36.1
v4-0003-Use-lookahead-path-in-json-string-lexing-for-the-.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Use-lookahead-path-in-json-string-lexing-for-the-.patchDownload
From ef486287090daa24d51735ba9fa9585341b6e8ec Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 15:35:33 +0700
Subject: [PATCH v4 3/4] Use lookahead path in json string lexing for the
non-escape case too
This removes some duplicated code and enables the no-escape path
to be optimized in the same way.
Per suggestion from Andres Freund
---
src/common/jsonapi.c | 46 +++++++++++++++++++++++---------------------
1 file changed, 24 insertions(+), 22 deletions(-)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index ad4858c623..81e176ad8d 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -686,15 +686,6 @@ json_lex_string(JsonLexContext *lex)
lex->token_terminator = s;
return JSON_INVALID_TOKEN;
}
- else if (*s == '"')
- break;
- else if ((unsigned char) *s < 32)
- {
- /* Per RFC4627, these characters MUST be escaped. */
- /* Since *s isn't printable, exclude it from the context string */
- lex->token_terminator = s;
- return JSON_ESCAPING_REQUIRED;
- }
else if (*s == '\\')
{
/* OK, we have an escape character. */
@@ -849,22 +840,41 @@ json_lex_string(JsonLexContext *lex)
return JSON_ESCAPING_INVALID;
}
}
- else if (lex->strval != NULL)
+ else
{
- /* start lookahead at next byte */
- char *p = s + 1;
+ /* start lookahead at current byte */
+ char *p = s;
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
while (p < end)
{
- if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
+ if (*p == '\\' || *p == '"')
break;
+ else if ((unsigned char) *p < 32)
+ {
+ /* Per RFC4627, these characters MUST be escaped. */
+ /*
+ * Since *s isn't printable, exclude it from the context
+ * string
+ */
+ lex->token_terminator = p;
+ return JSON_ESCAPING_REQUIRED;
+ }
p++;
}
- appendBinaryStringInfo(lex->strval, s, p - s);
+ if (lex->strval != NULL)
+ appendBinaryStringInfo(lex->strval, s, p - s);
+
+ if (*p == '"')
+ {
+ /* Hooray, we found the end of the string! */
+ lex->prev_token_terminator = lex->token_terminator;
+ lex->token_terminator = p + 1;
+ return JSON_SUCCESS;
+ }
/*
* s will be incremented at the top of the loop, so set it to just
@@ -873,14 +883,6 @@ json_lex_string(JsonLexContext *lex)
s = p - 1;
}
}
-
- if (hi_surrogate != -1)
- return JSON_UNICODE_LOW_SURROGATE;
-
- /* Hooray, we found the end of the string! */
- lex->prev_token_terminator = lex->token_terminator;
- lex->token_terminator = s + 1;
- return JSON_SUCCESS;
}
/*
--
2.36.1
v4-0002-Build-json-strings-in-larger-chunks-during-lexing.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Build-json-strings-in-larger-chunks-during-lexing.patchDownload
From 03d3be083ba60d272e848b3ec96db3c6d47a3b06 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 1 Jul 2022 17:28:20 +0700
Subject: [PATCH v4 2/4] Build json strings in larger chunks during lexing
Add lookahead loop to json_lex_string. This way, we can batch calls
to appendBinaryStringInfo.
Jelte Fennema and Andres Freund, with some adjustments by me
Discussion:
https://www.postgresql.org/message-id/CAGECzQQuXbies_nKgSiYifZUjBk6nOf2%3DTSXqRjj2BhUh8CTeA%40mail.gmail.com
Discussion:
https://www.postgresql.org/message-id/flat/PR3PR83MB0476F098CBCF68AF7A1CA89FF7B49@PR3PR83MB0476.EURPRD83.prod.outlook.com
---
src/common/jsonapi.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index eeedc0645a..ad4858c623 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -851,10 +851,26 @@ json_lex_string(JsonLexContext *lex)
}
else if (lex->strval != NULL)
{
+ /* start lookahead at next byte */
+ char *p = s + 1;
+
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
- appendStringInfoChar(lex->strval, *s);
+ while (p < end)
+ {
+ if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
+ break;
+ p++;
+ }
+
+ appendBinaryStringInfo(lex->strval, s, p - s);
+
+ /*
+ * s will be incremented at the top of the loop, so set it to just
+ * behind our lookahead position
+ */
+ s = p - 1;
}
}
--
2.36.1
I've pushed 0001 (although the email seems to have been swallowed
again), and pending additional comments on 0002 and 0003 I'll squash
and push those next week. 0004 needs some thought on integrating with
symbols we discover during configure.
--
John Naylor
EDB: http://www.enterprisedb.com
On Fri, Jul 8, 2022 at 3:06 PM John Naylor <john.naylor@enterprisedb.com> wrote:
I've pushed 0001 (although the email seems to have been swallowed
again), and pending additional comments on 0002 and 0003 I'll squash
and push those next week.
This is done.
0004 needs some thought on integrating with
symbols we discover during configure.
Still needs thought.
--
John Naylor
EDB: http://www.enterprisedb.com
Hi,
On 2022-07-06 15:58:44 +0700, John Naylor wrote:
From 82e13b6bebd85a152ededcfd75495c0c0f642354 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 15:50:09 +0700
Subject: [PATCH v4 4/4] Use vectorized lookahead in json_lex_string on x86---
src/common/jsonapi.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c index 81e176ad8d..44e8ed2b2f 100644 --- a/src/common/jsonapi.c +++ b/src/common/jsonapi.c @@ -24,6 +24,12 @@ #include "miscadmin.h" #endif+/* WIP: put somewhere sensible and consider removing CRC from the names */ +#if (defined (__x86_64__) || defined(_M_AMD64)) && (defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK)) +#include <nmmintrin.h> +#define USE_SSE2 +#endif + /* * The context of the parser is maintained by the recursive descent * mechanism, but is passed explicitly to the error reporting routine @@ -842,12 +848,54 @@ json_lex_string(JsonLexContext *lex) } else { +#ifdef USE_SSE2 + __m128i block, + has_backslash, + has_doublequote, + control, + has_control, + error_cum = _mm_setzero_si128(); + const __m128i backslash = _mm_set1_epi8('\\'); + const __m128i doublequote = _mm_set1_epi8('"'); + const __m128i max_control = _mm_set1_epi8(0x1F); +#endif /* start lookahead at current byte */ char *p = s;if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;+#ifdef USE_SSE2 + while (p < end - sizeof(__m128i)) + { + block = _mm_loadu_si128((const __m128i *) p); + + /* direct comparison to quotes and backslashes */ + has_backslash = _mm_cmpeq_epi8(block, backslash); + has_doublequote = _mm_cmpeq_epi8(block, doublequote); + + /* + * use saturation arithmetic to check for <= highest control + * char + */ + control = _mm_subs_epu8(block, max_control); + has_control = _mm_cmpeq_epi8(control, _mm_setzero_si128()); + + /* + * set bits in error_cum where the corresponding lanes in has_* + * are set + */ + error_cum = _mm_or_si128(error_cum, has_backslash); + error_cum = _mm_or_si128(error_cum, has_doublequote); + error_cum = _mm_or_si128(error_cum, has_control); + + if (_mm_movemask_epi8(error_cum)) + break; + + p += sizeof(__m128i); + } +#endif /* USE_SSE2 */ + while (p < end) { if (*p == '\\' || *p == '"') -- 2.36.1
I wonder if we can't abstract this at least a bit better. If we go that route
a bit further, then add another arch, this code will be pretty much
unreadable.
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
I wonder if we can't abstract this at least a bit better. If we go that route
a bit further, then add another arch, this code will be pretty much
unreadable.
IMO, it's pretty unreadable *now*, for lack of comments about what it's
doing and why.
regards, tom lane
Hi,
On 2022-07-11 11:53:26 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
I wonder if we can't abstract this at least a bit better. If we go that route
a bit further, then add another arch, this code will be pretty much
unreadable.IMO, it's pretty unreadable *now*, for lack of comments about what it's
doing and why.
Yea, that could at least be addressed by adding comments. But even with a
bunch of comments, it'd still be pretty hard to read once the events above
have happened (and they seem kind of inevitable).
I wonder if we can add a somewhat more general function for scanning until
some characters are found using SIMD? There's plenty other places that could
be useful.
Greetings,
Andres Freund
On Mon, Jul 11, 2022 at 11:07 PM Andres Freund <andres@anarazel.de> wrote:
I wonder if we can add a somewhat more general function for scanning until
some characters are found using SIMD? There's plenty other places that
could
be useful.
In simple cases, we could possibly abstract the entire loop. With this
particular case, I imagine the most approachable way to write the loop
would be a bit more low-level:
while (p < end - VECTOR_WIDTH &&
!vector_has_byte(p, '\\') &&
!vector_has_byte(p, '"') &&
vector_min_byte(p, 0x20))
p += VECTOR_WIDTH
I wonder if we'd lose a bit of efficiency here by not accumulating set bits
from the three conditions, but it's worth trying.
--
John Naylor
EDB: http://www.enterprisedb.com
On 2022-06-24 Fr 20:18, Andres Freund wrote:
Hi,
On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:
To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB.This will have a lot of other costs included (DML is expensive). I'd suggest
storing the json in a text column and casting it to json[b], with a filter
ontop of the json[b] result that cheaply filters it away. That should end up
spending nearly all the time somewhere around json parsing.It's useful for things like this to include a way for others to use the same
benchmark...I tried your patch with:
DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';
I've been doing some other work related to json parsing and John
referred me to this. But it's actually not the best test for pure json
parsing - casting to jsonb involves some extra work besides pure
parsing. Instead I've been using this query with the same table, which
should be almost all json parsing:
select 1 from json_as_text where t::json is null;
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
I wrote
On Mon, Jul 11, 2022 at 11:07 PM Andres Freund <andres@anarazel.de> wrote:
I wonder if we can add a somewhat more general function for scanning until
some characters are found using SIMD? There's plenty other places that could
be useful.In simple cases, we could possibly abstract the entire loop. With this particular case, I imagine the most approachable way to write the loop would be a bit more low-level:
while (p < end - VECTOR_WIDTH &&
!vector_has_byte(p, '\\') &&
!vector_has_byte(p, '"') &&
vector_min_byte(p, 0x20))
p += VECTOR_WIDTHI wonder if we'd lose a bit of efficiency here by not accumulating set bits from the three conditions, but it's worth trying.
The attached implements the above, more or less, using new pg_lfind8()
and pg_lfind8_le(), which in turn are based on helper functions that
act on a single vector. The pg_lfind* functions have regression tests,
but I haven't done the same for json yet. I went the extra step to use
bit-twiddling for non-SSE builds using uint64 as a "vector", which
still gives a pretty good boost (test below, min of 3):
master:
356ms
v5:
259ms
v5 disable SSE:
288ms
It still needs a bit of polishing and testing, but I think it's a good
workout for abstracting SIMD out of the way.
-------------
test:
DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 11)
from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;
select 1 from long_json_as_text where t::json is null; -- from Andrew upthread
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v5-json-lex-string-simd-ops.patchtext/x-patch; charset=US-ASCII; name=v5-json-lex-string-simd-ops.patchDownload
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..1f9eb134e8 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
#include "common/jsonapi.h"
#include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
#ifndef FRONTEND
#include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
}
else
{
- char *p;
+ char *p = s;
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
* Skip to the first byte that requires special handling, so we
* can batch calls to appendBinaryStringInfo.
*/
- for (p = s; p < end; p++)
+ while (p < end - SIZEOF_VECTOR &&
+ !pg_lfind8('\\', (unsigned char*) p, SIZEOF_VECTOR) &&
+ !pg_lfind8('"', (unsigned char*) p, SIZEOF_VECTOR) &&
+ !pg_lfind8_le(0x1F, (unsigned char*) p, SIZEOF_VECTOR))
+ p += SIZEOF_VECTOR;
+
+ for (; p < end; p++)
{
if (*p == '\\' || *p == '"')
break;
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..e090ea6ac3 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,7 @@
/*-------------------------------------------------------------------------
*
* pg_lfind.h
- * Optimized linear search routines.
+ * Optimized linear search routines using SIMD intrinsics where available
*
* Copyright (c) 2022, PostgreSQL Global Development Group
*
@@ -15,6 +15,76 @@
#include "port/simd.h"
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+ /* round down to multiple of vector length */
+ uint32 iterations = nelem & ~(SIZEOF_VECTOR - 1);
+ TYPEOF_VECTOR chunk;
+
+ for (i = 0; i < iterations; i += SIZEOF_VECTOR)
+ {
+#ifdef USE_SSE2
+ chunk = _mm_loadu_si128((const __m128i *) &base[i]);
+#else
+ memcpy(&chunk, &base[i], sizeof(chunk));
+#endif /* USE_SSE2 */
+ if (vector_eq_byte(chunk, key))
+ return true;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (key == base[i])
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that is less than or equal to 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+ /* round down to multiple of vector length */
+ uint32 iterations = nelem & ~(SIZEOF_VECTOR - 1);
+ TYPEOF_VECTOR chunk;
+
+ for (i = 0; i < iterations; i += SIZEOF_VECTOR)
+ {
+#ifdef USE_SSE2
+ chunk = _mm_loadu_si128((const __m128i *) &base[i]);
+#else
+ memcpy(&chunk, &base[i], sizeof(chunk));
+#endif /* USE_SSE2 */
+ if (vector_le_byte(chunk, key))
+ return true;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (base[i] <= key)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pg_lfind32
*
@@ -26,7 +96,6 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
{
uint32 i = 0;
- /* Use SIMD intrinsics where available. */
#ifdef USE_SSE2
/*
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..0185bc2ae0 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -25,6 +25,125 @@
#if (defined(__x86_64__) || defined(_M_AMD64))
#include <emmintrin.h>
#define USE_SSE2
+#define TYPEOF_VECTOR __m128i
+
+#else
+#define TYPEOF_VECTOR uint64
+#endif /* (defined(__x86_64__) || defined(_M_AMD64)) */
+
+#define SIZEOF_VECTOR sizeof(TYPEOF_VECTOR)
+
+/* return a vector with all bytes set to c */
+static inline TYPEOF_VECTOR
+vector_broadcast(const uint8 c)
+{
+#ifdef USE_SSE2
+ return _mm_set1_epi8(c);
+#else
+ return ~UINT64CONST(0) / 0xFF * c;
#endif
+}
+
+/* return true if any bytes in the vector are zero */
+static inline bool
+vector_has_zero(const TYPEOF_VECTOR v)
+{
+#ifdef USE_SSE2
+ return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#else
+ /* from https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord */
+ return (v - vector_broadcast(0x01)) & ~v & vector_broadcast(0x80);
+#endif
+}
+
+static inline bool
+vector_eq_byte(const TYPEOF_VECTOR v, const uint8 c)
+{
+ bool result;
+
+ /* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+ bool assert_result = false;
+ const char* s = (const char*) &v;
+
+ for (int j = 0; j < SIZEOF_VECTOR; j++)
+ {
+ if (s[j] == c)
+ {
+ assert_result = true;
+ break;
+ }
+ }
+#endif /* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+ result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector_broadcast(c)));
+#else
+ /* any bytes in v equal to c will evaluate to zero via XOR */
+ result = vector_has_zero(v ^ vector_broadcast(c));
+#endif /* USE_SSE2 */
+
+#ifdef USE_ASSERT_CHECKING
+ Assert(assert_result == result);
+#endif /* USE_ASSERT_CHECKING */
+
+ return result;
+}
+
+static inline bool
+vector_le_byte(const TYPEOF_VECTOR v, const uint8 c)
+{
+ bool result;
+#ifdef USE_SSE2
+ __m128i sub;
+#endif
+
+#if !defined(USE_SSE2) || defined(USE_ASSERT_CHECKING)
+ const char* s = (const char*) &v;
+#endif
+
+ /* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+ bool assert_result = false;
+
+ for (int j = 0; j < SIZEOF_VECTOR; j++)
+ {
+ if (s[j] <= c)
+ {
+ assert_result = true;
+ break;
+ }
+ }
+#endif /* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+ /* use saturating subtraction to find bytes <= c, which will present as NUL bytes in 'sub' */
+ sub = _mm_subs_epu8(v, vector_broadcast(c));
+ return _mm_movemask_epi8(_mm_cmpeq_epi8(sub, _mm_setzero_si128()));
+#else
+ /* to find bytes <= c, we can use bitwise operations to find bytes < c+1, but it only works if c+1 <= 128 */
+ /* from https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord */
+ if (c + 1 <= 128)
+ return (v - vector_broadcast(c + 1)) & ~v & vector_broadcast(0x80);
+ else
+ {
+ /* one byte at a time */
+ for (int j = 0; j < SIZEOF_VECTOR; j++)
+ {
+ if (s[j] <= c)
+ {
+ result = true;
+ break;
+ }
+ }
+ }
+#endif
+
+#ifdef USE_ASSERT_CHECKING
+ Assert(assert_result == result);
+#endif /* USE_ASSERT_CHECKING */
+
+ return result;
+}
#endif /* SIMD_H */
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 222c8fd7ff..1d4b14e703 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -4,9 +4,21 @@ CREATE EXTENSION test_lfind;
-- the operations complete without crashing or hanging and that none of their
-- internal sanity tests fail.
--
-SELECT test_lfind();
- test_lfind
-------------
+SELECT test_lfind8();
+ test_lfind8
+-------------
+
+(1 row)
+
+SELECT test_lfind8_le();
+ test_lfind8_le
+----------------
+
+(1 row)
+
+SELECT test_lfind32();
+ test_lfind32
+--------------
(1 row)
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 899f1dd49b..766c640831 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -5,4 +5,6 @@ CREATE EXTENSION test_lfind;
-- the operations complete without crashing or hanging and that none of their
-- internal sanity tests fail.
--
-SELECT test_lfind();
+SELECT test_lfind8();
+SELECT test_lfind8_le();
+SELECT test_lfind32();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index d82ab0567e..81801926ae 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -3,6 +3,14 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION test_lfind" to load this file. \quit
-CREATE FUNCTION test_lfind()
+CREATE FUNCTION test_lfind32()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8_le()
RETURNS pg_catalog.void
AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index a000746fb8..b853ddb609 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -18,10 +18,104 @@
PG_MODULE_MAGIC;
-PG_FUNCTION_INFO_V1(test_lfind);
+PG_FUNCTION_INFO_V1(test_lfind8);
+Datum
+test_lfind8(PG_FUNCTION_ARGS)
+{
+ unsigned char* str1 = (unsigned char*) "1234567890abcdef";
+ unsigned char* str2 = (unsigned char*) "1234567890abcdefZ";
+ size_t len1 = strlen((char*) str1);
+ size_t len2 = strlen((char*) str2);
+ uint8 key;
+
+ /* whole length of 16*/
+ Assert(len1 == 16);
+
+ key = 'X';
+ if (pg_lfind8(key, str1, len1))
+ elog(ERROR, "pg_lfind8() found nonexistent element '%c'", key);
+ key = 'f';
+ if (!pg_lfind8(key, str1, len1))
+ elog(ERROR, "pg_lfind8() did not find existing element '%c'", key);
+ /* include terminator */
+ key = '\0';
+ if (!pg_lfind8(key, str1, len1 + 1))
+ elog(ERROR, "pg_lfind8() did not find existing element '%c'", key);
+
+ /* restricted length */
+ key = '5';
+ if (pg_lfind8(key, str1, 4))
+ elog(ERROR, "pg_lfind8() found nonexistent element '%c'", key);
+ key = '4';
+ if (!pg_lfind8(key, str1, 4))
+ elog(ERROR, "pg_lfind8() did not find existing element '%c'", key);
+
+ /* test byte-wise path with string larger than vector size */
+ Assert(len2 > 16);
+
+ key = 'Y';
+ if (pg_lfind8(key, str2, len2))
+ elog(ERROR, "pg_lfind8() found nonexistent element '%c'", key);
+ key = 'Z';
+ if (!pg_lfind8(key, str2, len2))
+ elog(ERROR, "pg_lfind8() did not find existing element '%c'", key);
+
+ PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(test_lfind8_le);
+Datum
+test_lfind8_le(PG_FUNCTION_ARGS)
+{
+ unsigned char* str1 = (unsigned char*) "A_CDEFGHIJKLMNO_";
+ unsigned char* str2 = (unsigned char*) "A_CDEFGHIJKLMNO_3";
+ size_t len1 = strlen((char*) str1);
+ size_t len2 = strlen((char*) str2);
+ uint8 key;
+
+ /* whole length of 16*/
+ Assert(len1 == 16);
+
+ /* search for char with value one less than minimum */
+ key = '@';
+ if (pg_lfind8_le(key, str1, len1))
+ elog(ERROR, "pg_lfind8_le() found nonexistent element <= '%c'", key);
+ /* search for minimum char */
+ key = 'A';
+ if (!pg_lfind8_le(key, str1, len1))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '%c'", key);
+ key = 'B';
+ if (!pg_lfind8_le(key, str1, len1))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '%c'", key);
+
+ /* search for terminating null by <= 0x01 */
+ key = 0x01;
+ if (!pg_lfind8_le(key, str1, len1 + 1))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '%c'", key);
+
+ /* test byte-wise path with string larger than vector size */
+ Assert(len2 > 16);
+
+ /* search for char with value one less than minimum */
+ key = '2';
+ if (pg_lfind8_le(key, str2, len2))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '%c'", key);
+
+ /* search for minimum char */
+ key = '3';
+ if (!pg_lfind8_le(key, str2, len2))
+ elog(ERROR, "pg_lfind8() did not find existing element <= '%c'", key);
+ key = '4';
+ if (!pg_lfind8_le(key, str2, len2))
+ elog(ERROR, "pg_lfind8() did not find existing element <= '%c'", key);
+
+ PG_RETURN_VOID();
+}
+PG_FUNCTION_INFO_V1(test_lfind32);
Datum
-test_lfind(PG_FUNCTION_ARGS)
+test_lfind32(PG_FUNCTION_ARGS)
{
#define TEST_ARRAY_SIZE 135
uint32 test_array[TEST_ARRAY_SIZE] = {0};
Hi,
I ran this test.
DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 11)
from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;
select 1 from long_json_as_text where t::json is null;
head:
Time: 161,741ms
v5:
Time: 270,298 ms
ubuntu 64 bits
gcc 9.4.0
Am I missing something?
regards,
Ranier Vilela
Import Notes
Resolved by subject fallback
Em seg., 15 de ago. de 2022 às 15:34, Ranier Vilela <ranier.vf@gmail.com>
escreveu:
Hi,
I ran this test.
DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 11)
from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;select 1 from long_json_as_text where t::json is null;
head:
Time: 161,741msv5:
Time: 270,298 ms
Sorry too fast, 270,298ms with native memchr.
v5
Time: 208,689 ms
regards,
Ranier Vilela
On Mon, Aug 15, 2022 at 08:33:21PM +0700, John Naylor wrote:
The attached implements the above, more or less, using new pg_lfind8()
and pg_lfind8_le(), which in turn are based on helper functions that
act on a single vector. The pg_lfind* functions have regression tests,
but I haven't done the same for json yet. I went the extra step to use
bit-twiddling for non-SSE builds using uint64 as a "vector", which
still gives a pretty good boost (test below, min of 3):
Looks pretty reasonable to me.
+#ifdef USE_SSE2 + chunk = _mm_loadu_si128((const __m128i *) &base[i]); +#else + memcpy(&chunk, &base[i], sizeof(chunk)); +#endif /* USE_SSE2 */
+#ifdef USE_SSE2 + chunk = _mm_loadu_si128((const __m128i *) &base[i]); +#else + memcpy(&chunk, &base[i], sizeof(chunk)); +#endif /* USE_SSE2 */
Perhaps there should be a macro or inline function for loading a vector so
that these USE_SSE2 checks can be abstracted away, too.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Tue, Aug 16, 2022 at 4:23 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Mon, Aug 15, 2022 at 08:33:21PM +0700, John Naylor wrote:
+#ifdef USE_SSE2 + chunk = _mm_loadu_si128((const __m128i *) &base[i]); +#else + memcpy(&chunk, &base[i], sizeof(chunk)); +#endif /* USE_SSE2 */Perhaps there should be a macro or inline function for loading a vector so
that these USE_SSE2 checks can be abstracted away, too.
This is done. Also:
- a complete overhaul of the pg_lfind8* tests
- using a typedef for the vector type
- some refactoring, name changes and other cleanups (a few of these
could also be applied to the 32-byte element path, but that is left
for future work)
TODO: json-specific tests of the new path
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v6-json-lex-string-simd.patchapplication/x-patch; name=v6-json-lex-string-simd.patchDownload
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..efcaedd682 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
#include "common/jsonapi.h"
#include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
#ifndef FRONTEND
#include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
}
else
{
- char *p;
+ char *p = s;
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
* Skip to the first byte that requires special handling, so we
* can batch calls to appendBinaryStringInfo.
*/
- for (p = s; p < end; p++)
+ while (p < end - sizeof(Vector) &&
+ !pg_lfind8('\\', (uint8 *) p, sizeof(Vector)) &&
+ !pg_lfind8('"', (uint8 *) p, sizeof(Vector)) &&
+ !pg_lfind8_le(0x1F, (uint8 *) p, sizeof(Vector)))
+ p += sizeof(Vector);
+
+ for (; p < end; p++)
{
if (*p == '\\' || *p == '"')
break;
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..bb4033c7fc 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,7 @@
/*-------------------------------------------------------------------------
*
* pg_lfind.h
- * Optimized linear search routines.
+ * Optimized linear search routines using SIMD intrinsics where available
*
* Copyright (c) 2022, PostgreSQL Global Development Group
*
@@ -15,6 +15,68 @@
#include "port/simd.h"
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+ /* round down to multiple of vector length */
+ uint32 tail_idx = nelem & ~(sizeof(Vector) - 1);
+ Vector chunk;
+
+ for (i = 0; i < tail_idx; i += sizeof(Vector))
+ {
+ vector_load(&chunk, &base[i]);
+ if (vector_eq_byte(chunk, key))
+ return true;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (key == base[i])
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * pg_lfind8_le
+ *
+ * Return true if there is an element in 'base' that is less than or equal to
+ * 'key', otherwise return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+ /* round down to multiple of vector length */
+ uint32 tail_idx = nelem & ~(sizeof(Vector) - 1);
+ Vector chunk;
+
+ for (i = 0; i < tail_idx; i += sizeof(Vector))
+ {
+ vector_load(&chunk, &base[i]);
+ if (vector_le_byte(chunk, key))
+ return true;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (base[i] <= key)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pg_lfind32
*
@@ -26,7 +88,6 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
{
uint32 i = 0;
- /* Use SIMD intrinsics where available. */
#ifdef USE_SSE2
/*
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..56da8e27cd 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -25,6 +25,141 @@
#if (defined(__x86_64__) || defined(_M_AMD64))
#include <emmintrin.h>
#define USE_SSE2
+typedef __m128i Vector;
+
+#else
+typedef uint64 Vector;
+#endif /* (defined(__x86_64__) || defined(_M_AMD64)) */
+
+
+static inline void vector_load(Vector * v, const uint8 *s);
+static inline Vector vector_broadcast(const uint8 c);
+static inline bool vector_has_zero(const Vector v);
+static inline bool vector_le_byte(const Vector v, const uint8 c);
+static inline bool vector_eq_byte(const Vector v, const uint8 c);
+
+
+/* load a chunk of memory into a register */
+static inline void
+vector_load(Vector * v, const uint8 *s)
+{
+#ifdef USE_SSE2
+ *v = _mm_loadu_si128((const __m128i *) s);
+#else
+ memcpy(v, s, sizeof(Vector));
+#endif
+}
+
+/* return a vector with all bytes set to c */
+static inline Vector
+vector_broadcast(const uint8 c)
+{
+#ifdef USE_SSE2
+ return _mm_set1_epi8(c);
+#else
+ return ~UINT64CONST(0) / 0xFF * c;
#endif
+}
+
+/* return true if any bytes in the vector are zero */
+static inline bool
+vector_has_zero(const Vector v)
+{
+#ifdef USE_SSE2
+ return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#else
+ return vector_le_byte(v, 0);
+#endif
+}
+
+static inline bool
+vector_eq_byte(const Vector v, const uint8 c)
+{
+ bool result;
+
+ /* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+ bool assert_result = false;
+ const uint8* s = (const uint8*) &v;
+
+ for (int i = 0; i < sizeof(Vector); i++)
+ {
+ if (s[i] == c)
+ {
+ assert_result = true;
+ break;
+ }
+ }
+#endif /* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+ result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector_broadcast(c)));
+#else
+ /* any bytes in v equal to c will evaluate to zero via XOR */
+ result = vector_has_zero(v ^ vector_broadcast(c));
+#endif /* USE_SSE2 */
+
+#ifdef USE_ASSERT_CHECKING
+ Assert(assert_result == result);
+#endif /* USE_ASSERT_CHECKING */
+
+ return result;
+}
+
+static inline bool
+vector_le_byte(const Vector v, const uint8 c)
+{
+ bool result = false;
+#ifdef USE_SSE2
+ __m128i sub;
+#endif
+
+#if !defined(USE_SSE2) || defined(USE_ASSERT_CHECKING)
+ const uint8* s = (const uint8*) &v;
+#endif
+
+ /* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+ bool assert_result = false;
+
+ for (int i = 0; i < sizeof(Vector); i++)
+ {
+ if (s[i] <= c)
+ {
+ assert_result = true;
+ break;
+ }
+ }
+#endif /* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+ /* use saturating subtraction to find bytes <= c, which will present as NUL bytes in 'sub' */
+ sub = _mm_subs_epu8(v, vector_broadcast(c));
+ result = vector_has_zero(sub);
+#else
+ /* to find bytes <= c, we can use bitwise operations to find bytes < c+1, but it only works if c+1 <= 128 and if the highest bit in v is not set */
+ /* from https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord */
+ if ((int64) v >= 0 && c < 0x80)
+ result = (v - vector_broadcast(c + 1)) & ~v & vector_broadcast(0x80);
+ else
+ {
+ /* one byte at a time */
+ for (int i = 0; i < sizeof(Vector); i++)
+ {
+ if (s[i] <= c)
+ {
+ result = true;
+ break;
+ }
+ }
+ }
+#endif
+
+#ifdef USE_ASSERT_CHECKING
+ Assert(assert_result == result);
+#endif /* USE_ASSERT_CHECKING */
+
+ return result;
+}
#endif /* SIMD_H */
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 222c8fd7ff..1d4b14e703 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -4,9 +4,21 @@ CREATE EXTENSION test_lfind;
-- the operations complete without crashing or hanging and that none of their
-- internal sanity tests fail.
--
-SELECT test_lfind();
- test_lfind
-------------
+SELECT test_lfind8();
+ test_lfind8
+-------------
+
+(1 row)
+
+SELECT test_lfind8_le();
+ test_lfind8_le
+----------------
+
+(1 row)
+
+SELECT test_lfind32();
+ test_lfind32
+--------------
(1 row)
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 899f1dd49b..766c640831 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -5,4 +5,6 @@ CREATE EXTENSION test_lfind;
-- the operations complete without crashing or hanging and that none of their
-- internal sanity tests fail.
--
-SELECT test_lfind();
+SELECT test_lfind8();
+SELECT test_lfind8_le();
+SELECT test_lfind32();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index d82ab0567e..81801926ae 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -3,6 +3,14 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION test_lfind" to load this file. \quit
-CREATE FUNCTION test_lfind()
+CREATE FUNCTION test_lfind32()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8_le()
RETURNS pg_catalog.void
AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index a000746fb8..18ca9d0018 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -18,10 +18,97 @@
PG_MODULE_MAGIC;
-PG_FUNCTION_INFO_V1(test_lfind);
+/* workhorse for test_lfind8 */
+static void
+test_lfind8_internal(uint8 key)
+{
+ /* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+ uint8 charbuf[LEN_WITH_TAIL];
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search tail to test one-byte-at-a-time path */
+ charbuf[LEN_WITH_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search with vector operations */
+ charbuf[LEN_NO_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8);
+Datum
+test_lfind8(PG_FUNCTION_ARGS)
+{
+ test_lfind8_internal(0);
+ test_lfind8_internal(1);
+ test_lfind8_internal(0x7F);
+ test_lfind8_internal(0x80);
+ test_lfind8_internal(0xFD);
+
+ PG_RETURN_VOID();
+}
+
+/* workhorse for test_lfind8_le */
+static void
+test_lfind8_le_internal(uint8 key)
+{
+ /* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+ uint8 charbuf[LEN_WITH_TAIL];
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search tail to test one-byte-at-a-time path */
+ charbuf[LEN_WITH_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search with vector operations */
+ charbuf[LEN_NO_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8_le);
+Datum
+test_lfind8_le(PG_FUNCTION_ARGS)
+{
+ test_lfind8_le_internal(0);
+ test_lfind8_le_internal(1);
+ test_lfind8_le_internal(0x7F);
+ test_lfind8_le_internal(0x80);
+ test_lfind8_le_internal(0xFD);
+
+ PG_RETURN_VOID();
+}
+PG_FUNCTION_INFO_V1(test_lfind32);
Datum
-test_lfind(PG_FUNCTION_ARGS)
+test_lfind32(PG_FUNCTION_ARGS)
{
#define TEST_ARRAY_SIZE 135
uint32 test_array[TEST_ARRAY_SIZE] = {0};
On Fri, Aug 19, 2022 at 03:11:36PM +0700, John Naylor wrote:
This is done. Also:
- a complete overhaul of the pg_lfind8* tests
- using a typedef for the vector type
- some refactoring, name changes and other cleanups (a few of these
could also be applied to the 32-byte element path, but that is left
for future work)TODO: json-specific tests of the new path
This looks pretty good to me. Should we rename vector_broadcast() and
vector_has_zero() to indicate that they are working with bytes (e.g.,
vector_broadcast_byte())? We might be able to use vector_broadcast_int()
in the 32-bit functions, and your other vector functions already have a
_byte suffix.
In general, the approach you've taken seems like a decent readability
improvement. I'd be happy to try my hand at adjusting the 32-bit path and
adding ARM versions of all this stuff.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Aug 19, 2022 at 01:42:15PM -0700, Nathan Bossart wrote:
On Fri, Aug 19, 2022 at 03:11:36PM +0700, John Naylor wrote:
This is done. Also:
- a complete overhaul of the pg_lfind8* tests
- using a typedef for the vector type
- some refactoring, name changes and other cleanups (a few of these
could also be applied to the 32-byte element path, but that is left
for future work)TODO: json-specific tests of the new path
This looks pretty good to me. Should we rename vector_broadcast() and
vector_has_zero() to indicate that they are working with bytes (e.g.,
vector_broadcast_byte())? We might be able to use vector_broadcast_int()
in the 32-bit functions, and your other vector functions already have a
_byte suffix.In general, the approach you've taken seems like a decent readability
improvement. I'd be happy to try my hand at adjusting the 32-bit path and
adding ARM versions of all this stuff.
I spent some more time looking at this one, and I had a few ideas that I
thought I'd share. 0001 is your v6 patch with a few additional changes,
including simplying the assertions for readability, splitting out the
Vector type into Vector8 and Vector32 (needed for ARM), and adjusting
pg_lfind32() to use the new tools in simd.h. 0002 adds ARM versions of
everything, which obsoletes the other thread I started [0]/messages/by-id/20220819200829.GA395728@nathanxps13. This is still
a little rough around the edges (e.g., this should probably be more than 2
patches), but I think it helps demonstrate a more comprehensive design than
what I've proposed in the pg_lfind32-for-ARM thread [0]/messages/by-id/20220819200829.GA395728@nathanxps13.
Apologies if I'm stepping on your toes a bit here.
[0]: /messages/by-id/20220819200829.GA395728@nathanxps13
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
0001-json_lex_string-SIMD.patchtext/x-diff; charset=us-asciiDownload
From 7dd35c8ffe8e42885586fb16a77b6c3e792c6a6d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Sat, 20 Aug 2022 21:14:01 -0700
Subject: [PATCH 1/2] json_lex_string() SIMD
---
src/common/jsonapi.c | 11 +-
src/include/port/pg_lfind.h | 132 ++++++----
src/include/port/simd.h | 227 ++++++++++++++++++
.../test_lfind/expected/test_lfind.out | 18 +-
.../modules/test_lfind/sql/test_lfind.sql | 4 +-
.../modules/test_lfind/test_lfind--1.0.sql | 10 +-
src/test/modules/test_lfind/test_lfind.c | 91 ++++++-
7 files changed, 443 insertions(+), 50 deletions(-)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..87e1d0b192 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
#include "common/jsonapi.h"
#include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
#ifndef FRONTEND
#include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
}
else
{
- char *p;
+ char *p = s;
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
* Skip to the first byte that requires special handling, so we
* can batch calls to appendBinaryStringInfo.
*/
- for (p = s; p < end; p++)
+ while (p < end - sizeof(Vector8) &&
+ !pg_lfind8('\\', (uint8 *) p, sizeof(Vector8)) &&
+ !pg_lfind8('"', (uint8 *) p, sizeof(Vector8)) &&
+ !pg_lfind8_le(0x1F, (uint8 *) p, sizeof(Vector8)))
+ p += sizeof(Vector8);
+
+ for (; p < end; p++)
{
if (*p == '\\' || *p == '"')
break;
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..def858cbe1 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,7 @@
/*-------------------------------------------------------------------------
*
* pg_lfind.h
- * Optimized linear search routines.
+ * Optimized linear search routines using SIMD intrinsics where available
*
* Copyright (c) 2022, PostgreSQL Global Development Group
*
@@ -15,6 +15,68 @@
#include "port/simd.h"
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+ /* round down to multiple of vector length */
+ uint32 tail_idx = nelem & ~(sizeof(Vector8) - 1);
+ Vector8 chunk;
+
+ for (i = 0; i < tail_idx; i += sizeof(Vector8))
+ {
+ vector8_load(&chunk, &base[i]);
+ if (vector8_eq(chunk, key))
+ return true;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (key == base[i])
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * pg_lfind8_le
+ *
+ * Return true if there is an element in 'base' that is less than or equal to
+ * 'key', otherwise return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+ /* round down to multiple of vector length */
+ uint32 tail_idx = nelem & ~(sizeof(Vector8) - 1);
+ Vector8 chunk;
+
+ for (i = 0; i < tail_idx; i += sizeof(Vector8))
+ {
+ vector8_load(&chunk, &base[i]);
+ if (vector8_le(chunk, key))
+ return true;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (base[i] <= key)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pg_lfind32
*
@@ -24,59 +86,50 @@
static inline bool
pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
{
+ bool result = false;
uint32 i = 0;
+#ifdef USE_ASSERT_CHECKING
+ size_t nelem_for_asserts = nelem;
+#endif
- /* Use SIMD intrinsics where available. */
#ifdef USE_SSE2
-
/*
* A 16-byte register only has four 4-byte lanes. For better
* instruction-level parallelism, each loop iteration operates on a block
* of four registers. Testing has showed this is ~40% faster than using a
* block of two registers.
*/
- const __m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
- uint32 iterations = nelem & ~0xF; /* round down to multiple of 16 */
+ const Vector32 keys = vector32_broadcast(key); /* load 4 copies of key */
+ uint32 tail_idx = nelem & ~0xF; /* round down to multiple of 16 */
-#if defined(USE_ASSERT_CHECKING)
- bool assert_result = false;
-
- /* pre-compute the result for assert checking */
- for (i = 0; i < nelem; i++)
+ for (i = 0; i < tail_idx; i += 16)
{
- if (key == base[i])
- {
- assert_result = true;
- break;
- }
- }
-#endif
+ Vector32 vals1, vals2, vals3, vals4,
+ result1, result2, result3, result4,
+ tmp1, tmp2, result;
- for (i = 0; i < iterations; i += 16)
- {
/* load the next block into 4 registers holding 4 values each */
- const __m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
- const __m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
- const __m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
- const __m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
+ vector32_load(&vals1, &base[i]);
+ vector32_load(&vals2, &base[i + 4]);
+ vector32_load(&vals3, &base[i + 8]);
+ vector32_load(&vals4, &base[i + 12]);
/* compare each value to the key */
- const __m128i result1 = _mm_cmpeq_epi32(keys, vals1);
- const __m128i result2 = _mm_cmpeq_epi32(keys, vals2);
- const __m128i result3 = _mm_cmpeq_epi32(keys, vals3);
- const __m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+ result1 = vector32_veq(keys, vals1);
+ result2 = vector32_veq(keys, vals2);
+ result3 = vector32_veq(keys, vals3);
+ result4 = vector32_veq(keys, vals4);
/* combine the results into a single variable */
- const __m128i tmp1 = _mm_or_si128(result1, result2);
- const __m128i tmp2 = _mm_or_si128(result3, result4);
- const __m128i result = _mm_or_si128(tmp1, tmp2);
+ tmp1 = vector32_vor(result1, result2);
+ tmp2 = vector32_vor(result3, result4);
+ result = vector32_vor(tmp1, tmp2);
/* see if there was a match */
if (_mm_movemask_epi8(result) != 0)
{
-#if defined(USE_ASSERT_CHECKING)
- Assert(assert_result == true);
-#endif
+ Assert(lfind(&key, base, &nelem_for_asserts, sizeof(uint32),
+ uint32_cmp_eq));
return true;
}
}
@@ -87,17 +140,14 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
{
if (key == base[i])
{
-#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
- Assert(assert_result == true);
-#endif
- return true;
+ result = true;
+ break;
}
}
-#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
- Assert(assert_result == false);
-#endif
- return false;
+ Assert(result == (lfind(&key, base, &nelem_for_asserts, sizeof(uint32),
+ uint32_cmp_eq) != NULL));
+ return result;
}
#endif /* PG_LFIND_H */
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..4dda87f3dd 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -13,6 +13,8 @@
#ifndef SIMD_H
#define SIMD_H
+#include "utils/elog.h"
+
/*
* SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
* that compilers targeting this architecture understand SSE2 intrinsics.
@@ -25,6 +27,231 @@
#if (defined(__x86_64__) || defined(_M_AMD64))
#include <emmintrin.h>
#define USE_SSE2
+typedef __m128i Vector8;
+typedef __m128i Vector32;
+
+/*
+ * If no SIMD instructions are available, we emulate specialized vector
+ * operations using uint64.
+ */
+#else
+typedef uint64 Vector8;
+typedef uint64 Vector32;
+#endif
+
+
+static inline void vector8_load(Vector8 *v, const uint8 *s);
+static inline void vector32_load(Vector32 *v, const uint32 *s);
+static inline Vector8 vector8_broadcast(const uint8 c);
+static inline Vector32 vector32_broadcast(const uint32 c);
+static inline bool vector8_has_zero(const Vector8 v);
+static inline bool vector8_le(const Vector8 v, const uint8 c);
+static inline bool vector8_eq(const Vector8 v, const uint8 c);
+static inline Vector32 vector32_veq(const Vector32 v1, const Vector32 v2);
+static inline Vector32 vector32_vor(const Vector32 v1, const Vector32 v2);
+
+
+/*
+ * Stuff for assert-enabled builds.
+ */
+#ifdef USE_ASSERT_CHECKING
+
+#include <search.h>
+
+static size_t nelem_vector8 = sizeof(Vector8) / sizeof(uint8);
+
+static int
+uint8_cmp_eq(const void *key, const void *elem)
+{
+ uint8 k = *((const uint8 *) key);
+ uint8 e = *((const uint8 *) elem);
+
+ if (k < e)
+ return -1;
+ if (k > e)
+ return 1;
+ return 0;
+}
+
+static int
+uint32_cmp_eq(const void *key, const void *elem)
+{
+ uint32 k = *((const uint32 *) key);
+ uint32 e = *((const uint32 *) elem);
+
+ if (k < e)
+ return -1;
+ if (k > e)
+ return 1;
+ return 0;
+}
+
+static int
+uint8_cmp_le(const void *key, const void *elem)
+{
+ uint8 k = *((const uint8 *) key);
+ uint8 e = *((const uint8 *) elem);
+
+ /*
+ * This is counterintuitive. We want lfind() to report success if it finds
+ * an element <= the key, so we need to return 0 any time the key is >= the
+ * current element.
+ */
+ if (k >= e)
+ return 0;
+ return -1;
+}
+
+#endif /* USE_ASSERT_CHECKING */
+
+
+/*
+ * Functions for loading a chunk of memory into a vector.
+ */
+
+static inline void
+vector8_load(Vector8 *v, const uint8 *s)
+{
+#ifdef USE_SSE2
+ *v = _mm_loadu_si128((const __m128i *) s);
+#else
+ memcpy(v, s, sizeof(Vector8));
+#endif
+}
+
+static inline void
+vector32_load(Vector32 *v, const uint32 *s)
+{
+#ifdef USE_SSE2
+ *v = _mm_loadu_si128((const __m128i *) s);
+#else
+ elog(ERROR, "vector32() without SIMD not implemented");
+ pg_unreachable();
+#endif
+}
+
+
+/*
+ * Functions for creating a vector with all elements set to the same value.
+ */
+
+static inline Vector8
+vector8_broadcast(const uint8 c)
+{
+#ifdef USE_SSE2
+ return _mm_set1_epi8(c);
+#else
+ return ~UINT64CONST(0) / 0xFF * c;
+#endif
+}
+
+static inline Vector32
+vector32_broadcast(const uint32 c)
+{
+#ifdef USE_SSE2
+ return _mm_set1_epi32(c);
+#else
+ elog(ERROR, "vector32_broadcast() without SIMD not implemented");
+ pg_unreachable();
+#endif
+}
+
+
+/*
+ * Functions for comparing vector elements to a given value.
+ */
+
+static inline bool
+vector8_has_zero(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#else
+ return vector8_le(v, 0);
+#endif
+}
+
+static inline bool
+vector8_eq(const Vector8 v, const uint8 c)
+{
+ bool result;
+
+#ifdef USE_SSE2
+ result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector8_broadcast(c)));
+#else
+ /* any bytes in v equal to c will evaluate to zero via XOR */
+ result = vector8_has_zero(v ^ vector8_broadcast(c));
+#endif /* USE_SSE2 */
+
+ Assert(result == (lfind(&c, &v, &nelem_vector8, sizeof(uint8),
+ uint8_cmp_eq) != NULL));
+ return result;
+}
+
+static inline Vector32
+vector32_veq(const Vector32 v1, const Vector32 v2)
+{
+#ifdef USE_SSE2
+ return _mm_cmpeq_epi32(v1, v2);
+#else
+ elog(ERROR, "vector32_veq() without SIMD not implemented");
+ pg_unreachable();
+#endif
+}
+
+static inline bool
+vector8_le(const Vector8 v, const uint8 c)
+{
+ bool result = false;
+
+#ifdef USE_SSE2
+ /*
+ * Use saturating subtraction to find bytes <= c, which will present as
+ * NUL bytes in 'sub'.
+ */
+ __m128i sub = _mm_subs_epu8(v, vector8_broadcast(c));
+ result = vector8_has_zero(sub);
+#else
+ /*
+ * To find bytes <= c, we can use bitwise operations to find bytes < c + 1,
+ * but it only works if c + 1 <= 128 and if the highest bit in v is not set
+ * (from https://graphics.stanford.edu/~seander/bithacks.html).
+ */
+ if ((int64) v >= 0 && c < 0x80)
+ result = (v - vector8_broadcast(c + 1)) & ~v & vector8_broadcast(0x80);
+ else
+ {
+ /* one byte at a time */
+ for (int i = 0; i < sizeof(Vector8); i++)
+ {
+ if (((const uint8 *) &v)[i] <= c)
+ {
+ result = true;
+ break;
+ }
+ }
+ }
+#endif
+
+ Assert(result == (lfind(&c, &v, &nelem_vector8, sizeof(uint8),
+ uint8_cmp_le) != NULL));
+ return result;
+}
+
+
+/*
+ * Functions for bitwise operations on vectors.
+ */
+
+static inline Vector32
+vector32_vor(const Vector32 v1, const Vector32 v2)
+{
+#ifdef USE_SSE2
+ return _mm_or_si128(v1, v2);
+#else
+ elog(ERROR, "vector32_vor() without SIMD not implemented");
+ pg_unreachable();
#endif
+}
#endif /* SIMD_H */
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 222c8fd7ff..1d4b14e703 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -4,9 +4,21 @@ CREATE EXTENSION test_lfind;
-- the operations complete without crashing or hanging and that none of their
-- internal sanity tests fail.
--
-SELECT test_lfind();
- test_lfind
-------------
+SELECT test_lfind8();
+ test_lfind8
+-------------
+
+(1 row)
+
+SELECT test_lfind8_le();
+ test_lfind8_le
+----------------
+
+(1 row)
+
+SELECT test_lfind32();
+ test_lfind32
+--------------
(1 row)
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 899f1dd49b..766c640831 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -5,4 +5,6 @@ CREATE EXTENSION test_lfind;
-- the operations complete without crashing or hanging and that none of their
-- internal sanity tests fail.
--
-SELECT test_lfind();
+SELECT test_lfind8();
+SELECT test_lfind8_le();
+SELECT test_lfind32();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index d82ab0567e..81801926ae 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -3,6 +3,14 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION test_lfind" to load this file. \quit
-CREATE FUNCTION test_lfind()
+CREATE FUNCTION test_lfind32()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8_le()
RETURNS pg_catalog.void
AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index a000746fb8..efe6b60bc5 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -18,10 +18,97 @@
PG_MODULE_MAGIC;
-PG_FUNCTION_INFO_V1(test_lfind);
+/* workhorse for test_lfind8 */
+static void
+test_lfind8_internal(uint8 key)
+{
+ /* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector8))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+ uint8 charbuf[LEN_WITH_TAIL];
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search tail to test one-byte-at-a-time path */
+ charbuf[LEN_WITH_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search with vector operations */
+ charbuf[LEN_NO_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8);
+Datum
+test_lfind8(PG_FUNCTION_ARGS)
+{
+ test_lfind8_internal(0);
+ test_lfind8_internal(1);
+ test_lfind8_internal(0x7F);
+ test_lfind8_internal(0x80);
+ test_lfind8_internal(0xFD);
+
+ PG_RETURN_VOID();
+}
+
+/* workhorse for test_lfind8_le */
+static void
+test_lfind8_le_internal(uint8 key)
+{
+ /* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector8))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+ uint8 charbuf[LEN_WITH_TAIL];
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search tail to test one-byte-at-a-time path */
+ charbuf[LEN_WITH_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search with vector operations */
+ charbuf[LEN_NO_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8_le);
+Datum
+test_lfind8_le(PG_FUNCTION_ARGS)
+{
+ test_lfind8_le_internal(0);
+ test_lfind8_le_internal(1);
+ test_lfind8_le_internal(0x7F);
+ test_lfind8_le_internal(0x80);
+ test_lfind8_le_internal(0xFD);
+
+ PG_RETURN_VOID();
+}
+PG_FUNCTION_INFO_V1(test_lfind32);
Datum
-test_lfind(PG_FUNCTION_ARGS)
+test_lfind32(PG_FUNCTION_ARGS)
{
#define TEST_ARRAY_SIZE 135
uint32 test_array[TEST_ARRAY_SIZE] = {0};
--
2.25.1
0002-ARM-SIMD-support.patchtext/x-diff; charset=us-asciiDownload
From e2fa356adfc41a13ea18da839b53353b774739cc Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Sat, 20 Aug 2022 21:44:21 -0700
Subject: [PATCH 2/2] ARM SIMD support
---
src/include/port/pg_lfind.h | 6 +++++-
src/include/port/simd.h | 28 ++++++++++++++++++++++++++++
2 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index def858cbe1..04f09200b4 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -92,7 +92,7 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
size_t nelem_for_asserts = nelem;
#endif
-#ifdef USE_SSE2
+#if defined(USE_SSE2) || defined(__ARM_NEON)
/*
* A 16-byte register only has four 4-byte lanes. For better
* instruction-level parallelism, each loop iteration operates on a block
@@ -126,7 +126,11 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
result = vector32_vor(tmp1, tmp2);
/* see if there was a match */
+#ifdef USE_SSE2
if (_mm_movemask_epi8(result) != 0)
+#elif defined(__ARM_NEON)
+ if (vmaxvq_u32(result) != 0)
+#endif
{
Assert(lfind(&key, base, &nelem_for_asserts, sizeof(uint32),
uint32_cmp_eq));
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 4dda87f3dd..4f40b31e2a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -30,6 +30,15 @@
typedef __m128i Vector8;
typedef __m128i Vector32;
+/*
+ * Include arm_neon.h if the compiler is targeting an architecture that
+ * supports ARM Advanced SIMD (Neon) intrinsics.
+ */
+#elif defined(__ARM_NEON)
+#include <arm_neon.h>
+typedef uint8x16_t Vector8;
+typedef uint32x4_t Vector32;
+
/*
* If no SIMD instructions are available, we emulate specialized vector
* operations using uint64.
@@ -114,6 +123,8 @@ vector8_load(Vector8 *v, const uint8 *s)
{
#ifdef USE_SSE2
*v = _mm_loadu_si128((const __m128i *) s);
+#elif defined(__ARM_NEON)
+ *v = vld1q_u8(s);
#else
memcpy(v, s, sizeof(Vector8));
#endif
@@ -124,6 +135,8 @@ vector32_load(Vector32 *v, const uint32 *s)
{
#ifdef USE_SSE2
*v = _mm_loadu_si128((const __m128i *) s);
+#elif defined(__ARM_NEON)
+ *v = vld1q_u32(s);
#else
elog(ERROR, "vector32() without SIMD not implemented");
pg_unreachable();
@@ -140,6 +153,8 @@ vector8_broadcast(const uint8 c)
{
#ifdef USE_SSE2
return _mm_set1_epi8(c);
+#elif defined(__ARM_NEON)
+ return vdupq_n_u8(c);
#else
return ~UINT64CONST(0) / 0xFF * c;
#endif
@@ -150,6 +165,8 @@ vector32_broadcast(const uint32 c)
{
#ifdef USE_SSE2
return _mm_set1_epi32(c);
+#elif defined(__ARM_NEON)
+ return vdupq_n_u32(c);
#else
elog(ERROR, "vector32_broadcast() without SIMD not implemented");
pg_unreachable();
@@ -166,6 +183,8 @@ vector8_has_zero(const Vector8 v)
{
#ifdef USE_SSE2
return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#elif defined(__ARM_NEON)
+ return vmaxvq_u8(vceqzq_u8(v));
#else
return vector8_le(v, 0);
#endif
@@ -178,6 +197,8 @@ vector8_eq(const Vector8 v, const uint8 c)
#ifdef USE_SSE2
result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector8_broadcast(c)));
+#elif defined(__ARM_NEON)
+ result = vmaxvq_u8(vceqq_u8(v, vector8_broadcast(c)));
#else
/* any bytes in v equal to c will evaluate to zero via XOR */
result = vector8_has_zero(v ^ vector8_broadcast(c));
@@ -193,6 +214,8 @@ vector32_veq(const Vector32 v1, const Vector32 v2)
{
#ifdef USE_SSE2
return _mm_cmpeq_epi32(v1, v2);
+#elif defined(__ARM_NEON)
+ return vceqq_u32(v1, v2);
#else
elog(ERROR, "vector32_veq() without SIMD not implemented");
pg_unreachable();
@@ -211,6 +234,9 @@ vector8_le(const Vector8 v, const uint8 c)
*/
__m128i sub = _mm_subs_epu8(v, vector8_broadcast(c));
result = vector8_has_zero(sub);
+#elif __ARM_NEON
+ uint8x16_t sub = vqsubq_u8(v, vector8_broadcast(c));
+ result = vector8_has_zero(sub);
#else
/*
* To find bytes <= c, we can use bitwise operations to find bytes < c + 1,
@@ -248,6 +274,8 @@ vector32_vor(const Vector32 v1, const Vector32 v2)
{
#ifdef USE_SSE2
return _mm_or_si128(v1, v2);
+#elif defined(__ARM_NEON)
+ return vorrq_u32(v1, v2);
#else
elog(ERROR, "vector32_vor() without SIMD not implemented");
pg_unreachable();
--
2.25.1
On Sun, Aug 21, 2022 at 12:47 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
I spent some more time looking at this one, and I had a few ideas that I
thought I'd share. 0001 is your v6 patch with a few additional changes,
including simplying the assertions for readability, splitting out the
Vector type into Vector8 and Vector32 (needed for ARM), and adjusting
pg_lfind32() to use the new tools in simd.h. 0002 adds ARM versions of
everything, which obsoletes the other thread I started [0]. This is still
a little rough around the edges (e.g., this should probably be more than 2
patches), but I think it helps demonstrate a more comprehensive design than
what I've proposed in the pg_lfind32-for-ARM thread [0].Apologies if I'm stepping on your toes a bit here.
Not at all! However, the 32-bit-element changes are irrelevant for
json, and make review more difficult. I would suggest keeping those in
the other thread starting with whatever refactoring is needed. I can
always rebase over that.
Not a full review, but on a brief look:
- I like the idea of simplifying the assertions, but I can't get
behind using platform lfind to do it, since it has a different API,
requires new functions we don't need, and possibly has portability
issues. A simple for-loop is better for assertions.
- A runtime elog is not appropriate for a compile time check -- use
#error instead.
--
John Naylor
EDB: http://www.enterprisedb.com
On Mon, Aug 22, 2022 at 09:35:34AM +0700, John Naylor wrote:
Not at all! However, the 32-bit-element changes are irrelevant for
json, and make review more difficult. I would suggest keeping those in
the other thread starting with whatever refactoring is needed. I can
always rebase over that.
Yeah, I'll remove those to keep this thread focused.
- I like the idea of simplifying the assertions, but I can't get
behind using platform lfind to do it, since it has a different API,
requires new functions we don't need, and possibly has portability
issues. A simple for-loop is better for assertions.
My main goal with this was improving readability, which is likely possible
without lfind(). I'll see what I can do.
- A runtime elog is not appropriate for a compile time check -- use
#error instead.
Will do.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Mon, Aug 22, 2022 at 02:22:29PM -0700, Nathan Bossart wrote:
On Mon, Aug 22, 2022 at 09:35:34AM +0700, John Naylor wrote:
Not at all! However, the 32-bit-element changes are irrelevant for
json, and make review more difficult. I would suggest keeping those in
the other thread starting with whatever refactoring is needed. I can
always rebase over that.Yeah, I'll remove those to keep this thread focused.
Here's a new version of the patch with the 32-bit changes and calls to
lfind() removed.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
Attachments:
v8-0001-json_lex_string-SIMD.patchtext/x-diff; charset=us-asciiDownload
From dd7b3c3b0567d77a7f171d9cb4b6a8d3ee30ceec Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Sat, 20 Aug 2022 21:14:01 -0700
Subject: [PATCH v8 1/1] json_lex_string() SIMD
---
src/common/jsonapi.c | 11 +-
src/include/port/pg_lfind.h | 65 +++++++-
src/include/port/simd.h | 143 ++++++++++++++++++
.../test_lfind/expected/test_lfind.out | 18 ++-
.../modules/test_lfind/sql/test_lfind.sql | 4 +-
.../modules/test_lfind/test_lfind--1.0.sql | 10 +-
src/test/modules/test_lfind/test_lfind.c | 91 ++++++++++-
7 files changed, 332 insertions(+), 10 deletions(-)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..87e1d0b192 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
#include "common/jsonapi.h"
#include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
#ifndef FRONTEND
#include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
}
else
{
- char *p;
+ char *p = s;
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
* Skip to the first byte that requires special handling, so we
* can batch calls to appendBinaryStringInfo.
*/
- for (p = s; p < end; p++)
+ while (p < end - sizeof(Vector8) &&
+ !pg_lfind8('\\', (uint8 *) p, sizeof(Vector8)) &&
+ !pg_lfind8('"', (uint8 *) p, sizeof(Vector8)) &&
+ !pg_lfind8_le(0x1F, (uint8 *) p, sizeof(Vector8)))
+ p += sizeof(Vector8);
+
+ for (; p < end; p++)
{
if (*p == '\\' || *p == '"')
break;
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..86342d71d8 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,8 @@
/*-------------------------------------------------------------------------
*
* pg_lfind.h
- * Optimized linear search routines.
+ * Optimized linear search routines using SIMD intrinsics where
+ * available.
*
* Copyright (c) 2022, PostgreSQL Global Development Group
*
@@ -15,6 +16,68 @@
#include "port/simd.h"
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+ /* round down to multiple of vector length */
+ uint32 tail_idx = nelem & ~(sizeof(Vector8) - 1);
+ Vector8 chunk;
+
+ for (i = 0; i < tail_idx; i += sizeof(Vector8))
+ {
+ vector8_load(&chunk, &base[i]);
+ if (vector8_eq(chunk, key))
+ return true;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (key == base[i])
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * pg_lfind8_le
+ *
+ * Return true if there is an element in 'base' that is less than or equal to
+ * 'key', otherwise return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+ /* round down to multiple of vector length */
+ uint32 tail_idx = nelem & ~(sizeof(Vector8) - 1);
+ Vector8 chunk;
+
+ for (i = 0; i < tail_idx; i += sizeof(Vector8))
+ {
+ vector8_load(&chunk, &base[i]);
+ if (vector8_le(chunk, key))
+ return true;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (base[i] <= key)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pg_lfind32
*
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..23f6269169 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -25,6 +25,149 @@
#if (defined(__x86_64__) || defined(_M_AMD64))
#include <emmintrin.h>
#define USE_SSE2
+typedef __m128i Vector8;
+
+/*
+ * If no SIMD instructions are available, we emulate specialized vector
+ * operations using uint64.
+ */
+#else
+typedef uint64 Vector8;
+#endif
+
+
+static inline void vector8_load(Vector8 *v, const uint8 *s);
+static inline Vector8 vector8_broadcast(const uint8 c);
+static inline bool vector8_has_zero(const Vector8 v);
+static inline bool vector8_eq(const Vector8 v, const uint8 c);
+static inline bool vector8_le(const Vector8 v, const uint8 c);
+
+
+/*
+ * Functions for loading a chunk of memory into a vector.
+ */
+
+static inline void
+vector8_load(Vector8 *v, const uint8 *s)
+{
+#ifdef USE_SSE2
+ *v = _mm_loadu_si128((const __m128i *) s);
+#else
+ memcpy(v, s, sizeof(Vector8));
+#endif
+}
+
+
+/*
+ * Functions for creating a vector with all elements set to the same value.
+ */
+
+static inline Vector8
+vector8_broadcast(const uint8 c)
+{
+#ifdef USE_SSE2
+ return _mm_set1_epi8(c);
+#else
+ return ~UINT64CONST(0) / 0xFF * c;
#endif
+}
+
+
+/*
+ * Functions for comparing vector elements to a given value.
+ */
+
+static inline bool
+vector8_has_zero(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#else
+ return vector8_le(v, 0);
+#endif
+}
+
+static inline bool
+vector8_eq(const Vector8 v, const uint8 c)
+{
+ bool result;
+
+ /* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+ bool assert_result = false;
+ for (int i = 0; i < sizeof(Vector8); i++)
+ {
+ if (((const uint8 *) &v)[i] == c)
+ {
+ assert_result = true;
+ break;
+ }
+ }
+#endif /* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+ result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector8_broadcast(c)));
+#else
+ /* any bytes in v equal to c will evaluate to zero via XOR */
+ result = vector8_has_zero(v ^ vector8_broadcast(c));
+#endif
+
+ Assert(assert_result == result);
+ return result;
+}
+
+static inline bool
+vector8_le(const Vector8 v, const uint8 c)
+{
+ bool result = false;
+#ifdef USE_SSE2
+ __m128i sub;
+#endif
+
+ /* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+ bool assert_result = false;
+ for (int i = 0; i < sizeof(Vector8); i++)
+ {
+ if (((const uint8 *) &v)[i] <= c)
+ {
+ assert_result = true;
+ break;
+ }
+ }
+#endif /* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+ /*
+ * Use saturating subtraction to find bytes <= c, which will present as
+ * NUL bytes in 'sub'.
+ */
+ sub = _mm_subs_epu8(v, vector8_broadcast(c));
+ result = vector8_has_zero(sub);
+#else
+ /*
+ * To find bytes <= c, we can use bitwise operations to find bytes < c + 1,
+ * but it only works if c + 1 <= 128 and if the highest bit in v is not set
+ * (from https://graphics.stanford.edu/~seander/bithacks.html).
+ */
+ if ((int64) v >= 0 && c < 0x80)
+ result = (v - vector8_broadcast(c + 1)) & ~v & vector8_broadcast(0x80);
+ else
+ {
+ /* one byte at a time */
+ for (int i = 0; i < sizeof(Vector8); i++)
+ {
+ if (((const uint8 *) &v)[i] <= c)
+ {
+ result = true;
+ break;
+ }
+ }
+ }
+#endif
+
+ Assert(assert_result == result);
+ return result;
+}
#endif /* SIMD_H */
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 222c8fd7ff..1d4b14e703 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -4,9 +4,21 @@ CREATE EXTENSION test_lfind;
-- the operations complete without crashing or hanging and that none of their
-- internal sanity tests fail.
--
-SELECT test_lfind();
- test_lfind
-------------
+SELECT test_lfind8();
+ test_lfind8
+-------------
+
+(1 row)
+
+SELECT test_lfind8_le();
+ test_lfind8_le
+----------------
+
+(1 row)
+
+SELECT test_lfind32();
+ test_lfind32
+--------------
(1 row)
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 899f1dd49b..766c640831 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -5,4 +5,6 @@ CREATE EXTENSION test_lfind;
-- the operations complete without crashing or hanging and that none of their
-- internal sanity tests fail.
--
-SELECT test_lfind();
+SELECT test_lfind8();
+SELECT test_lfind8_le();
+SELECT test_lfind32();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index d82ab0567e..81801926ae 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -3,6 +3,14 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION test_lfind" to load this file. \quit
-CREATE FUNCTION test_lfind()
+CREATE FUNCTION test_lfind32()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8_le()
RETURNS pg_catalog.void
AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index a000746fb8..efe6b60bc5 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -18,10 +18,97 @@
PG_MODULE_MAGIC;
-PG_FUNCTION_INFO_V1(test_lfind);
+/* workhorse for test_lfind8 */
+static void
+test_lfind8_internal(uint8 key)
+{
+ /* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector8))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+ uint8 charbuf[LEN_WITH_TAIL];
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search tail to test one-byte-at-a-time path */
+ charbuf[LEN_WITH_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search with vector operations */
+ charbuf[LEN_NO_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8);
+Datum
+test_lfind8(PG_FUNCTION_ARGS)
+{
+ test_lfind8_internal(0);
+ test_lfind8_internal(1);
+ test_lfind8_internal(0x7F);
+ test_lfind8_internal(0x80);
+ test_lfind8_internal(0xFD);
+
+ PG_RETURN_VOID();
+}
+
+/* workhorse for test_lfind8_le */
+static void
+test_lfind8_le_internal(uint8 key)
+{
+ /* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector8))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+ uint8 charbuf[LEN_WITH_TAIL];
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search tail to test one-byte-at-a-time path */
+ charbuf[LEN_WITH_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_WITH_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+
+ memset(charbuf, 0xFF, LEN_WITH_TAIL);
+ /* search with vector operations */
+ charbuf[LEN_NO_TAIL - 1] = key;
+ if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_NO_TAIL))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8_le);
+Datum
+test_lfind8_le(PG_FUNCTION_ARGS)
+{
+ test_lfind8_le_internal(0);
+ test_lfind8_le_internal(1);
+ test_lfind8_le_internal(0x7F);
+ test_lfind8_le_internal(0x80);
+ test_lfind8_le_internal(0xFD);
+
+ PG_RETURN_VOID();
+}
+PG_FUNCTION_INFO_V1(test_lfind32);
Datum
-test_lfind(PG_FUNCTION_ARGS)
+test_lfind32(PG_FUNCTION_ARGS)
{
#define TEST_ARRAY_SIZE 135
uint32 test_array[TEST_ARRAY_SIZE] = {0};
--
2.25.1
On Tue, Aug 23, 2022 at 10:32 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
On Mon, Aug 22, 2022 at 02:22:29PM -0700, Nathan Bossart wrote:
On Mon, Aug 22, 2022 at 09:35:34AM +0700, John Naylor wrote:
Not at all! However, the 32-bit-element changes are irrelevant for
json, and make review more difficult. I would suggest keeping those in
the other thread starting with whatever refactoring is needed. I can
always rebase over that.Yeah, I'll remove those to keep this thread focused.
Here's a new version of the patch with the 32-bit changes and calls to
lfind() removed.
LGTM overall. My plan is to split out the json piece, adding tests for
that, and commit the infrastructure for it fairly soon. Possible
bikeshedding: Functions like vector8_eq() might be misunderstood as
comparing two vectors, but here we are comparing each lane with a
scalar. I wonder if vector8_eq_scalar() et al might be more clear.
--
John Naylor
EDB: http://www.enterprisedb.com
On Tue, Aug 23, 2022 at 01:03:03PM +0700, John Naylor wrote:
On Tue, Aug 23, 2022 at 10:32 AM Nathan Bossart
Here's a new version of the patch with the 32-bit changes and calls to
lfind() removed.LGTM overall. My plan is to split out the json piece, adding tests for
that, and commit the infrastructure for it fairly soon. Possible
bikeshedding: Functions like vector8_eq() might be misunderstood as
comparing two vectors, but here we are comparing each lane with a
scalar. I wonder if vector8_eq_scalar() et al might be more clear.
Good point. I had used vector32_veq() to denote vector comparison, which
would extend to something like vector8_seq(). But that doesn't seem
descriptive enough. It might be worth considering vector8_contains() or
vector8_has() as well. I don't really have an opinion, but if I had to
pick something, I guess I'd choose vector8_contains().
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, Aug 24, 2022 at 12:15 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
On Tue, Aug 23, 2022 at 01:03:03PM +0700, John Naylor wrote:
On Tue, Aug 23, 2022 at 10:32 AM Nathan Bossart
Here's a new version of the patch with the 32-bit changes and calls to
lfind() removed.LGTM overall. My plan is to split out the json piece, adding tests for
that, and commit the infrastructure for it fairly soon. Possible
bikeshedding: Functions like vector8_eq() might be misunderstood as
comparing two vectors, but here we are comparing each lane with a
scalar. I wonder if vector8_eq_scalar() et al might be more clear.Good point. I had used vector32_veq() to denote vector comparison, which
would extend to something like vector8_seq(). But that doesn't seem
descriptive enough. It might be worth considering vector8_contains() or
vector8_has() as well. I don't really have an opinion, but if I had to
pick something, I guess I'd choose vector8_contains().
It seems "scalar" would be a bad choice since it already means
(confusingly) operating on the least significant element of a vector.
I'm thinking of *_has and *_has_le, matching the already existing in
the earlier patch *_has_zero.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Aug 24, 2022 at 11:59:25AM +0700, John Naylor wrote:
It seems "scalar" would be a bad choice since it already means
(confusingly) operating on the least significant element of a vector.
I'm thinking of *_has and *_has_le, matching the already existing in
the earlier patch *_has_zero.
That seems reasonable to me.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, Aug 24, 2022 at 11:56 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
On Wed, Aug 24, 2022 at 11:59:25AM +0700, John Naylor wrote:
It seems "scalar" would be a bad choice since it already means
(confusingly) operating on the least significant element of a vector.
I'm thinking of *_has and *_has_le, matching the already existing in
the earlier patch *_has_zero.That seems reasonable to me.
Okay, done that way, also in v9:
- a convenience macro in the test suite which is handy now and can be
used for 32-bit element tests if we like
- more tests
- pgindent and some additional comment smithing
- split out the json piece for a later commit
- For the following comment, pgindent will put spaced operands on a
separate line which is not great for readability. and our other
reference to the Stanford bithacks page keeps the in-page link, and I
see no reason to exclude it -- if it goes missing, the whole page will
still load. So I put back those two details.
+ * To find bytes <= c, we can use bitwise operations to find
bytes < c+1,
+ * but it only works if c+1 <= 128 and if the highest bit in v
is not set.
+ * Adapted from
+ * https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
I think I'll go ahead and commit 0001 in a couple days pending further comments.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v9-0002-Speed-up-json_lex_string-via-vector-operations.patchtext/x-patch; charset=US-ASCII; name=v9-0002-Speed-up-json_lex_string-via-vector-operations.patchDownload
From 606c14de59a68ed88bff75f7250c37ad082fbd9f Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 25 Aug 2022 13:32:28 +0700
Subject: [PATCH v9 2/2] Speed up json_lex_string via vector operations
TODO: tests
---
src/common/jsonapi.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..87e1d0b192 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
#include "common/jsonapi.h"
#include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
#ifndef FRONTEND
#include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
}
else
{
- char *p;
+ char *p = s;
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
* Skip to the first byte that requires special handling, so we
* can batch calls to appendBinaryStringInfo.
*/
- for (p = s; p < end; p++)
+ while (p < end - sizeof(Vector8) &&
+ !pg_lfind8('\\', (uint8 *) p, sizeof(Vector8)) &&
+ !pg_lfind8('"', (uint8 *) p, sizeof(Vector8)) &&
+ !pg_lfind8_le(0x1F, (uint8 *) p, sizeof(Vector8)))
+ p += sizeof(Vector8);
+
+ for (; p < end; p++)
{
if (*p == '\\' || *p == '"')
break;
--
2.36.1
v9-0001-Add-optimized-functions-for-linear-search-within-.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Add-optimized-functions-for-linear-search-within-.patchDownload
From fe7d8f2471e2e0b37ccdc1a0a2d7fc5bb93af7d9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Sat, 20 Aug 2022 21:14:01 -0700
Subject: [PATCH v9 1/2] Add optimized functions for linear search within byte
arrays
In similar vein to b6ef167564, add pg_lfind8() and pg_lfind8_le()
to search for bytes equal or less-than-or-equal to a given byte,
respectively. To abstract away platform details, add helper functions
and typedefs to simd.h.
John Naylor and Nathan Bossart, per suggestion from Andres Freund
Discussion: https://www.postgresql.org/message-id/CAFBsxsGzaaGLF%3DNuq61iRXTyspbO9rOjhSqFN%3DV6ozzmta5mXg%40mail.gmail.com
---
src/include/port/pg_lfind.h | 68 +++++++-
src/include/port/simd.h | 154 ++++++++++++++++++
.../test_lfind/expected/test_lfind.out | 18 +-
.../modules/test_lfind/sql/test_lfind.sql | 4 +-
.../modules/test_lfind/test_lfind--1.0.sql | 10 +-
src/test/modules/test_lfind/test_lfind.c | 100 +++++++++++-
6 files changed, 345 insertions(+), 9 deletions(-)
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..a4e13dffec 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,8 @@
/*-------------------------------------------------------------------------
*
* pg_lfind.h
- * Optimized linear search routines.
+ * Optimized linear search routines using SIMD intrinsics where
+ * available.
*
* Copyright (c) 2022, PostgreSQL Global Development Group
*
@@ -15,6 +16,70 @@
#include "port/simd.h"
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+
+ /* round down to multiple of vector length */
+ uint32 tail_idx = nelem & ~(sizeof(Vector8) - 1);
+ Vector8 chunk;
+
+ for (i = 0; i < tail_idx; i += sizeof(Vector8))
+ {
+ vector8_load(&chunk, &base[i]);
+ if (vector8_has(chunk, key))
+ return true;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (key == base[i])
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * pg_lfind8_le
+ *
+ * Return true if there is an element in 'base' that is less than or equal to
+ * 'key', otherwise return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+ uint32 i;
+
+ /* round down to multiple of vector length */
+ uint32 tail_idx = nelem & ~(sizeof(Vector8) - 1);
+ Vector8 chunk;
+
+ for (i = 0; i < tail_idx; i += sizeof(Vector8))
+ {
+ vector8_load(&chunk, &base[i]);
+ if (vector8_has_le(chunk, key))
+ return true;
+ }
+
+ /* Process the remaining elements one at a time. */
+ for (; i < nelem; i++)
+ {
+ if (base[i] <= key)
+ return true;
+ }
+
+ return false;
+}
+
/*
* pg_lfind32
*
@@ -26,7 +91,6 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
{
uint32 i = 0;
- /* Use SIMD intrinsics where available. */
#ifdef USE_SSE2
/*
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..56df989094 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -13,6 +13,12 @@
#ifndef SIMD_H
#define SIMD_H
+/*
+ * Note: VectorN in this file refers to a register where the element operands
+ * are N bits wide. The vector width is platform-specific, so users that care
+ * about that will need to inspect "sizeof(VectorN)".
+ */
+
/*
* SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
* that compilers targeting this architecture understand SSE2 intrinsics.
@@ -25,6 +31,154 @@
#if (defined(__x86_64__) || defined(_M_AMD64))
#include <emmintrin.h>
#define USE_SSE2
+typedef __m128i Vector8;
+
+#else
+/*
+ * If no SIMD instructions are available, we can in some cases emulate vector
+ * operations using bitwise operations on unsigned integers.
+ */
+typedef uint64 Vector8;
#endif
+
+static inline void vector8_load(Vector8 *v, const uint8 *s);
+static inline Vector8 vector8_broadcast(const uint8 c);
+static inline bool vector8_has_zero(const Vector8 v);
+static inline bool vector8_has(const Vector8 v, const uint8 c);
+static inline bool vector8_has_le(const Vector8 v, const uint8 c);
+
+
+/*
+ * Functions for loading a chunk of memory into a vector.
+ */
+
+static inline void
+vector8_load(Vector8 *v, const uint8 *s)
+{
+#ifdef USE_SSE2
+ *v = _mm_loadu_si128((const __m128i *) s);
+#else
+ memcpy(v, s, sizeof(Vector8));
+#endif
+}
+
+
+/*
+ * Functions for creating a vector with all elements set to the same value.
+ */
+
+static inline Vector8
+vector8_broadcast(const uint8 c)
+{
+#ifdef USE_SSE2
+ return _mm_set1_epi8(c);
+#else
+ return ~UINT64CONST(0) / 0xFF * c;
+#endif
+}
+
+
+/*
+ * Functions for comparing vector elements to a given value.
+ */
+
+static inline bool
+vector8_has_zero(const Vector8 v)
+{
+#ifdef USE_SSE2
+ return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#else
+ return vector8_has_le(v, 0);
+#endif
+}
+
+static inline bool
+vector8_has(const Vector8 v, const uint8 c)
+{
+ bool result;
+
+ /* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+ bool assert_result = false;
+
+ for (int i = 0; i < sizeof(Vector8); i++)
+ {
+ if (((const uint8 *) &v)[i] == c)
+ {
+ assert_result = true;
+ break;
+ }
+ }
+#endif /* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+ result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector8_broadcast(c)));
+#else
+ /* any bytes in v equal to c will evaluate to zero via XOR */
+ result = vector8_has_zero(v ^ vector8_broadcast(c));
+#endif
+
+ Assert(assert_result == result);
+ return result;
+}
+
+static inline bool
+vector8_has_le(const Vector8 v, const uint8 c)
+{
+ bool result = false;
+#ifdef USE_SSE2
+ __m128i sub;
+#endif
+
+ /* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+ bool assert_result = false;
+
+ for (int i = 0; i < sizeof(Vector8); i++)
+ {
+ if (((const uint8 *) &v)[i] <= c)
+ {
+ assert_result = true;
+ break;
+ }
+ }
+#endif /* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+
+ /*
+ * Use saturating subtraction to find bytes <= c, which will present as
+ * NUL bytes in 'sub'.
+ */
+ sub = _mm_subs_epu8(v, vector8_broadcast(c));
+ result = vector8_has_zero(sub);
+#else
+
+ /*
+ * To find bytes <= c, we can use bitwise operations to find bytes < c+1,
+ * but it only works if c+1 <= 128 and if the highest bit in v is not set.
+ * Adapted from
+ * https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
+ */
+ if ((int64) v >= 0 && c < 0x80)
+ result = (v - vector8_broadcast(c + 1)) & ~v & vector8_broadcast(0x80);
+ else
+ {
+ /* one byte at a time */
+ for (int i = 0; i < sizeof(Vector8); i++)
+ {
+ if (((const uint8 *) &v)[i] <= c)
+ {
+ result = true;
+ break;
+ }
+ }
+ }
+#endif
+
+ Assert(assert_result == result);
+ return result;
+}
+
#endif /* SIMD_H */
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 222c8fd7ff..1d4b14e703 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -4,9 +4,21 @@ CREATE EXTENSION test_lfind;
-- the operations complete without crashing or hanging and that none of their
-- internal sanity tests fail.
--
-SELECT test_lfind();
- test_lfind
-------------
+SELECT test_lfind8();
+ test_lfind8
+-------------
+
+(1 row)
+
+SELECT test_lfind8_le();
+ test_lfind8_le
+----------------
+
+(1 row)
+
+SELECT test_lfind32();
+ test_lfind32
+--------------
(1 row)
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 899f1dd49b..766c640831 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -5,4 +5,6 @@ CREATE EXTENSION test_lfind;
-- the operations complete without crashing or hanging and that none of their
-- internal sanity tests fail.
--
-SELECT test_lfind();
+SELECT test_lfind8();
+SELECT test_lfind8_le();
+SELECT test_lfind32();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index d82ab0567e..81801926ae 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -3,6 +3,14 @@
-- complain if script is sourced in psql, rather than via CREATE EXTENSION
\echo Use "CREATE EXTENSION test_lfind" to load this file. \quit
-CREATE FUNCTION test_lfind()
+CREATE FUNCTION test_lfind32()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8()
+ RETURNS pg_catalog.void
+ AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8_le()
RETURNS pg_catalog.void
AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index a000746fb8..e0c905aad3 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -16,12 +16,108 @@
#include "fmgr.h"
#include "port/pg_lfind.h"
+/*
+ * Convenience macros for testing both vector and scalar operations. The 2x
+ * factor is to make sure iteration works
+ */
+#define LEN_NO_TAIL(vectortype) (2 * sizeof(vectortype))
+#define LEN_WITH_TAIL(vectortype) (LEN_NO_TAIL(vectortype) + 3)
+
PG_MODULE_MAGIC;
-PG_FUNCTION_INFO_V1(test_lfind);
+/* workhorse for test_lfind8 */
+static void
+test_lfind8_internal(uint8 key)
+{
+ uint8 charbuf[LEN_WITH_TAIL(Vector8)];
+ const int len_no_tail = LEN_NO_TAIL(Vector8);
+ const int len_with_tail = LEN_WITH_TAIL(Vector8);
+
+ memset(charbuf, 0xFF, len_with_tail);
+ /* search tail to test one-byte-at-a-time path */
+ charbuf[len_with_tail - 1] = key;
+ if (key > 0x00 && pg_lfind8(key - 1, charbuf, len_with_tail))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8(key, charbuf, len_with_tail))
+ elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && pg_lfind8(key + 1, charbuf, len_with_tail))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+
+ memset(charbuf, 0xFF, len_with_tail);
+ /* search with vector operations */
+ charbuf[len_no_tail - 1] = key;
+ if (key > 0x00 && pg_lfind8(key - 1, charbuf, len_no_tail))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8(key, charbuf, len_no_tail))
+ elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && pg_lfind8(key + 1, charbuf, len_no_tail))
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8);
+Datum
+test_lfind8(PG_FUNCTION_ARGS)
+{
+ test_lfind8_internal(0);
+ test_lfind8_internal(1);
+ test_lfind8_internal(0x7F);
+ test_lfind8_internal(0x80);
+ test_lfind8_internal(0x81);
+ test_lfind8_internal(0xFD);
+ test_lfind8_internal(0xFE);
+ test_lfind8_internal(0xFF);
+
+ PG_RETURN_VOID();
+}
+
+/* workhorse for test_lfind8_le */
+static void
+test_lfind8_le_internal(uint8 key)
+{
+ uint8 charbuf[LEN_WITH_TAIL(Vector8)];
+ const int len_no_tail = LEN_NO_TAIL(Vector8);
+ const int len_with_tail = LEN_WITH_TAIL(Vector8);
+
+ memset(charbuf, 0xFF, len_with_tail);
+ /* search tail to test one-byte-at-a-time path */
+ charbuf[len_with_tail - 1] = key;
+ if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, len_with_tail))
+ elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8_le(key, charbuf, len_with_tail))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, len_with_tail))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+
+ memset(charbuf, 0xFF, len_with_tail);
+ /* search with vector operations */
+ charbuf[len_no_tail - 1] = key;
+ if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, len_no_tail))
+ elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+ if (key < 0xFF && !pg_lfind8_le(key, charbuf, len_no_tail))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+ if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, len_no_tail))
+ elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8_le);
+Datum
+test_lfind8_le(PG_FUNCTION_ARGS)
+{
+ test_lfind8_le_internal(0);
+ test_lfind8_le_internal(1);
+ test_lfind8_le_internal(0x7F);
+ test_lfind8_le_internal(0x80);
+ test_lfind8_le_internal(0x81);
+ test_lfind8_le_internal(0xFD);
+ test_lfind8_le_internal(0xFE);
+ test_lfind8_le_internal(0xFF);
+
+ PG_RETURN_VOID();
+}
+PG_FUNCTION_INFO_V1(test_lfind32);
Datum
-test_lfind(PG_FUNCTION_ARGS)
+test_lfind32(PG_FUNCTION_ARGS)
{
#define TEST_ARRAY_SIZE 135
uint32 test_array[TEST_ARRAY_SIZE] = {0};
--
2.36.1
On Thu, Aug 25, 2022 at 01:35:45PM +0700, John Naylor wrote:
- For the following comment, pgindent will put spaced operands on a
separate line which is not great for readability. and our other
reference to the Stanford bithacks page keeps the in-page link, and I
see no reason to exclude it -- if it goes missing, the whole page will
still load. So I put back those two details.+ * To find bytes <= c, we can use bitwise operations to find bytes < c+1, + * but it only works if c+1 <= 128 and if the highest bit in v is not set. + * Adapted from + * https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
This was just unnecessary fiddling on my part, sorry about that.
+test_lfind8_internal(uint8 key) +{ + uint8 charbuf[LEN_WITH_TAIL(Vector8)]; + const int len_no_tail = LEN_NO_TAIL(Vector8); + const int len_with_tail = LEN_WITH_TAIL(Vector8); + + memset(charbuf, 0xFF, len_with_tail); + /* search tail to test one-byte-at-a-time path */ + charbuf[len_with_tail - 1] = key; + if (key > 0x00 && pg_lfind8(key - 1, charbuf, len_with_tail)) + elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1); + if (key < 0xFF && !pg_lfind8(key, charbuf, len_with_tail)) + elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key); + if (key < 0xFE && pg_lfind8(key + 1, charbuf, len_with_tail)) + elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1); + + memset(charbuf, 0xFF, len_with_tail); + /* search with vector operations */ + charbuf[len_no_tail - 1] = key; + if (key > 0x00 && pg_lfind8(key - 1, charbuf, len_no_tail)) + elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1); + if (key < 0xFF && !pg_lfind8(key, charbuf, len_no_tail)) + elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key); + if (key < 0xFE && pg_lfind8(key + 1, charbuf, len_no_tail)) + elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1); +}
nitpick: Shouldn't the elog() calls use "==" instead of "<=" for this one?
Otherwise, 0001 looks good to me.
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Fri, Aug 26, 2022 at 10:14 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
+test_lfind8_internal(uint8 key)
[...]
+ elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1); +}nitpick: Shouldn't the elog() calls use "==" instead of "<=" for this one?
Good catch, will fix.
--
John Naylor
EDB: http://www.enterprisedb.com
On Thu, Aug 25, 2022 at 1:35 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
I think I'll go ahead and commit 0001 in a couple days pending further comments.
Pushed with Nathan's correction and some cosmetic rearrangements.
--
John Naylor
EDB: http://www.enterprisedb.com
I wonder why copyable_characters_length is not reset after flushing.
It's not necessary because of the break statement right after. But this part
of the code was refactored away in John's improved patch that's actually
merged:
https://github.com/postgres/postgres/commit/3838fa269c15706df2b85ce2d6af8aacd5611655
On Tue, Aug 23, 2022 at 1:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
LGTM overall. My plan is to split out the json piece, adding tests for
that, and commit the infrastructure for it fairly soon.
Here's the final piece. I debated how many tests to add and decided it
was probably enough to add one each for checking quotes and
backslashes in the fast path. There is one cosmetic change in the
code: Before, the vectorized less-equal check compared to 0x1F, but
the byte-wise path did so with < 32. I made them both "less-equal 31"
for consistency. I'll commit this by the end of the week unless anyone
has a better idea about testing.
--
John Naylor
EDB: http://www.enterprisedb.com
Attachments:
v10-0001-Optimize-JSON-lexing-of-long-strings.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Optimize-JSON-lexing-of-long-strings.patchDownload
From f1159dcc2044edb107e0dfeae5e8f3c7feb10cd2 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 31 Aug 2022 10:39:17 +0700
Subject: [PATCH v10] Optimize JSON lexing of long strings
Use optimized linear search when looking ahead for end quotes,
backslashes, and non-printable characters. This results in nearly 40%
faster JSON parsing on x86-64 when most values are long strings, and
all platforms should see some improvement.
Reviewed by Andres Freund and Nathan Bossart
Discussion: https://www.postgresql.org/message-id/CAFBsxsGhaR2KQ5eisaK%3D6Vm60t%3DaxhD8Ckj1qFoCH1pktZi%2B2w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com
---
src/common/jsonapi.c | 13 ++++++++++---
src/test/regress/expected/json.out | 13 +++++++++++++
src/test/regress/sql/json.sql | 5 +++++
3 files changed, 28 insertions(+), 3 deletions(-)
diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..cfc025749c 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
#include "common/jsonapi.h"
#include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
#ifndef FRONTEND
#include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
}
else
{
- char *p;
+ char *p = s;
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;
@@ -853,11 +854,17 @@ json_lex_string(JsonLexContext *lex)
* Skip to the first byte that requires special handling, so we
* can batch calls to appendBinaryStringInfo.
*/
- for (p = s; p < end; p++)
+ while (p < end - sizeof(Vector8) &&
+ !pg_lfind8('\\', (uint8 *) p, sizeof(Vector8)) &&
+ !pg_lfind8('"', (uint8 *) p, sizeof(Vector8)) &&
+ !pg_lfind8_le(31, (uint8 *) p, sizeof(Vector8)))
+ p += sizeof(Vector8);
+
+ for (; p < end; p++)
{
if (*p == '\\' || *p == '"')
break;
- else if ((unsigned char) *p < 32)
+ else if ((unsigned char) *p <= 31)
{
/* Per RFC4627, these characters MUST be escaped. */
/*
diff --git a/src/test/regress/expected/json.out b/src/test/regress/expected/json.out
index e9d6e9faf2..cb181226e9 100644
--- a/src/test/regress/expected/json.out
+++ b/src/test/regress/expected/json.out
@@ -42,6 +42,19 @@ LINE 1: SELECT '"\v"'::json;
^
DETAIL: Escape sequence "\v" is invalid.
CONTEXT: JSON data, line 1: "\v...
+-- Check fast path for longer strings (at least 16 bytes long)
+SELECT ('"'||repeat('.', 12)||'abc"')::json; -- OK
+ json
+-------------------
+ "............abc"
+(1 row)
+
+SELECT ('"'||repeat('.', 12)||'abc\n"')::json; -- OK, legal escapes
+ json
+---------------------
+ "............abc\n"
+(1 row)
+
-- see json_encoding test for input with unicode escapes
-- Numbers.
SELECT '1'::json; -- OK
diff --git a/src/test/regress/sql/json.sql b/src/test/regress/sql/json.sql
index e366c6f51b..589e0cea36 100644
--- a/src/test/regress/sql/json.sql
+++ b/src/test/regress/sql/json.sql
@@ -7,6 +7,11 @@ SELECT '"abc
def"'::json; -- ERROR, unescaped newline in string constant
SELECT '"\n\"\\"'::json; -- OK, legal escapes
SELECT '"\v"'::json; -- ERROR, not a valid JSON escape
+
+-- Check fast path for longer strings (at least 16 bytes long)
+SELECT ('"'||repeat('.', 12)||'abc"')::json; -- OK
+SELECT ('"'||repeat('.', 12)||'abc\n"')::json; -- OK, legal escapes
+
-- see json_encoding test for input with unicode escapes
-- Numbers.
--
2.36.1
On Wed, Aug 31, 2022 at 10:50:39AM +0700, John Naylor wrote:
Here's the final piece. I debated how many tests to add and decided it
was probably enough to add one each for checking quotes and
backslashes in the fast path. There is one cosmetic change in the
code: Before, the vectorized less-equal check compared to 0x1F, but
the byte-wise path did so with < 32. I made them both "less-equal 31"
for consistency. I'll commit this by the end of the week unless anyone
has a better idea about testing.
LGTM
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Wed, Aug 31, 2022 at 11:17 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
On Wed, Aug 31, 2022 at 10:50:39AM +0700, John Naylor wrote:
Here's the final piece. I debated how many tests to add and decided it
was probably enough to add one each for checking quotes and
backslashes in the fast path. There is one cosmetic change in the
code: Before, the vectorized less-equal check compared to 0x1F, but
the byte-wise path did so with < 32. I made them both "less-equal 31"
for consistency. I'll commit this by the end of the week unless anyone
has a better idea about testing.LGTM
Pushed, thanks for looking!
--
John Naylor
EDB: http://www.enterprisedb.com