[PATCH] Optimize json_lex_string by batching character copying

Started by Jelte Fennemaover 3 years ago37 messages

Jelte.Fennema@microsoft.com

over 3 years ago

1 attachment(s)

When parsing JSON strings need to be converted from the JSON string
format to a c-style string. A simple copy of the buffer does not suffice
because of the various escape sequences that that JSON supports. Because
of this our JSON parser wrote characters into the c-style string buffer
one at a time.

However, this is only necessary for these escaped sequences that map to
another character. This patch changes the behaviour for non-escaped
characters. These are now copied in batches instead of one character at
a time.

To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB. The JSONB values
are a JSON object with a single level. They contain a few small keys and
values, but one very big value that's a stringified JSON blob. So this
JSON blob contains a relatively high number of escape characters, to
escape all the " characters. This change improves performance for
workload this workload on my machine by ~18% (going from 1m24s to 1m09s).

@Andres, there was indeed some low hanging fruit.
@John Naylor, SSE2 indeed sounds like another nice improvement. I'll leave
that to you.

Attachments:

0001-Optimize-json_lex_string-by-batching-character-copie.patchapplication/octet-stream; name=0001-Optimize-json_lex_string-by-batching-character-copie.patchDownload

From 8f8aa638ee2b6dfa85fa8bc0ec5788f44768e92f Mon Sep 17 00:00:00 2001
From: Jelte Fennema <github-tech@jeltef.nl>
Date: Fri, 24 Jun 2022 09:19:13 +0200
Subject: [PATCH] Optimize json_lex_string by batching character copies

When parsing JSON strings need to be converted from the JSON string
format to a c-style string. A simple copy of the buffer does not suffice
because of the various escape sequences that that JSON supports. Because
of this our JSON parser wrote characters into the c-style string buffer
one at a time.

However, this is only necessary for these escaped sequences that map to
another character. This patch changes the behaviour for non-escaped
characters. These are now copied in batches instead of one character at
a time.

To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB. The JSONB values
are a JSON object with a single level. They contain a few small keys and
values, but one very big value that's a stringified JSON blob. So this
JSON blob contains a relatively high number of escape characters, to
escape all the " characters. This change improves performance for
workload this workload on my machine by ~18% (going from 1m24s to 1m09s).
---
 src/common/jsonapi.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 98e4ef0942..219ecb9df9 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -674,6 +674,7 @@ json_lex_string(JsonLexContext *lex)
 	char	   *s;
 	int			len;
 	int			hi_surrogate = -1;
+	int			copyable_characters_length = 0;
 
 	if (lex->strval != NULL)
 		resetStringInfo(lex->strval);
@@ -692,7 +693,18 @@ json_lex_string(JsonLexContext *lex)
 			return JSON_INVALID_TOKEN;
 		}
 		else if (*s == '"')
+		{
+			if (copyable_characters_length)
+			{
+				/* flush copyable characters */
+				appendBinaryStringInfo(
+									   lex->strval,
+									   s - copyable_characters_length,
+									   copyable_characters_length);
+
+			}
 			break;
+		}
 		else if ((unsigned char) *s < 32)
 		{
 			/* Per RFC4627, these characters MUST be escaped. */
@@ -702,6 +714,16 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else if (*s == '\\')
 		{
+			if (copyable_characters_length)
+			{
+				/* flush copyable characters */
+				appendBinaryStringInfo(
+									   lex->strval,
+									   s - copyable_characters_length,
+									   copyable_characters_length);
+				copyable_characters_length = 0;
+
+			}
 			/* OK, we have an escape character. */
 			s++;
 			len++;
@@ -818,7 +840,7 @@ json_lex_string(JsonLexContext *lex)
 					case '"':
 					case '\\':
 					case '/':
-						appendStringInfoChar(lex->strval, *s);
+						copyable_characters_length++;
 						break;
 					case 'b':
 						appendStringInfoChar(lex->strval, '\b');
@@ -861,7 +883,7 @@ json_lex_string(JsonLexContext *lex)
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
 
-			appendStringInfoChar(lex->strval, *s);
+			copyable_characters_length++;
 		}
 	}
 
-- 
2.34.1

Zhihong Yu

zyu@yugabyte.com

over 3 years ago

In reply to: Jelte Fennema (#1)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,
Looking at the patch,

+           if (copyable_characters_length)
+           {
+               /* flush copyable characters */
+               appendBinaryStringInfo(
+                                      lex->strval,
+                                      s - copyable_characters_length,
+                                      copyable_characters_length);
+
+           }
            break;

I wonder why copyable_characters_length is not reset after flushing.

Cheers

Import Notes

Resolved by subject fallback

Jelte Fennema

Jelte.Fennema@microsoft.com

over 3 years ago

In reply to: Zhihong Yu (#2)

Re: [PATCH] Optimize json_lex_string by batching character copying

+           if (copyable_characters_length)
+           {
+               /* flush copyable characters */
+               appendBinaryStringInfo(
+                                      lex->strval,
+                                      s - copyable_characters_length,
+                                      copyable_characters_length);
+
+           }
            break;

I wonder why copyable_characters_length is not reset after flushing.

It breaks from the loop right after. So copyable_characters_length isn't used
again and thus resetting is not necessary. But I agree this could use a comment.

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Jelte Fennema (#1)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:

To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB.

This will have a lot of other costs included (DML is expensive). I'd suggest
storing the json in a text column and casting it to json[b], with a filter
ontop of the json[b] result that cheaply filters it away. That should end up
spending nearly all the time somewhere around json parsing.

It's useful for things like this to include a way for others to use the same
benchmark...

I tried your patch with:

DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;

SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

Which the patch improves from 846ms to 754ms (best of three). A bit smaller
than your improvement, but still nice.

I think your patch doesn't quite go far enough - we still end up looping for
each character, have the added complication of needing to flush the
"buffer". I'd be surprised if a "dedicated" loop to see until where the string
last isn't faster. That then obviously could be SIMDified.

Separately, it seems pretty awful efficiency / code density wise to have the
NULL checks for ->strval all over. Might be worth forcing json_lex() and
json_lex_string() to be inlined, with a constant parameter deciding whether
->strval is expected. That'd likely be enough to get the compiler specialize
the code for us.

Might also be worth to maintain ->strval using appendBinaryStringInfoNT().

Greetings,

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#4)

1 attachment(s)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

On 2022-06-24 17:18:10 -0700, Andres Freund wrote:

On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:

To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB.

This will have a lot of other costs included (DML is expensive). I'd suggest
storing the json in a text column and casting it to json[b], with a filter
ontop of the json[b] result that cheaply filters it away. That should end up
spending nearly all the time somewhere around json parsing.

It's useful for things like this to include a way for others to use the same
benchmark...

I tried your patch with:

DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;

SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

Which the patch improves from 846ms to 754ms (best of three). A bit smaller
than your improvement, but still nice.

I think your patch doesn't quite go far enough - we still end up looping for
each character, have the added complication of needing to flush the
"buffer". I'd be surprised if a "dedicated" loop to see until where the string
last isn't faster. That then obviously could be SIMDified.

A naive implementation (attached) of that gets me down to 706ms.

Greetings,

Andres Freund

Attachments:

json-lex-string-lookahead-speed.difftext/x-diff; charset=us-asciiDownload

diff --git i/src/common/jsonapi.c w/src/common/jsonapi.c
index 98e4ef09426..63d92c66aec 100644
--- i/src/common/jsonapi.c
+++ w/src/common/jsonapi.c
@@ -858,10 +858,25 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else if (lex->strval != NULL)
 		{
+			size_t chunklen = 1;
+
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
 
-			appendStringInfoChar(lex->strval, *s);
+			while (len + chunklen < lex->input_length)
+			{
+				char next = *(s + chunklen);
+
+				if (next == '\\' || next == '"' || (unsigned char) next < 32)
+					break;
+
+				chunklen++;
+			}
+
+			appendBinaryStringInfo(lex->strval, s, chunklen);
+
+			s += (chunklen - 1);
+			len += (chunklen - 1);
 		}
 	}

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Andres Freund (#5)

3 attachment(s)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Sat, Jun 25, 2022 at 8:05 AM Andres Freund <andres@anarazel.de> wrote:

I tried your patch with:

DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;

SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

Which the patch improves from 846ms to 754ms (best of three). A bit smaller
than your improvement, but still nice.

I think your patch doesn't quite go far enough - we still end up looping for
each character, have the added complication of needing to flush the
"buffer". I'd be surprised if a "dedicated" loop to see until where the string
last isn't faster. That then obviously could be SIMDified.

A naive implementation (attached) of that gets me down to 706ms.

Taking this a step further, I modified json_lex and json_lex_string to
use a const end pointer instead of maintaining the length (0001). The
observed speedup is small enough that it might not be real, but the
code is simpler this way, and it makes 0002 and 0003 easier to reason
about. Then I modified your patch to do the same (0002). Hackish SSE2
support is in 0003.

To exercise the SIMD code a bit, I added a second test:

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 10) from pg_description pd
)
SELECT (select json_agg(row_to_json(long)) as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

SELECT 1 FROM long_json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

With this, I get (turbo disabled, min of 3):

short test:
master: 769ms
0001: 751ms
0002: 703ms
0003: 701ms

long test;
master: 939ms
0001: 883ms
0002: 537ms
0003: 439ms

I think 0001/2 are mostly in committable shape.

With 0003, I'd want to make the portability check a bit nicer and more
centralized. I'm thinking of modifying the CRC check to report that
the host cpu/compiler understands SSE4.2 x86 intrinsics, and then the
compile time SSE2 check can piggyback on top of that without a runtime
check. This is conceptually easy but a bit of work to not look like a
hack (which probably means the ARM CRC check should look more generic
somehow...). The regression tests will likely need some work as well.

Separately, it seems pretty awful efficiency / code density wise to have the
NULL checks for ->strval all over. Might be worth forcing json_lex() and
json_lex_string() to be inlined, with a constant parameter deciding whether
->strval is expected. That'd likely be enough to get the compiler specialize
the code for us.

I had a look at this but it's a bit more invasive than I want to
devote time to at this point.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v3-0003-Use-vectorized-lookahead-in-json_lex_string-on-x8.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Use-vectorized-lookahead-in-json_lex_string-on-x8.patchDownload

From 1484b7541b9ed3f0c476e31f99340d36895bc629 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 5 Jul 2022 18:57:37 +0700
Subject: [PATCH v3 3/3] Use vectorized lookahead in json_lex_string on x86

---
 src/common/jsonapi.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index ad4858c623..978f18b129 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -24,6 +24,12 @@
 #include "miscadmin.h"
 #endif
 
+/*  WIP: put somewhere sensible and consider removing CRC from the names */
+#if (defined (__x86_64__) || defined(_M_AMD64)) && (defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK))
+#include <nmmintrin.h>
+#define USE_SSE2
+#endif
+
 /*
  * The context of the parser is maintained by the recursive descent
  * mechanism, but is passed explicitly to the error reporting routine
@@ -851,12 +857,54 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else if (lex->strval != NULL)
 		{
+#ifdef USE_SSE2
+			__m128i		block,
+						has_backslash,
+						has_doublequote,
+						control,
+						has_control,
+						error_cum = _mm_setzero_si128();
+			const		__m128i backslash = _mm_set1_epi8('\\');
+			const		__m128i doublequote = _mm_set1_epi8('"');
+			const		__m128i max_control = _mm_set1_epi8(0x1F);
+#endif
 			/* start lookahead at next byte */
 			char	   *p = s + 1;
 
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
 
+#ifdef USE_SSE2
+			while (p < end - sizeof(__m128i))
+			{
+				block = _mm_loadu_si128((const __m128i *) p);
+
+				/* direct comparison to quotes and backslashes */
+				has_backslash = _mm_cmpeq_epi8(block, backslash);
+				has_doublequote = _mm_cmpeq_epi8(block, doublequote);
+
+				/*
+				 * use saturation arithmetic to check for <= highest control
+				 * char
+				 */
+				control = _mm_subs_epu8(block, max_control);
+				has_control = _mm_cmpeq_epi8(control, _mm_setzero_si128());
+
+				/*
+				 * set bits in error_cum where the corresponding lanes in has_*
+				 * are set
+				 */
+				error_cum = _mm_or_si128(error_cum, has_backslash);
+				error_cum = _mm_or_si128(error_cum, has_doublequote);
+				error_cum = _mm_or_si128(error_cum, has_control);
+
+				if (_mm_movemask_epi8(error_cum))
+					break;
+
+				p += sizeof(__m128i);
+			}
+#endif							/* USE_SSE2 */
+
 			while (p < end)
 			{
 				if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
-- 
2.36.1

v3-0002-Build-json-strings-in-larger-chunks-during-lexing.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Build-json-strings-in-larger-chunks-during-lexing.patchDownload

From 03d3be083ba60d272e848b3ec96db3c6d47a3b06 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 1 Jul 2022 17:28:20 +0700
Subject: [PATCH v3 2/3] Build json strings in larger chunks during lexing

Add lookahead loop to json_lex_string. This way, we can batch calls
to appendBinaryStringInfo.

Jelte Fennema and Andres Freund, with some adjustments by me

Discussion:
https://www.postgresql.org/message-id/CAGECzQQuXbies_nKgSiYifZUjBk6nOf2%3DTSXqRjj2BhUh8CTeA%40mail.gmail.com
Discussion:
https://www.postgresql.org/message-id/flat/PR3PR83MB0476F098CBCF68AF7A1CA89FF7B49@PR3PR83MB0476.EURPRD83.prod.outlook.com
---
 src/common/jsonapi.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index eeedc0645a..ad4858c623 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -851,10 +851,26 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else if (lex->strval != NULL)
 		{
+			/* start lookahead at next byte */
+			char	   *p = s + 1;
+
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
 
-			appendStringInfoChar(lex->strval, *s);
+			while (p < end)
+			{
+				if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
+					break;
+				p++;
+			}
+
+			appendBinaryStringInfo(lex->strval, s, p - s);
+
+			/*
+			 * s will be incremented at the top of the loop, so set it to just
+			 * behind our lookahead position
+			 */
+			s = p - 1;
 		}
 	}
 
-- 
2.36.1

v3-0001-Simplify-json-lexing-state.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Simplify-json-lexing-state.patchDownload

From 3d8b39ff1c1a4abf9effc45323b293b62551770a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 08:35:24 +0700
Subject: [PATCH v3 1/3] Simplify json lexing state

Instead of updating the length as we go, use a const pointer to end of
the input, which we know already at the start
---
 src/common/jsonapi.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 98e4ef0942..eeedc0645a 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -519,26 +519,23 @@ JsonParseErrorType
 json_lex(JsonLexContext *lex)
 {
 	char	   *s;
-	int			len;
+	char	   *const end = lex->input + lex->input_length;
 	JsonParseErrorType result;
 
 	/* Skip leading whitespace. */
 	s = lex->token_terminator;
-	len = s - lex->input;
-	while (len < lex->input_length &&
-		   (*s == ' ' || *s == '\t' || *s == '\n' || *s == '\r'))
+	while (s < end && (*s == ' ' || *s == '\t' || *s == '\n' || *s == '\r'))
 	{
 		if (*s++ == '\n')
 		{
 			++lex->line_number;
 			lex->line_start = s;
 		}
-		len++;
 	}
 	lex->token_start = s;
 
 	/* Determine token type. */
-	if (len >= lex->input_length)
+	if (s >= end)
 	{
 		lex->token_start = NULL;
 		lex->prev_token_terminator = lex->token_terminator;
@@ -623,7 +620,7 @@ json_lex(JsonLexContext *lex)
 					 * the whole word as an unexpected token, rather than just
 					 * some unintuitive prefix thereof.
 					 */
-					for (p = s; p - s < lex->input_length - len && JSON_ALPHANUMERIC_CHAR(*p); p++)
+					for (p = s; p < end && JSON_ALPHANUMERIC_CHAR(*p); p++)
 						 /* skip */ ;
 
 					/*
@@ -672,7 +669,7 @@ static inline JsonParseErrorType
 json_lex_string(JsonLexContext *lex)
 {
 	char	   *s;
-	int			len;
+	char	   *const end = lex->input + lex->input_length;
 	int			hi_surrogate = -1;
 
 	if (lex->strval != NULL)
@@ -680,13 +677,11 @@ json_lex_string(JsonLexContext *lex)
 
 	Assert(lex->input_length > 0);
 	s = lex->token_start;
-	len = lex->token_start - lex->input;
 	for (;;)
 	{
 		s++;
-		len++;
 		/* Premature end of the string. */
-		if (len >= lex->input_length)
+		if (s >= end)
 		{
 			lex->token_terminator = s;
 			return JSON_INVALID_TOKEN;
@@ -704,8 +699,7 @@ json_lex_string(JsonLexContext *lex)
 		{
 			/* OK, we have an escape character. */
 			s++;
-			len++;
-			if (len >= lex->input_length)
+			if (s >= end)
 			{
 				lex->token_terminator = s;
 				return JSON_INVALID_TOKEN;
@@ -718,8 +712,7 @@ json_lex_string(JsonLexContext *lex)
 				for (i = 1; i <= 4; i++)
 				{
 					s++;
-					len++;
-					if (len >= lex->input_length)
+					if (s >= end)
 					{
 						lex->token_terminator = s;
 						return JSON_INVALID_TOKEN;
-- 
2.36.1

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: John Naylor (#6)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

On 2022-07-06 12:10:20 +0700, John Naylor wrote:

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index eeedc0645a..ad4858c623 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -851,10 +851,26 @@ json_lex_string(JsonLexContext *lex)
}
else if (lex->strval != NULL)
{
+			/* start lookahead at next byte */
+			char	   *p = s + 1;
+
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;

-			appendStringInfoChar(lex->strval, *s);
+			while (p < end)
+			{
+				if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
+					break;
+				p++;
+			}
+
+			appendBinaryStringInfo(lex->strval, s, p - s);
+
+			/*
+			 * s will be incremented at the top of the loop, so set it to just
+			 * behind our lookahead position
+			 */
+			s = p - 1;
}
}

--
2.36.1

I think before committing something along those lines we should make the
relevant bits also be applicable when ->strval is NULL, as several functions
use that (notably json_in IIRC). Afaics we'd just need to move the strval
check to be around the appendBinaryStringInfo(). And it should simplify the
function, because some of the relevant code is duplicated outside as well...

Greetings,

Andres Freund

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Andres Freund (#7)

4 attachment(s)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Wed, Jul 6, 2022 at 12:18 PM Andres Freund <andres@anarazel.de> wrote:

I think before committing something along those lines we should make the
relevant bits also be applicable when ->strval is NULL, as several functions
use that (notably json_in IIRC). Afaics we'd just need to move the strval
check to be around the appendBinaryStringInfo().

That makes sense and is easy.

And it should simplify the
function, because some of the relevant code is duplicated outside as well...

Not sure how far to take this, but I put the returnable paths inside
the "other" path, so only backslash will go back to the top.

Both the above changes are split into a new 0003 patch for easier
review, but in the end will likely be squashed with 0002.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v4-0001-Simplify-json-lexing-state.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Simplify-json-lexing-state.patchDownload

From 3d8b39ff1c1a4abf9effc45323b293b62551770a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 08:35:24 +0700
Subject: [PATCH v4 1/4] Simplify json lexing state

Instead of updating the length as we go, use a const pointer to end of
the input, which we know already at the start
---
 src/common/jsonapi.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 98e4ef0942..eeedc0645a 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -519,26 +519,23 @@ JsonParseErrorType
 json_lex(JsonLexContext *lex)
 {
 	char	   *s;
-	int			len;
+	char	   *const end = lex->input + lex->input_length;
 	JsonParseErrorType result;
 
 	/* Skip leading whitespace. */
 	s = lex->token_terminator;
-	len = s - lex->input;
-	while (len < lex->input_length &&
-		   (*s == ' ' || *s == '\t' || *s == '\n' || *s == '\r'))
+	while (s < end && (*s == ' ' || *s == '\t' || *s == '\n' || *s == '\r'))
 	{
 		if (*s++ == '\n')
 		{
 			++lex->line_number;
 			lex->line_start = s;
 		}
-		len++;
 	}
 	lex->token_start = s;
 
 	/* Determine token type. */
-	if (len >= lex->input_length)
+	if (s >= end)
 	{
 		lex->token_start = NULL;
 		lex->prev_token_terminator = lex->token_terminator;
@@ -623,7 +620,7 @@ json_lex(JsonLexContext *lex)
 					 * the whole word as an unexpected token, rather than just
 					 * some unintuitive prefix thereof.
 					 */
-					for (p = s; p - s < lex->input_length - len && JSON_ALPHANUMERIC_CHAR(*p); p++)
+					for (p = s; p < end && JSON_ALPHANUMERIC_CHAR(*p); p++)
 						 /* skip */ ;
 
 					/*
@@ -672,7 +669,7 @@ static inline JsonParseErrorType
 json_lex_string(JsonLexContext *lex)
 {
 	char	   *s;
-	int			len;
+	char	   *const end = lex->input + lex->input_length;
 	int			hi_surrogate = -1;
 
 	if (lex->strval != NULL)
@@ -680,13 +677,11 @@ json_lex_string(JsonLexContext *lex)
 
 	Assert(lex->input_length > 0);
 	s = lex->token_start;
-	len = lex->token_start - lex->input;
 	for (;;)
 	{
 		s++;
-		len++;
 		/* Premature end of the string. */
-		if (len >= lex->input_length)
+		if (s >= end)
 		{
 			lex->token_terminator = s;
 			return JSON_INVALID_TOKEN;
@@ -704,8 +699,7 @@ json_lex_string(JsonLexContext *lex)
 		{
 			/* OK, we have an escape character. */
 			s++;
-			len++;
-			if (len >= lex->input_length)
+			if (s >= end)
 			{
 				lex->token_terminator = s;
 				return JSON_INVALID_TOKEN;
@@ -718,8 +712,7 @@ json_lex_string(JsonLexContext *lex)
 				for (i = 1; i <= 4; i++)
 				{
 					s++;
-					len++;
-					if (len >= lex->input_length)
+					if (s >= end)
 					{
 						lex->token_terminator = s;
 						return JSON_INVALID_TOKEN;
-- 
2.36.1

v4-0004-Use-vectorized-lookahead-in-json_lex_string-on-x8.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Use-vectorized-lookahead-in-json_lex_string-on-x8.patchDownload

From 82e13b6bebd85a152ededcfd75495c0c0f642354 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 15:50:09 +0700
Subject: [PATCH v4 4/4] Use vectorized lookahead in json_lex_string on x86

---
 src/common/jsonapi.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 81e176ad8d..44e8ed2b2f 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -24,6 +24,12 @@
 #include "miscadmin.h"
 #endif
 
+/*  WIP: put somewhere sensible and consider removing CRC from the names */
+#if (defined (__x86_64__) || defined(_M_AMD64)) && (defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK))
+#include <nmmintrin.h>
+#define USE_SSE2
+#endif
+
 /*
  * The context of the parser is maintained by the recursive descent
  * mechanism, but is passed explicitly to the error reporting routine
@@ -842,12 +848,54 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else
 		{
+#ifdef USE_SSE2
+			__m128i		block,
+						has_backslash,
+						has_doublequote,
+						control,
+						has_control,
+						error_cum = _mm_setzero_si128();
+			const		__m128i backslash = _mm_set1_epi8('\\');
+			const		__m128i doublequote = _mm_set1_epi8('"');
+			const		__m128i max_control = _mm_set1_epi8(0x1F);
+#endif
 			/* start lookahead at current byte */
 			char	   *p = s;
 
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
 
+#ifdef USE_SSE2
+			while (p < end - sizeof(__m128i))
+			{
+				block = _mm_loadu_si128((const __m128i *) p);
+
+				/* direct comparison to quotes and backslashes */
+				has_backslash = _mm_cmpeq_epi8(block, backslash);
+				has_doublequote = _mm_cmpeq_epi8(block, doublequote);
+
+				/*
+				 * use saturation arithmetic to check for <= highest control
+				 * char
+				 */
+				control = _mm_subs_epu8(block, max_control);
+				has_control = _mm_cmpeq_epi8(control, _mm_setzero_si128());
+
+				/*
+				 * set bits in error_cum where the corresponding lanes in has_*
+				 * are set
+				 */
+				error_cum = _mm_or_si128(error_cum, has_backslash);
+				error_cum = _mm_or_si128(error_cum, has_doublequote);
+				error_cum = _mm_or_si128(error_cum, has_control);
+
+				if (_mm_movemask_epi8(error_cum))
+					break;
+
+				p += sizeof(__m128i);
+			}
+#endif							/* USE_SSE2 */
+
 			while (p < end)
 			{
 				if (*p == '\\' || *p == '"')
-- 
2.36.1

v4-0003-Use-lookahead-path-in-json-string-lexing-for-the-.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Use-lookahead-path-in-json-string-lexing-for-the-.patchDownload

From ef486287090daa24d51735ba9fa9585341b6e8ec Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 15:35:33 +0700
Subject: [PATCH v4 3/4] Use lookahead path in json string lexing for the
 non-escape case too

This removes some duplicated code and enables the no-escape path
to be optimized in the same way.

Per suggestion from Andres Freund
---
 src/common/jsonapi.c | 46 +++++++++++++++++++++++---------------------
 1 file changed, 24 insertions(+), 22 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index ad4858c623..81e176ad8d 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -686,15 +686,6 @@ json_lex_string(JsonLexContext *lex)
 			lex->token_terminator = s;
 			return JSON_INVALID_TOKEN;
 		}
-		else if (*s == '"')
-			break;
-		else if ((unsigned char) *s < 32)
-		{
-			/* Per RFC4627, these characters MUST be escaped. */
-			/* Since *s isn't printable, exclude it from the context string */
-			lex->token_terminator = s;
-			return JSON_ESCAPING_REQUIRED;
-		}
 		else if (*s == '\\')
 		{
 			/* OK, we have an escape character. */
@@ -849,22 +840,41 @@ json_lex_string(JsonLexContext *lex)
 				return JSON_ESCAPING_INVALID;
 			}
 		}
-		else if (lex->strval != NULL)
+		else
 		{
-			/* start lookahead at next byte */
-			char	   *p = s + 1;
+			/* start lookahead at current byte */
+			char	   *p = s;
 
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
 
 			while (p < end)
 			{
-				if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
+				if (*p == '\\' || *p == '"')
 					break;
+				else if ((unsigned char) *p < 32)
+				{
+					/* Per RFC4627, these characters MUST be escaped. */
+					/*
+					 * Since *s isn't printable, exclude it from the context
+					 * string
+					 */
+					lex->token_terminator = p;
+					return JSON_ESCAPING_REQUIRED;
+				}
 				p++;
 			}
 
-			appendBinaryStringInfo(lex->strval, s, p - s);
+			if (lex->strval != NULL)
+				appendBinaryStringInfo(lex->strval, s, p - s);
+
+			if (*p == '"')
+			{
+				/* Hooray, we found the end of the string! */
+				lex->prev_token_terminator = lex->token_terminator;
+				lex->token_terminator = p + 1;
+				return JSON_SUCCESS;
+			}
 
 			/*
 			 * s will be incremented at the top of the loop, so set it to just
@@ -873,14 +883,6 @@ json_lex_string(JsonLexContext *lex)
 			s = p - 1;
 		}
 	}
-
-	if (hi_surrogate != -1)
-		return JSON_UNICODE_LOW_SURROGATE;
-
-	/* Hooray, we found the end of the string! */
-	lex->prev_token_terminator = lex->token_terminator;
-	lex->token_terminator = s + 1;
-	return JSON_SUCCESS;
 }
 
 /*
-- 
2.36.1

v4-0002-Build-json-strings-in-larger-chunks-during-lexing.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Build-json-strings-in-larger-chunks-during-lexing.patchDownload

From 03d3be083ba60d272e848b3ec96db3c6d47a3b06 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 1 Jul 2022 17:28:20 +0700
Subject: [PATCH v4 2/4] Build json strings in larger chunks during lexing

Add lookahead loop to json_lex_string. This way, we can batch calls
to appendBinaryStringInfo.

Jelte Fennema and Andres Freund, with some adjustments by me

Discussion:
https://www.postgresql.org/message-id/CAGECzQQuXbies_nKgSiYifZUjBk6nOf2%3DTSXqRjj2BhUh8CTeA%40mail.gmail.com
Discussion:
https://www.postgresql.org/message-id/flat/PR3PR83MB0476F098CBCF68AF7A1CA89FF7B49@PR3PR83MB0476.EURPRD83.prod.outlook.com
---
 src/common/jsonapi.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index eeedc0645a..ad4858c623 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -851,10 +851,26 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else if (lex->strval != NULL)
 		{
+			/* start lookahead at next byte */
+			char	   *p = s + 1;
+
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
 
-			appendStringInfoChar(lex->strval, *s);
+			while (p < end)
+			{
+				if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
+					break;
+				p++;
+			}
+
+			appendBinaryStringInfo(lex->strval, s, p - s);
+
+			/*
+			 * s will be incremented at the top of the loop, so set it to just
+			 * behind our lookahead position
+			 */
+			s = p - 1;
 		}
 	}
 
-- 
2.36.1

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: John Naylor (#8)

Re: [PATCH] Optimize json_lex_string by batching character copying

I've pushed 0001 (although the email seems to have been swallowed
again), and pending additional comments on 0002 and 0003 I'll squash
and push those next week. 0004 needs some thought on integrating with
symbols we discover during configure.

--
John Naylor
EDB: http://www.enterprisedb.com

#10

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: John Naylor (#9)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Fri, Jul 8, 2022 at 3:06 PM John Naylor <john.naylor@enterprisedb.com> wrote:

I've pushed 0001 (although the email seems to have been swallowed
again), and pending additional comments on 0002 and 0003 I'll squash
and push those next week.

This is done.

0004 needs some thought on integrating with
symbols we discover during configure.

Still needs thought.

--
John Naylor
EDB: http://www.enterprisedb.com

#11

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: John Naylor (#8)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

On 2022-07-06 15:58:44 +0700, John Naylor wrote:

From 82e13b6bebd85a152ededcfd75495c0c0f642354 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 15:50:09 +0700
Subject: [PATCH v4 4/4] Use vectorized lookahead in json_lex_string on x86

---
src/common/jsonapi.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 81e176ad8d..44e8ed2b2f 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -24,6 +24,12 @@
#include "miscadmin.h"
#endif

+/*  WIP: put somewhere sensible and consider removing CRC from the names */
+#if (defined (__x86_64__) || defined(_M_AMD64)) && (defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK))
+#include <nmmintrin.h>
+#define USE_SSE2
+#endif
+
/*
* The context of the parser is maintained by the recursive descent
* mechanism, but is passed explicitly to the error reporting routine
@@ -842,12 +848,54 @@ json_lex_string(JsonLexContext *lex)
}
else
{
+#ifdef USE_SSE2
+			__m128i		block,
+						has_backslash,
+						has_doublequote,
+						control,
+						has_control,
+						error_cum = _mm_setzero_si128();
+			const		__m128i backslash = _mm_set1_epi8('\\');
+			const		__m128i doublequote = _mm_set1_epi8('"');
+			const		__m128i max_control = _mm_set1_epi8(0x1F);
+#endif
/* start lookahead at current byte */
char	   *p = s;

if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;

+#ifdef USE_SSE2
+			while (p < end - sizeof(__m128i))
+			{
+				block = _mm_loadu_si128((const __m128i *) p);
+
+				/* direct comparison to quotes and backslashes */
+				has_backslash = _mm_cmpeq_epi8(block, backslash);
+				has_doublequote = _mm_cmpeq_epi8(block, doublequote);
+
+				/*
+				 * use saturation arithmetic to check for <= highest control
+				 * char
+				 */
+				control = _mm_subs_epu8(block, max_control);
+				has_control = _mm_cmpeq_epi8(control, _mm_setzero_si128());
+
+				/*
+				 * set bits in error_cum where the corresponding lanes in has_*
+				 * are set
+				 */
+				error_cum = _mm_or_si128(error_cum, has_backslash);
+				error_cum = _mm_or_si128(error_cum, has_doublequote);
+				error_cum = _mm_or_si128(error_cum, has_control);
+
+				if (_mm_movemask_epi8(error_cum))
+					break;
+
+				p += sizeof(__m128i);
+			}
+#endif							/* USE_SSE2 */
+
while (p < end)
{
if (*p == '\\' || *p == '"')
-- 
2.36.1

I wonder if we can't abstract this at least a bit better. If we go that route
a bit further, then add another arch, this code will be pretty much
unreadable.

Greetings,

Andres Freund

#12

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Andres Freund (#11)

Re: [PATCH] Optimize json_lex_string by batching character copying

Andres Freund <andres@anarazel.de> writes:

I wonder if we can't abstract this at least a bit better. If we go that route
a bit further, then add another arch, this code will be pretty much
unreadable.

IMO, it's pretty unreadable *now*, for lack of comments about what it's
doing and why.

regards, tom lane

#13

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tom Lane (#12)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

On 2022-07-11 11:53:26 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I wonder if we can't abstract this at least a bit better. If we go that route
a bit further, then add another arch, this code will be pretty much
unreadable.

IMO, it's pretty unreadable *now*, for lack of comments about what it's
doing and why.

Yea, that could at least be addressed by adding comments. But even with a
bunch of comments, it'd still be pretty hard to read once the events above
have happened (and they seem kind of inevitable).

I wonder if we can add a somewhat more general function for scanning until
some characters are found using SIMD? There's plenty other places that could
be useful.

Greetings,

Andres Freund

#14

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Andres Freund (#13)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Mon, Jul 11, 2022 at 11:07 PM Andres Freund <andres@anarazel.de> wrote:

I wonder if we can add a somewhat more general function for scanning until
some characters are found using SIMD? There's plenty other places that

could

be useful.

In simple cases, we could possibly abstract the entire loop. With this
particular case, I imagine the most approachable way to write the loop
would be a bit more low-level:

while (p < end - VECTOR_WIDTH &&
!vector_has_byte(p, '\\') &&
!vector_has_byte(p, '"') &&
vector_min_byte(p, 0x20))
p += VECTOR_WIDTH

I wonder if we'd lose a bit of efficiency here by not accumulating set bits
from the three conditions, but it's worth trying.
--
John Naylor
EDB: http://www.enterprisedb.com

#15

Andrew Dunstan

andrew@dunslane.net

over 3 years ago

In reply to: Andres Freund (#4)

Re: [PATCH] Optimize json_lex_string by batching character copying

On 2022-06-24 Fr 20:18, Andres Freund wrote:

Hi,

On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:

To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB.

This will have a lot of other costs included (DML is expensive). I'd suggest
storing the json in a text column and casting it to json[b], with a filter
ontop of the json[b] result that cheaply filters it away. That should end up
spending nearly all the time somewhere around json parsing.

It's useful for things like this to include a way for others to use the same
benchmark...

I tried your patch with:

DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;

SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

I've been doing some other work related to json parsing and John
referred me to this. But it's actually not the best test for pure json
parsing - casting to jsonb involves some extra work besides pure
parsing. Instead I've been using this query with the same table, which
should be almost all json parsing:

select 1 from json_as_text where t::json is null;

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#16

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: John Naylor (#14)

1 attachment(s)

Re: [PATCH] Optimize json_lex_string by batching character copying

I wrote

On Mon, Jul 11, 2022 at 11:07 PM Andres Freund <andres@anarazel.de> wrote:

I wonder if we can add a somewhat more general function for scanning until
some characters are found using SIMD? There's plenty other places that could
be useful.

In simple cases, we could possibly abstract the entire loop. With this particular case, I imagine the most approachable way to write the loop would be a bit more low-level:

while (p < end - VECTOR_WIDTH &&
!vector_has_byte(p, '\\') &&
!vector_has_byte(p, '"') &&
vector_min_byte(p, 0x20))
p += VECTOR_WIDTH

I wonder if we'd lose a bit of efficiency here by not accumulating set bits from the three conditions, but it's worth trying.

The attached implements the above, more or less, using new pg_lfind8()
and pg_lfind8_le(), which in turn are based on helper functions that
act on a single vector. The pg_lfind* functions have regression tests,
but I haven't done the same for json yet. I went the extra step to use
bit-twiddling for non-SSE builds using uint64 as a "vector", which
still gives a pretty good boost (test below, min of 3):

master:
356ms

v5:
259ms

v5 disable SSE:
288ms

It still needs a bit of polishing and testing, but I think it's a good
workout for abstracting SIMD out of the way.

-------------
test:

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 11)
from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

select 1 from long_json_as_text where t::json is null; -- from Andrew upthread

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v5-json-lex-string-simd-ops.patchtext/x-patch; charset=US-ASCII; name=v5-json-lex-string-simd-ops.patchDownload

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..1f9eb134e8 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
 
 #include "common/jsonapi.h"
 #include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
 
 #ifndef FRONTEND
 #include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else
 		{
-			char	   *p;
+			char	   *p = s;
 
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
 			 * Skip to the first byte that requires special handling, so we
 			 * can batch calls to appendBinaryStringInfo.
 			 */
-			for (p = s; p < end; p++)
+			while (p < end - SIZEOF_VECTOR &&
+					!pg_lfind8('\\', (unsigned char*) p, SIZEOF_VECTOR) &&
+					!pg_lfind8('"', (unsigned char*) p, SIZEOF_VECTOR) &&
+					!pg_lfind8_le(0x1F, (unsigned char*) p, SIZEOF_VECTOR))
+				p += SIZEOF_VECTOR;
+
+			for (; p < end; p++)
 			{
 				if (*p == '\\' || *p == '"')
 					break;
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..e090ea6ac3 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pg_lfind.h
- *	  Optimized linear search routines.
+ *	  Optimized linear search routines using SIMD intrinsics where available
  *
  * Copyright (c) 2022, PostgreSQL Global Development Group
  *
@@ -15,6 +15,76 @@
 
 #include "port/simd.h"
 
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+	/* round down to multiple of vector length */
+	uint32		iterations = nelem & ~(SIZEOF_VECTOR - 1);
+	TYPEOF_VECTOR		chunk;
+
+	for (i = 0; i < iterations; i += SIZEOF_VECTOR)
+	{
+#ifdef USE_SSE2
+		chunk = _mm_loadu_si128((const __m128i *) &base[i]);
+#else
+		memcpy(&chunk, &base[i], sizeof(chunk));
+#endif							/* USE_SSE2 */
+		if (vector_eq_byte(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that is less than or equal to 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+	/* round down to multiple of vector length */
+	uint32		iterations = nelem & ~(SIZEOF_VECTOR - 1);
+	TYPEOF_VECTOR		chunk;
+
+	for (i = 0; i < iterations; i += SIZEOF_VECTOR)
+	{
+#ifdef USE_SSE2
+		chunk = _mm_loadu_si128((const __m128i *) &base[i]);
+#else
+		memcpy(&chunk, &base[i], sizeof(chunk));
+#endif							/* USE_SSE2 */
+		if (vector_le_byte(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (base[i] <= key)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * pg_lfind32
  *
@@ -26,7 +96,6 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
 	uint32		i = 0;
 
-	/* Use SIMD intrinsics where available. */
 #ifdef USE_SSE2
 
 	/*
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..0185bc2ae0 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -25,6 +25,125 @@
 #if (defined(__x86_64__) || defined(_M_AMD64))
 #include <emmintrin.h>
 #define USE_SSE2
+#define TYPEOF_VECTOR __m128i
+
+#else
+#define TYPEOF_VECTOR uint64
+#endif							/* (defined(__x86_64__) || defined(_M_AMD64)) */
+
+#define SIZEOF_VECTOR sizeof(TYPEOF_VECTOR)
+
+/* return a vector with all bytes set to c */
+static inline TYPEOF_VECTOR
+vector_broadcast(const uint8 c)
+{
+#ifdef USE_SSE2
+	return _mm_set1_epi8(c);
+#else
+	return ~UINT64CONST(0) / 0xFF * c;
 #endif
+}
+
+/* return true if any bytes in the vector are zero */
+static inline bool
+vector_has_zero(const TYPEOF_VECTOR v)
+{
+#ifdef USE_SSE2
+	return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#else
+	/* from https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord */
+	return (v - vector_broadcast(0x01)) & ~v & vector_broadcast(0x80);
+#endif
+}
+
+static inline bool
+vector_eq_byte(const TYPEOF_VECTOR v, const uint8 c)
+{
+	bool		result;
+
+	/* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+	bool		assert_result = false;
+	const char* s = (const char*) &v;
+
+	for (int j = 0; j < SIZEOF_VECTOR; j++)
+	{
+		if (s[j] == c)
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif							/* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+	result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector_broadcast(c)));
+#else
+	/* any bytes in v equal to c will evaluate to zero via XOR */
+	result = vector_has_zero(v ^ vector_broadcast(c));
+#endif							/* USE_SSE2 */
+
+#ifdef USE_ASSERT_CHECKING
+	Assert(assert_result == result);
+#endif							/* USE_ASSERT_CHECKING */
+
+	return result;
+}
+
+static inline bool
+vector_le_byte(const TYPEOF_VECTOR v, const uint8 c)
+{
+	bool		result;
+#ifdef USE_SSE2
+	__m128i		sub;
+#endif
+
+#if !defined(USE_SSE2) || defined(USE_ASSERT_CHECKING)
+	const char* s = (const char*) &v;
+#endif
+
+	/* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+	bool		assert_result = false;
+
+	for (int j = 0; j < SIZEOF_VECTOR; j++)
+	{
+		if (s[j] <= c)
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif							/* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+	/* use saturating subtraction to find bytes <= c, which will present as NUL bytes in 'sub' */
+	sub = _mm_subs_epu8(v, vector_broadcast(c));
+	return _mm_movemask_epi8(_mm_cmpeq_epi8(sub, _mm_setzero_si128()));
+#else
+	/* to find bytes <= c, we can use bitwise operations to find bytes < c+1, but it only works if c+1 <= 128 */
+	/* from https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord */
+	if (c + 1 <= 128)
+		return (v - vector_broadcast(c + 1)) & ~v & vector_broadcast(0x80);
+	else
+	{
+		/* one byte at a time */
+		for (int j = 0; j < SIZEOF_VECTOR; j++)
+		{
+			if (s[j] <= c)
+			{
+				result = true;
+				break;
+			}
+		}
+	}
+#endif
+
+#ifdef USE_ASSERT_CHECKING
+	Assert(assert_result == result);
+#endif							/* USE_ASSERT_CHECKING */
+
+	return result;
+}
 
 #endif							/* SIMD_H */
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 222c8fd7ff..1d4b14e703 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -4,9 +4,21 @@ CREATE EXTENSION test_lfind;
 -- the operations complete without crashing or hanging and that none of their
 -- internal sanity tests fail.
 --
-SELECT test_lfind();
- test_lfind 
-------------
+SELECT test_lfind8();
+ test_lfind8 
+-------------
+ 
+(1 row)
+
+SELECT test_lfind8_le();
+ test_lfind8_le 
+----------------
+ 
+(1 row)
+
+SELECT test_lfind32();
+ test_lfind32 
+--------------
  
 (1 row)
 
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 899f1dd49b..766c640831 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -5,4 +5,6 @@ CREATE EXTENSION test_lfind;
 -- the operations complete without crashing or hanging and that none of their
 -- internal sanity tests fail.
 --
-SELECT test_lfind();
+SELECT test_lfind8();
+SELECT test_lfind8_le();
+SELECT test_lfind32();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index d82ab0567e..81801926ae 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -3,6 +3,14 @@
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION test_lfind" to load this file. \quit
 
-CREATE FUNCTION test_lfind()
+CREATE FUNCTION test_lfind32()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8_le()
 	RETURNS pg_catalog.void
 	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index a000746fb8..b853ddb609 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -18,10 +18,104 @@
 
 PG_MODULE_MAGIC;
 
-PG_FUNCTION_INFO_V1(test_lfind);
+PG_FUNCTION_INFO_V1(test_lfind8);
+Datum
+test_lfind8(PG_FUNCTION_ARGS)
+{
+	unsigned char* str1 = (unsigned char*) "1234567890abcdef";
+	unsigned char* str2 = (unsigned char*) "1234567890abcdefZ";
+	size_t len1 = strlen((char*) str1);
+	size_t len2 = strlen((char*) str2);
+	uint8			key;
+
+	/* whole length of 16*/
+	Assert(len1 == 16);
+
+	key = 'X';
+	if (pg_lfind8(key, str1, len1))
+		elog(ERROR, "pg_lfind8() found nonexistent element '%c'", key);
+	key = 'f';
+	if (!pg_lfind8(key, str1, len1))
+		elog(ERROR, "pg_lfind8() did not find existing element '%c'", key);
+	/* include terminator */
+	key = '\0';
+	if (!pg_lfind8(key, str1, len1 + 1))
+		elog(ERROR, "pg_lfind8() did not find existing element '%c'", key);
+
+	/* restricted length */
+	key = '5';
+	if (pg_lfind8(key, str1, 4))
+		elog(ERROR, "pg_lfind8() found nonexistent element '%c'", key);
+	key = '4';
+	if (!pg_lfind8(key, str1, 4))
+		elog(ERROR, "pg_lfind8() did not find existing element '%c'", key);
+
+	/* test byte-wise path with string larger than vector size */
+	Assert(len2 > 16);
+
+	key = 'Y';
+	if (pg_lfind8(key, str2, len2))
+		elog(ERROR, "pg_lfind8() found nonexistent element '%c'", key);
+	key = 'Z';
+	if (!pg_lfind8(key, str2, len2))
+		elog(ERROR, "pg_lfind8() did not find existing element '%c'", key);
+
+	PG_RETURN_VOID();
+}
+
+
+PG_FUNCTION_INFO_V1(test_lfind8_le);
+Datum
+test_lfind8_le(PG_FUNCTION_ARGS)
+{
+	unsigned char* str1 = (unsigned char*) "A_CDEFGHIJKLMNO_";
+	unsigned char* str2 = (unsigned char*) "A_CDEFGHIJKLMNO_3";
+	size_t len1 = strlen((char*) str1);
+	size_t len2 = strlen((char*) str2);
+	uint8			key;
+
+	/* whole length of 16*/
+	Assert(len1 == 16);
+
+	/* search for char with value one less than minimum */
+	key = '@';
+	if (pg_lfind8_le(key, str1, len1))
+		elog(ERROR, "pg_lfind8_le() found nonexistent element <= '%c'", key);
+	/* search for minimum char */
+	key = 'A';
+	if (!pg_lfind8_le(key, str1, len1))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '%c'", key);
+	key = 'B';
+	if (!pg_lfind8_le(key, str1, len1))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '%c'", key);
+
+	/* search for terminating null by <= 0x01 */
+	key = 0x01;
+	if (!pg_lfind8_le(key, str1, len1 + 1))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '%c'", key);
+
+	/* test byte-wise path with string larger than vector size */
+	Assert(len2 > 16);
+
+	/* search for char with value one less than minimum */
+	key = '2';
+	if (pg_lfind8_le(key, str2, len2))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '%c'", key);
+
+	/* search for minimum char */
+	key = '3';
+	if (!pg_lfind8_le(key, str2, len2))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '%c'", key);
+	key = '4';
+	if (!pg_lfind8_le(key, str2, len2))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '%c'", key);
+
+	PG_RETURN_VOID();
+}
 
+PG_FUNCTION_INFO_V1(test_lfind32);
 Datum
-test_lfind(PG_FUNCTION_ARGS)
+test_lfind32(PG_FUNCTION_ARGS)
 {
 #define TEST_ARRAY_SIZE 135
 	uint32		test_array[TEST_ARRAY_SIZE] = {0};

#17

Ranier Vilela

ranier.vf@gmail.com

over 3 years ago

In reply to: John Naylor (#16)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

I ran this test.

select 1 from long_json_as_text where t::json is null;

head:
Time: 161,741ms

v5:
Time: 270,298 ms

ubuntu 64 bits
gcc 9.4.0

Am I missing something?

regards,
Ranier Vilela

Import Notes

Resolved by subject fallback

#18

Ranier Vilela

ranier.vf@gmail.com

over 3 years ago

In reply to: Ranier Vilela (#17)

Re: [PATCH] Optimize json_lex_string by batching character copying

Em seg., 15 de ago. de 2022 às 15:34, Ranier Vilela <ranier.vf@gmail.com>
escreveu:

Hi,

I ran this test.

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 11)
from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

select 1 from long_json_as_text where t::json is null;

head:
Time: 161,741ms

v5:
Time: 270,298 ms

Sorry too fast, 270,298ms with native memchr.

v5
Time: 208,689 ms

regards,
Ranier Vilela

#19

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: John Naylor (#16)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Mon, Aug 15, 2022 at 08:33:21PM +0700, John Naylor wrote:

The attached implements the above, more or less, using new pg_lfind8()
and pg_lfind8_le(), which in turn are based on helper functions that
act on a single vector. The pg_lfind* functions have regression tests,
but I haven't done the same for json yet. I went the extra step to use
bit-twiddling for non-SSE builds using uint64 as a "vector", which
still gives a pretty good boost (test below, min of 3):

Looks pretty reasonable to me.

+#ifdef USE_SSE2
+		chunk = _mm_loadu_si128((const __m128i *) &base[i]);
+#else
+		memcpy(&chunk, &base[i], sizeof(chunk));
+#endif							/* USE_SSE2 */

+#ifdef USE_SSE2
+		chunk = _mm_loadu_si128((const __m128i *) &base[i]);
+#else
+		memcpy(&chunk, &base[i], sizeof(chunk));
+#endif							/* USE_SSE2 */

Perhaps there should be a macro or inline function for loading a vector so
that these USE_SSE2 checks can be abstracted away, too.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#20

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Nathan Bossart (#19)

1 attachment(s)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Tue, Aug 16, 2022 at 4:23 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Aug 15, 2022 at 08:33:21PM +0700, John Naylor wrote:
+#ifdef USE_SSE2
+             chunk = _mm_loadu_si128((const __m128i *) &base[i]);
+#else
+             memcpy(&chunk, &base[i], sizeof(chunk));
+#endif                                                       /* USE_SSE2 */
Perhaps there should be a macro or inline function for loading a vector so
that these USE_SSE2 checks can be abstracted away, too.

This is done. Also:
- a complete overhaul of the pg_lfind8* tests
- using a typedef for the vector type
- some refactoring, name changes and other cleanups (a few of these
could also be applied to the 32-byte element path, but that is left
for future work)

TODO: json-specific tests of the new path

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v6-json-lex-string-simd.patchapplication/x-patch; name=v6-json-lex-string-simd.patchDownload

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..efcaedd682 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
 
 #include "common/jsonapi.h"
 #include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
 
 #ifndef FRONTEND
 #include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else
 		{
-			char	   *p;
+			char	   *p = s;
 
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
 			 * Skip to the first byte that requires special handling, so we
 			 * can batch calls to appendBinaryStringInfo.
 			 */
-			for (p = s; p < end; p++)
+			while (p < end - sizeof(Vector) &&
+				   !pg_lfind8('\\', (uint8 *) p, sizeof(Vector)) &&
+				   !pg_lfind8('"', (uint8 *) p, sizeof(Vector)) &&
+				   !pg_lfind8_le(0x1F, (uint8 *) p, sizeof(Vector)))
+				p += sizeof(Vector);
+
+			for (; p < end; p++)
 			{
 				if (*p == '\\' || *p == '"')
 					break;
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..bb4033c7fc 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pg_lfind.h
- *	  Optimized linear search routines.
+ *	  Optimized linear search routines using SIMD intrinsics where available
  *
  * Copyright (c) 2022, PostgreSQL Global Development Group
  *
@@ -15,6 +15,68 @@
 
 #include "port/simd.h"
 
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector) - 1);
+	Vector		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector))
+	{
+		vector_load(&chunk, &base[i]);
+		if (vector_eq_byte(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * pg_lfind8_le
+ *
+ * Return true if there is an element in 'base' that is less than or equal to
+ * 'key', otherwise return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector) - 1);
+	Vector		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector))
+	{
+		vector_load(&chunk, &base[i]);
+		if (vector_le_byte(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (base[i] <= key)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * pg_lfind32
  *
@@ -26,7 +88,6 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
 	uint32		i = 0;
 
-	/* Use SIMD intrinsics where available. */
 #ifdef USE_SSE2
 
 	/*
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..56da8e27cd 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -25,6 +25,141 @@
 #if (defined(__x86_64__) || defined(_M_AMD64))
 #include <emmintrin.h>
 #define USE_SSE2
+typedef __m128i Vector;
+
+#else
+typedef uint64 Vector;
+#endif							/* (defined(__x86_64__) || defined(_M_AMD64)) */
+
+
+static inline void vector_load(Vector * v, const uint8 *s);
+static inline Vector vector_broadcast(const uint8 c);
+static inline bool vector_has_zero(const Vector v);
+static inline bool vector_le_byte(const Vector v, const uint8 c);
+static inline bool vector_eq_byte(const Vector v, const uint8 c);
+
+
+/* load a chunk of memory into a register */
+static inline void
+vector_load(Vector * v, const uint8 *s)
+{
+#ifdef USE_SSE2
+	*v = _mm_loadu_si128((const __m128i *) s);
+#else
+	memcpy(v, s, sizeof(Vector));
+#endif
+}
+
+/* return a vector with all bytes set to c */
+static inline Vector
+vector_broadcast(const uint8 c)
+{
+#ifdef USE_SSE2
+	return _mm_set1_epi8(c);
+#else
+	return ~UINT64CONST(0) / 0xFF * c;
 #endif
+}
+
+/* return true if any bytes in the vector are zero */
+static inline bool
+vector_has_zero(const Vector v)
+{
+#ifdef USE_SSE2
+	return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#else
+	return vector_le_byte(v, 0);
+#endif
+}
+
+static inline bool
+vector_eq_byte(const Vector v, const uint8 c)
+{
+	bool		result;
+
+	/* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+	bool		assert_result = false;
+	const uint8* s = (const uint8*) &v;
+
+	for (int i = 0; i < sizeof(Vector); i++)
+	{
+		if (s[i] == c)
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif							/* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+	result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector_broadcast(c)));
+#else
+	/* any bytes in v equal to c will evaluate to zero via XOR */
+	result = vector_has_zero(v ^ vector_broadcast(c));
+#endif							/* USE_SSE2 */
+
+#ifdef USE_ASSERT_CHECKING
+	Assert(assert_result == result);
+#endif							/* USE_ASSERT_CHECKING */
+
+	return result;
+}
+
+static inline bool
+vector_le_byte(const Vector v, const uint8 c)
+{
+	bool		result = false;
+#ifdef USE_SSE2
+	__m128i		sub;
+#endif
+
+#if !defined(USE_SSE2) || defined(USE_ASSERT_CHECKING)
+	const uint8* s = (const uint8*) &v;
+#endif
+
+	/* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+	bool		assert_result = false;
+
+	for (int i = 0; i < sizeof(Vector); i++)
+	{
+		if (s[i] <= c)
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif							/* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+	/* use saturating subtraction to find bytes <= c, which will present as NUL bytes in 'sub' */
+	sub = _mm_subs_epu8(v, vector_broadcast(c));
+	result = vector_has_zero(sub);
+#else
+	/* to find bytes <= c, we can use bitwise operations to find bytes < c+1, but it only works if c+1 <= 128 and if the highest bit in v is not set */
+	/* from https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord */
+	if ((int64) v >= 0 && c < 0x80)
+		result = (v - vector_broadcast(c + 1)) & ~v & vector_broadcast(0x80);
+	else
+	{
+		/* one byte at a time */
+		for (int i = 0; i < sizeof(Vector); i++)
+		{
+			if (s[i] <= c)
+			{
+				result = true;
+				break;
+			}
+		}
+	}
+#endif
+
+#ifdef USE_ASSERT_CHECKING
+	Assert(assert_result == result);
+#endif							/* USE_ASSERT_CHECKING */
+
+	return result;
+}
 
 #endif							/* SIMD_H */
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 222c8fd7ff..1d4b14e703 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -4,9 +4,21 @@ CREATE EXTENSION test_lfind;
 -- the operations complete without crashing or hanging and that none of their
 -- internal sanity tests fail.
 --
-SELECT test_lfind();
- test_lfind 
-------------
+SELECT test_lfind8();
+ test_lfind8 
+-------------
+ 
+(1 row)
+
+SELECT test_lfind8_le();
+ test_lfind8_le 
+----------------
+ 
+(1 row)
+
+SELECT test_lfind32();
+ test_lfind32 
+--------------
  
 (1 row)
 
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 899f1dd49b..766c640831 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -5,4 +5,6 @@ CREATE EXTENSION test_lfind;
 -- the operations complete without crashing or hanging and that none of their
 -- internal sanity tests fail.
 --
-SELECT test_lfind();
+SELECT test_lfind8();
+SELECT test_lfind8_le();
+SELECT test_lfind32();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index d82ab0567e..81801926ae 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -3,6 +3,14 @@
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION test_lfind" to load this file. \quit
 
-CREATE FUNCTION test_lfind()
+CREATE FUNCTION test_lfind32()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8_le()
 	RETURNS pg_catalog.void
 	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index a000746fb8..18ca9d0018 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -18,10 +18,97 @@
 
 PG_MODULE_MAGIC;
 
-PG_FUNCTION_INFO_V1(test_lfind);
+/* workhorse for test_lfind8 */
+static void
+test_lfind8_internal(uint8 key)
+{
+	/* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+	uint8 charbuf[LEN_WITH_TAIL];
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search tail to test one-byte-at-a-time path */
+	charbuf[LEN_WITH_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search with vector operations */
+	charbuf[LEN_NO_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8);
+Datum
+test_lfind8(PG_FUNCTION_ARGS)
+{
+	test_lfind8_internal(0);
+	test_lfind8_internal(1);
+	test_lfind8_internal(0x7F);
+	test_lfind8_internal(0x80);
+	test_lfind8_internal(0xFD);
+
+	PG_RETURN_VOID();
+}
+
+/* workhorse for test_lfind8_le */
+static void
+test_lfind8_le_internal(uint8 key)
+{
+	/* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+	uint8 charbuf[LEN_WITH_TAIL];
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search tail to test one-byte-at-a-time path */
+	charbuf[LEN_WITH_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search with vector operations */
+	charbuf[LEN_NO_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8_le);
+Datum
+test_lfind8_le(PG_FUNCTION_ARGS)
+{
+	test_lfind8_le_internal(0);
+	test_lfind8_le_internal(1);
+	test_lfind8_le_internal(0x7F);
+	test_lfind8_le_internal(0x80);
+	test_lfind8_le_internal(0xFD);
+
+	PG_RETURN_VOID();
+}
 
+PG_FUNCTION_INFO_V1(test_lfind32);
 Datum
-test_lfind(PG_FUNCTION_ARGS)
+test_lfind32(PG_FUNCTION_ARGS)
 {
 #define TEST_ARRAY_SIZE 135
 	uint32		test_array[TEST_ARRAY_SIZE] = {0};

#21

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: John Naylor (#20)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Fri, Aug 19, 2022 at 03:11:36PM +0700, John Naylor wrote:

This is done. Also:
- a complete overhaul of the pg_lfind8* tests
- using a typedef for the vector type
- some refactoring, name changes and other cleanups (a few of these
could also be applied to the 32-byte element path, but that is left
for future work)

TODO: json-specific tests of the new path

This looks pretty good to me. Should we rename vector_broadcast() and
vector_has_zero() to indicate that they are working with bytes (e.g.,
vector_broadcast_byte())? We might be able to use vector_broadcast_int()
in the 32-bit functions, and your other vector functions already have a
_byte suffix.

In general, the approach you've taken seems like a decent readability
improvement. I'd be happy to try my hand at adjusting the 32-bit path and
adding ARM versions of all this stuff.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#22

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#21)

2 attachment(s)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Fri, Aug 19, 2022 at 01:42:15PM -0700, Nathan Bossart wrote:

On Fri, Aug 19, 2022 at 03:11:36PM +0700, John Naylor wrote:

This is done. Also:
- a complete overhaul of the pg_lfind8* tests
- using a typedef for the vector type
- some refactoring, name changes and other cleanups (a few of these
could also be applied to the 32-byte element path, but that is left
for future work)

TODO: json-specific tests of the new path

This looks pretty good to me. Should we rename vector_broadcast() and
vector_has_zero() to indicate that they are working with bytes (e.g.,
vector_broadcast_byte())? We might be able to use vector_broadcast_int()
in the 32-bit functions, and your other vector functions already have a
_byte suffix.

In general, the approach you've taken seems like a decent readability
improvement. I'd be happy to try my hand at adjusting the 32-bit path and
adding ARM versions of all this stuff.

I spent some more time looking at this one, and I had a few ideas that I
thought I'd share. 0001 is your v6 patch with a few additional changes,
including simplying the assertions for readability, splitting out the
Vector type into Vector8 and Vector32 (needed for ARM), and adjusting
pg_lfind32() to use the new tools in simd.h. 0002 adds ARM versions of
everything, which obsoletes the other thread I started [0]/messages/by-id/20220819200829.GA395728@nathanxps13. This is still
a little rough around the edges (e.g., this should probably be more than 2
patches), but I think it helps demonstrate a more comprehensive design than
what I've proposed in the pg_lfind32-for-ARM thread [0]/messages/by-id/20220819200829.GA395728@nathanxps13.

Apologies if I'm stepping on your toes a bit here.

[0]: /messages/by-id/20220819200829.GA395728@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

0001-json_lex_string-SIMD.patchtext/x-diff; charset=us-asciiDownload

From 7dd35c8ffe8e42885586fb16a77b6c3e792c6a6d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Sat, 20 Aug 2022 21:14:01 -0700
Subject: [PATCH 1/2] json_lex_string() SIMD

---
 src/common/jsonapi.c                          |  11 +-
 src/include/port/pg_lfind.h                   | 132 ++++++----
 src/include/port/simd.h                       | 227 ++++++++++++++++++
 .../test_lfind/expected/test_lfind.out        |  18 +-
 .../modules/test_lfind/sql/test_lfind.sql     |   4 +-
 .../modules/test_lfind/test_lfind--1.0.sql    |  10 +-
 src/test/modules/test_lfind/test_lfind.c      |  91 ++++++-
 7 files changed, 443 insertions(+), 50 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..87e1d0b192 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
 
 #include "common/jsonapi.h"
 #include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
 
 #ifndef FRONTEND
 #include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else
 		{
-			char	   *p;
+			char	   *p = s;
 
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
 			 * Skip to the first byte that requires special handling, so we
 			 * can batch calls to appendBinaryStringInfo.
 			 */
-			for (p = s; p < end; p++)
+			while (p < end - sizeof(Vector8) &&
+				   !pg_lfind8('\\', (uint8 *) p, sizeof(Vector8)) &&
+				   !pg_lfind8('"', (uint8 *) p, sizeof(Vector8)) &&
+				   !pg_lfind8_le(0x1F, (uint8 *) p, sizeof(Vector8)))
+				p += sizeof(Vector8);
+
+			for (; p < end; p++)
 			{
 				if (*p == '\\' || *p == '"')
 					break;
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..def858cbe1 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pg_lfind.h
- *	  Optimized linear search routines.
+ *	  Optimized linear search routines using SIMD intrinsics where available
  *
  * Copyright (c) 2022, PostgreSQL Global Development Group
  *
@@ -15,6 +15,68 @@
 
 #include "port/simd.h"
 
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		vector8_load(&chunk, &base[i]);
+		if (vector8_eq(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * pg_lfind8_le
+ *
+ * Return true if there is an element in 'base' that is less than or equal to
+ * 'key', otherwise return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		vector8_load(&chunk, &base[i]);
+		if (vector8_le(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (base[i] <= key)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * pg_lfind32
  *
@@ -24,59 +86,50 @@
 static inline bool
 pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
+	bool		result = false;
 	uint32		i = 0;
+#ifdef USE_ASSERT_CHECKING
+	size_t		nelem_for_asserts = nelem;
+#endif
 
-	/* Use SIMD intrinsics where available. */
 #ifdef USE_SSE2
-
 	/*
 	 * A 16-byte register only has four 4-byte lanes. For better
 	 * instruction-level parallelism, each loop iteration operates on a block
 	 * of four registers. Testing has showed this is ~40% faster than using a
 	 * block of two registers.
 	 */
-	const		__m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
-	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+	const Vector32 keys = vector32_broadcast(key);  /* load 4 copies of key */
+	uint32		tail_idx = nelem & ~0xF;    /* round down to multiple of 16 */
 
-#if defined(USE_ASSERT_CHECKING)
-	bool		assert_result = false;
-
-	/* pre-compute the result for assert checking */
-	for (i = 0; i < nelem; i++)
+	for (i = 0; i < tail_idx; i += 16)
 	{
-		if (key == base[i])
-		{
-			assert_result = true;
-			break;
-		}
-	}
-#endif
+		Vector32	vals1, vals2, vals3, vals4,
+					result1, result2, result3, result4,
+					tmp1, tmp2, result;
 
-	for (i = 0; i < iterations; i += 16)
-	{
 		/* load the next block into 4 registers holding 4 values each */
-		const		__m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
-		const		__m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
-		const		__m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
-		const		__m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
+		vector32_load(&vals1, &base[i]);
+		vector32_load(&vals2, &base[i + 4]);
+		vector32_load(&vals3, &base[i + 8]);
+		vector32_load(&vals4, &base[i + 12]);
 
 		/* compare each value to the key */
-		const		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
-		const		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
-		const		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
-		const		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+		result1 = vector32_veq(keys, vals1);
+		result2 = vector32_veq(keys, vals2);
+		result3 = vector32_veq(keys, vals3);
+		result4 = vector32_veq(keys, vals4);
 
 		/* combine the results into a single variable */
-		const		__m128i tmp1 = _mm_or_si128(result1, result2);
-		const		__m128i tmp2 = _mm_or_si128(result3, result4);
-		const		__m128i result = _mm_or_si128(tmp1, tmp2);
+		tmp1 = vector32_vor(result1, result2);
+		tmp2 = vector32_vor(result3, result4);
+		result = vector32_vor(tmp1, tmp2);
 
 		/* see if there was a match */
 		if (_mm_movemask_epi8(result) != 0)
 		{
-#if defined(USE_ASSERT_CHECKING)
-			Assert(assert_result == true);
-#endif
+			Assert(lfind(&key, base, &nelem_for_asserts, sizeof(uint32),
+						 uint32_cmp_eq));
 			return true;
 		}
 	}
@@ -87,17 +140,14 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	{
 		if (key == base[i])
 		{
-#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
-			Assert(assert_result == true);
-#endif
-			return true;
+			result = true;
+			break;
 		}
 	}
 
-#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
-	Assert(assert_result == false);
-#endif
-	return false;
+	Assert(result == (lfind(&key, base, &nelem_for_asserts, sizeof(uint32),
+							uint32_cmp_eq) != NULL));
+	return result;
 }
 
 #endif							/* PG_LFIND_H */
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..4dda87f3dd 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -13,6 +13,8 @@
 #ifndef SIMD_H
 #define SIMD_H
 
+#include "utils/elog.h"
+
 /*
  * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
  * that compilers targeting this architecture understand SSE2 intrinsics.
@@ -25,6 +27,231 @@
 #if (defined(__x86_64__) || defined(_M_AMD64))
 #include <emmintrin.h>
 #define USE_SSE2
+typedef __m128i Vector8;
+typedef __m128i Vector32;
+
+/*
+ * If no SIMD instructions are available, we emulate specialized vector
+ * operations using uint64.
+ */
+#else
+typedef uint64 Vector8;
+typedef uint64 Vector32;
+#endif
+
+
+static inline void vector8_load(Vector8 *v, const uint8 *s);
+static inline void vector32_load(Vector32 *v, const uint32 *s);
+static inline Vector8 vector8_broadcast(const uint8 c);
+static inline Vector32 vector32_broadcast(const uint32 c);
+static inline bool vector8_has_zero(const Vector8 v);
+static inline bool vector8_le(const Vector8 v, const uint8 c);
+static inline bool vector8_eq(const Vector8 v, const uint8 c);
+static inline Vector32 vector32_veq(const Vector32 v1, const Vector32 v2);
+static inline Vector32 vector32_vor(const Vector32 v1, const Vector32 v2);
+
+
+/*
+ * Stuff for assert-enabled builds.
+ */
+#ifdef USE_ASSERT_CHECKING
+
+#include <search.h>
+
+static size_t nelem_vector8 = sizeof(Vector8) / sizeof(uint8);
+
+static int
+uint8_cmp_eq(const void *key, const void *elem)
+{
+	uint8 k = *((const uint8 *) key);
+	uint8 e = *((const uint8 *) elem);
+
+	if (k < e)
+		return -1;
+	if (k > e)
+		return 1;
+	return 0;
+}
+
+static int
+uint32_cmp_eq(const void *key, const void *elem)
+{
+	uint32 k = *((const uint32 *) key);
+	uint32 e = *((const uint32 *) elem);
+
+	if (k < e)
+		return -1;
+	if (k > e)
+		return 1;
+	return 0;
+}
+
+static int
+uint8_cmp_le(const void *key, const void *elem)
+{
+	uint8 k = *((const uint8 *) key);
+	uint8 e = *((const uint8 *) elem);
+
+	/*
+	 * This is counterintuitive.  We want lfind() to report success if it finds
+	 * an element <= the key, so we need to return 0 any time the key is >= the
+	 * current element.
+	 */
+	if (k >= e)
+		return 0;
+	return -1;
+}
+
+#endif                          /* USE_ASSERT_CHECKING */
+
+
+/*
+ * Functions for loading a chunk of memory into a vector.
+ */
+
+static inline void
+vector8_load(Vector8 *v, const uint8 *s)
+{
+#ifdef USE_SSE2
+	*v = _mm_loadu_si128((const __m128i *) s);
+#else
+	memcpy(v, s, sizeof(Vector8));
+#endif
+}
+
+static inline void
+vector32_load(Vector32 *v, const uint32 *s)
+{
+#ifdef USE_SSE2
+	*v = _mm_loadu_si128((const __m128i *) s);
+#else
+	elog(ERROR, "vector32() without SIMD not implemented");
+	pg_unreachable();
+#endif
+}
+
+
+/*
+ * Functions for creating a vector with all elements set to the same value.
+ */
+
+static inline Vector8
+vector8_broadcast(const uint8 c)
+{
+#ifdef USE_SSE2
+	return _mm_set1_epi8(c);
+#else
+	return ~UINT64CONST(0) / 0xFF * c;
+#endif
+}
+
+static inline Vector32
+vector32_broadcast(const uint32 c)
+{
+#ifdef USE_SSE2
+	return _mm_set1_epi32(c);
+#else
+	elog(ERROR, "vector32_broadcast() without SIMD not implemented");
+	pg_unreachable();
+#endif
+}
+
+
+/*
+ * Functions for comparing vector elements to a given value.
+ */
+
+static inline bool
+vector8_has_zero(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#else
+	return vector8_le(v, 0);
+#endif
+}
+
+static inline bool
+vector8_eq(const Vector8 v, const uint8 c)
+{
+	bool		result;
+
+#ifdef USE_SSE2
+	result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector8_broadcast(c)));
+#else
+	/* any bytes in v equal to c will evaluate to zero via XOR */
+	result = vector8_has_zero(v ^ vector8_broadcast(c));
+#endif							/* USE_SSE2 */
+
+	Assert(result == (lfind(&c, &v, &nelem_vector8, sizeof(uint8),
+							uint8_cmp_eq) != NULL));
+	return result;
+}
+
+static inline Vector32
+vector32_veq(const Vector32 v1, const Vector32 v2)
+{
+#ifdef USE_SSE2
+	return _mm_cmpeq_epi32(v1, v2);
+#else
+	elog(ERROR, "vector32_veq() without SIMD not implemented");
+	pg_unreachable();
+#endif
+}
+
+static inline bool
+vector8_le(const Vector8 v, const uint8 c)
+{
+	bool		result = false;
+
+#ifdef USE_SSE2
+	/*
+	 * Use saturating subtraction to find bytes <= c, which will present as
+	 * NUL bytes in 'sub'.
+	 */
+	__m128i sub = _mm_subs_epu8(v, vector8_broadcast(c));
+	result = vector8_has_zero(sub);
+#else
+	/*
+	 * To find bytes <= c, we can use bitwise operations to find bytes < c + 1,
+	 * but it only works if c + 1 <= 128 and if the highest bit in v is not set
+	 * (from https://graphics.stanford.edu/~seander/bithacks.html).
+	 */
+	if ((int64) v >= 0 && c < 0x80)
+		result = (v - vector8_broadcast(c + 1)) & ~v & vector8_broadcast(0x80);
+	else
+	{
+		/* one byte at a time */
+		for (int i = 0; i < sizeof(Vector8); i++)
+		{
+			if (((const uint8 *) &v)[i] <= c)
+			{
+				result = true;
+				break;
+			}
+		}
+	}
+#endif
+
+	Assert(result == (lfind(&c, &v, &nelem_vector8, sizeof(uint8),
+							uint8_cmp_le) != NULL));
+	return result;
+}
+
+
+/*
+ * Functions for bitwise operations on vectors.
+ */
+
+static inline Vector32
+vector32_vor(const Vector32 v1, const Vector32 v2)
+{
+#ifdef USE_SSE2
+	return _mm_or_si128(v1, v2);
+#else
+	elog(ERROR, "vector32_vor() without SIMD not implemented");
+	pg_unreachable();
 #endif
+}
 
 #endif							/* SIMD_H */
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 222c8fd7ff..1d4b14e703 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -4,9 +4,21 @@ CREATE EXTENSION test_lfind;
 -- the operations complete without crashing or hanging and that none of their
 -- internal sanity tests fail.
 --
-SELECT test_lfind();
- test_lfind 
-------------
+SELECT test_lfind8();
+ test_lfind8 
+-------------
+ 
+(1 row)
+
+SELECT test_lfind8_le();
+ test_lfind8_le 
+----------------
+ 
+(1 row)
+
+SELECT test_lfind32();
+ test_lfind32 
+--------------
  
 (1 row)
 
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 899f1dd49b..766c640831 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -5,4 +5,6 @@ CREATE EXTENSION test_lfind;
 -- the operations complete without crashing or hanging and that none of their
 -- internal sanity tests fail.
 --
-SELECT test_lfind();
+SELECT test_lfind8();
+SELECT test_lfind8_le();
+SELECT test_lfind32();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index d82ab0567e..81801926ae 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -3,6 +3,14 @@
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION test_lfind" to load this file. \quit
 
-CREATE FUNCTION test_lfind()
+CREATE FUNCTION test_lfind32()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8_le()
 	RETURNS pg_catalog.void
 	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index a000746fb8..efe6b60bc5 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -18,10 +18,97 @@
 
 PG_MODULE_MAGIC;
 
-PG_FUNCTION_INFO_V1(test_lfind);
+/* workhorse for test_lfind8 */
+static void
+test_lfind8_internal(uint8 key)
+{
+	/* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector8))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+	uint8 charbuf[LEN_WITH_TAIL];
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search tail to test one-byte-at-a-time path */
+	charbuf[LEN_WITH_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search with vector operations */
+	charbuf[LEN_NO_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8);
+Datum
+test_lfind8(PG_FUNCTION_ARGS)
+{
+	test_lfind8_internal(0);
+	test_lfind8_internal(1);
+	test_lfind8_internal(0x7F);
+	test_lfind8_internal(0x80);
+	test_lfind8_internal(0xFD);
+
+	PG_RETURN_VOID();
+}
+
+/* workhorse for test_lfind8_le */
+static void
+test_lfind8_le_internal(uint8 key)
+{
+	/* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector8))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+	uint8 charbuf[LEN_WITH_TAIL];
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search tail to test one-byte-at-a-time path */
+	charbuf[LEN_WITH_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search with vector operations */
+	charbuf[LEN_NO_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8_le);
+Datum
+test_lfind8_le(PG_FUNCTION_ARGS)
+{
+	test_lfind8_le_internal(0);
+	test_lfind8_le_internal(1);
+	test_lfind8_le_internal(0x7F);
+	test_lfind8_le_internal(0x80);
+	test_lfind8_le_internal(0xFD);
+
+	PG_RETURN_VOID();
+}
 
+PG_FUNCTION_INFO_V1(test_lfind32);
 Datum
-test_lfind(PG_FUNCTION_ARGS)
+test_lfind32(PG_FUNCTION_ARGS)
 {
 #define TEST_ARRAY_SIZE 135
 	uint32		test_array[TEST_ARRAY_SIZE] = {0};
-- 
2.25.1

0002-ARM-SIMD-support.patchtext/x-diff; charset=us-asciiDownload

From e2fa356adfc41a13ea18da839b53353b774739cc Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Sat, 20 Aug 2022 21:44:21 -0700
Subject: [PATCH 2/2] ARM SIMD support

---
 src/include/port/pg_lfind.h |  6 +++++-
 src/include/port/simd.h     | 28 ++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index def858cbe1..04f09200b4 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -92,7 +92,7 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	size_t		nelem_for_asserts = nelem;
 #endif
 
-#ifdef USE_SSE2
+#if defined(USE_SSE2) || defined(__ARM_NEON)
 	/*
 	 * A 16-byte register only has four 4-byte lanes. For better
 	 * instruction-level parallelism, each loop iteration operates on a block
@@ -126,7 +126,11 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 		result = vector32_vor(tmp1, tmp2);
 
 		/* see if there was a match */
+#ifdef USE_SSE2
 		if (_mm_movemask_epi8(result) != 0)
+#elif defined(__ARM_NEON)
+		if (vmaxvq_u32(result) != 0)
+#endif
 		{
 			Assert(lfind(&key, base, &nelem_for_asserts, sizeof(uint32),
 						 uint32_cmp_eq));
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 4dda87f3dd..4f40b31e2a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -30,6 +30,15 @@
 typedef __m128i Vector8;
 typedef __m128i Vector32;
 
+/*
+ * Include arm_neon.h if the compiler is targeting an architecture that
+ * supports ARM Advanced SIMD (Neon) intrinsics.
+ */
+#elif defined(__ARM_NEON)
+#include <arm_neon.h>
+typedef uint8x16_t Vector8;
+typedef uint32x4_t Vector32;
+
 /*
  * If no SIMD instructions are available, we emulate specialized vector
  * operations using uint64.
@@ -114,6 +123,8 @@ vector8_load(Vector8 *v, const uint8 *s)
 {
 #ifdef USE_SSE2
 	*v = _mm_loadu_si128((const __m128i *) s);
+#elif defined(__ARM_NEON)
+	*v = vld1q_u8(s);
 #else
 	memcpy(v, s, sizeof(Vector8));
 #endif
@@ -124,6 +135,8 @@ vector32_load(Vector32 *v, const uint32 *s)
 {
 #ifdef USE_SSE2
 	*v = _mm_loadu_si128((const __m128i *) s);
+#elif defined(__ARM_NEON)
+	*v = vld1q_u32(s);
 #else
 	elog(ERROR, "vector32() without SIMD not implemented");
 	pg_unreachable();
@@ -140,6 +153,8 @@ vector8_broadcast(const uint8 c)
 {
 #ifdef USE_SSE2
 	return _mm_set1_epi8(c);
+#elif defined(__ARM_NEON)
+	return vdupq_n_u8(c);
 #else
 	return ~UINT64CONST(0) / 0xFF * c;
 #endif
@@ -150,6 +165,8 @@ vector32_broadcast(const uint32 c)
 {
 #ifdef USE_SSE2
 	return _mm_set1_epi32(c);
+#elif defined(__ARM_NEON)
+	return vdupq_n_u32(c);
 #else
 	elog(ERROR, "vector32_broadcast() without SIMD not implemented");
 	pg_unreachable();
@@ -166,6 +183,8 @@ vector8_has_zero(const Vector8 v)
 {
 #ifdef USE_SSE2
 	return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#elif defined(__ARM_NEON)
+	return vmaxvq_u8(vceqzq_u8(v));
 #else
 	return vector8_le(v, 0);
 #endif
@@ -178,6 +197,8 @@ vector8_eq(const Vector8 v, const uint8 c)
 
 #ifdef USE_SSE2
 	result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector8_broadcast(c)));
+#elif defined(__ARM_NEON)
+	result = vmaxvq_u8(vceqq_u8(v, vector8_broadcast(c)));
 #else
 	/* any bytes in v equal to c will evaluate to zero via XOR */
 	result = vector8_has_zero(v ^ vector8_broadcast(c));
@@ -193,6 +214,8 @@ vector32_veq(const Vector32 v1, const Vector32 v2)
 {
 #ifdef USE_SSE2
 	return _mm_cmpeq_epi32(v1, v2);
+#elif defined(__ARM_NEON)
+	return vceqq_u32(v1, v2);
 #else
 	elog(ERROR, "vector32_veq() without SIMD not implemented");
 	pg_unreachable();
@@ -211,6 +234,9 @@ vector8_le(const Vector8 v, const uint8 c)
 	 */
 	__m128i sub = _mm_subs_epu8(v, vector8_broadcast(c));
 	result = vector8_has_zero(sub);
+#elif __ARM_NEON
+	uint8x16_t sub = vqsubq_u8(v, vector8_broadcast(c));
+	result = vector8_has_zero(sub);
 #else
 	/*
 	 * To find bytes <= c, we can use bitwise operations to find bytes < c + 1,
@@ -248,6 +274,8 @@ vector32_vor(const Vector32 v1, const Vector32 v2)
 {
 #ifdef USE_SSE2
 	return _mm_or_si128(v1, v2);
+#elif defined(__ARM_NEON)
+	return vorrq_u32(v1, v2);
 #else
 	elog(ERROR, "vector32_vor() without SIMD not implemented");
 	pg_unreachable();
-- 
2.25.1

#23

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Nathan Bossart (#22)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Sun, Aug 21, 2022 at 12:47 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

I spent some more time looking at this one, and I had a few ideas that I
thought I'd share. 0001 is your v6 patch with a few additional changes,
including simplying the assertions for readability, splitting out the
Vector type into Vector8 and Vector32 (needed for ARM), and adjusting
pg_lfind32() to use the new tools in simd.h. 0002 adds ARM versions of
everything, which obsoletes the other thread I started [0]. This is still
a little rough around the edges (e.g., this should probably be more than 2
patches), but I think it helps demonstrate a more comprehensive design than
what I've proposed in the pg_lfind32-for-ARM thread [0].

Apologies if I'm stepping on your toes a bit here.

Not at all! However, the 32-bit-element changes are irrelevant for
json, and make review more difficult. I would suggest keeping those in
the other thread starting with whatever refactoring is needed. I can
always rebase over that.

Not a full review, but on a brief look:

- I like the idea of simplifying the assertions, but I can't get
behind using platform lfind to do it, since it has a different API,
requires new functions we don't need, and possibly has portability
issues. A simple for-loop is better for assertions.
- A runtime elog is not appropriate for a compile time check -- use
#error instead.

--
John Naylor
EDB: http://www.enterprisedb.com

#24

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: John Naylor (#23)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Mon, Aug 22, 2022 at 09:35:34AM +0700, John Naylor wrote:

Not at all! However, the 32-bit-element changes are irrelevant for
json, and make review more difficult. I would suggest keeping those in
the other thread starting with whatever refactoring is needed. I can
always rebase over that.

Yeah, I'll remove those to keep this thread focused.

- I like the idea of simplifying the assertions, but I can't get
behind using platform lfind to do it, since it has a different API,
requires new functions we don't need, and possibly has portability
issues. A simple for-loop is better for assertions.

My main goal with this was improving readability, which is likely possible
without lfind(). I'll see what I can do.

- A runtime elog is not appropriate for a compile time check -- use
#error instead.

Will do.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#25

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#24)

1 attachment(s)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Mon, Aug 22, 2022 at 02:22:29PM -0700, Nathan Bossart wrote:

On Mon, Aug 22, 2022 at 09:35:34AM +0700, John Naylor wrote:

Not at all! However, the 32-bit-element changes are irrelevant for
json, and make review more difficult. I would suggest keeping those in
the other thread starting with whatever refactoring is needed. I can
always rebase over that.

Yeah, I'll remove those to keep this thread focused.

Here's a new version of the patch with the 32-bit changes and calls to
lfind() removed.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0001-json_lex_string-SIMD.patchtext/x-diff; charset=us-asciiDownload

From dd7b3c3b0567d77a7f171d9cb4b6a8d3ee30ceec Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Sat, 20 Aug 2022 21:14:01 -0700
Subject: [PATCH v8 1/1] json_lex_string() SIMD

---
 src/common/jsonapi.c                          |  11 +-
 src/include/port/pg_lfind.h                   |  65 +++++++-
 src/include/port/simd.h                       | 143 ++++++++++++++++++
 .../test_lfind/expected/test_lfind.out        |  18 ++-
 .../modules/test_lfind/sql/test_lfind.sql     |   4 +-
 .../modules/test_lfind/test_lfind--1.0.sql    |  10 +-
 src/test/modules/test_lfind/test_lfind.c      |  91 ++++++++++-
 7 files changed, 332 insertions(+), 10 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..87e1d0b192 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
 
 #include "common/jsonapi.h"
 #include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
 
 #ifndef FRONTEND
 #include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else
 		{
-			char	   *p;
+			char	   *p = s;
 
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
 			 * Skip to the first byte that requires special handling, so we
 			 * can batch calls to appendBinaryStringInfo.
 			 */
-			for (p = s; p < end; p++)
+			while (p < end - sizeof(Vector8) &&
+				   !pg_lfind8('\\', (uint8 *) p, sizeof(Vector8)) &&
+				   !pg_lfind8('"', (uint8 *) p, sizeof(Vector8)) &&
+				   !pg_lfind8_le(0x1F, (uint8 *) p, sizeof(Vector8)))
+				p += sizeof(Vector8);
+
+			for (; p < end; p++)
 			{
 				if (*p == '\\' || *p == '"')
 					break;
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..86342d71d8 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,8 @@
 /*-------------------------------------------------------------------------
  *
  * pg_lfind.h
- *	  Optimized linear search routines.
+ *	  Optimized linear search routines using SIMD intrinsics where
+ *	  available.
  *
  * Copyright (c) 2022, PostgreSQL Global Development Group
  *
@@ -15,6 +16,68 @@
 
 #include "port/simd.h"
 
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		vector8_load(&chunk, &base[i]);
+		if (vector8_eq(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * pg_lfind8_le
+ *
+ * Return true if there is an element in 'base' that is less than or equal to
+ * 'key', otherwise return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		vector8_load(&chunk, &base[i]);
+		if (vector8_le(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (base[i] <= key)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * pg_lfind32
  *
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..23f6269169 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -25,6 +25,149 @@
 #if (defined(__x86_64__) || defined(_M_AMD64))
 #include <emmintrin.h>
 #define USE_SSE2
+typedef __m128i Vector8;
+
+/*
+ * If no SIMD instructions are available, we emulate specialized vector
+ * operations using uint64.
+ */
+#else
+typedef uint64 Vector8;
+#endif
+
+
+static inline void vector8_load(Vector8 *v, const uint8 *s);
+static inline Vector8 vector8_broadcast(const uint8 c);
+static inline bool vector8_has_zero(const Vector8 v);
+static inline bool vector8_eq(const Vector8 v, const uint8 c);
+static inline bool vector8_le(const Vector8 v, const uint8 c);
+
+
+/*
+ * Functions for loading a chunk of memory into a vector.
+ */
+
+static inline void
+vector8_load(Vector8 *v, const uint8 *s)
+{
+#ifdef USE_SSE2
+	*v = _mm_loadu_si128((const __m128i *) s);
+#else
+	memcpy(v, s, sizeof(Vector8));
+#endif
+}
+
+
+/*
+ * Functions for creating a vector with all elements set to the same value.
+ */
+
+static inline Vector8
+vector8_broadcast(const uint8 c)
+{
+#ifdef USE_SSE2
+	return _mm_set1_epi8(c);
+#else
+	return ~UINT64CONST(0) / 0xFF * c;
 #endif
+}
+
+
+/*
+ * Functions for comparing vector elements to a given value.
+ */
+
+static inline bool
+vector8_has_zero(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#else
+	return vector8_le(v, 0);
+#endif
+}
+
+static inline bool
+vector8_eq(const Vector8 v, const uint8 c)
+{
+	bool		result;
+
+	/* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+	bool		assert_result = false;
+	for (int i = 0; i < sizeof(Vector8); i++)
+	{
+		if (((const uint8 *) &v)[i] == c)
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif							/* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+	result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector8_broadcast(c)));
+#else
+	/* any bytes in v equal to c will evaluate to zero via XOR */
+	result = vector8_has_zero(v ^ vector8_broadcast(c));
+#endif
+
+	Assert(assert_result == result);
+	return result;
+}
+
+static inline bool
+vector8_le(const Vector8 v, const uint8 c)
+{
+	bool		result = false;
+#ifdef USE_SSE2
+	__m128i		sub;
+#endif
+
+	/* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+	bool		assert_result = false;
+	for (int i = 0; i < sizeof(Vector8); i++)
+	{
+		if (((const uint8 *) &v)[i] <= c)
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif							/* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+	/*
+	 * Use saturating subtraction to find bytes <= c, which will present as
+	 * NUL bytes in 'sub'.
+	 */
+	sub = _mm_subs_epu8(v, vector8_broadcast(c));
+	result = vector8_has_zero(sub);
+#else
+	/*
+	 * To find bytes <= c, we can use bitwise operations to find bytes < c + 1,
+	 * but it only works if c + 1 <= 128 and if the highest bit in v is not set
+	 * (from https://graphics.stanford.edu/~seander/bithacks.html).
+	 */
+	if ((int64) v >= 0 && c < 0x80)
+		result = (v - vector8_broadcast(c + 1)) & ~v & vector8_broadcast(0x80);
+	else
+	{
+		/* one byte at a time */
+		for (int i = 0; i < sizeof(Vector8); i++)
+		{
+			if (((const uint8 *) &v)[i] <= c)
+			{
+				result = true;
+				break;
+			}
+		}
+	}
+#endif
+
+	Assert(assert_result == result);
+	return result;
+}
 
 #endif							/* SIMD_H */
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 222c8fd7ff..1d4b14e703 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -4,9 +4,21 @@ CREATE EXTENSION test_lfind;
 -- the operations complete without crashing or hanging and that none of their
 -- internal sanity tests fail.
 --
-SELECT test_lfind();
- test_lfind 
-------------
+SELECT test_lfind8();
+ test_lfind8 
+-------------
+ 
+(1 row)
+
+SELECT test_lfind8_le();
+ test_lfind8_le 
+----------------
+ 
+(1 row)
+
+SELECT test_lfind32();
+ test_lfind32 
+--------------
  
 (1 row)
 
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 899f1dd49b..766c640831 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -5,4 +5,6 @@ CREATE EXTENSION test_lfind;
 -- the operations complete without crashing or hanging and that none of their
 -- internal sanity tests fail.
 --
-SELECT test_lfind();
+SELECT test_lfind8();
+SELECT test_lfind8_le();
+SELECT test_lfind32();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index d82ab0567e..81801926ae 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -3,6 +3,14 @@
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION test_lfind" to load this file. \quit
 
-CREATE FUNCTION test_lfind()
+CREATE FUNCTION test_lfind32()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8_le()
 	RETURNS pg_catalog.void
 	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index a000746fb8..efe6b60bc5 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -18,10 +18,97 @@
 
 PG_MODULE_MAGIC;
 
-PG_FUNCTION_INFO_V1(test_lfind);
+/* workhorse for test_lfind8 */
+static void
+test_lfind8_internal(uint8 key)
+{
+	/* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector8))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+	uint8 charbuf[LEN_WITH_TAIL];
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search tail to test one-byte-at-a-time path */
+	charbuf[LEN_WITH_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search with vector operations */
+	charbuf[LEN_NO_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8(key - 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8(key, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && pg_lfind8(key + 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8);
+Datum
+test_lfind8(PG_FUNCTION_ARGS)
+{
+	test_lfind8_internal(0);
+	test_lfind8_internal(1);
+	test_lfind8_internal(0x7F);
+	test_lfind8_internal(0x80);
+	test_lfind8_internal(0xFD);
+
+	PG_RETURN_VOID();
+}
+
+/* workhorse for test_lfind8_le */
+static void
+test_lfind8_le_internal(uint8 key)
+{
+	/* The byte searched for shouldn't be in the first vector-sized chunk, to make sure iteration works */
+#define LEN_NO_TAIL (2 * sizeof(Vector8))
+#define LEN_WITH_TAIL (LEN_NO_TAIL + 3)
+
+	uint8 charbuf[LEN_WITH_TAIL];
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search tail to test one-byte-at-a-time path */
+	charbuf[LEN_WITH_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_WITH_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, LEN_WITH_TAIL);
+	/* search with vector operations */
+	charbuf[LEN_NO_TAIL - 1] = key;
+	if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8_le(key, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, LEN_NO_TAIL))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8_le);
+Datum
+test_lfind8_le(PG_FUNCTION_ARGS)
+{
+	test_lfind8_le_internal(0);
+	test_lfind8_le_internal(1);
+	test_lfind8_le_internal(0x7F);
+	test_lfind8_le_internal(0x80);
+	test_lfind8_le_internal(0xFD);
+
+	PG_RETURN_VOID();
+}
 
+PG_FUNCTION_INFO_V1(test_lfind32);
 Datum
-test_lfind(PG_FUNCTION_ARGS)
+test_lfind32(PG_FUNCTION_ARGS)
 {
 #define TEST_ARRAY_SIZE 135
 	uint32		test_array[TEST_ARRAY_SIZE] = {0};
-- 
2.25.1

#26

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Nathan Bossart (#25)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Tue, Aug 23, 2022 at 10:32 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Mon, Aug 22, 2022 at 02:22:29PM -0700, Nathan Bossart wrote:

On Mon, Aug 22, 2022 at 09:35:34AM +0700, John Naylor wrote:

Not at all! However, the 32-bit-element changes are irrelevant for
json, and make review more difficult. I would suggest keeping those in
the other thread starting with whatever refactoring is needed. I can
always rebase over that.

Yeah, I'll remove those to keep this thread focused.

Here's a new version of the patch with the 32-bit changes and calls to
lfind() removed.

LGTM overall. My plan is to split out the json piece, adding tests for
that, and commit the infrastructure for it fairly soon. Possible
bikeshedding: Functions like vector8_eq() might be misunderstood as
comparing two vectors, but here we are comparing each lane with a
scalar. I wonder if vector8_eq_scalar() et al might be more clear.

--
John Naylor
EDB: http://www.enterprisedb.com

#27

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: John Naylor (#26)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Tue, Aug 23, 2022 at 01:03:03PM +0700, John Naylor wrote:

On Tue, Aug 23, 2022 at 10:32 AM Nathan Bossart

Here's a new version of the patch with the 32-bit changes and calls to
lfind() removed.

LGTM overall. My plan is to split out the json piece, adding tests for
that, and commit the infrastructure for it fairly soon. Possible
bikeshedding: Functions like vector8_eq() might be misunderstood as
comparing two vectors, but here we are comparing each lane with a
scalar. I wonder if vector8_eq_scalar() et al might be more clear.

Good point. I had used vector32_veq() to denote vector comparison, which
would extend to something like vector8_seq(). But that doesn't seem
descriptive enough. It might be worth considering vector8_contains() or
vector8_has() as well. I don't really have an opinion, but if I had to
pick something, I guess I'd choose vector8_contains().

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#28

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Nathan Bossart (#27)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Wed, Aug 24, 2022 at 12:15 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Tue, Aug 23, 2022 at 01:03:03PM +0700, John Naylor wrote:

On Tue, Aug 23, 2022 at 10:32 AM Nathan Bossart

Here's a new version of the patch with the 32-bit changes and calls to
lfind() removed.

LGTM overall. My plan is to split out the json piece, adding tests for
that, and commit the infrastructure for it fairly soon. Possible
bikeshedding: Functions like vector8_eq() might be misunderstood as
comparing two vectors, but here we are comparing each lane with a
scalar. I wonder if vector8_eq_scalar() et al might be more clear.

Good point. I had used vector32_veq() to denote vector comparison, which
would extend to something like vector8_seq(). But that doesn't seem
descriptive enough. It might be worth considering vector8_contains() or
vector8_has() as well. I don't really have an opinion, but if I had to
pick something, I guess I'd choose vector8_contains().

It seems "scalar" would be a bad choice since it already means
(confusingly) operating on the least significant element of a vector.
I'm thinking of *_has and *_has_le, matching the already existing in
the earlier patch *_has_zero.

--
John Naylor
EDB: http://www.enterprisedb.com

#29

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: John Naylor (#28)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Wed, Aug 24, 2022 at 11:59:25AM +0700, John Naylor wrote:

It seems "scalar" would be a bad choice since it already means
(confusingly) operating on the least significant element of a vector.
I'm thinking of *_has and *_has_le, matching the already existing in
the earlier patch *_has_zero.

That seems reasonable to me.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#30

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Nathan Bossart (#29)

2 attachment(s)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Wed, Aug 24, 2022 at 11:56 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Wed, Aug 24, 2022 at 11:59:25AM +0700, John Naylor wrote:

It seems "scalar" would be a bad choice since it already means
(confusingly) operating on the least significant element of a vector.
I'm thinking of *_has and *_has_le, matching the already existing in
the earlier patch *_has_zero.

That seems reasonable to me.

Okay, done that way, also in v9:
- a convenience macro in the test suite which is handy now and can be
used for 32-bit element tests if we like
- more tests
- pgindent and some additional comment smithing
- split out the json piece for a later commit
- For the following comment, pgindent will put spaced operands on a
separate line which is not great for readability. and our other
reference to the Stanford bithacks page keeps the in-page link, and I
see no reason to exclude it -- if it goes missing, the whole page will
still load. So I put back those two details.

+        * To find bytes <= c, we can use bitwise operations to find
bytes < c+1,
+        * but it only works if c+1 <= 128 and if the highest bit in v
is not set.
+        * Adapted from
+        * https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord

I think I'll go ahead and commit 0001 in a couple days pending further comments.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v9-0002-Speed-up-json_lex_string-via-vector-operations.patchtext/x-patch; charset=US-ASCII; name=v9-0002-Speed-up-json_lex_string-via-vector-operations.patchDownload

From 606c14de59a68ed88bff75f7250c37ad082fbd9f Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 25 Aug 2022 13:32:28 +0700
Subject: [PATCH v9 2/2] Speed up json_lex_string via vector operations

TODO: tests
---
 src/common/jsonapi.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..87e1d0b192 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
 
 #include "common/jsonapi.h"
 #include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
 
 #ifndef FRONTEND
 #include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else
 		{
-			char	   *p;
+			char	   *p = s;
 
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
@@ -853,7 +854,13 @@ json_lex_string(JsonLexContext *lex)
 			 * Skip to the first byte that requires special handling, so we
 			 * can batch calls to appendBinaryStringInfo.
 			 */
-			for (p = s; p < end; p++)
+			while (p < end - sizeof(Vector8) &&
+				   !pg_lfind8('\\', (uint8 *) p, sizeof(Vector8)) &&
+				   !pg_lfind8('"', (uint8 *) p, sizeof(Vector8)) &&
+				   !pg_lfind8_le(0x1F, (uint8 *) p, sizeof(Vector8)))
+				p += sizeof(Vector8);
+
+			for (; p < end; p++)
 			{
 				if (*p == '\\' || *p == '"')
 					break;
-- 
2.36.1

v9-0001-Add-optimized-functions-for-linear-search-within-.patchtext/x-patch; charset=US-ASCII; name=v9-0001-Add-optimized-functions-for-linear-search-within-.patchDownload

From fe7d8f2471e2e0b37ccdc1a0a2d7fc5bb93af7d9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathandbossart@gmail.com>
Date: Sat, 20 Aug 2022 21:14:01 -0700
Subject: [PATCH v9 1/2] Add optimized functions for linear search within byte
 arrays

In similar vein to b6ef167564, add pg_lfind8() and pg_lfind8_le()
to search for bytes equal or less-than-or-equal to a given byte,
respectively. To abstract away platform details, add helper functions
and typedefs to simd.h.

John Naylor and Nathan Bossart, per suggestion from Andres Freund

Discussion: https://www.postgresql.org/message-id/CAFBsxsGzaaGLF%3DNuq61iRXTyspbO9rOjhSqFN%3DV6ozzmta5mXg%40mail.gmail.com
---
 src/include/port/pg_lfind.h                   |  68 +++++++-
 src/include/port/simd.h                       | 154 ++++++++++++++++++
 .../test_lfind/expected/test_lfind.out        |  18 +-
 .../modules/test_lfind/sql/test_lfind.sql     |   4 +-
 .../modules/test_lfind/test_lfind--1.0.sql    |  10 +-
 src/test/modules/test_lfind/test_lfind.c      | 100 +++++++++++-
 6 files changed, 345 insertions(+), 9 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index fb125977b2..a4e13dffec 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -1,7 +1,8 @@
 /*-------------------------------------------------------------------------
  *
  * pg_lfind.h
- *	  Optimized linear search routines.
+ *	  Optimized linear search routines using SIMD intrinsics where
+ *	  available.
  *
  * Copyright (c) 2022, PostgreSQL Global Development Group
  *
@@ -15,6 +16,70 @@
 
 #include "port/simd.h"
 
+/*
+ * pg_lfind8
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind8(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		vector8_load(&chunk, &base[i]);
+		if (vector8_has(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * pg_lfind8_le
+ *
+ * Return true if there is an element in 'base' that is less than or equal to
+ * 'key', otherwise return false.
+ */
+static inline bool
+pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		vector8_load(&chunk, &base[i]);
+		if (vector8_has_le(chunk, key))
+			return true;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (base[i] <= key)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * pg_lfind32
  *
@@ -26,7 +91,6 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
 	uint32		i = 0;
 
-	/* Use SIMD intrinsics where available. */
 #ifdef USE_SSE2
 
 	/*
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index a571e79f57..56df989094 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -13,6 +13,12 @@
 #ifndef SIMD_H
 #define SIMD_H
 
+/*
+ * Note: VectorN in this file refers to a register where the element operands
+ * are N bits wide. The vector width is platform-specific, so users that care
+ * about that will need to inspect "sizeof(VectorN)".
+ */
+
 /*
  * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
  * that compilers targeting this architecture understand SSE2 intrinsics.
@@ -25,6 +31,154 @@
 #if (defined(__x86_64__) || defined(_M_AMD64))
 #include <emmintrin.h>
 #define USE_SSE2
+typedef __m128i Vector8;
+
+#else
+/*
+ * If no SIMD instructions are available, we can in some cases emulate vector
+ * operations using bitwise operations on unsigned integers.
+ */
+typedef uint64 Vector8;
 #endif
 
+
+static inline void vector8_load(Vector8 *v, const uint8 *s);
+static inline Vector8 vector8_broadcast(const uint8 c);
+static inline bool vector8_has_zero(const Vector8 v);
+static inline bool vector8_has(const Vector8 v, const uint8 c);
+static inline bool vector8_has_le(const Vector8 v, const uint8 c);
+
+
+/*
+ * Functions for loading a chunk of memory into a vector.
+ */
+
+static inline void
+vector8_load(Vector8 *v, const uint8 *s)
+{
+#ifdef USE_SSE2
+	*v = _mm_loadu_si128((const __m128i *) s);
+#else
+	memcpy(v, s, sizeof(Vector8));
+#endif
+}
+
+
+/*
+ * Functions for creating a vector with all elements set to the same value.
+ */
+
+static inline Vector8
+vector8_broadcast(const uint8 c)
+{
+#ifdef USE_SSE2
+	return _mm_set1_epi8(c);
+#else
+	return ~UINT64CONST(0) / 0xFF * c;
+#endif
+}
+
+
+/*
+ * Functions for comparing vector elements to a given value.
+ */
+
+static inline bool
+vector8_has_zero(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
+#else
+	return vector8_has_le(v, 0);
+#endif
+}
+
+static inline bool
+vector8_has(const Vector8 v, const uint8 c)
+{
+	bool		result;
+
+	/* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+	bool		assert_result = false;
+
+	for (int i = 0; i < sizeof(Vector8); i++)
+	{
+		if (((const uint8 *) &v)[i] == c)
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif							/* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+	result = _mm_movemask_epi8(_mm_cmpeq_epi8(v, vector8_broadcast(c)));
+#else
+	/* any bytes in v equal to c will evaluate to zero via XOR */
+	result = vector8_has_zero(v ^ vector8_broadcast(c));
+#endif
+
+	Assert(assert_result == result);
+	return result;
+}
+
+static inline bool
+vector8_has_le(const Vector8 v, const uint8 c)
+{
+	bool		result = false;
+#ifdef USE_SSE2
+	__m128i		sub;
+#endif
+
+	/* pre-compute the result for assert checking */
+#ifdef USE_ASSERT_CHECKING
+	bool		assert_result = false;
+
+	for (int i = 0; i < sizeof(Vector8); i++)
+	{
+		if (((const uint8 *) &v)[i] <= c)
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif							/* USE_ASSERT_CHECKING */
+
+#ifdef USE_SSE2
+
+	/*
+	 * Use saturating subtraction to find bytes <= c, which will present as
+	 * NUL bytes in 'sub'.
+	 */
+	sub = _mm_subs_epu8(v, vector8_broadcast(c));
+	result = vector8_has_zero(sub);
+#else
+
+	/*
+	 * To find bytes <= c, we can use bitwise operations to find bytes < c+1,
+	 * but it only works if c+1 <= 128 and if the highest bit in v is not set.
+	 * Adapted from
+	 * https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
+	 */
+	if ((int64) v >= 0 && c < 0x80)
+		result = (v - vector8_broadcast(c + 1)) & ~v & vector8_broadcast(0x80);
+	else
+	{
+		/* one byte at a time */
+		for (int i = 0; i < sizeof(Vector8); i++)
+		{
+			if (((const uint8 *) &v)[i] <= c)
+			{
+				result = true;
+				break;
+			}
+		}
+	}
+#endif
+
+	Assert(assert_result == result);
+	return result;
+}
+
 #endif							/* SIMD_H */
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 222c8fd7ff..1d4b14e703 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -4,9 +4,21 @@ CREATE EXTENSION test_lfind;
 -- the operations complete without crashing or hanging and that none of their
 -- internal sanity tests fail.
 --
-SELECT test_lfind();
- test_lfind 
-------------
+SELECT test_lfind8();
+ test_lfind8 
+-------------
+ 
+(1 row)
+
+SELECT test_lfind8_le();
+ test_lfind8_le 
+----------------
+ 
+(1 row)
+
+SELECT test_lfind32();
+ test_lfind32 
+--------------
  
 (1 row)
 
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 899f1dd49b..766c640831 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -5,4 +5,6 @@ CREATE EXTENSION test_lfind;
 -- the operations complete without crashing or hanging and that none of their
 -- internal sanity tests fail.
 --
-SELECT test_lfind();
+SELECT test_lfind8();
+SELECT test_lfind8_le();
+SELECT test_lfind32();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index d82ab0567e..81801926ae 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -3,6 +3,14 @@
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION test_lfind" to load this file. \quit
 
-CREATE FUNCTION test_lfind()
+CREATE FUNCTION test_lfind32()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lfind8_le()
 	RETURNS pg_catalog.void
 	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index a000746fb8..e0c905aad3 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -16,12 +16,108 @@
 #include "fmgr.h"
 #include "port/pg_lfind.h"
 
+/*
+ * Convenience macros for testing both vector and scalar operations. The 2x
+ * factor is to make sure iteration works
+ */
+#define LEN_NO_TAIL(vectortype) (2 * sizeof(vectortype))
+#define LEN_WITH_TAIL(vectortype) (LEN_NO_TAIL(vectortype) + 3)
+
 PG_MODULE_MAGIC;
 
-PG_FUNCTION_INFO_V1(test_lfind);
+/* workhorse for test_lfind8 */
+static void
+test_lfind8_internal(uint8 key)
+{
+	uint8		charbuf[LEN_WITH_TAIL(Vector8)];
+	const int	len_no_tail = LEN_NO_TAIL(Vector8);
+	const int	len_with_tail = LEN_WITH_TAIL(Vector8);
+
+	memset(charbuf, 0xFF, len_with_tail);
+	/* search tail to test one-byte-at-a-time path */
+	charbuf[len_with_tail - 1] = key;
+	if (key > 0x00 && pg_lfind8(key - 1, charbuf, len_with_tail))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8(key, charbuf, len_with_tail))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && pg_lfind8(key + 1, charbuf, len_with_tail))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, len_with_tail);
+	/* search with vector operations */
+	charbuf[len_no_tail - 1] = key;
+	if (key > 0x00 && pg_lfind8(key - 1, charbuf, len_no_tail))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8(key, charbuf, len_no_tail))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && pg_lfind8(key + 1, charbuf, len_no_tail))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8);
+Datum
+test_lfind8(PG_FUNCTION_ARGS)
+{
+	test_lfind8_internal(0);
+	test_lfind8_internal(1);
+	test_lfind8_internal(0x7F);
+	test_lfind8_internal(0x80);
+	test_lfind8_internal(0x81);
+	test_lfind8_internal(0xFD);
+	test_lfind8_internal(0xFE);
+	test_lfind8_internal(0xFF);
+
+	PG_RETURN_VOID();
+}
+
+/* workhorse for test_lfind8_le */
+static void
+test_lfind8_le_internal(uint8 key)
+{
+	uint8		charbuf[LEN_WITH_TAIL(Vector8)];
+	const int	len_no_tail = LEN_NO_TAIL(Vector8);
+	const int	len_with_tail = LEN_WITH_TAIL(Vector8);
+
+	memset(charbuf, 0xFF, len_with_tail);
+	/* search tail to test one-byte-at-a-time path */
+	charbuf[len_with_tail - 1] = key;
+	if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, len_with_tail))
+		elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8_le(key, charbuf, len_with_tail))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, len_with_tail))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, len_with_tail);
+	/* search with vector operations */
+	charbuf[len_no_tail - 1] = key;
+	if (key > 0x00 && pg_lfind8_le(key - 1, charbuf, len_no_tail))
+		elog(ERROR, "pg_lfind8_le() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8_le(key, charbuf, len_no_tail))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && !pg_lfind8_le(key + 1, charbuf, len_no_tail))
+		elog(ERROR, "pg_lfind8_le() did not find existing element <= '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lfind8_le);
+Datum
+test_lfind8_le(PG_FUNCTION_ARGS)
+{
+	test_lfind8_le_internal(0);
+	test_lfind8_le_internal(1);
+	test_lfind8_le_internal(0x7F);
+	test_lfind8_le_internal(0x80);
+	test_lfind8_le_internal(0x81);
+	test_lfind8_le_internal(0xFD);
+	test_lfind8_le_internal(0xFE);
+	test_lfind8_le_internal(0xFF);
+
+	PG_RETURN_VOID();
+}
 
+PG_FUNCTION_INFO_V1(test_lfind32);
 Datum
-test_lfind(PG_FUNCTION_ARGS)
+test_lfind32(PG_FUNCTION_ARGS)
 {
 #define TEST_ARRAY_SIZE 135
 	uint32		test_array[TEST_ARRAY_SIZE] = {0};
-- 
2.36.1

#31

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: John Naylor (#30)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Thu, Aug 25, 2022 at 01:35:45PM +0700, John Naylor wrote:

- For the following comment, pgindent will put spaced operands on a
separate line which is not great for readability. and our other
reference to the Stanford bithacks page keeps the in-page link, and I
see no reason to exclude it -- if it goes missing, the whole page will
still load. So I put back those two details.
+        * To find bytes <= c, we can use bitwise operations to find
bytes < c+1,
+        * but it only works if c+1 <= 128 and if the highest bit in v
is not set.
+        * Adapted from
+        * https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord

This was just unnecessary fiddling on my part, sorry about that.

+test_lfind8_internal(uint8 key)
+{
+	uint8		charbuf[LEN_WITH_TAIL(Vector8)];
+	const int	len_no_tail = LEN_NO_TAIL(Vector8);
+	const int	len_with_tail = LEN_WITH_TAIL(Vector8);
+
+	memset(charbuf, 0xFF, len_with_tail);
+	/* search tail to test one-byte-at-a-time path */
+	charbuf[len_with_tail - 1] = key;
+	if (key > 0x00 && pg_lfind8(key - 1, charbuf, len_with_tail))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8(key, charbuf, len_with_tail))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && pg_lfind8(key + 1, charbuf, len_with_tail))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, len_with_tail);
+	/* search with vector operations */
+	charbuf[len_no_tail - 1] = key;
+	if (key > 0x00 && pg_lfind8(key - 1, charbuf, len_no_tail))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key - 1);
+	if (key < 0xFF && !pg_lfind8(key, charbuf, len_no_tail))
+		elog(ERROR, "pg_lfind8() did not find existing element <= '0x%x'", key);
+	if (key < 0xFE && pg_lfind8(key + 1, charbuf, len_no_tail))
+		elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+}

nitpick: Shouldn't the elog() calls use "==" instead of "<=" for this one?

Otherwise, 0001 looks good to me.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#32

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Nathan Bossart (#31)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Fri, Aug 26, 2022 at 10:14 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

+test_lfind8_internal(uint8 key)

[...]

+             elog(ERROR, "pg_lfind8() found nonexistent element <= '0x%x'", key + 1);
+}
nitpick: Shouldn't the elog() calls use "==" instead of "<=" for this one?

Good catch, will fix.

--
John Naylor
EDB: http://www.enterprisedb.com

#33

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: John Naylor (#30)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Thu, Aug 25, 2022 at 1:35 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I think I'll go ahead and commit 0001 in a couple days pending further comments.

Pushed with Nathan's correction and some cosmetic rearrangements.

--
John Naylor
EDB: http://www.enterprisedb.com

#34

Jelte Fennema

Jelte.Fennema@microsoft.com

over 3 years ago

In reply to: Zhihong Yu (#2)

Re: [PATCH] Optimize json_lex_string by batching character copying

I wonder why copyable_characters_length is not reset after flushing.

It's not necessary because of the break statement right after. But this part
of the code was refactored away in John's improved patch that's actually
merged:
https://github.com/postgres/postgres/commit/3838fa269c15706df2b85ce2d6af8aacd5611655

#35

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: John Naylor (#26)

1 attachment(s)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Tue, Aug 23, 2022 at 1:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

LGTM overall. My plan is to split out the json piece, adding tests for
that, and commit the infrastructure for it fairly soon.

Here's the final piece. I debated how many tests to add and decided it
was probably enough to add one each for checking quotes and
backslashes in the fast path. There is one cosmetic change in the
code: Before, the vectorized less-equal check compared to 0x1F, but
the byte-wise path did so with < 32. I made them both "less-equal 31"
for consistency. I'll commit this by the end of the week unless anyone
has a better idea about testing.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v10-0001-Optimize-JSON-lexing-of-long-strings.patchtext/x-patch; charset=US-ASCII; name=v10-0001-Optimize-JSON-lexing-of-long-strings.patchDownload

From f1159dcc2044edb107e0dfeae5e8f3c7feb10cd2 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 31 Aug 2022 10:39:17 +0700
Subject: [PATCH v10] Optimize JSON lexing of long strings

Use optimized linear search when looking ahead for end quotes,
backslashes, and non-printable characters. This results in nearly 40%
faster JSON parsing on x86-64 when most values are long strings, and
all platforms should see some improvement.

Reviewed by Andres Freund and Nathan Bossart
Discussion: https://www.postgresql.org/message-id/CAFBsxsGhaR2KQ5eisaK%3D6Vm60t%3DaxhD8Ckj1qFoCH1pktZi%2B2w%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA%40mail.gmail.com
---
 src/common/jsonapi.c               | 13 ++++++++++---
 src/test/regress/expected/json.out | 13 +++++++++++++
 src/test/regress/sql/json.sql      |  5 +++++
 3 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index fefd1d24d9..cfc025749c 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -19,6 +19,7 @@
 
 #include "common/jsonapi.h"
 #include "mb/pg_wchar.h"
+#include "port/pg_lfind.h"
 
 #ifndef FRONTEND
 #include "miscadmin.h"
@@ -844,7 +845,7 @@ json_lex_string(JsonLexContext *lex)
 		}
 		else
 		{
-			char	   *p;
+			char	   *p = s;
 
 			if (hi_surrogate != -1)
 				return JSON_UNICODE_LOW_SURROGATE;
@@ -853,11 +854,17 @@ json_lex_string(JsonLexContext *lex)
 			 * Skip to the first byte that requires special handling, so we
 			 * can batch calls to appendBinaryStringInfo.
 			 */
-			for (p = s; p < end; p++)
+			while (p < end - sizeof(Vector8) &&
+				   !pg_lfind8('\\', (uint8 *) p, sizeof(Vector8)) &&
+				   !pg_lfind8('"', (uint8 *) p, sizeof(Vector8)) &&
+				   !pg_lfind8_le(31, (uint8 *) p, sizeof(Vector8)))
+				p += sizeof(Vector8);
+
+			for (; p < end; p++)
 			{
 				if (*p == '\\' || *p == '"')
 					break;
-				else if ((unsigned char) *p < 32)
+				else if ((unsigned char) *p <= 31)
 				{
 					/* Per RFC4627, these characters MUST be escaped. */
 					/*
diff --git a/src/test/regress/expected/json.out b/src/test/regress/expected/json.out
index e9d6e9faf2..cb181226e9 100644
--- a/src/test/regress/expected/json.out
+++ b/src/test/regress/expected/json.out
@@ -42,6 +42,19 @@ LINE 1: SELECT '"\v"'::json;
                ^
 DETAIL:  Escape sequence "\v" is invalid.
 CONTEXT:  JSON data, line 1: "\v...
+-- Check fast path for longer strings (at least 16 bytes long)
+SELECT ('"'||repeat('.', 12)||'abc"')::json; -- OK
+       json        
+-------------------
+ "............abc"
+(1 row)
+
+SELECT ('"'||repeat('.', 12)||'abc\n"')::json; -- OK, legal escapes
+        json         
+---------------------
+ "............abc\n"
+(1 row)
+
 -- see json_encoding test for input with unicode escapes
 -- Numbers.
 SELECT '1'::json;				-- OK
diff --git a/src/test/regress/sql/json.sql b/src/test/regress/sql/json.sql
index e366c6f51b..589e0cea36 100644
--- a/src/test/regress/sql/json.sql
+++ b/src/test/regress/sql/json.sql
@@ -7,6 +7,11 @@ SELECT '"abc
 def"'::json;					-- ERROR, unescaped newline in string constant
 SELECT '"\n\"\\"'::json;		-- OK, legal escapes
 SELECT '"\v"'::json;			-- ERROR, not a valid JSON escape
+
+-- Check fast path for longer strings (at least 16 bytes long)
+SELECT ('"'||repeat('.', 12)||'abc"')::json; -- OK
+SELECT ('"'||repeat('.', 12)||'abc\n"')::json; -- OK, legal escapes
+
 -- see json_encoding test for input with unicode escapes
 
 -- Numbers.
-- 
2.36.1

#36

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: John Naylor (#35)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Wed, Aug 31, 2022 at 10:50:39AM +0700, John Naylor wrote:

Here's the final piece. I debated how many tests to add and decided it
was probably enough to add one each for checking quotes and
backslashes in the fast path. There is one cosmetic change in the
code: Before, the vectorized less-equal check compared to 0x1F, but
the byte-wise path did so with < 32. I made them both "less-equal 31"
for consistency. I'll commit this by the end of the week unless anyone
has a better idea about testing.

LGTM

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#37

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Nathan Bossart (#36)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Wed, Aug 31, 2022 at 11:17 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Wed, Aug 31, 2022 at 10:50:39AM +0700, John Naylor wrote:

Here's the final piece. I debated how many tests to add and decided it
was probably enough to add one each for checking quotes and
backslashes in the fast path. There is one cosmetic change in the
code: Before, the vectorized less-equal check compared to 0x1F, but
the byte-wise path did so with < 32. I made them both "less-equal 31"
for consistency. I'll commit this by the end of the week unless anyone
has a better idea about testing.

LGTM

Pushed, thanks for looking!

--
John Naylor
EDB: http://www.enterprisedb.com