[PATCH] Optimize json_lex_string by batching character copying

Started by Jelte Fennemaalmost 4 years ago37 messageshackers

Jelte.Fennema@microsoft.com

almost 4 years ago

When parsing JSON strings need to be converted from the JSON string
format to a c-style string. A simple copy of the buffer does not suffice
because of the various escape sequences that that JSON supports. Because
of this our JSON parser wrote characters into the c-style string buffer
one at a time.

However, this is only necessary for these escaped sequences that map to
another character. This patch changes the behaviour for non-escaped
characters. These are now copied in batches instead of one character at
a time.

To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB. The JSONB values
are a JSON object with a single level. They contain a few small keys and
values, but one very big value that's a stringified JSON blob. So this
JSON blob contains a relatively high number of escape characters, to
escape all the " characters. This change improves performance for
workload this workload on my machine by ~18% (going from 1m24s to 1m09s).

@Andres, there was indeed some low hanging fruit.
@John Naylor, SSE2 indeed sounds like another nice improvement. I'll leave
that to you.

Zhihong Yu

zyu@yugabyte.com

almost 4 years ago

In reply to: Jelte Fennema (#1)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,
Looking at the patch,

+           if (copyable_characters_length)
+           {
+               /* flush copyable characters */
+               appendBinaryStringInfo(
+                                      lex->strval,
+                                      s - copyable_characters_length,
+                                      copyable_characters_length);
+
+           }
            break;

I wonder why copyable_characters_length is not reset after flushing.

Cheers

Import Notes

Resolved by subject fallback

Jelte Fennema

Jelte.Fennema@microsoft.com

almost 4 years ago

In reply to: Zhihong Yu (#2)

Re: [PATCH] Optimize json_lex_string by batching character copying

+           if (copyable_characters_length)
+           {
+               /* flush copyable characters */
+               appendBinaryStringInfo(
+                                      lex->strval,
+                                      s - copyable_characters_length,
+                                      copyable_characters_length);
+
+           }
            break;

I wonder why copyable_characters_length is not reset after flushing.

It breaks from the loop right after. So copyable_characters_length isn't used
again and thus resetting is not necessary. But I agree this could use a comment.

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Jelte Fennema (#1)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:

To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB.

This will have a lot of other costs included (DML is expensive). I'd suggest
storing the json in a text column and casting it to json[b], with a filter
ontop of the json[b] result that cheaply filters it away. That should end up
spending nearly all the time somewhere around json parsing.

It's useful for things like this to include a way for others to use the same
benchmark...

I tried your patch with:

DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;

SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

Which the patch improves from 846ms to 754ms (best of three). A bit smaller
than your improvement, but still nice.

I think your patch doesn't quite go far enough - we still end up looping for
each character, have the added complication of needing to flush the
"buffer". I'd be surprised if a "dedicated" loop to see until where the string
last isn't faster. That then obviously could be SIMDified.

Separately, it seems pretty awful efficiency / code density wise to have the
NULL checks for ->strval all over. Might be worth forcing json_lex() and
json_lex_string() to be inlined, with a constant parameter deciding whether
->strval is expected. That'd likely be enough to get the compiler specialize
the code for us.

Might also be worth to maintain ->strval using appendBinaryStringInfoNT().

Greetings,

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Andres Freund (#4)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

On 2022-06-24 17:18:10 -0700, Andres Freund wrote:

On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:

To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB.

This will have a lot of other costs included (DML is expensive). I'd suggest
storing the json in a text column and casting it to json[b], with a filter
ontop of the json[b] result that cheaply filters it away. That should end up
spending nearly all the time somewhere around json parsing.

It's useful for things like this to include a way for others to use the same
benchmark...

I tried your patch with:

DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;

SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

Which the patch improves from 846ms to 754ms (best of three). A bit smaller
than your improvement, but still nice.

I think your patch doesn't quite go far enough - we still end up looping for
each character, have the added complication of needing to flush the
"buffer". I'd be surprised if a "dedicated" loop to see until where the string
last isn't faster. That then obviously could be SIMDified.

A naive implementation (attached) of that gets me down to 706ms.

Greetings,

Andres Freund

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: Andres Freund (#5)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Sat, Jun 25, 2022 at 8:05 AM Andres Freund <andres@anarazel.de> wrote:

I tried your patch with:

DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;

SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

Which the patch improves from 846ms to 754ms (best of three). A bit smaller
than your improvement, but still nice.

I think your patch doesn't quite go far enough - we still end up looping for
each character, have the added complication of needing to flush the
"buffer". I'd be surprised if a "dedicated" loop to see until where the string
last isn't faster. That then obviously could be SIMDified.

A naive implementation (attached) of that gets me down to 706ms.

Taking this a step further, I modified json_lex and json_lex_string to
use a const end pointer instead of maintaining the length (0001). The
observed speedup is small enough that it might not be real, but the
code is simpler this way, and it makes 0002 and 0003 easier to reason
about. Then I modified your patch to do the same (0002). Hackish SSE2
support is in 0003.

To exercise the SIMD code a bit, I added a second test:

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 10) from pg_description pd
)
SELECT (select json_agg(row_to_json(long)) as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

SELECT 1 FROM long_json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

With this, I get (turbo disabled, min of 3):

short test:
master: 769ms
0001: 751ms
0002: 703ms
0003: 701ms

long test;
master: 939ms
0001: 883ms
0002: 537ms
0003: 439ms

I think 0001/2 are mostly in committable shape.

With 0003, I'd want to make the portability check a bit nicer and more
centralized. I'm thinking of modifying the CRC check to report that
the host cpu/compiler understands SSE4.2 x86 intrinsics, and then the
compile time SSE2 check can piggyback on top of that without a runtime
check. This is conceptually easy but a bit of work to not look like a
hack (which probably means the ARM CRC check should look more generic
somehow...). The regression tests will likely need some work as well.

Separately, it seems pretty awful efficiency / code density wise to have the
NULL checks for ->strval all over. Might be worth forcing json_lex() and
json_lex_string() to be inlined, with a constant parameter deciding whether
->strval is expected. That'd likely be enough to get the compiler specialize
the code for us.

I had a look at this but it's a bit more invasive than I want to
devote time to at this point.

--
John Naylor
EDB: http://www.enterprisedb.com

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: John Naylor (#6)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

On 2022-07-06 12:10:20 +0700, John Naylor wrote:

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index eeedc0645a..ad4858c623 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -851,10 +851,26 @@ json_lex_string(JsonLexContext *lex)
}
else if (lex->strval != NULL)
{
+			/* start lookahead at next byte */
+			char	   *p = s + 1;
+
if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;

-			appendStringInfoChar(lex->strval, *s);
+			while (p < end)
+			{
+				if (*p == '\\' || *p == '"' || (unsigned char) *p < 32)
+					break;
+				p++;
+			}
+
+			appendBinaryStringInfo(lex->strval, s, p - s);
+
+			/*
+			 * s will be incremented at the top of the loop, so set it to just
+			 * behind our lookahead position
+			 */
+			s = p - 1;
}
}

--
2.36.1

I think before committing something along those lines we should make the
relevant bits also be applicable when ->strval is NULL, as several functions
use that (notably json_in IIRC). Afaics we'd just need to move the strval
check to be around the appendBinaryStringInfo(). And it should simplify the
function, because some of the relevant code is duplicated outside as well...

Greetings,

Andres Freund

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: Andres Freund (#7)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Wed, Jul 6, 2022 at 12:18 PM Andres Freund <andres@anarazel.de> wrote:

I think before committing something along those lines we should make the
relevant bits also be applicable when ->strval is NULL, as several functions
use that (notably json_in IIRC). Afaics we'd just need to move the strval
check to be around the appendBinaryStringInfo().

That makes sense and is easy.

And it should simplify the
function, because some of the relevant code is duplicated outside as well...

Not sure how far to take this, but I put the returnable paths inside
the "other" path, so only backslash will go back to the top.

Both the above changes are split into a new 0003 patch for easier
review, but in the end will likely be squashed with 0002.

--
John Naylor
EDB: http://www.enterprisedb.com

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: John Naylor (#8)

Re: [PATCH] Optimize json_lex_string by batching character copying

I've pushed 0001 (although the email seems to have been swallowed
again), and pending additional comments on 0002 and 0003 I'll squash
and push those next week. 0004 needs some thought on integrating with
symbols we discover during configure.

--
John Naylor
EDB: http://www.enterprisedb.com

#10

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: John Naylor (#9)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Fri, Jul 8, 2022 at 3:06 PM John Naylor <john.naylor@enterprisedb.com> wrote:

I've pushed 0001 (although the email seems to have been swallowed
again), and pending additional comments on 0002 and 0003 I'll squash
and push those next week.

This is done.

0004 needs some thought on integrating with
symbols we discover during configure.

Still needs thought.

--
John Naylor
EDB: http://www.enterprisedb.com

#11

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: John Naylor (#8)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

On 2022-07-06 15:58:44 +0700, John Naylor wrote:

From 82e13b6bebd85a152ededcfd75495c0c0f642354 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 6 Jul 2022 15:50:09 +0700
Subject: [PATCH v4 4/4] Use vectorized lookahead in json_lex_string on x86

---
src/common/jsonapi.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 48 insertions(+)

diff --git a/src/common/jsonapi.c b/src/common/jsonapi.c
index 81e176ad8d..44e8ed2b2f 100644
--- a/src/common/jsonapi.c
+++ b/src/common/jsonapi.c
@@ -24,6 +24,12 @@
#include "miscadmin.h"
#endif

+/*  WIP: put somewhere sensible and consider removing CRC from the names */
+#if (defined (__x86_64__) || defined(_M_AMD64)) && (defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK))
+#include <nmmintrin.h>
+#define USE_SSE2
+#endif
+
/*
* The context of the parser is maintained by the recursive descent
* mechanism, but is passed explicitly to the error reporting routine
@@ -842,12 +848,54 @@ json_lex_string(JsonLexContext *lex)
}
else
{
+#ifdef USE_SSE2
+			__m128i		block,
+						has_backslash,
+						has_doublequote,
+						control,
+						has_control,
+						error_cum = _mm_setzero_si128();
+			const		__m128i backslash = _mm_set1_epi8('\\');
+			const		__m128i doublequote = _mm_set1_epi8('"');
+			const		__m128i max_control = _mm_set1_epi8(0x1F);
+#endif
/* start lookahead at current byte */
char	   *p = s;

if (hi_surrogate != -1)
return JSON_UNICODE_LOW_SURROGATE;

+#ifdef USE_SSE2
+			while (p < end - sizeof(__m128i))
+			{
+				block = _mm_loadu_si128((const __m128i *) p);
+
+				/* direct comparison to quotes and backslashes */
+				has_backslash = _mm_cmpeq_epi8(block, backslash);
+				has_doublequote = _mm_cmpeq_epi8(block, doublequote);
+
+				/*
+				 * use saturation arithmetic to check for <= highest control
+				 * char
+				 */
+				control = _mm_subs_epu8(block, max_control);
+				has_control = _mm_cmpeq_epi8(control, _mm_setzero_si128());
+
+				/*
+				 * set bits in error_cum where the corresponding lanes in has_*
+				 * are set
+				 */
+				error_cum = _mm_or_si128(error_cum, has_backslash);
+				error_cum = _mm_or_si128(error_cum, has_doublequote);
+				error_cum = _mm_or_si128(error_cum, has_control);
+
+				if (_mm_movemask_epi8(error_cum))
+					break;
+
+				p += sizeof(__m128i);
+			}
+#endif							/* USE_SSE2 */
+
while (p < end)
{
if (*p == '\\' || *p == '"')
-- 
2.36.1

I wonder if we can't abstract this at least a bit better. If we go that route
a bit further, then add another arch, this code will be pretty much
unreadable.

Greetings,

Andres Freund

#12

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Andres Freund (#11)

Re: [PATCH] Optimize json_lex_string by batching character copying

Andres Freund <andres@anarazel.de> writes:

I wonder if we can't abstract this at least a bit better. If we go that route
a bit further, then add another arch, this code will be pretty much
unreadable.

IMO, it's pretty unreadable *now*, for lack of comments about what it's
doing and why.

regards, tom lane

#13

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Tom Lane (#12)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

On 2022-07-11 11:53:26 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I wonder if we can't abstract this at least a bit better. If we go that route
a bit further, then add another arch, this code will be pretty much
unreadable.

IMO, it's pretty unreadable *now*, for lack of comments about what it's
doing and why.

Yea, that could at least be addressed by adding comments. But even with a
bunch of comments, it'd still be pretty hard to read once the events above
have happened (and they seem kind of inevitable).

I wonder if we can add a somewhat more general function for scanning until
some characters are found using SIMD? There's plenty other places that could
be useful.

Greetings,

Andres Freund

#14

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: Andres Freund (#13)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Mon, Jul 11, 2022 at 11:07 PM Andres Freund <andres@anarazel.de> wrote:

I wonder if we can add a somewhat more general function for scanning until
some characters are found using SIMD? There's plenty other places that

could

be useful.

In simple cases, we could possibly abstract the entire loop. With this
particular case, I imagine the most approachable way to write the loop
would be a bit more low-level:

while (p < end - VECTOR_WIDTH &&
!vector_has_byte(p, '\\') &&
!vector_has_byte(p, '"') &&
vector_min_byte(p, 0x20))
p += VECTOR_WIDTH

I wonder if we'd lose a bit of efficiency here by not accumulating set bits
from the three conditions, but it's worth trying.
--
John Naylor
EDB: http://www.enterprisedb.com

#15

Andrew Dunstan

andrew@dunslane.net

almost 4 years ago

In reply to: Andres Freund (#4)

Re: [PATCH] Optimize json_lex_string by batching character copying

On 2022-06-24 Fr 20:18, Andres Freund wrote:

Hi,

On 2022-06-24 08:47:09 +0000, Jelte Fennema wrote:

To test performance of this change I used COPY BINARY from a JSONB table
into another, containing fairly JSONB values of ~15kB.

This will have a lot of other costs included (DML is expensive). I'd suggest
storing the json in a text column and casting it to json[b], with a filter
ontop of the json[b] result that cheaply filters it away. That should end up
spending nearly all the time somewhere around json parsing.

It's useful for things like this to include a way for others to use the same
benchmark...

I tried your patch with:

DROP TABLE IF EXISTS json_as_text;
CREATE TABLE json_as_text AS SELECT (SELECT json_agg(row_to_json(pd)) as t FROM pg_description pd) FROM generate_series(1, 100);
VACUUM FREEZE json_as_text;

SELECT 1 FROM json_as_text WHERE jsonb_typeof(t::jsonb) = 'not me';

I've been doing some other work related to json parsing and John
referred me to this. But it's actually not the best test for pure json
parsing - casting to jsonb involves some extra work besides pure
parsing. Instead I've been using this query with the same table, which
should be almost all json parsing:

select 1 from json_as_text where t::json is null;

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#16

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: John Naylor (#14)

Re: [PATCH] Optimize json_lex_string by batching character copying

I wrote

On Mon, Jul 11, 2022 at 11:07 PM Andres Freund <andres@anarazel.de> wrote:

I wonder if we can add a somewhat more general function for scanning until
some characters are found using SIMD? There's plenty other places that could
be useful.

In simple cases, we could possibly abstract the entire loop. With this particular case, I imagine the most approachable way to write the loop would be a bit more low-level:

while (p < end - VECTOR_WIDTH &&
!vector_has_byte(p, '\\') &&
!vector_has_byte(p, '"') &&
vector_min_byte(p, 0x20))
p += VECTOR_WIDTH

I wonder if we'd lose a bit of efficiency here by not accumulating set bits from the three conditions, but it's worth trying.

The attached implements the above, more or less, using new pg_lfind8()
and pg_lfind8_le(), which in turn are based on helper functions that
act on a single vector. The pg_lfind* functions have regression tests,
but I haven't done the same for json yet. I went the extra step to use
bit-twiddling for non-SSE builds using uint64 as a "vector", which
still gives a pretty good boost (test below, min of 3):

master:
356ms

v5:
259ms

v5 disable SSE:
288ms

It still needs a bit of polishing and testing, but I think it's a good
workout for abstracting SIMD out of the way.

-------------
test:

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 11)
from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

select 1 from long_json_as_text where t::json is null; -- from Andrew upthread

--
John Naylor
EDB: http://www.enterprisedb.com

#17

Ranier Vilela

ranier.vf@gmail.com

almost 4 years ago

In reply to: John Naylor (#16)

Re: [PATCH] Optimize json_lex_string by batching character copying

Hi,

I ran this test.

select 1 from long_json_as_text where t::json is null;

head:
Time: 161,741ms

v5:
Time: 270,298 ms

ubuntu 64 bits
gcc 9.4.0

Am I missing something?

regards,
Ranier Vilela

Import Notes

Resolved by subject fallback

#18

Ranier Vilela

ranier.vf@gmail.com

almost 4 years ago

In reply to: Ranier Vilela (#17)

Re: [PATCH] Optimize json_lex_string by batching character copying

Em seg., 15 de ago. de 2022 às 15:34, Ranier Vilela <ranier.vf@gmail.com>
escreveu:

Hi,

I ran this test.

DROP TABLE IF EXISTS long_json_as_text;
CREATE TABLE long_json_as_text AS
with long as (
select repeat(description, 11)
from pg_description
)
select (select json_agg(row_to_json(long))::text as t from long) from
generate_series(1, 100);
VACUUM FREEZE long_json_as_text;

select 1 from long_json_as_text where t::json is null;

head:
Time: 161,741ms

v5:
Time: 270,298 ms

Sorry too fast, 270,298ms with native memchr.

v5
Time: 208,689 ms

regards,
Ranier Vilela

#19

Nathan Bossart

nathandbossart@gmail.com

almost 4 years ago

In reply to: John Naylor (#16)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Mon, Aug 15, 2022 at 08:33:21PM +0700, John Naylor wrote:

The attached implements the above, more or less, using new pg_lfind8()
and pg_lfind8_le(), which in turn are based on helper functions that
act on a single vector. The pg_lfind* functions have regression tests,
but I haven't done the same for json yet. I went the extra step to use
bit-twiddling for non-SSE builds using uint64 as a "vector", which
still gives a pretty good boost (test below, min of 3):

Looks pretty reasonable to me.

+#ifdef USE_SSE2
+		chunk = _mm_loadu_si128((const __m128i *) &base[i]);
+#else
+		memcpy(&chunk, &base[i], sizeof(chunk));
+#endif							/* USE_SSE2 */

+#ifdef USE_SSE2
+		chunk = _mm_loadu_si128((const __m128i *) &base[i]);
+#else
+		memcpy(&chunk, &base[i], sizeof(chunk));
+#endif							/* USE_SSE2 */

Perhaps there should be a macro or inline function for loading a vector so
that these USE_SSE2 checks can be abstracted away, too.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#20

John Naylor

john.naylor@enterprisedb.com

almost 4 years ago

In reply to: Nathan Bossart (#19)

Re: [PATCH] Optimize json_lex_string by batching character copying

On Tue, Aug 16, 2022 at 4:23 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Aug 15, 2022 at 08:33:21PM +0700, John Naylor wrote:
+#ifdef USE_SSE2
+             chunk = _mm_loadu_si128((const __m128i *) &base[i]);
+#else
+             memcpy(&chunk, &base[i], sizeof(chunk));
+#endif                                                       /* USE_SSE2 */
Perhaps there should be a macro or inline function for loading a vector so
that these USE_SSE2 checks can be abstracted away, too.

This is done. Also:
- a complete overhaul of the pg_lfind8* tests
- using a typedef for the vector type
- some refactoring, name changes and other cleanups (a few of these
could also be applied to the 32-byte element path, but that is left
for future work)

TODO: json-specific tests of the new path

--
John Naylor
EDB: http://www.enterprisedb.com