WIP Incremental JSON Parser

Started by Andrew Dunstanover 2 years ago102 messageshackers

andrew@dunslane.net

over 2 years ago

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser. Nevertheless, I'm attaching my current WIP state for it,
and I'll add it to the next CF to keep the conversation going.

One possible use would be in parsing large manifest files for
incremental backup. However, it struck me a few days ago that this might
not work all that well. The current parser and the new parser both
palloc() space for each field name and scalar token in the JSON (unless
they aren't used, which is normally not the case), and they don't free
it, so that particularly if done in frontend code this amounts to a
possible memory leak, unless the semantic routines do the freeing
themselves. So while we can save some memory by not having to slurp in
the whole JSON in one hit, we aren't saving any of that other allocation
of memory, which amounts to almost as much space as the raw JSON.

In any case, I've had fun so it's not a total loss come what may :-)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Robert Haas

robertmhaas@gmail.com

over 2 years ago

In reply to: Andrew Dunstan (#1)

Re: WIP Incremental JSON Parser

On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser. Nevertheless, I'm attaching my current WIP state for it,
and I'll add it to the next CF to keep the conversation going.

Thanks for doing this. I think it's useful even if it's slower than
the current parser, although that probably necessitates keeping both,
which isn't great, but I don't see a better alternative.

One possible use would be in parsing large manifest files for
incremental backup. However, it struck me a few days ago that this might
not work all that well. The current parser and the new parser both
palloc() space for each field name and scalar token in the JSON (unless
they aren't used, which is normally not the case), and they don't free
it, so that particularly if done in frontend code this amounts to a
possible memory leak, unless the semantic routines do the freeing
themselves. So while we can save some memory by not having to slurp in
the whole JSON in one hit, we aren't saving any of that other allocation
of memory, which amounts to almost as much space as the raw JSON.

It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.

As far as addressing that other issue, we could address the issue
either by having the semantic routines free the memory if they don't
need it, or alternatively by having the parser itself free the memory
after invoking any callbacks to which it might be passed. The latter
approach feels more conceptually pure, but the former might be the
more practical approach. I think what really matters here is that we
document who must or may do which things. When a callback gets passed
a pointer, we can document either that (1) it's a palloc'd chunk that
the calllback can free if they want or (2) that it's a palloc'd chunk
that the caller must not free or (3) that it's not a palloc'd chunk.
We can further document the memory context in which the chunk will be
allocated, if applicable, and when/if the parser will free it.

WIP Incremental JSON Parser

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: