WIP Incremental JSON Parser

Started by Andrew Dunstanabout 2 years ago102 messages
Jump to latest
#1Andrew Dunstan
andrew@dunslane.net

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser. Nevertheless, I'm attaching my current WIP state for it,
and I'll add it to the next CF to keep the conversation going.

One possible use would be in parsing large manifest files for
incremental backup. However, it struck me a few days ago that this might
not work all that well. The current parser and the new parser both
palloc() space for each field name and scalar token in the JSON (unless
they aren't used, which is normally not the case), and they don't free
it, so that particularly if done in frontend code this amounts to a
possible memory leak, unless the semantic routines do the freeing
themselves. So while we can save some memory by not having to slurp in
the whole JSON in one hit, we aren't saving any of that other allocation
of memory, which amounts to almost as much space as the raw JSON.

In any case, I've had fun so it's not a total loss come what may :-)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

json-incremental-parser-2023-12-26.patchtext/x-patch; charset=UTF-8; name=json-incremental-parser-2023-12-26.patchDownload+1411-9
#2Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#1)
Re: WIP Incremental JSON Parser

On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser. Nevertheless, I'm attaching my current WIP state for it,
and I'll add it to the next CF to keep the conversation going.

Thanks for doing this. I think it's useful even if it's slower than
the current parser, although that probably necessitates keeping both,
which isn't great, but I don't see a better alternative.

One possible use would be in parsing large manifest files for
incremental backup. However, it struck me a few days ago that this might
not work all that well. The current parser and the new parser both
palloc() space for each field name and scalar token in the JSON (unless
they aren't used, which is normally not the case), and they don't free
it, so that particularly if done in frontend code this amounts to a
possible memory leak, unless the semantic routines do the freeing
themselves. So while we can save some memory by not having to slurp in
the whole JSON in one hit, we aren't saving any of that other allocation
of memory, which amounts to almost as much space as the raw JSON.

It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.

As far as addressing that other issue, we could address the issue
either by having the semantic routines free the memory if they don't
need it, or alternatively by having the parser itself free the memory
after invoking any callbacks to which it might be passed. The latter
approach feels more conceptually pure, but the former might be the
more practical approach. I think what really matters here is that we
document who must or may do which things. When a callback gets passed
a pointer, we can document either that (1) it's a palloc'd chunk that
the calllback can free if they want or (2) that it's a palloc'd chunk
that the caller must not free or (3) that it's not a palloc'd chunk.
We can further document the memory context in which the chunk will be
allocated, if applicable, and when/if the parser will free it.

--
Robert Haas
EDB: http://www.enterprisedb.com

#3Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#2)
Re: WIP Incremental JSON Parser

On 2024-01-02 Tu 10:14, Robert Haas wrote:

On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser. Nevertheless, I'm attaching my current WIP state for it,
and I'll add it to the next CF to keep the conversation going.

Thanks for doing this. I think it's useful even if it's slower than
the current parser, although that probably necessitates keeping both,
which isn't great, but I don't see a better alternative.

One possible use would be in parsing large manifest files for
incremental backup. However, it struck me a few days ago that this might
not work all that well. The current parser and the new parser both
palloc() space for each field name and scalar token in the JSON (unless
they aren't used, which is normally not the case), and they don't free
it, so that particularly if done in frontend code this amounts to a
possible memory leak, unless the semantic routines do the freeing
themselves. So while we can save some memory by not having to slurp in
the whole JSON in one hit, we aren't saving any of that other allocation
of memory, which amounts to almost as much space as the raw JSON.

It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.

As far as addressing that other issue, we could address the issue
either by having the semantic routines free the memory if they don't
need it, or alternatively by having the parser itself free the memory
after invoking any callbacks to which it might be passed. The latter
approach feels more conceptually pure, but the former might be the
more practical approach. I think what really matters here is that we
document who must or may do which things. When a callback gets passed
a pointer, we can document either that (1) it's a palloc'd chunk that
the calllback can free if they want or (2) that it's a palloc'd chunk
that the caller must not free or (3) that it's not a palloc'd chunk.
We can further document the memory context in which the chunk will be
allocated, if applicable, and when/if the parser will free it.

Yeah. One idea I had yesterday was to stash the field names, which in
large JSON docs tent to be pretty repetitive, in a hash table instead of
pstrduping each instance. The name would be valid until the end of the
parse, and would only need to be duplicated by the callback function if
it were needed beyond that. That's not the case currently with the
parse_manifest code. I'll work on using a hash table.

The parse_manifest code does seem to pfree the scalar values it no
longer needs fairly well, so maybe we don't need to to anything there.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#4Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#3)
Re: WIP Incremental JSON Parser

On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Yeah. One idea I had yesterday was to stash the field names, which in
large JSON docs tent to be pretty repetitive, in a hash table instead of
pstrduping each instance. The name would be valid until the end of the
parse, and would only need to be duplicated by the callback function if
it were needed beyond that. That's not the case currently with the
parse_manifest code. I'll work on using a hash table.

IMHO, this is not a good direction. Anybody who is parsing JSON
probably wants to discard the duplicated labels and convert other
heavily duplicated strings to enum values or something. (e.g. if every
record has {"color":"red"} or {"color":"green"}). So the hash table
lookups will cost but won't really save anything more than just
freeing the memory not needed, but will probably be more expensive.

The parse_manifest code does seem to pfree the scalar values it no
longer needs fairly well, so maybe we don't need to to anything there.

Hmm. This makes me wonder if you've measured how much actual leakage there is?

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#4)
Re: WIP Incremental JSON Parser

On 2024-01-03 We 08:45, Robert Haas wrote:

On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Yeah. One idea I had yesterday was to stash the field names, which in
large JSON docs tent to be pretty repetitive, in a hash table instead of
pstrduping each instance. The name would be valid until the end of the
parse, and would only need to be duplicated by the callback function if
it were needed beyond that. That's not the case currently with the
parse_manifest code. I'll work on using a hash table.

IMHO, this is not a good direction. Anybody who is parsing JSON
probably wants to discard the duplicated labels and convert other
heavily duplicated strings to enum values or something. (e.g. if every
record has {"color":"red"} or {"color":"green"}). So the hash table
lookups will cost but won't really save anything more than just
freeing the memory not needed, but will probably be more expensive.

I don't quite follow.

Say we have a document with an array 1m objects, each with a field
called "color". As it stands we'll allocate space for that field name 1m
times. Using a hash table we'd allocated space for it once. And
allocating the memory isn't free, although it might be cheaper than
doing hash lookups.

I guess we can benchmark it and see what the performance impact of using
a hash table might be.

Another possibility would be simply to have the callback free the field
name after use. for the parse_manifest code that could be a one-line
addition to the code at the bottom of json_object_manifest_field_start().

The parse_manifest code does seem to pfree the scalar values it no
longer needs fairly well, so maybe we don't need to to anything there.

Hmm. This makes me wonder if you've measured how much actual leakage there is?

No I haven't. I have simply theorized about how much memory we might
consume if nothing were done by the callers to free the memory.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#6Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#5)
Re: WIP Incremental JSON Parser

On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Say we have a document with an array 1m objects, each with a field
called "color". As it stands we'll allocate space for that field name 1m
times. Using a hash table we'd allocated space for it once. And
allocating the memory isn't free, although it might be cheaper than
doing hash lookups.

I guess we can benchmark it and see what the performance impact of using
a hash table might be.

Another possibility would be simply to have the callback free the field
name after use. for the parse_manifest code that could be a one-line
addition to the code at the bottom of json_object_manifest_field_start().

Yeah. So I'm arguing that allocating the memory each time and then
freeing it sounds cheaper than looking it up in the hash table every
time, discovering it's there, and thus skipping the allocate/free.

I might be wrong about that. It's just that allocating and freeing a
small chunk of memory should boil down to popping it off of a linked
list and then pushing it back on. And that sounds cheaper than hashing
the string and looking for it in a hash bucket.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#6)
Re: WIP Incremental JSON Parser

On 2024-01-03 We 10:12, Robert Haas wrote:

On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Say we have a document with an array 1m objects, each with a field
called "color". As it stands we'll allocate space for that field name 1m
times. Using a hash table we'd allocated space for it once. And
allocating the memory isn't free, although it might be cheaper than
doing hash lookups.

I guess we can benchmark it and see what the performance impact of using
a hash table might be.

Another possibility would be simply to have the callback free the field
name after use. for the parse_manifest code that could be a one-line
addition to the code at the bottom of json_object_manifest_field_start().

Yeah. So I'm arguing that allocating the memory each time and then
freeing it sounds cheaper than looking it up in the hash table every
time, discovering it's there, and thus skipping the allocate/free.

I might be wrong about that. It's just that allocating and freeing a
small chunk of memory should boil down to popping it off of a linked
list and then pushing it back on. And that sounds cheaper than hashing
the string and looking for it in a hash bucket.

OK, cleaning up in the client code will be much simpler, so let's go
with that for now and revisit it later if necessary.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#8Nico Williams
nico@cryptonector.com
In reply to: Robert Haas (#2)
Re: WIP Incremental JSON Parser

On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:

It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.

You could also build a streaming incremental parser. That is, one that
outputs a path and a leaf value (where leaf values are scalar values,
`null`, `true`, `false`, numbers, and strings). Then if the caller is
doing something JSONPath-like then the caller can probably immediately
free almost all allocations and even terminate the parse early.

Nico
--

#9Robert Haas
robertmhaas@gmail.com
In reply to: Nico Williams (#8)
Re: WIP Incremental JSON Parser

On Wed, Jan 3, 2024 at 6:36 PM Nico Williams <nico@cryptonector.com> wrote:

On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:

It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.

You could also build a streaming incremental parser. That is, one that
outputs a path and a leaf value (where leaf values are scalar values,
`null`, `true`, `false`, numbers, and strings). Then if the caller is
doing something JSONPath-like then the caller can probably immediately
free almost all allocations and even terminate the parse early.

I think our current parser is event-based rather than this ... but it
seems like this could easily be built on top of it, if someone wanted
to.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#1)
Re: WIP Incremental JSON Parser

On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser.

The prediction stack is neat. It seems like the main loop is hit so
many thousands of times that micro-optimization would be necessary...
I attached a sample diff to get rid of the strlen calls during
push_prediction(), which speeds things up a bit (8-15%, depending on
optimization level) on my machines.

Maybe it's possible to condense some of those productions down, and
reduce the loop count? E.g. does every "scalar" production need to go
three times through the loop/stack, or can the scalar semantic action
just peek at the next token prediction and do all the callback work at
once?

+               case JSON_SEM_SCALAR_CALL:
+                   {
+                       json_scalar_action sfunc = sem->scalar;
+
+                       if (sfunc != NULL)
+                           (*sfunc) (sem->semstate, scalar_val, scalar_tok);
+                   }

Is it safe to store state (scalar_val/scalar_tok) on the stack, or
does it disappear if the parser hits an incomplete token?

One possible use would be in parsing large manifest files for
incremental backup.

I'm keeping an eye on this thread for OAuth, since the clients have to
parse JSON as well. Those responses tend to be smaller, though, so
you'd have to really be hurting for resources to need this.

--Jacob

Attachments:

no-strlen.diff.txttext/plain; charset=US-ASCII; name=no-strlen.diff.txtDownload+41-31
#11Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#10)
Re: WIP Incremental JSON Parser

On 2024-01-09 Tu 13:46, Jacob Champion wrote:

On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser.

The prediction stack is neat. It seems like the main loop is hit so
many thousands of times that micro-optimization would be necessary...
I attached a sample diff to get rid of the strlen calls during
push_prediction(), which speeds things up a bit (8-15%, depending on
optimization level) on my machines.

Thanks for looking! I've been playing around with a similar idea, but
yours might be better.

Maybe it's possible to condense some of those productions down, and
reduce the loop count? E.g. does every "scalar" production need to go
three times through the loop/stack, or can the scalar semantic action
just peek at the next token prediction and do all the callback work at
once?

Also a good suggestion. Will look and see. IIRC I had trouble with this bit.

+               case JSON_SEM_SCALAR_CALL:
+                   {
+                       json_scalar_action sfunc = sem->scalar;
+
+                       if (sfunc != NULL)
+                           (*sfunc) (sem->semstate, scalar_val, scalar_tok);
+                   }

Is it safe to store state (scalar_val/scalar_tok) on the stack, or
does it disappear if the parser hits an incomplete token?

Good point. In fact it might be responsible for the error I'm currently
trying to get to the bottom of.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#12Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#10)
Re: WIP Incremental JSON Parser

On 2024-01-09 Tu 13:46, Jacob Champion wrote:

On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser.

The prediction stack is neat. It seems like the main loop is hit so
many thousands of times that micro-optimization would be necessary...
I attached a sample diff to get rid of the strlen calls during
push_prediction(), which speeds things up a bit (8-15%, depending on
optimization level) on my machines.

Maybe it's possible to condense some of those productions down, and
reduce the loop count? E.g. does every "scalar" production need to go
three times through the loop/stack, or can the scalar semantic action
just peek at the next token prediction and do all the callback work at
once?

+               case JSON_SEM_SCALAR_CALL:
+                   {
+                       json_scalar_action sfunc = sem->scalar;
+
+                       if (sfunc != NULL)
+                           (*sfunc) (sem->semstate, scalar_val, scalar_tok);
+                   }

Is it safe to store state (scalar_val/scalar_tok) on the stack, or
does it disappear if the parser hits an incomplete token?

One possible use would be in parsing large manifest files for
incremental backup.

I'm keeping an eye on this thread for OAuth, since the clients have to
parse JSON as well. Those responses tend to be smaller, though, so
you'd have to really be hurting for resources to need this.

I've incorporated your suggestion, and fixed the bug you identified.

The attached also adds support for incrementally parsing backup
manifests, and uses that in the three places we call the manifest parser.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

v3-0001-Introduce-a-non-recursive-JSON-parser.patchtext/x-patch; charset=UTF-8; name=v3-0001-Introduce-a-non-recursive-JSON-parser.patchDownload+1455-10
v3-0002-Add-support-for-incrementally-parsing-backup-mani.patchtext/x-patch; charset=UTF-8; name=v3-0002-Add-support-for-incrementally-parsing-backup-mani.patchDownload+117-9
v3-0003-Use-incremental-parsing-of-backup-manifests.patchtext/x-patch; charset=UTF-8; name=v3-0003-Use-incremental-parsing-of-backup-manifests.patchDownload+175-61
#13Peter Smith
smithpb2250@gmail.com
In reply to: Andrew Dunstan (#12)
Re: WIP Incremental JSON Parser

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4725/, but it seems
there were CFbot test failures last time it was run [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4725. Please have a
look and post an updated version if necessary.

======
[1]: https://commitfest.postgresql.org/46/4725/
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4725

Kind Regards,
Peter Smith.

#14Andrew Dunstan
andrew@dunslane.net
In reply to: Peter Smith (#13)
Re: WIP Incremental JSON Parser

On 2024-01-22 Mo 01:29, Peter Smith wrote:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

Thanks.

Let's see if the attached does better.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

v4-0003-Use-incremental-parsing-of-backup-manifests.patchtext/x-patch; charset=UTF-8; name=v4-0003-Use-incremental-parsing-of-backup-manifests.patchDownload+175-61
v4-0002-Add-support-for-incrementally-parsing-backup-mani.patchtext/x-patch; charset=UTF-8; name=v4-0002-Add-support-for-incrementally-parsing-backup-mani.patchDownload+117-9
v4-0001-Introduce-a-non-recursive-JSON-parser.patchtext/x-patch; charset=UTF-8; name=v4-0001-Introduce-a-non-recursive-JSON-parser.patchDownload+1485-10
#15Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#14)
Re: WIP Incremental JSON Parser

On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:

On 2024-01-22 Mo 01:29, Peter Smith wrote:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

Thanks.

Let's see if the attached does better.

This time for sure! (Watch me pull a rabbit out of my hat!)

It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed it,
and I think the test will now pass.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

v5-0001-Introduce-a-non-recursive-JSON-parser.patchtext/x-patch; charset=UTF-8; name=v5-0001-Introduce-a-non-recursive-JSON-parser.patchDownload+1486-10
v5-0002-Add-support-for-incrementally-parsing-backup-mani.patchtext/x-patch; charset=UTF-8; name=v5-0002-Add-support-for-incrementally-parsing-backup-mani.patchDownload+117-9
v5-0003-Use-incremental-parsing-of-backup-manifests.patchtext/x-patch; charset=UTF-8; name=v5-0003-Use-incremental-parsing-of-backup-manifests.patchDownload+175-61
#16Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#15)
Re: WIP Incremental JSON Parser

On 2024-01-22 Mo 18:01, Andrew Dunstan wrote:

On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:

On 2024-01-22 Mo 01:29, Peter Smith wrote:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

Thanks.

Let's see if the attached does better.

This time for sure! (Watch me pull a rabbit out of my hat!)

It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed
it, and I think the test will now pass.

Fixed one problem but there are some others. I'm hoping this will
satisfy the cfbot.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

v6-0001-Introduce-a-non-recursive-JSON-parser.patchtext/x-patch; charset=UTF-8; name=v6-0001-Introduce-a-non-recursive-JSON-parser.patchDownload+1486-10
v6-0002-Add-support-for-incrementally-parsing-backup-mani.patchtext/x-patch; charset=UTF-8; name=v6-0002-Add-support-for-incrementally-parsing-backup-mani.patchDownload+117-9
v6-0003-Use-incremental-parsing-of-backup-manifests.patchtext/x-patch; charset=UTF-8; name=v6-0003-Use-incremental-parsing-of-backup-manifests.patchDownload+177-61
#17Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#16)
Re: WIP Incremental JSON Parser

On 2024-01-22 Mo 21:02, Andrew Dunstan wrote:

On 2024-01-22 Mo 18:01, Andrew Dunstan wrote:

On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:

On 2024-01-22 Mo 01:29, Peter Smith wrote:

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

Thanks.

Let's see if the attached does better.

This time for sure! (Watch me pull a rabbit out of my hat!)

It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed
it, and I think the test will now pass.

Fixed one problem but there are some others. I'm hoping this will
satisfy the cfbot.

The cfbot reports an error on a 32 bit build
<https://api.cirrus-ci.com/v1/artifact/task/6055909135220736/testrun/build-32/testrun/pg_combinebackup/003_timeline/log/regress_log_003_timeline&gt;:

# Running: pg_basebackup -D /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2 --no-sync -cfast --incremental /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest
pg_basebackup: error: could not upload manifest: ERROR: could not parse backup manifest: file size is not an integer
pg_basebackup: removing data directory "/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2"
[02:41:07.830](0.073s) not ok 2 - incremental backup from node1
[02:41:07.830](0.000s) # Failed test 'incremental backup from node1'

I have set up a Debian 12 EC2 instance following the recipe at
<https://raw.githubusercontent.com/anarazel/pg-vm-images/main/scripts/linux_debian_install_deps.sh&gt;,
and ran what I think are the same tests dozens of times, but the failure
did not reappear in my setup. Unfortunately, the test doesn't show the
failing manifest or log the failing field, so trying to divine what
happened here is more than difficult.

Not sure how to address this.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

#18Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Dunstan (#17)
Re: WIP Incremental JSON Parser

On Wed, Jan 24, 2024 at 10:04 AM Andrew Dunstan <andrew@dunslane.net> wrote:

The cfbot reports an error on a 32 bit build <https://api.cirrus-ci.com/v1/artifact/task/6055909135220736/testrun/build-32/testrun/pg_combinebackup/003_timeline/log/regress_log_003_timeline&gt;:

# Running: pg_basebackup -D /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2 --no-sync -cfast --incremental /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest
pg_basebackup: error: could not upload manifest: ERROR: could not parse backup manifest: file size is not an integer
pg_basebackup: removing data directory "/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2"
[02:41:07.830](0.073s) not ok 2 - incremental backup from node1
[02:41:07.830](0.000s) # Failed test 'incremental backup from node1'

I have set up a Debian 12 EC2 instance following the recipe at <https://raw.githubusercontent.com/anarazel/pg-vm-images/main/scripts/linux_debian_install_deps.sh&gt;, and ran what I think are the same tests dozens of times, but the failure did not reappear in my setup. Unfortunately, the test doesn't show the failing manifest or log the failing field, so trying to divine what happened here is more than difficult.

Not sure how to address this.

Yeah, that's really odd. The backup size field is printed like this:

appendStringInfo(&buf, "\"Size\": %zu, ", size);

And parsed like this:

size = strtoul(parse->size, &ep, 10);
if (*ep)
json_manifest_parse_failure(parse->context,

"file size is not an integer");

I confess to bafflement -- how could the output of the first fail to
be parsed by the second? The manifest has to look pretty much valid in
order not to error out before it gets to this check, with just that
one field corrupted. But I don't understand how that could happen.

I agree that the error reporting could be better here, but it didn't
seem worth spending time on when I wrote the code. I figured the only
way we could end up with something like "Size": "Walrus" is if the
user was messing with us on purpose. Apparently that's not so, yet the
mechanism eludes me. Or maybe it's not some random string, but is
something like an empty string or a number with trailing garbage or a
number that's out of range. But I don't see how any of those things
can happen either.

Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.

--
Robert Haas
EDB: http://www.enterprisedb.com

#19Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#18)
Re: WIP Incremental JSON Parser

On 2024-01-24 We 13:08, Robert Haas wrote:

Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.

Yeah, I thought earlier today I was on the track of an off by one error,
but I was apparently mistaken, so here's the same patch set with an
extra patch that logs a bunch of stuff, and might let us see what's
upsetting the cfbot.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachments:

v7-0001-Introduce-a-non-recursive-JSON-parser.patchtext/x-patch; charset=UTF-8; name=v7-0001-Introduce-a-non-recursive-JSON-parser.patchDownload+1486-10
v7-0002-Add-support-for-incrementally-parsing-backup-mani.patchtext/x-patch; charset=UTF-8; name=v7-0002-Add-support-for-incrementally-parsing-backup-mani.patchDownload+117-9
v7-0003-Use-incremental-parsing-of-backup-manifests.patchtext/x-patch; charset=UTF-8; name=v7-0003-Use-incremental-parsing-of-backup-manifests.patchDownload+177-61
v7-0004-add-logging-traces-to-see-if-we-can-find-out-what.patchtext/x-patch; charset=UTF-8; name=v7-0004-add-logging-traces-to-see-if-we-can-find-out-what.patchDownload+21-4
#20Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#19)
Re: WIP Incremental JSON Parser

On 2024-01-26 Fr 12:15, Andrew Dunstan wrote:

On 2024-01-24 We 13:08, Robert Haas wrote:

Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.

Yeah, I thought earlier today I was on the track of an off by one
error, but I was apparently mistaken, so here's the same patch set
with an extra patch that logs a bunch of stuff, and might let us see
what's upsetting the cfbot.

Well, that didn't help a lot, but meanwhile the CFBot seems to have
decided in the last few days that it's now happy, so full steam aead! ;-)

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#21Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#20)
#22Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#21)
#23Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#22)
#24Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#23)
#25Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#24)
#26Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#25)
#27Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#26)
#28Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#27)
#29Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#28)
#30Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#29)
#31Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#30)
#32Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#31)
#33Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#32)
#34Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#33)
#35Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#34)
#36Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#35)
#37Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#36)
#38Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#37)
#39Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#38)
#40Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#39)
#41Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#40)
#42Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#41)
#43Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#42)
#44Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#43)
#45Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#44)
#46Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#45)
#47Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#46)
#48Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#47)
#49Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#48)
#50Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#48)
#51Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#50)
#52Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#46)
#53Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#52)
#54Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#47)
#55Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#54)
#56Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#55)
#57Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#56)
#58Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#57)
#59Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#58)
#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#59)
#61Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#60)
#62Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#61)
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#62)
#64Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#63)
#65Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#64)
#66Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#65)
#67Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Andrew Dunstan (#66)
#68Nathan Bossart
nathandbossart@gmail.com
In reply to: Jelte Fennema-Nio (#67)
#69Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jelte Fennema-Nio (#67)
#70Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jacob Champion (#69)
#71Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Nathan Bossart (#68)
#72Andrew Dunstan
andrew@dunslane.net
In reply to: Nathan Bossart (#68)
#73Nathan Bossart
nathandbossart@gmail.com
In reply to: Andrew Dunstan (#72)
#74Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jacob Champion (#70)
#75Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Jelte Fennema-Nio (#74)
#76Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jacob Champion (#75)
#77Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Jelte Fennema-Nio (#76)
#78Andrew Dunstan
andrew@dunslane.net
In reply to: Nathan Bossart (#73)
#79Nathan Bossart
nathandbossart@gmail.com
In reply to: Andrew Dunstan (#78)
#80Andrew Dunstan
andrew@dunslane.net
In reply to: Nathan Bossart (#79)
#81Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andrew Dunstan (#78)
#82Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#81)
#83Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#82)
#84Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#83)
#85Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#82)
#86Michael Paquier
michael@paquier.xyz
In reply to: Andrew Dunstan (#84)
#87Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#86)
#88Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#87)
#89Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#88)
#90Andrew Dunstan
andrew@dunslane.net
In reply to: Jacob Champion (#89)
#91Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Andrew Dunstan (#90)
#92Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Michael Paquier (#87)
#93Jacob Champion
jacob.champion@enterprisedb.com
In reply to: Michael Paquier (#86)
#94Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#88)
#95Andrew Dunstan
andrew@dunslane.net
In reply to: Andrew Dunstan (#94)
#96Michael Paquier
michael@paquier.xyz
In reply to: Jacob Champion (#91)
#97Michael Paquier
michael@paquier.xyz
In reply to: Andrew Dunstan (#95)
#98Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#97)
#99Michael Paquier
michael@paquier.xyz
In reply to: Andrew Dunstan (#98)
#100Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#99)
#101Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#100)
#102Andrew Dunstan
andrew@dunslane.net
In reply to: Michael Paquier (#101)