WIP Incremental JSON Parser
Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser. Nevertheless, I'm attaching my current WIP state for it,
and I'll add it to the next CF to keep the conversation going.
One possible use would be in parsing large manifest files for
incremental backup. However, it struck me a few days ago that this might
not work all that well. The current parser and the new parser both
palloc() space for each field name and scalar token in the JSON (unless
they aren't used, which is normally not the case), and they don't free
it, so that particularly if done in frontend code this amounts to a
possible memory leak, unless the semantic routines do the freeing
themselves. So while we can save some memory by not having to slurp in
the whole JSON in one hit, we aren't saving any of that other allocation
of memory, which amounts to almost as much space as the raw JSON.
In any case, I've had fun so it's not a total loss come what may :-)
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
json-incremental-parser-2023-12-26.patchtext/x-patch; charset=UTF-8; name=json-incremental-parser-2023-12-26.patchDownload+1411-9
On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser. Nevertheless, I'm attaching my current WIP state for it,
and I'll add it to the next CF to keep the conversation going.
Thanks for doing this. I think it's useful even if it's slower than
the current parser, although that probably necessitates keeping both,
which isn't great, but I don't see a better alternative.
One possible use would be in parsing large manifest files for
incremental backup. However, it struck me a few days ago that this might
not work all that well. The current parser and the new parser both
palloc() space for each field name and scalar token in the JSON (unless
they aren't used, which is normally not the case), and they don't free
it, so that particularly if done in frontend code this amounts to a
possible memory leak, unless the semantic routines do the freeing
themselves. So while we can save some memory by not having to slurp in
the whole JSON in one hit, we aren't saving any of that other allocation
of memory, which amounts to almost as much space as the raw JSON.
It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.
As far as addressing that other issue, we could address the issue
either by having the semantic routines free the memory if they don't
need it, or alternatively by having the parser itself free the memory
after invoking any callbacks to which it might be passed. The latter
approach feels more conceptually pure, but the former might be the
more practical approach. I think what really matters here is that we
document who must or may do which things. When a callback gets passed
a pointer, we can document either that (1) it's a palloc'd chunk that
the calllback can free if they want or (2) that it's a palloc'd chunk
that the caller must not free or (3) that it's not a palloc'd chunk.
We can further document the memory context in which the chunk will be
allocated, if applicable, and when/if the parser will free it.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2024-01-02 Tu 10:14, Robert Haas wrote:
On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser. Nevertheless, I'm attaching my current WIP state for it,
and I'll add it to the next CF to keep the conversation going.Thanks for doing this. I think it's useful even if it's slower than
the current parser, although that probably necessitates keeping both,
which isn't great, but I don't see a better alternative.One possible use would be in parsing large manifest files for
incremental backup. However, it struck me a few days ago that this might
not work all that well. The current parser and the new parser both
palloc() space for each field name and scalar token in the JSON (unless
they aren't used, which is normally not the case), and they don't free
it, so that particularly if done in frontend code this amounts to a
possible memory leak, unless the semantic routines do the freeing
themselves. So while we can save some memory by not having to slurp in
the whole JSON in one hit, we aren't saving any of that other allocation
of memory, which amounts to almost as much space as the raw JSON.It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.As far as addressing that other issue, we could address the issue
either by having the semantic routines free the memory if they don't
need it, or alternatively by having the parser itself free the memory
after invoking any callbacks to which it might be passed. The latter
approach feels more conceptually pure, but the former might be the
more practical approach. I think what really matters here is that we
document who must or may do which things. When a callback gets passed
a pointer, we can document either that (1) it's a palloc'd chunk that
the calllback can free if they want or (2) that it's a palloc'd chunk
that the caller must not free or (3) that it's not a palloc'd chunk.
We can further document the memory context in which the chunk will be
allocated, if applicable, and when/if the parser will free it.
Yeah. One idea I had yesterday was to stash the field names, which in
large JSON docs tent to be pretty repetitive, in a hash table instead of
pstrduping each instance. The name would be valid until the end of the
parse, and would only need to be duplicated by the callback function if
it were needed beyond that. That's not the case currently with the
parse_manifest code. I'll work on using a hash table.
The parse_manifest code does seem to pfree the scalar values it no
longer needs fairly well, so maybe we don't need to to anything there.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Yeah. One idea I had yesterday was to stash the field names, which in
large JSON docs tent to be pretty repetitive, in a hash table instead of
pstrduping each instance. The name would be valid until the end of the
parse, and would only need to be duplicated by the callback function if
it were needed beyond that. That's not the case currently with the
parse_manifest code. I'll work on using a hash table.
IMHO, this is not a good direction. Anybody who is parsing JSON
probably wants to discard the duplicated labels and convert other
heavily duplicated strings to enum values or something. (e.g. if every
record has {"color":"red"} or {"color":"green"}). So the hash table
lookups will cost but won't really save anything more than just
freeing the memory not needed, but will probably be more expensive.
The parse_manifest code does seem to pfree the scalar values it no
longer needs fairly well, so maybe we don't need to to anything there.
Hmm. This makes me wonder if you've measured how much actual leakage there is?
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2024-01-03 We 08:45, Robert Haas wrote:
On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Yeah. One idea I had yesterday was to stash the field names, which in
large JSON docs tent to be pretty repetitive, in a hash table instead of
pstrduping each instance. The name would be valid until the end of the
parse, and would only need to be duplicated by the callback function if
it were needed beyond that. That's not the case currently with the
parse_manifest code. I'll work on using a hash table.IMHO, this is not a good direction. Anybody who is parsing JSON
probably wants to discard the duplicated labels and convert other
heavily duplicated strings to enum values or something. (e.g. if every
record has {"color":"red"} or {"color":"green"}). So the hash table
lookups will cost but won't really save anything more than just
freeing the memory not needed, but will probably be more expensive.
I don't quite follow.
Say we have a document with an array 1m objects, each with a field
called "color". As it stands we'll allocate space for that field name 1m
times. Using a hash table we'd allocated space for it once. And
allocating the memory isn't free, although it might be cheaper than
doing hash lookups.
I guess we can benchmark it and see what the performance impact of using
a hash table might be.
Another possibility would be simply to have the callback free the field
name after use. for the parse_manifest code that could be a one-line
addition to the code at the bottom of json_object_manifest_field_start().
The parse_manifest code does seem to pfree the scalar values it no
longer needs fairly well, so maybe we don't need to to anything there.Hmm. This makes me wonder if you've measured how much actual leakage there is?
No I haven't. I have simply theorized about how much memory we might
consume if nothing were done by the callers to free the memory.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Say we have a document with an array 1m objects, each with a field
called "color". As it stands we'll allocate space for that field name 1m
times. Using a hash table we'd allocated space for it once. And
allocating the memory isn't free, although it might be cheaper than
doing hash lookups.I guess we can benchmark it and see what the performance impact of using
a hash table might be.Another possibility would be simply to have the callback free the field
name after use. for the parse_manifest code that could be a one-line
addition to the code at the bottom of json_object_manifest_field_start().
Yeah. So I'm arguing that allocating the memory each time and then
freeing it sounds cheaper than looking it up in the hash table every
time, discovering it's there, and thus skipping the allocate/free.
I might be wrong about that. It's just that allocating and freeing a
small chunk of memory should boil down to popping it off of a linked
list and then pushing it back on. And that sounds cheaper than hashing
the string and looking for it in a hash bucket.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2024-01-03 We 10:12, Robert Haas wrote:
On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Say we have a document with an array 1m objects, each with a field
called "color". As it stands we'll allocate space for that field name 1m
times. Using a hash table we'd allocated space for it once. And
allocating the memory isn't free, although it might be cheaper than
doing hash lookups.I guess we can benchmark it and see what the performance impact of using
a hash table might be.Another possibility would be simply to have the callback free the field
name after use. for the parse_manifest code that could be a one-line
addition to the code at the bottom of json_object_manifest_field_start().Yeah. So I'm arguing that allocating the memory each time and then
freeing it sounds cheaper than looking it up in the hash table every
time, discovering it's there, and thus skipping the allocate/free.I might be wrong about that. It's just that allocating and freeing a
small chunk of memory should boil down to popping it off of a linked
list and then pushing it back on. And that sounds cheaper than hashing
the string and looking for it in a hash bucket.
OK, cleaning up in the client code will be much simpler, so let's go
with that for now and revisit it later if necessary.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:
It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.
You could also build a streaming incremental parser. That is, one that
outputs a path and a leaf value (where leaf values are scalar values,
`null`, `true`, `false`, numbers, and strings). Then if the caller is
doing something JSONPath-like then the caller can probably immediately
free almost all allocations and even terminate the parse early.
Nico
--
On Wed, Jan 3, 2024 at 6:36 PM Nico Williams <nico@cryptonector.com> wrote:
On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:
It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.You could also build a streaming incremental parser. That is, one that
outputs a path and a leaf value (where leaf values are scalar values,
`null`, `true`, `false`, numbers, and strings). Then if the caller is
doing something JSONPath-like then the caller can probably immediately
free almost all allocations and even terminate the parse early.
I think our current parser is event-based rather than this ... but it
seems like this could easily be built on top of it, if someone wanted
to.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser.
The prediction stack is neat. It seems like the main loop is hit so
many thousands of times that micro-optimization would be necessary...
I attached a sample diff to get rid of the strlen calls during
push_prediction(), which speeds things up a bit (8-15%, depending on
optimization level) on my machines.
Maybe it's possible to condense some of those productions down, and
reduce the loop count? E.g. does every "scalar" production need to go
three times through the loop/stack, or can the scalar semantic action
just peek at the next token prediction and do all the callback work at
once?
+ case JSON_SEM_SCALAR_CALL: + { + json_scalar_action sfunc = sem->scalar; + + if (sfunc != NULL) + (*sfunc) (sem->semstate, scalar_val, scalar_tok); + }
Is it safe to store state (scalar_val/scalar_tok) on the stack, or
does it disappear if the parser hits an incomplete token?
One possible use would be in parsing large manifest files for
incremental backup.
I'm keeping an eye on this thread for OAuth, since the clients have to
parse JSON as well. Those responses tend to be smaller, though, so
you'd have to really be hurting for resources to need this.
--Jacob
Attachments:
no-strlen.diff.txttext/plain; charset=US-ASCII; name=no-strlen.diff.txtDownload+41-31
On 2024-01-09 Tu 13:46, Jacob Champion wrote:
On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser.The prediction stack is neat. It seems like the main loop is hit so
many thousands of times that micro-optimization would be necessary...
I attached a sample diff to get rid of the strlen calls during
push_prediction(), which speeds things up a bit (8-15%, depending on
optimization level) on my machines.
Thanks for looking! I've been playing around with a similar idea, but
yours might be better.
Maybe it's possible to condense some of those productions down, and
reduce the loop count? E.g. does every "scalar" production need to go
three times through the loop/stack, or can the scalar semantic action
just peek at the next token prediction and do all the callback work at
once?
Also a good suggestion. Will look and see. IIRC I had trouble with this bit.
+ case JSON_SEM_SCALAR_CALL: + { + json_scalar_action sfunc = sem->scalar; + + if (sfunc != NULL) + (*sfunc) (sem->semstate, scalar_val, scalar_tok); + }Is it safe to store state (scalar_val/scalar_tok) on the stack, or
does it disappear if the parser hits an incomplete token?
Good point. In fact it might be responsible for the error I'm currently
trying to get to the bottom of.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On 2024-01-09 Tu 13:46, Jacob Champion wrote:
On Tue, Dec 26, 2023 at 8:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser.The prediction stack is neat. It seems like the main loop is hit so
many thousands of times that micro-optimization would be necessary...
I attached a sample diff to get rid of the strlen calls during
push_prediction(), which speeds things up a bit (8-15%, depending on
optimization level) on my machines.Maybe it's possible to condense some of those productions down, and
reduce the loop count? E.g. does every "scalar" production need to go
three times through the loop/stack, or can the scalar semantic action
just peek at the next token prediction and do all the callback work at
once?+ case JSON_SEM_SCALAR_CALL: + { + json_scalar_action sfunc = sem->scalar; + + if (sfunc != NULL) + (*sfunc) (sem->semstate, scalar_val, scalar_tok); + }Is it safe to store state (scalar_val/scalar_tok) on the stack, or
does it disappear if the parser hits an incomplete token?One possible use would be in parsing large manifest files for
incremental backup.I'm keeping an eye on this thread for OAuth, since the clients have to
parse JSON as well. Those responses tend to be smaller, though, so
you'd have to really be hurting for resources to need this.
I've incorporated your suggestion, and fixed the bug you identified.
The attached also adds support for incrementally parsing backup
manifests, and uses that in the three places we call the manifest parser.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
v3-0001-Introduce-a-non-recursive-JSON-parser.patchtext/x-patch; charset=UTF-8; name=v3-0001-Introduce-a-non-recursive-JSON-parser.patchDownload+1455-10
v3-0002-Add-support-for-incrementally-parsing-backup-mani.patchtext/x-patch; charset=UTF-8; name=v3-0002-Add-support-for-incrementally-parsing-backup-mani.patchDownload+117-9
v3-0003-Use-incremental-parsing-of-backup-manifests.patchtext/x-patch; charset=UTF-8; name=v3-0003-Use-incremental-parsing-of-backup-manifests.patchDownload+175-61
2024-01 Commitfest.
Hi, This patch has a CF status of "Needs Review" [1]https://commitfest.postgresql.org/46/4725/, but it seems
there were CFbot test failures last time it was run [2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4725. Please have a
look and post an updated version if necessary.
======
[1]: https://commitfest.postgresql.org/46/4725/
[2]: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4725
Kind Regards,
Peter Smith.
On 2024-01-22 Mo 01:29, Peter Smith wrote:
2024-01 Commitfest.
Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.
Thanks.
Let's see if the attached does better.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
v4-0003-Use-incremental-parsing-of-backup-manifests.patchtext/x-patch; charset=UTF-8; name=v4-0003-Use-incremental-parsing-of-backup-manifests.patchDownload+175-61
v4-0002-Add-support-for-incrementally-parsing-backup-mani.patchtext/x-patch; charset=UTF-8; name=v4-0002-Add-support-for-incrementally-parsing-backup-mani.patchDownload+117-9
v4-0001-Introduce-a-non-recursive-JSON-parser.patchtext/x-patch; charset=UTF-8; name=v4-0001-Introduce-a-non-recursive-JSON-parser.patchDownload+1485-10
On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:
On 2024-01-22 Mo 01:29, Peter Smith wrote:
2024-01 Commitfest.
Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.Thanks.
Let's see if the attached does better.
This time for sure! (Watch me pull a rabbit out of my hat!)
It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed it,
and I think the test will now pass.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
v5-0001-Introduce-a-non-recursive-JSON-parser.patchtext/x-patch; charset=UTF-8; name=v5-0001-Introduce-a-non-recursive-JSON-parser.patchDownload+1486-10
v5-0002-Add-support-for-incrementally-parsing-backup-mani.patchtext/x-patch; charset=UTF-8; name=v5-0002-Add-support-for-incrementally-parsing-backup-mani.patchDownload+117-9
v5-0003-Use-incremental-parsing-of-backup-manifests.patchtext/x-patch; charset=UTF-8; name=v5-0003-Use-incremental-parsing-of-backup-manifests.patchDownload+175-61
On 2024-01-22 Mo 18:01, Andrew Dunstan wrote:
On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:
On 2024-01-22 Mo 01:29, Peter Smith wrote:
2024-01 Commitfest.
Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.Thanks.
Let's see if the attached does better.
This time for sure! (Watch me pull a rabbit out of my hat!)
It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed
it, and I think the test will now pass.
Fixed one problem but there are some others. I'm hoping this will
satisfy the cfbot.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
v6-0001-Introduce-a-non-recursive-JSON-parser.patchtext/x-patch; charset=UTF-8; name=v6-0001-Introduce-a-non-recursive-JSON-parser.patchDownload+1486-10
v6-0002-Add-support-for-incrementally-parsing-backup-mani.patchtext/x-patch; charset=UTF-8; name=v6-0002-Add-support-for-incrementally-parsing-backup-mani.patchDownload+117-9
v6-0003-Use-incremental-parsing-of-backup-manifests.patchtext/x-patch; charset=UTF-8; name=v6-0003-Use-incremental-parsing-of-backup-manifests.patchDownload+177-61
On 2024-01-22 Mo 21:02, Andrew Dunstan wrote:
On 2024-01-22 Mo 18:01, Andrew Dunstan wrote:
On 2024-01-22 Mo 14:16, Andrew Dunstan wrote:
On 2024-01-22 Mo 01:29, Peter Smith wrote:
2024-01 Commitfest.
Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.Thanks.
Let's see if the attached does better.
This time for sure! (Watch me pull a rabbit out of my hat!)
It turns out that NO_TEMP_INSTALL=1 can do ugly things, so I removed
it, and I think the test will now pass.Fixed one problem but there are some others. I'm hoping this will
satisfy the cfbot.
The cfbot reports an error on a 32 bit build
<https://api.cirrus-ci.com/v1/artifact/task/6055909135220736/testrun/build-32/testrun/pg_combinebackup/003_timeline/log/regress_log_003_timeline>:
# Running: pg_basebackup -D /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2 --no-sync -cfast --incremental /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest
pg_basebackup: error: could not upload manifest: ERROR: could not parse backup manifest: file size is not an integer
pg_basebackup: removing data directory "/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2"
[02:41:07.830](0.073s) not ok 2 - incremental backup from node1
[02:41:07.830](0.000s) # Failed test 'incremental backup from node1'
I have set up a Debian 12 EC2 instance following the recipe at
<https://raw.githubusercontent.com/anarazel/pg-vm-images/main/scripts/linux_debian_install_deps.sh>,
and ran what I think are the same tests dozens of times, but the failure
did not reappear in my setup. Unfortunately, the test doesn't show the
failing manifest or log the failing field, so trying to divine what
happened here is more than difficult.
Not sure how to address this.
cheers
andrew
--
Andrew Dunstan
EDB:https://www.enterprisedb.com
On Wed, Jan 24, 2024 at 10:04 AM Andrew Dunstan <andrew@dunslane.net> wrote:
The cfbot reports an error on a 32 bit build <https://api.cirrus-ci.com/v1/artifact/task/6055909135220736/testrun/build-32/testrun/pg_combinebackup/003_timeline/log/regress_log_003_timeline>:
# Running: pg_basebackup -D /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2 --no-sync -cfast --incremental /tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup1/backup_manifest
pg_basebackup: error: could not upload manifest: ERROR: could not parse backup manifest: file size is not an integer
pg_basebackup: removing data directory "/tmp/cirrus-ci-build/build-32/testrun/pg_combinebackup/003_timeline/data/t_003_timeline_node1_data/backup/backup2"
[02:41:07.830](0.073s) not ok 2 - incremental backup from node1
[02:41:07.830](0.000s) # Failed test 'incremental backup from node1'I have set up a Debian 12 EC2 instance following the recipe at <https://raw.githubusercontent.com/anarazel/pg-vm-images/main/scripts/linux_debian_install_deps.sh>, and ran what I think are the same tests dozens of times, but the failure did not reappear in my setup. Unfortunately, the test doesn't show the failing manifest or log the failing field, so trying to divine what happened here is more than difficult.
Not sure how to address this.
Yeah, that's really odd. The backup size field is printed like this:
appendStringInfo(&buf, "\"Size\": %zu, ", size);
And parsed like this:
size = strtoul(parse->size, &ep, 10);
if (*ep)
json_manifest_parse_failure(parse->context,
"file size is not an integer");
I confess to bafflement -- how could the output of the first fail to
be parsed by the second? The manifest has to look pretty much valid in
order not to error out before it gets to this check, with just that
one field corrupted. But I don't understand how that could happen.
I agree that the error reporting could be better here, but it didn't
seem worth spending time on when I wrote the code. I figured the only
way we could end up with something like "Size": "Walrus" is if the
user was messing with us on purpose. Apparently that's not so, yet the
mechanism eludes me. Or maybe it's not some random string, but is
something like an empty string or a number with trailing garbage or a
number that's out of range. But I don't see how any of those things
can happen either.
Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 2024-01-24 We 13:08, Robert Haas wrote:
Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.
Yeah, I thought earlier today I was on the track of an off by one error,
but I was apparently mistaken, so here's the same patch set with an
extra patch that logs a bunch of stuff, and might let us see what's
upsetting the cfbot.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
v7-0001-Introduce-a-non-recursive-JSON-parser.patchtext/x-patch; charset=UTF-8; name=v7-0001-Introduce-a-non-recursive-JSON-parser.patchDownload+1486-10
v7-0002-Add-support-for-incrementally-parsing-backup-mani.patchtext/x-patch; charset=UTF-8; name=v7-0002-Add-support-for-incrementally-parsing-backup-mani.patchDownload+117-9
v7-0003-Use-incremental-parsing-of-backup-manifests.patchtext/x-patch; charset=UTF-8; name=v7-0003-Use-incremental-parsing-of-backup-manifests.patchDownload+177-61
v7-0004-add-logging-traces-to-see-if-we-can-find-out-what.patchtext/x-patch; charset=UTF-8; name=v7-0004-add-logging-traces-to-see-if-we-can-find-out-what.patchDownload+21-4
On 2024-01-26 Fr 12:15, Andrew Dunstan wrote:
On 2024-01-24 We 13:08, Robert Haas wrote:
Maybe you should adjust your patch to dump the manifests into the log
file with note(). Then when cfbot runs on it you can see exactly what
the raw file looks like. Although I wonder if it's possible that the
manifest itself is OK, but somehow it gets garbled when uploaded to
the server, either because the client code that sends it or the server
code that receives it does something that isn't safe in 32-bit mode.
If we hypothesize an off-by-one error or a buffer overrun, that could
possibly explain how one field got garbled while the rest of the file
is OK.Yeah, I thought earlier today I was on the track of an off by one
error, but I was apparently mistaken, so here's the same patch set
with an extra patch that logs a bunch of stuff, and might let us see
what's upsetting the cfbot.
Well, that didn't help a lot, but meanwhile the CFBot seems to have
decided in the last few days that it's now happy, so full steam aead! ;-)
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com