Proposal to use JSON for Postgres Parser format

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Michel Pelletier (#1)

Re: Proposal to use JSON for Postgres Parser format

Michel Pelletier <pelletier.michel@gmail.com> writes:

I would like to propose a discussion that in a future major release
Postgres switch from this custom format to JSON.

There are certainly reasons to think about changing the node tree
storage format; but if we change it, I'd like to see it go to something
more compact not more verbose. JSON doesn't fit our needs all that
closely, so some things like bitmapsets would become a lot longer;
and even where the semantics are pretty-much-the-same, JSON's
insistence on details like quoting field names will add bytes.
Perhaps making the physical storage be JSONB not JSON would help that
pain point. It's still far from ideal though.

Maybe a compromise could be found whereby we provide a conversion
function that converts whatever the catalog storage format is to
some JSON equivalent. That would address the needs of external
code that doesn't want to write a custom parser, while not tying
us directly to JSON.

regards, tom lane

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#2)

Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 7:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe a compromise could be found whereby we provide a conversion
function that converts whatever the catalog storage format is to
some JSON equivalent. That would address the needs of external
code that doesn't want to write a custom parser, while not tying
us directly to JSON.

That seems like a perfectly good solution, as long as it can be done
in a way that doesn't leave consumers of the JSON output at any kind
of disadvantage.

I find the current node tree format ludicrously verbose, and generally
hard to work with. But it's not the format itself, really -- that's
not the problem. The underlying data structures are typically very
information dense. So having an output format that's a known quantity
sounds very valuable to me.

Writing declarative @> containment queries against (say) a JSON
variant of node tree format seems like it could be a huge quality of
life improvement. It will make the output format even more verbose,
but that might not matter in the same way as it does right now.

--
Peter Geoghegan

thomas.munro@gmail.com

over 3 years ago

In reply to: Michel Pelletier (#1)

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 12:16 PM Michel Pelletier
<pelletier.michel@gmail.com> wrote:

This non-standard format

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley). At some point it was probably exchanging data between the
Lisp and C parts of POSTGRES, and you could maybe sorta claim it's
based on an ANSI standard (Lisp!), but not with a straight face :-)

[1]: /messages/by-id/CAFPTHDaqqGxqncAP42Z=w9GVXDR92HN-57O=2Zy6tmayV2_eZw@mail.gmail.com
[2]: /messages/by-id/CAAD30U+pVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo=OjdQGJ9g@mail.gmail.com

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Peter Geoghegan (#3)

Re: Proposal to use JSON for Postgres Parser format

Peter Geoghegan <pg@bowt.ie> writes:

Writing declarative @> containment queries against (say) a JSON
variant of node tree format seems like it could be a huge quality of
life improvement.

There are certainly use-cases for something like that, but let's
be clear about it: that's a niche case of interest to developers
and pretty much nobody else. For ordinary users, what matters about
the node tree storage format is compactness and speed of loading.
Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

regards, tom lane

Amit Kapila

amit.kapila16@gmail.com

over 3 years ago

In reply to: Tom Lane (#2)

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 7:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michel Pelletier <pelletier.michel@gmail.com> writes:

I would like to propose a discussion that in a future major release
Postgres switch from this custom format to JSON.

There are certainly reasons to think about changing the node tree
storage format; but if we change it, I'd like to see it go to something
more compact not more verbose. JSON doesn't fit our needs all that
closely, so some things like bitmapsets would become a lot longer;
and even where the semantics are pretty-much-the-same, JSON's
insistence on details like quoting field names will add bytes.
Perhaps making the physical storage be JSONB not JSON would help that
pain point. It's still far from ideal though.

Maybe a compromise could be found whereby we provide a conversion
function that converts whatever the catalog storage format is to
some JSON equivalent. That would address the needs of external
code that doesn't want to write a custom parser, while not tying
us directly to JSON.

I think the DDL deparsing stuff that is being discussed as a base for
DDL logical replication provides something like what you are saying
[1]: /messages/by-id/CAFPTHDaqqGxqncAP42Z=w9GVXDR92HN-57O=2Zy6tmayV2_eZw@mail.gmail.com

--
With Regards,
Amit Kapila.

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Thomas Munro (#4)

Re: Proposal to use JSON for Postgres Parser format

Thomas Munro <thomas.munro@gmail.com> writes:

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley).

Wow, where did you find a commit history for Berkeley's code?
There's evidence in the tarballs I have that they were using
RCS, but I never heard that the repo was made public.

regards, tom lane

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#5)

Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

There are certainly use-cases for something like that, but let's
be clear about it: that's a niche case of interest to developers
and pretty much nobody else. For ordinary users, what matters about
the node tree storage format is compactness and speed of loading.

Of course. But is there any reason to think that there has to be even
a tiny cost imposed on users?

Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections. This
is certainly not a new consideration.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

--
Peter Geoghegan

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#7)

Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Wow, where did you find a commit history for Berkeley's code?
There's evidence in the tarballs I have that they were using
RCS, but I never heard that the repo was made public.

It's on Github:

https://github.com/kelvich/postgres_pre95
--
Peter Geoghegan

#10

thomas.munro@gmail.com

over 3 years ago

In reply to: Tom Lane (#7)

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley).

Wow, where did you find a commit history for Berkeley's code?
There's evidence in the tarballs I have that they were using
RCS, but I never heard that the repo was made public.

One of the tarballs at https://dsf.berkeley.edu/postgres.html has the
complete RCS history, but Stas Kelvich imported it to github as Peter
G has just reported faster than I could.

#11

https://github.com/kelvich/postgres_pre95/blob/master/src/backend/lib/lispread.c#L132

thomas.munro@gmail.com

over 3 years ago

In reply to: Thomas Munro (#10)

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 4:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Sep 20, 2022 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley).

Wow, where did you find a commit history for Berkeley's code?
There's evidence in the tarballs I have that they were using
RCS, but I never heard that the repo was made public.

One of the tarballs at https://dsf.berkeley.edu/postgres.html has the
complete RCS history, but Stas Kelvich imported it to github as Peter
G has just reported faster than I could.

To explain my earlier guess: reader code for #S(STRUCTNAME ...) can
bee seen here, though it's being lexed as "PLAN_SYM" so perhaps the
author of that C already didn't know that was a general syntax for
Lisp structs. (Example: at a Lisp prompt, if you write (defstruct foo
x y z) then (make-foo :x 1 :y 2 :z 3), the resulting object will be
printed as #S(FOO :x 1 :y 2 :z 3), so I'm guessing that the POSTGRES
Lisp code, which sadly (for me) was ripped out before even that repo
IIUC, must have used defstruct-based plans.)

It may still be within the bounds of what a real Lisp could be
convinced to read though, given a reader macro to handle {} and maybe
some other little tweaks here and there.

#12

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Peter Geoghegan (#8)

Re: Proposal to use JSON for Postgres Parser format

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections.

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace). First we need to separate those. We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

regards, tom lane

#13

pg@bowt.ie

over 3 years ago

In reply to: Tom Lane (#12)

Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

That sounds very much like the 1980s graduate student equivalent of
JSON to my ears.

JSON is generally manipulated as native Javascript/python/whatever
lists, maps, and strings. It's an interchange format that tries not to
be obtrusive in the same way as things like XML always are, at the
cost of making things kinda dicey for things like numeric precision
(unless you can account for everything). Isn't that...basically the
same concept as the lisp-y dump format, at a high level?

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace). First we need to separate those. We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

Agreed.

--
Peter Geoghegan

#14

thomas.munro@gmail.com

over 3 years ago

In reply to: Peter Geoghegan (#13)

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 4:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Sep 19, 2022 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

That sounds very much like the 1980s graduate student equivalent of
JSON to my ears.

Yeah. Easy data interchange on Lisp systems is built in, just write
objects into a socket/file/whatever and read them back, as people now
do with JSON/XML/whatever. That's the format we see here.

JSON is generally manipulated as native Javascript/python/whatever
lists, maps, and strings. It's an interchange format that tries not to
be obtrusive in the same way as things like XML always are, at the
cost of making things kinda dicey for things like numeric precision
(unless you can account for everything). Isn't that...basically the
same concept as the lisp-y dump format, at a high level?

Yes, s-expressions and JSON are absolutely the same concept; simple
representation of simple data structures of a dynamically typed
language. There's even a chain of events connecting the two: JSON is
roughly the literal data syntax from Javascript's grammar, and
Javascript is the language that Brendan Eich developed after Netscape
hired him to do an embedded Lisp (Scheme) for the browser, except they
decided at some point to change tack and make their new language have
a surface grammar more like Java, the new hotness. If the goal was to
make sure it caught on, it's hard to conclude they were wrong...

#15

Alexander Korotkov

aekorotkov@gmail.com

over 3 years ago

In reply to: Tom Lane (#12)

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections.

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace). First we need to separate those. We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

+1 for considering storage format and text format separately.

Let's consider what our criteria could be for the storage format.

1) Storage effectiveness (shorter is better) and
serialization/deserialization effectiveness (faster is better). On
this criterion, the custom binary format looks perfect.
2) Robustness in the case of corruption. It seems much easier to
detect the data corruption and possibly make some partial manual
recovery for textual format.
3) Standartness. It's better to use something known worldwide or at
least used in other parts of PostgreSQL than something completely
custom. From this perspective, JSON/JSONB is better than custom
things.

------
Regards,
Alexander Korotkov

#16

Alexander Korotkov

aekorotkov@gmail.com

over 3 years ago

In reply to: Alexander Korotkov (#15)

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 1:00 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections.

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace). First we need to separate those. We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

+1 for considering storage format and text format separately.

Let's consider what our criteria could be for the storage format.

1) Storage effectiveness (shorter is better) and
serialization/deserialization effectiveness (faster is better). On
this criterion, the custom binary format looks perfect.
2) Robustness in the case of corruption. It seems much easier to
detect the data corruption and possibly make some partial manual
recovery for textual format.
3) Standartness. It's better to use something known worldwide or at
least used in other parts of PostgreSQL than something completely
custom. From this perspective, JSON/JSONB is better than custom
things.

(sorry, I've accidentally cut the last paragraph from the message)

It seems that there is no perfect fit for this multi-criteria
optimization, and we should pick what is more important. Any
thoughts?

------
Regards,
Alexander Korotkov

#17

[0]: /messages/by-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w@mail.gmail.com

boekewurm+postgres@gmail.com

over 3 years ago

In reply to: Alexander Korotkov (#15)

Re: Proposal to use JSON for Postgres Parser format

On Tue, 20 Sept 2022 at 12:00, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections.

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace). First we need to separate those. We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

+1 for considering storage format and text format separately.

Let's consider what our criteria could be for the storage format.

1) Storage effectiveness (shorter is better) and
serialization/deserialization effectiveness (faster is better). On
this criterion, the custom binary format looks perfect.
2) Robustness in the case of corruption. It seems much easier to
detect the data corruption and possibly make some partial manual
recovery for textual format.
3) Standartness. It's better to use something known worldwide or at
least used in other parts of PostgreSQL than something completely
custom. From this perspective, JSON/JSONB is better than custom
things.

Allow me to add: compressability

In the thread surrounding [0]/messages/by-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w@mail.gmail.com there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

In that same thread, I also suggested that we could try to not emit a
Node's fields if they contain their default values while serializing;
such as the common `:location -1` or `:mynodefield <>`. Those fields
still take up space in the format, while conveying no interesting
information (the absense of that field in the struct definition would
convey the same). It would be useful if this new serialized format
would allow us to do similar tricks cheaply.

As for JSON vs JSONB for storage:
I'm fairly certain that JSONB is less compact than JSON (without
taking compression into the picture) due to the 4-byte guaranteed
overhead for each jsonb element; while for JSON that is only 2 bytes
for each (up to 3 when you consider separators, plus potential extra
overhead for escaped values that are unlikely to appear our catalogs).
Some numbers can be stored more efficiently in JSONB, but only large
numbers and small fractions that we're unlikely to hit in system
views: a back-of-the-envelope calculation puts the cutoff point of
efficient storage between strings-of-decimals and Numeric at >10^12, <
-10^11, or very precise fractional values.

Kind regards,

Matthias van de Meent

#18

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Matthias van de Meent (#17)

Re: Proposal to use JSON for Postgres Parser format

On 2022-Sep-20, Matthias van de Meent wrote:

Allow me to add: compressability

In the thread surrounding [0] there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

True. Currently, the largest ev_action values compress pretty well. I
think if we wanted this to be more succint, we would have to invent some
binary format -- perhaps something like Protocol Buffers: it'd be stored
in the binary format in catalogs, but for output it would be converted
into something easy to read (we already do this for
pg_statistic_ext_data for example). We'd probably lose compressibility,
but that'd be okay because the binary format would already remove most
of the redundancy by nature.

Do we want to go there?

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming" (A. Stepanov)

#19

boekewurm+postgres@gmail.com

over 3 years ago

In reply to: Alvaro Herrera (#18)

Re: Proposal to use JSON for Postgres Parser format

On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Sep-20, Matthias van de Meent wrote:

Allow me to add: compressability

In the thread surrounding [0] there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

True. Currently, the largest ev_action values compress pretty well. I
think if we wanted this to be more succint, we would have to invent some
binary format -- perhaps something like Protocol Buffers: it'd be stored
in the binary format in catalogs, but for output it would be converted
into something easy to read (we already do this for
pg_statistic_ext_data for example). We'd probably lose compressibility,
but that'd be okay because the binary format would already remove most
of the redundancy by nature.

Do we want to go there?

I don't think that a binary format would be much better for
debugging/fixing than an optimization of the current textual format
when combined with compression. As I mentioned in that thread, there
is a lot of improvement possible with the existing format, and I think
any debugging of serialized nodes would greatly benefit from using a
textual format.

Then again, I also agree that this argument doesn't hold it's weight
when storage and output formats are going to be different. I trust
that any new tooling introduced as a result of this thread will be
better than what we have right now.

As for best format: I don't know. The current format is usable, and a
better format would not store any data for default values. JSON can do
that, but I could think of many formats that could do the same (Smile,
BSON, xml, etc.).

I do not think that protobuf is the best choice for storage, though,
because it has its own rules on what it considers a default value and
what it does or does not serialize: zero is considered the only
default for numbers, as is the empty string for text, etc.
I think it is allright for general use, but with e.g. `location: -1`
in just about every parse node we'd probably want to select our own
values to ignore during (de)serialization of fields.

Kind regards,

Matthias van de Meent

#20

Michel Pelletier

pelletier.michel@gmail.com

about 3 years ago

In reply to: Matthias van de Meent (#19)

Re: Proposal to use JSON for Postgres Parser format

On Wed, Sep 21, 2022 at 11:04 AM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Sep-20, Matthias van de Meent wrote:

Allow me to add: compressability

In the thread surrounding [0] there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

True. Currently, the largest ev_action values compress pretty well. I
think if we wanted this to be more succint, we would have to invent some
binary format -- perhaps something like Protocol Buffers: it'd be stored
in the binary format in catalogs, but for output it would be converted
into something easy to read (we already do this for
pg_statistic_ext_data for example). We'd probably lose compressibility,
but that'd be okay because the binary format would already remove most
of the redundancy by nature.

Do we want to go there?

I don't think that a binary format would be much better for
debugging/fixing than an optimization of the current textual format
when combined with compression.

I agree, JSON is not perfect, but it compresses and it's usable
everywhere. My personal need for this is purely developer experience, and
Tom pointed out, a "niche" need for sure, but we are starting to do some
serious work with Dan Lynch's plpgsql deparser tool to generate RLS
policies from meta schema models, and having the same format come out of
the parser would make a complete end to end solution for us, especially if
we can get this data from a function in a ddl_command_start event trigger.
Dan also writes a popular deparser for Javascript, and unifying the formats
across these tools would be a big win for us.

As I mentioned in that thread, there
is a lot of improvement possible with the existing format, and I think
any debugging of serialized nodes would greatly benefit from using a
textual format.

Agreed.

Then again, I also agree that this argument doesn't hold it's weight
when storage and output formats are going to be different. I trust
that any new tooling introduced as a result of this thread will be
better than what we have right now.

Separating formats seems like a lot of work to me, to get what might not be
a huge improvement over compressing JSON, for what seems unlikely to be
more than a few megabytes of parsed SQL.

As for best format: I don't know. The current format is usable, and a
better format would not store any data for default values. JSON can do
that, but I could think of many formats that could do the same (Smile,
BSON, xml, etc.).

I do not think that protobuf is the best choice for storage, though,
because it has its own rules on what it considers a default value and
what it does or does not serialize: zero is considered the only
default for numbers, as is the empty string for text, etc.
I think it is allright for general use, but with e.g. `location: -1`
in just about every parse node we'd probably want to select our own
values to ignore during (de)serialization of fields.

Agreed.

Thank you everyone who has contributed to this thread, I'm pleased that it
got a very spirited debate and I apologize for the delay in getting back to
everyone.

I'd like to spike on a proposed patch that:

- Converts the existing text format to JSON (or possibly jsonb,
considering feedback from this thread)
- Can be stored compressed
- Can be passed to a ddl_command_start event trigger with a function.

Thoughts?

-Michel

#21

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Tom Lane (#2)

Re: Proposal to use JSON for Postgres Parser format

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

There are certainly reasons to think about changing the node tree
storage format; but if we change it, I'd like to see it go to something
more compact not more verbose.

Very much seconded - the various pg_node_trees are a quite significant
fraction of the overall size of an empty database. And they're not
particularly useful for a human either.

IIRC it's not just catalog storage that's affected, but iirc also relevant for
parallel query.

My pet peeve is the way datums are output as individual bytes printed as
integers each. For narrow fixed-width datums including a lot of 0's for bytes
that aren't even used in the datum.

Maybe a compromise could be found whereby we provide a conversion function
that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that doesn't want
to write a custom parser, while not tying us directly to JSON.

Greetings,

Andres Freund

#22

Andrew Dunstan

andrew@dunslane.net

about 3 years ago

In reply to: Andres Freund (#21)

Re: Proposal to use JSON for Postgres Parser format

On 2022-10-27 Th 19:38, Andres Freund wrote:

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

Maybe a compromise could be found whereby we provide a conversion function
that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that doesn't want
to write a custom parser, while not tying us directly to JSON.

+1

Agreed.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#23

Alexander Korotkov

aekorotkov@gmail.com

about 3 years ago

In reply to: Andrew Dunstan (#22)

Re: Proposal to use JSON for Postgres Parser format

On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2022-10-27 Th 19:38, Andres Freund wrote:

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

Maybe a compromise could be found whereby we provide a conversion function
that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that doesn't want
to write a custom parser, while not tying us directly to JSON.

+1

Agreed.

Michel, it seems that you now have a green light to implement node to
json function.

------
Regards,
Alexander Korotkov

#24

boekewurm+postgres@gmail.com

about 3 years ago

In reply to: Alexander Korotkov (#23)

Re: Proposal to use JSON for Postgres Parser format

On Mon, 31 Oct 2022 at 13:46, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2022-10-27 Th 19:38, Andres Freund wrote:

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

Maybe a compromise could be found whereby we provide a conversion function
that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that doesn't want
to write a custom parser, while not tying us directly to JSON.

+1

Agreed.

+1

Michel, it seems that you now have a green light to implement node to
json function.

I think that Tom's proposal that we +1 is on a pg_node_tree to json
SQL function / cast; which is tangentially related to the "nodeToJson
/ changing the storage format of pg_node_tree to json" proposal, but
not the same.

I will add my +1 to Tom's proposal for that function/cast, but I'm not
sure on changing the storage format of pg_node_tree to json.

Kind regards,

Matthias van de Meent

#25

Michel Pelletier

pelletier.michel@gmail.com

about 3 years ago

In reply to: Matthias van de Meent (#24)

Re: Proposal to use JSON for Postgres Parser format

On Mon, Oct 31, 2022 at 6:15 AM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

On Mon, 31 Oct 2022 at 13:46, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net>

wrote:

On 2022-10-27 Th 19:38, Andres Freund wrote:

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

Maybe a compromise could be found whereby we provide a conversion

function

that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that

doesn't want

to write a custom parser, while not tying us directly to JSON.

+1

Agreed.

+1

Michel, it seems that you now have a green light to implement node to
json function.

I think that Tom's proposal that we +1 is on a pg_node_tree to json
SQL function / cast; which is tangentially related to the "nodeToJson
/ changing the storage format of pg_node_tree to json" proposal, but
not the same.

I agree.

I will add my +1 to Tom's proposal for that function/cast, but I'm not
sure on changing the storage format of pg_node_tree to json.

I'm going to spike on this function and will get back to the thread with
any updates.

Thank you!

-Michel

#26

thomas.munro@gmail.com

over 2 years ago

In reply to: Thomas Munro (#11)

Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:

To explain my earlier guess: reader code for #S(STRUCTNAME ...) can
bee seen here, though it's being lexed as "PLAN_SYM" so perhaps the
author of that C already didn't know that was a general syntax for
Lisp structs. (Example: at a Lisp prompt, if you write (defstruct foo
x y z) then (make-foo :x 1 :y 2 :z 3), the resulting object will be
printed as #S(FOO :x 1 :y 2 :z 3), so I'm guessing that the POSTGRES
Lisp code, which sadly (for me) was ripped out before even that repo
IIUC, must have used defstruct-based plans.)

That defstruct guess is confirmed by page 36 and nearby of
https://dsf.berkeley.edu/papers/UCB-MS-zfong.pdf.

#27