Proposal to use JSON for Postgres Parser format

Started by Michel Pelletierover 3 years ago28 messages
#1Michel Pelletier
pelletier.michel@gmail.com

Hello hackers,

As noted in the source:

https://github.com/postgres/postgres/blob/master/src/include/nodes/pg_list.h#L6-L11

* Once upon a time, parts of Postgres were written in Lisp and used real
* cons-cell lists for major data structures. When that code was rewritten
* in C, we initially had a faithful emulation of cons-cell lists, which
* unsurprisingly was a performance bottleneck. A couple of major rewrites
* later, these data structures are actually simple expansible arrays;
* but the "List" name and a lot of the notation survives.

The Postgres parser format as described in the wiki page:

https://wiki.postgresql.org/wiki/Query_Parsing

looks almost, but not quite, entirely like JSON:

SELECT * FROM foo where bar = 42 ORDER BY id DESC LIMIT 23;
(
{SELECT
:distinctClause <>
:intoClause <>
:targetList (
{RESTARGET
:name <>
:indirection <>
:val
{COLUMNREF
:fields (
{A_STAR
}
)
:location 7
}
:location 7
}
)
:fromClause (
{RANGEVAR
:schemaname <>
:relname foo
:inhOpt 2
:relpersistence p
:alias <>
:location 14
}
)
... and so on
)

This non-standard format is useful for visual inspection and perhaps
simple parsing. Parsers that do exist for it are generally specific
to some languages. If there were a standard way to parse queries,
tools like code generators and analysis tools can work with a variety
of libraries that already handle JSON quite well. Future potential
would include exposing this data to command_ddl_start event triggers.
Providing a JSON Schema would also aid tools that want to validate or
transform the json with rule based systems.

I would like to propose a discussion that in a future major release
Postgres switch
from this custom format to JSON. The current format is question is
generated from macros and functions found in
`src/backend/nodes/readfuncs.c` and `src/backend/nodes/outfuncs.c` and
converting them to emit valid JSON would be relatively
straightforward.

One downside would be that this would not be a forward compatible
binary change across releases. Since it is unlikely that very much
code is reliant on this custom format; this would not be a huge problem
for most.

Thoughts?

-Michel

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michel Pelletier (#1)
Re: Proposal to use JSON for Postgres Parser format

Michel Pelletier <pelletier.michel@gmail.com> writes:

I would like to propose a discussion that in a future major release
Postgres switch from this custom format to JSON.

There are certainly reasons to think about changing the node tree
storage format; but if we change it, I'd like to see it go to something
more compact not more verbose. JSON doesn't fit our needs all that
closely, so some things like bitmapsets would become a lot longer;
and even where the semantics are pretty-much-the-same, JSON's
insistence on details like quoting field names will add bytes.
Perhaps making the physical storage be JSONB not JSON would help that
pain point. It's still far from ideal though.

Maybe a compromise could be found whereby we provide a conversion
function that converts whatever the catalog storage format is to
some JSON equivalent. That would address the needs of external
code that doesn't want to write a custom parser, while not tying
us directly to JSON.

regards, tom lane

In reply to: Tom Lane (#2)
Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 7:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Maybe a compromise could be found whereby we provide a conversion
function that converts whatever the catalog storage format is to
some JSON equivalent. That would address the needs of external
code that doesn't want to write a custom parser, while not tying
us directly to JSON.

That seems like a perfectly good solution, as long as it can be done
in a way that doesn't leave consumers of the JSON output at any kind
of disadvantage.

I find the current node tree format ludicrously verbose, and generally
hard to work with. But it's not the format itself, really -- that's
not the problem. The underlying data structures are typically very
information dense. So having an output format that's a known quantity
sounds very valuable to me.

Writing declarative @> containment queries against (say) a JSON
variant of node tree format seems like it could be a huge quality of
life improvement. It will make the output format even more verbose,
but that might not matter in the same way as it does right now.

--
Peter Geoghegan

#4Thomas Munro
thomas.munro@gmail.com
In reply to: Michel Pelletier (#1)
Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 12:16 PM Michel Pelletier
<pelletier.michel@gmail.com> wrote:

This non-standard format

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley). At some point it was probably exchanging data between the
Lisp and C parts of POSTGRES, and you could maybe sorta claim it's
based on an ANSI standard (Lisp!), but not with a straight face :-)

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#3)
Re: Proposal to use JSON for Postgres Parser format

Peter Geoghegan <pg@bowt.ie> writes:

Writing declarative @> containment queries against (say) a JSON
variant of node tree format seems like it could be a huge quality of
life improvement.

There are certainly use-cases for something like that, but let's
be clear about it: that's a niche case of interest to developers
and pretty much nobody else. For ordinary users, what matters about
the node tree storage format is compactness and speed of loading.
Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

regards, tom lane

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Tom Lane (#2)
Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 7:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Michel Pelletier <pelletier.michel@gmail.com> writes:

I would like to propose a discussion that in a future major release
Postgres switch from this custom format to JSON.

There are certainly reasons to think about changing the node tree
storage format; but if we change it, I'd like to see it go to something
more compact not more verbose. JSON doesn't fit our needs all that
closely, so some things like bitmapsets would become a lot longer;
and even where the semantics are pretty-much-the-same, JSON's
insistence on details like quoting field names will add bytes.
Perhaps making the physical storage be JSONB not JSON would help that
pain point. It's still far from ideal though.

Maybe a compromise could be found whereby we provide a conversion
function that converts whatever the catalog storage format is to
some JSON equivalent. That would address the needs of external
code that doesn't want to write a custom parser, while not tying
us directly to JSON.

I think the DDL deparsing stuff that is being discussed as a base for
DDL logical replication provides something like what you are saying
[1]: /messages/by-id/CAFPTHDaqqGxqncAP42Z=w9GVXDR92HN-57O=2Zy6tmayV2_eZw@mail.gmail.com

[1]: /messages/by-id/CAFPTHDaqqGxqncAP42Z=w9GVXDR92HN-57O=2Zy6tmayV2_eZw@mail.gmail.com
[2]: /messages/by-id/CAAD30U+pVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo=OjdQGJ9g@mail.gmail.com

--
With Regards,
Amit Kapila.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Thomas Munro (#4)
Re: Proposal to use JSON for Postgres Parser format

Thomas Munro <thomas.munro@gmail.com> writes:

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley).

Wow, where did you find a commit history for Berkeley's code?
There's evidence in the tarballs I have that they were using
RCS, but I never heard that the repo was made public.

regards, tom lane

In reply to: Tom Lane (#5)
Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

There are certainly use-cases for something like that, but let's
be clear about it: that's a niche case of interest to developers
and pretty much nobody else. For ordinary users, what matters about
the node tree storage format is compactness and speed of loading.

Of course. But is there any reason to think that there has to be even
a tiny cost imposed on users?

Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections. This
is certainly not a new consideration.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

--
Peter Geoghegan

In reply to: Tom Lane (#7)
Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 8:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Wow, where did you find a commit history for Berkeley's code?
There's evidence in the tarballs I have that they were using
RCS, but I never heard that the repo was made public.

It's on Github:

https://github.com/kelvich/postgres_pre95
--
Peter Geoghegan

#10Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#7)
Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley).

Wow, where did you find a commit history for Berkeley's code?
There's evidence in the tarballs I have that they were using
RCS, but I never heard that the repo was made public.

One of the tarballs at https://dsf.berkeley.edu/postgres.html has the
complete RCS history, but Stas Kelvich imported it to github as Peter
G has just reported faster than I could.

#11Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#10)
Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 4:03 PM Thomas Munro <thomas.munro@gmail.com> wrote:

On Tue, Sep 20, 2022 at 3:58 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

FWIW, it derives from Lisp s-expressions, but deviates from Lisp's
default reader/printer behaviour in small ways, including being case
sensitive and using {NAME :x 1 ...} instead of #S(NAME :x 1 ...) for
structs for reasons that are lost AFAIK (there's a dark age between
the commit history of the old Berkeley repo and our current repo, and
it looks like plan nodes were still printed as #(NAME ...) at
Berkeley).

Wow, where did you find a commit history for Berkeley's code?
There's evidence in the tarballs I have that they were using
RCS, but I never heard that the repo was made public.

One of the tarballs at https://dsf.berkeley.edu/postgres.html has the
complete RCS history, but Stas Kelvich imported it to github as Peter
G has just reported faster than I could.

To explain my earlier guess: reader code for #S(STRUCTNAME ...) can
bee seen here, though it's being lexed as "PLAN_SYM" so perhaps the
author of that C already didn't know that was a general syntax for
Lisp structs. (Example: at a Lisp prompt, if you write (defstruct foo
x y z) then (make-foo :x 1 :y 2 :z 3), the resulting object will be
printed as #S(FOO :x 1 :y 2 :z 3), so I'm guessing that the POSTGRES
Lisp code, which sadly (for me) was ripped out before even that repo
IIUC, must have used defstruct-based plans.)

https://github.com/kelvich/postgres_pre95/blob/master/src/backend/lib/lispread.c#L132

It may still be within the bounds of what a real Lisp could be
convinced to read though, given a reader macro to handle {} and maybe
some other little tweaks here and there.

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#8)
Re: Proposal to use JSON for Postgres Parser format

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections.

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace). First we need to separate those. We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

regards, tom lane

In reply to: Tom Lane (#12)
Re: Proposal to use JSON for Postgres Parser format

On Mon, Sep 19, 2022 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

That sounds very much like the 1980s graduate student equivalent of
JSON to my ears.

JSON is generally manipulated as native Javascript/python/whatever
lists, maps, and strings. It's an interchange format that tries not to
be obtrusive in the same way as things like XML always are, at the
cost of making things kinda dicey for things like numeric precision
(unless you can account for everything). Isn't that...basically the
same concept as the lisp-y dump format, at a high level?

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace). First we need to separate those. We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

Agreed.

--
Peter Geoghegan

#14Thomas Munro
thomas.munro@gmail.com
In reply to: Peter Geoghegan (#13)
Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 4:58 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Sep 19, 2022 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

That sounds very much like the 1980s graduate student equivalent of
JSON to my ears.

Yeah. Easy data interchange on Lisp systems is built in, just write
objects into a socket/file/whatever and read them back, as people now
do with JSON/XML/whatever. That's the format we see here.

JSON is generally manipulated as native Javascript/python/whatever
lists, maps, and strings. It's an interchange format that tries not to
be obtrusive in the same way as things like XML always are, at the
cost of making things kinda dicey for things like numeric precision
(unless you can account for everything). Isn't that...basically the
same concept as the lisp-y dump format, at a high level?

Yes, s-expressions and JSON are absolutely the same concept; simple
representation of simple data structures of a dynamically typed
language. There's even a chain of events connecting the two: JSON is
roughly the literal data syntax from Javascript's grammar, and
Javascript is the language that Brendan Eich developed after Netscape
hired him to do an embedded Lisp (Scheme) for the browser, except they
decided at some point to change tack and make their new language have
a surface grammar more like Java, the new hotness. If the goal was to
make sure it caught on, it's hard to conclude they were wrong...

#15Alexander Korotkov
aekorotkov@gmail.com
In reply to: Tom Lane (#12)
Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections.

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace). First we need to separate those. We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

+1 for considering storage format and text format separately.

Let's consider what our criteria could be for the storage format.

1) Storage effectiveness (shorter is better) and
serialization/deserialization effectiveness (faster is better). On
this criterion, the custom binary format looks perfect.
2) Robustness in the case of corruption. It seems much easier to
detect the data corruption and possibly make some partial manual
recovery for textual format.
3) Standartness. It's better to use something known worldwide or at
least used in other parts of PostgreSQL than something completely
custom. From this perspective, JSON/JSONB is better than custom
things.

------
Regards,
Alexander Korotkov

#16Alexander Korotkov
aekorotkov@gmail.com
In reply to: Alexander Korotkov (#15)
Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 1:00 PM Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections.

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace). First we need to separate those. We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

+1 for considering storage format and text format separately.

Let's consider what our criteria could be for the storage format.

1) Storage effectiveness (shorter is better) and
serialization/deserialization effectiveness (faster is better). On
this criterion, the custom binary format looks perfect.
2) Robustness in the case of corruption. It seems much easier to
detect the data corruption and possibly make some partial manual
recovery for textual format.
3) Standartness. It's better to use something known worldwide or at
least used in other parts of PostgreSQL than something completely
custom. From this perspective, JSON/JSONB is better than custom
things.

(sorry, I've accidentally cut the last paragraph from the message)

It seems that there is no perfect fit for this multi-criteria
optimization, and we should pick what is more important. Any
thoughts?

------
Regards,
Alexander Korotkov

#17Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Alexander Korotkov (#15)
Re: Proposal to use JSON for Postgres Parser format

On Tue, 20 Sept 2022 at 12:00, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Tue, Sep 20, 2022 at 7:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

On Mon, Sep 19, 2022 at 8:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Our existing format is certainly not great on those metrics, but
I do not see how "let's use JSON!" is a route to improvement.

The existing format was designed with developer convenience as a goal,
though -- despite my complaints, and in spite of your objections.

As Munro adduces nearby, it'd be a stretch to conclude that the current
format was designed with any Postgres-related goals in mind at all.
I think he's right that it's a variant of some Lisp-y dump format that's
probably far hoarier than even Berkeley Postgres.

If it didn't have to be easy (or even practical) for developers to
directly work with the output format, then presumably the format used
internally could be replaced with something lower level and faster. So
it seems like the two goals (developer ergonomics and faster
interchange format for users) might actually be complementary.

I think the principal mistake in what we have now is that the storage
format is identical to the "developer friendly" text format (plus or
minus some whitespace). First we need to separate those. We could
have more than one equivalent text format perhaps, and I don't have
any strong objection to basing the text format (or one of them) on
JSON.

+1 for considering storage format and text format separately.

Let's consider what our criteria could be for the storage format.

1) Storage effectiveness (shorter is better) and
serialization/deserialization effectiveness (faster is better). On
this criterion, the custom binary format looks perfect.
2) Robustness in the case of corruption. It seems much easier to
detect the data corruption and possibly make some partial manual
recovery for textual format.
3) Standartness. It's better to use something known worldwide or at
least used in other parts of PostgreSQL than something completely
custom. From this perspective, JSON/JSONB is better than custom
things.

Allow me to add: compressability

In the thread surrounding [0]/messages/by-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w@mail.gmail.com there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

In that same thread, I also suggested that we could try to not emit a
Node's fields if they contain their default values while serializing;
such as the common `:location -1` or `:mynodefield <>`. Those fields
still take up space in the format, while conveying no interesting
information (the absense of that field in the struct definition would
convey the same). It would be useful if this new serialized format
would allow us to do similar tricks cheaply.

As for JSON vs JSONB for storage:
I'm fairly certain that JSONB is less compact than JSON (without
taking compression into the picture) due to the 4-byte guaranteed
overhead for each jsonb element; while for JSON that is only 2 bytes
for each (up to 3 when you consider separators, plus potential extra
overhead for escaped values that are unlikely to appear our catalogs).
Some numbers can be stored more efficiently in JSONB, but only large
numbers and small fractions that we're unlikely to hit in system
views: a back-of-the-envelope calculation puts the cutoff point of
efficient storage between strings-of-decimals and Numeric at >10^12, <
-10^11, or very precise fractional values.

Kind regards,

Matthias van de Meent

[0]: /messages/by-id/CAEze2WgGexDM63dOvndLdAWwA6uSmSsc97jmrCuNmrF1JEDK7w@mail.gmail.com

#18Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Matthias van de Meent (#17)
Re: Proposal to use JSON for Postgres Parser format

On 2022-Sep-20, Matthias van de Meent wrote:

Allow me to add: compressability

In the thread surrounding [0] there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

True. Currently, the largest ev_action values compress pretty well. I
think if we wanted this to be more succint, we would have to invent some
binary format -- perhaps something like Protocol Buffers: it'd be stored
in the binary format in catalogs, but for output it would be converted
into something easy to read (we already do this for
pg_statistic_ext_data for example). We'd probably lose compressibility,
but that'd be okay because the binary format would already remove most
of the redundancy by nature.

Do we want to go there?

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Java is clearly an example of money oriented programming" (A. Stepanov)

#19Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Alvaro Herrera (#18)
Re: Proposal to use JSON for Postgres Parser format

On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

On 2022-Sep-20, Matthias van de Meent wrote:

Allow me to add: compressability

In the thread surrounding [0] there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

True. Currently, the largest ev_action values compress pretty well. I
think if we wanted this to be more succint, we would have to invent some
binary format -- perhaps something like Protocol Buffers: it'd be stored
in the binary format in catalogs, but for output it would be converted
into something easy to read (we already do this for
pg_statistic_ext_data for example). We'd probably lose compressibility,
but that'd be okay because the binary format would already remove most
of the redundancy by nature.

Do we want to go there?

I don't think that a binary format would be much better for
debugging/fixing than an optimization of the current textual format
when combined with compression. As I mentioned in that thread, there
is a lot of improvement possible with the existing format, and I think
any debugging of serialized nodes would greatly benefit from using a
textual format.

Then again, I also agree that this argument doesn't hold it's weight
when storage and output formats are going to be different. I trust
that any new tooling introduced as a result of this thread will be
better than what we have right now.

As for best format: I don't know. The current format is usable, and a
better format would not store any data for default values. JSON can do
that, but I could think of many formats that could do the same (Smile,
BSON, xml, etc.).

I do not think that protobuf is the best choice for storage, though,
because it has its own rules on what it considers a default value and
what it does or does not serialize: zero is considered the only
default for numbers, as is the empty string for text, etc.
I think it is allright for general use, but with e.g. `location: -1`
in just about every parse node we'd probably want to select our own
values to ignore during (de)serialization of fields.

Kind regards,

Matthias van de Meent

#20Michel Pelletier
pelletier.michel@gmail.com
In reply to: Matthias van de Meent (#19)
Re: Proposal to use JSON for Postgres Parser format

On Wed, Sep 21, 2022 at 11:04 AM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

On Tue, 20 Sept 2022 at 17:29, Alvaro Herrera <alvherre@alvh.no-ip.org>
wrote:

On 2022-Sep-20, Matthias van de Meent wrote:

Allow me to add: compressability

In the thread surrounding [0] there were complaints about the size of
catalogs, and specifically the template database. Significant parts of
that (688kB of 8080kB a fresh PG14 database) are in pg_rewrite, which
consists mostly of serialized Nodes. If we're going to replace our
current NodeToText infrastructure, we'd better know we can effectively
compress this data.

True. Currently, the largest ev_action values compress pretty well. I
think if we wanted this to be more succint, we would have to invent some
binary format -- perhaps something like Protocol Buffers: it'd be stored
in the binary format in catalogs, but for output it would be converted
into something easy to read (we already do this for
pg_statistic_ext_data for example). We'd probably lose compressibility,
but that'd be okay because the binary format would already remove most
of the redundancy by nature.

Do we want to go there?

I don't think that a binary format would be much better for
debugging/fixing than an optimization of the current textual format
when combined with compression.

I agree, JSON is not perfect, but it compresses and it's usable
everywhere. My personal need for this is purely developer experience, and
Tom pointed out, a "niche" need for sure, but we are starting to do some
serious work with Dan Lynch's plpgsql deparser tool to generate RLS
policies from meta schema models, and having the same format come out of
the parser would make a complete end to end solution for us, especially if
we can get this data from a function in a ddl_command_start event trigger.
Dan also writes a popular deparser for Javascript, and unifying the formats
across these tools would be a big win for us.

As I mentioned in that thread, there
is a lot of improvement possible with the existing format, and I think
any debugging of serialized nodes would greatly benefit from using a
textual format.

Agreed.

Then again, I also agree that this argument doesn't hold it's weight
when storage and output formats are going to be different. I trust
that any new tooling introduced as a result of this thread will be
better than what we have right now.

Separating formats seems like a lot of work to me, to get what might not be
a huge improvement over compressing JSON, for what seems unlikely to be
more than a few megabytes of parsed SQL.

As for best format: I don't know. The current format is usable, and a
better format would not store any data for default values. JSON can do
that, but I could think of many formats that could do the same (Smile,
BSON, xml, etc.).

I do not think that protobuf is the best choice for storage, though,
because it has its own rules on what it considers a default value and
what it does or does not serialize: zero is considered the only
default for numbers, as is the empty string for text, etc.
I think it is allright for general use, but with e.g. `location: -1`
in just about every parse node we'd probably want to select our own
values to ignore during (de)serialization of fields.

Agreed.

Thank you everyone who has contributed to this thread, I'm pleased that it
got a very spirited debate and I apologize for the delay in getting back to
everyone.

I'd like to spike on a proposed patch that:

- Converts the existing text format to JSON (or possibly jsonb,
considering feedback from this thread)
- Can be stored compressed
- Can be passed to a ddl_command_start event trigger with a function.

Thoughts?

-Michel

#21Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#2)
Re: Proposal to use JSON for Postgres Parser format

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

There are certainly reasons to think about changing the node tree
storage format; but if we change it, I'd like to see it go to something
more compact not more verbose.

Very much seconded - the various pg_node_trees are a quite significant
fraction of the overall size of an empty database. And they're not
particularly useful for a human either.

IIRC it's not just catalog storage that's affected, but iirc also relevant for
parallel query.

My pet peeve is the way datums are output as individual bytes printed as
integers each. For narrow fixed-width datums including a lot of 0's for bytes
that aren't even used in the datum.

Maybe a compromise could be found whereby we provide a conversion function
that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that doesn't want
to write a custom parser, while not tying us directly to JSON.

+1

Greetings,

Andres Freund

#22Andrew Dunstan
andrew@dunslane.net
In reply to: Andres Freund (#21)
Re: Proposal to use JSON for Postgres Parser format

On 2022-10-27 Th 19:38, Andres Freund wrote:

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

Maybe a compromise could be found whereby we provide a conversion function
that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that doesn't want
to write a custom parser, while not tying us directly to JSON.

+1

Agreed.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#23Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andrew Dunstan (#22)
Re: Proposal to use JSON for Postgres Parser format

On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2022-10-27 Th 19:38, Andres Freund wrote:

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

Maybe a compromise could be found whereby we provide a conversion function
that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that doesn't want
to write a custom parser, while not tying us directly to JSON.

+1

Agreed.

+1

Michel, it seems that you now have a green light to implement node to
json function.

------
Regards,
Alexander Korotkov

#24Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Alexander Korotkov (#23)
Re: Proposal to use JSON for Postgres Parser format

On Mon, 31 Oct 2022 at 13:46, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2022-10-27 Th 19:38, Andres Freund wrote:

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

Maybe a compromise could be found whereby we provide a conversion function
that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that doesn't want
to write a custom parser, while not tying us directly to JSON.

+1

Agreed.

+1

Michel, it seems that you now have a green light to implement node to
json function.

I think that Tom's proposal that we +1 is on a pg_node_tree to json
SQL function / cast; which is tangentially related to the "nodeToJson
/ changing the storage format of pg_node_tree to json" proposal, but
not the same.

I will add my +1 to Tom's proposal for that function/cast, but I'm not
sure on changing the storage format of pg_node_tree to json.

Kind regards,

Matthias van de Meent

#25Michel Pelletier
pelletier.michel@gmail.com
In reply to: Matthias van de Meent (#24)
Re: Proposal to use JSON for Postgres Parser format

On Mon, Oct 31, 2022 at 6:15 AM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

On Mon, 31 Oct 2022 at 13:46, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net>

wrote:

On 2022-10-27 Th 19:38, Andres Freund wrote:

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

Maybe a compromise could be found whereby we provide a conversion

function

that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that

doesn't want

to write a custom parser, while not tying us directly to JSON.

+1

Agreed.

+1

Michel, it seems that you now have a green light to implement node to
json function.

I think that Tom's proposal that we +1 is on a pg_node_tree to json
SQL function / cast; which is tangentially related to the "nodeToJson
/ changing the storage format of pg_node_tree to json" proposal, but
not the same.

I agree.

I will add my +1 to Tom's proposal for that function/cast, but I'm not
sure on changing the storage format of pg_node_tree to json.

I'm going to spike on this function and will get back to the thread with
any updates.

Thank you!

-Michel

#26Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#11)
Re: Proposal to use JSON for Postgres Parser format

On Tue, Sep 20, 2022 at 4:16 PM Thomas Munro <thomas.munro@gmail.com> wrote:

To explain my earlier guess: reader code for #S(STRUCTNAME ...) can
bee seen here, though it's being lexed as "PLAN_SYM" so perhaps the
author of that C already didn't know that was a general syntax for
Lisp structs. (Example: at a Lisp prompt, if you write (defstruct foo
x y z) then (make-foo :x 1 :y 2 :z 3), the resulting object will be
printed as #S(FOO :x 1 :y 2 :z 3), so I'm guessing that the POSTGRES
Lisp code, which sadly (for me) was ripped out before even that repo
IIUC, must have used defstruct-based plans.)

That defstruct guess is confirmed by page 36 and nearby of
https://dsf.berkeley.edu/papers/UCB-MS-zfong.pdf.

#27Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Michel Pelletier (#25)
Re: Proposal to use JSON for Postgres Parser format

On Mon, 31 Oct 2022 at 15:56, Michel Pelletier
<pelletier.michel@gmail.com> wrote:

On Mon, Oct 31, 2022 at 6:15 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Mon, 31 Oct 2022 at 13:46, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2022-10-27 Th 19:38, Andres Freund wrote:

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

Maybe a compromise could be found whereby we provide a conversion function
that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that doesn't want
to write a custom parser, while not tying us directly to JSON.

+1

Agreed.

+1

Michel, it seems that you now have a green light to implement node to
json function.

I think that Tom's proposal that we +1 is on a pg_node_tree to json
SQL function / cast; which is tangentially related to the "nodeToJson
/ changing the storage format of pg_node_tree to json" proposal, but
not the same.

I agree.

I will add my +1 to Tom's proposal for that function/cast, but I'm not
sure on changing the storage format of pg_node_tree to json.

I'm going to spike on this function and will get back to the thread with any updates.

Michel, did you get a result from this spike?

I'm asking, because as I spiked most of my ideas on updating the node
text format, and am working on wrapping it up into a patch (or
patchset) later this week. The ideas for this are:

1. Don't write fields with default values for their types, such as
NULL for Node* fields;
2. Reset location fields before transforming the node tree to text
when we don't have a copy of the original query, which removes
location fields from serialization with step 1;
3. Add default() node labels to struct fields that do not share the
field type's default, allowing more fields to be omitted with step 1;
4. Add special default_ref() pg_node_attr for node fields that default
to other node field's values, used in Var's varnosyn/varattnosyn as
refering to varno/varattno; and
5. Truncate trailing 0s in Const' outDatum notation of by-ref types,
so that e.g. Consts with `name` data don't waste so much space with 0s

Currently, it reduces the pg_total_relation_size metric of pg_rewrite
after TOAST compression by 35% vs pg16, down to 483328 bytes / 59
pages, from 753664 bytes / 92 pages. The raw size of the ev_action
column's data (that is, before compression) is reduced by 55% to
1.18MB (from 2.80MB), and the largest default shipped row (the
information_schema.columns view) in that table is reduced to 'only'
78kB raw, from 193kB.

RW performance hasn't been tested yet, so that is still to be determined...

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#28jian he
jian.universality@gmail.com
In reply to: Matthias van de Meent (#27)
Re: Proposal to use JSON for Postgres Parser format

On Tue, Dec 5, 2023 at 12:45 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Mon, 31 Oct 2022 at 15:56, Michel Pelletier
<pelletier.michel@gmail.com> wrote:

On Mon, Oct 31, 2022 at 6:15 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:

On Mon, 31 Oct 2022 at 13:46, Alexander Korotkov <aekorotkov@gmail.com> wrote:

On Fri, Oct 28, 2022 at 4:27 PM Andrew Dunstan <andrew@dunslane.net> wrote:

On 2022-10-27 Th 19:38, Andres Freund wrote:

Hi,

On 2022-09-19 22:29:15 -0400, Tom Lane wrote:

Maybe a compromise could be found whereby we provide a conversion function
that converts whatever the catalog storage format is to some JSON
equivalent. That would address the needs of external code that doesn't want
to write a custom parser, while not tying us directly to JSON.

+1

Agreed.

+1

Michel, it seems that you now have a green light to implement node to
json function.

I think that Tom's proposal that we +1 is on a pg_node_tree to json
SQL function / cast; which is tangentially related to the "nodeToJson
/ changing the storage format of pg_node_tree to json" proposal, but
not the same.

I agree.

I will add my +1 to Tom's proposal for that function/cast, but I'm not
sure on changing the storage format of pg_node_tree to json.

I'm going to spike on this function and will get back to the thread with any updates.

Michel, did you get a result from this spike?

I'm asking, because as I spiked most of my ideas on updating the node
text format, and am working on wrapping it up into a patch (or
patchset) later this week. The ideas for this are:

1. Don't write fields with default values for their types, such as
NULL for Node* fields;
2. Reset location fields before transforming the node tree to text
when we don't have a copy of the original query, which removes
location fields from serialization with step 1;
3. Add default() node labels to struct fields that do not share the
field type's default, allowing more fields to be omitted with step 1;
4. Add special default_ref() pg_node_attr for node fields that default
to other node field's values, used in Var's varnosyn/varattnosyn as
refering to varno/varattno; and
5. Truncate trailing 0s in Const' outDatum notation of by-ref types,
so that e.g. Consts with `name` data don't waste so much space with 0s

can you share the draft patch, if it is still there?