PATCH: jsonpath string methods: lower, upper, initcap, l/r/btrim, replace, split_part
Hello hackers,
This patch is a follow-up and generalization to [0]/messages/by-id/185BF814-9225-46DB-B1A1-6468CF2C8B63@justatheory.com.
It adds the following jsonpath methods: lower, upper, initcap, l/r/btrim,
replace, split_part.
It makes jsonpath able to support expressions like these:
select jsonb_path_query('" hElLo WorlD "',
'$.btrim().lower().upper().lower().replace("hello","bye") starts with
"bye"');
select jsonb_path_query('"abc~@~def~@~ghi"', '$.split_part("~@~", 2)')
They, of course, forward their implementation to the internal
pg_proc-registered function.
As a first wip/poc I've picked the functions I typically need to clean up
JSON data.
I've also added a README.jsonpath with documentation on how to add a new
jsonpath method.
If I had this available when I started, it would have saved me some time.
So, I am leaving it here for the next hacker.
This patch is not particularly intrusive to existing code:
Afaict, the only struct I've touched is JsonPathParseItem , where I added {
JsonPathParseItem *arg0, *arg1; } method_args.
Up until now, most of the jsonpath methods that accept arguments rely on
left/right operands,
which works, but it could be more convenient for future more complex
methods.
I've also added the appropriate jspGetArgX(JsonPathItem *v, JsonPathItem
*a).
Open items
- What happens if the jsonpath standard adds a new method by the same name?
A.D. mentioned this in [0]/messages/by-id/185BF814-9225-46DB-B1A1-6468CF2C8B63@justatheory.com with the proposal of having a prefix like pg_ or
initial-upper letter.
- Still using the default collation like the rest of the jsonpath code.
- documentation N/A yet
- I do realize that the process of adding a new method sketches an
imaginary.
CREATE JSONPATH FUNCTION. This has been on the back of my mind for some
time now,
but I can't say I have an action plan for this yet.
GitHub PR view if you prefer:
https://github.com/Florents-Tselai/postgres/pull/18
[0]: /messages/by-id/185BF814-9225-46DB-B1A1-6468CF2C8B63@justatheory.com
/messages/by-id/185BF814-9225-46DB-B1A1-6468CF2C8B63@justatheory.com
All the best,
Flo
Attachments:
v1-0001-This-patch-adds-the-following-string-processing-m.patchapplication/octet-stream; name=v1-0001-This-patch-adds-the-following-string-processing-m.patchDownload+1048-5
Florents Tselai <florents.tselai@gmail.com> writes:
This patch is a follow-up and generalization to [0].
It adds the following jsonpath methods: lower, upper, initcap, l/r/btrim,
replace, split_part.
How are you going to deal with the fact that this makes jsonpath
operations not guaranteed immutable? (See commit cb599b9dd
for some context.) Those are all going to have behavior that's
dependent on the underlying locale.
We have the kluge of having separate "_tz" functions to support
non-immutable datetime operations, but that way doesn't seem like
it's going to scale well to multiple sources of mutability.
regards, tom lane
On Thu, Sep 26, 2024 at 12:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Florents Tselai <florents.tselai@gmail.com> writes:
This patch is a follow-up and generalization to [0].
It adds the following jsonpath methods: lower, upper, initcap, l/r/btrim,
replace, split_part.How are you going to deal with the fact that this makes jsonpath
operations not guaranteed immutable? (See commit cb599b9dd
for some context.) Those are all going to have behavior that's
dependent on the underlying locale.We have the kluge of having separate "_tz" functions to support
non-immutable datetime operations, but that way doesn't seem like
it's going to scale well to multiple sources of mutability.
While inventing "_tz" functions I was thinking about jsonpath methods
and operators defined in standard then. Now I see huge interest on
extending that. I wonder if we can introduce a notion of flexible
mutability? Imagine that jsonb_path_query() function (and others) has
another function which analyzes arguments and reports mutability. If
jsonpath argument is constant and all methods inside are safe then
jsonb_path_query() is immutable otherwise it is stable. I was
thinking about that back working on jsonpath, but that time problem
seemed too limited for this kind of solution. Now, it's possibly time
to shake off the dust from this idea. What do you think?
------
Regards,
Alexander Korotkov
Supabase
Hi, Florents!
On Wed, Sep 25, 2024 at 9:18 PM Florents Tselai
<florents.tselai@gmail.com> wrote:
This patch is a follow-up and generalization to [0].
It adds the following jsonpath methods: lower, upper, initcap, l/r/btrim, replace, split_part.
It makes jsonpath able to support expressions like these:
select jsonb_path_query('" hElLo WorlD "', '$.btrim().lower().upper().lower().replace("hello","bye") starts with "bye"');
select jsonb_path_query('"abc~@~def~@~ghi"', '$.split_part("~@~", 2)')
Did you check if these new methods now in SQL standard or project of
SQL standard?
------
Regards,
Alexander Korotkov
Supabase
On Thu, Sep 26, 2024 at 1:55 PM Alexander Korotkov <aekorotkov@gmail.com>
wrote:
On Thu, Sep 26, 2024 at 12:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Florents Tselai <florents.tselai@gmail.com> writes:
This patch is a follow-up and generalization to [0].
It adds the following jsonpath methods: lower, upper, initcap,l/r/btrim,
replace, split_part.
How are you going to deal with the fact that this makes jsonpath
operations not guaranteed immutable? (See commit cb599b9dd
for some context.) Those are all going to have behavior that's
dependent on the underlying locale.We have the kluge of having separate "_tz" functions to support
non-immutable datetime operations, but that way doesn't seem like
it's going to scale well to multiple sources of mutability.While inventing "_tz" functions I was thinking about jsonpath methods
and operators defined in standard then. Now I see huge interest on
extending that. I wonder if we can introduce a notion of flexible
mutability? Imagine that jsonb_path_query() function (and others) has
another function which analyzes arguments and reports mutability. If
jsonpath argument is constant and all methods inside are safe then
jsonb_path_query() is immutable otherwise it is stable. I was
thinking about that back working on jsonpath, but that time problem
seemed too limited for this kind of solution. Now, it's possibly time
to shake off the dust from this idea. What do you think?------
Regards,
Alexander Korotkov
Supabase
In case you're having a deja vu, while researching this
I did come across [0]/messages/by-id/CAPpHfdvDci4iqNF9fhRkTqhe-5_8HmzeLt56drH+_Rv2rNRqfg@mail.gmail.com where disussing this back in 2019.
In this patch I've conveniently left jspIsMutable and jspIsMutableWalker
untouched and under the rug,
but for the few seconds I pondered over this,the best answer I came with
was
a simple heuristic to what Alexander says above:
if all elements are safe, then the whole jsp is immutable.
If we really want to tackle this and make jsonpath richer though,
I don't think we can avoid being a little more flexible/explicit wrt
mutability.
Speaking of extensible: the jsonpath standard does mention function
extensions [1]https://www.rfc-editor.org/rfc/rfc9535.html#name-function-extensions ,
so it looks like we're covered by the standard, and the mutability aspect
is an implementation detail. No?
And having said that, the whole jsonb/jsonpath parser/executor
infrastructure is extremely powerful
and kinda under-utilized if we use it "only" for jsonpath.
Tbh, I can see it supporting more specific DSLs and even offering hooks for
extensions.
And I know for certain I'm not the only one thinking about this.
See [2]https://github.com/apache/age/blob/master/src/include/utils/agtype.h for example where they've lifted, shifted and renamed the
jsonb/jsonpath infra to build a separate language for graphs
[0]: /messages/by-id/CAPpHfdvDci4iqNF9fhRkTqhe-5_8HmzeLt56drH+_Rv2rNRqfg@mail.gmail.com
/messages/by-id/CAPpHfdvDci4iqNF9fhRkTqhe-5_8HmzeLt56drH+_Rv2rNRqfg@mail.gmail.com
[1]: https://www.rfc-editor.org/rfc/rfc9535.html#name-function-extensions
[2]: https://github.com/apache/age/blob/master/src/include/utils/agtype.h
On Sep 26, 2024, at 13:59, Florents Tselai <florents.tselai@gmail.com> wrote:
Speaking of extensible: the jsonpath standard does mention function extensions [1] ,
so it looks like we're covered by the standard, and the mutability aspect is an implementation detail. No?
That’s not the standard used for Postgres jsonpath. Postgres follows the SQL/JSON standard in the SQL standard, which is not publicly available, but a few people on the list have copies they’ve purchased and so could provide some context.
In a previous post I wondered if the SQL standard had some facility for function extensions, but I suspect not. Maybe in the next iteration?
And having said that, the whole jsonb/jsonpath parser/executor infrastructure is extremely powerful
and kinda under-utilized if we use it "only" for jsonpath.
Tbh, I can see it supporting more specific DSLs and even offering hooks for extensions.
And I know for certain I'm not the only one thinking about this.
See [2] for example where they've lifted, shifted and renamed the jsonb/jsonpath infra to build a separate language for graphs
I’m all for extensibility, though jsonpath does need to continue to comply with the SQL standard. Do you have some idea of the sorts of hooks that would allow extension authors to use some of that underlying capability?
Best,
David
On 27 Sep 2024, at 12:45 PM, David E. Wheeler <david@justatheory.com> wrote:
On Sep 26, 2024, at 13:59, Florents Tselai <florents.tselai@gmail.com> wrote:
Speaking of extensible: the jsonpath standard does mention function extensions [1] ,
so it looks like we're covered by the standard, and the mutability aspect is an implementation detail. No?That’s not the standard used for Postgres jsonpath. Postgres follows the SQL/JSON standard in the SQL standard, which is not publicly available, but a few people on the list have copies they’ve purchased and so could provide some context.
In a previous post I wondered if the SQL standard had some facility for function extensions, but I suspect not. Maybe in the next iteration?
And having said that, the whole jsonb/jsonpath parser/executor infrastructure is extremely powerful
and kinda under-utilized if we use it "only" for jsonpath.
Tbh, I can see it supporting more specific DSLs and even offering hooks for extensions.
And I know for certain I'm not the only one thinking about this.
See [2] for example where they've lifted, shifted and renamed the jsonb/jsonpath infra to build a separate language for graphsI’m all for extensibility, though jsonpath does need to continue to comply with the SQL standard. Do you have some idea of the sorts of hooks that would allow extension authors to use some of that underlying capability?
Re-tracing what I had to do
1. Define a new JsonPathItemType jpiMyExtType and map it to a JsonPathKeyword
2. Add a new JsonPathKeyword and make the lexer and parser aware of that,
3. Tell the main executor executeItemOptUnwrapTarget what to do when the new type is matched.
I think 1, 2 are the trickiest because they require hooks to jsonpath_scan.l and parser jsonpath_gram.y
3. is the meat of a potential hook, which would be something like
extern JsonPathExecResult executeOnMyJsonpathItem(JsonPathExecContext *cxt, JsonbValue *jb, JsonValueList *found);
This should be called by the main executor executeItemOptUnwrapTarget when it encounters case jpiMyExtType
It looks like quite an endeavor, to be honest.
On Thu, Sep 26, 2024 at 1:55 PM Alexander Korotkov <aekorotkov@gmail.com>
wrote:
On Thu, Sep 26, 2024 at 12:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Florents Tselai <florents.tselai@gmail.com> writes:
This patch is a follow-up and generalization to [0].
It adds the following jsonpath methods: lower, upper, initcap,l/r/btrim,
replace, split_part.
How are you going to deal with the fact that this makes jsonpath
operations not guaranteed immutable? (See commit cb599b9dd
for some context.) Those are all going to have behavior that's
dependent on the underlying locale.We have the kluge of having separate "_tz" functions to support
non-immutable datetime operations, but that way doesn't seem like
it's going to scale well to multiple sources of mutability.While inventing "_tz" functions I was thinking about jsonpath methods
and operators defined in standard then. Now I see huge interest on
extending that. I wonder if we can introduce a notion of flexible
mutability? Imagine that jsonb_path_query() function (and others) has
another function which analyzes arguments and reports mutability. If
jsonpath argument is constant and all methods inside are safe then
jsonb_path_query() is immutable otherwise it is stable. I was
thinking about that back working on jsonpath, but that time problem
seemed too limited for this kind of solution. Now, it's possibly time
to shake off the dust from this idea. What do you think?
I was thinking about taking another stab at this.
Would someone more versed in the inner workings of jsonpath like to weigh
in on the immutability wrt locale?
On Wed, Mar 5, 2025 at 2:30 PM Florents Tselai
<florents.tselai@gmail.com> wrote:
I was thinking about taking another stab at this.
Would someone more versed in the inner workings of jsonpath like to weigh in on the immutability wrt locale?
I'm not sure the issues with immutability here are particularly
related to jsonpath -- I think they may just be general problems with
our framework for immutability.
I always struggle a bit to remember our policy on these issues -- to
the best of my knowledge, we haven't documented it anywhere, and I
think we probably should. I believe the way it works is that whenever
a function depends on the operating system's timestamp or locale
definitions, we decide it has to be stable, not immutable. We don't
expect those things to be updated very often, but we know sometimes
they do get updated.
Now apparently what we've done for time zones is we have both
json_path_exists and json_path_exists_tz, and the former only supports
things that are truly immutable while the latter additionally supports
things that depend on time zone, and are thus marked stable.
I suppose we could just add support for these locale-dependent
operations to the "tz" version and have them error out in the non-tz
version. After all, the effect of depending on time zone is, as far as
I know, the same as the effect of depending on locale: the function
can't be immutable any more. The only real problem with that idea, at
least to my knowledge, is that the function naming makes you think
that it's just about time zones and not about anything else. Maybe
that's a wart we can live with?
Tom writes earlier in the thread that:
# We have the kluge of having separate "_tz" functions to support
# non-immutable datetime operations, but that way doesn't seem like
# it's going to scale well to multiple sources of mutability.
But I'm not sure I understand why it matters that there are multiple
sources of mutability here. Maybe I'm missing a piece of the puzzle
here.
--
Robert Haas
EDB: http://www.enterprisedb.com
On May 9, 2025, at 15:50, Robert Haas <robertmhaas@gmail.com> wrote:
# We have the kluge of having separate "_tz" functions to support
# non-immutable datetime operations, but that way doesn't seem like
# it's going to scale well to multiple sources of mutability.But I'm not sure I understand why it matters that there are multiple
sources of mutability here. Maybe I'm missing a piece of the puzzle
here.
I read that to mean “we’re not going to add another json_path_exists_* function for every potentially immutable JSONPath function. But I take your point that it could be generalized for *any* mutable function. In which case maybe it should be renamed?
Best,
David
On 13 May 2025, at 2:07 PM, David E. Wheeler <david@justatheory.com> wrote:
On May 9, 2025, at 15:50, Robert Haas <robertmhaas@gmail.com> wrote:
# We have the kluge of having separate "_tz" functions to support
# non-immutable datetime operations, but that way doesn't seem like
# it's going to scale well to multiple sources of mutability.But I'm not sure I understand why it matters that there are multiple
sources of mutability here. Maybe I'm missing a piece of the puzzle
here.I read that to mean “we’re not going to add another json_path_exists_* function for every potentially immutable JSONPath function. But I take your point that it could be generalized for *any* mutable function. In which case maybe it should be renamed?
Best,
David
We discussed this a bit during the APFS:
As Robert said—and I agree—renaming the existing _tz family would be more trouble than it’s worth, given the need for deprecations, migration paths, etc. If we were designing this today, suffixes like _stable or _volatile might have been more appropriate, but at this point, we’re better off staying consistent with the _tz family.
So the path forward seems to be:
- Put these new functions under the jsonb_path_*_tz family.
- Raise an error if they’re used in the non-_tz versions.
- Document this behavior clearly.
I’ll make sure to follow the patterns in the existing _tz functions closely.
Other thoughts and head’s up are, of course, welcome.
Patch CF entry: https://commitfest.postgresql.org/patch/5270/
Last updated Sept 24, so it will also need a rebase to account for changes in jsonpath_scan.l. I’ll get to that shortly.
On May 13, 2025, at 16:24, Florents Tselai <florents.tselai@gmail.com> wrote:
As Robert said—and I agree—renaming the existing _tz family would be more trouble than it’s worth, given the need for deprecations, migration paths, etc. If we were designing this today, suffixes like _stable or _volatile might have been more appropriate, but at this point, we’re better off staying consistent with the _tz family.
I get the pragmatism, and don’t want to over-bike-shed, but what a wart to live with. [I just went back and re-read Robert’s post, and didn’t realize he used exactly the same expression!] Would it really be too effortful to create _stable or _volatile functions and leave the _tz functions as a sort of legacy?
Or maybe there’s a nice backronym we could come up with for _tz.
Best,
David
On 13 May 2025, at 11:00 PM, David E. Wheeler <david@justatheory.com> wrote:
On May 13, 2025, at 16:24, Florents Tselai <florents.tselai@gmail.com> wrote:
As Robert said—and I agree—renaming the existing _tz family would be more trouble than it’s worth, given the need for deprecations, migration paths, etc. If we were designing this today, suffixes like _stable or _volatile might have been more appropriate, but at this point, we’re better off staying consistent with the _tz family.
I get the pragmatism, and don’t want to over-bike-shed, but what a wart to live with. [I just went back and re-read Robert’s post, and didn’t realize he used exactly the same expression!] Would it really be too effortful to create _stable or _volatile functions and leave the _tz functions as a sort of legacy?
Thinking about it a second time, you may be right.
Especially if more people are interested in adding even more methods there.
Here’s a patch just merging the latest changes in the jsonpath tooling;
no substantial changes to v1; mainly for CFbot to pick this up.
Attachments:
v2-0001-Rebase-latest-changes.-jsonpath_scan.l-white-spac.patchapplication/octet-stream; name=v2-0001-Rebase-latest-changes.-jsonpath_scan.l-white-spac.patch; x-unix-mode=0644Download+1081-5
On Tue, May 13, 2025 at 11:00 PM David E. Wheeler <david@justatheory.com> wrote:
On May 13, 2025, at 16:24, Florents Tselai <florents.tselai@gmail.com> wrote:
As Robert said—and I agree—renaming the existing _tz family would be more trouble than it’s worth, given the need for deprecations, migration paths, etc. If we were designing this today, suffixes like _stable or _volatile might have been more appropriate, but at this point, we’re better off staying consistent with the _tz family.
I get the pragmatism, and don’t want to over-bike-shed, but what a wart to live with. [I just went back and re-read Robert’s post, and didn’t realize he used exactly the same expression!] Would it really be too effortful to create _stable or _volatile functions and leave the _tz functions as a sort of legacy?
No, that wouldn't be too much work, but the issue is that people will
keep using the _tz versions and when we eventually try to remove them
those people will complain no matter how prominent we make the
deprecation notice. If we want to go this route, maybe we should do
something like:
1. Add the new versions with a _s suffix or whatever.
2. Invent a GUC jsonb_tz_warning = { on | off } that advises you to
use the new functions instead, whenever you use the old ones.
3. After N years, flip the default value from off to on.
4. After M additional years, remove the old functions and the GUC.
5. Still get complaints.
--
Robert Haas
EDB: http://www.enterprisedb.com
On May 21, 2025, at 14:06, Robert Haas <robertmhaas@gmail.com> wrote:
No, that wouldn't be too much work, but the issue is that people will
keep using the _tz versions and when we eventually try to remove them
those people will complain no matter how prominent we make the
deprecation notice. If we want to go this route, maybe we should do
something like:1. Add the new versions with a _s suffix or whatever.
2. Invent a GUC jsonb_tz_warning = { on | off } that advises you to
use the new functions instead, whenever you use the old ones.3. After N years, flip the default value from off to on.
4. After M additional years, remove the old functions and the GUC.
5. Still get complaints.
Complainers gonna complain. 🫠
Any idea how widespread the use of the function is? It was added in 17, and I’ve met few who have really dug into the jonpath stuff yet, let alone needed the time zone conversion functionality.
Best,
David
"David E. Wheeler" <david@justatheory.com> writes:
On May 21, 2025, at 14:06, Robert Haas <robertmhaas@gmail.com> wrote:
If we want to go this route, maybe we should do
something like:
...
5. Still get complaints.
Complainers gonna complain.
Yeah. I do not see the point of that amount of effort.
Any idea how widespread the use of the function is? It was added in 17, and I’ve met few who have really dug into the jonpath stuff yet, let alone needed the time zone conversion functionality.
That's a good point. We should also remember that if somebody really
really doesn't want to fix their app, they can trivially create a
wrapper function with the old name.
Having said that, what's wrong with inventing some improved function
names and never removing the old ones?
regards, tom lane
On Wed, May 21, 2025 at 2:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Having said that, what's wrong with inventing some improved function
names and never removing the old ones?
I don't particularly like the clutter, but if the consensus is that
the clutter doesn't matter, fair enough.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 22 May 2025, at 5:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 21, 2025 at 2:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Having said that, what's wrong with inventing some improved function
names and never removing the old ones?I don't particularly like the clutter, but if the consensus is that
the clutter doesn't matter, fair enough.
It depends really on how much future work we expect in adding more methods in jsonpath.
I think there’s a lot of potential there, but that’s a guess really.
On David’s point about popularity:
In my experience timestamp related stuff from jsonb documents end up in a generated column,
and are indexed & queried there.
I expect that to continue in PG18 onwards as we’ll have virtual gen columns too.
Just to be clear, though, adding another version of these functions means
we’ll have an additional (now third) set of the same 5 functions:
The vanilla versions are considered stable and the suffixed *_tz or *_volatile (?)
jsonb_path_exists
jsonb_path_query
jsonb_path_query_array
jsonb_path_query_first
jsonb_path_match
On May 22, 2025, at 12:38, Florents Tselai <florents.tselai@gmail.com> wrote:
In my experience timestamp related stuff from jsonb documents end up in a generated column,
and are indexed & queried there.
Have you seen this in the wild using the _tz functions? I wouldn’t think they were indexable, given the volatility.
D
On 09.05.25 21:50, Robert Haas wrote:
I always struggle a bit to remember our policy on these issues -- to
the best of my knowledge, we haven't documented it anywhere, and I
think we probably should. I believe the way it works is that whenever
a function depends on the operating system's timestamp or locale
definitions, we decide it has to be stable, not immutable. We don't
expect those things to be updated very often, but we know sometimes
they do get updated.
I don't understand how this discussion got to the conclusion that
functions that depend on the locale cannot be immutable. Note that the
top-level functions lower, upper, and initcap themselves are immutable.