add function argument names to regex* functions.

Started by jian heover 2 years ago48 messageshackers

jian.universality@gmail.com

over 2 years ago

Hi.
similar to [1]/messages/by-id/877cw3jl8y.fsf@wibble.ilmari.org, add function argument names to the following functions:
regexp_like, regexp_match,regexp_matches,regexp_replace,
regexp_substr,regexp_split_to_array,regexp_split_to_table,regexp_count

so I call these function in a different notation[2]https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html#SQL-SYNTAX-CALLING-FUNCS, like:

SELECT regexp_like(string=>'a'||CHR(10)||'d', pattern=>'a.d', flags:='n');
select regexp_match(string=>'abc',n pattern=>'(B)(c)', flags=>'i');
select regexp_matches(string=>'Programmer', pattern=>'(\w)(.*?\1)',
flags=>'ig');
SELECT regexp_replace(source=>'A PostgreSQL function',
pattern=>'a|e|i|o|u', replacement=>'X', start=>1, n=>4, flags=>'i');
SELECT regexp_substr(string=>'1234567890',
pattern=>'(123)(4(56)(78))', start=>1, n=>1, flags=>'i', subexpr=>4);
SELECT regexp_split_to_array(string=>'thE QUick bROWn FOx jUMPs ovEr
The lazy dOG', pattern=>'e', flags=>'i');

SELECT foo, length(foo)
FROM regexp_split_to_table(string=>'thE QUick bROWn FOx jUMPs ovEr The
lazy dOG', pattern=>'e',flags=>'i') AS foo;
SELECT regexp_count(string=>'ABCABCABCABC', pattern=>'Abc', start=>1,
flags=>'i');

As you can see, the substring function argument names need an explicit
double quote,
which doesn't look good, so I gave up.

[1]: /messages/by-id/877cw3jl8y.fsf@wibble.ilmari.org
/messages/by-id/877cw3jl8y.fsf@wibble.ilmari.org
[2]: https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html#SQL-SYNTAX-CALLING-FUNCS
https://www.postgresql.org/docs/current/sql-syntax-calling-funcs.html#SQL-SYNTAX-CALLING-FUNCS
[3]: https://www.postgresql.org/docs/current/functions-matching.html

Peter Eisentraut

peter_e@gmx.net

over 2 years ago

In reply to: jian he (#1)

Re: add function argument names to regex* functions.

On 27.12.23 17:53, jian he wrote:

similar to [1], add function argument names to the following functions:
regexp_like, regexp_match,regexp_matches,regexp_replace,
regexp_substr,regexp_split_to_array,regexp_split_to_table,regexp_count

Note that these functions are a quasi-standard that is shared with other
SQL implementations. It might be worth looking around if there are
already other implementations of this idea.

jian he

jian.universality@gmail.com

over 2 years ago

In reply to: Peter Eisentraut (#2)

Re: add function argument names to regex* functions.

On Thu, Dec 28, 2023 at 6:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:

On 27.12.23 17:53, jian he wrote:

similar to [1], add function argument names to the following functions:
regexp_like, regexp_match,regexp_matches,regexp_replace,
regexp_substr,regexp_split_to_array,regexp_split_to_table,regexp_count

Note that these functions are a quasi-standard that is shared with other
SQL implementations. It might be worth looking around if there are
already other implementations of this idea.

turns out people do like calling functions via explicitly mentioning
function argument names, example: [0]https://stackoverflow.com/questions/33387348/oracle-named-parameters-in-regular-functions
There are no provisions for the argument names.

I looked around the oracle implementation in [1]https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/REGEXP_SUBSTR.html#GUID-2903904D-455F-4839-A8B2-1731EF4BD099, and the oracle regex
related function argumentation name in [2]https://dbfiddle.uk/h_SBDEKi
I use the doc [3]https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP syntax explanation and add the related function names.

Current, regex.* function syntax seems fine. but only parameter `N`
seems a little bit weird.
If we change the function's argument name, we also need to change
function syntax explanation in the doc; vise versa.

QUOTE:
The regexp_instr function returns the starting or ending position of
the N'th match of a POSIX regular expression pattern to a string, or
zero if there is no such match. It has the syntax regexp_instr(string,
pattern [, start [, N [, endoption [, flags [, subexpr ]]]]]). pattern
is searched for in string, normally from the beginning of the string,
but if the start parameter is provided then beginning from that
character index. If N is specified then the N'th match of the pattern
is located, otherwise the first match is located.
END OF QUOTE.

maybe we can change `N` to occurrence. but `occurrence` is kind of verbose.

[0]: https://stackoverflow.com/questions/33387348/oracle-named-parameters-in-regular-functions
[1]: https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/REGEXP_SUBSTR.html#GUID-2903904D-455F-4839-A8B2-1731EF4BD099
[2]: https://dbfiddle.uk/h_SBDEKi
[3]: https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP

Dian Fay

di@nmfay.com

over 2 years ago

In reply to: jian he (#3)

Re: add function argument names to regex* functions.

On Wed Dec 27, 2023 at 10:28 PM EST, jian he wrote:

On Thu, Dec 28, 2023 at 6:25 AM Peter Eisentraut <peter@eisentraut.org> wrote:

On 27.12.23 17:53, jian he wrote:

similar to [1], add function argument names to the following functions:
regexp_like, regexp_match,regexp_matches,regexp_replace,
regexp_substr,regexp_split_to_array,regexp_split_to_table,regexp_count

Note that these functions are a quasi-standard that is shared with other
SQL implementations. It might be worth looking around if there are
already other implementations of this idea.

turns out people do like calling functions via explicitly mentioning
function argument names, example: [0]
There are no provisions for the argument names.

I looked around the oracle implementation in [1], and the oracle regex
related function argumentation name in [2]
I use the doc [3] syntax explanation and add the related function names.

Current, regex.* function syntax seems fine. but only parameter `N`
seems a little bit weird.
If we change the function's argument name, we also need to change
function syntax explanation in the doc; vise versa.

QUOTE:
The regexp_instr function returns the starting or ending position of
the N'th match of a POSIX regular expression pattern to a string, or
zero if there is no such match. It has the syntax regexp_instr(string,
pattern [, start [, N [, endoption [, flags [, subexpr ]]]]]). pattern
is searched for in string, normally from the beginning of the string,
but if the start parameter is provided then beginning from that
character index. If N is specified then the N'th match of the pattern
is located, otherwise the first match is located.
END OF QUOTE.

maybe we can change `N` to occurrence. but `occurrence` is kind of verbose.

[0] https://stackoverflow.com/questions/33387348/oracle-named-parameters-in-regular-functions
[1] https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/REGEXP_SUBSTR.html#GUID-2903904D-455F-4839-A8B2-1731EF4BD099
[2] https://dbfiddle.uk/h_SBDEKi
[3] https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP

I've been trying to use named arguments more diligently so expanding
support for built-in functions is welcome. The patch applies cleanly and
works as advertised.

I agree that the parameter name `n` is not ideal. For example, in
`regexp_replace` it's easy to misinterpret it as "make up to n
replacements". This has not been a problem when `n` only lives in the
documentation which explains exactly what it does, but that context is
not readily available in code expressing `n => 3`.

Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.

1. https://www.postgresql.org/docs/current/sql-keywords-appendix.html

Peter Eisentraut

peter_e@gmx.net

over 2 years ago

In reply to: jian he (#3)

Re: add function argument names to regex* functions.

On 28.12.23 04:28, jian he wrote:

I looked around the oracle implementation in [1], and the oracle regex
related function argumentation name in [2]
I use the doc [3] syntax explanation and add the related function names.

Current, regex.* function syntax seems fine. but only parameter `N`
seems a little bit weird.
If we change the function's argument name, we also need to change
function syntax explanation in the doc; vise versa.

So, it looks like Oracle already has defined parameter names for these,
so we should make ours match.

Jim Nasby

Jim.Nasby@BlueTreble.com

over 2 years ago

In reply to: Dian Fay (#4)

Re: add function argument names to regex* functions.

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 1/1/24 12:05 PM, Dian Fay wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CY3KI01TXYMD.3Q41WQ6U0NFEO@nmfay.com">
<pre>I agree that the parameter name `n` is not ideal. For example, in
`regexp_replace` it's easy to misinterpret it as "make up to n
replacements". This has not been a problem when `n` only lives in the
documentation which explains exactly what it does, but that context is
not readily available in code expressing `n => 3`.
</pre>
</blockquote>
Agreed; IMO it's worth diverging from what Oracle has done here.<br>
<blockquote type="cite"
cite="mid:CY3KI01TXYMD.3Q41WQ6U0NFEO@nmfay.com">
<pre>
Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.
</pre>
</blockquote>
<p>I'm not a fan of "index" since that leaves the question of
whether it's 0 or 1 based. "Position" is a bit better, but I think
Jian's suggestion of "occurance" is best.<br>
</p>
<pre class="moz-signature" cols="72">--
Jim Nasby, Data Architect, Austin TX</pre>
</body>
</html>

Dian Fay

di@nmfay.com

over 2 years ago

In reply to: Jim Nasby (#6)

Re: add function argument names to regex* functions.

Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.

I'm not a fan of "index" since that leaves the question of
whether it's 0 or 1 based. "Position" is a bit better, but I think
Jian's suggestion of "occurance" is best.

We do have precedent for one-based `index` in Postgres: array types are
1-indexed by default! "Occurrence" removes that ambiguity but it's long
and easy to misspell (I looked it up after typing it just now and it
_still_ feels off).

How's "instance"?

Jim Nasby

Jim.Nasby@BlueTreble.com

over 2 years ago

In reply to: Dian Fay (#7)

Re: add function argument names to regex* functions.

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">On 1/3/24 5:05 PM, Dian Fay wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CY5G4TAMVX32.23TH6LFF93ATZ@nmfay.com">
<blockquote type="cite">
<blockquote type="cite">
<pre class="moz-quote-pre" wrap="">Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
I'm not a fan of "index" since that leaves the question of
whether it's 0 or 1 based. "Position" is a bit better, but I think
Jian's suggestion of "occurance" is best.
</pre>
</blockquote>
<pre class="moz-quote-pre" wrap="">
We do have precedent for one-based `index` in Postgres: array types are
1-indexed by default! "Occurrence" removes that ambiguity but it's long
and easy to misspell (I looked it up after typing it just now and it
_still_ feels off).

How's "instance"?
</pre>
</blockquote>
<p>Presumably someone referencing arguments by name would have just
looked up the names via \df or whatever, so presumably misspelling
wouldn't be a big issue. But I think "instance" is OK as well.<br>
</p>
<pre class="moz-signature" cols="72">--
Jim Nasby, Data Architect, Austin TX</pre>
</body>
</html>

jian he

jian.universality@gmail.com

over 2 years ago

In reply to: Jim Nasby (#8)

Re: add function argument names to regex* functions.

On Thu, Jan 4, 2024 at 7:26 AM Jim Nasby <jim.nasby@gmail.com> wrote:

On 1/3/24 5:05 PM, Dian Fay wrote:

Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.

I'm not a fan of "index" since that leaves the question of
whether it's 0 or 1 based. "Position" is a bit better, but I think
Jian's suggestion of "occurance" is best.

We do have precedent for one-based `index` in Postgres: array types are
1-indexed by default! "Occurrence" removes that ambiguity but it's long
and easy to misspell (I looked it up after typing it just now and it
_still_ feels off).

How's "instance"?

Presumably someone referencing arguments by name would have just looked up the names via \df or whatever, so presumably misspelling wouldn't be a big issue. But I think "instance" is OK as well.

--
Jim Nasby, Data Architect, Austin TX

regexp_instr: It has the syntax regexp_instr(string, pattern [, start
[, N [, endoption [, flags [, subexpr ]]]]])
oracle:
REGEXP_INSTR (source_char, pattern, [, position [, occurrence [,
return_opt [, match_param [, subexpr ]]]]] )

"string" and "source_char" are almost the same descriptive, so maybe
there is no need to change.
"start" is better than "position", imho.
"return_opt" is better than "endoption", (maybe we need change, for
now I didn't)
"flags" cannot be changed to "match_param", given it quite everywhere
in functions-matching.html.

similarly for function regexp_replace, oracle using "repplace_string",
we use "replacement"(mentioned in the doc).
so I don't think we need to change to "repplace_string".

Based on how people google[0]https://www.google.com/search?q=regex+nth+match&oq=regex+nth+match&gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIGCAEQRRg8MgYIAhBFGDzSAQc2MThqMGo5qAIAsAIA&sourceid=chrome&ie=UTF-8, I think `occurrence` is ok, even though
it's verbose.
to change from `N` to `occurrence`, we also need to change the doc,
that is why this patch is more larger.

[0]: https://www.google.com/search?q=regex+nth+match&oq=regex+nth+match&gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIGCAEQRRg8MgYIAhBFGDzSAQc2MThqMGo5qAIAsAIA&sourceid=chrome&ie=UTF-8

#10

Dian Fay

di@nmfay.com

over 2 years ago

In reply to: jian he (#9)

Re: add function argument names to regex* functions.

On Thu Jan 4, 2024 at 2:03 AM EST, jian he wrote:

On Thu, Jan 4, 2024 at 7:26 AM Jim Nasby <jim.nasby@gmail.com> wrote:

On 1/3/24 5:05 PM, Dian Fay wrote:

Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.

I'm not a fan of "index" since that leaves the question of
whether it's 0 or 1 based. "Position" is a bit better, but I think
Jian's suggestion of "occurance" is best.

We do have precedent for one-based `index` in Postgres: array types are
1-indexed by default! "Occurrence" removes that ambiguity but it's long
and easy to misspell (I looked it up after typing it just now and it
_still_ feels off).

How's "instance"?

Presumably someone referencing arguments by name would have just looked up the names via \df or whatever, so presumably misspelling wouldn't be a big issue. But I think "instance" is OK as well.

--
Jim Nasby, Data Architect, Austin TX

regexp_instr: It has the syntax regexp_instr(string, pattern [, start
[, N [, endoption [, flags [, subexpr ]]]]])
oracle:
REGEXP_INSTR (source_char, pattern, [, position [, occurrence [,
return_opt [, match_param [, subexpr ]]]]] )

"string" and "source_char" are almost the same descriptive, so maybe
there is no need to change.
"start" is better than "position", imho.
"return_opt" is better than "endoption", (maybe we need change, for
now I didn't)
"flags" cannot be changed to "match_param", given it quite everywhere
in functions-matching.html.

similarly for function regexp_replace, oracle using "repplace_string",
we use "replacement"(mentioned in the doc).
so I don't think we need to change to "repplace_string".

Based on how people google[0], I think `occurrence` is ok, even though
it's verbose.
to change from `N` to `occurrence`, we also need to change the doc,
that is why this patch is more larger.

[0]: https://www.google.com/search?q=regex+nth+match&oq=regex+nth+match&gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIGCAEQRRg8MgYIAhBFGDzSAQc2MThqMGo5qAIAsAIA&sourceid=chrome&ie=UTF-8

The `regexp_replace` summary in table 9.10 is mismatched and still
specifies the first parameter name as `string` instead of `source`.
Since all the other functions use `string`, should `regexp_replace` do
the same or is this a case where an established "standard" diverges?

I noticed the original documentation for some of these functions is
rather disorganized; summaries explain `occurrence` without explaining
the prior `start` parameter, and detailed documentation in 9.7 is
usually a single paragraph per function running pell-mell through ifs
and buts without section headings, so entries in table 9.10 have to
reference the entire section 9.7.3 instead of their specific functions.
It's out of scope here, but should I bring this up on pgsql-docs?

#11

jian he

jian.universality@gmail.com

over 2 years ago

In reply to: Dian Fay (#10)

Re: add function argument names to regex* functions.

On Mon, Jan 8, 2024 at 8:44 AM Dian Fay <di@nmfay.com> wrote:

On Thu Jan 4, 2024 at 2:03 AM EST, jian he wrote:

On Thu, Jan 4, 2024 at 7:26 AM Jim Nasby <jim.nasby@gmail.com> wrote:

On 1/3/24 5:05 PM, Dian Fay wrote:

Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.

I'm not a fan of "index" since that leaves the question of
whether it's 0 or 1 based. "Position" is a bit better, but I think
Jian's suggestion of "occurance" is best.

We do have precedent for one-based `index` in Postgres: array types are
1-indexed by default! "Occurrence" removes that ambiguity but it's long
and easy to misspell (I looked it up after typing it just now and it
_still_ feels off).

How's "instance"?

Presumably someone referencing arguments by name would have just looked up the names via \df or whatever, so presumably misspelling wouldn't be a big issue. But I think "instance" is OK as well.

--
Jim Nasby, Data Architect, Austin TX

regexp_instr: It has the syntax regexp_instr(string, pattern [, start
[, N [, endoption [, flags [, subexpr ]]]]])
oracle:
REGEXP_INSTR (source_char, pattern, [, position [, occurrence [,
return_opt [, match_param [, subexpr ]]]]] )

"string" and "source_char" are almost the same descriptive, so maybe
there is no need to change.
"start" is better than "position", imho.
"return_opt" is better than "endoption", (maybe we need change, for
now I didn't)
"flags" cannot be changed to "match_param", given it quite everywhere
in functions-matching.html.

similarly for function regexp_replace, oracle using "repplace_string",
we use "replacement"(mentioned in the doc).
so I don't think we need to change to "repplace_string".

Based on how people google[0], I think `occurrence` is ok, even though
it's verbose.
to change from `N` to `occurrence`, we also need to change the doc,
that is why this patch is more larger.

[0]: https://www.google.com/search?q=regex+nth+match&oq=regex+nth+match&gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIGCAEQRRg8MgYIAhBFGDzSAQc2MThqMGo5qAIAsAIA&sourceid=chrome&ie=UTF-8

The `regexp_replace` summary in table 9.10 is mismatched and still
specifies the first parameter name as `string` instead of `source`.
Since all the other functions use `string`, should `regexp_replace` do
the same or is this a case where an established "standard" diverges?

got it. Thanks for pointing it out.

in functions-matching.html
if I change <replaceable>source</replaceable> to
<replaceable>string</replaceable> then
there are no markup "string" and markup "string", it's kind of
slightly confusing.

So does the following refactored description of regexp_replace make sense:

The <replaceable>string</replaceable> is returned unchanged if
there is no match to the <replaceable>pattern</replaceable>. If there is a
match, the <replaceable>string</replaceable> is returned with the
<replaceable>replacement</replaceable> string substituted for the matching
substring. The <replaceable>replacement</replaceable> string can contain
<literal>\</literal><replaceable>n</replaceable>, where
<replaceable>n</replaceable> is 1
through 9, to indicate that the source substring matching the
<replaceable>n</replaceable>'th parenthesized subexpression of
the pattern should be
inserted, and it can contain <literal>\&</literal> to indicate that the
substring matching the entire pattern should be inserted. Write
<literal>\\</literal> if you need to put a literal backslash in
the replacement
text.

I noticed the original documentation for some of these functions is
rather disorganized; summaries explain `occurrence` without explaining
the prior `start` parameter, and detailed documentation in 9.7 is
usually a single paragraph per function running pell-mell through ifs
and buts without section headings, so entries in table 9.10 have to
reference the entire section 9.7.3 instead of their specific functions.
It's out of scope here, but should I bring this up on pgsql-docs?

I got it.
in Table 9.10. Other String Functions and Operators, if we can
reference the specific function would be great.
As for now, in the browser, you need to use Ctrl+F to find the
detailed explanation in 9.7.3.
you can just bring your suggested or patch to pgsql-hackers@postgresql.org.

#12

Dian Fay

di@nmfay.com

over 2 years ago

In reply to: jian he (#11)

Re: add function argument names to regex* functions.

On Mon Jan 8, 2024 at 9:26 AM EST, jian he wrote:

On Mon, Jan 8, 2024 at 8:44 AM Dian Fay <di@nmfay.com> wrote:

The `regexp_replace` summary in table 9.10 is mismatched and still
specifies the first parameter name as `string` instead of `source`.
Since all the other functions use `string`, should `regexp_replace` do
the same or is this a case where an established "standard" diverges?

got it. Thanks for pointing it out.

in functions-matching.html
if I change <replaceable>source</replaceable> to
<replaceable>string</replaceable> then
there are no markup "string" and markup "string", it's kind of
slightly confusing.

So does the following refactored description of regexp_replace make sense:

The <replaceable>string</replaceable> is returned unchanged if
there is no match to the <replaceable>pattern</replaceable>. If there is a
match, the <replaceable>string</replaceable> is returned with the
<replaceable>replacement</replaceable> string substituted for the matching
substring. The <replaceable>replacement</replaceable> string can contain
<literal>\</literal><replaceable>n</replaceable>, where
<replaceable>n</replaceable> is 1
through 9, to indicate that the source substring matching the
<replaceable>n</replaceable>'th parenthesized subexpression of
the pattern should be
inserted, and it can contain <literal>\&</literal> to indicate that the
substring matching the entire pattern should be inserted. Write
<literal>\\</literal> if you need to put a literal backslash in
the replacement
text.

That change makes sense to me! I'll see about the section refactoring
after this lands.

#13

jian he

jian.universality@gmail.com

over 2 years ago

In reply to: Dian Fay (#12)

Re: add function argument names to regex* functions.

On Tue, Jan 9, 2024 at 8:52 AM Dian Fay <di@nmfay.com> wrote:

On Mon Jan 8, 2024 at 9:26 AM EST, jian he wrote:

On Mon, Jan 8, 2024 at 8:44 AM Dian Fay <di@nmfay.com> wrote:

The `regexp_replace` summary in table 9.10 is mismatched and still
specifies the first parameter name as `string` instead of `source`.
Since all the other functions use `string`, should `regexp_replace` do
the same or is this a case where an established "standard" diverges?

got it. Thanks for pointing it out.

in functions-matching.html
if I change <replaceable>source</replaceable> to
<replaceable>string</replaceable> then
there are no markup "string" and markup "string", it's kind of
slightly confusing.

So does the following refactored description of regexp_replace make sense:

The <replaceable>string</replaceable> is returned unchanged if
there is no match to the <replaceable>pattern</replaceable>. If there is a
match, the <replaceable>string</replaceable> is returned with the
<replaceable>replacement</replaceable> string substituted for the matching
substring. The <replaceable>replacement</replaceable> string can contain
<literal>\</literal><replaceable>n</replaceable>, where
<replaceable>n</replaceable> is 1
through 9, to indicate that the source substring matching the
<replaceable>n</replaceable>'th parenthesized subexpression of
the pattern should be
inserted, and it can contain <literal>\&</literal> to indicate that the
substring matching the entire pattern should be inserted. Write
<literal>\\</literal> if you need to put a literal backslash in
the replacement
text.

That change makes sense to me! I'll see about the section refactoring
after this lands.

I put the changes into the new patch.

#14

Dian Fay

di@nmfay.com

over 2 years ago

In reply to: jian he (#13)

Re: add function argument names to regex* functions.

On Wed Jan 10, 2024 at 9:18 AM EST, jian he wrote:

On Tue, Jan 9, 2024 at 8:52 AM Dian Fay <di@nmfay.com> wrote:

On Mon Jan 8, 2024 at 9:26 AM EST, jian he wrote:

On Mon, Jan 8, 2024 at 8:44 AM Dian Fay <di@nmfay.com> wrote:

The `regexp_replace` summary in table 9.10 is mismatched and still
specifies the first parameter name as `string` instead of `source`.
Since all the other functions use `string`, should `regexp_replace` do
the same or is this a case where an established "standard" diverges?

got it. Thanks for pointing it out.

in functions-matching.html
if I change <replaceable>source</replaceable> to
<replaceable>string</replaceable> then
there are no markup "string" and markup "string", it's kind of
slightly confusing.

So does the following refactored description of regexp_replace make sense:

The <replaceable>string</replaceable> is returned unchanged if
there is no match to the <replaceable>pattern</replaceable>. If there is a
match, the <replaceable>string</replaceable> is returned with the
<replaceable>replacement</replaceable> string substituted for the matching
substring. The <replaceable>replacement</replaceable> string can contain
<literal>\</literal><replaceable>n</replaceable>, where
<replaceable>n</replaceable> is 1
through 9, to indicate that the source substring matching the
<replaceable>n</replaceable>'th parenthesized subexpression of
the pattern should be
inserted, and it can contain <literal>\&</literal> to indicate that the
substring matching the entire pattern should be inserted. Write
<literal>\\</literal> if you need to put a literal backslash in
the replacement
text.

That change makes sense to me! I'll see about the section refactoring
after this lands.

I put the changes into the new patch.

Sorry, I missed one minor issue with v2. The replacement on lines
6027-6028 of func.sgml (originally "`n` rows if there are `n` matches")
is not needed and could be more confusing since the `n` represents a
number, not an argument to `regexp_matches`. I've built v3 and gone over
everything else one more time and it looks good.

#15

Peter Eisentraut

peter_e@gmx.net

about 2 years ago

In reply to: jian he (#13)

Re: add function argument names to regex* functions.

On 10.01.24 15:18, jian he wrote:

I put the changes into the new patch.

Reading back through the discussion, I wasn't quite able to interpret
the resolution regarding Oracle compatibility. From the patch, it looks
like you chose not to adopt the parameter names from Oracle. Was that
your intention?

#16

jian he

jian.universality@gmail.com

about 2 years ago

In reply to: Peter Eisentraut (#15)

Re: add function argument names to regex* functions.

On Thu, Jan 18, 2024 at 4:17 PM Peter Eisentraut <peter@eisentraut.org> wrote:

On 10.01.24 15:18, jian he wrote:

I put the changes into the new patch.

Reading back through the discussion, I wasn't quite able to interpret
the resolution regarding Oracle compatibility. From the patch, it looks
like you chose not to adopt the parameter names from Oracle. Was that
your intention?

per committee message:
https://git.postgresql.org/cgit/postgresql.git/commit/?id=6424337073589476303b10f6d7cc74f501b8d9d7
Even if the names are all the same, our function is still not the same
as oracle.

There is a documentation bug.
In [0]https://www.postgresql.org/docs/current/functions-matching.html, Table 9.25. Regular Expression Functions Equivalencies
regexp_replace function definition: regexp_replace(string, pattern, replacement)

In one of the <tip> section below, regexp_replace explains as
<<<<<
The regexp_replace function provides substitution of new text for
substrings that match POSIX regular expression patterns. It has the
syntax regexp_replace(source, pattern, replacement [, start [, N ]] [,
flags ]). (Notice that N cannot be specified unless start is, but
flags can be given in any case.)
<<<<<
So I changed the first argument of regexp_replace to "string". So
accordingly, the doc needs to change also, which I did.

another regex* function argument changes: from "N" to "occurences",  example:
    + If <replaceable>occurrence</replaceable> is specified
    + then the <replaceable>occurrence</replaceable>'th match of the pattern
    + is located,

but [2] says
Speaking of the "occurrence'th
occurrence" is just silly, not to mention long and easy to misspell."

summary:
adding function-named notation is my intention.
To make regex.* functions named-notation works, we need to add
proargnames to src/include/catalog/pg_proc.dat.
add proargnames also require changing the doc.
naming proargnames is a matter of taste now, So I only change 'N' to
'occurrence'.

[0]: https://www.postgresql.org/docs/current/functions-matching.html
[1]: /messages/by-id/fc160ee0-c843-b024-29bb-97b5da61971f@darold.net

#17

jian he

jian.universality@gmail.com

about 2 years ago

In reply to: jian he (#16)

Re: add function argument names to regex* functions.

On Sat, Jan 20, 2024 at 10:55 AM jian he <jian.universality@gmail.com> wrote:

another regex* function argument changes: from "N" to "occurences",  example:
+ If <replaceable>occurrence</replaceable> is specified
+ then the <replaceable>occurrence</replaceable>'th match of the pattern
+ is located,
but [2] says
Speaking of the "occurrence'th
occurrence" is just silly, not to mention long and easy to misspell."

sorry.
[2]: , The reference link is /messages/by-id/1567465.1627860115@sss.pgh.pa.us
/messages/by-id/1567465.1627860115@sss.pgh.pa.us

my previous post will link to the whole thread.

#18

Tom Lane

tgl@sss.pgh.pa.us

about 2 years ago

In reply to: jian he (#16)

Re: add function argument names to regex* functions.

jian he <jian.universality@gmail.com> writes:

On Thu, Jan 18, 2024 at 4:17 PM Peter Eisentraut <peter@eisentraut.org> wrote:

Reading back through the discussion, I wasn't quite able to interpret
the resolution regarding Oracle compatibility. From the patch, it looks
like you chose not to adopt the parameter names from Oracle. Was that
your intention?

per committee message:
https://git.postgresql.org/cgit/postgresql.git/commit/?id=6424337073589476303b10f6d7cc74f501b8d9d7
Even if the names are all the same, our function is still not the same
as oracle.

The fact that there's minor discrepancies in the regex languages
doesn't seem to me to have a lot of bearing on whether we should
follow Oracle's choices of parameter names.

However, if we do follow Oracle, it seems like we should do that
consistently, which this patch doesn't. For instance, per [1]https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/REGEXP_SUBSTR.html#GUID-2903904D-455F-4839-A8B2-1731EF4BD099
Oracle calls the arguments of regex_substr

source_char,
pattern,
position,
occurrence,
match_param,
subexpr

while we have

string,
pattern,
start,
N,
flags,
subexpr

The patch proposes to replace "N" with "occurrence" but not touch
the other discrepancies, which seems to me to be a pretty poor
choice. "occurrence" is very long and difficult to spell correctly,
and if you're not following Oracle slavishly, exactly what is the
argument in its favor? I quite agree that Oracle's other choices
aren't improvements over ours, but neither is that one.

On the whole my inclination would be to stick to the names we have
in the documentation. There might be an argument for changing "N"
to something lower-case so you don't have to quote it; but if we do,
I'd go for, say, "count".

regards, tom lane

[1]: https://docs.oracle.com/en/database/oracle/oracle-database/23/sqlrf/REGEXP_SUBSTR.html#GUID-2903904D-455F-4839-A8B2-1731EF4BD099

#19

jian he

jian.universality@gmail.com

about 2 years ago

In reply to: Tom Lane (#18)

Re: add function argument names to regex* functions.

On Wed, Apr 3, 2024 at 4:45 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

jian he <jian.universality@gmail.com> writes:

On Thu, Jan 18, 2024 at 4:17 PM Peter Eisentraut <peter@eisentraut.org> wrote:

Reading back through the discussion, I wasn't quite able to interpret
the resolution regarding Oracle compatibility. From the patch, it looks
like you chose not to adopt the parameter names from Oracle. Was that
your intention?

per committee message:
https://git.postgresql.org/cgit/postgresql.git/commit/?id=6424337073589476303b10f6d7cc74f501b8d9d7
Even if the names are all the same, our function is still not the same
as oracle.

The fact that there's minor discrepancies in the regex languages
doesn't seem to me to have a lot of bearing on whether we should
follow Oracle's choices of parameter names.

However, if we do follow Oracle, it seems like we should do that
consistently, which this patch doesn't. For instance, per [1]
Oracle calls the arguments of regex_substr

source_char,
pattern,
position,
occurrence,
match_param,
subexpr

while we have

string,
pattern,
start,
N,
flags,
subexpr

The patch proposes to replace "N" with "occurrence" but not touch
the other discrepancies, which seems to me to be a pretty poor
choice. "occurrence" is very long and difficult to spell correctly,
and if you're not following Oracle slavishly, exactly what is the
argument in its favor? I quite agree that Oracle's other choices
aren't improvements over ours, but neither is that one.

On the whole my inclination would be to stick to the names we have
in the documentation. There might be an argument for changing "N"
to something lower-case so you don't have to quote it; but if we do,
I'd go for, say, "count".

we have
---------------------------------------------------------------
The replacement string can contain \n, where n is 1 through 9, to
indicate that the source substring matching the n'th parenthesized
subexpression of the pattern should be inserted, and it can contain \&
to indicate that the substring matching the entire pattern should be
inserted.
----------------------------------------------------------------------------
in the regexp_replace explanation section.
changing "N" to lower-case would be misleading for regexp_replace?
so I choose "count".

By the way, I think the above is so hard to comprehend.
I can only find related test in src/test/regress/sql/strings.sql are:
SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})',
E'(\\1) \\2-\\3');
SELECT regexp_replace('foobarrbazz', E'(.)\\1', E'X\\&Y', 'g');
SELECT regexp_replace('foobarrbazz', E'(.)\\1', E'X\\\\Y', 'g');

but these tests seem not friendly.
maybe we should have some simple examples to demonstrate the above paragraph.

#20

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: jian he (#19)

Re: add function argument names to regex* functions.

On Thu, Apr 4, 2024 at 9:55 AM jian he <jian.universality@gmail.com> wrote:

in the regexp_replace explanation section.
changing "N" to lower-case would be misleading for regexp_replace?
so I choose "count".

I don't see why that would be confusing for regexp_replace
specifically, but I think N => count is a reasonable change to make.
However, I don't think this quite works:

+ then the <replaceable>count</replaceable>'th match of the pattern

An English speaker is more likely to understand what is meant by
"N'th" than what is meant by "count'th". Even if they can guess, it's
kinda strange-looking. I think it needs to be rephrased somehow, but
I'm not sure exactly how.

By the way, I think the above is so hard to comprehend.
I can only find related test in src/test/regress/sql/strings.sql are:
SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})',
E'(\\1) \\2-\\3');
SELECT regexp_replace('foobarrbazz', E'(.)\\1', E'X\\&Y', 'g');
SELECT regexp_replace('foobarrbazz', E'(.)\\1', E'X\\\\Y', 'g');

but these tests seem not friendly.
maybe we should have some simple examples to demonstrate the above paragraph.

Examples in the regression tests aren't meant as tests, not examples
for users to copy. If we want examples, those belong in the
documentation. However, I see that regexp_replace already has some
examples at https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-REGEXP
so I'm not sure exactly what you think should be added.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21

Robert Haas

robertmhaas@gmail.com

almost 2 years ago

In reply to: Robert Haas (#20)

#22