regexp_match() returning text

Started by Emre Hasegelialmost 10 years ago9 messageshackers
Jump to latest
#1Emre Hasegeli
emre@hasegeli.com

Attached patch adds regexp_match() function which is a simple variant of
regexp_matches() that doesn't return a set. It is based on Tom Lane's
comment to bug #10889 [1]/messages/by-id/23769.1404747766@sss.pgh.pa.us.

[1]: /messages/by-id/23769.1404747766@sss.pgh.pa.us

Attachments:

regexp-match-v1.patchtext/x-diff; charset=utf-8; name=regexp-match-v1.patchDownload+188-48
#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Emre Hasegeli (#1)
Re: regexp_match() returning text

On Sun, May 29, 2016 at 1:28 PM, Emre Hasegeli <emre@hasegeli.com> wrote:

Attached patch adds regexp_match() function which is a simple variant of
regexp_matches() that doesn't return a set. It is based on Tom Lane's
comment to bug #10889 [1].

[1] /messages/by-id/23769.1404747766@sss.pgh.pa.us

Not sure if we need two functions or what but I dislike that:

"It ignores the parenthesized subexpressions in the pattern."

​The main problem being solved is the use of a SETOF result. I'm inclined
to prefer that the final, single, result is still an array.

Even if we return a single text value instead of an array it would be nice
to return the first (only...) parenthesized subexpression so that the input
pattern isn't constrained.

​I've got a style issue with the information_schema - I like to call it
useless-use-of-E'' - but that was there long before this patch...

/* user mustn't specify 'g' for regexp_split */ - do we add "or
regexp_match" or just removed the extraneous detail?

There seems to be scope creep regarding "regexp_split_to_table" that I'm
surprised to find. Related to that is the unexpected removal of the
"force_glob" parameter to setup_regexp_matches. You took what was a single
block of code and now duplicated it without any explanation in the commit
message (a code comment wouldn't work for this kind of change). The change
to flags from passing a pointer to text to passing in a pointer to a
previously derived pg_re_flags makes more sense on its face, and it is
apparently a non-public API, but again constitutes a refactoring that at
least would ideally be a separate commit from the one the introduces the
new behavior.

Also, I see an assert for the "no subexpressions" policy but I have seemed
to have overlooked where the non-assert-enabled user is informed of the
prohibition.

Tom's opinion on all this will be needed - I am not a C code writer nor do
I follow the patch writing process that closely generally, but I have a
keen interest in this topic - but there seems to be a bit more here than
simply having a function identical to the existing regexp_matches function
that prohibits the global flag and only ever returns a single array per the
existing API.

I'm good with the name, regexp_match - even with the behavior being changed
to returning an array instead of text. If there was any active plans to
redo the API so that we'd return a composite type instead of an array I'd
be more inclined to reserve "match" for that change and make this one a bit
more verbose.

David J.

#3Andrew Gierth
andrew@tao11.riddles.org.uk
In reply to: Emre Hasegeli (#1)
Re: regexp_match() returning text

"Emre" == Emre Hasegeli <emre@hasegeli.com> writes:

Emre> Attached patch adds regexp_match() function which is a simple
Emre> variant of regexp_matches() that doesn't return a set.

We already have a function that takes a string and a regexp and returns
a single text result: substring().

Regexp flags other than 'g' can also be embedded in the regexp:

postgres=# select substring('foo bar' from '(?i)BA+');
substring
-----------
ba

Returning an array containing the values of all capture groups might be
more useful (substring returns the value of the first capture group if
any, otherwise the matched string).

--
Andrew (irc:RhodiumToad)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andrew Gierth (#3)
Re: regexp_match() returning text

On 5/30/16 1:01 PM, Andrew Gierth wrote:

Returning an array containing the values of all capture groups might be
more useful (substring returns the value of the first capture group if
any, otherwise the matched string).

+1.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Emre Hasegeli
emre@hasegeli.com
In reply to: David G. Johnston (#2)
Re: regexp_match() returning text

The main problem being solved is the use of a SETOF result. I'm inclined to
prefer that the final, single, result is still an array.

I have changed it like that. New patch attached.

I've got a style issue with the information_schema - I like to call it
useless-use-of-E'' - but that was there long before this patch...

We better not touch it.

/* user mustn't specify 'g' for regexp_split */ - do we add "or
regexp_match" or just removed the extraneous detail?

I don't think it would be a nice error message.

There seems to be scope creep regarding "regexp_split_to_table" that I'm
surprised to find. Related to that is the unexpected removal of the
"force_glob" parameter to setup_regexp_matches. You took what was a single
block of code and now duplicated it without any explanation in the commit
message (a code comment wouldn't work for this kind of change). The change
to flags from passing a pointer to text to passing in a pointer to a
previously derived pg_re_flags makes more sense on its face, and it is
apparently a non-public API, but again constitutes a refactoring that at
least would ideally be a separate commit from the one the introduces the new
behavior.

That check doesn't belong to setup_regexp_matches() in the first place.
The arguments of the function are organised to be caller agnostic,
and then it gives an error on behalf of regexp_split(). The check
fits better to the regexp_split() functions even with duplication.

I can split it to another patch, but I think these kind of changes most
often go together.

Would you mind adding yourself to the reviewers on the Commitfest app?
I think you have already read though it.

Thanks for all the feedback.

Attachments:

regexp-match-v2.patchtext/x-diff; charset=utf-8; name=regexp-match-v2.patchDownload+183-48
#6David G. Johnston
david.g.johnston@gmail.com
In reply to: Emre Hasegeli (#5)
Re: regexp_match() returning text

On Saturday, June 4, 2016, Emre Hasegeli <emre@hasegeli.com> wrote:

The main problem being solved is the use of a SETOF result. I'm

inclined to

prefer that the final, single, result is still an array.

I have changed it like that. New patch attached.

Good

I've got a style issue with the information_schema - I like to call it
useless-use-of-E'' - but that was there long before this patch...

We better not touch it.

Agreed

/* user mustn't specify 'g' for regexp_split */ - do we add "or
regexp_match" or just removed the extraneous detail?

I don't think it would be a nice error message.

Just meant the comment. I think they look good now though the part about
force global in the split area just says what the code does and not why.
Namely that when splitting we find all matches so the input is completely
split. We disallow a user specification of global for some arbitrary
reason; though I don't have any reason to change that and the present
behavior doesn't cause people to complain.

There seems to be scope creep regarding "regexp_split_to_table" that I'm
surprised to find. Related to that is the unexpected removal of the
"force_glob" parameter to setup_regexp_matches. You took what was a

single

block of code and now duplicated it without any explanation in the commit
message (a code comment wouldn't work for this kind of change). The

change

to flags from passing a pointer to text to passing in a pointer to a
previously derived pg_re_flags makes more sense on its face, and it is
apparently a non-public API, but again constitutes a refactoring that at
least would ideally be a separate commit from the one the introduces the

new

behavior.

That check doesn't belong to setup_regexp_matches() in the first place.
The arguments of the function are organised to be caller agnostic,
and then it gives an error on behalf of regexp_split(). The check
fits better to the regexp_split() functions even with duplication.

I can split it to another patch, but I think these kind of changes most
often go together.

After reading this again I understand better what is going on and agree
with the changes as written.

Would you mind adding yourself to the reviewers on the Commitfest app?
I think you have already read though it.

Done.

I didn't compile either patch but given the scope and complexity I'd say it
is ready for committer without that confirmed. Tom usually touches the
regexp code and I'm pretty sure he'll look at this with an eye no one else
has. Though I wouldn't expect anything until work on 10 begins in earnest.

David J.

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: David G. Johnston (#6)
Re: regexp_match() returning text

"David G. Johnston" <david.g.johnston@gmail.com> writes:

I didn't compile either patch but given the scope and complexity I'd say it
is ready for committer without that confirmed. Tom usually touches the
regexp code and I'm pretty sure he'll look at this with an eye no one else
has. Though I wouldn't expect anything until work on 10 begins in earnest.

Pushed with some cosmetic adjustments to the code and rather more
extensive work on the documentation.

I did *not* push the hunk in citext.sgml, since that was alleging support
that doesn't actually exist in this patch. To make this work for citext,
we need to add wrapper functions similar to citext's wrappers for
regexp_matches. And that in turn means a citext extension version bump,
which makes it more notationally tedious than I felt like dealing with
today. I'm bouncing that requirement back to you to produce a separate
patch for.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Emre Hasegeli
emre@hasegeli.com
In reply to: Tom Lane (#7)
Re: regexp_match() returning text

I did *not* push the hunk in citext.sgml, since that was alleging support
that doesn't actually exist in this patch. To make this work for citext,
we need to add wrapper functions similar to citext's wrappers for
regexp_matches. And that in turn means a citext extension version bump,
which makes it more notationally tedious than I felt like dealing with
today. I'm bouncing that requirement back to you to produce a separate
patch for.

It is attached.

Attachments:

citext-regexp-match-v1.patchapplication/octet-stream; name=citext-regexp-match-v1.patchDownload+95-5
#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Emre Hasegeli (#8)
Re: regexp_match() returning text

Emre Hasegeli <emre@hasegeli.com> writes:

I did *not* push the hunk in citext.sgml, since that was alleging support
that doesn't actually exist in this patch. To make this work for citext,
we need to add wrapper functions similar to citext's wrappers for
regexp_matches. And that in turn means a citext extension version bump,
which makes it more notationally tedious than I felt like dealing with
today. I'm bouncing that requirement back to you to produce a separate
patch for.

It is attached.

You missed updating citext_1.out, but otherwise looks good. Pushed.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers