writing new regexp functions
I am wanting to write some new C functions which leverage postgresql's
existing regexp code in an extension module. I notice that the functions
RE_compile_and_cache and RE_compile_and_execute in
src/backend/util/regexp.c contain the code necessary to connect the regexp
code in src/backend/regex with the postgresql string conversion, error
reporting, and memory management infrastructure, as well as providing
caching of regexes which would probably be a win to any regex function in
postgres. It would seem that these functions would be useful to any
C function dealing with regexp matching in postgresql, but they are static
functions, so they cannot be used outside of
src/backend/utils/adt/regexp.c. Since all of the core regexp functions
are in this file, this has been ok, but it is my opinion that these
functions should be made visible and added to a header file so that
extensions can make use of them, because any add-on functions that want to
use the regex code in postgres in some new way would need to basically
duplicate that same code in order to do so.
Is there some specific reason that these functions are static, or would it
be ok to make them non-static and add them to a header (say,
src/include/utils/regexp.h) so that extensions could use them as well? I
could put together a patch for this if desired, or it seems simple enough
that someone could just do it...
--
I can't decide whether to commit suicide or go bowling.
-- Florence Henderson
Jeremy Drake <pgsql@jdrake.com> writes:
Is there some specific reason that these functions are static,
Yeah: not cluttering the global namespace. I'm not excited about
exporting everything that anybody could possibly want access to;
that just makes it harder to maintain the code. When you see a
static function, you know that you don't have to look further than
the current file to understand how it's used. When you see a global
function, the difficulty of knowing how it's used is an order of
magnitude higher, maybe more. What's more, if you want to change it
then you have to worry about the possible side-effects on unknown
non-core code that might be calling it.
Is there a reason for not putting your new code itself into regexp.c?
regards, tom lane
On Thu, 1 Feb 2007, Tom Lane wrote:
Jeremy Drake <pgsql@jdrake.com> writes:
Is there some specific reason that these functions are static,
Yeah: not cluttering the global namespace.
Is there a reason for not putting your new code itself into regexp.c?
Not really, I just figured it would be cleaner/easier to write it as an
extension. I also figure that it is unlikely that every regexp function
that anyone could possibly want will be implemented in core in that one
file. If anyone writes an extension like this, they would need to
duplicate a good amount of code in order to do so, that would make more
difficulty in maintaining the code if it should need to change. It also
makes developing a new function a lot easier, no need to re-initdb to add
the function, no need to relink the postmaster and restart it every time
the function changes.
Anyway, the particular thing I was writing was a function like
substring(str FROM pattern) which instead of returning just the first
match group, would return an array of text containing all of the match
groups. I exported the functions in my sandbox, and wrote a module with a
function that does this.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
--
Calling J-Man Kink. Calling J-Man Kink. Hash missile sighted, target
Los Angeles. Disregard personal feelings about city and intercept.
On Thu, 1 Feb 2007, Jeremy Drake wrote:
On Thu, 1 Feb 2007, Tom Lane wrote:
Jeremy Drake <pgsql@jdrake.com> writes:
Is there some specific reason that these functions are static,
Yeah: not cluttering the global namespace.
Is there a reason for not putting your new code itself into regexp.c?
Not really, I just figured it would be cleaner/easier to write it as an
extension. I also figure that it is unlikely that every regexp function
that anyone could possibly want will be implemented in core in that one
file.
<snip>
Anyway, the particular thing I was writing was a function like
substring(str FROM pattern) which instead of returning just the first
match group, would return an array of text containing all of the match
groups. I exported the functions in my sandbox, and wrote a module with a
function that does this.
I have attached the patch I have put together, which does the following:
* Expose the previously static RE_* functions from regexp.c which wrap
the code in src/backend/regex with postgres-style errors, string
conversion, and caching of patterns.
* expose regex_flavor guc var, which is needed to know how to interpret
patterns when compiling them
* Add a couple more RE_* functions in regexp.c to provide access
to different levels of the process, which were necessary to avoid
duplicating effort elsewhere.
* Update replace_text_regexp in varlena.c to use newly exposed functions
from regexp.c instead of duplicating error handling code from there.
Also attached is the function I wrote to retrieve all of the capture
groups in a pattern match in a text[]. I also intend to put together a
function analogous to split_part which will take a string and a pattern to
split on, and return setof text.
Let me know if I should work under the assumption of the attached patch
and write the functions for contrib or pgfoundry, or to put the functions
in regexp.c and try to get them in core, or both? (it made my life a lot
easier working on the function to not have to restart the postmaster every
time I recompiled it, may be nice for the future to be able to make
extensions like this...)
--
To err is human, to forgive, beyond the scope of the Operating System.
On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote:
On Thu, 1 Feb 2007, Tom Lane wrote:
Jeremy Drake <pgsql@jdrake.com> writes:
Is there some specific reason that these functions are static,
Yeah: not cluttering the global namespace.
Is there a reason for not putting your new code itself into regexp.c?
Not really, I just figured it would be cleaner/easier to write it as an
extension. I also figure that it is unlikely that every regexp function
that anyone could possibly want will be implemented in core in that one
file. If anyone writes an extension like this, they would need to
duplicate a good amount of code in order to do so, that would make more
difficulty in maintaining the code if it should need to change. It also
makes developing a new function a lot easier, no need to re-initdb to add
the function, no need to relink the postmaster and restart it every time
the function changes.Anyway, the particular thing I was writing was a function like
substring(str FROM pattern) which instead of returning just the
first match group, would return an array of text containing all of
the match groups.
That'd be great! People who use dynamic languages like Perl would
feel much more at home having access to all the matches. While you're
at it, could you could make pre-match and post-match (optionally--I
know it's expensive) available?
Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter
Remember to vote!
On Thu, 1 Feb 2007, David Fetter wrote:
On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote:
Anyway, the particular thing I was writing was a function like
substring(str FROM pattern) which instead of returning just the
first match group, would return an array of text containing all of
the match groups.
If you are subscribed to -patches, I sent my code to date there earlier
this evening. I also said that I wanted to make a function that split on
a pattern (like perl split) and returned setof text.
That'd be great! People who use dynamic languages like Perl would
feel much more at home having access to all the matches. While you're
at it, could you could make pre-match and post-match (optionally--I
know it's expensive) available?
I could, but I'm not sure how someone would go about accessing such a
thing. What I just wrote would be most like this perl:
@foo = ($str=~/pattern/);
Where would pre and post match fit into this? Are you talking about a
different function? Or sticking prematch at the beginning of the array
and postmatch at the end? I could also put the whole match somewhere
also, but I did not in this version.
The code I wrote returns a text[] which is one-dimensional, has a lower
bound of 1 (as most postgres arrays do), where if there are n capture
groups, ra[1] has the first capture group and ra[n] has the last one.
Since postgres has an option to make different lower bounds, I suppose I
could have an option to put the prematch in [-1], the entire match in [0],
and the postmatch in [n+1]. This seems to be odd to me though.
I guess I'm saying, I agree that the entire match, prematch, and postmatch
would be helpful, but how would you propose to present these to the user?
Cheers,
D
--
To err is human, to forgive, beyond the scope of the Operating System.
On Thu, Feb 01, 2007 at 10:16:54PM -0800, Jeremy Drake wrote:
On Thu, 1 Feb 2007, David Fetter wrote:
On Thu, Feb 01, 2007 at 05:11:30PM -0800, Jeremy Drake wrote:
Anyway, the particular thing I was writing was a function like
substring(str FROM pattern) which instead of returning just the
first match group, would return an array of text containing all
of the match groups.If you are subscribed to -patches, I sent my code to date there
earlier this evening. I also said that I wanted to make a function
that split on a pattern (like perl split) and returned setof text.That'd be great! People who use dynamic languages like Perl would
feel much more at home having access to all the matches. While
you're at it, could you could make pre-match and post-match
(optionally--I know it's expensive) available?I could, but I'm not sure how someone would go about accessing such
a thing. What I just wrote would be most like this perl: @foo =
($str=~/pattern/);
Where would pre and post match fit into this? Are you talking about a
different function?
Yes, although it might have the same name, as in regex_match(pattern
TEXT, string TEXT, return_pre_and_post BOOL).
Or sticking prematch at the beginning of the array and postmatch at
the end? I could also put the whole match somewhere also, but I did
not in this version.
The data structure could be something like
TYPE matches (
prematch TEXT,
match TEXT[],
postmatch TEXT
)
The code I wrote returns a text[] which is one-dimensional, has a lower
bound of 1 (as most postgres arrays do), where if there are n capture
groups, ra[1] has the first capture group and ra[n] has the last one.
Since postgres has an option to make different lower bounds, I suppose I
could have an option to put the prematch in [-1], the entire match in [0],
and the postmatch in [n+1]. This seems to be odd to me though.
Odd == bad. I think the pre- and post-matches should be different in
essence, not just in index :)
I guess I'm saying, I agree that the entire match, prematch, and postmatch
would be helpful, but how would you propose to present these to the user?
See above :)
Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter
Remember to vote!
On Thu, 1 Feb 2007, David Fetter wrote:
Yes, although it might have the same name, as in regex_match(pattern
TEXT, string TEXT, return_pre_and_post BOOL).The data structure could be something like
TYPE matches (
prematch TEXT,
match TEXT[],
postmatch TEXT
)
I just coded up for this:
CREATE FUNCTION regexp_matches(IN str text, IN pattern text) RETURNS
text[]
AS 'MODULE_PATHNAME', 'regexp_matches'
LANGUAGE C IMMUTABLE STRICT;
CREATE FUNCTION regexp_matches(
IN str text, IN pattern text, IN return_pre_and_post bool,
OUT prematch text, OUT fullmatch text, OUT matches text[], OUT
postmatch text) RETURNS record
AS 'MODULE_PATHNAME', 'regexp_matches'
LANGUAGE C IMMUTABLE STRICT;
Which works like this:
jeremyd=# \pset null '\\N'
Null display is "\N".
jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(beque)$re$);
regexp_matches
----------------
{bar,beque}
(1 row)
jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(beque)$re$, false);
prematch | fullmatch | matches | postmatch
----------+-----------+-------------+-----------
\N | \N | {bar,beque} | \N
(1 row)
jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(beque)$re$, true);
prematch | fullmatch | matches | postmatch
----------+-----------+-------------+-----------
foo | barbeque | {bar,beque} | baz
(1 row)
And then you also have this behavior in the matches array:
jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(.*)(beque)$re$);
regexp_matches
----------------
{bar,"",beque}
(1 row)
jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(.+)(beque)$re$);
regexp_matches
----------------
\N
(1 row)
jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(.+)?(beque)$re$);
regexp_matches
------------------
{bar,NULL,beque}
(1 row)
Reasonable?
--
A.A.A.A.A.:
An organization for drunks who drive
On Fri, 2 Feb 2007, Jeremy Drake wrote:
jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(beque)$re$, false);
prematch | fullmatch | matches | postmatch
----------+-----------+-------------+-----------
\N | \N | {bar,beque} | \N
(1 row)
I just changed this to fill in fullmatch when the bool is false, so this
one would look like:
prematch | fullmatch | matches | postmatch
----------+-----------+-------------+-----------
\N | barbeque | {bar,beque} | \N
(1 row)
I also removed my check for capture groups, since in this setup you could
get useful output without any. I am still trying to decide whether or not
to add back an error if you called the no-bool version which just returns
the array, and you do not have any capture groups. ISTM this is likely an
oversight on the query author's part, and it would be helpful to alert him
to this.
If you have no capture groups, the matches array is empty (not null). If
the match happened at the start of the string, the prematch is an empty
string, and if the match happened at the end of the string, the postmatch
is an empty string.
Reasonable?
--
It's odd, and a little unsettling, to reflect upon the fact that
English is the only major language in which "I" is capitalized; in many
other languages "You" is capitalized and the "i" is lower case.
-- Sydney J. Harris
On Fri, Feb 02, 2007 at 12:54:30AM -0800, Jeremy Drake wrote:
On Fri, 2 Feb 2007, Jeremy Drake wrote:
jeremyd=# select * from regexp_matches('foobarbequebaz',
$re$(bar)(beque)$re$, false);
prematch | fullmatch | matches | postmatch
----------+-----------+-------------+-----------
\N | \N | {bar,beque} | \N
(1 row)I just changed this to fill in fullmatch when the bool is false, so this
one would look like:
prematch | fullmatch | matches | postmatch
----------+-----------+-------------+-----------
\N | barbeque | {bar,beque} | \N
(1 row)I also removed my check for capture groups, since in this setup you could
get useful output without any. I am still trying to decide whether or not
to add back an error if you called the no-bool version which just returns
the array, and you do not have any capture groups. ISTM this is likely an
oversight on the query author's part, and it would be helpful to alert him
to this.If you have no capture groups, the matches array is empty (not null). If
the match happened at the start of the string, the prematch is an empty
string, and if the match happened at the end of the string, the postmatch
is an empty string.Reasonable?
This is great :)
Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter
Remember to vote!
On Fri, 2 Feb 2007, Jeremy Drake wrote:
I just coded up for this:
CREATE FUNCTION regexp_matches(IN str text, IN pattern text) RETURNS
text[]
AS 'MODULE_PATHNAME', 'regexp_matches'
LANGUAGE C IMMUTABLE STRICT;CREATE FUNCTION regexp_matches(
IN str text, IN pattern text, IN return_pre_and_post bool,
OUT prematch text, OUT fullmatch text, OUT matches text[], OUT
postmatch text) RETURNS record
AS 'MODULE_PATHNAME', 'regexp_matches'
LANGUAGE C IMMUTABLE STRICT;
I wanted to put out there the question of what order the parameters to
these regex functions should go. ISTM most people expect them to go
(pattern, string), but I made these functions consistant with
substring(text,text) which takes (string, pattern). Now I have been
working on a regexp_split function, which takes (pattern, string), which
is what someone familiar with the function from perl would expect, but is
not consistant with substring or now with my regexp_matches function.
I want to ask, should I break with following substring's precedent, and
put the pattern first (as most people probably would expect), or should I
break with perl's precedent and put the pattern second (to behave like
substring)?
--
We cannot put the face of a person on a stamp unless said person is
deceased. My suggestion, therefore, is that you drop dead.
-- James E. Day, Postmaster General
Jeremy Drake <pgsql@jdrake.com> writes:
I want to ask, should I break with following substring's precedent, and
put the pattern first (as most people probably would expect), or should I
break with perl's precedent and put the pattern second (to behave like
substring)?
All of SQL's pattern match operators have the pattern on the right, so
my advice is to stick with that and try not to think about Perl ;-)
regards, tom lane
On Fri, Feb 02, 2007 at 08:56:31PM -0500, Tom Lane wrote:
Jeremy Drake <pgsql@jdrake.com> writes:
I want to ask, should I break with following substring's
precedent, and put the pattern first (as most people probably
would expect), or should I break with perl's precedent and put the
pattern second (to behave like substring)?All of SQL's pattern match operators have the pattern on the right,
so my advice is to stick with that and try not to think about Perl
;-)
Perl provides inspiration, but here, consistency would help more than
orderly imitation of how it does what it does. And besides, when
people really need Perl, they can pull it in as a PL :)
Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter
Remember to vote!
On Fri, 2 Feb 2007, David Fetter wrote:
On Fri, Feb 02, 2007 at 08:56:31PM -0500, Tom Lane wrote:
All of SQL's pattern match operators have the pattern on the right,
so my advice is to stick with that and try not to think about Perl
;-)Perl provides inspiration, but here, consistency would help more than
orderly imitation of how it does what it does. And besides, when
people really need Perl, they can pull it in as a PL :)
Alright, here is my code to date. I have put the pattern after the string
in the split function, as discussed above. The .tar.gz file expects to be
untarred in contrib/. I have made some regression tests that can be run
using 'make installcheck' as normal for contrib. I think they exercise
the corner cases in the code, but I may very well have missed some. It
requires the (previously submitted) attached patch to core to compile, as
it takes advantage of new exported functions from
src/backend/utils/adt/regexp.c.
Let me know if you see any bugs or issues with this code, and I am open to
suggestions for further regression tests ;)
Things that I still want to look into:
* regexp flags (a la regexp_replace).
* maybe make regexp_matches return setof whatever, if given a 'g' flag
return all matches in string.
* maybe a join function that works as an aggregate
SELECT join(',', col) FROM tbl
currently can be written as
SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')
--
It was a virgin forest, a place where the Hand of Man had never set
foot.
On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote:
Let me know if you see any bugs or issues with this code, and I am
open to suggestions for further regression tests ;)
Things that I still want to look into:
* regexp flags (a la regexp_replace).
One more text field at the end is how the regexp_replace() one does
it.
* maybe make regexp_matches return setof whatever, if given a 'g' flag
return all matches in string.
This is doable with current machinery, albeit a little clumsily.
* maybe a join function that works as an aggregate
SELECT join(',', col) FROM tbl
currently can be written as
SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')
The array_accum() aggregate in the docs works OK for this purpose.
Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter
Remember to vote!
On Sun, 4 Feb 2007, David Fetter wrote:
On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote:
Let me know if you see any bugs or issues with this code, and I am
open to suggestions for further regression tests ;)Things that I still want to look into:
* regexp flags (a la regexp_replace).One more text field at the end is how the regexp_replace() one does
it.
That's how I did it.
* maybe make regexp_matches return setof whatever, if given a 'g' flag
return all matches in string.This is doable with current machinery, albeit a little clumsily.
I have implemented this too.
* maybe a join function that works as an aggregate
SELECT join(',', col) FROM tbl
currently can be written as
SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')The array_accum() aggregate in the docs works OK for this purpose.
I have not tackled this yet, I think it may be better to stick with the
ARRAY() construct for now.
So, here is the new version of the code, and also a new version of the
patch to core, which fixes some compile warnings that I did not see at
first because I was using ICC rather than GCC.
Here is the README.regexp_ext from the tar file:
This package contains regexp functions beyond those currently provided
in core PostgreSQL, utilizing the regexp engine built into core. This
is still a work-in-progress.
The most recent version of this code can be found at
http://www.jdrake.com/postgresql/regexp/regexp_ext.tar.gz
and the prerequisite patch to PostgreSQL core, which has been submitted
for review, can be found at
http://www.jdrake.com/postgresql/regexp/regexp-export.patch
The .tar.gz file expects to be untarred in contrib/. I have made some
regression tests that can be run using 'make installcheck' as normal for
contrib. I think they exercise the corner cases in the code, but I may
very well have missed some. It requires the above mentioned patch to
core to compile, as it takes advantage of new exported functions from
src/backend/utils/adt/regexp.c.
Let me know if you see any bugs or issues with this code, and I am open to
suggestions for further regression tests ;)
Functions implemented in this module:
* regexp_split(str text, pattern text) RETURNS SETOF text
regexp_split(str text, pattern text, flags text) RETURNS SETOF text
returns each section of the string delimited by the pattern.
* regexp_matches(str text, pattern text) RETURNS text[]
returns all capture groups when matching pattern against string in an array
* regexp_matches(str text, pattern text, flags text) RETURNS SETOF
(prematch text, fullmatch text, matches text[], postmatch text)
returns all capture groups when matching pattern against string in an array.
also returns the entire match in fullmatch. if the 'g' option is given,
returns all matches in the string. if the 'r' option is given, also return
the text before and after the match in prematch and postmatch respectively.
See the regression tests for more details about usage and return values.
Recent changes:
* I have put the pattern after the string in all of the functions, as
discussed on the pgsql-hackers mailing list.
* regexp flags (a la regexp_replace).
* make regexp_matches return setof whatever, if given a 'g' flag return
all matches in string.
Things that I still want to look into:
* maybe a join function that works as an aggregate
SELECT join(',', col) FROM tbl
currently can be written as
SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')
--
Philogeny recapitulates erogeny; erogeny recapitulates philogeny.
On Sun, 4 Feb 2007, David Fetter wrote:
On Fri, Feb 02, 2007 at 07:01:33PM -0800, Jeremy Drake wrote:
Let me know if you see any bugs or issues with this code, and I am
open to suggestions for further regression tests ;)
I have not heard anything, so I guess at this point I should figure out
where to go next with this. I see a couple options:
* Set this up as a pgfoundry project or contrib. This would require
merging the patch to expose some functions from regexp.c outside that
file, which has raised some concerns about maintainability.
* Put together a patch to add these functions to core. I could put them
directly in regexp.c, so the support functions could stay static. My
concern here is that I don't know if there are any functions currently
in core with OUT parameters. I don't know the acceptable style for
handling this: OUT parameters, a named composite type, ...?
Does anyone have any opinions either way, as to how I should proceed
from here?
* maybe a join function that works as an aggregate
SELECT join(',', col) FROM tbl
currently can be written as
SELECT array_to_string(ARRAY(SELECT col FROM tbl), ',')The array_accum() aggregate in the docs works OK for this purpose.
I have decided not to pursue this function, I think the array construct,
or the array_accum option, is about the best possible currently. If it
should become possible in the future to write aggregates with a non-sql
state type (structs with pointers) it may be worthwhile to re-evaluate
this.
--
The cost of living hasn't affected its popularity.
Jeremy Drake <pgsql@jdrake.com> writes:
* Put together a patch to add these functions to core. I could put them
directly in regexp.c, so the support functions could stay static. My
concern here is that I don't know if there are any functions currently
in core with OUT parameters.
As of 8.2 there are.
If we are going to include these I would vote for core not contrib
status, exactly to avoid having to export those functions.
regards, tom lane
On Wed, Feb 07, 2007 at 09:23:58AM -0500, Tom Lane wrote:
Jeremy Drake <pgsql@jdrake.com> writes:
* Put together a patch to add these functions to core. I could put them
directly in regexp.c, so the support functions could stay static. My
concern here is that I don't know if there are any functions currently
in core with OUT parameters.As of 8.2 there are.
If we are going to include these I would vote for core not contrib
status, exactly to avoid having to export those functions.
+1 for core. :)
Cheers,
D
--
David Fetter <david@fetter.org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter
Remember to vote!
On Wed, 7 Feb 2007, Tom Lane wrote:
Jeremy Drake <pgsql@jdrake.com> writes:
* Put together a patch to add these functions to core. I could put them
directly in regexp.c, so the support functions could stay static. My
concern here is that I don't know if there are any functions currently
in core with OUT parameters.As of 8.2 there are.
Could you give me the name of one in pg_proc.h so I can see how I should
go about adding one there?
If we are going to include these I would vote for core not contrib
status, exactly to avoid having to export those functions.
OK, this patch will be my next project.
--
History is curious stuff
You'd think by now we had enough
Yet the fact remains I fear
They make more of it every year.