Character classes

Started by PG Bug reporting formalmost 7 years ago3 messagesdocs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/11/functions-matching.html
Description:

On https://www.postgresql.org/docs/11/functions-matching.html paragraph
9.7.3.2. Bracket Expressions says "Standard character class names are:
alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper,
xdigit". The class "ascii" exists, but is not mentioned (probably a
combination of some of the other classes). Are there any other classes? Do
they work only for ASCII characters (e.g. '\u00A0' is not picked up by
'[:blank:]')?
best regards
geert

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: PG Bug reporting form (#1)
Re: Character classes

PG Doc comments form <noreply@postgresql.org> writes:

On https://www.postgresql.org/docs/11/functions-matching.html paragraph
9.7.3.2. Bracket Expressions says "Standard character class names are:
alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper,
xdigit". The class "ascii" exists, but is not mentioned (probably a
combination of some of the other classes). Are there any other classes?

Hm, fair question. I think the text means to say that these are the
character class names required by the POSIX regexp spec, which is
accurate. A look into our src/backend/regex/regc_locale.c will show
you that we also implement "ascii", and no others. That probably ought
to be documented.

Do they work only for ASCII characters (e.g. '\u00A0' is not picked up
by '[:blank:]')?

The POSIX ones are implemented by calling the C library, so it's whatever
the ctype.h and wctype.h functions think is appropriate for your LC_CTYPE
setting.

The 20-year-old reference in our text to ctype(3) seems rather unhelpful
today; in the first place, there's no such man page on my Linux systems,
and in the second place, wctype(3) is more important if it exists, and
in the third place what a reader actually wants to know is that this
is controlled by the LC_CTYPE server parameter. It'd likely be better
to dump the man-page reference altogether and instead point readers to
our "Locale Support" chapter.

regards, tom lane

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Tom Lane (#2)
Re: Character classes

On Tue, May 21, 2019 at 6:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

The 20-year-old reference in our text to ctype(3) seems rather unhelpful
today; in the first place, there's no such man page on my Linux systems,
and in the second place, wctype(3) is more important if it exists, and
in the third place what a reader actually wants to know is that this
is controlled by the LC_CTYPE server parameter. It'd likely be better
to dump the man-page reference altogether and instead point readers to
our "Locale Support" chapter.

No opinion on the reference, but out of curiosity I hunted down the
equivalent man page on a RHEL system. There it goes by ctype.h(0P),
which makes some kind of sense: there isn't a ctype function, so it
has no business in section 3, while wctype is a function so there is a
wctype(3) along with a header page wctype.h(0P). 0P seems to be for
POSIX headers, or something like that. BSDen don't seem to bother
with this distinction and just provide ctype(3).

--
Thomas Munro
https://enterprisedb.com