Tightening isspace() tests where behavior should match SQL parser

Started by Tom Laneover 8 years ago5 messages

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

1 attachment(s)

The proximate cause of bug #14662,
/messages/by-id/20170519162316.29945.5021@wrigleys.postgresql.org
appears to be that SplitIdentifierString's isspace() checks are
identifying some bytes of a multibyte character as spaces. It's not
quite clear to me whether there's an encoding configuration problem
involved in this specific report, but in any case feeding individual
bytes of a multibyte character to <ctype.h> functions is highly dubious.
The best you can say about that is that the behavior will be
platform-dependent.

I think that the easiest way to resolve this is to replace the isspace()
calls with scanner_isspace(), on the grounds that we are trying to parse
identifiers the same way that the core SQL lexer would, and therefore we
should use its definition of whitespace.

I went looking through all our other calls of isspace() to see if we
have the same problem anywhere else, and identified these functions
as being at risk:

parse_ident()
regproc.c's parseNameAndArgTypes (for regprocedurein and siblings)
SplitIdentifierString()
SplitDirectoriesString()

In all four cases, we must allow for non-ASCII input and it's
arguable that correct behavior is to match the core lexer,
so I propose replacing isspace() with scanner_isspace() in
these places.

There are quite a lot more isspace() calls than that of course,
but the rest of them seem probably all right to me. As an
example, I don't see a need to make float8in() stricter about
what it will allow as trailing whitespace. We're not expecting
any non-ASCII input really, and if we do see multibyte characters
and all the bytes manage to pass isspace(), not much harm is done.

Attached is a proposed patch. I'm vacillating on whether to
back-patch this --- it will fix a reported bug, but it seems
at least somewhat conceivable that it will also break cases
that were working acceptably before. Thoughts?

regards, tom lane

Heikki Linnakangas

hlinnaka@iki.fi

over 8 years ago

In reply to: Tom Lane (#1)

Re: Tightening isspace() tests where behavior should match SQL parser

On 05/20/2017 01:48 PM, Tom Lane wrote:

Attached is a proposed patch. I'm vacillating on whether to
back-patch this --- it will fix a reported bug, but it seems
at least somewhat conceivable that it will also break cases
that were working acceptably before. Thoughts?

+1 for back-patching. If I understand correctly, it would change the
behavior when you pass a string with non-ASCII leading or trailing
whitespace characters to those functions. I suppose that can happen, but
it's only pure luck if the current code happens to work in that case.
And even if so, it's IMO still quite reasonable to change the behavior,
on the grounds that the new behavior is what the core SQL lexer would do.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Heikki Linnakangas (#2)

Re: Tightening isspace() tests where behavior should match SQL parser

Heikki Linnakangas <hlinnaka@iki.fi> writes:

On 05/20/2017 01:48 PM, Tom Lane wrote:

Attached is a proposed patch. I'm vacillating on whether to
back-patch this --- it will fix a reported bug, but it seems
at least somewhat conceivable that it will also break cases
that were working acceptably before. Thoughts?

+1 for back-patching. If I understand correctly, it would change the
behavior when you pass a string with non-ASCII leading or trailing
whitespace characters to those functions. I suppose that can happen, but
it's only pure luck if the current code happens to work in that case.

Well, it'd work properly for e.g. no-break space in LATINn. But that
seems like a very narrow use-case, and probably not enough to justify
the misbehavior you might get in multi-byte character sets.

A compromise possibly worth considering is to apply isspace() only in
single-byte encodings. I think that's more complication than it's
worth, but others might think differently.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Tom Lane (#3)

Re: Tightening isspace() tests where behavior should match SQL parser

I wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

+1 for back-patching. If I understand correctly, it would change the
behavior when you pass a string with non-ASCII leading or trailing
whitespace characters to those functions. I suppose that can happen, but
it's only pure luck if the current code happens to work in that case.

Well, it'd work properly for e.g. no-break space in LATINn.

Actually, it's dubious that treating no-break space as whitespace is
correct at all in these use-cases. The core scanner would think it's
an identifier character, so it's not impossible that somebody would
consider it cute to write   as part of a SQL identifier. If
the core scanner accepts that, so must these functions.

Hence, applied and back-patched.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 8 years ago

In reply to: Tom Lane (#4)

Re: Tightening isspace() tests where behavior should match SQL parser

On 5/24/17 15:34, Tom Lane wrote:

I wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:

+1 for back-patching. If I understand correctly, it would change the
behavior when you pass a string with non-ASCII leading or trailing
whitespace characters to those functions. I suppose that can happen, but
it's only pure luck if the current code happens to work in that case.

Well, it'd work properly for e.g. no-break space in LATINn.

Actually, it's dubious that treating no-break space as whitespace is
correct at all in these use-cases. The core scanner would think it's
an identifier character, so it's not impossible that somebody would
consider it cute to write   as part of a SQL identifier. If
the core scanner accepts that, so must these functions.

The SQL standard might permit non-breaking spaces or similar things as
token delimiters, so it could be legitimate to look into changing that
sometime. But in any case it should be consistent, so it's correct to
make this change now.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Participants

Thread Outline

Tightening isspace() tests where behavior should match SQL parser

Attachments: