EOL characters and multibyte encodings

Started by Joe Conwayabout 19 years ago6 messageshackers

mail@joeconway.com

about 19 years ago

I finally was able PL/R to compile and run on Windows recently. This has
lead to people using a Windows based client (typically PgAdmin III) to
create PL/R functions. Immediately I started to receive reports of
failures that turned out to be due to the carriage return (\r) used in
standard Win32 EOLs (\r\n). It seems that the R parser only accepts
newlines (\n), even on Win32 (confirmed on r-devel list with a core
developer).

My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte characters.
Is it safe to do this, or do I need to use a mb-aware replace algorithm?

Thanks,

Joe

Tom Lane

tgl@sss.pgh.pa.us

about 19 years ago

In reply to: Joe Conway (#1)

Re: EOL characters and multibyte encodings

Joe Conway <mail@joeconway.com> writes:

I finally was able PL/R to compile and run on Windows recently. This has
lead to people using a Windows based client (typically PgAdmin III) to
create PL/R functions. Immediately I started to receive reports of
failures that turned out to be due to the carriage return (\r) used in
standard Win32 EOLs (\r\n). It seems that the R parser only accepts
newlines (\n), even on Win32 (confirmed on r-devel list with a core
developer).

My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte characters.
Is it safe to do this, or do I need to use a mb-aware replace algorithm?

It's safe, because you'll be dealing with prosrc inside the backend,
therefore using a backend-legal encoding, and those don't have any ASCII
aliasing problems (all bytes of an MB character must have high bit set).

However I dislike doing it exactly that way because line numbers in the
R script will all get doubled. Unless R never reports errors in terms
of line numbers, you'd be better off to either delete the \r characters
or replace them with spaces.

regards, tom lane

Andrew Dunstan

andrew@dunslane.net

about 19 years ago

In reply to: Joe Conway (#1)

Re: EOL characters and multibyte encodings

Joe Conway wrote:

I finally was able PL/R to compile and run on Windows recently. This
has lead to people using a Windows based client (typically PgAdmin
III) to create PL/R functions. Immediately I started to receive
reports of failures that turned out to be due to the carriage return
(\r) used in standard Win32 EOLs (\r\n). It seems that the R parser
only accepts newlines (\n), even on Win32 (confirmed on r-devel list
with a core developer).

My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to
the R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte
characters. Is it safe to do this, or do I need to use a mb-aware
replace algorithm?

Didn't we just settle that all the server-side encodings have to be
ASCII supersets? In which case, just removing the CRs should be quite safe.

cheers

andrew

Joe Conway

mail@joeconway.com

about 19 years ago

In reply to: Tom Lane (#2)

Re: EOL characters and multibyte encodings

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte characters.
Is it safe to do this, or do I need to use a mb-aware replace algorithm?

It's safe, because you'll be dealing with prosrc inside the backend,
therefore using a backend-legal encoding, and those don't have any ASCII
aliasing problems (all bytes of an MB character must have high bit set).

Great -- I wasn't sure about that.

However I dislike doing it exactly that way because line numbers in the
R script will all get doubled. Unless R never reports errors in terms
of line numbers, you'd be better off to either delete the \r characters
or replace them with spaces.

Good point. But I need to be able to deal with Apple EOLs too -- IIRC
those can be *only* '\r'. So I guess I need to do a look-ahead whenever
I run into '\r', see if it is followed by '\n', and then munge the
string accordingly.

Joe

William ZHANG

zedware@gmail.com

about 19 years ago

In reply to: Joe Conway (#1)

Re: EOL characters and multibyte encodings

"Joe Conway" <mail@joeconway.com>

Tom Lane wrote:

Joe Conway <mail@joeconway.com> writes:

My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte characters.
Is it safe to do this, or do I need to use a mb-aware replace algorithm?

It's safe, because you'll be dealing with prosrc inside the backend,
therefore using a backend-legal encoding, and those don't have any ASCII
aliasing problems (all bytes of an MB character must have high bit set).

The lower byte of some characters in BIG5, GBK, GB18030 may be less than
0x7F and don't have the high bit set. Fortunately, they don't use 0x0D and
0x0A (CR and LF).

Regards,
William ZHANG

Show quoted text

Great -- I wasn't sure about that.

Andrew Dunstan

andrew@dunslane.net

about 19 years ago

In reply to: William ZHANG (#5)

Re: EOL characters and multibyte encodings

William ZHANG wrote:

It's safe, because you'll be dealing with prosrc inside the backend,
therefore using a backend-legal encoding, and those don't have any ASCII
aliasing problems (all bytes of an MB character must have high bit set).

The lower byte of some characters in BIG5, GBK, GB18030 may be less than
0x7F and don't have the high bit set. Fortunately, they don't use 0x0D and
0x0A (CR and LF).

Those are client-only encodings, precisely for this sort of reason, and
thus not relevant to the present discussion. As Tom points out above,
when the language handler gets the code it will be encoded in the
relevant backend encoding which can't be any of these.

(Side note: the restriction by the R parser to unix-only line endings is
a dreadful piece of design. As Jon Postel rightly said, the best rule is
"Be liberal in what you accept and conservative in what you send." Just
about every parser for every language has been able to handle this, so
why must R be different?)

cheers

andrew