EOL characters and multibyte encodings
I finally was able PL/R to compile and run on Windows recently. This has
lead to people using a Windows based client (typically PgAdmin III) to
create PL/R functions. Immediately I started to receive reports of
failures that turned out to be due to the carriage return (\r) used in
standard Win32 EOLs (\r\n). It seems that the R parser only accepts
newlines (\n), even on Win32 (confirmed on r-devel list with a core
developer).
My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte characters.
Is it safe to do this, or do I need to use a mb-aware replace algorithm?
Thanks,
Joe
Joe Conway <mail@joeconway.com> writes:
I finally was able PL/R to compile and run on Windows recently. This has
lead to people using a Windows based client (typically PgAdmin III) to
create PL/R functions. Immediately I started to receive reports of
failures that turned out to be due to the carriage return (\r) used in
standard Win32 EOLs (\r\n). It seems that the R parser only accepts
newlines (\n), even on Win32 (confirmed on r-devel list with a core
developer).
My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte characters.
Is it safe to do this, or do I need to use a mb-aware replace algorithm?
It's safe, because you'll be dealing with prosrc inside the backend,
therefore using a backend-legal encoding, and those don't have any ASCII
aliasing problems (all bytes of an MB character must have high bit set).
However I dislike doing it exactly that way because line numbers in the
R script will all get doubled. Unless R never reports errors in terms
of line numbers, you'd be better off to either delete the \r characters
or replace them with spaces.
regards, tom lane
Joe Conway wrote:
I finally was able PL/R to compile and run on Windows recently. This
has lead to people using a Windows based client (typically PgAdmin
III) to create PL/R functions. Immediately I started to receive
reports of failures that turned out to be due to the carriage return
(\r) used in standard Win32 EOLs (\r\n). It seems that the R parser
only accepts newlines (\n), even on Win32 (confirmed on r-devel list
with a core developer).My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to
the R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte
characters. Is it safe to do this, or do I need to use a mb-aware
replace algorithm?
Didn't we just settle that all the server-side encodings have to be
ASCII supersets? In which case, just removing the CRs should be quite safe.
cheers
andrew
Tom Lane wrote:
Joe Conway <mail@joeconway.com> writes:
My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte characters.
Is it safe to do this, or do I need to use a mb-aware replace algorithm?It's safe, because you'll be dealing with prosrc inside the backend,
therefore using a backend-legal encoding, and those don't have any ASCII
aliasing problems (all bytes of an MB character must have high bit set).
Great -- I wasn't sure about that.
However I dislike doing it exactly that way because line numbers in the
R script will all get doubled. Unless R never reports errors in terms
of line numbers, you'd be better off to either delete the \r characters
or replace them with spaces.
Good point. But I need to be able to deal with Apple EOLs too -- IIRC
those can be *only* '\r'. So I guess I need to do a look-ahead whenever
I run into '\r', see if it is followed by '\n', and then munge the
string accordingly.
Joe
"Joe Conway" <mail@joeconway.com>
Tom Lane wrote:
Joe Conway <mail@joeconway.com> writes:
My first thought on fixing this issue was to simply replace all
instances of '\r' in pg_proc.prosrc with '\n' prior to sending it to the
R parser. As far as I know, any instances of '\r' embedded in a
syntactically valid R statement must be escaped (i.e. literally the
characters "\" and "r"), so that should not be a problem. But I am
concerned about how this potentially plays against multibyte characters.
Is it safe to do this, or do I need to use a mb-aware replace algorithm?It's safe, because you'll be dealing with prosrc inside the backend,
therefore using a backend-legal encoding, and those don't have any ASCII
aliasing problems (all bytes of an MB character must have high bit set).
The lower byte of some characters in BIG5, GBK, GB18030 may be less than
0x7F and don't have the high bit set. Fortunately, they don't use 0x0D and
0x0A (CR and LF).
Regards,
William ZHANG
Show quoted text
Great -- I wasn't sure about that.
William ZHANG wrote:
It's safe, because you'll be dealing with prosrc inside the backend,
therefore using a backend-legal encoding, and those don't have any ASCII
aliasing problems (all bytes of an MB character must have high bit set).The lower byte of some characters in BIG5, GBK, GB18030 may be less than
0x7F and don't have the high bit set. Fortunately, they don't use 0x0D and
0x0A (CR and LF).
Those are client-only encodings, precisely for this sort of reason, and
thus not relevant to the present discussion. As Tom points out above,
when the language handler gets the code it will be encoded in the
relevant backend encoding which can't be any of these.
(Side note: the restriction by the R parser to unix-only line endings is
a dreadful piece of design. As Jon Postel rightly said, the best rule is
"Be liberal in what you accept and conservative in what you send." Just
about every parser for every language has been able to handle this, so
why must R be different?)
cheers
andrew