multi line text data/query ?bug?
While I was testing an issue in python I discovered a general
*nix/windows/mac issue with newlines.
The same query will give different results depending on what client executes
it.
create table test1(f1 text);
insert into test1 values('this is a long string.
it will have new lines in it.
I want to see if those new lines go away.
so ignore. the ugliness')
If the insert was done on a Windows machine there will be a CRLF as the
EOLN, if done on Unix it will have LF and if done on Mac it will have CR.
So if the insert was done on windows, the following query will only work
from a windows client :
select * from test1 where f1='this is a long string.
it will have new lines in it.
I want to see if those new lines go away.
so ignore. the ugliness'
If that insert was done on a *NIX then the query will only work from that
client
Am Dienstag, 22. Mᅵrz 2005 08:00 schrieb Sim Zacks:
While I was testing an issue in python I discovered a general
*nix/windows/mac issue with newlines.
The same query will give different results depending on what client
executes it.
While that is indoubtedly strange behavior, I can't see how it qualifies as a
bug. You press a certain key, and you get exactly the characters that the
key is defined to produce. That differs between operating systems, but so do
a lot of other things.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/
While I would agree with you that from a purely technical standpoint, the
user inserted into the database a CRLF and a query with just an LF does not
exactly match that, from a users and more practical perspective, that does
not make sense at all. That is why I surrounded the word bug in ??.
I would say that from a users perspective it qualifies as a bug because they
did not put in specific binary characters. They want a newline. From a
database standards perspective, I would argue that any database that allows
connections from a client without qualifying a required operating system
should be OS neutral.
I would say it is a bug from a users perspective because the exact same
query works differently from different clients. Since the user does not
choose what binary characters to put in, they are invisible to the user.
Anything that is completely invisible to the user should not be considered
valid qualifying data.
As there is no postgresql database standard, such as "all newlines are unix
newlines" it is impossible to write a client that will necessarily return
the data that you want.
This is the exact problem we are having with Python right now, as a Windows
client cannot write a python function to be run on a linux server.
"Peter Eisentraut" <peter_e@gmx.net> wrote in message
news:200503221535.07641.peter_e@gmx.net...
Am Dienstag, 22. M�rz 2005 08:00 schrieb Sim Zacks:
While I was testing an issue in python I discovered a general
*nix/windows/mac issue with newlines.
The same query will give different results depending on what client
executes it.While that is indoubtedly strange behavior, I can't see how it qualifies
as a
bug. You press a certain key, and you get exactly the characters that the
key is defined to produce. That differs between operating systems, but so
do
Show quoted text
a lot of other things.
--
Peter Eisentraut
http://developer.postgresql.org/~petere/---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
On Wed, 23 Mar 2005, Sim Zacks wrote:
While I would agree with you that from a purely technical standpoint, the
user inserted into the database a CRLF and a query with just an LF does not
exactly match that, from a users and more practical perspective, that does
not make sense at all. That is why I surrounded the word bug in ??.I would say that from a users perspective it qualifies as a bug because they
did not put in specific binary characters. They want a newline. From a
database standards perspective, I would argue that any database that allows
connections from a client without qualifying a required operating system
should be OS neutral.I would say it is a bug from a users perspective because the exact same
query works differently from different clients. Since the user does not
choose what binary characters to put in, they are invisible to the user.
Anything that is completely invisible to the user should not be considered
valid qualifying data.As there is no postgresql database standard, such as "all newlines are unix
newlines" it is impossible to write a client that will necessarily return
the data that you want.This is the exact problem we are having with Python right now, as a Windows
client cannot write a python function to be run on a linux server.
Unfortunately, it's not that simple. There are problems with python
when _both_ the client and the server are Windows. Python itself
_always_ uses \n even on Windows. So the only solution is to
"pythonize" the input (convert to \n), no matter what.
For the more general problem of handling text, see my comments in
this thread:
http://archives.postgresql.org/pgsql-general/2005-01/msg00792.php
There are interesting problems with multiline text, as a datatype.
Think of digital signatures and checksums. Think of a simple function:
len(text)
should it count line separators as characters? In theory, the only
way to get cross-platform consistent behaviour, is to _ignore_ line
separators when counting or checksumming. But the real world solution
is to treat textfiles as binary and let the users or the application
handle the conversion.
.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo@ESI.it
In any case, there are 2 correct solutions to the problem for the case of
postgresql.
1) Database standard - postgresql chooses a newline standard and every
client must support that if they support postgresql. Either put the onus on
the client developers to override the OS standard and support the postgresql
standard, or have the db convert the incoming newlines into the db standard.
2) Server's OS - the server must convert any new lines coming in to the
standard that it expects. This is similar to an ftp client that converts
newlines on the transfer. That means that data sent to a Windows server with
an LF will be converted to a CRLF and vice versa.
The data restore function will also have to follow the above procedures to
make sure you can take data from one server to the other without
compromising integrity.
Without one of these solutions, PostGreSQL is not compatible between servers
and clients.
A query written on any client should return the same result. The query being
the visible appearance on the screen. That is what the users would expect to
have returned.
"Marco Colombo" <pgsql@esiway.net> wrote in message
news:Pine.LNX.4.61.0503231028120.26346@Megathlon.ESI...
On Wed, 23 Mar 2005, Sim Zacks wrote:
While I would agree with you that from a purely technical standpoint,
the
user inserted into the database a CRLF and a query with just an LF does
not
exactly match that, from a users and more practical perspective, that
does
not make sense at all. That is why I surrounded the word bug in ??.
I would say that from a users perspective it qualifies as a bug because
they
did not put in specific binary characters. They want a newline. From a
database standards perspective, I would argue that any database that
allows
connections from a client without qualifying a required operating system
should be OS neutral.I would say it is a bug from a users perspective because the exact same
query works differently from different clients. Since the user does not
choose what binary characters to put in, they are invisible to the user.
Anything that is completely invisible to the user should not be
considered
valid qualifying data.
As there is no postgresql database standard, such as "all newlines are
unix
newlines" it is impossible to write a client that will necessarily
return
the data that you want.
This is the exact problem we are having with Python right now, as a
Windows
Show quoted text
client cannot write a python function to be run on a linux server.
Unfortunately, it's not that simple. There are problems with python
when _both_ the client and the server are Windows. Python itself
_always_ uses \n even on Windows. So the only solution is to
"pythonize" the input (convert to \n), no matter what.For the more general problem of handling text, see my comments in
this thread:
http://archives.postgresql.org/pgsql-general/2005-01/msg00792.phpThere are interesting problems with multiline text, as a datatype.
Think of digital signatures and checksums. Think of a simple function:
len(text)
should it count line separators as characters? In theory, the only
way to get cross-platform consistent behaviour, is to _ignore_ line
separators when counting or checksumming. But the real world solution
is to treat textfiles as binary and let the users or the application
handle the conversion..TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo@ESI.it---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
On Wed, 23 Mar 2005, Sim Zacks wrote:
In any case, there are 2 correct solutions to the problem for the case of
postgresql.
1) Database standard - postgresql chooses a newline standard and every
client must support that if they support postgresql. Either put the onus on
the client developers to override the OS standard and support the postgresql
standard, or have the db convert the incoming newlines into the db standard.2) Server's OS - the server must convert any new lines coming in to the
standard that it expects. This is similar to an ftp client that converts
newlines on the transfer. That means that data sent to a Windows server with
an LF will be converted to a CRLF and vice versa.The data restore function will also have to follow the above procedures to
make sure you can take data from one server to the other without
compromising integrity.Without one of these solutions, PostGreSQL is not compatible between servers
and clients.
A query written on any client should return the same result. The query being
the visible appearance on the screen. That is what the users would expect to
have returned.
Yeah, those were my points.
The _open_ problems are:
- what about storing a signed document? it's possible that newline
conversion makes the signature invalid. How would you restore the original
document? Before you answer think of:
a) a client running on a platform different from the one that inserted
the document;
b) a document with _mixed_ newline types, such as a windows text with
enmbedded bare \n or \r. [*]
- what about any other function that may be affected by newline style?
I mean, the user may insert a text that he knows it's 1000 chars long,
and finds that PG thinks it's only 980. Is this "consistent"? What if
the user selects for the messages longer than 990? What is the expected
answer, from the user standpoint?
There's no easy solution I think.
[*] This is _way_ more common than you'd think. RFC2822, internet message
format, says lines are CRFL separated. It happens sometimes that a message
contains a NL or a CR alone.
.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo@ESI.it
- what about storing a signed document? it's possible that newline
conversion makes the signature invalid. How would you restore the
original
document? Before you answer think of:
a) a client running on a platform different from the one that inserted
the document;
b) a document with _mixed_ newline types, such as a windows text with
enmbedded bare \n or \r. [*]
I think there can be a difference between ASCII (text) data and binary data.
Think of the example of FTP again, which handles this very nicely. Binary
data should not have new lines converted. There is a well accepted
difference between binary and ascii. Text can be defined as being completely
readable by the human eye. If there happens to be binary data embedded in
the text, that is not a part of the text but rather supports the text. Any
data that can not have the binary data modified becuase it will render the
data unusable is not really text.
I am not familiar with signed documents, but I would question how they are
currently handled in a cross OS environment. You may as well ask the same
question about how you would handle a jpeg image. Obviously you will not
want to look for LF and replace it with CRLF because in that data LF does
not mean skip a line.
what about any other function that may be affected by newline style?
I mean, the user may insert a text that he knows it's 1000 chars long,
and finds that PG thinks it's only 980. Is this "consistent"?
If a user inserts a string that he thinks is 1000 charcters long and across
PG implementations and documentation it is considered to be 980 charcters
long, then that is consistent. If he enters a string that he thinks is 1000
characters long and sometimes PG thinks that it is 980, sometimes 1000 and
sometimes 1050 that is not consistent.
"Sim Zacks" <sim@compulab.co.il> writes:
A query written on any client should return the same result. The query being
the visible appearance on the screen.
This is presupposing the answer to the question at hand. I do not agree
with the above premise; it would seem to imply, for example, expanding
tabs to spaces so that "where foo = '<tab>'" yields the same result as
"where foo = ' '" for some appropriate number of spaces.
regards, tom lane
The difference between a Tab and a newline is that tab is a universally
recognized single ascii character while newline is in flux. Aside from this,
a tab is a quasi-viewable character as the cursor will not go to the middle
of the tab. Meaning if the tab takes up the space of 10 characters, you
could not scroll to the place where the 5th character would be if it were in
fact 10 spaces. You cannot highlight half of a tab in editors that allow
text highlighting. I would therefore say that a tab is as visible as a space
and can be easily differentiated. On the other hand, it is impossible to
determine which binary charcters the editor stuck in at the end of a newline
without looking at the binary/hex code.
I understand the complexity of dealing with multiple operating systems, but
seriously, how many non-viewable characters can be embedded in text that
actually make a difference between operating systems? Are there any besides
newline?
Sim
"Tom Lane" <tgl@sss.pgh.pa.us> wrote in message
news:8379.1111600299@sss.pgh.pa.us...
"Sim Zacks" <sim@compulab.co.il> writes:
A query written on any client should return the same result. The query
being
Show quoted text
the visible appearance on the screen.
This is presupposing the answer to the question at hand. I do not agree
with the above premise; it would seem to imply, for example, expanding
tabs to spaces so that "where foo = '<tab>'" yields the same result as
"where foo = ' '" for some appropriate number of spaces.regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend
On Thu, Mar 24, 2005 at 09:24:11AM +0200, Sim Zacks wrote:
The difference between a Tab and a newline is that tab is a universally
recognized single ascii character while newline is in flux. Aside from this,
a tab is a quasi-viewable character as the cursor will not go to the middle
of the tab. Meaning if the tab takes up the space of 10 characters, you
could not scroll to the place where the 5th character would be if it were in
fact 10 spaces. You cannot highlight half of a tab in editors that allow
text highlighting. I would therefore say that a tab is as visible as a space
and can be easily differentiated. On the other hand, it is impossible to
determine which binary charcters the editor stuck in at the end of a newline
without looking at the binary/hex code.
Actually, Emacs has a space-through-tab mode in which you can just move
through a tab as if it were spaces. If you delete or insert it
automatically reforms the tabs and spaces around it. Several editors
have an auto-lineending mode in which they'll detect the end of line
character and apply that everywhere. Admittedly this is an extreme case
but editors regularly show things that don't reflect the underlying
file.
I understand the complexity of dealing with multiple operating systems, but
seriously, how many non-viewable characters can be embedded in text that
actually make a difference between operating systems? Are there any besides
newline?
Sure, the character 0xE9 means different things depending on the
encoding and will sort differently based on the locale. Text files
generally don't indicate what encoding they are, leading to all sorts
of confusion. Unix tends to use Latin1 or UTF8. Windows has it's own
encoding.
IMHO, if you're trying to write portably, don't just hit enter when you
want an end of line, use \n or \r to indicate *exactly* what you mean.
Using a variable behaviour and expecting the server to fix it for you
is wrong. I beleive the server should take exactly what the client
gives as the client is the only one who knows for sure the type.
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
Show quoted text
Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
tool for doing 5% of the work and then sitting around waiting for someone
else to do the other 95% so you can sue them.
Martijn van Oosterhout <kleptog@svana.org> writes:
IMHO, if you're trying to write portably, don't just hit enter when you
want an end of line, use \n or \r to indicate *exactly* what you mean.
Indeed. We are already permissive about line endings in SQL text, so
this discussion really boils down only to whether we should reinterpret
data that's inside a string literal. There are good style reasons why
you should never put an unescaped newline into a string literal in the
first place ... and if you do, I don't think it's the database's job to
second-guess what you meant.
regards, tom lane