client side syntax error localisation for psql (v1)
Dear patchers,
Please find attached my first attempt at providing a localisation
information on the client side for syntax errors in "psql". Basically, one
function is added which takes care of the burden. I put a lot of
comments in the function to try make it clear what is going on.
It validates for me against current cvs head.
I've also added new regression tests in "errors.sql".
Although I have taken great and painful care to deal with multi-byte
encodings, I'm not sure how I can validate that, as I don't have some
japanese terminal to play with.
Have a nice day,
--
Fabien Coelho - coelho@cri.ensmp.fr
Attachments:
client_side_syntax_error_localisation.patchtext/plain; charset=US-ASCII; name=client_side_syntax_error_localisation.patchDownload+306-6
Dear patchers,
Sorry, wrong list:-(
--
Fabien.
Fabien COELHO <coelho@cri.ensmp.fr> writes:
Please find attached my first attempt at providing a localisation
information on the client side for syntax errors in "psql".
"Localisation" tends to mean something quite different around here.
I'd suggest renaming the function to something like ReportSyntaxErrorPosition.
Would it read better to add "..." at the front and/or back of the query
extract, when you have actually truncated text at that end?
The "on line N" bit seems just noise to me.
regards, tom lane
Dear Tom,
Please find attached my first attempt at providing a localisation
information on the client side for syntax errors in "psql"."Localisation" tends to mean something quite different around here. I'd
suggest renaming the function to something like ReportSyntaxErrorPosition.
Ok. "Localisation" was the good french word;-)
Would it read better to add "..." at the front and/or back of the query
extract, when you have actually truncated text at that end?
I thought about it as well, but did not implement it.
I agree that it would look better. I can have a go.
The "on line N" bit seems just noise to me.
It depends.
If you're in the middle of a sql-script as I often do, giving the line
within the statement isn't bad. Say:
SELECT o.titre
FROM collections AS col, ouvrages AS o, a_ecrit AS e, auteurs AS a
WHERE o.id=e.oeuvre AND e.auteur=a.id AND o.collection=col.id
AND col.nom LIKE '%Spirou%' AND a.nom<>'Franquin'
EXCEPT
SELECT o.titre
FROM collections AS col, ouvrages AS o, a_ecrit AS e, auteurs AS a
WHERE o.id=e.oeuvre AND e.auteur=a.id AND o.collection=col.id
AND col.nom LIKE '%Spirou%' AND a.nom='Franquin';
All lines look the same... So I would prefer to keep it.
Well, if it is the only issue against the adoption of the patch, I will do
without the line number.
I'll submit a new patch in a moment.
Thanks for your comments,
--
Fabien.
Fabien COELHO <coelho@cri.ensmp.fr> writes:
The "on line N" bit seems just noise to me.
It depends.
I can see that it would be useful in a very large query. Perhaps
include it only when the query has more than N lines, for some N
like three to five or so?
Another possibility is to keep the cursor as just "^", and bury the
line info in the query extract. For instance:
Short single-line query, no truncation needed:
QUERY: SELECT * FROM foo WHRE bar;
QUERY: ^
Truncation on the right just adds ..., no other change needed:
QUERY: SELECT * FROM foo WHRE lots-of-conditions...
QUERY: ^
When you truncate on the left, count the number of newlines removed,
and emit "LINE n" if there were any:
QUERY: LINE 4: FROM foo WHRE lots-of-conditions...
QUERY: ^
or
QUERY: LINE 4: ...FROM foo WHRE lots-of-conditions...
QUERY: ^
(So "LINE n" would never say "LINE 1", only larger line numbers.)
Here I'm imagining that the leading ... gets inserted when text has been
removed after the start of the indicated line. So another possible
output format is
QUERY: ...FROM foo WHRE lots-of-conditions...
QUERY: ^
if you removed some text but not any newlines from the start of the
query.
I think this wouldn't be terribly complex to implement, and it would
make things look fairly nice for both short and long queries.
One last point: it seems a bit odd to use QUERY: as the prefix for both
the query extract and the cursor line. I don't have a suggestion what
to use instead, offhand.
regards, tom lane
Dear Tom,
The "on line N" bit seems just noise to me.
It depends.
I can see that it would be useful in a very large query. Perhaps
include it only when the query has more than N lines, for some N
like three to five or so?
Yes, I can do that.
Another possibility is to keep the cursor as just "^", and bury the
line info in the query extract. For instance:
Well, I want to preserve space for the query extract, that's where the
context information is! If I put more information there, I'll have to
reduce the extract length.
Moreover the cursor line information content is small, so it seems better
to put the line information there.
(So "LINE n" would never say "LINE 1", only larger line numbers.)
I agree that LINE 1 looks pretty useless, especially if there is only
one line;-)
I think this wouldn't be terribly complex to implement, and it would
make things look fairly nice for both short and long queries.
There is also an alignment issue here, as depending on the number of
figures, the cursor line would have to be fixed accordingly.
One last point: it seems a bit odd to use QUERY: as the prefix for both
the query extract and the cursor line. I don't have a suggestion what
to use instead, offhand.
I agree, but couldn't think of a better choice either.
There is also a localisation issue here, as the translation of both lines
must match so that the alignment is kept. I thought that if it is the very
same word, the translation should be the same.
Thanks for your comments and ideas, I'll submit a v3 on (my) tomorrow.
Have a nice day,
--
Fabien Coelho - coelho@cri.ensmp.fr
Fabien COELHO <coelho@cri.ensmp.fr> writes:
Another possibility is to keep the cursor as just "^", and bury the
line info in the query extract. For instance:
Well, I want to preserve space for the query extract, that's where the
context information is! If I put more information there, I'll have to
reduce the extract length.
But in the form you are using, you have to reserve that space anyway.
Consider an error marker at the end of the line:
QUERY: select 'text that is long enough to be a problem':
QUERY: ^ on line 1
Having to leave space for "on line 1" at the right end is not better
than putting "line 1:" at the left end.
Moreover the cursor line information content is small, so it seems better
to put the line information there.
It seems confusing to me. The cursor is just a marker. Consider also
the next step up from this scheme: instead of assuming a dumb terminal,
suppose you can use color or underlining or reverse-video or some such.
Then you might want to do just one line of output:
QUERY: SELECT foo FROM bar WHRE baz;
----
where my underlining represents using the terminal's highlight ability.
With such a scheme, putting LINE n on the same line fits naturally.
There is also an alignment issue here, as depending on the number of
figures, the cursor line would have to be fixed accordingly.
Sure, but you have to buy into that anyway if you are going to do "...".
It's not going to add more than about one strlen() call to what you need
to do.
There is also a localisation issue here, as the translation of both lines
must match so that the alignment is kept. I thought that if it is the very
same word, the translation should be the same.
Hmm, that's a good point, although perhaps you could insert a note to
translators to add space as needed to make the two strings the same
length.
regards, tom lane
Fabien COELHO wrote:
There is also a localisation issue here, as the translation of both
lines must match so that the alignment is kept. I thought that if it
is the very same word, the translation should be the same.
You can just indent with as many spaces. This is done in other places
as well.
Btw., a better word than "query" would be "statement".
Please find attached my first attempt at providing a localisation
information on the client side for syntax errors in "psql". Basically, one
function is added which takes care of the burden. I put a lot of
comments in the function to try make it clear what is going on.It validates for me against current cvs head.
I've also added new regression tests in "errors.sql".
Although I have taken great and painful care to deal with multi-byte
encodings, I'm not sure how I can validate that, as I don't have some
japanese terminal to play with.
As far as I understand your code, it will be broken on many multi byte
encodings.
1) a character is not always represented on a terminal propotional to
the storage size. For example a kanji character in UTF-8 encoding
has a storage size of 3 bytes while it occupies spaces only twice
of ASCII characters on a terminal. Same thing can be said to LATIN
2,3 etc. in UTF-8 perhaps.
2) It assume all encodings are "ASCII compatible". Apparently some
client-side-only encodings do not satisfy this request. Examples
include SJIS, Big5.
I think 2) would be solved by carefull codings.
However 1) is hard to solve. We need kind of a mapping table between
"storage size" and "terminal size" for each encoding. Note that this
is not new problem, and sophiscated editors such as Emacs (mule) has
already concur that.
--
Tatsuo Ishii
Dear Tom,
Well, I want to preserve space for the query extract, that's where the
context information is! If I put more information there, I'll have to
reduce the extract length.But in the form you are using, you have to reserve that space anyway.
Consider an error marker at the end of the line:QUERY: select 'text that is long enough to be a problem':
QUERY: ^ on line 1Having to leave space for "on line 1" at the right end is not better
than putting "line 1:" at the left end.
Yes, I partly agree. However the space above can be used nevertheless for
the extract, so it is not totally "lost":
QUERY: select 'text that is long enough to be a problem': from foo ...
QUERY: ^ on line 1
Moreover the cursor line information content is small, so it seems better
to put the line information there.It seems confusing to me. The cursor is just a marker.
Sure.
Consider also the next step up from this scheme: instead of assuming a
dumb terminal, suppose you can use color or underlining or reverse-video
or some such. Then you might want to do just one line of output:QUERY: SELECT foo FROM bar WHRE baz;
----
where my underlining represents using the terminal's highlight ability.
With such a scheme, putting LINE n on the same line fits naturally.
Sure, but the current status of psql is to have no such thing, and
I'm not going to start terminal-dependent features in the software;-)
I just want to help my students with syntax errors.
As a compromise, I can suggest the following:
LINE 4: WHERE id=123 AND name LIKE 'calvin' GROP BY name...
^
So "QUERY:" is dropped out. If there is very few lines, I can just
put "LINE:". I'll have to deal with the alignment problem, no big deal,
but not very nice either.
There is also an alignment issue here, as depending on the number of
figures, the cursor line would have to be fixed accordingly.Sure, but you have to buy into that anyway if you are going to do "...".
Well, I know that "..." is 3 char long, so I can put a " " in the
cursor line in that case. If it is a figure, the length depends on its
value.
It's not going to add more than about one strlen() call to what you need
to do.
I cannot strlen an integer value;-) I have to convert to a string, or
deal directly with the line number.
With my new current implementation, the line number is only shown if the
query is more than 3 lines.
--
Fabien Coelho - coelho@cri.ensmp.fr
Dear Tatsuo,
Thanks for your reply, as I noticed from the source code that your name
often appears when dealing with multi-byte issues;-)
On Fri, 12 Mar 2004, Tatsuo Ishii wrote:
As far as I understand your code, it will be broken on many multi byte
encodings.1) a character is not always represented on a terminal propotional to
the storage size. For example a kanji character in UTF-8 encoding
has a storage size of 3 bytes while it occupies spaces only twice
of ASCII characters on a terminal. Same thing can be said to LATIN
2,3 etc. in UTF-8 perhaps.
I thought I dealt with that in the code by calling PQmblen for every char.
Am I wrong ?
2) It assume all encodings are "ASCII compatible". Apparently some
client-side-only encodings do not satisfy this request. Examples
include SJIS, Big5.
What I mean by "ASCII compatible" is that spaces, new lines, carriage
returns, tabs and NULL (C string terminaison) are one byte characters.
This assumption seemed pretty safe to me.
If this is not the case, I cannot understand how any error message could
work in psql. If one printf(" "), that would not be a space character?
Or is the terminal doing some "on the fly" translation?? What if a
file is read with such encoding??? Or is there a special compilation
option to generate special strings, but in this case the executable
would not be compatible with any other terminal????
Well, I just underline my lack of knowledge here:-(
If not, how can I detect these special characters that I need to change ?
Maybe I could translate the string to a pg_wchar[] if the function is
available to psql?
Also as I quick and dirty temporary fix, I can skip statement extraction
for those encodings that do not meet my expectations. So I would need to
know what encodings are at risk with the current scheme?
--
Fabien Coelho - coelho@cri.ensmp.fr
Fabien COELHO wrote:
There is also a localisation issue here, as the translation of both
lines must match so that the alignment is kept. I thought that if it
is the very same word, the translation should be the same.You can just indent with as many spaces. This is done in other places
as well.
Yes, but then I need to know the length of the translation...
I did that, so I hope sprintf did the translation job.
Thanks,
--
Fabien Coelho - coelho@cri.ensmp.fr
Dear Tatsuo,
Thanks for your reply, as I noticed from the source code that your name
often appears when dealing with multi-byte issues;-)On Fri, 12 Mar 2004, Tatsuo Ishii wrote:
As far as I understand your code, it will be broken on many multi byte
encodings.1) a character is not always represented on a terminal propotional to
the storage size. For example a kanji character in UTF-8 encoding
has a storage size of 3 bytes while it occupies spaces only twice
of ASCII characters on a terminal. Same thing can be said to LATIN
2,3 etc. in UTF-8 perhaps.I thought I dealt with that in the code by calling PQmblen for every char.
Am I wrong ?
PQmblen returns the storage size, which is not necessarily same as the
character width reprensented in a terminal. For example for a kanji
character in UTF-8 PQmblen returns 3, but it ocuppies 2 x ASCII
character space, not x 3. Isn't that a problem for you?
2) It assume all encodings are "ASCII compatible". Apparently some
client-side-only encodings do not satisfy this request. Examples
include SJIS, Big5.What I mean by "ASCII compatible" is that spaces, new lines, carriage
returns, tabs and NULL (C string terminaison) are one byte characters.
This assumption seemed pretty safe to me.If this is not the case, I cannot understand how any error message could
work in psql. If one printf(" "), that would not be a space character?
Or is the terminal doing some "on the fly" translation?? What if a
file is read with such encoding??? Or is there a special compilation
option to generate special strings, but in this case the executable
would not be compatible with any other terminal????Well, I just underline my lack of knowledge here:-(
If not, how can I detect these special characters that I need to change ?
Maybe I could translate the string to a pg_wchar[] if the function is
available to psql?
I think you can do it safely using PQmblen.
1) start from the begining of the target string
2) apply PQmblen
3) if it returns 1, you can do the spcecial character detection
4) otherwise it must not be an ASCII character and you can skip as
many characters as PQmnlen returns
5) goto 1) if any characters remain
Show quoted text
Also as I quick and dirty temporary fix, I can skip statement extraction
for those encodings that do not meet my expectations. So I would need to
know what encodings are at risk with the current scheme?--
Fabien Coelho - coelho@cri.ensmp.fr---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
PQmblen returns the storage size, which is not necessarily same as the
character width reprensented in a terminal. For example for a kanji
character in UTF-8 PQmblen returns 3, but it ocuppies 2 x ASCII
character space, not x 3. Isn't that a problem for you?2) It assume all encodings are "ASCII compatible". Apparently some
client-side-only encodings do not satisfy this request. Examples
include SJIS, Big5.What I mean by "ASCII compatible" is that spaces, new lines, carriage
returns, tabs and NULL (C string terminaison) are one byte characters.
This assumption seemed pretty safe to me.If this is not the case, I cannot understand how any error message could
work in psql. If one printf(" "), that would not be a space character?
Or is the terminal doing some "on the fly" translation?? What if a
file is read with such encoding??? Or is there a special compilation
option to generate special strings, but in this case the executable
would not be compatible with any other terminal????Well, I just underline my lack of knowledge here:-(
If not, how can I detect these special characters that I need to change ?
Maybe I could translate the string to a pg_wchar[] if the function is
available to psql?I think you can do it safely using PQmblen.
1) start from the begining of the target string
2) apply PQmblen
3) if it returns 1, you can do the spcecial character detection
4) otherwise it must not be an ASCII character and you can skip as
many characters as PQmnlen returns5) goto 1) if any characters remain
~~of course this should be 2)
--
Tatsuo Ishii
Dear Tatsuo,
1) a character is not always represented on a terminal propotional to
the storage size. For example a kanji character in UTF-8 encoding
has a storage size of 3 bytes while it occupies spaces only twice
of ASCII characters on a terminal. Same thing can be said to LATIN
2,3 etc. in UTF-8 perhaps.I thought I dealt with that in the code by calling PQmblen for every char.
Am I wrong ?PQmblen returns the storage size, which is not necessarily same as the
character width reprensented in a terminal. For example for a kanji
character in UTF-8 PQmblen returns 3, but it ocuppies 2 x ASCII
character space, not x 3. Isn't that a problem for you?
If I read you correctly, you mean that 1 character may take 3 bytes
of storage in the string, but it is not guaranteed to be 1 character
from the terminal perspective... Argh, that's definitely an issue:-(
I assumed that one character whatever the encoding would be 1 character
on the display.
If it is not the case, I think I can put/compute this information in the
translation structures that is use by PQmblen, and implement a
PQmbtermlen function...
Maybe you could point me some source of information about display lengths
of characters depending on the encoding?
What I mean by "ASCII compatible" is that spaces, new lines, carriage
returns, tabs and NULL (C string terminaison) are one byte characters.
This assumption seemed pretty safe to me.I think you can do it safely using PQmblen.
Ok, what you describe is basically what I've done with the qidx
computation as suggested by Tom Lane and then later I check that the
encoded length is one to find my special characters.
Thanks for you reply,
--
Fabien Coelho - coelho@cri.ensmp.fr
PQmblen returns the storage size, which is not necessarily same as the
character width reprensented in a terminal. For example for a kanji
character in UTF-8 PQmblen returns 3, but it ocuppies 2 x ASCII
character space, not x 3. Isn't that a problem for you?If I read you correctly, you mean that 1 character may take 3 bytes
of storage in the string, but it is not guaranteed to be 1 character
from the terminal perspective... Argh, that's definitely an issue:-(
I assumed that one character whatever the encoding would be 1 character
on the display.
That's not correct...
One thing I have to note is that some Asian characters such as
Japanese, Chinese require twice the space on a terminal for each
character comparing with plain ASCII characters. This is hard to
explain to those who are not familiar with kanji... Could you take a
look at included screen shot? As you can see there are four ASCII
characters in the first line. On the second line there are *two* kanji
characters and they occupy same space as above four ASCII
characters. Moreover the strage size for the first line is 4, but the
strage size for the second line may vary depending on the encoding. If
the encoding is EUC_JP or SJIS, it takes 4 bytes, however it takes 6
bytes if the encoding is UTF-8. Got it?
If it is not the case, I think I can put/compute this information in the
translation structures that is use by PQmblen, and implement a
PQmbtermlen function...Maybe you could point me some source of information about display lengths
of characters depending on the encoding?
I could write "PQmbtermlen" function for every encoding supported by
PostgreSQL except UTF-8. Such kind of info for UTF-8 might be quite
complex. I believe there are some mapping tables or functions to get
such kind of info somewhere on the Internet, but I don't remember.
I think you can do it safely using PQmblen.
Ok, what you describe is basically what I've done with the qidx
computation as suggested by Tom Lane and then later I check that the
encoded length is one to find my special characters.
Oh, ok.
Thanks for you reply,
You are welcome!
--
Tatsuo Ishii
Fabien COELHO <coelho@cri.ensmp.fr> writes:
As a compromise, I can suggest the following:
LINE 4: WHERE id=123 AND name LIKE 'calvin' GROP BY name...
^
That works for me. I don't mind it saying LINE 1: in the one-line case.
It's not going to add more than about one strlen() call to what you need
to do.
I cannot strlen an integer value;-) I have to convert to a string, or
deal directly with the line number.
Well, what I'm imagining is that you sprintf("LINE %d") and then strlen
that in the process of constructing the first line, and later add that
length to the offset you need in the second line.
Tatsuo's complaints about characters that span more than one position
are a much nastier problem, in any case :-(. Can we cope?
regards, tom lane
Dear Tatsuo,
One thing I have to note is that some Asian characters such as Japanese,
Chinese require twice the space on a terminal for each character
comparing with plain ASCII characters. This is hard to explain to those
who are not familiar with kanji...
I learnt a little bit of chinese a few years ago, but I never saw the
computer version.
Could you take a look at included screen shot?
I haven't found it. However I've made a little bit of trying with my
emacs 21 mule demonstration, and indeed there is two different character
widths...
As you can see there are four ASCII characters in the first line. On the
second line there are *two* kanji characters and they occupy same space
as above four ASCII characters. Moreover the strage size for the first
line is 4, but the strage size for the second line may vary depending on
the encoding. If the encoding is EUC_JP or SJIS, it takes 4 bytes,
however it takes 6 bytes if the encoding is UTF-8. Got it?
Yep.
Maybe you could point me some source of information about display lengths
of characters depending on the encoding?I could write "PQmbtermlen" function for every encoding supported by
PostgreSQL
That would be great ! ;-)
except UTF-8. Such kind of info for UTF-8 might be quite
complex. I believe there are some mapping tables or functions to get
such kind of info somewhere on the Internet, but I don't remember.
That would be even greater;-)
I guess a "return 1" version would be a false quick fix that could
be improved later on...
--
Fabien COELHO.
Could you take a look at included screen shot?
I haven't found it. However I've made a little bit of trying with my
Oops. I have included this time.
Maybe you could point me some source of information about display lengths
of characters depending on the encoding?I could write "PQmbtermlen" function for every encoding supported by
PostgreSQLThat would be great ! ;-)
Ok, I will work on this.
except UTF-8. Such kind of info for UTF-8 might be quite
complex. I believe there are some mapping tables or functions to get
such kind of info somewhere on the Internet, but I don't remember.That would be even greater;-)
I guess a "return 1" version would be a false quick fix that could
be improved later on...
Agreeed.
Attachments:
On Sat, 13 Mar 2004, Tatsuo Ishii wrote:
Oops. I have included this time.
How ! a japanese vi !
I could write "PQmbtermlen" function for every encoding supported by
PostgreSQLThat would be great ! ;-)
Ok, I will work on this.
Thanks.
--
Fabien Coelho - coelho@cri.ensmp.fr