non-ASCII characters in SGML documentation (and elsewhere)

Started by Peter Eisentrautalmost 15 years ago10 messagesdocs
Jump to latest
#1Peter Eisentraut
peter_e@gmx.net

There are a few literal non-ASCII characters in the SGML documentation,
namely in

isn.sgml
release-7.4.sgml
release-8.4.sgml

Also, there are some encoded (&foo;) non-ASCII characters in

release-8.0.sgml
release-8.1.sgml
release-8.2.sgml
unaccent.sgml

These all work fine, because they are all LATIN1, and DocBook SGML uses
LATIN1.

But I notice that the contributor names in the 9.1 release notes have
been carefully ASCII-fied, presumably from the Git UTF-8 commit
messages.

For additional amusement, when creating the HISTORY file, lynx recodes
the HTML into the encoding specified by your LC_CTYPE environment
setting.

Also, the following source files contain non-ASCII characters in
comments:

src/backend/port/dynloader/darwin.c (LATIN1)
src/backend/storage/lmgr/predicate.c (UTF8)
src/backend/storage/lmgr/README-SSI (UTF8)

The last two are new in 9.1.

So, some questions:

* Should we consistently use entities for encoding non-ASCII
characters in SGML? Or use LATIN1 freely?
* Should we allow/use non-ASCII characters in the release notes?
* What encoding should the HISTORY file have?
* Should we allow non-ASCII characters in general source files?
* If so, what should the encoding be?

#2Susanne Ebrecht
susanne@2ndQuadrant.com
In reply to: Peter Eisentraut (#1)
Re: non-ASCII characters in SGML documentation (and elsewhere)

Hello Peter,

On 19.05.2011 23:49, Peter Eisentraut wrote:

So, some questions:

* Should we consistently use entities for encoding non-ASCII
characters in SGML? Or use LATIN1 freely?
* Should we allow/use non-ASCII characters in the release notes?
* What encoding should the HISTORY file have?
* Should we allow non-ASCII characters in general source files?
* If so, what should the encoding be?

one more argument for switching to XML? :)

I guess we will get some more non-ASCII signs in documentation.
How do you want to document the collation stuff?
Collations are for all that isn't ASCII.
Our docs usually have small examples.
I can imagine that you want to place German or Russian letters or whatever
else as examples into doc.

Do you have another idea then using utf8?
What do you expect what not would fit into utf8?
I would expect words like déjà vu - means words that English just copied
from French and still use the French accents.
Or even personal names with e.g. umlauts, accents, and other special
signs from
special languages.

Also consider - usually editors (vi, emacs) use utf8 today.

Btw.
For German docs I use utf8.
The HTML output works well with both 'ö' and 'ö'.
I not yet tested other outputs.

I just changed to utf8 in stylsheets and use export SP_ENCODING=XML
before compiling.

Unfortunately index sorting neither works with 'ö' nor 'ö' yet.
We are still fighting with it and try to figure out how we can force that
it will sort correct.
Just changing makefile didn't help.

But - in English docs - I doubt that you have to deal with indexes on
special
words using non-ASCII characters.

Means very small and low effort changes already might help.

Susanne

--
Susanne Ebrecht - 2ndQuadrant
PostgreSQL Development, 24x7 Support, Training and Services
www.2ndQuadrant.com

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#1)
Re: non-ASCII characters in SGML documentation (and elsewhere)

Peter Eisentraut <peter_e@gmx.net> writes:

* Should we consistently use entities for encoding non-ASCII
characters in SGML? Or use LATIN1 freely?

I think we previously discussed this and agreed that all non-ASCII in
the SGML docs should be written as entities. The existence of
violations of that rule is just, well, a violation that ought to be
fixed.

* Should we allow/use non-ASCII characters in the release notes?
* What encoding should the HISTORY file have?

Ideally "sure, if entity-ified", but I don't know what to do about
HISTORY.

* Should we allow non-ASCII characters in general source files?

Prefer "no" here.

regards, tom lane

#4Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#3)
Re: non-ASCII characters in SGML documentation (and elsewhere)

Excerpts from Tom Lane's message of vie may 20 07:56:58 -0400 2011:

Peter Eisentraut <peter_e@gmx.net> writes:

* Should we consistently use entities for encoding non-ASCII
characters in SGML? Or use LATIN1 freely?

I think we previously discussed this and agreed that all non-ASCII in
the SGML docs should be written as entities. The existence of
violations of that rule is just, well, a violation that ought to be
fixed.

+1

* Should we allow/use non-ASCII characters in the release notes?
* What encoding should the HISTORY file have?

Ideally "sure, if entity-ified", but I don't know what to do about
HISTORY.

Can we recode that to plain ascii? I think iconv has a //TRANSLIT flag
or something like that.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#5Susanne Ebrecht
susanne@2ndQuadrant.com
In reply to: Tom Lane (#3)
Re: non-ASCII characters in SGML documentation (and elsewhere)

On 20.05.2011 13:56, Tom Lane wrote:

* Should we allow non-ASCII characters in general source files?

Prefer "no" here.

I only see two reasons for non-ASCII signs in English.
Either it is a foreign name of e.g. a person
or it is a word that English took from French like in d�j� vu.
For the second I am sure you will find synonyms that are ASCII only.

The only other reason that I can see for non-ASCII signs in our docs is
for demonstrating collations.

Susanne

--
Susanne Ebrecht - 2ndQuadrant
PostgreSQL Development, 24x7 Support, Training and Services
www.2ndQuadrant.com

#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Susanne Ebrecht (#5)
Re: non-ASCII characters in SGML documentation (and elsewhere)

Excerpts from Susanne Ebrecht's message of vie may 20 09:04:26 -0400 2011:

On 20.05.2011 13:56, Tom Lane wrote:

* Should we allow non-ASCII characters in general source files?

Prefer "no" here.

I only see two reasons for non-ASCII signs in English.
Either it is a foreign name of e.g. a person
or it is a word that English took from French like in déjà vu.

I'd like my name accented in the release notes, thanks.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#7Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#3)
Re: non-ASCII characters in SGML documentation (and elsewhere)

On fre, 2011-05-20 at 07:56 -0400, Tom Lane wrote:

* Should we allow non-ASCII characters in general source

files?

Prefer "no" here.

Going through this I felt a little bad butchering up people's names that
hadn't bothered anyone before now. So as a compromise, I made
contributor names UTF-8 consistently, but removed other uses of
non-ASCII characters.

#8Peter Eisentraut
peter_e@gmx.net
In reply to: Alvaro Herrera (#4)
Re: non-ASCII characters in SGML documentation (and elsewhere)

On fre, 2011-05-20 at 08:16 -0400, Alvaro Herrera wrote:

* Should we allow/use non-ASCII characters in the release

notes?

* What encoding should the HISTORY file have?

Ideally "sure, if entity-ified", but I don't know what to do about
HISTORY.

Can we recode that to plain ascii? I think iconv has a //TRANSLIT
flag or something like that.

To make this work on FreeBSD, where we build the releases, we need to
use the following command:

"/usr/bin/perl" -p -e 's/<H(1|2)$/<H\1 align=center/g' HISTORY.html | LC_ALL=en_US.ISO8859-1 lynx -force_html -dump -nolist -stdin | iconv -f latin1 -t us-ascii//TRANSLIT > HISTORY

This also works on Linux/glibc, but FreeBSD is a bit stricter/more
limited. Not sure about other platforms, but I'd guess if they don't
have the required locales, they'd be no worse off than now anyway.

The results are reasonable. It actually depends on the platform
what //TRANSLIT does, e.g. on FreeBSD ö -> "o, on Linux ö -> o.

#9Bruce Momjian
bruce@momjian.us
In reply to: Alvaro Herrera (#6)
Re: non-ASCII characters in SGML documentation (and elsewhere)

Alvaro Herrera wrote:

Excerpts from Susanne Ebrecht's message of vie may 20 09:04:26 -0400 2011:

On 20.05.2011 13:56, Tom Lane wrote:

* Should we allow non-ASCII characters in general source files?

Prefer "no" here.

I only see two reasons for non-ASCII signs in English.
Either it is a foreign name of e.g. a person
or it is a word that English took from French like in d��j�� vu.

I'd like my name accented in the release notes, thanks.

Sure, you want the first "A" in Alvaro with an accent. I would love to
backpatch that but it would be royal pain. I am afraid it can only
easily be done in future release notes.

I have added the proper markup to our release note checklist; patch
attached. Does anyone else want special handling for their name?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Attachments:

/rtmp/accenttext/x-diffDownload+2-0
#10Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#9)
Re: non-ASCII characters in SGML documentation (and elsewhere)

Excerpts from Bruce Momjian's message of mié oct 12 18:21:19 -0300 2011:

Alvaro Herrera wrote:

Excerpts from Susanne Ebrecht's message of vie may 20 09:04:26 -0400 2011:

On 20.05.2011 13:56, Tom Lane wrote:

* Should we allow non-ASCII characters in general source files?

Prefer "no" here.

I only see two reasons for non-ASCII signs in English.
Either it is a foreign name of e.g. a person
or it is a word that English took from French like in dj vu.

I'd like my name accented in the release notes, thanks.

Sure, you want the first "A" in Alvaro with an accent. I would love to
backpatch that but it would be royal pain. I am afraid it can only
easily be done in future release notes.

Many thanks, Bruce.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support