Doc: typo in config.sgml
I think there's an unnecessary underscore in config.sgml.
Attached patch fixes it.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
Attachments:
fix_config.patchtext/x-patch; charset=iso-8859-1Download
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0aec11f443..08173ecb5c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9380,7 +9380,7 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
<para>
If <varname>transaction_timeout</varname> is shorter or equal to
<varname>idle_in_transaction_session_timeout</varname> or <varname>statement_timeout</varname>
- then the longer timeout is ignored.
+ then the longer timeout is ignored.
</para>
<para>
On Mon, 30 Sep 2024 15:34:04 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:
I think there's an unnecessary underscore in config.sgml.
Attached patch fixes it.
I could not apply the patch with an error.
error: patch failed: doc/src/sgml/config.sgml:9380
error: doc/src/sgml/config.sgml: patch does not apply
I found your patch contains an odd character (ASCII Code 240?)
by performing `od -c` command on the file. See the attached file.
Regards,
Yugo Nagata
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
I think there's an unnecessary underscore in config.sgml.
Attached patch fixes it.I could not apply the patch with an error.
error: patch failed: doc/src/sgml/config.sgml:9380
error: doc/src/sgml/config.sgml: patch does not apply
Strange. I have no problem applying the patch here.
I found your patch contains an odd character (ASCII Code 240?)
by performing `od -c` command on the file. See the attached file.
Yes, 240 in octal (== 0xc2) is in the patch but it's because current
config.sgml includes the character. You can check it by looking at
line 9383 of config.sgml.
I think it was introduced by 28e858c0f95.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Mon, 30 Sep 2024 17:23:24 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:
I think there's an unnecessary underscore in config.sgml.
Attached patch fixes it.I could not apply the patch with an error.
error: patch failed: doc/src/sgml/config.sgml:9380
error: doc/src/sgml/config.sgml: patch does not applyStrange. I have no problem applying the patch here.
I found your patch contains an odd character (ASCII Code 240?)
by performing `od -c` command on the file. See the attached file.Yes, 240 in octal (== 0xc2) is in the patch but it's because current
config.sgml includes the character. You can check it by looking at
line 9383 of config.sgml.
Yes, you are right, I can find the 0xc2 char in config.sgml using od -c,
although I still could not apply the patch.
I think this is non-breaking space of (C2A0) of utf-8. I guess my
terminal normally regards this as a space, so applying patch fails.
I found it also in line 85 of ref/drop_extension.sgml.
I think it was introduced by 28e858c0f95.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
--
Yugo NAGATA <nagata@sraoss.co.jp>
I think there's an unnecessary underscore in config.sgml.
I was wrong. The particular byte sequences just looked an underscore
on my editor but the byte sequence is actually 0xc2a0, which must be a
"non breaking space" encoded in UTF-8. I guess someone mistakenly
insert a non breaking space while editing config.sgml.
However the mistake does not affect the patch.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Mon, 30 Sep 2024 18:03:44 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:
I think there's an unnecessary underscore in config.sgml.
I was wrong. The particular byte sequences just looked an underscore
on my editor but the byte sequence is actually 0xc2a0, which must be a
"non breaking space" encoded in UTF-8. I guess someone mistakenly
insert a non breaking space while editing config.sgml.However the mistake does not affect the patch.
It looks like we've crisscrossed our mail.
Anyway, I agree with removing non breaking spaces, as well as
one found in line 85 of ref/drop_extension.sgml.
Regards,
Yugo Nagata
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
--
Yugo NAGATA <nagata@sraoss.co.jp>
On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
I think there's an unnecessary underscore in config.sgml.
I was wrong. The particular byte sequences just looked an underscore
on my editor but the byte sequence is actually 0xc2a0, which must be a
"non breaking space" encoded in UTF-8. I guess someone mistakenly
insert a non breaking space while editing config.sgml.
I wonder if it would be worth to add a check for this like we have to tabs?
The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
(doing so made me realize we don't have an equivalent meson target).
--
Daniel Gustafsson
Attachments:
check_nbsp.diffapplication/octet-stream; name=check_nbsp.diff; x-unix-mode=0644Download
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 9c9bbfe375..f6d2c85226 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -194,7 +194,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALLSGML) check-tabs
+check: postgres.sgml $(ALLSGML) check-tabs check-nbsp
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -259,6 +259,9 @@ endif # sqlmansectnum != 7
check-tabs:
@( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || (echo "Tabs appear in SGML/XML files" 1>&2; exit 1)
+check-nbsp:
+ @( ! grep -e "\xA0" $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || (echo "Non-breaking space appear in SGML/XML files" 1>&2; exit 1)
+
##
## Clean
##
On Mon, 30 Sep 2024 11:59:48 +0200
Daniel Gustafsson <daniel@yesql.se> wrote:
On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
I think there's an unnecessary underscore in config.sgml.
I was wrong. The particular byte sequences just looked an underscore
on my editor but the byte sequence is actually 0xc2a0, which must be a
"non breaking space" encoded in UTF-8. I guess someone mistakenly
insert a non breaking space while editing config.sgml.I wonder if it would be worth to add a check for this like we have to tabs?
The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
(doing so made me realize we don't have an equivalent meson target).
Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works
when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`.
However, it also detects the following line in charset.sgml.
(https://www.postgresql.org/docs/current/collation.html)
For example, locale und-u-kb sorts 'àe' before 'aé'.
This is not non-breaking space, so should not be detected as an error.
Regards,
Yugo Nagata
--
Daniel Gustafsson
--
Yugo Nagata <nagata@sraoss.co.jp>
I wonder if it would be worth to add a check for this like we have to tabs?
+1.
The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
(doing so made me realize we don't have an equivalent meson target).Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works
when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`.However, it also detects the following line in charset.sgml.
(https://www.postgresql.org/docs/current/collation.html)For example, locale und-u-kb sorts 'àe' before 'aé'.
This is not non-breaking space, so should not be detected as an error.
That's because non-breaking space (nbsp) is not encoded as 0xa0 in
UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
point in Unicode. i.e. U+00A0).
So grep -P "[\xC2\xA0]" should work to detect nbsp.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Mon, 30 Sep 2024 20:07:31 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:
I wonder if it would be worth to add a check for this like we have to tabs?
+1.
The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
(doing so made me realize we don't have an equivalent meson target).Your patch couldn't detect 0xA0 in config.sgml in my machine, but it works
when I use `grep -P "[\xA0]"` instead of `grep -e "\xA0"`.However, it also detects the following line in charset.sgml.
(https://www.postgresql.org/docs/current/collation.html)For example, locale und-u-kb sorts 'àe' before 'aé'.
This is not non-breaking space, so should not be detected as an error.
That's because non-breaking space (nbsp) is not encoded as 0xa0 in
UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
point in Unicode. i.e. U+00A0).
So grep -P "[\xC2\xA0]" should work to detect nbsp.
`LC_ALL=C grep -P "\xC2\xA0"` works for my environment.
([ and ] were not necessary.)
When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml,
but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting
nbsp.
One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it.
On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash.
Maybe, better way is use perl itself rather than grep as following.
`perl -ne '/\xC2\xA0/ and print' `
I attached a patch fixed in this way.
Regards,
Yugo Nagata
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
--
Yugo NAGATA <nagata@sraoss.co.jp>
Attachments:
v2_check_nbsp.difftext/x-diff; name=v2_check_nbsp.diffDownload
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 9c9bbfe375..2081ba1ffc 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -194,7 +194,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALLSGML) check-tabs
+check: postgres.sgml $(ALLSGML) check-tabs check-nbsp
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -259,6 +259,9 @@ endif # sqlmansectnum != 7
check-tabs:
@( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || (echo "Tabs appear in SGML/XML files" 1>&2; exit 1)
+check-nbsp:
+ @( ! $(PERL) -ne '/\xC2\xA0/ and print "$$ARGV $$_"' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || (echo "Non-breaking space appear in SGML/XML files" 1>&2; exit 1)
+
##
## Clean
##
That's because non-breaking space (nbsp) is not encoded as 0xa0 in
UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
point in Unicode. i.e. U+00A0).
So grep -P "[\xC2\xA0]" should work to detect nbsp.`LC_ALL=C grep -P "\xC2\xA0"` works for my environment.
([ and ] were not necessary.)When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml,
but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting
nbsp.One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it.
On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash.
Maybe, better way is use perl itself rather than grep as following.
`perl -ne '/\xC2\xA0/ and print' `
I attached a patch fixed in this way.
GNU sed can also be used without setting LC_ALL:
sed -n /"\xC2\xA0"/p
However I am not sure if non-GNU sed can do this too...
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Mon, 30 Sep 2024 17:23:24 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:I think there's an unnecessary underscore in config.sgml.
Attached patch fixes it.I could not apply the patch with an error.
error: patch failed: doc/src/sgml/config.sgml:9380
error: doc/src/sgml/config.sgml: patch does not applyStrange. I have no problem applying the patch here.
I found your patch contains an odd character (ASCII Code 240?)
by performing `od -c` command on the file. See the attached file.Yes, 240 in octal (== 0xc2) is in the patch but it's because current
config.sgml includes the character. You can check it by looking at
line 9383 of config.sgml.Yes, you are right, I can find the 0xc2 char in config.sgml using od -c,
although I still could not apply the patch.I think this is non-breaking space of (C2A0) of utf-8. I guess my
terminal normally regards this as a space, so applying patch fails.I found it also in line 85 of ref/drop_extension.sgml.
Thanks. I have pushed the fix for ref/drop_extension.sgml along with
config.sgml.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Tue, 01 Oct 2024 10:33:50 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:
That's because non-breaking space (nbsp) is not encoded as 0xa0 in
UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
point in Unicode. i.e. U+00A0).
So grep -P "[\xC2\xA0]" should work to detect nbsp.`LC_ALL=C grep -P "\xC2\xA0"` works for my environment.
([ and ] were not necessary.)When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml,
but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting
nbsp.One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it.
On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash.
Maybe, better way is use perl itself rather than grep as following.
`perl -ne '/\xC2\xA0/ and print' `
I attached a patch fixed in this way.
GNU sed can also be used without setting LC_ALL:
sed -n /"\xC2\xA0"/p
However I am not sure if non-GNU sed can do this too...
Although I've not check it myself, BSD sed doesn't support \x escape according to [1]https://stackoverflow.com/questions/24275070/sed-not-giving-me-correct-substitute-operation-for-newline-with-mac-difference.
By the way, I've attached a patch a bit modified to use the plural form statement
as same as check-tabs.
Non-breaking **spaces** appear in SGML/XML files
Regards,
Yugo Nagata
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
--
Yugo NAGATA <nagata@sraoss.co.jp>
Attachments:
v3_check_nbsp.difftext/x-diff; name=v3_check_nbsp.diffDownload
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 9c9bbfe375..17feae9ed0 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -194,7 +194,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALLSGML) check-tabs
+check: postgres.sgml $(ALLSGML) check-tabs check-nbsp
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -259,6 +259,9 @@ endif # sqlmansectnum != 7
check-tabs:
@( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || (echo "Tabs appear in SGML/XML files" 1>&2; exit 1)
+check-nbsp:
+ @( ! $(PERL) -ne '/\xC2\xA0/ and print "$$ARGV $$_"' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || (echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
+
##
## Clean
##
On Tue, 1 Oct 2024 15:16:52 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:
On Tue, 01 Oct 2024 10:33:50 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:That's because non-breaking space (nbsp) is not encoded as 0xa0 in
UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
point in Unicode. i.e. U+00A0).
So grep -P "[\xC2\xA0]" should work to detect nbsp.`LC_ALL=C grep -P "\xC2\xA0"` works for my environment.
([ and ] were not necessary.)When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml,
but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting
nbsp.One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it.
On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash.
Maybe, better way is use perl itself rather than grep as following.
`perl -ne '/\xC2\xA0/ and print' `
I attached a patch fixed in this way.
GNU sed can also be used without setting LC_ALL:
sed -n /"\xC2\xA0"/p
However I am not sure if non-GNU sed can do this too...
Although I've not check it myself, BSD sed doesn't support \x escape according to [1].
By the way, I've attached a patch a bit modified to use the plural form statement
as same as check-tabs.Non-breaking **spaces** appear in SGML/XML files
The previous patch was broken because the perl command failed to return the correct result.
I've attached an updated patch to fix the return value. In passing, I added line breaks
for long lines.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
v4_check_nbsp.difftext/x-diff; name=v4_check_nbsp.diffDownload
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 9c9bbfe375..e5607585af 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -194,7 +194,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALLSGML) check-tabs
+check: postgres.sgml $(ALLSGML) check-tabs check-nbsp
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -255,9 +255,15 @@ clean-man:
endif # sqlmansectnum != 7
-# tabs are harmless, but it is best to avoid them in SGML files
+# tabs and non-breaking spaces are harmless, but it is best to avoid them in SGML files
check-tabs:
- @( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || (echo "Tabs appear in SGML/XML files" 1>&2; exit 1)
+ @( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
+ (echo "Tabs appear in SGML/XML files" 1>&2; exit 1)
+
+check-nbsp:
+ @ ( $(PERL) -ne '/\xC2\xA0/ and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
+ $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
+ (echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
##
## Clean
On Tue, 1 Oct 2024 22:20:55 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:
On Tue, 1 Oct 2024 15:16:52 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:On Tue, 01 Oct 2024 10:33:50 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:That's because non-breaking space (nbsp) is not encoded as 0xa0 in
UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
point in Unicode. i.e. U+00A0).
So grep -P "[\xC2\xA0]" should work to detect nbsp.`LC_ALL=C grep -P "\xC2\xA0"` works for my environment.
([ and ] were not necessary.)When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml,
but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting
nbsp.One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it.
On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash.
Maybe, better way is use perl itself rather than grep as following.
`perl -ne '/\xC2\xA0/ and print' `
I attached a patch fixed in this way.
GNU sed can also be used without setting LC_ALL:
sed -n /"\xC2\xA0"/p
However I am not sure if non-GNU sed can do this too...
Although I've not check it myself, BSD sed doesn't support \x escape according to [1].
By the way, I've attached a patch a bit modified to use the plural form statement
as same as check-tabs.Non-breaking **spaces** appear in SGML/XML files
The previous patch was broken because the perl command failed to return the correct result.
I've attached an updated patch to fix the return value. In passing, I added line breaks
for long lines.
I've attached a updated patch.
I added the comment to explain why Perl is used instead of grep or sed.
Regards,
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>
Attachments:
v5_check_nbsp.difftext/x-diff; name=v5_check_nbsp.diffDownload
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 9c9bbfe375..65ed32cd0a 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -194,7 +194,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALLSGML) check-tabs
+check: postgres.sgml $(ALLSGML) check-tabs check-nbsp
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -257,7 +257,15 @@ endif # sqlmansectnum != 7
# tabs are harmless, but it is best to avoid them in SGML files
check-tabs:
- @( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || (echo "Tabs appear in SGML/XML files" 1>&2; exit 1)
+ @( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
+ (echo "Tabs appear in SGML/XML files" 1>&2; exit 1)
+
+# Non-breaking spaces are harmless, but it is best to avoid them in SGML files.
+# Use perl command because non-GNU grep or sed could not have hex escape sequence.
+check-nbsp:
+ @ ( $(PERL) -ne '/\xC2\xA0/ and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
+ $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
+ (echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
##
## Clean
On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
I think there's an unnecessary underscore in config.sgml.
I was wrong. The particular byte sequences just looked an underscore
on my editor but the byte sequence is actually 0xc2a0, which must be a
"non breaking space" encoded in UTF-8. I guess someone mistakenly
insert a non breaking space while editing config.sgml.I wonder if it would be worth to add a check for this like we have to tabs?
The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
(doing so made me realize we don't have an equivalent meson target).
Can we check for any character outside the support range of SGML?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On Tue, 1 Oct 2024 22:20:55 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:On Tue, 1 Oct 2024 15:16:52 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:On Tue, 01 Oct 2024 10:33:50 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:That's because non-breaking space (nbsp) is not encoded as 0xa0 in
UTF-8. nbsp in UTF-8 is "0xc2 0xa0" (2 bytes) (A 0xa0 is a nbsp's code
point in Unicode. i.e. U+00A0).
So grep -P "[\xC2\xA0]" should work to detect nbsp.`LC_ALL=C grep -P "\xC2\xA0"` works for my environment.
([ and ] were not necessary.)When LC_ALL is null, `grep -P "\xA0"` could not detect any characters in charset.sgml,
but I think it is better to specify both LC_ALL=C and "\xC2\xA0" for making sure detecting
nbsp.One problem is that -P option can be used in only GNU grep, and grep in mac doesn't support it.
On bash, we can also use `grep $'\xc2\xa0'`, but I am not sure we can assume the shell is bash.
Maybe, better way is use perl itself rather than grep as following.
`perl -ne '/\xC2\xA0/ and print' `
I attached a patch fixed in this way.
GNU sed can also be used without setting LC_ALL:
sed -n /"\xC2\xA0"/p
However I am not sure if non-GNU sed can do this too...
Although I've not check it myself, BSD sed doesn't support \x escape according to [1].
By the way, I've attached a patch a bit modified to use the plural form statement
as same as check-tabs.Non-breaking **spaces** appear in SGML/XML files
The previous patch was broken because the perl command failed to return the correct result.
I've attached an updated patch to fix the return value. In passing, I added line breaks
for long lines.I've attached a updated patch.
I added the comment to explain why Perl is used instead of grep or sed.
Looks good to me. If there's no objection, I will commit this to
master branch.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On 8 Oct 2024, at 02:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
On Tue, 1 Oct 2024 22:20:55 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:
I've attached a updated patch.
I added the comment to explain why Perl is used instead of grep or sed.Looks good to me. If there's no objection, I will commit this to
master branch.
No objections, LGTM.
--
Daniel Gustafsson
Hi Danile, Yugo,
On 8 Oct 2024, at 02:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
On Tue, 1 Oct 2024 22:20:55 +0900
Yugo Nagata <nagata@sraoss.co.jp> wrote:I've attached a updated patch.
I added the comment to explain why Perl is used instead of grep or sed.Looks good to me. If there's no objection, I will commit this to
master branch.No objections, LGTM.
Thank you for the patch and review! I have pushed the patch.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Mon, 7 Oct 2024 15:45:54 -0400
Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
I think there's an unnecessary underscore in config.sgml.
I was wrong. The particular byte sequences just looked an underscore
on my editor but the byte sequence is actually 0xc2a0, which must be a
"non breaking space" encoded in UTF-8. I guess someone mistakenly
insert a non breaking space while editing config.sgml.I wonder if it would be worth to add a check for this like we have to tabs?
The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
(doing so made me realize we don't have an equivalent meson target).Can we check for any character outside the support range of SGML?
What we can define the range of allowed characters range in SGML?
We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/,
but they are used in some places in charset.sgml and some names in release-*.sgml.
Regards,
Yugo Nagata
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.comWhen a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
--
Yugo Nagata <nagata@sraoss.co.jp>
On Mon, 7 Oct 2024 15:45:54 -0400
Bruce Momjian <bruce@momjian.us> wrote:On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
I think there's an unnecessary underscore in config.sgml.
I was wrong. The particular byte sequences just looked an underscore
on my editor but the byte sequence is actually 0xc2a0, which must be a
"non breaking space" encoded in UTF-8. I guess someone mistakenly
insert a non breaking space while editing config.sgml.I wonder if it would be worth to add a check for this like we have to tabs?
The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
(doing so made me realize we don't have an equivalent meson target).Can we check for any character outside the support range of SGML?
What we can define the range of allowed characters range in SGML?
We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/,
but they are used in some places in charset.sgml and some names in release-*.sgml.
I failed to find any standard regarding what characters are allowed in
SGML/XML. Assuming that any valid Unicode characters are allowed in
our *sgml files, I am afraid the best we can do is grepping non-ASCII
characters against the files and checking the results by a visual
inspection. Besides nbsp, there are tons of confusing Unicode
characters out there. For example there are many "hyphen like
characters".
https://www.compart.com/en/unicode/category/Pd
If one of them is used in the sgml files, it may be possible that it
was accidentally inserted.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Wed, Oct 9, 2024 at 11:49:29AM +0900, Tatsuo Ishii wrote:
On Mon, Sep 30, 2024 at 11:59:48AM +0200, Daniel Gustafsson wrote:
On 30 Sep 2024, at 11:03, Tatsuo Ishii <ishii@postgresql.org> wrote:
I think there's an unnecessary underscore in config.sgml.
I was wrong. The particular byte sequences just looked an underscore
on my editor but the byte sequence is actually 0xc2a0, which must be a
"non breaking space" encoded in UTF-8. I guess someone mistakenly
insert a non breaking space while editing config.sgml.I wonder if it would be worth to add a check for this like we have to tabs?
The attached adds a rule to "make -C doc/src/sgml check" for trapping nbsp
(doing so made me realize we don't have an equivalent meson target).Can we check for any character outside the support range of SGML?
What we can define the range of allowed characters range in SGML?
We can detect non-ASCII characters by using regexp /\P{ascii}/ or /[^\x00-\x7f]/,
but they are used in some places in charset.sgml and some names in release-*.sgml.I failed to find any standard regarding what characters are allowed in
SGML/XML. Assuming that any valid Unicode characters are allowed in
our *sgml files, I am afraid the best we can do is grepping non-ASCII
characters against the files and checking the results by a visual
inspection. Besides nbsp, there are tons of confusing Unicode
characters out there. For example there are many "hyphen like
characters".https://www.compart.com/en/unicode/category/Pd
If one of them is used in the sgml files, it may be possible that it
was accidentally inserted.
Can we use Unicode in the SGML files?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes:
Can we use Unicode in the SGML files?
I believe we've been doing it for contributors' names that require
non-ASCII letters, but not in any other places.
regards, tom lane
Bruce Momjian <bruce@momjian.us> writes:
Can we use Unicode in the SGML files?
I believe we've been doing it for contributors' names that require
non-ASCII letters, but not in any other places.
We have non-ASCII letters in charset.sgml too, to show some examples
of collation.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On 9 Oct 2024, at 04:49, Tatsuo Ishii <ishii@postgresql.org> wrote:
Besides nbsp, there are tons of confusing Unicode
characters out there. For example there are many "hyphen like
characters".
Using characters which look alike is in the field of internet security known as
homograph attacks, where for example a url visually passes for postgresql.org
but in fact leads to an attacker. That sort of attack clearly doesn't apply to
our docs though. However, what might cause similar problems is if we use a
unicode character in example code which the reader could be expected to
copy/paste into psql and run which then (at best) cause a syntax error. We
could probably build tooling to catch this (most likely not too hard in XSLT)
but the ROI for that might be unfavourable. Even with tooling, committer
caution is needed to ensure we don't publish examples that might cause
unintended side effects when executed by copy/paste.
What separates nbsp is that it may affect the rendering in an un-intuitive way
by forcing two words to not break even if the viewport is too narrow to fit.
Catching such characters seems wortwhile since it's also quite doable with a
trivial grep.
--
Daniel Gustafsson
On Thu, 10 Oct 2024 16:00:41 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:
Bruce Momjian <bruce@momjian.us> writes:
Can we use Unicode in the SGML files?
I believe we've been doing it for contributors' names that require
non-ASCII letters, but not in any other places.We have non-ASCII letters in charset.sgml too, to show some examples
of collation.
We can check non-ASCII letters SGML/XML files by preparing "allowlist"
that contains lines which are allowed to have non-ascii characters,
although this list will need to be maintained when lines in it are modified.
I've attached a patch to add a simple Perl script to do this.
During testing this script, I found "stylesheet-man.xsl" also has non-ascii
characters. I don't know these characters are really necessary though, since
I don't understand this file well.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
0001-Doc-Add-check-to-detect-non-ASCII-characters.patchtext/x-diff; name=0001-Doc-Add-check-to-detect-non-ASCII-characters.patchDownload
From c5a16f1f7c515294cb600554fe1bbe045d25ec26 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Thu, 10 Oct 2024 23:35:19 +0900
Subject: [PATCH] Doc: Add check to detect non-ASCII characters
---
doc/src/sgml/Makefile | 11 ++++----
doc/src/sgml/check_non_ascii.pl | 47 +++++++++++++++++++++++++++++++++
2 files changed, 52 insertions(+), 6 deletions(-)
create mode 100644 doc/src/sgml/check_non_ascii.pl
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 65ed32cd0a..90cbeed542 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -194,7 +194,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALLSGML) check-tabs check-nbsp
+check: postgres.sgml $(ALLSGML) check-tabs check-non-ascii
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -260,12 +260,11 @@ check-tabs:
@( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
(echo "Tabs appear in SGML/XML files" 1>&2; exit 1)
-# Non-breaking spaces are harmless, but it is best to avoid them in SGML files.
+# Non-ASCII characters are harmless, but it is best to avoid them in SGML files.
# Use perl command because non-GNU grep or sed could not have hex escape sequence.
-check-nbsp:
- @ ( $(PERL) -ne '/\xC2\xA0/ and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
- $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
- (echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
+check-non-ascii:
+ @ ( $(PERL) $(srcdir)/check_non_ascii.pl $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
+ (echo "Non-ASCII characters appear in SGML/XML files" 1>&2; exit 1)
##
## Clean
diff --git a/doc/src/sgml/check_non_ascii.pl b/doc/src/sgml/check_non_ascii.pl
new file mode 100644
index 0000000000..1d7ae405b5
--- /dev/null
+++ b/doc/src/sgml/check_non_ascii.pl
@@ -0,0 +1,47 @@
+#!/usr/bin/perl
+#
+# Check if non-ASCII characters appear in SGML/XML files
+# Copyright (c) 2000-2024, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+
+# list of lines where non-ascii characters are allowed
+my %allowlist = (
+'./charset.sgml' => [
+"SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true",
+" <entry><literal>'n' = 'ñ'</literal></entry>",
+" performed. For example, <literal>'á'</literal> may be composed of the",
+" locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>",
+" before <literal>'aé'</literal>."
+],
+'./stylesheet-man.xsl' => [
+'<l:template name="sect.*" text="Section %n, “%t”, in the documentation"/>'
+]
+);
+
+# begin of the acknowledgements for contributors in the release-note
+my $release_ack='<sect2 id="release-.*-acknowledgements">';
+
+my $n = 0;
+foreach my $file (@ARGV)
+{
+ open my $fh, '<', $file or die;
+ while (my $line = <$fh>)
+ {
+ # skip lines in allowlist
+ next if exists($allowlist{$file}) and (grep {$line =~ $_} @{$allowlist{$file}});
+
+ # skip contributor names in the acknowledgements
+ last if ($line =~ /$release_ack/);
+
+ # check non-ascii characters
+ if ($line =~ /[^\x00-\x7f]/)
+ {
+ print "$file:$line";
+ $n++;
+ }
+ }
+ close $fh;
+}
+exit($n>0);
--
2.34.1
We can check non-ASCII letters SGML/XML files by preparing "allowlist"
that contains lines which are allowed to have non-ascii characters,
although this list will need to be maintained when lines in it are modified.
I've attached a patch to add a simple Perl script to do this.
I doubt it really works. For example, nbsp can be used formatting
(that's the purpose of the character in the first place). Whenever a
developer decides to or not to use nbsp, "allowlist" needs to be
maintained. It's too annoying.
I think it's better to add the non-ASCII character checking to the
comitting check list and let committers check non-ASCII character in
the patch. Non-ASCII characters rarely used and it would not become a
burden.
https://wiki.postgresql.org/wiki/Committing_checklist
Maybe we can add to the wiki page something like this?
git diff origin/master | grep -P '[^\x00-\x7f]'
During testing this script, I found "stylesheet-man.xsl" also has non-ascii
characters. I don't know these characters are really necessary though, since
I don't understand this file well.
They are U+201C (double turned comma quotation mark) and U+201D
(double comma quotation mark).
<l:template name="sect3" text="Section %n, “%t”, in the documentation"/>
I would like to know why they are necessary too.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Fri, 11 Oct 2024 12:16:50 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:
We can check non-ASCII letters SGML/XML files by preparing "allowlist"
that contains lines which are allowed to have non-ascii characters,
although this list will need to be maintained when lines in it are modified.
I've attached a patch to add a simple Perl script to do this.I doubt it really works. For example, nbsp can be used formatting
(that's the purpose of the character in the first place). Whenever a
developer decides to or not to use nbsp, "allowlist" needs to be
maintained. It's too annoying.
I suppose non-ascii characters including nbsp are basically disallowed,
so the allowlist will not increase unless there is some special reason.
However, it is true that there might be a cost for maintaining the list
more or less, so if people don't think it is worth adding this check,
I will withdraw this proposal.l.
I think it's better to add the non-ASCII character checking to the
comitting check list and let committers check non-ASCII character in
the patch. Non-ASCII characters rarely used and it would not become a
burden.
https://wiki.postgresql.org/wiki/Committing_checklistMaybe we can add to the wiki page something like this?
git diff origin/master | grep -P '[^\x00-\x7f]'
During testing this script, I found "stylesheet-man.xsl" also has non-ascii
characters. I don't know these characters are really necessary though, since
I don't understand this file well.They are U+201C (double turned comma quotation mark) and U+201D
(double comma quotation mark).<l:template name="sect3" text="Section %n, “%t”, in the documentation"/>
I would like to know why they are necessary too.
+1
Regards,
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>
On Fri, Oct 11, 2024 at 12:36:53PM +0900, Yugo NAGATA wrote:
On Fri, 11 Oct 2024 12:16:50 +0900 (JST)
Tatsuo Ishii <ishii@postgresql.org> wrote:We can check non-ASCII letters SGML/XML files by preparing "allowlist"
that contains lines which are allowed to have non-ascii characters,
although this list will need to be maintained when lines in it are modified.
I've attached a patch to add a simple Perl script to do this.I doubt it really works. For example, nbsp can be used formatting
(that's the purpose of the character in the first place). Whenever a
developer decides to or not to use nbsp, "allowlist" needs to be
maintained. It's too annoying.I suppose non-ascii characters including nbsp are basically disallowed,
so the allowlist will not increase unless there is some special reason.However, it is true that there might be a cost for maintaining the list
more or less, so if people don't think it is worth adding this check,
I will withdraw this proposal.l.
I did some more research and we able to clarify our behavior in
release.sgml:
We can only use Latin1 characters, not all UTF8 characters,
because rendering engines must support the referenced characters,
and they currently only support Latin1. In the SGML files we
encode non-ASCII Latin1 characters as HTML entities, e.g.,
Álvaro Herrera. Oddly, it is possible to add Latin1
characters as UTF8, but we we currently prevent this via the
Makefile "check-non-ascii" check.
We used to use UTF8 characters in SGML files, but only UTF8 characters
that had Latin1 equivalents, and I think the toolchain would convert
UTF8 to Latin1 for us.
What I ended up doing was to change the UTF8 encoded characters to HTML
entities, and then modify the Makefile to check for any non-ASCII
characters. This will catch and any other UTF8 characters.
I also added a dummy 'pdf' target that is the same as the postgres.pdf
dummy target; we already had an "html" target, so I thought a "pdf" one
made sense.
Patch attached. I plan to apply this in a few days to master.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Attachments:
latin1.difftext/x-diff; charset=utf-8Download
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 65ed32cd0ab..87d21783e52 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -143,7 +143,7 @@ postgres.txt: postgres.html
## Print
##
-postgres.pdf:
+postgres.pdf pdf:
$(error Invalid target; use postgres-A4.pdf or postgres-US.pdf as targets)
XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
@@ -194,7 +194,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALLSGML) check-tabs check-nbsp
+check: postgres.sgml $(ALLSGML) check-tabs check-non-ascii
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -262,10 +262,9 @@ check-tabs:
# Non-breaking spaces are harmless, but it is best to avoid them in SGML files.
# Use perl command because non-GNU grep or sed could not have hex escape sequence.
-check-nbsp:
- @ ( $(PERL) -ne '/\xC2\xA0/ and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
- $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
- (echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
+check-non-ascii:
+ @( ! grep -P '[^\x00-\x7f]' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
+ (echo "Non-ASCII characters appear in SGML/XML files; use HTML entities for Latin1 characters" 1>&2; exit 1)
##
## Clean
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 1ef5322b912..f5e115e8d6e 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1225,7 +1225,7 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
<programlisting>
-- ignore differences in accents and case
CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
-SELECT 'Ã
' = 'A' COLLATE ignore_accent_case; -- true
+SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
-- upper case letters sort before lower case.
@@ -1282,7 +1282,7 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<entry><literal>'ab' = U&'a\2063b'</literal></entry>
<entry><literal>'x-y' = 'x_y'</literal></entry>
<entry><literal>'g' = 'G'</literal></entry>
- <entry><literal>'n' = 'ñ'</literal></entry>
+ <entry><literal>'n' = 'ñ'</literal></entry>
<entry><literal>'y' = 'z'</literal></entry>
</row>
</thead>
@@ -1346,7 +1346,7 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<para>
At every level, even with full normalization off, basic normalization is
- performed. For example, <literal>'á'</literal> may be composed of the
+ performed. For example, <literal>'á'</literal> may be composed of the
code points <literal>U&'\0061\0301'</literal> or the single code
point <literal>U&'\00E1'</literal>, and those sequences will be
considered equal even at the <literal>identic</literal> level. To treat
@@ -1430,8 +1430,8 @@ SELECT 'x-y' = 'x_y' COLLATE level4; -- false
<entry><literal>false</literal></entry>
<entry>
Backwards comparison for the level 2 differences. For example,
- locale <literal>und-u-kb</literal> sorts <literal>'Ã e'</literal>
- before <literal>'aé'</literal>.
+ locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
+ before <literal>'aé'</literal>.
</entry>
</row>
diff --git a/doc/src/sgml/images/genetic-algorithm.svg b/doc/src/sgml/images/genetic-algorithm.svg
index fb9fdd1ba78..2ce5f1b2712 100644
--- a/doc/src/sgml/images/genetic-algorithm.svg
+++ b/doc/src/sgml/images/genetic-algorithm.svg
@@ -72,7 +72,7 @@
<title>a4->end</title>
<path fill="none" stroke="#000000" d="M259,-312.5834C259,-312.5834 259,-54.659 259,-54.659"/>
<polygon fill="#000000" stroke="#000000" points="262.5001,-54.659 259,-44.659 255.5001,-54.6591 262.5001,-54.659"/>
-<text text-anchor="middle" x="246" y="-186.6212" font-family="sans-serif" font-size="10.00" fill="#000000">true  </text>
+<text text-anchor="middle" x="246" y="-186.6212" font-family="sans-serif" font-size="10.00" fill="#000000">true</text>
</g>
<!-- a5 -->
<g id="node7" class="node">
@@ -85,7 +85,7 @@
<title>a4->a5</title>
<path fill="none" stroke="#000000" d="M144,-298.269C144,-298.269 144,-286.5248 144,-286.5248"/>
<polygon fill="#000000" stroke="#000000" points="147.5001,-286.5248 144,-276.5248 140.5001,-286.5249 147.5001,-286.5248"/>
-<text text-anchor="middle" x="127" y="-284.3969" font-family="sans-serif" font-size="10.00" fill="#000000">false   </text>
+<text text-anchor="middle" x="127" y="-284.3969" font-family="sans-serif" font-size="10.00" fill="#000000">false</text>
</g>
<!-- a6 -->
<g id="node8" class="node">
diff --git a/doc/src/sgml/release.sgml b/doc/src/sgml/release.sgml
index 8433690dead..65c86f54c0e 100644
--- a/doc/src/sgml/release.sgml
+++ b/doc/src/sgml/release.sgml
@@ -26,13 +26,15 @@ non-ASCII characters find using grep -P '[\x80-\xFF]' or
http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
- We cannot use UTF8 because rendering engines have to
- support the referenced characters.
-
- Do not use numeric _UTF_ numeric character escapes (&#nnn;),
- we can only use Latin1.
-
- Example: Alvaro Herrera is Álvaro Herrera
+ We can only use Latin1 characters, not all UTF8 characters,
+ because rendering engines must support the referenced characters,
+ and they currently only support Latin1. In the SGML files we
+ encode non-ASCII Latin1 characters as HTML entities, e.g.,
+ Álvaro Herrera. Oddly, it is possible to add Latin1
+ characters as UTF8, but we we currently prevent this via the
+ Makefile "check-non-ascii" check.
+
+ Do not use numeric _UTF_ numeric character escapes (&#nnn;).
wrap long lines
diff --git a/doc/src/sgml/stylesheet-man.xsl b/doc/src/sgml/stylesheet-man.xsl
index fcb485c2931..2e2564da683 100644
--- a/doc/src/sgml/stylesheet-man.xsl
+++ b/doc/src/sgml/stylesheet-man.xsl
@@ -213,12 +213,12 @@
<!-- Slight rephrasing to indicate that missing sections are found
in the documentation. -->
<l:context name="xref-number-and-title">
- <l:template name="chapter" text="Chapter %n, %t, in the documentation"/>
- <l:template name="sect1" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect2" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect3" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect4" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect5" text="Section %n, â%tâ, in the documentation"/>
+ <l:template name="chapter" text="Chapter %n, "%t", in the documentation"/>
+ <l:template name="sect1" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect2" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect3" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect4" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect5" text="Section %n, "%t", in the documentation"/>
</l:context>
</l:l10n>
</l:i18n>
On Mon, Oct 14, 2024 at 03:05:35PM -0400, Bruce Momjian wrote:
I did some more research and we able to clarify our behavior in
release.sgml:
I have specified some more details in my patched version:
We can only use Latin1 characters, not all UTF8 characters,
because some rendering engines do not support non-Latin1 UTF8
characters. Specifically, the HTML rendering engine can display
all UTF8 characters, but the PDF rendering engine can only display
Latin1 characters. In PDF files, non-Latin1 UTF8 characters are
displayed as "###".
In the SGML files we encode non-ASCII Latin1 characters as HTML
entities, e.g., Álvaro. Oddly, it is possible to safely
represent Latin1 characters in SGML files as UTF8 for HTML and
PDF output, but we we currently disallow this via the Makefile
"check-non-ascii" rule.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Hi Bruce,
On Mon, 14 Oct 2024 16:31:11 -0400
Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Oct 14, 2024 at 03:05:35PM -0400, Bruce Momjian wrote:
I did some more research and we able to clarify our behavior in
release.sgml:I have specified some more details in my patched version:
We can only use Latin1 characters, not all UTF8 characters,
because some rendering engines do not support non-Latin1 UTF8
characters. Specifically, the HTML rendering engine can display
all UTF8 characters, but the PDF rendering engine can only display
Latin1 characters. In PDF files, non-Latin1 UTF8 characters are
displayed as "###".In the SGML files we encode non-ASCII Latin1 characters as HTML
entities, e.g., Álvaro. Oddly, it is possible to safely
represent Latin1 characters in SGML files as UTF8 for HTML and
PDF output, but we we currently disallow this via the Makefile
"check-non-ascii" rule.
I agree with encoding non-Latin1 characters and disallowing non-ASCII
characters totally.
I found your patch includes fixes in *.svg files, so how about checking
also them by check-non-ascii? Also, I think it is better to use perl instead
of grep because non-GNU grep doesn't support hex escape sequences. I've attached
a updated patch for Makefile. The changes in release.sgml above is not applied
yet, though.
Regards,
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>
Attachments:
v2_latin1.difftext/x-diff; name=v2_latin1.diffDownload
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 65ed32cd0a..3d992ebd84 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -143,7 +143,7 @@ postgres.txt: postgres.html
## Print
##
-postgres.pdf:
+postgres.pdf pdf:
$(error Invalid target; use postgres-A4.pdf or postgres-US.pdf as targets)
XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
@@ -194,7 +194,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALLSGML) check-tabs check-nbsp
+check: postgres.sgml $(ALLSGML) check-tabs check-non-ascii
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -257,15 +257,16 @@ endif # sqlmansectnum != 7
# tabs are harmless, but it is best to avoid them in SGML files
check-tabs:
- @( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
+ @( ! grep ' ' $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdr)/images/*.svg) ) || \
(echo "Tabs appear in SGML/XML files" 1>&2; exit 1)
-# Non-breaking spaces are harmless, but it is best to avoid them in SGML files.
-# Use perl command because non-GNU grep or sed could not have hex escape sequence.
-check-nbsp:
- @ ( $(PERL) -ne '/\xC2\xA0/ and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
- $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
- (echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
+# Disallow non-ASCII characters because some rendering engines do not
+# support non-Latin1 UTF8 characters. Use perl command because non-GNU grep
+# or sed could not have hex escape sequence.
+check-non-ascii:
+ @ ( $(PERL) -ne '/[^\x00-\x7f]/ and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
+ $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.svg) ) || \
+ (echo "Non-ASCII characters appear in SGML/XML files; use HTML entities for Latin1 characters" 1>&2; exit 1)
##
## Clean
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 1ef5322b91..f5e115e8d6 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1225,7 +1225,7 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
<programlisting>
-- ignore differences in accents and case
CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
-SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
+SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
-- upper case letters sort before lower case.
@@ -1282,7 +1282,7 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<entry><literal>'ab' = U&'a\2063b'</literal></entry>
<entry><literal>'x-y' = 'x_y'</literal></entry>
<entry><literal>'g' = 'G'</literal></entry>
- <entry><literal>'n' = 'ñ'</literal></entry>
+ <entry><literal>'n' = 'ñ'</literal></entry>
<entry><literal>'y' = 'z'</literal></entry>
</row>
</thead>
@@ -1346,7 +1346,7 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<para>
At every level, even with full normalization off, basic normalization is
- performed. For example, <literal>'á'</literal> may be composed of the
+ performed. For example, <literal>'á'</literal> may be composed of the
code points <literal>U&'\0061\0301'</literal> or the single code
point <literal>U&'\00E1'</literal>, and those sequences will be
considered equal even at the <literal>identic</literal> level. To treat
@@ -1430,8 +1430,8 @@ SELECT 'x-y' = 'x_y' COLLATE level4; -- false
<entry><literal>false</literal></entry>
<entry>
Backwards comparison for the level 2 differences. For example,
- locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
- before <literal>'aé'</literal>.
+ locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
+ before <literal>'aé'</literal>.
</entry>
</row>
diff --git a/doc/src/sgml/images/genetic-algorithm.svg b/doc/src/sgml/images/genetic-algorithm.svg
index fb9fdd1ba7..2ce5f1b271 100644
--- a/doc/src/sgml/images/genetic-algorithm.svg
+++ b/doc/src/sgml/images/genetic-algorithm.svg
@@ -72,7 +72,7 @@
<title>a4->end</title>
<path fill="none" stroke="#000000" d="M259,-312.5834C259,-312.5834 259,-54.659 259,-54.659"/>
<polygon fill="#000000" stroke="#000000" points="262.5001,-54.659 259,-44.659 255.5001,-54.6591 262.5001,-54.659"/>
-<text text-anchor="middle" x="246" y="-186.6212" font-family="sans-serif" font-size="10.00" fill="#000000">true </text>
+<text text-anchor="middle" x="246" y="-186.6212" font-family="sans-serif" font-size="10.00" fill="#000000">true</text>
</g>
<!-- a5 -->
<g id="node7" class="node">
@@ -85,7 +85,7 @@
<title>a4->a5</title>
<path fill="none" stroke="#000000" d="M144,-298.269C144,-298.269 144,-286.5248 144,-286.5248"/>
<polygon fill="#000000" stroke="#000000" points="147.5001,-286.5248 144,-276.5248 140.5001,-286.5249 147.5001,-286.5248"/>
-<text text-anchor="middle" x="127" y="-284.3969" font-family="sans-serif" font-size="10.00" fill="#000000">false </text>
+<text text-anchor="middle" x="127" y="-284.3969" font-family="sans-serif" font-size="10.00" fill="#000000">false</text>
</g>
<!-- a6 -->
<g id="node8" class="node">
diff --git a/doc/src/sgml/release.sgml b/doc/src/sgml/release.sgml
index 8433690dea..65c86f54c0 100644
--- a/doc/src/sgml/release.sgml
+++ b/doc/src/sgml/release.sgml
@@ -26,13 +26,15 @@ non-ASCII characters find using grep -P '[\x80-\xFF]' or
http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
- We cannot use UTF8 because rendering engines have to
- support the referenced characters.
-
- Do not use numeric _UTF_ numeric character escapes (&#nnn;),
- we can only use Latin1.
-
- Example: Alvaro Herrera is Álvaro Herrera
+ We can only use Latin1 characters, not all UTF8 characters,
+ because rendering engines must support the referenced characters,
+ and they currently only support Latin1. In the SGML files we
+ encode non-ASCII Latin1 characters as HTML entities, e.g.,
+ Álvaro Herrera. Oddly, it is possible to add Latin1
+ characters as UTF8, but we we currently prevent this via the
+ Makefile "check-non-ascii" check.
+
+ Do not use numeric _UTF_ numeric character escapes (&#nnn;).
wrap long lines
diff --git a/doc/src/sgml/stylesheet-man.xsl b/doc/src/sgml/stylesheet-man.xsl
index fcb485c293..2e2564da68 100644
--- a/doc/src/sgml/stylesheet-man.xsl
+++ b/doc/src/sgml/stylesheet-man.xsl
@@ -213,12 +213,12 @@
<!-- Slight rephrasing to indicate that missing sections are found
in the documentation. -->
<l:context name="xref-number-and-title">
- <l:template name="chapter" text="Chapter %n, %t, in the documentation"/>
- <l:template name="sect1" text="Section %n, “%t”, in the documentation"/>
- <l:template name="sect2" text="Section %n, “%t”, in the documentation"/>
- <l:template name="sect3" text="Section %n, “%t”, in the documentation"/>
- <l:template name="sect4" text="Section %n, “%t”, in the documentation"/>
- <l:template name="sect5" text="Section %n, “%t”, in the documentation"/>
+ <l:template name="chapter" text="Chapter %n, "%t", in the documentation"/>
+ <l:template name="sect1" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect2" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect3" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect4" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect5" text="Section %n, "%t", in the documentation"/>
</l:context>
</l:l10n>
</l:i18n>
On Tue, Oct 15, 2024 at 10:10:36AM +0900, Yugo NAGATA wrote:
Hi Bruce,
On Mon, 14 Oct 2024 16:31:11 -0400
Bruce Momjian <bruce@momjian.us> wrote:On Mon, Oct 14, 2024 at 03:05:35PM -0400, Bruce Momjian wrote:
I did some more research and we able to clarify our behavior in
release.sgml:I have specified some more details in my patched version:
We can only use Latin1 characters, not all UTF8 characters,
because some rendering engines do not support non-Latin1 UTF8
characters. Specifically, the HTML rendering engine can display
all UTF8 characters, but the PDF rendering engine can only display
Latin1 characters. In PDF files, non-Latin1 UTF8 characters are
displayed as "###".In the SGML files we encode non-ASCII Latin1 characters as HTML
entities, e.g., Álvaro. Oddly, it is possible to safely
represent Latin1 characters in SGML files as UTF8 for HTML and
PDF output, but we we currently disallow this via the Makefile
"check-non-ascii" rule.I agree with encoding non-Latin1 characters and disallowing non-ASCII
characters totally.I found your patch includes fixes in *.svg files, so how about checking
also them by check-non-ascii? Also, I think it is better to use perl instead
of grep because non-GNU grep doesn't support hex escape sequences. I've attached
a updated patch for Makefile. The changes in release.sgml above is not applied
yet, though.
Yes, good idea on using Perl and checking svg files --- I have used your
Makefile rule.
Attached is an updated patch. I realized that the new rules apply to
all SGML files, not just the release notes, so I have created
README.non-ASCII and moved the description there.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Attachments:
latin1.difftext/x-diff; charset=utf-8Download
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 65ed32cd0ab..a3ff1168729 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -143,11 +143,12 @@ postgres.txt: postgres.html
## Print
##
-postgres.pdf:
+postgres.pdf pdf:
$(error Invalid target; use postgres-A4.pdf or postgres-US.pdf as targets)
XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
+# XSL Formatting Objects (FO), https://en.wikipedia.org/wiki/XSL_Formatting_Objects
%-A4.fo: stylesheet-fo.xsl %-full.xml
$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type A4 -o $@ $^
@@ -194,7 +195,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALLSGML) check-tabs check-nbsp
+check: postgres.sgml $(ALLSGML) check-tabs check-non-ascii
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -262,10 +263,10 @@ check-tabs:
# Non-breaking spaces are harmless, but it is best to avoid them in SGML files.
# Use perl command because non-GNU grep or sed could not have hex escape sequence.
-check-nbsp:
- @ ( $(PERL) -ne '/\xC2\xA0/ and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
- $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) ) || \
- (echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
+check-non-ascii:
+ @ ( $(PERL) -ne '/[^\x00-\x7f]/ and print("$$ARGV: $$_"), $$n++; END { exit($$n > 0) }' \
+ $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.svg) ) || \
+ (echo "Non-ASCII characters appear in SGML/XML files; use HTML entities for Latin1 characters" 1>&2; exit 1)
##
## Clean
diff --git a/doc/src/sgml/README.non-ASCII b/doc/src/sgml/README.non-ASCII
new file mode 100644
index 00000000000..a7300bcb2d3
--- /dev/null
+++ b/doc/src/sgml/README.non-ASCII
@@ -0,0 +1,38 @@
+<!-- doc/src/sgml/README.non-ASCII -->
+
+Representation of non-ASCII characters
+--------------------------------------
+
+Find non-ASCII characters using:
+
+ grep --color='auto' -P "[\x80-\xFF]"
+
+Convert to HTML4 named entity (&) escapes
+-----------------------------------------
+
+We support several output formats:
+
+* html (supports all Unicode characters)
+* man (supports all Unicode characters)
+* pdf (supports only Latin-1 characters)
+* info
+
+While some output formatting tools support all Unicode characters,
+others only support Latin-1 characters. Specifically, the PDF rendering
+engine can only display Latin-1 characters; non-Latin-1 Unicode
+characters are displayed as "###".
+
+Therefore, in the SGML files, we only use Latin-1 characters. We encode
+these characters as HTML entities, e.g., Álvaro. Oddly, in SGML
+files it is possible to safely represent Latin-1 characters in UTF8
+encoding for all output formats, but we we currently disallow this via
+the Makefile rule "check-non-ascii".
+
+Do not use UTF numeric character escapes (&#nnn;).
+
+HTML entities
+ official: http://www.w3.org/TR/html4/sgml/entities.html
+ one page: http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
+ other lists: http://www.zipcon.net/~swhite/docs/computers/browsers/entities.html
+ http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
+ https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 1ef5322b912..f5e115e8d6e 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1225,7 +1225,7 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
<programlisting>
-- ignore differences in accents and case
CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
-SELECT 'Ã
' = 'A' COLLATE ignore_accent_case; -- true
+SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
-- upper case letters sort before lower case.
@@ -1282,7 +1282,7 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<entry><literal>'ab' = U&'a\2063b'</literal></entry>
<entry><literal>'x-y' = 'x_y'</literal></entry>
<entry><literal>'g' = 'G'</literal></entry>
- <entry><literal>'n' = 'ñ'</literal></entry>
+ <entry><literal>'n' = 'ñ'</literal></entry>
<entry><literal>'y' = 'z'</literal></entry>
</row>
</thead>
@@ -1346,7 +1346,7 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<para>
At every level, even with full normalization off, basic normalization is
- performed. For example, <literal>'á'</literal> may be composed of the
+ performed. For example, <literal>'á'</literal> may be composed of the
code points <literal>U&'\0061\0301'</literal> or the single code
point <literal>U&'\00E1'</literal>, and those sequences will be
considered equal even at the <literal>identic</literal> level. To treat
@@ -1430,8 +1430,8 @@ SELECT 'x-y' = 'x_y' COLLATE level4; -- false
<entry><literal>false</literal></entry>
<entry>
Backwards comparison for the level 2 differences. For example,
- locale <literal>und-u-kb</literal> sorts <literal>'Ã e'</literal>
- before <literal>'aé'</literal>.
+ locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
+ before <literal>'aé'</literal>.
</entry>
</row>
diff --git a/doc/src/sgml/images/genetic-algorithm.svg b/doc/src/sgml/images/genetic-algorithm.svg
index fb9fdd1ba78..2ce5f1b2712 100644
--- a/doc/src/sgml/images/genetic-algorithm.svg
+++ b/doc/src/sgml/images/genetic-algorithm.svg
@@ -72,7 +72,7 @@
<title>a4->end</title>
<path fill="none" stroke="#000000" d="M259,-312.5834C259,-312.5834 259,-54.659 259,-54.659"/>
<polygon fill="#000000" stroke="#000000" points="262.5001,-54.659 259,-44.659 255.5001,-54.6591 262.5001,-54.659"/>
-<text text-anchor="middle" x="246" y="-186.6212" font-family="sans-serif" font-size="10.00" fill="#000000">true  </text>
+<text text-anchor="middle" x="246" y="-186.6212" font-family="sans-serif" font-size="10.00" fill="#000000">true</text>
</g>
<!-- a5 -->
<g id="node7" class="node">
@@ -85,7 +85,7 @@
<title>a4->a5</title>
<path fill="none" stroke="#000000" d="M144,-298.269C144,-298.269 144,-286.5248 144,-286.5248"/>
<polygon fill="#000000" stroke="#000000" points="147.5001,-286.5248 144,-276.5248 140.5001,-286.5249 147.5001,-286.5248"/>
-<text text-anchor="middle" x="127" y="-284.3969" font-family="sans-serif" font-size="10.00" fill="#000000">false   </text>
+<text text-anchor="middle" x="127" y="-284.3969" font-family="sans-serif" font-size="10.00" fill="#000000">false</text>
</g>
<!-- a6 -->
<g id="node8" class="node">
diff --git a/doc/src/sgml/release.sgml b/doc/src/sgml/release.sgml
index 8433690dead..cee577ff8d3 100644
--- a/doc/src/sgml/release.sgml
+++ b/doc/src/sgml/release.sgml
@@ -16,24 +16,6 @@ pg_[A-Za-z0-9_]+ <application>, <structname>
\<[a-z]+_[a-z_]+\> <varname>, <structfield>
<systemitem class="osname">
-non-ASCII characters find using grep -P '[\x80-\xFF]' or
- (remove 'X') grep -X-color='auto' -P -n "[\x80-\xFF]"
- convert to HTML4 named entity (&) escapes
-
- official: http://www.w3.org/TR/html4/sgml/entities.html
- one page: http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
- other lists: http://www.zipcon.net/~swhite/docs/computers/browsers/entities.html
- http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
- https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
-
- We cannot use UTF8 because rendering engines have to
- support the referenced characters.
-
- Do not use numeric _UTF_ numeric character escapes (&#nnn;),
- we can only use Latin1.
-
- Example: Alvaro Herrera is Álvaro Herrera
-
wrap long lines
For new features, add links to the documentation sections.
diff --git a/doc/src/sgml/stylesheet-man.xsl b/doc/src/sgml/stylesheet-man.xsl
index fcb485c2931..2e2564da683 100644
--- a/doc/src/sgml/stylesheet-man.xsl
+++ b/doc/src/sgml/stylesheet-man.xsl
@@ -213,12 +213,12 @@
<!-- Slight rephrasing to indicate that missing sections are found
in the documentation. -->
<l:context name="xref-number-and-title">
- <l:template name="chapter" text="Chapter %n, %t, in the documentation"/>
- <l:template name="sect1" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect2" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect3" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect4" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect5" text="Section %n, â%tâ, in the documentation"/>
+ <l:template name="chapter" text="Chapter %n, "%t", in the documentation"/>
+ <l:template name="sect1" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect2" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect3" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect4" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect5" text="Section %n, "%t", in the documentation"/>
</l:context>
</l:l10n>
</l:i18n>
On 15.10.24 18:54, Bruce Momjian wrote:
I agree with encoding non-Latin1 characters and disallowing non-ASCII
characters totally.I found your patch includes fixes in *.svg files, so how about checking
also them by check-non-ascii? Also, I think it is better to use perl instead
of grep because non-GNU grep doesn't support hex escape sequences. I've attached
a updated patch for Makefile. The changes in release.sgml above is not applied
yet, though.Yes, good idea on using Perl and checking svg files --- I have used your
Makefile rule.Attached is an updated patch. I realized that the new rules apply to
all SGML files, not just the release notes, so I have created
README.non-ASCII and moved the description there.
I don't understand the point of this. Maybe it's okay to try to detect
certain "hidden" whitespace characters, like in the case that started
this thread. But I don't see the value in prohibiting all non-ASCII
characters, as is being proposed here.
On Tue, Oct 15, 2024 at 10:34:16PM +0200, Peter Eisentraut wrote:
On 15.10.24 18:54, Bruce Momjian wrote:
I agree with encoding non-Latin1 characters and disallowing non-ASCII
characters totally.I found your patch includes fixes in *.svg files, so how about checking
also them by check-non-ascii? Also, I think it is better to use perl instead
of grep because non-GNU grep doesn't support hex escape sequences. I've attached
a updated patch for Makefile. The changes in release.sgml above is not applied
yet, though.Yes, good idea on using Perl and checking svg files --- I have used your
Makefile rule.Attached is an updated patch. I realized that the new rules apply to
all SGML files, not just the release notes, so I have created
README.non-ASCII and moved the description there.I don't understand the point of this. Maybe it's okay to try to detect
certain "hidden" whitespace characters, like in the case that started this
thread. But I don't see the value in prohibiting all non-ASCII characters,
as is being proposed here.
Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files
if desired.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On 15.10.24 22:37, Bruce Momjian wrote:
I don't understand the point of this. Maybe it's okay to try to detect
certain "hidden" whitespace characters, like in the case that started this
thread. But I don't see the value in prohibiting all non-ASCII characters,
as is being proposed here.Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8.
But your patch prohibits even otherwise allowed Latin-1 characters.
I don't see why we need to enforce this at this level. Whatever
downstream toolchain has requirements about which characters are allowed
will complain if it encounters a character it doesn't like.
Bruce Momjian <bruce@momjian.us> writes:
Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files
if desired.
That policy would cause substantial problems with contributor names
in the release notes. I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.
regards, tom lane
On Tue, Oct 15, 2024 at 11:08:15PM +0200, Peter Eisentraut wrote:
On 15.10.24 22:37, Bruce Momjian wrote:
I don't understand the point of this. Maybe it's okay to try to detect
certain "hidden" whitespace characters, like in the case that started this
thread. But I don't see the value in prohibiting all non-ASCII characters,
as is being proposed here.Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8.But your patch prohibits even otherwise allowed Latin-1 characters.
Well, yes, they are Latin-1 characters encoded as UTF-8.
I don't see why we need to enforce this at this level. Whatever downstream
toolchain has requirements about which characters are allowed will complain
if it encounters a character it doesn't like.
Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8
characters. To test this I added some Russian characters (non-Latin-1)
to release.sgml:
(⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩,
⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩,
⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier
letters or "signs" (⟨ъ⟩, ⟨ь⟩)
and I ran 'make postgres-US.pdf', and then removed the Russian
characters and ran the same command again. The output, including stderr
was identical. The PDFs, of course, were not, with the Russian
characters showing as "####". Makefile output attached.
So, in summary, the PDF build is allowed to complain, but it does not.
Even if it did complain, odds are most people are only going to test an
HTML build of their patch, if at all, rather than a PDF build.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Attachments:
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files
if desired.That policy would cause substantial problems with contributor names
in the release notes. I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.
Uh, why can't we use HTML entities going forward? Is that harder? Can
we just exclude the release notes from this check?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes:
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
That policy would cause substantial problems with contributor names
in the release notes. I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.
Uh, why can't we use HTML entities going forward? Is that harder?
Yes: it requires looking up the entities. The mail you are probably
consulting to make a release note or commit message is most likely
just going to contain the person's name as normally spelled.
Plus (as you pointed out earlier today) there aren't HTML entities for
all characters.
Can we just exclude the release notes from this check?
What is the point of a check we can only enforce against part of the
documentation?
regards, tom lane
On Tue, Oct 15, 2024 at 05:59:05PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
That policy would cause substantial problems with contributor names
in the release notes. I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.Uh, why can't we use HTML entities going forward? Is that harder?
Yes: it requires looking up the entities. The mail you are probably
consulting to make a release note or commit message is most likely
just going to contain the person's name as normally spelled.Plus (as you pointed out earlier today) there aren't HTML entities for
all characters.Can we just exclude the release notes from this check?
What is the point of a check we can only enforce against part of the
documentation?
If people are uncomfortable with a hard requirement, we can convert the
Latin-1 we have now to HTML entities, and then just give people a
command in README.non-ASCII to check for UTF8 if they wish. Patch
attached.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Attachments:
latin1.difftext/x-diff; charset=utf-8Download
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 65ed32cd0ab..ad5796819b9 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -143,11 +143,12 @@ postgres.txt: postgres.html
## Print
##
-postgres.pdf:
+postgres.pdf pdf:
$(error Invalid target; use postgres-A4.pdf or postgres-US.pdf as targets)
XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
+# XSL Formatting Objects (FO), https://en.wikipedia.org/wiki/XSL_Formatting_Objects
%-A4.fo: stylesheet-fo.xsl %-full.xml
$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type A4 -o $@ $^
diff --git a/doc/src/sgml/README.non-ASCII b/doc/src/sgml/README.non-ASCII
new file mode 100644
index 00000000000..9c21e02e8f2
--- /dev/null
+++ b/doc/src/sgml/README.non-ASCII
@@ -0,0 +1,37 @@
+<!-- doc/src/sgml/README.non-ASCII -->
+
+Representation of non-ASCII characters
+--------------------------------------
+
+Find non-ASCII characters using:
+
+ grep --recursive --color='auto' -P '[\x80-\xFF]' .
+
+Convert to HTML4 named entity (&) escapes
+-----------------------------------------
+
+We support several output formats:
+
+* html (supports all Unicode characters)
+* man (supports all Unicode characters)
+* pdf (supports only Latin-1 characters)
+* info
+
+While some output formatting tools support all Unicode characters,
+others only support Latin-1 characters. Specifically, the PDF rendering
+engine can only display Latin-1 characters; non-Latin-1 Unicode
+characters are displayed as "###".
+
+Therefore, in the SGML files, we only use Latin-1 characters. We
+typically encode these characters as HTML entities, e.g., Álvaro.
+It is also possible to safely represent Latin-1 characters in UTF8
+encoding for all output formats.
+
+Do not use UTF numeric character escapes (&#nnn;).
+
+HTML entities
+ official: http://www.w3.org/TR/html4/sgml/entities.html
+ one page: http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
+ other lists: http://www.zipcon.net/~swhite/docs/computers/browsers/entities.html
+ http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
+ https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 1ef5322b912..f5e115e8d6e 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1225,7 +1225,7 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
<programlisting>
-- ignore differences in accents and case
CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
-SELECT 'Ã
' = 'A' COLLATE ignore_accent_case; -- true
+SELECT 'Å' = 'A' COLLATE ignore_accent_case; -- true
SELECT 'z' = 'Z' COLLATE ignore_accent_case; -- true
-- upper case letters sort before lower case.
@@ -1282,7 +1282,7 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<entry><literal>'ab' = U&'a\2063b'</literal></entry>
<entry><literal>'x-y' = 'x_y'</literal></entry>
<entry><literal>'g' = 'G'</literal></entry>
- <entry><literal>'n' = 'ñ'</literal></entry>
+ <entry><literal>'n' = 'ñ'</literal></entry>
<entry><literal>'y' = 'z'</literal></entry>
</row>
</thead>
@@ -1346,7 +1346,7 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
<para>
At every level, even with full normalization off, basic normalization is
- performed. For example, <literal>'á'</literal> may be composed of the
+ performed. For example, <literal>'á'</literal> may be composed of the
code points <literal>U&'\0061\0301'</literal> or the single code
point <literal>U&'\00E1'</literal>, and those sequences will be
considered equal even at the <literal>identic</literal> level. To treat
@@ -1430,8 +1430,8 @@ SELECT 'x-y' = 'x_y' COLLATE level4; -- false
<entry><literal>false</literal></entry>
<entry>
Backwards comparison for the level 2 differences. For example,
- locale <literal>und-u-kb</literal> sorts <literal>'Ã e'</literal>
- before <literal>'aé'</literal>.
+ locale <literal>und-u-kb</literal> sorts <literal>'àe'</literal>
+ before <literal>'aé'</literal>.
</entry>
</row>
diff --git a/doc/src/sgml/images/genetic-algorithm.svg b/doc/src/sgml/images/genetic-algorithm.svg
index fb9fdd1ba78..2ce5f1b2712 100644
--- a/doc/src/sgml/images/genetic-algorithm.svg
+++ b/doc/src/sgml/images/genetic-algorithm.svg
@@ -72,7 +72,7 @@
<title>a4->end</title>
<path fill="none" stroke="#000000" d="M259,-312.5834C259,-312.5834 259,-54.659 259,-54.659"/>
<polygon fill="#000000" stroke="#000000" points="262.5001,-54.659 259,-44.659 255.5001,-54.6591 262.5001,-54.659"/>
-<text text-anchor="middle" x="246" y="-186.6212" font-family="sans-serif" font-size="10.00" fill="#000000">true  </text>
+<text text-anchor="middle" x="246" y="-186.6212" font-family="sans-serif" font-size="10.00" fill="#000000">true</text>
</g>
<!-- a5 -->
<g id="node7" class="node">
@@ -85,7 +85,7 @@
<title>a4->a5</title>
<path fill="none" stroke="#000000" d="M144,-298.269C144,-298.269 144,-286.5248 144,-286.5248"/>
<polygon fill="#000000" stroke="#000000" points="147.5001,-286.5248 144,-276.5248 140.5001,-286.5249 147.5001,-286.5248"/>
-<text text-anchor="middle" x="127" y="-284.3969" font-family="sans-serif" font-size="10.00" fill="#000000">false   </text>
+<text text-anchor="middle" x="127" y="-284.3969" font-family="sans-serif" font-size="10.00" fill="#000000">false</text>
</g>
<!-- a6 -->
<g id="node8" class="node">
diff --git a/doc/src/sgml/release.sgml b/doc/src/sgml/release.sgml
index 8433690dead..cee577ff8d3 100644
--- a/doc/src/sgml/release.sgml
+++ b/doc/src/sgml/release.sgml
@@ -16,24 +16,6 @@ pg_[A-Za-z0-9_]+ <application>, <structname>
\<[a-z]+_[a-z_]+\> <varname>, <structfield>
<systemitem class="osname">
-non-ASCII characters find using grep -P '[\x80-\xFF]' or
- (remove 'X') grep -X-color='auto' -P -n "[\x80-\xFF]"
- convert to HTML4 named entity (&) escapes
-
- official: http://www.w3.org/TR/html4/sgml/entities.html
- one page: http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
- other lists: http://www.zipcon.net/~swhite/docs/computers/browsers/entities.html
- http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
- https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
-
- We cannot use UTF8 because rendering engines have to
- support the referenced characters.
-
- Do not use numeric _UTF_ numeric character escapes (&#nnn;),
- we can only use Latin1.
-
- Example: Alvaro Herrera is Álvaro Herrera
-
wrap long lines
For new features, add links to the documentation sections.
diff --git a/doc/src/sgml/stylesheet-man.xsl b/doc/src/sgml/stylesheet-man.xsl
index fcb485c2931..2e2564da683 100644
--- a/doc/src/sgml/stylesheet-man.xsl
+++ b/doc/src/sgml/stylesheet-man.xsl
@@ -213,12 +213,12 @@
<!-- Slight rephrasing to indicate that missing sections are found
in the documentation. -->
<l:context name="xref-number-and-title">
- <l:template name="chapter" text="Chapter %n, %t, in the documentation"/>
- <l:template name="sect1" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect2" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect3" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect4" text="Section %n, â%tâ, in the documentation"/>
- <l:template name="sect5" text="Section %n, â%tâ, in the documentation"/>
+ <l:template name="chapter" text="Chapter %n, "%t", in the documentation"/>
+ <l:template name="sect1" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect2" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect3" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect4" text="Section %n, "%t", in the documentation"/>
+ <l:template name="sect5" text="Section %n, "%t", in the documentation"/>
</l:context>
</l:l10n>
</l:i18n>
On 15.10.24 23:51, Bruce Momjian wrote:
I don't see why we need to enforce this at this level. Whatever downstream
toolchain has requirements about which characters are allowed will complain
if it encounters a character it doesn't like.Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8
characters. To test this I added some Russian characters (non-Latin-1)
to release.sgml:(⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩,
⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩,
⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier
letters or "signs" (⟨ъ⟩, ⟨ь⟩)and I ran 'make postgres-US.pdf', and then removed the Russian
characters and ran the same command again. The output, including stderr
was identical. The PDFs, of course, were not, with the Russian
characters showing as "####". Makefile output attached.
Hmm, mine complains:
/opt/homebrew/bin/fop -fo postgres-A4.fo -pdf postgres-A4.pdf
Picked up JAVA_TOOL_OPTIONS: -Djava.awt.headless=true
[WARN] FOUserAgent - Font "Symbol,normal,700" not found. Substituting
with "Symbol,normal,400".
[WARN] FOUserAgent - Font "ZapfDingbats,normal,700" not found.
Substituting with "ZapfDingbats,normal,400".
[WARN] FOUserAgent - Glyph "⟨" (0x27e8) not available in font "Times-Roman".
[WARN] FOUserAgent - Glyph "б" (0x431, afii10066) not available in font
"Times-Roman".
[WARN] FOUserAgent - Glyph "⟩" (0x27e9) not available in font "Times-Roman".
[WARN] FOUserAgent - Glyph "в" (0x432, afii10067) not available in font
"Times-Roman".
[WARN] FOUserAgent - Glyph "г" (0x433, afii10068) not available in font
"Times-Roman".
[WARN] FOUserAgent - Glyph "д" (0x434, afii10069) not available in font
"Times-Roman".
[WARN] FOUserAgent - Glyph "ж" (0x436, afii10072) not available in font
"Times-Roman".
[WARN] FOUserAgent - Glyph "з" (0x437, afii10073) not available in font
"Times-Roman".
[WARN] PropertyMaker - span="inherit" on fo:block, but no explicit value
found on the parent FO.
On 15.10.24 23:51, Bruce Momjian wrote:
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files
if desired.That policy would cause substantial problems with contributor names
in the release notes. I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.Uh, why can't we use HTML entities going forward? Is that harder?
I think the question should be the other way around. The entities are a
historical workaround for when encoding support and rendering support
was poor. Now you can just type in the characters you want as is, which
seems nicer.
On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
On 15.10.24 23:51, Bruce Momjian wrote:
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files
if desired.That policy would cause substantial problems with contributor names
in the release notes. I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.Uh, why can't we use HTML entities going forward? Is that harder?
I think the question should be the other way around. The entities are a
historical workaround for when encoding support and rendering support was
poor. Now you can just type in the characters you want as is, which seems
nicer.
Yes, that does make sense, and if we fully supported Unicode, we could
ignore all of this.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On Wed, Oct 16, 2024 at 09:58:23AM +0200, Peter Eisentraut wrote:
On 15.10.24 23:51, Bruce Momjian wrote:
I don't see why we need to enforce this at this level. Whatever downstream
toolchain has requirements about which characters are allowed will complain
if it encounters a character it doesn't like.Uh, the PDF build does not complain if you pass it a non-Latin-1 UTF8
characters. To test this I added some Russian characters (non-Latin-1)
to release.sgml:(⟨б⟩, ⟨в⟩, ⟨г⟩, ⟨д⟩, ⟨ж⟩, ⟨з⟩, ⟨к⟩, ⟨л⟩, ⟨м⟩, ⟨н⟩, ⟨п⟩, ⟨р⟩, ⟨с⟩, ⟨т⟩,
⟨ф⟩, ⟨х⟩, ⟨ц⟩, ⟨ч⟩, ⟨ш⟩, ⟨щ⟩), ten vowels (⟨а⟩, ⟨е⟩, ⟨ё⟩, ⟨и⟩, ⟨о⟩, ⟨у⟩,
⟨ы⟩, ⟨э⟩, ⟨ю⟩, ⟨я⟩), a semivowel / consonant (⟨й⟩), and two modifier
letters or "signs" (⟨ъ⟩, ⟨ь⟩)and I ran 'make postgres-US.pdf', and then removed the Russian
characters and ran the same command again. The output, including stderr
was identical. The PDFs, of course, were not, with the Russian
characters showing as "####". Makefile output attached.Hmm, mine complains:
My Debian 12 toolchain must be older.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote:
On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
On 15.10.24 23:51, Bruce Momjian wrote:
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files
if desired.That policy would cause substantial problems with contributor names
in the release notes. I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.Uh, why can't we use HTML entities going forward? Is that harder?
I think the question should be the other way around. The entities are a
historical workaround for when encoding support and rendering support was
poor. Now you can just type in the characters you want as is, which seems
nicer.Yes, that does make sense, and if we fully supported Unicode, we could
ignore all of this.
Patch applied to master --- no new UTF8 restrictions.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Hi Bruce,
On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote:
On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
On 15.10.24 23:51, Bruce Momjian wrote:
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files
if desired.That policy would cause substantial problems with contributor names
in the release notes. I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.Uh, why can't we use HTML entities going forward? Is that harder?
I think the question should be the other way around. The entities are a
historical workaround for when encoding support and rendering support was
poor. Now you can just type in the characters you want as is, which seems
nicer.Yes, that does make sense, and if we fully supported Unicode, we could
ignore all of this.Patch applied to master --- no new UTF8 restrictions.
I thought the conclusion of the discussion was allowing to use LATIN1
(or UTF-8 encoded LATIN1) characters in SGML files without converting
them to HTML entities. Your patch seems to do opposite.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Sat, Nov 2, 2024 at 07:27:00AM +0900, Tatsuo Ishii wrote:
On Wed, Oct 16, 2024 at 09:54:57AM -0400, Bruce Momjian wrote:
On Wed, Oct 16, 2024 at 10:00:15AM +0200, Peter Eisentraut wrote:
On 15.10.24 23:51, Bruce Momjian wrote:
On Tue, Oct 15, 2024 at 05:27:49PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Well, we can only use Latin-1, so the idea is that we will be explicit
about specifying Latin-1 only as HTML entities, rather than letting
non-Latin-1 creep in as UTF8. We can exclude certain UTF8 or SGML files
if desired.That policy would cause substantial problems with contributor names
in the release notes. I agree with Peter that we don't need this.
Catching otherwise-invisible characters seems sufficient.Uh, why can't we use HTML entities going forward? Is that harder?
I think the question should be the other way around. The entities are a
historical workaround for when encoding support and rendering support was
poor. Now you can just type in the characters you want as is, which seems
nicer.Yes, that does make sense, and if we fully supported Unicode, we could
ignore all of this.Patch applied to master --- no new UTF8 restrictions.
I thought the conclusion of the discussion was allowing to use LATIN1
(or UTF-8 encoded LATIN1) characters in SGML files without converting
them to HTML entities. Your patch seems to do opposite.
Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
LATIN1 characters we had with HTML entities, so there are none
currently.
I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
so I added a cron job on my server to alert me when non-ASCII characters
appear.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
LATIN1 characters we had with HTML entities, so there are none
currently.I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
so I added a cron job on my server to alert me when non-ASCII characters
appear.
So you convert LATIN1 characters to HTML entities so that it's easier
to detect non-LATIN1 characters is in the SGML docs? If my
understanding is correct, it can be also achieved by using some tools
like:
iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
If there are some non-LATIN1 characters in release-17.sgml,
it will complain like:
iconv: illegal input sequence at position 175
An advantage of this is, we don't need to covert each LATIN1
characters to HTML entities and make the sgml file authors life a
little bit easier.
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Sat, Nov 2, 2024 at 12:02:12PM +0900, Tatsuo Ishii wrote:
Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
LATIN1 characters we had with HTML entities, so there are none
currently.I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
so I added a cron job on my server to alert me when non-ASCII characters
appear.So you convert LATIN1 characters to HTML entities so that it's easier
to detect non-LATIN1 characters is in the SGML docs? If my
understanding is correct, it can be also achieved by using some tools
like:iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
If there are some non-LATIN1 characters in release-17.sgml,
it will complain like:iconv: illegal input sequence at position 175
An advantage of this is, we don't need to covert each LATIN1
characters to HTML entities and make the sgml file authors life a
little bit easier.
I might have misread the feedback. I know people didn't want a Makfile
rule to prevent it, but I though converting few UTF8's we had was
acceptable. Let me think some more and come up with a patch.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On 02.11.24 14:18, Bruce Momjian wrote:
On Sat, Nov 2, 2024 at 12:02:12PM +0900, Tatsuo Ishii wrote:
Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
LATIN1 characters we had with HTML entities, so there are none
currently.I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
so I added a cron job on my server to alert me when non-ASCII characters
appear.So you convert LATIN1 characters to HTML entities so that it's easier
to detect non-LATIN1 characters is in the SGML docs? If my
understanding is correct, it can be also achieved by using some tools
like:iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
If there are some non-LATIN1 characters in release-17.sgml,
it will complain like:iconv: illegal input sequence at position 175
An advantage of this is, we don't need to covert each LATIN1
characters to HTML entities and make the sgml file authors life a
little bit easier.I might have misread the feedback. I know people didn't want a Makfile
rule to prevent it, but I though converting few UTF8's we had was
acceptable. Let me think some more and come up with a patch.
The question of encoding characters as entities is orthogonal to the
issue of only allowing Unicode characters that have a mapping to Latin
1. This patch seems to confuse these two issues, and I don't think it
actually fixed the second one, which is the one that was complained
about. I don't think anyone actually complained about the first one,
which is the one that was actually patched.
I think the iconv approach is an idea worth checking out.
It's also not necessarily true that the set of characters provided by
the built-in PDF fonts is exactly the set of characters in Latin 1. It
appears to be close enough, but I'm not sure, and I haven't found any
authoritative information on that. Another approach for a fix would be
to get FOP produce the required warnings or errors more reliably. I
know it has a bunch of logging settings (ultimately via log4j), so there
might be some possibilities.
On Tue, 5 Nov 2024 10:08:17 +0100
Peter Eisentraut <peter@eisentraut.org> wrote:
So you convert LATIN1 characters to HTML entities so that it's easier
to detect non-LATIN1 characters is in the SGML docs? If my
understanding is correct, it can be also achieved by using some tools
like:iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
If there are some non-LATIN1 characters in release-17.sgml,
it will complain like:iconv: illegal input sequence at position 175
An advantage of this is, we don't need to covert each LATIN1
characters to HTML entities and make the sgml file authors life a
little bit easier.
I think the iconv approach is an idea worth checking out.
It's also not necessarily true that the set of characters provided by
the built-in PDF fonts is exactly the set of characters in Latin 1. It
appears to be close enough, but I'm not sure, and I haven't found any
authoritative information on that.
I found a description in FAQ on Apache FOP [1]https://xmlgraphics.apache.org/fop/faq.html#pdf-characters that explains some glyphs for
Latin1 character set are not contained in the standard text fonts.
The standard text fonts supplied with Acrobat Reader have mostly glyphs for
characters from the ISO Latin 1 character set. For a variety of reasons, even
those are not completely guaranteed to work, for example you can't use the fi
ligature from the standard serif font.
[1]: https://xmlgraphics.apache.org/fop/faq.html#pdf-characters
However, it seems that using iconv to detect non-Latin1 characters may be still
useful because these are likely not displayed in PDF. For example, we can do this
in make check as the attached patch 0002. It cannot show the filname where one
is found, though.
Another approach for a fix would be
to get FOP produce the required warnings or errors more reliably. I
know it has a bunch of logging settings (ultimately via log4j), so there
might be some possibilities.
When a character that cannot be displayed in PDF is found, a warning
"Glyph ... not available in font ...." is output in fop's log. We can
prevent such characters from being contained in PDF by checking
the message as the attached patch 0001. However, this is checked after
the pdf is generated since I could not have an idea how to terminate the
generation immediately when such character is detected.
Regards,
Yugo Nagata
--
Yugo Nagata <nagata@sraoss.co.jp>
Attachments:
0002-Check-non-latin1-characters-in-make-check.patchtext/x-diff; name=0002-Check-non-latin1-characters-in-make-check.patchDownload
From b6bed0089fa510480dc410969ecff42a55ea7442 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Mon, 11 Nov 2024 19:45:18 +0900
Subject: [PATCH 2/2] Check non-latin1 characters in make check
---
doc/src/sgml/Makefile | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index edc3725e5a..39822082c8 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -157,10 +157,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
%.pdf: %.fo $(ALL_IMAGES)
$(FOP) -fo $< -pdf $@ 2>&1 | \
- awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
+ awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
(echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1)
-
##
## EPUB
##
@@ -197,7 +196,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp
+check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp check-non-latin1
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -270,6 +269,11 @@ check-nbsp:
$(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \
(echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
+# Non-Latin1 characters cannot be displayed in PDF.
+check-non-latin1:
+ @ (iconv -t ISO-8859-1 -f UTF-8 $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) >/dev/null 2>&1) || \
+ (echo "Non-Latin1 characters appear in SGML/XML files" 1>&2; exit 1)
+
##
## Clean
##
--
2.34.1
0001-Disallow-characters-that-cannot-be-displayed-in-PDF.patchtext/x-diff; name=0001-Disallow-characters-that-cannot-be-displayed-in-PDF.patchDownload
From 7e6a612c15bf65169e31906371218cdf13fcacdb Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Mon, 11 Nov 2024 19:22:02 +0900
Subject: [PATCH 1/2] Disallow characters that cannot be displayed in PDF
---
doc/src/sgml/Makefile | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index a04c532b53..edc3725e5a 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^
%.pdf: %.fo $(ALL_IMAGES)
- $(FOP) -fo $< -pdf $@
+ $(FOP) -fo $< -pdf $@ 2>&1 | \
+ awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
+ (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1)
##
--
2.34.1
On Mon, Nov 11, 2024 at 10:02:15PM +0900, Yugo Nagata wrote:
On Tue, 5 Nov 2024 10:08:17 +0100
Peter Eisentraut <peter@eisentraut.org> wrote:So you convert LATIN1 characters to HTML entities so that it's easier
to detect non-LATIN1 characters is in the SGML docs? If my
understanding is correct, it can be also achieved by using some tools
like:iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
If there are some non-LATIN1 characters in release-17.sgml,
it will complain like:iconv: illegal input sequence at position 175
An advantage of this is, we don't need to covert each LATIN1
characters to HTML entities and make the sgml file authors life a
little bit easier.I think the iconv approach is an idea worth checking out.
It's also not necessarily true that the set of characters provided by
the built-in PDF fonts is exactly the set of characters in Latin 1. It
appears to be close enough, but I'm not sure, and I haven't found any
authoritative information on that.I found a description in FAQ on Apache FOP [1] that explains some glyphs for
Latin1 character set are not contained in the standard text fonts.The standard text fonts supplied with Acrobat Reader have mostly glyphs for
characters from the ISO Latin 1 character set. For a variety of reasons, even
those are not completely guaranteed to work, for example you can't use the fi
ligature from the standard serif font.
So, the failure of ligatures is caused usually by not using the right
Adobe Font Metric (AFM) file, I think. I have seen faulty ligature
rendering in PDFs but was alway able to fix it by using the right AFM
file. Odds are, failure is caused by using a standard Latin1 AFM file
and not the AFM file that matches the font being used.
[1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters
However, it seems that using iconv to detect non-Latin1 characters may be still
useful because these are likely not displayed in PDF. For example, we can do this
in make check as the attached patch 0002. It cannot show the filname where one
is found, though.
I was thinking something like:
grep -l --recursive -P '[\x80-\xFF]' . |
while read FILE
do iconv -f UTF-8 -t ISO-8859-1 "$FILE" || exit 1
done
This only checks files with non-ASCII characters.
Another approach for a fix would be
to get FOP produce the required warnings or errors more reliably. I
know it has a bunch of logging settings (ultimately via log4j), so there
might be some possibilities.When a character that cannot be displayed in PDF is found, a warning
"Glyph ... not available in font ...." is output in fop's log. We can
prevent such characters from being contained in PDF by checking
the message as the attached patch 0001. However, this is checked after
the pdf is generated since I could not have an idea how to terminate the
generation immediately when such character is detected.
So, are we sure this will be the message even for non-English users? I
thought checking for warning message text was too fragile.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On Mon, 18 Nov 2024 16:04:20 -0500
Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Nov 11, 2024 at 10:02:15PM +0900, Yugo Nagata wrote:
On Tue, 5 Nov 2024 10:08:17 +0100
Peter Eisentraut <peter@eisentraut.org> wrote:So you convert LATIN1 characters to HTML entities so that it's easier
to detect non-LATIN1 characters is in the SGML docs? If my
understanding is correct, it can be also achieved by using some tools
like:iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
If there are some non-LATIN1 characters in release-17.sgml,
it will complain like:iconv: illegal input sequence at position 175
An advantage of this is, we don't need to covert each LATIN1
characters to HTML entities and make the sgml file authors life a
little bit easier.I think the iconv approach is an idea worth checking out.
It's also not necessarily true that the set of characters provided by
the built-in PDF fonts is exactly the set of characters in Latin 1. It
appears to be close enough, but I'm not sure, and I haven't found any
authoritative information on that.I found a description in FAQ on Apache FOP [1] that explains some glyphs for
Latin1 character set are not contained in the standard text fonts.The standard text fonts supplied with Acrobat Reader have mostly glyphs for
characters from the ISO Latin 1 character set. For a variety of reasons, even
those are not completely guaranteed to work, for example you can't use the fi
ligature from the standard serif font.So, the failure of ligatures is caused usually by not using the right
Adobe Font Metric (AFM) file, I think. I have seen faulty ligature
rendering in PDFs but was alway able to fix it by using the right AFM
file. Odds are, failure is caused by using a standard Latin1 AFM file
and not the AFM file that matches the font being used.[1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters
However, it seems that using iconv to detect non-Latin1 characters may be still
useful because these are likely not displayed in PDF. For example, we can do this
in make check as the attached patch 0002. It cannot show the filname where one
is found, though.I was thinking something like:
grep -l --recursive -P '[\x80-\xFF]' . |
while read FILE
do iconv -f UTF-8 -t ISO-8859-1 "$FILE" || exit 1
doneThis only checks files with non-ASCII characters.
Checking non-latin1 after non-ASCII characters seems good idea.
I attached a updated patch (0002) that uses perl instead of grep
because non-GNU grep could not have escape sequences for hex.
Another approach for a fix would be
to get FOP produce the required warnings or errors more reliably. I
know it has a bunch of logging settings (ultimately via log4j), so there
might be some possibilities.When a character that cannot be displayed in PDF is found, a warning
"Glyph ... not available in font ...." is output in fop's log. We can
prevent such characters from being contained in PDF by checking
the message as the attached patch 0001. However, this is checked after
the pdf is generated since I could not have an idea how to terminate the
generation immediately when such character is detected.So, are we sure this will be the message even for non-English users? I
thought checking for warning message text was too fragile.
I am not sure whether fop has messages in non-English, although I've never
seen Japanese messages output.
I wonder we can get unified results if executed with LANG=C.
The updated patch 0001 is fixed in this direction.
Regards,
--
Yugo NAGATA <nagata@sraoss.co.jp>
Attachments:
v2-0002-Check-non-latin1-characters-in-make-check.patchtext/x-diff; name=v2-0002-Check-non-latin1-characters-in-make-check.patchDownload
From d73024303b4bbac3d6a7e861f7b3b91b0541a5ba Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Mon, 11 Nov 2024 19:45:18 +0900
Subject: [PATCH v2 2/2] Check non-latin1 characters in make check
---
doc/src/sgml/Makefile | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 18bf87d031..55dd2da299 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -36,6 +36,10 @@ ifndef FOP
FOP = $(missing) fop
endif
+ifndef ICONV
+ICONV = $(missing) iconv
+endif
+
PANDOC = pandoc
XMLINCLUDE = --path . --path $(srcdir)
@@ -160,7 +164,6 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
(echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1)
-
##
## EPUB
##
@@ -197,7 +200,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp
+check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp check-non-latin1
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -270,6 +273,12 @@ check-nbsp:
$(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \
(echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
+# Non-Latin1 characters cannot be displayed in PDF.
+check-non-latin1:
+ @ ( $(PERL) -ne '/[\x80-\xFF]/ and `${ICONV} -t ISO-8859-1 -f UTF-8 "$$ARGV" 2>/dev/null` and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
+ $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \
+ (echo "Non-Latin1 characters appear in SGML/XML files" 1>&2; exit 1)
+
##
## Clean
##
--
2.34.1
v2-0001-Disallow-characters-that-cannot-be-displayed-in-P.patchtext/x-diff; name=v2-0001-Disallow-characters-that-cannot-be-displayed-in-P.patchDownload
From 3abf606f693776410dd667bd59b0d33b9b6a75f3 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Mon, 11 Nov 2024 19:22:02 +0900
Subject: [PATCH v2 1/2] Disallow characters that cannot be displayed in PDF
---
doc/src/sgml/Makefile | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index a04c532b53..18bf87d031 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^
%.pdf: %.fo $(ALL_IMAGES)
- $(FOP) -fo $< -pdf $@
+ CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
+ awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
+ (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1)
##
--
2.34.1
On Tue, Nov 19, 2024 at 11:29:07AM +0900, Yugo NAGATA wrote:
On Mon, 18 Nov 2024 16:04:20 -0500
So, the failure of ligatures is caused usually by not using the right
Adobe Font Metric (AFM) file, I think. I have seen faulty ligature
rendering in PDFs but was alway able to fix it by using the right AFM
file. Odds are, failure is caused by using a standard Latin1 AFM file
and not the AFM file that matches the font being used.[1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters
However, it seems that using iconv to detect non-Latin1 characters may be still
useful because these are likely not displayed in PDF. For example, we can do this
in make check as the attached patch 0002. It cannot show the filname where one
is found, though.I was thinking something like:
grep -l --recursive -P '[\x80-\xFF]' . |
while read FILE
do iconv -f UTF-8 -t ISO-8859-1 "$FILE" || exit 1
doneThis only checks files with non-ASCII characters.
Checking non-latin1 after non-ASCII characters seems good idea.
I attached a updated patch (0002) that uses perl instead of grep
because non-GNU grep could not have escape sequences for hex.
Yes, good point.
So, are we sure this will be the message even for non-English users? I
thought checking for warning message text was too fragile.I am not sure whether fop has messages in non-English, although I've never
seen Japanese messages output.I wonder we can get unified results if executed with LANG=C.
The updated patch 0001 is fixed in this direction.
Yes, good idea.
+ @ ( $(PERL) -ne '/[\x80-\xFF]/ and `${ICONV} -t ISO-8859-1 -f UTF-8 "$$ARGV" 2>/dev/null` and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
I am thinking we should have -f before -t becaues it is from/to.
I like this approach.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On Mon, 18 Nov 2024 22:07:40 -0500
Bruce Momjian <bruce@momjian.us> wrote:
On Tue, Nov 19, 2024 at 11:29:07AM +0900, Yugo NAGATA wrote:
On Mon, 18 Nov 2024 16:04:20 -0500
So, the failure of ligatures is caused usually by not using the right
Adobe Font Metric (AFM) file, I think. I have seen faulty ligature
rendering in PDFs but was alway able to fix it by using the right AFM
file. Odds are, failure is caused by using a standard Latin1 AFM file
and not the AFM file that matches the font being used.[1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters
However, it seems that using iconv to detect non-Latin1 characters may be still
useful because these are likely not displayed in PDF. For example, we can do this
in make check as the attached patch 0002. It cannot show the filname where one
is found, though.I was thinking something like:
grep -l --recursive -P '[\x80-\xFF]' . |
while read FILE
do iconv -f UTF-8 -t ISO-8859-1 "$FILE" || exit 1
doneThis only checks files with non-ASCII characters.
Checking non-latin1 after non-ASCII characters seems good idea.
I attached a updated patch (0002) that uses perl instead of grep
because non-GNU grep could not have escape sequences for hex.Yes, good point.
So, are we sure this will be the message even for non-English users? I
thought checking for warning message text was too fragile.I am not sure whether fop has messages in non-English, although I've never
seen Japanese messages output.I wonder we can get unified results if executed with LANG=C.
The updated patch 0001 is fixed in this direction.Yes, good idea.
+ @ ( $(PERL) -ne '/[\x80-\xFF]/ and `${ICONV} -t ISO-8859-1 -f UTF-8 "$$ARGV" 2>/dev/null` and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
I am thinking we should have -f before -t becaues it is from/to.
I've updated the patch 0002 to move -f before -t.
Also, I added a new patch 0003 that updates configure scripts to check
whether iconv exists. When it does not exist, the message
"ERROR: `iconv' is missing on your system." will be raised.
However, this change may be unnecessary since iconv is POSIX standard
and most of UNIX-like system would have it.
Regards,
Yugo Nagata
--
Yugo NAGATA <nagata@sraoss.co.jp>
Attachments:
v3-0003-Check-whether-iconv-exists-for-detecting-non-lati.patchtext/x-diff; name=v3-0003-Check-whether-iconv-exists-for-detecting-non-lati.patchDownload
From 93adc51c0135d274cea75f2de2b328480c72a94c Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Tue, 19 Nov 2024 19:19:14 +0900
Subject: [PATCH v3 3/3] Check whether iconv exists for detecting non-latin1
characters
---
configure | 65 ++++++++++++++++++++++++++++++++++++++----
configure.ac | 1 +
doc/src/sgml/Makefile | 6 +++-
src/Makefile.global.in | 1 +
4 files changed, 67 insertions(+), 6 deletions(-)
diff --git a/configure b/configure
index f58eae1baa..eaf02c5660 100755
--- a/configure
+++ b/configure
@@ -632,6 +632,7 @@ PG_VERSION_NUM
LDFLAGS_EX_BE
PROVE
DBTOEPUB
+ICONV
FOP
XSLTPROC
XMLLINT
@@ -14728,7 +14729,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -14774,7 +14775,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -14798,7 +14799,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -14843,7 +14844,7 @@ else
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -14867,7 +14868,7 @@ rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
We can't simply define LARGE_OFF_T to be 9223372036854775807,
since some C++ compilers masquerading as C compilers
incorrectly reject 9223372036854775807. */
-#define LARGE_OFF_T (((off_t) 1 << 62) - 1 + ((off_t) 1 << 62))
+#define LARGE_OFF_T ((((off_t) 1 << 31) << 31) - 1 + (((off_t) 1 << 31) << 31))
int off_t_is_large[(LARGE_OFF_T % 2147483629 == 721
&& LARGE_OFF_T % 2147483647 == 1)
? 1 : -1];
@@ -18535,6 +18536,60 @@ $as_echo_n "checking for FOP... " >&6; }
$as_echo "$FOP" >&6; }
fi
+if test -z "$ICONV"; then
+ for ac_prog in iconv
+do
+ # Extract the first word of "$ac_prog", so it can be a program name with args.
+set dummy $ac_prog; ac_word=$2
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for $ac_word" >&5
+$as_echo_n "checking for $ac_word... " >&6; }
+if ${ac_cv_path_ICONV+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ case $ICONV in
+ [\\/]* | ?:[\\/]*)
+ ac_cv_path_ICONV="$ICONV" # Let the user override the test with a path.
+ ;;
+ *)
+ as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH
+do
+ IFS=$as_save_IFS
+ test -z "$as_dir" && as_dir=.
+ for ac_exec_ext in '' $ac_executable_extensions; do
+ if as_fn_executable_p "$as_dir/$ac_word$ac_exec_ext"; then
+ ac_cv_path_ICONV="$as_dir/$ac_word$ac_exec_ext"
+ $as_echo "$as_me:${as_lineno-$LINENO}: found $as_dir/$ac_word$ac_exec_ext" >&5
+ break 2
+ fi
+done
+ done
+IFS=$as_save_IFS
+
+ ;;
+esac
+fi
+ICONV=$ac_cv_path_ICONV
+if test -n "$ICONV"; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ICONV" >&5
+$as_echo "$ICONV" >&6; }
+else
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+fi
+
+
+ test -n "$ICONV" && break
+done
+
+else
+ # Report the value of ICONV in configure's output in all cases.
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ICONV" >&5
+$as_echo_n "checking for ICONV... " >&6; }
+ { $as_echo "$as_me:${as_lineno-$LINENO}: result: $ICONV" >&5
+$as_echo "$ICONV" >&6; }
+fi
+
if test -z "$DBTOEPUB"; then
for ac_prog in dbtoepub
do
diff --git a/configure.ac b/configure.ac
index 82c5009e3e..1196f857cf 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2321,6 +2321,7 @@ fi
PGAC_PATH_PROGS(XMLLINT, xmllint)
PGAC_PATH_PROGS(XSLTPROC, xsltproc)
PGAC_PATH_PROGS(FOP, fop)
+PGAC_PATH_PROGS(ICONV, iconv)
PGAC_PATH_PROGS(DBTOEPUB, dbtoepub)
#
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 820ae7c456..416dfc6c89 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -36,6 +36,10 @@ ifndef FOP
FOP = $(missing) fop
endif
+ifndef ICONV
+ICONV = $(missing) iconv
+endif
+
PANDOC = pandoc
XMLINCLUDE = --path . --path $(srcdir)
@@ -271,7 +275,7 @@ check-nbsp:
# Non-Latin1 characters cannot be displayed in PDF.
check-non-latin1:
- @ ( $(PERL) -ne '/[\x80-\xFF]/ and `iconv -f UTF-8 -t ISO-8859-1 "$$ARGV" 2>/dev/null` and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
+ @ ( $(PERL) -ne '/[\x80-\xFF]/ and `LANG=C ${ICONV} -f UTF-8 -t ISO-8859-1 "$$ARGV"` and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
$(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \
(echo "Non-Latin1 characters appear in SGML/XML files" 1>&2; exit 1)
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 0f38d712d1..f3bd700664 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -517,6 +517,7 @@ STRIP_SHARED_LIB = @STRIP_SHARED_LIB@
DBTOEPUB = @DBTOEPUB@
FOP = @FOP@
+ICONV = @ICONV@
XMLLINT = @XMLLINT@
XSLTPROC = @XSLTPROC@
--
2.34.1
v3-0002-Check-non-latin1-characters-in-make-check.patchtext/x-diff; name=v3-0002-Check-non-latin1-characters-in-make-check.patchDownload
From d07e2646a0a27852e169686fcce6c5647840abf3 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Mon, 11 Nov 2024 19:45:18 +0900
Subject: [PATCH v3 2/3] Check non-latin1 characters in make check
---
doc/src/sgml/Makefile | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 18bf87d031..820ae7c456 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -160,7 +160,6 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
(echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1)
-
##
## EPUB
##
@@ -197,7 +196,7 @@ MAKEINFO = makeinfo
##
# Quick syntax check without style processing
-check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp
+check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp check-non-latin1
$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
@@ -270,6 +269,12 @@ check-nbsp:
$(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \
(echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1)
+# Non-Latin1 characters cannot be displayed in PDF.
+check-non-latin1:
+ @ ( $(PERL) -ne '/[\x80-\xFF]/ and `iconv -f UTF-8 -t ISO-8859-1 "$$ARGV" 2>/dev/null` and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
+ $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \
+ (echo "Non-Latin1 characters appear in SGML/XML files" 1>&2; exit 1)
+
##
## Clean
##
--
2.34.1
v3-0001-Disallow-characters-that-cannot-be-displayed-in-P.patchtext/x-diff; name=v3-0001-Disallow-characters-that-cannot-be-displayed-in-P.patchDownload
From 3abf606f693776410dd667bd59b0d33b9b6a75f3 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nagata@sraoss.co.jp>
Date: Mon, 11 Nov 2024 19:22:02 +0900
Subject: [PATCH v3 1/3] Disallow characters that cannot be displayed in PDF
---
doc/src/sgml/Makefile | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index a04c532b53..18bf87d031 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^
%.pdf: %.fo $(ALL_IMAGES)
- $(FOP) -fo $< -pdf $@
+ CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
+ awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
+ (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1)
##
--
2.34.1
I have looked into the patches.
Subject: [PATCH v3 1/3] Disallow characters that cannot be displayed in PDF
---
doc/src/sgml/Makefile | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile index a04c532b53..18bf87d031 100644 --- a/doc/src/sgml/Makefile +++ b/doc/src/sgml/Makefile @@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/' $(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^%.pdf: %.fo $(ALL_IMAGES) - $(FOP) -fo $< -pdf $@ + CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
Shouldn't "CLANG" be "LANG"?
+ awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \ + (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1)
Currently "make postgres*.pdf" generates the pdf file even if there's
a "not available in font" error while generating it. With the patch
the pdf file is removed in this case. I'm not sure if this is an
improvement because there's no way to generate such a pdf file if
there's such a warning. Printing "Found characters that cannot be
displayed in PDF" is good, but I'd prefer let users decide whether
they retain or remove the pdf file.
Subject: [PATCH v3 3/3] Check whether iconv exists for detecting non-latin1
characters---
configure | 65 ++++++++++++++++++++++++++++++++++++++----
configure.ac | 1 +
doc/src/sgml/Makefile | 6 +++-
src/Makefile.global.in | 1 +
You don't need to include the patch for configure. Committer will
generate configure when it gets committed. See the discussion:
/messages/by-id/20241126.102906.1020285543012274306.ishii@postgresql.org
Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
On Tue, Nov 26, 2024 at 06:25:13PM +0900, Tatsuo Ishii wrote:
I have looked into the patches.
%.pdf: %.fo $(ALL_IMAGES) - $(FOP) -fo $< -pdf $@ + CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \Shouldn't "CLANG" be "LANG"?
Yes, probably.
+ awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \ + (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1)Currently "make postgres*.pdf" generates the pdf file even if there's
a "not available in font" error while generating it. With the patch
the pdf file is removed in this case. I'm not sure if this is an
improvement because there's no way to generate such a pdf file if
there's such a warning. Printing "Found characters that cannot be
displayed in PDF" is good, but I'd prefer let users decide whether
they retain or remove the pdf file.
Looking at the patch:
%.pdf: %.fo $(ALL_IMAGES)
- $(FOP) -fo $< -pdf $@
+ CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
+ awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
+ (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1)
it returns an error if it sees a "not available in font" error, and
since src/Makefile.global has .DELETE_ON_ERROR, and this is included in
doc/src/sgml/Makefile, the file is deleted on the awk 'exit' error.
If there are invalid characters in the PDF, shouldn't the PDF be
considered invalid and removed from the build? To allow such builds to
keep those PDF files, we would need to probably override
.DELETE_ON_ERROR, but it would have to be done in a way that an error
exit from FOP would still remove the PDF file. I think we would have to
have FOP write to a temporary file, and then override the
.DELETE_ON_ERROR just for the check for the string "not available in
font" text in the temporary file.
Do we want to add this complexity?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes:
Do we want to add this complexity?
I don't think this patch is doing anything I want at all.
regards, tom lane
On Tue, Nov 26, 2024 at 11:43:02AM -0500, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Do we want to add this complexity?
I don't think this patch is doing anything I want at all.
Gee, I kind of liked the patch, but maybe you didn't like the additional
complexity to check the PDF output twice, once on input (complex) and
once on output. The attached patch only does the output check.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Attachments:
latin1.difftext/x-diff; charset=us-asciiDownload
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index a04c532b536..feba0698605 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^
%.pdf: %.fo $(ALL_IMAGES)
- $(FOP) -fo $< -pdf $@
+ LANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
+ awk 'BEGIN { err = 0 } { print } /not available in font/ { err = 1 } END { exit err }' 1>&2 || \
+ (echo "Found characters that cannot be displayed in the PDF document" 1>&2; exit 1)
##
Bruce Momjian <bruce@momjian.us> writes:
On Tue, Nov 26, 2024 at 11:43:02AM -0500, Tom Lane wrote:
I don't think this patch is doing anything I want at all.
Gee, I kind of liked the patch, but maybe you didn't like the additional
complexity to check the PDF output twice, once on input (complex) and
once on output. The attached patch only does the output check.
It's still not doing anything I want at all. I'm with Tatsuo
on this: I do not want the makefiles deciding for me which
warnings are acceptable.
regards, tom lane
On Tue, Nov 26, 2024 at 12:41:37PM -0500, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
On Tue, Nov 26, 2024 at 11:43:02AM -0500, Tom Lane wrote:
I don't think this patch is doing anything I want at all.
Gee, I kind of liked the patch, but maybe you didn't like the additional
complexity to check the PDF output twice, once on input (complex) and
once on output. The attached patch only does the output check.It's still not doing anything I want at all. I'm with Tatsuo
on this: I do not want the makefiles deciding for me which
warnings are acceptable.
Okay, how about the attached patch that just prints the message at the
bottom, with no error. We could do this for all warnings, but I think
there are some we expect.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Attachments:
latin1.difftext/x-diff; charset=us-asciiDownload
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index a04c532b536..cffb06317f9 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^
%.pdf: %.fo $(ALL_IMAGES)
- $(FOP) -fo $< -pdf $@
+ LANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
+ awk 'BEGIN { warn = 0 } { print } /not available in font/ { warn = 1 } \
+ END { if (warn != 0) print("\nFound characters that cannot be displayed in the PDF document") }' 1>&2
##
On Tue, Nov 26, 2024 at 02:04:15PM -0500, Bruce Momjian wrote:
On Tue, Nov 26, 2024 at 12:41:37PM -0500, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
On Tue, Nov 26, 2024 at 11:43:02AM -0500, Tom Lane wrote:
I don't think this patch is doing anything I want at all.
Gee, I kind of liked the patch, but maybe you didn't like the additional
complexity to check the PDF output twice, once on input (complex) and
once on output. The attached patch only does the output check.It's still not doing anything I want at all. I'm with Tatsuo
on this: I do not want the makefiles deciding for me which
warnings are acceptable.Okay, how about the attached patch that just prints the message at the
bottom, with no error. We could do this for all warnings, but I think
there are some we expect.
Patch applied. I added a mention of README.non-ASCII.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On Tue, Nov 5, 2024 at 10:08:17AM +0100, Peter Eisentraut wrote:
On 02.11.24 14:18, Bruce Momjian wrote:
On Sat, Nov 2, 2024 at 12:02:12PM +0900, Tatsuo Ishii wrote:
Yes, we _allow_ LATIN1 characters in the SGML docs, but I replaced the
LATIN1 characters we had with HTML entities, so there are none
currently.I think it is too easy for non-Latin1 UTF8 to creep into our SGML docs
so I added a cron job on my server to alert me when non-ASCII characters
appear.So you convert LATIN1 characters to HTML entities so that it's easier
to detect non-LATIN1 characters is in the SGML docs? If my
understanding is correct, it can be also achieved by using some tools
like:iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
If there are some non-LATIN1 characters in release-17.sgml,
it will complain like:iconv: illegal input sequence at position 175
An advantage of this is, we don't need to covert each LATIN1
characters to HTML entities and make the sgml file authors life a
little bit easier.I might have misread the feedback. I know people didn't want a Makfile
rule to prevent it, but I though converting few UTF8's we had was
acceptable. Let me think some more and come up with a patch.The question of encoding characters as entities is orthogonal to the issue
of only allowing Unicode characters that have a mapping to Latin 1. This
patch seems to confuse these two issues, and I don't think it actually fixed
the second one, which is the one that was complained about. I don't think
anyone actually complained about the first one, which is the one that was
actually patched.
Now that we have a warning about non-emittable characters in the PDF
build, do you want me to put back the Latin1 characters in the SGML
files or leave them as HTML entities?
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
Bruce Momjian <bruce@momjian.us> writes:
Now that we have a warning about non-emittable characters in the PDF
build, do you want me to put back the Latin1 characters in the SGML
files or leave them as HTML entities?
I think going forward we're going to be putting in people's names
in UTF8 --- I was certainly planning to start doing that. It doesn't
matter that much what we do with existing cases, though.
regards, tom lane
On Mon, Dec 2, 2024 at 09:33:39PM -0500, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Now that we have a warning about non-emittable characters in the PDF
build, do you want me to put back the Latin1 characters in the SGML
files or leave them as HTML entities?I think going forward we're going to be putting in people's names
in UTF8 --- I was certainly planning to start doing that. It doesn't
Yes, I expected that, and added an item to my release checklist to make
a PDF file and check for the warning. I don't normally do that.
matter that much what we do with existing cases, though.
Okay, I think Peter had an opinion but I wasn't sure what it was.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"
On 03.12.24 04:13, Bruce Momjian wrote:
On Mon, Dec 2, 2024 at 09:33:39PM -0500, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Now that we have a warning about non-emittable characters in the PDF
build, do you want me to put back the Latin1 characters in the SGML
files or leave them as HTML entities?I think going forward we're going to be putting in people's names
in UTF8 --- I was certainly planning to start doing that. It doesn'tYes, I expected that, and added an item to my release checklist to make
a PDF file and check for the warning. I don't normally do that.matter that much what we do with existing cases, though.
Okay, I think Peter had an opinion but I wasn't sure what it was.
I would prefer that the parts of commit 641a5b7a144 that replace
non-ASCII characters with entities are reverted.
On 26.11.24 20:04, Bruce Momjian wrote:
%.pdf: %.fo $(ALL_IMAGES) - $(FOP) -fo $< -pdf $@ + LANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \ + awk 'BEGIN { warn = 0 } { print }/not available in font/ { warn = 1 } \ + END { if (warn != 0) print("\nFound characters that cannot be displayed in the PDF document") }' 1>&2
Wouldn't that lose the exit code from the fop execution?
On Tue, Dec 3, 2024 at 09:05:45PM +0100, Peter Eisentraut wrote:
On 26.11.24 20:04, Bruce Momjian wrote:
%.pdf: %.fo $(ALL_IMAGES) - $(FOP) -fo $< -pdf $@ + LANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \ + awk 'BEGIN { warn = 0 } { print }/not available in font/ { warn = 1 } \ + END { if (warn != 0) print("\nFound characters that cannot be displayed in the PDF document") }' 1>&2Wouldn't that lose the exit code from the fop execution?
Yikes, I think it would. Let me work on a fix now.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
On Tue, Dec 3, 2024 at 09:03:37PM +0100, Peter Eisentraut wrote:
On 03.12.24 04:13, Bruce Momjian wrote:
On Mon, Dec 2, 2024 at 09:33:39PM -0500, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Now that we have a warning about non-emittable characters in the PDF
build, do you want me to put back the Latin1 characters in the SGML
files or leave them as HTML entities?I think going forward we're going to be putting in people's names
in UTF8 --- I was certainly planning to start doing that. It doesn'tYes, I expected that, and added an item to my release checklist to make
a PDF file and check for the warning. I don't normally do that.matter that much what we do with existing cases, though.
Okay, I think Peter had an opinion but I wasn't sure what it was.
I would prefer that the parts of commit 641a5b7a144 that replace non-ASCII
characters with entities are reverted.
Done.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
On Tue, Dec 3, 2024 at 03:58:20PM -0500, Bruce Momjian wrote:
On Tue, Dec 3, 2024 at 09:05:45PM +0100, Peter Eisentraut wrote:
On 26.11.24 20:04, Bruce Momjian wrote:
%.pdf: %.fo $(ALL_IMAGES) - $(FOP) -fo $< -pdf $@ + LANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \ + awk 'BEGIN { warn = 0 } { print }/not available in font/ { warn = 1 } \ + END { if (warn != 0) print("\nFound characters that cannot be displayed in the PDF document") }' 1>&2Wouldn't that lose the exit code from the fop execution?
Yikes, I think it would. Let me work on a fix now.
Fixed in the attached applied patch. Glad you saw this mistake.
--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
Attachments:
master.difftext/x-diff; charset=us-asciiDownload
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 4a08b6f433e..9d52715ff4b 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -156,9 +156,11 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^
%.pdf: %.fo $(ALL_IMAGES)
- LANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
- awk 'BEGIN { warn = 0 } { print } /not available in font/ { warn = 1 } \
- END { if (warn != 0) print("\nFound characters that cannot be output in the PDF document; see README.non-ASCII") }' 1>&2
+ @# There is no easy way to pipe output and capture its return code, so output a special string on failure.
+ { LANG=C $(FOP) -fo $< -pdf $@ 2>&1; [ "$$?" -ne 0 ] && echo "FOP_ERROR"; } | \
+ awk 'BEGIN { warn = 0 } ! /^FOP_ERROR$$/ { print } /not available in font/ { warn = 1 } \
+ END { if (warn != 0) print("\nFound characters that cannot be output in the PDF document; see README.non-ASCII"); \
+ if ($$0 ~ /^FOP_ERROR$$/) { exit 1} }' 1>&2
##