BUG #18057: unaccent removes intentional spaces

Started by PG Bug reporting formover 2 years ago10 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 18057
Logged by: Martin Schlossarek
Email address: martin@schlossarek.me
PostgreSQL version: 15.1
Operating system: Fedora 38
Description:

I discovered that the unaccent extension also removes intentional spaces
that are explicitly specified in the accent.rules. As far as I see it
correctly, all fraction characters are affected, for example:

```sql
# select unaccent('1½');
--- expected output: 1 1/2
--- actual output: 11/2
```

Affected characters:
```bash
$ curl -s
"https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob_plain;f=contrib/unaccent/unaccent.rules;hb=HEAD"
| grep -E " "
¼ 1/4
½ 1/2
¾ 3/4
⅐ 1/7
⅑ 1/9
⅒ 1/10
⅓ 1/3
⅔ 2/3
⅕ 1/5
⅖ 2/5
⅗ 3/5
⅘ 4/5
⅙ 1/6
⅚ 5/6
⅛ 1/8
⅜ 3/8
⅝ 5/8
⅞ 7/8
⅟ 1/
↉ 0/3
```

#2Michael Paquier
michael@paquier.xyz
In reply to: PG Bug reporting form (#1)
Re: BUG #18057: unaccent removes intentional spaces

On Tue, Aug 15, 2023 at 07:54:57PM +0000, PG Bug reporting form wrote:

I discovered that the unaccent extension also removes intentional spaces
that are explicitly specified in the accent.rules. As far as I see it
correctly, all fraction characters are affected, for example:

```sql
# select unaccent('1½');
--- expected output: 1 1/2
--- actual output: 11/2
```

Agreed that this looks incorrect as-is. This goes as far as 9a206d0
when these has been introduced, and it looks like the culprit is
around initTrie() where the entries are loaded. See around t_isspace,
for example.
--
Michael

#3Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#2)
Re: BUG #18057: unaccent removes intentional spaces

On Wed, Aug 16, 2023 at 09:00:43AM +0900, Michael Paquier wrote:

Agreed that this looks incorrect as-is. This goes as far as 9a206d0
when these has been introduced, and it looks like the culprit is
around initTrie() where the entries are loaded. See around t_isspace,
for example.

I was looking at the code, and my first impression was right. All
leading and trailing whitespaces between the two characters listed in
the rule file are discarded. The thing is that we clearly document
the parsing rules for the sake of any custom files one can feed to the
extension:
https://www.postgresql.org/docs/devel/unaccent.html

I am not sure what we can do here. Doing nothing is certainly an
option, but I am wondering if we could put in place an extra rule
where whitespaces can be part of the translated character if it uses
double quotes, for example. Thoughts?
--
Michael

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#3)
Re: BUG #18057: unaccent removes intentional spaces

Michael Paquier <michael@paquier.xyz> writes:

I was looking at the code, and my first impression was right. All
leading and trailing whitespaces between the two characters listed in
the rule file are discarded. The thing is that we clearly document
the parsing rules for the sake of any custom files one can feed to the
extension:
https://www.postgresql.org/docs/devel/unaccent.html

I am not sure what we can do here. Doing nothing is certainly an
option, but I am wondering if we could put in place an extra rule
where whitespaces can be part of the translated character if it uses
double quotes, for example. Thoughts?

Yeah, we could extend the parsing rules that way. It would break
any rules files that currently use double quote as a plain character,
but it seems unlikely that there are any. In any case, as long as
this wasn't back-patched I think such a change would be acceptable.

regards, tom lane

#5Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#4)
Re: BUG #18057: unaccent removes intentional spaces

On Sat, Aug 19, 2023 at 12:30:04PM -0400, Tom Lane wrote:

Yeah, we could extend the parsing rules that way. It would break
any rules files that currently use double quote as a plain character,
but it seems unlikely that there are any. In any case, as long as
this wasn't back-patched I think such a change would be acceptable.

Okay, thanks. Note that we do use double-quotes as a translated
character in a few cases, but as long as these are only handled as a
single character we could be OK. Or actually, wouldn't it be better
to always force escaping even for double quotes listed as single
characters? Based on what unaccent.rules has, that's not necessary,
but it could simplify the python code generating the file or the C
parsing. For example, the existing '”' would become listed as "\"" in
our unaccent.rules.
--
Michael

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#5)
Re: BUG #18057: unaccent removes intentional spaces

Michael Paquier <michael@paquier.xyz> writes:

On Sat, Aug 19, 2023 at 12:30:04PM -0400, Tom Lane wrote:

Yeah, we could extend the parsing rules that way. It would break
any rules files that currently use double quote as a plain character,
but it seems unlikely that there are any. In any case, as long as
this wasn't back-patched I think such a change would be acceptable.

Okay, thanks. Note that we do use double-quotes as a translated
character in a few cases, but as long as these are only handled as a
single character we could be OK.

Oh, shame on me for not actually looking into unaccent.rules :-(
That does make it more complicated, although I think it's still the
case that an incompatible change could be OK.

Or actually, wouldn't it be better
to always force escaping even for double quotes listed as single
characters?

I was envisioning following SQL identifier rules, that is you double
the double quote. Then the existing " entries would have to become
"""", which isn't much fun. But what you're suggesting would make
both " and \ into magic characters, doubling the chance of problems
with existing rules files.

We could perhaps allow standalone " to still be considered valid,
but that adds ambiguity that I think we'd eventually regret.
For example, what is the interpretation of

" "

Is that a (rather useless) translation of " to itself, or is it
a command to delete tab characters? So I think a clean break
would be better.

regards, tom lane

#7Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#6)
Re: BUG #18057: unaccent removes intentional spaces

On Sat, Aug 19, 2023 at 08:08:26PM -0400, Tom Lane wrote:

I was envisioning following SQL identifier rules, that is you double
the double quote. Then the existing " entries would have to become
"""", which isn't much fun. But what you're suggesting would make
both " and \ into magic characters, doubling the chance of problems
with existing rules files.

Apologies for the confusion. I was thinking to also escape \ in
quoted strings. Your suggestion to use a second double-quote for the
escaping is fine by me. """" feels a bit ugly-ish in the rules file,
for sure, but that does not look like a huge issue to me as long as
the python script generates consistent contents ;)
--
Michael

#8Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#7)
Re: BUG #18057: unaccent removes intentional spaces

On Sun, Aug 20, 2023 at 09:20:48AM +0900, Michael Paquier wrote:

Apologies for the confusion. I was thinking to also escape \ in
quoted strings. Your suggestion to use a second double-quote for the
escaping is fine by me. """" feels a bit ugly-ish in the rules file,
for sure, but that does not look like a huge issue to me as long as
the python script generates consistent contents ;)

Please find attached a patch to achieve that. This includes tweaks
for the python script to update unaccent.rules, docs about the rules
for the quotes and tests.

The patch also includes a custom rule file that I have used to stress
more the parsing logic, but I intend to remove it in the final version
of the patch (and it fails with meson). It would be possible to have
a TAP test that sneaks a custom rule file in the installation tree,
but that's not worth the extra cycles IMO.

What do you think?
--
Michael

Attachments:

0001-unaccent-Add-support-for-quoted-translated-character.patchtext/x-diff; charset=utf-8Download+216-38
#9Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#8)
Re: BUG #18057: unaccent removes intentional spaces

On Mon, Aug 21, 2023 at 04:14:20PM +0900, Michael Paquier wrote:

The patch also includes a custom rule file that I have used to stress
more the parsing logic, but I intend to remove it in the final version
of the patch (and it fails with meson). It would be possible to have
a TAP test that sneaks a custom rule file in the installation tree,
but that's not worth the extra cycles IMO.

So, it's been a couple of weeks here.

I have given this patch a second look and something that stood out is
that we'd still report "more than two strings in unaccent rule" as
warning when loading a rule where the target string is an unfinished
quoted area, which was kind of confusing.

The patch has been updated to improve that by making the error state a
bit smarter, and I have removed the previous tests with the custom
file and its fancy rules. Any comments and/or objections?
--
Michael

Attachments:

v2-0001-unaccent-Add-support-for-quoted-translated-charac.patchtext/x-diff; charset=utf-8Download+164-39
#10Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#9)
Re: BUG #18057: unaccent removes intentional spaces

On Tue, Sep 19, 2023 at 04:26:43PM +0900, Michael Paquier wrote:

The patch has been updated to improve that by making the error state a
bit smarter, and I have removed the previous tests with the custom
file and its fancy rules. Any comments and/or objections?

And applied as of 59f47fb98dab.
--
Michael