ts_parse reports different between MacOS, FreeBSD/Linux

Started by Mark Felderover 5 years ago2 messagesgeneral
Jump to latest
#1Mark Felder
feld@FreeBSD.org

Hello,

We have an application whose test suite fails on MacOS when running the search tests on unicode characters.

I've narrowed it down to the following:

macos=# select * from ts_parse('default','天');
tokid | token
-------+-------
12 | 天
(1 row)

freebsd=# select * from ts_parse('default','天');
tokid | token
-------+-------
2 | 天
(1 row)

This has been bugging me for a while, but it's a test our devs using MacOS just ignores for now as we know it passes our CI/CD pipeline on FreeBSD/Linux. It seems if anyone is shipping an app on MacOS and bundling Postgres they're going to have a bad time with searching.

Please let me know if there's anything I can do to help. Will gladly test patches.

Thanks,

--
Mark Felder
ports-secteam & portmgr alumni
feld@FreeBSD.org

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Felder (#1)
Re: ts_parse reports different between MacOS, FreeBSD/Linux

"Mark Felder" <feld@FreeBSD.org> writes:

We have an application whose test suite fails on MacOS when running the search tests on unicode characters.

Yeah, known problem :-(. The text search parser relies on the C library's
locale data to classify characters as being letters, digits, etc.
Unfortunately, the UTF8 locales on macOS are just horribly bad, and
report many results that are different from other platforms.

I suppose that Apple has got reasonable Unicode character knowledge
somewhere in their OS; they are just not very interested in making the
POSIX locale APIs work well. Which leaves us with a bit of a problem
for getting consistent results cross-platform.

regards, tom lane