EBCDIC sorting as a use case for ICU rules

Started by Daniel Veriteover 2 years ago12 messages
#1Daniel Verite
daniel@manitou-mail.org

Hi,

In the "Order changes in PG16 since ICU introduction" discussion, one
sub-thread [1]/messages/by-id/a28aba5fa6bf1abfff96e40b6d6acff8412edb15.camel@j-davis.com was about having a credible use case for tailoring collations
with custom rules, a new feature in v16.

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]https://en.wikipedia.org/wiki/EBCDIC. It turns out it has been already asked on
-general a few years ago [3]/messages/by-id/0A3221C70F24FB45833433255569204D1F84A7AD@G01JPEXMBYT05 with no satisfactory answer at the time ,
and that it can be implemented with rules in v16.

A collation like the following this seems to work (the rule simply enumerates
US-ASCII letters in the EBCDIC alphabet order, with adequate quoting)

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$&'
'<'.'<'<'<'('<'+'<\|<'&'<'!'<'$'<'*'<')'<';'<'-'<'/'<','<'%'<'_'<'>'<'?'<'`'<':'<'#'<'@'<\'<'='<'"'<a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<'~'<s<t<u<v<w<x<y<z<'['<'^'<']'<'{'<A<B<C<D<E<F<G<H<I<'}'<J<K<L<M<N<O<P<Q<R<'\'<S<T<U<V<W<X<Y<Z<0<1<2<3<4<5<6<7<8<9$$);

This can be useful for people who migrate from mainframes to Postgres
and need their migration tests to produce the same sorted results as the
original system.
Since rules can be defined at the database level with the icu_rules option,
they don't even need to tweak their queries to add COLLATE clauses,
which surely is appreciable in that kind of project.

US-ASCII when sorted in EBCDIC order comes out like this:

.<(+|&!$*);-/,%_>?`:#@'="abcdefghijklmnopqr~stuvwxyz[^]{ABCDEFGHI}JKLMNOPQR\ST
UVWXYZ0123456789

Maybe this example could be added to the documentation except for
the problem that the rule is very long and dollar-quoting cannot be split
into several lines. Literals enclosed by single quotes can be split that
way, but would require escaping the single quotes in the rule, which
would lead to scary-looking over-quoted contents.

I'm open to suggestions on whether this EBCDIC example is worth being in the
doc in some form or putting this in the wiki would be good enough.

[1]: /messages/by-id/a28aba5fa6bf1abfff96e40b6d6acff8412edb15.camel@j-davis.com
/messages/by-id/a28aba5fa6bf1abfff96e40b6d6acff8412edb15.camel@j-davis.com

[2]: https://en.wikipedia.org/wiki/EBCDIC

[3]: /messages/by-id/0A3221C70F24FB45833433255569204D1F84A7AD@G01JPEXMBYT05
/messages/by-id/0A3221C70F24FB45833433255569204D1F84A7AD@G01JPEXMBYT05

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

#2Joe Conway
mail@joeconway.com
In reply to: Daniel Verite (#1)
Re: EBCDIC sorting as a use case for ICU rules

On 6/21/23 09:28, Daniel Verite wrote:

In the "Order changes in PG16 since ICU introduction" discussion, one
sub-thread [1] was about having a credible use case for tailoring collations
with custom rules, a new feature in v16.

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already asked on
-general a few years ago [3] with no satisfactory answer at the time ,
and that it can be implemented with rules in v16.

Oh, very cool! I have seen the requirement for EBCDIC come up multiple
times over the years.

<snip>

Maybe this example could be added to the documentation except for
the problem that the rule is very long and dollar-quoting cannot be split
into several lines. Literals enclosed by single quotes can be split that
way, but would require escaping the single quotes in the rule, which
would lead to scary-looking over-quoted contents.

I'm open to suggestions on whether this EBCDIC example is worth being in the
doc in some form or putting this in the wiki would be good enough.

I would definitely favor adding to the docs, but no idea how to deal
with the length issue.

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#3Jeff Davis
pgsql@j-davis.com
In reply to: Daniel Verite (#1)
Re: EBCDIC sorting as a use case for ICU rules

On Wed, 2023-06-21 at 15:28 +0200, Daniel Verite wrote:

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already  asked on
-general a few years ago [3] with no satisfactory answer at the time
,
and that it can be implemented with rules in v16.

Interesting, thank you!

This can be useful for people who migrate from mainframes to Postgres
and need their migration tests to produce the same sorted results as
the
original system.
Since rules can be defined at the database level with the icu_rules
option,
they don't even need to tweak their queries to add COLLATE clauses,
which surely is appreciable in that kind of project.

I still had some technical concerns about the ICU rules feature,
unfortunately, and one option is to only allow it for the collation
objects and not the database level collation. How much would that hurt
this use case?

I'm open to suggestions on whether this EBCDIC example is worth being
in the
doc in some form or putting this in the wiki would be good enough.

I like the idea of having a real example. Ideally, we could add some
explanation along the way about how the rule is constructed to match
EBCDIC, which would reduce the shock of a long rule like that.

I wonder why the rule syntax is such that it cannot be broken up? Would
it be incorrect for us to allow some whitespace in there?

Regards,
Jeff Davis

#4Jonathan S. Katz
jkatz@postgresql.org
In reply to: Jeff Davis (#3)
Re: EBCDIC sorting as a use case for ICU rules

On 6/21/23 12:14 PM, Jeff Davis wrote:

On Wed, 2023-06-21 at 15:28 +0200, Daniel Verite wrote:

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already  asked on
-general a few years ago [3] with no satisfactory answer at the time
,
and that it can be implemented with rules in v16.

Interesting, thank you!

+1 -- this is very helpful framing the problem, thank you!

This can be useful for people who migrate from mainframes to Postgres
and need their migration tests to produce the same sorted results as
the
original system.
Since rules can be defined at the database level with the icu_rules
option,
they don't even need to tweak their queries to add COLLATE clauses,
which surely is appreciable in that kind of project.

I still had some technical concerns about the ICU rules feature,
unfortunately, and one option is to only allow it for the collation
objects and not the database level collation. How much would that hurt
this use case?

I'm open to suggestions on whether this EBCDIC example is worth being
in the
doc in some form or putting this in the wiki would be good enough.

I like the idea of having a real example. Ideally, we could add some
explanation along the way about how the rule is constructed to match
EBCDIC, which would reduce the shock of a long rule like that.

I wonder why the rule syntax is such that it cannot be broken up? Would
it be incorrect for us to allow some whitespace in there?

I'll give the unhelpful comment of "yes, I agree we should have a real
world example", especially one that seems relevant to helping more
people adopt PostgreSQL.

#5Daniel Verite
daniel@manitou-mail.org
In reply to: Jeff Davis (#3)
1 attachment(s)
Re: EBCDIC sorting as a use case for ICU rules

Jeff Davis wrote:

I still had some technical concerns about the ICU rules feature,
unfortunately, and one option is to only allow it for the collation
objects and not the database level collation. How much would that hurt
this use case?

For a regression test suite that should produce results with the custom
order, not being able to configure the sort rules at the db level means
that you'd have to change all the queries to add explicit COLLATE clauses.
I guess that could be quite annoying if the test suite is large.

About making a doc patch from this, I've came up with the attached,
which generates a CREATE COLLATION statement with rules from an
arbitrary strings that just lists characters in whichever order is desired.

In the case of EBCDIC and code page 37, it turns out that there are
several versions of "code page 37", with more or less additions of
characters outside the US-ASCII range. This is why I decided
to show code that generates the rules rather than an already generated
rule. Users may simply change the codepage_37 string in the code
to add or rearrange any characters.

Also the patch makes the relevant sections of "CREATE COLLATION" and
"CREATE DATABASE" point to "Collation Support" with the idea to
centralize the information on tailoring rules.

I'll add this to the next CF.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachments:

collation-icu-rules.patchtext/plainDownload
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index ed84465996..05d2a61cba 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1497,6 +1497,53 @@ SELECT 'x-y' = 'x_y' COLLATE level4; -- false
     </para>
    </sect3>
 
+   <sect3 id="icu-tailoring-rules">
+     <title>Tailoring rules</title>
+     <para>
+       The order of individual code points may be changed with tailoring rules, whose syntax is detailed at <ulink url="https://unicode-org.github.io/icu/userguide/collation/customization/"></ulink>.
+       As an example, the following code sets up a collation named <literal>ebcdic</literal> with rules to sort US-ASCII characters in the order of the <ulink url="https://en.wikipedia.org/wiki/EBCDIC">EBCDIC encoding.</ulink>
+<programlisting>
+<![CDATA[
+DO $body$
+DECLARE
+  -- list ASCII letters in the order of the EBCDIC encoding
+  codepage_37 constant text := ' .<(+|&!$*);-/,%_>?`:#@''="'
+    'abcdefghijklmnopqr~stuvwxyz^[]{ABCDEFGHI}JKLMNOPQR'
+    '\STUVWXYZ0123456789';
+  rules text;
+BEGIN
+  -- format the list as a valid set of rules
+  WITH list AS
+   (SELECT CASE WHEN ch<>'''' THEN ch ELSE '\''' END AS c,
+    CASE WHEN ch !~ '[''A-Za-z0-9]' THEN '''' ELSE '' END as q,
+    n
+    FROM regexp_split_to_table(codepage_37, '') WITH ORDINALITY AS x(ch,n)
+    )
+  SELECT '&'||array_to_string(array_agg(concat(q,c,q) order by n),'<')
+    FROM list INTO rules;
+
+  -- create the collation with the rules adequately formatted and quoted
+  EXECUTE format('CREATE COLLATION ebcdic (PROVIDER = ''icu'', LOCALE = ''und'''
+    ', rules = %L)', rules);
+END
+$body$ LANGUAGE plpgsql;
+]]>
+
+SELECT c FROM (VALUES('a'),('b'),('A'),('B'),('1'),('2'),('!'),('^')) AS x(c)
+ORDER BY c COLLATE "ebcdic";
+ c
+---
+ !
+ a
+ b
+ ^
+ A
+ B
+ 1
+ 2
+</programlisting>
+     </para>
+   </sect3>
    <sect3 id="icu-external-references">
     <title>External References for ICU</title>
     <para>
@@ -1528,6 +1575,11 @@ SELECT 'x-y' = 'x_y' COLLATE level4; -- false
        <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
       </para>
      </listitem>
+     <listitem>
+      <para>
+       <ulink url="https://unicode-org.github.io/icu/userguide/collation/customization/">Collation customization (tailoring rules)</ulink>
+      </para>
+     </listitem>
      <listitem>
       <para>
        <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink>
diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml
index b86a9bbb9c..5003ae7d38 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -165,9 +165,8 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
      <listitem>
       <para>
        Specifies additional collation rules to customize the behavior of the
-       collation.  This is supported for ICU only.  See <ulink
-       url="https://unicode-org.github.io/icu/userguide/collation/customization/"/>
-       for details on the syntax.
+       collation.  This is supported for ICU only.
+       See <xref linkend="icu-tailoring-rules"/> for details.
       </para>
      </listitem>
     </varlistentry>
@@ -261,8 +260,8 @@ CREATE COLLATION german_phonebook (provider = icu, locale = 'de-u-co-phonebk');
    <quote>V</quote>, but is treated as a secondary difference similar to an
    accent.  Rules like this are contained in the locale definitions of some
    languages.  (Of course, if a locale definition already contains the desired
-   rules, then they don't need to be specified again explicitly.)  See the ICU
-   documentation for further details and examples on the rules syntax.
+   rules, then they don't need to be specified again explicitly.)  See
+   <xref linkend="icu-tailoring-rules"/> for further details and examples on the rules syntax.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index b2c8aef1ad..ce7317f81b 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -232,9 +232,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        <para>
         Specifies additional collation rules to customize the behavior of the
         default collation of this database.  This is supported for ICU only.
-        See <ulink
-        url="https://unicode-org.github.io/icu/userguide/collation/customization/"/>
-        for details on the syntax.
+        See <xref linkend="icu-tailoring-rules"/> for details.
        </para>
       </listitem>
      </varlistentry>
#6Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Daniel Verite (#1)
Re: EBCDIC sorting as a use case for ICU rules

On 21.06.23 15:28, Daniel Verite wrote:

A collation like the following this seems to work (the rule simply enumerates
US-ASCII letters in the EBCDIC alphabet order, with adequate quoting)

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$&'
'<'.'<'<'<'('<'+'<\|<'&'<'!'<'$'<'*'<')'<';'<'-'<'/'<','<'%'<'_'<'>'<'?'<'`'<':'<'#'<'@'<\'<'='<'"'<a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<'~'<s<t<u<v<w<x<y<z<'['<'^'<']'<'{'<A<B<C<D<E<F<G<H<I<'}'<J<K<L<M<N<O<P<Q<R<'\'<S<T<U<V<W<X<Y<Z<0<1<2<3<4<5<6<7<8<9$$);

This can be useful for people who migrate from mainframes to Postgres
and need their migration tests to produce the same sorted results as the
original system.
Since rules can be defined at the database level with the icu_rules option,
they don't even need to tweak their queries to add COLLATE clauses,
which surely is appreciable in that kind of project.

US-ASCII when sorted in EBCDIC order comes out like this:

.<(+|&!$*);-/,%_>?`:#@'="abcdefghijklmnopqr~stuvwxyz[^]{ABCDEFGHI}JKLMNOPQR\ST
UVWXYZ0123456789

Maybe this example could be added to the documentation except for
the problem that the rule is very long and dollar-quoting cannot be split
into several lines. Literals enclosed by single quotes can be split that
way, but would require escaping the single quotes in the rule, which
would lead to scary-looking over-quoted contents.

You can use whitespace in the rules. For example,

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$
& ' ' < '.' < '<' < '(' < '+' < \|
< '&' < '!' < '$' < '*' < ')' < ';'
< '-' < '/' < ',' < '%' < '_' < '>' < '?'
< '`' < ':' < '#' < '@' < \' < '=' < '"'
< a < b < c < d < e < f < g < h < i
< j < k < l < m < n < o < p < q < r
< '~' < s < t < u < v < w < x < y < z
< '[' < '^' < ']'
< '{' < A < B < C < D < E < F < G < H < I
< '}' < J < K < L < M < N < O < P < Q < R
< '\' < S < T < U < V < W < X < Y < Z
< 0 < 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9
$$);

(This particular layout is meant to match the rows in
https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout.)

#7Peter Eisentraut
peter.eisentraut@enterprisedb.com
In reply to: Daniel Verite (#5)
Re: EBCDIC sorting as a use case for ICU rules

On 30.06.23 13:08, Daniel Verite wrote:

About making a doc patch from this, I've came up with the attached,
which generates a CREATE COLLATION statement with rules from an
arbitrary strings that just lists characters in whichever order is desired.

I like adding more documentation and links around this. But I'm not
sure how this code you are including is supposed to help users
understand the rules language. Effectively, this would be adding
another rules mechanism on top of the existing one, but doesn't explain
either one.

#8Jeff Davis
pgsql@j-davis.com
In reply to: Peter Eisentraut (#6)
Re: EBCDIC sorting as a use case for ICU rules

On Thu, 2023-07-06 at 11:32 +0200, Peter Eisentraut wrote:

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$
& ' ' < '.' < '<' < '(' < '+' < \|
< '&' < '!' < '$' < '*' < ')' < ';'
< '-' < '/' < ',' < '%' < '_' < '>' < '?'
< '`' < ':' < '#' < '@' < \' < '=' < '"'
< a < b < c < d < e < f < g < h < i
< j < k < l < m < n < o < p < q < r
< '~' < s < t < u < v < w < x < y < z
< '[' < '^' < ']'
< '{' < A < B < C < D < E < F < G < H < I
< '}' < J < K < L < M < N < O < P < Q < R
< '\'  < S < T < U < V < W < X < Y < Z
< 0 < 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9
$$);

That looks much nicer and would go nicely in the documentation along
with some explanation.

Regards,
Jeff Davis

#9Daniel Verite
daniel@manitou-mail.org
In reply to: Peter Eisentraut (#6)
1 attachment(s)
Re: EBCDIC sorting as a use case for ICU rules

Peter Eisentraut wrote:

You can use whitespace in the rules. For example,

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$

Nice, it's clearly better that the piece of code I had in the
previous patch.
It can also be made more compact by grouping consecutive
code points, for instance <*a-r for 'a' to 'r'
I changed it that way, and also moved '^' before '[' and ']',
since according to [1]https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout, '^' is at location 0xB0 and '[' and ']'
at 0xBA and 0xBB.

Updated patch attached.

[1]: https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachments:

collation-icu-rules-v2.patchtext/plainDownload
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index ed84465996..24de689753 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1497,6 +1497,39 @@ SELECT 'x-y' = 'x_y' COLLATE level4; -- false
     </para>
    </sect3>
 
+   <sect3 id="icu-tailoring-rules">
+     <title>Tailoring rules</title>
+     <para>
+       The order of individual code points may be changed with tailoring rules, whose syntax is detailed at <ulink url="https://unicode-org.github.io/icu/userguide/collation/customization/"></ulink>.
+       As an example, the following statement sets up a collation named <literal>ebcdic</literal> with rules to sort US-ASCII characters in the order of the <ulink url="https://en.wikipedia.org/wiki/EBCDIC">EBCDIC encoding.</ulink>
+<programlisting>
+<![CDATA[
+CREATE COLLATION ebcdic (provider='icu', locale='und',
+rules=$$
+& ' ' < '.' < '<' < '(' < '+' < \|
+< '&' < '!' < '$' < '*' < ')' < ';'
+< '-' < '/' < ',' < '%' < '_' < '>' < '?'
+< '`' < ':' < '#' < '@' < \' < '=' < '"'
+<*a-r < '~' <*s-z < '^' < '[' < ']'
+< '{' <*A-I < '}' <*J-R < '\' <*S-Z <*0-9
+$$);
+]]>
+
+SELECT c FROM (VALUES('a'),('b'),('A'),('B'),('1'),('2'),('!'),('^')) AS x(c)
+ORDER BY c COLLATE "ebcdic";
+ c
+---
+ !
+ a
+ b
+ ^
+ A
+ B
+ 1
+ 2
+</programlisting>
+     </para>
+   </sect3>
    <sect3 id="icu-external-references">
     <title>External References for ICU</title>
     <para>
@@ -1528,6 +1561,11 @@ SELECT 'x-y' = 'x_y' COLLATE level4; -- false
        <ulink url="https://unicode-org.github.io/icu/userguide/locale/"></ulink>
       </para>
      </listitem>
+     <listitem>
+      <para>
+       <ulink url="https://unicode-org.github.io/icu/userguide/collation/customization/">Collation customization (tailoring rules)</ulink>
+      </para>
+     </listitem>
      <listitem>
       <para>
        <ulink url="https://unicode-org.github.io/icu/userguide/collation/api.html"></ulink>
diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml
index b86a9bbb9c..5003ae7d38 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -165,9 +165,8 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
      <listitem>
       <para>
        Specifies additional collation rules to customize the behavior of the
-       collation.  This is supported for ICU only.  See <ulink
-       url="https://unicode-org.github.io/icu/userguide/collation/customization/"/>
-       for details on the syntax.
+       collation.  This is supported for ICU only.
+       See <xref linkend="icu-tailoring-rules"/> for details.
       </para>
      </listitem>
     </varlistentry>
@@ -261,8 +260,8 @@ CREATE COLLATION german_phonebook (provider = icu, locale = 'de-u-co-phonebk');
    <quote>V</quote>, but is treated as a secondary difference similar to an
    accent.  Rules like this are contained in the locale definitions of some
    languages.  (Of course, if a locale definition already contains the desired
-   rules, then they don't need to be specified again explicitly.)  See the ICU
-   documentation for further details and examples on the rules syntax.
+   rules, then they don't need to be specified again explicitly.)  See
+   <xref linkend="icu-tailoring-rules"/> for further details and examples on the rules syntax.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml
index b2c8aef1ad..ce7317f81b 100644
--- a/doc/src/sgml/ref/create_database.sgml
+++ b/doc/src/sgml/ref/create_database.sgml
@@ -232,9 +232,7 @@ CREATE DATABASE <replaceable class="parameter">name</replaceable>
        <para>
         Specifies additional collation rules to customize the behavior of the
         default collation of this database.  This is supported for ICU only.
-        See <ulink
-        url="https://unicode-org.github.io/icu/userguide/collation/customization/"/>
-        for details on the syntax.
+        See <xref linkend="icu-tailoring-rules"/> for details.
        </para>
       </listitem>
      </varlistentry>
#10Peter Eisentraut
peter@eisentraut.org
In reply to: Daniel Verite (#9)
Re: EBCDIC sorting as a use case for ICU rules

On 17.07.23 10:10, Daniel Verite wrote:

Peter Eisentraut wrote:

You can use whitespace in the rules. For example,

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$

Nice, it's clearly better that the piece of code I had in the
previous patch.
It can also be made more compact by grouping consecutive
code points, for instance <*a-r for 'a' to 'r'
I changed it that way, and also moved '^' before '[' and ']',
since according to [1], '^' is at location 0xB0 and '[' and ']'
at 0xBA and 0xBB.

Updated patch attached.

Committed with some editing. I moved the existing rules example from
the CREATE COLLATION page into the new section you created, so we have a
simple example followed by the complex example.

In reply to: Daniel Verite (#1)
1 attachment(s)
Re: EBCDIC sorting as a use case for ICU rules

Hi,

Sorry to chime in so lately, I was waiting for some customer feedback.

On Wed, 21 Jun 2023 15:28:38 +0200
"Daniel Verite" <daniel@manitou-mail.org> wrote:

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2].
It turns out it has been already asked on
-general a few years ago [3] with no satisfactory answer at the time ,
and that it can be implemented with rules in v16.

We worked with a customer few months ago about this question and end up with a
procedure to build new locale/collation for glibc and load them in PostgreSQL
[1]: /messages/by-id/20230209144947.1dfad6c0@karst

Our customer built the fr_ebcdic locale file themselves, based on the EBCDIC
IBM500 codepage (including about the same characters than iso 8859-1) and share
it under the BY-CC licence. See in attachment.

The procedure is quite simple:

1. copy this file under "/usr/share/i18n/locales/fr_ebcdic"
2. build it using "localedef -c -i fr_ebcdic -f UTF-8 fr_ebcdic.UTF-8"
3. restart your PostgreSQL instance (because of localeset weird behavior)
4. "pg_import_system_collations('schema')" or create the collation, eg.:
CREATE COLLATION fr_ebcdic (
PROVIDER = libc,
LC_COLLATE = fr_ebcdic.utf8,
LC_CTYPE = fr_ebcdic.utf8
);

Now, same question than for the ICU: do we want to provide documentation about
this? Online documentation about such feature are quite arid. In fact, this
could be useful in various other way than just EBCDIC.

Regards,

[1]: /messages/by-id/20230209144947.1dfad6c0@karst

Attachments:

fr_ebcdicapplication/octet-stream; name=fr_ebcdicDownload
#12Daniel Verite
daniel@manitou-mail.org
In reply to: Peter Eisentraut (#10)
Re: EBCDIC sorting as a use case for ICU rules

Peter Eisentraut wrote:

Committed with some editing. I moved the existing rules example from
the CREATE COLLATION page into the new section you created, so we have a
simple example followed by the complex example.

OK, thanks for pushing this!

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite