Is it possible to sort strings in EBCDIC order in PostgreSQL server?

Started by Tsunakawa, Takayukiover 8 years ago15 messagesgeneral
Jump to latest
#1Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com

Hello,

It is embarrassing to ask such a thing, but is there any way to sort character column values in EBCDIC order inside the PostgreSQL server? i.e. is it possible to use EBCDIC collation in PostgreSQL, say, by using ICU or something? We need to run on a certain mainframe.

I see almost no hope from the following, but let me make sure.

/messages/by-id/28548.1424884373@sss.pgh.pa.us

"Hmm ... EBCDIC ... is that mandatory? Because there are a pretty large
number of ASCII dependencies in PG, mostly arising from our support of
multibyte character sets, which are all expected to be ASCII supersets.
If there's not a way to run the server process in ASCII-land, you're
likely dead in the water. I would not want to try to ferret out all
the dependencies."

Regards
Takayuki Tsunakawa

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tsunakawa, Takayuki (#1)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

Tsunakawa, Takayuki wrote:

It is embarrassing to ask such a thing, but is there any way to sort character column values in EBCDIC order inside the PostgreSQL server?
i.e. is it possible to use EBCDIC collation in PostgreSQL, say, by using ICU or something? We need to run on a certain mainframe.

I see almost no hope from the following, but let me make sure.

That should not be too difficult. PostgreSQL is extensible!

Just define operators that implement <, <=, =, >= and > for EBCDIC,
create an operator class for these (for indexing) and use
ORDER BY ... USING.

Yours,
Laurenz Albe

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tsunakawa, Takayuki (#1)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

It is embarrassing to ask such a thing, but is there any way to sort character column values in EBCDIC order inside the PostgreSQL server? i.e. is it possible to use EBCDIC collation in PostgreSQL, say, by using ICU or something? We need to run on a certain mainframe.

I see almost no hope from the following, but let me make sure.
/messages/by-id/28548.1424884373@sss.pgh.pa.us

"Hmm ... EBCDIC ... is that mandatory? Because there are a pretty large
number of ASCII dependencies in PG, mostly arising from our support of
multibyte character sets, which are all expected to be ASCII supersets.
If there's not a way to run the server process in ASCII-land, you're
likely dead in the water. I would not want to try to ferret out all
the dependencies."

As I recall, that question was about converting the server to use
EBCDIC strings everywhere. That seems impractical for the reasons
I mentioned. But you could probably sort according to EBCDIC order
if you were willing to accept some inefficiency: internal to the
comparison function, convert each string to EBCDIC on each call.

Another way you could imagine is to store EBCDIC strings embedded
in bytea values.

regards, tom lane

In reply to: Tsunakawa, Takayuki (#1)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

On Sun, Dec 10, 2017 at 11:09 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

It is embarrassing to ask such a thing, but is there any way to sort character column values in EBCDIC order inside the PostgreSQL server? i.e. is it possible to use EBCDIC collation in PostgreSQL, say, by using ICU or something? We need to run on a certain mainframe.

There is no need for embarrassment. I think that you mean that you'd
like to get behavior equivalent to the "C" locale with an EBCDIC
encoding. Right? I think that you could use ICU to do this with some
work. ICU collations are not tied to a particular encoding, unlike the
libc collations (they support most but not all collations).

That said, the idea of an "EBCDIC collation" seems limiting. Why
should a system like DB2 for the mainframe (that happens to use EBCDIC
as its encoding) not have a more natural, human-orientated collation
even while using EBCDIC? ISTM that the point of using the "C" locale
(with EBDIC or with UTF-8 or with any other encoding) is to get a
performance benefit where the actual collation's behavior doesn't
matter much to users. Are you sure it's really important to be
*exactly* compatible with EBCDIC order? As long as you're paying for a
custom collation, why not just use a collation that is helpful to
humans?

--
Peter Geoghegan

#5Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Peter Geoghegan (#4)
RE: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

Hi Laurenz, Tom, Peter,

Thanks for your suggestions. The practical solution seems to be to override comparison operators of char, varchar and text data types with UDFs that behave as Tom mentioned.

From: Peter Geoghegan [mailto:pg@bowt.ie]

That said, the idea of an "EBCDIC collation" seems limiting. Why
should a system like DB2 for the mainframe (that happens to use EBCDIC
as its encoding) not have a more natural, human-orientated collation
even while using EBCDIC? ISTM that the point of using the "C" locale
(with EBDIC or with UTF-8 or with any other encoding) is to get a
performance benefit where the actual collation's behavior doesn't
matter much to users. Are you sure it's really important to be
*exactly* compatible with EBCDIC order? As long as you're paying for a
custom collation, why not just use a collation that is helpful to
humans?

You are right. I'd like to ask the customer whether and why they need EBCDIC ordering.

Regards
Takayuki Tsunakawa

#6John McKown
john.archie.mckown@gmail.com
In reply to: Tsunakawa, Takayuki (#5)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

On Tue, Dec 12, 2017 at 2:17 AM, Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:

Hi Laurenz, Tom, Peter,

Thanks for your suggestions. The practical solution seems to be to
override comparison operators of char, varchar and text data types with
UDFs that behave as Tom mentioned.

From: Peter Geoghegan [mailto:pg@bowt.ie]

That said, the idea of an "EBCDIC collation" seems limiting. Why
should a system like DB2 for the mainframe (that happens to use EBCDIC
as its encoding) not have a more natural, human-orientated collation
even while using EBCDIC? ISTM that the point of using the "C" locale
(with EBDIC or with UTF-8 or with any other encoding) is to get a
performance benefit where the actual collation's behavior doesn't
matter much to users. Are you sure it's really important to be
*exactly* compatible with EBCDIC order? As long as you're paying for a
custom collation, why not just use a collation that is helpful to
humans?

You are right. I'd like to ask the customer whether and why they need
EBCDIC ordering.

​This is a guess on my part, based on many years on an EBCDIC system. But
I'll bet that they are doing a conversion off of the EBCDIC system (maybe
Db2 on z/OS) to an ASCII system (Linux or Windows) running PostgreSQL. They
want to be able to compare the output from the existing system to the
output on the new system. EBCDIC orders "lower case", "upper case", then
"digits". The default C locale on Linux (I don't know Windows) will sort
"digits", then alphabetic with the lower then upper case of each letter in
order like: "aAbB...zZ". Comparing identical data which is not presented in
exactly the same order would be very difficult. ​

Regards
Takayuki Tsunakawa

--
I have a theory that it's impossible to prove anything, but I can't prove
it.

Maranatha! <><
John McKown

#7James Keener
jim@jimkeener.com
In reply to: John McKown (#6)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

The default C locale on Linux (I don't know Windows) will sort "digits",
then alphabetic with the lower then upper case of each letter in order
like: "aAbB...zZ"

That's no true at all! The C locales are 0-9A-Za-z

#include <locale.h>

#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static int myCompare (const void * a, const void * b)
{
return strcmp (*(const char **) a, *(const char **) b);
}

void with_other_locale (char *new_locale,
const char **arr,
const int n
)
{
char *old_locale, *saved_locale;

/* Get the name of the current locale. */
old_locale = setlocale (LC_ALL, NULL);

/* Copy the name so it won’t be clobbered by setlocale. */
saved_locale = strdup (old_locale);

if (strlen(new_locale) == 0)
{
new_locale = saved_locale;
}

/* Now change the locale and do some stuff with it. */
setlocale (LC_ALL, new_locale);
qsort (arr, n, sizeof (const char *), myCompare);

printf("\nSorted array in locale %s is\n", new_locale);
for (int i = 0; i < n; i++)
printf("%d: %s \n", i, arr[i]);

/* Restore the original locale. */
setlocale (LC_ALL, saved_locale);
free (saved_locale);
}

int main ()
{
const char *arr[] = {"Jim", "job", "Anne", "aardvark", "Isaac",
"island", "12 Days of Christmas", "12 drummers"};
int n = sizeof(arr)/sizeof(arr[0]);
int i;

printf("Given array is\n");
for (i = 0; i < n; i++)
printf("%d: %s \n", i, arr[i]);

with_other_locale("", arr, n);
with_other_locale("C", arr, n);
with_other_locale("en_US.UTF-8", arr, n);
with_other_locale("UTF-8", arr, n);
return 0;
}

Gives

Given array is
0: Jim
1: job
2: Anne
3: aardvark
4: Isaac
5: island
6: 12 Days of Christmas
7: 12 drummers

Sorted array in locale C is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job

Sorted array in locale C is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job

Sorted array in locale en_US.UTF-8 is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job

Sorted array in locale UTF-8 is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job

I actually don't think there is a case-insensitive locale (bydefault?) on a
unix machine.

Jim

On Tue, Dec 12, 2017 at 8:18 AM, John McKown <john.archie.mckown@gmail.com>
wrote:

Show quoted text

On Tue, Dec 12, 2017 at 2:17 AM, Tsunakawa, Takayuki <
tsunakawa.takay@jp.fujitsu.com> wrote:

Hi Laurenz, Tom, Peter,

Thanks for your suggestions. The practical solution seems to be to
override comparison operators of char, varchar and text data types with
UDFs that behave as Tom mentioned.

From: Peter Geoghegan [mailto:pg@bowt.ie]

That said, the idea of an "EBCDIC collation" seems limiting. Why
should a system like DB2 for the mainframe (that happens to use EBCDIC
as its encoding) not have a more natural, human-orientated collation
even while using EBCDIC? ISTM that the point of using the "C" locale
(with EBDIC or with UTF-8 or with any other encoding) is to get a
performance benefit where the actual collation's behavior doesn't
matter much to users. Are you sure it's really important to be
*exactly* compatible with EBCDIC order? As long as you're paying for a
custom collation, why not just use a collation that is helpful to
humans?

You are right. I'd like to ask the customer whether and why they need
EBCDIC ordering.

​This is a guess on my part, based on many years on an EBCDIC system. But
I'll bet that they are doing a conversion off of the EBCDIC system (maybe
Db2 on z/OS) to an ASCII system (Linux or Windows) running PostgreSQL. They
want to be able to compare the output from the existing system to the
output on the new system. EBCDIC orders "lower case", "upper case", then
"digits". The default C locale on Linux (I don't know Windows) will sort
"digits", then alphabetic with the lower then upper case of each letter in
order like: "aAbB...zZ". Comparing identical data which is not presented in
exactly the same order would be very difficult. ​

Regards
Takayuki Tsunakawa

--
I have a theory that it's impossible to prove anything, but I can't prove
it.

Maranatha! <><
John McKown

#8John McKown
john.archie.mckown@gmail.com
In reply to: James Keener (#7)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

On Tue, Dec 12, 2017 at 9:11 AM, James Keener <jim@jimkeener.com> wrote:

The default C locale on Linux (I don't know Windows) will sort "digits",

then alphabetic with the lower then upper case of each letter in order
like: "aAbB...zZ"

That's no true at all! The C locales are 0-9A-Za-z

​Thanks for the correction. Turns out that I forgot that my default locale
on Linux was en_US.utf8.​

--
I have a theory that it's impossible to prove anything, but I can't prove
it.

Maranatha! <><
John McKown

#9James Keener
jim@jimkeener.com
In reply to: John McKown (#8)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

en_US.utf8. is still 0-9A-Za-z and in my example set (as it's my default
too :))

You'd need a case insensitive collation to do what you described, and I'm
not sure those exist in postgres. (I guess you could always build your own
if you _really_ wanted to.

Jim

On Tue, Dec 12, 2017 at 10:24 AM, John McKown <john.archie.mckown@gmail.com>
wrote:

Show quoted text

On Tue, Dec 12, 2017 at 9:11 AM, James Keener <jim@jimkeener.com> wrote:

The default C locale on Linux (I don't know Windows) will sort "digits",

then alphabetic with the lower then upper case of each letter in order
like: "aAbB...zZ"

That's no true at all! The C locales are 0-9A-Za-z

​Thanks for the correction. Turns out that I forgot that my default locale
on Linux was en_US.utf8.​

--
I have a theory that it's impossible to prove anything, but I can't prove
it.

Maranatha! <><
John McKown

#10James Keener
jim@jimkeener.com
In reply to: James Keener (#9)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

Sorry for spamming the list. It appears that I'm an idiot. Sorry :(

jim=# select * from test order by a collate "C";
a
----------------------
12 Days of Christmas
12 drummers
Anne
Isaac
Jim
a
aardvark
b
island
job
(10 rows)

jim=# select * from test order by a collate "en_US.utf8";
a
----------------------
12 Days of Christmas
12 drummers
a
aardvark
Anne
b
Isaac
island
Jim
job
(10 rows)

On Tue, Dec 12, 2017 at 10:36 AM, James Keener <jim@jimkeener.com> wrote:

Show quoted text

en_US.utf8. is still 0-9A-Za-z and in my example set (as it's my default
too :))

You'd need a case insensitive collation to do what you described, and I'm
not sure those exist in postgres. (I guess you could always build your own
if you _really_ wanted to.

Jim

On Tue, Dec 12, 2017 at 10:24 AM, John McKown <
john.archie.mckown@gmail.com> wrote:

On Tue, Dec 12, 2017 at 9:11 AM, James Keener <jim@jimkeener.com> wrote:

The default C locale on Linux (I don't know Windows) will sort "digits",

then alphabetic with the lower then upper case of each letter in order
like: "aAbB...zZ"

That's no true at all! The C locales are 0-9A-Za-z

​Thanks for the correction. Turns out that I forgot that my default
locale on Linux was en_US.utf8.​

--
I have a theory that it's impossible to prove anything, but I can't prove
it.

Maranatha! <><
John McKown

#11John McKown
john.archie.mckown@gmail.com
In reply to: James Keener (#10)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

On Tue, Dec 12, 2017 at 9:43 AM, James Keener <jim@jimkeener.com> wrote:

Sorry for spamming the list. It appears that I'm an idiot. Sorry :(

​I guess we're even now. We both made a similar mistake.​

​But, despite my error, I still think the OP's need for an EBCDIC order is
to compare output from parallel runs of an application from both an EBCDIC
host and and ASCII host, to make sure they are "the same".

--
I have a theory that it's impossible to prove anything, but I can't prove
it.

Maranatha! <><
John McKown

#12James Keener
jim@jimkeener.com
In reply to: John McKown (#11)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

If that's the case, I wonder if OP could write a function that would
convert from the ASCII code-point ot the EBCDIC codepoint. For instance,
(using the function at https://wiki.postgresql.org/wiki/Binary_Replace)
convert A to char 193, `select binary_replace('Anne'::bytea, 'A'::bytea,
'\xc1'::bytea);` and then sort by the result of this function.

Jim

On Tue, Dec 12, 2017 at 10:47 AM, John McKown <john.archie.mckown@gmail.com>
wrote:

Show quoted text

On Tue, Dec 12, 2017 at 9:43 AM, James Keener <jim@jimkeener.com> wrote:

Sorry for spamming the list. It appears that I'm an idiot. Sorry :(

​I guess we're even now. We both made a similar mistake.​

​But, despite my error, I still think the OP's need for an EBCDIC order is
to compare output from parallel runs of an application from both an EBCDIC
host and and ASCII host, to make sure they are "the same".

--
I have a theory that it's impossible to prove anything, but I can't prove
it.

Maranatha! <><
John McKown

#13George Neuner
gneuner2@comcast.net
In reply to: Tsunakawa, Takayuki (#1)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

On Tue, 12 Dec 2017 07:18:10 -0600, John McKown
<john.archie.mckown@gmail.com> wrote:

?This is a guess on my part, based on many years on an EBCDIC system. But
I'll bet that they are doing a conversion off of the EBCDIC system (maybe
Db2 on z/OS) to an ASCII system (Linux or Windows) running PostgreSQL. They
want to be able to compare the output from the existing system to the
output on the new system.

[he asks naively:]

If a port off the EBCDIC system really is the case, wouldn't it be
easier just to write an external compare tool? Or maybe a PG
function to convert UTF strings as EBCDIC so as to compare string
column dumps?

Either of the above seems more rational than putting a whole lot of
effort into creating an EBCDIC collation.

[Not that it is terribly hard to create a new collation. It's more a
matter of "why bother?" when you are trying to move away from that
system in the 1st place.]

YMMV,
George

In reply to: John McKown (#6)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

On Tue, Dec 12, 2017 at 5:18 AM, John McKown
<john.archie.mckown@gmail.com> wrote:

On Tue, Dec 12, 2017 at 2:17 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

Hi Laurenz, Tom, Peter,

Thanks for your suggestions. The practical solution seems to be to
override comparison operators of char, varchar and text data types with UDFs
that behave as Tom mentioned.

From: Peter Geoghegan [mailto:pg@bowt.ie]

That said, the idea of an "EBCDIC collation" seems limiting. Why
should a system like DB2 for the mainframe (that happens to use EBCDIC
as its encoding) not have a more natural, human-orientated collation
even while using EBCDIC? ISTM that the point of using the "C" locale
(with EBDIC or with UTF-8 or with any other encoding) is to get a
performance benefit where the actual collation's behavior doesn't
matter much to users. Are you sure it's really important to be
*exactly* compatible with EBCDIC order? As long as you're paying for a
custom collation, why not just use a collation that is helpful to
humans?

You are right. I'd like to ask the customer whether and why they need
EBCDIC ordering.

This is a guess on my part, based on many years on an EBCDIC system. But
I'll bet that they are doing a conversion off of the EBCDIC system (maybe
Db2 on z/OS) to an ASCII system (Linux or Windows) running PostgreSQL. They
want to be able to compare the output from the existing system to the output
on the new system. EBCDIC orders "lower case", "upper case", then "digits".

ICU supports creating custom collations that reorder upper and lower
case, or digits with scripts (e.g. Latin alphabet characters). See the
documentation -- "23.2.2.3.2. ICU collations". Advanced customization
is possible.

--
Peter Geoghegan

#15Jeremy Schneider
schneider@ardentperf.com
In reply to: Peter Geoghegan (#14)
Re: Is it possible to sort strings in EBCDIC order in PostgreSQL server?

On 12/12/17 10:21, Peter Geoghegan wrote:

ICU supports creating custom collations that reorder upper and lower
case, or digits with scripts (e.g. Latin alphabet characters). See the
documentation -- "23.2.2.3.2. ICU collations". Advanced customization
is possible.

I just gave this a try and I'm not sure I completely understand what
it's doing but it definitely doesn't look like it's sorting according to
EBCDIC byte orders. (This was on centos 7 with libicu-50.1.2) I might
have created the collation incorrectly. I'd love to hear if anyone else
finds a way to get this working outside of custom comparison operators.

pops-10.1 root@db1=# create collation "ebcdic" (provider=icu,
locale='cp037');
CREATE COLLATION

pops-10.1 root@db1=# create table test (data text);
CREATE TABLE

pops-10.1 root@db1=# insert into test
values('a'),('b'),('A'),('B'),('1'),('2'),('!'),('^');
INSERT 0 8

pops-10.1 root@db1=# select * from test order by data collate "ebcdic";
data
------
!
^
1
2
a
A
b
B
(8 rows)

--
http://about.me/jeremy_schneider