Supporting SJIS as a database encoding

Started by Tsunakawa, Takayukiover 9 years ago80 messageshackers
Jump to latest
#1Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com

Hello,

I'd like to propose adding SJIS as a database encoding. You may wonder why SJIS is still necessary in the world of Unicode. The purpose is to achieve comparable performance when migrating legacy database systems from other DBMSs without little modification of applications.

Recently, we failed to migrate some customer's legacy database from DBMS-X to PostgreSQL. That customer wished for PostgreSQL, but PostgreSQL couldn't meet the performance requirement.

The system uses DBMS-X with the database character set being SJIS. The main applications are written in embedded SQL, which require SJIS in their host variables. They insisted they cannot use UTF8 for the host variables because that would require large modification of applications due to character handling. So no character set conversion is necessary between the clients and the server.

On the other hand, PostgreSQL doesn't support SJIS as a database encoding. Therefore, character set conversion from UTF-8 to SJIS has to be performed. The batch application runs millions of SELECTS each of which retrieves more than 100 columns. And many of those columns are of character type.

If PostgreSQL supports SJIS, PostgreSQL will match or outperform the performance of DBMS-X with regard to the applications. We confirmed it by using psql to run a subset of the batch processing. When the client encoding is SJIS, one FETCH of 10,000 rows took about 500ms. When the client encoding is UTF8 (the same as the database encoding), the same FETCH took 270ms.

Supporting SJIS may somewhat regain attention to PostgreSQL here in Japan, in the context of database migration. BTW, MySQL supports SJIS as a database encoding. PostgreSQL used to be the most popular open source database in Japan, but MySQL is now more popular.

But what I'm wondering is why PostgreSQL doesn't support SJIS. Was there any technical difficulty? Is there anything you are worried about if adding SJIS?

I'd like to write a patch for adding SJIS if there's no strong objection. I'd appreciate it if you could let me know good design information to add a server encoding (e.g. the URL of the most recent patch to add a new server encoding)

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tsunakawa, Takayuki (#1)
Re: Supporting SJIS as a database encoding

But what I'm wondering is why PostgreSQL doesn't support SJIS. Was there any technical difficulty? Is there anything you are worried about if adding SJIS?

Yes, there's a technical difficulty with backend code. In many places
it is assumed that any string is "ASCII compatible", which means no
ASCII character is used as a part of multi byte string. Here is such a
random example from src/backend/util/adt/varlena.c:

/* Else, it's the traditional escaped style */
for (bc = 0, tp = inputText; *tp != '\0'; bc++)
{
if (tp[0] != '\\')
tp++;

Sometimes SJIS uses '\' as the second byte of it.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tatsuo Ishii (#2)
Re: Supporting SJIS as a database encoding

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tatsuo Ishii

But what I'm wondering is why PostgreSQL doesn't support SJIS. Was there

any technical difficulty? Is there anything you are worried about if adding
SJIS?

Yes, there's a technical difficulty with backend code. In many places it
is assumed that any string is "ASCII compatible", which means no ASCII
character is used as a part of multi byte string. Here is such a random
example from src/backend/util/adt/varlena.c:

/* Else, it's the traditional escaped style */
for (bc = 0, tp = inputText; *tp != '\0'; bc++)
{
if (tp[0] != '\\')
tp++;

Sometimes SJIS uses '\' as the second byte of it.

Thanks, I'll try to understand the seriousness of the problem as I don't have good knowledge of character sets. But your example seems to be telling everything about the difficulty...

Before digging into the problem, could you share your impression on whether PostgreSQL can support SJIS? Would it be hopeless? Can't we find any direction to go? Can I find relevant source code by searching specific words like "ASCII", "HIGH_BIT", "\\" etc?

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tsunakawa, Takayuki (#3)
Re: Supporting SJIS as a database encoding

Before digging into the problem, could you share your impression on whether PostgreSQL can support SJIS? Would it be hopeless? Can't we find any direction to go? Can I find relevant source code by searching specific words like "ASCII", "HIGH_BIT", "\\" etc?

For starters, you could grep "multibyte".

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tsunakawa, Takayuki (#3)
Re: Supporting SJIS as a database encoding

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

Before digging into the problem, could you share your impression on
whether PostgreSQL can support SJIS? Would it be hopeless?

I think it's pretty much hopeless. Even if we were willing to make every
bit of code that looks for '\' and other specific at-risk characters
multi-byte aware (with attendant speed penalties), we could expect that
third-party extensions would still contain vulnerable code. More, we
could expect that new bugs of the same ilk would get introduced all the
time. Many such bugs would amount to security problems. So the amount of
effort and vigilance required seems out of proportion to the benefits.

Most of the recent discussion about allowed backend encodings has run
more in the other direction, ie, "why don't we disallow everything but
UTF8 and get rid of all the infrastructure for multiple backend
encodings?". I'm not personally in favor of that, but there are very
few hackers who want to add any more overhead in this area.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#5)
Re: Supporting SJIS as a database encoding

On 09/05/2016 05:47 PM, Tom Lane wrote:

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

Before digging into the problem, could you share your impression on
whether PostgreSQL can support SJIS? Would it be hopeless?

I think it's pretty much hopeless.

Agreed.

But one thing that would help a little, would be to optimize the UTF-8
-> SJIS conversion. It uses a very generic routine, with a binary search
over a large array of mappings. I bet you could do better than that,
maybe using a hash table or a radix tree instead of the large
binary-searched array.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Tom Lane (#5)
Re: Supporting SJIS as a database encoding

From: Tom Lane [mailto:tgl@sss.pgh.pa.us]

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

Before digging into the problem, could you share your impression on
whether PostgreSQL can support SJIS? Would it be hopeless?

I think it's pretty much hopeless. Even if we were willing to make every
bit of code that looks for '\' and other specific at-risk characters
multi-byte aware (with attendant speed penalties), we could expect that
third-party extensions would still contain vulnerable code. More, we could
expect that new bugs of the same ilk would get introduced all the time.
Many such bugs would amount to security problems. So the amount of effort
and vigilance required seems out of proportion to the benefits.

Hmm, this sounds like a death sentence. But as I don't have good knowledge of character set handling yet, I'm not completely convinced about why PostgreSQL cannot support SJIS. I wonder why and how other DBMSs support SJIS and what's the difference of the implementation. Using multibyte-functions like mb... to process characters would solve the problem? Isn't the current implementation blocking the support of other character sets that have similar characteristics? I'll learn the character set handling...

Most of the recent discussion about allowed backend encodings has run more
in the other direction, ie, "why don't we disallow everything but
UTF8 and get rid of all the infrastructure for multiple backend encodings?".
I'm not personally in favor of that, but there are very few hackers who
want to add any more overhead in this area.

Personally, I totally agree. I want non-Unicode character sets to disappear from the world. But the real business doesn't seem to forgive the lack of SJIS...

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Heikki Linnakangas (#6)
Re: Supporting SJIS as a database encoding

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Heikki
But one thing that would help a little, would be to optimize the UTF-8
-> SJIS conversion. It uses a very generic routine, with a binary search
over a large array of mappings. I bet you could do better than that, maybe
using a hash table or a radix tree instead of the large binary-searched
array.

That sounds worth pursuing. Thanks!

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tsunakawa, Takayuki (#7)
Re: Supporting SJIS as a database encoding

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

Using multibyte-functions like mb... to process characters would solve
the problem?

Well, sure. The problem is (1) finding all the places that need that
(I'd estimate dozens to hundreds of places in the core code, and then
there's the question of extensions); (2) preventing new
non-multibyte-aware code from being introduced after you've fixed those
places; and (3) the performance penalties you'd take, because a lot of
those places are bottlenecks and it's much cheaper to not worry about
character lengths in an inner loop.

Isn't the current implementation blocking the support of
other character sets that have similar characteristics?

Sure, SJIS is not the only encoding that we consider frontend-only.
See

https://www.postgresql.org/docs/devel/static/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#6)
Re: Supporting SJIS as a database encoding

Hello,

At Mon, 5 Sep 2016 19:38:33 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in <529db688-72fc-1ca2-f898-b0b99e30076f@iki.fi>

On 09/05/2016 05:47 PM, Tom Lane wrote:

"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes:

Before digging into the problem, could you share your impression on
whether PostgreSQL can support SJIS? Would it be hopeless?

I think it's pretty much hopeless.

Agreed.

+1, even as a user of SJIS:)

But one thing that would help a little, would be to optimize the UTF-8
-> SJIS conversion. It uses a very generic routine, with a binary
search over a large array of mappings. I bet you could do better than
that, maybe using a hash table or a radix tree instead of the large
binary-searched array.

I'm very impressed by the idea. Mean number of iterations for
binsearch on current conversion table with 8000 characters is
about 13 and the table size is under 100kBytes (maybe).

A three-level array with 2 byte values will take about 1.6~2MB of memory.

A radix tree for UTF-8->some-encoding conversion requires about,
or up to.. (using 1 byte index to point the next level)

(1 * ((7f + 1) +
(df - c2 + 1) * (bf - 80 + 1) +
(ef - e0 + 1) * (bf - 80 + 1)^2)) = 67 kbytes.

SJIS characters are 2byte length at longest so about 8000
characters takes extra 16 k Bytes. And some padding space will be
added on them.

As the result, radix tree seems to be promising because of small
requirement of additional memory and far less comparisons. Also
Big5 and other encodings including EUC-* will get benefit from
it.

Implementing radix tree code, then redefining the format of
mapping table to suppot radix tree, then modifying mapping
generator script are needed.

If no one oppse to this, I'll do that.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Kyotaro Horiguchi (#10)
Re: Supporting SJIS as a database encoding

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro
HORIGUCHI

Implementing radix tree code, then redefining the format of mapping table

to suppot radix tree, then modifying mapping generator script are needed.

If no one oppse to this, I'll do that.

+100
Great analysis and your guts. I very much appreciate your trial!

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tsunakawa, Takayuki (#11)
Re: Supporting SJIS as a database encoding

Hello,

At Tue, 6 Sep 2016 03:43:46 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1F5E66CE@G01JPEXMBYT05>

From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro
HORIGUCHI

Implementing radix tree code, then redefining the format of mapping table

to suppot radix tree, then modifying mapping generator script are needed.

If no one oppse to this, I'll do that.

+100
Great analysis and your guts. I very much appreciate your trial!

Thanks, by the way, there's another issue related to SJIS
conversion. MS932 has several characters that have multiple code
points. By converting texts in this encoding to and from Unicode
causes a round-trop problem. For example,

8754(ROMAN NUMERICAL I in NEC specials)
=> U+2160(ROMAN NUMERICAL I)
=> FA4A (ROMAN NUMERICA I in IBM extension)

My counting said that 398 characters are affected by this kind of
replacement. Addition to that, "GAIJI" (Private usage area) is
not allowed. Is this meet your purpose?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Kyotaro Horiguchi (#12)
Re: Supporting SJIS as a database encoding

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro
Thanks, by the way, there's another issue related to SJIS conversion. MS932
has several characters that have multiple code points. By converting texts
in this encoding to and from Unicode causes a round-trop problem. For
example,

8754(ROMAN NUMERICAL I in NEC specials)
=> U+2160(ROMAN NUMERICAL I)
=> FA4A (ROMAN NUMERICA I in IBM extension)

My counting said that 398 characters are affected by this kind of replacement.
Addition to that, "GAIJI" (Private usage area) is not allowed. Is this meet
your purpose?

Supporting GAIJI is not a requirement as far as I know. Thank you for sharing information.

# I realize my lack of knowledge about character sets...

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#12)
Re: Supporting SJIS as a database encoding

Hello,

At Wed, 07 Sep 2016 16:13:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160907.161304.112519789.horiguchi.kyotaro@lab.ntt.co.jp>

Implementing radix tree code, then redefining the format of mapping table

to suppot radix tree, then modifying mapping generator script are needed.

If no one oppse to this, I'll do that.

So, I did that as a PoC. The radix tree takes a little less than
100k bytes (far smaller than expected:) and it is defnitely
faster than binsearch.

The attached patch does the following things.

- Defines a struct for static radix tree
(utf_radix_tree). Currently it supports up to 3-byte encodings.

- Adds a map generator script UCS_to_SJIS_radix.pl, which
generates utf8_to_sjis_radix.map from utf8_to_sjis.map.

- Adds a new conversion function utf8_to_sjis_radix.

- Modifies UtfToLocal so as to allow map to be NULL.

- Modifies utf8_to_sjis to use the new conversion function
instead of ULmapSJIS.

The followings are to be done.

- utf8_to_sjis_radix could be more generic.

- SJIS->UTF8 is not implemented but it would be easily done since
there's no difference in using the radix tree mechanism.
(but the output character is currently assumed to be 2-byte long)

- It doesn't support 4-byte codes so this is not applicable to
sjis_2004. Extending the radix tree to support 4-byte wouldn't
be hard.

The following is the result of a simple test.

=# create table t (a text); alter table t alter column a storage plain;
=# insert into t values ('... 7130 cahracters containing (I believe) all characters in SJIS encoding');
=# insert into t values ('... 7130 cahracters containing (I believe) all characters in SJIS encoding');

# Doing that twice is just my mistake.

$ export PGCLIENTENCODING=SJIS

$ time psql postgres -c '
$ psql -c '\encoding' postgres
SJIS

<Using radix tree>
$ time psql postgres -c 'select t.a from t, generate_series(0, 9999)' > /dev/null

real 0m22.696s
user 0m16.991s
sys 0m0.182s>

Using binsearch the result for the same operation was
real 0m35.296s
user 0m17.166s
sys 0m0.216s

Returning in UTF-8 bloats the result string by about 1.5 times so
it doesn't seem to make sense comparing with it. But it takes
real = 47.35s.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Use-radix-tree-to-encoding-characters-of-Shift-JIS.patch.gzapplication/octet-streamDownload
#15Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Kyotaro Horiguchi (#14)
Re: Supporting SJIS as a database encoding

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro
HORIGUCHI
<Using radix tree>
$ time psql postgres -c 'select t.a from t, generate_series(0, 9999)' >
/dev/null

real 0m22.696s
user 0m16.991s
sys 0m0.182s>

Using binsearch the result for the same operation was
real 0m35.296s
user 0m17.166s
sys 0m0.216s

Returning in UTF-8 bloats the result string by about 1.5 times so it doesn't
seem to make sense comparing with it. But it takes real = 47.35s.

Cool, 36% speedup! Does this difference vary depending on the actual characters used, e.g. the speedup would be greater if most of the characters are ASCII?

Regards
Takayuki Tsunakawa

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tsunakawa, Takayuki (#15)
Re: Supporting SJIS as a database encoding

At Thu, 8 Sep 2016 07:09:51 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in <0A3221C70F24FB45833433255569204D1F5E7D4A@G01JPEXMBYT05>

From: pgsql-hackers-owner@postgresql.org

[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kyotaro
HORIGUCHI
<Using radix tree>
$ time psql postgres -c 'select t.a from t, generate_series(0, 9999)' >
/dev/null

real 0m22.696s
user 0m16.991s
sys 0m0.182s>

Using binsearch the result for the same operation was
real 0m35.296s
user 0m17.166s
sys 0m0.216s

Returning in UTF-8 bloats the result string by about 1.5 times so it doesn't
seem to make sense comparing with it. But it takes real = 47.35s.

Cool, 36% speedup! Does this difference vary depending on the actual characters used, e.g. the speedup would be greater if most of the characters are ASCII?

Binsearch on JIS X 0208 always needs about 10 times of comparison
and bisecting and the radix tree requires three hops on arrays
for most of the characters and two hops for some. In sort, this
effect won't be differ among 2 and 3 byte characters in UTF-8.

The translation speed of ASCII cahracters (U+20 - U+7f) is not
affected by the character conversion mechanism. They are just
copied without conversion.

As the result, there's no speedup if the output consists only of
ASCII characters and maximum speedup when the output consists
only of 2 byte UTF-8 characters.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#14)
Re: Supporting SJIS as a database encoding

On 09/08/2016 09:35 AM, Kyotaro HORIGUCHI wrote:

Returning in UTF-8 bloats the result string by about 1.5 times so
it doesn't seem to make sense comparing with it. But it takes
real = 47.35s.

Nice!

I was hoping that this would also make the binaries smaller. A few dozen
kB of storage is perhaps not a big deal these days, but still. And
smaller tables would also consume less memory and CPU cache.

I removed the #include "../../Unicode/utf8_to_sjis.map" line, so that
the old table isn't included anymore, compiled, and ran "strip
utf8_and_sjis.so". Without this patch, it's 126 kB, and with it, it's
160 kB. So the radix tree takes a little bit more space.

That's not too bad, and I'm sure we could live with that, but with a few
simple tricks, we could do better. First, since all the values we store
in the tree are < 0xffff, we could store them in int16 instead of int32,
and halve the size of the table right off the bat. That won't work for
all encodings, of course, but it might be worth it to have two versions
of the code, one for int16 and another for int32.

Another trick is to eliminate redundancies in the tables. Many of the
tables contain lots of zeros, as in:

/* c3xx */{
/* c380 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* c388 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* c390 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x817e,
/* c398 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* c3a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* c3a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* c3b0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x8180,
/* c3b8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000
},

and

/* e388xx */{
/* e38880 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* e38888 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* e38890 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* e38898 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* e388a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* e388a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* e388b0 */ 0x0000, 0xfa58, 0x878b, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
/* e388b8 */ 0x0000, 0x878c, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000
},

You could overlay the last row of the first table, which is all zeros,
with the first row of the second table, which is also all zeros. (Many
of the tables have a lot more zero-rows than this example.)

But yes, this patch looks very promising in general. I think we should
switch over to radix trees for the all the encodings.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#17)
Re: Supporting SJIS as a database encoding

Hello,

At Tue, 13 Sep 2016 11:44:01 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in <7ff67a45-a53e-4d38-e25d-3a121afea47c@iki.fi>

On 09/08/2016 09:35 AM, Kyotaro HORIGUCHI wrote:

Returning in UTF-8 bloats the result string by about 1.5 times so
it doesn't seem to make sense comparing with it. But it takes
real = 47.35s.

Nice!

Thanks!

I was hoping that this would also make the binaries smaller. A few
dozen kB of storage is perhaps not a big deal these days, but
still. And smaller tables would also consume less memory and CPU
cache.

Agreed.

I removed the #include "../../Unicode/utf8_to_sjis.map" line, so that
the old table isn't included anymore, compiled, and ran "strip
utf8_and_sjis.so". Without this patch, it's 126 kB, and with it, it's
160 kB. So the radix tree takes a little bit more space.

That's not too bad, and I'm sure we could live with that, but with a
few simple tricks, we could do better. First, since all the values we
store in the tree are < 0xffff, we could store them in int16 instead
of int32, and halve the size of the table right off the bat. won't work
for all encodings, of course, but it might be worth it to
have two versions of the code, one for int16 and another for int32.

That's right. I used int imprudently. All of the character in the
patch, and most of characters in other than Unicode-related
encodings are in 2 bytes. 3 bytes characters can be in separate
table in the struct for the case. Othersise two or more versions
of the structs is possible since currently the radix struct is
utf8_and_sjis's own in spite of the fact that it is in pg_wchar.h
in the patch.

Another trick is to eliminate redundancies in the tables. Many of the
tables contain lots of zeros, as in:

/* c3xx */{

...

0x817e,
/* c398 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000,
/* c3a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000,
/* c3a8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000,
/* c3b0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x8180,
/* c3b8 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000
},

and

/* e388xx */{
/* e38880 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000,
/* e38888 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000,
/* e38890 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000,
/* e38898 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000,
/* e388a0 */ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x0000,

...

},

You could overlay the last row of the first table, which is all zeros,
with the first row of the second table, which is also all zeros. (Many
of the tables have a lot more zero-rows than this example.)

Yes, the bunch of zeros was annoyance. Several or many
compression techniques are available in exchange for some
additional CPU time. But the technique you suggested doesn't
need such sacrifice, sounds nice.

But yes, this patch looks very promising in general. I think we should
switch over to radix trees for the all the encodings.

The result was more than I expected for a character set with
about 7000 characters. We can expect certain amount of advangate
even for character sets that have less than a hundred of
characters.

I'll work on this for the next CF.

Thanks.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#18)
Re: Supporting SJIS as a database encoding

Hello, did this.

As a result, radix tree is about 1.5 times faster and needs a
half memory.

At Wed, 21 Sep 2016 15:14:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160921.151427.265121484.horiguchi.kyotaro@lab.ntt.co.jp>

I'll work on this for the next CF.

The radix conversion function and map conversion script became
more generic than the previous state. So I could easily added
radix conversion of EUC_JP in addition to SjiftJIS.

nm -S said that the size of radix tree data for sjis->utf8
conversion is 34kB and that for utf8->sjis is 46kB. (eucjp->utf8
57kB, utf8->eucjp 93kB) LUmapSJIS and ULmapSJIS was 62kB and
59kB, and LUmapEUC_JP and ULmapEUC_JP was 106kB and 105kB. If I'm
not missing something, radix tree is faster and require less
memory.

A simple test where 'select '7070 sjis chars' x 100' (I'm not
sure, but the size is 1404kB) on local connection shows that this
is fast enough.

radix: real 0m0.285s / user 0m0.199s / sys 0m0.006s
master: real 0m0.418s / user 0m0.180s / sys 0m0.004s

To make sure, the result of a test of sending the same amount of
ASCII string (1404kB) on SJIS and UTF8(no-conversion) encoding is
as follows.

ascii/utf8-sjis: real 0m0.220s / user 0m0.176s / sys 0m0.011s
ascii/utf8-utf8: real 0m0.137s / user 0m0.111s / sys 0m0.008s

======
Random discussions -

Currently the tree structure is devided into several elements,
One for 2-byte, other ones for 3-byte and 4-byte codes and output
table. The other than the last one is logically and technically
merged into single table but it makes the generator script far
complex than the current complexity. I no longer want to play
hide'n seek with complex perl object..

It might be better that combining this as a native feature of the
core. Currently the helper function is in core but that function
is given as conv_func on calling LocalToUtf.

Current implement uses *.map files of pg_utf_to_local as
input. It seems not good but the radix tree files is completely
uneditable. Provide custom made loading functions for every
source instead of load_chartable() would be the way to go.

# However, for example utf8_to_sjis.map, it doesn't seem to have
# generated from the source mentioned in UCS_to_SJIS.pl

I'm not sure that compilers other than gcc accepts generated map
file content.

The RADIXTREE.pm is in rather older style but seem no problem.

I haven't tried this for charsets that contains 4-byte code.

I haven't consider charset with conbined characters. I don't
think it is needed so immediately.

Though I believe that this is easily applied to other
conversions, I tried this only with character sets that I know
about it.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Radix-tree-infrastructure-for-character-encoding.patchtext/x-patch; charset=us-asciiDownload+848-7
0002-Use-radix-tree-for-UTF8-ShiftJIS-conversion.patch.gzapplication/octet-streamDownload
0003-Use-radix-tree-for-UTF8-EUC_JP-conversion.patch.gzapplication/octet-streamDownload+1-1
#20Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#18)
Radix tree for character conversion

This is a differnet topic from the original thread so I renamed
the subject and repost. Sorry for duplicate posting.

======================
Hello, I did this.

As a result, radix tree is about 1.5 times faster and needs a
half memory.

At Wed, 21 Sep 2016 15:14:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160921.151427.265121484.horiguchi.kyotaro@lab.ntt.co.jp>

I'll work on this for the next CF.

The radix conversion function and map conversion script became
more generic than the previous state. So I could easily added
radix conversion of EUC_JP in addition to SjiftJIS.

nm -S said that the size of radix tree data for sjis->utf8
conversion is 34kB and that for utf8->sjis is 46kB. (eucjp->utf8
57kB, utf8->eucjp 93kB) LUmapSJIS and ULmapSJIS was 62kB and
59kB, and LUmapEUC_JP and ULmapEUC_JP was 106kB and 105kB. If I'm
not missing something, radix tree is faster and require less
memory.

A simple test where 'select '7070 sjis chars' x 100' (I'm not
sure, but the size is 1404kB) on local connection shows that this
is fast enough.

radix: real 0m0.285s / user 0m0.199s / sys 0m0.006s
master: real 0m0.418s / user 0m0.180s / sys 0m0.004s

To make sure, the result of a test of sending the same amount of
ASCII string (1404kB) on SJIS and UTF8(no-conversion) encoding is
as follows.

ascii/utf8-sjis: real 0m0.220s / user 0m0.176s / sys 0m0.011s
ascii/utf8-utf8: real 0m0.137s / user 0m0.111s / sys 0m0.008s

======
Random discussions -

Currently the tree structure is devided into several elements,
One for 2-byte, other ones for 3-byte and 4-byte codes and output
table. The other than the last one is logically and technically
merged into single table but it makes the generator script far
complex than the current complexity. I no longer want to play
hide'n seek with complex perl object..

It might be better that combining this as a native feature of the
core. Currently the helper function is in core but that function
is given as conv_func on calling LocalToUtf.

Current implement uses *.map files of pg_utf_to_local as
input. It seems not good but the radix tree files is completely
uneditable. Provide custom made loading functions for every
source instead of load_chartable() would be the way to go.

# However, for example utf8_to_sjis.map, it doesn't seem to have
# generated from the source mentioned in UCS_to_SJIS.pl

I'm not sure that compilers other than gcc accepts generated map
file content.

The RADIXTREE.pm is in rather older style but seem no problem.

I haven't tried this for charsets that contains 4-byte code.

I haven't consider charset with conbined characters. I don't
think it is needed so immediately.

Though I believe that this is easily applied to other
conversions, I tried this only with character sets that I know
about it.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Radix-tree-infrastructure-for-character-encoding.patchtext/x-patch; charset=us-asciiDownload+848-7
0002-Use-radix-tree-for-UTF8-ShiftJIS-conversion.patch.gzapplication/octet-streamDownload
0003-Use-radix-tree-for-UTF8-EUC_JP-conversion.patch.gzapplication/octet-streamDownload+1-1
#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#20)
#22Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#21)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#22)
#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#22)
#25Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#24)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#25)
#27Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#26)
#28Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#27)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#28)
#30David Fetter
david@fetter.org
In reply to: Robert Haas (#28)
#31Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#29)
#32Daniel Gustafsson
daniel@yesql.se
In reply to: Kyotaro Horiguchi (#27)
#33Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Daniel Gustafsson (#32)
#34Daniel Gustafsson
daniel@yesql.se
In reply to: Kyotaro Horiguchi (#33)
#35Daniel Gustafsson
daniel@yesql.se
In reply to: Daniel Gustafsson (#34)
#36Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Daniel Gustafsson (#34)
#37Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Daniel Gustafsson (#35)
#38Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Daniel Gustafsson (#34)
#39Daniel Gustafsson
daniel@yesql.se
In reply to: Kyotaro Horiguchi (#38)
#40Peter Eisentraut
peter_e@gmx.net
In reply to: Daniel Gustafsson (#32)
#41Daniel Gustafsson
daniel@yesql.se
In reply to: Peter Eisentraut (#40)
#42Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Daniel Gustafsson (#41)
#43Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#42)
#44Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Daniel Gustafsson (#32)
#45Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#42)
#46Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#45)
#47Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alvaro Herrera (#46)
#48Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#45)
#49Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#48)
#50Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#49)
#51Ishii Ayumi
ayumi.ishii.pg@gmail.com
In reply to: Kyotaro Horiguchi (#50)
#52Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Ishii Ayumi (#51)
#53Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#50)
#54Michael Paquier
michael@paquier.xyz
In reply to: Ishii Ayumi (#51)
#55Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#54)
#56Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#53)
#57Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#56)
#58Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#57)
#59Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#58)
#60Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#59)
#61Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#60)
#62Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#61)
#63Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#62)
#64Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#63)
#65Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#64)
#66Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#63)
#67Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#66)
#68Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#67)
#69Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#68)
#70Michael Paquier
michael@paquier.xyz
In reply to: Kyotaro Horiguchi (#69)
#71Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Michael Paquier (#70)
#72Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#71)
#73Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#72)
#74Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#73)
#75Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#74)
#76Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#74)
#77Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Kyotaro Horiguchi (#76)
#78Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Heikki Linnakangas (#77)
#79Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#78)
#80Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#79)