Patch: add conversion from pg_wchar to multibyte

Started by Alexander Korotkovabout 14 years ago48 messageshackers

aekorotkov@gmail.com

about 14 years ago

Hackers,

attached patch adds conversion from pg_wchar string to multibyte string.
This functionality is needed for my patch on index support for regular
expression search
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php .
Analyzing conversion from multibyte to pg_wchar I found following types of
conversion:
1) Trivial conversion for single-byte encoding. It just adds leading zeros
to each byte.
2) Conversion from UTF-8 to unicode.
3) Conversions from euc* encodings. They write bytes of a character to
pg_wchar in inverse order starting from lower byte (this explanation assume
little endian system).
4) Conversion from mule encoding. This conversion is unclear for me and
also seems to be lossy.

It was easy to write inverse conversion for 1-3. I've changed 4 conversion
to behave like 3. I'm not sure my change is ok, because I didn't understand
original conversion.

------
With best regards,
Alexander Korotkov.

Erik Rijkers

er@xs4all.nl

about 14 years ago

In reply to: Alexander Korotkov (#1)

Re: Patch: add conversion from pg_wchar to multibyte

Hi Alexander,

Perhaps I'm too early with these tests, but FWIW I reran my earlier test program against three
instances. (the patches compiled fine, and make check was without problem).

-- 3 instances:
HEAD port 6542
trgm_regex port 6547 HEAD + trgm-regexp patch (22 Nov 2011) [1]http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php
trgm_regex_wchar2mb port 6549 HEAD + trgm-regexp + wchar2mb patch (23 Apr 2012) [2]http://archives.postgresql.org/pgsql-hackers/2012-04/msg01095.php

[1]: http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php
[2]: http://archives.postgresql.org/pgsql-hackers/2012-04/msg01095.php

-- table sizes:
azjunk4 10^4 rows 1 MB
azjunk5 10^5 rows 11 MB
azjunk6 10^6 rows 112 MB
azjunk7 10^7 rows 1116 MB

for table creation/structure, see:
[3]: http://archives.postgresql.org/pgsql-hackers/2012-01/msg01094.php

Results for three instances with 4 repetitions per instance are attached.

Although the regexes I chose are somewhat arbitrary, it does show some of the good, the bad and
the ugly of the patch(es). (Also: I've limited the tests to a range of 'workable' regexps, i.e.
avoiding unbounded regexps)

hth (and thanks, great work!),

Erik Rijkers

Import Notes

Resolved by subject fallback

Robert Haas

robertmhaas@gmail.com

about 14 years ago

In reply to: Erik Rijkers (#2)

Re: Patch: add conversion from pg_wchar to multibyte

On Sun, Apr 29, 2012 at 8:12 AM, Erik Rijkers <er@xs4all.nl> wrote:

Perhaps I'm too early with these tests, but FWIW I reran my earlier test program against three
instances. (the patches compiled fine, and make check was without problem).

These tests results seem to be more about the pg_trgm changes than the
patch actually on this thread, unless I'm missing something. But the
executive summary seems to be that pg_trgm might need to be a bit
smarter about costing the trigram-based search, because when the
number of trigrams is really big, using the index is
counterproductive. Hopefully that's not too hard to fix; the basic
approach seems quite promising.

(I haven't actually looked at the patch on this thread yet to
understand how it fits in; the above comments are about the pg_trgm
regex stuff.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Kevin Grittner

Kevin.Grittner@wicourts.gov

about 14 years ago

In reply to: Robert Haas (#3)

Re: Patch: add conversion from pg_wchar to multibyte

Robert Haas <robertmhaas@gmail.com> wrote:

Hopefully that's not too hard to fix; the basic approach seems
quite promising.

After playing with trigram searches for name searches against copies
of production database with appropriate indexing, our shop has
chosen it as the new way to do name searches here. It's really
nice.

My biggest complaint is related to setting the threshold for the %
operator. It seems to me that there should be a GUC to control the
default, and that there should be a way to set the threshold for
each % operator in a query (if there is more than one). The
function names which must be used on the connection before running
the queries don't give any clue that they are related to trigrams:
show_limit() and set_limit() are nearly useless for conveying the
semantics of what they do.

Even with those issues, trigram similarity searching is IMO one of
the top five coolest things about PostgreSQL and should be promoted
heavily.

-Kevin

Alexander Korotkov

aekorotkov@gmail.com

about 14 years ago

In reply to: Erik Rijkers (#2)

Re: Patch: add conversion from pg_wchar to multibyte

Hi Erik

On Sun, Apr 29, 2012 at 4:12 PM, Erik Rijkers <er@xs4all.nl> wrote:

Perhaps I'm too early with these tests, but FWIW I reran my earlier test
program against three
instances. (the patches compiled fine, and make check was without
problem).

-- 3 instances:
HEAD port 6542
trgm_regex port 6547 HEAD + trgm-regexp patch (22 Nov 2011) [1]
trgm_regex_wchar2mb port 6549 HEAD + trgm-regexp + wchar2mb patch (23
Apr 2012) [2]

Actually wchar2mb patch doesn't affect behaviour of trgm-regexp. It provide
correct way to do some work of encoding conversion which last published
version of trgm-regexp does internally. So "HEAD + trgm-regexp patch" and
"HEAD + trgm-regexp + wchar2mb patch" should behave similarly.

[1] http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php
[2] http://archives.postgresql.org/pgsql-hackers/2012-04/msg01095.php

-- table sizes:
azjunk4 10^4 rows 1 MB
azjunk5 10^5 rows 11 MB
azjunk6 10^6 rows 112 MB
azjunk7 10^7 rows 1116 MB

for table creation/structure, see:
[3] http://archives.postgresql.org/pgsql-hackers/2012-01/msg01094.php

Results for three instances with 4 repetitions per instance are attached.

Although the regexes I chose are somewhat arbitrary, it does show some of
the good, the bad and
the ugly of the patch(es). (Also: I've limited the tests to a range of
'workable' regexps, i.e.
avoiding unbounded regexps)

Thank you for testing!
Such synthetical tests are very valuable for finding corner cases of the
patch, bugs etc.
But also, it would be nice to do some tests on reallife datasets with
reallife regexps in order to see real benefit of this approach of indexing
and do some comparison with other approaches. May be you or somebody else
could obtain such datasets?

Also, I did some optimizations in algorithm. Automaton analysis stage
should become less CPU and memory consuming. I'll publish new version soon.

------
With best regards,
Alexander Korotkov.

Alexander Korotkov

aekorotkov@gmail.com

about 14 years ago

In reply to: Robert Haas (#3)

Re: Patch: add conversion from pg_wchar to multibyte

On Mon, Apr 30, 2012 at 10:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Apr 29, 2012 at 8:12 AM, Erik Rijkers <er@xs4all.nl> wrote:

Perhaps I'm too early with these tests, but FWIW I reran my earlier test

program against three

instances. (the patches compiled fine, and make check was without

problem).

These tests results seem to be more about the pg_trgm changes than the
patch actually on this thread, unless I'm missing something. But the
executive summary seems to be that pg_trgm might need to be a bit
smarter about costing the trigram-based search, because when the
number of trigrams is really big, using the index is
counterproductive. Hopefully that's not too hard to fix; the basic
approach seems quite promising.

Right. When number of trigrams is big, it is slow to scan posting list of
all of them. The solution is this case is to exclude most frequent trigrams
from index scan. But, it require some kind of statistics of trigrams
frequencies which we don't have. We could estimate frequencies using some
hard-coded assumptions about natural languages. Or we could exclude
arbitrary trigrams. But I don't like both these ideas. This problem is also
relevant for LIKE/ILIKE search using trigram indexes.

Something similar could occur in tsearch when we search for "frequent_term
& rare_term". In some situations (depending on terms frequencies) it's
better to exclude frequent_term from index scan and do recheck. We have
relevant statistics to do such decision, but it doesn't seem to be feasible
to get it using current GIN interface.

Probably you have some comments on idea of conversion from pg_wchar to
multibyte? Is it acceptable at all?

------
With best regards,
Alexander Korotkov.

Alexander Korotkov

aekorotkov@gmail.com

about 14 years ago

In reply to: Kevin Grittner (#4)

Re: Patch: add conversion from pg_wchar to multibyte

On Tue, May 1, 2012 at 1:48 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov>wrote:

My biggest complaint is related to setting the threshold for the %
operator. It seems to me that there should be a GUC to control the
default, and that there should be a way to set the threshold for
each % operator in a query (if there is more than one). The
function names which must be used on the connection before running
the queries don't give any clue that they are related to trigrams:
show_limit() and set_limit() are nearly useless for conveying the
semantics of what they do.

I think this problem can be avoided by introducing composite type
representing trigram similarity query. It could consists of a query text
and similarity threshold. This type would have similar purpose as tsquery
or query_int. It would make queries more heavy, but would allow to use
distinct similarity threshold in the same query.

------
With best regards,
Alexander Korotkov.

Robert Haas

robertmhaas@gmail.com

about 14 years ago

In reply to: Alexander Korotkov (#6)

Re: Patch: add conversion from pg_wchar to multibyte

On Tue, May 1, 2012 at 6:02 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

Right. When number of trigrams is big, it is slow to scan posting list of
all of them. The solution is this case is to exclude most frequent trigrams
from index scan. But, it require some kind of statistics of trigrams
frequencies which we don't have. We could estimate frequencies using some
hard-coded assumptions about natural languages. Or we could exclude
arbitrary trigrams. But I don't like both these ideas. This problem is also
relevant for LIKE/ILIKE search using trigram indexes.

I was thinking you could perhaps do it just based on the *number* of
trigrams, not necessarily their frequency.

Probably you have some comments on idea of conversion from pg_wchar to
multibyte? Is it acceptable at all?

Well, I'm not an expert on encodings, but it seems like a logical
extension of what we're doing right now, so I don't really see why
not. I'm confused by the diff hunks in pg_mule2wchar_with_len,
though. Presumably either the old code is right (in which case, don't
change it) or the new code is right (in which case, there's a bug fix
needed here that ought to be discussed and committed separately from
the rest of the patch). Maybe I am missing something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Alexander Korotkov

aekorotkov@gmail.com

about 14 years ago

In reply to: Robert Haas (#8)

Re: Patch: add conversion from pg_wchar to multibyte

On Wed, May 2, 2012 at 4:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, May 1, 2012 at 6:02 PM, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Right. When number of trigrams is big, it is slow to scan posting list of
all of them. The solution is this case is to exclude most frequent

trigrams

from index scan. But, it require some kind of statistics of trigrams
frequencies which we don't have. We could estimate frequencies using some
hard-coded assumptions about natural languages. Or we could exclude
arbitrary trigrams. But I don't like both these ideas. This problem is

also

relevant for LIKE/ILIKE search using trigram indexes.

I was thinking you could perhaps do it just based on the *number* of
trigrams, not necessarily their frequency.

Imagine we've two queries:
1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';

The first query require reading posting lists of trigrams "abc" and "bcd".
The second query require reading posting lists of trigrams "abc", "bcd",
"cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
We could decide to use index scan for first query and sequential scan for
second query because number of posting list to read is high. But it is
unreasonable because actually second query is narrower than the first one.
We can use same index scan for it, recheck will remove all false positives.
When number of trigrams is high we can just exclude some of them from index
scan. It would be better than just decide to do sequential scan. But the
question is what trigrams to exclude? Ideally we would leave most rare
trigrams to make index scan cheaper.

Probably you have some comments on idea of conversion from pg_wchar to
multibyte? Is it acceptable at all?

Well, I'm not an expert on encodings, but it seems like a logical
extension of what we're doing right now, so I don't really see why
not. I'm confused by the diff hunks in pg_mule2wchar_with_len,
though. Presumably either the old code is right (in which case, don't
change it) or the new code is right (in which case, there's a bug fix
needed here that ought to be discussed and committed separately from
the rest of the patch). Maybe I am missing something.

Unfortunately I didn't understand original logic of pg_mule2wchar_with_len.
I just did proposal about how it could be. I hope somebody more familiar
with this code would clarify this situation.

------
With best regards,
Alexander Korotkov.

#10

Robert Haas

robertmhaas@gmail.com

about 14 years ago

In reply to: Alexander Korotkov (#9)

Re: Patch: add conversion from pg_wchar to multibyte

On Wed, May 2, 2012 at 9:35 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

I was thinking you could perhaps do it just based on the *number* of
trigrams, not necessarily their frequency.

Imagine we've two queries:
1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';

The first query require reading posting lists of trigrams "abc" and "bcd".
The second query require reading posting lists of trigrams "abc", "bcd",
"cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
We could decide to use index scan for first query and sequential scan for
second query because number of posting list to read is high. But it is
unreasonable because actually second query is narrower than the first one.
We can use same index scan for it, recheck will remove all false positives.
When number of trigrams is high we can just exclude some of them from index
scan. It would be better than just decide to do sequential scan. But the
question is what trigrams to exclude? Ideally we would leave most rare
trigrams to make index scan cheaper.

True. I guess I was thinking more of the case where you've got
abc|def|ghi|jkl|mno|pqr|stu|vwx|yza|.... There's probably some point
at which it becomes silly to think about using the index.

Probably you have some comments on idea of conversion from pg_wchar to
multibyte? Is it acceptable at all?

Well, I'm not an expert on encodings, but it seems like a logical
extension of what we're doing right now, so I don't really see why
not. I'm confused by the diff hunks in pg_mule2wchar_with_len,
though. Presumably either the old code is right (in which case, don't
change it) or the new code is right (in which case, there's a bug fix
needed here that ought to be discussed and committed separately from
the rest of the patch). Maybe I am missing something.

Unfortunately I didn't understand original logic of pg_mule2wchar_with_len.
I just did proposal about how it could be. I hope somebody more familiar
with this code would clarify this situation.

Well, do you think the current code is buggy, or not?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Alexander Korotkov

aekorotkov@gmail.com

about 14 years ago

In reply to: Robert Haas (#10)

Re: Patch: add conversion from pg_wchar to multibyte

On Wed, May 2, 2012 at 5:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, May 2, 2012 at 9:35 AM, Alexander Korotkov <aekorotkov@gmail.com>
wrote:

Imagine we've two queries:
1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';

The first query require reading posting lists of trigrams "abc" and

"bcd".

The second query require reading posting lists of trigrams "abc", "bcd",
"cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
We could decide to use index scan for first query and sequential scan for
second query because number of posting list to read is high. But it is
unreasonable because actually second query is narrower than the first

one.

We can use same index scan for it, recheck will remove all false

positives.

When number of trigrams is high we can just exclude some of them from

index

scan. It would be better than just decide to do sequential scan. But the
question is what trigrams to exclude? Ideally we would leave most rare
trigrams to make index scan cheaper.

True. I guess I was thinking more of the case where you've got
abc|def|ghi|jkl|mno|pqr|stu|vwx|yza|.... There's probably some point
at which it becomes silly to think about using the index.

Yes, such situations are also possible.

Well, I'm not an expert on encodings, but it seems like a logical

extension of what we're doing right now, so I don't really see why
not. I'm confused by the diff hunks in pg_mule2wchar_with_len,
though. Presumably either the old code is right (in which case, don't
change it) or the new code is right (in which case, there's a bug fix
needed here that ought to be discussed and committed separately from
the rest of the patch). Maybe I am missing something.

Unfortunately I didn't understand original logic

of pg_mule2wchar_with_len.

I just did proposal about how it could be. I hope somebody more familiar
with this code would clarify this situation.

Well, do you think the current code is buggy, or not?

Probably, but I'm not sure. The conversion seems lossy to me unless I'm
missing something about mule encoding.

------
With best regards,
Alexander Korotkov.

#12

Alexander Korotkov

aekorotkov@gmail.com

about 14 years ago

In reply to: Alexander Korotkov (#1)

Re: Patch: add conversion from pg_wchar to multibyte

Hello, Ishii-san!

We've talked on PGCon that I've questions about mule to wchar
conversion. My questions about pg_mule2wchar_with_len function are
following. In these parts of code:
*
*
else if (IS_LCPRV1(*from) && len >= 3)
{
from++;
*to = *from++ << 16;
*to |= *from++;
len -= 3;
}

and

else if (IS_LCPRV2(*from) && len >= 4)
{
from++;
*to = *from++ << 16;
*to |= *from++ << 8;
*to |= *from++;
len -= 4;
}

we skip first character of original string. Are we able to restore it back
from pg_wchar?
Also in this part of code we're shifting first byte by 16 bits:

if (IS_LC1(*from) && len >= 2)
{
*to = *from++ << 16;
*to |= *from++;
len -= 2;
}
else if (IS_LCPRV1(*from) && len >= 3)
{
from++;
*to = *from++ << 16;
*to |= *from++;
len -= 3;
}

Why don't we shift it by 8 bits?
You can see my patch in this thread where I propose purely mechanical
changes in this function which make inverse conversion possible.

------
With best regards,
Alexander Korotkov.

#13

Tatsuo Ishii

t-ishii@sra.co.jp

about 14 years ago

In reply to: Alexander Korotkov (#12)

Re: Patch: add conversion from pg_wchar to multibyte

Hi Alexander,

It was good seeing you in Ottawa!

Hello, Ishii-san!

We've talked on PGCon that I've questions about mule to wchar
conversion. My questions about pg_mule2wchar_with_len function are
following. In these parts of code:
*
*
else if (IS_LCPRV1(*from) && len >= 3)
{
from++;
*to = *from++ << 16;
*to |= *from++;
len -= 3;
}

and

else if (IS_LCPRV2(*from) && len >= 4)
{
from++;
*to = *from++ << 16;
*to |= *from++ << 8;
*to |= *from++;
len -= 4;
}

we skip first character of original string. Are we able to restore it back
from pg_wchar?

I think it's possible. The first characters are defined like this:

#define IS_LCPRV1(c) ((unsigned char)(c) == 0x9a || (unsigned char)(c) == 0x9b)
#define IS_LCPRV2(c) ((unsigned char)(c) == 0x9c || (unsigned char)(c) == 0x9d)

It seems IS_LCPRV1 is not used in any of PostgreSQL supported
encodings at this point, that means there's 0 chance which existing
databases include LCPRV1. So you could safely ignore it.

For IS_LCPRV2, it is only used for Chinese encodings (EUC_TW and BIG5)
in backend/utils/mb/conversion_procs/euc_tw_and_big5/euc_tw_and_big5.c
and it is fixed to 0x9d. So you can always restore the value to 0x9d.

Also in this part of code we're shifting first byte by 16 bits:

if (IS_LC1(*from) && len >= 2)
{
*to = *from++ << 16;
*to |= *from++;
len -= 2;
}
else if (IS_LCPRV1(*from) && len >= 3)
{
from++;
*to = *from++ << 16;
*to |= *from++;
len -= 3;
}

Why don't we shift it by 8 bits?

Because we want the first byte of LC1 case to be placed in the second
byte of wchar. i.e.

0th byte: always 0
1th byte: leading byte (the first byte of the multibyte)
2th byte: always 0
3th byte: the second byte of the multibyte

Note that we always assume that the 1th byte (called "leading byte":
LB in short) represents the id of the character set (from 0x81 to
0xff) in MULE INTERNAL encoding. For the mapping between LB and
charsets, see pg_wchar.h.

Show quoted text

You can see my patch in this thread where I propose purely mechanical
changes in this function which make inverse conversion possible.

------
With best regards,
Alexander Korotkov.

#14

Alexander Korotkov

aekorotkov@gmail.com

about 14 years ago

In reply to: Tatsuo Ishii (#13)

Re: Patch: add conversion from pg_wchar to multibyte

On Tue, May 22, 2012 at 11:50 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:

I think it's possible. The first characters are defined like this:

#define IS_LCPRV1(c) ((unsigned char)(c) == 0x9a || (unsigned char)(c)
== 0x9b)
#define IS_LCPRV2(c) ((unsigned char)(c) == 0x9c || (unsigned char)(c)
== 0x9d)

It seems IS_LCPRV1 is not used in any of PostgreSQL supported
encodings at this point, that means there's 0 chance which existing
databases include LCPRV1. So you could safely ignore it.

For IS_LCPRV2, it is only used for Chinese encodings (EUC_TW and BIG5)
in backend/utils/mb/conversion_procs/euc_tw_and_big5/euc_tw_and_big5.c
and it is fixed to 0x9d. So you can always restore the value to 0x9d.

Also in this part of code we're shifting first byte by 16 bits:

if (IS_LC1(*from) && len >= 2)
{
*to = *from++ << 16;
*to |= *from++;
len -= 2;
}
else if (IS_LCPRV1(*from) && len >= 3)
{
from++;
*to = *from++ << 16;
*to |= *from++;
len -= 3;
}

Why don't we shift it by 8 bits?

Because we want the first byte of LC1 case to be placed in the second
byte of wchar. i.e.

0th byte: always 0
1th byte: leading byte (the first byte of the multibyte)
2th byte: always 0
3th byte: the second byte of the multibyte

Note that we always assume that the 1th byte (called "leading byte":
LB in short) represents the id of the character set (from 0x81 to
0xff) in MULE INTERNAL encoding. For the mapping between LB and
charsets, see pg_wchar.h.

Thanks for your comments. They clarify a lot.
But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
Isn't it possible for them to produce same pg_wchar?

------
With best regards,
Alexander Korotkov.

#15

Tatsuo Ishii

t-ishii@sra.co.jp

about 14 years ago

In reply to: Alexander Korotkov (#14)

Re: Patch: add conversion from pg_wchar to multibyte

Thanks for your comments. They clarify a lot.
But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
Isn't it possible for them to produce same pg_wchar?

If LB is in 0x90 - 0x99 range, then they are LC2.
If LB is in 0xf0 - 0xff range, then they are LCPRV2.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#16

Alexander Korotkov

aekorotkov@gmail.com

about 14 years ago

In reply to: Tatsuo Ishii (#15)

Re: Patch: add conversion from pg_wchar to multibyte

On Tue, May 22, 2012 at 3:27 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:

Thanks for your comments. They clarify a lot.
But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
Isn't it possible for them to produce same pg_wchar?

If LB is in 0x90 - 0x99 range, then they are LC2.
If LB is in 0xf0 - 0xff range, then they are LCPRV2.

Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of
patch is attached.

------
With best regards,
Alexander Korotkov.

#17

Tatsuo Ishii

t-ishii@sra.co.jp

about 14 years ago

In reply to: Alexander Korotkov (#16)

Re: Patch: add conversion from pg_wchar to multibyte

On Tue, May 22, 2012 at 3:27 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:

Thanks for your comments. They clarify a lot.
But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
Isn't it possible for them to produce same pg_wchar?

If LB is in 0x90 - 0x99 range, then they are LC2.
If LB is in 0xf0 - 0xff range, then they are LCPRV2.

Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of
patch is attached.

[forgot to cc: to the list]

I looked into your patch, especially: pg_wchar2euc_with_len(const
pg_wchar *from, unsigned char *to, int len)

I think there's a small room to enhance the function.

if (*from >> 24)
{
*to++ = *from >> 24;
*to++ = (*from >> 16) & 0xFF;
*to++ = (*from >> 8) & 0xFF;
*to++ = *from & 0xFF;
cnt += 4;
}

Since the function walk through this every single wchar, something like:

if ((c = *from >> 24))
{
*to++ = c;
*to++ = (*from >> 16) & 0xFF;
*to++ = (*from >> 8) & 0xFF;
*to++ = *from & 0xFF;
cnt += 4;
}

will save few cycles(I'm not sure the optimizer produces similar code
above anyway though).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

#18

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Alexander Korotkov (#16)

Re: Patch: add conversion from pg_wchar to multibyte

On Thu, May 24, 2012 at 12:04 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of
patch is attached.

Review:

It looks to me like pg_wchar2utf_with_len will not work, because
unicode_to_utf8 returns its second argument unmodified - not, as your
code seems to assume, the byte following what was already written.

MULE also looks problematic. The code that you've written isn't
symmetric with the opposite conversion, unlike what you did in all
other cases, and I don't understand why. I'm also somewhat baffled by
the reverse conversion: it treats a multi-byte sequence beginning with
a byte for which IS_LCPRV1(x) returns true as invalid if there are
less than 3 bytes available, but it only reads two; similarly, for
IS_LCPRV2(x), it demands 4 bytes but converts only 3.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Alexander Korotkov

aekorotkov@gmail.com

almost 14 years ago

In reply to: Robert Haas (#18)

Re: Patch: add conversion from pg_wchar to multibyte

On Wed, Jun 27, 2012 at 11:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:

It looks to me like pg_wchar2utf_with_len will not work, because
unicode_to_utf8 returns its second argument unmodified - not, as your
code seems to assume, the byte following what was already written.

Fixed.

MULE also looks problematic. The code that you've written isn't
symmetric with the opposite conversion, unlike what you did in all
other cases, and I don't understand why. I'm also somewhat baffled by
the reverse conversion: it treats a multi-byte sequence beginning with
a byte for which IS_LCPRV1(x) returns true as invalid if there are
less than 3 bytes available, but it only reads two; similarly, for
IS_LCPRV2(x), it demands 4 bytes but converts only 3.

Should we save existing pg_wchar representation for MULE encoding?
Probably, we can modify it like in 0.1 version of patch in order to make it
more transparent.

------
With best regards,
Alexander Korotkov.

#20

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Alexander Korotkov (#19)

Re: Patch: add conversion from pg_wchar to multibyte

On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:

MULE also looks problematic. The code that you've written isn't
symmetric with the opposite conversion, unlike what you did in all
other cases, and I don't understand why. I'm also somewhat baffled by
the reverse conversion: it treats a multi-byte sequence beginning with
a byte for which IS_LCPRV1(x) returns true as invalid if there are
less than 3 bytes available, but it only reads two; similarly, for
IS_LCPRV2(x), it demands 4 bytes but converts only 3.

Should we save existing pg_wchar representation for MULE encoding? Probably,
we can modify it like in 0.1 version of patch in order to make it more
transparent.

Changing the encoding would break pg_upgrade, so -1 from me on that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

Alexander Korotkov

aekorotkov@gmail.com

almost 14 years ago

In reply to: Robert Haas (#20)

#22

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Alexander Korotkov (#21)