What is the maximum encoding-conversion growth rate, anyway?

Started by Tom Laneover 18 years ago15 messages

tgl@sss.pgh.pa.us

over 18 years ago

I just rearranged the code in mbutils.c a little bit to make it more
robust if conversion of an over-length string is attempted, and noted
this comment:

/*
* When converting strings between different encodings, we assume that space
* for converted result is 4-to-1 growth in the worst case. The rate for
* currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
* kanna -> UTF8 is the worst case). So "4" should be enough for the moment.
*
* Note that this is not the same as the maximum character width in any
* particular encoding.
*/
#define MAX_CONVERSION_GROWTH 4

It strikes me that this is overly pessimistic, since we do not support
5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
in any supported encoding that require 4 bytes in another. Could we
reduce the multiplier to 3? Or even 2? This has a direct impact on the
longest COPY lines we can support, so I'd like it not to be larger than
necessary.

regards, tom lane

ishii@postgresql.org

over 18 years ago

In reply to: Tom Lane (#1)

Re: What is the maximum encoding-conversion growth rate, anyway?

I just rearranged the code in mbutils.c a little bit to make it more
robust if conversion of an over-length string is attempted, and noted
this comment:

/*
* When converting strings between different encodings, we assume that space
* for converted result is 4-to-1 growth in the worst case. The rate for
* currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
* kanna -> UTF8 is the worst case). So "4" should be enough for the moment.
*
* Note that this is not the same as the maximum character width in any
* particular encoding.
*/
#define MAX_CONVERSION_GROWTH 4

It strikes me that this is overly pessimistic, since we do not support
5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
in any supported encoding that require 4 bytes in another. Could we
reduce the multiplier to 3? Or even 2? This has a direct impact on the
longest COPY lines we can support, so I'd like it not to be larger than
necessary.

I'm afraid we have to mke it larger, rather than smaller for 8.3. For
example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
bytes UTF_8 (0x00e3818b and 0x00e3829a). See
util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

So the worst case is now 6, rather than 3.

Can we add a column to pg_conversion which represents the "growth
rate"? This would reduce the rate for most encodings much smaller than
6.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Tatsuo Ishii (#2)

Re: What is the maximum encoding-conversion growth rate, anyway?

Tatsuo Ishii <ishii@postgresql.org> writes:

I'm afraid we have to mke it larger, rather than smaller for 8.3. For
example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
bytes UTF_8 (0x00e3818b and 0x00e3829a). See
util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

So the worst case is now 6, rather than 3.

Yipes.

Can we add a column to pg_conversion which represents the "growth
rate"? This would reduce the rate for most encodings much smaller than
6.

We need to do something, but the pg_conversion catalog seems a bad place
to put the info --- don't we have places that need to be able to do
conversion without catalog access?

Perhaps better would be to redefine the API for the conversion functions
so that they palloc their own result space. Then each conversion
function would have to know the maximum growth rate for its particular
conversion. This change would also make it feasible for a conversion
function to prescan the data and determine an exact output size, if that
seemed worthwhile because the potential growth rate was too extreme.

regards, tom lane

ishii@postgresql.org

over 18 years ago

In reply to: Tom Lane (#3)

Re: What is the maximum encoding-conversion growth rate, anyway?

Can we add a column to pg_conversion which represents the "growth
rate"? This would reduce the rate for most encodings much smaller than
6.

We need to do something, but the pg_conversion catalog seems a bad place
to put the info --- don't we have places that need to be able to do
conversion without catalog access?

Can you tell me where? I thought conversion functions are always
called by using OidFunctionCall5 thus we need to consult the
pg_conversion catalog beforehand anyway.

Perhaps better would be to redefine the API for the conversion functions
so that they palloc their own result space. Then each conversion
function would have to know the maximum growth rate for its particular
conversion. This change would also make it feasible for a conversion
function to prescan the data and determine an exact output size, if that
seemed worthwhile because the potential growth rate was too extreme.

--
Tatsuo Ishii
SRA OSS, Inc. Japan

mike@fuhr.org

over 18 years ago

In reply to: Tom Lane (#3)

Re: What is the maximum encoding-conversion growth rate, anyway?

On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:

Tatsuo Ishii <ishii@postgresql.org> writes:

I'm afraid we have to mke it larger, rather than smaller for 8.3. For
example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
bytes UTF_8 (0x00e3818b and 0x00e3829a). See
util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

So the worst case is now 6, rather than 3.

Yipes.

Isn't MAX_CONVERSION_GROWTH a multiplier? Doesn't 2 bytes becoming
2 * 3 bytes represent a growth of 3, not 6? Or does that 2-byte
SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
encoding? Or am I missing something?

--
Michael Fuhr

ishii@postgresql.org

over 18 years ago

In reply to: Michael Fuhr (#5)

Re: What is the maximum encoding-conversion growth rate, anyway?

On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:

Tatsuo Ishii <ishii@postgresql.org> writes:

I'm afraid we have to mke it larger, rather than smaller for 8.3. For
example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
bytes UTF_8 (0x00e3818b and 0x00e3829a). See
util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

So the worst case is now 6, rather than 3.

Yipes.

Isn't MAX_CONVERSION_GROWTH a multiplier? Doesn't 2 bytes becoming
2 * 3 bytes represent a growth of 3, not 6? Or does that 2-byte
SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
encoding? Or am I missing something?

Oops. You are right. The MAX_CONVERSION_GROWTH should be 3 (=
(2*3)/2), rather than 6 for the case.

So it seems we could safely make MAX_CONVERSION_GROWTH down to 3 for
the moment.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

ishii@postgresql.org

over 18 years ago

In reply to: Tatsuo Ishii (#6)

Re: What is the maximum encoding-conversion growth rate, anyway?

On Mon, May 28, 2007 at 10:23:42PM -0400, Tom Lane wrote:

Tatsuo Ishii <ishii@postgresql.org> writes:

I'm afraid we have to mke it larger, rather than smaller for 8.3. For
example 0x82f5 in SHIFT_JIS_2004 (new in 8.3) becomes *pair* of 3
bytes UTF_8 (0x00e3818b and 0x00e3829a). See
util/mb/Unicode/shift_jis_2004_to_utf8_combined.map for more details.

So the worst case is now 6, rather than 3.

Yipes.

Isn't MAX_CONVERSION_GROWTH a multiplier? Doesn't 2 bytes becoming
2 * 3 bytes represent a growth of 3, not 6? Or does that 2-byte
SHIFT_JIS_2004 sequence have a 1-byte sequence in another supported
encoding? Or am I missing something?

Oops. You are right. The MAX_CONVERSION_GROWTH should be 3 (=
(2*3)/2), rather than 6 for the case.

So it seems we could safely make MAX_CONVERSION_GROWTH down to 3 for
the moment.

Thinking more, it striked me that users can define arbitarily growing
rate by using CFREATE CONVERSION. So it seems we need functionality to
define the growing rate anyway.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

tgl@sss.pgh.pa.us

over 18 years ago

In reply to: Tatsuo Ishii (#7)

Re: What is the maximum encoding-conversion growth rate, anyway?

Tatsuo Ishii <ishii@postgresql.org> writes:

Thinking more, it striked me that users can define arbitarily growing
rate by using CFREATE CONVERSION. So it seems we need functionality to
define the growing rate anyway.

Seems to me that would be an argument for moving the palloc inside the
conversion functions, as I suggested before.

In practice though, I find it hard to imagine a pair of encodings for
which the growth rate is more than 3x. You'd need something that
translates a single-byte character into 4 or more bytes (pretty
unlikely, especially considering we require all these encodings to be
ASCII supersets); or something that translates a 2-byte character into
more than 6 bytes.

regards, tom lane

mike@fuhr.org

over 18 years ago

In reply to: Tom Lane (#8)

Re: What is the maximum encoding-conversion growth rate, anyway?

On Tue, May 29, 2007 at 10:00:06AM -0400, Tom Lane wrote:

In practice though, I find it hard to imagine a pair of encodings for
which the growth rate is more than 3x. You'd need something that
translates a single-byte character into 4 or more bytes (pretty
unlikely, especially considering we require all these encodings to be
ASCII supersets); or something that translates a 2-byte character into
more than 6 bytes.

Many characters in the 0x80..0xff range of single-byte encodings
like LATIN1 become four bytes in GB18030 (e.g., LATIN1 f1 = GB18030
81 30 8a 39). PostgreSQL doesn't currently support such conversions
but it's something to be aware of.

--
Michael Fuhr

Jeroen T. Vermeulen

jtv@xs4all.nl

over 18 years ago

In reply to: Tatsuo Ishii (#7)

Re: What is the maximum encoding-conversion growth rate, anyway?

On Tue, May 29, 2007 20:51, Tatsuo Ishii wrote:

Thinking more, it striked me that users can define arbitarily growing
rate by using CFREATE CONVERSION. So it seems we need functionality to
define the growing rate anyway.

Would it make sense to define just the longest and shortest character
lengths for an encoding? Then for any conversion you'd have a safe
estimate of

ceil(target_encoding.max_char_len / source_encoding.min_char_len)

...without going through every possible conversion.

Jeroen

bruce@momjian.us

over 18 years ago

In reply to: Tom Lane (#1)

Re: What is the maximum encoding-conversion growth rate, anyway?

Where are we on this?

---------------------------------------------------------------------------

Tom Lane wrote:

I just rearranged the code in mbutils.c a little bit to make it more
robust if conversion of an over-length string is attempted, and noted
this comment:

/*
* When converting strings between different encodings, we assume that space
* for converted result is 4-to-1 growth in the worst case. The rate for
* currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
* kanna -> UTF8 is the worst case). So "4" should be enough for the moment.
*
* Note that this is not the same as the maximum character width in any
* particular encoding.
*/
#define MAX_CONVERSION_GROWTH 4

It strikes me that this is overly pessimistic, since we do not support
5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
in any supported encoding that require 4 bytes in another. Could we
reduce the multiplier to 3? Or even 2? This has a direct impact on the
longest COPY lines we can support, so I'd like it not to be larger than
necessary.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

ishii@postgresql.org

over 18 years ago

In reply to: Jeroen T. Vermeulen (#10)

Re: What is the maximum encoding-conversion growth rate, anyway?

Sorry for dealy.

On Tue, May 29, 2007 20:51, Tatsuo Ishii wrote:

Thinking more, it striked me that users can define arbitarily growing
rate by using CFREATE CONVERSION. So it seems we need functionality to
define the growing rate anyway.

Would it make sense to define just the longest and shortest character
lengths for an encoding? Then for any conversion you'd have a safe
estimate of

ceil(target_encoding.max_char_len / source_encoding.min_char_len)

...without going through every possible conversion.

This will not work since certain CONVERSION allows n char to m char
conversion.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

ishii@postgresql.org

over 18 years ago

In reply to: Bruce Momjian (#11)

Re: What is the maximum encoding-conversion growth rate, anyway?

The conclusion of the discussion appears that we could reduce
MAX_CONVERSION_GROWTH from 4 to 3 safely with all existing built-in
conversions.

However, since user defined conversions could set arbitrary growth
rate, probably it would be better leave it as it is now.

For 8.4, maybe we could change conversion function's signature so that
we don't need to have the fixed conversion rate as Tom suggested.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Show quoted text

Where are we on this?

---------------------------------------------------------------------------

Tom Lane wrote:

I just rearranged the code in mbutils.c a little bit to make it more
robust if conversion of an over-length string is attempted, and noted
this comment:

/*
* When converting strings between different encodings, we assume that space
* for converted result is 4-to-1 growth in the worst case. The rate for
* currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
* kanna -> UTF8 is the worst case). So "4" should be enough for the moment.
*
* Note that this is not the same as the maximum character width in any
* particular encoding.
*/
#define MAX_CONVERSION_GROWTH 4

It strikes me that this is overly pessimistic, since we do not support
5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
in any supported encoding that require 4 bytes in another. Could we
reduce the multiplier to 3? Or even 2? This has a direct impact on the
longest COPY lines we can support, so I'd like it not to be larger than
necessary.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

bruce@momjian.us

over 18 years ago

In reply to: Tatsuo Ishii (#13)

Re: What is the maximum encoding-conversion growth rate, anyway?

This has been saved for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Tatsuo Ishii wrote:

The conclusion of the discussion appears that we could reduce
MAX_CONVERSION_GROWTH from 4 to 3 safely with all existing built-in
conversions.

However, since user defined conversions could set arbitrary growth
rate, probably it would be better leave it as it is now.

For 8.4, maybe we could change conversion function's signature so that
we don't need to have the fixed conversion rate as Tom suggested.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Where are we on this?

---------------------------------------------------------------------------

Tom Lane wrote:

I just rearranged the code in mbutils.c a little bit to make it more
robust if conversion of an over-length string is attempted, and noted
this comment:

/*
* When converting strings between different encodings, we assume that space
* for converted result is 4-to-1 growth in the worst case. The rate for
* currently supported encoding pairs are within 3 (SJIS JIS X0201 half width
* kanna -> UTF8 is the worst case). So "4" should be enough for the moment.
*
* Note that this is not the same as the maximum character width in any
* particular encoding.
*/
#define MAX_CONVERSION_GROWTH 4

It strikes me that this is overly pessimistic, since we do not support
5- or 6-byte UTF8 characters, and AFAICS there are no 1-byte characters
in any supported encoding that require 4 bytes in another. Could we
reduce the multiplier to 3? Or even 2? This has a direct impact on the
longest COPY lines we can support, so I'd like it not to be larger than
necessary.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

bruce@momjian.us

almost 18 years ago

In reply to: Tom Lane (#8)

Re: What is the maximum encoding-conversion growth rate, anyway?

Added to TODO:

* Change memory allocation for multi-byte functions so memory is
allocated inside conversion functions

Currently we preallocate memory based on worst-case usage.

---------------------------------------------------------------------------

Tom Lane wrote:

Tatsuo Ishii <ishii@postgresql.org> writes:

Thinking more, it striked me that users can define arbitarily growing
rate by using CFREATE CONVERSION. So it seems we need functionality to
define the growing rate anyway.

Seems to me that would be an argument for moving the palloc inside the
conversion functions, as I suggested before.

In practice though, I find it hard to imagine a pair of encodings for
which the growth rate is more than 3x. You'd need something that
translates a single-byte character into 4 or more bytes (pretty
unlikely, especially considering we require all these encodings to be
ASCII supersets); or something that translates a 2-byte character into
more than 6 bytes.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +