Proposal to Compile a 256-Byte Identifier Length Version Alongside the Current 64-Byte Version

Started by David HJover 2 years ago8 messagesgeneral
Jump to latest
#1David HJ
chuxiongzhong@gmail.com

Hello PostgreSQL Community,

I am writing to propose an alternative compilation of PostgreSQL that
allows for a 256-byte identifier length limit, alongside the existing
64-byte version.

*Problem:*
The current limit of 63 bytes can be quite restrictive, especially for
databases that use multi-byte character sets like UTF-8. In such cases, a
Chinese character takes up 3 bytes, limiting the name to just 21 characters.

*Use-Case:*
In specific use-cases, table names like
"能源消耗统计_全球主要国家石油与天然气使用量_年度碳排放与可再生能源比例表" and
"气候变化研究_全球主要地区极端天气事件频率_年度灾害损失与应对措施表" can far exceed the current limit.

*Proposed Solution:*
I propose that we offer an alternative compilation of PostgreSQL that
increases the NAMEDATALEN constant to allow for 256-byte identifiers. This
would be particularly useful for databases that make extensive use of
multi-byte character sets like UTF-8.

*Dual Compilation:*
To maintain backward compatibility and to offer more flexibility, I suggest
compiling this 256-byte version alongside the existing 64-byte version.
Users could then choose which version best suits their needs.

*Backward Compatibility:*
I understand that this change could affect backward compatibility. Offering
this as an alternative compilation could be a way to mitigate those
concerns. I am open to suggestions on how this could be implemented to
minimize disruptions.

I look forward to your feedback and hope we can make PostgreSQL even more
adaptable to various needs.

Best regards,
Chuxiong

#2Laurenz Albe
laurenz.albe@cybertec.at
In reply to: David HJ (#1)
Re: Proposal to Compile a 256-Byte Identifier Length Version Alongside the Current 64-Byte Version

On Tue, 2023-10-10 at 11:49 +0800, David HJ wrote:

I am writing to propose an alternative compilation of PostgreSQL that allows
for a 256-byte identifier length limit, alongside the existing 64-byte version.

Problem:
The current limit of 63 bytes can be quite restrictive, especially for databases
that use multi-byte character sets like UTF-8. In such cases, a Chinese character
takes up 3 bytes, limiting the name to just 21 characters.

Use-Case:
In specific use-cases, table names like
"能源消耗统计_全球主要国家石油与天然气使用量_年度碳排放与可再生能源比例表" and
"气候变化研究_全球主要地区极端天气事件频率_年度灾害损失与应对措施表" can far exceed
the current limit.

I have little understanding for people who name their tables
energy_consumption_statistics_oil_and_natural_gas_usage_in_major_countries_around_the_world_annual_carbon_emissions_and_renewable_energy_ratio_table.
The table could just as well be named "能源统计排放比". The exact description could
go in the table comment. Apart from that, it is a good idea to use table names
that are standard SQL identifiers, so that you don't have to double quote them
all the time.

Anyway, you are not the first person to hit the limit, so there is clearly a
pain that some people feel.

Proposed Solution:
I propose that we offer an alternative compilation of PostgreSQL that increases
the NAMEDATALEN constant to allow for 256-byte identifiers. This would be
particularly useful for databases that make extensive use of multi-byte character
sets like UTF-8.

Wouldn't it be a good solution if we promote the #define to a configure option,
like "./configure --identifier-length-limit=256"?

Yours,
Laurenz Albe

#3Julien Rouhaud
rjuju123@gmail.com
In reply to: Laurenz Albe (#2)
Re: Proposal to Compile a 256-Byte Identifier Length Version Alongside the Current 64-Byte Version

On Tue, Oct 10, 2023 at 08:22:55AM +0200, Laurenz Albe wrote:

On Tue, 2023-10-10 at 11:49 +0800, David HJ wrote:

I am writing to propose an alternative compilation of PostgreSQL that allows
for a 256-byte identifier length limit, alongside the existing 64-byte version.

Problem:
The current limit of 63 bytes can be quite restrictive, especially for databases
that use multi-byte character sets like UTF-8. In such cases, a Chinese character
takes up 3 bytes, limiting the name to just 21 characters.

Anyway, you are not the first person to hit the limit, so there is clearly a
pain that some people feel.

Proposed Solution:
I propose that we offer an alternative compilation of PostgreSQL that increases
the NAMEDATALEN constant to allow for 256-byte identifiers. This would be
particularly useful for databases that make extensive use of multi-byte character
sets like UTF-8.

Wouldn't it be a good solution if we promote the #define to a configure option,
like "./configure --identifier-length-limit=256"?

Note that there was some thread recently [1]/messages/by-id/20230630211153.kbysulcjedxa5ii6@awork3.anarazel.de and following messages where the possibility of having
some kind of compilation matrix to generate multiple set of binaries with
various compile-time values was discussed, so I guess it could fit well with
that approach.

[1]: /messages/by-id/20230630211153.kbysulcjedxa5ii6@awork3.anarazel.de and following messages
/messages/by-id/20230630211153.kbysulcjedxa5ii6@awork3.anarazel.de
and following messages

#4Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: David HJ (#1)
Re: Proposal to Compile a 256-Byte Identifier Length Version Alongside the Current 64-Byte Version

Hello PostgreSQL Community,

I am writing to propose an alternative compilation of PostgreSQL that
allows for a 256-byte identifier length limit, alongside the existing
64-byte version.

*Problem:*
The current limit of 63 bytes can be quite restrictive, especially for
databases that use multi-byte character sets like UTF-8. In such cases, a
Chinese character takes up 3 bytes, limiting the name to just 21 characters.

*Use-Case:*
In specific use-cases, table names like
"能源消耗统计_全球主要国家石油与天然气使用量_年度碳排放与可再生能源比例表" and
"气候变化研究_全球主要地区极端天气事件频率_年度灾害损失与应对措施表" can far exceed the current limit.

*Proposed Solution:*
I propose that we offer an alternative compilation of PostgreSQL that
increases the NAMEDATALEN constant to allow for 256-byte identifiers. This
would be particularly useful for databases that make extensive use of
multi-byte character sets like UTF-8.

Another solution would be, letting the meaning of NAMEDATALEN to be
number of *characters*, not the number of bytes. This way, you can use
up to 64 UTF-8 characters. In my understanding MySQL already does this
way. I know this requires non trivial code modifications to PostgreSQL
but would be better than to make binaries with random NAMEDATALEN
values.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

#5Laurenz Albe
laurenz.albe@cybertec.at
In reply to: Tatsuo Ishii (#4)
Re: Proposal to Compile a 256-Byte Identifier Length Version Alongside the Current 64-Byte Version

On Tue, 2023-10-10 at 15:53 +0900, Tatsuo Ishii wrote:

Another solution would be, letting the meaning of NAMEDATALEN to be
number of *characters*, not the number of bytes. This way, you can use
up to 64 UTF-8 characters. In my understanding MySQL already does this
way. I know this requires non trivial code modifications to PostgreSQL
but would be better than to make binaries with random NAMEDATALEN
values.

Since "name" is a fixed-length data type, that would require the stored
size to increase to accomodate the extra bytes. Wouldn't that change the
storage format and break pg_upgrade?

Yours,
Laurenz Albe

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Laurenz Albe (#5)
Re: Proposal to Compile a 256-Byte Identifier Length Version Alongside the Current 64-Byte Version

Laurenz Albe <laurenz.albe@cybertec.at> writes:

On Tue, 2023-10-10 at 15:53 +0900, Tatsuo Ishii wrote:

Another solution would be, letting the meaning of NAMEDATALEN to be
number of *characters*, not the number of bytes. This way, you can use
up to 64 UTF-8 characters. In my understanding MySQL already does this
way. I know this requires non trivial code modifications to PostgreSQL
but would be better than to make binaries with random NAMEDATALEN
values.

Since "name" is a fixed-length data type, that would require the stored
size to increase to accomodate the extra bytes. Wouldn't that change the
storage format and break pg_upgrade?

Yeah, the real reason this is unlikely to happen is precisely that
"name" is fixed-length. Increasing the standard NAMEDATALEN by 4x,
or even 2x, has been proposed and rejected many times before because
of the bloat it would cause in places like pg_attribute, pg_proc,
in-memory tuple descriptors, etc.

The real way forward IMO is to find a way to make "name" variable-length,
thus both satisfying people who need a few long names and reducing
overhead for everybody. This is difficult to do without breaking
mountains of backend code, but there's been some discussions about
ways to accomplish that. The most recent thread I could find is

/messages/by-id/CALSd-crdmj9PGdvdioU=a5W7P=TgNmEB2QP9wiF6DTUbBuMXrQ@mail.gmail.com

regards, tom lane

#7Peter Eisentraut
peter_e@gmx.net
In reply to: Laurenz Albe (#2)
Re: Proposal to Compile a 256-Byte Identifier Length Version Alongside the Current 64-Byte Version

On 10.10.23 08:22, Laurenz Albe wrote:

Apart from that, it is a good idea to use table names that are standard
SQL identifiers, so that you don't have to double quote them all the time.

FWIW, the Chinese character sequences posted here would be valid
unquoted identifiers if PostgreSQL implemented standard SQL
correctly/completely.

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Eisentraut (#7)
Re: Proposal to Compile a 256-Byte Identifier Length Version Alongside the Current 64-Byte Version

Peter Eisentraut <peter@eisentraut.org> writes:

On 10.10.23 08:22, Laurenz Albe wrote:

Apart from that, it is a good idea to use table names that are standard
SQL identifiers, so that you don't have to double quote them all the time.

FWIW, the Chinese character sequences posted here would be valid
unquoted identifiers if PostgreSQL implemented standard SQL
correctly/completely.

I'm pretty sure they're valid unquoted identifiers today,
because by and large we'll take any non-ASCII as identifier
characters. Conforming to the letter of the spec would
reduce, not increase, the set of strings we'll call identifiers.

regards, tom lane