NAMEDATALEN increase because of non-latin languages
Hello dear hackers. I understand the position of the developers community
about NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but
only if we speak about latin languages.
Postgresql has wonderful support for unicode in table and column names. And
it looks like very good idea to create table with names on native language
for databases across the world. But when I want to create, for example,
table with name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in
Russian for catalog of counteragent contacts) it will be auto-shrinked to
"Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional problem
- many words in Russian are just longer than it's English counterparts and
I have many examples like this.
Although recompiling the source is not so hard, updating is hard. I know
that is not free for disk space because of storing table names and field
names but, from my point of view, in 2021 year convenience is more
important than disk space.
I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in future
releases.
Sorry for wasted time for this message if this topic is not match with
direction of postgresql development (and thank you for your hard work)
On Wed, Aug 18, 2021 at 7:08 PM Денис Романенко <deromanenko@gmail.com> wrote:
Hello dear hackers. I understand the position of the developers community about NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but only if we speak about latin languages.
Postgresql has wonderful support for unicode in table and column names. And it looks like very good idea to create table with names on native language for databases across the world. But when I want to create, for example, table with name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in Russian for catalog of counteragent contacts) it will be auto-shrinked to "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional problem - many words in Russian are just longer than it's English counterparts and I have many examples like this.
Although recompiling the source is not so hard, updating is hard. I know that is not free for disk space because of storing table names and field names but, from my point of view, in 2021 year convenience is more important than disk space.
Unfortunately, the problem isn't really the additional disk space it
would require. The problem is the additional performance hit and
memory overhead, as the catalog names are part of the internal
syscache.
I understand your frustration, but given those problems I don't think
that postgres will increase the default NAMEDATALEN value any time
soon, even though it's in contradiction with the SQL standard.
Em qua., 18 de ago. de 2021 às 08:08, Денис Романенко <deromanenko@gmail.com>
escreveu:
Hello dear hackers. I understand the position of the developers community
about NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but
only if we speak about latin languages.Postgresql has wonderful support for unicode in table and column names.
And it looks like very good idea to create table with names on native
language for databases across the world. But when I want to create, for
example, table with name "Catalog_Контрагенты_КонтактнаяИнформация" (that
stands in Russian for catalog of counteragent contacts) it will be
auto-shrinked to "Catalog_Контрагенты_КонтактнаяИнформ". And this is not a
fictional problem - many words in Russian are just longer than it's English
counterparts and I have many examples like this.Although recompiling the source is not so hard, updating is hard. I know
that is not free for disk space because of storing table names and field
names but, from my point of view, in 2021 year convenience is more
important than disk space.I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in future
releases.
+1 once that Oracle Database 12.2 and higher, has support for 128 bytes
names.
What possibly, in the future, could impact some migration from Oracle to
Postgres.
regards,
Ranier Vilela
On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
Unfortunately, the problem isn't really the additional disk space it
would require. The problem is the additional performance hit and
memory overhead, as the catalog names are part of the internal
syscache.
Some actual numbers on recent hardware would show what kind of tradeoff is
involved. No one has done that for a long time that I recall.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Aug 18, 2021 at 7:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
Unfortunately, the problem isn't really the additional disk space it
would require. The problem is the additional performance hit and
memory overhead, as the catalog names are part of the internal
syscache.Some actual numbers on recent hardware would show what kind of tradeoff is involved. No one has done that for a long time that I recall.
Agreed, but I don't have access to such hardware. However this won't
influence the memory overhead part, and there is already frequent
problems with that, especially since declarative partitioning, so I
don't see how we could afford that without some kind of cache TTL or
similar. AFAIR the last discussion about it a few years ago didn't
lead anywhere :(
I don't very close with PG testing methodology, but I can pay for a server
(virtual or dedicated, DO maybe) and give access to it, if anyone has time
for that.
Or if someone describes to me steps and shows where to look - I can do it
by myself.
Import Notes
Resolved by subject fallback
Could we just make the limitation to be 64 (or 128) _characters_ not _bytes_ ?
Memory sizes and processor speeds have grown by order(s) of magnitude
since the 64 byte limit was decided and supporting non-ASCII charsets
properly seems like a prudent thing to do.
Also - have we checked that at least the truncation does not cut utf-8
characters in half ?
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.
Show quoted text
On Wed, Aug 18, 2021 at 1:33 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
On Wed, Aug 18, 2021 at 7:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:On Wed, Aug 18, 2021 at 7:15 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
Unfortunately, the problem isn't really the additional disk space it
would require. The problem is the additional performance hit and
memory overhead, as the catalog names are part of the internal
syscache.Some actual numbers on recent hardware would show what kind of tradeoff is involved. No one has done that for a long time that I recall.
Agreed, but I don't have access to such hardware. However this won't
influence the memory overhead part, and there is already frequent
problems with that, especially since declarative partitioning, so I
don't see how we could afford that without some kind of cache TTL or
similar. AFAIR the last discussion about it a few years ago didn't
lead anywhere :(
On Wed, Aug 18, 2021 at 7:33 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
Some actual numbers on recent hardware would show what kind of tradeoff
is involved. No one has done that for a long time that I recall.
Agreed, but I don't have access to such hardware. However this won't
Well, by "recent" I had in mind something more recent than 2002, which is
the time where I see a lot of hits in the archives if you search for this
topic.
influence the memory overhead part, and there is already frequent
problems with that, especially since declarative partitioning, so I
That's a fair point.
don't see how we could afford that without some kind of cache TTL or
similar. AFAIR the last discussion about it a few years ago didn't
lead anywhere :(
If you mean the thread "Protect syscache from bloating with negative cache
entries", it had activity earlier this year, so I wouldn't give up hope
just yet. Progress has been slow, so I'll see about putting some effort
into that after concluding my attempt to speed up the syscaches first [1]/messages/by-id/CAFBsxsE35VLJ3hHkjJARB3QWqJ0zWeDw-jzqrfzkzMPuD_Ctvw@mail.gmail.com.
The main thing I'm worried about is the fact that a name would no longer
fit in a Datum. The rest I think we can mitigate in some way.
[1]: /messages/by-id/CAFBsxsE35VLJ3hHkjJARB3QWqJ0zWeDw-jzqrfzkzMPuD_Ctvw@mail.gmail.com
/messages/by-id/CAFBsxsE35VLJ3hHkjJARB3QWqJ0zWeDw-jzqrfzkzMPuD_Ctvw@mail.gmail.com
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Aug 18, 2021 at 8:03 AM Hannu Krosing <hannuk@google.com> wrote:
Could we just make the limitation to be 64 (or 128) _characters_ not
_bytes_ ?
That couldn't work because characters are variable length. The limit has to
be a fixed length in bytes so we can quickly compute offsets in the
attribute tuple.
--
John Naylor
EDB: http://www.enterprisedb.com
On Wed, Aug 18, 2021 at 8:04 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
Agreed, but I don't have access to such hardware. However this won't
Well, by "recent" I had in mind something more recent than 2002, which is the time where I see a lot of hits in the archives if you search for this topic.
Yeah, but my current laptop has a tendency to crash after a few minute
if I stress it too much, so I'm still out.
If you mean the thread "Protect syscache from bloating with negative cache entries", it had activity earlier this year, so I wouldn't give up hope just yet.
Yes that's the thread I was thinking about. I'm not giving up hope
either, but I also don't see it being solved for v15.
The main thing I'm worried about is the fact that a name would no longer fit in a Datum. The rest I think we can mitigate in some way.
Agreed.
On Wed, Aug 18, 2021 at 8:03 PM Hannu Krosing <hannuk@google.com> wrote:
Also - have we checked that at least the truncation does not cut utf-8
characters in half ?
Yes, same for all other places that can truncate text (like the query
text in pg_stat_activity and similar). See usage of pg_mbcliplen() in
truncate_identifier.
On Wed, 2021-08-18 at 08:16 -0300, Ranier Vilela wrote:
Em qua., 18 de ago. de 2021 às 08:08, Денис Романенко <deromanenko@gmail.com> escreveu:
Hello dear hackers. I understand the position of the developers community about
NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but only if we
speak about latin languages.Postgresql has wonderful support for unicode in table and column names. And it
looks like very good idea to create table with names on native language for
databases across the world. But when I want to create, for example, table with
name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in Russian for
catalog of counteragent contacts) it will be auto-shrinked to
"Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictional problem -
many words in Russian are just longer than it's English counterparts and I
have many examples like this.Although recompiling the source is not so hard, updating is hard. I know that
is not free for disk space because of storing table names and field names but,
from my point of view, in 2021 year convenience is more important than disk space.I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in future releases.
My stance here is that you should always use ASCII only for database identifiers,
not only because of this, but also to avoid unpleasant encoding problems if
you want to do something like
pg_dump -t Catalog_Контрагенты_КонтактнаяИнформация mydb
on a shell with an encoding different from the database encoding.
So I am not too excited about this.
+1 once that Oracle Database 12.2 and higher, has support for 128 bytes names.
What possibly, in the future, could impact some migration from Oracle to Postgres.
That seems to be a better argument from my point of view.
I have no idea as to how bad the additional memory impact for the catalog
caches would be...
Yours,
Laurenz Albe
Em qua., 18 de ago. de 2021 às 09:33, Laurenz Albe <laurenz.albe@cybertec.at>
escreveu:
On Wed, 2021-08-18 at 08:16 -0300, Ranier Vilela wrote:
Em qua., 18 de ago. de 2021 às 08:08, Денис Романенко <
deromanenko@gmail.com> escreveu:
Hello dear hackers. I understand the position of the developers
community about
NAMEDATALEN length - and, in fact, 63 bytes is more than enough - but
only if we
speak about latin languages.
Postgresql has wonderful support for unicode in table and column
names. And it
looks like very good idea to create table with names on native
language for
databases across the world. But when I want to create, for example,
table with
name "Catalog_Контрагенты_КонтактнаяИнформация" (that stands in
Russian for
catalog of counteragent contacts) it will be auto-shrinked to
"Catalog_Контрагенты_КонтактнаяИнформ". And this is not a fictionalproblem -
many words in Russian are just longer than it's English counterparts
and I
have many examples like this.
Although recompiling the source is not so hard, updating is hard. I
know that
is not free for disk space because of storing table names and field
names but,
from my point of view, in 2021 year convenience is more important
than disk space.
I ask you to consider increasing NAMEDATALEN for maybe 128 bytes in
future releases.
My stance here is that you should always use ASCII only for database
identifiers,
not only because of this, but also to avoid unpleasant encoding problems if
you want to do something likepg_dump -t Catalog_Контрагенты_КонтактнаяИнформация mydb
on a shell with an encoding different from the database encoding.
So I am not too excited about this.
+1 once that Oracle Database 12.2 and higher, has support for 128 bytes
names.
What possibly, in the future, could impact some migration from Oracle to
Postgres.
That seems to be a better argument from my point of view.
I have no idea as to how bad the additional memory impact for the catalog
caches would be...
It seems to me that this is a case for macro:
HAS_SUPPORT_NAME_128_BYTES
Деnis Романенко would like and would pay the price for regression in
exchange for the convenience.
What impacts him now is the difficulty of maintaining a private tree, with
this support.
regards,
Ranier Vilela
John Naylor <john.naylor@enterprisedb.com> writes:
The main thing I'm worried about is the fact that a name would no longer
fit in a Datum. The rest I think we can mitigate in some way.
Not sure what you mean by that? name is a pass-by-ref data type.
Anyway, this whole argument could be rendered moot if we could convert
name to a variable-length type. That would satisfy *both* sides of
the argument, since those who need long names could have them, while
those who don't would see net reduction instead of growth in space usage.
Of course, this is far far easier said than done; else we would have
done it years ago. But maybe it's not entirely out of reach.
I do not think it'd be hard to change "name" to have the same on-disk
storage representation as cstring; the hard part is what about its
usage in fixed-width catalog structures. Maybe we could finesse that
by decreeing that the name column always has to be the last
non-CATALOG_VARLEN field. (This would require fixing up the couple of
places where we let some other var-width field have that distinction;
but I suspect that would be small in comparison to the other work this
implies. If there are any catalogs having two name columns, one of them
would become more difficult to reach from C code.)
Another fun thing --- and, I fear, another good argument against just
raising NAMEDATALEN --- is what about TupleDescs, which last I checked
used an array of fixed-width pg_attribute images. But maybe we could
replace that with an array of pointers. Andres already did a lot of
the heavy code churn required to hide that data structure behind
TupleDescAttr() macros, so changing the representation should be much
less painful than it would once have been.
I wonder if we'd get complaints from changing the catalog column layouts
that much. People are used to the name at the front, I think. OTOH,
I expected a lot of bleating about the OID column becoming frontmost,
but there hasn't been much.
Anyway, I have little desire to work on this myself, but I recommend that
somebody who is more affected by the name length restriction look into it.
regards, tom lane
On Wed, Aug 18, 2021 at 10:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Anyway, this whole argument could be rendered moot if we could convert
name to a variable-length type. That would satisfy *both* sides of
the argument, since those who need long names could have them, while
those who don't would see net reduction instead of growth in space usage.
Yeah it seems like the best way forward.
Of course, this is far far easier said than done; else we would have
done it years ago. But maybe it's not entirely out of reach.
I do not think it'd be hard to change "name" to have the same on-disk
storage representation as cstring; the hard part is what about its
usage in fixed-width catalog structures. Maybe we could finesse that
by decreeing that the name column always has to be the last
non-CATALOG_VARLEN field. (This would require fixing up the couple of
places where we let some other var-width field have that distinction;
but I suspect that would be small in comparison to the other work this
implies. If there are any catalogs having two name columns, one of them
would become more difficult to reach from C code.)
Here is the list on some recent build (as of 17707c059c):
relname | array_agg
------------------+----------------------------------
pg_collation | {collname,collctype,collcollate}
pg_database | {datctype,datcollate,datname}
pg_event_trigger | {evtevent,evtname}
pg_subscription | {subname,subslotname}
pg_trigger | {tgname,tgnewtable,tgoldtable}
(5 rows)
I wonder if we'd get complaints from changing the catalog column layouts
that much. People are used to the name at the front, I think. OTOH,
I expected a lot of bleating about the OID column becoming frontmost,
but there hasn't been much.
I don't think that would be comparable. Having an extra oid in the
1st column doesn't really make a raw SELECT * harder to read. But
having the XXXname column way behind, and not even at the end, means
that most people will have to type an extra "xxxname," for each
throwaway query run to quickly verify something. I know that I often
do that, and while I could live with it I'd rather not have to do it.
On 18.08.21 13:33, Julien Rouhaud wrote:
Agreed, but I don't have access to such hardware. However this won't
influence the memory overhead part, and there is already frequent
problems with that, especially since declarative partitioning,
On the flip side, with partitioning you need room for longer table
names, since you need room for the real name plus some partition identifier.
Julien Rouhaud <rjuju123@gmail.com> writes:
On Wed, Aug 18, 2021 at 10:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wonder if we'd get complaints from changing the catalog column layouts
that much. People are used to the name at the front, I think. OTOH,
I expected a lot of bleating about the OID column becoming frontmost,
but there hasn't been much.
I don't think that would be comparable. Having an extra oid in the
1st column doesn't really make a raw SELECT * harder to read. But
having the XXXname column way behind, and not even at the end, means
that most people will have to type an extra "xxxname," for each
throwaway query run to quickly verify something. I know that I often
do that, and while I could live with it I'd rather not have to do it.
Yeah, it would annoy the heck out of me too. Again there's a potential
technical solution, which is to decouple the user-visible column order
from the storage order. However, multiple people have tilted at that
windmill without much success, so making it a prerequisite for improving
the name-length situation doesn't seem like a smart plan.
regards, tom lane
On 8/18/21 10:53 AM, Tom Lane wrote:
Julien Rouhaud <rjuju123@gmail.com> writes:
On Wed, Aug 18, 2021 at 10:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wonder if we'd get complaints from changing the catalog column layouts
that much. People are used to the name at the front, I think. OTOH,
I expected a lot of bleating about the OID column becoming frontmost,
but there hasn't been much.I don't think that would be comparable. Having an extra oid in the
1st column doesn't really make a raw SELECT * harder to read. But
having the XXXname column way behind, and not even at the end, means
that most people will have to type an extra "xxxname," for each
throwaway query run to quickly verify something. I know that I often
do that, and while I could live with it I'd rather not have to do it.Yeah, it would annoy the heck out of me too. Again there's a potential
technical solution, which is to decouple the user-visible column order
from the storage order. However, multiple people have tilted at that
windmill without much success, so making it a prerequisite for improving
the name-length situation doesn't seem like a smart plan.
There might be other benefits, though. IIRC what we discussed years ago
was having for each attribute an immutable number (presumably attnum as
now) and a mutable display order and storage order. Previous patches
didn't implement this and so were rejected. I think part of the trouble
is that we'd have to go through roughly 1700 mentions of attnum in the
source and decide if it's really attnum they want or if it's
attnum_display_order or attnum_storage_order. So this has the potential
to be extraordinarily invasive and potentially bug-prone. And then
there's the world of extensions to consider.
I have a bit less sympathy for the argument that just moving it will
break things that use 'select *' on the catalogs. In general, if you
care about the order of columns you should name the columns you want in
the order you want. I've seen 'select *' break for people on other
changes, like adding or dropping a column. It might cause Postgres
developers a bit of pain, but it should be manageable, so I kind of like
your suggestion.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
On 8/18/21 10:53 AM, Tom Lane wrote:
Yeah, it would annoy the heck out of me too. Again there's a potential
technical solution, which is to decouple the user-visible column order
from the storage order. However, multiple people have tilted at that
windmill without much success, so making it a prerequisite for improving
the name-length situation doesn't seem like a smart plan.
There might be other benefits, though. IIRC what we discussed years ago
was having for each attribute an immutable number (presumably attnum as
now) and a mutable display order and storage order. Previous patches
didn't implement this and so were rejected. I think part of the trouble
is that we'd have to go through roughly 1700 mentions of attnum in the
source and decide if it's really attnum they want or if it's
attnum_display_order or attnum_storage_order. So this has the potential
to be extraordinarily invasive and potentially bug-prone. And then
there's the world of extensions to consider.
Yeah, exactly: conceptually that's simple, but flushing all the bugs
out would be a years-long nightmare. It'd make all the fun we had
with missed attisdropped checks look like a walk in the park. Unless
somebody can figure out a way to mechanically check for mistakes,
I don't think I want to go there.
I wonder though if we could fix the immediate problem with something
less ambitious. The hard part of the full proposal, I think, is
separating permanent identity from physical position. If we were to
split out *only* the display order from that, the patch footprint
ought to be far far smaller --- basically, I think, we'd need to fix
star-expansion and not a lot more in the backend. Of course,
client-side code like psql's \d and pg_dump would need to be upgraded
too, but any missed things there would be cosmetic that's-not-the-
expected-column-order bugs, not core dumps. Also, at least in the v1
patch we could use this just for system catalogs without exposing it
as a feature for user tables, which would greatly restrict the set of
client-side places that really need fixed.
(I think Alvaro was the last person to mess with this issue, so I
wonder what his take is on the feasibility of such a restricted
solution.)
regards, tom lane
On 2021-Aug-18, Tom Lane wrote:
I wonder though if we could fix the immediate problem with something
less ambitious. The hard part of the full proposal, I think, is
separating permanent identity from physical position. If we were to
split out *only* the display order from that, the patch footprint
ought to be far far smaller --- basically, I think, we'd need to fix
star-expansion and not a lot more in the backend. Of course,
client-side code like psql's \d and pg_dump would need to be upgraded
too, but any missed things there would be cosmetic that's-not-the-
expected-column-order bugs, not core dumps. Also, at least in the v1
patch we could use this just for system catalogs without exposing it
as a feature for user tables, which would greatly restrict the set of
client-side places that really need fixed.(I think Alvaro was the last person to mess with this issue, so I
wonder what his take is on the feasibility of such a restricted
solution.)
Yeah, my impression is that the project of just changing star expansion
is much, much easier than splitting out attribute identity from physical
location. The former, as I recall, is a localized change in expandRTE
and friends, and you don't have to touch anything else. The other part
of the change is much more invasive and got me into territory that I
wasn't able to navigate successfully.
(I think we should consider keeping 'attnum' as the display-order
attribute, and have the physical-and-identity attribute get a new name,
say attphysnum. That's so that you don't have to change psql and the
whole world -- the change is transparent to them. This means we need a
first step that's very invasive, because every single current use of
'attnum' has to be changed to attphysnum first, followed by a functional
patch that changes a few of those back to attnum. It'd be a large
inconvenience to backend developers to ease the lives of client-side
developers.)
--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/