Design strategy for table with many attributes

Started by Lok Palmost 2 years ago13 messagesgeneral
Jump to latest
#1Lok P
loknath.73@gmail.com

Hello,
In one of the applications we are getting transactions in messages/events
format and also in files and then they are getting parsed and stored into
the relational database. The number of attributes/columns each transaction
has is ~900+. Logically they are part of one single transaction and should
be stored in one table as one single row. There will be ~500million such
transactions each day coming into the system. And there will be approx ~10K
peak write TPS and 5K read TPS in target state. This system has a postgres
database as a "source of truth" or OLTP store. And then data moves to
snowflakes for the olap store.

Initially when the system was designed the number of attributes per
transaction was <100 but slowly the business wants to keep/persist other
attributes too in the current system and the number of columns keep growing.

However, as worked with some database systems , we get few suggestions from
DBA's to not have many columns in a single table. For example in oracle
they say not to go beyond ~255 columns as then row chaining and row
migration type of things are going to hunt us. Also we are afraid
concurrent DMLS on the table may cause this as a contention point. So I
wanted to understand , in such a situation what would be the best design
approach we should use irrespective of databases? Or say, what is the
maximum number of columns per table we should restrict? Should we break the
single transaction into multiple tables like one main table and other
addenda tables with the same primary key to join and fetch the results
wherever necessary?

Regards
Lok

#2David G. Johnston
david.g.johnston@gmail.com
In reply to: Lok P (#1)
Re: Design strategy for table with many attributes

On Thu, Jul 4, 2024 at 12:38 PM Lok P <loknath.73@gmail.com> wrote:

Should we break the single transaction into multiple tables like one main
table and other addenda tables with the same primary key to join and fetch
the results wherever necessary?

I would say yes. Find a way to logically group sets of columns together
and place those groups into separate tables. I'd also be looking for cases
where multiple columns really should be multiple rows. This is not
uncommon.

David J.

David J.

#3Kent Dorfman
kent.dorfman766@gmail.com
In reply to: Lok P (#1)
Re: Design strategy for table with many attributes

On 7/4/24 15:37, Lok P wrote:

Or say, what is the maximum number of columns per table we should
restrict? Should we break the single transaction into multiple tables
like one main table and other addenda tables with the same primary key
to join and fetch the results wherever necessary?

900 columns makes my head hurt badly...

The neophyte will design a table with a separate field for each type of
phone number that may be encountered.  The experienced designer will
move all phone numbers to its own table, where each entry/row contains a
reference link, a "number_type" field, and a field with the actual
number in it...Three fields in a table that is manageable and can be
queried/modified without stressing the database server.

#4Ron
ronljohnsonjr@gmail.com
In reply to: Lok P (#1)
Re: Design strategy for table with many attributes

On Thu, Jul 4, 2024 at 3:38 PM Lok P <loknath.73@gmail.com> wrote:

Hello,
In one of the applications we are getting transactions in messages/events
format and also in files and then they are getting parsed and stored into
the relational database. The number of attributes/columns each transaction
has is ~900+. Logically they are part of one single transaction

Nothing out of the ordinary.

and should be stored in one table as one single row.

Says who?

There will be ~500million such transactions each day coming into the
system. And there will be approx ~10K peak write TPS and 5K read TPS in
target state. This system has a postgres database as a "source of truth" or
OLTP store. And then data moves to snowflakes for the olap store.

Initially when the system was designed the number of attributes per
transaction was <100 but slowly the business wants to keep/persist other
attributes too in the current system and the number of columns keep growing.

However, as worked with some database systems , we get few suggestions
from DBA's to not have many columns in a single table. For example in
oracle they say not to go beyond ~255 columns as then row chaining and row
migration type of things are going to hunt us. Also we are afraid
concurrent DMLS on the table may cause this as a contention point. So I
wanted to understand , in such a situation what would be the best design
approach we should use irrespective of databases? Or say, what is the
maximum number of columns per table we should restrict? Should we break the
single transaction into multiple tables like one main table and other
addenda tables with the same primary key to join and fetch the results
wherever necessary?

You need database normalization. It's a big topic. Here's a good simple
explanation:
https://popsql.com/blog/normalization-in-sql

#5Guyren Howe
guyren@gmail.com
In reply to: Lok P (#1)
Re: Design strategy for table with many attributes

Ultimately, the way you should store the data depends on how you will use it. When you retrieve these values, what are the different ways you’ll be using them?

Normalised representations are more flexible, and the pragmatic, performance-based consideration is that all the values in a row are always effectively retrieved together. So if you reasonably often have a query that only accesses the creation date and transaction id, then it will be pretty slow if you are also always retrieving 500 other columns you don’t need.

So: you might often pull all the values *other* than the attributes (creation date, creator, job information, whatever) together. This argues that those values should be in one table, and the attributes in another.

Will you usually be using *all* of the attributes for a particular transaction together in the same operation? It might make sense to store them in eg an array in that case. But this significantly reduces the speed of accessing particular attributes separately.

It is likely that you will want to slice things by particular named attribute across many transactions. This argues for the more normalised form, as does the general principle of doing things in the way that is most general and flexible.

When considering how the data will be used, please consider not only the developers of your current application(s), but also eg data analysts, managers, future applications etc.

The less specific you can be about how you want to use the data, the more you should lean toward fully normalising.

Note also that you can store your data in a normalised and maximally flexible form, but also use triggers or batch jobs to gather various permutations of the data for specific purposes. If you really do, say, both have some major part of your project that uses all the attributes on a given transaction together, but you also have other uses, you may want to store both the normalised/attribute table and the “all the values together” version.

Even if you want to store “all the values together”, it may well be better to use an array, JSON or HStore, rather than having a super-wide table. JSON would eg let you enumerate all the column names (for example) and employ Postgres’s really nice JSON query features.

Show quoted text

On Jul 4, 2024, at 12:37, Lok P <loknath.73@gmail.com> wrote:

Hello,
In one of the applications we are getting transactions in messages/events format and also in files and then they are getting parsed and stored into the relational database. The number of attributes/columns each transaction has is ~900+. Logically they are part of one single transaction and should be stored in one table as one single row. There will be ~500million such transactions each day coming into the system. And there will be approx ~10K peak write TPS and 5K read TPS in target state. This system has a postgres database as a "source of truth" or OLTP store. And then data moves to snowflakes for the olap store.

Initially when the system was designed the number of attributes per transaction was <100 but slowly the business wants to keep/persist other attributes too in the current system and the number of columns keep growing.

However, as worked with some database systems , we get few suggestions from DBA's to not have many columns in a single table. For example in oracle they say not to go beyond ~255 columns as then row chaining and row migration type of things are going to hunt us. Also we are afraid concurrent DMLS on the table may cause this as a contention point. So I wanted to understand , in such a situation what would be the best design approach we should use irrespective of databases? Or say, what is the maximum number of columns per table we should restrict? Should we break the single transaction into multiple tables like one main table and other addenda tables with the same primary key to join and fetch the results wherever necessary?

Regards
Lok

#6Lok P
loknath.73@gmail.com
In reply to: David G. Johnston (#2)
Re: Design strategy for table with many attributes

On Fri, Jul 5, 2024 at 1:26 AM David G. Johnston <david.g.johnston@gmail.com>
wrote:

On Thu, Jul 4, 2024 at 12:38 PM Lok P <loknath.73@gmail.com> wrote:

Should we break the single transaction into multiple tables like one main
table and other addenda tables with the same primary key to join and fetch
the results wherever necessary?

I would say yes. Find a way to logically group sets of columns together
and place those groups into separate tables. I'd also be looking for cases
where multiple columns really should be multiple rows. This is not
uncommon.

David J.

Thank you David.

As you said, to logically break this into multiple tables so i believe it
means it should be such that there will be no need to query multiple tables
and join them most of the time for fetching the results. It should just
fetch the results from one table at any point in time.

But do you also suggest keeping those table pieces related to each other
through the same primary key ? Won't there be a problem when we load the
data like say for example , in normal scenario the data load will be to one
table but when we break it to multiple tables it will happen to all the
individual pieces, won't that cause additional burden to the data load?

Also I understand the technical limitation of the max number of columns per
table is ~1600. But should you advise to restrict/stop us to some low
number long before reaching that limit , such that we will not face any
anomalies when we grow in future. And if we should maintain any specific
order in the columns from start to end column in the specific table?

#7David Rowley
dgrowleyml@gmail.com
In reply to: Lok P (#6)
Re: Design strategy for table with many attributes

On Fri, 5 Jul 2024 at 17:07, Lok P <loknath.73@gmail.com> wrote:

Also I understand the technical limitation of the max number of columns per table is ~1600. But should you advise to restrict/stop us to some low number long before reaching that limit , such that we will not face any anomalies when we grow in future. And if we should maintain any specific order in the columns from start to end column in the specific table?

Something else you may wish to consider, depending on the column types
of your 900+ columns is the possibility that some INSERTs may fail due
to row length while others with shorter variable length values may be
ok.

Here's a quick example with psql:

select 'create table a (' || string_agg('a'||x||' text not null
default $$$$',',') || ')' from generate_series(1,1000)x;
\gexec
insert into a default values;
INSERT 0 1

again but with a larger DEFAULT to make the tuple larger.

select 'create table b (' || string_agg('a'||x||' text not null
default $$hello world$$',',') || ')' from generate_series(1,1000)x;
\gexec
insert into b default values;
ERROR: row is too big: size 12024, maximum size 8160

There is a paragraph at the bottom of [1]https://www.postgresql.org/docs/current/limits.html with some warnings about
things relating to this.

The tuple length would be fixed for fixed-length types defined as NOT
NULL. So, if you have that, there should be no such surprises.

David

[1]: https://www.postgresql.org/docs/current/limits.html

#8David G. Johnston
david.g.johnston@gmail.com
In reply to: Lok P (#6)
Re: Design strategy for table with many attributes

On Thursday, July 4, 2024, Lok P <loknath.73@gmail.com> wrote:

But do you also suggest keeping those table pieces related to each other
through the same primary key ?

Yes, everyone row produced from the input data “row” should get the same ID
associated with it - either as an entire PK or a component of a
multi-column PK/unique index.

Won't there be a problem when we load the data like say for example , in
normal scenario the data load will be to one table but when we break it to
multiple tables it will happen to all the individual pieces, won't that
cause additional burden to the data load?

Yes, doing this requires additional CPU time to perform the work. I’d say
IO is hopefully a wash.

Also I understand the technical limitation of the max number of
columns per table is ~1600. But should you advise to restrict/stop us to
some low number long before reaching that limit , such that we will not
face any anomalies when we grow in future.

In a row-oriented system wider is worser. I start evaluation of table
design with that in mind at the fourth column (including the surrogate key
that is usually present, and the natural key - ignoring auditing columns.)

And if we should maintain any specific order in the columns from start to
end column in the specific table?

There is material out there on micro-optimizing column ordering to match
with alignment boundaries. I’d the benefit is meaningful but there is a
non-trivial cost to actually setup the testing to verify that what you’ve
figured out is working. Never actually done it myself. Though it actually
seems like something someone could/may have written an algorithm for
(though I do not recall ever seeing mention of one.)

David J.

#9Lok P
loknath.73@gmail.com
In reply to: Lok P (#1)
Re: Design strategy for table with many attributes

On Fri, Jul 5, 2024 at 10:45 AM Guyren Howe <guyren@gmail.com> wrote:

On Jul 4, 2024, at 22:07, Lok P <loknath.73@gmail.com> wrote:

If you stick to the principle of grouping columns in a table when you use
those columns together, you should be good.

Note that you might want to split up the “parent” table if that naturally
groups its columns together for certain uses. In that case, you could have
the same pk on all the 1:1 tables you then have. In that case, the pk for
each of those tables is also the fk.

Thank you.

When you said below,

*"Note that you might want to split up the “parent” table if that naturally
groups its columns together for certain uses. In that case, you could have
the same pk on all the 1:1 tables you then have. In that case, the pk for
each of those tables is also the fk."*
Do you mean having a real FK created through DDL and maintaining it or just
assume it and no need to define it for all the pieces/tables. Only keep the
same PK across all the pieces and as we know these are related to the same
transaction and are logically related?

#10Lok P
loknath.73@gmail.com
In reply to: Lok P (#1)
Re: Design strategy for table with many attributes

Some folks in the team suggested to have key business attributes or say
frequently queried attributes in individual columns and others in a column
in same table clubbed in JSON format. Is that advisable or any issues can
occur with this approach? Also not sure how effectively postgres processes
JSON (both read and write perspective) as compared to normal column in a
oltp environment. Please advise.

As David suggested it breaks if a row exceeds the 8k limit I. E a single
page size , will that still holds true if we have a column with JSON in it?

On Fri, 5 Jul, 2024, 12:04 pm Guyren Howe, <guyren@gmail.com> wrote:

Show quoted text

On Jul 4, 2024, at 23:28, Lok P <loknath.73@gmail.com> wrote:

*"Note that you might want to split up the “parent” table if that
naturally groups its columns together for certain uses. In that case, you
could have the same pk on all the 1:1 tables you then have. In that case,
the pk for each of those tables is also the fk."*
Do you mean having a real FK created through DDL and maintaining it or
just assume it and no need to define it for all the pieces/tables. Only
keep the same PK across all the pieces and as we know these are related to
the same transaction and are logically related?

A primary key *names something*. Often it’s a kind of platonic
representation of a real thing — say, a person.

I might use a person’s login columns in some functions, and the person’s
name, birth date, etc in other functions.

Rather than have one table, I should split this into two, but use the same
primary key (I would either name both id or both, say, person_id,
irrespective of the name of the table, so it’s clear you’re doing this).

You can just do a join on the mutual primary keys as you’d expect. In
fact, if you name them the same, you can just use NATURAL JOIN.

So you’d have person_details and person_login tables, and have a person_id
pk for both.

#11David Rowley
dgrowleyml@gmail.com
In reply to: Lok P (#10)
Re: Design strategy for table with many attributes

On Fri, 5 Jul 2024 at 19:53, Lok P <loknath.73@gmail.com> wrote:

As David suggested it breaks if a row exceeds the 8k limit I. E a single page size , will that still holds true if we have a column with JSON in it?

You wouldn't be at risk of the same tuple length problem if you
reduced the column count and stored the additional details in JSON.
Each varlena column is either stored in the tuple inline, or toasted
and stored out of line. Out of line values need an 18-byte pointer to
the toasted data. That pointer contributes to the tuple length.

This isn't me advocating for JSON. I'm just explaining the
limitations. I think I'd only advocate for JSON if the properties you
need to store vary wildly between each tuple. There's a large
overhead to storing JSON labels, which you'd pay the price of for each
tuple. That sounds like it would scale terribly with the data volumes
you've suggested you'll be processing.

David

#12Lok P
loknath.73@gmail.com
In reply to: David Rowley (#11)
Re: Design strategy for table with many attributes

On Fri, 5 Jul, 2024, 1:44 pm David Rowley, <dgrowleyml@gmail.com> wrote:

On Fri, 5 Jul 2024 at 19:53, Lok P <loknath.73@gmail.com> wrote:

As David suggested it breaks if a row exceeds the 8k limit I. E a single

page size , will that still holds true if we have a column with JSON in it?

You wouldn't be at risk of the same tuple length problem if you
reduced the column count and stored the additional details in JSON.
Each varlena column is either stored in the tuple inline, or toasted
and stored out of line. Out of line values need an 18-byte pointer to
the toasted data. That pointer contributes to the tuple length.

David

Got it. Thank you very much.

So there would be performance overhead with JSON and we need to validate
that carefully, if at all going in that direction.

However out of curiosity, if the roasted/compressed component or column
which is JSON itself goes beyond 8k post compression, will it break then?

#13David Rowley
dgrowleyml@gmail.com
In reply to: Lok P (#12)
Re: Design strategy for table with many attributes

On Fri, 5 Jul 2024 at 20:53, Lok P <loknath.73@gmail.com> wrote:

However out of curiosity, if the roasted/compressed component or column which is JSON itself goes beyond 8k post compression, will it break then?

No. The size limit of a varlena field such as TEXT, JSON or JSONB is
1GB. See "field size" in [1]https://www.postgresql.org/docs/current/limits.html.

Please also read [2]https://www.postgresql.org/docs/current/storage-toast.html.

David

[1]: https://www.postgresql.org/docs/current/limits.html
[2]: https://www.postgresql.org/docs/current/storage-toast.html