Standard of data storage and transformation

Started by yudhi sover 1 year ago3 messagesgeneral

learnerdatabase99@gmail.com

over 1 year ago

Hi All,
We are having a use case in which we are having transaction data for
multiple customers in one of the Postgres databases(version 15.4) and we
are consuming it from multiple sources(batch file processing, kafka event
processing etc). It's currently stored in normalized form postgres
with constraints, indexes, partitions defined. This postgres database is
holding the transaction data for around a month or so. There are use cases
of running online transaction search reports which will be mostly real time
reporting and also some daily transaction batch reports based on customers
and also month end reports for customers. In target state it will hold
Approx. ~400 million transactions/day which can be billions of rows across
multiple related parent/child tables.

There is another requirement to send these customer transaction data to an
olap system which is in a snowflake database and there it will be persisted
for many years. The lag between the data in postgres/oltp and in snowflake
will be ~1hr. And any reporting api can query postgres for <1 month worth
of transaction data and if it needs to scan for >1month worth of
transaction data, it will point to the snowflake database.

Now the question which we are wondering is , should we send the data as is
in normalized table form to snowflake and then there we transform/flatten
the data to support the reporting use case or should we first flatten or
transform the data in postgres itself and make it as another structure( for
example creating materialized views on top of base table) and only then
move that data to the snowflake? What is the appropriate standard and
downside if we do anything different.

Regards
Yudhi

Ron

ronljohnsonjr@gmail.com

over 1 year ago

In reply to: yudhi s (#1)

Re: Standard of data storage and transformation

On Tue, Aug 6, 2024 at 5:07 PM yudhi s <learnerdatabase99@gmail.com> wrote:

Hi All,
We are having a use case in which we are having transaction data for
multiple customers in one of the Postgres databases(version 15.4) and we
are consuming it from multiple sources(batch file processing, kafka event
processing etc). It's currently stored in normalized form postgres
with constraints, indexes, partitions defined. This postgres database is
holding the transaction data for around a month or so. There are use cases
of running online transaction search reports which will be mostly real time
reporting and also some daily transaction batch reports based on customers
and also month end reports for customers. In target state it will hold
Approx. ~400 million transactions/day which can be billions of rows across
multiple related parent/child tables.

There is another requirement to send these customer transaction data to an
olap system which is in a snowflake database and there it will be persisted
for many years. The lag between the data in postgres/oltp and in snowflake
will be ~1hr. And any reporting api can query postgres for <1 month worth
of transaction data and if it needs to scan for >1month worth of
transaction data, it will point to the snowflake database.

Now the question which we are wondering is , should we send the data as is
in normalized table form to snowflake and then there we transform/flatten
the data to support the reporting use case or should we first flatten or
transform the data in postgres itself and make it as another structure( for
example creating materialized views on top of base table) and only then
move that data to the snowflake? What is the appropriate standard and
downside if we do anything different.

Some thoughts:
0) How big are the records?
1) Materialized views add disk space overhead.
2) Materialized views are for when you query the same static over and over
again.
3) IIUC, you'll be moving the data from PG to Snowflake just once.
4) Writing an MV to disk and then reading it only once doubles the IO
requirements.
5) Thus, my first thought would be to extract the data from PG using a
denormalizing "plain" view.
5a) If you can't make that Fast Enough, then obviously you must pull the
normalized data from PG and denorm it elsewhere. You know your situation
better than us.
6) Indices will be critical: not too many, but not too few.
7) Obviously consider partitioning, but note that too many partitions can
make query planning MUCH slower.
7a) 31 days cut into hours means 744 partitions. That's a LOT.
7b) Partitioning every THREE hours means only 248 child tables. A lot, but
much more manageable.
7c) That might well kill reporting performance, though, if it's for example
one customer across many partitions.
8) You (hopefully) know what kind of queries will be run. Maybe partition
by customer (or whatever) range and THEN by an hour range.
8a) You'd have to simultaneously run multiple extract jobs (on for each
"customer" range), but that might not be too onerous, since then each hour
partition would be smaller.
9) Testing. Nothing beats full-scale testing.

--
Death to America, and butter sauce!
Iraq lobster...

yudhi s

learnerdatabase99@gmail.com

over 1 year ago

In reply to: Ron (#2)

Re: Standard of data storage and transformation

On Wed, Aug 7, 2024 at 3:13 AM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

On Tue, Aug 6, 2024 at 5:07 PM yudhi s <learnerdatabase99@gmail.com>
wrote:

Hi All,
We are having a use case in which we are having transaction data for
multiple customers in one of the Postgres databases(version 15.4) and we
are consuming it from multiple sources(batch file processing, kafka event
processing etc). It's currently stored in normalized form postgres
with constraints, indexes, partitions defined. This postgres database is
holding the transaction data for around a month or so. There are use cases
of running online transaction search reports which will be mostly real time
reporting and also some daily transaction batch reports based on customers
and also month end reports for customers. In target state it will hold
Approx. ~400 million transactions/day which can be billions of rows across
multiple related parent/child tables.

There is another requirement to send these customer transaction data to
an olap system which is in a snowflake database and there it will be
persisted for many years. The lag between the data in postgres/oltp and in
snowflake will be ~1hr. And any reporting api can query postgres for <1
month worth of transaction data and if it needs to scan for >1month worth
of transaction data, it will point to the snowflake database.

Now the question which we are wondering is , should we send the data as
is in normalized table form to snowflake and then there we
transform/flatten the data to support the reporting use case or should we
first flatten or transform the data in postgres itself and make it as
another structure( for example creating materialized views on top of base
table) and only then move that data to the snowflake? What is the
appropriate standard and downside if we do anything different.

Some thoughts:
0) How big are the records?
1) Materialized views add disk space overhead.
2) Materialized views are for when you query the same static over and over
again.
3) IIUC, you'll be moving the data from PG to Snowflake just once.
4) Writing an MV to disk and then reading it only once doubles the IO
requirements.
5) Thus, my first thought would be to extract the data from PG using a
denormalizing "plain" view.
5a) If you can't make that Fast Enough, then obviously you must pull the
normalized data from PG and denorm it elsewhere. You know your situation
better than us.
6) Indices will be critical: not too many, but not too few.
7) Obviously consider partitioning, but note that too many partitions can
make query planning MUCH slower.
7a) 31 days cut into hours means 744 partitions. That's a LOT.
7b) Partitioning every THREE hours means only 248 child tables. A lot,
but much more manageable.
7c) That might well kill reporting performance, though, if it's for
example one customer across many partitions.
8) You (hopefully) know what kind of queries will be run. Maybe partition
by customer (or whatever) range and THEN by an hour range.
8a) You'd have to simultaneously run multiple extract jobs (on for each
"customer" range), but that might not be too onerous, since then each hour
partition would be smaller.
9) Testing. Nothing beats full-scale testing.

The table has ~100+ columns but I think the denormalized or the flatten
table which we are planning to create will mostly have a large number of
columns in it as that will be based on the columns from multiple tables
joined together. However, the plan was to have the cooked data ready so as
not to do the additional work in target or downstream. So I was thinking of
a physical model for persisting the transformed data(using MV) rather than
using a simple view which will have performance overhead.

Because what is happening is , after the data moves to snowflake , people
try to create their own version of the transformed table on top of
these normalized tables(which we call as refiners) and then query those
from UI or for reporting. And some people say we should avoid those
downstream refiners and should do it here in source/postgres.Also the plan
was to move the data from postgres once every hour.