Question on overall design

Started by veem vover 2 years ago13 messagesgeneral
Jump to latest
#1veem v
veema0000@gmail.com

Hello All,
Although it's not exactly related to opensource postgre but want to ask
this question here to understand colleagues' view, considering having
decades of experience in the database world, We want some guidance, if the
below design looks okay for our customer use case.

We currently have financial systems transaction data streams to Oracle
exadata(X9) on-premise. This database supports processing of 400million
transactions per day. A single transaction for us is a combination of 7-8
inserts into different transaction tables with Indexes , unique constraints
etc defined on those. The transactions processed/committed in batches(~1000
batch size) in the database. And this system persists data for ~6 months.
We do have all sorts of OLAP(daily/monthly batch reports running)
applications run on the same database along with some user facing UI
applications showing customer transactions. So it's basically currently
serving a hybrid workload and is one stop solution for all use cases.

Many of the applications are moving from on premise to AWS cloud as part of
modernization journey and AWS being chosen cloud partner also the product
is expected to expand across more regions and this system is expected to
serve increase in the transaction volume. And also we have a requirement to
persist transaction data for ~10years to have those available for
analytics/data science use cases.

So the team is thinking of splitting it into two parts
1)OLTP type use case in which we will persist/write the transaction data
faster and show it to the UI related apps , in near real time/quickest
possible time. and this database will store Max 60-90 days of transaction
data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
team planning of using/experimenting with Aurora postgres. Please correct
me, if there are any other options we should use otherwise?

2)Then move the data beyond ~90 days into another database or object
storage S3 which will keep it there for ~10 years and will be queryable
using the necessary API's. That is supposed to cater to Olap/analytics/data
science use cases etc.

Is the above design is okay? and also in regards to the second point above
i.e. persisting the historical data (that to be in queryable state), should
we go for some database like snowflake or should just keep it on S3 as is
and make those queryable through APIs. Please advice?

#2Ron
ronljohnsonjr@gmail.com
In reply to: veem v (#1)
Re: Question on overall design

On Sat, Dec 9, 2023 at 6:14 AM veem v <veema0000@gmail.com> wrote:
[snip]

Many of the applications are moving from on premise to AWS cloud as part
of modernization journey and AWS being chosen cloud partner also the
product is expected to expand across more regions and this system is
expected to serve increase in the transaction volume. And also we have a
requirement to persist transaction data for ~10years to have those
available for analytics/data science use cases.

So the team is thinking of splitting it into two parts
1)OLTP type use case in which we will persist/write the transaction data
faster and show it to the UI related apps , in near real time/quickest
possible time. and this database will store Max 60-90 days of transaction
data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
team planning of using/experimenting with Aurora postgres. Please correct
me, if there are any other options we should use otherwise?

2)Then move the data beyond ~90 days into another database or object
storage S3 which will keep it there for ~10 years and will be queryable
using the necessary API's. That is supposed to cater to Olap/analytics/data
science use cases etc.

Is the above design is okay? and also in regards to the second point above
i.e. persisting the historical data (that to be in queryable state), should
we go for some database like snowflake or should just keep it on S3 as is
and make those queryable through APIs.

"OK" is relative, but it's what we did in a similar situation: two years of
data on-line and 5 years of data in compressed files in S3. (We're
required to keep data for 7 years, but they *never* ask for records more
than 2 years old. If they ever do, we'll manually load the relevant data
back into PG.)

(I can't imagine that querying billions of unindexed flat-file records via
S3 would be fast.)

How often do end users look for data more than 90 days old? Two years old?

How quickly does the old data need to be available?

Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would balance
speed and cost (or maybe Snowflake would just drive up costs).

#3veem v
veema0000@gmail.com
In reply to: veem v (#1)
Re: Question on overall design

Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

"OK" is relative, but it's what we did in a similar situation: two years
of data on-line and 5 years of data in compressed files in S3. (We're
required to keep data for 7 years, but they *never* ask for records more
than 2 years old. If they ever do, we'll manually load the relevant data
back into PG.)
(I can't imagine that querying billions of unindexed flat-file records via
S3 would be fast.)
How often do end users look for data more than 90 days old? Two years old?
How quickly does the old data need to be available?
Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would balance
speed and cost (or maybe Snowflake would just drive up costs).

Thank You so much Ron.

When you said "*two years of data on-line and 5 years of data in compressed
files in S3*." So do you mean two years of data in aurora postgre which
will be OLTP database and rest just dump as is in S3 bucket and when we
need any such query for those data , just simply read and dump those S3
files back in the aurora postgre?

In the currently running oracle exadata system , it has SIX months of data
(which is ~200TB) and the transaction tables are all range partitioned on a
daily basis. And out of that ~2months of data gets frequently queried and
other ~4months of data gets queried less frequently. However, in the
target architecture which we want to build on cloud here, there are some
requirements for the analytics/data science team to query ~3years history.
Beyond ~3years we may need that data rarely.

We were initially thinking of just having one database to serve both OLTP
and OLAP use cases(somewhat like oracle exadata currently doing for us) but
it seems we don't have such an option here on AWS. Postgre will serve OLTP
use case whereas Snowflake will serve OLAP use case.

So do you suggest having both the databases in use, like recent transaction
data for last 3 months should be streamed to aurora postgre, then from
3months till 3years of data should be parked in snowflake which will serve
OLAP/analytics use case. and from 3years till 10years will be kept in S3
(as parquet or Iceberg format) so that even Snowflake can query those
directly when needed.

OR

Do you suggest just keeping last ~3months of data on Aurora postgre and
rest everything on snowflake considering it will store those as compressed
format and also storage is cheap(23$ per TB per month)?

Few colleagues are pointing to databricks for the analytics use case. Is
that a good option here?

On Sat, 9 Dec 2023 at 16:43, veem v <veema0000@gmail.com> wrote:

Show quoted text

Hello All,
Although it's not exactly related to opensource postgre but want to ask
this question here to understand colleagues' view, considering having
decades of experience in the database world, We want some guidance, if the
below design looks okay for our customer use case.

We currently have financial systems transaction data streams to Oracle
exadata(X9) on-premise. This database supports processing of 400million
transactions per day. A single transaction for us is a combination of 7-8
inserts into different transaction tables with Indexes , unique constraints
etc defined on those. The transactions processed/committed in batches(~1000
batch size) in the database. And this system persists data for ~6 months.
We do have all sorts of OLAP(daily/monthly batch reports running)
applications run on the same database along with some user facing UI
applications showing customer transactions. So it's basically currently
serving a hybrid workload and is one stop solution for all use cases.

Many of the applications are moving from on premise to AWS cloud as part
of modernization journey and AWS being chosen cloud partner also the
product is expected to expand across more regions and this system is
expected to serve increase in the transaction volume. And also we have a
requirement to persist transaction data for ~10years to have those
available for analytics/data science use cases.

So the team is thinking of splitting it into two parts
1)OLTP type use case in which we will persist/write the transaction data
faster and show it to the UI related apps , in near real time/quickest
possible time. and this database will store Max 60-90 days of transaction
data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
team planning of using/experimenting with Aurora postgres. Please correct
me, if there are any other options we should use otherwise?

2)Then move the data beyond ~90 days into another database or object
storage S3 which will keep it there for ~10 years and will be queryable
using the necessary API's. That is supposed to cater to Olap/analytics/data
science use cases etc.

Is the above design is okay? and also in regards to the second point above
i.e. persisting the historical data (that to be in queryable state), should
we go for some database like snowflake or should just keep it on S3 as is
and make those queryable through APIs. Please advice?

#4Ron
ronljohnsonjr@gmail.com
In reply to: veem v (#3)
Re: Question on overall design

On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0000@gmail.com> wrote:

Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

"OK" is relative, but it's what we did in a similar situation: two years
of data on-line and 5 years of data in compressed files in S3. (We're
required to keep data for 7 years, but they *never* ask for records more
than 2 years old. If they ever do, we'll manually load the relevant data
back into PG.)
(I can't imagine that querying billions of unindexed flat-file records
via S3 would be fast.)
How often do end users look for data more than 90 days old? Two years
old?
How quickly does the old data need to be available?
Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would balance
speed and cost (or maybe Snowflake would just drive up costs).

Thank You so much Ron.

When you said "*two years of data on-line and 5 years of data in
compressed files in S3*." So do you mean two years of data in aurora
postgre which will be OLTP database and rest just dump as is in S3 bucket
and when we need any such query for those data , just simply read and dump
those S3 files back in the aurora postgre?

(RDS Postgresql, not Aurora, but that's beside the point.)

Yes.

But keep reading...

In the currently running oracle exadata system , it has SIX months of data

(which is ~200TB) and the transaction tables are all range partitioned on a
daily basis. And out of that ~2months of data gets frequently queried and
other ~4months of data gets queried less frequently. However, in the
target architecture which we want to build on cloud here, there are some
requirements for the analytics/data science team to query ~3years history.
Beyond ~3years we may need that data rarely.

We were initially thinking of just having one database to serve both OLTP
and OLAP use cases(somewhat like oracle exadata currently doing for us) but
it seems we don't have such an option here on AWS. Postgre will serve OLTP
use case whereas Snowflake will serve OLAP use case.

So do you suggest having both the databases in use, like recent
transaction data for last 3 months should be streamed to aurora postgre,
then from 3months till 3years of data should be parked in snowflake which
will serve OLAP/analytics use case. and from 3years till 10years will be
kept in S3 (as parquet or Iceberg format) so that even Snowflake can query
those directly when needed.

OR

Do you suggest just keeping last ~3months of data on Aurora postgre and
rest everything on snowflake considering it will store those as compressed
format and also storage is cheap(23$ per TB per month)?

Few colleagues are pointing to databricks for the analytics use case. Is
that a good option here?

I can't answer that without knowing what the end users actually need
(details, or just summaries of historical data, in different tiers).

You all will have to do the cost:benefit analysis of different
architectures.

Show quoted text

On Sat, 9 Dec 2023 at 16:43, veem v <veema0000@gmail.com> wrote:

Hello All,
Although it's not exactly related to opensource postgre but want to ask
this question here to understand colleagues' view, considering having
decades of experience in the database world, We want some guidance, if the
below design looks okay for our customer use case.

We currently have financial systems transaction data streams to Oracle
exadata(X9) on-premise. This database supports processing of 400million
transactions per day. A single transaction for us is a combination of 7-8
inserts into different transaction tables with Indexes , unique constraints
etc defined on those. The transactions processed/committed in batches(~1000
batch size) in the database. And this system persists data for ~6 months.
We do have all sorts of OLAP(daily/monthly batch reports running)
applications run on the same database along with some user facing UI
applications showing customer transactions. So it's basically currently
serving a hybrid workload and is one stop solution for all use cases.

Many of the applications are moving from on premise to AWS cloud as part
of modernization journey and AWS being chosen cloud partner also the
product is expected to expand across more regions and this system is
expected to serve increase in the transaction volume. And also we have a
requirement to persist transaction data for ~10years to have those
available for analytics/data science use cases.

So the team is thinking of splitting it into two parts
1)OLTP type use case in which we will persist/write the transaction data
faster and show it to the UI related apps , in near real time/quickest
possible time. and this database will store Max 60-90 days of transaction
data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
team planning of using/experimenting with Aurora postgres. Please correct
me, if there are any other options we should use otherwise?

2)Then move the data beyond ~90 days into another database or object
storage S3 which will keep it there for ~10 years and will be queryable
using the necessary API's. That is supposed to cater to Olap/analytics/data
science use cases etc.

Is the above design is okay? and also in regards to the second point
above i.e. persisting the historical data (that to be in queryable state),
should we go for some database like snowflake or should just keep it on S3
as is and make those queryable through APIs. Please advice?

#5veem v
veema0000@gmail.com
In reply to: Ron (#4)
Re: Question on overall design

Thank you so much for the response.

Got your point, will check if we really need details or summary for the
historical data. But it looks like we will need detailed transaction data
for ~2 years at least.

My understanding was that AWS has two different offerings and "aurora
postgresql" is more performant and costlier than "RDS postgresql". Will
double check on this though.

However , how to prove RDS/aurora postgresql is going to serve our OLTP
requirement here , similar to the current on premise Oracle exadata. For
the OLTP use case we are expecting ~15K TPS write and 2K TPS read and the
response for the UI queries are expected to be within subseconds. But yes,
as i mentioned individual transactions will be batched and then will be
written to the database, so this should have lesser resource consumption
and contention created.

To test if Aurora postgresql will be comparable to cater the above needs
(in regards to the expected performance with nominal cost) ,how should we
test it? As we won't be able to test everything right away, Should we test
basic read and write performance and benchmark to have some confidence and
go ahead with development?

Say for example if one transaction consists of ~8 Inserts we can create a
sample target table on aurora Postgre with required indexes/constraints and
try running those inserts from multiple threads(for concurrency) using
blazemeter and see/compare the response time, CPU, IO, Memory usage for the
Postgresql database with set TPS. Similarly to see read performance, we can
run multiple select queries from blazemeter and compare the response time.
Is this the correct approach for validating the database here or any other
approach exists?

And another question coming to mind, I read in past Vaccum to be a problem
in postgresql, is it going to give trouble in Aurora postgresql too, for
such a highly transactional read/write system? How to test/validate that?

On Sun, 10 Dec 2023 at 01:29, Ron Johnson <ronljohnsonjr@gmail.com> wrote:

Show quoted text

On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0000@gmail.com> wrote:

Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

"OK" is relative, but it's what we did in a similar situation: two years
of data on-line and 5 years of data in compressed files in S3. (We're
required to keep data for 7 years, but they *never* ask for records
more than 2 years old. If they ever do, we'll manually load the relevant
data back into PG.)
(I can't imagine that querying billions of unindexed flat-file records
via S3 would be fast.)
How often do end users look for data more than 90 days old? Two years
old?
How quickly does the old data need to be available?
Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would balance
speed and cost (or maybe Snowflake would just drive up costs).

Thank You so much Ron.

When you said "*two years of data on-line and 5 years of data in
compressed files in S3*." So do you mean two years of data in aurora
postgre which will be OLTP database and rest just dump as is in S3 bucket
and when we need any such query for those data , just simply read and dump
those S3 files back in the aurora postgre?

(RDS Postgresql, not Aurora, but that's beside the point.)

Yes.

But keep reading...

In the currently running oracle exadata system , it has SIX months of data

(which is ~200TB) and the transaction tables are all range partitioned on a
daily basis. And out of that ~2months of data gets frequently queried and
other ~4months of data gets queried less frequently. However, in the
target architecture which we want to build on cloud here, there are some
requirements for the analytics/data science team to query ~3years history.
Beyond ~3years we may need that data rarely.

We were initially thinking of just having one database to serve both OLTP
and OLAP use cases(somewhat like oracle exadata currently doing for us) but
it seems we don't have such an option here on AWS. Postgre will serve OLTP
use case whereas Snowflake will serve OLAP use case.

So do you suggest having both the databases in use, like recent
transaction data for last 3 months should be streamed to aurora postgre,
then from 3months till 3years of data should be parked in snowflake which
will serve OLAP/analytics use case. and from 3years till 10years will be
kept in S3 (as parquet or Iceberg format) so that even Snowflake can query
those directly when needed.

OR

Do you suggest just keeping last ~3months of data on Aurora postgre and
rest everything on snowflake considering it will store those as compressed
format and also storage is cheap(23$ per TB per month)?

Few colleagues are pointing to databricks for the analytics use case. Is
that a good option here?

I can't answer that without knowing what the end users actually need
(details, or just summaries of historical data, in different tiers).

You all will have to do the cost:benefit analysis of different
architectures.

On Sat, 9 Dec 2023 at 16:43, veem v <veema0000@gmail.com> wrote:

Hello All,
Although it's not exactly related to opensource postgre but want to ask
this question here to understand colleagues' view, considering having
decades of experience in the database world, We want some guidance, if the
below design looks okay for our customer use case.

We currently have financial systems transaction data streams to Oracle
exadata(X9) on-premise. This database supports processing of 400million
transactions per day. A single transaction for us is a combination of 7-8
inserts into different transaction tables with Indexes , unique constraints
etc defined on those. The transactions processed/committed in batches(~1000
batch size) in the database. And this system persists data for ~6 months.
We do have all sorts of OLAP(daily/monthly batch reports running)
applications run on the same database along with some user facing UI
applications showing customer transactions. So it's basically currently
serving a hybrid workload and is one stop solution for all use cases.

Many of the applications are moving from on premise to AWS cloud as part
of modernization journey and AWS being chosen cloud partner also the
product is expected to expand across more regions and this system is
expected to serve increase in the transaction volume. And also we have a
requirement to persist transaction data for ~10years to have those
available for analytics/data science use cases.

So the team is thinking of splitting it into two parts
1)OLTP type use case in which we will persist/write the transaction data
faster and show it to the UI related apps , in near real time/quickest
possible time. and this database will store Max 60-90 days of transaction
data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
team planning of using/experimenting with Aurora postgres. Please correct
me, if there are any other options we should use otherwise?

2)Then move the data beyond ~90 days into another database or object
storage S3 which will keep it there for ~10 years and will be queryable
using the necessary API's. That is supposed to cater to Olap/analytics/data
science use cases etc.

Is the above design is okay? and also in regards to the second point
above i.e. persisting the historical data (that to be in queryable state),
should we go for some database like snowflake or should just keep it on S3
as is and make those queryable through APIs. Please advice?

#6Ron
ronljohnsonjr@gmail.com
In reply to: veem v (#5)
Re: Question on overall design

I don't know anything about Aurora, only have experience with RDS
Postgresql.

We successfully migrated from on-prem Oracle (12c, I think) to RDS
Postgresql 12, and were very happy: little down time (I take pride in
designing/implementing that; thanks, ora2pg!), with lower disk (8TB, down
to 5TB) and CPU usage.

I'm not sure what the TPS was in Oracle, but the server level "we" are on
(I'm no longer with that company, and don't remember the specifics (48 vCPU
/ 132 GB RAM, I think, with 10,000 IOPs) is over-powered for *our* needs.

You're going to have to spin up a full-sized instance, import a *lot* of
real data(*) into a production-identical schema and then run your batch
load process using test data (or copies of real batches). That's the only
way you'll *really* know.

Beware of excessive partitioning. We had to "departion" most tables,
because of the impact on performance.

(*) Use ora2pg to export views in the Oracle database. It's *very* easy; a
small EC2 VM running Linux with enough disk space lets you automate the
extraction from Oracle and importation into AWS Postgresql.)

On Sat, Dec 9, 2023 at 3:36 PM veem v <veema0000@gmail.com> wrote:

Show quoted text

Thank you so much for the response.

Got your point, will check if we really need details or summary for the
historical data. But it looks like we will need detailed transaction data
for ~2 years at least.

My understanding was that AWS has two different offerings and "aurora
postgresql" is more performant and costlier than "RDS postgresql". Will
double check on this though.

However , how to prove RDS/aurora postgresql is going to serve our OLTP
requirement here , similar to the current on premise Oracle exadata. For
the OLTP use case we are expecting ~15K TPS write and 2K TPS read and the
response for the UI queries are expected to be within subseconds. But yes,
as i mentioned individual transactions will be batched and then will be
written to the database, so this should have lesser resource consumption
and contention created.

To test if Aurora postgresql will be comparable to cater the above needs
(in regards to the expected performance with nominal cost) ,how should we
test it? As we won't be able to test everything right away, Should we test
basic read and write performance and benchmark to have some confidence and
go ahead with development?

Say for example if one transaction consists of ~8 Inserts we can create a
sample target table on aurora Postgre with required indexes/constraints and
try running those inserts from multiple threads(for concurrency) using
blazemeter and see/compare the response time, CPU, IO, Memory usage for the
Postgresql database with set TPS. Similarly to see read performance, we can
run multiple select queries from blazemeter and compare the response time.
Is this the correct approach for validating the database here or any other
approach exists?

And another question coming to mind, I read in past Vaccum to be a problem
in postgresql, is it going to give trouble in Aurora postgresql too, for
such a highly transactional read/write system? How to test/validate that?

On Sun, 10 Dec 2023 at 01:29, Ron Johnson <ronljohnsonjr@gmail.com> wrote:

On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0000@gmail.com> wrote:

Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

"OK" is relative, but it's what we did in a similar situation: two
years of data on-line and 5 years of data in compressed files in S3.
(We're required to keep data for 7 years, but they *never* ask for
records more than 2 years old. If they ever do, we'll manually load the
relevant data back into PG.)
(I can't imagine that querying billions of unindexed flat-file records
via S3 would be fast.)
How often do end users look for data more than 90 days old? Two years
old?
How quickly does the old data need to be available?
Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would balance
speed and cost (or maybe Snowflake would just drive up costs).

Thank You so much Ron.

When you said "*two years of data on-line and 5 years of data in
compressed files in S3*." So do you mean two years of data in aurora
postgre which will be OLTP database and rest just dump as is in S3 bucket
and when we need any such query for those data , just simply read and dump
those S3 files back in the aurora postgre?

(RDS Postgresql, not Aurora, but that's beside the point.)

Yes.

But keep reading...

In the currently running oracle exadata system , it has SIX months of

data (which is ~200TB) and the transaction tables are all range partitioned
on a daily basis. And out of that ~2months of data gets frequently queried
and other ~4months of data gets queried less frequently. However, in the
target architecture which we want to build on cloud here, there are some
requirements for the analytics/data science team to query ~3years history.
Beyond ~3years we may need that data rarely.

We were initially thinking of just having one database to serve both
OLTP and OLAP use cases(somewhat like oracle exadata currently doing for
us) but it seems we don't have such an option here on AWS. Postgre will
serve OLTP use case whereas Snowflake will serve OLAP use case.

So do you suggest having both the databases in use, like recent
transaction data for last 3 months should be streamed to aurora postgre,
then from 3months till 3years of data should be parked in snowflake which
will serve OLAP/analytics use case. and from 3years till 10years will be
kept in S3 (as parquet or Iceberg format) so that even Snowflake can query
those directly when needed.

OR

Do you suggest just keeping last ~3months of data on Aurora postgre and
rest everything on snowflake considering it will store those as compressed
format and also storage is cheap(23$ per TB per month)?

Few colleagues are pointing to databricks for the analytics use case. Is
that a good option here?

I can't answer that without knowing what the end users actually need
(details, or just summaries of historical data, in different tiers).

You all will have to do the cost:benefit analysis of different
architectures.

On Sat, 9 Dec 2023 at 16:43, veem v <veema0000@gmail.com> wrote:

Hello All,
Although it's not exactly related to opensource postgre but want to ask
this question here to understand colleagues' view, considering having
decades of experience in the database world, We want some guidance, if the
below design looks okay for our customer use case.

We currently have financial systems transaction data streams to Oracle
exadata(X9) on-premise. This database supports processing of 400million
transactions per day. A single transaction for us is a combination of 7-8
inserts into different transaction tables with Indexes , unique constraints
etc defined on those. The transactions processed/committed in batches(~1000
batch size) in the database. And this system persists data for ~6 months.
We do have all sorts of OLAP(daily/monthly batch reports running)
applications run on the same database along with some user facing UI
applications showing customer transactions. So it's basically currently
serving a hybrid workload and is one stop solution for all use cases.

Many of the applications are moving from on premise to AWS cloud as
part of modernization journey and AWS being chosen cloud partner also the
product is expected to expand across more regions and this system is
expected to serve increase in the transaction volume. And also we have a
requirement to persist transaction data for ~10years to have those
available for analytics/data science use cases.

So the team is thinking of splitting it into two parts
1)OLTP type use case in which we will persist/write the transaction
data faster and show it to the UI related apps , in near real time/quickest
possible time. and this database will store Max 60-90 days of transaction
data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
team planning of using/experimenting with Aurora postgres. Please correct
me, if there are any other options we should use otherwise?

2)Then move the data beyond ~90 days into another database or object
storage S3 which will keep it there for ~10 years and will be queryable
using the necessary API's. That is supposed to cater to Olap/analytics/data
science use cases etc.

Is the above design is okay? and also in regards to the second point
above i.e. persisting the historical data (that to be in queryable state),
should we go for some database like snowflake or should just keep it on S3
as is and make those queryable through APIs. Please advice?

#7veem v
veema0000@gmail.com
In reply to: Ron (#6)
Re: Question on overall design

Thank you so much Ron. I have some more doubts related to this.

We were thinking , if there is any utility in PG with which we can
create/generate large sample data volume which we can use to run it on our
on premise Oracle exadata box and use the same on the aurora postgresql in
cloud to see the read and write performance comparison. Is there any such
exists, to quickly get some large data volume? But i think , what you are
saying is, we should use real data to get actual or closest possible
benchmarks, correct me if wrong?

We used to see the data dictionary views (called AWR views) in Oracle to
see the current and historical performance statistics like CPU, IO , Memory
usage, object level contentions etc. in the oracle database. Do we have
such a thing available in Aurora postgre, so as to monitor the
performance and get some idea of how well the load test goes and what
capacity is available or are we saturating it?

When you said "*Beware of excessive partitioning. We had to "departion"
most tables, because of the impact on performance.*" , as I understand
partitioning helps in reducing IO while reading , as it scans less data
because of partition pruning. And while writing there is almost minimal
overhead in identifying the required partition to which the
INSERTed/Updated/Deleted data will be landing ON. So can you please help me
understand what exact performance impact you are highlighting here? Are you
talking about the size of the partitions or total count of the partitions?
In our case we are planning to do either daily range partition or hourly
based on data data volume, not sure if there exists some sweet spot in
regards to what should be the size of each partition in postgresql be. If
you are pointing to higher count of partitions of table , then in our case
if we persist ~90 days data then for a transaction table it would be ~90
daily range partitions or ~2160 hourly range partitions in the aurora
postgresql. It would be helpful , if you can explain a bit regarding what
exact performance impact you faced in regards to the partitioning in aurora
postgresql.

*"Use ora2pg to export views in the Oracle database. It's very easy; a
small EC2 VM running Linux with enough disk space lets you automate the
extraction from Oracle and importation into AWS Postgresql.)"*

Need to explore a bit more on this I believe. We have an oracle on premise
database, so we can move data directly to aurora postgresql in the cloud.
Another thing , is we have some sample data available in the AWS snowflake
but then not sure if some mechanism is available to move the same data to
the aurora postgresql ?

On Sun, 10 Dec 2023 at 02:27, Ron Johnson <ronljohnsonjr@gmail.com> wrote:

Show quoted text

I don't know anything about Aurora, only have experience with RDS
Postgresql.

We successfully migrated from on-prem Oracle (12c, I think) to RDS
Postgresql 12, and were very happy: little down time (I take pride in
designing/implementing that; thanks, ora2pg!), with lower disk (8TB, down
to 5TB) and CPU usage.

I'm not sure what the TPS was in Oracle, but the server level "we" are on
(I'm no longer with that company, and don't remember the specifics (48 vCPU
/ 132 GB RAM, I think, with 10,000 IOPs) is over-powered for *our* needs.

You're going to have to spin up a full-sized instance, import a *lot* of
real data(*) into a production-identical schema and then run your batch
load process using test data (or copies of real batches). That's the only
way you'll *really* know.

Beware of excessive partitioning. We had to "departion" most tables,
because of the impact on performance.

(*) Use ora2pg to export views in the Oracle database. It's *very* easy;
a small EC2 VM running Linux with enough disk space lets you automate the
extraction from Oracle and importation into AWS Postgresql.)

On Sat, Dec 9, 2023 at 3:36 PM veem v <veema0000@gmail.com> wrote:

Thank you so much for the response.

Got your point, will check if we really need details or summary for the
historical data. But it looks like we will need detailed transaction data
for ~2 years at least.

My understanding was that AWS has two different offerings and "aurora
postgresql" is more performant and costlier than "RDS postgresql". Will
double check on this though.

However , how to prove RDS/aurora postgresql is going to serve our OLTP
requirement here , similar to the current on premise Oracle exadata. For
the OLTP use case we are expecting ~15K TPS write and 2K TPS read and the
response for the UI queries are expected to be within subseconds. But yes,
as i mentioned individual transactions will be batched and then will be
written to the database, so this should have lesser resource consumption
and contention created.

To test if Aurora postgresql will be comparable to cater the above needs
(in regards to the expected performance with nominal cost) ,how should we
test it? As we won't be able to test everything right away, Should we test
basic read and write performance and benchmark to have some confidence and
go ahead with development?

Say for example if one transaction consists of ~8 Inserts we can create a
sample target table on aurora Postgre with required indexes/constraints and
try running those inserts from multiple threads(for concurrency) using
blazemeter and see/compare the response time, CPU, IO, Memory usage for the
Postgresql database with set TPS. Similarly to see read performance, we can
run multiple select queries from blazemeter and compare the response time.
Is this the correct approach for validating the database here or any other
approach exists?

And another question coming to mind, I read in past Vaccum to be a
problem in postgresql, is it going to give trouble in Aurora postgresql
too, for such a highly transactional read/write system? How to
test/validate that?

On Sun, 10 Dec 2023 at 01:29, Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0000@gmail.com> wrote:

Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

"OK" is relative, but it's what we did in a similar situation: two
years of data on-line and 5 years of data in compressed files in S3.
(We're required to keep data for 7 years, but they *never* ask for
records more than 2 years old. If they ever do, we'll manually load the
relevant data back into PG.)
(I can't imagine that querying billions of unindexed flat-file records
via S3 would be fast.)
How often do end users look for data more than 90 days old? Two years
old?
How quickly does the old data need to be available?
Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would balance
speed and cost (or maybe Snowflake would just drive up costs).

Thank You so much Ron.

When you said "*two years of data on-line and 5 years of data in
compressed files in S3*." So do you mean two years of data in aurora
postgre which will be OLTP database and rest just dump as is in S3 bucket
and when we need any such query for those data , just simply read and dump
those S3 files back in the aurora postgre?

(RDS Postgresql, not Aurora, but that's beside the point.)

Yes.

But keep reading...

In the currently running oracle exadata system , it has SIX months of

data (which is ~200TB) and the transaction tables are all range partitioned
on a daily basis. And out of that ~2months of data gets frequently queried
and other ~4months of data gets queried less frequently. However, in the
target architecture which we want to build on cloud here, there are some
requirements for the analytics/data science team to query ~3years history.
Beyond ~3years we may need that data rarely.

We were initially thinking of just having one database to serve both
OLTP and OLAP use cases(somewhat like oracle exadata currently doing for
us) but it seems we don't have such an option here on AWS. Postgre will
serve OLTP use case whereas Snowflake will serve OLAP use case.

So do you suggest having both the databases in use, like recent
transaction data for last 3 months should be streamed to aurora postgre,
then from 3months till 3years of data should be parked in snowflake which
will serve OLAP/analytics use case. and from 3years till 10years will be
kept in S3 (as parquet or Iceberg format) so that even Snowflake can query
those directly when needed.

OR

Do you suggest just keeping last ~3months of data on Aurora postgre and
rest everything on snowflake considering it will store those as compressed
format and also storage is cheap(23$ per TB per month)?

Few colleagues are pointing to databricks for the analytics use case.
Is that a good option here?

I can't answer that without knowing what the end users actually need
(details, or just summaries of historical data, in different tiers).

You all will have to do the cost:benefit analysis of different
architectures.

On Sat, 9 Dec 2023 at 16:43, veem v <veema0000@gmail.com> wrote:

Hello All,
Although it's not exactly related to opensource postgre but want to
ask this question here to understand colleagues' view, considering having
decades of experience in the database world, We want some guidance, if the
below design looks okay for our customer use case.

We currently have financial systems transaction data streams to Oracle
exadata(X9) on-premise. This database supports processing of 400million
transactions per day. A single transaction for us is a combination of 7-8
inserts into different transaction tables with Indexes , unique constraints
etc defined on those. The transactions processed/committed in batches(~1000
batch size) in the database. And this system persists data for ~6 months.
We do have all sorts of OLAP(daily/monthly batch reports running)
applications run on the same database along with some user facing UI
applications showing customer transactions. So it's basically currently
serving a hybrid workload and is one stop solution for all use cases.

Many of the applications are moving from on premise to AWS cloud as
part of modernization journey and AWS being chosen cloud partner also the
product is expected to expand across more regions and this system is
expected to serve increase in the transaction volume. And also we have a
requirement to persist transaction data for ~10years to have those
available for analytics/data science use cases.

So the team is thinking of splitting it into two parts
1)OLTP type use case in which we will persist/write the transaction
data faster and show it to the UI related apps , in near real time/quickest
possible time. and this database will store Max 60-90 days of transaction
data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
team planning of using/experimenting with Aurora postgres. Please correct
me, if there are any other options we should use otherwise?

2)Then move the data beyond ~90 days into another database or object
storage S3 which will keep it there for ~10 years and will be queryable
using the necessary API's. That is supposed to cater to Olap/analytics/data
science use cases etc.

Is the above design is okay? and also in regards to the second point
above i.e. persisting the historical data (that to be in queryable state),
should we go for some database like snowflake or should just keep it on S3
as is and make those queryable through APIs. Please advice?

#8Ron
ronljohnsonjr@gmail.com
In reply to: veem v (#7)
Re: Question on overall design

* PG has pgbench; *maybe* you can hack it to work on Oracle.
* If you want to know how well an RDBMS will work on your workload, then
you must provide it with a simulated workload. Right?
* AWS RDS Postgresql has a dashboard that *might* be similar to AWR. Or it
might not...
* We departitioned because SELECT statements were *slow*. All partitions
were scanned, even when the partition key was specified in the WHERE clause.

On Sun, Dec 10, 2023 at 8:45 AM veem v <veema0000@gmail.com> wrote:

Show quoted text

Thank you so much Ron. I have some more doubts related to this.

We were thinking , if there is any utility in PG with which we can
create/generate large sample data volume which we can use to run it on our
on premise Oracle exadata box and use the same on the aurora postgresql in
cloud to see the read and write performance comparison. Is there any such
exists, to quickly get some large data volume? But i think , what you are
saying is, we should use real data to get actual or closest possible
benchmarks, correct me if wrong?

We used to see the data dictionary views (called AWR views) in Oracle to
see the current and historical performance statistics like CPU, IO , Memory
usage, object level contentions etc. in the oracle database. Do we have
such a thing available in Aurora postgre, so as to monitor the
performance and get some idea of how well the load test goes and what
capacity is available or are we saturating it?

When you said "*Beware of excessive partitioning. We had to "departion"
most tables, because of the impact on performance.*" , as I understand
partitioning helps in reducing IO while reading , as it scans less data
because of partition pruning. And while writing there is almost minimal
overhead in identifying the required partition to which the
INSERTed/Updated/Deleted data will be landing ON. So can you please help me
understand what exact performance impact you are highlighting here? Are you
talking about the size of the partitions or total count of the partitions?
In our case we are planning to do either daily range partition or hourly
based on data data volume, not sure if there exists some sweet spot in
regards to what should be the size of each partition in postgresql be. If
you are pointing to higher count of partitions of table , then in our case
if we persist ~90 days data then for a transaction table it would be ~90
daily range partitions or ~2160 hourly range partitions in the aurora
postgresql. It would be helpful , if you can explain a bit regarding what
exact performance impact you faced in regards to the partitioning in aurora
postgresql.

*"Use ora2pg to export views in the Oracle database. It's very easy; a
small EC2 VM running Linux with enough disk space lets you automate the
extraction from Oracle and importation into AWS Postgresql.)"*

Need to explore a bit more on this I believe. We have an oracle on premise
database, so we can move data directly to aurora postgresql in the cloud.
Another thing , is we have some sample data available in the AWS snowflake
but then not sure if some mechanism is available to move the same data to
the aurora postgresql ?

On Sun, 10 Dec 2023 at 02:27, Ron Johnson <ronljohnsonjr@gmail.com> wrote:

I don't know anything about Aurora, only have experience with RDS
Postgresql.

We successfully migrated from on-prem Oracle (12c, I think) to RDS
Postgresql 12, and were very happy: little down time (I take pride in
designing/implementing that; thanks, ora2pg!), with lower disk (8TB, down
to 5TB) and CPU usage.

I'm not sure what the TPS was in Oracle, but the server level "we" are on
(I'm no longer with that company, and don't remember the specifics (48 vCPU
/ 132 GB RAM, I think, with 10,000 IOPs) is over-powered for *our* needs.

You're going to have to spin up a full-sized instance, import a *lot* of
real data(*) into a production-identical schema and then run your batch
load process using test data (or copies of real batches). That's the only
way you'll *really* know.

Beware of excessive partitioning. We had to "departion" most tables,
because of the impact on performance.

(*) Use ora2pg to export views in the Oracle database. It's *very*
easy; a small EC2 VM running Linux with enough disk space lets you automate
the extraction from Oracle and importation into AWS Postgresql.)

On Sat, Dec 9, 2023 at 3:36 PM veem v <veema0000@gmail.com> wrote:

Thank you so much for the response.

Got your point, will check if we really need details or summary for the
historical data. But it looks like we will need detailed transaction data
for ~2 years at least.

My understanding was that AWS has two different offerings and "aurora
postgresql" is more performant and costlier than "RDS postgresql". Will
double check on this though.

However , how to prove RDS/aurora postgresql is going to serve our OLTP
requirement here , similar to the current on premise Oracle exadata. For
the OLTP use case we are expecting ~15K TPS write and 2K TPS read and the
response for the UI queries are expected to be within subseconds. But yes,
as i mentioned individual transactions will be batched and then will be
written to the database, so this should have lesser resource consumption
and contention created.

To test if Aurora postgresql will be comparable to cater the above needs
(in regards to the expected performance with nominal cost) ,how should we
test it? As we won't be able to test everything right away, Should we test
basic read and write performance and benchmark to have some confidence and
go ahead with development?

Say for example if one transaction consists of ~8 Inserts we can create
a sample target table on aurora Postgre with required indexes/constraints
and try running those inserts from multiple threads(for concurrency) using
blazemeter and see/compare the response time, CPU, IO, Memory usage for the
Postgresql database with set TPS. Similarly to see read performance, we can
run multiple select queries from blazemeter and compare the response time.
Is this the correct approach for validating the database here or any other
approach exists?

And another question coming to mind, I read in past Vaccum to be a
problem in postgresql, is it going to give trouble in Aurora postgresql
too, for such a highly transactional read/write system? How to
test/validate that?

On Sun, 10 Dec 2023 at 01:29, Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0000@gmail.com> wrote:

Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

"OK" is relative, but it's what we did in a similar situation: two
years of data on-line and 5 years of data in compressed files in S3.
(We're required to keep data for 7 years, but they *never* ask for
records more than 2 years old. If they ever do, we'll manually load the
relevant data back into PG.)
(I can't imagine that querying billions of unindexed flat-file
records via S3 would be fast.)
How often do end users look for data more than 90 days old? Two
years old?
How quickly does the old data need to be available?
Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would
balance speed and cost (or maybe Snowflake would just drive up costs).

Thank You so much Ron.

When you said "*two years of data on-line and 5 years of data in
compressed files in S3*." So do you mean two years of data in aurora
postgre which will be OLTP database and rest just dump as is in S3 bucket
and when we need any such query for those data , just simply read and dump
those S3 files back in the aurora postgre?

(RDS Postgresql, not Aurora, but that's beside the point.)

Yes.

But keep reading...

In the currently running oracle exadata system , it has SIX months of

data (which is ~200TB) and the transaction tables are all range partitioned
on a daily basis. And out of that ~2months of data gets frequently queried
and other ~4months of data gets queried less frequently. However, in the
target architecture which we want to build on cloud here, there are some
requirements for the analytics/data science team to query ~3years history.
Beyond ~3years we may need that data rarely.

We were initially thinking of just having one database to serve both
OLTP and OLAP use cases(somewhat like oracle exadata currently doing for
us) but it seems we don't have such an option here on AWS. Postgre will
serve OLTP use case whereas Snowflake will serve OLAP use case.

So do you suggest having both the databases in use, like recent
transaction data for last 3 months should be streamed to aurora postgre,
then from 3months till 3years of data should be parked in snowflake which
will serve OLAP/analytics use case. and from 3years till 10years will be
kept in S3 (as parquet or Iceberg format) so that even Snowflake can query
those directly when needed.

OR

Do you suggest just keeping last ~3months of data on Aurora postgre
and rest everything on snowflake considering it will store those as
compressed format and also storage is cheap(23$ per TB per month)?

Few colleagues are pointing to databricks for the analytics use case.
Is that a good option here?

I can't answer that without knowing what the end users actually need
(details, or just summaries of historical data, in different tiers).

You all will have to do the cost:benefit analysis of different
architectures.

On Sat, 9 Dec 2023 at 16:43, veem v <veema0000@gmail.com> wrote:

Hello All,
Although it's not exactly related to opensource postgre but want to
ask this question here to understand colleagues' view, considering having
decades of experience in the database world, We want some guidance, if the
below design looks okay for our customer use case.

We currently have financial systems transaction data streams to
Oracle exadata(X9) on-premise. This database supports processing of
400million transactions per day. A single transaction for us is a
combination of 7-8 inserts into different transaction tables with Indexes ,
unique constraints etc defined on those. The transactions
processed/committed in batches(~1000 batch size) in the database. And this
system persists data for ~6 months. We do have all sorts of
OLAP(daily/monthly batch reports running) applications run on the same
database along with some user facing UI applications showing customer
transactions. So it's basically currently serving a hybrid workload and is
one stop solution for all use cases.

Many of the applications are moving from on premise to AWS cloud as
part of modernization journey and AWS being chosen cloud partner also the
product is expected to expand across more regions and this system is
expected to serve increase in the transaction volume. And also we have a
requirement to persist transaction data for ~10years to have those
available for analytics/data science use cases.

So the team is thinking of splitting it into two parts
1)OLTP type use case in which we will persist/write the transaction
data faster and show it to the UI related apps , in near real time/quickest
possible time. and this database will store Max 60-90 days of transaction
data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
team planning of using/experimenting with Aurora postgres. Please correct
me, if there are any other options we should use otherwise?

2)Then move the data beyond ~90 days into another database or object
storage S3 which will keep it there for ~10 years and will be queryable
using the necessary API's. That is supposed to cater to Olap/analytics/data
science use cases etc.

Is the above design is okay? and also in regards to the second point
above i.e. persisting the historical data (that to be in queryable state),
should we go for some database like snowflake or should just keep it on S3
as is and make those queryable through APIs. Please advice?

#9veem v
veema0000@gmail.com
In reply to: Ron (#8)
Re: Question on overall design

Thank you for your response.

Apology if it sounds silly, but is it advisable to use just one database in
this flow, say for e.g snowflake for olap usecase and use a cache
layer(like reddish) on top of it to cater our oltp Usecase, rather having a
dedicated oltp database like Aurora postgresql?

Basically wanted to know, if anythings we would be missing if not having an
oltp database here for our usecase.

On Sun, 10 Dec, 2023, 10:25 pm Ron Johnson, <ronljohnsonjr@gmail.com> wrote:

Show quoted text

* PG has pgbench; *maybe* you can hack it to work on Oracle.
* If you want to know how well an RDBMS will work on your workload, then
you must provide it with a simulated workload. Right?
* AWS RDS Postgresql has a dashboard that *might* be similar to AWR. Or
it might not...
* We departitioned because SELECT statements were *slow*. All partitions
were scanned, even when the partition key was specified in the WHERE clause.

On Sun, Dec 10, 2023 at 8:45 AM veem v <veema0000@gmail.com> wrote:

Thank you so much Ron. I have some more doubts related to this.

We were thinking , if there is any utility in PG with which we can
create/generate large sample data volume which we can use to run it on our
on premise Oracle exadata box and use the same on the aurora postgresql in
cloud to see the read and write performance comparison. Is there any such
exists, to quickly get some large data volume? But i think , what you are
saying is, we should use real data to get actual or closest possible
benchmarks, correct me if wrong?

We used to see the data dictionary views (called AWR views) in Oracle to
see the current and historical performance statistics like CPU, IO , Memory
usage, object level contentions etc. in the oracle database. Do we have
such a thing available in Aurora postgre, so as to monitor the
performance and get some idea of how well the load test goes and what
capacity is available or are we saturating it?

When you said "*Beware of excessive partitioning. We had to "departion"
most tables, because of the impact on performance.*" , as I understand
partitioning helps in reducing IO while reading , as it scans less data
because of partition pruning. And while writing there is almost minimal
overhead in identifying the required partition to which the
INSERTed/Updated/Deleted data will be landing ON. So can you please help me
understand what exact performance impact you are highlighting here? Are you
talking about the size of the partitions or total count of the partitions?
In our case we are planning to do either daily range partition or hourly
based on data data volume, not sure if there exists some sweet spot in
regards to what should be the size of each partition in postgresql be. If
you are pointing to higher count of partitions of table , then in our case
if we persist ~90 days data then for a transaction table it would be ~90
daily range partitions or ~2160 hourly range partitions in the aurora
postgresql. It would be helpful , if you can explain a bit regarding what
exact performance impact you faced in regards to the partitioning in aurora
postgresql.

*"Use ora2pg to export views in the Oracle database. It's very easy; a
small EC2 VM running Linux with enough disk space lets you automate the
extraction from Oracle and importation into AWS Postgresql.)"*

Need to explore a bit more on this I believe. We have an oracle on
premise database, so we can move data directly to aurora postgresql in the
cloud. Another thing , is we have some sample data available in the AWS
snowflake but then not sure if some mechanism is available to move the same
data to the aurora postgresql ?

On Sun, 10 Dec 2023 at 02:27, Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

I don't know anything about Aurora, only have experience with RDS
Postgresql.

We successfully migrated from on-prem Oracle (12c, I think) to RDS
Postgresql 12, and were very happy: little down time (I take pride in
designing/implementing that; thanks, ora2pg!), with lower disk (8TB, down
to 5TB) and CPU usage.

I'm not sure what the TPS was in Oracle, but the server level "we" are
on (I'm no longer with that company, and don't remember the specifics (48
vCPU / 132 GB RAM, I think, with 10,000 IOPs) is over-powered for *our*
needs.

You're going to have to spin up a full-sized instance, import a *lot* of
real data(*) into a production-identical schema and then run your batch
load process using test data (or copies of real batches). That's the only
way you'll *really* know.

Beware of excessive partitioning. We had to "departion" most tables,
because of the impact on performance.

(*) Use ora2pg to export views in the Oracle database. It's *very*
easy; a small EC2 VM running Linux with enough disk space lets you automate
the extraction from Oracle and importation into AWS Postgresql.)

On Sat, Dec 9, 2023 at 3:36 PM veem v <veema0000@gmail.com> wrote:

Thank you so much for the response.

Got your point, will check if we really need details or summary for the
historical data. But it looks like we will need detailed transaction data
for ~2 years at least.

My understanding was that AWS has two different offerings and "aurora
postgresql" is more performant and costlier than "RDS postgresql". Will
double check on this though.

However , how to prove RDS/aurora postgresql is going to serve our OLTP
requirement here , similar to the current on premise Oracle exadata. For
the OLTP use case we are expecting ~15K TPS write and 2K TPS read and the
response for the UI queries are expected to be within subseconds. But yes,
as i mentioned individual transactions will be batched and then will be
written to the database, so this should have lesser resource consumption
and contention created.

To test if Aurora postgresql will be comparable to cater the above
needs (in regards to the expected performance with nominal cost) ,how
should we test it? As we won't be able to test everything right away,
Should we test basic read and write performance and benchmark to have some
confidence and go ahead with development?

Say for example if one transaction consists of ~8 Inserts we can create
a sample target table on aurora Postgre with required indexes/constraints
and try running those inserts from multiple threads(for concurrency) using
blazemeter and see/compare the response time, CPU, IO, Memory usage for the
Postgresql database with set TPS. Similarly to see read performance, we can
run multiple select queries from blazemeter and compare the response time.
Is this the correct approach for validating the database here or any other
approach exists?

And another question coming to mind, I read in past Vaccum to be a
problem in postgresql, is it going to give trouble in Aurora postgresql
too, for such a highly transactional read/write system? How to
test/validate that?

On Sun, 10 Dec 2023 at 01:29, Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0000@gmail.com> wrote:

Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

"OK" is relative, but it's what we did in a similar situation: two
years of data on-line and 5 years of data in compressed files in S3.
(We're required to keep data for 7 years, but they *never* ask for
records more than 2 years old. If they ever do, we'll manually load the
relevant data back into PG.)
(I can't imagine that querying billions of unindexed flat-file
records via S3 would be fast.)
How often do end users look for data more than 90 days old? Two
years old?
How quickly does the old data need to be available?
Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would
balance speed and cost (or maybe Snowflake would just drive up costs).

Thank You so much Ron.

When you said "*two years of data on-line and 5 years of data in
compressed files in S3*." So do you mean two years of data in aurora
postgre which will be OLTP database and rest just dump as is in S3 bucket
and when we need any such query for those data , just simply read and dump
those S3 files back in the aurora postgre?

(RDS Postgresql, not Aurora, but that's beside the point.)

Yes.

But keep reading...

In the currently running oracle exadata system , it has SIX months of

data (which is ~200TB) and the transaction tables are all range partitioned
on a daily basis. And out of that ~2months of data gets frequently queried
and other ~4months of data gets queried less frequently. However, in the
target architecture which we want to build on cloud here, there are some
requirements for the analytics/data science team to query ~3years history.
Beyond ~3years we may need that data rarely.

We were initially thinking of just having one database to serve both
OLTP and OLAP use cases(somewhat like oracle exadata currently doing for
us) but it seems we don't have such an option here on AWS. Postgre will
serve OLTP use case whereas Snowflake will serve OLAP use case.

So do you suggest having both the databases in use, like recent
transaction data for last 3 months should be streamed to aurora postgre,
then from 3months till 3years of data should be parked in snowflake which
will serve OLAP/analytics use case. and from 3years till 10years will be
kept in S3 (as parquet or Iceberg format) so that even Snowflake can query
those directly when needed.

OR

Do you suggest just keeping last ~3months of data on Aurora postgre
and rest everything on snowflake considering it will store those as
compressed format and also storage is cheap(23$ per TB per month)?

Few colleagues are pointing to databricks for the analytics use case.
Is that a good option here?

I can't answer that without knowing what the end users actually need
(details, or just summaries of historical data, in different tiers).

You all will have to do the cost:benefit analysis of different
architectures.

On Sat, 9 Dec 2023 at 16:43, veem v <veema0000@gmail.com> wrote:

Hello All,
Although it's not exactly related to opensource postgre but want to
ask this question here to understand colleagues' view, considering having
decades of experience in the database world, We want some guidance, if the
below design looks okay for our customer use case.

We currently have financial systems transaction data streams to
Oracle exadata(X9) on-premise. This database supports processing of
400million transactions per day. A single transaction for us is a
combination of 7-8 inserts into different transaction tables with Indexes ,
unique constraints etc defined on those. The transactions
processed/committed in batches(~1000 batch size) in the database. And this
system persists data for ~6 months. We do have all sorts of
OLAP(daily/monthly batch reports running) applications run on the same
database along with some user facing UI applications showing customer
transactions. So it's basically currently serving a hybrid workload and is
one stop solution for all use cases.

Many of the applications are moving from on premise to AWS cloud as
part of modernization journey and AWS being chosen cloud partner also the
product is expected to expand across more regions and this system is
expected to serve increase in the transaction volume. And also we have a
requirement to persist transaction data for ~10years to have those
available for analytics/data science use cases.

So the team is thinking of splitting it into two parts
1)OLTP type use case in which we will persist/write the transaction
data faster and show it to the UI related apps , in near real time/quickest
possible time. and this database will store Max 60-90 days of transaction
data. Not sure if we have an option of Oracle exadata equivalent on AWS, so
team planning of using/experimenting with Aurora postgres. Please correct
me, if there are any other options we should use otherwise?

2)Then move the data beyond ~90 days into another database or object
storage S3 which will keep it there for ~10 years and will be queryable
using the necessary API's. That is supposed to cater to Olap/analytics/data
science use cases etc.

Is the above design is okay? and also in regards to the second point
above i.e. persisting the historical data (that to be in queryable state),
should we go for some database like snowflake or should just keep it on S3
as is and make those queryable through APIs. Please advice?

#10Dominique Devienne
ddevienne@gmail.com
In reply to: Ron (#8)
Re: Question on overall design

On Sun, Dec 10, 2023 at 5:56 PM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

* We departitioned because SELECT statements were *slow*. All partitions
were scanned, even when the partition key was specified in the WHERE clause.

Surely that's no the case on newer PostgreSQL, is it? Otherwise what's the
point of partitioning?
Also, I remember reading something about recent improvements with a large
number of partitions, no?

As someone who's interested on partitioning, I'd appreciate details.
Thanks, --DD

#11Ron
ronljohnsonjr@gmail.com
In reply to: Dominique Devienne (#10)
Re: Question on overall design

On Mon, Dec 11, 2023 at 4:41 AM Dominique Devienne <ddevienne@gmail.com>
wrote:

On Sun, Dec 10, 2023 at 5:56 PM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

* We departitioned because SELECT statements were *slow*. All
partitions were scanned, even when the partition key was specified in the
WHERE clause.

Surely that's no the case on newer PostgreSQL, is it? Otherwise what's the
point of partitioning?
Also, I remember reading something about recent improvements with a large
number of partitions, no?

As someone who's interested on partitioning, I'd appreciate details.
Thanks, --DD

This was on 12.5. v13 was just released, and we weren't confident about
running a mission-critical system on a .1 version.

All "transaction" tables were partitioned by month on partion_date, while
the PK was table_name_id, partition_date.

Queries were _slow_, even when the application knew the partion_date range
(since queries might span months). PG just wouldn't prune.

When I departitioned the tables, performance became acceptable.

#12Chris Travers
chris.travers@gmail.com
In reply to: Ron (#11)
Re: Question on overall design

On Tue, Dec 12, 2023 at 2:11 AM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

On Mon, Dec 11, 2023 at 4:41 AM Dominique Devienne <ddevienne@gmail.com>
wrote:

On Sun, Dec 10, 2023 at 5:56 PM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

* We departitioned because SELECT statements were *slow*. All
partitions were scanned, even when the partition key was specified in the
WHERE clause.

Surely that's no the case on newer PostgreSQL, is it? Otherwise what's
the point of partitioning?
Also, I remember reading something about recent improvements with a large
number of partitions, no?

As someone who's interested on partitioning, I'd appreciate details.
Thanks, --DD

This was on 12.5. v13 was just released, and we weren't confident about
running a mission-critical system on a .1 version.

Something's wrong if all partitions are scanned even when the partition
clause is explicit in the where clause.

There are however some things which can cause problems here, such as type
casts of the partition key, or when the partition key is being brought in
from a join.

All "transaction" tables were partitioned by month on partion_date, while
the PK was table_name_id, partition_date.

Queries were _slow_, even when the application knew the partion_date range
(since queries might span months). PG just wouldn't prune.

Was there a datatype issue here? Like having a partition key of type
timestamp, but the query casting from date?

When I departitioned the tables, performance became acceptable.

--
Best Wishes,
Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more

#13Ron
ronljohnsonjr@gmail.com
In reply to: Chris Travers (#12)
Re: Question on overall design

On Mon, Dec 11, 2023 at 10:34 PM Chris Travers <chris.travers@gmail.com>
wrote:

On Tue, Dec 12, 2023 at 2:11 AM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

On Mon, Dec 11, 2023 at 4:41 AM Dominique Devienne <ddevienne@gmail.com>
wrote:

On Sun, Dec 10, 2023 at 5:56 PM Ron Johnson <ronljohnsonjr@gmail.com>
wrote:

* We departitioned because SELECT statements were *slow*. All
partitions were scanned, even when the partition key was specified in the
WHERE clause.

Surely that's no the case on newer PostgreSQL, is it? Otherwise what's
the point of partitioning?
Also, I remember reading something about recent improvements with a
large number of partitions, no?

As someone who's interested on partitioning, I'd appreciate details.
Thanks, --DD

This was on 12.5. v13 was just released, and we weren't confident about
running a mission-critical system on a .1 version.

Something's wrong if all partitions are scanned even when the partition
clause is explicit in the where clause.

There are however some things which can cause problems here, such as type
casts of the partition key, or when the partition key is being brought in
from a join.

Here's a snippet. part_date (type timestamp without time zone) is the
partition key:

and separation0_.part_date>=to_date('01-Jun-2021', 'DD-Mon-YYYY')
and transmissi1_.part_date>=to_date('01-Jun-2021', 'DD-Mon-YYYY')
and separation0_.part_date=transmissi1_.part_date

All "transaction" tables were partitioned by month on partion_date, while
the PK was table_name_id, partition_date.

Queries were _slow_, even when the application knew the partion_date
range (since queries might span months). PG just wouldn't prune.

Was there a datatype issue here? Like having a partition key of type
timestamp, but the query casting from date?

The partition key was of type timestamp, while "the right hand side of the
predicate".would be whatever to_char() generated.