Aurora Postgresql RDS DB Latency

Started by github kranabout 7 years ago26 messagesgeneral

githubkran@gmail.com

about 7 years ago

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted
few million rows from the database and ran into a issue in one of our dev
account where the
DB was not normal after this deletion. We did re index, vacuuming entire
database but we couldnt bring it to the same state as earlier. So next
steps we deleted the database and
recreated the database by copying the snapshot from a production instance.
Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state than earlier
but we are seeing few of our DB calls are taking more than 1 minute when we
are fetching data and we observed
this is because the query plan was executing a hash join as part of the
query whereas a similar query on prod instance is not doing any hash join
and is returning faster.

Also we did not want to experiment by modifing the DB settings by doing
enable_hash_join to off or random_page_count to 1 as we dont have these
settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this
is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct
here to reduce the query latency.

Thanks
githubKran

Michael Lewis

mlewis@entrata.com

about 7 years ago

In reply to: github kran (#1)

Re: Aurora Postgresql RDS DB Latency

Are default statistics target the same on both prod and AWS? Have you
analyzed all tables being used in this query to ensure stats are up proper?
If the optimizer is choosing a different plan, then the stats must be
different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com> wrote:

Show quoted text

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted
few million rows from the database and ran into a issue in one of our dev
account where the
DB was not normal after this deletion. We did re index, vacuuming entire
database but we couldnt bring it to the same state as earlier. So next
steps we deleted the database and
recreated the database by copying the snapshot from a production instance.
Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state than earlier
but we are seeing few of our DB calls are taking more than 1 minute when we
are fetching data and we observed
this is because the query plan was executing a hash join as part of the
query whereas a similar query on prod instance is not doing any hash join
and is returning faster.

Also we did not want to experiment by modifing the DB settings by doing
enable_hash_join to off or random_page_count to 1 as we dont have these
settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this
is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct
here to reduce the query latency.

Thanks
githubKran

github kran

githubkran@gmail.com

about 7 years ago

In reply to: Michael Lewis (#2)

Re: Aurora Postgresql RDS DB Latency

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com> wrote:

Are default statistics target the same on both prod and AWS? Have you
analyzed all tables being used in this query to ensure stats are up proper?
If the optimizer is choosing a different plan, then the stats must be
different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and their default
statistics target and they seem to be same but is there anything in
particular you want me to look at it to differentiate Prod and Non prod
databases ?. ( Also the DB instance size is same but there is little
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.

Show quoted text

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com> wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted
few million rows from the database and ran into a issue in one of our dev
account where the
DB was not normal after this deletion. We did re index, vacuuming entire
database but we couldnt bring it to the same state as earlier. So next
steps we deleted the database and
recreated the database by copying the snapshot from a production
instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state than
earlier but we are seeing few of our DB calls are taking more than 1 minute
when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the
query whereas a similar query on prod instance is not doing any hash join
and is returning faster.

Also we did not want to experiment by modifing the DB settings by doing
enable_hash_join to off or random_page_count to 1 as we dont have these
settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and this
is our normal size range, we have a new partition table for every 7 days.

Appreciate your ideas on what we could be missing and what we can correct
here to reduce the query latency.

Thanks
githubKran

github kran

githubkran@gmail.com

about 7 years ago

In reply to: github kran (#3)

Re: Aurora Postgresql RDS DB Latency

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com> wrote:

Show quoted text

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com> wrote:

Are default statistics target the same on both prod and AWS? Have you
analyzed all tables being used in this query to ensure stats are up proper?
If the optimizer is choosing a different plan, then the stats must be
different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and their default
statistics target and they seem to be same but is there anything in
particular you want me to look at it to differentiate Prod and Non prod
databases ?. ( Also the DB instance size is same but there is little
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com> wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently
deleted few million rows from the database and ran into a issue in one of
our dev account where the
DB was not normal after this deletion. We did re index, vacuuming entire
database but we couldnt bring it to the same state as earlier. So next
steps we deleted the database and
recreated the database by copying the snapshot from a production
instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state than
earlier but we are seeing few of our DB calls are taking more than 1 minute
when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the
query whereas a similar query on prod instance is not doing any hash join
and is returning faster.

Also we did not want to experiment by modifing the DB settings by doing
enable_hash_join to off or random_page_count to 1 as we dont have these
settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and
this is our normal size range, we have a new partition table for every 7
days.

Appreciate your ideas on what we could be missing and what we can
correct here to reduce the query latency.

Thanks
githubKran

github kran

githubkran@gmail.com

about 7 years ago

In reply to: github kran (#4)

Postgresql RDS DB Latency Chossing Hash join Plan

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com> wrote:

Show quoted text

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com> wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com> wrote:

Are default statistics target the same on both prod and AWS? Have you
analyzed all tables being used in this query to ensure stats are up proper?
If the optimizer is choosing a different plan, then the stats must be
different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and their
default statistics target and they seem to be same but is there anything in
particular you want me to look at it to differentiate Prod and Non prod
databases ?. ( Also the DB instance size is same but there is little
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com>
wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently
deleted few million rows from the database and ran into a issue in one of
our dev account where the
DB was not normal after this deletion. We did re index, vacuuming
entire database but we couldnt bring it to the same state as earlier. So
next steps we deleted the database and
recreated the database by copying the snapshot from a production
instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state than
earlier but we are seeing few of our DB calls are taking more than 1 minute
when we are fetching data and we observed
this is because the query plan was executing a hash join as part of the
query whereas a similar query on prod instance is not doing any hash join
and is returning faster.

Also we did not want to experiment by modifing the DB settings by doing
enable_hash_join to off or random_page_count to 1 as we dont have these
settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and
this is our normal size range, we have a new partition table for every 7
days.

Appreciate your ideas on what we could be missing and what we can
correct here to reduce the query latency.

Thanks
githubKran

Michael Lewis

mlewis@entrata.com

about 7 years ago

In reply to: github kran (#5)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

Did you update the stats by running ANALYZE on the tables involved, or
perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you
share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@gmail.com> wrote:

Show quoted text

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com> wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com> wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com>
wrote:

Are default statistics target the same on both prod and AWS? Have you
analyzed all tables being used in this query to ensure stats are up proper?
If the optimizer is choosing a different plan, then the stats must be
different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and their
default statistics target and they seem to be same but is there anything in
particular you want me to look at it to differentiate Prod and Non prod
databases ?. ( Also the DB instance size is same but there is little
more data in the Non prod Aurora RDS instance compared to Prod instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com>
wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently
deleted few million rows from the database and ran into a issue in one of
our dev account where the
DB was not normal after this deletion. We did re index, vacuuming
entire database but we couldnt bring it to the same state as earlier. So
next steps we deleted the database and
recreated the database by copying the snapshot from a production
instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state than
earlier but we are seeing few of our DB calls are taking more than 1 minute
when we are fetching data and we observed
this is because the query plan was executing a hash join as part of
the query whereas a similar query on prod instance is not doing any hash
join and is returning faster.

Also we did not want to experiment by modifing the DB settings by
doing enable_hash_join to off or random_page_count to 1 as we dont have
these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and
this is our normal size range, we have a new partition table for every 7
days.

Appreciate your ideas on what we could be missing and what we can
correct here to reduce the query latency.

Thanks
githubKran

github kran

githubkran@gmail.com

about 7 years ago

In reply to: Michael Lewis (#6)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@entrata.com> wrote:

Did you update the stats by running ANALYZE on the tables involved, or
perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you
share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Here is the plan for both of the DB instances.

Show quoted text

On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@gmail.com> wrote:

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com> wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com>
wrote:

Are default statistics target the same on both prod and AWS? Have you
analyzed all tables being used in this query to ensure stats are up proper?
If the optimizer is choosing a different plan, then the stats must be
different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and their
default statistics target and they seem to be same but is there anything in
particular you want me to look at it to differentiate Prod and Non prod
databases ?. ( Also the DB instance size is same but there is little
more data in the Non prod Aurora RDS instance compared to Prod
instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com>
wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently
deleted few million rows from the database and ran into a issue in one of
our dev account where the
DB was not normal after this deletion. We did re index, vacuuming
entire database but we couldnt bring it to the same state as earlier. So
next steps we deleted the database and
recreated the database by copying the snapshot from a production
instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state than
earlier but we are seeing few of our DB calls are taking more than 1 minute
when we are fetching data and we observed
this is because the query plan was executing a hash join as part of
the query whereas a similar query on prod instance is not doing any hash
join and is returning faster.

Also we did not want to experiment by modifing the DB settings by
doing enable_hash_join to off or random_page_count to 1 as we dont have
these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and
this is our normal size range, we have a new partition table for every 7
days.

Appreciate your ideas on what we could be missing and what we can
correct here to reduce the query latency.

Thanks
githubKran

github kran

githubkran@gmail.com

about 7 years ago

In reply to: github kran (#7)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran@gmail.com> wrote:

On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@entrata.com> wrote:

Did you update the stats by running ANALYZE on the tables involved, or
perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you
share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Here is the plan for both of the DB instances.

Michael - Did you get any chance to look at this issue. Also we see there
is a sequential scan being done instead of index scan.

Show quoted text

On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com>
wrote:

Are default statistics target the same on both prod and AWS? Have you
analyzed all tables being used in this query to ensure stats are up proper?
If the optimizer is choosing a different plan, then the stats must be
different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and their
default statistics target and they seem to be same but is there anything in
particular you want me to look at it to differentiate Prod and Non prod
databases ?. ( Also the DB instance size is same but there is little
more data in the Non prod Aurora RDS instance compared to Prod
instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com>
wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently
deleted few million rows from the database and ran into a issue in one of
our dev account where the
DB was not normal after this deletion. We did re index, vacuuming
entire database but we couldnt bring it to the same state as earlier. So
next steps we deleted the database and
recreated the database by copying the snapshot from a production
instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state than
earlier but we are seeing few of our DB calls are taking more than 1 minute
when we are fetching data and we observed
this is because the query plan was executing a hash join as part of
the query whereas a similar query on prod instance is not doing any hash
join and is returning faster.

Also we did not want to experiment by modifing the DB settings by
doing enable_hash_join to off or random_page_count to 1 as we dont have
these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB and
this is our normal size range, we have a new partition table for every 7
days.

Appreciate your ideas on what we could be missing and what we can
correct here to reduce the query latency.

Thanks
githubKran

Michael Lewis

mlewis@entrata.com

about 7 years ago

In reply to: github kran (#8)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

I didn't see your email yesterday, sorry about that. Index scans instead of
sequential scans and nested loop instead of hash join means that you have
bad row count estimates on "Non prod Aurora RDS instance" as far as I can
figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and
default_statistics_target are the same, then you should be getting the same
plans.

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

On Wed, Feb 13, 2019 at 8:49 AM github kran <githubkran@gmail.com> wrote:

Show quoted text

On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran@gmail.com> wrote:

On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@entrata.com>
wrote:

Did you update the stats by running ANALYZE on the tables involved, or
perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you
share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Here is the plan for both of the DB instances.

Michael - Did you get any chance to look at this issue. Also we see there
is a sequential scan being done instead of index scan.

On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com>
wrote:

Are default statistics target the same on both prod and AWS? Have
you analyzed all tables being used in this query to ensure stats are up
proper? If the optimizer is choosing a different plan, then the stats must
be different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and their
default statistics target and they seem to be same but is there anything in
particular you want me to look at it to differentiate Prod and Non prod
databases ?. ( Also the DB instance size is same but there is little
more data in the Non prod Aurora RDS instance compared to Prod
instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com>
wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently
deleted few million rows from the database and ran into a issue in one of
our dev account where the
DB was not normal after this deletion. We did re index, vacuuming
entire database but we couldnt bring it to the same state as earlier. So
next steps we deleted the database and
recreated the database by copying the snapshot from a production
instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state than
earlier but we are seeing few of our DB calls are taking more than 1 minute
when we are fetching data and we observed
this is because the query plan was executing a hash join as part of
the query whereas a similar query on prod instance is not doing any hash
join and is returning faster.

Also we did not want to experiment by modifing the DB settings by
doing enable_hash_join to off or random_page_count to 1 as we dont have
these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB
and this is our normal size range, we have a new partition table for every
7 days.

Appreciate your ideas on what we could be missing and what we can
correct here to reduce the query latency.

Thanks
githubKran

#10

github kran

githubkran@gmail.com

about 7 years ago

In reply to: Michael Lewis (#9)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <mlewis@entrata.com> wrote:

I didn't see your email yesterday, sorry about that. Index scans instead
of sequential scans and nested loop instead of hash join means that you
have bad row count estimates on "Non prod Aurora RDS instance" as far as I
can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and
default_statistics_target are the same, then you should be getting the same
plans.

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

1) Yes did the VACUUM for all the tables like asset_info_2019_2_part1,

part2 , part4 and also for location_info_xxx to remove any dead tuples and
also rebuilt the indexes..
2) REINDEX table location_data_2018_12_part4;( Like wise ran for all the
tables and also VACUUM for all the tables).
3) The data in Non prod instance is more. One thing to mention here when
we built the Non prod instance we copied SNAPSHOT from Prod instance and on
top of that inserted data about 100 million rows and then did VACUUM and
re-indexed the tables.

I cant think of anything we can do here but let us know if you need any
more details on this problem. Iam happy to share more details.

Show quoted text

On Wed, Feb 13, 2019 at 8:49 AM github kran <githubkran@gmail.com> wrote:

On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@entrata.com>
wrote:

Did you update the stats by running ANALYZE on the tables involved, or
perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you
share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Here is the plan for both of the DB instances.

Michael - Did you get any chance to look at this issue. Also we see there
is a sequential scan being done instead of index scan.

On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com>
wrote:

Are default statistics target the same on both prod and AWS? Have
you analyzed all tables being used in this query to ensure stats are up
proper? If the optimizer is choosing a different plan, then the stats must
be different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and their
default statistics target and they seem to be same but is there anything in
particular you want me to look at it to differentiate Prod and Non prod
databases ?. ( Also the DB instance size is same but there is little
more data in the Non prod Aurora RDS instance compared to Prod
instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com>
wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently
deleted few million rows from the database and ran into a issue in one of
our dev account where the
DB was not normal after this deletion. We did re index, vacuuming
entire database but we couldnt bring it to the same state as earlier. So
next steps we deleted the database and
recreated the database by copying the snapshot from a production
instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state than
earlier but we are seeing few of our DB calls are taking more than 1 minute
when we are fetching data and we observed
this is because the query plan was executing a hash join as part
of the query whereas a similar query on prod instance is not doing any hash
join and is returning faster.

Also we did not want to experiment by modifing the DB settings by
doing enable_hash_join to off or random_page_count to 1 as we dont have
these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB
and this is our normal size range, we have a new partition table for every
7 days.

Appreciate your ideas on what we could be missing and what we can
correct here to reduce the query latency.

Thanks
githubKran

#11

Michael Lewis

mlewis@entrata.com

about 7 years ago

In reply to: github kran (#10)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

How many total rows in these tables? I am assuming these are partitions and
those 100 million rows got distributed. If the data difference is
significant, then you aren't guaranteed similar performance. You may want
to follow more of the suggested steps on.

https://wiki.postgresql.org/wiki/SlowQueryQuestions

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

On Thu, Feb 14, 2019 at 8:48 AM github kran <githubkran@gmail.com> wrote:

Show quoted text

On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <mlewis@entrata.com> wrote:

I didn't see your email yesterday, sorry about that. Index scans instead
of sequential scans and nested loop instead of hash join means that you
have bad row count estimates on "Non prod Aurora RDS instance" as far as I
can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and
default_statistics_target are the same, then you should be getting the same
plans.

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

1) Yes did the VACUUM for all the tables like asset_info_2019_2_part1,

part2 , part4 and also for location_info_xxx to remove any dead tuples and
also rebuilt the indexes..
2) REINDEX table location_data_2018_12_part4;( Like wise ran for all
the tables and also VACUUM for all the tables).
3) The data in Non prod instance is more. One thing to mention here when
we built the Non prod instance we copied SNAPSHOT from Prod instance and on
top of that inserted data about 100 million rows and then did VACUUM and
re-indexed the tables.

I cant think of anything we can do here but let us know if you need any
more details on this problem. Iam happy to share more details.

On Wed, Feb 13, 2019 at 8:49 AM github kran <githubkran@gmail.com> wrote:

On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@entrata.com>
wrote:

Did you update the stats by running ANALYZE on the tables involved, or
perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you
share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Here is the plan for both of the DB instances.

Michael - Did you get any chance to look at this issue. Also we see
there is a sequential scan being done instead of index scan.

On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com>
wrote:

Are default statistics target the same on both prod and AWS? Have
you analyzed all tables being used in this query to ensure stats are up
proper? If the optimizer is choosing a different plan, then the stats must
be different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and their
default statistics target and they seem to be same but is there anything in
particular you want me to look at it to differentiate Prod and Non prod
databases ?. ( Also the DB instance size is same but there is little
more data in the Non prod Aurora RDS instance compared to Prod
instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com>
wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently
deleted few million rows from the database and ran into a issue in one of
our dev account where the
DB was not normal after this deletion. We did re index, vacuuming
entire database but we couldnt bring it to the same state as earlier. So
next steps we deleted the database and
recreated the database by copying the snapshot from a production
instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state
than earlier but we are seeing few of our DB calls are taking more than 1
minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part
of the query whereas a similar query on prod instance is not doing any hash
join and is returning faster.

Also we did not want to experiment by modifing the DB settings by
doing enable_hash_join to off or random_page_count to 1 as we dont have
these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB
and this is our normal size range, we have a new partition table for every
7 days.

Appreciate your ideas on what we could be missing and what we can
correct here to reduce the query latency.

Thanks
githubKran

#12

github kran

githubkran@gmail.com

about 7 years ago

In reply to: Michael Lewis (#11)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <mlewis@entrata.com> wrote:

How many total rows in these tables? I am assuming these are partitions
and those 100 million rows got distributed. If the data difference is
significant, then you aren't guaranteed similar performance. You may want
to follow more of the suggested steps on.

https://wiki.postgresql.org/wiki/SlowQueryQuestions

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Michael - Yes correct the data of 100 million rows is distributed to all

the partitions.
FInally I feel we have come to conclusion after we changed the
random_page_cost from 4 to 2 in Non prod instance and we see improvements
in the query which use to take from 1 minute to 1 -2 seconds.
That's correct we have around 490 million rows in few of our partition
tables. The partition tables are created for every 7 days.

We ran our API test which hits the backend database Aurora RDS PostgreSQL
and see our query response times , requests/sec are better than before. Do
you recommend this setting on a Production instance? Right now we are
planning to go
implement this option of random_page_cost to 2. ( We know it also works if
we turn off the hash_join but we dont want to implement this change but
rather use random_page_cost to 2).

Questions.
1) What is your recommendation on this ? Can we modify this change on Prod
instance which is performing better today or only keep this change to Non
prod instance ?. ( Actually we want to implement this change on Non Prod
instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x
large EC2 instance. Few of our partitions are bigger and few of them are
smaller. We have data from last 9 months and we are planning to keep the
data for about
close to 1 year till May. Do you see any concerns on this ?. Eventually we
are thinking to archive this data in next 2 months by dropping of older
partitions.
3) What could be the problems of keeping the data longer if there is a
regular maintenance like VACUUM and other maintenace activities

Show quoted text

On Thu, Feb 14, 2019 at 8:48 AM github kran <githubkran@gmail.com> wrote:

On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <mlewis@entrata.com>
wrote:

I didn't see your email yesterday, sorry about that. Index scans instead
of sequential scans and nested loop instead of hash join means that you
have bad row count estimates on "Non prod Aurora RDS instance" as far as I
can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and
default_statistics_target are the same, then you should be getting the same
plans.

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

1) Yes did the VACUUM for all the tables like asset_info_2019_2_part1,

part2 , part4 and also for location_info_xxx to remove any dead tuples and
also rebuilt the indexes..
2) REINDEX table location_data_2018_12_part4;( Like wise ran for all
the tables and also VACUUM for all the tables).
3) The data in Non prod instance is more. One thing to mention here
when we built the Non prod instance we copied SNAPSHOT from Prod instance
and on top of that inserted data about 100 million rows and then did VACUUM
and re-indexed the tables.

I cant think of anything we can do here but let us know if you need any
more details on this problem. Iam happy to share more details.

On Wed, Feb 13, 2019 at 8:49 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@entrata.com>
wrote:

Did you update the stats by running ANALYZE on the tables involved,
or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can
you share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Here is the plan for both of the DB instances.

Michael - Did you get any chance to look at this issue. Also we see
there is a sequential scan being done instead of index scan.

On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com>
wrote:

Are default statistics target the same on both prod and AWS? Have
you analyzed all tables being used in this query to ensure stats are up
proper? If the optimizer is choosing a different plan, then the stats must
be different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and their
default statistics target and they seem to be same but is there anything in
particular you want me to look at it to differentiate Prod and Non prod
databases ?. ( Also the DB instance size is same but there is little
more data in the Non prod Aurora RDS instance compared to Prod
instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@gmail.com>
wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We
recently deleted few million rows from the database and ran into a issue in
one of our dev account where the
DB was not normal after this deletion. We did re index,
vacuuming entire database but we couldnt bring it to the same state as
earlier. So next steps we deleted the database and
recreated the database by copying the snapshot from a production
instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state
than earlier but we are seeing few of our DB calls are taking more than 1
minute when we are fetching data and we observed
this is because the query plan was executing a hash join as part
of the query whereas a similar query on prod instance is not doing any hash
join and is returning faster.

Also we did not want to experiment by modifing the DB settings
by doing enable_hash_join to off or random_page_count to 1 as we dont have
these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75 GB
and this is our normal size range, we have a new partition table for every
7 days.

Appreciate your ideas on what we could be missing and what we
can correct here to reduce the query latency.

Thanks
githubKran

#13

Michael Lewis

mlewis@entrata.com

about 7 years ago

In reply to: github kran (#12)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

This is beyond my expertise except to say that if your storage is SSDs in
AWS, then you definitely want random_page_cost close to the same as
seq_page_cost (1 by default) assuming your data is likely to be in cache as
discussed in the documentation
<https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS>.
As it says- "Reducing this value relative to seq_page_cost will cause the
system to prefer index scans" as you saw. Changing the value on production
would again depend on the storage type used, and how good the cache hit
rate is.

As far as I know, dropping old partitions should not be significantly
impactful to the system other than no longer needing to store that data
(cost, time for full backups, etc).

Again, as I understand things, there is not a big impact from having old
unused tables in terms of maintenance. They should be ignored by normal
processes.

Glad you got your issue resolved.

*Michael Lewis*

On Thu, Feb 14, 2019 at 3:11 PM github kran <githubkran@gmail.com> wrote:

Show quoted text

On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <mlewis@entrata.com> wrote:

How many total rows in these tables? I am assuming these are partitions
and those 100 million rows got distributed. If the data difference is
significant, then you aren't guaranteed similar performance. You may want
to follow more of the suggested steps on.

https://wiki.postgresql.org/wiki/SlowQueryQuestions

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Michael - Yes correct the data of 100 million rows is distributed to all

the partitions.
FInally I feel we have come to conclusion after we changed the
random_page_cost from 4 to 2 in Non prod instance and we see improvements
in the query which use to take from 1 minute to 1 -2 seconds.
That's correct we have around 490 million rows in few of our partition
tables. The partition tables are created for every 7 days.

We ran our API test which hits the backend database Aurora RDS PostgreSQL
and see our query response times , requests/sec are better than before. Do
you recommend this setting on a Production instance? Right now we are
planning to go
implement this option of random_page_cost to 2. ( We know it also works if
we turn off the hash_join but we dont want to implement this change but
rather use random_page_cost to 2).

Questions.
1) What is your recommendation on this ? Can we modify this change on Prod
instance which is performing better today or only keep this change to Non
prod instance ?. ( Actually we want to implement this change on Non Prod
instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x
large EC2 instance. Few of our partitions are bigger and few of them are
smaller. We have data from last 9 months and we are planning to keep the
data for about
close to 1 year till May. Do you see any concerns on this ?. Eventually we
are thinking to archive this data in next 2 months by dropping of older
partitions.
3) What could be the problems of keeping the data longer if there is a
regular maintenance like VACUUM and other maintenace activities

On Thu, Feb 14, 2019 at 8:48 AM github kran <githubkran@gmail.com> wrote:

On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <mlewis@entrata.com>
wrote:

I didn't see your email yesterday, sorry about that. Index scans
instead of sequential scans and nested loop instead of hash join means that
you have bad row count estimates on "Non prod Aurora RDS instance" as far
as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and
default_statistics_target are the same, then you should be getting the same
plans.

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

1) Yes did the VACUUM for all the tables like asset_info_2019_2_part1,

part2 , part4 and also for location_info_xxx to remove any dead tuples and
also rebuilt the indexes..
2) REINDEX table location_data_2018_12_part4;( Like wise ran for all
the tables and also VACUUM for all the tables).
3) The data in Non prod instance is more. One thing to mention here
when we built the Non prod instance we copied SNAPSHOT from Prod instance
and on top of that inserted data about 100 million rows and then did VACUUM
and re-indexed the tables.

I cant think of anything we can do here but let us know if you need any
more details on this problem. Iam happy to share more details.

On Wed, Feb 13, 2019 at 8:49 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@entrata.com>
wrote:

Did you update the stats by running ANALYZE on the tables involved,
or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can
you share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Here is the plan for both of the DB instances.

Michael - Did you get any chance to look at this issue. Also we see
there is a sequential scan being done instead of index scan.

On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@entrata.com>
wrote:

Are default statistics target the same on both prod and AWS?
Have you analyzed all tables being used in this query to ensure stats are
up proper? If the optimizer is choosing a different plan, then the stats
must be different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and
their default statistics target and they seem to be same but is there
anything in particular you want me to look at it to differentiate Prod and
Non prod databases ?. ( Also the DB instance size is same but there is
little
more data in the Non prod Aurora RDS instance compared to Prod
instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <
githubkran@gmail.com> wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We
recently deleted few million rows from the database and ran into a issue in
one of our dev account where the
DB was not normal after this deletion. We did re index,
vacuuming entire database but we couldnt bring it to the same state as
earlier. So next steps we deleted the database and
recreated the database by copying the snapshot from a
production instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state
than earlier but we are seeing few of our DB calls are taking more than 1
minute when we are fetching data and we observed
this is because the query plan was executing a hash join as
part of the query whereas a similar query on prod instance is not doing any
hash join and is returning faster.

Also we did not want to experiment by modifing the DB settings
by doing enable_hash_join to off or random_page_count to 1 as we dont have
these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75
GB and this is our normal size range, we have a new partition table for
every 7 days.

Appreciate your ideas on what we could be missing and what we
can correct here to reduce the query latency.

Thanks
githubKran

#14

github kran

githubkran@gmail.com

about 7 years ago

In reply to: Michael Lewis (#13)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

On Thu, Feb 14, 2019 at 4:58 PM Michael Lewis <mlewis@entrata.com> wrote:

This is beyond my expertise except to say that if your storage is SSDs in
AWS, then you definitely want random_page_cost close to the same as
seq_page_cost (1 by default) assuming your data is likely to be in cache as
discussed in the documentation
<https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS>.
As it says- "Reducing this value relative to seq_page_cost will cause the
system to prefer index scans" as you saw. Changing the value on production
would again depend on the storage type used, and how good the cache hit
rate is.

As far as I know, dropping old partitions should not be significantly
impactful to the system other than no longer needing to store that data
(cost, time for full backups, etc).

Again, as I understand things, there is not a big impact from having old
unused tables in terms of maintenance. They should be ignored by normal
processes.

Glad you got your issue resolved.

*Michael Lewis*

Thanks for the feedback.You have been giving your thoughts/suggestions
since the beginning of the case. It was helpful. I think I realized later
based on your suggestion to increase the default statistics target from
100. It was not correctly initially
as I had that set at session level without setting them on the
partition tables. As next steps I have the stats to 1000 on all of the
partitiion tables we use from Nov 2018 to Feb 19 and then did ANALYZE. Iam
currently running a load test to test
how the DB performance is behaving right now and so far its running
good than before. ( I have reset all the previous changes done except the
statistics change). I will keep you posted after the test finishes

Questions.
1) Can i further increase the Setting to 3000 and see the system
behaves. ?. How do I know the best value to be used for my database in
terms of the sampling limit with the default statistics setting ?.
2) Apart from analyzing the tables do I need to do any other changes
with the statistics setting ?
3) Also the current work mem is set to 4 MB and we didnt play with
this value so far. For future needs can I increase the WORK MEM setting ?.

Appreciate your reply.

Thanks

Show quoted text

On Thu, Feb 14, 2019 at 3:11 PM github kran <githubkran@gmail.com> wrote:

On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <mlewis@entrata.com>
wrote:

How many total rows in these tables? I am assuming these are partitions
and those 100 million rows got distributed. If the data difference is
significant, then you aren't guaranteed similar performance. You may want
to follow more of the suggested steps on.

https://wiki.postgresql.org/wiki/SlowQueryQuestions

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Michael - Yes correct the data of 100 million rows is distributed to all

the partitions.
FInally I feel we have come to conclusion after we changed the
random_page_cost from 4 to 2 in Non prod instance and we see improvements
in the query which use to take from 1 minute to 1 -2 seconds.
That's correct we have around 490 million rows in few of our partition
tables. The partition tables are created for every 7 days.

We ran our API test which hits the backend database Aurora RDS PostgreSQL
and see our query response times , requests/sec are better than before. Do
you recommend this setting on a Production instance? Right now we are
planning to go
implement this option of random_page_cost to 2. ( We know it also works
if we turn off the hash_join but we dont want to implement this change but
rather use random_page_cost to 2).

Questions.
1) What is your recommendation on this ? Can we modify this change on
Prod instance which is performing better today or only keep this change to
Non prod instance ?. ( Actually we want to implement this change on Non
Prod instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x
large EC2 instance. Few of our partitions are bigger and few of them are
smaller. We have data from last 9 months and we are planning to keep the
data for about
close to 1 year till May. Do you see any concerns on this ?. Eventually
we are thinking to archive this data in next 2 months by dropping of older
partitions.
3) What could be the problems of keeping the data longer if there is a
regular maintenance like VACUUM and other maintenace activities

On Thu, Feb 14, 2019 at 8:48 AM github kran <githubkran@gmail.com>
wrote:

On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <mlewis@entrata.com>
wrote:

I didn't see your email yesterday, sorry about that. Index scans
instead of sequential scans and nested loop instead of hash join means that
you have bad row count estimates on "Non prod Aurora RDS instance" as far
as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and
default_statistics_target are the same, then you should be getting the same
plans.

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

1) Yes did the VACUUM for all the tables like

asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx to
remove any dead tuples and also rebuilt the indexes..
2) REINDEX table location_data_2018_12_part4;( Like wise ran for all
the tables and also VACUUM for all the tables).
3) The data in Non prod instance is more. One thing to mention here
when we built the Non prod instance we copied SNAPSHOT from Prod instance
and on top of that inserted data about 100 million rows and then did VACUUM
and re-indexed the tables.

I cant think of anything we can do here but let us know if you need any
more details on this problem. Iam happy to share more details.

On Wed, Feb 13, 2019 at 8:49 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@entrata.com>
wrote:

Did you update the stats by running ANALYZE on the tables involved,
or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can
you share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Here is the plan for both of the DB instances.

Michael - Did you get any chance to look at this issue. Also we see
there is a sequential scan being done instead of index scan.

On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <
mlewis@entrata.com> wrote:

Are default statistics target the same on both prod and AWS?
Have you analyzed all tables being used in this query to ensure stats are
up proper? If the optimizer is choosing a different plan, then the stats
must be different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and
their default statistics target and they seem to be same but is there
anything in particular you want me to look at it to differentiate Prod and
Non prod databases ?. ( Also the DB instance size is same but there is
little
more data in the Non prod Aurora RDS instance compared to Prod
instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <
githubkran@gmail.com> wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We
recently deleted few million rows from the database and ran into a issue in
one of our dev account where the
DB was not normal after this deletion. We did re index,
vacuuming entire database but we couldnt bring it to the same state as
earlier. So next steps we deleted the database and
recreated the database by copying the snapshot from a
production instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state
than earlier but we are seeing few of our DB calls are taking more than 1
minute when we are fetching data and we observed
this is because the query plan was executing a hash join as
part of the query whereas a similar query on prod instance is not doing any
hash join and is returning faster.

Also we did not want to experiment by modifing the DB settings
by doing enable_hash_join to off or random_page_count to 1 as we dont have
these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75
GB and this is our normal size range, we have a new partition table for
every 7 days.

Appreciate your ideas on what we could be missing and what we
can correct here to reduce the query latency.

Thanks
githubKran

#15

Michael Lewis

mlewis@entrata.com

about 7 years ago

In reply to: github kran (#14)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

1) You can increase it as much as you want but (auto)analyze will take
longer to examine the values of default_stat_target * 300 rows and compute
the most common values and the frequencies of those values. How much
variation does you data actually have? If your data only has 50 distinct
values with fairly even distribution, then no need to increase it from 100
even. Oh, deciding on the best query plan will take a little more time for
the optimizer since it will be examining bigger collection of stats on the
tables that you have increased.

2) I am not aware.

3) I am not aware of anything about your application so I can't recommend
any number outright, but 4MB for work_mem definitely seems low to me
assuming you have 16GB or more memory available unless you have very high
concurrency. It will depend on how many sorts per statement, how many
users, etc. If you spill over to disk on routine operations, then things
are definitely going to be much slower than if you are able to keep things
in memory. You could try running explain analyze and just verify that you
are keeping things in memory. You could also turn on automatic gathering of
explain analyze plans on live if you have the room for logging and can
tolerate just a little latency.

https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-WORK-MEM

I'm glad your performance is improved in AWS/dev environment. It can be a
big hassle to test things in an environment that performs significantly
different.

*Michael Lewis*

On Sun, Feb 17, 2019 at 10:01 AM github kran <githubkran@gmail.com> wrote:

Show quoted text

On Thu, Feb 14, 2019 at 4:58 PM Michael Lewis <mlewis@entrata.com> wrote:

This is beyond my expertise except to say that if your storage is SSDs in
AWS, then you definitely want random_page_cost close to the same as
seq_page_cost (1 by default) assuming your data is likely to be in cache as
discussed in the documentation
<https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS>.
As it says- "Reducing this value relative to seq_page_cost will cause the
system to prefer index scans" as you saw. Changing the value on production
would again depend on the storage type used, and how good the cache hit
rate is.

As far as I know, dropping old partitions should not be significantly
impactful to the system other than no longer needing to store that data
(cost, time for full backups, etc).

Again, as I understand things, there is not a big impact from having old
unused tables in terms of maintenance. They should be ignored by normal
processes.

Glad you got your issue resolved.

*Michael Lewis*

Thanks for the feedback.You have been giving your
thoughts/suggestions since the beginning of the case. It was helpful. I
think I realized later based on your suggestion to increase the default
statistics target from 100. It was not correctly initially
as I had that set at session level without setting them on the
partition tables. As next steps I have the stats to 1000 on all of the
partitiion tables we use from Nov 2018 to Feb 19 and then did ANALYZE. Iam
currently running a load test to test
how the DB performance is behaving right now and so far its running
good than before. ( I have reset all the previous changes done except the
statistics change). I will keep you posted after the test finishes

Questions.
1) Can i further increase the Setting to 3000 and see the system
behaves. ?. How do I know the best value to be used for my database in
terms of the sampling limit with the default statistics setting ?.
2) Apart from analyzing the tables do I need to do any other changes
with the statistics setting ?
3) Also the current work mem is set to 4 MB and we didnt play with
this value so far. For future needs can I increase the WORK MEM setting ?.

Appreciate your reply.

Thanks

On Thu, Feb 14, 2019 at 3:11 PM github kran <githubkran@gmail.com> wrote:

On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <mlewis@entrata.com>
wrote:

How many total rows in these tables? I am assuming these are partitions
and those 100 million rows got distributed. If the data difference is
significant, then you aren't guaranteed similar performance. You may want
to follow more of the suggested steps on.

https://wiki.postgresql.org/wiki/SlowQueryQuestions

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Michael - Yes correct the data of 100 million rows is distributed to

all the partitions.
FInally I feel we have come to conclusion after we changed the
random_page_cost from 4 to 2 in Non prod instance and we see improvements
in the query which use to take from 1 minute to 1 -2 seconds.
That's correct we have around 490 million rows in few of our partition
tables. The partition tables are created for every 7 days.

We ran our API test which hits the backend database Aurora RDS
PostgreSQL and see our query response times , requests/sec are better than
before. Do you recommend this setting on a Production instance? Right now
we are planning to go
implement this option of random_page_cost to 2. ( We know it also works
if we turn off the hash_join but we dont want to implement this change but
rather use random_page_cost to 2).

Questions.
1) What is your recommendation on this ? Can we modify this change on
Prod instance which is performing better today or only keep this change to
Non prod instance ?. ( Actually we want to implement this change on Non
Prod instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x
large EC2 instance. Few of our partitions are bigger and few of them are
smaller. We have data from last 9 months and we are planning to keep the
data for about
close to 1 year till May. Do you see any concerns on this ?. Eventually
we are thinking to archive this data in next 2 months by dropping of older
partitions.
3) What could be the problems of keeping the data longer if there is a
regular maintenance like VACUUM and other maintenace activities

On Thu, Feb 14, 2019 at 8:48 AM github kran <githubkran@gmail.com>
wrote:

On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <mlewis@entrata.com>
wrote:

I didn't see your email yesterday, sorry about that. Index scans
instead of sequential scans and nested loop instead of hash join means that
you have bad row count estimates on "Non prod Aurora RDS instance" as far
as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and
default_statistics_target are the same, then you should be getting the same
plans.

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

1) Yes did the VACUUM for all the tables like

asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx to
remove any dead tuples and also rebuilt the indexes..
2) REINDEX table location_data_2018_12_part4;( Like wise ran for
all the tables and also VACUUM for all the tables).
3) The data in Non prod instance is more. One thing to mention here
when we built the Non prod instance we copied SNAPSHOT from Prod instance
and on top of that inserted data about 100 million rows and then did VACUUM
and re-indexed the tables.

I cant think of anything we can do here but let us know if you need
any more details on this problem. Iam happy to share more details.

On Wed, Feb 13, 2019 at 8:49 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@entrata.com>
wrote:

Did you update the stats by running ANALYZE on the tables
involved, or perhaps the entire database on the 'Non prod Aurora RDS
instance'? Can you share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Here is the plan for both of the DB instances.

Michael - Did you get any chance to look at this issue. Also we see
there is a sequential scan being done instead of index scan.

On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@gmail.com>
wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <
githubkran@gmail.com> wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <
mlewis@entrata.com> wrote:

Are default statistics target the same on both prod and AWS?
Have you analyzed all tables being used in this query to ensure stats are
up proper? If the optimizer is choosing a different plan, then the stats
must be different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and
their default statistics target and they seem to be same but is there
anything in particular you want me to look at it to differentiate Prod and
Non prod databases ?. ( Also the DB instance size is same but there is
little
more data in the Non prod Aurora RDS instance compared to Prod
instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <
githubkran@gmail.com> wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We
recently deleted few million rows from the database and ran into a issue in
one of our dev account where the
DB was not normal after this deletion. We did re index,
vacuuming entire database but we couldnt bring it to the same state as
earlier. So next steps we deleted the database and
recreated the database by copying the snapshot from a
production instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better state
than earlier but we are seeing few of our DB calls are taking more than 1
minute when we are fetching data and we observed
this is because the query plan was executing a hash join as
part of the query whereas a similar query on prod instance is not doing any
hash join and is returning faster.

Also we did not want to experiment by modifing the DB
settings by doing enable_hash_join to off or random_page_count to 1 as we
dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to 75
GB and this is our normal size range, we have a new partition table for
every 7 days.

Appreciate your ideas on what we could be missing and what we
can correct here to reduce the query latency.

Thanks
githubKran

#16

github kran

githubkran@gmail.com

about 7 years ago

In reply to: Michael Lewis (#15)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

Thanks for the feedback.

On Tue, Feb 19, 2019 at 11:12 AM Michael Lewis <mlewis@entrata.com> wrote:

Show quoted text

1) You can increase it as much as you want but (auto)analyze will take
longer to examine the values of default_stat_target * 300 rows and compute
the most common values and the frequencies of those values. How much
variation does you data actually have? If your data only has 50 distinct
values with fairly even distribution, then no need to increase it from 100
even. Oh, deciding on the best query plan will take a little more time for
the optimizer since it will be examining bigger collection of stats on the
tables that you have increased.

2) I am not aware.

3) I am not aware of anything about your application so I can't recommend
any number outright, but 4MB for work_mem definitely seems low to me
assuming you have 16GB or more memory available unless you have very high
concurrency. It will depend on how many sorts per statement, how many
users, etc. If you spill over to disk on routine operations, then things
are definitely going to be much slower than if you are able to keep things
in memory. You could try running explain analyze and just verify that you
are keeping things in memory. You could also turn on automatic gathering of
explain analyze plans on live if you have the room for logging and can
tolerate just a little latency.

https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-WORK-MEM

I'm glad your performance is improved in AWS/dev environment. It can be a
big hassle to test things in an environment that performs significantly
different.

*Michael Lewis*

On Sun, Feb 17, 2019 at 10:01 AM github kran <githubkran@gmail.com> wrote:

On Thu, Feb 14, 2019 at 4:58 PM Michael Lewis <mlewis@entrata.com> wrote:

This is beyond my expertise except to say that if your storage is SSDs
in AWS, then you definitely want random_page_cost close to the same as
seq_page_cost (1 by default) assuming your data is likely to be in cache as
discussed in the documentation
<https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS>.
As it says- "Reducing this value relative to seq_page_cost will cause the
system to prefer index scans" as you saw. Changing the value on production
would again depend on the storage type used, and how good the cache hit
rate is.

As far as I know, dropping old partitions should not be significantly
impactful to the system other than no longer needing to store that data
(cost, time for full backups, etc).

Again, as I understand things, there is not a big impact from having old
unused tables in terms of maintenance. They should be ignored by normal
processes.

Glad you got your issue resolved.

*Michael Lewis*

Thanks for the feedback.You have been giving your
thoughts/suggestions since the beginning of the case. It was helpful. I
think I realized later based on your suggestion to increase the default
statistics target from 100. It was not correctly initially
as I had that set at session level without setting them on the
partition tables. As next steps I have the stats to 1000 on all of the
partitiion tables we use from Nov 2018 to Feb 19 and then did ANALYZE. Iam
currently running a load test to test
how the DB performance is behaving right now and so far its running
good than before. ( I have reset all the previous changes done except the
statistics change). I will keep you posted after the test finishes

Questions.
1) Can i further increase the Setting to 3000 and see the system
behaves. ?. How do I know the best value to be used for my database in
terms of the sampling limit with the default statistics setting ?.
2) Apart from analyzing the tables do I need to do any other
changes with the statistics setting ?
3) Also the current work mem is set to 4 MB and we didnt play with
this value so far. For future needs can I increase the WORK MEM setting ?.

Appreciate your reply.

Thanks

On Thu, Feb 14, 2019 at 3:11 PM github kran <githubkran@gmail.com>
wrote:

On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <mlewis@entrata.com>
wrote:

How many total rows in these tables? I am assuming these are
partitions and those 100 million rows got distributed. If the data
difference is significant, then you aren't guaranteed similar performance.
You may want to follow more of the suggested steps on.

https://wiki.postgresql.org/wiki/SlowQueryQuestions

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Michael - Yes correct the data of 100 million rows is distributed to

all the partitions.
FInally I feel we have come to conclusion after we changed the
random_page_cost from 4 to 2 in Non prod instance and we see improvements
in the query which use to take from 1 minute to 1 -2 seconds.
That's correct we have around 490 million rows in few of our partition
tables. The partition tables are created for every 7 days.

We ran our API test which hits the backend database Aurora RDS
PostgreSQL and see our query response times , requests/sec are better than
before. Do you recommend this setting on a Production instance? Right now
we are planning to go
implement this option of random_page_cost to 2. ( We know it also works
if we turn off the hash_join but we dont want to implement this change but
rather use random_page_cost to 2).

Questions.
1) What is your recommendation on this ? Can we modify this change on
Prod instance which is performing better today or only keep this change to
Non prod instance ?. ( Actually we want to implement this change on Non
Prod instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x
large EC2 instance. Few of our partitions are bigger and few of them are
smaller. We have data from last 9 months and we are planning to keep the
data for about
close to 1 year till May. Do you see any concerns on this ?. Eventually
we are thinking to archive this data in next 2 months by dropping of older
partitions.
3) What could be the problems of keeping the data longer if there is a
regular maintenance like VACUUM and other maintenace activities

On Thu, Feb 14, 2019 at 8:48 AM github kran <githubkran@gmail.com>
wrote:

On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <mlewis@entrata.com>
wrote:

I didn't see your email yesterday, sorry about that. Index scans
instead of sequential scans and nested loop instead of hash join means that
you have bad row count estimates on "Non prod Aurora RDS instance" as far
as I can figure. Have you run commands like-

analyze asset_info_2019_2_part4;
analyze asset_info_2019_2_part2;

etc? If data are very similar, indexes all exist, and
default_statistics_target are the same, then you should be getting the same
plans.

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

1) Yes did the VACUUM for all the tables like

asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx to
remove any dead tuples and also rebuilt the indexes..
2) REINDEX table location_data_2018_12_part4;( Like wise ran for
all the tables and also VACUUM for all the tables).
3) The data in Non prod instance is more. One thing to mention here
when we built the Non prod instance we copied SNAPSHOT from Prod instance
and on top of that inserted data about 100 million rows and then did VACUUM
and re-indexed the tables.

I cant think of anything we can do here but let us know if you need
any more details on this problem. Iam happy to share more details.

On Wed, Feb 13, 2019 at 8:49 AM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran@gmail.com>
wrote:

On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@entrata.com>
wrote:

Did you update the stats by running ANALYZE on the tables
involved, or perhaps the entire database on the 'Non prod Aurora RDS
instance'? Can you share the two execution plans?

*Michael Lewis | Software Engineer*
*Entrata*
*c: **619.370.8697 <619-370-8697>*

Here is the plan for both of the DB instances.

Michael - Did you get any chance to look at this issue. Also we see
there is a sequential scan being done instead of index scan.

On Tue, Feb 12, 2019 at 11:27 AM github kran <
githubkran@gmail.com> wrote:

On Tue, Feb 12, 2019 at 7:07 AM github kran <
githubkran@gmail.com> wrote:

On Mon, Feb 11, 2019 at 6:00 PM github kran <
githubkran@gmail.com> wrote:

On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <
mlewis@entrata.com> wrote:

Are default statistics target the same on both prod and AWS?
Have you analyzed all tables being used in this query to ensure stats are
up proper? If the optimizer is choosing a different plan, then the stats
must be different IMO.

*Michael Lewis | Software Engineer*
*Entrata*

Thanks for your reply I have verified few of the tables and
their default statistics target and they seem to be same but is there
anything in particular you want me to look at it to differentiate Prod and
Non prod databases ?. ( Also the DB instance size is same but there is
little
more data in the Non prod Aurora RDS instance compared to Prod
instance).

Query used.

select * from pg_stats where tablename = 'tableName'

On Mon, Feb 11, 2019 at 2:15 PM github kran <
githubkran@gmail.com> wrote:

Hi Postgres Team,
We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We
recently deleted few million rows from the database and ran into a issue in
one of our dev account where the
DB was not normal after this deletion. We did re index,
vacuuming entire database but we couldnt bring it to the same state as
earlier. So next steps we deleted the database and
recreated the database by copying the snapshot from a
production instance. Further did vacumming, re-index on the database.

After this now the dev database seems to be in a better
state than earlier but we are seeing few of our DB calls are taking more
than 1 minute when we are fetching data and we observed
this is because the query plan was executing a hash join as
part of the query whereas a similar query on prod instance is not doing any
hash join and is returning faster.

Also we did not want to experiment by modifing the DB
settings by doing enable_hash_join to off or random_page_count to 1 as we
dont have these settings in Prod instance.

Note:
The partition table sizes we have here is between 40 GB to
75 GB and this is our normal size range, we have a new partition table for
every 7 days.

Appreciate your ideas on what we could be missing and what
we can correct here to reduce the query latency.

Thanks
githubKran

#17

github kran

githubkran@gmail.com

over 6 years ago

In reply to: Michael Lewis (#15)

Re: Postgresql RDS DB Latency Chossing Hash join Plan

Hello Team,

Hope everyone is doing great !!.

*Background*

We use PostgreSQL Version 10.6 version and heavily use PostgreSQL for our
day to day activities to write and read data. We have 2 clusters running
PostgreSQL engine , one cluster

keeps data up to 60 days and another cluster retains data beyond 1 year.
The data is partitioned close to a week( ~evry 5 days a partition) and we
have around 5 partitions per month per each table and we have 2 tables
primarily so that will be 10 tables a week. So in the cluster-1 we have
around 20 partitions and in cluster-2 we have around 160 partitions ( data
from 2018). We also want to keep the data for up to 2 years in the
cluster-2 to serve the data needs of the customer and so far we reached
upto 1 year of maintaining this data.

*Current activity*

We have a custom weekly migration DB script job that moves data from 1
cluster to another cluster what it does is the below things.

1) COPY command to copy the data from cluster-1 and split that data into
binary files

2) Writing the binary data into the cluster-2 table

3) Creating indexes after the data is copied.

*Problem what we have right now. *

When the migration activity runs(weekly) from past 2 times , we saw the
cluster read replica instance has restarted as it fallen behind the
master(writer instance). Everything

after that worked seamlessly but we want to avoid the replica getting
restarted. To avoid from restart we started doing smaller binary files and
copy those files to the cluster-2

instead of writing 1 big file of 450 million records. We were successful in
the recent migration as the reader instance didn’t restart after we split 1
big file into multiple files to copy the data over but did restart after
the indexes are created on the new table as it could be write intensive.

*DB parameters set on migration job*

work_mem set to 8 GB and maintenace_work_mem=32 GB.

Indexes per table = 3

total indexes for 2 tables = 5

*DB size*

Cluster-2 = 8.6 TB

Cluster-1 = 3.6 TB

Peak Table relational rows = 400 - 480 million rows

Average table relational rows = 300 - 350 million rows.

Per table size = 90 -95 GB , per table index size is about 45 GB

*Questions*

1) Can we decrease the maintenace_work_mem to 16 GB and will it slow down
the writes to the cluster , with that the reader instance can sync the data
slowly ?.

2) Based on the above use case what are your recommendations to keep the
data longer up to 2 years ?

3) What other recommendations you recommend ?.

Appreciate your replies.

THanks
githubkran

Show quoted text

#18

github kran

githubkran@gmail.com

over 6 years ago

In reply to: github kran (#17)

PostGreSQL Replication and question on maintenance

sorry changing the subject line.

On Thu, Nov 14, 2019 at 11:21 AM github kran <githubkran@gmail.com> wrote:

Show quoted text

Hello Team,

Hope everyone is doing great !!.

*Background*

We use PostgreSQL Version 10.6 version and heavily use PostgreSQL for our
day to day activities to write and read data. We have 2 clusters running
PostgreSQL engine , one cluster

keeps data up to 60 days and another cluster retains data beyond 1 year.
The data is partitioned close to a week( ~evry 5 days a partition) and we
have around 5 partitions per month per each table and we have 2 tables
primarily so that will be 10 tables a week. So in the cluster-1 we have
around 20 partitions and in cluster-2 we have around 160 partitions ( data
from 2018). We also want to keep the data for up to 2 years in the
cluster-2 to serve the data needs of the customer and so far we reached
upto 1 year of maintaining this data.

*Current activity*

We have a custom weekly migration DB script job that moves data from 1
cluster to another cluster what it does is the below things.

1) COPY command to copy the data from cluster-1 and split that data into
binary files

2) Writing the binary data into the cluster-2 table

3) Creating indexes after the data is copied.

*Problem what we have right now. *

When the migration activity runs(weekly) from past 2 times , we saw the
cluster read replica instance has restarted as it fallen behind the
master(writer instance). Everything

after that worked seamlessly but we want to avoid the replica getting
restarted. To avoid from restart we started doing smaller binary files and
copy those files to the cluster-2

instead of writing 1 big file of 450 million records. We were successful
in the recent migration as the reader instance didn’t restart after we
split 1 big file into multiple files to copy the data over but did restart
after the indexes are created on the new table as it could be write
intensive.

*DB parameters set on migration job*

work_mem set to 8 GB and maintenace_work_mem=32 GB.

Indexes per table = 3

total indexes for 2 tables = 5

*DB size*

Cluster-2 = 8.6 TB

Cluster-1 = 3.6 TB

Peak Table relational rows = 400 - 480 million rows

Average table relational rows = 300 - 350 million rows.

Per table size = 90 -95 GB , per table index size is about 45 GB

*Questions*

1) Can we decrease the maintenace_work_mem to 16 GB and will it slow down
the writes to the cluster , with that the reader instance can sync the data
slowly ?.

2) Based on the above use case what are your recommendations to keep the
data longer up to 2 years ?

3) What other recommendations you recommend ?.

Appreciate your replies.

THanks
githubkran

#19

github kran

githubkran@gmail.com

over 6 years ago

In reply to: github kran (#18)

Fwd: PostGreSQL Replication and question on maintenance

Show quoted text

Hello postGreSQL Community ,

Hope everyone is doing great !!.

*Background*

We use PostgreSQL Version 10.6 version and heavily use PostgreSQL for our
day to day activities to write and read data. We have 2 clusters running
PostgreSQL engine , one cluster

keeps data up to 60 days and another cluster retains data beyond 1 year.
The data is partitioned close to a week( ~evry 5 days a partition) and we
have around 5 partitions per month per each table and we have 2 tables
primarily so that will be 10 tables a week. So in the cluster-1 we have
around 20 partitions and in cluster-2 we have around 160 partitions ( data
from 2018). We also want to keep the data for up to 2 years in the
cluster-2 to serve the data needs of the customer and so far we reached
upto 1 year of maintaining this data.

*Current activity*

We have a custom weekly migration DB script job that moves data from 1
cluster to another cluster what it does is the below things.

1) COPY command to copy the data from cluster-1 and split that data into
binary files

2) Writing the binary data into the cluster-2 table

3) Creating indexes after the data is copied.

*Problem what we have right now. *

When the migration activity runs(weekly) from past 2 times , we saw the
cluster read replica instance has restarted as it fallen behind the
master(writer instance). Everything

after that worked seamlessly but we want to avoid the replica getting
restarted. To avoid from restart we started doing smaller binary files and
copy those files to the cluster-2

instead of writing 1 big file of 450 million records. We were successful
in the recent migration as the reader instance didn’t restart after we
split 1 big file into multiple files to copy the data over but did restart
after the indexes are created on the new table as it could be write
intensive.

*DB parameters set on migration job*

work_mem set to 8 GB and maintenace_work_mem=32 GB.

Indexes per table = 3

total indexes for 2 tables = 5

*DB size*

Cluster-2 = 8.6 TB

Cluster-1 = 3.6 TB

Peak Table relational rows = 400 - 480 million rows

Average table relational rows = 300 - 350 million rows.

Per table size = 90 -95 GB , per table index size is about 45 GB

*Questions*

1) Can we decrease the maintenace_work_mem to 16 GB and will it slow down
the writes to the cluster , with that the reader instance can sync the data
slowly ?.

2) Based on the above use case what are your recommendations to keep the
data longer up to 2 years ?

3) What other recommendations you recommend ?.

Appreciate your replies.

THanks
githubkran

#20

Pavel Stehule

pavel.stehule@gmail.com

over 6 years ago

In reply to: github kran (#19)

Re: PostGreSQL Replication and question on maintenance

pá 15. 11. 2019 v 6:26 odesílatel github kran <githubkran@gmail.com> napsal:

Hello postGreSQL Community ,

Hope everyone is doing great !!.

*Background*

We use PostgreSQL Version 10.6 version and heavily use PostgreSQL for our
day to day activities to write and read data. We have 2 clusters running
PostgreSQL engine , one cluster

keeps data up to 60 days and another cluster retains data beyond 1 year.
The data is partitioned close to a week( ~evry 5 days a partition) and we
have around 5 partitions per month per each table and we have 2 tables
primarily so that will be 10 tables a week. So in the cluster-1 we have
around 20 partitions and in cluster-2 we have around 160 partitions ( data
from 2018). We also want to keep the data for up to 2 years in the
cluster-2 to serve the data needs of the customer and so far we reached
upto 1 year of maintaining this data.

*Current activity*

We have a custom weekly migration DB script job that moves data from 1
cluster to another cluster what it does is the below things.

1) COPY command to copy the data from cluster-1 and split that data into
binary files

2) Writing the binary data into the cluster-2 table

3) Creating indexes after the data is copied.

*Problem what we have right now. *

When the migration activity runs(weekly) from past 2 times , we saw the
cluster read replica instance has restarted as it fallen behind the
master(writer instance). Everything

after that worked seamlessly but we want to avoid the replica getting
restarted. To avoid from restart we started doing smaller binary files and
copy those files to the cluster-2

instead of writing 1 big file of 450 million records. We were successful
in the recent migration as the reader instance didn’t restart after we
split 1 big file into multiple files to copy the data over but did restart
after the indexes are created on the new table as it could be write
intensive.

*DB parameters set on migration job*

work_mem set to 8 GB and maintenace_work_mem=32 GB.

these numbers looks crazy high - how much memory has your server - more
than 1TB?

Indexes per table = 3

Show quoted text

total indexes for 2 tables = 5

*DB size*

Cluster-2 = 8.6 TB

Cluster-1 = 3.6 TB

Peak Table relational rows = 400 - 480 million rows

Average table relational rows = 300 - 350 million rows.

Per table size = 90 -95 GB , per table index size is about 45 GB

*Questions*

1) Can we decrease the maintenace_work_mem to 16 GB and will it slow down
the writes to the cluster , with that the reader instance can sync the data
slowly ?.

2) Based on the above use case what are your recommendations to keep the
data longer up to 2 years ?

3) What other recommendations you recommend ?.

Appreciate your replies.

THanks
githubkran