Unexplained disk usage in AWS Aurora Postgres

Started by Chris Borckholderover 5 years ago9 messagesgeneral
Jump to latest
#1Chris Borckholder
chris.borckholder@bitpanda.com

Hi!

We are experiencing a strange situation with an AWS Aurora postgres
instance.
The database steadily grows in size, which is expected and normal.
After enabling logical replication, the disk usage reported by AWS metrics
increases much faster then the database size (as seen by \l+ in psql). The
current state is that database size is ~290GB, while AWS reports >640GB
disk usage.
We reached out to AWS support of course, which is ultimately responsible.
Unfortunately they were not able to diagnose this until now.

I checked with the queries from wiki
https://wiki.postgresql.org/wiki/Disk_Usage , which essentially give the
same result.
I tried to check on wal segment file size, but we have no permission to
execute select pg_ls_waldir().
The replication slot is active and it also progresses
(pg_replication_slots.confirmed_flush_lsn increases and is close to
pg_current_wal_flush_lsn).

Can you imagine other things that I could check from within postgres with
limited permissions to diagnose this?

Best Regards
Chris

#2Srinivasa T N
seenutn@gmail.com
In reply to: Chris Borckholder (#1)
Re: Unexplained disk usage in AWS Aurora Postgres

There may be lot of wal files or the size of log files in pg_log might be
huge. "du -sh *" of data directory holding the database might help.

Regards,
Seenu.

On Tue, Aug 4, 2020 at 2:09 PM Chris Borckholder <
chris.borckholder@bitpanda.com> wrote:

Show quoted text

Hi!

We are experiencing a strange situation with an AWS Aurora postgres
instance.
The database steadily grows in size, which is expected and normal.
After enabling logical replication, the disk usage reported by AWS metrics
increases much faster then the database size (as seen by \l+ in psql). The
current state is that database size is ~290GB, while AWS reports >640GB
disk usage.
We reached out to AWS support of course, which is ultimately responsible.
Unfortunately they were not able to diagnose this until now.

I checked with the queries from wiki
https://wiki.postgresql.org/wiki/Disk_Usage , which essentially give the
same result.
I tried to check on wal segment file size, but we have no permission to
execute select pg_ls_waldir().
The replication slot is active and it also progresses
(pg_replication_slots.confirmed_flush_lsn increases and is close to
pg_current_wal_flush_lsn).

Can you imagine other things that I could check from within postgres with
limited permissions to diagnose this?

Best Regards
Chris

#3Mohamed Wael Khobalatte
mkhobalatte@grubhub.com
In reply to: Chris Borckholder (#1)
Re: Unexplained disk usage in AWS Aurora Postgres

On Tue, Aug 4, 2020 at 4:39 AM Chris Borckholder <
chris.borckholder@bitpanda.com> wrote:

Hi!

We are experiencing a strange situation with an AWS Aurora postgres
instance.
The database steadily grows in size, which is expected and normal.
After enabling logical replication, the disk usage reported by AWS metrics
increases much faster then the database size (as seen by \l+ in psql). The
current state is that database size is ~290GB, while AWS reports >640GB
disk usage.
We reached out to AWS support of course, which is ultimately responsible.
Unfortunately they were not able to diagnose this until now.

I checked with the queries from wiki
https://wiki.postgresql.org/wiki/Disk_Usage , which essentially give the
same result.
I tried to check on wal segment file size, but we have no permission to
execute select pg_ls_waldir().
The replication slot is active and it also progresses
(pg_replication_slots.confirmed_flush_lsn increases and is close to
pg_current_wal_flush_lsn).

Can you imagine other things that I could check from within postgres with
limited permissions to diagnose this?

Best Regards
Chris

If you do archive wal files, maybe the archive_command is failing?

#4Chris Borckholder
chris.borckholder@bitpanda.com
In reply to: Srinivasa T N (#2)
Re: Unexplained disk usage in AWS Aurora Postgres

Thank you for your insight Seenu!

That is a good point, unfortunately we do not have access to the
server/file system as the database is a managed service.
Access to the file system from postgres like pg_ls_dir is also blocked.

Are you aware of another, creative way to infer the wal file size from
within postgres?

Best Regards
Chris

On Tue, Aug 4, 2020 at 11:39 AM Srinivasa T N <seenutn@gmail.com> wrote:

Show quoted text

There may be lot of wal files or the size of log files in pg_log might be
huge. "du -sh *" of data directory holding the database might help.

Regards,
Seenu.

On Tue, Aug 4, 2020 at 2:09 PM Chris Borckholder <
chris.borckholder@bitpanda.com> wrote:

Hi!

We are experiencing a strange situation with an AWS Aurora postgres
instance.
The database steadily grows in size, which is expected and normal.
After enabling logical replication, the disk usage reported by AWS
metrics increases much faster then the database size (as seen by \l+ in
psql). The current state is that database size is ~290GB, while AWS reports

640GB disk usage.

We reached out to AWS support of course, which is ultimately responsible.
Unfortunately they were not able to diagnose this until now.

I checked with the queries from wiki
https://wiki.postgresql.org/wiki/Disk_Usage , which essentially give the
same result.
I tried to check on wal segment file size, but we have no permission to
execute select pg_ls_waldir().
The replication slot is active and it also progresses
(pg_replication_slots.confirmed_flush_lsn increases and is close to
pg_current_wal_flush_lsn).

Can you imagine other things that I could check from within postgres with
limited permissions to diagnose this?

Best Regards
Chris

#5Chris Borckholder
chris.borckholder@bitpanda.com
In reply to: Mohamed Wael Khobalatte (#3)
Re: Unexplained disk usage in AWS Aurora Postgres

Thanks for your insight!

I cannot find any errors related to archiving in the logs that are
accessible to me.
It's definitely something that I will forward to the support team of the
managed database.

Best Regards
Chris

On Thu, Aug 6, 2020 at 3:18 AM Mohamed Wael Khobalatte <
mkhobalatte@grubhub.com> wrote:

Show quoted text

On Tue, Aug 4, 2020 at 4:39 AM Chris Borckholder <
chris.borckholder@bitpanda.com> wrote:

Hi!

We are experiencing a strange situation with an AWS Aurora postgres
instance.
The database steadily grows in size, which is expected and normal.
After enabling logical replication, the disk usage reported by AWS
metrics increases much faster then the database size (as seen by \l+ in
psql). The current state is that database size is ~290GB, while AWS reports

640GB disk usage.

We reached out to AWS support of course, which is ultimately responsible.
Unfortunately they were not able to diagnose this until now.

I checked with the queries from wiki
https://wiki.postgresql.org/wiki/Disk_Usage , which essentially give the
same result.
I tried to check on wal segment file size, but we have no permission to
execute select pg_ls_waldir().
The replication slot is active and it also progresses
(pg_replication_slots.confirmed_flush_lsn increases and is close to
pg_current_wal_flush_lsn).

Can you imagine other things that I could check from within postgres with
limited permissions to diagnose this?

Best Regards
Chris

If you do archive wal files, maybe the archive_command is failing?

#6Adam Brusselback
adambrusselback@gmail.com
In reply to: Chris Borckholder (#5)
Re: Unexplained disk usage in AWS Aurora Postgres

I would highly suggest you reach out to AWS support for Aurora questions,
that's part of what you're paying for, support.
For reasons you mentioned and more, it's pretty hard to debug issues
because it isn't actually Postgres.

Show quoted text
#7Christoph Moench-Tegeder
cmt@burggraben.net
In reply to: Chris Borckholder (#1)
Re: Unexplained disk usage in AWS Aurora Postgres

## Chris Borckholder (chris.borckholder@bitpanda.com):

We are experiencing a strange situation with an AWS Aurora postgres
instance.

The main problem here is that "Amazon Aurora" is not PostgreSQL.
If I understand Amazon's documentation, what you are using is
officially named "Amazon Aurora with PostgreSQL Compatibility",
and that sums is up quite nicely: Aurora is a database engine
developed at Amazon - and it's inner workings are not publically
documented.
Whatever is using up that disk space - only AWS Support can know.

Regards,
Christoph

--
Spare Space

#8Chris Borckholder
chris.borckholder@bitpanda.com
In reply to: Christoph Moench-Tegeder (#7)
Re: Unexplained disk usage in AWS Aurora Postgres

Thank you Adam and Christoph,

You are totally right, that AWS support is the one to help me with this
problem.
I am in contact with them for quite some time on this problem and as there
was no progress on resolving this,
I tried to find some insight or trick that I missed here. It's a long shot
(:

Best Regards
Chris

On Fri, Aug 7, 2020 at 4:22 PM Christoph Moench-Tegeder <cmt@burggraben.net>
wrote:

Show quoted text

## Chris Borckholder (chris.borckholder@bitpanda.com):

We are experiencing a strange situation with an AWS Aurora postgres
instance.

The main problem here is that "Amazon Aurora" is not PostgreSQL.
If I understand Amazon's documentation, what you are using is
officially named "Amazon Aurora with PostgreSQL Compatibility",
and that sums is up quite nicely: Aurora is a database engine
developed at Amazon - and it's inner workings are not publically
documented.
Whatever is using up that disk space - only AWS Support can know.

Regards,
Christoph

--
Spare Space

#9Ravi Krishna
srkrishna@yahoo.com
In reply to: Christoph Moench-Tegeder (#7)
Re: Unexplained disk usage in AWS Aurora Postgres

The main problem here is that "Amazon Aurora" is not PostgreSQL.
If I understand Amazon's documentation, what you are using is
officially named "Amazon Aurora with PostgreSQL Compatibility",
and that sums is up quite nicely: Aurora is a database engine
developed at Amazon - and it's inner workings are not publically
documented.
Whatever is using up that disk space - only AWS Support can know.

Correct. Aurora is basically forked PG code, but with a different I/O layer. That explains why
they are quite behind community PG in versions.