filesystem full during vacuum - space recovery issues

Started by Thomas Simpsonover 1 year ago24 messages

ts@talentstack.to

over 1 year ago

I have a large database (multi TB) which had a vacuum full running but
the database ran out of space during the rebuild of one of the large
data tables.

Cleaning down the WAL files got the database restarted (an archiving
problem led to the initial disk full).

However, the disk space is still at 99% as it appears the large table
rebuild files are still hanging around using space and have not been
deleted.

My problem now is how do I get this space back to return my free space
back to where it should be?

I tried some scripts to map the data files to relations but this didn't
work as removing some files led to startup failure despite them
appearing to be unrelated to anything in the database - I had to put
them back and then startup worked.

Any suggestions here?

Thanks

Tom

Laurenz Albe

laurenz.albe@cybertec.at

over 1 year ago

In reply to: Thomas Simpson (#1)

Re: filesystem full during vacuum - space recovery issues

On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:

I have a large database (multi TB) which had a vacuum full running but the database
ran out of space during the rebuild of one of the large data tables.

Cleaning down the WAL files got the database restarted (an archiving problem led to
the initial disk full).

However, the disk space is still at 99% as it appears the large table rebuild files
are still hanging around using space and have not been deleted.

My problem now is how do I get this space back to return my free space back to where
it should be?

I tried some scripts to map the data files to relations but this didn't work as
removing some files led to startup failure despite them appearing to be unrelated
to anything in the database - I had to put them back and then startup worked.

Any suggestions here?

That reads like the sad old story: "cleaning down" WAL files - you mean deleting the
very files that would have enabled PostgreSQL to recover from the crash that was
caused by the full file system.

Did you run "pg_resetwal"? If yes, that probably led to data corruption.

The above are just guesses. Anyway, there is no good way to get rid of the files
that were left behind after the crash. The reliable way of doing so is also the way
to get rid of potential data corruption caused by "cleaning down" the database:
pg_dump the whole thing and restore the dump to a new, clean cluster.

Yes, that will be a painfully long down time. An alternative is to restore a backup
taken before the crash.

Yours,
Laurenz Albe

Imran Khan

imran.k.23@gmail.com

over 1 year ago

In reply to: Laurenz Albe (#2)

Re: filesystem full during vacuum - space recovery issues

Also, you can use multi process dump and restore using pg_dump plus pigz
utility for zipping.

Thanks

On Tue, Jul 16, 2024, 4:00 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

Show quoted text

On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:

I have a large database (multi TB) which had a vacuum full running but

the database

ran out of space during the rebuild of one of the large data tables.

Cleaning down the WAL files got the database restarted (an archiving

problem led to

the initial disk full).

However, the disk space is still at 99% as it appears the large table

rebuild files

are still hanging around using space and have not been deleted.

My problem now is how do I get this space back to return my free space

back to where

it should be?

I tried some scripts to map the data files to relations but this didn't

work as

removing some files led to startup failure despite them appearing to be

unrelated

to anything in the database - I had to put them back and then startup

worked.

Any suggestions here?

That reads like the sad old story: "cleaning down" WAL files - you mean
deleting the
very files that would have enabled PostgreSQL to recover from the crash
that was
caused by the full file system.

Did you run "pg_resetwal"? If yes, that probably led to data corruption.

The above are just guesses. Anyway, there is no good way to get rid of
the files
that were left behind after the crash. The reliable way of doing so is
also the way
to get rid of potential data corruption caused by "cleaning down" the
database:
pg_dump the whole thing and restore the dump to a new, clean cluster.

Yes, that will be a painfully long down time. An alternative is to
restore a backup
taken before the crash.

Yours,
Laurenz Albe

Thomas Simpson

ts@talentstack.to

over 1 year ago

In reply to: Laurenz Albe (#2)

Re: filesystem full during vacuum - space recovery issues

Thanks Laurenz & Imran for your comments.

My responses inline below.

Thanks

Tom

On 15-Jul-2024 20:58, Laurenz Albe wrote:

On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:

I have a large database (multi TB) which had a vacuum full running but the database
ran out of space during the rebuild of one of the large data tables.

Cleaning down the WAL files got the database restarted (an archiving problem led to
the initial disk full).

However, the disk space is still at 99% as it appears the large table rebuild files
are still hanging around using space and have not been deleted.

My problem now is how do I get this space back to return my free space back to where
it should be?

I tried some scripts to map the data files to relations but this didn't work as
removing some files led to startup failure despite them appearing to be unrelated
to anything in the database - I had to put them back and then startup worked.

Any suggestions here?

That reads like the sad old story: "cleaning down" WAL files - you mean deleting the
very files that would have enabled PostgreSQL to recover from the crash that was
caused by the full file system.

Did you run "pg_resetwal"? If yes, that probably led to data corruption.

No, I just removed the excess already archived WALs to get space and
restarted. The vacuum full that was running had created files for the
large table it was processing and these are still hanging around eating
space without doing anything useful. The shutdown prevented the
rollback cleanly removing them which seems to be the core problem.

The above are just guesses. Anyway, there is no good way to get rid of the files
that were left behind after the crash. The reliable way of doing so is also the way
to get rid of potential data corruption caused by "cleaning down" the database:
pg_dump the whole thing and restore the dump to a new, clean cluster.

Yes, that will be a painfully long down time. An alternative is to restore a backup
taken before the crash.

My issue now is the dump & reload is taking a huge time; I know the
hardware is capable of multi-GB/s throughput but the reload is taking a
long time - projected to be about 10 days to reload at the current rate
(about 30Mb/sec). The old server and new server have a 10G link between
them and storage is SSD backed, so the hardware is capable of much much
more than it is doing now.

Is there a way to improve the reload performance? Tuning of any type -
even if I need to undo it later once the reload is done.

My backups were in progress when all the issues happened, so they're not
such a good starting point and I'd actually prefer the clean reload
since this DB has been through multiple upgrades (without reloads) until
now so I know it's not especially clean. The size has always prevented
the full reload before but the database is relatively low traffic now so
I can afford some time to reload, but ideally not 10 days.

Show quoted text

Yours,
Laurenz Albe

Ron

ronljohnsonjr@gmail.com

over 1 year ago

In reply to: Thomas Simpson (#4)

Re: filesystem full during vacuum - space recovery issues

On Wed, Jul 17, 2024 at 9:26 AM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

uge time; I know the hardware is capable of multi-GB/s throughput but the
reload is taking a long time - projected to be about 10 days to reload at
the current rate (about 30Mb/sec). The old server and new server have a
10G link between them and storage is SSD backed, so the hardware is capable
of much much more than it is doing now.

Is there a way to improve the reload performance? Tuning of any type -
even if I need to undo it later once the reload is done.

That would, of course, depend on what you're currently doing. pg_dumpall
of a Big Database is certainly suboptimal compared to "pg_dump -Fd
--jobs=24".

This is what I run (which I got mostly from a databasesoup.com blog post)
on the target instance before doing "pg_restore -Fd --jobs=24":
declare -i CheckPoint=30
declare -i SharedBuffs=32
declare -i MaintMem=3
declare -i MaxWalSize=36
declare -i WalBuffs=64
pg_ctl restart -wt$TimeOut -mfast \
-o "-c hba_file=$PGDATA/pg_hba_maintmode.conf" \
-o "-c fsync=off" \
-o "-c log_statement=none" \
-o "-c log_temp_files=100kB" \
-o "-c log_checkpoints=on" \
-o "-c log_min_duration_statement=120000" \
-o "-c shared_buffers=${SharedBuffs}GB" \
-o "-c maintenance_work_mem=${MaintMem}GB" \
-o "-c synchronous_commit=off" \
-o "-c archive_mode=off" \
-o "-c full_page_writes=off" \
-o "-c checkpoint_timeout=${CheckPoint}min" \
-o "-c max_wal_size=${MaxWalSize}GB" \
-o "-c wal_level=minimal" \
-o "-c max_wal_senders=0" \
-o "-c wal_buffers=${WalBuffs}MB" \
-o "-c autovacuum=off"

After the pg_restore -Fd --jobs=24 and vacuumdb --analyze-only --jobs=24:
pg_ctl stop -wt$TimeOut && pg_ctl start -wt$TimeOut

Of course, these parameter values were for *my* hardware.

Show quoted text

My backups were in progress when all the issues happened, so they're not
such a good starting point and I'd actually prefer the clean reload since
this DB has been through multiple upgrades (without reloads) until now so I
know it's not especially clean. The size has always prevented the full
reload before but the database is relatively low traffic now so I can
afford some time to reload, but ideally not 10 days.

Yours,
Laurenz Albe

Thomas Simpson

ts@talentstack.to

over 1 year ago

In reply to: Ron (#5)

Re: filesystem full during vacuum - space recovery issues

Thanks Ron for the suggestions - I applied some of the settings which
helped throughput a little bit but were not an ideal solution for me -
let me explain.

Due to the size, I do not have the option to use the directory mode (or
anything that uses disk space) for dump as that creates multiple
directories (hence why it can do multiple jobs). I do not have the
several hundred TB of space to hold the output and there is no practical
way to get it, especially for a transient reload.

I have my original server plus my replica; as the replica also applied
the WALs, it too filled up and went down. I've basically recreated this
as a primary server and am using a pipeline to dump from the original
into this as I know that has enough space for the final loaded database
and should have space left over from the clean rebuild (whereas the
original server still has space exhausted due to the leftover files).

Incidentally, this state is also why going to a backup is not helpful
either as the restore and then re-apply the WALs would just end up
filling the disk and recreating the original problem.

Even with the improved throughput, current calculations are pointing to
almost 30 days to recreate the database through dump and reload which is
a pretty horrible state to be in.

I think this is perhaps an area of improvement - especially as larger
PostgreSQL databases become more common, I'm not the only person who
could face this issue.

Perhaps an additional dumpall mode that generates multiple output pipes
(I'm piping via netcat to the other server) - it would need to combine
with a multiple listening streams too and some degree of
ordering/feedback to get to the essentially serialized output from the
current dumpall. But this feels like PostgreSQL expert developer territory.

Thanks

Tom

On 17-Jul-2024 09:49, Ron Johnson wrote:

On Wed, Jul 17, 2024 at 9:26 AM Thomas Simpson <ts@talentstack.to> wrote:

---8<--snip,snip---8<---

Show quoted text

That would, of course, depend on what you're currently doing.
pg_dumpall of a Big Database is certainly suboptimal compared to
"pg_dump -Fd --jobs=24".

This is what I run (which I got mostly from a databasesoup.com
<http://databasesoup.com> blog post) on the target instance before
doing "pg_restore -Fd --jobs=24":
declare -i CheckPoint=30
declare -i SharedBuffs=32
declare -i MaintMem=3
declare -i MaxWalSize=36
declare -i WalBuffs=64
pg_ctl restart -wt$TimeOut -mfast \
-o "-c hba_file=$PGDATA/pg_hba_maintmode.conf" \
-o "-c fsync=off" \
-o "-c log_statement=none" \
-o "-c log_temp_files=100kB" \
-o "-c log_checkpoints=on" \
-o "-c log_min_duration_statement=120000" \
-o "-c shared_buffers=${SharedBuffs}GB" \
-o "-c maintenance_work_mem=${MaintMem}GB" \
-o "-c synchronous_commit=off" \
-o "-c archive_mode=off" \
-o "-c full_page_writes=off" \
-o "-c checkpoint_timeout=${CheckPoint}min" \
-o "-c max_wal_size=${MaxWalSize}GB" \
-o "-c wal_level=minimal" \
-o "-c max_wal_senders=0" \
-o "-c wal_buffers=${WalBuffs}MB" \
-o "-c autovacuum=off"

After the pg_restore -Fd --jobs=24 and vacuumdb --analyze-only --jobs=24:
pg_ctl stop -wt$TimeOut && pg_ctl start -wt$TimeOut

Of course, these parameter values were for *my* hardware.

My backups were in progress when all the issues happened, so
they're not such a good starting point and I'd actually prefer the
clean reload since this DB has been through multiple upgrades
(without reloads) until now so I know it's not especially clean.
The size has always prevented the full reload before but the
database is relatively low traffic now so I can afford some time
to reload, but ideally not 10 days.

Yours,
Laurenz Albe

Ron

ronljohnsonjr@gmail.com

over 1 year ago

In reply to: Thomas Simpson (#6)

Re: filesystem full during vacuum - space recovery issues

There's no free lunch, and you can't squeeze blood from a turnip.

Single-threading will *ALWAYS* be slow: if you want speed, temporarily
throw more hardware at it: specifically another disk (and possibly more RAM
and CPU).

On Thu, Jul 18, 2024 at 9:55 AM Thomas Simpson <ts@talentstack.to> wrote:

Show quoted text

Thanks Ron for the suggestions - I applied some of the settings which
helped throughput a little bit but were not an ideal solution for me - let
me explain.

Due to the size, I do not have the option to use the directory mode (or
anything that uses disk space) for dump as that creates multiple
directories (hence why it can do multiple jobs). I do not have the several
hundred TB of space to hold the output and there is no practical way to get
it, especially for a transient reload.

I have my original server plus my replica; as the replica also applied the
WALs, it too filled up and went down. I've basically recreated this as a
primary server and am using a pipeline to dump from the original into this
as I know that has enough space for the final loaded database and should
have space left over from the clean rebuild (whereas the original server
still has space exhausted due to the leftover files).

Incidentally, this state is also why going to a backup is not helpful
either as the restore and then re-apply the WALs would just end up filling
the disk and recreating the original problem.

Even with the improved throughput, current calculations are pointing to
almost 30 days to recreate the database through dump and reload which is a
pretty horrible state to be in.

I think this is perhaps an area of improvement - especially as larger
PostgreSQL databases become more common, I'm not the only person who could
face this issue.

Perhaps an additional dumpall mode that generates multiple output pipes
(I'm piping via netcat to the other server) - it would need to combine with
a multiple listening streams too and some degree of ordering/feedback to
get to the essentially serialized output from the current dumpall. But
this feels like PostgreSQL expert developer territory.

Thanks

Tom

On 17-Jul-2024 09:49, Ron Johnson wrote:

On Wed, Jul 17, 2024 at 9:26 AM Thomas Simpson <ts@talentstack.to> wrote:

---8<--snip,snip---8<---

That would, of course, depend on what you're currently doing. pg_dumpall
of a Big Database is certainly suboptimal compared to "pg_dump -Fd
--jobs=24".

This is what I run (which I got mostly from a databasesoup.com blog post)
on the target instance before doing "pg_restore -Fd --jobs=24":
declare -i CheckPoint=30
declare -i SharedBuffs=32
declare -i MaintMem=3
declare -i MaxWalSize=36
declare -i WalBuffs=64
pg_ctl restart -wt$TimeOut -mfast \
-o "-c hba_file=$PGDATA/pg_hba_maintmode.conf" \
-o "-c fsync=off" \
-o "-c log_statement=none" \
-o "-c log_temp_files=100kB" \
-o "-c log_checkpoints=on" \
-o "-c log_min_duration_statement=120000" \
-o "-c shared_buffers=${SharedBuffs}GB" \
-o "-c maintenance_work_mem=${MaintMem}GB" \
-o "-c synchronous_commit=off" \
-o "-c archive_mode=off" \
-o "-c full_page_writes=off" \
-o "-c checkpoint_timeout=${CheckPoint}min" \
-o "-c max_wal_size=${MaxWalSize}GB" \
-o "-c wal_level=minimal" \
-o "-c max_wal_senders=0" \
-o "-c wal_buffers=${WalBuffs}MB" \
-o "-c autovacuum=off"

After the pg_restore -Fd --jobs=24 and vacuumdb --analyze-only --jobs=24:
pg_ctl stop -wt$TimeOut && pg_ctl start -wt$TimeOut

Of course, these parameter values were for *my* hardware.

My backups were in progress when all the issues happened, so they're not
such a good starting point and I'd actually prefer the clean reload since
this DB has been through multiple upgrades (without reloads) until now so I
know it's not especially clean. The size has always prevented the full
reload before but the database is relatively low traffic now so I can
afford some time to reload, but ideally not 10 days.

Yours,
Laurenz Albe

Paul Smith

paul@pscs.co.uk

over 1 year ago

In reply to: Thomas Simpson (#1)

Re: filesystem full during vacuum - space recovery issues

On 15/07/2024 19:47, Thomas Simpson wrote:

My problem now is how do I get this space back to return my free space
back to where it should be?

I tried some scripts to map the data files to relations but this
didn't work as removing some files led to startup failure despite them
appearing to be unrelated to anything in the database - I had to put
them back and then startup worked.

I don't know what you tried to do

What would normally happen on a failed VACUUM FULL that fills up the
disk so the server crashes is that there are loads of data files
containing the partially rebuilt table. Nothing 'internal' to PostgreSQL
will point to those files as the internal pointers all change to the new
table in an ACID way, so you should be able to delete them.

You can usually find these relatively easily by looking in the relevant
tablespace directory for the base filename for a new huge table (lots
and lots of files with the same base name - eg looking for files called
*.1000 will find you base filenames for relations over about 1TB) and
checking to see if pg_filenode_relation() can't turn the filenode into a
relation. If that's the case that they're not currently in use for a
relation, then you should be able to just delete all those files

Is this what you tried, or did your 'script to map data files to
relations' do something else? You were a bit ambiguous about that part
of things.

Paul

Thomas Simpson

ts@talentstack.to

over 1 year ago

In reply to: Paul Smith (#8)

Re: filesystem full during vacuum - space recovery issues

On 18-Jul-2024 11:19, Paul Smith* wrote:

On 15/07/2024 19:47, Thomas Simpson wrote:

My problem now is how do I get this space back to return my free
space back to where it should be?

I tried some scripts to map the data files to relations but this
didn't work as removing some files led to startup failure despite
them appearing to be unrelated to anything in the database - I had to
put them back and then startup worked.

I don't know what you tried to do

What would normally happen on a failed VACUUM FULL that fills up the
disk so the server crashes is that there are loads of data files
containing the partially rebuilt table. Nothing 'internal' to
PostgreSQL will point to those files as the internal pointers all
change to the new table in an ACID way, so you should be able to
delete them.

You can usually find these relatively easily by looking in the
relevant tablespace directory for the base filename for a new huge
table (lots and lots of files with the same base name - eg looking for
files called *.1000 will find you base filenames for relations over
about 1TB) and checking to see if pg_filenode_relation() can't turn
the filenode into a relation. If that's the case that they're not
currently in use for a relation, then you should be able to just
delete all those files

Is this what you tried, or did your 'script to map data files to
relations' do something else? You were a bit ambiguous about that part
of things.

[BTW, v9.6 which I know is old but this server is stuck there]

Yes, I was querying relfilenode from pg_class to get the filename
(integer) and then comparing a directory listing for files which did not
match the relfilenode as candidates to remove.

I moved these elsewhere (i.e. not delete, just move out the way so I
could move them back in case of trouble).

Without these apparently unrelated files, the database did not start and
complained about them being missing, so I had to put them back. This
was despite not finding any reference to the filename/number in pg_class.

At that point I gave up since I cannot afford to make the problem worse!

I know I'm stuck with the slow rebuild at this point. However, I doubt
I am the only person in the world that needs to dump and reload a large
database. My thought is this is a weak point for PostgreSQL so it makes
sense to consider ways to improve the dump reload process, especially as
it's the last-resort upgrade path recommended in the upgrade guide and
the general fail-safe route to get out of trouble.

Thanks

Tom

Show quoted text

Paul

#10

Ron

ronljohnsonjr@gmail.com

over 1 year ago

In reply to: Thomas Simpson (#9)

Re: filesystem full during vacuum - space recovery issues

On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip]

I know I'm stuck with the slow rebuild at this point. However, I doubt I
am the only person in the world that needs to dump and reload a large
database. My thought is this is a weak point for PostgreSQL so it makes
sense to consider ways to improve the dump reload process, especially as
it's the last-resort upgrade path recommended in the upgrade guide and the
general fail-safe route to get out of trouble.

No database does fast single-threaded backups.

#11

Thomas Simpson

ts@talentstack.to

over 1 year ago

In reply to: Ron (#10)

Re: filesystem full during vacuum - space recovery issues

On 18-Jul-2024 16:32, Ron Johnson wrote:

On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip]

I know I'm stuck with the slow rebuild at this point. However, I
doubt I am the only person in the world that needs to dump and
reload a large database. My thought is this is a weak point for
PostgreSQL so it makes sense to consider ways to improve the dump
reload process, especially as it's the last-resort upgrade path
recommended in the upgrade guide and the general fail-safe route
to get out of trouble.

No database does fast single-threaded backups.

Agreed. My thought is that is should be possible for a 'new dumpall' to
be multi-threaded.

Something like :

* Set number of threads on 'source' (perhaps by querying a listening
destination for how many threads it is prepared to accept via a control
port)

* Select each database in turn

* Organize the tables which do not have references themselves

* Send each table separately in each thread (or queue them until a
thread is available) ('Stage 1')

* Rendezvous stage 1 completion (pause sending, wait until feedback from
destination confirming all completed) so we have a known consistent
state that is safe to proceed to subsequent tables

* Work through tables that do refer to the previously sent in the same
way (since the tables they reference exist and have their data) ('Stage 2')

* Repeat progressively until all tables are done ('Stage 3', 4 etc. as
necessary)

The current dumpall is essentially doing this table organization
currently [minus stage checkpoints/multi-thread] otherwise the dump/load
would not work. It may even be doing a lot of this for 'directory'
mode? The change here is organizing n threads to process them
concurrently where possible and coordinating the pipes so they only send
data which can be accepted.

The destination would need to have a multi-thread listen and co-ordinate
with the sender on some control channel so feed back completion of each
stage.

Something like a destination host and control channel port to establish
the pipes and create additional netcat pipes on incremental ports above
the control port for each thread used.

Dumpall seems like it could be a reasonable start point since it is
already doing the complicated bits of serializing the dump data so it
can be consistently loaded.

Probably not really an admin question at this point, more a feature
enhancement.

Is there anything fundamentally wrong that someone with more intimate
knowledge of dumpall could point out?

Thanks

Tom

#12

Ron

ronljohnsonjr@gmail.com

over 1 year ago

In reply to: Thomas Simpson (#11)

Re: filesystem full during vacuum - space recovery issues

Multi-threaded writing to the same giant text file won't work too well,
when all the data for one table needs to be together.

Just temporarily add another disk for backups.

On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:

Show quoted text

On 18-Jul-2024 16:32, Ron Johnson wrote:

On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip]

I know I'm stuck with the slow rebuild at this point. However, I doubt I
am the only person in the world that needs to dump and reload a large
database. My thought is this is a weak point for PostgreSQL so it makes
sense to consider ways to improve the dump reload process, especially as
it's the last-resort upgrade path recommended in the upgrade guide and the
general fail-safe route to get out of trouble.

No database does fast single-threaded backups.

Agreed. My thought is that is should be possible for a 'new dumpall' to
be multi-threaded.

Something like :

* Set number of threads on 'source' (perhaps by querying a listening
destination for how many threads it is prepared to accept via a control
port)

* Select each database in turn

* Organize the tables which do not have references themselves

* Send each table separately in each thread (or queue them until a thread
is available) ('Stage 1')

* Rendezvous stage 1 completion (pause sending, wait until feedback from
destination confirming all completed) so we have a known consistent state
that is safe to proceed to subsequent tables

* Work through tables that do refer to the previously sent in the same way
(since the tables they reference exist and have their data) ('Stage 2')

* Repeat progressively until all tables are done ('Stage 3', 4 etc. as
necessary)

The current dumpall is essentially doing this table organization currently
[minus stage checkpoints/multi-thread] otherwise the dump/load would not
work. It may even be doing a lot of this for 'directory' mode? The change
here is organizing n threads to process them concurrently where possible
and coordinating the pipes so they only send data which can be accepted.

The destination would need to have a multi-thread listen and co-ordinate
with the sender on some control channel so feed back completion of each
stage.

Something like a destination host and control channel port to establish
the pipes and create additional netcat pipes on incremental ports above the
control port for each thread used.

Dumpall seems like it could be a reasonable start point since it is
already doing the complicated bits of serializing the dump data so it can
be consistently loaded.

Probably not really an admin question at this point, more a feature
enhancement.

Is there anything fundamentally wrong that someone with more intimate
knowledge of dumpall could point out?

Thanks

Tom

#13

Thomas Simpson

ts@talentstack.to

over 1 year ago

In reply to: Ron (#12)

Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)

[Added cross post to pgsql-hackers@lists.postgresql.org - background is
multi-TB database needs recovered via pgdumpall & reload, thoughts on
ways to make pg_dump scale to multi-thread to expedite loading to a new
cluster. Straight dump to a file is impractical as the dump will be

200TB; hackers may be a better home for the discussion than current

admin list]

Hi Ron

On 18-Jul-2024 18:41, Ron Johnson wrote:

Multi-threaded writing to the same giant text file won't work too
well, when all the data for one table needs to be together.

Just temporarily add another disk for backups.

For clarity, I'm not proposing multi threaded writing to one file; the
proposal is a new special mode which specifically makes multiple output
streams across *network sockets* to a listener which is listening on the
other side. The goal is avoiding any files at all and only using
multiple network streams to gain multi-threaded processing with some
co-ordination to keep things organized and consistent.

This would really be specifically for the use-case of dump/reload
upgrade or recreate rather than everyday use. And particularly for very
large databases.

Looking at pg_dump.c it's doing the baseline organization but the
extension would be adding the required coordination with the
destination. So, for a huge table (I have many) these would go in
different streams but if there is a dependency (FK relations etc) the
checkpoint needs to ensure those are met before proceeding. Worst case
scenario it would end up using only 1 thread but it would be very
unusual to have a database where every table depends on another table
all the way down.

In theory at least, some gains should be achieved for typical databases
where a degree of parallelism is possible.

Thanks

Tom

Show quoted text

On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:

On 18-Jul-2024 16:32, Ron Johnson wrote:

On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson
<ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip]

I know I'm stuck with the slow rebuild at this point.
However, I doubt I am the only person in the world that needs
to dump and reload a large database. My thought is this is a
weak point for PostgreSQL so it makes sense to consider ways
to improve the dump reload process, especially as it's the
last-resort upgrade path recommended in the upgrade guide and
the general fail-safe route to get out of trouble.

No database does fast single-threaded backups.

Agreed. My thought is that is should be possible for a 'new
dumpall' to be multi-threaded.

Something like :

* Set number of threads on 'source' (perhaps by querying a
listening destination for how many threads it is prepared to
accept via a control port)

* Select each database in turn

* Organize the tables which do not have references themselves

* Send each table separately in each thread (or queue them until a
thread is available) ('Stage 1')

* Rendezvous stage 1 completion (pause sending, wait until
feedback from destination confirming all completed) so we have a
known consistent state that is safe to proceed to subsequent tables

* Work through tables that do refer to the previously sent in the
same way (since the tables they reference exist and have their
data) ('Stage 2')

* Repeat progressively until all tables are done ('Stage 3', 4
etc. as necessary)

The current dumpall is essentially doing this table organization
currently [minus stage checkpoints/multi-thread] otherwise the
dump/load would not work. It may even be doing a lot of this for
'directory' mode? The change here is organizing n threads to
process them concurrently where possible and coordinating the
pipes so they only send data which can be accepted.

The destination would need to have a multi-thread listen and
co-ordinate with the sender on some control channel so feed back
completion of each stage.

Something like a destination host and control channel port to
establish the pipes and create additional netcat pipes on
incremental ports above the control port for each thread used.

Dumpall seems like it could be a reasonable start point since it
is already doing the complicated bits of serializing the dump data
so it can be consistently loaded.

Probably not really an admin question at this point, more a
feature enhancement.

Is there anything fundamentally wrong that someone with more
intimate knowledge of dumpall could point out?

Thanks

Tom

#14

Scott Ribe

scott_ribe@elevated-dev.com

over 1 year ago

In reply to: Ron (#12)

Re: filesystem full during vacuum - space recovery issues

1) Add new disk, use a new tablespace to move some big tables to it, to get back up and running
2) Replica server provisioned sufficiently for the db, pg_basebackup to it
3) Get streaming replication working
4) Switch over to new server

In other words, if you don't want terrible downtime, you need yet another server fully provisioned to be able to run your db.

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/

Show quoted text

On Jul 18, 2024, at 4:41 PM, Ron Johnson <ronljohnsonjr@gmail.com> wrote:

Multi-threaded writing to the same giant text file won't work too well, when all the data for one table needs to be together.

Just temporarily add another disk for backups.

On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:

On 18-Jul-2024 16:32, Ron Johnson wrote:

On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]
[BTW, v9.6 which I know is old but this server is stuck there]
[snip]
I know I'm stuck with the slow rebuild at this point. However, I doubt I am the only person in the world that needs to dump and reload a large database. My thought is this is a weak point for PostgreSQL so it makes sense to consider ways to improve the dump reload process, especially as it's the last-resort upgrade path recommended in the upgrade guide and the general fail-safe route to get out of trouble.
No database does fast single-threaded backups.

Agreed. My thought is that is should be possible for a 'new dumpall' to be multi-threaded.
Something like :
* Set number of threads on 'source' (perhaps by querying a listening destination for how many threads it is prepared to accept via a control port)
* Select each database in turn
* Organize the tables which do not have references themselves
* Send each table separately in each thread (or queue them until a thread is available) ('Stage 1')
* Rendezvous stage 1 completion (pause sending, wait until feedback from destination confirming all completed) so we have a known consistent state that is safe to proceed to subsequent tables
* Work through tables that do refer to the previously sent in the same way (since the tables they reference exist and have their data) ('Stage 2')
* Repeat progressively until all tables are done ('Stage 3', 4 etc. as necessary)
The current dumpall is essentially doing this table organization currently [minus stage checkpoints/multi-thread] otherwise the dump/load would not work. It may even be doing a lot of this for 'directory' mode? The change here is organizing n threads to process them concurrently where possible and coordinating the pipes so they only send data which can be accepted.
The destination would need to have a multi-thread listen and co-ordinate with the sender on some control channel so feed back completion of each stage.
Something like a destination host and control channel port to establish the pipes and create additional netcat pipes on incremental ports above the control port for each thread used.
Dumpall seems like it could be a reasonable start point since it is already doing the complicated bits of serializing the dump data so it can be consistently loaded.
Probably not really an admin question at this point, more a feature enhancement.
Is there anything fundamentally wrong that someone with more intimate knowledge of dumpall could point out?
Thanks
Tom

#15

Ron

ronljohnsonjr@gmail.com

over 1 year ago

In reply to: Thomas Simpson (#13)

Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)

200TB... how do you currently back up your database?

On Fri, Jul 19, 2024 at 5:08 AM Thomas Simpson <ts@talentstack.to> wrote:

Show quoted text

[Added cross post to pgsql-hackers@lists.postgresql.org - background is
multi-TB database needs recovered via pgdumpall & reload, thoughts on ways
to make pg_dump scale to multi-thread to expedite loading to a new
cluster. Straight dump to a file is impractical as the dump will be

200TB; hackers may be a better home for the discussion than current admin

list]

Hi Ron

On 18-Jul-2024 18:41, Ron Johnson wrote:

Multi-threaded writing to the same giant text file won't work too well,
when all the data for one table needs to be together.

Just temporarily add another disk for backups.

For clarity, I'm not proposing multi threaded writing to one file; the
proposal is a new special mode which specifically makes multiple output
streams across *network sockets* to a listener which is listening on the
other side. The goal is avoiding any files at all and only using multiple
network streams to gain multi-threaded processing with some co-ordination
to keep things organized and consistent.

This would really be specifically for the use-case of dump/reload upgrade
or recreate rather than everyday use. And particularly for very large
databases.

Looking at pg_dump.c it's doing the baseline organization but the
extension would be adding the required coordination with the destination.
So, for a huge table (I have many) these would go in different streams but
if there is a dependency (FK relations etc) the checkpoint needs to ensure
those are met before proceeding. Worst case scenario it would end up using
only 1 thread but it would be very unusual to have a database where every
table depends on another table all the way down.

In theory at least, some gains should be achieved for typical databases
where a degree of parallelism is possible.
Thanks

Tom

On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:

On 18-Jul-2024 16:32, Ron Johnson wrote:

On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip]

I know I'm stuck with the slow rebuild at this point. However, I doubt
I am the only person in the world that needs to dump and reload a large
database. My thought is this is a weak point for PostgreSQL so it makes
sense to consider ways to improve the dump reload process, especially as
it's the last-resort upgrade path recommended in the upgrade guide and the
general fail-safe route to get out of trouble.

No database does fast single-threaded backups.

Agreed. My thought is that is should be possible for a 'new dumpall' to
be multi-threaded.

Something like :

* Set number of threads on 'source' (perhaps by querying a listening
destination for how many threads it is prepared to accept via a control
port)

* Select each database in turn

* Organize the tables which do not have references themselves

* Send each table separately in each thread (or queue them until a thread
is available) ('Stage 1')

* Rendezvous stage 1 completion (pause sending, wait until feedback from
destination confirming all completed) so we have a known consistent state
that is safe to proceed to subsequent tables

* Work through tables that do refer to the previously sent in the same
way (since the tables they reference exist and have their data) ('Stage 2')

* Repeat progressively until all tables are done ('Stage 3', 4 etc. as
necessary)

The current dumpall is essentially doing this table organization
currently [minus stage checkpoints/multi-thread] otherwise the dump/load
would not work. It may even be doing a lot of this for 'directory' mode?
The change here is organizing n threads to process them concurrently where
possible and coordinating the pipes so they only send data which can be
accepted.

The destination would need to have a multi-thread listen and co-ordinate
with the sender on some control channel so feed back completion of each
stage.

Something like a destination host and control channel port to establish
the pipes and create additional netcat pipes on incremental ports above the
control port for each thread used.

Dumpall seems like it could be a reasonable start point since it is
already doing the complicated bits of serializing the dump data so it can
be consistently loaded.

Probably not really an admin question at this point, more a feature
enhancement.

Is there anything fundamentally wrong that someone with more intimate
knowledge of dumpall could point out?

Thanks

Tom

#16

Scott Ribe

scott_ribe@elevated-dev.com

over 1 year ago

In reply to: Thomas Simpson (#13)

Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)

Do you actually have 100G networking between the nodes? Because if not, a single CPU should be able to saturate 10G.

Likewise the receiving end would need disk capable of keeping up. Which brings up the question, why not write to disk, but directly to the destination rather than write locally then copy?

Do you require dump-reload because of suspected corruption? That's a tough one. But if not, if the goal is just to get up and running on a new server, why not pg_basebackup, streaming replica, promote? That depends on the level of data modification activity being low enough that pg_basebackup can keep up with WAL as it's generated and apply it faster than new WAL comes in, but given that your server is currently keeping up with writing that much WAL and flushing that many changes, seems likely it would keep up as long as the network connection is fast enough. Anyway, in that scenario, you don't need to care how long pg_basebackup takes.

If you do need a dump/reload because of suspected corruption, the only thing I can think of is something like doing it a table at a time--partitioning would help here, if practical.

#17

Thomas Simpson

ts@talentstack.to

over 1 year ago

In reply to: Scott Ribe (#16)

Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)

Hi Scott,

I realize some of the background was snipped on what I sent to the
hacker list, I'll try to fill in the details.

Short background is very large database ran out of space during vacuum
full taking down the server. There is a replica which was applying the
WALs and so it too ran out of space. On restart after clearing some
space, the database came back up but left over the in-progress rebuild
files. I've cleared that replica and am using it as my rebuild target
just now.

Trying to identify the 'orphan' files and move them away always led to
the database spotting the supposedly unused files having gone and
refusing to start, so I had no successful way to clean up and get space
back.

Last resort after discussion is pg_dumpall & reload. I'm doing this via
a network pipe (netcat) as I do not have the vast amount of storage
necessary for the dump file to be stored (in any format).

On 19-Jul-2024 09:26, Scott Ribe wrote:

Do you actually have 100G networking between the nodes? Because if not, a single CPU should be able to saturate 10G.

Servers connect via 10G WAN; sending is not the issue, it's application
of the incoming stream on the destination which is bottlenecked.

Likewise the receiving end would need disk capable of keeping up. Which brings up the question, why not write to disk, but directly to the destination rather than write locally then copy?

In this case, it's not a local write, it's piped via netcat.

Do you require dump-reload because of suspected corruption? That's a tough one. But if not, if the goal is just to get up and running on a new server, why not pg_basebackup, streaming replica, promote? That depends on the level of data modification activity being low enough that pg_basebackup can keep up with WAL as it's generated and apply it faster than new WAL comes in, but given that your server is currently keeping up with writing that much WAL and flushing that many changes, seems likely it would keep up as long as the network connection is fast enough. Anyway, in that scenario, you don't need to care how long pg_basebackup takes.

If you do need a dump/reload because of suspected corruption, the only thing I can think of is something like doing it a table at a time--partitioning would help here, if practical.

The basebackup is, to the best of my understanding, essentially just
copying the database files. Since the failed vacuum has left extra
files, my expectation is these too would be copied, leaving me in the
same position I started in. If I'm wrong, please tell me as that would
be vastly quicker - it is how I originally set up the replica and it
took only a few hours on the 10G link.

The inability to get a clean start if I move any files out the way leads
me to be concerned for some underlying corruption/issue and the
recommendation earlier in the discussion was opt for dump/reload as the
fail-safe.

Resigned to my fate, my thoughts were to see if there is a way to
improve the dump-reload approach for the future. Since dump-reload is
the ultimate upgrade suggestion in the documentation, it seems
worthwhile to see if there is a way to improve the performance of that
especially as very large databases like mine are a thing with
PostgreSQL. From a quick review of pg_dump.c (I'm no expert on it
obviously), it feels like it's already doing most of what needs done and
the addition is some sort of multi-thread coordination with a restore
client to ensure each thread can successfully complete each task it has
before accepting more work. I realize that's actually difficult to
implement.

Thanks

Tom

#18

Scott Ribe

scott_ribe@elevated-dev.com

over 1 year ago

In reply to: Thomas Simpson (#17)

Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)

On Jul 19, 2024, at 7:46 AM, Thomas Simpson <ts@talentstack.to> wrote:

I realize some of the background was snipped on what I sent to the hacker list, I'll try to fill in the details.

I was gone from my computer for a day and lost track of the thread.

Perhaps logical replication could help you out here?

#19

Thomas Simpson

ts@talentstack.to

over 1 year ago

In reply to: Scott Ribe (#18)

Re: Enhance pg_dump multi-threaded streaming (WAS: Re: filesystem full during vacuum - space recovery issues)

Hi Scott

On 19-Jul-2024 15:34, Scott Ribe wrote:

On Jul 19, 2024, at 7:46 AM, Thomas Simpson<ts@talentstack.to> wrote:

I realize some of the background was snipped on what I sent to the hacker list, I'll try to fill in the details.

I was gone from my computer for a day and lost track of the thread.

Perhaps logical replication could help you out here?

I'm not sure - perhaps, but at this point, I've got that dump/reload
running and provided it completes ok (in about 20 days time at current
rate), I'll be fine with this.

The database itself is essentially an archive of data so is no longer
being added to at this point, so it's an annoyance for the rebuild time
rather than a disaster.

[But incidentally, I am working on an even larger project which is
likely to make this one seem small, so improvement around large
databases is important to me.]

However, my thought is around how to avoid this issue in the future and
to improve the experience for others faced with the dump-reload which is
always the fall-back upgrade suggestion between versions.

Getting parallelism should be possible and the current pg_dump does that
for directory mode from what I can see - making multiple threads etc.
according to parallel.c in pg_dump, it even looks like most of where my
thought process was going is actually already there.

The extension should be adding synchronization/checkpointing between the
generating dump and the receiving reload to ensure objects are not
processed until all their requirements are already present in the new
database. This is all based around routing via network streams instead
of the filesystem as currently happens.

Perhaps this is already in place since the restore can be done in
parallel, so must need to implement that ordering already? If someone
with a good understanding of dump is able to comment or even give
suggestions, I'm not against making an attempt to implement something as
a first attempt.

I see Tom Lane from git blame did a bunch of work around the parallel
dump back in 2020 - perhaps he could make suggestions either via private
direct email or the list ?

Thanks

Tom

#20