filesystem full during vacuum - space recovery issues
Hi
I have a large database (multi TB) which had a vacuum full running but
the database ran out of space during the rebuild of one of the large
data tables.
Cleaning down the WAL files got the database restarted (an archiving
problem led to the initial disk full).
However, the disk space is still at 99% as it appears the large table
rebuild files are still hanging around using space and have not been
deleted.
My problem now is how do I get this space back to return my free space
back to where it should be?
I tried some scripts to map the data files to relations but this didn't
work as removing some files led to startup failure despite them
appearing to be unrelated to anything in the database - I had to put
them back and then startup worked.
Any suggestions here?
Thanks
Tom
On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:
I have a large database (multi TB) which had a vacuum full running but the database
ran out of space during the rebuild of one of the large data tables.Cleaning down the WAL files got the database restarted (an archiving problem led to
the initial disk full).However, the disk space is still at 99% as it appears the large table rebuild files
are still hanging around using space and have not been deleted.My problem now is how do I get this space back to return my free space back to where
it should be?I tried some scripts to map the data files to relations but this didn't work as
removing some files led to startup failure despite them appearing to be unrelated
to anything in the database - I had to put them back and then startup worked.Any suggestions here?
That reads like the sad old story: "cleaning down" WAL files - you mean deleting the
very files that would have enabled PostgreSQL to recover from the crash that was
caused by the full file system.
Did you run "pg_resetwal"? If yes, that probably led to data corruption.
The above are just guesses. Anyway, there is no good way to get rid of the files
that were left behind after the crash. The reliable way of doing so is also the way
to get rid of potential data corruption caused by "cleaning down" the database:
pg_dump the whole thing and restore the dump to a new, clean cluster.
Yes, that will be a painfully long down time. An alternative is to restore a backup
taken before the crash.
Yours,
Laurenz Albe
Also, you can use multi process dump and restore using pg_dump plus pigz
utility for zipping.
Thanks
On Tue, Jul 16, 2024, 4:00 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
Show quoted text
On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:
I have a large database (multi TB) which had a vacuum full running but
the database
ran out of space during the rebuild of one of the large data tables.
Cleaning down the WAL files got the database restarted (an archiving
problem led to
the initial disk full).
However, the disk space is still at 99% as it appears the large table
rebuild files
are still hanging around using space and have not been deleted.
My problem now is how do I get this space back to return my free space
back to where
it should be?
I tried some scripts to map the data files to relations but this didn't
work as
removing some files led to startup failure despite them appearing to be
unrelated
to anything in the database - I had to put them back and then startup
worked.
Any suggestions here?
That reads like the sad old story: "cleaning down" WAL files - you mean
deleting the
very files that would have enabled PostgreSQL to recover from the crash
that was
caused by the full file system.Did you run "pg_resetwal"? If yes, that probably led to data corruption.
The above are just guesses. Anyway, there is no good way to get rid of
the files
that were left behind after the crash. The reliable way of doing so is
also the way
to get rid of potential data corruption caused by "cleaning down" the
database:
pg_dump the whole thing and restore the dump to a new, clean cluster.Yes, that will be a painfully long down time. An alternative is to
restore a backup
taken before the crash.Yours,
Laurenz Albe
Thanks Laurenz & Imran for your comments.
My responses inline below.
Thanks
Tom
On 15-Jul-2024 20:58, Laurenz Albe wrote:
On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:
I have a large database (multi TB) which had a vacuum full running but the database
ran out of space during the rebuild of one of the large data tables.Cleaning down the WAL files got the database restarted (an archiving problem led to
the initial disk full).However, the disk space is still at 99% as it appears the large table rebuild files
are still hanging around using space and have not been deleted.My problem now is how do I get this space back to return my free space back to where
it should be?I tried some scripts to map the data files to relations but this didn't work as
removing some files led to startup failure despite them appearing to be unrelated
to anything in the database - I had to put them back and then startup worked.Any suggestions here?
That reads like the sad old story: "cleaning down" WAL files - you mean deleting the
very files that would have enabled PostgreSQL to recover from the crash that was
caused by the full file system.Did you run "pg_resetwal"? If yes, that probably led to data corruption.
No, I just removed the excess already archived WALs to get space and
restarted. The vacuum full that was running had created files for the
large table it was processing and these are still hanging around eating
space without doing anything useful. The shutdown prevented the
rollback cleanly removing them which seems to be the core problem.
The above are just guesses. Anyway, there is no good way to get rid of the files
that were left behind after the crash. The reliable way of doing so is also the way
to get rid of potential data corruption caused by "cleaning down" the database:
pg_dump the whole thing and restore the dump to a new, clean cluster.Yes, that will be a painfully long down time. An alternative is to restore a backup
taken before the crash.
My issue now is the dump & reload is taking a huge time; I know the
hardware is capable of multi-GB/s throughput but the reload is taking a
long time - projected to be about 10 days to reload at the current rate
(about 30Mb/sec). The old server and new server have a 10G link between
them and storage is SSD backed, so the hardware is capable of much much
more than it is doing now.
Is there a way to improve the reload performance? Tuning of any type -
even if I need to undo it later once the reload is done.
My backups were in progress when all the issues happened, so they're not
such a good starting point and I'd actually prefer the clean reload
since this DB has been through multiple upgrades (without reloads) until
now so I know it's not especially clean. The size has always prevented
the full reload before but the database is relatively low traffic now so
I can afford some time to reload, but ideally not 10 days.
Show quoted text
Yours,
Laurenz Albe
On Wed, Jul 17, 2024 at 9:26 AM Thomas Simpson <ts@talentstack.to> wrote:
[snip]
uge time; I know the hardware is capable of multi-GB/s throughput but the
reload is taking a long time - projected to be about 10 days to reload at
the current rate (about 30Mb/sec). The old server and new server have a
10G link between them and storage is SSD backed, so the hardware is capable
of much much more than it is doing now.Is there a way to improve the reload performance? Tuning of any type -
even if I need to undo it later once the reload is done.
That would, of course, depend on what you're currently doing. pg_dumpall
of a Big Database is certainly suboptimal compared to "pg_dump -Fd
--jobs=24".
This is what I run (which I got mostly from a databasesoup.com blog post)
on the target instance before doing "pg_restore -Fd --jobs=24":
declare -i CheckPoint=30
declare -i SharedBuffs=32
declare -i MaintMem=3
declare -i MaxWalSize=36
declare -i WalBuffs=64
pg_ctl restart -wt$TimeOut -mfast \
-o "-c hba_file=$PGDATA/pg_hba_maintmode.conf" \
-o "-c fsync=off" \
-o "-c log_statement=none" \
-o "-c log_temp_files=100kB" \
-o "-c log_checkpoints=on" \
-o "-c log_min_duration_statement=120000" \
-o "-c shared_buffers=${SharedBuffs}GB" \
-o "-c maintenance_work_mem=${MaintMem}GB" \
-o "-c synchronous_commit=off" \
-o "-c archive_mode=off" \
-o "-c full_page_writes=off" \
-o "-c checkpoint_timeout=${CheckPoint}min" \
-o "-c max_wal_size=${MaxWalSize}GB" \
-o "-c wal_level=minimal" \
-o "-c max_wal_senders=0" \
-o "-c wal_buffers=${WalBuffs}MB" \
-o "-c autovacuum=off"
After the pg_restore -Fd --jobs=24 and vacuumdb --analyze-only --jobs=24:
pg_ctl stop -wt$TimeOut && pg_ctl start -wt$TimeOut
Of course, these parameter values were for *my* hardware.
Show quoted text
My backups were in progress when all the issues happened, so they're not
such a good starting point and I'd actually prefer the clean reload since
this DB has been through multiple upgrades (without reloads) until now so I
know it's not especially clean. The size has always prevented the full
reload before but the database is relatively low traffic now so I can
afford some time to reload, but ideally not 10 days.Yours,
Laurenz Albe
Thanks Ron for the suggestions - I applied some of the settings which
helped throughput a little bit but were not an ideal solution for me -
let me explain.
Due to the size, I do not have the option to use the directory mode (or
anything that uses disk space) for dump as that creates multiple
directories (hence why it can do multiple jobs). I do not have the
several hundred TB of space to hold the output and there is no practical
way to get it, especially for a transient reload.
I have my original server plus my replica; as the replica also applied
the WALs, it too filled up and went down. I've basically recreated this
as a primary server and am using a pipeline to dump from the original
into this as I know that has enough space for the final loaded database
and should have space left over from the clean rebuild (whereas the
original server still has space exhausted due to the leftover files).
Incidentally, this state is also why going to a backup is not helpful
either as the restore and then re-apply the WALs would just end up
filling the disk and recreating the original problem.
Even with the improved throughput, current calculations are pointing to
almost 30 days to recreate the database through dump and reload which is
a pretty horrible state to be in.
I think this is perhaps an area of improvement - especially as larger
PostgreSQL databases become more common, I'm not the only person who
could face this issue.
Perhaps an additional dumpall mode that generates multiple output pipes
(I'm piping via netcat to the other server) - it would need to combine
with a multiple listening streams too and some degree of
ordering/feedback to get to the essentially serialized output from the
current dumpall. But this feels like PostgreSQL expert developer territory.
Thanks
Tom
On 17-Jul-2024 09:49, Ron Johnson wrote:
On Wed, Jul 17, 2024 at 9:26 AM Thomas Simpson <ts@talentstack.to> wrote:
---8<--snip,snip---8<---
Show quoted text
That would, of course, depend on what you're currently doing.
pg_dumpall of a Big Database is certainly suboptimal compared to
"pg_dump -Fd --jobs=24".This is what I run (which I got mostly from a databasesoup.com
<http://databasesoup.com> blog post) on the target instance before
doing "pg_restore -Fd --jobs=24":
declare -i CheckPoint=30
declare -i SharedBuffs=32
declare -i MaintMem=3
declare -i MaxWalSize=36
declare -i WalBuffs=64
pg_ctl restart -wt$TimeOut -mfast \
-o "-c hba_file=$PGDATA/pg_hba_maintmode.conf" \
-o "-c fsync=off" \
-o "-c log_statement=none" \
-o "-c log_temp_files=100kB" \
-o "-c log_checkpoints=on" \
-o "-c log_min_duration_statement=120000" \
-o "-c shared_buffers=${SharedBuffs}GB" \
-o "-c maintenance_work_mem=${MaintMem}GB" \
-o "-c synchronous_commit=off" \
-o "-c archive_mode=off" \
-o "-c full_page_writes=off" \
-o "-c checkpoint_timeout=${CheckPoint}min" \
-o "-c max_wal_size=${MaxWalSize}GB" \
-o "-c wal_level=minimal" \
-o "-c max_wal_senders=0" \
-o "-c wal_buffers=${WalBuffs}MB" \
-o "-c autovacuum=off"After the pg_restore -Fd --jobs=24 and vacuumdb --analyze-only --jobs=24:
pg_ctl stop -wt$TimeOut && pg_ctl start -wt$TimeOutOf course, these parameter values were for *my* hardware.
My backups were in progress when all the issues happened, so
they're not such a good starting point and I'd actually prefer the
clean reload since this DB has been through multiple upgrades
(without reloads) until now so I know it's not especially clean.
The size has always prevented the full reload before but the
database is relatively low traffic now so I can afford some time
to reload, but ideally not 10 days.Yours,
Laurenz Albe
There's no free lunch, and you can't squeeze blood from a turnip.
Single-threading will *ALWAYS* be slow: if you want speed, temporarily
throw more hardware at it: specifically another disk (and possibly more RAM
and CPU).
On Thu, Jul 18, 2024 at 9:55 AM Thomas Simpson <ts@talentstack.to> wrote:
Show quoted text
Thanks Ron for the suggestions - I applied some of the settings which
helped throughput a little bit but were not an ideal solution for me - let
me explain.Due to the size, I do not have the option to use the directory mode (or
anything that uses disk space) for dump as that creates multiple
directories (hence why it can do multiple jobs). I do not have the several
hundred TB of space to hold the output and there is no practical way to get
it, especially for a transient reload.I have my original server plus my replica; as the replica also applied the
WALs, it too filled up and went down. I've basically recreated this as a
primary server and am using a pipeline to dump from the original into this
as I know that has enough space for the final loaded database and should
have space left over from the clean rebuild (whereas the original server
still has space exhausted due to the leftover files).Incidentally, this state is also why going to a backup is not helpful
either as the restore and then re-apply the WALs would just end up filling
the disk and recreating the original problem.Even with the improved throughput, current calculations are pointing to
almost 30 days to recreate the database through dump and reload which is a
pretty horrible state to be in.I think this is perhaps an area of improvement - especially as larger
PostgreSQL databases become more common, I'm not the only person who could
face this issue.Perhaps an additional dumpall mode that generates multiple output pipes
(I'm piping via netcat to the other server) - it would need to combine with
a multiple listening streams too and some degree of ordering/feedback to
get to the essentially serialized output from the current dumpall. But
this feels like PostgreSQL expert developer territory.Thanks
Tom
On 17-Jul-2024 09:49, Ron Johnson wrote:
On Wed, Jul 17, 2024 at 9:26 AM Thomas Simpson <ts@talentstack.to> wrote:
---8<--snip,snip---8<---
That would, of course, depend on what you're currently doing. pg_dumpall
of a Big Database is certainly suboptimal compared to "pg_dump -Fd
--jobs=24".This is what I run (which I got mostly from a databasesoup.com blog post)
on the target instance before doing "pg_restore -Fd --jobs=24":
declare -i CheckPoint=30
declare -i SharedBuffs=32
declare -i MaintMem=3
declare -i MaxWalSize=36
declare -i WalBuffs=64
pg_ctl restart -wt$TimeOut -mfast \
-o "-c hba_file=$PGDATA/pg_hba_maintmode.conf" \
-o "-c fsync=off" \
-o "-c log_statement=none" \
-o "-c log_temp_files=100kB" \
-o "-c log_checkpoints=on" \
-o "-c log_min_duration_statement=120000" \
-o "-c shared_buffers=${SharedBuffs}GB" \
-o "-c maintenance_work_mem=${MaintMem}GB" \
-o "-c synchronous_commit=off" \
-o "-c archive_mode=off" \
-o "-c full_page_writes=off" \
-o "-c checkpoint_timeout=${CheckPoint}min" \
-o "-c max_wal_size=${MaxWalSize}GB" \
-o "-c wal_level=minimal" \
-o "-c max_wal_senders=0" \
-o "-c wal_buffers=${WalBuffs}MB" \
-o "-c autovacuum=off"After the pg_restore -Fd --jobs=24 and vacuumdb --analyze-only --jobs=24:
pg_ctl stop -wt$TimeOut && pg_ctl start -wt$TimeOutOf course, these parameter values were for *my* hardware.
My backups were in progress when all the issues happened, so they're not
such a good starting point and I'd actually prefer the clean reload since
this DB has been through multiple upgrades (without reloads) until now so I
know it's not especially clean. The size has always prevented the full
reload before but the database is relatively low traffic now so I can
afford some time to reload, but ideally not 10 days.Yours,
Laurenz Albe
On 15/07/2024 19:47, Thomas Simpson wrote:
My problem now is how do I get this space back to return my free space
back to where it should be?I tried some scripts to map the data files to relations but this
didn't work as removing some files led to startup failure despite them
appearing to be unrelated to anything in the database - I had to put
them back and then startup worked.
I don't know what you tried to do
What would normally happen on a failed VACUUM FULL that fills up the
disk so the server crashes is that there are loads of data files
containing the partially rebuilt table. Nothing 'internal' to PostgreSQL
will point to those files as the internal pointers all change to the new
table in an ACID way, so you should be able to delete them.
You can usually find these relatively easily by looking in the relevant
tablespace directory for the base filename for a new huge table (lots
and lots of files with the same base name - eg looking for files called
*.1000 will find you base filenames for relations over about 1TB) and
checking to see if pg_filenode_relation() can't turn the filenode into a
relation. If that's the case that they're not currently in use for a
relation, then you should be able to just delete all those files
Is this what you tried, or did your 'script to map data files to
relations' do something else? You were a bit ambiguous about that part
of things.
Paul
On 18-Jul-2024 11:19, Paul Smith* wrote:
On 15/07/2024 19:47, Thomas Simpson wrote:
My problem now is how do I get this space back to return my free
space back to where it should be?I tried some scripts to map the data files to relations but this
didn't work as removing some files led to startup failure despite
them appearing to be unrelated to anything in the database - I had to
put them back and then startup worked.I don't know what you tried to do
What would normally happen on a failed VACUUM FULL that fills up the
disk so the server crashes is that there are loads of data files
containing the partially rebuilt table. Nothing 'internal' to
PostgreSQL will point to those files as the internal pointers all
change to the new table in an ACID way, so you should be able to
delete them.You can usually find these relatively easily by looking in the
relevant tablespace directory for the base filename for a new huge
table (lots and lots of files with the same base name - eg looking for
files called *.1000 will find you base filenames for relations over
about 1TB) and checking to see if pg_filenode_relation() can't turn
the filenode into a relation. If that's the case that they're not
currently in use for a relation, then you should be able to just
delete all those filesIs this what you tried, or did your 'script to map data files to
relations' do something else? You were a bit ambiguous about that part
of things.
[BTW, v9.6 which I know is old but this server is stuck there]
Yes, I was querying relfilenode from pg_class to get the filename
(integer) and then comparing a directory listing for files which did not
match the relfilenode as candidates to remove.
I moved these elsewhere (i.e. not delete, just move out the way so I
could move them back in case of trouble).
Without these apparently unrelated files, the database did not start and
complained about them being missing, so I had to put them back. This
was despite not finding any reference to the filename/number in pg_class.
At that point I gave up since I cannot afford to make the problem worse!
I know I'm stuck with the slow rebuild at this point. However, I doubt
I am the only person in the world that needs to dump and reload a large
database. My thought is this is a weak point for PostgreSQL so it makes
sense to consider ways to improve the dump reload process, especially as
it's the last-resort upgrade path recommended in the upgrade guide and
the general fail-safe route to get out of trouble.
Thanks
Tom
Show quoted text
Paul
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]
[BTW, v9.6 which I know is old but this server is stuck there]
[snip]
I know I'm stuck with the slow rebuild at this point. However, I doubt I
am the only person in the world that needs to dump and reload a large
database. My thought is this is a weak point for PostgreSQL so it makes
sense to consider ways to improve the dump reload process, especially as
it's the last-resort upgrade path recommended in the upgrade guide and the
general fail-safe route to get out of trouble.
No database does fast single-threaded backups.
On 18-Jul-2024 16:32, Ron Johnson wrote:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip][BTW, v9.6 which I know is old but this server is stuck there]
[snip]
I know I'm stuck with the slow rebuild at this point. However, I
doubt I am the only person in the world that needs to dump and
reload a large database. My thought is this is a weak point for
PostgreSQL so it makes sense to consider ways to improve the dump
reload process, especially as it's the last-resort upgrade path
recommended in the upgrade guide and the general fail-safe route
to get out of trouble.No database does fast single-threaded backups.
Agreed. My thought is that is should be possible for a 'new dumpall' to
be multi-threaded.
Something like :
* Set number of threads on 'source' (perhaps by querying a listening
destination for how many threads it is prepared to accept via a control
port)
* Select each database in turn
* Organize the tables which do not have references themselves
* Send each table separately in each thread (or queue them until a
thread is available) ('Stage 1')
* Rendezvous stage 1 completion (pause sending, wait until feedback from
destination confirming all completed) so we have a known consistent
state that is safe to proceed to subsequent tables
* Work through tables that do refer to the previously sent in the same
way (since the tables they reference exist and have their data) ('Stage 2')
* Repeat progressively until all tables are done ('Stage 3', 4 etc. as
necessary)
The current dumpall is essentially doing this table organization
currently [minus stage checkpoints/multi-thread] otherwise the dump/load
would not work. It may even be doing a lot of this for 'directory'
mode? The change here is organizing n threads to process them
concurrently where possible and coordinating the pipes so they only send
data which can be accepted.
The destination would need to have a multi-thread listen and co-ordinate
with the sender on some control channel so feed back completion of each
stage.
Something like a destination host and control channel port to establish
the pipes and create additional netcat pipes on incremental ports above
the control port for each thread used.
Dumpall seems like it could be a reasonable start point since it is
already doing the complicated bits of serializing the dump data so it
can be consistently loaded.
Probably not really an admin question at this point, more a feature
enhancement.
Is there anything fundamentally wrong that someone with more intimate
knowledge of dumpall could point out?
Thanks
Tom
Multi-threaded writing to the same giant text file won't work too well,
when all the data for one table needs to be together.
Just temporarily add another disk for backups.
On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:
Show quoted text
On 18-Jul-2024 16:32, Ron Johnson wrote:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip][BTW, v9.6 which I know is old but this server is stuck there]
[snip]
I know I'm stuck with the slow rebuild at this point. However, I doubt I
am the only person in the world that needs to dump and reload a large
database. My thought is this is a weak point for PostgreSQL so it makes
sense to consider ways to improve the dump reload process, especially as
it's the last-resort upgrade path recommended in the upgrade guide and the
general fail-safe route to get out of trouble.No database does fast single-threaded backups.
Agreed. My thought is that is should be possible for a 'new dumpall' to
be multi-threaded.Something like :
* Set number of threads on 'source' (perhaps by querying a listening
destination for how many threads it is prepared to accept via a control
port)* Select each database in turn
* Organize the tables which do not have references themselves
* Send each table separately in each thread (or queue them until a thread
is available) ('Stage 1')* Rendezvous stage 1 completion (pause sending, wait until feedback from
destination confirming all completed) so we have a known consistent state
that is safe to proceed to subsequent tables* Work through tables that do refer to the previously sent in the same way
(since the tables they reference exist and have their data) ('Stage 2')* Repeat progressively until all tables are done ('Stage 3', 4 etc. as
necessary)The current dumpall is essentially doing this table organization currently
[minus stage checkpoints/multi-thread] otherwise the dump/load would not
work. It may even be doing a lot of this for 'directory' mode? The change
here is organizing n threads to process them concurrently where possible
and coordinating the pipes so they only send data which can be accepted.The destination would need to have a multi-thread listen and co-ordinate
with the sender on some control channel so feed back completion of each
stage.Something like a destination host and control channel port to establish
the pipes and create additional netcat pipes on incremental ports above the
control port for each thread used.Dumpall seems like it could be a reasonable start point since it is
already doing the complicated bits of serializing the dump data so it can
be consistently loaded.Probably not really an admin question at this point, more a feature
enhancement.Is there anything fundamentally wrong that someone with more intimate
knowledge of dumpall could point out?Thanks
Tom
[Added cross post to pgsql-hackers@lists.postgresql.org - background is
multi-TB database needs recovered via pgdumpall & reload, thoughts on
ways to make pg_dump scale to multi-thread to expedite loading to a new
cluster. Straight dump to a file is impractical as the dump will be
200TB; hackers may be a better home for the discussion than current
admin list]
Hi Ron
On 18-Jul-2024 18:41, Ron Johnson wrote:
Multi-threaded writing to the same giant text file won't work too
well, when all the data for one table needs to be together.Just temporarily add another disk for backups.
For clarity, I'm not proposing multi threaded writing to one file; the
proposal is a new special mode which specifically makes multiple output
streams across *network sockets* to a listener which is listening on the
other side. The goal is avoiding any files at all and only using
multiple network streams to gain multi-threaded processing with some
co-ordination to keep things organized and consistent.
This would really be specifically for the use-case of dump/reload
upgrade or recreate rather than everyday use. And particularly for very
large databases.
Looking at pg_dump.c it's doing the baseline organization but the
extension would be adding the required coordination with the
destination. So, for a huge table (I have many) these would go in
different streams but if there is a dependency (FK relations etc) the
checkpoint needs to ensure those are met before proceeding. Worst case
scenario it would end up using only 1 thread but it would be very
unusual to have a database where every table depends on another table
all the way down.
In theory at least, some gains should be achieved for typical databases
where a degree of parallelism is possible.
Thanks
Tom
Show quoted text
On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:
On 18-Jul-2024 16:32, Ron Johnson wrote:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson
<ts@talentstack.to> wrote:
[snip][BTW, v9.6 which I know is old but this server is stuck there]
[snip]
I know I'm stuck with the slow rebuild at this point.
However, I doubt I am the only person in the world that needs
to dump and reload a large database. My thought is this is a
weak point for PostgreSQL so it makes sense to consider ways
to improve the dump reload process, especially as it's the
last-resort upgrade path recommended in the upgrade guide and
the general fail-safe route to get out of trouble.No database does fast single-threaded backups.
Agreed. My thought is that is should be possible for a 'new
dumpall' to be multi-threaded.Something like :
* Set number of threads on 'source' (perhaps by querying a
listening destination for how many threads it is prepared to
accept via a control port)* Select each database in turn
* Organize the tables which do not have references themselves
* Send each table separately in each thread (or queue them until a
thread is available) ('Stage 1')* Rendezvous stage 1 completion (pause sending, wait until
feedback from destination confirming all completed) so we have a
known consistent state that is safe to proceed to subsequent tables* Work through tables that do refer to the previously sent in the
same way (since the tables they reference exist and have their
data) ('Stage 2')* Repeat progressively until all tables are done ('Stage 3', 4
etc. as necessary)The current dumpall is essentially doing this table organization
currently [minus stage checkpoints/multi-thread] otherwise the
dump/load would not work. It may even be doing a lot of this for
'directory' mode? The change here is organizing n threads to
process them concurrently where possible and coordinating the
pipes so they only send data which can be accepted.The destination would need to have a multi-thread listen and
co-ordinate with the sender on some control channel so feed back
completion of each stage.Something like a destination host and control channel port to
establish the pipes and create additional netcat pipes on
incremental ports above the control port for each thread used.Dumpall seems like it could be a reasonable start point since it
is already doing the complicated bits of serializing the dump data
so it can be consistently loaded.Probably not really an admin question at this point, more a
feature enhancement.Is there anything fundamentally wrong that someone with more
intimate knowledge of dumpall could point out?Thanks
Tom
1) Add new disk, use a new tablespace to move some big tables to it, to get back up and running
2) Replica server provisioned sufficiently for the db, pg_basebackup to it
3) Get streaming replication working
4) Switch over to new server
In other words, if you don't want terrible downtime, you need yet another server fully provisioned to be able to run your db.
--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/
Show quoted text
On Jul 18, 2024, at 4:41 PM, Ron Johnson <ronljohnsonjr@gmail.com> wrote:
Multi-threaded writing to the same giant text file won't work too well, when all the data for one table needs to be together.
Just temporarily add another disk for backups.
On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:
On 18-Jul-2024 16:32, Ron Johnson wrote:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]
[BTW, v9.6 which I know is old but this server is stuck there]
[snip]
I know I'm stuck with the slow rebuild at this point. However, I doubt I am the only person in the world that needs to dump and reload a large database. My thought is this is a weak point for PostgreSQL so it makes sense to consider ways to improve the dump reload process, especially as it's the last-resort upgrade path recommended in the upgrade guide and the general fail-safe route to get out of trouble.
No database does fast single-threaded backups.Agreed. My thought is that is should be possible for a 'new dumpall' to be multi-threaded.
Something like :
* Set number of threads on 'source' (perhaps by querying a listening destination for how many threads it is prepared to accept via a control port)
* Select each database in turn
* Organize the tables which do not have references themselves
* Send each table separately in each thread (or queue them until a thread is available) ('Stage 1')
* Rendezvous stage 1 completion (pause sending, wait until feedback from destination confirming all completed) so we have a known consistent state that is safe to proceed to subsequent tables
* Work through tables that do refer to the previously sent in the same way (since the tables they reference exist and have their data) ('Stage 2')
* Repeat progressively until all tables are done ('Stage 3', 4 etc. as necessary)
The current dumpall is essentially doing this table organization currently [minus stage checkpoints/multi-thread] otherwise the dump/load would not work. It may even be doing a lot of this for 'directory' mode? The change here is organizing n threads to process them concurrently where possible and coordinating the pipes so they only send data which can be accepted.
The destination would need to have a multi-thread listen and co-ordinate with the sender on some control channel so feed back completion of each stage.
Something like a destination host and control channel port to establish the pipes and create additional netcat pipes on incremental ports above the control port for each thread used.
Dumpall seems like it could be a reasonable start point since it is already doing the complicated bits of serializing the dump data so it can be consistently loaded.
Probably not really an admin question at this point, more a feature enhancement.
Is there anything fundamentally wrong that someone with more intimate knowledge of dumpall could point out?
Thanks
Tom
200TB... how do you currently back up your database?
On Fri, Jul 19, 2024 at 5:08 AM Thomas Simpson <ts@talentstack.to> wrote:
Show quoted text
[Added cross post to pgsql-hackers@lists.postgresql.org - background is
multi-TB database needs recovered via pgdumpall & reload, thoughts on ways
to make pg_dump scale to multi-thread to expedite loading to a new
cluster. Straight dump to a file is impractical as the dump will be200TB; hackers may be a better home for the discussion than current admin
list]
Hi Ron
On 18-Jul-2024 18:41, Ron Johnson wrote:
Multi-threaded writing to the same giant text file won't work too well,
when all the data for one table needs to be together.Just temporarily add another disk for backups.
For clarity, I'm not proposing multi threaded writing to one file; the
proposal is a new special mode which specifically makes multiple output
streams across *network sockets* to a listener which is listening on the
other side. The goal is avoiding any files at all and only using multiple
network streams to gain multi-threaded processing with some co-ordination
to keep things organized and consistent.This would really be specifically for the use-case of dump/reload upgrade
or recreate rather than everyday use. And particularly for very large
databases.Looking at pg_dump.c it's doing the baseline organization but the
extension would be adding the required coordination with the destination.
So, for a huge table (I have many) these would go in different streams but
if there is a dependency (FK relations etc) the checkpoint needs to ensure
those are met before proceeding. Worst case scenario it would end up using
only 1 thread but it would be very unusual to have a database where every
table depends on another table all the way down.In theory at least, some gains should be achieved for typical databases
where a degree of parallelism is possible.
ThanksTom
On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:
On 18-Jul-2024 16:32, Ron Johnson wrote:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip][BTW, v9.6 which I know is old but this server is stuck there]
[snip]
I know I'm stuck with the slow rebuild at this point. However, I doubt
I am the only person in the world that needs to dump and reload a large
database. My thought is this is a weak point for PostgreSQL so it makes
sense to consider ways to improve the dump reload process, especially as
it's the last-resort upgrade path recommended in the upgrade guide and the
general fail-safe route to get out of trouble.No database does fast single-threaded backups.
Agreed. My thought is that is should be possible for a 'new dumpall' to
be multi-threaded.Something like :
* Set number of threads on 'source' (perhaps by querying a listening
destination for how many threads it is prepared to accept via a control
port)* Select each database in turn
* Organize the tables which do not have references themselves
* Send each table separately in each thread (or queue them until a thread
is available) ('Stage 1')* Rendezvous stage 1 completion (pause sending, wait until feedback from
destination confirming all completed) so we have a known consistent state
that is safe to proceed to subsequent tables* Work through tables that do refer to the previously sent in the same
way (since the tables they reference exist and have their data) ('Stage 2')* Repeat progressively until all tables are done ('Stage 3', 4 etc. as
necessary)The current dumpall is essentially doing this table organization
currently [minus stage checkpoints/multi-thread] otherwise the dump/load
would not work. It may even be doing a lot of this for 'directory' mode?
The change here is organizing n threads to process them concurrently where
possible and coordinating the pipes so they only send data which can be
accepted.The destination would need to have a multi-thread listen and co-ordinate
with the sender on some control channel so feed back completion of each
stage.Something like a destination host and control channel port to establish
the pipes and create additional netcat pipes on incremental ports above the
control port for each thread used.Dumpall seems like it could be a reasonable start point since it is
already doing the complicated bits of serializing the dump data so it can
be consistently loaded.Probably not really an admin question at this point, more a feature
enhancement.Is there anything fundamentally wrong that someone with more intimate
knowledge of dumpall could point out?Thanks
Tom
Do you actually have 100G networking between the nodes? Because if not, a single CPU should be able to saturate 10G.
Likewise the receiving end would need disk capable of keeping up. Which brings up the question, why not write to disk, but directly to the destination rather than write locally then copy?
Do you require dump-reload because of suspected corruption? That's a tough one. But if not, if the goal is just to get up and running on a new server, why not pg_basebackup, streaming replica, promote? That depends on the level of data modification activity being low enough that pg_basebackup can keep up with WAL as it's generated and apply it faster than new WAL comes in, but given that your server is currently keeping up with writing that much WAL and flushing that many changes, seems likely it would keep up as long as the network connection is fast enough. Anyway, in that scenario, you don't need to care how long pg_basebackup takes.
If you do need a dump/reload because of suspected corruption, the only thing I can think of is something like doing it a table at a time--partitioning would help here, if practical.
Hi Scott,
I realize some of the background was snipped on what I sent to the
hacker list, I'll try to fill in the details.
Short background is very large database ran out of space during vacuum
full taking down the server. There is a replica which was applying the
WALs and so it too ran out of space. On restart after clearing some
space, the database came back up but left over the in-progress rebuild
files. I've cleared that replica and am using it as my rebuild target
just now.
Trying to identify the 'orphan' files and move them away always led to
the database spotting the supposedly unused files having gone and
refusing to start, so I had no successful way to clean up and get space
back.
Last resort after discussion is pg_dumpall & reload. I'm doing this via
a network pipe (netcat) as I do not have the vast amount of storage
necessary for the dump file to be stored (in any format).
On 19-Jul-2024 09:26, Scott Ribe wrote:
Do you actually have 100G networking between the nodes? Because if not, a single CPU should be able to saturate 10G.
Servers connect via 10G WAN; sending is not the issue, it's application
of the incoming stream on the destination which is bottlenecked.
Likewise the receiving end would need disk capable of keeping up. Which brings up the question, why not write to disk, but directly to the destination rather than write locally then copy?
In this case, it's not a local write, it's piped via netcat.
Do you require dump-reload because of suspected corruption? That's a tough one. But if not, if the goal is just to get up and running on a new server, why not pg_basebackup, streaming replica, promote? That depends on the level of data modification activity being low enough that pg_basebackup can keep up with WAL as it's generated and apply it faster than new WAL comes in, but given that your server is currently keeping up with writing that much WAL and flushing that many changes, seems likely it would keep up as long as the network connection is fast enough. Anyway, in that scenario, you don't need to care how long pg_basebackup takes.
If you do need a dump/reload because of suspected corruption, the only thing I can think of is something like doing it a table at a time--partitioning would help here, if practical.
The basebackup is, to the best of my understanding, essentially just
copying the database files. Since the failed vacuum has left extra
files, my expectation is these too would be copied, leaving me in the
same position I started in. If I'm wrong, please tell me as that would
be vastly quicker - it is how I originally set up the replica and it
took only a few hours on the 10G link.
The inability to get a clean start if I move any files out the way leads
me to be concerned for some underlying corruption/issue and the
recommendation earlier in the discussion was opt for dump/reload as the
fail-safe.
Resigned to my fate, my thoughts were to see if there is a way to
improve the dump-reload approach for the future. Since dump-reload is
the ultimate upgrade suggestion in the documentation, it seems
worthwhile to see if there is a way to improve the performance of that
especially as very large databases like mine are a thing with
PostgreSQL. From a quick review of pg_dump.c (I'm no expert on it
obviously), it feels like it's already doing most of what needs done and
the addition is some sort of multi-thread coordination with a restore
client to ensure each thread can successfully complete each task it has
before accepting more work. I realize that's actually difficult to
implement.
Thanks
Tom
On Jul 19, 2024, at 7:46 AM, Thomas Simpson <ts@talentstack.to> wrote:
I realize some of the background was snipped on what I sent to the hacker list, I'll try to fill in the details.
I was gone from my computer for a day and lost track of the thread.
Perhaps logical replication could help you out here?
Hi Scott
On 19-Jul-2024 15:34, Scott Ribe wrote:
On Jul 19, 2024, at 7:46 AM, Thomas Simpson<ts@talentstack.to> wrote:
I realize some of the background was snipped on what I sent to the hacker list, I'll try to fill in the details.
I was gone from my computer for a day and lost track of the thread.
Perhaps logical replication could help you out here?
I'm not sure - perhaps, but at this point, I've got that dump/reload
running and provided it completes ok (in about 20 days time at current
rate), I'll be fine with this.
The database itself is essentially an archive of data so is no longer
being added to at this point, so it's an annoyance for the rebuild time
rather than a disaster.
[But incidentally, I am working on an even larger project which is
likely to make this one seem small, so improvement around large
databases is important to me.]
However, my thought is around how to avoid this issue in the future and
to improve the experience for others faced with the dump-reload which is
always the fall-back upgrade suggestion between versions.
Getting parallelism should be possible and the current pg_dump does that
for directory mode from what I can see - making multiple threads etc.
according to parallel.c in pg_dump, it even looks like most of where my
thought process was going is actually already there.
The extension should be adding synchronization/checkpointing between the
generating dump and the receiving reload to ensure objects are not
processed until all their requirements are already present in the new
database. This is all based around routing via network streams instead
of the filesystem as currently happens.
Perhaps this is already in place since the restore can be done in
parallel, so must need to implement that ordering already? If someone
with a good understanding of dump is able to comment or even give
suggestions, I'm not against making an attempt to implement something as
a first attempt.
I see Tom Lane from git blame did a bunch of work around the parallel
dump back in 2020 - perhaps he could make suggestions either via private
direct email or the list ?
Thanks
Tom
Hi Doug
On 19-Jul-2024 17:21, Doug Reynolds wrote:
Thomas—
Why are you using logical backups for a database this large? A
solution like PgBackRest? Obviously, if you are going to upgrade, but
for operational use, that seems to be a slow choice.
In normal operation the server runs as a primary-replica and pgbackrest
handles backups. Right when disk space was used up, pgbackrest also
took a backup during the failed vacuum so going back to it (or anything
earlier) would also roll forward the WALs for recovery to date and put
me right back where I am just now by running out of space part way through.
It's a pragmatic decision that trying various things short of the
dump-reload would take a number of days for me to try and see if I could
get them to work with a high likelihood of needing to resort to
dump-reload anyway. I'd already tried a few file matching/moving
exercises by they all prevented the database starting up so I cut my
losses and started the dump-reload this week instead of next week since
there's a limited window before this becomes a larger problem.
My thoughts on improving pg_dump are to help make it a better tool for
worst case scenarios like this for the future or for those that like the
dump-reload as part of upgrades but have reasonable size databases.
Thanks
Tom
Show quoted text
Doug
On Jul 19, 2024, at 4:26 PM, Thomas Simpson <ts@talentstack.to> wrote:
Hi Scott
On 19-Jul-2024 15:34, Scott Ribe wrote:
On Jul 19, 2024, at 7:46 AM, Thomas Simpson<ts@talentstack.to> wrote:
I realize some of the background was snipped on what I sent to the hacker list, I'll try to fill in the details.
I was gone from my computer for a day and lost track of the thread.
Perhaps logical replication could help you out here?
I'm not sure - perhaps, but at this point, I've got that dump/reload
running and provided it completes ok (in about 20 days time at
current rate), I'll be fine with this.The database itself is essentially an archive of data so is no longer
being added to at this point, so it's an annoyance for the rebuild
time rather than a disaster.[But incidentally, I am working on an even larger project which is
likely to make this one seem small, so improvement around large
databases is important to me.]However, my thought is around how to avoid this issue in the future
and to improve the experience for others faced with the dump-reload
which is always the fall-back upgrade suggestion between versions.Getting parallelism should be possible and the current pg_dump does
that for directory mode from what I can see - making multiple threads
etc. according to parallel.c in pg_dump, it even looks like most of
where my thought process was going is actually already there.The extension should be adding synchronization/checkpointing between
the generating dump and the receiving reload to ensure objects are
not processed until all their requirements are already present in the
new database. This is all based around routing via network streams
instead of the filesystem as currently happens.Perhaps this is already in place since the restore can be done in
parallel, so must need to implement that ordering already? If
someone with a good understanding of dump is able to comment or even
give suggestions, I'm not against making an attempt to implement
something as a first attempt.I see Tom Lane from git blame did a bunch of work around the parallel
dump back in 2020 - perhaps he could make suggestions either via
private direct email or the list ?Thanks
Tom
On 2024-07-19 Fr 9:46 AM, Thomas Simpson wrote:
Hi Scott,
I realize some of the background was snipped on what I sent to the
hacker list, I'll try to fill in the details.Short background is very large database ran out of space during vacuum
full taking down the server. There is a replica which was applying
the WALs and so it too ran out of space. On restart after clearing
some space, the database came back up but left over the in-progress
rebuild files. I've cleared that replica and am using it as my
rebuild target just now.Trying to identify the 'orphan' files and move them away always led to
the database spotting the supposedly unused files having gone and
refusing to start, so I had no successful way to clean up and get
space back.Last resort after discussion is pg_dumpall & reload. I'm doing this
via a network pipe (netcat) as I do not have the vast amount of
storage necessary for the dump file to be stored (in any format).On 19-Jul-2024 09:26, Scott Ribe wrote:
Do you actually have 100G networking between the nodes? Because if not, a single CPU should be able to saturate 10G.
Servers connect via 10G WAN; sending is not the issue, it's
application of the incoming stream on the destination which is
bottlenecked.Likewise the receiving end would need disk capable of keeping up. Which brings up the question, why not write to disk, but directly to the destination rather than write locally then copy?
In this case, it's not a local write, it's piped via netcat.
Do you require dump-reload because of suspected corruption? That's a tough one. But if not, if the goal is just to get up and running on a new server, why not pg_basebackup, streaming replica, promote? That depends on the level of data modification activity being low enough that pg_basebackup can keep up with WAL as it's generated and apply it faster than new WAL comes in, but given that your server is currently keeping up with writing that much WAL and flushing that many changes, seems likely it would keep up as long as the network connection is fast enough. Anyway, in that scenario, you don't need to care how long pg_basebackup takes.
If you do need a dump/reload because of suspected corruption, the only thing I can think of is something like doing it a table at a time--partitioning would help here, if practical.
The basebackup is, to the best of my understanding, essentially just
copying the database files. Since the failed vacuum has left extra
files, my expectation is these too would be copied, leaving me in the
same position I started in. If I'm wrong, please tell me as that
would be vastly quicker - it is how I originally set up the replica
and it took only a few hours on the 10G link.The inability to get a clean start if I move any files out the way
leads me to be concerned for some underlying corruption/issue and the
recommendation earlier in the discussion was opt for dump/reload as
the fail-safe.Resigned to my fate, my thoughts were to see if there is a way to
improve the dump-reload approach for the future. Since dump-reload is
the ultimate upgrade suggestion in the documentation, it seems
worthwhile to see if there is a way to improve the performance of that
especially as very large databases like mine are a thing with
PostgreSQL. From a quick review of pg_dump.c (I'm no expert on it
obviously), it feels like it's already doing most of what needs done
and the addition is some sort of multi-thread coordination with a
restore client to ensure each thread can successfully complete each
task it has before accepting more work. I realize that's actually
difficult to implement.
There is a plan for a non-text mode for pg_dumpall. I have started work
on it, and hope to have a WIP patch in a month or so. It's not my
intention to parallelize it for the first cut, but it could definitely
be parallelizable in future. However, it will require writing to disk
somewhere, albeit that the data will be compressed. It's well nigh
impossible to parallelize text format dumps.
Restoration of custom and directory format dumps has long been
parallelized. Parallel dumps require directory format, and so will
non-text pg_dumpall.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
On Fri, Jul 19, 2024 at 10:19 PM Thomas Simpson <ts@talentstack.to> wrote:
Hi Doug
On 19-Jul-2024 17:21, Doug Reynolds wrote:Thomas—
Why are you using logical backups for a database this large? A solution
like PgBackRest? Obviously, if you are going to upgrade, but for
operational use, that seems to be a slow choice.In normal operation the server runs as a primary-replica and pgbackrest
handles backups.
Expire the oldest pgbackrest, so as to free up space for a multithreaded
pg_dump.
Right when disk space was used up, pgbackrest also took a backup during
the failed vacuum so going back to it (or anything earlier) would also roll
forward the WALs for recovery to date and put me right back where I am just
now by running out of space part way through.
Who says you have to restore to the failure point? That's what the
"--target" option is for.
For example, if you took a full backup on 7/14 at midnight, and want to
restore to 7/18 23:00, run:
declare LL=detail
declare PGData=/path/to/data
declare -i Threads=`nproc`-2
declare BackupSet=20240714-000003F
declare RestoreUntil="2024-07-18 23:00"
pgbackrest restore \
--stanza=localhost \
--log-level-file=$LL \
--log-level-console=$LL \
--process-max=${Threads}
--pg1-path=$PGData \
--set=$BackupSet \
--type=time --target="${RestoreUntil}"
Show quoted text
Hi Andrew,
This is very interesting.
I had started looking at pg_dumpall trying to work out an approach. I
noticed parallel.c essentially already does all the thread creation and
coordination that I knew would be needed. Given that is a solved
problem, I started to look further (continued below).
On 22-Jul-2024 11:50, Andrew Dunstan wrote:
On 2024-07-19 Fr 9:46 AM, Thomas Simpson wrote:
Hi Scott,
I realize some of the background was snipped on what I sent to the
hacker list, I'll try to fill in the details.Short background is very large database ran out of space during
vacuum full taking down the server. There is a replica which was
applying the WALs and so it too ran out of space. On restart after
clearing some space, the database came back up but left over the
in-progress rebuild files. I've cleared that replica and am using it
as my rebuild target just now.Trying to identify the 'orphan' files and move them away always led
to the database spotting the supposedly unused files having gone and
refusing to start, so I had no successful way to clean up and get
space back.Last resort after discussion is pg_dumpall & reload. I'm doing this
via a network pipe (netcat) as I do not have the vast amount of
storage necessary for the dump file to be stored (in any format).On 19-Jul-2024 09:26, Scott Ribe wrote:
Do you actually have 100G networking between the nodes? Because if
not, a single CPU should be able to saturate 10G.Servers connect via 10G WAN; sending is not the issue, it's
application of the incoming stream on the destination which is
bottlenecked.Likewise the receiving end would need disk capable of keeping up.
Which brings up the question, why not write to disk, but directly to
the destination rather than write locally then copy?In this case, it's not a local write, it's piped via netcat.
Do you require dump-reload because of suspected corruption? That's a
tough one. But if not, if the goal is just to get up and running on
a new server, why not pg_basebackup, streaming replica, promote?
That depends on the level of data modification activity being low
enough that pg_basebackup can keep up with WAL as it's generated and
apply it faster than new WAL comes in, but given that your server is
currently keeping up with writing that much WAL and flushing that
many changes, seems likely it would keep up as long as the network
connection is fast enough. Anyway, in that scenario, you don't need
to care how long pg_basebackup takes.If you do need a dump/reload because of suspected corruption, the
only thing I can think of is something like doing it a table at a
time--partitioning would help here, if practical.The basebackup is, to the best of my understanding, essentially just
copying the database files. Since the failed vacuum has left extra
files, my expectation is these too would be copied, leaving me in the
same position I started in. If I'm wrong, please tell me as that
would be vastly quicker - it is how I originally set up the replica
and it took only a few hours on the 10G link.The inability to get a clean start if I move any files out the way
leads me to be concerned for some underlying corruption/issue and the
recommendation earlier in the discussion was opt for dump/reload as
the fail-safe.Resigned to my fate, my thoughts were to see if there is a way to
improve the dump-reload approach for the future. Since dump-reload
is the ultimate upgrade suggestion in the documentation, it seems
worthwhile to see if there is a way to improve the performance of
that especially as very large databases like mine are a thing with
PostgreSQL. From a quick review of pg_dump.c (I'm no expert on it
obviously), it feels like it's already doing most of what needs done
and the addition is some sort of multi-thread coordination with a
restore client to ensure each thread can successfully complete each
task it has before accepting more work. I realize that's actually
difficult to implement.There is a plan for a non-text mode for pg_dumpall. I have started
work on it, and hope to have a WIP patch in a month or so. It's not my
intention to parallelize it for the first cut, but it could definitely
be parallelizable in future. However, it will require writing to disk
somewhere, albeit that the data will be compressed. It's well nigh
impossible to parallelize text format dumps.Restoration of custom and directory format dumps has long been
parallelized. Parallel dumps require directory format, and so will
non-text pg_dumpall.
My general approach (which I'm sure is naive) was:
Add to pg_dumpall the concept of backup phase and I have the basic hooks
in place. 0 = role grants etc. The stuff before dumping actual
databases. I intercepted the fprintf(OPF to a hook function that for
normal run just ends up doing the same as fprintf but for my parallel
mode, it has a hook to send the info via the network (still to be done
but I think I may need to alter the fprintf stuff with more granularity
of what is being processed at each output to help this part, such as
outputRoleCreate, outputComment etc.).
Each subsequent phase is a whole database - increment at each pg_dump
call. The actual pg_dump is to get a new format, -F N for network;
based around directory dump as the base, my intention was to make
multiple network pipes to send the data in place of the files within the
directory. Essentially relying on whatever is already done to organize
parallel dumps to disk to be sufficient for coordinating network streaming.
The restore side needs to do network listen plus some handshaking to
confirm completion of the incoming phases, any necessary dependency
tracking on restore etc.
My goal was to actively avoid the disk usage part through the
coordination over the network between dump and restore even though my
starting point is the pg_backup_directory code. Any problem on the
restore side would feed back and halt the dump side in error so this is
a new failure mode compared with how it works just now.
I'll hold off a bit as I'm very interested in any feedback you have,
particularly if you see serious flaws in my though process here.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Thanks
Tom