Journal based VACUUM FULL
Hi Team.
New to contributing so hopefully this is the right place. I've searched the
forum and it seems this is the place for feature requests/suggestions.
I was reading on VACUUM and VACUUM FULL and saw that the current
implementation of VACUUM FULL writes to an entirely new file and then
switches to it, as opposed to rewriting the current file in-place. I assume
the reason for this is safety in the event the database shuts down in the
middle of the vacuum, the most you will lose is the progress of the vacuum
and have to start from scratch but otherwise the database will retain its
integrity and not become corrupted. This seems to be an effective but
somewhat rudimentary system that could be improved. Most notably, the
rewriting of almost the entire database takes up basically double the
storage during the duration of the rewrite which can become expensive or
even simply inconvenient in IaaS(and probably other) installations where
the drive sizes are scaled on demand. Even if the VACUUM FULL doesn't need
to run often, having to reallocate drive space for an effective duplication
is not ideal. My suggestion is a journal based in-place rewrite of the
database files.
This means that the VACUUM FULL will do a "pre-processing" pass over the
database and figure out at a fairly low level what operations need to be
done to compact the database back to it's "correct" size. These operations
will be written in their entirety to a journal which records all the
operations about to be performed, with some mechanism for checking if they
have already been performed, using the same principle described here:
https://en.wikipedia.org/wiki/Journaling_file_system. This will allow an
in-place rewrite of the file in a safe manner such that you are able to
recover from an unexpected shutdown by resuming the VACUUM FULL from the
journal, or by detecting where the copy hole is in the file and
recording/ignoring it
The journal could be something as simple as a record of which byte ranges
need to be copied into which other byte ranges locations. The journal
should record whenever a byte range copy completes for the sake of error
recovery. Obviously, each byte range will have a max size of the copy
distance from the source to the destination so that the destination will
not overwrite the source, therefore making recovery impossible(how can you
know where in the copy you stopped?). However, this will have a snowball
effect as the further you are in the rewrite, the further the source and
destination ranges will be so you can copy bigger chunks at a time, and
won't have to update the journal's completion flags as often. In the case
of a shutdown during a copy, you merely read the journal, looking for the
first copy that isn't completed yet, and continue rewriting from there.
Even if some of the bytes have been copied already, there will be no
corruption as you haven't overwritten the source bytes at all. Finally, a
simple file truncate can take place once all the copies are complete, and
the journal can be deleted. This means the headroom required in the
filesystem would be much smaller, and would pay for itself in any copy of
at least 17 bytes or more (assuming 2*8 byte pointers plus a bit as a
completion flag). The only situation in which this system would consume
more space than a total copy is if the database has more than 50% garbage,
and the garbage is perfectly spread out i.e. isn't in large chunks that can
be copied at once and therefore recorded in the journal at once, and each
piece of garbage is smaller than 17 bytes. Obviously, the journal itself
would need a error checking mechanism to ensure the journal was correctly
and completely written, but this can be as simple as a total file hash at
the end of the file.
An alternative to the completion flags is to compute a hash of the data to
be copied and store it in the journal, and then in recovery you can compare
the destination with the hash. This has the advantage of not needing to
write to the journal to keep it up to date during the operations, but the
disadvantages associated with having to compute many hashes while
recovering and storing the hashes in the journal, taking up more space.
It's also arguably less safe as there is always the chance(albeit extremely
unlikely) of a collision, which would mean that the data is not actually
validated. I would argue the risk of this is lower than the risk of bit-rot
flipping the completion bit, however.
A journaling system like this *might* have performance benefits too,
specifically when running in less intelligent file systems like NTFS which
can become easily fragmented(causing potentially big performance issues on
spinning disks). Rewriting the same file will never require a file-system
de-fragment. The other advantage as mentioned before is in the case of
auto-scaling drives if used as storage for the DB(as they often are in
IaaS/Paas services). Not having to scale up rapidly could be a performance
boost in some cases.
Finally, a journaling system like this will also lend itself to
stopping/resuming in the middle of the VACUUM FULL. Once the journal is
created and the rewrites have started, assuming the journal "completion"
flag is kept up to date, you can stop the operation in the
middle(presumably writing the current "gap" with null bytes or otherwise
indicating to the DB that there's a gap in the middle that should be
ignored), and continue using the database as usual. This means you can do a
"soft" recovery wherein the database is only halfway vacuumed but it's till
perfectly operational and functional. You can also resume from this soft
recovery by simply continuing to write from the last copy that was
completed. Obviously you will only regain disk space when you reach the end
and truncate the file but you are at least able to pause/resume the
operation, waiting only for the current copy block to finish instead of for
the entire VACUUM FULL to finish.
I hope this was a coherent explanation of my suggestion. It's possible and
maybe even likely that there's a glaring misunderstanding or assumption on
my part that means this isn't practical, but I'd still love to get feedback
on it.
*ThanksRyan Sheasby*
On 2/21/19 12:16 AM, Ryan David Sheasby wrote:
I was reading on VACUUM and VACUUM FULL and saw that the current
implementation of VACUUM FULL writes to an entirely new file and then
switches to it, as opposed to rewriting the current file in-place. I
assume the reason for this is safety in the event the database shuts
down in the middle of the vacuum, the most you will lose is the progress
of the vacuum and have to start from scratch but otherwise the database
will retain its integrity and not become corrupted. This seems to be an
effective but somewhat rudimentary system that could be improved. Most
notably, the rewriting of almost the entire database takes up basically
double the storage during the duration of the rewrite which can become
expensive or even simply inconvenient in IaaS(and probably other)
installations where the drive sizes are scaled on demand. Even if the
VACUUM FULL doesn't need to run often, having to reallocate drive space
for an effective duplication is not ideal. My suggestion is a journal
based in-place rewrite of the database files.
Hi,
VACUUM FULL used to modify the table in-place in PostgreSQL 8.4 and
earlier but that solution was slow and did often cause plenty of index
bloat while moving the rows around in the table. Which is why PostgreSQL
9.0 switched it to rewiring the whole table and its indexes.
I have not heard many requests for bringing back the old behavior, but
I could easily have missed them. Either way I do not think there would
be much demand for an in-place VACUUM FULL unless the index bloat
problem is also solved.
Additionally I do not think that the project would want a whole new kind
of infrastructure just to solve this very narrow case. PostgreSQL
already has its own journal (the write-ahead log) which is used to
ensure crash safety, and I think any proposed solution for this would
need to use the WAL.
Andreas
Thanks for getting back to me. I had a small discussion with @sfrost on the
slack team and understand the issue better now. I must admit I didn't
realize that the scope of WAL extended to VACUUM operations which is why I
suggested a new journaling system. I realize now the issue is not safety(as
the WAL already sorts out that issue), but performance. I will rethink my
suggestion and let you know if I come up with a useful/performant way of
doing this.
*ThanksRyan Sheasby*
On Thu, Feb 21, 2019 at 5:27 PM Andreas Karlsson <andreas@proxel.se> wrote:
Show quoted text
On 2/21/19 12:16 AM, Ryan David Sheasby wrote:
I was reading on VACUUM and VACUUM FULL and saw that the current
implementation of VACUUM FULL writes to an entirely new file and then
switches to it, as opposed to rewriting the current file in-place. I
assume the reason for this is safety in the event the database shuts
down in the middle of the vacuum, the most you will lose is the progress
of the vacuum and have to start from scratch but otherwise the database
will retain its integrity and not become corrupted. This seems to be an
effective but somewhat rudimentary system that could be improved. Most
notably, the rewriting of almost the entire database takes up basically
double the storage during the duration of the rewrite which can become
expensive or even simply inconvenient in IaaS(and probably other)
installations where the drive sizes are scaled on demand. Even if the
VACUUM FULL doesn't need to run often, having to reallocate drive space
for an effective duplication is not ideal. My suggestion is a journal
based in-place rewrite of the database files.Hi,
VACUUM FULL used to modify the table in-place in PostgreSQL 8.4 and
earlier but that solution was slow and did often cause plenty of index
bloat while moving the rows around in the table. Which is why PostgreSQL
9.0 switched it to rewiring the whole table and its indexes.I have not heard many requests for bringing back the old behavior, but
I could easily have missed them. Either way I do not think there would
be much demand for an in-place VACUUM FULL unless the index bloat
problem is also solved.Additionally I do not think that the project would want a whole new kind
of infrastructure just to solve this very narrow case. PostgreSQL
already has its own journal (the write-ahead log) which is used to
ensure crash safety, and I think any proposed solution for this would
need to use the WAL.Andreas
Hi,
On 2019-02-21 16:27:06 +0100, Andreas Karlsson wrote:
I have not heard many requests for bringing back the old behavior, but I
could easily have missed them. Either way I do not think there would be much
demand for an in-place VACUUM FULL unless the index bloat problem is also
solved.
Yea, I don't think there's much either. What I think there's PLENTY need
for is something like pg_repack in core. And could argue that the
trigger based logging it does to catch up to changes made concurrently
with the rewrite, to the old table, is a form of journaling...
Greetings,
Andres Freund
On 2/22/19 2:15 PM, Andres Freund wrote:
Hi,
On 2019-02-21 16:27:06 +0100, Andreas Karlsson wrote:
I have not heard many requests for bringing back the old behavior, but I
could easily have missed them. Either way I do not think there would be much
demand for an in-place VACUUM FULL unless the index bloat problem is also
solved.Yea, I don't think there's much either. What I think there's PLENTY need
for is something like pg_repack in core. And could argue that the
trigger based logging it does to catch up to changes made concurrently
with the rewrite, to the old table, is a form of journaling...
+1. Maybe this is something that should be on the agenda of the next
developers' meeting.
cheers
andrew
--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings,
* Andrew Dunstan (andrew.dunstan@2ndquadrant.com) wrote:
On 2/22/19 2:15 PM, Andres Freund wrote:
On 2019-02-21 16:27:06 +0100, Andreas Karlsson wrote:
I have not heard many requests for bringing back the old behavior, but I
could easily have missed them. Either way I do not think there would be much
demand for an in-place VACUUM FULL unless the index bloat problem is also
solved.Yea, I don't think there's much either. What I think there's PLENTY need
for is something like pg_repack in core. And could argue that the
trigger based logging it does to catch up to changes made concurrently
with the rewrite, to the old table, is a form of journaling...+1. Maybe this is something that should be on the agenda of the next
developers' meeting.
Seems more appropriate to the developer unconference, though perhaps
that's what you meant..?
Thanks!
Stephen