finding changed blocks using WAL scanning

Started by Robert Haasabout 7 years ago56 messageshackers

robertmhaas@gmail.com

about 7 years ago

Over at /messages/by-id/CA+TgmobFVe4J4AA7z9OMUzKnm09Tt+sybhxeL_Ddst3q3wqpzQ@mail.gmail.com
I mentioned parsing the WAL to extract block references so that
incremental backup could efficiently determine which blocks needed to
be copied. Ashwin replied in
/messages/by-id/CALfoeitO-vkfjubMFQRmgyXghL-uUnZLNxbr=obrQQsm8kFO4A@mail.gmail.com
to mention that the same approach could be useful for pg_upgrade:

# Currently, pg_rewind requires
# all the WAL logs to be present on source side from point of divergence to
# rewind. Instead just parse the wal and keep the changed blocks around on
# sourece. Then don't need to retain the WAL but can still rewind using the
# changed block map.

Since there are at least two possible use case for an efficient way to
learn when blocks last changed, and in each case the performance
benefits of being able to learn that are potentially quite large, I'm
starting this thread to brainstorm how such a system might work.

It seems to me that there are basically two ways of storing this kind
of information, plus a bunch of variants. One way is to store files
that cover a range of LSNs, and basically contain a synopsis of the
WAL for those LSNs. You omit all the actual data and just mention
which blocks were changed by some record in that part of the WAL. In
this type of scheme, the storage required is roughly proportional to
the volume of WAL for which you wish to retain data. Pruning old data
is easy; just remove the files that provide information about LSNs
that you don't care about any more. The other way is to store data
about each block, or each range of blocks, or all the blocks that hash
onto a certain slot; for each, store the newest LSN that has modified
that block, or a block in that range, or a block that hashes onto that
that slot. In this system, storage is roughly proportional to the
size of the database cluster, except maybe in the hashing case, but I
*think* that case will degrade unless you basically expand the map to
be roughly proportional to the size of the cluster anyway. I may be
wrong.

Of these two variants, I am inclined to prefer the version where each
file is a summary of the block references within some range of LSNs.
It seems simpler to implement to me. You just read a bunch of WAL
files and then when you get tired you stop and emit an output file.
You need to protect yourself against untimely crashes. One way is to
stick a checksum into the output file. After you finish writing it,
fsync() it before you start writing the next one. After a restart,
read the latest such file and see if the checksum is OK. If not,
regenerate it; if not, assume it's good and move on. Files other than
the last one can be assumed good. Another way is to create the file
with a temporary name, fsync() it, and then rename it into place and
fsync() again. The background worker that generates the files can
have a GUC to remove them when they are older than some threshold
amount of time, or you can keep them forever and let the user manually
remove stuff they no longer want based on LSN. That's pretty much it.

The version where you keep an LSN per block/range/hash bucket seems
more complex in terms of durability. Now you have to worry not only
about creating new files, but about modifying them. That seems to add
some complexity. I think it may be possible to make it work without
doing write-ahead logging for every change, but it certainly needs
careful thought to avoid torn page problems and checkpoint
synchronization issues. Moreover, it potentially uses lots and lots
of inodes if there are many relations in the cluster. You can avoid
that by not creating maps for small files, if you like, or by
switching to the hash bucket approach. But either of those approaches
is lossy. Omitting the maps for small files means you always have to
assume everything in those files changed. The hash bucket approach is
vulnerable to hash collisions; you have to assume that all blocks that
hash to a given bucket have changed. Those are probably manageable
disadvantages, but I think they are real ones.

There is one thing that does worry me about the file-per-LSN-range
approach, and that is memory consumption when trying to consume the
information. Suppose you have a really high velocity system. I don't
know exactly what the busiest systems around are doing in terms of
data churn these days, but let's say just for kicks that we are
dirtying 100GB/hour. That means, roughly 12.5 million block
references per hour. If each block reference takes 12 bytes, that's
maybe 150MB/hour in block reference files. If you run a daily
incremental backup, you've got to load all the block references for
the last 24 hours and deduplicate them, which means you're going to
need about 3.6GB of memory. If you run a weekly incremental backup,
you're going to need about 25GB of memory. That is not ideal. One
can keep the memory consumption to a more reasonable level by using
temporary files. For instance, say you realize you're going to need
25GB of memory to store all the block references you have, but you
only have 1GB of memory that you're allowed to use. Well, just
hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno,
writing each batch to a separate temporary file, and then process each
of those 32 files separately. That does add some additional I/O, but
it's not crazily complicated and doesn't seem too terrible, at least
to me. Still, it's something not to like.

Another problem to think about is whether the most recent data is
going to be available when you need it. This concern applies to
either approach. In the case of incremental backup, the server is
going to be up and running, so if the WAL-scanner gets behind, you can
just wait for it to catch up. In the case of pg_rewind, the server is
going to be down, so that doesn't work. You probably need to figure
out how new the data you have is, and then scan all the newer WAL to
pick up any additional block references. That's a bit of a pain, but
I don't see any real alternative. In the case of the
file-per-LSN-range approach, it's easy to see what LSNs are covered by
the files. In the case of the LSN-per-block/range/hash bucket
approach, you can presumably rely on the last checkpoint having
flushed all the pending changes to disk, but the WAL scanner could've
been arbitrarily far behind at that point, so you'd have to store a
piece of metadata someplace that tells you exactly how far the WAL
scanner had progressed and then have pg_rewind fish it out.

Yet another question is how to make sure WAL doesn't get removed
before we finish scanning it. Peter mentioned on the other thread
that we could use a variant replication slot, which immediately made
me wonder why we'd need a variant. Actually, the biggest problem I
see here is that if we use a replication slot, somebody might try to
drop it or use it for some other purpose, and that would mess things
up. I guess we could pull the usual trick of reserving names that
start with 'pg_' for internal purposes. Or we could just hard-code
the LSN that was last scanned for this purpose as a bespoke constraint
on WAL discard. Not sure what is best.

I think all of this should be optional functionality. It's going to
be really useful for people with large databases, I think, but people
with small databases may not care, and it won't be entirely free. If
it's not enabled, then the functionality that would otherwise exploit
it can fall back to doing things in a less efficient way; nothing
needs to break hard.

Opinions?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Robert Haas (#1)

Re: finding changed blocks using WAL scanning

On Wed, Apr 10, 2019 at 5:49 PM Robert Haas <robertmhaas@gmail.com> wrote:

There is one thing that does worry me about the file-per-LSN-range
approach, and that is memory consumption when trying to consume the
information. Suppose you have a really high velocity system. I don't
know exactly what the busiest systems around are doing in terms of
data churn these days, but let's say just for kicks that we are
dirtying 100GB/hour. That means, roughly 12.5 million block
references per hour. If each block reference takes 12 bytes, that's
maybe 150MB/hour in block reference files. If you run a daily
incremental backup, you've got to load all the block references for
the last 24 hours and deduplicate them, which means you're going to
need about 3.6GB of memory. If you run a weekly incremental backup,
you're going to need about 25GB of memory. That is not ideal. One
can keep the memory consumption to a more reasonable level by using
temporary files. For instance, say you realize you're going to need
25GB of memory to store all the block references you have, but you
only have 1GB of memory that you're allowed to use. Well, just
hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno,
writing each batch to a separate temporary file, and then process each
of those 32 files separately. That does add some additional I/O, but
it's not crazily complicated and doesn't seem too terrible, at least
to me. Still, it's something not to like.

Oh, I'm being dumb. We should just have the process that writes out
these files sort the records first. Then when we read them back in to
use them, we can just do a merge pass like MergeAppend would do. Then
you never need very much memory at all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Peter Eisentraut

peter_e@gmx.net

about 7 years ago

In reply to: Robert Haas (#1)

Re: finding changed blocks using WAL scanning

On 2019-04-10 23:49, Robert Haas wrote:

It seems to me that there are basically two ways of storing this kind
of information, plus a bunch of variants. One way is to store files
that cover a range of LSNs, and basically contain a synopsis of the
WAL for those LSNs. You omit all the actual data and just mention
which blocks were changed by some record in that part of the WAL.

That seems better than the other variant.

Yet another question is how to make sure WAL doesn't get removed
before we finish scanning it. Peter mentioned on the other thread
that we could use a variant replication slot, which immediately made
me wonder why we'd need a variant. Actually, the biggest problem I
see here is that if we use a replication slot, somebody might try to
drop it or use it for some other purpose, and that would mess things
up. I guess we could pull the usual trick of reserving names that
start with 'pg_' for internal purposes. Or we could just hard-code
the LSN that was last scanned for this purpose as a bespoke constraint
on WAL discard. Not sure what is best.

The word "variant" was used as a hedge ;-), but now that I think about
it ...

I had in mind that you could have different overlapping incremental
backup jobs in existence at the same time. Maybe a daily one to a
nearby disk and a weekly one to a faraway cloud. Each one of these
would need a separate replication slot, so that the information that is
required for *that* incremental backup series is preserved between runs.
So just one reserved replication slot that feeds the block summaries
wouldn't work. Perhaps what would work is a flag on the replication
slot itself "keep block summaries for this slot". Then when all the
slots with the block summary flag are past an LSN, you can clean up the
summaries before that LSN.

I think all of this should be optional functionality. It's going to
be really useful for people with large databases, I think, but people
with small databases may not care, and it won't be entirely free. If
it's not enabled, then the functionality that would otherwise exploit
it can fall back to doing things in a less efficient way; nothing
needs to break hard.

With the flag on the slot scheme you wouldn't need a separate knob to
turn this on, because it's just enabled when a backup software has
created an appropriate slot.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Peter Eisentraut (#3)

Re: finding changed blocks using WAL scanning

On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

I had in mind that you could have different overlapping incremental
backup jobs in existence at the same time. Maybe a daily one to a
nearby disk and a weekly one to a faraway cloud. Each one of these
would need a separate replication slot, so that the information that is
required for *that* incremental backup series is preserved between runs.
So just one reserved replication slot that feeds the block summaries
wouldn't work. Perhaps what would work is a flag on the replication
slot itself "keep block summaries for this slot". Then when all the
slots with the block summary flag are past an LSN, you can clean up the
summaries before that LSN.

I don't think that quite works. There are two different LSNs. One is
the LSN of the oldest WAL archive that we need to keep around so that
it can be summarized, and the other is the LSN of the oldest summary
we need to keep around so it can be used for incremental backup
purposes. You can't keep both of those LSNs in the same slot.
Furthermore, the LSN stored in the slot is defined as the amount of
WAL we need to keep, not the amount of something else (summaries) that
we need to keep. Reusing that same field to mean something different
sounds inadvisable.

In other words, I think there are two problems which we need to
clearly separate: one is retaining WAL so we can generate summaries,
and the other is retaining summaries so we can generate incremental
backups. Even if we solve the second problem by using some kind of
replication slot, we still need to solve the first problem somehow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Ashwin Agrawal

aagrawal@pivotal.io

about 7 years ago

In reply to: Robert Haas (#1)

Re: finding changed blocks using WAL scanning

On Wed, Apr 10, 2019 at 2:50 PM Robert Haas <robertmhaas@gmail.com> wrote:

Over at
/messages/by-id/CA+TgmobFVe4J4AA7z9OMUzKnm09Tt+sybhxeL_Ddst3q3wqpzQ@mail.gmail.com
I mentioned parsing the WAL to extract block references so that
incremental backup could efficiently determine which blocks needed to
be copied. Ashwin replied in

https://urldefense.proofpoint.com/v2/url?u=http-3A__postgr.es_m_CALfoeitO-2DvkfjubMFQRmgyXghL-2DuUnZLNxbr-3DobrQQsm8kFO4A-40mail.gmail.com&d=DwIBaQ&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=gxIaqms7ncm0pvqXLI_xjkgwSStxAET2rnZQpzba2KM&m=W07oy16p6VEfYKCgfRXQpRz9pfy_of-a_8DAjAg5TGk&s=YAtoa9HWqQ1PPjt1CGui1Fo_a20j0n95LRonCXucBz4&e=
to mention that the same approach could be useful for pg_upgrade:

Thank you for initiating this separate thread. Just typo above not
pg_upgrade but pg_rewind.

Let me explain first the thought I have around how to leverage this for
pg_rewind, actually any type of incremental recovery to be exact. Would
love to hear thoughts on it.

Currently, incremental recovery of any form, if replica goes down and comes
up or trying to bring back primary after failover to replica, requires
*all* the WAL to be present from point of disconnect. So, its boolean in
those terms, if WAL available can incrementally recovery otherwise have to
perform full basebackup. If we come up with this mechanism to find and
store changed blocks from WAL, we can provide intermediate level of
incremental recovery which will be better than full recovery.

WAL allows tuple level granularity for recovery (if we ignore FPI for a
moment). Modified blocks from WAL, if WAL is not available will provide
block level incremental recovery.

So, pg_basebackup (or some other tool or just option to it) and pg_rewind
can leverage the changed blocks if WAL can't be retained due to space
constraints and perform the recovery.

pg_rewind can also be optimized as it currently copies blocks from src to
target which were present in target WAL to rewind. So, such blocks can be
easily skipped from copying again.

Depending on pattern of changes in WAL and size, instead of replaying all
the WAL logs for incremental recovery, just copying over the changed blocks
could prove more efficient.

It seems to me that there are basically two ways of storing this kind

of information, plus a bunch of variants. One way is to store files
that cover a range of LSNs, and basically contain a synopsis of the
WAL for those LSNs. You omit all the actual data and just mention
which blocks were changed by some record in that part of the WAL. In
this type of scheme, the storage required is roughly proportional to
the volume of WAL for which you wish to retain data. Pruning old data
is easy; just remove the files that provide information about LSNs
that you don't care about any more. The other way is to store data
about each block, or each range of blocks, or all the blocks that hash
onto a certain slot; for each, store the newest LSN that has modified
that block, or a block in that range, or a block that hashes onto that
that slot. In this system, storage is roughly proportional to the
size of the database cluster, except maybe in the hashing case, but I
*think* that case will degrade unless you basically expand the map to
be roughly proportional to the size of the cluster anyway. I may be
wrong.

Of these two variants, I am inclined to prefer the version where each
file is a summary of the block references within some range of LSNs.
It seems simpler to implement to me. You just read a bunch of WAL
files and then when you get tired you stop and emit an output file.
You need to protect yourself against untimely crashes. One way is to
stick a checksum into the output file. After you finish writing it,
fsync() it before you start writing the next one. After a restart,
read the latest such file and see if the checksum is OK. If not,
regenerate it; if not, assume it's good and move on. Files other than
the last one can be assumed good. Another way is to create the file
with a temporary name, fsync() it, and then rename it into place and
fsync() again. The background worker that generates the files can
have a GUC to remove them when they are older than some threshold
amount of time, or you can keep them forever and let the user manually
remove stuff they no longer want based on LSN. That's pretty much it.

+1 for first option. Seems simpler and straight-forward.

Ashwin Agrawal

aagrawal@pivotal.io

about 7 years ago

In reply to: Robert Haas (#4)

Re: finding changed blocks using WAL scanning

On Thu, Apr 11, 2019 at 6:27 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

I had in mind that you could have different overlapping incremental
backup jobs in existence at the same time. Maybe a daily one to a
nearby disk and a weekly one to a faraway cloud. Each one of these
would need a separate replication slot, so that the information that is
required for *that* incremental backup series is preserved between runs.
So just one reserved replication slot that feeds the block summaries
wouldn't work. Perhaps what would work is a flag on the replication
slot itself "keep block summaries for this slot". Then when all the
slots with the block summary flag are past an LSN, you can clean up the
summaries before that LSN.

I don't think that quite works. There are two different LSNs. One is
the LSN of the oldest WAL archive that we need to keep around so that
it can be summarized, and the other is the LSN of the oldest summary
we need to keep around so it can be used for incremental backup
purposes. You can't keep both of those LSNs in the same slot.
Furthermore, the LSN stored in the slot is defined as the amount of
WAL we need to keep, not the amount of something else (summaries) that
we need to keep. Reusing that same field to mean something different
sounds inadvisable.

In other words, I think there are two problems which we need to
clearly separate: one is retaining WAL so we can generate summaries,
and the other is retaining summaries so we can generate incremental
backups. Even if we solve the second problem by using some kind of
replication slot, we still need to solve the first problem somehow.

Just a thought for first problem, may not to simpler, can replication slot
be enhanced to define X amount of WAL to retain, after reaching such limit
collect summary and let the WAL be deleted.

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 7 years ago

In reply to: Ashwin Agrawal (#6)

Re: finding changed blocks using WAL scanning

At Thu, 11 Apr 2019 10:00:35 -0700, Ashwin Agrawal <aagrawal@pivotal.io> wrote in <CALfoeis0qOyGk+KQ3AbkfRVv=XbsSecqHfKSag=i_SLWMT+B0A@mail.gmail.com>

On Thu, Apr 11, 2019 at 6:27 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Apr 11, 2019 at 3:52 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

I had in mind that you could have different overlapping incremental
backup jobs in existence at the same time. Maybe a daily one to a
nearby disk and a weekly one to a faraway cloud. Each one of these
would need a separate replication slot, so that the information that is
required for *that* incremental backup series is preserved between runs.
So just one reserved replication slot that feeds the block summaries
wouldn't work. Perhaps what would work is a flag on the replication
slot itself "keep block summaries for this slot". Then when all the
slots with the block summary flag are past an LSN, you can clean up the
summaries before that LSN.

I don't think that quite works. There are two different LSNs. One is
the LSN of the oldest WAL archive that we need to keep around so that
it can be summarized, and the other is the LSN of the oldest summary
we need to keep around so it can be used for incremental backup
purposes. You can't keep both of those LSNs in the same slot.
Furthermore, the LSN stored in the slot is defined as the amount of
WAL we need to keep, not the amount of something else (summaries) that
we need to keep. Reusing that same field to mean something different
sounds inadvisable.

In other words, I think there are two problems which we need to
clearly separate: one is retaining WAL so we can generate summaries,
and the other is retaining summaries so we can generate incremental
backups. Even if we solve the second problem by using some kind of
replication slot, we still need to solve the first problem somehow.

Just a thought for first problem, may not to simpler, can replication slot
be enhanced to define X amount of WAL to retain, after reaching such limit
collect summary and let the WAL be deleted.

I think Peter is saying that a slot for block summary doesn't
keep WAL segments themselves, but keeps maybe segmented block
summaries. n block-summary-slots maintains n block summaries and
the newest block summary is "active", in other words,
continuously updated by WAL records pass-by. When backup-tool
requests for block summary, for example, for the oldest slot, the
acitve summary is closed then a new summary is opened from the
LSN at the time, which is the new LSN of the slot. Then the
concatenated block summary is sent. Finally the oldest summary is
removed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Robert Haas (#2)

Re: finding changed blocks using WAL scanning

On Wed, Apr 10, 2019 at 08:11:11PM -0400, Robert Haas wrote:

On Wed, Apr 10, 2019 at 5:49 PM Robert Haas <robertmhaas@gmail.com> wrote:

There is one thing that does worry me about the file-per-LSN-range
approach, and that is memory consumption when trying to consume the
information. Suppose you have a really high velocity system. I don't
know exactly what the busiest systems around are doing in terms of
data churn these days, but let's say just for kicks that we are
dirtying 100GB/hour. That means, roughly 12.5 million block
references per hour. If each block reference takes 12 bytes, that's
maybe 150MB/hour in block reference files. If you run a daily
incremental backup, you've got to load all the block references for
the last 24 hours and deduplicate them, which means you're going to
need about 3.6GB of memory. If you run a weekly incremental backup,
you're going to need about 25GB of memory. That is not ideal. One
can keep the memory consumption to a more reasonable level by using
temporary files. For instance, say you realize you're going to need
25GB of memory to store all the block references you have, but you
only have 1GB of memory that you're allowed to use. Well, just
hash-partition the data 32 ways by dboid/tsoid/relfilenode/segno,
writing each batch to a separate temporary file, and then process each
of those 32 files separately. That does add some additional I/O, but
it's not crazily complicated and doesn't seem too terrible, at least
to me. Still, it's something not to like.

Oh, I'm being dumb. We should just have the process that writes out
these files sort the records first. Then when we read them back in to
use them, we can just do a merge pass like MergeAppend would do. Then
you never need very much memory at all.

Can I throw out a simple idea? What if, when we finish writing a WAL
file, we create a new file 000000010000000000000001.modblock which
lists all the heap/index files and block numbers modified in that WAL
file? How much does that help with the list I posted earlier?

I think there is some interesting complexity brought up in this thread.
Which options are going to minimize storage I/O, network I/O, have only
background overhead, allow parallel operation, integrate with
pg_basebackup. Eventually we will need to evaluate the incremental
backup options against these criteria.

I am thinking tools could retain modblock files along with WAL, could
pull full-page-writes from WAL, or from PGDATA. It avoids the need to
scan 16MB WAL files, and the WAL files and modblock files could be
expired independently.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Bruce Momjian (#8)

Re: finding changed blocks using WAL scanning

On Mon, Apr 15, 2019 at 4:31 PM Bruce Momjian <bruce@momjian.us> wrote:

Can I throw out a simple idea? What if, when we finish writing a WAL
file, we create a new file 000000010000000000000001.modblock which
lists all the heap/index files and block numbers modified in that WAL
file? How much does that help with the list I posted earlier?

I think there is some interesting complexity brought up in this thread.
Which options are going to minimize storage I/O, network I/O, have only
background overhead, allow parallel operation, integrate with
pg_basebackup. Eventually we will need to evaluate the incremental
backup options against these criteria.

I am thinking tools could retain modblock files along with WAL, could
pull full-page-writes from WAL, or from PGDATA. It avoids the need to
scan 16MB WAL files, and the WAL files and modblock files could be
expired independently.

That is pretty much exactly what I was intending to propose.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Robert Haas (#9)

Re: finding changed blocks using WAL scanning

On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote:

On Mon, Apr 15, 2019 at 4:31 PM Bruce Momjian <bruce@momjian.us> wrote:

Can I throw out a simple idea? What if, when we finish writing a WAL
file, we create a new file 000000010000000000000001.modblock which
lists all the heap/index files and block numbers modified in that WAL
file? How much does that help with the list I posted earlier?

I think there is some interesting complexity brought up in this thread.
Which options are going to minimize storage I/O, network I/O, have only
background overhead, allow parallel operation, integrate with
pg_basebackup. Eventually we will need to evaluate the incremental
backup options against these criteria.

I am thinking tools could retain modblock files along with WAL, could
pull full-page-writes from WAL, or from PGDATA. It avoids the need to
scan 16MB WAL files, and the WAL files and modblock files could be
expired independently.

That is pretty much exactly what I was intending to propose.

OK, good. Some of your wording was vague so I was unclear exactly what
you were suggesting.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#11

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Robert Haas (#9)

Re: finding changed blocks using WAL scanning

On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote:

That is pretty much exactly what I was intending to propose.

Any caller of XLogWrite() could switch to a new segment once the
current one is done, and I am not sure that we would want some random
backend to potentially slow down to do that kind of operation.

Or would a separate background worker do this work by itself? An
external tool can do that easily already:
https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks
--
Michael

#12

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Bruce Momjian (#10)

Re: finding changed blocks using WAL scanning

On Mon, Apr 15, 2019 at 10:22 PM Bruce Momjian <bruce@momjian.us> wrote:

I am thinking tools could retain modblock files along with WAL, could
pull full-page-writes from WAL, or from PGDATA. It avoids the need to
scan 16MB WAL files, and the WAL files and modblock files could be
expired independently.

That is pretty much exactly what I was intending to propose.

OK, good. Some of your wording was vague so I was unclear exactly what
you were suggesting.

Well, I guess the part that isn't like what I was suggesting is the
idea that there should be exactly one modified block file per segment.
The biggest problem with that idea is that a single WAL record can be
split across two segments (or, in pathological cases, perhaps more).
I think it makes sense to talk about the blocks modified by WAL
between LSN A and LSN B, but it doesn't make much sense to talk about
the block modified by the WAL in segment XYZ.

You can make it kinda make sense by saying "the blocks modified by
records *beginning in* segment XYZ" or alternatively "the blocks
modified by records *ending in* segment XYZ", but that seems confusing
to me. For example, suppose you decide on the first one --
000000010000000100000068.modblock will contain all blocks modified by
records that begin in 000000010000000100000068. Well, that means that
to generate the 000000010000000100000068.modblock, you will need
access to 000000010000000100000068 AND probably also
000000010000000100000069 and in rare cases perhaps
00000001000000010000006A or even later files. I think that's actually
pretty confusing.

It seems better to me to give the files names like
${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g.
00000001.0000000168000058.00000001687DBBB8.modblock, so that you can
see exactly which *records* are covered by that segment.

And I suspect it may also be a good idea to bunch up the records from
several WAL files. Especially if you are using 16MB WAL files,
collecting all of the block references from a single WAL file is going
to produce a very small file. I suspect that the modified block files
will end up being 100x smaller than the WAL itself, perhaps more, and
I don't think anybody will appreciate us adding another PostgreSQL
systems that spews out huge numbers of tiny little files. If, for
example, somebody's got a big cluster that is churning out a WAL
segment every second, they would probably still be happy to have a new
modified block file only, say, every 10 seconds.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Michael Paquier (#11)

Re: finding changed blocks using WAL scanning

On Mon, Apr 15, 2019 at 11:45 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Apr 15, 2019 at 09:04:13PM -0400, Robert Haas wrote:

That is pretty much exactly what I was intending to propose.

Any caller of XLogWrite() could switch to a new segment once the
current one is done, and I am not sure that we would want some random
backend to potentially slow down to do that kind of operation.

Or would a separate background worker do this work by itself? An
external tool can do that easily already:
https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks

I was thinking that a dedicated background worker would be a good
option, but Stephen Frost seems concerned (over on the other thread)
about how much load that would generate. That never really occurred
to me as a serious issue and I suspect for many people it wouldn't be,
but there might be some.

It's cool that you have a command-line tool that does this as well.
Over there, it was also discussed that we might want to have both a
command-line tool and a background worker. I think, though, that we
would want to get the output in some kind of compressed binary format,
rather than text. e.g.

4-byte database OID
4-byte tablespace OID
any number of relation OID/block OID pairings for that
database/tablespace combination
4-byte zero to mark the end of the relation OID/block OID list
and then repeat all of the above any number of times

That might be too dumb and I suspect we want some headers and a
checksum, but we should try to somehow exploit the fact that there
aren't likely to be many distinct databases or many distinct
tablespaces mentioned -- whereas relation OID and block number will
probably have a lot more entropy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Robert Haas (#12)

Re: finding changed blocks using WAL scanning

On Thu, Apr 18, 2019 at 03:43:30PM -0400, Robert Haas wrote:

You can make it kinda make sense by saying "the blocks modified by
records *beginning in* segment XYZ" or alternatively "the blocks
modified by records *ending in* segment XYZ", but that seems confusing
to me. For example, suppose you decide on the first one --
000000010000000100000068.modblock will contain all blocks modified by
records that begin in 000000010000000100000068. Well, that means that
to generate the 000000010000000100000068.modblock, you will need
access to 000000010000000100000068 AND probably also
000000010000000100000069 and in rare cases perhaps
00000001000000010000006A or even later files. I think that's actually
pretty confusing.

It seems better to me to give the files names like
${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g.
00000001.0000000168000058.00000001687DBBB8.modblock, so that you can
see exactly which *records* are covered by that segment.

How would you choose the STARTLSN/ENDLSN? If you could do it per
checkpoint, rather than per-WAL, I think that would be great.

And I suspect it may also be a good idea to bunch up the records from
several WAL files. Especially if you are using 16MB WAL files,
collecting all of the block references from a single WAL file is going
to produce a very small file. I suspect that the modified block files
will end up being 100x smaller than the WAL itself, perhaps more, and
I don't think anybody will appreciate us adding another PostgreSQL
systems that spews out huge numbers of tiny little files. If, for
example, somebody's got a big cluster that is churning out a WAL
segment every second, they would probably still be happy to have a new
modified block file only, say, every 10 seconds.

Agreed.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#15

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Bruce Momjian (#14)

Re: finding changed blocks using WAL scanning

On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian <bruce@momjian.us> wrote:

How would you choose the STARTLSN/ENDLSN? If you could do it per
checkpoint, rather than per-WAL, I think that would be great.

I thought of that too. It seems appealing, because you probably only
really care whether a particular block was modified between one
checkpoint and the next, not exactly when during that interval it was
modified. However, the simple algorithm of "just stop when you get to
a checkpoint record" does not work, because the checkpoint record
itself points back to a much earlier LSN, and I think that it's that
earlier LSN that is interesting. So if you want to make this work you
have to be more clever, and I'm not sure I'm clever enough.

I think it's important that a .modblock file not be too large, because
then it will use too much memory, and that it not cover too much WAL,
because then it will be too imprecise about when the blocks were
modified. Perhaps we should have a threshold for each -- e.g. emit
the next .modblock file after finding 2^20 distinct block references
or scanning 1GB of WAL. Then individual files would probably be in
the single-digit numbers of megabytes in size, assuming we do a decent
job with the compression, and you never need to scan more than 1GB of
WAL to regenerate one. If the starting point for a backup falls in
the middle of such a file, and you include the whole file, at worst
you have ~8GB of extra blocks to read, but in most cases less, because
your writes probably have some locality and the file may not actually
contain the full 2^20 block references. You could also make it more
fine-grained than that if you don't mind having more smaller files
floating around.

It would definitely be better if we could set things up so that we
could always switch to the next .modblock file when we cross a
potential redo start point, but they're not noted in the WAL so I
don't see how to do that. I don't know if it would be possible to
insert some new kind of log record concurrently with fixing the redo
location, so that redo always started at a record of this new type.
That would certainly be helpful for this kind of thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Robert Haas (#15)

Re: finding changed blocks using WAL scanning

On Thu, Apr 18, 2019 at 04:25:24PM -0400, Robert Haas wrote:

On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian <bruce@momjian.us> wrote:

How would you choose the STARTLSN/ENDLSN? If you could do it per
checkpoint, rather than per-WAL, I think that would be great.

I thought of that too. It seems appealing, because you probably only
really care whether a particular block was modified between one
checkpoint and the next, not exactly when during that interval it was
modified. However, the simple algorithm of "just stop when you get to
a checkpoint record" does not work, because the checkpoint record
itself points back to a much earlier LSN, and I think that it's that
earlier LSN that is interesting. So if you want to make this work you
have to be more clever, and I'm not sure I'm clever enough.

OK, so let's back up and study how this will be used. Someone wanting
to make a useful incremental backup will need the changed blocks from
the time of the start of the base backup. It is fine if they
incrementally back up some blocks modified _before_ the base backup, but
they need all blocks after, until some marker. They will obviously
still use WAL to recover to a point after the incremental backup, so
there is no need to get every modifified block up to current, just up to
some cut-off point where WAL can be discarded.

I can see a 1GB marker being used for that. It would prevent an
incremental backup from being done until the first 1G modblock files was
written, since until then there is no record of modified blocks, but
that seems fine. A 1G marker would allow for consistent behavior
independent of server restarts and base backups.

How would the modblock file record all the modified blocks across
restarts and crashes? I assume that 1G of WAL would not be available
for scanning. I suppose that writing a modblock file to some PGDATA
location when WAL is removed would work since during a crash the
modblock file could be updated with the contents of the existing pg_wal
files.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#17

Andres Freund

andres@anarazel.de

about 7 years ago

In reply to: Bruce Momjian (#16)

Re: finding changed blocks using WAL scanning

On 2019-04-18 17:47:56 -0400, Bruce Momjian wrote:

I can see a 1GB marker being used for that. It would prevent an
incremental backup from being done until the first 1G modblock files was
written, since until then there is no record of modified blocks, but
that seems fine. A 1G marker would allow for consistent behavior
independent of server restarts and base backups.

That doesn't seem like a good idea - it'd make writing regression tests
for this impractical.

#18

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Robert Haas (#13)

Re: finding changed blocks using WAL scanning

On Thu, Apr 18, 2019 at 03:51:14PM -0400, Robert Haas wrote:

I was thinking that a dedicated background worker would be a good
option, but Stephen Frost seems concerned (over on the other thread)
about how much load that would generate. That never really occurred
to me as a serious issue and I suspect for many people it wouldn't be,
but there might be some.

WAL segment size can go up to 1GB, and this does not depend on the
compilation anymore. So scanning a very large segment is not going to
be free. I think that the performance concerns of Stephen are legit
as now we have on the WAL partitions sequential read and write
patterns.

It's cool that you have a command-line tool that does this as well.
Over there, it was also discussed that we might want to have both a
command-line tool and a background worker. I think, though, that we
would want to get the output in some kind of compressed binary format,
rather than text. e.g.

If you want to tweak it the way you want, please feel free to reuse
it for any patch submitted to upstream. Reshaping or redirecting the
data is not a big issue once the basics with the WAL reader are in
place.
--
Michael

#19

Stephen Frost

sfrost@snowman.net

about 7 years ago

In reply to: Robert Haas (#13)

Re: finding changed blocks using WAL scanning

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Mon, Apr 15, 2019 at 11:45 PM Michael Paquier <michael@paquier.xyz> wrote:

Any caller of XLogWrite() could switch to a new segment once the
current one is done, and I am not sure that we would want some random
backend to potentially slow down to do that kind of operation.

Or would a separate background worker do this work by itself? An
external tool can do that easily already:
https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks

I was thinking that a dedicated background worker would be a good
option, but Stephen Frost seems concerned (over on the other thread)
about how much load that would generate. That never really occurred
to me as a serious issue and I suspect for many people it wouldn't be,
but there might be some.

While I do think we should at least be thinking about the load caused
from scanning the WAL to generate a list of blocks that are changed, the
load I was more concerned with in the other thread is the effort
required to actually merge all of those changes together over a large
amount of WAL. I'm also not saying that we couldn't have either of
those pieces done as a background worker, just that it'd be really nice
to have an external tool (or library) that can be used on an independent
system to do that work.

It's cool that you have a command-line tool that does this as well.
Over there, it was also discussed that we might want to have both a
command-line tool and a background worker. I think, though, that we
would want to get the output in some kind of compressed binary format,
rather than text. e.g.

4-byte database OID
4-byte tablespace OID
any number of relation OID/block OID pairings for that
database/tablespace combination
4-byte zero to mark the end of the relation OID/block OID list
and then repeat all of the above any number of times

I agree that we'd like to get the data in a binary format of some kind.

That might be too dumb and I suspect we want some headers and a
checksum, but we should try to somehow exploit the fact that there
aren't likely to be many distinct databases or many distinct
tablespaces mentioned -- whereas relation OID and block number will
probably have a lot more entropy.

I'm not remembering exactly where this idea came from, but I don't
believe it's my own (and I think there's some tool which already does
this.. maybe it's rsync?), but I certainly don't think we want to
repeat the relation OID for every block, and I don't think we really
want to store a block number for every block. Instead, something like:

4-byte database OID
4-byte tablespace OID
relation OID

starting-ending block numbers
bitmap covering range of blocks
starting-ending block numbers
bitmap covering range of blocks
4-byte zero to mark the end of the relation
...
4-byte database OID
4-byte tablespace OID
relation OID

starting-ending block numbers
bitmap covering range of blocks
4-byte zero to mark the end of the relation
...

Only for relations which actually have changes though, of course.

Haven't implemented it, so it's entirely possible there's reasons why it
wouldn't work, but I do like the bitmap idea. I definitely think we
need a checksum, as you mentioned.

Thanks!

Stephen

#20

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Bruce Momjian (#16)

Re: finding changed blocks using WAL scanning

On Thu, Apr 18, 2019 at 5:47 PM Bruce Momjian <bruce@momjian.us> wrote:

How would the modblock file record all the modified blocks across
restarts and crashes? I assume that 1G of WAL would not be available
for scanning. I suppose that writing a modblock file to some PGDATA
location when WAL is removed would work since during a crash the
modblock file could be updated with the contents of the existing pg_wal
files.

I think you've got to prevent the WAL from being removed until a
.modblock file has been written. In more detail, you should (a) scan
all the WAL segments that will be summarized in the .modblock file,
(b) write the file under a temporary name, (c) fsync it, (d) rename it
into place, (e) fsync it again, and (f) then allow those WAL segments
to be removed, if they are otherwise eligible to be removed.

If 1GB of WAL is too much to keep around (which I doubt, except on
systems that are so small and low-activity that they don't need
incremental backups anyway), then you'd have to scan less WAL at once
and write smaller .modblock files.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Stephen Frost (#19)

#22

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Michael Paquier (#18)

#23

Stephen Frost

sfrost@snowman.net

about 7 years ago

In reply to: Robert Haas (#21)

#24

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Robert Haas (#20)

#25

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Robert Haas (#22)

#26

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Bruce Momjian (#24)

#27

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Stephen Frost (#23)

#28

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Robert Haas (#26)

#29

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Bruce Momjian (#28)

#30

Michael Paquier

michael@paquier.xyz

about 7 years ago

In reply to: Robert Haas (#22)

#31

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Robert Haas (#29)

#32

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Michael Paquier (#30)

#33

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Michael Paquier (#30)

#34

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Bruce Momjian (#31)

#35

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Robert Haas (#34)

#36

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Bruce Momjian (#35)

#37

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Robert Haas (#36)

#38

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#15)

#39

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Bruce Momjian (#37)

#40

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#27)

#41

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Tomas Vondra (#40)

#42

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Bruce Momjian (#41)

#43

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Tomas Vondra (#42)

#44

Bruce Momjian

bruce@momjian.us

about 7 years ago

In reply to: Bruce Momjian (#43)

#45

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Tomas Vondra (#38)

#46

Stephen Frost

sfrost@snowman.net

about 7 years ago

In reply to: Tomas Vondra (#40)

#47

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Stephen Frost (#46)

#48

Stephen Frost

sfrost@snowman.net

about 7 years ago

In reply to: Tomas Vondra (#47)

#49

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Stephen Frost (#48)

#50

Andres Freund

andres@anarazel.de

about 7 years ago

In reply to: Tomas Vondra (#49)

#51

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Andres Freund (#50)

#52

Andres Freund

andres@anarazel.de

about 7 years ago

In reply to: Tomas Vondra (#51)

#53

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Andres Freund (#52)

#54

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Robert Haas (#45)

#55

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Robert Haas (#54)

#56

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Tomas Vondra (#55)