.ready and .done files considered harmful

Started by Robert Haasalmost 5 years ago118 messageshackers
Jump to latest
#1Robert Haas
robertmhaas@gmail.com

I and various colleagues of mine have from time to time encountered
systems that got a bit behind on WAL archiving, because the
archive_command started failing and nobody noticed right away.
Ideally, people should have monitoring for this and put it to rights
immediately, but some people don't. If those people happen to have a
relatively small pg_wal partition, they will likely become aware of
the issue when it fills up and takes down the server, but some users
provision disk space pretty generously and therefore nothing compels
them to notice the issue until they fill it up. In at least one case,
on a system that was actually generating a reasonable amount of WAL,
this took in excess of six months.

As you might imagine, pg_wal can get fairly large in such scenarios,
but the user is generally less concerned with solving that problem
than they are with getting the system back up. It is doubtless true
that the user would prefer to shrink the disk usage down to something
more reasonable over time, but on the facts as presented, it can't
really be an urgent issue for them. What they really need is just free
up a little disk space somehow or other and then get archiving running
fast enough to keep up with future WAL generation. Regrettably, the
archiver cannot do this, not even if you set archive_command =
/bin/true, because the archiver will barely ever actually run the
archive_command. Instead, it will spend virtually all of its time
calling readdir(), because for some reason it feels a need to make a
complete scan of the archive_status directory before archiving a WAL
file, and then it has to make another scan before archiving the next
one.

Someone - and it's probably for the best that the identity of that
person remains unknown to me - came up with a clever solution to this
problem, which is now used almost as a matter of routine whenever this
comes up. You just run pg_archivecleanup on your pg_wal directory, and
then remove all the corresponding .ready files and call it a day. I
haven't scrutinized the code for pg_archivecleanup, but evidently it
avoids needing O(n^2) time for this and therefore can clean up the
whole directory in something like the amount of time the archiver
would take to deal with a single file. While this seems to be quite an
effective procedure and I have not yet heard any user complaints, it
seems disturbingly error-prone, and honestly shouldn't ever be
necessary. The issue here is only that pgarch.c acts as though after
archiving 000000010000000000000001, 000000010000000000000002, and then
000000010000000000000003, we have no idea what file we might need to
archive next. Could it, perhaps, be 000000010000000000000004? Only a
full directory scan will tell us the answer!

I have two possible ideas for addressing this; perhaps other people
will have further suggestions. A relatively non-invasive fix would be
to teach pgarch.c how to increment a WAL file name. After archiving
segment N, check using stat() whether there's an .ready file for
segment N+1. If so, do that one next. If not, then fall back to
performing a full directory scan. As far as I can see, this is just
cheap insurance. If archiving is keeping up, the extra stat() won't
matter much. If it's not, this will save more system calls than it
costs. Since during normal operation it shouldn't really be possible
for files to show up in pg_wal out of order, I don't really see a
scenario where this changes the behavior, either. If there are gaps in
the sequence at startup time, this will cope with it exactly the same
as we do now, except with a better chance of finishing before I
retire.

However, that's still pretty wasteful. Every time we have to wait for
the next file to be ready for archiving, we'll basically fall back to
repeatedly scanning the whole directory, waiting for it to show up.
And I think that we can't get around that by just using stat() to look
for the appearance of the file we expect to see, because it's possible
that we might be doing all of this on a standby which then gets
promoted, or some upstream primary gets promoted, and WAL files start
appearing on a different timeline, making our prediction of what the
next filename will be incorrect. But perhaps we could work around this
by allowing pgarch.c to access shared memory, in which case it could
examine the current timeline whenever it wants, and probably also
whatever LSNs it needs to know what's safe to archive. If we did that,
could we just get rid of the .ready and .done files altogether? Are
they just a really expensive IPC mechanism to avoid a shared memory
connection, or is there some more fundamental reason why we need them?
And is there any good reason why the archiver shouldn't be connected
to shared memory? It is certainly nice to avoid having more processes
connected to shared memory than necessary, but the current scheme is
so inefficient that I think we end up worse off.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

#2Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#1)
Re: .ready and .done files considered harmful

Hi,

On 2021-05-03 16:49:16 -0400, Robert Haas wrote:

I have two possible ideas for addressing this; perhaps other people
will have further suggestions. A relatively non-invasive fix would be
to teach pgarch.c how to increment a WAL file name. After archiving
segment N, check using stat() whether there's an .ready file for
segment N+1. If so, do that one next. If not, then fall back to
performing a full directory scan.

Hm. I wonder if it'd not be better to determine multiple files to be
archived in one readdir() pass?

As far as I can see, this is just cheap insurance. If archiving is
keeping up, the extra stat() won't matter much. If it's not, this will
save more system calls than it costs. Since during normal operation it
shouldn't really be possible for files to show up in pg_wal out of
order, I don't really see a scenario where this changes the behavior,
either. If there are gaps in the sequence at startup time, this will
cope with it exactly the same as we do now, except with a better
chance of finishing before I retire.

There's definitely gaps in practice :(. Due to the massive performance
issues with archiving there are several tools that archive multiple
files as part of one archive command invocation (and mark the additional
archived files as .done immediately).

However, that's still pretty wasteful. Every time we have to wait for
the next file to be ready for archiving, we'll basically fall back to
repeatedly scanning the whole directory, waiting for it to show up.

Hm. That seems like it's only an issue because .done and .ready are in
the same directory? Otherwise the directory would be empty while we're
waiting for the next file to be ready to be archived. I hate that that's
a thing but given teh serial nature of archiving, with high per-call
overhead, I don't think it'd be ok to just break that without a
replacement :(.

But perhaps we could work around this by allowing pgarch.c to access
shared memory, in which case it could examine the current timeline
whenever it wants, and probably also whatever LSNs it needs to know
what's safe to archive.

FWIW, the shared memory stats patch implies doing that, since the
archiver reports stats.

If we did that, could we just get rid of the .ready and .done files
altogether? Are they just a really expensive IPC mechanism to avoid a
shared memory connection, or is there some more fundamental reason why
we need them?

What kind of shared memory mechanism are you thinking of? Due to
timelines and history files I don't think simple position counters would
be quite enough.

I think the aforementioned "batching" archive commands are part of the
problem :(.

And is there any good reason why the archiver shouldn't be connected
to shared memory? It is certainly nice to avoid having more processes
connected to shared memory than necessary, but the current scheme is
so inefficient that I think we end up worse off.

I think there is no fundamental for avoiding shared memory in the
archiver. I guess there's a minor robustness advantage, because the
forked shell to start the archvive command won't be attached to shared
memory. But that's only until the child exec()s to the archive command.

There is some minor performance advantage as well, not having to process
the often large and contended memory mapping for shared_buffers is
probably measurable - but swamped by the cost of needing to actually
archive the segment.

My only "concern" with doing anything around this is that I think the
whole approach of archive_command is just hopelessly broken, with even
just halfway busy servers only able to keep up archiving if they muck
around with postgres internal data during archive command execution. Add
to that how hard it is to write a robust archive command (e.g. the one
in our docs still suggests test ! -f && cp, which means that copy
failing in the middle yields an incomplete archive)...

While I don't think it's all that hard to design a replacement, it's
however likely still more work than addressing the O(n^2) issue, so ...

Greetings,

Andres Freund

#3Andrey Borodin
amborodin@acm.org
In reply to: Andres Freund (#2)
Re: .ready and .done files considered harmful

4 мая 2021 г., в 09:27, Andres Freund <andres@anarazel.de> написал(а):

Hi,

On 2021-05-03 16:49:16 -0400, Robert Haas wrote:

I have two possible ideas for addressing this; perhaps other people
will have further suggestions. A relatively non-invasive fix would be
to teach pgarch.c how to increment a WAL file name. After archiving
segment N, check using stat() whether there's an .ready file for
segment N+1. If so, do that one next. If not, then fall back to
performing a full directory scan.

Hm. I wonder if it'd not be better to determine multiple files to be
archived in one readdir() pass?

FWIW we use both methods [0]https://github.com/x4m/wal-g/blob/c8a785217fe1123197280fd24254e51492bf5a68/internal/bguploader.go#L119-L137. WAL-G has a pipe with WAL-push candidates.
We add there some predictions, and if it does not fill upload concurrency - list archive_status contents (concurrently to background uploads).

As far as I can see, this is just cheap insurance. If archiving is
keeping up, the extra stat() won't matter much. If it's not, this will
save more system calls than it costs. Since during normal operation it
shouldn't really be possible for files to show up in pg_wal out of
order, I don't really see a scenario where this changes the behavior,
either. If there are gaps in the sequence at startup time, this will
cope with it exactly the same as we do now, except with a better
chance of finishing before I retire.

There's definitely gaps in practice :(. Due to the massive performance
issues with archiving there are several tools that archive multiple
files as part of one archive command invocation (and mark the additional
archived files as .done immediately).

Interestingly, we used to rename .ready->.done some years ago. But pgBackRest developers convinced me that it's not a good idea to mess with data dir [1]/messages/by-id/20180828200754.GI3326@tamriel.snowman.net. Then pg_probackup developers convinced me that renaming .ready->.done on our own scales better and implemented this functionality for us [2]https://github.com/wal-g/wal-g/pull/950.

If we did that, could we just get rid of the .ready and .done files
altogether? Are they just a really expensive IPC mechanism to avoid a
shared memory connection, or is there some more fundamental reason why
we need them?

What kind of shared memory mechanism are you thinking of? Due to
timelines and history files I don't think simple position counters would
be quite enough.

I think the aforementioned "batching" archive commands are part of the
problem :(.archiv

I'd be happy if we had a table with files that need to be archived, a table with registered archivers and a function to say "archiver number X has done its job on file Y". Archiver could listen to some archiver channel while sleeping or something like that.

Thanks!

Best regards, Andrey Borodin.

[0]: https://github.com/x4m/wal-g/blob/c8a785217fe1123197280fd24254e51492bf5a68/internal/bguploader.go#L119-L137
[1]: /messages/by-id/20180828200754.GI3326@tamriel.snowman.net
[2]: https://github.com/wal-g/wal-g/pull/950

#4Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#2)
Re: .ready and .done files considered harmful

On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote:

On 2021-05-03 16:49:16 -0400, Robert Haas wrote:

I have two possible ideas for addressing this; perhaps other people
will have further suggestions. A relatively non-invasive fix would be
to teach pgarch.c how to increment a WAL file name. After archiving
segment N, check using stat() whether there's an .ready file for
segment N+1. If so, do that one next. If not, then fall back to
performing a full directory scan.

Hm. I wonder if it'd not be better to determine multiple files to be
archived in one readdir() pass?

I think both methods have some merit. If we had a way to pass a range
of files to archive_command instead of just one, then your way is
distinctly better, and perhaps we should just go ahead and invent such
a thing. If not, your way doesn't entirely solve the O(n^2) problem,
since you have to choose some upper bound on the number of file names
you're willing to buffer in memory, but it may lower it enough that it
makes no practical difference. I am somewhat inclined to think that it
would be good to start with the method I'm proposing, since it is a
clear-cut improvement over what we have today and can be done with a
relatively limited amount of code change and no redesign, and then
perhaps do something more ambitious afterward.

There's definitely gaps in practice :(. Due to the massive performance
issues with archiving there are several tools that archive multiple
files as part of one archive command invocation (and mark the additional
archived files as .done immediately).

Good to know.

However, that's still pretty wasteful. Every time we have to wait for
the next file to be ready for archiving, we'll basically fall back to
repeatedly scanning the whole directory, waiting for it to show up.

Hm. That seems like it's only an issue because .done and .ready are in
the same directory? Otherwise the directory would be empty while we're
waiting for the next file to be ready to be archived.

I think that's right.

I hate that that's
a thing but given teh serial nature of archiving, with high per-call
overhead, I don't think it'd be ok to just break that without a
replacement :(.

I don't know quite what you mean by this. Moving .done files to a
separate directory from .ready files could certainly be done and I
don't think it even would be that hard. It does seem like a bit of a
half measure though. If we're going to redesign this I think we ought
to be more ambitious than that.

But perhaps we could work around this by allowing pgarch.c to access
shared memory, in which case it could examine the current timeline
whenever it wants, and probably also whatever LSNs it needs to know
what's safe to archive.

FWIW, the shared memory stats patch implies doing that, since the
archiver reports stats.

Are you planning to commit that for v15? If so, will it be early in
the cycle, do you think?

What kind of shared memory mechanism are you thinking of? Due to
timelines and history files I don't think simple position counters would
be quite enough.

I was thinking of simple position counters, but we could do something
more sophisticated. I don't even care if we stick with .ready/.done
for low-frequency stuff like timeline and history files. But I think
we'd be better off avoiding it for WAL files, because there are just
too many of them, and it's too hard to create a system that actually
scales. Or else we need a way for a single .ready file to cover many
WAL files in need of being archived, rather than just one.

I think there is no fundamental for avoiding shared memory in the
archiver. I guess there's a minor robustness advantage, because the
forked shell to start the archvive command won't be attached to shared
memory. But that's only until the child exec()s to the archive command.

That doesn't seem like a real issue because we're not running
user-defined code between fork() and exec().

There is some minor performance advantage as well, not having to process
the often large and contended memory mapping for shared_buffers is
probably measurable - but swamped by the cost of needing to actually
archive the segment.

Process it how?

Another option would be to have two processes. You could have one that
stayed connected to shared memory and another that JUST ran the
archive_command, and they could talk over a socket or something. But
that would add a bunch of extra complexity, so I don't want to do it
unless we actually need to do it.

My only "concern" with doing anything around this is that I think the
whole approach of archive_command is just hopelessly broken, with even
just halfway busy servers only able to keep up archiving if they muck
around with postgres internal data during archive command execution. Add
to that how hard it is to write a robust archive command (e.g. the one
in our docs still suggests test ! -f && cp, which means that copy
failing in the middle yields an incomplete archive)...

While I don't think it's all that hard to design a replacement, it's
however likely still more work than addressing the O(n^2) issue, so ...

I think it is probably a good idea to fix the O(n^2) issue first, and
then as a separate step try to redefine things so that a decent
archive command doesn't have to poke around as much at internal stuff.
Part of that should probably involve having a way to pass a range of
files to archive_command instead of a single file. I was also
wondering whether we should go further and allow for the archiving to
be performed by C code running inside the backend rather than shelling
out to an external command.

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#4)
Re: .ready and .done files considered harmful

iOn Tue, May 4, 2021 at 7:38 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote:

On 2021-05-03 16:49:16 -0400, Robert Haas wrote:

I have two possible ideas for addressing this; perhaps other people
will have further suggestions. A relatively non-invasive fix would be
to teach pgarch.c how to increment a WAL file name. After archiving
segment N, check using stat() whether there's an .ready file for
segment N+1. If so, do that one next. If not, then fall back to
performing a full directory scan.

Hm. I wonder if it'd not be better to determine multiple files to be
archived in one readdir() pass?

I think both methods have some merit. If we had a way to pass a range
of files to archive_command instead of just one, then your way is
distinctly better, and perhaps we should just go ahead and invent such
a thing. If not, your way doesn't entirely solve the O(n^2) problem,
since you have to choose some upper bound on the number of file names
you're willing to buffer in memory, but it may lower it enough that it
makes no practical difference. I am somewhat inclined to think that it
would be good to start with the method I'm proposing, since it is a
clear-cut improvement over what we have today and can be done with a
relatively limited amount of code change and no redesign, and then
perhaps do something more ambitious afterward.

I agree that if we continue to archive one file using the archive
command then Robert's solution of checking the existence of the next
WAL segment (N+1) has an advantage. But, currently, if you notice
pgarch_readyXlog always consider any history file as the oldest file
but that will not be true if we try to predict the next WAL segment
name. For example, if we have archived 000000010000000000000004 then
next we will look for 000000010000000000000005 but after generating
segment 000000010000000000000005, if there is a timeline switch then
we will have the below files in the archive status
(000000010000000000000005.ready, 00000002.history file). Now, the
existing archiver will archive 00000002.history first whereas our code
will archive 000000010000000000000005 first. Said that I don't see
any problem with that because before archiving any segment file from
TL 2 we will definitely archive the 00000002.history file because we
will not find the 000000010000000000000006.ready and we will scan the
full directory and now we will find 00000002.history as oldest file.

However, that's still pretty wasteful. Every time we have to wait for
the next file to be ready for archiving, we'll basically fall back to
repeatedly scanning the whole directory, waiting for it to show up.

Is this true? that only when we have to wait for the next file to be
ready we got for scanning? If I read the code in
"pgarch_ArchiverCopyLoop", for every single file to achieve it is
calling "pgarch_readyXlog", wherein it scans the directory every time.
So I did not understand your point that only when it needs to wait for
the next .ready file it need to scan the full directory. It appeared
it always scans the full directory after archiving each WAL segment.
What am I missing?

Hm. That seems like it's only an issue because .done and .ready are in
the same directory? Otherwise the directory would be empty while we're
waiting for the next file to be ready to be archived.

I think that's right.

If we agree with your above point that it only needs to scan the full
directory when it has to wait for the next file to be ready then
making a separate directory for .done file can improve a lot because
the directory will be empty so scanning will not be very costly.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#6Robert Haas
robertmhaas@gmail.com
In reply to: Dilip Kumar (#5)
Re: .ready and .done files considered harmful

On Tue, May 4, 2021 at 11:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I agree that if we continue to archive one file using the archive
command then Robert's solution of checking the existence of the next
WAL segment (N+1) has an advantage. But, currently, if you notice
pgarch_readyXlog always consider any history file as the oldest file
but that will not be true if we try to predict the next WAL segment
name. For example, if we have archived 000000010000000000000004 then
next we will look for 000000010000000000000005 but after generating
segment 000000010000000000000005, if there is a timeline switch then
we will have the below files in the archive status
(000000010000000000000005.ready, 00000002.history file). Now, the
existing archiver will archive 00000002.history first whereas our code
will archive 000000010000000000000005 first. Said that I don't see
any problem with that because before archiving any segment file from
TL 2 we will definitely archive the 00000002.history file because we
will not find the 000000010000000000000006.ready and we will scan the
full directory and now we will find 00000002.history as oldest file.

OK, that makes sense and is good to know.

However, that's still pretty wasteful. Every time we have to wait for
the next file to be ready for archiving, we'll basically fall back to
repeatedly scanning the whole directory, waiting for it to show up.

Is this true? that only when we have to wait for the next file to be
ready we got for scanning? If I read the code in
"pgarch_ArchiverCopyLoop", for every single file to achieve it is
calling "pgarch_readyXlog", wherein it scans the directory every time.
So I did not understand your point that only when it needs to wait for
the next .ready file it need to scan the full directory. It appeared
it always scans the full directory after archiving each WAL segment.
What am I missing?

It's not true now, but my proposal would make it true.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#6)
Re: .ready and .done files considered harmful

On Tue, May 4, 2021 at 10:12 PM Robert Haas <robertmhaas@gmail.com> wrote:

Is this true? that only when we have to wait for the next file to be
ready we got for scanning? If I read the code in
"pgarch_ArchiverCopyLoop", for every single file to achieve it is
calling "pgarch_readyXlog", wherein it scans the directory every time.
So I did not understand your point that only when it needs to wait for
the next .ready file it need to scan the full directory. It appeared
it always scans the full directory after archiving each WAL segment.
What am I missing?

It's not true now, but my proposal would make it true.

Okay, got it. Thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#8Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#6)
Re: .ready and .done files considered harmful

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Tue, May 4, 2021 at 11:54 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I agree that if we continue to archive one file using the archive
command then Robert's solution of checking the existence of the next
WAL segment (N+1) has an advantage. But, currently, if you notice
pgarch_readyXlog always consider any history file as the oldest file
but that will not be true if we try to predict the next WAL segment
name. For example, if we have archived 000000010000000000000004 then
next we will look for 000000010000000000000005 but after generating
segment 000000010000000000000005, if there is a timeline switch then
we will have the below files in the archive status
(000000010000000000000005.ready, 00000002.history file). Now, the
existing archiver will archive 00000002.history first whereas our code
will archive 000000010000000000000005 first. Said that I don't see
any problem with that because before archiving any segment file from
TL 2 we will definitely archive the 00000002.history file because we
will not find the 000000010000000000000006.ready and we will scan the
full directory and now we will find 00000002.history as oldest file.

OK, that makes sense and is good to know.

I expect David will chime in on this thread too, but I did want to point
out that when it coming to archiving history files you'd *really* like
that to be done just about as quickly as absolutely possible, to avoid
the case that we saw before that code was added, to wit: two promotions
done too quickly that ended up with conflicting history and possibly
conflicting WAL files trying to be archived, and ensuing madness.

It's not just about making sure that we archive the history file for a
timeline before archiving WAL segments along that timeline but also
about making sure we get that history file into the archive as fast as
we can, and archiving a 16MB WAL first would certainly delay that.

Thanks,

Stephen

#9Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#8)
Re: .ready and .done files considered harmful

On Wed, May 5, 2021 at 1:06 PM Stephen Frost <sfrost@snowman.net> wrote:

It's not just about making sure that we archive the history file for a
timeline before archiving WAL segments along that timeline but also
about making sure we get that history file into the archive as fast as
we can, and archiving a 16MB WAL first would certainly delay that.

Ooph. That's a rather tough constraint. Could we get around it by
introducing some kind of signalling mechanism, perhaps? Like if
there's a new history file, that must mean the server has switched
timelines -- I think, anyway -- so if we notified the archiver every
time there was a timeline switch it could react accordingly.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#9)
Re: .ready and .done files considered harmful

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Wed, May 5, 2021 at 1:06 PM Stephen Frost <sfrost@snowman.net> wrote:

It's not just about making sure that we archive the history file for a
timeline before archiving WAL segments along that timeline but also
about making sure we get that history file into the archive as fast as
we can, and archiving a 16MB WAL first would certainly delay that.

Ooph. That's a rather tough constraint. Could we get around it by
introducing some kind of signalling mechanism, perhaps? Like if
there's a new history file, that must mean the server has switched
timelines -- I think, anyway -- so if we notified the archiver every
time there was a timeline switch it could react accordingly.

I would think something like that would be alright and not worse than
what we've got now.

That said, in an ideal world, we'd have a way to get the new timeline to
switch to in a way that doesn't leave open race conditions, so as long
we're talking about big changes to the way archiving and archive_command
work (or about throwing out the horrible idea that is archive_command in
the first place and replacing it with appropriate hooks such that
someone could install an extension which would handle archiving...), I
would hope we'd have a way of saying "please, atomically, go get me a new
timeline."

Just as a reminder for those following along at home, as I'm sure you're
already aware, the way we figure out what timeline to switch to when a
replica is getting promoted is that we go run the restore command asking
for history files until we get back "nope, there is no file named
0000123.history", and then we switch to that timeline and then try to
push such a history file into the repo and hope that it works.

Thanks,

Stephen

#11Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#10)
Re: .ready and .done files considered harmful

On Wed, May 5, 2021 at 4:13 PM Stephen Frost <sfrost@snowman.net> wrote:

I would think something like that would be alright and not worse than
what we've got now.

OK.

That said, in an ideal world, we'd have a way to get the new timeline to
switch to in a way that doesn't leave open race conditions, so as long
we're talking about big changes to the way archiving and archive_command
work (or about throwing out the horrible idea that is archive_command in
the first place and replacing it with appropriate hooks such that
someone could install an extension which would handle archiving...), I
would hope we'd have a way of saying "please, atomically, go get me a new
timeline."

Just as a reminder for those following along at home, as I'm sure you're
already aware, the way we figure out what timeline to switch to when a
replica is getting promoted is that we go run the restore command asking
for history files until we get back "nope, there is no file named
0000123.history", and then we switch to that timeline and then try to
push such a history file into the repo and hope that it works.

Huh, I had not thought about that problem. So, at the risk of getting
sidetracked, what exactly are you asking for here? Let the extension
pick the timeline using an algorithm of its own devising, rather than
having core do it? Or what?

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Andres Freund
andres@anarazel.de
In reply to: Stephen Frost (#10)
Re: .ready and .done files considered harmful

Hi,

On 2021-05-05 16:13:08 -0400, Stephen Frost wrote:

Just as a reminder for those following along at home, as I'm sure you're
already aware, the way we figure out what timeline to switch to when a
replica is getting promoted is that we go run the restore command asking
for history files until we get back "nope, there is no file named
0000123.history", and then we switch to that timeline and then try to
push such a history file into the repo and hope that it works.

Which is why the whole concept of timelines as we have them right now is
pretty much useless. It is fundamentally impossible to guarantee unique
timeline ids in all cases if they are assigned sequentially at timeline
creation - consider needing to promote a node on both ends of a split
network. I'm quite doubtful that pretending to tackle this problem via
archiving order is a good idea, given the fundamentally racy nature.

Greetings,

Andres Freund

#13Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#11)
Re: .ready and .done files considered harmful

Hi,

On 2021-05-05 16:22:21 -0400, Robert Haas wrote:

Huh, I had not thought about that problem. So, at the risk of getting
sidetracked, what exactly are you asking for here? Let the extension
pick the timeline using an algorithm of its own devising, rather than
having core do it? Or what?

Not Stephen, but to me the most reasonable way to address this is to
make timeline identifier wider and randomly allocated. The sequential
looking natures of timelines imo is actively unhelpful.

Greetings,

Andres Freund

#14Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#11)
Re: .ready and .done files considered harmful

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Wed, May 5, 2021 at 4:13 PM Stephen Frost <sfrost@snowman.net> wrote:

That said, in an ideal world, we'd have a way to get the new timeline to
switch to in a way that doesn't leave open race conditions, so as long
we're talking about big changes to the way archiving and archive_command
work (or about throwing out the horrible idea that is archive_command in
the first place and replacing it with appropriate hooks such that
someone could install an extension which would handle archiving...), I
would hope we'd have a way of saying "please, atomically, go get me a new
timeline."

Just as a reminder for those following along at home, as I'm sure you're
already aware, the way we figure out what timeline to switch to when a
replica is getting promoted is that we go run the restore command asking
for history files until we get back "nope, there is no file named
0000123.history", and then we switch to that timeline and then try to
push such a history file into the repo and hope that it works.

Huh, I had not thought about that problem. So, at the risk of getting
sidetracked, what exactly are you asking for here? Let the extension
pick the timeline using an algorithm of its own devising, rather than
having core do it? Or what?

Having the extension do it somehow is an interesting idea and one which
might be kind of cool.

The first thought I had was to make it archive_command's job to "pick"
the timeline by just re-trying to push the .history file (the actual
contents of it don't change, as the information in the file is about the
timeline we are switching *from* and at what LSN). That requires an
archive command which will fail if that file already exists though and,
ideally, would perform the file archival in an atomic fashion (though
this last bit isn't stricly necessary- anything along these lines would
certainly be better than the current state).

Having an entirely independent command/hook that's explicitly for this
case would be another approach, of course, either in a manner that
allows the extension to pick the destination timeline or is defined to
be "return success only if the file is successfully archived, but do
*not* overwrite any existing file of the same name and return an error
instead." and then the same approach as outlined above.

Thanks,

Stephen

#15Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#13)
Re: .ready and .done files considered harmful

On Wed, May 5, 2021 at 4:31 PM Andres Freund <andres@anarazel.de> wrote:

On 2021-05-05 16:22:21 -0400, Robert Haas wrote:

Huh, I had not thought about that problem. So, at the risk of getting
sidetracked, what exactly are you asking for here? Let the extension
pick the timeline using an algorithm of its own devising, rather than
having core do it? Or what?

Not Stephen, but to me the most reasonable way to address this is to
make timeline identifier wider and randomly allocated. The sequential
looking natures of timelines imo is actively unhelpful.

Yeah, I always wondered why we didn't assign them randomly.

--
Robert Haas
EDB: http://www.enterprisedb.com

#16Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#15)
Re: .ready and .done files considered harmful

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Wed, May 5, 2021 at 4:31 PM Andres Freund <andres@anarazel.de> wrote:

On 2021-05-05 16:22:21 -0400, Robert Haas wrote:

Huh, I had not thought about that problem. So, at the risk of getting
sidetracked, what exactly are you asking for here? Let the extension
pick the timeline using an algorithm of its own devising, rather than
having core do it? Or what?

Not Stephen, but to me the most reasonable way to address this is to
make timeline identifier wider and randomly allocated. The sequential
looking natures of timelines imo is actively unhelpful.

Yeah, I always wondered why we didn't assign them randomly.

Based on what we do today regarding the info we put into .history files,
trying to figure out which is the "latest" timeline might be a bit
tricky with randomly selected timelines. Maybe we could find a way to
solve that though.

I do note that this comment is timeline.c is, ahem, perhaps over-stating
things a bit:

* Note: while this is somewhat heuristic, it does positively guarantee
* that (result + 1) is not a known timeline, and therefore it should
* be safe to assign that ID to a new timeline.

Thanks,

Stephen

#17Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#16)
Re: .ready and .done files considered harmful

On Wed, May 5, 2021 at 4:53 PM Stephen Frost <sfrost@snowman.net> wrote:

I do note that this comment is timeline.c is, ahem, perhaps over-stating
things a bit:

* Note: while this is somewhat heuristic, it does positively guarantee
* that (result + 1) is not a known timeline, and therefore it should
* be safe to assign that ID to a new timeline.

OK, that made me laugh out loud.

--
Robert Haas
EDB: http://www.enterprisedb.com

#18Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#4)
Re: .ready and .done files considered harmful

At Tue, 4 May 2021 10:07:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Tue, May 4, 2021 at 12:27 AM Andres Freund <andres@anarazel.de> wrote:

On 2021-05-03 16:49:16 -0400, Robert Haas wrote:

But perhaps we could work around this by allowing pgarch.c to access
shared memory, in which case it could examine the current timeline
whenever it wants, and probably also whatever LSNs it needs to know
what's safe to archive.

FWIW, the shared memory stats patch implies doing that, since the
archiver reports stats.

Are you planning to commit that for v15? If so, will it be early in
the cycle, do you think?

FWIW It's already done for v14 individually.

Author: Fujii Masao <fujii@postgresql.org>
Date: Mon Mar 15 13:13:14 2021 +0900

Make archiver process an auxiliary process.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#18)
Re: .ready and .done files considered harmful

On Thu, May 6, 2021 at 3:23 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

FWIW It's already done for v14 individually.

Author: Fujii Masao <fujii@postgresql.org>
Date: Mon Mar 15 13:13:14 2021 +0900

Make archiver process an auxiliary process.

Oh, I hadn't noticed. Thanks.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20Hannu Krosing
hannu@tm.ee
In reply to: Robert Haas (#19)
Re: .ready and .done files considered harmful

How are you envisioning the shared-memory signaling should work in the
original sample case, where the archiver had been failing for half a
year ?

Or should we perhaps have a system table for ready-to-archive WAL
files to get around limitation sof file system to return just the
needed files with ORDER BY ... LIMIT as we already know how to make
lookups in database fast ?

Cheers
Hannu

Show quoted text

On Thu, May 6, 2021 at 12:24 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, May 6, 2021 at 3:23 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

FWIW It's already done for v14 individually.

Author: Fujii Masao <fujii@postgresql.org>
Date: Mon Mar 15 13:13:14 2021 +0900

Make archiver process an auxiliary process.

Oh, I hadn't noticed. Thanks.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Andres Freund
andres@anarazel.de
In reply to: Hannu Krosing (#20)
#22Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Andres Freund (#21)
#23Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dipesh Pandit (#22)
#24Stephen Frost
sfrost@snowman.net
In reply to: Dipesh Pandit (#22)
#25Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Stephen Frost (#24)
#26Jeevan Ladhe
jeevan.ladhe@enterprisedb.com
In reply to: Dilip Kumar (#23)
#27Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Jeevan Ladhe (#26)
#28Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dipesh Pandit (#27)
#29Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#24)
#30Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Robert Haas (#29)
#31Jeevan Ladhe
jeevan.ladhe@enterprisedb.com
In reply to: Dipesh Pandit (#30)
#32Nathan Bossart
nathandbossart@gmail.com
In reply to: Jeevan Ladhe (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#32)
#34Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#33)
#35Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Nathan Bossart (#34)
#36Robert Haas
robertmhaas@gmail.com
In reply to: Dipesh Pandit (#35)
#37Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Robert Haas (#36)
#38Robert Haas
robertmhaas@gmail.com
In reply to: Dipesh Pandit (#37)
#39Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Robert Haas (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Dipesh Pandit (#39)
#41Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#40)
#42Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Robert Haas (#40)
#43Robert Haas
robertmhaas@gmail.com
In reply to: Dipesh Pandit (#42)
#44Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Robert Haas (#43)
#45Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#41)
#46Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#45)
#47Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Dipesh Pandit (#44)
#48Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#46)
#49Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Kyotaro Horiguchi (#48)
#50Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#49)
#51Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#49)
#52Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Nathan Bossart (#51)
#53Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#52)
#54Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#53)
#55Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#54)
#56Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#54)
#57Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Nathan Bossart (#56)
#58Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#56)
#59Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#58)
#60Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Nathan Bossart (#59)
#61Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#60)
#62Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#60)
#63Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#60)
#64Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#63)
#65Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#64)
#66Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#65)
#67Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#66)
#68Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#67)
#69Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#68)
#70Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Kyotaro Horiguchi (#69)
#71Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#70)
#72Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#71)
#73Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#72)
#74Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Nathan Bossart (#73)
#75Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#74)
#76Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#74)
#77Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Nathan Bossart (#76)
#78Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#77)
#79Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Nathan Bossart (#78)
#80Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Dipesh Pandit (#79)
#81Nathan Bossart
nathandbossart@gmail.com
In reply to: Kyotaro Horiguchi (#80)
#82Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#81)
#83Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#82)
#84Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#83)
#85Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#84)
#86Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#85)
#87Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Kyotaro Horiguchi (#86)
#88Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#87)
#89Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Nathan Bossart (#88)
#90Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#78)
#91Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#90)
#92Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Nathan Bossart (#91)
#93Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#92)
#94Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#92)
#95Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Nathan Bossart (#94)
#96Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Kyotaro Horiguchi (#95)
#97Nathan Bossart
nathandbossart@gmail.com
In reply to: Dipesh Pandit (#96)
#98Dipesh Pandit
dipesh.pandit@gmail.com
In reply to: Nathan Bossart (#97)
#99Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#97)
#100Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#99)
#101Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#100)
#102Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#100)
#103Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#97)
#104David Steele
david@pgmasters.net
In reply to: Robert Haas (#103)
#105Nathan Bossart
nathandbossart@gmail.com
In reply to: David Steele (#104)
#106Nathan Bossart
nathandbossart@gmail.com
In reply to: David Steele (#104)
#107Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#103)
#108Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#107)
#109Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#107)
#110Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#109)
#111Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#110)
#112Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#111)
#113Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#112)
#114Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Robert Haas (#1)
#115Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#114)
#116Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#115)
#117Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#116)
#118Nathan Bossart
nathandbossart@gmail.com
In reply to: Alvaro Herrera (#117)