Fetching timeline during recovery

Started by Jehan-Guillaume de Rorthaisalmost 7 years ago31 messageshackers

jgdr@dalibo.com

almost 7 years ago

Hello,

Fetching the timeline from a standby could be useful in various situation.
Either for backup tools [1]/messages/by-id/BF2AD4A8-E7F5-486F-92C8-A6959040DEB6@yandex-team.ru or failover tools during some kind of election
process.

Please, find in attachment a first trivial patch to support pg_walfile_name()
and pg_walfile_name_offset() on a standby.

Previous restriction on this functions seems related to ThisTimeLineID not
being safe on standby. This patch is fetching the timeline from
WalRcv->receivedTLI using GetWalRcvWriteRecPtr(). As far as I understand,
this is updated each time some data are flushed to the WAL.

As the SQL function pg_last_wal_receive_lsn() reads WalRcv->receivedUpto
which is flushed in the same time, any tool relying on these functions should be
quite fine. It will just have to parse the TL from the walfile name.

It doesn't seems perfectly sain though. I suspect a race condition in any SQL
statement that would try to get the LSN and the walfile name in the same time
if the timeline changes in the meantime. Ideally, a function should be able to
return both LSN and TL in the same time, with only one read from WalRcv. I'm not
sure if I should change the result from pg_last_wal_receive_lsn() or add a
brand new admin function. Any advice?

Last, I plan to produce an extension to support this on older release. Is
it something that could be integrated in official source tree during a minor
release or should I publish it on eg. pgxn?

Regards,

[1]: /messages/by-id/BF2AD4A8-E7F5-486F-92C8-A6959040DEB6@yandex-team.ru
/messages/by-id/BF2AD4A8-E7F5-486F-92C8-A6959040DEB6@yandex-team.ru

Andrey Borodin

amborodin@acm.org

almost 7 years ago

In reply to: Jehan-Guillaume de Rorthais (#1)

Re: Fetching timeline during recovery

23 июля 2019 г., в 21:05, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> написал(а):

Fetching the timeline from a standby could be useful in various situation.
Either for backup tools [1] or failover tools during some kind of election
process.

That backup tool is reading timeline from pg_control_checkpoint(). And formats WAL file name itself when necessary.

Please, find in attachment a first trivial patch to support pg_walfile_name()
and pg_walfile_name_offset() on a standby.

You just cannot format WAL file name for LSN when timeline changed. Because there are at least three WALs for that point: previous, new and partial. However, reading TLI from checkpoint seems safe for backup purposes.
The only reason for WAL-G to read that timeline is to mark backup invalid: if it's name is base_00000001XXXXXXXXYY00000YY and timeline change happens, it should be named base_00000002XXXXXXXXYY00000YY (consistency point is not on TLI 2), but WAL-G cannot rename backup during backup-push.

Hope this information is useful. Thanks!

Best regards, Andrey Borodin.

[0]: https://github.com/wal-g/wal-g/blob/master/internal/timeline.go#L39

David Steele

david@pgmasters.net

almost 7 years ago

In reply to: Andrey Borodin (#2)

Re: Fetching timeline during recovery

On 7/23/19 2:59 PM, Andrey Borodin wrote:

23 июля 2019 г., в 21:05, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> написал(а):

Fetching the timeline from a standby could be useful in various situation.
Either for backup tools [1] or failover tools during some kind of election
process.

That backup tool is reading timeline from pg_control_checkpoint(). And formats WAL file name itself when necessary.

We do the same [1]https://github.com/pgbackrest/pgbackrest/blob/release/2.15.1/lib/pgBackRest/Db.pm#L1008.

Please, find in attachment a first trivial patch to support pg_walfile_name()
and pg_walfile_name_offset() on a standby.

You just cannot format WAL file name for LSN when timeline changed. Because there are at least three WALs for that point: previous, new and partial. However, reading TLI from checkpoint seems safe for backup purposes.
The only reason for WAL-G to read that timeline is to mark backup invalid: if it's name is base_00000001XXXXXXXXYY00000YY and timeline change happens, it should be named base_00000002XXXXXXXXYY00000YY (consistency point is not on TLI 2), but WAL-G cannot rename backup during backup-push.

Naming considerations aside, I don't think that a timeline switch during
a standby backup is a good idea, mostly because it is (currently) not
tested. We don't allow it in pgBackRest.

[1]: https://github.com/pgbackrest/pgbackrest/blob/release/2.15.1/lib/pgBackRest/Db.pm#L1008
https://github.com/pgbackrest/pgbackrest/blob/release/2.15.1/lib/pgBackRest/Db.pm#L1008

--
-David
david@pgmasters.net

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 7 years ago

In reply to: David Steele (#3)

Re: Fetching timeline during recovery

On Tue, 23 Jul 2019 16:00:29 -0400
David Steele <david@pgmasters.net> wrote:

On 7/23/19 2:59 PM, Andrey Borodin wrote:

23 июля 2019 г., в 21:05, Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
написал(а):

Fetching the timeline from a standby could be useful in various situation.
Either for backup tools [1] or failover tools during some kind of election
process.

That backup tool is reading timeline from pg_control_checkpoint(). And
formats WAL file name itself when necessary.

We do the same [1].

Thank you both for your comments.

OK, so backup tools are fine with reading slightly outdated data from
controldata file.

Anyway, my use case is mostly about auto failover. During election, I currently
have to force a checkpoint on standbys to get their real timeline from the
controldata.

However, the forced checkpoint could be very long[1]this exact use case is actually hiding behind this thread: /messages/by-id/CAEkBuzeno6ztiM1g4WdzKRJFgL8b2nfePNU=q3sBiEZUm-D-sQ@mail.gmail.com (considering auto
failover). I need to be able to compare TL without all the burden of a
CHECKPOINT just for this.

As I wrote, my favorite solution would be a function returning BOTH
current TL and LSN at the same time. I'll send a patch tomorrow to the list
and I'll bikeshedding later depending on the feedback.

In the meantime, previous patch might still be useful for some other purpose.
Comments are welcomed.

Thanks,

[1]: this exact use case is actually hiding behind this thread: /messages/by-id/CAEkBuzeno6ztiM1g4WdzKRJFgL8b2nfePNU=q3sBiEZUm-D-sQ@mail.gmail.com
/messages/by-id/CAEkBuzeno6ztiM1g4WdzKRJFgL8b2nfePNU=q3sBiEZUm-D-sQ@mail.gmail.com

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Jehan-Guillaume de Rorthais (#1)

Re: Fetching timeline during recovery

On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais wrote:

Please, find in attachment a first trivial patch to support pg_walfile_name()
and pg_walfile_name_offset() on a standby.
Previous restriction on this functions seems related to ThisTimeLineID not
being safe on standby. This patch is fetching the timeline from
WalRcv->receivedTLI using GetWalRcvWriteRecPtr(). As far as I understand,
this is updated each time some data are flushed to the WAL.

FWIW, I don't have any objections to lift a bit the restrictions on
those functions if we can make that reliable enough. Now during
recovery you cannot rely on ThisTimeLineID as you say, per mostly the
following bit in xlog.c (the comment block a little bit up also has
explanations):
/*
* ThisTimeLineID is normally not set when we're still in recovery.
* However, recycling/preallocating segments above needed ThisTimeLineID
* to determine which timeline to install the segments on. Reset it now,
* to restore the normal state of affairs for debugging purposes.
*/
if (RecoveryInProgress())
ThisTimeLineID = 0;

Your patch does not count for the case of archive recovery, where
there is no WAL receiver, and as the shared memory state of the WAL
receiver is not set 0 would be set. The replay timeline is something
we could use here instead via GetXLogReplayRecPtr().
CreateRestartPoint actually takes the latest WAL receiver or replayed
point for its end LSN position, whichever is newer.

Last, I plan to produce an extension to support this on older release. Is
it something that could be integrated in official source tree during a minor
release or should I publish it on eg. pgxn?

Unfortunately no. This is a behavior change so it cannot find its way
into back branches. The WAL receiver state is in shared memory and
published, so that's easy enough to get. We don't do that for XLogCtl
unfortunately. I think that there are arguments for being more
flexible with it, and perhaps have a system-level view to be able to
look at some of its fields.

There is also a downside with get_controlfile(), which is that it
fetches directly the data from the on-disk pg_control, and
post-recovery this only gets updated at the first checkpoint.
--
Michael

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 7 years ago

In reply to: Michael Paquier (#5)

Re: Fetching timeline during recovery

Hello Michael,

On Wed, 24 Jul 2019 09:49:05 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais wrote:

Please, find in attachment a first trivial patch to support
pg_walfile_name() and pg_walfile_name_offset() on a standby.
Previous restriction on this functions seems related to ThisTimeLineID not
being safe on standby. This patch is fetching the timeline from
WalRcv->receivedTLI using GetWalRcvWriteRecPtr(). As far as I understand,
this is updated each time some data are flushed to the WAL.

[...]

Your patch does not count for the case of archive recovery, where
there is no WAL receiver, and as the shared memory state of the WAL
receiver is not set 0 would be set.

Indeed. I tested this topic with the following query and was fine with the
NULL result:

select pg_walfile_name(pg_last_wal_receive_lsn());

I was fine with this result because my use case requires replication anyway. A
NULL result would mean that the node never streamed from the old primary since
its last startup, so a failover should ignore it anyway.

However, NULL just comes from pg_last_wal_receive_lsn() here. The following
query result is wrong:

select pg_walfile_name('0/1')

000000000000000000000000

I fixed that. See patch 0001-v2-* in attachement

The replay timeline is something we could use here instead via
GetXLogReplayRecPtr(). CreateRestartPoint actually takes the latest WAL
receiver or replayed point for its end LSN position, whichever is newer.

I did consider GetXLogReplayRecPtr() or even XLogCtl->replayEndTLI (which is
updated right before the replay). However, both depend on read activity on the
standby. That's why I picked WalRcv->receivedTLI which is updated whatever the
reading activity on the standby.

Last, I plan to produce an extension to support this on older release. Is
it something that could be integrated in official source tree during a minor
release or should I publish it on eg. pgxn?

Unfortunately no. This is a behavior change so it cannot find its way
into back branches.

Yes, my patch is a behavior change. But here, I was yalking about an
extension, not the core itself, to support this feature in older releases.

The WAL receiver state is in shared memory and published, so that's easy
enough to get. We don't do that for XLogCtl unfortunately.

Both are in shared memory, but WalRcv have a public function to get its
receivedTLI member.

XLogCtl has nothing in public to expose its ThisTimeLineID member. However, from
a module, I'm able to fetch it using:

XLogCtl = ShmemInitStruct("XLOG Ctl", XLOGShmemSize(), &found);
SpinLockAcquire(&XLogCtl->info_lck);
tl = XLogCtl->ThisTimeLineID;
SpinLockRelease(&XLogCtl->info_lck);

As the "XLOG Ctl" index entry already exists in shmem, ShmemInitStruct returns
the correct structure from there. Not sure this was supposed to be used this
way though...Adding a public function might be cleaner, but it will not help
for older releases.

I think that there are arguments for being more flexible with it, and perhaps
have a system-level view to be able to look at some of its fields.

Great idea. I'll give it a try to keep the discussion on.

There is also a downside with get_controlfile(), which is that it
fetches directly the data from the on-disk pg_control, and
post-recovery this only gets updated at the first checkpoint.

Indeed, that's why I started this patch and thread.

Thanks,

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 7 years ago

In reply to: Jehan-Guillaume de Rorthais (#6)

Re: Fetching timeline during recovery

Hello,

On Wed, 24 Jul 2019 14:33:27 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Wed, 24 Jul 2019 09:49:05 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais
wrote:

[...]

I think that there are arguments for being more flexible with it, and
perhaps have a system-level view to be able to look at some of its fields.

Great idea. I'll give it a try to keep the discussion on.

After some thinking, I did not find enough data to expose to justify the
creation a system-level view. As I just need the current timeline I
wrote "pg_current_timeline()". Please, find the patch in attachment.

The current behavior is quite simple:
* if the cluster is in production, return ThisTimeLineID
* else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr)

This is really naive implementation. We should probably add some code around
the startup process to gather and share general recovery stats. This would
allow to fetch eg. the current recovery method, latest xlog file name restored
from archives or streaming, its timeline, etc.

Any thoughts?

Regards,

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 7 years ago

In reply to: Jehan-Guillaume de Rorthais (#7)

Re: Fetching timeline during recovery

Hi.

At Thu, 25 Jul 2019 19:38:08 +0200, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote in <20190725193808.1648ddc8@firost>

On Wed, 24 Jul 2019 14:33:27 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Wed, 24 Jul 2019 09:49:05 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais
wrote:

[...]

I think that there are arguments for being more flexible with it, and
perhaps have a system-level view to be able to look at some of its fields.

Great idea. I'll give it a try to keep the discussion on.

After some thinking, I did not find enough data to expose to justify the
creation a system-level view. As I just need the current timeline I
wrote "pg_current_timeline()". Please, find the patch in attachment.

The current behavior is quite simple:
* if the cluster is in production, return ThisTimeLineID
* else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr)

This is really naive implementation. We should probably add some code around
the startup process to gather and share general recovery stats. This would
allow to fetch eg. the current recovery method, latest xlog file name restored
from archives or streaming, its timeline, etc.

Any thoughts?

If replay is delayed behind timeline switch point, replay-LSN and
receive/write/flush LSNs are on different timelines. When
replica have not reached the new timeline to which alredy
received file belongs, the fucntion returns wrong file name,
specifically a name consisting of the latest segment number and
the older timeline where the segment doesn't belong to.

We have an LSN reporting function each for several objectives.

pg_current_wal_lsn
pg_current_wal_insert_lsn
pg_current_wal_flush_lsn
pg_last_wal_receive_lsn
pg_last_wal_replay_lsn

But, I'm not sure just adding further pg_last_*_timeline() to
this list is a good thing..

The function returns NULL for NULL input (STRICT behavior) but
returns (NULL, NULL) for undefined timeline. I don't think the
differene is meaningful.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 7 years ago

In reply to: Kyotaro Horiguchi (#8)

Re: Fetching timeline during recovery

On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

Hi.

At Thu, 25 Jul 2019 19:38:08 +0200, Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote in <20190725193808.1648ddc8@firost>

On Wed, 24 Jul 2019 14:33:27 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Wed, 24 Jul 2019 09:49:05 +0900
Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jul 23, 2019 at 06:05:18PM +0200, Jehan-Guillaume de Rorthais
wrote:

[...]

I think that there are arguments for being more flexible with it, and
perhaps have a system-level view to be able to look at some of its
fields.

Great idea. I'll give it a try to keep the discussion on.

After some thinking, I did not find enough data to expose to justify the
creation a system-level view. As I just need the current timeline I
wrote "pg_current_timeline()". Please, find the patch in attachment.

The current behavior is quite simple:
* if the cluster is in production, return ThisTimeLineID
* else return walrcv->receivedTLI (using GetWalRcvWriteRecPtr)

This is really naive implementation. We should probably add some code around
the startup process to gather and share general recovery stats. This would
allow to fetch eg. the current recovery method, latest xlog file name
restored from archives or streaming, its timeline, etc.

Any thoughts?

If replay is delayed behind timeline switch point, replay-LSN and
receive/write/flush LSNs are on different timelines. When
replica have not reached the new timeline to which alredy
received file belongs, the fucntion returns wrong file name,
specifically a name consisting of the latest segment number and
the older timeline where the segment doesn't belong to.

Indeed.

We have an LSN reporting function each for several objectives.

pg_current_wal_lsn
pg_current_wal_insert_lsn
pg_current_wal_flush_lsn
pg_last_wal_receive_lsn
pg_last_wal_replay_lsn

Yes. In fact, my current implementation might be split as:

pg_current_wal_tl: returns TL on a production cluster
pg_last_wal_received_tl: returns last received TL on a standby

If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
*flush_tl would be useful as a cluster in production is not supposed to
change its timeline during its lifetime.

But, I'm not sure just adding further pg_last_*_timeline() to
this list is a good thing..

I think this is a much better idea than mixing different case (production and
standby) in the same function as I did. Moreover, it's much more coherent with
other existing functions.

The function returns NULL for NULL input (STRICT behavior) but
returns (NULL, NULL) for undefined timeline. I don't think the
differene is meaningful.

Unless I'm missing something, nothing
returns "(NULL, NULL)" in 0001-v1-Add-function-pg_current_timeline.patch.

Thank you for your feedback!

#10

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 7 years ago

In reply to: Jehan-Guillaume de Rorthais (#9)

Re: Fetching timeline during recovery

On Fri, 26 Jul 2019 10:02:58 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

[...]

We have an LSN reporting function each for several objectives.

pg_current_wal_lsn
pg_current_wal_insert_lsn
pg_current_wal_flush_lsn
pg_last_wal_receive_lsn
pg_last_wal_replay_lsn

Yes. In fact, my current implementation might be split as:

pg_current_wal_tl: returns TL on a production cluster
pg_last_wal_received_tl: returns last received TL on a standby

If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
*flush_tl would be useful as a cluster in production is not supposed to
change its timeline during its lifetime.

But, I'm not sure just adding further pg_last_*_timeline() to
this list is a good thing..

I think this is a much better idea than mixing different case (production and
standby) in the same function as I did. Moreover, it's much more coherent with
other existing functions.

Please, find in attachment a new version of the patch. It now creates two new
fonctions:

pg_current_wal_tl()
pg_last_wal_received_tl()

Regards,

#11

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

almost 7 years ago

In reply to: Jehan-Guillaume de Rorthais (#10)

Re: Fetching timeline during recovery

On Fri, 26 Jul 2019 18:22:25 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Fri, 26 Jul 2019 10:02:58 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

[...]

We have an LSN reporting function each for several objectives.

pg_current_wal_lsn
pg_current_wal_insert_lsn
pg_current_wal_flush_lsn
pg_last_wal_receive_lsn
pg_last_wal_replay_lsn

Yes. In fact, my current implementation might be split as:

pg_current_wal_tl: returns TL on a production cluster
pg_last_wal_received_tl: returns last received TL on a standby

If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
*flush_tl would be useful as a cluster in production is not supposed to
change its timeline during its lifetime.

But, I'm not sure just adding further pg_last_*_timeline() to
this list is a good thing..

I think this is a much better idea than mixing different case (production
and standby) in the same function as I did. Moreover, it's much more
coherent with other existing functions.

Please, find in attachment a new version of the patch. It now creates two new
fonctions:

pg_current_wal_tl()
pg_last_wal_received_tl()

I just found I forgot to use PG_RETURN_INT32 in pg_last_wal_received_tl().
Please find the corrected patch in attachment:
0001-v3-Add-functions-to-get-timeline.patch

Also, TimeLineID is declared as a uint32. So why do we use
PG_RETURN_INT32/Int32GetDatum to return a timeline and not PG_RETURN_UINT32?
See eg. in pg_stat_get_wal_receiver().

Regards,

#12

Fujii Masao

masao.fujii@gmail.com

over 6 years ago

In reply to: Jehan-Guillaume de Rorthais (#11)

Re: Fetching timeline during recovery

On Mon, Jul 29, 2019 at 7:26 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

On Fri, 26 Jul 2019 18:22:25 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Fri, 26 Jul 2019 10:02:58 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Fri, 26 Jul 2019 16:49:53 +0900 (Tokyo Standard Time)
Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:

[...]

We have an LSN reporting function each for several objectives.

pg_current_wal_lsn
pg_current_wal_insert_lsn
pg_current_wal_flush_lsn
pg_last_wal_receive_lsn
pg_last_wal_replay_lsn

Yes. In fact, my current implementation might be split as:

pg_current_wal_tl: returns TL on a production cluster
pg_last_wal_received_tl: returns last received TL on a standby

If useful, I could add pg_last_wal_replayed_tl. I don't think *insert_tl and
*flush_tl would be useful as a cluster in production is not supposed to
change its timeline during its lifetime.

But, I'm not sure just adding further pg_last_*_timeline() to
this list is a good thing..

I think this is a much better idea than mixing different case (production
and standby) in the same function as I did. Moreover, it's much more
coherent with other existing functions.

Please, find in attachment a new version of the patch. It now creates two new
fonctions:

pg_current_wal_tl()
pg_last_wal_received_tl()

I just found I forgot to use PG_RETURN_INT32 in pg_last_wal_received_tl().
Please find the corrected patch in attachment:
0001-v3-Add-functions-to-get-timeline.patch

Thanks for the patch! Here are some comments from me.

You need to write the documentation explaining the functions
that you're thinking to add.

+/*
+ * Returns the current timeline on a production cluster
+ */
+Datum
+pg_current_wal_tl(PG_FUNCTION_ARGS)

The timeline ID that this function returns seems almost
the same as pg_control_checkpoint().timeline_id,
when the server is in production. So I'm not sure
if it's worth adding that new function.

+ currentTL = GetCurrentTimeLine();
+
+ PG_RETURN_INT32(currentTL);

Is GetCurrentTimeLine() really necessary? Seems ThisTimeLineID can be
returned directly since it indicates the current timeline ID in production.

+pg_last_wal_received_tl(PG_FUNCTION_ARGS)
+{
+ TimeLineID lastReceivedTL;
+ WalRcvData *walrcv = WalRcv;
+
+ SpinLockAcquire(&walrcv->mutex);
+ lastReceivedTL = walrcv->receivedTLI;
+ SpinLockRelease(&walrcv->mutex);

I think that it's smarter to use GetWalRcvWriteRecPtr() to
get the last received TLI, like pg_last_wal_receive_lsn() does.

The timeline ID that this function returns is the same as
pg_stat_wal_receiver.received_tli while walreceiver is running.
But when walreceiver is not running, pg_stat_wal_receiver returns
no record, and pg_last_wal_received_tl() would be useful to
get the timeline only in this case. Is this my understanding right?

Also, TimeLineID is declared as a uint32. So why do we use
PG_RETURN_INT32/Int32GetDatum to return a timeline and not PG_RETURN_UINT32?
See eg. in pg_stat_get_wal_receiver().

pg_stat_wal_receiver.received_tli is declared as integer.

Regards,

--
Fujii Masao

#13

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

over 6 years ago

In reply to: Fujii Masao (#12)

Re: Fetching timeline during recovery

On Wed, 4 Sep 2019 00:32:03 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Jul 29, 2019 at 7:26 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

On Fri, 26 Jul 2019 18:22:25 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Fri, 26 Jul 2019 10:02:58 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

[...]

Please, find in attachment a new version of the patch. It now creates two
new fonctions:

pg_current_wal_tl()
pg_last_wal_received_tl()

I just found I forgot to use PG_RETURN_INT32 in pg_last_wal_received_tl().
Please find the corrected patch in attachment:
0001-v3-Add-functions-to-get-timeline.patch

Thanks for the patch! Here are some comments from me.

Thank you for your review!

Please, find in attachment the v4 of the patch:
0001-v4-Add-functions-to-get-timeline.patch

Answers bellow.

You need to write the documentation explaining the functions
that you're thinking to add.

Done.

+/*
+ * Returns the current timeline on a production cluster
+ */
+Datum
+pg_current_wal_tl(PG_FUNCTION_ARGS)
The timeline ID that this function returns seems almost
the same as pg_control_checkpoint().timeline_id,
when the server is in production. So I'm not sure
if it's worth adding that new function.

pg_control_checkpoint().timeline_id is read from the controldata file on disk
which is asynchronously updated with the real status of the local cluster.
Right after a promotion, fetching the TL from pg_control_checkpoint() is wrong
and can cause race conditions on client side.

This is the main reason I am working on this patch.

+ currentTL = GetCurrentTimeLine();
+
+ PG_RETURN_INT32(currentTL);
Is GetCurrentTimeLine() really necessary? Seems ThisTimeLineID can be
returned directly since it indicates the current timeline ID in production.

Indeed. I might have over-focused on memory state. ThisTimeLineID seems to be
updated soon enough during the promotion, in fact, even before
XLogCtl->ThisTimeLineID:

if (ArchiveRecoveryRequested)
{
[...]
ThisTimeLineID = findNewestTimeLine(recoveryTargetTLI) + 1;
[...]
}

/* Save the selected TimeLineID in shared memory, too */
XLogCtl->ThisTimeLineID = ThisTimeLineID;

+pg_last_wal_received_tl(PG_FUNCTION_ARGS)
+{
+ TimeLineID lastReceivedTL;
+ WalRcvData *walrcv = WalRcv;
+
+ SpinLockAcquire(&walrcv->mutex);
+ lastReceivedTL = walrcv->receivedTLI;
+ SpinLockRelease(&walrcv->mutex);
I think that it's smarter to use GetWalRcvWriteRecPtr() to
get the last received TLI, like pg_last_wal_receive_lsn() does.

I has been hesitant between the current implementation and using
GetWalRcvWriteRecPtr(). I choose the current implementation to avoid unnecessary
operations during the spinlock and make it as fast as possible.

However, maybe I'm scratching nothing or just dust here in comparison to
calling GetWalRcvWriteRecPtr() and avoiding minor code duplication.

Being hesitant, v4 of the patch use GetWalRcvWriteRecPtr() as suggested.

The timeline ID that this function returns is the same as
pg_stat_wal_receiver.received_tli while walreceiver is running.
But when walreceiver is not running, pg_stat_wal_receiver returns
no record, and pg_last_wal_received_tl() would be useful to
get the timeline only in this case. Is this my understanding right?

Exactly.

Also, TimeLineID is declared as a uint32. So why do we use
PG_RETURN_INT32/Int32GetDatum to return a timeline and not PG_RETURN_UINT32?
See eg. in pg_stat_get_wal_receiver().

pg_stat_wal_receiver.received_tli is declared as integer.

Oh, right. Thank you.

Thanks,

#14

Fujii Masao

masao.fujii@gmail.com

over 6 years ago

In reply to: Jehan-Guillaume de Rorthais (#13)

Re: Fetching timeline during recovery

On Sat, Sep 7, 2019 at 12:06 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

On Wed, 4 Sep 2019 00:32:03 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

On Mon, Jul 29, 2019 at 7:26 PM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

On Fri, 26 Jul 2019 18:22:25 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

On Fri, 26 Jul 2019 10:02:58 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

[...]

Please, find in attachment a new version of the patch. It now creates two
new fonctions:

pg_current_wal_tl()
pg_last_wal_received_tl()

I just found I forgot to use PG_RETURN_INT32 in pg_last_wal_received_tl().
Please find the corrected patch in attachment:
0001-v3-Add-functions-to-get-timeline.patch

Thanks for the patch! Here are some comments from me.

Thank you for your review!

Please, find in attachment the v4 of the patch:
0001-v4-Add-functions-to-get-timeline.patch

Thanks for updating the patch!

Should we add regression tests for these functions? For example,
what about using these functions to check the timeline switch case,
in src/test/recovery/t/004_timeline_switch.pl?

Answers bellow.

You need to write the documentation explaining the functions
that you're thinking to add.

Done.

Thanks!

+ <entry>Get current write-ahead log timeline</entry>

I'm not sure if "current write-ahead log timeline" is proper word.
"timeline ID of current write-ahead log" is more appropriate?

+       <entry><type>int</type></entry>
+       <entry>Get last write-ahead log timeline received and sync to disk by
+        streaming replication.

Same as above. I think that "timeline ID of last write-ahead log received
and sync to disk ..." is better here.

Like pg_last_wal_receive_lsn(), something like "If recovery has
completed this will remain static at the value of the last WAL
record received and synced to disk during recovery.
If streaming replication is disabled, or if it has not yet started,
the function returns NULL." should be in this description?

+/*
+ * Returns the current timeline on a production cluster
+ */
+Datum
+pg_current_wal_tl(PG_FUNCTION_ARGS)

I think that "tl" in the function name should be "tli". "tli" is used
used for other functions and views related to timeline, e.g.,
pg_stat_wal_receiver.received_tli. Thought?

The timeline ID that this function returns seems almost
the same as pg_control_checkpoint().timeline_id,
when the server is in production. So I'm not sure
if it's worth adding that new function.

pg_control_checkpoint().timeline_id is read from the controldata file on disk
which is asynchronously updated with the real status of the local cluster.
Right after a promotion, fetching the TL from pg_control_checkpoint() is wrong
and can cause race conditions on client side.

Understood.

The timeline ID that this function returns is the same as
pg_stat_wal_receiver.received_tli while walreceiver is running.
But when walreceiver is not running, pg_stat_wal_receiver returns
no record, and pg_last_wal_received_tl() would be useful to
get the timeline only in this case. Is this my understanding right?

Exactly.

I'm just imaging that some users want to use pg_last_wal_receive_lsn() and
pg_last_wal_receive_tli() together to, e.g., get the name of WAL file received
last. But there can be a corner case where the return values of
pg_last_wal_receive_lsn() and of pg_last_wal_receive_tli() are inconsistent.
This can happen because those values are NOT gotten within single lock.
That is, each function takes each lock to get each value.

So, to avoid that corner case and get consistent WAL file name,
we might want to have the function that gets both LSN and
timeline ID of the last received WAL record within single lock
(i.e., just uses GetWalRcvWriteRecPtr()) and returns them.
Thought?

Regards,

--
Fujii Masao

#15

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

over 6 years ago

In reply to: Fujii Masao (#14)

Re: Fetching timeline during recovery

On Mon, 9 Sep 2019 19:44:10 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

On Sat, Sep 7, 2019 at 12:06 AM Jehan-Guillaume de Rorthais
<jgdr@dalibo.com> wrote:

On Wed, 4 Sep 2019 00:32:03 +0900
Fujii Masao <masao.fujii@gmail.com> wrote:

[...]

Thanks for updating the patch!

Thank you for your review!

Please find in attachment a new version of the patch.

0001-v5-Add-facilities-to-fetch-real-timeline-from-SQL.patch

Should we add regression tests for these functions? For example,
what about using these functions to check the timeline switch case,
in src/test/recovery/t/004_timeline_switch.pl?

Indeed, I added 6 tests to this file.

[...]

Thank you for all other suggestions. They all make sense for v4 of the patch.
However, I removed pg_current_wal_tl() and pg_last_wal_received_tl() to explore
a patch paying attention to your next comment.

I'm just imaging that some users want to use pg_last_wal_receive_lsn() and
pg_last_wal_receive_tli() together to, e.g., get the name of WAL file received
last. But there can be a corner case where the return values of
pg_last_wal_receive_lsn() and of pg_last_wal_receive_tli() are inconsistent.
This can happen because those values are NOT gotten within single lock.
That is, each function takes each lock to get each value.

So, to avoid that corner case and get consistent WAL file name,
we might want to have the function that gets both LSN and
timeline ID of the last received WAL record within single lock
(i.e., just uses GetWalRcvWriteRecPtr()) and returns them.
Thought?

You are right.

SO either I add some new functions or I overload the existing ones.

I was not convinced to add two new functions very close to pg_current_wal_lsn
and pg_last_wal_receive_lsn but with a slightly different name (eg. suffixed
with _tli?).

I choose to overload pg_current_wal_lsn and pg_last_wal_receive_lsn with
pg_current_wal_lsn(with_tli bool) and pg_last_wal_receive_lsn(with_tli bool).

Both function returns the record (lsn pg_lsn,timeline int4). If with_tli is
NULL or false, the timeline field is NULL.

Documentation is updated to reflect this.

Thoughts?

If this solution is accepted, some other function of the same family might be
good candidates as well, for the sake of homogeneity:

* pg_current_wal_insert_lsn
* pg_current_wal_flush_lsn
* pg_last_wal_replay_lsn

However, I'm not sure how useful this would be.

Thanks again for your time, suggestions and review!

Regards,

#16

Michael Paquier

michael@paquier.xyz

over 6 years ago

In reply to: Jehan-Guillaume de Rorthais (#15)

Re: Fetching timeline during recovery

On Thu, Sep 26, 2019 at 07:20:46PM +0200, Jehan-Guillaume de Rorthais wrote:

If this solution is accepted, some other function of the same family might be
good candidates as well, for the sake of homogeneity:

* pg_current_wal_insert_lsn
* pg_current_wal_flush_lsn
* pg_last_wal_replay_lsn

However, I'm not sure how useful this would be.

Thanks again for your time, suggestions and review!

+{ oid => '3435', descr => 'current wal flush location',
+  proname => 'pg_last_wal_receive_lsn', provolatile => 'v',
proisstrict => 'f',
This description is incorrect.

And please use OIDs in the range of 8000~9999 for patches in
development. You could just use src/include/catalog/unused_oids which
would point out a random range.

+   if (recptr == 0) {
+       nulls[0] = 1;
+       nulls[1] = 1;
+   }
The indendation of the code is incorrect, these should use actual
booleans and recptr should be InvalidXLogRecPtr (note also the
existence of the macro XLogRecPtrIsInvalid).  Just for the style.

As said in the last emails exchanged on this thread, I don't see how
you cannot use multiple functions which have different meaning
depending on if the cluster is a primary or a standby knowing that we
have two different concepts of WAL when at recovery: the received
LSN and the replayed LSN, and three concepts for primaries (insert,
current, flush). I agree as well with the point of Fujii-san about
not returning the TLI and the LSN across different functions as this
opens the door for a risk of inconsistency for the data received by
the client.

+ * When the first parameter (variable 'with_tli') is true, returns the current
+ * timeline as second field. If false, second field is null.
I don't see much the point of having this input parameter which
determines the NULL-ness of one of the result columns, and I think
that you had better use a completely different function name for each
one of them instead of enforcing the functions.  Let's remember that a
lot of tools use the existing functions directly in the SELECT clause
for LSN calculations, which is just a 64-bit integer *without* a
timeline assigned to it.  However your patch mixes both concepts by
using pg_current_wal_lsn.

So we could do more with the introduction of five new functions which
allow to grab the LSN and the TLI in use for replay, received, insert,
write and flush positions:
- pg_current_wal_flush_info
- pg_current_wal_insert_info
- pg_current_wal_info
- pg_last_wal_receive_info
- pg_last_wal_replay_info

I would be actually tempted to do the following: one single SRF
function, say pg_wal_info which takes a text argument in input with
the following values: flush, write, insert, receive, replay. Thinking
more about it that would be rather neat, and more extensible than the
rest discussed until now. See for example PostgresNode::lsn.
--
Michael

#17

Stephen Frost

sfrost@snowman.net

over 6 years ago

In reply to: Michael Paquier (#16)

Re: Fetching timeline during recovery

Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:

I would be actually tempted to do the following: one single SRF
function, say pg_wal_info which takes a text argument in input with
the following values: flush, write, insert, receive, replay. Thinking
more about it that would be rather neat, and more extensible than the
rest discussed until now. See for example PostgresNode::lsn.

I've not followed this discussion very closely but I agree entirely that
it's really nice to have the timeline be able to be queried in a more
timely manner than asking through pg_control_checkpoint() gives you.

I'm not sure about adding a text argument to such a function though, I
would think you'd either have multiple rows if it's an SRF that gives
you the information on each row and allows a user to filter with a WHERE
clause, or do something like what pg_stat_replication has and just have
a bunch of columns.

Given that we've already gone with the "bunch of columns" approach
elsewhere, it seems like that approach would be more consistent.

Thanks,

Stephen

#18

Michael Paquier

michael@paquier.xyz

over 6 years ago

In reply to: Stephen Frost (#17)

Re: Fetching timeline during recovery

On Wed, Dec 11, 2019 at 10:16:29AM -0500, Stephen Frost wrote:

I've not followed this discussion very closely but I agree entirely that
it's really nice to have the timeline be able to be queried in a more
timely manner than asking through pg_control_checkpoint() gives you.

I'm not sure about adding a text argument to such a function though, I
would think you'd either have multiple rows if it's an SRF that gives
you the information on each row and allows a user to filter with a WHERE
clause, or do something like what pg_stat_replication has and just have
a bunch of columns.

With a NULL added for the values which cannot be defined then, like
trying to use the function on a primary for the fields which can only
show up at recovery? That would be possible, still my heart tells me
that a function returning one row is a more natural approach for
this stuff. I may be under too much used to what we have in the TAP
tests though.
--
Michael

#19

Stephen Frost

sfrost@snowman.net

over 6 years ago

In reply to: Michael Paquier (#18)

Re: Fetching timeline during recovery

Greetings,

* Michael Paquier (michael@paquier.xyz) wrote:

On Wed, Dec 11, 2019 at 10:16:29AM -0500, Stephen Frost wrote:

I've not followed this discussion very closely but I agree entirely that
it's really nice to have the timeline be able to be queried in a more
timely manner than asking through pg_control_checkpoint() gives you.

I'm not sure about adding a text argument to such a function though, I
would think you'd either have multiple rows if it's an SRF that gives
you the information on each row and allows a user to filter with a WHERE
clause, or do something like what pg_stat_replication has and just have
a bunch of columns.

With a NULL added for the values which cannot be defined then, like
trying to use the function on a primary for the fields which can only
show up at recovery?

Sure, the function would only return those values that make sense for
the state that the system is in.

That would be possible, still my heart tells me
that a function returning one row is a more natural approach for
this stuff. I may be under too much used to what we have in the TAP
tests though.

I'm confused- wouldn't the above approach be a function that's returning
only one row, if you had a bunch of columns and then had NULL values for
those cases that didn't apply..? Or, if you were thinking about the SRF
approach that you suggested, you could use a WHERE clause to make it
only one row... Though I can see how it's nicer to just have one row in
some cases which is why I was suggesting the "bunch of columns"
approach.

Thanks,

Stephen

#20

Michael Paquier

michael@paquier.xyz

over 6 years ago

In reply to: Stephen Frost (#19)

Re: Fetching timeline during recovery

On Wed, Dec 11, 2019 at 10:45:25AM -0500, Stephen Frost wrote:

I'm confused- wouldn't the above approach be a function that's returning
only one row, if you had a bunch of columns and then had NULL values for
those cases that didn't apply..? Or, if you were thinking about the SRF
approach that you suggested, you could use a WHERE clause to make it
only one row... Though I can see how it's nicer to just have one row in
some cases which is why I was suggesting the "bunch of columns"
approach.

Oh, sorry. I see the confusion now and that's my fault. In
/messages/by-id/20191211052002.GK72921@paquier.xyz
I mentioned a SRF function which takes an input argument, but that
makes no sense. What I would prefer having is just having one
function, returning one row (LSN, TLI), using in input one argument to
extract the WAL information the caller wants with five possible cases
(write, insert, flush, receive, replay).

Then, what you are referring to is one function which returns all
(LSN,TLI) for the five cases (write, insert, etc.), so it would return
one row with 10 columns, with NULL mapping to the values which have no
meaning (like replay on a primary).

And on top of that we have a third possibility: one SRF function
returning 5 rows with three attributes (mode, LSN, TLI), where mode
corresponds to one value in the set {write, insert, etc.}.

I actually prefer the first one, and you mentioned the second. But
there could be a point in doing the third one. An advantage of the
second and third ones is that you may be able to get a consistent view
of all the data, but it means holding locks to look at the values a
bit longer. Let's see what others think.
--
Michael

#21

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

over 6 years ago

In reply to: Michael Paquier (#16)

#22

Jehan-Guillaume de Rorthais

jgdr@dalibo.com

over 6 years ago

In reply to: Michael Paquier (#20)

#23

Kyotaro Horiguchi