Add LSN <-> time conversion functionality

Started by Melanie Plagemanabout 2 years ago33 messages
#1Melanie Plageman
melanieplageman@gmail.com
5 attachment(s)

Hi,

Elsewhere [1]/messages/by-id/CAAKRu_b3tpbdRPUPh1Q5h35gXhY=spH2ssNsEsJ9sDfw6=PEAg@mail.gmail.com I required a way to estimate the time corresponding to a
particular LSN in the past. I devised the attached LSNTimeline, a data
structure mapping LSNs <-> timestamps with decreasing precision for
older time, LSN pairs. This can be used to locate and translate a
particular time to LSN or vice versa using linear interpolation.

I've added an instance of the LSNTimeline to PgStat_WalStats and insert
new values to it in background writer's main loop. This patch set also
introduces some new pageinspect functions exposing LSN <-> time
translations.

Outside of being useful to users wondering about the last modification
time of a particular block in a relation, the LSNTimeline can be put to
use in other Postgres sub-systems to govern behavior based on resource
consumption -- using the LSN consumption rate as a proxy.

As mentioned in [1]/messages/by-id/CAAKRu_b3tpbdRPUPh1Q5h35gXhY=spH2ssNsEsJ9sDfw6=PEAg@mail.gmail.com, the LSNTimeline is a prerequisite for my
implementation of a new freeze heuristic which seeks to freeze only
pages which will remain unmodified for a certain amount of wall clock
time. But one can imagine other uses for such translation capabilities.

The pageinspect additions need a bit more work. I didn't bump the
pageinspect version (didn't add the new functions to a new pageinspect
version file). I also didn't exercise the new pageinspect functions in a
test. I was unsure how to write a test which would be guaranteed not to
flake. Because the background writer updates the timeline, it seemed a
remote possibility that the time or LSN returned by the functions would
be 0 and as such, I'm not sure even a test that SELECT time/lsn > 0
would always pass.

I also noticed the pageinspect functions don't have XML id attributes
for link discoverability. I planned to add that in a separate commit.

- Melanie

[1]: /messages/by-id/CAAKRu_b3tpbdRPUPh1Q5h35gXhY=spH2ssNsEsJ9sDfw6=PEAg@mail.gmail.com

Attachments:

v1-0001-Record-LSN-at-postmaster-startup.patchtext/x-patch; charset=US-ASCII; name=v1-0001-Record-LSN-at-postmaster-startup.patchDownload
From 75a48fec0e9f0909dd11a676bb51e494fa4ca61c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 5 Dec 2023 07:29:39 -0500
Subject: [PATCH v1 1/5] Record LSN at postmaster startup

The insert_lsn at postmaster startup can be used along with PgStartTime
as seed values for a timeline mapping LSNs to time. Future commits will
add such a structure for LSN <-> time conversions. A start LSN allows
for such conversions before even inserting a value into the timeline.
The current time and current insert LSN can be used along with
PgStartTime and PgStartLSN.

This is WIP, as I'm not sure if I did this in the right place.
---
 src/backend/access/transam/xlog.c   | 2 ++
 src/backend/postmaster/postmaster.c | 1 +
 src/include/utils/builtins.h        | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1264849883..aa71e502e4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -146,6 +146,8 @@ bool		XLOG_DEBUG = false;
 
 int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
 
+XLogRecPtr	PgStartLSN = InvalidXLogRecPtr;
+
 /*
  * Number of WAL insertion locks to use. A higher value allows more insertions
  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index b163e89cbb..d858e04454 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1448,6 +1448,7 @@ PostmasterMain(int argc, char *argv[])
 	 * Remember postmaster startup time
 	 */
 	PgStartTime = GetCurrentTimestamp();
+	PgStartLSN = GetXLogInsertRecPtr();
 
 	/*
 	 * Report postmaster status in the postmaster.pid file, to allow pg_ctl to
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 2f8b46d6da..0cb24e10e6 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -17,6 +17,7 @@
 #include "fmgr.h"
 #include "nodes/nodes.h"
 #include "utils/fmgrprotos.h"
+#include "access/xlogdefs.h"
 
 /* Sign + the most decimal digits an 8-byte number could have */
 #define MAXINT8LEN 20
@@ -82,6 +83,8 @@ extern void generate_operator_clause(fmStringInfo buf,
 									 Oid opoid,
 									 const char *rightop, Oid rightoptype);
 
+extern PGDLLIMPORT XLogRecPtr PgStartLSN;
+
 /* varchar.c */
 extern int	bpchartruelen(char *s, int len);
 
-- 
2.37.2

v1-0002-Add-LSNTimeline-for-converting-LSN-time.patchtext/x-patch; charset=US-ASCII; name=v1-0002-Add-LSNTimeline-for-converting-LSN-time.patchDownload
From aedbc5058e370705b4041732f671263a2050b997 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:40:27 -0500
Subject: [PATCH v1 2/5] Add LSNTimeline for converting LSN <-> time

Add a new structure, LSNTimeline, consisting of LSNTimes -- each an LSN,
time pair. Each LSNTime can represent multiple logical LSN, time pairs,
referred to as members. LSN <-> time conversions can be done using
linear interpolation with two LSNTimes on the LSNTimeline.

This commit does not add a global instance of LSNTimeline. It adds the
structures and functions needed to maintain and access such a timeline.
---
 src/backend/utils/activity/pgstat_wal.c | 199 ++++++++++++++++++++++++
 src/include/pgstat.h                    |  34 ++++
 src/tools/pgindent/typedefs.list        |   2 +
 3 files changed, 235 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 6a81b78135..ba40aad258 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -17,8 +17,11 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "utils/pgstat_internal.h"
 #include "executor/instrument.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
 
 
 PgStat_PendingWalStats PendingWalStats = {0};
@@ -32,6 +35,12 @@ PgStat_PendingWalStats PendingWalStats = {0};
 static WalUsage prevWalUsage;
 
 
+static void lsntime_absorb(LSNTime *a, const LSNTime *b);
+void lsntime_insert(LSNTimeline *timeline, TimestampTz time, XLogRecPtr lsn);
+
+XLogRecPtr estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time);
+TimestampTz estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn);
+
 /*
  * Calculate how much WAL usage counters have increased and update
  * shared WAL and IO statistics.
@@ -184,3 +193,193 @@ pgstat_wal_snapshot_cb(void)
 		   sizeof(pgStatLocal.snapshot.wal));
 	LWLockRelease(&stats_shmem->lock);
 }
+
+/*
+ * Set *a to be the earlier of *a or *b.
+ */
+static void
+lsntime_absorb(LSNTime *a, const LSNTime *b)
+{
+	LSNTime		result;
+	int			new_members = a->members + b->members;
+
+	if (a->time < b->time)
+		result = *a;
+	else if (b->time < a->time)
+		result = *b;
+	else if (a->lsn < b->lsn)
+		result = *a;
+	else if (b->lsn < a->lsn)
+		result = *b;
+	else
+		result = *a;
+
+	*a = result;
+	a->members = new_members;
+}
+
+/*
+ * Insert a new LSNTime into the LSNTimeline in the first element with spare
+ * capacity.
+ */
+void
+lsntime_insert(LSNTimeline *timeline, TimestampTz time,
+			   XLogRecPtr lsn)
+{
+	LSNTime		temp;
+	LSNTime		carry = {.lsn = lsn,.time = time,.members = 1};
+
+	for (int i = 0; i < timeline->length; i++)
+	{
+		bool		full;
+		LSNTime    *cur = &timeline->data[i];
+
+		/*
+		 * An array element's capacity to represent members is 2 ^ its
+		 * position in the array.
+		 */
+		full = cur->members >= (1 << i);
+
+		/*
+		 * If the current element is not yet at capacity, then insert the
+		 * passed-in LSNTime into this element by taking the smaller of the it
+		 * and the current LSNTime element. This is required to ensure that
+		 * time moves forward on the timeline.
+		 */
+		if (!full)
+		{
+			Assert(cur->members == carry.members);
+			Assert(cur->members + carry.members <= 1 << i);
+			lsntime_absorb(cur, &carry);
+			return;
+		}
+
+		/*
+		 * If the current element is full, ensure that the inserting LSNTime
+		 * is larger than the current element. This must be true for time to
+		 * move forward on the timeline.
+		 */
+		Assert(carry.lsn >= cur->lsn || carry.time >= cur->time);
+
+		/*
+		 * If the element is at capacity, swap the element with the carry and
+		 * continue on to find an element with space to represent the new
+		 * member.
+		 */
+		temp = *cur;
+		*cur = carry;
+		carry = temp;
+	}
+
+	/*
+	 * Time to use another element in the array -- and increase the length in
+	 * the process
+	 */
+	timeline->data[timeline->length] = carry;
+	timeline->length++;
+}
+
+
+/*
+ * Translate time to a LSN using the provided timeline. The timeline will not
+ * be modified.
+ */
+XLogRecPtr
+estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time)
+{
+	TimestampTz time_elapsed;
+	XLogRecPtr	lsns_elapsed;
+	double		result;
+
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the target time is after the current time, our best estimate of the
+	 * LSN is the current insert LSN.
+	 */
+	if (time >= end.time)
+		return end.lsn;
+
+	for (int i = 0; i < timeline->length; i++)
+	{
+		/* Pass times more recent than our target time */
+		if (timeline->data[i].time > time)
+			continue;
+
+		/* Found the first element before our target time */
+		start = timeline->data[i];
+
+		/*
+		 * If there is only one element in the array, use the current time as
+		 * the end of the range. Otherwise it is the element preceding our
+		 * start.
+		 */
+		if (i > 0)
+			end = timeline->data[i - 1];
+		break;
+	}
+
+	time_elapsed = end.time - start.time;
+	Assert(time_elapsed != 0);
+
+	lsns_elapsed = end.lsn - start.lsn;
+	Assert(lsns_elapsed != 0);
+
+	result = (double) (time - start.time) / time_elapsed * lsns_elapsed + start.lsn;
+	if (result < 0)
+		return InvalidXLogRecPtr;
+	return result;
+}
+
+/*
+ * Translate lsn to a time using the provided timeline. The timeline will not
+ * be modified.
+ */
+TimestampTz
+estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn)
+{
+	TimestampTz time_elapsed;
+	XLogRecPtr	lsns_elapsed;
+	TimestampTz result;
+
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the target LSN is after the current insert LSN, the current time is
+	 * our best estimate.
+	 */
+	if (lsn >= end.lsn)
+		return end.time;
+
+	for (int i = 0; i < timeline->length; i++)
+	{
+		/* Pass LSNs more recent than our target LSN */
+		if (timeline->data[i].lsn > lsn)
+			continue;
+
+		/* Found the first element before our target LSN */
+		start = timeline->data[i];
+
+		/*
+		 * If there is only one element in the array, use the current LSN and
+		 * time as the end of the range. Otherwise, use the preceding element
+		 * (the first element occuring before our target LSN in the timeline).
+		 */
+		if (i > 0)
+			end = timeline->data[i - 1];
+		break;
+	}
+
+	time_elapsed = end.time - start.time;
+	Assert(time_elapsed != 0);
+
+	lsns_elapsed = end.lsn - start.lsn;
+	Assert(lsns_elapsed != 0);
+
+	result = (lsn - start.lsn) / lsns_elapsed * time_elapsed + start.time;
+	if (result < 0)
+		return 0;
+	return result;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ab91b3b367..ddbe320bf3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -428,6 +429,39 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter autoanalyze_count;
 } PgStat_StatTabEntry;
 
+/*
+ * The elements of an LSNTimeline. Each LSNTime represents one or more time,
+ * LSN pairs. The LSN is typically the insert LSN recorded at the time. Members
+ * is the number of logical members -- each a time, LSN pair -- represented in
+ * the LSNTime.
+ */
+typedef struct LSNTime
+{
+	TimestampTz time;
+	XLogRecPtr	lsn;
+	uint64		members;
+} LSNTime;
+
+/*
+ * A timeline consists of LSNTimes from most to least recent. Each element of
+ * the array in the timeline may represent 2^array index logical members --
+ * meaning that each element's capacity is twice that of the preceding element.
+ * This gives more recent times greater precision than less recent ones. An
+ * array of size 64 should provide sufficient capacity without accounting for
+ * what to do when all elements of the array are at capacity.
+ *
+ * When LSNTimes are inserted into the timeline, they are absorbed into the
+ * first array element with spare capacity -- with the new combined element
+ * having the lesser of the two values. The timeline's length is the highest
+ * array index representing one or more logical members. Use the timeline for
+ * LSN <-> time conversion using linear interpolation.
+ */
+typedef struct LSNTimeline
+{
+	int			length;
+	LSNTime		data[64];
+} LSNTimeline;
+
 typedef struct PgStat_WalStats
 {
 	PgStat_Counter wal_records;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e37ef9aa76..3a4121b482 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1520,6 +1520,8 @@ LogicalTapeSet
 LsnReadQueue
 LsnReadQueueNextFun
 LsnReadQueueNextStatus
+LSNTime
+LSNTimeline
 LtreeGistOptions
 LtreeSignature
 MAGIC
-- 
2.37.2

v1-0003-Add-LSNTimeline-to-PgStat_WalStats.patchtext/x-patch; charset=US-ASCII; name=v1-0003-Add-LSNTimeline-to-PgStat_WalStats.patchDownload
From 7b5b8e53026ecdcdccc07029a38d35e8d13985fd Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:41:47 -0500
Subject: [PATCH v1 3/5] Add LSNTimeline to PgStat_WalStats

Add a globally maintained instance of the new LSNTimeline to
PgStat_WalStats and add utility functions for maintaining and accessing
it. This commit does not insert new values to the timeline or use the
helpers to access it.
---
 src/backend/utils/activity/pgstat_wal.c | 48 +++++++++++++++++++++----
 src/include/pgstat.h                    |  6 ++++
 2 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index ba40aad258..594185acb9 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -36,10 +36,10 @@ static WalUsage prevWalUsage;
 
 
 static void lsntime_absorb(LSNTime *a, const LSNTime *b);
-void lsntime_insert(LSNTimeline *timeline, TimestampTz time, XLogRecPtr lsn);
+static void lsntime_insert(LSNTimeline *timeline, TimestampTz time, XLogRecPtr lsn);
 
-XLogRecPtr estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time);
-TimestampTz estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn);
+static XLogRecPtr estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time);
+static TimestampTz estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn);
 
 /*
  * Calculate how much WAL usage counters have increased and update
@@ -222,7 +222,7 @@ lsntime_absorb(LSNTime *a, const LSNTime *b)
  * Insert a new LSNTime into the LSNTimeline in the first element with spare
  * capacity.
  */
-void
+static void
 lsntime_insert(LSNTimeline *timeline, TimestampTz time,
 			   XLogRecPtr lsn)
 {
@@ -284,7 +284,7 @@ lsntime_insert(LSNTimeline *timeline, TimestampTz time,
  * Translate time to a LSN using the provided timeline. The timeline will not
  * be modified.
  */
-XLogRecPtr
+static XLogRecPtr
 estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time)
 {
 	TimestampTz time_elapsed;
@@ -336,7 +336,7 @@ estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time)
  * Translate lsn to a time using the provided timeline. The timeline will not
  * be modified.
  */
-TimestampTz
+static TimestampTz
 estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn)
 {
 	TimestampTz time_elapsed;
@@ -383,3 +383,39 @@ estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn)
 		return 0;
 	return result;
 }
+
+XLogRecPtr
+pgstat_wal_estimate_lsn_at_time(TimestampTz time)
+{
+	XLogRecPtr	result;
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_lsn_at_time(&stats_shmem->stats.timeline, time);
+	LWLockRelease(&stats_shmem->lock);
+
+	return result;
+}
+
+TimestampTz
+pgstat_wal_estimate_time_at_lsn(XLogRecPtr lsn)
+{
+	TimestampTz result;
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_time_at_lsn(&stats_shmem->stats.timeline, lsn);
+	LWLockRelease(&stats_shmem->lock);
+
+	return result;
+}
+
+void
+pgstat_wal_update_lsntimeline(TimestampTz time, XLogRecPtr lsn)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	lsntime_insert(&stats_shmem->stats.timeline, time, lsn);
+	LWLockRelease(&stats_shmem->lock);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index ddbe320bf3..dd914e606e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -472,6 +472,7 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_sync;
 	PgStat_Counter wal_write_time;
 	PgStat_Counter wal_sync_time;
+	LSNTimeline timeline;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -754,6 +755,11 @@ extern void pgstat_execute_transactional_drops(int ndrops, struct xl_xact_stats_
 extern void pgstat_report_wal(bool force);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 
+/* Helpers for maintaining the LSNTimeline */
+extern XLogRecPtr pgstat_wal_estimate_lsn_at_time(TimestampTz time);
+extern TimestampTz pgstat_wal_estimate_time_at_lsn(XLogRecPtr lsn);
+extern void pgstat_wal_update_lsntimeline(TimestampTz time, XLogRecPtr lsn);
+
 
 /*
  * Variables in pgstat.c
-- 
2.37.2

v1-0004-Bgwriter-maintains-global-LSNTimeline.patchtext/x-patch; charset=US-ASCII; name=v1-0004-Bgwriter-maintains-global-LSNTimeline.patchDownload
From 18d4844e8cf1a6ddc92c46c57fac682fda79ad41 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:32:40 -0500
Subject: [PATCH v1 4/5] Bgwriter maintains global LSNTimeline

Insert new LSN, time pairs to the global LSNTimeline stored in
PgStat_WalStats in the background writer's main loop. This ensures that
new values are added to the timeline in a regular manner.
---
 src/backend/postmaster/bgwriter.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d02dc17b9c..9a2ef869b5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -277,6 +277,7 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_lsn = GetLastImportantRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
@@ -289,10 +290,11 @@ BackgroundWriterMain(void)
 			 * the end of the record.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn <= GetLastImportantRecPtr())
+				last_snapshot_lsn <= current_lsn)
 			{
 				last_snapshot_lsn = LogStandbySnapshot();
 				last_snapshot_ts = now;
+				pgstat_wal_update_lsntimeline(now, current_lsn);
 			}
 		}
 
-- 
2.37.2

v1-0005-Add-time-LSN-translation-functions-to-pageinspect.patchtext/x-patch; charset=US-ASCII; name=v1-0005-Add-time-LSN-translation-functions-to-pageinspect.patchDownload
From ddc0e0b9dbb5b6457e121e286737736dfaad7ef5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 15:46:16 -0500
Subject: [PATCH v1 5/5] Add time <-> LSN translation functions to pageinspect

Previous commits added a global LSNTimeline, maintained by background
writer, that allows approximate translations between time and LSNs. This
can be useful for approximating the time of last modification of a page
or estimating the LSN consumption rate to moderate maintenance processes
and balance system resource utilization. This commit adds user-facing
access to the conversion capabilities of the timeline.
---
 .../pageinspect/pageinspect--1.10--1.11.sql   | 10 +++++
 contrib/pageinspect/rawpage.c                 | 26 +++++++++++
 doc/src/sgml/pageinspect.sgml                 | 45 +++++++++++++++++++
 3 files changed, 81 insertions(+)

diff --git a/contrib/pageinspect/pageinspect--1.10--1.11.sql b/contrib/pageinspect/pageinspect--1.10--1.11.sql
index 8fa5e105bc..72b16d5f84 100644
--- a/contrib/pageinspect/pageinspect--1.10--1.11.sql
+++ b/contrib/pageinspect/pageinspect--1.10--1.11.sql
@@ -26,3 +26,13 @@ ALTER FUNCTION hash_bitmap_info(regclass, int8) PARALLEL RESTRICTED;
 -- Likewise for gist_page_items.
 ALTER FUNCTION brin_page_items(bytea, regclass) PARALLEL RESTRICTED;
 ALTER FUNCTION gist_page_items(bytea, regclass) PARALLEL RESTRICTED;
+
+CREATE FUNCTION estimate_lsn_at_time(IN input_time timestamp with time zone,
+    OUT lsn pg_lsn)
+AS 'MODULE_PATHNAME', 'estimate_lsn_at_time'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+CREATE FUNCTION estimate_time_at_lsn(IN lsn pg_lsn,
+    OUT result timestamp with time zone)
+AS 'MODULE_PATHNAME', 'estimate_time_at_lsn'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
index b25a63cbd6..6d15ab542f 100644
--- a/contrib/pageinspect/rawpage.c
+++ b/contrib/pageinspect/rawpage.c
@@ -22,6 +22,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/checksum.h"
 #include "utils/builtins.h"
@@ -335,6 +336,9 @@ page_header(PG_FUNCTION_ARGS)
 PG_FUNCTION_INFO_V1(page_checksum_1_9);
 PG_FUNCTION_INFO_V1(page_checksum);
 
+PG_FUNCTION_INFO_V1(estimate_lsn_at_time);
+PG_FUNCTION_INFO_V1(estimate_time_at_lsn);
+
 static Datum
 page_checksum_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 {
@@ -374,3 +378,25 @@ page_checksum(PG_FUNCTION_ARGS)
 {
 	return page_checksum_internal(fcinfo, PAGEINSPECT_V1_8);
 }
+
+Datum
+estimate_time_at_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn = PG_GETARG_LSN(0);
+	TimestampTz result;
+
+	result = pgstat_wal_estimate_time_at_lsn(lsn);
+
+	PG_RETURN_TIMESTAMPTZ(result);
+}
+
+Datum
+estimate_lsn_at_time(PG_FUNCTION_ARGS)
+{
+	TimestampTz time = PG_GETARG_TIMESTAMPTZ(0);
+	XLogRecPtr	result;
+
+	result = pgstat_wal_estimate_lsn_at_time(time);
+
+	PG_RETURN_LSN(result);
+}
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 27e0598f74..cfd60bfd9a 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -153,6 +153,51 @@ test=# SELECT fsm_page_contents(get_raw_page('pg_class', 'fsm', 0));
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>estimate_lsn_at_time(lsn timestamptz) returns pg_lsn</function>
+     <indexterm>
+      <primary>estimate_lsn_at_time</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>estimate_lsn_at_time</function> estimates the LSN at the provided time.
+     </para>
+
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>estimate_time_at_lsn(lsn pg_lsn) returns timestamp with timezone</function>
+     <indexterm>
+      <primary>estimate_time_at_lsn</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>estimate_time_at_lsn</function> estimates the time at provided LSN.
+     </para>
+
+     <para>
+      One useful application is approximating the last modification time of a
+      given page in a relation. For example, when combined with pageinspect
+      functions returning a page LSN:
+<screen>
+test=# SELECT estimate_time_at_lsn((SELECT lsn from page_header(get_raw_page('pg_class', 0))));
+     estimate_time_at_lsn
+-------------------------------
+ 2023-12-22 08:01:02.393598-05
+</screen>
+     </para>
+
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </sect2>
 
-- 
2.37.2

#2Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#1)
5 attachment(s)
Re: Add LSN <-> time conversion functionality

On Wed, Dec 27, 2023 at 5:16 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Elsewhere [1] I required a way to estimate the time corresponding to a
particular LSN in the past. I devised the attached LSNTimeline, a data
structure mapping LSNs <-> timestamps with decreasing precision for
older time, LSN pairs. This can be used to locate and translate a
particular time to LSN or vice versa using linear interpolation.

Attached is a new version which fixes one overflow danger I noticed in
the original patch set.

I have also been doing some thinking about the LSNTimeline data
structure. Its array elements are combined before all elements have
been used. This sacrifices precision earlier than required. I tried
some alternative structures that would use the whole array. There are
a lot of options, though. Currently each element fits twice as many
members as the preceding element. To use the whole array, we'd have to
change the behavior from filling each element to its max capacity to
something that filled elements only partially. I'm not sure what the
best distribution would be.

I've added an instance of the LSNTimeline to PgStat_WalStats and insert
new values to it in background writer's main loop. This patch set also
introduces some new pageinspect functions exposing LSN <-> time
translations.

I was thinking that maybe it is silly to have the functions allowing
for translation between LSN and time in the pageinspect extension --
since they are not specifically related to pages (pages are just an
object that has an accessible LSN). I was thinking perhaps we add them
as system information functions. However, the closest related
functions I can think of are those to get the current LSN (like
pg_current_wal_lsn ()). And those are listed as system administration
functions under backup control [1]https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-ADMIN-BACKUP. I don't think the LSN <-> time
functionality fits under backup control.

If I did put them in one of the system information function sections
[2]: https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO

- Melanie

[1]: https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-ADMIN-BACKUP
[2]: https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO

Attachments:

v2-0002-Add-LSNTimeline-for-converting-LSN-time.patchtext/x-patch; charset=US-ASCII; name=v2-0002-Add-LSNTimeline-for-converting-LSN-time.patchDownload
From 8590125d66ce366b35251e5aff14db1a858edda9 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:40:27 -0500
Subject: [PATCH v2 2/5] Add LSNTimeline for converting LSN <-> time

Add a new structure, LSNTimeline, consisting of LSNTimes -- each an LSN,
time pair. Each LSNTime can represent multiple logical LSN, time pairs,
referred to as members. LSN <-> time conversions can be done using
linear interpolation with two LSNTimes on the LSNTimeline.

This commit does not add a global instance of LSNTimeline. It adds the
structures and functions needed to maintain and access such a timeline.
---
 src/backend/utils/activity/pgstat_wal.c | 199 ++++++++++++++++++++++++
 src/include/pgstat.h                    |  34 ++++
 src/tools/pgindent/typedefs.list        |   2 +
 3 files changed, 235 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 1a3c0e1a669..e8d9660f82e 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -17,8 +17,11 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "utils/pgstat_internal.h"
 #include "executor/instrument.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
 
 
 PgStat_PendingWalStats PendingWalStats = {0};
@@ -32,6 +35,12 @@ PgStat_PendingWalStats PendingWalStats = {0};
 static WalUsage prevWalUsage;
 
 
+static void lsntime_absorb(LSNTime *a, const LSNTime *b);
+void lsntime_insert(LSNTimeline *timeline, TimestampTz time, XLogRecPtr lsn);
+
+XLogRecPtr estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time);
+TimestampTz estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn);
+
 /*
  * Calculate how much WAL usage counters have increased and update
  * shared WAL and IO statistics.
@@ -184,3 +193,193 @@ pgstat_wal_snapshot_cb(void)
 		   sizeof(pgStatLocal.snapshot.wal));
 	LWLockRelease(&stats_shmem->lock);
 }
+
+/*
+ * Set *a to be the earlier of *a or *b.
+ */
+static void
+lsntime_absorb(LSNTime *a, const LSNTime *b)
+{
+	LSNTime		result;
+	uint64		new_members = a->members + b->members;
+
+	if (a->time < b->time)
+		result = *a;
+	else if (b->time < a->time)
+		result = *b;
+	else if (a->lsn < b->lsn)
+		result = *a;
+	else if (b->lsn < a->lsn)
+		result = *b;
+	else
+		result = *a;
+
+	*a = result;
+	a->members = new_members;
+}
+
+/*
+ * Insert a new LSNTime into the LSNTimeline in the first element with spare
+ * capacity.
+ */
+void
+lsntime_insert(LSNTimeline *timeline, TimestampTz time,
+			   XLogRecPtr lsn)
+{
+	LSNTime		temp;
+	LSNTime		carry = {.lsn = lsn,.time = time,.members = 1};
+
+	for (int i = 0; i < timeline->length; i++)
+	{
+		bool		full;
+		LSNTime    *cur = &timeline->data[i];
+
+		/*
+		 * An array element's capacity to represent members is 2 ^ its
+		 * position in the array.
+		 */
+		full = cur->members >= (1 << i);
+
+		/*
+		 * If the current element is not yet at capacity, then insert the
+		 * passed-in LSNTime into this element by taking the smaller of the it
+		 * and the current LSNTime element. This is required to ensure that
+		 * time moves forward on the timeline.
+		 */
+		if (!full)
+		{
+			Assert(cur->members == carry.members);
+			Assert(cur->members + carry.members <= 1 << i);
+			lsntime_absorb(cur, &carry);
+			return;
+		}
+
+		/*
+		 * If the current element is full, ensure that the inserting LSNTime
+		 * is larger than the current element. This must be true for time to
+		 * move forward on the timeline.
+		 */
+		Assert(carry.lsn >= cur->lsn || carry.time >= cur->time);
+
+		/*
+		 * If the element is at capacity, swap the element with the carry and
+		 * continue on to find an element with space to represent the new
+		 * member.
+		 */
+		temp = *cur;
+		*cur = carry;
+		carry = temp;
+	}
+
+	/*
+	 * Time to use another element in the array -- and increase the length in
+	 * the process
+	 */
+	timeline->data[timeline->length] = carry;
+	timeline->length++;
+}
+
+
+/*
+ * Translate time to a LSN using the provided timeline. The timeline will not
+ * be modified.
+ */
+XLogRecPtr
+estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time)
+{
+	TimestampTz time_elapsed;
+	XLogRecPtr	lsns_elapsed;
+	double		result;
+
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the target time is after the current time, our best estimate of the
+	 * LSN is the current insert LSN.
+	 */
+	if (time >= end.time)
+		return end.lsn;
+
+	for (int i = 0; i < timeline->length; i++)
+	{
+		/* Pass times more recent than our target time */
+		if (timeline->data[i].time > time)
+			continue;
+
+		/* Found the first element before our target time */
+		start = timeline->data[i];
+
+		/*
+		 * If there is only one element in the array, use the current time as
+		 * the end of the range. Otherwise it is the element preceding our
+		 * start.
+		 */
+		if (i > 0)
+			end = timeline->data[i - 1];
+		break;
+	}
+
+	time_elapsed = end.time - start.time;
+	Assert(time_elapsed != 0);
+
+	lsns_elapsed = end.lsn - start.lsn;
+	Assert(lsns_elapsed != 0);
+
+	result = (double) (time - start.time) / time_elapsed * lsns_elapsed + start.lsn;
+	if (result < 0)
+		return InvalidXLogRecPtr;
+	return result;
+}
+
+/*
+ * Translate lsn to a time using the provided timeline. The timeline will not
+ * be modified.
+ */
+TimestampTz
+estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn)
+{
+	TimestampTz time_elapsed;
+	XLogRecPtr	lsns_elapsed;
+	TimestampTz result;
+
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the target LSN is after the current insert LSN, the current time is
+	 * our best estimate.
+	 */
+	if (lsn >= end.lsn)
+		return end.time;
+
+	for (int i = 0; i < timeline->length; i++)
+	{
+		/* Pass LSNs more recent than our target LSN */
+		if (timeline->data[i].lsn > lsn)
+			continue;
+
+		/* Found the first element before our target LSN */
+		start = timeline->data[i];
+
+		/*
+		 * If there is only one element in the array, use the current LSN and
+		 * time as the end of the range. Otherwise, use the preceding element
+		 * (the first element occuring before our target LSN in the timeline).
+		 */
+		if (i > 0)
+			end = timeline->data[i - 1];
+		break;
+	}
+
+	time_elapsed = end.time - start.time;
+	Assert(time_elapsed != 0);
+
+	lsns_elapsed = end.lsn - start.lsn;
+	Assert(lsns_elapsed != 0);
+
+	result = (lsn - start.lsn) / lsns_elapsed * time_elapsed + start.time;
+	if (result < 0)
+		return 0;
+	return result;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2136239710e..4f25773d681 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -428,6 +429,39 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter autoanalyze_count;
 } PgStat_StatTabEntry;
 
+/*
+ * The elements of an LSNTimeline. Each LSNTime represents one or more time,
+ * LSN pairs. The LSN is typically the insert LSN recorded at the time. Members
+ * is the number of logical members -- each a time, LSN pair -- represented in
+ * the LSNTime.
+ */
+typedef struct LSNTime
+{
+	TimestampTz time;
+	XLogRecPtr	lsn;
+	uint64		members;
+} LSNTime;
+
+/*
+ * A timeline consists of LSNTimes from most to least recent. Each element of
+ * the array in the timeline may represent 2^array index logical members --
+ * meaning that each element's capacity is twice that of the preceding element.
+ * This gives more recent times greater precision than less recent ones. An
+ * array of size 64 should provide sufficient capacity without accounting for
+ * what to do when all elements of the array are at capacity.
+ *
+ * When LSNTimes are inserted into the timeline, they are absorbed into the
+ * first array element with spare capacity -- with the new combined element
+ * having the lesser of the two values. The timeline's length is the highest
+ * array index representing one or more logical members. Use the timeline for
+ * LSN <-> time conversion using linear interpolation.
+ */
+typedef struct LSNTimeline
+{
+	int			length;
+	LSNTime		data[64];
+} LSNTimeline;
+
 typedef struct PgStat_WalStats
 {
 	PgStat_Counter wal_records;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 90b37b919c2..32057181277 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1525,6 +1525,8 @@ LogicalTapeSet
 LsnReadQueue
 LsnReadQueueNextFun
 LsnReadQueueNextStatus
+LSNTime
+LSNTimeline
 LtreeGistOptions
 LtreeSignature
 MAGIC
-- 
2.37.2

v2-0003-Add-LSNTimeline-to-PgStat_WalStats.patchtext/x-patch; charset=US-ASCII; name=v2-0003-Add-LSNTimeline-to-PgStat_WalStats.patchDownload
From d07f54554b74dd3984c16b3d81b723ed735e80af Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:41:47 -0500
Subject: [PATCH v2 3/5] Add LSNTimeline to PgStat_WalStats

Add a globally maintained instance of the new LSNTimeline to
PgStat_WalStats and add utility functions for maintaining and accessing
it. This commit does not insert new values to the timeline or use the
helpers to access it.
---
 src/backend/utils/activity/pgstat_wal.c | 48 +++++++++++++++++++++----
 src/include/pgstat.h                    |  6 ++++
 2 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e8d9660f82e..274ed7a24cd 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -36,10 +36,10 @@ static WalUsage prevWalUsage;
 
 
 static void lsntime_absorb(LSNTime *a, const LSNTime *b);
-void lsntime_insert(LSNTimeline *timeline, TimestampTz time, XLogRecPtr lsn);
+static void lsntime_insert(LSNTimeline *timeline, TimestampTz time, XLogRecPtr lsn);
 
-XLogRecPtr estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time);
-TimestampTz estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn);
+static XLogRecPtr estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time);
+static TimestampTz estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn);
 
 /*
  * Calculate how much WAL usage counters have increased and update
@@ -222,7 +222,7 @@ lsntime_absorb(LSNTime *a, const LSNTime *b)
  * Insert a new LSNTime into the LSNTimeline in the first element with spare
  * capacity.
  */
-void
+static void
 lsntime_insert(LSNTimeline *timeline, TimestampTz time,
 			   XLogRecPtr lsn)
 {
@@ -284,7 +284,7 @@ lsntime_insert(LSNTimeline *timeline, TimestampTz time,
  * Translate time to a LSN using the provided timeline. The timeline will not
  * be modified.
  */
-XLogRecPtr
+static XLogRecPtr
 estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time)
 {
 	TimestampTz time_elapsed;
@@ -336,7 +336,7 @@ estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time)
  * Translate lsn to a time using the provided timeline. The timeline will not
  * be modified.
  */
-TimestampTz
+static TimestampTz
 estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn)
 {
 	TimestampTz time_elapsed;
@@ -383,3 +383,39 @@ estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn)
 		return 0;
 	return result;
 }
+
+XLogRecPtr
+pgstat_wal_estimate_lsn_at_time(TimestampTz time)
+{
+	XLogRecPtr	result;
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_lsn_at_time(&stats_shmem->stats.timeline, time);
+	LWLockRelease(&stats_shmem->lock);
+
+	return result;
+}
+
+TimestampTz
+pgstat_wal_estimate_time_at_lsn(XLogRecPtr lsn)
+{
+	TimestampTz result;
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_time_at_lsn(&stats_shmem->stats.timeline, lsn);
+	LWLockRelease(&stats_shmem->lock);
+
+	return result;
+}
+
+void
+pgstat_wal_update_lsntimeline(TimestampTz time, XLogRecPtr lsn)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	lsntime_insert(&stats_shmem->stats.timeline, time, lsn);
+	LWLockRelease(&stats_shmem->lock);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4f25773d681..a9bf8301a34 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -472,6 +472,7 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_sync;
 	PgStat_Counter wal_write_time;
 	PgStat_Counter wal_sync_time;
+	LSNTimeline timeline;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -754,6 +755,11 @@ extern void pgstat_execute_transactional_drops(int ndrops, struct xl_xact_stats_
 extern void pgstat_report_wal(bool force);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 
+/* Helpers for maintaining the LSNTimeline */
+extern XLogRecPtr pgstat_wal_estimate_lsn_at_time(TimestampTz time);
+extern TimestampTz pgstat_wal_estimate_time_at_lsn(XLogRecPtr lsn);
+extern void pgstat_wal_update_lsntimeline(TimestampTz time, XLogRecPtr lsn);
+
 
 /*
  * Variables in pgstat.c
-- 
2.37.2

v2-0005-Add-time-LSN-translation-functions-to-pageinspect.patchtext/x-patch; charset=US-ASCII; name=v2-0005-Add-time-LSN-translation-functions-to-pageinspect.patchDownload
From 768829e8e64d7296a86ce55467c5e2adf1e2b3f7 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 15:46:16 -0500
Subject: [PATCH v2 5/5] Add time <-> LSN translation functions to pageinspect

Previous commits added a global LSNTimeline, maintained by background
writer, that allows approximate translations between time and LSNs. This
can be useful for approximating the time of last modification of a page
or estimating the LSN consumption rate to moderate maintenance processes
and balance system resource utilization. This commit adds user-facing
access to the conversion capabilities of the timeline.

ci-os-only:
---
 .../pageinspect/pageinspect--1.10--1.11.sql   | 10 +++++
 contrib/pageinspect/rawpage.c                 | 26 +++++++++++
 doc/src/sgml/pageinspect.sgml                 | 45 +++++++++++++++++++
 3 files changed, 81 insertions(+)

diff --git a/contrib/pageinspect/pageinspect--1.10--1.11.sql b/contrib/pageinspect/pageinspect--1.10--1.11.sql
index 8fa5e105bc4..72b16d5f84d 100644
--- a/contrib/pageinspect/pageinspect--1.10--1.11.sql
+++ b/contrib/pageinspect/pageinspect--1.10--1.11.sql
@@ -26,3 +26,13 @@ ALTER FUNCTION hash_bitmap_info(regclass, int8) PARALLEL RESTRICTED;
 -- Likewise for gist_page_items.
 ALTER FUNCTION brin_page_items(bytea, regclass) PARALLEL RESTRICTED;
 ALTER FUNCTION gist_page_items(bytea, regclass) PARALLEL RESTRICTED;
+
+CREATE FUNCTION estimate_lsn_at_time(IN input_time timestamp with time zone,
+    OUT lsn pg_lsn)
+AS 'MODULE_PATHNAME', 'estimate_lsn_at_time'
+LANGUAGE C STRICT PARALLEL SAFE;
+
+CREATE FUNCTION estimate_time_at_lsn(IN lsn pg_lsn,
+    OUT result timestamp with time zone)
+AS 'MODULE_PATHNAME', 'estimate_time_at_lsn'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
index 2800ebd62f5..514d8092838 100644
--- a/contrib/pageinspect/rawpage.c
+++ b/contrib/pageinspect/rawpage.c
@@ -22,6 +22,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pageinspect.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/checksum.h"
 #include "utils/builtins.h"
@@ -335,6 +336,9 @@ page_header(PG_FUNCTION_ARGS)
 PG_FUNCTION_INFO_V1(page_checksum_1_9);
 PG_FUNCTION_INFO_V1(page_checksum);
 
+PG_FUNCTION_INFO_V1(estimate_lsn_at_time);
+PG_FUNCTION_INFO_V1(estimate_time_at_lsn);
+
 static Datum
 page_checksum_internal(PG_FUNCTION_ARGS, enum pageinspect_version ext_version)
 {
@@ -374,3 +378,25 @@ page_checksum(PG_FUNCTION_ARGS)
 {
 	return page_checksum_internal(fcinfo, PAGEINSPECT_V1_8);
 }
+
+Datum
+estimate_time_at_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn = PG_GETARG_LSN(0);
+	TimestampTz result;
+
+	result = pgstat_wal_estimate_time_at_lsn(lsn);
+
+	PG_RETURN_TIMESTAMPTZ(result);
+}
+
+Datum
+estimate_lsn_at_time(PG_FUNCTION_ARGS)
+{
+	TimestampTz time = PG_GETARG_TIMESTAMPTZ(0);
+	XLogRecPtr	result;
+
+	result = pgstat_wal_estimate_lsn_at_time(time);
+
+	PG_RETURN_LSN(result);
+}
diff --git a/doc/src/sgml/pageinspect.sgml b/doc/src/sgml/pageinspect.sgml
index 27e0598f74c..cfd60bfd9aa 100644
--- a/doc/src/sgml/pageinspect.sgml
+++ b/doc/src/sgml/pageinspect.sgml
@@ -153,6 +153,51 @@ test=# SELECT fsm_page_contents(get_raw_page('pg_class', 'fsm', 0));
      </para>
     </listitem>
    </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>estimate_lsn_at_time(lsn timestamptz) returns pg_lsn</function>
+     <indexterm>
+      <primary>estimate_lsn_at_time</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>estimate_lsn_at_time</function> estimates the LSN at the provided time.
+     </para>
+
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term>
+     <function>estimate_time_at_lsn(lsn pg_lsn) returns timestamp with timezone</function>
+     <indexterm>
+      <primary>estimate_time_at_lsn</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <function>estimate_time_at_lsn</function> estimates the time at provided LSN.
+     </para>
+
+     <para>
+      One useful application is approximating the last modification time of a
+      given page in a relation. For example, when combined with pageinspect
+      functions returning a page LSN:
+<screen>
+test=# SELECT estimate_time_at_lsn((SELECT lsn from page_header(get_raw_page('pg_class', 0))));
+     estimate_time_at_lsn
+-------------------------------
+ 2023-12-22 08:01:02.393598-05
+</screen>
+     </para>
+
+    </listitem>
+   </varlistentry>
+
   </variablelist>
  </sect2>
 
-- 
2.37.2

v2-0001-Record-LSN-at-postmaster-startup.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Record-LSN-at-postmaster-startup.patchDownload
From fab7205db15a23e15c157e52704690d895e8cd87 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 5 Dec 2023 07:29:39 -0500
Subject: [PATCH v2 1/5] Record LSN at postmaster startup

The insert_lsn at postmaster startup can be used along with PgStartTime
as seed values for a timeline mapping LSNs to time. Future commits will
add such a structure for LSN <-> time conversions. A start LSN allows
for such conversions before even inserting a value into the timeline.
The current time and current insert LSN can be used along with
PgStartTime and PgStartLSN.

This is WIP, as I'm not sure if I did this in the right place.
---
 src/backend/access/transam/xlog.c   | 2 ++
 src/backend/postmaster/postmaster.c | 1 +
 src/include/utils/builtins.h        | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a23..b0f34f3b7a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -146,6 +146,8 @@ bool		XLOG_DEBUG = false;
 
 int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
 
+XLogRecPtr	PgStartLSN = InvalidXLogRecPtr;
+
 /*
  * Number of WAL insertion locks to use. A higher value allows more insertions
  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index feb471dd1df..951114342a5 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1448,6 +1448,7 @@ PostmasterMain(int argc, char *argv[])
 	 * Remember postmaster startup time
 	 */
 	PgStartTime = GetCurrentTimestamp();
+	PgStartLSN = GetXLogInsertRecPtr();
 
 	/*
 	 * Report postmaster status in the postmaster.pid file, to allow pg_ctl to
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 359c570f23e..16a7a058bc7 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -17,6 +17,7 @@
 #include "fmgr.h"
 #include "nodes/nodes.h"
 #include "utils/fmgrprotos.h"
+#include "access/xlogdefs.h"
 
 /* Sign + the most decimal digits an 8-byte number could have */
 #define MAXINT8LEN 20
@@ -85,6 +86,8 @@ extern void generate_operator_clause(fmStringInfo buf,
 									 Oid opoid,
 									 const char *rightop, Oid rightoptype);
 
+extern PGDLLIMPORT XLogRecPtr PgStartLSN;
+
 /* varchar.c */
 extern int	bpchartruelen(char *s, int len);
 
-- 
2.37.2

v2-0004-Bgwriter-maintains-global-LSNTimeline.patchtext/x-patch; charset=US-ASCII; name=v2-0004-Bgwriter-maintains-global-LSNTimeline.patchDownload
From 3da2b81d052f39580cd8336853668db6233d2243 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:32:40 -0500
Subject: [PATCH v2 4/5] Bgwriter maintains global LSNTimeline

Insert new LSN, time pairs to the global LSNTimeline stored in
PgStat_WalStats in the background writer's main loop. This ensures that
new values are added to the timeline in a regular manner.
---
 src/backend/postmaster/bgwriter.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index d7d6cc0cd7b..ec6828aa2a5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -277,6 +277,7 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_lsn = GetLastImportantRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
@@ -289,10 +290,11 @@ BackgroundWriterMain(void)
 			 * the end of the record.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn <= GetLastImportantRecPtr())
+				last_snapshot_lsn <= current_lsn)
 			{
 				last_snapshot_lsn = LogStandbySnapshot();
 				last_snapshot_ts = now;
+				pgstat_wal_update_lsntimeline(now, current_lsn);
 			}
 		}
 
-- 
2.37.2

#3Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Melanie Plageman (#2)
Re: Add LSN <-> time conversion functionality

Hi,

I took a look at this today, to try to understand the purpose and how it
works. Let me share some initial thoughts and questions I have. Some of
this may be wrong/missing the point, so apologies for that.

The goal seems worthwhile in general - the way I understand it, the
patch aims to provide tracking of WAL "velocity", i.e. how much WAL was
generated over time. Which we now don't have, as we only maintain simple
cumulative stats/counters. And then uses it to estimate timestamp for a
given LSN, and vice versa, because that's what the pruning patch needs.

When I first read this, I immediately started wondering if this might
use the commit timestamp stuff we already have. Because for each commit
we already store the LSN and commit timestamp, right? But I'm not sure
that would be a good match - the commit_ts serves a very special purpose
of mapping XID => (LSN, timestamp), I don't see how to make it work for
(LSN=>timestmap) and (timestamp=>LSN) very easily.

As for the inner workings of the patch, my understanding is this:

- "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts)
points, but those points are really "buckets" that grow larger and
larger for older periods of time.

- The entries are being added from bgwriter, i.e. on each loop we add
the current (LSN, timestamp) into the timeline.

- We then estimate LSN/timestamp using the data stored in LSNTimeline
(either LSN => timestamp, or the opposite direction).

Some comments in arbitrary order:

- AFAIK each entry represent an interval of time, and the next (older)
interval is twice as long, right? So the first interval is 1 second,
then 2 seconds, 4 seconds, 8 seconds, ...

- But I don't understand how the LSNTimeline entries are "aging" and get
less accurate, while the "current" bucket is short. lsntime_insert()
seems to simply move to the next entry, but doesn't that mean we insert
the entries into larger and larger buckets?

- The comments never really spell what amount of time the entries cover
/ how granular it is. My understanding is it's simply measured in number
of entries added, which is assumed to be constant and drive by
bgwriter_delay, right? Which is 200ms by default. Which seems fine, but
isn't the hibernation (HIBERNATE_FACTOR) going to mess with it?

Is there some case where bgwriter would just loop without sleeping,
filling the timeline much faster? (I can't think of any, but ...)

- The LSNTimeline comment claims an array of size 64 is large enough to
not need to care about filling it, but maybe it should briefly explain
why we can never fill it (I guess 2^64 is just too many).

- I don't quite understand why 0005 adds the functions to pageinspect.
This has nothing to do with pages, right?

- Not sure why we need 0001. Just so that the "estimate" functions in
0002 have a convenient "start" point? Surely we could look at the
current LSNTimeline data and use the oldest value, or (if there's no
data) use the current timestamp/LSN?

- I wonder what happens if we lose the data - we know that if people
reset statistics for whatever reason (or just lose them because of a
crash, or because they're on a replica), bad things happen to
autovacuum. What's the (expected) impact on pruning?

- What about a SRF function that outputs the whole LSNTimeline? Would be
useful for debugging / development, I think. (Just a suggestion).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Melanie Plageman
melanieplageman@gmail.com
In reply to: Tomas Vondra (#3)
5 attachment(s)
Re: Add LSN <-> time conversion functionality

Thanks so much for reviewing!

On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

When I first read this, I immediately started wondering if this might
use the commit timestamp stuff we already have. Because for each commit
we already store the LSN and commit timestamp, right? But I'm not sure
that would be a good match - the commit_ts serves a very special purpose
of mapping XID => (LSN, timestamp), I don't see how to make it work for
(LSN=>timestmap) and (timestamp=>LSN) very easily.

I took a look at the code in commit_ts.c, and I really don't see a way
of reusing any of this commit<->timestamp infrastructure for
timestamp<->LSN mappings.

As for the inner workings of the patch, my understanding is this:

- "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts)
points, but those points are really "buckets" that grow larger and
larger for older periods of time.

Yes, they are buckets in the sense that they represent multiple values
but each contains a single LSNTime value which is the minimum of all
the LSNTimes we "merged" into that single array element. In order to
represent a range of time, you need to use two array elements. The
linear interpolation from time <-> LSN is all done with two elements.

- AFAIK each entry represent an interval of time, and the next (older)
interval is twice as long, right? So the first interval is 1 second,
then 2 seconds, 4 seconds, 8 seconds, ...

- But I don't understand how the LSNTimeline entries are "aging" and get
less accurate, while the "current" bucket is short. lsntime_insert()
seems to simply move to the next entry, but doesn't that mean we insert
the entries into larger and larger buckets?

Because the earlier array elements can represent fewer logical members
than later ones and because elements are merged into the next element
when space runs out, later array elements will contain older data and
more of it, so those "ranges" will be larger. But, after thinking
about it and also reading your feedback, I realized my algorithm was
silly because it starts merging logical members before it has even
used the whole array.

The attached v3 has a new algorithm. Now, LSNTimes are added from the
end of the array forward until all array elements have at least one
logical member (array length == volume). Once array length == volume,
new LSNTimes will result in merging logical members in existing
elements. We want to merge older members because those can be less
precise. So, the number of logical members per array element will
always monotonically increase starting from the beginning of the array
(which contains the most recent data) and going to the end. We want to
use all the available space in the array. That means that each LSNTime
insertion will always result in a single merge. We want the timeline
to be inclusive of the oldest data, so merging means taking the
smaller value of two LSNTime values. I had to pick a rule for choosing
which elements to merge. So, I choose the merge target as the oldest
element whose logical membership is < 2x its predecessor. I merge the
merge target's predecessor into the merge target. Then I move all of
the intervening elements down 1. Then I insert the new LSNTime at
index 0.

- The comments never really spell what amount of time the entries cover
/ how granular it is. My understanding is it's simply measured in number
of entries added, which is assumed to be constant and drive by
bgwriter_delay, right? Which is 200ms by default. Which seems fine, but
isn't the hibernation (HIBERNATE_FACTOR) going to mess with it?

Is there some case where bgwriter would just loop without sleeping,
filling the timeline much faster? (I can't think of any, but ...)

bgwriter will wake up when there are buffers to flush, which is likely
correlated with there being new LSNs. So, actually it seems like it
might work well to rely on only filling the timeline when there are
things for bgwriter to do.

- The LSNTimeline comment claims an array of size 64 is large enough to
not need to care about filling it, but maybe it should briefly explain
why we can never fill it (I guess 2^64 is just too many).

The new structure fits a different number of members. I have yet to
calculate that number, but it should be explained in the comments once
I do.

For example, if we made an LSNTimeline with volume 4, once every
element had one LSNTime and we needed to start merging, the following
is how many logical members each element would have after each of four
merges:
1111
1112
1122
1114
1124
So, if we store the number of members as an unsigned 64-bit int and we
have an LSNTimeline with volume 64, what is the maximum number of
members can we store if we hold all of the invariants described in my
algorithm above (we only merge when required, every element holds < 2x
the number of logical members as its predecessor, we do exactly one
merge every insertion [when required], membership must monotonically
increase [choose the oldest element meeting the criteria when deciding
what to merge])?

- I don't quite understand why 0005 adds the functions to pageinspect.
This has nothing to do with pages, right?

You're right. I just couldn't think of a good place to put the
functions. In version 3, I just put the SQL functions in pgstat_wal.c
and made them generally available (i.e. not in a contrib module). I
haven't added docs back yet. But perhaps a section near the docs
describing pg_xact_commit_timestamp() [1]https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO-COMMIT-TIMESTAMP? I wasn't sure if I should
put the SQL function source code in pgstatfuncs.c -- I kind of prefer
it in pgstat_wal.c but there are no other SQL functions there.

- Not sure why we need 0001. Just so that the "estimate" functions in
0002 have a convenient "start" point? Surely we could look at the
current LSNTimeline data and use the oldest value, or (if there's no
data) use the current timestamp/LSN?

When there are 0 or 1 entries in the timeline you'll get an answer
that could be very off if you just return the current timestamp or
LSN. I guess that is okay?

- I wonder what happens if we lose the data - we know that if people
reset statistics for whatever reason (or just lose them because of a
crash, or because they're on a replica), bad things happen to
autovacuum. What's the (expected) impact on pruning?

This is an important question. Because stats aren't crashsafe, we
could return very inaccurate numbers for some time/LSN values if we
crash. I don't actually know what we could do about that. When I use
the LSNTimeline for the freeze heuristic it is less of an issue
because the freeze heuristic has a fallback strategy when there aren't
enough stats to do its calculations. But other users of this
LSNTimeline will simply get highly inaccurate results (I think?). Is
there anything we could do about this? It seems bad.

Andres had brought up something at some point about, what if the
database is simply turned off for awhile and then turned back on. Even
if you cleanly shut down, will there be "gaps" in the timeline? I
think that could be okay, but it is something to think about.

- What about a SRF function that outputs the whole LSNTimeline? Would be
useful for debugging / development, I think. (Just a suggestion).

Good idea! I've added this. Though, maybe there was a simpler way to
implement than I did.

Just a note, all of my comments could use a lot of work, but I want to
get consensus on the algorithm before I make sure and write about it
in a perfect way.

- Melanie

[1]: https://www.postgresql.org/docs/devel/functions-info.html#FUNCTIONS-INFO-COMMIT-TIMESTAMP

Attachments:

v3-0005-Add-time-LSN-translation-functions.patchtext/x-patch; charset=US-ASCII; name=v3-0005-Add-time-LSN-translation-functions.patchDownload
From cf9e6f507bc9781bf79e8c39766c8e84209d2ada Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 21 Feb 2024 20:06:29 -0500
Subject: [PATCH v3 5/5] Add time <-> LSN translation functions

Previous commits added a global LSNTimeline, maintained by background
writer, that allows approximate translations between time and LSNs.

Add SQL-callable functions to convert from LSN to time and back and a
SQL-callable function returning the entire LSNTimeline.

This could be useful in combination with SQL-callable functions
accessing a page LSN to approximate the time of last modification of a
page or estimating the LSN consumption rate to moderate maintenance
processes and balance system resource utilization.
---
 src/backend/utils/activity/pgstat_wal.c | 58 +++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat         | 22 ++++++++++
 2 files changed, 80 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index df4c91ee3cf..27f2b23cd88 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -18,6 +18,7 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
+#include "funcapi.h"
 #include "utils/pgstat_internal.h"
 #include "executor/instrument.h"
 #include "utils/builtins.h"
@@ -418,3 +419,60 @@ pgstat_wal_update_lsntimeline(TimestampTz time, XLogRecPtr lsn)
 	lsntime_insert(&stats_shmem->stats.timeline, time, lsn);
 	LWLockRelease(&stats_shmem->lock);
 }
+
+PG_FUNCTION_INFO_V1(pg_estimate_lsn_at_time);
+PG_FUNCTION_INFO_V1(pg_estimate_time_at_lsn);
+PG_FUNCTION_INFO_V1(pg_lsntimeline);
+
+Datum
+pg_estimate_time_at_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn = PG_GETARG_LSN(0);
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+	TimestampTz result;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_time_at_lsn(&stats_shmem->stats.timeline, lsn);
+	LWLockRelease(&stats_shmem->lock);
+
+	PG_RETURN_TIMESTAMPTZ(result);
+}
+
+Datum
+pg_estimate_lsn_at_time(PG_FUNCTION_ARGS)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+	TimestampTz time = PG_GETARG_TIMESTAMPTZ(0);
+	XLogRecPtr	result;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_lsn_at_time(&stats_shmem->stats.timeline, time);
+	LWLockRelease(&stats_shmem->lock);
+
+	PG_RETURN_LSN(result);
+}
+
+Datum
+pg_lsntimeline(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	LSNTimeline *timeline;
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	timeline = &stats_shmem->stats.timeline;
+	for (int i = LSNTIMELINE_VOLUME - timeline->length; i < LSNTIMELINE_VOLUME; i++)
+	{
+		Datum		values[2] = {0};
+		bool		nulls[2] = {0};
+
+		values[0] = timeline->data[i].time;
+		values[1] = timeline->data[i].lsn;
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+	LWLockRelease(&stats_shmem->lock);
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 9c120fc2b7f..e69cf9c2437 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6326,6 +6326,28 @@
   prorettype => 'timestamptz', proargtypes => 'xid',
   prosrc => 'pg_xact_commit_timestamp' },
 
+{ oid => '9997',
+  descr => 'get approximate LSN at a particular point in time',
+  proname => 'pg_estimate_lsn_at_time', provolatile => 'v',
+  prorettype => 'pg_lsn', proargtypes => 'timestamptz',
+  prosrc => 'pg_estimate_lsn_at_time' },
+
+{ oid => '9996',
+  descr => 'get approximate time at a particular LSN',
+  proname => 'pg_estimate_time_at_lsn', provolatile => 'v',
+  prorettype => 'timestamptz', proargtypes => 'pg_lsn',
+  prosrc => 'pg_estimate_time_at_lsn' },
+
+{ oid => '9994',
+  descr => 'print the LSN timeline',
+  proname => 'pg_lsntimeline', prorows => '64',
+  proretset => 't', provolatile => 'v', proparallel => 's',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,pg_lsn}',
+  proargmodes => '{o,o}',
+  proargnames => '{time, lsn}',
+  prosrc => 'pg_lsntimeline' },
+
 { oid => '6168',
   descr => 'get commit timestamp and replication origin of a transaction',
   proname => 'pg_xact_commit_timestamp_origin', provolatile => 'v',
-- 
2.37.2

v3-0003-Add-LSNTimeline-to-PgStat_WalStats.patchtext/x-patch; charset=US-ASCII; name=v3-0003-Add-LSNTimeline-to-PgStat_WalStats.patchDownload
From 6fd317a319de144a7a999fc6230268521e72a36f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 21 Feb 2024 20:28:27 -0500
Subject: [PATCH v3 3/5] Add LSNTimeline to PgStat_WalStats

Add a globally maintained instance of the new LSNTimeline to
PgStat_WalStats and a utility function to insert new values.
---
 src/backend/utils/activity/pgstat_wal.c | 10 ++++++++++
 src/include/pgstat.h                    |  4 ++++
 2 files changed, 14 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 96e84319f6f..df4c91ee3cf 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -408,3 +408,13 @@ stop:
 	result = (double) (lsn - start.lsn) / lsns_elapsed * time_elapsed + start.time;
 	return Max(result, 0);
 }
+
+void
+pgstat_wal_update_lsntimeline(TimestampTz time, XLogRecPtr lsn)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	lsntime_insert(&stats_shmem->stats.timeline, time, lsn);
+	LWLockRelease(&stats_shmem->lock);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1926ddb00ed..8a63f56fdd3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -482,6 +482,7 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_sync;
 	PgStat_Counter wal_write_time;
 	PgStat_Counter wal_sync_time;
+	LSNTimeline timeline;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -764,6 +765,9 @@ extern void pgstat_execute_transactional_drops(int ndrops, struct xl_xact_stats_
 extern void pgstat_report_wal(bool force);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 
+/* Helpers for maintaining the LSNTimeline */
+extern void pgstat_wal_update_lsntimeline(TimestampTz time, XLogRecPtr lsn);
+
 
 /*
  * Variables in pgstat.c
-- 
2.37.2

v3-0001-Record-LSN-at-postmaster-startup.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Record-LSN-at-postmaster-startup.patchDownload
From 1348109617ce772835336ea8e8a9781407f6060d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 5 Dec 2023 07:29:39 -0500
Subject: [PATCH v3 1/5] Record LSN at postmaster startup

The insert_lsn at postmaster startup can be used along with PgStartTime
as seed values for a timeline mapping LSNs to time. Future commits will
add such a structure for LSN <-> time conversions. A start LSN allows
for such conversions before even inserting a value into the timeline.
The current time and current insert LSN can be used along with
PgStartTime and PgStartLSN.

This is WIP, as I'm not sure if I did this in the right place.
---
 src/backend/access/transam/xlog.c   | 2 ++
 src/backend/postmaster/postmaster.c | 1 +
 src/include/utils/builtins.h        | 3 +++
 3 files changed, 6 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1162d55bff..3fea9f4c31f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -146,6 +146,8 @@ bool		XLOG_DEBUG = false;
 
 int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
 
+XLogRecPtr	PgStartLSN = InvalidXLogRecPtr;
+
 /*
  * Number of WAL insertion locks to use. A higher value allows more insertions
  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index df945a5ac4d..9e5cad60549 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1440,6 +1440,7 @@ PostmasterMain(int argc, char *argv[])
 	 * Remember postmaster startup time
 	 */
 	PgStartTime = GetCurrentTimestamp();
+	PgStartLSN = GetXLogInsertRecPtr();
 
 	/*
 	 * Report postmaster status in the postmaster.pid file, to allow pg_ctl to
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 359c570f23e..16a7a058bc7 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -17,6 +17,7 @@
 #include "fmgr.h"
 #include "nodes/nodes.h"
 #include "utils/fmgrprotos.h"
+#include "access/xlogdefs.h"
 
 /* Sign + the most decimal digits an 8-byte number could have */
 #define MAXINT8LEN 20
@@ -85,6 +86,8 @@ extern void generate_operator_clause(fmStringInfo buf,
 									 Oid opoid,
 									 const char *rightop, Oid rightoptype);
 
+extern PGDLLIMPORT XLogRecPtr PgStartLSN;
+
 /* varchar.c */
 extern int	bpchartruelen(char *s, int len);
 
-- 
2.37.2

v3-0004-Bgwriter-maintains-global-LSNTimeline.patchtext/x-patch; charset=US-ASCII; name=v3-0004-Bgwriter-maintains-global-LSNTimeline.patchDownload
From 6204555c6d59035cba9f850c676d0b1f530c6f43 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:32:40 -0500
Subject: [PATCH v3 4/5] Bgwriter maintains global LSNTimeline

Insert new LSN, time pairs to the global LSNTimeline stored in
PgStat_WalStats in the background writer's main loop. This ensures that
new values are added to the timeline in a regular manner.
---
 src/backend/postmaster/bgwriter.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 6364b16261f..4b4d5db60bb 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -272,6 +272,7 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_lsn = GetLastImportantRecPtr();
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
@@ -284,10 +285,11 @@ BackgroundWriterMain(void)
 			 * the end of the record.
 			 */
 			if (now >= timeout &&
-				last_snapshot_lsn <= GetLastImportantRecPtr())
+				last_snapshot_lsn <= current_lsn)
 			{
 				last_snapshot_lsn = LogStandbySnapshot();
 				last_snapshot_ts = now;
+				pgstat_wal_update_lsntimeline(now, current_lsn);
 			}
 		}
 
-- 
2.37.2

v3-0002-Add-LSNTimeline-for-converting-LSN-time.patchtext/x-patch; charset=US-ASCII; name=v3-0002-Add-LSNTimeline-for-converting-LSN-time.patchDownload
From 800b0610f430f965e9216a374afe638bbec7bb6f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:40:27 -0500
Subject: [PATCH v3 2/5] Add LSNTimeline for converting LSN <-> time

Add a new structure, LSNTimeline, consisting of LSNTimes -- each an LSN,
time pair. Each LSNTime can represent multiple logical LSN, time pairs,
referred to as members. LSN <-> time conversions can be done using
linear interpolation with two LSNTimes on the LSNTimeline.

This commit does not add a global instance of LSNTimeline. It adds the
structures and functions needed to maintain and access such a timeline.
---
 src/backend/utils/activity/pgstat_wal.c | 224 ++++++++++++++++++++++++
 src/include/pgstat.h                    |  44 +++++
 src/tools/pgindent/typedefs.list        |   2 +
 3 files changed, 270 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 1a3c0e1a669..96e84319f6f 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -17,8 +17,12 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "utils/pgstat_internal.h"
 #include "executor/instrument.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+#include "utils/pg_lsn.h"
 
 
 PgStat_PendingWalStats PendingWalStats = {0};
@@ -32,6 +36,12 @@ PgStat_PendingWalStats PendingWalStats = {0};
 static WalUsage prevWalUsage;
 
 
+static void lsntime_merge(LSNTime *target, LSNTime *src);
+static void lsntime_insert(LSNTimeline *timeline, TimestampTz time, XLogRecPtr lsn);
+
+XLogRecPtr	estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time);
+TimestampTz estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn);
+
 /*
  * Calculate how much WAL usage counters have increased and update
  * shared WAL and IO statistics.
@@ -184,3 +194,217 @@ pgstat_wal_snapshot_cb(void)
 		   sizeof(pgStatLocal.snapshot.wal));
 	LWLockRelease(&stats_shmem->lock);
 }
+
+/*
+ * Set *target to be the earlier of *target or *src.
+ */
+static void
+lsntime_merge(LSNTime *target, LSNTime *src)
+{
+	LSNTime		result;
+	uint64		new_members = target->members + src->members;
+
+	if (target->time < src->time)
+		result = *target;
+	else if (src->time < target->time)
+		result = *src;
+	else if (target->lsn < src->lsn)
+		result = *target;
+	else if (src->lsn < target->lsn)
+		result = *src;
+	else
+		result = *target;
+
+	*target = result;
+	target->members = new_members;
+	src->members = 1;
+}
+
+static int
+lsntime_merge_target(LSNTimeline *timeline)
+{
+	/* Don't merge if free space available */
+	Assert(timeline->length == LSNTIMELINE_VOLUME);
+
+	for (int i = timeline->length; i-- > 0;)
+	{
+		/*
+		 * An array element can represent up to twice the number of members
+		 * represented by the preceding array element.
+		 */
+		if (timeline->data[i].members < (2 * timeline->data[i - 1].members))
+			return i;
+	}
+
+	/* Should not be reachable or we are out of space */
+	Assert(false);
+}
+
+/*
+ * Insert a new LSNTime into the LSNTimeline in the first available element,
+ * or, if there are no empty elements, insert it into the element at index 0,
+ * merge the logical members of two old buckets and move the intervening
+ * elements down by one.
+ */
+void
+lsntime_insert(LSNTimeline *timeline, TimestampTz time,
+			   XLogRecPtr lsn)
+{
+	int			merge_target;
+	LSNTime		entrant = {.lsn = lsn,.time = time,.members = 1};
+
+	if (timeline->length < LSNTIMELINE_VOLUME)
+	{
+		/*
+		 * The new entry should exceed the most recent entry to ensure time
+		 * moves forward on the timeline.
+		 */
+		Assert(timeline->length == 0 ||
+			   (lsn >= timeline->data[LSNTIMELINE_VOLUME - timeline->length].lsn &&
+				time >= timeline->data[LSNTIMELINE_VOLUME - timeline->length].time));
+
+		/*
+		 * If there are unfilled elements in the timeline, then insert the
+		 * passed-in LSNTime into the tail of the array.
+		 */
+		timeline->length++;
+		timeline->data[LSNTIMELINE_VOLUME - timeline->length] = entrant;
+		return;
+	}
+
+	/*
+	 * If all elements in the timeline represent at least one member, merge
+	 * the oldest element whose membership is < 2x its predecessor with its
+	 * preceding member. Then shift all elements preceding these two elements
+	 * down by one and insert the passed-in LSNTime at array index 0.
+	 */
+	merge_target = lsntime_merge_target(timeline);
+	Assert(merge_target >= 0 && merge_target < timeline->length);
+	lsntime_merge(&timeline->data[merge_target], &timeline->data[merge_target - 1]);
+	memmove(&timeline->data[1], &timeline->data[0], sizeof(LSNTime) * merge_target - 1);
+	timeline->data[0] = entrant;
+}
+
+/*
+ * Translate time to a LSN using the provided timeline. The timeline will not
+ * be modified.
+ */
+XLogRecPtr
+estimate_lsn_at_time(const LSNTimeline *timeline, TimestampTz time)
+{
+	XLogRecPtr	result;
+	int64		time_elapsed,
+				lsns_elapsed;
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the provided time is before DB startup, the best we can do is return
+	 * the start LSN.
+	 */
+	if (time < start.time)
+		return start.lsn;
+
+	/*
+	 * If the provided time is after now, the current LSN is our best
+	 * estimate.
+	 */
+	if (time >= end.time)
+		return end.lsn;
+
+	/*
+	 * Loop through the timeline. Stop at the first LSNTime earlier than our
+	 * target time. This LSNTime will be our interpolation start point. If
+	 * there's an LSNTime later than that, then that will be our interpolation
+	 * end point.
+	 */
+	for (int i = LSNTIMELINE_VOLUME - timeline->length; i < LSNTIMELINE_VOLUME; i++)
+	{
+		if (timeline->data[i].time > time)
+			continue;
+
+		start = timeline->data[i];
+		if (i > 0)
+			end = timeline->data[i - 1];
+		goto stop;
+	}
+
+	/*
+	 * If we exhausted the timeline, then use its earliest LSNTime as our
+	 * interpolation end point.
+	 */
+	if (timeline->length > 0)
+		end = timeline->data[timeline->length - 1];
+
+stop:
+	Assert(end.time > start.time);
+	Assert(end.lsn > start.lsn);
+	time_elapsed = end.time - start.time;
+	Assert(time_elapsed != 0);
+	lsns_elapsed = end.lsn - start.lsn;
+	Assert(lsns_elapsed != 0);
+	result = (double) (time - start.time) / time_elapsed * lsns_elapsed + start.lsn;
+	return Max(result, 0);
+}
+
+/*
+ * Translate lsn to a time using the provided timeline. The timeline will not
+ * be modified.
+ */
+TimestampTz
+estimate_time_at_lsn(const LSNTimeline *timeline, XLogRecPtr lsn)
+{
+	int64		time_elapsed,
+				lsns_elapsed;
+	TimestampTz result;
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the LSN is before DB startup, the best we can do is return that
+	 * time.
+	 */
+	if (lsn <= start.lsn)
+		return start.time;
+
+	/*
+	 * If the target LSN is after the current insert LSN, the current time is
+	 * our best estimate.
+	 */
+	if (lsn >= end.lsn)
+		return end.time;
+
+	/*
+	 * Loop through the timeline. Stop at the first LSNTime earlier than our
+	 * target LSN. This LSNTime will be our interpolation start point. If
+	 * there's an LSNTime later than that, then that will be our interpolation
+	 * end point.
+	 */
+	for (int i = LSNTIMELINE_VOLUME - timeline->length; i < LSNTIMELINE_VOLUME; i++)
+	{
+		if (timeline->data[i].lsn > lsn)
+			continue;
+
+		start = timeline->data[i];
+		if (i > 0)
+			end = timeline->data[i - 1];
+		goto stop;
+	}
+
+	/*
+	 * If we exhausted the timeline, then use its earliest LSNTime as our
+	 * interpolation end point.
+	 */
+	if (timeline->length > 0)
+		end = timeline->data[timeline->length - 1];
+
+stop:
+	Assert(end.time > start.time);
+	Assert(end.lsn > start.lsn);
+	time_elapsed = end.time - start.time;
+	Assert(time_elapsed != 0);
+	lsns_elapsed = end.lsn - start.lsn;
+	Assert(lsns_elapsed != 0);
+	result = (double) (lsn - start.lsn) / lsns_elapsed * time_elapsed + start.time;
+	return Max(result, 0);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2136239710e..1926ddb00ed 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -428,6 +429,49 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter autoanalyze_count;
 } PgStat_StatTabEntry;
 
+/*
+ * The elements of an LSNTimeline. Each LSNTime represents one or more time,
+ * LSN pairs. The LSN is typically the insert LSN recorded at the time. Members
+ * is the number of logical members -- each a time, LSN pair -- represented in
+ * the LSNTime.
+ */
+typedef struct LSNTime
+{
+	TimestampTz time;
+	XLogRecPtr	lsn;
+	uint64		members;
+} LSNTime;
+
+#define LSNTIMELINE_VOLUME 64
+/*
+ * A timeline consists of LSNTimes from most to least recent. The array is
+ * filled from end to start before the contents of any elements are merged.
+ * Once the LSNTimeline length == volume (the array is full), old elements are
+ * merged to make space for new elements at index 0. When merging logical
+ * members, each element of the array in the timeline may represent twice as
+ * many logical members as the preceding element.
+ *
+ * This gives more recent times greater precision than less recent ones. An
+ * array of size 64 and an unsigned 64-bit number for the number of members
+ * should provide sufficient capacity without accounting for what to do when
+ * all elements of the array are at capacity.
+ *
+ * After every element has at least one logical member, when a new LSNTime is
+ * inserted, the oldest array element whose logical membership is < 2x its
+ * predecessor is the merge target. Its preceding element is merged into it.
+ * Then all of the intervening elements are moved down by one and the new
+ * LSNTime is inserted at index 0.
+ *
+ * Merging two elements is combining their members and assigning the lesser
+ * LSNTime. Use the timeline for LSN <-> time conversion using linear
+ * interpolation.
+ */
+typedef struct LSNTimeline
+{
+	int			length;
+	LSNTime		data[LSNTIMELINE_VOLUME];
+} LSNTimeline;
+
 typedef struct PgStat_WalStats
 {
 	PgStat_Counter wal_records;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d808aad8b05..aef83230836 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1525,6 +1525,8 @@ LogicalTapeSet
 LsnReadQueue
 LsnReadQueueNextFun
 LsnReadQueueNextStatus
+LSNTime
+LSNTimeline
 LtreeGistOptions
 LtreeSignature
 MAGIC
-- 
2.37.2

#5Daniel Gustafsson
daniel@yesql.se
In reply to: Melanie Plageman (#4)
1 attachment(s)
Re: Add LSN <-> time conversion functionality

On 22 Feb 2024, at 03:45, Melanie Plageman <melanieplageman@gmail.com> wrote:
On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

- Not sure why we need 0001. Just so that the "estimate" functions in
0002 have a convenient "start" point? Surely we could look at the
current LSNTimeline data and use the oldest value, or (if there's no
data) use the current timestamp/LSN?

When there are 0 or 1 entries in the timeline you'll get an answer
that could be very off if you just return the current timestamp or
LSN. I guess that is okay?

I don't think that's a huge problem at such a young "lsn-age", but I might be
missing something.

- I wonder what happens if we lose the data - we know that if people
reset statistics for whatever reason (or just lose them because of a
crash, or because they're on a replica), bad things happen to
autovacuum. What's the (expected) impact on pruning?

This is an important question. Because stats aren't crashsafe, we
could return very inaccurate numbers for some time/LSN values if we
crash. I don't actually know what we could do about that. When I use
the LSNTimeline for the freeze heuristic it is less of an issue
because the freeze heuristic has a fallback strategy when there aren't
enough stats to do its calculations. But other users of this
LSNTimeline will simply get highly inaccurate results (I think?). Is
there anything we could do about this? It seems bad.

A complication with this over stats is that we can't recompute this in case of
a crash/corruption issue. The simple solution is to consider this unlogged
data and start fresh at every unclean shutdown, but I have a feeling that won't
be good enough for basing heuristics on?

Andres had brought up something at some point about, what if the
database is simply turned off for awhile and then turned back on. Even
if you cleanly shut down, will there be "gaps" in the timeline? I
think that could be okay, but it is something to think about.

The gaps would represent reality, so there is nothing wrong per se with gaps,
but if they inflate the interval of a bucket which in turns impact the
precision of the results then that can be a problem.

Just a note, all of my comments could use a lot of work, but I want to
get consensus on the algorithm before I make sure and write about it
in a perfect way.

I'm not sure "a lot of work" is accurate, they seem pretty much there to me,
but I think that an illustration of running through the algorithm in an
ascii-art array would be helpful.

Reading through this I think such a function has merits, not only for your
usecase but other heuristic based work and quite possibly systems debugging.

While the bucketing algorithm is a clever algorithm for degrading precision for
older entries without discarding them, I do fear that we'll risk ending up with
answers like "somewhere between in the past and even further in the past".
I've been playing around with various compression algorithms for packing the
buckets such that we can retain precision for longer. Since you were aiming to
work on other patches leading up to the freeze, let's pick this up again once
things calm down.

When compiling I got this warning for lsntime_merge_target:

pgstat_wal.c:242:1: warning: non-void function does not return a value in all control paths [-Wreturn-type]
}
^
1 warning generated.

The issue seems to be that the can't-really-happen path is protected with an
Assertion, which is a no-op for production builds. I think we should handle
the error rather than rely on testing catching it (since if it does happen even
though it can't, it's going to be when it's for sure not tested). Returning an
invalid array subscript like -1 and testing for it in lsntime_insert, with an
elog(WARNING..), seems enough.

-    last_snapshot_lsn <= GetLastImportantRecPtr())
+    last_snapshot_lsn <= current_lsn)
I think we should delay extracting the LSN with GetLastImportantRecPtr until we
know that we need it, to avoid taking locks in this codepath unless needed.

I've attached a diff with the above suggestions which applies on op of your
patchset.

--
Daniel Gustafsson

Attachments:

review.txttext/plain; name=review.txt; x-unix-mode=0644Download
commit 908f3bb511450f05980ba01d42c909cc9ef8007a
Author: Daniel Gustafsson <dgustafsson@postgresql.org>
Date:   Thu Mar 14 14:32:15 2024 +0100

    Code review hackery

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 7d9ad9046f..115204f708 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -267,7 +267,7 @@ BackgroundWriterMain(void)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
-			XLogRecPtr	current_lsn = GetLastImportantRecPtr();
+			XLogRecPtr	current_lsn;
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
@@ -279,12 +279,15 @@ BackgroundWriterMain(void)
 			 * start of a record, whereas last_snapshot_lsn points just past
 			 * the end of the record.
 			 */
-			if (now >= timeout &&
-				last_snapshot_lsn <= current_lsn)
+			if (now >= timeout)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
-				last_snapshot_ts = now;
-				pgstat_wal_update_lsntimeline(now, current_lsn);
+				current_lsn = GetLastImportantRecPtr();
+				if (last_snapshot_lsn <= current_lsn)
+				{
+					last_snapshot_lsn = LogStandbySnapshot();
+					last_snapshot_ts = now;
+					pgstat_wal_update_lsntimeline(now, current_lsn);
+				}
 			}
 		}
 
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 0a7b545fd2..90afec580b 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -238,7 +238,7 @@ lsntime_merge_target(LSNTimeline *timeline)
 	}
 
 	/* Should not be reachable or we are out of space */
-	Assert(false);
+	return -1;
 }
 
 /*
@@ -280,7 +280,12 @@ lsntime_insert(LSNTimeline *timeline, TimestampTz time,
 	 * down by one and insert the passed-in LSNTime at array index 0.
 	 */
 	merge_target = lsntime_merge_target(timeline);
-	Assert(merge_target >= 0 && merge_target < timeline->length);
+	if (merge_target < 0)
+	{
+		elog(WARNING, "unable to insert LSN in LSN timeline, merge failed");
+		return;
+	}
+	Assert(merge_target < timeline->length);
 	lsntime_merge(&timeline->data[merge_target], &timeline->data[merge_target - 1]);
 	memmove(&timeline->data[1], &timeline->data[0], sizeof(LSNTime) * merge_target - 1);
 	timeline->data[0] = entrant;
#6Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Melanie Plageman (#4)
Re: Add LSN <-> time conversion functionality

On 2/22/24 03:45, Melanie Plageman wrote:

Thanks so much for reviewing!

On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

When I first read this, I immediately started wondering if this might
use the commit timestamp stuff we already have. Because for each commit
we already store the LSN and commit timestamp, right? But I'm not sure
that would be a good match - the commit_ts serves a very special purpose
of mapping XID => (LSN, timestamp), I don't see how to make it work for
(LSN=>timestmap) and (timestamp=>LSN) very easily.

I took a look at the code in commit_ts.c, and I really don't see a way
of reusing any of this commit<->timestamp infrastructure for
timestamp<->LSN mappings.

As for the inner workings of the patch, my understanding is this:

- "LSNTimeline" consists of "LSNTime" entries representing (LSN,ts)
points, but those points are really "buckets" that grow larger and
larger for older periods of time.

Yes, they are buckets in the sense that they represent multiple values
but each contains a single LSNTime value which is the minimum of all
the LSNTimes we "merged" into that single array element. In order to
represent a range of time, you need to use two array elements. The
linear interpolation from time <-> LSN is all done with two elements.

- AFAIK each entry represent an interval of time, and the next (older)
interval is twice as long, right? So the first interval is 1 second,
then 2 seconds, 4 seconds, 8 seconds, ...

- But I don't understand how the LSNTimeline entries are "aging" and get
less accurate, while the "current" bucket is short. lsntime_insert()
seems to simply move to the next entry, but doesn't that mean we insert
the entries into larger and larger buckets?

Because the earlier array elements can represent fewer logical members
than later ones and because elements are merged into the next element
when space runs out, later array elements will contain older data and
more of it, so those "ranges" will be larger. But, after thinking
about it and also reading your feedback, I realized my algorithm was
silly because it starts merging logical members before it has even
used the whole array.

The attached v3 has a new algorithm. Now, LSNTimes are added from the
end of the array forward until all array elements have at least one
logical member (array length == volume). Once array length == volume,
new LSNTimes will result in merging logical members in existing
elements. We want to merge older members because those can be less
precise. So, the number of logical members per array element will
always monotonically increase starting from the beginning of the array
(which contains the most recent data) and going to the end. We want to
use all the available space in the array. That means that each LSNTime
insertion will always result in a single merge. We want the timeline
to be inclusive of the oldest data, so merging means taking the
smaller value of two LSNTime values. I had to pick a rule for choosing
which elements to merge. So, I choose the merge target as the oldest
element whose logical membership is < 2x its predecessor. I merge the
merge target's predecessor into the merge target. Then I move all of
the intervening elements down 1. Then I insert the new LSNTime at
index 0.

I can't help but think about t-digest [1]https://github.com/tdunning/t-digest/blob/main/docs/t-digest-paper/histo.pdf, which also merges data into
variable-sized buckets (called centroids, which is a pair of values,
just like LSNTime). But the merging is driven by something called "scale
function" which I found like a pretty nice approach to this, and it
yields some guarantees regarding accuracy. I wonder if we could do
something similar here ...

The t-digest is a way to approximate quantiles, and the default scale
function is optimized for best accuracy on the extremes (close to 0.0
and 1.0), but it's possible to use scale functions that optimize only
for accuracy close to 1.0.

This may be misguided, but I see similarity between quantiles and what
LSNTimeline does - timestamps are ordered, and quantiles close to 0.0
are "old timestamps" while quantiles close to 1.0 are "now".

And t-digest also defines a pretty efficient algorithm to merge data in
a way that gradually combines older "buckets" into larger ones.

- The comments never really spell what amount of time the entries cover
/ how granular it is. My understanding is it's simply measured in number
of entries added, which is assumed to be constant and drive by
bgwriter_delay, right? Which is 200ms by default. Which seems fine, but
isn't the hibernation (HIBERNATE_FACTOR) going to mess with it?

Is there some case where bgwriter would just loop without sleeping,
filling the timeline much faster? (I can't think of any, but ...)

bgwriter will wake up when there are buffers to flush, which is likely
correlated with there being new LSNs. So, actually it seems like it
might work well to rely on only filling the timeline when there are
things for bgwriter to do.

- The LSNTimeline comment claims an array of size 64 is large enough to
not need to care about filling it, but maybe it should briefly explain
why we can never fill it (I guess 2^64 is just too many).

The new structure fits a different number of members. I have yet to
calculate that number, but it should be explained in the comments once
I do.

For example, if we made an LSNTimeline with volume 4, once every
element had one LSNTime and we needed to start merging, the following
is how many logical members each element would have after each of four
merges:
1111
1112
1122
1114
1124
So, if we store the number of members as an unsigned 64-bit int and we
have an LSNTimeline with volume 64, what is the maximum number of
members can we store if we hold all of the invariants described in my
algorithm above (we only merge when required, every element holds < 2x
the number of logical members as its predecessor, we do exactly one
merge every insertion [when required], membership must monotonically
increase [choose the oldest element meeting the criteria when deciding
what to merge])?

I guess that should be enough for (2^64-1) logical members, because it's
a sequence 1, 2, 4, 8, ..., 2^63. Seems enough.

But now that I think about it, does it make sense to do the merging
based on the number of logical members? Shouldn't this really be driven
by the "length" of the time interval the member covers?

- I don't quite understand why 0005 adds the functions to pageinspect.
This has nothing to do with pages, right?

You're right. I just couldn't think of a good place to put the
functions. In version 3, I just put the SQL functions in pgstat_wal.c
and made them generally available (i.e. not in a contrib module). I
haven't added docs back yet. But perhaps a section near the docs
describing pg_xact_commit_timestamp() [1]? I wasn't sure if I should
put the SQL function source code in pgstatfuncs.c -- I kind of prefer
it in pgstat_wal.c but there are no other SQL functions there.

OK, pgstat_wal seems like a good place.

- Not sure why we need 0001. Just so that the "estimate" functions in
0002 have a convenient "start" point? Surely we could look at the
current LSNTimeline data and use the oldest value, or (if there's no
data) use the current timestamp/LSN?

When there are 0 or 1 entries in the timeline you'll get an answer
that could be very off if you just return the current timestamp or
LSN. I guess that is okay?

- I wonder what happens if we lose the data - we know that if people
reset statistics for whatever reason (or just lose them because of a
crash, or because they're on a replica), bad things happen to
autovacuum. What's the (expected) impact on pruning?

This is an important question. Because stats aren't crashsafe, we
could return very inaccurate numbers for some time/LSN values if we
crash. I don't actually know what we could do about that. When I use
the LSNTimeline for the freeze heuristic it is less of an issue
because the freeze heuristic has a fallback strategy when there aren't
enough stats to do its calculations. But other users of this
LSNTimeline will simply get highly inaccurate results (I think?). Is
there anything we could do about this? It seems bad.

Andres had brought up something at some point about, what if the
database is simply turned off for awhile and then turned back on. Even
if you cleanly shut down, will there be "gaps" in the timeline? I
think that could be okay, but it is something to think about.

- What about a SRF function that outputs the whole LSNTimeline? Would be
useful for debugging / development, I think. (Just a suggestion).

Good idea! I've added this. Though, maybe there was a simpler way to
implement than I did.

Thanks. I'll take a look.

Just a note, all of my comments could use a lot of work, but I want to
get consensus on the algorithm before I make sure and write about it
in a perfect way.

Makes sense, as long as the comments are sufficiently clear. It's hard
to reach consensus on something not explained clearly enough.

regards

[1]: https://github.com/tdunning/t-digest/blob/main/docs/t-digest-paper/histo.pdf
https://github.com/tdunning/t-digest/blob/main/docs/t-digest-paper/histo.pdf

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Daniel Gustafsson (#5)
Re: Add LSN <-> time conversion functionality

On 3/18/24 15:02, Daniel Gustafsson wrote:

On 22 Feb 2024, at 03:45, Melanie Plageman <melanieplageman@gmail.com> wrote:
On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

- Not sure why we need 0001. Just so that the "estimate" functions in
0002 have a convenient "start" point? Surely we could look at the
current LSNTimeline data and use the oldest value, or (if there's no
data) use the current timestamp/LSN?

When there are 0 or 1 entries in the timeline you'll get an answer
that could be very off if you just return the current timestamp or
LSN. I guess that is okay?

I don't think that's a huge problem at such a young "lsn-age", but I might be
missing something.

- I wonder what happens if we lose the data - we know that if people
reset statistics for whatever reason (or just lose them because of a
crash, or because they're on a replica), bad things happen to
autovacuum. What's the (expected) impact on pruning?

This is an important question. Because stats aren't crashsafe, we
could return very inaccurate numbers for some time/LSN values if we
crash. I don't actually know what we could do about that. When I use
the LSNTimeline for the freeze heuristic it is less of an issue
because the freeze heuristic has a fallback strategy when there aren't
enough stats to do its calculations. But other users of this
LSNTimeline will simply get highly inaccurate results (I think?). Is
there anything we could do about this? It seems bad.

Do we have something to calculate a sufficiently good "average" to use
as a default, if we don't have a better value? For example, we know the
timestamp of the last checkpoint, and we know the LSN, right? Maybe if
we're sufficiently far from the checkpoint, we could use that.

Or maybe checkpoint_timeout / max_wal_size would be enough to calculate
some default value?

I wonder how long it takes until LSNTimeline gives us sufficiently good
data for all LSNs we need. That is, if we lose this, how long it takes
until we get enough data to do good decisions?

Why don't we simply WAL-log this in some trivial way? It's pretty small,
so if we WAL-log this once in a while (after a merge happens), that
should not be a problem.

Or a different idea - if we lost the data, but commit_ts is enabled,
can't we "simply" walk commit_ts and feed LSN/timestamp into the
timeline? I guess we don't want to walk 2B transactions, but even just
sampling some recent transactions might be enough, no?

A complication with this over stats is that we can't recompute this in case of
a crash/corruption issue. The simple solution is to consider this unlogged
data and start fresh at every unclean shutdown, but I have a feeling that won't
be good enough for basing heuristics on?

Andres had brought up something at some point about, what if the
database is simply turned off for awhile and then turned back on. Even
if you cleanly shut down, will there be "gaps" in the timeline? I
think that could be okay, but it is something to think about.

The gaps would represent reality, so there is nothing wrong per se with gaps,
but if they inflate the interval of a bucket which in turns impact the
precision of the results then that can be a problem.

Well, I think the gaps are a problem in the sense that they disappear
once we start merging the buckets. But maybe that's fine, if we're only
interested in approximate data.

Just a note, all of my comments could use a lot of work, but I want to
get consensus on the algorithm before I make sure and write about it
in a perfect way.

I'm not sure "a lot of work" is accurate, they seem pretty much there to me,
but I think that an illustration of running through the algorithm in an
ascii-art array would be helpful.

+1

Reading through this I think such a function has merits, not only for your
usecase but other heuristic based work and quite possibly systems debugging.

While the bucketing algorithm is a clever algorithm for degrading precision for
older entries without discarding them, I do fear that we'll risk ending up with
answers like "somewhere between in the past and even further in the past".
I've been playing around with various compression algorithms for packing the
buckets such that we can retain precision for longer. Since you were aiming to
work on other patches leading up to the freeze, let's pick this up again once
things calm down.

I guess this ambiguity is pretty inherent to a structure that does not
keep all the data, and gradually reduces the resolution for old stuff.
But my understanding was that's sufficient for the freezing patch.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#8Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Tomas Vondra (#7)
Re: Add LSN <-> time conversion functionality

Hi everyone!

Me, Bharath, and Ilya are on patch review session at the PGConf.dev :) Maybe we got everything wrong, please consider that we are just doing training on reviewing patches.

=== Purpose of the patch ===
Currently, we have checkpoint_timeout and max_wal size to know when we need a checkpoint. This patch brings a capability to freeze page not only by internal state of the system, but also by wall clock time.
To do so we need an infrastructure which will tell when page was modified.

The patch in this thread is doing exactly this: in-memory information to map LSNs with wall clock time. Mapping is maintained by bacgroundwriter.

=== Questions ===
1. The patch does not handle server restart. All pages will need freeze after any crash?
2. Some benchmarks to proof the patch does not have CPU footprint.

=== Nits ===
"Timeline" term is already taken.
The patch needs rebase due to some header changes.
Tests fail on Windows.
The patch lacks tests.
Some docs would be nice, but the feature is for developers.
Mapping is protected for multithreaded access by walstats LWlock and might have tuplestore_putvalues() under that lock. That might be a little dangerous, if tuplestore will be on-disk for some reason (should not happen).

Overall, the patch is a base for good feature which would help to do freeze right in time. Thanks!

Best regards, Bharath, Andrey, Ilya.

#9Melanie Plageman
melanieplageman@gmail.com
In reply to: Tomas Vondra (#6)
Re: Add LSN <-> time conversion functionality

On Mon, Mar 18, 2024 at 1:29 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 2/22/24 03:45, Melanie Plageman wrote:

The attached v3 has a new algorithm. Now, LSNTimes are added from the
end of the array forward until all array elements have at least one
logical member (array length == volume). Once array length == volume,
new LSNTimes will result in merging logical members in existing
elements. We want to merge older members because those can be less
precise. So, the number of logical members per array element will
always monotonically increase starting from the beginning of the array
(which contains the most recent data) and going to the end. We want to
use all the available space in the array. That means that each LSNTime
insertion will always result in a single merge. We want the timeline
to be inclusive of the oldest data, so merging means taking the
smaller value of two LSNTime values. I had to pick a rule for choosing
which elements to merge. So, I choose the merge target as the oldest
element whose logical membership is < 2x its predecessor. I merge the
merge target's predecessor into the merge target. Then I move all of
the intervening elements down 1. Then I insert the new LSNTime at
index 0.

I can't help but think about t-digest [1], which also merges data into
variable-sized buckets (called centroids, which is a pair of values,
just like LSNTime). But the merging is driven by something called "scale
function" which I found like a pretty nice approach to this, and it
yields some guarantees regarding accuracy. I wonder if we could do
something similar here ...

The t-digest is a way to approximate quantiles, and the default scale
function is optimized for best accuracy on the extremes (close to 0.0
and 1.0), but it's possible to use scale functions that optimize only
for accuracy close to 1.0.

This may be misguided, but I see similarity between quantiles and what
LSNTimeline does - timestamps are ordered, and quantiles close to 0.0
are "old timestamps" while quantiles close to 1.0 are "now".

And t-digest also defines a pretty efficient algorithm to merge data in
a way that gradually combines older "buckets" into larger ones.

I started taking a look at this paper and think the t-digest could be
applicable as a possible alternative data structure to the one I am
using to approximate page age for the actual opportunistic freeze
heuristic -- especially since the centroids are pairs of a mean and a
count. I couldn't quite understand how the t-digest is combining those
centroids. Since I am not computing quantiles over the LSNTimeStream,
though, I think I can probably do something simpler for this part of
the project.

- The LSNTimeline comment claims an array of size 64 is large enough to
not need to care about filling it, but maybe it should briefly explain
why we can never fill it (I guess 2^64 is just too many).

-- snip --

I guess that should be enough for (2^64-1) logical members, because it's
a sequence 1, 2, 4, 8, ..., 2^63. Seems enough.

But now that I think about it, does it make sense to do the merging
based on the number of logical members? Shouldn't this really be driven
by the "length" of the time interval the member covers?

After reading this, I decided to experiment with a few different
algorithms in python and plot the unabbreviated LSNTimeStream against
various ways of compressing it. You can see the results if you run the
python code here [1]https://gist.github.com/melanieplageman/95126993bcb43d4b4042099e9d0ccc11.

What I found is that attempting to calculate the error represented by
dropping a point and picking the point which would cause the least
additional error were it to be dropped produced more accurate results
than combining the oldest entries based on logical membership to fit
some series.

This is inspired by what you said about using the length of segments
to decide which points to merge. In my case, I want to merge segments
that have a similar slope -- those which have a point that is
essentially redundant. I loop through the LSNTimeStream and look for
the point that we can drop with the lowest impact on future
interpolation accuracy. To do this, I calculate the area of the
triangle made by each point on the stream and its adjacent points. The
idea being that if you drop that point, the triangle is the amount of
error you introduce for points being interpolated between the new pair
(previous adjacencies of the dropped point). This has some issues, but
it seems more logical than just always combining the oldest points.

If you run the python simulator code, you'll see that for the
LSNTimeStream I generated, using this method produces more accurate
results than either randomly dropping points or using the "combine
oldest members" method.

It would be nice if we could do something with the accumulated error
-- so we could use it to modify estimates when interpolating. I don't
really know how to keep it though. I thought I would just save the
calculated error in one or the other of the adjacent points after
dropping a point, but then what do we do with the error saved in a
point before it is dropped? Add it to the error value in one of the
adjacent points? If we did, what would that even mean? How would we
use it?

- Melanie

[1]: https://gist.github.com/melanieplageman/95126993bcb43d4b4042099e9d0ccc11

#10Melanie Plageman
melanieplageman@gmail.com
In reply to: Daniel Gustafsson (#5)
5 attachment(s)
Re: Add LSN <-> time conversion functionality

Thanks for the review!

Attached v4 implements the new algorithm/compression described in [1]/messages/by-id/CAAKRu_YbbZGz-X_pm2zXJA+6A22YYpaWhOjmytqFL1yF_FCv6w@mail.gmail.com.

We had discussed off-list possibly using error in some way. So, I'm
interested to know what you think about this method I suggested which
calculates error. It doesn't save the error so that we could use it
when interpolating for reasons I describe in that mail. If you have
any ideas on how to use the calculated error or just how to combine
error when dropping a point, that would be super helpful.

Note that in this version, I've changed the name from LSNTimeline to
LSNTimeStream to address some feedback from another reviewer about
Timeline being already in use in Postgres as a concept.

On Mon, Mar 18, 2024 at 10:02 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 22 Feb 2024, at 03:45, Melanie Plageman <melanieplageman@gmail.com> wrote:
On Fri, Feb 16, 2024 at 3:41 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

- I wonder what happens if we lose the data - we know that if people
reset statistics for whatever reason (or just lose them because of a
crash, or because they're on a replica), bad things happen to
autovacuum. What's the (expected) impact on pruning?

This is an important question. Because stats aren't crashsafe, we
could return very inaccurate numbers for some time/LSN values if we
crash. I don't actually know what we could do about that. When I use
the LSNTimeline for the freeze heuristic it is less of an issue
because the freeze heuristic has a fallback strategy when there aren't
enough stats to do its calculations. But other users of this
LSNTimeline will simply get highly inaccurate results (I think?). Is
there anything we could do about this? It seems bad.

A complication with this over stats is that we can't recompute this in case of
a crash/corruption issue. The simple solution is to consider this unlogged
data and start fresh at every unclean shutdown, but I have a feeling that won't
be good enough for basing heuristics on?

Yes, I still haven't dealt with this yet. Tomas had a list of
suggestions in an upthread email, so I will spend some time thinking
about those next.

It seems like we might be able to come up with some way of calculating
a valid "default" value or "best guess" which could be used whenever
there isn't enough data. Though, if we crash and lose some time stream
data, we won't know that that data was lost due to a crash so we
wouldn't know to use our "best guess" to make up for it. So, maybe I
should try and rebuild the stream using some combination of WAL, clog,
and commit timestamps? Or perhaps I should do some basic WAL logging
just for this data structure.

Andres had brought up something at some point about, what if the
database is simply turned off for awhile and then turned back on. Even
if you cleanly shut down, will there be "gaps" in the timeline? I
think that could be okay, but it is something to think about.

The gaps would represent reality, so there is nothing wrong per se with gaps,
but if they inflate the interval of a bucket which in turns impact the
precision of the results then that can be a problem.

Yes, actually I added some hacky code to the quick and dirty python
simulator I wrote [2]https://gist.github.com/melanieplageman/7400e81bbbd518fe08b4af55a9b632d4 to test out having a big gap with no updates (if
there is no db activity so nothing for bgwriter to do or the db is off
for a while). And it seemed to basically work fine.

While the bucketing algorithm is a clever algorithm for degrading precision for
older entries without discarding them, I do fear that we'll risk ending up with
answers like "somewhere between in the past and even further in the past".
I've been playing around with various compression algorithms for packing the
buckets such that we can retain precision for longer. Since you were aiming to
work on other patches leading up to the freeze, let's pick this up again once
things calm down.

Let me know what you think about the new algorithm. I wonder if for
points older than the second to oldest point, we have the function
return something like "older than a year ago" instead of guessing. The
new method doesn't focus on compressing old data though.

When compiling I got this warning for lsntime_merge_target:

pgstat_wal.c:242:1: warning: non-void function does not return a value in all control paths [-Wreturn-type]
}
^
1 warning generated.

The issue seems to be that the can't-really-happen path is protected with an
Assertion, which is a no-op for production builds. I think we should handle
the error rather than rely on testing catching it (since if it does happen even
though it can't, it's going to be when it's for sure not tested). Returning an
invalid array subscript like -1 and testing for it in lsntime_insert, with an
elog(WARNING..), seems enough.

-    last_snapshot_lsn <= GetLastImportantRecPtr())
+    last_snapshot_lsn <= current_lsn)
I think we should delay extracting the LSN with GetLastImportantRecPtr until we
know that we need it, to avoid taking locks in this codepath unless needed.

I've attached a diff with the above suggestions which applies on op of your
patchset.

I've implemented these review points in the attached v4.

- Melanie

[1]: /messages/by-id/CAAKRu_YbbZGz-X_pm2zXJA+6A22YYpaWhOjmytqFL1yF_FCv6w@mail.gmail.com
[2]: https://gist.github.com/melanieplageman/7400e81bbbd518fe08b4af55a9b632d4

Attachments:

v4-0001-Record-LSN-at-postmaster-startup.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Record-LSN-at-postmaster-startup.patchDownload
From 1b86460bb25aef99a39034e5bf6be581cdccfb88 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 5 Dec 2023 07:29:39 -0500
Subject: [PATCH v4 1/6] Record LSN at postmaster startup

The insert_lsn at postmaster startup can be used along with PgStartTime
as seed values for a timeline mapping LSNs to time. Future commits will
add such a structure for LSN <-> time conversions. A start LSN allows
for such conversions before even inserting a value into the timeline.
The current time and current insert LSN can be used along with
PgStartTime and PgStartLSN.

This is WIP, as I'm not sure if I did this in the right place.
---
 src/backend/access/transam/xlog.c   | 2 ++
 src/backend/postmaster/postmaster.c | 2 ++
 src/include/utils/builtins.h        | 3 +++
 3 files changed, 7 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f2..6fff1f52084 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -142,6 +142,8 @@ bool		XLOG_DEBUG = false;
 
 int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
 
+XLogRecPtr	PgStartLSN = InvalidXLogRecPtr;
+
 /*
  * Number of WAL insertion locks to use. A higher value allows more insertions
  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index bf0241aed0c..f1b60fe6cee 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -117,6 +117,7 @@
 #include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
+#include "utils/builtins.h"
 #include "utils/datetime.h"
 #include "utils/memutils.h"
 #include "utils/pidfile.h"
@@ -1345,6 +1346,7 @@ PostmasterMain(int argc, char *argv[])
 	 * Remember postmaster startup time
 	 */
 	PgStartTime = GetCurrentTimestamp();
+	PgStartLSN = GetXLogInsertRecPtr();
 
 	/*
 	 * Report postmaster status in the postmaster.pid file, to allow pg_ctl to
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 359c570f23e..16a7a058bc7 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -17,6 +17,7 @@
 #include "fmgr.h"
 #include "nodes/nodes.h"
 #include "utils/fmgrprotos.h"
+#include "access/xlogdefs.h"
 
 /* Sign + the most decimal digits an 8-byte number could have */
 #define MAXINT8LEN 20
@@ -85,6 +86,8 @@ extern void generate_operator_clause(fmStringInfo buf,
 									 Oid opoid,
 									 const char *rightop, Oid rightoptype);
 
+extern PGDLLIMPORT XLogRecPtr PgStartLSN;
+
 /* varchar.c */
 extern int	bpchartruelen(char *s, int len);
 
-- 
2.34.1

v4-0004-Bgwriter-maintains-global-LSNTimeStream.patchtext/x-patch; charset=US-ASCII; name=v4-0004-Bgwriter-maintains-global-LSNTimeStream.patchDownload
From 661d8f2db88a510efbbd7c19f0af13ee75416967 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:32:40 -0500
Subject: [PATCH v4 4/6] Bgwriter maintains global LSNTimeStream

Insert new LSN, time pairs to the global LSNTimeStream stored in
PgStat_WalStats in the background writer's main loop. This ensures that
new values are added to the stream in a regular manner.
---
 src/backend/postmaster/bgwriter.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0f75548759a..02b039cfacf 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -273,6 +273,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_lsn;
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
@@ -284,11 +285,15 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 			 * start of a record, whereas last_snapshot_lsn points just past
 			 * the end of the record.
 			 */
-			if (now >= timeout &&
-				last_snapshot_lsn <= GetLastImportantRecPtr())
+			if (now >= timeout)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
-				last_snapshot_ts = now;
+				current_lsn = GetLastImportantRecPtr();
+				if (last_snapshot_lsn <= current_lsn)
+				{
+					last_snapshot_lsn = LogStandbySnapshot();
+					last_snapshot_ts = now;
+					pgstat_wal_update_lsntime_stream(now, current_lsn);
+				}
 			}
 		}
 
-- 
2.34.1

v4-0005-Add-time-LSN-translation-functions.patchtext/x-patch; charset=US-ASCII; name=v4-0005-Add-time-LSN-translation-functions.patchDownload
From 23db440712a45f7c58eb57933df61a9b2e40c6a0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 21 Feb 2024 20:06:29 -0500
Subject: [PATCH v4 5/6] Add time <-> LSN translation functions

Previous commits added a global LSNTimeStream, maintained by background
writer, that allows approximate translations between time and LSNs.

Add SQL-callable functions to convert from LSN to time and back and a
SQL-callable function returning the entire LSNTimeStream.

This could be useful in combination with SQL-callable functions
accessing a page LSN to approximate the time of last modification of a
page or estimating the LSN consumption rate to moderate maintenance
processes and balance system resource utilization.
---
 doc/src/sgml/monitoring.sgml            | 66 +++++++++++++++++++++++++
 src/backend/utils/activity/pgstat_wal.c | 56 +++++++++++++++++++++
 src/include/catalog/pg_proc.dat         | 22 +++++++++
 3 files changed, 144 insertions(+)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b2ad9b446f3..1f7cd2f2f3b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3187,6 +3187,72 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </tgroup>
   </table>
 
+  <para>
+  In addition to these WAL stats, a stream of LSN <-> time pairs is accessible
+  via the functions shown in <xref linkend="functions-lsn-time-stream"/>.
+  </para>
+
+  <table id="functions-lsn-time-stream">
+   <title>LSN Time Stream Information Functions</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       Function
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_estimate_lsn_at_time</primary>
+       </indexterm>
+       <function>pg_estimate_lsn_at_time</function> ( <type>timestamp with time zone</type> )
+       <returnvalue>pg_lsn</returnvalue>
+      </para>
+      <para>
+       Returns the estimated lsn at the provided time.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_estimate_lsn_at_time</primary>
+       </indexterm>
+       <function>pg_estimate_lsn_at_time</function> ( <type>pg_lsn</type> )
+       <returnvalue>timestamp with time zone</returnvalue>
+      </para>
+      <para>
+        Returns the estimated time at the provided lsn.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_lsntime_stream</primary>
+       </indexterm>
+       <function>pg_lsntime_stream</function> ()
+       <returnvalue>record</returnvalue>
+       ( <parameter>time</parameter> <type>timestamp with time zone</type>,
+       <parameter>lsn</parameter> <type>pg_lsnwith time zone</type>)
+      </para>
+      <para>
+       Returns all of the LSN <-> time pairs in the current LSN time stream.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+
+
 </sect2>
 
  <sect2 id="monitoring-pg-stat-database-view">
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index eddd2ec03cb..5d5ab62d4ba 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -19,7 +19,9 @@
 
 #include "access/xlog.h"
 #include "executor/instrument.h"
+#include "funcapi.h"
 #include "utils/builtins.h"
+#include "utils/pg_lsn.h"
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 
@@ -427,3 +429,57 @@ pgstat_wal_update_lsntime_stream(TimestampTz time, XLogRecPtr lsn)
 	lsntime_insert(&stats_shmem->stats.stream, time, lsn);
 	LWLockRelease(&stats_shmem->lock);
 }
+
+PG_FUNCTION_INFO_V1(pg_estimate_lsn_at_time);
+PG_FUNCTION_INFO_V1(pg_estimate_time_at_lsn);
+PG_FUNCTION_INFO_V1(pg_lsntime_stream);
+
+Datum
+pg_estimate_time_at_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn = PG_GETARG_LSN(0);
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+	TimestampTz result;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_time_at_lsn(&stats_shmem->stats.stream, lsn);
+	LWLockRelease(&stats_shmem->lock);
+
+	PG_RETURN_TIMESTAMPTZ(result);
+}
+
+Datum
+pg_estimate_lsn_at_time(PG_FUNCTION_ARGS)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+	TimestampTz time = PG_GETARG_TIMESTAMPTZ(0);
+	XLogRecPtr	result;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_lsn_at_time(&stats_shmem->stats.stream, time);
+	LWLockRelease(&stats_shmem->lock);
+
+	PG_RETURN_LSN(result);
+}
+
+Datum
+pg_lsntime_stream(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_WalStats *stats = pgstat_fetch_stat_wal();
+	LSNTimeStream *stream = &stats->stream;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	for (int i = LSNTIMESTREAM_VOLUME - stream->length; i < LSNTIMESTREAM_VOLUME; i++)
+	{
+		Datum		values[2] = {0};
+		bool		nulls[2] = {0};
+
+		values[0] = stream->data[i].time;
+		values[1] = stream->data[i].lsn;
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6a5476d3c4c..8ab14b49b2a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6342,6 +6342,28 @@
   prorettype => 'timestamptz', proargtypes => 'xid',
   prosrc => 'pg_xact_commit_timestamp' },
 
+{ oid => '9997',
+  descr => 'get approximate LSN at a particular point in time',
+  proname => 'pg_estimate_lsn_at_time', provolatile => 'v',
+  prorettype => 'pg_lsn', proargtypes => 'timestamptz',
+  prosrc => 'pg_estimate_lsn_at_time' },
+
+{ oid => '9996',
+  descr => 'get approximate time at a particular LSN',
+  proname => 'pg_estimate_time_at_lsn', provolatile => 'v',
+  prorettype => 'timestamptz', proargtypes => 'pg_lsn',
+  prosrc => 'pg_estimate_time_at_lsn' },
+
+{ oid => '9994',
+  descr => 'print the LSN Time Stream',
+  proname => 'pg_lsntime_stream', prorows => '64',
+  proretset => 't', provolatile => 'v', proparallel => 's',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,pg_lsn}',
+  proargmodes => '{o,o}',
+  proargnames => '{time, lsn}',
+  prosrc => 'pg_lsntime_stream' },
+
 { oid => '6168',
   descr => 'get commit timestamp and replication origin of a transaction',
   proname => 'pg_xact_commit_timestamp_origin', provolatile => 'v',
-- 
2.34.1

v4-0003-Add-LSNTimeStream-to-PgStat_WalStats.patchtext/x-patch; charset=US-ASCII; name=v4-0003-Add-LSNTimeStream-to-PgStat_WalStats.patchDownload
From 9b8091be4a9e327e39bc4b9a5f4e5a438897c0e1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 21 Feb 2024 20:28:27 -0500
Subject: [PATCH v4 3/6] Add LSNTimeStream to PgStat_WalStats

Add a globally maintained instance of an LSNTimeStream to
PgStat_WalStats and a utility function to insert new values.
---
 src/backend/utils/activity/pgstat_wal.c | 10 ++++++++++
 src/include/pgstat.h                    |  4 ++++
 2 files changed, 14 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index d76ace5cbfc..eddd2ec03cb 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -417,3 +417,13 @@ stop:
 	result = (double) (lsn - start.lsn) / lsns_elapsed * time_elapsed + start.time;
 	return Max(result, 0);
 }
+
+void
+pgstat_wal_update_lsntime_stream(TimestampTz time, XLogRecPtr lsn)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	lsntime_insert(&stats_shmem->stats.stream, time, lsn);
+	LWLockRelease(&stats_shmem->lock);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index af348be839c..773e3cd5003 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -470,6 +470,7 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_sync;
 	PgStat_Counter wal_write_time;
 	PgStat_Counter wal_sync_time;
+	LSNTimeStream stream;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -752,6 +753,9 @@ extern void pgstat_execute_transactional_drops(int ndrops, struct xl_xact_stats_
 extern void pgstat_report_wal(bool force);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 
+/* Helpers for maintaining the LSNTimeStream */
+extern void pgstat_wal_update_lsntime_stream(TimestampTz time, XLogRecPtr lsn);
+
 
 /*
  * Variables in pgstat.c
-- 
2.34.1

v4-0002-Add-LSNTimeStream-for-converting-LSN-time.patchtext/x-patch; charset=US-ASCII; name=v4-0002-Add-LSNTimeStream-for-converting-LSN-time.patchDownload
From 0b2ab6bb6507cec869f6f55a78016d8c446a7b2f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:40:27 -0500
Subject: [PATCH v4 2/6] Add LSNTimeStream for converting LSN <-> time

Add a new structure, LSNTimeStream, consisting of LSNTimes -- each an
LSN, time pair. The LSNTimeStream is fixed size, so when a new LSNTime
is inserted to a full LSNTimeStream, an LSNTime is dropped and the new
LSNTime is inserted. We drop the LSNTime whose absence would cause the
least error when interpolating between its adjoining points.

LSN <-> time conversions can be done using linear interpolation with two
LSNTimes on the LSNTimeStream.

This commit does not add a global instance of LSNTimeStream. It adds the
structures and functions needed to maintain and access such a stream.
---
 src/backend/utils/activity/pgstat_wal.c | 233 ++++++++++++++++++++++++
 src/include/pgstat.h                    |  32 ++++
 src/tools/pgindent/typedefs.list        |   2 +
 3 files changed, 267 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 0e374f133a9..d76ace5cbfc 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -17,8 +17,11 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
+#include "utils/builtins.h"
 #include "utils/pgstat_internal.h"
+#include "utils/timestamp.h"
 
 
 PgStat_PendingWalStats PendingWalStats = {0};
@@ -32,6 +35,11 @@ PgStat_PendingWalStats PendingWalStats = {0};
 static WalUsage prevWalUsage;
 
 
+static void lsntime_insert(LSNTimeStream *stream, TimestampTz time, XLogRecPtr lsn);
+
+XLogRecPtr	estimate_lsn_at_time(const LSNTimeStream *stream, TimestampTz time);
+TimestampTz estimate_time_at_lsn(const LSNTimeStream *stream, XLogRecPtr lsn);
+
 /*
  * Calculate how much WAL usage counters have increased and update
  * shared WAL and IO statistics.
@@ -184,3 +192,228 @@ pgstat_wal_snapshot_cb(void)
 		   sizeof(pgStatLocal.snapshot.wal));
 	LWLockRelease(&stats_shmem->lock);
 }
+
+/*
+ * Given three LSNTimes, calculate the area of the triangle they form were they
+ * plotted with time on the X axis and LSN on the Y axis.
+ */
+static int
+lsn_ts_calculate_error_area(LSNTime *left, LSNTime *mid, LSNTime *right)
+{
+	int			rectangle_all = (right->time - left->time) * (right->lsn - left->lsn);
+	int			triangle1 = rectangle_all / 2;
+	int			triangle2 = (mid->lsn - left->lsn) * (mid->time - left->time) / 2;
+	int			triangle3 = (right->lsn - mid->lsn) * (right->time - mid->time) / 2;
+	int			rectangle_part = (right->lsn - mid->lsn) * (mid->time - left->time);
+
+	return rectangle_all - triangle1 - triangle2 - triangle3 - rectangle_part;
+}
+
+/*
+ * Determine which LSNTime to drop from a full LSNTimeStream. Once the LSNTime
+ * is dropped, points between it and either of its adjacent LSNTimes will be
+ * interpolated between those two LSNTimes instead. To keep the LSNTimeStream
+ * as accurate as possible, drop the LSNTime whose absence would have the least
+ * impact on future interpolations.
+ *
+ * We determine the error that would be introduced by dropping a point on the
+ * stream by calculating the area of the triangle formed by the LSNTime and its
+ * adjacent LSNTimes. We do this for each LSNTime in the stream (except for the
+ * first and last LSNTimes) and choose the LSNTime with the smallest error
+ * (area). We avoid extrapolation by never dropping the first or last points.
+ */
+static int
+lsntime_to_drop(LSNTimeStream *stream)
+{
+	int			min_area = INT_MAX;
+	int			target_point = stream->length - 1;
+
+	/* Don't drop points if free space available */
+	Assert(stream->length == LSNTIMESTREAM_VOLUME);
+
+	for (int i = stream->length - 1; i-- > 0;)
+	{
+		LSNTime    *left = &stream->data[i - 1];
+		LSNTime    *mid = &stream->data[i];
+		LSNTime    *right = &stream->data[i + 1];
+		int			area = lsn_ts_calculate_error_area(left, mid, right);
+
+		if (abs(area) < abs(min_area))
+		{
+			min_area = area;
+			target_point = i;
+		}
+	}
+
+	return target_point;
+}
+
+/*
+ * Insert a new LSNTime into the LSNTimeStream in the first available element,
+ * or, if there are no empty elements, drop an LSNTime from the stream, move
+ * all LSNTimes down and insert the new LSNTime into the element at index 0.
+ */
+void
+lsntime_insert(LSNTimeStream *stream, TimestampTz time,
+			   XLogRecPtr lsn)
+{
+	int			drop;
+	LSNTime		entrant = {.lsn = lsn,.time = time};
+
+	if (stream->length < LSNTIMESTREAM_VOLUME)
+	{
+		/*
+		 * The new entry should exceed the most recent entry to ensure time
+		 * moves forward on the stream.
+		 */
+		Assert(stream->length == 0 ||
+			   (lsn >= stream->data[LSNTIMESTREAM_VOLUME - stream->length].lsn &&
+				time >= stream->data[LSNTIMESTREAM_VOLUME - stream->length].time));
+
+		/*
+		 * If there are unfilled elements in the stream, insert the passed-in
+		 * LSNTime into the tail of the array.
+		 */
+		stream->length++;
+		stream->data[LSNTIMESTREAM_VOLUME - stream->length] = entrant;
+		return;
+	}
+
+	drop = lsntime_to_drop(stream);
+	if (drop < 0 || drop >= stream->length)
+	{
+		elog(WARNING, "Unable to insert LSNTime to LSNTimeStream. Drop failed.");
+		return;
+	}
+
+	/*
+	 * Drop the LSNTime at index drop by copying the array from drop - 1 to
+	 * drop
+	 */
+	memmove(&stream->data[1], &stream->data[0], sizeof(LSNTime) * drop);
+	stream->data[0] = entrant;
+}
+
+/*
+ * Translate time to a LSN using the provided stream. The stream will not
+ * be modified.
+ */
+XLogRecPtr
+estimate_lsn_at_time(const LSNTimeStream *stream, TimestampTz time)
+{
+	XLogRecPtr	result;
+	int64		time_elapsed,
+				lsns_elapsed;
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the provided time is before DB startup, the best we can do is return
+	 * the start LSN.
+	 */
+	if (time < start.time)
+		return start.lsn;
+
+	/*
+	 * If the provided time is after now, the current LSN is our best
+	 * estimate.
+	 */
+	if (time >= end.time)
+		return end.lsn;
+
+	/*
+	 * Loop through the stream. Stop at the first LSNTime earlier than our
+	 * target time. This LSNTime will be our interpolation start point. If
+	 * there's an LSNTime later than that, then that will be our interpolation
+	 * end point.
+	 */
+	for (int i = LSNTIMESTREAM_VOLUME - stream->length; i < LSNTIMESTREAM_VOLUME; i++)
+	{
+		if (stream->data[i].time > time)
+			continue;
+
+		start = stream->data[i];
+		if (i > 0)
+			end = stream->data[i - 1];
+		goto stop;
+	}
+
+	/*
+	 * If we exhausted the stream, then use its earliest LSNTime as our
+	 * interpolation end point.
+	 */
+	if (stream->length > 0)
+		end = stream->data[stream->length - 1];
+
+stop:
+	Assert(end.time > start.time);
+	Assert(end.lsn > start.lsn);
+	time_elapsed = end.time - start.time;
+	Assert(time_elapsed != 0);
+	lsns_elapsed = end.lsn - start.lsn;
+	Assert(lsns_elapsed != 0);
+	result = (double) (time - start.time) / time_elapsed * lsns_elapsed + start.lsn;
+	return Max(result, 0);
+}
+
+/*
+ * Translate lsn to a time using the provided stream. The stream will not
+ * be modified.
+ */
+TimestampTz
+estimate_time_at_lsn(const LSNTimeStream *stream, XLogRecPtr lsn)
+{
+	int64		time_elapsed,
+				lsns_elapsed;
+	TimestampTz result;
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the LSN is before DB startup, the best we can do is return that
+	 * time.
+	 */
+	if (lsn <= start.lsn)
+		return start.time;
+
+	/*
+	 * If the target LSN is after the current insert LSN, the current time is
+	 * our best estimate.
+	 */
+	if (lsn >= end.lsn)
+		return end.time;
+
+	/*
+	 * Loop through the stream. Stop at the first LSNTime earlier than our
+	 * target LSN. This LSNTime will be our interpolation start point. If
+	 * there's an LSNTime later than that, then that will be our interpolation
+	 * end point.
+	 */
+	for (int i = LSNTIMESTREAM_VOLUME - stream->length; i < LSNTIMESTREAM_VOLUME; i++)
+	{
+		if (stream->data[i].lsn > lsn)
+			continue;
+
+		start = stream->data[i];
+		if (i > 0)
+			end = stream->data[i - 1];
+		goto stop;
+	}
+
+	/*
+	 * If we exhausted the stream, then use its earliest LSNTime as our
+	 * interpolation end point.
+	 */
+	if (stream->length > 0)
+		end = stream->data[stream->length - 1];
+
+stop:
+	Assert(end.time > start.time);
+	Assert(end.lsn > start.lsn);
+	time_elapsed = end.time - start.time;
+	Assert(time_elapsed != 0);
+	lsns_elapsed = end.lsn - start.lsn;
+	Assert(lsns_elapsed != 0);
+	result = (double) (lsn - start.lsn) / lsns_elapsed * time_elapsed + start.time;
+	return Max(result, 0);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2136239710e..af348be839c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -428,6 +429,37 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter autoanalyze_count;
 } PgStat_StatTabEntry;
 
+/*
+ * The elements of an LSNTimeStream. Each LSNTime represents one or more time,
+ * LSN pairs. The LSN is typically the insert LSN recorded at the time.
+ */
+typedef struct LSNTime
+{
+	TimestampTz time;
+	XLogRecPtr	lsn;
+} LSNTime;
+
+#define LSNTIMESTREAM_VOLUME 64
+
+/*
+ * An LSN time stream is an array consisting of LSNTimes from most to least
+ * recent. The array is filled from end to start before the contents of any
+ * elements are merged. Once the LSNTimeStream length == volume (the array is
+ * full), an LSNTime is dropped, the new LSNTime is added at index 0, and the
+ * intervening LSNTimes are moved down by one.
+ *
+ * When dropping an LSNTime, we attempt to pick the member which would
+ * introduce the least error into the stream. See lsntime_to_drop() for more
+ * details.
+ *
+ * Use the stream for LSN <-> time conversion using linear interpolation.
+ */
+typedef struct LSNTimeStream
+{
+	int			length;
+	LSNTime		data[LSNTIMESTREAM_VOLUME];
+} LSNTimeStream;
+
 typedef struct PgStat_WalStats
 {
 	PgStat_Counter wal_records;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 61ad417cde6..4c065e24ba7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1584,6 +1584,8 @@ LogicalTapeSet
 LsnReadQueue
 LsnReadQueueNextFun
 LsnReadQueueNextStatus
+LSNTime
+LSNTimeStream
 LtreeGistOptions
 LtreeSignature
 MAGIC
-- 
2.34.1

#11Melanie Plageman
melanieplageman@gmail.com
In reply to: Andrey M. Borodin (#8)
Re: Add LSN <-> time conversion functionality

Thanks so much Bharath, Andrey, and Ilya for the review!

I've posted a new version here [1]/messages/by-id/CAAKRu_a6WSkWPtJCw=W+P+g-Fw9kfA_t8sMx99dWpMiGHCqJNA@mail.gmail.com which addresses some of your
concerns. I'll comment on those it does not address inline.

On Thu, May 30, 2024 at 1:26 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

=== Questions ===
1. The patch does not handle server restart. All pages will need freeze after any crash?

I haven't fixed this yet. See my email for some thoughts on what I
should do here.

2. Some benchmarks to proof the patch does not have CPU footprint.

This is still a todo. Typically when designing a benchmark like this,
I would want to pick a worst-case workload to see how bad it could be.
I wonder if just a write heavy workload like pgbench builtin tpcb-like
would be sufficient?

=== Nits ===
"Timeline" term is already taken.

I changed it to LSNTimeStream. What do you think?

The patch needs rebase due to some header changes.

I did this.

Tests fail on Windows.

I think this was because of the compiler warnings, but I need to
double-check now.

The patch lacks tests.

I thought about this a bit. I wonder what kind of tests make sense.

I could
1) Add tests with the existing stats tests
(src/test/regress/sql/stats.sql) and just test that bgwriter is in
fact adding to the time stream.

2) Or should I add some infrastructure to be able to create an
LSNTimeStream and then insert values to it and do some validations of
what is added? I did a version of this but it is just much more
annoying with C & SQL than with python (where I tried out my
algorithms) [2]https://gist.github.com/melanieplageman/95126993bcb43d4b4042099e9d0ccc11.

Some docs would be nice, but the feature is for developers.

I added some docs.

Mapping is protected for multithreaded access by walstats LWlock and might have tuplestore_putvalues() under that lock. That might be a little dangerous, if tuplestore will be on-disk for some reason (should not happen).

Ah, great point! I forgot about the *fetch_stat*() functions. I used
pgstat_fetch_stat_wal() in the latest version so I have a local copy
that I can stuff into the tuplestore without any locking. It won't be
as up-to-date, but I think that is 100% okay for this function.

- Melanie

[1]: /messages/by-id/CAAKRu_a6WSkWPtJCw=W+P+g-Fw9kfA_t8sMx99dWpMiGHCqJNA@mail.gmail.com
[2]: https://gist.github.com/melanieplageman/95126993bcb43d4b4042099e9d0ccc11

#12Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#10)
5 attachment(s)
Re: Add LSN <-> time conversion functionality

On Wed, Jun 26, 2024 at 10:04 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

I've implemented these review points in the attached v4.

I realized the docs had a compilation error. Attached v5 fixes that as
well as three bugs I found while using this patch set more intensely
today.

I see Michael has been working on some crash safety for stats here
[1]: /messages/by-id/ZnEiqAITL-VgZDoY@paquier.xyz
haven't examined his patch functionality yet, though.

I also had an off-list conversation with Robert where he suggested I
could perhaps change the user-facing functions for estimating an
LSN/time conversion to instead return a floor and a ceiling -- instead
of linearly interpolating a guess. This would be a way to keep users
from misunderstanding the accuracy of the functions to translate LSN
<-> time. I'm interested in what others think of this.

I like this idea a lot because it allows me to worry less about how I
decide to compress the data and whether or not it will be accurate for
use cases different than my own (the opportunistic freezing
heuristic). If I can provide a floor and a ceiling that are definitely
accurate, I don't have to worry about misleading people.

- Melanie

[1]: /messages/by-id/ZnEiqAITL-VgZDoY@paquier.xyz

Attachments:

v5-0002-Add-LSNTimeStream-for-converting-LSN-time.patchtext/x-patch; charset=US-ASCII; name=v5-0002-Add-LSNTimeStream-for-converting-LSN-time.patchDownload
From f492af31c1b9917aa27ba3ad76560e59f3fd5c9b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:40:27 -0500
Subject: [PATCH v5 2/6] Add LSNTimeStream for converting LSN <-> time

Add a new structure, LSNTimeStream, consisting of LSNTimes -- each an
LSN, time pair. The LSNTimeStream is fixed size, so when a new LSNTime
is inserted to a full LSNTimeStream, an LSNTime is dropped and the new
LSNTime is inserted. We drop the LSNTime whose absence would cause the
least error when interpolating between its adjoining points.

LSN <-> time conversions can be done using linear interpolation with two
LSNTimes on the LSNTimeStream.

This commit does not add a global instance of LSNTimeStream. It adds the
structures and functions needed to maintain and access such a stream.
---
 src/backend/utils/activity/pgstat_wal.c | 233 ++++++++++++++++++++++++
 src/include/pgstat.h                    |  32 ++++
 src/tools/pgindent/typedefs.list        |   2 +
 3 files changed, 267 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 0e374f133a9..cef9429994c 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -17,8 +17,11 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
+#include "utils/builtins.h"
 #include "utils/pgstat_internal.h"
+#include "utils/timestamp.h"
 
 
 PgStat_PendingWalStats PendingWalStats = {0};
@@ -32,6 +35,11 @@ PgStat_PendingWalStats PendingWalStats = {0};
 static WalUsage prevWalUsage;
 
 
+static void lsntime_insert(LSNTimeStream *stream, TimestampTz time, XLogRecPtr lsn);
+
+XLogRecPtr	estimate_lsn_at_time(const LSNTimeStream *stream, TimestampTz time);
+TimestampTz estimate_time_at_lsn(const LSNTimeStream *stream, XLogRecPtr lsn);
+
 /*
  * Calculate how much WAL usage counters have increased and update
  * shared WAL and IO statistics.
@@ -184,3 +192,228 @@ pgstat_wal_snapshot_cb(void)
 		   sizeof(pgStatLocal.snapshot.wal));
 	LWLockRelease(&stats_shmem->lock);
 }
+
+/*
+ * Given three LSNTimes, calculate the area of the triangle they form were they
+ * plotted with time on the X axis and LSN on the Y axis.
+ */
+static int
+lsn_ts_calculate_error_area(LSNTime *left, LSNTime *mid, LSNTime *right)
+{
+	int			rectangle_all = (right->time - left->time) * (right->lsn - left->lsn);
+	int			triangle1 = rectangle_all / 2;
+	int			triangle2 = (mid->lsn - left->lsn) * (mid->time - left->time) / 2;
+	int			triangle3 = (right->lsn - mid->lsn) * (right->time - mid->time) / 2;
+	int			rectangle_part = (right->lsn - mid->lsn) * (mid->time - left->time);
+
+	return rectangle_all - triangle1 - triangle2 - triangle3 - rectangle_part;
+}
+
+/*
+ * Determine which LSNTime to drop from a full LSNTimeStream. Once the LSNTime
+ * is dropped, points between it and either of its adjacent LSNTimes will be
+ * interpolated between those two LSNTimes instead. To keep the LSNTimeStream
+ * as accurate as possible, drop the LSNTime whose absence would have the least
+ * impact on future interpolations.
+ *
+ * We determine the error that would be introduced by dropping a point on the
+ * stream by calculating the area of the triangle formed by the LSNTime and its
+ * adjacent LSNTimes. We do this for each LSNTime in the stream (except for the
+ * first and last LSNTimes) and choose the LSNTime with the smallest error
+ * (area). We avoid extrapolation by never dropping the first or last points.
+ */
+static int
+lsntime_to_drop(LSNTimeStream *stream)
+{
+	int			min_area = INT_MAX;
+	int			target_point = stream->length - 1;
+
+	/* Don't drop points if free space available */
+	Assert(stream->length == LSNTIMESTREAM_VOLUME);
+
+	for (int i = stream->length - 1; i-- > 0;)
+	{
+		LSNTime    *left = &stream->data[i - 1];
+		LSNTime    *mid = &stream->data[i];
+		LSNTime    *right = &stream->data[i + 1];
+		int			area = lsn_ts_calculate_error_area(left, mid, right);
+
+		if (abs(area) < abs(min_area))
+		{
+			min_area = area;
+			target_point = i;
+		}
+	}
+
+	return target_point;
+}
+
+/*
+ * Insert a new LSNTime into the LSNTimeStream in the first available element,
+ * or, if there are no empty elements, drop an LSNTime from the stream, move
+ * all LSNTimes down and insert the new LSNTime into the element at index 0.
+ */
+void
+lsntime_insert(LSNTimeStream *stream, TimestampTz time,
+			   XLogRecPtr lsn)
+{
+	int			drop;
+	LSNTime		entrant = {.lsn = lsn,.time = time};
+
+	if (stream->length < LSNTIMESTREAM_VOLUME)
+	{
+		/*
+		 * The new entry should exceed the most recent entry to ensure time
+		 * moves forward on the stream.
+		 */
+		Assert(stream->length == 0 ||
+			   (lsn >= stream->data[LSNTIMESTREAM_VOLUME - stream->length].lsn &&
+				time >= stream->data[LSNTIMESTREAM_VOLUME - stream->length].time));
+
+		/*
+		 * If there are unfilled elements in the stream, insert the passed-in
+		 * LSNTime into the tail of the array.
+		 */
+		stream->length++;
+		stream->data[LSNTIMESTREAM_VOLUME - stream->length] = entrant;
+		return;
+	}
+
+	drop = lsntime_to_drop(stream);
+	if (drop < 0 || drop >= stream->length)
+	{
+		elog(WARNING, "Unable to insert LSNTime to LSNTimeStream. Drop failed.");
+		return;
+	}
+
+	/*
+	 * Drop the LSNTime at index drop by copying the array from drop - 1 to
+	 * drop
+	 */
+	memmove(&stream->data[1], &stream->data[0], sizeof(LSNTime) * drop);
+	stream->data[0] = entrant;
+}
+
+/*
+ * Translate time to a LSN using the provided stream. The stream will not
+ * be modified.
+ */
+XLogRecPtr
+estimate_lsn_at_time(const LSNTimeStream *stream, TimestampTz time)
+{
+	XLogRecPtr	result;
+	int64		time_elapsed,
+				lsns_elapsed;
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the provided time is before DB startup, the best we can do is return
+	 * the start LSN.
+	 */
+	if (time < start.time)
+		return start.lsn;
+
+	/*
+	 * If the provided time is after now, the current LSN is our best
+	 * estimate.
+	 */
+	if (time >= end.time)
+		return end.lsn;
+
+	/*
+	 * Loop through the stream. Stop at the first LSNTime earlier than our
+	 * target time. This LSNTime will be our interpolation start point. If
+	 * there's an LSNTime later than that, then that will be our interpolation
+	 * end point.
+	 */
+	for (int i = LSNTIMESTREAM_VOLUME - stream->length; i < LSNTIMESTREAM_VOLUME; i++)
+	{
+		if (stream->data[i].time > time)
+			continue;
+
+		start = stream->data[i];
+		if (i > LSNTIMESTREAM_VOLUME)
+			end = stream->data[i - 1];
+		goto stop;
+	}
+
+	/*
+	 * If we exhausted the stream, then use its earliest LSNTime as our
+	 * interpolation end point.
+	 */
+	if (stream->length > 0)
+		end = stream->data[LSNTIMESTREAM_VOLUME - 1];
+
+stop:
+	Assert(end.time > start.time);
+	Assert(end.lsn > start.lsn);
+	time_elapsed = end.time - start.time;
+	Assert(time_elapsed != 0);
+	lsns_elapsed = end.lsn - start.lsn;
+	Assert(lsns_elapsed != 0);
+	result = (double) (time - start.time) / time_elapsed * lsns_elapsed + start.lsn;
+	return Max(result, 0);
+}
+
+/*
+ * Translate lsn to a time using the provided stream. The stream will not
+ * be modified.
+ */
+TimestampTz
+estimate_time_at_lsn(const LSNTimeStream *stream, XLogRecPtr lsn)
+{
+	int64		time_elapsed,
+				lsns_elapsed;
+	TimestampTz result;
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the LSN is before DB startup, the best we can do is return that
+	 * time.
+	 */
+	if (lsn <= start.lsn)
+		return start.time;
+
+	/*
+	 * If the target LSN is after the current insert LSN, the current time is
+	 * our best estimate.
+	 */
+	if (lsn >= end.lsn)
+		return end.time;
+
+	/*
+	 * Loop through the stream. Stop at the first LSNTime earlier than our
+	 * target LSN. This LSNTime will be our interpolation start point. If
+	 * there's an LSNTime later than that, then that will be our interpolation
+	 * end point.
+	 */
+	for (int i = LSNTIMESTREAM_VOLUME - stream->length; i < LSNTIMESTREAM_VOLUME; i++)
+	{
+		if (stream->data[i].lsn > lsn)
+			continue;
+
+		start = stream->data[i];
+		if (i > LSNTIMESTREAM_VOLUME - stream->length)
+			end = stream->data[i - 1];
+		goto stop;
+	}
+
+	/*
+	 * If we exhausted the stream, then use its earliest LSNTime as our
+	 * interpolation end point.
+	 */
+	if (stream->length > 0)
+		end = stream->data[LSNTIMESTREAM_VOLUME - 1];
+
+stop:
+	Assert(end.time > start.time);
+	Assert(end.lsn > start.lsn);
+	time_elapsed = end.time - start.time;
+	Assert(time_elapsed != 0);
+	lsns_elapsed = end.lsn - start.lsn;
+	Assert(lsns_elapsed != 0);
+	result = (double) (lsn - start.lsn) / lsns_elapsed * time_elapsed + start.time;
+	return Max(result, 0);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2136239710e..af348be839c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -428,6 +429,37 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter autoanalyze_count;
 } PgStat_StatTabEntry;
 
+/*
+ * The elements of an LSNTimeStream. Each LSNTime represents one or more time,
+ * LSN pairs. The LSN is typically the insert LSN recorded at the time.
+ */
+typedef struct LSNTime
+{
+	TimestampTz time;
+	XLogRecPtr	lsn;
+} LSNTime;
+
+#define LSNTIMESTREAM_VOLUME 64
+
+/*
+ * An LSN time stream is an array consisting of LSNTimes from most to least
+ * recent. The array is filled from end to start before the contents of any
+ * elements are merged. Once the LSNTimeStream length == volume (the array is
+ * full), an LSNTime is dropped, the new LSNTime is added at index 0, and the
+ * intervening LSNTimes are moved down by one.
+ *
+ * When dropping an LSNTime, we attempt to pick the member which would
+ * introduce the least error into the stream. See lsntime_to_drop() for more
+ * details.
+ *
+ * Use the stream for LSN <-> time conversion using linear interpolation.
+ */
+typedef struct LSNTimeStream
+{
+	int			length;
+	LSNTime		data[LSNTIMESTREAM_VOLUME];
+} LSNTimeStream;
+
 typedef struct PgStat_WalStats
 {
 	PgStat_Counter wal_records;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 82b3b411fb5..a5851d44b16 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1584,6 +1584,8 @@ LogicalTapeSet
 LsnReadQueue
 LsnReadQueueNextFun
 LsnReadQueueNextStatus
+LSNTime
+LSNTimeStream
 LtreeGistOptions
 LtreeSignature
 MAGIC
-- 
2.34.1

v5-0004-Bgwriter-maintains-global-LSNTimeStream.patchtext/x-patch; charset=US-ASCII; name=v5-0004-Bgwriter-maintains-global-LSNTimeStream.patchDownload
From af14cc7652649e1e651ab73c2041b5bf434a57c3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:32:40 -0500
Subject: [PATCH v5 4/6] Bgwriter maintains global LSNTimeStream

Insert new LSN, time pairs to the global LSNTimeStream stored in
PgStat_WalStats in the background writer's main loop. This ensures that
new values are added to the stream in a regular manner.
---
 src/backend/postmaster/bgwriter.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0f75548759a..02b039cfacf 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -273,6 +273,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_lsn;
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
@@ -284,11 +285,15 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 			 * start of a record, whereas last_snapshot_lsn points just past
 			 * the end of the record.
 			 */
-			if (now >= timeout &&
-				last_snapshot_lsn <= GetLastImportantRecPtr())
+			if (now >= timeout)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
-				last_snapshot_ts = now;
+				current_lsn = GetLastImportantRecPtr();
+				if (last_snapshot_lsn <= current_lsn)
+				{
+					last_snapshot_lsn = LogStandbySnapshot();
+					last_snapshot_ts = now;
+					pgstat_wal_update_lsntime_stream(now, current_lsn);
+				}
 			}
 		}
 
-- 
2.34.1

v5-0001-Record-LSN-at-postmaster-startup.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Record-LSN-at-postmaster-startup.patchDownload
From 693cd6d15b0e60c16f8e7977e1631676f4ca7c5d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 5 Dec 2023 07:29:39 -0500
Subject: [PATCH v5 1/6] Record LSN at postmaster startup

The insert_lsn at postmaster startup can be used along with PgStartTime
as seed values for a timeline mapping LSNs to time. Future commits will
add such a structure for LSN <-> time conversions. A start LSN allows
for such conversions before even inserting a value into the timeline.
The current time and current insert LSN can be used along with
PgStartTime and PgStartLSN.

This is WIP, as I'm not sure if I did this in the right place.
---
 src/backend/access/transam/xlog.c   | 2 ++
 src/backend/postmaster/postmaster.c | 2 ++
 src/include/utils/builtins.h        | 3 +++
 3 files changed, 7 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d36272ab4ff..5be3361582e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -142,6 +142,8 @@ bool		XLOG_DEBUG = false;
 
 int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
 
+XLogRecPtr	PgStartLSN = InvalidXLogRecPtr;
+
 /*
  * Number of WAL insertion locks to use. A higher value allows more insertions
  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index bf0241aed0c..f1b60fe6cee 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -117,6 +117,7 @@
 #include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
+#include "utils/builtins.h"
 #include "utils/datetime.h"
 #include "utils/memutils.h"
 #include "utils/pidfile.h"
@@ -1345,6 +1346,7 @@ PostmasterMain(int argc, char *argv[])
 	 * Remember postmaster startup time
 	 */
 	PgStartTime = GetCurrentTimestamp();
+	PgStartLSN = GetXLogInsertRecPtr();
 
 	/*
 	 * Report postmaster status in the postmaster.pid file, to allow pg_ctl to
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 359c570f23e..16a7a058bc7 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -17,6 +17,7 @@
 #include "fmgr.h"
 #include "nodes/nodes.h"
 #include "utils/fmgrprotos.h"
+#include "access/xlogdefs.h"
 
 /* Sign + the most decimal digits an 8-byte number could have */
 #define MAXINT8LEN 20
@@ -85,6 +86,8 @@ extern void generate_operator_clause(fmStringInfo buf,
 									 Oid opoid,
 									 const char *rightop, Oid rightoptype);
 
+extern PGDLLIMPORT XLogRecPtr PgStartLSN;
+
 /* varchar.c */
 extern int	bpchartruelen(char *s, int len);
 
-- 
2.34.1

v5-0005-Add-time-LSN-translation-functions.patchtext/x-patch; charset=US-ASCII; name=v5-0005-Add-time-LSN-translation-functions.patchDownload
From f384a26ef0264b688b7be1d1e0a18b17313a3ffc Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 21 Feb 2024 20:06:29 -0500
Subject: [PATCH v5 5/6] Add time <-> LSN translation functions

Previous commits added a global LSNTimeStream, maintained by background
writer, that allows approximate translations between time and LSNs.

Add SQL-callable functions to convert from LSN to time and back and a
SQL-callable function returning the entire LSNTimeStream.

This could be useful in combination with SQL-callable functions
accessing a page LSN to approximate the time of last modification of a
page or estimating the LSN consumption rate to moderate maintenance
processes and balance system resource utilization.
---
 doc/src/sgml/monitoring.sgml            | 66 +++++++++++++++++++++++++
 src/backend/utils/activity/pgstat_wal.c | 56 +++++++++++++++++++++
 src/include/catalog/pg_proc.dat         | 22 +++++++++
 3 files changed, 144 insertions(+)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 991f6299075..979c193a721 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3195,6 +3195,72 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </tgroup>
   </table>
 
+  <para>
+  In addition to these WAL stats, a stream of LSN-time pairs is accessible
+  via the functions shown in <xref linkend="functions-lsn-time-stream"/>.
+  </para>
+
+  <table id="functions-lsn-time-stream">
+   <title>LSN Time Stream Information Functions</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       Function
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_estimate_lsn_at_time</primary>
+       </indexterm>
+       <function>pg_estimate_lsn_at_time</function> ( <type>timestamp with time zone</type> )
+       <returnvalue>pg_lsn</returnvalue>
+      </para>
+      <para>
+       Returns the estimated lsn at the provided time.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_estimate_lsn_at_time</primary>
+       </indexterm>
+       <function>pg_estimate_lsn_at_time</function> ( <type>pg_lsn</type> )
+       <returnvalue>timestamp with time zone</returnvalue>
+      </para>
+      <para>
+        Returns the estimated time at the provided lsn.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_lsntime_stream</primary>
+       </indexterm>
+       <function>pg_lsntime_stream</function> ()
+       <returnvalue>record</returnvalue>
+       ( <parameter>time</parameter> <type>timestamp with time zone</type>,
+       <parameter>lsn</parameter> <type>pg_lsnwith time zone</type>)
+      </para>
+      <para>
+       Returns all of the LSN-time pairs in the current LSN time stream.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+
+
 </sect2>
 
  <sect2 id="monitoring-pg-stat-database-view">
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 68bc5b4e9af..2e05eb1a4f3 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -19,7 +19,9 @@
 
 #include "access/xlog.h"
 #include "executor/instrument.h"
+#include "funcapi.h"
 #include "utils/builtins.h"
+#include "utils/pg_lsn.h"
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 
@@ -427,3 +429,57 @@ pgstat_wal_update_lsntime_stream(TimestampTz time, XLogRecPtr lsn)
 	lsntime_insert(&stats_shmem->stats.stream, time, lsn);
 	LWLockRelease(&stats_shmem->lock);
 }
+
+PG_FUNCTION_INFO_V1(pg_estimate_lsn_at_time);
+PG_FUNCTION_INFO_V1(pg_estimate_time_at_lsn);
+PG_FUNCTION_INFO_V1(pg_lsntime_stream);
+
+Datum
+pg_estimate_time_at_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn = PG_GETARG_LSN(0);
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+	TimestampTz result;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_time_at_lsn(&stats_shmem->stats.stream, lsn);
+	LWLockRelease(&stats_shmem->lock);
+
+	PG_RETURN_TIMESTAMPTZ(result);
+}
+
+Datum
+pg_estimate_lsn_at_time(PG_FUNCTION_ARGS)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+	TimestampTz time = PG_GETARG_TIMESTAMPTZ(0);
+	XLogRecPtr	result;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_lsn_at_time(&stats_shmem->stats.stream, time);
+	LWLockRelease(&stats_shmem->lock);
+
+	PG_RETURN_LSN(result);
+}
+
+Datum
+pg_lsntime_stream(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_WalStats *stats = pgstat_fetch_stat_wal();
+	LSNTimeStream *stream = &stats->stream;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	for (int i = LSNTIMESTREAM_VOLUME - stream->length; i < LSNTIMESTREAM_VOLUME; i++)
+	{
+		Datum		values[2] = {0};
+		bool		nulls[2] = {0};
+
+		values[0] = stream->data[i].time;
+		values[1] = stream->data[i].lsn;
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6a5476d3c4c..8ab14b49b2a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6342,6 +6342,28 @@
   prorettype => 'timestamptz', proargtypes => 'xid',
   prosrc => 'pg_xact_commit_timestamp' },
 
+{ oid => '9997',
+  descr => 'get approximate LSN at a particular point in time',
+  proname => 'pg_estimate_lsn_at_time', provolatile => 'v',
+  prorettype => 'pg_lsn', proargtypes => 'timestamptz',
+  prosrc => 'pg_estimate_lsn_at_time' },
+
+{ oid => '9996',
+  descr => 'get approximate time at a particular LSN',
+  proname => 'pg_estimate_time_at_lsn', provolatile => 'v',
+  prorettype => 'timestamptz', proargtypes => 'pg_lsn',
+  prosrc => 'pg_estimate_time_at_lsn' },
+
+{ oid => '9994',
+  descr => 'print the LSN Time Stream',
+  proname => 'pg_lsntime_stream', prorows => '64',
+  proretset => 't', provolatile => 'v', proparallel => 's',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,pg_lsn}',
+  proargmodes => '{o,o}',
+  proargnames => '{time, lsn}',
+  prosrc => 'pg_lsntime_stream' },
+
 { oid => '6168',
   descr => 'get commit timestamp and replication origin of a transaction',
   proname => 'pg_xact_commit_timestamp_origin', provolatile => 'v',
-- 
2.34.1

v5-0003-Add-LSNTimeStream-to-PgStat_WalStats.patchtext/x-patch; charset=US-ASCII; name=v5-0003-Add-LSNTimeStream-to-PgStat_WalStats.patchDownload
From f78ea2aee794358a8ba1dabece1debd52d48a10c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 21 Feb 2024 20:28:27 -0500
Subject: [PATCH v5 3/6] Add LSNTimeStream to PgStat_WalStats

Add a globally maintained instance of an LSNTimeStream to
PgStat_WalStats and a utility function to insert new values.
---
 src/backend/utils/activity/pgstat_wal.c | 10 ++++++++++
 src/include/pgstat.h                    |  4 ++++
 2 files changed, 14 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index cef9429994c..68bc5b4e9af 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -417,3 +417,13 @@ stop:
 	result = (double) (lsn - start.lsn) / lsns_elapsed * time_elapsed + start.time;
 	return Max(result, 0);
 }
+
+void
+pgstat_wal_update_lsntime_stream(TimestampTz time, XLogRecPtr lsn)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	lsntime_insert(&stats_shmem->stats.stream, time, lsn);
+	LWLockRelease(&stats_shmem->lock);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index af348be839c..773e3cd5003 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -470,6 +470,7 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_sync;
 	PgStat_Counter wal_write_time;
 	PgStat_Counter wal_sync_time;
+	LSNTimeStream stream;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -752,6 +753,9 @@ extern void pgstat_execute_transactional_drops(int ndrops, struct xl_xact_stats_
 extern void pgstat_report_wal(bool force);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 
+/* Helpers for maintaining the LSNTimeStream */
+extern void pgstat_wal_update_lsntime_stream(TimestampTz time, XLogRecPtr lsn);
+
 
 /*
  * Variables in pgstat.c
-- 
2.34.1

#13Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Melanie Plageman (#11)
Re: Add LSN <-> time conversion functionality

Hi!

I’m doing another iteration over the patchset.

PgStartLSN = GetXLogInsertRecPtr();
Should this be kind of RecoveryEndPtr? How is it expected to behave on Standby in HA cluster, which was doing a crash recovery of 1y WALs in a day, then is in startup for a year as a Hot Standby, and then is promoted?

lsn_ts_calculate_error_area() is prone to overflow. Even int64 does not seem capable to accommodate LSN*time. And the function may return negative result, despite claiming area as a result. It’s intended, but a little misleading.

i-- > 0
Is there a point to do a backward count in the loop?
Consider dropping not one by one, but half of a stream, LSNTimeStream is ~2Kb of cache and it’s loaded as a whole to the cache..

On 27 Jun 2024, at 07:18, Melanie Plageman <melanieplageman@gmail.com> wrote:

2. Some benchmarks to proof the patch does not have CPU footprint.

This is still a todo. Typically when designing a benchmark like this,
I would want to pick a worst-case workload to see how bad it could be.
I wonder if just a write heavy workload like pgbench builtin tpcb-like
would be sufficient?

Increasing background writer activity to maximum and not seeing LSNTimeStream function in `perf top` seems enough to me.

=== Nits ===
"Timeline" term is already taken.

I changed it to LSNTimeStream. What do you think?

Sounds good to me.

Tests fail on Windows.

I think this was because of the compiler warnings, but I need to
double-check now.

Nope, it really looks more serious.
[12:31:25.701] FAILED: src/backend/postgres_lib.a.p/utils_activity_pgstat_wal.c.obj
[12:31:25.701] "cl" "-Isrc\backend\postgres_lib.a.p" "-Isrc\include" "-I..\src\include" "-Ic:\openssl\1.1\include" "-I..\src\include\port\win32" "-I..\src\include\port\win32_msvc" "/MDd" "/FIpostgres_pch.h" "/Yupostgres_pch.h" "/Fpsrc\backend\postgres_lib.a.p\postgres_pch.pch" "/nologo" "/showIncludes" "/utf-8" "/W2" "/Od" "/Zi" "/DWIN32" "/DWINDOWS" "/D__WINDOWS__" "/D__WIN32__" "/D_CRT_SECURE_NO_DEPRECATE" "/D_CRT_NONSTDC_NO_DEPRECATE" "/wd4018" "/wd4244" "/wd4273" "/wd4101" "/wd4102" "/wd4090" "/wd4267" "-DBUILDING_DLL" "/FS" "/FdC:\cirrus\build\src\backend\postgres_lib.pdb" /Fosrc/backend/postgres_lib.a.p/utils_activity_pgstat_wal.c.obj "/c" ../src/backend/utils/activity/pgstat_wal.c
[12:31:25.701] ../src/backend/utils/activity/pgstat_wal.c(433): error C2375: 'pg_estimate_lsn_at_time': redefinition; different linkage
[12:31:25.701] c:\cirrus\build\src\include\utils/fmgrprotos.h(2906): note: see declaration of 'pg_estimate_lsn_at_time'
[12:31:25.701] ../src/backend/utils/activity/pgstat_wal.c(434): error C2375: 'pg_estimate_time_at_lsn': redefinition; different linkage
[12:31:25.701] c:\cirrus\build\src\include\utils/fmgrprotos.h(2905): note: see declaration of 'pg_estimate_time_at_lsn'
[12:31:25.701] ../src/backend/utils/activity/pgstat_wal.c(435): error C2375: 'pg_lsntime_stream': redefinition; different linkage
[12:31:25.858] c:\cirrus\build\src\include\utils/fmgrprotos.h(2904): note: see declaration of 'pg_lsntime_stream'

The patch lacks tests.

I thought about this a bit. I wonder what kind of tests make sense.

I could
1) Add tests with the existing stats tests
(src/test/regress/sql/stats.sql) and just test that bgwriter is in
fact adding to the time stream.

2) Or should I add some infrastructure to be able to create an
LSNTimeStream and then insert values to it and do some validations of
what is added? I did a version of this but it is just much more
annoying with C & SQL than with python (where I tried out my
algorithms) [2].

I think just a test which calls functions and discards the result would greatly increase coverage.

On 29 Jun 2024, at 03:09, Melanie Plageman <melanieplageman@gmail.com> wrote:
change the user-facing functions for estimating an
LSN/time conversion to instead return a floor and a ceiling -- instead
of linearly interpolating a guess. This would be a way to keep users
from misunderstanding the accuracy of the functions to translate LSN
<-> time.

I think this is a good idea. And it covers well “server restart problem”. If API just returns -inf as a boundary, caller can correctly interpret this situation.

Thanks! Looking forward to more timely freezing.

Best regards, Andrey Borodin.

#14Melanie Plageman
melanieplageman@gmail.com
In reply to: Andrey M. Borodin (#13)
5 attachment(s)
Re: Add LSN <-> time conversion functionality

Thanks for the review! v6 attached.

On Sat, Jul 6, 2024 at 1:36 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

PgStartLSN = GetXLogInsertRecPtr();
Should this be kind of RecoveryEndPtr? How is it expected to behave on Standby in HA cluster, which was doing a crash recovery of 1y WALs in a day, then is in startup for a year as a Hot Standby, and then is promoted?

So, I don't think we will allow use of the LSNTimeStream on a standby,
since it is unclear what it would mean on a standby. For example, do
you want to know the time the LSN was generated or the time it was
replayed? Note that bgwriter won't insert values to the time stream on
a standby (it explicitly checks).

But, you bring up an issue that I don't quite know what to do about.
If the standby doesn't have an LSNTimeStream, then when it is
promoted, LSN <-> time conversions of LSNs and times before the
promotion seem impossible. Maybe if the stats file is getting written
out at checkpoints, we could restore from that previous primary's file
after promotion?

This brings up a second issue, which is that, currently, bgwriter
won't insert into the time stream when wal_level is minimal. I'm not
sure exactly how to resolve it because I am relying on the "last
important rec pointer" and the LOG_SNAPSHOT_INTERVAL_MS to throttle
when the bgwriter actually inserts new records into the LSNTimeStream.
I could come up with a different timeout interval for updating the
time stream. Perhaps I should do that?

lsn_ts_calculate_error_area() is prone to overflow. Even int64 does not seem capable to accommodate LSN*time. And the function may return negative result, despite claiming area as a result. It’s intended, but a little misleading.

Ah, great point. I've fixed this.

i-- > 0
Is there a point to do a backward count in the loop?
Consider dropping not one by one, but half of a stream, LSNTimeStream is ~2Kb of cache and it’s loaded as a whole to the cache..

Yes, the backwards looping was super confusing. It was a relic of my
old design. Even without your point about cache locality, the code is
much harder to understand with the backwards looping. I've changed the
array to fill forwards and be accessed with forward loops.

On 27 Jun 2024, at 07:18, Melanie Plageman <melanieplageman@gmail.com> wrote:

2. Some benchmarks to proof the patch does not have CPU footprint.

This is still a todo. Typically when designing a benchmark like this,
I would want to pick a worst-case workload to see how bad it could be.
I wonder if just a write heavy workload like pgbench builtin tpcb-like
would be sufficient?

Increasing background writer activity to maximum and not seeing LSNTimeStream function in `perf top` seems enough to me.

I've got this on my TODO.

Tests fail on Windows.

I think this was because of the compiler warnings, but I need to
double-check now.

Nope, it really looks more serious.
[12:31:25.701] ../src/backend/utils/activity/pgstat_wal.c(433): error C2375: 'pg_estimate_lsn_at_time': redefinition; different linkage

Ah, yes. I mistakenly added the functions to pg_proc.dat and also
called PG_FUNCTION_INFO_V1 for the functions. I've fixed it.

The patch lacks tests.

I thought about this a bit. I wonder what kind of tests make sense.

I could
1) Add tests with the existing stats tests
(src/test/regress/sql/stats.sql) and just test that bgwriter is in
fact adding to the time stream.

2) Or should I add some infrastructure to be able to create an
LSNTimeStream and then insert values to it and do some validations of
what is added? I did a version of this but it is just much more
annoying with C & SQL than with python (where I tried out my
algorithms) [2].

I think just a test which calls functions and discards the result would greatly increase coverage.

I've added tests of the two main conversion functions. I didn't add a
test of the function which gets the whole stream (pg_lsntime_stream())
because I don't think I can guarantee it won't be empty -- so I'm not
sure what I could do with it in a test.

On 29 Jun 2024, at 03:09, Melanie Plageman <melanieplageman@gmail.com> wrote:
change the user-facing functions for estimating an
LSN/time conversion to instead return a floor and a ceiling -- instead
of linearly interpolating a guess. This would be a way to keep users
from misunderstanding the accuracy of the functions to translate LSN
<-> time.

I think this is a good idea. And it covers well “server restart problem”. If API just returns -inf as a boundary, caller can correctly interpret this situation.

Providing "ceiling" and "floor" user functions is still a TODO for me,
however, I think that the patch mostly does handle server restarts.

In the event of a restart, the cumulative stats system will have
persisted our time stream, so the LSNTimeStream will just be read back
in with the rest of the stats. I've added logic to ensure that if the
PgStartLSN is newer than our oldest LSNTimeStream entry, we use the
oldest entry instead of PgStartLSN when doing conversions LSN <->
time.

As for a crash, stats do not persist crashes, but I think Michael's
patch will go in to write out the stats file at checkpoints, and then
this should be good enough.

Is there anything else you think that is an issue with restarts?

Thanks! Looking forward to more timely freezing.

Thanks! I'll be posting a new version of the opportunistic freezing
patch that uses the time stream quite soon, so I hope you'll take a
look at that as well!

- Melanie

Attachments:

v6-0005-Add-time-LSN-translation-functions.patchtext/x-patch; charset=US-ASCII; name=v6-0005-Add-time-LSN-translation-functions.patchDownload
From 49f767bc859356df8b4a6ea03d490b6b1aa1d48d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 21 Feb 2024 20:06:29 -0500
Subject: [PATCH v6 5/6] Add time <-> LSN translation functions

Previous commits added a global LSNTimeStream, maintained by background
writer, that allows approximate translations between time and LSNs.

Add SQL-callable functions to convert from LSN to time and back and a
SQL-callable function returning the entire LSNTimeStream.

This could be useful in combination with SQL-callable functions
accessing a page LSN to approximate the time of last modification of a
page or estimating the LSN consumption rate to moderate maintenance
processes and balance system resource utilization.
---
 doc/src/sgml/monitoring.sgml            | 66 +++++++++++++++++++++++++
 src/backend/utils/activity/pgstat_wal.c | 52 +++++++++++++++++++
 src/include/catalog/pg_proc.dat         | 22 +++++++++
 src/test/regress/expected/stats.out     | 13 +++++
 src/test/regress/sql/stats.sql          |  5 ++
 5 files changed, 158 insertions(+)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 55417a6fa9d..f86e77955d6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3195,6 +3195,72 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </tgroup>
   </table>
 
+  <para>
+  In addition to these WAL stats, a stream of LSN-time pairs is accessible
+  via the functions shown in <xref linkend="functions-lsn-time-stream"/>.
+  </para>
+
+  <table id="functions-lsn-time-stream">
+   <title>LSN Time Stream Information Functions</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       Function
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_estimate_lsn_at_time</primary>
+       </indexterm>
+       <function>pg_estimate_lsn_at_time</function> ( <type>timestamp with time zone</type> )
+       <returnvalue>pg_lsn</returnvalue>
+      </para>
+      <para>
+       Returns the estimated lsn at the provided time.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_estimate_lsn_at_time</primary>
+       </indexterm>
+       <function>pg_estimate_lsn_at_time</function> ( <type>pg_lsn</type> )
+       <returnvalue>timestamp with time zone</returnvalue>
+      </para>
+      <para>
+        Returns the estimated time at the provided lsn.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_lsntime_stream</primary>
+       </indexterm>
+       <function>pg_lsntime_stream</function> ()
+       <returnvalue>record</returnvalue>
+       ( <parameter>time</parameter> <type>timestamp with time zone</type>,
+       <parameter>lsn</parameter> <type>pg_lsnwith time zone</type>)
+      </para>
+      <para>
+       Returns all of the LSN-time pairs in the current LSN time stream.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+
+
 </sect2>
 
  <sect2 id="monitoring-pg-stat-database-view">
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index c1c3da22b2f..7552a964b80 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -19,8 +19,10 @@
 
 #include "access/xlog.h"
 #include "executor/instrument.h"
+#include "funcapi.h"
 #include "math.h"
 #include "utils/builtins.h"
+#include "utils/pg_lsn.h"
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 
@@ -525,3 +527,53 @@ pgstat_wal_update_lsntime_stream(TimestampTz time, XLogRecPtr lsn)
 	lsntime_insert(&stats_shmem->stats.stream, time, lsn);
 	LWLockRelease(&stats_shmem->lock);
 }
+
+Datum
+pg_estimate_time_at_lsn(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn = PG_GETARG_LSN(0);
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+	TimestampTz result;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_time_at_lsn(&stats_shmem->stats.stream, lsn);
+	LWLockRelease(&stats_shmem->lock);
+
+	PG_RETURN_TIMESTAMPTZ(result);
+}
+
+Datum
+pg_estimate_lsn_at_time(PG_FUNCTION_ARGS)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+	TimestampTz time = PG_GETARG_TIMESTAMPTZ(0);
+	XLogRecPtr	result;
+
+	LWLockAcquire(&stats_shmem->lock, LW_SHARED);
+	result = estimate_lsn_at_time(&stats_shmem->stats.stream, time);
+	LWLockRelease(&stats_shmem->lock);
+
+	PG_RETURN_LSN(result);
+}
+
+Datum
+pg_lsntime_stream(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_WalStats *stats = pgstat_fetch_stat_wal();
+	LSNTimeStream *stream = &stats->stream;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	for (int i = 0; i < stream->length; i++)
+	{
+		Datum		values[2] = {0};
+		bool		nulls[2] = {0};
+
+		values[0] = stream->data[i].time;
+		values[1] = stream->data[i].lsn;
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 54b50ee5d61..b5d8d0d3673 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6375,6 +6375,28 @@
   prorettype => 'timestamptz', proargtypes => 'xid',
   prosrc => 'pg_xact_commit_timestamp' },
 
+{ oid => '9997',
+  descr => 'get approximate LSN at a particular point in time',
+  proname => 'pg_estimate_lsn_at_time', provolatile => 'v',
+  prorettype => 'pg_lsn', proargtypes => 'timestamptz',
+  prosrc => 'pg_estimate_lsn_at_time' },
+
+{ oid => '9996',
+  descr => 'get approximate time at a particular LSN',
+  proname => 'pg_estimate_time_at_lsn', provolatile => 'v',
+  prorettype => 'timestamptz', proargtypes => 'pg_lsn',
+  prosrc => 'pg_estimate_time_at_lsn' },
+
+{ oid => '9994',
+  descr => 'print the LSN Time Stream',
+  proname => 'pg_lsntime_stream', prorows => '64',
+  proretset => 't', provolatile => 'v', proparallel => 's',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{timestamptz,pg_lsn}',
+  proargmodes => '{o,o}',
+  proargnames => '{time, lsn}',
+  prosrc => 'pg_lsntime_stream' },
+
 { oid => '6168',
   descr => 'get commit timestamp and replication origin of a transaction',
   proname => 'pg_xact_commit_timestamp_origin', provolatile => 'v',
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 6e08898b183..b02b74e5872 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -813,6 +813,19 @@ SELECT (n_tup_ins + n_tup_upd) > 0 AS has_data FROM pg_stat_all_tables
 -----
 -- Test that various stats views are being properly populated
 -----
+SELECT pg_estimate_time_at_lsn(pg_current_wal_insert_lsn()) >
+                              now() - make_interval(years=> 100);
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT pg_estimate_lsn_at_time(now()) - '0/0' > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Test that sessions is incremented when a new session is started in pg_stat_database
 SELECT sessions AS db_stat_sessions FROM pg_stat_database WHERE datname = (SELECT current_database()) \gset
 \c
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index d8ac0d06f48..8562bdb45e8 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -411,6 +411,11 @@ SELECT (n_tup_ins + n_tup_upd) > 0 AS has_data FROM pg_stat_all_tables
 -- Test that various stats views are being properly populated
 -----
 
+SELECT pg_estimate_time_at_lsn(pg_current_wal_insert_lsn()) >
+                              now() - make_interval(years=> 100);
+
+SELECT pg_estimate_lsn_at_time(now()) - '0/0' > 0;
+
 -- Test that sessions is incremented when a new session is started in pg_stat_database
 SELECT sessions AS db_stat_sessions FROM pg_stat_database WHERE datname = (SELECT current_database()) \gset
 \c
-- 
2.34.1

v6-0001-Record-LSN-at-postmaster-startup.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Record-LSN-at-postmaster-startup.patchDownload
From 3df61a9f26f33a88409920587707d269f35eccdf Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 5 Dec 2023 07:29:39 -0500
Subject: [PATCH v6 1/6] Record LSN at postmaster startup

The insert_lsn at postmaster startup can be used along with PgStartTime
as seed values for a timeline mapping LSNs to time. Future commits will
add such a structure for LSN <-> time conversions. A start LSN allows
for such conversions before even inserting a value into the timeline.
The current time and current insert LSN can be used along with
PgStartTime and PgStartLSN.

This is WIP, as I'm not sure if I did this in the right place.
---
 src/backend/access/transam/xlog.c   | 2 ++
 src/backend/postmaster/postmaster.c | 2 ++
 src/include/utils/builtins.h        | 3 +++
 3 files changed, 7 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4a8a2f6098f..fed41b3f992 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -140,6 +140,8 @@ bool		XLOG_DEBUG = false;
 
 int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
 
+XLogRecPtr	PgStartLSN = InvalidXLogRecPtr;
+
 /*
  * Number of WAL insertion locks to use. A higher value allows more insertions
  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 02442a4b85a..c637a4229fe 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -117,6 +117,7 @@
 #include "storage/proc.h"
 #include "tcop/backend_startup.h"
 #include "tcop/tcopprot.h"
+#include "utils/builtins.h"
 #include "utils/datetime.h"
 #include "utils/memutils.h"
 #include "utils/pidfile.h"
@@ -1333,6 +1334,7 @@ PostmasterMain(int argc, char *argv[])
 	 * Remember postmaster startup time
 	 */
 	PgStartTime = GetCurrentTimestamp();
+	PgStartLSN = GetXLogInsertRecPtr();
 
 	/*
 	 * Report postmaster status in the postmaster.pid file, to allow pg_ctl to
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 359c570f23e..16a7a058bc7 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -17,6 +17,7 @@
 #include "fmgr.h"
 #include "nodes/nodes.h"
 #include "utils/fmgrprotos.h"
+#include "access/xlogdefs.h"
 
 /* Sign + the most decimal digits an 8-byte number could have */
 #define MAXINT8LEN 20
@@ -85,6 +86,8 @@ extern void generate_operator_clause(fmStringInfo buf,
 									 Oid opoid,
 									 const char *rightop, Oid rightoptype);
 
+extern PGDLLIMPORT XLogRecPtr PgStartLSN;
+
 /* varchar.c */
 extern int	bpchartruelen(char *s, int len);
 
-- 
2.34.1

v6-0003-Add-LSNTimeStream-to-PgStat_WalStats.patchtext/x-patch; charset=US-ASCII; name=v6-0003-Add-LSNTimeStream-to-PgStat_WalStats.patchDownload
From 1f55402be2f1e4bd015432a11640cfe72e44957c Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 21 Feb 2024 20:28:27 -0500
Subject: [PATCH v6 3/6] Add LSNTimeStream to PgStat_WalStats

Add a globally maintained instance of an LSNTimeStream to
PgStat_WalStats and a utility function to insert new values.
---
 src/backend/utils/activity/pgstat_wal.c | 10 ++++++++++
 src/include/pgstat.h                    |  4 ++++
 2 files changed, 14 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index affab8437c8..c1c3da22b2f 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -515,3 +515,13 @@ stop:
 	result = (double) (lsn - start.lsn) / lsns_elapsed * time_elapsed + start.time;
 	return Max(result, 0);
 }
+
+void
+pgstat_wal_update_lsntime_stream(TimestampTz time, XLogRecPtr lsn)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	lsntime_insert(&stats_shmem->stats.stream, time, lsn);
+	LWLockRelease(&stats_shmem->lock);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 825cdc8f73a..667f2b93cad 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -470,6 +470,7 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_sync;
 	PgStat_Counter wal_write_time;
 	PgStat_Counter wal_sync_time;
+	LSNTimeStream stream;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -752,6 +753,9 @@ extern void pgstat_execute_transactional_drops(int ndrops, struct xl_xact_stats_
 extern void pgstat_report_wal(bool force);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
 
+/* Helpers for maintaining the LSNTimeStream */
+extern void pgstat_wal_update_lsntime_stream(TimestampTz time, XLogRecPtr lsn);
+
 
 /*
  * Variables in pgstat.c
-- 
2.34.1

v6-0002-Add-LSNTimeStream-for-converting-LSN-time.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Add-LSNTimeStream-for-converting-LSN-time.patchDownload
From d6dc1128f75d883332945ab27f98a8c70b83b607 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:40:27 -0500
Subject: [PATCH v6 2/6] Add LSNTimeStream for converting LSN <-> time

Add a new structure, LSNTimeStream, consisting of LSNTimes -- each an
LSN, time pair. The LSNTimeStream is fixed size, so when a new LSNTime
is inserted to a full LSNTimeStream, an LSNTime is dropped and the new
LSNTime is inserted. We drop the LSNTime whose absence would cause the
least error when interpolating between its adjoining points.

LSN <-> time conversions can be done using linear interpolation with two
LSNTimes on the LSNTimeStream.

This commit does not add a global instance of LSNTimeStream. It adds the
structures and functions needed to maintain and access such a stream.
---
 src/backend/utils/activity/pgstat_wal.c | 323 ++++++++++++++++++++++++
 src/include/pgstat.h                    |  32 +++
 src/tools/pgindent/typedefs.list        |   2 +
 3 files changed, 357 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e2a3f6b865c..affab8437c8 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -17,8 +17,12 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
+#include "math.h"
+#include "utils/builtins.h"
 #include "utils/pgstat_internal.h"
+#include "utils/timestamp.h"
 
 
 PgStat_PendingWalStats PendingWalStats = {0};
@@ -32,6 +36,11 @@ PgStat_PendingWalStats PendingWalStats = {0};
 static WalUsage prevWalUsage;
 
 
+static void lsntime_insert(LSNTimeStream *stream, TimestampTz time, XLogRecPtr lsn);
+
+XLogRecPtr	estimate_lsn_at_time(const LSNTimeStream *stream, TimestampTz time);
+TimestampTz estimate_time_at_lsn(const LSNTimeStream *stream, XLogRecPtr lsn);
+
 /*
  * Calculate how much WAL usage counters have increased and update
  * shared WAL and IO statistics.
@@ -192,3 +201,317 @@ pgstat_wal_snapshot_cb(void)
 		   sizeof(pgStatLocal.snapshot.wal));
 	LWLockRelease(&stats_shmem->lock);
 }
+
+/*
+ * Given three LSNTimes, calculate the area of the triangle they form were they
+ * plotted with time on the X axis and LSN on the Y axis. An illustration:
+ *
+ *   LSN
+ *    |
+ *    |                                                         * right
+ *    |
+ *    |
+ *    |
+ *    |                                                * mid    * C
+ *    |
+ *    |
+ *    |
+ *    |  * left                                        * B      * A
+ *    |
+ *    +------------------------------------------------------------------
+ *
+ * The area of the triangle with vertices (left, mid, right) is the error
+ * incurred over the interval [left, right] were we to interpolate with just
+ * [left, right] rather than [left, mid) and [mid, right).
+ */
+static float
+lsn_ts_calculate_error_area(LSNTime *left, LSNTime *mid, LSNTime *right)
+{
+	float		left_time = left->time,
+				left_lsn = left->lsn;
+	float		mid_time = mid->time,
+				mid_lsn = mid->lsn;
+	float		right_time = right->time,
+				right_lsn = right->lsn;
+
+	/* Area of the rectangle with opposing corners left and right */
+	float		rectangle_all = (right_time - left_time) * (right_lsn - left_lsn);
+
+	/* Area of the right triangle with vertices left, right, and A */
+	float		triangle1 = rectangle_all / 2;
+
+	/* Area of the right triangle with vertices left, mid, and B */
+	float		triangle2 = (mid_lsn - left_lsn) * (mid_time - left_time) / 2;
+
+	/* Area of the right triangle with vertices mid, right, and C */
+	float		triangle3 = (right_lsn - mid_lsn) * (right_time - mid_time) / 2;
+
+	/* Area of the rectangle with vertices mid, A, B, and C */
+	float		rectangle_part = (right_lsn - mid_lsn) * (mid_time - left_time);
+
+	/* Area of the triangle with vertices left, mid, and right */
+	return triangle1 - triangle2 - triangle3 - rectangle_part;
+}
+
+/*
+ * Determine which LSNTime to drop from a full LSNTimeStream. Once the LSNTime
+ * is dropped, points between it and either of its adjacent LSNTimes will be
+ * interpolated between those two LSNTimes instead. To keep the LSNTimeStream
+ * as accurate as possible, drop the LSNTime whose absence would have the least
+ * impact on future interpolations.
+ *
+ * We determine the error that would be introduced by dropping a point on the
+ * stream by calculating the area of the triangle formed by the LSNTime and its
+ * adjacent LSNTimes. We do this for each LSNTime in the stream (except for the
+ * first and last LSNTimes) and choose the LSNTime with the smallest error
+ * (area). We avoid extrapolation by never dropping the first or last points.
+ */
+static unsigned int
+lsntime_to_drop(LSNTimeStream *stream)
+{
+	double		min_area;
+	unsigned int target_point;
+
+	/* Don't drop points if free space available */
+	Assert(stream->length == LSNTIMESTREAM_VOLUME);
+
+	min_area = lsn_ts_calculate_error_area(&stream->data[0],
+										   &stream->data[1],
+										   &stream->data[2]);
+
+	target_point = 1;
+
+	for (int i = 1; i < stream->length - 1; i++)
+	{
+		LSNTime    *left = &stream->data[i - 1];
+		LSNTime    *mid = &stream->data[i];
+		LSNTime    *right = &stream->data[i + 1];
+		float		area = lsn_ts_calculate_error_area(left, mid, right);
+
+		if (fabs(area) < fabs(min_area))
+		{
+			min_area = area;
+			target_point = i;
+		}
+	}
+
+	return target_point;
+}
+
+/*
+ * Insert a new LSNTime into the LSNTimeStream in the first available element,
+ * or, if there are no empty elements, drop an LSNTime from the stream, move
+ * all the subsequent LSNTimes down and insert the new LSNTime into the tail.
+ */
+void
+lsntime_insert(LSNTimeStream *stream, TimestampTz time,
+			   XLogRecPtr lsn)
+{
+	unsigned int drop;
+	LSNTime		entrant = {.lsn = lsn,.time = time};
+
+	if (stream->length < LSNTIMESTREAM_VOLUME)
+	{
+		/*
+		 * The new entry should exceed the most recent entry to ensure time
+		 * moves forward on the stream.
+		 */
+		Assert(stream->length == 0 ||
+			   (lsn >= stream->data[stream->length - 1].lsn &&
+				time >= stream->data[stream->length - 1].time));
+
+		/*
+		 * If there are unfilled elements in the stream, insert the passed-in
+		 * LSNTime into the current tail of the array.
+		 */
+		stream->data[stream->length++] = entrant;
+		return;
+	}
+
+	drop = lsntime_to_drop(stream);
+
+	/*
+	 * Drop the LSNTime at index drop by copying the array from drop - 1 to
+	 * drop
+	 */
+	memmove(&stream->data[drop],
+			&stream->data[drop + 1],
+			sizeof(LSNTime) * (stream->length - 1 - drop));
+
+	stream->data[stream->length - 1] = entrant;
+}
+
+/*
+ * Translate time to a LSN using the provided stream. The stream will not
+ * be modified.
+ */
+XLogRecPtr
+estimate_lsn_at_time(const LSNTimeStream *stream, TimestampTz time)
+{
+	XLogRecPtr	result;
+	int64		time_elapsed,
+				lsns_elapsed;
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the database has been restarted, PgStartLSN may be after our oldest
+	 * value. In that case, use the oldest value in the time stream as the
+	 * start.
+	 */
+	if (stream->length > 0 && start.time > stream->data[0].time)
+		start = stream->data[0];
+
+	/*
+	 * If the LSN is before our oldest known LSN, the best we can do is return
+	 * our oldest known time.
+	 */
+	if (time < start.time)
+		return start.lsn;
+
+	/*
+	 * If the provided time is after now, the current LSN is our best
+	 * estimate.
+	 */
+	if (time >= end.time)
+		return end.lsn;
+
+	/*
+	 * Loop through the stream. Stop at the first LSNTime later than our
+	 * target time. This LSNTime will be our interpolation end point. If
+	 * there's an LSNTime earlier than that, that will be our interpolation
+	 * start point.
+	 */
+	for (int i = 0; i < stream->length; i++)
+	{
+		if (stream->data[i].time < time)
+			continue;
+
+		end = stream->data[i];
+		if (i > 0)
+			start = stream->data[i - 1];
+		goto stop;
+	}
+
+	/*
+	 * If we exhausted the stream, then use its latest LSNTime as our
+	 * interpolation start point.
+	 */
+	if (stream->length > 0)
+		start = stream->data[stream->length - 1];
+
+stop:
+
+	/*
+	 * In rare cases, the start and end LSN could be the same. If, for
+	 * example, no new records have been inserted since the last one recorded
+	 * in the LSNTimeStream and we are looking for the LSN corresponding to
+	 * the current time.
+	 */
+	if (end.lsn == start.lsn)
+		return end.lsn;
+
+	Assert(end.lsn > start.lsn);
+
+	/*
+	 * It should be extremely rare (if not impossible) for the start time and
+	 * end time to be the same. In this case, just return an LSN halfway
+	 * between the two.
+	 */
+	if (end.time == start.time)
+		return start.lsn + ((end.lsn - start.lsn) / 2);
+
+	Assert(end.time > start.time);
+
+	time_elapsed = end.time - start.time;
+	lsns_elapsed = end.lsn - start.lsn;
+
+	result = (double) (time - start.time) / time_elapsed * lsns_elapsed + start.lsn;
+	return Max(result, 0);
+}
+
+/*
+ * Translate lsn to a time using the provided stream. The stream will not
+ * be modified.
+ */
+TimestampTz
+estimate_time_at_lsn(const LSNTimeStream *stream, XLogRecPtr lsn)
+{
+	int64		time_elapsed,
+				lsns_elapsed;
+	TimestampTz result;
+	LSNTime		start = {.time = PgStartTime,.lsn = PgStartLSN};
+	LSNTime		end = {.time = GetCurrentTimestamp(),.lsn = GetXLogInsertRecPtr()};
+
+	/*
+	 * If the database has been restarted, PgStartLSN may be after our oldest
+	 * value. In that case, use the oldest value in the time stream as the
+	 * start.
+	 */
+	if (stream->length > 0 && start.time > stream->data[0].time)
+		start = stream->data[0];
+
+	/*
+	 * If the LSN is before our oldest known LSN, the best we can do is return
+	 * our oldest known time.
+	 */
+	if (lsn < start.lsn)
+		return start.time;
+
+	/*
+	 * If the target LSN is after the current insert LSN, the current time is
+	 * our best estimate.
+	 */
+	if (lsn >= end.lsn)
+		return end.time;
+
+	/*
+	 * Loop through the stream. Stop at the first LSNTime later than our
+	 * target time. This LSNTime will be our interpolation end point. If
+	 * there's an LSNTime earlier than that, that will be our interpolation
+	 * start point.
+	 */
+	for (int i = 0; i < stream->length; i++)
+	{
+		if (stream->data[i].lsn < lsn)
+			continue;
+
+		end = stream->data[i];
+		if (i > 0)
+			start = stream->data[i - 1];
+		goto stop;
+	}
+
+	/*
+	 * If we exhausted the stream, then use its earliest LSNTime as our
+	 * interpolation end point.
+	 */
+	if (stream->length > 0)
+		start = stream->data[stream->length - 1];
+
+stop:
+
+	/* It should be nearly impossible to have the same start and end time. */
+	if (end.time == start.time)
+		return end.time;
+
+	Assert(end.time > start.time);
+
+	/*
+	 * In rare cases, the start and end LSN could be the same. If, for
+	 * example, no new records have been inserted since the last one recorded
+	 * in the LSNTimeStream and we are looking for the LSN corresponding to
+	 * the current time. In this case, just return a time halfway between
+	 * start and end.
+	 */
+	if (end.lsn == start.lsn)
+		return start.time + ((end.time - start.time) / 2);
+
+	Assert(end.lsn > start.lsn);
+
+	time_elapsed = end.time - start.time;
+	lsns_elapsed = end.lsn - start.lsn;
+
+	result = (double) (lsn - start.lsn) / lsns_elapsed * time_elapsed + start.time;
+	return Max(result, 0);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6b99bb8aadf..825cdc8f73a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -428,6 +429,37 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter autoanalyze_count;
 } PgStat_StatTabEntry;
 
+/*
+ * The elements of an LSNTimeStream. Each LSNTime represents one or more time,
+ * LSN pairs. The LSN is typically the insert LSN recorded at the time.
+ */
+typedef struct LSNTime
+{
+	TimestampTz time;
+	XLogRecPtr	lsn;
+} LSNTime;
+
+#define LSNTIMESTREAM_VOLUME 64
+
+/*
+ * An LSN time stream is an array consisting of LSNTimes from least to most
+ * recent. The array is filled before any element is dropped. Once the
+ * LSNTimeStream length == volume (the array is full), an LSNTime is dropped,
+ * the subsequent LSNTimes are moved down by 1, and the new LSNTime is inserted
+ * at the tail.
+ *
+ * When dropping an LSNTime, we attempt to pick the member which would
+ * introduce the least error into the stream. See lsntime_to_drop() for more
+ * details.
+ *
+ * Use the stream for LSN <-> time conversion using linear interpolation.
+ */
+typedef struct LSNTimeStream
+{
+	int			length;
+	LSNTime		data[LSNTIMESTREAM_VOLUME];
+} LSNTimeStream;
+
 typedef struct PgStat_WalStats
 {
 	PgStat_Counter wal_records;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8de9978ad8d..d924855069c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1587,6 +1587,8 @@ LogicalTapeSet
 LsnReadQueue
 LsnReadQueueNextFun
 LsnReadQueueNextStatus
+LSNTime
+LSNTimeStream
 LtreeGistOptions
 LtreeSignature
 MAGIC
-- 
2.34.1

v6-0004-Bgwriter-maintains-global-LSNTimeStream.patchtext/x-patch; charset=US-ASCII; name=v6-0004-Bgwriter-maintains-global-LSNTimeStream.patchDownload
From 0ab41bd5030caf33c82692c7a3a6618a3771166f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 27 Dec 2023 16:32:40 -0500
Subject: [PATCH v6 4/6] Bgwriter maintains global LSNTimeStream

Insert new LSN, time pairs to the global LSNTimeStream stored in
PgStat_WalStats in the background writer's main loop. This ensures that
new values are added to the stream in a regular manner.
---
 src/backend/postmaster/bgwriter.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0f75548759a..99c2e6eecc3 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -273,6 +273,7 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
+			XLogRecPtr	current_lsn;
 
 			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
 												  LOG_SNAPSHOT_INTERVAL_MS);
@@ -284,11 +285,23 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 			 * start of a record, whereas last_snapshot_lsn points just past
 			 * the end of the record.
 			 */
-			if (now >= timeout &&
-				last_snapshot_lsn <= GetLastImportantRecPtr())
+			if (now >= timeout)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
-				last_snapshot_ts = now;
+				current_lsn = GetLastImportantRecPtr();
+				if (last_snapshot_lsn <= current_lsn)
+				{
+					last_snapshot_lsn = LogStandbySnapshot();
+					last_snapshot_ts = now;
+
+					/*
+					 * After a restart GetXLogInsertRecPtr() may return 0. We
+					 * don't want the timeline to move backwards, though, so
+					 * get the insert LSN instead.
+					 */
+					if (current_lsn == 0)
+						current_lsn = GetXLogInsertRecPtr();
+					pgstat_wal_update_lsntime_stream(now, current_lsn);
+				}
 			}
 		}
 
-- 
2.34.1

#15Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Melanie Plageman (#1)
Re: Add LSN <-> time conversion functionality

This is a copy of my message for pgsql-hackers mailing list. Unfortunately original message was rejected due to one of recipients addresses is blocked.

Show quoted text

On 1 Aug 2024, at 10:54, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 1 Aug 2024, at 05:44, Melanie Plageman <melanieplageman@gmail.com> wrote:

Thanks for the review! v6 attached.

On Sat, Jul 6, 2024 at 1:36 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

PgStartLSN = GetXLogInsertRecPtr();
Should this be kind of RecoveryEndPtr? How is it expected to behave on Standby in HA cluster, which was doing a crash recovery of 1y WALs in a day, then is in startup for a year as a Hot Standby, and then is promoted?

So, I don't think we will allow use of the LSNTimeStream on a standby,
since it is unclear what it would mean on a standby. For example, do
you want to know the time the LSN was generated or the time it was
replayed? Note that bgwriter won't insert values to the time stream on
a standby (it explicitly checks).

Yes, I mentioned Standby because PgStartLSN is not what it says it is.

But, you bring up an issue that I don't quite know what to do about.
If the standby doesn't have an LSNTimeStream, then when it is
promoted, LSN <-> time conversions of LSNs and times before the
promotion seem impossible. Maybe if the stats file is getting written
out at checkpoints, we could restore from that previous primary's file
after promotion?

I’m afraid that clocks of a Primary from previous timeline might be not in sync with ours.
It’s OK if it causes error, we just need to be prepared when they indicate values from future. Perhaps, by shifting their last point to our “PgStartLSN”.

This brings up a second issue, which is that, currently, bgwriter
won't insert into the time stream when wal_level is minimal. I'm not
sure exactly how to resolve it because I am relying on the "last
important rec pointer" and the LOG_SNAPSHOT_INTERVAL_MS to throttle
when the bgwriter actually inserts new records into the LSNTimeStream.
I could come up with a different timeout interval for updating the
time stream. Perhaps I should do that?

IDK. My knowledge of bgwriter is not enough to give a meaningful advise here.

lsn_ts_calculate_error_area() is prone to overflow. Even int64 does not seem capable to accommodate LSN*time. And the function may return negative result, despite claiming area as a result. It’s intended, but a little misleading.

Ah, great point. I've fixed this.

Well, not exactly. Result of lsn_ts_calculate_error_area() is still fabs()’ed upon usage. I’d recommend to fabs in function.
BTW lsn_ts_calculate_error_area() have no prototype.

Also, I’m not a big fan of using IEEE 754 float in this function. This data type have 24 bits of significand bits.
Consider that current timestamp has 50 binary digits. Let’s assume realistic LSNs have same 50 bits.
Then our rounding error is 2^76 byte*microseconds.
Let’s assume we are interested to measure time on a scale of 1GB WAL records.
This gives us rounding error of 2^46 microseconds = 2^26 seconds = 64 million seconds = 2 years.
Seems like a gross error.

If we use IEEE 754 doubles we have 53 significand bytes. And rounding error will be on a scale of 128 microseconds per GB, which is acceptable.

So I think double is better than float here.

Nitpicking, but I’d prefer to sum up (triangle2 + triangle3 + rectangle_part) before subtracting. This can save a bit of precision (smaller figures can have lesser exponent).

On 29 Jun 2024, at 03:09, Melanie Plageman <melanieplageman@gmail.com> wrote:
change the user-facing functions for estimating an
LSN/time conversion to instead return a floor and a ceiling -- instead
of linearly interpolating a guess. This would be a way to keep users
from misunderstanding the accuracy of the functions to translate LSN
<-> time.

I think this is a good idea. And it covers well “server restart problem”. If API just returns -inf as a boundary, caller can correctly interpret this situation.

Providing "ceiling" and "floor" user functions is still a TODO for me,
however, I think that the patch mostly does handle server restarts.

In the event of a restart, the cumulative stats system will have
persisted our time stream, so the LSNTimeStream will just be read back
in with the rest of the stats. I've added logic to ensure that if the
PgStartLSN is newer than our oldest LSNTimeStream entry, we use the
oldest entry instead of PgStartLSN when doing conversions LSN <->
time.

As for a crash, stats do not persist crashes, but I think Michael's
patch will go in to write out the stats file at checkpoints, and then
this should be good enough.

Is there anything else you think that is an issue with restarts?

Nope, looks good to me.

Thanks! Looking forward to more timely freezing.

Thanks! I'll be posting a new version of the opportunistic freezing
patch that uses the time stream quite soon, so I hope you'll take a
look at that as well!

Great! Thank you!
Besides your TODOs and my nitpicking this patch series looks RfC to me.

I have to address some review comments on my patches, then I hope I’ll switch to reviewing opportunistic freezing.

Best regards, Andrey Borodin.

#16Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#1)
3 attachment(s)
Re: Add LSN <-> time conversion functionality

Attached v7 changes the SQL-callable functions to return ranges of
LSNs and times covering the target time or LSN instead of linearly
interpolating an approximate answer.

I also changed the frequency and conditions under which the background
writer updates the global LSNTimeStream. There is now a dedicated
interval at which the LSNTimeStream is updated (instead of reusing the
log standby snapshot interval).

I also found that it is incorrect to set PgStartLSN to the insert LSN
in PostmasterMain(). The XLog buffer cache is not guaranteed to be
initialized in time. Instead of trying to provide an LSN lower bound
for locating times before those recorded on the global LSNTimeStream,
I simply return a lower bound of InvalidXLogRecPtr. Similarly, I
provide a lower bound of -infinity when locating LSNs before those
recorded on the global LSNTimeStream.

On Thu, Aug 1, 2024 at 3:55 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

On 1 Aug 2024, at 05:44, Melanie Plageman <melanieplageman@gmail.com> wrote:

On Sat, Jul 6, 2024 at 1:36 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

PgStartLSN = GetXLogInsertRecPtr();
Should this be kind of RecoveryEndPtr? How is it expected to behave on Standby in HA cluster, which was doing a crash recovery of 1y WALs in a day, then is in startup for a year as a Hot Standby, and then is promoted?

So, I don't think we will allow use of the LSNTimeStream on a standby,
since it is unclear what it would mean on a standby. For example, do
you want to know the time the LSN was generated or the time it was
replayed? Note that bgwriter won't insert values to the time stream on
a standby (it explicitly checks).

Yes, I mentioned Standby because PgStartLSN is not what it says it is.

Right, I've found another way of dealing with this since PgStartLSN
was incorrect.

But, you bring up an issue that I don't quite know what to do about.
If the standby doesn't have an LSNTimeStream, then when it is
promoted, LSN <-> time conversions of LSNs and times before the
promotion seem impossible. Maybe if the stats file is getting written
out at checkpoints, we could restore from that previous primary's file
after promotion?

I’m afraid that clocks of a Primary from previous timeline might be not in sync with ours.
It’s OK if it causes error, we just need to be prepared when they indicate values from future. Perhaps, by shifting their last point to our “PgStartLSN”.

Regarding a standby being promoted. I plan to make a version of the
LSNTimeStream functions which works on a standby by using
getRecordTimestamp() and inserts an LSN from the last record replayed
and the associated timestamp. That should mean the LSNTimeStream on
the standby is roughly the same as the one on the primary since those
records were inserted on the primary.

As for time going backwards in general, I've now made it so that
values are only inserted if the times are monotonically increasing and
the LSN is the same or increasing. This should handle time going
backwards, either on the primary itself or after a standby is promoted
if the timeline wasn't a perfect match.

This brings up a second issue, which is that, currently, bgwriter
won't insert into the time stream when wal_level is minimal. I'm not
sure exactly how to resolve it because I am relying on the "last
important rec pointer" and the LOG_SNAPSHOT_INTERVAL_MS to throttle
when the bgwriter actually inserts new records into the LSNTimeStream.
I could come up with a different timeout interval for updating the
time stream. Perhaps I should do that?

IDK. My knowledge of bgwriter is not enough to give a meaningful advise here.

See my note at top of the email.

lsn_ts_calculate_error_area() is prone to overflow. Even int64 does not seem capable to accommodate LSN*time. And the function may return negative result, despite claiming area as a result. It’s intended, but a little misleading.

Ah, great point. I've fixed this.

Well, not exactly. Result of lsn_ts_calculate_error_area() is still fabs()’ed upon usage. I’d recommend to fabs in function.
BTW lsn_ts_calculate_error_area() have no prototype.

Also, I’m not a big fan of using IEEE 754 float in this function. This data type have 24 bits of significand bits.
Consider that current timestamp has 50 binary digits. Let’s assume realistic LSNs have same 50 bits.
Then our rounding error is 2^76 byte*microseconds.
Let’s assume we are interested to measure time on a scale of 1GB WAL records.
This gives us rounding error of 2^46 microseconds = 2^26 seconds = 64 million seconds = 2 years.
Seems like a gross error.

If we use IEEE 754 doubles we have 53 significand bytes. And rounding error will be on a scale of 128 microseconds per GB, which is acceptable.

So I think double is better than float here.

Nitpicking, but I’d prefer to sum up (triangle2 + triangle3 + rectangle_part) before subtracting. This can save a bit of precision (smaller figures can have lesser exponent).

Okay, thanks for the detail. See what you think about v7.

Some perf testing of bgwriter updates are still a todo. I was thinking
that it might be bad to take an exclusive lock on the WAL stats data
structure for the entire time I am inserting a new value to the
LSNTimeStream. I was thinking maybe I should take a share lock and
calculate which element to drop first and then take the exclusive
lock? Or maybe I should make a separate lock for just the stream
member of PgStat_WalStats. Maybe it isn't worth it? I'm not sure.

- Melanie

Attachments:

v7-0002-Add-global-LSNTimeStream-to-PgStat_WalStats.patchtext/x-patch; charset=US-ASCII; name=v7-0002-Add-global-LSNTimeStream-to-PgStat_WalStats.patchDownload
From 1e8bf7042e1c652c490f9ccd2940d200617cbfee Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 5 Aug 2024 20:33:15 -0400
Subject: [PATCH v7 2/4] Add global LSNTimeStream to PgStat_WalStats

Add a globally maintained instance of an LSNTimeStream to
PgStat_WalStats and a utility function to insert new values.
The WAL generation rate is meant to be used for statistical purposes, so
it makes sense for it to live in the WAL stats data structure.

Background writer is tasked with inserting new LSN, time pairs to the
global stream in its main loop at regular intervals. There is precedent
for background writer performing such tasks: bgwriter already
periodically logs snapshots into the WAL for the benefit of standbys.
---
 src/backend/postmaster/bgwriter.c       | 81 +++++++++++++++++++++----
 src/backend/utils/activity/pgstat_wal.c | 15 +++++
 src/include/pgstat.h                    |  5 ++
 3 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 0f75548759a..bb5e2d8ec5d 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -76,6 +76,21 @@ int			BgWriterDelay = 200;
 static TimestampTz last_snapshot_ts;
 static XLogRecPtr last_snapshot_lsn = InvalidXLogRecPtr;
 
+/*
+ * Interval at which new LSN, time pairs are added into the global
+ * LSNTimeStream, in milliseconds.
+ */
+#define LOG_STREAM_INTERVAL_MS 30000
+
+/*
+ * The timestamp at which we last checked whether or not to update the global
+ * LSNTimeStream.
+ */
+static TimestampTz last_stream_check_ts;
+
+/* The LSN we last updated the LSNTimeStream with */
+static XLogRecPtr last_stream_update_lsn = InvalidXLogRecPtr;
+
 
 /*
  * Main entry point for bgwriter process
@@ -119,6 +134,12 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 	 */
 	last_snapshot_ts = GetCurrentTimestamp();
 
+	/* Insert an entry to the global LSNTimeStream as soon as we can. */
+	last_stream_check_ts = last_snapshot_ts;
+	last_stream_update_lsn = GetXLogInsertRecPtr();
+	pgstat_wal_update_lsntime_stream(last_stream_update_lsn,
+									 last_stream_check_ts);
+
 	/*
 	 * Create a memory context that we will do all our work in.  We do this so
 	 * that we can reset the context during error recovery and thereby avoid
@@ -269,26 +290,62 @@ BackgroundWriterMain(char *startup_data, size_t startup_data_len)
 		 * Checkpointer, when active, is barely ever in its mainloop and thus
 		 * makes it hard to log regularly.
 		 */
-		if (XLogStandbyInfoActive() && !RecoveryInProgress())
+
+		if (!RecoveryInProgress())
 		{
 			TimestampTz timeout = 0;
 			TimestampTz now = GetCurrentTimestamp();
 
-			timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
-												  LOG_SNAPSHOT_INTERVAL_MS);
+			if (XLogStandbyInfoActive())
+			{
+				timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
+													  LOG_SNAPSHOT_INTERVAL_MS);
+
+				/*
+				 * Only log if enough time has passed and interesting records
+				 * have been inserted since the last snapshot.  Have to
+				 * compare with <= instead of < because
+				 * GetLastImportantRecPtr() points at the start of a record,
+				 * whereas last_snapshot_lsn points just past the end of the
+				 * record.
+				 */
+				if (now >= timeout &&
+					last_snapshot_lsn <= GetLastImportantRecPtr())
+				{
+					last_snapshot_lsn = LogStandbySnapshot();
+					last_snapshot_ts = now;
+				}
+			}
+
+			timeout = TimestampTzPlusMilliseconds(last_stream_check_ts,
+												  LOG_STREAM_INTERVAL_MS);
 
 			/*
-			 * Only log if enough time has passed and interesting records have
-			 * been inserted since the last snapshot.  Have to compare with <=
-			 * instead of < because GetLastImportantRecPtr() points at the
-			 * start of a record, whereas last_snapshot_lsn points just past
-			 * the end of the record.
+			 * Periodically insert a new LSNTime into the global
+			 * LSNTimeStream. It makes sense for the background writer to
+			 * maintain the global LSNTimeStream because it runs regularly and
+			 * returns to its main loop frequently.
 			 */
-			if (now >= timeout &&
-				last_snapshot_lsn <= GetLastImportantRecPtr())
+			if (now >= timeout)
 			{
-				last_snapshot_lsn = LogStandbySnapshot();
-				last_snapshot_ts = now;
+				XLogRecPtr	insert_lsn = GetXLogInsertRecPtr();
+
+				Assert(insert_lsn != InvalidXLogRecPtr);
+
+				/*
+				 * We only insert an LSNTime if the LSN has changed since the
+				 * last update. This sacrifices accuracy on LSN -> time
+				 * conversions but saves space, which increases the accuracy
+				 * of time -> LSN conversions.
+				 */
+				if (insert_lsn > last_stream_update_lsn)
+				{
+					pgstat_wal_update_lsntime_stream(insert_lsn,
+													 now);
+					last_stream_update_lsn = insert_lsn;
+				}
+
+				last_stream_check_ts = now;
 			}
 		}
 
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 95ec65a51ff..1ce9060641c 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -369,6 +369,21 @@ lsntime_insert(LSNTimeStream *stream, TimestampTz time,
 }
 
 
+/*
+ * Utility function for inserting a new member into the LSNTimeStream member
+ * of WAL stats.
+ */
+void
+pgstat_wal_update_lsntime_stream(XLogRecPtr lsn, TimestampTz time)
+{
+	PgStatShared_Wal *stats_shmem = &pgStatLocal.shmem->wal;
+
+	LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE);
+	lsntime_insert(&stats_shmem->stats.stream, time, lsn);
+	LWLockRelease(&stats_shmem->lock);
+}
+
+
 /*
  * Returns a range of LSNTimes starting at lower and ending at upper and
  * covering the target_time. If target_time is before the stream, lower will
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 13856e2bef3..43df60ce24c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -507,6 +507,7 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_sync;
 	PgStat_Counter wal_write_time;
 	PgStat_Counter wal_sync_time;
+	LSNTimeStream stream;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -795,6 +796,10 @@ extern void time_bounds_for_lsn(const LSNTimeStream *stream,
 								XLogRecPtr target_lsn,
 								LSNTime *lower, LSNTime *upper);
 
+/* Helper for maintaining the global LSNTimeStream */
+extern void pgstat_wal_update_lsntime_stream(XLogRecPtr lsn,
+											 TimestampTz time);
+
 
 /*
  * Variables in pgstat.c
-- 
2.34.1

v7-0003-Add-time-LSN-translation-range-functions.patchtext/x-patch; charset=US-ASCII; name=v7-0003-Add-time-LSN-translation-range-functions.patchDownload
From eeb936816924b91a194ed03a4296b4f669e72071 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Wed, 7 Aug 2024 10:57:45 -0400
Subject: [PATCH v7 3/4] Add time <-> LSN translation range functions

Previous commits added a global LSNTimeStream, maintained by background
writer and functions to return a range of LSNs covering a time or time
covering an LSN.

Add SQL-callable functions to produce these ranges and a SQL-callable
function returning the entire LSNTimeStream.

This could be useful in combination with SQL-callable functions
accessing a page LSN to approximate the time of last modification of a
page or estimating the LSN consumption rate to moderate maintenance
processes and balance system resource utilization.
---
 doc/src/sgml/monitoring.sgml        |  75 +++++++++++++++++++
 src/backend/utils/adt/pgstatfuncs.c | 107 ++++++++++++++++++++++++++++
 src/include/catalog/pg_proc.dat     |  27 +++++++
 src/test/regress/expected/stats.out |  43 +++++++++++
 src/test/regress/sql/stats.sql      |  28 ++++++++
 5 files changed, 280 insertions(+)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 55417a6fa9d..9b63659900b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3195,6 +3195,81 @@ description | Waiting for a newly initialized WAL file to reach durable storage
    </tgroup>
   </table>
 
+  <para>
+  In addition to these WAL stats, a stream of LSN-time pairs is accessible
+  via the functions shown in <xref linkend="functions-lsn-time-stream"/>.
+  </para>
+
+  <table id="functions-lsn-time-stream">
+   <title>LSN Time Stream Information Functions</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       Function
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_stat_lsn_bounds_for_time</primary>
+       </indexterm>
+       <function>pg_stat_lsn_bounds_for_time</function>
+       ( <type>timestamp with time zone</type> )
+       <returnvalue>record</returnvalue>
+       (<parameter>lsn</parameter> <type>pg_lsn</type>,
+       <parameter>lsn</parameter> <type>pg_lsn</type>)
+      </para>
+      <para>
+       Returns the upper and lower bound of the LSN range on the global
+       LSNTimeLine in which the time falls.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_stat_time_bounds_for_lsn</primary>
+       </indexterm>
+       <function>pg_stat_time_bounds_for_lsn</function>
+       ( <type>pg_lsn</type> )
+       <returnvalue>record</returnvalue>
+       ( <parameter>time</parameter> <type>timestamp with time zone</type>,
+       <parameter>time</parameter> <type>timestamp with time zone</type>)
+      </para>
+      <para>
+       Returns the upper and lower bound of the time range on the global
+       LSNTimeLine in which the LSN falls.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="func_table_entry"><para role="func_signature">
+       <indexterm>
+        <primary>pg_stat_lsntime_stream</primary>
+       </indexterm>
+       <function>pg_stat_lsntime_stream</function> ()
+       <returnvalue>setof record</returnvalue>
+       ( <parameter>lsn</parameter> <type>pg_lsn</type>,
+       <parameter>time</parameter> <type>timestamp with time zone</type>)
+      </para>
+      <para>
+       Returns all of the LSN-time pairs currently in the global LSN time
+       stream.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+
+
 </sect2>
 
  <sect2 id="monitoring-pg-stat-database-view">
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 32211371237..ac862fb679a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -30,6 +30,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/pg_lsn.h"
 #include "utils/timestamp.h"
 
 #define UINT32_ACCESS_ONCE(var)		 ((uint32)(*((volatile uint32 *)&(var))))
@@ -1526,6 +1527,112 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
 
+/*
+ * Returns the LSN, time pairs making up the global LSNTimeStream maintained
+ * in WAL statistics.
+ */
+Datum
+pg_stat_lsntime_stream(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_WalStats *stats;
+	LSNTimeStream *stream;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	stats = pgstat_fetch_stat_wal();
+	stream = &stats->stream;
+
+	for (size_t i = 0; i < stream->length; i++)
+	{
+		Datum		values[2] = {0};
+		bool		nulls[2] = {0};
+
+		values[0] = stream->data[i].lsn;
+		values[1] = stream->data[i].time;
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
+/*
+ * Returns the upper and lower bounds of an LSN range covering the passed-in
+ * time. If the passed-in time is far enough in the past that we don't have
+ * data, the lower bound will be InvalidXLogRecPtr. If it is in the future,
+ * the upper bound will be FFFFFFFF/FFFFFFFF.
+ */
+Datum
+pg_stat_lsn_bounds_for_time(PG_FUNCTION_ARGS)
+{
+	PgStat_WalStats *wal_stats;
+	TimestampTz target_time;
+	LSNTime		lower,
+				upper;
+	TupleDesc	tupdesc;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+
+	target_time = PG_GETARG_TIMESTAMPTZ(0);
+
+	tupdesc = CreateTemplateTupleDesc(2);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "lower",
+					   PG_LSNOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "upper",
+					   PG_LSNOID, -1, 0);
+	BlessTupleDesc(tupdesc);
+
+	wal_stats = pgstat_fetch_stat_wal();
+	lsn_bounds_for_time(&wal_stats->stream, target_time, &lower, &upper);
+
+	values[0] = lower.lsn;
+	values[1] = upper.lsn;
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc,
+													  values,
+													  nulls)));
+}
+
+
+/*
+ * Returns the upper and lower bounds of a TimestampTz range covering the
+ * passed-in LSN. If the passed-in LSN is far enough in the past that we don't
+ * have data, the lower bound will be -infinity. If the passed-in LSN is in
+ * the future, the upper bound will be infinity.
+ */
+Datum
+pg_stat_time_bounds_for_lsn(PG_FUNCTION_ARGS)
+{
+	PgStat_WalStats *wal_stats;
+	XLogRecPtr	target_lsn;
+	LSNTime		lower,
+				upper;
+	TupleDesc	tupdesc;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+
+	target_lsn = PG_GETARG_LSN(0);
+
+	tupdesc = CreateTemplateTupleDesc(2);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "lower",
+					   TIMESTAMPTZOID, -1, 0);
+	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "upper",
+					   TIMESTAMPTZOID, -1, 0);
+	BlessTupleDesc(tupdesc);
+
+	wal_stats = pgstat_fetch_stat_wal();
+	time_bounds_for_lsn(&wal_stats->stream, target_lsn, &lower, &upper);
+
+	values[0] = lower.time;
+	values[1] = upper.time;
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc,
+													  values,
+													  nulls)));
+}
+
 /*
  * Returns statistics of SLRU caches.
  */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d36f6001bb1..c59f42bc974 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6375,6 +6375,33 @@
   prorettype => 'timestamptz', proargtypes => 'xid',
   prosrc => 'pg_xact_commit_timestamp' },
 
+{ oid => '9997', descr => 'get upper and lower time bounds for LSN',
+  proname => 'pg_stat_time_bounds_for_lsn', provolatile => 'v',
+  proisstrict => 't', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'pg_lsn',
+  proallargtypes => '{pg_lsn,timestamptz,timestamptz}',
+  proargmodes => '{i,o,o}',
+  proargnames => '{target_lsn, lower, upper}',
+  prosrc => 'pg_stat_time_bounds_for_lsn' },
+
+{ oid => '9996', descr => 'get upper and lower LSN bounds for time',
+  proname => 'pg_stat_lsn_bounds_for_time', provolatile => 'v',
+  proisstrict => 't', proparallel => 'u',
+  prorettype => 'record', proargtypes => 'timestamptz',
+  proallargtypes => '{timestamptz,pg_lsn,pg_lsn}',
+  proargmodes => '{i,o,o}',
+  proargnames => '{target_time, lower, upper}',
+  prosrc => 'pg_stat_lsn_bounds_for_time' },
+
+{ oid => '9994',
+  descr => 'print the LSN Time Stream',
+  proname => 'pg_stat_lsntime_stream', prorows => '64',
+  provolatile => 'v', proparallel => 'u',
+  proretset => 't', prorettype => 'record',
+  proargtypes => '', proallargtypes => '{pg_lsn,timestamptz}',
+  proargmodes => '{o,o}', proargnames => '{lsn,time}',
+  prosrc => 'pg_stat_lsntime_stream' },
+
 { oid => '6168',
   descr => 'get commit timestamp and replication origin of a transaction',
   proname => 'pg_xact_commit_timestamp_origin', provolatile => 'v',
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 6e08898b183..5f32e3bd9e0 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -813,6 +813,49 @@ SELECT (n_tup_ins + n_tup_upd) > 0 AS has_data FROM pg_stat_all_tables
 -----
 -- Test that various stats views are being properly populated
 -----
+-- Test the functions querying the global LSNTimeStream stored in WAL stats.
+-- An LSN range covering a time 100 years in the past should be from 0 to a
+-- non-zero LSN (either the oldest LSN in the stream or the current insert
+-- LSN).
+SELECT lower = pg_lsn(0),
+       upper > pg_lsn(0)
+  FROM pg_stat_lsn_bounds_for_time(now() - make_interval(years=> 100));
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
+
+-- An LSN range covering a time 100 years in the future should be from roughly
+-- the current time to FFFFFFFF/FFFFFFFF (UINT64_MAX).
+SELECT lower > pg_lsn(0),
+       upper = pg_lsn('FFFFFFFF/FFFFFFFF')
+    FROM pg_stat_lsn_bounds_for_time(now() + make_interval(years=> 100));
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
+
+-- A TimestampTz range covering LSN 0 should be from -infinity to a positive
+-- time (either the oldest time in the stream or the current time).
+SELECT lower = timestamptz('-infinity'),
+       upper::time > 'allballs'::time
+    FROM pg_stat_time_bounds_for_lsn(pg_lsn(0));
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
+
+-- A TimestampTz range covering an LSN 1 GB in the future should be from
+-- roughly the current time to infinity.
+SELECT lower::time > 'allballs'::time,
+       upper = timestamptz('infinity')
+    FROM pg_stat_time_bounds_for_lsn(
+         pg_current_wal_insert_lsn() + 1000000000);
+ ?column? | ?column? 
+----------+----------
+ t        | t
+(1 row)
+
 -- Test that sessions is incremented when a new session is started in pg_stat_database
 SELECT sessions AS db_stat_sessions FROM pg_stat_database WHERE datname = (SELECT current_database()) \gset
 \c
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index d8ac0d06f48..0260779141c 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -411,6 +411,34 @@ SELECT (n_tup_ins + n_tup_upd) > 0 AS has_data FROM pg_stat_all_tables
 -- Test that various stats views are being properly populated
 -----
 
+-- Test the functions querying the global LSNTimeStream stored in WAL stats.
+
+-- An LSN range covering a time 100 years in the past should be from 0 to a
+-- non-zero LSN (either the oldest LSN in the stream or the current insert
+-- LSN).
+SELECT lower = pg_lsn(0),
+       upper > pg_lsn(0)
+  FROM pg_stat_lsn_bounds_for_time(now() - make_interval(years=> 100));
+
+-- An LSN range covering a time 100 years in the future should be from roughly
+-- the current time to FFFFFFFF/FFFFFFFF (UINT64_MAX).
+SELECT lower > pg_lsn(0),
+       upper = pg_lsn('FFFFFFFF/FFFFFFFF')
+    FROM pg_stat_lsn_bounds_for_time(now() + make_interval(years=> 100));
+
+-- A TimestampTz range covering LSN 0 should be from -infinity to a positive
+-- time (either the oldest time in the stream or the current time).
+SELECT lower = timestamptz('-infinity'),
+       upper::time > 'allballs'::time
+    FROM pg_stat_time_bounds_for_lsn(pg_lsn(0));
+
+-- A TimestampTz range covering an LSN 1 GB in the future should be from
+-- roughly the current time to infinity.
+SELECT lower::time > 'allballs'::time,
+       upper = timestamptz('infinity')
+    FROM pg_stat_time_bounds_for_lsn(
+         pg_current_wal_insert_lsn() + 1000000000);
+
 -- Test that sessions is incremented when a new session is started in pg_stat_database
 SELECT sessions AS db_stat_sessions FROM pg_stat_database WHERE datname = (SELECT current_database()) \gset
 \c
-- 
2.34.1

v7-0001-Add-LSNTimeStream-API-for-converting-LSN-time.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Add-LSNTimeStream-API-for-converting-LSN-time.patchDownload
From 54d6baf71e0e73131bb03fe641fd9bdaddf18a93 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 5 Aug 2024 20:29:51 -0400
Subject: [PATCH v7 1/4] Add LSNTimeStream API for converting LSN <-> time

Add a new structure, LSNTimeStream, consisting of LSNTimes -- each an
LSN, time pair. This structure is intended to reflect the WAL generation
rate. It can be used to determine a time range in which an LSN was
inserted or an LSN range covering a particular time. These could be used
to interpolate a more specific point in the range.

It produces ranges and not specific time <-> LSN conversions because an
LSNTimeStream is lossy. An LSNTimeStream is fixed size, so when a new
LSNTime is inserted to a full LSNTimeStream, an LSNTime is dropped and
the new LSNTime is inserted. We drop the LSNTime whose absence would
cause the least error when interpolating between its adjoining points.

This commit does not add any instances of LSNTimeStream.
---
 src/backend/utils/activity/pgstat_wal.c | 414 ++++++++++++++++++++++++
 src/include/pgstat.h                    |  45 +++
 src/tools/pgindent/typedefs.list        |   2 +
 3 files changed, 461 insertions(+)

diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e2a3f6b865c..95ec65a51ff 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -17,8 +17,11 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
+#include "math.h"
 #include "utils/pgstat_internal.h"
+#include "utils/timestamp.h"
 
 
 PgStat_PendingWalStats PendingWalStats = {0};
@@ -31,6 +34,23 @@ PgStat_PendingWalStats PendingWalStats = {0};
  */
 static WalUsage prevWalUsage;
 
+static double lsn_ts_calculate_error_area(LSNTime *left,
+										  LSNTime *mid,
+										  LSNTime *right);
+static unsigned char lsntime_to_drop(LSNTimeStream *stream);
+static void lsntime_insert(LSNTimeStream *stream, TimestampTz time,
+						   XLogRecPtr lsn);
+
+static void stream_get_bounds_for_lsn(const LSNTimeStream *stream,
+									  XLogRecPtr target_lsn,
+									  LSNTime *lower,
+									  LSNTime *upper);
+
+static void stream_get_bounds_for_time(const LSNTimeStream *stream,
+									   TimestampTz target_time,
+									   LSNTime *lower,
+									   LSNTime *upper);
+
 
 /*
  * Calculate how much WAL usage counters have increased and update
@@ -192,3 +212,397 @@ pgstat_wal_snapshot_cb(void)
 		   sizeof(pgStatLocal.snapshot.wal));
 	LWLockRelease(&stats_shmem->lock);
 }
+
+/*
+ * Given three LSNTimes, calculate the area of the triangle they form were
+ * they plotted with time on the X axis and LSN on the Y axis. An
+ * illustration:
+ *
+ *   LSN
+ *    |
+ *    |                                                         * right
+ *    |
+ *    |
+ *    |
+ *    |                                                * mid    * C
+ *    |
+ *    |
+ *    |
+ *    |  * left                                        * B      * A
+ *    |
+ *    +------------------------------------------------------------------
+ *
+ * The area of the triangle with vertices (left, mid, right) is the error
+ * incurred over the interval [left, right] were we to interpolate with just
+ * [left, right] rather than [left, mid] and [mid, right].
+ */
+static double
+lsn_ts_calculate_error_area(LSNTime *left, LSNTime *mid, LSNTime *right)
+{
+	double		left_time = left->time,
+				left_lsn = left->lsn;
+	double		mid_time = mid->time,
+				mid_lsn = mid->lsn;
+	double		right_time = right->time,
+				right_lsn = right->lsn;
+	double		rectangle_all,
+				triangle1,
+				triangle2,
+				triangle3,
+				rectangle_part,
+				area_to_subtract;
+
+	/* Area of the rectangle with opposing corners left and right */
+	rectangle_all = (right_time - left_time) * (right_lsn - left_lsn);
+
+	/* Area of the right triangle with vertices left, right, and A */
+	triangle1 = rectangle_all / 2;
+
+	/* Area of the right triangle with vertices left, mid, and B */
+	triangle2 = (mid_lsn - left_lsn) * (mid_time - left_time) / 2;
+
+	/* Area of the right triangle with vertices mid, right, and C */
+	triangle3 = (right_lsn - mid_lsn) * (right_time - mid_time) / 2;
+
+	/* Area of the rectangle with vertices mid, A, B, and C */
+	rectangle_part = (right_lsn - mid_lsn) * (mid_time - left_time);
+
+	/* Sum up the area to subtract first to produce a more precise answer */
+	area_to_subtract = triangle2 + triangle3 + rectangle_part;
+
+	/* Area of the triangle with vertices left, mid, and right */
+	return fabs(triangle1 - area_to_subtract);
+}
+
+/*
+ * Determine which LSNTime to drop from a full LSNTimeStream.
+ * Drop the LSNTime whose absence would introduce the least error into future
+ * linear interpolation on the stream.
+ *
+ * We determine the error that would be introduced by dropping a point on the
+ * stream by calculating the area of the triangle formed by the LSNTime and
+ * its adjacent LSNTimes. We do this for each LSNTime in the stream (except
+ * for the first and last LSNTimes) and choose the LSNTime with the smallest
+ * error (area).
+ *
+ * We avoid extrapolation by never dropping the first or last points.
+ */
+static unsigned char
+lsntime_to_drop(LSNTimeStream *stream)
+{
+	double		min_area;
+	unsigned char target_point;
+
+	/* Don't drop points if free spots available are available */
+	Assert(stream->length == LSNTIMESTREAM_VOLUME);
+	StaticAssertStmt(LSNTIMESTREAM_VOLUME >= 3, "LSNTIMESTREAM_VOLUME < 3");
+
+	min_area = lsn_ts_calculate_error_area(&stream->data[0],
+										   &stream->data[1],
+										   &stream->data[2]);
+
+	target_point = 1;
+
+	for (size_t i = 2; i < stream->length - 1; i++)
+	{
+		LSNTime    *left = &stream->data[i - 1];
+		LSNTime    *mid = &stream->data[i];
+		LSNTime    *right = &stream->data[i + 1];
+		double		area = lsn_ts_calculate_error_area(left, mid, right);
+
+		if (area < min_area)
+		{
+			min_area = area;
+			target_point = i;
+		}
+	}
+
+	return target_point;
+}
+
+/*
+ * Insert a new LSNTime into the LSNTimeStream in the first available element.
+ * If there are no empty elements, drop an LSNTime from the stream to make
+ * room for the new LSNTime.
+ */
+static void
+lsntime_insert(LSNTimeStream *stream, TimestampTz time,
+			   XLogRecPtr lsn)
+{
+	unsigned char drop;
+	LSNTime		entrant = {.lsn = lsn,.time = time};
+
+	if (stream->length < LSNTIMESTREAM_VOLUME)
+	{
+		/*
+		 * Time must move forward on the stream. If the clock moves backwards,
+		 * for example in an NTP correction, we'll just skip inserting this
+		 * LSNTime.
+		 *
+		 * Translating LSN <-> time is most meaningful if the LSNTimeStream
+		 * entries are the position of a single location in the WAL over time.
+		 * Though time must monotonically increase, it is valid to insert
+		 * multiple LSNTimes with the same LSN. Imagine a period of time in
+		 * which no new WAL records are inserted.
+		 */
+		if (stream->length > 0 &&
+			(time <= stream->data[stream->length - 1].time ||
+			 lsn < stream->data[stream->length - 1].lsn))
+		{
+			ereport(WARNING,
+					errmsg("Won't insert non-monotonic \"%lu, %s\" to LSNTimeStream.",
+						   lsn, timestamptz_to_str(time)));
+			return;
+		}
+
+		stream->data[stream->length++] = entrant;
+		return;
+	}
+
+	drop = lsntime_to_drop(stream);
+
+	memmove(&stream->data[drop],
+			&stream->data[drop + 1],
+			sizeof(LSNTime) * (stream->length - 1 - drop));
+
+	stream->data[stream->length - 1] = entrant;
+}
+
+
+/*
+ * Returns a range of LSNTimes starting at lower and ending at upper and
+ * covering the target_time. If target_time is before the stream, lower will
+ * contain the minimum values for the datatypes. If target_time is newer than
+ * the stream, upper will contain the maximum values for the datatypes.
+ */
+static void
+stream_get_bounds_for_time(const LSNTimeStream *stream,
+						   TimestampTz target_time,
+						   LSNTime *lower,
+						   LSNTime *upper)
+{
+	Assert(lower && upper);
+
+	/*
+	 * If the target_time is "off the stream" -- either the stream has no
+	 * members or the target_time is older than all values in the stream or
+	 * newer than all values -- the lower and/or upper bounds may be the min
+	 * or max value for the datatypes, respectively.
+	 */
+	*lower = LSNTIME_INIT(InvalidXLogRecPtr, INT64_MIN);
+	*upper = LSNTIME_INIT(UINT64_MAX, INT64_MAX);
+
+	/*
+	 * If the LSNTimeStream has no members, it provides no information about
+	 * the range.
+	 */
+	if (stream->length == 0)
+	{
+		elog(DEBUG1,
+			 "Attempt to identify LSN bounds for time: \"%s\" using empty LSNTimeStream.",
+			 timestamptz_to_str(target_time));
+		return;
+	}
+
+	/*
+	 * If the target_time is older than the stream, the oldest member in the
+	 * stream is our upper bound.
+	 */
+	if (target_time <= stream->data[0].time)
+	{
+		*upper = stream->data[0];
+		if (target_time == stream->data[0].time)
+			*lower = stream->data[0];
+		return;
+	}
+
+	/*
+	 * Loop through the stream and stop at the first LSNTime newer than or
+	 * equal to our target time. Skip the first LSNTime, as we know it is
+	 * older than our target time.
+	 */
+	for (size_t i = 1; i < stream->length; i++)
+	{
+		if (target_time == stream->data[i].time)
+		{
+			*lower = stream->data[i];
+			*upper = stream->data[i];
+			return;
+		}
+
+		if (target_time < stream->data[i].time)
+		{
+			/* Time must increase monotonically on the stream. */
+			Assert(stream->data[i - 1].time <
+				   stream->data[i].time);
+			*lower = stream->data[i - 1];
+			*upper = stream->data[i];
+			return;
+		}
+	}
+
+	/*
+	 * target_time is newer than the stream, so the newest member in the
+	 * stream is our lower bound.
+	 */
+	*lower = stream->data[stream->length - 1];
+}
+
+/*
+ * Try to find an upper and lower bound for the possible LSN values at the
+ * provided target_time. If the target_time doesn't fall on the provided
+ * LSNTimeStream, we compare the target_time to the current time and see if we
+ * can fill in a missing boundary. Note that we do not consult the
+ * current time if the target_time fell on the stream -- even if doing so
+ * might provide a tighter range.
+ */
+void
+lsn_bounds_for_time(const LSNTimeStream *stream, TimestampTz target_time,
+					LSNTime *lower, LSNTime *upper)
+{
+	TimestampTz current_time;
+	XLogRecPtr	current_lsn;
+
+	stream_get_bounds_for_time(stream, target_time, lower, upper);
+
+	/*
+	 * We found valid upper and lower bounds for target_time, so we're done.
+	 */
+	if (lower->lsn != InvalidXLogRecPtr && upper->lsn != UINT64_MAX)
+		return;
+
+	/*
+	 * The target_time was either off the stream or the stream has no members.
+	 * In either case, see if we can use the current time and LSN to provide
+	 * one (or both) of the bounds.
+	 */
+	current_time = GetCurrentTimestamp();
+	current_lsn = GetXLogInsertRecPtr();
+
+	if (lower->lsn == InvalidXLogRecPtr && target_time >= current_time)
+		*lower = LSNTIME_INIT(current_lsn, current_time);
+
+	if (upper->lsn == UINT64_MAX && target_time <= current_time)
+		*upper = LSNTIME_INIT(current_lsn, current_time);
+
+	Assert(upper->lsn >= lower->lsn);
+}
+
+/*
+ * Returns a range of LSNTimes starting at lower and ending at upper and
+ * covering the target_lsn. If target_lsn is before the stream, lower will
+ * contain the minimum values for the datatypes. If target_time is newer than
+ * the stream, upper will contain the maximum values for the datatypes.
+ */
+static void
+stream_get_bounds_for_lsn(const LSNTimeStream *stream,
+						  XLogRecPtr target_lsn,
+						  LSNTime *lower,
+						  LSNTime *upper)
+{
+	Assert(lower && upper);
+
+	/*
+	 * If the target_time is "off the stream" -- either the stream has no
+	 * members or the target_time is older than all values in the stream or
+	 * newer than all values -- the lower and/or upper bounds may be the min
+	 * or max value for the datatypes, respectively.
+	 */
+	*lower = LSNTIME_INIT(InvalidXLogRecPtr, INT64_MIN);
+	*upper = LSNTIME_INIT(UINT64_MAX, INT64_MAX);
+
+	/*
+	 * If the LSNTimeStream has no members, it provides no information about
+	 * the range.
+	 */
+	if (stream->length == 0)
+	{
+		elog(DEBUG1,
+			 "Attempt to identify time bounds for LSN: \"%lu\" using empty LSNTimeStream.",
+			 target_lsn);
+		return;
+	}
+
+	/*
+	 * If the target_lsn is older than the stream, the oldest member in the
+	 * stream is our upper bound.
+	 */
+	if (target_lsn <= stream->data[0].lsn)
+	{
+		*upper = stream->data[0];
+		if (target_lsn == stream->data[0].lsn)
+			*lower = stream->data[0];
+		return;
+	}
+
+	/*
+	 * Loop through the stream and stop at the first LSNTime newer than or
+	 * equal to our target time. Skip the first LSNTime, as we know it is
+	 * older than our target time.
+	 */
+	for (size_t i = 1; i < stream->length; i++)
+	{
+		if (target_lsn == stream->data[i].lsn)
+		{
+			*lower = stream->data[i - 1];
+			*upper = stream->data[i];
+			return;
+		}
+
+		if (target_lsn < stream->data[i].lsn)
+		{
+			/* LSNs must not decrease on the stream. */
+			Assert(stream->data[i - 1].lsn <=
+				   stream->data[i].lsn);
+			*lower = stream->data[i - 1];
+			*upper = stream->data[i];
+			return;
+		}
+	}
+
+	/*
+	 * target_lsn is newer than the stream, so the newest member in the stream
+	 * is our lower bound.
+	 */
+	*lower = stream->data[stream->length - 1];
+}
+
+/*
+ * Try to find an upper and lower bound for the possible times covering the
+ * provided target_lsn. If the target_lsn doesn't fall on the provided
+ * LSNTimeStream, we compare the target_lsn to the current insert LSN and see
+ * if we can fill in a missing boundary. Note that we do not consult
+ * the current insert LSN if the target_lsn fell on the stream -- even if
+ * doing so might provide a tighter range.
+ */
+void
+time_bounds_for_lsn(const LSNTimeStream *stream, XLogRecPtr target_lsn,
+					LSNTime *lower, LSNTime *upper)
+{
+	TimestampTz current_time;
+	XLogRecPtr	current_lsn;
+
+	stream_get_bounds_for_lsn(stream, target_lsn, lower, upper);
+
+	/*
+	 * We found valid upper and lower bounds for target_lsn, so we're done.
+	 */
+	if (lower->time != INT64_MIN && upper->time != INT64_MAX)
+		return;
+
+	/*
+	 * The target_lsn was either off the stream or the stream has no members.
+	 * In either case, see if we can use the current time and LSN to provide
+	 * one (or both) of the bounds.
+	 */
+	current_time = GetCurrentTimestamp();
+	current_lsn = GetXLogInsertRecPtr();
+
+	if (lower->time == INT64_MIN && target_lsn >= current_lsn)
+		*lower = LSNTIME_INIT(current_lsn, current_time);
+
+	if (upper->time == INT64_MAX && target_lsn <= current_lsn)
+		*upper = LSNTIME_INIT(current_lsn, current_time);
+
+	Assert(upper->time >= lower->time);
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f63159c55ca..13856e2bef3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -458,6 +458,45 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter autoanalyze_count;
 } PgStat_StatTabEntry;
 
+/*
+ * The elements of an LSNTimeStream. For the LSNTimeStream to be meaningful,
+ * the lsn should be a consistent position in the WAL over time (e.g. the
+ * insert LSN at each time in the stream or the flush LSN at each time).
+ */
+typedef struct LSNTime
+{
+	TimestampTz time;
+	XLogRecPtr	lsn;
+} LSNTime;
+
+/*
+ * Convenience macro returning an LSNTime with the time and LSN set to the
+ * passed in values.
+ */
+#define LSNTIME_INIT(i_lsn, i_time) \
+	((LSNTime) { .lsn = (i_lsn), .time = (i_time) })
+
+#define LSNTIMESTREAM_VOLUME 64
+
+/*
+ * An LSN time stream is an array consisting of LSNTimes from least to most
+ * recent. The array is filled before any element is dropped. Once the
+ * LSNTimeStream length == volume (the array is full), an LSNTime is dropped,
+ * the subsequent LSNTimes are moved down by 1, and the new LSNTime is
+ * inserted at the tail.
+ *
+ * When dropping an LSNTime, we attempt to pick the member which would
+ * introduce the least error into the stream. See lsntime_to_drop() for more
+ * details.
+ *
+ * Use the stream for LSN <-> time conversions.
+ */
+typedef struct LSNTimeStream
+{
+	unsigned char length;
+	LSNTime		data[LSNTIMESTREAM_VOLUME];
+} LSNTimeStream;
+
 typedef struct PgStat_WalStats
 {
 	PgStat_Counter wal_records;
@@ -749,6 +788,12 @@ extern void pgstat_execute_transactional_drops(int ndrops, struct xl_xact_stats_
 
 extern void pgstat_report_wal(bool force);
 extern PgStat_WalStats *pgstat_fetch_stat_wal(void);
+extern void lsn_bounds_for_time(const LSNTimeStream *stream,
+								TimestampTz target_time,
+								LSNTime *lower, LSNTime *upper);
+extern void time_bounds_for_lsn(const LSNTimeStream *stream,
+								XLogRecPtr target_lsn,
+								LSNTime *lower, LSNTime *upper);
 
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 547d14b3e7c..c8d84122976 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1587,6 +1587,8 @@ LogicalTapeSet
 LsnReadQueue
 LsnReadQueueNextFun
 LsnReadQueueNextStatus
+LSNTime
+LSNTimeStream
 LtreeGistOptions
 LtreeSignature
 MAGIC
-- 
2.34.1

#17Robert Haas
robertmhaas@gmail.com
In reply to: Melanie Plageman (#16)
Re: Add LSN <-> time conversion functionality

Melanie,

As I mentioned to you off-list, I feel like this needs some sort of
recency bias. Certainly vacuum, and really almost any conceivable user
of this facility, is going to care more about accurate answers for new
data than for old data. If there's no recency bias, then I think that
eventually answers for more recent LSNs will start to become less
accurate, since they've got to share the data structure with more and
more time from long ago. I don't think you've done anything about this
in this version of the patch, but I might be wrong.

One way to make the standby more accurately mimic the primary would be
to base entries on the timestamp-LSN data that is already present in
the WAL, i.e. {COMMIT|ABORT} [PREPARED] records. If you only added or
updated entries on the primary when logging those records, the standby
could redo exactly what the primary did. A disadvantage of this
approach is that if there are no commits for a while then your mapping
gets out of date, but that might be something we could just tolerate.
Another possible solution is to log the changes you make on the
primary and have the standby replay those changes. Perhaps I'm wrong
to advocate for such solutions, but it feels error-prone to have one
algorithm for the primary and a different algorithm for the standby.
You now basically have two things that can break and you have to debug
what went wrong instead of just one.

In terms of testing this, I advocate not so much performance testing
as accuracy testing. So for example if you intentionally change the
LSN consumption rate during your test, e.g. high LSN consumption rate
for a while, then low for while, then high again for a while, and then
graph the contents of the final data structure, how well does the data
structure model what actually happened? Honestly, my whole concern
here is really around the lack of recency bias. If you simply took a
sample every N seconds until the buffer was full and then repeatedly
thinned the data by throwing away every other sample from the older
half of the buffer, then it would be self-evident that accuracy for
the older data was going to degrade over time, but also that accuracy
for new data wasn't going to degrade no matter how long you ran the
algorithm, simply because the newest half of the data never gets
thinned. But because you've chosen to throw away the point that leads
to the least additional error (on an imaginary request distribution
that is just as likely to care about very old things as it is to care
about new ones), there's nothing to keep the algorithm from getting
into a state where it systematically throws away new data points and
keeps old ones.

To be clear, I'm not saying the algorithm I just mooted is the right
one or that it has no weaknesses; for example, it needlessly throws
away precision that it doesn't have to lose when the rate of LSN
consumption is constant for a long time. I don't think that
necessarily matters because the algorithm doesn't need to be as
accurate as possible; it just needs to be accurate enough to get the
job done.

--
Robert Haas
EDB: http://www.enterprisedb.com

#18Melanie Plageman
melanieplageman@gmail.com
In reply to: Robert Haas (#17)
Re: Add LSN <-> time conversion functionality

On Wed, Aug 7, 2024 at 1:06 PM Robert Haas <robertmhaas@gmail.com> wrote:

As I mentioned to you off-list, I feel like this needs some sort of
recency bias. Certainly vacuum, and really almost any conceivable user
of this facility, is going to care more about accurate answers for new
data than for old data. If there's no recency bias, then I think that
eventually answers for more recent LSNs will start to become less
accurate, since they've got to share the data structure with more and
more time from long ago. I don't think you've done anything about this
in this version of the patch, but I might be wrong.

That makes sense. This version of the patch set doesn't have a recency
bias implementation. I plan to work on it but will need to do the
testing like you mentioned.

One way to make the standby more accurately mimic the primary would be
to base entries on the timestamp-LSN data that is already present in
the WAL, i.e. {COMMIT|ABORT} [PREPARED] records. If you only added or
updated entries on the primary when logging those records, the standby
could redo exactly what the primary did. A disadvantage of this
approach is that if there are no commits for a while then your mapping
gets out of date, but that might be something we could just tolerate.
Another possible solution is to log the changes you make on the
primary and have the standby replay those changes. Perhaps I'm wrong
to advocate for such solutions, but it feels error-prone to have one
algorithm for the primary and a different algorithm for the standby.
You now basically have two things that can break and you have to debug
what went wrong instead of just one.

Your point about maintaining two different systems for creating the
time stream being error prone makes sense. Honestly logging the
contents of the LSNTimeStream seems like it will be the simplest to
maintain and understand. I was a bit apprehensive to WAL log one part
of a single stats structure (since the other stats aren't logged), but
I think explaining why that's done is easier than explaining separate
LSNTimeStream creation code for replicas.

- Melanie

#19Tomas Vondra
tomas@vondra.me
In reply to: Melanie Plageman (#18)
Re: Add LSN <-> time conversion functionality

On 8/7/24 21:39, Melanie Plageman wrote:

On Wed, Aug 7, 2024 at 1:06 PM Robert Haas <robertmhaas@gmail.com> wrote:

As I mentioned to you off-list, I feel like this needs some sort of
recency bias. Certainly vacuum, and really almost any conceivable user
of this facility, is going to care more about accurate answers for new
data than for old data. If there's no recency bias, then I think that
eventually answers for more recent LSNs will start to become less
accurate, since they've got to share the data structure with more and
more time from long ago. I don't think you've done anything about this
in this version of the patch, but I might be wrong.

That makes sense. This version of the patch set doesn't have a recency
bias implementation. I plan to work on it but will need to do the
testing like you mentioned.

I agree that it's likely we probably want more accurate results for
recent data, so some recency bias makes sense - for example for the
eager vacuuming that's definitely true.

But this was initially presented as a somewhat universal LSN/timestamp
mapping, and in that case it might make sense to minimize the average
error - which I think is what lsntime_to_drop() currently does, by
calculating the "area" etc.

Maybe it'd be good to approach this from the opposite direction, say
what "accuracy guarantees" we want to provide, and then design the
structure / algorithm to ensure that. Otherwise we may end up with an
infinite discussion about algorithms with unclear idea which one is the
best choice.

And I'm sure "users" of the LSN/Timestamp mapping may get confused about
what to expect, without reasonably clear guarantees.

For example, it seems to me a "good" accuracy guarantee would be:

Given a LSN, the age of the returned timestamp is less than 10% off
the actual timestamp. The timestamp precision is in seconds.

This means that if LSN was written 100 seconds ago, it would be OK to
get an answer in the 90-110 seconds range. For LSN from 1h ago, the
acceptable range would be 3600s +/- 360s. And so on. The 10% is just
arbitrary, maybe it should be lower - doesn't matter much.

How could we do this? We have 1s precision, so we start with buckets for
each seconds. And we want to allow merging stuff nicely. The smallest
merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
1s -> 10s -> 100s -> 1000s instead.

So we start with 100x one-second buckets

[A_0, A_1, ..., A_99] -> 100 x 1s buckets
[B_0, B_1, ..., B_99] -> 100 x 10s buckets
[C_0, C_1, ..., C_99] -> 100 x 100s buckets
[D_0, D_1, ..., D_99] -> 100 x 1000s buckets

We start by adding data into A_k buckets. After filling all 100 of them,
we grab the oldest 10 buckets, and combine/move them into B_k. And so
on, until B is gets full too. Then we grab the 10 oldest B_k entries,
and move them into C. and so on. For D the oldest entries would get
discarded, or we could add another layer with each bucket representing
10k seconds.

A-D is already enough to cover 30h, with A-E it'd be ~300h. Do we need
(or want) to keep a longer history?

These arrays are larger than what the current patch does, ofc. That has
64 x 16B entries, so 1kB. These arrays have ~6kB - but I'm pretty sure
it could be made more compact, by growing the buckets slower. With 10x
it's just simpler to think about, and also - 6kB seems pretty decent.

Note: I just realized the patch does LOG_STREAM_INTERVAL_MS = 30s, so
the 1s accuracy seems like an overkill, and it could be much smaller.

One way to make the standby more accurately mimic the primary would be
to base entries on the timestamp-LSN data that is already present in
the WAL, i.e. {COMMIT|ABORT} [PREPARED] records. If you only added or
updated entries on the primary when logging those records, the standby
could redo exactly what the primary did. A disadvantage of this
approach is that if there are no commits for a while then your mapping
gets out of date, but that might be something we could just tolerate.
Another possible solution is to log the changes you make on the
primary and have the standby replay those changes. Perhaps I'm wrong
to advocate for such solutions, but it feels error-prone to have one
algorithm for the primary and a different algorithm for the standby.
You now basically have two things that can break and you have to debug
what went wrong instead of just one.

Your point about maintaining two different systems for creating the
time stream being error prone makes sense. Honestly logging the
contents of the LSNTimeStream seems like it will be the simplest to
maintain and understand. I was a bit apprehensive to WAL log one part
of a single stats structure (since the other stats aren't logged), but
I think explaining why that's done is easier than explaining separate
LSNTimeStream creation code for replicas.

Isn't this a sign this does not quite fit into pgstats? Even if this
happens to deal with unsafe restarts, replica promotions and so on, what
if the user just does pg_stat_reset? That already causes trouble because
we simply forget deleted/updated/inserted tuples. If we also forget data
used for freezing heuristics, that does not seem great ...

Wouldn't it be better to write this into WAL as part of a checkpoint (or
something like that?), and make bgwriter to not only add LSN/timestamp
into the stream, but also write it into WAL. It's already waking up, on
idle systems ~32B written to WAL does not matter, and on busy system
it's just noise.

regards

--
Tomas Vondra

#20Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#19)
Re: Add LSN <-> time conversion functionality

On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:

How could we do this? We have 1s precision, so we start with buckets for
each seconds. And we want to allow merging stuff nicely. The smallest
merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
1s -> 10s -> 100s -> 1000s instead.

So we start with 100x one-second buckets

[A_0, A_1, ..., A_99] -> 100 x 1s buckets
[B_0, B_1, ..., B_99] -> 100 x 10s buckets
[C_0, C_1, ..., C_99] -> 100 x 100s buckets
[D_0, D_1, ..., D_99] -> 100 x 1000s buckets

We start by adding data into A_k buckets. After filling all 100 of them,
we grab the oldest 10 buckets, and combine/move them into B_k. And so
on, until B is gets full too. Then we grab the 10 oldest B_k entries,
and move them into C. and so on. For D the oldest entries would get
discarded, or we could add another layer with each bucket representing
10k seconds.

Yeah, this kind of algorithm makes sense to me, although as you say
later, I don't think we need this amount of precision. I also think
you're right to point out that this provides certain guaranteed
behavior.

A-D is already enough to cover 30h, with A-E it'd be ~300h. Do we need
(or want) to keep a longer history?

I think there is a difference of opinion about this between Melanie
and I. I feel like we should be designing something that does the
exact job we need done for the freezing stuff, and if anyone else can
use it, that's a bonus. For that, I feel that 300h is more than
plenty. The goal of the freezing stuff, roughly speaking, is to answer
the question "will this be unfrozen real soon?". "Real soon" could
arguably mean a minute or an hour, but I don't think it makes sense
for it to be a week. If we're freezing data now that has a good chance
of being unfrozen again within 7 days, we should just freeze it
anyway. The cost of freezing isn't really all that high. If we keep
freezing pages that are going to be unfrozen again within seconds or
minutes, we pay those freezing costs enough times that they become
material, but I have difficulty imagining that it ever matters if we
re-freeze the same page every week. It's OK to be wrong as long as we
aren't wrong too often, and I think that being wrong once per page per
week isn't too often.

But I think Melanie was hoping to create something more general, which
on one level is understandable, but on the other hand it's unclear
what the goals are exactly. If we limit our scope to specifically
VACUUM, we can make reasonable guesses about how much precision we
need and for how long. But a hypothetical other client of this
infrastructure could need anything at all, which makes it very unclear
what the best design is, IMHO.

Isn't this a sign this does not quite fit into pgstats? Even if this
happens to deal with unsafe restarts, replica promotions and so on, what
if the user just does pg_stat_reset? That already causes trouble because
we simply forget deleted/updated/inserted tuples. If we also forget data
used for freezing heuristics, that does not seem great ...

+1.

Wouldn't it be better to write this into WAL as part of a checkpoint (or
something like that?), and make bgwriter to not only add LSN/timestamp
into the stream, but also write it into WAL. It's already waking up, on
idle systems ~32B written to WAL does not matter, and on busy system
it's just noise.

I am not really sure of the best place to put this data. I agree that
pgstat doesn't feel like quite the right place. But I'm not quite sure
that putting it into every checkpoint is the right idea either.

--
Robert Haas
EDB: http://www.enterprisedb.com

#21Tomas Vondra
tomas@vondra.me
In reply to: Robert Haas (#20)
Re: Add LSN <-> time conversion functionality

On 8/8/24 20:59, Robert Haas wrote:

On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:

How could we do this? We have 1s precision, so we start with buckets for
each seconds. And we want to allow merging stuff nicely. The smallest
merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
1s -> 10s -> 100s -> 1000s instead.

So we start with 100x one-second buckets

[A_0, A_1, ..., A_99] -> 100 x 1s buckets
[B_0, B_1, ..., B_99] -> 100 x 10s buckets
[C_0, C_1, ..., C_99] -> 100 x 100s buckets
[D_0, D_1, ..., D_99] -> 100 x 1000s buckets

We start by adding data into A_k buckets. After filling all 100 of them,
we grab the oldest 10 buckets, and combine/move them into B_k. And so
on, until B is gets full too. Then we grab the 10 oldest B_k entries,
and move them into C. and so on. For D the oldest entries would get
discarded, or we could add another layer with each bucket representing
10k seconds.

Yeah, this kind of algorithm makes sense to me, although as you say
later, I don't think we need this amount of precision. I also think
you're right to point out that this provides certain guaranteed
behavior.

A-D is already enough to cover 30h, with A-E it'd be ~300h. Do we need
(or want) to keep a longer history?

I think there is a difference of opinion about this between Melanie
and I. I feel like we should be designing something that does the
exact job we need done for the freezing stuff, and if anyone else can
use it, that's a bonus. For that, I feel that 300h is more than
plenty. The goal of the freezing stuff, roughly speaking, is to answer
the question "will this be unfrozen real soon?". "Real soon" could
arguably mean a minute or an hour, but I don't think it makes sense
for it to be a week. If we're freezing data now that has a good chance
of being unfrozen again within 7 days, we should just freeze it
anyway. The cost of freezing isn't really all that high. If we keep
freezing pages that are going to be unfrozen again within seconds or
minutes, we pay those freezing costs enough times that they become
material, but I have difficulty imagining that it ever matters if we
re-freeze the same page every week. It's OK to be wrong as long as we
aren't wrong too often, and I think that being wrong once per page per
week isn't too often.

But I think Melanie was hoping to create something more general, which
on one level is understandable, but on the other hand it's unclear
what the goals are exactly. If we limit our scope to specifically
VACUUM, we can make reasonable guesses about how much precision we
need and for how long. But a hypothetical other client of this
infrastructure could need anything at all, which makes it very unclear
what the best design is, IMHO.

I don't have a strong opinion on this. I agree with you it's better to
have a good solution for the problem at hand than a poor solution for
hypothetical use cases. I don't have a clear idea what the other use
cases would be, which makes it hard to say what precision/history would
be necessary. But I also understand the wish to make it useful for a
wider set of use cases, when possible. I'd try to do the same thing.

But I think a clear description of the precision guarantees helps to
achieve that (even if the algorithm could be different).

If the only argument ends up being about how precise it needs to be and
how much history we need to cover, I think that's fine because that's
just a matter of setting a couple config parameters.

Isn't this a sign this does not quite fit into pgstats? Even if this
happens to deal with unsafe restarts, replica promotions and so on, what
if the user just does pg_stat_reset? That already causes trouble because
we simply forget deleted/updated/inserted tuples. If we also forget data
used for freezing heuristics, that does not seem great ...

+1.

Wouldn't it be better to write this into WAL as part of a checkpoint (or
something like that?), and make bgwriter to not only add LSN/timestamp
into the stream, but also write it into WAL. It's already waking up, on
idle systems ~32B written to WAL does not matter, and on busy system
it's just noise.

I am not really sure of the best place to put this data. I agree that
pgstat doesn't feel like quite the right place. But I'm not quite sure
that putting it into every checkpoint is the right idea either.

Is there a reason not to make this just another SLRU, just like we do
for commit_ts? I'm not saying it's perfect, but it's an approach we
already use to solve these issues.

regards

--
Tomas Vondra

#22Robert Haas
robertmhaas@gmail.com
In reply to: Tomas Vondra (#21)
Re: Add LSN <-> time conversion functionality

On Thu, Aug 8, 2024 at 8:39 PM Tomas Vondra <tomas@vondra.me> wrote:

Is there a reason not to make this just another SLRU, just like we do
for commit_ts? I'm not saying it's perfect, but it's an approach we
already use to solve these issues.

An SLRU is essentially an infinitely large array that grows at one end
and shrinks at the other -- but this is a fixed-size data structure.

--
Robert Haas
EDB: http://www.enterprisedb.com

#23Melanie Plageman
melanieplageman@gmail.com
In reply to: Tomas Vondra (#19)
Re: Add LSN <-> time conversion functionality

On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:

On 8/7/24 21:39, Melanie Plageman wrote:

On Wed, Aug 7, 2024 at 1:06 PM Robert Haas <robertmhaas@gmail.com> wrote:

As I mentioned to you off-list, I feel like this needs some sort of
recency bias. Certainly vacuum, and really almost any conceivable user
of this facility, is going to care more about accurate answers for new
data than for old data. If there's no recency bias, then I think that
eventually answers for more recent LSNs will start to become less
accurate, since they've got to share the data structure with more and
more time from long ago. I don't think you've done anything about this
in this version of the patch, but I might be wrong.

That makes sense. This version of the patch set doesn't have a recency
bias implementation. I plan to work on it but will need to do the
testing like you mentioned.

I agree that it's likely we probably want more accurate results for
recent data, so some recency bias makes sense - for example for the
eager vacuuming that's definitely true.

But this was initially presented as a somewhat universal LSN/timestamp
mapping, and in that case it might make sense to minimize the average
error - which I think is what lsntime_to_drop() currently does, by
calculating the "area" etc.

Maybe it'd be good to approach this from the opposite direction, say
what "accuracy guarantees" we want to provide, and then design the
structure / algorithm to ensure that. Otherwise we may end up with an
infinite discussion about algorithms with unclear idea which one is the
best choice.

And I'm sure "users" of the LSN/Timestamp mapping may get confused about
what to expect, without reasonably clear guarantees.

For example, it seems to me a "good" accuracy guarantee would be:

Given a LSN, the age of the returned timestamp is less than 10% off
the actual timestamp. The timestamp precision is in seconds.

This means that if LSN was written 100 seconds ago, it would be OK to
get an answer in the 90-110 seconds range. For LSN from 1h ago, the
acceptable range would be 3600s +/- 360s. And so on. The 10% is just
arbitrary, maybe it should be lower - doesn't matter much.

I changed this patch a bit to only provide ranges with an upper and
lower bound from the SQL callable functions. While the size of the
range provided could be part of our "accuracy guarantee", I'm not sure
if we have to provide that.

How could we do this? We have 1s precision, so we start with buckets for
each seconds. And we want to allow merging stuff nicely. The smallest
merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
1s -> 10s -> 100s -> 1000s instead.

So we start with 100x one-second buckets

[A_0, A_1, ..., A_99] -> 100 x 1s buckets
[B_0, B_1, ..., B_99] -> 100 x 10s buckets
[C_0, C_1, ..., C_99] -> 100 x 100s buckets
[D_0, D_1, ..., D_99] -> 100 x 1000s buckets

We start by adding data into A_k buckets. After filling all 100 of them,
we grab the oldest 10 buckets, and combine/move them into B_k. And so
on, until B is gets full too. Then we grab the 10 oldest B_k entries,
and move them into C. and so on. For D the oldest entries would get
discarded, or we could add another layer with each bucket representing
10k seconds.

I originally had an algorithm that stored old values somewhat like
this (each element stored 2x logical members of the preceding
element). When I was testing algorithms, I abandoned this method
because it was less accurate than the method which calculates the
interpolation error "area". But, this would be expected -- it would be
less accurate for older values.

I'm currently considering an algorithm that uses a combination of the
interpolation error and the age of the point. I'm thinking of adding
to or dividing the error of each point by "now - that point's time (or
lsn)". This would lead me to be more likely to drop points that are
older.

This is a bit different than "combining" buckets, but it seems like it
might allow us to drop unneeded recent points when they are very
regular.

Isn't this a sign this does not quite fit into pgstats? Even if this
happens to deal with unsafe restarts, replica promotions and so on, what
if the user just does pg_stat_reset? That already causes trouble because
we simply forget deleted/updated/inserted tuples. If we also forget data
used for freezing heuristics, that does not seem great ...

Wouldn't it be better to write this into WAL as part of a checkpoint (or
something like that?), and make bgwriter to not only add LSN/timestamp
into the stream, but also write it into WAL. It's already waking up, on
idle systems ~32B written to WAL does not matter, and on busy system
it's just noise.

I was imagining adding a new type of WAL record that contains just the
LSN and time and writing it out in bgwriter. Is that not what you are
thinking?

- Melanie

#24Melanie Plageman
melanieplageman@gmail.com
In reply to: Robert Haas (#20)
Re: Add LSN <-> time conversion functionality

On Thu, Aug 8, 2024 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:

A-D is already enough to cover 30h, with A-E it'd be ~300h. Do we need
(or want) to keep a longer history?

I think there is a difference of opinion about this between Melanie
and I. I feel like we should be designing something that does the
exact job we need done for the freezing stuff, and if anyone else can
use it, that's a bonus. For that, I feel that 300h is more than
plenty. The goal of the freezing stuff, roughly speaking, is to answer
the question "will this be unfrozen real soon?". "Real soon" could
arguably mean a minute or an hour, but I don't think it makes sense
for it to be a week. If we're freezing data now that has a good chance
of being unfrozen again within 7 days, we should just freeze it
anyway. The cost of freezing isn't really all that high. If we keep
freezing pages that are going to be unfrozen again within seconds or
minutes, we pay those freezing costs enough times that they become
material, but I have difficulty imagining that it ever matters if we
re-freeze the same page every week. It's OK to be wrong as long as we
aren't wrong too often, and I think that being wrong once per page per
week isn't too often.

But I think Melanie was hoping to create something more general, which
on one level is understandable, but on the other hand it's unclear
what the goals are exactly. If we limit our scope to specifically
VACUUM, we can make reasonable guesses about how much precision we
need and for how long. But a hypothetical other client of this
infrastructure could need anything at all, which makes it very unclear
what the best design is, IMHO.

I'm fine with creating something that is optimized for use with
freezing. I proposed this LSNTimeStream patch as a separate project
because 1) Andres suggested it would be useful for other things 2) it
would make the adaptive freezing project smaller if this goes in
first. The adaptive freezing has two different fuzzy bits (this
LSNTimeStream and then the accumulator which is used to determine if a
page is older than most pages which were unfrozen too soon). I was
hoping to find an independent use for one of the fuzzy bits to move it
forward.

But, I do think we should optimize the data thinning strategy for
vacuum's adaptive freezing.

- Melanie

#25Tomas Vondra
tomas@vondra.me
In reply to: Melanie Plageman (#24)
Re: Add LSN <-> time conversion functionality

On 8/9/24 03:29, Melanie Plageman wrote:

On Thu, Aug 8, 2024 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:

A-D is already enough to cover 30h, with A-E it'd be ~300h. Do we need
(or want) to keep a longer history?

I think there is a difference of opinion about this between Melanie
and I. I feel like we should be designing something that does the
exact job we need done for the freezing stuff, and if anyone else can
use it, that's a bonus. For that, I feel that 300h is more than
plenty. The goal of the freezing stuff, roughly speaking, is to answer
the question "will this be unfrozen real soon?". "Real soon" could
arguably mean a minute or an hour, but I don't think it makes sense
for it to be a week. If we're freezing data now that has a good chance
of being unfrozen again within 7 days, we should just freeze it
anyway. The cost of freezing isn't really all that high. If we keep
freezing pages that are going to be unfrozen again within seconds or
minutes, we pay those freezing costs enough times that they become
material, but I have difficulty imagining that it ever matters if we
re-freeze the same page every week. It's OK to be wrong as long as we
aren't wrong too often, and I think that being wrong once per page per
week isn't too often.

But I think Melanie was hoping to create something more general, which
on one level is understandable, but on the other hand it's unclear
what the goals are exactly. If we limit our scope to specifically
VACUUM, we can make reasonable guesses about how much precision we
need and for how long. But a hypothetical other client of this
infrastructure could need anything at all, which makes it very unclear
what the best design is, IMHO.

I'm fine with creating something that is optimized for use with
freezing. I proposed this LSNTimeStream patch as a separate project
because 1) Andres suggested it would be useful for other things 2) it
would make the adaptive freezing project smaller if this goes in
first. The adaptive freezing has two different fuzzy bits (this
LSNTimeStream and then the accumulator which is used to determine if a
page is older than most pages which were unfrozen too soon). I was
hoping to find an independent use for one of the fuzzy bits to move it
forward.

But, I do think we should optimize the data thinning strategy for
vacuum's adaptive freezing.

+1 to this

IMHO if Andres thinks this would be useful for something else, it'd be
nice if he could explain what the other use cases are. Otherwise it's
not clear how to make it work for them.

The one other use case I can think of is monitoring - being able to look
at WAL throughput over time. Which seems OK, but it also can accept very
low resolution in distant past.

FWIW it still makes sense to do this as a separate patch, before the
main "freezing" one.

regards

--
Tomas Vondra

#26Tomas Vondra
tomas@vondra.me
In reply to: Melanie Plageman (#23)
Re: Add LSN <-> time conversion functionality

On 8/9/24 03:02, Melanie Plageman wrote:

On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:

On 8/7/24 21:39, Melanie Plageman wrote:

On Wed, Aug 7, 2024 at 1:06 PM Robert Haas <robertmhaas@gmail.com> wrote:

As I mentioned to you off-list, I feel like this needs some sort of
recency bias. Certainly vacuum, and really almost any conceivable user
of this facility, is going to care more about accurate answers for new
data than for old data. If there's no recency bias, then I think that
eventually answers for more recent LSNs will start to become less
accurate, since they've got to share the data structure with more and
more time from long ago. I don't think you've done anything about this
in this version of the patch, but I might be wrong.

That makes sense. This version of the patch set doesn't have a recency
bias implementation. I plan to work on it but will need to do the
testing like you mentioned.

I agree that it's likely we probably want more accurate results for
recent data, so some recency bias makes sense - for example for the
eager vacuuming that's definitely true.

But this was initially presented as a somewhat universal LSN/timestamp
mapping, and in that case it might make sense to minimize the average
error - which I think is what lsntime_to_drop() currently does, by
calculating the "area" etc.

Maybe it'd be good to approach this from the opposite direction, say
what "accuracy guarantees" we want to provide, and then design the
structure / algorithm to ensure that. Otherwise we may end up with an
infinite discussion about algorithms with unclear idea which one is the
best choice.

And I'm sure "users" of the LSN/Timestamp mapping may get confused about
what to expect, without reasonably clear guarantees.

For example, it seems to me a "good" accuracy guarantee would be:

Given a LSN, the age of the returned timestamp is less than 10% off
the actual timestamp. The timestamp precision is in seconds.

This means that if LSN was written 100 seconds ago, it would be OK to
get an answer in the 90-110 seconds range. For LSN from 1h ago, the
acceptable range would be 3600s +/- 360s. And so on. The 10% is just
arbitrary, maybe it should be lower - doesn't matter much.

I changed this patch a bit to only provide ranges with an upper and
lower bound from the SQL callable functions. While the size of the
range provided could be part of our "accuracy guarantee", I'm not sure
if we have to provide that.

I wouldn't object to providing the timestamp range, along with the
estimate. That seems potentially quite useful for other use cases - it
provides a very clear guarantee.

The thing that concerns me a bit is that maybe it's an implementation
detail. I mean, we might choose to rework the structure in a way that
does not track the ranges like this ... Doesn't seem likely, though.

How could we do this? We have 1s precision, so we start with buckets for
each seconds. And we want to allow merging stuff nicely. The smallest
merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
1s -> 10s -> 100s -> 1000s instead.

So we start with 100x one-second buckets

[A_0, A_1, ..., A_99] -> 100 x 1s buckets
[B_0, B_1, ..., B_99] -> 100 x 10s buckets
[C_0, C_1, ..., C_99] -> 100 x 100s buckets
[D_0, D_1, ..., D_99] -> 100 x 1000s buckets

We start by adding data into A_k buckets. After filling all 100 of them,
we grab the oldest 10 buckets, and combine/move them into B_k. And so
on, until B is gets full too. Then we grab the 10 oldest B_k entries,
and move them into C. and so on. For D the oldest entries would get
discarded, or we could add another layer with each bucket representing
10k seconds.

I originally had an algorithm that stored old values somewhat like
this (each element stored 2x logical members of the preceding
element). When I was testing algorithms, I abandoned this method
because it was less accurate than the method which calculates the
interpolation error "area". But, this would be expected -- it would be
less accurate for older values.

I'm currently considering an algorithm that uses a combination of the
interpolation error and the age of the point. I'm thinking of adding
to or dividing the error of each point by "now - that point's time (or
lsn)". This would lead me to be more likely to drop points that are
older.

This is a bit different than "combining" buckets, but it seems like it
might allow us to drop unneeded recent points when they are very
regular.

TBH I'm a bit lost in how the various patch versions merge the data.
Maybe there is a perfect algorithm, keeping a perfectly accurate
approximation in the smallest space, but does that really matter? If we
needed to keep many instances / very long history, maybe it's matter.

But we need one instance, and we seem to agree it's enough to have a
couple days of history at most. And even the super wasteful struct I
described above would only need ~8kB for that.

I suggest we do the simplest and most obvious algorithm possible, at
least for now. Focusing on this part seems like a distraction from the
freezing thing you actually want to do.

Isn't this a sign this does not quite fit into pgstats? Even if this
happens to deal with unsafe restarts, replica promotions and so on, what
if the user just does pg_stat_reset? That already causes trouble because
we simply forget deleted/updated/inserted tuples. If we also forget data
used for freezing heuristics, that does not seem great ...

Wouldn't it be better to write this into WAL as part of a checkpoint (or
something like that?), and make bgwriter to not only add LSN/timestamp
into the stream, but also write it into WAL. It's already waking up, on
idle systems ~32B written to WAL does not matter, and on busy system
it's just noise.

I was imagining adding a new type of WAL record that contains just the
LSN and time and writing it out in bgwriter. Is that not what you are
thinking?

Now sure, I was thinking we would do two things:

1) bgwriter writes the (LSN,timestamp) into WAL, and also updates the
in-memory struct

2) during checkpoint we flush the in-memory struct to disk, so that we
have it after restart / crash

I haven't thought about this very much, but I think this would address
both the crash/recovery/restart on the primary, and on replicas.

regards

--
Tomas Vondra

#27Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#23)
1 attachment(s)
Re: Add LSN <-> time conversion functionality

On Thu, Aug 8, 2024 at 9:02 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:

Maybe it'd be good to approach this from the opposite direction, say
what "accuracy guarantees" we want to provide, and then design the
structure / algorithm to ensure that. Otherwise we may end up with an
infinite discussion about algorithms with unclear idea which one is the
best choice.

And I'm sure "users" of the LSN/Timestamp mapping may get confused about
what to expect, without reasonably clear guarantees.

For example, it seems to me a "good" accuracy guarantee would be:

Given a LSN, the age of the returned timestamp is less than 10% off
the actual timestamp. The timestamp precision is in seconds.

This means that if LSN was written 100 seconds ago, it would be OK to
get an answer in the 90-110 seconds range. For LSN from 1h ago, the
acceptable range would be 3600s +/- 360s. And so on. The 10% is just
arbitrary, maybe it should be lower - doesn't matter much.

I changed this patch a bit to only provide ranges with an upper and
lower bound from the SQL callable functions. While the size of the
range provided could be part of our "accuracy guarantee", I'm not sure
if we have to provide that.

Okay, so as I think about evaluating a few new algorithms, I realize
that we do need some sort of criteria. I started listing out what I
feel is "reasonable" accuracy and plotting it to see if the
relationship is linear/exponential/etc. I think it would help to get
input on what would be "reasonable" accuracy.

I thought that the following might be acceptable:
The first column is how old the value I am looking for actually is,
the second column is how off I am willing to have the algorithm tell
me it is (+/-):

1 second, 1 minute
1 minute, 10 minute
1 hour, 1 hour
1 day, 6 hours
1 week, 12 hours
1 month, 1 day
6 months, 1 week

Column 1 over column 2 produces a line like in the attached pic. I'd
be interested in others' opinions of error tolerance.

- Melanie

Attachments:

example.pngimage/png; name=example.pngDownload
�PNG


IHDR��5���9tEXtSoftwareMatplotlib version3.9.0, https://matplotlib.org/4H�	pHYsaa�?�i>;IDATx���yxT����{&�$�d ��,����&n��"�E�m�E[����E�+�ntQ�}�7*Xl�"J�]�`!a�d!�d����DY$y&s���:�37s���s�y�eY�`N���(�6C�
 ��Pl�`3@����f(�6C�
 ��Pl�`3@����f(�6C�
 ��Pl�`3@����f(�6C�
 ��Pl�`3@����f(�6C�
 ��Pl�`3@����f(�6C�
 ��Pl�`3@����f(�6C�
 ��Pl�`3@���L������WAA�����p8L��`Y�JJJ���,���ca�PPP���T�1�y���WJJ��FP/@tt����@111�����x<JMM���#
�8q�7&&�@3c����y���(�6C�
 ��Pl�`3@����f(�6C�
 ��Ph"����h��r�Q`s@���oi��/������R��,W>�e:l*h���3���������h��A�?~���<y���i��-[j��q***2��r�P������J����5v�g�R�1�v�0%%E�=����Y���W��+��u�]�M�6I����~��7O����rrrTPP��c�N
V�/�Z�t������{*��u�G��?/�����g8!��aY�m��ccc������nP\\�f���n�A��e�u��]��/�����z�Gn�[������i���f,�p��=��,KZ��K�1���<�����`S�$�C�ztl/
���p����w���������RYY�
�5k����J���}L�n����������u�^�<O�
����y�,K���:���$%�D��������*>��]�t��+4���*>Ve81�]P�
6�e��r�\����5g�]t�E*,,Txx�Z�jU��			*,,<��M�>]n��vKMMm�?���V����|I�m��u��=�h���%;M������T��o�'�C���kW�]�V+W��=����'j�����z��MSqqq������i�h��B.�TbL��w�?�c��az��^z����!���xu�_���QaqE'�uW�N���o_M�>]YYY�������DUVV����u_TT����������������������[��r�������/�������m.�UO���_���1h@A]��������o��
����k���u����4h� �	�dSA��|}D�N�n�_���"�B4uDW}p�%�8��J����������������.BMh,��M��Q���������5KK�,����v�5i�$M�:U�������}���A���
`���y���=qN�����w��+��l������?-��.��{.�(WhHcD�Mm�������i��}r��������u�UWI��~�i9�N�7N^�W#G����>k85 Xx*�4�����������q:4qp{]uQ�~5w�>��_3>�������e�oz�����<�
�y������zx�fuIh��S���p\��Y�����#�6�`i�����������k���������)X�U{����/��I������d}8�R���"��^]��F<�Tnf)S�
 
l��C�y�L-�C4��v
��������,��?�Jo�}���WWk��/���)cP?@���G���������V~2Lw_�Q!N���~��?��wV�3�4��@*�Th���C��Z��!E��������C���[��j���u��W*�`Y���h�(�4�7?���oi@�XuKl�z�sk����k�+"������K�������$�
 
���������w���������?���3�R]�����~�q�V}���j]��&���G��|��HE�����=N��hcJk�W���&d�uT�������~��}�Y���F2!�Ph '�~����C�}�:�����^�1�������j��K�����\@���%�l�!9���i��H���ti�M��;��]�H�9rL_�\���V��*M��A@��u����v�"
����������t��9��/���'�h��{�2��(�\���j�c�I5+��P�f�Ez�GC�-1ZG��t���4��U�?\n:������x���M��vjk:��Nm�y�
��FvUx�SK�������>��j���
 ��,������������u�S�/��?�D;��X�O���W;�3m.����&@�|�wT��y�
u���)����q-�����z)&"T��k�_������g:�pb���f%�UT��4���p���i������^I��-�\�SW�X��v4���y:T������$���+4���=sk��{������u�����g���r��	6@��;�����WV�[�)�L�iW]��ES�����r8j��������2&�P8>��7V����`�����w8H��[�`i���������*8z�t<4
 �!g�~�9rL��0��J6�Q�k��x������o��������w��g4�9�pNL�2�_�"�B�i<��M�E���%���Ze�>���M��������t<�'
 �(�P��l; I�5;���N��h���A��u=���/����?��m���)c�
 ������eI����}���4���������4�{��|���x����'Z����x8@�AE�O�������6�$w�����z��>��vi��2�n���A��*��P@���7����*�k�+����c����5������������7V����r�pS��t8
 ����+�����f��ocsG���q�z�������<^���5���5*�T�����PO�����
qhB�T�q���m4�'�h���th��B
*G�V�1eL�PO'�~�3Iq�.�iODX�~6����7TY)n�TT��s6����������$@����J���+�������'���
���s�"�B�y�a���'��G�UY�7�P/�����*��%F�_zk�q^���IC3������.q������l���,��yGL��=
 g��[z}���u8����Rc��������z+�E���h������M*�V��g[@������{�L-]�sq;�q�����z���S/��>�dY�K��������-�M��%
 g�������}���+�l�f,�E����[��5@����{���|y�~���:X�5�V(�����cZ��HR��_\�a]��p�0}��9�?�h�S9��f�,�)c��3xse���4�C��$D��4��C������CuQR���W��w���>����L�z@N����7k��mo6L��������/Fu�+��e;j����[�NU��2��P8��l.�����]�#�t������v��)�4�cUT�5}�]�������t��D�4N��q��4��������m�7�'[���)wd�6x��g�M�
J\��)l+*����
q:t���m*�C�������zj�6�|dW����S81��U����4��~��]�>���A��l���[���`�_/
 �0���*�V�C\
���t��Q8�e���o6��"8Q8����hKa�"���7�t�QP8���_��n'wd��4@��8}�t���_��������1c�u��:������p8�lw�}�����x5�>I�����0''G�'O��+�h�"UUUi��*+�������}����v���h(1��wV���g���V���m:�h�v������/+>>^k����a�j�GEE)11�����o�������\��~Sqq�Z����u����j���z���i����������z��x�l������*(�P��0]�+�t�Q����~��L��!C��g����o�����+99Y�����>��[�����;��L�>]�<�HS�4�����M������i����,�2���s�=�?��-[����_���G��+���;��c�o���z��zko{<������X111����v,�eO,��!�<p���D���F��x�v�m���#���{�>��-]����O����%����r��r5JN�9o�����K�����,�w�}�3g��,Y�����>g������$���������{$��/�#h����5k�,�������Vaa�$��v+22R;w���Y�t�5��M�6Z�~����~
6L��������u*>V������K��8@��8s�LI5�=����^�w����p}����1c��������q���W�����SN��{kv�B���{�x�k[RSS����Di�h]�Q��S���&�c�_��m���N��]���6-���A����J�s]�$����P�4{�y���()F}�Z��4)
 �v�~��������pp�����e;j��rE�Bu]�d�q�&G���u��MQTx�N��`+{�����$q�����7W��oI�;�Q�����FP�QY��[��$I�3��lc��}:XZ����_�`:``'�~�y@��B�
�}��`_��h��#
q:t��4�q�(�[81�7�G�b"����^IE��|�W�t���f��zs����J�:��������QA��,������Y��(�� �2����/UTx�����t  PA����c.n���0�i��@���
-�X(I�-��?�(������|U�-�Ko���cL���}~�Zy|��A��'����_�W��BmZ��������J'V����\�!����:;�j���r8�[�Y��&
  �������+��+�u��4@���Jye��]�/I�m �B�y�
TRQ���(
�g:�(���aY�^=���m��t��/p*@@�X�T�
<
uj|�T�q��E�����LV������
��*���}�X�8
  (��:_��~�j�VV��t �Q���o���5�o�.���?�3��������b"B5:+�t �Q�����~�/U�����
���.�G[�K�ne�_�^(��fm��y�,���m�!���8@�@4[�j��^���������
u��RI�]�-�t�����+�2 M�!|����@����Xk�>�P�C7`�_�\P���+�$IW�LT|t��4@�B4;��*��r����?�
 ��yo����KBK
��5hv��N�>]���Wtt����5f�m����c***4y�d�i�F-[���q�TTTd(1�>,�����u���0''G�'O��+�h�"UUUi��*++�}�����y����w�UNN�


4v�X��g�|�!�<P��!sq;�q�f)�t���`��:�_~�e���k��56l������/h��Y���+$I/����w��+Vh���&b�������i���0�i��)hG����X����5kTUU�����>�[�nJKK�����d�Yaq�n�9U���������<�����)S4d����S�TXX���p�j���cTXXx���z��z���=O�e|�������4 #V]�M��-[�N�<Y7n�[o�uA�3}�t����-5��G��T��z��������0A_���^}���������R�?11Q���:z�h��)11���5m�4�n����p�E�������-]������'h�eY���{5g�}��G����s��}�������u����4h��S����RLLL�
�4^[^s���R�__@��s'O��Y�f����Wttt�y}n�[���r���4i��N����X���������A�����%Z������i���^���3gJ�.���:�_z�%�q�����~ZN�S���������#����6qR���X�wx�%��4�h��eY�C4W�Gn�[����FR����G��[��&
�%��LGB3��w���-P��Zm[hH����A�X�e����%I�f���d�_�!P���#�RX��0���e�U��P���/��J�;�u��B���^�{C�^��4,
   ��:_�>��R[�W��t �P�����
��p�l���G��UT����d:t(��������?&�KUDX��4@��������������4<
  ��Z�'��.���6-L��0*�|z{u�$.�0��~���W�]�H]�-�t hQ����d�)�u�FC�
{��6���B��?��������W���tN7
 ����*��n�$.��`��/�����n������t �QF��V�����������QF}���r���+Tcz�3�
 ���V��$���N-\�f�6A����m.�$���@���yse���4�C�:'D���`De�_o�:��o{�a������B(�*.��=L�l�0���5S��< Ma!|M�q�&���D+s+����X�hj@@�;1��U����4��
 �I�z��������/�	@@����^�z��!��wlc:`K@@����������)@@�Y�����(2,Dc������d^;>�7��d�#����h�K*�`�>I���F4�wV���g�OZ+�Hv������}~�Z�'��_�@@4����WAq�b[�kT�$�q�����?&�KUDX��4(��F�{�L�l?(�C�5;�t���G�.����(�iH@@#:V���k�H�Y�@`����*>V���H
�g:��(��Fsb��[���d�_ PP�b]�Q��S��P�&�K5�I��.]�T�G�Vrr�����[��;��C���v��W�	A���/��������8Y����2eee��g�9�c���j����v{��7�0!�#e����@�t+'�t��2j�(�5���q�\JLLl�D`��������1�8���8�!hG�c��%���W��]u�=����C�#@���[z}e��������� ����\}��;v�222�s�N������Q��|�r���z�"��+��[{���4U\h6>�qP_*WtD���;�t�`�x�M7����W/eff�c��Z�d�����S>g���z��G�*"4K�-����o���m�54[>Y���m[�������6m����k����&L�o��r}��H�t+����g�:tHIII�}������j�T����y���4�Su�ki:���XZZZg4/77Wk��Ull�bcc��#�h��qJLL���;����\�:u���#
����[����j����/����^�Z�_~y���S�J�&N���3gj���z��Wt��Q%''k�����~���u��R	1.
��`:�3�x�e�������p��&L������HWh��������>�V�>�P�C7
`�_ �Q������J��0���P���Js��+��_���� s����J�:��������
 ��UT����\I��4'@�y{b�V�>T��h���ig:�z����]����5�����0���p�J�����u�,������[��H�p���W�?|L�ZE��k����Q�$g��Z�'Iz|<�~�������*=8{�$����5�c[���
 ����I��
e�m���f:��D�����z���r:�'�g)2<�t$��8�C�^=4g�$���vT�����@�Y����l���JuK�����MGp�(��3���-�T�P�CON��+�C�@sG�V��B���Q���+;�G��p"
�8%����?��SQ��������HpJo����������W,����������%I?�U���
'��(��:�~K?��Ne�>
h�;�d����Qu���n��uXQ�!z||�B���40
 �������-��_^�]�mZN�1P��j�_?}g���~]���n�N3	@#�$I[�Kk��*:"T�)��C�@����>�f|�M����Jni8��D����k�;�T��t�E	����Hl�/m�W�<j�G����_�(�`ck����%;%I�{}/�E�'�(�`SU>������-}7+Y��J2	@��M=�p�v(S\�K�����hB@������Os%I��VQ��hJ@��2o���N�%��?U�w�7	@���<����������C�v7�@���m���<I���3f8(�`��Uzp�zI���kp���0�6���M*�T(�m=xu7�qDX��P�}�WN����,E����� 
 �C�^=4g�$���vT����0�A��,=4g��U�kB���l:�@� ��uZ��P�N����%W(�~P hy*���%I?���z�sN PP Y����^��je��u�eMG@(���^��%[(<��'�g),��{��'�������%I?�U��
'h��.]�T�G�Vrr�����[�~������FIII��������}�v3a�����~6{��*}������a:�����LYYYz��gNy���G����s�=��+W�E�9r�***�8)4�W����]��'�g)��0	@
5���5J�F�:�}�ei�����~����N�����*!!As���M7���Q�A�<P���o�$����Jo��p"�*hG�$77W���>|x�>�����l-_�����z��x<u6�>�xw���~]���n�N3	@�e,,,�$%$$�����P{��L�>]n��vKMMm��P_[�K_�UtD��0.S�~��-���6m����k���|��@_��h���$I����V��t�,���������:����j�;������:�TY���w���g���4�O;��4�,�JLL����k�y<�\�R�
2���_>����y�:*L�^��C��%h�.--��;jo���j������UZZ��L����������222��_�Z���3f���p����KvJ����^��vN����z�j]~�����N�*I�8q�^~�e���?WYY�~������:t�,X���S���*�|��;k��[�nV����d:�f�aY�e:Ds��x�v�U\\������?���[���h��?L���MG�
��mz 4g+w���J��0���9�@3R������������n	g|��G����S�V���w������@3�����X�'Iz��LEG�N���@3P|�J�^/I�cp{
���p"���G�mR��Bm[���������@�[��P�}�WN����LE�������@;T��Cs6H�~0�����N P @Y��_���������R�_��t$A����
4c�B�=5��\���0(���<��������������&@0�e������Z�)n�����#2@0o�����������G5����
�������%I?�U��
'�(� �~K?��Ne�>�o�Zw
�0	@��@�xe�n��uX�a!zb|�B���)
 ��J���-��_^�]�mZN �Q��j�_��N�j�.��V�e��� �Q���?��/��*��?������_��m)���E�$I��n%��4��P���j����NU>K��'h\�v�#�	
 ����k�>�ZG����=9���P��u�G���������������	�XE�OS�Y+�����d]��d:��@{b�V�<P��h���w{����(���V�:�>��$=6��Z�7��Q���y����u�,iB�]�=�t$6E�&����R��cj�*R���E���1
 4��m���<I��7d*:"�p"vF�FV|�J�^/I�cp{
���p"vG�F���M*�T(�m=xu7�q��i��B���^9��3b:P��*���9$I?�Q}�c
'�@h�e�Ws7�`i��$���Wu6	jQ��s]��o,T����&��+�C�X��B�y�$��+:�g;��DP�eYz��U|�J������;���B����|-�z@��N=9!Ka!|�<|2@�?\��}�Y����.��m8���o�g������~��5ih����(��^Y�[+vVdX������t$8-
 \��J���-��_^�M���0�������������������t����(�p��d���;�hW��pC������>���r8u�n�������R�����I�~3�"�ki8�O������C~�a���P��%������o�S��������o��HPo�o;���JLL4@3����k�>�ZG������pp�@�a�C���}�v%''�C����[���w��z�^y<�:�Y�T�,�)I���^���0����`vv�^~�e-X�@3g�Tnn�.����������O�����RSS�81�*�|������[���k3�LG�s��,�2"P=zT���z���4i��o���z��zko{<������X111M�!������'���v�?S��u�p���#��#��m��o��x�V�Z�K�.��c�)�w�\r�\M�
@��<���oY�$����(�-[����R���SII�PW��Z��N�%M���+�'�������PNN�v����>�L�_�BBBt��7�� �<����w�\�ZE�����t� �>�g��|��:t�����4t�P�X�Bqqq�� 9����53���LEG�N������2@�+>V�g��$M��!��N�����l��I��
�o�G�T$��@��X��P�}�WN����,E����	� B�S8T��Cs6H��?�����N
����H����NK+�%�����t$hP����|}D/.��������P�CO��������AQ�Z�������\��?Z�H�6����+�m.4
 [*.�������g�UP\!I
q�����kh��'�s}P�@`+��������f��+}��6-�u��t�60]q���
 �Q=���|�!��,W����e�������d����9���P-o�O������r�y��v��]�4ih
��F��`B0� �*����yzm��:P��$E�9uC��9$C�ZNfQ�mE%zqY��|�W�j�$)1&B���[��UT���(��5��������\}��`����&
��5�����p2
 �f�X�O�}�G/}�[;��J��i�E��tI������>8
 �f��S�W�����y:R^%Ij�
��~��sH{��FN���Y���X/,���T����%�u���^7�OUtD����|P,����_��e��<�p��~��5ih�F�HT����p�(�N��Z�����������%I�N�����IC3����l@h�(���#�z���zkU�J*�%I��0�����
JW�;�pB@�����^\���
�������m�94C���ST8U���T`D�������\��?Z�H�6�44C�u������QP4���*��*O�~�[�����������f�{R����(��D��2��i�f����J�$�M�p�60]�
LW\��pB�
 �FcY���:���j����jN�S��h�5$C�������!��(�����y����e����S����q�4���tj�2m`@�9T��+�����u��+I�sj\��9$C��[N�(������,Ws��+o�_����
N�-��**�pB��(���eY��v@/,��'����Lqk��]�+Ia!N�	�CpN*�|z���z��\��_*Ir:�%j�%������ �Q��~O�^]���X����WI�Z�B5�_���^��Q�����6�-��r���U�j�qIi�;����Sf8!�\Q|��o�������\}�{�v����44C#z$*�e�����U��������g����rIR���kz%i��e��2� (���H�^�l��Z����jI�;2L7H����JrGNhH@���|}D/.���M���k���������q}�)*��F|�6S��k��B��,Wk�������&
��e]����>j@ �Y��#�U�;\�������T�W����B���u��u��uO�1��T(�@3���i��c�;\�=��^�vL���U����s���m�u��t�E���D�eY:P�U��r�>vR�+W��rz*dYg~�����F)%6R;��w���4@���X�O�Gj
���.�x�;V�;��[��(56Ji'�6Q���������M����������?(����N�����S�RZ��vl�p���h ����F�����=G������GG�*�MM�K��Rj����%��Tx����$�`G��������7
�o.�<��C��k���Qu���Q��h@�$����5z�|�s����g���uTX����.56JI���0�0����g����?���Beee�/��`:I��������������������RZG�)x���F*:�Q<@��u|���5u�T=��s�����34r�Hm��U������<|s������;T��#5���Y����Ki���(xQJ��P�d�9�e�m�������������$���+55U��w�~��_����Gn�[�����i��(X�%�Us���o�g���?���_�}��'��,���������>��~�I�9��'=����}}�����*�_���3N||2W������(��@p����a�o���J�Y�F��M���t:5|�p-_��`2i��}��������R�������o���2W������\�jK�7,N����[�E��2`k�-����SBBB��			��e�)���z���w�6���(���W���4�k_�C
q:�t8�t��t(��P��������o'?����||����sO������p(�)9�5Y��#jG�RZ3�1gb�x>�O��Gy���a]�Z��tr���K�I��[�t��9�t�[��T���C��$l[��m�����_TT����S>g��i�:uj�m��������7�����n���$�NJ��}�j������~�/^�A���9.�K111u6����#��4u�TM�8Q�����4c������;�4
�����x��:p��~������P�{����ua@0��<��y�h~����9�vE�
 ��Pl�`3@����f(�6c���.��ET<��$��N|o�y14
�())�$���N�UII��n��F�����*((Ptt�G�����Qjj����m�Na}�^��U��^��U��^�_c�W�e���D���r:�y6#���t*%%�Q���>$�����x������x������������	���6F�
`�r�\��o+��e:J����?������?������?����E 6� ��Pl�`3@���g�yF���WDD������������.]���G+99Y�Cs��5) M�>]���Wtt����5f�m���t��5s�Leff�N>;h� ��?�t����c���ph��)����~X�����[7�����{u�m��M�6���T�^��z�j���
0������:u�~�����/�PVV�F�������p�������g�y�t��������'k��Z�h����4b�������RRR��c�i��5Z�z����
]w�u��i��hk��U�������L�QZ�=�o���m��e�#�#G�h��!
�����y�f=���j����hA�i`Lvv����������j�NMM�}���_����.���9s�h��1���(>>^9996l��8�Bll��qM�4�t��SZZ�>}���g����{���[3f�0+�<����;w���]k:J���/~�O?�T�|���(A��RYY�5k�h������N������L�`R\\,�����|>��z�-���i��A�����'��k�����S��}������C�z�����3) ����T�~�4~�x������/����w���0�<xP>�O			u�'$$����P*���)S�h��!�����8k��
j���\.����n��3G]t��X�����_|�������������/k���9s�rssu�%�����t���k�.��9S�;w���u�=���?��^y����J��������q�F�=:��]�j���*..����5q�D���PO�������'Z�h�"""L�	x�F���933S���JOO�;����������_?=������/�X7n�s�=��'N< m��UHH������/**Rbb��T��{�>��}���JII1'�����S�N�����O����,��O2+��Y�F���W�>}���P�������BCC���LGh�Z�R�.]�c��QNRR������{w�70
`	W��}�x���}~�_�/��#�7��t���j��9�������a:R������zM�(W^y�6l���k��n���������k�*$$�t��VZZ��;w*))�t��3d��oMU�m�6���J�8`�N���'�_�~0`�f�����2�y��������:�{������k���4������5k�,��������=���v+22�p��3m�4�5Jiii*))��Y��d�-\��t�����H[�h�6m�p~�)<��=z����UPP�����
		��7�l:Z�����5x�`=����0a�>��s=���z���MG.�_��+--�
�`�X��t�����[���M�8�t��r��H���K/������.+==�
�����+������c:V�p���Z?��OL�H7�x����d���[����n��Fk���c�y��Y={��\.���[7����7)�0 ��p ��Pl�`3@����f(�6C�
 ��Pl�`3@����f(�6C�
 ��Pl�`3@����f(�6C�
 ��Pl�`3@����f(�6���f}�4���IEND�B`�
#28Melanie Plageman
melanieplageman@gmail.com
In reply to: Tomas Vondra (#26)
Re: Add LSN <-> time conversion functionality

On Fri, Aug 9, 2024 at 9:09 AM Tomas Vondra <tomas@vondra.me> wrote:

On 8/9/24 03:02, Melanie Plageman wrote:

On Thu, Aug 8, 2024 at 2:34 PM Tomas Vondra <tomas@vondra.me> wrote:

each seconds. And we want to allow merging stuff nicely. The smallest
merges we could do is 1s -> 2s -> 4s -> 8s -> ... but let's say we do
1s -> 10s -> 100s -> 1000s instead.

So we start with 100x one-second buckets

[A_0, A_1, ..., A_99] -> 100 x 1s buckets
[B_0, B_1, ..., B_99] -> 100 x 10s buckets
[C_0, C_1, ..., C_99] -> 100 x 100s buckets
[D_0, D_1, ..., D_99] -> 100 x 1000s buckets

We start by adding data into A_k buckets. After filling all 100 of them,
we grab the oldest 10 buckets, and combine/move them into B_k. And so
on, until B is gets full too. Then we grab the 10 oldest B_k entries,
and move them into C. and so on. For D the oldest entries would get
discarded, or we could add another layer with each bucket representing
10k seconds.

I originally had an algorithm that stored old values somewhat like
this (each element stored 2x logical members of the preceding
element). When I was testing algorithms, I abandoned this method
because it was less accurate than the method which calculates the
interpolation error "area". But, this would be expected -- it would be
less accurate for older values.

I'm currently considering an algorithm that uses a combination of the
interpolation error and the age of the point. I'm thinking of adding
to or dividing the error of each point by "now - that point's time (or
lsn)". This would lead me to be more likely to drop points that are
older.

This is a bit different than "combining" buckets, but it seems like it
might allow us to drop unneeded recent points when they are very
regular.

TBH I'm a bit lost in how the various patch versions merge the data.
Maybe there is a perfect algorithm, keeping a perfectly accurate
approximation in the smallest space, but does that really matter? If we
needed to keep many instances / very long history, maybe it's matter.

But we need one instance, and we seem to agree it's enough to have a
couple days of history at most. And even the super wasteful struct I
described above would only need ~8kB for that.

I suggest we do the simplest and most obvious algorithm possible, at
least for now. Focusing on this part seems like a distraction from the
freezing thing you actually want to do.

The simplest thing to do would be to pick an arbitrary point in the
past (say one week) and then throw out all the points (except the very
oldest to avoid extrapolation) from before that cliff. I would like to
spend time on getting a new version of the freezing patch on the list,
but I think Robert had strong feelings about having a complete design
first. I'll switch focus to that for a bit so that perhaps you all can
see how I am using the time -> LSN conversion and that could inform
the design of the data structure.

- Melanie

#29Melanie Plageman
melanieplageman@gmail.com
In reply to: Melanie Plageman (#28)
Re: Add LSN <-> time conversion functionality

On Fri, Aug 9, 2024 at 9:15 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Aug 9, 2024 at 9:09 AM Tomas Vondra <tomas@vondra.me> wrote:

I suggest we do the simplest and most obvious algorithm possible, at
least for now. Focusing on this part seems like a distraction from the
freezing thing you actually want to do.

The simplest thing to do would be to pick an arbitrary point in the
past (say one week) and then throw out all the points (except the very
oldest to avoid extrapolation) from before that cliff. I would like to
spend time on getting a new version of the freezing patch on the list,
but I think Robert had strong feelings about having a complete design
first. I'll switch focus to that for a bit so that perhaps you all can
see how I am using the time -> LSN conversion and that could inform
the design of the data structure.

I realize this thought didn't make much sense since it is a fixed size
data structure. We would have to use some other algorithm to get rid
of data if there are still too many points from within the last week.

In the adaptive freezing code, I use the time stream to answer a yes
or no question. I translate a time in the past (now -
target_freeze_duration) to an LSN so that I can determine if a page
that is being modified for the first time after having been frozen has
been modified sooner than target_freeze_duration (a GUC value). If it
is, that page was unfrozen too soon. So, my use case is to produce a
yes or no answer. It doesn't matter very much how accurate I am if I
am wrong. I count the page as having been unfrozen too soon or I
don't. So, it seems I care about the accuracy of data from now until
now - target_freeze_duration + margin of error a lot and data before
that not at all. While it is true that if I'm wrong about a page that
was older but near the cutoff, that might be better than being wrong
about a very recent page, it is still wrong.

- Melanie

#30Tomas Vondra
tomas@vondra.me
In reply to: Melanie Plageman (#29)
Re: Add LSN <-> time conversion functionality

On 8/9/24 17:48, Melanie Plageman wrote:

On Fri, Aug 9, 2024 at 9:15 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Aug 9, 2024 at 9:09 AM Tomas Vondra <tomas@vondra.me> wrote:

I suggest we do the simplest and most obvious algorithm possible, at
least for now. Focusing on this part seems like a distraction from the
freezing thing you actually want to do.

The simplest thing to do would be to pick an arbitrary point in the
past (say one week) and then throw out all the points (except the very
oldest to avoid extrapolation) from before that cliff. I would like to
spend time on getting a new version of the freezing patch on the list,
but I think Robert had strong feelings about having a complete design
first. I'll switch focus to that for a bit so that perhaps you all can
see how I am using the time -> LSN conversion and that could inform
the design of the data structure.

I realize this thought didn't make much sense since it is a fixed size
data structure. We would have to use some other algorithm to get rid
of data if there are still too many points from within the last week.

Not sure I understand. Why would the fixed size of the struct mean we
can't discard too old data?

I'd imagine we simply reclaim some of the slots and mark them as unused,
"move" the data to make space for recent data, or something like that.
Or just use something like a cyclic buffer, that wraps around and
overwrites oldest data.

In the adaptive freezing code, I use the time stream to answer a yes
or no question. I translate a time in the past (now -
target_freeze_duration) to an LSN so that I can determine if a page
that is being modified for the first time after having been frozen has
been modified sooner than target_freeze_duration (a GUC value). If it
is, that page was unfrozen too soon. So, my use case is to produce a
yes or no answer. It doesn't matter very much how accurate I am if I
am wrong. I count the page as having been unfrozen too soon or I
don't. So, it seems I care about the accuracy of data from now until
now - target_freeze_duration + margin of error a lot and data before
that not at all. While it is true that if I'm wrong about a page that
was older but near the cutoff, that might be better than being wrong
about a very recent page, it is still wrong.

Yeah. But isn't that a bit backwards? The decision can be wrong because
the estimate was too off, or maybe it was spot on and we still made a
wrong decision. That's what happens with heuristics.

I think a natural expectation is that the quality of the answers
correlates with the accuracy of the data / estimates. With accurate
results (say we keep a perfect history, with no loss of precision for
older data) we should be doing the right decision most of the time. If
not, it's a lost cause, IMHO. And with lower accuracy it'd get worse,
otherwise why would we need the detailed data.

But now that I think about it, I'm not entirely sure I understand what
point are you making :-(

regards

--
Tomas Vondra

#31Tomas Vondra
tomas@vondra.me
In reply to: Melanie Plageman (#27)
Re: Add LSN <-> time conversion functionality

On 8/9/24 15:09, Melanie Plageman wrote:

...

Okay, so as I think about evaluating a few new algorithms, I realize
that we do need some sort of criteria. I started listing out what I
feel is "reasonable" accuracy and plotting it to see if the
relationship is linear/exponential/etc. I think it would help to get
input on what would be "reasonable" accuracy.

I thought that the following might be acceptable:
The first column is how old the value I am looking for actually is,
the second column is how off I am willing to have the algorithm tell
me it is (+/-):

1 second, 1 minute
1 minute, 10 minute
1 hour, 1 hour
1 day, 6 hours
1 week, 12 hours
1 month, 1 day
6 months, 1 week

I think the question is whether we want to make this useful for other
places and/or people, or if it's fine to tailor this specifically for
the freezing patch.

If the latter (specific to the freezing patch), I don't see why would it
matter what we think - either it works for the patch, or not.

But if we want to make it more widely useful, I find it a bit strange
the relative accuracy *increases* for older data. I mean, we start with
relative error 6000% (60s/1s) and then we get to relative error ~4%
(1w/24w). Isn't that a bit against the earlier discussion on needing
better accuracy for recent data? Sure, the absolute accuracy is still
better (1m <<< 1w). And if this is good enough for the freezing ...

Column 1 over column 2 produces a line like in the attached pic. I'd
be interested in others' opinions of error tolerance.

- Melanie

I don't understand what the axes on the chart are :-( Does "A over B"
mean A is x-axis or y-axis?

--
Tomas Vondra

#32Melanie Plageman
melanieplageman@gmail.com
In reply to: Tomas Vondra (#31)
1 attachment(s)
Re: Add LSN <-> time conversion functionality

On Fri, Aug 9, 2024 at 1:03 PM Tomas Vondra <tomas@vondra.me> wrote:

On 8/9/24 17:48, Melanie Plageman wrote:

On Fri, Aug 9, 2024 at 9:15 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

On Fri, Aug 9, 2024 at 9:09 AM Tomas Vondra <tomas@vondra.me> wrote:

I suggest we do the simplest and most obvious algorithm possible, at
least for now. Focusing on this part seems like a distraction from the
freezing thing you actually want to do.

The simplest thing to do would be to pick an arbitrary point in the
past (say one week) and then throw out all the points (except the very
oldest to avoid extrapolation) from before that cliff. I would like to
spend time on getting a new version of the freezing patch on the list,
but I think Robert had strong feelings about having a complete design
first. I'll switch focus to that for a bit so that perhaps you all can
see how I am using the time -> LSN conversion and that could inform
the design of the data structure.

I realize this thought didn't make much sense since it is a fixed size
data structure. We would have to use some other algorithm to get rid
of data if there are still too many points from within the last week.

Not sure I understand. Why would the fixed size of the struct mean we
can't discard too old data?

Oh, we can discard old data. I was just saying that all of the data
might be newer than the cutoff, in which case we can't only discard
old data if we want to make room for new data.

In the adaptive freezing code, I use the time stream to answer a yes
or no question. I translate a time in the past (now -
target_freeze_duration) to an LSN so that I can determine if a page
that is being modified for the first time after having been frozen has
been modified sooner than target_freeze_duration (a GUC value). If it
is, that page was unfrozen too soon. So, my use case is to produce a
yes or no answer. It doesn't matter very much how accurate I am if I
am wrong. I count the page as having been unfrozen too soon or I
don't. So, it seems I care about the accuracy of data from now until
now - target_freeze_duration + margin of error a lot and data before
that not at all. While it is true that if I'm wrong about a page that
was older but near the cutoff, that might be better than being wrong
about a very recent page, it is still wrong.

Yeah. But isn't that a bit backwards? The decision can be wrong because
the estimate was too off, or maybe it was spot on and we still made a
wrong decision. That's what happens with heuristics.

I think a natural expectation is that the quality of the answers
correlates with the accuracy of the data / estimates. With accurate
results (say we keep a perfect history, with no loss of precision for
older data) we should be doing the right decision most of the time. If
not, it's a lost cause, IMHO. And with lower accuracy it'd get worse,
otherwise why would we need the detailed data.

But now that I think about it, I'm not entirely sure I understand what
point are you making :-(

My only point was that we really don't need to produce *any* estimate
for a value from before the cutoff. We just need to estimate if it is
before or after. So, while we need to keep enough data to get that
answer right, we don't need very old data at all. Which is different
from how I was thinking about the LSNTimeStream feature before.

On Fri, Aug 9, 2024 at 1:24 PM Tomas Vondra <tomas@vondra.me> wrote:

On 8/9/24 15:09, Melanie Plageman wrote:

Okay, so as I think about evaluating a few new algorithms, I realize
that we do need some sort of criteria. I started listing out what I
feel is "reasonable" accuracy and plotting it to see if the
relationship is linear/exponential/etc. I think it would help to get
input on what would be "reasonable" accuracy.

I thought that the following might be acceptable:
The first column is how old the value I am looking for actually is,
the second column is how off I am willing to have the algorithm tell
me it is (+/-):

1 second, 1 minute
1 minute, 10 minute
1 hour, 1 hour
1 day, 6 hours
1 week, 12 hours
1 month, 1 day
6 months, 1 week

I think the question is whether we want to make this useful for other
places and/or people, or if it's fine to tailor this specifically for
the freezing patch.

If the latter (specific to the freezing patch), I don't see why would it
matter what we think - either it works for the patch, or not.

I think the best way forward is to make it useful for the freezing
patch and then, if it seems like exposing it makes sense, we can do
that and properly document what to expect.

But if we want to make it more widely useful, I find it a bit strange
the relative accuracy *increases* for older data. I mean, we start with
relative error 6000% (60s/1s) and then we get to relative error ~4%
(1w/24w). Isn't that a bit against the earlier discussion on needing
better accuracy for recent data? Sure, the absolute accuracy is still
better (1m <<< 1w). And if this is good enough for the freezing ...

I was just writing out what seemed intuitively like something I would
be willing to tolerate as a user. But you are right that doesn't make
sense -- for the accuracy to go up for older data. I just think being
months off for any estimate seems bad no matter how old the data is --
which is probably why I felt like 1 week of accuracy for data 6 months
old seemed like a reasonable tolerance. But, perhaps that isn't
useful.

Also, I realized that my "calculate the error area" method strongly
favors keeping older data. Once you drop a point, the area between the
two remaining points (on either side of it) will be larger because the
distance between them is greater with the dropped point. So there is a
strong bias toward dropping newer data. That seems bad.

Column 1 over column 2 produces a line like in the attached pic. I'd
be interested in others' opinions of error tolerance.

I don't understand what the axes on the chart are :-( Does "A over B"
mean A is x-axis or y-axis?

Yea, that was confusing. A over B is the y axis so you could see the
ratios and x is just their position in the array (so meaningless).
Attached is A on the x axis and B on the y axis.

- Melanie

Attachments:

example.pngimage/png; name=example.pngDownload
�PNG


IHDR��5���9tEXtSoftwareMatplotlib version3.9.0, https://matplotlib.org/4H�	pHYsaa�?�iUIDATx���yX�u����9�
������IT�TSYM�Z�k���X�e�T������4�L6�{S��i��V����n�D�s>�?�z�Q���
�����uy��}����p^����e�1����:j��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<��P<���Z�.s�\:r��5j$///���
0�(??_������(����#����:�T�V���a	
�Eh�����Y�T���Pdd��u�Q/���}���(��1���[���7��x
 ���x
 ���x
 ���x
 ���x
 ����Z�cu�z������u�]w�i��
P�=������c4a��h�BJHH���������6l��������O�O�.7�i�&���_�������K/���,,P�������=zh��%��W$�ys5���u�D��Q��*-��N����_.???}����m�^}�U5n��=��K/�����;�������A�JLLTQQ�{f��a��u��/_�E�i��5z��������zEGG+55U/����}�Y�������k�j��!���������j�����eK����c���R���w�jGV��|���H���B���3W\q���w�\&<<������m����n��9s�c���m��d���;��_|a��������1�L�>�4n���{�N�:��4�$%%�{���8��?���Y�'//�H2yyy���tQ�yhv����D�[d����M^aI�?���T�;�.T��}u�w(44T}���{�����o�>eee)!!��-88XqqqJII�$���($$D}��u�$$$���[���w�\y����l����D���S�N�r���y���}��d����b9�r7p�vg���i�j��#�����I]4����Y�^���w�^��1C:t��e�4b�=������%IYYY�����r�s����Rhhh�����j��I��s����k3?��,?7y�d�o�����-������y����sZaAv�}�R���������Vo�V��\.�����_|Q���Om��E�����^�Oe����k��1�����*.s��E���u$I��o�7���f
�'�����-Z�k����u��E�������pIRvvv����l����p�����_VV��'O��9�1~��6������sv�]AAA�n���<�;�Iq����i��1��WC��^~����sg�m�v�Rtt�$�M�6
��+�����_���xIR||�rss�����Y�r�\.�����3k��Qii�{f�������������r�sv���T$�z+wd��o�M������Dc��$o>��1UyF��
����y��������Y�L``������3S�L1!!!�?�����i����[L�6mLaa�{��n0}��1���7�|������2d�{nn�	3w�}���e��;w�	4����3�~�����5������}��8q����3�7o�T���YDT\��e^^��}���S�1�N����~S��c>��s��{wc��M�������[n���2�<��	3v��\{��f����fN�8a�b6lh�������k�����l���\q��n���-[�)S��"����M����f3��u3�/�t���_ *&�Qd����.>�l�J�,����1^�p���p8���<��_�a�I������b�|4�����W�eyx������2�������Kw��2��P3��Q��FVG�x@P��
K5v�F}�����1�w�^���mT���?P�����Yi:p��l>��pSW
���;�"@Pe�}��g��U%e.�	���b��U����3@p�
K����-Z�zH�tM�P�6��Bm'��P�E�w�@#>J���|y{I�]�I#�j'o~�s�E���G5��M:]\�f
mzkH]�����p@Pi�N��|�C�f�$�_�&z{h��[�A���W���iJ=pJ���+�jlb'��x[�E����zdn�N�����^�����[�D��rM�j�^��.#um�w�(�i���P�o:UP���2�z�1I��~��xS7���X��~Uz�)%�J���"��y���=���VV��E��_0���)���m*u�i�@����K� ���
P@9���4����|�I��=����{������PU(��mWv�F|������K�o��?^�Z^^\��>�I�g��5���*,u*<�_���Qlt�c�P�pE�N=�h�f���$���Lo���
�'Cu����<������p����������� o>���(�x���5f�F���q��^�[��ju,�
 ����k�wi��$I�#C4mX�Z�X�5����/��s��n�II������|c�|�-N��D�C��{B�IWN~��|��?���{FX�P�c����zi�N9]F�j��X�mhu4X�@=�WX��l��m���[����vW��
�������r8O#g�)���|�����4�_$W�����y����[UR�R���1,V=Z[
��z�����?����$]�9T�
���@?���6�PO�=vZ#g�iGV���������+����z�g(��K6�o���25kh��C�(�]S�c���P��:]������}��~m�h��>

��8j3
 u���B������$I^�N�_�Q�>\���@���czdn�N�����^�[�u
�:�
 u��e���=zc�.#uo��Cc�4��h�C(��'J4z^���:&I�/Ro�&?�����P�g�R��4�+������C�meu,�Q@j1c�>\�_/,��R�Q�f
4��u�:�0
 ����2=��MZ���$������=����z��P��ve����R��X�|��������[����z��Q�e>M?�?�E��N�����1��nlu,�#@j��R���h�f���$���Lo���
�'C}C�8x��F�J���y���������|���U�����-[c�g�QT���~z��>��cs�c��`�2�K�.���~�$��
���1�	�8��*�Z���>+//�r���;��)99YM�6U��
u���+;;��1233������@���j���*+++7�j�*����n��}���9s�/�L�6M�[���������a��r�+�����_�����]������=O�C���(I��u���G��o������G����h���:r��n��6�~�����$���h������5s�LM�0�=�o�>%%%����VFF�F�����_��-s���7Oc���������^�z)11Q999�@uY������F���T������7u����_��s3Uh����W�^�����k��������o�n$���c�1K�,1���&++�=3c�d����1�<���[�n��=x�`�������_?������t:MDD��<yr��TD^^��d���*���r�\f��=����&z�"s�k����|�cy^�����j���[j����
���OeOMMUii����;wVTT�RRR$I)))��������3���r8��u�{���8;s�%%%JMM-7�������LE�P��
K�?�L��/v��2��OK}�|��5ohu4x�*=	$..N3g�T�N�t��QM�4I�����-[���%������r�	SVV�$)++�\�;������q8*,,��S��t:�9�c��1���\���U\\���p8��;���p�F�J������x����iH�H���Ti0`���={�T\\����5�|��oj�<y�&M�du@a����j���*)s)�I���U�V�VG�����6
		Q���g�������D����f���.I
����g��o&((Hj���|||�9��c�/���?^yyy����+��8�%N=�`���Y%e.%t	��Q�)����>}Z?���Z�h���X���i����;w�Tff����%I�����ys��u�/_��� u���=��c��9{������r3.�K+V�p�T$����v���s{����i����������Y���W��~VG~T�g�<��cf��Uf��}��o�5			�Y�f&''�c��>h������+���o���M||���eee�{������7f����y��f������{����@3v�X�}�v3m�4���c�.]���;w����f���f��m��0!!!��.>_���,"��-�x�t���D�[db�[nR~8nu$����Ti<x�i�����l�e��f���f��=�����f����q��&00��z����������~3`�`�5kf{�1SZZZn����2�{�66���m��|�������o���(c��L�~���u����H���/���R�yv�=n����zg���+�:���oc��1��� �.�����`����q0x�#��J�����\I�����c�u��?��6���kpQ��:���2t��DA��zmPo%t
;�Q�N���+w���e���e���UT�@���E��N������z�qI���(M�}W���X��
 ���yJ���t4�H�~�z���-�����J�P��\�_/,��2�Q�f
4��Xu
odu4��(��G~Q����f-�|T������z{O5��2�������Y��(M{����KO��E�/k-///������$�����f������b���X�E��3E�NM�|��l��$���Lo��GM�,NT
 ?�y��F�J��#yyI����Q����7�������o��c�3�(*S�6�ygo�����X@��<^���W���wV� I��
���1�	�8P=(����(��9�����$������:���mq2��P+��zhN���.VC��^�CO�������jGx���ok���e;�2R��F�>,Fm�7�:P#(���w�T�-����H�n�i��P����d@��<��Cy1+U�N�������M�/����8@@�g���
���p�J�.E5	��a1��2��h�%(��z�LI���t�>I?,I��k�^�����,NX���~8vZ#>J�������������m�����m:�qoRA�S��5uH��mju,�V����2�^\�]3���$]�����G����
�"@@�q8�P����q0W�4�w�4��������OQ���]�4zn�N�)U���^�[�v	�:P+Qu��e����z{�n#�h���b�$��h@�E�Y'Nk��}���$iX\���}W��qU��PuR���J���,G��|��m�uk�VV��
 �N1������%�U�2j�����+V�Y
�3(��:#��T���IK6gI�~������S
������@��#���i����x�����'>��z���>N=��?���R�"��5mX��D5�:PgQ�VQ�S�.�����$]�����[��,N�m@@�t�D�F�J��#yyI�&t��������|��E�:_n��c6*��LM����}tE�fV��
 ��(s���������������:��ZX��_(��Z!�Q�Q���a�II��W�������mq2���,����zxN���.VC��^�CO
����X@�EX��2������N���9��f��6�X
��(�K��)����rG�$�����-�`��8P�Q5n��\����C�
e���s�t��K"��x
 ��c4k}����6�8]�n���b�-"��h�G�j���2�����,��$���az��^
��8�y(��j�'��F|���9�����'o�������W��@T�����o���B�5uh���ibu,��Q�����o��)$I�m���!}�����d����O�2E^^^=z�{[QQ������iS5l�P��~�����=.33SIII
Thh�������r3�V�RLL��v���o��3g����M����[���_qqq��aC����0�s5�o)���|u;���~�?������w��o��z��Yn���>��?�\,����u���v�m��N�SIII*))���k���j����0a�{f��}JJJ��W_����=Z����-[���7o�����'*--M�z�Rbb�rrr*�paV��Q�[_+�`�������Wc;��K����b����C�f�������2�<��1����\���g,X����}��dRRR�1�,Y��x{{���,���3LPP�)..6���O�n���{������D��~�����d�}��i"""����+��"����$���W��@}U�t�W��iZ?��D�[dnz�k�y���X�/��mL��w,99YIIIJHH(�=55U�����w��YQQQJII�$����G�
s�$&&��ph������;11�}�����������VBB�{�"Y����X���
 �8]������V��1�]�Fi����lhu4�P�'���;Wiii����~�/++K6�M!!!�����)++�=���wv��}�5�p8TXX�S�N��t�sf���r.�'O��I�~u?x��'�<+]Y�"��h�m=4�OK�c�
U������#�h��Y�������*���W^^��v��A�#�e�1z�����u�r�]�Z8�r�PT�;�������QLL�{�����5k4u�T-[�L%%%���-��[vv����%I����8[����?��������


R@@�|||���s����|Y��n��n�6�/*�o�[~����^�r[5������������k�y�fedd�o}����a��������+����s�233/I�����������|�r�k���������c�l6�����q�\Z�b�{&66��Y����C7O�V_l������rK7�ugo�P�T���F��{����5h�@M�6uo����4f�5i�DAAAz����K/�T�t����k��������K/)++KO?��������=����:u��x�	������+5�|-^����c��������o_���Oo���


t���J�������K�?��?���2�Z�h�����:�J������������������b%&&j�����>>>Z�h�F����x5h�@���_���L�6m�x�b=���z��7��U+����JLLt�<X����	������{k����N9_��)*u���[5�����w����A������d.��1�X��r8
V^^������������(M��:��%�I����������h����k~���Yz|�F���i����������X.��N�^^�S���+I��XS��(<�~��/��P�d;����tm�R�t�m4n@g�q-_������p\�I���%jd���w��
�[X@��r�X��^�r�\F��H3��U�f
���P����)����r��?����Vzn`w���X�@u���t(W#>J���B�}���-�5��H�c�f@�@�}�>S�}�M%N���j��u��:�@SP\��>���2�H�����;z)����dj
<���|��(M�sN���KO��Y��o#//��x
 x�����o���B�5mX�.i���X,@�z����o�?SH�.k�To��G��-N�*@���:�����x0W���5�5:��|����d@������G�e(�L���������s����@�g�.�7��Ko��#I��*X����U�@���-(�P�?]�G����='$I��G���.��rU������O*yv���
��h�m=tK��V�PQ��3�������/v��2j�P3���CX#����(�P�9�J5v�F-��-I��W�&��C
�|y���
u��#�����'�����gn����������@4���z��-*.s�eH���Q���c�#(�P��:5�?[4��C���;5�k�z�q����%@�#�/��Yi�~�!o/���;i�U���U=T���[�4v�F���YC�����.k���X�(
 �b�N�^Z�C�}�O�tI��z{H����-N�.�@-��W�Q�����S���l�������mq2uj�o��#s�u�t��}���tC�p�c�'(�P��\F�W��k�w�e�.-�4cX�Z7k`u4�j�S%zt~�V�<&I�7R�n�&?���o(�Pd�U��4�-���[�
��A}#��������1�h��e�6�:�Z7
��a��du4�,RP\���l���G$I7t�Kw�T������w@����|����=9�����'t�}W���W�P�(�P���qXO�{�
K�
�k���m���X<jHq�S�/���; I��}S�yg5kh�8OC�p��%�N��Cy����i�G:����|�<
 T��;������+,UH��^�[Ww
�:F�j�t��|��~�G��+2D����eH���x:
 T�c��zdn���pB�4<>ZN�"�/W�`=
 T�
�Nj��4��+���)�����"��n@�"����^�u�N9]FBj�]1j���hP�@^a��.��/�eK�������P��/�j�2�E�z$O#g����3��xk�M]5,.��z��(�p�}��g��U%e.�	���b��U����7yW��f����={*((HAAA����_|��_TT���d5m�T
6��������r����TRR���c��������U�#������k�����2m�4�n�Z��������
���H�5�%N�]�Q���Y%e.]�9T�����N����U+M�2E�������u�5���[n���[%I�>��>��s-X�@�W���#Gt�m���t:�������]�V~��f���	&�g������$]}���������u���k��e��y��i��1�8q������W/%&&*''�=s�,�k�/�������C����&v����UH���hP1��5n������&77�������m���H2)))�c�,Yb���MVV�{f��&((�c�y��'L�n��=����Mbb��~�~�Lrr�����4f����S�,���g$����
?@��d��m�R=n��}�K���cVGPI�~S���������sUPP���x������T			����;+**J)))�������Caaa����D9���)))��qv��1JJJ���Zn���[			���d��*u����m1+M�����u-~��.k���hPiU~�������"5l�P�~���v�����l6�����SVV�$)++�\�;������q8*,,��S��t:�9�c��1���\���U\\���p8����>��+R��4�8%I���m56��|}����P���v��I���������k���U�4��<y�&M�du5�������t�((Q#_�rG/%v�:\�*����fS������'�W�^z��7�����������Vx��_L���q&�����	

R@@��5k&�s�����r.���W^^��v�������s\.��V����X�%��"H�����^���/\.����+???�X���o������T||�$)>>^�7o.w�������]��g~z��3g�a����ri�����d9����7go��S%�w�wzm�.#
��OF^���
��U�J??~����(���k���Z�j��-[���`�w�}3f��4i��� =��C������^*I������kW�}��z���������~Z��������|PS�N�O<�?���Z�r��������s�3F��W��}��_?���*((����+I��3�g�R��4�+���[/��C�meu,�ZUyJ���Gml6�i������k��_~��_XXhF�i7nl����j�=Z����70�Y�f���3����f���+��woc��L��m�|��,o������26�������[�����d9N#���ef~�����b=n�����f��m����xc��%��r8
V^^u���2��d�>�xD�4�{�����
���8����7���ve�k�G���X�|��4��.�������eu4�6@������f�:��i��(6������Qx��R��[�M��gJ��h�Lo��[��-N5���<yF#g�i��<yyI]�A�\�A>�|��sPx��;������+,UH����[��ju,�q@�^������K���A��+2D����eH����@����b=<']){OH���e������[�B�Z�������F�NSN~��|4�����W����r@��1F�}�W]�SN�Q����>,V�CZ
j
 �z%��Tcl����%I{G���z(���;8�����-��4rV�2O����[o��������@�g�������lUI�K�h��X�hlu4��(��������}�zH�tm�P�:��Bm'������{��F�J���|y{I�'v��W��7W���DP'-�|TO|�I������]o���vM��u@�R�ti�;��o�I���i��C�(4���dPwP�G�
5jv�R��$����{}'��pU�
 �:��������,(Q#_�zG/]�-��XP'Q�j.���+����d��-"H3��*�i��������N�h����uL�4�_�&��M�~>'����VJ�<��Yi:�W$?o=?������XP/P�*�}�v�^X�]�N�6�h�]1�du4�7(�j���ez����h�QI��=����{�������~��ve����R��X�|��������[����z@U��������O��������:4F�������e�J����m��>S���C3�1���6�[��7
 K<yF#g�i��<yyI_�A_�A>�|������w[�������L�����}tU��V��APc��.��|�f��A��'*D���("$��d�Y(�jDN~����u{OJ�������"�������PT��{Oh��t�/V�������}��c����6��m�^��l��.��a
5��X�k���h��(��E^a���Q���-I��OK=kw���V�+1�*��p�F�J������x����iH�H���@�1�h�w5q�V����$@����G�`��~��J�8��g��I�aIRB�P�zGo�Y��s@m����Q�vf���Kz���z�[ysU��(�.��MG5���t��L��5uh]������������4����������6M���>

��6��(�*�Hn��g�)=3W�4�w���u���U=�.���5��i���,(Q���^�[	]����
 �
q���^�[o��-c��-�4}h���Z
PI@�u��D��M����K���Ei��������d�A���2O)yV������[/��C����:�"P��1F3�������e��Y��+V��Y
p�������'��K.Q�F���j����f��������M��a���������]n&33SIII
Thh�������r3�V�RLL��v���o��3g�"��i���uk���+..N6l�t���j��tM�|��\FI=[h�CWP������z�j%''k��uZ�|�JKKu�������=�������?���z�j9rD��v�{���TRR�JJJ�v�Z}����9s�&L�����o����t��W+##C�G������e���g����1c�h���JKKS�^��������
g<��,�n���o>*?/={SWM�G
�|`���F999F�Y�z�1����\���g,X����}��dRRR�1�,Y��x{{���,���3LPP�)..6���O�n���{������D��~�����d�}��i"""����+��|����$���W�y���w�A���%&z�"��M���VG�*���1��S[���$IM�4�$������T			����;+**J)))�������Caa��s��p8�u�V��O�qv��1JJJ���Zn���[			���d����b9�r7�>(*uj�'�5f�F����C3-z��b�[
P
���\.�=Z�_~��w�.I�����fSHHH����0eee�g~Z���?���f�
u��q9��s������s�'OVpp��Y��
���<qF��X�92��%=��Q3���&
lVGT�j+������e����[]OQ������<�����VG.��m����_k���4�����G:�����h�jT-��=j�(-Z�Hk��Q�V�������URR����r��egg+<<�=���u��������������� ���G>>>����1�����v��v{%~'������+_��;��$�D�h��E�X�P��@c�F��O?�T+W�T�6m���������V�X���s�Neff*>>^����7�;[w���


R��]�3?=�������l���-7�r��b�
�LE��Q��HC�_�.����>O�OR�g��1��U�V��G��og��q�<���&**��\��|���&>>������������������ddd��K����������g���k���c������i�����Y�t�{f����n���3g�m���x�����;��|Y����P���s��>��D�[d�MXjo:bu$�q�~S�P�9o|��{�����9�4n���[o��=z��q���o`L�f��c�=fJKK��|��W�w���f���m��{���~�mel6�����Y�n]����[�����t��_�1m�\d��-2���6?��[,���1^�c���u���Ppp����du����������������za`�|,N����k���Cy1+U�N�������M�/���g��'���1F�7dj��m*q��$P����{�`��j
 P��))���n�'��%I�u
�+w�Rp������G~8vZ#>J�������������m��P�'m:�qoRA�S��5uH��mju,@-D���2�^\�]3���$]�����G����
��(�@v8�P����q0W�4�w���u��Sm���@��Z���F�M��3�
���k�z+�k���u�c�.�7W���+w��{� ���&�VG�@�9q�X��e����%I�����������z*���N*yV��E
����v�m1�����(�@-g��?����K���e��y��N������(�@-�_T�q���%��$II=[����TC;�t�W�����������x��|��tRW��U=��B����mVQ�K���:,F1Q����'(�@-RT����j�w%IWvl�7�V�6���
 PK8Q�������C^^��	5������#_@�����[������/*S�6�ygo�����X�z�X������v�ok�J�b�B4mX�ZX�P�Q��8�4jN�6�;)I���6zr@g��x[�P�Q��pB�I����jh�����=ZX�!(�@
r���Y��^Y�S.#uo���b��yC��<�!ygJ5f~�V���$��J�����������M�r5rV��*���[���M��FrU�%(�@52�h��LMZ�M%N���j��uolu4����LI���t�>M?,I��k�^�����,N�t@���9���R�+��|��4��N���m��P+P�*���#z���TP�Th#���Q�6M����"%e.��d�f��/I�o�To���F���g(�@8�[��Yi�8�+IJ���M�(_����(��EZ�3G��e(�L������^��s����U@�9]Fo����W��1R�V��64F�M���o����b=27C��9.I���(=�����rU@�G*)��I%�JW��H~>�|[
����XT� c����>M�b��\F��7�;w��CX#��P)@���J�������,I�M�"4��j`������8��G9+M�����K������4��z�,
 �|PO�E�e.�	��a1�bu,.
8��R��]�Us�;(I�]��z}Po5n`�8�����Q��u��K��Q�W���7��
 ���f���_T��
lz��>��C3�cP�(���R�K//��w���$��n��Cc�oq2�/�Q��f�k�������h�q:�����dT
 <�����9�:~�D���z�����{�cP���-�5k����nRDD������g���o���	��E(!!A�w�.7s��I
6LAAA
		�}�����O����i����/EFF���^�E��s�����W�=�d��JgA��rM�j��z���.Q��FZ���?�G��XPP�^�zi��i����K/�����;�������A�JLLTQQ�{f��a��u��/_�E�i��5z��������zEGG+55U/����}�Y�������k�j��!���������j�����eK���~�=S�����^^�S.#��J�%_�6�X
��a��$�������\.n^~�e����\c����9s�1�l���H2�}��{��/�0^^^�����c�O�n7nl����3���3�:ur�4h�IJJ*�'..���O�p�����3�L^^^�km<x�\6y����t|j���!��H����15�]����SVV���������IRJJ�BBB��o_�LBB�����~�z���W^)���~ obb�v���S�N�g~�<gg�>OE��~1��_��3Rt8�P�M����4��H��P�j�$���,IRXXX��aaa�}YYY


-����WM�4)7��M�_�����+++���s�,�R\\���b�}���+FmQP\��>���2�H�����;z)����dX���+a����4i��1P	{r�5��4��9-o/=yCg���������s��G���������r������������SnYY�N�<Yn�\���s���O��/���?^yyy������jXi��#�y����sZ���������+�R���
`�6m�+V��9�_�^�������x���*55�=�r�J�\.����g��Y���R�������S'5n��=���9;s�y*��\�v������P��95�?[���t�)q��vM������u��P+T�G��O���={��������5i�DQQQ=z���yu��Am���3�<���
8P���K�p�
�������;*--��Q�t��w*""B�4t�PM�4I��w����-[���7������~�GyDW]u�^}�U%%%i��������?*�����YP7:uF�����`�$��k�ktBG�x��nUyJ�W_}e$��6|�pc��?~��g�1aaa�n��k��������1N�8a�b6lh�������k�����l���\q��n���-[�)S��"����M����f3��u3�/.��"Y����k��;�M�I�L��E��������VG�B�~�e�1��:��p(88Xyyy|l!��������+|��W�`M�V�-N��x��,`�qs���Z��	I�=��z*����>'�����),q��MG����t(O�h����z���--N@�GD���x�f�;����W��Y�6o��g%_�^��7�8!u���e�rG�������:����q���4Zw��R��vP�PQ+�/���j��L�-�$yyI���\w�G�����h.��1F�N�_�h���*u�x�z�@?
�$R��E+�)g�p�(��\Aq�>�8�����|����!���h%�l!?����Pa�=9��h]���zH��e�$?o������4Z=Z[�����U�ti��l�+��R��poo�4�O��Tp���	��(����"����92��(�$y{I�v	�=����]3ysR5��jc�Q���h�-��-����:�5���K�4$.J-C,N	�����9�J�i�a�k���9��~I���;��n�.����	�l@T��G����,����8%�x��[��xRG��y�mj
 .JI�K_l9����w�O���m�{��uk��j��I�&@\���������}wP�O�H�|����-\w]�K�6��'uPQQa.��7{��_�h��l��9
�kH�(
��� kC��������jA�A�Z��}���/k�Tw_���a�����

 ~��Cy����Z����J]��Fv_��Jw]����,N.��:�x�Q�s�m<����9����o�[zG����6�e��C��y��f�?�����3��$?/������4Z���9��z����.���r���Z��������2$@C��4��H5kh�6$�r@t�t��H����S���Wvl��/��5�C��uy��(�������(��m:���'u�iP�V���X��
`=w��L3��_�h��{{����;>Z7�������	@M��SE�N��|�fo�T~Q�$�����{E��K��+2����2�z�Hn����Tm>�'I�j��.����j��fq:`5
`=�����Q���.Q�@?M�����&oN���X��Z@�.��R�Q��Fz����lhu,P�P���2�&}�U��gJ��z�����@���hu���b��������������4�w��j�U�:l��\��_�:�W�Fv_�9����fu,P�Q����k��7��������{��]��V�u��q����t��]�W�tM�P�qgo��Y���:$�L�����w�$%_�Nc���u{@�P��]����~�'�(��G/��S��au,PQ��e[�4f^�
J�j�w��U��`�c�:�X��m����I��m�h��5mh�8��(������u�����]�ZO%u������@]G����R.#�m�go�fuPO�vR-������KZ7�8	�O(������$I��4�8	�O(�����2m=�'I��;���Pk���\���Z��U����z�XKm�������������:�Z���������U����i���uk���+..N6l�:���.�e�X/i�������p��y3f�&N����4���K�������4��#�)q*��WCY��?]_{�5��������^u��U���������\��:$I�������U�c`II�RSS����������������1���r8�n�!�L�$�YC[�x6�-������TXXX��aaa���:�c&O����`�-22�Z��:%I~>�r|���p%�?^c��q�w8�R���\��}��EP��c`�f��������r����~����v���j��X����P=<�#`�����X�X�����ri������0@���w%i��1>|�����~����7�PAA����^��T�.����c�4a�eee�w��Z�t�/N�O��1��u���Ppp�����	��~{��x*
 ���x
 ���x
 ���x
 ���x����:{��aqPQg_�=�bh�����/I����8	����|[�\�"�\.9rD�5���W���p(22R���)d��k�<a��g��5��1F���������g~7�^ooo�j��Z�#((���>�5���~��5J��N���<����<��x0
 ����Rv�]'N��n�:J�a��k�<a��g��5�"8	��� ���x
 ���x
�E�M����[���_qqq��a�o�/X�@�;w����z���%K��P��S�u���{����7n���+!!���/�Ae�,��;w����4p���
X*����\%''�E�������c��;[�5��������G}TEEE5�����Y��n�I����g�}v���Z�J111���j���f��Y�9/Fe���'�����S������x-[��f�^��s<��o�����z��]m��������XO=�����e����uk������u������1c4q�D����W�^JLLTNN�9���]�!C�����Szz���j��-5��r*��U�Vi��!����������H]��:|�p
'������������W���k(����KJJt�u�i��������s�N���{j��e
'����q���z��'5q�Dm��]���5o�<������WPP�^�zi��i���o����t��W+##C�G�����_�Re��f�]w�uZ�d�RSSu��W���nRzzz5'�p�]�Y������{t���VS��u!�4h�V�X�������s�����N�:Uc�:�������$''��;�Na&O�|��A�����r���������j�y�*���+++3�52~�auE�h����2s�e�����>��r�-5���Uv�3f�0m��5%%%5��Uv������k�)�m��1���/���UE����Os��'�0��u+�m���&11��U����\�v�j&M�T���Ae�8x�`���O��'�^�zUk��V�u~��&88��8q�fB��X�JJJ�������6ooo%$$(%%���III)7/I����:_\�:���3*--U�&M�+�E��5��/Qhh�������yQ.d�.T||����������_�������r!k����������x���Z�d�n����\������r����_k��\�>�@{�������Rm.\��}����^R��-��cG=���*,,�:Z��kuOs��q9�N������;v��1YYY���������B��s���SDD�/^�j�Y�7�|������������B��w�^�\�R��
��%K�g��9R������B�8t�P?~\W\q��1*++��>X�?��_���p8TXX�����U�W^yE�O���A���Rev���'�|R_��|}��K���{��7����_�~���?��#G��������W+� j�)S�h������O���ou�*�������[�����5kfu�j�r��w�}W���<x��z�)���;VG�2�V���/�����+--M�|��/^���{��h�@�g���I�4�|���Z�J8�N
:T�&MR����S�\.����4k�,���O7�x�^{�5}�����+��j�f�����G�����ggg+<<���	��|mp!�<��W^��)S����W={������k�����~�t�M�m.�K������;w�]�v���.���E��������{[�.]�������l�j�\Y��g�yFw�}����~IR�=TPP�x@O=����������NPPP�{�o��������`��Z�������������t�5J��_s�1�����_~�k�����U�E�j���������t�"c�:�:X��v��_�������X�X�����ri�����?�c�����K����u�6��uJ�K/����{NK�.U��}k"���;w����7+##�}�����gYFFF�d�
��?��/�\{��q�[I��k�Z�hQ���tak<s��/J���k��������1g��{���3g������S����~�5��T�N�������8�#V��/�\G�������v��%ooo�j���d�����x��s���nf��i�m�fx�b����1��}����'�t�����������+f���f�������l����%THe�9e�c����l�=�����[������.�\�5fff�F��Q�F��;w�E����P����[�����'N�h5jd���c���k���K��];3h� ��p^���&==����I���^3�������c�|�Is��w�����k���c������i�����Y�t�UK8���q��Y����L�6������\��p^�]������+����|��U+��?��l����^��t������V-���Z����6QQQ�f��~���u����]u�Uf��������o:v�hl6�����Y�xq
'�0�Ygtt�������k>x%T�����B4��k\�v����3v���m����������SWNe�XZZj�}�Y��];���o"##���#��S�j>x}��W���uv]��7W]u�/��woc��L��m�|P��+��k����~s�6��?���+�B��}�v���`L�V���1c��3gj>|�eL=����x
 ���x
 ���x
 ���x
 �U��Y��n�I����g�}V��?���������A�����Z���@�z���i�.���?���=Z���kW�q�U�����Ze��z���u����sqq��q�l�R
4P\\�V�Z����aC����o�����m�����ZA�Gu��Q������s�j��M���;t�
7h����������cG�������^@Pgdff��>�����k�N�?�����
}����/**��Y�x��g|�PQ�7o���T���m/..V��M1����*??_�����u��O�����RSS���Sn_��
1������������j*b�@uF�>}�t:���s�����o����+-\�����@P��>}Z{��q���o�222��Iu��Q��
�=���W_}U}����c��b�
���SIII�����P�-4`�+�Q�yc��!�Z�j�����_l>|�f�����R=�����?�����Y�f���K5i�$���C��r��{��G/��BM/���x~���x
 ���x
 ���x
 ���x
 ���x
 ���x
 ����lB���_{IEND�B`�
#33Robert Haas
robertmhaas@gmail.com
In reply to: Melanie Plageman (#29)
Re: Add LSN <-> time conversion functionality

On Fri, Aug 9, 2024 at 11:48 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:

In the adaptive freezing code, I use the time stream to answer a yes
or no question. I translate a time in the past (now -
target_freeze_duration) to an LSN so that I can determine if a page
that is being modified for the first time after having been frozen has
been modified sooner than target_freeze_duration (a GUC value). If it
is, that page was unfrozen too soon. So, my use case is to produce a
yes or no answer. It doesn't matter very much how accurate I am if I
am wrong. I count the page as having been unfrozen too soon or I
don't. So, it seems I care about the accuracy of data from now until
now - target_freeze_duration + margin of error a lot and data before
that not at all. While it is true that if I'm wrong about a page that
was older but near the cutoff, that might be better than being wrong
about a very recent page, it is still wrong.

I don't really think this is the right way to think about it.

First, you'd really like target_freeze_duration to be something that
can be changed at runtime, but the data structure that you use for the
LSN-time mapping has to be sized at startup time and can't change
thereafter. So I think you should try to design the LSN-time mapping
structure so that it is fixed size -- i.e. independent of the value of
target_freeze_duration -- but capable of producing sufficiently
correct answers for all reasonable values of target_freeze_duration.
Then the user can change the value to whatever they like without a
restart, and still get reasonable behavior. Meanwhile, you don't have
to deal with a variable-size data structure. Woohoo!

Second, I guess I'm a bit confused about the statement that "It
doesn't matter very much how accurate I am if I am wrong." What does
that really mean? We're going to look at the LSN of a page that we're
thinking about freezing and use that to estimate the time since the
page was last modified and use that to guess whether the page is
likely to be modified again soon and then use that to decide whether
to freeze. Even if we always estimated the time since last
modification perfectly, we could still be wrong about what that means
for the future. And we won't estimate the last modification time
perfectly in all cases, because even if we make perfect decisions
about which data points to throw away, we're still going to use linear
interpolation in between those points, and that can be wrong. And I
think it's pretty much impossible to make flawless decisions about
which points to throw away, too.

But the point is that we just need to be close enough. If
target_freeze_duration=10m and our page age estimates are off by an
average of 10s, we will still make the correct decision about whether
to freeze most of the time, but if they are off by an average of 1m,
we'll be wrong more often, and if they're off by an average of 10m,
we'll be wrong way more often. When target_freeze_duration=2h, it's
not nearly so bad to be off by 10m. The probability that a page will
be modified again soon when it hasn't been modified in the last 1h54m
is probably not that different from the probability when it hasn't
been modified in 2h4m, but the probability of a page being modified
again soon when it hasn't been modified in the last 4m could well be
quite different from when it hasn't been modified in the last 14m. So
it's completely reasonable, IMHO, to set things up so that you have
higher accuracy for newer LSNs.

I feel like you're making this a lot harder than it needs to be.
Actually, I think this is a hard problem in terms of where to actually
store the data -- as Tomas said, pgstat doesn't seem quite right, and
it's not clear to me what is actually right. But in terms of actually
what to do with the data structure, some kind of exponential thinning
of the data seems like the obvious thing to do. Tomas suggested a
version of that and I suggested a version of that and you could pick
either one or do something of your own, but I don't really feel like
we need or want an original algorithm here. It seems better to just do
stuff we know works, and whose characteristics we can easily predict.
The only area where I feel like we might want some algorithmic
innovation is in terms of eliding redundant measurements when things
aren't really changing.

But even that seems pretty optional. If you don't do that, and the
system sits there idle for a long time, you will have a needlessly
inaccurate idea of how old the pages are compared to what you could
have had. But also, they will all still look old so you'll still
freeze them so you win. The end.

--
Robert Haas
EDB: http://www.enterprisedb.com