wal stats questions

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Andres Freund (#1)

Re: wal stats questions

On 2021/03/25 8:22, Andres Freund wrote:

Hi,

I got a few questions about the wal stats while working on the shmem
stats patch:

Thanks for your reviews.

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

Is your point that it's better to call pgWalUsage = 0; after sending the
stats? My understanding is as same as your assumption. For example,
pg_stat_statements.c use pgWalUsage and calculate the diff.

But, because the stats may be sent after the transaction is finished, it
doesn't seem to lead wrong stats if pgWalUsage = 0 is called. So, I agree your
suggestion.

If the extension wants to know the walusage diff across two transactions,
it may lead to wrong stats, but I think it won't happen.

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

To control the transmission interval for the wal writer because it may send
the stats too frequency, and to omit to calculate the generated wal stats
because it doesn't generate wal records. But, now I think it's better to merge
them.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.

I understood that for backends, this may leads performance degradation and
this problem is not only for the WalStats but also SLRUStats.

I wondered the degradation is so big because pgstat_report_stat() checks if at
least PGSTAT_STAT_INTERVAL msec is passed since it last sent before check the
diff. If my understanding is correct, to get timestamp is more expensive.

4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

(I'm not confidence my understanding of your comment is right.)
You mean that we need to expend a clock check in pgstat_report_wal()?
I think it's unnecessary because pgstat_report_stat() is already checked it.

Generally the various patches seems to to have a lot of the boilerplate
style comments (like "Prepare and send the message"), but very little in
the way of explaining the design.

Sorry for that. I'll be careful.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

andres@anarazel.de

almost 5 years ago

In reply to: Masahiro Ikeda (#2)

Re: wal stats questions

Hi,

On 2021-03-25 10:51:56 +0900, Masahiro Ikeda wrote:

On 2021/03/25 8:22, Andres Freund wrote:

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

Is your point that it's better to call pgWalUsage = 0; after sending the
stats?

Yes. At least there should be a comment explaining why it's done the way
it is.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.

I understood that for backends, this may leads performance degradation and
this problem is not only for the WalStats but also SLRUStats.

I wondered the degradation is so big because pgstat_report_stat() checks if at
least PGSTAT_STAT_INTERVAL msec is passed since it last sent before check the
diff. If my understanding is correct, to get timestamp is more expensive.

Getting a timestamp is expensive, yes. But I think we need to check for
the no-pending-wal-stats *before* the clock check. See the below:

4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

(I'm not confidence my understanding of your comment is right.)
You mean that we need to expend a clock check in pgstat_report_wal()?
I think it's unnecessary because pgstat_report_stat() is already checked it.

No, I mean that right now we might can erroneously return early
pgstat_report_wal() for normal backends in some workloads:

void
pgstat_report_stat(bool disconnect)
...
/* Don't expend a clock check if nothing to do */
if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)
return;

might return if there only is pending WAL activity. This needs to check
whether there are pending WAL stats. Which in turn means that the check
should be fast.

Greetings,

Andres Freund

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Andres Freund (#3)

Re: wal stats questions

At Wed, 24 Mar 2021 21:07:26 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-25 10:51:56 +0900, Masahiro Ikeda wrote:

On 2021/03/25 8:22, Andres Freund wrote:

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

Is your point that it's better to call pgWalUsage = 0; after sending the
stats?

Yes. At least there should be a comment explaining why it's done the way
it is.

pgWalUsage was used without resetting and we (I) thought that that
behavior should be preserved. On second thought, as Andres suggested,
we can just reset pgWalUsage at sending since AFAICS no one takes
difference crossing pgstat_report_stat() calls.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.

I understood that for backends, this may leads performance degradation and
this problem is not only for the WalStats but also SLRUStats.

I wondered the degradation is so big because pgstat_report_stat() checks if at
least PGSTAT_STAT_INTERVAL msec is passed since it last sent before check the
diff. If my understanding is correct, to get timestamp is more expensive.

Getting a timestamp is expensive, yes. But I think we need to check for
the no-pending-wal-stats *before* the clock check. See the below:

4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

(I'm not confidence my understanding of your comment is right.)
You mean that we need to expend a clock check in pgstat_report_wal()?
I think it's unnecessary because pgstat_report_stat() is already checked it.

No, I mean that right now we might can erroneously return early
pgstat_report_wal() for normal backends in some workloads:

void
pgstat_report_stat(bool disconnect)
...
/* Don't expend a clock check if nothing to do */
if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)
return;

might return if there only is pending WAL activity. This needs to check
whether there are pending WAL stats. Which in turn means that the check
should be fast.

Agreed that the condition is wrong. On the other hand, the counters
are incremented in XLogInsertRecord() and I think we don't want add
instructions there.

If any wal activity has been recorded, wal_records is always positive
thus we can check for wal activity just by "pgWalUsage.wal_records >
0, which should be fast enough.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#4)

Re: wal stats questions

On 2021/03/25 16:37, Kyotaro Horiguchi wrote:

At Wed, 24 Mar 2021 21:07:26 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-25 10:51:56 +0900, Masahiro Ikeda wrote:

On 2021/03/25 8:22, Andres Freund wrote:

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

Is your point that it's better to call pgWalUsage = 0; after sending the
stats?

Yes. At least there should be a comment explaining why it's done the way
it is.

pgWalUsage was used without resetting and we (I) thought that that
behavior should be preserved. On second thought, as Andres suggested,
we can just reset pgWalUsage at sending since AFAICS no one takes
difference crossing pgstat_report_stat() calls.

Yes, I agree that we can do that since there seems no such code for now.
Also if we do that, we can check, for example "pgWalUsage.wal_records > 0"
as you suggested, to easily determine whether there is pending WAL stats or not.
Anyway I agree it's better to add comments about the design more.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.

I understood that for backends, this may leads performance degradation and
this problem is not only for the WalStats but also SLRUStats.

I wondered the degradation is so big because pgstat_report_stat() checks if at
least PGSTAT_STAT_INTERVAL msec is passed since it last sent before check the
diff. If my understanding is correct, to get timestamp is more expensive.

Getting a timestamp is expensive, yes. But I think we need to check for
the no-pending-wal-stats *before* the clock check. See the below:

4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

(I'm not confidence my understanding of your comment is right.)
You mean that we need to expend a clock check in pgstat_report_wal()?
I think it's unnecessary because pgstat_report_stat() is already checked it.

No, I mean that right now we might can erroneously return early
pgstat_report_wal() for normal backends in some workloads:

void
pgstat_report_stat(bool disconnect)
...
/* Don't expend a clock check if nothing to do */
if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)
return;

might return if there only is pending WAL activity. This needs to check
whether there are pending WAL stats. Which in turn means that the check
should be fast.

Agreed that the condition is wrong. On the other hand, the counters
are incremented in XLogInsertRecord() and I think we don't want add
instructions there.

Basically yes. We should avoid that especially while WALInsertLock is being hold.
But it's not so harmful to do that after the lock is released?

If any wal activity has been recorded, wal_records is always positive
thus we can check for wal activity just by "pgWalUsage.wal_records >
0, which should be fast enough.

Maybe there is the case where a backend generates no WAL records,
but just writes them because it needs to do write-ahead-logging
when flush the table data? If yes, "pgWalUsage.wal_records > 0" is not enough.
Probably other fields also need to be checked.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#5)

Re: wal stats questions

At Thu, 25 Mar 2021 19:01:23 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/25 16:37, Kyotaro Horiguchi wrote:

pgWalUsage was used without resetting and we (I) thought that that
behavior should be preserved. On second thought, as Andres suggested,
we can just reset pgWalUsage at sending since AFAICS no one takes
difference crossing pgstat_report_stat() calls.

Yes, I agree that we can do that since there seems no such code for
now.
Also if we do that, we can check, for example "pgWalUsage.wal_records

0"

as you suggested, to easily determine whether there is pending WAL
stats or not.
Anyway I agree it's better to add comments about the design more.

...

If any wal activity has been recorded, wal_records is always positive
thus we can check for wal activity just by "pgWalUsage.wal_records >
0, which should be fast enough.

Maybe there is the case where a backend generates no WAL records,
but just writes them because it needs to do write-ahead-logging
when flush the table data? If yes, "pgWalUsage.wal_records > 0" is not
enough.
Probably other fields also need to be checked.

(I noticed I made the discussion above unconsciously premising
pgWalUsage reset.)

I may be misunderstanding or missing something, but the only place
where pgWalUsage counters are increased is XLogInsertRecrod. That is,
pgWalUsage counts wal insertions, not writes nor flushes. AFAICS
pgWalUsage.wal_records is always incremented when other counters are
increased. Looking from another side, we should refrain from adding
counters that incrases at a different time than
pgWalUsage.wal_recrods. (That should be written as a comment there.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#6)

Re: wal stats questions

On 2021/03/26 10:08, Kyotaro Horiguchi wrote:

At Thu, 25 Mar 2021 19:01:23 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/25 16:37, Kyotaro Horiguchi wrote:

pgWalUsage was used without resetting and we (I) thought that that
behavior should be preserved. On second thought, as Andres suggested,
we can just reset pgWalUsage at sending since AFAICS no one takes
difference crossing pgstat_report_stat() calls.

Yes, I agree that we can do that since there seems no such code for
now.
Also if we do that, we can check, for example "pgWalUsage.wal_records

0"

as you suggested, to easily determine whether there is pending WAL
stats or not.
Anyway I agree it's better to add comments about the design more.

...

If any wal activity has been recorded, wal_records is always positive
thus we can check for wal activity just by "pgWalUsage.wal_records >
0, which should be fast enough.

Maybe there is the case where a backend generates no WAL records,
but just writes them because it needs to do write-ahead-logging
when flush the table data? If yes, "pgWalUsage.wal_records > 0" is not
enough.
Probably other fields also need to be checked.

(I noticed I made the discussion above unconsciously premising
pgWalUsage reset.)

I may be misunderstanding or missing something, but the only place
where pgWalUsage counters are increased is XLogInsertRecrod. That is,
pgWalUsage counts wal insertions, not writes nor flushes. AFAICS

Yes. And WalStats instead of pgWalUsage includes the stats about
not only WAL insertions, but also writes and flushes.
pgstat_send_wal() checks WalStats to determine whether there are
pending WAL stats to send to the stats collector or not. That is,
the counters of not only WAL insertions but also writes and flushes
should be checked to determine whether there are pending stats or not, I think..

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#7)

Re: wal stats questions

At Fri, 26 Mar 2021 10:32:23 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

I may be misunderstanding or missing something, but the only place
where pgWalUsage counters are increased is XLogInsertRecrod. That is,
pgWalUsage counts wal insertions, not writes nor flushes. AFAICS

Yes. And WalStats instead of pgWalUsage includes the stats about
not only WAL insertions, but also writes and flushes.

Ugh! I was missing a very large blob.. Ok, we need additional check
for non-pgWalUsage part. Sorry.

pgstat_send_wal() checks WalStats to determine whether there are
pending WAL stats to send to the stats collector or not. That is,
the counters of not only WAL insertions but also writes and flushes
should be checked to determine whether there are pending stats or not,
I think..

I think we may have an additional flag to notify about io-stat part,
in constrast to wal-insertion part . Anyway we do additional
INSTR_TIME_SET_CURRENT when track_wal_io_timinge.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#8)

1 attachment(s)

Re: wal stats questions

Thanks for many your suggestions!
I made the patch to handle the issues.

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

I removed the unnecessary code copying pgWalUsage and just reset the
pgWalUsage after reporting the stats in pgstat_report_wal().

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

I added the comments why two functions are separated.
(But is it better to merge them?)

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.
4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

I added the logic to check if the stats counters are updated or not in
pgstat_report_stat() using not only generated wal record but also write/sync
counters, and it can skip to call reporting function.

Although I added the condition which the write/sync counters are updated or
not, I haven't understood the following case yet...Sorry. I checked related
code and tested to insert large object, but I couldn't. I'll investigate more
deeply, but if you already know the function which leads the following case,
please let me know.

Maybe there is the case where a backend generates no WAL records,
but just writes them because it needs to do write-ahead-logging
when flush the table data?

Ugh! I was missing a very large blob.. Ok, we need additional check
for non-pgWalUsage part. Sorry.

Regards,

--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v1-0001-performance-improvements-of-reporting-wal-stats.patchtext/x-patch; charset=UTF-8; name=v1-0001-performance-improvements-of-reporting-wal-stats.patchDownload

diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 237e13361b..095d8d0970 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,6 +19,17 @@
 
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
+
+/*
+ * generated WAL usage counters.
+ *
+ * Be careful that the counters are cleared after reporting them to
+ * the stats collector although you can use WalUsageAccumDiff()
+ * to computing differences to previous values. For backends,
+ * the counters may be reset after a transaction is finished and
+ * pgstat_report_wal() is invoked, so you can compute the difference
+ * in the same transaction only.
+ */
 WalUsage	pgWalUsage;
 static WalUsage save_pgWalUsage;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 60f45ccc4e..e354a454a9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -144,14 +144,6 @@ char	   *pgstat_stat_tmpname = NULL;
 PgStat_MsgBgWriter BgWriterStats;
 PgStat_MsgWal WalStats;
 
-/*
- * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
- * the previous counters from the current ones.
- */
-static WalUsage prevWalUsage;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -879,9 +871,18 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * Note: Regarding the wal data, it's not enough to check only the wal
+	 * records, because there is a case where a backend generates no wal
+	 * records, but just writes them because it needs to do
+	 * write-ahead-logging when flush the table data.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == 0 && WalStats.m_wal_write == 0 &&
+		WalStats.m_wal_sync == 0 &&
 		!have_function_stats && !disconnect)
 		return;
 
@@ -3117,13 +3118,6 @@ pgstat_initialize(void)
 		MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
 	}
 
-	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
-	 * calculate how much pgWalUsage counters are increased by substracting
-	 * prevWalUsage from pgWalUsage.
-	 */
-	prevWalUsage = pgWalUsage;
-
 	/* Set up a process-exit hook to clean up */
 	on_shmem_exit(pgstat_beshutdown_hook, 0);
 }
@@ -4674,28 +4668,40 @@ pgstat_send_bgwriter(void)
 /* ----------
  * pgstat_report_wal() -
  *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
+ * Send WAL statistics to the collector and clear the counters.
  *
  * Must be called by processes that generate WAL.
+ *
+ * If needed, the caller should check whether the statistic counters
+ * are updated, or time interval since we sent is enough before calling
+ * this function.
+ *
+ * Especially, for backends, if they execute read-only transaction,
+ * they rarely gererate and write the WAL data. So, to call this
+ * function is no meaning, but it may leads performance degradation.
+ *
+ * OTOH, for the checkpointer, it always generates the WAL record and
+ * doesn't call this function frequently. So, it doesn't need to check
+ * them.
+ *
+ * Note: if you want to know the WAL statistics counters are updated
+ * or not, to check WalStats.m_wal_records, walStats.wal_write and
+ * walStats.wal_sync. If WalStats.m_wal_records == 0, walStats.wal_write == 0
+ * and WalStats.wal_sync == 0, it means the counters are not updated.
+ * It's not enough to check only the wal records, because there is a
+ * case where a backend generates no wal records, but just writes them
+ * because it needs to do write-ahead-logging when flush the table data.
  * ----------
  */
 void
 pgstat_report_wal(void)
 {
-	WalUsage	walusage;
-
 	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
+	 * set the counters related to generated WAL data.
 	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
+	WalStats.m_wal_records = pgWalUsage.wal_records;
+	WalStats.m_wal_fpi = pgWalUsage.wal_fpi;
+	WalStats.m_wal_bytes = pgWalUsage.wal_bytes;
 
 	/*
 	 * Send WAL stats message to the collector.
@@ -4704,18 +4710,25 @@ pgstat_report_wal(void)
 		return;
 
 	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
+	 * Clear out the statistics buffer for generated WAL data, so it can be
+	 * re-used.
+	 *
+	 * It's ok because no one takes difference crossing pgstat_report_stat()
+	 * calls although these counters are used in another places, for example
+	 * in pg_stat_statements.c.
 	 */
-	prevWalUsage = pgWalUsage;
+	MemSet(&pgWalUsage, 0, sizeof(WalUsage));
 }
 
 /* ----------
  * pgstat_send_wal() -
  *
- *	Send WAL statistics to the collector.
+ * Send WAL statistics to the collector.
  *
- * If 'force' is not set, WAL stats message is only sent if enough time has
- * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
+ * If the processes that generate WAL data must call pgstat_report_wal() instead.
+ *
+ * If 'force' is not set, WAL stats message is only sent if the statistics counters
+ * are updated and enough time has passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
  *
  * Return true if the message is sent, and false otherwise.
  * ----------
@@ -4723,26 +4736,28 @@ pgstat_report_wal(void)
 bool
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
-	/*
-	 * This function can be called even if nothing at all has happened. In
-	 * this case, avoid sending a completely empty message to the stats
-	 * collector.
-	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
-
 	if (!force)
 	{
-		TimestampTz now = GetCurrentTimestamp();
+		TimestampTz now;
+
+		/*
+		 * First, to check whether the WAL statistics counters are updated. If
+		 * not updated, we can skip getting a timestamp which is expensive. We
+		 * don't need to check all counters of the WAL statistics counters
+		 * because other counters never be incremented if the counts of
+		 * written/synced are zero.
+		 */
+		if (walStats.wal_write == 0 && walStats.wal_sync == 0)
+			return false;
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
+		now = GetCurrentTimestamp();
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
 			return false;
 		sendTime = now;

#10

andres@anarazel.de

almost 5 years ago

In reply to: Kyotaro Horiguchi (#4)

Re: wal stats questions

Hi,

On 2021-03-25 16:37:10 +0900, Kyotaro Horiguchi wrote:

On the other hand, the counters are incremented in XLogInsertRecord()
and I think we don't want add instructions there.

I don't really buy this. Setting a boolean to true, in a cacheline
you're already touching, isn't that much compared to all the other stuff
in there. The branch to check if wal stats timing etc is enabled is much
more expensive. I think we should just set a boolean to true and leave
it at that.

Greetings,

Andres Freund

#11

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Andres Freund (#10)

Re: wal stats questions

At Fri, 26 Mar 2021 10:07:45 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-25 16:37:10 +0900, Kyotaro Horiguchi wrote:

On the other hand, the counters are incremented in XLogInsertRecord()
and I think we don't want add instructions there.

I don't really buy this. Setting a boolean to true, in a cacheline
you're already touching, isn't that much compared to all the other stuff
in there. The branch to check if wal stats timing etc is enabled is much
more expensive. I think we should just set a boolean to true and leave
it at that.

Hmm. Yes, I agree to you in that opinion. I (remember I) was told not
to add even a cycle to the hot path as far as we can avoid when I
tried something like that.

So I'm happy to +1 for that if it is the consensus here, since it is
cleaner.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#11)

Re: wal stats questions

At Mon, 29 Mar 2021 11:09:00 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 26 Mar 2021 10:07:45 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-25 16:37:10 +0900, Kyotaro Horiguchi wrote:

On the other hand, the counters are incremented in XLogInsertRecord()
and I think we don't want add instructions there.

I don't really buy this. Setting a boolean to true, in a cacheline
you're already touching, isn't that much compared to all the other stuff
in there. The branch to check if wal stats timing etc is enabled is much
more expensive. I think we should just set a boolean to true and leave
it at that.

Hmm. Yes, I agree to you in that opinion. I (remember I) was told not

It might sound differently.. To be precise, "I had the same opinion
with you".

to add even a cycle to the hot path as far as we can avoid when I
tried something like that.

So I'm happy to +1 for that if it is the consensus here, since it is
cleaner.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#9)

1 attachment(s)

Re: wal stats questions

I update the patch since there were my misunderstanding points.

On 2021/03/26 16:20, Masahiro Ikeda wrote:

Thanks for many your suggestions!
I made the patch to handle the issues.

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

I removed the unnecessary code copying pgWalUsage and just reset the
pgWalUsage after reporting the stats in pgstat_report_wal().

I didn't change this.

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

I added the comments why two functions are separated.
(But is it better to merge them?)

I updated the comments for following reasons.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.
4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

I added the logic to check if the stats counters are updated or not in
pgstat_report_stat() using not only generated wal record but also write/sync
counters, and it can skip to call reporting function.

I removed the checking code whether the wal stats counters were updated or not
in pgstat_report_stat() since I couldn't understand the valid case yet.
pgstat_report_stat() is called by backends when the transaction is finished.
This means that the condition seems always pass.

I thought to implement if the WAL stats counters were not updated, skip to
send all statistics including the counters for databases and so on. But I
think it's not good because it may take more time to be reflected the
generated stats by read-only transaction.

Although I added the condition which the write/sync counters are updated or
not, I haven't understood the following case yet...Sorry. I checked related
code and tested to insert large object, but I couldn't. I'll investigate more
deeply, but if you already know the function which leads the following case,
please let me know.

I understood the above case (Fujii-san, thanks for your advice in person).
When to flush buffers, for example, to select buffer replacement victim,
it requires a WAL flush if the buffer is dirty. So, to check the WAL stats
counters are updated or not, I check the number of generated wal record and
the counter of syncing in pgstat_report_wal().

The reason why not to check the counter of writing is that if to write is
happened, to sync is happened too in the above case. I added the comments in
the patch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v2-0001-performance-improvements-of-reporting-wal-stats.patchtext/x-patch; charset=UTF-8; name=v2-0001-performance-improvements-of-reporting-wal-stats.patchDownload

diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 237e13361b..095d8d0970 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,6 +19,17 @@
 
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
+
+/*
+ * generated WAL usage counters.
+ *
+ * Be careful that the counters are cleared after reporting them to
+ * the stats collector although you can use WalUsageAccumDiff()
+ * to computing differences to previous values. For backends,
+ * the counters may be reset after a transaction is finished and
+ * pgstat_report_wal() is invoked, so you can compute the difference
+ * in the same transaction only.
+ */
 WalUsage	pgWalUsage;
 static WalUsage save_pgWalUsage;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 60f45ccc4e..1ffedb2a24 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -144,14 +144,6 @@ char	   *pgstat_stat_tmpname = NULL;
 PgStat_MsgBgWriter BgWriterStats;
 PgStat_MsgWal WalStats;
 
-/*
- * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
- * the previous counters from the current ones.
- */
-static WalUsage prevWalUsage;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -3117,13 +3109,6 @@ pgstat_initialize(void)
 		MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
 	}
 
-	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
-	 * calculate how much pgWalUsage counters are increased by substracting
-	 * prevWalUsage from pgWalUsage.
-	 */
-	prevWalUsage = pgWalUsage;
-
 	/* Set up a process-exit hook to clean up */
 	on_shmem_exit(pgstat_beshutdown_hook, 0);
 }
@@ -4674,8 +4659,7 @@ pgstat_send_bgwriter(void)
 /* ----------
  * pgstat_report_wal() -
  *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
+ * Send WAL statistics to the collector and clear the counters.
  *
  * Must be called by processes that generate WAL.
  * ----------
@@ -4683,19 +4667,36 @@ pgstat_send_bgwriter(void)
 void
 pgstat_report_wal(void)
 {
-	WalUsage	walusage;
-
 	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
+	 * Skip if the WAL statistics counters were not updated.
+	 *
+	 * We can know whether the counters were updated to check only two
+	 * records. So, for performance, we don't allocate another memory spaces
+	 * and check the all stats like pgstat_send_slru().
+	 *
+	 * It's not enough to check only the generated wal records, because there
+	 * is a case that a backend generates no wal records, but just writes and
+	 * syncs the wal data. When to flush buffers, for example, to select
+	 * buffer replacement victim, it requires a WAL flush if the buffer is
+	 * dirty. This happens even if executing read-only transaction.
+	 *
+	 * In this case, if to write is happend, to sync is happend too. although
+	 * there is a case to write is not happend but to sync is happened. For
+	 * example, when to flush buffers just after another process only write
+	 * and don't sync because the wal buffer is no space.
+	 *
+	 * So, it's ok to check the generated wal records and the counter of
+	 * syncing only.
 	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
+	if (pgWalUsage.wal_records == 0 && WalStats.m_wal_sync == 0)
+		return;
 
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
+	/*
+	 * set the counters related to generated WAL data.
+	 */
+	WalStats.m_wal_records = pgWalUsage.wal_records;
+	WalStats.m_wal_fpi = pgWalUsage.wal_fpi;
+	WalStats.m_wal_bytes = pgWalUsage.wal_bytes;
 
 	/*
 	 * Send WAL stats message to the collector.
@@ -4704,18 +4705,26 @@ pgstat_report_wal(void)
 		return;
 
 	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
+	 * Clear out the statistics buffer for generated WAL data, so it can be
+	 * re-used.
+	 *
+	 * It's ok to clear out because no one takes difference crossing
+	 * pgstat_report_wal() calls although these counters are used in another
+	 * places, for example in pg_stat_statements.c.
 	 */
-	prevWalUsage = pgWalUsage;
+	MemSet(&pgWalUsage, 0, sizeof(WalUsage));
 }
 
 /* ----------
  * pgstat_send_wal() -
  *
- *	Send WAL statistics to the collector.
+ * Send WAL statistics to the collector.
  *
- * If 'force' is not set, WAL stats message is only sent if enough time has
- * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
+ * If the processes that generate WAL data must call pgstat_report_wal() instead.
+ *
+ * If 'force' is not set, the WAL stats message is only sent if the statistics
+ * counters except for the counters related to generated WAL data were updated
+ * and enough time has passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
  *
  * Return true if the message is sent, and false otherwise.
  * ----------
@@ -4723,26 +4732,32 @@ pgstat_report_wal(void)
 bool
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
-	/*
-	 * This function can be called even if nothing at all has happened. In
-	 * this case, avoid sending a completely empty message to the stats
-	 * collector.
-	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
-
 	if (!force)
 	{
-		TimestampTz now = GetCurrentTimestamp();
+		TimestampTz now;
+
+		/*
+		 * Don't expend a clock check if the WAL statistics counters were not
+		 * updated.
+		 *
+		 * We don't consider the WAL stats related to the generated WAL data
+		 * because pgstat_report_wal() is called instead in the case.
+		 *
+		 * We can know whether the counters were updated to check only two
+		 * records. So, for performance, we don't allocate another memory
+		 * spaces and check the all stats like pgstat_send_slru().
+		 */
+		if (walStats.wal_write == 0 && walStats.wal_sync == 0)
+			return false;
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
+		now = GetCurrentTimestamp();
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
 			return false;
 		sendTime = now;

#14

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Masahiro Ikeda (#13)

Re: wal stats questions

At Tue, 30 Mar 2021 09:41:24 +0900, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote in

I update the patch since there were my misunderstanding points.

On 2021/03/26 16:20, Masahiro Ikeda wrote:

Thanks for many your suggestions!
I made the patch to handle the issues.

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

I removed the unnecessary code copying pgWalUsage and just reset the
pgWalUsage after reporting the stats in pgstat_report_wal().

I didn't change this.

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

I added the comments why two functions are separated.
(But is it better to merge them?)

I updated the comments for following reasons.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.
4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

I added the logic to check if the stats counters are updated or not in
pgstat_report_stat() using not only generated wal record but also write/sync
counters, and it can skip to call reporting function.

I removed the checking code whether the wal stats counters were updated or not
in pgstat_report_stat() since I couldn't understand the valid case yet.
pgstat_report_stat() is called by backends when the transaction is finished.
This means that the condition seems always pass.

Doesn't the same holds for all other counters? If you are saying that
"wal counters should be zero if all other stats counters are zero", we
need an assertion to check that and a comment to explain that.

I personally find it safer to add the WAL-stats condition to the
fast-return check, rather than addin such assertion.

pgstat_send_wal() is called mainly from pgstat_report_wal() which
consumes pgWalStats counters and WalWriterMain() which
doesn't. Checking on pgWalStats counters isn't so complex that we need
to avoid that in wal writer, thus *I* think pgstat_send_wal() and
pgstat_report_wal() can be consolidated. Even if you have a strong
opinion that wal writer should call a separate function, the name
should be something other than pgstat_send_wal() since it ignores
pgWalUsage counters, which are supposed to be included in "WAL stats".

I thought to implement if the WAL stats counters were not updated, skip to
send all statistics including the counters for databases and so on. But I
think it's not good because it may take more time to be reflected the
generated stats by read-only transaction.

Ur, yep.

Although I added the condition which the write/sync counters are updated or
not, I haven't understood the following case yet...Sorry. I checked related
code and tested to insert large object, but I couldn't. I'll investigate more
deeply, but if you already know the function which leads the following case,
please let me know.

I understood the above case (Fujii-san, thanks for your advice in person).
When to flush buffers, for example, to select buffer replacement victim,
it requires a WAL flush if the buffer is dirty. So, to check the WAL stats
counters are updated or not, I check the number of generated wal record and
the counter of syncing in pgstat_report_wal().

The reason why not to check the counter of writing is that if to write is
happened, to sync is happened too in the above case. I added the comments in
the patch.

Mmm.. Although I couldn't read you clearly, The fact that a flush may
happen without a write means the reverse at the same time, a write may
happen without a flush. For asynchronous commits, WAL-write happens
alone unaccompanied by a flush. And the corresponding flush would
happen later without writes.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15

ikedamsh@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#14)

1 attachment(s)

Re: wal stats questions

On 2021/03/30 17:28, Kyotaro Horiguchi wrote:

At Tue, 30 Mar 2021 09:41:24 +0900, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote in

I update the patch since there were my misunderstanding points.

On 2021/03/26 16:20, Masahiro Ikeda wrote:

Thanks for many your suggestions!
I made the patch to handle the issues.

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

I removed the unnecessary code copying pgWalUsage and just reset the
pgWalUsage after reporting the stats in pgstat_report_wal().

I didn't change this.

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

I added the comments why two functions are separated.
(But is it better to merge them?)

I updated the comments for following reasons.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.
4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

I added the logic to check if the stats counters are updated or not in
pgstat_report_stat() using not only generated wal record but also write/sync
counters, and it can skip to call reporting function.

I removed the checking code whether the wal stats counters were updated or not
in pgstat_report_stat() since I couldn't understand the valid case yet.
pgstat_report_stat() is called by backends when the transaction is finished.
This means that the condition seems always pass.

Doesn't the same holds for all other counters? If you are saying that
"wal counters should be zero if all other stats counters are zero", we
need an assertion to check that and a comment to explain that.

I personally find it safer to add the WAL-stats condition to the
fast-return check, rather than addin such assertion.

Thanks for your comments.

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check using WAL
stats counters. But, the purpose is to make the conditions stricter, right?

pgstat_send_wal() is called mainly from pgstat_report_wal() which
consumes pgWalStats counters and WalWriterMain() which
doesn't. Checking on pgWalStats counters isn't so complex that we need
to avoid that in wal writer, thus *I* think pgstat_send_wal() and
pgstat_report_wal() can be consolidated. Even if you have a strong
opinion that wal writer should call a separate function, the name
should be something other than pgstat_send_wal() since it ignores
pgWalUsage counters, which are supposed to be included in "WAL stats".

OK, I consolidated them.

I thought to implement if the WAL stats counters were not updated, skip to
send all statistics including the counters for databases and so on. But I
think it's not good because it may take more time to be reflected the
generated stats by read-only transaction.

Ur, yep.

Although I added the condition which the write/sync counters are updated or
not, I haven't understood the following case yet...Sorry. I checked related
code and tested to insert large object, but I couldn't. I'll investigate more
deeply, but if you already know the function which leads the following case,
please let me know.

I understood the above case (Fujii-san, thanks for your advice in person).
When to flush buffers, for example, to select buffer replacement victim,
it requires a WAL flush if the buffer is dirty. So, to check the WAL stats
counters are updated or not, I check the number of generated wal record and
the counter of syncing in pgstat_report_wal().

The reason why not to check the counter of writing is that if to write is
happened, to sync is happened too in the above case. I added the comments in
the patch.

Mmm.. Although I couldn't read you clearly, The fact that a flush may
happen without a write means the reverse at the same time, a write may
happen without a flush. For asynchronous commits, WAL-write happens
alone unaccompanied by a flush. And the corresponding flush would
happen later without writes.

Sorry, I didn't explain it enough.

For processes which may generate WAL records like backends, I thought it's
enough to check (1)the number of generated WAL records and (2)the counters of
syncing(flushing) the WAL. This is checked in pgstat_report_wal(). Sorry for
that I didn't mention (1) in the above thread.

If a backend execute a write transaction, some WAL records must be generated.
So, it's ok to check (1) only regardless of whether asynchronous commit is
enabled or not.

OHOT, if a backend execute a read-only transaction, WAL records won't be
generated (although HOT makes a wal records exactly...). But, WAL-write and
flush may happen when to flush buffers via XLogFlush(). In this case, if
WAL-write happened, flush must be happen later. But, if my understanding is
correct, there is no a case to flush doesn't happen, but to write happen.
So, I thought (2) is needed and it's enough to check the counter of
syncing(flushing).

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v3-0001-performance-improvements-of-reporting-wal-stats.patchtext/x-patch; charset=UTF-8; name=v3-0001-performance-improvements-of-reporting-wal-stats.patchDownload

diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 237e13361b..15eb3c8e64 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,6 +19,17 @@
 
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
+
+/*
+ * generated WAL usage counters.
+ *
+ * Be careful that the counters are cleared after reporting them to
+ * the stats collector although you can use WalUsageAccumDiff()
+ * to computing differences to previous values. For backends,
+ * the counters may be reset after a transaction is finished and
+ * pgstat_send_wal() is invoked, so you can compute the difference
+ * in the same transaction only.
+ */
 WalUsage	pgWalUsage;
 static WalUsage save_pgWalUsage;
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cdd07770a0..75a95f3de7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -583,7 +583,7 @@ HandleCheckpointerInterrupts(void)
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 60f45ccc4e..7fd03b69b4 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -144,14 +144,6 @@ char	   *pgstat_stat_tmpname = NULL;
 PgStat_MsgBgWriter BgWriterStats;
 PgStat_MsgWal WalStats;
 
-/*
- * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
- * the previous counters from the current ones.
- */
-static WalUsage prevWalUsage;
-
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -879,9 +871,19 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * Note: regarding to WAL statistics counters, it's not enough to check
+	 * only whether the wal record is generated or not. If a read transaction
+	 * is executed, wal records aren't nomally generated (although HOT makes
+	 * wal records). But, just writes and syncs the wal data may happen when
+	 * to flush buffers.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == 0 && walStats.wal_write == 0 &&
+		walStats.wal_sync == 0 &&
 		!have_function_stats && !disconnect)
 		return;
 
@@ -975,7 +977,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -3117,13 +3119,6 @@ pgstat_initialize(void)
 		MyBEEntry = &BackendStatusArray[MaxBackends + MyAuxProcType];
 	}
 
-	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
-	 * calculate how much pgWalUsage counters are increased by substracting
-	 * prevWalUsage from pgWalUsage.
-	 */
-	prevWalUsage = pgWalUsage;
-
 	/* Set up a process-exit hook to clean up */
 	on_shmem_exit(pgstat_beshutdown_hook, 0);
 }
@@ -4671,44 +4666,6 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_report_wal() -
- *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
- * ----------
- */
-void
-pgstat_report_wal(void)
-{
-	WalUsage	walusage;
-
-	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
-	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
-
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
 /* ----------
  * pgstat_send_wal() -
  *
@@ -4720,20 +4677,28 @@ pgstat_report_wal(void)
  * Return true if the message is sent, and false otherwise.
  * ----------
  */
-bool
+void
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
 	/*
-	 * This function can be called even if nothing at all has happened. In
-	 * this case, avoid sending a completely empty message to the stats
-	 * collector.
+	 * First, to check the WAL stats counters were updated.
+	 *
+	 * Even if the 'force' is true, we don't need to send the stats if the
+	 * counters were not updated.
+	 *
+	 * We can know whether the counters were updated or not to check only
+	 * three counters. So, for performance, we don't allocate another memory
+	 * spaces and check the all stats like pgstat_send_slru().
+	 *
+	 * 'wal_records' is zero means that wal records weren't generated. So,
+	 * 'wal_fpi', 'wal_bytes', 'm_wal_buffers_full' is zero too. 'wal_writes'
+	 * and 'wal_sync' are zero means the counter of time spent is zero too.
 	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
+	if (pgWalUsage.wal_records == 0 && walStats.wal_write == 0 &&
+		walStats.wal_sync == 0)
+		return;
 
 	if (!force)
 	{
@@ -4741,13 +4706,21 @@ pgstat_send_wal(bool force)
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-			return false;
+			return;
 		sendTime = now;
 	}
 
+	/*
+	 * set the counters related to generated WAL data.
+	 */
+	WalStats.m_wal_records = pgWalUsage.wal_records;
+	WalStats.m_wal_fpi = pgWalUsage.wal_fpi;
+	WalStats.m_wal_bytes = pgWalUsage.wal_bytes;
+
 	/*
 	 * Prepare and send the message
 	 */
@@ -4759,7 +4732,14 @@ pgstat_send_wal(bool force)
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
 
-	return true;
+	/*
+	 * Clear out the statistics for generated wal usage too.
+	 *
+	 * Although these counters are used in another places, for example in
+	 * pg_stat_statements.c, it's ok to clear out because no one takes
+	 * difference crossing pgstat_send_wal() calls.
+	 */
+	MemSet(&pgWalUsage, 0, sizeof(WalUsage));
 }
 
 /* ----------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 87672e6f30..2174a7439c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1601,8 +1601,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to

#16

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Masahiro Ikeda (#15)

Re: wal stats questions

On 2021/03/30 20:37, Masahiro Ikeda wrote:

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check using WAL
stats counters. But, the purpose is to make the conditions stricter, right?

Yes. Currently if the following condition is false even when the WAL counters
are updated, nothing is sent to the stats collector. But with your patch,
in this case the WAL stats are sent.

if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)

Thanks for the patch! It now fails to be applied to the master cleanly.
So could you rebase the patch?

pgstat_initialize() should initialize pgWalUsage with zero?

+	/*
+	 * set the counters related to generated WAL data.
+	 */
+	WalStats.m_wal_records = pgWalUsage.wal_records;
+	WalStats.m_wal_fpi = pgWalUsage.wal_fpi;
+	WalStats.m_wal_bytes = pgWalUsage.wal_bytes;

This should be skipped if pgWalUsage.wal_records is zero?

+ * Be careful that the counters are cleared after reporting them to
+ * the stats collector although you can use WalUsageAccumDiff()
+ * to computing differences to previous values. For backends,
+ * the counters may be reset after a transaction is finished and
+ * pgstat_send_wal() is invoked, so you can compute the difference
+ * in the same transaction only.

On the second thought, I'm afraid that this can be likely to be a foot-gun
in the future. So I'm now wondering if the current approach (i.e., calculate
the diff between the current and previous pgWalUsage and don't reset it
to zero) is better. Thought? Since the similar data structure pgBufferUsage
doesn't have this kind of restriction, I'm afraid that the developer can
easily miss that only pgWalUsage has the restriction.

But currently the diff is calculated (i.e., WalUsageAccumDiff() is called)
even when the WAL counters are not updated. Then if that calculated diff is
zero, we skip sending the WAL stats. This logic should be improved. That is,
probably we should be able to check whether the WAL counters are updated
or not, without calculating the diff, because the calculation is not free.
We can implement this by introducing new flag variable that we discussed
upthread. This flag is set to true whenever the WAL counters are incremented
and used to determine whether the WAL stats need to be sent.

If we do this, another issue is that the data types for wal_records and wal_fpi
in pgWalUsage are long. Which may lead to overflow? If yes, it's should be
replaced with uint64.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#17

ikedamsh@oss.nttdata.com

over 4 years ago

In reply to: Fujii Masao (#16)

2 attachment(s)

Re: wal stats questions

On 2021/04/13 9:33, Fujii Masao wrote:

On 2021/03/30 20:37, Masahiro Ikeda wrote:

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check using WAL
stats counters. But, the purpose is to make the conditions stricter, right?

Yes. Currently if the following condition is false even when the WAL counters
are updated, nothing is sent to the stats collector. But with your patch,
in this case the WAL stats are sent.

    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
        !have_function_stats && !disconnect)

Thanks for the patch! It now fails to be applied to the master cleanly.
So could you rebase the patch?

Thanks for your comments!
I rebased it.

pgstat_initialize() should initialize pgWalUsage with zero?

Thanks. But, I didn't handle it because I undo the logic to calculate the diff
as you mentioned below.

+    /*
+     * set the counters related to generated WAL data.
+     */
+    WalStats.m_wal_records = pgWalUsage.wal_records;
+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi;
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes;

This should be skipped if pgWalUsage.wal_records is zero?

Yes, fixed it.

+ * Be careful that the counters are cleared after reporting them to
+ * the stats collector although you can use WalUsageAccumDiff()
+ * to computing differences to previous values. For backends,
+ * the counters may be reset after a transaction is finished and
+ * pgstat_send_wal() is invoked, so you can compute the difference
+ * in the same transaction only.
On the second thought, I'm afraid that this can be likely to be a foot-gun
in the future. So I'm now wondering if the current approach (i.e., calculate
the diff between the current and previous pgWalUsage and don't reset it
to zero) is better. Thought? Since the similar data structure pgBufferUsage
doesn't have this kind of restriction, I'm afraid that the developer can
easily miss that only pgWalUsage has the restriction.

But currently the diff is calculated (i.e., WalUsageAccumDiff() is called)
even when the WAL counters are not updated. Then if that calculated diff is
zero, we skip sending the WAL stats. This logic should be improved. That is,
probably we should be able to check whether the WAL counters are updated
or not, without calculating the diff, because the calculation is not free.
We can implement this by introducing new flag variable that we discussed
upthread. This flag is set to true whenever the WAL counters are incremented
and used to determine whether the WAL stats need to be sent.

Sound reasonable. I agreed that the restriction has a risk to lead mistakes.
I made the patch introducing a new flag.

- v4-0001-performance-improvements-of-reporting-wal-stats.patch

I think introducing a new flag is not necessary because we can know if the WAL
stats are updated or not using the counters of the generated wal record, wal
writes and wal syncs. It's fast compared to get timestamp. The attached patch
is to check if the counters are updated or not using them.

-
v4-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch

If we do this, another issue is that the data types for wal_records and wal_fpi
in pgWalUsage are long. Which may lead to overflow? If yes, it's should be
replaced with uint64.

Yes, I fixed. BufferUsage's counters have same issue, so I fixed them too.

BTW, is it better to change PgStat_Counter from int64 to uint64 because there
aren't any counters with negative number?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v4-0001-performance-improvements-of-reporting-wal-stats.patchtext/x-patch; charset=UTF-8; name=v4-0001-performance-improvements-of-reporting-wal-stats.patchDownload

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 14be850cb3..265bb16cf6 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -173,21 +173,21 @@ typedef struct Counters
 	double		sum_var_time[PGSS_NUMKIND]; /* sum of variances in
 											 * planning/execution time in msec */
 	int64		rows;			/* total # of retrieved or affected rows */
-	int64		shared_blks_hit;	/* # of shared buffer hits */
-	int64		shared_blks_read;	/* # of shared disk blocks read */
-	int64		shared_blks_dirtied;	/* # of shared disk blocks dirtied */
-	int64		shared_blks_written;	/* # of shared disk blocks written */
-	int64		local_blks_hit; /* # of local buffer hits */
-	int64		local_blks_read;	/* # of local disk blocks read */
-	int64		local_blks_dirtied; /* # of local disk blocks dirtied */
-	int64		local_blks_written; /* # of local disk blocks written */
-	int64		temp_blks_read; /* # of temp blocks read */
-	int64		temp_blks_written;	/* # of temp blocks written */
+	uint64		shared_blks_hit;	/* # of shared buffer hits */
+	uint64		shared_blks_read;	/* # of shared disk blocks read */
+	uint64		shared_blks_dirtied;	/* # of shared disk blocks dirtied */
+	uint64		shared_blks_written;	/* # of shared disk blocks written */
+	uint64		local_blks_hit; /* # of local buffer hits */
+	uint64		local_blks_read;	/* # of local disk blocks read */
+	uint64		local_blks_dirtied; /* # of local disk blocks dirtied */
+	uint64		local_blks_written; /* # of local disk blocks written */
+	uint64		temp_blks_read; /* # of temp blocks read */
+	uint64		temp_blks_written;	/* # of temp blocks written */
 	double		blk_read_time;	/* time spent reading, in msec */
 	double		blk_write_time; /* time spent writing, in msec */
 	double		usage;			/* usage factor */
-	int64		wal_records;	/* # of WAL records generated */
-	int64		wal_fpi;		/* # of WAL full page images generated */
+	uint64		wal_records;	/* # of WAL records generated */
+	uint64		wal_fpi;		/* # of WAL full page images generated */
 	uint64		wal_bytes;		/* total amount of WAL generated in bytes */
 } Counters;
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 45fdecbbd9..848eff2212 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1279,6 +1279,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		pgWalUsage.wal_bytes += rechdr->xl_tot_len;
 		pgWalUsage.wal_records++;
 		pgWalUsage.wal_fpi += num_fpi;
+		walUsageUpdated = true;
 	}
 
 	return EndPos;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 237e13361b..75ecd00c23 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -17,6 +17,12 @@
 
 #include "executor/instrument.h"
 
+/*
+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.
+ */
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cdd07770a0..75a95f3de7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -583,7 +583,7 @@ HandleCheckpointerInterrupts(void)
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 232aeeac29..2f1f6d8cc5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -133,12 +133,18 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
+ * pgstat_send_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_send_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
 
+/*
+ * WAL usage flag to save if the WAL counters are updated or not.
+ * This is used to decide sending the WAL statistics to the collector.
+ */
+bool walUsageUpdated;
+
 /*
  * List of SLRU names that we keep stats for.  There is no central registry of
  * SLRUs, so we use this fixed list instead.  The "other" entry is used for
@@ -855,10 +861,20 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * Note: regarding to WAL statistics counters, it's not enough to check
+	 * only whether the wal record is generated or not. If a read transaction
+	 * is executed, wal records aren't nomally generated (although HOT makes
+	 * wal records). But, just writes and syncs the wal data may happen when
+	 * to flush buffers.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-		!have_function_stats && !disconnect)
+		!walUsageUpdated && walStats.wal_write == 0 &&
+		walStats.wal_sync == 0 && !have_function_stats &&
+		!disconnect)
 		return;
 
 	/*
@@ -951,7 +967,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -2933,12 +2949,15 @@ void
 pgstat_initialize(void)
 {
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
 	prevWalUsage = pgWalUsage;
 
+	/* WAL usage counters are not updated yet. */
+	walUsageUpdated = false;
+
 	/* Set up a process-exit hook to clean up */
 	on_shmem_exit(pgstat_shutdown_hook, 0);
 }
@@ -3045,44 +3064,6 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_report_wal() -
- *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
- * ----------
- */
-void
-pgstat_report_wal(void)
-{
-	WalUsage	walusage;
-
-	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
-	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
-
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3094,20 +3075,29 @@ pgstat_report_wal(void)
  * Return true if the message is sent, and false otherwise.
  * ----------
  */
-bool
+void
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
 	/*
-	 * This function can be called even if nothing at all has happened. In
-	 * this case, avoid sending a completely empty message to the stats
-	 * collector.
+	 * First, to check the WAL stats counters were updated.
+	 *
+	 * Even if the 'force' is true, we don't need to send the stats if the
+	 * counters were not updated.
+	 *
+	 * We can know whether the counters were updated or not to check only
+	 * three counters. So, for performance, we don't allocate another memory
+	 * spaces and check the all stats like pgstat_send_slru().
+	 *
+	 * It's not enough to check the WAL usage counters are updated or not.
+	 * For example the walwriter may write/sync the WAL although it doesn't
+	 * generate wal records. 'wal_writes' and 'wal_sync' are zero means the
+	 * counters of time spent are zero too.
 	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
+	if (!walUsageUpdated && walStats.wal_write == 0 &&
+		walStats.wal_sync == 0)
+		return;
 
 	if (!force)
 	{
@@ -3115,25 +3105,52 @@ pgstat_send_wal(bool force)
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-			return false;
+			return;
 		sendTime = now;
 	}
 
+	/*
+	 * set the counters related to generated WAL data if the counters are
+	 * updated.
+	 */
+	if (walUsageUpdated)
+	{
+		WalUsage	walusage;
+
+		/*
+		 * Calculate how much WAL usage counters are increased by substracting
+		 * the previous counters from the current ones. Fill the results in
+		 * WAL stats message.
+		 */
+		MemSet(&walusage, 0, sizeof(WalUsage));
+		WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
+
+		WalStats.m_wal_records = walusage.wal_records;
+		WalStats.m_wal_fpi = walusage.wal_fpi;
+		WalStats.m_wal_bytes = walusage.wal_bytes;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage,
+	 * and reset the updated flag.
+	 */
+	prevWalUsage = pgWalUsage;
+	walUsageUpdated = false;
+
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
-
-	return true;
 }
 
 /* ----------
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index aa8eceda5f..9de29e51d6 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -16,26 +16,41 @@
 #include "portability/instr_time.h"
 
 
+/*
+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is that there may be many
+ * callers to use them, so to avoid unexpected reset.
+ */
 typedef struct BufferUsage
 {
-	long		shared_blks_hit;	/* # of shared buffer hits */
-	long		shared_blks_read;	/* # of shared disk blocks read */
-	long		shared_blks_dirtied;	/* # of shared blocks dirtied */
-	long		shared_blks_written;	/* # of shared disk blocks written */
-	long		local_blks_hit; /* # of local buffer hits */
-	long		local_blks_read;	/* # of local disk blocks read */
-	long		local_blks_dirtied; /* # of local blocks dirtied */
-	long		local_blks_written; /* # of local disk blocks written */
-	long		temp_blks_read; /* # of temp blocks read */
-	long		temp_blks_written;	/* # of temp blocks written */
+	uint64		shared_blks_hit;	/* # of shared buffer hits */
+	uint64		shared_blks_read;	/* # of shared disk blocks read */
+	uint64		shared_blks_dirtied;	/* # of shared blocks dirtied */
+	uint64		shared_blks_written;	/* # of shared disk blocks written */
+	uint64		local_blks_hit; /* # of local buffer hits */
+	uint64		local_blks_read;	/* # of local disk blocks read */
+	uint64		local_blks_dirtied; /* # of local blocks dirtied */
+	uint64		local_blks_written; /* # of local disk blocks written */
+	uint64		temp_blks_read; /* # of temp blocks read */
+	uint64		temp_blks_written;	/* # of temp blocks written */
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
 } BufferUsage;
 
+/*
+ * The accumulated counters for generated WAL usage.
+ *
+ * The reason the counters are accumulated values is the same as BufferUsage's one.
+ * And the reason to store only generated WAL usage and doesn't store WAL I/O
+ * activity, is that this is assumed for knowing the WAL usage in per query or
+ * transaction. So, common resources for the cluster like WAL I/O activity is
+ * not stored.
+ */
 typedef struct WalUsage
 {
-	long		wal_records;	/* # of WAL records produced */
-	long		wal_fpi;		/* # of WAL full page images produced */
+	uint64		wal_records;	/* # of WAL records produced */
+	uint64		wal_fpi;		/* # of WAL full page images produced */
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c2f70effbd..63414e7bc8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -503,8 +503,8 @@ typedef struct PgStat_MsgBgWriter
 typedef struct PgStat_MsgWal
 {
 	PgStat_MsgHdr m_hdr;
-	PgStat_Counter m_wal_records;
-	PgStat_Counter m_wal_fpi;
+	uint64		m_wal_records;
+	uint64		m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
 	PgStat_Counter m_wal_write;
@@ -984,6 +984,13 @@ extern PgStat_Counter pgStatTransactionIdleTime;
  */
 extern SessionEndType pgStatSessionEndCause;
 
+
+/*
+ * WAL usage flag to save if the WAL counters are updated or not.
+ * This is used to decide sending the WAL statistics to the collector.
+ */
+extern bool walUsageUpdated;
+
 /* ----------
  * Functions called from postmaster
  * ----------
@@ -1110,8 +1117,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to

v4-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchtext/x-patch; charset=UTF-8; name=v4-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchDownload

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 14be850cb3..265bb16cf6 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -173,21 +173,21 @@ typedef struct Counters
 	double		sum_var_time[PGSS_NUMKIND]; /* sum of variances in
 											 * planning/execution time in msec */
 	int64		rows;			/* total # of retrieved or affected rows */
-	int64		shared_blks_hit;	/* # of shared buffer hits */
-	int64		shared_blks_read;	/* # of shared disk blocks read */
-	int64		shared_blks_dirtied;	/* # of shared disk blocks dirtied */
-	int64		shared_blks_written;	/* # of shared disk blocks written */
-	int64		local_blks_hit; /* # of local buffer hits */
-	int64		local_blks_read;	/* # of local disk blocks read */
-	int64		local_blks_dirtied; /* # of local disk blocks dirtied */
-	int64		local_blks_written; /* # of local disk blocks written */
-	int64		temp_blks_read; /* # of temp blocks read */
-	int64		temp_blks_written;	/* # of temp blocks written */
+	uint64		shared_blks_hit;	/* # of shared buffer hits */
+	uint64		shared_blks_read;	/* # of shared disk blocks read */
+	uint64		shared_blks_dirtied;	/* # of shared disk blocks dirtied */
+	uint64		shared_blks_written;	/* # of shared disk blocks written */
+	uint64		local_blks_hit; /* # of local buffer hits */
+	uint64		local_blks_read;	/* # of local disk blocks read */
+	uint64		local_blks_dirtied; /* # of local disk blocks dirtied */
+	uint64		local_blks_written; /* # of local disk blocks written */
+	uint64		temp_blks_read; /* # of temp blocks read */
+	uint64		temp_blks_written;	/* # of temp blocks written */
 	double		blk_read_time;	/* time spent reading, in msec */
 	double		blk_write_time; /* time spent writing, in msec */
 	double		usage;			/* usage factor */
-	int64		wal_records;	/* # of WAL records generated */
-	int64		wal_fpi;		/* # of WAL full page images generated */
+	uint64		wal_records;	/* # of WAL records generated */
+	uint64		wal_fpi;		/* # of WAL full page images generated */
 	uint64		wal_bytes;		/* total amount of WAL generated in bytes */
 } Counters;
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 237e13361b..75ecd00c23 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -17,6 +17,12 @@
 
 #include "executor/instrument.h"
 
+/*
+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.
+ */
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cdd07770a0..75a95f3de7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -583,7 +583,7 @@ HandleCheckpointerInterrupts(void)
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 232aeeac29..87b5088d68 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -133,8 +133,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
+ * pgstat_send_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_send_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -855,9 +855,19 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * Note: regarding to WAL statistics counters, it's not enough to check
+	 * only whether the wal record is generated or not. If a read transaction
+	 * is executed, wal records aren't nomally generated (although HOT makes
+	 * wal records). But, just writes and syncs the wal data may happen when
+	 * to flush buffers.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		walStats.wal_write == 0 && walStats.wal_sync == 0 &&
 		!have_function_stats && !disconnect)
 		return;
 
@@ -951,7 +961,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -2933,7 +2943,7 @@ void
 pgstat_initialize(void)
 {
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -3045,44 +3055,6 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_report_wal() -
- *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
- * ----------
- */
-void
-pgstat_report_wal(void)
-{
-	WalUsage	walusage;
-
-	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
-	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
-
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3094,20 +3066,33 @@ pgstat_report_wal(void)
  * Return true if the message is sent, and false otherwise.
  * ----------
  */
-bool
+void
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
 	/*
-	 * This function can be called even if nothing at all has happened. In
-	 * this case, avoid sending a completely empty message to the stats
-	 * collector.
+	 * First, to check the WAL stats counters were updated.
+	 *
+	 * Even if the 'force' is true, we don't need to send the stats if the
+	 * counters were not updated.
+	 *
+	 * We can know whether the counters were updated or not to check only
+	 * three counters. So, for performance, we don't allocate another memory
+	 * spaces and check the all stats like pgstat_send_slru().
+	 *
+	 * The current 'wal_records' is the same as the previous one means that
+	 * wal records weren't generated. So, the counters of 'wal_fpi',
+	 * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+	 *
+	 * It's not enough to check the number of generated wal records, for
+	 * example the walwriter may write/sync the WAL although it doesn't
+	 * generate wal records. 'wal_writes' and 'wal_sync' are zero means the
+	 * counters of time spent are zero too.
 	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
+	if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		walStats.wal_write == 0 && walStats.wal_sync == 0)
+		return;
 
 	if (!force)
 	{
@@ -3115,25 +3100,50 @@ pgstat_send_wal(bool force)
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-			return false;
+			return;
 		sendTime = now;
 	}
 
+	/*
+	 * set the counters related to generated WAL data if the counters are
+	 * updated.
+	 */
+	if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+	{
+		WalUsage	walusage;
+
+		/*
+		 * Calculate how much WAL usage counters are increased by substracting
+		 * the previous counters from the current ones. Fill the results in
+		 * WAL stats message.
+		 */
+		MemSet(&walusage, 0, sizeof(WalUsage));
+		WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
+
+		WalStats.m_wal_records = walusage.wal_records;
+		WalStats.m_wal_fpi = walusage.wal_fpi;
+		WalStats.m_wal_bytes = walusage.wal_bytes;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
+
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
-
-	return true;
 }
 
 /* ----------
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index aa8eceda5f..9de29e51d6 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -16,26 +16,41 @@
 #include "portability/instr_time.h"
 
 
+/*
+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is that there may be many
+ * callers to use them, so to avoid unexpected reset.
+ */
 typedef struct BufferUsage
 {
-	long		shared_blks_hit;	/* # of shared buffer hits */
-	long		shared_blks_read;	/* # of shared disk blocks read */
-	long		shared_blks_dirtied;	/* # of shared blocks dirtied */
-	long		shared_blks_written;	/* # of shared disk blocks written */
-	long		local_blks_hit; /* # of local buffer hits */
-	long		local_blks_read;	/* # of local disk blocks read */
-	long		local_blks_dirtied; /* # of local blocks dirtied */
-	long		local_blks_written; /* # of local disk blocks written */
-	long		temp_blks_read; /* # of temp blocks read */
-	long		temp_blks_written;	/* # of temp blocks written */
+	uint64		shared_blks_hit;	/* # of shared buffer hits */
+	uint64		shared_blks_read;	/* # of shared disk blocks read */
+	uint64		shared_blks_dirtied;	/* # of shared blocks dirtied */
+	uint64		shared_blks_written;	/* # of shared disk blocks written */
+	uint64		local_blks_hit; /* # of local buffer hits */
+	uint64		local_blks_read;	/* # of local disk blocks read */
+	uint64		local_blks_dirtied; /* # of local blocks dirtied */
+	uint64		local_blks_written; /* # of local disk blocks written */
+	uint64		temp_blks_read; /* # of temp blocks read */
+	uint64		temp_blks_written;	/* # of temp blocks written */
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
 } BufferUsage;
 
+/*
+ * The accumulated counters for generated WAL usage.
+ *
+ * The reason the counters are accumulated values is the same as BufferUsage's one.
+ * And the reason to store only generated WAL usage and doesn't store WAL I/O
+ * activity, is that this is assumed for knowing the WAL usage in per query or
+ * transaction. So, common resources for the cluster like WAL I/O activity is
+ * not stored.
+ */
 typedef struct WalUsage
 {
-	long		wal_records;	/* # of WAL records produced */
-	long		wal_fpi;		/* # of WAL full page images produced */
+	uint64		wal_records;	/* # of WAL records produced */
+	uint64		wal_fpi;		/* # of WAL full page images produced */
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c2f70effbd..d9c9ece9c6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -503,8 +503,8 @@ typedef struct PgStat_MsgBgWriter
 typedef struct PgStat_MsgWal
 {
 	PgStat_MsgHdr m_hdr;
-	PgStat_Counter m_wal_records;
-	PgStat_Counter m_wal_fpi;
+	uint64		m_wal_records;
+	uint64		m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
 	PgStat_Counter m_wal_write;
@@ -1110,8 +1110,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to

#18

torikoshia

torikoshia@oss.nttdata.com

over 4 years ago

In reply to: Masahiro Ikeda (#17)

Re: wal stats questions

On 2021-04-16 10:27, Masahiro Ikeda wrote:

On 2021/04/13 9:33, Fujii Masao wrote:

On 2021/03/30 20:37, Masahiro Ikeda wrote:

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check
using WAL
stats counters. But, the purpose is to make the conditions stricter,
right?

Yes. Currently if the following condition is false even when the WAL
counters
are updated, nothing is sent to the stats collector. But with your
patch,
in this case the WAL stats are sent.

    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
        !have_function_stats && !disconnect)

Thanks for the patch! It now fails to be applied to the master
cleanly.
So could you rebase the patch?

Thanks for your comments!
I rebased it.

Thanks for working on this!

I have some minor comments on
performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch.

177 @@ -3094,20 +3066,33 @@ pgstat_report_wal(void)
178 * Return true if the message is sent, and false otherwise.

Since you changed the return value to void, it seems the description is
not necessary anymore.

208 + * generate wal records. 'wal_writes' and 'wal_sync' are
zero means the

It may be better to change 'wal_writes' to 'wal_write' since single
quotation seems to mean variable name.

234 + * set the counters related to generated WAL data if the
counters are

set -> Set?

Regards,

#19

ikedamsh@oss.nttdata.com

over 4 years ago

In reply to: torikoshia (#18)

1 attachment(s)

Re: wal stats questions

On 2021/04/21 15:08, torikoshia wrote:

On 2021-04-16 10:27, Masahiro Ikeda wrote:

On 2021/04/13 9:33, Fujii Masao wrote:

On 2021/03/30 20:37, Masahiro Ikeda wrote:

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check
using WAL stats counters. But, the purpose is to make the conditions
stricter, right?

Yes. Currently if the following condition is false even when the WAL
counters are updated, nothing is sent to the stats collector. But with
your patch, in this case the WAL stats are sent.

if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)

Thanks for the patch! It now fails to be applied to the master
cleanly. So could you rebase the patch?

Thanks for your comments! I rebased it.

Thanks for working on this!

Hi, thanks for your comments!

I have some minor comments on
performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch.

177 @@ -3094,20 +3066,33 @@ pgstat_report_wal(void) 178 * Return true if
the message is sent, and false otherwise.

Since you changed the return value to void, it seems the description is not
necessary anymore.

Right, I fixed it.

208 + * generate wal records. 'wal_writes' and 'wal_sync' are zero
means the

It may be better to change 'wal_writes' to 'wal_write' since single
quotation seems to mean variable name.

Agreed.

234 + * set the counters related to generated WAL data if the
counters are

set -> Set?

Yes, I fixed.

BTW, is it better to change PgStat_Counter from int64 to uint64 because> there aren't any counters with negative number?

Although this is not related to torikoshi-san's comment, the above my
understanding is not right. Some counters like delta_live_tuples,
delta_dead_tuples and changed_tuples can be negative.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v5-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchtext/x-patch; charset=UTF-8; name=v5-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchDownload

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index f42f07622e..1b925d31d2 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -173,21 +173,21 @@ typedef struct Counters
 	double		sum_var_time[PGSS_NUMKIND]; /* sum of variances in
 											 * planning/execution time in msec */
 	int64		rows;			/* total # of retrieved or affected rows */
-	int64		shared_blks_hit;	/* # of shared buffer hits */
-	int64		shared_blks_read;	/* # of shared disk blocks read */
-	int64		shared_blks_dirtied;	/* # of shared disk blocks dirtied */
-	int64		shared_blks_written;	/* # of shared disk blocks written */
-	int64		local_blks_hit; /* # of local buffer hits */
-	int64		local_blks_read;	/* # of local disk blocks read */
-	int64		local_blks_dirtied; /* # of local disk blocks dirtied */
-	int64		local_blks_written; /* # of local disk blocks written */
-	int64		temp_blks_read; /* # of temp blocks read */
-	int64		temp_blks_written;	/* # of temp blocks written */
+	uint64		shared_blks_hit;	/* # of shared buffer hits */
+	uint64		shared_blks_read;	/* # of shared disk blocks read */
+	uint64		shared_blks_dirtied;	/* # of shared disk blocks dirtied */
+	uint64		shared_blks_written;	/* # of shared disk blocks written */
+	uint64		local_blks_hit; /* # of local buffer hits */
+	uint64		local_blks_read;	/* # of local disk blocks read */
+	uint64		local_blks_dirtied; /* # of local disk blocks dirtied */
+	uint64		local_blks_written; /* # of local disk blocks written */
+	uint64		temp_blks_read; /* # of temp blocks read */
+	uint64		temp_blks_written;	/* # of temp blocks written */
 	double		blk_read_time;	/* time spent reading, in msec */
 	double		blk_write_time; /* time spent writing, in msec */
 	double		usage;			/* usage factor */
-	int64		wal_records;	/* # of WAL records generated */
-	int64		wal_fpi;		/* # of WAL full page images generated */
+	uint64		wal_records;	/* # of WAL records generated */
+	uint64		wal_fpi;		/* # of WAL full page images generated */
 	uint64		wal_bytes;		/* total amount of WAL generated in bytes */
 } Counters;
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 237e13361b..75ecd00c23 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -17,6 +17,12 @@
 
 #include "executor/instrument.h"
 
+/*
+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.
+ */
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a459..1761694a5b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -583,7 +583,7 @@ HandleCheckpointerInterrupts(void)
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e1ec7d8b7d..940663de20 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -133,8 +133,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
+ * pgstat_send_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_send_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -855,9 +855,19 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * Note: regarding to WAL statistics counters, it's not enough to check
+	 * only whether the wal record is generated or not. If a read transaction
+	 * is executed, wal records aren't nomally generated (although HOT makes
+	 * wal records). But, just writes and syncs the wal data may happen when
+	 * to flush buffers.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		walStats.wal_write == 0 && walStats.wal_sync == 0 &&
 		!have_function_stats && !disconnect)
 		return;
 
@@ -951,7 +961,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -2934,7 +2944,7 @@ void
 pgstat_initialize(void)
 {
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -3046,44 +3056,6 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_report_wal() -
- *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
- * ----------
- */
-void
-pgstat_report_wal(void)
-{
-	WalUsage	walusage;
-
-	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
-	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
-
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3091,24 +3063,35 @@ pgstat_report_wal(void)
  *
  * If 'force' is not set, WAL stats message is only sent if enough time has
  * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
- *
- * Return true if the message is sent, and false otherwise.
  * ----------
  */
-bool
+void
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
 	/*
-	 * This function can be called even if nothing at all has happened. In
-	 * this case, avoid sending a completely empty message to the stats
-	 * collector.
+	 * First, to check the WAL stats counters were updated.
+	 *
+	 * Even if the 'force' is true, we don't need to send the stats if the
+	 * counters were not updated.
+	 *
+	 * We can know whether the counters were updated or not to check only
+	 * three counters. So, for performance, we don't allocate another memory
+	 * spaces and check the all stats like pgstat_send_slru().
+	 *
+	 * The current 'wal_records' is the same as the previous one means that
+	 * wal records weren't generated. So, the counters of 'wal_fpi',
+	 * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+	 *
+	 * It's not enough to check the number of generated wal records, for
+	 * example the walwriter may write/sync the WAL although it doesn't
+	 * generate wal records. 'wal_write' and 'wal_sync' are zero means the
+	 * counters of time spent are zero too.
 	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
+	if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		walStats.wal_write == 0 && walStats.wal_sync == 0)
+		return;
 
 	if (!force)
 	{
@@ -3116,25 +3099,50 @@ pgstat_send_wal(bool force)
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-			return false;
+			return;
 		sendTime = now;
 	}
 
+	/*
+	 * Set the counters related to generated WAL data if the counters are
+	 * updated.
+	 */
+	if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+	{
+		WalUsage	walusage;
+
+		/*
+		 * Calculate how much WAL usage counters are increased by substracting
+		 * the previous counters from the current ones. Fill the results in
+		 * WAL stats message.
+		 */
+		MemSet(&walusage, 0, sizeof(WalUsage));
+		WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
+
+		WalStats.m_wal_records = walusage.wal_records;
+		WalStats.m_wal_fpi = walusage.wal_fpi;
+		WalStats.m_wal_bytes = walusage.wal_bytes;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
+
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
-
-	return true;
 }
 
 /* ----------
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index aa8eceda5f..9de29e51d6 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -16,26 +16,41 @@
 #include "portability/instr_time.h"
 
 
+/*
+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is that there may be many
+ * callers to use them, so to avoid unexpected reset.
+ */
 typedef struct BufferUsage
 {
-	long		shared_blks_hit;	/* # of shared buffer hits */
-	long		shared_blks_read;	/* # of shared disk blocks read */
-	long		shared_blks_dirtied;	/* # of shared blocks dirtied */
-	long		shared_blks_written;	/* # of shared disk blocks written */
-	long		local_blks_hit; /* # of local buffer hits */
-	long		local_blks_read;	/* # of local disk blocks read */
-	long		local_blks_dirtied; /* # of local blocks dirtied */
-	long		local_blks_written; /* # of local disk blocks written */
-	long		temp_blks_read; /* # of temp blocks read */
-	long		temp_blks_written;	/* # of temp blocks written */
+	uint64		shared_blks_hit;	/* # of shared buffer hits */
+	uint64		shared_blks_read;	/* # of shared disk blocks read */
+	uint64		shared_blks_dirtied;	/* # of shared blocks dirtied */
+	uint64		shared_blks_written;	/* # of shared disk blocks written */
+	uint64		local_blks_hit; /* # of local buffer hits */
+	uint64		local_blks_read;	/* # of local disk blocks read */
+	uint64		local_blks_dirtied; /* # of local blocks dirtied */
+	uint64		local_blks_written; /* # of local disk blocks written */
+	uint64		temp_blks_read; /* # of temp blocks read */
+	uint64		temp_blks_written;	/* # of temp blocks written */
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
 } BufferUsage;
 
+/*
+ * The accumulated counters for generated WAL usage.
+ *
+ * The reason the counters are accumulated values is the same as BufferUsage's one.
+ * And the reason to store only generated WAL usage and doesn't store WAL I/O
+ * activity, is that this is assumed for knowing the WAL usage in per query or
+ * transaction. So, common resources for the cluster like WAL I/O activity is
+ * not stored.
+ */
 typedef struct WalUsage
 {
-	long		wal_records;	/* # of WAL records produced */
-	long		wal_fpi;		/* # of WAL full page images produced */
+	uint64		wal_records;	/* # of WAL records produced */
+	uint64		wal_fpi;		/* # of WAL full page images produced */
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5c5920b0b5..bd7ce9ad99 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -503,8 +503,8 @@ typedef struct PgStat_MsgBgWriter
 typedef struct PgStat_MsgWal
 {
 	PgStat_MsgHdr m_hdr;
-	PgStat_Counter m_wal_records;
-	PgStat_Counter m_wal_fpi;
+	uint64		m_wal_records;
+	uint64		m_wal_fpi;
 	uint64		m_wal_bytes;
 	PgStat_Counter m_wal_buffers_full;
 	PgStat_Counter m_wal_write;
@@ -1114,8 +1114,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to

#20

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Masahiro Ikeda (#19)

Re: wal stats questions

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64 because> there aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#21

andres@anarazel.de

over 4 years ago

In reply to: Fujii Masao (#20)

Re: wal stats questions

On Thu, Apr 22, 2021, at 06:42, Fujii Masao wrote:

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64 because> there aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?

Why long? It's of a platform dependent size, which doesn't seem useful here?

Andres

#22

ikedamsh@oss.nttdata.com

over 4 years ago

In reply to: Andres Freund (#21)

Re: wal stats questions

On 2021/04/23 0:36, Andres Freund wrote:

Hi

On Thu, Apr 22, 2021, at 06:42, Fujii Masao wrote:

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64 because> there aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?

Why long? It's of a platform dependent size, which doesn't seem useful here?

I think it's ok to unify uint64. Although it's better to use small size for
cache, the idea works well for only some platform which use 4bytes for "long".

(Although I'm not familiar with the standardization...)
It seems that we need to adopt unsinged long if use "long", which may be 4bytes.

I though WalUsageAccumDiff() seems to return the right value too. But, I
researched deeply and found that ISO/IEC 9899:1999 defines unsinged integer
never overflow(2.6.5 Types 9th section) although it doesn't define overflow of
signed integer type.

If my understanding is right, the definition only guarantee
WalUsageAccumDiff() returns the right value only if the type is unsigned
integer. So, it's safe to change "long" to "unsigned long".

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Import Notes

Resolved by subject fallback

#23

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Masahiro Ikeda (#22)

Re: wal stats questions

On 2021/04/23 9:26, Masahiro Ikeda wrote:

On 2021/04/23 0:36, Andres Freund wrote:

Hi

On Thu, Apr 22, 2021, at 06:42, Fujii Masao wrote:

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64 because> there aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?

Why long? It's of a platform dependent size, which doesn't seem useful here?

I'm not sure why "long" was chosen for the counters in BufferUsage.
And then I guess that maybe we didn't change that because using "long"
for them caused no actual issue in practice.

I think it's ok to unify uint64. Although it's better to use small size for
cache, the idea works well for only some platform which use 4bytes for "long".

(Although I'm not familiar with the standardization...)
It seems that we need to adopt unsinged long if use "long", which may be 4bytes.

I though WalUsageAccumDiff() seems to return the right value too. But, I
researched deeply and found that ISO/IEC 9899:1999 defines unsinged integer
never overflow(2.6.5 Types 9th section) although it doesn't define overflow of
signed integer type.

If my understanding is right, the definition only guarantee
WalUsageAccumDiff() returns the right value only if the type is unsigned
integer. So, it's safe to change "long" to "unsigned long".

Yes, we can change the counters so they use uint64. But if we do that,
ISTM that we need to change the code more than your patch does.
For example, even with the patch, pg_stat_statements uses Int64GetDatumFast()
to report the counter like shared_blks_hit, but this should be changed?
For example, "%ld" should be changed to "%llu" at the following code in
vacuumlazy.c? I think that we should check all codes that use the counters
whose types are changed to uint64.

appendStringInfo(&buf,
_("WAL usage: %ld records, %ld full page images, %llu bytes"),
walusage.wal_records,
walusage.wal_fpi,
(unsigned long long) walusage.wal_bytes);

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#24

andres@anarazel.de

over 4 years ago

In reply to: Masahiro Ikeda (#22)

Re: wal stats questions

Hi,

On 2021-04-23 09:26:17 +0900, Masahiro Ikeda wrote:

On 2021/04/23 0:36, Andres Freund wrote:

On Thu, Apr 22, 2021, at 06:42, Fujii Masao wrote:

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64 because> there aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?

Why long? It's of a platform dependent size, which doesn't seem useful here?

I think it's ok to unify uint64. Although it's better to use small size for
cache, the idea works well for only some platform which use 4bytes for "long".

(Although I'm not familiar with the standardization...)
It seems that we need to adopt unsinged long if use "long", which may be 4bytes.

I though WalUsageAccumDiff() seems to return the right value too. But, I
researched deeply and found that ISO/IEC 9899:1999 defines unsinged integer
never overflow(2.6.5 Types 9th section) although it doesn't define overflow of
signed integer type.

If my understanding is right, the definition only guarantee
WalUsageAccumDiff() returns the right value only if the type is unsigned
integer. So, it's safe to change "long" to "unsigned long".

I think this should just use 64bit counters. Emitting more than 4
billion records in one transaction isn't actually particularly rare. And
obviously WalUsageAccumDiff() can't do anything about that, once
overflowed it overflowed.

Greetings,

Andres Freund

#25

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Andres Freund (#24)

Re: wal stats questions

On 2021/04/23 10:25, Andres Freund wrote:

Hi,

On 2021-04-23 09:26:17 +0900, Masahiro Ikeda wrote:

On 2021/04/23 0:36, Andres Freund wrote:

On Thu, Apr 22, 2021, at 06:42, Fujii Masao wrote:

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64 because> there aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?

Why long? It's of a platform dependent size, which doesn't seem useful here?

I think it's ok to unify uint64. Although it's better to use small size for
cache, the idea works well for only some platform which use 4bytes for "long".

(Although I'm not familiar with the standardization...)
It seems that we need to adopt unsinged long if use "long", which may be 4bytes.

I though WalUsageAccumDiff() seems to return the right value too. But, I
researched deeply and found that ISO/IEC 9899:1999 defines unsinged integer
never overflow(2.6.5 Types 9th section) although it doesn't define overflow of
signed integer type.

If my understanding is right, the definition only guarantee
WalUsageAccumDiff() returns the right value only if the type is unsigned
integer. So, it's safe to change "long" to "unsigned long".

I think this should just use 64bit counters. Emitting more than 4
billion records in one transaction isn't actually particularly rare. And

Right. I agree to change the types of the counters to int64.

I think that it's better to make this change as a separate patch from
the changes for pg_stat_wal view.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#26

ikedamsh@oss.nttdata.com

over 4 years ago

In reply to: Fujii Masao (#25)

2 attachment(s)

Re: wal stats questions

On 2021/04/23 16:30, Fujii Masao wrote:

On 2021/04/23 10:25, Andres Freund wrote:

Hi,

On 2021-04-23 09:26:17 +0900, Masahiro Ikeda wrote:

On 2021/04/23 0:36, Andres Freund wrote:

On Thu, Apr 22, 2021, at 06:42, Fujii Masao wrote:

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64
because> there aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get
overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?

Why long? It's of a platform dependent size, which doesn't seem useful here?

I think it's ok to unify uint64. Although it's better to use small size for
cache, the idea works well for only some platform which use 4bytes for "long".

(Although I'm not familiar with the standardization...)
It seems that we need to adopt unsinged long if use "long", which may be
4bytes.

I though WalUsageAccumDiff() seems to return the right value too. But, I
researched deeply and found that ISO/IEC 9899:1999 defines unsinged integer
never overflow(2.6.5 Types 9th section) although it doesn't define overflow of
signed integer type.

If my understanding is right, the definition only guarantee
WalUsageAccumDiff() returns the right value only if the type is unsigned
integer. So, it's safe to change "long" to "unsigned long".

I think this should just use 64bit counters. Emitting more than 4
billion records in one transaction isn't actually particularly rare. And

Right. I agree to change the types of the counters to int64.

I think that it's better to make this change as a separate patch from
the changes for pg_stat_wal view.

Thanks for your comments.
OK, I separate two patches.

First patch has only the changes for pg_stat_wal view.
("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")

Second one has the changes for the type of the BufferUsage's and WalUsage's
members. I change the type from long to int64. Is it better to make new thread?
("v6-0002-change-the-data-type-of-XXXUsage-from-long-to-int64.patch")

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v6-0002-change-the-data-type-of-XXXUsage-from-long-to-int64.patchtext/x-patch; charset=UTF-8; name=v6-0002-change-the-data-type-of-XXXUsage-from-long-to-int64.patchDownload

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c3fc12d76c..ad25f72085 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -830,9 +830,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 			}
 			appendStringInfo(&buf, _("system usage: %s\n"), pg_rusage_show(&ru0));
 			appendStringInfo(&buf,
-							 _("WAL usage: %ld records, %ld full page images, %llu bytes"),
-							 walusage.wal_records,
-							 walusage.wal_fpi,
+							 _("WAL usage: %lld records, %lld full page images, %llu bytes"),
+							 (long long) walusage.wal_records,
+							 (long long) walusage.wal_fpi,
 							 (unsigned long long) walusage.wal_bytes);
 
 			ereport(LOG,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 3d5198e234..4135d88b1a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3519,17 +3519,17 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			{
 				appendStringInfoString(es->str, " shared");
 				if (usage->shared_blks_hit > 0)
-					appendStringInfo(es->str, " hit=%ld",
-									 usage->shared_blks_hit);
+					appendStringInfo(es->str, " hit=%lld",
+									 (long long) usage->shared_blks_hit);
 				if (usage->shared_blks_read > 0)
-					appendStringInfo(es->str, " read=%ld",
-									 usage->shared_blks_read);
+					appendStringInfo(es->str, " read=%lld",
+									 (long long) usage->shared_blks_read);
 				if (usage->shared_blks_dirtied > 0)
-					appendStringInfo(es->str, " dirtied=%ld",
-									 usage->shared_blks_dirtied);
+					appendStringInfo(es->str, " dirtied=%lld",
+									 (long long) usage->shared_blks_dirtied);
 				if (usage->shared_blks_written > 0)
-					appendStringInfo(es->str, " written=%ld",
-									 usage->shared_blks_written);
+					appendStringInfo(es->str, " written=%lld",
+									 (long long) usage->shared_blks_written);
 				if (has_local || has_temp)
 					appendStringInfoChar(es->str, ',');
 			}
@@ -3537,17 +3537,17 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			{
 				appendStringInfoString(es->str, " local");
 				if (usage->local_blks_hit > 0)
-					appendStringInfo(es->str, " hit=%ld",
-									 usage->local_blks_hit);
+					appendStringInfo(es->str, " hit=%lld",
+									 (long long) usage->local_blks_hit);
 				if (usage->local_blks_read > 0)
-					appendStringInfo(es->str, " read=%ld",
-									 usage->local_blks_read);
+					appendStringInfo(es->str, " read=%lld",
+									 (long long) usage->local_blks_read);
 				if (usage->local_blks_dirtied > 0)
-					appendStringInfo(es->str, " dirtied=%ld",
-									 usage->local_blks_dirtied);
+					appendStringInfo(es->str, " dirtied=%lld",
+									 (long long) usage->local_blks_dirtied);
 				if (usage->local_blks_written > 0)
-					appendStringInfo(es->str, " written=%ld",
-									 usage->local_blks_written);
+					appendStringInfo(es->str, " written=%lld",
+									 (long long) usage->local_blks_written);
 				if (has_temp)
 					appendStringInfoChar(es->str, ',');
 			}
@@ -3555,11 +3555,11 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
 			{
 				appendStringInfoString(es->str, " temp");
 				if (usage->temp_blks_read > 0)
-					appendStringInfo(es->str, " read=%ld",
-									 usage->temp_blks_read);
+					appendStringInfo(es->str, " read=%lld",
+									 (long long) usage->temp_blks_read);
 				if (usage->temp_blks_written > 0)
-					appendStringInfo(es->str, " written=%ld",
-									 usage->temp_blks_written);
+					appendStringInfo(es->str, " written=%lld",
+									 (long long) usage->temp_blks_written);
 			}
 			appendStringInfoChar(es->str, '\n');
 		}
@@ -3631,11 +3631,11 @@ show_wal_usage(ExplainState *es, const WalUsage *usage)
 			appendStringInfoString(es->str, "WAL:");
 
 			if (usage->wal_records > 0)
-				appendStringInfo(es->str, " records=%ld",
-								 usage->wal_records);
+				appendStringInfo(es->str, " records=%lld",
+								 (long long) usage->wal_records);
 			if (usage->wal_fpi > 0)
-				appendStringInfo(es->str, " fpi=%ld",
-								 usage->wal_fpi);
+				appendStringInfo(es->str, " fpi=%lld",
+								 (long long) usage->wal_fpi);
 			if (usage->wal_bytes > 0)
 				appendStringInfo(es->str, " bytes=" UINT64_FORMAT,
 								 usage->wal_bytes);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9a0d794a1c..2e11ab6d4b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -24,16 +24,16 @@
  */
 typedef struct BufferUsage
 {
-	long		shared_blks_hit;	/* # of shared buffer hits */
-	long		shared_blks_read;	/* # of shared disk blocks read */
-	long		shared_blks_dirtied;	/* # of shared blocks dirtied */
-	long		shared_blks_written;	/* # of shared disk blocks written */
-	long		local_blks_hit; /* # of local buffer hits */
-	long		local_blks_read;	/* # of local disk blocks read */
-	long		local_blks_dirtied; /* # of local blocks dirtied */
-	long		local_blks_written; /* # of local disk blocks written */
-	long		temp_blks_read; /* # of temp blocks read */
-	long		temp_blks_written;	/* # of temp blocks written */
+	int64		shared_blks_hit;	/* # of shared buffer hits */
+	int64		shared_blks_read;	/* # of shared disk blocks read */
+	int64		shared_blks_dirtied;	/* # of shared blocks dirtied */
+	int64		shared_blks_written;	/* # of shared disk blocks written */
+	int64		local_blks_hit; /* # of local buffer hits */
+	int64		local_blks_read;	/* # of local disk blocks read */
+	int64		local_blks_dirtied; /* # of local blocks dirtied */
+	int64		local_blks_written; /* # of local disk blocks written */
+	int64		temp_blks_read; /* # of temp blocks read */
+	int64		temp_blks_written;	/* # of temp blocks written */
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
 } BufferUsage;
@@ -49,8 +49,8 @@ typedef struct BufferUsage
  */
 typedef struct WalUsage
 {
-	long		wal_records;	/* # of WAL records produced */
-	long		wal_fpi;		/* # of WAL full page images produced */
+	int64		wal_records;	/* # of WAL records produced */
+	int64		wal_fpi;		/* # of WAL full page images produced */
 	uint64		wal_bytes;		/* size of WAL records produced */
 } WalUsage;

v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchtext/x-patch; charset=UTF-8; name=v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchDownload

diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 237e13361b..75ecd00c23 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -17,6 +17,12 @@
 
 #include "executor/instrument.h"
 
+/*
+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.
+ */
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a459..1761694a5b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -583,7 +583,7 @@ HandleCheckpointerInterrupts(void)
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 6e8dee9784..f7f28673f8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -133,8 +133,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
+ * pgstat_send_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_send_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -855,9 +855,19 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * Note: regarding to WAL statistics counters, it's not enough to check
+	 * only whether the wal record is generated or not. If a read transaction
+	 * is executed, wal records aren't nomally generated (although HOT makes
+	 * wal records). But, just writes and syncs the wal data may happen when
+	 * to flush buffers.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		walStats.wal_write == 0 && walStats.wal_sync == 0 &&
 		!have_function_stats && !disconnect)
 		return;
 
@@ -951,7 +961,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -2934,7 +2944,7 @@ void
 pgstat_initialize(void)
 {
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -3046,44 +3056,6 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_report_wal() -
- *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
- * ----------
- */
-void
-pgstat_report_wal(void)
-{
-	WalUsage	walusage;
-
-	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
-	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
-
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3091,24 +3063,35 @@ pgstat_report_wal(void)
  *
  * If 'force' is not set, WAL stats message is only sent if enough time has
  * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
- *
- * Return true if the message is sent, and false otherwise.
  * ----------
  */
-bool
+void
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
 	/*
-	 * This function can be called even if nothing at all has happened. In
-	 * this case, avoid sending a completely empty message to the stats
-	 * collector.
+	 * First, to check the WAL stats counters were updated.
+	 *
+	 * Even if the 'force' is true, we don't need to send the stats if the
+	 * counters were not updated.
+	 *
+	 * We can know whether the counters were updated or not to check only
+	 * three counters. So, for performance, we don't allocate another memory
+	 * spaces and check the all stats like pgstat_send_slru().
+	 *
+	 * The current 'wal_records' is the same as the previous one means that
+	 * wal records weren't generated. So, the counters of 'wal_fpi',
+	 * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+	 *
+	 * It's not enough to check the number of generated wal records, for
+	 * example the walwriter may write/sync the WAL although it doesn't
+	 * generate wal records. 'wal_write' and 'wal_sync' are zero means the
+	 * counters of time spent are zero too.
 	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
+	if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		walStats.wal_write == 0 && walStats.wal_sync == 0)
+		return;
 
 	if (!force)
 	{
@@ -3116,25 +3099,50 @@ pgstat_send_wal(bool force)
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-			return false;
+			return;
 		sendTime = now;
 	}
 
+	/*
+	 * Set the counters related to generated WAL data if the counters are
+	 * updated.
+	 */
+	if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+	{
+		WalUsage	walusage;
+
+		/*
+		 * Calculate how much WAL usage counters are increased by substracting
+		 * the previous counters from the current ones. Fill the results in
+		 * WAL stats message.
+		 */
+		MemSet(&walusage, 0, sizeof(WalUsage));
+		WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
+
+		WalStats.m_wal_records = walusage.wal_records;
+		WalStats.m_wal_fpi = walusage.wal_fpi;
+		WalStats.m_wal_bytes = walusage.wal_bytes;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
+
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
-
-	return true;
 }
 
 /* ----------
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index aa8eceda5f..9a0d794a1c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -16,6 +16,12 @@
 #include "portability/instr_time.h"
 
 
+/*
+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is to avoid unexpected
+ * reset because many callers may use them.
+ */
 typedef struct BufferUsage
 {
 	long		shared_blks_hit;	/* # of shared buffer hits */
@@ -32,6 +38,15 @@ typedef struct BufferUsage
 	instr_time	blk_write_time; /* time spent writing */
 } BufferUsage;
 
+/*
+ * The accumulated counters for generated WAL usage.
+ *
+ * The reason the counters are accumulated values is the same as BufferUsage's one.
+ * And the reason to store only generated WAL usage and doesn't store WAL I/O
+ * activity, is that this is assumed for knowing the WAL usage in per query or
+ * transaction. So, common resources for the cluster like WAL I/O activity is
+ * not stored.
+ */
 typedef struct WalUsage
 {
 	long		wal_records;	/* # of WAL records produced */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5c5920b0b5..8e048b9172 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1114,8 +1114,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
 extern void pgstat_send_recoveryprefetch(PgStat_RecoveryPrefetchStats *stats);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to

#27

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Masahiro Ikeda (#26)

Re: wal stats questions

On 2021/04/26 10:11, Masahiro Ikeda wrote:

On 2021/04/23 16:30, Fujii Masao wrote:

On 2021/04/23 10:25, Andres Freund wrote:

Hi,

On 2021-04-23 09:26:17 +0900, Masahiro Ikeda wrote:

On 2021/04/23 0:36, Andres Freund wrote:

On Thu, Apr 22, 2021, at 06:42, Fujii Masao wrote:

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64
because> there aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get
overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?

Why long? It's of a platform dependent size, which doesn't seem useful here?

I think it's ok to unify uint64. Although it's better to use small size for
cache, the idea works well for only some platform which use 4bytes for "long".

(Although I'm not familiar with the standardization...)
It seems that we need to adopt unsinged long if use "long", which may be
4bytes.

I though WalUsageAccumDiff() seems to return the right value too. But, I
researched deeply and found that ISO/IEC 9899:1999 defines unsinged integer
never overflow(2.6.5 Types 9th section) although it doesn't define overflow of
signed integer type.

If my understanding is right, the definition only guarantee
WalUsageAccumDiff() returns the right value only if the type is unsigned
integer. So, it's safe to change "long" to "unsigned long".

I think this should just use 64bit counters. Emitting more than 4
billion records in one transaction isn't actually particularly rare. And

Right. I agree to change the types of the counters to int64.

I think that it's better to make this change as a separate patch from
the changes for pg_stat_wal view.

Thanks for your comments.
OK, I separate two patches.

Thanks!

First patch has only the changes for pg_stat_wal view.
("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")

+		pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		walStats.wal_write == 0 && walStats.wal_sync == 0 &&

WalStats.m_wal_write should be checked here instead of walStats.wal_write?

Is there really the case where the number of sync is larger than zero when
the number of writes is zero? If not, it's enough to check only the number
of writes?

+	 * wal records weren't generated. So, the counters of 'wal_fpi',
+	 * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.

It's better to add the assertion check that confirms
m_wal_buffers_full == 0 whenever wal_records is larger than zero?

Second one has the changes for the type of the BufferUsage's and WalUsage's
members. I change the type from long to int64. Is it better to make new thread?
("v6-0002-change-the-data-type-of-XXXUsage-from-long-to-int64.patch")

Will review the patch later. I'm ok to discuss that in this thread.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#28

ikedamsh@oss.nttdata.com

over 4 years ago

In reply to: Fujii Masao (#27)

Re: wal stats questions

On 2021/04/27 21:56, Fujii Masao wrote:

On 2021/04/26 10:11, Masahiro Ikeda wrote:

First patch has only the changes for pg_stat_wal view.
("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")
+        pgWalUsage.wal_records == prevWalUsage.wal_records &&
+        walStats.wal_write == 0 && walStats.wal_sync == 0 &&
WalStats.m_wal_write should be checked here instead of walStats.wal_write?

Thanks! Yes, I'll fix it.

Is there really the case where the number of sync is larger than zero when
the number of writes is zero? If not, it's enough to check only the number
of writes?

I thought that there is the case if "wal_sync_method" is fdatasync, fsync or
fsync_writethrough. The example case is following.

(1) backend-1 writes the wal data because wal buffer has no space. But, it
doesn't sync the wal data.
(2) backend-2 reads data pages. In the execution, it need to write and sync
the wal because dirty pages is selected as victim pages. backend-2 need to
only sync the wal data because the wal data were already written by backend-1,
but they weren't synced.

I'm ok to change it since it's rare case.

+     * wal records weren't generated. So, the counters of 'wal_fpi',
+     * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
It's better to add the assertion check that confirms
m_wal_buffers_full == 0 whenever wal_records is larger than zero?

Sorry, I couldn't understand yet. I thought that m_wal_buffers_full can be
larger than 0 if wal_records > 0.

Do you suggest that the following assertion is needed?

-       if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-               return false;
+       if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+               WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0)
+       {
+               Assert(pgWalUsage.wal_fpi == 0 && pgWalUsage.wal_bytes &&
+                               WalStats.m_wal_buffers_full == 0 &&
WalStats.m_wal_write_time == 0 &&
+                               WalStats.m_wal_sync_time == 0);
+               return;
+       }

Second one has the changes for the type of the BufferUsage's and WalUsage's
members. I change the type from long to int64. Is it better to make new thread?
("v6-0002-change-the-data-type-of-XXXUsage-from-long-to-int64.patch")

Will review the patch later. I'm ok to discuss that in this thread.

Thanks!

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#29

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Masahiro Ikeda (#28)

Re: wal stats questions

On 2021/04/28 9:10, Masahiro Ikeda wrote:

On 2021/04/27 21:56, Fujii Masao wrote:
On 2021/04/26 10:11, Masahiro Ikeda wrote:

First patch has only the changes for pg_stat_wal view.
("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")
+        pgWalUsage.wal_records == prevWalUsage.wal_records &&
+        walStats.wal_write == 0 && walStats.wal_sync == 0 &&
WalStats.m_wal_write should be checked here instead of walStats.wal_write?
Thanks! Yes, I'll fix it.

Thanks!

Is there really the case where the number of sync is larger than zero when
the number of writes is zero? If not, it's enough to check only the number
of writes?

I thought that there is the case if "wal_sync_method" is fdatasync, fsync or
fsync_writethrough. The example case is following.

(1) backend-1 writes the wal data because wal buffer has no space. But, it
doesn't sync the wal data.
(2) backend-2 reads data pages. In the execution, it need to write and sync
the wal because dirty pages is selected as victim pages. backend-2 need to
only sync the wal data because the wal data were already written by backend-1,
but they weren't synced.

You're right. So let's leave the check of "m_wal_sync == 0".

I'm ok to change it since it's rare case.
+     * wal records weren't generated. So, the counters of 'wal_fpi',
+     * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
It's better to add the assertion check that confirms
m_wal_buffers_full == 0 whenever wal_records is larger than zero?
Sorry, I couldn't understand yet. I thought that m_wal_buffers_full can be
larger than 0 if wal_records > 0.

Do you suggest that the following assertion is needed?
-       if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-               return false;
+       if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+               WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0)
+       {
+               Assert(pgWalUsage.wal_fpi == 0 && pgWalUsage.wal_bytes &&
+                               WalStats.m_wal_buffers_full == 0 &&
WalStats.m_wal_write_time == 0 &&
+                               WalStats.m_wal_sync_time == 0);
+               return;
+       }

I was thinking to add the "Assert(WalStats.m_wal_buffers_full)" as a safe-guard
because only m_wal_buffers_full is incremented in different places where
wal_records, m_wal_write and m_wal_sync are incremented.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#30

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Masahiro Ikeda (#28)

Re: wal stats questions

On 2021/04/28 9:10, Masahiro Ikeda wrote:

Second one has the changes for the type of the BufferUsage's and WalUsage's
members. I change the type from long to int64. Is it better to make new thread?
("v6-0002-change-the-data-type-of-XXXUsage-from-long-to-int64.patch")

Will review the patch later. I'm ok to discuss that in this thread.

Thanks!

0002 patch looks good to me.
I think we can commit this at first. Barring any objection, I will do that.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#31

ikedamsh@oss.nttdata.com

over 4 years ago

In reply to: Fujii Masao (#29)

1 attachment(s)

Re: wal stats questions

On 2021/05/11 16:44, Fujii Masao wrote:

On 2021/04/28 9:10, Masahiro Ikeda wrote:
On 2021/04/27 21:56, Fujii Masao wrote:
On 2021/04/26 10:11, Masahiro Ikeda wrote:

First patch has only the changes for pg_stat_wal view.
("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")
+        pgWalUsage.wal_records == prevWalUsage.wal_records &&
+        walStats.wal_write == 0 && walStats.wal_sync == 0 &&
WalStats.m_wal_write should be checked here instead of walStats.wal_write?
Thanks! Yes, I'll fix it.
Thanks!

Thanks for your comments!
I fixed them.

Is there really the case where the number of sync is larger than zero when
the number of writes is zero? If not, it's enough to check only the number
of writes?

I thought that there is the case if "wal_sync_method" is fdatasync, fsync or
fsync_writethrough. The example case is following.

(1) backend-1 writes the wal data because wal buffer has no space. But, it
doesn't sync the wal data.
(2) backend-2 reads data pages. In the execution, it need to write and sync
the wal because dirty pages is selected as victim pages. backend-2 need to
only sync the wal data because the wal data were already written by backend-1,
but they weren't synced.

You're right. So let's leave the check of "m_wal_sync == 0".

OK.

+     * wal records weren't generated. So, the counters of 'wal_fpi',
+     * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
It's better to add the assertion check that confirms
m_wal_buffers_full == 0 whenever wal_records is larger than zero?
Sorry, I couldn't understand yet. I thought that m_wal_buffers_full can be
larger than 0 if wal_records > 0.

Do you suggest that the following assertion is needed?
-       if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-               return false;
+       if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+               WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0)
+       {
+               Assert(pgWalUsage.wal_fpi == 0 && pgWalUsage.wal_bytes &&
+                               WalStats.m_wal_buffers_full == 0 &&
WalStats.m_wal_write_time == 0 &&
+                               WalStats.m_wal_sync_time == 0);
+               return;
+       }
I was thinking to add the "Assert(WalStats.m_wal_buffers_full)" as a safe-guard
because only m_wal_buffers_full is incremented in different places where
wal_records, m_wal_write and m_wal_sync are incremented.

Understood. I added the assertion for m_wal_buffers_full only.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v7-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchtext/x-patch; charset=UTF-8; name=v7-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchDownload

diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 237e13361b..75ecd00c23 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -17,6 +17,12 @@
 
 #include "executor/instrument.h"
 
+/*
+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.
+ */
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a459..1761694a5b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -583,7 +583,7 @@ HandleCheckpointerInterrupts(void)
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e94f5f55c7..1a1fcc55be 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -133,8 +133,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
+ * pgstat_send_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_send_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -852,9 +852,19 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * Note: regarding to WAL statistics counters, it's not enough to check
+	 * only whether the wal record is generated or not. If a read transaction
+	 * is executed, wal records aren't nomally generated (although HOT makes
+	 * wal records). But, just writes and syncs the wal data may happen when
+	 * to flush buffers.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&
 		!have_function_stats && !disconnect)
 		return;
 
@@ -948,7 +958,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -2918,7 +2928,7 @@ void
 pgstat_initialize(void)
 {
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -3030,44 +3040,6 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_report_wal() -
- *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
- * ----------
- */
-void
-pgstat_report_wal(void)
-{
-	WalUsage	walusage;
-
-	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
-	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
-
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3075,24 +3047,38 @@ pgstat_report_wal(void)
  *
  * If 'force' is not set, WAL stats message is only sent if enough time has
  * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
- *
- * Return true if the message is sent, and false otherwise.
  * ----------
  */
-bool
+void
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
 	/*
-	 * This function can be called even if nothing at all has happened. In
-	 * this case, avoid sending a completely empty message to the stats
-	 * collector.
+	 * First, to check the WAL stats counters were updated.
+	 *
+	 * Even if the 'force' is true, we don't need to send the stats if the
+	 * counters were not updated.
+	 *
+	 * We can know whether the counters were updated or not to check only
+	 * three counters. So, for performance, we don't allocate another memory
+	 * spaces and check the all stats like pgstat_send_slru().
+	 *
+	 * The current 'wal_records' is the same as the previous one means that
+	 * wal records weren't generated. So, the counters of 'wal_fpi',
+	 * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+	 *
+	 * It's not enough to check the number of generated wal records, for
+	 * example the walwriter may write/sync the WAL although it doesn't
+	 * generate wal records. 'm_wal_write' and 'm_wal_sync' are zero means the
+	 * counters of time spent are zero too.
 	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
+	if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0)
+	{
+		Assert(WalStats.m_wal_buffers_full == 0);
+		return;
+	}
 
 	if (!force)
 	{
@@ -3100,25 +3086,50 @@ pgstat_send_wal(bool force)
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-			return false;
+			return;
 		sendTime = now;
 	}
 
+	/*
+	 * Set the counters related to generated WAL data if the counters were
+	 * updated.
+	 */
+	if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+	{
+		WalUsage	walusage;
+
+		/*
+		 * Calculate how much WAL usage counters were increased by substracting
+		 * the previous counters from the current ones. Fill the results in
+		 * WAL stats message.
+		 */
+		MemSet(&walusage, 0, sizeof(WalUsage));
+		WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
+
+		WalStats.m_wal_records = walusage.wal_records;
+		WalStats.m_wal_fpi = walusage.wal_fpi;
+		WalStats.m_wal_bytes = walusage.wal_bytes;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
 	pgstat_setheader(&WalStats.m_hdr, PGSTAT_MTYPE_WAL);
 	pgstat_send(&WalStats, sizeof(WalStats));
 
+	/*
+	 * Save the current counters for the subsequent calculation of WAL usage.
+	 */
+	prevWalUsage = pgWalUsage;
+
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
-
-	return true;
 }
 
 /* ----------
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index aa8eceda5f..9a0d794a1c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -16,6 +16,12 @@
 #include "portability/instr_time.h"
 
 
+/*
+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is to avoid unexpected
+ * reset because many callers may use them.
+ */
 typedef struct BufferUsage
 {
 	long		shared_blks_hit;	/* # of shared buffer hits */
@@ -32,6 +38,15 @@ typedef struct BufferUsage
 	instr_time	blk_write_time; /* time spent writing */
 } BufferUsage;
 
+/*
+ * The accumulated counters for generated WAL usage.
+ *
+ * The reason the counters are accumulated values is the same as BufferUsage's one.
+ * And the reason to store only generated WAL usage and doesn't store WAL I/O
+ * activity, is that this is assumed for knowing the WAL usage in per query or
+ * transaction. So, common resources for the cluster like WAL I/O activity is
+ * not stored.
+ */
 typedef struct WalUsage
 {
 	long		wal_records;	/* # of WAL records produced */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 72ff4a06d6..7727e83455 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1091,8 +1091,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to

#32

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Masahiro Ikeda (#31)

Re: wal stats questions

On 2021/05/11 18:46, Masahiro Ikeda wrote:

On 2021/05/11 16:44, Fujii Masao wrote:
On 2021/04/28 9:10, Masahiro Ikeda wrote:
On 2021/04/27 21:56, Fujii Masao wrote:
On 2021/04/26 10:11, Masahiro Ikeda wrote:

First patch has only the changes for pg_stat_wal view.
("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")
+        pgWalUsage.wal_records == prevWalUsage.wal_records &&
+        walStats.wal_write == 0 && walStats.wal_sync == 0 &&
WalStats.m_wal_write should be checked here instead of walStats.wal_write?
Thanks! Yes, I'll fix it.
Thanks!
Thanks for your comments!
I fixed them.

Thanks for updating the patch!

  	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
  		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&

I'm just wondering if the above WAL activity counters need to be checked.
Maybe it's not necessary because "pgStatXactCommit == 0 && pgStatXactRollback == 0"
is checked? IOW, is there really the case where WAL activity counters are updated
even when both pgStatXactCommit and pgStatXactRollback are zero?

+	if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+	{
+		WalUsage	walusage;
+
+		/*
+		 * Calculate how much WAL usage counters were increased by substracting
+		 * the previous counters from the current ones. Fill the results in
+		 * WAL stats message.
+		 */
+		MemSet(&walusage, 0, sizeof(WalUsage));
+		WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);

Isn't it better to move the code "prevWalUsage = pgWalUsage" into here?
Because it's necessary only when pgWalUsage.wal_records != prevWalUsage.wal_records.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#33

/messages/by-id/20210330.172843.267174731834281075.horikyota.ntt@gmail.com

ikedamsh@oss.nttdata.com

over 4 years ago

In reply to: Fujii Masao (#32)

1 attachment(s)

Re: wal stats questions

On 2021/05/12 19:19, Fujii Masao wrote:

On 2021/05/11 18:46, Masahiro Ikeda wrote:
On 2021/05/11 16:44, Fujii Masao wrote:
On 2021/04/28 9:10, Masahiro Ikeda wrote:
On 2021/04/27 21:56, Fujii Masao wrote:
On 2021/04/26 10:11, Masahiro Ikeda wrote:

First patch has only the changes for pg_stat_wal view.
("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")
+        pgWalUsage.wal_records == prevWalUsage.wal_records &&
+        walStats.wal_write == 0 && walStats.wal_sync == 0 &&
WalStats.m_wal_write should be checked here instead of walStats.wal_write?
Thanks! Yes, I'll fix it.
Thanks!
Thanks for your comments!
I fixed them.
Thanks for updating the patch!
     if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+        pgWalUsage.wal_records == prevWalUsage.wal_records &&
+        WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&
I'm just wondering if the above WAL activity counters need to be checked.
Maybe it's not necessary because "pgStatXactCommit == 0 && pgStatXactRollback
== 0"
is checked? IOW, is there really the case where WAL activity counters are updated
even when both pgStatXactCommit and pgStatXactRollback are zero?

Thanks for checking.

I haven't found the case yet. But, I added the condition because there is a
discussion that it's safer.

(It seems the mail thread chain is broken, Sorry...)
I copy the discussion at the time below.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.
4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

I added the logic to check if the stats counters are updated or not in
pgstat_report_stat() using not only generated wal record but also write/sync
counters, and it can skip to call reporting function.

I removed the checking code whether the wal stats counters were updated or not
in pgstat_report_stat() since I couldn't understand the valid case yet.
pgstat_report_stat() is called by backends when the transaction is finished.
This means that the condition seems always pass.

Doesn't the same holds for all other counters? If you are saying that
"wal counters should be zero if all other stats counters are zero", we
need an assertion to check that and a comment to explain that.

I personally find it safer to add the WAL-stats condition to the
fast-return check, rather than addin such assertion.

+    if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+    {
+        WalUsage    walusage;
+
+        /*
+         * Calculate how much WAL usage counters were increased by substracting
+         * the previous counters from the current ones. Fill the results in
+         * WAL stats message.
+         */
+        MemSet(&walusage, 0, sizeof(WalUsage));
+        WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);

Isn't it better to move the code "prevWalUsage = pgWalUsage" into here?
Because it's necessary only when pgWalUsage.wal_records !=
prevWalUsage.wal_records.

Yes, I fixed it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v8-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchtext/x-patch; charset=UTF-8; name=v8-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchDownload

diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2b106d8473..355778028e 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -17,6 +17,12 @@
 
 #include "executor/instrument.h"
 
+/*
+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.
+ */
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e7e6a2a459..1761694a5b 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -583,7 +583,7 @@ HandleCheckpointerInterrupts(void)
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e94f5f55c7..b65bcd5c9b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -133,8 +133,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
+ * pgstat_send_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_send_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -852,9 +852,19 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * Note: regarding to WAL statistics counters, it's not enough to check
+	 * only whether the wal record is generated or not. If a read transaction
+	 * is executed, wal records aren't nomally generated (although HOT makes
+	 * wal records). But, just writes and syncs the wal data may happen when
+	 * to flush buffers.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&
 		!have_function_stats && !disconnect)
 		return;
 
@@ -948,7 +958,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -2918,7 +2928,7 @@ void
 pgstat_initialize(void)
 {
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -3030,44 +3040,6 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_report_wal() -
- *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
- * ----------
- */
-void
-pgstat_report_wal(void)
-{
-	WalUsage	walusage;
-
-	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
-	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
-
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3075,24 +3047,38 @@ pgstat_report_wal(void)
  *
  * If 'force' is not set, WAL stats message is only sent if enough time has
  * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
- *
- * Return true if the message is sent, and false otherwise.
  * ----------
  */
-bool
+void
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
 	/*
-	 * This function can be called even if nothing at all has happened. In
-	 * this case, avoid sending a completely empty message to the stats
-	 * collector.
+	 * First, to check the WAL stats counters were updated.
+	 *
+	 * Even if the 'force' is true, we don't need to send the stats if the
+	 * counters were not updated.
+	 *
+	 * We can know whether the counters were updated or not to check only
+	 * three counters. So, for performance, we don't allocate another memory
+	 * spaces and check the all stats like pgstat_send_slru().
+	 *
+	 * The current 'wal_records' is the same as the previous one means that
+	 * wal records weren't generated. So, the counters of 'wal_fpi',
+	 * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+	 *
+	 * It's not enough to check the number of generated wal records, for
+	 * example the walwriter may write/sync the WAL although it doesn't
+	 * generate wal records. 'm_wal_write' and 'm_wal_sync' are zero means the
+	 * counters of time spent are zero too.
 	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
+	if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0)
+	{
+		Assert(WalStats.m_wal_buffers_full == 0);
+		return;
+	}
 
 	if (!force)
 	{
@@ -3100,13 +3086,40 @@ pgstat_send_wal(bool force)
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-			return false;
+			return;
 		sendTime = now;
 	}
 
+	/*
+	 * Set the counters related to generated WAL data if the counters were
+	 * updated.
+	 */
+	if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+	{
+		WalUsage	walusage;
+
+		/*
+		 * Calculate how much WAL usage counters were increased by substracting
+		 * the previous counters from the current ones. Fill the results in
+		 * WAL stats message.
+		 */
+		MemSet(&walusage, 0, sizeof(WalUsage));
+		WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
+
+		WalStats.m_wal_records = walusage.wal_records;
+		WalStats.m_wal_fpi = walusage.wal_fpi;
+		WalStats.m_wal_bytes = walusage.wal_bytes;
+
+		/*
+		 * Save the current counters for the subsequent calculation of WAL usage.
+		 */
+		prevWalUsage = pgWalUsage;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
@@ -3117,8 +3130,6 @@ pgstat_send_wal(bool force)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
-
-	return true;
 }
 
 /* ----------
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index fc87eed4fb..31dbfc832a 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -16,6 +16,12 @@
 #include "portability/instr_time.h"
 
 
+/*
+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is to avoid unexpected
+ * reset because many callers may use them.
+ */
 typedef struct BufferUsage
 {
 	int64		shared_blks_hit;	/* # of shared buffer hits */
@@ -32,6 +38,15 @@ typedef struct BufferUsage
 	instr_time	blk_write_time; /* time spent writing */
 } BufferUsage;
 
+/*
+ * The accumulated counters for generated WAL usage.
+ *
+ * The reason the counters are accumulated values is the same as BufferUsage's one.
+ * And the reason to store only generated WAL usage and doesn't store WAL I/O
+ * activity, is that this is assumed for knowing the WAL usage in per query or
+ * transaction. So, common resources for the cluster like WAL I/O activity is
+ * not stored.
+ */
 typedef struct WalUsage
 {
 	int64		wal_records;	/* # of WAL records produced */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 72ff4a06d6..7727e83455 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1091,8 +1091,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to

#34

torikoshia

torikoshia@oss.nttdata.com

over 4 years ago

In reply to: Masahiro Ikeda (#33)

Re: wal stats questions

On 2021-05-13 09:05, Masahiro Ikeda wrote:

On 2021/05/12 19:19, Fujii Masao wrote:
On 2021/05/11 18:46, Masahiro Ikeda wrote:
On 2021/05/11 16:44, Fujii Masao wrote:
On 2021/04/28 9:10, Masahiro Ikeda wrote:
On 2021/04/27 21:56, Fujii Masao wrote:
On 2021/04/26 10:11, Masahiro Ikeda wrote:

First patch has only the changes for pg_stat_wal view.
("v6-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch")
+        pgWalUsage.wal_records == prevWalUsage.wal_records &&
+        walStats.wal_write == 0 && walStats.wal_sync == 0 &&
WalStats.m_wal_write should be checked here instead of
walStats.wal_write?
Thanks! Yes, I'll fix it.
Thanks!
Thanks for your comments!
I fixed them.
Thanks for updating the patch!
     if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
         pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+        pgWalUsage.wal_records == prevWalUsage.wal_records &&
+        WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&
I'm just wondering if the above WAL activity counters need to be
checked.
Maybe it's not necessary because "pgStatXactCommit == 0 &&
pgStatXactRollback
== 0"
is checked? IOW, is there really the case where WAL activity counters
are updated
even when both pgStatXactCommit and pgStatXactRollback are zero?
Thanks for checking.

I haven't found the case yet. But, I added the condition because there
is a
discussion that it's safer.

(It seems the mail thread chain is broken, Sorry...)
I copy the discussion at the time below.

/messages/by-id/20210330.172843.267174731834281075.horikyota.ntt@gmail.com

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal))
== 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too.
Seems
adding a boolean WalStats.have_pending = true or such would be
better.
4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the
"Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one
of
the other conditions, but ...

I added the logic to check if the stats counters are updated or not
in
pgstat_report_stat() using not only generated wal record but also
write/sync
counters, and it can skip to call reporting function.

I removed the checking code whether the wal stats counters were
updated or not
in pgstat_report_stat() since I couldn't understand the valid case
yet.
pgstat_report_stat() is called by backends when the transaction is
finished.
This means that the condition seems always pass.

Doesn't the same holds for all other counters? If you are saying that
"wal counters should be zero if all other stats counters are zero", we
need an assertion to check that and a comment to explain that.

I personally find it safer to add the WAL-stats condition to the
fast-return check, rather than addin such assertion.
+    if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+    {
+        WalUsage    walusage;
+
+        /*
+         * Calculate how much WAL usage counters were increased by 
substracting
+         * the previous counters from the current ones. Fill the 
results in
+         * WAL stats message.
+         */
+        MemSet(&walusage, 0, sizeof(WalUsage));
+        WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
Isn't it better to move the code "prevWalUsage = pgWalUsage" into
here?
Because it's necessary only when pgWalUsage.wal_records !=
prevWalUsage.wal_records.
Yes, I fixed it.

Regards,

Thanks for updating the patch!

+ * is executed, wal records aren't nomally generated (although HOT
makes

nomally -> normally?

+	 * It's not enough to check the number of generated wal records, for
+	 * example the walwriter may write/sync the WAL although it doesn't

You use both 'wal' and 'WAL' in the comments, but are there any
intension?

Regards,

#35

ikedamsh@oss.nttdata.com

over 4 years ago

In reply to: torikoshia (#34)

1 attachment(s)

Re: wal stats questions

Thanks for your comments!

+ * is executed, wal records aren't nomally generated (although HOT makes

nomally -> normally?

Yes, fixed.

+     * It's not enough to check the number of generated wal records, for
+     * example the walwriter may write/sync the WAL although it doesn't
You use both 'wal' and 'WAL' in the comments, but are there any intension?

No, I changed to use 'WAL' only. Although some other comments have 'wal' and
'WAL', it seems that 'WAL' is often used.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v9-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchtext/x-patch; charset=UTF-8; name=v9-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchDownload

diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2b106d8473..355778028e 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -17,6 +17,12 @@
 
 #include "executor/instrument.h"
 
+/*
+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.
+ */
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cdd07770a0..75a95f3de7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -583,7 +583,7 @@ HandleCheckpointerInterrupts(void)
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 249b17c92b..734a7cf802 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -133,8 +133,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
+ * pgstat_send_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_send_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -852,9 +852,19 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * Note: regarding to WAL statistics counters, it's not enough to check
+	 * only whether the WAL record is generated or not. If a read transaction
+	 * is executed, WAL records aren't normally generated (although HOT makes
+	 * WAL records). But, just writes and syncs the WAL data may happen when
+	 * to flush buffers.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&
 		!have_function_stats && !disconnect)
 		return;
 
@@ -948,7 +958,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -2918,7 +2928,7 @@ void
 pgstat_initialize(void)
 {
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -3030,44 +3040,6 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_report_wal() -
- *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
- * ----------
- */
-void
-pgstat_report_wal(void)
-{
-	WalUsage	walusage;
-
-	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
-	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
-
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3075,24 +3047,38 @@ pgstat_report_wal(void)
  *
  * If 'force' is not set, WAL stats message is only sent if enough time has
  * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
- *
- * Return true if the message is sent, and false otherwise.
  * ----------
  */
-bool
+void
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
 	/*
-	 * This function can be called even if nothing at all has happened. In
-	 * this case, avoid sending a completely empty message to the stats
-	 * collector.
+	 * First, to check the WAL stats counters were updated.
+	 *
+	 * Even if the 'force' is true, we don't need to send the stats if the
+	 * counters were not updated.
+	 *
+	 * We can know whether the counters were updated or not to check only
+	 * three counters. So, for performance, we don't allocate another memory
+	 * spaces and check the all stats like pgstat_send_slru().
+	 *
+	 * The current 'wal_records' is the same as the previous one means that
+	 * WAL records weren't generated. So, the counters of 'wal_fpi',
+	 * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+	 *
+	 * It's not enough to check the number of generated WAL records, for
+	 * example the walwriter may write/sync the WAL although it doesn't
+	 * generate WAL records. 'm_wal_write' and 'm_wal_sync' are zero means the
+	 * counters of time spent are zero too.
 	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
+	if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0)
+	{
+		Assert(WalStats.m_wal_buffers_full == 0);
+		return;
+	}
 
 	if (!force)
 	{
@@ -3100,13 +3086,40 @@ pgstat_send_wal(bool force)
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-			return false;
+			return;
 		sendTime = now;
 	}
 
+	/*
+	 * Set the counters related to generated WAL data if the counters were
+	 * updated.
+	 */
+	if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+	{
+		WalUsage	walusage;
+
+		/*
+		 * Calculate how much WAL usage counters were increased by substracting
+		 * the previous counters from the current ones. Fill the results in
+		 * WAL stats message.
+		 */
+		MemSet(&walusage, 0, sizeof(WalUsage));
+		WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
+
+		WalStats.m_wal_records = walusage.wal_records;
+		WalStats.m_wal_fpi = walusage.wal_fpi;
+		WalStats.m_wal_bytes = walusage.wal_bytes;
+
+		/*
+		 * Save the current counters for the subsequent calculation of WAL usage.
+		 */
+		prevWalUsage = pgWalUsage;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
@@ -3117,8 +3130,6 @@ pgstat_send_wal(bool force)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
-
-	return true;
 }
 
 /* ----------
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index fc87eed4fb..31dbfc832a 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -16,6 +16,12 @@
 #include "portability/instr_time.h"
 
 
+/*
+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is to avoid unexpected
+ * reset because many callers may use them.
+ */
 typedef struct BufferUsage
 {
 	int64		shared_blks_hit;	/* # of shared buffer hits */
@@ -32,6 +38,15 @@ typedef struct BufferUsage
 	instr_time	blk_write_time; /* time spent writing */
 } BufferUsage;
 
+/*
+ * The accumulated counters for generated WAL usage.
+ *
+ * The reason the counters are accumulated values is the same as BufferUsage's one.
+ * And the reason to store only generated WAL usage and doesn't store WAL I/O
+ * activity, is that this is assumed for knowing the WAL usage in per query or
+ * transaction. So, common resources for the cluster like WAL I/O activity is
+ * not stored.
+ */
 typedef struct WalUsage
 {
 	int64		wal_records;	/* # of WAL records produced */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5fbd3a05ba..9612c0a6c2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1091,8 +1091,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to

#36

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Masahiro Ikeda (#35)

Re: wal stats questions

On 2021/05/17 16:39, Masahiro Ikeda wrote:

Thanks for your comments!

+ * is executed, wal records aren't nomally generated (although HOT makes

nomally -> normally?

Yes, fixed.
+     * It's not enough to check the number of generated wal records, for
+     * example the walwriter may write/sync the WAL although it doesn't
You use both 'wal' and 'WAL' in the comments, but are there any intension?
No, I changed to use 'WAL' only. Although some other comments have 'wal' and
'WAL', it seems that 'WAL' is often used.

Thanks for updating the patch!

+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.

Is it really worth adding these comments here?

+	 * Note: regarding to WAL statistics counters, it's not enough to check
+	 * only whether the WAL record is generated or not. If a read transaction
+	 * is executed, WAL records aren't normally generated (although HOT makes
+	 * WAL records). But, just writes and syncs the WAL data may happen when
+	 * to flush buffers.

Aren't the following comments better?

------------------------------
To determine whether any WAL activity has occurred since last time, not only the number of generated WAL records but also the numbers of WAL writes and syncs need to be checked. Because even transaction that generates no WAL records can write or sync WAL data when flushing the data pages.
------------------------------

- * This function can be called even if nothing at all has happened. In
- * this case, avoid sending a completely empty message to the stats
- * collector.

I think that it's better to leave this comment as it is.

+	 * First, to check the WAL stats counters were updated.
+	 *
+	 * Even if the 'force' is true, we don't need to send the stats if the
+	 * counters were not updated.
+	 *
+	 * We can know whether the counters were updated or not to check only
+	 * three counters. So, for performance, we don't allocate another memory
+	 * spaces and check the all stats like pgstat_send_slru().

Is it really worth adding these comments here?

+	 * The current 'wal_records' is the same as the previous one means that
+	 * WAL records weren't generated. So, the counters of 'wal_fpi',
+	 * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+	 *
+	 * It's not enough to check the number of generated WAL records, for
+	 * example the walwriter may write/sync the WAL although it doesn't
+	 * generate WAL records. 'm_wal_write' and 'm_wal_sync' are zero means the
+	 * counters of time spent are zero too.

Aren't the following comments better?

------------------------
Check wal_records counter to determine whether any WAL activity has happened since last time. Note that other WalUsage counters don't need to be checked because they are incremented always together with wal_records counter.

m_wal_buffers_full also doesn't need to be checked because it's incremented only when at least one WAL record is generated (i.e., wal_records counter is incremented). But for safely, we assert that m_wal_buffers_full is always zero when no WAL record is generated

This function can be called by a process like walwriter that normally generates no WAL records. To determine whether any WAL activity has happened at that process since the last time, the numbers of WAL writes and syncs are also checked.
------------------------

+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is to avoid unexpected
+ * reset because many callers may use them.

Aren't the following comments better?

------------------------
These counters keep being incremented infinitely, i.e., must never be reset to zero, so that we can calculate how much the counters are incremented in an arbitrary period.
------------------------

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#37

ikedamsh@oss.nttdata.com

over 4 years ago

In reply to: Fujii Masao (#36)

1 attachment(s)

Re: wal stats questions

On 2021/05/17 22:34, Fujii Masao wrote:

On 2021/05/17 16:39, Masahiro Ikeda wrote:
Thanks for your comments!

+ * is executed, wal records aren't nomally generated (although HOT makes

nomally -> normally?

Yes, fixed.
+     * It's not enough to check the number of generated wal records, for
+     * example the walwriter may write/sync the WAL although it doesn't
You use both 'wal' and 'WAL' in the comments, but are there any intension?
No, I changed to use 'WAL' only. Although some other comments have 'wal' and
'WAL', it seems that 'WAL' is often used.
Thanks for updating the patch!

Thanks a lot of comments!

+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.

Is it really worth adding these comments here?

BufferUsage has almost same comments. So, I removed it.

+     * Note: regarding to WAL statistics counters, it's not enough to check
+     * only whether the WAL record is generated or not. If a read transaction
+     * is executed, WAL records aren't normally generated (although HOT makes
+     * WAL records). But, just writes and syncs the WAL data may happen when
+     * to flush buffers.
Aren't the following comments better?

------------------------------
To determine whether any WAL activity has occurred since last time, not only
the number of generated WAL records but also the numbers of WAL writes and
syncs need to be checked. Because even transaction that generates no WAL
records can write or sync WAL data when flushing the data pages.
------------------------------

Thanks. Yes, I fixed it.

-     * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.

I think that it's better to leave this comment as it is.

OK. I leave it.

+     * First, to check the WAL stats counters were updated.
+     *
+     * Even if the 'force' is true, we don't need to send the stats if the
+     * counters were not updated.
+     *
+     * We can know whether the counters were updated or not to check only
+     * three counters. So, for performance, we don't allocate another memory
+     * spaces and check the all stats like pgstat_send_slru().

Is it really worth adding these comments here?

I removed them because the following comments are enough.

* This function can be called even if nothing at all has happened. In
* this case, avoid sending a completely empty message to the stats
* collector.

+     * The current 'wal_records' is the same as the previous one means that
+     * WAL records weren't generated. So, the counters of 'wal_fpi',
+     * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+     *
+     * It's not enough to check the number of generated WAL records, for
+     * example the walwriter may write/sync the WAL although it doesn't
+     * generate WAL records. 'm_wal_write' and 'm_wal_sync' are zero means the
+     * counters of time spent are zero too.
Aren't the following comments better?

------------------------
Check wal_records counter to determine whether any WAL activity has happened
since last time. Note that other WalUsage counters don't need to be checked
because they are incremented always together with wal_records counter.

m_wal_buffers_full also doesn't need to be checked because it's incremented
only when at least one WAL record is generated (i.e., wal_records counter is
incremented). But for safely, we assert that m_wal_buffers_full is always zero
when no WAL record is generated

This function can be called by a process like walwriter that normally
generates no WAL records. To determine whether any WAL activity has happened
at that process since the last time, the numbers of WAL writes and syncs are
also checked.
------------------------

Yes, I modified them.

+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is to avoid unexpected
+ * reset because many callers may use them.
Aren't the following comments better?

------------------------
These counters keep being incremented infinitely, i.e., must never be reset to
zero, so that we can calculate how much the counters are incremented in an
arbitrary period.
------------------------

Yes, thanks.
I fixed it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v10-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchtext/x-patch; charset=UTF-8; name=v10-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchDownload

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cdd07770a0..75a95f3de7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -505,7 +505,7 @@ CheckpointerMain(void)
 		pgstat_send_bgwriter();
 
 		/* Send WAL statistics to the stats collector. */
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/*
 		 * If any checkpoint flags have been set, redo the loop to handle the
@@ -583,7 +583,7 @@ HandleCheckpointerInterrupts(void)
 		BgWriterStats.m_requested_checkpoints++;
 		ShutdownXLOG(0, 0);
 		pgstat_send_bgwriter();
-		pgstat_report_wal();
+		pgstat_send_wal(true);
 
 		/* Normal exit from the checkpointer is here */
 		proc_exit(0);			/* done */
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 249b17c92b..5dceccd2fa 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -133,8 +133,8 @@ PgStat_MsgWal WalStats;
 
 /*
  * WAL usage counters saved from pgWALUsage at the previous call to
- * pgstat_report_wal(). This is used to calculate how much WAL usage
- * happens between pgstat_report_wal() calls, by substracting
+ * pgstat_send_wal(). This is used to calculate how much WAL usage
+ * happens between pgstat_send_wal() calls, by substracting
  * the previous counters from the current ones.
  */
 static WalUsage prevWalUsage;
@@ -852,9 +852,18 @@ pgstat_report_stat(bool disconnect)
 	TabStatusArray *tsa;
 	int			i;
 
-	/* Don't expend a clock check if nothing to do */
+	/*
+	 * Don't expend a clock check if nothing to do.
+	 *
+	 * To determine whether any WAL activity has occurred since last time, not only
+	 * the number of generated WAL records but also the numbers of WAL writes and
+	 * syncs need to be checked. Because even transaction that generates no WAL
+	 * records can write or sync WAL data when flushing the data pages.
+	 */
 	if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
 		pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
+		pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0 &&
 		!have_function_stats && !disconnect)
 		return;
 
@@ -948,7 +957,7 @@ pgstat_report_stat(bool disconnect)
 	pgstat_send_funcstats();
 
 	/* Send WAL statistics */
-	pgstat_report_wal();
+	pgstat_send_wal(true);
 
 	/* Finally send SLRU statistics */
 	pgstat_send_slru();
@@ -2918,7 +2927,7 @@ void
 pgstat_initialize(void)
 {
 	/*
-	 * Initialize prevWalUsage with pgWalUsage so that pgstat_report_wal() can
+	 * Initialize prevWalUsage with pgWalUsage so that pgstat_send_wal() can
 	 * calculate how much pgWalUsage counters are increased by substracting
 	 * prevWalUsage from pgWalUsage.
 	 */
@@ -3030,44 +3039,6 @@ pgstat_send_bgwriter(void)
 	MemSet(&BgWriterStats, 0, sizeof(BgWriterStats));
 }
 
-/* ----------
- * pgstat_report_wal() -
- *
- * Calculate how much WAL usage counters are increased and send
- * WAL statistics to the collector.
- *
- * Must be called by processes that generate WAL.
- * ----------
- */
-void
-pgstat_report_wal(void)
-{
-	WalUsage	walusage;
-
-	/*
-	 * Calculate how much WAL usage counters are increased by substracting the
-	 * previous counters from the current ones. Fill the results in WAL stats
-	 * message.
-	 */
-	MemSet(&walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
-
-	WalStats.m_wal_records = walusage.wal_records;
-	WalStats.m_wal_fpi = walusage.wal_fpi;
-	WalStats.m_wal_bytes = walusage.wal_bytes;
-
-	/*
-	 * Send WAL stats message to the collector.
-	 */
-	if (!pgstat_send_wal(true))
-		return;
-
-	/*
-	 * Save the current counters for the subsequent calculation of WAL usage.
-	 */
-	prevWalUsage = pgWalUsage;
-}
-
 /* ----------
  * pgstat_send_wal() -
  *
@@ -3075,24 +3046,38 @@ pgstat_report_wal(void)
  *
  * If 'force' is not set, WAL stats message is only sent if enough time has
  * passed since last one was sent to reach PGSTAT_STAT_INTERVAL.
- *
- * Return true if the message is sent, and false otherwise.
  * ----------
  */
-bool
+void
 pgstat_send_wal(bool force)
 {
-	/* We assume this initializes to zeroes */
-	static const PgStat_MsgWal all_zeroes;
 	static TimestampTz sendTime = 0;
 
 	/*
 	 * This function can be called even if nothing at all has happened. In
 	 * this case, avoid sending a completely empty message to the stats
 	 * collector.
+	 *
+	 * Check wal_records counter to determine whether any WAL activity has happened
+	 * since last time. Note that other WalUsage counters don't need to be checked
+	 * because they are incremented always together with wal_records counter.
+	 *
+	 * m_wal_buffers_full also doesn't need to be checked because it's incremented
+	 * only when at least one WAL record is generated (i.e., wal_records counter is
+	 * incremented). But for safely, we assert that m_wal_buffers_full is always zero
+	 * when no WAL record is generated
+	 *
+	 * This function can be called by a process like walwriter that normally
+	 * generates no WAL records. To determine whether any WAL activity has happened
+	 * at that process since the last time, the numbers of WAL writes and syncs are
+	 * also checked.
 	 */
-	if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
-		return false;
+	if (pgWalUsage.wal_records == prevWalUsage.wal_records &&
+		WalStats.m_wal_write == 0 && WalStats.m_wal_sync == 0)
+	{
+		Assert(WalStats.m_wal_buffers_full == 0);
+		return;
+	}
 
 	if (!force)
 	{
@@ -3100,13 +3085,40 @@ pgstat_send_wal(bool force)
 
 		/*
 		 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-		 * msec since we last sent one.
+		 * msec since we last sent one to avoid overloading the stats
+		 * collector.
 		 */
 		if (!TimestampDifferenceExceeds(sendTime, now, PGSTAT_STAT_INTERVAL))
-			return false;
+			return;
 		sendTime = now;
 	}
 
+	/*
+	 * Set the counters related to generated WAL data if the counters were
+	 * updated.
+	 */
+	if (pgWalUsage.wal_records != prevWalUsage.wal_records)
+	{
+		WalUsage	walusage;
+
+		/*
+		 * Calculate how much WAL usage counters were increased by substracting
+		 * the previous counters from the current ones. Fill the results in
+		 * WAL stats message.
+		 */
+		MemSet(&walusage, 0, sizeof(WalUsage));
+		WalUsageAccumDiff(&walusage, &pgWalUsage, &prevWalUsage);
+
+		WalStats.m_wal_records = walusage.wal_records;
+		WalStats.m_wal_fpi = walusage.wal_fpi;
+		WalStats.m_wal_bytes = walusage.wal_bytes;
+
+		/*
+		 * Save the current counters for the subsequent calculation of WAL usage.
+		 */
+		prevWalUsage = pgWalUsage;
+	}
+
 	/*
 	 * Prepare and send the message
 	 */
@@ -3117,8 +3129,6 @@ pgstat_send_wal(bool force)
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
 	MemSet(&WalStats, 0, sizeof(WalStats));
-
-	return true;
 }
 
 /* ----------
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index fc87eed4fb..2a7b3f1233 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -16,6 +16,11 @@
 #include "portability/instr_time.h"
 
 
+/*
+ * These counters keep being incremented infinitely, i.e., must never be reset to
+ * zero, so that we can calculate how much the counters are incremented in an
+ * arbitrary period.
+ */
 typedef struct BufferUsage
 {
 	int64		shared_blks_hit;	/* # of shared buffer hits */
@@ -32,6 +37,15 @@ typedef struct BufferUsage
 	instr_time	blk_write_time; /* time spent writing */
 } BufferUsage;
 
+/*
+ * The accumulated counters for generated WAL usage.
+ *
+ * The reason the counters are accumulated values is the same as BufferUsage's one.
+ * And the reason to store only generated WAL usage and doesn't store WAL I/O
+ * activity, is that this is assumed for knowing the WAL usage in per query or
+ * transaction. So, common resources for the cluster like WAL I/O activity is
+ * not stored.
+ */
 typedef struct WalUsage
 {
 	int64		wal_records;	/* # of WAL records produced */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5fbd3a05ba..9612c0a6c2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1091,8 +1091,7 @@ extern void pgstat_twophase_postabort(TransactionId xid, uint16 info,
 
 extern void pgstat_send_archiver(const char *xlog, bool failed);
 extern void pgstat_send_bgwriter(void);
-extern void pgstat_report_wal(void);
-extern bool pgstat_send_wal(bool force);
+extern void pgstat_send_wal(bool force);
 
 /* ----------
  * Support functions for the SQL-callable functions to

#38

masao.fujii@oss.nttdata.com

over 4 years ago

In reply to: Masahiro Ikeda (#37)

Re: wal stats questions

On 2021/05/18 9:57, Masahiro Ikeda wrote:

On 2021/05/17 22:34, Fujii Masao wrote:
On 2021/05/17 16:39, Masahiro Ikeda wrote:
Thanks for your comments!

+ * is executed, wal records aren't nomally generated (although HOT makes

nomally -> normally?

Yes, fixed.
+     * It's not enough to check the number of generated wal records, for
+     * example the walwriter may write/sync the WAL although it doesn't
You use both 'wal' and 'WAL' in the comments, but are there any intension?
No, I changed to use 'WAL' only. Although some other comments have 'wal' and
'WAL', it seems that 'WAL' is often used.
Thanks for updating the patch!
Thanks a lot of comments!
+ * Buffer and generated WAL usage counters.
+ *
+ * The counters are accumulated values. There are infrastructures
+ * to add the values and calculate the difference within a specific period.
Is it really worth adding these comments here?
BufferUsage has almost same comments. So, I removed it.
+     * Note: regarding to WAL statistics counters, it's not enough to check
+     * only whether the WAL record is generated or not. If a read transaction
+     * is executed, WAL records aren't normally generated (although HOT makes
+     * WAL records). But, just writes and syncs the WAL data may happen when
+     * to flush buffers.
Aren't the following comments better?

------------------------------
To determine whether any WAL activity has occurred since last time, not only
the number of generated WAL records but also the numbers of WAL writes and
syncs need to be checked. Because even transaction that generates no WAL
records can write or sync WAL data when flushing the data pages.
------------------------------
Thanks. Yes, I fixed it.

-     * This function can be called even if nothing at all has happened. In
-     * this case, avoid sending a completely empty message to the stats
-     * collector.

I think that it's better to leave this comment as it is.

OK. I leave it.
+     * First, to check the WAL stats counters were updated.
+     *
+     * Even if the 'force' is true, we don't need to send the stats if the
+     * counters were not updated.
+     *
+     * We can know whether the counters were updated or not to check only
+     * three counters. So, for performance, we don't allocate another memory
+     * spaces and check the all stats like pgstat_send_slru().
Is it really worth adding these comments here?
I removed them because the following comments are enough.

* This function can be called even if nothing at all has happened. In
* this case, avoid sending a completely empty message to the stats
* collector.
+     * The current 'wal_records' is the same as the previous one means that
+     * WAL records weren't generated. So, the counters of 'wal_fpi',
+     * 'wal_bytes', 'm_wal_buffers_full' are not updated neither.
+     *
+     * It's not enough to check the number of generated WAL records, for
+     * example the walwriter may write/sync the WAL although it doesn't
+     * generate WAL records. 'm_wal_write' and 'm_wal_sync' are zero means the
+     * counters of time spent are zero too.
Aren't the following comments better?

------------------------
Check wal_records counter to determine whether any WAL activity has happened
since last time. Note that other WalUsage counters don't need to be checked
because they are incremented always together with wal_records counter.

m_wal_buffers_full also doesn't need to be checked because it's incremented
only when at least one WAL record is generated (i.e., wal_records counter is
incremented). But for safely, we assert that m_wal_buffers_full is always zero
when no WAL record is generated

This function can be called by a process like walwriter that normally
generates no WAL records. To determine whether any WAL activity has happened
at that process since the last time, the numbers of WAL writes and syncs are
also checked.
------------------------
Yes, I modified them.
+ * The accumulated counters for buffer usage.
+ *
+ * The reason the counters are accumulated values is to avoid unexpected
+ * reset because many callers may use them.
Aren't the following comments better?

------------------------
These counters keep being incremented infinitely, i.e., must never be reset to
zero, so that we can calculate how much the counters are incremented in an
arbitrary period.
------------------------
Yes, thanks.
I fixed it.

Thanks for updating the patch! I modified some comments slightly and
pushed that version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#39