wal stats questions

Started by Andres Freundabout 5 years ago39 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

I got a few questions about the wal stats while working on the shmem
stats patch:

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.

4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

Generally the various patches seems to to have a lot of the boilerplate
style comments (like "Prepare and send the message"), but very little in
the way of explaining the design.

Greetings,

Andres Freund

#2Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Andres Freund (#1)
Re: wal stats questions

On 2021/03/25 8:22, Andres Freund wrote:

Hi,

I got a few questions about the wal stats while working on the shmem
stats patch:

Thanks for your reviews.

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

Is your point that it's better to call pgWalUsage = 0; after sending the
stats? My understanding is as same as your assumption. For example,
pg_stat_statements.c use pgWalUsage and calculate the diff.

But, because the stats may be sent after the transaction is finished, it
doesn't seem to lead wrong stats if pgWalUsage = 0 is called. So, I agree your
suggestion.

If the extension wants to know the walusage diff across two transactions,
it may lead to wrong stats, but I think it won't happen.

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

To control the transmission interval for the wal writer because it may send
the stats too frequency, and to omit to calculate the generated wal stats
because it doesn't generate wal records. But, now I think it's better to merge
them.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.

I understood that for backends, this may leads performance degradation and
this problem is not only for the WalStats but also SLRUStats.

I wondered the degradation is so big because pgstat_report_stat() checks if at
least PGSTAT_STAT_INTERVAL msec is passed since it last sent before check the
diff. If my understanding is correct, to get timestamp is more expensive.

4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

(I'm not confidence my understanding of your comment is right.)
You mean that we need to expend a clock check in pgstat_report_wal()?
I think it's unnecessary because pgstat_report_stat() is already checked it.

Generally the various patches seems to to have a lot of the boilerplate
style comments (like "Prepare and send the message"), but very little in
the way of explaining the design.

Sorry for that. I'll be careful.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

#3Andres Freund
andres@anarazel.de
In reply to: Masahiro Ikeda (#2)
Re: wal stats questions

Hi,

On 2021-03-25 10:51:56 +0900, Masahiro Ikeda wrote:

On 2021/03/25 8:22, Andres Freund wrote:

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

Is your point that it's better to call pgWalUsage = 0; after sending the
stats?

Yes. At least there should be a comment explaining why it's done the way
it is.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.

I understood that for backends, this may leads performance degradation and
this problem is not only for the WalStats but also SLRUStats.

I wondered the degradation is so big because pgstat_report_stat() checks if at
least PGSTAT_STAT_INTERVAL msec is passed since it last sent before check the
diff. If my understanding is correct, to get timestamp is more expensive.

Getting a timestamp is expensive, yes. But I think we need to check for
the no-pending-wal-stats *before* the clock check. See the below:

4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

(I'm not confidence my understanding of your comment is right.)
You mean that we need to expend a clock check in pgstat_report_wal()?
I think it's unnecessary because pgstat_report_stat() is already checked it.

No, I mean that right now we might can erroneously return early
pgstat_report_wal() for normal backends in some workloads:

void
pgstat_report_stat(bool disconnect)
...
/* Don't expend a clock check if nothing to do */
if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)
return;

might return if there only is pending WAL activity. This needs to check
whether there are pending WAL stats. Which in turn means that the check
should be fast.

Greetings,

Andres Freund

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#3)
Re: wal stats questions

At Wed, 24 Mar 2021 21:07:26 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-25 10:51:56 +0900, Masahiro Ikeda wrote:

On 2021/03/25 8:22, Andres Freund wrote:

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

Is your point that it's better to call pgWalUsage = 0; after sending the
stats?

Yes. At least there should be a comment explaining why it's done the way
it is.

pgWalUsage was used without resetting and we (I) thought that that
behavior should be preserved. On second thought, as Andres suggested,
we can just reset pgWalUsage at sending since AFAICS no one takes
difference crossing pgstat_report_stat() calls.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.

I understood that for backends, this may leads performance degradation and
this problem is not only for the WalStats but also SLRUStats.

I wondered the degradation is so big because pgstat_report_stat() checks if at
least PGSTAT_STAT_INTERVAL msec is passed since it last sent before check the
diff. If my understanding is correct, to get timestamp is more expensive.

Getting a timestamp is expensive, yes. But I think we need to check for
the no-pending-wal-stats *before* the clock check. See the below:

4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

(I'm not confidence my understanding of your comment is right.)
You mean that we need to expend a clock check in pgstat_report_wal()?
I think it's unnecessary because pgstat_report_stat() is already checked it.

No, I mean that right now we might can erroneously return early
pgstat_report_wal() for normal backends in some workloads:

void
pgstat_report_stat(bool disconnect)
...
/* Don't expend a clock check if nothing to do */
if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)
return;

might return if there only is pending WAL activity. This needs to check
whether there are pending WAL stats. Which in turn means that the check
should be fast.

Agreed that the condition is wrong. On the other hand, the counters
are incremented in XLogInsertRecord() and I think we don't want add
instructions there.

If any wal activity has been recorded, wal_records is always positive
thus we can check for wal activity just by "pgWalUsage.wal_records >
0, which should be fast enough.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#4)
Re: wal stats questions

On 2021/03/25 16:37, Kyotaro Horiguchi wrote:

At Wed, 24 Mar 2021 21:07:26 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-25 10:51:56 +0900, Masahiro Ikeda wrote:

On 2021/03/25 8:22, Andres Freund wrote:

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

Is your point that it's better to call pgWalUsage = 0; after sending the
stats?

Yes. At least there should be a comment explaining why it's done the way
it is.

pgWalUsage was used without resetting and we (I) thought that that
behavior should be preserved. On second thought, as Andres suggested,
we can just reset pgWalUsage at sending since AFAICS no one takes
difference crossing pgstat_report_stat() calls.

Yes, I agree that we can do that since there seems no such code for now.
Also if we do that, we can check, for example "pgWalUsage.wal_records > 0"
as you suggested, to easily determine whether there is pending WAL stats or not.
Anyway I agree it's better to add comments about the design more.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.

I understood that for backends, this may leads performance degradation and
this problem is not only for the WalStats but also SLRUStats.

I wondered the degradation is so big because pgstat_report_stat() checks if at
least PGSTAT_STAT_INTERVAL msec is passed since it last sent before check the
diff. If my understanding is correct, to get timestamp is more expensive.

Getting a timestamp is expensive, yes. But I think we need to check for
the no-pending-wal-stats *before* the clock check. See the below:

4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

(I'm not confidence my understanding of your comment is right.)
You mean that we need to expend a clock check in pgstat_report_wal()?
I think it's unnecessary because pgstat_report_stat() is already checked it.

No, I mean that right now we might can erroneously return early
pgstat_report_wal() for normal backends in some workloads:

void
pgstat_report_stat(bool disconnect)
...
/* Don't expend a clock check if nothing to do */
if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)
return;

might return if there only is pending WAL activity. This needs to check
whether there are pending WAL stats. Which in turn means that the check
should be fast.

Agreed that the condition is wrong. On the other hand, the counters
are incremented in XLogInsertRecord() and I think we don't want add
instructions there.

Basically yes. We should avoid that especially while WALInsertLock is being hold.
But it's not so harmful to do that after the lock is released?

If any wal activity has been recorded, wal_records is always positive
thus we can check for wal activity just by "pgWalUsage.wal_records >
0, which should be fast enough.

Maybe there is the case where a backend generates no WAL records,
but just writes them because it needs to do write-ahead-logging
when flush the table data? If yes, "pgWalUsage.wal_records > 0" is not enough.
Probably other fields also need to be checked.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#5)
Re: wal stats questions

At Thu, 25 Mar 2021 19:01:23 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/25 16:37, Kyotaro Horiguchi wrote:

pgWalUsage was used without resetting and we (I) thought that that
behavior should be preserved. On second thought, as Andres suggested,
we can just reset pgWalUsage at sending since AFAICS no one takes
difference crossing pgstat_report_stat() calls.

Yes, I agree that we can do that since there seems no such code for
now.
Also if we do that, we can check, for example "pgWalUsage.wal_records

0"

as you suggested, to easily determine whether there is pending WAL
stats or not.
Anyway I agree it's better to add comments about the design more.

...

If any wal activity has been recorded, wal_records is always positive
thus we can check for wal activity just by "pgWalUsage.wal_records >
0, which should be fast enough.

Maybe there is the case where a backend generates no WAL records,
but just writes them because it needs to do write-ahead-logging
when flush the table data? If yes, "pgWalUsage.wal_records > 0" is not
enough.
Probably other fields also need to be checked.

(I noticed I made the discussion above unconsciously premising
pgWalUsage reset.)

I may be misunderstanding or missing something, but the only place
where pgWalUsage counters are increased is XLogInsertRecrod. That is,
pgWalUsage counts wal insertions, not writes nor flushes. AFAICS
pgWalUsage.wal_records is always incremented when other counters are
increased. Looking from another side, we should refrain from adding
counters that incrases at a different time than
pgWalUsage.wal_recrods. (That should be written as a comment there.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Kyotaro Horiguchi (#6)
Re: wal stats questions

On 2021/03/26 10:08, Kyotaro Horiguchi wrote:

At Thu, 25 Mar 2021 19:01:23 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/25 16:37, Kyotaro Horiguchi wrote:

pgWalUsage was used without resetting and we (I) thought that that
behavior should be preserved. On second thought, as Andres suggested,
we can just reset pgWalUsage at sending since AFAICS no one takes
difference crossing pgstat_report_stat() calls.

Yes, I agree that we can do that since there seems no such code for
now.
Also if we do that, we can check, for example "pgWalUsage.wal_records

0"

as you suggested, to easily determine whether there is pending WAL
stats or not.
Anyway I agree it's better to add comments about the design more.

...

If any wal activity has been recorded, wal_records is always positive
thus we can check for wal activity just by "pgWalUsage.wal_records >
0, which should be fast enough.

Maybe there is the case where a backend generates no WAL records,
but just writes them because it needs to do write-ahead-logging
when flush the table data? If yes, "pgWalUsage.wal_records > 0" is not
enough.
Probably other fields also need to be checked.

(I noticed I made the discussion above unconsciously premising
pgWalUsage reset.)

I may be misunderstanding or missing something, but the only place
where pgWalUsage counters are increased is XLogInsertRecrod. That is,
pgWalUsage counts wal insertions, not writes nor flushes. AFAICS

Yes. And WalStats instead of pgWalUsage includes the stats about
not only WAL insertions, but also writes and flushes.
pgstat_send_wal() checks WalStats to determine whether there are
pending WAL stats to send to the stats collector or not. That is,
the counters of not only WAL insertions but also writes and flushes
should be checked to determine whether there are pending stats or not, I think..

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#8Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Fujii Masao (#7)
Re: wal stats questions

At Fri, 26 Mar 2021 10:32:23 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

I may be misunderstanding or missing something, but the only place
where pgWalUsage counters are increased is XLogInsertRecrod. That is,
pgWalUsage counts wal insertions, not writes nor flushes. AFAICS

Yes. And WalStats instead of pgWalUsage includes the stats about
not only WAL insertions, but also writes and flushes.

Ugh! I was missing a very large blob.. Ok, we need additional check
for non-pgWalUsage part. Sorry.

pgstat_send_wal() checks WalStats to determine whether there are
pending WAL stats to send to the stats collector or not. That is,
the counters of not only WAL insertions but also writes and flushes
should be checked to determine whether there are pending stats or not,
I think..

I think we may have an additional flag to notify about io-stat part,
in constrast to wal-insertion part . Anyway we do additional
INSTR_TIME_SET_CURRENT when track_wal_io_timinge.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#9Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#8)
Re: wal stats questions

Thanks for many your suggestions!
I made the patch to handle the issues.

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

I removed the unnecessary code copying pgWalUsage and just reset the
pgWalUsage after reporting the stats in pgstat_report_wal().

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

I added the comments why two functions are separated.
(But is it better to merge them?)

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.
4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

I added the logic to check if the stats counters are updated or not in
pgstat_report_stat() using not only generated wal record but also write/sync
counters, and it can skip to call reporting function.

Although I added the condition which the write/sync counters are updated or
not, I haven't understood the following case yet...Sorry. I checked related
code and tested to insert large object, but I couldn't. I'll investigate more
deeply, but if you already know the function which leads the following case,
please let me know.

Maybe there is the case where a backend generates no WAL records,
but just writes them because it needs to do write-ahead-logging
when flush the table data?

Ugh! I was missing a very large blob.. Ok, we need additional check
for non-pgWalUsage part. Sorry.

Regards,

--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v1-0001-performance-improvements-of-reporting-wal-stats.patchtext/x-patch; charset=UTF-8; name=v1-0001-performance-improvements-of-reporting-wal-stats.patchDownload+72-46
#10Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#4)
Re: wal stats questions

Hi,

On 2021-03-25 16:37:10 +0900, Kyotaro Horiguchi wrote:

On the other hand, the counters are incremented in XLogInsertRecord()
and I think we don't want add instructions there.

I don't really buy this. Setting a boolean to true, in a cacheline
you're already touching, isn't that much compared to all the other stuff
in there. The branch to check if wal stats timing etc is enabled is much
more expensive. I think we should just set a boolean to true and leave
it at that.

Greetings,

Andres Freund

#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#10)
Re: wal stats questions

At Fri, 26 Mar 2021 10:07:45 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-25 16:37:10 +0900, Kyotaro Horiguchi wrote:

On the other hand, the counters are incremented in XLogInsertRecord()
and I think we don't want add instructions there.

I don't really buy this. Setting a boolean to true, in a cacheline
you're already touching, isn't that much compared to all the other stuff
in there. The branch to check if wal stats timing etc is enabled is much
more expensive. I think we should just set a boolean to true and leave
it at that.

Hmm. Yes, I agree to you in that opinion. I (remember I) was told not
to add even a cycle to the hot path as far as we can avoid when I
tried something like that.

So I'm happy to +1 for that if it is the consensus here, since it is
cleaner.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#11)
Re: wal stats questions

At Mon, 29 Mar 2021 11:09:00 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 26 Mar 2021 10:07:45 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-25 16:37:10 +0900, Kyotaro Horiguchi wrote:

On the other hand, the counters are incremented in XLogInsertRecord()
and I think we don't want add instructions there.

I don't really buy this. Setting a boolean to true, in a cacheline
you're already touching, isn't that much compared to all the other stuff
in there. The branch to check if wal stats timing etc is enabled is much
more expensive. I think we should just set a boolean to true and leave
it at that.

Hmm. Yes, I agree to you in that opinion. I (remember I) was told not

It might sound differently.. To be precise, "I had the same opinion
with you".

to add even a cycle to the hot path as far as we can avoid when I
tried something like that.

So I'm happy to +1 for that if it is the consensus here, since it is
cleaner.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Masahiro Ikeda (#9)
Re: wal stats questions

I update the patch since there were my misunderstanding points.

On 2021/03/26 16:20, Masahiro Ikeda wrote:

Thanks for many your suggestions!
I made the patch to handle the issues.

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

I removed the unnecessary code copying pgWalUsage and just reset the
pgWalUsage after reporting the stats in pgstat_report_wal().

I didn't change this.

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

I added the comments why two functions are separated.
(But is it better to merge them?)

I updated the comments for following reasons.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.
4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

I added the logic to check if the stats counters are updated or not in
pgstat_report_stat() using not only generated wal record but also write/sync
counters, and it can skip to call reporting function.

I removed the checking code whether the wal stats counters were updated or not
in pgstat_report_stat() since I couldn't understand the valid case yet.
pgstat_report_stat() is called by backends when the transaction is finished.
This means that the condition seems always pass.

I thought to implement if the WAL stats counters were not updated, skip to
send all statistics including the counters for databases and so on. But I
think it's not good because it may take more time to be reflected the
generated stats by read-only transaction.

Although I added the condition which the write/sync counters are updated or
not, I haven't understood the following case yet...Sorry. I checked related
code and tested to insert large object, but I couldn't. I'll investigate more
deeply, but if you already know the function which leads the following case,
please let me know.

I understood the above case (Fujii-san, thanks for your advice in person).
When to flush buffers, for example, to select buffer replacement victim,
it requires a WAL flush if the buffer is dirty. So, to check the WAL stats
counters are updated or not, I check the number of generated wal record and
the counter of syncing in pgstat_report_wal().

The reason why not to check the counter of writing is that if to write is
happened, to sync is happened too in the above case. I added the comments in
the patch.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v2-0001-performance-improvements-of-reporting-wal-stats.patchtext/x-patch; charset=UTF-8; name=v2-0001-performance-improvements-of-reporting-wal-stats.patchDownload+70-44
#14Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiro Ikeda (#13)
Re: wal stats questions

At Tue, 30 Mar 2021 09:41:24 +0900, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote in

I update the patch since there were my misunderstanding points.

On 2021/03/26 16:20, Masahiro Ikeda wrote:

Thanks for many your suggestions!
I made the patch to handle the issues.

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

I removed the unnecessary code copying pgWalUsage and just reset the
pgWalUsage after reporting the stats in pgstat_report_wal().

I didn't change this.

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

I added the comments why two functions are separated.
(But is it better to merge them?)

I updated the comments for following reasons.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.
4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

I added the logic to check if the stats counters are updated or not in
pgstat_report_stat() using not only generated wal record but also write/sync
counters, and it can skip to call reporting function.

I removed the checking code whether the wal stats counters were updated or not
in pgstat_report_stat() since I couldn't understand the valid case yet.
pgstat_report_stat() is called by backends when the transaction is finished.
This means that the condition seems always pass.

Doesn't the same holds for all other counters? If you are saying that
"wal counters should be zero if all other stats counters are zero", we
need an assertion to check that and a comment to explain that.

I personally find it safer to add the WAL-stats condition to the
fast-return check, rather than addin such assertion.

pgstat_send_wal() is called mainly from pgstat_report_wal() which
consumes pgWalStats counters and WalWriterMain() which
doesn't. Checking on pgWalStats counters isn't so complex that we need
to avoid that in wal writer, thus *I* think pgstat_send_wal() and
pgstat_report_wal() can be consolidated. Even if you have a strong
opinion that wal writer should call a separate function, the name
should be something other than pgstat_send_wal() since it ignores
pgWalUsage counters, which are supposed to be included in "WAL stats".

I thought to implement if the WAL stats counters were not updated, skip to
send all statistics including the counters for databases and so on. But I
think it's not good because it may take more time to be reflected the
generated stats by read-only transaction.

Ur, yep.

Although I added the condition which the write/sync counters are updated or
not, I haven't understood the following case yet...Sorry. I checked related
code and tested to insert large object, but I couldn't. I'll investigate more
deeply, but if you already know the function which leads the following case,
please let me know.

I understood the above case (Fujii-san, thanks for your advice in person).
When to flush buffers, for example, to select buffer replacement victim,
it requires a WAL flush if the buffer is dirty. So, to check the WAL stats
counters are updated or not, I check the number of generated wal record and
the counter of syncing in pgstat_report_wal().

The reason why not to check the counter of writing is that if to write is
happened, to sync is happened too in the above case. I added the comments in
the patch.

Mmm.. Although I couldn't read you clearly, The fact that a flush may
happen without a write means the reverse at the same time, a write may
happen without a flush. For asynchronous commits, WAL-write happens
alone unaccompanied by a flush. And the corresponding flush would
happen later without writes.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#15Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Kyotaro Horiguchi (#14)
Re: wal stats questions

On 2021/03/30 17:28, Kyotaro Horiguchi wrote:

At Tue, 30 Mar 2021 09:41:24 +0900, Masahiro Ikeda <ikedamsh@oss.nttdata.com> wrote in

I update the patch since there were my misunderstanding points.

On 2021/03/26 16:20, Masahiro Ikeda wrote:

Thanks for many your suggestions!
I made the patch to handle the issues.

1) What is the motivation to have both prevWalUsage and pgWalUsage,
instead of just accumulating the stats since the last submission?
There doesn't seem to be any comment explaining it? Computing
differences to previous values, and copying to prev*, isn't free. I
assume this is out of a fear that the stats could get reset before
they're used for instrument.c purposes - but who knows?

I removed the unnecessary code copying pgWalUsage and just reset the
pgWalUsage after reporting the stats in pgstat_report_wal().

I didn't change this.

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
former being used by wal writer, the latter by most other processes?
There again don't seem to be comments explaining this.

I added the comments why two functions are separated.
(But is it better to merge them?)

I updated the comments for following reasons.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
just to figure out if there's been any changes isn't all that
cheap. This is regularly exercised in read-only workloads too. Seems
adding a boolean WalStats.have_pending = true or such would be
better.
4) For plain backends pgstat_report_wal() is called by
pgstat_report_stat() - but it is not checked as part of the "Don't
expend a clock check if nothing to do" check at the top. It's
probably rare to have pending wal stats without also passing one of
the other conditions, but ...

I added the logic to check if the stats counters are updated or not in
pgstat_report_stat() using not only generated wal record but also write/sync
counters, and it can skip to call reporting function.

I removed the checking code whether the wal stats counters were updated or not
in pgstat_report_stat() since I couldn't understand the valid case yet.
pgstat_report_stat() is called by backends when the transaction is finished.
This means that the condition seems always pass.

Doesn't the same holds for all other counters? If you are saying that
"wal counters should be zero if all other stats counters are zero", we
need an assertion to check that and a comment to explain that.

I personally find it safer to add the WAL-stats condition to the
fast-return check, rather than addin such assertion.

Thanks for your comments.

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check using WAL
stats counters. But, the purpose is to make the conditions stricter, right?

pgstat_send_wal() is called mainly from pgstat_report_wal() which
consumes pgWalStats counters and WalWriterMain() which
doesn't. Checking on pgWalStats counters isn't so complex that we need
to avoid that in wal writer, thus *I* think pgstat_send_wal() and
pgstat_report_wal() can be consolidated. Even if you have a strong
opinion that wal writer should call a separate function, the name
should be something other than pgstat_send_wal() since it ignores
pgWalUsage counters, which are supposed to be included in "WAL stats".

OK, I consolidated them.

I thought to implement if the WAL stats counters were not updated, skip to
send all statistics including the counters for databases and so on. But I
think it's not good because it may take more time to be reflected the
generated stats by read-only transaction.

Ur, yep.

Although I added the condition which the write/sync counters are updated or
not, I haven't understood the following case yet...Sorry. I checked related
code and tested to insert large object, but I couldn't. I'll investigate more
deeply, but if you already know the function which leads the following case,
please let me know.

I understood the above case (Fujii-san, thanks for your advice in person).
When to flush buffers, for example, to select buffer replacement victim,
it requires a WAL flush if the buffer is dirty. So, to check the WAL stats
counters are updated or not, I check the number of generated wal record and
the counter of syncing in pgstat_report_wal().

The reason why not to check the counter of writing is that if to write is
happened, to sync is happened too in the above case. I added the comments in
the patch.

Mmm.. Although I couldn't read you clearly, The fact that a flush may
happen without a write means the reverse at the same time, a write may
happen without a flush. For asynchronous commits, WAL-write happens
alone unaccompanied by a flush. And the corresponding flush would
happen later without writes.

Sorry, I didn't explain it enough.

For processes which may generate WAL records like backends, I thought it's
enough to check (1)the number of generated WAL records and (2)the counters of
syncing(flushing) the WAL. This is checked in pgstat_report_wal(). Sorry for
that I didn't mention (1) in the above thread.

If a backend execute a write transaction, some WAL records must be generated.
So, it's ok to check (1) only regardless of whether asynchronous commit is
enabled or not.

OHOT, if a backend execute a read-only transaction, WAL records won't be
generated (although HOT makes a wal records exactly...). But, WAL-write and
flush may happen when to flush buffers via XLogFlush(). In this case, if
WAL-write happened, flush must be happen later. But, if my understanding is
correct, there is no a case to flush doesn't happen, but to write happen.
So, I thought (2) is needed and it's enough to check the counter of
syncing(flushing).

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v3-0001-performance-improvements-of-reporting-wal-stats.patchtext/x-patch; charset=UTF-8; name=v3-0001-performance-improvements-of-reporting-wal-stats.patchDownload+60-70
#16Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#15)
Re: wal stats questions

On 2021/03/30 20:37, Masahiro Ikeda wrote:

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check using WAL
stats counters. But, the purpose is to make the conditions stricter, right?

Yes. Currently if the following condition is false even when the WAL counters
are updated, nothing is sent to the stats collector. But with your patch,
in this case the WAL stats are sent.

if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)

Thanks for the patch! It now fails to be applied to the master cleanly.
So could you rebase the patch?

pgstat_initialize() should initialize pgWalUsage with zero?

+	/*
+	 * set the counters related to generated WAL data.
+	 */
+	WalStats.m_wal_records = pgWalUsage.wal_records;
+	WalStats.m_wal_fpi = pgWalUsage.wal_fpi;
+	WalStats.m_wal_bytes = pgWalUsage.wal_bytes;

This should be skipped if pgWalUsage.wal_records is zero?

+ * Be careful that the counters are cleared after reporting them to
+ * the stats collector although you can use WalUsageAccumDiff()
+ * to computing differences to previous values. For backends,
+ * the counters may be reset after a transaction is finished and
+ * pgstat_send_wal() is invoked, so you can compute the difference
+ * in the same transaction only.

On the second thought, I'm afraid that this can be likely to be a foot-gun
in the future. So I'm now wondering if the current approach (i.e., calculate
the diff between the current and previous pgWalUsage and don't reset it
to zero) is better. Thought? Since the similar data structure pgBufferUsage
doesn't have this kind of restriction, I'm afraid that the developer can
easily miss that only pgWalUsage has the restriction.

But currently the diff is calculated (i.e., WalUsageAccumDiff() is called)
even when the WAL counters are not updated. Then if that calculated diff is
zero, we skip sending the WAL stats. This logic should be improved. That is,
probably we should be able to check whether the WAL counters are updated
or not, without calculating the diff, because the calculation is not free.
We can implement this by introducing new flag variable that we discussed
upthread. This flag is set to true whenever the WAL counters are incremented
and used to determine whether the WAL stats need to be sent.

If we do this, another issue is that the data types for wal_records and wal_fpi
in pgWalUsage are long. Which may lead to overflow? If yes, it's should be
replaced with uint64.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#17Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#16)
Re: wal stats questions

On 2021/04/13 9:33, Fujii Masao wrote:

On 2021/03/30 20:37, Masahiro Ikeda wrote:

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check using WAL
stats counters. But, the purpose is to make the conditions stricter, right?

Yes. Currently if the following condition is false even when the WAL counters
are updated, nothing is sent to the stats collector. But with your patch,
in this case the WAL stats are sent.

    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
        !have_function_stats && !disconnect)

Thanks for the patch! It now fails to be applied to the master cleanly.
So could you rebase the patch?

Thanks for your comments!
I rebased it.

pgstat_initialize() should initialize pgWalUsage with zero?

Thanks. But, I didn't handle it because I undo the logic to calculate the diff
as you mentioned below.

+    /*
+     * set the counters related to generated WAL data.
+     */
+    WalStats.m_wal_records = pgWalUsage.wal_records;
+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi;
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes;

This should be skipped if pgWalUsage.wal_records is zero?

Yes, fixed it.

+ * Be careful that the counters are cleared after reporting them to
+ * the stats collector although you can use WalUsageAccumDiff()
+ * to computing differences to previous values. For backends,
+ * the counters may be reset after a transaction is finished and
+ * pgstat_send_wal() is invoked, so you can compute the difference
+ * in the same transaction only.

On the second thought, I'm afraid that this can be likely to be a foot-gun
in the future. So I'm now wondering if the current approach (i.e., calculate
the diff between the current and previous pgWalUsage and don't reset it
to zero) is better. Thought? Since the similar data structure pgBufferUsage
doesn't have this kind of restriction, I'm afraid that the developer can
easily miss that only pgWalUsage has the restriction.

But currently the diff is calculated (i.e., WalUsageAccumDiff() is called)
even when the WAL counters are not updated. Then if that calculated diff is
zero, we skip sending the WAL stats. This logic should be improved. That is,
probably we should be able to check whether the WAL counters are updated
or not, without calculating the diff, because the calculation is not free.
We can implement this by introducing new flag variable that we discussed
upthread. This flag is set to true whenever the WAL counters are incremented
and used to determine whether the WAL stats need to be sent.

Sound reasonable. I agreed that the restriction has a risk to lead mistakes.
I made the patch introducing a new flag.

- v4-0001-performance-improvements-of-reporting-wal-stats.patch

I think introducing a new flag is not necessary because we can know if the WAL
stats are updated or not using the counters of the generated wal record, wal
writes and wal syncs. It's fast compared to get timestamp. The attached patch
is to check if the counters are updated or not using them.

-
v4-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch

If we do this, another issue is that the data types for wal_records and wal_fpi
in pgWalUsage are long. Which may lead to overflow? If yes, it's should be
replaced with uint64.

Yes, I fixed. BufferUsage's counters have same issue, so I fixed them too.

BTW, is it better to change PgStat_Counter from int64 to uint64 because there
aren't any counters with negative number?

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v4-0001-performance-improvements-of-reporting-wal-stats.patchtext/x-patch; charset=UTF-8; name=v4-0001-performance-improvements-of-reporting-wal-stats.patchDownload+131-86
v4-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchtext/x-patch; charset=UTF-8; name=v4-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchDownload+115-85
#18torikoshia
torikoshia@oss.nttdata.com
In reply to: Masahiro Ikeda (#17)
Re: wal stats questions

On 2021-04-16 10:27, Masahiro Ikeda wrote:

On 2021/04/13 9:33, Fujii Masao wrote:

On 2021/03/30 20:37, Masahiro Ikeda wrote:

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check
using WAL
stats counters. But, the purpose is to make the conditions stricter,
right?

Yes. Currently if the following condition is false even when the WAL
counters
are updated, nothing is sent to the stats collector. But with your
patch,
in this case the WAL stats are sent.

    if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
        pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
        !have_function_stats && !disconnect)

Thanks for the patch! It now fails to be applied to the master
cleanly.
So could you rebase the patch?

Thanks for your comments!
I rebased it.

Thanks for working on this!

I have some minor comments on
performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch.

177 @@ -3094,20 +3066,33 @@ pgstat_report_wal(void)
178 * Return true if the message is sent, and false otherwise.

Since you changed the return value to void, it seems the description is
not necessary anymore.

208 + * generate wal records. 'wal_writes' and 'wal_sync' are
zero means the

It may be better to change 'wal_writes' to 'wal_write' since single
quotation seems to mean variable name.

234 + * set the counters related to generated WAL data if the
counters are

set -> Set?

Regards,

#19Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: torikoshia (#18)
Re: wal stats questions

On 2021/04/21 15:08, torikoshia wrote:

On 2021-04-16 10:27, Masahiro Ikeda wrote:

On 2021/04/13 9:33, Fujii Masao wrote:

On 2021/03/30 20:37, Masahiro Ikeda wrote:

OK, I added the condition to the fast-return check. I noticed that I
misunderstood that the purpose is to avoid expanding a clock check
using WAL stats counters. But, the purpose is to make the conditions
stricter, right?

Yes. Currently if the following condition is false even when the WAL
counters are updated, nothing is sent to the stats collector. But with
your patch, in this case the WAL stats are sent.

if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)

Thanks for the patch! It now fails to be applied to the master
cleanly. So could you rebase the patch?

Thanks for your comments! I rebased it.

Thanks for working on this!

Hi, thanks for your comments!

I have some minor comments on
performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patch.

177 @@ -3094,20 +3066,33 @@ pgstat_report_wal(void) 178 * Return true if
the message is sent, and false otherwise.

Since you changed the return value to void, it seems the description is not
necessary anymore.

Right, I fixed it.

208 + * generate wal records. 'wal_writes' and 'wal_sync' are zero
means the

It may be better to change 'wal_writes' to 'wal_write' since single
quotation seems to mean variable name.

Agreed.

234 + * set the counters related to generated WAL data if the
counters are

set -> Set?

Yes, I fixed.

BTW, is it better to change PgStat_Counter from int64 to uint64 because> there aren't any counters with negative number?

Although this is not related to torikoshi-san's comment, the above my
understanding is not right. Some counters like delta_live_tuples,
delta_dead_tuples and changed_tuples can be negative.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachments:

v5-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchtext/x-patch; charset=UTF-8; name=v5-0001-performance-improvements-of-reporting-wal-stats-without-introducing-a-new-variable.patchDownload+115-87
#20Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#19)
Re: wal stats questions

On 2021/04/21 18:31, Masahiro Ikeda wrote:

BTW, is it better to change PgStat_Counter from int64 to uint64 because> there aren't any counters with negative number?

On second thought, it's ok even if the counters like wal_records get overflowed.
Because they are always used to calculate the diff between the previous and
current counters. Even when the current counters are overflowed and
the previous ones are not, WalUsageAccumDiff() seems to return the right
diff of them. If this understanding is right, I'd withdraw my comment and
it's ok to use "long" type for those counters. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#21Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#20)
#22Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Andres Freund (#21)
#23Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#22)
#24Andres Freund
andres@anarazel.de
In reply to: Masahiro Ikeda (#22)
#25Fujii Masao
masao.fujii@gmail.com
In reply to: Andres Freund (#24)
#26Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#25)
#27Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#26)
#28Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#27)
#29Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#28)
#30Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#28)
#31Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#29)
#32Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#31)
#33Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#32)
#34torikoshia
torikoshia@oss.nttdata.com
In reply to: Masahiro Ikeda (#33)
#35Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: torikoshia (#34)
#36Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#35)
#37Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#36)
#38Fujii Masao
masao.fujii@gmail.com
In reply to: Masahiro Ikeda (#37)
#39Masahiro Ikeda
ikedamsh@oss.nttdata.com
In reply to: Fujii Masao (#38)