shared-memory based stats collector

Started by Kyotaro HORIGUCHIover 7 years ago339 messages

Threaded Flat

Oldest First Newest First

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 7 years ago

3 attachment(s)

Hello.

This is intended to provide more stats like the following thread.

/messages/by-id/20171010.192616.108347483.horiguchi.kyotaro@lab.ntt.co.jp

Most major obstracle for having more items in statistics
collector views comes from the mechanism to share the values
among backends. It is currently using a file. The stats collector
writes a file by triggers from backens then backens reads the
written file. Larger file makes the latency longer and we don't
have a spare bandwidth for additional statistics items.

Nowadays PostgreSQL has dynamic shared hash (dshash) so we can
use this as the main storage of statistics. We can share data
without a stress using this.

A PoC previously posted tried to use "locally copied" dshash but
it doesn't looks fine so I steered to different direction.

With this patch dshash can create a local copy based on dynhash.

This patch consists of tree files.

v1-0001-Give-dshash-ability-to-make-a-local-snapshot.patch

adds dshash to make a local copy backed by dynahash.

v1-0002-Change-stats-collector-to-an-axiliary-process.patch

change the stats collector to be a auxiliary process so that it
can attach dynamic shared memory.

v1-0003-dshash-based-stats-collector.patch

implements shared-memory based stats collector.

I'll put more detailed explanation later.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Kyotaro HORIGUCHI (#1)

Re: shared-memory based stats collector

On Fri, Jun 29, 2018 at 4:34 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Nowadays PostgreSQL has dynamic shared hash (dshash) so we can
use this as the main storage of statistics. We can share data
without a stress using this.

A PoC previously posted tried to use "locally copied" dshash but
it doesn't looks fine so I steered to different direction.

With this patch dshash can create a local copy based on dynhash.

Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 7 years ago

In reply to: Robert Haas (#2)

3 attachment(s)

Re: shared-memory based stats collector

Hello. Thanks for the comment.

At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>

On Fri, Jun 29, 2018 at 4:34 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Nowadays PostgreSQL has dynamic shared hash (dshash) so we can
use this as the main storage of statistics. We can share data
without a stress using this.

A PoC previously posted tried to use "locally copied" dshash but
it doesn't looks fine so I steered to different direction.

With this patch dshash can create a local copy based on dynhash.

Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.

The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.

Addition to that vacuum doesn't require even repeatable read
consistency so we don't need to cache the entries at all.
backend_get_tab_entry now returns an isolated (that means
not-stored-in-hash) palloc'ed copy without making a local copy in
the case.

As the result, this version behaves as the follows.

- Stats collector stores the results in shared memory.

- In backend, cache is created only for requested objects and
lasts for the transaction.

- Vacuum directly reads the shared stats and doesn't create a
local copy.

The non-behavioral difference from the v1 is the follows.

- snapshot feature of dshash is removed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 7 years ago

In reply to: Kyotaro HORIGUCHI (#3)

6 attachment(s)

Re: shared-memory based stats collector

Hello. This is new version fixed windows build.

At Tue, 03 Jul 2018 19:01:44 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180703.190144.222427588.horiguchi.kyotaro@lab.ntt.co.jp>

Hello. Thanks for the comment.

At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>

Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.

The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.

Addition to that vacuum doesn't require even repeatable read
consistency so we don't need to cache the entries at all.
backend_get_tab_entry now returns an isolated (that means
not-stored-in-hash) palloc'ed copy without making a local copy in
the case.

As the result, this version behaves as the follows.

- Stats collector stores the results in shared memory.

- In backend, cache is created only for requested objects and
lasts for the transaction.

- Vacuum directly reads the shared stats and doesn't create a
local copy.

The non-behavioral difference from the v1 is the follows.

- snapshot feature of dshash is removed.

This version includes some additional patches. 0003 removes
PG_STAT_TMP_DIR and it affects pg_stat_statements, pg_basebackup
and pg_rewind. Among them pg_stat_statements gets build failure
because it uses the directory to save query texts. 0005 is a new
patch and moves the file to the permanent stat directory. With
this change pg_basebackup and pg_rewind no longer ignore the
query text file.

I haven't explicitly mentioned that, but
dynamic_shared_memory_type = none prevents server from
starting. This patch is not providing a fallback path for the
case. I'm expecting that 'none' will be removed in v12.

v3-0001-sequential-scan-for-dshash.patch
- Functionally same with v2. got cosmetic changes.

v3-0002-Change-stats-collector-to-an-axiliary-process.patch
- Fixed for Windows build.

v3-0003-dshash-based-stats-collector.patch
- Cosmetic changes from v2.

v3-0004-Documentation-update.patch
- New patch in v3 of documentation edits.

v3-0005-Let-pg_stat_statements-not-to-use-PG_STAT_TMP_DIR.patch
- New patch of tentative change of pg_stat_statements.

v3-0006-Remove-pg_stat_tmp-exclusion-from-pg_resetwal.patch
- New patch of tentative change of pg_rewind.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Tom Lane

tgl@sss.pgh.pa.us

over 7 years ago

In reply to: Kyotaro HORIGUCHI (#3)

Re: shared-memory based stats collector

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>

Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.

The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.

Uh, what? That's basically destroying the long-standing semantics of
statistics snapshots. I do not think we can consider that acceptable.
As an example, it would mean that scan counts for indexes would not
match up with scan counts for their tables.

regards, tom lane

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 7 years ago

In reply to: Tom Lane (#5)

Re: shared-memory based stats collector

Hello.

At Wed, 04 Jul 2018 17:23:51 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <67470.1530739431@sss.pgh.pa.us>

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>

Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.

The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.

Uh, what? That's basically destroying the long-standing semantics of
statistics snapshots. I do not think we can consider that acceptable.
As an example, it would mean that scan counts for indexes would not
match up with scan counts for their tables.

The current stats collector mechanism sends at most 8 table stats
in a single message. Split messages from multiple transactions
can reach to collector in shuffled order. The resulting snapshot
can be "inconsistent" if INQUIRY message comes between such split
messages. Of course a single meesage would be enough for common
transactions but not for all.

Even though the inconsistency happens more frequently with this
patch, I don't think users expect such strict consistency of
table stats, especially on a busy system. And I believe it's a
good thing if users saw more "useful" information for the relaxed
consistency. (The actual meaning of "useful" is out of the
current focus:p)

Meanwhile, if we should keep the practical-consistency, a giant
lock is out of the question. So we need a transactional stats of
any shape. It can be a whole-image snapshot or a regular MMVC
table, or maybe the current dshash with UNDO logs. Since there
are actually many states, it is inevitable to require storage to
reproduce each state.

I think the consensus is that the whole-image snapshot takes
too-much memory. MMVC is apparently too-much for the purpose.

UNDO logs seems a bit promising. If we looking stats in a long
transaction, the required memory for UNDO information easily
reaches to the same amount with the whole-image snapshot. But I
expect that it is not so common.

I'll consider that apart from the current patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 7 years ago

In reply to: Kyotaro HORIGUCHI (#6)

2 attachment(s)

Re: shared-memory based stats collector

Hello.

At Thu, 05 Jul 2018 12:04:23 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180705.120423.49626073.horiguchi.kyotaro@lab.ntt.co.jp>

UNDO logs seems a bit promising. If we looking stats in a long
transaction, the required memory for UNDO information easily
reaches to the same amount with the whole-image snapshot. But I
expect that it is not so common.

I'll consider that apart from the current patch.

Done as a PoC. (Sorry for the format since filterdiff genearates
a garbage from the patch..)

The attached v3-0008 is that. PoC of UNDO logging of server
stats. It records undo logs only for table stats if any
transaction started acess to stats data. So the logging is rarely
performed. The undo logs are used at the first acess to each
relation's stats then cached. autovacuum and vacuum doesn't
request undoing since they just don't need it.

# v3-0007 is a trivial fix for v3-0003, which will be merged.

I see several arguable points on this feature.

- The undo logs are stored in a ring buffer with a fixed size,
currently 1000 entries. If it is filled up, the consistency
will be broken. Undo log is recorded just once after the
latest undo-recording transaction comes. It is likely to be
read in rather short-lived transactions and it's likely that
there's no more than several such transactions
simultaneously. It's possible to provide dynamic resizing
feature but it doesn't seem worth the complexity.

- Undoing runs linear search on the ring buffer. It is done at
the first time when the stats for every relation is
accessed. It can be (a bit) slow when many log entriess
resides. (Concurrent vacuum can generate many undo log
entries.)

- Undo logs for other stats doesn't seem to me to be needed,
but..

A=>: select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname | seq_scan
t1 | 0

A=> select relname, seq_scan from pg_stat_user_tables where relname = 't2';
relname | seq_scan
t2 | 0

A=> BEGIN;

-- This gives effect because no stats access has been made
B=> select * from t1;
B=> select * from t2;

A=> select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname | seq_scan
t1 | 1

-- This has no effect because undo logging is now working
B=> select * from t1;
B=> select * from t2;

-- This is the second time in this xact to request for t1,
-- just returns cached result.
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname | seq_scan
t1 | 1

-- This is the first time in this xact to request for t2. The
-- result is undoed one.
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't2';
relname | seq_scan
t2 | 1
A=> COMMIT;

A=> select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname | seq_scan
t1 | 4
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't2';
relname | seq_scan
t2 | 4

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Magnus Hagander

magnus@hagander.net

over 7 years ago

In reply to: Tom Lane (#5)

Re: shared-memory based stats collector

On Wed, Jul 4, 2018 at 11:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:

At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com>

wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.
gmail.com>

Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.

The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.

Uh, what? That's basically destroying the long-standing semantics of
statistics snapshots. I do not think we can consider that acceptable.
As an example, it would mean that scan counts for indexes would not
match up with scan counts for their tables.

I agree that this is definitely something that needs to be considered. I
took a look some time ago at the same thing, and ran up against exactly
that one (and at the time did not have time to fix it).

I have not yet had time to look at the downstream suggested handling
(UNDO). However, I had one other thing from my notes I wanted to mention :)

We should probably consider adding an API to fetch counters that *don't*
follow these rules, in case it's not needed. When going through files we're
still stuck at that bottleneck, but if going through shared memory it
should be possible to make it a lot cheaper by volunteering to "not need
that".

We should also consider the ability to fetch stats for a single object,
which would require no copying of the whole structure at all. I think
something like this could for example be used for autovacuum rechecks. On
top of the file based transfer that would help very little, but through
shared memory it could be a lot lighter weight.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Magnus Hagander (#8)

Re: shared-memory based stats collector

On Fri, Jul 6, 2018 at 10:29 AM, Magnus Hagander <magnus@hagander.net> wrote:

I agree that this is definitely something that needs to be considered. I
took a look some time ago at the same thing, and ran up against exactly that
one (and at the time did not have time to fix it).

I have not yet had time to look at the downstream suggested handling (UNDO).
However, I had one other thing from my notes I wanted to mention :)

We should probably consider adding an API to fetch counters that *don't*
follow these rules, in case it's not needed. When going through files we're
still stuck at that bottleneck, but if going through shared memory it should
be possible to make it a lot cheaper by volunteering to "not need that".

We should also consider the ability to fetch stats for a single object,
which would require no copying of the whole structure at all. I think
something like this could for example be used for autovacuum rechecks. On
top of the file based transfer that would help very little, but through
shared memory it could be a lot lighter weight.

I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost. I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal. However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.

Note that commit 3cba8999b34 relaxed the synchronization requirements
around GetLockStatusData(). In other words, since 2011, you can no
longer be certain that 'select * from pg_locks' is returning a
perfectly consistent view of the lock status. If this has caused
anybody a major problem, I'm unaware of it. Maybe the same would end
up being true here. The amount of memory we're consuming for this
data may be a bigger problem than minor inconsistencies in the view of
the data would be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Robert Haas (#9)

Re: shared-memory based stats collector

On 2018-07-06 14:49:53 -0400, Robert Haas wrote:

I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost. I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal. However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.

Indeed. I don't think it's worthwhile major additional memory or code
complexity in this situation. The likelihood of benefitting from more /
better stats seems far higher than a more accurate view of the stats -
which aren't particularly accurate themselves. They don't even survive
crashes right now, so I don't think the current accuracy is very high.

Greetings,

Andres Freund

#11

Joshua D. Drake

jd@commandprompt.com

over 7 years ago

In reply to: Andres Freund (#10)

Re: shared-memory based stats collector

On 07/06/2018 11:57 AM, Andres Freund wrote:

On 2018-07-06 14:49:53 -0400, Robert Haas wrote:

I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost. I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal. However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.

Indeed. I don't think it's worthwhile major additional memory or code
complexity in this situation. The likelihood of benefitting from more /
better stats seems far higher than a more accurate view of the stats -
which aren't particularly accurate themselves. They don't even survive
crashes right now, so I don't think the current accuracy is very high.

Will stats, if we move toward the suggested changes be "less" accurate
than they are now? We already know that stats are generally not accurate
but they are close enough. If we move toward this change will it still
be close enough?

Greetings,

Andres Freund

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
*** A fault and talent of mine is to tell it exactly how it is. ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
***** Unless otherwise stated, opinions are my own. *****

#12

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Joshua D. Drake (#11)

Re: shared-memory based stats collector

On Fri, Jul 6, 2018 at 3:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

Will stats, if we move toward the suggested changes be "less" accurate than
they are now? We already know that stats are generally not accurate but they
are close enough. If we move toward this change will it still be close
enough?

There proposed change would have no impact at all on the long-term
accuracy of the statistics. It would just mean that there would be
race conditions when reading them, so that for example you would be
more likely to see a count of heap scans that doesn't match the count
of index scans, because an update arrives in between when you read the
first value and when you read the second one. I don't see that
mattering a whole lot, TBH, but maybe I'm missing something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Joshua D. Drake (#11)

Re: shared-memory based stats collector

On 2018-07-06 12:02:39 -0700, Joshua D. Drake wrote:

On 07/06/2018 11:57 AM, Andres Freund wrote:

On 2018-07-06 14:49:53 -0400, Robert Haas wrote:

I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost. I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal. However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.

Indeed. I don't think it's worthwhile major additional memory or code
complexity in this situation. The likelihood of benefitting from more /
better stats seems far higher than a more accurate view of the stats -
which aren't particularly accurate themselves. They don't even survive
crashes right now, so I don't think the current accuracy is very high.

Will stats, if we move toward the suggested changes be "less" accurate than
they are now? We already know that stats are generally not accurate but they
are close enough. If we move toward this change will it still be close
enough?

I don't think there's a meaningful difference to before. And at the same
time less duplication / hardcoded structure will allow us to increase
the amount of stats we keep.

Greetings,

Andres Freund

#14

Joshua D. Drake

jd@commandprompt.com

over 7 years ago

In reply to: Robert Haas (#12)

Re: shared-memory based stats collector

On 07/06/2018 12:34 PM, Robert Haas wrote:

On Fri, Jul 6, 2018 at 3:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

Will stats, if we move toward the suggested changes be "less" accurate than
they are now? We already know that stats are generally not accurate but they
are close enough. If we move toward this change will it still be close
enough?

There proposed change would have no impact at all on the long-term
accuracy of the statistics. It would just mean that there would be
race conditions when reading them, so that for example you would be
more likely to see a count of heap scans that doesn't match the count
of index scans, because an update arrives in between when you read the
first value and when you read the second one. I don't see that
mattering a whole lot, TBH, but maybe I'm missing something.

I agree that it probably isn't a big deal. Generally speaking when we
look at stats it is to get an "idea" of what is going on. We don't care
if we are missing an increase/decrease of 20 of any particular value
within stats. Based on this and what Andres said, it seems like a net
win to me.

#15

Magnus Hagander

magnus@hagander.net

over 7 years ago

In reply to: Andres Freund (#10)

Re: shared-memory based stats collector

On Fri, Jul 6, 2018 at 8:57 PM, Andres Freund <andres@anarazel.de> wrote:

On 2018-07-06 14:49:53 -0400, Robert Haas wrote:

I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost. I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal. However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.

Indeed. I don't think it's worthwhile major additional memory or code
complexity in this situation. The likelihood of benefitting from more /
better stats seems far higher than a more accurate view of the stats -
which aren't particularly accurate themselves. They don't even survive
crashes right now, so I don't think the current accuracy is very high.

Definitely agreed.

*If* we can provide the snapshots view of them without too much overhead I
think it's worth looking into that while *also* proviiding a lower overhead
interface for those that don't care about it.

If it ends up that keeping the snapshots become too much overhead in either
in performance or code-maintenance, then I agree can probably drop that.
But we should at least properly investigate the cost.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>

#16

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Magnus Hagander (#15)

Re: shared-memory based stats collector

Hi,

On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:

*If* we can provide the snapshots view of them without too much overhead I
think it's worth looking into that while *also* proviiding a lower overhead
interface for those that don't care about it.

I don't see how that's possible without adding significant amounts of
complexity and probably memory / cpu overhead. The current stats already
are quite inconsistent (often outdated, partially updated, messages
dropped when busy) - I don't see what we really gain by building
something MVCC like in the "new" stats subsystem.

If it ends up that keeping the snapshots become too much overhead in either
in performance or code-maintenance, then I agree can probably drop that.
But we should at least properly investigate the cost.

I don't think it's worthwhile to more than think a bit about it. There's
fairly obvious tradeoffs in complexity here. Trying to get there seems
like a good way to make the feature too big.

Greetings,

Andres Freund

#17

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 7 years ago

In reply to: Andres Freund (#16)

6 attachment(s)

Re: shared-memory based stats collector

Hello. Thanks for the opinions.

At Fri, 6 Jul 2018 13:10:36 -0700, Andres Freund <andres@anarazel.de> wrote in <20180706201036.awheoi6tk556x6aj@alap3.anarazel.de>

Hi,

On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:

*If* we can provide the snapshots view of them without too much overhead I
think it's worth looking into that while *also* proviiding a lower overhead
interface for those that don't care about it.

I don't see how that's possible without adding significant amounts of
complexity and probably memory / cpu overhead. The current stats already
are quite inconsistent (often outdated, partially updated, messages
dropped when busy) - I don't see what we really gain by building
something MVCC like in the "new" stats subsystem.

If it ends up that keeping the snapshots become too much overhead in either
in performance or code-maintenance, then I agree can probably drop that.
But we should at least properly investigate the cost.

I don't think it's worthwhile to more than think a bit about it. There's
fairly obvious tradeoffs in complexity here. Trying to get there seems
like a good way to make the feature too big.

Agreed.

Well, if we allow to lose consistency in some extent for improved
performance and smaller footprint, relaxing the consistency of
database stats can reduce footprint further especially on a
cluster with so many databases. Backends are interested only in
the residing database and vacuum doesn't cache stats at all. A
possible problem is vacuum and stats collector can go into a race
condition. I'm not sure but I suppose it is not worse than being
involved in an IO congestion.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#18

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 7 years ago

In reply to: Kyotaro HORIGUCHI (#17)

Re: shared-memory based stats collector

On 07/10/2018 02:07 PM, Kyotaro HORIGUCHI wrote:

Hello. Thanks for the opinions.

At Fri, 6 Jul 2018 13:10:36 -0700, Andres Freund <andres@anarazel.de> wrote in <20180706201036.awheoi6tk556x6aj@alap3.anarazel.de>

Hi,

On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:

*If* we can provide the snapshots view of them without too much overhead I
think it's worth looking into that while *also* proviiding a lower overhead
interface for those that don't care about it.

I don't see how that's possible without adding significant amounts of
complexity and probably memory / cpu overhead. The current stats already
are quite inconsistent (often outdated, partially updated, messages
dropped when busy) - I don't see what we really gain by building
something MVCC like in the "new" stats subsystem.

If it ends up that keeping the snapshots become too much overhead in either
in performance or code-maintenance, then I agree can probably drop that.
But we should at least properly investigate the cost.

I don't think it's worthwhile to more than think a bit about it. There's
fairly obvious tradeoffs in complexity here. Trying to get there seems
like a good way to make the feature too big.

Agreed.

Well, if we allow to lose consistency in some extent for improved
performance and smaller footprint, relaxing the consistency of
database stats can reduce footprint further especially on a
cluster with so many databases. Backends are interested only in
the residing database and vacuum doesn't cache stats at all. A
possible problem is vacuum and stats collector can go into a race
condition. I'm not sure but I suppose it is not worse than being
involved in an IO congestion.

As someone who regularly analyzes stats collected from user systems, I
think there's certainly some value with keeping the snapshots reasonably
consistent. But I agree it doesn't need to be perfect, and some level of
inconsistency is acceptable (and the amount of complexity/overhead
needed to maintain perfect consistency seems rather excessive here).

There's one more reason why attempts to keep stats snapshots "perfectly"
consistent are likely doomed to fail - the messages are sent over UDP,
which does not guarantee delivery etc. So there's always some level of
possible inconsistency even with "perfectly consistent" snapshots.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#19

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Tomas Vondra (#18)

Re: shared-memory based stats collector

On 2018-07-10 14:52:13 +0200, Tomas Vondra wrote:

There's one more reason why attempts to keep stats snapshots "perfectly"
consistent are likely doomed to fail - the messages are sent over UDP, which
does not guarantee delivery etc. So there's always some level of possible
inconsistency even with "perfectly consistent" snapshots.

FWIW, I don't see us continuing to do so if we go for a shared hashtable
for stats.

- Andres

#20

Antonin Houska

ah@cybertec.at

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#17)

Re: shared-memory based stats collector

I've spent some time reviewing this version.

Design
------

1. Even with your patch the stats collector still uses an UDP socket to
receive data. Now that the shared memory API is there, shouldn't the
messages be sent via shared memory queue? [1]/messages/by-id/20180711000605.sqjik3vqe5opqz33@alap3.anarazel.de That would increase the
reliability of message delivery.

I can actually imagine backends inserting data into the shared hash tables
themselves, but that might make them wait if the same entries are accessed
by another backend. It should be much cheaper just to insert message into
the queue and let the collector process it. In future version the collector
can launch parallel workers so that writes by backends do not get blocked
due to full queue.

2. I think the access to the shared hash tables introduces more contention
than necessary. For example, pgstat_recv_tabstat() retrieves "dbentry" and
leaves the containing hash table partition locked *exclusively* even if it
changes only the containing table entries, while changes of the containing
dbentry are done.

It appears that the shared hash tables are only modified by the stats
collector. The unnecessary use of the exclusive lock might be a bigger
issue in the future if the stats collector will use parallel
workers. Monitoring functions and autovacuum are affected by the locking
now.

(I see that the it's not trivial to get just-created entry locked in shared
mode: it may need a loop in which we release the exclusive lock and acquire
the shared lock unless the entry was already removed.)

3. Data in both shared_archiverStats and shared_globalStats is mostly accessed
w/o locking. Is that ok? I'd expect the StatsLock to be used for these.

Coding
------

* git apply v4-0003-dshash-based-stats-collector.patch needed manual
resolution of one conflict.

* pgstat_quickdie_handler() appears to be the only "quickdie handler" that
calls on_exit_reset(), although the comments are almost copy & pasted from
such a handler of other processes. Can you please explain what's specific
about pgstat.c?

* the variable name "area" would be sufficient if it was local to some
function, otherwise I think the name is too generic.

* likewise db_stats is too generic for a global variable. How about
"snapshot_db_stats_local"?

* backend_get_db_entry() passes 0 for handle to snapshot_statentry(). How
about DSM_HANDLE_INVALID ?

* I only see one call of snapshot_statentry_all() and it receives 0 for
handle. Thus the argument can be removed and the function does not have to
attach / detach to / from the shared hash table.

* backend_snapshot_global_stats() switches to TopMemoryContext before it calls
pgstat_attach_shared_stats(), but the latter takes care of the context
itself.

* pgstat_attach_shared_stats() - header comment should explain what the return
value means.

* reset_dbentry_counters() does more than just resetting the counters. Name
like initialize_dbentry() would be more descriptive.

* typos:

** backend_snapshot_global_stats(): "liftime" -> "lifetime"

** snapshot_statentry(): "entriy" -> "entry"

** backend_get_func_etnry(): "onshot" -> "oneshot"

** snapshot_statentry_all(): "Returns a local hash contains ..." -> "Returns a local hash containing ..."

[1]: /messages/by-id/20180711000605.sqjik3vqe5opqz33@alap3.anarazel.de

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

#21

Andres Freund

andres@anarazel.de

about 7 years ago

In reply to: Antonin Houska (#20)

Re: shared-memory based stats collector

Hi,

On 2018-09-20 09:55:27 +0200, Antonin Houska wrote:

I've spent some time reviewing this version.

Design
------

1. Even with your patch the stats collector still uses an UDP socket to
receive data. Now that the shared memory API is there, shouldn't the
messages be sent via shared memory queue? [1] That would increase the
reliability of message delivery.

I can actually imagine backends inserting data into the shared hash tables
themselves, but that might make them wait if the same entries are accessed
by another backend. It should be much cheaper just to insert message into
the queue and let the collector process it. In future version the collector
can launch parallel workers so that writes by backends do not get blocked
due to full queue.

I don't think either of these is right. I think it's crucial to get rid
of the UDP socket, but I think using a shmem queue is the wrong
approach. Not just because postgres' shm_mq is single-reader/writer, but
also because it's plainly unnecessary. Backends should attempt to
update the shared hashtable, but acquire the necessary lock
conditionally, and leave the pending updates of the shared hashtable to
a later time if they cannot acquire the lock.

Greetings,

Andres Freund

#22

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Andres Freund (#21)

Re: shared-memory based stats collector

Hello. Thank you for the comments.

At Thu, 20 Sep 2018 10:37:24 -0700, Andres Freund <andres@anarazel.de> wrote in <20180920173724.5w2n2nwkxtyi4azw@alap3.anarazel.de>

Hi,

On 2018-09-20 09:55:27 +0200, Antonin Houska wrote:

I've spent some time reviewing this version.

Design
------

1. Even with your patch the stats collector still uses an UDP socket to
receive data. Now that the shared memory API is there, shouldn't the
messages be sent via shared memory queue? [1] That would increase the
reliability of message delivery.

I can actually imagine backends inserting data into the shared hash tables
themselves, but that might make them wait if the same entries are accessed
by another backend. It should be much cheaper just to insert message into
the queue and let the collector process it. In future version the collector
can launch parallel workers so that writes by backends do not get blocked
due to full queue.

I don't think either of these is right. I think it's crucial to get rid
of the UDP socket, but I think using a shmem queue is the wrong
approach. Not just because postgres' shm_mq is single-reader/writer, but
also because it's plainly unnecessary. Backends should attempt to
update the shared hashtable, but acquire the necessary lock
conditionally, and leave the pending updates of the shared hashtable to
a later time if they cannot acquire the lock.

Ok, I just intended to avoid reading many bytes from a file and
thought that writer-side can be resolved later.

Currently locks on the shared stats table is acquired by dshash
mechanism in a partition-wise manner. The number of the
partitions is currently fixed to 2^7 = 128, but writes for the
same table confilicts each other regardless of the number of
partitions. As the first step, I'm going to add
conditional-locking capability to dsahsh_find_or_insert and each
backend holds a queue of its pending updates.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#23

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#22)

8 attachment(s)

Re: shared-memory based stats collector

Hello. This is a super-PoC of no-UDP stats collector.

At Wed, 26 Sep 2018 09:55:09 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180926.095509.182252925.horiguchi.kyotaro@lab.ntt.co.jp>

I don't think either of these is right. I think it's crucial to get rid
of the UDP socket, but I think using a shmem queue is the wrong
approach. Not just because postgres' shm_mq is single-reader/writer, but
also because it's plainly unnecessary. Backends should attempt to
update the shared hashtable, but acquire the necessary lock
conditionally, and leave the pending updates of the shared hashtable to
a later time if they cannot acquire the lock.

Ok, I just intended to avoid reading many bytes from a file and
thought that writer-side can be resolved later.

Currently locks on the shared stats table is acquired by dshash
mechanism in a partition-wise manner. The number of the
partitions is currently fixed to 2^7 = 128, but writes for the
same table confilicts each other regardless of the number of
partitions. As the first step, I'm going to add
conditional-locking capability to dsahsh_find_or_insert and each
backend holds a queue of its pending updates.

I don't have more time 'til next monday so this is just a PoC
(sorry..).

- 0001 to 0006 is rebased version of v4.
- 0007 adds conditional locking to dshash

- 0008 is the no-UDP stats collector.

If required lock is not acquired for some stats items, report
funcions immediately return after storing the values locally. The
stored values are merged with later calls. Explicitly calling
pgstat_cleanup_pending_stat() at a convenient timing tries to
apply the pending values, but the function is not called anywhere
for now.

stats collector process is used only to save and load saved stats
files and create shared memory for stats. I'm going to remove
stats collector.

I'll continue working this way.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#24

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#23)

1 attachment(s)

Re: shared-memory based stats collector

The previous patch doesn't work...

At Thu, 27 Sep 2018 22:00:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180927.220049.168546206.horiguchi.kyotaro@lab.ntt.co.jp>

- 0001 to 0006 is rebased version of v4.
- 0007 adds conditional locking to dshash

- 0008 is the no-UDP stats collector.

If required lock is not acquired for some stats items, report
funcions immediately return after storing the values locally. The
stored values are merged with later calls. Explicitly calling
pgstat_cleanup_pending_stat() at a convenient timing tries to
apply the pending values, but the function is not called anywhere
for now.

stats collector process is used only to save and load saved stats
files and create shared memory for stats. I'm going to remove
stats collector.

I'll continue working this way.

It doesn't work nor even compile since I failed to include some
changes. The atached v6-0008 at least compiles and words.

0001-0007 are not attached since they are still aplicable on
master head with offsets.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#25

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#24)

3 attachment(s)

Re: shared-memory based stats collector

Hello.

At Tue, 02 Oct 2018 16:06:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20181002.160651.117284090.horiguchi.kyotaro@lab.ntt.co.jp>

It doesn't work nor even compile since I failed to include some
changes. The atached v6-0008 at least compiles and words.

0001-0007 are not attached since they are still aplicable on
master head with offsets.

In this patchset 0001-0007 are still the same with the previous
version. I'll reorganize the whole patchset in the next version.

This is more saner version of previous v5-0008, which didn't pass
regression test. v6-0008 to v6-0010 are attached and they are
applied on top of v5-0001-0007.

- stats collector has been removed.

- modified dshash further so that deletion is allowed during
sequential scan.

- I'm not sure about the following existing comment at the
beginning of pgstat.c

* - Add a pgstat config column to pg_database, so this
* entire thing can be enabled/disabled on a per db basis.

Some points known to need some considerations are:

1. Concurrency is controlled by per-database entry in db_stats
dshash. It has 127 lock partitions but all backends on the
same database share just one lock and only one backend takes
the right to update stats. (Every backend doesn't update stats
with the interval not shorter than 500ms, like the current
stats collector.) Table-stats can be removed by DROP DB
simultaeously with stats updates so need to block it using
per-databsae lock. Any locking means other than dshash might
be needed.

2. Since dshash cannot allow multiple locks because of resize,
pgstat_update_stat is forced to be a bit ineffecient.
It loops onver stats list twice, for shared tables and regular
tables since we can acquire lock on one database at once.

Maybe providing individual TabStatusArray for the two will fix
it, will do in the next version.

3. This adds a new timeout IDLE_STATS_UPDATE_TIMEOUT. This works
similarly to IDLE_IN_TRANSACTIION_SESSION_TIMEOUT. It fires in
at most PGSTAT_STAT_MIN_INTERVAL(500)ms to clean up pending
statistics update.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#26

Antonin Houska

ah@cybertec.at

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#25)

Re: shared-memory based stats collector

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

This is more saner version of previous v5-0008, which didn't pass
regression test. v6-0008 to v6-0010 are attached and they are
applied on top of v5-0001-0007.

- stats collector has been removed.

- modified dshash further so that deletion is allowed during
sequential scan.

- I'm not sure about the following existing comment at the
beginning of pgstat.c

* - Add a pgstat config column to pg_database, so this
* entire thing can be enabled/disabled on a per db basis.

Following is the next handful of my comments:

* If you remove the stats collector, I think the remaining code in pgstat.c
does no longer fit into the backend/postmaster/ directory.

* I'm not sure it's o.k. to call pgstat_write_statsfiles() from
postmaster.c:reaper(): the function can raise ERROR (I see at least one code
path: pgstat_write_statsfile() -> get_dbstat_filename()) and, as reaper() is
a signal handler, it's hard to imagine the consequences. Maybe a reason to
leave some functionality in a separate worker, although the "stats
collector" would have to be changed.

* Question still remains whether all the statistics should be loaded into
shared memory, see the note on paging near the bottom of [1]/messages/by-id/CA+TgmobQVbz4K_+RSmiM9HeRKpy3vS5xnbkL95gSEnWijzprKQ@mail.gmail.com.

* if dshash_seq_init() is passed consistent=false, shouldn't we call
ensure_valid_bucket_pointers() also from dshash_seq_next()? If the scan
needs to access the next partition and the old partition lock got released,
the table can be resized before the next partition lock is acquired, and
thus the backend-local copy of buckets becomes obsolete.

* Neither snapshot_statentry_all() nor backend_snapshot_all_db_entries() seems
to be used in the current patch version.

* pgstat_initstats(): I think WaitLatch() should be used instead of sleep().

* pgstat_get_db_entry(): "return false" should probably be "return NULL".

* Is the PGSTAT_TABLE_WRITE flag actually used? Unlike PGSTAT_TABLE_CREATE, I
couldn't find a place where it's value is tested.

* dshash_seq_init(): does it need to be called with consistent=true from
pgstat_vacuum_stat() when the the entries returned by the scan are just
dropped?

dshash_seq_init(&dshstat, db_stats, true, true);

I suspect this is a thinko because another call from the same function looks
like

dshash_seq_init(&dshstat, dshtable, false, true);

* I'm not sure about usefulness of dshash_get_num_entries(). It passes
consistent=false to dshash_seq_init(), so the number of entries can change
during the execution. And even if the function requested a "consistent
scan", entries can be added / removed as soon as the scan is over.

If the returned value is used to decide whether the hashtable should be
scanned or not, I think you don't break anything if you simply start the
scan unconditionally and see if you find some entries.

And if you really need to count the entries, I suggest that you use the
per-partition counts (dshash_partition.count) instead of scanning individual
entries.

[1]: /messages/by-id/CA+TgmobQVbz4K_+RSmiM9HeRKpy3vS5xnbkL95gSEnWijzprKQ@mail.gmail.com

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com

#27

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#25)

1 attachment(s)

Re: shared-memory based stats collector

Hi,

I've started looking at the patch over the past few days. I don't have
any deep insights at this point, but there seems to be some sort of
issue in pgstat_update_stat. When building using gcc, I do get this warning:

pgstat.c: In function Â‘pgstat_update_statÂ’:
pgstat.c:648:18: warning: Â‘nowÂ’ may be used uninitialized in this
function [-Wmaybe-uninitialized]
oldest_pending = now;
~~~~~~~~~~~~~~~^~~~~
PostgreSQL installation complete.

which kinda makes sense, because 'now' is set only in the (!force)
branch. So if the very first call to pgstat_update_stat is with
force=true, it's not set, and the code executes this:

/* record oldest pending update time */
if (pgStatPendingTabHash == NULL)
oldest_pending = 0;
else if (oldest_pending == 0)
oldest_pending = now;

at which point we set "oldest_pending = now" with "now" containing some
random garbage.

When running this under valgrind, I get a couple of warnings in this
area of code - see the attached log with a small sample. Judging by the
locations I assume those are related to the same issue, but I have not
looked into that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#28

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#25)

Re: shared-memory based stats collector

On 10/05/2018 10:30 AM, Kyotaro HORIGUCHI wrote:

Hello.

At Tue, 02 Oct 2018 16:06:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20181002.160651.117284090.horiguchi.kyotaro@lab.ntt.co.jp>

It doesn't work nor even compile since I failed to include some
changes. The atached v6-0008 at least compiles and words.

0001-0007 are not attached since they are still aplicable on
master head with offsets.

In this patchset 0001-0007 are still the same with the previous
version. I'll reorganize the whole patchset in the next version.

This is more saner version of previous v5-0008, which didn't pass
regression test. v6-0008 to v6-0010 are attached and they are
applied on top of v5-0001-0007.

BTW one more thing - I strongly recommend always attaching the whole
patch series, even if some of the parts did not change.

Firstly, it makes the reviewer's life much easier, because it's not
necessary to hunt through past messages for all the bits and resolve
potential conflicts (e.g. there are two 0008 in recent messages).

Secondly, it makes http://commitfest.cputube.org/ work - it tries to
apply patches from a single message, which fails when some of the parts
are omitted.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#29

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Tomas Vondra (#28)

Re: shared-memory based stats collector

Thank you for the comments, Antonin, Tomas.

At Tue, 30 Oct 2018 13:35:23 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <b38591e0-54ca-7a27-813b-0bf91a204c5b@2ndquadrant.com>

On 10/05/2018 10:30 AM, Kyotaro HORIGUCHI wrote:

Hello.

At Tue, 02 Oct 2018 16:06:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20181002.160651.117284090.horiguchi.kyotaro@lab.ntt.co.jp>

It doesn't work nor even compile since I failed to include some
changes. The atached v6-0008 at least compiles and words.

0001-0007 are not attached since they are still aplicable on
master head with offsets.

In this patchset 0001-0007 are still the same with the previous
version. I'll reorganize the whole patchset in the next version.

This is more saner version of previous v5-0008, which didn't pass
regression test. v6-0008 to v6-0010 are attached and they are
applied on top of v5-0001-0007.

BTW one more thing - I strongly recommend always attaching the whole
patch series, even if some of the parts did not change.

Firstly, it makes the reviewer's life much easier, because it's not
necessary to hunt through past messages for all the bits and resolve
potential conflicts (e.g. there are two 0008 in recent messages).

Secondly, it makes http://commitfest.cputube.org/ work - it tries to
apply patches from a single message, which fails when some of the parts
are omitted.

Yeah, I know about the second point. About the second point, I'm
not sure which makes review easier confirming difference in all
files or finding not-modified portion from upthread. But ok, I'l
follow your suggestion. I'll attach the whole patches and add
note about difference.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#30

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Antonin Houska (#26)

Re: shared-memory based stats collector

Thank you for the comments.

This message contains the whole refactored patch set.

At Mon, 29 Oct 2018 15:10:10 +0100, Antonin Houska <ah@cybertec.at> wrote in <28855.1540822210@localhost>

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

This is more saner version of previous v5-0008, which didn't pass
regression test. v6-0008 to v6-0010 are attached and they are
applied on top of v5-0001-0007.

- stats collector has been removed.

- modified dshash further so that deletion is allowed during
sequential scan.

- I'm not sure about the following existing comment at the
beginning of pgstat.c

* - Add a pgstat config column to pg_database, so this
* entire thing can be enabled/disabled on a per db basis.

Following is the next handful of my comments:

* If you remove the stats collector, I think the remaining code in pgstat.c
does no longer fit into the backend/postmaster/ directory.

I didn't condier that, but I still don't find nice place among
existing directories. backend/statistics may be that but I feel
somewhat uneasy.. Finally I split pgstat.c into two files
(suggested in the file comment) and put both of them into a new
directory backend/statmon. One part is backend status faclity,
named bestatus (BackEnd Status) and one is pgstat, access
statistics part. The last 0008 patch does that. I tried to move
it ealier but it was a bit tough.

* I'm not sure it's o.k. to call pgstat_write_statsfiles() from
postmaster.c:reaper(): the function can raise ERROR (I see at least one code
path: pgstat_write_statsfile() -> get_dbstat_filename()) and, as reaper() is
a signal handler, it's hard to imagine the consequences. Maybe a reason to
leave some functionality in a separate worker, although the "stats
collector" would have to be changed.

I was careless about that. longjmp() in signal handler is
inhibited so we mustn't emit ERROR there. In the first place they
are placed in wrong places from several perspective. I changed
the way load and store. In the attached patch (0003), load is
performed on the way initializing shared memory on postmaster and
storing is done in shutdown hook on postmaster. Since the shared
memory area is inherited to all children so no process actually
does initial attaching any longer. Addition to that archiver
process became an auxiliary process (0004) since it writes to the
shared stats.

* Question still remains whether all the statistics should be loaded into
shared memory, see the note on paging near the bottom of [1].

Even counting possible page-in latency on stats writing, I agree
to what Robert said in the message that we will win by average
regarding to users who don't create so many databases. If some
stats page to write were paged-out, also related heap page would
have been evicted out from shared buffers (or the buffer page
itself may have been paged-out) and every resources that can be
stashed out may be stashed out. So I don't think it becomes a
serious problem. On reading stats, it is currently reading a file
and sometimes waiting for maiking a up-to-date file. I think we
are needless to say about the case.

For cluster with many-many databases, a backend running on a
database will mainly see only stats for the current database (and
about shared tables) we can split stats by that criteria in the
next step.

* if dshash_seq_init() is passed consistent=false, shouldn't we call
ensure_valid_bucket_pointers() also from dshash_seq_next()? If the scan
needs to access the next partition and the old partition lock got released,
the table can be resized before the next partition lock is acquired, and
thus the backend-local copy of buckets becomes obsolete.

Oops. You're right. Addition to that, resizing can happen while
dshash_seq_next moves the lock to the next partition. Resizing
happens on the way breaks sequential scan semantics. I added
ensure_valid_bucket_pointers() after the initial acquisition of
partition lock and move the lock seamlessly during scan. (0001)

* Neither snapshot_statentry_all() nor backend_snapshot_all_db_entries() seems
to be used in the current patch version.

Thanks. This is not used since we concluded that we no longer
need strict consistency in stats numbers. Removed. (0003)

* pgstat_initstats(): I think WaitLatch() should be used instead of sleep().

Bgwriter and checkpointer waited for postmaster's loading of
stats files. But I changed the startup sequence (as mentioned
above), so the wait became useless. Most of them are reaplced
with Assert. (0003)

* pgstat_get_db_entry(): "return false" should probably be "return NULL".

I don't find that. (Isn't it caught by compiler?) Maybe it is
"found = false"? (it might be a bit tricky)

* Is the PGSTAT_TABLE_WRITE flag actually used? Unlike PGSTAT_TABLE_CREATE, I
couldn't find a place where it's value is tested.

Thank you for fiding that. As pointed, PGSTAT_TABLE_WRITE is
finally not used since WRITE is always accompanied by CREATE in
the patch. I think WRITE is more readable than CREATE there so I
removed CREATE. I renamed all PGSTAT_TABLE_ symbols as the
follows while fixing this.

PGSTAST_TABLE_READ -> PGSTAT_FETCH_SHARED
PGSTAST_TABLE_WRITE -> PGSTAT_FETCH_EXCLUSIVE
PGSTAST_TABLE_NOWAIT -> PGSTAT_FETCH_NOWAIT

PGSTAST_TABLE_NOT_FOUND -> PGSTAT_ENTRY_NOT_FOUND
PGSTAST_TABLE_FOUND -> PGSTAT_ENTRY_FOUND
PGSTAST_TABLE_LOCK_FAILED -> PGSTAT_LOCK_FAILED

* dshash_seq_init(): does it need to be called with consistent=true from
pgstat_vacuum_stat() when the the entries returned by the scan are just
dropped?

dshash_seq_init(&dshstat, db_stats, true, true);

I suspect this is a thinko because another call from the same function looks
like

dshash_seq_init(&dshstat, dshtable, false, true);

It's a left-behind, which was just migrated from the previous (in
v4) snapshot-based code. Snapshots had such consistency. But it
no longer looks useful. (0003)

As the result consistent=false in all caller site so I can remove
the parameter but I leave it alone for a while.

* I'm not sure about usefulness of dshash_get_num_entries(). It passes
consistent=false to dshash_seq_init(), so the number of entries can change
during the execution. And even if the function requested a "consistent
scan", entries can be added / removed as soon as the scan is over.

If the returned value is used to decide whether the hashtable should be
scanned or not, I think you don't break anything if you simply start the
scan unconditionally and see if you find some entries.
And if you really need to count the entries, I suggest that you use the
per-partition counts (dshash_partition.count) instead of scanning individual
entries.

It is mainly to avoid useless call to pgstat_collect_oid(). The
shortcut is useful because funcstat is not taken ususally.
Instead, I removed the function and create function stats dshash
on-demand. Then changed the condition "dshash_get_num_entries() >
0" to "dbentry->functions != DSM_HANDLE_INVALID". (0003)

[1] /messages/by-id/CA+TgmobQVbz4K_+RSmiM9HeRKpy3vS5xnbkL95gSEnWijzprKQ@mail.gmail.com

New version of this patch is attched to reply to Tomas's message.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#31

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Tomas Vondra (#27)

8 attachment(s)

Re: shared-memory based stats collector

Hello. Thank you for looking this.

At Tue, 30 Oct 2018 01:49:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <5253d750-890b-069b-031f-2a9b73e47832@2ndquadrant.com>

Hi,

I've started looking at the patch over the past few days. I don't have
any deep insights at this point, but there seems to be some sort of
issue in pgstat_update_stat. When building using gcc, I do get this
warning:

pgstat.c: In function ‘pgstat_update_stat’:
pgstat.c:648:18: warning: ‘now’ may be used uninitialized in this
function [-Wmaybe-uninitialized]
oldest_pending = now;
~~~~~~~~~~~~~~~^~~~~
PostgreSQL installation complete.

Uggh! The reason for the code is "last_report = now" comes later
than the code around... Fixed.

When running this under valgrind, I get a couple of warnings in this
area of code - see the attached log with a small sample. Judging by
the locations I assume those are related to the same issue, but I have
not looked into that.

There was several typos/thinkos related to pointers modifed from
original variables. There was a code like the following in the
original code.

memset(&shared_globalStats, 0, siazeof(shared_globalStats));

It was not fixed despite this patch changes the type of the
variable from PgStat_GlboalStats to (PgStat_GlobalStats *). As
the result major part of the varialbe remaineduninitialized.

I re-ran this version on valgrind and I didn't see such kind of
problem. Thank you for the testing.

I refactored the patch into complete set consists of 8 files.

v7-0001-sequential-scan-for-dshash.patch
- dshash sequential scan feature

v7-0002-Add-conditional-lock-feature-to-dshash.patch
- dshash contitional lock feature

v7-0003-Shared-memory-based-stats-collector.patch
- Shared memory base stats collector.
- Remove stats collector process
- Change stats collector to shared memory base
(This needs 0004 to work, but it is currently separate from
this for readability)

v7-0004-Make-archiver-process-an-auxiliary-process.patch
- Archiver process needs to touch shared memory.
(I didn't check EXEC_BACKEND case)

v7-0005-Let-pg_stat_statements-not-to-use-PG_STAT_TMP_DIR.patch
- I removed pg_stat_tmp directory so move pg_stat_statements
file stored there to pg_stat directory. (This would need fix)

v7-0006-Remove-pg_stat_tmp-exclusion-from-pg_rewind.patch
- For the same reason with 0005.

v7-0007-Documentation-update.patch
- Removes description related to pg_stat_tmp.

v7-0008-Split-out-backend-status-monitor-part-from-pgstat.patch
- Just refactoring. Splits the current postmaster/pgstat.c into
two files statmon/pgstat.c and statmon/bestatus.c.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#32

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#31)

Re: shared-memory based stats collector

On 11/8/18 12:46 PM, Kyotaro HORIGUCHI wrote:

Hello. Thank you for looking this.

At Tue, 30 Oct 2018 01:49:59 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <5253d750-890b-069b-031f-2a9b73e47832@2ndquadrant.com>

Hi,

I've started looking at the patch over the past few days. I don't have
any deep insights at this point, but there seems to be some sort of
issue in pgstat_update_stat. When building using gcc, I do get this
warning:

pgstat.c: In function âpgstat_update_statâ:
pgstat.c:648:18: warning: ânowâ may be used uninitialized in this
function [-Wmaybe-uninitialized]
oldest_pending = now;
~~~~~~~~~~~~~~~^~~~~
PostgreSQL installation complete.

Uggh! The reason for the code is "last_report = now" comes later
than the code around... Fixed.

When running this under valgrind, I get a couple of warnings in this
area of code - see the attached log with a small sample. Judging by
the locations I assume those are related to the same issue, but I have
not looked into that.

There was several typos/thinkos related to pointers modifed from
original variables. There was a code like the following in the
original code.

memset(&shared_globalStats, 0, siazeof(shared_globalStats));

It was not fixed despite this patch changes the type of the
variable from PgStat_GlboalStats to (PgStat_GlobalStats *). As
the result major part of the varialbe remaineduninitialized.

I re-ran this version on valgrind and I didn't see such kind of
problem. Thank you for the testing.

OK, regression tests now seem to pass without any valgrind issues.

However a quite a few extensions in contrib seem are broken now. It
seems fixing it is as simple as including the new bestatus.h next to
pgstat.h.

I'm not sure splitting the headers like this is needed, actually. It's
true we're replacing pgstat.c with something else, but it's still
related to stats, backing pg_stat_* system views etc. So I'd keep as
much of the definitions in pgstat.h, so that it's enough to include that
one header file. That would "unbreak" the extensions.

Renaming pgstat_report_* functions to bestatus_report_* seems
unnecessary to me too. The original names seem quite fine to me.

BTW the updated patches no longer apply cleanly. Apparently it got
broken since Tuesday, most likely by the pread/pwrite patch.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#33

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: Tomas Vondra (#32)

Re: shared-memory based stats collector

On 2018-Nov-08, Tomas Vondra wrote:

I'm not sure splitting the headers like this is needed, actually. It's true
we're replacing pgstat.c with something else, but it's still related to
stats, backing pg_stat_* system views etc. So I'd keep as much of the
definitions in pgstat.h, so that it's enough to include that one header
file. That would "unbreak" the extensions.

pgstat.h includes a lot of other stuff that presumably isn't needed if
all some .c wants is in bestatus.h, so my vote would be to make this
change *if it's actually possible to do it*: you want the affected
headers to compile standalone (use cpluspluscheck or similar to verify
this), for one thing.

Renaming pgstat_report_* functions to bestatus_report_* seems unnecessary to
me too. The original names seem quite fine to me.

Yeah, this probably keeps churn to a minimum.

--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#34

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Tomas Vondra (#32)

9 attachment(s)

Re: shared-memory based stats collector

Hello. This is rebased version.

At Thu, 8 Nov 2018 16:06:49 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <de249c3f-79c9-b75c-79a3-5e2d008548a8@2ndquadrant.com>

However a quite a few extensions in contrib seem are broken now. It
seems fixing it is as simple as including the new bestatus.h next to
pgstat.h.

The additional 0009 does that.

At Thu, 8 Nov 2018 12:39:41 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20181108153941.txjb6rg3y7q26ldm@alvherre.pgsql>

On 2018-Nov-08, Tomas Vondra wrote:

I'm not sure splitting the headers like this is needed, actually. It's true
we're replacing pgstat.c with something else, but it's still related to
stats, backing pg_stat_* system views etc. So I'd keep as much of the
definitions in pgstat.h, so that it's enough to include that one header
file. That would "unbreak" the extensions.

pgstat.h includes a lot of other stuff that presumably isn't needed if
all some .c wants is in bestatus.h, so my vote would be to make this
change *if it's actually possible to do it*: you want the affected
headers to compile standalone (use cpluspluscheck or similar to verify
this), for one thing.

cpluspluscheck doesn't complain to this change. I'm afraid that I
didn't read you cleary, but I counted up files that needs only
pgstat/ only bestatus / both in $(TOPDIR). (containing contrib)

only (new) pgstat.h : 33
only bestatus.h : 47
both : 22

Renaming pgstat_report_* functions to bestatus_report_* seems unnecessary to
me too. The original names seem quite fine to me.

Yeah, this probably keeps churn to a minimum.

Reverted the names. pgstat_intialize() and
pgstat_clear_snapshot() were splitted into two functions for each
pgstat.c and bestatus.c. (and pgstat_clear_snapstop() is calling
pgstat_bestatus_clear_snapshot()!, and the names of the functions
looks somewhat inconsistent. Will fix later.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Import Notes

Reply to msg id not found: 20181108153941.txjb6rg3y7q26ldm@alvherre.pgsqlde249c3f-79c9-b75c-79a3-5e2d008548a8@2ndquadrant.com

#35

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#34)

Re: shared-memory based stats collector

On 11/9/18 9:33 AM, Kyotaro HORIGUCHI wrote:

Hello. This is rebased version.

At Thu, 8 Nov 2018 16:06:49 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <de249c3f-79c9-b75c-79a3-5e2d008548a8@2ndquadrant.com>

However a quite a few extensions in contrib seem are broken now. It
seems fixing it is as simple as including the new bestatus.h next to
pgstat.h.

The additional 0009 does that.

That does fix it, indeed. But the break happens in 0003, so that's where
the fixes should be moved - I've tried to simply apply 0009 right after
0003, but that does not seem to work because bestatus.h does not exist
at that point yet :-/

The current split into 8 parts seems quite sensible to me, i.e. that's
how it might get committed eventually. That however means each part
needs to be correct on it's own (hence fixes in 0009 are a problem).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#36

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Tomas Vondra (#35)

7 attachment(s)

Re: shared-memory based stats collector

At Fri, 9 Nov 2018 14:16:31 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <803f2d96-3b4b-f357-9a2e-45443212f13d@2ndquadrant.com>

On 11/9/18 9:33 AM, Kyotaro HORIGUCHI wrote:

Hello. This is rebased version.
At Thu, 8 Nov 2018 16:06:49 +0100, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote in
<de249c3f-79c9-b75c-79a3-5e2d008548a8@2ndquadrant.com>

However a quite a few extensions in contrib seem are broken now. It
seems fixing it is as simple as including the new bestatus.h next to
pgstat.h.

The additional 0009 does that.

That does fix it, indeed. But the break happens in 0003, so that's
where the fixes should be moved - I've tried to simply apply 0009
right after 0003, but that does not seem to work because bestatus.h
does not exist at that point yet :-/

Sorry, I misunderstood you. The real reason for breaking 0003 as
you saw was the result I just removed PG_STAT_TMP_DIR. 0005 fixes
that later. I (half-intentionally) didin't keep soundness of the
source tree at v8-0003 and v8-0008.

The current split into 8 parts seems quite sensible to me, i.e. that's
how it might get committed eventually. That however means each part
needs to be correct on it's own (hence fixes in 0009 are a problem).

Thanks. I neatended up the patchset so that individual patch
keeps source buildable and doesn't break programs' behaviors.

v9-0001-sequential-scan-for-dshash.patch
v9-0002-Add-conditional-lock-feature-to-dshash.patch
same to v8

v9-0003-Make-archiver-process-an-auxiliary-process.patch
moved from v8-0004 since this is applicable independently from
v8-0003.

v9-0004-Shared-memory-based-stats-collector.patch
v8-0003 + some fixes to make contribs work and removed initdb
part.

v9-0005-Remove-statistics-temporary-directory.patch
v8-0005 + v8-0006 + initdb part of v8-0003. pg_stat_statements
may still need fix since I changed only the directory for
temporary query files.

v9-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
v8-0008 plus v8-0009

v9-0007-Documentation-update.patch
this still leaves description about UDP-based stats collector
and will need further edit.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#37

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#36)

Re: shared-memory based stats collector

On Mon, 2018-11-12 at 20:10 +0900, Kyotaro HORIGUCHI wrote:

At Fri, 9 Nov 2018 14:16:31 +0100, Tomas Vondra <
tomas.vondra@2ndquadrant.com> wrote in <
803f2d96-3b4b-f357-9a2e-45443212f13d@2ndquadrant.com>

On 11/9/18 9:33 AM, Kyotaro HORIGUCHI wrote:

Hello. This is rebased version.
At Thu, 8 Nov 2018 16:06:49 +0100, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote in
<de249c3f-79c9-b75c-79a3-5e2d008548a8@2ndquadrant.com>

However a quite a few extensions in contrib seem are broken now.

It

seems fixing it is as simple as including the new bestatus.h

next to

pgstat.h.

The additional 0009 does that.

That does fix it, indeed. But the break happens in 0003, so that's
where the fixes should be moved - I've tried to simply apply 0009
right after 0003, but that does not seem to work because bestatus.h
does not exist at that point yet :-/

Sorry, I misunderstood you. The real reason for breaking 0003 as
you saw was the result I just removed PG_STAT_TMP_DIR. 0005 fixes
that later. I (half-intentionally) didin't keep soundness of the
source tree at v8-0003 and v8-0008.

The current split into 8 parts seems quite sensible to me, i.e.

that's

how it might get committed eventually. That however means each part
needs to be correct on it's own (hence fixes in 0009 are a

problem).

Thanks. I neatended up the patchset so that individual patch
keeps source buildable and doesn't break programs' behaviors.

OK, thanks. I'll take a look. I also plan to do much more testing, both
for correctness and performance - it's quite piece of functionality.

If everything goes well I'd like to get this committed by the end of
January CF (with some of the initial parts in this CF, possibly).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#38

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Tomas Vondra (#37)

1 attachment(s)

Re: shared-memory based stats collector

Hi,

Unfortunately, the patch does not apply anymore - it seems it got broken
by the changes to signal handling and/or removal of magic OIDs :-(

I've done a review and testing when applied on top of 10074651e335.

Firstly, the testing - I was wondering if the patch has some performance
impact, so I've done some testing with a read-only workload on large
number of tables (1k, 10k and 100k) while concurrently selecting data
from pg_stat_* catalogs at the same time.

In one case both workloads were running against the same database, in
another there were two separate databases (and the selects from stat
catalogs were running against an "empty" database with no use tables).

In both cases there were 8 clients doing selects from the user tables,
and 4 clients accessing the pg_stat_* catalogs.

For the "single database" case the results looks like this (this is just
patched / master throughput):

# of tables xact stats
------------------------------
1000 97.71% 98.76%
10000 100.38% 97.97%
100000 100.10% 98.50%

xact is throughput of the user workload (select from the large number of
tables) and stats is throughput for selects from system catalogs.

So pretty much no difference - 2% is within noise on this machine.

On two separate databases the results are a bit more interesting:

# of tables xact stats
-------------------------------
1000 100.49% 80.38%
10000 103.18% 80.28%
100000 100.85% 81.95%

For the main workload there's pretty much no difference, but for selects
from the stats catalogs there's ~20% drop in throughput. In absolute
numbers this means drop from ~670tps to ~550tps. I haven't investigated
this, but I suppose this is due to dshash seqscan being more expensive
than reading the data from file.

I don't think any of this is an issue in practice, though. The important
thing is that there's no measurable impact on the regular workload.

Now, a couple of comments regarding individual parts of the patch.

0001-0003
---------

I do think 0001 - 0003 are ready, with some minor cosmetic issues:

1) I'd rephrase the last part of dshash_seq_init comment more like this:

* If consistent is set for dshash_seq_init, the all hash table
* partitions are locked in the requested mode (as determined by the
* exclusive flag), and the locks are held until the end of the scan.
* Otherwise the partition locks are acquired and released as needed
* during the scan (up to two partitions may be locked at the same time).

Maybe it should briefly explain what the consistency guarantees are (and
aren't), but considering we're not materially changing the existing
behavior probably is not really necessary.

2) I think the dshash_find_extended() signature should be more like
dshash_find(), i.e. just appending parameters instead of moving them
around unnecessarily. Perhaps we should add

Assert(nowait || !lock_acquired);

Using nowait=false with lock_acquired!=NULL does not seem sensible.

3) I suppose this comment in postmaster.c is just copy-paste:

-#define BACKEND_TYPE_ARCHIVER	0x0010	/* bgworker process */
+#define BACKEND_TYPE_ARCHIVER	0x0010	/* archiver process */

I wonder why wasn't archiver a regular auxiliary process already? It
seems like a fairly natural thing, so why not?

0004 (+0005 and 0007)
---------------------

This seems fine, but I have my doubts about two changes - removing of
stats_temp_directory and the IDLE_STATS_UPDATE_TIMEOUT thingy.

There's a couple of issues with the stats_temp_directory. Firstly, I
don't understand why it's spread over multiple parts of the patch. The
GUC is removed in 0004, the underlying variable is removed in 0005 and
then the docs are updated in 0007. If we really want to do this, it
should happen in a single patch.

But the main question is - do we really want to do that? I understand
this directory was meant for the stats data we're moving to shared
memory, so removing it seems natural. But clearly it's used by
pg_stat_statements - 0005 fixes that, of course, but I wonder if there
are other extensions using it to store files?

It's not just about how intensive I/O to those files is, but this also
means the files will now be included in backups / pg_rewind, and maybe
that's not really desirable?

Maybe it's fine but I'm not quite convinced about it ...

I'm not sure I understand what IDLE_STATS_UPDATE_TIMEOUT does. You've
described it as

This adds a new timeout IDLE_STATS_UPDATE_TIMEOUT. This works
similarly to IDLE_IN_TRANSACTIION_SESSION_TIMEOUT. It fires in
at most PGSTAT_STAT_MIN_INTERVAL(500)ms to clean up pending
statistics update.

but I'm not sure what pending updates do you mean? Aren't we updating
the stats at the end of each transaction? At least that's what we've
been doing before, so maybe this patch changes that?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#39

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 7 years ago

In reply to: Tomas Vondra (#38)

7 attachment(s)

Re: shared-memory based stats collector

Thank you very much for the testing.

At Mon, 26 Nov 2018 02:52:30 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <6c079a69-feba-e47c-7b85-8a9ff31adef3@2ndquadrant.com>

Hi,

Unfortunately, the patch does not apply anymore - it seems it got broken
by the changes to signal handling and/or removal of magic OIDs :-(

Big hit but was simple.

I've done a review and testing when applied on top of 10074651e335.

Firstly, the testing - I was wondering if the patch has some performance
impact, so I've done some testing with a read-only workload on large
number of tables (1k, 10k and 100k) while concurrently selecting data
from pg_stat_* catalogs at the same time.

In one case both workloads were running against the same database, in
another there were two separate databases (and the selects from stat
catalogs were running against an "empty" database with no use tables).

In both cases there were 8 clients doing selects from the user tables,
and 4 clients accessing the pg_stat_* catalogs.

For the "single database" case the results looks like this (this is just
patched / master throughput):

# of tables xact stats
------------------------------
1000 97.71% 98.76%
10000 100.38% 97.97%
100000 100.10% 98.50%

xact is throughput of the user workload (select from the large number of
tables) and stats is throughput for selects from system catalogs.

So pretty much no difference - 2% is within noise on this machine.

No increase by the number of tables seems suggesting that.

On two separate databases the results are a bit more interesting:

# of tables xact stats
-------------------------------
1000 100.49% 80.38%
10000 103.18% 80.28%
100000 100.85% 81.95%

For the main workload there's pretty much no difference, but for selects
from the stats catalogs there's ~20% drop in throughput. In absolute
numbers this means drop from ~670tps to ~550tps. I haven't investigated
this, but I suppose this is due to dshash seqscan being more expensive
than reading the data from file.

Thanks for finding that. The three seqscan loops in
pgstat_vacuum_stat cannot take such a long time, I think. I'll
investigate it.

I don't think any of this is an issue in practice, though. The important
thing is that there's no measurable impact on the regular workload.

Now, a couple of comments regarding individual parts of the patch.

0001-0003
---------

I do think 0001 - 0003 are ready, with some minor cosmetic issues:

1) I'd rephrase the last part of dshash_seq_init comment more like this:

* If consistent is set for dshash_seq_init, the all hash table
* partitions are locked in the requested mode (as determined by the
* exclusive flag), and the locks are held until the end of the scan.
* Otherwise the partition locks are acquired and released as needed
* during the scan (up to two partitions may be locked at the same time).

Replaced with this.

Maybe it should briefly explain what the consistency guarantees are (and
aren't), but considering we're not materially changing the existing
behavior probably is not really necessary.

Mmm. actually sequential scan is a new thing altogether, but..

2) I think the dshash_find_extended() signature should be more like
dshash_find(), i.e. just appending parameters instead of moving them
around unnecessarily. Perhaps we should add

Sure. It seems to have done by my off-lined finger;p Fixed.

Assert(nowait || !lock_acquired);

Using nowait=false with lock_acquired!=NULL does not seem sensible.

Agreed. Added.

3) I suppose this comment in postmaster.c is just copy-paste:
-#define BACKEND_TYPE_ARCHIVER	0x0010	/* bgworker process */
+#define BACKEND_TYPE_ARCHIVER	0x0010	/* archiver process */

Ugh! Fixed.

I wonder why wasn't archiver a regular auxiliary process already? It
seems like a fairly natural thing, so why not?

Perhaps it's just because it didn't need access to shared memory.

0004 (+0005 and 0007)
---------------------

This seems fine, but I have my doubts about two changes - removing of
stats_temp_directory and the IDLE_STATS_UPDATE_TIMEOUT thingy.

There's a couple of issues with the stats_temp_directory. Firstly, I
don't understand why it's spread over multiple parts of the patch. The
GUC is removed in 0004, the underlying variable is removed in 0005 and
then the docs are updated in 0007. If we really want to do this, it
should happen in a single patch.

Sure.

But the main question is - do we really want to do that? I understand
this directory was meant for the stats data we're moving to shared
memory, so removing it seems natural. But clearly it's used by
pg_stat_statements - 0005 fixes that, of course, but I wonder if there
are other extensions using it to store files?
It's not just about how intensive I/O to those files is, but this also
means the files will now be included in backups / pg_rewind, and maybe
that's not really desirable?

Maybe it's fine but I'm not quite convinced about it ...

It was also in my mind. Anyway sorry for the strange separation.

I was confused about pgstat_stat_directory (the names are
actually very confusing..). Addition to that pg_stat_statements
does *not* use the variable stats_temp_directory, but using
PG_STAT_TMP_DIR. pgstat_stat_directory was used only by
basebackup.c.

The GUC base variable pgstat_temp_directory is not extern'ed so
we can just remove it along with the GUC
definition. pgstat_stat_directory (it actually stores *temporary*
stats directory) was extern'ed in pgstat.h and PG_STAT_TMP_DIR is
defined in pgstat.h. They are not removed in the new version.
Finally 0005 no longer breaks any other bins, contribs and
externalextensions.

I'm not sure I understand what IDLE_STATS_UPDATE_TIMEOUT does. You've
described it as

This adds a new timeout IDLE_STATS_UPDATE_TIMEOUT. This works
similarly to IDLE_IN_TRANSACTIION_SESSION_TIMEOUT. It fires in
at most PGSTAT_STAT_MIN_INTERVAL(500)ms to clean up pending
statistics update.

but I'm not sure what pending updates do you mean? Aren't we updating
the stats at the end of each transaction? At least that's what we've
been doing before, so maybe this patch changes that?

Without the timeout, updates on shared memory happens at the same
rate with transaction traffic and it easily causes congestion. So
the update frequency is limited to the timtout in this patch and
the local statistics made by trasactions committed within the
timeout interval will be merged into one shared stats update. It
is the "pending statistics".

With the socket-based stats collector, it doesn't update the
temporary stats file with the interval not shorter than the
timeout. The update timeout seemingly behaves the same way with
the socket-based stats collector in the view of readers.

If local statistics is not fully processed at the end of the last
transaction. We don't have a chance to flush them before the next
transaction ends. So timeout is loaded if any "panding stats"
remains. (around postgres.c:4175) The pending stats are processed
forcibly in ProcessInterrupts().

postgres.c:4175

stats_timeout = pgstat_update_stat(false);
if (stats_timeout > 0)
{
disable_idle_stats_update_timeout = true;
enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT,
stats_timeout);

The attached are the new version addressing the all comments here
(except the 20% regression), and rebased.

v10-0001-sequential-scan-for-dshash.patch
v10-0002-Add-conditional-lock-feature-to-dshash.patch
fixed.
v10-0003-Make-archiver-process-an-auxiliary-process.patch
fixed.
v10-0004-Shared-memory-based-stats-collector.patch
updated not to touch guc.
v10-0005-Remove-the-GUC-stats_temp_directory.patch
collected all guc-related changes.
updated not to break other programs.
v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
basebackup.c requires both bestats.h and pgstat.h
v10-0007-Documentation-update.patch
small change related to 0005.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#40

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#39)

Re: shared-memory based stats collector

On 11/27/18 9:59 AM, Kyotaro HORIGUCHI wrote:

...>>
For the main workload there's pretty much no difference, but for selects
from the stats catalogs there's ~20% drop in throughput. In absolute
numbers this means drop from ~670tps to ~550tps. I haven't investigated
this, but I suppose this is due to dshash seqscan being more expensive
than reading the data from file.

Thanks for finding that. The three seqscan loops in
pgstat_vacuum_stat cannot take such a long time, I think. I'll
investigate it.

OK. I'm not sure this is related to pgstat_vacuum_stat - the slowdown
happens while querying the catalogs, so why would that trigger vacuum of
the stats? I may be missing something, of course.

FWIW, the "query statistics" test simply does this:

SELECT * FROM pg_stat_all_tables;
SELECT * FROM pg_stat_all_indexes;
SELECT * FROM pg_stat_user_indexes;
SELECT * FROM pg_stat_user_tables;
SELECT * FROM pg_stat_sys_tables;
SELECT * FROM pg_stat_sys_indexes;

and the slowdown happened even it was running on it's own (nothing else
running on the instance). Which mostly rules out concurrency issues with
the hash table locking etc.

I don't think any of this is an issue in practice, though. The important
thing is that there's no measurable impact on the regular workload.

Now, a couple of comments regarding individual parts of the patch.

0001-0003
---------

I do think 0001 - 0003 are ready, with some minor cosmetic issues:

1) I'd rephrase the last part of dshash_seq_init comment more like this:

* If consistent is set for dshash_seq_init, the all hash table
* partitions are locked in the requested mode (as determined by the
* exclusive flag), and the locks are held until the end of the scan.
* Otherwise the partition locks are acquired and released as needed
* during the scan (up to two partitions may be locked at the same time).

Replaced with this.

Maybe it should briefly explain what the consistency guarantees are (and
aren't), but considering we're not materially changing the existing
behavior probably is not really necessary.

Mmm. actually sequential scan is a new thing altogether, but..

Sure, there are new pieces. But does it significantly change consistency
guarantees when reading the stats? I don't think so - there was no
strict consistency guaranteed before (due to data interleaved with
inquiries, UDP throwing away packets under load, etc.). Based on the
discussion in this thread that seems to be the consensus.

2) I think the dshash_find_extended() signature should be more like
dshash_find(), i.e. just appending parameters instead of moving them
around unnecessarily. Perhaps we should add

Sure. It seems to have done by my off-lined finger;p Fixed.

;-)

0004 (+0005 and 0007)
---------------------

This seems fine, but I have my doubts about two changes - removing of
stats_temp_directory and the IDLE_STATS_UPDATE_TIMEOUT thingy.

There's a couple of issues with the stats_temp_directory. Firstly, I
don't understand why it's spread over multiple parts of the patch. The
GUC is removed in 0004, the underlying variable is removed in 0005 and
then the docs are updated in 0007. If we really want to do this, it
should happen in a single patch.

Sure.

But the main question is - do we really want to do that? I understand
this directory was meant for the stats data we're moving to shared
memory, so removing it seems natural. But clearly it's used by
pg_stat_statements - 0005 fixes that, of course, but I wonder if there
are other extensions using it to store files?
It's not just about how intensive I/O to those files is, but this also
means the files will now be included in backups / pg_rewind, and maybe
that's not really desirable?

Maybe it's fine but I'm not quite convinced about it ...

It was also in my mind. Anyway sorry for the strange separation.

I was confused about pgstat_stat_directory (the names are
actually very confusing..). Addition to that pg_stat_statements
does *not* use the variable stats_temp_directory, but using
PG_STAT_TMP_DIR. pgstat_stat_directory was used only by
basebackup.c.

The GUC base variable pgstat_temp_directory is not extern'ed so
we can just remove it along with the GUC
definition. pgstat_stat_directory (it actually stores *temporary*
stats directory) was extern'ed in pgstat.h and PG_STAT_TMP_DIR is
defined in pgstat.h. They are not removed in the new version.
Finally 0005 no longer breaks any other bins, contribs and
external extensions.

Great. I'll take a look.

I'm not sure I understand what IDLE_STATS_UPDATE_TIMEOUT does. You've
described it as

This adds a new timeout IDLE_STATS_UPDATE_TIMEOUT. This works
similarly to IDLE_IN_TRANSACTIION_SESSION_TIMEOUT. It fires in
at most PGSTAT_STAT_MIN_INTERVAL(500)ms to clean up pending
statistics update.

but I'm not sure what pending updates do you mean? Aren't we updating
the stats at the end of each transaction? At least that's what we've
been doing before, so maybe this patch changes that?

Without the timeout, updates on shared memory happens at the same
rate with transaction traffic and it easily causes congestion. So
the update frequency is limited to the timtout in this patch and
the local statistics made by trasactions committed within the
timeout interval will be merged into one shared stats update. It
is the "pending statistics".

With the socket-based stats collector, it doesn't update the
temporary stats file with the interval not shorter than the
timeout. The update timeout seemingly behaves the same way with
the socket-based stats collector in the view of readers.

If local statistics is not fully processed at the end of the last
transaction. We don't have a chance to flush them before the next
transaction ends. So timeout is loaded if any "panding stats"
remains. (around postgres.c:4175) The pending stats are processed
forcibly in ProcessInterrupts().

OK, thanks for the explanation. So it's essentially a protection against
stats from short transactions not being reported for a long time, when
the next transaction is long. For example we might end up with a
sequence of short transactions

T1: short, does not trigger IDLE_STATS_UPDATE_TIMEOUT -> local
T2: short, does not trigger IDLE_STATS_UPDATE_TIMEOUT -> local
...
TN: short, does not trigger IDLE_STATS_UPDATE_TIMEOUT -> local
T(N+1): long (say, several hours)

in which case stats from short ones are not reported until the end of
the long one. That makes sense.

That however raises the question - won't that also report some of the
stats from the last transaction? That would be a change compared to
current behavior, although I'm not sure it's undesirable - it's often
quite annoying that we don't receive stats from a transaction until it
completes. But I wonder - doesn't this affect pg_stat_xact catalogs?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#41

Tomas Vondra

tomas.vondra@2ndquadrant.com

about 7 years ago

In reply to: Kyotaro HORIGUCHI (#39)

Re: shared-memory based stats collector

On 11/27/18 9:59 AM, Kyotaro HORIGUCHI wrote:

...

v10-0001-sequential-scan-for-dshash.patch
v10-0002-Add-conditional-lock-feature-to-dshash.patch
fixed.
v10-0003-Make-archiver-process-an-auxiliary-process.patch
fixed.
v10-0004-Shared-memory-based-stats-collector.patch
updated not to touch guc.
v10-0005-Remove-the-GUC-stats_temp_directory.patch
collected all guc-related changes.
updated not to break other programs.
v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
basebackup.c requires both bestats.h and pgstat.h
v10-0007-Documentation-update.patch
small change related to 0005.

I need to do a more thorough review of part 0006, but these patches
seems quite fine to me. I'd however merge 0007 into the other relevant
parts (it seems like a mix of docs changes for 0004, 0005 and 0006).

Thinking about it a bit more, I'm wondering if we need to keep 0004 and
0005 separate. My understanding is that the stats_temp_directory is used
only from the stats collector, so it probably does not make much sense
to keep it after 0004. We may also keep it separate and then commit both
0004 and 0005 together, of course. What do you think.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#42

Alvaro Herrera

alvherre@2ndquadrant.com

about 7 years ago

In reply to: Tomas Vondra (#41)

Re: shared-memory based stats collector

On 2018-Nov-28, Tomas Vondra wrote:

v10-0004-Shared-memory-based-stats-collector.patch
updated not to touch guc.
v10-0005-Remove-the-GUC-stats_temp_directory.patch
collected all guc-related changes.
updated not to break other programs.
v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
basebackup.c requires both bestats.h and pgstat.h
v10-0007-Documentation-update.patch
small change related to 0005.

I need to do a more thorough review of part 0006, but these patches
seems quite fine to me. I'd however merge 0007 into the other relevant
parts (it seems like a mix of docs changes for 0004, 0005 and 0006).

Looking at 0001 - 0003 it seems OK to keep each as separate commits, but
I suggest to have 0004+0006 be a single commit, mostly because
introducing a bunch of "new" code in 0004 and then moving it over to
bestatus.c in 0006 makes "git blame" doubly painful. And I think
committing 0005 and not 0007 makes the documentation temporarily buggy,
so I see no reason to think of this as two commits, one being 0004+0006
and the other 0005+0007. And even those could conceivably be pushed
together instead of as a single patch. (But be sure to push very early
in your work day, to have plenty of time to deal with any resulting
buildfarm problems.)

--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#43

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Alvaro Herrera (#42)

1 attachment(s)

Re: shared-memory based stats collector

On 11/29/18 1:18 PM, Alvaro Herrera wrote:

On 2018-Nov-28, Tomas Vondra wrote:

v10-0004-Shared-memory-based-stats-collector.patch
updated not to touch guc.
v10-0005-Remove-the-GUC-stats_temp_directory.patch
collected all guc-related changes.
updated not to break other programs.
v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
basebackup.c requires both bestats.h and pgstat.h
v10-0007-Documentation-update.patch
small change related to 0005.

I need to do a more thorough review of part 0006, but these patches
seems quite fine to me. I'd however merge 0007 into the other relevant
parts (it seems like a mix of docs changes for 0004, 0005 and 0006).

Looking at 0001 - 0003 it seems OK to keep each as separate commits, but
I suggest to have 0004+0006 be a single commit, mostly because
introducing a bunch of "new" code in 0004 and then moving it over to
bestatus.c in 0006 makes "git blame" doubly painful. And I think
committing 0005 and not 0007 makes the documentation temporarily buggy,
so I see no reason to think of this as two commits, one being 0004+0006
and the other 0005+0007. And even those could conceivably be pushed
together instead of as a single patch. (But be sure to push very early
in your work day, to have plenty of time to deal with any resulting
buildfarm problems.)

Kyotaro-san, do you agree with committing the patch the way Alvaro
proposed? That is, 0001-0003 as separate commits, and 0004+0006 and
0005+0007 together. The plan seems reasonable to me.

FWIW I see cputube reports some build failures on Windows:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.26736#L3135

If I understand it correctly, it complains about this line in postmaster.c:

extern pgsocket pgStatSock;

which seems to only affect EXEC_BACKEND (including Win32). ISTM we
should get rid of all pgStatSock references, per the attached fix.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#44

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Tomas Vondra (#43)

Re: shared-memory based stats collector

Hi,

On 2019-01-01 18:39:12 +0100, Tomas Vondra wrote:

On 11/29/18 1:18 PM, Alvaro Herrera wrote:

On 2018-Nov-28, Tomas Vondra wrote:

v10-0004-Shared-memory-based-stats-collector.patch
updated not to touch guc.
v10-0005-Remove-the-GUC-stats_temp_directory.patch
collected all guc-related changes.
updated not to break other programs.
v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
basebackup.c requires both bestats.h and pgstat.h
v10-0007-Documentation-update.patch
small change related to 0005.

I need to do a more thorough review of part 0006, but these patches
seems quite fine to me. I'd however merge 0007 into the other relevant
parts (it seems like a mix of docs changes for 0004, 0005 and 0006).

Looking at 0001 - 0003 it seems OK to keep each as separate commits, but
I suggest to have 0004+0006 be a single commit, mostly because
introducing a bunch of "new" code in 0004 and then moving it over to
bestatus.c in 0006 makes "git blame" doubly painful. And I think
committing 0005 and not 0007 makes the documentation temporarily buggy,
so I see no reason to think of this as two commits, one being 0004+0006
and the other 0005+0007. And even those could conceivably be pushed
together instead of as a single patch. (But be sure to push very early
in your work day, to have plenty of time to deal with any resulting
buildfarm problems.)

Kyotaro-san, do you agree with committing the patch the way Alvaro
proposed? That is, 0001-0003 as separate commits, and 0004+0006 and
0005+0007 together. The plan seems reasonable to me.

Do you guys think these patches are ready already? I'm a bit doubtful, and
failures here could have quite wide-ranging symptoms.

Greetings,

Andres Freund

#45

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Andres Freund (#44)

Re: shared-memory based stats collector

On 1/1/19 7:03 PM, Andres Freund wrote:

Hi,

On 2019-01-01 18:39:12 +0100, Tomas Vondra wrote:

On 11/29/18 1:18 PM, Alvaro Herrera wrote:

On 2018-Nov-28, Tomas Vondra wrote:

v10-0004-Shared-memory-based-stats-collector.patch
updated not to touch guc.
v10-0005-Remove-the-GUC-stats_temp_directory.patch
collected all guc-related changes.
updated not to break other programs.
v10-0006-Split-out-backend-status-monitor-part-from-pgstat.patch
basebackup.c requires both bestats.h and pgstat.h
v10-0007-Documentation-update.patch
small change related to 0005.

I need to do a more thorough review of part 0006, but these patches
seems quite fine to me. I'd however merge 0007 into the other relevant
parts (it seems like a mix of docs changes for 0004, 0005 and 0006).

Looking at 0001 - 0003 it seems OK to keep each as separate commits, but
I suggest to have 0004+0006 be a single commit, mostly because
introducing a bunch of "new" code in 0004 and then moving it over to
bestatus.c in 0006 makes "git blame" doubly painful. And I think
committing 0005 and not 0007 makes the documentation temporarily buggy,
so I see no reason to think of this as two commits, one being 0004+0006
and the other 0005+0007. And even those could conceivably be pushed
together instead of as a single patch. (But be sure to push very early
in your work day, to have plenty of time to deal with any resulting
buildfarm problems.)

Kyotaro-san, do you agree with committing the patch the way Alvaro
proposed? That is, 0001-0003 as separate commits, and 0004+0006 and
0005+0007 together. The plan seems reasonable to me.

Do you guys think these patches are ready already? I'm a bit doubtful, and
failures here could have quite wide-ranging symptoms.

I agree it's a sensitive part of the code, so additional reviews would
be welcome of course. I've done as much review and testing as possible,
and overall it seems in a fairly good shape. Do you have any particular
concerns / ideas what to look for?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#46

Alvaro Herrera

alvherre@2ndquadrant.com

almost 7 years ago

In reply to: Tomas Vondra (#45)

Re: shared-memory based stats collector

On 2019-Jan-01, Tomas Vondra wrote:

I agree it's a sensitive part of the code, so additional reviews would
be welcome of course. I've done as much review and testing as possible,
and overall it seems in a fairly good shape. Do you have any particular
concerns / ideas what to look for?

I haven't reviewed this patch thoroughly.

Shall we do a triage run over the complete commitfest to determine the
highest priority items that we should put extra effort into reviewing?

--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#47

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 7 years ago

In reply to: Tomas Vondra (#40)

Re: shared-memory based stats collector

Hi,

The patch needs rebasing, as it got broken by 285d8e1205, and there's
some other minor bitrot.

On 11/27/18 4:40 PM, Tomas Vondra wrote:

On 11/27/18 9:59 AM, Kyotaro HORIGUCHI wrote:

...>>
For the main workload there's pretty much no difference, but for
selects from the stats catalogs there's ~20% drop in throughput.
In absolute numbers this means drop from ~670tps to ~550tps. I
haven't investigated this, but I suppose this is due to dshash
seqscan being more expensive than reading the data from file.

Thanks for finding that. The three seqscan loops in
pgstat_vacuum_stat cannot take such a long time, I think. I'll
investigate it.

OK. I'm not sure this is related to pgstat_vacuum_stat - the
slowdown happens while querying the catalogs, so why would that
trigger vacuum of the stats? I may be missing something, of course.

FWIW, the "query statistics" test simply does this:

Â SELECT * FROM pg_stat_all_tables;
Â SELECT * FROM pg_stat_all_indexes;
Â SELECT * FROM pg_stat_user_indexes;
Â SELECT * FROM pg_stat_user_tables;
Â SELECT * FROM pg_stat_sys_tables;
Â SELECT * FROM pg_stat_sys_indexes;

and the slowdown happened even it was running on it's own (nothing
else running on the instance). Which mostly rules out concurrency
issues with the hash table locking etc.

Did you have time to investigate the slowdown?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#48

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Tomas Vondra (#47)

7 attachment(s)

Re: shared-memory based stats collector

Thank you very much for reviewing this and sorry for the absense.

At Sun, 20 Jan 2019 18:13:04 +0100, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <b760035b-1941-38bb-5e84-c2fbc63fef6b@2ndquadrant.com>

Hi,

The patch needs rebasing, as it got broken by 285d8e1205, and there's
some other minor bitrot.

The most affected part was 0006 because of file splitting, but
actually only the follwing four (actually three) commits
affected.

42e2a58071 Fix typos in documentation and for one wait event
97c39498e5 Update copyright for 2019
578b229718 Remove WITH OIDS support, change oid catalog column visibility.
(125f551c8b Leave SIGTTIN/SIGTTOU signal handling alone in postmaster child processes.)

The last one is not relevant because stats collector is no longer
a process.

This contains the EXEC_BACKEND related bug pointed by

/messages/by-id/854d6d91-f2f3-e391-f0fc-064db51b391e@2ndquadrant.com

On 11/27/18 4:40 PM, Tomas Vondra wrote:

On 11/27/18 9:59 AM, Kyotaro HORIGUCHI wrote:

...>>
For the main workload there's pretty much no difference, but for
selects from the stats catalogs there's ~20% drop in throughput.
In absolute numbers this means drop from ~670tps to ~550tps. I
haven't investigated this, but I suppose this is due to dshash
seqscan being more expensive than reading the data from file.

Thanks for finding that. The three seqscan loops in
pgstat_vacuum_stat cannot take such a long time, I think. I'll
investigate it.

OK. I'm not sure this is related to pgstat_vacuum_stat - the
slowdown happens while querying the catalogs, so why would that
trigger vacuum of the stats? I may be missing something, of course.

FWIW, the "query statistics" test simply does this:

SELECT * FROM pg_stat_all_tables;
SELECT * FROM pg_stat_all_indexes;
SELECT * FROM pg_stat_user_indexes;
SELECT * FROM pg_stat_user_tables;
SELECT * FROM pg_stat_sys_tables;
SELECT * FROM pg_stat_sys_indexes;

and the slowdown happened even it was running on it's own (nothing
else running on the instance). Which mostly rules out concurrency
issues with the hash table locking etc.

Did you have time to investigate the slowdown?

It seems to me that the slowdown comes from local caching in
snapshot_statentry in several ways.

It searches local hash (HTAB), then shared hash (dshash) if not
found and copies the found entry into local hash (action A). *If*
the second reference in a transaction comes, HTAB returns the
result (action B). But it mostly takes action A in frequent-short
transactions. It can be reduced to the update interval of shared
stats, but it would be shorter if many backends runs.

Another bottle neck found in pgstat_fetch_stat_tabentry. It calls
pgstat_fetch_stat_dbentry() too often. It can be largely reduced.

A quick (and dirty) fix of the aboves reduced the slowdown
roughly by half. (59tps(master)->48tps(current)->54tps(the fix))

I'll reconsider the referer side of the stats.

I didn't merge the suggested two pairs of commits. I'll do that
after adressing the slowdown issue.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#49

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#48)

5 attachment(s)

Re: shared-memory based stats collector

Hello.

At Mon, 21 Jan 2019 21:19:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190121.211907.59625409.horiguchi.kyotaro@lab.ntt.co.jp>

I'll reconsider the referer side of the stats.

The most significant cause of the slowdown is repeated search for
non-existent entries both on local and shared hash each time.

Negative cache in addition to cache expiration interval
eliminates the slowdown.

1000 times repetition with -O2 binary:

master : 124.99 tps
patched: 125.48 tps (+0.4%)

I didn't merge the suggested two pairs of commits. I'll do that
after adressing the slowdown issue.

I agree to commiting 0001-0003 separately. In the attached patch
set, old 0004+0006 are merged as 0004 and old 0005+0007 are
merged as new 0005.

Changed cacheing policy.
Expired at every xact end
-> Keeps at least for PGSTAT_STAT_MIN_INTERVAL (500ms).

Added negative cache feature (snapshot_statentry).

Improved separation between pgstat and bestat (separated AtEOXact_* functions).

Fixed doubious memory context usage.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#50

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#49)

Re: shared-memory based stats collector

On Tue, Jan 22, 2019 at 03:48:02PM +0900, Kyotaro HORIGUCHI wrote:

Fixed doubious memory context usage.

That's quite something that we have here for 0005:
84 files changed, 6588 insertions(+), 7501 deletions(-)

Moved to next CF for now.
--
Michael

#51

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#36)

Re: shared-memory based stats collector

Hi,

On 2018-11-12 20:10:42 +0900, Kyotaro HORIGUCHI wrote:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7eed5866d2..e52ae54821 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
&sync_secs, &sync_usecs);

/* Accumulate checkpoint timing summary data, in milliseconds. */
-	BgWriterStats.m_checkpoint_write_time +=
+	BgWriterStats.checkpoint_write_time +=
write_secs * 1000 + write_usecs / 1000;
-	BgWriterStats.m_checkpoint_sync_time +=
+	BgWriterStats.checkpoint_sync_time +=
sync_secs * 1000 + sync_usecs / 1000;

Why does this patch do renames like this in the same entry as actual
functional changes?

@@ -1273,16 +1276,22 @@ do_start_worker(void)
break;
}
}
-		if (skipit)
-			continue;
+		if (!skipit)
+		{
+			/* Remember the db with oldest autovac time. */
+			if (avdb == NULL ||
+				tmp->adw_entry->last_autovac_time <
+				avdb->adw_entry->last_autovac_time)
+			{
+				if (avdb)
+					pfree(avdb->adw_entry);
+				avdb = tmp;
+			}
+		}

-		/*
-		 * Remember the db with oldest autovac time.  (If we are here, both
-		 * tmp->entry and db->entry must be non-null.)
-		 */
-		if (avdb == NULL ||
-			tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
-			avdb = tmp;
+		/* Immediately free it if not used */
+		if(avdb != tmp)
+			pfree(tmp->adw_entry);
}

This looks like another unrelated refactoring. Don't do this.

/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+	BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+	BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;

More unrelated renaming. I'll stop mentioning this in the rest of this
patch, but really don't do this, makes such a large unnecessary harder
to review.

pgstat.c needs a header comment explaining the architecture of the
approach.

+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED	0
+#define PGSTAT_FETCH_EXCLUSIVE	1
+#define	PGSTAT_FETCH_NOWAIT 2
+
+typedef enum
+{

Please don't create anonymous enums that are then typedef'd to a
name. The underlying name and the one not scoped to typedefs shoudl be
the same.

+/*
+ *  report withholding facility.
+ *
+ *  some report items are withholded if required lock is not acquired
+ *  immediately.
+ */

This comment needs polishing. The variables are named _pending_, but the
comment talks about withholding - which doesn't seem like an apt name.

/*
* Structures in which backends store per-table info that's waiting to be
@@ -189,18 +189,14 @@ typedef struct TabStatHashEntry
* Hash table for O(1) t_id -> tsa_entry lookup
*/
static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatPendingTabHash = NULL;

/*
* Backends store per-function info that's waiting to be sent to the collector
* in this hash table (indexed by function OID).
*/
static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
+static HTAB *pgStatPendingFunctions = NULL;

So this patch leaves us with a pgStatFunctions that has a comment
explaining it's about "waiting to be sent" stats, and then additionally
a pgStatPendingFunctions?

/* ------------------------------------------------------------
* Public functions used by backends follow
@@ -802,41 +436,107 @@ allow_immediate_pgstat_restart(void)
* pgstat_report_stat() -
*
*	Must be called by processes that performs DML: tcop/postgres.c, logical
- *	receiver processes, SPI worker, etc. to send the so far collected
- *	per-table and function usage statistics to the collector.  Note that this
- *	is called only when not within a transaction, so it is fair to use
- *	transaction stop time as an approximation of current time.
- * ----------
+ *	receiver processes, SPI worker, etc. to apply the so far collected
+ *	per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  This requires taking some locks on the shared statistics hashes and some

Weird mix of different indentation.

+ *  of updates may be withholded on lock failure. Pending updates are
+ *  retried in later call of this function and finally cleaned up by calling
+ *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *  was elapsed since last cleanup. On the other hand updates by regular

s/was/has/

+
+	/* Forecibly update other stats if any. */

s/Forecibly/Forcibly/

Typo aside, what does forcibly mean here?

/*
-	 * Scan through the TabStatusArray struct(s) to find tables that actually
-	 * have counts, and build messages to send.  We have to separate shared
-	 * relations from regular ones because the databaseid field in the message
-	 * header has to depend on that.
+	 * XX: We cannot lock two dshash entries at once. Since we must keep lock

Typically we use three XXX (or alternatively NB:).

+	 * while tables stats are being updated we have no choice other than
+	 * separating jobs for shared table stats and that of egular tables.

s/egular/regular/

+	 * Looping over the array twice isapparently ineffcient and more efficient
+	 * way is expected.
*/

s/isapparently/is apparently/
s/ineffcient/inefficient/

But I don't know what this sentence is trying to say precisely. Are you
saying this cannot be committed unless this is fixed?

Nor do I understand why it's actually relevant that we cannot lock two
dshash entries at once. The same table is never shared and unshared,
no?

+/*
+ * Subroutine for pgstat_update_stat.
+ *
+ * Appies table stats in table status array merging with pending stats if any.

s/Appies/Applies/

+ * If force is true waits until required locks to be acquired. Elsewise stats

s/elsewise/ohterwise/

+ * merged stats as pending sats and it will be processed in the next chance.

s/sats/stats/
s/in the next change/at the next chance/

+			/* if pending update exists, it should be applied along with */
+			if (pgStatPendingTabHash != NULL)

Why is any of this done if there's no pending data?

{
-				pgstat_send_tabstat(this_msg);
-				this_msg->m_nentries = 0;
+				pentry = hash_search(pgStatPendingTabHash,
+									 (void *) entry, HASH_FIND, NULL);
+
+				if (pentry)
+				{
+					/* merge new update into pending updates */
+					pgstat_merge_tabentry(pentry, entry, false);
+					entry = pentry;
+				}
+			}
+
+			/* try to apply the merged stats */
+			if (pgstat_apply_tabstat(cxt, entry, !force))
+			{
+				/* succeeded. remove it if it was pending stats */
+				if (pentry && entry != pentry)
+					hash_search(pgStatPendingTabHash,
+								(void *) pentry, HASH_REMOVE, NULL);

Huh, how can entry != pentry in the case of pending stats? They're
literally set to the same value above?

+			else if (!pentry)
+			{
+				/* failed and there was no pending entry, create new one. */
+				bool found;
+
+				if (pgStatPendingTabHash == NULL)
+				{
+					HASHCTL		ctl;
+
+					memset(&ctl, 0, sizeof(ctl));
+					ctl.keysize = sizeof(Oid);
+					ctl.entrysize = sizeof(PgStat_TableStatus);
+					pgStatPendingTabHash =
+						hash_create("pgstat pending table stats hash",
+									TABSTAT_QUANTUM,
+									&ctl,
+									HASH_ELEM | HASH_BLOBS);
+				}
+
+				pentry = hash_search(pgStatPendingTabHash,
+									 (void *) entry, HASH_ENTER, &found);
+				Assert (!found);
+
+				*pentry = *entry;
}
}
-		/* zero out TableStatus structs after use */
-		MemSet(tsa->tsa_entries, 0,
-			   tsa->tsa_used * sizeof(PgStat_TableStatus));
-		tsa->tsa_used = 0;
+	}

I don't understand why we do this at all.

+ if (cxt->tabhash)
+ dshash_detach(cxt->tabhash);

Huh, why do we detach here?

+/*
+ * pgstat_apply_tabstat: update shared stats entry using given entry
+ *
+ * If nowait is true, just returns false on lock failure.  Dshashes for table
+ * and function stats are kept attached and stored in ctx. The caller must
+ * detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+					 PgStat_TableStatus *entry, bool nowait)
+{
+	Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+	int		table_mode = PGSTAT_FETCH_EXCLUSIVE;
+	bool updated = false;
+
+	if (nowait)
+		table_mode |= PGSTAT_FETCH_NOWAIT;
+
+	/*
+	 * We need to keep lock on dbentries for regular tables to avoid race
+	 * condition with drop database. So we hold it in the context variable. We
+	 * don't need that for shared tables.
+	 */
+	if (!cxt->dbentry)
+		cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);

Oh, wait, what? *That's* the reason why we need to hold a lock on a
second entry?

Uhm, how can this actually be an issue? If we apply pending stats, we're
connected to the database, it therefore cannot be dropped while we're
applying stats, no?

+	/* attach shared stats table if not yet */
+	if (!cxt->tabhash)
+	{
+		/* apply database stats  */
+		if (!entry->t_shared)
+		{
+			/* Update database-wide stats  */
+			cxt->dbentry->n_xact_commit += pgStatXactCommit;
+			cxt->dbentry->n_xact_rollback += pgStatXactRollback;
+			cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
+			cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
+			pgStatXactCommit = 0;
+			pgStatXactRollback = 0;
+			pgStatBlockReadTime = 0;
+			pgStatBlockWriteTime = 0;
+		}

Uh, this seems to have nothing to do with "attach shared stats table if
not yet".

+	/* create hash if not yet */
+	if (dbentry->functions == DSM_HANDLE_INVALID)
+	{
+		funchash = dshash_create(area, &dsh_funcparams, 0);
+		dbentry->functions = dshash_get_hash_table_handle(funchash);
+	}
+	else
+		funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);

Why is this created on-demand?

+	/*
+	 * First, we empty the transaction stats. Just move numbers to pending
+	 * stats if any. Elsewise try to directly update the shared stats but
+	 * create a new pending entry on lock failure.
+	 */
+	if (pgStatFunctions)

I don't understand why we have both pgStatFunctions and pgStatFunctions
and pgStatPendingFunctions (and the same for other such pairs). That
seems to make no sense to me. The comments for the former literally are:
/*
* Backends store per-function info that's waiting to be sent to the collector
* in this hash table (indexed by function OID).
*/

static HTAB *
pgstat_collect_oids(Oid catalogid)
@@ -1241,62 +1173,54 @@ pgstat_collect_oids(Oid catalogid)
/* ----------
* pgstat_drop_database() -
*
- *	Tell the collector that we just dropped a database.
- *	(If the message gets lost, we will still clean the dead DB eventually
- *	via future invocations of pgstat_vacuum_stat().)
+ *	Remove entry for the database that we just dropped.
+ *
+ *  If some stats update happens after this, this entry will re-created but
+ *	we will still clean the dead DB eventually via future invocations of
+ *	pgstat_vacuum_stat().
* ----------
*/
+
void
pgstat_drop_database(Oid databaseid)
{

Mixed indentation, added newline.

+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *

s/entriy/entry/

Ok, getting too tired now. Two AM in an airport lounge is not the
easiest place and time to concentrate...

I don't think this is all that close to being committable :(

Greetings,

Andres Freund

#52

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Andres Freund (#51)

Re: shared-memory based stats collector

Hi Kyatoro,

On 2019-02-07 13:10:08 -0800, Andres Freund wrote:

I don't think this is all that close to being committable :(

Are you planning to update this soon? I think this needs to be improved
pretty quickly to have any shot at getting into v12. I'm willing to put
in some resources towards that, but I definitely don't have the
resources to entirely polish it from my end.

Greetings,

Andres Freund

#53

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Andres Freund (#51)

5 attachment(s)

Re: shared-memory based stats collector

Hello. Thank you for the comment.

At Thu, 7 Feb 2019 13:10:08 -0800, Andres Freund <andres@anarazel.de> wrote in <20190207211008.nc3axviivmcoaluq@alap3.anarazel.de>

Hi,

On 2018-11-12 20:10:42 +0900, Kyotaro HORIGUCHI wrote:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7eed5866d2..e52ae54821 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
&sync_secs, &sync_usecs);

/* Accumulate checkpoint timing summary data, in milliseconds. */
-	BgWriterStats.m_checkpoint_write_time +=
+	BgWriterStats.checkpoint_write_time +=
write_secs * 1000 + write_usecs / 1000;
-	BgWriterStats.m_checkpoint_sync_time +=
+	BgWriterStats.checkpoint_sync_time +=
sync_secs * 1000 + sync_usecs / 1000;

Why does this patch do renames like this in the same entry as actual
functional changes?

Just because it is no longer "messages". I'm ok to preserve them
as historcal names. Reverted.

@@ -1273,16 +1276,22 @@ do_start_worker(void)
break;
}
}
-		if (skipit)
-			continue;
+		if (!skipit)
+		{
+			/* Remember the db with oldest autovac time. */
+			if (avdb == NULL ||
+				tmp->adw_entry->last_autovac_time <
+				avdb->adw_entry->last_autovac_time)
+			{
+				if (avdb)
+					pfree(avdb->adw_entry);
+				avdb = tmp;
+			}
+		}

-		/*
-		 * Remember the db with oldest autovac time.  (If we are here, both
-		 * tmp->entry and db->entry must be non-null.)
-		 */
-		if (avdb == NULL ||
-			tmp->adw_entry->last_autovac_time < avdb->adw_entry->last_autovac_time)
-			avdb = tmp;
+		/* Immediately free it if not used */
+		if(avdb != tmp)
+			pfree(tmp->adw_entry);
}

This looks like another unrelated refactoring. Don't do this.

Rewrited it to less invasive way.

/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
+	BgWriterStats.buf_written_backend += CheckpointerShmem->num_backend_writes;
+	BgWriterStats.buf_fsync_backend += CheckpointerShmem->num_backend_fsync;

More unrelated renaming. I'll stop mentioning this in the rest of this
patch, but really don't do this, makes such a large unnecessary harder
to review.

AFAICS it is done only to the struct. I reverted all of them.

pgstat.c needs a header comment explaining the architecture of the
approach.

I write some. Please check it.

+/*
+ * Operation mode of pgstat_get_db_entry.
+ */
+#define PGSTAT_FETCH_SHARED	0
+#define PGSTAT_FETCH_EXCLUSIVE	1
+#define	PGSTAT_FETCH_NOWAIT 2
+
+typedef enum
+{
Please don't create anonymous enums that are then typedef'd to a
name. The underlying name and the one not scoped to typedefs shoudl be
the same.

Sorry. I named the struct PgStat_TableLookupState and typedef'ed
with the same name.

+/*
+ *  report withholding facility.
+ *
+ *  some report items are withholded if required lock is not acquired
+ *  immediately.
+ */
This comment needs polishing. The variables are named _pending_, but the
comment talks about withholding - which doesn't seem like an apt name.

The comment was updated at v12, but it still doesn't
good. Rewrite as follows:

| * variables signals that the backend has some numbers that are waiting to be
| * written to shared stats.

/*
* Structures in which backends store per-table info that's waiting to be
@@ -189,18 +189,14 @@ typedef struct TabStatHashEntry
* Hash table for O(1) t_id -> tsa_entry lookup
*/
static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatPendingTabHash = NULL;

/*
* Backends store per-function info that's waiting to be sent to the collector
* in this hash table (indexed by function OID).
*/
static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
+static HTAB *pgStatPendingFunctions = NULL;

So this patch leaves us with a pgStatFunctions that has a comment
explaining it's about "waiting to be sent" stats, and then additionally
a pgStatPendingFunctions?

Mmm. Thanks . I changed the comment and separated pgSTatPending*
stuff from there and merged with pgstat_pending_*. And unified
the naming.

/* ------------------------------------------------------------
* Public functions used by backends follow
@@ -802,41 +436,107 @@ allow_immediate_pgstat_restart(void)
* pgstat_report_stat() -
*
*	Must be called by processes that performs DML: tcop/postgres.c, logical
- *	receiver processes, SPI worker, etc. to send the so far collected
- *	per-table and function usage statistics to the collector.  Note that this
- *	is called only when not within a transaction, so it is fair to use
- *	transaction stop time as an approximation of current time.
- * ----------
+ *	receiver processes, SPI worker, etc. to apply the so far collected
+ *	per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  This requires taking some locks on the shared statistics hashes and some

Weird mix of different indentation.

Fixed. Unifed to tabs.

+ *  of updates may be withholded on lock failure. Pending updates are
+ *  retried in later call of this function and finally cleaned up by calling
+ *  this function with force = true or PGSTAT_STAT_MAX_INTERVAL milliseconds
+ *  was elapsed since last cleanup. On the other hand updates by regular

s/was/has/

Ugg. Fixed.

+
+	/* Forecibly update other stats if any. */
s/Forecibly/Forcibly/

Typo aside, what does forcibly mean here?

Meant that it should wait for lock to be acquired. But I don't
recall why. Changed it to follow "force" flag.

/*
-	 * Scan through the TabStatusArray struct(s) to find tables that actually
-	 * have counts, and build messages to send.  We have to separate shared
-	 * relations from regular ones because the databaseid field in the message
-	 * header has to depend on that.
+	 * XX: We cannot lock two dshash entries at once. Since we must keep lock

Typically we use three XXX (or alternatively NB:).

Fixed. I thought that the number 'X's represents how bad it is.
(Just kidding).

+	 * while tables stats are being updated we have no choice other than
+	 * separating jobs for shared table stats and that of egular tables.

s/egular/regular/

Fixed

+	 * Looping over the array twice isapparently ineffcient and more efficient
+	 * way is expected.
*/
s/isapparently/is apparently/
s/ineffcient/inefficient/

But I don't know what this sentence is trying to say precisely. Are you
saying this cannot be committed unless this is fixed?

Just I wanted to say I'd happy if could do all in a
loop. So.. I rewote it as the follows.

| * Flush pending stats separately for regular tables and shared tables
| * since we cannot hold locks on two dshash entries at once.

Nor do I understand why it's actually relevant that we cannot lock two
dshash entries at once. The same table is never shared and unshared,
no?

Ah, I understood. A backend acculunates statistics on shared
tables, and tables of the connecting database. Shared table
entries are stored under database entry with id = 0 so we need to
use database entries of id = 0 and id = MyDatabaseId stored in
db_stats dshash.....

Ah, I found now that database entry for shared tables necessarily
in the db_stats dshash. Okey. I'll separate shared dbentry from
db_stats hash in the next version.

+/*
+ * Subroutine for pgstat_update_stat.
+ *
+ * Appies table stats in table status array merging with pending stats if any.
s/Appies/Applies/

+ * If force is true waits until required locks to be acquired. Elsewise stats

s/elsewise/ohterwise/

+ * merged stats as pending sats and it will be processed in the next chance.

s/sats/stats/
s/in the next change/at the next chance/

Sorry for many silly (but even difficult for me) mistakes. All
fixed. I'll check all by ispell later.

+			/* if pending update exists, it should be applied along with */
+			if (pgStatPendingTabHash != NULL)
Why is any of this done if there's no pending data?

Sorry, but I don't follow it. We cannot do anything to what doesn't exist.

{
-				pgstat_send_tabstat(this_msg);
-				this_msg->m_nentries = 0;
+				pentry = hash_search(pgStatPendingTabHash,
+									 (void *) entry, HASH_FIND, NULL);
+
+				if (pentry)
+				{
+					/* merge new update into pending updates */
+					pgstat_merge_tabentry(pentry, entry, false);
+					entry = pentry;
+				}
+			}
+
+			/* try to apply the merged stats */
+			if (pgstat_apply_tabstat(cxt, entry, !force))
+			{
+				/* succeeded. remove it if it was pending stats */
+				if (pentry && entry != pentry)
+					hash_search(pgStatPendingTabHash,
+								(void *) pentry, HASH_REMOVE, NULL);

Huh, how can entry != pentry in the case of pending stats? They're
literally set to the same value above?

Seems right. Removed. I might have done something complex before..

+			else if (!pentry)
+			{
+				/* failed and there was no pending entry, create new one. */
+				bool found;
+
+				if (pgStatPendingTabHash == NULL)
+				{
+					HASHCTL		ctl;
+
+					memset(&ctl, 0, sizeof(ctl));
+					ctl.keysize = sizeof(Oid);
+					ctl.entrysize = sizeof(PgStat_TableStatus);
+					pgStatPendingTabHash =
+						hash_create("pgstat pending table stats hash",
+									TABSTAT_QUANTUM,
+									&ctl,
+									HASH_ELEM | HASH_BLOBS);
+				}
+
+				pentry = hash_search(pgStatPendingTabHash,
+									 (void *) entry, HASH_ENTER, &found);
+				Assert (!found);
+
+				*pentry = *entry;
}
}
-		/* zero out TableStatus structs after use */
-		MemSet(tsa->tsa_entries, 0,
-			   tsa->tsa_used * sizeof(PgStat_TableStatus));
-		tsa->tsa_used = 0;
+	}

I don't understand why we do this at all.

If we didn't have pending stats of the table, and failed to apply
the stats in TSA, we should move it into pending stats hash.

Though we could merge numbers in TSA into pending hash, then
flush pending hash, I prefer to avoid useless relocation of stats
numbers from TSA to pending stats hash. Does it make sense?

+ if (cxt->tabhash)
+ dshash_detach(cxt->tabhash);

Huh, why do we detach here?

To release the lock on cxt->dbentry. It may be destroyed.

+/*
+ * pgstat_apply_tabstat: update shared stats entry using given entry
+ *
+ * If nowait is true, just returns false on lock failure.  Dshashes for table
+ * and function stats are kept attached and stored in ctx. The caller must
+ * detach them after use.
+ */
+bool
+pgstat_apply_tabstat(pgstat_apply_tabstat_context *cxt,
+					 PgStat_TableStatus *entry, bool nowait)
+{
+	Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
+	int		table_mode = PGSTAT_FETCH_EXCLUSIVE;
+	bool updated = false;
+
+	if (nowait)
+		table_mode |= PGSTAT_FETCH_NOWAIT;
+
+	/*
+	 * We need to keep lock on dbentries for regular tables to avoid race
+	 * condition with drop database. So we hold it in the context variable. We
+	 * don't need that for shared tables.
+	 */
+	if (!cxt->dbentry)
+		cxt->dbentry = pgstat_get_db_entry(dboid, table_mode, NULL);

Oh, wait, what? *That's* the reason why we need to hold a lock on a
second entry?

Yeah, one of the reasons.

Uhm, how can this actually be an issue? If we apply pending stats, we're
connected to the database, it therefore cannot be dropped while we're
applying stats, no?

Ah!!!!uch. You're right. I'll consider that in the next version
soon. Thnaks for the insight.

+	/* attach shared stats table if not yet */
+	if (!cxt->tabhash)
+	{
+		/* apply database stats  */
+		if (!entry->t_shared)
+		{
+			/* Update database-wide stats  */
+			cxt->dbentry->n_xact_commit += pgStatXactCommit;
+			cxt->dbentry->n_xact_rollback += pgStatXactRollback;
+			cxt->dbentry->n_block_read_time += pgStatBlockReadTime;
+			cxt->dbentry->n_block_write_time += pgStatBlockWriteTime;
+			pgStatXactCommit = 0;
+			pgStatXactRollback = 0;
+			pgStatBlockReadTime = 0;
+			pgStatBlockWriteTime = 0;
+		}

Uh, this seems to have nothing to do with "attach shared stats table if
not yet".

It's because the database stats needs to be applied once and
attaching tabhash happens once for a database. But, actually it
looks somewhat strange. In other words, I used cxt->tabhash as
the flat that indicates whether applying database stats is
requried. I rewote the comment there as the follows. It is
acceptable?

| * If we haven't attached the tabhash, we didn't apply database stats
| * yet. So apply it now..

+	/* create hash if not yet */
+	if (dbentry->functions == DSM_HANDLE_INVALID)
+	{
+		funchash = dshash_create(area, &dsh_funcparams, 0);
+		dbentry->functions = dshash_get_hash_table_handle(funchash);
+	}
+	else
+		funchash = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);

Why is this created on-demand?

The reason is that function stats are optional, but one dshash
takes 1MB memory at creation time.

+	/*
+	 * First, we empty the transaction stats. Just move numbers to pending
+	 * stats if any. Elsewise try to directly update the shared stats but
+	 * create a new pending entry on lock failure.
+	 */
+	if (pgStatFunctions)
I don't understand why we have both pgStatFunctions and pgStatFunctions
and pgStatPendingFunctions (and the same for other such pairs). That
seems to make no sense to me. The comments for the former literally are:
/*
* Backends store per-function info that's waiting to be sent to the collector
* in this hash table (indexed by function OID).
*/

I should have did that naively comparing to pgStatTabHash. I'll
remove pgStatPendingFunctions in the next version.

static HTAB *
pgstat_collect_oids(Oid catalogid)
@@ -1241,62 +1173,54 @@ pgstat_collect_oids(Oid catalogid)
/* ----------
* pgstat_drop_database() -
*
- *	Tell the collector that we just dropped a database.
- *	(If the message gets lost, we will still clean the dead DB eventually
- *	via future invocations of pgstat_vacuum_stat().)
+ *	Remove entry for the database that we just dropped.
+ *
+ *  If some stats update happens after this, this entry will re-created but
+ *	we will still clean the dead DB eventually via future invocations of
+ *	pgstat_vacuum_stat().
* ----------
*/
+
void
pgstat_drop_database(Oid databaseid)
{

Mixed indentation, added newline.

Fixed.

+/*
+ * snapshot_statentry() - Find an entriy from source dshash.
+ *
s/entriy/entry/

Ok, getting too tired now. Two AM in an airport lounge is not the
easiest place and time to concentrate...

Thank you very much for reviewing and sorry for the slow
response.

I don't think this is all that close to being committable :(

I'm going to work harder on this. The remaining taks just now are
the follows:

- Separte shared database stats from db_stats hash.

- Consider relaxing dbentry locking.

- Try removing pgStatPendingFunctions

- ispell on it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#54

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#53)

Re: shared-memory based stats collector

Hi,

On 2019-02-15 17:29:00 +0900, Kyotaro HORIGUCHI wrote:

At Thu, 7 Feb 2019 13:10:08 -0800, Andres Freund <andres@anarazel.de> wrote in <20190207211008.nc3axviivmcoaluq@alap3.anarazel.de>
Hi,

On 2018-11-12 20:10:42 +0900, Kyotaro HORIGUCHI wrote:
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7eed5866d2..e52ae54821 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8587,9 +8587,9 @@ LogCheckpointEnd(bool restartpoint)
&sync_secs, &sync_usecs);
/* Accumulate checkpoint timing summary data, in milliseconds. */
-	BgWriterStats.m_checkpoint_write_time +=
+	BgWriterStats.checkpoint_write_time +=
write_secs * 1000 + write_usecs / 1000;
-	BgWriterStats.m_checkpoint_sync_time +=
+	BgWriterStats.checkpoint_sync_time +=
sync_secs * 1000 + sync_usecs / 1000;
Why does this patch do renames like this in the same entry as actual
functional changes?
Just because it is no longer "messages". I'm ok to preserve them
as historcal names. Reverted.

It's fine to do such renames, just do them as separate patches. It's
hard enough to review changes this big...

/*
* Structures in which backends store per-table info that's waiting to be
@@ -189,18 +189,14 @@ typedef struct TabStatHashEntry
* Hash table for O(1) t_id -> tsa_entry lookup
*/
static HTAB *pgStatTabHash = NULL;
+static HTAB *pgStatPendingTabHash = NULL;
/*
* Backends store per-function info that's waiting to be sent to the collector
* in this hash table (indexed by function OID).
*/
static HTAB *pgStatFunctions = NULL;
-
-/*
- * Indicates if backend has some function stats that it hasn't yet
- * sent to the collector.
- */
-static bool have_function_stats = false;
+static HTAB *pgStatPendingFunctions = NULL;
So this patch leaves us with a pgStatFunctions that has a comment
explaining it's about "waiting to be sent" stats, and then additionally
a pgStatPendingFunctions?
Mmm. Thanks . I changed the comment and separated pgSTatPending*
stuff from there and merged with pgstat_pending_*. And unified
the naming.

I think my point is larger than that - I don't see why the pending
hashtables are needed at all. They seem purely superflous.

+ if (cxt->tabhash)
+ dshash_detach(cxt->tabhash);

Huh, why do we detach here?

To release the lock on cxt->dbentry. It may be destroyed.

Uh, how?

- Separte shared database stats from db_stats hash.

- Consider relaxing dbentry locking.

- Try removing pgStatPendingFunctions

- ispell on it.

Additionally:
- consider getting rid of all the pending stuff, not just for functions,
- as far as I can tell it's unnecessary

Thanks,

Andres

#55

Alvaro Herrera

alvherre@2ndquadrant.com

almost 7 years ago

In reply to: Andres Freund (#54)

Re: shared-memory based stats collector

On 2019-Feb-15, Andres Freund wrote:

It's fine to do such renames, just do them as separate patches. It's
hard enough to review changes this big...

Talk about moving the whole file to another subdir ...

--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#56

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#53)

5 attachment(s)

Re: shared-memory based stats collector

At Fri, 15 Feb 2019 17:29:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190215.172900.84235698.horiguchi.kyotaro@lab.ntt.co.jp>

I don't think this is all that close to being committable :(

I'm going to work harder on this. The remaining taks just now are
the follows:

- Separte shared database stats from db_stats hash.

Finally I didn't do that. It lead to more complexy.

- Consider relaxing dbentry locking.

Lock on a dbenty by dshash was useless to protect it from DROP
DB, so I relaxed locking on dbentry so that the dshash lock is
immediately released after fetching it. On the other hand table
and function counter hash are just destroyed at the time of a
counter reset ant it required some kind of arbitration. I could
introduce dshash_reset() but it requires many lwlocks, which
would be too-much. Instaed, I inroduced two-set of hash_handles
and reference counter in PgStat_StatDBEntry to stash out
to-be-removed-but-currently-accessed hash. pin_hashes() and
unpin_hashes(), and reset_dbentry_counters() are that.

After all, dbentries are no longer isolated by dshash partition
lock on updates, so every dbentry instead has LWLock to do
that. (tabentries/funcentries are still isolated by dshash).

pgstat_apply_tabstats() runs single-pass. Previously ran
two-passes, shared db and my database.

We could eliminate pgStatPendingTabHash, but manipulating TSA

I'm trying removing pgStatPendingTabHash it does't work yet. I'll
include it in the next version.

- Try removing pgStatPendingFunctions

Done. pgStatPendingDeadLocks and pgStatPendingTempfiles are also
removed.

- ispell on it.

I fixed many misspellings..

- Fixed several silly mistakes in the previous version.

I'll post the next version soon.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#57

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#56)

5 attachment(s)

Re: shared-memory based stats collector

At Mon, 18 Feb 2019 21:35:31 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190218.213531.89078771.horiguchi.kyotaro@lab.ntt.co.jp>

I'm trying removing pgStatPendingTabHash it does't work yet. I'll
include it in the next version.

Done. Passed a test for the case of internittent dshash lock
failure, which causes "pending" of local stats info.

- Removed pgStatPendingTabHash. "Pending" table entries are left
in pgStatTabList after pgstat_update_stat(). There's no longer
a "pending" stats store.

- Fixed several bugs of reading/writing at-rest file.

- Transactional snapshot behaved wrongly. Fixed it.

- In this project SQL helper functions are renamed from
pgstat_fetch_stat_* to backend_fetch_stat_* because the
functions with the similar cuntion are implemented for writer
side and they have the names of pgstat_fetch_stat_*.

But some of the functions had very confusing names that don't
follow the convention.

- Split pgStatLocalContext into pgSharedStatsContext and
pgStatSnapshotContext. The former is for shared statistics and
the latter is for transactional snapshot.

- Clean up pgstat.[ch] removing stale lines, fixing bogus comments.

This version is heavily improved from the previous version.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#58

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Alvaro Herrera (#55)

Re: shared-memory based stats collector

At Fri, 15 Feb 2019 15:53:28 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190215185328.GA29663@alvherre.pgsql>

On 2019-Feb-15, Andres Freund wrote:

It's fine to do such renames, just do them as separate patches. It's
hard enough to review changes this big...

Talk about moving the whole file to another subdir ...

Sounds reasonable. It was a searate patch at the first but
currently melded to bloat the patch. I'm going to revert the
separation/moving.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#59

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#58)

6 attachment(s)

Re: shared-memory based stats collector

Hello.

At Wed, 20 Feb 2019 15:45:17 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190220.154517.24528798.horiguchi.kyotaro@lab.ntt.co.jp>

At Fri, 15 Feb 2019 15:53:28 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190215185328.GA29663@alvherre.pgsql>

Talk about moving the whole file to another subdir ...

Sounds reasonable. It was a searate patch at the first but
currently melded to bloat the patch. I'm going to revert the
separation/moving.

Done. This verison 16 looks as if the moving and splitting were
not happen. Major changes are:

- Restored old pgstats_* names. This largily shrinks the patch
size to less than a half lines of v15. More than that, it
gets easier to examine differences. (checkpointer.c and
bgwriter.c have a bit stale comments but it is an issue for
later.)

- Removed "oneshot" feature at all. This simplifies pgstat API
and let this patch far less simple.

- Moved StatsLock to LWTRANCHE_STATS, which is not necessary to
be in the main tranche.

- Fixed several bugs revealed by the shrinked size of the patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#60

Arthur Zakirov

a.zakirov@postgrespro.ru

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#59)

Re: shared-memory based stats collector

Hello,

On 21.02.2019 10:05, Kyotaro HORIGUCHI wrote:

Done. This verison 16 looks as if the moving and splitting were
not happen. Major changes are:

- Restored old pgstats_* names. This largily shrinks the patch
size to less than a half lines of v15. More than that, it
gets easier to examine differences. (checkpointer.c and
bgwriter.c have a bit stale comments but it is an issue for
later.)

- Removed "oneshot" feature at all. This simplifies pgstat API
and let this patch far less simple.

- Moved StatsLock to LWTRANCHE_STATS, which is not necessary to
be in the main tranche.

- Fixed several bugs revealed by the shrinked size of the patch.

I run regression tests. Unfortunately tests didn't pass, failed test is
'rangetypes':

rangetypes ... FAILED (test process exited with exit code 2)

It seems to me that an autovacuum process terminates because of segfault.

Segfault occurs within get_pgstat_tabentry_relid(). If I'm not mistaken,
somehow 'dbentry' hasn't valid pointer anymore.

'dbentry' is get in the line in do_autovacuum():

dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);

'dbentry' becomes invalid after calling pgstat_vacuum_stat().

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#61

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Arthur Zakirov (#60)

Re: shared-memory based stats collector

Hello, Arthur.

At Thu, 21 Feb 2019 17:30:50 +0300, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote in <db346d14-4130-57a5-5f46-9a57e9982bec@postgrespro.ru>

Hello,

On 21.02.2019 10:05, Kyotaro HORIGUCHI wrote:

Done. This verison 16 looks as if the moving and splitting were
not happen. Major changes are:
- Restored old pgstats_* names. This largily shrinks the patch
size to less than a half lines of v15. More than that, it
gets easier to examine differences. (checkpointer.c and
bgwriter.c have a bit stale comments but it is an issue for
later.)
- Removed "oneshot" feature at all. This simplifies pgstat API
and let this patch far less simple.
- Moved StatsLock to LWTRANCHE_STATS, which is not necessary to
be in the main tranche.
- Fixed several bugs revealed by the shrinked size of the patch.

I run regression tests. Unfortunately tests didn't pass, failed test
is 'rangetypes':

rangetypes ... FAILED (test process exited with exit code 2)

It seems to me that an autovacuum process terminates because of
segfault.

Segfault occurs within get_pgstat_tabentry_relid(). If I'm not
mistaken, somehow 'dbentry' hasn't valid pointer anymore.

'dbentry' is get in the line in do_autovacuum():

dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);

'dbentry' becomes invalid after calling pgstat_vacuum_stat().

Thank you very much for the report. I haven't seen the error, but
I think you gave me enough information about the issue. I try to
reproduce it.

I found another problem commit_ts test reliably fails by dshash
corruption in startup process. I've not found why and will
investigate it, too.

regarsds.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#62

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#61)

6 attachment(s)

Re: shared-memory based stats collector

Hello.

At Fri, 22 Feb 2019 17:19:56 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190222.171956.98584931.horiguchi.kyotaro@lab.ntt.co.jp>

It seems to me that an autovacuum process terminates because of
segfault.

Segfault occurs within get_pgstat_tabentry_relid(). If I'm not
mistaken, somehow 'dbentry' hasn't valid pointer anymore.

do_autovacuum does the followings:

dbentry = pgstat_fetch_stat_dbentry() -- create cached dbentry
StartTransactionCommand() -- starts transaction
pgstat_vacuum_stat() -- blows away the cached dbentry.
shared = pgstat_fetch_stat_dbentry()

It was harmless previously, but pgstat_* functions blow away
local cache at the first call after transaction start. As the
result dbentry becomes invalid. The reason I didin't see the same
crash is the second pgstat_fetch_stat_dbentry() accidentially
zeroes-out the once invalidated dbentry.

It is fixed by moving StartTransactionCommand to before the first
pgstat_f_s_dbentry(), which looks better not having this problem.

me> I found another problem commit_ts test reliably fails by dshash
me> corruption in startup process. I've not found why and will
me> investigate it, too.

It is rather stupid, pgstat_reset_all() releases an entry within
the sequential scan loop, which contradicts the protocol of
dshash_seq_next.

The two aboves are fixed in the attached v17.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#63

Arthur Zakirov

a.zakirov@postgrespro.ru

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#62)

Re: shared-memory based stats collector

On 25.02.2019 07:52, Kyotaro HORIGUCHI wrote:

It is fixed by moving StartTransactionCommand to before the first
pgstat_f_s_dbentry(), which looks better not having this problem.

Thank you. Still there are couple TAP-test which don't pass:
002_archiving.pl and 010_pg_basebackup.pl. I think the next simple patch
solves the issue:

diff --git a/src/backend/postmaster/pgstat.c 
b/src/backend/postmaster/pgstat.c
index f9b22a4d71..d500f9d090 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3169,6 +3169,9 @@ pgstat_bestart(void)
             case StartupProcess:
                 beentry->st_backendType = B_STARTUP;
                 break;
+           case ArchiverProcess:
+               beentry->st_backendType = B_ARCHIVER;
+               break;
             case BgWriterProcess:
                 beentry->st_backendType = B_BG_WRITER;
                 break;
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl 
b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 8939758c59..4f656c98a3 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -6,7 +6,7 @@ use File::Basename qw(basename dirname);
  use File::Path qw(rmtree);
  use PostgresNode;
  use TestLib;
-use Test::More tests => 106;
+use Test::More tests => 105;

program_help_ok('pg_basebackup');
program_version_ok('pg_basebackup');

010_pg_basebackup.pl has 105 tests now because pg_stat_tmp dir was
removed from the `foreach my $dirname` loop.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

#64

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#62)

Re: shared-memory based stats collector

On Sun, Feb 24, 2019 at 11:53 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The two aboves are fixed in the attached v17.

Andres just drew my attention to patch 0004 in this series, which is
definitely not OK. That patch allows the postmaster to use dynamic
shared memory, claiming: "Shared memory baesd stats collector needs it
to work on postmaster and no problem found to do that. Just allow it."

But if you just look a little further down in the code from where that
assertion is located, you'll find this:

/* Lock the control segment so we can register the new segment. */
LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);

It is a well-established principle that the postmaster must not
acquire any locks, because if it did, a corrupted shared memory
segment could take down not only individual backends but the
postmaster as well. So this is entirely not OK in the postmaster. I
think there might be other reasons as well why this is not OK that
aren't occurring to me at the moment, but that one is enough by
itself.

But even if for some reason that were OK, I'm pretty sure that any
design that involves the postmaster interacting with the data stored
in shared memory by the stats collector is an extremely bad idea.
Again, the postmaster is supposed to have as little interaction with
shared memory as possible, precisely so that it is doesn't crash and
burn when some other process corrupts shared memory. Dynamic shared
memory is included in that. So, really, the LWLock here is just the
tip of the iceberg: the postmaster not only CAN'T safely run this
code, but it shouldn't WANT to do so.

And I'm kind of baffled that it does. I haven't looked at the other
patches, but it seems to me that, while a shared-memory stats
collector is a good idea in general to avoid the I/O and CPU costs of
with reading and writing temporary files, I don't see why the
postmaster would need to be involved in any of that. Whatever the
reason, though, I'm pretty sure that's GOT to be changed for this
patch set to have any chance of being accepted.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#65

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Kyotaro HORIGUCHI (#62)

Re: shared-memory based stats collector

Hi,

From 88740269660d00d548910c2f3aa631878c7cf0d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:42:07 +0900
Subject: [PATCH 4/6] Allow dsm to use on postmaster.

DSM is inhibited to be used on postmaster. Shared memory baesd stats
collector needs it to work on postmaster and no problem found to do
that. Just allow it.

Maybe I'm missing something, but why? postmaster doesn't actually need
to process stats messages in any way?

From 774b1495136db1ad6d174ab261487fdf6cb6a5ed Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 5/6] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.

Btw, you can make the life of a committer easier by collecting the
reviewers and co-authors of a patch yourself...

This desparately needs an introductory comment in pgstat.c or such
explaining how the new scheme works.

+LWLock StatsMainLock;
+#define StatsLock (&StatsMainLock)

Wait, what? You can't just define a lock this way. That's process local
memory, locking that doesn't do anything useful.

+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct {

Please note that in PG's coding style the { comes in the next line.

+/*
+ *  Backends store various database-wide info that's waiting to be flushed out
+ *  to shared memory in these variables.
+ */
+static int		n_deadlocks = 0;
+static size_t	n_tmpfiles = 0;
+static size_t	n_tmpfilesize = 0;
+
+/*
+ * have_recovery_conflicts represents the existence of any kind if conflict
+ */
+static bool		have_recovery_conflicts = false;
+static int		n_conflict_tablespace = 0;
+static int		n_conflict_lock = 0;
+static int		n_conflict_snapshot = 0;
+static int		n_conflict_bufferpin = 0;
+static int		n_conflict_startup_deadlock = 0;

Probably worthwhile to group those into a struct, even just to make
debugging easier.

-/* ----------
- * pgstat_init() -
- *
- *	Called from postmaster at startup. Create the resources required
- *	by the statistics collector process.  If unable to do so, do not
- *	fail --- better to let the postmaster start with stats collection
- *	disabled.
- * ----------
- */
-void
-pgstat_init(void)
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)

You can't have a function like that without explaining why it's there.

+	/* trash the stats on crash */
+	if (code == 0)
+		pgstat_write_statsfiles();
}

And especially not without documenting what that code is supposed to
mean.

pgstat_report_stat(bool force)
{
-	/* we assume this inits to all zeroes: */
-	static const PgStat_TableCounts all_zeroes;
-	static TimestampTz last_report = 0;
-
+	static TimestampTz last_flush = 0;
+	static TimestampTz pending_since = 0;
TimestampTz now;
-	PgStat_MsgTabstat regular_msg;
-	PgStat_MsgTabstat shared_msg;
-	TabStatusArray *tsa;
-	int			i;
+	pgstat_flush_stat_context cxt = {0};
+	bool		have_other_stats = false;
+	bool		pending_stats = false;
+	long		elapsed;
+	long		secs;
+	int			usecs;
+
+	/* Do we have anything to flush? */
+	if (have_recovery_conflicts || n_deadlocks != 0 || n_tmpfiles != 0)
+		have_other_stats = true;

/* Don't expend a clock check if nothing to do */
if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
-		!have_function_stats)
-		return;
+		!have_other_stats && !have_function_stats)
+		return 0;

"other" seems like a pretty mysterious category. Seems better to either
name precisely, or just use the underlying variables for checks.

+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
+ */
+
+/*
+ * snapshot_statentry() - Find an entry from source dshash with cache.
+ *

Is snapshot_statentry() really a subroutine for pgstat_flush_stat()?

+static void *
+snapshot_statentry(pgstat_snapshot_cxt *cxt, Oid key)
+{
+	char *lentry = NULL;
+	size_t keysize = cxt->dsh_params->key_size;
+	size_t dsh_entrysize = cxt->dsh_params->entry_size;
+	bool found;
+	bool *negative;
+
+	/* caches the result entry */

/*
-	 * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-	 * msec since we last sent one, or the caller wants to force stats out.
+	 * Create new hash with arbitrary initial entries since we don't know how
+	 * this hash will grow. The boolean put at the end of the entry is
+	 * negative flag.
*/

That, uh, seems pretty ugly and hard to understand.

+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns with false if required lock was not acquired

s/with false/false/

+ * immediately. In the case, infos of some tables may be left alone in TSA to

TSA? I assume TabStatusArray, but I don't think that's a common or
useful abbreviation. It'd be ok to just refer to the variable name.

+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
+{

+			/* try to apply the tab stats */
+			if (!pgstat_flush_tabstat(cxt, nowait, entry))
{
-				pgstat_send_tabstat(this_msg);
-				this_msg->m_nentries = 0;
+				/*
+				 * Failed. Leave it alone filling at the beginning in TSA.
+				 */
+				TabStatHashEntry *hash_entry;
+				bool found;
+
+				if (new_tsa_hash == NULL)
+					new_tsa_hash = create_tabstat_hash();
+
+				/* Create hash entry for this entry */
+				hash_entry = hash_search(new_tsa_hash, &entry->t_id,
+										 HASH_ENTER, &found);
+				Assert(!found);
+
+				/*
+				 * Move insertion pointer to the next segment. There must be
+				 * enough space segments since we are just leaving some of the
+				 * current elements.
+				 */
+				if (dest_elem >= TABSTAT_QUANTUM)
+				{
+					Assert(dest_tsa->tsa_next != NULL);
+					dest_tsa = dest_tsa->tsa_next;
+					dest_elem = 0;
+				}
+
+				/* Move the entry if needed */
+				if (tsa != dest_tsa || i != dest_elem)
+				{
+					PgStat_TableStatus *new_entry;
+					new_entry = &dest_tsa->tsa_entries[dest_elem];
+					*new_entry = *entry;
+					entry = new_entry;
+				}
+
+				hash_entry->tsa_entry = entry;
+				dest_elem++;

This seems a lot of work for just leaving an entry around to be
processed later. Shouldn't code for that already exist elsewhere?

void
pgstat_vacuum_stat(void)
{
-	HTAB	   *htab;
-	PgStat_MsgTabpurge msg;
-	PgStat_MsgFuncpurge f_msg;
-	HASH_SEQ_STATUS hstat;
+	HTAB	   *oidtab;
+	dshash_table *dshtable;
+	dshash_seq_status dshstat;
PgStat_StatDBEntry *dbentry;
PgStat_StatTabEntry *tabentry;
PgStat_StatFuncEntry *funcentry;
-	int			len;

-	if (pgStatSock == PGINVALID_SOCKET)
+	/* we don't collect statistics under standalone mode */
+	if (!IsUnderPostmaster)
return;

-	/*
-	 * If not done for this transaction, read the statistics collector stats
-	 * file into some hash tables.
-	 */
-	backend_read_statsfile();
+	/* If not done for this transaction, take a snapshot of stats */
+	pgstat_snapshot_global_stats();

Hm, why do we need a snapshot here?

/*
* Now repeat the above steps for functions. However, we needn't bother
* in the common case where no function stats are being collected.
*/

Can't we move the act of iterating through these hashes and probing
against another hash into a helper function and reuse? These
duplications aren't pretty.

Greetings,

Andres Freund

#66

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Andres Freund (#65)

Re: shared-memory based stats collector

Ping? Unless there's a new version pretty soon, we're going to have to
move this to the next CF, I think.

#67

David Steele

david@pgmasters.net

over 6 years ago

In reply to: Andres Freund (#66)

Re: Re: shared-memory based stats collector

On 3/22/19 10:33 PM, Andres Freund wrote:

Ping? Unless there's a new version pretty soon, we're going to have to
move this to the next CF, I think.

Agreed. I've marked it Waiting on Author for now and will move it to
PG13 on the 28th if no new patch appears.

Regards,
--
-David
david@pgmasters.net

#68

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#61)

5 attachment(s)

Re: shared-memory based stats collector

Arthur, Andres and Robert, thank you all for the comment, and
sorry for being late.

At Wed, 6 Mar 2019 14:43:32 -0800, Andres Freund <andres@anarazel.de> wrote in <20190306224332.374sdf6hsjh27m7t@alap3.anarazel.de>

DSM is inhibited to be used on postmaster. Shared memory baesd stats
collector needs it to work on postmaster and no problem found to do
that. Just allow it.

Maybe I'm missing something, but why? postmaster doesn't actually need
to process stats messages in any way?

Agreed. I should have do the initial work in the startup process.
Thank you Robert for the detailed explanation.

Done in the attached patch. Each backend craete/destroys shared
memory and loads/writers stats file. Startup process firstly
creates shared stats and destroys at its end. The first backend
after that again loads the file and craetes shared memory. It is
saved and destroys at server stop. This is because I didn't pin
the dsm underlying to dsa, which could be a cause of troubles.

From 774b1495136db1ad6d174ab261487fdf6cb6a5ed Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 5/6] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.

Btw, you can make the life of a committer easier by collecting the
reviewers and co-authors of a patch yourself...

This desparately needs an introductory comment in pgstat.c or such
explaining how the new scheme works.

Mmm. I'm not sure what you want exactly but this behaves a bit
complex way than the previous version, I added a description that
explains that in the comment at the beginning of pgstat.c. I'm
very happy if you give me a little more information about what
you want.

+LWLock StatsMainLock;
+#define StatsLock (&StatsMainLock)

Wait, what? You can't just define a lock this way. That's process local
memory, locking that doesn't do anything useful.

Oops.. I don't understand why I did that. Moved to StatsShmemStruct.

+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct {
Please note that in PG's coding style the { comes in the next line.

Fixed.

+/*
+ *  Backends store various database-wide info that's waiting to be flushed out
+ *  to shared memory in these variables.
+ */
+static int		n_deadlocks = 0;

...

+static int		n_conflict_bufferpin = 0;
+static int		n_conflict_startup_deadlock = 0;
Probably worthwhile to group those into a struct, even just to make
debugging easier.

Done. Moved to BackendDBStats. But didn't for pgStatXact* and
pgStatBlock* to make the patch smaller.

-void
-pgstat_init(void)
+static void
+pgstat_postmaster_shutdown(int code, Datum arg)
You can't have a function like that without explaining why it's there.
+	/* trash the stats on crash */
+	if (code == 0)
+		pgstat_write_statsfiles();
}
And especially not without documenting what that code is supposed to
mean.

Added comments to all functions.

pgstat_report_stat(bool force)
{

...

+ bool have_other_stats = false;

+	/* Do we have anything to flush? */
+	if (have_recovery_conflicts || n_deadlocks != 0 || n_tmpfiles != 0)
+		have_other_stats = true;

"other" seems like a pretty mysterious category. Seems better to either
name precisely, or just use the underlying variables for checks.

Agreed. It is mainly database-wide stats so I named it
dbstats. To move the expression near to the definitions of the
used variables I defined two macros, HAVE_PENDING_DBSTATS() and
HAVE_PENDING_CONFLICTS().

+/* -------
+ * Subroutines for pgstat_flush_stat.
+ * -------
+ */
+
+/*
+ * snapshot_statentry() - Find an entry from source dshash with cache.
+ *
Is snapshot_statentry() really a subroutine for pgstat_flush_stat()?

No. Sorry, it is just misplaced. Fixed.

+static void *
+snapshot_statentry(pgstat_snapshot_cxt *cxt, Oid key)

+	 * Create new hash with arbitrary initial entries since we don't know how
+	 * this hash will grow. The boolean put at the end of the entry is
+	 * negative flag.
*/

That, uh, seems pretty ugly and hard to understand.

Ok. I introduced common struct PgStat_snapshot_head to hold the
negative flag.

+/*
+ * pgstat_flush_stat: Flushes table stats out to shared statistics.
+ *
+ *  If nowait is true, returns with false if required lock was not acquired

s/with false/false/

Fixed. Thanks.

+ * immediately. In the case, infos of some tables may be left alone in TSA to

TSA? I assume TabStatusArray, but I don't think that's a common or
useful abbreviation. It'd be ok to just refer to the variable name.

Yeah I think so and I though that it's not my original but
actually is.. Replaced them with a complete word.

+static bool
+pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
+{

+			/* try to apply the tab stats */
+			if (!pgstat_flush_tabstat(cxt, nowait, entry))
{
-				pgstat_send_tabstat(this_msg);
-				this_msg->m_nentries = 0;
+				/*
+				 * Failed. Leave it alone filling at the beginning in TSA.
+				 */

...

This seems a lot of work for just leaving an entry around to be
processed later. Shouldn't code for that already exist elsewhere?

No. TabStatusArray was alweays emptied after sending stats data
to collector. So the packing (entry moving) code is a brand new
stuff.

void
pgstat_vacuum_stat(void)

...

-	backend_read_statsfile();
+	/* If not done for this transaction, take a snapshot of stats */
+	pgstat_snapshot_global_stats();

Hm, why do we need a snapshot here?

Thank you for pointing out. It used to be needed during
development but no longer needed. Removed.

/*
* Now repeat the above steps for functions. However, we needn't bother
* in the common case where no function stats are being collected.
*/

Can't we move the act of iterating through these hashes and probing
against another hash into a helper function and reuse? These
duplications aren't pretty.

Like snapshot_statentry? The first member of both
PgStat_StatTab/FuncEntry is Oid. So just accessing the hash entry
as Oid makes it possible. There's no point in providing common
header struct as I did for backend snapshot.

pgstat_remove_useless_entries() is that.

At Mon, 25 Feb 2019 14:27:13 +0300, Arthur Zakirov <a.zakirov@postgrespro.ru> wrote in <fb4ca586-c3f0-7c2f-ca2e-2aa49b66d146@postgrespro.ru>

Thank you. Still there are couple TAP-test which don't pass:
002_archiving.pl and 010_pg_basebackup.pl. I think the next simple
patch solves the issue:

I accidentially removed it in 0004. Fixed it. Thanks!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Import Notes

Reply to msg id not found: 20190306224332.374sdf6hsjh27m7t@alap3.anarazel.defb4ca586-c3f0-7c2f-ca2e-2aa49b66d146@postgrespro.ru

#69

Andres Freund

andres@anarazel.de

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#68)

Re: shared-memory based stats collector

Hi,

Unfortunately I don't think it's realistic to target this to v12. I
think it was unlikely to make at the beginning of the CF, but since then
development just wasn't quick enough to warrant aiming for it. It's a
large and somewhat complex patch, and has some significant risks
associated. Therefore I think we should mark this as targeting v13, and
move to the next CF?

Greetings,

Andres Freund

#70

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Andres Freund (#69)

Re: shared-memory based stats collector

Hello.

At Wed, 3 Apr 2019 12:56:59 -0700, Andres Freund <andres@anarazel.de> wrote in <20190403195659.fcmk2i7ruxhtyqjl@alap3.anarazel.de>

Unfortunately I don't think it's realistic to target this to v12. I
think it was unlikely to make at the beginning of the CF, but since then
development just wasn't quick enough to warrant aiming for it. It's a
large and somewhat complex patch, and has some significant risks
associated. Therefore I think we should mark this as targeting v13, and
move to the next CF?

I'd like to get this in 12 but actually the time is coming. If
everyone think that that is not realistic, I do that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#71

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#70)

Re: shared-memory based stats collector

On Thu, Apr 04, 2019 at 09:25:12AM +0900, Kyotaro HORIGUCHI wrote:

Hello.

At Wed, 3 Apr 2019 12:56:59 -0700, Andres Freund <andres@anarazel.de> wrote in <20190403195659.fcmk2i7ruxhtyqjl@alap3.anarazel.de>

Unfortunately I don't think it's realistic to target this to v12. I
think it was unlikely to make at the beginning of the CF, but since then
development just wasn't quick enough to warrant aiming for it. It's a
large and somewhat complex patch, and has some significant risks
associated. Therefore I think we should mark this as targeting v13, and
move to the next CF?

I'd like to get this in 12 but actually the time is coming. If
everyone think that that is not realistic, I do that.

Unfortunately, now that we're past code freeze it's clear this is a PG12
matter now :-(

I personally consider this to be very worthwhile & beneficial improvement,
but I agree with Andres the patch did not quite get to committable state
in the last CF. Conidering how sensitive part it touches, I suggest we try
to get it committed early in the PG13 cycle. I'm willing to spend some
time on doing test/benchmarks and reviewing the code, if needed.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#72

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Tomas Vondra (#71)

Re: shared-memory based stats collector

At Tue, 9 Apr 2019 17:03:33 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <20190409150333.5iashyjxm5jmraml@development>

Unfortunately, now that we're past code freeze it's clear this is a
PG12
matter now :-(

I personally consider this to be very worthwhile & beneficial
improvement,
but I agree with Andres the patch did not quite get to committable
state
in the last CF. Conidering how sensitive part it touches, I suggest we
try
to get it committed early in the PG13 cycle. I'm willing to spend some
time on doing test/benchmarks and reviewing the code, if needed.

I'm very happy to be told that. Actually the code was a rush work
(mainly for reverting refactoring) and left some stupid
mistakes. I'm going through on the patch again and polish code.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#73

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#72)

Re: shared-memory based stats collector

On Wed, Apr 10, 2019 at 09:39:29AM +0900, Kyotaro HORIGUCHI wrote:

At Tue, 9 Apr 2019 17:03:33 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <20190409150333.5iashyjxm5jmraml@development>

Unfortunately, now that we're past code freeze it's clear this is a
PG12
matter now :-(

I personally consider this to be very worthwhile & beneficial
improvement,
but I agree with Andres the patch did not quite get to committable
state
in the last CF. Conidering how sensitive part it touches, I suggest we
try
to get it committed early in the PG13 cycle. I'm willing to spend some
time on doing test/benchmarks and reviewing the code, if needed.

I'm very happy to be told that. Actually the code was a rush work
(mainly for reverting refactoring) and left some stupid
mistakes. I'm going through on the patch again and polish code.

While reviewing the patch I've always had issue with evaluating how it
behaves for various scenarios / workloads. The reviews generally did one
specific benchmark, but I find that unsatisfactory. I wonder whether if
we could develop a small set of more comprehensive workloads for this
patch (i.e. different numbers of objects, access patterns, ...).

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#74

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Tomas Vondra (#73)

9 attachment(s)

Re: shared-memory based stats collector

Hello.

At Wed, 10 Apr 2019 11:13:27 +0200, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote in <20190410091327.fpnvjbuu74dzxizl@development>

While reviewing the patch I've always had issue with evaluating how it
behaves for various scenarios / workloads. The reviews generally did
one
specific benchmark, but I find that unsatisfactory. I wonder whether
if
we could develop a small set of more comprehensive workloads for this
patch (i.e. different numbers of objects, access patterns, ...).

Indeed. I'm having difficulty also with catcache pruning
stuff. But I might have found a clue to that.

I took performance numbers after some amendment and polishment of
the patch.

I expected operf might work but it doesn't show meaningful
information with O2'ed binary. gprof slows binary to about one
third. But just running pgbench gave me rather stable numbers
(differently from catcache stuff..).

The numbers are tps for 300 minutes run and ratio between master
and patched.

[A-D]1 are just running stats-updator clients.

master-O2 patched patched/master-O2
A1: 13431.603208 13457.968950 100.1963
B1: 72284.737474 72535.169437 100.3465
C1: 19.963315 20.037411 100.3712
D1: 193.027074 196.651603 101.8777

[A-D]2 tests introduces stats-reader client.

master-O2 patched patched/master-O2
updator / reader updator / reader updator / reader
A2: 12929.418503/512.784200 13066.150297/584.686889 101.0575 / 114.0220
B2: 71673.804812/ 20.102687 71916.823242/ 22.109251 100.3391 / 109.9816
C2: 16.066719/485.788495 16.487942/577.930340 102.6217 / 118.9675
D2: 189.563306/ 36.252532 193.817075/ 44.661707 102.2440 / 123.1961

Case A1 is most simplest case: 1 client repeatedly updated stats
of pgbench_acconts (of scale-1, but that doesn't matter)

Case B1 is A1 from 100 concurrent clients.

Case C1 is massive(?) number of stats update: Concretely select
sum() on a partitioned table with 1000 children, from 1 client.

Case D1 doesn C1 from 97 concurrent clients.

A2-D2 are running a single stats referencing client while A1-D1
are running respectively. (select sum(seq_scan) from pg_stat_user_tables)

Perhaps the number will get worse having many rerefencing clients
but I think it's not realistic.

I'll run test with many databases (-100?) and expanded tabstat
entry cases.

The attached files are:

v19-0001-sequential-scan-for-dshash.patch:
v19-0002-Add-conditional-lock-feature-to-dshash.patch:
v19-0003-Make-archiver-process-an-auxiliary-process.patch:
v19-0005-Remove-the-GUC-stats_temp_directory.patch:

not changed since v18 except rebasing.

v19-0004-Shared-memory-based-stats-collector.patch:

Rebased. Fixed several bugs. Improved performance in some
cases. Made structs/code tidier. Added/rewrote comments.

run.sh : main test script
gencr.pl : partitioned table generator script generator
(perl gencr.pl | psql postgres to run)
tr.sql : stats-updator client script used by run.sh
ref.sql : stats-reader client script used by run.sh

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#75

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#74)

Re: shared-memory based stats collector

Hi.

At Fri, 17 May 2019 14:27:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190517.142722.139901807.horiguchi.kyotaro@lab.ntt.co.jp>

The attached files are:

It's broken perhaps by recent core change?

I'll fix it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#76

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#75)

5 attachment(s)

Re: shared-memory based stats collector

me> It's broken perhaps by recent core change?
me>
me> I'll fix it.

Not by core change, but my silly mistake in memory context usage.

Fixed. Version 20.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#77

Thomas Munro

thomas.munro@gmail.com

over 6 years ago

In reply to: Kyotaro HORIGUCHI (#76)

Re: shared-memory based stats collector

On Fri, May 17, 2019 at 6:48 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Fixed. Version 20.

Hello Horiguchi-san,

A new Commitfest is here. This doesn't apply (maybe just because of
the new improved pgindent). Could we please have a fresh rebase?

Thanks,

--
Thomas Munro
https://enterprisedb.com

#78

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Thomas Munro (#77)

7 attachment(s)

Re: shared-memory based stats collector

Hello.

At Mon, 1 Jul 2019 23:19:31 +1200, Thomas Munro <thomas.munro@gmail.com> wrote in <CA+hUKGK5WNCEe9g4ie=-6Oym-WNqYBXX9A1qPgKv89KGkzW72g@mail.gmail.com>

Fixed. Version 20.

Hello Horiguchi-san,

A new Commitfest is here. This doesn't apply (maybe just because of
the new improved pgindent). Could we please have a fresh rebase?

Thank you for noticing, Thomas. I rebased and made some possible
improvement. More than that, I wrote a new test script.

- Rebased.

- Reworded almost all of comments. Many of them was found
broken. Added some comments.

- Shortened some LWLocked code paths.

- Got rid of useless palloc for snapshot area of globalStats.

The attached files are:

gendb.pl:
script to generate databases.

statbehch.pl:
benchmarking script.

0001-sequential-scan-for-dshash.patch:
Adds sequential scan feature to dshash

0002-Add-conditional-lock-feature-to-dshash.patch:
Adds conditional lock feature to dshash

0003-Make-archiver-process-an-auxiliary-process.patch:
Change archiver process to auxiliary process. This is needed to
let archive access shared memory.

0004-Shared-memory-based-stats-collector.patch:
The body of this patchset. Changes pgstat from a process
connected by socket to shared memory shared among backends.

0005-Remove-the-GUC-stats_temp_directory.patch:
Remove GUC stats_temp_directory. Separated from 0004 to make it
smaller.

====
Tomas said upthread:

While reviewing the patch I've always had issue with evaluating how it
behaves for various scenarios / workloads. The reviews generally did
one
specific benchmark, but I find that unsatisfactory. I wonder whether
if
we could develop a small set of more comprehensive workloads for this
patch (i.e. different numbers of objects, access patterns, ...).

The structure of shared stats area follows:

  dshash for databsae stats
   + dshash entry for db1
      + dshash for table stats
          + dshash entry for tbl1
          + dshash entry for tbl2
          + dshash entry for tbl3
           ...
   + db2
     ...
   + db3
   ...

Dshash restiricts an entry to be accessed only by a single
process. This is quite inconvenient since that a database hash
entry becomes a bottle neck. On the other hand dbstat dshash
entry used on a backend is not removed since it is removed after
all accessor to the databse have gone. So this patch immediately
releases dbstats dshash entry so that it doesn't become a
bottleneck.

Another bottle neck would be lock conflicts on a
database/table/function stats entry. This is avoided by, like
existing stats collector, enforces intervals not shorter than 500
ms (PGSTAT_STAT_MIN_INTERVAL) between two successive updates on
one process and skipping the update if lock is going to conflict.

Yet another bottle neck was conflict between reset and
udpate. Since all processes are working on the same data on
shared memory, counters cannot be reset until all referer are
gone. So I let dbentry have two sets of table/function stats
dshash in order to separate accessors come after reset and
existing accessors. A process "pins" the current table/function
dshashes before accessing them (pin_hashes()/unpin_hashes()). All
updates in the round are performed on the "pinned" generation of
dshashes. If two or more successive reset requests come in a
very short time, the requests other than the first one are just
ignored (reset_dbentry_counters()). So client can see some
non-zero numbers just after reset if many processes reset stats
at the same time but I don't think that is worth amending.

After all, almost all potential bottle necks are eliminated in
this patch. If so many n clients are running, the mean interval
of updates would be 500/n ms so 1000 or more clients can hit by
the bottle neck but I think also the current stats collecter
suffers from such many clients. (No matter what would happen with
such massive number of processes, I don't have an environment
that can let such many clients/backends live on...)

That being said, this patch is doomed to update stats in
reasonable period, 1000ms in this patch
(PGSTAST_STAT_MAX_INTERVAL). If that duration elapsed, this patch
waits all required locks to be acquired. So, the possible bottle
neck is still on database and table/function shard hash
entries. It is efficiently causes the conflicts that many
processes on the same database update the same table.

I remade benchmark script so that many parameters can be changed
easily. I took numbers for the following access patterns. Every
number is the mean of 6 runs. I choosed the configurations so
that no disk access happenes while running benchmark.

#db : number of accessing database
#tbl : number of tables per database
#client : number of stats-updator clients
#iter : number of query iterations
#xactlen : number of queries in a transaction
#referers: number of stats-referenceing clients

#db #tbl #clients #iter #xactlen #referers
A1: 1 1 1 20000 10 0
A2: 1 1 1 20000 10 1
B1: 1 1 90 2000 10 0
B2: 1 1 90 2000 10 1
C1: 1 50 90 2000 10 0
C2: 1 50 90 2000 10 1
D1: 50 1 90 2000 10 0
D2: 50 1 90 2000 10 1
E1: 50 1 90 2000 10 10
F1: 50 1 10 2000 10 90

master patched
updator referrer updator referrer
time / stdev count / stdev time / stdev count / stdev
A1: 1769.13 / 71.87 1729.97 / 61.58
A2: 1903.94 / 75.65 2906.67 / 78.28 1849.41 / 43.00 2855.33 / 62.95
B1: 2967.84 / 9.88 2984.20 / 6.10
B2: 3005.38 / 5.32 253.00 / 33.09 3007.26 / 5.70 253.33 / 60.63
C1: 3066.14 / 13.80 3069.34 / 11.65
C2: 3353.66 / 8.14 282.92 / 20.65 3341.36 / 12.44 251.65 / 21.13
D1: 2977.12 / 5.12 2991.60 / 6.68
D2: 3005.80 / 6.44 252.50 / 38.34 3010.58 / 7.34 282.33 / 57.07
E1: 3255.47 / 8.91 244.02 / 17.03 3293.88 / 18.05 249.13 / 14.58
F1: 2620.85 / 9.17 202.46 / 3.35 2668.60 / 41.04 208.19 / 6.79

ratio (100: same, smaller value means patched version is faster)

updator referrer
patched/master(%) master/patched (%)
A1: 97.79 -
A2: 97.14 101.80
B1: 100.55
B2: 100.06 99.87
C1: 100.10
C2: 99.63 112.43
D1: 100.49
D2: 100.16 89.43
E1: 101.18 97.95
F1: 101.82 97.25

Mmm... I don't see distinctive tendency.. Referrer side shows
larger fluctuation but I'm not sure that suggests something
meaningful.

I'll rerun the bencmarks with loger period (many itererations).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#79

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Kyotaro Horiguchi (#78)

Re: shared-memory based stats collector

At Thu, 04 Jul 2019 19:27:54 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in <20190704.192754.27063464.horikyota.ntt@gmail.com>

#db #tbl #clients #iter #xactlen #referers
A1: 1 1 1 20000 10 0
A2: 1 1 1 20000 10 1
B1: 1 1 90 2000 10 0
B2: 1 1 90 2000 10 1
C1: 1 50 90 2000 10 0
C2: 1 50 90 2000 10 1
D1: 50 1 90 2000 10 0
D2: 50 1 90 2000 10 1
E1: 50 1 90 2000 10 10
F1: 50 1 10 2000 10 90

master patched
updator referrer updator referrer
time / stdev count / stdev time / stdev count / stdev
A1: 1769.13 / 71.87 1729.97 / 61.58
A2: 1903.94 / 75.65 2906.67 / 78.28 1849.41 / 43.00 2855.33 / 62.95
B1: 2967.84 / 9.88 2984.20 / 6.10
B2: 3005.38 / 5.32 253.00 / 33.09 3007.26 / 5.70 253.33 / 60.63
C1: 3066.14 / 13.80 3069.34 / 11.65
C2: 3353.66 / 8.14 282.92 / 20.65 3341.36 / 12.44 251.65 / 21.13
D1: 2977.12 / 5.12 2991.60 / 6.68
D2: 3005.80 / 6.44 252.50 / 38.34 3010.58 / 7.34 282.33 / 57.07
E1: 3255.47 / 8.91 244.02 / 17.03 3293.88 / 18.05 249.13 / 14.58
F1: 2620.85 / 9.17 202.46 / 3.35 2668.60 / 41.04 208.19 / 6.79

ratio (100: same, smaller value means patched version is faster)

updator referrer
patched/master(%) master/patched (%)
A1: 97.79 -
A2: 97.14 101.80
B1: 100.55
B2: 100.06 99.87
C1: 100.10
C2: 99.63 112.43
D1: 100.49
D2: 100.16 89.43
E1: 101.18 97.95
F1: 101.82 97.25

Mmm... I don't see distinctive tendency.. Referrer side shows
larger fluctuation but I'm not sure that suggests something
meaningful.

I'll rerun the bencmarks with loger period (many itererations).

I put more puressure to the system.

G1: 1 db, 400 clients, 1000 tables, 20000 loops/client, 1000 query/tr, 0 reader
G2: 1 db, 400 clients, 1000 tables, 20000 loops/client, 1000 query/tr, 1 reader

Result:
master patched
updator referrer updator referrer
time / stdev count / stdev time / stdev count / stdev
G1: 125946.22 / 796.83 125227.24 / 89.82
G2: 126463.47 / 81.87 1985.70 / 33.96 125427.95 / 82.35 1985.60 / 55.24

Ratio: (100: same, smaller value means patched version is faster)

updator referrer
patched/master(%) master/patched (%)
G1: 99.40 -
G2: 99.18 100.0

Slightly faster, or maybe significantly enough considering the
stdev. More crucial difference is shown outside the
numbers. Non-patched version complained that (incorrectly) "stats
collector not respond, used stale stats" many times, which is not
seen on patched version. That means that the reader reads older
numbers as far as 1 second ago. (It might be good that writer
should complain about update holdoff for more than, say 0.75s.)

CF-bot warned that it doesn't work on Windows. I'm experiencing
an very painful time to wait for tortoise git is walking slowly
as its name suggests. It would be fixed in the next version.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#80

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Kyotaro Horiguchi (#79)

6 attachment(s)

Re: shared-memory based stats collector

Hello. This is v21 of the patch.

CF-bot warned that it doesn't work on Windows. I'm experiencing
an very painful time to wait for tortoise git is walking slowly
as its name suggests. It would be fixed in the next version.

Found a bug in initialization. StatsShememInit() was placed at a
wrong place and stats code on child processes accessed
uninitialized pointer. It is a leftover from the previous shape
where dsm was activated on postmaster.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#81

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Kyotaro Horiguchi (#80)

Re: shared-memory based stats collector

On 2019-Jul-11, Kyotaro Horiguchi wrote:

Hello. This is v21 of the patch.

CF-bot warned that it doesn't work on Windows. I'm experiencing
an very painful time to wait for tortoise git is walking slowly
as its name suggests. It would be fixed in the next version.

Found a bug in initialization. StatsShememInit() was placed at a
wrong place and stats code on child processes accessed
uninitialized pointer. It is a leftover from the previous shape
where dsm was activated on postmaster.

This doesn't apply anymore. Can you please rebase?

--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#82

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Alvaro Herrera (#81)

5 attachment(s)

Re: shared-memory based stats collector

At Tue, 3 Sep 2019 18:28:05 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190903222805.GA13932@alvherre.pgsql>

Found a bug in initialization. StatsShememInit() was placed at a
wrong place and stats code on child processes accessed
uninitialized pointer. It is a leftover from the previous shape
where dsm was activated on postmaster.

This doesn't apply anymore. Can you please rebase?

Thanks! I forgot to post rebased version after doing. Here it is.

- (Re)Rebased to the current master.
- Passed all tests for me.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#83

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Kyotaro Horiguchi (#82)

Re: shared-memory based stats collector

On 2019-Sep-10, Kyotaro Horiguchi wrote:

At Tue, 3 Sep 2019 18:28:05 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190903222805.GA13932@alvherre.pgsql>

Found a bug in initialization. StatsShememInit() was placed at a
wrong place and stats code on child processes accessed
uninitialized pointer. It is a leftover from the previous shape
where dsm was activated on postmaster.

This doesn't apply anymore. Can you please rebase?

Thanks! I forgot to post rebased version after doing. Here it is.

- (Re)Rebased to the current master.
- Passed all tests for me.

This seems to have very trivial conflicts -- please rebase again?

--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#84

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Alvaro Herrera (#83)

5 attachment(s)

Re: shared-memory based stats collector

At Wed, 25 Sep 2019 18:01:02 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190925210102.GA26396@alvherre.pgsql>

On 2019-Sep-10, Kyotaro Horiguchi wrote:

At Tue, 3 Sep 2019 18:28:05 -0400, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20190903222805.GA13932@alvherre.pgsql>

Found a bug in initialization. StatsShememInit() was placed at a
wrong place and stats code on child processes accessed
uninitialized pointer. It is a leftover from the previous shape
where dsm was activated on postmaster.

This doesn't apply anymore. Can you please rebase?

Thanks! I forgot to post rebased version after doing. Here it is.

- (Re)Rebased to the current master.
- Passed all tests for me.

This seems to have very trivial conflicts -- please rebase again?

Affected by the code movement in 9a86f03b4e. Just
rebased. Thanks.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#85

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Kyotaro Horiguchi (#84)

Re: shared-memory based stats collector

On Fri, Sep 27, 2019 at 09:46:47AM +0900, Kyotaro Horiguchi wrote:

Affected by the code movement in 9a86f03b4e. Just
rebased. Thanks.

This does not apply anymore. Could you provide a rebase? I have
moved the patch to next CF, waiting on author.

Thanks,
--
Michael

#86

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Michael Paquier (#85)

5 attachment(s)

Re: shared-memory based stats collector

At Sun, 1 Dec 2019 11:12:32 +0900, Michael Paquier <michael@paquier.xyz> wrote in

On Fri, Sep 27, 2019 at 09:46:47AM +0900, Kyotaro Horiguchi wrote:

Affected by the code movement in 9a86f03b4e. Just
rebased. Thanks.

This does not apply anymore. Could you provide a rebase? I have
moved the patch to next CF, waiting on author.

Thanks! Rebased.

# I should design then run a performance test on this..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#87

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Kyotaro Horiguchi (#86)

5 attachment(s)

Re: shared-memory based stats collector

At Tue, 03 Dec 2019 17:27:59 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Thanks! Rebased.

CFbots seem unhappy with this. Rebased.

# I should design then run a performance test on this..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#88

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Kyotaro Horiguchi (#87)

5 attachment(s)

Re: shared-memory based stats collector

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:

CFbots seem unhappy with this. Rebased.

It's failing to apply again, so I rebased again. I haven't
read the code or done any testing beyond check-world.

regards, tom lane

#89

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Tom Lane (#88)

Re: shared-memory based stats collector

Tom Lane escribiÃ³:

In patch 0003,

/*
-		 * Was it the archiver?  If so, just try to start a new one; no need
-		 * to force reset of the rest of the system.  (If fail, we'll try
-		 * again in future cycles of the main loop.).  Unless we were waiting
-		 * for it to shut down; don't restart it in that case, and
-		 * PostmasterStateMachine() will advance to the next shutdown step.
+		 * Was it the archiver?  Normal exit can be ignored; we'll start a new
+		 * one at the next iteration of the postmaster's main loop, if
+		 * necessary. Any other exit condition is treated as a crash.
*/
if (pid == PgArchPID)
{
PgArchPID = 0;
if (!EXIT_STATUS_0(exitstatus))
-				LogChildExit(LOG, _("archiver process"),
-							 pid, exitstatus);
-			if (PgArchStartupAllowed())
-				PgArchPID = pgarch_start();
+				HandleChildCrash(pid, exitstatus,
+								 _("archiver process"));
continue;
}

I'm worried that we're causing all processes to terminate when an
archiver dies in some ugly way; but in the current coding, it's pretty
harmless and we'd just start a new one. I think this needs to be
reconsidered. As far as I know, pgarchiver remains unconnected to
shared memory so a crash-restart cycle is not necessary. We should
continue to just log the error message and move on.

--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#90

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Alvaro Herrera (#89)

Re: shared-memory based stats collector

Hi,

On 2020-03-09 15:37:05 -0300, Alvaro Herrera wrote:

Tom Lane escribiÃ³:

In patch 0003,
/*
-		 * Was it the archiver?  If so, just try to start a new one; no need
-		 * to force reset of the rest of the system.  (If fail, we'll try
-		 * again in future cycles of the main loop.).  Unless we were waiting
-		 * for it to shut down; don't restart it in that case, and
-		 * PostmasterStateMachine() will advance to the next shutdown step.
+		 * Was it the archiver?  Normal exit can be ignored; we'll start a new
+		 * one at the next iteration of the postmaster's main loop, if
+		 * necessary. Any other exit condition is treated as a crash.
*/
if (pid == PgArchPID)
{
PgArchPID = 0;
if (!EXIT_STATUS_0(exitstatus))
-				LogChildExit(LOG, _("archiver process"),
-							 pid, exitstatus);
-			if (PgArchStartupAllowed())
-				PgArchPID = pgarch_start();
+				HandleChildCrash(pid, exitstatus,
+								 _("archiver process"));
continue;
}
I'm worried that we're causing all processes to terminate when an
archiver dies in some ugly way; but in the current coding, it's pretty
harmless and we'd just start a new one. I think this needs to be
reconsidered. As far as I know, pgarchiver remains unconnected to
shared memory so a crash-restart cycle is not necessary. We should
continue to just log the error message and move on.

Why is it worth having the archiver be "robust" that way? Except that
random implementation details led to it not being connected to shared
memory, and thus allowing a restart for any exit code, I don't see a
need? It doesn't have exit paths that could validly trigger another exit
code, as far as I can see.

Greetings,

Andres Freund

#91

Tom Lane

tgl@sss.pgh.pa.us

almost 6 years ago

In reply to: Andres Freund (#90)

Re: shared-memory based stats collector

Andres Freund <andres@anarazel.de> writes:

On 2020-03-09 15:37:05 -0300, Alvaro Herrera wrote:

I'm worried that we're causing all processes to terminate when an
archiver dies in some ugly way; but in the current coding, it's pretty
harmless and we'd just start a new one. I think this needs to be
reconsidered. As far as I know, pgarchiver remains unconnected to
shared memory so a crash-restart cycle is not necessary. We should
continue to just log the error message and move on.

Why is it worth having the archiver be "robust" that way?

I'd ask a different question: what the heck is this patchset doing
touching the archiver in the first place? I can see no plausible
reason for that doing anything related to stats collection. If we
now need some new background processing for stats, let's make a
new postmaster child process to do that, not overload the archiver
with unrelated responsibilities.

regards, tom lane

#92

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Tom Lane (#91)

Re: shared-memory based stats collector

Hi,

On 2020-03-09 15:04:23 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2020-03-09 15:37:05 -0300, Alvaro Herrera wrote:

I'm worried that we're causing all processes to terminate when an
archiver dies in some ugly way; but in the current coding, it's pretty
harmless and we'd just start a new one. I think this needs to be
reconsidered. As far as I know, pgarchiver remains unconnected to
shared memory so a crash-restart cycle is not necessary. We should
continue to just log the error message and move on.

Why is it worth having the archiver be "robust" that way?

I'd ask a different question: what the heck is this patchset doing
touching the archiver in the first place? I can see no plausible
reason for that doing anything related to stats collection.

As of a release or two back, it sends stats messages for archiving
events:

if (pgarch_archiveXlog(xlog))
{
/* successful */
pgarch_archiveDone(xlog);

/*
* Tell the collector about the WAL file that we successfully
* archived
*/
pgstat_send_archiver(xlog, false);

break; /* out of inner retry loop */
}
else
{
/*
* Tell the collector about the WAL file that we failed to
* archive
*/
pgstat_send_archiver(xlog, true);

If we now need some new background processing for stats, let's make a
new postmaster child process to do that, not overload the archiver
with unrelated responsibilities.

I don't think that's what's going on. It's just changing the archiver so
it can report stats via shared memory - which subsequently means that it
needs to deal differently with errors than when it wasn't connected.

Greetings,

Andres Freund

#93

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Andres Freund (#92)

Re: shared-memory based stats collector

Thank you all for the suggestions.

At Mon, 9 Mar 2020 12:25:39 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2020-03-09 15:04:23 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

On 2020-03-09 15:37:05 -0300, Alvaro Herrera wrote:

I'm worried that we're causing all processes to terminate when an
archiver dies in some ugly way; but in the current coding, it's pretty
harmless and we'd just start a new one. I think this needs to be
reconsidered. As far as I know, pgarchiver remains unconnected to
shared memory so a crash-restart cycle is not necessary. We should
continue to just log the error message and move on.

Why is it worth having the archiver be "robust" that way?

I'd ask a different question: what the heck is this patchset doing
touching the archiver in the first place? I can see no plausible
reason for that doing anything related to stats collection.

As of a release or two back, it sends stats messages for archiving
events:

if (pgarch_archiveXlog(xlog))
{
/* successful */
pgarch_archiveDone(xlog);

/*
* Tell the collector about the WAL file that we successfully
* archived
*/
pgstat_send_archiver(xlog, false);

break; /* out of inner retry loop */
}
else
{
/*
* Tell the collector about the WAL file that we failed to
* archive
*/
pgstat_send_archiver(xlog, true);

If we now need some new background processing for stats, let's make a
new postmaster child process to do that, not overload the archiver
with unrelated responsibilities.

I don't think that's what's going on. It's just changing the archiver so
it can report stats via shared memory - which subsequently means that it
it can report stats via shared memory - which subsequently means that it
needs to deal differently with errors than when it wasn't connected.

That's true, but I have the same concern with Tom. The archive bacame
too-tightly linked with other processes than actual relation. We may
need the second static shared memory segment apart from the current
one.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#94

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Kyotaro Horiguchi (#93)

Re: shared-memory based stats collector

At Tue, 10 Mar 2020 12:27:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

That's true, but I have the same concern with Tom. The archive bacame
too-tightly linked with other processes than actual relation. We may
need the second static shared memory segment apart from the current
one.

Anyway, I changed the target version of the entry to 14.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#95

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Kyotaro Horiguchi (#93)

Re: shared-memory based stats collector

On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:

That's true, but I have the same concern with Tom. The archive bacame
too-tightly linked with other processes than actual relation.

What's the problem here? We have a number of helper processes
(checkpointer, bgwriter) that are attached to shared memory, and it's
not a problem.

We may need the second static shared memory segment apart from the
current one.

That seems absurd to me. Solving a non-problem by introducing complex
new infrastructure.

#96

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Andres Freund (#95)

Re: shared-memory based stats collector

At Mon, 9 Mar 2020 20:34:20 -0700, Andres Freund <andres@anarazel.de> wrote in

On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:

That's true, but I have the same concern with Tom. The archive bacame
too-tightly linked with other processes than actual relation.

What's the problem here? We have a number of helper processes
(checkpointer, bgwriter) that are attached to shared memory, and it's
not a problem.

That theoretically raises the chance of server-crash by a small amount
of probability. But, yes, it's absurd to prmise that archiver process
crashes.

We may need the second static shared memory segment apart from the
current one.

That seems absurd to me. Solving a non-problem by introducing complex
new infrastructure.

Ok. I think I must be worrying too much.

Thanks for the suggestion.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#97

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Kyotaro Horiguchi (#96)

Re: shared-memory based stats collector

On 2020-Mar-10, Kyotaro Horiguchi wrote:

At Mon, 9 Mar 2020 20:34:20 -0700, Andres Freund <andres@anarazel.de> wrote in

On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:

That's true, but I have the same concern with Tom. The archive bacame
too-tightly linked with other processes than actual relation.

What's the problem here? We have a number of helper processes
(checkpointer, bgwriter) that are attached to shared memory, and it's
not a problem.

That theoretically raises the chance of server-crash by a small amount
of probability. But, yes, it's absurd to prmise that archiver process
crashes.

The case I'm worried about is a misconfigured archive_command that
causes the archiver to misbehave (exit with a code other than 0); if
that already doesn't happen, or we can make it not happen, then I'm okay
with the changes to archiver.

--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#98

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Alvaro Herrera (#97)

Re: shared-memory based stats collector

On Tue, Mar 10, 2020 at 1:48 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Mar-10, Kyotaro Horiguchi wrote:

At Mon, 9 Mar 2020 20:34:20 -0700, Andres Freund <andres@anarazel.de> wrote in

On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:

That's true, but I have the same concern with Tom. The archive bacame
too-tightly linked with other processes than actual relation.

What's the problem here? We have a number of helper processes
(checkpointer, bgwriter) that are attached to shared memory, and it's
not a problem.

That theoretically raises the chance of server-crash by a small amount
of probability. But, yes, it's absurd to prmise that archiver process
crashes.

The case I'm worried about is a misconfigured archive_command that
causes the archiver to misbehave (exit with a code other than 0); if
that already doesn't happen, or we can make it not happen, then I'm okay
with the changes to archiver.

Any script that gets killed, or that exit with a return code > 127
would do that.

#99

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Alvaro Herrera (#97)

Re: shared-memory based stats collector

On 2020-03-10 09:48:07 -0300, Alvaro Herrera wrote:

On 2020-Mar-10, Kyotaro Horiguchi wrote:

At Mon, 9 Mar 2020 20:34:20 -0700, Andres Freund <andres@anarazel.de> wrote in

On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:

That's true, but I have the same concern with Tom. The archive bacame
too-tightly linked with other processes than actual relation.

What's the problem here? We have a number of helper processes
(checkpointer, bgwriter) that are attached to shared memory, and it's
not a problem.

That theoretically raises the chance of server-crash by a small amount
of probability. But, yes, it's absurd to prmise that archiver process
crashes.

The case I'm worried about is a misconfigured archive_command that
causes the archiver to misbehave (exit with a code other than 0); if
that already doesn't happen, or we can make it not happen, then I'm okay
with the changes to archiver.

Well, an exit(1) is also fine, afaict. No?

The archive command can just trigger either a FATAL or a LOG:

rc = system(xlogarchcmd);
if (rc != 0)
{
/*
* If either the shell itself, or a called command, died on a signal,
* abort the archiver. We do this because system() ignores SIGINT and
* SIGQUIT while waiting; so a signal is very likely something that
* should have interrupted us too. Also die if the shell got a hard
* "command not found" type of error. If we overreact it's no big
* deal, the postmaster will just start the archiver again.
*/
int lev = wait_result_is_any_signal(rc, true) ? FATAL : LOG;

if (WIFEXITED(rc))
{
ereport(lev,
(errmsg("archive command failed with exit code %d",
WEXITSTATUS(rc)),
errdetail("The failed archive command was: %s",
xlogarchcmd)));
}
else if (WIFSIGNALED(rc))
{
#if defined(WIN32)
ereport(lev,
(errmsg("archive command was terminated by exception 0x%X",
WTERMSIG(rc)),
errhint("See C include file \"ntstatus.h\" for a description of the hexadecimal value."),
errdetail("The failed archive command was: %s",
xlogarchcmd)));
#else
ereport(lev,
(errmsg("archive command was terminated by signal %d: %s",
WTERMSIG(rc), pg_strsignal(WTERMSIG(rc))),
errdetail("The failed archive command was: %s",
xlogarchcmd)));
#endif
}
else
{
ereport(lev,
(errmsg("archive command exited with unrecognized status %d",
rc),
errdetail("The failed archive command was: %s",
xlogarchcmd)));
}

snprintf(activitymsg, sizeof(activitymsg), "failed on %s", xlog);
set_ps_display(activitymsg, false);

return false;
}

I.e. there's only normal ways to shut down the archiver due to a failing
archvie command.

Greetings,

Andres Freund

#100

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Julien Rouhaud (#98)

Re: shared-memory based stats collector

Hi,

On 2020-03-10 19:52:22 +0100, Julien Rouhaud wrote:

On Tue, Mar 10, 2020 at 1:48 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2020-Mar-10, Kyotaro Horiguchi wrote:

At Mon, 9 Mar 2020 20:34:20 -0700, Andres Freund <andres@anarazel.de> wrote in

On 2020-03-10 12:27:25 +0900, Kyotaro Horiguchi wrote:

That's true, but I have the same concern with Tom. The archive bacame
too-tightly linked with other processes than actual relation.

What's the problem here? We have a number of helper processes
(checkpointer, bgwriter) that are attached to shared memory, and it's
not a problem.

That theoretically raises the chance of server-crash by a small amount
of probability. But, yes, it's absurd to prmise that archiver process
crashes.

The case I'm worried about is a misconfigured archive_command that
causes the archiver to misbehave (exit with a code other than 0); if
that already doesn't happen, or we can make it not happen, then I'm okay
with the changes to archiver.

Any script that gets killed, or that exit with a return code > 127
would do that.

But just with a FATAL, not with something worse. And the default
handling for aux backends accepts exit code 1 (which elog uses for
FATAL) as a normal shutdown. Am I missing something here?

Greetings,

Andres Freund

#101

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Kyotaro Horiguchi (#87)

Re: shared-memory based stats collector

Hi,

Thomas, could you look at the first two patches here, and my review
questions?

General comments about this series:
- A lot of the review comments feel like I've written them before, a
year or more ago. I feel this patch ought to be in a much better
state. There's a lot of IMO fairly obvious stuff here, and things that
have been mentioned multiple times previously.
- There's a *lot* of typos in here. I realize being an ESL is hard, but
a lot of these can be found with the simplest spellchecker. That's
one thing for a patch that just has been hacked up as a POC, but this
is a multi year thread?
- There's some odd formatting. Consider using pgindent more regularly.

More detailed comments below.

I'm considering rewriting the parts of the patchset that I don't like -
but it'll look quite different afterwards.

On 2020-01-22 17:24:04 +0900, Kyotaro Horiguchi wrote:

From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.

dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
LW_EXCLUSIVE));

delete_item(hash_table, item);
-	hash_table->find_locked = false;
-	hash_table->find_exclusively_locked = false;
-	LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+	/* We need to keep partition lock while sequential scan */
+	if (!hash_table->seqscan_running)
+	{
+		hash_table->find_locked = false;
+		hash_table->find_exclusively_locked = false;
+		LWLockRelease(PARTITION_LOCK(hash_table, partition));
+	}
}

This seems like a failure prone API.

/*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
hash_table->find_exclusively_locked
? LW_EXCLUSIVE : LW_SHARED));
+	/* lock is under control of sequential scan */
+	Assert(!hash_table->seqscan_running);

hash_table->find_locked = false;
hash_table->find_exclusively_locked = false;
@@ -592,6 +610,164 @@ dshash_memhash(const void *v, size_t size, void *arg)
return tag_hash(v, size);
}

+/*
+ * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.

s/trhough/through/

This uses a different comment style that the other functions in this
file. Why?

+ * dshash_seq_term should be called for incomplete scans and otherwise
+ * shoudln't. Finished scans are cleaned up automatically.

s/shoudln't/shouldn't/

I find the "cleaned up automatically" API terrible. I know you copied it
from dynahash, but I find it to be really failure prone. dynahash isn't
an example of good postgres code, the opposite, I'd say. It's a lot
easier to unconditionally have a terminate call if we need that.

+ * Returned elements are locked as is the case with dshash_find.  However, the
+ * caller must not release the lock.
+ *
+ * Same as dynanash, the caller may delete returned elements midst of a scan.

I think it's a bad idea to refer to dynahash here. That's just going to
get out of date. Also, code should be documented on its own.

+ * If consistent is set for dshash_seq_init, the all hash table partitions are
+ * locked in the requested mode (as determined by the exclusive flag) during
+ * the scan.  Otherwise partitions are locked in one-at-a-time way during the
+ * scan.

Yet delete unconditionally retains locks?

+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+				bool consistent, bool exclusive)
+{

Why does this patch add the consistent mode? There's no users currently?
Without it's not clear that we need a seperate _term function, I think?

I think we also can get rid of the dshash_delete changes, by instead
adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
or such.

@@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
extern void dshash_detach(dshash_table *hash_table);
extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
extern void dshash_destroy(dshash_table *hash_table);
-
/* Finding, creating, deleting entries. */
extern void *dshash_find(dshash_table *hash_table,
const void *key, bool
exclusive);

There's a number of spurious changes like this.

From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.

s/taht/that/

There should be at least a sentence or two explaining why these are
useful.

+/*
+ * The version of dshash_find, which is allowed to return immediately on lock
+ * failure. Lock status is set to *lock_failed in that case.
+ */

Hm. Not sure I like the *lock_acquired API.

+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+					 bool exclusive, bool nowait, bool *lock_acquired)
{
dshash_hash hash;
size_t		partition;
dshash_table_item *item;

+	/*
+	 * No need to return lock resut when !nowait. Otherwise the caller may
+	 * omit the lock result when NULL is returned.
+	 */
+	Assert(nowait || !lock_acquired);
+
hash = hash_key(hash_table, key);
partition = PARTITION_FOR_HASH(hash);

Assert(hash_table->control->magic == DSHASH_MAGIC);
Assert(!hash_table->find_locked);

-	LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-				  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+	if (nowait)
+	{
+		if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partition),
+									  exclusive ? LW_EXCLUSIVE : LW_SHARED))
+		{
+			if (lock_acquired)
+				*lock_acquired = false;

Why is the test for lock_acquired needed here? I don't think it's
possible to use nowait correctly without passing in lock_acquired?

Think it'd make sense to document & assert that nowait = true implies
lock_acquired set, and nowait = false implies lock_acquired not being
set.

But, uh, why do we even need the lock_acquired parameter? If we couldn't
find an entry, then we should just release the lock, no?

I'm however inclined to think it's better to just have a separate
function for the nowait case, rather than an extended version supporting
both (with an internal helper doing most of the work).

+/*
+ * The version of dshash_find_or_insert, which is allowed to return immediately
+ * on lock failure.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.

They don't, there's no lock_acquired parameter.

+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+							   const void *key,
+							   bool *found,
+							   bool nowait)

I think it's absurd to have dshash_find, dshash_find_extended,
dshash_find_or_insert, dshash_find_or_insert_extended. If they're
extended they should also be able to specify whether the entry will get
created.

From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process

s/wes/was/ s/onto/into/

an auxiliary process in order to make it work.

@@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
StartupProcessMain();
proc_exit(1); /* should never return */
+		case ArchiverProcess:
+			/* don't set signals, archiver has its own agenda */
+			PgArchiverMain();
+			proc_exit(1);		/* should never return */
+
case BgWriterProcess:
/* don't set signals, bgwriter has its own agenda */
BackgroundWriterMain();

I think I'd rather remove the two comments that are copied to 6 out of 8
cases - they don't add anything.

/* ------------------------------------------------------------
* Local functions called by archiver follow
* ------------------------------------------------------------
@@ -219,8 +148,8 @@ pgarch_forkexec(void)
*	The argc/argv parameters are valid only in EXEC_BACKEND case.  However,
*	since we don't use 'em, it hardly matters...
*/
-NON_EXEC_STATIC void
-PgArchiverMain(int argc, char *argv[])
+void
+PgArchiverMain(void)
{
/*
* Ignore all signals usually bound to some action in the postmaster,
@@ -252,8 +181,27 @@ PgArchiverMain(int argc, char *argv[])
static void
pgarch_exit(SIGNAL_ARGS)
{
-	/* SIGQUIT means curl up and die ... */
-	exit(1);
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).  This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * process.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
}

This seems to be a copy of code & comments from other signal handlers that predates

commit 8e19a82640d3fa2350db146ec72916856dd02f0a
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2018-08-08 19:08:10 +0300

Don't run atexit callbacks in quickdie signal handlers.

I think this just should use SignalHandlerForCrashExit().

I think we can even commit that separately - there's not really a reason
to not do that today, as far as I can tell?

/* SIGUSR1 signal handler for archiver process */

Hm - this currently doesn't set up a correct sigusr1 handler for a
shared memory backend - needs to invoke procsignal_sigusr1_handler
somewhere.

We can probably just convert to using normal latches here, and remove
the current 'wakened' logic? That'll remove the indirection via
postmaster too, which is nice.

@@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
switch (backendType)
{
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;

should imo include 'WAL' or such.

From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

Previously activity statistics is shared via files on disk. Every
backend sends the numbers to the stats collector process via a socket.
It makes snapshots as a set of files on disk with a certain interval
then every backend reads them as necessary. It worked fine for
comparatively small set of statistics but the set is under the
pressure to growing up and the file size has reached the order of
megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.

This spends a fair bit describing the old state, but very little
describing the new state.

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0bfd6151c4..a6b0bdec12 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -53,7 +53,6 @@ postgres  15554  0.0  0.0  57536  1184 ?        Ss   18:02   0:00 postgres: back
postgres  15555  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: checkpointer
postgres  15556  0.0  0.0  57536   916 ?        Ss   18:02   0:00 postgres: walwriter
postgres  15557  0.0  0.0  58504  2244 ?        Ss   18:02   0:00 postgres: autovacuum launcher
-postgres  15558  0.0  0.0  17512  1068 ?        Ss   18:02   0:00 postgres: stats collector
postgres  15582  0.0  0.0  58772  3080 ?        Ss   18:04   0:00 postgres: joe runbug 127.0.0.1 idle
postgres  15606  0.0  0.0  58772  3052 ?        Ss   18:07   0:00 postgres: tgl regression [local] SELECT waiting
postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl regression [local] idle in transaction
@@ -65,9 +64,8 @@ postgres  15610  0.0  0.0  58772  3056 ?        Ss   18:07   0:00 postgres: tgl
master server process.  The command arguments
shown for it are the same ones used when it was launched.  The next five
processes are background worker processes automatically launched by the
-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
-   the <quote>autovacuum launcher</quote> process can be disabled.)
+   master process.  (The <quote>autovacuum launcher</quote> process will not
+   be present if you have set the system not to start it.)
Each of the remaining
processes is a server process handling one client connection.  Each such
process sets its command line display in the form

There's more references to the stats collector than this... E.g. in
catalogs.sgml

<xref linkend="view-table"/> lists the system views described here.
More detailed documentation of each view follows below.
There are some additional views that provide access to the results of
the statistics collector; they are described in <xref
linkend="monitoring-stats-views-table"/>.
</para>

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6d1f28c327..8dcb0fb7f7 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -1956,15 +1956,15 @@ do_autovacuum(void)
ALLOCSET_DEFAULT_SIZES);
MemoryContextSwitchTo(AutovacMemCxt);

+	/* Start a transaction so our commands have one to play into. */
+	StartTransactionCommand();
+
/*
* may be NULL if we couldn't find an entry (only happens if we are
* forcing a vacuum for anti-wrap purposes).
*/
dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);

-	/* Start a transaction so our commands have one to play into. */
-	StartTransactionCommand();
-
/*
* Clean up any dead statistics collector entries for this DB. We always
* want to do this exactly once per DB-processing cycle, even if we find
@@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
if (isshared)
{
if (PointerIsValid(shared))
-			tabentry = hash_search(shared->tables, &relid,
-								   HASH_FIND, NULL);
+			tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
}
else if (PointerIsValid(dbentry))
-		tabentry = hash_search(dbentry->tables, &relid,
-							   HASH_FIND, NULL);
+		tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);

return tabentry;
}

Why is pgstat_fetch_stat_tabentry_extended called "_extended"? Outside
the stats subsystem there are exactly one caller for the non extended
version, as far as I can see. That's index_concurrently_swap() - and imo
that's code that should live in the stats subsystem, rather than open
coded in index.c.

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ca5c6376e5..1ffe073a1f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1,15 +1,23 @@
/* ----------
* pgstat.c
*
- *	All the statistics collector stuff hacked up in one big, ugly file.
+ *	Statistics collector facility.
*
- *	TODO:	- Separate collector, postmaster and backend stuff
- *			  into different files.
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends are the interface to locally store
+ *  backend activities during a transaction. Then pgstat_flush_stat() is called
+ *  at the end of a transaction to pulish the local stats on shared memory.
*

I'd rather not exhaustively list the different objects this handles -
it'll either be annoying to maintain, or just get out of date.

- *			- Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
+ *  all the local numbers cannot be flushed immediately, we postpone updates
+ *  and try the next chance after the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).

I'm not convinced by this backoff logic. The basic interval seems quite
high for something going through shared memory, and the max retry seems
pretty low.

+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define	PGSTAT_SHARED		0

This is unreferenced.

+#define PGSTAT_EXCLUSIVE 1
+#define PGSTAT_NOWAIT 2

And these should imo rather be parameters.

+typedef enum PgStat_TableLookupResult
+{
+	NOT_FOUND,
+	FOUND,
+	LOCK_FAILED
+} PgStat_TableLookupResult;

This seems like a seriously bad idea to me. These are very generic
names. There's also basically no references except setting them to the
first two?

+#define StatsLock (&StatsShmem->StatsMainLock)

-static time_t last_pgstat_start_time;
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+	LWLock				StatsMainLock;		/* lock to protect this struct */
+	dsa_handle 			stats_dsa_handle;	/* DSA handle for stats data */
+	dshash_table_handle db_hash_handle;
+	dsa_pointer			global_stats;
+	dsa_pointer			archiver_stats;
+	int					refcount;
+} StatsShmemStruct;

Why isn't this an lwlock in lwlock in lwlocknames.h, rather than
allocated here?

+/*
+ * BgWriter global statistics counters. The name cntains a remnant from the
+ * time when the stats collector was a dedicate process, which used sockets to
+ * send it.
+ */
+PgStat_MsgBgWriter BgWriterStats = {0};

I am strongly against keeping the 'Msg' prefix. That seems extremely
confusing going forward.

+/* common header of snapshot entry in reader snapshot hash */
+typedef struct PgStat_snapshot
+{
+	Oid		key;
+	bool	negative;
+	void   *body;				/* end of header part: to keep alignment */
+} PgStat_snapshot;

+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+	char		   *hash_name;			/* name of the snapshot hash */
+	int				hash_entsize;		/* element size of hash entry */
+	dshash_table_handle	dsh_handle;		/* dsh handle to attach */
+	const dshash_parameters *dsh_params;/* dshash params */
+	HTAB		  **hash;				/* points to variable to hold hash */
+	dshash_table  **dshash;				/* ditto for dshash */
+} pgstat_snapshot_param;

Why does this exist? The struct contents are actually constant across
calls, yet you have declared them inside functions (as static - static
on function scope isn't really the same as global static).

If we want it, I think we should separate the naming more
meaningfully. The important difference between 'hash' and 'dshash' isn't
the hashing module, it's that one is a local copy, the other a shared
hashtable!

+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception in that it is cluster-wide value.
+ */
+typedef struct BackendDBStats
+{
+	int		n_conflict_tablespace;
+	int		n_conflict_lock;
+	int		n_conflict_snapshot;
+	int		n_conflict_bufferpin;
+	int		n_conflict_startup_deadlock;
+	int		n_deadlocks;
+	size_t	n_tmpfiles;
+	size_t	tmpfilesize;
+	HTAB	*checksum_failures;
+} BackendDBStats;

Why is this a separate struct from PgStat_StatDBEntry? We should have
these fields in multiple places.

+ if (StatsShmem->refcount > 0)
+ StatsShmem->refcount++;

What prevents us from leaking the refcount here? We could e.g. error out
while attaching, no? Which'd mean we'd leak the refcount.

To me it looks like there's a lot of added complexity just because you
want to be able to reset stats via

void
pgstat_reset_all(void)a
{

/*
* We could directly remove files and recreate the shared memory area. But
* detach then attach for simplicity.
*/
pgstat_detach_shared_stats(false); /* Don't write */
pgstat_attach_shared_stats();

Without that you'd not need the complexity of attaching, detaching to
the same degree - every backend could just cache lookup data during
initialization, instead of having to constantly re-compute that.

Nor would the dynamic re-creation of the db dshash table be needed.

+/* ----------
+ * pgstat_report_stat() -
+ *
+ *	Must be called by processes that performs DML: tcop/postgres.c, logical
+ *	receiver processes, SPI worker, etc. to apply the so far collected
+ *	per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  Updates are applied not more frequent than the interval of
+ *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *  failure if force is false and there's no pending updates longer than
+ *  PGSTAT_STAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *  succeeding calls of this function.
+ *
+ *	Returns the time until the next timing when updates are applied in
+ *	milliseconds if there are no updates holded for more than
+ *	PGSTAT_STAT_MIN_INTERVAL milliseconds.
+ *
+ *	Note that this is called only out of a transaction, so it is fine to use
+ *	transaction stop time as an approximation of current time.
+ *	----------
+ */

Inconsistent indentation.

+long
+pgstat_report_stat(bool force)
{

+	/* Flush out table stats */
+	if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+		pending_stats = true;
+
+	/* Flush out function stats */
+	if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+		pending_stats = true;

This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
being confusing while reading the code, it also made the diff much
harder to read.

-		snprintf(fname, sizeof(fname), "%s/%s", directory,
-				 entry->d_name);
-		unlink(fname);
+	/* Flush out database-wide stats */
+	if (HAVE_PENDING_DBSTATS())
+	{
+		if (!pgstat_flush_dbstats(&cxt, !force))
+			pending_stats = true;
}

Linearly checking a number of stats doesn't seem like the right way
going forward. Also seems fairly omission prone.

Why does this code check live in pgstat_report_stat(), rather than
pgstat_flush_dbstats()?

/*
* snapshot_statentry() - Common routine for functions
* pgstat_fetch_stat_*entry()
*

Why has this function been added between the closely linked
pgstat_report_stat() and pgstat_flush_stat() etc?

* Returns the pointer to a snapshot of a shared entry for the key or NULL if
* not found. Returned snapshots are stable during the current transaction or
* until pgstat_clear_snapshot() is called.
*
* The snapshots are stored in a hash, pointer to which is stored in the
* *HTAB variable pointed by cxt->hash. If not created yet, it is created
* using hash_name, hash_entsize in cxt.
*
* cxt->dshash points to dshash_table for dbstat entries. If not yet
* attached, it is attached using cxt->dsh_handle.

Why do we still have this? A hashtable lookup is cheap, compared to
fetching a file - so it's not to save time. Given how infrequent the
pgstat_fetch_* calls are, it's not to avoid contention either.

At first one could think it's for consistency - but no, that's not it
either, because snapshot_statentry() refetches the snapshot without
control from the outside:

/*
* We don't want so frequent update of stats snapshot. Keep it at least
* for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
*/
if (clear_snapshot)
{
clear_snapshot = false;

if (pgStatSnapshotContext &&
snapshot_globalStats.stats_timestamp <
GetCurrentStatementStartTimestamp() -
PGSTAT_STAT_MIN_INTERVAL * 1000)
{
MemoryContextReset(pgStatSnapshotContext);

/* Reset variables */
global_snapshot_is_valid = false;
pgStatSnapshotContext = NULL;
pgStatLocalHash = NULL;

pgstat_setup_memcxt();
}
}

I think we should just remove this entire local caching snapshot layer
for lookups.

/*
* pgstat_flush_stat: Flushes table stats out to shared statistics.
*

Why is this named pgstat_flush_stat, rather than pgstat_flush_tabstats
or such? Given that the code for dealing with an individual table's
entry is named pgstat_flush_tabstat() that's very confusing.

* If nowait is true, returns false if required lock was not acquired
* immediately. In that case, unapplied table stats updates are left alone in
* TabStatusArray to wait for the next chance. cxt holds some dshash related
* values that we want to carry around while updating shared stats.
*
* Returns true if all stats info are flushed. Caller must detach dshashes
* stored in cxt after use.
*/
static bool
pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)
{
static const PgStat_TableCounts all_zeroes;
TabStatusArray *tsa;
HTAB *new_tsa_hash = NULL;
TabStatusArray *dest_tsa = pgStatTabList;
int dest_elem = 0;
int i;

/* nothing to do, just return */
if (pgStatTabHash == NULL)
return true;

/*
* Destroy pgStatTabHash before we start invalidating PgStat_TableEntry
* entries it points to.
*/
hash_destroy(pgStatTabHash);
pgStatTabHash = NULL;

/*
* Scan through the TabStatusArray struct(s) to find tables that actually
* have counts, and try flushing it out to shared stats. We may fail on
* some entries in the array. Leaving the entries being packed at the
* beginning of the array.
*/
for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
{

It seems odd that there's a tabstat specific code in pgstat_flush_stat
(also note singular while it's processing all stats, whereas you're
below treating pgstat_flush_tabstat as only affecting one table).

for (i = 0; i < tsa->tsa_used; i++)
{
PgStat_TableStatus *entry = &tsa->tsa_entries[i];

/* Shouldn't have any pending transaction-dependent counts */
Assert(entry->trans == NULL);

/*
* Ignore entries that didn't accumulate any actual counts, such
* as indexes that were opened by the planner but not used.
*/
if (memcmp(&entry->t_counts, &all_zeroes,
sizeof(PgStat_TableCounts)) == 0)
continue;

/* try to apply the tab stats */
if (!pgstat_flush_tabstat(cxt, nowait, entry))
{
/*
* Failed. Move it to the beginning in TabStatusArray and
* leave it.
*/
TabStatHashEntry *hash_entry;
bool found;

if (new_tsa_hash == NULL)
new_tsa_hash = create_tabstat_hash();

/* Create hash entry for this entry */
hash_entry = hash_search(new_tsa_hash, &entry->t_id,
HASH_ENTER, &found);
Assert(!found);

/*
* Move insertion pointer to the next segment if the segment
* is filled up.
*/
if (dest_elem >= TABSTAT_QUANTUM)
{
Assert(dest_tsa->tsa_next != NULL);
dest_tsa = dest_tsa->tsa_next;
dest_elem = 0;
}

/*
* Pack the entry at the begining of the array. Do nothing if
* no need to be moved.
*/
if (tsa != dest_tsa || i != dest_elem)
{
PgStat_TableStatus *new_entry;
new_entry = &dest_tsa->tsa_entries[dest_elem];
*new_entry = *entry;

/* use new_entry as entry hereafter */
entry = new_entry;
}

hash_entry->tsa_entry = entry;
dest_elem++;
}

This seems like too much code. Why is this entirely different from the
way funcstats works? The difference was already too big before, but this
made it *way* worse.

One goal of this project, as I understand it, is to make it easier to
add additional stats. As is, this seems to make it harder from the code
level.

bool
pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
PgStat_TableStatus *entry)
{
Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
int table_mode = PGSTAT_EXCLUSIVE;
bool updated = false;
dshash_table *tabhash;
PgStat_StatDBEntry *dbent;
int generation;

if (nowait)
table_mode |= PGSTAT_NOWAIT;

/* Attach required table hash if not yet. */
if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
{
/*
* Return if we don't have corresponding dbentry. It would've been
* removed.
*/
dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
if (!dbent)
return false;

/*
* We don't hold lock on the dbentry since it cannot be dropped while
* we are working on it.
*/
generation = pin_hashes(dbent);
tabhash = attach_table_hash(dbent, generation);

This again is just cost incurred by insisting on destroying hashtables
instead of keeping them around as long as necessary.

if (entry->t_shared)
{
cxt->shgeneration = generation;
cxt->shdbentry = dbent;
cxt->shdb_tabhash = tabhash;
}
else
{
cxt->mygeneration = generation;
cxt->mydbentry = dbent;
cxt->mydb_tabhash = tabhash;

/*
* We come here once per database. Take the chance to update
* database-wide stats
*/
LWLockAcquire(&dbent->lock, LW_EXCLUSIVE);
dbent->n_xact_commit += pgStatXactCommit;
dbent->n_xact_rollback += pgStatXactRollback;
dbent->n_block_read_time += pgStatBlockReadTime;
dbent->n_block_write_time += pgStatBlockWriteTime;
LWLockRelease(&dbent->lock);
pgStatXactCommit = 0;
pgStatXactRollback = 0;
pgStatBlockReadTime = 0;
pgStatBlockWriteTime = 0;
}
}
else if (entry->t_shared)
{
dbent = cxt->shdbentry;
tabhash = cxt->shdb_tabhash;
}
else
{
dbent = cxt->mydbentry;
tabhash = cxt->mydb_tabhash;
}

/*
* Local table stats should be applied to both dbentry and tabentry at
* once. Update dbentry only if we could update tabentry.
*/
if (pgstat_update_tabentry(tabhash, entry, nowait))
{
pgstat_update_dbentry(dbent, entry);
updated = true;
}

At this point we're very deeply nested. pgstat_report_stat() ->
pgstat_flush_stat() -> pgstat_flush_tabstat() ->
pgstat_update_tabentry().

That's way over the top imo.

I don't think it makes much sense that pgstat_update_dbentry() is called
separately for each table. Why would we want to constantly lock that
entry? It seems to be much more sensible to instead have
pgstat_flush_stat() transfer the stats it reported to the pending
database wide counters, and then report that to shared memory *once* per
pgstat_report_stat() with pgstat_flush_dbstats()?

/*
* pgstat_flush_dbstats: Flushes out miscellaneous database stats.
*
* If nowait is true, returns with false on lock failure on dbentry.
*
* Returns true if all stats are flushed out.
*/
static bool
pgstat_flush_dbstats(pgstat_flush_stat_context *cxt, bool nowait)
{
/* get dbentry if not yet */
if (cxt->mydbentry == NULL)
{
int op = PGSTAT_EXCLUSIVE;
if (nowait)
op |= PGSTAT_NOWAIT;

cxt->mydbentry = pgstat_get_db_entry(MyDatabaseId, op, NULL);

/* return if lock failed. */
if (cxt->mydbentry == NULL)
return false;

/* we use this generation of table /function stats in this turn */
cxt->mygeneration = pin_hashes(cxt->mydbentry);
}

LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
if (HAVE_PENDING_CONFLICTS())
pgstat_flush_recovery_conflict(cxt->mydbentry);
if (BeDBStats.n_deadlocks != 0)
pgstat_flush_deadlock(cxt->mydbentry);
if (BeDBStats.n_tmpfiles != 0)
pgstat_flush_tempfile(cxt->mydbentry);
if (BeDBStats.checksum_failures != NULL)
pgstat_flush_checksum_failure(cxt->mydbentry);
LWLockRelease(&cxt->mydbentry->lock);

What's the point of having all these sub-functions? I see that you, for
an undocumented reason, have pgstat_report_recovery_conflict() flush
conflict stats immediately:

dbentry = pgstat_get_db_entry(MyDatabaseId,
PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
&status);

if (status == LOCK_FAILED)
return;

/* We had a chance to flush immediately */
pgstat_flush_recovery_conflict(dbentry);

dshash_release_lock(pgStatDBHash, dbentry);

But I don't understand why? Nor why we'd not just report all pending
database wide changes in that case?

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.

/* ----------
* pgstat_vacuum_stat() -
*
* Remove objects we can get rid of.
* ----------
*/
void
pgstat_vacuum_stat(void)
{
HTAB *oidtab;
dshash_seq_status dshstat;
PgStat_StatDBEntry *dbentry;

/* we don't collect stats under standalone mode */
if (!IsUnderPostmaster)
return;

/*
* Read pg_database and make a list of OIDs of all existing databases
*/
oidtab = pgstat_collect_oids(DatabaseRelationId, Anum_pg_database_oid);

/*
* Search the database hash table for dead databases and drop them
* from the hash.
*/

dshash_seq_init(&dshstat, pgStatDBHash, false, true);
while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
{
Oid dbid = dbentry->databaseid;

CHECK_FOR_INTERRUPTS();

/* the DB entry for shared tables (with InvalidOid) is never dropped */
if (OidIsValid(dbid) &&
hash_search(oidtab, (void *) &dbid, HASH_FIND, NULL) == NULL)
pgstat_drop_database(dbid);
}

/* Clean up */
hash_destroy(oidtab);

So, uh, pgstat_drop_database() again does a *separate* lookup in the
dshash, locking the entry. Which only works because you added this dirty
hack:

/* We need to keep partition lock while sequential scan */
if (!hash_table->seqscan_running)
{
hash_table->find_locked = false;
hash_table->find_exclusively_locked = false;
LWLockRelease(PARTITION_LOCK(hash_table, partition));
}

to dshash_delete_entry(). This seems insane to me. There's not even a
comment explaining this?

/*
* Similarly to above, make a list of all known relations in this DB.
*/
oidtab = pgstat_collect_oids(RelationRelationId, Anum_pg_class_oid);

/*
* Check for all tables listed in stats hashtable if they still exist.
* Stats cache is useless here so directly search the shared hash.
*/
pgstat_remove_useless_entries(dbentry->tables, &dsh_tblparams, oidtab);

/*
* Repeat the above but we needn't bother in the common case where no
* function stats are being collected.
*/
if (dbentry->functions != DSM_HANDLE_INVALID)
{
oidtab = pgstat_collect_oids(ProcedureRelationId, Anum_pg_proc_oid);

pgstat_remove_useless_entries(dbentry->functions, &dsh_funcparams,
oidtab);
}
dshash_release_lock(pgStatDBHash, dbentry);

Wait, why are we holding the database partition lock across all this?
Again without any comments explaining why?

+void
+pgstat_send_archiver(const char *xlog, bool failed)

Why do we still have functions named pgstat_send*?

Greetings,

Andres Freund

#102

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 6 years ago

In reply to: Andres Freund (#101)

Re: shared-memory based stats collector

Thank you very much!!

At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

Thomas, could you look at the first two patches here, and my review
questions?

General comments about this series:
- A lot of the review comments feel like I've written them before, a
year or more ago. I feel this patch ought to be in a much better
state. There's a lot of IMO fairly obvious stuff here, and things that
have been mentioned multiple times previously.

I apologize for all of the obvious stuff or things that have been
mentioned.. I'll address them.

- There's a *lot* of typos in here. I realize being an ESL is hard, but
a lot of these can be found with the simplest spellchecker. That's
one thing for a patch that just has been hacked up as a POC, but this
is a multi year thread?

I'll review all changed part again. I used ispell but I should have
failed to check much of the changes.

- There's some odd formatting. Consider using pgindent more regularly.

I'll do so.

More detailed comments below.

Thank you very much for the intensive review, I'm going to revise the
patch according to them.

I'm considering rewriting the parts of the patchset that I don't like -
but it'll look quite different afterwards.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#103

Andres Freund

andres@anarazel.de

almost 6 years ago

In reply to: Kyotaro Horiguchi (#102)

Re: shared-memory based stats collector

Hi,

On 2020-03-13 16:34:50 +0900, Kyotaro Horiguchi wrote:

Thank you very much!!

At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

Thomas, could you look at the first two patches here, and my review
questions?

General comments about this series:
- A lot of the review comments feel like I've written them before, a
year or more ago. I feel this patch ought to be in a much better
state. There's a lot of IMO fairly obvious stuff here, and things that
have been mentioned multiple times previously.

I apologize for all of the obvious stuff or things that have been
mentioned.. I'll address them.

- There's a *lot* of typos in here. I realize being an ESL is hard, but
a lot of these can be found with the simplest spellchecker. That's
one thing for a patch that just has been hacked up as a POC, but this
is a multi year thread?

I'll review all changed part again. I used ispell but I should have
failed to check much of the changes.

- There's some odd formatting. Consider using pgindent more regularly.

I'll do so.

More detailed comments below.

Thank you very much for the intensive review, I'm going to revise the
patch according to them.

I'm considering rewriting the parts of the patchset that I don't like -
but it'll look quite different afterwards.

I take your response to mean that you'd prefer to evolve the patch
largely on your own? I'm mainly asking because I think there's some
chance that we could till get this into v13, but if so we'll have to go
for it now.

Greetings,

Andres Freund

#104

Thomas Munro

thomas.munro@gmail.com

over 5 years ago

In reply to: Andres Freund (#103)

5 attachment(s)

Re: shared-memory based stats collector

Hi Horiguchi-san, Andres,

I tried to rebase this (see attached, no intentional changes beyond
rebasing). Some feedback:

On Fri, Mar 13, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:

Thomas, could you look at the first two patches here, and my review
questions?

Ack.

dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
LW_EXCLUSIVE));

delete_item(hash_table, item);
-     hash_table->find_locked = false;
-     hash_table->find_exclusively_locked = false;
-     LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+     /* We need to keep partition lock while sequential scan */
+     if (!hash_table->seqscan_running)
+     {
+             hash_table->find_locked = false;
+             hash_table->find_exclusively_locked = false;
+             LWLockRelease(PARTITION_LOCK(hash_table, partition));
+     }
}

This seems like a failure prone API.

If I understand correctly, the only purpose of the seqscan_running
variable is to control that behaviour ^^^. That is, to make
dshash_delete_entry() keep the partition lock if you delete an entry
while doing a seq scan. Why not get rid of that, and provide a
separate interface for deleting while scanning?
dshash_seq_delete(dshash_seq_status *scan, void *entry). I suppose it
would be most common to want to delete the "current" item in the seq
scan, but it could allow you to delete anything in the same partition,
or any entry if using the "consistent" mode. Oh, I see that Andres
said the same thing later.

[Andres complaining about comments and language stuff]

I would be happy to proof read and maybe extend the comments (writing
new comments will also help me understand and review the code!), and
maybe some code changes to move this forward. Horiguchi-san, are you
working on another version now? If so I'll wait for it before I do
that.

+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+                             bool consistent, bool exclusive)
+{
Why does this patch add the consistent mode? There's no users currently?
Without it's not clear that we need a seperate _term function, I think?

+1, let's not do that if we don't need it!

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.

Right, I also noticed that:

/*
* Local table stats should be applied to both dbentry and tabentry at
* once. Update dbentry only if we could update tabentry.
*/
if (pgstat_update_tabentry(tabhash, entry, nowait))
{
pgstat_update_dbentry(dbent, entry);
updated = true;
}

So pgstat_update_tabentry() goes to great trouble to take locks
conditionally, but then pgstat_update_dbentry() immediately does:

LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
LWLockRelease(&dbentry->lock);

Why can't we be "lazy" with the dbentry stats too? Is it really
important for the table stats and DB stats to agree with each other?
Even if it were, your current coding doesn't achieve that: the table
stats are updated before the DB stat under different locks, so I'm not
sure why it can't wait longer.

Hmm. Even if you change the above code use a conditional lock, I am
wondering (admittedly entirely without data) if this approach is still
too clunky: even trying and failing to acquire the lock creates
contention, just a bit less. I wonder if it would make sense to make
readers do more work, so that writers can avoid contention. For
example, maybe PgStat_StatDBEntry could hold an array of N sets of
counters, and readers have to add them all up. An advanced version of
this idea would use a reasonably fresh copy of something like
sched_getcpu() and numa_node_of_cpu() to select a partition to
minimise contention and cross-node traffic, with a portable fallback
based on PID or something. CPU core/node awareness is something I
haven't looked into too seriously, but it's been on my mind to solve
some other problems.

#105

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Andres Freund (#101)

8 attachment(s)

Re: shared-memory based stats collector

Thank you for the comment.

The new version is attached.

At Thu, 12 Mar 2020 20:13:24 -0700, Andres Freund <andres@anarazel.de> wrote in

General comments about this series:
- A lot of the review comments feel like I've written them before, a
year or more ago. I feel this patch ought to be in a much better
state. There's a lot of IMO fairly obvious stuff here, and things that
have been mentioned multiple times previously.
- There's a *lot* of typos in here. I realize being an ESL is hard, but
a lot of these can be found with the simplest spellchecker. That's
one thing for a patch that just has been hacked up as a POC, but this
is a multi year thread?
- There's some odd formatting. Consider using pgindent more regularly.

More detailed comments below.

I'm considering rewriting the parts of the patchset that I don't like -
but it'll look quite different afterwards.

On 2020-01-22 17:24:04 +0900, Kyotaro Horiguchi wrote:

From 5f7946522dc189429008e830af33ff2db435dd42 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/5] sequential scan for dshash

Add sequential scan feature to dshash.
dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
LW_EXCLUSIVE));
delete_item(hash_table, item);
-	hash_table->find_locked = false;
-	hash_table->find_exclusively_locked = false;
-	LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+	/* We need to keep partition lock while sequential scan */
+	if (!hash_table->seqscan_running)
+	{
+		hash_table->find_locked = false;
+		hash_table->find_exclusively_locked = false;
+		LWLockRelease(PARTITION_LOCK(hash_table, partition));
+	}
}
This seems like a failure prone API.

[001]: (Fixed) As the result of the fixed in [044], it's gone now.
As the result of the fixed in [044]Following [001] and [009], I added dshash_delete_currenet(). pgstat_vacuum_stat() uses it instead of pgstat_delete_entry(). The hack is gone., it's gone now.

/*
@@ -568,6 +584,8 @@ dshash_release_lock(dshash_table *hash_table, void *entr> > + * dshash_seq_init/_next/_term
+ *           Sequentially scan trhough dshash table and return all the
+ *           elements one by one, return NULL when no more.

s/trhough/through/

[002]: (Fixed)

This uses a different comment style that the other functions in this
file. Why?

[003]: (Fixed)

It was following the equivalent in dynahash.c. I rewrote it different
way.

+ * dshash_seq_term should be called for incomplete scans and otherwise
+ * shoudln't. Finished scans are cleaned up automatically.

s/shoudln't/shouldn't/

[004]: (Fixed)

I find the "cleaned up automatically" API terrible. I know you copied it
from dynahash, but I find it to be really failure prone. dynahash isn't
an example of good postgres code, the opposite, I'd say. It's a lot
easier to unconditionally have a terminate call if we need that.

[005]: (Fixed) OK, I remember I had a similar thought on this. Fixed with the all correspondents.
OK, I remember I had a similar thought on this. Fixed with the
all correspondents.

+ * Returned elements are locked as is the case with dshash_find.  However, the
+ * caller must not release the lock.
+ *
+ * Same as dynanash, the caller may delete returned elements midst of a scan.
I think it's a bad idea to refer to dynahash here. That's just going to
get out of date. Also, code should be documented on its own.

[006]: (Fixed) Understood, fixed as the follows.
Understood, fixed as the follows.

* Returned elements are locked and the caller must not explicitly release
* it.

+ * If consistent is set for dshash_seq_init, the all hash table partitions are
+ * locked in the requested mode (as determined by the exclusive flag) during
+ * the scan.  Otherwise partitions are locked in one-at-a-time way during the
+ * scan.

Yet delete unconditionally retains locks?

[007]: (Not fixed) Yes. If we release the lock on the current partition, hash resize breaks the concurrent scans.
Yes. If we release the lock on the current partition, hash resize
breaks the concurrent scans.

+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+				bool consistent, bool exclusive)
+{
Why does this patch add the consistent mode? There's no users currently?
Without it's not clear that we need a seperate _term function, I think?

[008]: (Fixed) I remember that it is used in early stage of development. I left it for a matter of API completeness but actually it is not used. _term is another matter. We need to release lock and clean up some dshash status if we allow seq scan to be exited before it reaches to the end.
I remember that it is used in early stage of development. I left it
for a matter of API completeness but actually it is not used. _term is
another matter. We need to release lock and clean up some dshash
status if we allow seq scan to be exited before it reaches to the end.

I removed the "consistent" from dshash_seq_init and reverted
dshash_seq_term.

I think we also can get rid of the dshash_delete changes, by instead
adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
or such.

[009]: (Fixed) I'm not sure about the point of having two interfaces that are hard to distinguish. Maybe dshash_delete_current(dshash_seq_stat *status) is enough(). I also reverted the dshash_delete().
I'm not sure about the point of having two interfaces that are hard to
distinguish. Maybe dshash_delete_current(dshash_seq_stat *status) is
enough(). I also reverted the dshash_delete().

@@ -70,7 +86,6 @@ extern dshash_table *dshash_attach(dsa_area *area,
extern void dshash_detach(dshash_table *hash_table);
extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
extern void dshash_destroy(dshash_table *hash_table);
-
/* Finding, creating, deleting entries. */
extern void *dshash_find(dshash_table *hash_table,
const void *key, bool
exclusive);

There's a number of spurious changes like this.

[010]: (Fixed) I found such isolated line insertion or removal, two in 0001, eight in 0004.
I found such isolated line insertion or removal, two in 0001, eight in
0004.

From 60da67814fe40fd2a0c1870b15dcf6fcb21c989a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 27 Sep 2018 11:15:19 +0900
Subject: [PATCH 2/5] Add conditional lock feature to dshash

Dshash currently waits for lock unconditionally. This commit adds new
interfaces for dshash_find and dshash_find_or_insert. The new
interfaces have an extra parameter "nowait" taht commands not to wait
for lock.

s/taht/that/

[011]: (Fixed) Sounds reasonable. I rewrote it that way.
Applied ispell on all commit messages.

There should be at least a sentence or two explaining why these are
useful.

[011]: (Fixed) Sounds reasonable. I rewrote it that way.
Sounds reasonable. I rewrote it that way.

+/*
+ * The version of dshash_find, which is allowed to return immediately on lock
+ * failure. Lock status is set to *lock_failed in that case.
+ */

Hm. Not sure I like the *lock_acquired API.

+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+					 bool exclusive, bool nowait, bool *lock_acquired)

...

+ Assert(nowait || !lock_acquired);

...

+ if (lock_acquired)
+ *lock_acquired = false;

Why is the test for lock_acquired needed here? I don't think it's
possible to use nowait correctly without passing in lock_acquired?

Think it'd make sense to document & assert that nowait = true implies
lock_acquired set, and nowait = false implies lock_acquired not being
set.

But, uh, why do we even need the lock_acquired parameter? If we couldn't
find an entry, then we should just release the lock, no?

[012]: (Fixed) (related to [013], [014]) The name is confusing. In this version the old dshash_find_extended and dshash_find_or_insert_extended are merged into new dshash_find_extended, which covers all the functions of dshash_find and dshash_find_or_insert plus insertion with shared lock is allowed now.
The name is confusing. In this version the old dshash_find_extended
and dshash_find_or_insert_extended are merged into new
dshash_find_extended, which covers all the functions of dshash_find
and dshash_find_or_insert plus insertion with shared lock is allowed
now.

I'm however inclined to think it's better to just have a separate
function for the nowait case, rather than an extended version supporting
both (with an internal helper doing most of the work).

[013]: (Fixed) (related to [012], [014]) After some thoughts, the nowait is no longer a matter of complexity. Finally I did as [012].
After some thoughts, the nowait is no longer a matter of complexity.
Finally I did as [012](Fixed) (related to [013], [014]) The name is confusing. In this version the old dshash_find_extended and dshash_find_or_insert_extended are merged into new dshash_find_extended, which covers all the functions of dshash_find and dshash_find_or_insert plus insertion with shared lock is allowed now..

+/*
+ * The version of dshash_find_or_insert, which is allowed to return immediately
+ * on lock failure.
+ *
+ * Notes above dshash_find_extended() regarding locking and error handling
+ * equally apply here.
They don't, there's no lock_acquired parameter.
+ */
+void *
+dshash_find_or_insert_extended(dshash_table *hash_table,
+							   const void *key,
+							   bool *found,
+							   bool nowait)
I think it's absurd to have dshash_find, dshash_find_extended,
dshash_find_or_insert, dshash_find_or_insert_extended. If they're
extended they should also be able to specify whether the entry will get
created.

[014]: (Fixed) (related to [012], [013]) As mentioned above, this version has the original two functions and one dshash_find_extended().
As mentioned above, this version has the original two functions and
one dshash_find_extended().

From d10c1117cec77a474dbb2cff001086d828b79624 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 7 Nov 2018 16:53:49 +0900
Subject: [PATCH 3/5] Make archiver process an auxiliary process

This is a preliminary patch for shared-memory based stats collector.
Archiver process must be a auxiliary process since it uses shared
memory after stats data wes moved onto shared-memory. Make the process

s/wes/was/ s/onto/into/

[015]: (Fixed)

an auxiliary process in order to make it work.
@@ -451,6 +454,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
StartupProcessMain();
proc_exit(1); /* should never return */
+		case ArchiverProcess:
+			/* don't set signals, archiver has its own agenda */
+			PgArchiverMain();
+			proc_exit(1);		/* should never return */
+
case BgWriterProcess:
/* don't set signals, bgwriter has its own agenda */
BackgroundWriterMain();
I think I'd rather remove the two comments that are copied to 6 out of 8
cases - they don't add anything.

[016]: (Fixed) Agreed. I removed the comments from StartProcess to WalReceiverProcess.
Agreed. I removed the comments from StartProcess to WalReceiverProcess.

pgarch_exit(SIGNAL_ARGS)
{

+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our

...

+	 * being doubly sure.)
+	 */
+	exit(2);

...

This seems to be a copy of code & comments from other signal handlers that predates

I think this just should use SignalHandlerForCrashExit().
I think we can even commit that separately - there's not really a reason
to not do that today, as far as I can tell?

[017]: (Fixed, separate patch 0001) Exactly. Although on_*_exit_list is empty on the process, SIGQUIT ought to prevent the process from calling the functions even if any. That changes the exit status of archiver on SIGQUIT from 1 to 2, but that doesn't make any behavior change (other than log message).
Exactly. Although on_*_exit_list is empty on the process, SIGQUIT
ought to prevent the process from calling the functions even if
any. That changes the exit status of archiver on SIGQUIT from 1 to 2,
but that doesn't make any behavior change (other than log message).

/* SIGUSR1 signal handler for archiver process */

Hm - this currently doesn't set up a correct sigusr1 handler for a
shared memory backend - needs to invoke procsignal_sigusr1_handler
somewhere.

We can probably just convert to using normal latches here, and remove
the current 'wakened' logic? That'll remove the indirection via
postmaster too, which is nice.

[018]: (Fixed, separate patch 0005) It seems better. I added it as a separate patch just after the patch that turns archiver an auxiliary process.
It seems better. I added it as a separate patch just after the patch
that turns archiver an auxiliary process.

@@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
switch (backendType)
{
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
should imo include 'WAL' or such.

[019]: (Not Fixed) It is already named "archiver" by 8e8a0becb3. Do I rename it in this patch set?
It is already named "archiver" by 8e8a0becb3. Do I rename it in this
patch set?

From 5079583c447c3172aa0b4f8c0f0a46f6e1512812 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 21 Feb 2019 12:44:56 +0900
Subject: [PATCH 4/5] Shared-memory based stats collector

megabytes. To deal with larger statistics set, this patch let backends
directly share the statistics via shared memory.

This spends a fair bit describing the old state, but very little
describing the new state.

[020]: (Fixed, Maybe) Ugg. I got the same comment in the last round. I rewrote it this time.
Ugg. I got the same comment in the last round. I rewrote it this time.

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 0bfd6151c4..a6b0bdec12 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml

...

-   master process.  (The <quote>stats collector</quote> process will not be present
-   if you have set the system not to start the statistics collector; likewise
+   master process.  (The <quote>autovacuum launcher</quote> process will not

...

There's more references to the stats collector than this... E.g. in
catalogs.sgml

[021]: (Fixed, separate patch 0007) However the "statistics collector process" is gone, I'm not sure "statistics collector" feature also is gone. But actually the word "collector" looks a bit odd in some context. I replaced "the results of statistics collector" with "the activity statistics". (I'm not sure "the activity statistics" is proper as a subsystem name.) The word "collect" is replaced with "track". I didn't change section IDs corresponding to the renaming so that old links can work. I also fixed the tranche name for LWTRANCHE_STATS from "activity stats" to "activity_statistics"
However the "statistics collector process" is gone, I'm not sure
"statistics collector" feature also is gone. But actually the word
"collector" looks a bit odd in some context. I replaced "the results
of statistics collector" with "the activity statistics". (I'm not sure
"the activity statistics" is proper as a subsystem name.) The word
"collect" is replaced with "track". I didn't change section IDs
corresponding to the renaming so that old links can work. I also fixed
the tranche name for LWTRANCHE_STATS from "activity stats" to
"activity_statistics"

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 6d1f28c327..8dcb0fb7f7 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c

...

@@ -2747,12 +2747,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
if (isshared)
{
if (PointerIsValid(shared))
-			tabentry = hash_search(shared->tables, &relid,
-								   HASH_FIND, NULL);
+			tabentry = pgstat_fetch_stat_tabentry_extended(shared, relid);
}
else if (PointerIsValid(dbentry))
-		tabentry = hash_search(dbentry->tables, &relid,
-							   HASH_FIND, NULL);
+		tabentry = pgstat_fetch_stat_tabentry_extended(dbentry, relid);

return tabentry;
}

Why is pgstat_fetch_stat_tabentry_extended called "_extended"? Outside

[022]: (Fixed) The _extended function is not an extended version of the original function. I renamed pgstat_fetch_stat_tabentry_extended to pgstat_fetch_stat_tabentry_snapshot. Also pgstat_fetch_funcentry_extended and pgstat_fetch_dbentry() are renamed following that.
The _extended function is not an extended version of the original
function. I renamed pgstat_fetch_stat_tabentry_extended to
pgstat_fetch_stat_tabentry_snapshot. Also
pgstat_fetch_funcentry_extended and pgstat_fetch_dbentry() are renamed
following that.

the stats subsystem there are exactly one caller for the non extended
version, as far as I can see. That's index_concurrently_swap() - and imo
that's code that should live in the stats subsystem, rather than open
coded in index.c.

[023]: (Fixed) Agreed. I added a new function "pgstat_copy_index_counters()" and now pgstat_fetch_stat_tabentry() has no caller sites outside pgstat subsystem.
Agreed. I added a new function "pgstat_copy_index_counters()" and now
pgstat_fetch_stat_tabentry() has no caller sites outside pgstat
subsystem.

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ca5c6376e5..1ffe073a1f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends are the interface to locally store
+ *  backend activities during a transaction. Then pgstat_flush_stat() is called
+ *  at the end of a transaction to pulish the local stats on shared memory.
*

I'd rather not exhaustively list the different objects this handles -
it'll either be annoying to maintain, or just get out of date.

[024]: (Fixed, Maybe) Although not sure I get you correctly, I rewrote it as the follows.
Although not sure I get you correctly, I rewrote it as the follows.

* Collects per-table and per-function usage statistics of all backends on
* shared memory. The activity numbers are once stored locally, then written
* to shared memory at commit time or by idle-timeout.

- *			- Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
+ *  all the local numbers cannot be flushed immediately, we postpone updates
+ *  and try the next chance after the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).

I'm not convinced by this backoff logic. The basic interval seems quite
high for something going through shared memory, and the max retry seems
pretty low.

[025]: (Not Fixed) Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500, 10000) reasonable?
Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
10000) reasonable?

+/*
+ * Operation mode and return code of pgstat_get_db_entry.
+ */
+#define	PGSTAT_SHARED		0
This is unreferenced.

+#define PGSTAT_EXCLUSIVE 1
+#define PGSTAT_NOWAIT 2

And these should imo rather be parameters.

[026]: (Fixed) Mmm. Right. The two symbols conveys just two distinct parameters. Two booleans suffice. But I found some confusion here. As the result pgstat_get_db_entry have three booleans parameters exclusive, nowait and create.
Mmm. Right. The two symbols conveys just two distinct parameters. Two
booleans suffice. But I found some confusion here. As the result
pgstat_get_db_entry have three booleans parameters exclusive, nowait
and create.

+typedef enum PgStat_TableLookupResult
+{
+	NOT_FOUND,
+	FOUND,
+	LOCK_FAILED
+} PgStat_TableLookupResult;
This seems like a seriously bad idea to me. These are very generic
names. There's also basically no references except setting them to the
first two?

[027]: (Fixed) Considering some related comments above, I decided not to return lock status from pgstat_get_db_entry. That makes the enum useless and makes the function simpler.
Considering some related comments above, I decided not to return lock
status from pgstat_get_db_entry. That makes the enum useless and makes
the function simpler.

+#define StatsLock (&StatsShmem->StatsMainLock)

-static time_t last_pgstat_start_time;
+/* Shared stats bootstrap information */
+typedef struct StatsShmemStruct
+{
+	LWLock				StatsMainLock;		/* lock to protect this struct */

...

+} StatsShmemStruct;

Why isn't this an lwlock in lwlock in lwlocknames.h, rather than
allocated here?

[028]: (Fixed) The activity stats system already used a dedicate tranche, so I might think that it is natural that it is in the same tranche. Actually it's not so firm reason. Moved the lock into main tranche.
The activity stats system already used a dedicate tranche, so I might
think that it is natural that it is in the same tranche. Actually it's
not so firm reason. Moved the lock into main tranche.

+/*
+ * BgWriter global statistics counters. The name cntains a remnant from the
+ * time when the stats collector was a dedicate process, which used sockets to
+ * send it.
+ */
+PgStat_MsgBgWriter BgWriterStats = {0};

I am strongly against keeping the 'Msg' prefix. That seems extremely
confusing going forward.

[029]: (Fixed) (Related to [046]) Mmm. It's following your old suggestion to avoid unsubstantial diffs. I'm happy to change it. The functions that have "send" in their names are for the same reason. I removed the prefix "m_" of the members of the struct. (The comment above (with a typo) explains that).
Mmm. It's following your old suggestion to avoid unsubstantial
diffs. I'm happy to change it. The functions that have "send" in their
names are for the same reason. I removed the prefix "m_" of the
members of the struct. (The comment above (with a typo) explains that).

+/* common header of snapshot entry in reader snapshot hash */
+typedef struct PgStat_snapshot
+{
+	Oid		key;
+	bool	negative;
+	void   *body;				/* end of header part: to keep alignment */
+} PgStat_snapshot;

+/* context struct for snapshot_statentry */
+typedef struct pgstat_snapshot_param
+{
+	char		   *hash_name;			/* name of the snapshot hash */
+	int				hash_entsize;		/* element size of hash entry */
+	dshash_table_handle	dsh_handle;		/* dsh handle to attach */
+	const dshash_parameters *dsh_params;/* dshash params */
+	HTAB		  **hash;				/* points to variable to hold hash */
+	dshash_table  **dshash;				/* ditto for dshash */
+} pgstat_snapshot_param;

Why does this exist? The struct contents are actually constant across
calls, yet you have declared them inside functions (as static - static
on function scope isn't really the same as global static).

[030]: (Fixed) IIUC, I didn't want it initialized at every call and it doesn't need an external linkage. So it was static variable on function scope. But, first, the name _param is bogus since it actually contains context variables. Second the "context" variables have been moved to other variables. I removed the struct and moved the members to the parameter of snapshot_statentry.
IIUC, I didn't want it initialized at every call and it doesn't need
an external linkage. So it was static variable on function scope.
But, first, the name _param is bogus since it actually contains
context variables. Second the "context" variables have been moved to
other variables. I removed the struct and moved the members to the
parameter of snapshot_statentry.

If we want it, I think we should separate the naming more
meaningfully. The important difference between 'hash' and 'dshash' isn't
the hashing module, it's that one is a local copy, the other a shared
hashtable!

[031]: (Fixed) Definitely. The parameters of snapshot_statentry now have more meaningful names.
Definitely. The parameters of snapshot_statentry now have more
meaningful names.

+/*
+ * Backends store various database-wide info that's waiting to be flushed out
+ * to shared memory in these variables.
+ *
+ * checksum_failures is the exception in that it is cluster-wide value.
+ */
+typedef struct BackendDBStats
+{
+	int		n_conflict_tablespace;
+	int		n_conflict_lock;
+	int		n_conflict_snapshot;
+	int		n_conflict_bufferpin;
+	int		n_conflict_startup_deadlock;
+	int		n_deadlocks;
+	size_t	n_tmpfiles;
+	size_t	tmpfilesize;
+	HTAB	*checksum_failures;
+} BackendDBStats;

Why is this a separate struct from PgStat_StatDBEntry? We should have
these fields in multiple places.

[032]: (Fixed, Maybe) (Related to [042]) It is almost a subset of PgStat_StatDBEntry with an exception. checksum_failures is different between the two. Anyway, tracking of conflict events don't need to be so fast so they have been changed to be counted on shared hash entries directly.
It is almost a subset of PgStat_StatDBEntry with an
exception. checksum_failures is different between the two.
Anyway, tracking of conflict events don't need to be so fast so they
have been changed to be counted on shared hash entries directly.

Checkpoint failure is handled different way so only it is left alone.

+ if (StatsShmem->refcount > 0)
+ StatsShmem->refcount++;

What prevents us from leaking the refcount here? We could e.g. error out
while attaching, no? Which'd mean we'd leak the refcount.

[033]: (Fixed) We don't attach shared stats on postmaster process, so I want to know the first attacher process and the last detacher process of shared stats. It's not leaks that I'm considering here. (continued below)
We don't attach shared stats on postmaster process, so I want to know
the first attacher process and the last detacher process of shared
stats. It's not leaks that I'm considering here.
(continued below)

To me it looks like there's a lot of added complexity just because you
want to be able to reset stats via

void
pgstat_reset_all(void)
{

/*
* We could directly remove files and recreate the shared memory area. But
* detach then attach for simplicity.
*/
pgstat_detach_shared_stats(false); /* Don't write */
pgstat_attach_shared_stats();

Without that you'd not need the complexity of attaching, detaching to
the same degree - every backend could just cache lookup data during
initialization, instead of having to constantly re-compute that.

Mmm. I don't get that (or I failed to read clear meaning). The
function is assumed be called only from StartupXLOG().
(continued)

Nor would the dynamic re-creation of the db dshash table be needed.

Maybe you are mentioning the complexity of reset_dbentry_counters? It
is actually complex. Shared stats dshash cannot be destroyed (or
dshash entry cannot be removed) during someone is working on it. It
was simpler to wait for another process to end its work but that could
slow not only the clearing process but also other processes by
frequent resetting of counters.

After some thoughts, I decided to rip the all "generation" stuff off
and it gets far simpler. But counter reset may conflict with other
backends with a litter higher degree because counter reset needs
exclusive lock.

+/* ----------
+ * pgstat_report_stat() -
+ *
+ *	Must be called by processes that performs DML: tcop/postgres.c, logical
+ *	receiver processes, SPI worker, etc. to apply the so far collected
+ *	per-table and function usage statistics to the shared statistics hashes.
+ *
+ *  Updates are applied not more frequent than the interval of
+ *  PGSTAT_STAT_MIN_INTERVAL milliseconds. They are also postponed on lock

...

Inconsistent indentation.

[034]: (Fixed)

+long
+pgstat_report_stat(bool force)
{
+	/* Flush out table stats */
+	if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+		pending_stats = true;
+
+	/* Flush out function stats */
+	if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+		pending_stats = true;
This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
being confusing while reading the code, it also made the diff much
harder to read.

[035]: (Maybe Fixed) Is the question that, is there any case where pgstat_flush_stat/functions leaves some counters unflushed? It skips tables where someone else is working on (or another table that is in the same dshash partition), Or "!force == nowait" is the cause of confusion? It is now translated as "nowait = !force". (Or change the parameter of pgstat_report_stat from "force" to "nowait"?)
Is the question that, is there any case where
pgstat_flush_stat/functions leaves some counters unflushed? It skips
tables where someone else is working on (or another table that is in
the same dshash partition), Or "!force == nowait" is the cause of
confusion? It is now translated as "nowait = !force". (Or change the
parameter of pgstat_report_stat from "force" to "nowait"?)

-		snprintf(fname, sizeof(fname), "%s/%s", directory,
-				 entry->d_name);
-		unlink(fname);
+	/* Flush out database-wide stats */
+	if (HAVE_PENDING_DBSTATS())
+	{
+		if (!pgstat_flush_dbstats(&cxt, !force))
+			pending_stats = true;
}
Linearly checking a number of stats doesn't seem like the right way
going forward. Also seems fairly omission prone.

Why does this code check live in pgstat_report_stat(), rather than
pgstat_flush_dbstats()?

[036]: (Maybe Fixed) (Related to [041]) It's to avoid useless calls but it is no longer exists. Anyway the code disappeared by [041].
It's to avoid useless calls but it is no longer exists. Anyway the
code disappeared by [041](Fixed) (Related to [036]) Completely agree, It is a result of that I wanted to avoid scanning pgStatTables twice. (continued).

| /* Flush out individual stats tables */
| pending_stats |= pgstat_flush_stat(&cxt, nowait);
| pending_stats |= pgstat_flush_funcstats(&cxt, nowait);
| pending_stats |= pgstat_flush_checksum_failure(cxt.mydbentry, nowait);

/*
* snapshot_statentry() - Common routine for functions
* pgstat_fetch_stat_*entry()
*

Why has this function been added between the closely linked
pgstat_report_stat() and pgstat_flush_stat() etc?

[037]: It seems to be left there after some editing. Moved it to just before the caller functdions.
It seems to be left there after some editing. Moved it to just before
the caller functdions.

Why do we still have this? A hashtable lookup is cheap, compared to
fetching a file - so it's not to save time. Given how infrequent the
pgstat_fetch_* calls are, it's not to avoid contention either.

At first one could think it's for consistency - but no, that's not it
either, because snapshot_statentry() refetches the snapshot without
control from the outside:

[038]: I don't get the second paragraph. When the function re*create*s a snapshot without control from the outside? It keeps snapshots during a transaction. If not, it is broken. (continued)
I don't get the second paragraph. When the function re*create*s a
snapshot without control from the outside? It keeps snapshots during a
transaction. If not, it is broken.
(continued)

/*
* We don't want so frequent update of stats snapshot. Keep it at least
* for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
*/

...

I think we should just remove this entire local caching snapshot layer
for lookups.

Currently the behavior is documented as the follows and it seems reasonable.

Another important point is that when a server process is asked to display
any of these statistics, it first fetches the most recent report emitted by
the collector process and then continues to use this snapshot for all
statistical views and functions until the end of its current transaction.
So the statistics will show static information as long as you continue the
current transaction. Similarly, information about the current queries of
all sessions is collected when any such information is first requested
within a transaction, and the same information will be displayed throughout
the transaction.
This is a feature, not a bug, because it allows you to perform several
queries on the statistics and correlate the results without worrying that
the numbers are changing underneath you. But if you want to see new
results with each query, be sure to do the queries outside any transaction
block. Alternatively, you can invoke
<function>pg_stat_clear_snapshot</function>(), which will discard the
current transaction's statistics snapshot (if any). The next use of
statistical information will cause a new snapshot to be fetched.

/*
* pgstat_flush_stat: Flushes table stats out to shared statistics.
*

Why is this named pgstat_flush_stat, rather than pgstat_flush_tabstats
or such? Given that the code for dealing with an individual table's
entry is named pgstat_flush_tabstat() that's very confusing.

[039]: The names are hchanged while adressing [041]
Definitely. The names are hchanged while adressing [041](Fixed) (Related to [036]) Completely agree, It is a result of that I wanted to avoid scanning pgStatTables twice. (continued)

static bool
pgstat_flush_stat(pgstat_flush_stat_context *cxt, bool nowait)

...

It seems odd that there's a tabstat specific code in pgstat_flush_stat
(also note singular while it's processing all stats, whereas you're
below treating pgstat_flush_tabstat as only affecting one table).

[039]: The names are hchanged while adressing [041]
The names are hchanged while adressing [041](Fixed) (Related to [036]) Completely agree, It is a result of that I wanted to avoid scanning pgStatTables twice. (continued)

for (i = 0; i < tsa->tsa_used; i++)
{
PgStat_TableStatus *entry = &tsa->tsa_entries[i];

hash_entry->tsa_entry = entry;
dest_elem++;
}

This seems like too much code. Why is this entirely different from the
way funcstats works? The difference was already too big before, but this
made it *way* worse.

[040]: Maybe you are insisting the reverse? The pin_hash complexity is left in this version. -> [033]
We don't flush stats until transaction ends. So the description about
TabStatuArray is stale?

* NOTE: once allocated, TabStatusArray structures are never moved or deleted
* for the life of the backend. Also, we zero out the t_id fields of the
* contained PgStat_TableStatus structs whenever they are not actively in use.
* This allows relcache pgstat_info pointers to be treated as long-lived data,
* avoiding repeated searches in pgstat_initstats() when a relation is
* repeatedly opened during a transaction.
(continued to below)

One goal of this project, as I understand it, is to make it easier to
add additional stats. As is, this seems to make it harder from the code
level.

Indeed. I removed the TabStatsArray. Having said that it lives a long
life, its life lasts for at most transaction end. I used dynahash
entry as pgstat_info entry. One tricky part is I had to clear
entry->t_id after removal of the entry so that pgstat_initstats can
detect the removal. It is actually safe but we can add another table
id member in the struct for the use.

bool
pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
PgStat_TableStatus *entry)
{
Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
int table_mode = PGSTAT_EXCLUSIVE;
bool updated = false;
dshash_table *tabhash;
PgStat_StatDBEntry *dbent;
int generation;

if (nowait)
table_mode |= PGSTAT_NOWAIT;

/* Attach required table hash if not yet. */
if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
{
/*
* Return if we don't have corresponding dbentry. It would've been
* removed.
*/
dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
if (!dbent)
return false;

/*
* We don't hold lock on the dbentry since it cannot be dropped while
* we are working on it.
*/
generation = pin_hashes(dbent);
tabhash = attach_table_hash(dbent, generation);

This again is just cost incurred by insisting on destroying hashtables
instead of keeping them around as long as necessary.

[040]: Maybe you are insisting the reverse? The pin_hash complexity is left in this version. -> [033]
Maybe you are insisting the reverse? The pin_hash complexity is left
in this version. -> [033](Fixed) We don't attach shared stats on postmaster process, so I want to know the first attacher process and the last detacher process of shared stats. It's not leaks that I'm considering here. (continued below)

/*
* Local table stats should be applied to both dbentry and tabentry at
* once. Update dbentry only if we could update tabentry.
*/
if (pgstat_update_tabentry(tabhash, entry, nowait))
{
pgstat_update_dbentry(dbent, entry);
updated = true;
}

At this point we're very deeply nested. pgstat_report_stat() ->
pgstat_flush_stat() -> pgstat_flush_tabstat() ->
pgstat_update_tabentry().

That's way over the top imo.

[041]: (Fixed) (Related to [036]) Completely agree, It is a result of that I wanted to avoid scanning pgStatTables twice. (continued)
Completely agree, It is a result of that I wanted to avoid scanning
pgStatTables twice.
(continued)

I don't think it makes much sense that pgstat_update_dbentry() is called
separately for each table. Why would we want to constantly lock that
entry? It seems to be much more sensible to instead have
pgstat_flush_stat() transfer the stats it reported to the pending
database wide counters, and then report that to shared memory *once* per
pgstat_report_stat() with pgstat_flush_dbstats()?

In the attched it scans PgStat_StatDBEntry twice. Once for the tables
of current database and another for shared tables. That change
simplified the logic around.

pgstat_report_stat()
pgstat_flush_tabstats(<tables of current dataase>)
pgstat_update_tabentry() (at bottom)

LWLockAcquire(&cxt->mydbentry->lock, LW_EXCLUSIVE);
if (HAVE_PENDING_CONFLICTS())
pgstat_flush_recovery_conflict(cxt->mydbentry);
if (BeDBStats.n_deadlocks != 0)
pgstat_flush_deadlock(cxt->mydbentry);

What's the point of having all these sub-functions? I see that you, for
an undocumented reason, have pgstat_report_recovery_conflict() flush
conflict stats immediately:

[042]: Fixed by [032].
Fixed by [032](Fixed, Maybe) (Related to [042]) It is almost a subset of PgStat_StatDBEntry with an exception. checksum_failures is different between the two. Anyway, tracking of conflict events don't need to be so fast so they have been changed to be counted on shared hash entries directly..

dbentry = pgstat_get_db_entry(MyDatabaseId,
PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
&status);

if (status == LOCK_FAILED)
return;

/* We had a chance to flush immediately */
pgstat_flush_recovery_conflict(dbentry);

dshash_release_lock(pgStatDBHash, dbentry);

But I don't understand why? Nor why we'd not just report all pending
database wide changes in that case?

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.

[043]: (Maybe fixed) (Related to [045].) Vacuum, analyze, DROP DB and reset cannot be delayed. So the conditional lock is mainly used by pgstat_report_stat(). dshash_find_or_insert didn't allow shared lock. I changed dshash_find_extended to allow shared-lock even if it is told to create a missing entry. Alrhough it takes exclusive lock at the mement of entry creation, most of all cases it doesn't need exclusive lock. This allows use shared lock while processing vacuum or analyze stats.
Vacuum, analyze, DROP DB and reset cannot be delayed. So the
conditional lock is mainly used by
pgstat_report_stat(). dshash_find_or_insert didn't allow shared
lock. I changed dshash_find_extended to allow shared-lock even if it
is told to create a missing entry. Alrhough it takes exclusive lock at
the mement of entry creation, most of all cases it doesn't need
exclusive lock. This allows use shared lock while processing vacuum or
analyze stats.

Previously I thought that we can work on a shared database entry while
lock is not held, but actually there are cases where insertion of a
new database entry causes rehash (resize). The operation moves entries
so we need at least shared lock on database entry while we are working
on it. So in the attched basically most operations are working by the
following steps.

- get shared database entry with shared lock
- attach table/function hash
- fetch an entry with exclusive lock
- update entry
- release the table/function entry
- detach table/function hash

if needed
- take LW_EXCLUSIVE on database entry
- update database numbers
- release LWLock
- release shared database entry

pgstat_vacuum_stat(void)
{

...

dshash_seq_init(&dshstat, pgStatDBHash, false, true);
while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&dshstat)) != NULL)
pgstat_drop_database(dbid);

So, uh, pgstat_drop_database() again does a *separate* lookup in the
dshash, locking the entry. Which only works because you added this dirty
hack:

/* We need to keep partition lock while sequential scan */
if (!hash_table->seqscan_running)
{
hash_table->find_locked = false;
hash_table->find_exclusively_locked = false;
LWLockRelease(PARTITION_LOCK(hash_table, partition));
}

to dshash_delete_entry(). This seems insane to me. There's not even a
comment explaining this?

[044]: Following [001] and [009], I added dshash_delete_currenet(). pgstat_vacuum_stat() uses it instead of pgstat_delete_entry(). The hack is gone.

Following [001](Fixed) As the result of the fixed in [044], it's gone now. and [009](Fixed) I'm not sure about the point of having two interfaces that are hard to distinguish. Maybe dshash_delete_current(dshash_seq_stat *status) is enough(). I also reverted the dshash_delete()., I added
dshash_delete_currenet(). pgstat_vacuum_stat() uses it instead of
pgstat_delete_entry(). The hack is gone.

(pgstat_vacuum_stat(void))

}
dshash_release_lock(pgStatDBHash, dbentry);

Wait, why are we holding the database partition lock across all this?
Again without any comments explaining why?

[045]: (I'm not sure it is fixed) The lock is shared lock in the current version. The database entry is needed only for attaching table hash and now the hashes won't be removed. So, as maybe you suggested, the lock can be released earlier in:
The lock is shared lock in the current version. The database entry is
needed only for attaching table hash and now the hashes won't be
removed. So, as maybe you suggested, the lock can be released earlier in:

pgstat_report_stat()
pgstat_flush_funcstats()
pgstat_vacuum_stat()
pgstat_reset_single_counter()
pgstat_report_vacuum()
pgstat_report_analyze()

The following functions are working on the database entry so lock needs to be retained till the end of its work.

pgstat_flush_dbstats()
pgstat_drop_database() /* needs exclusive lock */
pgstat_reset_counters()
pgstat_report_autovac()
pgstat_report_recovery_conflict()
pgstat_report_deadlock()
pgstat_report_tempfile()
pgstat_report_checksum_failures_in_db()
pgstat_flush_checksum_failure() /* repeats short-time lock on each dbs */

+void
+pgstat_send_archiver(const char *xlog, bool failed)
Why do we still have functions named pgstat_send*?

[046]: (Fixed) Same as [029] and I changed it to pgstat_report_archiver().
Same as [029](Fixed) (Related to [046]) Mmm. It's following your old suggestion to avoid unsubstantial diffs. I'm happy to change it. The functions that have "send" in their names are for the same reason. I removed the prefix "m_" of the members of the struct. (The comment above (with a typo) explains that). and I changed it to pgstat_report_archiver().

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#106

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Kyotaro Horiguchi (#105)

Re: shared-memory based stats collector

Hi,

On 2020-03-19 20:30:04 +0900, Kyotaro Horiguchi wrote:

I think we also can get rid of the dshash_delete changes, by instead
adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
or such.

[009] (Fixed)
I'm not sure about the point of having two interfaces that are hard to
distinguish. Maybe dshash_delete_current(dshash_seq_stat *status) is
enough(). I also reverted the dshash_delete().

Well, dshash_delete() cannot generally safely be used together with
iteration. It has to be the current element etc. And I think the locking
changes make dshash less robust. By explicitly tying "delete the current
element" to the iterator, most of that can be avoided.

/* SIGUSR1 signal handler for archiver process */

Hm - this currently doesn't set up a correct sigusr1 handler for a
shared memory backend - needs to invoke procsignal_sigusr1_handler
somewhere.

We can probably just convert to using normal latches here, and remove
the current 'wakened' logic? That'll remove the indirection via
postmaster too, which is nice.

[018] (Fixed, separate patch 0005)
It seems better. I added it as a separate patch just after the patch
that turns archiver an auxiliary process.

I don't think it's correct to do it separately, but I can just merge
that on commit.

@@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
switch (backendType)
{
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
should imo include 'WAL' or such.
[019] (Not Fixed)
It is already named "archiver" by 8e8a0becb3. Do I rename it in this
patch set?

Oh. No, don't rename it as part of this. Could you reply to the thread
in which Peter made that change, and reference this complaint?

[021] (Fixed, separate patch 0007)
However the "statistics collector process" is gone, I'm not sure
"statistics collector" feature also is gone. But actually the word
"collector" looks a bit odd in some context. I replaced "the results
of statistics collector" with "the activity statistics". (I'm not sure
"the activity statistics" is proper as a subsystem name.) The word
"collect" is replaced with "track". I didn't change section IDs
corresponding to the renaming so that old links can work. I also fixed
the tranche name for LWTRANCHE_STATS from "activity stats" to
"activity_statistics"

Without having gone through the changes, that sounds like the correct
direction to me. There's no "collector" anymore, so removing that seems
like the right thing.

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ca5c6376e5..1ffe073a1f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
+ *  Collects per-table and per-function usage statistics of all backends on
+ *  shared memory. pg_count_*() and friends are the interface to locally store
+ *  backend activities during a transaction. Then pgstat_flush_stat() is called
+ *  at the end of a transaction to pulish the local stats on shared memory.
*
I'd rather not exhaustively list the different objects this handles -
it'll either be annoying to maintain, or just get out of date.
[024] (Fixed, Maybe)
Although not sure I get you correctly, I rewrote it as the follows.

* Collects per-table and per-function usage statistics of all backends on
* shared memory. The activity numbers are once stored locally, then written
* to shared memory at commit time or by idle-timeout.

s/backends on/backends in/

I was thinking of something like:
* Collects activity statistics, e.g. per-table access statistics, of
* all backends in shared memory. The activity numbers are first stored
* locally in each process, then flushed to shared memory at commit
* time or by idle-timeout.

- *			- Add some automatic call for pgstat vacuuming.
+ *  To avoid congestion on the shared memory, we update shared stats no more
+ *  often than intervals of PGSTAT_STAT_MIN_INTERVAL(500ms). In the case where
+ *  all the local numbers cannot be flushed immediately, we postpone updates
+ *  and try the next chance after the interval of
+ *  PGSTAT_STAT_RETRY_INTERVAL(100ms), but we don't wait for no longer than
+ *  PGSTAT_STAT_MAX_INTERVAL(1000ms).
I'm not convinced by this backoff logic. The basic interval seems quite
high for something going through shared memory, and the max retry seems
pretty low.
[025] (Not Fixed)
Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
10000) reasonable?

Partially. I think for access to shared resources we want *increasing*
wait times, rather than shorter retry timeout. The goal should be to be
to make it more likely for all processes to be able to flush their
stats, which can be achieved by flushing less often after hitting
contention.

+/*
+ * BgWriter global statistics counters. The name cntains a remnant from the
+ * time when the stats collector was a dedicate process, which used sockets to
+ * send it.
+ */
+PgStat_MsgBgWriter BgWriterStats = {0};
I am strongly against keeping the 'Msg' prefix. That seems extremely
confusing going forward.
[029] (Fixed) (Related to [046])
Mmm. It's following your old suggestion to avoid unsubstantial
diffs. I'm happy to change it. The functions that have "send" in their
names are for the same reason. I removed the prefix "m_" of the
members of the struct. (The comment above (with a typo) explains that).

I don't object to having the rename be a separate patch...

+ if (StatsShmem->refcount > 0)
+ StatsShmem->refcount++;

What prevents us from leaking the refcount here? We could e.g. error out
while attaching, no? Which'd mean we'd leak the refcount.

[033] (Fixed)
We don't attach shared stats on postmaster process, so I want to know
the first attacher process and the last detacher process of shared
stats. It's not leaks that I'm considering here.
(continued below)

To me it looks like there's a lot of added complexity just because you
want to be able to reset stats via

void
pgstat_reset_all(void)
{

/*
* We could directly remove files and recreate the shared memory area. But
* detach then attach for simplicity.
*/
pgstat_detach_shared_stats(false); /* Don't write */
pgstat_attach_shared_stats();

Without that you'd not need the complexity of attaching, detaching to
the same degree - every backend could just cache lookup data during
initialization, instead of having to constantly re-compute that.

Mmm. I don't get that (or I failed to read clear meaning). The
function is assumed be called only from StartupXLOG().
(continued)

Oh? I didn't get that you're only using it for that purpose - there's
very little documentation about what it's trying to do.

I don't see why that means we don't need to accurately track the
refcount? Otherwise we'll forget to write out the stats.

Nor would the dynamic re-creation of the db dshash table be needed.

Maybe you are mentioning the complexity of reset_dbentry_counters? It
is actually complex. Shared stats dshash cannot be destroyed (or
dshash entry cannot be removed) during someone is working on it. It
was simpler to wait for another process to end its work but that could
slow not only the clearing process but also other processes by
frequent resetting of counters.

I was referring to the fact that the last version of the patch
attached/detached from hashtables regularly. pin_hashes, unpin_hashes,
attach_table_hash, attach_function_hash etc.

After some thoughts, I decided to rip the all "generation" stuff off
and it gets far simpler. But counter reset may conflict with other
backends with a litter higher degree because counter reset needs
exclusive lock.

That seems harmless to me - stats reset should never happen at a high
enough frequency to make contention it causes problematic. There's also
an argument to be made that it makes sense for the reset to be atomic.

+	/* Flush out table stats */
+	if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+		pending_stats = true;
+
+	/* Flush out function stats */
+	if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+		pending_stats = true;
This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
being confusing while reading the code, it also made the diff much
harder to read.
[035] (Maybe Fixed)
Is the question that, is there any case where
pgstat_flush_stat/functions leaves some counters unflushed?

No, the point is that there's knowledge about
pgstat_flush_stat/pgstat_flush_funcstats outside of those functions,
namely the pgStatTabList, pgStatFunctions lists.

Why do we still have this? A hashtable lookup is cheap, compared to
fetching a file - so it's not to save time. Given how infrequent the
pgstat_fetch_* calls are, it's not to avoid contention either.

At first one could think it's for consistency - but no, that's not it
either, because snapshot_statentry() refetches the snapshot without
control from the outside:

[038]
I don't get the second paragraph. When the function re*create*s a
snapshot without control from the outside? It keeps snapshots during a
transaction. If not, it is broken.
(continued)

Maybe I just misunderstood the code flow - partially due to the global
clear_snapshot variable. I just had read the
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.

comment, and took it to mean that you're unconditionally updating the
snapshot every PGSTAT_STAT_MIN_INTERVAL. Which'd mean we don't actually
have consistent snapshot across all fetches.

(partially this might have been due to the diff:
     /*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
      */
-    now = GetCurrentTransactionStopTimestamp();
-    if (!force &&
-        !TimestampDifferenceExceeds(last_report, now, PGSTAT_STAT_INTERVAL))
-        return;
-    last_report = now;
+    if (clear_snapshot)
+    {
+        clear_snapshot = false;
+
+        if (pgStatSnapshotContext &&
)

But I think my question remains: Why do we need the whole snapshot thing
now? Previously we needed to avoid reading a potentially large file -
but that's not a concern anymore?

/*
* We don't want so frequent update of stats snapshot. Keep it at least
* for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
*/

...

I think we should just remove this entire local caching snapshot layer
for lookups.

Currently the behavior is documented as the follows and it seems reasonable.

Another important point is that when a server process is asked to display
any of these statistics, it first fetches the most recent report emitted by
the collector process and then continues to use this snapshot for all
statistical views and functions until the end of its current transaction.
So the statistics will show static information as long as you continue the
current transaction. Similarly, information about the current queries of
all sessions is collected when any such information is first requested
within a transaction, and the same information will be displayed throughout
the transaction.
This is a feature, not a bug, because it allows you to perform several
queries on the statistics and correlate the results without worrying that
the numbers are changing underneath you. But if you want to see new
results with each query, be sure to do the queries outside any transaction
block. Alternatively, you can invoke
<function>pg_stat_clear_snapshot</function>(), which will discard the
current transaction's statistics snapshot (if any). The next use of
statistical information will cause a new snapshot to be fetched.

I am very unconvinded this is worth the cost. Especially because plenty
of other stats related parts of the system do *NOT* behave this way. How
is a user supposed to understand that pg_stat_database behaves one way,
pg_stat_activity, another, pg_stat_statements a third,
pg_stat_progress_* ...

Perhaps it's best to not touch the semantics here, but I'm also very
wary of introducing significant complications and overhead just to have
this "feature".

for (i = 0; i < tsa->tsa_used; i++)
{
PgStat_TableStatus *entry = &tsa->tsa_entries[i];

<many TableStatsArray code>

hash_entry->tsa_entry = entry;
dest_elem++;
}

This seems like too much code. Why is this entirely different from the
way funcstats works? The difference was already too big before, but this
made it *way* worse.

[040]
We don't flush stats until transaction ends. So the description about
TabStatuArray is stale?

How is your comment related to my comment above?

bool
pgstat_flush_tabstat(pgstat_flush_stat_context *cxt, bool nowait,
PgStat_TableStatus *entry)
{
Oid dboid = entry->t_shared ? InvalidOid : MyDatabaseId;
int table_mode = PGSTAT_EXCLUSIVE;
bool updated = false;
dshash_table *tabhash;
PgStat_StatDBEntry *dbent;
int generation;

if (nowait)
table_mode |= PGSTAT_NOWAIT;

/* Attach required table hash if not yet. */
if ((entry->t_shared ? cxt->shdb_tabhash : cxt->mydb_tabhash) == NULL)
{
/*
* Return if we don't have corresponding dbentry. It would've been
* removed.
*/
dbent = pgstat_get_db_entry(dboid, table_mode, NULL);
if (!dbent)
return false;

/*
* We don't hold lock on the dbentry since it cannot be dropped while
* we are working on it.
*/
generation = pin_hashes(dbent);
tabhash = attach_table_hash(dbent, generation);

This again is just cost incurred by insisting on destroying hashtables
instead of keeping them around as long as necessary.

[040]
Maybe you are insisting the reverse? The pin_hash complexity is left
in this version. -> [033]

What do you mean? What I'm saying is that we should never end up in a
situation where there's no pgstat entry for the current database. And
that that's trivial, as long as we don't drop the hashtable, but instead
reset counters to 0.

dbentry = pgstat_get_db_entry(MyDatabaseId,
PGSTAT_EXCLUSIVE | PGSTAT_NOWAIT,
&status);

if (status == LOCK_FAILED)
return;

/* We had a chance to flush immediately */
pgstat_flush_recovery_conflict(dbentry);

dshash_release_lock(pgStatDBHash, dbentry);

But I don't understand why? Nor why we'd not just report all pending
database wide changes in that case?

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.

[043] (Maybe fixed) (Related to [045].)
Vacuum, analyze, DROP DB and reset cannot be delayed. So the
conditional lock is mainly used by
pgstat_report_stat().

You're saying "cannot be delayed" - but you're not explaining *why* that
is.

Even if true, I don't see why that necessitates doing the flushing and
locking once for each of these functions?

dshash_find_or_insert didn't allow shared lock. I changed
dshash_find_extended to allow shared-lock even if it is told to create
a missing entry. Alrhough it takes exclusive lock at the mement of
entry creation, most of all cases it doesn't need exclusive lock. This
allows use shared lock while processing vacuum or analyze stats.

Huh?

Previously I thought that we can work on a shared database entry while
lock is not held, but actually there are cases where insertion of a
new database entry causes rehash (resize). The operation moves entries
so we need at least shared lock on database entry while we are working
on it. So in the attched basically most operations are working by the
following steps.
- get shared database entry with shared lock
- attach table/function hash
- fetch an entry with exclusive lock
- update entry
- release the table/function entry
- detach table/function hash
if needed
- take LW_EXCLUSIVE on database entry
- update database numbers
- release LWLock
- release shared database entry

Just to be crystal clear: I am exceedingly unlikely to commit this with
any sort of short term attach/detach operations. Both because of the
runtime overhead/contention it causes is significant, and because of the
code complexity implied by it.

Leaving attach/detach aside: I think it's a complete no-go to acquire
database wide locks at this frequency, and then to hold them over other
operations that are a) not cheap b) can block. The contention due to
that would be *terrible* for scalability, even if it's just a shared
lock.

The way this *should* work is that:
1.1) At backend startup, attach to the database wide hashtable
1.2) At backend startup, attach to the various per-database hashtables
(including ones for shared tables)
2.1) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-table stats
2.2) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-function stats
2.3) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-database stats
2.4) When flushing stats that need to be flushed (e.g. vacuum): havestats && lock && flush
3.1) When shutting down backend, detach from all hashtables

That way we never need to hold onto the database-wide hashtables for
long, and we can do it with conditional locks (trylock above), unless we
need to force flushing.

It might be worthwhile to merge per-table, per-function, per-database
hashes into a single hash. Where the key is either something like
{hashkind, objoid} (referenced from a per-database hashtable), or even
{hashkind, dboid, objoid} (one global hashtable).

I think the contents of the hashtable should likely just be a single
dsa_pointer (plus some bookkeeping). Several reasons for that:

1) Since one goal of this is to make the stats system more extensible,
it seems important that we can make the set of stats kept
runtime configurable. Otherwise everyone will continue to have to pay
the price for every potential stat that we have an option to track.

2) Having hashtable resizes move fairly large stat entries around is
expensive. Whereas just moving key + dsa_pointer around is pretty
cheap. I don't think the cost of a pointer dereference matters in
*this* case.

3) If the stats contents aren't moved around, there's no need to worry
about hashtable resizes. Therefore the stats can be referenced
without holding dshash partition locks.

4) If the stats entries aren't moved around by hashtable resizes, we can
use atomics, lwlocks, spinlocks etc as part of the stats entry. It's
not generally correct/safe to have dshash resize to move those
around.

All of that would be addressed if we instead allocate the stats data
separately from the dshash entry.

Greetings,

Andres Freund

#107

Andres Freund

andres@anarazel.de

over 5 years ago

In reply to: Thomas Munro (#104)

Re: shared-memory based stats collector

Hi,

On 2020-03-19 16:51:59 +1300, Thomas Munro wrote:

On Fri, Mar 13, 2020 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:

Thomas, could you look at the first two patches here, and my review
questions?

Ack.

Thanks!

dsa_pointer item_pointer = hash_table->buckets[i];
@@ -549,9 +560,14 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
LW_EXCLUSIVE));
delete_item(hash_table, item);
-     hash_table->find_locked = false;
-     hash_table->find_exclusively_locked = false;
-     LWLockRelease(PARTITION_LOCK(hash_table, partition));
+
+     /* We need to keep partition lock while sequential scan */
+     if (!hash_table->seqscan_running)
+     {
+             hash_table->find_locked = false;
+             hash_table->find_exclusively_locked = false;
+             LWLockRelease(PARTITION_LOCK(hash_table, partition));
+     }
}
This seems like a failure prone API.
If I understand correctly, the only purpose of the seqscan_running
variable is to control that behaviour ^^^. That is, to make
dshash_delete_entry() keep the partition lock if you delete an entry
while doing a seq scan. Why not get rid of that, and provide a
separate interface for deleting while scanning?
dshash_seq_delete(dshash_seq_status *scan, void *entry). I suppose it
would be most common to want to delete the "current" item in the seq
scan, but it could allow you to delete anything in the same partition,
or any entry if using the "consistent" mode. Oh, I see that Andres
said the same thing later.

[Andres complaining about comments and language stuff]

I would be happy to proof read and maybe extend the comments (writing
new comments will also help me understand and review the code!), and
maybe some code changes to move this forward. Horiguchi-san, are you
working on another version now? If so I'll wait for it before I do
that.

Cool! Being ESL myself and mildly dyslexic to boot, that'd be
helpful. But I'd hold off for a moment, because I think there'll need to
be some open heart surgery on this patch (see bottom of my last email in
this thread, for minutes ago (don't yet have a message id, sorry)).

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.

Right, I also noticed that:

/*
* Local table stats should be applied to both dbentry and tabentry at
* once. Update dbentry only if we could update tabentry.
*/
if (pgstat_update_tabentry(tabhash, entry, nowait))
{
pgstat_update_dbentry(dbent, entry);
updated = true;
}

So pgstat_update_tabentry() goes to great trouble to take locks
conditionally, but then pgstat_update_dbentry() immediately does:

LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
LWLockRelease(&dbentry->lock);

Why can't we be "lazy" with the dbentry stats too? Is it really
important for the table stats and DB stats to agree with each other?

We *need* to be lazy here, I think.

Hmm. Even if you change the above code use a conditional lock, I am
wondering (admittedly entirely without data) if this approach is still
too clunky: even trying and failing to acquire the lock creates
contention, just a bit less. I wonder if it would make sense to make
readers do more work, so that writers can avoid contention. For
example, maybe PgStat_StatDBEntry could hold an array of N sets of
counters, and readers have to add them all up. An advanced version of
this idea would use a reasonably fresh copy of something like
sched_getcpu() and numa_node_of_cpu() to select a partition to
minimise contention and cross-node traffic, with a portable fallback
based on PID or something. CPU core/node awareness is something I
haven't looked into too seriously, but it's been on my mind to solve
some other problems.

I don't think we really need that for the per-object stats. The easier
way to address that is to instead reduce the rate of flushing to the
shared table. There's not really a problem with the shared state of the
stats lagging by a few hundred ms or so.

The amount of code complexity a scheme like you describe doesn't seem
worth it to me without very clear evidence its needed. If we didn't need
to handle the case were the "static" slots are insufficient to handle
all the stats, it'd be different. But given the number of tables etc
that can exist in systems, I don't think that's achievable.

I think we should go for per-backend counters for other parts of the
system though. I think it should basically be the default for cluster
wide stats like IO (even if we additionally flush it to per table
stats). Currently we have more complicated schemes for those. But that's
imo a separate patch.

Thanks!

Andres

#108

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Thomas Munro (#104)

Re: shared-memory based stats collector

Thank you for looking this.

At Thu, 19 Mar 2020 16:51:59 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in

This seems like a failure prone API.

If I understand correctly, the only purpose of the seqscan_running
variable is to control that behaviour ^^^. That is, to make
dshash_delete_entry() keep the partition lock if you delete an entry
while doing a seq scan. Why not get rid of that, and provide a
separate interface for deleting while scanning?
dshash_seq_delete(dshash_seq_status *scan, void *entry). I suppose it
would be most common to want to delete the "current" item in the seq
scan, but it could allow you to delete anything in the same partition,
or any entry if using the "consistent" mode. Oh, I see that Andres
said the same thing later.

The attached v25 in [1] is the new version.

Why does this patch add the consistent mode? There's no users currently?
Without it's not clear that we need a seperate _term function, I think?

+1, let's not do that if we don't need it!

Yes, it is removed.

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.

Right, I also noticed that:

I think I fixed all cases except drop or something like that needs
exclusive lock.

So pgstat_update_tabentry() goes to great trouble to take locks
conditionally, but then pgstat_update_dbentry() immediately does:

LWLockAcquire(&dbentry->lock, LW_EXCLUSIVE);
dbentry->n_tuples_returned += stat->t_counts.t_tuples_returned;
dbentry->n_tuples_fetched += stat->t_counts.t_tuples_fetched;
dbentry->n_tuples_inserted += stat->t_counts.t_tuples_inserted;
dbentry->n_tuples_updated += stat->t_counts.t_tuples_updated;
dbentry->n_tuples_deleted += stat->t_counts.t_tuples_deleted;
dbentry->n_blocks_fetched += stat->t_counts.t_blocks_fetched;
dbentry->n_blocks_hit += stat->t_counts.t_blocks_hit;
LWLockRelease(&dbentry->lock);

Why can't we be "lazy" with the dbentry stats too? Is it really
important for the table stats and DB stats to agree with each other?
Even if it were, your current coding doesn't achieve that: the table
stats are updated before the DB stat under different locks, so I'm not
sure why it can't wait longer.

It is done lazy way.

Hmm. Even if you change the above code use a conditional lock, I am
wondering (admittedly entirely without data) if this approach is still
too clunky: even trying and failing to acquire the lock creates
contention, just a bit less. I wonder if it would make sense to make
readers do more work, so that writers can avoid contention. For
example, maybe PgStat_StatDBEntry could hold an array of N sets of
counters, and readers have to add them all up. An advanced version of

I thought that kind of solution but that needs more memory multipled
by the number of backends. If the contention is not negligible, we can
go back to stats collector process connected via sockets then share
the result on shared memory. The motive was the file I/O on reading
stats on backens.

this idea would use a reasonably fresh copy of something like
sched_getcpu() and numa_node_of_cpu() to select a partition to
minimise contention and cross-node traffic, with a portable fallback
based on PID or something. CPU core/node awareness is something I
haven't looked into too seriously, but it's been on my mind to solve
some other problems.

I have got asked about the CPU core/node awareness several times. It
might have a certain degree of needs.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#109

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Andres Freund (#106)

7 attachment(s)

Re: shared-memory based stats collector

Hello.

At Thu, 19 Mar 2020 12:54:10 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2020-03-19 20:30:04 +0900, Kyotaro Horiguchi wrote:

I think we also can get rid of the dshash_delete changes, by instead
adding a dshash_delete_current(dshash_seq_stat *status, void *entry) API
or such.

[009] (Fixed)
I'm not sure about the point of having two interfaces that are hard to
distinguish. Maybe dshash_delete_current(dshash_seq_stat *status) is
enough(). I also reverted the dshash_delete().

Well, dshash_delete() cannot generally safely be used together with
iteration. It has to be the current element etc. And I think the locking
changes make dshash less robust. By explicitly tying "delete the current
element" to the iterator, most of that can be avoided.

Sure. By the way I forgot to remove seqscan_running stuff. Removed.

/* SIGUSR1 signal handler for archiver process */

Hm - this currently doesn't set up a correct sigusr1 handler for a
shared memory backend - needs to invoke procsignal_sigusr1_handler
somewhere.

We can probably just convert to using normal latches here, and remove
the current 'wakened' logic? That'll remove the indirection via
postmaster too, which is nice.

[018] (Fixed, separate patch 0005)
It seems better. I added it as a separate patch just after the patch
that turns archiver an auxiliary process.

I don't think it's correct to do it separately, but I can just merge
that on commit.

Yes, it's just for the convenience of reviewing. Merged.

@@ -4273,6 +4276,9 @@ pgstat_get_backend_desc(BackendType backendType)
switch (backendType)
{
+		case B_ARCHIVER:
+			backendDesc = "archiver";
+			break;
should imo include 'WAL' or such.
[019] (Not Fixed)
It is already named "archiver" by 8e8a0becb3. Do I rename it in this
patch set?
Oh. No, don't rename it as part of this. Could you reply to the thread
in which Peter made that change, and reference this complaint?

I sent a mail like that.

/messages/by-id/20200327.163007.128069746774242774.horikyota.ntt@gmail.com

[021] (Fixed, separate patch 0007)
However the "statistics collector process" is gone, I'm not sure
"statistics collector" feature also is gone. But actually the word
"collector" looks a bit odd in some context. I replaced "the results
of statistics collector" with "the activity statistics". (I'm not sure
"the activity statistics" is proper as a subsystem name.) The word
"collect" is replaced with "track". I didn't change section IDs
corresponding to the renaming so that old links can work. I also fixed
the tranche name for LWTRANCHE_STATS from "activity stats" to
"activity_statistics"

Without having gone through the changes, that sounds like the correct
direction to me. There's no "collector" anymore, so removing that seems
like the right thing.

Thanks.

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ca5c6376e5..1ffe073a1f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c

...

[024] (Fixed, Maybe)
Although not sure I get you correctly, I rewrote it as the follows.

I was thinking of something like:
* Collects activity statistics, e.g. per-table access statistics, of
* all backends in shared memory. The activity numbers are first stored
* locally in each process, then flushed to shared memory at commit
* time or by idle-timeout.

Looks fine. Replaced it with the above.

[025] (Not Fixed)
Is it the matter of intervals? Is (MIN, RETRY, MAX) = (1000, 500,
10000) reasonable?

Partially. I think for access to shared resources we want *increasing*
wait times, rather than shorter retry timeout. The goal should be to be
to make it more likely for all processes to be able to flush their
stats, which can be achieved by flushing less often after hitting
contention.

Ah! Indeed. The attached works the following way.

* To avoid congestion on the shared memory, shared stats is updated no more
* often than once per PGSTAT_MIN_INTERVAL (1000ms). If some local numbers
* remain unflushed for lock failure, retry with intervals that is initially
* PGSTAT_RETRY_MIN_INTERVAL (250ms) then doubled at every retry. Finally we
* force update after PGSTAT_MAX_INTERVAL (10000ms) since the first trial.

Concretely the interval changes as:

elapsed interval
-------------+--------------
0ms (1000ms)
1000ms 250ms
1250ms 500ms
1750ms 1000ms
2750ms 2000ms
4759ms 5250ms (not 4000ms)
10000ms

On the way fixing it I fixed several silly bugs:
- pgstat_report_stat accessed dbent even if it is NULL.
- pgstat_flush_tabstats set have_(sh|my)database_stats wrongly.

[029] (Fixed) (Related to [046])
Mmm. It's following your old suggestion to avoid unsubstantial
diffs. I'm happy to change it. The functions that have "send" in their
names are for the same reason. I removed the prefix "m_" of the
members of the struct. (The comment above (with a typo) explains that).

I don't object to having the rename be a separate patch...

Nope. I don't want make it a separate patch.

+ if (StatsShmem->refcount > 0)
+ StatsShmem->refcount++;

What prevents us from leaking the refcount here? We could e.g. error out
while attaching, no? Which'd mean we'd leak the refcount.

[033] (Fixed)
We don't attach shared stats on postmaster process, so I want to know
the first attacher process and the last detacher process of shared
stats. It's not leaks that I'm considering here.
(continued below)

To me it looks like there's a lot of added complexity just because you
want to be able to reset stats via

...

Without that you'd not need the complexity of attaching, detaching to
the same degree - every backend could just cache lookup data during
initialization, instead of having to constantly re-compute that.

Mmm. I don't get that (or I failed to read clear meaning). The
function is assumed be called only from StartupXLOG().
(continued)

Oh? I didn't get that you're only using it for that purpose - there's
very little documentation about what it's trying to do.

Ugg..

I don't see why that means we don't need to accurately track the
refcount? Otherwise we'll forget to write out the stats.

Exactly, and I added comments for that.

| * refcount is used to know whether a process going to detach shared stats is
| * the last process or not. The last process writes out the stats files.
| */
| typedef struct StatsShmemStruct

| if (--StatsShmem->refcount < 1)
| {
| /*
| * The process is the last one that is attaching the shared stats
| * memory. Write out the stats files if requested.

Nor would the dynamic re-creation of the db dshash table be needed.

I was referring to the fact that the last version of the patch
attached/detached from hashtables regularly. pin_hashes, unpin_hashes,
attach_table_hash, attach_function_hash etc.

pin/unpin is gone. Now there is only one dshash and it is attached for
the life time of process.

After some thoughts, I decided to rip the all "generation" stuff off
and it gets far simpler. But counter reset may conflict with other
backends with a litter higher degree because counter reset needs
exclusive lock.

That seems harmless to me - stats reset should never happen at a high
enough frequency to make contention it causes problematic. There's also
an argument to be made that it makes sense for the reset to be atomic.

Agreed.

+	/* Flush out table stats */
+	if (pgStatTabList != NULL && !pgstat_flush_stat(&cxt, !force))
+		pending_stats = true;
+
+	/* Flush out function stats */
+	if (pgStatFunctions != NULL && !pgstat_flush_funcstats(&cxt, !force))
+		pending_stats = true;
This seems weird. pgstat_flush_stat(), pgstat_flush_funcstats() operate
on pgStatTabList/pgStatFunctions, but don't actually reset it? Besides
being confusing while reading the code, it also made the diff much
harder to read.
[035] (Maybe Fixed)
Is the question that, is there any case where
pgstat_flush_stat/functions leaves some counters unflushed?
No, the point is that there's knowledge about
pgstat_flush_stat/pgstat_flush_funcstats outside of those functions,
namely the pgStatTabList, pgStatFunctions lists.

Mmm. Anyway the stuff has been largely changed in this version.

Why do we still have this? A hashtable lookup is cheap, compared to
fetching a file - so it's not to save time. Given how infrequent the
pgstat_fetch_* calls are, it's not to avoid contention either.

At first one could think it's for consistency - but no, that's not it
either, because snapshot_statentry() refetches the snapshot without
control from the outside:

[038]
I don't get the second paragraph. When the function re*create*s a
snapshot without control from the outside? It keeps snapshots during a
transaction. If not, it is broken.
(continued)
Maybe I just misunderstood the code flow - partially due to the global
clear_snapshot variable. I just had read the
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
comment, and took it to mean that you're unconditionally updating the
snapshot every PGSTAT_STAT_MIN_INTERVAL. Which'd mean we don't actually
have consistent snapshot across all fetches.
(partially this might have been due to the diff:
/*
-     * Don't send a message unless it's been at least PGSTAT_STAT_INTERVAL
-     * msec since we last sent one, or the caller wants to force stats out.
+     * We don't want so frequent update of stats snapshot. Keep it at least
+     * for PGSTAT_STAT_MIN_INTERVAL ms. Not postpone but just ignore the cue.
*/

Wow.. I tried "git config --global diff.algorithm patience" and it
seems works well.

But I think my question remains: Why do we need the whole snapshot thing
now? Previously we needed to avoid reading a potentially large file -
but that's not a concern anymore?

...

Currently the behavior is documented as the follows and it seems reasonable.

...

I am very unconvinded this is worth the cost. Especially because plenty
of other stats related parts of the system do *NOT* behave this way. How
is a user supposed to understand that pg_stat_database behaves one way,
pg_stat_activity, another, pg_stat_statements a third,
pg_stat_progress_* ...

Perhaps it's best to not touch the semantics here, but I'm also very
wary of introducing significant complications and overhead just to have
this "feature".

As a compromise, I removed the "clear_snapshot" stuff. Snapshot still
works, but now clear_snapshot() immediately clear them. It works the
same way with pg_stat_activity.

for (i = 0; i < tsa->tsa_used; i++)
{
PgStat_TableStatus *entry = &tsa->tsa_entries[i];

<many TableStatsArray code>

hash_entry->tsa_entry = entry;
dest_elem++;
}

This seems like too much code. Why is this entirely different from the
way funcstats works? The difference was already too big before, but this
made it *way* worse.

[040]
We don't flush stats until transaction ends. So the description about
TabStatuArray is stale?

How is your comment related to my comment above?

Hmm. It looks like truncated. The TableStatsArray is removed, all
kinds of local stats (except gobal stats) is now stored directly in
pgStatLocalHashEntry. The code gets far simpler.

generation = pin_hashes(dbent);
tabhash = attach_table_hash(dbent, generation);

This again is just cost incurred by insisting on destroying hashtables
instead of keeping them around as long as necessary.

[040]
Maybe you are insisting the reverse? The pin_hash complexity is left
in this version. -> [033]

What do you mean? What I'm saying is that we should never end up in a
situation where there's no pgstat entry for the current database. And
that that's trivial, as long as we don't drop the hashtable, but instead
reset counters to 0.

In the previous version that was not sent to ML attach/detaches only
at the start/end time of process. But in this version table/function
dshashes are gone.

The fact that you're locking the per-database entry unconditionally once
for each table almost guarantees contention - and you're not using the
'conditional lock' approach for that. I don't understand.

[043] (Maybe fixed) (Related to [045].)
Vacuum, analyze, DROP DB and reset cannot be delayed. So the
conditional lock is mainly used by
pgstat_report_stat().

You're saying "cannot be delayed" - but you're not explaining *why* that
is.

Even if true, I don't see why that necessitates doing the flushing and
locking once for each of these functions?

Sorry, that was wrong. We can just skip removal on lock failure
during pgstat_vacuum_stat(). It will be retried the next time. Other
database stats, deadlock, checksum failure, tmpfile and conflicts are
now collected locally then flushed.

dshash_find_or_insert didn't allow shared lock. I changed
dshash_find_extended to allow shared-lock even if it is told to create
a missing entry. Alrhough it takes exclusive lock at the mement of
entry creation, most of all cases it doesn't need exclusive lock. This
allows use shared lock while processing vacuum or analyze stats.

Huh?

Well, anyway, the shared-insert mode of dshash_find_extended is no
longer needed so I removed the mode in this version.

Just to be crystal clear: I am exceedingly unlikely to commit this with
any sort of short term attach/detach operations. Both because of the
runtime overhead/contention it causes is significant, and because of the
code complexity implied by it.

I think it is addressed in this version.

Leaving attach/detach aside: I think it's a complete no-go to acquire
database wide locks at this frequency, and then to hold them over other
operations that are a) not cheap b) can block. The contention due to
that would be *terrible* for scalability, even if it's just a shared
lock.

The way this *should* work is that:
1.1) At backend startup, attach to the database wide hashtable
1.2) At backend startup, attach to the various per-database hashtables
(including ones for shared tables)
2.1) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-table stats
2.2) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-function stats
2.3) When flushing stats (e.g. at eoxact): havestats && trylock && flush per-database stats
2.4) When flushing stats that need to be flushed (e.g. vacuum): havestats && lock && flush
3.1) When shutting down backend, detach from all hashtables

That way we never need to hold onto the database-wide hashtables for
long, and we can do it with conditional locks (trylock above), unless we
need to force flushing.

I think the attached works the similar way. Table/function stats are
processed togehter, then database stats is processed.

It might be worthwhile to merge per-table, per-function, per-database
hashes into a single hash. Where the key is either something like
{hashkind, objoid} (referenced from a per-database hashtable), or even
{hashkind, dboid, objoid} (one global hashtable).

I think the contents of the hashtable should likely just be a single
dsa_pointer (plus some bookkeeping). Several reasons for that:

1) Since one goal of this is to make the stats system more extensible,
it seems important that we can make the set of stats kept
runtime configurable. Otherwise everyone will continue to have to pay
the price for every potential stat that we have an option to track.

2) Having hashtable resizes move fairly large stat entries around is
expensive. Whereas just moving key + dsa_pointer around is pretty
cheap. I don't think the cost of a pointer dereference matters in
*this* case.

3) If the stats contents aren't moved around, there's no need to worry
about hashtable resizes. Therefore the stats can be referenced
without holding dshash partition locks.

4) If the stats entries aren't moved around by hashtable resizes, we can
use atomics, lwlocks, spinlocks etc as part of the stats entry. It's
not generally correct/safe to have dshash resize to move those
around.

All of that would be addressed if we instead allocate the stats data
separately from the dshash entry.

OK, I'm convinced by that (and I like it). The attached v27 is largely
changed from the previous version following the suggeston.

1) DB, table, function stats are stored into one hash keyed by (type,
dbid, objectid) and handled in unified way. Now pgstat_report_stat
flushes stats the following way.

while (hash_seq_search on local stats hash)
{
switch (ent->stats->type)
{
case PGSTAT_TYPE_DB: ...
case PGSTAT_TYPE_TABLE: ...
case PGSTAT_TYPE_FUNCTION: ...
}
}

2, 3) There's only one dshash table pgStatSharedHash. Its entry is
defined as the follows.

   +typedef struct PgStatHashEntry
   +{
   +    PgStatHashEntryKey	key;	/* hash key */
   +    dsa_pointer			stats;	/* pointer to shared stats entry in DSA */
   +} PgStatHashEntry;

key is (type, databaseid, objectid)

To handle entries of different types common way, the hash entry
points to the following struct stored in DSA memory.

   +typedef struct PgStatEntry
   +{
   +    PgStatTypes	type;		/* statistics entry type */
   +    size_t		len;		/* length of body, fixed per type. */
   +    LWLock		lock;		/* lightweight lock to protect body */
   +    char		body[FLEXIBLE_ARRAY_MEMBER];	/* statistics body */
   +} PgStatEntry;

The body stores the existing PgStat_Stat*Entry structs.

To match the shared stats, locally-stored stats entries are changed
similar way.

4) As shown above, I'm using LWLock in this version.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#110

Alvaro Herrera

alvherre@2ndquadrant.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#109)

Re: shared-memory based stats collector

On 2020-Mar-27, Kyotaro Horiguchi wrote:

+/*
+ * XLogArchiveWakeupEnd - Set up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupStart(void)
+{
+	Latch *old_latch PG_USED_FOR_ASSERTS_ONLY;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	old_latch = XLogCtl->archiverWakeupLatch;
+	XLogCtl->archiverWakeupLatch = MyLatch;
+	SpinLockRelease(&XLogCtl->info_lck);
+	Assert (old_latch == NULL);
+}

Comment is wrong about the function name; OTOH I don't think the
old_latch assigment in the fourth line won't work well in non-assert
builds. But why do you need those shenanigans? Surely
"Assert(XLogCtl->archiverWakeupLatch == NULL)" in the locked region
before assigning MyLatch should be sufficient and acceptable?

--
Ãlvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#111

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Alvaro Herrera (#110)

7 attachment(s)

Re: shared-memory based stats collector

Thank you for looking this.

At Fri, 27 Mar 2020 12:34:02 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in

On 2020-Mar-27, Kyotaro Horiguchi wrote:

+/*
+ * XLogArchiveWakeupEnd - Set up archiver wakeup stuff
+ */
+void
+XLogArchiveWakeupStart(void)
+{
+	Latch *old_latch PG_USED_FOR_ASSERTS_ONLY;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	old_latch = XLogCtl->archiverWakeupLatch;
+	XLogCtl->archiverWakeupLatch = MyLatch;
+	SpinLockRelease(&XLogCtl->info_lck);
+	Assert (old_latch == NULL);
+}

Comment is wrong about the function name; OTOH I don't think the

Oops! I found a similar mistake in another place. (pgstat_flush_funcentry)

old_latch assigment in the fourth line won't work well in non-assert
builds. But why do you need those shenanigans? Surely
"Assert(XLogCtl->archiverWakeupLatch == NULL)" in the locked region
before assigning MyLatch should be sufficient and acceptable?

Right. Maybe I wanted to move the assertion out of the lock section,
but that's actually useless.

Fixed them and rebased.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#112

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#111)

7 attachment(s)

Re: shared-memory based stats collector

Conflicted with 616ae3d2b0, so rebased.

Fixed on broken comment style.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#113

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#112)

7 attachment(s)

Re: shared-memory based stats collector

Conflicted with 616ae3d2b0, so rebased.

I made some cleanup. (v30)

- Added comments for members of dshash_seq_scans.
- Some style fix and comment fix of dshash.

- Cleaned up more usage of the word "stat(istics) collector" in comments,
- Changed the GUC attribute STATS_COLLECTOR to STATS_ACTIVITY
- Removed duplicate setup of MyBackendType and ps display in PgArchiverMain
- Removed B_STATS_COLLECTOR from BackendType and removed related code.
- Corrected the comment of PgArchiverMain, which mentioned "argv/argc".

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#114

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#113)

7 attachment(s)

Re: shared-memory based stats collector

Conflicted with the commit 28cac71bd3 - SLRU stats.

Rebased and fixed some issues.

At Wed, 01 Apr 2020 17:37:23 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Conflicted with 616ae3d2b0, so rebased.

I made some cleanup. (v30)

- Added comments for members of dshash_seq_scans.
- Some style fix and comment fix of dshash.

- Cleaned up more usage of the word "stat(istics) collector" in comments,
- Changed the GUC attribute STATS_COLLECTOR to STATS_ACTIVITY
- Removed duplicate setup of MyBackendType and ps display in PgArchiverMain
- Removed B_STATS_COLLECTOR from BackendType and removed related code.
- Corrected the comment of PgArchiverMain, which mentioned "argv/argc".

I made further cleanups.

- Removed wrongly added BACKEND_TYPE_ARCHIVER.

- Moved archiver latch from XLogCtlData to ProcGlobal.

- Removed XLogArchiverStart/End/Wakeup from xlog.h, where anyway was
not the proper place.

- pgarch_MainLoop start the loop with wakened = true when both
notified or timed out. Otherwise time_to_stop is set and exits from
the loop immediately. So the variable wakened is actually
useless. Removed it.

- A PoC (or a rush work) of refactoring of SLRU stats.

I tried to combine it into the global stats hash, but the SLRU
report functions are called within critical sections so memory
allocation fails. The current pgstat module removes the local entry
at successful flushing out to shared stats, so allocation at the
first report is inevitable. In the attched version it is handled
the same way with global stats. I continue seeking a way to
combining it to the global stats hash.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#115

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#114)

7 attachment(s)

Re: shared-memory based stats collector

At Fri, 03 Apr 2020 17:31:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Conflicted with the commit 28cac71bd3 - SLRU stats.

Rebased and fixed some issues.
- A PoC (or a rush work) of refactoring of SLRU stats.

I tried to combine it into the global stats hash, but the SLRU
report functions are called within critical sections so memory
allocation fails. The current pgstat module removes the local entry
at successful flushing out to shared stats, so allocation at the
first report is inevitable. In the attched version it is handled
the same way with global stats. I continue seeking a way to
combining it to the global stats hash.

I didn't find a way to consolidate into the general local stats
hash. The hash size could be large and the chance of allocation
failure is larger than other places where in-critical-section memory
allocation is allowed. Instead, in the attached, I separated
shared-SLRU lock from StatsLock and add the logic to avoid useless
scan on the array.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#116

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#115)

7 attachment(s)

Re: shared-memory based stats collector

This conflicts with several recent commits. Rebased.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#117

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#115)

7 attachment(s)

Re: shared-memory based stats collector

At Tue, 07 Apr 2020 16:38:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 03 Apr 2020 17:31:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Conflicted with the commit 28cac71bd3 - SLRU stats.

Rebased and fixed some issues.
- A PoC (or a rush work) of refactoring of SLRU stats.

I tried to combine it into the global stats hash, but the SLRU
report functions are called within critical sections so memory
allocation fails. The current pgstat module removes the local entry
at successful flushing out to shared stats, so allocation at the
first report is inevitable. In the attched version it is handled
the same way with global stats. I continue seeking a way to
combining it to the global stats hash.

I didn't find a way to consolidate into the general local stats
hash. The hash size could be large and the chance of allocation
failure is larger than other places where in-critical-section memory
allocation is allowed. Instead, in the attached, I separated
shared-SLRU lock from StatsLock and add the logic to avoid useless
scan on the array.

Maybe 29c3e2dd5a and recent doc change hit this. Rebased.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#118

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#117)

7 attachment(s)

Re: shared-memory based stats collector

Hi.

Rebased on the current HEAD. 36ac359d36 conflicts with this. Tranche
name is changed from underscore-connected style into camel case
according the change. On the way fixing that I rewrote the
descritions for vacuum_cleanup_index_scale_factor.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#119

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#118)

7 attachment(s)

Re: shared-memory based stats collector

At Mon, 01 Jun 2020 18:00:01 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Rebased on the current HEAD. 36ac359d36 conflicts with this. Tranche

Hmm. This conflicts with 0fd2a79a63. Reabsed on it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#120

Stephen Frost

sfrost@snowman.net

over 5 years ago

In reply to: Kyotaro Horiguchi (#119)

Re: shared-memory based stats collector

Greetings,

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:

At Mon, 01 Jun 2020 18:00:01 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Rebased on the current HEAD. 36ac359d36 conflicts with this. Tranche

Hmm. This conflicts with 0fd2a79a63. Reabsed on it.

Thanks for working on this and keeping it updated!

I've started taking a look and at least right off...

From 4926e50e7635548f86dcd0d36cbf56d168a5d242 Mon Sep 17 00:00:00 2001

From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v35 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.

Shouldn't this:

a) be back-patched, as the other change was
b) also include a change to have the stats collector (which I realize is
removed later on in this patch set, but we're talking about fixing
existing things..) for the same reason, and because there isn't much
point in trying to write out the stats after we get a SIGQUIT, since
we're just going to blow them away again since we're going to go
through crash recovery..?

Might be good to have a separate thread to address these changes.

I've looked through (some of) this thread and through the patches also
and hope to provide a review of the bits that should be targetting v14
(unlike the above) soon.

Thanks,

Stephen

#121

Michael Paquier

michael@paquier.xyz

over 5 years ago

In reply to: Stephen Frost (#120)

Re: shared-memory based stats collector

On Thu, Sep 03, 2020 at 01:16:59PM -0400, Stephen Frost wrote:

Shouldn't this:

a) be back-patched, as the other change was

0001 is just a piece of refactoring, so I see no strong argument in
favor of a backpatch, IMHO.

b) also include a change to have the stats collector (which I realize is
removed later on in this patch set, but we're talking about fixing
existing things..) for the same reason, and because there isn't much
point in trying to write out the stats after we get a SIGQUIT, since
we're just going to blow them away again since we're going to go
through crash recovery..?

Might be good to have a separate thread to address these changes.

I've looked through (some of) this thread and through the patches also
and hope to provide a review of the bits that should be targetting v14
(unlike the above) soon.

The latest patch set fails to apply and the CF bot is complaining, so
a rebase is needed. I have switched the patch as waiting on author.
--
Michael

#122

Stephen Frost

sfrost@snowman.net

over 5 years ago

In reply to: Michael Paquier (#121)

Re: shared-memory based stats collector

Greetings,

* Stephen Frost (sfrost@snowman.net) wrote:

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.

Shouldn't this:

a) be back-patched, as the other change was
b) also include a change to have the stats collector (which I realize is
removed later on in this patch set, but we're talking about fixing
existing things..) for the same reason, and because there isn't much
point in trying to write out the stats after we get a SIGQUIT, since
we're just going to blow them away again since we're going to go
through crash recovery..?

* Michael Paquier (michael@paquier.xyz) wrote:

0001 is just a piece of refactoring, so I see no strong argument in
favor of a backpatch, IMHO.

No, 0001 changes the SIGQUIT handler for the archiver process, which is
what 8e19a82640 was about changing for a bunch of the other subprocesses
and which was back-patched all the way, so I'm a bit confused about why
you're saying it's just refactoring..?

Note that exit() and _exit() aren't the same, and the latter is what's
now being used in SignalHandlerForCrashExit.

Thanks,

Stephen

Import Notes

Reply to msg id not found: 20200907020419.GC2455@paquier.xyz20200903171659.GO29590@tamriel.snowman.net | Resolved by subject fallback

#123

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Stephen Frost (#120)

Re: shared-memory based stats collector

At Thu, 3 Sep 2020 13:16:59 -0400, Stephen Frost <sfrost@snowman.net> wrote in

Greetings,

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:

At Mon, 01 Jun 2020 18:00:01 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Rebased on the current HEAD. 36ac359d36 conflicts with this. Tranche

Hmm. This conflicts with 0fd2a79a63. Reabsed on it.

Thanks for working on this and keeping it updated!

I've started taking a look and at least right off...

From 4926e50e7635548f86dcd0d36cbf56d168a5d242 Mon Sep 17 00:00:00 2001

From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 16 Mar 2020 17:15:35 +0900
Subject: [PATCH v35 1/7] Use standard crash handler in archiver.

The commit 8e19a82640 changed SIGQUIT handler of almost all processes
not to run atexit callbacks for safety. Archiver process should behave
the same way for the same reason. Exit status changes 1 to 2 but that
doesn't make any behavioral change.

Shouldn't this:

a) be back-patched, as the other change was
b) also include a change to have the stats collector (which I realize is
removed later on in this patch set, but we're talking about fixing
existing things..) for the same reason, and because there isn't much
point in trying to write out the stats after we get a SIGQUIT, since
we're just going to blow them away again since we're going to go
through crash recovery..?

Might be good to have a separate thread to address these changes.

I've looked through (some of) this thread and through the patches also
and hope to provide a review of the bits that should be targetting v14
(unlike the above) soon.

Thanks. Current the patch is found to lead to write contention heavier
than the current stats collector when nearly a thousand backend
heavily write to the same table. I need to address that.

- Collect stats via sockets (in the same way as the current implement)
and write/read to/from shared memory.

- Use our own lock-free (maybe) ring-buffer before stats-write enters
lock-waiting mode, then new stats collector(!) process consumes the
queue.

- Some other measures..

Anyway, I'll post a rebased version soon.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#124

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#123)

7 attachment(s)

Re: shared-memory based stats collector

At Tue, 08 Sep 2020 17:01:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 3 Sep 2020 13:16:59 -0400, Stephen Frost <sfrost@snowman.net> wrote in> > I've looked through (some of) this thread and through the patches also

and hope to provide a review of the bits that should be targetting v14
(unlike the above) soon.

Thanks. Current the patch is found to lead to write contention heavier
than the current stats collector when nearly a thousand backend
heavily write to the same table. I need to address that.

- Collect stats via sockets (in the same way as the current implement)
and write/read to/from shared memory.

- Use our own lock-free (maybe) ring-buffer before stats-write enters
lock-waiting mode, then new stats collector(!) process consumes the
queue.

- Some other measures..

Anyway, I'll post a rebased version soon.

This is that. I'll continue seeking a way to solve the contention.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#125

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Kyotaro Horiguchi (#124)

1 attachment(s)

Re: shared-memory based stats collector

At Tue, 08 Sep 2020 17:55:57 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 08 Sep 2020 17:01:47 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 3 Sep 2020 13:16:59 -0400, Stephen Frost <sfrost@snowman.net> wrote in> > I've looked through (some of) this thread and through the patches also

and hope to provide a review of the bits that should be targetting v14
(unlike the above) soon.

Thanks. Current the patch is found to lead to write contention heavier
than the current stats collector when nearly a thousand backend
heavily write to the same table. I need to address that.

- Collect stats via sockets (in the same way as the current implement)
and write/read to/from shared memory.

- Use our own lock-free (maybe) ring-buffer before stats-write enters
lock-waiting mode, then new stats collector(!) process consumes the
queue.

- Some other measures..

- Make dshash search less frequent. To find the actual source of the
contension, We're going to measure performance with the attached on
top of the latest patch let sessions cache the result of dshash
searches for the session lifetime. (table-dropping vacuum clears the
local hash.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#126

Stephen Frost

sfrost@snowman.net

over 5 years ago

In reply to: Kyotaro Horiguchi (#123)

Re: shared-memory based stats collector

Greetings,

* Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:

Shouldn't this:

a) be back-patched, as the other change was
b) also include a change to have the stats collector (which I realize is
removed later on in this patch set, but we're talking about fixing
existing things..) for the same reason, and because there isn't much
point in trying to write out the stats after we get a SIGQUIT, since
we're just going to blow them away again since we're going to go
through crash recovery..?

Might be good to have a separate thread to address these changes.

+1

Just FYI, Tom's started a thread which includes this over here-

/messages/by-id/1850884.1599601164@sss.pgh.pa.us

Thanks,

Stephen

#127

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: Stephen Frost (#126)

Re: shared-memory based stats collector

Stephen Frost <sfrost@snowman.net> writes:

Just FYI, Tom's started a thread which includes this over here-
/messages/by-id/1850884.1599601164@sss.pgh.pa.us

Per that discussion, I'm about to go and commit the 0001 patch from
this thread, which will cause the cfbot to not be able to apply the
patchset anymore till you repost it without 0001. However, before
reposting, you might want to fix the compile errors the cfbot is
showing currently.

On the Windows side:

src/backend/postmaster/postmaster.c(6410): error C2065: 'pgStatSock' : undeclared identifier [C:\projects\postgresql\postgres.vcxproj]

On the Linux side:

1711pgstat.c: In function ‘pgstat_vacuum_stat’:
1712pgstat.c:1411:7: error: ‘key’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
1713 if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
1714 ^
1715pgstat.c:1373:8: note: ‘key’ was declared here
1716 Oid *key;
1717 ^
1718pgstat.c:1411:7: error: ‘oidtab’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
1719 if (hash_search(oidtab, key, HASH_FIND, NULL) != NULL)
1720 ^
1721pgstat.c:1372:9: note: ‘oidtab’ was declared here
1722 HTAB *oidtab;
1723 ^
1724pgstat.c: In function ‘pgstat_reset_single_counter’:
1725pgstat.c:1625:6: error: ‘stattype’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
1726 env = get_stat_entry(stattype, MyDatabaseId, objoid, false, NULL, NULL);
1727 ^
1728pgstat.c:1601:14: note: ‘stattype’ was declared here
1729 PgStatTypes stattype;
1730 ^
1731cc1: all warnings being treated as errors

regards, tom lane

#128

Andres Freund

andres@anarazel.de

about 5 years ago

In reply to: Kyotaro Horiguchi (#124)

Re: shared-memory based stats collector

Hi,

On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote:

Locks on the shared statistics is acquired by the units of such like
tables, functions so the expected chance of collision are not so high.

I can't really parse that...

Furthermore, until 1 second has elapsed since the last flushing to
shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.

I think I commented on that before, but to me 1s seems way too low to
switch to blocking lock acquisition. What's the reason for such a low
limit?

/*
-	 * Clean up any dead statistics collector entries for this DB. We always
+	 * Clean up any dead activity statistics entries for this DB. We always
* want to do this exactly once per DB-processing cycle, even if we find
* nothing worth vacuuming in the database.
*/

What is "activity statistics"?

@@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
}

/* fetch the pgstat table entry */
-	tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-										 shared, dbentry);
+	tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+												   relid);

Why do all of these places deal with a snapshot? For most it seems to
make much more sense to just look up the entry and then copy that into
local memory? There may be some place that need some sort of snapshot
behaviour that's stable until commit / pgstat_clear_snapshot(). But I
can't reallly see many?

+#define PGSTAT_MIN_INTERVAL 1000 /* Minimum interval of stats data
+#define PGSTAT_MAX_INTERVAL			10000	/* Longest interval of stats data
+											 * updates */

These don't really seem to be in line with the commit message...

/*
- * Structures in which backends store per-table info that's waiting to be
- * sent to the collector.
+ * Enums and types to define shared statistics structure.
+ *
+ * Statistics entries for each object is stored in individual DSA-allocated
+ * memory. Every entry is pointed from the dshash pgStatSharedHash via
+ * dsa_pointer. The structure makes object-stats entries not moved by dshash
+ * resizing, and allows the dshash can release lock sooner on stats
+ * updates. Also it reduces interfering among write-locks on each stat entry by
+ * not relying on partition lock of dshash. PgStatLocalHashEntry is the
+ * local-stats equivalent of PgStatHashEntry for shared stat entries.
+ *
+ * Each stat entry is enveloped by the type PgStatEnvelope, which stores common
+ * attribute of all kind of statistics and a LWLock lock object.
+ *
+ * Shared stats are stored as:
+ *
+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry				(dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope	(dsa memory block)

I don't like 'Envelope' that much. If I understand you correctly that's
a common prefix that's used for all types of stat objects, correct? If
so, how about just naming it PgStatEntryBase or such? I think it'd also
be useful to indicate in the "are stored as" part that PgStatEnvelope is
just the common prefix for an allocation.

/*
- * pgStatTabHash entry: map from relation OID to PgStat_TableStatus pointer
+ * entry size lookup table of shared statistics entries corresponding to
+ * PgStatTypes
*/
-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =

+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+	0,							/* PGSTAT_TYPE_ALL: not an entry */
+	sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+	sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+	sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};

These probably should be const as well.

/*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
*/
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+	PgStatHashEntryKey key;		/* hash key */
+	dsa_pointer env;			/* pointer to shared stats envelope in DSA */
+}			PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+	PgStatTypes type;			/* statistics entry type */
+	Oid			databaseid;		/* databaseid */
+	Oid			objectid;		/* objectid */

Do we need this information both here and in PgStatHashEntry? It's
possible that it's worthwhile, but I am not sure it is.

+ size_t len; /* length of body, fixed per type. */

Why do we need this? Isn't that something that can easily be looked up
using the type?

+	LWLock		lock;			/* lightweight lock to protect body */
+	int			body[FLEXIBLE_ARRAY_MEMBER];	/* statistics body */
+}			PgStatEnvelope;

What you're doing here with 'body' doesn't provide enough guarantees
about proper alignment. E.g. if one of the entry types wants to store a
double, this won't portably work, because there's platforms that have 4
byte alignment for ints, but 8 byte alignment for doubles.

Wouldn't it be better to instead embed PgStatEnvelope into the struct
that's actually stored? E.g. something like

struct PgStat_TableStatus
{
PgStatEnvelope header; /* I'd rename the type */
TimestampTz vacuum_timestamp; /* user initiated vacuum */
...
}

or if you don't want to do that because it'd require declaring
PgStatEnvelope in the header (not sure that'd really be worth avoiding),
you could just get rid of the body field and just do the calculation
using something like MAXALIGN((char *) envelope + sizeof(PgStatEnvelope))

+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
*/
-static bool have_function_stats = false;
+typedef struct PgStatSnapshot
+{
+	PgStatHashEntryKey key;
+	bool		negative;
+	int			body[FLEXIBLE_ARRAY_MEMBER];	/* statistics body */
+}			PgStatSnapshot;
+
+#define PgStatSnapshotSize(bodylen)				\
+	(offsetof(PgStatSnapshot, body) + (bodylen))

-/*
- * Info about current "snapshot" of stats file
- */
+/* Variables for backend status snapshot */
static MemoryContext pgStatLocalContext = NULL;
-static HTAB *pgStatDBHash = NULL;
+static MemoryContext pgStatSnapshotContext = NULL;
+static HTAB *pgStatSnapshotHash = NULL;

/*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ *
+ * Contains statistics that are collected not per database nor per table
+ * basis.  shared_* points to shared memory and snapshot_* are backend
+ * snapshots.
*/
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
-static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];
-
-/*
- * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
- * it means to write only the shared-catalog stats ("DB 0"); otherwise, we
- * will write both that DB's data and the shared stats.
- */
-static List *pending_write_requests = NIL;
+static bool global_snapshot_is_valid = false;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats snapshot_globalStats;
+static PgStatSharedSLRUStats *shared_SLRUStats;
+static PgStat_StatSLRUEntry snapshot_SLRUStats[SLRU_NUM_ELEMENTS];

The amount of code needed for this snapshot stuff seems unreasonable to
me, especially because I don't see why we really need it. Is this just
so that there's no skew between all the columns of pg_stat_all_tables()
etc?

I think this needs a lot more comments explaining what it's trying to
achieve.

+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);

I think we should try to not need it, instead declaring that all fields
are zero initialized. That fits well together with my suggestion to
avoid duplicating the database / object ids.

+static void
+attach_shared_stats(void)
+{

...

+		/* We're the first process to attach the shared stats memory */
+		Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+		/* Initialize shared memory area */
+		area = dsa_create(LWTRANCHE_STATS);
+		pgStatSharedHash = dshash_create(area, &dsh_rootparams, 0);
+
+		StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+		StatsShmem->global_stats =
+			dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+		StatsShmem->archiver_stats =
+			dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+		StatsShmem->slru_stats =
+			dsa_allocate0(area, sizeof(PgStatSharedSLRUStats));
+		StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+
+		shared_globalStats = (PgStat_GlobalStats *)
+			dsa_get_address(area, StatsShmem->global_stats);
+		shared_archiverStats = (PgStat_ArchiverStats *)
+			dsa_get_address(area, StatsShmem->archiver_stats);
+
+		shared_SLRUStats = (PgStatSharedSLRUStats *)
+			dsa_get_address(area, StatsShmem->slru_stats);
+		LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);

I don't think it makes sense to use dsa allocations for any of the fixed
size stats (global_stats, archiver_stats, ...). They should just be
direct members of StatsShmem? Then we also don't need the shared_*
helper variables

+ /* Load saved data if any. */
+ pgstat_read_statsfiles();

Hm. Is it a good idea to do this as part of the shmem init function?
That's a lot more work than we normally do in these.

+/* ----------
+ * detach_shared_stats() -
+ *
+ *	Detach shared stats. Write out to file if we're the last process and told
+ *	to do so.
+ * ----------
*/
static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)

I think it'd be better to have an explicit call in the shutdown sequence
somewhere to write out the data, instead of munging detach and writing
stats out together.

/* ----------
* pgstat_report_stat() -
*
*	Must be called by processes that performs DML: tcop/postgres.c, logical
- *	receiver processes, SPI worker, etc. to send the so far collected
- *	per-table and function usage statistics to the collector.  Note that this
- *	is called only when not within a transaction, so it is fair to use
+ *	receiver processes, SPI worker, etc. to apply the so far collected
+ *	per-table and function usage statistics to the shared statistics hashes.
+ *
+ *	Updates are applied not more frequent than the interval of
+ *	PGSTAT_MIN_INTERVAL milliseconds. They are also postponed on lock
+ *	failure if force is false and there's no pending updates longer than
+ *	PGSTAT_MAX_INTERVAL milliseconds. Postponed updates are retried in
+ *	succeeding calls of this function.
+ *
+ *	Returns the time until the next timing when updates are applied in
+ *	milliseconds if there are no updates held for more than
+ *	PGSTAT_MIN_INTERVAL milliseconds.
+ *
+ *	Note that this is called only out of a transaction, so it is fine to use
*	transaction stop time as an approximation of current time.
- * ----------
+ *	----------
*/
-void
+long
pgstat_report_stat(bool force)
{
-	/* we assume this inits to all zeroes: */
-	static const PgStat_TableCounts all_zeroes;
-	static TimestampTz last_report = 0;
-
+	static TimestampTz next_flush = 0;
+	static TimestampTz pending_since = 0;
+	static long retry_interval = 0;
TimestampTz now;
-	PgStat_MsgTabstat regular_msg;
-	PgStat_MsgTabstat shared_msg;
-	TabStatusArray *tsa;
+	bool		nowait = !force;	/* Don't use force ever after */

+	if (nowait)
+	{
+		/*
+		 * Don't flush stats too frequently.  Return the time to the next
+		 * flush.
+		 */

I think it's confusing to use nowait in the if when you actually mean
!force.

-	for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+	if (pgStatLocalHash)
{
-		for (i = 0; i < tsa->tsa_used; i++)
+		/* Step 1: flush out other than database stats */
+		hash_seq_init(&scan, pgStatLocalHash);
+		while ((lent = (PgStatLocalHashEntry *) hash_seq_search(&scan)) != NULL)
{
-			PgStat_TableStatus *entry = &tsa->tsa_entries[i];
-			PgStat_MsgTabstat *this_msg;
-			PgStat_TableEntry *this_ent;
+			bool		remove = false;

-			/* Shouldn't have any pending transaction-dependent counts */
-			Assert(entry->trans == NULL);
+			switch (lent->env->type)
+			{
+				case PGSTAT_TYPE_DB:
+					if (ndbentries >= dbentlistlen)
+					{
+						dbentlistlen *= 2;
+						dbentlist = repalloc(dbentlist,
+											 sizeof(PgStatLocalHashEntry *) *
+											 dbentlistlen);
+					}
+					dbentlist[ndbentries++] = lent;
+					break;

Why do we need this special behaviour for database statistics?

If we need it,it'd be better to just use List here rather than open
coding a replacement (List these days basically has the same complexity
as what you do here).

+				case PGSTAT_TYPE_TABLE:
+					if (flush_tabstat(lent->env, nowait))
+						remove = true;
+					break;
+				case PGSTAT_TYPE_FUNCTION:
+					if (flush_funcstat(lent->env, nowait))
+						remove = true;
+					break;
+				default:
+					Assert(false);

Adding a default here prevents the compiler from issuing a warning when
new types of stats are added...

+			/* Remove the successfully flushed entry */
+			pfree(lent->env);

Probably worth zeroing the pointer here, to make debugging a little
easier.

+	/* Publish the last flush time */
+	LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+	if (shared_globalStats->stats_timestamp < now)
+		shared_globalStats->stats_timestamp = now;
+	LWLockRelease(StatsLock);

Ugh, that seems like a fairly unnecessary global lock acquisition. What
do we need this timestamp for? Not clear to me that it's still
needed. If it is needed, it'd probably worth making this an atomic and
doing a compare-exchange loop instead.

/*
-	 * Send partial messages.  Make sure that any pending xact commit/abort
-	 * gets counted, even if there are no table stats to send.
+	 * If we have pending local stats, let the caller know the retry interval.
*/
-	if (regular_msg.m_nentries > 0 ||
-		pgStatXactCommit > 0 || pgStatXactRollback > 0)
-		pgstat_send_tabstat(&regular_msg);
-	if (shared_msg.m_nentries > 0)
-		pgstat_send_tabstat(&shared_msg);
+	if (HAVE_ANY_PENDING_STATS())

I think this needs a comment explaining why we still may have pending
stats.

+ * flush_tabstat - flush out a local table stats entry
+ *
+ * Some of the stats numbers are copied to local database stats entry after
+ * successful flush-out.
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_tabstat(PgStatEnvelope * lenv, bool nowait)
+{
+	static const PgStat_TableCounts all_zeroes;
+	Oid			dboid;			/* database OID of the table */
+	PgStat_TableStatus *lstats; /* local stats entry  */
+	PgStatEnvelope *shenv;		/* shared stats envelope */
+	PgStat_StatTabEntry *shtabstats;	/* table entry of shared stats */
+	PgStat_StatDBEntry *ldbstats;	/* local database entry */
+	bool		found;
+
+	Assert(lenv->type == PGSTAT_TYPE_TABLE);
+
+	lstats = (PgStat_TableStatus *) &lenv->body;
+	dboid = lstats->t_shared ? InvalidOid : MyDatabaseId;
+
+	/*
+	 * Ignore entries that didn't accumulate any actual counts, such as
+	 * indexes that were opened by the planner but not used.
+	 */
+	if (memcmp(&lstats->t_counts, &all_zeroes,
+			   sizeof(PgStat_TableCounts)) == 0)
+		return true;
+
+	/* find shared table stats entry corresponding to the local entry */
+	shenv = get_stat_entry(PGSTAT_TYPE_TABLE, dboid, lstats->t_id,
+						   nowait, init_tabentry, &found);
+
+	/* skip if dshash failed to acquire lock */
+	if (shenv == NULL)
+		return false;

Could we cache the address of the shared entry in the local entry for a
while? It seems we have a bunch of contention (that I think you're
trying to address in a prototoype patch posted since) just because we
will over and over look up the same address in the shared hash table.

If we instead kept the local hashtable alive for longer and stored a
pointer to the shared entry in it, we could make this a lot
cheaper. There would be some somewhat nasty edge cases probably. Imagine
a table being dropped for which another backend still has pending
stats. But that could e.g. be addressed with a refcount.

+	/* retrieve the shared table stats entry from the envelope */
+	shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+	/* lock the shared entry to protect the content, skip if failed */
+	if (!nowait)
+		LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+		return false;
+
+	/* add the values to the shared entry. */
+	shtabstats->numscans += lstats->t_counts.t_numscans;
+	shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+	shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+	shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+	shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+	shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+	shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+	/*
+	 * If table was truncated or vacuum/analyze has ran, first reset the
+	 * live/dead counters.
+	 */
+	if (lstats->t_counts.t_truncated ||
+		lstats->t_counts.vacuum_count > 0 ||
+		lstats->t_counts.analyze_count > 0 ||
+		lstats->t_counts.autovac_vacuum_count > 0 ||
+		lstats->t_counts.autovac_analyze_count > 0)
+	{
+		shtabstats->n_live_tuples = 0;
+		shtabstats->n_dead_tuples = 0;
+	}

+	/* clear the change counter if requested */
+	if (lstats->t_counts.reset_changed_tuples)
+		shtabstats->changes_since_analyze = 0;

I know this is largely old code, but it's not obvious to me that there's
no race conditions here / that the race condition didn't get worse. What
prevents other backends to since have done a lot of inserts into this
table? Especially in case the flushes were delayed due to lock
contention.

+	/*
+	 * Update vacuum/analyze timestamp and counters, so that the values won't
+	 * goes back.
+	 */
+	if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+		shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;

It seems to me that if these branches are indeed a necessary branches,
my concerns above are well founded...

+init_tabentry(PgStatEnvelope * env)
{
-	int			n;
-	int			len;
+	PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
+	/*
+	 * If it's a new table entry, initialize counters to the values we just
+	 * got.
+	 */
+	Assert(env->type == PGSTAT_TYPE_TABLE);
+	tabent->tableid = env->objectid;

It seems over the top to me to have the object id stored in yet another
place. It's now in the hash entry, in the envelope, and the type
specific part.

+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+	/* we assume this inits to all zeroes: */
+	static const PgStat_FunctionCounts all_zeroes;
+	PgStat_BackendFunctionEntry *localent;	/* local stats entry */
+	PgStatEnvelope *shenv;		/* shared stats envelope */
+	PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+	bool		found;
+
+	Assert(env->type == PGSTAT_TYPE_FUNCTION);
+	localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+	/* Skip it if no counts accumulated for it so far */
+	if (memcmp(&localent->f_counts, &all_zeroes,
+			   sizeof(PgStat_FunctionCounts)) == 0)
+		return true;

Why would we have an entry in this case?

+	/* find shared table stats entry corresponding to the local entry */
+	shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+						   nowait, init_funcentry, &found);
+	/* skip if dshash failed to acquire lock */
+	if (shenv == NULL)
+		return false;			/* failed to acquire lock, skip */
+
+	/* retrieve the shared table stats entry from the envelope */
+	sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+	/* lock the shared entry to protect the content, skip if failed */
+	if (!nowait)
+		LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+		return false;			/* failed to acquire lock, skip */

It doesn't seem great that we have a separate copy of all of this logic
again. It seems to me that most of the code here is (or should be)
exactly the same as in table case. I think only the the below should be
in here, rather than in common code.

+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_dbstat(PgStatEnvelope * env, bool nowait)
+{
+	PgStat_StatDBEntry *localent;
+	PgStatEnvelope *shenv;
+	PgStat_StatDBEntry *sharedent;
+
+	Assert(env->type == PGSTAT_TYPE_DB);
+
+	localent = (PgStat_StatDBEntry *) &env->body;
+
+	/* find shared database stats entry corresponding to the local entry */
+	shenv = get_stat_entry(PGSTAT_TYPE_DB, localent->databaseid, InvalidOid,
+						   nowait, init_dbentry, NULL);
+
+	/* skip if dshash failed to acquire lock */
+	if (!shenv)
+		return false;
+
+	/* retrieve the shared stats entry from the envelope */
+	sharedent = (PgStat_StatDBEntry *) &shenv->body;
+
+	/* lock the shared entry to protect the content, skip if failed */
+	if (!nowait)
+		LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+		return false;

Dito re duplicating all of this.

+/*
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+	int			printed;
+
+	/* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+	printed = snprintf(filename, len, "%s/db_%u.%s",
+					   PGSTAT_STAT_PERMANENT_DIRECTORY,
+					   databaseid,
+					   tempname ? "tmp" : "stat");
+	if (printed >= len)
+		elog(ERROR, "overlength pgstat path");
}

Do we really want database specific storage after all of these changes?
Seems like there's no point anymore?

+	dshash_seq_init(&hstat, pgStatSharedHash, false);
+	while ((p = dshash_seq_next(&hstat)) != NULL)
{
-		Oid			tabid = tabentry->tableid;
-
-		CHECK_FOR_INTERRUPTS();
-

Given that this could take a while on a database with a lot of objects
it might worth keeping the CHECK_FOR_INTERRUPTS().

/* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
*
- *	Will tell the collector about objects he can get rid of.
+ *	Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
* ----------
*/
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
{

-		if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+		if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+			(type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
continue;

I don't like this interface much. Particularly not that it requires
adding a PGSTAT_TYPE_ALL that's otherwise not needed. And the thing
where PGSTAT_TYPE_DB doesn't actually works as one would expect isn't
nice either.

-		/*
-		 * Not there, so add this table's Oid to the message
-		 */
-		msg.m_tableid[msg.m_nentries++] = tabid;
-
-		/*
-		 * If the message is full, send it out and reinitialize to empty
-		 */
-		if (msg.m_nentries >= PGSTAT_NUM_TABPURGE)
+		if (n >= listlen - 1)
{
-			len = offsetof(PgStat_MsgTabpurge, m_tableid[0])
-				+ msg.m_nentries * sizeof(Oid);
-
-			pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_TABPURGE);
-			msg.m_databaseid = MyDatabaseId;
-			pgstat_send(&msg, len);
-
-			msg.m_nentries = 0;
+			listlen *= 2;
+			envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
}
+		envlist[n++] = dsa_get_address(area, p->env);
}

I'd use List here as well.

+ dshash_seq_term(&hstat);

Hm, I didn't immediately see which locking makes this safe? Is it just
that nobody should be attached at this point?

+void
+pgstat_vacuum_stat(void)
+{
+	HTAB	   *dbids;			/* database ids */
+	HTAB	   *relids;			/* relation ids in the current database */
+	HTAB	   *funcids;		/* function ids in the current database */
+	PgStatEnvelope **victims;	/* victim entry list */
+	int			arraylen = 0;	/* storage size of the above */
+	int			nvictims = 0;	/* # of entries of the above */
+	dshash_seq_status dshstat;
+	PgStatHashEntry *ent;
+	int			i;
+
+	/* we don't collect stats under standalone mode */
+	if (!IsUnderPostmaster)
+		return;
+
+	/* collect oids of existent objects */
+	dbids = collect_oids(DatabaseRelationId, Anum_pg_database_oid);
+	relids = collect_oids(RelationRelationId, Anum_pg_class_oid);
+	funcids = collect_oids(ProcedureRelationId, Anum_pg_proc_oid);
+
+	/* collect victims from shared stats */
+	arraylen = 16;
+	victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+	nvictims = 0;

Same List comment as before.

void
pgstat_reset_counters(void)
{
-	PgStat_MsgResetcounter msg;
+	PgStatEnvelope **envlist;
+	PgStatEnvelope **p;

-	if (pgStatSock == PGINVALID_SOCKET)
-		return;
+	/* Lookup the entries of the current database in the stats hash. */
+	envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+	for (p = envlist; *p != NULL; p++)
+	{
+		PgStatEnvelope *env = *p;
+		PgStat_StatDBEntry *dbstat;

-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-	msg.m_databaseid = MyDatabaseId;
-	pgstat_send(&msg, sizeof(msg));
+		LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+

What locking prevents this entry from being freed between the
collect_stat_entries() and this LWLockAcquire?

/* ----------
@@ -1440,48 +1684,63 @@ pgstat_reset_slru_counter(const char *name)
void
pgstat_report_autovac(Oid dboid)
{
-	PgStat_MsgAutovacStart msg;
+	PgStat_StatDBEntry *dbentry;
+	TimestampTz ts;

-	if (pgStatSock == PGINVALID_SOCKET)
+	/* return if activity stats is not active */
+	if (!area)
return;

-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-	msg.m_databaseid = dboid;
-	msg.m_start_time = GetCurrentTimestamp();
+	ts = GetCurrentTimestamp();

-	pgstat_send(&msg, sizeof(msg));
+	/*
+	 * Store the last autovacuum time in the database's hash table entry.
+	 */
+	dbentry = get_local_dbstat_entry(dboid);
+	dbentry->last_autovac_time = ts;
}

Why did you introduce the local ts variable here?

/* --------
* pgstat_report_analyze() -
*
- *	Tell the collector about the table we just analyzed.
+ *	Report about the table we just analyzed.
*
* Caller must provide new live- and dead-tuples estimates, as well as a
* flag indicating whether to reset the changes_since_analyze counter.
@@ -1492,9 +1751,10 @@ pgstat_report_analyze(Relation rel,
PgStat_Counter livetuples, PgStat_Counter deadtuples,
bool resetcounter)
{
}

It seems to me that the analyze / vacuum cases would be much better
dealth with by synchronously operating on the shared entry, instead of
going through the local hash table. ISTM that that'd make it a lot
easier to avoid most of the ordering issues.

+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+	PgStatEnvelope *env;
+	PgStat_TableStatus *tabentry;
+	bool		found;

-	/*
-	 * Now we can fill the entry in pgStatTabHash.
-	 */
-	hash_entry->tsa_entry = entry;
+	env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+							   isshared ? InvalidOid : MyDatabaseId,
+							   rel_id, true, &found);

-	return entry;
+	tabentry = (PgStat_TableStatus *) &env->body;
+
+	if (!found)
+	{
+		tabentry->t_id = rel_id;
+		tabentry->t_shared = isshared;
+		tabentry->trans = NULL;
+		MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+		tabentry->vacuum_timestamp = 0;
+		tabentry->autovac_vacuum_timestamp = 0;
+		tabentry->analyze_timestamp = 0;
+		tabentry->autovac_analyze_timestamp = 0;
+	}
+

As with shared entries, I think this should just be zero initialized
(and we should try to get rid of the duplication of t_id/t_shared).

+ return tabentry;
}

+
/*
* find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
*
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.

What do you mean with "from the current database then from shared
tables"?

void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
{
-	PgStat_MsgArchiver msg;
+	TimestampTz now = GetCurrentTimestamp();

-	/*
-	 * Prepare and send the message
-	 */
-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
-	msg.m_failed = failed;
-	strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
-	msg.m_timestamp = GetCurrentTimestamp();
-	pgstat_send(&msg, sizeof(msg));
+	if (failed)
+	{
+		/* Failed archival attempt */
+		LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+		++shared_archiverStats->failed_count;
+		memcpy(shared_archiverStats->last_failed_wal, xlog,
+			   sizeof(shared_archiverStats->last_failed_wal));
+		shared_archiverStats->last_failed_timestamp = now;
+		LWLockRelease(StatsLock);
+	}
+	else
+	{
+		/* Successful archival operation */
+		LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+		++shared_archiverStats->archived_count;
+		memcpy(shared_archiverStats->last_archived_wal, xlog,
+			   sizeof(shared_archiverStats->last_archived_wal));
+		shared_archiverStats->last_archived_timestamp = now;
+		LWLockRelease(StatsLock);
+	}
}

Huh, why is this duplicating near equivalent code?

/* ----------
* pgstat_write_statsfiles() -
- *		Write the global statistics file, as well as requested DB files.
- *
- *	'permanent' specifies writing to the permanent files not temporary ones.
- *	When true (happens only when the collector is shutting down), also remove
- *	the temporary files so that backends starting up under a new postmaster
- *	can't read old data before the new collector is ready.
- *
- *	When 'allDbs' is false, only the requested databases (listed in
- *	pending_write_requests) will be written; otherwise, all databases
- *	will be written.
+ *		Write the global statistics file, as well as DB files.
* ----------
*/
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
{

Whats the locking around this?

-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
{
-	HASH_SEQ_STATUS tstat;
-	HASH_SEQ_STATUS fstat;
-	PgStat_StatTabEntry *tabentry;
-	PgStat_StatFuncEntry *funcentry;
+	PgStatEnvelope **envlist;
+	PgStatEnvelope **penv;
FILE	   *fpout;
int32		format_id;
Oid			dbid = dbentry->databaseid;
@@ -5048,8 +4974,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
char		tmpfile[MAXPGPATH];
char		statfile[MAXPGPATH];

-	get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-	get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+	get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+	get_dbstat_filename(false, dbid, statfile, MAXPGPATH);

elog(DEBUG2, "writing stats file \"%s\"", statfile);

@@ -5076,24 +5002,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
/*
* Walk through the database's access stats per table.
*/
-	hash_seq_init(&tstat, dbentry->tables);
-	while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+	envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+	for (penv = envlist; *penv != NULL; penv++)

In several of these collect_stat_entries() callers it really bothers me
that we basically allocate an array as large as the number of objects
in the database (That's fine for databases, but for tables...). Without
much need as far as I can see.

{
+		PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
fputc('T', fpout);
rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
(void) rc;				/* we'll check for error with ferror */
}
+	pfree(envlist);

/*
* Walk through the database's function stats table.
*/
-	hash_seq_init(&fstat, dbentry->functions);
-	while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+	envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+	for (penv = envlist; *penv != NULL; penv++)
{
+		PgStat_StatFuncEntry *funcentry =
+		(PgStat_StatFuncEntry *) &(*penv)->body;
+
fputc('F', fpout);
rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
(void) rc;				/* we'll check for error with ferror */
}
+	pfree(envlist);

Why do we need separate loops for every type of object here?

+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{

In which situation is this necessary?

+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+			   bool nowait, entry_initializer initfunc, bool *found)
+{

+	bool		create = (initfunc != NULL);
+	PgStatHashEntry *shent;
+	PgStatEnvelope *shenv = NULL;
+	PgStatHashEntryKey key;
+	bool		myfound;
+
+	Assert(type != PGSTAT_TYPE_ALL);
+
+	key.type = type;
+	key.databaseid = dbid;
+	key.objectid = objid;
+	shent = dshash_find_extended(pgStatSharedHash, &key,
+								 create, nowait, create, &myfound);
+	if (shent)
{
-		get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+		if (create && !myfound)
+		{
+			/* Create new stats envelope. */
+			size_t		envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+			dsa_pointer chunk = dsa_allocate0(area, envsize);

+			/*
+			 * The lock on dshsh is released just after. Call initializer
+			 * callback before it is exposed to other process.
+			 */
+			if (initfunc)
+				initfunc(shenv);
+
+			/* Link the new entry from the hash entry. */
+			shent->env = chunk;
+		}
+		else
+			shenv = dsa_get_address(area, shent->env);
+
+		dshash_release_lock(pgStatSharedHash, shent);

Doesn't this mean that by this time the entry could already have been
removed by a concurrent backend, and the dsa allocation freed?

Subject: [PATCH v36 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.

I don't see what this achieves? Which use of those variables / macros
would would be safe? I think it'd be better to just remove them.

Greetings,

Andres Freund

#129

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Andres Freund (#128)

6 attachment(s)

Re: shared-memory based stats collector

Thanks for reviewing!

At Mon, 21 Sep 2020 19:47:04 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote:

Locks on the shared statistics is acquired by the units of such like
tables, functions so the expected chance of collision are not so high.

I can't really parse that...

Mmm... Is the following readable?

Shared statistics locks are acquired by units such as tables,
functions, etc., so the chances of an expected collision are not so
high.

Anyway, this is found to be wrong, so I removed it.

Furthermore, until 1 second has elapsed since the last flushing to
shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.

I think I commented on that before, but to me 1s seems way too low to
switch to blocking lock acquisition. What's the reason for such a low
limit?

It was 0.5 seconds previously. I don't have a clear idea of a
reasonable value for it. One possible rationale might be to have 1000
clients each have a writing time slot of 10ms.. So, 10s as the minimum
interval. I set maximum interval to 60, and retry interval to
1s. (Fixed?)

/*
-	 * Clean up any dead statistics collector entries for this DB. We always
+	 * Clean up any dead activity statistics entries for this DB. We always
* want to do this exactly once per DB-processing cycle, even if we find
* nothing worth vacuuming in the database.
*/

What is "activity statistics"?

I don't get your point. It is formally the replacement word for
"statistics collector". The "statistics collector (process)" no longer
exists, so I had to invent a name for the successor mechanism that is
distinguishable with data/column statistics. If it is not the proper
wording, I'd appreciate it if you could suggest the appropriate one.

@@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
}
/* fetch the pgstat table entry */
-	tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-										 shared, dbentry);
+	tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+												   relid);
Why do all of these places deal with a snapshot? For most it seems to
make much more sense to just look up the entry and then copy that into
local memory? There may be some place that need some sort of snapshot
behaviour that's stable until commit / pgstat_clear_snapshot(). But I
can't reallly see many?

Ok, I reread this thread and agree that there's a (vague) consensus to
remove the snapshot stuff. Backend-statistics (bestats) still are
stable during a transaction.

+#define PGSTAT_MIN_INTERVAL 1000 /* Minimum interval of stats data
+#define PGSTAT_MAX_INTERVAL			10000	/* Longest interval of stats data
+											 * updates */
These don't really seem to be in line with the commit message...

Oops! Sorry. Fixed both of this value and the commit message (and the
file comment).

+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry				(dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope	(dsa memory block)
I don't like 'Envelope' that much. If I understand you correctly that's
a common prefix that's used for all types of stat objects, correct? If
so, how about just naming it PgStatEntryBase or such? I think it'd also
be useful to indicate in the "are stored as" part that PgStatEnvelope is
just the common prefix for an allocation.

The name makes sense. Thanks! (But the struct is now gone..)

-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =

+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+	0,							/* PGSTAT_TYPE_ALL: not an entry */
+	sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+	sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+	sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};

These probably should be const as well.

Right. Fixed.

/*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
*/
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+	PgStatHashEntryKey key;		/* hash key */
+	dsa_pointer env;			/* pointer to shared stats envelope in DSA */
+}			PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+	PgStatTypes type;			/* statistics entry type */
+	Oid			databaseid;		/* databaseid */
+	Oid			objectid;		/* objectid */

Do we need this information both here and in PgStatHashEntry? It's
possible that it's worthwhile, but I am not sure it is.

Same key values were stored in PgStatEnvelope, PgStat(Local)HashEntry,
and PgStat_Stats*Entry. And I thought the same while developing. After
some thoughts, I managed to remove the duplicate values other than
PgStat(Local)HashEntry. Fixed.

+ size_t len; /* length of body, fixed per type. */

Why do we need this? Isn't that something that can easily be looked up
using the type?

Not only they are virtually fixed values, but they were found to be
write-only variables. Removed.

+	LWLock		lock;			/* lightweight lock to protect body */
+	int			body[FLEXIBLE_ARRAY_MEMBER];	/* statistics body */
+}			PgStatEnvelope;
What you're doing here with 'body' doesn't provide enough guarantees
about proper alignment. E.g. if one of the entry types wants to store a
double, this won't portably work, because there's platforms that have 4
byte alignment for ints, but 8 byte alignment for doubles.

Wouldn't it be better to instead embed PgStatEnvelope into the struct
that's actually stored? E.g. something like

struct PgStat_TableStatus
{
PgStatEnvelope header; /* I'd rename the type */
TimestampTz vacuum_timestamp; /* user initiated vacuum */
...
}

or if you don't want to do that because it'd require declaring
PgStatEnvelope in the header (not sure that'd really be worth avoiding),
you could just get rid of the body field and just do the calculation
using something like MAXALIGN((char *) envelope + sizeof(PgStatEnvelope))

As the result of the modification so far, there is only one member,
lock, left in the PgStatEnvelope (or PgStatEntryBase) struct. I chose
to embed it to each PgStat_Stat*Entry structs as
PgStat_StatEntryHeader.

+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.

...

The amount of code needed for this snapshot stuff seems unreasonable to
me, especially because I don't see why we really need it. Is this just
so that there's no skew between all the columns of pg_stat_all_tables()
etc?

I think this needs a lot more comments explaining what it's trying to
achieve.

I don't insist on keeping the behavior. Removed snapshot stuff only
of pgstat stuff. (beentry snapshot is left alone.)

+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
I think we should try to not need it, instead declaring that all fields
are zero initialized. That fits well together with my suggestion to
avoid duplicating the database / object ids.

Now that entries don't have type-specific fields that need a special
care, I removed that stuff altogether.

+static void
+attach_shared_stats(void)
+{

...

+		shared_globalStats = (PgStat_GlobalStats *)
+			dsa_get_address(area, StatsShmem->global_stats);
+		shared_archiverStats = (PgStat_ArchiverStats *)
+			dsa_get_address(area, StatsShmem->archiver_stats);
+
+		shared_SLRUStats = (PgStatSharedSLRUStats *)
+			dsa_get_address(area, StatsShmem->slru_stats);
+		LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
I don't think it makes sense to use dsa allocations for any of the fixed
size stats (global_stats, archiver_stats, ...). They should just be
direct members of StatsShmem? Then we also don't need the shared_*
helper variables

I intended to reduce the amount of fixed-allocated shared memory, or
make maximum use of DSA. However, you're right. Now they are members
of StatsShmem.

+ /* Load saved data if any. */
+ pgstat_read_statsfiles();

Hm. Is it a good idea to do this as part of the shmem init function?
That's a lot more work than we normally do in these.
+/* ----------
+ * detach_shared_stats() -
+ *
+ *	Detach shared stats. Write out to file if we're the last process and told
+ *	to do so.
+ * ----------
*/
static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
I think it'd be better to have an explicit call in the shutdown sequence
somewhere to write out the data, instead of munging detach and writing
stats out together.

It is actually strange that attach_shared_stats reads file in a
StatsLock section while it attaches existing shared memory area
deliberately outside the same lock section. So I moved the call to
pg_stat_read/write_statsfiles() out of StatsLock section as the first
step. But I couldn't move pgstat_write_stats_files() out of (or,
before or after) detach_shared_stats(), because I didn't find a way to
reliably check if the exiting process is the last detacher by a
separate function from detach_shared_stats().

(continued)
=====

The attached is the updated version taking in the comments above. I
continue to address the rest of the comments. Only 0004 is revised.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#130

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#129)

7 attachment(s)

Re: shared-memory based stats collector

At Fri, 25 Sep 2020 09:27:26 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Thanks for reviewing!

At Mon, 21 Sep 2020 19:47:04 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2020-09-08 17:55:57 +0900, Kyotaro Horiguchi wrote:

Locks on the shared statistics is acquired by the units of such like
tables, functions so the expected chance of collision are not so high.

I can't really parse that...

Mmm... Is the following readable?

Shared statistics locks are acquired by units such as tables,
functions, etc., so the chances of an expected collision are not so
high.

Anyway, this is found to be wrong, so I removed it.

01: (Fixed?)

Furthermore, until 1 second has elapsed since the last flushing to
shared stats, lock failure postpones stats flushing so that lock
contention doesn't slow down transactions.

I think I commented on that before, but to me 1s seems way too low to
switch to blocking lock acquisition. What's the reason for such a low
limit?

It was 0.5 seconds previously. I don't have a clear idea of a
reasonable value for it. One possible rationale might be to have 1000
clients each have a writing time slot of 10ms.. So, 10s as the minimum
interval. I set maximum interval to 60, and retry interval to
1s. (Fixed?)

02: (I'd appreciate it if you could suggest the appropriate one.)

/*
-	 * Clean up any dead statistics collector entries for this DB. We always
+	 * Clean up any dead activity statistics entries for this DB. We always
* want to do this exactly once per DB-processing cycle, even if we find
* nothing worth vacuuming in the database.
*/
What is "activity statistics"?
I don't get your point. It is formally the replacement word for
"statistics collector". The "statistics collector (process)" no longer
exists, so I had to invent a name for the successor mechanism that is
distinguishable with data/column statistics. If it is not the proper
wording, I'd appreciate it if you could suggest the appropriate one.

03: (Fixed. Replaced with far simpler cache implement.)

@@ -2816,8 +2774,8 @@ table_recheck_autovac(Oid relid, HTAB *table_toast_map,
}
/* fetch the pgstat table entry */
-	tabentry = get_pgstat_tabentry_relid(relid, classForm->relisshared,
-										 shared, dbentry);
+	tabentry = pgstat_fetch_stat_tabentry_snapshot(classForm->relisshared,
+												   relid);
Why do all of these places deal with a snapshot? For most it seems to
make much more sense to just look up the entry and then copy that into
local memory? There may be some place that need some sort of snapshot
behaviour that's stable until commit / pgstat_clear_snapshot(). But I
can't reallly see many?
Ok, I reread this thread and agree that there's a (vague) consensus to
remove the snapshot stuff. Backend-statistics (bestats) still are
stable during a transaction.

If we nuked the snapshot stuff completely, pgstatfuns.c needed many
additional pfree()s since it calls pgstat_fetch* many times for the
same object. I choosed to make pgstat_fetch_stat_*() functions return
a result stored in static variables. It doesn't work a transactional
way as before but keeps the last result for a while then invalidated
by transaction end time at most.

04: (Fixed.)

+#define PGSTAT_MIN_INTERVAL 1000 /* Minimum interval of stats data
+#define PGSTAT_MAX_INTERVAL			10000	/* Longest interval of stats data
+											 * updates */
These don't really seem to be in line with the commit message...
Oops! Sorry. Fixed both of this value and the commit message (and the
file comment).

05: (The struct is gone.)

+ * dshash pgStatSharedHash
+ *    -> PgStatHashEntry				(dshash entry)
+ *      (dsa_pointer)-> PgStatEnvelope	(dsa memory block)
I don't like 'Envelope' that much. If I understand you correctly that's
a common prefix that's used for all types of stat objects, correct? If
so, how about just naming it PgStatEntryBase or such? I think it'd also
be useful to indicate in the "are stored as" part that PgStatEnvelope is
just the common prefix for an allocation.
The name makes sense. Thanks! (But the struct is now gone..)

06: (Fixed.)

-typedef struct TabStatHashEntry
+static size_t pgstat_entsize[] =

+/* Ditto for local statistics entries */
+static size_t pgstat_localentsize[] =
+{
+	0,							/* PGSTAT_TYPE_ALL: not an entry */
+	sizeof(PgStat_StatDBEntry), /* PGSTAT_TYPE_DB */
+	sizeof(PgStat_TableStatus), /* PGSTAT_TYPE_TABLE */
+	sizeof(PgStat_BackendFunctionEntry) /* PGSTAT_TYPE_FUNCTION */
+};

These probably should be const as well.

Right. Fixed.

07: (Fixed.)

/*
- * Backends store per-function info that's waiting to be sent to the collector
- * in this hash table (indexed by function OID).
+ * Stats numbers that are waiting for flushing out to shared stats are held in
+ * pgStatLocalHash,
*/
-static HTAB *pgStatFunctions = NULL;
+typedef struct PgStatHashEntry
+{
+	PgStatHashEntryKey key;		/* hash key */
+	dsa_pointer env;			/* pointer to shared stats envelope in DSA */
+}			PgStatHashEntry;
+
+/* struct for shared statistics entry pointed from shared hash entry. */
+typedef struct PgStatEnvelope
+{
+	PgStatTypes type;			/* statistics entry type */
+	Oid			databaseid;		/* databaseid */
+	Oid			objectid;		/* objectid */
Do we need this information both here and in PgStatHashEntry? It's
possible that it's worthwhile, but I am not sure it is.
Same key values were stored in PgStatEnvelope, PgStat(Local)HashEntry,
and PgStat_Stats*Entry. And I thought the same while developing. After
some thoughts, I managed to remove the duplicate values other than
PgStat(Local)HashEntry. Fixed.

08: (Fixed.)

+ size_t len; /* length of body, fixed per type. */

Why do we need this? Isn't that something that can easily be looked up
using the type?

Not only they are virtually fixed values, but they were found to be
write-only variables. Removed.

09: (Fixed. "Envelope" is embeded in stats entry structs.)

+	LWLock		lock;			/* lightweight lock to protect body */
+	int			body[FLEXIBLE_ARRAY_MEMBER];	/* statistics body */
+}			PgStatEnvelope;
What you're doing here with 'body' doesn't provide enough guarantees
about proper alignment. E.g. if one of the entry types wants to store a
double, this won't portably work, because there's platforms that have 4
byte alignment for ints, but 8 byte alignment for doubles.

Wouldn't it be better to instead embed PgStatEnvelope into the struct
that's actually stored? E.g. something like

struct PgStat_TableStatus
{
PgStatEnvelope header; /* I'd rename the type */
TimestampTz vacuum_timestamp; /* user initiated vacuum */
...
}

or if you don't want to do that because it'd require declaring
PgStatEnvelope in the header (not sure that'd really be worth avoiding),
you could just get rid of the body field and just do the calculation
using something like MAXALIGN((char *) envelope + sizeof(PgStatEnvelope))
As the result of the modification so far, there is only one member,
lock, left in the PgStatEnvelope (or PgStatEntryBase) struct. I chose
to embed it to each PgStat_Stat*Entry structs as
PgStat_StatEntryHeader.

10: (Fixed. Same as #03)

+ * Snapshot is stats entry that is locally copied to offset stable values for a
+ * transaction.
...

The amount of code needed for this snapshot stuff seems unreasonable to
me, especially because I don't see why we really need it. Is this just
so that there's no skew between all the columns of pg_stat_all_tables()
etc?

I think this needs a lot more comments explaining what it's trying to
achieve.

I don't insist on keeping the behavior. Removed snapshot stuff only
of pgstat stuff. (beentry snapshot is left alone.)

11: (Fixed. Per-entry-type initialize is gone.)

+/*
+ * Newly created shared stats entries needs to be initialized before the other
+ * processes get access it. get_stat_entry() calls it for the purpose.
+ */
+typedef void (*entry_initializer) (PgStatEnvelope * env);
I think we should try to not need it, instead declaring that all fields
are zero initialized. That fits well together with my suggestion to
avoid duplicating the database / object ids.
Now that entries don't have type-specific fields that need a special
care, I removed that stuff altogether.

12: (Fixed. Global stats memories are merged.)

+static void
+attach_shared_stats(void)
+{
...
+		shared_globalStats = (PgStat_GlobalStats *)
+			dsa_get_address(area, StatsShmem->global_stats);
+		shared_archiverStats = (PgStat_ArchiverStats *)
+			dsa_get_address(area, StatsShmem->archiver_stats);
+
+		shared_SLRUStats = (PgStatSharedSLRUStats *)
+			dsa_get_address(area, StatsShmem->slru_stats);
+		LWLockInitialize(&shared_SLRUStats->lock, LWTRANCHE_STATS);
I don't think it makes sense to use dsa allocations for any of the fixed
size stats (global_stats, archiver_stats, ...). They should just be
direct members of StatsShmem? Then we also don't need the shared_*
helper variables
I intended to reduce the amount of fixed-allocated shared memory, or
make maximum use of DSA. However, you're right. Now they are members
of StatsShmem.

13: (I couldn't address this fully..)

+ /* Load saved data if any. */
+ pgstat_read_statsfiles();

Hm. Is it a good idea to do this as part of the shmem init function?
That's a lot more work than we normally do in these.
+/* ----------
+ * detach_shared_stats() -
+ *
+ *	Detach shared stats. Write out to file if we're the last process and told
+ *	to do so.
+ * ----------
*/
static void
-pgstat_reset_remove_files(const char *directory)
+detach_shared_stats(bool write_stats)
I think it'd be better to have an explicit call in the shutdown sequence
somewhere to write out the data, instead of munging detach and writing
stats out together.
It is actually strange that attach_shared_stats reads file in a
StatsLock section while it attaches existing shared memory area
deliberately outside the same lock section. So I moved the call to
pg_stat_read/write_statsfiles() out of StatsLock section as the first
step. But I couldn't move pgstat_write_stats_files() out of (or,
before or after) detach_shared_stats(), because I didn't find a way to
reliably check if the exiting process is the last detacher by a
separate function from detach_shared_stats().

(continued)
=====

14: (I believe it is addressed.)

+	if (nowait)
+	{
+		/*
+		 * Don't flush stats too frequently.  Return the time to the next
+		 * flush.
+		 */
I think it's confusing to use nowait in the if when you actually mean
!force.

Agreed. I'm hovering between using !force to the parameter "nowait"
of flush_tabstat() or using the relabeled variable nowait. I choosed
to use nowait in the attached.

15: (Not addressed.)

-	for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
+	if (pgStatLocalHash)
{
-		for (i = 0; i < tsa->tsa_used; i++)
+		/* Step 1: flush out other than database stats */

...

+				case PGSTAT_TYPE_DB:
+					if (ndbentries >= dbentlistlen)
+					{
+						dbentlistlen *= 2;
+						dbentlist = repalloc(dbentlist,
+											 sizeof(PgStatLocalHashEntry *) *
+											 dbentlistlen);
+					}
+					dbentlist[ndbentries++] = lent;
+					break;

Why do we need this special behaviour for database statistics?

Some of the table stats numbers are also counted as database stats
numbers. It is currently added at stats-sending time (in
pgstat_recv_tabstat()) and this follows that design. If we add such
table stats numbers to database stats before flushing out table stats,
we need to remember whether that number are already added to database
stats or not yet.

16: (Fixed. Used List.)

If we need it,it'd be better to just use List here rather than open
coding a replacement (List these days basically has the same complexity
as what you do here).

Agreed. (I noticed that lappend is faster than lcons now.) Fixed.

17: (Fixed. case-default is removed, and PGSTAT_TYPE_ALL is removed by #28)

+				case PGSTAT_TYPE_TABLE:
+					if (flush_tabstat(lent->env, nowait))
+						remove = true;
+					break;
+				case PGSTAT_TYPE_FUNCTION:
+					if (flush_funcstat(lent->env, nowait))
+						remove = true;
+					break;
+				default:
+					Assert(false);

Adding a default here prevents the compiler from issuing a warning when
new types of stats are added...

Agreed. Another instance of switch on the same enum doesn't have
default:. (Fixed.)

18: (Fixed.)

+			/* Remove the successfully flushed entry */
+			pfree(lent->env);
Probably worth zeroing the pointer here, to make debugging a little
easier.

Agreed. I did the same to another instance of freeing a memory chunk
pointed from non-block-local pointers.

19: (Fixed. LWLocks is replaced with atmoic update.)

+	/* Publish the last flush time */
+	LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+	if (shared_globalStats->stats_timestamp < now)
+		shared_globalStats->stats_timestamp = now;
+	LWLockRelease(StatsLock);
Ugh, that seems like a fairly unnecessary global lock acquisition. What
do we need this timestamp for? Not clear to me that it's still
needed. If it is needed, it'd probably worth making this an atomic and
doing a compare-exchange loop instead.

The value is exposed via a system view. I used pg_atomic but I didn't
find a clean way to store TimestampTz into pg_atomic_u64.

20: (Wrote a comment to explain the reason.)

/*
-	 * Send partial messages.  Make sure that any pending xact commit/abort
-	 * gets counted, even if there are no table stats to send.
+	 * If we have pending local stats, let the caller know the retry interval.
*/
-	if (regular_msg.m_nentries > 0 ||
-		pgStatXactCommit > 0 || pgStatXactRollback > 0)
-		pgstat_send_tabstat(&regular_msg);
-	if (shared_msg.m_nentries > 0)
-		pgstat_send_tabstat(&shared_msg);
+	if (HAVE_ANY_PENDING_STATS())

I think this needs a comment explaining why we still may have pending
stats.

Does the following work?

| * Some of the local stats may have not been flushed due to lock
| * contention. If we have such pending local stats here, let the caller
| * know the retry interval.

21: (Fixed. Local cache of shared stats entry is added.)

+ * flush_tabstat - flush out a local table stats entry
+ *

...

Could we cache the address of the shared entry in the local entry for a
while? It seems we have a bunch of contention (that I think you're
trying to address in a prototoype patch posted since) just because we
will over and over look up the same address in the shared hash table.

If we instead kept the local hashtable alive for longer and stored a
pointer to the shared entry in it, we could make this a lot
cheaper. There would be some somewhat nasty edge cases probably. Imagine
a table being dropped for which another backend still has pending
stats. But that could e.g. be addressed with a refcount.

Yeah, I noticed that and did that in the previous version (with a
silly bug..) The cache is based on the simple hash. All the entries
were dropped after a vacuum removed at least one shared stats entry in
the previous version. However, this version uses refcount and drops
only the entries actually needed to be dropped.

22: (vacuum/analyze immediately writes to shared stats according to #34)

+	/* retrieve the shared table stats entry from the envelope */
+	shtabstats = (PgStat_StatTabEntry *) &shenv->body;
+
+	/* lock the shared entry to protect the content, skip if failed */
+	if (!nowait)
+		LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+		return false;
+
+	/* add the values to the shared entry. */
+	shtabstats->numscans += lstats->t_counts.t_numscans;
+	shtabstats->tuples_returned += lstats->t_counts.t_tuples_returned;
+	shtabstats->tuples_fetched += lstats->t_counts.t_tuples_fetched;
+	shtabstats->tuples_inserted += lstats->t_counts.t_tuples_inserted;
+	shtabstats->tuples_updated += lstats->t_counts.t_tuples_updated;
+	shtabstats->tuples_deleted += lstats->t_counts.t_tuples_deleted;
+	shtabstats->tuples_hot_updated += lstats->t_counts.t_tuples_hot_updated;
+
+	/*
+	 * If table was truncated or vacuum/analyze has ran, first reset the
+	 * live/dead counters.
+	 */
+	if (lstats->t_counts.t_truncated ||
+		lstats->t_counts.vacuum_count > 0 ||
+		lstats->t_counts.analyze_count > 0 ||
+		lstats->t_counts.autovac_vacuum_count > 0 ||
+		lstats->t_counts.autovac_analyze_count > 0)
+	{
+		shtabstats->n_live_tuples = 0;
+		shtabstats->n_dead_tuples = 0;
+	}

+	/* clear the change counter if requested */
+	if (lstats->t_counts.reset_changed_tuples)
+		shtabstats->changes_since_analyze = 0;

# I noticed that I carelessly dropped inserts_since_vacuum code.

Well. if vacuum report is delayed after a massive insert commit, the
massive insert would be omitted. It seems to me that your suggestion
in #34 below gets the point.

+	/*
+	 * Update vacuum/analyze timestamp and counters, so that the values won't
+	 * goes back.
+	 */
+	if (shtabstats->vacuum_timestamp < lstats->vacuum_timestamp)
+		shtabstats->vacuum_timestamp = lstats->vacuum_timestamp;
It seems to me that if these branches are indeed a necessary branches,
my concerns above are well founded...

I'm not sure it is simply a talisman against evil or basing on an
actual trouble, but I don't believe it's possible that a vacuum ends
after another vacuum that started later ends...

23: (ids are no longer stored in duplicate.)

+init_tabentry(PgStatEnvelope * env)
{
-	int			n;
-	int			len;
+	PgStat_StatTabEntry *tabent = (PgStat_StatTabEntry *) &env->body;
+
+	/*
+	 * If it's a new table entry, initialize counters to the values we just
+	 * got.
+	 */
+	Assert(env->type == PGSTAT_TYPE_TABLE);
+	tabent->tableid = env->objectid;
It seems over the top to me to have the object id stored in yet another
place. It's now in the hash entry, in the envelope, and the type
specific part.

Agreed, and fixed. (See #11 above)

24: (Fixed. Don't check for all-zero of a function stats entry at flush.)

+/*
+ * flush_funcstat - flush out a local function stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise
+ * this function always returns true.
+ *
+ * Returns true if the entry is successfully flushed out.
+ */
+static bool
+flush_funcstat(PgStatEnvelope * env, bool nowait)
+{
+	/* we assume this inits to all zeroes: */
+	static const PgStat_FunctionCounts all_zeroes;
+	PgStat_BackendFunctionEntry *localent;	/* local stats entry */
+	PgStatEnvelope *shenv;		/* shared stats envelope */
+	PgStat_StatFuncEntry *sharedent = NULL; /* shared stats entry */
+	bool		found;
+
+	Assert(env->type == PGSTAT_TYPE_FUNCTION);
+	localent = (PgStat_BackendFunctionEntry *) &env->body;
+
+	/* Skip it if no counts accumulated for it so far */
+	if (memcmp(&localent->f_counts, &all_zeroes,
+			   sizeof(PgStat_FunctionCounts)) == 0)
+		return true;

Why would we have an entry in this case?

Right. A function entry was zeroed out in master but the entry is not
created in that case with this patch. Removed it. (Fixed)

25: (Perhaps fixed. I'm not confident, though.)

+	/* find shared table stats entry corresponding to the local entry */
+	shenv = get_stat_entry(PGSTAT_TYPE_FUNCTION, MyDatabaseId, localent->f_id,
+						   nowait, init_funcentry, &found);
+	/* skip if dshash failed to acquire lock */
+	if (shenv == NULL)
+		return false;			/* failed to acquire lock, skip */
+
+	/* retrieve the shared table stats entry from the envelope */
+	sharedent = (PgStat_StatFuncEntry *) &shenv->body;
+
+	/* lock the shared entry to protect the content, skip if failed */
+	if (!nowait)
+		LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+		return false;			/* failed to acquire lock, skip */

I failed to get the last phrase, but I guess you suggested that I
should factor-out the common code.

+/*
+ * flush_dbstat - flush out a local database stats entry
+ *
+ * If nowait is true, this function returns false on lock failure. Otherwise

...

+	/* lock the shared entry to protect the content, skip if failed */
+	if (!nowait)
+		LWLockAcquire(&shenv->lock, LW_EXCLUSIVE);
+	else if (!LWLockConditionalAcquire(&shenv->lock, LW_EXCLUSIVE))
+		return false;

Dito re duplicating all of this.

26: (Fixed. Now all stats are saved in one file.)

+/*
+ * Create the filename for a DB stat file; filename is output parameter points
+ * to a character buffer of length len.
+ */
+static void
+get_dbstat_filename(bool tempname, Oid databaseid, char *filename, int len)
+{
+	int			printed;
+
+	/* NB -- pgstat_reset_remove_files knows about the pattern this uses */
+	printed = snprintf(filename, len, "%s/db_%u.%s",
+					   PGSTAT_STAT_PERMANENT_DIRECTORY,
+					   databaseid,
+					   tempname ? "tmp" : "stat");
+	if (printed >= len)
+		elog(ERROR, "overlength pgstat path");
}

Do we really want database specific storage after all of these changes?
Seems like there's no point anymore?

Sounds reasonable. Since we no longer keep the file format,
pgstat_read/write_statsfiles() gets far simpler. (Fixed)

27: (Fixed. added CFI to the same kind of loops.)

+	dshash_seq_init(&hstat, pgStatSharedHash, false);
+	while ((p = dshash_seq_next(&hstat)) != NULL)
{
-		Oid			tabid = tabentry->tableid;
-
-		CHECK_FOR_INTERRUPTS();
-
Given that this could take a while on a database with a lot of objects
it might worth keeping the CHECK_FOR_INTERRUPTS().

Agreed. It seems like a mistake. (Fixed pstat_read/write_statsfile()).

28: (Fixed. collect_stat_entries is removed along with PGSTAT_TYPE_ALL.)

/* ----------
- * pgstat_vacuum_stat() -
+ * collect_stat_entries() -
*
- *	Will tell the collector about objects he can get rid of.
+ *	Collect the shared statistics entries specified by type and dbid. Returns a
+ *  list of pointer to shared statistics in palloc'ed memory. If type is
+ *  PGSTAT_TYPE_ALL, all types of statistics of the database is collected. If
+ *  type is PGSTAT_TYPE_DB, the parameter dbid is ignored and collect all
+ *  PGSTAT_TYPE_DB entries.
* ----------
*/
-void
-pgstat_vacuum_stat(void)
+static PgStatEnvelope * *collect_stat_entries(PgStatTypes type, Oid dbid)
{

-		if (hash_search(htab, (void *) &tabid, HASH_FIND, NULL) != NULL)
+		if ((type != PGSTAT_TYPE_ALL && p->key.type != type) ||
+			(type != PGSTAT_TYPE_DB && p->key.databaseid != dbid))
continue;

Sounds reasonable. It was annoying that dbid=InvalidOid is a valid
value for this interface. But now that the function is called only
from two places and it is now simpler to use dshash seqscan
directly. The function and the enum item PGSTAT_TYPE_ALL are gone.
(Fixed)

29: (Fixed. collect_stat_entries is gone.)

+		if (n >= listlen - 1)
+			listlen *= 2;
+			envlist = repalloc(envlist, listlen * sizeof(PgStatEnvelope * *));
+		envlist[n++] = dsa_get_address(area, p->env);
}

I'd use List here as well.

So the function no longer exists. (Fixed)

30: (...)

+ dshash_seq_term(&hstat);

Hm, I didn't immediately see which locking makes this safe? Is it just
that nobody should be attached at this point?

I'm not sure I get your point, but I try to elaborate.

All the callers of collect_stat_entries have been replaced with a bare
loop of dshash_seq_next.

There are two levels of lock here. One is dshash partition lock that
is needed to continue in-partition scan safely. Another is a lock of
stats entry that is pointed from a dshash entry.

---
((PgStatHashEntry) shent).body -(dsa_get_address)-+-> PgStat_StatEntryHeader
|
((PgStatLocalHashEntry) lent).body ---------------^
---

Dshash scans are used for dropping and resetting stats entries. Entry
dropping is performed in the following steps.

(delete_current_stats_entry())
- Drop the dshash entry (needs exlock of dshash partition).

- If refcount of the stats entry body is already zero, free the memory
immediately .

- If not, set the "dropped" flag of the body. No lock is required
because the "dropped" flag won't be even referred to by other
backends until the next step is done.

- Increment deletion count of the shared hash. (this is used as the
"age" of local pointer cache hash (pgstat_cache).

(get_stat_entry())

- If dshash deletion count is different from the local cache age, scan
over the local cache hash to find "dropped" entries.

- Decrements refcount of the dropped entry and free the shared entry
if it is no longer referenced. Apparently no lock is required.

pgstat_drop_database() and pgstat_vacuum_stat() have concurrent
backends so the locks above are required. pgstat_write_statsfile() is
guaranteed to run alone so it doesn't matter either taking locks or
not.

pgstat_reset_counters() doesn't drop or modify dshash entries so
dshash scan requires shared lock. The stats entry body is updated so it
needs exclusive lock.

31: (Fixed. Use List instead of the open coding.)

+void
+pgstat_vacuum_stat(void)
+{

...

+	/* collect victims from shared stats */
+	arraylen = 16;
+	victims = palloc(sizeof(PgStatEnvelope * *) * arraylen);
+	nvictims = 0;

Same List comment as before.

The function uses a list now. (Fixed)

32: (Fixed.)

void
pgstat_reset_counters(void)
{
-	PgStat_MsgResetcounter msg;
+	PgStatEnvelope **envlist;
+	PgStatEnvelope **p;

-	if (pgStatSock == PGINVALID_SOCKET)
-		return;
+	/* Lookup the entries of the current database in the stats hash. */
+	envlist = collect_stat_entries(PGSTAT_TYPE_ALL, MyDatabaseId);
+	for (p = envlist; *p != NULL; p++)
+	{
+		PgStatEnvelope *env = *p;
+		PgStat_StatDBEntry *dbstat;

-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_RESETCOUNTER);
-	msg.m_databaseid = MyDatabaseId;
-	pgstat_send(&msg, sizeof(msg));
+		LWLockAcquire(&env->lock, LW_EXCLUSIVE);
+

What locking prevents this entry from being freed between the
collect_stat_entries() and this LWLockAcquire?

Mmm. They're not protected. The attached version no longer uses the
intermediate list and the fetched dshash entry is protected by dshash
partition lock. (Fixed)

33: (Will keep the current code.)

/* ----------
@@ -1440,48 +1684,63 @@ pgstat_reset_slru_counter(const char *name)
void
pgstat_report_autovac(Oid dboid)
{
-	PgStat_MsgAutovacStart msg;
+	PgStat_StatDBEntry *dbentry;
+	TimestampTz ts;

-	if (pgStatSock == PGINVALID_SOCKET)
+	/* return if activity stats is not active */
+	if (!area)
return;

-	pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_AUTOVAC_START);
-	msg.m_databaseid = dboid;
-	msg.m_start_time = GetCurrentTimestamp();
+	ts = GetCurrentTimestamp();

-	pgstat_send(&msg, sizeof(msg));
+	/*
+	 * Store the last autovacuum time in the database's hash table entry.
+	 */
+	dbentry = get_local_dbstat_entry(dboid);
+	dbentry->last_autovac_time = ts;
}

Why did you introduce the local ts variable here?

The function used to assign the timestamp within a LWLock section. In
the last version it writes to local entry so the lock was useless but
the amendment following to the comment #34 just below introduces
LWLocks again.

34: (Fixed. Vacuum/analyze write shared stats instantly.)

/* --------
* pgstat_report_analyze() -
*
- *	Tell the collector about the table we just analyzed.
+ *	Report about the table we just analyzed.
*
* Caller must provide new live- and dead-tuples estimates, as well as a
* flag indicating whether to reset the changes_since_analyze counter.
@@ -1492,9 +1751,10 @@ pgstat_report_analyze(Relation rel,
PgStat_Counter livetuples, PgStat_Counter deadtuples,
bool resetcounter)
{
}
It seems to me that the analyze / vacuum cases would be much better
dealth with by synchronously operating on the shared entry, instead of
going through the local hash table. ISTM that that'd make it a lot

Blocking at the beginning and end of such operations doesn't
matter. Sounds reasonbale.

going through the local hash table. ISTM that that'd make it a lot
easier to avoid most of the ordering issues.

Agreed. That avoid at least the case of delayed vacuum report (#22).

35: (Fixed, needing a change of how relcache uses local stats.)

+static PgStat_TableStatus *
+get_local_tabstat_entry(Oid rel_id, bool isshared)
+{
+	PgStatEnvelope *env;
+	PgStat_TableStatus *tabentry;
+	bool		found;

-	/*
-	 * Now we can fill the entry in pgStatTabHash.
-	 */
-	hash_entry->tsa_entry = entry;
+	env = get_local_stat_entry(PGSTAT_TYPE_TABLE,
+							   isshared ? InvalidOid : MyDatabaseId,
+							   rel_id, true, &found);

-	return entry;
+	tabentry = (PgStat_TableStatus *) &env->body;
+
+	if (!found)
+	{
+		tabentry->t_id = rel_id;
+		tabentry->t_shared = isshared;
+		tabentry->trans = NULL;
+		MemSet(&tabentry->t_counts, 0, sizeof(PgStat_TableCounts));
+		tabentry->vacuum_timestamp = 0;
+		tabentry->autovac_vacuum_timestamp = 0;
+		tabentry->analyze_timestamp = 0;
+		tabentry->autovac_analyze_timestamp = 0;
+	}
+

As with shared entries, I think this should just be zero initialized
(and we should try to get rid of the duplication of t_id/t_shared).

Ah! Yeah, they are removable since we already converted them into the
key of hash entry. Removed oids and the intialization code from all
types of local stats entry types.

One annoyance doing that was pgstat_initstats, which assumes the
pgstat_info linked from relation won't be freed. Finally I tightned
up the management of pgstat_info link. The link between relcache and
table stats entry is now a bidirectional link and explicitly de-linked
by a new function pgstat_delinkstats().

36: (Perhaps fixed. I'm not confident, though.)

+ return tabentry;
}

+
/*
* find_tabstat_entry - find any existing PgStat_TableStatus entry for rel
*
- * If no entry, return NULL, don't create a new one
+ *  Find any existing PgStat_TableStatus entry for rel from the current
+ *  database then from shared tables.

What do you mean with "from the current database then from shared
tables"?

It is rewritten as the following, is this readable?

| * Find any existing PgStat_TableStatus entry for rel_id in the current
| * database. If not found, try finding from shared tables.

37: (Maybe fixed.)

void
-pgstat_send_archiver(const char *xlog, bool failed)
+pgstat_report_archiver(const char *xlog, bool failed)
{

+	if (failed)
+	{
+		/* Failed archival attempt */
+		LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+		++shared_archiverStats->failed_count;
+		memcpy(shared_archiverStats->last_failed_wal, xlog,
+			   sizeof(shared_archiverStats->last_failed_wal));
+		shared_archiverStats->last_failed_timestamp = now;
+		LWLockRelease(StatsLock);
+	}
+	else
+	{
+		/* Successful archival operation */
+		LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+		++shared_archiverStats->archived_count;
+		memcpy(shared_archiverStats->last_archived_wal, xlog,
+			   sizeof(shared_archiverStats->last_archived_wal));
+		shared_archiverStats->last_archived_timestamp = now;
+		LWLockRelease(StatsLock);
+	}
}

Huh, why is this duplicating near equivalent code?

To avoid branches within a lock section, or since it is simply
expanded from the master. They can be reset by backends so I couldn't
change it to use changecount protocol. So it still uses LWLock but the
common code is factored out in the attached version.

In connection with this, While I was looking at bgwriter and
checkpointer to see if the statistics of the two could be split, I
found the following comment in checkpoiner.c.

| * Send off activity statistics to the activity stats facility. (The
| * reason why we re-use bgwriter-related code for this is that the
| * bgwriter and checkpointer used to be just one process. It's
| * probably not worth the trouble to split the stats support into two
| * independent stats message types.)

So I split the two to try getting rid of LWLock for the global stats,
but resetting counter prevented me from doing that. In the attached
version, I left it as it is because I've done it..

38: (Haven't addressed.)

/* ----------
* pgstat_write_statsfiles() -
- *		Write the global statistics file, as well as requested DB files.
- *
- *	'permanent' specifies writing to the permanent files not temporary ones.
- *	When true (happens only when the collector is shutting down), also remove
- *	the temporary files so that backends starting up under a new postmaster
- *	can't read old data before the new collector is ready.
- *
- *	When 'allDbs' is false, only the requested databases (listed in
- *	pending_write_requests) will be written; otherwise, all databases
- *	will be written.
+ *		Write the global statistics file, as well as DB files.
* ----------
*/
-static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+void
+pgstat_write_statsfiles(void)
{

Whats the locking around this?

No locking is used there. The code is (currently) guaranteed to be the
only process that reads it. Added a comment and an assertion. I did
the same to pgstat_read_statsfile().

39: (Fixed.)

-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_database_stats(PgStat_StatDBEntry *dbentry)
{
-	HASH_SEQ_STATUS tstat;
-	HASH_SEQ_STATUS fstat;
-	PgStat_StatTabEntry *tabentry;
-	PgStat_StatFuncEntry *funcentry;
+	PgStatEnvelope **envlist;
+	PgStatEnvelope **penv;
FILE	   *fpout;
int32		format_id;
Oid			dbid = dbentry->databaseid;
@@ -5048,8 +4974,8 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
char		tmpfile[MAXPGPATH];
char		statfile[MAXPGPATH];

-	get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-	get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+	get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+	get_dbstat_filename(false, dbid, statfile, MAXPGPATH);

elog(DEBUG2, "writing stats file \"%s\"", statfile);

@@ -5076,24 +5002,31 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
/*
* Walk through the database's access stats per table.
*/
-	hash_seq_init(&tstat, dbentry->tables);
-	while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+	envlist = collect_stat_entries(PGSTAT_TYPE_TABLE, dbentry->databaseid);
+	for (penv = envlist; *penv != NULL; penv++)

collect_stat_entries() is removed (#28) and the callers now handles
entries directly in the dshash_seq_next loop.

40: (Fixed.)

{
+		PgStat_StatTabEntry *tabentry = (PgStat_StatTabEntry *) &(*penv)->body;
+
fputc('T', fpout);
rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
(void) rc;				/* we'll check for error with ferror */
}
+	pfree(envlist);

/*
* Walk through the database's function stats table.
*/
-	hash_seq_init(&fstat, dbentry->functions);
-	while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+	envlist = collect_stat_entries(PGSTAT_TYPE_FUNCTION, dbentry->databaseid);
+	for (penv = envlist; *penv != NULL; penv++)
{
+		PgStat_StatFuncEntry *funcentry =
+		(PgStat_StatFuncEntry *) &(*penv)->body;
+
fputc('F', fpout);
rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
(void) rc;				/* we'll check for error with ferror */
}
+	pfree(envlist);

Why do we need separate loops for every type of object here?

Just to keep the file format. But we decided to change it (#26) and it
is now a juble of all kinds of stats
entries. pgstat_write/read_statsfile() become far simpler.

41: (Fixed.)

+/* ----------
+ * create_missing_dbentries() -
+ *
+ *  There may be the case where database entry is missing for the database
+ *  where object stats are recorded. This function creates such missing
+ *  dbentries so that so that all stats entries can be written out to files.
+ * ----------
+ */
+static void
+create_missing_dbentries(void)
+{

In which situation is this necessary?

It is because the old file format required that entries. It is no
longer needed and removed in #26.

42: (Sorry, but I didn't get your point..)

+static PgStatEnvelope *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid,
+			   bool nowait, entry_initializer initfunc, bool *found)
+{

+	bool		create = (initfunc != NULL);
+	PgStatHashEntry *shent;
+	PgStatEnvelope *shenv = NULL;
+	PgStatHashEntryKey key;
+	bool		myfound;
+
+	Assert(type != PGSTAT_TYPE_ALL);
+
+	key.type = type;
+	key.databaseid = dbid;
+	key.objectid = objid;
+	shent = dshash_find_extended(pgStatSharedHash, &key,
+								 create, nowait, create, &myfound);
+	if (shent)
{
-		get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
+		if (create && !myfound)
+		{
+			/* Create new stats envelope. */
+			size_t		envsize = PgStatEnvelopeSize(pgstat_entsize[type]);
+			dsa_pointer chunk = dsa_allocate0(area, envsize);

+			/*
+			 * The lock on dshsh is released just after. Call initializer
+			 * callback before it is exposed to other process.
+			 */
+			if (initfunc)
+				initfunc(shenv);
+
+			/* Link the new entry from the hash entry. */
+			shent->env = chunk;
+		}
+		else
+			shenv = dsa_get_address(area, shent->env);
+
+		dshash_release_lock(pgStatSharedHash, shent);

Doesn't this mean that by this time the entry could already have been
removed by a concurrent backend, and the dsa allocation freed?

Does "by this time" mean before the dshash_find_extended, or after it
and until dshash_release_lock?

We can create an entry for a just droppted object but it should be
removed again by the next vacuum.

The newly created entry (or its partition) is exclusively locked so no
concurrent backend does not find it until the dshash_release_lock.

The shenv could be removed until the caller accesses it. But since the
function is requested for an existing object, that cannot be removed
until the first vacuum after the transaction end. I added a comment
just before the dshash_release_lock in get_stat_entry().

43: (Fixed. But has a side effect.)

Subject: [PATCH v36 7/7] Remove the GUC stats_temp_directory

The GUC used to specify the directory to store temporary statistics
files. It is no longer needed by the stats collector but still used by
the programs in bin and contrib, and maybe other extensions. Thus this
patch removes the GUC but some backing variables and macro definitions
are left alone for backward compatibility.

I don't see what this achieves? Which use of those variables / macros
would would be safe? I think it'd be better to just remove them.

pg_stat_statements used PG_STAT_TMP directory to store a temporary
file. I just replaced it with PGSTAT_STAT_PERMANENT_DIRECTORY. As the
result basebackup copies the temporary file of pg_stat_statements.

By the way, basebackup exludes pg_stat_tmp diretory but sends pg_stat
direcoty. On the other hand when we start a server from a base backup,
it starts crash recovery first and removes stats files anyway. Why
does basebackup send pg_stat direcoty then? (Added as 0007.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#131

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#130)

7 attachment(s)

Re: shared-memory based stats collector

Rebased on a previously committed WAL-stats patch.

I found a bug that the maximum interval was wrongly set to 600s
instead of 60s.

The previous version failed to flush local database stats for certain
condition. That behavior caused useless retries and finally a forced
flush that leads to contention. I fixed that and will measure
performance with this version.

Now that global stats are split into bgwriter stats and checkpointer
stats, that stats are updated only by one process each. However, they
are reset by client backends so LWLock is still needed to protect
them. To get rid of the LWLocks, pgstat_reset_shared_counters() is
changed so as to avoid scribble on the shared structs.

Finally archiver, bgwriter and checkpointer stats no longer need
LWLock to update, read and reset. Still reader-reader conflict on
StatsLock occurs but that doesn't affect writer processes.

WAL stats is written from many backends so it still requires LWLock to
reset, update and read.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#132

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#131)

2 attachment(s)

Re: shared-memory based stats collector

At Tue, 06 Oct 2020 10:06:44 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

The previous version failed to flush local database stats for certain
condition. That behavior caused useless retries and finally a forced
flush that leads to contention. I fixed that and will measure
performance with this version.

I (we) got some performance numbers.

- Fetching 1 tuple from 1 of 100 tables from 100 to 800 clients.
- Fetching 1 tuple from 1 of 10 tables from 100 to 800 clients.

Those showed speed of over 400,000 TPS at maximum, and no siginificant
difference is seen between patched and unpatched at the all range of
the test. I tried 5 seconds as PGSTAT_MIN_INTERVAL (10s in the patch)
but that made no difference.

- Fetching 1 tuple from 1 table from 800 clients.

No graph for this is not attached but this test shows speed of over 42
TPS with or without the v39 patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#133

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#132)

Re: shared-memory based stats collector

It occurred to my mind the fact that I forgot to mention the most
significant outcome of this patch.

At Thu, 08 Oct 2020 16:03:26 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 06 Oct 2020 10:06:44 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

The previous version failed to flush local database stats for certain
condition. That behavior caused useless retries and finally a forced
flush that leads to contention. I fixed that and will measure
performance with this version.

I (we) got some performance numbers.

- Fetching 1 tuple from 1 of 100 tables from 100 to 800 clients.
- Fetching 1 tuple from 1 of 10 tables from 100 to 800 clients.

Those showed speed of over 400,000 TPS at maximum, and no siginificant
difference is seen between patched and unpatched at the all range of
the test. I tried 5 seconds as PGSTAT_MIN_INTERVAL (10s in the patch)
but that made no difference.

- Fetching 1 tuple from 1 table from 800 clients.

No graph for this is not attached but this test shows speed of over 42
TPS with or without the v39 patch.

Under a heavy load and having many tables, the *reader* side takes
seconds ore more time to read the stats table. With this patch, it
takes almost no time (maybe ms order?) for the same operation.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#134

Georgios Kokolatos

gkokolatos@protonmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#133)

Re: shared-memory based stats collector

Hi,

I noticed that according to the cfbot this patch no longer applies.

As it is registered in the upcoming commitfest, it would be appreciated
if you could rebase it.

Cheers,
//Georgios

#135

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Georgios Kokolatos (#134)

7 attachment(s)

Re: shared-memory based stats collector

At Fri, 30 Oct 2020 15:00:55 +0000, Georgios Kokolatos <gkokolatos@protonmail.com> wrote in

Hi,

I noticed that according to the cfbot this patch no longer applies.

As it is registered in the upcoming commitfest, it would be appreciated
if you could rebase it.

Thanks! The replication slot stats patch (9868167500) hit this.

- Fixed a bug of original code.

get_stat_entry() returned a wrong result to found when shared entry
exists but it is not locally cached.

- Moved replication slot stats into shared memory stats.

Differently from wal_stats and slru_stats, it can be implemented as a
part of unified stats entry. I'm tempted to remove the entry for a
dropped slot immediately, but I didn't that since the number of the
slots should be under 10 or so and dropping an entry requires
exclusive lock on dshash. Instead, dropped entries are removed at
file-write time that happens only at the end of a process.

I had to clean up replication slots in pgstat_beshutdown_hook(). Even
though we have exactly the same code in several other places, the
function must be called before disabling DSA because we cannot update
statistics after detaching the shared-memory stats. Perhaps we can
remove some of the existing calling to ReplicationSlotCleanup() but I
haven't do that in this version.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#136

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#135)

7 attachment(s)

Re: shared-memory based stats collector

At Wed, 04 Nov 2020 17:39:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 30 Oct 2020 15:00:55 +0000, Georgios Kokolatos <gkokolatos@protonmail.com> wrote in

Hi,

I noticed that according to the cfbot this patch no longer applies.

As it is registered in the upcoming commitfest, it would be appreciated
if you could rebase it.

Thanks! The replication slot stats patch (9868167500) hit this.

- Fixed a bug of original code.

get_stat_entry() returned a wrong result to found when shared entry
exists but it is not locally cached.

- Moved replication slot stats into shared memory stats.

Differently from wal_stats and slru_stats, it can be implemented as a
part of unified stats entry. I'm tempted to remove the entry for a
dropped slot immediately, but I didn't that since the number of the
slots should be under 10 or so and dropping an entry requires
exclusive lock on dshash. Instead, dropped entries are removed at
file-write time that happens only at the end of a process.

I had to clean up replication slots in pgstat_beshutdown_hook(). Even
though we have exactly the same code in several other places, the
function must be called before disabling DSA because we cannot update
statistics after detaching the shared-memory stats. Perhaps we can
remove some of the existing calling to ReplicationSlotCleanup() but I
haven't do that in this version.

Fixed a bug that pgstat_report_replslot fails to reuse entries that
are marked as "dropped".

Fixed comments for the caller sites to pgstat_report_replslot(_drop)
in ReplicationSlotCreate() and ReplicationSlotDropPtr().

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#137

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#136)

7 attachment(s)

Re: shared-memory based stats collector

4f841ce3f7 hit this. Rebased.

At Fri, 06 Nov 2020 09:27:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Fixed a bug that pgstat_report_replslot fails to reuse entries that
are marked as "dropped".

Fixed comments for the caller sites to pgstat_report_replslot(_drop)
in ReplicationSlotCreate() and ReplicationSlotDropPtr().

The following changes were made along with the rebasing.

- Removed a useless struct PgStat_HashMountInfo.

- Removed a duplicate member "dropped" from PgStat_ReplSlot.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#138

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#137)

7 attachment(s)

Re: shared-memory based stats collector

At Wed, 11 Nov 2020 10:07:22 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

4f841ce3f7 hit this. Rebased.

- 01469241b2 and e2ac3fed3b (maybe that's all) have hit this. Rebased.

- Fixed some silly bugs of WAL statistics.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#139

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#138)

7 attachment(s)

Re: shared-memory based stats collector

At Fri, 11 Dec 2020 16:50:03 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

- Fixed some silly bugs of WAL statistics.

- Conflicted with b3817f5f77. Rebased.

- Make sure to clean up local reference hash before detaching shared
stats memory. Forgetting this caused an assertion failure.

- Reduced the planned number of tests of pg_basebackup according to
the previous reduction made in the directory list in the script.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#140

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#139)

7 attachment(s)

Re: shared-memory based stats collector

At Mon, 21 Dec 2020 17:16:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

- Conflicted with b3817f5f77. Rebased.

Conflicted with 9877374bef. Rebased.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#141

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#140)

7 attachment(s)

Re: shared-memory based stats collector

At Fri, 08 Jan 2021 10:24:34 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 21 Dec 2020 17:16:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

- Conflicted with b3817f5f77. Rebased.

Conflicted with 9877374bef. Rebased.

bea449c635 conflicted with this (on a comment change). Rebased.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#142

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#141)

7 attachment(s)

Re: shared-memory based stats collector

At Thu, 14 Jan 2021 15:14:25 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 08 Jan 2021 10:24:34 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 21 Dec 2020 17:16:20 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

- Conflicted with b3817f5f77. Rebased.

Conflicted with 9877374bef. Rebased.

bea449c635 conflicted with this (on a comment change). Rebased.

Commit 960869da08 (database statistics) conflicted with this. Rebased.

I'm concerned about the behavior that pgstat_update_connstats calls
GetCurrentTimestamp() every time stats update happens (with intervals
of 10s-60s in this patch). But I didn't change that design since that
happens with about 0.5s intervals in master and the rate is largely
reduced in this patch, to make this patch simpler.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#143

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#142)

7 attachment(s)

Re: shared-memory based stats collector

At Thu, 21 Jan 2021 12:03:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Commit 960869da08 (database statistics) conflicted with this. Rebased.

I'm concerned about the behavior that pgstat_update_connstats calls
GetCurrentTimestamp() every time stats update happens (with intervals
of 10s-60s in this patch). But I didn't change that design since that
happens with about 0.5s intervals in master and the rate is largely
reduced in this patch, to make this patch simpler.

I stepped on my foot, and another commit coflicted. Just rebased.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#144

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#143)

Re: shared-memory based stats collector

On 2021/03/05 17:18, Kyotaro Horiguchi wrote:

At Thu, 21 Jan 2021 12:03:48 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Commit 960869da08 (database statistics) conflicted with this. Rebased.

I'm concerned about the behavior that pgstat_update_connstats calls
GetCurrentTimestamp() every time stats update happens (with intervals
of 10s-60s in this patch). But I didn't change that design since that
happens with about 0.5s intervals in master and the rate is largely
reduced in this patch, to make this patch simpler.

I stepped on my foot, and another commit coflicted. Just rebased.

Thanks for rebasing the patches!

I think that 0003 patch is self-contained and useful, for example which
enables us to monitor archiver process in pg_stat_activity. So IMO
it's worth pusing 0003 patch firstly.

Here are the review comments for 0003 patch.

+	/* Archiver process's latch */
+	Latch	   *archiverLatch;
+	/* Current shared estimate of appropriate spins_per_delay value */

The last line in the above seems not necessary.

In proc.h, NUM_AUXILIARY_PROCS needs to be incremented.

/* ----------
* Functions called from postmaster
* ----------
*/
extern int pgarch_start(void);

In pgarch.h, the above is not necessary.

+extern void XLogArchiveWakeup(void);

This seems no longer necessary.

+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);

These seem also no longer necessary.

  			PgArchPID = 0;
  			if (!EXIT_STATUS_0(exitstatus))
-				LogChildExit(LOG, _("archiver process"),
-							 pid, exitstatus);
-			if (PgArchStartupAllowed())
-				PgArchPID = pgarch_start();
+				HandleChildCrash(pid, exitstatus,
+								 _("archiver process"));

I don't think that we should treat non-zero exit condition as a crash,
as before. Otherwise when archive_command fails on a signal,
archiver emits FATAL error and which leads the server restart.

- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
   *
   * The objectives here are to clean up our local state about the child
   * process, and to signal all other remaining children to quickdie.
@@ -3609,6 +3606,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
  		signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
  	}

+	/* Take care of the archiver too */
+	if (pid == PgArchPID)
+		PgArchPID = 0;
+	else if (PgArchPID != 0 && take_action)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) PgArchPID)));
+		signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+

Same as above.

In xlogarchive.c, "#include "storage/pmsignal.h"" is no longer necessary.

pgarch_forkexec() should be removed from pgarch.c because it's no longer used.

/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
*/

The definition of PgArchiverMain() should be placed just
after the above comment.

exit(0) in PgArchiverMain() should be proc_exit(0)?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#145

Ibrar Ahmed

ibrar.ahmad@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#144)

Re: shared-memory based stats collector

On Fri, Mar 5, 2021 at 8:32 PM Fujii Masao <masao.fujii@oss.nttdata.com>
wrote:

On 2021/03/05 17:18, Kyotaro Horiguchi wrote:

At Thu, 21 Jan 2021 12:03:48 +0900 (JST), Kyotaro Horiguchi <

horikyota.ntt@gmail.com> wrote in

Commit 960869da08 (database statistics) conflicted with this. Rebased.

I'm concerned about the behavior that pgstat_update_connstats calls
GetCurrentTimestamp() every time stats update happens (with intervals
of 10s-60s in this patch). But I didn't change that design since that
happens with about 0.5s intervals in master and the rate is largely
reduced in this patch, to make this patch simpler.

I stepped on my foot, and another commit coflicted. Just rebased.

Thanks for rebasing the patches!

I think that 0003 patch is self-contained and useful, for example which
enables us to monitor archiver process in pg_stat_activity. So IMO
it's worth pusing 0003 patch firstly.

Here are the review comments for 0003 patch.
+       /* Archiver process's latch */
+       Latch      *archiverLatch;
+       /* Current shared estimate of appropriate spins_per_delay value */
The last line in the above seems not necessary.

In proc.h, NUM_AUXILIARY_PROCS needs to be incremented.

/* ----------
* Functions called from postmaster
* ----------
*/
extern int pgarch_start(void);

In pgarch.h, the above is not necessary.

+extern void XLogArchiveWakeup(void);

This seems no longer necessary.
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
These seem also no longer necessary.
PgArchPID = 0;
if (!EXIT_STATUS_0(exitstatus))
-                               LogChildExit(LOG, _("archiver process"),
-                                                        pid, exitstatus);
-                       if (PgArchStartupAllowed())
-                               PgArchPID = pgarch_start();
+                               HandleChildCrash(pid, exitstatus,
+
_("archiver process"));
I don't think that we should treat non-zero exit condition as a crash,
as before. Otherwise when archive_command fails on a signal,
archiver emits FATAL error and which leads the server restart.
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
*
* The objectives here are to clean up our local state about the child
* process, and to signal all other remaining children to quickdie.
@@ -3609,6 +3606,18 @@ HandleChildCrash(int pid, int exitstatus, const
char *procname)
signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+       /* Take care of the archiver too */
+       if (pid == PgArchPID)
+               PgArchPID = 0;
+       else if (PgArchPID != 0 && take_action)
+       {
+               ereport(DEBUG2,
+                               (errmsg_internal("sending %s to process
%d",
+                                                                (SendStop
? "SIGSTOP" : "SIGQUIT"),
+                                                                (int)
PgArchPID)));
+               signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+       }
+
Same as above.

In xlogarchive.c, "#include "storage/pmsignal.h"" is no longer necessary.

pgarch_forkexec() should be removed from pgarch.c because it's no longer
used.

/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
*/

The definition of PgArchiverMain() should be placed just
after the above comment.

exit(0) in PgArchiverMain() should be proc_exit(0)?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

The code does not compile and has compilation warnings and errors.

------
pgstat.c:446:25: note: ‘cached_slrustats’ declared here
static PgStat_SLRUStats cached_slrustats;
^~~~~~~~~~~~~~~~
guc.c:4372:4: error: use of undeclared identifier 'pgstat_temp_directory';
did you mean 'pgstat_stat_directory'?
&pgstat_temp_directory,
^~~~~~~~~~~~~~~~~~~~~
pgstat_stat_directory
../../../../src/include/pgstat.h:922:14: note: 'pgstat_stat_directory'
declared here
extern char *pgstat_stat_directory;
^
guc.c:4373:3: error: use of undeclared identifier 'PG_STAT_TMP_DIR'
PG_STAT_TMP_DIR,
^
guc.c:4374:25: error: use of undeclared identifier
'assign_pgstat_temp_directory'
check_canonical_path, assign_pgstat_temp_directory, NULL

-------

Can we get an updated patch?

I am marking the patch "Waiting on Author"

--
Ibrar Ahmed

#146

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#144)

Re: shared-memory based stats collector

At Sat, 6 Mar 2021 00:32:07 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/05 17:18, Kyotaro Horiguchi wrote:

At Thu, 21 Jan 2021 12:03:48 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

Commit 960869da08 (database statistics) conflicted with this. Rebased.

I'm concerned about the behavior that pgstat_update_connstats calls
GetCurrentTimestamp() every time stats update happens (with intervals
of 10s-60s in this patch). But I didn't change that design since that
happens with about 0.5s intervals in master and the rate is largely
reduced in this patch, to make this patch simpler.

I stepped on my foot, and another commit coflicted. Just rebased.

Thanks for rebasing the patches!

I think that 0003 patch is self-contained and useful, for example
which
enables us to monitor archiver process in pg_stat_activity. So IMO
it's worth pusing 0003 patch firstly.

I'm not sure archiver process is worth realtime monitoring, but I
agree that the patch makes it possible. Anyway it is requried in this
patchset and I'm happy to see it committed beforehand.

Thanks for the review.

Here are the review comments for 0003 patch.
+	/* Archiver process's latch */
+	Latch	   *archiverLatch;
+	/* Current shared estimate of appropriate spins_per_delay value */
The last line in the above seems not necessary.

Oops. It seems like a garbage after a past rebasing. Removed.

In proc.h, NUM_AUXILIARY_PROCS needs to be incremented.

Right. Increased to 5 and rewrote the comment.

/* ----------
* Functions called from postmaster
* ----------
*/
extern int pgarch_start(void);

In pgarch.h, the above is not necessary.

Removed.

+extern void XLogArchiveWakeup(void);

This seems no longer necessary.
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
These seem also no longer necessary.

Sorry for many garbages. Removed all of them.

PgArchPID = 0;
if (!EXIT_STATUS_0(exitstatus))
-				LogChildExit(LOG, _("archiver process"),
-							 pid, exitstatus);
-			if (PgArchStartupAllowed())
-				PgArchPID = pgarch_start();
+				HandleChildCrash(pid, exitstatus,
+								 _("archiver process"));
I don't think that we should treat non-zero exit condition as a crash,
as before. Otherwise when archive_command fails on a signal,
archiver emits FATAL error and which leads the server restart.

Sounds reasonable. Now archiver is treated the same way to wal
receiver. Specifically exit(1) doesn't cause server restart.

- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
*
* The objectives here are to clean up our local state about the child
* process, and to signal all other remaining children to quickdie.
@@ -3609,6 +3606,18 @@ HandleChildCrash(int pid, int exitstatus, const
char *procname)
signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+	/* Take care of the archiver too */
+	if (pid == PgArchPID)
+		PgArchPID = 0;
+	else if (PgArchPID != 0 && take_action)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) PgArchPID)));
+		signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+

Same as above.

Mmm. In the first place, I found that I forgot to remove existing
code to handle archiver... Removed it instead of the above, which is
added by this patch. Since the process becomes an auxiliary process,
no reason to differentiate from other auxiliary processes in handling?

In xlogarchive.c, "#include "storage/pmsignal.h"" is no longer
necessary.

Removed.

pgarch_forkexec() should be removed from pgarch.c because it's no
longer used.

Right. Removed. EXEC_BACKEND still fails for another reason of
prototype mismatch of PgArchiverMain. Fixed it toghether.

/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
*/

The definition of PgArchiverMain() should be placed just
after the above comment.

The module no longer have a function called from postmaster. Now
PgArchiverMain() is placed just below "/* Main entry point...".

exit(0) in PgArchiverMain() should be proc_exit(0)?

Yeah, proc_exit() says as it should be the only function to call
exit() directly.

By the way, the patch (0003) removes the flag "wakened". The flag was
originally added to prevent spurious wakeups (66ec2db7284). At the
time the pg_ysleep in pgarch_MainLoop was replaced with WaitLatch by
89fd72cbf26, the flag survived but with losing its effect since
WaitLatch doesn't get spurious wakeups (AFAICS). So if the change
(removal of "wakened") is correct, it might worth another patch.

I'll send new patchset soon.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#147

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Ibrar Ahmed (#145)

7 attachment(s)

Re: shared-memory based stats collector

At Mon, 8 Mar 2021 21:55:31 +0500, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote in

The code does not compile and has compilation warnings and errors.

------
pgstat.c:446:25: note: ‘cached_slrustats’ declared here
static PgStat_SLRUStats cached_slrustats;
^~~~~~~~~~~~~~~~
guc.c:4372:4: error: use of undeclared identifier 'pgstat_temp_directory';
did you mean 'pgstat_stat_directory'?
&pgstat_temp_directory,
^~~~~~~~~~~~~~~~~~~~~
pgstat_stat_directory
../../../../src/include/pgstat.h:922:14: note: 'pgstat_stat_directory'
declared here
extern char *pgstat_stat_directory;
^
guc.c:4373:3: error: use of undeclared identifier 'PG_STAT_TMP_DIR'
PG_STAT_TMP_DIR,
^
guc.c:4374:25: error: use of undeclared identifier
'assign_pgstat_temp_directory'
check_canonical_path, assign_pgstat_temp_directory, NULL

Thanks! That's a stupid bug sneaked in by past rebasing and somehow
had lurked outside my sight.

The attached is a new version of this patchset.

- Amendmentof Fujii-san's comments (in 0003-Make-archiver..)

- Fix removal of pgstat_temp_directory.

- Fixed a bug in EXEC_BACKEND build.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#148

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#147)

Re: shared-memory based stats collector

At Tue, 09 Mar 2021 16:53:11 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

The attached is a new version of this patchset.

Hmm. That's too bad. A just commited change killed this. Will do a
further rebase.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#149

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#148)

7 attachment(s)

Re: shared-memory based stats collector

At Tue, 09 Mar 2021 17:11:07 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Tue, 09 Mar 2021 16:53:11 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

The attached is a new version of this patchset.

Hmm. That's too bad. A just commited change killed this. Will do a
further rebase.

Done.

+ {"track_activities", PGC_SUSET, STATS_ACTIVITY,

# The name STAS_ACTIVITY getting look worse to me..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#150

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#146)

Re: shared-memory based stats collector

On 2021/03/09 16:51, Kyotaro Horiguchi wrote:

At Sat, 6 Mar 2021 00:32:07 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/05 17:18, Kyotaro Horiguchi wrote:

At Thu, 21 Jan 2021 12:03:48 +0900 (JST), Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote in

Commit 960869da08 (database statistics) conflicted with this. Rebased.

I'm concerned about the behavior that pgstat_update_connstats calls
GetCurrentTimestamp() every time stats update happens (with intervals
of 10s-60s in this patch). But I didn't change that design since that
happens with about 0.5s intervals in master and the rate is largely
reduced in this patch, to make this patch simpler.

I stepped on my foot, and another commit coflicted. Just rebased.

Thanks for rebasing the patches!

I think that 0003 patch is self-contained and useful, for example
which
enables us to monitor archiver process in pg_stat_activity. So IMO
it's worth pusing 0003 patch firstly.

I'm not sure archiver process is worth realtime monitoring, but I
agree that the patch makes it possible. Anyway it is requried in this
patchset and I'm happy to see it committed beforehand.

Thanks for the review.
Here are the review comments for 0003 patch.
+	/* Archiver process's latch */
+	Latch	   *archiverLatch;
+	/* Current shared estimate of appropriate spins_per_delay value */
The last line in the above seems not necessary.
Oops. It seems like a garbage after a past rebasing. Removed.

In proc.h, NUM_AUXILIARY_PROCS needs to be incremented.

Right. Increased to 5 and rewrote the comment.

/* ----------
* Functions called from postmaster
* ----------
*/
extern int pgarch_start(void);

In pgarch.h, the above is not necessary.

Removed.
+extern void XLogArchiveWakeup(void);

This seems no longer necessary.
+extern void XLogArchiveWakeupStart(void);
+extern void XLogArchiveWakeupEnd(void);
+extern void XLogArchiveWakeup(void);
These seem also no longer necessary.
Sorry for many garbages. Removed all of them.
PgArchPID = 0;
if (!EXIT_STATUS_0(exitstatus))
-				LogChildExit(LOG, _("archiver process"),
-							 pid, exitstatus);
-			if (PgArchStartupAllowed())
-				PgArchPID = pgarch_start();
+				HandleChildCrash(pid, exitstatus,
+								 _("archiver process"));
I don't think that we should treat non-zero exit condition as a crash,
as before. Otherwise when archive_command fails on a signal,
archiver emits FATAL error and which leads the server restart.
Sounds reasonable. Now archiver is treated the same way to wal
receiver. Specifically exit(1) doesn't cause server restart.

Thanks!

- if (PgArchStartupAllowed())
- PgArchPID = pgarch_start();

In the latest patch, why did you remove the code to restart new archiver
in reaper()? When archiver dies, I think new archiver should be restarted like
the current reaper() does. Otherwise, the restart of archiver can be
delayed until the next cycle of ServerLoop, which may take time.

- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, archiver or background worker.
*
* The objectives here are to clean up our local state about the child
* process, and to signal all other remaining children to quickdie.
@@ -3609,6 +3606,18 @@ HandleChildCrash(int pid, int exitstatus, const
char *procname)
signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
}
+	/* Take care of the archiver too */
+	if (pid == PgArchPID)
+		PgArchPID = 0;
+	else if (PgArchPID != 0 && take_action)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) PgArchPID)));
+		signal_child(PgArchPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+

Same as above.

Yes, ok.

In xlogarchive.c, "#include "storage/pmsignal.h"" is no longer
necessary.

Removed.

pgarch_forkexec() should be removed from pgarch.c because it's no
longer used.

Right. Removed. EXEC_BACKEND still fails for another reason of
prototype mismatch of PgArchiverMain. Fixed it toghether.

/* ------------------------------------------------------------
* Public functions called from postmaster follow
* ------------------------------------------------------------
*/

The definition of PgArchiverMain() should be placed just
after the above comment.

The module no longer have a function called from postmaster. Now
PgArchiverMain() is placed just below "/* Main entry point...".

exit(0) in PgArchiverMain() should be proc_exit(0)?

Yeah, proc_exit() says as it should be the only function to call
exit() directly.

By the way, the patch (0003) removes the flag "wakened". The flag was
originally added to prevent spurious wakeups (66ec2db7284). At the
time the pg_ysleep in pgarch_MainLoop was replaced with WaitLatch by
89fd72cbf26, the flag survived but with losing its effect since
WaitLatch doesn't get spurious wakeups (AFAICS). So if the change
(removal of "wakened") is correct, it might worth another patch.

I'll send new patchset soon.

I read v50_003 patch.

When archiver dies, ProcGlobal->archiverLatch should be reset to NULL,
like walreceiver does the similar thing in WalRcvDie()?

In pgarch.c, #include "postmaster/fork_process.h" seems no longer necessary.

+	if (strcmp(argv[1], "--forkarch") == 0)
+	{
+		/* Restore basic shared memory pointers */
+		InitShmemAccess(UsedShmemSegAddr);
+
+		/* Need a PGPROC to run CreateSharedMemoryAndSemaphores */
+		InitAuxiliaryProcess();
+
+		/* Attach process to shared data structures */
+		CreateSharedMemoryAndSemaphores();
+
+		PgArchiverMain(); /* does not return */
+	}

Why is this necessary? I was thinking that "--forkboot" handles archiver
in SubPostmasterMain().

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#151

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#150)

7 attachment(s)

Re: shared-memory based stats collector

At Tue, 9 Mar 2021 23:24:10 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/09 16:51, Kyotaro Horiguchi wrote:

At Sat, 6 Mar 2021 00:32:07 +0900, Fujii Masao <masao.fujii@oss.nttdata.com>
wrote in

I don't think that we should treat non-zero exit condition as a crash,
as before. Otherwise when archive_command fails on a signal,
archiver emits FATAL error and which leads the server restart.

Sounds reasonable. Now archiver is treated the same way to wal
receiver. Specifically exit(1) doesn't cause server restart.

Thanks!

- if (PgArchStartupAllowed())
- PgArchPID = pgarch_start();

In the latest patch, why did you remove the code to restart new archiver
in reaper()? When archiver dies, I think new archiver should be restarted like
the current reaper() does. Otherwise, the restart of archiver can be
delayed until the next cycle of ServerLoop, which may take time.

Agreed. The code moved to the original place and added the crash
handling code. And I added a phrase to the comment.

+		 * Was it the archiver?  If exit status is zero (normal) or one (FATAL
+		 * exit), we assume everything is all right just like normal backends
+		 * and just try to restart a new one so that we immediately retry
                                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+		 * archiving of remaining files. (If fail, we'll try again in future
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I read v50_003 patch.

When archiver dies, ProcGlobal->archiverLatch should be reset to NULL,
like walreceiver does the similar thing in WalRcvDie()?

Differently from walwriter and checkpointer, archiver as well as
walreceiver may die while server is running. Leaving the latch pointer
alone may lead to nudging a wrong process that took over the same
procarray slot. Added pgarch_die() to do that.

(I moved the archiverLatch to just after checkpointerLatch in this version.)

In pgarch.c, #include "postmaster/fork_process.h" seems no longer necessary.

Right. That's not due to this patch, postmater.h, dsm.h and pg_shmem.h
are not used. (fd.h is not necessary but pgarch.c uses AllocateDir().)

+	if (strcmp(argv[1], "--forkarch") == 0)
+	{
Why is this necessary? I was thinking that "--forkboot" handles archiver
in SubPostmasterMain().

Yeah, the correspondent code is removed in the same patch at the same
time.

The attached is v51 patchset.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#152

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#151)

Re: shared-memory based stats collector

On 2021/03/10 12:10, Kyotaro Horiguchi wrote:

At Tue, 9 Mar 2021 23:24:10 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/09 16:51, Kyotaro Horiguchi wrote:

At Sat, 6 Mar 2021 00:32:07 +0900, Fujii Masao <masao.fujii@oss.nttdata.com>
wrote in

I don't think that we should treat non-zero exit condition as a crash,
as before. Otherwise when archive_command fails on a signal,
archiver emits FATAL error and which leads the server restart.

Sounds reasonable. Now archiver is treated the same way to wal
receiver. Specifically exit(1) doesn't cause server restart.

Thanks!

- if (PgArchStartupAllowed())
- PgArchPID = pgarch_start();

In the latest patch, why did you remove the code to restart new archiver
in reaper()? When archiver dies, I think new archiver should be restarted like
the current reaper() does. Otherwise, the restart of archiver can be
delayed until the next cycle of ServerLoop, which may take time.

Agreed. The code moved to the original place and added the crash
handling code. And I added a phrase to the comment.
+		 * Was it the archiver?  If exit status is zero (normal) or one (FATAL
+		 * exit), we assume everything is all right just like normal backends
+		 * and just try to restart a new one so that we immediately retry
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+		 * archiving of remaining files. (If fail, we'll try again in future
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

"of" of "archiving of remaining" should be replaced with "the", or removed?

Just for record. Previously LogChildExit() was called and the following LOG
message was output when the archiver reported FATAL error. OTOH the patch
prevents that and the following LOG message is not output at FATAL exit of
archiver. But I don't think that the following message is required in that case
because FATAL message indicating the similar thing is already output.
Therefore, I'm ok with the patch.

LOG: archiver process (PID 46418) exited with exit code 1

I read v50_003 patch.

When archiver dies, ProcGlobal->archiverLatch should be reset to NULL,
like walreceiver does the similar thing in WalRcvDie()?

Differently from walwriter and checkpointer, archiver as well as
walreceiver may die while server is running. Leaving the latch pointer
alone may lead to nudging a wrong process that took over the same
procarray slot. Added pgarch_die() to do that.

Thanks!

+	if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+		SetLatch(ProcGlobal->archiverLatch);

The latch can be reset to NULL in pgarch_die() between the if-condition and
SetLatch(), and which would be problematic. Probably we should protect
the access to the latch by using spinlock, like we do for walreceiver's latch?

(I moved the archiverLatch to just after checkpointerLatch in this version.)

In pgarch.c, #include "postmaster/fork_process.h" seems no longer necessary.

Right. That's not due to this patch, postmater.h, dsm.h and pg_shmem.h
are not used. (fd.h is not necessary but pgarch.c uses AllocateDir().)
+	if (strcmp(argv[1], "--forkarch") == 0)
+	{
Why is this necessary? I was thinking that "--forkboot" handles archiver
in SubPostmasterMain().
Yeah, the correspondent code is removed in the same patch at the same
time.

The attached is v51 patchset.

Thanks a lot!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#153

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#152)

7 attachment(s)

Re: shared-memory based stats collector

At Wed, 10 Mar 2021 15:20:43 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/10 12:10, Kyotaro Horiguchi wrote:

Agreed. The code moved to the original place and added the crash
handling code. And I added a phrase to the comment.
+		 * Was it the archiver?  If exit status is zero (normal) or one (FATAL
+		 * exit), we assume everything is all right just like normal backends
+		 * and just try to restart a new one so that we immediately retry
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+		 * archiving of remaining files. (If fail, we'll try again in future
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

"of" of "archiving of remaining" should be replaced with "the", or removed?

Either will do. I don't mind turning the gerund (archiving) into a
gerund phrase (archiving remaining files).

Just for record. Previously LogChildExit() was called and the following LOG
message was output when the archiver reported FATAL error. OTOH the patch
prevents that and the following LOG message is not output at FATAL exit of
archiver. But I don't think that the following message is required in that
case
because FATAL message indicating the similar thing is already output.
Therefore, I'm ok with the patch.

LOG: archiver process (PID 46418) exited with exit code 1

Yeah, that's the same behavor with wal receiver.

I read v50_003 patch.

When archiver dies, ProcGlobal->archiverLatch should be reset to NULL,
like walreceiver does the similar thing in WalRcvDie()?

Differently from walwriter and checkpointer, archiver as well as
walreceiver may die while server is running. Leaving the latch pointer
alone may lead to nudging a wrong process that took over the same
procarray slot. Added pgarch_die() to do that.

Thanks!
+	if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+		SetLatch(ProcGlobal->archiverLatch);
The latch can be reset to NULL in pgarch_die() between the if-condition and
SetLatch(), and which would be problematic. Probably we should protect
the access to the latch by using spinlock, like we do for walreceiver's latch?

Ugg. Right. I remember about that bug. I moved the archiverLatch out
of ProcGlobal to a dedicate local struct PgArch and placed a spinlock
together.

Thanks for the review! v52 is attached.

Other than the archiver fix above, A bug of 0004 in handling of
replication slot stats that leads to a hang is fixed. (It's the cause
of CF-bot failure.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#154

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#153)

1 attachment(s)

Re: shared-memory based stats collector

On 2021/03/10 17:51, Kyotaro Horiguchi wrote:

At Wed, 10 Mar 2021 15:20:43 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
On 2021/03/10 12:10, Kyotaro Horiguchi wrote:
Agreed. The code moved to the original place and added the crash
handling code. And I added a phrase to the comment.
+		 * Was it the archiver?  If exit status is zero (normal) or one (FATAL
+		 * exit), we assume everything is all right just like normal backends
+		 * and just try to restart a new one so that we immediately retry
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+		 * archiving of remaining files. (If fail, we'll try again in future
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"of" of "archiving of remaining" should be replaced with "the", or removed?
Either will do. I don't mind turning the gerund (archiving) into a
gerund phrase (archiving remaining files).

Just for record. Previously LogChildExit() was called and the following LOG
message was output when the archiver reported FATAL error. OTOH the patch
prevents that and the following LOG message is not output at FATAL exit of
archiver. But I don't think that the following message is required in that
case
because FATAL message indicating the similar thing is already output.
Therefore, I'm ok with the patch.

LOG: archiver process (PID 46418) exited with exit code 1

Yeah, that's the same behavor with wal receiver.
I read v50_003 patch.

When archiver dies, ProcGlobal->archiverLatch should be reset to NULL,
like walreceiver does the similar thing in WalRcvDie()?

Differently from walwriter and checkpointer, archiver as well as
walreceiver may die while server is running. Leaving the latch pointer
alone may lead to nudging a wrong process that took over the same
procarray slot. Added pgarch_die() to do that.

Thanks!
+	if (IsUnderPostmaster && ProcGlobal->archiverLatch)
+		SetLatch(ProcGlobal->archiverLatch);
The latch can be reset to NULL in pgarch_die() between the if-condition and
SetLatch(), and which would be problematic. Probably we should protect
the access to the latch by using spinlock, like we do for walreceiver's latch?
Ugg. Right. I remember about that bug. I moved the archiverLatch out
of ProcGlobal to a dedicate local struct PgArch and placed a spinlock
together.

Thanks for the review! v52 is attached.

Thanks! I applied minor and cosmetic changes to the 0003 patch as follows.
Attached is the updated version of the 0003 patch. Barring any objection,
I will commit this patch.

-#include "storage/latch.h"
-#include "storage/proc.h"

I removed these because they are no longer necessary.

<literal>logical replication worker</literal>,
<literal>parallel worker</literal>, <literal>background writer</literal>,
<literal>client backend</literal>, <literal>checkpointer</literal>,
+ <literal>archiver</literal>,
<literal>startup</literal>, <literal>walreceiver</literal>,
<literal>walsender</literal> and <literal>walwriter</literal>.

In the document about pg_stat_activity, possible values in backend_type
column are all listed. I added "archiver" into the list.

BTW, those values were originally listed in alphabetical order,
but ISTM that they were gotten out of order at a certain moment.
So they should be listed in alphabetical order again. This should
be implemented as a separate patch.

-PgArchData *PgArch = NULL;
+static PgArchData *PgArch = NULL;

I marked PgArchData as static because it's used only in pgarch.c.

-		ShmemInitStruct("Archiver ", PgArchShmemSize(), &found);
+		ShmemInitStruct("Archiver Data", PgArchShmemSize(), &found);

I found that the original shmem name ended with unnecessary space character.
I replaced it with "Archiver Data".

In reaper(), I moved the code block for archiver to the original location.

I ran pgindent for the files that the patch modifies.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#155

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#154)

Re: shared-memory based stats collector

At Wed, 10 Mar 2021 21:47:51 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

Attached is the updated version of the 0003 patch. Barring any
objection,
I will commit this patch.

-#include "storage/latch.h"
-#include "storage/proc.h"

I removed these because they are no longer necessary.

Mmm. Sorry for the garbage.

<literal>logical replication worker</literal>,
<literal>parallel worker</literal>, <literal>background
writer</literal>,
<literal>client backend</literal>, <literal>checkpointer</literal>,
+ <literal>archiver</literal>,
<literal>startup</literal>, <literal>walreceiver</literal>,
<literal>walsender</literal> and <literal>walwriter</literal>.

In the document about pg_stat_activity, possible values in
backend_type
column are all listed. I added "archiver" into the list.

BTW, those values were originally listed in alphabetical order,
but ISTM that they were gotten out of order at a certain moment.
So they should be listed in alphabetical order again. This should
be implemented as a separate patch.

Thanks for adding it.

They are also mildly sorted by function or characteristics. I'm not
sure which is better, but anyway they should be the order based on a
clear criteria.

-PgArchData *PgArch = NULL;
+static PgArchData *PgArch = NULL;
I marked PgArchData as static because it's used only in pgarch.c.

Right.

-		ShmemInitStruct("Archiver ", PgArchShmemSize(), &found);
+ ShmemInitStruct("Archiver Data", PgArchShmemSize(), &found);
I found that the original shmem name ended with unnecessary space
character.
I replaced it with "Archiver Data".

Oops. The trailing space is where I stopped writing the string and try
to find a better word, then in the meanwhile, my mind have been
attracted to something else and left. I don't object to "Archiver
Data". Thanks for completing it.

In reaper(), I moved the code block for archiver to the original
location.

Agreed.

I ran pgindent for the files that the patch modifies.

Yeah, I forgot to add the new struct into typedefs.list. I
intentionally omitted clearing newly-acquired shared memory but it
doesn't matter to do that.

So, I'm fine with it. Thanks for taking this.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#156

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Fujii Masao (#154)

Re: shared-memory based stats collector

Hi,

Two minor nits:

On 2021-03-10 21:47:51 +0900, Fujii Masao wrote:

+/* Shared memory area for archiver process */
+typedef struct PgArchData
+{
+	Latch	   *latch;			/* latch to wake the archiver up */
+	slock_t		mutex;			/* locks this struct */
+} PgArchData;
+

It doesn't really matter, but it'd be pretty trivial to avoid needing a
spinlock for this kind of thing. Just store the pgprocno of the archiver
in PgArchData.

While getting rid of the spinlock doesn't seem like a huge win, it does
seem nicer that we'd automatically have a way to find data about the
archiver (e.g. pid).

* checkpointer to exit as well, otherwise not. The archiver, stats,
* and syslogger processes are disregarded since they are not
* connected to shared memory; we also disregard dead_end children
* here. Walsenders are also disregarded, they will be terminated
* later after writing the checkpoint record, like the archiver
* process.
*/

This comment in PostmasterStateMachine() is outdated now.

Greetings,

Andres Freund

#157

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Kyotaro Horiguchi (#153)

Re: shared-memory based stats collector

Hi,

On 2021-03-10 17:51:37 +0900, Kyotaro Horiguchi wrote:

From ed2fb2fca47fccbf9af1538688aab8334cf6470c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Fri, 13 Mar 2020 16:58:03 +0900
Subject: [PATCH v52 1/7] sequential scan for dshash

Dshash did not allow scan the all entries sequentially. This adds the
functionality. The interface is similar but a bit different both from
that of dynahash and simple dshash search functions. One of the most
significant differences is the sequential scan interface of dshash
always needs a call to dshash_seq_term when scan ends. Another is
locking. Dshash holds partition lock when returning an entry,
dshash_seq_next() also holds lock when returning an entry but callers
shouldn't release it, since the lock is essential to continue a
scan. The seqscan interface allows entry deletion while a scan. The
in-scan deletion should be performed by dshash_delete_current().

*while a scan is in progress

+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+	dsa_pointer next_item_pointer;
+
+	if (status->curitem == NULL)
+	{
+		int partition;
+
+		Assert(status->curbucket == 0);
+		Assert(!status->hash_table->find_locked);
+
+		/* first shot. grab the first item. */
+		partition =
+			PARTITION_FOR_BUCKET_INDEX(status->curbucket,
+									   status->hash_table->size_log2);
+		LWLockAcquire(PARTITION_LOCK(status->hash_table, partition),
+					  status->exclusive ? LW_EXCLUSIVE : LW_SHARED);
+		status->curpartition = partition;

What does "first shot" mean here?

+/*
+ * Terminates the seqscan and release all locks.
+ *
+ * Should be always called when finishing or exiting a seqscan.
+ */
+void
+dshash_seq_term(dshash_seq_status *status)
+{
+	status->hash_table->find_locked = false;
+	status->hash_table->find_exclusively_locked = false;
+
+	if (status->curpartition >= 0)
+		LWLockRelease(PARTITION_LOCK(status->hash_table, status->curpartition));
+}

I think it'd be good if you added an assertion preventing this from
being called twice.

status->curpartition should definitely be reset to something < 0 after
releasing the lock, to avoid that happening again in case of such a bug.

+/* Get the current entry while a seq scan. */
+void *
+dshash_get_current(dshash_seq_status *status)
+{
+	return ENTRY_FROM_ITEM(status->curitem);
+}

What is this used for? It'd probably be good to assert that there's a
scan in progress too, and that locks are held - otherwise this isn't
safe, right?

+void *
+dshash_find_extended(dshash_table *hash_table, const void *key,
+					 bool exclusive, bool nowait, bool insert, bool *found)
+{
+	dshash_hash hash = hash_key(hash_table, key);
+	size_t		partidx = PARTITION_FOR_HASH(hash);
+	dshash_partition *partition = &hash_table->control->partitions[partidx];
+	LWLockMode  lockmode = exclusive ? LW_EXCLUSIVE : LW_SHARED;
dshash_table_item *item;
-	hash = hash_key(hash_table, key);
-	partition_index = PARTITION_FOR_HASH(hash);
-	partition = &hash_table->control->partitions[partition_index];
-
-	Assert(hash_table->control->magic == DSHASH_MAGIC);
-	Assert(!hash_table->find_locked);
+	/* must be exclusive when insert allowed */
+	Assert(!insert || (exclusive && found != NULL));

restart:
-	LWLockAcquire(PARTITION_LOCK(hash_table, partition_index),
-				  LW_EXCLUSIVE);
+	if (!nowait)
+		LWLockAcquire(PARTITION_LOCK(hash_table, partidx), lockmode);
+	else if (!LWLockConditionalAcquire(PARTITION_LOCK(hash_table, partidx),
+									   lockmode))
+		return NULL;

I think this code (and probably also some in the previous patch) would
be more readable if you introduced a local variable to store
PARTITION_LOCK(hash_table, partidx) in. There's four repetitions of
this...

ensure_valid_bucket_pointers(hash_table);

/* Search the active bucket. */
item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
if (item)
-		*found = true;
+	{
+		if (found)
+			*found = true;
+	}
else
{
-		*found = false;
+		if (found)
+			*found = false;
+
+		if (!insert)
+		{
+			/* The caller didn't told to add a new entry. */

s/told/tell us/

This nested ifs here make me think that the interface is too
complicated...

+typedef struct StatsShmemStruct
+{
+	dsa_handle	stats_dsa_handle;	/* handle for stats data area */
+	dshash_table_handle hash_handle;	/* shared dbstat hash */
+	int			refcount;		/* # of processes that is attaching the shared
+								 * stats memory */
+	/* Global stats structs */
+	PgStat_Archiver archiver_stats;
+	pg_atomic_uint32 archiver_changecount;
+	PgStat_BgWriter bgwriter_stats;
+	pg_atomic_uint32 bgwriter_changecount;
+	PgStat_CheckPointer checkpointer_stats;
+	pg_atomic_uint32 checkpointer_changecount;
+	PgStat_Wal	wal_stats;
+	LWLock		wal_stats_lock;
+	PgStatSharedSLRUStats slru_stats;
+	pg_atomic_uint32 slru_changecount;
+	pg_atomic_uint64 stats_timestamp;

Looks like you're intending to use something similar to
PGSTAT_BEGIN_WRITE_ACTIVITY etc. But you're not using the same set of
macros, and I didn't see any comments explaining the exact locking
protocol.

I did see that you do atomic fetch_add below. That should never be
necessary though - there's only ever exactly one writer for archiver,
bgwriter, checkpointer etc.

+
+/* BgWriter global statistics counters */
+PgStat_BgWriter BgWriterStats = {0};
+
+/* CheckPointer global statistics counters */
+PgStat_CheckPointer CheckPointerStats = {0};
+
+/* WAL global statistics counters */
+PgStat_Wal	WalStats = {0};
+

This makes it sound like these are actually global counters, visible to
everyone - but they're not, right?

/*
- * SLRU statistics counts waiting to be sent to the collector.  These are
- * stored directly in stats message format so they can be sent without needing
- * to copy things around.  We assume this variable inits to zeroes.  Entries
- * are one-to-one with slru_names[].
+ * XXXX: always try to flush WAL stats. We don't want to manipulate another
+ * counter during XLogInsert so we don't have an effecient short cut to know
+ * whether any counter gets incremented.
*/
-static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
+static inline bool
+walstats_pending(void)
+{
+	static const PgStat_Wal all_zeroes;
+
+	return memcmp(&WalStats, &all_zeroes,
+				  offsetof(PgStat_Wal, stat_reset_timestamp)) != 0;
+}

That's a pretty expensive way to figure out whether there've been
changes, given how PgStat_Wal has grown. You say "don't want to
manipulate another during XLogInsert" - but a simple increment wouldn't
cost that much in addition to what the wal stats stuff already does.

+ * Per-object statistics are stored in a "shared stats", corresponding struct
+ * that has a header part common among all object types in DSA-allocated
+ * memory.

The part of the sentence doesn't quite seem to make sense gramattically.

+static void
+attach_shared_stats(void)
+{
+	MemoryContext oldcontext;

+	/*
+	 * The first attacher backend may still reading the stats file, or the
+	 * last detacher may writing it. Wait for the work to finish.
+	 */

I still believe this kind of approach is too complicated, and we should
simply not do any of this "first attacher" business. Instead read it in
the startup process or such.

+		StatsShmem->refcount++;
+	else
{
-		ereport(LOG,
-				(errcode_for_socket_access(),
-				 errmsg("could not set statistics collector socket to nonblocking mode: %m")));
-		goto startup_failed;
+		/* We're the first process to attach the shared stats memory */
+		Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+		/* Initialize shared memory area */
+		area = dsa_create(LWTRANCHE_STATS);
+		pgStatSharedHash = dshash_create(area, &dsh_params, 0);
+
+		StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+		StatsShmem->hash_handle = dshash_get_hash_table_handle(pgStatSharedHash);
+		LWLockInitialize(&StatsShmem->slru_stats.lock, LWTRANCHE_STATS);
+		pg_atomic_init_u32(&StatsShmem->slru_stats.changecount, 0);
+
+		/* Block the next attacher for a while, see the comment above. */
+		StatsShmem->attach_holdoff = true;
+
+		StatsShmem->refcount = 1;
}

At least the lwlock etc initialization should definitely just happen as
part of shared memory initialization, not at some point later (like most
(all?) other places using *statically sized* shared memory). The story
for dsa_create/dshash_create is different, but then just do that, not
everything else.

+/* ----------
+ * get_stat_entry() -
*
- *	Called from postmaster at startup or after an existing collector
- *	died.  Attempt to fire up a fresh statistics collector.
+ *	get shared stats entry for specified type, dbid and objid.
+ *  If nowait is true, returns NULL on lock failure.
*
- *	Returns PID of child process, or 0 if fail.
+ *  If initfunc is not NULL, new entry is created if not yet and the function
+ *  is called with the new base entry. If found is not NULL, it is set to true
+ *  if existing entry is found or false if not.
+ *  ----------
+ */
+static PgStat_StatEntryHeader *
+get_stat_entry(PgStatTypes type, Oid dbid, Oid objid, bool nowait, bool create,
+			   bool *found)
+{
+	PgStatHashEntry *shhashent;
+	PgStatLocalHashEntry *lohashent;
+	PgStat_StatEntryHeader *shheader = NULL;
+	PgStatHashKey key;
+	bool		shfound;
+
+	key.type = type;
+	key.databaseid = dbid;
+	key.objectid = objid;
+
+	if (pgStatEntHash)
+	{
+		uint64		currage;
+
+		/*
+		 * pgStatEntHashAge increments quite slowly than the time the
+		 * following loop takes so this is expected to iterate no more than
+		 * twice.
+		 */
+		while (unlikely
+			   (pgStatEntHashAge !=
+				(currage = pg_atomic_read_u64(&StatsShmem->gc_count))))
+		{
+			pgstat_localhash_iterator i;
+
+			/*
+			 * Some entries have been dropped. Invalidate cache pointer to
+			 * them.
+			 */
+			pgstat_localhash_start_iterate(pgStatEntHash, &i);
+			while ((lohashent = pgstat_localhash_iterate(pgStatEntHash, &i))
+				   != NULL)
+			{
+				PgStat_StatEntryHeader *header = lohashent->body;
+
+				if (header->dropped)
+				{
+					pgstat_localhash_delete(pgStatEntHash, key);
+
+					if (pg_atomic_sub_fetch_u32(&header->refcount, 1) < 1)
+					{
+						/*
+						 * We're the last referrer to this entry, drop the
+						 * shared entry.
+						 */
+						dsa_free(area, lohashent->dsapointer);
+					}
+				}
+			}
+
+			pgStatEntHashAge = currage;
+		}

This should be its own function.

+	shhashent = dshash_find_extended(pgStatSharedHash, &key,
+									 create, nowait, create, &shfound);
+	if (shhashent)
+	{
+		if (create && !shfound)
+		{
+			/* Create new stats entry. */
+			dsa_pointer chunk = dsa_allocate0(area,
+											  pgstat_sharedentsize[type]);
+
+			shheader = dsa_get_address(area, chunk);
+			LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+			pg_atomic_init_u32(&shheader->refcount, 0);
+
+			/* Link the new entry from the hash entry. */
+			shhashent->body = chunk;
+		}
+		else
+			shheader = dsa_get_address(area, shhashent->body);
+
+		/*
+		 * We expose this shared entry now.  You might think that the entry
+		 * can be removed by a concurrent backend, but since we are creating
+		 * an stats entry, the object actually exists and used in the upper
+		 * layer. Such an object cannot be dropped until the first vacuum
+		 * after the current transaction ends.
+		 */
+		dshash_release_lock(pgStatSharedHash, shhashent);

I don't think you can safely release the lock before you incremented the
refcount? What if, once the lock is released, somebody looks up that
entry, increments the refcount, and decrements it again? It'll see a
refcount of 0 at the end and decide to free the memory. Then the code
below will access already freed / reused memory, no?

@@ -2649,85 +3212,138 @@ pgstat_twophase_postabort(TransactionId xid, uint16 info,
/* ----------
* pgstat_fetch_stat_dbentry() -
*
- *	Support function for the SQL-callable pgstat* functions. Returns
- *	the collected statistics for one database or NULL. NULL doesn't mean
- *	that the database doesn't exist, it is just not yet known by the
- *	collector, so the caller is better off to report ZERO instead.
+ *	Find database stats entry on backends in a palloc'ed memory.
+ *
+ *  The returned entry is stored in static memory so the content is valid until
+ *	the next call of the same function for the different database.
* ----------
*/
PgStat_StatDBEntry *
pgstat_fetch_stat_dbentry(Oid dbid)
{
-	/*
-	 * If not done for this transaction, read the statistics collector stats
-	 * file into some hash tables.
-	 */
-	backend_read_statsfile();
-
-	/*
-	 * Lookup the requested database; return NULL if not found
-	 */
-	return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-											  (void *) &dbid,
-											  HASH_FIND, NULL);
+	PgStat_StatDBEntry *shent;
+
+	/* should be called from backends */
+	Assert(IsUnderPostmaster);

Can't these be called from single user mode?

Greetings,

Andres Freund

#158

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Andres Freund (#156)

Re: shared-memory based stats collector

At Wed, 10 Mar 2021 19:21:00 -0800, Andres Freund <andres@anarazel.de> wrote in

Hi,

Two minor nits:

Thanks for the comments!

On 2021-03-10 21:47:51 +0900, Fujii Masao wrote:
+/* Shared memory area for archiver process */
+typedef struct PgArchData
+{
+	Latch	   *latch;			/* latch to wake the archiver up */
+	slock_t		mutex;			/* locks this struct */
+} PgArchData;
+
It doesn't really matter, but it'd be pretty trivial to avoid needing a
spinlock for this kind of thing. Just store the pgprocno of the archiver
in PgArchData.

Looks promising.

While getting rid of the spinlock doesn't seem like a huge win, it does
seem nicer that we'd automatically have a way to find data about the
archiver (e.g. pid).

PGPROC GetAuxProcessInfo(AuxProcType type)?

* checkpointer to exit as well, otherwise not. The archiver, stats,
* and syslogger processes are disregarded since they are not
* connected to shared memory; we also disregard dead_end children
* here. Walsenders are also disregarded, they will be terminated
* later after writing the checkpoint record, like the archiver
* process.
*/

This comment in PostmasterStateMachine() is outdated now.

Right. Will fix a bit later.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#159

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#158)

Re: shared-memory based stats collector

On 2021/03/11 13:42, Kyotaro Horiguchi wrote:

At Wed, 10 Mar 2021 19:21:00 -0800, Andres Freund <andres@anarazel.de> wrote in

Hi,

Two minor nits:

Thanks for the comments!
On 2021-03-10 21:47:51 +0900, Fujii Masao wrote:
+/* Shared memory area for archiver process */
+typedef struct PgArchData
+{
+	Latch	   *latch;			/* latch to wake the archiver up */
+	slock_t		mutex;			/* locks this struct */
+} PgArchData;
+
It doesn't really matter, but it'd be pretty trivial to avoid needing a
spinlock for this kind of thing. Just store the pgprocno of the archiver
in PgArchData.
Looks promising.

You mean that spinlock is not necessary by doing the followings?

- Save pgprocno of archiver in PgArchData, instead of latch and mutex.
- Set PgArch->pgprocno at the startup of archiver
- Reset PgArch->pgprocno to INVALID_PGPROCNO at pgarch_die()
- XLogArchiveNotify() sets the latch (i.e., &ProcGlobal->allProcs[PgArch->pgprocno].procLatch) if PgArch->pgprocno is not INVALID_PGPROCNO

Right?

While getting rid of the spinlock doesn't seem like a huge win, it does
seem nicer that we'd automatically have a way to find data about the
archiver (e.g. pid).

PGPROC GetAuxProcessInfo(AuxProcType type)?

I don't think this new function is necessary.

ISTM that Andres said that it's worth adding pgprocno into PgArch because
which enables us to get the information about archiver more easily by
using that pgprocno. For example, we can get the pid of archiver by
&ProcGlobal->allProcs[PgArch->pgprocno].pid. That is, he thinks that
adding pgprocno has several merits. I agree that. Maybe I'm misunderstanding
his comment, though...

* checkpointer to exit as well, otherwise not. The archiver, stats,
* and syslogger processes are disregarded since they are not
* connected to shared memory; we also disregard dead_end children
* here. Walsenders are also disregarded, they will be terminated
* later after writing the checkpoint record, like the archiver
* process.
*/

This comment in PostmasterStateMachine() is outdated now.

Right. Will fix a bit later.

Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#160

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#159)

1 attachment(s)

Re: shared-memory based stats collector

At Thu, 11 Mar 2021 15:33:52 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/11 13:42, Kyotaro Horiguchi wrote:
At Wed, 10 Mar 2021 19:21:00 -0800, Andres Freund <andres@anarazel.de>
wrote in

Hi,

Two minor nits:

Thanks for the comments!
On 2021-03-10 21:47:51 +0900, Fujii Masao wrote:
+/* Shared memory area for archiver process */
+typedef struct PgArchData
+{
+	Latch	   *latch;			/* latch to wake the archiver up */
+	slock_t		mutex;			/* locks this struct */
+} PgArchData;
+
It doesn't really matter, but it'd be pretty trivial to avoid needing
a
spinlock for this kind of thing. Just store the pgprocno of the
archiver
in PgArchData.
Looks promising.
You mean that spinlock is not necessary by doing the followings?

- Save pgprocno of archiver in PgArchData, instead of latch and mutex.
- Set PgArch->pgprocno at the startup of archiver
- Reset PgArch->pgprocno to INVALID_PGPROCNO at pgarch_die()
- XLogArchiveNotify() sets the latch (i.e.,
- &ProcGlobal->allProcs[PgArch->pgprocno].procLatch) if PgArch->pgprocno
- is not INVALID_PGPROCNO

Right?

I think it is right in a rough sketch.

While getting rid of the spinlock doesn't seem like a huge win, it
does
seem nicer that we'd automatically have a way to find data about the
archiver (e.g. pid).

PGPROC GetAuxProcessInfo(AuxProcType type)?

I don't think this new function is necessary.

ISTM that Andres said that it's worth adding pgprocno into PgArch
because
which enables us to get the information about archiver more easily by
using that pgprocno. For example, we can get the pid of archiver by
&ProcGlobal->allProcs[PgArch->pgprocno].pid. That is, he thinks that
adding pgprocno has several merits. I agree that. Maybe I'm
misunderstanding
his comment, though...

I meant some operation that converts an AuxProcType to a PGPROC * for
any type of auxiliary process, but perhaps that's wrong. It should
saying about the same thing to the previous paragraph and it seems
that you're right.

* checkpointer to exit as well, otherwise not. The archiver,
* stats,
* and syslogger processes are disregarded since they are not
* connected to shared memory; we also disregard dead_end children
* here. Walsenders are also disregarded, they will be terminated
* later after writing the checkpoint record, like the archiver
* process.
*/

This comment in PostmasterStateMachine() is outdated now.

Right. Will fix a bit later.

I moved archiver from the current location to next to "walsenders"
that is to be terminated along with archiver.

The attached the only 0003 of the new version based on the last one
from Fujii-san.

regareds.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#161

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#160)

Re: shared-memory based stats collector

On 2021/03/12 9:23, Kyotaro Horiguchi wrote:

At Thu, 11 Mar 2021 15:33:52 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
On 2021/03/11 13:42, Kyotaro Horiguchi wrote:
At Wed, 10 Mar 2021 19:21:00 -0800, Andres Freund <andres@anarazel.de>
wrote in

Hi,

Two minor nits:

Thanks for the comments!
On 2021-03-10 21:47:51 +0900, Fujii Masao wrote:
+/* Shared memory area for archiver process */
+typedef struct PgArchData
+{
+	Latch	   *latch;			/* latch to wake the archiver up */
+	slock_t		mutex;			/* locks this struct */
+} PgArchData;
+
It doesn't really matter, but it'd be pretty trivial to avoid needing
a
spinlock for this kind of thing. Just store the pgprocno of the
archiver
in PgArchData.
Looks promising.
You mean that spinlock is not necessary by doing the followings?

- Save pgprocno of archiver in PgArchData, instead of latch and mutex.
- Set PgArch->pgprocno at the startup of archiver
- Reset PgArch->pgprocno to INVALID_PGPROCNO at pgarch_die()
- XLogArchiveNotify() sets the latch (i.e.,
- &ProcGlobal->allProcs[PgArch->pgprocno].procLatch) if PgArch->pgprocno
- is not INVALID_PGPROCNO

Right?
I think it is right in a rough sketch.

While getting rid of the spinlock doesn't seem like a huge win, it
does
seem nicer that we'd automatically have a way to find data about the
archiver (e.g. pid).

PGPROC GetAuxProcessInfo(AuxProcType type)?

I don't think this new function is necessary.

ISTM that Andres said that it's worth adding pgprocno into PgArch
because
which enables us to get the information about archiver more easily by
using that pgprocno. For example, we can get the pid of archiver by
&ProcGlobal->allProcs[PgArch->pgprocno].pid. That is, he thinks that
adding pgprocno has several merits. I agree that. Maybe I'm
misunderstanding
his comment, though...

I meant some operation that converts an AuxProcType to a PGPROC * for
any type of auxiliary process, but perhaps that's wrong. It should
saying about the same thing to the previous paragraph and it seems
that you're right.

* checkpointer to exit as well, otherwise not. The archiver,
* stats,
* and syslogger processes are disregarded since they are not
* connected to shared memory; we also disregard dead_end children
* here. Walsenders are also disregarded, they will be terminated
* later after writing the checkpoint record, like the archiver
* process.
*/

This comment in PostmasterStateMachine() is outdated now.

Right. Will fix a bit later.

I moved archiver from the current location to next to "walsenders"
that is to be terminated along with archiver.

The attached the only 0003 of the new version based on the last one
from Fujii-san.

Thanks for updating the patch! But you forgot to add the changes
related to pgprocno into the patch?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#162

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#160)

Re: shared-memory based stats collector

At Fri, 12 Mar 2021 09:23:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

The attached the only 0003 of the new version based on the last one
from Fujii-san.

Please wait a moment. Something might be wrong.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#163

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#162)

Re: shared-memory based stats collector

At Fri, 12 Mar 2021 10:07:10 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 12 Mar 2021 09:23:12 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

The attached the only 0003 of the new version based on the last one
from Fujii-san.

Please wait a moment. Something might be wrong.

It was not in the 0003 but in other part. Sorry for the noise.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#164

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#161)

Re: shared-memory based stats collector

At Fri, 12 Mar 2021 10:03:31 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

I moved archiver from the current location to next to "walsenders"
that is to be terminated along with archiver.
The attached the only 0003 of the new version based on the last one
from Fujii-san.

Thanks for updating the patch! But you forgot to add the changes
related to pgprocno into the patch?

Although I intentinally didn't do that because "it really doesn't
matter", please wait for a while for the new version. I'm going to
make PgArchData have only one member procno instead of replacing the
struct with a single variable.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#165

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Andres Freund (#157)

Re: shared-memory based stats collector

Hi,

On 2021-03-10 20:26:56 -0800, Andres Freund wrote:

+static void
+attach_shared_stats(void)
+{
+	MemoryContext oldcontext;
+	/*
+	 * The first attacher backend may still reading the stats file, or the
+	 * last detacher may writing it. Wait for the work to finish.
+	 */
I still believe this kind of approach is too complicated, and we should
simply not do any of this "first attacher" business. Instead read it in
the startup process or such.

I started changing the patch to address my complaints. I'll try to do
it as an incremental patch ontop of your 0004, but it might become too
unwieldy. Not planning to touch other patches for now (and would be
happy if the first few were committed). I do think we'll have to
split 0004 a bit - it's just too large to commit as is, I think.

Greetings,

Andres Freund

#166

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#164)

1 attachment(s)

Re: shared-memory based stats collector

At Fri, 12 Mar 2021 10:38:00 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 12 Mar 2021 10:03:31 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

I moved archiver from the current location to next to "walsenders"
that is to be terminated along with archiver.
The attached the only 0003 of the new version based on the last one
from Fujii-san.

Thanks for updating the patch! But you forgot to add the changes
related to pgprocno into the patch?

Although I intentinally didn't do that because "it really doesn't
matter", please wait for a while for the new version. I'm going to
make PgArchData have only one member procno instead of replacing the
struct with a single variable.

I noticed that I accidentally removed the launch-suppression feature
that is to avoid frequent relaunching. That mechanism is needed on
the postmaster side. I added PgArchIsSuppressed() to do the same check
with the old pgarch_start() and make it called as a part of
PgArchStartupAllowed().

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#167

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Andres Freund (#165)

Re: shared-memory based stats collector

At Thu, 11 Mar 2021 19:22:57 -0800, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-10 20:26:56 -0800, Andres Freund wrote:
+static void
+attach_shared_stats(void)
+{
+	MemoryContext oldcontext;
+	/*
+	 * The first attacher backend may still reading the stats file, or the
+	 * last detacher may writing it. Wait for the work to finish.
+	 */
I still believe this kind of approach is too complicated, and we should
simply not do any of this "first attacher" business. Instead read it in
the startup process or such.
I started changing the patch to address my complaints. I'll try to do
it as an incremental patch ontop of your 0004, but it might become too
unwieldy. Not planning to touch other patches for now (and would be

Sorry for bothering you by that but thank you very much for the
labor. Actually the startup is always the first and lonely attacher
but the last detacher may be any of checkpointer, archiver or
walsender. And I was reluctant to add new stats-mechanism initializer
other than attach_sahred_stats. Those are the reason (to me) for the
*first* attacher mechanism.

However, surely we can separate the "first attacher business" as a
separate function. And maybe the "fear" backing the complexity is
groundless and such overlap might not actually happen.

The current mechanism repeats read->write process twice while starting
up server, which is annoyance to me.

happy if the first few were committed). I do think we'll have to
split 0004 a bit - it's just too large to commit as is, I think.

Deeply agreed. 0004 is too large and quite confusive to diff. It is
always tough to resolve a conflict:(

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#168

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#166)

Re: shared-memory based stats collector

On 2021/03/12 13:49, Kyotaro Horiguchi wrote:

At Fri, 12 Mar 2021 10:38:00 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 12 Mar 2021 10:03:31 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

I moved archiver from the current location to next to "walsenders"
that is to be terminated along with archiver.
The attached the only 0003 of the new version based on the last one
from Fujii-san.

Thanks for updating the patch! But you forgot to add the changes
related to pgprocno into the patch?

Although I intentinally didn't do that because "it really doesn't
matter", please wait for a while for the new version. I'm going to
make PgArchData have only one member procno instead of replacing the
struct with a single variable.

I noticed that I accidentally removed the launch-suppression feature
that is to avoid frequent relaunching. That mechanism is needed on
the postmaster side. I added PgArchIsSuppressed() to do the same check
with the old pgarch_start() and make it called as a part of
PgArchStartupAllowed().

You're right! At least for me the function name PgArchIsSuppressed() sounds not good to me. What about something like PgArchCanRestart()?

This is not fault of this patch. But last_pgarch_start_time should be initialized with zero?

+	if ((curtime - last_pgarch_start_time) < PGARCH_RESTART_INTERVAL)
+		return true;

Why did you remove the cast to unsigned int there?

+	/*
+	 * Advertise our latch that backends can use to wake us up while we're
+	 * sleeping.
+	 */
+	PgArch->pgprocno = MyProc->pgprocno;

The comment should be updated?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#169

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 5 years ago

In reply to: Fujii Masao (#168)

1 attachment(s)

Re: shared-memory based stats collector

At Fri, 12 Mar 2021 15:13:15 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/12 13:49, Kyotaro Horiguchi wrote:

I noticed that I accidentally removed the launch-suppression feature
that is to avoid frequent relaunching. That mechanism is needed on
the postmaster side. I added PgArchIsSuppressed() to do the same check
with the old pgarch_start() and make it called as a part of
PgArchStartupAllowed().

You're right! At least for me the function name PgArchIsSuppressed()
sounds not good to me. What about something like PgArchCanRestart()?

The reason for the name was some positive-meaning names came up with
me are confusing with PgArchStartupAllowed(). The name
PgArchCanRestart suggests that it's usable only when
restarting. However, the function requires to be called also when the
first launch since last_pgarch_start_time needs to be updated every
time archiver is launched.

Anyway in the attached the name is changed wihtout changing its usage.

# I don't like it uses "can" as "allowed" so much. The archiver
# actually can restart but is just inhibited to restart.

This is not fault of this patch. But last_pgarch_start_time should be
initialized with zero?

Right. I noticed that but forgot to fix it.

+	if ((curtime - last_pgarch_start_time) < PGARCH_RESTART_INTERVAL)
+		return true;
Why did you remove the cast to unsigned int there?

The cast converts small negative values to large numbers, the code
looks like intending to allow archiver launched when curtime goes
behind than last_pgarch_start_time. That is the case the on-memory
data is corrupt. I'm not sure it's worth worrying and in the first
place if we want to care of the case we should explicitly compare the
operands of the subtraction. I did that in the attached.

And the last_pgarch_start_time is accessed only in the function. I
moved it to inside the function.

+	/*
+ * Advertise our latch that backends can use to wake us up while
we're
+	 * sleeping.
+	 */
+	PgArch->pgprocno = MyProc->pgprocno;

The comment should be updated?

Hmm. What is advertised is our pgprocno.. Fixed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#170

Fujii Masao

masao.fujii@oss.nttdata.com

almost 5 years ago

In reply to: Kyotaro Horiguchi (#169)

1 attachment(s)

Re: shared-memory based stats collector

On 2021/03/12 17:24, Kyotaro Horiguchi wrote:

At Fri, 12 Mar 2021 15:13:15 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/12 13:49, Kyotaro Horiguchi wrote:

I noticed that I accidentally removed the launch-suppression feature
that is to avoid frequent relaunching. That mechanism is needed on
the postmaster side. I added PgArchIsSuppressed() to do the same check
with the old pgarch_start() and make it called as a part of
PgArchStartupAllowed().

You're right! At least for me the function name PgArchIsSuppressed()
sounds not good to me. What about something like PgArchCanRestart()?

The reason for the name was some positive-meaning names came up with
me are confusing with PgArchStartupAllowed(). The name
PgArchCanRestart suggests that it's usable only when
restarting. However, the function requires to be called also when the
first launch since last_pgarch_start_time needs to be updated every
time archiver is launched.

Anyway in the attached the name is changed wihtout changing its usage.

Thanks! If we come up with better name, let's rename the function later.

# I don't like it uses "can" as "allowed" so much. The archiver
# actually can restart but is just inhibited to restart.

This is not fault of this patch. But last_pgarch_start_time should be
initialized with zero?

Right. I noticed that but forgot to fix it.
+	if ((curtime - last_pgarch_start_time) < PGARCH_RESTART_INTERVAL)
+		return true;
Why did you remove the cast to unsigned int there?
The cast converts small negative values to large numbers, the code
looks like intending to allow archiver launched when curtime goes
behind than last_pgarch_start_time. That is the case the on-memory
data is corrupt. I'm not sure it's worth worrying and in the first
place if we want to care of the case we should explicitly compare the
operands of the subtraction. I did that in the attached.

That's an idea. But the similar calculation using that cast is used in
other places (e.g., in pgarch_MainLoop()), so I'm thinking that it's
better not to change that...

And the last_pgarch_start_time is accessed only in the function. I
moved it to inside the function.

OK.

+	/*
+ * Advertise our latch that backends can use to wake us up while
we're
+	 * sleeping.
+	 */
+	PgArch->pgprocno = MyProc->pgprocno;
The comment should be updated?
Hmm. What is advertised is our pgprocno.. Fixed.

OK.

Thanks for updating the patch! I applied some minor changes into your patch.
Attached is the updated version of the patch. I'm thinking to commit this version.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#171

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Andres Freund (#165)

Re: shared-memory based stats collector

Hi,

On 2021-03-11 19:22:57 -0800, Andres Freund wrote:

I started changing the patch to address my complaints. I'll try to do
it as an incremental patch ontop of your 0004, but it might become too
unwieldy. Not planning to touch other patches for now (and would be
happy if the first few were committed). I do think we'll have to
split 0004 a bit - it's just too large to commit as is, I think.

Far from being done (and the individual commits is just me working, not
something that is intended to survive), but I thought it might be
helpful to post my WIP git tree:
https://github.com/anarazel/postgres/commits/shmstat

I do think that the removal of the the "attach" logic, as well as the
more explicit naming differentiating between the three different hash
tables is a pretty clear improvement. It's way too easy to get turned
around otherwise.

I suspect to make all of this workable we'll have to add a few
preliminary cleanup patches. I'm currently thinking:

1) There's a bunch of functions in places that don't make sense.

/* ------------------------------------------------------------
* Local support functions follow
* ------------------------------------------------------------
*/

[internal stuff]
...
void
pgstat_send_archiver(const char *xlog, bool failed)
...
void
pgstat_send_bgwriter(void)
...
void
pgstat_report_wal(void)
...
bool
pgstat_send_wal(bool force)
...
[internal stuff]
...
void
pgstat_count_slru_page_zeroed(int slru_idx)

...

I think it'd make sense to separately clean that up.

2) src/backend/postmaster/pgstat.c currently contains at least two,
effectively independent, subsystems. Arguably more:
a) cumulative stats collection infrastructure: sending data to the
persistent stats file, and reading from it.
b) "content aware" cumulative statistics function

c) "current" activity infrastructure, around PgBackendStatus (which
basically boils down to pg_stat_activity et al.)
d) wait events
e) command progress stuff

They don't actually have much to do with each other, except being
related to stats. Even without the shared memory stats, having these
be in one file makes pretty much no sense. Having them all under one
common pgstat_* prefix is endlessly confusing too.

I think before making things differently complicated with this patch,
we need to clean this up, unfortunately. I think we should initially have
- src/backend/postmaster/pgstat.c, for a), b) above
- src/backend/utils/activity/backend_status.c for c)
- src/backend/utils/activity/wait_events.c for d)
- src/backend/utils/activity/progress.c for e)

Not at all sure about the names, but something roughly like this
would im make sense.

The next thing to note is that after this whole patchseries, having the
remaining functionality in src/backend/postmaster/pgstat.c doesn't make
sense. The things in postmaster/ are related to postmaster
sub-processes, not random pieces of backend infrastructure. Therefore I
am thinking that patch 0004 should be changed so it basically adds all
the changed code to two new files:
- src/backend/utils/activity/stats.c - for the stats keeping
infrastructure (shared memory hash table, pending table, etc)
- src/backend/utils/activity/stats_report.c - for pgstat_report_*,
pgstat_count_*, pgstat_update_*, flush_*, i.e. everything that knows
about specific kinds of stats.

The reason for that split is that imo the two pieces of code are largely
independent. One shouldn't need to understand the way stats are stored
in any sort of detail to add a new stats field and vice versa.

Horiguchi-san, is there a chance you could add a few tests (on master)
that test/document the way stats are kept across "normal" restarts, and
thrown away after crash restarts/immediate restarts and also thrown away
graceful streaming rep shutdowns?

Greetings,

Andres Freund

#172

Magnus Hagander

magnus@hagander.net

almost 5 years ago

In reply to: Andres Freund (#171)

Re: shared-memory based stats collector

On Sat, Mar 13, 2021 at 7:20 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-03-11 19:22:57 -0800, Andres Freund wrote:

I started changing the patch to address my complaints. I'll try to do
it as an incremental patch ontop of your 0004, but it might become too
unwieldy. Not planning to touch other patches for now (and would be
happy if the first few were committed). I do think we'll have to
split 0004 a bit - it's just too large to commit as is, I think.

Far from being done (and the individual commits is just me working, not
something that is intended to survive), but I thought it might be
helpful to post my WIP git tree:
https://github.com/anarazel/postgres/commits/shmstat

I do think that the removal of the the "attach" logic, as well as the
more explicit naming differentiating between the three different hash
tables is a pretty clear improvement. It's way too easy to get turned
around otherwise.

I suspect to make all of this workable we'll have to add a few
preliminary cleanup patches. I'm currently thinking:

1) There's a bunch of functions in places that don't make sense.

/* ------------------------------------------------------------
* Local support functions follow
* ------------------------------------------------------------
*/

[internal stuff]
...
void
pgstat_send_archiver(const char *xlog, bool failed)
...
void
pgstat_send_bgwriter(void)
...
void
pgstat_report_wal(void)
...
bool
pgstat_send_wal(bool force)
...
[internal stuff]
...
void
pgstat_count_slru_page_zeroed(int slru_idx)

...

I think it'd make sense to separately clean that up.

2) src/backend/postmaster/pgstat.c currently contains at least two,
effectively independent, subsystems. Arguably more:
a) cumulative stats collection infrastructure: sending data to the
persistent stats file, and reading from it.
b) "content aware" cumulative statistics function

c) "current" activity infrastructure, around PgBackendStatus (which
basically boils down to pg_stat_activity et al.)
d) wait events
e) command progress stuff

They don't actually have much to do with each other, except being
related to stats. Even without the shared memory stats, having these
be in one file makes pretty much no sense. Having them all under one
common pgstat_* prefix is endlessly confusing too.

I think before making things differently complicated with this patch,
we need to clean this up, unfortunately. I think we should initially have
- src/backend/postmaster/pgstat.c, for a), b) above
- src/backend/utils/activity/backend_status.c for c)
- src/backend/utils/activity/wait_events.c for d)
- src/backend/utils/activity/progress.c for e)

Not at all sure about the names, but something roughly like this
would im make sense.

+1, definitely.

I've been thinking in these lines more than once when poking at
differen patches around that area, but they've never been big enough
to justify the restructuring on their own. Which then of course just
helps accumulate the problem...

If anything, I'd even consider splitting (a) and (b) above out into
separate ones as well. But hey, I see you got to that in your next
paragraph :)

This does seem like a good time to do it, given the size of this
suggested change.

The next thing to note is that after this whole patchseries, having the
remaining functionality in src/backend/postmaster/pgstat.c doesn't make
sense. The things in postmaster/ are related to postmaster
sub-processes, not random pieces of backend infrastructure. Therefore I
am thinking that patch 0004 should be changed so it basically adds all
the changed code to two new files:
- src/backend/utils/activity/stats.c - for the stats keeping
infrastructure (shared memory hash table, pending table, etc)
- src/backend/utils/activity/stats_report.c - for pgstat_report_*,
pgstat_count_*, pgstat_update_*, flush_*, i.e. everything that knows
about specific kinds of stats.

The reason for that split is that imo the two pieces of code are largely
independent. One shouldn't need to understand the way stats are stored
in any sort of detail to add a new stats field and vice versa.

Agreed as well.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#173

Andres Freund

andres@anarazel.de

almost 5 years ago

In reply to: Magnus Hagander (#172)

Re: shared-memory based stats collector

Hi,

On 2021-03-13 12:53:30 +0100, Magnus Hagander wrote:

On Sat, Mar 13, 2021 at 7:20 AM Andres Freund <andres@anarazel.de> wrote:

I think before making things differently complicated with this patch,
we need to clean this up, unfortunately. I think we should initially have
- src/backend/postmaster/pgstat.c, for a), b) above
- src/backend/utils/activity/backend_status.c for c)
- src/backend/utils/activity/wait_events.c for d)
- src/backend/utils/activity/progress.c for e)

Not at all sure about the names, but something roughly like this
would im make sense.

+1, definitely.

Cool. I think I can introduce them without causing too much breakage in
the shmstats patch.

Without yet having done it, I think I'd not touch the function name
prefixes during the move and leave the prototypes in the pgstat.h
header? Then we could split the header off in a second step (with
backward compat includes in pgstat.h), potentially combined with a
rename? But I'm happy to consider different approaches.

I've been thinking in these lines more than once when poking at
differen patches around that area, but they've never been big enough
to justify the restructuring on their own. Which then of course just
helps accumulate the problem...

Yea...

If anything, I'd even consider splitting (a) and (b) above out into
separate ones as well. But hey, I see you got to that in your next
paragraph :)

I wondered about doing that as a preceding step as well, but it seems a
bit pointless if all that code needs to be moved elsewhere, and little
of it survives unchanged...

The next thing to note is that after this whole patchseries, having the
remaining functionality in src/backend/postmaster/pgstat.c doesn't make
sense. The things in postmaster/ are related to postmaster
sub-processes, not random pieces of backend infrastructure. Therefore I
am thinking that patch 0004 should be changed so it basically adds all
the changed code to two new files:
- src/backend/utils/activity/stats.c - for the stats keeping
infrastructure (shared memory hash table, pending table, etc)
- src/backend/utils/activity/stats_report.c - for pgstat_report_*,
pgstat_count_*, pgstat_update_*, flush_*, i.e. everything that knows
about specific kinds of stats.

The reason for that split is that imo the two pieces of code are largely
independent. One shouldn't need to understand the way stats are stored
in any sort of detail to add a new stats field and vice versa.

Agreed as well.

Cool. I'll give it a try.

Greetings,

Andres Freund

#174

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#173)

Re: shared-memory based stats collector

At Sat, 13 Mar 2021 10:05:21 -0800, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-13 12:53:30 +0100, Magnus Hagander wrote:

On Sat, Mar 13, 2021 at 7:20 AM Andres Freund <andres@anarazel.de> wrote:

I think before making things differently complicated with this patch,
we need to clean this up, unfortunately. I think we should initially have
- src/backend/postmaster/pgstat.c, for a), b) above
- src/backend/utils/activity/backend_status.c for c)
- src/backend/utils/activity/wait_events.c for d)
- src/backend/utils/activity/progress.c for e)

Not at all sure about the names, but something roughly like this
would im make sense.

+1, definitely.

Cool. I think I can introduce them without causing too much breakage in
the shmstats patch.

Previously the patch split pgstat.c into pgstat.c and besomething.c
but that was rejected for large footprint. Of course I happily agree
to split (a,b) and c. And also agree to split out d and e.

Without yet having done it, I think I'd not touch the function name
prefixes during the move and leave the prototypes in the pgstat.h
header? Then we could split the header off in a second step (with
backward compat includes in pgstat.h), potentially combined with a
rename? But I'm happy to consider different approaches.

That sounds reasonable. (I did split the header files at the time.)

I've been thinking in these lines more than once when poking at
differen patches around that area, but they've never been big enough
to justify the restructuring on their own. Which then of course just
helps accumulate the problem...

Yea...

^^;

If anything, I'd even consider splitting (a) and (b) above out into
separate ones as well. But hey, I see you got to that in your next
paragraph :)

I wondered about doing that as a preceding step as well, but it seems a
bit pointless if all that code needs to be moved elsewhere, and little
of it survives unchanged...

The next thing to note is that after this whole patchseries, having the
remaining functionality in src/backend/postmaster/pgstat.c doesn't make
sense. The things in postmaster/ are related to postmaster
sub-processes, not random pieces of backend infrastructure. Therefore I
am thinking that patch 0004 should be changed so it basically adds all
the changed code to two new files:
- src/backend/utils/activity/stats.c - for the stats keeping
infrastructure (shared memory hash table, pending table, etc)
- src/backend/utils/activity/stats_report.c - for pgstat_report_*,
pgstat_count_*, pgstat_update_*, flush_*, i.e. everything that knows
about specific kinds of stats.

The reason for that split is that imo the two pieces of code are largely
independent. One shouldn't need to understand the way stats are stored
in any sort of detail to add a new stats field and vice versa.

Agreed as well.

FWIW, agreed, too.

Cool. I'll give it a try.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#175

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Fujii Masao (#170)

Re: shared-memory based stats collector

At Fri, 12 Mar 2021 23:33:05 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in

On 2021/03/12 17:24, Kyotaro Horiguchi wrote:

At Fri, 12 Mar 2021 15:13:15 +0900, Fujii Masao
<masao.fujii@oss.nttdata.com> wrote in

On 2021/03/12 13:49, Kyotaro Horiguchi wrote:

I noticed that I accidentally removed the launch-suppression feature
that is to avoid frequent relaunching. That mechanism is needed on
the postmaster side. I added PgArchIsSuppressed() to do the same check
with the old pgarch_start() and make it called as a part of
PgArchStartupAllowed().

You're right! At least for me the function name PgArchIsSuppressed()
sounds not good to me. What about something like PgArchCanRestart()?

The reason for the name was some positive-meaning names came up with
me are confusing with PgArchStartupAllowed(). The name
PgArchCanRestart suggests that it's usable only when
restarting. However, the function requires to be called also when the
first launch since last_pgarch_start_time needs to be updated every
time archiver is launched.
Anyway in the attached the name is changed wihtout changing its usage.

Thanks! If we come up with better name, let's rename the function
later.

Ok.

# I don't like it uses "can" as "allowed" so much. The archiver
# actually can restart but is just inhibited to restart.

This is not fault of this patch. But last_pgarch_start_time should be
initialized with zero?

Right. I noticed that but forgot to fix it.
+	if ((curtime - last_pgarch_start_time) < PGARCH_RESTART_INTERVAL)
+		return true;
Why did you remove the cast to unsigned int there?
The cast converts small negative values to large numbers, the code
looks like intending to allow archiver launched when curtime goes
behind than last_pgarch_start_time. That is the case the on-memory
data is corrupt. I'm not sure it's worth worrying and in the first
place if we want to care of the case we should explicitly compare the
operands of the subtraction. I did that in the attached.
That's an idea. But the similar calculation using that cast is used in
other places (e.g., in pgarch_MainLoop()), so I'm thinking that it's
better not to change that...

Mmm. I'm fine with it:( (:p)

And the last_pgarch_start_time is accessed only in the function. I
moved it to inside the function.

OK.
+	/*
+ * Advertise our latch that backends can use to wake us up while
we're
+	 * sleeping.
+	 */
+	PgArch->pgprocno = MyProc->pgprocno;
The comment should be updated?
Hmm. What is advertised is our pgprocno.. Fixed.
OK.

Thanks for updating the patch! I applied some minor changes into your
patch.
Attached is the updated version of the patch. I'm thinking to commit
this version.

Thanks for committing this! I'm very happy to see this reduces the
size of this patchset.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#176

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#171)

Re: shared-memory based stats collector

At Fri, 12 Mar 2021 22:20:40 -0800, Andres Freund <andres@anarazel.de> wrote in

Horiguchi-san, is there a chance you could add a few tests (on master)
that test/document the way stats are kept across "normal" restarts, and
thrown away after crash restarts/immediate restarts and also thrown away
graceful streaming rep shutdowns?

Roger. I'll give it a try.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#177

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#175)

6 attachment(s)

Re: shared-memory based stats collector

At Mon, 15 Mar 2021 17:49:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Thanks for committing this! I'm very happy to see this reduces the
size of this patchset.

Now that 0003 is committed as d75288fb27, and 33394ee6f2 conflicts
with old 0004, I'd like to post a rebased version for future work.

The commit 33394ee6f2 adds on-exit forced write of WAL stats on
walwriter and in this patch that part would appear to have been
removed. However, this patchset already does that by calling to
pgstat_report_stat from pgstat_beshutdown_hook.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#178

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#173)

Re: shared-memory based stats collector

Hi,

On 2021-03-13 10:05:21 -0800, Andres Freund wrote:

Cool. I'll give it a try.

I have a few questions about the patch:

- Why was collect_oids() changed to a different hashtable as part of
this change? Seems fairly independent?

- What's the point of all those cached_* stuff? There's not a single
comment explaining it as far as I can tell...

Several of them are never used as a cache! E.g. cached_archiverstats,
cached_bgwriterstats, ...

- What is the idea behind pgstat_reset_shared_counters() using
pgstat_copy_global_stats() to reset, using StatsShmem->*_reset_offset?
But then still taking a lock in pgstat_fetch_stat_*? Again, no
comments explaining what the goal is.

It kinda looks like you tried to make both read and write paths not
use the lock, but then ended up using a lock?

Do you have some benchmarks that you used to verify performance?

I think I'm going to try to split the storage of fixed-size stats in
StatsShmemStruct into a separate patch. That's already a pretty large
change, and it's pretty much unrelated to the rest.

Greetings,

Andres Freund

#179

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#178)

3 attachment(s)

Re: shared-memory based stats collector

Hi,

On 2021-03-15 19:04:29 -0700, Andres Freund wrote:

On 2021-03-13 10:05:21 -0800, Andres Freund wrote:

Cool. I'll give it a try.

Ooops, I was intending to write a bit about my attempt at that:

I did roughly the first steps of the split as I had outlined. I moved:

1) wait event related functions into utils/activity/wait_event.c /
wait_event.h

2) "backend status" functionality (PgBackendStatus stuff) into
utils/activity/backend_status.c

3) "progress" related functionality into
utils/activity/backend_progress.c

I think 1 and 2 are good (albeit in need of further polish). I'm a bit
less sure about 3:
- There's dependency from backend_status.h to backend_progress.h,
because it PGSTAT_NUM_PROGRESS_PARAM etc.
- it's a fairly small amount of code
- there's potential for confusion, because there's also
include/commands/progress.h

On balance I think 3) is probably worth it, but I'm far from confident.

Happy to bikeshed about the names for a moment...

Questions / Points:

- I'm inclined to leave pgstat_report_{activity, tmpfile, appname,
timestamp, ..} alone naming-wise, but to rename pgstat_bestart() to
something like pgbestat_start()?

- I've not gone through all the files that could now remove pgstat.h,
replacing it with wait_event.h - I'm thinking it might be worth
waiting till just after code freeze with that (there'll new additions,
and it's likely to cause conflicts)?

- backend_status.h needs miscadmin.h, due to BackendType. Imo that's a
header we should try to avoid exposing in headers if possible. But
right now there's no good place to move BackendType to. So I'd let
this slide for now.

On 2021-03-15 19:04:29 -0700, Andres Freund wrote:

I have a few questions about the patch:
- Why was collect_oids() changed to a different hashtable as part of
this change? Seems fairly independent?

- What's the point of all those cached_* stuff? There's not a single
comment explaining it as far as I can tell...

Several of them are never used as a cache! E.g. cached_archiverstats,
cached_bgwriterstats, ...

- What is the idea behind pgstat_reset_shared_counters() using
pgstat_copy_global_stats() to reset, using StatsShmem->*_reset_offset?
But then still taking a lock in pgstat_fetch_stat_*? Again, no
comments explaining what the goal is.

It kinda looks like you tried to make both read and write paths not
use the lock, but then ended up using a lock?

Do you have some benchmarks that you used to verify performance?

I think I'm going to try to split the storage of fixed-size stats in
StatsShmemStruct into a separate patch. That's already a pretty large
change, and it's pretty much unrelated to the rest.

Some more:
- pgstat_vacuum_stat() talks about using three phases, one with only a
shared lock. That doesn't seem to exist (anymore)?

- WAIT_EVENT_PGSTAT_MAIN was renamed to WAIT_EVENT_READING_STATS_FILE. I
assume that was intended to be used in pgstat_read_statsfile(),
pgstat_write_statsfile()? Not sure there's much point in adding it,
because at that stage nobody can observe wait events anyway...

- There are a lot of diffs like
@@ -1267,7 +1264,7 @@ pg_stat_get_db_xact_commit(PG_FUNCTION_ARGS)
     if ((dbentry = pgstat_fetch_stat_dbentry(dbid)) == NULL)
         result = 0;
     else
-        result = (int64) (dbentry->n_xact_commit);
+        result = (int64) (dbentry->counts.n_xact_commit);

PG_RETURN_INT64(result);
}

I wonder if we should instead just return *entry->counts from
pgstat_fetch_stat_* (adjusting the names). There's as far as I can
tell never a reason for pgstatfuncs.c (and others if there are) to
need the remaining members? That'd could the patch down a lot I
think?

What's especially nice is that afterwards PgStat_StatEntryHeader
wouldn't need to be exposed anymore, and in turn, atomic.h wouldn't
need to be included anymore.

- pgstat.h had dshash_table_handle members in PgStat_StatDBEntry, I
assume those were leftovers from an older version, not something you
forsee needing?

- There's a pretty weird change in relcache.c:
+       /* break mutual link with stats entry */
+       pgstat_delinkstats(relation);
+
+       if (relation->rd_rel)

did you add that if intentionally?

- Is there any reason to treat PgStat_ReplSlot as something of a dynamic
number, instead of just having a fixed number of slots in
StatsShmemStruct? The GUC can't change without a restart...

Greetings,

Andres Freund

#180

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#178)

Re: shared-memory based stats collector

At Mon, 15 Mar 2021 19:04:29 -0700, Andres Freund <andres@anarazel.de> wrote in

I have a few questions about the patch:

- Why was collect_oids() changed to a different hashtable as part of
this change? Seems fairly independent?

Right. It is changed at the time I used simplehash for some other
stuff.

- What's the point of all those cached_* stuff? There's not a single
comment explaining it as far as I can tell...

Several of them are never used as a cache! E.g. cached_archiverstats,
cached_bgwriterstats, ...

They're just local copies of them on shared memory. That works to
allow early release of locks on shared objects. And they're to avoid
repeated copying of the same stats entry and to offer a consistent
values. For example, pgstat_fetch_stat_tabentry() is called repeatedly
in a single view.

About cached_archiverstats... cached_archvierstats_is_valid is utterly
useless. Since we don't offer in-transaction consistency for the
values, I think *_is_valid are useless.

- What is the idea behind pgstat_reset_shared_counters() using
pgstat_copy_global_stats() to reset, using StatsShmem->*_reset_offset?
But then still taking a lock in pgstat_fetch_stat_*? Again, no
comments explaining what the goal is.

It kinda looks like you tried to make both read and write paths not
use the lock, but then ended up using a lock?

Mmm. That is a kind of inconsistent between the view from upper side
and from lower side. As you say, I tried to get rid of *the*
StatsLock in pgstat_reset_shared_counters and eventually I forgot to
remove the locking. I think the lock is no longer necessary.

Do you have some benchmarks that you used to verify performance?

/messages/by-id/20201008.160326.2246946707652981235.horikyota.ntt@gmail.com

It is using pgbench, with 800 clients with 20 threads.

Graphs of performance gain/loss from the master for the following
benchmarks are attached.

- Fetching 1 tuple from 1 of 100 tables from 100 to 800 clients.
- Fetching 1 tuple from 1 of 10 tables from 100 to 800 clients.

v36 showed about 60% of degradation (TPS redued to 1/3 of master) at
600 clients, but it has been dissapeared as of v39. The graphs are of
v39. I'm asking for the script that was used for the benchark and will
send them later.

I think I'm going to try to split the storage of fixed-size stats in
StatsShmemStruct into a separate patch. That's already a pretty large
change, and it's pretty much unrelated to the rest.

I'm not sure the ratio between the "Global stats structs" (aka
fixed-size stats?) and other stuff but agreed that they are almost
unrelated to each other.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#181

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#180)

2 attachment(s)

Re: shared-memory based stats collector

At Tue, 16 Mar 2021 16:44:38 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 15 Mar 2021 19:04:29 -0700, Andres Freund <andres@anarazel.de> wrote in

Do you have some benchmarks that you used to verify performance?

/messages/by-id/20201008.160326.2246946707652981235.horikyota.ntt@gmail.com

It is using pgbench, with 800 clients with 20 threads.

Graphs of performance gain/loss from the master for the following
benchmarks are attached.

- Fetching 1 tuple from 1 of 100 tables from 100 to 800 clients.
- Fetching 1 tuple from 1 of 10 tables from 100 to 800 clients.

v36 showed about 60% of degradation (TPS redued to 1/3 of master) at
600 clients, but it has been dissapeared as of v39. The graphs are of
v39. I'm asking for the script that was used for the benchark and will
send them later.

This is that.

create_tables.sh: creates tables for benchmarking.
simple_use_file_ac_1000.sql: randomly selects 1 tuple from a table.

$ createdb benchdb
$ ./create_tables.sh benchdb 100000 # the case of 100000 tables
$ pgbench benchdb -n -c100 -j20 -T300 -r P 10 -f simple_use_file_ac_1000.sql

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#182

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Andres Freund (#179)

Re: shared-memory based stats collector

On Mon, Mar 15, 2021 at 10:56 PM Andres Freund <andres@anarazel.de> wrote:

I did roughly the first steps of the split as I had outlined. I moved:

1) wait event related functions into utils/activity/wait_event.c /
wait_event.h

2) "backend status" functionality (PgBackendStatus stuff) into
utils/activity/backend_status.c

3) "progress" related functionality into
utils/activity/backend_progress.c

In general, I like this. I'm not too sure about the names. I realize
you don't want to have functions called status.c and progress.c,
because that's awful generic, but now you have backend_progress.c,
backend_status.c, and wait_event.c, which makes the last one look a
little strange. Maybe command_progress.c instead of
backend_progress.c?

I think 1 and 2 are good (albeit in need of further polish). I'm a bit
less sure about 3:
- There's dependency from backend_status.h to backend_progress.h,
because it PGSTAT_NUM_PROGRESS_PARAM etc.

That doesn't seem like a catastrophe.

- it's a fairly small amount of code

But it's not bad to have it separate.

- there's potential for confusion, because there's also
include/commands/progress.h

That could be merged, perhaps. I think I only created that because I
didn't want to jam too much stuff into pgstat.h. But if it has its own
header then jamming some more stuff in there seems more OK.

- I'm inclined to leave pgstat_report_{activity, tmpfile, appname,
timestamp, ..} alone naming-wise, but to rename pgstat_bestart() to
something like pgbestat_start()?

I'd probably rename them e.g. command_progress_start(),
comand_progress_update_param(), etc.

- I've not gone through all the files that could now remove pgstat.h,
replacing it with wait_event.h - I'm thinking it might be worth
waiting till just after code freeze with that (there'll new additions,
and it's likely to cause conflicts)?

Don't care.

- backend_status.h needs miscadmin.h, due to BackendType. Imo that's a
header we should try to avoid exposing in headers if possible. But
right now there's no good place to move BackendType to. So I'd let
this slide for now.

Why the concern? miscadmin.h is extremely widely-included already.
Maybe it should be broken up into pieces so that we're not including
so MUCH stuff in a zillion places, but the header that contains the
definition of CHECK_FOR_INTERRUPTS() is always going to be needed in a
ton of spots. Honestly, I wonder why we don't just put that part in
postgres.h. If you're writing any significant amount of code and you
don't have at least one CHECK_FOR_INTERRUPTS() in there, you're
probably doing it wrong.

--
Robert Haas
EDB: http://www.enterprisedb.com

#183

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Robert Haas (#182)

Re: shared-memory based stats collector

Hi,

On 2021-03-16 15:08:39 -0400, Robert Haas wrote:

On Mon, Mar 15, 2021 at 10:56 PM Andres Freund <andres@anarazel.de> wrote:

I did roughly the first steps of the split as I had outlined. I moved:

1) wait event related functions into utils/activity/wait_event.c /
wait_event.h

2) "backend status" functionality (PgBackendStatus stuff) into
utils/activity/backend_status.c

3) "progress" related functionality into
utils/activity/backend_progress.c

In general, I like this. I'm not too sure about the names. I realize
you don't want to have functions called status.c and progress.c,
because that's awful generic, but now you have backend_progress.c,
backend_status.c, and wait_event.c, which makes the last one look a
little strange. Maybe command_progress.c instead of
backend_progress.c?

I'm not thrilled about the names I ended up with either, so happy to get
some ideas.

I did consider command_progress.c too - but that seems confusing because
there's src/include/commands/progress.h, which is imo a different layer
than what pgstat/backend_progress provide. So I thought splitting things
up so that backend_progress.[ch] provide the place to store the progress
values, and commands/progress.h defining the meaning of the values as
used for in-core postgres commands would make sense. I could see us
using the general progress infrastructure for things that'd not fit
super well into commands/* at some point...

But I'd also be ok with folding in command/progress.h.

I think 1 and 2 are good (albeit in need of further polish). I'm a bit
less sure about 3:
- There's dependency from backend_status.h to backend_progress.h,
because it PGSTAT_NUM_PROGRESS_PARAM etc.

That doesn't seem like a catastrophe.

- it's a fairly small amount of code

But it's not bad to have it separate.

Agreed.

- I'm inclined to leave pgstat_report_{activity, tmpfile, appname,
timestamp, ..} alone naming-wise, but to rename pgstat_bestart() to
something like pgbestat_start()?

I'd probably rename them e.g. command_progress_start(),
comand_progress_update_param(), etc.

Hm. There's ~250 calls to pgstat_report_*. Which are you proposing to
rename? In my mind there's at least the following groups with
"inaccurately" overlapping names:

1) "stats reporting" functions, like pgstat_report_{bgwriter,
archiver,...}(), pgstat_count_*(), pgstat_{init,
end}_function_usage(),

2) "backend activity" functions, like pgstat_report_activity()

2.1) "wait event" functions, like pgstat_report_wait_{start,end}()

3) "stats control" functions, like pgstat_report_stat()

4) "stats reporting" fetch functions like pgstat_fetch_stat_dbentry()

5) "backend activity" fetch functions like
pgstat_fetch_stat_numbackends(), pgstat_fetch_stat_beentry()

I'd not quite group the progress functions as part of that, because they
do already have a distinct namespace, even though perhaps pgstat_* isn't
a great prefix.

- backend_status.h needs miscadmin.h, due to BackendType. Imo that's a
header we should try to avoid exposing in headers if possible. But
right now there's no good place to move BackendType to. So I'd let
this slide for now.

Why the concern? miscadmin.h is extremely widely-included already.

In .c files, but not from a lot of headers... There's only two places,
pgstat.h and regex/regcustom.h.

For me it a weird mix of things that should be included from very few
places, and super widely used stuff. I already feel bad including it
from .c files, but indirect includes in .h files imo should be as
narrow as possible.

Maybe it should be broken up into pieces so that we're not including
so MUCH stuff in a zillion places, but the header that contains the
definition of CHECK_FOR_INTERRUPTS() is always going to be needed in a
ton of spots. Honestly, I wonder why we don't just put that part in
postgres.h.

I'd not be against that. I'd personally put CFI() in a separate header,
but include it from postgres.h. I don't think there's much else in
there that should be as widely used. The closest is INTERRUPTS and
CRIT_SECTION stuff, but that should be less frequent.

Greetings,

Andres Freund

#184

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#177)

6 attachment(s)

Re: shared-memory based stats collector

At Tue, 16 Mar 2021 10:27:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 15 Mar 2021 17:49:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Thanks for committing this! I'm very happy to see this reduces the
size of this patchset.

Now that 0003 is committed as d75288fb27, and 33394ee6f2 conflicts
with old 0004, I'd like to post a rebased version for future work.

The commit 33394ee6f2 adds on-exit forced write of WAL stats on
walwriter and in this patch that part would appear to have been
removed. However, this patchset already does that by calling to
pgstat_report_stat from pgstat_beshutdown_hook.

Rebased and fixed two bugs. Not addressed received comments in this
version.

5f79580ad6 confilicts with this. However, the modified code is removed
by the patch. (And 081876d75e made a small conflict.)

I fixed two silly bugs along with the rebasing.

1) An assertion failure happens while accessing pg_stat_database,
since pgstat_fetch_stat_dbentry() rejected 0 as database oid.

2) pgstat_report_stat() failed to flush out global database stats (oid
= 0). I stopped to collecgt database entries in the existing loop on
pgStatLocalhash, then modified the existing second loop to scan
pgStatLocalhash directly for database stats entries. The second loop
might get slow when the first loop left many table/function local
stats entreis due to heavy load.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#185

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Kyotaro Horiguchi (#184)

Re: shared-memory based stats collector

Hi,

On 2021-03-18 16:56:02 +0900, Kyotaro Horiguchi wrote:

At Tue, 16 Mar 2021 10:27:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 15 Mar 2021 17:49:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Thanks for committing this! I'm very happy to see this reduces the
size of this patchset.

Now that 0003 is committed as d75288fb27, and 33394ee6f2 conflicts
with old 0004, I'd like to post a rebased version for future work.

The commit 33394ee6f2 adds on-exit forced write of WAL stats on
walwriter and in this patch that part would appear to have been
removed. However, this patchset already does that by calling to
pgstat_report_stat from pgstat_beshutdown_hook.

Rebased and fixed two bugs. Not addressed received comments in this
version.

Since I am heavily editing the code, could you submit "functional"
changes (as opposed to fixing rebase issues) as incremental patches?

Greetings,

Andres Freund

#186

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#157)

Re: shared-memory based stats collector

Hi,

On 2021-03-10 20:26:56 -0800, Andres Freund wrote:

+	shhashent = dshash_find_extended(pgStatSharedHash, &key,
+									 create, nowait, create, &shfound);
+	if (shhashent)
+	{
+		if (create && !shfound)
+		{
+			/* Create new stats entry. */
+			dsa_pointer chunk = dsa_allocate0(area,
+											  pgstat_sharedentsize[type]);
+
+			shheader = dsa_get_address(area, chunk);
+			LWLockInitialize(&shheader->lock, LWTRANCHE_STATS);
+			pg_atomic_init_u32(&shheader->refcount, 0);
+
+			/* Link the new entry from the hash entry. */
+			shhashent->body = chunk;
+		}
+		else
+			shheader = dsa_get_address(area, shhashent->body);
+
+		/*
+		 * We expose this shared entry now.  You might think that the entry
+		 * can be removed by a concurrent backend, but since we are creating
+		 * an stats entry, the object actually exists and used in the upper
+		 * layer. Such an object cannot be dropped until the first vacuum
+		 * after the current transaction ends.
+		 */
+		dshash_release_lock(pgStatSharedHash, shhashent);

Yep, it's not even particularly hard to hit:

S0: CREATE TABLE a_table();
S0: INSERT INTO a_table();
S0: disconnect
S1: set a breakpoint to just after the dshash_release_lock(), with an
if objid == a_table_oid
S1: SELECT pg_stat_get_live_tuples('a_table'::regclass);
(will break at above breakpoint, without having incremented the
refcount yet)
S2: DROP TABLE a_table;
S2: VACUUM pg_class;

At that point S2's call to pgstat_vacuum_stat() will find the shared
stats entry for a_table, delete the entry from the shared hash table,
see that the stats data has a zero refcount, and free it. Once S1 wakes
up it'll use already freed (and potentially since reused) memory.

Greetings,

Andres Freund

#187

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#185)

Re: shared-memory based stats collector

At Thu, 18 Mar 2021 01:47:20 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-18 16:56:02 +0900, Kyotaro Horiguchi wrote:

At Tue, 16 Mar 2021 10:27:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 15 Mar 2021 17:49:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Thanks for committing this! I'm very happy to see this reduces the
size of this patchset.

Now that 0003 is committed as d75288fb27, and 33394ee6f2 conflicts
with old 0004, I'd like to post a rebased version for future work.

The commit 33394ee6f2 adds on-exit forced write of WAL stats on
walwriter and in this patch that part would appear to have been
removed. However, this patchset already does that by calling to
pgstat_report_stat from pgstat_beshutdown_hook.

Rebased and fixed two bugs. Not addressed received comments in this
version.

Since I am heavily editing the code, could you submit "functional"
changes (as opposed to fixing rebase issues) as incremental patches?

Oh.. please wait for.. a moment, maybe.

regareds.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#188

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#187)

1 attachment(s)

Re: shared-memory based stats collector

At Mon, 22 Mar 2021 09:55:59 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 18 Mar 2021 01:47:20 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-18 16:56:02 +0900, Kyotaro Horiguchi wrote:

At Tue, 16 Mar 2021 10:27:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Rebased and fixed two bugs. Not addressed received comments in this
version.

Since I am heavily editing the code, could you submit "functional"
changes (as opposed to fixing rebase issues) as incremental patches?

Oh.. please wait for.. a moment, maybe.

This is that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#189

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#186)

Re: shared-memory based stats collector

Hi,

On 2021-03-19 14:27:38 -0700, Andres Freund wrote:

Yep, it's not even particularly hard to hit:

S0: CREATE TABLE a_table();
S0: INSERT INTO a_table();
S0: disconnect
S1: set a breakpoint to just after the dshash_release_lock(), with an
if objid == a_table_oid
S1: SELECT pg_stat_get_live_tuples('a_table'::regclass);
(will break at above breakpoint, without having incremented the
refcount yet)
S2: DROP TABLE a_table;
S2: VACUUM pg_class;

At that point S2's call to pgstat_vacuum_stat() will find the shared
stats entry for a_table, delete the entry from the shared hash table,
see that the stats data has a zero refcount, and free it. Once S1 wakes
up it'll use already freed (and potentially since reused) memory.

I fixed this by initializing / increment the refcount while holding
dshash partition lock. To avoid the potential refcount leak in case the
lookup cache insertion fails due to OOM I changed things so that the
lookup cache is inserted, not just looked up, earlier. That also avoids
needing two hashtable ops in the cache miss case. The price of an empty
hashtable entry in the !create case doesn't seem high.

Related issue: delete_current_stats_entry() there's the following
comment:

/*
* Let the referrers drop the entry if any. Refcount won't be decremented
* until "dropped" is set true and StatsShmem->gc_count is incremented
* later. So we can check refcount to set dropped without holding a lock.
* If no one is referring this entry, free it immediately.
*/

I don't think this explanations is correct. gc_count might have been
incremented by another backend, or cleanup_dropped_stats_entries() might
run. So the whole bit about refcounts seems wrong.

I don't see what prevents a double-free here. Consider what happens if
S1: cleanup_dropped_stats_entries() does pg_atomic_sub_fetch_u32(&ent->shared->refcount, 1)
S2: delete_current_stats_entry() pg_atomic_read_u32(&header->refcount), reading 0
S1: dsa_free(area, ent->dsapointer);
S2: dsa_free(area, pdsa);
World: boom

I think the appropriate fix might be to not have ->dropped (or rather have it
just as a crosscheck), and have every non-dropped entry have an extra
refcount. When dropping the entry the refcount is dropped, and we can safely
free the entry. That way normal paths don't need to check ->dropped at all.

Greetings,

Andres Freund

#190

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Kyotaro Horiguchi (#188)

Re: shared-memory based stats collector

Hi,

On 2021-03-22 12:02:39 +0900, Kyotaro Horiguchi wrote:

At Mon, 22 Mar 2021 09:55:59 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 18 Mar 2021 01:47:20 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-18 16:56:02 +0900, Kyotaro Horiguchi wrote:

At Tue, 16 Mar 2021 10:27:55 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
Rebased and fixed two bugs. Not addressed received comments in this
version.

Since I am heavily editing the code, could you submit "functional"
changes (as opposed to fixing rebase issues) as incremental patches?

Oh.. please wait for.. a moment, maybe.

This is that.

Thanks! That change shouldn't be necessary on my branch - I did
something to fix this kind of problem too. I decided that there's no
point in doing hash table lookups for the database: It's not going to
change in the life of a backend. So there's now two static "pending"
entries: One for the current DB, one for the shared DB. There's only
one place that needed to change,
pgstat_report_checksum_failures_in_db(), which now reports the changes
directly instead of going via pending.

I suspect we should actually do that with a number of other DB specific
functions. Things like recovery conflicts, deadlocks, checksum failures
imo should really not be delayed till later. And you should never have
enough of them to make contention a concern.

You can see a somewhat sensible list of changes from your v52 at
https://github.com/anarazel/postgres/compare/master...shmstat-before-split-2021-03-22
(I did fix some of damage from rebase in a non-incremental way, of course)

My branch: https://github.com/anarazel/postgres/tree/shmstat

It would be cool if you could check if there any relevant things between
v52 and v56 that I should include.

I think a lot of the concerns I had with the patch are addressed at the
end of my series of changes. Please let me know what you think.

My next step is going to be to squash all my changes into the base
patch, and try to extract all the things that I think can be
independently committed, and to reduce unnecessary diff noise. Once
that's done I plan to post that series to the list.

TODO:

- explain the design at the top of pgstat.c

- figure out a way to deal with the different demands on stats
consistency / efficiency

- see how hard it'd be to not need collect_oids()

- split pgstat.c

- consider removing PgStatTypes and replacing it with the oid of the
table the type of stats reside in. So PGSTAT_TYPE_DB would be
DatabaseRelationId, PGSTAT_TYPE_TABLE would be RelationRelationId, ...

I think that'd make the system more cleanly extensible going forward?

- I'm not yet happy with the naming schemes in use in pgstat.c. I feel
like I improved it a bunch, but it's not yet there.

- the replication slot stuff isn't quite right in my branch

- I still don't quite like the reset_offset stuff - I wonder if we can
find something better there. And if not, whether we can deduplicate
the code between functions like pgstat_fetch_stat_checkpointer() and
pgstat_report_checkpointer().

At the very least it'll need a lot better comments.

- bunch of FIXMEs / XXXs

Greetings,

Andres Freund

#191

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#190)

Re: shared-memory based stats collector

Hi,

On 2021-03-22 16:17:37 -0700, Andres Freund wrote:

and to reduce unnecessary diff noise

This patch has just tons of stuff like:

-/*
- * Calculate function call usage and update stat counters.
- * Called by the executor after invoking a function.
+/* ----------
+ * pgstat_end_function_usage() -
  *
- * In the case of a set-returning function that runs in value-per-call mode,
- * we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
- * calls for what the user considers a single call of the function.  The
- * finalize flag should be TRUE on the last call.
+ *  Calculate function call usage and update stat counters.
+ *  Called by the executor after invoking a function.
+ *
+ *  In the case of a set-returning function that runs in value-per-call mode,
+ *  we will see multiple pgstat_init_function_usage/pgstat_end_function_usage
+ *  calls for what the user considers a single call of the function.  The
+ *  finalize flag should be TRUE on the last call.
+ * ----------

and

 typedef struct PgStat_StatTabEntry
 {
-    Oid         tableid;
+    /* Persistent data follow */
+    TimestampTz vacuum_timestamp;   /* user initiated vacuum */
+    TimestampTz autovac_vacuum_timestamp;   /* autovacuum initiated */
+    TimestampTz analyze_timestamp;  /* user initiated */
+    TimestampTz autovac_analyze_timestamp;  /* autovacuum initiated */

PgStat_Counter numscans;

@@ -773,103 +352,31 @@ typedef struct PgStat_StatTabEntry
PgStat_Counter blocks_fetched;
PgStat_Counter blocks_hit;

- TimestampTz vacuum_timestamp; /* user initiated vacuum */
PgStat_Counter vacuum_count;
- TimestampTz autovac_vacuum_timestamp; /* autovacuum initiated */
PgStat_Counter autovac_vacuum_count;
- TimestampTz analyze_timestamp; /* user initiated */
PgStat_Counter analyze_count;
- TimestampTz autovac_analyze_timestamp; /* autovacuum initiated */
PgStat_Counter autovac_analyze_count;
} PgStat_StatTabEntry;

and changes like s/PgStat_WalStats/PgStat_Wal/.

I can't really recognize what the pattern of what was changed and what
not is?

Mixing this in into a 300kb patch really adds a bunch of work.

Greetings,

Andres Freund

#192

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#186)

1 attachment(s)

Re: shared-memory based stats collector

At Fri, 19 Mar 2021 14:27:38 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-03-10 20:26:56 -0800, Andres Freund wrote:
+		 * We expose this shared entry now.  You might think that the entry
+		 * can be removed by a concurrent backend, but since we are creating
+		 * an stats entry, the object actually exists and used in the upper
+		 * layer. Such an object cannot be dropped until the first vacuum
+		 * after the current transaction ends.
+		 */
+		dshash_release_lock(pgStatSharedHash, shhashent);
I don't think you can safely release the lock before you incremented the
refcount? What if, once the lock is released, somebody looks up that
entry, increments the refcount, and decrements it again? It'll see a
refcount of 0 at the end and decide to free the memory. Then the code
below will access already freed / reused memory, no?
Yep, it's not even particularly hard to hit:

S0: CREATE TABLE a_table();
S0: INSERT INTO a_table();
S0: disconnect
S1: set a breakpoint to just after the dshash_release_lock(), with an
if objid == a_table_oid
S1: SELECT pg_stat_get_live_tuples('a_table'::regclass);
(will break at above breakpoint, without having incremented the
refcount yet)
S2: DROP TABLE a_table;
S2: VACUUM pg_class;

At that point S2's call to pgstat_vacuum_stat() will find the shared
stats entry for a_table, delete the entry from the shared hash table,
see that the stats data has a zero refcount, and free it. Once S1 wakes
up it'll use already freed (and potentially since reused) memory.

Sorry for the delay. You're right. I actually see permanent-block when
continue running S1 after the vacuum. That happens at LWLockRelease
on freed block.

Moving the refcount bumping before the dshash_release_lock call fixes
that. One issue doing that *was* get_stat_entry() had the path for
the case pgStatCacheContext is not available, which is already
dead. After the early lock releasing is removed, the comment is no
longer needed, too.

While working on this, I noticed that the previous diff
v56-57-func-diff.txt was slightly stale (missing a later bug fix). So
attached contains a fix on the amendment path.

Please find the attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#193

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#190)

Re: shared-memory based stats collector

Thank you for the the lot of help!

At Mon, 22 Mar 2021 16:17:37 -0700, Andres Freund <andres@anarazel.de> wrote in

Thanks! That change shouldn't be necessary on my branch - I did
something to fix this kind of problem too. I decided that there's no
point in doing hash table lookups for the database: It's not going to
change in the life of a backend. So there's now two static "pending"

Right.

entries: One for the current DB, one for the shared DB. There's only
one place that needed to change,
pgstat_report_checksum_failures_in_db(), which now reports the changes
directly instead of going via pending.
I suspect we should actually do that with a number of other DB specific
functions. Things like recovery conflicts, deadlocks, checksum failures
imo should really not be delayed till later. And you should never have
enough of them to make contention a concern.

Sounds readonable.

You can see a somewhat sensible list of changes from your v52 at
https://github.com/anarazel/postgres/compare/master...shmstat-before-split-2021-03-22
(I did fix some of damage from rebase in a non-incremental way, of course)

My branch: https://github.com/anarazel/postgres/tree/shmstat

It would be cool if you could check if there any relevant things between
v52 and v56 that I should include.

I think a lot of the concerns I had with the patch are addressed at the
end of my series of changes. Please let me know what you think.

I like the name "stats subsystem".

https://github.com/anarazel/postgres/commit/f28463601e93c68f4dd50fe930d29a54509cffc7

I'm impressed that the way you resolved "who should load stats". Using
static shared memory area to hold the point to existing DSA memory
resolves the "first attacher problem". However somewhat doubtful
about the "who should write the stats file", I think it is reasonable
in general.

But the current place of calling pgstat_write_stats() is a bit to
early. Checkpointer reports some stats *after* calling
ShutdownXLOG(). Perhaps we need to move it after pg_stat_report_*()
calls in HandleCheckpointerInterrupts().

Separating pgbestat_backend_initialize() from pgstat_initialize()
allows us to initialize stats subsystem earlier in autovacuum workers,
which looks nice.

https://github.com/anarazel/postgres/commit/3304ee1344f348e079b5eb208d76a2f1553e721c

* Whenever the for a dropped stats entry could not be freed (because
* backends still have references), this is incremented, causing backends
* to run pgstat_lookup_cache_gc(), allowing that memory to be reclaimed.

"Whenever the <what?> for a "

gc_count is incremented whenever *some stats hash entries are
removed*. Some of the delinked shared stats area might not be freed
due to references.

If each backend finds that gc_count is incremented, it removes
corresponding local hash entries to the delinked shared entries. If
the backend was the last referer, it frees the shared area.

https://github.com/anarazel/postgres/commit/88ffb289860c7011e729cd0a1a01cda1899e6209

Ah, it sounds nice that refcount == 1 means it is to be dropped and no
one is referring to it. Thanks!

https://github.com/anarazel/postgres/commit/03824a236597c87c99d07aa14b9af9d6fe04dd37

+ * XXX: Why is this a good place to do this?

Agreed. We don't need to so haste to clean up stats entries. We could
run that in pgstat_reporT_stat()?

flush_walstat()

I found a mistake of an existing comment.

- * If nowait is true, this function returns false on lock failure. Otherwise
- * this function always returns true.
+ * If nowait is true, this function returns true on lock failure. Otherwise
+ * this function always returns false.

https://github.com/anarazel/postgres/commit/7bde068d8a512d918f76cfc88c1c10f1db8fe553
(pgstat_reset_replslot_counter())
+	 * AFIXME: pgstats has business no looking into slot.c structures at
+	 * this level of detail.

Does just moving name resolution part to pgstatfuncs.c resolve it?
pgstat_report_replslot_drop() have gotten fixed a similar way.

https://github.com/anarazel/postgres/commit/ded2198d93ce5944fc9d68031d86dd84944053f8

Yeah, I forcefully consolidated replslot stats are into stats-hash but
I agree that it would be more natural that replslot stats are
fixed-sized stats.

https://github.com/anarazel/postgres/commit/e2ef1931fb51da56a6ba483c960e034e52f90430

Agreed that it's better to move database stat entries to fixed pointers.

My next step is going to be to squash all my changes into the base
patch, and try to extract all the things that I think can be
independently committed, and to reduce unnecessary diff noise. Once
that's done I plan to post that series to the list.

TODO:

- explain the design at the top of pgstat.c

- figure out a way to deal with the different demands on stats
consistency / efficiency

- see how hard it'd be to not need collect_oids()

- split pgstat.c

- consider removing PgStatTypes and replacing it with the oid of the
table the type of stats reside in. So PGSTAT_TYPE_DB would be
DatabaseRelationId, PGSTAT_TYPE_TABLE would be RelationRelationId, ...

I think that'd make the system more cleanly extensible going forward?

I'm not sure that works as expected. We already separated repliation
stats from the unified stats hash and pgstat_read/write_statsfile()
needs have the corresponding specific code path.

- I'm not yet happy with the naming schemes in use in pgstat.c. I feel
like I improved it a bunch, but it's not yet there.

I feel the same about the namings.

- the replication slot stuff isn't quite right in my branch

Ah, yeah. As I mentioned above I think it should be in the unified
stats and should have a special means of shotcut. And the global
stats also should be the same.

- I still don't quite like the reset_offset stuff - I wonder if we can
find something better there. And if not, whether we can deduplicate
the code between functions like pgstat_fetch_stat_checkpointer() and
pgstat_report_checkpointer().

Yeah, I find it annoying. If we had reset-offset as negatives (or 2's
complements) the two arithmetic are in the same shape. It might be
somewhat tricky but we can deduplicate the code. (In exchange, we
would have additional code to convert the reset offset.)

At the very least it'll need a lot better comments.

- bunch of FIXMEs / XXXs

I'll lool more close to the patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#194

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#176)

1 attachment(s)

Re: shared-memory based stats collector

At Mon, 15 Mar 2021 17:51:31 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Fri, 12 Mar 2021 22:20:40 -0800, Andres Freund <andres@anarazel.de> wrote in

Horiguchi-san, is there a chance you could add a few tests (on master)
that test/document the way stats are kept across "normal" restarts, and
thrown away after crash restarts/immediate restarts and also thrown away
graceful streaming rep shutdowns?

Roger. I'll give it a try.

Sorry, I forgot this. This is it.

retards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#195

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Kyotaro Horiguchi (#193)

Re: shared-memory based stats collector

Hi,

I spent quite a bit more time working on the patch. There's are large
changes:

- postmaster/pgstats.c (which is an incorrect location now that it's not
a subprocess anymore) is split into utils/activity/pgstat.c and
utils/activity/pgstat_kind.c. I don't love the _kind name, but I
couldn't come up with anything better.

- Implemented a new GUC, stats_fetch_consistency = {none, cache,
snapshot}. I think the code overhead of it is pretty ok - most of the
handling is entirely generalized.

- Most of the "per stats kind" handling is done in pgstat_kind.c. Nearly
all the rest is done through an array with per-stats-kind information
(extending what was already done with pgstat_sharedentsize etc).

- There is no separate "pending stats" hash anymore. If there are
pending stats, they are referenced from 'pgStatSharedRefHash' (which
used to be the "lookup cache" hash). All the entries with pending
stats are in double linked list pgStatPending.

- A stat's entry's lwlock, refcount, .. are moved into the dshash
entry. There is no need for them to be separate anymore. Also allows
to avoid doing some dsa lookups while holding dshash locks.

- The dshash entries are not deleted until the refcount has reached
0. That's an important building block to avoid constantly re-creating
stats when flushing pending stats for a dropped object.

- The reference to the shared entry is established the first time stats
for an object are reported. Together with the previous entry that
avoids nearly all the avenues for re-creating already dropped stats
(see below for the hole).

- I addded a bunch of pg_regress style tests, and a larger amount of
isolationtester tests. The latter are possibly due to a new
pg_stat_force_next_flush() function, avoiding the need to wait until
stats are submitted.

- 2PC support for "precise" dropping of stats has been added, the
collect_oids() based approach removed.

- lots of bugfixes, comments, etc...

I know of one nontrivial issue that can lead to dropped stats being
revived:

Within a transaction, a functions can be called even when another
transaction that dropped that function has already committed. I added a
spec test reproducing the issue:

# FIXME: this shows the bug that stats will be revived, because the
# shared stats in s2 is only referenced *after* the DROP FUNCTION
# committed. That's only possible because there is no locking (and
# thus no stats invalidation) around function calls.
permutation
"s1_track_funcs_all" "s2_track_funcs_none"
"s1_func_call" "s2_begin" "s2_func_call" "s1_func_drop" "s2_track_funcs_all" "s2_func_call" "s2_commit" "s2_ff" "s1_func_stats" "s2_func_stats"

I think the best fix here would be to associate an xid with the dropped
stats object, and only delete the dshash entry once there's no alive
with a horizon from before that xid...

There's also a second issue (stats for newly created objects surviving
the transaction), but that's pretty simple to resolve.

Here's all the gory details of my changes happening incrementally:

https://github.com/anarazel/postgres/compare/master...shmstat

I'll squash and split tomorrow. Too tired for today.

I think this is starting to look a lot better than what we have now. But
I'm getting less confident that it's realistic to get any of this into
PG14, given the state of the release cycle.

I'm impressed that the way you resolved "who should load stats". Using
static shared memory area to hold the point to existing DSA memory
resolves the "first attacher problem". However somewhat doubtful
about the "who should write the stats file", I think it is reasonable
in general.

But the current place of calling pgstat_write_stats() is a bit to
early. Checkpointer reports some stats *after* calling
ShutdownXLOG(). Perhaps we need to move it after pg_stat_report_*()
calls in HandleCheckpointerInterrupts().

I now moved it into a before_shmem_exit(). I think that should avoid
that problem?

https://github.com/anarazel/postgres/commit/03824a236597c87c99d07aa14b9af9d6fe04dd37

+ * XXX: Why is this a good place to do this?

Agreed. We don't need to so haste to clean up stats entries. We could
run that in pgstat_reporT_stat()?

I've not changed that yet, but I had the same thought.

Agreed that it's better to move database stat entries to fixed
pointers.

I actually ended up reverting that. My main motivation for it was that
it was problematic that new pending database stats entries could be
created at some random place in the hashtable. But with the linked list
of pending entries that's not a problem anymore. And I found it
nontrivial to manage the refcounts to the shared entry accurately this
way.

We could still add a cache for the two stats entries though...

- consider removing PgStatTypes and replacing it with the oid of the
table the type of stats reside in. So PGSTAT_TYPE_DB would be
DatabaseRelationId, PGSTAT_TYPE_TABLE would be RelationRelationId, ...

I think that'd make the system more cleanly extensible going forward?

I'm not sure that works as expected. We already separated repliation
stats from the unified stats hash and pgstat_read/write_statsfile()
needs have the corresponding specific code path.

I didn't quite go towards my proposal, but I think I got a lot closer
towards not needing much extra code for additional types of stats. I
even added an XXX to pgstat_read/write_statsfile() that show how they
now could be made generic.

- the replication slot stuff isn't quite right in my branch

Ah, yeah. As I mentioned above I think it should be in the unified
stats and should have a special means of shotcut. And the global
stats also should be the same.

The problem is that I use indexes for addressing, but that they can
change between restarts. I think we can fix that fairly easily, by
mapping names to indices once, pgstat_restore_stats(). At the point we
call pgstat_restore_stats() StartupReplicationSlots() already was
executed, so we can just inquire at that point...

Greetings,

Andres Freund

#196

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#195)

Re: shared-memory based stats collector

At Thu, 1 Apr 2021 19:44:25 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

I spent quite a bit more time working on the patch. There's are large
changes:

- postmaster/pgstats.c (which is an incorrect location now that it's not
a subprocess anymore) is split into utils/activity/pgstat.c and
utils/activity/pgstat_kind.c. I don't love the _kind name, but I
couldn't come up with anything better.

The place was not changed to keep footprint smaller. I agree that the
old place is not appropriate. pgstat_kind... How about changin
pgstat.c to pgstat_core.c and pgstat_kind.c to pgstat.c?

- Implemented a new GUC, stats_fetch_consistency = {none, cache,
snapshot}. I think the code overhead of it is pretty ok - most of the
handling is entirely generalized.

Sounds good.

- Most of the "per stats kind" handling is done in pgstat_kind.c. Nearly
all the rest is done through an array with per-stats-kind information
(extending what was already done with pgstat_sharedentsize etc).

- There is no separate "pending stats" hash anymore. If there are
pending stats, they are referenced from 'pgStatSharedRefHash' (which
used to be the "lookup cache" hash). All the entries with pending
stats are in double linked list pgStatPending.

Sounds reasonable. A bit silimar to TabStatusArray.. Pending stats and
shared stats share the same key so they are naturally consolidatable.

- A stat's entry's lwlock, refcount, .. are moved into the dshash
entry. There is no need for them to be separate anymore. Also allows
to avoid doing some dsa lookups while holding dshash locks.

- The dshash entries are not deleted until the refcount has reached
0. That's an important building block to avoid constantly re-creating
stats when flushing pending stats for a dropped object.

Does that mean the entries for a dropped object is actually dropped by
the backend that has been flushd stats of the dropped object at exit?
Sounds nice.

- The reference to the shared entry is established the first time stats
for an object are reported. Together with the previous entry that
avoids nearly all the avenues for re-creating already dropped stats
(see below for the hole).

- I addded a bunch of pg_regress style tests, and a larger amount of
isolationtester tests. The latter are possibly due to a new
pg_stat_force_next_flush() function, avoiding the need to wait until
stats are submitted.

- 2PC support for "precise" dropping of stats has been added, the
collect_oids() based approach removed.

Cool!

- lots of bugfixes, comments, etc...

Thanks for all of them.

I know of one nontrivial issue that can lead to dropped stats being
revived:

Within a transaction, a functions can be called even when another
transaction that dropped that function has already committed. I added a
spec test reproducing the issue:

# FIXME: this shows the bug that stats will be revived, because the
# shared stats in s2 is only referenced *after* the DROP FUNCTION
# committed. That's only possible because there is no locking (and
# thus no stats invalidation) around function calls.
permutation
"s1_track_funcs_all" "s2_track_funcs_none"
"s1_func_call" "s2_begin" "s2_func_call" "s1_func_drop" "s2_track_funcs_all" "s2_func_call" "s2_commit" "s2_ff" "s1_func_stats" "s2_func_stats"

I think the best fix here would be to associate an xid with the dropped
stats object, and only delete the dshash entry once there's no alive
with a horizon from before that xid...

I'm not sure how we do that avoiding a full scan on dshash..

There's also a second issue (stats for newly created objects surviving
the transaction), but that's pretty simple to resolve.

Here's all the gory details of my changes happening incrementally:

https://github.com/anarazel/postgres/compare/master...shmstat

I'll squash and split tomorrow. Too tired for today.

Thank you very much for all of your immense effort.

I think this is starting to look a lot better than what we have now. But
I'm getting less confident that it's realistic to get any of this into
PG14, given the state of the release cycle.

I'm impressed that the way you resolved "who should load stats". Using
static shared memory area to hold the point to existing DSA memory
resolves the "first attacher problem". However somewhat doubtful
about the "who should write the stats file", I think it is reasonable
in general.

But the current place of calling pgstat_write_stats() is a bit to
early. Checkpointer reports some stats *after* calling
ShutdownXLOG(). Perhaps we need to move it after pg_stat_report_*()
calls in HandleCheckpointerInterrupts().

I now moved it into a before_shmem_exit(). I think that should avoid
that problem?

I think so.

https://github.com/anarazel/postgres/commit/03824a236597c87c99d07aa14b9af9d6fe04dd37

+ * XXX: Why is this a good place to do this?

Agreed. We don't need to so haste to clean up stats entries. We could
run that in pgstat_reporT_stat()?

I've not changed that yet, but I had the same thought.

Agreed that it's better to move database stat entries to fixed
pointers.

I actually ended up reverting that. My main motivation for it was that
it was problematic that new pending database stats entries could be
created at some random place in the hashtable. But with the linked list
of pending entries that's not a problem anymore. And I found it
nontrivial to manage the refcounts to the shared entry accurately this
way.

We could still add a cache for the two stats entries though...

Yeah.

- consider removing PgStatTypes and replacing it with the oid of the
table the type of stats reside in. So PGSTAT_TYPE_DB would be
DatabaseRelationId, PGSTAT_TYPE_TABLE would be RelationRelationId, ...

I think that'd make the system more cleanly extensible going forward?

I'm not sure that works as expected. We already separated repliation
stats from the unified stats hash and pgstat_read/write_statsfile()
needs have the corresponding specific code path.

I didn't quite go towards my proposal, but I think I got a lot closer
towards not needing much extra code for additional types of stats. I
even added an XXX to pgstat_read/write_statsfile() that show how they
now could be made generic.

I'll check it.

- the replication slot stuff isn't quite right in my branch

Ah, yeah. As I mentioned above I think it should be in the unified
stats and should have a special means of shotcut. And the global
stats also should be the same.

The problem is that I use indexes for addressing, but that they can
change between restarts. I think we can fix that fairly easily, by
mapping names to indices once, pgstat_restore_stats(). At the point we
call pgstat_restore_stats() StartupReplicationSlots() already was
executed, so we can just inquire at that point...

Does that mean the saved replslot stats is keyed by their names?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#197

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Kyotaro Horiguchi (#196)

Re: shared-memory based stats collector

Hi,

On 2021-04-02 15:34:54 +0900, Kyotaro Horiguchi wrote:

At Thu, 1 Apr 2021 19:44:25 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

I spent quite a bit more time working on the patch. There's are large
changes:

- postmaster/pgstats.c (which is an incorrect location now that it's not
a subprocess anymore) is split into utils/activity/pgstat.c and
utils/activity/pgstat_kind.c. I don't love the _kind name, but I
couldn't come up with anything better.

The place was not changed to keep footprint smaller. I agree that the
old place is not appropriate. pgstat_kind... How about changin
pgstat.c to pgstat_core.c and pgstat_kind.c to pgstat.c?

I don't really like that split over what I chose.

- A stat's entry's lwlock, refcount, .. are moved into the dshash
entry. There is no need for them to be separate anymore. Also allows
to avoid doing some dsa lookups while holding dshash locks.

- The dshash entries are not deleted until the refcount has reached
0. That's an important building block to avoid constantly re-creating
stats when flushing pending stats for a dropped object.

Does that mean the entries for a dropped object is actually dropped by
the backend that has been flushd stats of the dropped object at exit?
Sounds nice.

It's marked as dropped after the commit of the transaction that dropped
the object. The memory is freed when then subsequently the last
reference goes away.

I know of one nontrivial issue that can lead to dropped stats being
revived:

Within a transaction, a functions can be called even when another
transaction that dropped that function has already committed. I added a
spec test reproducing the issue:

# FIXME: this shows the bug that stats will be revived, because the
# shared stats in s2 is only referenced *after* the DROP FUNCTION
# committed. That's only possible because there is no locking (and
# thus no stats invalidation) around function calls.
permutation
"s1_track_funcs_all" "s2_track_funcs_none"
"s1_func_call" "s2_begin" "s2_func_call" "s1_func_drop" "s2_track_funcs_all" "s2_func_call" "s2_commit" "s2_ff" "s1_func_stats" "s2_func_stats"

I think the best fix here would be to associate an xid with the dropped
stats object, and only delete the dshash entry once there's no alive
with a horizon from before that xid...

I'm not sure how we do that avoiding a full scan on dshash..

I think it's quite possible. Have a linked list of "to be dropped stats"
or such. A bit annoying because of needing to deal with dsa pointers,
but not too hard either.

- the replication slot stuff isn't quite right in my branch

Ah, yeah. As I mentioned above I think it should be in the unified
stats and should have a special means of shotcut. And the global
stats also should be the same.

The problem is that I use indexes for addressing, but that they can
change between restarts. I think we can fix that fairly easily, by
mapping names to indices once, pgstat_restore_stats(). At the point we
call pgstat_restore_stats() StartupReplicationSlots() already was
executed, so we can just inquire at that point...

Does that mean the saved replslot stats is keyed by their names?

I was thinking we'd key them by name only at startup, where their
indices are not known.

Greetings,

Andres Freund

#198

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#183)

Re: shared-memory based stats collector

Hi,

On 2021-03-16 12:54:40 -0700, Andres Freund wrote:

I did consider command_progress.c too - but that seems confusing because
there's src/include/commands/progress.h, which is imo a different layer
than what pgstat/backend_progress provide. So I thought splitting things
up so that backend_progress.[ch] provide the place to store the progress
values, and commands/progress.h defining the meaning of the values as
used for in-core postgres commands would make sense. I could see us
using the general progress infrastructure for things that'd not fit
super well into commands/* at some point...

Thinking about it some more, having the split between backend_status.h
and commands/progress.h actually makes a fair bit of sense from another
angle: Commands utilizing workers. backend_status.h provides
infrastructure to store progress counters for a single backend, but
multiple backends can participate in a command...

I added some comments to the header to that end.

Greetings,

Andres Freund

#199

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#197)

17 attachment(s)

Re: shared-memory based stats collector

Hi,

Please find v60 of my version of Horiguchi-san's patch attached. It's now
pretty reasonably split, I think.

Major changes:

- several precursor patches committed

- Split pgstat.c into separate files (see below)

- split into reasonable-ish sized invidual commits, in particular no code is
moved around in the same commit with function changes

- stats for newly created objects are now dropped on abort

- a good bit of comment, correctness, ... cleanup

- quite a few AFIXME addition denoting places that would need to be fixed
before commit

I've spent most of the last 2 1/2 weeks on this now. Unfortunately I think
that, while it has gotten a lot closer, it's still about a week's worth of
work away from being committable.

My main concerns are:

- Test Coverage:

I've added a fair bit of tests, but it's still pretty bad. There were a lot
of easy-to-hit bugs in earlier versions that nevertheless passed the test
just fine.

Due to the addition of pg_stat_force_next_flush(), and that there's no need
to wait for the stats collector to write out files, it's now a lot more
realistic to have proper testing of a lot of the pgstat.c code.

- Architectural Review

I rejiggered the patchset pretty significantly, and I think it needs more
review than I see as realistic in the next two days. In particular I don't
think

- Performance Testing

I did a small amount, but given that this touches just about every query
etc, I think that's not enough. My changes unfortunately are substantial
enough to invalidate Horiguchi-san's earlier tests.

- Currently there's a corner case in which function (but not table!) stats
for a dropped function may not be removed. That possibly is not too bad,

- Too many FIXMEs still open

It is quite disappointing to not have the patch go into v14 :(. But I just
don't quite see the path right now. But maybe I am just too tired right now,
and it'll look better tomorrow (err today, in a few hours).

On 2021-04-02 10:27:23 -0700, Andres Freund wrote:

On 2021-04-02 15:34:54 +0900, Kyotaro Horiguchi wrote:

At Thu, 1 Apr 2021 19:44:25 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

I spent quite a bit more time working on the patch. There's are large
changes:

- postmaster/pgstats.c (which is an incorrect location now that it's not
a subprocess anymore) is split into utils/activity/pgstat.c and
utils/activity/pgstat_kind.c. I don't love the _kind name, but I
couldn't come up with anything better.

The place was not changed to keep footprint smaller. I agree that the
old place is not appropriate. pgstat_kind... How about changin
pgstat.c to pgstat_core.c and pgstat_kind.c to pgstat.c?

I don't really like that split over what I chose.

I now changed the split so that there is
utils/activity/pgstat_{database,functions,global,relation}.c

For me that makes the code a lot more readable. Before this change I found it
really hard to know where code should best be put etc, or where to find
code. I found it to be pretty nice to work with after the new split.

The only sad thing about the split is that pgstat_kind_infos now is defined in
pgstat.c, necessitating to expose the callback functions. Seems worth it.

Greetings,

Andres Freund

#200

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#199)

Re: shared-memory based stats collector

Hi,

On 2021-04-05 02:29:14 -0700, Andres Freund wrote:

I've spent most of the last 2 1/2 weeks on this now. Unfortunately I think
that, while it has gotten a lot closer, it's still about a week's worth of
work away from being committable.

My main concerns are:

- Test Coverage:

I've added a fair bit of tests, but it's still pretty bad. There were a lot
of easy-to-hit bugs in earlier versions that nevertheless passed the test
just fine.

Due to the addition of pg_stat_force_next_flush(), and that there's no need
to wait for the stats collector to write out files, it's now a lot more
realistic to have proper testing of a lot of the pgstat.c code.

- Architectural Review

I rejiggered the patchset pretty significantly, and I think it needs more
review than I see as realistic in the next two days. In particular I don't
think

- Performance Testing

I did a small amount, but given that this touches just about every query
etc, I think that's not enough. My changes unfortunately are substantial
enough to invalidate Horiguchi-san's earlier tests.

- Currently there's a corner case in which function (but not table!) stats
for a dropped function may not be removed. That possibly is not too bad,

- Too many FIXMEs still open

It is quite disappointing to not have the patch go into v14 :(. But I just
don't quite see the path right now. But maybe I am just too tired right now,
and it'll look better tomorrow (err today, in a few hours).
[...]
I now changed the split so that there is
utils/activity/pgstat_{database,functions,global,relation}.c

For me that makes the code a lot more readable. Before this change I found it
really hard to know where code should best be put etc, or where to find
code. I found it to be pretty nice to work with after the new split.

I'm inclined to push patches
[PATCH v60 05/17] pgstat: split bgwriter and checkpointer messages.
[PATCH v60 06/17] pgstat: Split out relation stats handling from AtEO[Sub]Xact_PgStat() etc.
[PATCH v60 09/17] pgstat: xact level cleanups / consolidation.
[PATCH v60 10/17] pgstat: Split different types of stats into separate files.
[PATCH v60 12/17] pgstat: reorder file pgstat.c / pgstat.h contents.

to v14. They're just moving things around, so are fairly low risk. But
they're going to be a pain to maintain. And I think 10 and 12 make
pgstat.c a lot easier to understand.

Greetings,

Andres Freund

#201

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#200)

Re: shared-memory based stats collector

At Tue, 6 Apr 2021 09:32:16 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-04-05 02:29:14 -0700, Andres Freund wrote:

I'm inclined to push patches
[PATCH v60 05/17] pgstat: split bgwriter and checkpointer messages.
[PATCH v60 06/17] pgstat: Split out relation stats handling from AtEO[Sub]Xact_PgStat() etc.
[PATCH v60 09/17] pgstat: xact level cleanups / consolidation.
[PATCH v60 10/17] pgstat: Split different types of stats into separate files.
[PATCH v60 12/17] pgstat: reorder file pgstat.c / pgstat.h contents.

FWIW..

05 is a straight forward code-rearrange and reasonable to apply.

06 is same as above and it seems to make things cleaner.

09 mainly adds ensure_tabtat_xact_level() to remove repeated code
blocks a straight-forward way. I wonder if
pgstat_xact_stack_level_get() might better be
pgstat_get_xact_stack_level(), but I'm fine with the name in the
patch.

10 I found that the kind in "pgstat_kind" meant the placeholder for
specific types. It looks good to separate them into smaller pieces.
It is also a simple rearrangement of code.

pgstat.c is very long, and it's hard to find an order that makes sense
and is likely to be maintained over time. Splitting the different

I deeply agree to "hard to find an order that makes sense".

12 I'm not sure how it looks after this patch (I failed to apply 09 at
my hand.), but it is also a simple rearrangement of code blocks.

to v14. They're just moving things around, so are fairly low risk. But
they're going to be a pain to maintain. And I think 10 and 12 make
pgstat.c a lot easier to understand.

I think that pgstat.c doesn't get frequent back-patching. It seems to
me that at least 10 looks good.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#202

Ibrar Ahmed

ibrar.ahmad@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#201)

Re: shared-memory based stats collector

On Wed, Apr 7, 2021 at 8:05 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

At Tue, 6 Apr 2021 09:32:16 -0700, Andres Freund <andres@anarazel.de>
wrote in

Hi,

On 2021-04-05 02:29:14 -0700, Andres Freund wrote:

..

I'm inclined to push patches
[PATCH v60 05/17] pgstat: split bgwriter and checkpointer messages.
[PATCH v60 06/17] pgstat: Split out relation stats handling from

AtEO[Sub]Xact_PgStat() etc.

[PATCH v60 09/17] pgstat: xact level cleanups / consolidation.
[PATCH v60 10/17] pgstat: Split different types of stats into separate

files.

[PATCH v60 12/17] pgstat: reorder file pgstat.c / pgstat.h contents.

FWIW..

05 is a straight forward code-rearrange and reasonable to apply.

06 is same as above and it seems to make things cleaner.

09 mainly adds ensure_tabtat_xact_level() to remove repeated code
blocks a straight-forward way. I wonder if
pgstat_xact_stack_level_get() might better be
pgstat_get_xact_stack_level(), but I'm fine with the name in the
patch.

10 I found that the kind in "pgstat_kind" meant the placeholder for
specific types. It looks good to separate them into smaller pieces.
It is also a simple rearrangement of code.

pgstat.c is very long, and it's hard to find an order that makes sense
and is likely to be maintained over time. Splitting the different

I deeply agree to "hard to find an order that makes sense".

12 I'm not sure how it looks after this patch (I failed to apply 09 at
my hand.), but it is also a simple rearrangement of code blocks.

to v14. They're just moving things around, so are fairly low risk. But
they're going to be a pain to maintain. And I think 10 and 12 make
pgstat.c a lot easier to understand.

I think that pgstat.c doesn't get frequent back-patching. It seems to
me that at least 10 looks good.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

The patch does not apply, and require rebase,

1 out of 8 hunks FAILED -- saving rejects to file src/include/pgstat.h.rej
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 8758 (offset 34 lines).
patching file src/backend/postmaster/checkpointer.c
Hunk #3 succeeded at 496 with fuzz 1.
Hunk #4 FAILED at 576.
1 out of 6 hunks FAILED -- saving rejects to file
src/backend/postmaster/checkpointer.c.rej
patching file src/backend/postmaster/pgstat.c

I am changing the status to "Waiting on Author".

--
Ibrar Ahmed

#203

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Ibrar Ahmed (#202)

Re: shared-memory based stats collector

At Mon, 19 Jul 2021 15:34:56 +0500, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote in

The patch does not apply, and require rebase,

Yeah, thank you very much for checking that. However, this patch is
now developed in Andres' GitHub repository. So I'm at a loss what to
do for the failure..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#204

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Kyotaro Horiguchi (#203)

Re: shared-memory based stats collector

Hi,

On 2021-07-21 17:09:49 +0900, Kyotaro Horiguchi wrote:

At Mon, 19 Jul 2021 15:34:56 +0500, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote in

The patch does not apply, and require rebase,

Yeah, thank you very much for checking that. However, this patch is
now developed in Andres' GitHub repository. So I'm at a loss what to
do for the failure..

I'll post a rebased version soon.

Greetings,

Andres Freund

#205

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Andres Freund (#204)

Re: shared-memory based stats collector

Yeah, thank you very much for checking that. However, this patch is
now developed in Andres' GitHub repository. So I'm at a loss what to
do for the failure..

I'll post a rebased version soon.

(Sorry if you feel being hurried, which I didn't meant to.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#206

Andres Freund

andres@anarazel.de

over 4 years ago

In reply to: Kyotaro Horiguchi (#205)

Re: shared-memory based stats collector

Hi,

On 2021-07-26 17:52:01 +0900, Kyotaro Horiguchi wrote:

Yeah, thank you very much for checking that. However, this patch is
now developed in Andres' GitHub repository. So I'm at a loss what to
do for the failure..

I'll post a rebased version soon.

(Sorry if you feel being hurried, which I didn't meant to.)

No worries!

I had intended to post a rebase by now. But while I did mostly finish
that (see [1]https://github.com/anarazel/postgres/tree/shmstat) I unfortunately encountered a new issue around
partitioned tables, see [2]/messages/by-id/20210722205458.f2bug3z6qzxzpx2s@alap3.anarazel.de. Currently I'm hoping for a few thoughts on
that thread about the best way to address the issues.

Greetings,

Andres Freund

[1]: https://github.com/anarazel/postgres/tree/shmstat
[2]: /messages/by-id/20210722205458.f2bug3z6qzxzpx2s@alap3.anarazel.de

#207

Jaime Casanova

jcasanov@systemguards.com.ec

over 4 years ago

In reply to: Andres Freund (#206)

Re: shared-memory based stats collector

On Mon, Jul 26, 2021 at 06:27:54PM -0700, Andres Freund wrote:

Hi,

On 2021-07-26 17:52:01 +0900, Kyotaro Horiguchi wrote:

Yeah, thank you very much for checking that. However, this patch is
now developed in Andres' GitHub repository. So I'm at a loss what to
do for the failure..

I'll post a rebased version soon.

(Sorry if you feel being hurried, which I didn't meant to.)

No worries!

I had intended to post a rebase by now. But while I did mostly finish
that (see [1]) I unfortunately encountered a new issue around
partitioned tables, see [2]. Currently I'm hoping for a few thoughts on
that thread about the best way to address the issues.

Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/tree/shmstat
[2] /messages/by-id/20210722205458.f2bug3z6qzxzpx2s@alap3.anarazel.de

Hi Andres,

Are you going planning to post a rebase soon?

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#208

Jaime Casanova

jcasanov@systemguards.com.ec

about 4 years ago

In reply to: Jaime Casanova (#207)

Re: shared-memory based stats collector

On Thu, Sep 02, 2021 at 10:20:50AM -0500, Jaime Casanova wrote:

On Mon, Jul 26, 2021 at 06:27:54PM -0700, Andres Freund wrote:

Hi,

On 2021-07-26 17:52:01 +0900, Kyotaro Horiguchi wrote:

Yeah, thank you very much for checking that. However, this patch is
now developed in Andres' GitHub repository. So I'm at a loss what to
do for the failure..

I'll post a rebased version soon.

(Sorry if you feel being hurried, which I didn't meant to.)

No worries!

I had intended to post a rebase by now. But while I did mostly finish
that (see [1]) I unfortunately encountered a new issue around
partitioned tables, see [2]. Currently I'm hoping for a few thoughts on
that thread about the best way to address the issues.

Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/tree/shmstat
[2] /messages/by-id/20210722205458.f2bug3z6qzxzpx2s@alap3.anarazel.de

Hi Andres,

Are you going planning to post a rebase soon?

Hi,

we haven't heard about this since July so I will mark this as RwF

--
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

#209

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Andres Freund (#206)

11 attachment(s)

Re: shared-memory based stats collector

Hi,

On 2021-07-26 18:27:54 -0700, Andres Freund wrote:

I had intended to post a rebase by now. But while I did mostly finish
that (see [1]) I unfortunately encountered a new issue around
partitioned tables, see [2]. Currently I'm hoping for a few thoughts on
that thread about the best way to address the issues.

Now that /messages/by-id/20220125063131.4cmvsxbz2tdg6g65@alap3.anarazel.de
is resolved, here's a rebased version. With a good bit of further cleanup.

One "big" thing that I'd like to figure out is a naming policy for the
different types prefixed with PgStat. We have different groups of types:

- "pending statistics", that are accumulated but not yet submitted to the
shared stats system, like PgStat_TableStatus, PgStat_BackendFunctionEntry
etc
- accumulated statistics like PgStat_StatDBEntry, PgStat_SLRUStats. About half are
prefixed with PgStat_Stat, the other just with PgStat_
- random other types like PgStat_Single_Reset_Type, ...

To me it's very confusing to have these all in an essentially undistinguishing
namespace, particularly the top two items.

I think we should at least do s/PgStat_Stat/PgStat_/. Perhaps we should use a
distinct PgStatPending_* for the pending item? I can't quite come up with a
good name for the "accumulated" ones.

I'd like that get resolved first because I think that'd allow commiting the
prepatory split and reordering patches.

Greetings,

Andres Freund

#210

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 4 years ago

In reply to: Andres Freund (#209)

Re: shared-memory based stats collector

At Wed, 2 Mar 2022 18:16:00 -0800, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2021-07-26 18:27:54 -0700, Andres Freund wrote:

I had intended to post a rebase by now. But while I did mostly finish
that (see [1]) I unfortunately encountered a new issue around
partitioned tables, see [2]. Currently I'm hoping for a few thoughts on
that thread about the best way to address the issues.

Now that /messages/by-id/20220125063131.4cmvsxbz2tdg6g65@alap3.anarazel.de
is resolved, here's a rebased version. With a good bit of further cleanup.

One "big" thing that I'd like to figure out is a naming policy for the
different types prefixed with PgStat. We have different groups of types:

- "pending statistics", that are accumulated but not yet submitted to the
shared stats system, like PgStat_TableStatus, PgStat_BackendFunctionEntry
etc
- accumulated statistics like PgStat_StatDBEntry, PgStat_SLRUStats. About half are
prefixed with PgStat_Stat, the other just with PgStat_
- random other types like PgStat_Single_Reset_Type, ...

To me it's very confusing to have these all in an essentially undistinguishing
namespace, particularly the top two items.

Profoundly agreed. It was always a pain in the neck.

I think we should at least do s/PgStat_Stat/PgStat_/. Perhaps we should use a
distinct PgStatPending_* for the pending item? I can't quite come up with a
good name for the "accumulated" ones.

How about naming "pending stats" as just "Stats" and the "acculumated
stats" as "counts" or "counters"? "Counter" doesn't reflect the
characteristics so exactly but I think the discriminability of the two
is more significant. Specifically;

- PgStat_TableStatus
+ PgStat_TableStats
- PgStat_BackendFunctionEntry
+ PgStat_FunctionStats

- PgStat_GlobalStats
+ PgStat_GlobalCounts
- PgStat_ArchiverStats
+ PgStat_ArchiverCounts
- PgStat_BgWriterStats
+ PgStat_BgWriterCounts

Moving to shared stats collector turns them into attributed by "Local"
and "Shared". (I don't consider the details at this stage.)

PgStatLocal_TableStats
PgStatLocal_FunctionStats
PgStatLocal_GlobalCounts
PgStatLocal_ArchiverCounts
PgStatLocal_BgWriterCounts

PgStatShared_TableStats
PgStatShared_FunctionStats
PgStatShared_GlobalCounts
PgStatShared_ArchiverCounts
PgStatShared_BgWriterCounts

PgStatLocal_GlobalCounts somewhat looks odd, but doesn't matter much, maybe.

I'd like that get resolved first because I think that'd allow commiting the
prepatory split and reordering patches.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#211

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#209)

21 attachment(s)

Re: shared-memory based stats collector - v66

Hi,

I've attached a substantially improved version of the shared memory stats
patch.

The biggest changes are:
- chopped the bigger patches into smaller chunks. Most importantly the
"transactional stats creation / drop" part is now its own commit. Also the
tests I added.

- put in a defense against the danger of loosing function stats entries due to
the lack of cache invalidation during function calls. See long comment in
pgstat_init_function_usage()

- split up pgstat_global.c into pgstat_{archiver, bgwriter, checkpoint,
replslot, slru, wal}. While each individually not large, there were enough
of them to make the file confusing. Feels a lot better to work with.

- replication slot stats used the slot "index" in a dangerous way, fixed

- implemented a few omitted features like resetting all subscriptions and
setting reset timestamps (Melanie)

- loads of code and comment polishing

I think the first few patches are basically ready to be applied and are
independently worthwhile:
- 0001-pgstat-run-pgindent-on-pgstat.c-h.patch
- 0002-pgstat-split-relation-database-stats-handling-ou.patch
- 0003-pgstat-split-out-WAL-handling-from-pgstat_-initi.patch
- 0004-pgstat-introduce-pgstat_relation_should_count.patch
- 0005-pgstat-xact-level-cleanups-consolidation.patch

Might not be worth having separately, should probably just be part of
0014:
- 0006-pgstat-wip-pgstat-relation-init-assoc.patch

A pain to maintain, needs mostly a bit of polishing of file headers. Perhaps I
should rename pgstat_checkpoint.c to pgstat_checkpointer.c, fits better with
function names:
- 0007-pgstat-split-different-types-of-stats-into-separ.patch

This is also painful to maintain. Mostly kept separate from 0007 for easier
reviewing:
- 0009-pgstat-reorder-file-pgstat.c-pgstat.h-contents.patch

Everything after isn't yet quite there / depend on patches that aren't yet
there:
- 0010-pgstat-add-pgstat_copy_relation_stats.patch
- 0011-pgstat-remove-superflous-comments-from-pgstat.h.patch
- 0012-pgstat-stats-collector-references-in-comments.patch
- 0013-pgstat-scaffolding-for-transactional-stats-creat.patch
- 0014-pgstat-store-statistics-in-shared-memory.patch
- 0015-pgstat-add-pg_stat_force_next_flush.patch

Notably this patch makes stat.sql a test that can safely be run concurrent
with other tests:
- 0016-pgstat-utilize-pg_stat_force_next_flush-to-simpl.patch

Needs a tap test as well, but already covers a lot of things that aren't
covered today. Unfortunately it can't really be applied before because it's
too hard to write / slow to run without 0015
- 0017-pgstat-extend-pgstat-test-coverage.patch

I don't yet know what we should do with other users of
PG_STAT_TMP_DIR. There's no need for it for pgstat.c et al anymore. Not sure
that pg_stat_statement is enough of a reason to keep the stats_temp_directory
GUC around?
- 0019-pgstat-wip-remove-stats_temp_directory.patch

Right now we reset stats for replicas, even if we start from a shutdown
checkpoint. That seems pretty unnecessary with this patch:
- 0021-pgstat-wip-only-reset-pgstat-data-after-crash-re.patch

Starting to feel more optimistic about this! There's loads more to do, but now
the TODOs just seem to require elbow grease, rather than deep thinking.

The biggest todos are:
- Address all the remaining AFIXMEs and XXXs

- add longer explanation of architecture to pgstat.c (or a README)

- Further improve our stats test coverage - there's a crapton not covered,
despite 0017:
- test WAL replay with stats (stats for dropped tables are removed etc)
- test crash recovery and "invalid stats file" paths
- lot of the pg_stat_ views like bgwriter, pg_stat_database have zero coverage today

- make naming not "a pain in the neck": [1]/messages/by-id/20220303.170412.1542007127371857370.horikyota.ntt@gmail.com

- lots of polishing

- revise docs

- run benchmarks - I've done so in the past, but not recently

- perhaps 0014 can be further broken down - it's still uncomfortably large

It's worth noting that the patchset, leaving new tests aside, has a
substantially negative diffstat, even if one includes all the new file
headers... Once a bunch more cleanup is done, I bet it'll improve further.

with new tests:
80 files changed, 9759 insertions(+), 8051 deletions(-)

without tests (mildly inaccurate):
71 files changed, 6991 insertions(+), 7814 deletions(-)

just shared memory stats patch, not all the code movement, new file headers:
49 files changed, 4079 insertions(+), 5472 deletions(-)

Comments and reviews welcome!

I regularly push to https://github.com/anarazel/postgres/tree/shmstat fwiw -
the series is way too big to spam the list all the time.

Greetings,

Andres Freund

[1]: /messages/by-id/20220303.170412.1542007127371857370.horikyota.ntt@gmail.com

#212

Melanie Plageman

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#211)

1 attachment(s)

Re: shared-memory based stats collector - v66

On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote:

Starting to feel more optimistic about this! There's loads more to do, but now
the TODOs just seem to require elbow grease, rather than deep thinking.

The biggest todos are:
- Address all the remaining AFIXMEs and XXXs

- add longer explanation of architecture to pgstat.c (or a README)

- Further improve our stats test coverage - there's a crapton not covered,
despite 0017:
- test WAL replay with stats (stats for dropped tables are removed etc)

Attached is a TAP test to check that stats are cleaned up on a physical
replica after the objects they concern are dropped on the primary.

I'm not sure that the extra force next flush on standby is needed after
drop on primary since drop should report stats and I wait for catchup.
Also, I don't think the tests with DROP SCHEMA actually exercise another
code path, so it might be worth cutting those.

- Melanie

#213

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Melanie Plageman (#212)

Re: shared-memory based stats collector - v66

Hi,

On 2022-03-20 12:32:39 -0400, Melanie Plageman wrote:

Attached is a TAP test to check that stats are cleaned up on a physical
replica after the objects they concern are dropped on the primary.

Thanks!

I'm not sure that the extra force next flush on standby is needed after
drop on primary since drop should report stats and I wait for catchup.

A drop doesn't force stats in other sessions to be flushed immediately, so
unless I misunderstand, yes, it's needed.

Also, I don't think the tests with DROP SCHEMA actually exercise another
code path, so it might be worth cutting those.

+/*
+ * Checks for presence of stats for object with provided object oid of kind
+ * specified in the type string in database of provided database oid.
+ *
+ * For subscription stats, only the objoid will be used. For database stats,
+ * only the dboid will be used. The value passed in for the unused parameter is
+ * discarded.
+ * TODO: should it be 'pg_stat_stats_present' instead of 'pg_stat_stats_exist'?
+ */
+Datum
+pg_stat_stats_exist(PG_FUNCTION_ARGS)

Should we revoke stats for this one from PUBLIC (similar to the reset functions)?

+# Set track_functions to all on standby
+$node_standby->append_conf('postgresql.conf', "track_functions = 'all'");

That should already be set, cloning from the primary includes the
configuration from that point in time.

+$node_standby->restart;

FWIW, it'd also only require a reload....

Greetings,

Andres Freund

#214

Melanie Plageman

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#213)

2 attachment(s)

Re: shared-memory based stats collector - v66

On Sun, Mar 20, 2022 at 12:58 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-03-20 12:32:39 -0400, Melanie Plageman wrote:

Attached is a TAP test to check that stats are cleaned up on a physical
replica after the objects they concern are dropped on the primary.

Thanks!

I'm not sure that the extra force next flush on standby is needed after
drop on primary since drop should report stats and I wait for catchup.

A drop doesn't force stats in other sessions to be flushed immediately, so
unless I misunderstand, yes, it's needed.

Also, I don't think the tests with DROP SCHEMA actually exercise another
code path, so it might be worth cutting those.
+/*
+ * Checks for presence of stats for object with provided object oid of kind
+ * specified in the type string in database of provided database oid.
+ *
+ * For subscription stats, only the objoid will be used. For database stats,
+ * only the dboid will be used. The value passed in for the unused parameter is
+ * discarded.
+ * TODO: should it be 'pg_stat_stats_present' instead of 'pg_stat_stats_exist'?
+ */
+Datum
+pg_stat_stats_exist(PG_FUNCTION_ARGS)
Should we revoke stats for this one from PUBLIC (similar to the reset functions)?
+# Set track_functions to all on standby
+$node_standby->append_conf('postgresql.conf', "track_functions = 'all'");
That should already be set, cloning from the primary includes the
configuration from that point in time.

+$node_standby->restart;

FWIW, it'd also only require a reload....

Addressed all of these points in
v2-0001-add-replica-cleanup-tests.patch

also added a new test file in
v2-0002-Add-TAP-test-for-discarding-stats-after-crash.patch
testing correct behavior after a crash and when stats file is invalid

- Melanie

#215

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#211)

17 attachment(s)

Re: shared-memory based stats collector - v67

Hi,

Attached is v67 of the patch. Changes:

- I've committed a number of the earlier patches after polishing them some more
- lots of small cleanups, particularly around reducing unnecessary diff noise
- included Melanie's tests

On 2022-03-17 00:36:52 -0700, Andres Freund wrote:

I think the first few patches are basically ready to be applied and are
independently worthwhile:
- 0001-pgstat-run-pgindent-on-pgstat.c-h.patch
- 0002-pgstat-split-relation-database-stats-handling-ou.patch
- 0003-pgstat-split-out-WAL-handling-from-pgstat_-initi.patch
- 0004-pgstat-introduce-pgstat_relation_should_count.patch
- 0005-pgstat-xact-level-cleanups-consolidation.patch

Committed.

Might not be worth having separately, should probably just be part of
0014:
- 0006-pgstat-wip-pgstat-relation-init-assoc.patch

Committed parts, the "assoc" stuff was moved into the main shared memory stats
patch.

A pain to maintain, needs mostly a bit of polishing of file headers. Perhaps I
should rename pgstat_checkpoint.c to pgstat_checkpointer.c, fits better with
function names:
- 0007-pgstat-split-different-types-of-stats-into-separ.patch

Committed.

This is also painful to maintain. Mostly kept separate from 0007 for easier
reviewing:
- 0009-pgstat-reorder-file-pgstat.c-pgstat.h-contents.patch

Planning to commit this soon (it's now 0001). Doing a last few passes of
readthrough / polishing.

I don't yet know what we should do with other users of
PG_STAT_TMP_DIR. There's no need for it for pgstat.c et al anymore. Not sure
that pg_stat_statement is enough of a reason to keep the stats_temp_directory
GUC around?
- 0019-pgstat-wip-remove-stats_temp_directory.patch

Still unclear. Might raise this separately for higher visibility.

Right now we reset stats for replicas, even if we start from a shutdown
checkpoint. That seems pretty unnecessary with this patch:
- 0021-pgstat-wip-only-reset-pgstat-data-after-crash-re.patch

Might raise this in another thread for higher visibility.

The biggest todos are:
- Address all the remaining AFIXMEs and XXXs
- add longer explanation of architecture to pgstat.c (or a README)
- make naming not "a pain in the neck": [1]
- lots of polishing
- run benchmarks - I've done so in the past, but not recently

Still TBD

- revise docs

Kyotaro-san, maybe you could do a first pass?

- Further improve our stats test coverage - there's a crapton not covered,
despite 0017:
- test WAL replay with stats (stats for dropped tables are removed etc)
- test crash recovery and "invalid stats file" paths
- lot of the pg_stat_ views like bgwriter, pg_stat_database have zero coverage today

That's gotten a lot better with Melanie's tests, still a bit further to go. I
think she's found at least one more small bug that's not yet fixed here.

- perhaps 0014 can be further broken down - it's still uncomfortably large

Things that I think can be split out:
- Encapsulate "if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)"
style tests in a helper function. Then just the body needs to be changed,
rather than a lot of places needing such checks.

Yep, that's it. I don't really see anything else that wouldn't be too
awkward. Would welcome suggestions!.

Greetings,

Andres Freund

#216

Melanie Plageman

melanieplageman@gmail.com

over 3 years ago

In reply to: Melanie Plageman (#214)

1 attachment(s)

Re: shared-memory based stats collector - v66

On Sun, Mar 20, 2022 at 4:56 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:

Addressed all of these points in
v2-0001-add-replica-cleanup-tests.patch

also added a new test file in
v2-0002-Add-TAP-test-for-discarding-stats-after-crash.patch
testing correct behavior after a crash and when stats file is invalid

Attached is the last of the tests confirming clean up for stats in the
shared stats hashtable (these are for the subscription stats).

I thought that maybe these tests could now use
pg_stat_force_next_flush() instead of poll_query_until() but I wasn't
sure how to ensure that the error has happened and the pending entry has
been added before setting force_next_flush.

I also added in tests that resetting subscription stats works as
expected.

- Melanie

#217

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#215)

Re: shared-memory based stats collector - v67

At Mon, 21 Mar 2022 14:30:17 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

Attached is v67 of the patch. Changes:

Thanks for the lot of work on this.

This is also painful to maintain. Mostly kept separate from 0007 for easier
reviewing:
- 0009-pgstat-reorder-file-pgstat.c-pgstat.h-contents.patch

Planning to commit this soon (it's now 0001). Doing a last few passes of
readthrough / polishing.

This looks like committed.

I don't yet know what we should do with other users of
PG_STAT_TMP_DIR. There's no need for it for pgstat.c et al anymore. Not sure
that pg_stat_statement is enough of a reason to keep the stats_temp_directory
GUC around?
- 0019-pgstat-wip-remove-stats_temp_directory.patch

Still unclear. Might raise this separately for higher visibility.

Right now we reset stats for replicas, even if we start from a shutdown
checkpoint. That seems pretty unnecessary with this patch:
- 0021-pgstat-wip-only-reset-pgstat-data-after-crash-re.patch

Might raise this in another thread for higher visibility.

The biggest todos are:
- Address all the remaining AFIXMEs and XXXs
- add longer explanation of architecture to pgstat.c (or a README)
- make naming not "a pain in the neck": [1]
- lots of polishing
- run benchmarks - I've done so in the past, but not recently

Still TBD

- revise docs

Kyotaro-san, maybe you could do a first pass?

Docs.. Yeah I'll try it.

- Further improve our stats test coverage - there's a crapton not covered,
despite 0017:
- test WAL replay with stats (stats for dropped tables are removed etc)
- test crash recovery and "invalid stats file" paths
- lot of the pg_stat_ views like bgwriter, pg_stat_database have zero coverage today

That's gotten a lot better with Melanie's tests, still a bit further to go. I
think she's found at least one more small bug that's not yet fixed here.

- perhaps 0014 can be further broken down - it's still uncomfortably large

Things that I think can be split out:
- Encapsulate "if (pgStatSock == PGINVALID_SOCKET || !pgstat_track_counts)"
style tests in a helper function. Then just the body needs to be changed,
rather than a lot of places needing such checks.

Yep, that's it. I don't really see anything else that wouldn't be too
awkward. Would welcome suggestions!.

I'm overwhelmed by the amout, but I'm going to look into them.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#218

Melanie Plageman

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#211)

1 attachment(s)

Re: shared-memory based stats collector - v66

On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote:

I've attached a substantially improved version of the shared memory stats
patch.

...

- lot of the pg_stat_ views like bgwriter, pg_stat_database have zero coverage today

Attached are some tests including tests that reset of stats works for
all views having a reset timestamp as well as a basic test for at least
one column in all of the following stats views:
pg_stat_archiver, pg_stat_bgwriter, pg_stat_wal, pg_stat_slru,
pg_stat_replication_slots, pg_stat_database

It might be nice to have a test for one of the columns fetched from the
PgStatBgwriter data structure since those and the Checkpointer stats are
stored separately despite being displayed in the same view currently.
but, alas...

- Melanie

#219

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#217)

1 attachment(s)

Re: shared-memory based stats collector - v67

At Tue, 22 Mar 2022 11:56:40 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Docs.. Yeah I'll try it.

This is the first cut, based on the earlier patchset.

monitoring.sgml:

When using the statistics to monitor collected data, it is important

I failed to read this clearly. I modified the part assuming that it
means "the statistics" means "the statistics views and functions".

I didn't mention pgstat_force_next_flush() since I think it is a
developer-only feature.

In the attached diff, I refrained to reindent for easy review.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#220

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#215)

Re: shared-memory based stats collector - v67

At Mon, 21 Mar 2022 14:30:17 -0700, Andres Freund <andres@anarazel.de> wrote in

Right now we reset stats for replicas, even if we start from a shutdown
checkpoint. That seems pretty unnecessary with this patch:
- 0021-pgstat-wip-only-reset-pgstat-data-after-crash-re.patch

Might raise this in another thread for higher visibility.

+	/*
+	 * When starting with crash recovery, reset pgstat data - it might not be
+	 * valid. Otherwise restore pgstat data. It's safe to do this here,
+	 * because postmaster will not yet have started any other processes
+	 *
+	 * TODO: With a bit of extra work we could just start with a pgstat file
+	 * associated with the checkpoint redo location we're starting from.
+	 */
+	if (ControlFile->state == DB_SHUTDOWNED ||
+		ControlFile->state == DB_SHUTDOWNED_IN_RECOVERY)
+		pgstat_restore_stats();
+	else
+		pgstat_discard_stats();
+

Before there, InitWalRecovery changes the state to
DB_IN_ARCHIVE_RECOVERY if it was either DB_SHUTDOWNED or
DB_IN_PRODUCTION. So the stat seems like always discarded on standby.

In the first place, I'm not sure it is valid that a standby from a
cold backup takes over the stats from the primary.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#221

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Kyotaro Horiguchi (#220)

Re: shared-memory based stats collector - v67

Hi,

On 2022-03-23 17:27:50 +0900, Kyotaro Horiguchi wrote:

At Mon, 21 Mar 2022 14:30:17 -0700, Andres Freund <andres@anarazel.de> wrote in

Right now we reset stats for replicas, even if we start from a shutdown
checkpoint. That seems pretty unnecessary with this patch:
- 0021-pgstat-wip-only-reset-pgstat-data-after-crash-re.patch

Might raise this in another thread for higher visibility.
+	/*
+	 * When starting with crash recovery, reset pgstat data - it might not be
+	 * valid. Otherwise restore pgstat data. It's safe to do this here,
+	 * because postmaster will not yet have started any other processes
+	 *
+	 * TODO: With a bit of extra work we could just start with a pgstat file
+	 * associated with the checkpoint redo location we're starting from.
+	 */
+	if (ControlFile->state == DB_SHUTDOWNED ||
+		ControlFile->state == DB_SHUTDOWNED_IN_RECOVERY)
+		pgstat_restore_stats();
+	else
+		pgstat_discard_stats();
+
Before there, InitWalRecovery changes the state to
DB_IN_ARCHIVE_RECOVERY if it was either DB_SHUTDOWNED or
DB_IN_PRODUCTION. So the stat seems like always discarded on standby.

Hm. I though it worked at some point. I guess there's a reason this commit is
a separate commit marked WIP ;)

In the first place, I'm not sure it is valid that a standby from a
cold backup takes over the stats from the primary.

I don't really see a reason not to use the stats in that case - we have a
correct stats file after all. But it doesn't seem too important. What I
actually find worth addressing is the case of standbys starting in
DB_SHUTDOWNED_IN_RECOVERY. Right now we always throw stats away after a
*graceful* restart of a standby, which doesn't seem great.

Greetings,

Andres Freund

#222

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#221)

Re: shared-memory based stats collector - v67

(nice work about ubsan)

At Wed, 23 Mar 2022 10:42:03 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2022-03-23 17:27:50 +0900, Kyotaro Horiguchi wrote:

At Mon, 21 Mar 2022 14:30:17 -0700, Andres Freund <andres@anarazel.de> wrote in
Before there, InitWalRecovery changes the state to
DB_IN_ARCHIVE_RECOVERY if it was either DB_SHUTDOWNED or
DB_IN_PRODUCTION. So the stat seems like always discarded on standby.

Hm. I though it worked at some point. I guess there's a reason this commit is
a separate commit marked WIP ;)

Yeah, I know:p

In the first place, I'm not sure it is valid that a standby from a
cold backup takes over the stats from the primary.

I don't really see a reason not to use the stats in that case - we have a
correct stats file after all. But it doesn't seem too important. What I
actually find worth addressing is the case of standbys starting in
DB_SHUTDOWNED_IN_RECOVERY. Right now we always throw stats away after a
*graceful* restart of a standby, which doesn't seem great.

It undoubtfully an improvement if stats preserved after a graceful
restart on standbys. I just wonder if there's way to detect the
first-start of a standby from a cold backup. But even if there isn't,
it's an improvement.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#223

Melanie Plageman

melanieplageman@gmail.com

over 3 years ago

In reply to: Andres Freund (#211)

1 attachment(s)

Re: shared-memory based stats collector - v66

On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote:

The biggest todos are:
- Address all the remaining AFIXMEs and XXXs

Attached is a patch that addresses three of the existing AFIXMEs.

#224

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Melanie Plageman (#223)

Re: shared-memory based stats collector - v66

At Thu, 24 Mar 2022 13:21:33 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in

On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote:

The biggest todos are:
- Address all the remaining AFIXMEs and XXXs

Attached is a patch that addresses three of the existing AFIXMEs.

Thanks!

+ .reset_timestamp_cb = pgstat_shared_reset_timestamp_noop,

(I once misunderstood that the "shared" means shared memory area..)

The reset function is type-specific and it must be set. So don't we
provide all to-be-required reset functions?

+	if (pgstat_shared_ref_get(kind, dboid, objoid, false, NULL))
+	{
+		Oid msg_oid = (kind == PGSTAT_KIND_DB) ? dboid : objoid;

Explicitly using PGSTAT_KIND_DB here is a kind of annoyance. Since we
always give InvalidOid correctly as the parameters, and objoid alone
is not specific enough, do we warn using both dboid and objoid without
a special treat?

Concretely, I propose to do the following instead.

+	if (pgstat_shared_ref_get(kind, dboid, objoid, false, NULL))
+	{
+		ereport(WARNING,
+				errmsg("resetting existing stats for type %s, db=%d, oid=%d",
+                      pgstat_kind_info_for(kind)->name, dboid, objoid);

+pgstat_pending_delete(PgStatSharedRef *shared_ref)
+{
+	void	   *pending_data = shared_ref->pending;
+	PgStatKind kind = shared_ref->shared_entry->key.kind;
+
+	Assert(pending_data != NULL);
+	Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+
+	/* PGSTAT_KIND_TABLE has its own callback */
+	Assert(kind != PGSTAT_KIND_TABLE);
+

"kind" is used only in assertion, which requires PG_USED_FOR_ASSERTS_ONLY.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#225

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#224)

Re: shared-memory based stats collector - v66

At Fri, 25 Mar 2022 14:22:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 24 Mar 2022 13:21:33 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in

On Thu, Mar 17, 2022 at 3:36 AM Andres Freund <andres@anarazel.de> wrote:

The biggest todos are:
- Address all the remaining AFIXMEs and XXXs

Attached is a patch that addresses three of the existing AFIXMEs.

I'd like to dump out my humble thoughts about other AFIXMEs..

AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
for increasing it?

It is 1000ms in the comment just above but actually 10000ms. The
number came from a discussion that if we have 1000 clients and each
backend writes stats once per 0.5 seconds, totally we flush pending
data to shared area at 2000 times per second which is too frequent. I
raised it to 5000ms, then 10000ms. So the expected maximum flush
frequency is reduces to 100 times per second. Of course it is
assuming the worst case and the 10000ms is apparently too long for the
average cases.

The current implement of pgstat postpones flushing if lock collision
happens then postpone by at most 60s. This is a kind of
auto-averaging mechanishm. It might be enough and we can reduce the
PGSTAT_MIN_INTERVAL to 500ms or so.

AFIXME: architecture explanation.

Mmm. next, please:p

( [PGSTAT_KIND_REPLSLOT] = {)

* AFIXME: With a bit of extra work this could now be a !fixed_amount
* stats kind.

Yeah. The most bothersome point is the slot index is not persistent
at all and the relationship between the index and name (or identity)
is not stable even within a process life. It can be resolved by
allocating an object id to every replication slot. I faintly remember
of a discussion like that but I don't have a clear memory of the
discussion.

static Size
pgstat_dsa_init_size(void)
{
/*
* AFIXME: What should we choose as an initial size? Should we make this
* configurable? Maybe tune based on NBuffers?

StatsShmemInit(void)
* AFIXME: we need to guarantee this can be allocated in plain shared
* memory, rather than allocating dsm segments.

I'm not sure that NBuffers is the ideal base for deciding the required
size since it doesn't seem to be generally in proportion with the
number of database objects. If we made it manually-tunable, we will
be able to emit a log when DSM segment allocation happens for this use
as as the tuning aid..

WARNING: dsa allocation happened for activity statistics
HINT: You might want to increase stat_dsa_initial_size if you see slow
down blah..

* AFIXME: Should all the stats drop code be moved into pgstat_drop.c?

Or pgstat_xact.c?

* AFIXME: comment
* AFIXME: see notes about race conditions for functions in
* pgstat_drop_function().
*/
void
pgstat_schedule_stat_drop(PgStatKind kind, Oid dboid, Oid objoid)

pgstat_drop_function() doesn't seem to have such a note.

I suppose the "race condition" means the case a stats entry for an
object is created just after the same object is dropped on another
backend. It seems to me such a race condition is eliminated by the
transactional drop mechanism. Are you intending to write an
explanation of that?

/*
* pgStatSharedRefAge increments quite slowly than the time the following
* loop takes so this is expected to iterate no more than twice.
*
* AFIXME: Why is this a good place to do this?
*/
while (pgstat_shared_refs_need_gc())
pgstat_shared_refs_gc();

Is the reason for the AFIXME is you think that GC-check happens too
frequently?

pgstat_shared_ref_release(PgStatHashKey key, PgStatSharedRef *shared_ref)
{

...

* AFIXME: this probably is racy. Another backend could look up the
* stat, bump the refcount, as we free it.
if (pg_atomic_fetch_sub_u32(&shared_ref->shared_entry->refcount, 1) == 1)
{

...

/* only dropped entries can reach a 0 refcount */
Assert(shared_ref->shared_entry->dropped);

I didn't deeply examined, but is that race condition avoidable by
prevent pgstat_shared_ref_get from incrementing the refcount of
dropped entries?

* AFIXME: This needs to be deduplicated with pgstat_shared_ref_release(). But
* it's not entirely trivial, because we can't use plain dshash_delete_entry()
* (but have to use dshash_delete_current()).
*/
static bool
pgstat_drop_stats_entry(dshash_seq_status *hstat)

...

* AFIXME: don't do this while holding the dshash lock.

Is the AFIXMEs mean that we should move the call to
pgstat_shared_ref_release() out of the dshash-loop (in
pgstat_drop_database_and_contents) that calls this function? Is it
sensible if we store the (key, ref) pairs for to-be released
shared_refs then clean up them after exiting the loop?

* Database stats contain other stats. Drop those as well when
* dropping the database. AFIXME: Perhaps this should be done in a
* slightly more principled way?
*/
if (key.kind == PGSTAT_KIND_DB)
pgstat_drop_database_and_contents(key.dboid);

I tend to agree to that and it is possible that we have
PgStatKindInfo.drop_cascade_cb(PgStatShm_StatEntryHeader *header). But
it is really needed only by PGSTAT_KIND_DB..

* AFIXME: consistent naming
* AFIXME: deduplicate some of this code with pgstat_fetch_snapshot_build().
*
* AFIXME: it'd be nicer if we passed .snapshot_cb() the target memory
* location, instead of putting PgStatSnapshot into pgstat_internal.h
*/
void
pgstat_snapshot_global(PgStatKind kind)

Does having PGSTAT_KIND_NONE in PgStatKind or InvalidPgStatKind work
for deduplication? But I'm afraid that harms in some way.

For the memory location, it seems like a matter of taste, but if we
don't need a multiple copies of a global snapshot, I think
.snapshot_cb() doesn't need to take the target memory location.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#226

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Kyotaro Horiguchi (#225)

Re: shared-memory based stats collector - v66

Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:

I'd like to dump out my humble thoughts about other AFIXMEs..

Thanks! Please have another look at the code in
https://github.com/anarazel/postgres/tree/shmstat I just pushed a revised
version with a lot of [a]fixmes removed.

Most importantly I did move replication slot stats into the hash table, and
just generally revised the replication slot stats code substantially. I
think it does look better now.

But also there's a new commit allowing dsm use in single user mode. To be able
to rely on stats drops we need to perform them even in single user mode. The
only reason this didn't previously fail was that we allocated enough "static"
shared memory for single user mode to never need DSMs.

Thanks to Melanie's tests, and a few more additions by myself, the code is now
reasonably well covered. The big exception to that is recovery conflict stats,
and as Melanie noticed, that was broken (somehow pgstat_database_flush_cb()
didn't sum them up)). I think she has some WIP tests...

Re the added tests: I did fix a few timing issues there. There's probably a
few more hiding somewhere.

I also found that unfortunately dshash_seq_next() as is isn't correct. I
included a workaround commit, but it's not correct. What we need to do is to
just always lock partition 0 in the initialization branch. Before we call
ensure_valid_bucket_pointers() status->hash_table->size_log2 isn't valid. And
ensure_valid_bucket_pointers can only be called with a lock...

Horiguchi-san, if you have time to look at the "XXX: The following could now be
generalized" in pgstat_read_statsfile(), pgstat_write_statsfile()... I think
that'd be nice to clean up.

AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
for increasing it?

It is 1000ms in the comment just above but actually 10000ms. The
number came from a discussion that if we have 1000 clients and each
backend writes stats once per 0.5 seconds, totally we flush pending
data to shared area at 2000 times per second which is too frequent.

Have you measured this (recently)? I tried to cause contention with a workload
targeted towards that, but couldn't see a problem with 1000ms. Of course
there's a problem with 1ms...

I think it's confusing to not report stats for 10s without a need.

The current implement of pgstat postpones flushing if lock collision
happens then postpone by at most 60s. This is a kind of
auto-averaging mechanishm. It might be enough and we can reduce the
PGSTAT_MIN_INTERVAL to 500ms or so.

Yea, I think the 60s part under contention is fine. I'd expect that to be
rarely reached.

AFIXME: architecture explanation.

Mmm. next, please:p

Working on it. There's one more AFIXME that I want to resolve before, so I
don't end up with old type names strewn around (the one in pgstat_internal.h).

( [PGSTAT_KIND_REPLSLOT] = {)

* AFIXME: With a bit of extra work this could now be a !fixed_amount
* stats kind.

Yeah. The most bothersome point is the slot index is not persistent
at all and the relationship between the index and name (or identity)
is not stable even within a process life. It can be resolved by
allocating an object id to every replication slot. I faintly remember
of a discussion like that but I don't have a clear memory of the
discussion.

I think it's resolved now. pgstat_report_replslot* all get the ReplicationSlot
as a parameter. They use the new ReplicationSlotIndex() to get an index from
that. pgstat_report_replslot_(create|acquire) ensure that the relevant index
doesn't somehow contain old stats.

To deal with indexes changing / slots getting removed during restart, there's
now a new callback made during pgstat_read_statsfile() to build the key from
the serialized NameStr. That can return false if a slot of that name is not
know, or use ReplicationSlotIndex() to get the index to store in-memory stats.

static Size
pgstat_dsa_init_size(void)
{
/*
* AFIXME: What should we choose as an initial size? Should we make this
* configurable? Maybe tune based on NBuffers?

StatsShmemInit(void)
* AFIXME: we need to guarantee this can be allocated in plain shared
* memory, rather than allocating dsm segments.

I'm not sure that NBuffers is the ideal base for deciding the required
size since it doesn't seem to be generally in proportion with the
number of database objects. If we made it manually-tunable, we will
be able to emit a log when DSM segment allocation happens for this use
as as the tuning aid..

WARNING: dsa allocation happened for activity statistics
HINT: You might want to increase stat_dsa_initial_size if you see slow
down blah..

FWIW, I couldn't find any performance impact from using DSM. Because of the
"PgStatSharedRef" layer, there's not actually that much interaction with the
dsm code...

I reduced the initial allocation to 256kB. Unfortunately that's currently the
minimum that allows dshash_create() to succeed (due to dsa.c pre-allocating 16
of each allocation). I was a bit worried about that for a while, but memory
usage is still lower with the patch than before in the scenarios I tested. We
can probably improve upon that fairly easily in the future (move
dshash_table_control into static shared memory, call dsa_trim() when resizing
dshash table).

* AFIXME: Should all the stats drop code be moved into pgstat_drop.c?

Or pgstat_xact.c?

Maybe. Somehow it doesn't seem *great* either.

* AFIXME: comment
* AFIXME: see notes about race conditions for functions in
* pgstat_drop_function().
*/
void
pgstat_schedule_stat_drop(PgStatKind kind, Oid dboid, Oid objoid)

pgstat_drop_function() doesn't seem to have such a note.

Yea, I fixed it in pgstat_init_function_usage(), forgetting about the node in
pgstat_schedule_stat_drop(). There's a decently long comment in
pgstat_init_function_usage() explaining the problem.

I suppose the "race condition" means the case a stats entry for an
object is created just after the same object is dropped on another
backend. It seems to me such a race condition is eliminated by the
transactional drop mechanism. Are you intending to write an
explanation of that?

Yes, I definitely plan to write a bit more about that.

/*
* pgStatSharedRefAge increments quite slowly than the time the following
* loop takes so this is expected to iterate no more than twice.
*
* AFIXME: Why is this a good place to do this?
*/
while (pgstat_shared_refs_need_gc())
pgstat_shared_refs_gc();

Is the reason for the AFIXME is you think that GC-check happens too
frequently?

Well, the while () loop makes me "suspicious" when looking at the code. I've
now made it an if (), I can't see a reason why we'd need a while()?

I just moved a bunch of that code around, there's probably a bit more polish
needed.

pgstat_shared_ref_release(PgStatHashKey key, PgStatSharedRef *shared_ref)
{

...

* AFIXME: this probably is racy. Another backend could look up the
* stat, bump the refcount, as we free it.
if (pg_atomic_fetch_sub_u32(&shared_ref->shared_entry->refcount, 1) == 1)
{

...

/* only dropped entries can reach a 0 refcount */
Assert(shared_ref->shared_entry->dropped);

I didn't deeply examined, but is that race condition avoidable by
prevent pgstat_shared_ref_get from incrementing the refcount of
dropped entries?

I don't think the race exists anymore. I've now revised the relevant code.

* AFIXME: This needs to be deduplicated with pgstat_shared_ref_release(). But
* it's not entirely trivial, because we can't use plain dshash_delete_entry()
* (but have to use dshash_delete_current()).
*/
static bool
pgstat_drop_stats_entry(dshash_seq_status *hstat)

...

* AFIXME: don't do this while holding the dshash lock.

Is the AFIXMEs mean that we should move the call to
pgstat_shared_ref_release() out of the dshash-loop (in
pgstat_drop_database_and_contents) that calls this function? Is it
sensible if we store the (key, ref) pairs for to-be released
shared_refs then clean up them after exiting the loop?

I think this is now resolved. The release now happens separately, without
nested locks. See pgstat_shared_refs_release_db() call in
pgstat_drop_database_and_contents().

* Database stats contain other stats. Drop those as well when
* dropping the database. AFIXME: Perhaps this should be done in a
* slightly more principled way?
*/
if (key.kind == PGSTAT_KIND_DB)
pgstat_drop_database_and_contents(key.dboid);

I tend to agree to that and it is possible that we have
PgStatKindInfo.drop_cascade_cb(PgStatShm_StatEntryHeader *header). But
it is really needed only by PGSTAT_KIND_DB..

Yea, I came to the same conclusion, namely that we don't need something better
for now.

* AFIXME: consistent naming
* AFIXME: deduplicate some of this code with pgstat_fetch_snapshot_build().
*
* AFIXME: it'd be nicer if we passed .snapshot_cb() the target memory
* location, instead of putting PgStatSnapshot into pgstat_internal.h
*/
void
pgstat_snapshot_global(PgStatKind kind)

Does having PGSTAT_KIND_NONE in PgStatKind or InvalidPgStatKind work
for deduplication? But I'm afraid that harms in some way.

I think I made it a bit nicer now, without needing either of those. I'd like
to remove "global" from those functions, it's not actually that obvious what
it means.

For the memory location, it seems like a matter of taste, but if we
don't need a multiple copies of a global snapshot, I think
.snapshot_cb() doesn't need to take the target memory location.

I think it's ok for now. It'd be a bit nicer if we didn't need PgStatSnapshot
/ stats_snapshot in pgstat_internal.h, but it's ok that way I think.

Greetings,

Andres Freund

#227

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Kyotaro Horiguchi (#219)

Re: shared-memory based stats collector - v67

Hi,

On 2022-03-23 16:38:33 +0900, Kyotaro Horiguchi wrote:

At Tue, 22 Mar 2022 11:56:40 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

Docs.. Yeah I'll try it.

This is the first cut, based on the earlier patchset.

Thanks!

I didn't mention pgstat_force_next_flush() since I think it is a
developer-only feature.

Yes, that makes sense.

Sorry for not yet getting back to looking at this.

One thing we definitely need to add documentation for is the
stats_fetch_consistency GUC. I think we should change its default to 'cache',
because that still gives the ability to "self-join", without the cost of the
current method.

Greetings,

Andres Freund

#228

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Kyotaro Horiguchi (#225)

Re: shared-memory based stats collector - v66

Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:

AFIXME: Isn't PGSTAT_MIN_INTERVAL way too long? What is the justification
for increasing it?

It is 1000ms in the comment just above but actually 10000ms. The
number came from a discussion that if we have 1000 clients and each
backend writes stats once per 0.5 seconds, totally we flush pending
data to shared area at 2000 times per second which is too frequent. I
raised it to 5000ms, then 10000ms. So the expected maximum flush
frequency is reduces to 100 times per second. Of course it is
assuming the worst case and the 10000ms is apparently too long for the
average cases.

The current implement of pgstat postpones flushing if lock collision
happens then postpone by at most 60s. This is a kind of
auto-averaging mechanishm. It might be enough and we can reduce the
PGSTAT_MIN_INTERVAL to 500ms or so.

I just noticed that the code doesn't appear to actually work like that right
now. Whenever the timeout is reached, pgstat_report_stat() is called with
force = true.

And even if the backend is busy running queries, once there's contention, the
next invocation of pgstat_report_stat() will return the timeout relative to
pendin_since, which then will trigger a force report via a very short timeout
soon.

It might actually make sense to only ever return PGSTAT_RETRY_MIN_INTERVAL
(with a slightly different name) from pgstat_report_stat() when blocked
(limiting the max reporting delay for an idle connection) and to continue
calling pgstat_report_stat(force = true). But to only trigger force
"internally" in pgstat_report_stat() when PGSTAT_MAX_INTERVAL is reached.

I think that'd mean we'd report after max PGSTAT_RETRY_MIN_INTERVAL in an idle
connection, and try reporting every PGSTAT_RETRY_MIN_INTERVAL (increasing up
to PGSTAT_MAX_INTERVAL when blocked) on busy connections.

Makes sense?

I think we need to do something with the pgstat_report_stat() calls outside of
postgres.c. Otherwise there's nothing limiting their reporting delay, because
they don't have the timeout logic postgres.c has. None of them is ever hot
enough to be problematic, so I think we should just make them pass force=true?

Greetings,

Andres Freund

#229

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Kyotaro Horiguchi (#225)

Re: shared-memory based stats collector - v66

Hi,

On 2022-03-25 17:24:18 +0900, Kyotaro Horiguchi wrote:

* AFIXME: Should all the stats drop code be moved into pgstat_drop.c?

Or pgstat_xact.c?

I wasn't initially happy with that suggestion, but after running with it, it
looks pretty good.

I also moved a fair bit of code into pgstat_shm.c, which to me improved code
navigation a lot. I'm wondering about splitting it further even, into
pgstat_shm.c and pgstat_entry.c.

What do you think?

Greetings,

Andres Freund

#230

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#229)

28 attachment(s)

Re: shared-memory based stats collector - v68

Hi,

New version of the shared memory stats patchset. Most important changes:

- It's now "cumulative statistics system", as discussed at [1]/messages/by-id/20220308205351.2xcn6k4x5yivcxyd@alap3.anarazel.de. This basically
is now the term that all the references to the "stats collector" are
replaced with. Looks much better than "activity statistics" imo. The
STATS_COLLECTOR is now named STATS_CUMULATIVE. I tried to find all
references to either collector or "activity statistics", but in all
likelihood I didn't get them all.

- updated docs (significantly edited version of the version Kyotaro posted a
few days ago)

- significantly improved test coverage - pgstat*.c are nearly completely
covered. While pgstatsfuncs.c coverage has increased, it is not great - but
there's already so much more coverage, that I think it's good enough for
now. Thanks to Melanie for help with this!

- largely cleaned up inconsisten function / type naming. Everything now /
again is under the PgStats_ prefix, except for statistics in shared memory,
which is prefixed with PgStatsShared_. I think we should go further and
add at least a PgStatsPending_ namespace, but that requires touching plenty
code that didn't need to be touched so far, so it'll have to be task for
another release.

- As discussed in [2]/messages/by-id/20220329191727.mzzwbl7udhpq7pmf@alap3.anarazel.de I added a patch at the start of the queue to clean up
the inconsistent function header comments conventions.

- pgstat.c is further split. Two new files: pgstat_xact.c and pgstat_shmem.c
(wrote an email about this a few days ago, without sending the patches)

- Split out as much as I could into separate commits.

- Cleaned up autovacuum.c changes - mostly removing more obsolted code

- code, comment polishing

Still todo:
- docs need review
- finish writing architectural comment atop pgstat.c
- fix the bug around pgstat_report_stat() I wrote about at [3]/messages/by-id/20220402081648.kbapqdxi2rr3ha3w@alap3.anarazel.de
- collect who reviewed earlier revisions
- choose what conditions for stats file reset we want
- I'm wondering if the solution for replication slot names on disk is too
narrow, and instead we should have a more general "serialize" /
"deserialize" callback. But we can easily do that later as well...

There's a bit more inconsistency around function naming. Right now all
callbacks are pgstat_$kind_$action_cb, but most of the rest of pgstat is
pgstat_$action_$kind. But somehow it "feels" wrong for the callbacks -
there's also a bunch of functions already named similarly, but that's
partially my fault in commits in the past.

There are a lot of copies of "Permission checking for this function is managed
through the normal GRANT system." in the pre-existing code. Aren't they
completely bogus? None of the functions commented upon like that is actually
exposed to SQL!

Please take a look!

Greetings,

Andres Freund

[1]: /messages/by-id/20220308205351.2xcn6k4x5yivcxyd@alap3.anarazel.de
[2]: /messages/by-id/20220329191727.mzzwbl7udhpq7pmf@alap3.anarazel.de
[3]: /messages/by-id/20220402081648.kbapqdxi2rr3ha3w@alap3.anarazel.de

#231

Thomas Munro

thomas.munro@gmail.com

over 3 years ago

In reply to: Andres Freund (#230)

Re: shared-memory based stats collector - v68

On Mon, Apr 4, 2022 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:

Please take a look!

A few superficial comments:

[PATCH v68 01/31] pgstat: consistent function header formatting.
[PATCH v68 02/31] pgstat: remove some superflous comments from pgstat.h.

[PATCH v68 03/31] dshash: revise sequential scan support.

Logic looks good. That is,
lock-0-and-ensure_valid_bucket_pointers()-only-once makes sense. Just
some comment trivia:

+ * dshash_seq_term needs to be called when a scan finished.  The caller may
+ * delete returned elements midst of a scan by using dshash_delete_current()
+ * if exclusive = true.

s/scan finished/scan is finished/
s/midst of/during/ (or /in the middle of/, ...)

[PATCH v68 04/31] dsm: allow use in single user mode.

LGTM.

+ Assert(IsUnderPostmaster || !IsPostmasterEnvironment);

(Not this patch's fault, but I wish we had a more explicit way to say "am
single user".)

[PATCH v68 05/31] pgstat: stats collector references in comments

LGTM. I could think of some alternative suggested names for this subsystem,
but don't think it would be helpful at this juncture so I will refrain :-)

[PATCH v68 06/31] pgstat: add pgstat_copy_relation_stats().
[PATCH v68 07/31] pgstat: move transactional code into pgstat_xact.c.

LGTM.

[PATCH v68 08/31] pgstat: introduce PgStat_Kind enum.

+#define PGSTAT_KIND_FIRST PGSTAT_KIND_DATABASE
+#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
+#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)

It's a little confusing that PGSTAT_NUM_KINDS isn't really the number of kinds,
because there is no kind 0. For the two users of it... maybe just use
pgstat_kind_infos[] = {...}, and
global_valid[PGSTAT_KIND_LAST + 1]?

[PATCH v68 10/31] pgstat: scaffolding for transactional stats creation / drop.

+   /*
+    * Dropping the statistics for objects that dropped transactionally itself
+    * needs to be transactional. ...

Hard to parse. How about: "Objects are dropped transactionally, so
related statistics need to be dropped transactionally too."

[PATCH v68 13/31] pgstat: store statistics in shared memory.

+ * Single-writer stats use the changecount mechanism to achieve low-overhead
+ * writes - they're obviously performance critical than reads. Check the
+ * definition of struct PgBackendStatus for some explanation of the
+ * changecount mechanism.

Missing word "more" after obviously?

+    /*
+     * Whenever the for a dropped stats entry could not be freed (because
+     * backends still have references), this is incremented, causing backends
+     * to run pgstat_gc_entry_refs(), allowing that memory to be reclaimed.
+     */
+    pg_atomic_uint64 gc_count;

Whenever the ...?

Would it be better to call this variable gc_request_count?

+ * Initialize refcount to 1, marking it as valid / not tdroped. The entry

s/tdroped/dropped/

+ * further if a longer lived references is needed.

s/references/reference/

+            /*
+             * There are legitimate cases where the old stats entry might not
+             * yet have been dropped by the time it's reused. The easiest case
+             * are replication slot stats. But oid wraparound can lead to
+             * other cases as well. We just reset the stats to their plain
+             * state.
+             */
+            shheader = pgstat_reinit_entry(kind, shhashent);

This whole comment is repeated in pgstat_reinit_entry and its caller.

+    /*
+     * XXX: Might be worth adding some frobbing of the allocation before
+     * freeing, to make it easier to detect use-after-free style bugs.
+     */
+    dsa_free(pgStatLocal.dsa, pdsa);

FWIW dsa_free() clobbers memory in assert builds, just like pfree().

+static Size
+pgstat_dsa_init_size(void)
+{
+    Size        sz;
+
+    /*
+     * The dshash header / initial buckets array needs to fit into "plain"
+     * shared memory, but it's beneficial to not need dsm segments
+     * immediately. A size of 256kB seems works well and is not
+     * disproportional compared to other constant sized shared memory
+     * allocations. NB: To avoid DSMs further, the user can configure
+     * min_dynamic_shared_memory.
+     */
+    sz = 256 * 1024;

It kinda bothers me that the memory reserved by
min_dynamic_shared_memory might eventually fill up with stats, and not
be available for temporary use by parallel queries (which can benefit
more from fast acquire/release on DSMs, and probably also huge pages,
or maybe not...), and that's hard to diagnose.

+         * (4) turn off the idle-in-transaction, idle-session and
+         * idle-state-update timeouts if active.  We do this before step (5) so

s/idle-state-/idle-stats-/

+    /*
+     * Some of the pending stats may have not been flushed due to lock
+     * contention.  If we have such pending stats here, let the caller know
+     * the retry interval.
+     */
+    if (partial_flush)
+    {

I think it's better for a comment that is outside the block to say "If
some of the pending...". Or the comment should be inside the blocks.

+static void
+pgstat_build_snapshot(void)
+{
...
+    dshash_seq_init(&hstat, pgStatLocal.shared_hash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
...
+        entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
...
+    }
+    dshash_seq_term(&hstat);

Doesn't allocation failure leave the shared hash table locked?

PATCH v68 16/31] pgstat: add pg_stat_exists_stat() for easier testing.

pg_stat_exists_stat() is a weird name, ... would it be better as
pg_stat_object_exists()?

[PATCH v68 28/31] pgstat: update docs.

+ Determines the behaviour when cumulative statistics are accessed

AFAIK our manual is written in en_US, so s/behaviour/behavior/.

+        memory. When set to <literal>cache</literal>, the first access to
+        statistics for an object caches those statistics until the end of the
+        transaction / until <function>pg_stat_clear_snapshot()</function> is

s|/|unless|

+ <literal>none</literal> is most suitable for monitoring solutions. If

I'd change "solutions" to "tools" or maybe "systems".

+ When using the accumulated statistics views and functions to
monitor collected data, it is important

Did you intend to write "accumulated" instead of "cumulative" here?

+   You can invoke <function>pg_stat_clear_snapshot</function>() to discard the
+   current transaction's statistics snapshot / cache (if any).  The next use

I'd change s|/ cache|or cached values|. I think "/" like this is an informal
thing.

#232

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Thomas Munro (#231)

Re: shared-memory based stats collector - v68

Hi,

On 2022-04-05 01:16:04 +1200, Thomas Munro wrote:

On Mon, Apr 4, 2022 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:

Please take a look!

A few superficial comments:

[PATCH v68 01/31] pgstat: consistent function header formatting.
[PATCH v68 02/31] pgstat: remove some superflous comments from pgstat.h.

+1

Planning to commit these after making another coffee and proof reading them
some more.

[PATCH v68 03/31] dshash: revise sequential scan support.

Logic looks good. That is,
lock-0-and-ensure_valid_bucket_pointers()-only-once makes sense. Just
some comment trivia:
+ * dshash_seq_term needs to be called when a scan finished.  The caller may
+ * delete returned elements midst of a scan by using dshash_delete_current()
+ * if exclusive = true.
s/scan finished/scan is finished/
s/midst of/during/ (or /in the middle of/, ...)

[PATCH v68 04/31] dsm: allow use in single user mode.

LGTM.

+ Assert(IsUnderPostmaster || !IsPostmasterEnvironment);

(Not this patch's fault, but I wish we had a more explicit way to say "am
single user".)

Agreed.

[PATCH v68 05/31] pgstat: stats collector references in comments

LGTM. I could think of some alternative suggested names for this subsystem,
but don't think it would be helpful at this juncture so I will refrain :-)

Heh. I did start a thread about it a while ago :)

[PATCH v68 08/31] pgstat: introduce PgStat_Kind enum.
+#define PGSTAT_KIND_FIRST PGSTAT_KIND_DATABASE
+#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
+#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)
It's a little confusing that PGSTAT_NUM_KINDS isn't really the number of kinds,
because there is no kind 0. For the two users of it... maybe just use
pgstat_kind_infos[] = {...}, and
global_valid[PGSTAT_KIND_LAST + 1]?

Maybe the whole justification for not defining an invalid kind is moot
now. There's not a single switch covering all kinds of stats left, and I hope
that we don't introduce one again...

[PATCH v68 10/31] pgstat: scaffolding for transactional stats creation / drop.
+   /*
+    * Dropping the statistics for objects that dropped transactionally itself
+    * needs to be transactional. ...
Hard to parse. How about: "Objects are dropped transactionally, so
related statistics need to be dropped transactionally too."

Not all objects are dropped transactionally. But I agree it reads awkwardly. I
now, incorporating feedback from Justin as well, rephrased it to:

/*
* Statistics for transactionally dropped objects need to be
* transactionally dropped as well. Collect the stats dropped in the
* current (sub-)transaction and only execute the stats drop when we know
* if the transaction commits/aborts. To handle replicas and crashes,
* stats drops are included in commit / abort records.
*/

A few too many "drop"s in there, but maybe that's unavoidable.

+    /*
+     * Whenever the for a dropped stats entry could not be freed (because
+     * backends still have references), this is incremented, causing backends
+     * to run pgstat_gc_entry_refs(), allowing that memory to be reclaimed.
+     */
+    pg_atomic_uint64 gc_count;

Whenever the ...?

* Whenever statistics for dropped objects could not be freed - because
* backends still have references - the dropping backend calls
* pgstat_request_entry_refs_gc() incrementing this counter. Eventually
* that causes backends to run pgstat_gc_entry_refs(), allowing memory to
* be reclaimed.

Would it be better to call this variable gc_request_count?

Agreed.

+ * Initialize refcount to 1, marking it as valid / not tdroped. The entry

s/tdroped/dropped/

+ * further if a longer lived references is needed.

s/references/reference/

Fixed.

+            /*
+             * There are legitimate cases where the old stats entry might not
+             * yet have been dropped by the time it's reused. The easiest case
+             * are replication slot stats. But oid wraparound can lead to
+             * other cases as well. We just reset the stats to their plain
+             * state.
+             */
+            shheader = pgstat_reinit_entry(kind, shhashent);

This whole comment is repeated in pgstat_reinit_entry and its caller.

I guess I felt as indecisive about where to place it between the two locations
when I wrote it as I do now. Left it at the callsite for now.

+    /*
+     * XXX: Might be worth adding some frobbing of the allocation before
+     * freeing, to make it easier to detect use-after-free style bugs.
+     */
+    dsa_free(pgStatLocal.dsa, pdsa);

FWIW dsa_free() clobbers memory in assert builds, just like pfree().

Oh. I could swear I saw that not being the case a while ago. But clearly it is
the case. Removed.

+static Size
+pgstat_dsa_init_size(void)
+{
+    Size        sz;
+
+    /*
+     * The dshash header / initial buckets array needs to fit into "plain"
+     * shared memory, but it's beneficial to not need dsm segments
+     * immediately. A size of 256kB seems works well and is not
+     * disproportional compared to other constant sized shared memory
+     * allocations. NB: To avoid DSMs further, the user can configure
+     * min_dynamic_shared_memory.
+     */
+    sz = 256 * 1024;
It kinda bothers me that the memory reserved by
min_dynamic_shared_memory might eventually fill up with stats, and not
be available for temporary use by parallel queries (which can benefit
more from fast acquire/release on DSMs, and probably also huge pages,
or maybe not...), and that's hard to diagnose.

It's not great, but I don't really see an alternative? The saving grace is
that it's hard to imagine "real" usages of min_dynamic_shared_memory being
used up by stats.

+         * (4) turn off the idle-in-transaction, idle-session and
+         * idle-state-update timeouts if active.  We do this before step (5) so
s/idle-state-/idle-stats-/
+    /*
+     * Some of the pending stats may have not been flushed due to lock
+     * contention.  If we have such pending stats here, let the caller know
+     * the retry interval.
+     */
+    if (partial_flush)
+    {
I think it's better for a comment that is outside the block to say "If
some of the pending...". Or the comment should be inside the blocks.

The comment says "if" in the second sentence? But it's a bit awkward anyway,
rephrased to:

* If some of the pending stats could not be flushed due to lock
* contention, let the caller know when to retry.

+static void
+pgstat_build_snapshot(void)
+{
...
+    dshash_seq_init(&hstat, pgStatLocal.shared_hash, false);
+    while ((p = dshash_seq_next(&hstat)) != NULL)
+    {
...
+        entry->data = MemoryContextAlloc(pgStatLocal.snapshot.context,
...
+    }
+    dshash_seq_term(&hstat);

Doesn't allocation failure leave the shared hash table locked?

The shared table itself not - the error path does LWLockReleaseAll(). The
problem is the backend local dshash_table, specifically
find_[exclusively_]locked will stay set, and then cause assertion failures
when used next.

I think we need to fix that in dshash.c. We have code in released branches
that's vulnerable to this problem. E.g.
ensure_record_cache_typmod_slot_exists() in lookup_rowtype_tupdesc_internal().

Afaics the only real choice is to remove find_[exclusively_]locked and rely on
LWLockHeldByMeInMode() instead.

PATCH v68 16/31] pgstat: add pg_stat_exists_stat() for easier testing.

pg_stat_exists_stat() is a weird name, ... would it be better as
pg_stat_object_exists()?

I was fighting with this one a bunch :). Earlier it was called
pg_stat_stats_exist() I think. "object" makes it sound a bit too much like
it's the database object?

Maybe pg_stat_have_stat()?

[PATCH v68 28/31] pgstat: update docs.

+ Determines the behaviour when cumulative statistics are accessed

AFAIK our manual is written in en_US, so s/behaviour/behavior/.

Fixed like 10 instances of this in the patchset. Not sure why I just can't
make myself type behavior.

+        memory. When set to <literal>cache</literal>, the first access to
+        statistics for an object caches those statistics until the end of the
+        transaction / until <function>pg_stat_clear_snapshot()</function> is
s|/|unless|

+ <literal>none</literal> is most suitable for monitoring solutions. If

I'd change "solutions" to "tools" or maybe "systems".

Done.

+ When using the accumulated statistics views and functions to
monitor collected data, it is important

Did you intend to write "accumulated" instead of "cumulative" here?

Not sure. I think I got bored of the word at some point :P

+   You can invoke <function>pg_stat_clear_snapshot</function>() to discard the
+   current transaction's statistics snapshot / cache (if any).  The next use
I'd change s|/ cache|or cached values|. I think "/" like this is an informal
thing.

I think we have a few other uses of it. But anyway, changed.

Thanks!

Andres

#233

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#230)

Re: shared-memory based stats collector - v68

Hi,

On 2022-04-03 21:15:16 -0700, Andres Freund wrote:

- collect who reviewed earlier revisions

I found reviews by
- Tomas Vondra <tomas.vondra@2ndquadrant.com>
- Arthur Zakirov <a.zakirov@postgrespro.ru>
- Antonin Houska <ah@cybertec.at>

There's also reviews by Fujii and Alvaro, but afaics just for parts that were
separately committed.

Greetings,

Andres Freund

#234

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#230)

Re: shared-memory based stats collector - v68

On Sun, Apr 3, 2022 at 9:16 PM Andres Freund <andres@anarazel.de> wrote:

Please take a look!

I didn't take the time to fixup all the various odd typos in the general
code comments; none of them reduced comprehension appreciably. I may do so
when/if I do another pass.

I did skim over the entire patch set and, FWIW, found it to be quite
understandable. I don't have the experience to comment on the lower-level
details like locking and such but the medium picture stuff makes sense to
me both as a user and a developer. I did leave a couple of comments about
parts that at least piqued my interest (reset single stats) or seemed like
an undesirable restriction that was under addressed (before server shutdown
called exactly once).

I agree with Thomas's observation regarding PGSTAT_KIND_LAST. I also think
that leaving it starting at 1 makes sense - maybe just fix the name and
comment to better reflect its actual usage in core.

I concur also with changing usages of " / " to ", or"

My first encounter with pg_stat_exists_stat() didn't draw my attention as
being problematic so I'd say we just stick with it. As a SQL user reading:
WHERE exists (...) is somewhat natural; using "have" or back-to-back
stat_stat is less appealing.

I would suggest we do away with stats_fetch_consistency "snapshot" mode and
instead add a function that can be called that would accomplish the same
thing but in "cache" mode. Future iterations of that function could accept
patterns, allowing for something between "one" and "everything".

I'm also not an immediate fan of "fetch_consistency"; with the function
suggestion it is basically "cache" and "no-cache" so maybe:
stats_use_transaction_cache ? (haven't thought hard or long on this one...)

 diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 22d0a1e491..e889c11d9e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2123,7 +2123,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss
11:34   0:00 postgres: ser
      </row>
      <row>
       <entry><literal>PgStatsData</literal></entry>
-      <entry>Waiting fo shared memory stats data access</entry>
+      <entry>Waiting for shared memory stats data access</entry>
      </row>
      <row>
       <entry><literal>SerializableXactHash</literal></entry>
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2689d0962c..bc7bdf8064 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4469,7 +4469,7 @@ PostgresMain(const char *dbname, const char *username)

  /*
  * (4) turn off the idle-in-transaction, idle-session and
- * idle-state-update timeouts if active.  We do this before step (5) so
+ * idle-stats-update timeouts if active.  We do this before step (5) so
  * that any last-moment timeout is certain to be detected in step (5).
  *
  * At most one of these timeouts will be active, so there's no need to
diff --git a/src/backend/utils/activity/pgstat.c
b/src/backend/utils/activity/pgstat.c
index dbd55a065d..370638b33b 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -5,7 +5,7 @@
  * Provides the infrastructure to collect and access cumulative statistics,
  * e.g. per-table access statistics, of all backends in shared memory.
  *
- * Most statistics updates are first first accumulated locally in each
process
+ * Most statistics updates are first accumulated locally in each process
  * as pending entries, then later flushed to shared memory (just after
commit,
  * or by idle-timeout).
  *
@@ -371,7 +371,9 @@ pgstat_discard_stats(void)
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one
process
  * during regular server shutdowns. Otherwise all stats will be lost.
- *
+ * XXX: What bad things happen if this is invoked by more than one process?
+ *   I'd presume stats are not actually lost in that case.  Can we just
'no-op'
+ *   subsequent calls and say "at least once at shutdown, as late as
possible"
  * We currently only write out stats for proc_exit(0). We might want to
change
  * that at some point... But right now pgstat_discard_stats() would be
called
  * during the start after a disorderly shutdown, anyway.
@@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
objoid)

Assert(!pgstat_kind_info_for(kind)->fixed_amount);

+ /*
+ * More of a conceptual observation here - the fact that something is
fixed does not imply
+ * that it is not fixed at a value greater than zero and thus could have
single subentries
+ * that could be addressed.
+ * I also am unsure, off the top of my head, whether both replication
slots and subscriptions,
+ * which are fixed, can be reset singly (today, and/or whether this patch
enables that capability)
+ */
+
  /* Set the reset timestamp for the whole database */
  pgstat_reset_database_timestamp(MyDatabaseId, ts);

David J.

#235

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#234)

Re: shared-memory based stats collector - v68

Hi,

On 2022-04-04 13:45:40 -0700, David G. Johnston wrote:

I didn't take the time to fixup all the various odd typos in the general
code comments; none of them reduced comprehension appreciably. I may do so
when/if I do another pass.

Cool.

My first encounter with pg_stat_exists_stat() didn't draw my attention as
being problematic so I'd say we just stick with it. As a SQL user reading:
WHERE exists (...) is somewhat natural; using "have" or back-to-back
stat_stat is less appealing.

There are a number of other *_exists functions, albeit not within
pg_stat_*. Like jsonb_exists. Perhaps just pg_stat_exists()?

I would suggest we do away with stats_fetch_consistency "snapshot" mode and
instead add a function that can be called that would accomplish the same
thing but in "cache" mode. Future iterations of that function could accept
patterns, allowing for something between "one" and "everything".

I don't want to do that. We had a lot of discussion around what consistency
model we want, and Tom was adamant that there needs to be a mode that behaves
like the current consistency model (which is what snapshot behaves like, with
very minor differences). A way to get back to the old behaviour seems good,
and the function idea doesn't provide that.

(merged the typos that I hadn't already fixed based on Justin / Thomas'
feedback)

@@ -371,7 +371,9 @@ pgstat_discard_stats(void)
/*
* pgstat_before_server_shutdown() needs to be called by exactly one
process
* during regular server shutdowns. Otherwise all stats will be lost.
- *
+ * XXX: What bad things happen if this is invoked by more than one process?
+ *   I'd presume stats are not actually lost in that case.  Can we just
'no-op'
+ *   subsequent calls and say "at least once at shutdown, as late as
possible"

What's the reason behind this question? There really shouldn't be a second
call (and there's only a single callsite). As is you'd get an assertion
failure about things already having been shutdown.

I don't think we want to relax that, because in all the realistic scenarios
that I can think of that'd open us up to loosing stats that were generated
after the first writeout of the stats data.

You mentioned this as a restriction above - I'm not seeing it as such? I'd
like to write out stats more often in the future (e.g. in the checkpointer),
but then it'd not be written out with this function...

@@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
objoid)

Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+ /*
+ * More of a conceptual observation here - the fact that something is
fixed does not imply
+ * that it is not fixed at a value greater than zero and thus could have
single subentries
+ * that could be addressed.

pgstat_reset_single_counter() is a pre-existing function (with a pre-existing
name, but adapted signature in the patch), it's currently only used for
functions and relation stats.

+ * I also am unsure, off the top of my head, whether both replication
slots and subscriptions,
+ * which are fixed, can be reset singly (today, and/or whether this patch
enables that capability)
+ */

FWIW, neither are implemented as fixed amount stats. There's afaics no limit
at all for the number of existing subscriptions (although some would either
need to be disabled or you'd get errors). While there is a limit on the number
of slots, that's a configurable limit. So replication slot stats are also
implemented as variable amount stats (that used to be different, wasn't nice).

There's one example of fixed amount stats that can be reset more granularly,
namely slru. That can be done via pg_stat_reset_slru().

Thanks,

Andres

#236

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#235)

Re: shared-memory based stats collector - v68

On Mon, Apr 4, 2022 at 2:06 PM Andres Freund <andres@anarazel.de> wrote:

My first encounter with pg_stat_exists_stat() didn't draw my attention as
being problematic so I'd say we just stick with it. As a SQL user

reading:

WHERE exists (...) is somewhat natural; using "have" or back-to-back
stat_stat is less appealing.

There are a number of other *_exists functions, albeit not within
pg_stat_*. Like jsonb_exists. Perhaps just pg_stat_exists()?

Works for me.

A way to get back to the old behaviour seems good,
and the function idea doesn't provide that.

Makes sense.

(merged the typos that I hadn't already fixed based on Justin / Thomas'
feedback)
@@ -371,7 +371,9 @@ pgstat_discard_stats(void)
/*
* pgstat_before_server_shutdown() needs to be called by exactly one
process
* during regular server shutdowns. Otherwise all stats will be lost.
- *
+ * XXX: What bad things happen if this is invoked by more than one
process?
+ *   I'd presume stats are not actually lost in that case.  Can we just
'no-op'
+ *   subsequent calls and say "at least once at shutdown, as late as
possible"
What's the reason behind this question? There really shouldn't be a second
call (and there's only a single callsite). As is you'd get an assertion
failure about things already having been shutdown.

Mostly OCD I guess, "exactly one" has two failure modes - zero, and > 1;
and the "Otherwise" only covers the zero mode.

I don't think we want to relax that, because in all the realistic scenarios
that I can think of that'd open us up to loosing stats that were generated
after the first writeout of the stats data.

You mentioned this as a restriction above - I'm not seeing it as such? I'd
like to write out stats more often in the future (e.g. in the
checkpointer),
but then it'd not be written out with this function...

Yeah, the idea only really works if you can implement "last one out, shut
off the lights". I think I was subconsciously wanting this to work that
way, but the existing process is good.

@@ -654,6 +656,14 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
objoid)

Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+ /*
+ * More of a conceptual observation here - the fact that something is
fixed does not imply
+ * that it is not fixed at a value greater than zero and thus could have
single subentries
+ * that could be addressed.
pgstat_reset_single_counter() is a pre-existing function (with a
pre-existing
name, but adapted signature in the patch), it's currently only used for
functions and relation stats.
+ * I also am unsure, off the top of my head, whether both replication
slots and subscriptions,
+ * which are fixed, can be reset singly (today, and/or whether this
patch

enables that capability)
+ */

FWIW, neither are implemented as fixed amount stats.

That was a typo, I meant to write variable. My point was that of these 5
kinds that will pass the assertion test only 2 of them are actually handled
by the function today.

+ PGSTAT_KIND_DATABASE = 1, /* database-wide statistics */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+ PGSTAT_KIND_REPLSLOT, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */

There's one example of fixed amount stats that can be reset more
granularly,
namely slru. That can be done via pg_stat_reset_slru().

Right, hence the conceptual disconnect. It doesn't affect the
implementation, everything is working just fine, but is something to ponder
for future maintainers getting up to speed here.

As the existing function only handles functions and relations why not just
perform a specific Kind check for them? Generalizing to assert on whether
or not the function works on fixed or variable Kinds seems beyond its
present state. Or could it be used, as-is, for databases, replication
slots, and subscriptions today, and we just haven't migrated those areas to
use the now generalized function? Even then, unless we do expand the
definition of the this publicly facing function is seems better to
precisely define what it requires as an input Kind by checking for RELATION
or FUNCTION specifically.

David J.

#237

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#236)

Re: shared-memory based stats collector - v68

Hi,

On 2022-04-04 14:25:57 -0700, David G. Johnston wrote:

You mentioned this as a restriction above - I'm not seeing it as such? I'd
like to write out stats more often in the future (e.g. in the
checkpointer),
but then it'd not be written out with this function...

Yeah, the idea only really works if you can implement "last one out, shut
off the lights". I think I was subconsciously wanting this to work that
way, but the existing process is good.

Preserving stats more than we do today (the patch doesn't really affect that)
will require a good chunk more work. My idea for it is that we'd write the
file out as part of a checkpoint / restartpoint, with a name including the
redo-lsn. Then when recovery starts, it can use the stats file associated with
that to start from. Then we'd loose at most 1 checkpoint's worth of stats
during a crash, not more.

There's a few non-trivial corner cases to solve, around stats objects getting
dropped concurrently with creating that serialized snapshot. Solvable, but not
trivial.

+ * I also am unsure, off the top of my head, whether both replication
slots and subscriptions,
+ * which are fixed, can be reset singly (today, and/or whether this
patch

enables that capability)
+ */

FWIW, neither are implemented as fixed amount stats.
That was a typo, I meant to write variable. My point was that of these 5
kinds that will pass the assertion test only 2 of them are actually handled
by the function today.
+ PGSTAT_KIND_DATABASE = 1, /* database-wide statistics */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+ PGSTAT_KIND_REPLSLOT, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */

As the existing function only handles functions and relations why not just
perform a specific Kind check for them? Generalizing to assert on whether
or not the function works on fixed or variable Kinds seems beyond its
present state. Or could it be used, as-is, for databases, replication
slots, and subscriptions today, and we just haven't migrated those areas to
use the now generalized function?

It couldn't quite be used for those, because it really only makes sense for
objects "within a database", because it wants to reset the timestamp of the
pg_stat_database row too (I don't like that behaviour as-is, but that's the
topic of another thread as you know...).

It will work for other per-database stats though, once we have them.

Even then, unless we do expand the
definition of the this publicly facing function is seems better to
precisely define what it requires as an input Kind by checking for RELATION
or FUNCTION specifically.

I don't see a benefit in adding a restriction on it that we'd just have to
lift again?

How about adding a
Assert(!pgstat_kind_info_for(kind)->accessed_across_databases)

and extending the function comment to say that it's used for per-database
stats and that it resets both the passed-in stats object as well as
pg_stat_database?

Greetings,

Andres Freund

#238

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#237)

Re: shared-memory based stats collector - v68

On Mon, Apr 4, 2022 at 2:54 PM Andres Freund <andres@anarazel.de> wrote:

As the existing function only handles functions and relations why not

just

perform a specific Kind check for them? Generalizing to assert on

whether

or not the function works on fixed or variable Kinds seems beyond its
present state. Or could it be used, as-is, for databases, replication
slots, and subscriptions today, and we just haven't migrated those areas

to

use the now generalized function?

It couldn't quite be used for those, because it really only makes sense for
objects "within a database", because it wants to reset the timestamp of the
pg_stat_database row too (I don't like that behaviour as-is, but that's the
topic of another thread as you know...).

It will work for other per-database stats though, once we have them.

Even then, unless we do expand the
definition of the this publicly facing function is seems better to
precisely define what it requires as an input Kind by checking for

RELATION

or FUNCTION specifically.

I don't see a benefit in adding a restriction on it that we'd just have to
lift again?

How about adding a
Assert(!pgstat_kind_info_for(kind)->accessed_across_databases)

and extending the function comment to say that it's used for per-database
stats and that it resets both the passed-in stats object as well as
pg_stat_database?

I could live with adding that, but...

Replacing the existing assert(!kind->fixed_amount) with
assert(!kind->accessed_across_databases) produces the same result as the
later presently implies the former.

Now I start to dislike the behavioral aspect of the attribute and would
rather just name it: kind->is_cluster_scoped (or something else that is
descriptive of the stat category itself, not how it is used)

Then reorganize the Kind documentation to note and emphasize these two
primary descriptors:
variable, which can be cluster or database scoped
fixed, which are cluster scoped by definition (if this is true...but given
this is an optimization category I'm thinking maybe it doesn't actually
matter...)

+ /* cluster-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
+ PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd
spot to be closer to the database-scoped options)
+
+ /* database-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+
+ /* cluster-scoped stats having a fixed number of entries */ (maybe these
should go first, the variable following?)
+ PGSTAT_KIND_ARCHIVER,
+ PGSTAT_KIND_BGWRITER,
+ PGSTAT_KIND_CHECKPOINTER,
+ PGSTAT_KIND_SLRU,
+ PGSTAT_KIND_WAL,

David J.

#239

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#238)

Re: shared-memory based stats collector - v68

Hi,

On 2022-04-04 15:24:24 -0700, David G. Johnston wrote:

Replacing the existing assert(!kind->fixed_amount) with
assert(!kind->accessed_across_databases) produces the same result as the
later presently implies the former.

I wasn't proposing to replace, but to add...

Now I start to dislike the behavioral aspect of the attribute and would
rather just name it: kind->is_cluster_scoped (or something else that is
descriptive of the stat category itself, not how it is used)

I'm not in love with the name either. But cluster is just a badly overloaded
word :(.

system_wide? Or invert it and say: database_scoped? I think I like the latter.

Then reorganize the Kind documentation to note and emphasize these two
primary descriptors:
variable, which can be cluster or database scoped
fixed, which are cluster scoped by definition

Hm. There's not actually that much difference between cluster/non-cluster wide
scope for most of the system. I'm not strongly against, but I'm also not
really seeing the benefit.

(if this is true...but given this is an optimization category I'm thinking
maybe it doesn't actually matter...)

It is true. Not sure what you mean with "optimization category"?

Greetings,

Andres Freund

#240

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#221)

Re: shared-memory based stats collector - v67

Hi,

On 2022-03-23 10:42:03 -0700, Andres Freund wrote:

On 2022-03-23 17:27:50 +0900, Kyotaro Horiguchi wrote:
+	/*
+	 * When starting with crash recovery, reset pgstat data - it might not be
+	 * valid. Otherwise restore pgstat data. It's safe to do this here,
+	 * because postmaster will not yet have started any other processes
+	 *
+	 * TODO: With a bit of extra work we could just start with a pgstat file
+	 * associated with the checkpoint redo location we're starting from.
+	 */
+	if (ControlFile->state == DB_SHUTDOWNED ||
+		ControlFile->state == DB_SHUTDOWNED_IN_RECOVERY)
+		pgstat_restore_stats();
+	else
+		pgstat_discard_stats();
+
Before there, InitWalRecovery changes the state to
DB_IN_ARCHIVE_RECOVERY if it was either DB_SHUTDOWNED or
DB_IN_PRODUCTION. So the stat seems like always discarded on standby.
Hm. I though it worked at some point. I guess there's a reason this commit is
a separate commit marked WIP ;)

FWIW, it had gotten broken by

commit be1c00ab13a7c2c9299d60cb5a9d285c40e2506c
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: 2022-02-16 09:22:44 +0200

Move code around in StartupXLOG().

because that moved the spot where
ControlFile->state = DB_IN_CRASH_RECOVERY
is set to an earlier location.

Greetings,

Andres Freund

#241

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#239)

Re: shared-memory based stats collector - v68

On Mon, Apr 4, 2022 at 3:44 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-04-04 15:24:24 -0700, David G. Johnston wrote:

Replacing the existing assert(!kind->fixed_amount) with
assert(!kind->accessed_across_databases) produces the same result as the
later presently implies the former.

I wasn't proposing to replace, but to add...

Right, but it seems redundant to have both when one implies the other. But
I'm not hard set against it either, though my idea below make them both
obsolete.

Now I start to dislike the behavioral aspect of the attribute and would
rather just name it: kind->is_cluster_scoped (or something else that is
descriptive of the stat category itself, not how it is used)

I'm not in love with the name either. But cluster is just a badly
overloaded
word :(.

system_wide? Or invert it and say: database_scoped? I think I like the
latter.

I like database_scoped as well...but see my idea below that makes this
obsolete.

Then reorganize the Kind documentation to note and emphasize these two
primary descriptors:
variable, which can be cluster or database scoped
fixed, which are cluster scoped by definition

Hm. There's not actually that much difference between cluster/non-cluster
wide
scope for most of the system. I'm not strongly against, but I'm also not
really seeing the benefit.

Not married to it myself, something to come back to when the dust settles.

(if this is true...but given this is an optimization category I'm

thinking

maybe it doesn't actually matter...)

It is true. Not sure what you mean with "optimization category"?

I mean that distinguishing between stats that are fixed and those that are
variable implies that fixed kinds have a better performance (speed, memory)
characteristic than variable kinds (at least in part due to the presence of
changecount). If fixed kinds did not have a performance benefit then
having the variable kind implementation simply handle fixed kinds as well
(using the common struct header and storage in a hash table) would make the
implementation simpler since all statistics would report through the same
API. In that world, variability is simply a possibility that not every
actual reporter has to use. That improved performance characteristic is
what I meant by "optimization category". I question whether we should be
publishing "fixed" and "variable" as concrete properties. I'm not
presently against the current choice to do so, but as you say above, I'm
also not really seeing the benefit.

(goes and looks at all the places that use the fixed_amount
field...sparking an idea)

Coming back to this:
"""
+ /* cluster-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
+ PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd
spot to be closer to the database-scoped options)
+
+ /* database-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+
+ /* cluster-scoped stats having a fixed number of entries */ (maybe these
should go first, the variable following?)
+ PGSTAT_KIND_ARCHIVER,
+ PGSTAT_KIND_BGWRITER,
+ PGSTAT_KIND_CHECKPOINTER,
+ PGSTAT_KIND_SLRU,
+ PGSTAT_KIND_WAL,
"""

I see three "KIND_GROUP" categories here:
PGSTAT_KIND_CLUSTER (open to a different word here though...)
PGSTAT_KIND_DATABASE (we seem to agree on this above)
PGSTAT_KIND_GLOBAL (already used in the code)

This single enum can replace the two booleans that, in combination, would
define 4 unique groups (of which only three are interesting -
database+fixed doesn't seem interesting and so is not given a name/value
here).

While the succinctness of the booleans has appeal the need for half of the
booleans to end up being negated quickly tarnishes it. With the three
groups, every assertion is positive in nature indicating which of the three
groups are handled by the function. While that is probably a few more
characters it seems like an easier read and is less complicated as it has
fewer independent parts. At most you OR two kinds together which is
succinct enough I would think. There are no gaps relative to the existing
implementation that defines fixed_amount and accessed_across_databases -
every call site using either of them can be transformed mechanically.

David J.

#242

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#241)

Re: shared-memory based stats collector - v68

Hi,

On 2022-04-04 19:03:13 -0700, David G. Johnston wrote:

(if this is true...but given this is an optimization category I'm

thinking

maybe it doesn't actually matter...)

It is true. Not sure what you mean with "optimization category"?

I mean that distinguishing between stats that are fixed and those that are
variable implies that fixed kinds have a better performance (speed, memory)
characteristic than variable kinds (at least in part due to the presence of
changecount). If fixed kinds did not have a performance benefit then
having the variable kind implementation simply handle fixed kinds as well
(using the common struct header and storage in a hash table) would make the
implementation simpler since all statistics would report through the same
API.

Yes, fixed-numbered stats are faster.

Coming back to this:
"""
+ /* cluster-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_REPLSLOT = 1, /* per-slot statistics */
+ PGSTAT_KIND_SUBSCRIPTION, /* per-subscription statistics */
+ PGSTAT_KIND_DATABASE, /* database-wide statistics */ (I moved this to 3rd
spot to be closer to the database-scoped options)
+
+ /* database-scoped object stats having a variable number of entries */
+ PGSTAT_KIND_RELATION, /* per-table statistics */
+ PGSTAT_KIND_FUNCTION, /* per-function statistics */
+
+ /* cluster-scoped stats having a fixed number of entries */ (maybe these
should go first, the variable following?)
+ PGSTAT_KIND_ARCHIVER,
+ PGSTAT_KIND_BGWRITER,
+ PGSTAT_KIND_CHECKPOINTER,
+ PGSTAT_KIND_SLRU,
+ PGSTAT_KIND_WAL,
"""
I see three "KIND_GROUP" categories here:
PGSTAT_KIND_CLUSTER (open to a different word here though...)
PGSTAT_KIND_DATABASE (we seem to agree on this above)
PGSTAT_KIND_GLOBAL (already used in the code)

This single enum can replace the two booleans that, in combination, would
define 4 unique groups (of which only three are interesting -
database+fixed doesn't seem interesting and so is not given a name/value
here).

The more I think about it, the less I think a split like that makes sense. The
difference between PGSTAT_KIND_CLUSTER / PGSTAT_KIND_DATABASE is tiny. Nearly
all code just deals with both together.

I think all this is going to achieve is to making code more complicated. There
is a *single* non-assert use of accessed_across_databases and now a single
assertion involving it.

What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?

Greetings,

Andres Freund

#243

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#230)

24 attachment(s)

Re: shared-memory based stats collector - v69

Hi,

Thanks for the reviews Justin, Thomas, David. I tried to incorporate the
feedback, with the exception of the ongoing discussion around
accessed_across_databases. I've also not renamed pg_stat_exists_stat() yet,
not clear who likes what :)

Changes in v69:

- merged feedback
- committed the first few commits, mostly pretty boring stuff
- added an architecture overview comment to the top of pgstat.c - not sure if
it makes sense to anybody but me (and perhaps Horiguchi-san)?
- merged "only reset pgstat data after crash recovery." into the main commit,
added tests verifying the behaviour of not resetting stats on a standby when
in SHUTDOWNED_IN_RECOVERY.
- drop variable-amount stats when loading on-disk file fails partway through,
I'd raised this earlier in [1]/messages/by-id/20220329191727.mzzwbl7udhpq7pmf@alap3.anarazel.de
- made most pgstat_report_stat() calls pass force = true. In worker.c, the
only possibly frequent caller, I instead added a pgstat_report_stat(true) to
the idle path.
- added a handful more tests, but mostly out of "test coverage vanity" ;)
- made the test output of 030_stats_cleanup_replica a bit more informative,
plus other minor cleanups

The one definite TODO I know of is

- fix the bug around pgstat_report_stat() I wrote about at [3]
[3] /messages/by-id/20220402081648.kbapqdxi2rr3ha3w@alap3.anarazel.de

I'd hoped Horiguchi-san would chime in on that discussion...

Regards,

Andres

[1]: /messages/by-id/20220329191727.mzzwbl7udhpq7pmf@alap3.anarazel.de

#244

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#242)

1 attachment(s)

Re: shared-memory based stats collector - v68

On Mon, Apr 4, 2022 at 7:36 PM Andres Freund <andres@anarazel.de> wrote:

I think all this is going to achieve is to making code more complicated.
There
is a *single* non-assert use of accessed_across_databases and now a single
assertion involving it.

What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?

So, I decided to see what this would look like; the results are attached,
portions of it also inlined below.

I'll admit this does introduce a terminology problem - but IMO these words
are much more meaningful to the reader and code than the existing
booleans. I'm hopeful we can bikeshed something agreeable as I'm strongly
in favor of making this change.

The ability to create defines for subsets nicely resolves the problem that
CLUSTER and DATABASE (now OBJECT to avoid DATABASE conflict in PgStat_Kind)
are generally related together - they are now grouped under the DYNAMIC
label (variable, if you want) while all of the fixed entries get associated
with GLOBAL. Thus the majority of usages, since accessed_across_databases
is rare, end up being either DYNAMIC or GLOBAL. The presence of any other
category should give one pause. We could add an ALL define if we ever
decide to consolidate the API - but for now it's largely used to ensure
that stats of one type don't get processed by the other. The boolean fixed
does that well enough but this just seems much cleaner and more
understandable to me. Though having made up the terms and model myself,
that isn't too surprising.

The only existing usage of accessed_across_databases is in the negative
form, which translates to excluding objects, but only those from other
actual databases.

@@ -909,7 +904,7 @@ pgstat_build_snapshot(void)
  */
  if (p->key.dboid != MyDatabaseId &&
  p->key.dboid != InvalidOid &&
- !kind_info->accessed_across_databases)
+ kind_info->kind_group == PGSTAT_OBJECT)
  continue;

The only other usage of something other than GLOBAL or DYNAMIC is the
restriction on the behavior of reset_single_counter, which also has to be
an object in the current database (the later condition being enforced by
the presence of a valid object oid I presume). The replacement for this
below is not behavior-preserving, the proposed behavior I believe we agree
is correct though.

@@ -652,7 +647,7 @@ pgstat_reset_single_counter(PgStat_Kind kind, Oid
objoid)

- Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+ Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_OBJECT);

Everything else is a straight conversion of fixed_amount to CLUSTER+OBJECT

@@ -728,7 +723,7 @@ pgstat_fetch_entry(PgStat_Kind kind, Oid dboid, Oid
objoid)

- AssertArg(!kind_info->fixed_amount);
+ AssertArg(kind_info->kind_group == PGSTAT_DYNAMIC);

and !fixed_amount to GLOBAL

@@ -825,7 +820,7 @@ pgstat_get_stat_snapshot_timestamp(bool *have_snapshot)
 bool
 pgstat_exists_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 {
- if (pgstat_kind_info_for(kind)->fixed_amount)
+ if (pgstat_kind_info_for(kind)->kind_group == PGSTAT_GLOBAL)
  return true;

return pgstat_get_entry_ref(kind, dboid, objoid, false, NULL) != NULL;

David J.

#245

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#244)

Re: shared-memory based stats collector - v68

Hi,

On 2022-04-05 08:49:36 -0700, David G. Johnston wrote:

On Mon, Apr 4, 2022 at 7:36 PM Andres Freund <andres@anarazel.de> wrote:

I think all this is going to achieve is to making code more complicated.
There
is a *single* non-assert use of accessed_across_databases and now a single
assertion involving it.

What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?

So, I decided to see what this would look like; the results are attached,
portions of it also inlined below.

I'll admit this does introduce a terminology problem - but IMO these words
are much more meaningful to the reader and code than the existing
booleans. I'm hopeful we can bikeshed something agreeable as I'm strongly
in favor of making this change.

Sorry, I just don't agree. I'm happy to try to make it look better, but this
isn't it.

Do you think it should be your way strongly enough that you'd not want to get
it in the current way?

The ability to create defines for subsets nicely resolves the problem that
CLUSTER and DATABASE (now OBJECT to avoid DATABASE conflict in PgStat_Kind)
are generally related together - they are now grouped under the DYNAMIC
label (variable, if you want) while all of the fixed entries get associated
with GLOBAL. Thus the majority of usages, since accessed_across_databases
is rare, end up being either DYNAMIC or GLOBAL.

FWIW, as-is DYNAMIC isn't correct:

+typedef enum PgStat_KindGroup
+{
+	PGSTAT_GLOBAL = 1,
+	PGSTAT_CLUSTER,
+	PGSTAT_OBJECT
+} PgStat_KindGroup;
+
+#define PGSTAT_DYNAMIC (PGSTAT_CLUSTER | PGSTAT_OBJECT)

Oring PGSTAT_CLUSTER = 2 with PGSTAT_OBJECT = 3 yields 3 again. To do this
kind of thing the different values need to have power-of-two values, and then
the tests need to be done with &.

Nicely demonstrated by the fact that with the patch applied initdb doesn't
pass...

@@ -909,7 +904,7 @@ pgstat_build_snapshot(void)
*/
if (p->key.dboid != MyDatabaseId &&
p->key.dboid != InvalidOid &&
-			!kind_info->accessed_across_databases)
+			kind_info->kind_group == PGSTAT_OBJECT)
continue;

if (p->dropped)

Imo this is far harder to interpret - !kind_info->accessed_across_databases
tells you why we're skipping in clear code. Your alternative doesn't.

@@ -938,7 +933,7 @@ pgstat_build_snapshot(void)
{
const PgStat_KindInfo *kind_info = pgstat_kind_info_for(kind);
-		if (!kind_info->fixed_amount)
+		if (kind_info->kind_group == PGSTAT_DYNAMIC)

These all would have to be kind_info->kind_group & PGSTAT_DYNAMIC, or even
(kind_group & PGSTAT_DYNAMIC) != 0, depending on the case.

@@ -1047,8 +1042,8 @@ pgstat_delete_pending_entry(PgStat_EntryRef *entry_ref)
void *pending_data = entry_ref->pending;
Assert(pending_data != NULL);
-	/* !fixed_amount stats should be handled explicitly */
-	Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+	/* global stats should be handled explicitly : why?*/
+	Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_DYNAMIC);

The pending data infrastructure doesn't provide a way of dealing with fixed
amount stats, and there's no PgStat_EntryRef for them (since they're not in
the hashtable).

Greetings,

Andres Freund

#246

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#245)

Re: shared-memory based stats collector - v68

On Tuesday, April 5, 2022, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-04-05 08:49:36 -0700, David G. Johnston wrote:

On Mon, Apr 4, 2022 at 7:36 PM Andres Freund <andres@anarazel.de> wrote:

I think all this is going to achieve is to making code more

complicated.

There
is a *single* non-assert use of accessed_across_databases and now a

single

assertion involving it.

What would having PGSTAT_KIND_CLUSTER and PGSTAT_KIND_DATABASE achieve?

So, I decided to see what this would look like; the results are attached,
portions of it also inlined below.

I'll admit this does introduce a terminology problem - but IMO these

words

are much more meaningful to the reader and code than the existing
booleans. I'm hopeful we can bikeshed something agreeable as I'm

strongly

in favor of making this change.

Sorry, I just don't agree. I'm happy to try to make it look better, but
this
isn't it.

Do you think it should be your way strongly enough that you'd not want to
get
it in the current way?

Not that strongly; I’m good with the code as-is. Its not pervasive enough
to be hard to understand (I may ponder some code comments though) and the
system it is modeling has some legacy aspects that are more the root
problem and I don’t want to touch those here for sure.

Oring PGSTAT_CLUSTER = 2 with PGSTAT_OBJECT = 3 yields 3 again. To do this
kind of thing the different values need to have power-of-two values, and
then
the tests need to be done with &.

Thanks.

Nicely demonstrated by the fact that with the patch applied initdb doesn't
pass...

Yeah, I compiled but tried to run the tests and learned I still need to
figure out my setup for make check; then I forgot to make install…

It served its purpose at least.

@@ -1047,8 +1042,8 @@ pgstat_delete_pending_entry(PgStat_EntryRef

*entry_ref)
void *pending_data = entry_ref->pending;
Assert(pending_data != NULL);
-     /* !fixed_amount stats should be handled explicitly */
-     Assert(!pgstat_kind_info_for(kind)->fixed_amount);
+     /* global stats should be handled explicitly : why?*/
+     Assert(pgstat_kind_info_for(kind)->kind_group == PGSTAT_DYNAMIC);
The pending data infrastructure doesn't provide a way of dealing with fixed
amount stats, and there's no PgStat_EntryRef for them (since they're not in
the hashtable).

Thanks.

David J.

#247

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#228)

Re: shared-memory based stats collector - v66

Hi,

On 2022-04-02 01:16:48 -0700, Andres Freund wrote:

I just noticed that the code doesn't appear to actually work like that right
now. Whenever the timeout is reached, pgstat_report_stat() is called with
force = true.

And even if the backend is busy running queries, once there's contention, the
next invocation of pgstat_report_stat() will return the timeout relative to
pendin_since, which then will trigger a force report via a very short timeout
soon.

It might actually make sense to only ever return PGSTAT_RETRY_MIN_INTERVAL
(with a slightly different name) from pgstat_report_stat() when blocked
(limiting the max reporting delay for an idle connection) and to continue
calling pgstat_report_stat(force = true). But to only trigger force
"internally" in pgstat_report_stat() when PGSTAT_MAX_INTERVAL is reached.

I think that'd mean we'd report after max PGSTAT_RETRY_MIN_INTERVAL in an idle
connection, and try reporting every PGSTAT_RETRY_MIN_INTERVAL (increasing up
to PGSTAT_MAX_INTERVAL when blocked) on busy connections.

Makes sense?

I tried to come up with a workload producing a *lot* of stats (multiple
function calls within a transaction, multiple transactions pipelined) and ran
it with 1000 clients (on a machine with 2 x (10 cores / 20 threads)). To
reduce overhead I set
default_transaction_isolation=repeatable read
track_activities=false
MVCC Snapshot acquisition is the clear bottleneck otherwise, followed by
pgstat_report_activity() (which, as confusing as it may sound, is independent
of this patch).

I do see a *small* amount of contention if I lower PGSTAT_MIN_INTERVAL to
1ms. Too small to ever be captured in pg_stat_activity.wait_event, but just
about visible in a profiler.

Which leads me to conclude we can simplify the logic significantly. Here's my
current comment explaining the logic:

* Unless called with 'force', pending stats updates are flushed happen once
* per PGSTAT_MIN_INTERVAL (1000ms). When not forced, stats flushes do not
* block on lock acquisition, except if stats updates have been pending for
* longer than PGSTAT_MAX_INTERVAL (60000ms).
*
* Whenever pending stats updates remain at the end of pgstat_report_stat() a
* suggested idle timeout is returned. Currently this is always
* PGSTAT_IDLE_INTERVAL (10000ms). Callers can use the returned time to set up
* a timeout after which to call pgstat_report_stat(true), but are not
* required to to do so.

Comments?

Greetings,

Andres Freund

#248

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#243)

1 attachment(s)

Re: shared-memory based stats collector - v69

On Mon, Apr 4, 2022 at 8:05 PM Andres Freund <andres@anarazel.de> wrote:

- added an architecture overview comment to the top of pgstat.c - not sure
if
it makes sense to anybody but me (and perhaps Horiguchi-san)?

I took a look at this, diff attached. Some typos and minor style stuff,
plus trying to bring a bit more detail to the caching mechanism. I may
have gotten it wrong in adding more detail though.

+ * read-only, backend-local, transaction-scoped, hashtable
(pgStatEntryRefHash)
+ * in front of the shared hashtable, containing references
(PgStat_EntryRef)
+ * to shared hashtable entries. The shared hashtable thus only needs to be
+ * accessed when the PgStat_HashKey is not present in the backend-local
hashtable,
+ * or if stats_fetch_consistency = 'none'.

I'm under the impression, but didn't try to confirm, that the pending
updates don't use the caching mechanism, but rather add to the shared
queue, and so the cache is effectively read-only. It is also
transaction-scoped based upon the GUC and the nature of stats vis-a-vis
transactions.

Even before I added the read-only and transaction-scoped I got a bit hung
up on reading:
"The shared hashtable only needs to be accessed when no prior reference to
the shared hashtable exists."

Thinking in terms of key seems to make more sense than value in this
sentence - even if there is a one-to-one correspondence.

The code comment about having per-kind definitions in pgstat.c being
annoying is probably sufficient but it does seem like a valid comment to
leave in the architecture as well. Having them in both places seems OK.

I am wondering why there are no mentions to the header files in this
architecture, only the .c files.

David J.

#249

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#248)

Re: shared-memory based stats collector - v69

Hi,

On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:

On Mon, Apr 4, 2022 at 8:05 PM Andres Freund <andres@anarazel.de> wrote:

- added an architecture overview comment to the top of pgstat.c - not sure
if
it makes sense to anybody but me (and perhaps Horiguchi-san)?

I took a look at this, diff attached.

Thanks!

Some typos and minor style stuff,
plus trying to bring a bit more detail to the caching mechanism. I may
have gotten it wrong in adding more detail though.
+ * read-only, backend-local, transaction-scoped, hashtable
(pgStatEntryRefHash)
+ * in front of the shared hashtable, containing references
(PgStat_EntryRef)
+ * to shared hashtable entries. The shared hashtable thus only needs to be
+ * accessed when the PgStat_HashKey is not present in the backend-local
hashtable,
+ * or if stats_fetch_consistency = 'none'.
I'm under the impression, but didn't try to confirm, that the pending
updates don't use the caching mechanism

They do.

, but rather add to the shared queue

Queue? Maybe you mean the hashtable?

, and so the cache is effectively read-only. It is also transaction-scoped
based upon the GUC and the nature of stats vis-a-vis transactions.

No, that's not right. I think you might be thinking of
pgStatLocal.snapshot.stats?

I guess I should add a paragraph about snapshots / fetch consistency.

Even before I added the read-only and transaction-scoped I got a bit hung
up on reading:
"The shared hashtable only needs to be accessed when no prior reference to
the shared hashtable exists."

Thinking in terms of key seems to make more sense than value in this
sentence - even if there is a one-to-one correspondence.

Maybe "prior reference to the shared hashtable exists for the key"?

I am wondering why there are no mentions to the header files in this
architecture, only the .c files.

Hm, I guess, but I'm not sure it'd add a lot? It's really just intended to
give a starting point (and it can't be worse than explanation of the current
system).

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index bfbfe53deb..504f952c0e 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -4,9 +4,9 @@
*
*
* PgStat_KindInfo describes the different types of statistics handled. Some
- * kinds of statistics are collected for fixed number of objects
- * (e.g. checkpointer statistics). Other kinds are statistics are collected
- * for variable-numbered objects (e.g. relations).
+ * kinds of statistics are collected for a fixed number of objects
+ * (e.g., checkpointer statistics). Other kinds of statistics are collected

Was that comma after e.g. intentional?

Applied the rest.

+ * for a varying number of objects (e.g., relations).
* Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
*
@@ -19,19 +19,21 @@
*
* All variable-numbered stats are addressed by PgStat_HashKey while running.
* It is not possible to have statistics for an object that cannot be
- * addressed that way at runtime. A wider identifier can be used when
+ * addressed that way at runtime. A alternate identifier can be used when
* serializing to disk (used for replication slot stats).

Not sure this improves things.

* The names for structs stored in shared memory are prefixed with
* PgStatShared instead of PgStat.
@@ -53,15 +55,16 @@
* entry in pgstat_kind_infos, see PgStat_KindInfo for details.
*
*
- * To keep things manageable stats handling is split across several
+ * To keep things manageable, stats handling is split across several

Done.

* files. Infrastructure pieces are in:
- * - pgstat.c - this file, to tie it all together
+ * - pgstat.c - this file, which ties everything together

I liked that :)

- * Each statistics kind is handled in a dedicated file:
+ * Each statistics kind is handled in a dedicated file, though their structs
+ * are defined here for lack of better ideas.

-0.5

Greetings,

Andres Freund

#250

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#249)

Re: shared-memory based stats collector - v69

On Tue, Apr 5, 2022 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:

, but rather add to the shared queue

Queue? Maybe you mean the hashtable?

Queue implemented by a list...? Anyway, I think I mean this:

/*
* List of PgStat_EntryRefs with unflushed pending stats.
*
* Newly pending entries should only ever be added to the end of the list,
* otherwise pgstat_flush_pending_entries() might not see them immediately.
*/
static dlist_head pgStatPending = DLIST_STATIC_INIT(pgStatPending);

, and so the cache is effectively read-only. It is also

transaction-scoped

based upon the GUC and the nature of stats vis-a-vis transactions.

No, that's not right. I think you might be thinking of
pgStatLocal.snapshot.stats?

Probably...

I guess I should add a paragraph about snapshots / fetch consistency.

I apparently confused/combined the two concepts just now so that would help.

Even before I added the read-only and transaction-scoped I got a bit hung
up on reading:
"The shared hashtable only needs to be accessed when no prior reference

to

the shared hashtable exists."

Thinking in terms of key seems to make more sense than value in this
sentence - even if there is a one-to-one correspondence.

Maybe "prior reference to the shared hashtable exists for the key"?

I specifically dislike having two mentions of the "shared hashtable" in the
same sentence, so I tried to phrase the second half in terms of the local
hashtable.

I am wondering why there are no mentions to the header files in this
architecture, only the .c files.

Hm, I guess, but I'm not sure it'd add a lot? It's really just intended to
give a starting point (and it can't be worse than explanation of the
current
system).

No need to try to come up with something. More curious if there was a
general reason to avoid it before I looked to see if I felt anything in
them seemed worth including from my perspective.

diff --git a/src/backend/utils/activity/pgstat.c
b/src/backend/utils/activity/pgstat.c
index bfbfe53deb..504f952c0e 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -4,9 +4,9 @@
*
*
* PgStat_KindInfo describes the different types of statistics handled.
Some

- * kinds of statistics are collected for fixed number of objects
- * (e.g. checkpointer statistics). Other kinds are statistics are

collected
- * for variable-numbered objects (e.g. relations).
+ * kinds of statistics are collected for a fixed number of objects
+ * (e.g., checkpointer statistics). Other kinds of statistics are
collected

Was that comma after e.g. intentional?

It is. That is the style I was taught, and that we seem to adhere to in
user-facing documentation. Source code is a mixed bag with no enforcement,
but while we are here...

+ * for a varying number of objects (e.g., relations).
* Fixed-numbered stats are stored in plain (non-dynamic) shared memory.

status-quo works for me too, and matches up with the desired labelling we
are using here.

*
@@ -19,19 +19,21 @@
*
* All variable-numbered stats are addressed by PgStat_HashKey while

running.
* It is not possible to have statistics for an object that cannot be
- * addressed that way at runtime. A wider identifier can be used when
+ * addressed that way at runtime. A alternate identifier can be used
when

* serializing to disk (used for replication slot stats).

Not sure this improves things.

It just seems odd that width is being mentioned when the actual struct is a
combination of three subcomponents. I do feel I'd need to understand
exactly what replication slot stats are doing uniquely here, though, to
make any point beyond that.

- * Each statistics kind is handled in a dedicated file:
+ * Each statistics kind is handled in a dedicated file, though their
structs

+ * are defined here for lack of better ideas.

-0.5

Status-quo works for me. Food for thought for other reviewers though.

David J.

#251

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#250)

Re: shared-memory based stats collector - v69

On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:

On Tue, Apr 5, 2022 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-05 13:51:12 -0700, David G. Johnston wrote:

, but rather add to the shared queue

Queue? Maybe you mean the hashtable?

Queue implemented by a list...? Anyway, I think I mean this:

/*
* List of PgStat_EntryRefs with unflushed pending stats.
*
* Newly pending entries should only ever be added to the end of the list,
* otherwise pgstat_flush_pending_entries() might not see them immediately.
*/
static dlist_head pgStatPending = DLIST_STATIC_INIT(pgStatPending);

That's not in shared memory, but backend local...

, and so the cache is effectively read-only. It is also

transaction-scoped

based upon the GUC and the nature of stats vis-a-vis transactions.

No, that's not right. I think you might be thinking of
pgStatLocal.snapshot.stats?

Probably...

I guess I should add a paragraph about snapshots / fetch consistency.

I apparently confused/combined the two concepts just now so that would help.

Will add.

Even before I added the read-only and transaction-scoped I got a bit hung
up on reading:
"The shared hashtable only needs to be accessed when no prior reference

to

the shared hashtable exists."

Thinking in terms of key seems to make more sense than value in this
sentence - even if there is a one-to-one correspondence.

Maybe "prior reference to the shared hashtable exists for the key"?

I specifically dislike having two mentions of the "shared hashtable" in the
same sentence, so I tried to phrase the second half in terms of the local
hashtable.

You left two mentions of "shared hashtable" in the sentence prior though
:). I'll try to rephrase. But it's not the end if this isn't the most elegant
prose...

Was that comma after e.g. intentional?

It is. That is the style I was taught, and that we seem to adhere to in
user-facing documentation. Source code is a mixed bag with no enforcement,
but while we are here...

Looks a bit odd to me. But I guess I'll add it then...

*
@@ -19,19 +19,21 @@
*
* All variable-numbered stats are addressed by PgStat_HashKey while

running.
* It is not possible to have statistics for an object that cannot be
- * addressed that way at runtime. A wider identifier can be used when
+ * addressed that way at runtime. A alternate identifier can be used
when

* serializing to disk (used for replication slot stats).

Not sure this improves things.
It just seems odd that width is being mentioned when the actual struct is a
combination of three subcomponents. I do feel I'd need to understand
exactly what replication slot stats are doing uniquely here, though, to
make any point beyond that.

There's no real numeric identifier for replication slot stats. So I'm using
the "index" used in slot.c while running. But that can change during
start/stop.

Greetings,

Andres Freund

#252

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#243)

24 attachment(s)

Re: shared-memory based stats collector - v70

Hi,

Here comes v70:
- extended / polished the architecture comment based on feedback from Melanie
and David
- other polishing as suggested by David
- addressed the open issue around pgstat_report_stat(), as described in
/messages/by-id/20220405204019.6yj7ocmpw352c2u5@alap3.anarazel.de
- while working on the above point, I noticed that hash_bytes() showed up
noticeably in profiles, so I replaced it with a fixed-width function
- found a few potential regression test instabilities by either *always*
flushing in pgstat_report_stat(), or only flushing when force = true.
- random minor improvements
- reordered commits some

I still haven't renamed pg_stat_exists_stat() yet - I'm leaning towards
pg_stat_have_stats() or pg_stat_exists() right now. But it's an SQL function
for testing, so it doesn't really matter.

I think this is basically ready, minus a a few comment adjustments here and
there. Unless somebody protests I'm planning to start pushing things tomorrow
morning.

It'll be a few hours to get to the main commit - but except for 0001 it
doesn't make sense to push without intending to push later changes too. I
might squash a few commits togther.

There's lots that can be done once all this is in place, both simplifying
pre-existing code and easy new features, but that's for a later release.

Greetings,

Andres Freund

#253

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#251)

Re: shared-memory based stats collector - v69

On Tue, Apr 5, 2022 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:

On Tue, Apr 5, 2022 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:

I guess I should add a paragraph about snapshots / fetch consistency.

I apparently confused/combined the two concepts just now so that would

help.

Will add.

Thank you.

On a slightly different track, I took the time to write-up a "Purpose"
section for pgstat.c :

It may possibly be duplicating some things written elsewhere as I didn't go
looking for similar prior art yet, I just wanted to get thoughts down.
This is the kind of preliminary framing I've been constructing in my own
mind as I try to absorb this patch. I haven't formed an opinion whether
the actual user-facing documentation should cover some or all of this
instead of the preamble to pgstat.c (which could just point to the docs for
prerequisite reading).

David J.

* Purpose:

* The PgStat namespace defines an API that facilitates concurrent access
* to a shared memory region where cumulative statistical data is saved.
* At shutdown, one of the running system workers will initiate the writing
* of the data to file. Then, during startup (following a clean shutdown)
the
* Postmaster process will early on ensure that the file is loaded into
memory.
*
* Each cumulative statistic producing system must construct a PgStat_Kind
* datum in this file. The details are described elsewhere, but of
* particular importance is that each kind is classified as having either a
* fixed number of objects that it tracks, or a variable number.
*
* During normal operations, the different consumers of the API will have
their
* accessed managed by the API, the protocol used is determined based upon
whether
* the statistical kind is fixed-numbered or variable-numbered.
* Readers of variable-numbered statistics will have the option to locally
* cache the data, while writers may have their updates locally queued
* and applied in a batch. Thus favoring speed over freshness.
* The fixed-numbered statistics are faster to process and thus forgo
* these mechanisms in favor of a light-weight lock.
*
* Cumulative in this context means that processes must, for numeric data,
send
* a delta (or change) value via the API which will then be added to the
* stored value in memory. The system does not track individual changes,
only
* their net effect. Additionally, both due to unclean shutdown or user
request,
* statistics can be reset - meaning that their stored numeric values are
returned
* to zero, and any non-numeric data that may be tracked (say a timestamp)
is cleared.

#254

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#252)

Re: shared-memory based stats collector - v70

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

Here comes v70:

I think this is basically ready, minus a a few comment adjustments here and
there. Unless somebody protests I'm planning to start pushing things
tomorrow
morning.

Nothing I've come across, given my area of experience, gives me pause. I'm
mostly going to focus on docs and comments at this point - to try and help
the next person in my position (and end-users) have an easier go at
on-boarding. Toward that end, I did just add a "Purpose" section writeup
to the v69 thread.

David J.

#255

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Andres Freund (#252)

Re: shared-memory based stats collector - v70

I've never tried to review a 24-patch series before. It's kind of
intimidating.... Is there a good place to start to get a good idea of
the most important changes?

#256

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#253)

Re: shared-memory based stats collector - v69

Hi,

On 2022-04-05 20:00:50 -0700, David G. Johnston wrote:

On Tue, Apr 5, 2022 at 4:16 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-05 14:43:49 -0700, David G. Johnston wrote:

On Tue, Apr 5, 2022 at 2:23 PM Andres Freund <andres@anarazel.de> wrote:

I guess I should add a paragraph about snapshots / fetch consistency.

I apparently confused/combined the two concepts just now so that would

help.

Will add.

I at least tried...

On a slightly different track, I took the time to write-up a "Purpose"
section for pgstat.c :

It may possibly be duplicating some things written elsewhere as I didn't go
looking for similar prior art yet, I just wanted to get thoughts down.

There's very very little prior documentation in this area.

This is the kind of preliminary framing I've been constructing in my own
mind as I try to absorb this patch. I haven't formed an opinion whether
the actual user-facing documentation should cover some or all of this
instead of the preamble to pgstat.c (which could just point to the docs for
prerequisite reading).

* The PgStat namespace defines an API that facilitates concurrent access
* to a shared memory region where cumulative statistical data is saved.
* At shutdown, one of the running system workers will initiate the writing
* of the data to file. Then, during startup (following a clean shutdown)
the
* Postmaster process will early on ensure that the file is loaded into
memory.

I added something roughly along those lines in the version I just sent, based
on a suggestion by Melanie over IM:

* Statistics are loaded from the filesystem during startup (by the startup
* process), unless preceded by a crash, in which case all stats are
* discarded. They are written out by the checkpointer process just before
* shutting down, except when shutting down in immediate mode.

* Each cumulative statistic producing system must construct a PgStat_Kind
* datum in this file. The details are described elsewhere, but of
* particular importance is that each kind is classified as having either a
* fixed number of objects that it tracks, or a variable number.
*
* During normal operations, the different consumers of the API will have
their
* accessed managed by the API, the protocol used is determined based upon
whether
* the statistical kind is fixed-numbered or variable-numbered.
* Readers of variable-numbered statistics will have the option to locally
* cache the data, while writers may have their updates locally queued
* and applied in a batch. Thus favoring speed over freshness.
* The fixed-numbered statistics are faster to process and thus forgo
* these mechanisms in favor of a light-weight lock.

This feels a bit jumbled. Of course something using an API will be managed by
the API. I don't know what protocol reallly means?

Additionally, both due to unclean shutdown or user
request,
* statistics can be reset - meaning that their stored numeric values are
returned
* to zero, and any non-numeric data that may be tracked (say a timestamp)
is cleared.

I think this is basically covered in the above?

Greetings,

Andres Freund

#257

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Greg Stark (#255)

Re: shared-memory based stats collector - v70

On Tue, Apr 5, 2022 at 8:11 PM Greg Stark <stark@mit.edu> wrote:

I've never tried to review a 24-patch series before. It's kind of
intimidating.... Is there a good place to start to get a good idea of
the most important changes?

It isn't as bad as the number makes it sound - I just used "git am" to
apply the patches to a branch and skimmed each commit separately. Most of
them are tests or other minor pieces. The remaining few cover different
aspects of the major commit and you can choose them based upon your
experience and time.

David J.

#258

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Greg Stark (#255)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-05 23:11:07 -0400, Greg Stark wrote:

I've never tried to review a 24-patch series before. It's kind of
intimidating.... Is there a good place to start to get a good idea of
the most important changes?

It was more at some point :). And believe me, I find this whole project
intimidating and exhausting. The stats collector is entangled in a lot of
places, and there was a lot of preparatory work to get this point.

Most of the commits aren't really interesting, I broke them out to make the
"main commit" a bit smaller, because it's exhausting to look at a *huge*
single commit. I wish I could have broken it down more, but I didn't find a
good way.

The interesting commit is
v70-0010-pgstat-store-statistics-in-shared-memory.patch
which actually replaces the stats collector by storing stats in shared
memory. It contains a, now hopefully decent, overview of how things work at
the top of pgstat.c.

Greetings,

Andres Freund

#259

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#256)

Re: shared-memory based stats collector - v69

On Tue, Apr 5, 2022 at 8:14 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-05 20:00:50 -0700, David G. Johnston wrote:

* Statistics are loaded from the filesystem during startup (by the startup
* process), unless preceded by a crash, in which case all stats are
* discarded. They are written out by the checkpointer process just before
* shutting down, except when shutting down in immediate mode.

Cool. I was on the fence about the level of detail here, but mostly
excluded mentioning the checkpointer 'cause I didn't want to research the
correct answer tonight.

* Each cumulative statistic producing system must construct a

PgStat_Kind

* datum in this file. The details are described elsewhere, but of
* particular importance is that each kind is classified as having

either a

* fixed number of objects that it tracks, or a variable number.
*
* During normal operations, the different consumers of the API will have
their
* accessed managed by the API, the protocol used is determined based

upon

whether
* the statistical kind is fixed-numbered or variable-numbered.
* Readers of variable-numbered statistics will have the option to

locally

* cache the data, while writers may have their updates locally queued
* and applied in a batch. Thus favoring speed over freshness.
* The fixed-numbered statistics are faster to process and thus forgo
* these mechanisms in favor of a light-weight lock.

This feels a bit jumbled.

I had that inkling as well. First draft and I needed to stop at some
point. It didn't seem bad or wrong at least.

Of course something using an API will be managed by

the API. I don't know what protocol reallly means?

Procedure, process, algorithm are synonyms. Procedure probably makes more
sense here since it is a procedural language we are using. I thought of
algorithm while writing this but it carried too much technical baggage for
me (compression, encryption, etc..) that this didn't seem to fit in with.

Additionally, both due to unclean shutdown or user
request,
* statistics can be reset - meaning that their stored numeric values are
returned
* to zero, and any non-numeric data that may be tracked (say a

timestamp)

is cleared.

I think this is basically covered in the above?

Yes and no. The first paragraph says they are forced to reset due to
system error. This paragraph basically says that resetting this kind of
statistic is an acceptable, and even expected, thing to do. And in fact
can also be done intentionally and not only due to system error. I am
pondering whether to mention this dynamic first and/or better blend it in -
but the minor repetition in the different contexts seems ok.

David J.

#260

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#252)

Re: shared-memory based stats collector - v70

At Tue, 5 Apr 2022 20:00:08 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

Here comes v70:
- extended / polished the architecture comment based on feedback from Melanie
and David
- other polishing as suggested by David
- addressed the open issue around pgstat_report_stat(), as described in
/messages/by-id/20220405204019.6yj7ocmpw352c2u5@alap3.anarazel.de
- while working on the above point, I noticed that hash_bytes() showed up
noticeably in profiles, so I replaced it with a fixed-width function
- found a few potential regression test instabilities by either *always*
flushing in pgstat_report_stat(), or only flushing when force = true.
- random minor improvements
- reordered commits some

I still haven't renamed pg_stat_exists_stat() yet - I'm leaning towards
pg_stat_have_stats() or pg_stat_exists() right now. But it's an SQL function
for testing, so it doesn't really matter.

I think this is basically ready, minus a a few comment adjustments here and
there. Unless somebody protests I'm planning to start pushing things tomorrow
morning.

It'll be a few hours to get to the main commit - but except for 0001 it
doesn't make sense to push without intending to push later changes too. I
might squash a few commits togther.

There's lots that can be done once all this is in place, both simplifying
pre-existing code and easy new features, but that's for a later release.

I'm not sure it's in time but..

(Sorry in advance for possible duplicate or pointless comments.)

0001: Looks fine.

0002:

All references to "stats collector" or alike looks like eliminated
after all of the 24 patches are applied. So this seems fine.

0003:

This is just moving around functions and variables. Looks fine.

0004:

I can see padding_pgstat_send and fun:pgstat_send in valgrind.supp

0005:

The function is changed later patch, and it looks fine.

0006:

I'm fine with the categorize for now.

+#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
+#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)

The number of kinds is 10. And PGSTAT_NUM_KINDS is 11?

+	 * Don't define an INVALID value so switch() statements can warn if some
+	 * cases aren't covered. But define the first member to 1 so that
+	 * uninitialized values can be detected more easily.

FWIW, I like this.

0007:

(mmm no comments)

0008:

+	xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
+	xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
+	xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);

I'm not sure I like this, but I don't object to this..

0009: (skipped)

0010:
(I didn't look this closer. The comments arised while looking other
patches.)

+pgstat_kind_from_str(char *kind_str)

I don't think I like "str" so much. Don't we spell it as
"pgstat_kind_from_name"?

0011:
Looks fine.

0012:
Looks like covering all related parts.

0013:
Just fine.

0014:
I bilieve it:p

0015:
Function attributes seems correct. Looks fine.

0016:
(skipped, but looks fine by a quick look.)

0017:

I don't find a problem with this.

0018: (skipped)

0019:

+my $og_stats = $datadir . '/' . "pg_stat" . '/' . "pgstat.stat";

It can be "$datadir/pg_stat/pgstat.stat" or $datadir . '/pgstat/pgstat.stat'.
Isn't it simpler?

0020-24:(I believe them:p)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#261

John Naylor

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Andres Freund (#252)

Re: shared-memory based stats collector - v70

On Wed, Apr 6, 2022 at 10:00 AM Andres Freund <andres@anarazel.de> wrote:

- while working on the above point, I noticed that hash_bytes() showed up
noticeably in profiles, so I replaced it with a fixed-width function

I'm curious about this -- could you direct me to which patch introduces this?

--
John Naylor
EDB: http://www.enterprisedb.com

#262

Alvaro Herrera

alvherre@alvh.no-ip.org

over 3 years ago

In reply to: Andres Freund (#252)

Re: shared-memory based stats collector - v70

Just skimming a bit here ...

On 2022-Apr-05, Andres Freund wrote:

From 0532b869033595202d5797b148f22c61e4eb4969 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 4 Apr 2022 16:53:16 -0700
Subject: [PATCH v70 10/27] pgstat: store statistics in shared memory.

+      <entry><literal>PgStatsData</literal></entry>
+      <entry>Waiting fo shared memory stats data access</entry>
+     </row>

Typo "fo" -> "for"

@@ -5302,7 +5317,9 @@ StartupXLOG(void)
performedWalRecovery = true;
}
else
+ {
performedWalRecovery = false;
+ }

Why? :-)

Why give pgstat_get_entry_ref the responsibility of initializing
created_entry to false? The vast majority of callers don't care about
that flag; it seems easier/cleaner to set it to false in
pgstat_init_function_usage (the only caller that cares that I could
find) before calling pgstat_prep_pending_entry.

(I suggest pgstat_prep_pending_entry should have a comment line stating
"*created_entry, if not NULL, is set true if the entry required to be
created.", same as pgstat_get_entry_ref.)

--
Ãlvaro Herrera PostgreSQL Developer â https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep." (Robert Davidson)
http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php

#263

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Kyotaro Horiguchi (#260)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-06 18:11:04 +0900, Kyotaro Horiguchi wrote:

0004:

I can see padding_pgstat_send and fun:pgstat_send in valgrind.supp

Those shouldn't be affected by the patch, I think? But I did indeed forget to
remove those in 0010.

0006:

I'm fine with the categorize for now.
+#define PGSTAT_KIND_LAST PGSTAT_KIND_WAL
+#define PGSTAT_NUM_KINDS (PGSTAT_KIND_LAST + 1)
The number of kinds is 10. And PGSTAT_NUM_KINDS is 11?

Yea, it's not great. I think I'll introduce INVALID and rename
PGSTAT_KIND_FIRST to FIRST_VALID.

+	 * Don't define an INVALID value so switch() statements can warn if some
+	 * cases aren't covered. But define the first member to 1 so that
+	 * uninitialized values can be detected more easily.

FWIW, I like this.

I think there's no switches left now, so it's not actually providing too much.

0008:

+	xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
+	xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
+	xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);

I'm not sure I like this, but I don't object to this..

The string prefixes? Or the entire patch?

0010:
(I didn't look this closer. The comments arised while looking other
patches.)

+pgstat_kind_from_str(char *kind_str)

I don't think I like "str" so much. Don't we spell it as
"pgstat_kind_from_name"?

name makes me think of NameData. What do you dislike about str? We seem to use
str in plenty places?

0019:

+my $og_stats = $datadir . '/' . "pg_stat" . '/' . "pgstat.stat";

It can be "$datadir/pg_stat/pgstat.stat" or $datadir . '/pgstat/pgstat.stat'.
Isn't it simpler?

Yes, will change.

Greetings,

Andres Freund

#264

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Alvaro Herrera (#262)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-06 13:31:31 +0200, Alvaro Herrera wrote:

Just skimming a bit here ...

Thanks!

On 2022-Apr-05, Andres Freund wrote:

From 0532b869033595202d5797b148f22c61e4eb4969 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 4 Apr 2022 16:53:16 -0700
Subject: [PATCH v70 10/27] pgstat: store statistics in shared memory.
+      <entry><literal>PgStatsData</literal></entry>
+      <entry>Waiting fo shared memory stats data access</entry>
+     </row>
Typo "fo" -> "for"

Oh, oops. I had fixed that in the wrong patch.

@@ -5302,7 +5317,9 @@ StartupXLOG(void)
performedWalRecovery = true;
}
else
+ {
performedWalRecovery = false;
+ }

Why? :-)

Damage from merging two commits yesterday. I'd left open where exactly we'd
reset stats, with the "main commit" implementing the current behaviour more
closely, and then a followup commit implementing something a bit
better. Nobody seemed to argue for keeping the behaviour 1:1, so I merged
them. Without removing the parens again :)

Why give pgstat_get_entry_ref the responsibility of initializing
created_entry to false? The vast majority of callers don't care about
that flag; it seems easier/cleaner to set it to false in
pgstat_init_function_usage (the only caller that cares that I could
find) before calling pgstat_prep_pending_entry.

It's annoying to have to initialize it, I agree. But I think it's bugprone for
the caller to know that it has to be pre-initialized to false.

(I suggest pgstat_prep_pending_entry should have a comment line stating
"*created_entry, if not NULL, is set true if the entry required to be
created.", same as pgstat_get_entry_ref.)

Added something along those lines.

Greetings,

Andres Freund

#265

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: John Naylor (#261)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-06 16:24:28 +0700, John Naylor wrote:

On Wed, Apr 6, 2022 at 10:00 AM Andres Freund <andres@anarazel.de> wrote:

- while working on the above point, I noticed that hash_bytes() showed up
noticeably in profiles, so I replaced it with a fixed-width function

I'm curious about this -- could you direct me to which patch introduces this?

Commit 0010, search for pgstat_hash_key_hash. For simplicity I'm including it
here inline:

/* helpers for dshash / simplehash hashtables */
static inline int
pgstat_hash_key_cmp(const void *a, const void *b, size_t size, void *arg)
{
AssertArg(size == sizeof(PgStat_HashKey) && arg == NULL);
return memcmp(a, b, sizeof(PgStat_HashKey));
}

static inline uint32
pgstat_hash_key_hash(const void *d, size_t size, void *arg)
{
const PgStat_HashKey *key = (PgStat_HashKey *)d;
uint32 hash;

AssertArg(size == sizeof(PgStat_HashKey) && arg == NULL);

hash = murmurhash32(key->kind);
hash = hash_combine(hash, murmurhash32(key->dboid));
hash = hash_combine(hash, murmurhash32(key->objoid));

return hash;
}

Greetings,

Andres Freund

#266

Lukas Fittl

lukas@fittl.com

over 3 years ago

In reply to: Andres Freund (#252)

Re: shared-memory based stats collector - v70

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

Here comes v70:

Some small nitpicks on the docs:

From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 2 Apr 2022 19:38:01 -0700
Subject: [PATCH v70 14/27] pgstat: update docs.
...
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
- These parameters control server-wide statistics collection

features.

- When statistics collection is enabled, the data that is produced

can be

+      These parameters control server-wide cumulative statistics system.
+      When enabled, the data that is collected can be

Missing "the" ("These parameters control the server-wide cumulative
statistics system").

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
+   any of the accumulated statistics, acessed values are cached until

the end

"acessed" => "accessed"

+   <varname>stats_fetch_consistency</varname> can be set
+   <literal>snapshot</literal>, at the price of increased memory usage

for

Missing "to" ("can be set to <literal>snapshot</literal>")

+ caching not-needed statistics data. Conversely, if it's known that

statistics

Double space between "data." and "Conversely" (not sure if that matters)

+ current transaction's statistics snapshot or cached values (if any).

The

Double space between "(if any)." and "The" (not sure if that matters)

+ next use of statistical information will cause a new snapshot to be

built

+ or accessed statistics to be cached.

I believe this should be an "and", not an "or". (next access builds both a
new snapshot and caches accessed statistics)

Thanks,
Lukas

--
Lukas Fittl

#267

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Lukas Fittl (#266)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-06 12:14:35 -0700, Lukas Fittl wrote:

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

Here comes v70:

Some small nitpicks on the docs:

Thanks!

From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 2 Apr 2022 19:38:01 -0700
Subject: [PATCH v70 14/27] pgstat: update docs.
...
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
- These parameters control server-wide statistics collection

features.

- When statistics collection is enabled, the data that is produced

can be
+      These parameters control server-wide cumulative statistics system.
+      When enabled, the data that is collected can be
Missing "the" ("These parameters control the server-wide cumulative
statistics system").

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
+   any of the accumulated statistics, acessed values are cached until

the end

"acessed" => "accessed"

+   <varname>stats_fetch_consistency</varname> can be set
+   <literal>snapshot</literal>, at the price of increased memory usage
for

Missing "to" ("can be set to <literal>snapshot</literal>")

Fixed.

+ caching not-needed statistics data. Conversely, if it's known that

statistics

Double space between "data." and "Conversely" (not sure if that matters)

+ current transaction's statistics snapshot or cached values (if any).

The

Double space between "(if any)." and "The" (not sure if that matters)

That's done pretty widely in the docs and comments.

+ next use of statistical information will cause a new snapshot to be

built

+ or accessed statistics to be cached.

I believe this should be an "and", not an "or". (next access builds both a
new snapshot and caches accessed statistics)

I *think* or is correct? The new snapshot is when stats_fetch_consistency =
snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to
make that clearer without making it a lot longer. Suggestions?

Greetings,

Andres Freund

#268

Justin Pryzby

pryzby@telsasoft.com

over 3 years ago

In reply to: Andres Freund (#267)

Re: shared-memory based stats collector - v70

On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote:

+   next use of statistical information will cause a new snapshot to be built
+   or accessed statistics to be cached.
I believe this should be an "and", not an "or". (next access builds both a
new snapshot and caches accessed statistics)
I *think* or is correct? The new snapshot is when stats_fetch_consistency =
snapshot, the cached is when stats_fetch_consistency = cache. Not sure how to
make that clearer without making it a lot longer. Suggestions?

I think it's correct. Maybe it's clearer to say:

+   next use of statistical information will (when in snapshot mode) cause a new snapshot to be built
+   or (when in cache mode) accessed statistics to be cached.

#269

Lukas Fittl

lukas@fittl.com

over 3 years ago

In reply to: Justin Pryzby (#268)

Re: shared-memory based stats collector - v70

On Wed, Apr 6, 2022 at 12:34 PM Justin Pryzby <pryzby@telsasoft.com> wrote:

On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote:

+ next use of statistical information will cause a new snapshot to

be built

+ or accessed statistics to be cached.

I believe this should be an "and", not an "or". (next access builds

both a

new snapshot and caches accessed statistics)

I *think* or is correct? The new snapshot is when

stats_fetch_consistency =

snapshot, the cached is when stats_fetch_consistency = cache. Not sure

how to

make that clearer without making it a lot longer. Suggestions?

I think it's correct. Maybe it's clearer to say:
+   next use of statistical information will (when in snapshot mode) cause
a new snapshot to be built
+   or (when in cache mode) accessed statistics to be cached.

Ah, yes, that does clarify what was meant.

+1 to Justin's edit, or something like it.

Thanks,
Lukas

--
Lukas Fittl

#270

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#252)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-05 20:00:08 -0700, Andres Freund wrote:

It'll be a few hours to get to the main commit - but except for 0001 it
doesn't make sense to push without intending to push later changes too. I
might squash a few commits togther.

I just noticed an existing incoherency that I'm wondering about fixing as part
of 0007 "pgstat: prepare APIs used by pgstatfuncs for shared memory stats."

The SQL functions to reset function and relation stats
pg_stat_reset_single_table_counters() and
pg_stat_reset_single_function_counters() respectively both make use of
pgstat_reset_single_counter().

Note that the SQL function uses plural "counters" (which makes sense, it
resets all counters for that object), whereas the C function they call to
perform the the reset uses singular.

Similarly pg_stat_reset_slru(), pg_stat_reset_replication_slot(),
pg_stat_reset_subscription_stats() SQL function use
pgstat_reset_subscription_counter(), pgstat_reset_replslot_counter() and
pgstat_reset_subscription_counter() to reset either the stats for one or all
SLRUs/slots.

This is relevant for the commit mentioned above because it separates the
functions to reset the stats for one slru / slot / sub from the function to
reset all slrus / slots / subs. Going with the existing naming I'd just named
them pgstat_reset_*_counters(). But that doesn't really make sense.

If it were just existing code I'd just not touch this for now. But because the
patch introduces further functions, I'd rather not introducing more weird
function names.

I'd go for
pgstat_reset_slru_counter() -> pgstat_reset_slru()
pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()

We could leave out the _all_ and just use plural too, but I think it's a bit
nicer with _all_ in there.

Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?

Greetings,

Andres Freund

#271

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#270)

Re: shared-memory based stats collector - v70

On Wednesday, April 6, 2022, Andres Freund <andres@anarazel.de> wrote:

I'd go for
pgstat_reset_slru_counter() -> pgstat_reset_slru()
pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()

I like having the SQL function paired with a matching implementation in
this scheme.

We could leave out the _all_ and just use plural too, but I think it's a
bit
nicer with _all_ in there.

+1 to _all_

Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?

Why not add both pgstat_resert_function() and pgstat_reset_table() (to keep
the pairing) and they can call the renamed pgstat_reset_function_or_table()
internally (since the function indeed handle both paths and we’ve yet to
come up with a label to use instead of “function and table stats”)?

These are private functions right?

David J.

#272

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#271)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-06 15:32:39 -0700, David G. Johnston wrote:

On Wednesday, April 6, 2022, Andres Freund <andres@anarazel.de> wrote:

I'd go for
pgstat_reset_slru_counter() -> pgstat_reset_slru()
pgstat_reset_subscription_counter() -> pgstat_reset_subscription()
pgstat_reset_subscription_counters() -> pgstat_reset_all_subscriptions()
pgstat_reset_replslot_counter() -> pgstat_reset_replslot()
pgstat_reset_replslot_counters() -> pgstat_reset_all_replslots()

I like having the SQL function paired with a matching implementation in
this scheme.

It would have gotten things closer than it was before imo. We can't just
rename the SQL functions, they're obviously exposed API.

I'd like to remove the NULL -> all behaviour, but that should be discussed
separately.

I've hacked up the above, but after doing so I think I found a cleaner
approach:

I've introduced:

pgstat_reset_of_kind(PgStat_Kind kind) which works for both fixed and variable
numbered stats. That allows to remove pgstat_reset_subscription_counters(),
pgstat_reset_replslot_counters(), pgstat_reset_shared_counters().

pgstat_reset(PgStat_Kind kind, Oid dboid, Oid objoid), which removes the need
for pgstat_reset_subscription_counter(),
pgstat_reset_single_counter(). pgstat_reset_replslot() is still needed, to do
the name -> index lookup.

That imo makes a lot more sense than requiring each variable-amount kind to
have wrapper functions.

Not quite sure what to do with pgstat_reset_single_counter(). I'd either go
for the minimal pgstat_reset_single_counters() or pgstat_reset_one()?

Why not add both pgstat_resert_function() and pgstat_reset_table() (to keep
the pairing) and they can call the renamed pgstat_reset_function_or_table()
internally (since the function indeed handle both paths and weâve yet to
come up with a label to use instead of âfunction and table statsâ)?

These are private functions right?

What does "private" mean for you? They're exposed via pgstat.h not
pgstat_internal.h. But not to SQL.

Greetings,

Andres Freund

#273

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#272)

Re: shared-memory based stats collector - v70

On Wed, Apr 6, 2022 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-04-06 15:32:39 -0700, David G. Johnston wrote:

On Wednesday, April 6, 2022, Andres Freund <andres@anarazel.de> wrote:

I like having the SQL function paired with a matching implementation in
this scheme.

It would have gotten things closer than it was before imo. We can't just
rename the SQL functions, they're obviously exposed API.

Right, I meant the naming scheme proposed was acceptable. Not that I
wanted to change the SQL functions too.

I've hacked up the above, but after doing so I think I found a cleaner

approach:

I've introduced:

pgstat_reset_of_kind(PgStat_Kind kind) which works for both fixed and
variable
numbered stats. That allows to remove pgstat_reset_subscription_counters(),
pgstat_reset_replslot_counters(), pgstat_reset_shared_counters().

pgstat_reset(PgStat_Kind kind, Oid dboid, Oid objoid), which removes the
need
for pgstat_reset_subscription_counter(),
pgstat_reset_single_counter().

pgstat_reset_replslot() is still needed, to do
the name -> index lookup.

That imo makes a lot more sense than requiring each variable-amount kind to
have wrapper functions.

I can see benefits of both, or even possibly combining them. Absent being
able to point to some other part of the system and saying "it is done this
way there, let's do the same here" I think the details will inform the
decision. The fact there is just the one outlier here suggests that this
is indeed the better option.

What does "private" mean for you? They're exposed via pgstat.h not
pgstat_internal.h. But not to SQL.

I was thinking specifically of the freedom to rename and not break
extensions. Namely, are these truly implementation details or something
that, while unlikely to be used by extensions, still constitute an exposed
API? It was mainly a passing thought, I'm not looking for a crash-course
in how all that works right now.

David J.

#274

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#273)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-06 17:01:17 -0700, David G. Johnston wrote:

On Wed, Apr 6, 2022 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:

The fact there is just the one outlier here suggests that this is indeed the
better option.

FWIW, the outlier also uses pgstat_reset(), just with a small wrapper doing
the translation from slot name to slot index.

What does "private" mean for you? They're exposed via pgstat.h not
pgstat_internal.h. But not to SQL.

I was thinking specifically of the freedom to rename and not break
extensions. Namely, are these truly implementation details or something
that, while unlikely to be used by extensions, still constitute an exposed
API? It was mainly a passing thought, I'm not looking for a crash-course
in how all that works right now.

I doubt there are extension using these functions - and they'd have been
broken the way things were in v70, because the signature already had changed.

Generally, between major releases, we don't worry too much about changing C
APIs. Of course we try to avoid unnecessarily breaking things, particularly
when it's going to cause widespread breakage.

Greetings,

Andres Freund

#275

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#263)

Re: shared-memory based stats collector - v70

At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund <andres@anarazel.de> wrote in

+	 * Don't define an INVALID value so switch() statements can warn if some
+	 * cases aren't covered. But define the first member to 1 so that
+	 * uninitialized values can be detected more easily.
FWIW, I like this.
I think there's no switches left now, so it's not actually providing too much.

(Ouch!)

0008:
+	xact_desc_stats(buf, "", parsed.nstats, parsed.stats);
+	xact_desc_stats(buf, "commit ", parsed.nstats, parsed.stats);
+	xact_desc_stats(buf, "abort ", parsed.nabortstats, parsed.abortstats);
I'm not sure I like this, but I don't object to this..
The string prefixes? Or the entire patch?

The string prefixes, since they are a limited set of fixed
strings. That being said, I don't think it's better to use an enum
instead, too. So I don't object to pass the strings here.

0010:
(I didn't look this closer. The comments arised while looking other
patches.)

+pgstat_kind_from_str(char *kind_str)

I don't think I like "str" so much. Don't we spell it as
"pgstat_kind_from_name"?

name makes me think of NameData. What do you dislike about str? We seem to use
str in plenty places?

For clarity, I don't dislike it so much. So, I'm fine with the
current name.

I found that you meant a type by the "str". I thought it as an
instance (I'm not sure I can express my feeling correctly here..) and
the following functions were in my mind.

char *get_namespace/rel/collation/func_name(Oid someoid)
char *pgstat_slru_name(int slru_idx)

Another instance of the same direction is

ForkNumber forkname_to_number(const char *forkName)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#276

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Kyotaro Horiguchi (#275)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-07 10:36:30 +0900, Kyotaro Horiguchi wrote:

At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund <andres@anarazel.de> wrote in
+	 * Don't define an INVALID value so switch() statements can warn if some
+	 * cases aren't covered. But define the first member to 1 so that
+	 * uninitialized values can be detected more easily.
FWIW, I like this.
I think there's no switches left now, so it's not actually providing too much.
(Ouch!)

I think it's great that there's no switches left - means we're pretty close to
pgstat being runtime extensible...

0010:
(I didn't look this closer. The comments arised while looking other
patches.)

+pgstat_kind_from_str(char *kind_str)

I don't think I like "str" so much. Don't we spell it as
"pgstat_kind_from_name"?

name makes me think of NameData. What do you dislike about str? We seem to use
str in plenty places?

For clarity, I don't dislike it so much. So, I'm fine with the
current name.

I found that you meant a type by the "str". I thought it as an
instance (I'm not sure I can express my feeling correctly here..) and
the following functions were in my mind.

char *get_namespace/rel/collation/func_name(Oid someoid)
char *pgstat_slru_name(int slru_idx)

Another instance of the same direction is

ForkNumber forkname_to_number(const char *forkName)

It's now pgstat_get_kind_from_str().

It was harder to see earlier (I certainly didn't really see it) - because
there were so many "violations" - but most of pgstat is
pgstat_<verb>_<subject>() or just <verb>_<subject>. I'd already moved most of
the patch series over to that (maybe in v68 or so). Now I also did that with
the internal functions.

There's a few functions breaking that pattern, partially because I added them
:(, but since they're not touched in these patches I've not renamed them. But
it's probably worth doing so tomorrow.

Greetings,

Andres Freund

#277

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#252)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-05 20:00:08 -0700, Andres Freund wrote:

It'll be a few hours to get to the main commit - but except for 0001 it
doesn't make sense to push without intending to push later changes too. I
might squash a few commits togther.

I've gotten through the main commits (and then a fix for the apparently
inevitable bug that's immediately highlighted by the buildfarm), and the first
test. I'll call it a night now, and work on the other tests & docs tomorrow.

Greetings,

Andres Freund

#278

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#276)

Re: shared-memory based stats collector - v70

At Wed, 6 Apr 2022 18:58:52 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2022-04-07 10:36:30 +0900, Kyotaro Horiguchi wrote:

At Wed, 6 Apr 2022 09:04:09 -0700, Andres Freund <andres@anarazel.de> wrote in

I think there's no switches left now, so it's not actually providing too much.

(Ouch!)

I think it's great that there's no switches left - means we're pretty close to
pgstat being runtime extensible...

Yes. I agree.

It's now pgstat_get_kind_from_str().

I'm fine with it.

It was harder to see earlier (I certainly didn't really see it) - because
there were so many "violations" - but most of pgstat is
pgstat_<verb>_<subject>() or just <verb>_<subject>. I'd already moved most of
the patch series over to that (maybe in v68 or so). Now I also did that with
the internal functions.

There's a few functions breaking that pattern, partially because I added them
:(, but since they're not touched in these patches I've not renamed them. But
it's probably worth doing so tomorrow.

Thab being said, it gets far cleaner. Thanks!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#279

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#277)

Re: shared-memory based stats collector - v70

At Thu, 7 Apr 2022 00:28:45 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2022-04-05 20:00:08 -0700, Andres Freund wrote:

It'll be a few hours to get to the main commit - but except for 0001 it
doesn't make sense to push without intending to push later changes too. I
might squash a few commits togther.

I've gotten through the main commits (and then a fix for the apparently
inevitable bug that's immediately highlighted by the buildfarm), and the first
test. I'll call it a night now, and work on the other tests & docs tomorrow.

Thank you very much for the great effort on this to make it get in!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#280

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#277)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-07 00:28:45 -0700, Andres Freund wrote:

I've gotten through the main commits (and then a fix for the apparently
inevitable bug that's immediately highlighted by the buildfarm), and the first
test. I'll call it a night now, and work on the other tests & docs tomorrow.

I've gotten through the tests now. There's one known, not yet addressed, issue
with the stats isolation test, see [1]/messages/by-id/20220407165709.jgdkrzqlkcwue6ko@alap3.anarazel.de.

Working on the docs. Found a few things worth raising:

1)
Existing text:
When the server shuts down cleanly, a permanent copy of the statistics
data is stored in the <filename>pg_stat</filename> subdirectory, so that
statistics can be retained across server restarts. When recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset.

The existing docs patch hadn't updated yet. My current edit is

When the server shuts down cleanly, a permanent copy of the statistics
data is stored in the <filename>pg_stat</filename> subdirectory, so that
statistics can be retained across server restarts. When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery, but not when starting a standby that was shut
down normally), all statistics counters are reset.

but I'm not sure the parenthetical is easy enough to understand?

2)
The edit is not a problem, but it's hard to understand what the existing
paragraph actually means?

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 3247e056663..8bfb584b752 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2222,17 +2222,17 @@ HINT:  You can then restart the server after making the necessary configuration
...
    <para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
+    The cumulative statistics system is active during recovery. All scans, reads, blocks,
     index usage, etc., will be recorded normally on the standby. Replayed
     actions will not duplicate their effects on primary, so replaying an
     insert will not increment the Inserts column of pg_stat_user_tables.
     The stats file is deleted at the start of recovery, so stats from primary
     and standby will differ; this is considered a feature, not a bug.
    </para>

<para>

I'll just commit the necessary bit, but we really ought to rephrase this.

Greetings,

Andres Freund

[1]: /messages/by-id/20220407165709.jgdkrzqlkcwue6ko@alap3.anarazel.de

#281

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#280)

Re: shared-memory based stats collector - v70

At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2022-04-07 00:28:45 -0700, Andres Freund wrote:

I've gotten through the main commits (and then a fix for the apparently
inevitable bug that's immediately highlighted by the buildfarm), and the first
test. I'll call it a night now, and work on the other tests & docs tomorrow.

I've gotten through the tests now. There's one known, not yet addressed, issue
with the stats isolation test, see [1].

Working on the docs. Found a few things worth raising:

1)
Existing text:
When the server shuts down cleanly, a permanent copy of the statistics
data is stored in the <filename>pg_stat</filename> subdirectory, so that
statistics can be retained across server restarts. When recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset.

The existing docs patch hadn't updated yet. My current edit is

When the server shuts down cleanly, a permanent copy of the statistics
data is stored in the <filename>pg_stat</filename> subdirectory, so that
statistics can be retained across server restarts. When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery, but not when starting a standby that was shut
down normally), all statistics counters are reset.

but I'm not sure the parenthetical is easy enough to understand?

I can read it. But I'm not sure that the difference is obvious for
average users between "starting a standby from a basebackup" and
"starting a standby after a normal shutdown"..

Other than that, it might be easier to read if the additional part
were moved out to the end of the paragraph, prefixing with "Note:
". For example,

...
statistics can be retained across server restarts. When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset. Note that
crash recovery is not performed when starting a standby that was shut
down normally then all counters are retained.

2)
The edit is not a problem, but it's hard to understand what the existing
paragraph actually means?

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 3247e056663..8bfb584b752 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2222,17 +2222,17 @@ HINT:  You can then restart the server after making the necessary configuration
...
<para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
+    The cumulative statistics system is active during recovery. All scans, reads, blocks,
index usage, etc., will be recorded normally on the standby. Replayed
actions will not duplicate their effects on primary, so replaying an
insert will not increment the Inserts column of pg_stat_user_tables.
The stats file is deleted at the start of recovery, so stats from primary
and standby will differ; this is considered a feature, not a bug.
</para>

<para>

Agreed partially. It's too detailed. It might not need to mention WAL
replay.

I'll just commit the necessary bit, but we really ought to rephrase this.

Greetings,

Andres Freund

[1] /messages/by-id/20220407165709.jgdkrzqlkcwue6ko@alap3.anarazel.de

--
Kyotaro Horiguchi
NTT Open Source Software Center

#282

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#281)

Re: shared-memory based stats collector - v70

On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund <andres@anarazel.de>
wrote in

Hi,

On 2022-04-07 00:28:45 -0700, Andres Freund wrote:

I've gotten through the main commits (and then a fix for the apparently
inevitable bug that's immediately highlighted by the buildfarm), and

the first

test. I'll call it a night now, and work on the other tests & docs

tomorrow.

I've gotten through the tests now. There's one known, not yet addressed,

issue

with the stats isolation test, see [1].

Working on the docs. Found a few things worth raising:

1)
Existing text:
When the server shuts down cleanly, a permanent copy of the statistics
data is stored in the <filename>pg_stat</filename> subdirectory, so

that

statistics can be retained across server restarts. When recovery is
performed at server start (e.g., after immediate shutdown, server

crash,

and point-in-time recovery), all statistics counters are reset.

The existing docs patch hadn't updated yet. My current edit is

When the server shuts down cleanly, a permanent copy of the statistics
data is stored in the <filename>pg_stat</filename> subdirectory, so

that

statistics can be retained across server restarts. When crash

recovery is

performed at server start (e.g., after immediate shutdown, server

crash,

and point-in-time recovery, but not when starting a standby that was

shut

down normally), all statistics counters are reset.

but I'm not sure the parenthetical is easy enough to understand?

I can read it. But I'm not sure that the difference is obvious for
average users between "starting a standby from a basebackup" and
"starting a standby after a normal shutdown"..

Other than that, it might be easier to read if the additional part
were moved out to the end of the paragraph, prefixing with "Note:
". For example,

...
statistics can be retained across server restarts. When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset. Note that
crash recovery is not performed when starting a standby that was shut
down normally then all counters are retained.

Maybe:
When the server shuts down cleanly a permanent copy of the statistics data
is stored in the <filename>pg_stat</filename> subdirectory so that
statistics can be retained across server restarts. However, if crash
recovery is performed (i.e., after immediate shutdown, server crash, or
point-in-time recovery), all statistics counters are reset. For any
standby server, the initial startup to get the cluster initialized is a
point-in-time crash recovery startup. For all subsequent startups it
behaves like any other server. For a hot standby server, statistics are
retained during a failover promotion.

I'm pretty sure i.e., is correct since those three situations are not
examples but rather the complete set.

Is crash recovery ever performed other than at server start? If so I
choose to remove the redundancy.

I feel like some of this detail about standby servers is/should be covered
elsewhere and we are at least missing a cross-reference chance even if we
leave the material coverage as-is.

2)
The edit is not a problem, but it's hard to understand what the existing
paragraph actually means?
diff --git a/doc/src/sgml/high-availability.sgml
b/doc/src/sgml/high-availability.sgml
index 3247e056663..8bfb584b752 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2222,17 +2222,17 @@ HINT:  You can then restart the server after
making the necessary configuration

...
<para>
- The statistics collector is active during recovery. All scans,

reads, blocks,

+ The cumulative statistics system is active during recovery. All

scans, reads, blocks,

index usage, etc., will be recorded normally on the standby.

Replayed

actions will not duplicate their effects on primary, so replaying an
insert will not increment the Inserts column of pg_stat_user_tables.
The stats file is deleted at the start of recovery, so stats from

primary

and standby will differ; this is considered a feature, not a bug.
</para>

<para>

Agreed partially. It's too detailed. It might not need to mention WAL
replay.

The insert example seems like a poor one...IIUC cumulative statistics are
not WAL logged and while in recovery INSERT is prohibited, so how would
replaying the insert in the WAL result in a duplicated effect on
pg_stat_user_tables.inserts?

I also have no idea what, in the fragment, "Replayed actions will not
duplicate their effects on primary...", what "on primary" is supposed to
mean.

I would like to write the following but I don't believe it is sufficiently
true:

"The cumulative statistics system records only the locally generated
activity of the cluster, including while in recovery. Activity happening
only due to the replay of WAL is not considered local."

But to apply the WAL we have to fetch blocks from the local filesystem and
write them back out. That is local activity happening due to the replay of
WAL which sounds like it is and should be reported ("All...blocks...", and
the example given being logical DDL oriented).

I cannot think of a better paragraph at the moment, the minimal change is
good, and the detail it contains presently seems like the right amount, if
indeed my interpretation of it is correct (i.e., the standby records
physical stats, not logical ones). It still has wording issues around "on
primary" and maybe a better example choice than a
disallowed-in-recovery-anyway insert.

David J.

#283

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Kyotaro Horiguchi (#281)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-08 11:10:14 +0900, Kyotaro Horiguchi wrote:

At Thu, 7 Apr 2022 16:37:51 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2022-04-07 00:28:45 -0700, Andres Freund wrote:

I've gotten through the main commits (and then a fix for the apparently
inevitable bug that's immediately highlighted by the buildfarm), and the first
test. I'll call it a night now, and work on the other tests & docs tomorrow.

I've gotten through the tests now. There's one known, not yet addressed, issue
with the stats isolation test, see [1].

Working on the docs. Found a few things worth raising:

1)
Existing text:
When the server shuts down cleanly, a permanent copy of the statistics
data is stored in the <filename>pg_stat</filename> subdirectory, so that
statistics can be retained across server restarts. When recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset.

The existing docs patch hadn't updated yet. My current edit is

When the server shuts down cleanly, a permanent copy of the statistics
data is stored in the <filename>pg_stat</filename> subdirectory, so that
statistics can be retained across server restarts. When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery, but not when starting a standby that was shut
down normally), all statistics counters are reset.

but I'm not sure the parenthetical is easy enough to understand?

I can read it. But I'm not sure that the difference is obvious for
average users between "starting a standby from a basebackup" and
"starting a standby after a normal shutdown"..

Yea, that's what I was concerned about. How about:

<para>
Cumulative statistics are collected in shared memory. Every
<productname>PostgreSQL</productname> process collects statistics locally
then updates the shared data at appropriate intervals. When a server,
including a physical replica, shuts down cleanly, a permanent copy of the
statistics data is stored in the <filename>pg_stat</filename> subdirectory,
so that statistics can be retained across server restarts. In contrast,
when starting from an unclean shutdown (e.g., after an immediate shutdown,
a server crash, starting from a base backup, and point-in-time recovery),
all statistics counters are reset.
</para>

Other than that, it might be easier to read if the additional part
were moved out to the end of the paragraph, prefixing with "Note:
". For example,

...
statistics can be retained across server restarts. When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset. Note that
crash recovery is not performed when starting a standby that was shut
down normally then all counters are retained.

I think I like my version above a bit better?

2)
The edit is not a problem, but it's hard to understand what the existing
paragraph actually means?

diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 3247e056663..8bfb584b752 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2222,17 +2222,17 @@ HINT:  You can then restart the server after making the necessary configuration
...
<para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
+    The cumulative statistics system is active during recovery. All scans, reads, blocks,
index usage, etc., will be recorded normally on the standby. Replayed
actions will not duplicate their effects on primary, so replaying an
insert will not increment the Inserts column of pg_stat_user_tables.
The stats file is deleted at the start of recovery, so stats from primary
and standby will differ; this is considered a feature, not a bug.
</para>

<para>

Agreed partially. It's too detailed. It might not need to mention WAL
replay.

My concern is more that it seems halfway nonsensical. "Replayed actions will
not duplicate their effects on primary" - I can guess what that means, but not
more. There's no "Inserts" column of pg_stat_user_tables.

<para>
The cumulative statistics system is active during recovery. All scans,
reads, blocks, index usage, etc., will be recorded normally on the
standby. However, WAL replay will not increment relation and database
specific counters. I.e. replay will not increment pg_stat_all_tables
columns (like n_tup_ins), nor will reads or writes performed by the
startup process be tracked in the pg_statio views, nor will associated
pg_stat_database columns be incremented.
</para>

Greetings,

Andres Freund

#284

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#282)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-07 20:51:10 -0700, David G. Johnston wrote:

On Thu, Apr 7, 2022 at 7:10 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

I can read it. But I'm not sure that the difference is obvious for
average users between "starting a standby from a basebackup" and
"starting a standby after a normal shutdown"..

Other than that, it might be easier to read if the additional part
were moved out to the end of the paragraph, prefixing with "Note:
". For example,

...
statistics can be retained across server restarts. When crash recovery is
performed at server start (e.g., after immediate shutdown, server crash,
and point-in-time recovery), all statistics counters are reset. Note that
crash recovery is not performed when starting a standby that was shut
down normally then all counters are retained.

Maybe:
When the server shuts down cleanly a permanent copy of the statistics data
is stored in the <filename>pg_stat</filename> subdirectory so that
statistics can be retained across server restarts. However, if crash
recovery is performed (i.e., after immediate shutdown, server crash, or
point-in-time recovery), all statistics counters are reset. For any
standby server, the initial startup to get the cluster initialized is a
point-in-time crash recovery startup. For all subsequent startups it
behaves like any other server. For a hot standby server, statistics are
retained during a failover promotion.

I'm pretty sure i.e., is correct since those three situations are not
examples but rather the complete set.

I don't think the "initial startup ..." bit is quite correct. A standby can be
created for a shut down server, and IIRC there's some differences in how PITR
is handled too.

Is crash recovery ever performed other than at server start? If so I
choose to remove the redundancy.

No.

I feel like some of this detail about standby servers is/should be covered
elsewhere and we are at least missing a cross-reference chance even if we
leave the material coverage as-is.

I didn't find anything good to reference...

2)
The edit is not a problem, but it's hard to understand what the existing
paragraph actually means?
diff --git a/doc/src/sgml/high-availability.sgml
b/doc/src/sgml/high-availability.sgml
index 3247e056663..8bfb584b752 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2222,17 +2222,17 @@ HINT:  You can then restart the server after
making the necessary configuration

...
<para>
- The statistics collector is active during recovery. All scans,

reads, blocks,

+ The cumulative statistics system is active during recovery. All

scans, reads, blocks,

index usage, etc., will be recorded normally on the standby.

Replayed

actions will not duplicate their effects on primary, so replaying an
insert will not increment the Inserts column of pg_stat_user_tables.
The stats file is deleted at the start of recovery, so stats from

primary

and standby will differ; this is considered a feature, not a bug.
</para>

<para>

Agreed partially. It's too detailed. It might not need to mention WAL
replay.
The insert example seems like a poor one...IIUC cumulative statistics are
not WAL logged and while in recovery INSERT is prohibited, so how would
replaying the insert in the WAL result in a duplicated effect on
pg_stat_user_tables.inserts?

I agree, the sentence doesn't make much sense.

It doesn't really matter that stats aren't WAL logged, one could infer them
from the WAL at a decent level of accuracy. However, we can't actually
associate those actions to relations, we just know the relfilenode... And the
startup process can't read the catalog to figure the mapping out.

I also have no idea what, in the fragment, "Replayed actions will not
duplicate their effects on primary...", what "on primary" is supposed to
mean.

I think it's trying to say "will not duplicate the effect on
pg_stat_user_tables they had on the primary".

I would like to write the following but I don't believe it is sufficiently
true:

"The cumulative statistics system records only the locally generated
activity of the cluster, including while in recovery. Activity happening
only due to the replay of WAL is not considered local."

But to apply the WAL we have to fetch blocks from the local filesystem and
write them back out. That is local activity happening due to the replay of
WAL which sounds like it is and should be reported ("All...blocks...", and
the example given being logical DDL oriented).

That's not true today - the startup processes reads / writes aren't reflected
in pg_statio, pg_stat_database or whatnot.

I cannot think of a better paragraph at the moment, the minimal change is
good, and the detail it contains presently seems like the right amount, if
indeed my interpretation of it is correct (i.e., the standby records
physical stats, not logical ones). It still has wording issues around "on
primary" and maybe a better example choice than a
disallowed-in-recovery-anyway insert.

What do you think about my suggested paragraphs in
/messages/by-id/20220408035921.xlmjrv7wdmk3xm7k@alap3.anarazel.de ?

Greetings,

Andres Freund

#285

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#283)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-07 20:59:21 -0700, Andres Freund wrote:

<para>
Cumulative statistics are collected in shared memory. Every
<productname>PostgreSQL</productname> process collects statistics locally
then updates the shared data at appropriate intervals. When a server,
including a physical replica, shuts down cleanly, a permanent copy of the
statistics data is stored in the <filename>pg_stat</filename> subdirectory,
so that statistics can be retained across server restarts. In contrast,
when starting from an unclean shutdown (e.g., after an immediate shutdown,
a server crash, starting from a base backup, and point-in-time recovery),
all statistics counters are reset.
</para>
...
<para>
The cumulative statistics system is active during recovery. All scans,
reads, blocks, index usage, etc., will be recorded normally on the
standby. However, WAL replay will not increment relation and database
specific counters. I.e. replay will not increment pg_stat_all_tables
columns (like n_tup_ins), nor will reads or writes performed by the
startup process be tracked in the pg_statio views, nor will associated
pg_stat_database columns be incremented.
</para>

I went with these for now. My guess is that there's further improvements in
them, and in surrounding areas...

With that, I'll close this CF entry. It's been a while.

Greetings,

Andres Freund

#286

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#280)

Re: shared-memory based stats collector - v70

On 2022-04-07 16:37:51 -0700, Andres Freund wrote:

On 2022-04-07 00:28:45 -0700, Andres Freund wrote:

I've gotten through the main commits (and then a fix for the apparently
inevitable bug that's immediately highlighted by the buildfarm), and the first
test. I'll call it a night now, and work on the other tests & docs tomorrow.

I've gotten through the tests now. There's one known, not yet addressed, issue
with the stats isolation test, see [1].

That has since been fixed, in d6c0db14836cd843d589372d909c73aab68c7a24

#287

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#283)

Re: shared-memory based stats collector - v70

On Thu, Apr 7, 2022 at 8:59 PM Andres Freund <andres@anarazel.de> wrote:

<para>
Cumulative statistics are collected in shared memory. Every
<productname>PostgreSQL</productname> process collects statistics
locally
then updates the shared data at appropriate intervals. When a server,
including a physical replica, shuts down cleanly, a permanent copy of
the
statistics data is stored in the <filename>pg_stat</filename>
subdirectory,
so that statistics can be retained across server restarts. In contrast,
when starting from an unclean shutdown (e.g., after an immediate
shutdown,
a server crash, starting from a base backup, and point-in-time
recovery),
all statistics counters are reset.
</para>

I like this. My comment regarding using "i.e.," here stands though.

<para>
The cumulative statistics system is active during recovery. All scans,
reads, blocks, index usage, etc., will be recorded normally on the
standby. However, WAL replay will not increment relation and database
specific counters. I.e. replay will not increment pg_stat_all_tables
columns (like n_tup_ins), nor will reads or writes performed by the
startup process be tracked in the pg_statio views, nor will associated
pg_stat_database columns be incremented.
</para>

I like this too. The second part with three nors is a bit rough. Maybe:

... specific counters. In particular, replay will not increment
pg_stat_database or pg_stat_all_tables columns, and the startup process
will not report reads and writes for the pg_statio views.

It would helpful to give at least one specific example of what is being
recorded normally, especially since we give three of what is not.

David J.

#288

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#283)

Re: shared-memory based stats collector - v70

At Thu, 7 Apr 2022 20:59:21 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2022-04-08 11:10:14 +0900, Kyotaro Horiguchi wrote:

I can read it. But I'm not sure that the difference is obvious for
average users between "starting a standby from a basebackup" and
"starting a standby after a normal shutdown"..

Yea, that's what I was concerned about. How about:

<para>
Cumulative statistics are collected in shared memory. Every
<productname>PostgreSQL</productname> process collects statistics locally
then updates the shared data at appropriate intervals. When a server,
including a physical replica, shuts down cleanly, a permanent copy of the
statistics data is stored in the <filename>pg_stat</filename> subdirectory,
so that statistics can be retained across server restarts. In contrast,
when starting from an unclean shutdown (e.g., after an immediate shutdown,
a server crash, starting from a base backup, and point-in-time recovery),
all statistics counters are reset.
</para>

Looks perfect generally, and especially in regard to the concern.

I think I like my version above a bit better?

Quite a bit. It didn't answer for the concern.

2)
The edit is not a problem, but it's hard to understand what the existing
paragraph actually means?
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 3247e056663..8bfb584b752 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -2222,17 +2222,17 @@ HINT:  You can then restart the server after making the necessary configuration
...
<para>
-    The statistics collector is active during recovery. All scans, reads, blocks,
+    The cumulative statistics system is active during recovery. All scans, reads, blocks,
index usage, etc., will be recorded normally on the standby. Replayed
actions will not duplicate their effects on primary, so replaying an
insert will not increment the Inserts column of pg_stat_user_tables.
The stats file is deleted at the start of recovery, so stats from primary
and standby will differ; this is considered a feature, not a bug.
</para>
<para>
Agreed partially. It's too detailed. It might not need to mention WAL
replay.
My concern is more that it seems halfway nonsensical. "Replayed actions will
not duplicate their effects on primary" - I can guess what that means, but not
more. There's no "Inserts" column of pg_stat_user_tables.

<para>
The cumulative statistics system is active during recovery. All scans,
reads, blocks, index usage, etc., will be recorded normally on the
standby. However, WAL replay will not increment relation and database
specific counters. I.e. replay will not increment pg_stat_all_tables
columns (like n_tup_ins), nor will reads or writes performed by the
startup process be tracked in the pg_statio views, nor will associated
pg_stat_database columns be incremented.
</para>

Looks clearer since it mention user-facing interfaces with concrete
example columns.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#289

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#287)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-07 21:39:55 -0700, David G. Johnston wrote:

On Thu, Apr 7, 2022 at 8:59 PM Andres Freund <andres@anarazel.de> wrote:

<para>
Cumulative statistics are collected in shared memory. Every
<productname>PostgreSQL</productname> process collects statistics
locally
then updates the shared data at appropriate intervals. When a server,
including a physical replica, shuts down cleanly, a permanent copy of
the
statistics data is stored in the <filename>pg_stat</filename>
subdirectory,
so that statistics can be retained across server restarts. In contrast,
when starting from an unclean shutdown (e.g., after an immediate
shutdown,
a server crash, starting from a base backup, and point-in-time
recovery),
all statistics counters are reset.
</para>

I like this. My comment regarding using "i.e.," here stands though.

Argh. I'd used in e.g., but not i.e..

<para>
The cumulative statistics system is active during recovery. All scans,
reads, blocks, index usage, etc., will be recorded normally on the
standby. However, WAL replay will not increment relation and database
specific counters. I.e. replay will not increment pg_stat_all_tables
columns (like n_tup_ins), nor will reads or writes performed by the
startup process be tracked in the pg_statio views, nor will associated
pg_stat_database columns be incremented.
</para>

I like this too. The second part with three nors is a bit rough. Maybe:

Agreed. I tried to come up with a smoother formulation, but didn't (perhaps
because I was a tad tired).

... specific counters. In particular, replay will not increment
pg_stat_database or pg_stat_all_tables columns, and the startup process
will not report reads and writes for the pg_statio views.

It would helpful to give at least one specific example of what is being
recorded normally, especially since we give three of what is not.

The second sentence is a set of examples - or do you mean examples for what
actions by the startup process are counted?

Greetings,

Andres Freund

#290

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#289)

Re: shared-memory based stats collector - v70

On Sat, Apr 9, 2022 at 12:07 PM Andres Freund <andres@anarazel.de> wrote:

... specific counters. In particular, replay will not increment
pg_stat_database or pg_stat_all_tables columns, and the startup process
will not report reads and writes for the pg_statio views.

It would helpful to give at least one specific example of what is being
recorded normally, especially since we give three of what is not.

The second sentence is a set of examples - or do you mean examples for what
actions by the startup process are counted?

Specific views that these statistics will be updating; like
pg_stat_database being the example of a view that is not updating.

David J.

#291

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: Andres Freund (#252)

Re: shared-memory based stats collector - v70

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

Here comes v70:

One thing I just noticed while peeking at pg_stat_slru:

The stats_reset column for my newly initdb'd cluster is showing me
"2000-01-01 00:00:00" (v15). I was expecting null, though a non-null value
restriction does make sense. Neither choice is documented though.

Based upon my expectation I checked to see if v14 reported null, and thus
this was a behavior change. v14 reports the initdb timestamp (e.g.,
2022-04-13 23:26:48.349115+00)

Can we document the non-null aspect of this value (pg_stat_database is
happy being null, this seems to be a "fixed" type behavior) but have it
continue to report initdb as its initial value?

David J.

#292

David G. Johnston

david.g.johnston@gmail.com

over 3 years ago

In reply to: David G. Johnston (#291)

Re: shared-memory based stats collector - v70

On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:

On Tue, Apr 5, 2022 at 8:00 PM Andres Freund <andres@anarazel.de> wrote:

Here comes v70:

One thing I just noticed while peeking at pg_stat_slru:

The stats_reset column for my newly initdb'd cluster is showing me
"2000-01-01 00:00:00" (v15). I was expecting null, though a non-null value
restriction does make sense. Neither choice is documented though.

Based upon my expectation I checked to see if v14 reported null, and thus
this was a behavior change. v14 reports the initdb timestamp (e.g.,
2022-04-13 23:26:48.349115+00)

Can we document the non-null aspect of this value (pg_stat_database is
happy being null, this seems to be a "fixed" type behavior) but have it
continue to report initdb as its initial value?

Sorry, apparently this "2000-01-01" behavior only manifests after crash
recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
same initdb timestamp.

Feels like we should still report the "end of crash recovery timestamp" for
these instead of 2000-01-01 (which I guess is derived from 0) if we are not
willing to produce null (and it seems other parts of the system using these
stats assumes non-null).

David J.

#293

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: David G. Johnston (#292)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-13 16:56:45 -0700, David G. Johnston wrote:

On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:
Sorry, apparently this "2000-01-01" behavior only manifests after crash
recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
same initdb timestamp.

Feels like we should still report the "end of crash recovery timestamp" for
these instead of 2000-01-01 (which I guess is derived from 0) if we are not
willing to produce null (and it seems other parts of the system using these
stats assumes non-null).

Yes, that's definitely not correct. I see the bug (need to call
pgstat_reset_after_failure(); in pgstat_discard_stats()). Stupid, but
easy to fix - too fried to write a test tonight, but will commit the fix
tomorrow.

Thanks for catching!

Greetings,

Andres Freund

#294

Michael Paquier

michael@paquier.xyz

over 3 years ago

In reply to: David G. Johnston (#292)

Re: shared-memory based stats collector - v70

On Wed, Apr 13, 2022 at 04:56:45PM -0700, David G. Johnston wrote:

Sorry, apparently this "2000-01-01" behavior only manifests after crash
recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
same initdb timestamp.

Feels like we should still report the "end of crash recovery timestamp" for
these instead of 2000-01-01 (which I guess is derived from 0) if we are not
willing to produce null (and it seems other parts of the system using these
stats assumes non-null).

I can see this timestamp as well after crash recovery. This seems
rather misleading to me. I have added an open item.
--
Michael

#295

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#293)

Re: shared-memory based stats collector - v70

Hi,

On 2022-04-13 17:55:18 -0700, Andres Freund wrote:

On 2022-04-13 16:56:45 -0700, David G. Johnston wrote:

On Wed, Apr 13, 2022 at 4:34 PM David G. Johnston <
david.g.johnston@gmail.com> wrote:
Sorry, apparently this "2000-01-01" behavior only manifests after crash
recovery on v15 (didn't check v14); after a clean initdb on v15 I got the
same initdb timestamp.

Feels like we should still report the "end of crash recovery timestamp" for
these instead of 2000-01-01 (which I guess is derived from 0) if we are not
willing to produce null (and it seems other parts of the system using these
stats assumes non-null).

Yes, that's definitely not correct. I see the bug (need to call
pgstat_reset_after_failure(); in pgstat_discard_stats()). Stupid, but
easy to fix - too fried to write a test tonight, but will commit the fix
tomorrow.

Pushed the fix (including a test that previously failed). Thanks again!

Greetings,

Andres Freund

#296

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Andres Freund (#252)

Re: shared-memory based stats collector - v70

So I'm finally wrapping my head around this new code. There is
something I'm surprised by that perhaps I'm misreading or perhaps I
shouldn't be surprised by, not sure.

Is it true that the shared memory allocation contains the hash table
entry and body of every object in every database? I guess I was
assuming I would find some kind of LRU cache which loaded data from
disk on demand. But afaict it loads everything on startup and then
never loads from disk later. The disk is purely for recovering state
after a restart.

On the one hand the rest of Postgres seems to be designed on the
assumption that the number of tables and database objects is limited
only by disk space. The catalogs are stored in relational storage
which is read through the buffer cache. On the other hand it's true
that syscaches don't do expire entries (though I think the assumption
is that no one backend touches very much).

It seems like if we really think the total number of database objects
is reasonably limited to scales that fit in RAM there would be a much
simpler database design that would just store the catalog tables in
simple in-memory data structures and map them all on startup without
doing all the work Postgres does to make relational storage scale.

#297

Melanie Plageman

melanieplageman@gmail.com

over 3 years ago

In reply to: Greg Stark (#296)

Re: shared-memory based stats collector - v70

On Wed, Jul 20, 2022 at 11:35 AM Greg Stark <stark@mit.edu> wrote:

On the one hand the rest of Postgres seems to be designed on the
assumption that the number of tables and database objects is limited
only by disk space. The catalogs are stored in relational storage
which is read through the buffer cache. On the other hand it's true
that syscaches don't do expire entries (though I think the assumption
is that no one backend touches very much).

It seems like if we really think the total number of database objects
is reasonably limited to scales that fit in RAM there would be a much
simpler database design that would just store the catalog tables in
simple in-memory data structures and map them all on startup without
doing all the work Postgres does to make relational storage scale.

I think efforts to do such a thing have gotten caught up in solving
issues around visibility and managing the relationship between local and
global caches [1]/messages/by-id/4E72940DA2BF16479384A86D54D0988A567B9245@G01JPEXMBKW04. It doesn't seem like the primary technical concern
was memory usage.

[1]: /messages/by-id/4E72940DA2BF16479384A86D54D0988A567B9245@G01JPEXMBKW04
/messages/by-id/4E72940DA2BF16479384A86D54D0988A567B9245@G01JPEXMBKW04

#298

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Greg Stark (#296)

Re: shared-memory based stats collector - v70

Hi,

On 2022-07-20 11:35:13 -0400, Greg Stark wrote:

Is it true that the shared memory allocation contains the hash table
entry and body of every object in every database?

Yes. However, note that that was already the case with the old stats
collector - it also kept everything in memory. In addition every read
access to stats loaded a copy of the stats (well of the global stats and
the relevant per-database stats).

It might be worth doing something fancier at some point - the shared
memory stats was already a huge effort, cramming yet another change in
there would pretty much have guaranteed that it'd fail.

Greetings,

Andres Freund

#299

Tom Lane

tgl@sss.pgh.pa.us

over 3 years ago

In reply to: Melanie Plageman (#297)

Re: shared-memory based stats collector - v70

Melanie Plageman <melanieplageman@gmail.com> writes:

On Wed, Jul 20, 2022 at 11:35 AM Greg Stark <stark@mit.edu> wrote:

It seems like if we really think the total number of database objects
is reasonably limited to scales that fit in RAM there would be a much
simpler database design that would just store the catalog tables in
simple in-memory data structures and map them all on startup without
doing all the work Postgres does to make relational storage scale.

I think efforts to do such a thing have gotten caught up in solving
issues around visibility and managing the relationship between local and
global caches [1]. It doesn't seem like the primary technical concern
was memory usage.

AFAIR, the previous stats collector implementation had no such provision
either: it'd just keep adding hashtable entries as it received info about
new objects. The only thing that's changed is that now those entries are
in shared memory instead of process-local memory. We'd be well advised to
be sure that memory can be swapped out under pressure, but otherwise I'm
not seeing that things have gotten worse.

regards, tom lane

#300

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Tom Lane (#299)

Re: shared-memory based stats collector - v70

Hi,

On 2022-07-20 12:08:35 -0400, Tom Lane wrote:

AFAIR, the previous stats collector implementation had no such provision
either: it'd just keep adding hashtable entries as it received info about
new objects.

Yep.

The only thing that's changed is that now those entries are in shared
memory instead of process-local memory. We'd be well advised to be
sure that memory can be swapped out under pressure, but otherwise I'm
not seeing that things have gotten worse.

FWIW, I ran a few memory usage benchmarks. Without stats accesses the
memory usage with shared memory stats was sometimes below, sometimes
above the "old" memory usage, depending on the number of objects. As
soon as there's stats access, it's well below (that includes things like
autovac workers).

I think there's quite a bit of memory usage reduction potential around
dsa.c - we occasionally end up with [nearly] unused dsm segments.

Greetings,

Andres Freund

#301

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Tom Lane (#299)

Re: shared-memory based stats collector - v70

On Wed, 20 Jul 2022 at 12:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:

AFAIR, the previous stats collector implementation had no such provision
either: it'd just keep adding hashtable entries as it received info about
new objects. The only thing that's changed is that now those entries are
in shared memory instead of process-local memory. We'd be well advised to
be sure that memory can be swapped out under pressure, but otherwise I'm
not seeing that things have gotten worse.

Just to be clear I'm not looking for ways things have gotten worse.
Just trying to understand what I'm reading and I guess I came in with
assumptions that led me astray.

But... adding entries as it received info about new objects isn't the
same as having info on everything. I didn't really understand how the
old system worked but if you had a very large schema but each session
only worked with a small subset did the local stats data ever absorb
info on the objects it never touched?

All that said -- having all objects loaded in shared memory makes my
work way easier. It actually seems feasible to dump out all the
objects from shared memory and including objects from other databases
and if I don't need a consistent snapshot it even seems like it would
be possible to do that without having a copy of more than one stats
entry at a time in local memory. I hope that doesn't cause huge
contention on the shared hash table to be doing that regularly.

--
greg

#302

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Greg Stark (#301)

Re: shared-memory based stats collector - v70

Hi,

On July 20, 2022 8:41:53 PM GMT+02:00, Greg Stark <stark@mit.edu> wrote:

On Wed, 20 Jul 2022 at 12:08, Tom Lane <tgl@sss.pgh.pa.us> wrote:

AFAIR, the previous stats collector implementation had no such provision
either: it'd just keep adding hashtable entries as it received info about
new objects. The only thing that's changed is that now those entries are
in shared memory instead of process-local memory. We'd be well advised to
be sure that memory can be swapped out under pressure, but otherwise I'm
not seeing that things have gotten worse.

Just to be clear I'm not looking for ways things have gotten worse.
Just trying to understand what I'm reading and I guess I came in with
assumptions that led me astray.

But... adding entries as it received info about new objects isn't the
same as having info on everything. I didn't really understand how the
old system worked but if you had a very large schema but each session
only worked with a small subset did the local stats data ever absorb
info on the objects it never touched?

Each backend only had stats for things it touched. But the stats collector read all files at startup into hash tables and absorbed all generated stats into those as well.

All that said -- having all objects loaded in shared memory makes my
work way easier.

What are your trying to do?

It actually seems feasible to dump out all the
objects from shared memory and including objects from other databases
and if I don't need a consistent snapshot it even seems like it would
be possible to do that without having a copy of more than one stats
entry at a time in local memory. I hope that doesn't cause huge
contention on the shared hash table to be doing that regularly.

The stats accessors now default to not creating a full snapshot of stats data at first access (but that's configurable). So yes, that behavior is possible. E.g. autovac now uses a single object access like you describe.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#303

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Andres Freund (#302)

Re: shared-memory based stats collector - v70

On Wed, 20 Jul 2022 at 15:09, Andres Freund <andres@anarazel.de> wrote:

Each backend only had stats for things it touched. But the stats collector read all files at startup into hash tables and absorbed all generated stats into those as well.

Fascinating. I'm surprised this didn't raise issues previously for
people with millions of tables. I wonder if it wasn't causing issues
and we just didn't hear about them because there were other bigger
issues :)

All that said -- having all objects loaded in shared memory makes my
work way easier.

What are your trying to do?

I'm trying to implement an exporter for prometheus/openmetrics/etc
that dumps directly from shared memory without going through the SQL
backend layer. I believe this will be much more reliable, lower
overhead, safer, and consistent than writing SQL queries.

Ideally I would want to dump out the stats without connecting to each
database. I suspect that would run into problems where the schema
really adds a lot of information (such as which table each index is on
or which table a toast relation is for. There are also some things
people think of as stats that are maintained in the catalog such as
reltuples and relpages. So I'm imagining this won't strictly stay true
in the end.

It seems like just having an interface to iterate over the shared hash
table and return entries one by one without filtering by database
would be fairly straightforward and I would be able to do most of what
I want just with that. There's actually enough meta information in the
stats entries to be able to handle them as they come instead of trying
to process look up specific stats one by one.

--
greg

#304

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Greg Stark (#303)

Re: shared-memory based stats collector - v70

Hi,

On 7/21/22 5:07 PM, Greg Stark wrote:

On Wed, 20 Jul 2022 at 15:09, Andres Freund <andres@anarazel.de> wrote:

What are your trying to do?

Ideally I would want to dump out the stats without connecting to each
database.

I can see the use case too (specially for monitoring tools) of being
able to collect the stats without connecting to each database.

It seems like just having an interface to iterate over the shared hash
table and return entries one by one without filtering by database
would be fairly straightforward and I would be able to do most of what
I want just with that.

What do you think about adding a function in core PG to provide such
functionality? (means being able to retrieve all the stats (+ eventually
add some filtering) without the need to connect to each database).

If there is some interest, I'd be happy to work on it and propose a patch.

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#305

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Drouvot, Bertrand (#304)

Re: shared-memory based stats collector - v70

On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

What do you think about adding a function in core PG to provide such
functionality? (means being able to retrieve all the stats (+ eventually
add some filtering) without the need to connect to each database).

I'm working on it myself too. I'll post a patch for discussion in a bit.

I was more aiming at a C function that extensions could use directly
rather than an SQL function -- though I suppose having the former it
would be simple enough to implement the latter using it. (though it
would have to be one for each stat type I guess)

The reason I want a C function is I'm trying to get as far as I can
without a connection to a database, without a transaction, without
accessing the catalog, and as much as possible without taking locks. I
think this is important for making monitoring highly reliable and low
impact on production. It's also kind of fundamental to accessing stats
for objects from other databases since we won't have easy access to
the catalogs for the other databases.

The main problem with my current code is that I'm accessing the shared
memory hash table directly. This means the I'm possibly introducing
locking contention on the shared memory hash table. I'm thinking of
separating the shared memory hash scan from the metric scan so the
list can be quickly built minimizing the time the lock is held. We
could possibly also only rebuild that list at a lower frequency than
the metrics gathering so new objects might not show up instantly.

I have a few things I would like to suggest for future improvements to
this infrastructure. I haven't polished the details of it yet but the
main thing I think I'm missing is the catalog name for the object. I
don't want to have to fetch it from the catalog and in any case I
think it would generally be useful and might regularize the
replication slot handling too.

I also think it would be nice to have a change counter for every stat
object, or perhaps a change time. Prometheus wouldn't be able to make
use of it but other monitoring software might be able to receive only
metrics that have changed since the last update which would really
help on databases with large numbers of mostly static objects. Even on
typical databases there are tons of builtin objects (especially
functions) that are probably never getting updates.

--
greg

#306

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Drouvot, Bertrand (#304)

Re: shared-memory based stats collector - v70

Hi,

On 2022-08-09 12:18:47 +0200, Drouvot, Bertrand wrote:

What do you think about adding a function in core PG to provide such
functionality? (means being able to retrieve all the stats (+ eventually add
some filtering) without the need to connect to each database).

I'm not that convinced by the use case, but I think it's also low cost to add
and maintain, so if somebody cares enough to write something...

The only thing I would "request" is that such a function requires more
permissions than the default accessors do. I think it's a minor problem that
we allow so much access within a database right now, regardless of object
permissions, but it'd not be a great idea to expand that to other databases,
in bulk?

Greetings,

Andres Freund

#307

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Greg Stark (#305)

Re: shared-memory based stats collector - v70

Hi,

On 2022-08-09 12:00:46 -0400, Greg Stark wrote:

I was more aiming at a C function that extensions could use directly
rather than an SQL function -- though I suppose having the former it
would be simple enough to implement the latter using it. (though it
would have to be one for each stat type I guess)

I think such a C extension could exist today, without patching core code? It'd
be a bit ugly to include pgstat_internal.h, I guess, but other than that...

The reason I want a C function is I'm trying to get as far as I can
without a connection to a database, without a transaction, without
accessing the catalog, and as much as possible without taking locks.

I assume you don't include lwlocks under locks?

I think this is important for making monitoring highly reliable and low
impact on production.

I'm doubtful about that, but whatever.

The main problem with my current code is that I'm accessing the shared
memory hash table directly. This means the I'm possibly introducing
locking contention on the shared memory hash table.

I don't think that's a large enough issue to worry about unless you're
polling at a very high rate, which'd be a bad idea in itself. If a backend
can't get the lock for some stats change it'll defer flushing the stats a bit,
so it'll not cause a lot of other problems.

I'm thinking of separating the shared memory hash scan from the metric scan
so the list can be quickly built minimizing the time the lock is held.

I'd really really want to see some evidence that any sort of complexity here
is worth it.

I have a few things I would like to suggest for future improvements to
this infrastructure. I haven't polished the details of it yet but the
main thing I think I'm missing is the catalog name for the object. I
don't want to have to fetch it from the catalog and in any case I
think it would generally be useful and might regularize the
replication slot handling too.

I'm *dead* set against including catalog names in shared memory stats. That'll
add a good amount of memory usage and complexity, without any sort of
comensurate gain.

I also think it would be nice to have a change counter for every stat
object, or perhaps a change time. Prometheus wouldn't be able to make
use of it but other monitoring software might be able to receive only
metrics that have changed since the last update which would really
help on databases with large numbers of mostly static objects.

I think you're proposing adding overhead that doesn't even have a real user.

Greetings,

Andres Freund

#308

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Andres Freund (#306)

Re: shared-memory based stats collector - v70

Hi,

On 8/9/22 6:40 PM, Andres Freund wrote:

Hi,

On 2022-08-09 12:18:47 +0200, Drouvot, Bertrand wrote:

What do you think about adding a function in core PG to provide such
functionality? (means being able to retrieve all the stats (+ eventually add
some filtering) without the need to connect to each database).

I'm not that convinced by the use case, but I think it's also low cost to add
and maintain, so if somebody cares enough to write something...

Ack.

The only thing I would "request" is that such a function requires more
permissions than the default accessors do. I think it's a minor problem that
we allow so much access within a database right now, regardless of object
permissions, but it'd not be a great idea to expand that to other databases,
in bulk?

Agree that special attention would need to be pay around permissions.

Something like allow its usage if member of pg_read_all_stats?

Regards,

Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#309

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Greg Stark (#305)

Re: shared-memory based stats collector - v70

Hi,

On 8/9/22 6:00 PM, Greg Stark wrote:

On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

What do you think about adding a function in core PG to provide such
functionality? (means being able to retrieve all the stats (+ eventually
add some filtering) without the need to connect to each database).

I'm working on it myself too. I'll post a patch for discussion in a bit.

Great! Thank you!

Out of curiosity, would you be also interested by such a feature for
previous versions (that will not get the patch in) ?

Regards,

Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#310

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Andres Freund (#307)

Re: shared-memory based stats collector - v70

Hi,

On 8/9/22 6:47 PM, Andres Freund wrote:

Hi,

On 2022-08-09 12:00:46 -0400, Greg Stark wrote:

I was more aiming at a C function that extensions could use directly
rather than an SQL function -- though I suppose having the former it
would be simple enough to implement the latter using it. (though it
would have to be one for each stat type I guess)

I think such a C extension could exist today, without patching core code? It'd
be a bit ugly to include pgstat_internal.h, I guess, but other than that...

Yeah, agree that writing such an extension is doable today.

The main problem with my current code is that I'm accessing the shared
memory hash table directly. This means the I'm possibly introducing
locking contention on the shared memory hash table.

I don't think that's a large enough issue to worry about unless you're
polling at a very high rate, which'd be a bad idea in itself. If a backend
can't get the lock for some stats change it'll defer flushing the stats a bit,
so it'll not cause a lot of other problems.

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#311

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Andres Freund (#307)

Re: shared-memory based stats collector - v70

On Tue, 9 Aug 2022 at 12:48, Andres Freund <andres@anarazel.de> wrote:

The reason I want a C function is I'm trying to get as far as I can
without a connection to a database, without a transaction, without
accessing the catalog, and as much as possible without taking locks.

I assume you don't include lwlocks under locks?

I guess it depends on which lwlock :) I would be leery of a monitoring
system taking an lwlock that could interfere with regular transactions
doing work. Or taking a lock that is itself the cause of the problem
elsewhere that you really need stats to debug would be a deal breaker.

I don't think that's a large enough issue to worry about unless you're
polling at a very high rate, which'd be a bad idea in itself. If a backend
can't get the lock for some stats change it'll defer flushing the stats a bit,
so it'll not cause a lot of other problems.

Hm. I wonder if we're on the same page about what constitutes a "high rate".

I've seen people try push prometheus or other similar systems to 5s
poll intervals. That would be challenging for Postgres due to the
volume of statistics. The default is 30s and people often struggle to
even have that function for large fleets. But if you had a small
fleet, perhaps an iot style system with a "one large table" type of
schema you might well want stats every 5s or even every 1s.

I'm *dead* set against including catalog names in shared memory stats. That'll
add a good amount of memory usage and complexity, without any sort of
comensurate gain.

Well it's pushing the complexity there from elsewhere. If the labels
aren't in the stats structures then the exporter needs to connect to
each database, gather all the names into some local cache and then it
needs to worry about keeping it up to date. And if there are any
database problems such as disk errors or catalog objects being locked
then your monitoring breaks though perhaps it can be limited to just
missing some object names or having out of date names.

I also think it would be nice to have a change counter for every stat
object, or perhaps a change time. Prometheus wouldn't be able to make
use of it but other monitoring software might be able to receive only
metrics that have changed since the last update which would really
help on databases with large numbers of mostly static objects.

I think you're proposing adding overhead that doesn't even have a real user.

I guess I'm just brainstorming here. I don't need to currently no. It
doesn't seem like significant overhead though compared to the locking
and copying though?

--
greg

#312

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Greg Stark (#311)

Re: shared-memory based stats collector - v70

One thing that's bugging me is that the names we use for these stats
are *all* over the place.

The names go through three different stages

pgstat structs -> pgstatfunc tupledescs -> pg_stat_* views

(Followed by a fourth stage where pg_exporter or whatever names for
the monitoring software)

And for some reason both transitions (plus the exporter) felt the need
to fiddle with the names or values. And not in any sort of even
vaguely consistent way. So there are three (or four) different sets of
names for the same metrics :(

e.g.

* Some of the struct elements have abbreviated words which are
expanded in the tupledesc names or the view columns -- some have long
names which get abbreviated.

* Some struct members have n_ prefixes (presumably to avoid C keywords
or other namespace issues?) and then lose them at one of the other
stages. But then the relation stats do not have n_ prefixes and then
the pg_stat view *adds* n_ prefixes in the SQL view!

* Some columns are added together in the SQL view which seems like
gratuitously hiding information from the user. The pg_stat_*_tables
view actually looks up information from the indexes stats and combines
them to get idx_scan and idx_tup_fetch.

* The pg_stat_bgwriter view returns data from two different fixed
entries, the checkpointer and the bgwriter, is there a reason those
are kept separately but then reported as if they're one thing?

Some of the simpler renaming could be transparently fixed by making
the internal stats match the public facing names. But for many of them
I think the internal names are better. And the cases where the views
aggregate data in a way that loses information are not something I
want to reproduce.

I had intended to use the internal names directly, reasoning that
transparency and consistency are the direction to be headed. But in
some cases I think the current public names are the better choice -- I
certainly don't want to remove n_* prefixes from some names but then
add them to different names! And some of the cases where the data is
combined or modified do seem like they would be missed.

--
greg

#313

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Drouvot, Bertrand (#309)

2 attachment(s)

Re: shared-memory based stats collector - v70

On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

Hi,

On 8/9/22 6:00 PM, Greg Stark wrote:

On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

What do you think about adding a function in core PG to provide such
functionality? (means being able to retrieve all the stats (+ eventually
add some filtering) without the need to connect to each database).

I'm working on it myself too. I'll post a patch for discussion in a bit.

Great! Thank you!

So I was adding the code to pgstat.c because I had thought there were
some data types I needed and/or static functions I needed. However you
and Andres encouraged me to check again now. And indeed I was able,
after fixing a couple things, to make the code work entirely
externally.

This is definitely not polished and there's a couple obvious things
missing. But at the risk of embarrassment I've attached my WIP. Please
be gentle :) I'll post the github link in a bit when I've fixed up
some meta info.

I'm definitely not wedded to the idea of using callbacks, it was just
the most convenient way to get started, especially when I was putting
the main loop in pgstat.c. Ideally I do want to keep open the
possibility of streaming the results out without buffering the whole
set in memory.

Out of curiosity, would you be also interested by such a feature for
previous versions (that will not get the patch in) ?

I always had trouble understanding the existing stats code so I was
hoping the new code would make it easier. It seems to have worked but
it's possible I'm wrong and it was always possible and the problem was
always just me :)

--
greg

#314

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Greg Stark (#311)

Re: shared-memory based stats collector - v70

Hi,

On 2022-08-10 14:18:25 -0400, Greg Stark wrote:

I don't think that's a large enough issue to worry about unless you're
polling at a very high rate, which'd be a bad idea in itself. If a backend
can't get the lock for some stats change it'll defer flushing the stats a bit,
so it'll not cause a lot of other problems.

Hm. I wonder if we're on the same page about what constitutes a "high rate".

I've seen people try push prometheus or other similar systems to 5s
poll intervals. That would be challenging for Postgres due to the
volume of statistics. The default is 30s and people often struggle to
even have that function for large fleets. But if you had a small
fleet, perhaps an iot style system with a "one large table" type of
schema you might well want stats every 5s or even every 1s.

That's probably fine. Although I think you might run into trouble not from the
stats subystem side, but from the "amount of data" side. On a system with a
lot of objects that can be a fair amount. If you really want to do very low
latency stats reporting, I suspect you'd have to build an incremental system.

I'm *dead* set against including catalog names in shared memory stats. That'll
add a good amount of memory usage and complexity, without any sort of
comensurate gain.

Well it's pushing the complexity there from elsewhere. If the labels
aren't in the stats structures then the exporter needs to connect to
each database, gather all the names into some local cache and then it
needs to worry about keeping it up to date. And if there are any
database problems such as disk errors or catalog objects being locked
then your monitoring breaks though perhaps it can be limited to just
missing some object names or having out of date names.

Shrug. If the stats system state desynchronizes from an alter table rename
you'll also have a problem in monitoring.

And even if you can benefit from having all that information, it'd still be an
overhead born by everybody for a very small share of users.

I also think it would be nice to have a change counter for every stat
object, or perhaps a change time. Prometheus wouldn't be able to make
use of it but other monitoring software might be able to receive only
metrics that have changed since the last update which would really
help on databases with large numbers of mostly static objects.

I think you're proposing adding overhead that doesn't even have a real user.

I guess I'm just brainstorming here. I don't need to currently no. It
doesn't seem like significant overhead though compared to the locking
and copying though?

Yes, timestamps aren't cheap to determine (nor free too store, but that's a
lesser issue).

Greetings,

Andres Freund

#315

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Greg Stark (#312)

Re: shared-memory based stats collector - v70

Hi,

On 2022-08-10 15:48:15 -0400, Greg Stark wrote:

One thing that's bugging me is that the names we use for these stats
are *all* over the place.

Yes. I had a huge issue with this when polishing the patch. And Horiguchi-san
did as well. I had to limit the amount of cleanup done to make it feasible to
get anything committed. I think it's a bit less bad than before, but by no
means good.

* The pg_stat_bgwriter view returns data from two different fixed
entries, the checkpointer and the bgwriter, is there a reason those
are kept separately but then reported as if they're one thing?

Historical raisins. Checkpointer and bgwriter used to be one thing, but isn't
anymore.

Greetings,

Andres Freund

#316

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Greg Stark (#313)

Re: shared-memory based stats collector - v70

Hi,

On 8/10/22 11:25 PM, Greg Stark wrote:

On Wed, 10 Aug 2022 at 04:05, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

Hi,

On 8/9/22 6:00 PM, Greg Stark wrote:

On Tue, 9 Aug 2022 at 06:19, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

What do you think about adding a function in core PG to provide such
functionality? (means being able to retrieve all the stats (+ eventually
add some filtering) without the need to connect to each database).

I'm working on it myself too. I'll post a patch for discussion in a bit.

Great! Thank you!

So I was adding the code to pgstat.c because I had thought there were
some data types I needed and/or static functions I needed. However you
and Andres encouraged me to check again now. And indeed I was able,
after fixing a couple things, to make the code work entirely
externally.

Nice!

Though I still think to have an SQL API in core could be useful to.

As Andres was not -1 about that idea (as it should be low cost to add
and maintain) as long as somebody cares enough to write something: then
I'll give it a try and submit a patch for it.

This is definitely not polished and there's a couple obvious things
missing. But at the risk of embarrassment I've attached my WIP. Please
be gentle :) I'll post the github link in a bit when I've fixed up
some meta info.

Thanks! I will have a look at it on github (once you share the link).

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#317

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Drouvot, Bertrand (#316)

Re: shared-memory based stats collector - v70

On Thu, 11 Aug 2022 at 02:11, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

As Andres was not -1 about that idea (as it should be low cost to add
and maintain) as long as somebody cares enough to write something: then
I'll give it a try and submit a patch for it.

I agree it would be a useful feature. I think there may be things to
talk about here though.

1) Are you planning to go through the local hash table and
LocalSnapshot and obey the consistency mode? I was thinking a flag
passed to build_snapshot to request global mode might be sufficient
instead of a completely separate function.

2) When I did the function attached above I tried to avoid returning
the whole set and make it possible to process them as they arrive. I
actually was hoping to get to the point where I could start shipping
out network data as they arrive and not even buffer up the response,
but I think I need to be careful about hash table locking then.

3) They key difference here is that we're returning whatever stats are
in the hash table rather than using the catalog to drive a list of id
numbers to look up. I guess the API should make it clear this is what
is being returned -- on that note I wonder if I've done something
wrong because I noted a few records with InvalidOid where I didn't
expect it.

4) I'm currently looping over the hash table returning the records all
intermixed. Some users will probably want to do things like "return
all Relation records for all databases" or "return all Index records
for database id xxx". So some form of filtering may be best or perhaps
a way to retrieve just the keys so they can then be looked up one by
one (through the local cache?).

5) On that note I'm not clear how the local cache will interact with
these cross-database lookups. That should probably be documented...

--
greg

#318

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Greg Stark (#317)

Re: shared-memory based stats collector - v70

Hi,

On 8/15/22 4:46 PM, Greg Stark wrote:

On Thu, 11 Aug 2022 at 02:11, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

As Andres was not -1 about that idea (as it should be low cost to add
and maintain) as long as somebody cares enough to write something: then
I'll give it a try and submit a patch for it.

I agree it would be a useful feature. I think there may be things to
talk about here though.

1) Are you planning to go through the local hash table and
LocalSnapshot and obey the consistency mode? I was thinking a flag
passed to build_snapshot to request global mode might be sufficient
instead of a completely separate function.

I think the new API should behave as PGSTAT_FETCH_CONSISTENCY_NONE (as
querying from all the databases increases the risk of having to deal
with "large" number of objects).

I've in mind to do something along those lines (still need to add some
filtering, extra check on the permission,...):

+Â Â Â Â Â Â  dshash_seq_init(&hstat, pgStatLocal.shared_hash, false);
+Â Â Â Â Â Â  while ((p = dshash_seq_next(&hstat)) != NULL)
+Â Â Â Â Â Â  {
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  Datum values[PG_STAT_GET_ALL_TABLES_STATS_COLS];
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  bool nulls[PG_STAT_GET_ALL_TABLES_STATS_COLS];
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  PgStat_StatTabEntry * tabentry = NULL;
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  MemSet(values, 0, sizeof(values));
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  MemSet(nulls, false, sizeof(nulls));
+

+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  if (p->key.kind != PGSTAT_KIND_RELATION)
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  continue;

+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  if (p->dropped)
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â  continue;
+
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  stats_data = dsa_get_address(pgStatLocal.dsa, p->body);
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  LWLockAcquire(&stats_data->lock, LW_SHARED);
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  tabentry = pgstat_get_entry_data(PGSTAT_KIND_RELATION, 
stats_data);
+
+
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  values[0] = ObjectIdGetDatum(p->key.dboid);
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  values[1] = ObjectIdGetDatum(p->key.objoid);
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  values[2]= DatumGetInt64(tabentry->tuples_inserted);

+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, 
values, nulls);
+Â Â Â Â Â Â Â Â Â Â Â Â Â Â  LWLockRelease(&stats_data->lock);
+Â Â Â Â Â Â  }
+Â Â Â Â Â Â  dshash_seq_term(&hstat);

What do you think?

2) When I did the function attached above I tried to avoid returning
the whole set and make it possible to process them as they arrive.

Is it the way it has been done? (did not look at your function yet)

I
actually was hoping to get to the point where I could start shipping
out network data as they arrive and not even buffer up the response,
but I think I need to be careful about hash table locking then.

If using dshash_seq_next() the already returned elements are locked.

But I guess you would like to unlock them (if you are able to process
them as they arrive)?

3) They key difference here is that we're returning whatever stats are
in the hash table rather than using the catalog to drive a list of id
numbers to look up.

Right.

I guess the API should make it clear this is what
is being returned

Right. I think we'll end up with a set of relations id (not their names)
and their associated stats.

-- on that note I wonder if I've done something
wrong because I noted a few records with InvalidOid where I didn't
expect it.

It looks like that InvalidOid for the dbid means that the entry is for a
shared relation.

Where did you see them (while not expecting them)?

4) I'm currently looping over the hash table returning the records all
intermixed. Some users will probably want to do things like "return
all Relation records for all databases" or "return all Index records
for database id xxx". So some form of filtering may be best or perhaps
a way to retrieve just the keys so they can then be looked up one by
one (through the local cache?).

I've in mind to add some filtering on the dbid (I think it could be
useful for monitoring tool with a persistent connection to one database
but that wants to pull the stats database per database).

I don't think a look up through the local cache will work if the
entry/key is related to another database the API is launched from.

5) On that note I'm not clear how the local cache will interact with
these cross-database lookups. That should probably be documented...

yeah I don't think that would work (if by local cache you mean what is
in the relcache).

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#319

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Drouvot, Bertrand (#318)

Re: shared-memory based stats collector - v70

On Tue, 16 Aug 2022 at 08:49, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

+               if (p->key.kind != PGSTAT_KIND_RELATION)
+                       continue;

Hm. So presumably this needs to be extended. Either to let the caller
decide which types of stats to return or to somehow return all the
stats intermixed. In my monitoring code I did the latter because I
didn't think going through the hash table repeatedly would be very
efficient. But it's definitely a pretty awkward API since I need a
switch statement that explicitly lists each case and casts the result.

2) When I did the function attached above I tried to avoid returning
the whole set and make it possible to process them as they arrive.

Is it the way it has been done? (did not look at your function yet)

I did it with callbacks. It was quick and easy and convenient for my
use case. But in general I don't really like callbacks and would think
some kind of iterator style api would be nicer.

I am handling the stats entries as they turn up. I'm constructing the
text output for each in a callback and buffering up the whole http
response in a string buffer.

I think that's ok but if I wanted to avoid buffering it up and do
network i/o then I would think the thing to do would be to build the
list of entry keys and then loop over that list doing a hash lookup
for each one and generating the response for each out and writing it
to the network. That way there wouldn't be anything locked, not even
the hash table, while doing network i/o. It would mean a lot of
traffic on the hash table though.

-- on that note I wonder if I've done something
wrong because I noted a few records with InvalidOid where I didn't
expect it.

It looks like that InvalidOid for the dbid means that the entry is for a
shared relation.

Ah yes. I had actually found that but forgotten it.

There's also a database entry with dboid=InvalidOid which is
apparently where background workers with no database attached report
stats.

I've in mind to add some filtering on the dbid (I think it could be
useful for monitoring tool with a persistent connection to one database
but that wants to pull the stats database per database).

I don't think a look up through the local cache will work if the
entry/key is related to another database the API is launched from.

Isn't there also a local hash table used to find the entries to reduce
traffic on the shared hash table? Even if you don't take a snapshot
does it get entered there? There are definitely still parts of this
I'm working on a pretty vague understanding of :/

--
greg

#320

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Greg Stark (#319)

Re: shared-memory based stats collector - v70

Hi,

On 2022-08-17 15:46:42 -0400, Greg Stark wrote:

Isn't there also a local hash table used to find the entries to reduce
traffic on the shared hash table? Even if you don't take a snapshot
does it get entered there? There are definitely still parts of this
I'm working on a pretty vague understanding of :/

Yes, there is. But it's more about code that generates stats, rather than
reporting functions. While there's backend local pending stats we need to have
a refcount on the shared stats item so that the stats item can't be dropped
and then revived when those local stats are flushed.

Relevant comments from pgstat.c:

* To avoid contention on the shared hashtable, each backend has a
* backend-local hashtable (pgStatEntryRefHash) in front of the shared
* hashtable, containing references (PgStat_EntryRef) to shared hashtable
* entries. The shared hashtable only needs to be accessed when no prior
* reference is found in the local hashtable. Besides pointing to the
* shared hashtable entry (PgStatShared_HashEntry) PgStat_EntryRef also
* contains a pointer to the shared statistics data, as a process-local
* address, to reduce access costs.
*
* The names for structs stored in shared memory are prefixed with
* PgStatShared instead of PgStat. Each stats entry in shared memory is
* protected by a dedicated lwlock.
*
* Most stats updates are first accumulated locally in each process as pending
* entries, then later flushed to shared memory (just after commit, or by
* idle-timeout). This practically eliminates contention on individual stats
* entries. For most kinds of variable-numbered pending stats data is stored
* in PgStat_EntryRef->pending. All entries with pending data are in the
* pgStatPending list. Pending statistics updates are flushed out by
* pgstat_report_stat().
*

pgstat_internal.h has more details about the refcount aspect:

* Per-object statistics are stored in the "shared stats" hashtable. That
* table's entries (PgStatShared_HashEntry) contain a pointer to the actual stats
* data for the object (the size of the stats data varies depending on the
* kind of stats). The table is keyed by PgStat_HashKey.
*
* Once a backend has a reference to a shared stats entry, it increments the
* entry's refcount. Even after stats data is dropped (e.g., due to a DROP
* TABLE), the entry itself can only be deleted once all references have been
* released.

Greetings,

Andres Freund

#321

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Greg Stark (#319)

Re: shared-memory based stats collector - v70

Hi,

On 8/17/22 9:46 PM, Greg Stark wrote:

On Tue, 16 Aug 2022 at 08:49, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:
+               if (p->key.kind != PGSTAT_KIND_RELATION)
+                       continue;
Hm. So presumably this needs to be extended. Either to let the caller
decide which types of stats to return or to somehow return all the
stats intermixed. In my monitoring code I did the latter because I
didn't think going through the hash table repeatedly would be very
efficient. But it's definitely a pretty awkward API since I need a
switch statement that explicitly lists each case and casts the result.

What I had in mind is to provide an API to retrieve stats for those that
would need to connect to each database individually otherwise.

That's why I focused on PGSTAT_KIND_RELATION that has
PgStat_KindInfo.accessed_across_databases set to false.

I think that another candidate could also be PGSTAT_KIND_FUNCTION.

I think that's the 2 cases where a monitoring tool connected to a single
database is currently missing stats related to databases it is not
connected to.

So what about 2 functions? one to get the stats for the relations and
one to get the stats for the functions? (And maybe a view on top of each
of them?)

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#322

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Andres Freund (#320)

Re: shared-memory based stats collector - v70

Hi,

On 8/18/22 1:30 AM, Andres Freund wrote:

Hi,

On 2022-08-17 15:46:42 -0400, Greg Stark wrote:

Isn't there also a local hash table used to find the entries to reduce
traffic on the shared hash table? Even if you don't take a snapshot
does it get entered there? There are definitely still parts of this
I'm working on a pretty vague understanding of :/

Yes, there is. But it's more about code that generates stats, rather than
reporting functions. While there's backend local pending stats we need to have
a refcount on the shared stats item so that the stats item can't be dropped
and then revived when those local stats are flushed.

What do you think about something along those lines for the reporting
part only?

Datum
pgstat_fetch_all_tables_stats(PG_FUNCTION_ARGS)
{
Â Â Â intÂ Â Â Â Â Â Â Â dbid = PG_ARGISNULL(0) ? -1 : (int) PG_GETARG_OID(0);
Â Â Â ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
Â Â Â dshash_seq_status hstat;
Â Â Â PgStatShared_HashEntry *p;
Â Â Â PgStatShared_Common *stats_data;

Â Â Â /* Only members of pg_read_all_stats can use this function */
Â Â Â if (!has_privs_of_role(GetUserId(), ROLE_PG_READ_ALL_STATS))
Â Â Â {
Â Â Â Â Â Â Â aclcheck_error(ACLCHECK_NO_PRIV, OBJECT_FUNCTION,
"pgstat_fetch_all_tables_stats");
Â Â Â }

Â Â Â pgstat_assert_is_up();

Â Â Â SetSingleFuncCall(fcinfo, 0);

Â Â Â dshash_seq_init(&hstat, pgStatLocal.shared_hash, false);
Â Â Â while ((p = dshash_seq_next(&hstat)) != NULL)
Â Â Â {
Â Â Â Â Â Â Â DatumÂ Â Â Â Â Â Â Â Â Â values[PG_STAT_GET_ALL_TABLES_STATS_COLS];
Â Â Â Â Â Â Â boolÂ Â Â Â Â Â Â Â Â Â Â nulls[PG_STAT_GET_ALL_TABLES_STATS_COLS];
Â Â Â Â Â Â Â PgStat_StatTabEntry * tabentry = NULL;

Â Â Â Â Â Â Â MemSet(values, 0, sizeof(values));
Â Â Â Â Â Â Â MemSet(nulls, false, sizeof(nulls));

Â Â Â Â Â Â Â /* If looking for specific dbid, ignore all the others */
Â Â Â Â Â Â Â if (dbid != -1 && p->key.dboid != (Oid) dbid)
Â Â Â Â Â Â Â Â Â Â Â continue;

Â Â Â Â Â Â Â /* If the entry is not of kind relation then ignore it */
Â Â Â Â Â Â Â if (p->key.kind != PGSTAT_KIND_RELATION)
Â Â Â Â Â Â Â Â Â Â Â continue;

Â Â Â Â Â Â Â /* If the entry has been dropped then ignore it */
Â Â Â Â Â Â Â if (p->dropped)
Â Â Â Â Â Â Â Â Â Â Â continue;

Â Â Â Â Â Â Â stats_data = dsa_get_address(pgStatLocal.dsa, p->body);
Â Â Â Â Â Â Â LWLockAcquire(&stats_data->lock, LW_SHARED);
Â Â Â Â Â Â Â tabentry = pgstat_get_entry_data(p->key.kind, stats_data);

Â Â Â Â Â Â Â values[0] = ObjectIdGetDatum(p->key.dboid);
Â Â Â Â Â Â Â values[1] = ObjectIdGetDatum(p->key.objoid);
Â Â Â Â Â Â Â values[2]= DatumGetInt64(tabentry->tuples_inserted);
Â Â Â Â Â Â Â values[3]= DatumGetInt64(tabentry->tuples_updated);
Â Â Â Â Â Â Â values[4]= DatumGetInt64(tabentry->tuples_deleted);
Â Â Â Â Â Â .
Â Â Â Â Â Â .
Â Â Â Â Â Â .
Â Â Â Â Â Â Â tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
values, nulls);

Â Â Â Â Â Â Â LWLockRelease(&stats_data->lock);
Â Â Â }
Â Â Â dshash_seq_term(&hstat);

Â Â Â return (Datum) 0;
}

I also tried to make use of pgstat_get_entry_ref() but went into a
failed assertion: pgstat_get_entry_ref -> dshash_find ->
ASSERT_NO_PARTITION_LOCKS_HELD_BY_ME(hash_table) due to lock acquired by
dshash_seq_next().

Regards,

Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#323

Greg Stark

stark@mit.edu

over 3 years ago

In reply to: Drouvot, Bertrand (#321)

Re: shared-memory based stats collector - v70

On Thu, 18 Aug 2022 at 02:27, Drouvot, Bertrand <bdrouvot@amazon.com> wrote:

What I had in mind is to provide an API to retrieve stats for those that
would need to connect to each database individually otherwise.

That's why I focused on PGSTAT_KIND_RELATION that has
PgStat_KindInfo.accessed_across_databases set to false.

I think that another candidate could also be PGSTAT_KIND_FUNCTION.

And indexes of course. It's a bit frustrating since without the
catalog you won't know what table the index actually is for... But
they're pretty important stats.

On that note though... What do you think about having the capability
to add other stats kinds to the stats infrastructure? It would make a
lot of sense for pg_stat_statements to add its entries here instead of
having to reimplement a lot of the same magic. And I have in mind an
extension that allows adding other stats and it would be nice to avoid
having to reimplement any of this.

To do that I guess more of the code needs to be moved to be table
driven from the kind structs either with callbacks or with other meta
data. So the kind record could contain tupledesc and the code to
construct the returned tuple so that these functions could return any
custom entry as well as the standard entries.

--
greg

#324

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Greg Stark (#323)

Re: shared-memory based stats collector - v70

Hi,

On 2022-08-18 15:26:31 -0400, Greg Stark wrote:

And indexes of course. It's a bit frustrating since without the
catalog you won't know what table the index actually is for... But
they're pretty important stats.

FWIW, I think we should split relation stats into table and index
stats. Historically it'd have added a lot of complexity to separate the two,
but I don't think that's the case anymore. And we waste space for index stats
by having lots of table specific fields.

On that note though... What do you think about having the capability
to add other stats kinds to the stats infrastructure?

Getting closer to that was one of my goals working on the shared memory stats
stuff.

It would make a lot of sense for pg_stat_statements to add its entries here
instead of having to reimplement a lot of the same magic.

Yes, we should move pg_stat_statements over.

It's pretty easy to get massive contention on stats entries with
pg_stat_statements, because it doesn't have support for "batching" updates to
shared stats. And reimplementing the same logic in pg_stat_statements.c
doesn't make sense.

And the set of normalized queries could probably stored in DSA as well - the
file based thing we have right now is problematic.

To do that I guess more of the code needs to be moved to be table
driven from the kind structs either with callbacks or with other meta
data.

Pretty much all of it already is. The only substantial missing bit is
reading/writing of stats files, but that should be pretty easy. And of course
making the callback array extensible.

So the kind record could contain tupledesc and the code to construct the
returned tuple so that these functions could return any custom entry as well
as the standard entries.

I don't see how this would work well - we don't have functions returning
variable kinds of tuples. And what would convert a struct to a tuple?

Nor do I think it's needed - if you have an extension providing a new stats
kind it can also provide accessors.

Greetings,

Andres Freund

#325

Drouvot, Bertrand

bdrouvot@amazon.com

over 3 years ago

In reply to: Andres Freund (#324)

Re: shared-memory based stats collector - v70

Hi,

On 8/18/22 9:51 PM, Andres Freund wrote:

Hi,

On 2022-08-18 15:26:31 -0400, Greg Stark wrote:

And indexes of course. It's a bit frustrating since without the
catalog you won't know what table the index actually is for... But
they're pretty important stats.

FWIW, I think we should split relation stats into table and index
stats. Historically it'd have added a lot of complexity to separate the two,
but I don't think that's the case anymore. And we waste space for index stats
by having lots of table specific fields.

It seems to me that we should work on that first then, what do you
think? (If so I can try to have a look at it).

And once done then resume the work to provide the APIs to get all
tables/indexes from all the databases.

That way we'll be able to provide one API for the tables and one for the
indexes (instead of one API for both like my current POC is doing).

On that note though... What do you think about having the capability
to add other stats kinds to the stats infrastructure?

I think that's a good idea and that would be great to have.

Getting closer to that was one of my goals working on the shared memory stats
stuff.

It would make a lot of sense for pg_stat_statements to add its entries here
instead of having to reimplement a lot of the same magic.

Yes, we should move pg_stat_statements over.

It's pretty easy to get massive contention on stats entries with
pg_stat_statements, because it doesn't have support for "batching" updates to
shared stats. And reimplementing the same logic in pg_stat_statements.c
doesn't make sense.

And the set of normalized queries could probably stored in DSA as well - the
file based thing we have right now is problematic.

To do that I guess more of the code needs to be moved to be table
driven from the kind structs either with callbacks or with other meta
data.

Pretty much all of it already is. The only substantial missing bit is
reading/writing of stats files, but that should be pretty easy. And of course
making the callback array extensible.

So the kind record could contain tupledesc and the code to construct the
returned tuple so that these functions could return any custom entry as well
as the standard entries.

I don't see how this would work well - we don't have functions returning
variable kinds of tuples. And what would convert a struct to a tuple?

Nor do I think it's needed - if you have an extension providing a new stats
kind it can also provide accessors.

I think the same (the extension should be able to do that).

I really like the idea of being able to provide new stats kind.

Regards,

--
Bertrand Drouvot
Amazon Web Services: https://aws.amazon.com

#326

Anton A. Melnikov

a.melnikov@postgrespro.ru

about 1 year ago

In reply to: Drouvot, Bertrand (#325)

Re: shared-memory based stats collector - v70

Hi!

Found a place in the code of this patch that is unclear to me:
https://github.com/postgres/postgres/blob/1acf10549e64c6a52ced570d712fcba1a2f5d1ec/src/backend/utils/activity/pgstat.c#L1658

Owing assert() the next if() should never be performed, but the comment above says the opposite.
Is this assert really needed here? And if so, for what?

Would be glad for clarification.

With the best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#327

Andres Freund

andres@anarazel.de

about 1 year ago

In reply to: Anton A. Melnikov (#326)

Re: shared-memory based stats collector - v70

Hi,

On 2024-12-03 13:37:48 +0300, Anton A. Melnikov wrote:

Found a place in the code of this patch that is unclear to me:
https://github.com/postgres/postgres/blob/1acf10549e64c6a52ced570d712fcba1a2f5d1ec/src/backend/utils/activity/pgstat.c#L1658

Owing assert() the next if() should never be performed, but the comment above says the opposite.
Is this assert really needed here? And if so, for what?

It's code that should be unreachable. But in case it is encountered in a
production scenario, it's not worth taking down the server for it.

Greetings,

Andres Freund

#328

Anton A. Melnikov

a.melnikov@postgrespro.ru

about 1 year ago

In reply to: Andres Freund (#327)

Re: shared-memory based stats collector - v70

On 03.12.2024 18:07, Andres Freund wrote:

Hi,

On 2024-12-03 13:37:48 +0300, Anton A. Melnikov wrote:

Found a place in the code of this patch that is unclear to me:
https://github.com/postgres/postgres/blob/1acf10549e64c6a52ced570d712fcba1a2f5d1ec/src/backend/utils/activity/pgstat.c#L1658

Owing assert() the next if() should never be performed, but the comment above says the opposite.
Is this assert really needed here? And if so, for what?

It's code that should be unreachable. But in case it is encountered in a
production scenario, it's not worth taking down the server for it.

Thanks! It's clear.
Although there is a test case that lead to this assertion to be triggered.
But i doubt if anything needs to be fixed.
I described this case in as it seems to me suitable thread:
/messages/by-id/56bf8ff9-dd8c-47b2-872a-748ede82af99@postgrespro.ru

Would be appreciate if you take a look on it.

With the best wishes,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#329

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Anton A. Melnikov (#328)

Re: shared-memory based stats collector - v70

Hi,

On Wed, Dec 04, 2024 at 04:00:53AM +0300, Anton A. Melnikov wrote:

On 03.12.2024 18:07, Andres Freund wrote:

Hi,

On 2024-12-03 13:37:48 +0300, Anton A. Melnikov wrote:

Found a place in the code of this patch that is unclear to me:
https://github.com/postgres/postgres/blob/1acf10549e64c6a52ced570d712fcba1a2f5d1ec/src/backend/utils/activity/pgstat.c#L1658

Owing assert() the next if() should never be performed, but the comment above says the opposite.
Is this assert really needed here? And if so, for what?

It's code that should be unreachable. But in case it is encountered in a
production scenario, it's not worth taking down the server for it.

Thanks! It's clear.
Although there is a test case that lead to this assertion to be triggered.
But i doubt if anything needs to be fixed.
I described this case in as it seems to me suitable thread:
/messages/by-id/56bf8ff9-dd8c-47b2-872a-748ede82af99@postgrespro.ru

Thanks! I've the feeling that something has to be fixed, see my comments in
[1]: /messages/by-id/Z1BzI/eMTCOKA+j6@ip-10-97-1-34.eu-west-3.compute.internal

[1]: /messages/by-id/Z1BzI/eMTCOKA+j6@ip-10-97-1-34.eu-west-3.compute.internal

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#330

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Bertrand Drouvot (#329)

Re: shared-memory based stats collector - v70

On Wed, Dec 04, 2024 at 03:24:55PM +0000, Bertrand Drouvot wrote:

Thanks! I've the feeling that something has to be fixed, see my comments in
[1]. It might be that the failed assertion does not handle a "valid" scenario.

[1]: /messages/by-id/Z1BzI/eMTCOKA+j6@ip-10-97-1-34.eu-west-3.compute.internal

It's really a case that should never be reached because it points to
an inconsistency in the interactions between the local entry cache in
a process and the central dshash it attempts to point to, so I don't
think that there is anything to change here. As Andres has mentioned,
it has a lot of value by acting as a safety guard in assert builds
without being annoying for production deployments.
--
Michael

#331

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Michael Paquier (#330)

1 attachment(s)

Re: shared-memory based stats collector - v70

Hi,

On Thu, Dec 05, 2024 at 02:43:43PM +0900, Michael Paquier wrote:

On Wed, Dec 04, 2024 at 03:24:55PM +0000, Bertrand Drouvot wrote:

Thanks! I've the feeling that something has to be fixed, see my comments in
[1]. It might be that the failed assertion does not handle a "valid" scenario.

[1]: /messages/by-id/Z1BzI/eMTCOKA+j6@ip-10-97-1-34.eu-west-3.compute.internal

It's really a case that should never be reached because it points to
an inconsistency in the interactions between the local entry cache in
a process and the central dshash it attempts to point to, so I don't
think that there is anything to change here.

Agree.

As Andres has mentioned,
it has a lot of value by acting as a safety guard in assert builds
without being annoying for production deployments.

Fully agree.

That said, I think that's worth to update the comment a bit (like in the
attached?) as I think that answers a legitimate question someone could have while
reading this code.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#332

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Bertrand Drouvot (#331)

Re: shared-memory based stats collector - v70

On Thu, Dec 05, 2024 at 07:37:27AM +0000, Bertrand Drouvot wrote:

That said, I think that's worth to update the comment a bit (like in the
attached?) as I think that answers a legitimate question someone could have while
reading this code.
-		/* we may have some "dropped" entries not yet removed, skip them */
+		/*
+		 * We may have some "dropped" entries not yet removed, skip them as
+		 * it's not worth taking down the server for this.
+		 */

Perhaps this should provide some details, like the fact that we don't
expect the server to still have references to entries that are dropped
at shutdown when writing the stats file as all the backends and/or
auxiliary processes should have done this cleanup before they are
gone.
--
Michael

#333

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Michael Paquier (#332)

1 attachment(s)

Re: shared-memory based stats collector - v70

Hi,

On Thu, Dec 05, 2024 at 05:13:13PM +0900, Michael Paquier wrote:

On Thu, Dec 05, 2024 at 07:37:27AM +0000, Bertrand Drouvot wrote:
That said, I think that's worth to update the comment a bit (like in the
attached?) as I think that answers a legitimate question someone could have while
reading this code.
-		/* we may have some "dropped" entries not yet removed, skip them */
+		/*
+		 * We may have some "dropped" entries not yet removed, skip them as
+		 * it's not worth taking down the server for this.
+		 */
Perhaps this should provide some details, like the fact that we don't
expect the server to still have references to entries that are dropped
at shutdown when writing the stats file as all the backends and/or
auxiliary processes should have done this cleanup before they are
gone.

Okay, attached a more elaborated comment.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#334

Anton A. Melnikov

a.melnikov@postgrespro.ru

about 1 year ago

In reply to: Bertrand Drouvot (#333)

1 attachment(s)

Re: shared-memory based stats collector - v70

Hi!

On 04.12.2024 18:24, Bertrand Drouvot wrote:

Thanks! I've the feeling that something has to be fixed, see my comments in
[1]. It might be that the failed assertion does not handle a "valid" scenario.

[1]: /messages/by-id/Z1BzI/eMTCOKA+j6@ip-10-97-1-34.eu-west-3.compute.internal

On 05.12.2024 08:43, Michael Paquier wrote:

It's really a case that should never be reached because it points to
an inconsistency in the interactions between the local entry cache in
a process and the central dshash it attempts to point to, so I don't
think that there is anything to change here. As Andres has mentioned,
it has a lot of value by acting as a safety guard in assert builds
without being annoying for production deployments.

Thanks a lot for for the detailed clarification!
Everything here became clear for me.

On 05.12.2024 11:13, Michael Paquier wrote:

On Thu, Dec 05, 2024 at 07:37:27AM +0000, Bertrand Drouvot wrote:

That said, I think that's worth to update the comment a bit (like in the
attached?) as I think that answers a legitimate question someone could have while
reading this code.

Perhaps this should provide some details, like the fact that we don't
expect the server to still have references to entries that are dropped
at shutdown when writing the stats file as all the backends and/or
auxiliary processes should have done this cleanup before they are
gone.

Completely agree that the original comment needs to be revised,
since it implies that it is normal for deleted entries to be here,
but it is not the case.

On 05.12.2024 17:13, Bertrand Drouvot wrote:

Okay, attached a more elaborated comment.

Looks good for me. Detailed and clear.
Will help to avoid unnecessary questions when reading this code.

Maybe it's worth adding a warning as well,
similar to the one a few lines below in the code?

Like in the patch attached?

With the best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#335

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Anton A. Melnikov (#334)

Re: shared-memory based stats collector - v70

On Sat, Dec 07, 2024 at 12:31:46PM +0300, Anton A. Melnikov wrote:

Completely agree that the original comment needs to be revised,
since it implies that it is normal for deleted entries to be here,
but it is not the case.

Yep, so applied v2-0001 to document that, and backpatched it as it is
kind of important to know about.

Maybe it's worth adding a warning as well,
similar to the one a few lines below in the code?

         Assert(!ps->dropped);
         if (ps->dropped)
+        {
+            PgStat_HashKey key = ps->key;
+            elog(WARNING, "found non-deleted stats entry %u/%u/%llu"
+                          "at server shutdown",
+                           key.kind, key.dboid,
+                           (unsigned long long) key.objid);
             continue;
+        }

/*
* This discards data related to custom stats kinds that are unknown

Not sure how to feel about this suggestion, though. This would
produce a warning when building without assertions, but the assertion
would likely let us more information with a trace during development,
so..
--
Michael

#336

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Michael Paquier (#335)

Re: shared-memory based stats collector - v70

Hi,

On Mon, Dec 09, 2024 at 02:39:58PM +0900, Michael Paquier wrote:

On Sat, Dec 07, 2024 at 12:31:46PM +0300, Anton A. Melnikov wrote:

Completely agree that the original comment needs to be revised,
since it implies that it is normal for deleted entries to be here,
but it is not the case.

Yep, so applied v2-0001 to document that, and backpatched it as it is
kind of important to know about.

Maybe it's worth adding a warning as well,
similar to the one a few lines below in the code?
Assert(!ps->dropped);
if (ps->dropped)
+        {
+            PgStat_HashKey key = ps->key;
+            elog(WARNING, "found non-deleted stats entry %u/%u/%llu"
+                          "at server shutdown",

There is a missing space. I think that should be " at server..." or "...%llu ".

+                           key.kind, key.dboid,
+                           (unsigned long long) key.objid);
continue;
+        }
/*
* This discards data related to custom stats kinds that are unknown

Not sure how to feel about this suggestion, though. This would
produce a warning when building without assertions, but the assertion
would likely let us more information with a trace during development,
so..

Right. OTOH I think that could help the tap test added in da99fedf8c to not
rely on assert enabled build (the tap test could "simply" check for the
WARNING in the logfile instead).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#337

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Bertrand Drouvot (#336)

Re: shared-memory based stats collector - v70

On Mon, Dec 09, 2024 at 08:03:54AM +0000, Bertrand Drouvot wrote:

Right. OTOH I think that could help the tap test added in da99fedf8c to not
rely on assert enabled build (the tap test could "simply" check for the
WARNING in the logfile instead).

That's true. Still, the coverage that we have is also enough for
assert builds, which is what the test is going to run with most of the
time anyway.
--
Michael

#338

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

about 1 year ago

In reply to: Michael Paquier (#337)

Re: shared-memory based stats collector - v70

Hi,

On Tue, Dec 10, 2024 at 09:54:36AM +0900, Michael Paquier wrote:

On Mon, Dec 09, 2024 at 08:03:54AM +0000, Bertrand Drouvot wrote:

Right. OTOH I think that could help the tap test added in da99fedf8c to not
rely on assert enabled build (the tap test could "simply" check for the
WARNING in the logfile instead).

That's true. Still, the coverage that we have is also enough for
assert builds, which is what the test is going to run with most of the
time anyway.

Yeah, that's fine by me and don't see the added value of the WARNING then.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#339

Anton A. Melnikov

a.melnikov@postgrespro.ru

about 1 year ago

In reply to: Bertrand Drouvot (#338)

1 attachment(s)

Re: shared-memory based stats collector - v70

Hi!

On 09.12.2024 11:03, Bertrand Drouvot wrote:

There is a missing space. I think that should be " at server..." or "...%llu ".

Thanks for pointing this out. In the other code the elog messages are all on one line,
regardless of their length. Did the same in v3.

On 10.12.2024 09:42, Bertrand Drouvot wrote:

On Tue, Dec 10, 2024 at 09:54:36AM +0900, Michael Paquier wrote:

On Mon, Dec 09, 2024 at 08:03:54AM +0000, Bertrand Drouvot wrote:

Right. OTOH I think that could help the tap test added in da99fedf8c to not
rely on assert enabled build (the tap test could "simply" check for the
WARNING in the logfile instead).

That's true. Still, the coverage that we have is also enough for
assert builds, which is what the test is going to run with most of the
time anyway.

Yeah, that's fine by me and don't see the added value of the WARNING then.

Agreed that this WARNING has no additional value for testing purposes
at pgfarm or ci. Assert is better.
My logic was different.
It's clear that during normal server operation this code should be unreachable.
But we admit that in production deployments it can be executed in case of some bug
that is still unknown to us. Now it is done in such a way that in this case
the server simply skip it and won't notice about it. And no one will know that this happened.
But if there is a warning here, the information will remain in the server logs,
we can find out about it and we can try to reproduce similar behavior
in the testing environment and probably detect a hidden bug like in [1]/messages/by-id/56bf8ff9-dd8c-47b2-872a-748ede82af99@postgrespro.ru.

Thanks a lot for fixing this!

With the best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[1]: /messages/by-id/56bf8ff9-dd8c-47b2-872a-748ede82af99@postgrespro.ru

Participants

Thread Outline

shared-memory based stats collector

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: