shared-memory based stats collector
Hello.
This is intended to provide more stats like the following thread.
/messages/by-id/20171010.192616.108347483.horiguchi.kyotaro@lab.ntt.co.jp
Most major obstracle for having more items in statistics
collector views comes from the mechanism to share the values
among backends. It is currently using a file. The stats collector
writes a file by triggers from backens then backens reads the
written file. Larger file makes the latency longer and we don't
have a spare bandwidth for additional statistics items.
Nowadays PostgreSQL has dynamic shared hash (dshash) so we can
use this as the main storage of statistics. We can share data
without a stress using this.
A PoC previously posted tried to use "locally copied" dshash but
it doesn't looks fine so I steered to different direction.
With this patch dshash can create a local copy based on dynhash.
This patch consists of tree files.
v1-0001-Give-dshash-ability-to-make-a-local-snapshot.patch
adds dshash to make a local copy backed by dynahash.
v1-0002-Change-stats-collector-to-an-axiliary-process.patch
change the stats collector to be a auxiliary process so that it
can attach dynamic shared memory.
v1-0003-dshash-based-stats-collector.patch
implements shared-memory based stats collector.
I'll put more detailed explanation later.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v1-0001-Give-dshash-ability-to-make-a-local-snapshot.patchtext/x-patch; charset=us-asciiDownload+184-2
v1-0002-Change-stats-collector-to-an-axiliary-process.patchtext/x-patch; charset=us-asciiDownload+70-30
v1-0003-dshash-based-stats-collector.patchtext/x-patch; charset=us-asciiDownload+547-860
On Fri, Jun 29, 2018 at 4:34 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Nowadays PostgreSQL has dynamic shared hash (dshash) so we can
use this as the main storage of statistics. We can share data
without a stress using this.A PoC previously posted tried to use "locally copied" dshash but
it doesn't looks fine so I steered to different direction.With this patch dshash can create a local copy based on dynhash.
Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hello. Thanks for the comment.
At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>
On Fri, Jun 29, 2018 at 4:34 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:Nowadays PostgreSQL has dynamic shared hash (dshash) so we can
use this as the main storage of statistics. We can share data
without a stress using this.A PoC previously posted tried to use "locally copied" dshash but
it doesn't looks fine so I steered to different direction.With this patch dshash can create a local copy based on dynhash.
Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.
The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.
Addition to that vacuum doesn't require even repeatable read
consistency so we don't need to cache the entries at all.
backend_get_tab_entry now returns an isolated (that means
not-stored-in-hash) palloc'ed copy without making a local copy in
the case.
As the result, this version behaves as the follows.
- Stats collector stores the results in shared memory.
- In backend, cache is created only for requested objects and
lasts for the transaction.
- Vacuum directly reads the shared stats and doesn't create a
local copy.
The non-behavioral difference from the v1 is the follows.
- snapshot feature of dshash is removed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello. This is new version fixed windows build.
At Tue, 03 Jul 2018 19:01:44 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180703.190144.222427588.horiguchi.kyotaro@lab.ntt.co.jp>
Hello. Thanks for the comment.
At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>
Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.Addition to that vacuum doesn't require even repeatable read
consistency so we don't need to cache the entries at all.
backend_get_tab_entry now returns an isolated (that means
not-stored-in-hash) palloc'ed copy without making a local copy in
the case.As the result, this version behaves as the follows.
- Stats collector stores the results in shared memory.
- In backend, cache is created only for requested objects and
lasts for the transaction.- Vacuum directly reads the shared stats and doesn't create a
local copy.The non-behavioral difference from the v1 is the follows.
- snapshot feature of dshash is removed.
This version includes some additional patches. 0003 removes
PG_STAT_TMP_DIR and it affects pg_stat_statements, pg_basebackup
and pg_rewind. Among them pg_stat_statements gets build failure
because it uses the directory to save query texts. 0005 is a new
patch and moves the file to the permanent stat directory. With
this change pg_basebackup and pg_rewind no longer ignore the
query text file.
I haven't explicitly mentioned that, but
dynamic_shared_memory_type = none prevents server from
starting. This patch is not providing a fallback path for the
case. I'm expecting that 'none' will be removed in v12.
v3-0001-sequential-scan-for-dshash.patch
- Functionally same with v2. got cosmetic changes.
v3-0002-Change-stats-collector-to-an-axiliary-process.patch
- Fixed for Windows build.
v3-0003-dshash-based-stats-collector.patch
- Cosmetic changes from v2.
v3-0004-Documentation-update.patch
- New patch in v3 of documentation edits.
v3-0005-Let-pg_stat_statements-not-to-use-PG_STAT_TMP_DIR.patch
- New patch of tentative change of pg_stat_statements.
v3-0006-Remove-pg_stat_tmp-exclusion-from-pg_resetwal.patch
- New patch of tentative change of pg_rewind.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v3-0001-sequential-scan-for-dshash.patchtext/x-patch; charset=us-asciiDownload+160-2
v3-0002-Change-stats-collector-to-an-axiliary-process.patchtext/x-patch; charset=us-asciiDownload+77-134
v3-0003-dshash-based-stats-collector.patchtext/x-patch; charset=us-asciiDownload+756-893
v3-0004-Documentation-update.patchtext/x-patch; charset=us-asciiDownload+5-40
v3-0005-Let-pg_stat_statements-not-to-use-PG_STAT_TMP_DIR.patchtext/x-patch; charset=us-asciiDownload+4-8
v3-0006-Remove-pg_stat_tmp-exclusion-from-pg_rewind.patchtext/x-patch; charset=us-asciiDownload+0-8
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>
Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.
The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.
Uh, what? That's basically destroying the long-standing semantics of
statistics snapshots. I do not think we can consider that acceptable.
As an example, it would mean that scan counts for indexes would not
match up with scan counts for their tables.
regards, tom lane
Hello.
At Wed, 04 Jul 2018 17:23:51 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in <67470.1530739431@sss.pgh.pa.us>
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.gmail.com>
Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.Uh, what? That's basically destroying the long-standing semantics of
statistics snapshots. I do not think we can consider that acceptable.
As an example, it would mean that scan counts for indexes would not
match up with scan counts for their tables.
The current stats collector mechanism sends at most 8 table stats
in a single message. Split messages from multiple transactions
can reach to collector in shuffled order. The resulting snapshot
can be "inconsistent" if INQUIRY message comes between such split
messages. Of course a single meesage would be enough for common
transactions but not for all.
Even though the inconsistency happens more frequently with this
patch, I don't think users expect such strict consistency of
table stats, especially on a busy system. And I believe it's a
good thing if users saw more "useful" information for the relaxed
consistency. (The actual meaning of "useful" is out of the
current focus:p)
Meanwhile, if we should keep the practical-consistency, a giant
lock is out of the question. So we need a transactional stats of
any shape. It can be a whole-image snapshot or a regular MMVC
table, or maybe the current dshash with UNDO logs. Since there
are actually many states, it is inevitable to require storage to
reproduce each state.
I think the consensus is that the whole-image snapshot takes
too-much memory. MMVC is apparently too-much for the purpose.
UNDO logs seems a bit promising. If we looking stats in a long
transaction, the required memory for UNDO information easily
reaches to the same amount with the whole-image snapshot. But I
expect that it is not so common.
I'll consider that apart from the current patch.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Hello.
At Thu, 05 Jul 2018 12:04:23 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180705.120423.49626073.horiguchi.kyotaro@lab.ntt.co.jp>
UNDO logs seems a bit promising. If we looking stats in a long
transaction, the required memory for UNDO information easily
reaches to the same amount with the whole-image snapshot. But I
expect that it is not so common.I'll consider that apart from the current patch.
Done as a PoC. (Sorry for the format since filterdiff genearates
a garbage from the patch..)
The attached v3-0008 is that. PoC of UNDO logging of server
stats. It records undo logs only for table stats if any
transaction started acess to stats data. So the logging is rarely
performed. The undo logs are used at the first acess to each
relation's stats then cached. autovacuum and vacuum doesn't
request undoing since they just don't need it.
# v3-0007 is a trivial fix for v3-0003, which will be merged.
I see several arguable points on this feature.
- The undo logs are stored in a ring buffer with a fixed size,
currently 1000 entries. If it is filled up, the consistency
will be broken. Undo log is recorded just once after the
latest undo-recording transaction comes. It is likely to be
read in rather short-lived transactions and it's likely that
there's no more than several such transactions
simultaneously. It's possible to provide dynamic resizing
feature but it doesn't seem worth the complexity.
- Undoing runs linear search on the ring buffer. It is done at
the first time when the stats for every relation is
accessed. It can be (a bit) slow when many log entriess
resides. (Concurrent vacuum can generate many undo log
entries.)
- Undo logs for other stats doesn't seem to me to be needed,
but..
A=>: select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname | seq_scan
t1 | 0
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't2';
relname | seq_scan
t2 | 0
A=> BEGIN;
-- This gives effect because no stats access has been made
B=> select * from t1;
B=> select * from t2;
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname | seq_scan
t1 | 1
-- This has no effect because undo logging is now working
B=> select * from t1;
B=> select * from t2;
<repeat two times>
-- This is the second time in this xact to request for t1,
-- just returns cached result.
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname | seq_scan
t1 | 1
-- This is the first time in this xact to request for t2. The
-- result is undoed one.
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't2';
relname | seq_scan
t2 | 1
A=> COMMIT;
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't1';
relname | seq_scan
t1 | 4
A=> select relname, seq_scan from pg_stat_user_tables where relname = 't2';
relname | seq_scan
t2 | 4
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Jul 4, 2018 at 11:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> writes:
At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas <robertmhaas@gmail.com>
wrote in <CA+TgmoYQhr30eAcgJCi1v0FhA+3RP1FZVnXqSTLe=6fHy9e5oA@mail.
gmail.com>Copying the whole hash table kinds of sucks, partly because of the
time it will take to copy it, but also because it means that memory
usage is still O(nbackends * ntables). Without looking at the patch,
I'm guessing that you're doing that because we need a way to show each
transaction a consistent snapshot of the data, and I admit that I
don't see another obvious way to tackle that problem. Still, it would
be nice if we had a better idea.The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.Uh, what? That's basically destroying the long-standing semantics of
statistics snapshots. I do not think we can consider that acceptable.
As an example, it would mean that scan counts for indexes would not
match up with scan counts for their tables.
I agree that this is definitely something that needs to be considered. I
took a look some time ago at the same thing, and ran up against exactly
that one (and at the time did not have time to fix it).
I have not yet had time to look at the downstream suggested handling
(UNDO). However, I had one other thing from my notes I wanted to mention :)
We should probably consider adding an API to fetch counters that *don't*
follow these rules, in case it's not needed. When going through files we're
still stuck at that bottleneck, but if going through shared memory it
should be possible to make it a lot cheaper by volunteering to "not need
that".
We should also consider the ability to fetch stats for a single object,
which would require no copying of the whole structure at all. I think
something like this could for example be used for autovacuum rechecks. On
top of the file based transfer that would help very little, but through
shared memory it could be a lot lighter weight.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
On Fri, Jul 6, 2018 at 10:29 AM, Magnus Hagander <magnus@hagander.net> wrote:
I agree that this is definitely something that needs to be considered. I
took a look some time ago at the same thing, and ran up against exactly that
one (and at the time did not have time to fix it).I have not yet had time to look at the downstream suggested handling (UNDO).
However, I had one other thing from my notes I wanted to mention :)We should probably consider adding an API to fetch counters that *don't*
follow these rules, in case it's not needed. When going through files we're
still stuck at that bottleneck, but if going through shared memory it should
be possible to make it a lot cheaper by volunteering to "not need that".We should also consider the ability to fetch stats for a single object,
which would require no copying of the whole structure at all. I think
something like this could for example be used for autovacuum rechecks. On
top of the file based transfer that would help very little, but through
shared memory it could be a lot lighter weight.
I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost. I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal. However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.
Note that commit 3cba8999b34 relaxed the synchronization requirements
around GetLockStatusData(). In other words, since 2011, you can no
longer be certain that 'select * from pg_locks' is returning a
perfectly consistent view of the lock status. If this has caused
anybody a major problem, I'm unaware of it. Maybe the same would end
up being true here. The amount of memory we're consuming for this
data may be a bigger problem than minor inconsistencies in the view of
the data would be.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2018-07-06 14:49:53 -0400, Robert Haas wrote:
I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost. I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal. However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.
Indeed. I don't think it's worthwhile major additional memory or code
complexity in this situation. The likelihood of benefitting from more /
better stats seems far higher than a more accurate view of the stats -
which aren't particularly accurate themselves. They don't even survive
crashes right now, so I don't think the current accuracy is very high.
Greetings,
Andres Freund
On 07/06/2018 11:57 AM, Andres Freund wrote:
On 2018-07-06 14:49:53 -0400, Robert Haas wrote:
I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost. I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal. However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.Indeed. I don't think it's worthwhile major additional memory or code
complexity in this situation. The likelihood of benefitting from more /
better stats seems far higher than a more accurate view of the stats -
which aren't particularly accurate themselves. They don't even survive
crashes right now, so I don't think the current accuracy is very high.
Will stats, if we move toward the suggested changes be "less" accurate
than they are now? We already know that stats are generally not accurate
but they are close enough. If we move toward this change will it still
be close enough?
JD
Greetings,
Andres Freund
--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
*** A fault and talent of mine is to tell it exactly how it is. ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
***** Unless otherwise stated, opinions are my own. *****
On Fri, Jul 6, 2018 at 3:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
Will stats, if we move toward the suggested changes be "less" accurate than
they are now? We already know that stats are generally not accurate but they
are close enough. If we move toward this change will it still be close
enough?
There proposed change would have no impact at all on the long-term
accuracy of the statistics. It would just mean that there would be
race conditions when reading them, so that for example you would be
more likely to see a count of heap scans that doesn't match the count
of index scans, because an update arrives in between when you read the
first value and when you read the second one. I don't see that
mattering a whole lot, TBH, but maybe I'm missing something.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 2018-07-06 12:02:39 -0700, Joshua D. Drake wrote:
On 07/06/2018 11:57 AM, Andres Freund wrote:
On 2018-07-06 14:49:53 -0400, Robert Haas wrote:
I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost. I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal. However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.Indeed. I don't think it's worthwhile major additional memory or code
complexity in this situation. The likelihood of benefitting from more /
better stats seems far higher than a more accurate view of the stats -
which aren't particularly accurate themselves. They don't even survive
crashes right now, so I don't think the current accuracy is very high.Will stats, if we move toward the suggested changes be "less" accurate than
they are now? We already know that stats are generally not accurate but they
are close enough. If we move toward this change will it still be close
enough?
I don't think there's a meaningful difference to before. And at the same
time less duplication / hardcoded structure will allow us to increase
the amount of stats we keep.
Greetings,
Andres Freund
On 07/06/2018 12:34 PM, Robert Haas wrote:
On Fri, Jul 6, 2018 at 3:02 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
Will stats, if we move toward the suggested changes be "less" accurate than
they are now? We already know that stats are generally not accurate but they
are close enough. If we move toward this change will it still be close
enough?There proposed change would have no impact at all on the long-term
accuracy of the statistics. It would just mean that there would be
race conditions when reading them, so that for example you would be
more likely to see a count of heap scans that doesn't match the count
of index scans, because an update arrives in between when you read the
first value and when you read the second one. I don't see that
mattering a whole lot, TBH, but maybe I'm missing something.
I agree that it probably isn't a big deal. Generally speaking when we
look at stats it is to get an "idea" of what is going on. We don't care
if we are missing an increase/decrease of 20 of any particular value
within stats. Based on this and what Andres said, it seems like a net
win to me.
JD
--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
*** A fault and talent of mine is to tell it exactly how it is. ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
***** Unless otherwise stated, opinions are my own. *****
On Fri, Jul 6, 2018 at 8:57 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-07-06 14:49:53 -0400, Robert Haas wrote:
I think we also have to ask ourselves in general whether snapshots of
this data are worth what they cost. I don't think anyone would doubt
that a consistent snapshot of the data is better than an inconsistent
view of the data if the costs were equal. However, if we can avoid a
huge amount of memory usage and complexity on large systems with
hundreds of backends by ditching the snapshot requirement, then we
should ask ourselves how important we think the snapshot behavior
really is.Indeed. I don't think it's worthwhile major additional memory or code
complexity in this situation. The likelihood of benefitting from more /
better stats seems far higher than a more accurate view of the stats -
which aren't particularly accurate themselves. They don't even survive
crashes right now, so I don't think the current accuracy is very high.
Definitely agreed.
*If* we can provide the snapshots view of them without too much overhead I
think it's worth looking into that while *also* proviiding a lower overhead
interface for those that don't care about it.
If it ends up that keeping the snapshots become too much overhead in either
in performance or code-maintenance, then I agree can probably drop that.
But we should at least properly investigate the cost.
--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/>
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
Hi,
On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:
*If* we can provide the snapshots view of them without too much overhead I
think it's worth looking into that while *also* proviiding a lower overhead
interface for those that don't care about it.
I don't see how that's possible without adding significant amounts of
complexity and probably memory / cpu overhead. The current stats already
are quite inconsistent (often outdated, partially updated, messages
dropped when busy) - I don't see what we really gain by building
something MVCC like in the "new" stats subsystem.
If it ends up that keeping the snapshots become too much overhead in either
in performance or code-maintenance, then I agree can probably drop that.
But we should at least properly investigate the cost.
I don't think it's worthwhile to more than think a bit about it. There's
fairly obvious tradeoffs in complexity here. Trying to get there seems
like a good way to make the feature too big.
Greetings,
Andres Freund
Hello. Thanks for the opinions.
At Fri, 6 Jul 2018 13:10:36 -0700, Andres Freund <andres@anarazel.de> wrote in <20180706201036.awheoi6tk556x6aj@alap3.anarazel.de>
Hi,
On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:
*If* we can provide the snapshots view of them without too much overhead I
think it's worth looking into that while *also* proviiding a lower overhead
interface for those that don't care about it.I don't see how that's possible without adding significant amounts of
complexity and probably memory / cpu overhead. The current stats already
are quite inconsistent (often outdated, partially updated, messages
dropped when busy) - I don't see what we really gain by building
something MVCC like in the "new" stats subsystem.If it ends up that keeping the snapshots become too much overhead in either
in performance or code-maintenance, then I agree can probably drop that.
But we should at least properly investigate the cost.I don't think it's worthwhile to more than think a bit about it. There's
fairly obvious tradeoffs in complexity here. Trying to get there seems
like a good way to make the feature too big.
Agreed.
Well, if we allow to lose consistency in some extent for improved
performance and smaller footprint, relaxing the consistency of
database stats can reduce footprint further especially on a
cluster with so many databases. Backends are interested only in
the residing database and vacuum doesn't cache stats at all. A
possible problem is vacuum and stats collector can go into a race
condition. I'm not sure but I suppose it is not worse than being
involved in an IO congestion.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Attachments:
v4-0001-sequential-scan-for-dshash.patchtext/x-patch; charset=us-asciiDownload+160-2
v4-0002-Change-stats-collector-to-an-axiliary-process.patchtext/x-patch; charset=us-asciiDownload+77-134
v4-0003-dshash-based-stats-collector.patchtext/x-patch; charset=us-asciiDownload+836-930
v4-0004-Documentation-update.patchtext/x-patch; charset=us-asciiDownload+5-40
v4-0005-Let-pg_stat_statements-not-to-use-PG_STAT_TMP_DIR.patchtext/x-patch; charset=us-asciiDownload+4-8
v4-0006-Remove-pg_stat_tmp-exclusion-from-pg_rewind.patchtext/x-patch; charset=us-asciiDownload+0-8
On 07/10/2018 02:07 PM, Kyotaro HORIGUCHI wrote:
Hello. Thanks for the opinions.
At Fri, 6 Jul 2018 13:10:36 -0700, Andres Freund <andres@anarazel.de> wrote in <20180706201036.awheoi6tk556x6aj@alap3.anarazel.de>
Hi,
On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:
*If* we can provide the snapshots view of them without too much overhead I
think it's worth looking into that while *also* proviiding a lower overhead
interface for those that don't care about it.I don't see how that's possible without adding significant amounts of
complexity and probably memory / cpu overhead. The current stats already
are quite inconsistent (often outdated, partially updated, messages
dropped when busy) - I don't see what we really gain by building
something MVCC like in the "new" stats subsystem.If it ends up that keeping the snapshots become too much overhead in either
in performance or code-maintenance, then I agree can probably drop that.
But we should at least properly investigate the cost.I don't think it's worthwhile to more than think a bit about it. There's
fairly obvious tradeoffs in complexity here. Trying to get there seems
like a good way to make the feature too big.Agreed.
Well, if we allow to lose consistency in some extent for improved
performance and smaller footprint, relaxing the consistency of
database stats can reduce footprint further especially on a
cluster with so many databases. Backends are interested only in
the residing database and vacuum doesn't cache stats at all. A
possible problem is vacuum and stats collector can go into a race
condition. I'm not sure but I suppose it is not worse than being
involved in an IO congestion.
As someone who regularly analyzes stats collected from user systems, I
think there's certainly some value with keeping the snapshots reasonably
consistent. But I agree it doesn't need to be perfect, and some level of
inconsistency is acceptable (and the amount of complexity/overhead
needed to maintain perfect consistency seems rather excessive here).
There's one more reason why attempts to keep stats snapshots "perfectly"
consistent are likely doomed to fail - the messages are sent over UDP,
which does not guarantee delivery etc. So there's always some level of
possible inconsistency even with "perfectly consistent" snapshots.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-07-10 14:52:13 +0200, Tomas Vondra wrote:
There's one more reason why attempts to keep stats snapshots "perfectly"
consistent are likely doomed to fail - the messages are sent over UDP, which
does not guarantee delivery etc. So there's always some level of possible
inconsistency even with "perfectly consistent" snapshots.
FWIW, I don't see us continuing to do so if we go for a shared hashtable
for stats.
- Andres
I've spent some time reviewing this version.
Design
------
1. Even with your patch the stats collector still uses an UDP socket to
receive data. Now that the shared memory API is there, shouldn't the
messages be sent via shared memory queue? [1]/messages/by-id/20180711000605.sqjik3vqe5opqz33@alap3.anarazel.de That would increase the
reliability of message delivery.
I can actually imagine backends inserting data into the shared hash tables
themselves, but that might make them wait if the same entries are accessed
by another backend. It should be much cheaper just to insert message into
the queue and let the collector process it. In future version the collector
can launch parallel workers so that writes by backends do not get blocked
due to full queue.
2. I think the access to the shared hash tables introduces more contention
than necessary. For example, pgstat_recv_tabstat() retrieves "dbentry" and
leaves the containing hash table partition locked *exclusively* even if it
changes only the containing table entries, while changes of the containing
dbentry are done.
It appears that the shared hash tables are only modified by the stats
collector. The unnecessary use of the exclusive lock might be a bigger
issue in the future if the stats collector will use parallel
workers. Monitoring functions and autovacuum are affected by the locking
now.
(I see that the it's not trivial to get just-created entry locked in shared
mode: it may need a loop in which we release the exclusive lock and acquire
the shared lock unless the entry was already removed.)
3. Data in both shared_archiverStats and shared_globalStats is mostly accessed
w/o locking. Is that ok? I'd expect the StatsLock to be used for these.
Coding
------
* git apply v4-0003-dshash-based-stats-collector.patch needed manual
resolution of one conflict.
* pgstat_quickdie_handler() appears to be the only "quickdie handler" that
calls on_exit_reset(), although the comments are almost copy & pasted from
such a handler of other processes. Can you please explain what's specific
about pgstat.c?
* the variable name "area" would be sufficient if it was local to some
function, otherwise I think the name is too generic.
* likewise db_stats is too generic for a global variable. How about
"snapshot_db_stats_local"?
* backend_get_db_entry() passes 0 for handle to snapshot_statentry(). How
about DSM_HANDLE_INVALID ?
* I only see one call of snapshot_statentry_all() and it receives 0 for
handle. Thus the argument can be removed and the function does not have to
attach / detach to / from the shared hash table.
* backend_snapshot_global_stats() switches to TopMemoryContext before it calls
pgstat_attach_shared_stats(), but the latter takes care of the context
itself.
* pgstat_attach_shared_stats() - header comment should explain what the return
value means.
* reset_dbentry_counters() does more than just resetting the counters. Name
like initialize_dbentry() would be more descriptive.
* typos:
** backend_snapshot_global_stats(): "liftime" -> "lifetime"
** snapshot_statentry(): "entriy" -> "entry"
** backend_get_func_etnry(): "onshot" -> "oneshot"
** snapshot_statentry_all(): "Returns a local hash contains ..." -> "Returns a local hash containing ..."
[1]: /messages/by-id/20180711000605.sqjik3vqe5opqz33@alap3.anarazel.de
--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at