RFE: Make statistics robust for unplanned events

Started by Patrik Novotnyover 4 years ago12 messages
#1Patrik Novotny
panovotn@redhat.com

Hello PostgreSQL Hackers,

is it possible to preserve the PostgreSQL statistics on a server crash?

Steps to reproduce the behaviour:
1) Observe the statistics counters, take note
2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will
already suffice
3) After recovery, observe the statistics counter again. Have they been
reset to zero (Bad) or are they preserved (Good).

Resetting the counters to zero harms execution planning and auto_vacuum
operations. That can cause growth of database as dead tuples are not removed
at the right time. In the end the database can go offline if autovacuum
never runs.

As far as I've checked, this would have to be implemented.

My question would be whether there is something that would make this
impossible to implement, and if there isn't, I'd like this to be considered
a feature request.

Regards

--
Patrik Novotný
Associate Software Engineer
Red Hat
panovotn@redhat.com

In reply to: Patrik Novotny (#1)
Re: RFE: Make statistics robust for unplanned events

On Tue, Apr 20, 2021 at 5:00 AM Patrik Novotny <panovotn@redhat.com> wrote:

As far as I've checked, this would have to be implemented.

My question would be whether there is something that would make this impossible to implement, and if there isn't, I'd like this to be considered a feature request.

I agree with you.

Maybe crash safety would require some care in cases where autovacuum
runs very frequently, so that the overhead isn't too high. But
overall, non-crash-safe information that drives autovacuum is penny
wise, pound foolish.

I'm sure that it doesn't matter that much most of the time, but there
are probably workloads and use cases where it causes significant and
persistent problems. That's not the right trade-off IMV.

--
Peter Geoghegan

#3Magnus Hagander
magnus@hagander.net
In reply to: Patrik Novotny (#1)
Re: RFE: Make statistics robust for unplanned events

On Tue, Apr 20, 2021 at 2:00 PM Patrik Novotny <panovotn@redhat.com> wrote:

Hello PostgreSQL Hackers,

is it possible to preserve the PostgreSQL statistics on a server crash?

Steps to reproduce the behaviour:
1) Observe the statistics counters, take note
2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice
3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved (Good).

Resetting the counters to zero harms execution planning and auto_vacuum
operations. That can cause growth of database as dead tuples are not removed
at the right time. In the end the database can go offline if autovacuum never runs.

The stats for the planner are store persistently in pg_stats though,
but autovacuum definitely takes a hit from it, and several other
things can too.

As far as I've checked, this would have to be implemented.

My question would be whether there is something that would make this impossible to implement, and if there isn't, I'd like this to be considered a feature request.

I'm pretty sure everybody would *want* this. At least nobody would be
against it. The problem is the potential performance cost of it.

Andres mentioned at least once over in the thread about shared memory
stats collection that being able to have persistent stats could come
out of that one in the future. Whatever is done on the topic should
probably be done based on that work, as it provides a better starting
point and also one that will stay around.

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#4Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Magnus Hagander (#3)
Re: RFE: Make statistics robust for unplanned events

On 4/21/21 2:38 PM, Magnus Hagander wrote:

On Tue, Apr 20, 2021 at 2:00 PM Patrik Novotny <panovotn@redhat.com> wrote:

Hello PostgreSQL Hackers,

is it possible to preserve the PostgreSQL statistics on a server crash?

Steps to reproduce the behaviour:
1) Observe the statistics counters, take note
2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice
3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved (Good).

Resetting the counters to zero harms execution planning and auto_vacuum
operations. That can cause growth of database as dead tuples are not removed
at the right time. In the end the database can go offline if autovacuum never runs.

The stats for the planner are store persistently in pg_stats though,
but autovacuum definitely takes a hit from it, and several other
things can too.

As far as I've checked, this would have to be implemented.

I think the problem with planner stats is that after reset of the
runtime stats we lose info about which tables may need analyze etc. and
then fail to run ANALYZE in time. Which may have negative impact on
performance, of course.

My question would be whether there is something that would make
this impossible to implement, and if there isn't, I'd like this to
be considered a feature request.

I'm pretty sure everybody would *want* this. At least nobody would be
against it. The problem is the potential performance cost of it.

Andres mentioned at least once over in the thread about shared memory
stats collection that being able to have persistent stats could come
out of that one in the future. Whatever is done on the topic should
probably be done based on that work, as it provides a better starting
point and also one that will stay around.

Right. I think the other question is how often does this happen in
practice - if your instance crashes often enough to make this an issue,
then there are probably bigger issues.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#5Magnus Hagander
magnus@hagander.net
In reply to: Tomas Vondra (#4)
Re: RFE: Make statistics robust for unplanned events

On Wed, Apr 21, 2021 at 5:02 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 4/21/21 2:38 PM, Magnus Hagander wrote:

On Tue, Apr 20, 2021 at 2:00 PM Patrik Novotny <panovotn@redhat.com> wrote:

Hello PostgreSQL Hackers,

is it possible to preserve the PostgreSQL statistics on a server crash?

Steps to reproduce the behaviour:
1) Observe the statistics counters, take note
2) Crash the machine, e.g. with sysrq; perhaps kill -9 on postgresql will already suffice
3) After recovery, observe the statistics counter again. Have they been reset to zero (Bad) or are they preserved (Good).

Resetting the counters to zero harms execution planning and auto_vacuum
operations. That can cause growth of database as dead tuples are not removed
at the right time. In the end the database can go offline if autovacuum never runs.

The stats for the planner are store persistently in pg_stats though,
but autovacuum definitely takes a hit from it, and several other
things can too.

As far as I've checked, this would have to be implemented.

I think the problem with planner stats is that after reset of the
runtime stats we lose info about which tables may need analyze etc. and
then fail to run ANALYZE in time. Which may have negative impact on
performance, of course.

My question would be whether there is something that would make
this impossible to implement, and if there isn't, I'd like this to
be considered a feature request.

I'm pretty sure everybody would *want* this. At least nobody would be
against it. The problem is the potential performance cost of it.

Andres mentioned at least once over in the thread about shared memory
stats collection that being able to have persistent stats could come
out of that one in the future. Whatever is done on the topic should
probably be done based on that work, as it provides a better starting
point and also one that will stay around.

Right. I think the other question is how often does this happen in
practice - if your instance crashes often enough to make this an issue,
then there are probably bigger issues.

Agreed.

I think the bigger problem there is replication failover, but that's
also a different issue (keeping the statistics from the *standby*
wouldn't help you much there, you'd need to replicate it from the
primary).

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#6Patrik Novotny
panovotn@redhat.com
In reply to: Magnus Hagander (#5)
Re: RFE: Make statistics robust for unplanned events

On Wed, Apr 21, 2021 at 5:05 PM Magnus Hagander <magnus@hagander.net> wrote:

Right. I think the other question is how often does this happen in
practice - if your instance crashes often enough to make this an issue,
then there are probably bigger issues.

Agreed.

I think the bigger problem there is replication failover, but that's
also a different issue (keeping the statistics from the *standby*
wouldn't help you much there, you'd need to replicate it from the
primary).

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

The report that I've received regarding this RFE has been triggered by
experiencing issues with long term deployments in a large scale industrial
environment. The point of this RFE is to be protected against those issues
in the future. While this doesn't seem to be a very frequent occurrence, I
wouldn't consider this a corner case not being worth attention.

If there is an expectation for the performance loss to be less of a problem
in the future, would it make sense to make this an opt-in feature until
then?

--
Patrik Novotný
Associate Software Engineer
Red Hat
panovotn@redhat.com

In reply to: Magnus Hagander (#3)
Re: RFE: Make statistics robust for unplanned events

On Wed, Apr 21, 2021 at 5:39 AM Magnus Hagander <magnus@hagander.net> wrote:

I'm pretty sure everybody would *want* this. At least nobody would be
against it. The problem is the potential performance cost of it.

VACUUM remembers vacrel->new_live_tuples as the pg_class.reltuples for
the heap relation being vacuumed. It also remembers new_rel_pages in
pg_class (see vac_update_relstats()). However, it does not remember
vacrel->new_dead_tuples in pg_class or in any other durable location
(the information gets remembered via a call to pgstat_report_vacuum()
instead).

We already *almost* pay the full cost of durably storing the
information used by autovacuum.c's relation_needs_vacanalyze() to
determine if a VACUUM is required -- we're only missing
new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step
further to fix this issue?

Admittedly, storing new_dead_tuples durably is not sufficient to allow
ANALYZE to be launched on schedule when there is a hard crash. It is
also insufficient to make sure that insert-driven autovacuums get
launched on schedule. Even still, I'm pretty sure that just making
sure that we store it durably (alongside pg_class.reltuples?) will
impose only a modest additional cost, while fixing Patrik's problem.
That seems likely to be worth it.

--
Peter Geoghegan

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Peter Geoghegan (#7)
Re: RFE: Make statistics robust for unplanned events

Peter Geoghegan <pg@bowt.ie> writes:

We already *almost* pay the full cost of durably storing the
information used by autovacuum.c's relation_needs_vacanalyze() to
determine if a VACUUM is required -- we're only missing
new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step
further to fix this issue?

Definitely worth thinking about, but I'm a little confused about how
you see this working. Those pg_class fields are updated by vacuum
(or analyze) itself. How could they usefully serve as input to
autovacuum's decisions?

Admittedly, storing new_dead_tuples durably is not sufficient to allow
ANALYZE to be launched on schedule when there is a hard crash. It is
also insufficient to make sure that insert-driven autovacuums get
launched on schedule.

I'm not that worried about the former case, but the latter seems
like kind of a problem.

regards, tom lane

#9Andres Freund
andres@anarazel.de
In reply to: Magnus Hagander (#3)
Re: RFE: Make statistics robust for unplanned events

Hi,

On 2021-04-21 14:38:44 +0200, Magnus Hagander wrote:

Andres mentioned at least once over in the thread about shared memory
stats collection that being able to have persistent stats could come
out of that one in the future. Whatever is done on the topic should
probably be done based on that work, as it provides a better starting
point and also one that will stay around.

Yea. I think the main benefit from the shared memory stats patch that
would make this a easier is that it tracks (with one small hole that can
probably be addressed) dropped objects accurately, even across crashes /
replication. Having old stats around runs into danger of mixing stats
for an old dropped object being combined with stats for a new object.

I don't think making pgstat.c fully durable by continually storing the
data in a table or something like that is an option. For one, the stats
for a replica and primary are independent. For another, the overhead
would be prohibitive.

But after we gain the accurate dropping of stats we can store a stats
snapshot corresponding to certain WAL records (by serializing to
something like pg_stats_%redo_lsn%) without ending up with dropped stats
surviving.

A big question around this is how often we'd want to write out the
stats. Obviously, the more often we do, the higher the overhead. And the
less frequently, the more stats updates might be lost.

Patrik, for your use cases, would loosing at most the stats since the
start of last checkpoint be an issue?

Greetings,

Andres Freund

#10Magnus Hagander
magnus@hagander.net
In reply to: Andres Freund (#9)
Re: RFE: Make statistics robust for unplanned events

On Fri, Apr 23, 2021 at 12:41 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-04-21 14:38:44 +0200, Magnus Hagander wrote:

Andres mentioned at least once over in the thread about shared memory
stats collection that being able to have persistent stats could come
out of that one in the future. Whatever is done on the topic should
probably be done based on that work, as it provides a better starting
point and also one that will stay around.

Yea. I think the main benefit from the shared memory stats patch that
would make this a easier is that it tracks (with one small hole that can
probably be addressed) dropped objects accurately, even across crashes /
replication. Having old stats around runs into danger of mixing stats
for an old dropped object being combined with stats for a new object.

I don't think making pgstat.c fully durable by continually storing the
data in a table or something like that is an option. For one, the stats
for a replica and primary are independent. For another, the overhead
would be prohibitive.

But after we gain the accurate dropping of stats we can store a stats
snapshot corresponding to certain WAL records (by serializing to
something like pg_stats_%redo_lsn%) without ending up with dropped stats
surviving.

A big question around this is how often we'd want to write out the
stats. Obviously, the more often we do, the higher the overhead. And the
less frequently, the more stats updates might be lost.

Yeah, that's what I was thinking as well -- dumping snapshot at
regular intervals, so that on crash recovery we lose a "controlled
amount" of recent starts instead of losing *everything*.

I think in most situations a fairly long interval is OK -- if you have
tables that take so many hits that you need a really quick reaction
from autovacuum it will probably pick that up quickly enough even
after a reset. And if it's moer the long-term tracking that's
important, then a longer interval is probably OK.

But perhaps make it configurable with a timeout and a "reasonable default"?

Patrik, for your use cases, would loosing at most the stats since the
start of last checkpoint be an issue?

Unless there's a particular benefit to tie it specifically to the
checkpoint occuring, I'd rather keep it as a separate setting. They
might both come with the same default of course, btu I can certainly
envision cases where you want to increase the checkpoint distance
while keeping the stats interval lower for example. Many people
increase the checkpoint timeout quite a lot...

--
Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

#11Patrik Novotny
panovotn@redhat.com
In reply to: Magnus Hagander (#10)
Re: RFE: Make statistics robust for unplanned events

Yeah, that's what I was thinking as well -- dumping snapshot at
regular intervals, so that on crash recovery we lose a "controlled
amount" of recent starts instead of losing *everything*.

I think in most situations a fairly long interval is OK -- if you have
tables that take so many hits that you need a really quick reaction
from autovacuum it will probably pick that up quickly enough even
after a reset. And if it's moer the long-term tracking that's
important, then a longer interval is probably OK.

But perhaps make it configurable with a timeout and a "reasonable default"?

Patrik, for your use cases, would loosing at most the stats since the
start of last checkpoint be an issue?

Unless there's a particular benefit to tie it specifically to the
checkpoint occuring, I'd rather keep it as a separate setting. They
might both come with the same default of course, btu I can certainly
envision cases where you want to increase the checkpoint distance
while keeping the stats interval lower for example. Many people
increase the checkpoint timeout quite a lot...

From what I understand, I think it depends on the interval of those
checkpoints. If the interval was configurable with the mentioned reasonable
default, then it shouldn't be an issue.

If we were to choose a hard coded interval of those checkpoints based on my
case, I would have to consult the original reporter, but then it might not
suit others anyway. Therefore, making it configurable makes more sense to
me personally.

--
Patrik Novotný
Associate Software Engineer
Red Hat
panovotn@redhat.com

In reply to: Tom Lane (#8)
Re: RFE: Make statistics robust for unplanned events

On Thu, Apr 22, 2021 at 3:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter Geoghegan <pg@bowt.ie> writes:

We already *almost* pay the full cost of durably storing the
information used by autovacuum.c's relation_needs_vacanalyze() to
determine if a VACUUM is required -- we're only missing
new_dead_tuples/tabentry->n_dead_tuples. Why not go one tiny baby step
further to fix this issue?

Definitely worth thinking about, but I'm a little confused about how
you see this working. Those pg_class fields are updated by vacuum
(or analyze) itself. How could they usefully serve as input to
autovacuum's decisions?

Honestly, the details weren't very well thought out. My basic point
still stands, which is that it shouldn't be *that* expensive to make
the relevant information crash-safe, which would protect the system
from certain pathological cases. Maybe it doesn't even have to be
crash-safe in the way that pg_class.reltuples is -- something
approximate might work quite well. Assuming that there are no dead
tuples after a crash seems rather naive.

I seem to recall that certain init scripts I saw years ago used
Immediate Shutdown mode very frequently. Stuff like that is bound to
happen in some installations, and so we should protect users from
hard-to-foresee extreme consequences. Sure, using immediate shutdown
like that isn't recommended practice, but that's no reason to allow a
nasty failure mode.

--
Peter Geoghegan