Checksum errors in pg_stat_database

Started by Magnus Haganderover 7 years ago62 messageshackers
Jump to latest
#1Magnus Hagander
magnus@hagander.net

Would it make sense to add a column to pg_stat_database showing the total
number of checksum errors that have occurred in a database?

It's really a ">1 means it's bad", but it's a lot easier to monitor that in
the statistics views, and given how much a lot of people set their systems
out to log, it's far too easy to miss individual checksum matches in the
logs.

If we track it at the database level, I don't think the overhead of adding
one more counter would be very high either.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#2Robert Haas
robertmhaas@gmail.com
In reply to: Magnus Hagander (#1)
Re: Checksum errors in pg_stat_database

On Fri, Jan 11, 2019 at 5:21 AM Magnus Hagander <magnus@hagander.net> wrote:

Would it make sense to add a column to pg_stat_database showing the total number of checksum errors that have occurred in a database?

It's really a ">1 means it's bad", but it's a lot easier to monitor that in the statistics views, and given how much a lot of people set their systems out to log, it's far too easy to miss individual checksum matches in the logs.

If we track it at the database level, I don't think the overhead of adding one more counter would be very high either.

It's probably not the idea way to track it. If you have a terabyte or
fifty of data, and you see that you have some checksum failures, good
luck finding the offending blocks.

But I'm tentatively in favor of your proposal anyway, because it's
pretty simple and cheap and might help people, and doing something
noticeably better is probably annoyingly complicated.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Robert Haas (#2)
Re: Checksum errors in pg_stat_database

On 1/11/19 7:40 PM, Robert Haas wrote:

On Fri, Jan 11, 2019 at 5:21 AM Magnus Hagander <magnus@hagander.net> wrote:

Would it make sense to add a column to pg_stat_database showing
the total number of checksum errors that have occurred in a database?

It's really a ">1 means it's bad", but it's a lot easier to monitor
that in the statistics views, and given how much a lot of people
set their systems out to log, it's far too easy to miss individual
checksum matches in the logs.

If we track it at the database level, I don't think the overhead
of adding one more counter would be very high either.

It's probably not the idea way to track it. If you have a terabyte or
fifty of data, and you see that you have some checksum failures, good
luck finding the offending blocks.

Isn't that somewhat similar to deadlocks, which we also track in
pg_stat_database? The number of deadlocks is rather useless on it's own,
you need to dive into the server log to find the details. Same for
checksum errors.

But I'm tentatively in favor of your proposal anyway, because it's
pretty simple and cheap and might help people, and doing something
noticeably better is probably annoyingly complicated.

+1

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Magnus Hagander
magnus@hagander.net
In reply to: Tomas Vondra (#3)
Re: Checksum errors in pg_stat_database

On Fri, Jan 11, 2019 at 9:20 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On 1/11/19 7:40 PM, Robert Haas wrote:

On Fri, Jan 11, 2019 at 5:21 AM Magnus Hagander <magnus@hagander.net>

wrote:

Would it make sense to add a column to pg_stat_database showing
the total number of checksum errors that have occurred in a database?

It's really a ">1 means it's bad", but it's a lot easier to monitor
that in the statistics views, and given how much a lot of people
set their systems out to log, it's far too easy to miss individual
checksum matches in the logs.

If we track it at the database level, I don't think the overhead
of adding one more counter would be very high either.

It's probably not the idea way to track it. If you have a terabyte or
fifty of data, and you see that you have some checksum failures, good
luck finding the offending blocks.

Isn't that somewhat similar to deadlocks, which we also track in
pg_stat_database? The number of deadlocks is rather useless on it's own,
you need to dive into the server log to find the details. Same for
checksum errors.

It is a bit similar yeah. Though a checksum counter is really a "you need
to look at fixing this right away" in a bit more sense than deadlocks. But
yes, the fact that we already tracks deadlocks there is a good example. (Of
course, I believe I added that one at some point as well, so I'm clearly
biased there)

But I'm tentatively in favor of your proposal anyway, because it's

pretty simple and cheap and might help people, and doing something
noticeably better is probably annoyingly complicated.

+1

Yeah, that's the idea behind it -- it's cheap, and an
early-warning-indicator.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#5David Steele
david@pgmasters.net
In reply to: Magnus Hagander (#4)
Re: Checksum errors in pg_stat_database

On 1/11/19 10:25 PM, Magnus Hagander wrote:

On Fri, Jan 11, 2019 at 9:20 PM Tomas Vondra
On 1/11/19 7:40 PM, Robert Haas wrote:

But I'm tentatively in favor of your proposal anyway, because it's
pretty simple and cheap and might help people, and doing something
noticeably better is probably annoyingly complicated.

+1

Yeah, that's the idea behind it -- it's cheap, and an
early-warning-indicator.

+1

--
-David
david@pgmasters.net

#6Magnus Hagander
magnus@hagander.net
In reply to: David Steele (#5)
Re: Checksum errors in pg_stat_database

On Sat, Jan 12, 2019 at 5:16 AM David Steele <david@pgmasters.net> wrote:

On 1/11/19 10:25 PM, Magnus Hagander wrote:

On Fri, Jan 11, 2019 at 9:20 PM Tomas Vondra
On 1/11/19 7:40 PM, Robert Haas wrote:

But I'm tentatively in favor of your proposal anyway, because it's
pretty simple and cheap and might help people, and doing something
noticeably better is probably annoyingly complicated.

+1

Yeah, that's the idea behind it -- it's cheap, and an
early-warning-indicator.

+1

PFA is a patch to do this.

It tracks things that happen in the general backends. Possibly we should
also consider counting the errors actually found when running base backups?
OTOH, that part of the code doesn't really track things like databases (as
it operates just on the raw data directory underneath), so that
implementation would definitely not be as clean...

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

Attachments:

stat_checksums.patchtext/x-patch; charset=US-ASCII; name=stat_checksums.patchDownload+86-1
#7Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#6)
Re: Checksum errors in pg_stat_database

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net> wrote:

PFA is a patch to do this.

+void
+pgstat_report_checksum_failure(void)
+{
+   PgStat_MsgDeadlock msg;

I think that you meant PgStat_MsgChecksumFailure :)

+/* ----------
+ * pgstat_recv_checksum_failure() -
+ *
+ * Process a DEADLOCK message.
+ * ----------

same here

Otherwise LGTM.

#8Magnus Hagander
magnus@hagander.net
In reply to: Julien Rouhaud (#7)
Re: Checksum errors in pg_stat_database

On Fri, Feb 22, 2019 at 3:16 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net>
wrote:

PFA is a patch to do this.

+void
+pgstat_report_checksum_failure(void)
+{
+   PgStat_MsgDeadlock msg;

I think that you meant PgStat_MsgChecksumFailure :)

+/* ----------
+ * pgstat_recv_checksum_failure() -
+ *
+ * Process a DEADLOCK message.
+ * ----------

same here

Otherwise LGTM.

Haha, damit, that's embarassing. You can probably guess where I copy/pasted
from :)

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#9Magnus Hagander
magnus@hagander.net
In reply to: Magnus Hagander (#8)
Re: Checksum errors in pg_stat_database

On Fri, Feb 22, 2019 at 3:23 PM Magnus Hagander <magnus@hagander.net> wrote:

On Fri, Feb 22, 2019 at 3:16 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net>
wrote:

PFA is a patch to do this.

+void
+pgstat_report_checksum_failure(void)
+{
+   PgStat_MsgDeadlock msg;

I think that you meant PgStat_MsgChecksumFailure :)

+/* ----------
+ * pgstat_recv_checksum_failure() -
+ *
+ * Process a DEADLOCK message.
+ * ----------

same here

Otherwise LGTM.

Haha, damit, that's embarassing. You can probably guess where I
copy/pasted from :)

And of course, then I forgot to attach the new file.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

Attachments:

stat_checksums2.patchtext/x-patch; charset=US-ASCII; name=stat_checksums2.patchDownload+86-1
#10Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#9)
Re: Checksum errors in pg_stat_database

On Fri, Feb 22, 2019 at 3:25 PM Magnus Hagander <magnus@hagander.net> wrote:

On Fri, Feb 22, 2019 at 3:23 PM Magnus Hagander <magnus@hagander.net> wrote:

On Fri, Feb 22, 2019 at 3:16 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net> wrote:

PFA is a patch to do this.

+void
+pgstat_report_checksum_failure(void)
+{
+   PgStat_MsgDeadlock msg;

I think that you meant PgStat_MsgChecksumFailure :)

+/* ----------
+ * pgstat_recv_checksum_failure() -
+ *
+ * Process a DEADLOCK message.
+ * ----------

same here

Otherwise LGTM.

Haha, damit, that's embarassing. You can probably guess where I copy/pasted from :)

heh :)

And of course, then I forgot to attach the new file.

It all looks fine. One minor nitpicking issue I just noticed, there's
an extra space there:

+ dbentry->n_checksum_failures ++;

I'm marking it as ready for committer!

#11Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#6)
Re: Checksum errors in pg_stat_database

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net> wrote:

It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database. Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?

#12Julien Rouhaud
rjuju123@gmail.com
In reply to: Julien Rouhaud (#11)
Re: Checksum errors in pg_stat_database

On Mon, Mar 4, 2019 at 8:31 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net> wrote:

It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database. Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?

So this seem to work just fine without adding much code. PFA v3 of
Magnus' patch including error reporting for BASE_BACKUP.

Attachments:

stat_checksums_v3.diffapplication/octet-stream; name=stat_checksums_v3.diffDownload+101-7
#13Magnus Hagander
magnus@hagander.net
In reply to: Julien Rouhaud (#11)
Re: Checksum errors in pg_stat_database

On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net>
wrote:

It tracks things that happen in the general backends. Possibly we should

also consider counting the errors actually found when running base backups?
OTOH, that part of the code doesn't really track things like databases (as
it operates just on the raw data directory underneath), so that
implementation would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database. Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?

That seems it was easy enough. PFA an updated patch that does this, and
also rebased so it doesn't conflict on oid.

(yes, while moving this from draft to publish after lunch, I realized that
you put a patch togerher for about the same. So let's merge it)

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

Attachments:

stat_checksums_2.patchtext/x-patch; charset=US-ASCII; name=stat_checksums_2.patchDownload+113-6
#14Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#13)
Re: Checksum errors in pg_stat_database

On Sat, Mar 9, 2019 at 12:35 AM Magnus Hagander <magnus@hagander.net> wrote:

On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net> wrote:

It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database. Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?

That seems it was easy enough. PFA an updated patch that does this, and also rebased so it doesn't conflict on oid.

(yes, while moving this from draft to publish after lunch, I realized that you put a patch togerher for about the same. So let's merge it)

Thanks! Our implementations are quite similar, so I'm fine with most
of the changes :) I'm just not sure about having two distinct
functions for reporting failures, given that there's only one caller
for each. On the other hand it avoids to include miscadmin.h in
bufpage.c.

That's just a detail, so I'm marking it (again) as ready for committer!

#15Julien Rouhaud
rjuju123@gmail.com
In reply to: Julien Rouhaud (#14)
Re: Checksum errors in pg_stat_database

On Sat, Mar 9, 2019 at 9:34 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sat, Mar 9, 2019 at 12:35 AM Magnus Hagander <magnus@hagander.net> wrote:

On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net> wrote:

It tracks things that happen in the general backends. Possibly we should also consider counting the errors actually found when running base backups? OTOH, that part of the code doesn't really track things like databases (as it operates just on the raw data directory underneath), so that implementation would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database. Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?

That seems it was easy enough. PFA an updated patch that does this, and also rebased so it doesn't conflict on oid.

Sorry, I have again new comments after a little bit more thinking.
I'm wondering if we can do something about shared objects while we're
at it. They don't belong to any database, so it's a little bit
orthogonal to this proposal, but it seems quite important to track
error on those too!

What about adding a new field in PgStat_GlobalStats for that? We can
use the same lastDir to easily detect such objects and slightly adapt
sendFile again, which seems quite straightforward.

#16Magnus Hagander
magnus@hagander.net
In reply to: Julien Rouhaud (#14)
Re: Checksum errors in pg_stat_database

On Sat, Mar 9, 2019 at 12:33 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sat, Mar 9, 2019 at 12:35 AM Magnus Hagander <magnus@hagander.net>
wrote:

On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud <rjuju123@gmail.com>

wrote:

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net>

wrote:

It tracks things that happen in the general backends. Possibly we

should also consider counting the errors actually found when running base
backups? OTOH, that part of the code doesn't really track things like
databases (as it operates just on the raw data directory underneath), so
that implementation would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database. Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?

That seems it was easy enough. PFA an updated patch that does this, and

also rebased so it doesn't conflict on oid.

(yes, while moving this from draft to publish after lunch, I realized

that you put a patch togerher for about the same. So let's merge it)

Thanks! Our implementations are quite similar, so I'm fine with most
of the changes :) I'm just not sure about having two distinct
functions for reporting failures, given that there's only one caller
for each. On the other hand it avoids to include miscadmin.h in
bufpage.c.

Yeah, and it brings "cosistence" to at least the calling point(s) within
regular backends.

That's just a detail, so I'm marking it (again) as ready for committer!

Thanks, and pushed :)

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#17Magnus Hagander
magnus@hagander.net
In reply to: Julien Rouhaud (#15)
Re: Checksum errors in pg_stat_database

On Sat, Mar 9, 2019 at 10:41 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sat, Mar 9, 2019 at 9:34 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sat, Mar 9, 2019 at 12:35 AM Magnus Hagander <magnus@hagander.net>

wrote:

On Mon, Mar 4, 2019 at 11:31 AM Julien Rouhaud <rjuju123@gmail.com>

wrote:

On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander <magnus@hagander.net>

wrote:

It tracks things that happen in the general backends. Possibly we

should also consider counting the errors actually found when running base
backups? OTOH, that part of the code doesn't really track things like
databases (as it operates just on the raw data directory underneath), so
that implementation would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database. Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?

That seems it was easy enough. PFA an updated patch that does this,

and also rebased so it doesn't conflict on oid.

Sorry, I have again new comments after a little bit more thinking.
I'm wondering if we can do something about shared objects while we're
at it. They don't belong to any database, so it's a little bit
orthogonal to this proposal, but it seems quite important to track
error on those too!

What about adding a new field in PgStat_GlobalStats for that? We can
use the same lastDir to easily detect such objects and slightly adapt
sendFile again, which seems quite straightforward.

Ah, didn't spot that one until after I pushed :/ Sorry about that.

Hmm. That's an interesting thought. And then add a column to
pg_stat_bgwriter, I assume? (Which is an ever increasingly bad name for the
view, but that's unrelated to this)

Question is then what number that should show -- only the checksum counter
in non-database-fields, or the total number across the cluster?

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#18Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#16)
Re: Checksum errors in pg_stat_database

On Sat, Mar 9, 2019 at 7:48 PM Magnus Hagander <magnus@hagander.net> wrote:

On Sat, Mar 9, 2019 at 12:33 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

Thanks! Our implementations are quite similar, so I'm fine with most
of the changes :) I'm just not sure about having two distinct
functions for reporting failures, given that there's only one caller
for each. On the other hand it avoids to include miscadmin.h in
bufpage.c.

Yeah, and it brings "cosistence" to at least the calling point(s) within regular backends.

That's just a detail, so I'm marking it (again) as ready for committer!

Thanks, and pushed :)

Thanks :)

#19Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#17)
Re: Checksum errors in pg_stat_database

On Sat, Mar 9, 2019 at 7:50 PM Magnus Hagander <magnus@hagander.net> wrote:

On Sat, Mar 9, 2019 at 10:41 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

Sorry, I have again new comments after a little bit more thinking.
I'm wondering if we can do something about shared objects while we're
at it. They don't belong to any database, so it's a little bit
orthogonal to this proposal, but it seems quite important to track
error on those too!

What about adding a new field in PgStat_GlobalStats for that? We can
use the same lastDir to easily detect such objects and slightly adapt
sendFile again, which seems quite straightforward.

Ah, didn't spot that one until after I pushed :/ Sorry about that.

No problem, I should have thought about it sooner anyway.

Hmm. That's an interesting thought. And then add a column to pg_stat_bgwriter, I assume?

Yes, and a new entry for PgStat_Shared_Reset_Target I guess.

(Which is an ever increasingly bad name for the view, but that's
unrelated to this)

yeah :/

Question is then what number that should show -- only the checksum counter in non-database-fields, or the total number across the cluster?

I'd say only for non-database-fields errors, especially if we can
reset each counters separately. If necessary, we can add a new view
to give a global overview of checksum errors for DBA convenience.

#20Julien Rouhaud
rjuju123@gmail.com
In reply to: Julien Rouhaud (#19)
Re: Checksum errors in pg_stat_database

On Sat, Mar 9, 2019 at 7:58 PM Julien Rouhaud <rjuju123@gmail.com> wrote:

On Sat, Mar 9, 2019 at 7:50 PM Magnus Hagander <magnus@hagander.net> wrote:

On Sat, Mar 9, 2019 at 10:41 AM Julien Rouhaud <rjuju123@gmail.com> wrote:

Sorry, I have again new comments after a little bit more thinking.
I'm wondering if we can do something about shared objects while we're
at it. They don't belong to any database, so it's a little bit
orthogonal to this proposal, but it seems quite important to track
error on those too!

What about adding a new field in PgStat_GlobalStats for that? We can
use the same lastDir to easily detect such objects and slightly adapt
sendFile again, which seems quite straightforward.

Question is then what number that should show -- only the checksum counter in non-database-fields, or the total number across the cluster?

I'd say only for non-database-fields errors, especially if we can
reset each counters separately. If necessary, we can add a new view
to give a global overview of checksum errors for DBA convenience.

So, after reading current implementation, I don't think that
PgStat_GlobalStats is the right place. It's already enough of a mess,
and clearly pg_stat_reset_shared('bgwriter') would not make any sense
if it did reset the shared relations checksum errors (though arguably
the fact that's it's resetting checkpointer stats right now is hardly
better), and handling a different target to reset part of
PgStat_GlobalStats counters would be an ugly kludge.

I'm considering adding a new PgStat_ChecksumStats for that purpose
instead, but I don't know if that's acceptable to do so in the last
commitfest. It seems worthwhile to add it eventually, since we'll
probably end up having more things to report to users related to
checksum. Online enabling of checksum could be the most immediate
potential target.

If that's acceptable, I was thinking this new stat could have those
fields with the first drop:

- number of non-db-related checksum checks done
- number of non-db-related checksum checks failed
- last stats reset

(and adding the number of checks for db-related blocks done with the
current checksum errors counter). Maybe also adding a
pg_checksum_stats view that would summarise all the counters in one
place.

It'll obviously add some traffic to the stats collector, but I'd hope
not too much, since BufferAlloc shouldn't be that frequent, and
BASE_BACKUP reports stats only once per file.

#21Julien Rouhaud
rjuju123@gmail.com
In reply to: Julien Rouhaud (#20)
#22Julien Rouhaud
rjuju123@gmail.com
In reply to: Julien Rouhaud (#21)
#23Magnus Hagander
magnus@hagander.net
In reply to: Julien Rouhaud (#22)
#24Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#23)
#25Magnus Hagander
magnus@hagander.net
In reply to: Julien Rouhaud (#24)
#26Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#25)
#27Michael Paquier
michael@paquier.xyz
In reply to: Julien Rouhaud (#26)
#28Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#27)
#29Michael Paquier
michael@paquier.xyz
In reply to: Julien Rouhaud (#28)
#30Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#30)
#32Julien Rouhaud
rjuju123@gmail.com
In reply to: Michael Paquier (#31)
#33Magnus Hagander
magnus@hagander.net
In reply to: Julien Rouhaud (#32)
#34Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#33)
#35Michael Paquier
michael@paquier.xyz
In reply to: Julien Rouhaud (#34)
#36Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#35)
#37Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#36)
#38Magnus Hagander
magnus@hagander.net
In reply to: Julien Rouhaud (#37)
#39Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#38)
#40Magnus Hagander
magnus@hagander.net
In reply to: Julien Rouhaud (#39)
#41Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#40)
#42Magnus Hagander
magnus@hagander.net
In reply to: Julien Rouhaud (#41)
#43Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#42)
#44Robert Treat
xzilla@users.sourceforge.net
In reply to: Magnus Hagander (#42)
#45Magnus Hagander
magnus@hagander.net
In reply to: Robert Treat (#44)
#46Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#45)
#47Robert Treat
xzilla@users.sourceforge.net
In reply to: Julien Rouhaud (#46)
#48Magnus Hagander
magnus@hagander.net
In reply to: Robert Treat (#47)
#49Julien Rouhaud
rjuju123@gmail.com
In reply to: Magnus Hagander (#48)
#50Robert Treat
xzilla@users.sourceforge.net
In reply to: Julien Rouhaud (#49)
#51Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Magnus Hagander (#30)
#52Magnus Hagander
magnus@hagander.net
In reply to: Bertrand Drouvot (#51)
#53Michael Paquier
michael@paquier.xyz
In reply to: Magnus Hagander (#52)
#54Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#53)
#55Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#54)
#56Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#55)
#57Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#56)
#58Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#53)
#59Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#57)
#60Magnus Hagander
magnus@hagander.net
In reply to: Michael Paquier (#53)
#61Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Bertrand Drouvot (#59)
#62Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#56)